Arxiv今日论文 | 2025-04-15

本篇博文主要内容为 2025-04-15 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决现有评估方法在评价采用慢思考策略的推理模型（Reasoning Models）输出时面临的挑战，特别是无法有效判断大型语言模型（LLMs）生成的答案是否与参考答案等效，以及难以从长且复杂的响应中准确识别和提取最终答案的问题。为了解决这一问题，论文提出了一种高效的答案验证器xVerify。其关键在于通过构造VAR数据集，并利用多轮标注确保标签准确性，从而训练出能够高效进行等效性判断的模型，显著提升了对多种客观问题答案等效性的评估能力，并在测试集和泛化集上的F1分数和准确率均超过95%，表现出卓越的有效性和泛化能力。

链接: https://arxiv.org/abs/2504.10481
作者: Ding Chen,Qingchen Yu,Pengyuan Wang,Wentao Zhang,Bo Tang,Feiyu Xiong,Xinchi Li,Minchuan Yang,Zhiyu Li
机构: Research Institute of China Telecom (中国电信研究院), Beijing, China; MemTensor (Shanghai) Technology Co., Ltd. (迈蒙智能（上海）科技有限公司); Center for Data Science, Peking University (北京大学数据科学中心)
类目: Computation and Language (cs.CL)
备注: 32 pages

点击查看摘要

Abstract:With the release of the o1 model by OpenAI, reasoning models adopting slow thinking strategies have gradually emerged. As the responses generated by such models often include complex reasoning, intermediate steps, and self-reflection, existing evaluation methods are often inadequate. They struggle to determine whether the LLM output is truly equivalent to the reference answer, and also have difficulty identifying and extracting the final answer from long, complex responses. To address this issue, we propose xVerify, an efficient answer verifier for reasoning model evaluations. xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions. To train and evaluate xVerify, we construct the VAR dataset by collecting question-answer pairs generated by multiple LLMs across various datasets, leveraging multiple reasoning models and challenging evaluation sets designed specifically for reasoning model assessment. A multi-round annotation process is employed to ensure label accuracy. Based on the VAR dataset, we train multiple xVerify models of different scales. In evaluation experiments conducted on both the test set and generalization set, all xVerify models achieve overall F1 scores and accuracy exceeding 95%. Notably, the smallest variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results validate the effectiveness and generalizability of xVerify.
zh

[NLP-1] MIEB: Massive Image Embedding Benchmark

【速读】：该论文试图解决现有图像表征评估方法碎片化的问题，即当前对图像嵌入模型的评估通常依赖于孤立且任务特定的协议，导致对模型能力的整体理解不完整。例如，尚不清楚在聚类任务中表现优异的图像嵌入模型是否同样能在给定文本的情况下有效检索相关图像。为了解决这一问题，论文提出了大规模图像嵌入基准（Massive Image Embedding Benchmark, MIEB），用于评估图像和图像-文本嵌入模型在迄今为止最广泛的谱系中的性能。MIEB涵盖了38种语言下的130个独立任务，并将其归类为8个高级别类别。通过在MIEB上评估50种模型，论文发现没有单一方法能够在所有任务类别中占据主导地位。此外，研究揭示了先进视觉模型的隐藏能力，如其准确表示文本的能力，以及在存在混淆因素时在图像与文本交织编码和匹配方面的局限性。同时，研究还表明，视觉编码器在MIEB上的性能与其在多模态大型语言模型中的表现高度相关。关键在于构建了一个广泛覆盖且任务多样化的基准，以全面评估图像嵌入模型的能力，并揭示其在不同应用场景下的优势与不足。代码、数据集和排行榜已公开发布。

链接: https://arxiv.org/abs/2504.10471
作者: Chenghao Xiao,Isaac Chung,Imene Kerboua,Jamie Stirling,Xin Zhang,Márton Kardos,Roman Solomatin,Noura Al Moubayed,Kenneth Enevoldsen,Niklas Muennighoff
机构: Durham University (杜伦大学); Zendesk (Zendesk); Esker (Esker); INSA Lyon, LIRIS (里昂国立应用科学学院, LIRIS实验室); The Hong Kong Polytechnic University (香港理工大学); Aarhus University (奥胡斯大学); ITMO University (ITMO大学); Contextual AI (Contextual AI); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at this https URL.
zh

[NLP-2] GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

【速读】：该论文旨在解决现有基于监督微调大型视觉语言模型（Large Vision-Language Models, LVLMs）构建图形用户界面（Graphical User Interface, GUI）代理的方法在实际应用中的局限性。具体而言，这些方法不仅需要大量训练数据，还难以有效理解GUI截图并在未见过的界面上泛化，尤其是在处理高层次任务时表现欠佳。为了解决这些问题，论文提出了一种名为\name的新框架，其核心在于通过统一的动作空间规则建模引入强化学习方法。关键解决方案是利用少量精心策划的高质量跨平台数据（涵盖Windows、Linux、macOS、Android和Web），并采用分组相对策略优化（Group Relative Policy Optimization, GRPO）等策略优化算法更新模型，从而仅使用极小比例的数据量（3K vs. 13M）即实现了超越先前最先进的方法（如OS-Atlas）的性能，在八个基准测试（覆盖移动、桌面和Web三个平台）中取得了显著成果。这表明基于统一动作空间规则建模的强化学习在提升LVLMs执行实际GUI代理任务能力方面具有巨大潜力。

链接: https://arxiv.org/abs/2504.10458
作者: Xiaobo Xia,Run Luo
机构: National University of Singapore; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.
zh

[NLP-3] RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

【速读】：该论文试图解决长时程网络任务中人工智能代理无法有效遵循复杂、模糊且动态变化的真实用户指令的问题。现有网络代理基准测试主要关注单一明确任务的执行，而现实世界中的用户指令往往具有歧义性、需要不同层次的AI协助，并可能随时间演变以反映用户心理状态的变化。为填补这一空白，论文提出了RealWebAssist这一新基准，其关键在于通过包含从真实用户收集的序列化指令数据集来评估代理在涉及长时间网络交互、视觉图形用户界面(GUI)定位以及理解模糊用户指令的真实场景中的表现。成功的代理不仅需推理每个指令背后的真正意图，还需追踪用户的认知状态、理解特定于用户的习惯模式，并将目标任务正确地映射到GUI元素的操作上。实验结果表明，当前最先进的模型在理解和定位用户指令方面存在显著困难，这凸显了在长时程网络辅助中遵循真实用户指令所面临的重大挑战。

链接: https://arxiv.org/abs/2504.10445
作者: Suyu Ye,Haojun Shi,Darren Shih,Hyokun Yun,Tanya Roosta,Tianmin Shu
机构: Johns Hopkins University (约翰斯·霍普kins大学); Amazon.com (亚马逊)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Website: this https URL Code: this https URL

点击查看摘要

Abstract:To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user’s mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.
zh

[NLP-4] Multimodal Long Video Modeling Based on Temporal Dynamic Context

【速读】：该论文致力于解决现有大语言模型（Large Language Models, LLMs）在长视频理解中的两大挑战：一是由于LLMs上下文长度限制导致的长视频处理困难；二是视频中包含海量信息以及多模态（如音频）处理的复杂性。为应对这些挑战，论文提出了一种名为“Temporal Dynamic Context (TDC)”的动态长视频编码方法。其关键在于利用帧之间的时序关系，首先通过视觉-音频编码器将视频分割为语义一致的场景，并对每一帧生成令牌；其次，设计了一种基于查询的Transformer压缩器，将视频、音频及指令文本令牌聚合为有限的时序上下文令牌，从而有效减少每个片段的令牌数量；最后，将静态帧令牌与压缩后的时序上下文令牌输入LLM以实现视频理解。此外，为处理极长视频，论文还提出了无需训练的逐步推理策略，通过从多个视频片段中逐步提取中间答案来辅助最终决策。实验结果表明，该方法在通用视频理解和音视频理解任务中表现优异。

链接: https://arxiv.org/abs/2504.10443
作者: Haoran Hao,Jiaming Han,Yiyuan Zhang,Xiangyu Yue
机构: MMLab, The Chinese University of Hong Kong (香港中文大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at this https URL.
zh

[NLP-5] LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在劝说任务中潜在的安全风险问题，特别是其可能被用于不道德影响的隐患，如操纵、欺骗、利用脆弱性及其他有害策略。研究的关键在于通过两个重要方面系统性地调查劝说安全：(1) LLMs 是否能够适当地拒绝不道德的劝说任务并在执行过程中避免不道德策略，包括那些初始目标看似无伦理争议的情况；(2) 影响因素如人格特质和外部压力如何影响它们的行为。为此，作者提出了PersuSafety框架，这是首个全面评估劝说安全性的框架，包含场景创建、劝说对话模拟以及安全评估三个阶段。PersuSafety涵盖了6种不同的不道德劝说主题和15种常见的不道德劝说策略。通过对8种广泛使用的LLMs进行大量实验，发现大多数LLMs存在显著的安全隐患，包括未能识别有害的劝说任务和使用多种不道德劝说策略。本研究呼吁更加关注渐进性和目标导向对话中的安全性对齐问题。

链接: https://arxiv.org/abs/2504.10430
作者: Minqian Liu,Zhiyang Xu,Xinyi Zhang,Heajun An,Sarvech Qadir,Qi Zhang,Pamela J. Wisniewski,Jin-Hee Cho,Sang Won Lee,Ruoxi Jia,Lifu Huang
机构: Virginia Tech (弗吉尼亚理工学院暨州立大学); UC Davis (加州大学戴维斯分校); STIR Lab (未知); Vanderbilt University (范德比尔特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 20 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.
zh

[NLP-6] Can We Edit LLM s for Long-Tail Biomedical Knowledge?

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生物医学领域应用时，因生物医学知识长尾分布导致的知识编辑效果受限的问题。论文的关键在于探究现有知识编辑方法在处理长尾生物医学知识时的有效性，并揭示长尾知识中广泛存在的一对多关系（one-to-many knowledge）是限制这些方法提升LLMs理解能力的主要因素。研究强调需要针对性策略来弥合长尾知识与高频知识之间的性能差距。

链接: https://arxiv.org/abs/2504.10421
作者: Xinhao Yi,Jake Lever,Kevin Bryson,Zaiqiao Meng
机构: University of Glasgow (格拉斯哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge editing has emerged as an effective approach for updating large language models (LLMs) by modifying their internal knowledge. However, their application to the biomedical domain faces unique challenges due to the long-tailed distribution of biomedical knowledge, where rare and infrequent information is prevalent. In this paper, we conduct the first comprehensive study to investigate the effectiveness of knowledge editing methods for editing long-tail biomedical knowledge. Our results indicate that, while existing editing methods can enhance LLMs’ performance on long-tail biomedical knowledge, their performance on long-tail knowledge remains inferior to that on high-frequency popular knowledge, even after editing. Our further analysis reveals that long-tail biomedical knowledge contains a significant amount of one-to-many knowledge, where one subject and relation link to multiple objects. This high prevalence of one-to-many knowledge limits the effectiveness of knowledge editing in improving LLMs’ understanding of long-tail biomedical knowledge, highlighting the need for tailored strategies to bridge this performance gap.
zh

[NLP-7] Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA

【速读】：该论文试图解决大型视觉与语言模型在处理文档中复选框内容理解方面的不足，尤其是在金融和法律科技等关键行业中，一个被忽略的复选框可能导致严重的监管或合同失误。为了解决这一问题，论文引入了CheckboxQA数据集，这是一个专门设计用于评估和提升模型在复选框相关任务上的表现的数据资源。其关键在于通过揭示当前模型的局限性，为推动文档理解系统的进步提供有价值的工具，从而对法律科技和金融等领域的应用产生重要影响。该数据集已公开发布。

链接: https://arxiv.org/abs/2504.10419
作者: Michał Turski,Mateusz Chiliński,Łukasz Borchmann
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2504.10419 [cs.CL] (or arXiv:2504.10419v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.10419 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-8] CliniChat: A Multi-Source Knowledge-Driven Framework for Clinical Interview Dialogue Reconstruction and Evaluation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在临床访谈辅助应用中的两大挑战：高质量访谈对话数据的缺乏以及广泛接受的评估方法的缺失。为了解决这些问题，论文提出了一种名为CliniChat的框架，其关键在于通过集成多源知识来提升LLMs模拟真实临床访谈的能力。CliniChat包含两个模块：Clini-Recon用于将临床记录转化为系统化、专业且富有同理心的访谈对话；Clini-Eval则结合全面的评价指标体系与两阶段自动评估方法，使LLMs能够像专家一样评估访谈表现。这一解决方案的核心在于通过多源知识整合和创新的评估机制，显著提升了LLMs在临床访谈任务上的性能。

链接: https://arxiv.org/abs/2504.10418
作者: Jing Chen,Zhihua Wei,Wei Zhang,Yingying Hu,Qiong Zhang
机构: College of Computer Science and Technology, Tongji University (同济大学), Shanghai, China; Zhongnan Hospital of Wuhan University (武汉大学中南医院), Wuhan, China; Renmin Hospital of Wuhan University (武汉大学人民医院), Wuhan, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hold great promise for assisting clinical interviews due to their fluent interactive capabilities and extensive medical knowledge. However, the lack of high-quality interview dialogue data and widely accepted evaluation methods has significantly impeded this process. So we propose CliniChat, a framework that integrates multi-source knowledge to enable LLMs to simulate real-world clinical interviews. It consists of two modules: Clini-Recon and Clini-Eval, each responsible for reconstructing and evaluating interview dialogues, respectively. By incorporating three sources of knowledge, Clini-Recon transforms clinical notes into systematic, professional, and empathetic interview dialogues. Clini-Eval combines a comprehensive evaluation metric system with a two-phase automatic evaluation approach, enabling LLMs to assess interview performance like experts. We contribute MedQA-Dialog, a high-quality synthetic interview dialogue dataset, and CliniChatGLM, a model specialized for clinical interviews. Experimental results demonstrate that CliniChatGLM’s interview capabilities undergo a comprehensive upgrade, particularly in history-taking, achieving state-of-the-art performance.
zh

[NLP-9] LLM -SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

【速读】：该论文旨在解决科学方程发现任务中现有方法评估标准存在的局限性问题，即现有基准测试容易受到大型语言模型（LLMs）记忆常见方程的影响，导致性能指标虚高，无法真实反映模型的发现能力。为了解决这一问题，论文提出了LLM-SRBench，这是一个包含四个科学领域共239个具有挑战性问题的综合基准测试集。该基准测试通过设计两类问题——LSR-Transform（将常见的物理模型转换为不常见的数学表示以测试推理能力）和LSR-Synth（引入需要数据驱动推理的合成发现驱动问题），有效防止了简单的记忆式解答。关键在于通过这种精心设计的基准测试，能够更准确地评估基于LLMs的科学方程发现方法的真实性能。

链接: https://arxiv.org/abs/2504.10415
作者: Parshin Shojaee,Ngoc-Hieu Nguyen,Kazem Meidani,Amir Barati Farimani,Khoa D Doan,Chandan K Reddy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL , Benchmark page: this https URL

点击查看摘要

Abstract:Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.
zh

[NLP-10] Performance of Large Language Models in Supporting Medical Diagnosis and Treatment

【速读】：本文旨在评估当前多种大型语言模型（Large Language Models, LLMs）在葡萄牙国家医学专科准入考试（PNA）中的表现，以探索其在医疗领域应用的潜力。论文通过标准化医学知识测试，分析开放源代码和闭源模型的准确性与成本效益，并识别出综合性能最优的模型。解决方案的关键在于结合模型的诊断准确性与实际应用成本，同时探讨链式思维（Chain-of-Thought）等推理方法的应用价值，强调LLMs作为辅助工具在复杂临床决策中的重要作用。

链接: https://arxiv.org/abs/2504.10405
作者: Diogo Sousa,Guilherme Barbosa,Catarina Rocha,Dulce Oliveira
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注: 21 pages, 6 figures, 4 tables. Acknowledgements: The authors acknowledge the support of the AITriage4SU Project ( this http URL ), funded by the FCT (Foundation for Science and Technology), Portugal

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) into healthcare holds significant potential to enhance diagnostic accuracy and support medical treatment planning. These AI-driven systems can analyze vast datasets, assisting clinicians in identifying diseases, recommending treatments, and predicting patient outcomes. This study evaluates the performance of a range of contemporary LLMs, including both open-source and closed-source models, on the 2024 Portuguese National Exam for medical specialty access (PNA), a standardized medical knowledge assessment. Our results highlight considerable variation in accuracy and cost-effectiveness, with several models demonstrating performance exceeding human benchmarks for medical students on this specific task. We identify leading models based on a combined score of accuracy and cost, discuss the implications of reasoning methodologies like Chain-of-Thought, and underscore the potential for LLMs to function as valuable complementary tools aiding medical professionals in complex clinical decision-making.
zh

[NLP-11] LLM -driven Constrained Copy Generation through Iterative Refinement

【速读】：该论文试图解决在营销文案（copy）生成任务中，由于复杂约束导致个性化内容创作受限的问题。传统手动文案创作耗时且成本高昂，而现有的大型语言模型（LLMs）虽然能够快速生成内容，但初次生成的结果难以始终满足所有约束条件，类似于人类的迭代过程。尽管已有研究通过迭代精炼改进受约束的生成任务，但这些研究主要针对简单约束的任务，对于涉及众多复杂约束（如长度、主题、关键词、词序偏好及语气等）的文案生成任务，其效果尚不明确。
解决方案的关键在于提出了一种基于LLM的端到端迭代精炼框架，用于可扩展的文案生成。该框架能够在多个复杂的约束条件下同时优化文案质量，并通过在不同复杂度的电商广告横幅文案生成任务中的实验验证，证明迭代精炼可以将文案成功率提升16.25%-35.91%，并且生成的文案在多臂老虎机实验中显著提升了38.5%-45.21%的点击率。

链接: https://arxiv.org/abs/2504.10391
作者: Varun Vasudevan,Faezeh Akhavizadegan,Abhinav Prakash,Yokila Arora,Jason Cho,Tanya Mendiratta,Sushant Kumar,Kannan Achan
机构: Walmart Global Tech (沃尔玛全球科技)
类目: Computation and Language (cs.CL)
备注: 10 pages, 2 figures, 7 Tables

点击查看摘要

Abstract:Crafting a marketing message (copy), or copywriting is a challenging generation task, as the copy must adhere to various constraints. Copy creation is inherently iterative for humans, starting with an initial draft followed by successive refinements. However, manual copy creation is time-consuming and expensive, resulting in only a few copies for each use case. This limitation restricts our ability to personalize content to customers. Contrary to the manual approach, LLMs can generate copies quickly, but the generated content does not consistently meet all the constraints on the first attempt (similar to humans). While recent studies have shown promise in improving constrained generation through iterative refinement, they have primarily addressed tasks with only a few simple constraints. Consequently, the effectiveness of iterative refinement for tasks such as copy generation, which involves many intricate constraints, remains unclear. To address this gap, we propose an LLM-based end-to-end framework for scalable copy generation using iterative refinement. To the best of our knowledge, this is the first study to address multiple challenging constraints simultaneously in copy generation. Examples of these constraints include length, topics, keywords, preferred lexical ordering, and tone of voice. We demonstrate the performance of our framework by creating copies for e-commerce banners for three different use cases of varying complexity. Our results show that iterative refinement increases the copy success rate by 16.25-35.91 % across use cases. Furthermore, the copies generated using our approach outperformed manually created content in multiple pilot studies using a multi-armed bandit framework. The winning copy improved the click-through rate by 38.5-45.21 %.
zh

[NLP-12] A 10.8mW Mixed-Signal Simulated Bifurcation Ising Solver using SRAM Compute-In-Memory with 0.6us Time-to-Solution

【速读】：该论文致力于解决NP难组合优化问题，特别是Ising模型求解在CMOS平台上的高效实现。论文的关键创新在于提出了一种基于模拟分岔（Simulated Bifurcation, SB）算法的Ising求解器架构，充分利用模拟域计算的优势，通过在SRAM计算存储单元（Compute-in-Memory, CIM）中加速分岔过程，并在模拟域生成和注入最优衰减噪声，以满足SB算法对噪声的需求。此外，论文设计了一种新型10-T SRAM单元，支持三值乘法运算。实验结果显示，该方案在TSMC 180nm CMOS工艺下，对随机二进制MAXCUT图问题，在0.6微秒内达到93%以上的基态解，相比先前提出的CMOS和其他平台的Ising求解器，在求解时间和功耗上均提升了至少一个数量级。

链接: https://arxiv.org/abs/2504.10384
作者: Alana Marie Dee,Sajjad Moazeni
机构: IEEE Publication Technology Group (IEEE 出版技术组)
类目: ystems and Control (eess.SY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Combinatorial optimization problems are funda- mental for various fields ranging from finance to wireless net- works. This work presents a simulated bifurcation (SB) Ising solver in CMOS for NP-hard optimization problems. Analog domain computing led to a superior implementation of this algorithm as inherent and injected noise is required in SB Ising solvers. The architecture novelties include the use of SRAM compute-in-memory (CIM) to accelerate bifurcation as well as the generation and injection of optimal decaying noise in the analog domain. We propose a novel 10-T SRAM cell capable of performing ternary multiplication. When measured with 60- node, 50% density, random, binary MAXCUT graphs, this all- to-all connected Ising solver reliably achieves above 93% of the ground state solution in 0.6us with 10.8mW average power in TSMC 180nm CMOS. Our chip achieves an order of magnitude improvement in time-to-solution and power compared to previously proposed Ising solvers in CMOS and other platforms.
zh

[NLP-13] S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在系统1思维（System 1 Thinking）相关简单任务上的评估难题。目前，尽管LRMs在复杂推理任务中通过显式的推理链取得了显著进展，但它们对深度分析性思考的依赖可能限制了其系统1思维的能力。同时，缺乏相应的基准来评估LRMs在需要此类能力的任务中的表现。为填补这一空白，论文提出了S1-Bench，这是一个包含多领域、多语言的简单、多样化且自然清晰问题的基准，专门用于评估LRMs在系统1思维任务中的性能。论文的关键解决方案在于设计了S1-Bench这一基准，通过全面评估发现，LRMs在这些任务中的效率显著较低，输出长度平均为传统小型LLMs的15.5倍，并且存在过早识别正确答案但仍继续不必要的深思熟虑以及产生错误的问题，这凸显了当前LRMs僵化推理模式的局限性，并强调了实现平衡的双系统思维能力（适应任务复杂性的系统1与系统2思维结合）所需的进一步发展。

链接: https://arxiv.org/abs/2504.10368
作者: Wenyuan Zhang,Shuaiyi Nie,Xinghua Zhang,Zefeng Zhang,Tingwen Liu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Work in Progress

点击查看摘要

Abstract:We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning Models’ (LRMs) performance on simple tasks that favor intuitive system 1 thinking rather than deliberative system 2 reasoning. While LRMs have achieved significant breakthroughs in complex reasoning tasks through explicit chains of thought, their reliance on deep analytical thinking may limit their system 1 thinking capabilities. Moreover, a lack of benchmark currently exists to evaluate LRMs’ performance in tasks that require such capabilities. To fill this gap, S1-Bench presents a set of simple, diverse, and naturally clear questions across multiple domains and languages, specifically designed to assess LRMs’ performance in such tasks. Our comprehensive evaluation of 22 LRMs reveals significant lower efficiency tendencies, with outputs averaging 15.5 times longer than those of traditional small LLMs. Additionally, LRMs often identify correct answers early but continue unnecessary deliberation, with some models even producing numerous errors. These findings highlight the rigid reasoning patterns of current LRMs and underscore the substantial development needed to achieve balanced dual-system thinking capabilities that can adapt appropriately to task complexity.
zh

[NLP-14] DICE: A Framework for Dimensional and Contextual Evaluation of Language Models

【速读】：该论文试图解决现代语言模型（Language Models, LMs）评估基准无法充分反映其实际应用场景的问题。现有评估基准通常缺乏与真实世界部署场景的直接适用性，而这些场景往往需要特定上下文依赖的考量。论文的关键解决方案是提出了一种名为维度与上下文评估（Dimensional and Contextual Evaluation, DICE）的新方法，该方法通过细粒度、上下文相关的维度来评估语言模型的行为。DICE 引入了上下文无关参数（如鲁棒性、连贯性和知识诚实性）以及需要根据具体应用场景定制的上下文相关参数，以满足不同领域利益相关者的实际需求。这一框架旨在为语言模型提供更贴合实际应用且对利益相关者有意义的评估方式，从而填补现有评估体系的不足。

链接: https://arxiv.org/abs/2504.10359
作者: Aryan Shrivastava,Paula Akemi Aoyagui
机构: University of Chicago (芝加哥大学); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Language models (LMs) are increasingly being integrated into a wide range of applications, yet the modern evaluation paradigm does not sufficiently reflect how they are actually being used. Current evaluations rely on benchmarks that often lack direct applicability to the real-world contexts in which LMs are being deployed. To address this gap, we propose Dimensional and Contextual Evaluation (DICE), an approach that evaluates LMs on granular, context-dependent dimensions. In this position paper, we begin by examining the insufficiency of existing LM benchmarks, highlighting their limited applicability to real-world use cases. Next, we propose a set of granular evaluation parameters that capture dimensions of LM behavior that are more meaningful to stakeholders across a variety of application domains. Specifically, we introduce the concept of context-agnostic parameters - such as robustness, coherence, and epistemic honesty - and context-specific parameters that must be tailored to the specific contextual constraints and demands of stakeholders choosing to deploy LMs into a particular setting. We then discuss potential approaches to operationalize this evaluation framework, finishing with the opportunities and challenges DICE presents to the LM evaluation landscape. Ultimately, this work serves as a practical and approachable starting point for context-specific and stakeholder-relevant evaluation of LMs.
zh

[NLP-15] MultiLoKo: a multilingual local knowledge benchmark for LLM s spanning 31 languages

【速读】：本文旨在解决多语言大型语言模型（Multilingual Large Language Models, LLMs）在跨语言能力评估方面的问题。为实现这一目标，论文提出了一个新的基准测试集 MultiLoKo，其关键创新在于设计了一个包含三种分区的数据集：一是每种语言各500个本地相关问题的主分区；二是两种翻译分区，分别包含从30种非英语语言到英语以及从英语到这些非英语语言的人工翻译；此外还提供了相应的机器翻译数据。通过将数据均匀分布于开发集与盲测集两个子集之间，MultiLoKo 可用于研究多种关于 LLMs 多语言能力的问题及多语言基准创建的相关元问题。研究者计算了11种基线和对话模型的 MultiLoKo 分数，并分析了它们的整体表现、不同语言间的性能均衡性、问题语言对答案能力的影响以及哪些语言最具挑战性。结果表明，所有被测试的模型在平均得分和最佳与最差得分语言之间的差距上表现不佳，且存在显著的语言间知识迁移不足现象。此外，使用本地化数据而非英语翻译数据会导致顶级模型的表现差异超过20分，显著改变某些语言的难度评估。对于机器翻译替代人工翻译的效果，发现其对语言难度排序影响较弱，但会导致模型排名变化更大，并使所有模型的估计性能大幅下降。

链接: https://arxiv.org/abs/2504.10356
作者: Dieuwke Hupkes,Nikolay Bogoychev
机构: Meta
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present MultiLoKo, a new benchmark for evaluating multilinguality in LLMs covering 31 languages. MultiLoKo consists of three partitions: a main partition consisting of 500 questions per language, separately sourced to be locally relevant to the specific language, and two translated partitions, containing human-authored translations from 30 non-English languages to English and vice versa. For comparison, we also release corresponding machine-authored translations. The data is equally distributed over two splits: a dev split and a blind, out-of-distribution test split. MultiLoKo can be used to study a variety of questions regarding the multilinguality of LLMs as well as meta-questions about multilingual benchmark creation. We compute MultiLoKo scores for 11 base and chat models marketed to be multilingual and study their average performance, their performance parity across languages, how much their ability to answer questions depends on the question language, and which languages are most difficult. None of the models we studied performs well on MultiLoKo, as indicated by low average scores as well as large differences between the best and worst scoring languages. Furthermore, we find a substantial effect of the question language, indicating sub-optimal knowledge transfer between languages. Lastly, we find that using local vs English-translated data can result in differences more than 20 points for the best performing models, drastically change the estimated difficulty of some languages. For using machines instead of human translations, we find a weaker effect on ordering of language difficulty, a larger difference in model rankings, and a substantial drop in estimated performance for all models.
zh

[NLP-16] VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

【速读】：该论文试图解决现有多模态基准测试中将推理能力与领域特定知识混杂在一起的问题，这使得在非专家环境中难以孤立和评估通用推理能力。为了解决这一问题，论文引入了VisualPuzzles，这是一个专注于视觉推理的基准测试，同时刻意减少了对专门知识的依赖。关键在于VisualPuzzles包含五个类别的多样化问题：算法推理、类比推理、演绎推理、归纳推理和空间推理，并且其问题主要来源于手动翻译的中国公务员考试逻辑推理题。实验表明，相比于MMMU等基准测试，VisualPuzzles需要更少的领域特定知识，而更多复杂的推理，从而能够更好地评估真正的多模态推理能力。

链接: https://arxiv.org/abs/2504.10342
作者: Yueqi Song,Tianyue Ou,Yibo Kong,Zecheng Li,Graham Neubig,Xiang Yue
机构: 未知
类目: Computation and Language (cs.CL)
备注: 56 pages, 43 figures

点击查看摘要

Abstract:Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with “thinking” modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.
zh

[NLP-17] Forecasting from Clinical Textual Time Series: Adaptations of the Encoder and Decoder Language Model Families ALT

【速读】：本文旨在解决从文本时间序列中进行临床事件预测的问题，其中通过大型语言模型（LLM）辅助标注管道提取的时间戳临床发现作为主要输入。论文的关键在于评估多种模型在事件发生预测、时间排序和生存分析任务上的表现，特别是强调了基于编码器的模型在短期和长期事件预测中的高F1分数和更好的时间一致性，同时指出经过微调的掩码方法提升了排名性能。相比之下，经过指令微调的解码器模型在生存分析，尤其是在早期预后场景中表现出相对优势。研究还通过敏感性分析表明时间顺序的重要性，这需要构建临床时间序列数据，而非传统的文本顺序，从而突显了时间有序语料库在LLM广泛应用时代中的额外价值。

链接: https://arxiv.org/abs/2504.10340
作者: Shahriar Noroozizadeh,Sayantan Kumar,Jeremy C. Weiss
机构: School of Computer Science, Carnegie Mellon University (卡内基梅隆大学); Division of Intramural Research, National Library of Medicine (美国国家医学图书馆), Bethesda MD USA; Division of Intramural Research, National Library of Medicine (美国国家医学图书馆), Bethesda MD USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Machine Learning for Healthcare (MLHC 2025)

点击查看摘要

Abstract:Clinical case reports encode rich, temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings–extracted via an LLM-assisted annotation pipeline–serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.
zh

[NLP-18] MorphTok: Morphologically Grounded Tokenization for Indian Languages

【速读】：该论文旨在解决现有基于Byte-pair Encoding (BPE) 的子词(tokenization)分割方法未能充分考虑语言形态学意义的问题，导致分割结果与语言学上有意义的单元不一致。此外，还关注了音节文字系统中Unicode字符因变音符号（尤其是依赖性元音）引起的歧义问题。为了解决这些问题，论文提出的关键方案包括：1) 在应用BPE之前引入形态学感知的预分割方法，并为此创建了一个包含连读(sandhi splitting)的新数据集，以提升印地语(Hindi)和马拉地语(Marathi)的子词分割效果；2) 提出了一种受脚本约束的BPE(Constrained BPE, CBPE)，通过融入特定脚本的约束来处理依赖性元音的歧义问题。实验表明，这些方法在机器翻译和语言建模等下游任务中提升了性能，同时保持了计算效率。

链接: https://arxiv.org/abs/2504.10335
作者: Maharaj Brahma,N J Karthika,Atul Singh,Devaraj Adiga,Smruti Bhate,Ganesh Ramakrishnan,Rohit Saluja,Maunendra Sankar Desarkar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams. This often leads to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step prior to applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves performance for machine translation and language modeling. Additionally, to handle the ambiguity in the Unicode characters for diacritics, particularly dependent vowels in syllable-based writing systems, we introduce Constrained BPE (CBPE), an extension to the traditional BPE algorithm that incorporates script-specific constraints. Specifically, CBPE handles dependent vowels. Our results show that CBPE achieves a 1.68% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textitEvalTok, enabling more human-grounded assessment.
zh

[NLP-19] Can LLM s Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol

【速读】：该论文旨在解决科学文献综述表格生成中的三个关键挑战：(1) 用户提示通常表述不清；(2) 检索到的候选文献常包含无关内容；(3) 任务评估应超越浅层文本相似性方法，转而评估生成表格在信息检索任务（如论文比较）中的实用性。为应对这些挑战，论文提出了一种结合大型语言模型（LLM）方法与人工标注的新方案，并引入了一个更真实且更具挑战性的基准数据集ARXIV2TABLE，以及一种改进文献综述表格生成的新方法。实验结果表明，无论是开源权重还是专有LLM，在此任务上均表现不佳，凸显了任务的难度及进一步技术发展的必要性。因此，其解决方案的关键在于通过结合AI技术和人工干预，提高生成表格的实用性和针对性。

链接: https://arxiv.org/abs/2504.10284
作者: Weiqi Wang,Jiefu Ou,Yangqiu Song,Benjamin Van Durme,Daniel Khashabi
机构: Center for Speech and Language Processing, Johns Hopkins University (约翰斯·霍普金斯大学); Department of Computer Science and Engineering, HKUST (香港科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user’s informational needs given a collection of scientific papers. Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers). To support reproducible evaluation, we introduce ARXIV2TABLE, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios. Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements. Our dataset and code are available at this https URL.
zh

[NLP-20] RealHarm: A Collection of Real-World Language Model Application Failures

【速读】：该论文旨在解决语言模型在面向消费者的应用中引入的多种风险问题，特别是探索现有基于监管框架和理论分析的自上而下研究方法所忽视的实际失效模式。论文的关键在于通过构建RealHarm数据集，系统性地分析公开报道的AI代理相关问题交互案例，从部署者的视角揭示声誉损害是主要的组织性危害，错误信息是最常见的危害类别，并评估现有最先进的防护系统（如guardrails和内容审核系统）的有效性，发现这些系统在保护AI应用方面存在显著不足。因此，论文的核心解决方案在于填补防护系统的保护空白，以减少实际应用中的风险。

链接: https://arxiv.org/abs/2504.10277
作者: Pierre Le Jeune,Jiaen Liu,Luca Rossi,Matteo Dora
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical evidence of real-world failure modes remains underexplored. In this work, we introduce RealHarm, a dataset of annotated problematic interactions with AI agents built from a systematic review of publicly reported incidents. Analyzing harms, causes, and hazards specifically from the deployer’s perspective, we find that reputational damage constitutes the predominant organizational harm, while misinformation emerges as the most common hazard category. We empirically evaluate state-of-the-art guardrails and content moderation systems to probe whether such systems would have prevented the incidents, revealing a significant gap in the protection of AI applications.
zh

[NLP-21] MURR: Model Updating with Regularized Replay for Searching a Document Stream ECIR2025

【速读】：该论文试图解决神经检索模型在面对持续更新的文档流和查询流时，由于语言使用随时间变化而导致的检索效果下降问题。传统稀疏检索方法可以通过更新集合统计量来适应语言变化，但神经检索模型（如DPR和ColBERT）通过继续微调基础语言模型会与已编码文档产生不兼容性，而重新编码和索引所有历史文档成本高昂。
解决方案的关键在于提出MURR（Model Updating with Regularized Replay），一种无需重新处理历史文档的模型更新策略。MURR能够在保持对现有文档检索忠实性的同时，持续更新模型以适应最新的主题和查询需求，从而实现更有效且一致的检索性能。

链接: https://arxiv.org/abs/2504.10250
作者: Eugene Yang,Nicola Tonellotto,Dawn Lawrie,Sean MacAvaney,James Mayfield,Douglas W. Oard,Scott Miller
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Published at ECIR 2025. 16 pages, 4 figures

点击查看摘要

Abstract:The Internet produces a continuous stream of new documents and user-generated queries. These naturally change over time based on events in the world and the evolution of language. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content and queries, leading to less effective retrieval. Traditional statistical sparse retrieval can update collection statistics to reflect these changes in the use of language in documents and queries. In contrast, continued fine-tuning of the language model underlying neural retrieval approaches such as DPR and ColBERT creates incompatibility with previously-encoded documents. Re-encoding and re-indexing all previously-processed documents can be costly. In this work, we explore updating a neural dual encoder retrieval model without reprocessing past documents in the stream. We propose MURR, a model updating strategy with regularized replay, to ensure the model can still faithfully search existing documents without reprocessing, while continuing to update the model for the latest topics. In our simulated streaming environments, we show that fine-tuning models using MURR leads to more effective and more consistent retrieval results than other strategies as the stream of documents and queries progresses.
zh

[NLP-22] Probing then Editing Response Personality of Large Language Models

【速读】：本文旨在探究大型语言模型（Large Language Models, LLMs）内部如何编码一致的性格特征，并提出方法以在推理过程中编辑这些性格表达。尽管已有大量研究通过输出结果分析性格表达，但性格特征在LLMs参数中的具体编码方式尚不明确。为解决这一问题，论文引入了一种分层探测框架，系统性地评估11个开源LLMs在不同层面对性格编码的能力。实验发现，性格主要被编码在模型的中间层和顶层，且经过指令微调的模型表现出更清晰的性格特征区分度。关键解决方案在于将训练好的探测超平面解释为每种类别性格的分层边界，并进一步提出分层扰动方法，实现在推理阶段对LLMs表达的性格进行编辑。研究显示，即使提示语明确指定某种性格，该方法仍能成功改变LLMs的响应性格。此外，不同性格特征之间的转换难度差异显著，与探测实验中的表征距离相一致。最终，论文通过MMLU基准评估和时间开销分析表明，所提出的性格编辑方法仅造成极小的通用能力下降，同时保持较低的训练成本和可接受的推理延迟。相关代码已公开发布。

链接: https://arxiv.org/abs/2504.10227
作者: Tianjie Ju,Zhenyu Shao,Bowen Wang,Yujia Chen,Zhuosheng Zhang,Hao Fei,Mong-Li Lee,Wynne Hsu,Sufeng Duan,Gongshen Liu
机构: Shanghai Jiao Tong University (上海交通大学); National University of Singapore (新加坡国立大学); Sichuan University (四川大学)
类目: Computation and Language (cs.CL)
备注: Working in Progress

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that exhibit consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in encoding personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly encode personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at this https URL.
zh

[NLP-23] DioR: Adaptive Cognitive Detection and Contextual Retrieval Optimization for Dynamic Retrieval-Augmented Generation

【速读】：该论文旨在解决现有动态检索增强生成（Dynamic Retrieval-augmented Generation, RAG）方法在两个关键方面的局限性：1）缺乏有效的检索触发机制；2）缺乏对检索内容的有效审查。为了解决这些问题，论文提出了一种创新的动态RAG方法——DioR（自适应认知检测与上下文检索优化），其核心在于包含两个主要组件：自适应认知检测和上下文检索优化。这两个组件专门设计用于确定何时需要进行检索以及为大型语言模型（Large Language Models, LLMs）检索哪些内容是有用的。实验结果表明，DioR在所有任务上均表现出色，验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.10198
作者: Hanghui Guo,Jia Zhu,Shimin Di,Weijie Shi,Zhangze Chen,Jiajie Xu
机构: School of Computer Science and Technology, Zhejiang Normal University (浙江师范大学计算机科学与技术学院); The Zhejiang Key Laboratory of Intelligent Education Technology and Application, Zhejiang Normal University (浙江省智能教育技术与应用重点实验室, 浙江师范大学); School of Computer Science and Engineering, Southeast University (东南大学计算机科学与工程学院); Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (香港科技大学计算机科学与工程系); School of Computer Science and Technology, Soochow University (苏州大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:Dynamic Retrieval-augmented Generation (RAG) has shown great success in mitigating hallucinations in large language models (LLMs) during generation. However, existing dynamic RAG methods face significant limitations in two key aspects: 1) Lack of an effective mechanism to control retrieval triggers, and 2) Lack of effective scrutiny of retrieval content. To address these limitations, we propose an innovative dynamic RAG method, DioR (Adaptive Cognitive Detection and Contextual Retrieval Optimization), which consists of two main components: adaptive cognitive detection and contextual retrieval optimization, specifically designed to determine when retrieval is needed and what to retrieve for LLMs is useful. Experimental results demonstrate that DioR achieves superior performance on all tasks, demonstrating the effectiveness of our work.
zh

[NLP-24] Localized Cultural Knowledge is Conserved and Controllable in Large Language Models

【速读】：本文旨在解决大型语言模型（LLMs）在多语言生成任务中倾向于以英语为中心的问题，即使在其他语言环境中也难以自然呈现本地文化特性。论文的关键发现在于，通过显式提供文化上下文信息（explicit cultural context），可以显著提升模型生成符合本地文化语境响应的能力，这一现象被称为“显式-隐式本地化差距”（explicit-implicit localization gap）。然而，这种方法虽然有效，但会导致生成结果多样性下降且易陷入刻板印象。为此，研究进一步提出了一种跨所有非英语语言均存在的显式文化定制向量（explicit cultural customization vector），用于引导模型从合成的英语文化世界模型转向各非英语文化世界模型。这种方法不仅保留了隐式提示的多样性，还减少了刻板印象，从而大幅提升模型的文化定制潜力。关键解决方案在于利用显式文化上下文和定制向量，使模型能够更好地理解和表达不同文化的特性，并为翻译、文化定制及通过软控制实现隐式文化表达提供了重要启示。

链接: https://arxiv.org/abs/2504.10191
作者: Veniamin Veselovsky,Berke Argin,Benedikt Stroebl,Chris Wendler,Robert West,James Evans,Thomas L. Griffiths,Arvind Narayanan
机构: Princeton University (普林斯顿大学); EPFL (洛桑联邦理工学院); Northeastern University (东北大学); Princeton University (普林斯顿大学); University of Chicago (芝加哥大学); Santa Fe Institute (圣塔菲研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Just as humans display language patterns influenced by their native tongue when speaking new languages, LLMs often default to English-centric responses even when generating in other languages. Nevertheless, we observe that local cultural information persists within the models and can be readily activated for cultural customization. We first demonstrate that explicitly providing cultural context in prompts significantly improves the models’ ability to generate culturally localized responses. We term the disparity in model performance with versus without explicit cultural context the explicit-implicit localization gap, indicating that while cultural knowledge exists within LLMs, it may not naturally surface in multilingual interactions if cultural context is not explicitly provided. Despite the explicit prompting benefit, however, the answers reduce in diversity and tend toward stereotypes. Second, we identify an explicit cultural customization vector, conserved across all non-English languages we explore, which enables LLMs to be steered from the synthetic English cultural world-model toward each non-English cultural world. Steered responses retain the diversity of implicit prompting and reduce stereotypes to dramatically improve the potential for customization. We discuss the implications of explicit cultural customization for understanding the conservation of alternative cultural world models within LLMs, and their controllable utility for translation, cultural customization, and the possibility of making the explicit implicit through soft control for expanded LLM function and appeal.
zh

[NLP-25] Deep Reasoning Translation via Reinforcement Learning

【速读】：该论文旨在解决深度推理大语言模型（Deep Reasoning LLMs）在自由翻译任务中的不足，这一任务要求超越逐字翻译，同时考虑文化差异。当前领域内，此类任务仍缺乏深入研究。为应对这一挑战，论文提出DeepTrans，这是一种通过强化学习进行自由翻译的深度推理翻译模型。其关键在于精心构建了一个奖励模型，该模型基于预先定义的评分标准，既评估翻译结果也评估思维过程。在给定源句的情况下，奖励模型指导深度翻译模型如何思考并自由翻译这些句子，从而在强化学习过程中实现目标。此方法无需任何标注数据，避免了人工标注或资源密集型的数据合成，有效降低了对标注数据的依赖性。实验结果表明，以Qwen2.5-7B为基座的DeepTrans在文学翻译上的性能提升了16.3%，并在多个基准测试中优于其他强基线模型及使用合成数据微调的模型。此外，论文还总结了强化学习探索过程中的失败案例与有趣发现，希望为相关领域的研究提供启发。

链接: https://arxiv.org/abs/2504.10187
作者: Jiaan Wang,Fandong Meng,Jie Zhou
机构: Pattern Recognition Center, WeChat AI, Tencent Inc (腾讯公司微信人工智能团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, deep reasoning LLMs (e.g., OpenAI o1/o3 and DeepSeek-R1) have shown promising performance in various complex tasks. Free translation is an important and interesting task in the multilingual world, which requires going beyond word-for-word translation and taking cultural differences into account. This task is still under-explored in deep reasoning LLMs. In this paper, we introduce DeepTrans, a deep reasoning translation model that learns free translation via reinforcement learning. Specifically, we carefully build a reward model with pre-defined scoring criteria on both the translation results and the thought process. Given the source sentences, the reward model teaches the deep translation model how to think and free-translate them during reinforcement learning. In this way, training DeepTrans does not need any labeled translations, avoiding the human-intensive annotation or resource-intensive data synthesis. Experimental results show the effectiveness of DeepTrans. Using Qwen2.5-7B as the backbone, DeepTrans improves performance by 16.3% in literature translation, and outperforms strong deep reasoning baselines as well as baselines that are fine-tuned with synthesized data. Moreover, we summarize the failures and interesting findings during our RL exploration. We hope this work could inspire other researchers in free translation.
zh

[NLP-26] LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）无学习（unlearning）过程中如何在去除预训练模型中不期望的数据影响的同时，保留其通用实用性的问题。具体而言，作者关注的是如何在确保模型安全性和可控行为的前提下，有效移除特定数据的影响。论文的关键在于首次揭示了在现有的无学习基准测试（如WMDP和MUSE）中存在一种“coreset效应”：即通过使用原始遗忘集（forget set）的一个显著更小的子集（作为coreset），即可有效地维持无学习的效果，甚至低至遗忘集的5%，且这种效果与数据选择方法无关，无论是随机选择还是更复杂的启发式方法。作者进一步从关键词提取的角度解释了这一现象，表明无学习主要依赖于一组高影响力的关键令牌，而非整个数据集。此外，论文验证了coreset无学习模型在模式连通性和对抗越狱攻击等其他维度上的忠实性。

链接: https://arxiv.org/abs/2504.10185
作者: Soumyadeep Pal,Changsheng Wang,James Diffenderfer,Bhavya Kailkhura,Sijia Liu
机构: Michigan State University (密歇根州立大学); Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a “coreset”), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at this https URL.
zh

[NLP-27] he Future of MLLM Prompting is Adaptive: A Comprehensive Experimental Evaluation of Prompt Engineering Methods for Robust Multimodal Performance

【速读】：该论文试图解决如何有效利用多模态大型语言模型（Multimodal Large Language Models, MLLMs）的潜力，特别是通过优化提示工程（prompt engineering）方法提升其在多样化任务中的性能。论文的关键在于评估七种提示工程方法（包括Zero-Shot、One-Shot、Few-Shot、Chain-of-Thought、Analogical、Generated Knowledge和Tree-of-Thought）在不同规模（Small、Medium、Large）和任务类型（如推理与复合性、多模态理解与对齐、复杂代码生成与执行、知识检索与集成）下的表现。研究发现，虽然大模型（Large MLLMs）在结构化任务中表现出色，但所有模型在复杂推理和抽象理解方面均表现欠佳，且存在较高的幻觉率。论文强调，单一提示方法无法普遍优化所有任务类型，关键在于采用自适应策略，结合基于示例的引导与选择性的结构化推理，以增强模型的鲁棒性、效率及事实准确性。

链接: https://arxiv.org/abs/2504.10179
作者: Anwesha Mohanty,Venkatesh Balavadhani Parthasarathy,Arsalan Shahid
机构: CeADAR: Ireland’s Centre for AI (爱尔兰人工智能中心); University College Dublin (都柏林大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are set to transform how machines process and generate human-like responses by integrating diverse modalities such as text, images, and code. Yet, effectively harnessing their capabilities hinges on optimal prompt engineering. We present a comprehensive experimental evaluation of seven prompt engineering methods applied to 13 open-source MLLMs over 24 tasks spanning Reasoning and Compositionality, Multimodal Understanding and Alignment, Complex Code Generation and Execution, and Knowledge Retrieval and Integration. Our approach stratifies models by parameter count into Small (4B), Medium (4B-10B), and Large (10B) categories and compares prompting techniques including Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, and Tree-of-Thought. While Large MLLMs excel in structured tasks such as code generation, achieving accuracies up to 96.88% under Few-Shot prompting, all models struggle with complex reasoning and abstract understanding, often yielding accuracies below 60% and high hallucination rates. Structured reasoning prompts frequently increased hallucination up to 75% in small models and led to longer response times (over 20 seconds in Large MLLMs), while simpler prompting methods provided more concise and efficient outputs. No single prompting method uniformly optimises all task types. Instead, adaptive strategies combining example-based guidance with selective structured reasoning are essential to enhance robustness, efficiency, and factual accuracy. Our findings offer practical recommendations for prompt engineering and support more reliable deployment of MLLMs across applications including AI-assisted coding, knowledge retrieval, and multimodal content understanding.
zh

[NLP-28] HalluSearch at SemEval-2025 Task 3: A Search-Enhanced RAG Pipeline for Hallucination Detection

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）输出中虚假文本片段检测的问题。解决方案的关键在于HalluSearch管道，它结合基于检索增强的验证（retrieval-augmented verification）与细粒度事实分割（fine-grained factual splitting），以识别和定位十四种不同语言中的幻觉现象（hallucinations）。这种多语言方法在英语和捷克语等语言上的评估显示了其竞争力，但其基于检索的策略在在线资源有限的语言中面临挑战，强调了进一步研究的必要性，以实现跨多样化语言环境的一致幻觉检测能力。

链接: https://arxiv.org/abs/2504.10168
作者: Mohamed A. Abdallah,Samhaa R. El-Beltagy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present HalluSearch, a multilingual pipeline designed to detect fabricated text spans in Large Language Model (LLM) outputs. Developed as part of Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes, HalluSearch couples retrieval-augmented verification with fine-grained factual splitting to identify and localize hallucinations in fourteen different languages. Empirical evaluations show that HalluSearch performs competitively, placing fourth in both English (within the top ten percent) and Czech. While the system’s retrieval-based strategy generally proves robust, it faces challenges in languages with limited online coverage, underscoring the need for further research to ensure consistent hallucination detection across diverse linguistic contexts.
zh

[NLP-29] C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成过程中容易产生幻觉（Hallucination）的问题，特别是现有中文幻觉基准依赖人工标注导致自动化和低成本幻觉评估困难这一挑战。论文的关键解决方案是提出HaluAgent，这是一种基于知识文档自动生成细粒度问答（QA）数据集的代理框架。通过结合手工设计规则与提示优化（Prompt Optimization），HaluAgent显著提升了生成数据的质量。基于此框架，作者构建了一个名为C-FAITH的中文QA幻觉基准，包含从网络爬取的1,399份知识文档中提取的60,702个条目，并进一步利用该基准全面评估了16种主流LLMs，提供了详尽的实验结果与分析。

链接: https://arxiv.org/abs/2504.10167
作者: Xu Zhang,Zhifei Liu,Jiahao Wang,Huixuan Zhang,Fan Xu,Junzhe Zhang,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所，北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the rapid advancement of large language models, they remain highly susceptible to generating hallucinations, which significantly hinders their widespread application. Hallucination research requires dynamic and fine-grained evaluation. However, most existing hallucination benchmarks (especially in Chinese language) rely on human annotations, making automatical and cost-effective hallucination evaluation challenging. To address this, we introduce HaluAgent, an agentic framework that automatically constructs fine-grained QA dataset based on some knowledge documents. Our experiments demonstrate that the manually designed rules and prompt optimization can improve the quality of generated data. Using HaluAgent, we construct C-FAITH, a Chinese QA hallucination benchmark created from 1,399 knowledge documents obtained from web scraping, totaling 60,702 entries. We comprehensively evaluate 16 mainstream LLMs with our proposed C-FAITH, providing detailed experimental results and analysis.
zh

[NLP-30] MT-R1-Zero: Advancing LLM -based Machine Translation via R1-Zero-like Reinforcement Learning

【速读】：该论文试图解决将大规模强化学习（Reinforcement Learning, RL）方法应用于机器翻译（Machine Translation, MT）的问题，尤其是在无需监督微调（supervised fine-tuning）或冷启动（cold-start）的情况下，利用大语言模型（Large Language Models, LLMs）通过自我推理提升翻译质量。传统机器翻译输出形式灵活且难以用明确规则自动评估，这使得直接应用强化学习框架面临挑战。为解决这一问题，论文提出的关键方案是设计一种混合奖励机制（rule-metric mixed reward mechanism），结合规则性和指标性奖励来引导LLMs通过涌现推理（emergent reasoning）优化翻译性能。这种方法在WMT 24英汉翻译基准测试中取得了具有竞争力的表现，并展示了较强的泛化能力，特别是在多语言和低资源场景下。

链接: https://arxiv.org/abs/2504.10160
作者: Zhaopeng Feng,Shaosheng Cao,Jiahan Ren,Jiayuan Su,Ruizhe Chen,Yan Zhang,Zhe Xu,Yao Hu,Jian Wu,Zuozhu Liu
机构: Zhejiang University (浙江大学); Xiaohongshu Inc. (小红书)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress. Our code is available at this https URL

点击查看摘要

Abstract:Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at this https URL.
zh

[NLP-31] SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

【速读】：该论文旨在解决社会模拟领域中现有方法在环境适配、目标用户定位、交互机制以及行为模式对齐方面的挑战。论文的关键解决方案是提出SocioVerse，这是一种基于大型语言模型（LLMs）驱动的世界模型，通过四个强大的对齐组件以及涵盖1000万真实个体的用户池，实现对大规模人群动态的反映，同时确保多样性、可信性和代表性，从而克服传统方法的局限性。

链接: https://arxiv.org/abs/2504.10157
作者: Xinnong Zhang,Jiayu Lin,Xinyi Mou,Shiyue Yang,Xiawei Liu,Libo Sun,Hanjia Lyu,Yihang Yang,Weihong Qi,Yue Chen,Guanying Li,Ling Yan,Yao Hu,Siming Chen,Yu Wang,Jingxuan Huang,Jiebo Luo,Shiping Tang,Libo Wu,Baohua Zhou,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Innovation Insititute (上海创新研究院); University of Rochester (罗切斯特大学); Indiana University (印第安纳大学); Xiaohongshu Inc. (小红书科技); SocioVerse (https://github.com/FudanDISC/SocioVerse)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: work in progress

点击查看摘要

Abstract:Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM-agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large-scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.
zh

[NLP-32] Breaking the Data Barrier – Building GUI Agents Through Task Generalization

【速读】：该论文旨在解决图形用户界面（GUI）代理在自动化复杂数字任务中的性能限制问题，主要由于高质量轨迹数据的稀缺性。为应对这一挑战，论文提出在专门设计的中期训练阶段利用数据丰富且推理密集的任务来训练视觉语言模型（VLMs）。解决方案的关键在于通过引入多种具有现成指令微调数据的任务，如GUI感知、多模态推理和文本推理，以促进跨领域的知识迁移和泛化能力。实验结果表明，这种策略显著提升了性能，在WebArena和AndroidWorld上的绝对提升分别达到8.0%和12.2%，并揭示了文本到视觉领域跨模态泛化的潜力。

链接: https://arxiv.org/abs/2504.10127
作者: Junlei Zhang,Zichen Ding,Chang Ma,Zijie Chen,Qiushi Sun,Zhenzhong Lan,Junxian He
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Shanghai AI Laboratory (上海人工智能实验室); The University of Hong Kong (香港大学); HKUST (香港科技大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 11 figures

点击查看摘要

Abstract:Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at this https URL.
zh

[NLP-33] CameraBench: Benchmarking Visual Reasoning in MLLM s via Photography

【速读】：该论文试图解决视觉推理（Visual Reasoning）在多模态大型语言模型（Multimodal Large Language Models, MLLMs）中的应用问题，特别是摄影相关任务中从图像信息推断数值化相机设置的能力。论文关注的核心问题是，如何让MLLMs深入理解图像背后的物理原理（如光照、模糊程度等）以实现精确的视觉理解，并将其应用于实际场景，例如摄影助手代理。解决方案的关键在于开发一种评估方法，用于衡量MLLMs区分与数值化相机设置相关的视觉差异的能力，并扩展此前针对视觉-语言模型（Vision-Language Models, VLMs）提出的方法论。初步结果显示，视觉推理在摄影相关任务中至关重要，但尚未有单一模型能够在所有评估任务中始终表现最优，这凸显了提升MLLMs视觉推理能力的挑战与机遇。

链接: https://arxiv.org/abs/2504.10090
作者: I-Sheng Fang,Jun-Cheng Chen
机构: Research Center for Information Technology Innovation, Academia Sinica (信息技术创新研究中心, 中央研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and multimodal large language models (MLLMs) have significantly advanced artificial intelligence. However, visual reasoning, reasoning involving both visual and textual inputs, remains underexplored. Recent advancements, including the reasoning models like OpenAI o1 and Gemini 2.0 Flash Thinking, which incorporate image inputs, have opened this capability. In this ongoing work, we focus specifically on photography-related tasks because a photo is a visual snapshot of the physical world where the underlying physics (i.e., illumination, blur extent, etc.) interplay with the camera parameters. Successfully reasoning from the visual information of a photo to identify these numerical camera settings requires the MLLMs to have a deeper understanding of the underlying physics for precise visual comprehension, representing a challenging and intelligent capability essential for practical applications like photography assistant agents. We aim to evaluate MLLMs on their ability to distinguish visual differences related to numerical camera settings, extending a methodology previously proposed for vision-language models (VLMs). Our preliminary results demonstrate the importance of visual reasoning in photography-related tasks. Moreover, these results show that no single MLLM consistently dominates across all evaluation tasks, demonstrating ongoing challenges and opportunities in developing MLLMs with better visual reasoning.
zh

[NLP-34] RealSafe-R1: Safety-Aligned DeepSeek -R1 without Compromising Reasoning Capability

【速读】：该论文试图解决大型推理模型（LRMs）在广泛应用中因安全问题而限制其效用的问题，特别是开源模型倾向于响应恶意查询的倾向。为了解决这一问题，论文提出了RealSafe-R1，这是DeepSeek-R1蒸馏模型的安全对齐版本。解决方案的关键在于通过构建一个包含15,000个由DeepSeek-R1生成的安全感知推理轨迹的数据集进行训练，并在训练中明确指导预期的拒绝行为。这种方法不仅提升了模型的安全性，避免有害查询和越狱攻击，同时通过保持训练数据在原始生成分布内，确保了模型推理能力不受损害。模型权重已开源。

链接: https://arxiv.org/abs/2504.10081
作者: Yichi Zhang,Zihao Zeng,Dongbai Li,Yao Huang,Zhijie Deng,Yinpeng Dong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have been rapidly progressing and achieving breakthrough performance on complex reasoning tasks such as mathematics and coding. However, the open-source R1 models have raised safety concerns in wide applications, such as the tendency to comply with malicious queries, which greatly impacts the utility of these powerful models in their applications. In this paper, we introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 distilled models. To train these models, we construct a dataset of 15k safety-aware reasoning trajectories generated by DeepSeek-R1, under explicit instructions for expected refusal behavior. Both quantitative experiments and qualitative case studies demonstrate the models’ improvements, which are shown in their safety guardrails against both harmful queries and jailbreak attacks. Importantly, unlike prior safety alignment efforts that often compromise reasoning performance, our method preserves the models’ reasoning capabilities by maintaining the training data within the original distribution of generation. Model weights of RealSafe-R1 are open-source at this https URL.
zh

[NLP-35] owards Quantifying Commonsense Reasoning with Mechanistic Insights NAACL2025

【速读】：该论文试图解决的问题是如何有效评估大型语言模型（LLMs）在常识推理方面的理解能力，并探索这些模型是否真正具备在实际场景中进行推理的能力。为解决这一问题，论文提出通过构建一个图形结构（graphical structure）来捕捉人类日常活动中隐含的常识知识，从而形成一个大规模的常识查询框架（~10^17个查询），以实现对LLMs常识推理能力的严格评估。解决方案的关键在于设计一种新的标注方案，用于将37项日常活动中的隐含常识知识转化为图形结构，同时揭示LLMs中负责推理的核心机制及其在决策过程中的作用。

链接: https://arxiv.org/abs/2504.10077
作者: Abhinav Joshi,Areeb Ahmad,Divyaksh Shukla,Ashutosh Modi
机构: Indian Institute of Technology Kanpur (IIT Kanpur)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at NAACL 2025; 28 pages (9 pages + 7 pages references + 12 pages appendix)

点击查看摘要

Abstract:Commonsense reasoning deals with the implicit knowledge that is well understood by humans and typically acquired via interactions with the world. In recent times, commonsense reasoning and understanding of various LLMs have been evaluated using text-based tasks. In this work, we argue that a proxy of this understanding can be maintained as a graphical structure that can further help to perform a rigorous evaluation of commonsense reasoning abilities about various real-world activities. We create an annotation scheme for capturing this implicit knowledge in the form of a graphical structure for 37 daily human activities. We find that the created resource can be used to frame an enormous number of commonsense queries (~ 10^17), facilitating rigorous evaluation of commonsense reasoning in LLMs. Moreover, recently, the remarkable performance of LLMs has raised questions about whether these models are truly capable of reasoning in the wild and, in general, how reasoning occurs inside these models. In this resource paper, we bridge this gap by proposing design mechanisms that facilitate research in a similar direction. Our findings suggest that the reasoning components are localized in LLMs that play a prominent role in decision-making when prompted with a commonsense query.
zh

[NLP-36] Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中长上下文视频理解的关键挑战：在保证计算效率的同时保留细粒度时空模式。现有方法（如稀疏采样、低分辨率密集采样和令牌压缩）在处理复杂运动或分辨率变化的视频时，会在时间动态、空间细节或微妙交互中产生显著的信息丢失。为了解决这一问题，论文提出了Mavors框架，其关键是引入多粒度视频表示方法，通过两个核心组件实现整体长视频建模：1）基于3D卷积和视觉Transformer的片段内视觉编码器（Intra-chunk Vision Encoder, IVE），用于保留高分辨率的空间特征；2）基于变换器依赖建模和片段级旋转位置编码的片段间特征聚合器（Inter-chunk Feature Aggregator, IFA），用于建立跨片段的时间一致性。此外，该框架通过子图像分解将图像视为单帧视频，实现了图像和视频理解的统一。实验结果表明，Mavors在保持空间保真度和时间连续性方面优于现有方法，特别是在需要细粒度时空推理的任务中表现出色。

链接: https://arxiv.org/abs/2504.10068
作者: Yang Shi,Jiaheng Liu,Yushuo Guan,Zhenhua Wu,Yuanxing Zhang,Zihao Wang,Weihong Lin,Jingyun Hua,Zekun Wang,Xinlong Chen,Bohan Zeng,Wentao Zhang,Fuzheng Zhang,Wenjing Yang,Di Zhang
机构: Peking University (北京大学); Kuaishou Technology (快手科技); Nanjing University (南京大学); CASIA (中科院自动化所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages

点击查看摘要

Abstract:Long-context video understanding in multimodal large language models (MLLMs) faces a critical challenge: balancing computational efficiency with the retention of fine-grained spatio-temporal patterns. Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose \mathbfMavors , a novel framework that introduces \mathbfM ulti-gr \mathbfa nularity \mathbfv ide \mathbfo \mathbfr epre \mathbfs entation for holistic long-video modeling. Specifically, Mavors directly encodes raw video content into latent representations through two core components: 1) an Intra-chunk Vision Encoder (IVE) that preserves high-resolution spatial features via 3D convolutions and Vision Transformers, and 2) an Inter-chunk Feature Aggregator (IFA) that establishes temporal coherence across chunks using transformer-based dependency modeling with chunk-level rotary position encodings. Moreover, the framework unifies image and video understanding by treating images as single-frame videos via sub-image decomposition. Experiments across diverse benchmarks demonstrate Mavors’ superiority in maintaining both spatial fidelity and temporal continuity, significantly outperforming existing methods in tasks requiring fine-grained spatio-temporal reasoning.
zh

[NLP-37] A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations

【速读】：该论文试图解决人类如何检测和理解结构重复（structural repeats）这一认知问题，这些结构重复来源于序列数据中层级关系的重复。论文的关键解决方案在于提出了一种基于加权演绎系统（weighted deduction system）的计算模型，通过该模型以模板程序（Template program）的形式推断给定序列的最小生成过程。这种模板程序扩展了上下文无关文法（context-free grammar），引入了重复组合子（repetition combinators），从而以递归方式高效编码子计算的重复。

链接: https://arxiv.org/abs/2504.10065
作者: Zeng Ren,Xinyi Guan,Martin Rohrmeier
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Patterns are fundamental to human cognition, enabling the recognition of structure and regularity across diverse domains. In this work, we focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data, and develop a candidate computational model of how humans detect and understand such structural repeats. Based on a weighted deduction system, our model infers the minimal generative process of a given sequence in the form of a Template program, a formalism that enriches the context-free grammar with repetition combinators. Such representation efficiently encodes the repetition of sub-computations in a recursive manner. As a proof of concept, we demonstrate the expressiveness of our model on short sequences from music and action planning. The proposed model offers broader insights into the mental representations and cognitive mechanisms underlying human pattern recognition.
zh

[NLP-38] Hallucination Detection in LLM s via Topological Divergence on Attention Graphs

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）中幻觉现象（hallucination），即生成与事实不符内容的关键挑战。为应对这一问题，论文提出了一种基于拓扑结构的幻觉检测方法TOHA（Topology-based HAllucination detector）。其核心解决方案在于利用拓扑发散度（topological divergence）度量来量化由注意力矩阵诱导的图的结构性质。通过分析提示（prompt）子图与响应子图之间的拓扑发散模式，发现特定注意力头中的较高发散值与幻觉输出存在强相关性，且此特性独立于数据集。实验结果表明，TOHA在多个基准测试中达到或接近最先进的性能，并展现出显著的领域迁移能力。研究结果表明，分析注意力矩阵的拓扑结构可作为评估LLMs事实可靠性的一种高效且稳健的指标。

链接: https://arxiv.org/abs/2504.10063
作者: Alexandra Bazarova,Aleksandr Yugay,Andrey Shulga,Alina Ermilova,Andrei Volodichev,Konstantin Polev,Julia Belikova,Rauf Parchiev,Dmitry Simakov,Maxim Savchenko,Andrey Savchenko,Serguei Barannikov,Alexey Zaytsev
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments, including evaluation on question answering and data-to-text tasks, show that our approach achieves state-of-the-art or competitive results on several benchmarks, two of which were annotated by us and are being publicly released to facilitate further research. Beyond its strong in-domain performance, TOHA maintains remarkable domain transferability across multiple open-source LLMs. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.
zh

[NLP-39] Joint Action Language Modelling for Transparent Policy Execution

【速读】：该论文旨在解决智能体意图因具身策略的黑箱性质而隐藏的问题，通过将策略学习转化为语言生成问题，并结合传统的自回归建模方法，实现行为透明性直接嵌入到学习过程中。解决方案的关键在于提出一种模型，该模型能够以自回归方式生成描述下一步动作的透明自然语言陈述，同时伴随表示具体动作的标记，用于解决Language-Table环境中的长时序任务。研究特别关注预测动作与生成高质量语言之间的关系，发现当两者同时生成时，动作轨迹质量和透明陈述质量在许多情况下均会提高。

链接: https://arxiv.org/abs/2504.10055
作者: Theodor Wulff,Rahul Singh Maharjan,Xinyun Chi,Angelo Cangelosi
机构: Department of Computer Science, The University of Manchester (曼彻斯特大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:An agent’s intention often remains hidden behind the black-box nature of embodied policies. Communication using natural language statements that describe the next action can provide transparency towards the agent’s behavior. We aim to insert transparent behavior directly into the learning process, by transforming the problem of policy learning into a language generation problem and combining it with traditional autoregressive modelling. The resulting model produces transparent natural language statements followed by tokens representing the specific actions to solve long-horizon tasks in the Language-Table environment. Following previous work, the model is able to learn to produce a policy represented by special discretized tokens in an autoregressive manner. We place special emphasis on investigating the relationship between predicting actions and producing high-quality language for a transparent agent. We find that in many cases both the quality of the action trajectory and the transparent statement increase when they are generated simultaneously.
zh

[NLP-40] Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

【速读】：该论文试图解决如何利用 Vision-Language Models (VLMs) 生成高效且高质量的多模态文档摘要问题。论文针对文本密集型多模态文档，在不同输入长度预算下，通过定量与定性分析，探索了成本效益高的摘要生成策略。解决方案的关键在于输入表示的选择：论文发现从视频流中提取的幻灯片作为输入优于原始视频，而结构化的幻灯片与字幕交织的表示方式能提供最佳性能。此外，论文还探讨了多模态交互的本质，并提出了提升 VLMs 理解此类文档能力的建议。

链接: https://arxiv.org/abs/2504.10049
作者: Théo Gigant,Camille Guinaudeau,Frédéric Dufaux
机构: Université Paris-Saclay (巴黎萨克雷大学), CNRS (法国国家科学研究中心), CentraleSupelec (CentraleSupélec), Laboratoire des signaux et systèmes (信号与系统实验室)(Gif-sur-Yvette, France); Université Paris-Saclay (巴黎萨克雷大学), CNRS (法国国家科学研究中心), LISN (Laboratoire Interdisciplinaire des Sciences du Numérique)(Orsay, France)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.
zh

[NLP-41] DataMosaic: Explainable and Verifiable Multi-Modal Data Analytics through Extract-Reason -Verify

【速读】：该论文试图解决大型语言模型（LLMs）在数据 analytics 应用中面临的两大核心问题：缺乏可解释性（opaque reasoning processes）和不可验证性（prone to hallucinations and unchecked errors）。尽管检索增强生成（RAG）通过将 LLMs 与外部数据结合提高了准确性，但它未能有效应对可信数据分析的根本挑战，特别是在处理噪声、不一致或多模态数据（如文本、表格、图像）时。
论文提出的解决方案核心在于 DataMosaic 框架，其通过从原始数据动态提取任务特定结构（如表格、图、树）来实现 LLM-powered analytics 的可解释性和可验证性。DataMosaic 提供透明的逐步推理轨迹，并支持中间结果的验证，同时基于多代理框架构建自我适应的代理，以满足下游任务需求，从而提升一致性、完整性和隐私保护。此方法不仅克服了现有 LLM-powered 系统的局限性，还为基于数据基础的、准确且可解释的多模态数据分析奠定了新范式。

链接: https://arxiv.org/abs/2504.10036
作者: Zhengxuan Zhang,Zhuowen Liang,Yin Wu,Teng Lin,Yuyu Luo,Nan Tang
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming data analytics, but their widespread adoption is hindered by two critical limitations: they are not explainable (opaque reasoning processes) and not verifiable (prone to hallucinations and unchecked errors). While retrieval-augmented generation (RAG) improves accuracy by grounding LLMs in external data, it fails to address the core challenges of trustworthy analytics - especially when processing noisy, inconsistent, or multi-modal data (for example, text, tables, images). We propose DataMosaic, a framework designed to make LLM-powered analytics both explainable and verifiable. By dynamically extracting task-specific structures (for example, tables, graphs, trees) from raw data, DataMosaic provides transparent, step-by-step reasoning traces and enables validation of intermediate results. Built on a multi-agent framework, DataMosaic orchestrates self-adaptive agents that align with downstream task requirements, enhancing consistency, completeness, and privacy. Through this approach, DataMosaic not only tackles the limitations of current LLM-powered analytics systems but also lays the groundwork for a new paradigm of grounded, accurate, and explainable multi-modal data analytics.
zh

[NLP-42] he Mirag e of Performance Gains: Why Contrastive Decoding Fails to Address Multimodal Hallucination

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）中对比解码策略未能有效缓解幻觉（hallucination）问题的核心挑战。论文指出，现有对比解码方法通过构建对比样本诱导幻觉并在输出分布中抑制它们，但其性能提升主要源于两个误导性因素：一是对模型输出分布的粗略单向调整；二是自适应可能性约束导致采样策略退化为贪婪搜索。为揭示这些问题，作者引入了一系列虚假改进方法，并通过实验表明对比解码的性能增益与缓解幻觉的目标无关。论文的关键在于通过实验证据质疑现有对比解码策略的有效性，并为开发真正有效的多模态大语言模型幻觉解决方案奠定基础。

链接: https://arxiv.org/abs/2504.10020
作者: Hao Yin,Gunagzong Si,Zilei Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive decoding strategies are widely used to reduce hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model’s output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.
zh

[NLP-43] Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models ? CVPR2025

【速读】：该论文试图解决多模态大型语言模型（MLLMs）在安全对齐方面的局限性问题，特别是由于缺乏专门针对多模态输入的安全措施而产生的对齐差距。当前方法主要依赖于从语言模块继承的对齐能力，并通过精心设计的安全数据集来增强模型防御能力，但具体的知识或模式如何从高质量数据集中获取仍不明确。研究发现，这种对齐差距主要源于数据分布偏差，而图像内容、响应质量或数据集的对比行为对此贡献较小。

解决方案的关键在于通过微调MLLMs，在少量良性的指令跟随数据集上进行训练，其中响应被替换为简单的、明确的拒绝语句。实验表明，只要在微调数据集中存在特定比例的拒绝数据，无需耗费大量精力收集高质量的恶意数据，模型的安全性就能显著提升。这表明安全对齐并非完全丢失，而是被掩盖了。通过纠正潜在的数据偏差，可以缩小视觉领域的安全性差距。

链接: https://arxiv.org/abs/2504.10000
作者: Yanbo Wang,Jiyang Guan,Jian Liang,Ran He
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2025, codes in process

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. Typically, current open-source MLLMs rely on the alignment inherited from their language module to avoid harmful generations. However, the lack of safety measures specifically designed for multi-modal inputs creates an alignment gap, leaving MLLMs vulnerable to vision-domain attacks such as typographic manipulation. Current methods utilize a carefully designed safety dataset to enhance model defense capability, while the specific knowledge or patterns acquired from the high-quality dataset remain unclear. Through comparison experiments, we find that the alignment gap primarily arises from data distribution biases, while image content, response quality, or the contrastive behavior of the dataset makes little contribution to boosting multi-modal safety. To further investigate this and identify the key factors in improving MLLM safety, we propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences. Experiments show that, without the need for labor-intensive collection of high-quality malicious data, model safety can still be significantly improved, as long as a specific fraction of rejection data exists in the finetuning set, indicating the security alignment is not lost but rather obscured during multi-modal pretraining or instruction finetuning. Simply correcting the underlying data bias could narrow the safety gap in the vision domain.
zh

[NLP-44] urn-taking annotation for quantitative and qualitative analyses of conversation

【速读】：该论文旨在解决两个问题：一是为Graz Corpus中的95分钟会话语料创建并提供切换单元标注层；二是详细描述标注系统与过程，以便其他研究者用于自己的会话数据。解决方案的关键在于设计了一个兼顾跨学科应用需求的标注系统，该系统基于会话分析的顺序标准，适合后续语音学分析（时间对齐标注在Praat中完成），并且能够支持自动分类（需要连续标注语音并使用大小适中的标签集以实现高评分者一致性）。通过在Inter-Pausal Units (IPU) 和潜在完成点 (PCOMP) 两个层次上进行标注，并明确描述分割与标注准则，实现了接近完美的IPU标注一致性和实质性的PCOMP标注一致性。这种系统可以广泛应用于语言学研究和技术开发，促进跨学科交流与合作。

链接: https://arxiv.org/abs/2504.09980
作者: Anneliese Kelterer,Barbara Schuppler
机构: Department of Linguistics, University of Graz (格拉茨大学), Austria; Signal Processing and Speech Communication Laboratory, Graz University of Technology (格拉茨技术大学), Austria
类目: Computation and Language (cs.CL); Databases (cs.DB); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
备注: 41 pages

点击查看摘要

Abstract:This paper has two goals. First, we present the turn-taking annotation layers created for 95 minutes of conversational speech of the Graz Corpus of Read and Spontaneous Speech (GRASS), available to the scientific community. Second, we describe the annotation system and the annotation process in more detail, so other researchers may use it for their own conversational data. The annotation system was developed with an interdisciplinary application in mind. It should be based on sequential criteria according to Conversation Analysis, suitable for subsequent phonetic analysis, thus time-aligned annotations were made Praat, and it should be suitable for automatic classification, which required the continuous annotation of speech and a label inventory that is not too large and results in a high inter-rater agreement. Turn-taking was annotated on two layers, Inter-Pausal Units (IPU) and points of potential completion (PCOMP; similar to transition relevance places). We provide a detailed description of the annotation process and of segmentation and labelling criteria. A detailed analysis of inter-rater agreement and common confusions shows that agreement for IPU annotation is near-perfect, that agreement for PCOMP annotations is substantial, and that disagreements often are either partial or can be explained by a different analysis of a sequence which also has merit. The annotation system can be applied to a variety of conversational data for linguistic studies and technological applications, and we hope that the annotations, as well as the annotation system will contribute to a stronger cross-fertilization between these disciplines.
zh

[NLP-45] C-MTCSD: A Chinese Multi-Turn Conversational Stance Detection Dataset WWW2025

【速读】：该论文试图解决中文语言处理中的立场检测（Stance Detection）难题，特别是在多轮对话（multi-turn conversational）分析方面的挑战。现有方法在这些领域存在显著局限性。为了解决这些问题，论文的关键解决方案是引入了一个名为C-MTCSD的大规模中文多轮对话立场检测数据集，该数据集包含24,264个精心标注的实例，规模是之前唯一存在的中文对话立场检测数据集的4.2倍。通过传统方法和大型语言模型的综合评估表明，即使最先进的模型在零样本设置下的F1分数仅为64.07%，并且随着对话深度增加，性能会持续下降，尤其在隐式立场检测方面表现尤为薄弱，F1分数低于50%。这一工作为中文立场检测研究建立了新的高难度基准，指出了未来改进的重要方向。

链接: https://arxiv.org/abs/2504.09958
作者: Fuqiang Niu,Yi Yang,Xianghua Fu,Genan Dai,Bowen Zhang
机构: Shenzhen Technology University (深圳技术大学) Shenzhen China; Shenzhen Technology University (深圳技术大学) Shenzhen China; Shenzhen Technology University (深圳技术大学) Shenzhen China; Shenzhen Technology University (深圳技术大学) Shenzhen China; Shenzhen Technology University (深圳技术大学) Shenzhen China; Shenzhen Technology University (深圳技术大学) Shenzhen China
类目: Computation and Language (cs.CL)
备注: WWW2025

点击查看摘要

Abstract:Stance detection has become an essential tool for analyzing public discussions on social media. Current methods face significant challenges, particularly in Chinese language processing and multi-turn conversational analysis. To address these limitations, we introduce C-MTCSD, the largest Chinese multi-turn conversational stance detection dataset, comprising 24,264 carefully annotated instances from Sina Weibo, which is 4.2 times larger than the only prior Chinese conversational stance detection dataset. Our comprehensive evaluation using both traditional approaches and large language models reveals the complexity of C-MTCSD: even state-of-the-art models achieve only 64.07% F1 score in the challenging zero-shot setting, while performance consistently degrades with increasing conversation depth. Traditional models particularly struggle with implicit stance detection, achieving below 50% F1 score. This work establishes a challenging new benchmark for Chinese stance detection research, highlighting significant opportunities for future improvements.
zh

[NLP-46] Assessing Judging Bias in Large Reasoning Models: An Empirical Study

【速读】：该论文旨在解决大型推理模型（LRMs）在作为裁判（LLM-as-a-Judge）时的判断偏差问题。论文通过构建一个综合基准，对比了LRMs与传统大语言模型（LLMs）在主观偏好对齐数据集和客观事实数据集上的判断偏差，揭示了四种关键偏差：LRMs尽管具备高级推理能力但仍易受从众、权威、位置和干扰偏差的影响；在事实相关数据集上表现出比LLMs更好的鲁棒性；存在显著的位置偏差，倾向于选择后置选项；以及一种新的“表层反思偏差”，即模仿推理的短语（如“等等，让我想想…”）对模型判断有显著影响。为缓解这些偏差，论文设计并评估了三种策略：专门的系统提示可使偏好对齐数据集的偏差减少高达19%，事实相关数据集减少14%；情境学习在偏好任务中提升可达27%，但在事实任务中结果不一致；自省机制在偏好数据集上减少偏差达10%，事实相关数据集上达16%，尤其对LRMs效果显著。论文的关键在于提出并验证了多种有效的偏差缓解方法，特别是自省机制在改善LRMs可靠性方面的有效性。

链接: https://arxiv.org/abs/2504.09946
作者: Qian Wang,Zhanzhi Lou,Zhenheng Tang,Nuo Chen,Xuandong Zhao,Wenxuan Zhang,Dawn Song,Bingsheng He
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities, raising important questions about their biases in LLM-as-a-judge settings. We present a comprehensive benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets. Through investigation of bandwagon, authority, position, and distraction biases, we uncover four key findings: (1) despite their advanced reasoning capabilities, LRMs remain susceptible to the above biases; (2) LRMs demonstrate better robustness than LLMs specifically on fact-related datasets; (3) LRMs exhibit notable position bias, preferring options in later positions; and (4) we identify a novel “superficial reflection bias” where phrases mimicking reasoning (e.g., “wait, let me think…”) significantly influence model judgments. To address these biases, we design and evaluate three mitigation strategies: specialized system prompts that reduce judging biases by up to 19% in preference alignment datasets and 14% in fact-related datasets, in-context learning that provides up to 27% improvement on preference tasks but shows inconsistent results on factual tasks, and a self-reflection mechanism that reduces biases by up to 10% in preference datasets and 16% in fact-related datasets, with self-reflection proving particularly effective for LRMs. Our work provides crucial insights for developing more reliable LLM-as-a-Judge frameworks, especially as LRMs become increasingly deployed as automated judges.
zh

[NLP-47] KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLM s Inference

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）高效推理受日益增长的关键值（Key-Value, KV）缓存限制的问题，并提出KV缓存压缩是一个重要的研究方向。传统方法通过注意力分数或位置启发式选择性地淘汰不重要的KV缓存条目，但会导致信息丢失和幻觉现象。尽管最近基于合并的策略尝试通过保留将被丢弃的KV对来保留更多信息，但这些方法不可避免地在合并前后引入注意力分布的不一致性，从而导致输出扰动和生成质量下降。为克服这一挑战，论文提出了KeepKV，这是一种新颖的自适应KV缓存合并方法，旨在消除输出扰动的同时，在严格的内存约束下保持性能。KeepKV引入了选举票机制（Electoral Votes Mechanism）以记录合并历史并自适应调整注意力分数，同时结合零推理扰动合并方法（Zero Inference-Perturbation Merging），保持注意力一致性并补偿因缓存合并而导致的注意力损失。关键在于通过引入这些机制，在显著压缩缓存的同时成功保留了关键上下文信息。实验结果表明，KeepKV不仅大幅减少了内存使用，还将推理吞吐量提升了两倍以上，即使在仅10%的KV缓存预算下仍能保持优异的生成质量。

链接: https://arxiv.org/abs/2504.09936
作者: Yuxuan Tian,Zihan Wang,Yebo Peng,Aomufei Yuan,Zhiming Wang,Bairen Yi,Xin Liu,Yong Cui,Tong Yang
机构: Peking University (北京大学); ByteDance (字节跳动); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Efficient inference of large language models (LLMs) is hindered by an ever-growing key-value (KV) cache, making KV cache compression a critical research direction. Traditional methods selectively evict less important KV cache entries based on attention scores or position heuristics, which leads to information loss and hallucinations. Recently, merging-based strategies have been explored to retain more information by merging KV pairs that would be discarded; however, these existing approaches inevitably introduce inconsistencies in attention distributions before and after merging, causing output perturbation and degraded generation quality. To overcome this challenge, we propose KeepKV, a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints. KeepKV introduces the Electoral Votes mechanism that records merging history and adaptively adjusts attention scores. Moreover, it further leverages a novel Zero Inference-Perturbation Merging methods, keeping attention consistency and compensating for attention loss resulting from cache merging. KeepKV successfully retains essential context information within a significantly compressed cache. Extensive experiments on various benchmarks and LLM architectures demonstrate that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.
zh

[NLP-48] Guiding Reasoning in Small Language Models with LLM Assistance

【速读】：该论文试图解决小语言模型（Small Language Models, SLMs）在深度多步逻辑推理任务中推理能力有限的问题。论文提出了一种名为Small Reasons, Large Hints (SMART) 的框架，其关键在于通过基于分数的评估识别不确定的推理步骤，并仅在必要时注入由大语言模型（Large Language Models, LLMs）生成的纠正性推理指导，从而实现认知支架式的外部增强。这种方法将结构化推理视为最优策略搜索问题，引导推理过程向正确解发展，而无需进行穷尽采样。实验结果表明，这种针对性的外部支架显著提升了性能，为SLMs与LLMs的协作解决当前无法单独由SLMs完成的复杂推理任务奠定了基础。

链接: https://arxiv.org/abs/2504.09923
作者: Yujin Kim,Euiin Yi,Minu Kim,Se-Young Yun,Taehyeon Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 10 figures, 11 tables

点击查看摘要

Abstract:The limited reasoning capabilities of small language models (SLMs) cast doubt on their suitability for tasks demanding deep, multi-step logical deduction. This paper introduces a framework called Small Reasons, Large Hints (SMART), which selectively augments SLM reasoning with targeted guidance from large language models (LLMs). Inspired by the concept of cognitive scaffolding, SMART employs a score-based evaluation to identify uncertain reasoning steps and injects corrective LLM-generated reasoning only when necessary. By framing structured reasoning as an optimal policy search, our approach steers the reasoning trajectory toward correct solutions without exhaustive sampling. Our experiments on mathematical reasoning datasets demonstrate that targeted external scaffolding significantly improves performance, paving the way for collaborative use of both SLM and LLM to tackle complex reasoning tasks that are currently unsolvable by SLMs alone.
zh

[NLP-49] Learning to Erase Private Knowledge from Multi-Documents for Retrieval-Augmented Large Language Models

【速读】：该论文旨在解决 Retrieval-Augmented Generation (RAG) 中因检索到的文档包含敏感知识而导致的隐私泄露风险问题。具体而言，论文关注如何有效擦除用户定义的私有信息，同时保留足够的公共知识以支持生成任务，并提出以下关键解决方案：首先构建全局知识图谱以识别跨文档的潜在知识，从而防御去匿名化攻击；其次将知识图谱划分为私有和公共子图，并利用 Flan-T5 对检索到的文档进行重写以排除私有三元组；最后通过 PPO 算法优化重写模型，实现私有三元组最小化与公共三元组最大化。实验结果表明，Eraser4RAG 在隐私擦除性能上优于 GPT-4o。

链接: https://arxiv.org/abs/2504.09910
作者: Yujing Wang,Hainan Zhang,Liang Pang,Yongxin Tong,Binghui Guo,Hongwei Zheng,Zhiming Zheng
机构: Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (北京未来区块链与隐私计算高级创新中心); School of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院), China; Beijing Academy of Blockchain and Edge Computing (北京区块链与边缘计算研究院), China; Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所), Beijing, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a promising technique for applying LLMs to proprietary domains. However, retrieved documents may contain sensitive knowledge, posing risks of privacy leakage in generative results. Thus, effectively erasing private information from retrieved documents is a key challenge for RAG. Unlike traditional text anonymization, RAG should consider: (1) the inherent multi-document reasoning may face de-anonymization attacks; (2) private knowledge varies by scenarios, so users should be allowed to customize which information to erase; (3) preserving sufficient publicly available knowledge for generation tasks. This paper introduces the privacy erasure task for RAG and proposes Eraser4RAG, a private knowledge eraser which effectively removes user-defined private knowledge from documents while preserving sufficient public knowledge for generation. Specifically, we first construct a global knowledge graph to identify potential knowledge across documents, aiming to defend against de-anonymization attacks. Then we randomly split it into private and public sub-graphs, and fine-tune Flan-T5 to rewrite the retrieved documents excluding private triples. Finally, PPO algorithm optimizes the rewriting model to minimize private triples and maximize public triples retention. Experiments on four QA datasets demonstrate that Eraser4RAG achieves superior erase performance than GPT-4o.
zh

[NLP-50] Quantum Natural Language Processing: A Comprehensive Review of Models Methods and Applications

【速读】：该论文试图解决的问题是：当前深度学习方法在自然语言处理（NLP）中的性能提升与所需大量数据和资源之间的矛盾。为应对这一挑战，论文提出利用量子计算（Quantum Computing）的特性来突破现有方法的计算限制，从而构建新兴领域——量子自然语言处理（Quantum Natural Language Processing, QNLP）。论文的关键在于通过分类基于量子计算原理、架构和计算方法的QNLP模型，探索量子编码技术、针对经典数据的量子处理模型以及用于超参数调优的量子优化技术，以实现语言处理任务中效率和准确性的提升，并总结量子计算在不同NLP任务中的应用现状及发展趋势。

链接: https://arxiv.org/abs/2504.09909
作者: Farha Nausheen,Khandakar Ahmed,M Imad Khan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent developments, deep learning methodologies applied to Natural Language Processing (NLP) have revealed a paradox: They improve performance but demand considerable data and resources for their training. Alternatively, quantum computing exploits the principles of quantum mechanics to overcome the computational limitations of current methodologies, thereby establishing an emerging field known as quantum natural language processing (QNLP). This domain holds the potential to attain a quantum advantage in the processing of linguistic structures, surpassing classical models in both efficiency and accuracy. In this paper, it is proposed to categorise QNLP models based on quantum computing principles, architecture, and computational approaches. This paper attempts to provide a survey on how quantum meets language by mapping state-of-the-art in this area, embracing quantum encoding techniques for classical data, QNLP models for prevalent NLP tasks, and quantum optimisation techniques for hyper parameter tuning. The landscape of quantum computing approaches applied to various NLP tasks is summarised by showcasing the specific QNLP methods used, and the popularity of these methods is indicated by their count. From the findings, it is observed that QNLP approaches are still limited to small data sets, with only a few models explored extensively, and there is increasing interest in the application of quantum computing to natural language processing tasks.
zh

[NLP-51] Refining Financial Consumer Complaints through Multi-Scale Model Interaction

【速读】：该论文旨在解决非法律专业人士撰写的文件因缺乏清晰性、正式性和领域特定精确性而导致的法律写作质量不足的问题。论文提出了一种新的方法，即多尺度模型交互（Multi-Scale Model Interaction, MSMI），其关键是利用轻量级分类器评估输出结果，并引导大型语言模型（Large Language Models, LLMs）进行迭代优化，将非正式、口语化的输入转化为有说服力的法律论据。实验结果显示，MSMI在提升生成质量方面显著优于单次提示策略，并在多个短文本基准测试中验证了其泛化能力和对抗鲁棒性。

链接: https://arxiv.org/abs/2504.09903
作者: Bo-Wei Chen,An-Zi Yen,Chung-Chi Chen
机构: Department of Computer Science, National Yang Ming Chiao Tung University (国立阳明交通大学计算机科学系), Taiwan; National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所), Japan
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal writing demands clarity, formality, and domain-specific precision-qualities often lacking in documents authored by individuals without legal training. To bridge this gap, this paper explores the task of legal text refinement that transforms informal, conversational inputs into persuasive legal arguments. We introduce FinDR, a Chinese dataset of financial dispute records, annotated with official judgments on claim reasonableness. Our proposed method, Multi-Scale Model Interaction (MSMI), leverages a lightweight classifier to evaluate outputs and guide iterative refinement by Large Language Models (LLMs). Experimental results demonstrate that MSMI significantly outperforms single-pass prompting strategies. Additionally, we validate the generalizability of MSMI on several short-text benchmarks, showing improved adversarial robustness. Our findings reveal the potential of multi-model collaboration for enhancing legal document generation and broader text refinement tasks.
zh

[NLP-52] WSSenti: A Novel Hybrid Framework for Topic-Wise Sentiment Analysis on Social Media Using Transformer Models

【速读】：该论文旨在解决情感分析任务中因噪声数据、上下文歧义以及跨多样数据集泛化能力不足所导致的挑战。为应对这些问题，论文提出了一种结合多种基于Transformer的模型（BERT、GPT-2、RoBERTa、XLNet和DistilBERT）的混合框架。关键在于利用这些模型的独特优势，如BERT的双向上下文理解、GPT-2的生成能力增强、RoBERTa在大规模语料库上的优化上下文感知、XLNet通过排列学习建模依赖关系，以及DistilBERT在保持高精度的同时降低计算开销。此外，论文通过文本清洗、分词及特征提取方法（如TF-IDF和BoW）确保高质量输入数据，并验证了该混合方法在Sentiment140和IMDB数据集上的卓越性能（分别达到94%和95%的准确率），从而证明了集成多模型架构的有效性。

链接: https://arxiv.org/abs/2504.09896
作者: Aish Albladi,Md Kaosar Uddin,Minarul Islam,Cheryl Seals
机构: Auburn University (奥本大学)
类目: Computation and Language (cs.CL)
备注: 41 pages, 12 figures, includes algorithm and comparative tables

点击查看摘要

Abstract:Sentiment analysis is a crucial task in natural language processing (NLP) that enables the extraction of meaningful insights from textual data, particularly from dynamic platforms like Twitter and IMDB. This study explores a hybrid framework combining transformer-based models, specifically BERT, GPT-2, RoBERTa, XLNet, and DistilBERT, to improve sentiment classification accuracy and robustness. The framework addresses challenges such as noisy data, contextual ambiguity, and generalization across diverse datasets by leveraging the unique strengths of these models. BERT captures bidirectional context, GPT-2 enhances generative capabilities, RoBERTa optimizes contextual understanding with larger corpora and dynamic masking, XLNet models dependency through permutation-based learning, and DistilBERT offers efficiency with reduced computational overhead while maintaining high accuracy. We demonstrate text cleaning, tokenization, and feature extraction using Term Frequency Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), ensure high-quality input data for the models. The hybrid approach was evaluated on benchmark datasets Sentiment140 and IMDB, achieving superior accuracy rates of 94% and 95%, respectively, outperforming standalone models. The results validate the effectiveness of combining multiple transformer models in ensemble-like setups to address the limitations of individual architectures. This research highlights its applicability to real-world tasks such as social media monitoring, customer sentiment analysis, and public opinion tracking which offers a pathway for future advancements in hybrid NLP frameworks.
zh

[NLP-53] Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data

【速读】：该论文旨在解决在大型语言模型（Large Language Models, LLMs）对齐过程中，传统基于二元偏好数据收集和奖励建模方法资源消耗大且效率低的问题。论文提出的关键解决方案是利用采样生成结果与高质量参考答案之间的相似性作为替代奖励函数，以实现LLM的对齐。这种方法避免了训练奖励模型的需求，并且当有多个候选答案时，收集单一参考答案的成本通常低于构建二元偏好对所需的时间。论文开发了一种名为\textitRefAlign的REINFORCE风格的对齐算法，它不依赖于参考或奖励模型，而是使用BERTScore来衡量采样生成结果与高质量参考答案之间的相似性作为代理奖励。此外，RefAlign能够通过结合相似性奖励与任务相关目标，轻松扩展到安全性和置信度对齐等多样化场景，同时保持高效性并达到与现有方法相当的性能。

链接: https://arxiv.org/abs/2504.09895
作者: Shuai Zhao,Linchao Zhu,Yi Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: work in progress

点击查看摘要

Abstract:Large language models~(LLMs) are expected to be helpful, harmless, and honest. In various alignment scenarios, such as general human preference, safety, and confidence alignment, binary preference data collection and reward modeling are resource-intensive but necessary for human preference transferring. In this work, we explore using the similarity between sampled generations and high-quality reference answers as an alternative reward function for LLM alignment. Using similarity as a reward circumvents training reward models, and collecting a single reference answer potentially costs less time than constructing binary preference pairs when multiple candidates are available. Specifically, we develop \textitRefAlign, a versatile REINFORCE-style alignment algorithm, which is free of reference and reward models. Instead, RefAlign utilizes BERTScore between sampled generations and high-quality reference answers as the surrogate reward. Beyond general human preference optimization, RefAlign can be readily extended to diverse scenarios, such as safety and confidence alignment, by incorporating the similarity reward with task-related objectives. In various scenarios, RefAlign demonstrates comparable performance to previous alignment methods while offering high efficiency.
zh

[NLP-54] Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English

【速读】：该论文旨在探究单语和多语大型语言模型（LLMs）在处理意大利语和英语中的关系从句（Relative Clause, RC）歧义示例时，是否表现出类似人类的偏好。同时，研究进一步评估这些偏好是否可以通过词汇因素（主句中的动词/名词类型）进行调节，这些词汇因素已被证明与句法和语义关系的细微约束相关。论文的关键在于通过跨语言实验设计，利用关系从句的歧义现象作为基准，系统分析LLMs在捕捉人类语言知识和偏见方面的表现差异及其局限性。

链接: https://arxiv.org/abs/2504.09886
作者: Michael Kamerath,Aniello De Santo
机构: University of Utah (犹他大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper leverages past sentence processing studies to investigate whether monolingual and multilingual LLMs show human-like preferences when presented with examples of relative clause attachment ambiguities in Italian and English. Furthermore, we test whether these preferences can be modulated by lexical factors (the type of verb/noun in the matrix clause) which have been shown to be tied to subtle constraints on syntactic and semantic relations. Our results overall showcase how LLM behavior varies interestingly across models, but also general failings of these models in correctly capturing human-like preferences. In light of these results, we argue that RC attachment is the ideal benchmark for cross-linguistic investigations of LLMs’ linguistic knowledge and biases.
zh

[NLP-55] PASS-FC: Progressive and Adaptive Search Scheme for Fact Checking of Comprehensive Claims

【速读】：本文旨在解决自动化事实核查在处理复杂现实世界主张时面临的挑战。论文提出了一种名为PASS-FC的新框架，通过主张增强（claim augmentation）、自适应问题生成（adaptive question generation）以及迭代验证（iterative verification）来应对这些挑战。关键在于PASS-FC不仅增强了原子主张的时间和实体上下文，还采用了先进的搜索技术并利用了反射机制（reflection mechanism）。此外，通过在六个多样化数据集上的评估，证明了其在通用知识、科学、真实世界及多语言事实核查任务中的卓越表现，并揭示了证据数量和反射标签触发器的最佳超参数设置。消融研究进一步强调了主张增强和语言特定适配的重要性。

链接: https://arxiv.org/abs/2504.09866
作者: Ziyu Zhuang
机构: Trip.com Group
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated fact-checking faces challenges in handling complex real-world claims. We present PASS-FC, a novel framework that addresses these issues through claim augmentation, adaptive question generation, and iterative verification. PASS-FC enhances atomic claims with temporal and entity context, employs advanced search techniques, and utilizes a reflection mechanism. We evaluate PASS-FC on six diverse datasets, demonstrating superior performance across general knowledge, scientific, real-world, and multilingual fact-checking tasks. Our framework often surpasses stronger baseline models. Hyperparameter analysis reveals optimal settings for evidence quantity and reflection label triggers, while ablation studies highlight the importance of claim augmentation and language-specific adaptations. PASS-FC’s performance underscores its effectiveness in improving fact-checking accuracy and adaptability across various domains. We will open-source our code and experimental results to facilitate further research in this area.
zh

[NLP-56] Reasoning Models Can Be Effective Without Thinking

【速读】：本文旨在探讨在大型语言模型（LLMs）的推理任务中，显式思维过程（Thinking）是否为必要组件。作者提出通过简单的提示方法绕过显式思维过程（称为NoThinking），并验证其有效性。研究的关键在于发现NoThinking在控制token数量的情况下，在涵盖数学问题求解、形式定理证明及代码生成等七类复杂推理数据集上的表现优于Thinking，尤其是在资源受限场景下（如ACM 23数据集上，700tokens时NoThinking得分51.3 vs Thinking得分28.9）。进一步地，作者展示了通过并行扩展策略生成多个独立输出并进行聚合的方法能够显著提升性能，其中聚合方式采用任务特定验证器或基于置信度的最佳-N选择策略。此方法不仅在相似延迟条件下超越多种基准模型，还在大幅降低延迟情况下与包含显式思维过程的模型表现相当（延迟可缩短至1/9）。总体而言，本文的研究挑战了显式思维过程的必要性，并为低预算场景或低延迟条件下的强推理性能提供了竞争性参考。

链接: https://arxiv.org/abs/2504.09858
作者: Wenjie Ma,Jingxuan He,Charlie Snell,Tyler Griggs,Sewon Min,Matei Zaharia
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 33 pages, 7 main figures, 2 tables

点击查看摘要

Abstract:Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When controlling for the number of tokens, NoThinking outperforms Thinking across a diverse set of seven challenging reasoning datasets–including mathematical problem solving, formal theorem proving, and coding–especially in low-budget settings, e.g., 51.3 vs. 28.9 on ACM 23 with 700 tokens. Notably, the performance of NoThinking becomes more competitive with pass@k as k increases. Building on this observation, we demonstrate that a parallel scaling approach that uses NoThinking to generate N outputs independently and aggregates them is highly effective. For aggregation, we use task-specific verifiers when available, or we apply simple best-of-N strategies such as confidence-based selection. Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x). Together, our research encourages a reconsideration of the necessity of lengthy thinking processes, while also establishing a competitive reference for achieving strong reasoning performance in low-budget settings or at low latency using parallel scaling.
zh

[NLP-57] A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents Smart Cities and Earth Science

【速读】：该论文试图解决的问题是如何理解并整合跨学科的“空间智能”（Spatial Intelligence），特别是在大型语言模型（Large Language Models, LLMs）背景下，探索其与人类空间认知、记忆、知识表示以及抽象推理之间的差异与联系，并分析其在不同尺度上的表现，从具身智能到城市乃至全球层面。论文的关键在于提出一个框架，从空间记忆与理解出发，逐步深入到空间推理与智能，以揭示这些领域间的共性与差异，从而为跨学科的空间智能研究提供洞见并启发未来工作。

链接: https://arxiv.org/abs/2504.09848
作者: Jie Feng,Jinwei Zeng,Qingyue Long,Hongyi Chen,Jie Zhao,Yanxin Xi,Zhilun Zhou,Yuan Yuan,Shengyuan Wang,Qingbin Zeng,Songwei Li,Yunke Zhang,Yuming Lin,Tong Li,Jingtao Ding,Chen Gao,Fengli Xu,Yong Li
机构: Department of Electronic Engineering, BNRist, Tsinghua University (清华大学), Beijing, China;
Shenzhen International Graduate School, Tsinghua University (清华大学), Shenzhen, China;
Department of Computer Science, Tsinghua University (清华大学), Beijing, China;
Department of Computer Science, University of Helsinki (赫尔辛基大学), Helsinki, Finland
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Over the past year, the development of large language models (LLMs) has brought spatial intelligence into focus, with much attention on vision-based embodied intelligence. However, spatial intelligence spans a broader range of disciplines and scales, from navigation and urban planning to remote sensing and earth science. What are the differences and connections between spatial intelligence across these fields? In this paper, we first review human spatial cognition and its implications for spatial intelligence in LLMs. We then examine spatial memory, knowledge representations, and abstract reasoning in LLMs, highlighting their roles and connections. Finally, we analyze spatial intelligence across scales – from embodied to urban and global levels – following a framework that progresses from spatial memory and understanding to spatial reasoning and intelligence. Through this survey, we aim to provide insights into interdisciplinary spatial intelligence research and inspire future studies.
zh

[NLP-58] Abacus-SQL: A Text-to-SQL System Empowering Cross-Domain and Open-Domain Database Retrieval

【速读】：该论文旨在解决现有文本转SQL（Text-to-SQL）系统在开放域数据库检索能力不足以及跨领域适应性有限的问题。现有的Text-to-SQL系统通常无法自动检索开放域中的相关数据库，需要用户手动筛选，同时其跨领域迁移能力较弱，难以满足多样化的查询需求。为了解决这些问题，论文提出的关键方案包括：利用数据库检索技术实现开放域环境中目标数据库的精准定位；通过数据增强方法提升系统的跨领域迁移能力；采用Pre-SQL和Self-debug方法进一步提高SQL查询的准确性。实验结果验证了Abacus-SQL在多轮文本转SQL任务中的优异表现，有效证明了所提方法的有效性。

链接: https://arxiv.org/abs/2504.09824
作者: Keyan Xu,Dingzirui Wang,Xuanliang Zhang,Qingfu Zhu,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3figures

点击查看摘要

Abstract:The existing text-to-SQL systems have made significant progress in SQL query generation, but they still face numerous challenges. Existing systems often lack retrieval capabilities for open-domain databases, requiring users to manually filter relevant databases. Additionally, their cross-domain transferability is limited, making it challenging to accommodate diverse query requirements. To address these issues, we propose Abacus-SQL. Abacus-SQL utilizes database retrieval technology to accurately locate the required databases in an open-domain database environment. It also enhances the system cross-domain transfer ability through data augmentation methods. Moreover, Abacus-SQL employs Pre-SQL and Self-debug methods, thereby enhancing the accuracy of SQL queries. Experimental results demonstrate that Abacus-SQL performs excellently in multi-turn text-to-SQL tasks, effectively validating the approach’s effectiveness. Abacus-SQL is publicly accessible at this https URL.
zh

[NLP-59] ransferable text data distillation by trajectory matching

【速读】：该论文旨在解决大型语言模型（LLM）训练中数据规模庞大导致的高成本问题，通过数据蒸馏方法在保证训练效果的同时减少所需数据量。论文的关键在于提出了一种基于轨迹匹配学习伪提示数据并寻找其最近邻ID以实现跨架构迁移的方法，并在蒸馏过程中引入正则化损失以增强蒸馏数据的鲁棒性。这是首个适用于指令微调等文本生成任务的数据蒸馏工作，其创新点显著提升了蒸馏方法在不同LLM结构间的迁移能力。

链接: https://arxiv.org/abs/2504.09818
作者: Rong Yao,Hailin Hu,Yifei Fu,Hanting Chen,Wenyi Fang,Fanyi Du,Kai Han,Yunhe Wang
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the realm of large language model (LLM), as the size of large models increases, it also brings higher training costs. There is a urgent need to minimize the data size in LLM training. Compared with data selection method, the data distillation method aims to synthesize a small number of data samples to achieve the training effect of the full data set and has better flexibility. Despite its successes in computer vision, the discreteness of text data has hitherto stymied its exploration in natural language processing (NLP). In this work, we proposed a method that involves learning pseudo prompt data based on trajectory matching and finding its nearest neighbor ID to achieve cross-architecture transfer. During the distillation process, we introduce a regularization loss to improve the robustness of our distilled data. To our best knowledge, this is the first data distillation work suitable for text generation tasks such as instruction tuning. Evaluations on two benchmarks, including ARC-Easy and MMLU instruction tuning datasets, established the superiority of our distillation approach over the SOTA data selection method LESS. Furthermore, our method demonstrates a good transferability over LLM structures (i.e., OPT to Llama).
zh

[NLP-60] Augmented Relevance Datasets with Fine-Tuned Small LLM s WSDM’25

【速读】：该论文试图解决高质量数据集构建与查询文档相关性标注资源消耗大的问题。解决方案的关键在于利用经过微调的小型大型语言模型（Small LLMs）来自动化相关性评估，并通过增强排名模型的训练数据集来提升其性能。论文通过实验验证了这些微调后的小型LLMs不仅在特定数据集上优于某些闭源模型，还显著提升了排名模型的表现，从而展示了利用小型LLMs进行高效可扩展的数据集扩增的潜力，为搜索引擎优化提供了实用方案。

链接: https://arxiv.org/abs/2504.09816
作者: Quentin Fitte-Rey,Matyas Amrouche,Romain Deveaud
机构: Qwant(魁特); Georgia Tech (乔治亚理工学院); UTCParisFrance(UTC巴黎)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 10 pages, 3 figures, and 6 tables. Accepted and presented to LLM4EVAL at WSDM '25

点击查看摘要

Abstract:Building high-quality datasets and labeling query-document relevance are essential yet resource-intensive tasks, requiring detailed guidelines and substantial effort from human annotators. This paper explores the use of small, fine-tuned large language models (LLMs) to automate relevance assessment, with a focus on improving ranking models’ performance by augmenting their training dataset. We fine-tuned small LLMs to enhance relevance assessments, thereby improving dataset creation quality for downstream ranking model training. Our experiments demonstrate that these fine-tuned small LLMs not only outperform certain closed source models on our dataset but also lead to substantial improvements in ranking model performance. These results highlight the potential of leveraging small LLMs for efficient and scalable dataset augmentation, providing a practical solution for search engine optimization.
zh

[NLP-61] raining Small Reasoning LLM s with Cognitive Preference Alignment

【速读】：该论文旨在解决如何以更少的参数训练出高效推理能力的大语言模型（LLMs）的问题。当前大型语言模型在深度推理方面取得了显著进展，但其高昂的资源需求限制了其广泛应用，而直接将大型模型的链式思考（CoT）结果蒸馏到小型模型中往往效果不佳且需要大量标注数据。论文的关键解决方案是提出了一种名为Critique-Rethink-Verify（CRV）的新框架，该框架通过多个专精于不同任务的LLM代理协同工作，包括对小型模型认知能力下的CoT进行批判性分析、基于批评重新思考与优化这些CoT，以及验证优化后的结果正确性。此外，还引入了认知偏好优化（CogPO）算法，通过使小型模型的思想与其认知能力相匹配来进一步提升推理能力。综合评估表明，CRV和CogPO方法在挑战性的推理基准测试中表现出色，大幅超越其他训练方法。

链接: https://arxiv.org/abs/2504.09802
作者: Wenrui Cai,Chengyu Wang,Junbing Yan,Jun Huang,Xiangzhong Fang
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Cloud Computing (阿里云)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reasoning capabilities of large language models (LLMs), such as OpenAI’s o1 and DeepSeek-R1, have seen substantial advancements through deep thinking. However, these enhancements come with significant resource demands, underscoring the need to explore strategies to train effective reasoning LLMs with far fewer parameters. A critical challenge is that smaller models have different capacities and cognitive trajectories than their larger counterparts. Hence, direct distillation of chain-of-thought (CoT) results from large LLMs to smaller ones can be sometimes ineffective and requires a huge amount of annotated data. In this paper, we introduce a novel framework called Critique-Rethink-Verify (CRV), designed for training smaller yet powerful reasoning LLMs. Our CRV framework consists of multiple LLM agents, each specializing in unique abilities: (i) critiquing the CoTs according to the cognitive capabilities of smaller models, (ii) rethinking and refining these CoTs based on the critiques, and (iii) verifying the correctness of the refined results. We further propose the cognitive preference optimization (CogPO) algorithm to enhance the reasoning abilities of smaller models by aligning thoughts of these models with their cognitive capacities. Comprehensive evaluations on challenging reasoning benchmarks demonstrate the efficacy of CRV and CogPO, which outperforms other training methods by a large margin.
zh

[NLP-62] VDocRAG : Retrieval-Augmented Generation over Visually-Rich Documents CVPR2025

【速读】：本文旨在解决在包含多样化模态（如图表、表格等）和多种格式（如PDF、PPTX等）的视觉丰富文档集合上构建检索增强生成（Retrieval-Augmented Generation, RAG）框架以回答问题的问题。为了解决这一挑战，论文提出了关键方案：开发了一个新的RAG框架——VDocRAG，它能够将不同类型的文档和模态统一转换为图像格式进行处理，从而避免因解析文档获取文本而导致的信息丢失。此外，为了提升性能，该框架引入了新颖的自监督预训练任务，通过压缩视觉信息到密集的标记表示，并将其与文档中的文本内容对齐，以适配大规模视觉-语言模型用于检索。同时，还推出了OpenDocVQA数据集，这是一个涵盖多种文档类型和格式的首个开放领域文档视觉问答数据集集合。实验结果表明，VDocRAG相比传统的基于文本的RAG表现更优，并具备强大的泛化能力，展示了有效RAG范式在真实世界文档应用中的潜力。

链接: https://arxiv.org/abs/2504.09795
作者: Ryota Tanaka,Taichi Iki,Taku Hasegawa,Kyosuke Nishida,Kuniko Saito,Jun Suzuki
机构: NTT Human Informatics Laboratories, NTT Corporation (NTT 人类信息学实验室, NTT公司); Tohoku University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Accepted by CVPR 2025; project page: this https URL

点击查看摘要

Abstract:We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
zh

[NLP-63] Reasoning Court: Combining Reasoning Reasoning Action and Judgment for Multi-Hop Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多跳任务中因幻觉（hallucinations）和推理错误导致的性能瓶颈问题，尤其是在需要整合多源信息的任务中。现有方法主要通过基于检索的技术（如将推理与外部证据关联）或基于推理的方法（如改进提示以增强连贯性），以及两者的混合策略来缓解这些问题。尽管混合方法如ReAct已展现出超越单一检索或推理方法的表现，但其仍缺乏对中间推理步骤的内部验证机制，可能导致错误在复杂推理任务中传播。

论文的关键解决方案是提出Reasoning Court (RC)，这是一种新颖的框架，它扩展了迭代式推理与检索方法（如ReAct），引入了一个专用的LLM法官。与ReAct不同，RC中的法官独立评估由多个LLM代理生成的不同候选答案及其对应的推理过程，并根据提供的推理和证据选择最符合事实依据且逻辑连贯的答案。如果所有候选答案均不足、有缺陷或无效，则法官会利用可用证据及预训练知识合成一个新的答案。这种设计显著提升了推理过程的可靠性与准确性。评估结果显示，RC在多跳基准测试（如HotpotQA、MuSiQue）和事实验证任务（如FEVER）中始终优于最先进的少样本提示方法，且无需针对具体任务进行微调。

链接: https://arxiv.org/abs/2504.09781
作者: Jingtian Wu,Claire Cardie
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have demonstrated strong capabilities in tasks like question answering and fact verification, they continue to suffer from hallucinations and reasoning errors, especially in multi-hop tasks that require integration of multiple information sources. Current methods address these issues through retrieval-based techniques (grounding reasoning in external evidence), reasoning-based approaches (enhancing coherence via improved prompting), or hybrid strategies combining both elements. One prominent hybrid method, ReAct, has outperformed purely retrieval-based or reasoning-based approaches; however, it lacks internal verification of intermediate reasoning steps, allowing potential errors to propagate through complex reasoning tasks. In this paper, we introduce Reasoning Court (RC), a novel framework that extends iterative reasoning-and-retrieval methods, such as ReAct, with a dedicated LLM judge. Unlike ReAct, RC employs this judge to independently evaluate multiple candidate answers and their associated reasoning generated by separate LLM agents. The judge is asked to select the answer that it considers the most factually grounded and logically coherent based on the presented reasoning and evidence, or synthesizes a new answer using available evidence and its pre-trained knowledge if all candidates are inadequate, flawed, or invalid. Evaluations on multi-hop benchmarks (HotpotQA, MuSiQue) and fact-verification (FEVER) demonstrate that RC consistently outperforms state-of-the-art few-shot prompting methods without task-specific fine-tuning.
zh

[NLP-64] Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

【速读】：本文旨在解决自动构建高级数学问题的可执行功能抽象（Executable Functional Abstraction, EFA）的问题。以往的研究仅限于小学数学的抽象化，而高级数学的EFA生成仍需人工工程介入。论文的关键解决方案是将EFA的自动生成任务形式化为程序综合问题，并开发了EFAGen系统。EFAGen通过条件化大型语言模型（LLM）在种子数学问题及其逐步解法上来生成忠实于种子问题潜在通用问题与解法类别的候选EFA程序。此外，论文通过可执行单元测试形式化了任何有效EFA应具备的属性，并展示了如何利用这些测试作为可验证的奖励来训练LLMs以更好地生成EFAs。最终，论文证明了由EFAGen构造的EFA能够保持对种子问题的忠实性、生成可学习的问题变体，并能从多个来源推断出有效的EFA，同时展示了模型生成的EFAs在下游应用中的潜力，如生成难易程度不同的问题变体以及数据生成。

链接: https://arxiv.org/abs/2504.09763
作者: Zaid Khan,Elias Stengel-Eskin,Archiki Prasad,Jaemin Cho,Mohit Bansal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from RL (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on their parameterizations (e.g., gridworld configuration or initial physical conditions). We introduce the term EFA (Executable Functional Abstraction) to denote such programs for math problems. EFA-like constructs have been shown to be useful for math reasoning as problem generators for stress-testing models. However, prior work has been limited to abstractions for grade-school math (whose simple rules are easy to encode in programs), while generating EFAs for advanced math has thus far required human engineering. We explore the automatic construction of EFAs for advanced math problems. We operationalize the task of automatically constructing EFAs as a program synthesis task, and develop EFAGen, which conditions an LLM on a seed math problem and its step-by-step solution to generate candidate EFA programs that are faithful to the generalized problem and solution class underlying the seed problem. Furthermore, we formalize properties any valid EFA must possess in terms of executable unit tests, and show how the tests can be used as verifiable rewards to train LLMs to become better writers of EFAs. We demonstrate that EFAs constructed by EFAGen behave rationally by remaining faithful to seed problems, produce learnable problem variations, and that EFAGen can infer EFAs across multiple diverse sources of competition-level math problems. Finally, we show downstream uses of model-written EFAs e.g. finding problem variations that are harder or easier for a learner to solve, as well as data generation.
zh

[NLP-65] Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在低资源语言上的性能不足问题，特别是针对印地语等语言的服务缺失。论文的关键解决方案在于通过精心设计的双语指令微调方法，在不增加模型规模的前提下显著提升跨语言（印地语-英语）的性能表现。具体而言，研究团队利用包含485K样本的高质量英印双语文本数据集，对多个参数量不同的LLMs进行微调，证明了即使在不同英语-印地语训练数据比例下，也能实现多语言性能的显著改进，同时保持单语言性能不受影响。这种方法避免了资源密集型操作如词汇扩展或架构修改，而是依赖于文化相关且本地化的适度微调数据，从而以较低的计算开销弥合性能差距。

链接: https://arxiv.org/abs/2504.09753
作者: Ram Mohan Rao Kadiyala,Siddartha Pullakhandam,Siddhant Gupta,Drishti Sharma,Jebish Purbey,Kanwal Mehreen,Muhammad Arham,Hamza Farooq
机构: Traversaal.ai; Vantager; IIT Roorkee; Cohere for AI Community; M2ai.in; Pulchowk Campus; Stanford University; University of California, Los Angeles
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ARR Feb 2025 submission

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbfMantra-14B with ~3% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.
zh

[NLP-66] Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025

【速读】：该论文试图解决人工智能（AI）会议同行评审因提交量激增而导致的评审质量下降以及作者不满增加的问题。解决方案的关键在于开发了Review Feedback Agent系统，该系统利用多个大型语言模型（LLMs）通过提供自动化反馈来提升评审的清晰度和可操作性，这些反馈针对模糊评论、内容误解和不专业的表述。为了确保大规模应用时反馈的质量，还开发了一套由LLMs驱动的自动化可靠性测试工具，作为保障反馈质量的机制，只有通过所有测试的反馈才会发送给评审人。

链接: https://arxiv.org/abs/2504.09737
作者: Nitya Thakkar,Mert Yuksekgonul,Jake Silberg,Animesh Garg,Nanyun Peng,Fei Sha,Rose Yu,Carl Vondrick,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 30 pages, 7 figures

点击查看摘要

Abstract:Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at this https URL.
zh

[NLP-67] Agent A/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

【速读】：该论文旨在解决传统 A/B 测试（Traditional A/B Testing）在现代 Web 应用中面临的两个主要瓶颈：一是依赖大规模真实用户流量，导致实验周期长且成本高；二是等待测试结果的时间过长。为应对这些挑战，论文提出了一种名为 AgentA/B 的创新系统，其核心解决方案是利用基于大语言模型（Large Language Model, LLM）的自主代理（LLM Agents）来模拟真实的用户交互行为。AgentA/B 可以部署多样化的代理角色，并使这些代理能够动态导航网页，执行多步骤交互任务（如搜索、点击、筛选和购买等），从而实现对用户行为的高效仿真。通过控制实验验证，AgentA/B 能够以接近人类的行为模式模拟大规模 A/B 测试，为 UI/UX 设计决策提供快速且可扩展的评估手段。

链接: https://arxiv.org/abs/2504.09723
作者: Dakuo Wang,Ting-Yao Hsu,Yuxuan Lu,Limeng Cui,Yaochen Xie,William Headean,Bingsheng Yao,Akash Veeragouni,Jiapeng Liu,Sreyashi Nag,Jessie Wang
机构: Northeastern University (东北大学); Pennsylvania State University (宾夕法尼亚州立大学); Amazon
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A/B testing experiment is a widely adopted method for evaluating UI/UX design decisions in modern web applications. Yet, traditional A/B testing remains constrained by its dependence on the large-scale and live traffic of human participants, and the long time of waiting for the testing result. Through formative interviews with six experienced industry practitioners, we identified critical bottlenecks in current A/B testing workflows. In response, we present AgentA/B, a novel system that leverages Large Language Model-based autonomous agents (LLM Agents) to automatically simulate user interaction behaviors with real webpages. AgentA/B enables scalable deployment of LLM agents with diverse personas, each capable of navigating the dynamic webpage and interactively executing multi-step interactions like search, clicking, filtering, and purchasing. In a demonstrative controlled experiment, we employ AgentA/B to simulate a between-subject A/B testing with 1,000 LLM agents this http URL, and compare agent behaviors with real human shopping behaviors at a scale. Our findings suggest AgentA/B can emulate human-like behavior patterns.
zh

[NLP-68] Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish

【速读】：该论文旨在解决低资源语言（如土耳其语）基准数据集在创建和适应过程中缺乏严格质量控制的问题，特别是由于过度依赖翻译或改编自英语或多语言资源所引入的语言和文化适用性挑战。论文的关键解决方案是提出一个综合评估框架，从六个维度全面评估17个常用土耳其语基准数据集的质量，并结合人类标注者与大型语言模型（LLM）评估者的反馈，识别数据集的优势与不足。研究发现，70%的数据集未能达到启发式质量标准，且大部分评估指标未被满足。尽管LLM展现出一定潜力，但在理解文化常识和解析流畅无歧义文本方面仍逊色于人类标注者。论文强调，在构建和适配低资源语言数据集时亟需更严格的质控措施。

链接: https://arxiv.org/abs/2504.09714
作者: Ayşe Aysu Cengiz,Ahmet Kaan Sever,Elif Ecem Ümütlü,Naime Şeyma Erdem,Burak Aytan,Büşra Tufan,Abdullah Topraksoy,Esra Darıcı,Cagri Toraman
机构: Middle East Technical University (中东技术大学), Computer Engineering Department; Bilkent University (比尔肯大学), Computer Engineering Department; Turkcell AI; Hacettepe University (哈切特佩大学), Sociology Department; Istanbul University (伊斯坦布尔大学), Linguistics Department; Middle East Technical University (中东技术大学), Turkish Language Department
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The reliance on translated or adapted datasets from English or multilingual resources introduces challenges regarding linguistic and cultural suitability. This study addresses the need for robust and culturally appropriate benchmarks by evaluating the quality of 17 commonly used Turkish benchmark datasets. Using a comprehensive framework that assesses six criteria, both human and LLM-judge annotators provide detailed evaluations to identify dataset strengths and shortcomings. Our results reveal that 70% of the benchmark datasets fail to meet our heuristic quality standards. The correctness of the usage of technical terms is the strongest criterion, but 85% of the criteria are not satisfied in the examined datasets. Although LLM judges demonstrate potential, they are less effective than human annotators, particularly in understanding cultural common sense knowledge and interpreting fluent, unambiguous text. GPT-4o has stronger labeling capabilities for grammatical and technical tasks, while Llama3.3-70B excels at correctness and cultural knowledge evaluation. Our findings emphasize the urgent need for more rigorous quality control in creating and adapting datasets for low-resource languages. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.09714 [cs.CL] (or arXiv:2504.09714v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.09714 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-69] DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training

【速读】：该论文旨在解决现有基于强化学习（Reinforcement Learning, RL）的大语言模型（Large Language Models, LLMs）后训练方法忽视训练数据分布异质性的问题，即如何根据数据分布的特性自适应地调整训练顺序以优化学习效率。论文的关键创新在于提出了一种基于分布级可学习性概念的原则性课程学习框架。其核心思想是利用策略优势的大小来衡量模型在特定分布上进一步训练所能获得的收益，并通过Upper Confidence Bound (UCB) 原理动态调整不同分布的采样概率。此方法优先选择具有高平均优势（ exploitation ）或低采样次数（ exploration ）的分布进行训练，从而形成一种自适应且理论基础扎实的训练调度方案。

链接: https://arxiv.org/abs/2504.09710
作者: Zhenting Wang,Guofeng Cui,Kun Wan,Wentian Zhao
机构: Rutgers University (罗格斯大学); Adobe Inc. (奥多比公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: this https URL.
zh

[NLP-70] GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models

【速读】：该论文旨在解决现有基于Group Relative Policy Optimization (GRPO)的数学推理模型在处理数学推理任务时遇到的关键挑战，包括因二元准确性度量导致的奖励稀疏性、对简洁性的激励不足以及对复杂推理任务关注不够的问题。论文提出的解决方案核心在于GRPO-LEAD，这是一种针对数学推理优化的增强套件，其关键创新点包括：(1) 引入与长度相关的准确性奖励以鼓励更简洁且精确的解答；(2) 设计明确的错误答案惩罚机制以强化决策边界；(3) 提出难度感知的优势重加权策略以放大对难题的学习信号。此外，通过系统研究模型规模和监督微调(Supervised Fine-Tuning, SFT)策略的影响，论文进一步验证了更大规模的基础模型和精心构建的数据集能够显著提升强化学习的效果。大量实证评估和消融研究表明，GRPO-LEAD有效缓解了先前方法的不足，使语言模型在多种数学任务中展现出更简洁、准确且鲁棒的推理能力。

链接: https://arxiv.org/abs/2504.09696
作者: Jixiao Zhang,Chunsheng Zuo
机构: Johns Hopkins University (约翰斯·霍普金斯大学); Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in R1-like reasoning models leveraging Group Relative Policy Optimization (GRPO) have significantly improved the performance of language models on mathematical reasoning tasks. However, current GRPO implementations encounter critical challenges, including reward sparsity due to binary accuracy metrics, limited incentives for conciseness, and insufficient focus on complex reasoning tasks. To address these issues, we propose GRPO-LEAD, a suite of novel enhancements tailored for mathematical reasoning. Specifically, GRPO-LEAD introduces (1) a length-dependent accuracy reward to encourage concise and precise solutions, (2) an explicit penalty mechanism for incorrect answers to sharpen decision boundaries, and (3) a difficulty-aware advantage reweighting strategy that amplifies learning signals for challenging problems. Furthermore, we systematically examine the impact of model scale and supervised fine-tuning (SFT) strategies, demonstrating that larger-scale base models and carefully curated datasets significantly enhance reinforcement learning effectiveness. Extensive empirical evaluations and ablation studies confirm that GRPO-LEAD substantially mitigates previous shortcomings, resulting in language models that produce more concise, accurate, and robust reasoning across diverse mathematical tasks.
zh

[NLP-71] EmoAgent : Assessing and Safeguarding Human-AI Interaction for Mental Health Safety

【速读】：本文旨在解决由大型语言模型（LLM）驱动的人工智能角色引发的安全隐患，特别是对存在心理障碍的脆弱人类用户可能造成的心理健康风险。为应对这些挑战，论文提出了EmoAgent，这是一个多智能体人工智能框架，用于评估和减轻人机交互中的心理健康危害。EmoAgent的关键在于其两个组成部分：EmoEval通过模拟虚拟用户（包括表现心理脆弱个体的用户）来评估与AI角色互动前后心理健康的变化，并利用经过临床验证的心理和精神健康评估工具（如PHQ-9、PDI、PANSS）量化LLM诱发的心理风险；而EmoGuard作为中介，负责监控用户的心理状态、预测潜在危害并提供纠正性反馈以降低风险。实验结果表明，在基于角色的聊天机器人中，情感丰富的对话可能导致超过34.4%的脆弱用户心理状态恶化，而EmoGuard显著降低了这一比例，凸显了其在保障更安全的人机交互方面的重要性。

链接: https://arxiv.org/abs/2504.09689
作者: Jiahao Qiu,Yinghui He,Xinzhe Juan,Yiming Wang,Yuhan Liu,Zixin Yao,Yue Wu,Xun Jiang,Ling Yang,Mengdi Wang
机构: Department of Electrical & Computer Engineering, Princeton University (电气与计算机工程系，普林斯顿大学); Department of Computer Science, Princeton University (计算机科学系，普林斯顿大学); Department of Computer Science & Engineering, University of Michigan (计算机科学与工程系，密歇根大学); Department of Data Science & Engineering, University of Michigan (数据科学与工程系，密歇根大学); Department of Philosophy, Columbia University (哲学系，哥伦比亚大学); AI Lab, Princeton University (普林斯顿大学人工智能实验室); Chen Frontier Lab for Al and Mental Health, Tianqiao and Chrissy Chen Institute (陈天桥和雒芊芊前沿人工智能暨精神健康实验室); Theta Health Inc. (Theta Health 公司); Department of Electrical & Computer Engineering, Princeton University (电气与计算机工程系，普林斯顿大学); Department of Electrical & Computer Engineering, Princeton University (电气与计算机工程系，普林斯顿大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. EmoGuard serves as an intermediary, monitoring users’ mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions. Our code is available at: this https URL
zh

[NLP-72] Domain-Adaptive Continued Pre-Training of Small Language Models

【速读】：该论文试图解决在资源受限条件下实现语言模型领域适应的问题。论文的关键解决方案在于通过持续增量预训练小规模语言模型，并结合全面的数据预处理、内存优化的训练配置以及基于基准的评估方法，在仅使用12亿tokens数据的情况下显著提升了知识密集型任务（MMLU +8.1%）和情境理解任务（HellaSwag +7.6%）的表现，同时分析了领域专有化带来的权衡。关键在于精心设计的数据预处理方法和训练策略，使得即使在计算资源有限的条件下，也能有效提升语言模型的能力并实现特定领域的适配。

链接: https://arxiv.org/abs/2504.09687
作者: Salman Faroz
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continued pre-training of small language models offers a promising path for domain adaptation with limited computational resources. I’ve investigated this approach within educational domains, evaluating it as a resource-efficient alternative to training models from scratch. Using a 125M parameter model, I demonstrate significant performance improvements through incremental training on 400 million tokens, followed by further training to reach 1 billion tokens. My approach includes comprehensive data preprocessing, memory-optimized training configurations, and benchmark-based evaluation. Results show notable gains in knowledge-intensive tasks (MMLU +8.1%) and contextual understanding (HellaSwag +7.6%), while revealing educational domain specialization trade-offs. I analyze token efficiency, catastrophic forgetting mitigation strategies, and scaling patterns. My findings suggest that thoughtful preprocessing and training methodologies enable meaningful improvements in language model capabilities even with constrained computational resources, opening pathways for domain-specific adaptation of smaller language models.
zh

[NLP-73] CLEAR-KGQA: Clarification-Enhanced Ambiguity Resolution for Knowledge Graph Question Answering IJCNN2025

【速读】：本文旨在解决知识图谱问答（Knowledge Graph Question Answering, KGQA）中的歧义问题。尽管近期的KGQA系统通过引入大型语言模型（Large Language Models, LLMs）取得了显著进展，但它们通常假设用户查询无歧义，而这一假设在实际应用中很少成立。为克服这些局限性，论文提出了一种新颖的框架，通过交互澄清动态处理实体歧义（如区分名称相似的实体）和意图歧义（如明确用户查询的不同解读）。关键在于采用贝叶斯推理机制量化查询歧义，并引导LLMs判断何时以及如何向用户请求澄清，同时在一个多轮对话框架内实现这一过程。此外，开发了一个双代理交互框架，利用基于LLM的用户模拟器通过模拟用户反馈迭代优化逻辑形式。实验结果表明，该方法显著提升了性能，有效解决了语义歧义问题，并贡献了一个由交互历史衍生出的消歧查询精炼数据集，以促进后续研究。

链接: https://arxiv.org/abs/2504.09665
作者: Liqiang Wen,Guanming Xiong,Tong Mo,Bing Li,Weiping Li,Wen Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: This work has been accepted by the IJCNN 2025 main track

点击查看摘要

Abstract:This study addresses the challenge of ambiguity in knowledge graph question answering (KGQA). While recent KGQA systems have made significant progress, particularly with the integration of large language models (LLMs), they typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. To address these limitations, we propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification. Our approach employs a Bayesian inference mechanism to quantify query ambiguity and guide LLMs in determining when and how to request clarification from users within a multi-turn dialogue framework. We further develop a two-agent interaction framework where an LLM-based user simulator enables iterative refinement of logical forms through simulated user feedback. Experimental results on the WebQSP and CWQ dataset demonstrate that our method significantly improves performance by effectively resolving semantic ambiguities. Additionally, we contribute a refined dataset of disambiguated queries, derived from interaction histories, to facilitate future research in this direction.
zh

[NLP-74] Myanmar XNLI: Building a Dataset and Exploring Low-resource Approaches to Natural Language Inference with Myanmar

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLM）在低资源语言上的应用挑战，特别是在跨语言自然语言推理（Cross-Lingual Natural Language Inference, XNLI）任务中的表现问题。论文以缅甸语（Myanmar）作为低资源语言的代理，扩展了XNLI数据集，并提出三个核心贡献。解决方案的关键在于构建了一个名为myXNLI的新数据集，通过社区众包与专家验证相结合的方式实现；同时，论文评估了现有多语言模型在myXNLI上的性能，并探索了数据增强方法以提升模型在缅甸语及其他低资源语言上的表现，证明了数据增强技术的有效性及其在跨语言任务中的泛化能力。

链接: https://arxiv.org/abs/2504.09645
作者: Aung Kyaw Htet,Mark Dras
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite dramatic recent progress in NLP, it is still a major challenge to apply Large Language Models (LLM) to low-resource languages. This is made visible in benchmarks such as Cross-Lingual Natural Language Inference (XNLI), a key task that demonstrates cross-lingual capabilities of NLP systems across a set of 15 languages. In this paper, we extend the XNLI task for one additional low-resource language, Myanmar, as a proxy challenge for broader low-resource languages, and make three core contributions. First, we build a dataset called Myanmar XNLI (myXNLI) using community crowd-sourced methods, as an extension to the existing XNLI corpus. This involves a two-stage process of community-based construction followed by expert verification; through an analysis, we demonstrate and quantify the value of the expert verification stage in the context of community-based construction for low-resource languages. We make the myXNLI dataset available to the community for future research. Second, we carry out evaluations of recent multilingual language models on the myXNLI benchmark, as well as explore data-augmentation methods to improve model performance. Our data-augmentation methods improve model accuracy by up to 2 percentage points for Myanmar, while uplifting other languages at the same time. Third, we investigate how well these data-augmentation methods generalise to other low-resource languages in the XNLI dataset.
zh

[NLP-75] Iterative Self-Training for Code Generation via Reinforced Re-Ranking ECIR2025

【速读】：该论文旨在解决利用当前解码器模型生成高质量代码的挑战，特别是在处理复杂编程任务时，这些模型通常会产生高度随机的输出，且即使小错误也可能破坏整个解决方案。为提升代码生成的整体质量，论文提出了一种新颖的迭代自训练方法，用于通过近邻策略优化（Proximal Policy Optimization, PPO）训练重排序（reranker）模型。该方法的关键在于开发一个鲁棒的奖励/重排序模型，它不仅通过重排序改进生成代码的质量，还能解决奖励模型在与重排序器对齐过程中可能忽略的问题和错误。此外，该方法通过重新评估输出、识别高分负例并将它们纳入训练循环来迭代优化训练数据集，从而提高模型性能。实验结果显示，所提出的134亿参数模型在代码生成质量上优于330亿参数模型，并且在某些编程语言上的表现甚至超过了GPT-4。

链接: https://arxiv.org/abs/2504.09643
作者: Nikita Sorokin,Ivan Sedykh,Valentin Malykh
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Software Engineering (cs.SE)
备注: Published at ECIR 2025

点击查看摘要

Abstract:Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language. Comments: Published at ECIR 2025 Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Software Engineering (cs.SE) Cite as: arXiv:2504.09643 [cs.CL] (or arXiv:2504.09643v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.09643 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1007/978-3-031-88714-7_21 Focus to learn more DOI(s) linking to related resources
zh

[NLP-76] Leverag ing Reasoning Model Answers to Enhance Non-Reasoning Model Capability

【速读】：该论文试图解决如何利用高计算成本的推理密集型大语言模型（Reasoning-Intensive LLMs）的高质量输出，来提升低计算成本的非推理型大语言模型（Non-Reasoning LLMs）的能力。论文的关键解决方案是通过监督微调（Supervised Fine-Tuning, SFT）方法，将推理密集型模型生成的答案作为训练数据，系统性地改进非推理型模型的问答能力，并通过在多个基准测试中的实验验证了此方法的有效性和一致性。

链接: https://arxiv.org/abs/2504.09639
作者: Haotian Wang,Han Zhao,Shuaiting Chen,Xiaoyu Tian,Sitong Zhao,Yunjie Ji,Yiping Peng,Xiangang Li
机构: a-m-team
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs), such as DeepSeek-R1 and OpenAI-o1, have demonstrated the significant effectiveness of test-time scaling, achieving substantial performance gains across various benchmarks. These advanced models utilize deliberate “thinking” steps to systematically enhance answer quality. In this paper, we propose leveraging these high-quality outputs generated by reasoning-intensive models to improve less computationally demanding, non-reasoning models. We explore and compare methodologies for utilizing the answers produced by reasoning models to train and improve non-reasoning models. Through straightforward Supervised Fine-Tuning (SFT) experiments on established benchmarks, we demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.
zh

[NLP-77] Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference

【速读】：该论文试图解决多视觉-语言模型（Vision-Language Models, VLMs）融合过程中存在的推理成本高和架构约束的问题。解决方案的关键在于提出Metropolis-Hastings Captioning Game (MHCG)，通过一种类似语言游戏的去中心化贝叶斯推理过程，使两个VLM代理交替为图像生成描述并相互学习，从而实现知识融合，而无需增加额外的推理成本或受限于特定架构。

链接: https://arxiv.org/abs/2504.09620
作者: Yuta Matsui,Ryosuke Yamaki,Ryo Ueda,Seitaro Shinagawa,Tadahiro Taniguchi
机构: Ritsumeikan University (立命馆大学); The University of Tokyo (东京大学); Nara Institute of Science and Technology (奈良先端科学技术大学院大学); Kyoto University (京都大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We propose the Metropolis-Hastings Captioning Game (MHCG), a method to fuse knowledge of multiple vision-language models (VLMs) by learning from each other. Although existing methods that combine multiple models suffer from inference costs and architectural constraints, MHCG avoids these problems by performing decentralized Bayesian inference through a process resembling a language game. The knowledge fusion process establishes communication between two VLM agents alternately captioning images and learning from each other. We conduct two image-captioning experiments with two VLMs, each pre-trained on a different dataset. The first experiment demonstrates that MHCG achieves consistent improvement in reference-free evaluation metrics. The second experiment investigates how MHCG contributes to sharing VLMs’ category-level vocabulary by observing the occurrence of the vocabulary in the generated captions.
zh

[NLP-78] Short-Path Prompting in LLM s: Analyzing Reasoning Instability and Solutions for Robust Performance

【速读】：该论文试图解决在用户使用短路径提示（short-path prompts）的情况下，大型语言模型（Large Language Models, LLMs）在推理任务中的性能显著下降及不稳定性问题。当前最先进的LLMs通过指令微调（instruction-tuning）提供详细的链式思考（Chain-of-Thought, CoT）路径以增强推理能力，但当面对用户提供的简短提示时，这种能力大幅减弱甚至失效，尤其是在小学水平的问题上。为了解决这一冲突，论文提出了两种方法：一是基于指令引导的方法（instruction-guided approach），二是基于微调的方法（fine-tuning approach）。这两种方案旨在有效平衡指令遵循与推理准确性之间的权衡，从而提升LLMs在短路径提示下的推理性能。实验结果表明，这两种方法均取得了高精度，为理解当前模型在指令遵循与推理准确性之间的取舍提供了重要见解。

链接: https://arxiv.org/abs/2504.09586
作者: Zuoli Tang,Junjie Ou,Kaiqin Hu,Chunwei Wu,Zhaoxin Huan,Chilin Fu,Xiaolu Zhang,Jun Zhou,Chenliang Li
机构: Wuhan University (武汉大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Recent years have witnessed significant progress in large language models’ (LLMs) reasoning, which is largely due to the chain-of-thought (CoT) approaches, allowing models to generate intermediate reasoning steps before reaching the final answer. Building on these advances, state-of-the-art LLMs are instruction-tuned to provide long and detailed CoT pathways when responding to reasoning-related questions. However, human beings are naturally cognitive misers and will prompt language models to give rather short responses, thus raising a significant conflict with CoT reasoning. In this paper, we delve into how LLMs’ reasoning performance changes when users provide short-path prompts. The results and analysis reveal that language models can reason effectively and robustly without explicit CoT prompts, while under short-path prompting, LLMs’ reasoning ability drops significantly and becomes unstable, even on grade-school problems. To address this issue, we propose two approaches: an instruction-guided approach and a fine-tuning approach, both designed to effectively manage the conflict. Experimental results show that both methods achieve high accuracy, providing insights into the trade-off between instruction adherence and reasoning accuracy in current models.
zh

[NLP-79] Reduction of Supervision for Biomedical Knowledge Discovery

【速读】：该论文旨在解决因出版物数量激增和标注数据稀缺导致的知识发现难题，特别是在生物医学领域识别实体间语义关系时对标注数据高度依赖的问题。论文的关键在于提出了一组基于依存树（Dependency Trees）和注意力机制（Attention Mechanisms）的无监督算法，并结合多种点式二分类方法，逐步从弱监督过渡到完全无监督设置，评估其在处理带有噪声标签数据时的学习能力。通过在生物医学基准数据集上的评估，论文揭示了点式二分类技术在减少监督的情况下仍能有效应对噪声标签的鲁棒性，表明这些方法能够从弱监督顺利过渡到完全无监督场景。这种渐进式减少监督的方法为构建适应性强的知识发现系统提供了有价值的参考，推动了在标注数据有限条件下高效提取有用信息的数据高效方法的发展。

链接: https://arxiv.org/abs/2504.09582
作者: Christos Theodoropoulos,Andrei Catalin Coman,James Henderson,Marie-Francine Moens
机构: KU Leuven (鲁汶大学); IDIAP Research Institute (IDIAP 研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published as part of the PhD dissertation: Theodoropoulos, Christos, Marie-Francine Moens, and Matthew Blaschko. “Deep Learning Models for the Extraction of Knowledge from Text.” (2025)

点击查看摘要

Abstract:Knowledge discovery is hindered by the increasing volume of publications and the scarcity of extensive annotated data. To tackle the challenge of information overload, it is essential to employ automated methods for knowledge extraction and processing. Finding the right balance between the level of supervision and the effectiveness of models poses a significant challenge. While supervised techniques generally result in better performance, they have the major drawback of demanding labeled data. This requirement is labor-intensive and time-consuming and hinders scalability when exploring new domains. In this context, our study addresses the challenge of identifying semantic relationships between biomedical entities (e.g., diseases, proteins) in unstructured text while minimizing dependency on supervision. We introduce a suite of unsupervised algorithms based on dependency trees and attention mechanisms and employ a range of pointwise binary classification methods. Transitioning from weakly supervised to fully unsupervised settings, we assess the methods’ ability to learn from data with noisy labels. The evaluation on biomedical benchmark datasets explores the effectiveness of the methods. Our approach tackles a central issue in knowledge discovery: balancing performance with minimal supervision. By gradually decreasing supervision, we assess the robustness of pointwise binary classification techniques in handling noisy labels, revealing their capability to shift from weakly supervised to entirely unsupervised scenarios. Comprehensive benchmarking offers insights into the effectiveness of these techniques, suggesting an encouraging direction toward adaptable knowledge discovery systems, representing progress in creating data-efficient methodologies for extracting useful insights when annotated data is limited.
zh

[NLP-80] LLM s Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline

【速读】：该论文致力于解决在流式输入源句子场景下，解码器-only大语言模型（Decoder-only Large Language Models）在同时性机器翻译（Simultaneous Machine Translation, SiMT）任务中的效率与性能受限问题。论文的关键创新在于提出了一种新颖的方法，包括构建用于SiMT的有监督微调（Supervised Fine-Tuning, SFT）数据，以及新的训练和推理策略。具体而言，通过将源令牌和目标令牌重新排列成交错序列，并使用特殊分隔符根据不同的延迟需求组织输入输出流，使强大的大语言模型能够自适应地学习读写操作，同时保持高效的自回归解码能力。实验结果表明，即使在有限的SFT数据条件下，该方法在多种SiMT基准测试中达到了最先进的性能，并保留了离线翻译的原始能力，同时在文档级SiMT任务中表现出良好的泛化能力，无需特定的微调。

链接: https://arxiv.org/abs/2504.09570
作者: Biao Fu,Minpeng Liao,Kai Fan,Chengxi Li,Liang Zhang,Yidong Chen,Xiaodong Shi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When the complete source sentence is provided, Large Language Models (LLMs) perform excellently in offline machine translation even with a simple prompt “Translate the following sentence from [src lang] into [tgt lang]:”. However, in many real scenarios, the source tokens arrive in a streaming manner and simultaneous machine translation (SiMT) is required, then the efficiency and performance of decoder-only LLMs are significantly limited by their auto-regressive nature. To enable LLMs to achieve high-quality SiMT as efficiently as offline translation, we propose a novel paradigm that includes constructing supervised fine-tuning (SFT) data for SiMT, along with new training and inference strategies. To replicate the token input/output stream in SiMT, the source and target tokens are rearranged into an interleaved sequence, separated by special tokens according to varying latency requirements. This enables powerful LLMs to learn read and write operations adaptively, based on varying latency prompts, while still maintaining efficient auto-regressive decoding. Experimental results show that, even with limited SFT data, our approach achieves state-of-the-art performance across various SiMT benchmarks, and preserves the original abilities of offline translation. Moreover, our approach generalizes well to document-level SiMT setting without requiring specific fine-tuning, even beyond the offline translation model.
zh

[NLP-81] Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution

【速读】：该论文旨在解决复杂任务中单一线性推理路径（如 Chain-of-Thought, CoT）难以有效处理庞大解空间与模糊约束的问题。为应对这一挑战，论文提出了一种名为 Syzygy of Thoughts (SoT) 的新框架，其关键在于引入辅助且相互关联的推理路径，通过借鉴交换代数中的 Minimal Free Resolution (MFR) 方法，将原始复杂问题系统性分解为逻辑完整的最小子问题，同时保持核心问题特征并减少推理长度。SoT 捕捉更深层次的逻辑依赖关系，实现更具鲁棒性和结构化的推理能力。实验结果表明，SoT 在多个数据集和模型上的推理精度达到或超过了主流 CoT 标准，并通过与代数约束对齐的采样过程提升了大规模语言模型的推理时间可扩展性。

链接: https://arxiv.org/abs/2504.09566
作者: Chenghao Li,Chaoning Zhang,Yi Lu,Jiaquan Zhang,Qigan Sun,Xudong Wang,Jiwei Wei,Guoqing Wang,Yang Yang,Heng Tao Shen
机构: University of Electronic Science and Technology of China(电子科技大学), Chengdu, China; Capital Normal University(首都师范大学), Beijing, China; University of Liverpool(利物浦大学), Liverpool, United Kingdom; Kyung Hee University(庆熙大学), Yongin-si, Republic of Korea; Tongji University(同济大学), Shanghai, China
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting enhances the reasoning of large language models (LLMs) by decomposing problems into sequential steps, mimicking human logic and reducing errors. However, complex tasks with vast solution spaces and vague constraints often exceed the capacity of a single reasoning chain. Inspired by Minimal Free Resolution (MFR) in commutative algebra and algebraic geometry, we propose Syzygy of Thoughts (SoT)-a novel framework that extends CoT by introducing auxiliary, interrelated reasoning paths. SoT captures deeper logical dependencies, enabling more robust and structured problem-solving. MFR decomposes a module into a sequence of free modules with minimal rank, providing a structured analytical approach to complex systems. This method introduces the concepts of “Module”, “Betti numbers”,“Freeness”, “Mapping”, “Exactness” and “Minimality”, enabling the systematic decomposition of the original complex problem into logically complete minimal subproblems while preserving key problem features and reducing reasoning length. We tested SoT across diverse datasets (e.g., GSM8K, MATH) and models (e.g., GPT-4o-mini, Qwen2.5), achieving inference accuracy that matches or surpasses mainstream CoTs standards. Additionally, by aligning the sampling process with algebraic constraints, our approach enhances the scalability of inference time in LLMs, ensuring both transparent reasoning and high performance. Our code will be publicly available at this https URL.
zh

[NLP-82] How new data permeates LLM knowledge and how to dilute it

【速读】：该论文旨在探究大型语言模型（LLMs）在学习新信息时如何影响其现有知识，特别是在有益泛化与有害幻觉之间的权衡机制。论文发现，在学习新信息时，LLMs表现出一种“启动效应”（priming effect），即学习的新事实可能导致模型在无关上下文中不恰当地应用该知识。为系统研究这一现象，作者构建了一个名为“Outlandish”的多样化数据集，用于检测新知识如何渗透到模型的知识库中。通过此数据集，研究揭示了学习新信息后启动效应的程度可以通过测量关键单词的学习前标记概率来预测，并且这种关系在不同架构（如PALM-2、Gemma、Llama）、规模和训练阶段中均保持稳健。

解决方案的关键在于提出两种新颖的技术以调节新知识对现有模型行为的影响：一是“踏脚石”文本增强策略，二是“忽略-k”更新剪枝方法。这些技术能够将不必要的启动效应减少50%-95%，同时保留模型学习新信息的能力。这不仅提供了关于LLMs学习方式的经验性见解，还为提升语言模型知识插入的特异性提供了实用工具。

链接: https://arxiv.org/abs/2504.09522
作者: Chen Sun,Renat Aksitov,Andrey Zhmoginov,Nolan Andrew Miller,Max Vladymyrov,Ulrich Rueckert,Been Kim,Mark Sandler
机构: Google DeepMind
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial generalization and problematic hallucination, remains poorly understood. We demonstrate that when learning new information, LLMs exhibit a “priming” effect: learning a new fact can cause the model to inappropriately apply that knowledge in unrelated contexts. To systematically study this phenomenon, we introduce “Outlandish,” a carefully curated dataset of 1320 diverse text samples designed to probe how new knowledge permeates through an LLM’s existing knowledge base. Using this dataset, we show that the degree of priming after learning new information can be predicted by measuring the token probability of key words before learning. This relationship holds robustly across different model architectures (PALM-2, Gemma, Llama), sizes, and training stages. Finally, we develop two novel techniques to modulate how new knowledge affects existing model behavior: (1) a stepping-stone'' text augmentation strategy and (2) an ignore-k’’ update pruning method. These approaches reduce undesirable priming effects by 50-95% while preserving the model’s ability to learn new information. Our findings provide both empirical insights into how LLMs learn and practical tools for improving the specificity of knowledge insertion in language models. Further materials: this https URL
zh

[NLP-83] MADLLM : Multivariate Anomaly Detection via Pre-trained LLM s ICME2025

【速读】：该论文试图解决当将预训练的大规模语言模型（Large Language Models, LLMs）应用于异常检测任务时，其文本模态与多变量时间序列（Multivariate Time Series, MTS）模态不匹配的问题。现有方法通常将MTS数据简单转换为多个单变量时间序列，这会导致许多问题。论文提出的关键解决方案是MADLLM，一种通过预训练LLMs实现的新型多变量异常检测方法。其核心在于设计了一种新的三重编码技术，将MTS模态与LLMs的文本模态对齐。具体而言，该技术结合了传统的补丁嵌入方法与两种新颖的嵌入方式：Skip嵌入，通过改变传统方法中补丁处理的顺序，帮助LLMs保留先前特征的知识；Feature嵌入，利用对比学习使模型更好地理解不同特征之间的相关性。实验结果表明，该方法在多种公开异常检测数据集上优于现有最先进方法。

链接: https://arxiv.org/abs/2504.09504
作者: Wei Tao,Xiaoyang Qu,Kai Lu,Jiguang Wan,Guokuan Li,Jianzong Wang
机构: Huazhong University of Science and Technology (华中科技大学); Ping An Technology (Shenzhen) Co., Ltd. (平安科技（深圳）有限公司)
类目: Computation and Language (cs.CL)
备注: Accepted by IEEE International Conference on Multimedia Expo 2025 (ICME 2025)

点击查看摘要

Abstract:When applying pre-trained large language models (LLMs) to address anomaly detection tasks, the multivariate time series (MTS) modality of anomaly detection does not align with the text modality of LLMs. Existing methods simply transform the MTS data into multiple univariate time series sequences, which can cause many problems. This paper introduces MADLLM, a novel multivariate anomaly detection method via pre-trained LLMs. We design a new triple encoding technique to align the MTS modality with the text modality of LLMs. Specifically, this technique integrates the traditional patch embedding method with two novel embedding approaches: Skip Embedding, which alters the order of patch processing in traditional methods to help LLMs retain knowledge of previous features, and Feature Embedding, which leverages contrastive learning to allow the model to better understand the correlations between different features. Experimental results demonstrate that our method outperforms state-of-the-art methods in various public anomaly detection datasets.
zh

[NLP-84] Kongzi: A Historical Large Language Model with Fact Enhancement

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在复杂推理任务中，尤其是在历史分析等知识密集型领域中存在的事实准确性不足和推理深度受限的问题。论文指出，当前的推理模型在较长的推理链中常表现出事实性错误，这限制了其在历史推理等任务中的表现。为了解决这些问题，论文提出了一种专门设计用于历史分析的大规模语言模型——Kongzi。其解决方案的关键在于通过整合高质量的历史数据集以及采用一种新颖的事实增强学习策略，使模型不仅能够实现更高的事实一致性，还具备更深层次的推理能力。实验结果表明，Kongzi在历史问答和叙事生成等任务中，显著优于现有模型，在事实准确性与推理深度方面均展现出卓越性能。

链接: https://arxiv.org/abs/2504.09488
作者: Jiashu Yang,Ningning Wang,Yian Zhao,Chaoran Feng,Junjia Du,Hao Pang,Zhirui Fang,Xuxin Cheng
机构: Dalian University of Technology (大连理工大学); Nanyang Technological University (南洋理工大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 12 figures

点击查看摘要

Abstract:The capabilities of the latest large language models (LLMs) have been extended from pure natural language understanding to complex reasoning tasks. However, current reasoning models often exhibit factual inaccuracies in longer reasoning chains, which poses challenges for historical reasoning and limits the potential of LLMs in complex, knowledge-intensive tasks. Historical studies require not only the accurate presentation of factual information but also the ability to establish cross-temporal correlations and derive coherent conclusions from fragmentary and often ambiguous sources. To address these challenges, we propose Kongzi, a large language model specifically designed for historical analysis. Through the integration of curated, high-quality historical data and a novel fact-reinforcement learning strategy, Kongzi demonstrates strong factual alignment and sophisticated reasoning depth. Extensive experiments on tasks such as historical question answering and narrative generation demonstrate that Kongzi outperforms existing models in both factual accuracy and reasoning depth. By effectively addressing the unique challenges inherent in historical texts, Kongzi sets a new standard for the development of accurate and reliable LLMs in professional domains.
zh

[NLP-85] HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLM s

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成响应时因内在动态特性导致的幻觉（hallucinations）问题，即模型倾向于生成看似结构良好且连贯但包含不正确信息的响应。论文的关键假设是幻觉源于LLMs的内部动力学，并观察到这种现象表现为模型在生成段落时逐渐偏离事实准确性，尤其是在响应的细微部分，最终可能导致错误信息的传播。受人类认知中类似现象的启发，研究引入了一种名为HalluShift的新方法，通过分析LLM生成响应中的内部状态空间分布偏移和标记概率变化来深入探究这一问题。这种方法在多个基准数据集上的表现优于现有基线，体现了其有效性。

链接: https://arxiv.org/abs/2504.09482
作者: Sharanya Dasgupta,Sujoy Nath,Arkaprabha Basu,Pourya Shamsolmoali,Swagatam Das
机构: Electronics and Communication Sciences Unit (ECSU), Indian Statistical Institute, Kolkata, India; Netaji Subhash Engineering College (NSEC), Kolkata, India; East China Normal University (ECNU), China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have recently garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains. However, LLMs often suffer from the inherent limitation of hallucinations and generate incorrect information while maintaining well-structured and coherent responses. In this work, we hypothesize that hallucinations stem from the internal dynamics of LLMs. Our observations indicate that, during passage generation, LLMs tend to deviate from factual accuracy in subtle parts of responses, eventually shifting toward misinformation. This phenomenon bears a resemblance to human cognition, where individuals may hallucinate while maintaining logical coherence, embedding uncertainty within minor segments of their speech. To investigate this further, we introduce an innovative approach, HalluShift, designed to analyze the distribution shifts in the internal state space and token probabilities of the LLM-generated responses. Our method attains superior performance compared to existing baselines across various benchmark datasets. Our codebase is available at this https URL.
zh

[NLP-86] Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation

【速读】：该论文旨在解决科学图表（Scientific Diagrams）因以静态光栅图像形式发表而丢失符号语义（symbolic semantics）且限制重用的问题。现有方法虽利用多模态大语言模型（Multimodal Large Language Models, MLLMs）尝试连接视觉与结构，但缺乏语义控制（semantic control）和结构可解释性（structural interpretability），尤其在复杂图表上表现不足。
解决方案的关键在于提出Draw with Thought (DwT)，一种无需微调（training-free）的框架，通过认知驱动的链式思维推理（cognitively-grounded Chain-of-Thought reasoning）引导MLLMs将图表重构为可编辑的mxGraph XML代码。DwT通过“粗到细规划”（Coarse-to-Fine Planning）和“结构感知代码生成”（Structure-Aware Code Generation）两阶段任务实现这一目标：前者负责感知结构化处理和语义指定，后者通过格式引导优化提升结果质量。此外，论文发布了Plot2XML数据集作为基准，包含247个真实科学图表及其金标准XML标注，验证了该方法在高保真度、语义一致性及结构有效性方面的优越性能。

链接: https://arxiv.org/abs/2504.09479
作者: Zhiqing Cui,Jiahao Yuan,Hanqing Wang,Yanshu Li,Chenxu Du,Zhenglong Ding
机构: Nanjing University of Information Science & Technology (南京信息工程大学); East China Normal University (华东师范大学); Huazhong University of Science and Technology (华中科技大学); Brown University (布朗大学); Southwest Jiaotong University (西南交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 14 figures

点击查看摘要

Abstract:Scientific diagrams are vital tools for communicating structured knowledge across disciplines. However, they are often published as static raster images, losing symbolic semantics and limiting reuse. While Multimodal Large Language Models (MLLMs) offer a pathway to bridging vision and structure, existing methods lack semantic control and structural interpretability, especially on complex diagrams. We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively-grounded Chain-of-Thought reasoning. DwT enables interpretable and controllable outputs without model fine-tuning by dividing the task into two stages: Coarse-to-Fine Planning, which handles perceptual structuring and semantic specification, and Structure-Aware Code Generation, enhanced by format-guided refinement. To support evaluation, we release Plot2XML, a benchmark of 247 real-world scientific diagrams with gold-standard XML annotations. Extensive experiments across eight MLLMs show that our approach yields high-fidelity, semantically aligned, and structurally valid reconstructions, with human evaluations confirming strong alignment in both accuracy and visual aesthetics, offering a scalable solution for converting static visuals into executable representations and advancing machine understanding of scientific graphics.
zh

[NLP-87] AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在安全对齐方面仍易受越狱攻击（jailbreak attacks）的问题。现有的无训练防御方法如激活引导（activation steering）依赖固定的引导系数，这导致保护效果次优且对良性输入的误拒率增加。为了解决这一问题，论文提出了一种自适应激活引导方法AdaSteer。其关键是通过识别两个关键特性：拒绝定律（Rejection Law, R-Law），即需要对反向拒绝方向的越狱输入施加更强的引导；有害性定律（Harmfulness Law, H-Law），即区分对抗性和良性输入。AdaSteer通过逻辑回归学习自适应系数，在拒绝方向（Rejection Direction, RD）和有害性方向（Harmfulness Direction, HD）上引导输入表示，从而实现对越狱攻击的稳健防御，同时保持对良性输入的良好处理能力。实验结果表明，AdaSteer在多个越狱攻击场景下优于基线方法，并对模型实用性影响较小。

链接: https://arxiv.org/abs/2504.09466
作者: Weixiang Zhao,Jiahe Guo,Yulin Hu,Yang Deng,An Zhang,Xingyu Sui,Xinyang Han,Yanyan Zhao,Bing Qin,Tat-Seng Chua,Ting Liu
机构: Harbin Institute of Technology (哈尔滨工业大学); Singapore Management University (新加坡管理大学); National University of Singapore (新加坡国立大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 17 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Despite extensive efforts in safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks. Activation steering offers a training-free defense method but relies on fixed steering coefficients, resulting in suboptimal protection and increased false rejections of benign inputs. To address this, we propose AdaSteer, an adaptive activation steering method that dynamically adjusts model behavior based on input characteristics. We identify two key properties: Rejection Law (R-Law), which shows that stronger steering is needed for jailbreak inputs opposing the rejection direction, and Harmfulness Law (H-Law), which differentiates adversarial and benign inputs. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD), with adaptive coefficients learned via logistic regression, ensuring robust jailbreak defense while preserving benign input handling. Experiments on LLaMA-3.1, Gemma-2, and Qwen2.5 show that AdaSteer outperforms baseline methods across multiple jailbreak attacks with minimal impact on utility. Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
zh

[NLP-88] BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

【速读】：该论文旨在解决现有视觉语言模型（Vision-Language Models, VLMs）预训练效率低下以及现有评估基准与婴儿视觉推理发展不一致的问题。具体而言，现有的婴儿启发式数据集如SAYCam虽被采用，但当前评估基准要么过于简单化，要么范围狭窄，或专门为大规模预训练模型设计，且单纯依赖婴儿数据忽略了婴儿从更广泛、更多样化输入中学习的特性。为了解决这些问题，论文提出BabyVLM框架，其关键是构建了一个综合的领域内评估基准以及通过现有数据集的儿童导向变换生成的合成训练数据集。实验表明，使用该合成数据集训练的VLM在BabyVLM任务上的表现优于仅基于SAYCam或同等规模通用数据训练的模型，从而证明了精心策划的小型数据集可以有效提升模型的泛化能力，并为高效的数据驱动视觉语言学习范式开辟了道路。

链接: https://arxiv.org/abs/2504.09426
作者: Shengao Wang,Arjun Chandra,Aoming Liu,Venkatesh Saligrama,Boqing Gong
机构: Boston University (波士顿大学); Boston University (波士顿大学); Boston University (波士顿大学); Boston University (波士顿大学); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned–they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or general-purpose data of the SAYCam size. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.
zh

[NLP-89] ClinicalGPT -R1: Pushing reasoning capability of generalist disease diagnosis with large language model

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在临床诊断领域应用不足的问题。解决方案的关键在于开发ClinicalGPT-R1，这是一种针对疾病诊断增强推理能力的通用型大语言模型。通过在包含20,000份真实世界临床记录的数据集上进行训练，并采用多样化的训练策略以提升诊断推理能力，ClinicalGPT-R1实现了在中文诊断任务中超越GPT-4o的表现，并在英文环境下与GPT-4达到相当的性能水平。这一比较研究有效验证了ClinicalGPT-R1在疾病诊断任务中的卓越表现。

链接: https://arxiv.org/abs/2504.09421
作者: Wuyang Lan,Wenzheng Wang,Changwei Ji,Guoxing Yang,Yongbo Zhang,Xiaohong Liu,Song Wu,Guangyu Wang
机构: State Key Laboratory of Networking and Switching Technology (网络与交换技术国家重点实验室) Beijing University of Posts and Telecommunications (北京邮电大学); South China Hospital, Medical School, Shenzhen University (深圳大学华南医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at this https URL.
zh

[NLP-90] SaRO: Enhancing LLM Safety through Reasoning -based Alignment

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）安全对齐中的两个关键问题：(1) 欠泛化导致模型易受新型越狱攻击（jailbreak attacks）的影响，以及 (2) 过度对齐引发对良性指令的过度拒绝。研究发现嵌入空间中越狱/有害查询与正常提示之间存在语义重叠，表明需要更深层次的语义理解以实现更有效的安全性对齐。为此，论文提出了一种以安全策略驱动推理的安全导向推理优化框架（Safety-oriented Reasoning Optimization Framework, SaRO），其关键是通过两阶段方法实现：(1) 推理风格预热（Reasoning-style Warmup, RW），利用有监督微调使LLMs内化长链推理能力；(2) 安全导向推理过程优化（Safety-oriented Reasoning Process Optimization, SRPO），通过直接偏好优化（Direct Preference Optimization, DPO）促进安全性反思。实验结果验证了SaRO相较于传统对齐方法的优越性。

链接: https://arxiv.org/abs/2504.09420
作者: Yutao Mou,Yuxiao Luo,Shikun Zhang,Wei Ye
机构: National Engineering Research Center for Software Engineering, Peking University (北京大学工程研究中心, 北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.
zh

[NLP-91] UXAgent : A System for Simulating Usability Testing of Web Design with LLM Agents

【速读】：该论文试图解决如何评估和迭代可用性测试研究设计本身的问题。解决方案的关键在于设计了一个名为 UXAgent 的系统，该系统利用大型语言模型模拟代理（Large Language Model-simulated Agent, LLM Agent）技术，通过其 Persona Generator 模块、LLM Agent 模块以及 Universal Browser Connector 模块自动生成数千个模拟用户来交互式测试目标网站。此外，系统提供了 Agent Interview Interface 和 Video Replay Interface，使用户体验（UX）研究人员能够便捷地审查和分析生成的定性和定量日志数据，从而支持 UX 研究人员在真实的人类主体研究之前对可用性测试研究设计进行评估和迭代。

链接: https://arxiv.org/abs/2504.09407
作者: Yuxuan Lu,Bingsheng Yao,Hansu Gu,Jing Huang,Jessie Wang,Yang Li,Jiri Gesi,Qi He,Toby Jia-Jun Li,Dakuo Wang
机构: Northeastern University (东北大学); University of Notre Dame (圣母大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Usability testing is a fundamental research method that user experience (UX) researchers use to evaluate and iterate a web design, but\textbf how to evaluate and iterate the usability testing study design itself? Recent advances in Large Language Model-simulated Agent (\textbfLLM Agent) research inspired us to design \textbfUXAgent to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human-subject study. Our system features a Persona Generator module, an LLM Agent module, and a Universal Browser Connector module to automatically generate thousands of simulated users to interactively test the target website. The system also provides an Agent Interview Interface and a Video Replay Interface so that the UX researchers can easily review and analyze the generated qualitative and quantitative log data. Through a heuristic evaluation, five UX researcher participants praised the innovation of our system but also expressed concerns about the future of LLM Agent usage in UX studies.
zh

[NLP-92] Question Tokens Deserve More Attention: Enhancing Large Language Models without Training through Step-by-Step Reading and Question Attention Recalibration

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理需要深入理解复杂问题的任务时所面临的挑战，特别是在涉及长距离依赖或多步推理的情况下。论文通过研究当前LLMs在问题理解方面的局限性，提出了三个关键洞察：(1) 重复问题标记可以提高对问题区域的关注从而增强理解；(2) 增加向后的依赖关系会因单向注意力限制而影响性能；(3) 重新校准注意力机制以优先关注与问题相关区域能够提升表现。基于这些发现，论文首先提出了一系列基于提示的方法——逐步阅读（Step-by-Step Reading, SSR）、SSR+ 和 SSR++，引导LLMs逐步处理问题标记并将推理过程与输入结构对齐，显著提升了性能，在多个基准测试中达到了最先进的结果。其次，引入了一种无需训练的注意力重新校准机制，在推理过程中动态调整注意力分布以强调与问题相关的区域。此方法在不改变模型参数或输入提示的情况下，将LLaMA 3.1-8B在AQuA上的准确性提高了5.17%。综合来看，研究结果强调了结构化提示设计和注意力优化在改善LLMs理解能力中的重要性，提供了轻量但有效的工具来提升自然语言处理任务的表现。

链接: https://arxiv.org/abs/2504.09402
作者: Feijiang Han,Licheng Guo,Hengtao Cui,Zhiyuan Lyu
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CIS 5300

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with tasks that require a deep understanding of complex questions, especially when faced with long-range dependencies or multi-step reasoning. This work investigates the limitations of current LLMs in question comprehension and identifies three insights: (1) repeating question tokens improves comprehension by increasing attention to question regions, (2) increased backward dependencies negatively affect performance due to unidirectional attentional constraints, and (3) recalibrating attentional mechanisms to prioritize question-relevant regions improves performance. Based on these findings, we first propose a family of prompt-based strategies - Step-by-Step Reading (SSR), SSR+, and SSR++ - that guide LLMs to incrementally process question tokens and align their reasoning with the input structure. These methods significantly improve performance, with SSR++ achieving state-of-the-art results on several benchmarks: 96.66% on GSM8K, 94.61% on ASDiv, and 76.28% on AQuA. Second, we introduce a training-free attention recalibration mechanism that dynamically adjusts attention distributions during inference to emphasize question-relevant regions. This method improves the accuracy of LLaMA 3.1-8B on AQuA by 5.17% without changing model parameters or input prompts. Taken together, our results highlight the importance of structured prompt design and attention optimization in improving LLM comprehension, providing lightweight yet effective tools for improving performance in various NLP tasks. Comments: CIS 5300 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.09402 [cs.CL] (or arXiv:2504.09402v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.09402 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-93] Composable NLP Workflows for BERT-based Ranking and QA System

【速读】：该论文旨在解决多任务交互处理中的难题，特别是在不同粒度文本上处理跨任务交互的繁琐性。为应对这一挑战，论文提出的关键解决方案是构建一个端到端的排名（Ranking）与问答（QA）系统，采用Forte工具包实现可组合的自然语言处理（NLP）流水线。该系统利用了如BERT和RoBERTa等最先进的深度学习模型，并通过MS-MARCO和Covid-19数据集评估性能，使用BLUE、MRR、F1等指标与相应基准结果进行对比。其关键之处在于流水线的模块化设计以及重排序器（reranker）的低延迟特性，这使得复杂NLP应用的构建变得简便高效。

链接: https://arxiv.org/abs/2504.09398
作者: Gaurav Kumar,Murali Mohana Krishna Dandu
机构: UC San Diego (加州大学圣地亚哥分校); UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures, 6 tables

点击查看摘要

Abstract:There has been a lot of progress towards building NLP models that scale to multiple tasks. However, real-world systems contain multiple components and it is tedious to handle cross-task interaction with varying levels of text granularity. In this work, we built an end-to-end Ranking and Question-Answering (QA) system using Forte, a toolkit that makes composable NLP pipelines. We utilized state-of-the-art deep learning models such as BERT, RoBERTa in our pipeline, evaluated the performance on MS-MARCO and Covid-19 datasets using metrics such as BLUE, MRR, F1 and compared the results of ranking and QA systems with their corresponding benchmark results. The modular nature of our pipeline and low latency of reranker makes it easy to build complex NLP applications easily.
zh

[NLP-94] Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification

【速读】：该论文试图解决文本简化任务中现有评估方法不可靠的问题，主要面临两个挑战：一是现有基准数据可能无法充分反映当前语言模型在该任务上的能力，包含不流畅、不连贯或过于简单的例子；二是基准数据的人类评分存在高度分歧，导致评价结果不一致。尽管已有指标努力与这些不完美的评分保持较高相关性，但评估结果仍缺乏可靠性和预期趋势（如更强大的模型应获得更高的分数）。

解决方案的关键在于通过三个贡献优化评估过程：首先，提出SynthSimpliEval，一个基于不同规模模型生成简化句子的合成基准，证明其上的人类评分具有高一致性且符合预期趋势；其次，验证使用大型语言模型（LLMs）作为评审小组进行自动评估通常足以获得一致的评分；最后，展示现有的可学习评估指标通过在LLMs-as-a-jury标注的合成数据上训练后，能够缩小与纯LLMs-as-a-jury评估之间的差距。总体而言，通过文本简化案例研究，论文表明可靠的评估需要更高质量的测试数据，可通过合成数据和LLMs-as-a-jury评分获得。

链接: https://arxiv.org/abs/2504.09394
作者: Joseph Liu,Yoonsoo Nam,Xinyue Cui,Swabha Swayamdipta
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注: Submitted to COLM 2025. 9 pages, 6 figures

点击查看摘要

Abstract:Despite the successes of language models, their evaluation remains a daunting challenge for new and existing tasks. We consider the task of text simplification, commonly used to improve information accessibility, where evaluation faces two major challenges. First, the data in existing benchmarks might not reflect the capabilities of current language models on the task, often containing disfluent, incoherent, or simplistic examples. Second, existing human ratings associated with the benchmarks often contain a high degree of disagreement, resulting in inconsistent ratings; nevertheless, existing metrics still have to show higher correlations with these imperfect ratings. As a result, evaluation for the task is not reliable and does not reflect expected trends (e.g., more powerful models being assigned higher scores). We address these challenges for the task of text simplification through three contributions. First, we introduce SynthSimpliEval, a synthetic benchmark for text simplification featuring simplified sentences generated by models of varying sizes. Through a pilot study, we show that human ratings on our benchmark exhibit high inter-annotator agreement and reflect the expected trend: larger models produce higher-quality simplifications. Second, we show that auto-evaluation with a panel of LLM judges (LLMs-as-a-jury) often suffices to obtain consistent ratings for the evaluation of text simplification. Third, we demonstrate that existing learnable metrics for text simplification benefit from training on our LLMs-as-a-jury-rated synthetic data, closing the gap with pure LLMs-as-a-jury for evaluation. Overall, through our case study on text simplification, we show that a reliable evaluation requires higher quality test data, which could be obtained through synthetic data and LLMs-as-a-jury ratings.
zh

[NLP-95] Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在创意任务中生成新颖输出的能力评估问题。传统方法仅从训练数据的角度衡量新颖性，忽略了生成质量；而基于人类偏好的评估又可能倾向于高质量但缺乏原创性的内容，导致评估可靠性不足。为此，论文提出了一种新的新颖性度量方法，即通过训练过程中未见的n-gram比例与特定任务的质量评分的调和平均值来平衡原创性和质量。解决方案的关键在于引入这一综合考虑原创性和质量的新颖性度量，并通过实验验证不同方法对提升生成文本新颖性的影响，同时揭示了原创性与质量之间的权衡关系。研究表明，增加模型规模或应用后训练技术能够更有效地提升新颖性，而非单纯依赖推理阶段的调整方法。

链接: https://arxiv.org/abs/2504.09389
作者: Vishakh Padmakumar,Chen Yueh-Han,Jane Pan,Valerie Chen,He He
机构: New York University; Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as the originality with respect to training data, but original outputs can be low quality. In contrast, non-expert judges may favor high-quality but memorized outputs, limiting the reliability of human preference as a metric. We propose a new novelty metric for LLM generations that balances originality and quality – the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. We evaluate the novelty of generations from two families of open-data models (OLMo and Pythia) on three creative tasks: story completion, poetry writing, and creative tool use. We find that LLM generated text is less novel than human written text. To elicit more novel outputs, we experiment with various inference-time methods, which reveals a trade-off between originality and quality. While these methods can boost novelty, they do so by increasing originality at the expense of quality. In contrast, increasing model size or applying post-training reliably shifts the Pareto frontier, highlighting that starting with a stronger base model is a more effective way to improve novelty.
zh

[NLP-96] On Language Models Sensitivity to Suspicious Coincidences

【速读】：该论文试图解决的问题是：分析语言模型（Language Models, LMs）是否表现出与人类相似的“可疑巧合”（suspicious coincidence）效应，即在归纳推理过程中，倾向于选择更小、更具体的假设而非更通用的假设。这一现象反映了人类对数据采样方式的敏感性以及对任务沟通目标的理解。

解决方案的关键在于通过两种实验设置来研究语言模型的行为：1）数字游戏任务，评估模型对一组数字是否适合某个候选数字的判断；2）扩展城市名称的游戏场景。研究者发现，在零样本（zero-shot）条件下，语言模型并未显示出明显的“可疑巧合”效应。然而，当通过链式思维（chain-of-thought）提示或显式提供假设空间后，模型开始表现出类似人类的“可疑巧合”效应。这表明，语言模型的归纳推理能力可以通过显式访问假设空间得以增强。

链接: https://arxiv.org/abs/2504.09387
作者: Sriram Padmanabhan,Kanishka Misra,Kyle Mahowald,Eunsol Choi
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Toyota Technological Institute at Chicago (芝加哥丰田技术研究所); New York University (纽约大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Humans are sensitive to suspicious coincidences when generalizing inductively over data, as they make assumptions as to how the data was sampled. This results in smaller, more specific hypotheses being favored over more general ones. For instance, when provided the set Austin, Dallas, Houston, one is more likely to think that this is sampled from “Texas Cities” over “US Cities” even though both are compatible. Suspicious coincidence is strongly connected to pragmatic reasoning, and can serve as a testbed to analyze systems on their sensitivity towards the communicative goals of the task (i.e., figuring out the true category underlying the data). In this paper, we analyze whether suspicious coincidence effects are reflected in language models’ (LMs) behavior. We do so in the context of two domains: 1) the number game, where humans made judgments of whether a number (e.g., 4) fits a list of given numbers (e.g., 16, 32, 2); and 2) by extending the number game setup to prominent cities. For both domains, the data is compatible with multiple hypotheses and we study which hypothesis is most consistent with the models’ behavior. On analyzing five models, we do not find strong evidence for suspicious coincidences in LMs’ zero-shot behavior. However, when provided access to the hypotheses space via chain-of-thought or explicit prompting, LMs start to show an effect resembling suspicious coincidences, sometimes even showing effects consistent with humans. Our study suggests that inductive reasoning behavior in LMs can be enhanced with explicit access to the hypothesis landscape.
zh

[NLP-97] Can you map it to English? The Role of Cross-Lingual Alignment in Multilingual Performance of LLM s

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在多语言文本表示对齐机制及其与自然语言理解及翻译任务性能关联性的理解不足这一问题。论文的关键解决方案在于引入跨语言对齐度量指标，如辨别性对齐指数（Discriminative Alignment Index, DALI），用于量化实例级别的对齐情况，并通过在三种自然语言理解任务（Belebele、XStoryCloze、XCOPA）以及机器翻译任务上的实验，揭示了跨语言对齐在语言级别上与任务准确性有强相关性，但在样本级别上往往无法有效区分正确与错误预测，表明对齐是成功的必要但不充分条件。

链接: https://arxiv.org/abs/2504.09378
作者: Kartik Ravisankar,Hyojung Han,Marine Carpuat
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) pre-trained predominantly on English text exhibit surprising multilingual capabilities, yet the mechanisms driving cross-lingual generalization remain poorly understood. This work investigates how the alignment of representations for text written in different languages correlates with LLM performance on natural language understanding tasks and translation tasks, both at the language and the instance level. For this purpose, we introduce cross-lingual alignment metrics such as the Discriminative Alignment Index (DALI) to quantify the alignment at an instance level for discriminative tasks. Through experiments on three natural language understanding tasks (Belebele, XStoryCloze, XCOPA), and machine translation, we find that while cross-lingual alignment metrics strongly correlate with task accuracy at the language level, the sample-level alignment often fails to distinguish correct from incorrect predictions, exposing alignment as a necessary but insufficient condition for success.
zh

[NLP-98] QUDsim: Quantifying Discourse Similarities in LLM -Generated Text

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成独特且富有创意内容方面的不足问题。尽管LLMs能够覆盖多样主题生成文本，但普遍存在重复性高的现象。现有基于词项重叠和句法模式的相似性度量主要捕捉内容层面的重叠，无法有效检测话语结构上的相似性。为解决此问题，论文的关键在于引入基于问题-under-讨论（Questions Under Discussion, QUD）及其语义理论的话语抽象方法，以量化不同文本间话语进展的差异，并据此构建了一个名为QUDsim的新相似性度量工具。通过QUDsim，研究发现LLMs在跨样本生成时频繁复用话语结构（比人类更甚），不仅表现出结构性单一性和重复性，还与人类作者使用的话语结构类型存在偏差。

链接: https://arxiv.org/abs/2504.09373
作者: Ramya Namuduri,Yating Wu,Anshun Asher Zheng,Manya Wadhwa,Greg Durrett,Junyi Jessy Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models become increasingly capable at various writing tasks, their weakness at generating unique and creative content becomes a major liability. Although LLMs have the ability to generate text covering diverse topics, there is an overall sense of repetitiveness across texts that we aim to formalize and quantify via a similarity metric. The familiarity between documents arises from the persistence of underlying discourse structures. However, existing similarity metrics dependent on lexical overlap and syntactic patterns largely capture \textitcontent overlap, thus making them unsuitable for detecting \textitstructural similarities. We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression. We then use this framework to build \textbfQUDsim , a similarity metric that can detect discursive parallels between documents. Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs. Furthermore, LLMs are not only repetitive and structurally uniform, but are also divergent from human authors in the types of structures they use.
zh

[NLP-99] REMEMBER: Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning in Zero- and Few-shot Neurodegenerative Diagnosis

【速读】：本文旨在解决神经退行性疾病（如阿尔茨海默病）诊断中现有深度学习模型依赖大规模标注数据且缺乏可解释性的问题。特别是在临床实践中，数据通常规模较小或未标注，限制了深度学习方法的潜力。为应对这些挑战，研究引入了REMEMBER框架，其核心在于通过基于参考的推理过程实现零样本和少样本阿尔茨海默病诊断。关键解决方案是首先利用专家注释的数据训练对比对齐的视觉-文本模型，并扩展伪文本模态以编码异常类型、诊断标签及复合临床描述；在推理阶段，从精心策划的数据集中检索相似的人类验证案例，并通过专用的证据编码模块和注意力机制推理头整合上下文信息，从而实现基于证据的预测，生成可解释的诊断报告，包括参考图像与临床工作流程一致的解释。这种设计使REMEMBER能够模仿真实的临床决策过程，同时在有限数据条件下展现出强大的实际应用价值。

链接: https://arxiv.org/abs/2504.09354
作者: Duy-Cat Can,Quang-Huy Tang,Huong Ha,Binh T. Nguyen,Oliver Y. Chén
机构: Lausanne University Hospital (CHUV); University of Lausanne (UNIL); Vietnam National University; Department of Computer Science, University of Science, Vietnam National University Ho Chi Minh City; School of Biomedical Engineering, International University, Vietnam National University Ho Chi Minh City; University of Science, Vietnam National University Ho Chi Minh City; Lausanne University Hospital (CHUV); University of Lausanne (UNIL)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Timely and accurate diagnosis of neurodegenerative disorders, such as Alzheimer’s disease, is central to disease management. Existing deep learning models require large-scale annotated datasets and often function as “black boxes”. Additionally, datasets in clinical practice are frequently small or unlabeled, restricting the full potential of deep learning methods. Here, we introduce REMEMBER – Retrieval-based Explainable Multimodal Evidence-guided Modeling for Brain Evaluation and Reasoning – a new machine learning framework that facilitates zero- and few-shot Alzheimer’s diagnosis using brain MRI scans through a reference-based reasoning process. Specifically, REMEMBER first trains a contrastively aligned vision-text model using expert-annotated reference data and extends pseudo-text modalities that encode abnormality types, diagnosis labels, and composite clinical descriptions. Then, at inference time, REMEMBER retrieves similar, human-validated cases from a curated dataset and integrates their contextual information through a dedicated evidence encoding module and attention-based inference head. Such an evidence-guided design enables REMEMBER to imitate real-world clinical decision-making process by grounding predictions in retrieved imaging and textual context. Specifically, REMEMBER outputs diagnostic predictions alongside an interpretable report, including reference images and explanations aligned with clinical workflows. Experimental results demonstrate that REMEMBER achieves robust zero- and few-shot performance and offers a powerful and explainable framework to neuroimaging-based diagnosis in the real world, especially under limited data.
zh

[NLP-100] Improving the Accuracy and Efficiency of Legal Document Tagging with Large Language Models and Instruction Prompts

【速读】：该论文致力于解决法律多标签分类任务中面临的挑战，包括法律语言的复杂性、标签间的复杂依赖关系以及显著的标签不平衡问题。为应对这些挑战，论文提出了一种名为Legal-LLM的新方法，其关键是通过微调充分利用大规模语言模型（Large Language Models, LLMs）的指令跟随能力。该方法将多标签分类任务重新定义为结构化生成问题，指导LLM直接输出给定文档的相关法律类别。实验结果表明，Legal-LLM在POSTURE50K和EURLEX57K两个基准数据集上的表现优于多种强基线模型，并通过消融研究和人工评估验证了其在处理标签不平衡及生成相关且准确的法律标签方面的有效性。

链接: https://arxiv.org/abs/2504.09309
作者: Emily Johnson,Xavier Holt,Noah Wilson
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal multi-label classification is a critical task for organizing and accessing the vast amount of legal documentation. Despite its importance, it faces challenges such as the complexity of legal language, intricate label dependencies, and significant label imbalance. In this paper, we propose Legal-LLM, a novel approach that leverages the instruction-following capabilities of Large Language Models (LLMs) through fine-tuning. We reframe the multi-label classification task as a structured generation problem, instructing the LLM to directly output the relevant legal categories for a given document. We evaluate our method on two benchmark datasets, POSTURE50K and EURLEX57K, using micro-F1 and macro-F1 scores. Our experimental results demonstrate that Legal-LLM outperforms a range of strong baseline models, including traditional methods and other Transformer-based approaches. Furthermore, ablation studies and human evaluations validate the effectiveness of our approach, particularly in handling label imbalance and generating relevant and accurate legal labels.
zh

[NLP-101] Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation

【速读】：该论文旨在解决In-Context Learning (ICL) 中演示示例选择对性能高度敏感的问题，特别是现有方法在选择示例时往往忽视了多样性的重要性。论文的关键创新在于提出了一种名为DiverseConE (Diversity-Enhanced Contrastive Example Selection) 的新方法，该方法在对比选择的基础上引入了一个基于嵌入空间不相似性的多样性增强步骤。通过这种方式，DiverseConE 在保持示例间对比性的同时，显著提升了示例集的多样性，从而有效改善了机器翻译任务中的翻译质量。实验结果表明，DiverseConE 在多个语言对（英语-中文、中文-英语、俄语-德语、德语-俄语）的 1-shot 和 3-shot 设置下均优于随机选择、BM25、TopK 等基线方法以及最先进的对比选择方法，并通过多样性和人工评估进一步验证了其有效性。

链接: https://arxiv.org/abs/2504.09305
作者: Owen Patterson,Chee Ng
机构: Universiti Teknologi Malaysia (马来西亚科技大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In-Context Learning (ICL) empowers large language models to perform tasks by conditioning on a few input-output examples. However, the performance of ICL is highly sensitive to the selection of these demonstrations. While existing methods focus on similarity or contrastive selection, they often overlook the importance of diversity among the chosen examples. In this paper, we propose DiverseConE (Diversity-Enhanced Contrastive Example Selection), a novel approach for demonstration selection in in-context learning for machine translation. Our method builds upon contrastive selection by incorporating a diversity enhancement step based on embedding space dissimilarity. We conduct extensive experiments on the Llama2-7b model across four language pairs (English-Chinese, Chinese-English, Russian-German, German-Russian) in 1-shot and 3-shot settings, using COMET20 and COMET22 for evaluation. Our results demonstrate that DiverseConE consistently outperforms strong baseline methods, including random selection, BM25, TopK, and a state-of-the-art contrastive selection method. Further analysis, including diversity metrics and human evaluation, validates the effectiveness of our approach and highlights the benefits of considering demonstration diversity for improved translation quality.
zh

[NLP-102] Linguistic Comparison of AI- and Human-Written Responses to Online Mental Health Queries

【速读】：本文旨在探讨生成式人工智能（GenAI）在模拟人类同侪支持方面的能力与局限性，特别是在在线心理健康社区（OMHCs）中提供即时、个性化心理支持的潜力。研究的关键在于评估当前最先进的大型语言模型（LLMs），如GPT-4-Turbo、Llama-3和Mistral-7B，在处理真实世界心理健康相关文本时的表现，并将其输出与人类编写的回应进行对比分析。通过定量的语言学指标以及定性的洞察，研究揭示了AI响应的特点，包括其更高的可读性和结构化表达，但也指出缺乏人际互动中的语言多样性和个人叙述特性。因此，本研究强调了如何在保持AI技术优势的同时，平衡其规模扩展性和时效性与人类交互的真实感、社交性和专业性之间的关系，为未来将GenAI整合进OMHCs提供了伦理与实践上的指导框架。

链接: https://arxiv.org/abs/2504.09271
作者: Koustuv Saha,Yoshee Jain,Munmun De Choudhury
机构: Siebel School of Computing and Data Science, The Grainger College of Engineering, University of Illinois Urbana-Champaign (西贝尔计算与数据科学学院，格兰杰工程学院，伊利诺伊大学香槟分校); School of Interactive Computing, College of Computing, Georgia Institute of Technology (交互计算学院，计算学院，乔治亚理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The ubiquity and widespread use of digital and online technologies have transformed mental health support, with online mental health communities (OMHCs) providing safe spaces for peer support. More recently, generative AI and large language models (LLMs) have introduced new possibilities for scalable, around-the-clock mental health assistance that could potentially augment and supplement the capabilities of OMHCs. Although genAI shows promise in delivering immediate and personalized responses, their effectiveness in replicating the nuanced, experience-based support of human peers remains an open question. In this study, we harnessed 24,114 posts and 138,758 online community (OC) responses from 55 OMHCs on Reddit. We prompted several state-of-the-art LLMs (GPT-4-Turbo, Llama-3, and Mistral-7B) with these posts, and compared their (AI) responses to human-written (OC) responses based on a variety of linguistic measures across psycholinguistics and lexico-semantics. Our findings revealed that AI responses are more verbose, readable, and analytically structured, but lack linguistic diversity and personal narratives inherent in human-human interactions. Through a qualitative examination, we found validation as well as complementary insights into the nature of AI responses, such as its neutrality of stance and the absence of seeking back-and-forth clarifications. We discuss the ethical and practical implications of integrating generative AI into OMHCs, advocating for frameworks that balance AI’s scalability and timeliness with the irreplaceable authenticity, social interactiveness, and expertise of human connections that form the ethos of online support communities.
zh

[NLP-103] Mixture of Group Experts for Learning Invariant Representations

【速读】：该论文旨在解决传统稀疏激活Mixture-of-Experts (MoE) 模型中专家多样性与专业化受限的问题，特别是在专家数量增加时性能和可扩展性受到制约的情况。论文提出了一种基于分组稀疏正则化的新方法——Mixture of Group Experts (MoGE)，其关键在于通过在top-k路由输入上施加结构化约束来间接正则化专家，同时保持原始MoE架构不变。此外，MoGE将路由输入组织成二维拓扑图，以空间方式对相邻元素进行分组，从而捕获对小变换不变的表示，显著提升专家的多样性和专业化水平。实验结果表明，MoGE在多种Transformer模型上的图像分类和语言建模任务中明显优于标准MoE，且仅带来极小的额外内存和计算开销。

链接: https://arxiv.org/abs/2504.09265
作者: Lei Kang,Jia Li,Mi Tian,Hua Huang
机构: Beijing Normal University (北京师范大学); TAL Education Group (学而思教育集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token. However, vanilla MoE models often suffer from limited diversity and specialization among experts, constraining their performance and scalability, especially as the number of experts increases. In this paper, we present a novel perspective on vanilla MoE with top- k routing inspired by sparse representation. This allows us to bridge established theoretical insights from sparse representation into MoE models. Building on this foundation, we propose a group sparse regularization approach for the input of top- k routing, termed Mixture of Group Experts (MoGE). MoGE indirectly regularizes experts by imposing structural constraints on the routing inputs, while preserving the original MoE architecture. Furthermore, we organize the routing input into a 2D topographic map, spatially grouping neighboring elements. This structure enables MoGE to capture representations invariant to minor transformations, thereby significantly enhancing expert diversity and specialization. Comprehensive evaluations across various Transformer models for image classification and language modeling tasks demonstrate that MoGE substantially outperforms its MoE counterpart, with minimal additional memory and computation overhead. Our approach provides a simple yet effective solution to scale the number of experts and reduce redundancy among them. The source code is included in the supplementary material and will be publicly released.
zh

[NLP-104] MiMIC: Multi-Modal Indian Earnings Calls Dataset to Predict Stock Prices

【速读】：本文旨在解决利用企业财报电话会议信息预测股市价格波动这一长期挑战。传统方法难以有效整合多样化的信息源，而本文提出的关键解决方案是构建一个多模态预测模型（Multimodal Predictive Model）。该模型通过整合来自财报电话会议文本转录（earnings call transcripts）、配套演示文稿中的图像与表格等多模态数据，结合基本面和技术指标等量化变量，实现了对财报电话会议后首个交易日股票价格走势的预测。为支持这一研究，作者创建了MiMIC（多模态印度财报电话会议）数据集，并采用了一种将文本与视觉模态的预测信号与定量变量相结合的多模态分析框架，从而实现全面的特征表征与分析。这种多模态方法的核心在于有效融合多种信息来源以提升金融预测准确性，为计算经济学领域的进一步研究提供了重要参考。

链接: https://arxiv.org/abs/2504.09257
作者: Sohom Ghosh,Arnab Maji,Sudip Kumar Naskar
机构: Jadavpur University ( Jadavpur University ); Independent Researcher ( Independent Researcher )
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code and Dataset: this https URL

点击查看摘要

Abstract:Predicting stock market prices following corporate earnings calls remains a significant challenge for investors and researchers alike, requiring innovative approaches that can process diverse information sources. This study investigates the impact of corporate earnings calls on stock prices by introducing a multi-modal predictive model. We leverage textual data from earnings call transcripts, along with images and tables from accompanying presentations, to forecast stock price movements on the trading day immediately following these calls. To facilitate this research, we developed the MiMIC (Multi-Modal Indian Earnings Calls) dataset, encompassing companies representing the Nifty 50, Nifty MidCap 50, and Nifty Small 50 indices. The dataset includes earnings call transcripts, presentations, fundamentals, technical indicators, and subsequent stock prices. We present a multimodal analytical framework that integrates quantitative variables with predictive signals derived from textual and visual modalities, thereby enabling a holistic approach to feature representation and analysis. This multi-modal approach demonstrates the potential for integrating diverse information sources to enhance financial forecasting accuracy. To promote further research in computational economics, we have made the MiMIC dataset publicly available under the CC-NC-SA-4.0 licence. Our work contributes to the growing body of literature on market reactions to corporate communications and highlights the efficacy of multi-modal machine learning techniques in financial analysis.
zh

[NLP-105] Feature-Aware Malicious Output Detection and Mitigation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在应对恶意内容（如毒性词汇）时缺乏有效防御机制的问题，特别是针对绕过模型安全限制的“越狱攻击”（jailbreak）。论文的关键解决方案是提出了一种特征感知的有害响应拒绝方法（Feature-aware Malicious Response Rejection, FMM）。该方法通过在模型的特征空间中检测潜在的恶意特征，并自适应调整模型的拒绝机制来实现。具体而言，FMM 在解码阶段利用一个简单的判别器检测潜在的恶意特性，并在识别到毒性标记时重新生成当前令牌。此外，通过激活修补（activation patching）技术，在后续令牌生成过程中引入额外的拒绝向量，引导模型拒绝有害输出。实验结果表明，FMM 方法能够在保持模型标准生成能力的同时，有效应对多种语言模型和攻击技术。

链接: https://arxiv.org/abs/2504.09191
作者: Weilong Dong,Peiguang Li,Yu Tian,Xinyi Zeng,Fengdi Li,Sirui Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has brought significant benefits to various domains while introducing substantial risks. Despite being fine-tuned through reinforcement learning, LLMs lack the capability to discern malicious content, limiting their defense against jailbreak. To address these safety concerns, we propose a feature-aware method for harmful response rejection (FMM), which detects the presence of malicious features within the model’s feature space and adaptively adjusts the model’s rejection mechanism. By employing a simple discriminator, we detect potential malicious traits during the decoding phase. Upon detecting features indicative of toxic tokens, FMM regenerates the current token. By employing activation patching, an additional rejection vector is incorporated during the subsequent token generation, steering the model towards a refusal response. Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques, while crucially maintaining the models’ standard generation capabilities.
zh

[NLP-106] Parameterized Synthetic Text Generation with SimpleStories

【速读】：该论文试图解决如何在大规模合成文本生成中同时实现简单性和多样性的问题。TinyStories 数据集存在的一些局限性为此研究提供了切入点。论文的关键解决方案在于提出了一种通过多层级抽象特征参数化提示的方法，这种方法能够系统性地控制故事的特性，从而确保语法和语义上的广泛多样性，最终构建了一个包含 200 万篇故事的 SimpleStories 数据集，分别用英语和日语编写。

链接: https://arxiv.org/abs/2504.09184
作者: Lennart Finke,Thomas Dooms,Mat Allen,Juan Diego Rodriguez,Noa Nabeshima,Dan Braun
机构: ETH Zürich; University of Antwerp (安特卫普大学); Dioptra; The University of Texas at Austin (德克萨斯大学奥斯汀分校); Independent; Apollo Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.
zh

[NLP-107] Langformers: Unified NLP Pipelines for Language Models

【速读】：该论文旨在解决使用基于 Transformer 的语言模型进行自然语言处理（NLP）任务时面临的复杂性问题，包括多框架切换、重复性代码编写等挑战，这些问题可能阻碍非程序员、初学者以及甚至有经验的开发者快速构建原型。为了解决这些问题，论文提出了一种名为 Langformers 的开源 Python 库，其核心创新点在于通过统一且基于工厂模式的接口简化 NLP 管道设计，支持大语言模型（LLM）和掩码语言模型（MLM）任务。Langformers 将对话 AI、预训练掩码任务、文本分类、句子嵌入/重排序、数据标注、语义搜索及知识蒸馏等功能整合到一个连贯的 API 中，并兼容 Hugging Face 和 Ollama 等主流平台。其关键解决方案包括：(1) 面向特定任务的工厂类以抽象训练、推理和部署的复杂性；(2) 内置记忆与流式处理功能以增强对话代理能力；(3) 轻量级模块化设计以提升易用性。文档地址详见上述链接。

链接: https://arxiv.org/abs/2504.09170
作者: Rabindra Lamsal,Maria Rodriguez Read,Shanika Karunasekera
机构: The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer-based language models have revolutionized the field of natural language processing (NLP). However, using these models often involves navigating multiple frameworks and tools, as well as writing repetitive boilerplate code. This complexity can discourage non-programmers and beginners, and even slow down prototyping for experienced developers. To address these challenges, we introduce Langformers, an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface for large language model (LLM) and masked language model (MLM) tasks. Langformers integrates conversational AI, MLM pretraining, text classification, sentence embedding/reranking, data labelling, semantic search, and knowledge distillation into a cohesive API, supporting popular platforms such as Hugging Face and Ollama. Key innovations include: (1) task-specific factories that abstract training, inference, and deployment complexities; (2) built-in memory and streaming for conversational agents; and (3) lightweight, modular design that prioritizes ease of use. Documentation: this https URL
zh

[NLP-108] Can postgraduate translation students identify machine-generated text?

【速读】：该论文试图解决的问题是评估受过语言训练的个体（如翻译学生）在辨别机器生成文本（ST）与人类撰写文本（HT）方面的能力，并探讨生成式人工智能（Generative AI）作为多语言内容创作工具的可行性。论文的关键解决方案在于通过短暂的训练让参与者识别合成文本中常见的文本异常特征，并要求他们对文本片段进行来源判断，从而量化其区分能力。研究发现，尽管参与者能够识别某些特定的文本异常，但整体上难以有效区分HT和ST，这表明现有训练方法存在不足，并提示需要进一步优化训练方案以及探索AI生成文本自然化编辑的必要性。

链接: https://arxiv.org/abs/2504.09164
作者: Michael Farrell
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, accepted for MT Summit 2025, Geneva, Switzerland, 23-27 June 2025

点击查看摘要

Abstract:Given the growing use of generative artificial intelligence as a tool for creating multilingual content and bypassing both machine and traditional translation methods, this study explores the ability of linguistically trained individuals to discern machine-generated output from human-written text (HT). After brief training sessions on the textual anomalies typically found in synthetic text (ST), twenty-three postgraduate translation students analysed excerpts of Italian prose and assigned likelihood scores to indicate whether they believed they were human-written or AI-generated (ChatGPT-4o). The results show that, on average, the students struggled to distinguish between HT and ST, with only two participants achieving notable accuracy. Closer analysis revealed that the students often identified the same textual anomalies in both HT and ST, although features such as low burstiness and self-contradiction were more frequently associated with ST. These findings suggest the need for improvements in the preparatory training. Moreover, the study raises questions about the necessity of editing synthetic text to make it sound more human-like and recommends further research to determine whether AI-generated text is already sufficiently natural-sounding not to require further refinement.
zh

[NLP-109] Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models

【速读】：该论文旨在解决在大型语言模型的实际应用中，受限解码（constrained decoding）效率低下且可能导致输出分布引入无意偏倚的问题。传统基于前缀树（prefix-tree）的受限解码方法在基于GPU的模型推理范式下效率较低，并可能影响输出的公平性。为了解决这些问题，论文提出了一种名为Dynamic Importance Sampling for Constrained Decoding (DISC) 的新算法，其关键在于结合基于GPU的并行前缀验证（Parallel Prefix-Verification, PPV），通过动态重要性采样（dynamic importance sampling）实现理论上保证的渐近无偏性（asymptotic unbiasedness），同时克服了前缀树方法的低效问题。实验结果表明，该方法在效率和输出质量方面均优于现有方法。

链接: https://arxiv.org/abs/2504.09135
作者: Haotian Ye,Himanshu Jain,Chong You,Ananda Theertha Suresh,Haowei Lin,James Zou,Felix Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In real-world applications of large language models, outputs are often required to be confined: selecting items from predefined product or document sets, generating phrases that comply with safety standards, or conforming to specialized formatting styles. To control the generation, constrained decoding has been widely adopted. However, existing prefix-tree-based constrained decoding is inefficient under GPU-based model inference paradigms, and it introduces unintended biases into the output distribution. This paper introduces Dynamic Importance Sampling for Constrained Decoding (DISC) with GPU-based Parallel Prefix-Verification (PPV), a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed asymptotic unbiasedness and overcomes the inefficiency of prefix-tree. Extensive experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality. These results highlight the potential of our methods to improve constrained generation in applications where adherence to specific constraints is essential.
zh

[NLP-110] VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

【速读】：该论文致力于解决现有大型视觉-语言模型在复杂推理任务中的局限性，这些任务通常需要人类通过视觉辅助和逐步思考来完成。传统方法主要依赖文本驱动的慢速思维或简单的视觉支持，但未能充分捕捉人类视觉-语言推理过程中复杂的交互特性。为克服这些限制，论文提出了一种名为VisuoThink的新框架，其关键是将视觉空间与语言领域无缝整合，通过渐进式的视觉-文本推理促进多模态慢速思维，并借助预测时的前瞻树搜索实现推理扩展。实验结果表明，VisuoThink在无需微调的情况下显著提升了几何和空间推理任务中的推理能力，达到了当前最先进的性能水平。

链接: https://arxiv.org/abs/2504.09130
作者: Yikun Wang,Siyin Wang,Qinyuan Cheng,Zhaoye Fei,Liang Ding,Qipeng Guo,Dacheng Tao,Xipeng Qiu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); The University of Sydney (悉尼大学); Shanghai AI Laboratory (上海人工智能实验室); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.
zh

[NLP-111] Optimizing FDTD Solvers for Electromagnetics: A Compiler-Guided Approach with High-Level Tensor Abstractions

【速读】：该论文旨在解决传统有限差分时域（Finite Difference Time Domain, FDTD）方法在实现过程中因平台特定手工编码导致的可移植性差、开发开销大以及性能瓶颈等问题，限制其在现代硬件架构上的可扩展性。论文的关键解决方案是提出了一种基于MLIR/LLVM基础设施的端到端领域专用编译器，用于FDTD仿真。通过将三维FDTD内核表示为三维张量抽象上的操作，并结合显式的计算语义，该编译器能够自动应用高级优化技术（如循环分块、融合和向量化）。实验证明，与基于Python和NumPy的基线实现相比，该方法在Intel、AMD和ARM平台上实现了高达10倍的加速。

链接: https://arxiv.org/abs/2504.09118
作者: Yifei He,Måns I. Andersson,Stefano Markidis
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Finite Difference Time Domain (FDTD) method is a widely used numerical technique for solving Maxwell’s equations, particularly in computational electromagnetics and photonics. It enables accurate modeling of wave propagation in complex media and structures but comes with significant computational challenges. Traditional FDTD implementations rely on handwritten, platform-specific code that optimizes certain kernels while underperforming in others. The lack of portability increases development overhead and creates performance bottlenecks, limiting scalability across modern hardware architectures. To address these challenges, we introduce an end-to-end domain-specific compiler based on the MLIR/LLVM infrastructure for FDTD simulations. Our approach generates efficient and portable code optimized for diverse hardware this http URL implement the three-dimensional FDTD kernel as operations on a 3D tensor abstraction with explicit computational semantics. High-level optimizations such as loop tiling, fusion, and vectorization are automatically applied by the compiler. We evaluate our customized code generation pipeline on Intel, AMD, and ARM platforms, achieving up to 10\times speedup over baseline Python implementation using NumPy.
zh

[NLP-112] A Short Survey on Small Reasoning Models: Training Inference Applications and Research Directions

【速读】：该论文试图解决大型推理模型（Large Reasoning Models, LRMs）在复杂推理任务中因计算需求过高而导致的效率瓶颈问题。论文的关键解决方案在于探索小型推理模型（Small Reasoning Models, SRMs），这些模型通常通过蒸馏技术从大型模型中获得，并以其更高的效率和独特的推理能力及认知路径为特点。论文通过对近170篇相关文献的综述，系统分析了SRMs的训练与推理技术，并探讨了其在特定领域的应用潜力及未来研究方向，旨在为研究人员提供高效实现高级推理功能的重要参考。

链接: https://arxiv.org/abs/2504.09100
作者: Chengyu Wang,Taolin Zhang,Richang Hong,Jun Huang
机构: Alibaba Cloud Computing (阿里云); School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, the reasoning capabilities of large reasoning models (LRMs), such as DeepSeek-R1, have seen significant advancements through the slow thinking process. Despite these achievements, the substantial computational demands of LRMs present considerable challenges. In contrast, small reasoning models (SRMs), often distilled from larger ones, offer greater efficiency and can exhibit distinct capabilities and cognitive trajectories compared to LRMs. This work surveys around 170 recently published papers on SRMs for tackling various complex reasoning tasks. We review the current landscape of SRMs and analyze diverse training and inference techniques related to SRMs. Furthermore, we provide a comprehensive review of SRMs for domain-specific applications and discuss possible future research directions. This survey serves as an essential reference for researchers to leverage or develop SRMs for advanced reasoning functionalities with high efficiency.
zh

[NLP-113] Enhancing Dialogue Systems with Discourse-Level Understanding Using Deep Canonical Correlation Analysis

【速读】：该论文旨在解决现有对话模型在捕捉和利用长期对话历史方面的局限性问题。为应对这一挑战，论文提出了一种新颖的框架，通过引入深度典型相关分析（Deep Canonical Correlation Analysis, DCCA）实现话语级理解。该框架的关键在于学习话语标记（discourse tokens），以捕获话语片段与其周围上下文之间的关系，从而更好地建模长期依赖性。实验结果表明，在Ubuntu Dialogue Corpus上的响应选择任务中，基于改进的自动评估指标，该方法显著提升了性能，凸显了DCCA在过滤无关上下文、保留关键话语信息以实现更精确响应检索方面的潜力。

链接: https://arxiv.org/abs/2504.09094
作者: Akanksha Mehndiratta,Krishna Asawa
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The evolution of conversational agents has been driven by the need for more contextually aware systems that can effectively manage dialogue over extended interactions. To address the limitations of existing models in capturing and utilizing long-term conversational history, we propose a novel framework that integrates Deep Canonical Correlation Analysis (DCCA) for discourse-level understanding. This framework learns discourse tokens to capture relationships between utterances and their surrounding context, enabling a better understanding of long-term dependencies. Experiments on the Ubuntu Dialogue Corpus demonstrate significant enhancement in response selection, based on the improved automatic evaluation metric scores. The results highlight the potential of DCCA in improving dialogue systems by allowing them to filter out irrelevant context and retain critical discourse information for more accurate response retrieval.
zh

[NLP-114] A Multi-view Discourse Framework for Integrating Semantic and Syntactic Features in Dialog Agents

【速读】：该论文旨在解决现有基于检索的多轮对话系统中响应选择方法忽视话语上下文交互或同等对待所有utterances的问题。论文提出了一种基于语篇意识的框架，其关键是通过两步方法整合语义和句法特征以构建话语级理解：首先利用多视图典型相关分析（Multi-view Canonical Correlation Analysis, MCCA）编码每个utterance和response的上下文、位置及句法特征；然后通过典型相关分析（Canonical Correlation Analysis, CCA）在共享子空间中学习话语token，捕捉utterance与其周围轮次之间的关系。实验结果表明，该模型在Ubuntu Dialogue Corpus上的自动评估指标中取得了显著改进。

链接: https://arxiv.org/abs/2504.09073
作者: Akanksha Mehndiratta,Krishna Asawa
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multiturn dialogue models aim to generate human-like responses by leveraging conversational context, consisting of utterances from previous exchanges. Existing methods often neglect the interactions between these utterances or treat all of them as equally significant. This paper introduces a discourse-aware framework for response selection in retrieval-based dialogue systems. The proposed model first encodes each utterance and response with contextual, positional, and syntactic features using Multi-view Canonical Correlation Analysis (MCCA). It then learns discourse tokens that capture relationships between an utterance and its surrounding turns in a shared subspace via Canonical Correlation Analysis (CCA). This two-step approach effectively integrates semantic and syntactic features to build discourse-level understanding. Experiments on the Ubuntu Dialogue Corpus demonstrate that our model achieves significant improvements in automatic evaluation metrics, highlighting its effectiveness in response selection.
zh

[NLP-115] Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models NAACL2025

【速读】：该论文试图解决小语言模型（Small Language Models, SLMs）在长文档叙事任务中的摘要生成问题，特别是如何通过基于计划的引导方法减少幻觉（hallucinations），同时提高摘要的忠实性（faithfulness）。论文的关键在于分析和比较两种基于计划的方案：一种是针对细粒度细节（如日期或命名实体）的现有方法；另一种是作者提出的更高层次的、基于叙事的计划制定方法。研究结果表明，无论采用哪种计划引导方法，在摘要质量和忠实性方面均未显著优于无计划引导的基线模型。此外，虽然基于计划的摘要通常与计划保持一致，但计划本身同样容易包含幻觉，导致最终生成的摘要仍然不够忠实。因此，论文对基于计划的摘要方法提出了警示，尤其是在处理长且复杂的叙事文本领域。

链接: https://arxiv.org/abs/2504.09071
作者: Matt Grenander,Siddharth Varia,Paula Czarnowska,Yogarshi Vyas,Kishaloy Halder,Bonan Min
机构: AWS AI Labs; School of Informatics, University of Edinburgh (爱丁堡大学信息学院); PolyAI
类目: Computation and Language (cs.CL)
备注: Accepted to the 7th Workshop on Narrative Understanding (WNU), co-located with NAACL 2025

点击查看摘要

Abstract:Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts’ length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts.
zh

[NLP-116] owards Stepwise Domain Knowledge-Driven Reasoning Optimization and Reflection Improvement

【速读】：该论文试图解决在需要领域特定专业知识的任务中，逐步监督链式思维（Chain of Thoughts, CoTs）方法的应用问题。传统蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）在逻辑推理任务中的贡献已得到验证，但在领域特定任务中的表现尚不明确。论文针对这一空白，识别了标准MCTS在处理领域知识驱动推理任务时的若干挑战，并提出了“逐步领域知识驱动推理优化框架”（Stepwise Domain Knowledge-Driven Reasoning Optimization），通过改进的MCTS算法实现对问题本质理解、推理及专业化知识的逐级监督。此外，还引入了“反思路径偏好优化”（Preference Optimization towards Reflection Paths），通过迭代学习从更优视角进行自我反思。关键解决方案在于结合领域知识与改进的MCTS算法，同时引入反思机制以增强推理过程的准确性与有效性，最终验证了所提方法在法律领域问题上的优势。

链接: https://arxiv.org/abs/2504.09058
作者: Chengyuan Liu,Shihang Wang,Lizhi Qing,Kaisong Song,Junjie Cao,Jun Lin,Ji Zhang,Ang Li,Kun Kuang,Fei Wu
机构: College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); Tongyi Lab, Alibaba Group (通义实验室，阿里巴巴集团); Law&AI Lab, Zhejiang University (浙大法与人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Recently, stepwise supervision on Chain of Thoughts (CoTs) presents an enhancement on the logical reasoning tasks such as coding and math, with the help of Monte Carlo Tree Search (MCTS). However, its contribution to tasks requiring domain-specific expertise and knowledge remains unexplored. Motivated by the interest, we identify several potential challenges of vanilla MCTS within this context, and propose the framework of Stepwise Domain Knowledge-Driven Reasoning Optimization, employing the MCTS algorithm to develop step-level supervision for problems that require essential comprehension, reasoning, and specialized knowledge. Additionally, we also introduce the Preference Optimization towards Reflection Paths, which iteratively learns self-reflection on the reasoning thoughts from better perspectives. We have conducted extensive experiments to evaluate the advantage of the methodologies. Empirical results demonstrate the effectiveness on various legal-domain problems. We also report a diverse set of valuable findings, hoping to encourage the enthusiasm to the research of domain-specific LLMs and MCTS.
zh

[NLP-117] From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy NAACL

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在识别幽默语句方面的有效性问题。研究的核心目标是评估主流LLMs在从单口喜剧文本中准确提取幽默笑点的能力，并提出了一种新的幽默检测度量方法（Humor Detection Metric, HDM）。该方法的关键在于其模块化设计，通过模糊字符串匹配（fuzzy string matching）、句子嵌入（sentence embedding）以及子空间相似性（subspace similarity）三种评分机制，提供全面的性能评估。论文进一步将LLMs的表现与人类评估者进行对比分析，揭示了无论经过何种提示工程优化，领先模型如ChatGPT、Claude和DeepSeek的幽默检测准确率最高仅为51%，但仍显著优于人类评估者的41%。这一结果凸显了幽默理解中的主观性和从现场表演文本中提取幽默的复杂性。因此，论文的关键创新在于开发了能够综合多维度评估LLMs幽默识别能力的新度量方法。

链接: https://arxiv.org/abs/2504.09049
作者: Adrianna Romanowski,Pedro H. V. Valois,Kazuhiro Fukui
机构: International Christian University; University of Tsukuba (筑波大学)
类目: Computation and Language (cs.CL)
备注: Accepted to CMCL2025 @ NAACL

点击查看摘要

Abstract:Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems’ abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy’s unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model’s performance. The model’s results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at this https URL.
zh

[NLP-118] A Survey of Frontiers in LLM Reasoning Reason ing: Inference Scaling Learning to Reason and Agent ic Systems

【速读】：该论文试图系统性梳理大型语言模型（Large Language Models, LLMs）在推理（Reasoning）能力方面的现有方法及其发展趋势。论文关注的核心问题是：如何通过不同的训练和推理机制提升LLMs的推理能力，并探讨了从单一模型到包含外部工具及多智能体协作的复杂系统的设计与优化方法。论文提出了一种分类框架，将现有方法沿着两个正交维度进行归类：推理实现阶段（Regimes，即推理是在推理阶段完成还是通过专门训练获得）以及架构类型（Architectures，即推理过程涉及的组件，区分独立LLMs与结合外部工具的主动式复合系统及多智能体协作）。关键在于从输入层面构建高质量提示（Prompt），以及从输出层面优化采样结果以提升推理质量。此外，论文还分析了从基于推理扩展到学习推理（如DeepSeek-R1）以及向主动式工作流（如OpenAI Deep Research, Manus Agent）过渡的趋势，并涵盖了从监督微调到强化学习（如PPO、GRPO）等多样化的学习算法，以及推理器和验证器的训练设计。

链接: https://arxiv.org/abs/2504.09037
作者: Zixuan Ke,Fangkai Jiao,Yifei Ming,Xuan-Phi Nguyen,Austin Xu,Do Xuan Long,Minzhi Li,Chengwei Qin,Peifeng Wang,Silvio Savarese,Caiming Xiong,Shafiq Joty
机构: Salesforce AI Research (销售力量人工智能研究院); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); IR2superscript𝑅2{}^{2}Rstart_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT italic_R, A*STAR, Singapore (新加坡科技研究局)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 72 pages, 6 figures

点击查看摘要

Abstract:Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi-agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. …
zh

[NLP-119] A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models

【速读】：该论文试图解决通过自动化方法生成高质量话语注释方案的问题，以替代耗时且依赖专家知识的手动设计树状注释方案。论文的关键解决方案在于提出了一种完全自动化的流水线，利用大型语言模型（LLMs）构建注释方案，并通过基于频率引导的决策树与先进的LLM结合进行注释。实验表明，该方法不仅显著减少了注释所需的时间，还能在某些情况下匹配甚至超越人工设计的树状结构以及人类注释者的性能。

链接: https://arxiv.org/abs/2504.08961
作者: Kseniia Petukhova,Ekaterina Kochmar
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have shown promise in automating discourse annotation for conversations. While manually designing tree annotation schemes significantly improves annotation quality for humans and models, their creation remains time-consuming and requires expert knowledge. We propose a fully automated pipeline that uses LLMs to construct such schemes and perform annotation. We evaluate our approach on speech functions (SFs) and the Switchboard-DAMSL (SWBD-DAMSL) taxonomies. Our experiments compare various design choices, and we show that frequency-guided decision trees, paired with an advanced LLM for annotation, can outperform previously manually designed trees and even match or surpass human annotators while significantly reducing the time required for annotation. We release all code and resultant schemes and annotations to facilitate future research on discourse annotation.
zh

[NLP-120] Generating Planning Feedback for Open-Ended Programming Exercises with LLM s

【速读】：该论文试图解决学生在完成开放式编程练习时，因语法错误导致无法获得关于其高层次规划过程反馈的问题。论文的关键解决方案在于利用大型语言模型（Large Language Models, LLM）检测学生代码中的高层次目标和模式（即编程计划），即使代码存在语法错误也能实现这一目标。研究显示，无论是完整的GPT-4o模型还是其小型变体GPT-4o-mini，在识别这些编程计划方面均表现出色，并且经过微调后，GPT-4o-mini能够达到与GPT-4o相当的结果，这为实时评分提供了成本效益更高的选择。这种小型模型可以集成到自动评分系统中，为学生的隐性规划技能提供反馈，即便程序存在语法错误。此外，LLM可能还适用于其他领域的问题反馈，如数学和物理问题，这些领域同样需要从高层次解题步骤逐步计算输出。

链接: https://arxiv.org/abs/2504.08958
作者: Mehmet Arif Demirtaş,Claire Zheng,Max Fowler,Kathryn Cunningham
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as full paper at AIED 2025

点击查看摘要

Abstract:To complete an open-ended programming exercise, students need to both plan a high-level solution and implement it using the appropriate syntax. However, these problems are often autograded on the correctness of the final submission through test cases, and students cannot get feedback on their planning process. Large language models (LLM) may be able to generate this feedback by detecting the overall code structure even for submissions with syntax errors. To this end, we propose an approach that detects which high-level goals and patterns (i.e. programming plans) exist in a student program with LLMs. We show that both the full GPT-4o model and a small variant (GPT-4o-mini) can detect these plans with remarkable accuracy, outperforming baselines inspired by conventional approaches to code analysis. We further show that the smaller, cost-effective variant (GPT-4o-mini) achieves results on par with state-of-the-art (GPT-4o) after fine-tuning, creating promising implications for smaller models for real-time grading. These smaller models can be incorporated into autograders for open-ended code-writing exercises to provide feedback for students’ implicit planning skills, even when their program is syntactically incorrect. Furthermore, LLMs may be useful in providing feedback for problems in other domains where students start with a set of high-level solution steps and iteratively compute the output, such as math and physics problems.
zh

[NLP-121] Large Language Model Empowered Recommendation Meets All-domain Continual Pre-Training

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在推荐系统中的两大挑战：一是通用语言表示与特定领域偏好模式之间的不匹配，二是对多领域推荐场景适应性的局限性。为解决这些问题，论文提出了一种名为CPRec的全领域持续预训练框架，其核心在于通过持续预训练范式全面对齐LLMs与普遍用户行为。关键解决方案包括设计统一提示模板以组织多领域用户行为，并采用针对推荐任务的Warmup-Stable-Annealing学习率调度策略，逐步提升LLMs从开放世界知识到通用推荐任务的知识适应能力。实验结果表明，该框架显著减小了语义-行为差异，并在所有推荐场景中达到了最先进的性能。

链接: https://arxiv.org/abs/2504.08949
作者: Haokai Ma,Yunshan Ma,Ruobing Xie,Lei Meng,Jialie Shen,Xingwu Sun,Zhanhui Kang,Tat-Seng Chua
机构: National University of Singapore(Singapore); Singapore Management University(Singapore); Tencent(腾讯)(China); Shandong University(山东大学)(China); City, University London(英国城市大学)(United Kingdom); Tencent(腾讯)(China); Tencent(腾讯)(China); National University of Singapore(Singapore)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: In submission

点击查看摘要

Abstract:Recent research efforts have investigated how to integrate Large Language Models (LLMs) into recommendation, capitalizing on their semantic comprehension and open-world knowledge for user behavior understanding. These approaches predominantly employ supervised fine-tuning on single-domain user interactions to adapt LLMs for specific recommendation tasks. However, they typically encounter dual challenges: the mismatch between general language representations and domain-specific preference patterns, as well as the limited adaptability to multi-domain recommendation scenarios. To bridge these gaps, we introduce CPRec – an All-domain Continual Pre-Training framework for Recommendation – designed to holistically align LLMs with universal user behaviors through the continual pre-training paradigm. Specifically, we first design a unified prompt template and organize users’ multi-domain behaviors into domain-specific behavioral sequences and all-domain mixed behavioral sequences that emulate real-world user decision logic. To optimize behavioral knowledge infusion, we devise a Warmup-Stable-Annealing learning rate schedule tailored for the continual pre-training paradigm in recommendation to progressively enhance the LLM’s capability in knowledge adaptation from open-world knowledge to universal recommendation tasks. To evaluate the effectiveness of our CPRec, we implement it on a large-scale dataset covering seven domains and conduct extensive experiments on five real-world datasets from two distinct platforms. Experimental results confirm that our continual pre-training paradigm significantly mitigates the semantic-behavioral discrepancy and achieves state-of-the-art performance in all recommendation scenarios. The source code will be released upon acceptance.
zh

[NLP-122] Agent RewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

【速读】：该论文试图解决如何有效评估基于网络代理（web agents）的任务完成情况的问题。现有方法主要依赖基于规则的方法，但其扩展性差且可能无法准确识别成功的轨迹；人工评估虽然更精确，但速度慢且成本高；利用大型语言模型（LLMs）进行自动评估虽可避免规则设计和人工标注的挑战，但其有效性尚未明确。论文的关键解决方案是提出了AgentRewardBench，这是一个包含1302条轨迹、覆盖5个基准数据集和4种LLMs的新基准，用于评估LLMs作为评判器的有效性。通过该基准，研究发现没有单一LLM在所有基准上表现最优，并揭示了基于规则的评估方法通常低估网络代理成功率的局限性，强调了发展更灵活自动评估方法的必要性。

链接: https://arxiv.org/abs/2504.08942
作者: Xing Han Lù,Amirhossein Kazemnejad,Nicholas Meade,Arkil Patel,Dongchan Shin,Alejandra Zambrano,Karolina Stańczak,Peter Shaw,Christopher J. Pal,Siva Reddy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: this https URL
zh

[NLP-123] Spatial Audio Processing with Large Language Model on Wearable Devices

【速读】：本文旨在解决将空间上下文整合到大型语言模型（Large Language Models, LLMs）中以提升人机交互体验的问题，特别是在可穿戴设备中的应用潜力。论文的关键在于提出了一种新颖的系统架构，通过微结构空间感知技术从单声道麦克风中提取精确的方向到达角（Direction of Arrival, DoA）信息，并将其与语言嵌入相结合，使模型能够学习互补的上下文表示。为了解决现有数据集缺乏的问题，研究团队利用LibriSpeech数据集合成创建了一个名为OmniTalk的数据集。最终，这些融合后的嵌入被调整以适配LLaMA-3.2 3B模型，并采用轻量级适配技术LoRA进行微调，以优化设备端处理性能。此方案不仅实现了具有空间感知能力的自动语音识别（ASR），其平均误差显著降低至25.72°，相比已有工作的88.52°有大幅改进，同时支持多达五人的声音场景分析，方向估计中值误差仅为16°，从而在保持功耗效率、隐私保护及硬件限制的前提下展示了卓越的空间语音理解能力。

链接: https://arxiv.org/abs/2504.08907
作者: Ayushi Mishra,Yang Bai,Priyadarshan Narayanasamy,Nakul Garg,Nirupam Roy
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI’s Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of 25.72^\circ -a substantial improvement compared to the 88.52 ^\circ median error in existing work-with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16 ^\circ . Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.
zh

[NLP-124] Forecasting Communication Derailments Through Conversation Generation

【速读】：该论文试图解决通信 derailment（沟通脱轨）的未来预测问题，现有方法主要基于历史对话内容预测结局，但难以提前预知潜在的沟通脱轨。论文的关键解决方案是通过微调的语言模型 (fine-tuned LLM)，基于现有的对话历史采样多个未来的对话轨迹，并依据这些轨迹的一致性预测沟通结果。此外，研究还探索利用反映轮次级对话动态的社会语言学属性作为生成未来对话的引导，从而提升预测准确性。该方法在英语沟通脱轨预测基准测试中超越了最先进的结果，并在消融实验中显示出显著的精度提升。

链接: https://arxiv.org/abs/2504.08905
作者: Yunfan Zhang,Kathleen McKeown,Smaranda Muresan
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Forecasting communication derailment can be useful in real-world settings such as online content moderation, conflict resolution, and business negotiations. However, despite language models’ success at identifying offensive speech present in conversations, they struggle to forecast future communication derailments. In contrast to prior work that predicts conversation outcomes solely based on the past conversation history, our approach samples multiple future conversation trajectories conditioned on existing conversation history using a fine-tuned LLM. It predicts the communication outcome based on the consensus of these trajectories. We also experimented with leveraging socio-linguistic attributes, which reflect turn-level conversation dynamics, as guidance when generating future conversations. Our method of future conversation trajectories surpasses state-of-the-art results on English communication derailment prediction benchmarks and demonstrates significant accuracy gains in ablation studies.
zh

[NLP-125] AI-University: An LLM -based platform for instructional alignment to scientific classrooms

【速读】：本文旨在解决如何通过生成式人工智能（Generative AI）实现课程内容的个性化交付，以适应不同教师的教学风格。解决方案的关键在于提出了一种名为AI University (AI-U) 的灵活框架，其核心是利用检索增强生成（Retrieval-Augmented Generation, RAG）技术微调大型语言模型（Large Language Model, LLM），使其能够生成与讲师风格一致的教学响应。这一方案通过构建系统化的训练数据管道、采用低秩适配（Low-Rank Adaptation, LoRA）微调开源LLM，并结合RAG进行响应优化，实现了与课程材料的高度一致性。评估结果显示，在86%的测试案例中，该专家模型的余弦相似度高于基准参考，且在五次对比中约有四次优于基础Llama 3.2模型。此外，开发的原型Web应用程序进一步增强了可追溯性，将AI生成的响应链接至课程材料的具体章节及开放访问视频讲座的时间标记实例。这些创新点共同构成了一个可扩展的AI辅助教育方法，为更广泛应用于高等教育铺平了道路，特别是在工程科学领域博士和硕士研究生的核心课程如有限元法（Finite Element Method, FEM）教学中。

链接: https://arxiv.org/abs/2504.08846
作者: Mostafa Faghih Shojaei,Rahul Gulati,Benjamin A. Jasperson,Shangshang Wang,Simone Cimolato,Dangli Cao,Willie Neiswanger,Krishna Garikipati
机构: Department of Aerospace and Mechanical Engineering, University of Southern California (南加州大学航天与机械工程系); Department of Computer Science, University of Southern California (南加州大学计算机科学系); Department of Astronautical Engineering, University of Southern California (南加州大学宇航工程系); Department of Electrical and Computer Engineering, University of Southern California (南加州大学电气与计算机工程系)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:We introduce AI University (AI-U), a flexible framework for AI-driven course content delivery that adapts to instructors’ teaching styles. At its core, AI-U fine-tunes a large language model (LLM) with retrieval-augmented generation (RAG) to generate instructor-aligned responses from lecture videos, notes, and textbooks. Using a graduate-level finite-element-method (FEM) course as a case study, we present a scalable pipeline to systematically construct training data, fine-tune an open-source LLM with Low-Rank Adaptation (LoRA), and optimize its responses through RAG-based synthesis. Our evaluation - combining cosine similarity, LLM-based assessment, and expert review - demonstrates strong alignment with course materials. We also have developed a prototype web application, available at this https URL, that enhances traceability by linking AI-generated responses to specific sections of the relevant course material and time-stamped instances of the open-access video lectures. Our expert model is found to have greater cosine similarity with a reference on 86% of test cases. An LLM judge also found our expert model to outperform the base Llama 3.2 model approximately four times out of five. AI-U offers a scalable approach to AI-assisted education, paving the way for broader adoption in higher education. Here, our framework has been presented in the setting of a class on FEM - a subject that is central to training PhD and Master students in engineering science. However, this setting is a particular instance of a broader context: fine-tuning LLMs to research content in science.
zh

[NLP-126] SD2: Self-Distilled Sparse Drafters

【速读】：该论文旨在解决大语言模型（Large Language Models, LLMs）推理延迟的问题，并提出了一种新的方法来生成高效且对齐的目标模型的草稿模型。论文的关键在于引入了自蒸馏稀疏起草者（Self-Distilled Sparse Drafters, SD²），这是一种结合自数据蒸馏和细粒度权重稀疏性的创新方法。通过这种方式，SD²系统性地提高了草稿标记的接受率，同时显著减少了乘积累加操作（Multiply-Accumulate operations, MACs），即使在通用辅助生成（Universal Assisted Generation, UAG）的设定下，即草稿模型和目标模型来自不同模型家族的情况下也能实现。实验结果显示，在Llama-3.1-70B目标模型上，与经过层修剪的草稿模型相比，SD²提供了1.59倍更高的平均接受长度（Mean Accepted Length, MAL），并且比密集型草稿模型减少了43.87%的MACs，同时MAL仅下降了8.36%。这些结果表明，基于稀疏性的微调和压缩策略能够有效提升LLMs的推理效率，同时保持与目标模型的良好对齐。

链接: https://arxiv.org/abs/2504.08838
作者: Mike Lasby,Nish Sinnadurai,Valavan Manohararajah,Sean Lie,Vithursan Thangarasa
机构: Cerebras Systems Inc. (Cerebras系统公司); Schulich School of Engineering, University of Calgary (卡加立大学谢尔丹工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages

点击查看摘要

Abstract:Speculative decoding is a powerful technique for reducing the latency of Large Language Models (LLMs), offering a fault-tolerant framework that enables the use of highly compressed draft models. In this work, we introduce Self-Distilled Sparse Drafters (SD ^2 ), a novel methodology that leverages self-data distillation and fine-grained weight sparsity to produce highly efficient and well-aligned draft models. SD ^2 systematically enhances draft token acceptance rates while significantly reducing Multiply-Accumulate operations (MACs), even in the Universal Assisted Generation (UAG) setting, where draft and target models originate from different model families. On a Llama-3.1-70B target model, SD ^2 provides a \times 1.59 higher Mean Accepted Length (MAL) compared to layer-pruned draft models and reduces MACs by over 43.87% with a 8.36% reduction in MAL compared to a dense draft models. Our results highlight the potential of sparsity-aware fine-tuning and compression strategies to improve LLM inference efficiency while maintaining alignment with target models.
zh

[NLP-127] CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization

【速读】：该论文旨在解决现有文化对齐大语言模型（Culturally Aligned LLMs）方法中存在的两个关键挑战：(1) 数据代表性不足导致冗余，造成计算资源浪费；(2) 数据独特性难以体现，无法有效区分目标文化与其他相关文化的共享模式。为应对这些问题，论文提出了一种名为CAReDiO的新框架，其核心在于利用强大的LLMs自动生成优化的文化对话数据，通过最大化代表性和独特性来提升数据质量。实验结果表明，基于该方法构建的小型数据集在仅使用100个训练样本的情况下即可实现高效的文化对齐，显著提升了性能与效率。

链接: https://arxiv.org/abs/2504.08820
作者: Jing Yao,Xiaoyuan Yi,Jindong Wang,Zhicheng Dou,Xing Xie
机构: Microsoft Research Asia (微软研究院亚洲); William & Mary (威廉与玛丽学院); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) more deeply integrate into human life across various regions, aligning them with pluralistic cultures is crucial for improving user experience and mitigating cultural conflicts. Existing approaches develop culturally aligned LLMs primarily through fine-tuning with massive carefully curated culture-specific corpora. Nevertheless, inspired by culture theories, we identify two key challenges faced by these datasets: (1) Representativeness: These corpora fail to fully capture the target culture’s core characteristics with redundancy, causing computation waste; (2) Distinctiveness: They struggle to distinguish the unique nuances of a given culture from shared patterns across other relevant ones, hindering precise cultural modeling. To handle these challenges, we introduce CAReDiO, a novel cultural data construction framework. Specifically, CAReDiO utilizes powerful LLMs to automatically generate cultural conversation data, where both the queries and responses are further optimized by maximizing representativeness and distinctiveness. Using CAReDiO, we construct a small yet effective dataset, covering five cultures, and compare it with several recent cultural corpora. Extensive experiments demonstrate that our method generates more effective data and enables cultural alignment with as few as 100 training samples, enhancing both performance and efficiency.
zh

[NLP-128] Mechanistic Anomaly Detection for “Quirky” Language Models ICLR

【速读】：该论文试图解决大型语言模型（LLMs）在能力提升过程中，因监督失效导致的敏感性问题，特别是当LLMs对监督者未知的因素产生反应时。论文的关键解决方案是引入机制异常检测（Mechanistic Anomaly Detection, MAD）技术，通过利用模型的内部特征来识别与训练环境显著不同的异常训练信号，并对其进行调查或排除。论文实验表明，尽管某些任务上的检测器具有较高的区分能力，但没有单一检测器能在所有模型和任务中都保持有效性。因此，MAD技术可能适用于低风险应用场景，但在高风险场景下仍需进一步改进检测方法及评估手段。

链接: https://arxiv.org/abs/2504.08812
作者: David O. Johnston,Arkajyoti Chakraborty,Nora Belrose
机构: Eleuther AI; University of California Santa Cruz (加州大学圣克鲁兹分校); Open Philanthropy (开放慈善)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICLR Building Trust Workshop 2025

点击查看摘要

Abstract:As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of ``quirky’’ language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.
zh

[NLP-129] Exploring the Effectiveness and Interpretability of Texts in LLM -based Time Series Models

【速读】：该论文试图解决大型语言模型（LLMs）在时间序列预测任务中引入文本数据是否真正有助于提升模型性能及其可解释性的问题。研究通过一系列实验发现，文本与时间序列模态之间存在不对齐现象，并且文本信息在许多情况下并未显著提升时间序列预测表现。此外，可视化分析表明现有框架学习到的文本表征在应用于时间序列数据时缺乏足够的可解释性。为了解决这一问题，论文提出了一种新的度量方法——语义匹配指数（Semantic Matching Index, SMI），用于更好地评估时间序列与文本之间的匹配程度。关键在于通过引入SMI来揭示当前时间序列LLMs中文本的不对齐及有限可解释性，并提高对时间序列中文本可解释性的认识。代码已公开发布。

链接: https://arxiv.org/abs/2504.08808
作者: Zhengke Sun,Hangwei Qian,Ivor Tsang
机构: Nanyang Technological University (南洋理工大学); CFAR, ASTAR (ASTAR 中心); CFAR, ASTAR (ASTAR 中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been applied to time series forecasting tasks, leveraging pre-trained language models as the backbone and incorporating textual data to purportedly enhance the comprehensive capabilities of LLMs for time series. However, are these texts really helpful for interpretation? This study seeks to investigate the actual efficacy and interpretability of such textual incorporations. Through a series of empirical experiments on textual prompts and textual prototypes, our findings reveal that the misalignment between two modalities exists, and the textual information does not significantly improve time series forecasting performance in many cases. Furthermore, visualization analysis indicates that the textual representations learned by existing frameworks lack sufficient interpretability when applied to time series data. We further propose a novel metric named Semantic Matching Index (SMI) to better evaluate the matching degree between time series and texts during our post hoc interpretability investigation. Our analysis reveals the misalignment and limited interpretability of texts in current time-series LLMs, and we hope this study can raise awareness of the interpretability of texts for time series. The code is available at this https URL.
zh

[NLP-130] Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms

【速读】：该论文试图解决通过大规模预测评估项目难度以减少依赖资源密集型现场测试的问题。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）的潜力。研究提出了两种方法：一种是直接估计法，让LLM为每个项目分配单一难度评分；另一种是基于特征的方法，即LLM提取多个认知和语言特征，然后通过集成树模型（如随机森林和梯度提升）预测难度。结果表明，基于特征的方法在预测准确性上优于直接LLM估计，相关性高达r=0.87，并且误差更低，强调了结构化特征提取的重要性。

链接: https://arxiv.org/abs/2504.08804
作者: Pooya Razavi,Sonya J. Powers
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Estimating item difficulty through field-testing is often resource-intensive and time-consuming. As such, there is strong motivation to develop methods that can predict item difficulty at scale using only the item content. Large Language Models (LLMs) represent a new frontier for this goal. The present research examines the feasibility of using an LLM to predict item difficulty for K-5 mathematics and reading assessment items (N = 5170). Two estimation approaches were implemented: (a) a direct estimation method that prompted the LLM to assign a single difficulty rating to each item, and (b) a feature-based strategy where the LLM extracted multiple cognitive and linguistic features, which were then used in ensemble tree-based models (random forests and gradient boosting) to predict difficulty. Overall, direct LLM estimates showed moderate to strong correlations with true item difficulties. However, their accuracy varied by grade level, often performing worse for early grades. In contrast, the feature-based method yielded stronger predictive accuracy, with correlations as high as r = 0.87 and lower error estimates compared to both direct LLM predictions and baseline regressors. These findings highlight the promise of LLMs in streamlining item development and reducing reliance on extensive field testing and underscore the importance of structured feature extraction. We provide a seven-step workflow for testing professionals who would want to implement a similar item difficulty estimation approach with their item pool.
zh

[NLP-131] Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention

【速读】：该论文旨在解决标准自注意力机制在处理长序列或资源受限场景下计算复杂度过高的问题，其计算复杂度随输入序列长度呈平方级增长。论文的关键解决方案是引入Learnable Multi-Scale Wavelet Transformer (LMWT)，用可学习的多尺度Haar小波变换模块替代传统的点积自注意力机制。通过利用小波变换的多分辨率特性，LMWT能够高效捕捉局部细节与全局上下文，同时其小波分解参数包括尺度特定系数均在端到端训练过程中自动学习，使模型能够根据数据与任务自适应调整分解策略。

链接: https://arxiv.org/abs/2504.08801
作者: Andrew Kiruluta,Priscilla Burity,Samantha Williams
机构: UC Berkeley (加州大学伯克利分校); School of Information (信息学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer architectures, underpinned by the self-attention mechanism, have achieved state-of-the-art results across numerous natural language processing (NLP) tasks by effectively modeling long-range dependencies. However, the computational complexity of self-attention, scaling quadratically with input sequence length, presents significant challenges for processing very long sequences or operating under resource constraints. This paper introduces the Learnable Multi-Scale Wavelet Transformer (LMWT), a novel architecture that replaces the standard dot-product self-attention with a learnable multi-scale Haar wavelet transform module. Leveraging the intrinsic multi-resolution properties of wavelets, the LMWT efficiently captures both local details and global context. Crucially, the parameters of the wavelet transform, including scale-specific coefficients, are learned end-to-end during training, allowing the model to adapt its decomposition strategy to the data and task. We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework, supplemented by an architectural diagram. We conduct a comprehensive experimental evaluation on a standard machine translation benchmark (WMT16 En-De), comparing the LMWT against a baseline self-attention transformer using metrics like BLEU score, perplexity, and token accuracy. Furthermore, we analyze the computational complexity, highlighting the linear scaling of our approach, discuss its novelty in the context of related work, and explore the interpretability offered by visualizing the learned Haar coefficients. Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages, positioning it as a promising and novel alternative for efficient sequence modeling.
zh

[NLP-132] Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks

【速读】：该论文旨在解决文本对抗样本对自然语言处理系统可靠性构成的严重威胁这一问题。现有研究表明，对抗样本往往偏离正常文本的潜在流形（manifold），而预训练的掩码语言模型（Masked Language Models, MLM）能够近似表示正常数据的流形。基于这些发现，论文探索利用掩码语言模型来检测文本对抗攻击。论文首先提出基于掩码语言模型的检测方法（Masked Language Model-based Detection, MLMD），通过利用掩码语言建模目标中的掩码与去掩码操作，诱导正常文本与对抗文本在流形变化上的差异以实现检测。然而，MLMD 的逐词掩码策略带来了显著的计算开销。后续分析表明，输入中的大量非关键词对于检测并不重要，却消耗了资源。为此，论文进一步提出梯度引导的 MLMD 方法（Gradient-guided MLMD, GradMLMD），通过利用梯度信息识别并跳过非关键单词，在不牺牲检测性能的前提下大幅减少了资源消耗，成为解决方案的关键所在。

链接: https://arxiv.org/abs/2504.08798
作者: Xiaomei Zhang,Zhaoxi Zhang,Yanjun Zhang,Xufei Zheng,Leo Yu Zhang,Shengshan Hu,Shirui Pan
机构: School of Information and Communication Technology, Griffith University, Queensland, Australia (昆士兰格里菲斯大学信息与通信技术学院); School of Computer Science, University of Technology Sydney, Sydney, New South Wales, Australia (悉尼科技大学计算机科学学院); College of Computer and Information Science, Southwest University, Chongqing, China (西南大学计算机与信息科学学院); School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China (华中科技大学网络科学与工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Textual adversarial examples pose serious threats to the reliability of natural language processing systems. Recent studies suggest that adversarial examples tend to deviate from the underlying manifold of normal texts, whereas pre-trained masked language models can approximate the manifold of normal data. These findings inspire the exploration of masked language models for detecting textual adversarial attacks. We first introduce Masked Language Model-based Detection (MLMD), leveraging the mask and unmask operations of the masked language modeling (MLM) objective to induce the difference in manifold changes between normal and adversarial texts. Although MLMD achieves competitive detection performance, its exhaustive one-by-one masking strategy introduces significant computational overhead. Our posterior analysis reveals that a significant number of non-keywords in the input are not important for detection but consume resources. Building on this, we introduce Gradient-guided MLMD (GradMLMD), which leverages gradient information to identify and skip non-keywords during detection, significantly reducing resource consumption without compromising detection performance.
zh

[NLP-133] Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation NAACL

【速读】：该论文试图解决低资源语言（如乌尔都语、沙姆基语、信德语和普什图语）中命名实体识别（NER）任务由于标注数据不足及预训练语言模型（PLMs）表示能力有限而导致的研究不足和挑战。解决方案的关键在于提出了一种数据增强技术，通过生成符合文化背景的句子，并结合多语言掩码大型语言模型（LLMs）的微调，在沙姆基语和普什图语中显著提升了NER性能。此外，研究进一步探索了生成式LLMs在少样本学习场景下的NER能力和数据增强潜力。

链接: https://arxiv.org/abs/2504.08792
作者: Toqeer Ehsan,Thamar Solorio
机构: Department of Natural Language Processing (自然语言处理系); MBZUAI (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to W-NUT 2025 @ NAACL

点击查看摘要

Abstract:Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.
zh

[NLP-134] Efficient Evaluation of Large Language Models via Collaborative Filtering

【速读】：该论文旨在解决在大型语言模型（Large Language Models, LLMs）评估中因测试实例数量庞大且推理速度慢导致的成本高昂的问题。论文提出了一种两阶段方法，灵感来源于推荐系统中的协同过滤（Collaborative Filtering, CF）。解决方案的关键在于：第一阶段通过将实例选择视为向用户推荐产品，挑选能够有效区分模型性能的测试实例；第二阶段则将性能预测视为推荐系统中的评分预测问题，从而预测目标LLM在未选中实例上的行为。实验结果表明，该方法能够在大幅减少推理开销的同时，准确估计目标模型的性能。

链接: https://arxiv.org/abs/2504.08781
作者: Xu-Xiang Zhong,Chao Yi,Han-Jia Ye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:With the development of Large Language Models (LLMs), numerous benchmarks have been proposed to measure and compare the capabilities of different LLMs. However, evaluating LLMs is costly due to the large number of test instances and their slow inference speed. In this paper, we aim to explore how to efficiently estimate a model’s real performance on a given benchmark based on its evaluation results on a small number of instances sampled from the benchmark. Inspired by Collaborative Filtering (CF) in Recommendation Systems (RS), we treat LLMs as users and test instances as items and propose a two-stage method. In the first stage, we treat instance selection as recommending products to users to choose instances that can easily distinguish model performance. In the second stage, we see performance prediction as rating prediction problem in RS to predict the target LLM’s behavior on unselected instances. Experiments on multiple LLMs and datasets imply that our method can accurately estimate the target model’s performance while largely reducing its inference overhead.
zh

[NLP-135] Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在应对建筑管理（Construction Management, CM）领域特定挑战时表现不足的问题。尽管LLMs在通用推理任务中表现出色，但在精确的定量分析和法规解释等CM领域的具体任务中，其性能尚未得到充分探索。为此，研究引入了CMExamSet数据集，这是一个包含来自四个国家级认证考试的真实多项选择题的综合基准数据集，并通过零样本评估考察了模型的整体准确性、主题领域（如施工安全）、推理复杂度（单步与多步任务）以及题目格式（仅文本、图表参考和表格参考）。研究的关键在于通过构建这一基准数据集及评估框架，揭示当前主流LLMs（如GPT-4o和Claude 3.7）在处理CM任务时的优势与局限性，强调了增强领域专用推理能力以及持续人工监督的重要性，以确保复杂决策中的可靠性和有效性。

链接: https://arxiv.org/abs/2504.08779
作者: Ruoxin Xiong,Yanyu Wang,Suat Gunhan,Yimin Zhu,Charles Berryman
机构: Kent State University (肯特州立大学); Louisiana State University (路易斯安那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing complexity of construction management (CM) projects, coupled with challenges such as strict regulatory requirements and labor shortages, requires specialized analytical tools that streamline project workflow and enhance performance. Although large language models (LLMs) have demonstrated exceptional performance in general reasoning tasks, their effectiveness in tackling CM-specific challenges, such as precise quantitative analysis and regulatory interpretation, remains inadequately explored. To bridge this gap, this study introduces CMExamSet, a comprehensive benchmarking dataset comprising 689 authentic multiple-choice questions sourced from four nationally accredited CM certification exams. Our zero-shot evaluation assesses overall accuracy, subject areas (e.g., construction safety), reasoning complexity (single-step and multi-step), and question formats (text-only, figure-referenced, and table-referenced). The results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. Additionally, both models performed better on single-step tasks, with accuracies of 85.7% (GPT-4o) and 86.7% (Claude 3.7). Multi-step tasks were more challenging, reducing performance to 76.5% and 77.6%, respectively. Furthermore, both LLMs show significant limitations on figure-referenced questions, with accuracies dropping to approximately 40%. Our error pattern analysis further reveals that conceptual misunderstandings are the most common (44.4% and 47.9%), underscoring the need for enhanced domain-specific reasoning models. These findings underscore the potential of LLMs as valuable supplementary analytical tools in CM, while highlighting the need for domain-specific refinements and sustained human oversight in complex decision making.
zh

[NLP-136] From Tokens to Lattices: Emergent Lattice Structures in Language Models ICLR2025

【速读】：该论文旨在探索预训练掩码语言模型（Pretrained Masked Language Models, MLMs）如何从其预训练过程中隐式学习概念格（Concept Lattice）结构。论文的核心问题是：这种概念化能力是如何从 MLM 的预训练中涌现的？为了解决这一问题，作者从形式概念分析（Formal Concept Analysis, FCA）的角度出发，提出了一种新颖的框架，通过 FCA 从预训练的 MLM 中构建概念格。该框架的关键在于揭示 MLM 的目标函数隐式学习了一个描述对象、属性及其依赖关系的形式背景（Formal Context），从而使得通过 FCA 重构概念格成为可能。与以往工作不同的是，该方法不依赖于人工定义的概念，而是能够发现超出人类定义范围的潜在概念（Latent Concepts）。

链接: https://arxiv.org/abs/2504.08778
作者: Bo Xiong,Steffen Staab
机构: Stanford University (斯坦福大学); University of Stuttgart (斯图加特大学); University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ICLR 2025

点击查看摘要

Abstract:Pretrained masked language models (MLMs) have demonstrated an impressive capability to comprehend and encode conceptual knowledge, revealing a lattice structure among concepts. This raises a critical question: how does this conceptualization emerge from MLM pretraining? In this paper, we explore this problem from the perspective of Formal Concept Analysis (FCA), a mathematical framework that derives concept lattices from the observations of object-attribute relationships. We show that the MLM’s objective implicitly learns a \emphformal context that describes objects, attributes, and their dependencies, which enables the reconstruction of a concept lattice through FCA. We propose a novel framework for concept lattice construction from pretrained MLMs and investigate the origin of the inductive biases of MLMs in lattice structure learning. Our framework differs from previous work because it does not rely on human-defined concepts and allows for discovering “latent” concepts that extend beyond human definitions. We create three datasets for evaluation, and the empirical results verify our hypothesis.
zh

[NLP-137] he Lyme Disease Controversy: An AI-Driven Discourse Analysis of a Quarter Century of Academic Debate and Divides

【速读】：该论文试图解决慢性莱姆病（Chronic Lyme Disease, CLD）与莱姆病治疗后综合征（Post-Treatment Lyme Disease Syndrome, PTLDS）相关科学话语中长期存在的复杂性和两极分化问题。论文的关键解决方案在于提出了一种创新的混合人工智能驱动方法，结合大规模语言模型（Large Language Models, LLMs）与结构化的人类验证，以系统性分析过去25年间数千篇学术摘要。通过将LLMs与专家监督相结合，研究构建了一个定量框架，用于追踪争议性医学领域中的认识论转变，并展示了该方法在其他内容分析领域的应用潜力。这一方法不仅提供了对塑造莱姆病研究的结构性和认识论力量的新实证见解，还为社会科学研究和医学研究中的AI辅助方法的应用提供了可扩展且可复制的范例。

链接: https://arxiv.org/abs/2504.08777
作者: Teo Susnjak,Cole Palffy,Tatiana Zimina,Nazgul Altynbekova,Kunal Garg,Leona Gilbert
机构: School of Mathematical and Computational Sciences, Massey University (梅西大学), Albany, New Zealand; Te?ted Oy (特奥公司), Jyväskylä, Finland
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The scientific discourse surrounding Chronic Lyme Disease (CLD) and Post-Treatment Lyme Disease Syndrome (PTLDS) has evolved over the past twenty-five years into a complex and polarised debate, shaped by shifting research priorities, institutional influences, and competing explanatory models. This study presents the first large-scale, systematic examination of this discourse using an innovative hybrid AI-driven methodology, combining large language models with structured human validation to analyse thousands of scholarly abstracts spanning 25 years. By integrating Large Language Models (LLMs) with expert oversight, we developed a quantitative framework for tracking epistemic shifts in contested medical fields, with applications to other content analysis domains. Our analysis revealed a progressive transition from infection-based models of Lyme disease to immune-mediated explanations for persistent symptoms. This study offers new empirical insights into the structural and epistemic forces shaping Lyme disease research, providing a scalable and replicable methodology for analysing discourse, while underscoring the value of AI-assisted methodologies in social science and medical research.
zh

[NLP-138] SemCAFE: When Named Entities make the Difference Assessing Web Source Reliability through Entity-level Analytics

【速读】：该论文旨在解决在线新闻环境中可靠与不可靠新闻文章难以区分的问题，特别是在涉及重大国际事件（如2022年俄罗斯入侵乌克兰）时，由于AI生成内容的低成本传播以及不可靠来源的不透明性，这一问题尤为突出。解决方案的关键在于引入SemCAFE系统，通过将实体相关性纳入评估机制，并结合自然语言处理技术（如boilerplate removal、tokenization）及基于YAGO知识库的实体级语义分析，为每篇新闻生成语义指纹，从而实现对新闻可信度的精准评估。实验结果表明，该方法在宏观F1分数上比现有最先进的方法提高了12%。

链接: https://arxiv.org/abs/2504.08776
作者: Gautam Kishore Shahi,Oshani Seneviratne,Marc Spaniol
机构: University of Duisburg-Essen(Duisburg)(杜伊斯堡-埃森大学); Rensselaer Polytechnic Institute(Troy, United States)(伦斯勒理工学院); University of Caen Normandy(Caen)(诺曼底大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the shift from traditional to digital media, the online landscape now hosts not only reliable news articles but also a significant amount of unreliable content. Digital media has faster reachability by significantly influencing public opinion and advancing political agendas. While newspaper readers may be familiar with their preferred outlets political leanings or credibility, determining unreliable news articles is much more challenging. The credibility of many online sources is often opaque, with AI generated content being easily disseminated at minimal cost. Unreliable news articles, particularly those that followed the Russian invasion of Ukraine in 2022, closely mimic the topics and writing styles of credible sources, making them difficult to distinguish. To address this, we introduce SemCAFE, a system designed to detect news reliability by incorporating entity relatedness into its assessment. SemCAFE employs standard Natural Language Processing techniques, such as boilerplate removal and tokenization, alongside entity level semantic analysis using the YAGO knowledge base. By creating a semantic fingerprint for each news article, SemCAFE could assess the credibility of 46,020 reliable and 3,407 unreliable articles on the 2022 Russian invasion of Ukraine. Our approach improved the macro F1 score by 12% over state of the art methods. The sample data and code are available on GitHub
zh

[NLP-139] Layers at Similar Depths Generate Similar Activations Across LLM Architectures

【速读】：该论文试图研究独立训练的大语言模型（LLMs）所使用的潜在空间之间的关系。论文的关键解决方案在于分析不同层激活所诱导的最近邻关系，通过对24个开源权重的LLMs进行研究，发现这些最近邻关系在单个模型内随层而变（Claim 1），但在不同模型的相应层之间大致共享（Claim 2）。这表明LLMs生成的激活几何结构具有层间的连续变化，而整个变化过程在不同模型间基本一致，尽管存在架构上的拉伸和压缩调整。

链接: https://arxiv.org/abs/2504.08775
作者: Christopher Wolfram,Aaron Schein
机构: University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and find that they 1) tend to vary from layer to layer within a model, and 2) are approximately shared between corresponding layers of different models. Claim 2 shows that these nearest neighbor relationships are not arbitrary, as they are shared across models, but Claim 1 shows that they are not “obvious” either, as there is no single set of nearest neighbor relationships that is universally shared. Together, these suggest that LLMs generate a progression of activation geometries from layer to layer, but that this entire progression is largely shared between models, stretched and squeezed to fit into different architectures.
zh

[NLP-140] Evaluation of the phi-3-mini SLM for identification of texts related to medicine health and sports injuries

【速读】：该论文试图解决的问题是如何评估小语言模型（Small Language Models, SLMs）在自动标注和识别医学/健康相关文本中的实际表现，并探讨其与人工标注结果的相关性。现有研究中，SLMs 的性能通常基于特定任务（如 MedQA）进行基准测试，但这些测试结果可能无法真实反映其在实际文档和网页中的应用效果。此外，由于 SLMs 的资源需求远低于大语言模型（Large Language Models, LLMs），它们可能更适合部署于多种设备上，因此对其实际性能的验证具有重要意义。

解决方案的关键在于通过对比微软的 phi-3-mini-4k-instruct SLM 生成的主题相关性分数与 7 名人类评估者对两类文本（医学/健康相关文本和运动损伤相关文本）的评分，来分析 SLM 在不同过滤条件下的表现。论文选取了约 900 万新闻标题数据集中的子样本（共 1144 条医学/健康相关文本和 1117 条运动损伤相关文本），并根据 phi-3 SLM 的评分采用不同程度的过滤策略。最终结果显示，SLM 的评分与人类评估者的评分之间存在低至中等显著的相关性，尤其是在医学/健康相关文本的高过滤条件下（(\rho = 0.3854, p < 0.001)）。这表明 SLM 在特定条件下可以部分替代人工标注，但仍需进一步优化以提高一致性。

链接: https://arxiv.org/abs/2504.08764
作者: Chris Brogly,Saif Rjaibi,Charlotte Liang,Erica Lam,Edward Wang,Adam Levitan,Sarah Paleczny,Michael Cusimano
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Small Language Models (SLMs) have potential to be used for automatically labelling and identifying aspects of text data for medicine/health-related purposes from documents and the web. As their resource requirements are significantly lower than Large Language Models (LLMs), these can be deployed potentially on more types of devices. SLMs often are benchmarked on health/medicine-related tasks, such as MedQA, although performance on these can vary especially depending on the size of the model in terms of number of parameters. Furthermore, these test results may not necessarily reflect real-world performance regarding the automatic labelling or identification of texts in documents and the web. As a result, we compared topic-relatedness scores from Microsofts phi-3-mini-4k-instruct SLM to the topic-relatedness scores from 7 human evaluators on 1144 samples of medical/health-related texts and 1117 samples of sports injury-related texts. These texts were from a larger dataset of about 9 million news headlines, each of which were processed and assigned scores by phi-3-mini-4k-instruct. Our sample was selected (filtered) based on 1 (low filtering) or more (high filtering) Boolean conditions on the phi-3 SLM scores. We found low-moderate significant correlations between the scores from the SLM and human evaluators for sports injury texts with low filtering (\rho = 0.3413, p 0.001) and medicine/health texts with high filtering (\rho = 0.3854, p 0.001), and low significant correlation for medicine/health texts with low filtering (\rho = 0.2255, p 0.001). There was negligible, insignificant correlation for sports injury-related texts with high filtering (\rho = 0.0318, p = 0.4466).
zh

[NLP-141] WebMap – Large Language Model-assisted Semantic Link Induction in the Web

【速读】：当前网络搜索引擎对开展研究任务的支持是不充分的，甚至可能起到阻碍作用。本文旨在解决这一问题，提出对WebMap的功能扩展，WebMap是一种基于语义诱导的网络叠加链接结构，旨在从根本上促进研究活动。其解决方案的关键在于通过这些功能扩展实现文档聚类的动态确定与重组、构建网络中的语义路标，以及交互式追溯主题起源。

链接: https://arxiv.org/abs/2504.08763
作者: Shiraj Pokharel,Georg P. Roßrucker,Mario M. Kubek
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 3 figures, accepted at the 2024 24th International Conference on Innovations for Community Services (I4CS), June 12 - 14, Maastricht, The Netherlands, 2024

点击查看摘要

Abstract:Carrying out research tasks is only inadequately supported, if not hindered, by current web search engines. This paper therefore proposes functional extensions of WebMap, a semantically induced overlay linking structure on the web to inherently facilitate research activities. These add-ons support the dynamic determination and regrouping of document clusters, the creation of a semantic signpost in the web, and the interactive tracing of topics back to their origins.
zh

[NLP-142] Domain Specific Question to SQL Conversion with Embedded Data Balancing Technique

【速读】：该论文旨在解决自然语言处理中生成结构化查询语言（SQL）查询时，因无法准确理解用户问题中的值而导致的错误率较高的问题。特别是在处理领域特定术语和具体值条件时，传统方法难以保持一致性与精确性。为克服这些障碍，论文提出了两个关键干预措施：一是实施数据平衡技术，二是过采样领域特定查询，以优化模型架构以增强值识别能力，并针对领域特定问题进行微调。通过这些方法，所提出的解决方案相较于在WikiSQL数据集上测试的最先进模型，在模型性能准确性方面提升了10.98个百分点，且在领域特定问题上应用过采样技术显示出显著改进。

链接: https://arxiv.org/abs/2504.08753
作者: Jyothi,T. Satyanarayana Murthy
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:The rise of deep learning in natural language processing has fostered the creation of text to structured query language models composed of an encoder and a decoder. Researchers have experimented with various intermediate processing like schema linking, table type aware, value extract. To generate accurate SQL results for the user question. However error analysis performed on the failed cases on these systems shows, 29 percentage of the errors would be because the system was unable to understand the values expressed by the user in their question. This challenge affects the generation of accurate SQL queries, especially when dealing with domain-specific terms and specific value conditions, where traditional methods struggle to maintain consistency and precision. To overcome these obstacles, proposed two intermediations like implementing data balancing technique and over sampling domain-specific queries which would refine the model architecture to enhance value recognition and fine tuning the model for domain-specific questions. This proposed solution achieved 10.98 percentage improvement in accuracy of the model performance compared to the state of the art model tested on WikiSQL dataset. to convert the user question accurately to SQL queries. Applying oversampling technique on the domain-specific questions shown a significant improvement as compared with traditional approaches.
zh

[NLP-143] A Survey of Multimodal Retrieval-Augmented Generation

【速读】：该论文试图解决传统 Retrieval-Augmented Generation (RAG) 模型在处理多模态任务时的局限性问题，即仅依赖文本数据进行检索和生成的不足。为应对这一挑战，论文提出 Multimodal Retrieval-Augmented Generation (MRAG)，其关键在于通过整合文本、图像和视频等多种模态的数据，扩展了传统的 RAG 框架，使其能够利用跨模态的上下文信息。这种方法不仅减少了生成中的幻觉现象，还增强了基于事实的问答系统性能，特别是在需要视觉与文本联合理解的场景中表现出色。

链接: https://arxiv.org/abs/2504.08748
作者: Lang Mei,Siyu Mo,Zhihan Yang,Chong Chen
机构: Huawei Cloud BU(Beijing China)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (MRAG) enhances large language models (LLMs) by integrating multimodal data (text, images, videos) into retrieval and generation processes, overcoming the limitations of text-only Retrieval-Augmented Generation (RAG). While RAG improves response accuracy by incorporating external textual knowledge, MRAG extends this framework to include multimodal retrieval and generation, leveraging contextual information from diverse data types. This approach reduces hallucinations and enhances question-answering systems by grounding responses in factual, multimodal knowledge. Recent studies show MRAG outperforms traditional RAG, especially in scenarios requiring both visual and textual understanding. This survey reviews MRAG’s essential components, datasets, evaluation methods, and limitations, providing insights into its construction and improvement. It also identifies challenges and future research directions, highlighting MRAG’s potential to revolutionize multimodal information retrieval and generation. By offering a comprehensive perspective, this work encourages further exploration into this promising paradigm.
zh

[NLP-144] Improving RAG for Personalization with Author Features and Contrastive Examples

【速读】：该论文旨在解决个性化文本生成中基于检索增强生成（RAG）方法难以捕捉作者细粒度特征的问题，从而难以有效识别作者的独特风格。为丰富RAG上下文，论文提出向大语言模型（LLMs）提供作者特定特征（如平均情感极性和常用词汇）以及作者档案中的过往样本。关键解决方案在于引入一种名为对比示例（Contrastive Examples）的新特征：从其他作者文档中检索对比示例，以帮助LLMs识别某一作者风格与其他作者相比的独特之处。实验表明，结合少量关于命名实体、依赖模式及个人常用词汇的信息可显著提升个性化文本生成的效果；而将细粒度特征与对比示例结合后，相对基线RAG性能提升15%，同时超越现有基准。研究结果凸显了细粒度特征在提升个性化方面的价值，并为将对比示例作为RAG补充开辟了新的研究方向。代码已公开发布。

链接: https://arxiv.org/abs/2504.08745
作者: Mert Yazan,Suzan Verberne,Frederik Situmeang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalization with retrieval-augmented generation (RAG) often fails to capture fine-grained features of authors, making it hard to identify their unique traits. To enrich the RAG context, we propose providing Large Language Models (LLMs) with author-specific features, such as average sentiment polarity and frequently used words, in addition to past samples from the author’s profile. We introduce a new feature called Contrastive Examples: documents from other authors are retrieved to help LLM identify what makes an author’s style unique in comparison to others. Our experiments show that adding a couple of sentences about the named entities, dependency patterns, and words a person uses frequently significantly improves personalized text generation. Combining features with contrastive examples boosts the performance further, achieving a relative 15% improvement over baseline RAG while outperforming the benchmarks. Our results show the value of fine-grained features for better personalization, while opening a new research dimension for including contrastive examples as a complement with RAG. We release our code publicly.
zh

[NLP-145] ExpertRAG : Efficient RAG with Mixture of Experts – Optimizing Context Retrieval for Adaptive LLM Responses

【速读】：该论文试图解决知识密集型语言模型在效率与准确性之间的权衡问题。传统Retrieval Augmented Generation (RAG) 模型始终依赖外部知识库，而Mixture-of-Experts (MoE) 模型则依赖参数化知识，两者各有局限性。为解决此问题，论文提出了一种名为ExpertRAG的新框架，其关键在于引入动态检索门控机制与专家路由策略，使模型能够根据查询需求动态选择外部知识库或内部专业化专家，从而实现更高效的计算资源利用和知识获取。通过理论分析和公式推导，论文量化了选择性检索带来的计算成本节省及稀疏专家利用带来的容量提升，并通过实验设计验证其在事实回忆、泛化能力及推理效率上的性能优势。

链接: https://arxiv.org/abs/2504.08744
作者: Esmail Gumaan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 30 pages, 4 figures

点击查看摘要

Abstract:ExpertRAG is a novel theoretical framework that integrates Mixture-of-Experts (MoE) architectures with Retrieval Augmented Generation (RAG) to advance the efficiency and accuracy of knowledge-intensive language modeling. We propose a dynamic retrieval gating mechanism coupled with expert routing, enabling the model to selectively consult an external knowledge store or rely on specialized internal experts based on the query’s needs. The paper lays out the theoretical foundations of ExpertRAG, including a probabilistic formulation that treats retrieval and expert selection as latent decisions, and mathematical justifications for its efficiency in both computation and knowledge utilization. We derive formulae to quantify the expected computational cost savings from selective retrieval and the capacity gains from sparse expert utilization. A comparative analysis positions ExpertRAG against standard RAG (with always-on retrieval) and pure MoE models (e.g., Switch Transformer, Mixtral) to highlight its unique balance between parametric knowledge and non-parametric retrieval. We also outline an experimental validation strategy, proposing benchmarks and evaluation protocols to test ExpertRAG’s performance on factual recall, generalization, and inference efficiency. The proposed framework, although presented theoretically, is supported by insights from prior work in RAG and MoE, and is poised to provide more factual, efficient, and adaptive generation by leveraging the best of both paradigms. In summary, ExpertRAG contributes a new perspective on scaling and augmenting language models, backed by a thorough analysis and a roadmap for empirical validation.
zh

[NLP-146] Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis ACM-MM2025

【速读】：该论文旨在解决现有零样本文本到语音（Text-to-Speech, TTS）系统面临的两难困境：自回归（Autoregressive, AR）模型虽然具有良好的时序建模能力，但生成速度慢且缺乏持续时间可控性；而非自回归（Non-Autoregressive, NAR）模型虽可实现快速并行生成，但通常缺乏有效的时序建模且设计复杂。为应对这一挑战，论文提出了一种新颖的伪自回归（Pseudo-Autoregressive, PAR）编解码语言建模方法，通过结合AR的显式时序建模能力和NAR的并行生成优势，实现了固定时间步长下的动态长度片段生成。基于PAR方法，进一步提出了PALLE（Progressive Autoregressive Language Learning for Efficient TTS）系统，采用两阶段策略，首先利用PAR进行初步生成，随后通过非自回归方式对低置信度的语音标记进行迭代优化。PALLE的关键创新在于其渐进式的时序预测机制与全局上下文增强的后处理策略，这不仅显著提升了语音质量、说话人相似性和可懂度，同时实现了高达十倍的推理加速。

链接: https://arxiv.org/abs/2504.10352
作者: Yifan Yang,Shujie Liu,Jinyu Li,Yuxuan Hu,Haibin Wu,Hui Wang,Jianwei Yu,Lingwei Meng,Haiyang Sun,Yanqing Liu,Yan Lu,Kai Yu,Xie Chen
机构: Shanghai Jiao Tong University (上海交通大学); Microsoft Corporation (微软公司)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注: Submitted to ACM MM 2025

点击查看摘要

Abstract:Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at this https URL.
zh

[NLP-147] Fine-tuning an Large Language Model for Automating Computational Fluid Dynamics Simulations

【速读】：该论文旨在解决配置计算流体力学（CFD）模拟需要大量领域专业知识的问题，这限制了其更广泛的应用。虽然大型语言模型（LLMs）在科学计算方面取得了进展，但它们在自动化CFD工作流程方面的应用尚处于初级阶段。为了解决这一问题，论文提出了一种基于领域特定LLM适应的新方法。通过在NL2FOAM（一个包含28716个自然语言到OpenFOAM配置对及其链式思维（CoT）注释的自定义数据集）上微调Qwen2.5-7B-Instruct，实现了从自然语言描述直接翻译成可执行的CFD设置的能力。多智能体框架协调整个过程，包括输入验证、配置生成、模拟运行以及错误修正。在包含21个不同流动案例的基准测试中，该方法展示了最先进的性能，达到了88.7%的解算精度和82.6%的一次性成功率，显著优于更大的通用模型如Qwen2.5-72B-Instruct、DeepSeek-R1和Llama3.3-70B-Instruct，并且需要更少的修正迭代次数同时保持高计算效率。结果表明，领域特定适应在部署LLM助手以支持复杂工程工作流程中的关键作用。

链接: https://arxiv.org/abs/2504.09602
作者: Zhehao Dong,Zhen Lu,Yue Yang
机构: State Key Laboratory for Turbulent and Complex Systems (湍流与复杂系统国家重点实验室), College of Engineering (工学院), Peking University (北京大学), Beijing 100871, China (中国); HEDPS-CAPT (北大重离子物理研究所), Peking University (北京大学), Beijing 100871, China (中国)
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Configuring computational fluid dynamics (CFD) simulations typically demands extensive domain expertise, limiting broader access. Although large language models (LLMs) have advanced scientific computing, their use in automating CFD workflows is underdeveloped. We introduce a novel approach centered on domain-specific LLM adaptation. By fine-tuning Qwen2.5-7B-Instruct on NL2FOAM, our custom dataset of 28716 natural language-to-OpenFOAM configuration pairs with chain-of-thought (CoT) annotations, we enable direct translation from natural language descriptions to executable CFD setups. A multi-agent framework orchestrates the process, autonomously verifying inputs, generating configurations, running simulations, and correcting errors. Evaluation on a benchmark of 21 diverse flow cases demonstrates state-of-the-art performance, achieving 88.7% solution accuracy and 82.6% first-attempt success rate. This significantly outperforms larger general-purpose models like Qwen2.5-72B-Instruct, DeepSeek-R1, and Llama3.3-70B-Instruct, while also requiring fewer correction iterations and maintaining high computational efficiency. The results highlight the critical role of domain-specific adaptation in deploying LLM assistants for complex engineering workflows.
zh

[NLP-148] SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

【速读】：该论文旨在解决语音-文本大型语言模型（Speech-Text Large Language Models, LLMs）在指令跟随（instruction-following）能力上的不足，同时提升其在基础语音任务上的性能。为实现这一目标，论文提出了SIFT（Speech Instruction Fine-Tuning），这是一个包含5000万个样本的数据集，用于支持语音-文本LLMs的指令微调和预训练。SIFT-50M数据集基于公开可用的语音语料库构建，涵盖了14,000小时的语音数据，并结合了大型语言模型和现成的专业模型。关键在于利用多语言、多样化的语音理解与可控语音生成指令，从而训练出的SIFT-LLM不仅在指令跟随基准测试中表现出色，还在基础语音任务中达到竞争力。此外，论文还引入了EvalSIFT，一个专门设计的评估数据集，用于衡量语音-文本LLMs的指令跟随能力。

链接: https://arxiv.org/abs/2504.09081
作者: Prabhat Pandey,Rupak Vignesh Swaminathan,K V Vijay Girish,Arunasish Sen,Jian Xie,Grant P. Strimel,Andreas Schwarz
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.
zh

[NLP-149] Emergence of psychopathological computations in large language models

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）是否能够实现心理病理学（Psychopathology）的计算。为解决这一问题，论文的关键在于两个方面：首先，从概念有效性出发，提出一种适用于非生物实体且无主观体验的通用计算理论来描述心理病理学；其次，研究LLMs行为背后的机制以提高方法学的有效性。为此，论文建立了计算-理论框架，并提出了新的机制可解释性方法与定制化的实证分析框架作为解决方案的核心。通过这些框架，实验验证了三个关键结论，从而支持其假设，即LLMs中已出现了基于网络理论的心理病理学计算，这表明某些LLM表现出的心理病理学行为可能是其内部处理的固有特性而非表面模仿。

链接: https://arxiv.org/abs/2504.08016
作者: Soo Yong Lee,Hyunjin Hwang,Taekwan Kim,Yuyeong Kim,Kyuri Park,Jaemin Yoo,Denny Borsboom,Kijung Shin
机构: KAIST (韩国科学技术院); UCL (伦敦大学学院); UvA (阿姆斯特丹大学); KAIST (韩国科学技术院); UvA (阿姆斯特丹大学); UvA (阿姆斯特丹大学)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: pre-print

点击查看摘要

Abstract:Can large language models (LLMs) implement computations of psychopathology? An effective approach to the question hinges on addressing two factors. First, for conceptual validity, we require a general and computational account of psychopathology that is applicable to computational entities without biological embodiment or subjective experience. Second, mechanisms underlying LLM behaviors need to be studied for better methodological validity. Thus, we establish a computational-theoretical framework to provide an account of psychopathology applicable to LLMs. To ground the theory for empirical analysis, we also propose a novel mechanistic interpretability method alongside a tailored empirical analytic framework. Based on the frameworks, we conduct experiments demonstrating three key claims: first, that distinct dysfunctional and problematic representational states are implemented in LLMs; second, that their activations can spread and self-sustain to trap LLMs; and third, that dynamic, cyclic structural causal models encoded in the LLMs underpin these patterns. In concert, the empirical results corroborate our hypothesis that network-theoretic computations of psychopathology have already emerged in LLMs. This suggests that certain LLM behaviors mirroring psychopathology may not be a superficial mimicry but a feature of their internal processing. Thus, our work alludes to the possibility of AI systems with psychopathological behaviors in the near future.
zh

计算机视觉

[CV-0] FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation

【速读】：本文挑战了现有开放词汇语义分割（OVSS）模型普遍采用多模板构造类别平均文本嵌入的做法，试图探究模板选择对OVSS的影响。研究发现，对于每个类别，存在单一模板分类器显著优于传统的平均分类器，这些单一模板分类器被称为类别专家（class-experts）。关键在于无需训练和利用未标注图像，通过单模板分类器的类别预测熵来估计这些专家，并选取最低熵对应的模板作为类别专家。所有专家协同工作，在新提出的融合方法FLOSS中生成更准确的OVSS预测。FLOSS是一种与现有OVSS方法正交且互补的插件式方法，可无标签、无需额外训练即可系统性提升OVSS性能。实验表明，FLOSS在多种OVSS基准数据集上持续提升当前最优方法的表现，且选定的专家模板能在不同数据集间良好泛化，即使存在分布偏移。此外，在低数据量场景下，FLOSS同样取得满意改进。

链接: https://arxiv.org/abs/2504.10487
作者: Yasser Benigmim,Mohammad Fahes,Tuan-Hung Vu,Andrei Bursuc,Raoul de Charette
机构: Inria; Valeo.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Recent Open-Vocabulary Semantic Segmentation (OVSS) models extend the CLIP model to segmentation while maintaining the use of multiple templates (e.g., a photo of class, a sketch of a class, etc.) for constructing class-wise averaged text embeddings, acting as a classifier. In this paper, we challenge this status quo and investigate the impact of templates for OVSS. Empirically, we observe that for each class, there exist single-template classifiers significantly outperforming the conventional averaged classifier. We refer to them as class-experts. Given access to unlabeled images and without any training involved, we estimate these experts by leveraging the class-wise prediction entropy of single-template classifiers, selecting as class-wise experts those which yield the lowest entropy. All experts, each specializing in a specific class, collaborate in a newly proposed fusion method to generate more accurate OVSS predictions. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering a ‘‘free lunch’’ to systematically improve OVSS without labels and additional training. Extensive experiments demonstrate that FLOSS consistently boosts state-of-the-art methods on various OVSS benchmarks. Moreover, the selected expert templates can generalize well from one dataset to others sharing the same semantic categories, yet exhibiting distribution shifts. Additionally, we obtain satisfactory improvements under a low-data regime, where only a few unlabeled images are available. Our code is available at this https URL .
zh

[CV-1] DNF-Avatar: Distilling Neural Fields for Real-time Animatable Avatar Relighting

【速读】：该论文致力于解决通过单目视频创建可重新照明和可动画化的人体 avatar 时渲染速度慢的问题。传统方法基于物理渲染（Physically Based Rendering, PBR）结合神经场（Neural Fields）来估计人体几何结构与外观属性，但由于昂贵的蒙特卡罗光线追踪（Monte Carlo Ray Tracing），导致渲染速度缓慢。为解决此问题，论文的关键方案是将隐式神经场（Implicit Neural Fields, 教师模型）的知识蒸馏到显式的 2D 高斯点 splatting（显式表示，学生模型）中，利用高斯点 splatting 的快速光栅化（rasterization）特性加速渲染。此外，论文采用分项和近似（Split-Sum Approximation）处理 PBR 外观，并提出基于身体部位的环境光遮蔽探针（Part-Wise Ambient Occlusion Probes）用于阴影计算，仅需每像素查询一次即可实现阴影预测，从而实现 avatar 的实时重新照明。这些技术结合后，学生模型在推理速度上达到教师模型的 370 倍提升，同时保持高质量的重新照明效果与逼真的阴影效果。

链接: https://arxiv.org/abs/2504.10486
作者: Zeren Jiang,Shaofei Wang,Siyu Tang
机构: Visual Geometry Group, University of Oxford (牛津大学视觉几何组); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 8 figures, Project pages: this https URL

点击查看摘要

Abstract:Creating relightable and animatable human avatars from monocular videos is a rising research topic with a range of applications, e.g. virtual reality, sports, and video games. Previous works utilize neural fields together with physically based rendering (PBR), to estimate geometry and disentangle appearance properties of human avatars. However, one drawback of these methods is the slow rendering speed due to the expensive Monte Carlo ray tracing. To tackle this problem, we proposed to distill the knowledge from implicit neural fields (teacher) to explicit 2D Gaussian splatting (student) representation to take advantage of the fast rasterization property of Gaussian splatting. To avoid ray-tracing, we employ the split-sum approximation for PBR appearance. We also propose novel part-wise ambient occlusion probes for shadow computation. Shadow prediction is achieved by querying these probes only once per pixel, which paves the way for real-time relighting of avatars. These techniques combined give high-quality relighting results with realistic shadow effects. Our experiments demonstrate that the proposed student model achieves comparable or even better relighting results with our teacher model while being 370 times faster at inference time, achieving a 67 FPS rendering speed.
zh

[CV-2] Decoupled Diffusion Sparks Adaptive Scene Generation

【速读】：该论文旨在解决自动驾驶领域场景生成中数据采集成本高以及现有方法在反应性和目标导向性方面的不足问题。具体而言，全序列去噪方法难以支持在线反应，而逐帧预测方法则缺乏精确的目标状态引导，且难以生成复杂或具有挑战性的场景。为克服这些问题，论文提出Nexus框架，其关键在于通过细粒度令牌和独立噪声状态模拟普通及挑战性场景，结合部分噪声屏蔽训练策略与噪声感知调度，确保去噪过程中的环境及时更新。此外，论文还构建了一个包含复杂边缘场景的数据集以补充挑战性场景生成。最终，Nexus实现了更高的生成真实感，同时将位移误差降低了40%，并通过闭环规划数据增强提升了20%的表现。

链接: https://arxiv.org/abs/2504.10485
作者: Yunsong Zhou,Naisheng Ye,William Ljungbergh,Tianyu Li,Jiazhi Yang,Zetong Yang,Hongzi Zhu,Christoffer Petersson,Hongyang Li
机构: OpenDriveLab; Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Zenseact; GAC R&D Center (广汽研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter’s short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20% through data augmentation and showcase its capability in safety-critical data generation.
zh

[CV-3] REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

【速读】：该论文试图解决的问题是：在潜在扩散模型（Latent Diffusion Models）与变分自编码器（Variational Auto-Encoder, VAE）tokenizer联合训练过程中，是否可以实现端到端（End-to-End）训练。传统深度学习观点认为，当条件允许时，端到端训练通常更优，但研究发现使用标准扩散损失函数（Diffusion Loss）进行VAE与扩散模型的联合训练不仅无效，甚至可能导致最终性能下降。论文的关键解决方案是提出了一种表征对齐（Representation-Aligned, REPA）损失函数，通过这种损失函数，使VAE和扩散模型能够在训练过程中被协同优化。尽管方法简单，但提出的端到端训练方案（REPA-E）显著提升了训练效率，相较于传统方法分别加速了17倍和45倍，并且还改善了VAE自身的潜在空间结构及其下游生成性能，最终在ImageNet 256 x 256数据集上达到了最新的SOTA性能（FID分别为1.26和1.83）。

链接: https://arxiv.org/abs/2504.10483
作者: Xingjian Leng,Jaskirat Singh,Yunzhong Hou,Zhenchang Xing,Saining Xie,Liang Zheng
机构: Australian National University (澳大利亚国立大学); Data61 CSIRO (Data61联邦科学与工业研究组织); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper we tackle a fundamental question: “Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?” Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss – allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.26 and 1.83 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at this https URL.
zh

[CV-4] InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

【速读】：该论文旨在解决传统多模态大语言模型（Multimodal Large Language Models, MLLMs）在后处理训练管道中面临的复杂性和对齐挑战。为应对这些问题，InternVL3提出了一种统一的预训练范式，即在单一预训练阶段内同时从多样化的多模态数据和纯文本语料库中联合学习多模态和语言能力，而非将仅支持文本的大型语言模型（Large Language Model, LLM）适配为支持视觉输入的MLLM。这一方法有效缓解了传统后处理训练中的复杂性与对齐难题。关键在于其创新的联合预训练机制以及引入的可变视觉位置编码（Variable Visual Position Encoding, V2PE）、先进的后训练技术（如监督微调和混合偏好优化）及测试时扩展策略等，这些共同构成了其性能提升的核心解决方案。

链接: https://arxiv.org/abs/2504.10479
作者: Jinguo Zhu,Weiyun Wang,Zhe Chen,Zhaoyang Liu,Shenglong Ye,Lixin Gu,Yuchen Duan,Hao Tian,Weijie Su,Jie Shao,Zhangwei Gao,Erfei Cui,Yue Cao,Yangzhou Liu,Weiye Xu,Hao Li,Jiahao Wang,Han Lv,Dengnian Chen,Songze Li,Yinan He,Tan Jiang,Jiapeng Luo,Yi Wang,Conghui He,Botian Shi,Xingcheng Zhang,Wenqi Shao,Junjun He,Yingtong Xiong,Wenwen Qu,Peng Sun,Penglong Jiao,Lijun Wu,Kaipeng Zhang,Huipeng Deng,Jiaye Ge,Kai Chen,Limin Wang,Min Dou,Lewei Lu,Xizhou Zhu,Tong Lu,Dahua Lin,Yu Qiao,Jifeng Dai,Wenhai Wang
机构: Shanghai AI Laboratory (上海人工智能实验室); SenseTime Research (商汤科技研究院); Tsinghua University (清华大学); Nanjing University (南京大学); Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学); OpenAI (OpenAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
zh

[CV-5] Art3D: Training-Free 3D Generation from Flat-Colored Illustration

【速读】：该论文试图解决在利用大规模预训练图像到3D生成模型时，因参考图像为平面单色（如手绘）缺乏三维幻觉而导致难以合成合理3D资产的问题。这类平面单色输入模态在艺术内容创作中通常最为用户友好。论文的关键解决方案是提出Art3D，这是一种无需训练的方法，通过结合结构化和语义特征，利用预训练的2D图像生成模型以及基于视觉语言模型（VLM）的真实性评估，成功增强了参考图像中的三维幻觉，从而简化了从2D到3D的生成过程，并适应多种绘画风格。此外，为了评估现有图像到3D模型在处理缺乏三维感觉的平面2D图像时的泛化性能，论文构建了一个包含超过100个样本的新数据集Flat-2D。实验结果验证了Art3D的性能与鲁棒性，展示了其出色的泛化能力和实际应用潜力。

链接: https://arxiv.org/abs/2504.10466
作者: Xiaoyan Cong,Jiayi Shen,Zekun Li,Rao Fu,Tao Lu,Srinath Sridhar
机构: Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Course Project of Brown CSCI 1430 Computer Vision. Project Page: this https URL

点击查看摘要

Abstract:Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a training-free method that can lift flat-colored 2D designs into 3D. By leveraging structural and semantic features with pre- trained 2D image generation models and a VLM-based realism evaluation, Art3D successfully enhances the three-dimensional illusion in reference images, thus simplifying the process of generating 3D from 2D, and proves adaptable to a wide range of painting styles. To benchmark the generalization performance of existing image-to-3D models on flat-colored images without 3D feeling, we collect a new dataset, Flat-2D, with over 100 samples. Experimental results demonstrate the performance and robustness of Art3D, exhibiting superior generalizable capacity and promising practical applicability. Our source code and dataset will be publicly available on our project page: this https URL .
zh

[CV-6] Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在细粒度像素级理解任务中依赖额外组件（如视觉编码器 CLIP 和分割专家）导致系统复杂性高且限制模型扩展的问题。论文的核心目标是探索一种高度简化的 MLLM，不引入额外组件。为此，作者受到近期单Transformer统一视觉-语言模型（SAIL）设计的启发，提出了一种名为 Pixel-SAIL 的单一 Transformer 方法，用于像素级 MLLM 任务。关键解决方案包括：(1) 设计可学习的上采样模块以优化视觉标记特征；(2) 提出新颖的视觉提示注入策略，使单个 Transformer 能够理解和利用视觉提示输入，并从视觉提示嵌入与视觉标记的早期融合中获益；(3) 引入视觉专家蒸馏策略以高效增强单个 Transformer 的细粒度特征提取能力。实验结果表明，Pixel-SAIL 在多个基准测试中实现了与更复杂管道相当甚至更好的性能。

链接: https://arxiv.org/abs/2504.10465
作者: Tao Zhang,Xiangtai Li,Zilong Huang,Yanwei Li,Weixian Lei,Xueqing Deng,Shihao Chen,Shunping Ji,Jiashi Feng
机构: Bytedance Seed (字节跳动种子团队); WHU (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer’s fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at this https URL.
zh

[CV-7] he Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

【速读】：该论文旨在解决现有多模态大语言模型（Multimodal Large Language Models, MLLMs）依赖预训练视觉Transformer（Vision Transformer, ViT）模块化架构的问题，提出了一种更简约的单体Transformer架构——SAIL。SAIL的关键创新在于摒弃独立的视觉编码器，通过适配混合注意力机制（Mix-Attention Mechanisms）和多模态位置编码（Multimodal Positional Encodings），更好地匹配视觉与文本模态的独特特性。这种设计不仅提升了模型的可扩展性，还实现了与模块化MLLMs相当的性能，并在视觉任务如语义分割中展现出与ViT-22B相媲美的表示能力。

链接: https://arxiv.org/abs/2504.10462
作者: Weixian Lei,Jiacong Wang,Haochen Wang,Xiangtai Li,Jun Hao Liew,Jiashi Feng,Zilong Huang
机构: Bytedance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL’s properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL’s scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models are available at this https URL.
zh

[CV-8] Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis

【速读】：该论文旨在解决急性且难以愈合伤口类型的高效分类问题，以降低诊断中的财务和时间成本，并辅助制定最优治疗方案。传统机器学习模型在特征选择方面存在局限性，而深度学习（Deep Learning, DL）虽展现出潜力，但仍需进一步提升模型效率与准确性。为此，论文提出了一种基于深度学习的多模态分类器，通过结合伤口图像及其位置数据对伤口类型进行分类，包括糖尿病性溃疡、压力性溃疡、手术后溃疡和静脉性溃疡等。此外，创建了身体地图（Body Map）以提供位置信息，从而提高伤口定位标注的准确性。

解决方案的关键在于设计了一个融合多种先进技术的模型架构：利用Vision Transformer提取输入图像的分层特征；引入离散小波变换（Discrete Wavelet Transform, DWT）层捕捉高低频成分；采用Transformer提取空间特征。同时，通过三种群体优化算法（Monster Gorilla Toner (MGTO)、Improved Gray Wolf Optimization (IGWO) 和 Fox Optimization Algorithm）优化神经元数量和权重向量，显著提升了诊断精度。实验结果表明，基于优化算法的权重向量调整不仅提高了分类准确性，还证明了其在伤口检测中的有效性。最终，所提模型在仅使用图像数据时达到0.8123的准确率，在结合图像数据与位置信息时达到0.8007的准确率，且与优化模型结合后的整体准确率范围为0.7801至0.8342。

链接: https://arxiv.org/abs/2504.10452
作者: Ramin Mousa,Hadis Taherinia,Khabiba Abdiyeva,Amir Ali Bengari,Mohammadmahdi Vahediahmar
机构: University of Zanjan (扎汉大学); Islamic Azad University Science and Research Branch (伊斯兰阿扎德大学科学与研究分部); Samarkand State University named after Sharof Rashidov (命名为沙罗夫·拉希多夫的撒马尔罕国立大学); Kimyo International University In Tashkent (塔什干金鹰国际大学); University of Tehran (德黑兰大学); Drexel University (德雷塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective recognition of acute and difficult-to-heal wounds is a necessary step in wound diagnosis. An efficient classification model can help wound specialists classify wound types with less financial and time costs and also help in deciding on the optimal treatment method. Traditional machine learning models suffer from feature selection and are usually cumbersome models for accurate recognition. Recently, deep learning (DL) has emerged as a powerful tool in wound diagnosis. Although DL seems promising for wound type recognition, there is still a large scope for improving the efficiency and accuracy of the model. In this study, a DL-based multimodal classifier was developed using wound images and their corresponding locations to classify them into multiple classes, including diabetic, pressure, surgical, and venous ulcers. A body map was also created to provide location data, which can help wound specialists label wound locations more effectively. The model uses a Vision Transformer to extract hierarchical features from input images, a Discrete Wavelet Transform (DWT) layer to capture low and high frequency components, and a Transformer to extract spatial features. The number of neurons and weight vector optimization were performed using three swarm-based optimization techniques (Monster Gorilla Toner (MGTO), Improved Gray Wolf Optimization (IGWO), and Fox Optimization Algorithm). The evaluation results show that weight vector optimization using optimization algorithms can increase diagnostic accuracy and make it a very effective approach for wound detection. In the classification using the original body map, the proposed model was able to achieve an accuracy of 0.8123 using image data and an accuracy of 0.8007 using a combination of image data and wound location. Also, the accuracy of the model in combination with the optimization models varied from 0.7801 to 0.8342.
zh

[CV-9] Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing

【速读】：本文旨在解决现有基于自回归（Autoregressive, AR）视觉模型的图像编辑技术无法直接迁移扩散模型编辑方法的问题。AR模型在图像编辑过程中面临空间注意力图的贫乏以及结构错误的序列累积，导致对象布局和全局一致性被破坏。为应对这一挑战，论文提出了一种名为隐式结构锁定（Implicit Structure Locking, ISLock）的无训练编辑策略。ISLock 的关键在于通过锚点标记匹配（Anchor Token Matching, ATM）协议动态对齐自注意力模式与参考图像，从而隐式地在潜在空间中保持结构蓝图，实现结构感知的编辑同时保留生成自主性。实验表明，ISLock 可以高质量地完成结构一致的编辑，且无需额外训练，优于或可比传统编辑技术。

链接: https://arxiv.org/abs/2504.10434
作者: Taihang Hu,Linxuan Li,Kai Wang,Yaxing Wang,Jian Yang,Ming-Ming Cheng
机构: VCIP, College of Computer Science, Nankai University (南开大学); NKIARI, Shenzhen Futian (深圳未来技术研究院); Computer Vision Center, Universitat Autònoma de Barcelona (巴塞罗那自治大学视觉中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image generation has seen groundbreaking advancements with diffusion models, enabling high-fidelity synthesis and precise image editing through cross-attention manipulation. Recently, autoregressive (AR) models have re-emerged as powerful alternatives, leveraging next-token generation to match diffusion models. However, existing editing techniques designed for diffusion models fail to translate directly to AR models due to fundamental differences in structural control. Specifically, AR models suffer from spatial poverty of attention maps and sequential accumulation of structural errors during image editing, which disrupt object layouts and global consistency. In this work, we introduce Implicit Structure Locking (ISLock), the first training-free editing strategy for AR visual models. Rather than relying on explicit attention manipulation or fine-tuning, ISLock preserves structural blueprints by dynamically aligning self-attention patterns with reference images through the Anchor Token Matching (ATM) protocol. By implicitly enforcing structural consistency in latent space, our method ISLock enables structure-aware editing while maintaining generative autonomy. Extensive experiments demonstrate that ISLock achieves high-quality, structure-consistent edits without additional training and is superior or comparable to conventional editing techniques. Our findings pioneer the way for efficient and flexible AR-based image editing, further bridging the performance gap between diffusion and autoregressive generative models. The code will be publicly available at this https URL
zh

[CV-10] MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation via Diffusion Model ICRA’25

【速读】：该论文旨在解决单目类别级物体姿态估计中对形状先验（shape priors）、CAD模型或深度传感器的依赖问题，特别是在处理类内未知物体时。论文提出了一种基于扩散模型的单目类别级9D物体姿态生成方法MonoDiff9D。其关键在于利用扩散模型的概率特性，在无需形状先验、CAD模型或深度传感器的情况下，通过零样本方式从单目图像估计粗略深度并转换为点云，进而结合点云全局特征与输入图像特征，并利用编码的时间步条件来驱动模型。最终，设计基于Transformer的去噪器从高斯噪声中恢复物体姿态，实现高精度的姿态估计。

链接: https://arxiv.org/abs/2504.10433
作者: Jian Liu,Wei Sun,Hui Yang,Jin Zheng,Zichen Geng,Hossein Rahmani,Ajmal Mian
机构: National Engineering Research Center for Robot Visual Perception and Control Technology, College of Electrical and Information Engineering, Hunan University (湖南大学), Changsha 410082, China; Central South University (中南大学), Changsha, 410082, China; Lancaster University (兰卡斯特大学), LA1 4YW, United Kingdom; The University of Western Australia (西澳大学), WA 6009, Australia
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted by ICRA’25

点击查看摘要

Abstract:Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at this https URL.
zh

[CV-11] HUMOTO: A 4D Dataset of Mocap Human Object Interactions

【速读】：该论文旨在解决高保真人类-物体交互数据集匮乏的问题，以支持运动生成、计算机视觉和机器人技术的应用。论文的关键创新在于两个方面：首先，提出了一种场景驱动的语言模型（LLM）脚本流水线，能够创建具有自然进展且目标明确的任务；其次，设计了一种动捕与摄像机结合的录制方案，有效应对遮挡问题。通过这些方法，HUMOTO 数据集不仅确保了物理动作的准确性，还保持了任务逻辑流程的合理性，并提供了全面的身体运动及多物体交互数据，从而解决了现有数据集在复杂交互建模中的局限性，为动画、机器人技术和具身人工智能系统的实际应用奠定了基础。

链接: https://arxiv.org/abs/2504.10414
作者: Jiaxin Lu,Chun-Hao Paul Huang,Uttaran Bhattacharya,Qixing Huang,Yi Zhou
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 15 figures

点击查看摘要

Abstract:We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. Featuring 736 sequences (7,875 seconds at 30 fps), HUMOTO captures interactions with 63 precisely modeled objects and 72 articulated parts. Our innovations include a scene-driven LLM scripting pipeline creating complete, purposeful tasks with natural progression, and a mocap-and-camera recording setup to effectively handle occlusions. Spanning diverse activities from cooking to outdoor picnics, HUMOTO preserves both physical accuracy and logical task flow. Professional artists rigorously clean and verify each sequence, minimizing foot sliding and object penetrations. We also provide benchmarks compared to other datasets. HUMOTO’s comprehensive full-body motion and simultaneous multi-object interactions address key data-capturing challenges and provide opportunities to advance realistic human-object interaction modeling across research domains with practical applications in animation, robotics, and embodied AI systems. Project: this https URL .
zh

[CV-12] GPS: Distilling Compact Memories via Grid-based Patch Sampling for Efficient Online Class-Incremental Learning

【速读】：本文旨在解决在线类别增量学习（Online Class-Incremental Learning）中模型适应新类别时面临的灾难性遗忘（Catastrophic Forgetting）问题，同时受限于对过去数据的访问有限。论文针对基于重放（Replay-Based）方法在存储受限情况下需要显著计算开销的问题，提出了一种轻量且有效的解决方案。关键在于引入了基于网格的补丁采样（Grid-based Patch Sampling, GPS），通过从原始图像中采样像素子集生成信息丰富的记忆样本，形成紧凑的低分辨率表示，保留语义内容和结构信息，而无需依赖可训练模型。这种策略能够无缝集成到现有重放框架中，在计算开销有限的情况下提升平均最终精度约3%-4%。

链接: https://arxiv.org/abs/2504.10409
作者: Mingchuan Ma,Yuhao Zhou,Jindi Lv,Yuxin Tian,Dan Si,Shujian Li,Qing Ye,Jiancheng Lv
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:Online class-incremental learning aims to enable models to continuously adapt to new classes with limited access to past data, while mitigating catastrophic forgetting. Replay-based methods address this by maintaining a small memory buffer of previous samples, achieving competitive performance. For effective replay under constrained storage, recent approaches leverage distilled data to enhance the informativeness of memory. However, such approaches often involve significant computational overhead due to the use of bi-level optimization. Motivated by these limitations, we introduce Grid-based Patch Sampling (GPS), a lightweight and effective strategy for distilling informative memory samples without relying on a trainable model. GPS generates informative samples by sampling a subset of pixels from the original image, yielding compact low-resolution representations that preserve both semantic content and structural information. During replay, these representations are reassembled to support training and evaluation. Experiments on extensive benchmarks demonstrate that GRS can be seamlessly integrated into existing replay frameworks, leading to 3%-4% improvements in average end accuracy under memory-constrained settings, with limited computational overhead.
zh

[CV-13] owards Low-Latency Event-based Obstacle Avoidance on a FPGA-Drone

【速读】：该论文旨在解决在资源受限环境中实时碰撞避免任务中，基于事件的视觉系统（Event-based Vision System, EVS）与传统RGB模型性能对比的问题。论文的关键在于通过硬件加速器（FPGA）量化评估EVS模型在动作预测任务中的有效性，并揭示其在高有效帧率（1 kHz）、低时间（-20 ms）和空间（-20 mm）预测误差以及鲁棒性方面的优势，特别是在处理分布外数据时。此外，EVS模型在区分运动与静止状态时展现出更高的精度（78% vs. 19%）和F1分数（0.73 vs. 0.06），证明其显著优于易受过拟合影响的RGB模型。最终，端到端系统的延迟仅为约2.14 ms，其中事件聚合（1 ms）和推理阶段（0.94 ms）占主导贡献。因此，论文的关键解决方案在于采用基于事件的视觉技术及其高效硬件实现，以提升实时碰撞避免任务的性能与可靠性。

链接: https://arxiv.org/abs/2504.10400
作者: Pietro Bonazzi,Christian Vogt,Michael Jost,Lyes Khacef,Federico Paredes-Vallés,Michele Magno
机构: ETH Zürich (苏黎世联邦理工学院); Sony Semiconductor Solutions Europe, Sony Europe B.V., Zürich, Switzerland (索尼半导体解决方案欧洲公司, 索尼欧洲有限公司, 瑞士苏黎世)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work quantitatively evaluates the performance of event-based vision systems (EVS) against conventional RGB-based models for action prediction in collision avoidance on an FPGA accelerator. Our experiments demonstrate that the EVS model achieves a significantly higher effective frame rate (1 kHz) and lower temporal (-20 ms) and spatial prediction errors (-20 mm) compared to the RGB-based model, particularly when tested on out-of-distribution data. The EVS model also exhibits superior robustness in selecting optimal evasion maneuvers. In particular, in distinguishing between movement and stationary states, it achieves a 59 percentage point advantage in precision (78% vs. 19%) and a substantially higher F1 score (0.73 vs. 0.06), highlighting the susceptibility of the RGB model to overfitting. Further analysis in different combinations of spatial classes confirms the consistent performance of the EVS model in both test data sets. Finally, we evaluated the system end-to-end and achieved a latency of approximately 2.14 ms, with event aggregation (1 ms) and inference on the processing unit (0.94 ms) accounting for the largest components. These results underscore the advantages of event-based vision for real-time collision avoidance and demonstrate its potential for deployment in resource-constrained environments.
zh

[CV-14] Better Coherence Better Height: Fusing Physical Models and Deep Learning for Forest Height Estimation from Interferometric SAR Data

【速读】：该论文旨在解决利用合成孔径雷达（SAR）图像估算森林高度时传统物理模型泛化能力不足的问题，同时克服深度学习（Deep Learning, DL）方法缺乏物理洞察力的局限。论文的关键解决方案是提出了一种端到端框架CoHNet，它结合了深度学习与物理约束的优势，通过在预训练神经代理模型的训练损失中引入物理可解释性约束来优化深度学习模型。这种方法不仅提升了森林高度估计的准确性，还生成了有助于增强预测可靠性的有意义特征。

链接: https://arxiv.org/abs/2504.10395
作者: Ragini Bal Mahesh,Ronny Hänsch
机构: German Aerospace Center (DLR)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating forest height from Synthetic Aperture Radar (SAR) images often relies on traditional physical models, which, while interpretable and data-efficient, can struggle with generalization. In contrast, Deep Learning (DL) approaches lack physical insight. To address this, we propose CoHNet - an end-to-end framework that combines the best of both worlds: DL optimized with physics-informed constraints. We leverage a pre-trained neural surrogate model to enforce physical plausibility through a unique training loss. Our experiments show that this approach not only improves forest height estimation accuracy but also produces meaningful features that enhance the reliability of predictions.
zh

[CV-15] PG-DPIR: An efficient plug-and-play method for high-count Poisson-Gaussian inverse problems

【速读】：该论文旨在解决高计数条件下Poisson-Gaussian噪声下的图像恢复逆问题，其目标是开发一种高效的方法来处理多种成像系统中的噪声。传统深度学习方法虽然性能优越，但在监督学习场景下通常需要针对特定传感器进行训练；而插件与播放（Plug-and-Play, PnP）方法通过仅学习正则化项避免了这一限制。然而，直接将现有的DPIR方法从白高斯噪声适应到Poisson-Gaussian噪声会导致算法收敛速度极慢，因为缺乏闭合形式的近端算子。为此，论文提出PG-DPIR方法，通过对DPIR进行针对性改进，并设计了一种高效的梯度下降初始化策略以加速近端步骤的收敛，其加速效果可达几个数量级。实验结果表明，该方法在卫星图像恢复和超分辨率任务中实现了最先进的性能，同时显著提升了计算效率，显示出在实际地面卫星处理链路中的应用潜力。

链接: https://arxiv.org/abs/2504.10375
作者: Maud Biquard,Marie Chabert,Florence Genin,Christophe Latry,Thomas Oberlin
机构: ISAE-Supaero/CNES (ISAE-Supaero/国家空间研究中心); IRIT / Toulouse-INP (IRIT/图卢兹理工学院); CNES (国家空间研究中心); ISAE-Supaero (ISAE-Supaero)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Poisson-Gaussian noise describes the noise of various imaging systems thus the need of efficient algorithms for Poisson-Gaussian image restoration. Deep learning methods offer state-of-the-art performance but often require sensor-specific training when used in a supervised setting. A promising alternative is given by plug-and-play (PnP) methods, which consist in learning only a regularization through a denoiser, allowing to restore images from several sources with the same network. This paper introduces PG-DPIR, an efficient PnP method for high-count Poisson-Gaussian inverse problems, adapted from DPIR. While DPIR is designed for white Gaussian noise, a naive adaptation to Poisson-Gaussian noise leads to prohibitively slow algorithms due to the absence of a closed-form proximal operator. To address this, we adapt DPIR for the specificities of Poisson-Gaussian noise and propose in particular an efficient initialization of the gradient descent required for the proximal step that accelerates convergence by several orders of magnitude. Experiments are conducted on satellite image restoration and super-resolution problems. High-resolution realistic Pleiades images are simulated for the experiments, which demonstrate that PG-DPIR achieves state-of-the-art performance with improved efficiency, which seems promising for on-ground satellite processing chains.
zh

[CV-16] FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos

【速读】：该论文旨在解决AI生成视频内容评估中存在的挑战，特别是随着复杂模型的涌现，现有评估方法难以应对生成视频中的多种不一致性和缺陷。论文的关键在于提出了一种名为“FingER”的新型实体级推理评估框架，通过自动为视频生成细粒度的实体级问题，并利用推理模型结合评分来实现综合评估。这一方案的核心创新点在于利用大型语言模型（LLMs）从五个不同视角生成易于多模态大型语言模型（MLLMs）解答且更具可解释性的实体级问题，同时构建了一个包含约3.3k视频及其对应的60k细粒度问答标注的数据集。此外，研究进一步探索了不同的训练策略，发现采用冷启动策略的分组相对策略优化（GRPO）方法在仅使用3.3k训练视频的情况下表现出色，超越现有方法的相对性能提升分别达到11.8%（GenAI-Bench）和5.5%（MonetBench）。

链接: https://arxiv.org/abs/2504.10358
作者: Rui Chen,Lei Sun,Jing Tang,Geng Li,Xiangxiang Chu
机构: AMAP, Alibaba Group (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Recent advances in video generation have posed great challenges in the assessment of AI-generated content, particularly with the emergence of increasingly sophisticated models. The various inconsistencies and defects observed in such videos are inherently complex, making overall scoring notoriously difficult. In this paper, we emphasize the critical importance of integrating fine-grained reasoning into video evaluation, and we propose \textbfF ing \textbfER , a novel entity-level reasoning evaluation framework that first automatically generates \textbfF ine-grained \textbfE ntity-level questions, and then answers those questions by a \textbfR easoning model with scores, which can be subsequently weighted summed to an overall score for different applications. Specifically, we leverage LLMs to derive entity-level questions across five distinct perspectives, which (i) often focus on some specific entities of the content, thereby making answering or scoring much easier by MLLMs, and (ii) are more interpretable. Then we construct a FingER dataset, consisting of approximately 3.3k videos and corresponding 60k fine-grained QA annotations, each with detailed reasons. Based on that, we further investigate various training protocols to best incentivize the reasoning capability of MLLMs for correct answer prediction. Extensive experiments demonstrate that a reasoning model trained using Group Relative Policy Optimization (GRPO) with a cold-start strategy achieves the best performance. Notably, our model surpasses existing methods by a relative margin of 11.8% on GenAI-Bench and 5.5% on MonetBench with only 3.3k training videos, which is at most one-tenth of the training samples utilized by other methods. Our code and dataset will be released soon.
zh

[CV-17] Patch and Shuffle: A Preprocessing Technique for Texture Classification in Autonomous Cementitious Fabrication

【速读】：该论文旨在解决水泥基材料挤出制造过程中因全局图像特征导致的传统纹理分类方法易受语义内容干扰的问题。为了解决这一挑战，论文提出了一种名为“patch and shuffle”的创新预处理技术。该方法通过将输入图像分割为小块、随机打乱这些小块的位置后再重新组合成一幅混乱的新图像，从而去除全局语义信息，迫使分类器依赖于局部纹理特征而非高层语义信息。实验结果表明，在基于ResNet-18架构的水泥挤出图像数据集上，采用此方法的模型测试准确率达到90.64%，显著优于传统方法的72.46%。因此，该解决方案的关键在于通过破坏全局结构来增强基于低级纹理特征的分类性能。

链接: https://arxiv.org/abs/2504.10353
作者: Jeremiah Giordani
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Originally completed as a final project for CEE 374 at Princeton University

点击查看摘要

Abstract:Autonomous fabrication systems are transforming construction and manufacturing, yet they remain vulnerable to print errors. Texture classification is a key component of computer vision systems that enable real-time monitoring and adjustment during cementitious fabrication. Traditional classification methods often rely on global image features, which can bias the model toward semantic content rather than low-level textures. In this paper, we introduce a novel preprocessing technique called “patch and shuffle,” which segments input images into smaller patches, shuffles them, and reconstructs a jumbled image before classification. This transformation removes semantic context, forcing the classifier to rely on local texture features. We evaluate this approach on a dataset of extruded cement images, using a ResNet-18-based architecture. Our experiments compare the patch and shuffle method to a standard pipeline, holding all other factors constant. Results show a significant improvement in accuracy: the patch and shuffle model achieved 90.64% test accuracy versus 72.46% for the baseline. These findings suggest that disrupting global structure enhances performance in texture-based classification tasks. This method has implications for broader vision tasks where low-level features matter more than high-level semantics. The technique may improve classification in applications ranging from fabrication monitoring to medical imaging. Comments: Originally completed as a final project for CEE 374 at Princeton University Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.10353 [cs.CV] (or arXiv:2504.10353v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.10353 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-18] Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis ICME2025

【速读】：该论文旨在解决多模态面部状态理解中的两个关键问题：一是现有研究在利用多模态面部表征理解感知方面仍显不足；二是理解与分析面部状态（如Action Units (AUs) 和情绪）需要一个能够有效桥接视觉与语言模态的综合且稳健的框架。论文的核心解决方案包括构建一个多模态面部数据集（MFA）以及提出一种名为多层级多模态面部基础模型（Multilevel Multimodal Face Foundation Model, MF^2）的新模型。其中，MF^2 模型通过在局部和全局层面整合全面的视觉特征建模，增强了对详细面部外观的表达能力，并确保视觉表示与结构化的 AUs 和情绪描述对齐，以实现有效的跨模态集成。此外，论文还开发了一种解耦微调网络（Decoupled Fine-Tuning Network, DFN），用于高效适配 MF^2 模型以应对不同任务和数据集，从而降低计算开销并扩展其应用场景。实验结果表明，该方法在 AUs 和情绪检测任务中表现出色。

链接: https://arxiv.org/abs/2504.10351
作者: Kaiwen Zheng,Xuri Ge,Junchen Fu,Jun Peng,Joemon M. Jose
机构: University of Glasgow (格拉斯哥大学), School of Computing Science; Shandong University (山东大学), School of Artificial Intelligence; Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME2025

点击查看摘要

Abstract:Multimodal foundation models have significantly improved feature representation by integrating information from multiple modalities, making them highly suitable for a broader set of applications. However, the exploration of multimodal facial representation for understanding perception has been limited. Understanding and analyzing facial states, such as Action Units (AUs) and emotions, require a comprehensive and robust framework that bridges visual and linguistic modalities. In this paper, we present a comprehensive pipeline for multimodal facial state analysis. First, we compile a new Multimodal Face Dataset (MFA) by generating detailed multilevel language descriptions of face, incorporating Action Unit (AU) and emotion descriptions, by leveraging GPT-4o. Second, we introduce a novel Multilevel Multimodal Face Foundation model (MF^2) tailored for Action Unit (AU) and emotion recognition. Our model incorporates comprehensive visual feature modeling at both local and global levels of face image, enhancing its ability to represent detailed facial appearances. This design aligns visual representations with structured AU and emotion descriptions, ensuring effective cross-modal integration. Third, we develop a Decoupled Fine-Tuning Network (DFN) that efficiently adapts MF^2 across various tasks and datasets. This approach not only reduces computational overhead but also broadens the applicability of the foundation model to diverse scenarios. Experimentation show superior performance for AU and emotion detection tasks.
zh

[CV-19] Benchmarking 3D Human Pose Estimation Models Under Occlusions

【速读】：本文针对3D人体姿态估计（3D Human Pose Estimation, 3D HPE）中的关键挑战展开研究，重点关注现有模型在遮挡（occlusions）、相机位置变化以及动作多样性方面的鲁棒性和敏感性。为了解决这些问题，作者利用一个新颖的合成数据集BlendMimic3D，该数据集包含多摄像头设置及多种类型的遮挡场景。通过分析常见3D HPE数据集（如Human3.6M）与2D检测数据集（如COCO）之间关键点格式的差异，论文探讨了模型在标准训练条件之外的泛化能力以及遮挡对模型性能的影响。研究的关键在于揭示现有模型对遮挡和相机设置的高度敏感性，并强调需要开发能够更好地适应真实世界复杂多样性的新型模型。这一工作为提升3D HPE系统在复杂环境下的精确度和实用性提供了重要贡献。

链接: https://arxiv.org/abs/2504.10350
作者: Filipa Lino,Carlos Santiago,Manuel Marques
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses critical challenges in 3D Human Pose Estimation (HPE) by analyzing the robustness and sensitivity of existing models to occlusions, camera position, and action variability. Using a novel synthetic dataset, BlendMimic3D, which includes diverse scenarios with multi-camera setups and several occlusion types, we conduct specific tests on several state-of-the-art models. Our study focuses on the discrepancy in keypoint formats between common datasets such as Human3.6M, and 2D datasets such as COCO, commonly used for 2D detection models and frequently input of 3D HPE models. Our work explores the impact of occlusions on model performance and the generality of models trained exclusively under standard conditions. The findings suggest significant sensitivity to occlusions and camera settings, revealing a need for models that better adapt to real-world variability and occlusion scenarios. This research contributed to ongoing efforts to improve the fidelity and applicability of 3D HPE systems in complex environments.
zh

[CV-20] LL-Gaussian: Low-Light Scene Reconstruction and Enhancement via Gaussian Splatting for Novel View Synthesis

【速读】：该论文旨在解决低光场景下新型视角合成（NVS）的挑战，具体表现为输入图像存在严重噪声、低动态范围（LDR）以及不可靠初始化等问题。现有基于NeRF的方法虽表现良好，但计算成本高且依赖于精心捕获或预处理的数据（如RAW传感器输入或多曝光序列），限制了其实用性。同时，基于3D高斯点撒布（3DGS）的方法虽支持实时渲染，但在低光sRGB输入下的表现欠佳，导致高斯初始化不稳定及噪声抑制效果不佳。
论文的关键创新在于提出LL-Gaussian框架，用于从低光sRGB图像进行3D重建与增强，实现伪正常光照条件下的新型视角合成。其解决方案的核心包括：1）端到端低光高斯初始化模块（LLGIM），利用基于学习的多视图立体视觉（MVS）方法的密集先验生成高质量初始点云；2）双分支高斯分解模型，分离场景内在属性（反射率与光照）与瞬态干扰，确保优化过程稳定且可解释；3）基于物理约束和扩散先验的无监督优化策略，联合引导分解与增强过程。此外，论文还贡献了一个在极端低光环境下收集的具有挑战性的数据集，并展示了LL-Gaussian的有效性。与最先进的NeRF方法相比，LL-Gaussian实现了高达2000倍的推理加速，训练时间减少至2%，同时提供更优的重建和渲染质量。

链接: https://arxiv.org/abs/2504.10331
作者: Hao Sun,Fenggen Yu,Huiyao Xu,Tao Zhang,Changqing Zou
机构: Zhejiang Lab (浙江实验室); University of Chinese Academy of Sciences (中国科学院大学); State Key Lab of CAD&CG, Zhejiang University (浙江大学计算机辅助设计与图形学国家重点实验室); Amazon (亚马逊); Hangzhou Dianzi University (杭州电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Novel view synthesis (NVS) in low-light scenes remains a significant challenge due to degraded inputs characterized by severe noise, low dynamic range (LDR) and unreliable initialization. While recent NeRF-based approaches have shown promising results, most suffer from high computational costs, and some rely on carefully captured or pre-processed data–such as RAW sensor inputs or multi-exposure sequences–which severely limits their practicality. In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. Our method introduces three key innovations: 1) an end-to-end Low-Light Gaussian Initialization Module (LLGIM) that leverages dense priors from learning-based MVS approach to generate high-quality initial point clouds; 2) a dual-branch Gaussian decomposition model that disentangles intrinsic scene properties (reflectance and illumination) from transient interference, enabling stable and interpretable optimization; 3) an unsupervised optimization strategy guided by both physical constrains and diffusion prior to jointly steer decomposition and enhancement. Additionally, we contribute a challenging dataset collected in extreme low-light environments and demonstrate the effectiveness of LL-Gaussian. Compared to state-of-the-art NeRF-based methods, LL-Gaussian achieves up to 2,000 times faster inference and reduces training time to just 2%, while delivering superior reconstruction and rendering quality.
zh

[CV-21] InstructEngine: Instruction-driven Text-to-Image Alignment

【速读】：本文针对文本到图像模型偏好对齐中的现有方法在数据与算法上的局限性提出了解决方案。在数据方面，大多数现有方法依赖于手动标注的偏好数据，要么直接微调生成器，要么训练奖励模型以提供训练信号，但这些方法存在标注成本高、奖励模型计算开销大且准确性无法保证的问题。在算法层面，多数方法仅将图像反馈作为比较信号，忽视了文本的价值，导致效率低下且信号稀疏。为缓解这些问题，论文提出了InstructEngine框架。其关键在于通过构建文本到图像生成的分类法，并开发基于此的自动化数据构造管道，利用先进的大型多模态模型和人工定义的规则生成了25K个文本-图像偏好对。此外，引入了交叉验证对齐方法，通过组织语义相似样本形成可相互比较的对，从而提高数据效率。评估结果显示，InstructEngine使SD v1.5和SDXL的性能分别提升了10.53%和5.30%，优于最先进的基线方法，消融研究也证实了InstructEngine各组件的优势，人类评审的胜率超过50%进一步证明了其更好地符合人类偏好。

链接: https://arxiv.org/abs/2504.10329
作者: Xingyu Lu,Yuhang Hu,YiFan Zhang,Kaiyu Jiang,Changyi Liu,Tianke Zhang,Jinpeng Wang,Bin Wen,Chun Yuan,Fan Yang,Tingting Gao,Di Zhang
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技); CASIA (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Reinforcement Learning from Human/AI Feedback (RLHF/RLAIF) has been extensively utilized for preference alignment of text-to-image models. Existing methods face certain limitations in terms of both data and algorithm. For training data, most approaches rely on manual annotated preference data, either by directly fine-tuning the generators or by training reward models to provide training signals. However, the high annotation cost makes them difficult to scale up, the reward model consumes extra computation and cannot guarantee accuracy. From an algorithmic perspective, most methods neglect the value of text and only take the image feedback as a comparative signal, which is inefficient and sparse. To alleviate these drawbacks, we propose the InstructEngine framework. Regarding annotation cost, we first construct a taxonomy for text-to-image generation, then develop an automated data construction pipeline based on it. Leveraging advanced large multimodal models and human-defined rules, we generate 25K text-image preference pairs. Finally, we introduce cross-validation alignment method, which refines data efficiency by organizing semantically analogous samples into mutually comparable pairs. Evaluations on DrawBench demonstrate that InstructEngine improves SD v1.5 and SDXL’s performance by 10.53% and 5.30%, outperforming state-of-the-art baselines, with ablation study confirming the benefits of InstructEngine’s all components. A win rate of over 50% in human reviews also proves that InstructEngine better aligns with human preferences.
zh

[CV-22] SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG -Enhanced Vision-Language Model

【速读】：该论文旨在解决视频异常检测（Video Anomaly Detection, VAD）中的高误报率和低可解释性问题，同时克服现有视觉-语言模型（Vision-Language Models, VLMs）在计算成本高和领域适应性不足方面的局限。论文的关键创新在于提出了一种名为SlowFastVAD的混合框架，通过结合快速异常检测器与慢速异常检测器（基于检索增强生成机制的VLM），实现高效且可靠的异常检测。具体而言，快速检测器提供粗粒度的异常置信度分数，并仅对模糊片段进行进一步处理，而慢速VLM则用于详细检测和推理，从而提升可解释性。此外，通过构建包含正常模式的知识库并结合少量正常样本及VLM推断出的异常模式，增强了模型的领域适应能力。最终，论文通过平滑融合快慢检测器的异常置信度，提高了检测鲁棒性。实验结果表明，该方法在多个基准数据集上实现了显著的检测精度与可解释性提升，同时大幅降低了计算开销，适用于对可靠性要求高的实际应用。

链接: https://arxiv.org/abs/2504.10320
作者: Zongcan Ding,Haodong Zhang,Peng Wu,Guansong Pang,Zhiwei Yang,Peng Wang,Yanning Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains. While semi-supervised methods trained on only normal samples have gained traction, they often suffer from high false alarm rates and poor interpretability. Recently, vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for explainable anomaly detection. However, their high computational cost and lack of domain adaptation hinder real-time deployment and reliability. Inspired by dual complementary pathways in human visual perception, we propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector (namely a retrieval augmented generation (RAG) enhanced VLM), to address these limitations. Specifically, the fast detector first provides coarse anomaly confidence scores, and only a small subset of ambiguous segments, rather than the entire video, is further analyzed by the slower yet more interpretable VLM for elaborate detection and reasoning. Furthermore, to adapt VLMs to domain-specific VAD scenarios, we construct a knowledge base including normal patterns based on few normal samples and abnormal patterns inferred by VLMs. During inference, relevant patterns are retrieved and used to augment prompts for anomaly reasoning. Finally, we smoothly fuse the anomaly confidence of fast and slow detectors to enhance robustness of anomaly detection. Extensive experiments on four benchmarks demonstrate that SlowFastVAD effectively combines the strengths of both fast and slow detectors, and achieves remarkable detection accuracy and interpretability with significantly reduced computational overhead, making it well-suited for real-world VAD applications with high reliability requirements.
zh

[CV-23] Analysis of Attention in Video Diffusion Transformers

【速读】：该论文旨在深入分析视频扩散变换器（Video Diffusion Transformers, VDiTs）中的注意力机制，并报告若干新发现。论文识别了VDiTs中注意力的三个关键特性：结构（Structure）、稀疏性（Sparsity）和汇点（Sinks）。论文的关键解决方案在于揭示这些特性的本质及其对模型效率与性能之间权衡的影响。通过研究注意力模式的相似性，论文提出可通过自注意力图迁移实现视频编辑；针对稀疏性问题，指出并非所有层都能有效应用现有稀疏化方法；同时首次探讨了VDiTs中注意力汇点的现象，并将其与语言模型中的汇点进行对比。基于这些洞察，论文提出了提升VDiTs效率-质量帕累托前沿的若干未来方向。

链接: https://arxiv.org/abs/2504.10317
作者: Yuxin Wen,Jim Wu,Ajay Jain,Tom Goldstein,Ashwinee Panda
机构: University of Maryland (马里兰大学); GenmoAI (GenmoAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in language models. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.
zh

[CV-24] ESCT3D: Efficient and Selectively Controllable Text-Driven 3D Content Generation with Gaussian Splatting

【速读】：该论文旨在解决文本驱动的3D内容生成中存在的两个主要问题：一是从极简文本输入生成高质量3D内容的效率低下，二是现有方法在可控性方面的不足。论文的关键解决方案是提出了一种结合GPT-4V的自优化机制，显著提升了单次生成满足用户期望的高质量3D内容的效率，并通过引入多条件控制（如风格、边缘、草图、姿态等）增强了生成过程的可控性。此外，在训练过程中，通过有效整合多视角信息（包括多视角深度、掩码、特征和图像），解决了3D内容生成中的“Janus问题”。实验结果表明，该方法具有强大的泛化能力，实现了高效且可控的高质量3D内容生成。

链接: https://arxiv.org/abs/2504.10316
作者: Huiqi Wu,Jianbo Mei,Yingjie Huang,Yining Xu,Jingjiao You,Yilong Liu,Li Yao
机构: Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, significant advancements have been made in text-driven 3D content generation. However, several challenges remain. In practical applications, users often provide extremely simple text inputs while expecting high-quality 3D content. Generating optimal results from such minimal text is a difficult task due to the strong dependency of text-to-3D models on the quality of input prompts. Moreover, the generation process exhibits high variability, making it difficult to control. Consequently, multiple iterations are typically required to produce content that meets user expectations, reducing generation efficiency. To address this issue, we propose GPT-4V for self-optimization, which significantly enhances the efficiency of generating satisfactory content in a single attempt. Furthermore, the controllability of text-to-3D generation methods has not been fully explored. Our approach enables users to not only provide textual descriptions but also specify additional conditions, such as style, edges, scribbles, poses, or combinations of multiple conditions, allowing for more precise control over the generated 3D content. Additionally, during training, we effectively integrate multi-view information, including multi-view depth, masks, features, and images, to address the common Janus problem in 3D content generation. Extensive experiments demonstrate that our method achieves robust generalization, facilitating the efficient and controllable generation of high-quality 3D content.
zh

[CV-25] Noise2Ghost: Self-supervised deep convolutional reconstruction for ghost imaging

【速读】：该论文致力于解决低光照条件下鬼成像（Ghost Imaging, GI）信号-to-噪声比低的问题，特别是在需要处理噪声较大的新兴应用场景中。论文提出了一种基于自监督深度学习的新方法，其关键在于通过引入自监督机制消除了对干净参考数据的需求，同时实现了强大的降噪效果，从而显著提升了成像质量。这一方案为低光照鬼成像场景提供了必要的工具，例如微米和纳米尺度的X射线发射成像（如剂量敏感样本的X射线荧光成像），并可应用于生物样本及电池的体内与在操作条件下的研究。

链接: https://arxiv.org/abs/2504.10288
作者: Mathieu Manni,Dmitry Karpov,K. Joost Batenburg,Sharon Shwartz,Nicola Viganò
机构: ESRF — The European Synchrotron (欧洲同步辐射设施), Bar Ilan University (巴伊兰大学), Université Grenoble Alpes (格勒诺布尔阿尔卑斯大学), Leiden Universiteit (莱顿大学), CEA (法国原子能委员会)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
备注:

点击查看摘要

Abstract:We present a new self-supervised deep-learning-based Ghost Imaging (GI) reconstruction method, which provides unparalleled reconstruction performance for noisy acquisitions among unsupervised methods. We present the supporting mathematical framework and results from theoretical and real data use cases. Self-supervision removes the need for clean reference data while offering strong noise reduction. This provides the necessary tools for addressing signal-to-noise ratio concerns for GI acquisitions in emerging and cutting-edge low-light GI scenarios. Notable examples include micro- and nano-scale x-ray emission imaging, e.g., x-ray fluorescence imaging of dose-sensitive samples. Their applications include in-vivo and in-operando case studies for biological samples and batteries.
zh

[CV-26] DiffMOD: Progressive Diffusion Point Denoising for Moving Object Detection in Remote Sensing

【速读】：该论文旨在解决遥感图像中运动目标检测(MOD)面临的低分辨率、极小目标尺寸以及复杂噪声干扰等挑战。当前基于深度学习的MOD方法依赖于概率密度估计，这限制了对象间及跨时间帧的灵活信息交互。为了解决这一问题，论文提出了一种基于点的遥感运动目标检测方法。其关键是将网络优化表述为一个逐步去噪的过程，通过迭代恢复稀疏噪声点中的运动目标中心来捕捉高阶的对象间和时间关系。具体而言，该方法从主干网络输出中采样离散特征作为原子单元进行后续处理，并聚合全局特征嵌入以弥补稀疏点特征覆盖范围有限的问题。此外，设计的空间关系聚合注意力模块通过建模空间相对位置和语义亲和性，增强了点级特征之间的高阶交互，从而提升目标表示能力；而时序传播与全局融合模块则利用隐式记忆推理机制实现了稳健的跨帧特征集成。为了配合逐步去噪过程，提出了渐进MinK最优传输分配策略，在每个去噪级别建立专门的学习目标。同时，引入缺失损失函数以对抗去噪点在显著物体周围聚集的趋势。实验结果表明，基于离散点去噪的MOD方法能够更有效地探索稀疏运动目标间的潜在关系，提高检测能力和时序一致性。

链接: https://arxiv.org/abs/2504.10278
作者: Jinyue Zhang,Xiangrong Zhang,Zhongjian Huang,Tianyang Zhang,Yifei Jiang,Licheng Jiao
机构: Xidian University (西安电子科技大学); Inspur Software Co., Ltd. (浪潮软件有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Moving object detection (MOD) in remote sensing is significantly challenged by low resolution, extremely small object sizes, and complex noise interference. Current deep learning-based MOD methods rely on probability density estimation, which restricts flexible information interaction between objects and across temporal frames. To flexibly capture high-order inter-object and temporal relationships, we propose a point-based MOD in remote sensing. Inspired by diffusion models, the network optimization is formulated as a progressive denoising process that iteratively recovers moving object centers from sparse noisy points. Specifically, we sample scattered features from the backbone outputs as atomic units for subsequent processing, while global feature embeddings are aggregated to compensate for the limited coverage of sparse point features. By modeling spatial relative positions and semantic affinities, Spatial Relation Aggregation Attention is designed to enable high-order interactions among point-level features for enhanced object representation. To enhance temporal consistency, the Temporal Propagation and Global Fusion module is designed, which leverages an implicit memory reasoning mechanism for robust cross-frame feature integration. To align with the progressive denoising process, we propose a progressive MinK optimal transport assignment strategy that establishes specialized learning objectives at each denoising level. Additionally, we introduce a missing loss function to counteract the clustering tendency of denoised points around salient objects. Experiments on the RsData remote sensing MOD dataset show that our MOD method based on scattered point denoising can more effectively explore potential relationships between sparse moving objects and improve the detection capability and temporal consistency.
zh

[CV-27] LMFormer: Lane based Motion Prediction Transformer CVPR2025

【速读】：该论文致力于解决自动驾驶中的运动预测问题，特别是在轨迹预测任务中提升模型的性能与可解释性。论文的关键创新在于提出了一种名为LMFormer的车道感知Transformer网络，其核心解决方案包括：1) 引入一种动态优先级机制以灵活关注不同车道，从而增强模型行为的可解释性；2) 利用交叉口处的车道连接信息（如车道合并与分裂）来学习长距离的车道结构依赖关系；3) 提出通过堆叠Transformer层实现预测轨迹的高效迭代优化方法。实验结果表明，LMFormer在nuScenes数据集上达到了SOTA性能，并展示了其跨数据集的统一训练能力与卓越表现。

链接: https://arxiv.org/abs/2504.10275
作者: Harsh Yadav,Maximilian Schaefer,Kun Zhao,Tobias Meisen
机构: University of Wuppertal (德国伍珀塔尔大学); Aptiv Services Deutschland GmbH (Aptiv德国服务有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted: Autonomous Driving Workshop, CVPR 2025

点击查看摘要

Abstract:Motion prediction plays an important role in autonomous driving. This study presents LMFormer, a lane-aware transformer network for trajectory prediction tasks. In contrast to previous studies, our work provides a simple mechanism to dynamically prioritize the lanes and shows that such a mechanism introduces explainability into the learning behavior of the network. Additionally, LMFormer uses the lane connection information at intersections, lane merges, and lane splits, in order to learn long-range dependency in lane structure. Moreover, we also address the issue of refining the predicted trajectories and propose an efficient method for iterative refinement through stacked transformer layers. For benchmarking, we evaluate LMFormer on the nuScenes dataset and demonstrate that it achieves SOTA performance across multiple metrics. Furthermore, the Deep Scenario dataset is used to not only illustrate cross-dataset network performance but also the unification capabilities of LMFormer to train on multiple datasets and achieve better performance.
zh

[CV-28] rade-offs in Privacy-Preserving Eye Tracking through Iris Obfuscation: A Benchmarking Study

【速读】：该论文试图解决在增强现实/虚拟现实（AR/VR）头戴式显示设备（HMDs）中集成眼动追踪技术时面临的隐私与实用性的权衡问题。具体而言，眼动追踪系统需要采集包含虹膜纹理的原始眼动图像或视频以估计用户的凝视信息，但这些虹膜数据被视为生物识别信息的黄金标准，可能引发隐私担忧。为了解决这一问题，论文探索了多种方法来模糊或转换用户身份信息，同时评估其对图像质量、隐私保护、任务实用性以及对抗伪造攻击风险的影响。

解决方案的关键在于综合比较几种主流的去标识化技术（包括模糊处理、加噪、下采样、橡胶薄片模型以及虹膜风格迁移）在隐私保护和功能性能之间的平衡效果。实验结果表明，传统的图像处理方法（如模糊和加噪）对基于深度学习的任务影响较小，而下采样、橡胶薄片模型及虹膜风格迁移在隐藏用户标识方面更有效。其中，虹膜风格迁移虽然计算成本较高，但在保持高任务性能的同时对伪造攻击更具鲁棒性。最终，论文指出不存在一种通用的最佳方法能够同时优化隐私保护、任务实用性与计算开销，因此建议研究者根据具体需求选择合适的技术组合以实现隐私-实用性之间的最佳折衷。

链接: https://arxiv.org/abs/2504.10267
作者: Mengdi Wang,Efe Bozkir,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 25th International Conference on Digital Signal Processing (DSP 2025)

点击查看摘要

Abstract:Recent developments in hardware, computer graphics, and AI may soon enable AR/VR head-mounted displays (HMDs) to become everyday devices like smartphones and tablets. Eye trackers within HMDs provide a special opportunity for such setups as it is possible to facilitate gaze-based research and interaction. However, estimating users’ gaze information often requires raw eye images and videos that contain iris textures, which are considered a gold standard biometric for user authentication, and this raises privacy concerns. Previous research in the eye-tracking community focused on obfuscating iris textures while keeping utility tasks such as gaze estimation accurate. Despite these attempts, there is no comprehensive benchmark that evaluates state-of-the-art approaches. Considering all, in this paper, we benchmark blurring, noising, downsampling, rubber sheet model, and iris style transfer to obfuscate user identity, and compare their impact on image quality, privacy, utility, and risk of imposter attack on two datasets. We use eye segmentation and gaze estimation as utility tasks, and reduction in iris recognition accuracy as a measure of privacy protection, and false acceptance rate to estimate risk of attack. Our experiments show that canonical image processing methods like blurring and noising cause a marginal impact on deep learning-based tasks. While downsampling, rubber sheet model, and iris style transfer are effective in hiding user identifiers, iris style transfer, with higher computation cost, outperforms others in both utility tasks, and is more resilient against spoof attacks. Our analyses indicate that there is no universal optimal approach to balance privacy, utility, and computation burden. Therefore, we recommend practitioners consider the strengths and weaknesses of each approach, and possible combinations of those to reach an optimal privacy-utility trade-off.
zh

[CV-29] XY-Cut: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark

【速读】：该论文致力于解决文档图像理解中的文档阅读顺序恢复任务，这一任务在增强检索增强生成（Retrieval-Augmented Generation, RAG）以及作为大型语言模型（Large Language Models, LLMs）的预处理步骤中具有重要作用。然而，现有方法常面临复杂布局（如多栏报纸）、跨模态元素（视觉区域与文本语义）之间高开销交互，以及缺乏鲁棒评估基准等挑战。为应对这些难题，论文提出了一种先进的布局排序方法XY-Cut++，其关键在于整合预掩码处理、多粒度分割和跨模态匹配技术。通过这种方法，XY-Cut++显著提升了布局排序的准确性，在DocBench-100数据集上实现了最先进的性能（总体98.8 BLEU），并且在简单与复杂的布局上均表现出一致的高精度，相比现有基线最多提升了24%。这项工作为文档结构恢复奠定了可靠的基础，并为布局排序任务设定了新的标准，促进了更有效的RAG和LLM预处理。

链接: https://arxiv.org/abs/2504.10258
作者: Shuai Liu,Youmeng Li,Jizeng Wei
机构: Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Document Reading Order Recovery is a fundamental task in document image understanding, playing a pivotal role in enhancing Retrieval-Augmented Generation (RAG) and serving as a critical preprocessing step for large language models (LLMs). Existing methods often struggle with complex layouts(e.g., multi-column newspapers), high-overhead interactions between cross-modal elements (visual regions and textual semantics), and a lack of robust evaluation benchmarks. We introduce XY-Cut++, an advanced layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching to address these challenges. Our method significantly enhances layout ordering accuracy compared to traditional XY-Cut techniques. Specifically, XY-Cut++ achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency. It outperforms existing baselines by up to 24% and demonstrates consistent accuracy across simple and complex layouts on the newly introduced DocBench-100 dataset. This advancement establishes a reliable foundation for document structure recovery, setting a new standard for layout ordering tasks and facilitating more effective RAG and LLM preprocessing.
zh

[CV-30] MASSeg : 2nd Technical Report for 4th PVUW MOSE Track

【速读】：该论文旨在解决复杂视频对象分割中的小目标识别、遮挡处理以及动态场景建模等挑战。为应对这些问题，论文提出的关键解决方案包括：1）改进了一个名为MASSeg的分割模型，其基于现有框架并针对复杂场景进行了优化；2）构建了一个增强数据集MOSE+，包含具有遮挡、杂乱背景及小目标实例的典型场景；3）在训练阶段结合帧间一致与不一致的数据增强策略以提升鲁棒性和泛化能力；4）在推理阶段设计了掩码输出缩放策略，以更好地适应不同大小的对象及其遮挡程度。最终，MASSeg在MOSE测试集上取得了J分数0.8250、F分数0.9007以及JF分数0.8628的成绩。

链接: https://arxiv.org/abs/2504.10254
作者: Xuqiang Cao,Linnan Zhao,Jiaxuan Zhao,Fang Liu,Puhua Chen,Wenping Ma
机构: Key Laboratory of Intelligent Perception and Image Understanding (智能感知与图像理解重点实验室), Xi’an, China (中国西安)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages,4 figures,Technical report on Complex Video Object Segmentation

点击查看摘要

Abstract:Complex video object segmentation continues to face significant challenges in small object recognition, occlusion handling, and dynamic scene modeling. This report presents our solution, which ranked second in the MOSE track of CVPR 2025 PVUW Challenge. Based on an existing segmentation framework, we propose an improved model named MASSeg for complex video object segmentation, and construct an enhanced dataset, MOSE+, which includes typical scenarios with occlusions, cluttered backgrounds, and small target instances. During training, we incorporate a combination of inter-frame consistent and inconsistent data augmentation strategies to improve robustness and generalization. During inference, we design a mask output scaling strategy to better adapt to varying object sizes and occlusion levels. As a result, MASSeg achieves a J score of 0.8250, F score of 0.9007, and a JF score of 0.8628 on the MOSE test set.
zh

[CV-31] CAT: A Conditional Adaptation Tailor for Efficient and Effective Instance-Specific Pansharpening on Real-World Data

【速读】：该论文旨在解决传统 pansharpening 方法在跨传感器泛化能力有限以及计算开销高的问题，限制了其实时应用潜力。为应对这些挑战，论文提出了一种高效的框架，能够快速适应特定输入实例，并在短时间内完成训练与推理过程。解决方案的关键在于引入了一个名为 CAT 的模块，该模块嵌入在预训练网络的特征提取与通道变换阶段之间，通过调整融合特征并固定参数以实现高效推理，从而生成改进的结果。这种方法不仅提升了模型的跨传感器泛化能力，还显著增强了计算效率。

链接: https://arxiv.org/abs/2504.10242
作者: Tianyu Xin,Jin-Liang Xiao,Zeyu Xia,Shan Yin,Liang-Jian Deng
机构: University of Electronic Science and Technology of China(电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pansharpening is a crucial remote sensing technique that fuses low-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images to generate high-resolution multispectral (HRMS) imagery. Although deep learning techniques have significantly advanced pansharpening, many existing methods suffer from limited cross-sensor generalization and high computational overhead, restricting their real-time applications. To address these challenges, we propose an efficient framework that quickly adapts to a specific input instance, completing both training and inference in a short time. Our framework splits the input image into multiple patches, selects a subset for unsupervised CAT training, and then performs inference on all patches, stitching them into the final output. The CAT module, integrated between the feature extraction and channel transformation stages of a pre-trained network, tailors the fused features and fixes the parameters for efficient inference, generating improved results. Our approach offers two key advantages: (1) \textitImproved Generalization Ability : by mitigating cross-sensor degradation, our model–although pre-trained on a specific dataset–achieves superior performance on datasets captured by other sensors; (2) \textitEnhanced Computational Efficiency : the CAT-enhanced network can swiftly adapt to the test sample using the single LRMS-PAN pair input, without requiring extensive large-scale data retraining. Experiments on the real-world data from WorldView-3 and WorldView-2 datasets demonstrate that our method achieves state-of-the-art performance on cross-sensor real-world data, while achieving both training and inference of 512\times512 image within \textit0.4 seconds and 4000\times4000 image within \textit3 seconds at the fastest setting on a commonly used RTX 3090 GPU.
zh

[CV-32] Balancing Stability and Plasticity in Pretrained Detector: A Dual-Path Framework for Incremental Object Detection

【速读】：该论文旨在解决预训练模型增量目标检测（Pretrained Model-based Incremental Object Detection, PTMIOD）中稳定性与可塑性之间的平衡这一根本挑战。现有方法在与预训练数据对齐的任务中表现出色，但在跨域场景中的可塑性尚未得到充分探索。论文通过系统分析发现，预训练检测器中的定位模块在分布偏移情况下能够保持跨域稳定性并精确估计边界框，而分类组件则需要增强的可塑性以缓解跨域场景中的判别能力下降。基于此，论文提出了一种基于预训练DETR检测器的双路径框架，关键在于解耦定位稳定性和分类可塑性：定位路径保留预训练的定位知识以维持稳定性，而分类路径通过参数高效微调实现可塑性，并借助伪特征重放抵抗遗忘现象。该解决方案有效平衡了PTMIOD中的稳定性与可塑性，在域内（如MS COCO和PASCAL VOC）及跨域（如TT100K）基准测试中展现了最先进的性能。

链接: https://arxiv.org/abs/2504.10214
作者: Songze Li,Qixing Xu,Tonghua Su,Xu-Yao Zhang,Zhongjie Wang
机构: Harbin Institute of Technology (哈尔滨工业大学); State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA (多模态人工智能系统国家重点实验室, 中科院自动化研究所); School of Artificial Intelligence, UCAS (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The balance between stability and plasticity remains a fundamental challenge in pretrained model-based incremental object detection (PTMIOD). While existing PTMIOD methods demonstrate strong performance on in-domain tasks aligned with pretraining data, their plasticity to cross-domain scenarios remains underexplored. Through systematic component-wise analysis of pretrained detectors, we reveal a fundamental discrepancy: the localization modules demonstrate inherent cross-domain stability-preserving precise bounding box estimation across distribution shifts-while the classification components require enhanced plasticity to mitigate discriminability degradation in cross-domain scenarios. Motivated by these findings, we propose a dual-path framework built upon pretrained DETR-based detectors which decouples localization stability and classification plasticity: the localization path maintains stability to preserve pretrained localization knowledge, while the classification path facilitates plasticity via parameter-efficient fine-tuning and resists forgetting with pseudo-feature replay. Extensive evaluations on both in-domain (MS COCO and PASCAL VOC) and cross-domain (TT100K) benchmarks show state-of-the-art performance, demonstrating our method’s ability to effectively balance stability and plasticity in PTMIOD, achieving robust cross-domain adaptation and strong retention of anti-forgetting capabilities.
zh

[CV-33] VibrantLeaves: A principled parametric image generator for training deep restoration models

【速读】：该论文试图解决深度神经网络在图像恢复任务中因训练集偏差导致性能受限以及可解释性不足的问题。为应对这些挑战，论文提出了一种基于简单原则的合成图像生成器，其关键是通过几何建模、纹理建模及简单的图像采集建模，将这些特性整合到经典的Dead Leaves模型中，从而生成高效的训练数据集。这种解决方案使得标准的去噪和超分辨率网络能够在合成数据集上达到接近自然图像数据集的性能，并且提升了模型对测试集中几何与辐射扰动的鲁棒性。此外，论文还对所采用的原则进行了细致分析，以识别实现良好性能所需的图像属性。

链接: https://arxiv.org/abs/2504.10201
作者: Raphael Achddou,Yann Gousseau,Saïd Ladjal,Sabine Süsstrunk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Even though Deep Neural Networks are extremely powerful for image restoration tasks, they have several limitations. They are poorly understood and suffer from strong biases inherited from the training sets. One way to address these shortcomings is to have a better control over the training sets, in particular by using synthetic sets. In this paper, we propose a synthetic image generator relying on a few simple principles. In particular, we focus on geometric modeling, textures, and a simple modeling of image acquisition. These properties, integrated in a classical Dead Leaves model, enable the creation of efficient training sets. Standard image denoising and super-resolution networks can be trained on such datasets, reaching performance almost on par with training on natural image datasets. As a first step towards explainability, we provide a careful analysis of the considered principles, identifying which image properties are necessary to obtain good performances. Besides, such training also yields better robustness to various geometric and radiometric perturbations of the test sets.
zh

[CV-34] Differentially Private 2D Human Pose Estimation

【速读】：该论文旨在解决在处理敏感视觉数据的人体姿态估计（Human Pose Estimation, HPE）任务中隐私保护与模型性能之间的权衡问题。传统匿名化技术无法充分保护隐私且会损害数据的广泛可用性，而差分隐私（Differential Privacy, DP）虽能提供严格的隐私保障，但通常会导致模型性能显著下降。论文的关键创新在于首次提出了基于差分隐私随机梯度下降（Differentially Private Stochastic Gradient Descent, DP-SGD）的2D人体姿态估计方法，并通过引入投影差分隐私随机梯度下降（Projected DP-SGD, PDP-SGD）有效平衡了隐私保护与模型性能。此外，论文还采用了轻量级且高效的TinyViT作为姿态估计的骨干网络，进一步提升了在资源受限设备上的隐私保护部署可行性。实验结果表明，PDP-SGD在严格的隐私预算（(\epsilon=0.2)）下达到了78.48%的PCKh@0.5，显著优于标准DP-SGD的63.85%，为实际应用中的隐私保护人体姿态估计奠定了基础。

链接: https://arxiv.org/abs/2504.10190
作者: Kaushik Bhargav Sivangi,Idris Zakariyya,Paul Henderson,Fani Deligianni
机构: University of Glasgow(Glasgow)(格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human pose estimation (HPE) has become essential in numerous applications including healthcare, activity recognition, and human-computer interaction. However, the privacy implications of processing sensitive visual data present significant deployment barriers in critical domains. While traditional anonymization techniques offer limited protection and often compromise data utility for broader motion analysis, Differential Privacy (DP) provides formal privacy guarantees but typically degrades model performance when applied naively. In this work, we present the first differentially private 2D human pose estimation (2D-HPE) by applying Differentially Private Stochastic Gradient Descent (DP-SGD) to this task. To effectively balance privacy with performance, we adopt Projected DP-SGD (PDP-SGD), which projects the noisy gradients to a low-dimensional subspace. Additionally, we adapt TinyViT, a compact and efficient vision transformer for coordinate classification in HPE, providing a lightweight yet powerful backbone that enhances privacy-preserving deployment feasibility on resource-limited devices. Our approach is particularly valuable for multimedia interpretation tasks, enabling privacy-safe analysis and understanding of human motion across diverse visual media while preserving the semantic meaning required for downstream applications. Comprehensive experiments on the MPII Human Pose Dataset demonstrate significant performance enhancement with PDP-SGD achieving 78.48% PCKh@0.5 at a strict privacy budget ( \epsilon=0.2 ), compared to 63.85% for standard DP-SGD. This work lays foundation for privacy-preserving human pose estimation in real-world, sensitive applications.
zh

[CV-35] LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

【速读】：该论文致力于解决传统基于文本的人体再识别（Person ReID）方法在实际场景中面临的挑战，即目击者提供的描述往往是不完整或模糊的。为了解决这一局限性，论文引入了一种新的任务——交互式人体再识别（Interactive Person ReID, Inter-ReID）。Inter-ReID 是一种基于对话的检索任务，通过与目击者的持续互动逐步细化初始描述。论文的关键创新在于提出了一种名为 LLaVA-ReID 的模型，该模型能够根据视觉和文本上下文生成有针对性的问题，以引导目击者提供关于目标人体的更多细节信息。此外，通过采用前瞻策略（looking-forward strategy），论文优先选择最具信息量的问题作为训练过程中的监督信号，从而显著提升了模型性能。实验结果表明，LLaVA-ReID 在 Inter-ReID 和基于文本的 ReID 基准测试中均大幅超越基线方法。

链接: https://arxiv.org/abs/2504.10174
作者: Yiding Lu,Mouxing Yang,Dezhong Peng,Peng Hu,Yijie Lin,Xi Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.
zh

[CV-36] WildLive: Near Real-time Visual Wildlife Tracking onboard UAVs

【速读】：该论文试图解决无人机在远距离自主飞行及野生动物特定行为识别任务中，因依赖地面站视频流支持而导致实时性不足的问题。论文的关键在于提出了一种名为WildLive的近实时动物检测与跟踪框架，直接运行于无人机（UAV）机载设备上。该方案通过将稀疏光流跟踪的高效性与YOLO驱动的目标检测技术相结合，并优化计算资源分配至时空高不确定性区域，实现了HD分辨率下每秒17帧以上、4K分辨率下每秒7帧以上的多动物检测与跟踪速度，同时保持了高精度。此外，论文还构建了一个包含超过20万个人工标注动物实例的数据集WildLive，用于验证系统的性能。通过与现有目标跟踪方法如OC-SORT、ByteTrack和SORT的对比，证明了该系统能够在维持高精度的同时，实现无人机上的近实时高分辨率野生动物跟踪，为未来的导航及任务特定的动物中心操作自主性奠定了基础。

链接: https://arxiv.org/abs/2504.10165
作者: Nguyen Ngoc Dat,Tom Richardson,Matthew Watson,Kilian Meier,Jenna Kline,Sid Reid,Guy Maalouf,Duncan Hine,Majid Mirmehdi,Tilo Burghardt
机构: University of Bristol (布里斯托大学); Ohio State University (俄亥俄州立大学); University of Southern Denmark (南丹麦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted in CV4Animals 2025

点击查看摘要

Abstract:Live tracking of wildlife via high-resolution video processing directly onboard drones is widely unexplored and most existing solutions rely on streaming video to ground stations to support navigation. Yet, both autonomous animal-reactive flight control beyond visual line of sight and/or mission-specific individual and behaviour recognition tasks rely to some degree on this capability. In response, we introduce WildLive – a near real-time animal detection and tracking framework for high-resolution imagery running directly onboard uncrewed aerial vehicles (UAVs). The system performs multi-animal detection and tracking at 17fps+ for HD and 7fps+ on 4K video streams suitable for operation during higher altitude flights to minimise animal disturbance. Our system is optimised for Jetson Orin AGX onboard hardware. It integrates the efficiency of sparse optical flow tracking and mission-specific sampling with device-optimised and proven YOLO-driven object detection and segmentation techniques. Essentially, computational resource is focused onto spatio-temporal regions of high uncertainty to significantly improve UAV processing speeds without domain-specific loss of accuracy. Alongside, we introduce our WildLive dataset, which comprises 200k+ annotated animal instances across 19k+ frames from 4K UAV videos collected at the Ol Pejeta Conservancy in Kenya. All frames contain ground truth bounding boxes, segmentation masks, as well as individual tracklets and tracking point trajectories. We compare our system against current object tracking approaches including OC-SORT, ByteTrack, and SORT. Our multi-animal tracking experiments with onboard hardware confirm that near real-time high-resolution wildlife tracking is possible on UAVs whilst maintaining high accuracy levels as needed for future navigational and mission-specific animal-centric operational autonomy.
zh

[CV-37] COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts

【速读】：该论文旨在解决现有目标检测器在遭遇分布偏移（distributional shifts）时性能显著下降的问题，并填补大规模、细粒度标注的数据集及评估基准的空白，以更全面地评估目标检测器和跨模态大语言模型（MLLMs）在域外泛化（out-of-distribution generalization, OOD）能力上的表现。论文的关键解决方案是引入COUNTS数据集，这是一个包含14种自然分布偏移、超过222K样本以及超过119.6万个目标级标注的大规模OOD数据集。基于此，论文提出了两个新的基准：O(OD)²用于评估目标检测器的OOD泛化能力，OODG则专注于评估MLLMs在视觉接地任务中的OOD泛化能力。研究发现，尽管大规模预训练模型和大量预训练数据显著提升了同分布（in-distribution, IID）场景下的性能，但在OOD场景下，无论是目标检测器还是MLLMs仍存在明显局限性和改进空间。

链接: https://arxiv.org/abs/2504.10158
作者: Jiansheng Li,Xingxuan Zhang,Hao Zou,Yige Guo,Renzhe Xu,Yilong Liu,Chuzhao Zhu,Yue He,Peng Cui
机构: Department of Computer Science, Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current object detectors often suffer significant perfor-mance degradation in real-world applications when encountering distributional shifts. Consequently, the out-of-distribution (OOD) generalization capability of object detectors has garnered increasing attention from researchers. Despite this growing interest, there remains a lack of a large-scale, comprehensive dataset and evaluation benchmark with fine-grained annotations tailored to assess the OOD generalization on more intricate tasks like object detection and grounding. To address this gap, we introduce COUNTS, a large-scale OOD dataset with object-level annotations. COUNTS encompasses 14 natural distributional shifts, over 222K samples, and more than 1,196K labeled bounding boxes. Leveraging COUNTS, we introduce two novel benchmarks: O(OD)2 and OODG. O(OD)2 is designed to comprehensively evaluate the OOD generalization capabilities of object detectors by utilizing controlled distribution shifts between training and testing data. OODG, on the other hand, aims to assess the OOD generalization of grounding abilities in multimodal large language models (MLLMs). Our findings reveal that, while large models and extensive pre-training data substantially en hance performance in in-distribution (IID) scenarios, significant limitations and opportunities for improvement persist in OOD contexts for both object detectors and MLLMs. In visual grounding tasks, even the advanced GPT-4o and Gemini-1.5 only achieve 56.7% and 28.0% accuracy, respectively. We hope COUNTS facilitates advancements in the development and assessment of robust object detectors and MLLMs capable of maintaining high performance under distributional shifts.
zh

[CV-38] Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers

【速读】：该论文旨在解决文本到图像（Text-to-image, T2I）生成模型在多实例合成（Multi-instance Synthesis, MIS）中的挑战，即如何根据复杂提示（prompts）准确描绘单个图像中的多个独立实例。传统基于UNet架构如SD v1.5/SDXL的MIS控制方法无法适应依赖于图像与文本标记之间集成注意力而非跨注意力机制的DiT基模型（如FLUX和SD v3.5）。论文的关键解决方案是提出了一种无需训练的层次化且分层逐步注意力特性微调（Hierarchical and Step-Layer-Wise Attention Specialty Tuning, AST）方法。通过分析DiT模型中混合注意力机制的标记级和层级注意力图，发现其具有层级响应结构：实例标记主导早期层，背景标记在中间层，属性标记在后期层。AST方法通过利用这一层级结构，在不同层和步骤中增强关键区域并抑制无关区域，从而优化多模态交互，实现基于实例草图的复杂提示解耦，最终提升复杂布局生成的精确性与多实例及属性表示的准确性。

链接: https://arxiv.org/abs/2504.10148
作者: Chunyang Zhang,Zhenhong Sun,Zhicheng Zhang,Junyan Wang,Yu Zhang,Dong Gong,Huadong Mo,Daoyi Dong
机构: School of Systems and Computing (系统与计算学院), University of New South Wales (新南威尔士大学); School of Engineering (工程学院), Australian National University (澳大利亚国立大学); School of Business (商学院), University of New South Wales (新南威尔士大学); Australian Institute for Machine Learning (澳大利亚机器学习研究所), University of Adelaide (阿德莱德大学); School of Computer Science and Engineering (计算机科学与工程学院), University of New South Wales (新南威尔士大学); School of Computer Science (计算机科学学院), University of Technology Sydney (悉尼科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS), where they must accurately depict multiple distinct instances in a single image based on complex prompts detailing individual features. Traditional MIS control methods for UNet architectures like SD v1.5/SDXL fail to adapt to DiT-based models like FLUX and SD v3.5, which rely on integrated attention between image and text tokens rather than text-image cross-attention. To enhance MIS in DiT, we first analyze the mixed attention mechanism in DiT. Our token-wise and layer-wise analysis of attention maps reveals a hierarchical response structure: instance tokens dominate early layers, background tokens in middle layers, and attribute tokens in later layers. Building on this observation, we propose a training-free approach for enhancing MIS in DiT-based models with hierarchical and step-layer-wise attention specialty tuning (AST). AST amplifies key regions while suppressing irrelevant areas in distinct attention maps across layers and steps, guided by the hierarchical structure. This optimizes multimodal interactions by hierarchically decoupling the complex prompts with instance-based sketches. We evaluate our approach using upgraded sketch-based layouts for the T2I-CompBench and customized complex scenes. Both quantitative and qualitative results confirm our method enhances complex layout generation, ensuring precise instance placement and attribute representation in MIS.
zh

[CV-39] Negate or Embrace: On How Misalignment Shapes Multimodal Representation Learning

【速读】：该论文旨在解决多模态表示学习中因跨模态对齐假设失效而导致的数据集错配问题。论文的关键在于通过引入潜在变量模型，将错配形式化为选择偏差（selection bias）和扰动偏差（perturbation bias）两种机制，并证明在轻度假设下，多模态对比学习（MMCL）所学得的表示恰好捕获了对这两种偏差不变的语义变量子集的信息。这一理论分析统一了缓解错配与利用错配这两种看似对立的观点，为实际多模态机器学习系统的构建提供了指导，并通过合成数据与真实图像-文本数据集的实证研究验证了结论。

链接: https://arxiv.org/abs/2504.10143
作者: Yichao Cai,Yuhang Liu,Erdun Gao,Tianjiao Jiang,Zhen Zhang,Anton van den Hengel,Javen Qinfeng Shi
机构: Australian Institute for Machine Learning, The University of Adelaide (澳大利亚机器学习研究所, 阿德莱德大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages

点击查看摘要

Abstract:Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize misalignment by introducing two specific mechanisms: selection bias, where some semantic variables are missing, and perturbation bias, where semantic variables are distorted – both affecting latent variables shared across modalities. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings through extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of misalignment on multimodal representation learning.
zh

[CV-40] M2S-RoAD: Multi-Modal Semantic Segmentation for Road Damage Using Camera and LiDAR Data

【速读】：该论文试图解决自动驾驶车辆（Autonomous Vehicles, AVs）在农村地区因路面损坏导致的安全性和舒适性挑战问题。当前研究主要集中于城市环境，而农村地区的道路损伤检测缺乏关注。论文的关键在于引入M2S-RoAD数据集，该数据集针对农村地区不同类型的路面退化进行语义分割标注，涵盖澳大利亚新南威尔士州多个城镇的九类路面损伤，并将为自动驾驶系统提供输入以提升道路安全性。

链接: https://arxiv.org/abs/2504.10123
作者: Tzu-Yun Tseng,Hongyu Lyu,Josephine Li,Julie Stephany Berrio,Mao Shan,Stewart Worrall
机构: Australian Centre for Robotics (ACFR), The University of Sydney (悉尼大学); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road damage can create safety and comfort challenges for both human drivers and autonomous vehicles (AVs). This damage is particularly prevalent in rural areas due to less frequent surveying and maintenance of roads. Automated detection of pavement deterioration can be used as an input to AVs and driver assistance systems to improve road safety. Current research in this field has predominantly focused on urban environments driven largely by public datasets, while rural areas have received significantly less attention. This paper introduces M2S-RoAD, a dataset for the semantic segmentation of different classes of road damage. M2S-RoAD was collected in various towns across New South Wales, Australia, and labelled for semantic segmentation to identify nine distinct types of road damage. This dataset will be released upon the acceptance of the paper.
zh

[CV-41] AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

【速读】：该论文致力于解决开放世界三维语义占用预测中的两个关键挑战：一是如何从传感器输入中生成体素化的三维表示，并同时识别已知和未知物体；二是如何有效利用视觉-语言模型（Vision-Language Models, VLMs）的知识迁移来增强模型的泛化能力。现有方法受限于预定义标签空间且缺乏对未知类别的可靠预测能力，而直接对齐预训练图像嵌入的表现也不尽如人意，因为视觉-语言模型中的图像和文本表征往往不一致。

为了解决上述问题，论文提出了一种名为AGO的新框架，其核心在于通过自适应定位（Adaptive Grounding）技术处理多样化的开放世界场景。AGO的关键在于首先将周围图像和类别提示分别编码为三维嵌入和文本嵌入，并利用基于相似度的定位训练与三维伪标签进行协同优化。此外，引入模态适配器（Modality Adapter），将三维嵌入映射到与VLM衍生图像嵌入对齐的空间中，从而缩小模态差异。实验结果表明，AGO在零样本和少量样本迁移任务中显著提升了未知物体的预测性能，并在封闭世界自监督任务中达到了最先进的水平，相比先前方法提升了4.09 mIoU。

链接: https://arxiv.org/abs/2504.10117
作者: Peizheng Li,Shuxiao Ding,You Zhou,Qingwen Zhang,Onat Inak,Larissa Triess,Niklas Hanselmann,Marius Cordts,Andreas Zell
机构: Mercedes-Benz AG (梅赛德斯-奔驰集团); University of Tübingen (蒂宾根大学); Tübingen AI Center; University of Bonn (波恩大学); RPL, KTH Royal Institute of Technology (皇家理工学院); TU Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities. Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps. Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.
zh

[CV-42] SoccerNet-v3D: Leverag ing Sports Broadcast Replays for 3D Scene Understanding

【速读】：该论文旨在解决足球比赛视频分析中三维场景理解的问题，特别是如何通过多视图信息实现精确的三维目标定位。解决方案的关键在于引入了SoccerNet-v3D和ISSIA-3D两个增强且可扩展的数据集，这些数据集通过场地线驱动的相机标定与多视图同步技术，实现了基于三角化方法的三维物体定位。此外，论文提出了利用单目图像进行三维球体定位的任务，并开发了一种基于相机标定和球体尺寸先验的基线方法来估计球体位置。同时，为了提升二维标注的质量，还提出了一种边界框优化技术以确保其与三维场景表示的一致性。这些创新点共同推动了体育分析领域中空间和时间分析能力的提升。

链接: https://arxiv.org/abs/2504.10106
作者: Marc Gutiérrez-Pérez,Antonio Agudo
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC (机器人与信息工业研究所, CSIC-UPC), Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sports video analysis is a key domain in computer vision, enabling detailed spatial understanding through multi-view correspondences. In this work, we introduce SoccerNet-v3D and ISSIA-3D, two enhanced and scalable datasets designed for 3D scene understanding in soccer broadcast analysis. These datasets extend SoccerNet-v3 and ISSIA by incorporating field-line-based camera calibration and multi-view synchronization, enabling 3D object localization through triangulation. We propose a monocular 3D ball localization task built upon the triangulation of ground-truth 2D ball annotations, along with several calibration and reprojection metrics to assess annotation quality on demand. Additionally, we present a single-image 3D ball localization method as a baseline, leveraging camera calibration and ball size priors to estimate the ball’s position from a monocular viewpoint. To further refine 2D annotations, we introduce a bounding box optimization technique that ensures alignment with the 3D scene representation. Our proposed datasets establish new benchmarks for 3D soccer scene understanding, enhancing both spatial and temporal analysis in sports analytics. Finally, we provide code to facilitate access to our annotations and the generation pipelines for the datasets.
zh

[CV-43] Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution

【速读】：该论文旨在解决多模态医学图像超分辨率中存在的两个主要问题：一是现有方法要么具有固定的局部学习感受野，要么在全局学习时面临显著的计算负担，限制了超分辨率性能；二是如何有效建模图像中的长程依赖关系以提升超分辨率效果。为了解决这些问题，论文的关键在于引入状态空间模型（State Space Models）中的Mamba模型，其具有线性计算复杂度，能够高效建模图像中的长程依赖关系。基于此，论文提出了一种全局与局部Mamba网络（Global and Local Mamba Network, GLMamba），通过设计一个包含全局Mamba分支和局部Mamba分支的双支路网络结构，分别捕获低分辨率输入中的长程关系以及高分辨率参考图像中的短程细节。此外，通过可变形模块（Deform Block）增强特征提取能力，并利用模调器（Modulator）进一步优化全局和局部Mamba块中的可变形特征。同时，为了充分利用参考图像进行低分辨率图像的超分辨率重建，论文还开发了多模态特征融合模块（Multi-Modality Feature Fusion Block）以自适应融合不同模态间的相似性、差异性和互补性特征。最后，通过对比边缘损失（Contrastive Edge Loss, CELoss）强化医学图像中的边缘纹理和对比度。关键在于结合Mamba模型的高效长程依赖建模能力和双支路网络的设计，实现了多模态医学图像的高质量超分辨率。

链接: https://arxiv.org/abs/2504.10105
作者: Zexin Ji,Beiji Zou,Xiaoyan Kui,Sebastien Thureau,Su Ruan
机构: School of Computer Science and Engineering, Central South University (中央南大学计算机科学与工程学院), Changsha, 410083, China; Hunan Engineering Research Center of Machine Vision and Intelligent Medicine, Central South University (湖南机器视觉与智能医学工程研究中心), Central South University (中央南大学), Changsha, 410083, China; University of Rouen-Normandy (鲁昂-诺曼底大学), LITIS - QuantIF UR 4108, F-76000, Rouen, France; Department of Nuclear Medicine, Henri Becquerel Cancer Center (核医学系, 亨利·贝克雷尔癌症中心), Rouen, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional neural networks and Transformer have made significant progresses in multi-modality medical image super-resolution. However, these methods either have a fixed receptive field for local learning or significant computational burdens for global learning, limiting the super-resolution performance. To solve this problem, State Space Models, notably Mamba, is introduced to efficiently model long-range dependencies in images with linear computational complexity. Relying on the Mamba and the fact that low-resolution images rely on global information to compensate for missing details, while high-resolution reference images need to provide more local details for accurate super-resolution, we propose a global and local Mamba network (GLMamba) for multi-modality medical image super-resolution. To be specific, our GLMamba is a two-branch network equipped with a global Mamba branch and a local Mamba branch. The global Mamba branch captures long-range relationships in low-resolution inputs, and the local Mamba branch focuses more on short-range details in high-resolution reference images. We also use the deform block to adaptively extract features of both branches to enhance the representation ability. A modulator is designed to further enhance deformable features in both global and local Mamba blocks. To fully integrate the reference image for low-resolution image super-resolution, we further develop a multi-modality feature fusion block to adaptively fuse features by considering similarities, differences, and complementary aspects between modalities. In addition, a contrastive edge loss (CELoss) is developed for sufficient enhancement of edge textures and contrast in medical images.
zh

[CV-44] UP-Person: Unified Parameter-Efficient Transfer Learning for Text-based Person Retrieval

【速读】：本文旨在解决基于文本的人物检索（Text-based Person Retrieval, TPR）任务中，利用预训练的视觉-语言模型（如CLIP）进行参数高效微调的问题。传统方法通过完全微调整个大型模型以提取人物的视觉和文本特征，虽然在多模态任务中表现优于单一模态预训练模型，但容易过拟合且限制了泛化能力。为应对这一挑战，论文提出了一种名为统一参数高效迁移学习（Unified Parameter-Efficient Transfer Learning, UP-Person）的新方法。其关键是同时集成三种轻量级组件：Prefix、LoRA和Adapter。Prefix与LoRA共同挖掘局部信息及特定任务提示信息，而Adapter用于调整全局特征表示。此外，还优化了两个基础子模块S-Prefix和L-Adapter，以适配TPR的整体架构。实验结果表明，UP-Person在多个数据集上达到了最先进的性能，仅需微调4.7%的参数。

链接: https://arxiv.org/abs/2504.10084
作者: Yating Liu,Yaowei Li,Xiangyuan Lan,Wenming Yang,Zimo Liu,Qingmin Liao
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); School of ECE, Peking University (北京大学电子工程学院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures, first submited to IEEE TCSVT on 2024 May. Under review

点击查看摘要

Abstract:Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at this https URL.
zh

[CV-45] Learning to Harmonize Cross-vendor X-ray Images by Non-linear Image Dynamics Correction

【速读】：该论文旨在解决医学图像分析中，由于不同设备厂商的图像域特定非线性特性导致模型泛化能力受限的问题。论文指出，传统的线性归一化方法无法有效处理这些非线性动态特征。为了解决这一问题，论文将图像一致性任务重新定义为曝光校正问题，并提出了一种名为全局深度曲线估计（Global Deep Curve Estimation, GDCE）的方法。GDCE通过预定义的多项式函数实现图像增强，并借助“领域判别器”进行训练，以提高下游任务中的模型透明度，而非依赖现有的黑箱方法。因此，其关键在于通过非线性变换减少领域特定的曝光差异，从而提升模型的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2504.10080
作者: Yucheng Lu,Shunxin Wang,Dovile Juodelyte,Veronika Cheplygina
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In this paper, we explore how conventional image enhancement can improve model robustness in medical image analysis. By applying commonly used normalization methods to images from various vendors and studying their influence on model generalization in transfer learning, we show that the nonlinear characteristics of domain-specific image dynamics cannot be addressed by simple linear transforms. To tackle this issue, we reformulate the image harmonization task as an exposure correction problem and propose a method termed Global Deep Curve Estimation (GDCE) to reduce domain-specific exposure mismatch. GDCE performs enhancement via a pre-defined polynomial function and is trained with the help of a ``domain discriminator’', aiming to improve model transparency in downstream tasks compared to existing black-box methods.
zh

[CV-46] Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

【速读】：该论文旨在解决少样本动作识别（Few-shot Action Recognition, FSAR）任务中如何有效利用少量样本识别新动作类别的问题。现有方法通常通过设计多种帧间时间建模策略独立学习每个视频的帧级表示，但忽视了视频与任务之间显式的关联建模，导致无法捕捉跨视频共享的时间模式或重用历史任务中的时间知识。为此，论文提出了一种名为HR2G-shot的分层关系增强表示泛化框架，其关键在于统一帧间（inter-frame）、视频间（inter-video）和任务间（inter-task）三种关系建模，从整体视角学习任务特定的时间模式。具体而言，除了进行帧间时间交互外，还设计了两个组件：Inter-video Semantic Correlation (ISC) 组件以细粒度方式执行跨视频帧级交互，捕获任务特定的查询特征，并学习支持特征的类内和类间时间相关性；Inter-task Knowledge Transfer (IKT) 组件从存储历史任务多样化时间模式的知识库中检索并聚合相关时间知识。实验结果表明，该方法在五个基准数据集上优于当前最先进的FSAR方法。

链接: https://arxiv.org/abs/2504.10079
作者: Hongyu Qu,Ling Xing,Rui Yan,Yazhou Yao,Guo-Sen Xie,Xiangbo Shu
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations independently for each video by designing various inter-frame temporal modeling strategies. However, they neglect explicit relation modeling between videos and tasks, thus failing to capture shared temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. In addition to conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and learning intra- and inter-class temporal correlations among support features; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.
zh

[CV-47] DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction

【速读】：该论文旨在解决通过整合视觉与听觉信息预测视频中显著区域（Audio-Visual Saliency Prediction）的问题，特别是如何有效利用听觉线索以克服复杂时空交互和高计算需求带来的挑战。论文的关键解决方案在于提出了一种名为Dynamic Token Fusion Saliency (DFTSal) 的新型框架，其核心创新包括：1) 在多尺度视觉编码器中引入Learnable Token Enhancement Block (LTEB)，通过自适应加权增强关键显著性特征；2) 设计Dynamic Learnable Token Fusion Block (DLTFB)，利用移位操作重新组织特征以捕捉长距离依赖和细节空间信息；3) 并行处理音频信号的音频分支，结合Adaptive Multimodal Fusion Block (AMFB) 实现精确的跨模态融合。这些技术共同确保了模型在保持计算效率的同时实现最优性能。

链接: https://arxiv.org/abs/2504.10070
作者: Kiana Hoshanfar,Alireza Hosseini,Ahmad Kalhor,Babak Nadjar Araabi
机构: University of Tehran ( Tehran University )
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos through the integration of both visual and auditory information. Although visual-only approaches have significantly advanced, effectively incorporating auditory cues remains challenging due to complex spatio-temporal interactions and high computational demands. To address these challenges, we propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency. Our approach features a multi-scale visual encoder equipped with two novel modules: the Learnable Token Enhancement Block (LTEB), which adaptively weights tokens to emphasize crucial saliency cues, and the Dynamic Learnable Token Fusion Block (DLTFB), which employs a shifting operation to reorganize and merge features, effectively capturing long-range dependencies and detailed spatial information. In parallel, an audio branch processes raw audio signals to extract meaningful auditory features. Both visual and audio features are integrated using our Adaptive Multimodal Fusion Block (AMFB), which employs local, global, and adaptive fusion streams for precise cross-modal fusion. The resulting fused features are processed by a hierarchical multi-decoder structure, producing accurate saliency maps. Extensive evaluations on six audio-visual benchmarks demonstrate that DFTSal achieves SOTA performance while maintaining computational efficiency.
zh

[CV-48] Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers

【速读】：该论文致力于解决三维场景中多目标定位的问题，即根据自然语言输入同时精确定位多个物体。与以往主要关注单目标定位的工作不同，本文聚焦于更贴近实际需求的多目标场景。论文的关键创新在于提出了Hierarchical Contrastive Siamese Transformers (H-COST)，其核心包括两个方面：首先采用分层处理策略逐步优化物体定位，以增强对复杂语言指令的理解；其次引入对比式孪生Transformer框架，通过一个基于真实标签的辅助网络提取鲁棒的物体关系，指导并强化主参考网络在分割点云数据上的操作。这种对比机制显著提升了模型的语义理解能力，并大幅增强了其处理复杂点云数据的能力。实验结果表明，该方法在具有挑战性的多目标定位基准测试中超越了现有最先进方法9.5%。

链接: https://arxiv.org/abs/2504.10048
作者: Chengyi Du,Keyan Jin
机构: School of Information and Software Engineering (信息与软件工程学院), University of Electronic Science and Technology of China (电子科技大学); University of Granada (格拉纳达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-object grounding in 3D scenes involves localizing multiple objects based on natural language input. While previous work has primarily focused on single-object grounding, real-world scenarios often demand the localization of several objects. To tackle this challenge, we propose Hierarchical Contrastive Siamese Transformers (H-COST), which employs a Hierarchical Processing strategy to progressively refine object localization, enhancing the understanding of complex language instructions. Additionally, we introduce a Contrastive Siamese Transformer framework, where two networks with the identical structure are used: one auxiliary network processes robust object relations from ground-truth labels to guide and enhance the second network, the reference network, which operates on segmented point-cloud data. This contrastive mechanism strengthens the model’ s semantic understanding and significantly enhances its ability to process complex point-cloud data. Our approach outperforms previous state-of-the-art methods by 9.5% on challenging multi-object grounding benchmarks.
zh

[CV-49] Aligning Anime Video Generation with Human Feedback

【速读】：该论文旨在解决动漫视频生成中因数据稀缺和异常运动模式导致的问题，如运动失真和闪烁伪影，这些问题与人类偏好不一致。现有奖励模型主要针对真实世界视频设计，无法捕捉动漫的独特外观和一致性需求。论文的关键解决方案是提出一个结合人类反馈的管道以实现更好的对齐。具体而言，构建了一个包含30,000个人类标注样本的多维奖励数据集，涵盖视觉外观和一致性的人类偏好。基于此，开发了AnimeReward奖励模型，使用专门的视觉语言模型来指导不同评估维度的偏好对齐。此外，引入了Gap-Aware Preference Optimization (GAPO)，一种新型训练方法，明确将偏好差距纳入优化过程，从而提高对齐性能和效率。实验结果表明，AnimeReward优于现有奖励模型，而GAPO的加入在定量基准和人为评估中均实现了更优的对齐效果，验证了该管道在提升动漫视频质量方面的有效性。论文的数据集和代码将公开发布。

链接: https://arxiv.org/abs/2504.10044
作者: Bingwen Zhu,Yudong Jiang,Baohan Xu,Siqian Yang,Mingyu Yin,Yidi Wu,Huyang Sun,Zuxuan Wu
机构: Fudan University (复旦大学); Bilibili Inc (哔哩哔哩公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Anime video generation faces significant challenges due to the scarcity of anime data and unusual motion patterns, leading to issues such as motion distortion and flickering artifacts, which result in misalignment with human preferences. Existing reward models, designed primarily for real-world videos, fail to capture the unique appearance and consistency requirements of anime. In this work, we propose a pipeline to enhance anime video generation by leveraging human feedback for better alignment. Specifically, we construct the first multi-dimensional reward dataset for anime videos, comprising 30k human-annotated samples that incorporating human preferences for both visual appearance and visual consistency. Based on this, we develop AnimeReward, a powerful reward model that employs specialized vision-language models for different evaluation dimensions to guide preference alignment. Furthermore, we introduce Gap-Aware Preference Optimization (GAPO), a novel training method that explicitly incorporates preference gaps into the optimization process, enhancing alignment performance and efficiency. Extensive experiment results show that AnimeReward outperforms existing reward models, and the inclusion of GAPO leads to superior alignment in both quantitative benchmarks and human evaluations, demonstrating the effectiveness of our pipeline in enhancing anime video quality. Our dataset and code will be publicly available.
zh

[CV-50] Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models

【速读】：该论文旨在解决现有基于扩散模型的模仿学习在视觉导航任务中因初始动作分布为去噪高斯噪声而导致目标动作分布偏离较大、冗余去噪步骤增加学习复杂性的问题，以及有效动作分布稀疏带来的无指导情况下难以生成精确动作的挑战。为了解决这些问题，论文提出了一种名为NaviBridger的新框架，其关键是利用去噪扩散桥接（denoising diffusion bridge）模型，通过从任何具有信息量的先验动作出发进行去噪，从而增强引导性和效率。这一方法显著提升了策略推理的速度，并在模拟与真实环境中的室内和室外场景下优于基线方法。

链接: https://arxiv.org/abs/2504.10041
作者: Hao Ren,Yiming Zeng,Zetong Bi,Zhaoliang Wan,Junlong Huang,Hui Cheng
机构: School of Computer Science and Engineering, Sun Yat-sen University (中山大学计算机科学与工程学院); School of Intelligent Systems Engineering, Sun Yat-sen University (中山大学智能系统工程学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion-based imitation learning, which show impressive performance in modeling multimodal distributions and training stability, have led to substantial progress in various robot learning tasks. In visual navigation, previous diffusion-based policies typically generate action sequences by initiating from denoising Gaussian noise. However, the target action distribution often diverges significantly from Gaussian noise, leading to redundant denoising steps and increased learning complexity. Additionally, the sparsity of effective action distributions makes it challenging for the policy to generate accurate actions without guidance. To address these issues, we propose a novel, unified visual navigation framework leveraging the denoising diffusion bridge models named NaviBridger. This approach enables action generation by initiating from any informative prior actions, enhancing guidance and efficiency in the denoising process. We explore how diffusion bridges can enhance imitation learning in visual navigation tasks and further examine three source policies for generating prior actions. Extensive experiments in both simulated and real-world indoor and outdoor scenarios demonstrate that NaviBridger accelerates policy inference and outperforms the baselines in generating target action sequences. Code is available at this https URL.
zh

[CV-51] Investigating the Role of Bilateral Symmetry for Inpainting Brain MRI

【速读】：该论文旨在研究脑MRI图像修复（Inpainting）过程中，修复结构与特定受试者条件信息量之间的统计关系，特别是分析额外屏蔽图像其他区域（如对侧结构）时修复结果的分布，以揭示模型从大脑何处获取信息以及半球对称性在修复过程中的重要性。论文通过分析扩散修复模型在亚皮质脑结构修复中的强度和面积变化，探讨对称性对修复条件的影响。解决方案的关键在于通过系统性地屏蔽不同图像区域并评估修复结果的分布，量化对侧结构对修复过程的影响，从而解析模型利用的大脑对称性信息及其对修复性能的作用。

链接: https://arxiv.org/abs/2504.10039
作者: Sergey Kuznetsov,Sanduni Pinnawala,Peter A. Wijeratne,Ivor J. A. Simpson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inpainting has recently emerged as a valuable and interesting technology to employ in the analysis of medical imaging data, in particular brain MRI. A wide variety of methodologies for inpainting MRI have been proposed and demonstrated on tasks including anomaly detection. In this work we investigate the statistical relationship between inpainted brain structures and the amount of subject-specific conditioning information, i.e. the other areas of the image that are masked. In particular, we analyse the distribution of inpainting results when masking additional regions of the image, specifically the contra-lateral structure. This allows us to elucidate where in the brain the model is drawing information from, and in particular, what is the importance of hemispherical symmetry? Our experiments interrogate a diffusion inpainting model through analysing the inpainting of subcortical brain structures based on intensity and estimated area change. We demonstrate that some structures show a strong influence of symmetry in the conditioning of the inpainting process.
zh

[CV-52] 3D: Table Tennis 3D Reconstruction

【速读】：该论文旨在解决体育数据分析中仅依赖二维（2D）球跟踪方法的局限性问题，特别是在线乒乓球比赛回放中，由于摄像机视角的限制及无法全面支持复杂比赛分析的不足。为克服这些限制，论文提出了一种新颖的方法，通过利用乒乓球运动的物理特性，从在线乒乓球比赛录像中重建精确的三维（3D）球轨迹。关键在于通过最小化球飞行轨迹的重投影误差来识别最优的弹跳状态，从而确保高精度和可靠的3D重建。此外，该方法无需依赖人体姿态估计或球拍跟踪即可推断球的旋转信息，同时结合一种自动相机标定技术以可靠追踪摄像机运动，并改进了一个缺乏深度运动捕捉的现有3D姿态估计算法以实现精准球员动作追踪，最终实现完整的乒乓球回合3D重建。

链接: https://arxiv.org/abs/2504.10035
作者: Thomas Gossard,Andreas Ziegler,Andreas Zell
机构: Cognitive Systems - Univeristy of Tuebingen (认知系统 - 图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVSport 2025

点击查看摘要

Abstract:Sports analysis requires processing large amounts of data, which is time-consuming and costly. Advancements in neural networks have significantly alleviated this burden, enabling highly accurate ball tracking in sports broadcasts. However, relying solely on 2D ball tracking is limiting, as it depends on the camera’s viewpoint and falls short of supporting comprehensive game analysis. To address this limitation, we propose a novel approach for reconstructing precise 3D ball trajectories from online table tennis match recordings. Our method leverages the underlying physics of the ball’s motion to identify the bounce state that minimizes the reprojection error of the ball’s flying trajectory, hence ensuring an accurate and reliable 3D reconstruction. A key advantage of our approach is its ability to infer ball spin without relying on human pose estimation or racket tracking, which are often unreliable or unavailable in broadcast footage. We developed an automated camera calibration method capable of reliably tracking camera movements. Additionally, we adapted an existing 3D pose estimation model, which lacks depth motion capture, to accurately track player movements. Together, these contributions enable the full 3D reconstruction of a table tennis rally.
zh

[CV-53] Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes

【速读】：该论文致力于解决在非均匀照明且存在未知共动光源的散射环境中构建一致且逼真的神经辐射场（Neural Radiance Field, NERF）的挑战。现有针对水下场景表示的研究大多集中在静态均匀光照条件下，而忽视了当机器人探索超过数十米深水域时，由于阳光不足所导致的复杂光照条件。为此，论文提出了一种新型局部附着于相机的光照场，并结合体素介质表示方法，形成一种能够有效处理动态光照场与静态散射介质之间相互作用的整体方案。其关键在于通过引入与相机关联的局部光照场来捕捉视锥内不均一的光照效果，并将其与体积介质建模相结合。评估结果证明了该方法的有效性和灵活性。

链接: https://arxiv.org/abs/2504.10024
作者: Mengkun She,Felix Seegräber,David Nakath,Patricia Schöntag,Kevin Köser
机构: Institute of Computer Science (计算机科学研究所); GEOMAR (未知中文名称); Kiel University (基尔大学), Germany; Helmholtz Centre for Ocean Research Kiel (基尔海洋研究中心), Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures. First two authors contributed equally to this work

点击查看摘要

Abstract:We address the challenge of constructing a consistent and photorealistic Neural Radiance Field in inhomogeneously illuminated, scattering environments with unknown, co-moving light sources. While most existing works on underwater scene representation focus on a static homogeneous illumination, limited attention has been paid to scenarios such as when a robot explores water deeper than a few tens of meters, where sunlight becomes insufficient. To address this, we propose a novel illumination field locally attached to the camera, enabling the capture of uneven lighting effects within the viewing frustum. We combine this with a volumetric medium representation to an overall method that effectively handles interaction between dynamic illumination field and static scattering medium. Evaluation results demonstrate the effectiveness and flexibility of our approach.
zh

[CV-54] Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics

【速读】：该论文试图解决微电子缺陷检测领域中基于变换器架构（Transformer-based architectures）应用不足的问题，主要原因是微电子缺陷图像标注数据稀缺且生成成本高昂，同时自然图像预训练无法有效迁移到微电子领域。为解决这一问题，论文提出了一种基于掩码自编码器（Masked Autoencoder, MAE）的视觉变换器（Vision Transformer, ViT）自监督预训练框架。其关键在于通过在目标数据集上进行自预训练（self pre-training），而非依赖其他领域的数据集，利用掩码机制让模型在预训练阶段重建被遮掩的图像块，从而有效学习到与缺陷相关的特征表示。实验结果表明，该方法在性能上显著优于监督ViT、基于自然图像预训练的ViT以及现有最先进的卷积神经网络（CNN）基线模型，并能够正确聚焦于如焊料裂纹等与缺陷相关的特定特征。

链接: https://arxiv.org/abs/2504.10021
作者: Nikolai Röhrich(XITASO GmbH and LMU Munich),Alwin Hoffmann(XITASO GmbH),Richard Nordsieck(XITASO GmbH),Emilio Zarbali(XITASO GmbH),Alireza Javanmardi(LMU Munich and Munich Center for Machine Learning)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Whereas in general computer vision, transformer-based architectures have quickly become the gold standard, microelectronics defect detection still heavily relies on convolutional neural networks (CNNs). We hypothesize that this is due to the fact that a) transformers have an increased need for data and b) labelled image generation procedures for microelectronics are costly, and labelled data is therefore sparse. Whereas in other domains, pre-training on large natural image datasets can mitigate this problem, in microelectronics transfer learning is hindered due to the dissimilarity of domain data and natural images. Therefore, we evaluate self pre-training, where models are pre-trained on the target dataset, rather than another dataset. We propose a vision transformer (ViT) pre-training framework for defect detection in microelectronics based on masked autoencoders (MAE). In MAE, a large share of image patches is masked and reconstructed by the model during pre-training. We perform pre-training and defect detection using a dataset of less than 10.000 scanning acoustic microscopy (SAM) images labelled using transient thermal analysis (TTA). Our experimental results show that our approach leads to substantial performance gains compared to a) supervised ViT, b) ViT pre-trained on natural image datasets, and c) state-of-the-art CNN-based defect detection models used in the literature. Additionally, interpretability analysis reveals that our self pre-trained models, in comparison to ViT baselines, correctly focus on defect-relevant features such as cracks in the solder material. This demonstrates that our approach yields fault-specific feature representations, making our self pre-trained models viable for real-world defect detection in microelectronics.
zh

[CV-55] RGB-Event based Pedestrian Attribute Recognition: A Benchmark Dataset and An Asymmetric RWKV Fusion Framework

【速读】：该论文旨在解决现有行人属性识别方法受RGB相机限制的问题，这些问题包括对光照条件敏感和运动模糊等，同时当前方法主要关注外部外观和服装分析，缺乏对情感维度的探索。为此，论文提出了一种新颖的多模态RGB-事件行人属性识别任务，灵感来源于事件相机在低光、高速和低功耗方面的优势。解决方案的关键在于引入了一个包含100K配对RGB-事件样本的大规模多模态行人属性识别数据集EventPAR，并提出了一个基于RWKV的新型多模态行人属性识别框架，该框架具有RWKV视觉编码器和不对称RWKV融合模块。通过在所提出的及两个模拟数据集上的广泛实验，实现了最先进的性能。

链接: https://arxiv.org/abs/2504.10018
作者: Xiao Wang,Haiyang Wang,Shiao Wang,Qiang Chen,Jiandong Jin,Haoyu Song,Bo Jiang,Chenglong Li
机构: School of Computer Science and Technology, Anhui University, Hefei, China (安徽大学计算机科学与技术学院，合肥，中国); School of Artificial Intelligence, Anhui University, Hefei, China (安徽大学人工智能学院，合肥，中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The First Benchmark Dataset for RGB-Event Multimodal Pedestrian Attribute Recognition Task

点击查看摘要

Abstract:Existing pedestrian attribute recognition methods are generally developed based on RGB frame cameras. However, these approaches are constrained by the limitations of RGB cameras, such as sensitivity to lighting conditions and motion blur, which hinder their performance. Furthermore, current attribute recognition primarily focuses on analyzing pedestrians’ external appearance and clothing, lacking an exploration of emotional dimensions. In this paper, we revisit these issues and propose a novel multi-modal RGB-Event attribute recognition task by drawing inspiration from the advantages of event cameras in low-light, high-speed, and low-power consumption. Specifically, we introduce the first large-scale multi-modal pedestrian attribute recognition dataset, termed EventPAR, comprising 100K paired RGB-Event samples that cover 50 attributes related to both appearance and six human emotions, diverse scenes, and various seasons. By retraining and evaluating mainstream PAR models on this dataset, we establish a comprehensive benchmark and provide a solid foundation for future research in terms of data and algorithmic baselines. In addition, we propose a novel RWKV-based multi-modal pedestrian attribute recognition framework, featuring an RWKV visual encoder and an asymmetric RWKV fusion module. Extensive experiments are conducted on our proposed dataset as well as two simulated datasets (MARS-Attribute and DukeMTMC-VID-Attribute), achieving state-of-the-art results. The source code and dataset will be released on this https URL
zh

[CV-56] Air Quality Prediction with A Meteorology-Guided Modality-Decoupled Spatio-Temporal Network

【速读】：该论文致力于解决空气质量和气象数据之间动态相互依赖关系建模不足的问题，现有方法低估了大气条件在空气质量预测中的关键作用，并未能充分挖掘综合气象数据的价值。论文的关键在于提出MDSTNet，这是一种编码器-解码器框架，通过将空气质量观测与大气条件明确建模为不同的模态（modalities），整合多气压层气象数据和天气预报，以捕捉大气-污染之间的依赖关系，从而实现更精准的预测。此外，论文构建了ChinaAirNet，首个结合全国范围空气质量记录与多气压层气象观测数据的数据集。实验结果表明，MDSTNet相较于最先进的模型，在48小时预测误差上降低了17.54%，验证了解决方案的有效性。

链接: https://arxiv.org/abs/2504.10014
作者: Hang Yin,Yan-Ming Zhang,Jian Xu,Jian-Long Chang,Yin Li,Cheng-Lin Liu
机构: School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institution of Automation, Chinese Academy of Sciences (多模态人工智能系统国家重点实验室（自动化研究所，中国科学院）); Huawei Cloud Enterprise Intelligence (华为云企业智能)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Air quality prediction plays a crucial role in public health and environmental protection. Accurate air quality prediction is a complex multivariate spatiotemporal problem, that involves interactions across temporal patterns, pollutant correlations, spatial station dependencies, and particularly meteorological influences that govern pollutant dispersion and chemical transformations. Existing works underestimate the critical role of atmospheric conditions in air quality prediction and neglect comprehensive meteorological data utilization, thereby impairing the modeling of dynamic interdependencies between air quality and meteorological data. To overcome this, we propose MDSTNet, an encoder-decoder framework that explicitly models air quality observations and atmospheric conditions as distinct modalities, integrating multi-pressure-level meteorological data and weather forecasts to capture atmosphere-pollution dependencies for prediction. Meantime, we construct ChinaAirNet, the first nationwide dataset combining air quality records with multi-pressure-level meteorological observations. Experimental results on ChinaAirNet demonstrate MDSTNet’s superiority, substantially reducing 48-hour prediction errors by 17.54% compared to the state-of-the-art model. The source code and dataset will be available on github.
zh

[CV-57] EBAD-Gaussian: Event-driven Bundle Adjusted Deblur Gaussian Splatting

【速读】：该论文旨在解决在存在运动模糊的情况下，基于RGB的传统三维高斯点 splatting（3D-GS）方法性能下降的问题。特别是在快速运动或低光照条件下，现有的基于RGB的去模糊方法难以建模相机姿态和辐射变化，导致重建精度降低。为了解决这一问题，论文提出了一种名为Event-driven Bundle Adjusted Deblur Gaussian Splatting（EBAD-Gaussian）的方法。其关键是利用事件相机捕捉曝光期间连续的亮度变化来有效建模运动模糊，并通过联合学习三维高斯体素参数与恢复曝光期间的相机运动轨迹来提升重建质量。具体而言，该方法通过合成多个潜在清晰图像构建模糊损失函数以最小化真实与合成模糊图像之间的差异；利用事件流监督曝光期间任意时刻潜在清晰图像间的光强变化以补充RGB图像中丢失的光强动态信息；并通过基于事件的双积分先验优化中间曝光时间的潜在清晰图像，应用一致性约束增强重建图像的细节与纹理信息。实验结果表明，EBAD-Gaussian能够在模糊图像和事件流输入条件下实现高质量的三维场景重建。

链接: https://arxiv.org/abs/2504.10012
作者: Yufei Deng,Yuanjian Wang,Rong Xiao,Chenwei Tang,Jizhe Zhou,Jiahao Fan,Deng Xiong,Jiancheng Lv,Huajin Tang
机构: Sichuan University (四川大学); Stevens institute of Technology (史蒂文斯理工学院); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While 3D Gaussian Splatting (3D-GS) achieves photorealistic novel view synthesis, its performance degrades with motion blur. In scenarios with rapid motion or low-light conditions, existing RGB-based deblurring methods struggle to model camera pose and radiance changes during exposure, reducing reconstruction accuracy. Event cameras, capturing continuous brightness changes during exposure, can effectively assist in modeling motion blur and improving reconstruction quality. Therefore, we propose Event-driven Bundle Adjusted Deblur Gaussian Splatting (EBAD-Gaussian), which reconstructs sharp 3D Gaussians from event streams and severely blurred images. This method jointly learns the parameters of these Gaussians while recovering camera motion trajectories during exposure time. Specifically, we first construct a blur loss function by synthesizing multiple latent sharp images during the exposure time, minimizing the difference between real and synthesized blurred images. Then we use event stream to supervise the light intensity changes between latent sharp images at any time within the exposure period, supplementing the light intensity dynamic changes lost in RGB images. Furthermore, we optimize the latent sharp images at intermediate exposure times based on the event-based double integral (EDI) prior, applying consistency constraints to enhance the details and texture information of the reconstructed images. Extensive experiments on synthetic and real-world datasets show that EBAD-Gaussian can achieve high-quality 3D scene reconstruction under the condition of blurred images and event stream inputs.
zh

[CV-58] Balancing Two Classifiers via A Simplex ETF Structure for Model Calibration CVPR2025

【速读】：该论文旨在解决深度神经网络（Deep Neural Networks, DNNs）在模型校准（model calibration）方面存在的过自信（overconfidence）或欠自信（underconfidence）问题，尤其是在自动驾驶和医疗等安全关键领域中，不可靠预测可能导致严重后果的应用场景。现有研究主要从分类器的角度改进模型校准，但针对设计分类器以解决校准问题的探索尚显不足，且大多数现有方法忽视了由欠自信引起的校准误差。

论文提出了一种名为BalCAL的新方法，通过平衡可学习分类器与等角超平面（ETF）分类器，在引入可调置信度模块和动态调整机制的基础上，确保模型置信度与其实际准确性之间更好的一致性。关键在于结合这两种分类器的优势，并通过上述机制有效缓解过自信或欠自信问题，从而显著提升模型的校准性能，同时保持较高的预测准确性。

实验验证表明，BalCAL不仅优于现有技术，还为深度学习中常见的校准挑战提供了新颖的解决方案。

链接: https://arxiv.org/abs/2504.10007
作者: Jiani Ni,He Zhao,Jintong Gao,Dandan Guo,Hongyuan Zha
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025

点击查看摘要

Abstract:In recent years, deep neural networks (DNNs) have demonstrated state-of-the-art performance across various domains. However, despite their success, they often face calibration issues, particularly in safety-critical applications such as autonomous driving and healthcare, where unreliable predictions can have serious consequences. Recent research has started to improve model calibration from the view of the classifier. However, the exploration of designing the classifier to solve the model calibration problem is insufficient. Let alone most of the existing methods ignore the calibration errors arising from underconfidence. In this work, we propose a novel method by balancing learnable and ETF classifiers to solve the overconfidence or underconfidence problem for model Calibration named BalCAL. By introducing a confidence-tunable module and a dynamic adjustment method, we ensure better alignment between model confidence and its true accuracy. Extensive experimental validation shows that ours significantly improves model calibration performance while maintaining high predictive accuracy, outperforming existing techniques. This provides a novel solution to the calibration challenges commonly encountered in deep learning.
zh

[CV-59] An Image is Worth K Topics: A Visual Structural Topic Model with Pretrained Image Embeddings

【速读】：该论文试图解决政治科学家在大规模分析视觉内容时面临的挑战，即现有计算工具箱缺乏专门针对社会和政治研究挑战与目标的方法和模型。解决方案的关键在于引入了一种结合预训练图像嵌入（pretrained image embeddings）与结构化主题模型（structural topic model）的视觉结构化主题模型（visual Structural Topic Model, vSTM）。这种模型通过利用预训练嵌入捕捉与政治语境相关的图像语义复杂性，并借助结构化主题模型分析话题与协变量之间的关系，同时保持对图像作为多重主题混合体的精细表示，从而克服了传统方法的局限性。

链接: https://arxiv.org/abs/2504.10004
作者: Matías Piqueras,Alexandra Segerberg,Matteo Magnani,Måns Magnusson,Nataša Sladoje
机构: Uppsala University (乌普萨拉大学); InfoLab, Department of Information Technology, Uppsala University (乌普萨拉大学信息学实验室, 乌普萨拉大学信息技术系); Department of Government, Uppsala University (乌普萨拉大学政府系); Department of Statistics, Uppsala University (乌普萨拉大学统计系); Centre for Image Analysis, Department of Information Technology, Uppsala University (乌普萨拉大学图像分析中心, 乌普萨拉大学信息技术系)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Political scientists are increasingly interested in analyzing visual content at scale. However, the existing computational toolbox is still in need of methods and models attuned to the specific challenges and goals of social and political inquiry. In this article, we introduce a visual Structural Topic Model (vSTM) that combines pretrained image embeddings with a structural topic model. This has important advantages compared to existing approaches. First, pretrained embeddings allow the model to capture the semantic complexity of images relevant to political contexts. Second, the structural topic model provides the ability to analyze how topics and covariates are related, while maintaining a nuanced representation of images as a mixture of multiple topics. In our empirical application, we show that the vSTM is able to identify topics that are interpretable, coherent, and substantively relevant to the study of online political communication.
zh

[CV-60] NaviDiffusor: Cost-Guided Diffusion Model for Visual Navigation

【速读】：该论文旨在解决视觉导航在移动机器人中的两大挑战：一是经典方法因多模块设计和手工规则依赖易产生系统错误；二是学习方法虽有高规划成功率，但泛化能力有限且通常需要大量训练数据。为克服这些限制，论文提出了一种结合学习方法与经典方法优势的混合框架，用于仅基于RGB图像的视觉导航。关键在于通过条件扩散模型学习路径-图像对，并在推理阶段引入可微场景特定及任务级代价函数的梯度信息，引导生成满足约束的有效路径，从而避免重新训练，提供即插即用的解决方案。

链接: https://arxiv.org/abs/2504.10003
作者: Yiming Zeng,Hao Ren,Shuhang Wang,Junlong Huang,Hui Cheng
机构: School of Computer Science and Engineering, Sun Yat-sen University (中山大学计算机科学与工程学院); School of Intelligent Systems Engineering, Sun Yat-sen University (中山大学智能系统工程学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual navigation, a fundamental challenge in mobile robotics, demands versatile policies to handle diverse environments. Classical methods leverage geometric solutions to minimize specific costs, offering adaptability to new scenarios but are prone to system errors due to their multi-modular design and reliance on hand-crafted rules. Learning-based methods, while achieving high planning success rates, face difficulties in generalizing to unseen environments beyond the training data and often require extensive training. To address these limitations, we propose a hybrid approach that combines the strengths of learning-based methods and classical approaches for RGB-only visual navigation. Our method first trains a conditional diffusion model on diverse path-RGB observation pairs. During inference, it integrates the gradients of differentiable scene-specific and task-level costs, guiding the diffusion model to generate valid paths that meet the constraints. This approach alleviates the need for retraining, offering a plug-and-play solution. Extensive experiments in both indoor and outdoor settings, across simulated and real-world scenarios, demonstrate zero-shot transfer capability of our approach, achieving higher success rates and fewer collisions compared to baseline methods. Code will be released at this https URL.
zh

[CV-61] GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

【速读】：该论文致力于解决单图像三维场景重建（Single-image 3D Scene Reconstruction）这一具有挑战性的问题，由于其固有的不适定性质（Ill-posed Nature）以及输入约束的局限性，现有方法在跨视图一致性（Cross-view Consistency）和误差处理（Error Handling）方面存在不足。论文提出的关键解决方案包括两个创新点：(1) 一种渐进式视频修复策略（Progressive Video Inpainting Strategy），利用时间一致性（Temporal Coherence）提升多视图一致性（Multiview Consistency）并加速收敛；(2) 一种基于三维高斯点绘制的一致性掩码（3D Gaussian Splatting Consistency Mask），通过三维一致的多视图证据指导视频扩散（Video Diffusion）。这些创新结合了几何感知初始化协议、不一致性感知高斯点绘制以及渐进式视频修复策略，实现了更高的LLaVA-IQA评分（提高32%）和至少两倍的速度提升，同时保持了对多样化场景的鲁棒性能。

链接: https://arxiv.org/abs/2504.10001
作者: Junlin Hao,Peiheng Wang,Haoyang Wang,Xinggong Zhang,Zongming Guo
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.
zh

[CV-62] Metric-Guided Synthesis of Class Activation Mapping

【速读】：该论文试图解决现有类激活映射（Class Activation Mapping, CAM）方法在生成显著图（heatmap）时缺乏灵活性的问题，即无法根据用户的意图或领域知识调整显著图的生成方式。论文的关键在于提出了一种基于指标的合成方法——SyCAM，它能够通过优化预定义的评价指标自动生成针对特定需求的CAM表达式。具体而言，作者探索了一种基于语法引导的合成实例化方法，即在预设的句法约束和评价指标指导下推导CAM表达式，从而实现生成目标导向的显著图，并验证了该方法在ResNet50、VGG16和VGG19等模型上的有效性和灵活性。

链接: https://arxiv.org/abs/2504.09998
作者: Alejandro Luque-Cerpa,Elizabeth Polgreen,Ajitha Rajan,Hazem Torfah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Class activation mapping (CAM) is a widely adopted class of saliency methods used to explain the behavior of convolutional neural networks (CNNs). These methods generate heatmaps that highlight the parts of the input most relevant to the CNN output. Various CAM methods have been proposed, each distinguished by the expressions used to derive heatmaps. In general, users look for heatmaps with specific properties that reflect different aspects of CNN functionality. These may include similarity to ground truth, robustness, equivariance, and more. Although existing CAM methods implicitly encode some of these properties in their expressions, they do not allow for variability in heatmap generation following the user’s intent or domain knowledge. In this paper, we address this limitation by introducing SyCAM, a metric-based approach for synthesizing CAM expressions. Given a predefined evaluation metric for saliency maps, SyCAM automatically generates CAM expressions optimized for that metric. We specifically explore a syntax-guided synthesis instantiation of SyCAM, where CAM expressions are derived based on predefined syntactic constraints and the given metric. Using several established evaluation metrics, we demonstrate the efficacy and flexibility of our approach in generating targeted heatmaps. We compare SyCAM with other well-known CAM methods on three prominent models: ResNet50, VGG16, and VGG19.
zh

[CV-63] Correlative and Discriminative Label Grouping for Multi-Label Visual Prompt Tuning CVPR

【速读】：该论文旨在解决多标签图像分类（Multi-Label Image Classification, MLC）中过度关注标签共现关系导致的过拟合风险问题。为了解决这一问题，论文提出了一种新的参数高效方法——多标签视觉提示调优框架（Multi-Label Visual Prompt Tuning Framework）。其关键在于通过平衡标签间的相关性和区分性关系来缓解过拟合风险并提升模型性能。具体而言，该框架将类别划分为多个子集，基于标签共现和互斥关系分别建模，并在视觉Transformer（Vision Transformer, ViT）中采用多个提示词令牌以捕捉每个子集内的相关或区分性标签关系，同时利用混合专家（Mixture of Experts, MoE）模型动态分配组感知视觉表示为标签感知表示，从而更有效地支持分类任务。实验结果表明，该方法在多个基准数据集上取得了竞争性成果，优于现有的多种预训练模型。

链接: https://arxiv.org/abs/2504.09990
作者: LeiLei Ma,Shuo Xu,MingKun Xie,Lei Wang,Dengdi Sun,Haifeng Zhao
机构: Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University (安徽大学), Hefei, China; Nanjing University of Aeronautics and Astronautics (南京航空航天大学), Nanjing, China; Nanjing University of Science and Technology (南京理工大学), Nanjing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:Modeling label correlations has always played a pivotal role in multi-label image classification (MLC), attracting significant attention from researchers. However, recent studies have overemphasized co-occurrence relationships among labels, which can lead to overfitting risk on this overemphasis, resulting in suboptimal models. To tackle this problem, we advocate for balancing correlative and discriminative relationships among labels to mitigate the risk of overfitting and enhance model performance. To this end, we propose the Multi-Label Visual Prompt Tuning framework, a novel and parameter-efficient method that groups classes into multiple class subsets according to label co-occurrence and mutual exclusivity relationships, and then models them respectively to balance the two relationships. In this work, since each group contains multiple classes, multiple prompt tokens are adopted within Vision Transformer (ViT) to capture the correlation or discriminative label relationship within each group, and effectively learn correlation or discriminative representations for class subsets. On the other hand, each group contains multiple group-aware visual representations that may correspond to multiple classes, and the mixture of experts (MoE) model can cleverly assign them from the group-aware to the label-aware, adaptively obtaining label-aware representation, which is more conducive to classification. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods on multiple pre-trained models.
zh

[CV-64] Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models

【速读】：该论文试图解决大型视觉-语言模型（Vision-Language Models, VLMs）评估中计算资源消耗过大的问题。现有评估方法需要依赖多个基准数据集以全面评估模型的知识与推理能力，但这些基准的综合评估计算成本高昂。为提高效率，论文提出了一种基于最远点采样法（Farthest Point Sampling, FPS）的子集构建方法，通过仅使用约1%的数据即可保持与完整评估高度相关性（皮尔逊相关系数达0.96），从而显著降低评估开销。此外，该方法还能减少现有基准数据集中潜在的偏差问题。关键在于利用FPS策略选择能够代表整体挑战的最小化子集，实现高效且有效的评估。

链接: https://arxiv.org/abs/2504.09979
作者: Teppei Suzuki,Keisuke Ozawa
机构: SB Intuitions (SB Intuitions)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose an efficient evaluation protocol for large vision-language models (VLMs). Given their broad knowledge and reasoning capabilities, multiple benchmarks are needed for comprehensive assessment, making evaluation computationally expensive. To improve efficiency, we construct a subset that yields results comparable to full benchmark evaluations. Our benchmark classification experiments reveal that no single benchmark fully covers all challenges. We then introduce a subset construction method using farthest point sampling (FPS). Our experiments show that FPS-based benchmarks maintain a strong correlation ( 0.96) with full evaluations while using only ~1% of the data. Additionally, applying FPS to an existing benchmark improves correlation with overall evaluation results, suggesting its potential to reduce unintended dataset biases.
zh

[CV-65] OctGPT : Octree-based Multiscale Autoregressive Models for 3D Shape Generation SIGGRAPH2025

【速读】：该论文旨在解决3D形状生成领域中自回归模型性能显著落后于扩散模型的问题。为了解决这一挑战，论文提出了一种名为OctGPT的新型多尺度自回归模型。其关键创新在于采用序列化八叉树表示法以高效捕捉3D形状的层次结构与空间结构，通过八叉树编码粗略几何形状，利用向量量化变分自编码器（VQVAE）生成二值标记来表示细节，将3D形状转换为适合自回归预测的紧凑多尺度二值序列。此外，引入基于八叉树的Transformer，并结合三维旋转位置编码、尺度特定嵌入及令牌并行生成方案，有效应对长序列计算难题，大幅提升了训练效率（缩短13倍训练时间）和生成速度（加速69倍），实现了在四块NVIDIA 4090 GPU上仅需数天即可高效训练高分辨率（如1024³）3D形状的能力。这一方法不仅提高了生成质量，还展示了在多种任务中的卓越灵活性。

链接: https://arxiv.org/abs/2504.09975
作者: Si-Tong Wei,Rui-Huan Wang,Chuan-Zhi Zhou,Baoquan Chen,Peng-Shuai Wang
机构: Wangxuan Institute of Computer Technology (王选计算机技术研究所), Peking University (北京大学), China; Peking University (北京大学), China; Peking University (北京大学), China; Peking University (北京大学), China; Wangxuan Institute of Computer Technology (王选计算机技术研究所), Peking University (北京大学), China
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025

点击查看摘要

Abstract:Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact \emphmultiscale binary sequences suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g., 1024^3 , on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation.
zh

[CV-66] Beyond Degradation Redundancy: Contrastive Prompt Learning for All-in-One Image Restoration

【速读】：该论文旨在解决统一模型在处理多种图像退化类型（如模糊、噪声等）时，设计任务特定提示（task-specific prompts）以有效引导图像修复的挑战。传统方法中，自适应提示学习虽然支持端到端优化，但容易产生重叠或冗余的任务表示；而从预训练分类器中提取的显式提示虽能增强区分能力，却可能丢失重建所需的视觉信息。为克服这些局限性，论文提出对比提示学习（Contrastive Prompt Learning, CPL），其关键创新在于结合稀疏提示模块（Sparse Prompt Module, SPM）与对比提示正则化（Contrastive Prompt Regularization, CPR）。SPM能够高效捕捉特定退化特征并减少冗余，而CPR通过引入负样本强化不同退化类型间任务边界的明确性。与仅关注退化分类的传统方法不同，CPL着重优化提示与修复模型之间的交互关系。实验结果表明，该框架不仅提升了现有最先进的统一图像修复模型性能，在标准多任务场景及复杂复合退化条件下均实现了显著改进，同时保持了参数效率。

链接: https://arxiv.org/abs/2504.09973
作者: Gang Wu,Junjun Jiang,Kui Jiang,Xianming Liu,Liqiang Nie
机构: School of Computer Science and Technology, Harbin Institute of Technology (哈尔滨工业大学)(哈工大); School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:All-in-one image restoration, addressing diverse degradation types with a unified model, presents significant challenges in designing task-specific prompts that effectively guide restoration across multiple degradation scenarios. While adaptive prompt learning enables end-to-end optimization, it often yields overlapping or redundant task representations. Conversely, explicit prompts derived from pretrained classifiers enhance discriminability but may discard critical visual information for reconstruction. To address these limitations, we introduce Contrastive Prompt Learning (CPL), a novel framework that fundamentally enhances prompt-task alignment through two complementary innovations: a \emphSparse Prompt Module (SPM) that efficiently captures degradation-specific features while minimizing redundancy, and a \emphContrastive Prompt Regularization (CPR) that explicitly strengthens task boundaries by incorporating negative prompt samples across different degradation types. Unlike previous approaches that focus primarily on degradation classification, CPL optimizes the critical interaction between prompts and the restoration model itself. Extensive experiments across five comprehensive benchmarks demonstrate that CPL consistently enhances state-of-the-art all-in-one restoration models, achieving significant improvements in both standard multi-task scenarios and challenging composite degradation settings. Our framework establishes new state-of-the-art performance while maintaining parameter efficiency, offering a principled solution for unified image restoration.
zh

[CV-67] Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data

【速读】：该论文旨在解决现有医疗多模态大语言模型（Medical Multi-modal Large Language Models, MLLMs）在多任务学习能力上的局限性，特别是因数据资源简单聚合导致的图像与任务对齐分散化问题。传统方法忽视了从数据构建角度重新审视多任务学习的重要性，从而无法有效培养全面的图像理解能力或满足临床多维度图像解释的需求。为应对这一挑战，论文提出了一个以图像为中心的多标注X射线数据集（IMAX），这是首个从数据构建层面增强MLLMs多任务学习能力的尝试。

IMAX的关键创新点在于其高质量的数据整理和以图像为中心的密集标注：数据集中包含超过354,000条适用于七种不同医学任务的记录，并且每张X射线图像平均关联4.10个任务和7.46个训练样本，显著提升了每张图像的多任务表示丰富度。与通用的分散式多标注X射线数据集（DMAX）相比，IMAX在七种开源最先进的医学MLLMs中实现了3.20%至21.05%的多任务平均性能提升。此外，论文进一步分析了IMAX与DMAX训练过程中的统计模式差异，探索了优化动态与多任务性能之间的潜在相关性，并基于IMAX的数据构建核心理念提出了一种改进的基于DMAX的训练策略，以缓解实际场景中获取高质量IMAX数据的难题。

链接: https://arxiv.org/abs/2504.09967
作者: Xun Zhu,Fanbin Mo,Zheng Zhang,Jiaxi Wang,Yiming Shi,Ming Wu,Chuang Zhang,Miao Li,Ji Wu
机构: Tsinghua University (清华大学); BUPT (北京邮电大学); Beijing Wuzi University (北京物资学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The emergence of medical generalist foundation models has revolutionized conventional task-specific model development paradigms, aiming to better handle multiple tasks through joint training on large-scale medical datasets. However, recent advances prioritize simple data scaling or architectural component enhancement, while neglecting to re-examine multi-task learning from a data-centric perspective. Critically, simply aggregating existing data resources leads to decentralized image-task alignment, which fails to cultivate comprehensive image understanding or align with clinical needs for multi-dimensional image interpretation. In this paper, we introduce the image-centric multi-annotation X-ray dataset (IMAX), the first attempt to enhance the multi-task learning capabilities of medical multi-modal large language models (MLLMs) from the data construction level. To be specific, IMAX is featured from the following attributes: 1) High-quality data curation. A comprehensive collection of more than 354K entries applicable to seven different medical tasks. 2) Image-centric dense annotation. Each X-ray image is associated with an average of 4.10 tasks and 7.46 training entries, ensuring multi-task representation richness per image. Compared to the general decentralized multi-annotation X-ray dataset (DMAX), IMAX consistently demonstrates significant multi-task average performance gains ranging from 3.20% to 21.05% across seven open-source state-of-the-art medical MLLMs. Moreover, we investigate differences in statistical patterns exhibited by IMAX and DMAX training processes, exploring potential correlations between optimization dynamics and multi-task performance. Finally, leveraging the core concept of IMAX data construction, we propose an optimized DMAX-based training strategy to alleviate the dilemma of obtaining high-quality IMAX data in practical scenarios.
zh

[CV-68] SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting CVPR2025

【速读】：该论文致力于解决半监督场景文本检测与识别（Semi-supervised Text Spotting, SSTS）任务中因伪标签不一致性和教师/学生模型不一致性导致的挑战。论文的关键创新在于提出了一种新的端到端文本检测与识别半监督框架——SemiETS。其核心解决方案包括：1) 为检测和识别任务逐步生成可靠的分层伪标签以减少噪声；2) 利用双向流提取位置和转录信息，提升任务间的一致性。实验结果表明，SemiETS在多种数据标注比例下显著优于现有最先进的半监督学习方法，并且在强监督场景下仍能实现性能提升，展现了强大的领域适应能力。

链接: https://arxiv.org/abs/2504.09966
作者: Dongliang Luo,Hanshen Zhu,Ziyang Zhang,Dingkang Liang,Xudong Xie,Yuliang Liu,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025. Code will be available at \url{ this https URL }

点击查看摘要

Abstract:Most previous scene text spotting methods rely on high-quality manual annotations to achieve promising performance. To reduce their expensive costs, we study semi-supervised text spotting (SSTS) to exploit useful information from unlabeled images. However, directly applying existing semi-supervised methods of general scenes to SSTS will face new challenges: 1) inconsistent pseudo labels between detection and recognition tasks, and 2) sub-optimal supervisions caused by inconsistency between teacher/student. Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. Specifically, it gradually generates reliable hierarchical pseudo labels for each task, thereby reducing noisy labels. Meanwhile, it extracts important information in locations and transcriptions from bidirectional flows to improve consistency. Extensive experiments on three datasets under various settings demonstrate the effectiveness of SemiETS on arbitrary-shaped text. For example, it outperforms previous state-of-the-art SSL methods by a large margin on end-to-end spotting (+8.7%, +5.6%, and +2.6% H-mean under 0.5%, 1%, and 2% labeled data settings on Total-Text, respectively). More importantly, it still improves upon a strongly supervised text spotter trained with plenty of labeled data by 2.0%. Compelling domain adaptation ability shows practical potential. Moreover, our method demonstrates consistent improvement on different text spotters.
zh

[CV-69] Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling CVPR

【速读】：本文针对基于事件的眼动追踪技术在实际应用中面临的挑战，如突发的眼部运动和环境噪声等问题展开研究。论文的关键解决方案包括两个方面：首先，通过引入包含时间平移、空间翻转和事件删除的鲁棒数据增强管道，提升了模型的稳健性，在具有挑战性的样本上将欧氏距离误差减少了12%；其次，提出了KnightPupil混合架构，结合EfficientNet-B3用于空间特征提取，双向GRU用于上下文时间建模，并采用线性时变状态空间模块以动态适应稀疏输入和噪声。这些改进显著提高了系统的性能与实用性。

链接: https://arxiv.org/abs/2504.09960
作者: Hoang M. Truong,Vinh-Thuan Ly,Huy G. Tran,Thuan-Phat Nguyen,Tram T. Doan
机构: University of Science, VNU-HCM (越南国立大学胡志明市大学), Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City (胡志明市越南国家大学), Ho Chi Minh City, Vietnam
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Camera-ready version for CVPRW 2025. Accepted for presentation at the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025)

点击查看摘要

Abstract:Event-based eye tracking has become a pivotal technology for augmented reality and human-computer interaction. Yet, existing methods struggle with real-world challenges such as abrupt eye movements and environmental noise. Building on the efficiency of the Lightweight Spatiotemporal Network-a causal architecture optimized for edge devices-we introduce two key advancements. First, a robust data augmentation pipeline incorporating temporal shift, spatial flip, and event deletion improves model resilience, reducing Euclidean distance error by 12% (1.61 vs. 1.70 baseline) on challenging samples. Second, we propose KnightPupil, a hybrid architecture combining an EfficientNet-B3 backbone for spatial feature extraction, a bidirectional GRU for contextual temporal modeling, and a Linear Time-Varying State-Space Module to adapt to sparse inputs and noise dynamically. Evaluated on the 3ET+ benchmark, our framework achieved 1.61 Euclidean distance on the private test set of the Event-based Eye Tracking Challenge at CVPR 2025, demonstrating its effectiveness for practical deployment in AR/VR systems while providing a foundation for future innovations in neuromorphic vision.
zh

[CV-70] Semantic Depth Matters: Explaining Errors of Deep Vision Networks through Perceived Class Similarities

【速读】：该论文试图解决现有深度神经网络（DNN）评估方法缺乏透明性的问题，特别是无法有效解释网络误分类的根本原因。为了解决这一问题，论文提出了一种新颖的框架，旨在探索网络感知的语义层次深度与其实际数据误分类模式之间的关系。该框架的关键在于引入了一个名为相似度深度（Similarity Depth, SD）的新指标，用于量化网络感知的语义层次深度，并评估网络错误与内部感知的相似性结构的一致性。此外，论文还提出了一种基于图的模型语义关系及误感知可视化方法。该方法的核心优势在于利用类别模板（从分类器层权重推导出的表示）来分析已训练网络的语义特性，而无需额外的数据或实验。通过这种方法，研究发现深度视觉网络编码了特定的语义层次，并且较高的语义深度能够提升感知类间相似性与实际误分类之间的一致性。

链接: https://arxiv.org/abs/2504.09956
作者: Katarzyna Filus,Michał Romaszewski,Mateusz Żarski
机构: Institute of Theoretical and Applied Informatics (理论与应用信息学研究所),
Polish Academy of Sciences (波兰科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding deep neural network (DNN) behavior requires more than evaluating classification accuracy alone; analyzing errors and their predictability is equally crucial. Current evaluation methodologies lack transparency, particularly in explaining the underlying causes of network misclassifications. To address this, we introduce a novel framework that investigates the relationship between the semantic hierarchy depth perceived by a network and its real-data misclassification patterns. Central to our framework is the Similarity Depth (SD) metric, which quantifies the semantic hierarchy depth perceived by a network along with a method of evaluation of how closely the network’s errors align with its internally perceived similarity structure. We also propose a graph-based visualization of model semantic relationships and misperceptions. A key advantage of our approach is that leveraging class templates – representations derived from classifier layer weights – is applicable to already trained networks without requiring additional data or experiments. Our approach reveals that deep vision networks encode specific semantic hierarchies and that high semantic depth improves the compliance between perceived class similarities and actual errors.
zh

[CV-71] Efficient 2D to Full 3D Human Pose Uplifting including Joint Rotations CVPR’25

【速读】：该论文旨在解决在体育数据分析中，准确捕捉人体关节的三维位置和旋转的问题。传统的人体网格恢复（Human Mesh Recovery, HMR）模型在估计关节旋转方面表现良好，但在关节定位精度上不如三维人体姿态估计（3D Human Pose Estimation, HPE）模型。为了解决这一局限性，近期的研究通过结合3D HPE模型与逆向运动学（Inverse Kinematics, IK）来同时估计关节位置和旋转，但IK计算代价高昂。论文的关键创新在于提出了一种新的2D到3D提升模型，该模型能够在单次前向传播中直接估计包括关节旋转在内的三维人体姿态，从而避免了IK的高计算开销。这一方案的关键在于设计了多种旋转表示方法、损失函数以及训练策略，并且无论是否使用真实标签旋转数据，均实现了卓越的旋转估计精度，同时显著提升了效率，比基于IK的方法快150倍，并在关节定位精度上超越了HMR模型。

链接: https://arxiv.org/abs/2504.09953
作者: Katja Ludwig,Yuliia Oksymets,Robin Schön,Daniel Kienzle,Rainer Lienhart
机构: Chair for Machine Learning & Computer Vision, University of Augsburg, Germany (机器学习与计算机视觉讲座, 奥格斯堡大学, 德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted at CVSports@CVPR’25

点击查看摘要

Abstract:In sports analytics, accurately capturing both the 3D locations and rotations of body joints is essential for understanding an athlete’s biomechanics. While Human Mesh Recovery (HMR) models can estimate joint rotations, they often exhibit lower accuracy in joint localization compared to 3D Human Pose Estimation (HPE) models. Recent work addressed this limitation by combining a 3D HPE model with inverse kinematics (IK) to estimate both joint locations and rotations. However, IK is computationally expensive. To overcome this, we propose a novel 2D-to-3D uplifting model that directly estimates 3D human poses, including joint rotations, in a single forward pass. We investigate multiple rotation representations, loss functions, and training strategies - both with and without access to ground truth rotations. Our models achieve state-of-the-art accuracy in rotation estimation, are 150 times faster than the IK-based approach, and surpass HMR models in joint localization precision.
zh

[CV-72] Pseudo-Label Guided Real-World Image De-weathering: A Learning Framework with Imperfect Supervision

【速读】：该论文致力于解决现实世界图像去天气化（real-world image de-weathering）任务中因非理想监督导致的问题，即在训练过程中由于采集的真实配对数据集存在光照、物体位置、场景细节等方面的不一致性，可能导致去天气化模型产生变形伪影。为应对这一挑战，论文提出了一种基于伪标签引导学习的统一解决方案，其关键在于通过一致标签构造器（Consistent Label Constructor, CLC）生成高质量的伪标签，并结合原始真实标签实现对去天气化模型（De-W）的自适应监督。CLC 中引入了跨帧相似性聚合（Cross-frame Similarity Aggregation, CSA）模块以探索多帧间的潜在互补信息来提升伪标签质量，同时设计了信息分配策略（Information Allocation Strategy, IAS）将真实标签与伪标签相结合，从而实现对去天气化模型的有效联合监督。实验结果表明，该方法在处理不完美对齐的数据集时表现出显著优势。

链接: https://arxiv.org/abs/2504.09949
作者: Heming Xu,Xiaohui Liu,Zhilu Zhang,Hongzhi Zhang,Xiaohe Wu,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 16 figures

点击查看摘要

Abstract:Real-world image de-weathering aims at removingvarious undesirable weather-related artifacts, e.g., rain, snow,and fog. To this end, acquiring ideal training pairs is this http URL real-world datasets are typically constructed paired databy extracting clean and degraded images from live streamsof landscape scene on the Internet. Despite the use of strictfiltering mechanisms during collection, training pairs inevitablyencounter inconsistency in terms of lighting, object position, scenedetails, etc, making de-weathering models possibly suffer fromdeformation artifacts under non-ideal supervision. In this work,we propose a unified solution for real-world image de-weatheringwith non-ideal supervision, i.e., a pseudo-label guided learningframework, to address various inconsistencies within the realworld paired dataset. Generally, it consists of a de-weatheringmodel (De-W) and a Consistent Label Constructor (CLC), bywhich restoration result can be adaptively supervised by originalground-truth image to recover sharp textures while maintainingconsistency with the degraded inputs in non-weather contentthrough the supervision of pseudo-labels. Particularly, a Crossframe Similarity Aggregation (CSA) module is deployed withinCLC to enhance the quality of pseudo-labels by exploring thepotential complementary information of multi-frames throughgraph model. Moreover, we introduce an Information AllocationStrategy (IAS) to integrate the original ground-truth imagesand pseudo-labels, thereby facilitating the joint supervision forthe training of de-weathering model. Extensive experimentsdemonstrate that our method exhibits significant advantageswhen trained on imperfectly aligned de-weathering datasets incomparison with other approaches.
zh

[CV-73] Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

【速读】：该论文旨在解决现有文本到图像生成模型难以捕捉特定领域（尤其是中国菜肴）多样化特征和忠实细节的问题。为应对这一局限，论文提出Omni-Dish，这是一种专为中国菜肴设计的文本到图像生成模型。解决方案的关键在于开发了一个全面的菜肴整理管道以构建迄今为止最大的菜肴数据集，并引入了一种重描述策略及粗到细的训练方案以帮助模型更好地学习精细的烹饪差异。此外，在推理阶段，通过利用预先构建的高质量描述库和大型语言模型增强用户的文本输入，实现了更逼真且忠实的图像生成。同时，为了扩展模型在菜肴编辑任务中的能力，提出了概念增强的P2P方法，并构建了相应的菜肴编辑数据集与专门的编辑模型。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.09948
作者: Huijie Liu,Bingcan Wang,Jie Hu,Xiaoming Wei,Guoliang Kang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 10 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user’s textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model’s capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.
zh

[CV-74] FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

【速读】：该论文旨在解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在视觉与文本模态之间浅层交互的问题，提出了一种全新的全视-文对齐与整合范式。论文的关键在于引入了“Text-Guided Unified Vision Encoding”以在视觉编码过程中融入文本信息，实现像素级整合；并通过“Context-Aware Recursive Alignment Decoding”在解码阶段递归聚合基于文本上下文的视觉特征，实现细粒度的语义整合。此外，通过“Dual-Supervised Semantic Mapping Loss”指导特征映射并缓解模态差异，并构建了一个高质量的合成语言驱动问答数据集来优化文本引导的特征整合。这些创新点共同构成了FUSION模型，显著提升了多模态集成的效果。

链接: https://arxiv.org/abs/2504.09925
作者: Zheng Liu,Mengjie Liu,Jingzhou Chen,Jingwei Xu,Bin Cui,Conghui He,Wentao Zhang
机构: Peking University (北京大学); Shanghai AI Laboratory (上海人工智能实验室); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce FUSION, a family of multimodal large language models (MLLMs) with a fully vision-language alignment and integration paradigm. Unlike existing methods that primarily rely on late-stage modality interaction during LLM decoding, our approach achieves deep, dynamic integration throughout the entire processing pipeline. To this end, we propose Text-Guided Unified Vision Encoding, incorporating textual information in vision encoding to achieve pixel-level integration. We further design Context-Aware Recursive Alignment Decoding that recursively aggregates visual features conditioned on textual context during decoding, enabling fine-grained, question-level semantic integration. To guide feature mapping and mitigate modality discrepancies, we develop Dual-Supervised Semantic Mapping Loss. Additionally, we construct a Synthesized Language-Driven Question-Answer (QA) dataset through a new data synthesis method, prioritizing high-quality QA pairs to optimize text-guided feature integration. Building on these foundations, we train FUSION at two scales-3B, 8B-and demonstrate that our full-modality integration approach significantly outperforms existing methods with only 630 vision tokens. Notably, FUSION 3B surpasses Cambrian-1 8B and Florence-VL 8B on most benchmarks. FUSION 3B continues to outperform Cambrian-1 8B even when limited to 300 vision tokens. Our ablation studies show that FUSION outperforms LLaVA-NeXT on over half of the benchmarks under same configuration without dynamic resolution, highlighting the effectiveness of our approach. We release our code, model weights, and dataset. this https URL
zh

[CV-75] Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge CVPR2025

【速读】：该论文试图解决利用大型多模态模型（Large Multimodal Model, LMM）检测仇恨类表情包（hateful meme detection）这一关键任务。解决方案的关键在于通过两种方式有效利用LMM的知识：首先，提取与仇恨表情包检测任务相关的通用语义描述和情感信息，构建强表情包表示，并以此训练简单的分类头；其次，提出一种新颖的困难样本挖掘方法，直接将LMM编码的知识引入训练过程以进一步提升性能。实验结果验证了所提方法的有效性，并达到了当前最佳性能（state-of-the-art）。

链接: https://arxiv.org/abs/2504.09914
作者: Maria Tzelepi,Vasileios Mezaris
机构: CERTH-ITI (CERTH-信息技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication, Multimodal Learning and Applications Workshop (MULA 2025) @ IEEE/CVF CVPR 2025, Nashville, TN, USA, June 2025. This is the authors’ “accepted version”

点击查看摘要

Abstract:Memes have become a dominant form of communication in social media in recent years. Memes are typically humorous and harmless, however there are also memes that promote hate speech, being in this way harmful to individuals and groups based on their identity. Therefore, detecting hateful content in memes has emerged as a task of critical importance. The need for understanding the complex interactions of images and their embedded text renders the hateful meme detection a challenging multimodal task. In this paper we propose to address the aforementioned task leveraging knowledge encoded in powerful Large Multimodal Models (LMM). Specifically, we propose to exploit LMMs in a two-fold manner. First, by extracting knowledge oriented to the hateful meme detection task in order to build strong meme representations. Specifically, generic semantic descriptions and emotions that the images along with their embedded texts elicit are extracted, which are then used to train a simple classification head for hateful meme detection. Second, by developing a novel hard mining approach introducing directly LMM-encoded knowledge to the training process, providing further improvements. We perform extensive experiments on two datasets that validate the effectiveness of the proposed method, achieving state-of-the-art performance. Our code and trained models are publicly available at: this https URL.
zh

[CV-76] LiteTracker: Leverag ing Temporal Causality for Accurate Low-latency Tissue Tracking

【速读】：该论文致力于解决现有组织跟踪方法在内窥镜手术导航和扩展现实(XR)应用中的实时性不足问题，即其运行时间性能无法满足低延迟要求。为了解决这一局限性，论文提出了一种名为LiteTracker的轻量级组织跟踪方法。LiteTracker的关键创新在于结合了一种最先进的长期点跟踪方法，并引入了一系列无需额外训练的运行时优化策略。这些优化通过利用时序记忆缓冲区实现特征高效复用，以及借助先验运动信息进行精准跟踪初始化，从而实现实时逐帧跟踪，显著提升了运行效率，比其前辈快约7倍，比当前最先进方法快2倍，同时保持了高精度的跟踪与遮挡预测能力。

链接: https://arxiv.org/abs/2504.09904
作者: Mert Asim Karaoglu,Wenbo Ji,Ahmed Abbas,Nassir Navab,Benjamin Busam,Alexander Ladikos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tissue tracking plays a critical role in various surgical navigation and extended reality (XR) applications. While current methods trained on large synthetic datasets achieve high tracking accuracy and generalize well to endoscopic scenes, their runtime performances fail to meet the low-latency requirements necessary for real-time surgical applications. To address this limitation, we propose LiteTracker, a low-latency method for tissue tracking in endoscopic video streams. LiteTracker builds on a state-of-the-art long-term point tracking method, and introduces a set of training-free runtime optimizations. These optimizations enable online, frame-by-frame tracking by leveraging a temporal memory buffer for efficient feature reuse and utilizing prior motion for accurate track initialization. LiteTracker demonstrates significant runtime improvements being around 7x faster than its predecessor and 2x than the state-of-the-art. Beyond its primary focus on efficiency, LiteTracker delivers high-accuracy tracking and occlusion prediction, performing competitively on both the STIR and SuPer datasets. We believe LiteTracker is an important step toward low-latency tissue tracking for real-time surgical applications in the operating room.
zh

[CV-77] Small Object Detection with YOLO: A Performance Analysis Across Model Versions and Hardware

【速读】：该论文旨在全面评估 YOLO 系列目标检测模型（v5、v8、v9、v10、v11）在不同硬件平台和优化库上的性能表现。论文通过比较这些模型在 Intel 和 AMD CPU 上使用 ONNX 和 OpenVINO 等流行库的推理速度与检测精度，以及在 GPU 上使用 TensorRT 和其他 GPU 优化框架的表现，探讨其在不同硬件环境下的效率与准确性权衡。此外，研究还分析了这些 YOLO 模型对图像中目标大小的敏感性，考察了检测占据图像总面积 1%、2.5% 和 5% 的对象时的性能差异。论文的关键在于揭示模型在效率、精度及目标大小适应性之间的权衡关系，从而为基于特定硬件约束和检测需求的最优模型选择提供指导，帮助从业者有效部署 YOLO 模型以满足实际应用需求。

链接: https://arxiv.org/abs/2504.09900
作者: Muhammad Fasih Tariq,Muhammad Azeem Javed
机构: School of Systems and Technology (系统与技术学院), University of Management and Technology (管理与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper provides an extensive evaluation of YOLO object detection models (v5, v8, v9, v10, v11) by com- paring their performance across various hardware platforms and optimization libraries. Our study investigates inference speed and detection accuracy on Intel and AMD CPUs using popular libraries such as ONNX and OpenVINO, as well as on GPUs through TensorRT and other GPU-optimized frameworks. Furthermore, we analyze the sensitivity of these YOLO models to object size within the image, examining performance when detecting objects that occupy 1%, 2.5%, and 5% of the total area of the image. By identifying the trade-offs in efficiency, accuracy, and object size adaptability, this paper offers insights for optimal model selection based on specific hardware constraints and detection requirements, aiding practitioners in deploying YOLO models effectively for real-world applications.
zh

[CV-78] Digital Staining with Knowledge Distillation: A Unified Framework for Unpaired and Paired-But-Misaligned Data

【速读】：该论文旨在解决细胞成像和医学诊断中传统染色方法面临的高成本、耗时、劳动密集以及对组织不可逆改变等挑战。为应对这些难题，论文提出了一种新颖的无监督深度学习框架用于数字细胞染色，其关键在于通过知识蒸馏技术减少了对大规模配对染色与未染色图像数据的需求。具体而言，研究探索了两种训练方案：无配对情况下的两阶段流水线（包括光照增强和着色）作为教师模型，以及配对但错位情况下的扩展版本，后者引入了对齐模块以利用像素级信息。实验结果表明，所提出的无监督深度染色方法在两种设置下均能生成具有更精确细胞目标位置和形状的染色图像，并且在定量指标（如NIQE和PSNR）及定性评估上优于现有方法。此外，该方法已在白细胞数据集上得到应用，展示了其在医学领域的潜力。

链接: https://arxiv.org/abs/2504.09899
作者: Ziwang Xu,Lanqing Guo,Satoshi Tsutsui,Shuyan Zhang,Alex C. Kot,Bihan Wen
机构: Rapid-Rich Object Search Lab, Nanyang Technological University (南洋理工大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Institute of Materials Research and Engineering, Agency for Science, Technology and Research (A*STAR)(材料研究与工程研究院, 新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to IEEE Transactions on Medical Imaging

点击查看摘要

Abstract:Staining is essential in cell imaging and medical diagnostics but poses significant challenges, including high cost, time consumption, labor intensity, and irreversible tissue alterations. Recent advances in deep learning have enabled digital staining through supervised model training. However, collecting large-scale, perfectly aligned pairs of stained and unstained images remains difficult. In this work, we propose a novel unsupervised deep learning framework for digital cell staining that reduces the need for extensive paired data using knowledge distillation. We explore two training schemes: (1) unpaired and (2) paired-but-misaligned settings. For the unpaired case, we introduce a two-stage pipeline, comprising light enhancement followed by colorization, as a teacher model. Subsequently, we obtain a student staining generator through knowledge distillation with hybrid non-reference losses. To leverage the pixel-wise information between adjacent sections, we further extend to the paired-but-misaligned setting, adding the Learning to Align module to utilize pixel-level information. Experiment results on our dataset demonstrate that our proposed unsupervised deep staining method can generate stained images with more accurate positions and shapes of the cell targets in both settings. Compared with competing methods, our method achieves improved results both qualitatively and quantitatively (e.g., NIQE and PSNR).We applied our digital staining method to the White Blood Cell (WBC) dataset, investigating its potential for medical applications.
zh

[CV-79] AMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models

【速读】：本文旨在解决多模态大型语言模型（MLLMs）在后训练剪枝（post-training pruning）过程中效果有限的问题。传统剪枝方法未能充分考虑MLLMs中跨层及模态的独特标记属性，导致剪枝性能不佳。为应对这一挑战，论文提出了一种名为TAMP的简单而有效的剪枝框架，其关键是两个核心组件：(1) 基于多样性感知稀疏性（Diversity-Aware Sparsity），根据多模态输出标记之间的多样性动态调整每层的稀疏比率，在高多样性层保留更多参数；(2) 自适应多模态输入激活（Adaptive Multimodal Input Activation），利用注意力分数识别代表性多模态输入标记，以指导无结构权重剪枝。通过在两种最先进的MLLMs（LLaVA-NeXT和VideoLLaMA2）上的验证，实验结果表明，TAMP的每个组成部分均显著优于现有剪枝技术。

链接: https://arxiv.org/abs/2504.09897
作者: Jaewoo Lee,Keyang Xuan,Chanakya Ekbote,Sandeep Polisetty,Yi R.(May)Fung,Paul Pu Liang
机构: University of North Carolina Chapel Hill (北卡罗来纳大学教堂山分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Massachusetts Institute of Technology (麻省理工学院); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks. However, these capabilities come with an increased model scale. While post-training pruning reduces model size in unimodal models, its application to MLLMs often yields limited success. Our analysis discovers that conventional methods fail to account for the unique token attributes across layers and modalities inherent to MLLMs. Inspired by this observation, we propose TAMP, a simple yet effective pruning framework tailored for MLLMs, featuring two key components: (1) Diversity-Aware Sparsity, which adjusts sparsity ratio per layer based on diversities among multimodal output tokens, preserving more parameters in high-diversity layers; and (2) Adaptive Multimodal Input Activation, which identifies representative multimodal input tokens using attention scores to guide unstructured weight pruning. We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities. Empirical experiments across various multimodal evaluation benchmarks demonstrate that each component of our approach substantially outperforms existing pruning techniques.
zh

[CV-80] Enhanced Semantic Extraction and Guidance for UGC Image Super Resolution

【速读】：该论文旨在解决用户生成内容（User-Generated Content, UGC）图像超分辨率领域中，真实世界退化与合成退化之间的差异导致传统方法难以有效泛化的问题。为应对这一挑战，论文提出了一种新颖的方法，通过将语义引导融入扩散框架来提升模型对真实世界失真的建模能力。关键在于：首先，通过在LSDIR数据集上分别模拟退化过程，并结合官方配对训练集来解决野蛮数据集与合成数据集之间的一致性问题；其次，引入预训练的语义提取模型（Semantic Extraction Model, SAM2），并精细调整关键超参数以增强退化去除能力和细节生成，从而实现更高的感知保真度。实验结果验证了该方法优于现有技术，并在CVPR NTIRE 2025短格式UGC图像超分辨率挑战赛中获得第二名。

链接: https://arxiv.org/abs/2504.09887
作者: Yiwen Wang,Ying Liang,Yuxuan Zhang,Xinning Chai,Zhengxue Cheng,Yingsheng Qin,Yucai Yang,Rong Xie,Li Song
机构: Shanghai Jiao Tong University (上海交通大学); Transsion (传音控股)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the disparity between real-world degradations in user-generated content(UGC) images and synthetic degradations, traditional super-resolution methods struggle to generalize effectively, necessitating a more robust approach to model real-world distortions. In this paper, we propose a novel approach to UGC image super-resolution by integrating semantic guidance into a diffusion framework. Our method addresses the inconsistency between degradations in wild and synthetic datasets by separately simulating the degradation processes on the LSDIR dataset and combining them with the official paired training set. Furthermore, we enhance degradation removal and detail generation by incorporating a pretrained semantic extraction model (SAM2) and fine-tuning key hyperparameters for improved perceptual fidelity. Extensive experiments demonstrate the superiority of our approach against state-of-the-art methods. Additionally, the proposed model won second place in the CVPR NTIRE 2025 Short-form UGC Image Super-Resolution Challenge, further validating its effectiveness. The code is available at https://github.c10pom/Moonsofang/NTIRE-2025-SRlab.
zh

[CV-81] Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis

【速读】：本文旨在解决自动化合成协调的双手钢琴演奏的问题，尤其关注如何捕捉双手之间复杂的协调性同时保留各自独特的运动学特征。论文提出了一种双流神经框架，用于从音频输入生成同步的手部动作。解决方案的关键在于两项创新：(i) 一种基于解耦扩散的生成框架，通过双重噪声初始化独立建模每只手的运动，同时利用共享的位置条件采样每只手的独特潜在噪声；(ii) 手部协调不对称注意力机制（Hand-Coordinated Asymmetric Attention, HCAA），抑制对称噪声以突出不对称的手部特定特征，并在去噪过程中自适应增强双手间的协调性。系统采用分层操作，首先从音频特征预测三维手部位置，然后通过位置感知的扩散模型生成关节角度，其中并行去噪流通过HCAA相互作用。综合评估表明，该框架在多个指标上优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.09885
作者: Zihao Liu,Mingwen Ou,Zunnan Xu,Jiaqi Huang,Haonan Han,Ronghui Li,Xiu Li
机构: Tsinghua University (清华大学); Sun Yat-Sen University (中山大学)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Automating the synthesis of coordinated bimanual piano performances poses significant challenges, particularly in capturing the intricate choreography between the hands while preserving their distinct kinematic signatures. In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. Our framework introduces two key innovations: (i) a decoupled diffusion-based generation framework that independently models each hand’s motion via dual-noise initialization, sampling distinct latent noise for each while leveraging a shared positional condition, and (ii) a Hand-Coordinated Asymmetric Attention (HCAA) mechanism suppresses symmetric (common-mode) noise to highlight asymmetric hand-specific features, while adaptively enhancing inter-hand coordination during denoising. The system operates hierarchically: it first predicts 3D hand positions from audio features and then generates joint angles through position-aware diffusion models, where parallel denoising streams interact via HCAA. Comprehensive evaluations demonstrate that our framework outperforms existing state-of-the-art methods across multiple metrics.
zh

[CV-82] Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition AAAI2025

【速读】：该论文旨在解决视觉位置识别（Visual Place Recognition, VPR）任务中因图像中较少判别性局部区域而产生的性能瓶颈问题，同时避免平凡背景区域引入的感知混淆。现有方法在建模和充分利用这些判别性区域方面存在不足。为此，论文提出了一种名为Focus on Local (FoL)的方法，通过挖掘和利用可靠的判别性局部区域，并引入伪相关性监督来同时提升图像检索与重排序性能。其关键在于设计了两种损失函数：提取-聚合空间对齐损失（SAL）和前景-背景对比增强损失（CEL），以显式建模判别性局部区域，并指导全局表示生成及高效重排序；提出了基于聚合全局特征获得的伪对应关系的弱监督局部特征训练策略，缓解了VPR任务中局部对应标注数据稀缺的问题；并通过基于判别性区域引导的高效重排序管道提升了计算效率。实验结果表明，FoL方法在多个VPR基准测试中实现了最先进的性能，并显著优于现有的两阶段VPR方法。

链接: https://arxiv.org/abs/2504.09881
作者: Changwei Wang,Shunpeng Chen,Yukun Song,Rongtao Xu,Zherui Zhang,Jiguang Zhang,Haoran Yang,Yu Zhang,Kexue Fu,Shide Du,Zhiwei Xu,Longxiang Gao,Li Guo,Shibiao Xu
机构: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Visual Place Recognition (VPR) is aimed at predicting the location of a query image by referencing a database of geotagged images. For VPR task, often fewer discriminative local regions in an image produce important effects while mundane background regions do not contribute or even cause perceptual aliasing because of easy overlap. However, existing methods lack precisely modeling and full exploitation of these discriminative regions. In this paper, we propose the Focus on Local (FoL) approach to stimulate the performance of image retrieval and re-ranking in VPR simultaneously by mining and exploiting reliable discriminative local regions in images and introducing pseudo-correlation supervision. First, we design two losses, Extraction-Aggregation Spatial Alignment Loss (SAL) and Foreground-Background Contrast Enhancement Loss (CEL), to explicitly model reliable discriminative local regions and use them to guide the generation of global representations and efficient re-ranking. Second, we introduce a weakly-supervised local feature training strategy based on pseudo-correspondences obtained from aggregating global features to alleviate the lack of local correspondences ground truth for the VPR task. Third, we suggest an efficient re-ranking pipeline that is efficiently and precisely based on discriminative region guidance. Finally, experimental results show that our FoL achieves the state-of-the-art on multiple VPR benchmarks in both image retrieval and re-ranking stages and also significantly outperforms existing two-stage VPR methods in terms of computational efficiency. Code and models are available at this https URL
zh

[CV-83] MCBlock: Boosting Neural Radiance Field Training Speed by MCTS-based Dynamic-Resolution Ray Sampling

【速读】：该论文旨在解决Neural Radiance Field (NeRF) 在训练效率方面的瓶颈问题，特别是现有方法在采样策略上的低效性。NeRF虽以高保真度的新视角合成著称，但即使是最先进的模型（如Gaussian Splatting）仍需耗时数分钟完成训练，无法满足实时应用需求，尤其是在远程医疗等多媒体场景中。当前点采样算法仅部分优化了简单纹理区域与复杂纹理区域的采样策略，而现有射线采样算法仍以像素级细粒度对所有区域进行采样，导致GPU资源浪费。实际上，不同纹理强度的区域需要不同的采样分辨率。

为了解决这一问题，论文提出了一种新颖的动态分辨率射线采样算法MCBlock。其关键在于利用Monte Carlo Tree Search (MCTS) 将每个训练图像划分为不同大小的像素块，并针对这些块进行主动的块级训练。具体而言，通过初始化树结构以反映训练图像的纹理特征来加速初始化过程，同时采用扩展/剪枝模块动态优化块划分策略。实验表明，MCBlock在Nerfstudio中实现了高达2.33倍的训练加速，优于其他射线采样算法。论文认为，MCBlock可适用于任何基于锥形追踪的NeRF模型，并有望推动多媒体领域的进一步发展。

链接: https://arxiv.org/abs/2504.09878
作者: Yunpeng Tan,Junlin Hao,Jiangkai Wu,Liming Liu,Qingyang Li,Xinggong Zhang
机构: Peking University (北京大学); Beijing (北京)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Radiance Field (NeRF) is widely known for high-fidelity novel view synthesis. However, even the state-of-the-art NeRF model, Gaussian Splatting, requires minutes for training, far from the real-time performance required by multimedia scenarios like telemedicine. One of the obstacles is its inefficient sampling, which is only partially addressed by existing works. Existing point-sampling algorithms uniformly sample simple-texture regions (easy to fit) and complex-texture regions (hard to fit), while existing ray-sampling algorithms sample these regions all in the finest granularity (i.e. the pixel level), both wasting GPU training resources. Actually, regions with different texture intensities require different sampling granularities. To this end, we propose a novel dynamic-resolution ray-sampling algorithm, MCBlock, which employs Monte Carlo Tree Search (MCTS) to partition each training image into pixel blocks with different sizes for active block-wise training. Specifically, the trees are initialized according to the texture of training images to boost the initialization speed, and an expansion/pruning module dynamically optimizes the block partition. MCBlock is implemented in Nerfstudio, an open-source toolset, and achieves a training acceleration of up to 2.33x, surpassing other ray-sampling algorithms. We believe MCBlock can apply to any cone-tracing NeRF model and contribute to the multimedia community.
zh

[CV-84] HDC: Hierarchical Distillation for Multi-level Noisy Consistency in Semi-Supervised Fetal Ultrasound Segmentation

【速读】：该论文致力于解决宫颈结构在经阴道超声图像中因对比度低、阴影伪影及边界模糊导致的精确分割难题。由于临床超声成像难以获取大规模标注数据，传统基于卷积神经网络（CNN）的方法受限。此外，现有的半监督学习（SSL）框架常面临确认偏差和高计算成本的问题。为应对这些挑战，论文提出了一种名为HDC的新框架，其关键在于结合了多层级噪声教师模型中的分层蒸馏与一致性学习机制。不同于仅依赖伪标签的传统方法，HDC引入了相关性引导损失（Correlation Guidance Loss）以对齐教师与主学生分支间的特征表示，并通过互信息损失（Mutual Information Loss）稳定主分支与噪声分支间的表示，从而在降低模型复杂度的同时提升泛化能力。

链接: https://arxiv.org/abs/2504.09876
作者: Tran Quoc Khanh Le,Nguyen Lan Vi Vu,Ha-Hieu Pham,Xuan-Loc Huynh,Tien-Huy Nguyen,Minh Huu Nhat Le,Quan Nguyen,Hien D. Nguyen
机构: University of Information Technology (胡志明市, 越南); Ho Chi Minh University of Technology (胡志明市, 越南); University of Science (胡志明市, 越南); Vietnam National University (胡志明市, 越南); Boston University (马萨诸塞州, 美国); Taipei Medical University (台北, 台湾); Posts and Telecommunications Institute of Technology (河内, 越南); New Mexico State University (拉斯克鲁塞斯, 新墨西哥州, 美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transvaginal ultrasound is a critical imaging modality for evaluating cervical anatomy and detecting physiological changes. However, accurate segmentation of cervical structures remains challenging due to low contrast, shadow artifacts, and fuzzy boundaries. While convolutional neural networks (CNNs) have shown promising results in medical image segmentation, their performance is often limited by the need for large-scale annotated datasets - an impractical requirement in clinical ultrasound imaging. Semi-supervised learning (SSL) offers a compelling solution by leveraging unlabeled data, but existing teacher-student frameworks often suffer from confirmation bias and high computational costs. We propose HDC, a novel semi-supervised segmentation framework that integrates Hierarchical Distillation and Consistency learning within a multi-level noise mean-teacher framework. Unlike conventional approaches that rely solely on pseudo-labeling, we introduce a hierarchical distillation mechanism that guides feature-level learning via two novel objectives: (1) Correlation Guidance Loss to align feature representations between the teacher and main student branch, and (2) Mutual Information Loss to stabilize representations between the main and noisy student branches. Our framework reduces model complexity while improving generalization. Extensive experiments on two fetal ultrasound datasets, FUGC and PSFH, demonstrate that our method achieves competitive performance with significantly lower computational overhead than existing multi-teacher models.
zh

[CV-85] GFT: Gradient Focal Transformer

【速读】：本文旨在解决细粒度图像分类（Fine-Grained Image Classification, FGIC）任务中，现有模型在捕捉全局上下文与局部细节特征之间的不平衡问题。尽管基于卷积神经网络（CNN）的模型在局部特征提取方面表现强劲，但难以有效捕获细粒度识别所需的全局上下文；而基于视觉Transformer（ViT）的模型虽通过注意力驱动机制改进了全局上下文建模，但在自适应聚焦于真正判别性区域的能力上仍显不足。此外，已有方法如TransFG等引入部件感知标记选择策略，但在计算效率、注意力区域选择灵活性以及复杂环境下的细节聚焦能力上仍有局限。

为了解决这些问题，本文提出了GFT（Gradient Focal Transformer），一种新的基于ViT的框架。其核心解决方案包括两个关键模块：一是梯度注意学习对齐（Gradient Attention Learning Alignment, GALA）机制，通过分析注意力梯度流动态优先关注类别判别性特征；二是渐进块选择（Progressive Patch Selection, PPS）策略，逐步过滤掉信息量较少的区域，从而降低计算开销并增强对细微细节的敏感性。GFT在FGVC Aircraft、Food-101和COCO数据集上达到了最先进的性能（SOTA），同时仅使用93M参数，在效率上超越了其他先进的ViT基FGIC模型。通过整合全局上下文与局部细节提取能力，GFT为细粒度图像分类任务设定了新基准，并为实际部署场景提供了可解释性的解决方案。

链接: https://arxiv.org/abs/2504.09852
作者: Boris Kriuk,Simranjit Kaur Gill,Shoaib Aslam,Amir Fakhrutdinov
机构: Hong Kong University of Science and Technology (香港科技大学); University of Westminster (西敏斯特大学); University of Engineering & Technology Lahore (拉合尔工程技术大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 tables, 5 figures

点击查看摘要

Abstract:Fine-Grained Image Classification (FGIC) remains a complex task in computer vision, as it requires models to distinguish between categories with subtle localized visual differences. Well-studied CNN-based models, while strong in local feature extraction, often fail to capture the global context required for fine-grained recognition, while more recent ViT-backboned models address FGIC with attention-driven mechanisms but lack the ability to adaptively focus on truly discriminative regions. TransFG and other ViT-based extensions introduced part-aware token selection to enhance attention localization, yet they still struggle with computational efficiency, attention region selection flexibility, and detail-focus narrative in complex environments. This paper introduces GFT (Gradient Focal Transformer), a new ViT-derived framework created for FGIC tasks. GFT integrates the Gradient Attention Learning Alignment (GALA) mechanism to dynamically prioritize class-discriminative features by analyzing attention gradient flow. Coupled with a Progressive Patch Selection (PPS) strategy, the model progressively filters out less informative regions, reducing computational overhead while enhancing sensitivity to fine details. GFT achieves SOTA accuracy on FGVC Aircraft, Food-101, and COCO datasets with 93M parameters, outperforming ViT-based advanced FGIC models in efficiency. By bridging global context and localized detail extraction, GFT sets a new benchmark in fine-grained recognition, offering interpretable solutions for real-world deployment scenarios.
zh

[CV-86] ST-Booster: An Iterative SpatioTemporal Perception Booster for Vision-and-Language Navigation in Continuous Environments

【速读】：该论文旨在解决连续环境中视觉与语言导航（Vision-and-Language Navigation in Continuous Environments, VLN-CE）所面临的两大核心感知挑战：一是缺乏预定义观测点导致的异构视觉记忆和削弱的全局空间相关性；二是三维场景中累积重建误差引入的结构噪声，影响局部特征感知。为应对这些挑战，论文提出了一种迭代时空增强器ST-Booster，通过多粒度感知和指令感知推理提升导航性能。ST-Booster的关键在于其包含三个核心模块：分层时空编码（Hierarchical SpatioTemporal Encoding, HSTE）、多粒度对齐融合（Multi-Granularity Aligned Fusion, MGAF）以及值引导路径点生成（ValueGuided Waypoint Generation, VGWG）。其中，HSTE利用拓扑图编码长期全局记忆，并通过网格地图捕获短期局部细节；MGAF通过几何感知知识融合对齐双地图表示与指令；VGWG则生成指导注意力热图（Guided Attention Heatmaps, GAHs）以显式建模环境与指令的相关性并优化路径点选择。这些模块协同工作，显著提升了复杂且易受干扰环境中的导航性能。

链接: https://arxiv.org/abs/2504.09843
作者: Lu Yue,Dongliang Zhou,Liang Xie,Erwei Yin,Feitian Zhang
机构: Department of Advanced Manufacturing and Robotics, and the State Key Laboratory of Turbulence and Complex Systems, College of Engineering, Peking University (北京大学); Harbin Institute of Technology (哈尔滨工业大学); Defense Innovation Institute, Academy of Military Sciences (军事科学院国防创新研究院); Tianjin Artificial Intelligence Innovation Center (天津人工智能创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate unknown, continuous spaces based on natural language instructions. Compared to discrete settings, VLN-CE poses two core perception challenges. First, the absence of predefined observation points leads to heterogeneous visual memories and weakened global spatial correlations. Second, cumulative reconstruction errors in three-dimensional scenes introduce structural noise, impairing local feature perception. To address these challenges, this paper proposes ST-Booster, an iterative spatiotemporal booster that enhances navigation performance through multi-granularity perception and instruction-aware reasoning. ST-Booster consists of three key modules – Hierarchical SpatioTemporal Encoding (HSTE), Multi-Granularity Aligned Fusion (MGAF), and ValueGuided Waypoint Generation (VGWG). HSTE encodes long-term global memory using topological graphs and captures shortterm local details via grid maps. MGAF aligns these dualmap representations with instructions through geometry-aware knowledge fusion. The resulting representations are iteratively refined through pretraining tasks. During reasoning, VGWG generates Guided Attention Heatmaps (GAHs) to explicitly model environment-instruction relevance and optimize waypoint selection. Extensive comparative experiments and performance analyses are conducted, demonstrating that ST-Booster outperforms existing state-of-the-art methods, particularly in complex, disturbance-prone environments.
zh

[CV-87] FATE: A Prompt-Tuning-Based Semi-Supervised Learning Framework for Extremely Limited Labeled Data

【速读】：该论文致力于解决半监督学习（Semi-Supervised Learning, SSL）在极少量标注数据（如单一样本）场景下的有效性问题。现有方法难以从头训练或在利用预训练模型时无法有效平衡有限标注数据与大量未标注数据之间的关系。论文提出的关键解决方案是FATE（Firstly Adapt, Then catEgorize），一种针对极少量标注数据场景设计的新型SSL框架。其核心在于两阶段提示微调范式：首先通过大量未标注数据无监督地调整预训练模型以适应下游任务的数据特征分布；然后应用专为预训练模型设计的SSL方法完成最终分类任务。这种方法旨在兼容视觉及视觉-语言预训练模型，并通过实验验证显著提升了性能，在七个基准测试中平均性能提升达33.74%，从而有效缓解了标注样本稀缺带来的挑战。

链接: https://arxiv.org/abs/2504.09828
作者: Hezhao Liu,Yang Lu,Mengke Li,Yiqun Zhang,Shreyank N Gowda,Chen Gong,Hanzi Wang
机构: Xiamen University(Xiamen University); Xiamen University(Xiamen University); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室（深圳）); Guangdong University of Technology(Guangdong University of Technology); University of Nottingham(诺丁汉大学); Nanjing University of Science and Technology(Nanjing University of Science and Technology); Xiamen University(Xiamen University)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has achieved significant progress by leveraging both labeled data and unlabeled data. Existing SSL methods overlook a common real-world scenario when labeled data is extremely scarce, potentially as limited as a single labeled sample in the dataset. General SSL approaches struggle to train effectively from scratch under such constraints, while methods utilizing pre-trained models often fail to find an optimal balance between leveraging limited labeled data and abundant unlabeled data. To address this challenge, we propose Firstly Adapt, Then catEgorize (FATE), a novel SSL framework tailored for scenarios with extremely limited labeled data. At its core, the two-stage prompt tuning paradigm FATE exploits unlabeled data to compensate for scarce supervision signals, then transfers to downstream tasks. Concretely, FATE first adapts a pre-trained model to the feature distribution of downstream data using volumes of unlabeled samples in an unsupervised manner. It then applies an SSL method specifically designed for pre-trained models to complete the final classification task. FATE is designed to be compatible with both vision and vision-language pre-trained models. Extensive experiments demonstrate that FATE effectively mitigates challenges arising from the scarcity of labeled samples in SSL, achieving an average performance improvement of 33.74% across seven benchmarks compared to state-of-the-art SSL methods. Code is available at this https URL.
zh

[CV-88] Density-based Object Detection in Crowded Scenes

【速读】：该论文旨在解决在拥挤场景中目标检测面临的两个主要问题：1）训练过程中因高度重叠实例导致的锚点（anchors）更模糊；2）推理阶段后处理中更容易错误抑制预测结果。为了解决这些问题，论文提出了两种新策略——密度引导锚点（Density-Guided Anchors, DGA）和密度引导非极大值抑制（Density-Guided NMS, DG-NMS）。关键在于利用对象密度图联合计算最优锚点分配与重新加权，并采用自适应NMS。具体而言，基于不平衡最优传输（Unbalanced Optimal Transport, UOT）问题，每个真实目标的密度以最小传输成本分配到每个锚点位置，从而形成锚点上的实例特定密度分布，DGA从中解码出最优锚点分配和重新加权策略。同时，DG-NMS通过预测的密度图自适应调整NMS阈值以减少误抑制。此外，设计了一种新颖的重叠感知传输代价，专门针对由邻近重叠对象引起的模糊锚点。实验表明，所提出的密度引导检测器在CrowdHuman和CityPersons数据集上对拥挤场景表现出了有效性和鲁棒性。

链接: https://arxiv.org/abs/2504.09819
作者: Chenyang Zhao,Jia Wan,Antoni B. Chan
机构: Department of Computer Science, City University of Hong Kong (香港城市大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compared with the generic scenes, crowded scenes contain highly-overlapped instances, which result in: 1) more ambiguous anchors during training of object detectors, and 2) more predictions are likely to be mistakenly suppressed in post-processing during inference. To address these problems, we propose two new strategies, density-guided anchors (DGA) and density-guided NMS (DG-NMS), which uses object density maps to jointly compute optimal anchor assignments and reweighing, as well as an adaptive NMS. Concretely, based on an unbalanced optimal transport (UOT) problem, the density owned by each ground-truth object is transported to each anchor position at a minimal transport cost. And density on anchors comprises an instance-specific density distribution, from which DGA decodes the optimal anchor assignment and re-weighting strategy. Meanwhile, DG-NMS utilizes the predicted density map to adaptively adjust the NMS threshold to reduce mistaken suppressions. In the UOT, a novel overlap-aware transport cost is specifically designed for ambiguous anchors caused by overlapped neighboring objects. Extensive experiments on the challenging CrowdHuman dataset with Citypersons dataset demonstrate that our proposed density-guided detector is effective and robust to crowdedness. The code and pre-trained models will be made available later.
zh

[CV-89] DUDA: Distilled Unsupervised Domain Adaptation for Lightweight Semantic Segmentation

【速读】：该论文旨在解决现有无监督领域自适应（UDA）方法在轻量级模型上性能显著下降的问题。这一问题源于架构上的固有僵化，导致生成的质量较低的伪标签。为了解决此问题，论文提出了一种名为蒸馏无监督领域自适应（DUDA）的新框架。其关键是结合基于指数移动平均（EMA）的自训练与知识蒸馏（KD），通过引入一个辅助学生网络来弥合重型与轻量级模型之间的架构差距，从而提高伪标签质量。此外，DUDA 还通过渐进式蒸馏、优先考虑适应较差类别的不一致性损失以及多教师学习等创新元素，进一步优化了UDA 和 KD 的融合策略。实验结果表明，DUDA 在多个UDA基准测试中实现了最先进的性能，尤其在轻量级模型上的表现优于其他方法中的重型模型。

链接: https://arxiv.org/abs/2504.09814
作者: Beomseok Kang,Niluthpol Chowdhury Mithun,Abhinav Rajvanshi,Han-Pang Chiu,Supun Samarasekera
机构: Georgia Institute of Technology (乔治亚理工学院); SRI International (SRI国际)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised Domain Adaptation (UDA) is essential for enabling semantic segmentation in new domains without requiring costly pixel-wise annotations. State-of-the-art (SOTA) UDA methods primarily use self-training with architecturally identical teacher and student networks, relying on Exponential Moving Average (EMA) updates. However, these approaches face substantial performance degradation with lightweight models due to inherent architectural inflexibility leading to low-quality pseudo-labels. To address this, we propose Distilled Unsupervised Domain Adaptation (DUDA), a novel framework that combines EMA-based self-training with knowledge distillation (KD). Our method employs an auxiliary student network to bridge the architectural gap between heavyweight and lightweight models for EMA-based updates, resulting in improved pseudo-label quality. DUDA employs a strategic fusion of UDA and KD, incorporating innovative elements such as gradual distillation from large to small networks, inconsistency loss prioritizing poorly adapted classes, and learning with multiple teachers. Extensive experiments across four UDA benchmarks demonstrate DUDA’s superiority in achieving SOTA performance with lightweight models, often surpassing the performance of heavyweight models from other approaches.
zh

[CV-90] IGL-DT: Iterative Global-Local Feature Learning with Dual-Teacher Semantic Segmentation Framework under Limited Annotation Scheme

【速读】：该论文旨在解决半监督语义分割（Semi-Supervised Semantic Segmentation, SSSS）中全局语义表示与细粒度局部特征提取难以平衡的问题。现有方法在利用少量标注数据和大量未标注数据提升分割精度的同时，往往无法兼顾这两方面的有效融合。为应对这一挑战，论文提出了一种名为IGL-DT的新型三分支半监督分割框架，其关键在于引入双教师策略，并结合全局上下文学习（Global Context Learning）通过SwinUnet实现高阶语义引导，以及局部区域学习（Local Regional Learning）通过ResUnet进行细节特征优化。此外，差异学习（Discrepancy Learning）机制进一步缓解了对单一教师模型的过度依赖，促进了自适应特征学习。实验结果表明，该方法在多个基准数据集上超越了现有最先进方法，实现了更优的分割性能。

链接: https://arxiv.org/abs/2504.09797
作者: Dinh Dai Quan Tran,Hoang-Thien Nguyen. Thanh-Huy Nguyen,Gia-Van To,Tien-Huy Nguyen,Quan Nguyen
机构: National Chung Cheng University (国立中正大学), Taiwan; Posts and Telecommunications Institute of Technology, Ho Chi Minh, Vietnam; Université de Bourgogne (第戎大学), Dijon, France; Institut de Science Financière et d’Assurances (金融与保险学院), Lyon, France; University of Information Technology, VNU-HCM, Vietnam; Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Semi-Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo-labeling, consistency regularization, and co-training strategies. However, existing methods struggle to balance global semantic representation with fine-grained local feature extraction. To address this challenge, we propose a novel tri-branch semi-supervised segmentation framework incorporating a dual-teacher strategy, named IGL-DT. Our approach employs SwinUnet for high-level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over-reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior segmentation performance across various data regimes.
zh

[CV-91] EquiVDM: Equivariant Video Diffusion Models with Temporally Consistent Noise

【速读】：该论文致力于解决视频扩散模型在跨领域应用（如仿真到真实场景转换、风格迁移、视频超分辨率等）中生成时序一致性视频的问题。论文的关键在于提出了一种利用时序一致噪声的视频扩散框架，无需引入专门模块或额外约束即可生成连贯的视频帧。其核心解决方案的关键在于，通过在扩散模型的标准训练目标中结合时序一致噪声，促使模型对输入视频及其噪声的空间变换具有等变性（equivariance），从而更好地捕捉并遵循输入视频中的运动模式，实现对齐的运动轨迹与高保真度的帧生成。此外，进一步扩展至3D一致性生成任务，通过将噪声作为纹理附加到3D网格上，确保仿真到真实场景应用中的3D一致性。实验结果表明，该方法在运动对齐、3D一致性及视频质量方面超越现有先进基线，且实际应用中仅需较少的采样步骤。

链接: https://arxiv.org/abs/2504.09789
作者: Chao Liu,Arash Vahdat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporally consistent video-to-video generation is essential for applications of video diffusion models in areas such as sim-to-real, style-transfer, video upsampling, etc. In this paper, we propose a video diffusion framework that leverages temporally consistent noise to generate coherent video frames without specialized modules or additional constraints. We show that the standard training objective of diffusion models, when applied with temporally consistent noise, encourages the model to be equivariant to spatial transformations in input video and noise. This enables our model to better follow motion patterns from the input video, producing aligned motion and high-fidelity frames. Furthermore, we extend our approach to 3D-consistent video generation by attaching noise as textures on 3D meshes, ensuring 3D consistency in sim-to-real applications. Experimental results demonstrate that our method surpasses state-of-the-art baselines in motion alignment, 3D consistency, and video quality while requiring only a few sampling steps in practice.
zh

[CV-92] On the representation of stack operators by mathematical morphology

【速读】：该论文试图解决灰度图像处理中的算子设计问题，旨在通过引入灰度图像堆栈算子（stack operators）来简化复杂灰度图像变换的设计。论文的关键在于证明灰度图像堆栈算子能够继承特征集算子的格结构属性，并且可以通过将二值图像算子扩展到灰度图像来实现特定的图像处理任务。这一解决方案的核心在于利用平移不变且局部定义的堆栈算子，通过推导其特征函数、核以及基表示来实现上述性质。这表明，对于某些灰度图像处理问题，只需先设计适用于二值图像的算子，再通过堆栈算子扩展即可，从而显著简化设计流程。

链接: https://arxiv.org/abs/2504.09766
作者: Diego Marcondes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces the class of grey-scale image stack operators as those that (a) map binary-images into binary-images and (b) commute in average with cross-sectioning. We show that stack operators are 1-Lipchitz extensions of set operators which can be represented by applying a characteristic set operator to the cross-sections of the image and summing. In particular, they are a generalisation of stack filters, for which the characteristic set operators are increasing. Our main result is that stack operators inherit lattice properties of the characteristic set operators. We focus on the case of translation-invariant and locally defined stack operators and show the main result by deducing the characteristic function, kernel, and basis representation of stack operators. The results of this paper have implications on the design of image operators, since imply that to solve some grey-scale image processing problems it is enough to design an operator for performing the desired transformation on binary images, and then considering its extension given by a stack operator. We leave many topics for future research regarding the machine learning of stack operators and the characterisation of the image processing problems that can be solved by them.
zh

[CV-93] Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

【速读】：该论文试图解决多模态大型语言模型（MLLMs）在图表推理任务中无法真正展现视觉理解能力的问题，具体表现为现有基准如ChartQA显示模型更多依赖于基于文本的捷径和概率模式匹配，而非真正的视觉推理。为严格评估视觉推理能力，作者通过移除图表中的文本标签并引入图表扰动来构建更具挑战性的测试场景。结果显示，像GPT-4o和Gemini-2.0 Pro这样的模型性能下降可达30%，揭示了其局限性。

为应对这些挑战，论文提出了一种名为Socratic Chart的新框架，该框架将图表图像转换为可扩展矢量图形（SVG）表示形式，使MLLMs能够整合文本和视觉模态以增强图表理解。关键在于Socratic Chart采用了一个多智能体管道，包括专门的智能体生成器用于提取原始图表属性（如条形高度、线条坐标），以及智能体批评者用于验证结果，从而确保高保真符号化表示。这一方案显著提升了捕捉图表元素的准确性及推理性能，为推动MLLMs的视觉理解能力奠定了坚实基础。

链接: https://arxiv.org/abs/2504.09764
作者: Yuyang Ji,Haohan Wang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.
zh

[CV-94] Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention

【速读】：该论文旨在解决视频中“片头/片尾字幕与主体内容过渡”的检测问题，这是视频内容分割、索引和推荐系统中的关键任务。传统方法依赖人工标注，耗时且易出错，而基于启发式的算法难以适应多样化的视频风格。论文提出了一种基于深度学习的解决方案，将问题形式化为序列到序列的分类任务，通过对视频每一秒进行“片头”或“主体内容”的标签预测来实现自动检测。其关键技术在于利用CLIP模型提取帧特征，并结合多头注意力机制与学习到的位置编码处理这些特征表示，从而实现高效且准确的分类。这一方法在测试集上取得了F1-score为91.0%、Precision为89.0%、Recall为97.0%的性能表现，同时优化了实时推理能力，在CPU和高端GPU上的运行速度分别达到11.5 FPS和107 FPS。

链接: https://arxiv.org/abs/2504.09738
作者: Vasilii Korolkov,Andrey Yanchenko
机构: Binat, Inc. (Binat公司); Independent Researcher, USA (美国独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 22 pages, 11 figures, submitted as a preprint. ArXiv preprint only, not submitted to a journal yet

点击查看摘要

Abstract:Detecting transitions between intro/credits and main content in videos is a crucial task for content segmentation, indexing, and recommendation systems. Manual annotation of such transitions is labor-intensive and error-prone, while heuristic-based methods often fail to generalize across diverse video styles. In this work, we introduce a deep learning-based approach that formulates the problem as a sequence-to-sequence classification task, where each second of a video is labeled as either “intro” or “film.” Our method extracts frames at a fixed rate of 1 FPS, encodes them using CLIP (Contrastive Language-Image Pretraining), and processes the resulting feature representations with a multihead attention model incorporating learned positional encoding. The system achieves an F1-score of 91.0%, Precision of 89.0%, and Recall of 97.0% on the test set, and is optimized for real-time inference, achieving 11.5 FPS on CPU and 107 FPS on high-end GPUs. This approach has practical applications in automated content indexing, highlight detection, and video summarization. Future work will explore multimodal learning, incorporating audio features and subtitles to further enhance detection accuracy.
zh

[CV-95] A Survey on Efficient Vision-Language Models

【速读】：该论文旨在解决视觉-语言模型（Vision-Language Models, VLMs）在边缘设备和资源受限设备上的高效部署问题，特别是其高计算需求对实时应用带来的挑战。论文的关键在于综述优化VLM的技术方法，包括紧凑型VLM架构与框架的设计，并深入分析高效VLM在性能与内存占用之间的权衡。通过建立GitHub仓库整合相关文献，作者希望推动该领域的进一步研究。

链接: https://arxiv.org/abs/2504.09724
作者: Gaurav Shinde,Anuradha Ravi,Emon Dey,Shadman Sakib,Milind Rampure,Nirmalya Roy
机构: University of Maryland Baltimore County (UMBC)(马里兰大学巴尔的摩县分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 16 figures

点击查看摘要

Abstract:Vision-language models (VLMs) integrate visual and textual information, enabling a wide range of applications such as image captioning and visual question answering, making them crucial for modern AI systems. However, their high computational demands pose challenges for real-time applications. This has led to a growing focus on developing efficient vision language models. In this survey, we review key techniques for optimizing VLMs on edge and resource-constrained devices. We also explore compact VLM architectures, frameworks and provide detailed insights into the performance-memory trade-offs of efficient VLMs. Furthermore, we establish a GitHub repository at this https URL to compile all surveyed papers, which we will actively update. Our objective is to foster deeper research in this area.
zh

[CV-96] he Structural Safety Generalization Problem

【速读】：该论文试图解决大语言模型（LLM）的安全性挑战，特别是安全机制未能在语义等价输入中有效泛化的问题。论文聚焦于通过设计具备可解释性、跨模型迁移性和跨目标迁移性的攻击来研究这一关键失败机制。解决方案的关键在于提出了一种名为“结构重写护栏”（Structure Rewriting Guardrail）的防御方法，该方法将输入转换为更利于安全性评估的结构，从而显著提高对有害输入的拒绝率，同时避免过度拒绝良性输入。通过这一中间挑战的解决，论文为长期AI安全研究指出了一个重要的里程碑。

链接: https://arxiv.org/abs/2504.09712
作者: Julius Broomfield,Tom Gibbs,Ethan Kosak-Hine,George Ingebretsen,Tia Nasir,Jason Zhang,Reihaneh Iranmanesh,Sara Pieri,Reihaneh Rabbany,Kellin Pelrine
机构: Cranberry-Lemon University (cranberry-lemon.edu); University of the Witwatersrand (wits.ac.za); Georgia Tech; Mila; UC Berkeley; Stanford; MBZUAI; McGill
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge - more tractable than universal defenses but essential for long-term safety - we highlight a critical milestone for AI safety research.
zh

[CV-97] oolTipNet: A Segmentation-Driven Deep Learning Baseline for Surgical Instrument Tip Detection

【速读】：该论文旨在解决机器人辅助腹腔镜根治性前列腺切除术（RALP）中工具尖端位置注册精度不足的问题，当前方法依赖于da Vinci API获取的不准确位置数据，并需进行手眼标定。此外，在手术技能评估与手术自动化等任务中，工具尖端检测亦为核心需求，但因工具尺寸小且器械具有可动关节而极具挑战性。论文提出的关键解决方案是基于分割基础模型（Segment Anything）实现的器械部分级分割掩码输入的深度学习工具尖端检测方法，通过对比传统手工图像处理方法的实验验证了所提方法在模拟及真实数据集上的优越性。

链接: https://arxiv.org/abs/2504.09700
作者: Zijian Wu,Shuojue Yang,Yueming Jin,Septimiu E Salcudean
机构: Robotics and Control Laboratory (RCL), Department of Electrical and Computer Engineering, the University of British Columbia (英属哥伦比亚大学), Vancouver, Canada; Department of Biomedical Engineering, National University of Singapore (新加坡国立大学), Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In robot-assisted laparoscopic radical prostatectomy (RALP), the location of the instrument tip is important to register the ultrasound frame with the laparoscopic camera frame. A long-standing limitation is that the instrument tip position obtained from the da Vinci API is inaccurate and requires hand-eye calibration. Thus, directly computing the position of the tool tip in the camera frame using the vision-based method becomes an attractive solution. Besides, surgical instrument tip detection is the key component of other tasks, like surgical skill assessment and surgery automation. However, this task is challenging due to the small size of the tool tip and the articulation of the surgical instrument. Surgical instrument segmentation becomes relatively easy due to the emergence of the Segmentation Foundation Model, i.e., Segment Anything. Based on this advancement, we explore the deep learning-based surgical instrument tip detection approach that takes the part-level instrument segmentation mask as input. Comparison experiments with a hand-crafted image-processing approach demonstrate the superiority of the proposed method on simulated and real datasets.
zh

[CV-98] Computer-Aided Layout Generation for Building Design: A Review

【速读】：该论文旨在系统性地回顾建筑布局设计与生成领域的三个主要研究主题：平面图布局生成、场景布局合成以及其它形式建筑布局的生成。论文的关键在于综合分析不同研究范式，包括基于建筑学或机器学习领域的分类方法，以及根据用户输入条件或约束进行的分类。此外，论文介绍了常用的基准数据集及其对应的评估指标，以验证相关方法的有效性。最终，论文指出了现有方法已解决的问题及局限性，并提出了未来研究的潜在方向。论文的核心解决方案在于利用深度生成模型显著提升生成布局的真实感与多样性，从而大幅减少设计师的工作负担并提高效率。相关资源维护项目可在awesome-building-layout-generation获取。

链接: https://arxiv.org/abs/2504.09694
作者: Jiachen Liu,Yuan Xue,Haomiao Ni,Rui Yu,Zihan Zhou,Sharon X. Huang
机构: The Pennsylvania State University (宾夕法尼亚州立大学); The Ohio State University (俄亥俄州立大学); University of Memphis (孟菲斯大学); University of Louisville (路易斯维尔大学); Manycore Tech Inc (Manycore Tech Inc)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVMJ 2025

点击查看摘要

Abstract:Generating realistic building layouts for automatic building design has been studied in both the computer vision and architecture domains. Traditional approaches from the architecture domain, which are based on optimization techniques or heuristic design guidelines, can synthesize desirable layouts, but usually require post-processing and involve human interaction in the design pipeline, making them costly and timeconsuming. The advent of deep generative models has significantly improved the fidelity and diversity of the generated architecture layouts, reducing the workload by designers and making the process much more efficient. In this paper, we conduct a comprehensive review of three major research topics of architecture layout design and generation: floorplan layout generation, scene layout synthesis, and generation of some other formats of building layouts. For each topic, we present an overview of the leading paradigms, categorized either by research domains (architecture or machine learning) or by user input conditions or constraints. We then introduce the commonly-adopted benchmark datasets that are used to verify the effectiveness of the methods, as well as the corresponding evaluation metrics. Finally, we identify the well-solved problems and limitations of existing approaches, then propose new perspectives as promising directions for future research in this important research area. A project associated with this survey to maintain the resources is available at awesome-building-layout-generation.
zh

[CV-99] LightHeadEd: Relightable Editable Head Avatars from a Smartphone

【速读】：该论文旨在解决利用传统昂贵的Lightstage设备（多台校准相机）创建高质量可重新照明的3D人脸 avatar 的高成本问题，以促进其广泛应用。论文的关键在于提出了一种新颖且经济高效的方法，仅使用配备偏振滤镜的智能手机即可实现这一目标。核心解决方案的关键在于通过在暗室中使用单点光源同时捕获交叉偏振和平行偏振视频流，分离皮肤在动态面部表演中的漫反射和镜面反射成分。此外，引入了一种嵌入UV空间的混合表示法，结合参数化头模型中的2D高斯分布，以支持高效的实时渲染并保留高保真的几何细节。这使得通过基于学习的神经分析-综合管道能够将姿态和表情相关的几何偏移与外观解耦，并分解表面为albedo、法线和镜面UV纹理图以及环境贴图。

链接: https://arxiv.org/abs/2504.09671
作者: Pranav Manu,Astitva Srivastava,Amit Raj,Varun Jampani,Avinash Sharma,P.J. Narayanan
机构: IIIT Hyderabad; Google Research (谷歌研究); Stability AI; IIT Jodhpur (印度技术学院乔杜尔); IIIT Hyderabad; IIIT Hyderabad
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating photorealistic, animatable, and relightable 3D head avatars traditionally requires expensive Lightstage with multiple calibrated cameras, making it inaccessible for widespread adoption. To bridge this gap, we present a novel, cost-effective approach for creating high-quality relightable head avatars using only a smartphone equipped with polaroid filters. Our approach involves simultaneously capturing cross-polarized and parallel-polarized video streams in a dark room with a single point-light source, separating the skin’s diffuse and specular components during dynamic facial performances. We introduce a hybrid representation that embeds 2D Gaussians in the UV space of a parametric head model, facilitating efficient real-time rendering while preserving high-fidelity geometric details. Our learning-based neural analysis-by-synthesis pipeline decouples pose and expression-dependent geometrical offsets from appearance, decomposing the surface into albedo, normal, and specular UV texture maps, along with the environment maps. We collect a unique dataset of various subjects performing diverse facial expressions and head movements.
zh

[CV-100] Uncertainty Guided Refinement for Fine-Grained Salient Object Detection

【速读】：该论文针对现有显著物体检测（Saliency Object Detection, SOD）方法在预测显著区域时常包含未饱和区域和阴影的问题，试图提升模型的可靠细粒度预测能力。为解决此问题，论文引入不确定性引导学习（Uncertainty Guidance Learning）方法，并设计了一种新颖的不确定性引导细化注意网络（Uncertainty Guided Refinement Attention Network, UGRAN）。UGRAN的关键在于其包含三个重要模块：多级交互注意模块（Multilevel Interaction Attention, MIA），用于促进多级特征之间的交互与感知；尺度空间一致注意模块（Scale Spatial-Consistent Attention, SSCA），用于全面整合多尺度特征中的显著信息；以及不确定性细化注意模块（Uncertainty Refinement Attention, URA），利用由显著性预测图生成的不确定性图增强模型对不确定区域的感知能力。此外，论文还提出自适应动态划分机制（Adaptive Dynamic Partition, ADP），以降低URA模块的计算开销并提高不确定性引导的有效性。实验结果表明，UGRAN在七个基准数据集上的表现优于当前最先进的方法。

链接: https://arxiv.org/abs/2504.09666
作者: Yao Yuan,Pan Gao,Qun Dai,Jie Qin,Wei Xiang
机构: Collge of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (南京航空航天大学计算机科学与技术学院); Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (脑机智能技术教育部重点实验室); School of Computing, Engineering and Mathematical Sciences, La Trobe University (拉筹伯大学计算、工程与数学科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Transactions on Image Processing 2025

点击查看摘要

Abstract:Recently, salient object detection (SOD) methods have achieved impressive performance. However, salient regions predicted by existing methods usually contain unsaturated regions and shadows, which limits the model for reliable fine-grained predictions. To address this, we introduce the uncertainty guidance learning approach to SOD, intended to enhance the model’s perception of uncertain regions. Specifically, we design a novel Uncertainty Guided Refinement Attention Network (UGRAN), which incorporates three important components, i.e., the Multilevel Interaction Attention (MIA) module, the Scale Spatial-Consistent Attention (SSCA) module, and the Uncertainty Refinement Attention (URA) module. Unlike conventional methods dedicated to enhancing features, the proposed MIA facilitates the interaction and perception of multilevel features, leveraging the complementary characteristics among multilevel features. Then, through the proposed SSCA, the salient information across diverse scales within the aggregated features can be integrated more comprehensively and integrally. In the subsequent steps, we utilize the uncertainty map generated from the saliency prediction map to enhance the model’s perception capability of uncertain regions, generating a highly-saturated fine-grained saliency prediction map. Additionally, we devise an adaptive dynamic partition (ADP) mechanism to minimize the computational overhead of the URA module and improve the utilization of uncertainty guidance. Experiments on seven benchmark datasets demonstrate the superiority of the proposed UGRAN over the state-of-the-art methodologies. Codes will be released at this https URL.
zh

[CV-101] KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation

【速读】：该论文试图解决在低帧率条件下，由音频生成视频时难以捕捉戏剧性动作中的关键瞬间以及增加帧数导致内存消耗显著上升的问题。为了解决这一问题，论文提出了一种名为KeyVID的关键帧感知音频到视觉动画框架。KeyVID的关键之处在于引入了关键帧定位机制，通过从音频信号中定位关键帧时间步，并利用关键帧生成器生成对应的视觉关键帧，最后借助运动插值器生成中间帧。这种方法不仅显著提升了关键时刻的生成质量，同时保持了计算效率。

链接: https://arxiv.org/abs/2504.09656
作者: Xingrui Wang,Jiang Liu,Ze Wang,Xiaodong Yu,Jialian Wu,Ximeng Sun,Yusheng Su,Alan Yuille,Zicheng Liu,Emad Barsoum
机构: Advanced Micro Devices; Johns Hopkins University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating video from various conditions, such as text, image, and audio, enables both spatial and temporal control, leading to high-quality generation results. Videos with dramatic motions often require a higher frame rate to ensure smooth motion. Currently, most audio-to-visual animation models use uniformly sampled frames from video clips. However, these uniformly sampled frames fail to capture significant key moments in dramatic motions at low frame rates and require significantly more memory when increasing the number of frames directly. In this paper, we propose KeyVID, a keyframe-aware audio-to-visual animation framework that significantly improves the generation quality for key moments in audio signals while maintaining computation efficiency. Given an image and an audio input, we first localize keyframe time steps from the audio. Then, we use a keyframe generator to generate the corresponding visual keyframes. Finally, we generate all intermediate frames using the motion interpolator. Through extensive experiments, we demonstrate that KeyVID significantly improves audio-video synchronization and video quality across multiple datasets, particularly for highly dynamic motions. The code is released in this https URL.
zh

[CV-102] RANSAC Revisited: An Improved Algorithm for Robust Subspace Recovery under Adversarial and Noisy Corruptions

【速读】：本文研究了在强对抗性污染和高斯噪声同时存在的情况下鲁棒子空间恢复（Robust Subspace Recovery, RSR）的问题。论文旨在利用有限的含噪样本恢复一个低维子空间，该子空间能够近似包含大部分未被污染的样本，且误差与高斯噪声规模成比例。现有方法通常面临计算成本高昂或依赖于限制性的分布假设的问题，限制了其在真正对抗性环境中的适用性。为了解决这些问题，本文重新审视了经典的随机采样一致性（Random Sample Consensus, RANSAC）算法，尽管该算法对对抗性离群值具有较强的鲁棒性，但会牺牲对高斯噪声和模型误设的鲁棒性以及效率。为此，本文提出了一种两阶段算法RANSAC+，精准定位并修复了标准RANSAC的失效模式。该方法在理论上有证明的鲁棒性，能够处理高斯噪声和对抗性污染，达到接近最优的样本复杂度，并且无需预先知道子空间维度，同时比现有的RANSAC类方法更高效。

链接: https://arxiv.org/abs/2504.09648
作者: Guixian Chen,Jianhao Ma,Salar Fattahi
机构: University of Michigan (密歇根大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO)
备注:

点击查看摘要

Abstract:In this paper, we study the problem of robust subspace recovery (RSR) in the presence of both strong adversarial corruptions and Gaussian noise. Specifically, given a limited number of noisy samples – some of which are tampered by an adaptive and strong adversary – we aim to recover a low-dimensional subspace that approximately contains a significant fraction of the uncorrupted samples, up to an error that scales with the Gaussian noise. Existing approaches to this problem often suffer from high computational costs or rely on restrictive distributional assumptions, limiting their applicability in truly adversarial settings. To address these challenges, we revisit the classical random sample consensus (RANSAC) algorithm, which offers strong robustness to adversarial outliers, but sacrifices efficiency and robustness against Gaussian noise and model misspecification in the process. We propose a two-stage algorithm, RANSAC+, that precisely pinpoints and remedies the failure modes of standard RANSAC. Our method is provably robust to both Gaussian and adversarial corruptions, achieves near-optimal sample complexity without requiring prior knowledge of the subspace dimension, and is more efficient than existing RANSAC-type methods.
zh

[CV-103] SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model

【速读】：该论文旨在解决传统遥感工作流程难以处理复杂隐式查询的问题，这些查询需要在空间上下文、领域知识及隐式用户意图上进行推理。为应对这一挑战，论文引入了一项新任务——地理空间像素推理（geospatial pixel reasoning），允许隐式查询与推理，并生成目标区域的掩码。为此，作者构建并发布了首个大规模基准数据集EarthReason，包含5,434个手动标注图像掩码及超过30,000组隐式问答对。同时，提出SegEarth-R1，这是一种简单而有效的语言引导分割基线模型，集成了分层视觉编码器、用于指令解析的大语言模型(LLM)以及专为空间相关性设计的掩码生成器。SegEarth-R1的关键设计包括针对遥感领域的特定优化，如对超高清遥感图像的激进视觉令牌压缩、描述投影模块以融合语言与多尺度特征，以及直接查询描述嵌入的精简掩码预测管道。实验结果表明，SegEarth-R1在推理与指代分割任务上达到了当前最优性能，显著优于传统方法及基于LLM的分割方法。

链接: https://arxiv.org/abs/2504.09644
作者: Kaiyu Li,Zepeng Xin,Li Pang,Chao Pang,Yupeng Deng,Jing Yao,Guisong Xia,Deyu Meng,Zhi Wang,Xiangyong Cao
机构: Xi’an Jiaotong University (西安交通大学); Wuhan University (武汉大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, \ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at this https URL.
zh

[CV-104] nyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

【速读】：该论文试图解决如何通过强化学习提升小规模多模态模型的推理能力，并探索其在通用问答数据集上的可解释性推理过程。论文的关键在于提出了一种基于TinyLLaVA-Video的小规模视频推理模型TinyLLaVA-Video-R1，该模型参数量不超过40亿（4B），通过在通用视频问答（Video-QA）数据集上应用强化学习，不仅显著提升了推理与思维能力，还表现出类似人类“顿悟”（“aha moments”）的特性。此外，论文分享了一系列实验结果，为未来小规模模型在视频推理领域的研究提供了实践指导。

链接: https://arxiv.org/abs/2504.09641
作者: Xingjian Zhang,Siwei Wen,Wenjun Wu,Lei Huang
机构: SKLCCSE, Institute of Artificial Intelligence, Beihang University (北航), Beijing, China; Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beihang University (北航); Hangzhou International Innovation Institute, Beihang University (北航), Hangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, improving the reasoning ability of large multimodal models (LMMs) through reinforcement learning has made great progress. However, most existing works are based on highly reasoning-intensive datasets such as mathematics and code, and researchers generally choose large-scale models as the foundation. We argue that exploring small-scale models’ reasoning capabilities remains valuable for researchers with limited computational resources. Moreover, enabling models to explain their reasoning processes on general question-answering datasets is equally meaningful. Therefore, we present the small-scale video reasoning model TinyLLaVA-Video-R1. Based on TinyLLaVA-Video, a traceably trained video understanding model with no more than 4B parameters, it not only demonstrates significantly improved reasoning and thinking capabilities after using reinforcement learning on general Video-QA datasets, but also exhibits the emergent characteristic of “aha moments”. Furthermore, we share a series of experimental findings, aiming to provide practical insights for future exploration of video reasoning (thinking) abilities in small-scale models. It is available at this https URL.
zh

[CV-105] Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding CVPR

【速读】：该论文旨在解决三维场景中结合语言描述与人体指向手势的 embodied reference understanding (3D-ERU) 的研究不足问题。尽管已有研究探索了基于纯语言的 3D grounding，但针对同时包含语言和指向手势的 3D-ERU 探索较为有限。为填补这一空白，论文提出了一种名为 Imputer 的数据增强框架，并利用其构建了一个新的基准数据集 ImputeRefer，通过将人体指向手势融入仅包含语言指令的现有 3D 场景数据集中。关键解决方案在于引入 Ges3ViG 模型，该模型在 3D-ERU 任务上的准确率相较于其他 3D-ERU 方法提升了约 30%，同时比纯语言基础的 3D grounding 模型提高了约 9%。

链接: https://arxiv.org/abs/2504.09623
作者: Atharv Mahesh Mane,Dulanga Weerakoon,Vigneshwaran Subbaraju,Sougata Sen,Sanjay E. Sarma,Archan Misra
机构: Stony Brook University (石溪大学), USA; BITS Pilani Goa campus (比卡内冈理工学院果阿校区), India; Singapore-MIT Alliance for Research and Technology Centre (SMART 中心), Singapore; IHPC, Agency for Science, Technology and Research (A*STAR), Singapore (新加坡科技研究局信息通信研究院); BITS Pilani Goa campus (比卡内冈理工学院果阿校区), India; APPCAIR, India; Massachusetts Institute of Technology (麻省理工学院), USA; Singapore Management University (新加坡管理大学), Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

点击查看摘要

Abstract:3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves ~30% improvement in accuracy as compared to other 3D-ERU models and ~9% compared to other purely language-based 3D grounding models. Our code and dataset are available at this https URL.
zh

[CV-106] okenize Image Patches: Global Context Fusion for Effective Haze Removal in Large Images CVPR2025

【速读】：该论文旨在解决大尺寸高分辨率图像去雾任务中的两个主要挑战：一是全局上下文信息与局部细节特征难以有效平衡的问题；二是现有深度学习模型因GPU内存限制，在处理大尺寸图像时通常需要采用图像切片或下采样方法，而这会导致全局信息丢失或高频细节被丢弃。论文的关键解决方案是提出了一种名为DehazeXL的新方法，它能够有效地在主流GPU硬件上实现大尺寸图像的端到端建模，同时保持全局上下文信息和局部特征的平衡。此外，为了评估全局上下文在去雾性能中的利用效率，论文设计了一种特定于去雾任务的视觉归因方法，并构建了一个超高清去雾数据集（8KDehaze），包含10,000对8192 × 8192像素的清晰与有雾遥感图像，以支持模型训练和测试。实验结果表明，DehazeXL能够在仅使用21GB显存的情况下推断高达10240 × 10240像素的图像，达到了当前所有评估方法中的最佳性能。

链接: https://arxiv.org/abs/2504.09621
作者: Jiuchen Chen,Xinyu Yan,Qizhi Xu,Kaiqi Li
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Global contextual information and local detail features are essential for haze removal tasks. Deep learning models perform well on small, low-resolution images, but they encounter difficulties with large, high-resolution ones due to GPU memory limitations. As a compromise, they often resort to image slicing or downsampling. The former diminishes global information, while the latter discards high-frequency details. To address these challenges, we propose DehazeXL, a haze removal method that effectively balances global context and local feature extraction, enabling end-to-end modeling of large images on mainstream GPU hardware. Additionally, to evaluate the efficiency of global context utilization in haze removal performance, we design a visual attribution method tailored to the characteristics of haze removal tasks. Finally, recognizing the lack of benchmark datasets for haze removal in large images, we have developed an ultra-high-resolution haze removal dataset (8KDehaze) to support model training and testing. It includes 10000 pairs of clear and hazy remote sensing images, each sized at 8192 \times 8192 pixels. Extensive experiments demonstrate that DehazeXL can infer images up to 10240 \times 10240 pixels with only 21 GB of memory, achieving state-of-the-art results among all evaluated methods. The source code and experimental dataset are available at this https URL.
zh

[CV-107] ERL-MPP: Evolutionary Reinforcement Learning with Multi-head Puzzle Perception for Solving Large-scale Jigsaw Puzzles of Eroded Gaps

【速读】：该论文旨在解决大型拼图（具有较大缝隙）在图像理解和组合优化方面特有的挑战，特别是那些碎片间存在间隙的大规模拼图。为应对这些挑战，论文提出了一种结合多头拼图感知的进化强化学习框架（Evolutionary Reinforcement Learning with Multi-head Puzzle Perception, ERL-MPP）。该框架的关键在于设计了一个共享编码器的多头拼图感知网络（Multi-head Puzzle Perception Network, MPPN），其中多个拼图头全面感知局部组装状态，而判别头提供全局评估；同时，通过进化强化学习（Evolutionary Reinforcement Learning, EvoRL）代理高效探索大规模交换动作空间，该代理中的演员基于感知到的拼图状态推荐一组合适的交换动作，评论员使用估计的奖励和拼图状态更新演员，而与进化策略耦合的评估员则根据历史组装经验优化动作选择。这一方法在包含大缝隙的JPLEG-5数据集和大规模拼图的MIT数据集上得到了全面验证，并显著优于现有的所有先进模型。

链接: https://arxiv.org/abs/2504.09608
作者: Xingke Song,Xiaoying Yang,Chenglin Yao,Jianfeng Ren,Ruibin Bai,Xin Chen,Xudong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Solving jigsaw puzzles has been extensively studied. While most existing models focus on solving either small-scale puzzles or puzzles with no gap between fragments, solving large-scale puzzles with gaps presents distinctive challenges in both image understanding and combinatorial optimization. To tackle these challenges, we propose a framework of Evolutionary Reinforcement Learning with Multi-head Puzzle Perception (ERL-MPP) to derive a better set of swapping actions for solving the puzzles. Specifically, to tackle the challenges of perceiving the puzzle with gaps, a Multi-head Puzzle Perception Network (MPPN) with a shared encoder is designed, where multiple puzzlet heads comprehensively perceive the local assembly status, and a discriminator head provides a global assessment of the puzzle. To explore the large swapping action space efficiently, an Evolutionary Reinforcement Learning (EvoRL) agent is designed, where an actor recommends a set of suitable swapping actions from a large action space based on the perceived puzzle status, a critic updates the actor using the estimated rewards and the puzzle status, and an evaluator coupled with evolutionary strategies evolves the actions aligning with the historical assembly experience. The proposed ERL-MPP is comprehensively evaluated on the JPLEG-5 dataset with large gaps and the MIT dataset with large-scale puzzles. It significantly outperforms all state-of-the-art models on both datasets.
zh

[CV-108] Early-Bird Diffusion: Investigating and Leverag ing Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training

【速读】：该论文旨在解决扩散模型（DMs）训练过程中计算资源需求高的问题，通过多次前向和后向传播在众多时间步内进行操作。为应对这一挑战，论文提出了一种名为EB-Diff-Train的新方法，这是一种与其他加速扩散模型训练技术正交的方法。其关键在于研究和利用早期鸟彩票（Early-Bird Tickets, EB），即在训练初期显现且能够保持高质量生成的稀疏子网络。论文首先验证了传统EB彩票在扩散模型中的存在性，使模型无需完全训练密集网络即可实现竞争性的生成质量；接着引入针对扩散模型优化的timestep-aware EB彩票概念，这些彩票根据相应时间步区域的重要性调整稀疏程度，在非关键区域允许高度稀疏以节省计算资源，而在关键区域保留必要的计算能力。基于此，开发了一种高效训练技术，该技术衍生出时间步感知的EB彩票，进行并行训练并在推理阶段组合用于图像生成。实验结果表明，所提方法不仅验证了传统和时间步感知EB彩票的存在，还证明了EB-Diff-Train方法的有效性，可在空间和时间上显著减少训练时间，相比未修剪的密集模型速度提升2.9倍到5.8倍，与标准的训练-剪枝-微调流程相比甚至可快至10.3倍，同时不牺牲生成质量。

链接: https://arxiv.org/abs/2504.09606
作者: Lexington Whalen,Zhenbang Du,Haoran You,Chaojian Li,Sixu Li,Yingyan(Celine)Lin
机构: Georgia Institute of Technology (乔治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

点击查看摘要

Abstract:Training diffusion models (DMs) requires substantial computational resources due to multiple forward and backward passes across numerous timesteps, motivating research into efficient training techniques. In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets – sparse subnetworks that manifest early in the training process and maintain high generation quality. We first investigate the existence of traditional EB tickets in DMs, enabling competitive generation quality without fully training a dense model. Then, we delve into the concept of diffusion-dedicated EB tickets, drawing on insights from varying importance of different timestep regions. These tickets adapt their sparsity levels according to the importance of corresponding timestep regions, allowing for aggressive sparsity during non-critical regions while conserving computational resources for crucial timestep regions. Building on this, we develop an efficient DM training technique that derives timestep-aware EB tickets, trains them in parallel, and combines them during inference for image generation. Extensive experiments validate the existence of both traditional and timestep-aware EB tickets, as well as the effectiveness of our proposed EB-Diff-Train method. This approach can significantly reduce training time both spatially and temporally – achieving 2.9 \times to 5.8 \times speedups over training unpruned dense models, and up to 10.3 \times faster training compared to standard train-prune-finetune pipelines – without compromising generative quality. Our code is available at this https URL. Comments: 10 pages, 5 figures. Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.09606 [cs.CV] (or arXiv:2504.09606v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.09606 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lexington Whalen [view email] [v1] Sun, 13 Apr 2025 14:53:18 UTC (4,997 KB) Full-text links: Access Paper: View a PDF of the paper titled Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training, by Lexington Whalen and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-109] Mixture-of-Shape-Experts (MoSE): End-to-End Shape Dictionary Framework to Prompt SAM for Generalizable Medical Segmentation CVPR2025

【速读】：该论文致力于解决单域泛化（Single Domain Generalization, SDG）在医学图像分割中的挑战，特别是如何利用跨成像协议、扫描仪供应商和临床站点的一致语义形状先验来提升模型的泛化能力。现有基于字典学习的方法通常受限于有限的表示能力（当离线计算的形状元素较少时）或因字典规模增大而过拟合。此外，这些方法难以与大型基础模型（如Segment Anything Model, SAM）兼容。

论文的关键解决方案是提出了一种名为Mixture-of-Shape-Experts (MoSE) 的新框架，将混合专家（Mixture-of-Experts, MoE）的思想引入字典学习中，以高效捕获多样且鲁棒的形状先验。MoSE 将字典中的每个原子视为形状专家，每个专家专注于编码特定的语义形状信息。通过一个门控网络动态融合这些形状专家，形成鲁棒的形状图谱，并借助 SAM 编码引导稀疏激活以避免过拟合。进一步地，该形状图谱被作为提示提供给 SAM，通过双向集成充分利用 SAM 的强大泛化能力。所有模块（包括形状字典）均以端到端的方式进行训练。实验结果验证了该方法的有效性。

链接: https://arxiv.org/abs/2504.09601
作者: Jia Wei,Xiaoqi Zhao,Jonghye Woo,Jinsong Ouyang,Georges El Fakhri,Qingyu Chen,Xiaofeng Liu
机构: Dept. of Radiology and Biomedical Imaging, Yale University (耶鲁大学), New Haven, USA; Dept. of Biomedical Informatics and Data Science, Yale University (耶鲁大学), New Haven, USA; Dept. of Radiology, Massachusetts General Hospital and Harvard Medical School (哈佛大学医学院), Boston, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)
备注: Accepted to CVPR 2025 workshop

点击查看摘要

Abstract:Single domain generalization (SDG) has recently attracted growing attention in medical image segmentation. One promising strategy for SDG is to leverage consistent semantic shape priors across different imaging protocols, scanner vendors, and clinical sites. However, existing dictionary learning methods that encode shape priors often suffer from limited representational power with a small set of offline computed shape elements, or overfitting when the dictionary size grows. Moreover, they are not readily compatible with large foundation models such as the Segment Anything Model (SAM). In this paper, we propose a novel Mixture-of-Shape-Experts (MoSE) framework that seamlessly integrates the idea of mixture-of-experts (MoE) training into dictionary learning to efficiently capture diverse and robust shape priors. Our method conceptualizes each dictionary atom as a shape expert, which specializes in encoding distinct semantic shape information. A gating network dynamically fuses these shape experts into a robust shape map, with sparse activation guided by SAM encoding to prevent overfitting. We further provide this shape map as a prompt to SAM, utilizing the powerful generalization capability of SAM through bidirectional integration. All modules, including the shape dictionary, are trained in an end-to-end manner. Extensive experiments on multiple public datasets demonstrate its effectiveness.
zh

[CV-110] DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning

【速读】：该论文旨在解决医学影像描述生成中上下文相关性和模态识别准确性不足的问题。现有方法在生成准确且与模态相关的描述时面临挑战，尤其是在缺乏高质量标注数据的情况下。论文的关键创新在于提出了一种名为DualPrompt-MedCap的双提示增强框架，通过两个专门组件实现对大规模视觉语言模型（Large Vision-Language Models, LVLMs）的增强：一是从预训练于医学问答对上的半监督分类模型中提取的模态感知提示（modality-aware prompt），二是利用生物医学语言模型嵌入构建的问题引导提示（question-guided prompt）。此外，为了弥补描述生成基准数据的不足，论文还设计了一个综合考虑空间语义相关性和医学叙述质量的评估框架。实验结果表明，DualPrompt-MedCap不仅将模态识别准确率提升了22%，还生成了更全面且与问题对齐的描述，从而为临床诊断提供支持，并可作为医学专家的先验知识或下游任务的自动标注来源。

链接: https://arxiv.org/abs/2504.09598
作者: Yining Zhao,Ali Braytee,Mukesh Prasad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Medical image captioning via vision-language models has shown promising potential for clinical diagnosis assistance. However, generating contextually relevant descriptions with accurate modality recognition remains challenging. We present DualPrompt-MedCap, a novel dual-prompt enhancement framework that augments Large Vision-Language Models (LVLMs) through two specialized components: (1) a modality-aware prompt derived from a semi-supervised classification model pretrained on medical question-answer pairs, and (2) a question-guided prompt leveraging biomedical language model embeddings. To address the lack of captioning ground truth, we also propose an evaluation framework that jointly considers spatial-semantic relevance and medical narrative quality. Experiments on multiple medical datasets demonstrate that DualPrompt-MedCap outperforms the baseline BLIP-3 by achieving a 22% improvement in modality recognition accuracy while generating more comprehensive and question-aligned descriptions. Our method enables the generation of clinically accurate reports that can serve as medical experts’ prior knowledge and automatic annotations for downstream vision-language tasks.
zh

[CV-111] xtSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting

【速读】：该论文旨在解决现有通用高斯点 splatting 方法在复杂场景中重建精细细节时忽视文本驱动引导以增强语义理解的问题。尽管许多方法强调了几何一致性，但它们未能充分利用文本提示来提升语义信息的准确性。为了解决这一局限性，论文提出了一种名为 TextSplat 的首个文本驱动的通用高斯点 splatting 框架。其关键在于通过融合多样的语义线索，学习稳健的跨模态特征表示，以改善几何与语义信息的对齐，从而实现高保真的三维重建。具体而言，该框架采用三个并行模块获取互补表征：扩散先验深度估计器用于精确深度信息，语义感知分割网络用于详细语义信息，以及多视角交互网络用于精化跨视图特征。最终，在文本引导语义融合模块中，这些表征通过基于文本和注意力的特征聚合机制整合，生成富含详细语义线索的增强三维高斯参数。实验结果表明，该框架在多个基准数据集上的性能优于现有方法，验证了其有效性。

链接: https://arxiv.org/abs/2504.09588
作者: Zhicong Wu,Hongbin Xu,Gang Xu,Ping Nie,Zhixin Yan,Jinkai Zheng,Liangqiong Qu,Ming Li,Liqiang Nie
机构: Institute of Artificial Intelligence, Xiamen University (厦门大学人工智能研究院); School of Automation Science and Engineering, South China University of Technology (华南理工大学自动化科学与工程学院); Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济实验室（深圳）); Peking University (北京大学); School of Future Technology, South China University of Technology (华南理工大学未来技术学院); College of communication Engineering, Hangzhou Dianzi University (杭州电子科技大学通信工程学院); University of Hong Kong (香港大学); Ming Li, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) (广东省人工智能与数字经济实验室（深圳）); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学（深圳）);
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Generalizable Gaussian Splatting have enabled robust 3D reconstruction from sparse input views by utilizing feed-forward Gaussian Splatting models, achieving superior cross-scene generalization. However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accurately reconstructing fine-grained details in complex scenes. To address this limitation, we propose TextSplat–the first text-driven Generalizable Gaussian Splatting framework. By employing a text-guided fusion of diverse semantic cues, our framework learns robust cross-modal feature representations that improve the alignment of geometric and semantic information, producing high-fidelity 3D reconstructions. Specifically, our framework employs three parallel modules to obtain complementary representations: the Diffusion Prior Depth Estimator for accurate depth information, the Semantic Aware Segmentation Network for detailed semantic information, and the Multi-View Interaction Network for refined cross-view features. Then, in the Text-Guided Semantic Fusion Module, these representations are integrated via the text-guided and attention-based feature aggregation mechanism, resulting in enhanced 3D Gaussian parameters enriched with detailed semantic cues. Experimental results on various benchmark datasets demonstrate improved performance compared to existing methods across multiple evaluation metrics, validating the effectiveness of our framework. The code will be publicly available.
zh

[CV-112] Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset Model and Benchmark

【速读】：该论文旨在解决甲骨文（OBI）识别领域中现有数据集长尾分布问题导致模型在多数类与少数类上的性能偏差，以及缺乏大规模结构对齐图像对以支持生成模型训练的问题。论文的关键解决方案包括：首先构建了一个名为Oracle-P15K的结构对齐甲骨文数据集，包含14,542幅融合甲骨文专家领域知识的图像；其次提出了一种基于扩散模型的伪甲骨文生成器OBIDiff，能够实现真实且可控的甲骨文生成，在保持甲骨文字形结构的同时有效转移拓片风格。

链接: https://arxiv.org/abs/2504.09555
作者: Jinhao Li,Zijian Chen,Runze Dong,Tingzhu Chen,Changbo Wang,Guangtao Zhai
机构: East China Normal University (华东师范大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a long-tail distribution problem, leading to biased performance of OBI recognition models across majority and minority classes. With recent advancements in generative models, OBI synthesis-based data augmentation has become a promising avenue to expand the sample size of minority classes. Unfortunately, current OBI datasets lack large-scale structure-aligned image pairs for generative model training. To address these problems, we first present the Oracle-P15K, a structure-aligned OBI dataset for OBI generation and denoising, consisting of 14,542 images infused with domain knowledge from OBI experts. Second, we propose a diffusion model-based pseudo OBI generator, called OBIDiff, to achieve realistic and controllable OBI generation. Given a clean glyph image and a target rubbing-style image, it can effectively transfer the noise style of the original rubbing to the glyph image. Extensive experiments on OBI downstream tasks and user preference studies show the effectiveness of the proposed Oracle-P15K dataset and demonstrate that OBIDiff can accurately preserve inherent glyph structures while transferring authentic rubbing styles effectively.
zh

[CV-113] SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification

【速读】：本文旨在解决跨空地视角（Aerial-Ground）行人再识别（Person Re-Identification, ReID）中的视点变化导致的身份一致性保持难题。现有方法主要关注设计判别性ReID模型以应对相机视点剧烈变化带来的挑战，但这些方法在设计对视点鲁棒的网络时面临极大困难，并且忽视了视点特定特征在增强模型表达能力方面的作用。为了解决这些问题，论文提出了一种名为SD-ReID的新型两阶段特征学习框架。其关键是利用生成式AI（Generative AI）模型（如Stable Diffusion, SD）的强大理解能力，在不同视点间生成视点特定特征，并通过两阶段策略：第一阶段训练基于ViT的简单模型提取粗粒度表示与可控条件；第二阶段微调SD模型以学习互补表示，同时引入视点精炼解码器（View-Refine Decoder, VRD）生成缺失的跨视点特征，最终结合粗粒度表示与全视点特征实现目标行人的检索。

链接: https://arxiv.org/abs/2504.09549
作者: Xiang Hu,Pingping Zhang,Yuhao Wang,Bin Yan,Huchuan Lu
机构: School of Future Technology, School of Artificial Intelligence, Dalian University of Technology (大连理工大学未来技术学院, 人工智能学院); School of Information and Communication Engineering, Dalian University of Technology (大连理工大学信息与通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aerial-Ground Person Re-IDentification (AG-ReID) aims to retrieve specific persons across cameras with different viewpoints. Previous works focus on designing discriminative ReID models to maintain identity consistency despite drastic changes in camera viewpoints. The core idea behind these methods is quite natural, but designing a view-robust network is a very challenging task. Moreover, they overlook the contribution of view-specific features in enhancing the model’s capability to represent persons. To address these issues, we propose a novel two-stage feature learning framework named SD-ReID for AG-ReID, which takes advantage of the powerful understanding capacity of generative models, e.g., Stable Diffusion (SD), to generate view-specific features between different viewpoints. In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions. Then, in the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions. Furthermore, we propose the View-Refine Decoder (VRD) to obtain additional controllable conditions to generate missing cross-view features. Finally, we use the coarse-grained representations and all-view features generated by SD to retrieve target persons. Extensive experiments on the AG-ReID benchmarks demonstrate the effectiveness of our proposed SD-ReID. The source code will be available upon acceptance.
zh

[CV-114] Causal integration of chemical structures improves representations of microscopy images for morphological profiling

【速读】：该论文试图解决高通量显微镜筛选中形态轮廓表征学习未能充分利用多模态数据的问题，特别是忽略了化学扰动信息在图像表征学习中的作用。论文的关键解决方案在于提出了一种名为MICON（分子-图像对比学习）的表征学习框架，该框架将化学化合物建模为诱导细胞表型反事实变换的处理因素，并通过对比学习方式将化学信息融入图像表征学习过程。与仅基于图像的方法或直接对齐图像和化合物表示空间的方法相比，MICON在跨独立重复实验和不同数据生成中心识别药物可重复效应的任务中表现出显著优势，证明了在因果框架下显式建模化学扰动的重要性。

链接: https://arxiv.org/abs/2504.09544
作者: Yemin Yu,Neil Tenenholtz,Lester Mackey,Ying Wei,David Alvarez-Melis,Ava P. Amini,Alex X. Lu
机构: City University of Hong Kong (香港城市大学); Microsoft Research (微软研究); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages

点击查看摘要

Abstract:Recent advances in self-supervised deep learning have improved our ability to quantify cellular morphological changes in high-throughput microscopy screens, a process known as morphological profiling. However, most current methods only learn from images, despite many screens being inherently multimodal, as they involve both a chemical or genetic perturbation as well as an image-based readout. We hypothesized that incorporating chemical compound structure during self-supervised pre-training could improve learned representations of images in high-throughput microscopy screens. We introduce a representation learning framework, MICON (Molecular-Image Contrastive Learning), that models chemical compounds as treatments that induce counterfactual transformations of cell phenotypes. MICON significantly outperforms classical hand-crafted features such as CellProfiler and existing deep-learning-based representation learning methods in challenging evaluation settings where models must identify reproducible effects of drugs across independent replicates and data-generating centers. We demonstrate that incorporating chemical compound information into the learning process provides consistent improvements in our evaluation setting and that modeling compounds specifically as treatments in a causal framework outperforms approaches that directly align images and compounds in a single representation space. Our findings point to a new direction for representation learning in morphological profiling, suggesting that methods should explicitly account for the multimodal nature of microscopy screening data.
zh

[CV-115] EmbodiedOcc: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler

【速读】：该论文旨在解决在线三维占用预测中对室内环境中几何特性的建模不足问题。传统方法虽利用3D语义高斯分布进行渐进式室内占用预测，但忽略了室内环境主要由平面结构主导这一特性。为此，论文提出EmbodiedOcc++，其关键创新在于引入Geometry-guided Refinement Module (GRM) 和 Semantic-aware Uncertainty Sampler (SUS)。GRM通过平面正则化约束高斯分布更新，使位置更新与表面法线对齐，并基于曲率和深度自适应调整正则化权重，从而实现语义高斯分布与平面表面的精准对齐；SUS则通过自适应选择合适的高斯分布以更新，有效提升了来自不同视角的几何一致性。实验表明，EmbodiedOcc++在多个指标上达到最先进性能，同时保持计算效率。

链接: https://arxiv.org/abs/2504.09540
作者: Hao Wang,Xiaobao Wei,Xiaoan Zhang,Jianing Li,Chengyu Bai,Ying Li,Ming Lu,Wenzhao Zheng,Shanghang Zhang
机构: State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University (多媒体信息处理国家重点实验室，北京大学计算机学院); Nanjing University (南京大学); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online 3D occupancy prediction provides a comprehensive spatial understanding of embodied environments. While the innovative EmbodiedOcc framework utilizes 3D semantic Gaussians for progressive indoor occupancy prediction, it overlooks the geometric characteristics of indoor environments, which are primarily characterized by planar structures. This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. GRM regularizes the position update to align with surface normals. It determines the adaptive regularization weight using curvature-based and depth-based constraints, allowing semantic Gaussians to align accurately with planar surfaces while adapting in complex regions. To effectively improve geometric consistency from different views, SUS adaptively selects proper Gaussians to update. Comprehensive experiments on the EmbodiedOcc-ScanNet benchmark demonstrate that EmbodiedOcc++ achieves state-of-the-art performance across different settings. Our method demonstrates improved edge accuracy and retains more geometric details while ensuring computational efficiency, which is essential for online embodied perception. The code will be released at: this https URL.
zh

[CV-116] FastRSR: Efficient and Accurate Road Surface Reconstruction from Birds Eye View

【速读】：该论文致力于解决基于鸟瞰图（Bird’s Eye View, BEV）的路面重建（Road Surface Reconstruction, RSR）中存在的信息丢失与表示稀疏性问题，同时优化立体匹配在精度与推理速度之间的权衡。为应对这些挑战，论文提出的关键解决方案包括：首先引入深度感知投影（Depth-Aware Projection, DAP），通过查询深度与图像特征，并利用预计算的查找表，在特定道路区域高效聚合BEV数据以减少信息损失和稀疏性；其次设计空间注意力增强（Spatial Attention Enhancement, SAE）和置信度注意力生成（Confidence Attention Generation, CAG）模块，分别实现重要区域的自适应突出显示以及高置信度预测的关注与无关信息的过滤，从而在保证准确性的同时提升推理速度。

链接: https://arxiv.org/abs/2504.09535
作者: Yuting Zhao,Yuheng Ji,Xiaoshuai Hao,Shuxiao Li
机构: Institute of Automation, CAS (自动化研究所, 中科院); School of Artificial Intelligence, UCAS (人工智能学院, 中国科学院大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Road Surface Reconstruction (RSR) is crucial for autonomous driving, enabling the understanding of road surface conditions. Recently, RSR from the Bird’s Eye View (BEV) has gained attention for its potential to enhance performance. However, existing methods for transforming perspective views to BEV face challenges such as information loss and representation sparsity. Moreover, stereo matching in BEV is limited by the need to balance accuracy with inference speed. To address these challenges, we propose two efficient and accurate BEV-based RSR models: FastRSR-mono and FastRSR-stereo. Specifically, we first introduce Depth-Aware Projection (DAP), an efficient view transformation strategy designed to mitigate information loss and sparsity by querying depth and image features to aggregate BEV data within specific road surface regions using a pre-computed look-up table. To optimize accuracy and speed in stereo matching, we design the Spatial Attention Enhancement (SAE) and Confidence Attention Generation (CAG) modules. SAE adaptively highlights important regions, while CAG focuses on high-confidence predictions and filters out irrelevant information. FastRSR achieves state-of-the-art performance, exceeding monocular competitors by over 6.0% in elevation absolute error and providing at least a 3.0x speedup by stereo methods on the RSRD dataset. The source code will be released.
zh

[CV-117] rajectory-guided Motion Perception for Facial Expression Quality Assessment in Neurological Disorders

【速读】：本文旨在解决神经性疾病中自动化面部表情质量评估（FEQA）的挑战，特别是有效捕捉面部肌肉运动的细微变化和复杂性。解决方案的关键在于引入了一种基于面部标志点轨迹（facial landmark trajectories）的紧凑且信息丰富的表示方法，从高阶结构的角度编码这些细微运动。为此，作者提出了轨迹引导的运动感知Transformer（TraMP-Former），这是一种新颖的FEQA框架，通过融合标志点轨迹特征与RGB帧中的视觉语义线索，实现精细的运动捕捉，并最终将综合特征回归为质量评分。实验结果表明，TraMP-Former在包含神经性疾病的基准数据集上达到了新的最先进性能，同时消融研究进一步验证了标志点轨迹在FEQA中的效率和有效性。

链接: https://arxiv.org/abs/2504.09530
作者: Shuchao Duan,Amirhossein Dadashzadeh,Alan Whone,Majid Mirmehdi
机构: School of Computer Science (计算机科学学院); Translational Health Sciences (转化健康科学); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE FG 2025 (preprint)

点击查看摘要

Abstract:Automated facial expression quality assessment (FEQA) in neurological disorders is critical for enhancing diagnostic accuracy and improving patient care, yet effectively capturing the subtle motions and nuances of facial muscle movements remains a challenge. We propose to analyse facial landmark trajectories, a compact yet informative representation, that encodes these subtle motions from a high-level structural perspective. Hence, we introduce Trajectory-guided Motion Perception Transformer (TraMP-Former), a novel FEQA framework that fuses landmark trajectory features for fine-grained motion capture with visual semantic cues from RGB frames, ultimately regressing the combined features into a quality score. Extensive experiments demonstrate that TraMP-Former achieves new state-of-the-art performance on benchmark datasets with neurological disorders, including PFED5 (up by 6.51%) and an augmented Toronto NeuroFace (up by 7.62%). Our ablation studies further validate the efficiency and effectiveness of landmark trajectories in FEQA. Our code is available at this https URL.
zh

[CV-118] AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions

【速读】：该论文旨在解决遥感图像自动标注任务中的复杂空间语义和领域变异性问题。为应对这些挑战，论文提出了一种名为\textbfAeroLite的轻量级、标签引导的标注框架，其核心解决方案在于通过GPT-4o生成大规模伪标注数据集，并结合多标签CLIP编码器提取和学习语义标签，以增强小规模语言模型（1–3B参数）的鲁棒性和可解释性。关键创新点包括引入一种新颖的桥接多层感知机（Bridging MLP）架构，用于高效融合视觉特征与语义信息，同时采用两阶段LoRA-based训练策略，在保证较低计算成本的前提下实现跨领域的精确标注能力。

链接: https://arxiv.org/abs/2504.09528
作者: Xing Zi,Tengjun Ni,Xianjing Fan,Xian Tao,Jun Li,Ali Braytee,Mukesh Prasad
机构: University of Technology Sydney(Sydney Australia); University of New South Wales(Sydney Australia); Institute of Automation, Chinese Academy of Sciences(Beijing China)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Accurate and automated captioning of aerial imagery is crucial for applications like environmental monitoring, urban planning, and disaster management. However, this task remains challenging due to complex spatial semantics and domain variability. To address these issues, we introduce \textbfAeroLite, a lightweight, tag-guided captioning framework designed to equip small-scale language models (1–3B parameters) with robust and interpretable captioning capabilities specifically for remote sensing images. \textbfAeroLite leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset by integrating multiple remote sensing benchmarks, including DLRSD, iSAID, LoveDA, WHU, and RSSCN7. To explicitly capture key semantic elements such as orientation and land-use types, AeroLite employs natural language processing techniques to extract relevant semantic tags. These tags are then learned by a dedicated multi-label CLIP encoder, ensuring precise semantic predictions. To effectively fuse visual and semantic information, we propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings while maintaining minimal computational overhead. AeroLite’s flexible design also enables seamless integration with various pretrained large language models. We adopt a two-stage LoRA-based training approach: the initial stage leverages our pseudo-caption dataset to capture broad remote sensing semantics, followed by fine-tuning on smaller, curated datasets like UCM and Sydney Captions to refine domain-specific alignment. Experimental evaluations demonstrate that AeroLite surpasses significantly larger models (e.g., 13B parameters) in standard captioning metrics, including BLEU and METEOR, while maintaining substantially lower computational costs.
zh

[CV-119] 3D CoCa: Contrastive Learners are 3D Captioners

【速读】：该论文致力于解决3D场景自然语言描述（3D captioning）中的两个主要挑战：点云的固有稀疏性和现有方法中存在的弱跨模态对齐问题。为应对这些挑战，论文提出了一种名为3D CoCa的新型统一框架，其关键在于通过单一架构将对比视觉-语言学习与3D描述生成相结合。具体而言，3D CoCa利用冻结的CLIP视觉-语言主干网络提供丰富的语义先验信息，采用空间感知的3D场景编码器捕获几何上下文，并使用多模态解码器生成描述性标题。与依赖显式物体提议的两阶段方法不同，3D CoCa在共享特征空间中同时优化对比学习和描述生成目标，无需外部检测器或手工设计的提议。这种联合训练范式通过对齐3D和文本表示，实现了更强的空间推理能力和更丰富的语义定位。

链接: https://arxiv.org/abs/2504.09518
作者: Ting Huang,Zeyu Zhang,Yemin Wang,Hao Tang
机构: Shanghai University of Engineering Science (上海工程技术大学); The Australian National University (澳大利亚国立大学); Xiamen University (厦门大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods. To address these challenges, we propose 3D CoCa, a novel unified framework that seamlessly combines contrastive vision-language learning with 3D caption generation in a single architecture. Our approach leverages a frozen CLIP vision-language backbone to provide rich semantic priors, a spatially-aware 3D scene encoder to capture geometric context, and a multi-modal decoder to generate descriptive captions. Unlike prior two-stage methods that rely on explicit object proposals, 3D CoCa jointly optimizes contrastive and captioning objectives in a shared feature space, eliminating the need for external detectors or handcrafted proposals. This joint training paradigm yields stronger spatial reasoning and richer semantic grounding by aligning 3D and textual representations. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that 3D CoCa significantly outperforms current state-of-the-arts by 10.2% and 5.76% in CIDEr at 0.5IoU, respectively. Code will be available at this https URL.
zh

[CV-120] FSSUAVL: A Discriminative Framework using Vision Models for Federated Self-Supervised Audio and Image Understanding

【速读】：该论文试图解决跨模态（audio 和 image）无配对数据情况下深度模型如何有效学习表征的问题。传统方法通过在本地客户端使用辅助预训练编码器或生成模型来处理该问题，但随着模态数量增加，计算成本显著上升。本文的关键解决方案是提出了一种名为 \texttt{FSSUAVL} 的单一深度模型，该模型在联邦学习（Federated Learning, FL）环境中通过自监督对比学习（Self-Supervised Contrastive Learning, SSL）进行预训练。不同于对齐两种模态，\texttt{FSSUAVL} 通过对比学习将音频和图像投影到一个共享嵌入空间中，并联合区分它们，从而扩展了其在有配对和无配对跨模态识别任务中的适用性。实验结果表明，与针对每种模态分别使用独立深度模型相比，\texttt{FSSUAVL} 显著提升了多种基于图像和音频下游任务的性能，并能够利用辅助信息进一步增强识别精度。

链接: https://arxiv.org/abs/2504.09516
作者: Yasar Abbas Ur Rehman,Kin Wai Lau,Yuyang Xie,Ma Lan,JiaJun Shen
机构: TCL AI Lab (TCL 人工智能实验室), Hong Kong (香港)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 8 pages

点击查看摘要

Abstract:Recent studies have demonstrated that vision models can effectively learn multimodal audio-image representations when paired. However, the challenge of enabling deep models to learn representations from unpaired modalities remains unresolved. This issue is especially pertinent in scenarios like Federated Learning (FL), where data is often decentralized, heterogeneous, and lacks a reliable guarantee of paired data. Previous attempts tackled this issue through the use of auxiliary pretrained encoders or generative models on local clients, which invariably raise computational cost with increasing number modalities. Unlike these approaches, in this paper, we aim to address the task of unpaired audio and image recognition using \textttFSSUAVL, a single deep model pretrained in FL with self-supervised contrastive learning (SSL). Instead of aligning the audio and image modalities, \textttFSSUAVL jointly discriminates them by projecting them into a common embedding space using contrastive SSL. This extends the utility of \textttFSSUAVL to paired and unpaired audio and image recognition tasks. Our experiments with CNN and ViT demonstrate that \textttFSSUAVL significantly improves performance across various image- and audio-based downstream tasks compared to using separate deep models for each modality. Additionally, \textttFSSUAVL’s capacity to learn multimodal feature representations allows for integrating auxiliary information, if available, to enhance recognition accuracy.
zh

[CV-121] Capturing Longitudinal Changes in Brain Morphology Using Temporally Parameterized Neural Displacement Fields

【速读】：该论文旨在解决纵向脑部图像配准中的两个主要挑战：数据中的噪声/伪影以及量化连续扫描之间微小解剖变化的困难。为应对这些挑战，论文提出了一种新颖的纵向配准方法，通过时变参数化神经位移场建模结构变化。该方法的关键在于利用多层感知机实现隐式神经表示（Implicit Neural Representation, INR），作为任意时间点变形场的空间坐标连续近似。对于特定受试者的N次扫描，模型以三维空间坐标(x, y, z)和相应的时间表示t为输入，学习描述观测及未观测时间点上的结构连续形态。此外，论文通过INR的解析导数推导出一种新的正则化函数，强制体素轨迹具有单调变化速率，从而生成更符合生物学规律的结果。实验验证了该方法在4D脑部磁共振图像配准中的有效性。

链接: https://arxiv.org/abs/2504.09514
作者: Aisha L. Shuaibu,Kieran A. Gibb,Peter A. Wijeratne,Ivor J.A. Simpson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at Medical Imaging with Deep Learning (MIDL)

点击查看摘要

Abstract:Longitudinal image registration enables studying temporal changes in brain morphology which is useful in applications where monitoring the growth or atrophy of specific structures is important. However this task is challenging due to; noise/artifacts in the data and quantifying small anatomical changes between sequential scans. We propose a novel longitudinal registration method that models structural changes using temporally parameterized neural displacement fields. Specifically, we implement an implicit neural representation (INR) using a multi-layer perceptron that serves as a continuous coordinate-based approximation of the deformation field at any time point. In effect, for any N scans of a particular subject, our model takes as input a 3D spatial coordinate location x, y, z and a corresponding temporal representation t and learns to describe the continuous morphology of structures for both observed and unobserved points in time. Furthermore, we leverage the analytic derivatives of the INR to derive a new regularization function that enforces monotonic rate of change in the trajectory of the voxels, which is shown to provide more biologically plausible patterns. We demonstrate the effectiveness of our method on 4D brain MR registration.
zh

[CV-122] DiffuMural: Restoring Dunhuang Murals with Multi-scale Diffusion

【速读】：该论文旨在解决古代壁画修复这一重要的下游任务中存在的挑战，特别是由于大面积缺陷区域和有限训练样本导致的传统基于扩散模型方法的局限性。此外，当前研究缺乏用于评估图像补全美学标准（如整体风格和接缝细节）的有效指标。为此，论文提出了一种名为DiffuMural的解决方案，其关键是结合多尺度收敛与协作扩散机制，并引入ControlNet及循环一致性损失函数，以优化生成图像与条件控制之间的匹配。通过利用来自23幅大规模敦煌壁画的一致视觉美学训练数据，DiffuMural在恢复复杂细节、保持整体连贯性以及应对缺乏事实依据的不完整壁画的独特挑战方面表现出色。评价框架包含四个关键定量指标：事实准确性、纹理细节、语义上下文和整体视觉一致性，同时结合人文价值评估以确保修复后的壁画保留其文化和艺术意义。实验结果验证了该方法在定性和定量指标上均优于现有最先进的技术。

链接: https://arxiv.org/abs/2504.09513
作者: Puyu Han,Jiaju Kang,Yuhang Pan,Erting Pan,Zeyu Zhang,Qunchao Jin,Juntao Jiang,Zhichen Liu,Luqi Gong
机构: Southern University of Science and Technology (南方科技大学); Beijing Normal University (北京师范大学); Hebei Guoyan Science and Technology Center (河北国研科技中心); Wuhan University (武汉大学); The Australian National University (澳大利亚国立大学); AI Geeks; Zhejiang University (浙江大学); Zhejiang Lab (之江实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale pre-trained diffusion models have produced excellent results in the field of conditional image generation. However, restoration of ancient murals, as an important downstream task in this field, poses significant challenges to diffusion model-based restoration methods due to its large defective area and scarce training samples. Conditional restoration tasks are more concerned with whether the restored part meets the aesthetic standards of mural restoration in terms of overall style and seam detail, and such metrics for evaluating heuristic image complements are lacking in current research. We therefore propose DiffuMural, a combined Multi-scale convergence and Collaborative Diffusion mechanism with ControlNet and cyclic consistency loss to optimise the matching between the generated images and the conditional control. DiffuMural demonstrates outstanding capabilities in mural restoration, leveraging training data from 23 large-scale Dunhuang murals that exhibit consistent visual aesthetics. The model excels in restoring intricate details, achieving a coherent overall appearance, and addressing the unique challenges posed by incomplete murals lacking factual grounding. Our evaluation framework incorporates four key metrics to quantitatively assess incomplete murals: factual accuracy, textural detail, contextual semantics, and holistic visual coherence. Furthermore, we integrate humanistic value assessments to ensure the restored murals retain their cultural and artistic significance. Extensive experiments validate that our method outperforms state-of-the-art (SOTA) approaches in both qualitative and quantitative metrics.
zh

[CV-123] FVOS for MOSE Track of 4th PVUW Challenge: 3rd Place Solution

【速读】：该论文旨在解决视频对象分割（Video Object Segmentation, VOS）在复杂真实场景中的准确性问题。现有方法通常依赖时空记忆网络提取帧级特征，在常用数据集上表现良好，但在复杂场景下效果欠佳。为应对这一挑战，论文提出了一种针对特定数据集进行微调的VOS方法（Fine-tuning VOS, FVOS），并通过设计特定训练策略优化现有方法。关键解决方案包括：(1) 针对单模型预测中相邻物体间间隙过大的问题，引入形态学后处理策略；(2) 在多尺度分割结果基础上采用基于投票的融合方法生成最终输出。最终，该方法在验证阶段和测试阶段分别获得了76.81%和83.92%的JF分数，并在第4届PVUW挑战赛MOSE赛道中位列第三。

链接: https://arxiv.org/abs/2504.09507
作者: Mengjiao Wang,Junpei Zhang,Xu Liu,Yuting Yang,Mengru Ma
机构: International Joint Research Center for Intelligent Perception and Computation (智能感知与计算国际联合研究中心), Xi’an, China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame-level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real-world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine-tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post-processing strategy to address the issue of excessively large gaps between adjacent objects in single-model predictions. Finally, we apply a voting-based fusion method on multi-scale segmentation results to generate the final output. Our approach achieves JF scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.
zh

[CV-124] Pillar-Voxel Fusion Network for 3D Object Detection in Airborne Hyperspectral Point Clouds

【速读】：该论文旨在解决基于高光谱点云（Hyperspectral Point Clouds, HPCs）的生成过程中因融合误差和障碍物遮挡导致的几何-光谱失真问题，这些问题限制了其在多个场景下的下游细粒度任务（尤其是机载应用）中的性能。为了解决这些问题，论文提出了PiV-AHPC，这是一种针对机载HPCs的3D目标检测网络，据作者所知，这是首次针对此任务的研究尝试。

解决方案的关键在于设计了一个柱体-体素双分支编码器（pillar-voxel dual-branch encoder），其中前一分支通过捕捉高光谱点云的光谱信息和垂直结构特征来克服光谱失真；后一分支则专注于从点云中提取精确的三维空间特征。此外，还提出了一种多级特征融合机制，以增强两个分支之间的信息交互，实现邻域特征对齐和通道自适应选择，从而有机整合异构特征并缓解几何失真。这些创新方法显著提升了模型的检测性能与泛化能力。

链接: https://arxiv.org/abs/2504.09506
作者: Yanze Jiang,Yanfeng Gu,Xian Li
机构: School of Electronics and Information Engineering, Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral point clouds (HPCs) can simultaneously characterize 3D spatial and spectral information of ground objects, offering excellent 3D perception and target recognition capabilities. Current approaches for generating HPCs often involve fusion techniques with hyperspectral images and LiDAR point clouds, which inevitably lead to geometric-spectral distortions due to fusion errors and obstacle occlusions. These adverse effects limit their performance in downstream fine-grained tasks across multiple scenarios, particularly in airborne applications. To address these issues, we propose PiV-AHPC, a 3D object detection network for airborne HPCs. To the best of our knowledge, this is the first attempt at this HPCs task. Specifically, we first develop a pillar-voxel dual-branch encoder, where the former captures spectral and vertical structural features from HPCs to overcome spectral distortion, while the latter emphasizes extracting accurate 3D spatial features from point clouds. A multi-level feature fusion mechanism is devised to enhance information interaction between the two branches, achieving neighborhood feature alignment and channel-adaptive selection, thereby organically integrating heterogeneous features and mitigating geometric distortion. Extensive experiments on two airborne HPCs datasets demonstrate that PiV-AHPC possesses state-of-the-art detection performance and high generalization capability.
zh

[CV-125] PCM-SAR: Physics-Driven Contrastive Mutual Learning for SAR Classification

【速读】：该论文旨在解决现有基于对比学习 (Contrastive Learning) 的SAR图像分类方法过度依赖针对光学图像设计的样本生成策略的问题，这些方法未能充分捕捉SAR数据独特的语义和物理特性。论文的关键解决方案是提出了一种基于物理学驱动的对比互学习方法（Physics-Driven Contrastive Mutual Learning for SAR Classification, PCM-SAR）。PCM-SAR通过引入领域特定的物理洞察力改进样本生成和特征提取：利用灰度共生矩阵 (GLCM) 模拟真实的噪声模式，并采用语义检测实现无监督局部采样，确保生成的样本能够准确反映SAR成像特性。此外，其多级特征融合机制基于互学习，实现了特征表示的协作优化。特别是，PCM-SAR通过优化SAR特征表示显著提升了小模型的性能，弥补了其容量限制。实验结果表明，PCM-SAR在多种数据集和SAR分类任务中始终优于最先进的方法 (SOTA)。

链接: https://arxiv.org/abs/2504.09502
作者: Pengfei Wang,Hao Zheng,Zhigang Hu,Aikun Xu,Meiguang Zheng,Liu Yang(School of Computer Science and Engineering, Central South University, Changsha, China)
机构: School of Computer Science and Engineering, Central South University (中央南大学计算机科学与工程学院), Changsha, China
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Existing SAR image classification methods based on Contrastive Learning often rely on sample generation strategies designed for optical images, failing to capture the distinct semantic and physical characteristics of SAR data. To address this, we propose Physics-Driven Contrastive Mutual Learning for SAR Classification (PCM-SAR), which incorporates domain-specific physical insights to improve sample generation and feature extraction. PCM-SAR utilizes the gray-level co-occurrence matrix (GLCM) to simulate realistic noise patterns and applies semantic detection for unsupervised local sampling, ensuring generated samples accurately reflect SAR imaging properties. Additionally, a multi-level feature fusion mechanism based on mutual learning enables collaborative refinement of feature representations. Notably, PCM-SAR significantly enhances smaller models by refining SAR feature representations, compensating for their limited capacity. Experimental results show that PCM-SAR consistently outperforms SOTA methods across diverse datasets and SAR classification tasks.
zh

[CV-126] EasyREG: Easy Depth-Based Markerless Registration and Tracking using Augmented Reality Device for Surgical Guidance

【速读】：该论文旨在解决传统基于标记的增强现实（AR）手术引导方法中存在的繁琐校准流程以及在临床环境中部署困难的问题，同时克服现有无标记解决方案在医学引导中的精度不足，尤其是因遮挡和显著异常值导致的问题。论文的关键在于提出了一种仅依赖AR设备深度传感器的无标记框架，该框架包含两个模块：高精度、鲁棒的目标解剖结构定位注册模块和实时位姿估计跟踪模块。注册模块通过集成深度传感器误差校正、人机交互区域过滤技术、基于曲率感知特征采样的鲁棒全局对齐，以及局部ICP精化，实现术前模型与患者解剖结构的无标记配准；跟踪模块则利用注册模块提供的初始位姿，采用快速鲁棒的配准算法进行实时目标位姿估计。这一两模块设计使得系统能够为术中目标解剖结构移动或静止的手术提供一体化解决方案。

链接: https://arxiv.org/abs/2504.09498
作者: Yue Yang,Christoph Leuze,Brian Hargreaves,Bruce Daniel,Fred Baik
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The use of Augmented Reality (AR) devices for surgical guidance has gained increasing traction in the medical field. Traditional registration methods often rely on external fiducial markers to achieve high accuracy and real-time performance. However, these markers introduce cumbersome calibration procedures and can be challenging to deploy in clinical settings. While commercial solutions have attempted real-time markerless tracking using the native RGB cameras of AR devices, their accuracy remains questionable for medical guidance, primarily due to occlusions and significant outliers between the live sensor data and the preoperative target anatomy point cloud derived from MRI or CT scans. In this work, we present a markerless framework that relies only on the depth sensor of AR devices and consists of two modules: a registration module for high-precision, outlier-robust target anatomy localization, and a tracking module for real-time pose estimation. The registration module integrates depth sensor error correction, a human-in-the-loop region filtering technique, and a robust global alignment with curvature-aware feature sampling, followed by local ICP refinement, for markerless alignment of preoperative models with patient anatomy. The tracking module employs a fast and robust registration algorithm that uses the initial pose from the registration module to estimate the target pose in real-time. We comprehensively evaluated the performance of both modules through simulation and real-world measurements. The results indicate that our markerless system achieves superior performance for registration and comparable performance for tracking to industrial solutions. The two-module design makes our system a one-stop solution for surgical procedures where the target anatomy moves or stays static during surgery.
zh

[CV-127] DropoutGS: Dropping Out Gaussians for Better Sparse-view Rendering CVPR2025

【速读】：该论文致力于解决3D Gaussian Splatting (3DGS) 在稀疏输入条件下性能显著下降并产生不良伪影的问题。随着训练视角数量的减少，新型视图合成任务会退化为高度欠定问题，导致现有方法面临严重的过拟合挑战。论文的关键观察是低复杂度模型（即具有较少高斯基元的模型）在稀疏输入下表现出更少的过拟合倾向。基于此，论文提出了一种随机丢弃正则化 (Random Dropout Regularization, RDR) 来利用低复杂度模型的优势减轻过拟合，并进一步通过边缘引导分裂策略 (Edge-guided Splitting Strategy, ESS) 弥补这些模型高频细节的不足。通过结合这两种技术，DropoutGS 提供了一种简单而有效的插件式方法，显著提升了现有 3DGS 方法的泛化性能。实验结果表明，DropoutGS 在包括 Blender、LLFF 和 DTU 等基准数据集的稀疏视图条件下实现了最先进的性能。

链接: https://arxiv.org/abs/2504.09491
作者: Yexing Xu,Longguang Wang,Minglin Chen,Sheng Ao,Li Li,Yulan Guo
机构: The Shenzhen Campus, Sun Yat-Sen University (中山大学深圳校区); Xiamen University (厦门大学); University of Macau (澳门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:Although 3D Gaussian Splatting (3DGS) has demonstrated promising results in novel view synthesis, its performance degrades dramatically with sparse inputs and generates undesirable artifacts. As the number of training views decreases, the novel view synthesis task degrades to a highly under-determined problem such that existing methods suffer from the notorious overfitting issue. Interestingly, we observe that models with fewer Gaussian primitives exhibit less overfitting under sparse inputs. Inspired by this observation, we propose a Random Dropout Regularization (RDR) to exploit the advantages of low-complexity models to alleviate overfitting. In addition, to remedy the lack of high-frequency details for these models, an Edge-guided Splitting Strategy (ESS) is developed. With these two techniques, our method (termed DropoutGS) provides a simple yet effective plug-in approach to improve the generalization performance of existing 3DGS methods. Extensive experiments show that our DropoutGS produces state-of-the-art performance under sparse views on benchmark datasets including Blender, LLFF, and DTU. The project page is at: this https URL.
zh

[CV-128] Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation

【速读】：该论文试图解决的问题是如何评估基于视觉-语言模型（Vision-Language Model, VLM）在传统视觉任务中的有效性，并探索其在多种下游任务中的性能优势与局限性。论文的关键在于首次以系统化的方式评估了VLM在八个检测场景（如闭集检测、领域适应、拥挤目标等）和八个分割场景（如少样本、开放世界、小目标等）中的表现，同时分析了不同微调策略（零样本预测、视觉微调、文本提示）对检测任务的影响。此外，论文通过实证研究揭示了任务特性、模型架构与训练方法之间的内在关联，为未来VLM的设计提供了深入见解。

链接: https://arxiv.org/abs/2504.09480
作者: Yongchao Feng,Yajie Liu,Shuai Yang,Wenrui Cai,Jinqing Zhang,Qiqi Zhan,Ziyue Huang,Hongxi Yan,Qiao Wan,Chenguang Liu,Junzhe Wang,Jiahui Lv,Ziqi Liu,Tengyuan Shi,Qingjie Liu,Yunhong Wang
机构: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (北航)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: A Review and Evaluation about Vision-Language Model for Object Detection and Segmentation

点击查看摘要

Abstract:Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated. In this work, we present the systematic review of VLM-based detection and segmentation, view VLM as the foundational model and conduct comprehensive evaluations across multiple downstream tasks for the first time: 1) The evaluation spans eight detection scenarios (closed-set detection, domain adaptation, crowded objects, etc.) and eight segmentation scenarios (few-shot, open-world, small object, etc.), revealing distinct performance advantages and limitations of various VLM architectures across tasks. 2) As for detection tasks, we evaluate VLMs under three finetuning granularities: \textitzero prediction, \textitvisual fine-tuning, and \textittext prompt, and further analyze how different finetuning strategies impact performance under varied task. 3) Based on empirical findings, we provide in-depth analysis of the correlations between task characteristics, model architectures, and training methodologies, offering insights for future VLM design. 4) We believe that this work shall be valuable to the pattern recognition experts working in the fields of computer vision, multimodal learning, and vision foundation models by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research. A project associated with this review and evaluation has been created at this https URL.
zh

[CV-129] CamMimic: Zero-Shot Image To Camera Motion Personalized Video Generation Using Diffusion Models

【速读】：本文旨在解决跨场景零样本相机运动迁移的问题，即如何在不依赖额外数据的情况下，将参考视频中的相机运动无缝迁移到用户指定的任意场景中。解决方案的关键在于提出了一种两阶段策略：第一阶段通过结合LoRA层和正交性损失的多概念学习方法，捕捉并理解参考视频的空间-时间特性以及目标场景的空间特征；第二阶段则引入基于单应性的细化策略，以增强生成视频的时间与空间对齐效果。此外，为了评估不同场景间相机运动迁移的效果，文中还提出了一个新的度量标准CameraScore，用于衡量参考视频与生成视频之间相机运动的相似性。实验结果表明，该方法能够生成高质量且具有增强运动效果的视频，并得到了用户的广泛认可。

链接: https://arxiv.org/abs/2504.09472
作者: Pooja Guhan,Divya Kothandaraman,Tsung-Wei Huang,Guan-Ming Su,Dinesh Manocha
机构: University of Maryland College Park (马里兰大学帕克分校); Dolby Laboratories (杜比实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce CamMimic, an innovative algorithm tailored for dynamic video editing needs. It is designed to seamlessly transfer the camera motion observed in a given reference video onto any scene of the user’s choice in a zero-shot manner without requiring any additional data. Our algorithm achieves this using a two-phase strategy by leveraging a text-to-video diffusion model. In the first phase, we develop a multi-concept learning method using a combination of LoRA layers and an orthogonality loss to capture and understand the underlying spatial-temporal characteristics of the reference video as well as the spatial features of the user’s desired scene. The second phase proposes a unique homography-based refinement strategy to enhance the temporal and spatial alignment of the generated video. We demonstrate the efficacy of our method through experiments conducted on a dataset containing combinations of diverse scenes and reference videos containing a variety of camera motions. In the absence of an established metric for assessing camera motion transfer between unrelated scenes, we propose CameraScore, a novel metric that utilizes homography representations to measure camera motion similarity between the reference and generated videos. Extensive quantitative and qualitative evaluations demonstrate that our approach generates high-quality, motion-enhanced videos. Additionally, a user study reveals that 70.31% of participants preferred our method for scene preservation, while 90.45% favored it for motion transfer. We hope this work lays the foundation for future advancements in camera motion transfer across different scenes.
zh

[CV-130] Dont Deceive Me: Mitigating Gaslighting through Attention Reallocation in LMMs

【速读】：本文旨在解决大型多模态模型（Large Multimodal Models, LMMs）在面对基于否定的“gaslighting”（用户故意提供误导性或矛盾输入的现象）时所面临的可靠性下降问题。这种现象会导致模型准确性显著降低。为应对这一挑战，论文提出了一种名为GasEraser的训练-free方法，其关键是重新分配注意力权重，将误导性文本标记的关注转移到语义重要的视觉区域。通过抑制“注意力陷阱”标记的影响并增强对视觉线索的关注，GasEraser提升了LMMs的鲁棒性，且无需重新训练或额外标注。实验结果表明，GasEraser在多个领先开源LMMs上有效，尤其对于LLaVA-v1.5-7B，成功将误导率降低了48.2%，展现了其在构建更可信LMMs方面的潜力。

链接: https://arxiv.org/abs/2504.09456
作者: Pengkun Jiao,Bin Zhu,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their vulnerability to user gaslighting-the deliberate use of misleading or contradictory inputs-raises critical concerns about their reliability in real-world applications. In this paper, we address the novel and challenging issue of mitigating the negative impact of negation-based gaslighting on LMMs, where deceptive user statements lead to significant drops in model accuracy. Specifically, we introduce GasEraser, a training-free approach that reallocates attention weights from misleading textual tokens to semantically salient visual regions. By suppressing the influence of “attention sink” tokens and enhancing focus on visually grounded cues, GasEraser significantly improves LMM robustness without requiring retraining or additional supervision. Extensive experimental results demonstrate that GasEraser is effective across several leading open-source LMMs on the GaslightingBench. Notably, for LLaVA-v1.5-7B, GasEraser reduces the misguidance rate by 48.2%, demonstrating its potential for more trustworthy LMMs.
zh

[CV-131] Enhancing Wide-Angle Image Using Narrow-Angle View of the Same Scene

【速读】：本文旨在解决在摄影中常见的权衡难题：宽视角拍摄虽能覆盖更多场景但细节较少，而窄视角拍摄虽能捕捉更多细节却会遗漏部分场景。为解决此问题，论文提出了一种新颖方法，通过同时使用窄视角与宽视角镜头拍摄同一场景，利用基于生成对抗网络（GAN）的模型学习从窄视角图像提取视觉质量参数，并将其转移至对应的宽视角图像中，从而将窄视角的细节丰富性注入宽视角图像。解决方案的关键在于设计一种能够有效分离图像视觉本质并实现跨图像迁移的技术，同时通过多组基准数据集验证及与现有技术对比展示了其实现效果。

链接: https://arxiv.org/abs/2504.09455
作者: Hussain Md. Safwan,Mahbub Islam Mahim,Fawwaz Mohammed Amin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:A common dilemma while photographing a scene is whether to capture it in wider angle, allowing more of the scene to be covered but in lesser details or to click in narrow angle that captures better details but leaves out portions of the scene. We propose a novel method in this paper that infuses wider shots with finer quality details that is usually associated with an image captured by the primary lens by capturing the same scene using both narrow and wide field of view (FoV) lenses. We do so by training a GAN-based model to learn to extract the visual quality parameters from a narrow angle shot and to transfer these to the corresponding wide-angle image of the scene. We have mentioned in details the proposed technique to isolate the visual essence of an image and to transfer it into another image. We have also elaborately discussed our implementation details and have presented the results of evaluation over several benchmark datasets and comparisons with contemporary advancements in the field.
zh

[CV-132] D2iT: Dynamic Diffusion Transformer for Accurate Image Generation

【速读】：该论文旨在解决扩散模型在生成高保真图像过程中，固定压缩不同图像区域导致的信息密度差异未被充分利用的问题。传统方法在扩散过程中采用统一的压缩策略，这会导致局部真实感受限或计算复杂度增加以及全局一致性受损，从而影响生成图像的质量。为了解决这些问题，论文提出了一种动态压缩不同图像区域的方法，并引入了一个新颖的两阶段框架来提升图像生成的效果和效率。关键在于第一阶段的动态变分自编码器（Dynamic VAE, DVAE）通过层次化编码器根据不同区域的信息密度以不同的下采样率进行编码，提供更精确且自然的潜在代码；第二阶段的动态扩散Transformer（Dynamic Diffusion Transformer, D²iT）通过结合粗粒度和细粒度噪声预测，利用动态粒度Transformer和动态内容Transformer的创新组合生成图像，实现了全局一致性和局部真实性的统一。综合实验验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.09454
作者: Weinan Jia,Mengqi Huang,Nan Chen,Lei Zhang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models are widely recognized for their ability to generate high-fidelity images. Despite the excellent performance and scalability of the Diffusion Transformer (DiT) architecture, it applies fixed compression across different image regions during the diffusion process, disregarding the naturally varying information densities present in these regions. However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. (2) Dynamic Diffusion Transformer (D ^2 iT) at second stage generates images by predicting multi-grained noise, consisting of coarse-grained (less latent code in smooth regions) and fine-grained (more latent codes in detailed regions), through an novel combination of the Dynamic Grain Transformer and the Dynamic Content Transformer. The strategy of combining rough prediction of noise with detailed regions correction achieves a unification of global consistency and local realism. Comprehensive experiments on various generation tasks validate the effectiveness of our approach. Code will be released at this https URL.
zh

[CV-133] FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks

【速读】：该论文旨在解决现有主动深度伪造（Deepfake）检测方法在高保真合成图像识别中的局限性，特别是被动深度伪造检测器在检测性能上的不足，以及缺乏定位功能和可解释性的问题。此外，传统水印的不稳定性也显著影响了检测效果。为了解决这些问题，论文提出了FractalForensics，这是一种基于分形水印的新型主动深度伪造检测与定位方案。其关键在于利用分形特性的参数驱动水印生成管道，通过选择性的一次性加密生成基于分形的水印，并设计了一种半脆弱水印框架，使其能够抵抗良性图像处理操作，同时在面对深度伪造篡改时保持脆弱性。此外，引入的入口到块（entry-to-patch）策略实现了水印矩阵条目在图像块中的隐式嵌入，从而实现深度伪造篡改的精确定位。实验结果验证了该方法在鲁棒性和脆弱性方面的优越性，并提供了检测结果的可解释性。

链接: https://arxiv.org/abs/2504.09451
作者: Tianyi Wang,Harry Cheng,Ming-Hui Liu,Mohan Kankanhalli
机构: National University of Singapore(Singapore)(新加坡国立大学); Shandong University(Jinan)(山东大学); National University of Singapore(Singapore)(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Proactive Deepfake detection via robust watermarks has been raised ever since passive Deepfake detectors encountered challenges in identifying high-quality synthetic images. However, while demonstrating reasonable detection performance, they lack localization functionality and explainability in detection results. Additionally, the unstable robustness of watermarks can significantly affect the detection performance accordingly. In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. Benefiting from the characteristics of fractals, we devise a parameter-driven watermark generation pipeline that derives fractal-based watermarks and conducts one-way encryption regarding the parameters selected. Subsequently, we propose a semi-fragile watermarking framework for watermark embedding and recovery, trained to be robust against benign image processing operations and fragile when facing Deepfake manipulations in a black-box setting. Meanwhile, we introduce an entry-to-patch strategy that implicitly embeds the watermark matrix entries into image patches at corresponding positions, achieving localization of Deepfake manipulations. Extensive experiments demonstrate satisfactory robustness and fragility of our approach against common image processing operations and Deepfake manipulations, outperforming state-of-the-art semi-fragile watermarking algorithms and passive detectors for Deepfake detection. Furthermore, by highlighting the areas manipulated, our method provides explainability for the proactive Deepfake detection results.
zh

[CV-134] InfoBound: A Provable Information-Bounds Inspired Framework for Both OoD Generalization and OoD Detection

【速读】：该论文旨在解决现实世界中的分布偏移问题，具体包括两类挑战：一是跨协变量偏移（covariate shift）的域外泛化（Out-of-Distribution, OoD Generalization），即模型在环境变化下的泛化能力；二是域外检测（OoD Detection），即识别测试阶段未见过的语义类别（semantic shifts）。论文指出，真实场景中的测试环境通常同时包含协变量偏移和语义偏移，然而现有方法往往只能单独优化其中一个问题，且常以牺牲另一个问题为代价。为克服这些局限性，作者从信息论的角度提出了一种统一的方法，通过最小化互信息（Mutual Information Minimization, MI-Min）与最大化条件熵（Conditional Entropy Maximizing, CE-Max）来同时提升域外检测和域外泛化的能力。实验结果表明，该方法在多标签图像分类和目标检测任务中显著优于现有基线，并有效缓解了两项挑战之间的权衡问题。

链接: https://arxiv.org/abs/2504.09448
作者: Lin Zhu,Yifeng Yang,Zichao Nie,Yuan Gao,Jiarui Li,Qinying Gu,Xinbing Wang,Chenghu Zhou,Nanyang Ye
机构: Shanghai Jiao Tong University (上海交通大学); Beijing Jiaotong University (北京交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:In real-world scenarios, distribution shifts give rise to the importance of two problems: out-of-distribution (OoD) generalization, which focuses on models’ generalization ability against covariate shifts (i.e., the changes of environments), and OoD detection, which aims to be aware of semantic shifts (i.e., test-time unseen classes). Real-world testing environments often involve a combination of both covariate and semantic shifts. While numerous methods have been proposed to address these critical issues, only a few works tackled them simultaneously. Moreover, prior works often improve one problem but sacrifice the other. To overcome these limitations, we delve into boosting OoD detection and OoD generalization from the perspective of information theory, which can be easily applied to existing models and different tasks. Building upon the theoretical bounds for mutual information and conditional entropy, we provide a unified approach, composed of Mutual Information Minimization (MI-Min) and Conditional Entropy Maximizing (CE-Max). Extensive experiments and comprehensive evaluations on multi-label image classification and object detection have demonstrated the superiority of our method. It successfully mitigates trade-offs between the two challenges compared to competitive baselines.
zh

[CV-135] Sparse Deformable Mamba for Hyperspectral Image Classification

【速读】：该论文旨在解决高光谱图像（HSI）分类中由于Mamba序列构建效率和有效性不足而带来的挑战。解决方案的关键在于提出了一种稀疏可变形Mamba（SDMamba）方法，其核心贡献包括：首先设计了一种高效的稀疏可变形序列（SDS）方法，以自适应学习“最优”序列，从而生成具有增强细节保留和减少计算量的稀疏且可变形的Mamba序列；其次基于SDS，分别设计了稀疏可变形空间Mamba模块（SDSpaM）和稀疏可变形光谱Mamba模块（SDSpeM），用于针对性地建模空间信息和光谱信息；最后，通过一种基于注意力机制的特征融合方法，整合SDSpaM和SDSpeM的输出，进一步提升特征融合效果。实验结果表明，该方法在多个基准数据集上实现了更高的分类精度、更快的速度以及更好的小类别细节保留能力。

链接: https://arxiv.org/abs/2504.09446
作者: Lincoln Linlin Xu,Yimin Zhu,Zack Dewis,Zhengsen Xu,Motasem Alkayid,Mabel Heffring,Saeid Taleghanidoozdoozan
机构: University of Calgary (卡尔加里大学), Canada; University of Waterloo (滑铁卢大学), Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although the recent Mamba models significantly improve hyperspectral image (HSI) classification, one critical challenge is caused by the difficulty to build the Mamba sequence efficiently and effectively. This paper presents a Sparse Deformable Mamba (SDMamba) approach for enhanced HSI classification, with the following contributions. First, to enhance Mamba sequence, an efficient Sparse Deformable Sequencing (SDS) approach is designed to adaptively learn the “optimal” sequence, leading to sparse and deformable Mamba sequence with increased detail preservation and decreased computations. Second, to boost spatial-spectral feature learning, based on SDS, a Sparse Deformable Spatial Mamba Module (SDSpaM) and a Sparse Deformable Spectral Mamba Module (SDSpeM) are designed for tailored modeling of the spatial information spectral information. Last, to improve the fusion of SDSpaM and SDSpeM, an attention based feature fusion approach is designed to integrate the outputs of the SDSpaM and SDSpeM. The proposed method is tested on several benchmark datasets with many state-of-the-art approaches, demonstrating that the proposed approach can achieve higher accuracy, faster speed, and better detail small-class preservation capability.
zh

[CV-136] Structure-Accurate Medical Image Translation based on Dynamic Frequency Balance and Knowledge Guidance

【速读】：该论文旨在解决现有扩散模型在合成医学图像时因高频信息过拟合和低频信息弱化导致的解剖结构失真问题。为应对这一挑战，论文提出了一种基于动态频率平衡与知识引导的新方法。其关键是通过小波变换分解模型的关键特征以提取高低频成分，并设计动态频率平衡模块来自适应调整频率，增强全局低频特征和有效高频细节同时抑制高频噪声。此外，针对不同医学模态间的大差异，构建了知识引导机制，将视觉语言模型中的先验临床知识与视觉特征融合，以促进精确解剖结构的生成。实验结果表明，该方法在多个数据集上的定性与定量评估中均取得了显著改进。

链接: https://arxiv.org/abs/2504.09441
作者: Jiahua Xu,Dawei Zhou,Lei Hu,Zaiyi Liu,Nannan Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Guangdong Provincial People’s Hospital (广东省人民医院); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Medical image translation, Diffusion model, 16 pages

点击查看摘要

Abstract:Multimodal medical images play a crucial role in the precise and comprehensive clinical diagnosis. Diffusion model is a powerful strategy to synthesize the required medical images. However, existing approaches still suffer from the problem of anatomical structure distortion due to the overfitting of high-frequency information and the weakening of low-frequency information. Thus, we propose a novel method based on dynamic frequency balance and knowledge guidance. Specifically, we first extract the low-frequency and high-frequency components by decomposing the critical features of the model using wavelet transform. Then, a dynamic frequency balance module is designed to adaptively adjust frequency for enhancing global low-frequency features and effective high-frequency details as well as suppressing high-frequency noise. To further overcome the challenges posed by the large differences between different medical modalities, we construct a knowledge-guided mechanism that fuses the prior clinical knowledge from a visual language model with visual features, to facilitate the generation of accurate anatomical structures. Experimental evaluations on multiple datasets show the proposed method achieves significant improvements in qualitative and quantitative assessments, verifying its effectiveness and superiority.
zh

[CV-137] Comparing Performance of Preprocessing Techniques for Traffic Sign Recognition Using a HOG-SVM

【速读】：该论文旨在研究和比较不同预处理技术在交通标志识别（Traffic Sign Recognition, TSR）任务中的性能，特别是在基于直方图方向梯度（Histogram of Oriented Gradients, HOG）和支持向量机（Support Vector Machine, SVM）的方法应用于德国交通标志识别基准（German Traffic Sign Recognition Benchmark, GTSRB）数据集时的效果。论文评估了对比增强局部直方图均衡化（Contrast Limited Adaptive Histogram Equalization, CLAHE）、色相（HUE）以及 YUV 空间转换等技术对分类准确率的影响。结果表明，YUV 特别显著提升了 HOG-SVM 分类器的性能（将准确率从 89.65% 提高到 91.25%），为 TSR 应用的预处理流程优化提供了重要参考。关键在于发现 YUV 转换能够有效提升特征提取的质量，从而显著改善分类性能。

链接: https://arxiv.org/abs/2504.09424
作者: Luis Vieira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: working paper (preprint)

点击查看摘要

Abstract:This study compares the performance of various preprocessing techniques for Traffic Sign Recognition (TSR) using Histogram of Oriented Gradients (HOG) and Support Vector Machine (SVM) on the German Traffic Sign Recognition Benchmark (GTSRB) dataset. Techniques such as CLAHE, HUE, and YUV were evaluated for their impact on classification accuracy. Results indicate that YUV in particular significantly enhance the performance of the HOG-SVM classifier (improving accuracy from 89.65% to 91.25%), providing insights into improvements for preprocessing pipeline of TSR applications.
zh

[CV-138] Vision Transformers Exhibit Human-Like Biases: Evidence of Orientation and Color Selectivity Categorical Perception and Phase Transitions

【速读】：该论文旨在探究视觉Transformer（Vision Transformers, ViTs）是否发展出了与人脑相似的方向性和颜色偏向。研究通过使用包含受控噪声水平、角度、长度、宽度和颜色变化的合成数据集，分析了基于LoRA微调的ViTs的行为。关键在于利用合成数据集系统性地评估ViTs在方向和颜色处理上的特性，并结合聚类分析及训练损失曲线观察其行为模式，从而揭示ViTs在预训练阶段形成的表示及其架构约束如何主导其偏向性和属性，而非完全由下游任务的数据统计特性决定。

链接: https://arxiv.org/abs/2504.09393
作者: Nooshin Bahador
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study explored whether Vision Transformers (ViTs) developed orientation and color biases similar to those observed in the human brain. Using synthetic datasets with controlled variations in noise levels, angles, lengths, widths, and colors, we analyzed the behavior of ViTs fine-tuned with LoRA. Our findings revealed four key insights: First, ViTs exhibited an oblique effect showing the lowest angle prediction errors at 180 deg (horizontal) across all conditions. Second, angle prediction errors varied by color. Errors were highest for bluish hues and lowest for yellowish ones. Additionally, clustering analysis of angle prediction errors showed that ViTs grouped colors in a way that aligned with human perceptual categories. In addition to orientation and color biases, we observed phase transition phenomena. While two phase transitions occurred consistently across all conditions, the training loss curves exhibited delayed transitions when color was incorporated as an additional data attribute. Finally, we observed that attention heads in certain layers inherently develop specialized capabilities, functioning as task-agnostic feature extractors regardless of the downstream task. These observations suggest that biases and properties arise primarily from pre-training on the original dataset which shapes the model’s foundational representations and the inherent architectural constraints of the vision transformer, rather than being solely determined by downstream data statistics.
zh

[CV-139] Contour Flow Constraint: Preserving Global Shape Similarity for Deep Learning based Image Segmentation

【速读】：该论文致力于解决现有图像分割方法在利用先验知识时主要关注特定属性或形状的问题，缺乏从轮廓流（Contour Flow, CF）视角考虑全局形状相似性的通用约束。论文的关键在于提出了一种基于形状可比性的全局形状相似性概念，并通过数学推导出一种保证全局形状相似性保持的轮廓流约束。解决方案的核心是将该约束以两种方式集成到深度神经网络中：一是将其转化为形状损失，无缝融入任何基于学习的分割框架的训练阶段；二是将其嵌入变分分割模型中，推导出其迭代方案并展开为CFSSnet网络架构。实验验证表明，所提出的形状损失显著提高了分割精度和形状相似性，展示了该损失项的良好通用适应性，同时CFSSnet在处理噪声图像时表现出鲁棒性并具备保持全局形状相似性的能力。

链接: https://arxiv.org/abs/2504.09384
作者: Shengzhe Chen,Zhaoxuan Dong,Jun Liu
机构: Ira A. Fulton Schools of Engineering, Arizona State University (亚利桑那州立大学); School of Mathematical Sciences, Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Image Processin on Dec-14-2023. Revised on Oct-16-2024

点击查看摘要

Abstract:For effective image segmentation, it is crucial to employ constraints informed by prior knowledge about the characteristics of the areas to be segmented to yield favorable segmentation outcomes. However, the existing methods have primarily focused on priors of specific properties or shapes, lacking consideration of the general global shape similarity from a Contour Flow (CF) perspective. Furthermore, naturally integrating this contour flow prior image segmentation model into the activation functions of deep convolutional networks through mathematical methods is currently unexplored. In this paper, we establish a concept of global shape similarity based on the premise that two shapes exhibit comparable contours. Furthermore, we mathematically derive a contour flow constraint that ensures the preservation of global shape similarity. We propose two implementations to integrate the constraint with deep neural networks. Firstly, the constraint is converted to a shape loss, which can be seamlessly incorporated into the training phase for any learning-based segmentation framework. Secondly, we add the constraint into a variational segmentation model and derive its iterative schemes for solution. The scheme is then unrolled to get the architecture of the proposed CFSSnet. Validation experiments on diverse datasets are conducted on classic benchmark deep network segmentation models. The results indicate a great improvement in segmentation accuracy and shape similarity for the proposed shape loss, showcasing the general adaptability of the proposed loss term regardless of specific network architectures. CFSSnet shows robustness in segmenting noise-contaminated images, and inherent capability to preserve global shape similarity.
zh

[CV-140] Low-Light Image Enhancement using Event-Based Illumination Estimation

【速读】：该论文致力于解决低光图像增强（LLIE）问题，特别是在事件相机环境下通过事件数据提升低光照图像的可见性。现有基于事件的方法主要利用运动事件（motion events）强化边缘纹理，而忽略了事件相机在高动态范围和低光照响应方面的潜力。论文提出从“时间映射”事件（temporal-mapping events）的角度估计光照强度，即将由透射调制触发的事件的时间戳转换为亮度值，从而生成精细的光照线索。解决方案的关键在于设计了一个基于光照辅助的反射率增强模块（Illumination-aided Reflectance Enhancement module），利用这些光照线索更有效地分解和增强低光图像中的反射率分量。此外，论文还研究了低光照条件下时间映射事件的退化模型，用于合成真实的训练数据，并构建了一套包含图像、时间映射事件和运动事件的数据集EvLowLight。实验结果表明，所提出的RetinEV管道在多个合成和真实数据集上均表现出色，相比现有最先进的基于事件的方法提升了高达6.62 dB的性能，同时保持了每秒35.6帧的高效推理速度。

链接: https://arxiv.org/abs/2504.09379
作者: Lei Sun,Yuhan Bao,Jiajun Zhai,Jingyun Liang,Yulun Zhang,Kaiwei Wang,Danda Pani Paudel,Luc Van Gool
机构: INSAIT, Sofia University (索非亚大学); Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., ‘‘motion events’’ to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new avenue from the perspective of estimating the illumination using ‘‘temporal-mapping’’ events, i.e., by converting the timestamps of events triggered by a transmittance modulation into brightness values. The resulting fine-grained illumination cues facilitate a more effective decomposition and enhancement of the reflectance component in low-light images through the proposed Illumination-aided Reflectance Enhancement module. Furthermore, the degradation model of temporal-mapping events under low-light condition is investigated for realistic training data synthesizing. To address the lack of datasets under this regime, we construct a beam-splitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. Extensive experiments across 5 synthetic datasets and our real-world EvLowLight dataset substantiate that the devised pipeline, dubbed RetinEV, excels in producing well-illuminated, high dynamic range images, outperforming previous state-of-the-art event-based methods by up to 6.62 dB, while maintaining an efficient inference speed of 35.6 frame-per-second on a 640X480 image.
zh

[CV-141] Beyond Degradation Conditions: All-in-One Image Restoration via HOG Transformers

【速读】：该论文旨在解决现有统一图像恢复框架在复杂场景中因预测和整合退化条件而导致特定退化特征误激活的问题，这限制了其恢复性能。为了解决这一问题，论文提出了一种新的基于方向梯度直方图（Histograms of Oriented Gradients, HOG）引导的统一图像恢复框架——HOGformer。其关键在于利用HOG描述符的退化区分能力，通过动态自注意力机制自适应地关注长距离的空间依赖性，并结合HOG感知线索。此外，设计了一个HOG引导的局部动态范围卷积模块以增强注意力输入的退化敏感性，同时保持对全局结构信息的关注。同时，提出了一个动态交互前馈模块，通过通道-空间交互高效提升模型对不同退化的适应能力。实验结果表明，HOGformer在多种基准数据集上的表现达到当前最优水平，并且能够有效泛化到复杂的现实世界退化场景中。

链接: https://arxiv.org/abs/2504.09377
作者: Jiawei Wu,Zhifei Yang,Zhe Wang,Zhi Jin
机构: School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区智能工程学院), Guangdong, China; School of Computer Science, Peking University (北京大学计算机科学学院), China; Guangdong Provincial Key Laboratory of Fire Science and Technology (广东省火灾科学重点实验室), Guangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:All-in-one image restoration, which aims to address diverse degradations within a unified framework, is critical for practical applications. However, existing methods rely on predicting and integrating degradation conditions, which can misactivate degradation-specific features in complex scenarios, limiting their restoration performance. To address this issue, we propose a novel all-in-one image restoration framework guided by Histograms of Oriented Gradients (HOG), named HOGformer. By leveraging the degradation-discriminative capability of HOG descriptors, HOGformer employs a dynamic self-attention mechanism that adaptively attends to long-range spatial dependencies based on degradation-aware HOG cues. To enhance the degradation sensitivity of attention inputs, we design a HOG-guided local dynamic-range convolution module that captures long-range degradation similarities while maintaining awareness of global structural information. Furthermore, we propose a dynamic interaction feed-forward module, efficiently increasing the model capacity to adapt to different degradations through channel-spatial interactions. Extensive experiments across diverse benchmarks, including adverse weather and natural degradations, demonstrate that HOGformer achieves state-of-the-art performance and generalizes effectively to complex real-world degradations. Code is available at this https URL.
zh

[CV-142] PapMOT: Exploring Adversarial Patch Attack against Multiple Object Tracking ECCV2024

【速读】：该论文旨在解决现有多目标跟踪（Multiple Object Tracking, MOT）方法易受数字攻击但对物理场景无效的问题。论文的关键在于提出了一种名为PapMOT的方案，能够针对MOT生成适用于数字和物理场景的物理对抗补丁。PapMOT不仅通过优化可打印的对抗补丁来干扰检测机制，还能将这些补丁作为新的目标，误导身份关联过程，进一步通过补丁增强策略破坏视频帧之间的时序一致性，从而实现更强大的攻击效果。此外，论文还引入了新的评估指标以量化MOT在面对此类攻击时的鲁棒性，并验证了PapMOT在数字和物理场景中的有效性。

链接: https://arxiv.org/abs/2504.09361
作者: Jiahuan Long,Tingsong Jiang,Wen Yao,Shuai Jia,Weijia Zhang,Weien Zhou,Chao Ma,Xiaoqian Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2024

点击查看摘要

Abstract:Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attacks belong to digital attacks that inject pixel-level noise into input images, and are therefore ineffective in physical scenarios. To fill this gap, we propose PapMOT, which can generate physical adversarial patches against MOT for both digital and physical scenarios. Besides attacking the detection mechanism, PapMOT also optimizes a printable patch that can be detected as new targets to mislead the identity association process. Moreover, we introduce a patch enhancement strategy to further degrade the temporal consistency of tracking results across video frames, resulting in more aggressive attacks. We further develop new evaluation metrics to assess the robustness of MOT against such attacks. Extensive evaluations on multiple datasets demonstrate that our PapMOT can successfully attack various architectures of MOT trackers in digital scenarios. We also validate the effectiveness of PapMOT for physical attacks by deploying printed adversarial patches in the real world.
zh

[CV-143] Explorer: Robust Collection of Interactable GUI Elements

【速读】：该论文试图解决现有图形用户界面（Graphical User Interfaces, GUIs）自动化中存在的挑战，特别是数据收集和理解原始界面所面临的困难。论文的关键在于提出了一种名为Explorer的系统，专注于检测屏幕上的按钮和文本输入字段等可交互元素。该解决方案的关键点在于利用目标应用程序的实时版本进行训练，并且通过个性化数据收集和机器学习训练，确保在测试阶段能够精确识别相关UI元素，从而增强用户体验的信心。此外，Explorer还支持交互会话的记录与映射，实现路径规划功能，使用户可以通过语音命令导航至目标位置。代码已开源供公众使用。

链接: https://arxiv.org/abs/2504.09352
作者: Iason Chaimalas,Arnas Vyšniauskas,Gabriel Brostow
机构: University College London (伦敦大学学院); University College London (伦敦大学学院); University College London (伦敦大学学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 17 figures

点击查看摘要

Abstract:Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML’s precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at this https URL.
zh

[CV-144] Minority Reports: Balancing Cost and Quality in Ground Truth Data Annotation

【速读】：该论文试图解决高质量数据标注在机器学习软件开发中的高成本和劳动密集型问题，通过检测和移除“少数派报告”（即不正确的标注响应）来优化标注任务分配。论文的关键在于提出一种方法，在执行标注任务前估计注释者与多数投票结果不一致的可能性，从而裁剪潜在冗余的任务分配。这种方法基于对计算机视觉数据集的实证分析，发现少数派报告的发生主要取决于图像模糊性、标注人员差异性和疲劳度。通过在多个数据集上的模拟实验表明，该方法可在仅小幅降低标签质量的前提下将所需标注数量减少超过60%，相当于节省约6.6个工日的劳动力。其核心创新在于提供了一种平衡标注成本与数据集质量的方法，同时允许机器学习实践者根据具体应用需求调整标注精度，从而优化预算分配并确保关键应用场景（如自动驾驶技术）的数据质量。

链接: https://arxiv.org/abs/2504.09341
作者: Hsuan Wei Liao,Christopher Klugmann,Daniel Kondermann,Rafid Mahmood
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 39 pages

点击查看摘要

Abstract:High-quality data annotation is an essential but laborious and costly aspect of developing machine learning-based software. We explore the inherent tradeoff between annotation accuracy and cost by detecting and removing minority reports – instances where annotators provide incorrect responses – that indicate unnecessary redundancy in task assignments. We propose an approach to prune potentially redundant annotation task assignments before they are executed by estimating the likelihood of an annotator disagreeing with the majority vote for a given task. Our approach is informed by an empirical analysis over computer vision datasets annotated by a professional data annotation platform, which reveals that the likelihood of a minority report event is dependent primarily on image ambiguity, worker variability, and worker fatigue. Simulations over these datasets show that we can reduce the number of annotations required by over 60% with a small compromise in label quality, saving approximately 6.6 days-equivalent of labor. Our approach provides annotation service platforms with a method to balance cost and dataset quality. Machine learning practitioners can tailor annotation accuracy levels according to specific application needs, thereby optimizing budget allocation while maintaining the data quality necessary for critical settings like autonomous driving technology.
zh

[CV-145] xt To 3D Object Generation For Scalable Room Assembly ICLR2025

【速读】：该论文旨在解决场景理解任务（如深度估计和目标跟踪）中现代机器学习模型面临的高质量数据稀缺问题。为应对这一挑战，论文提出了一种端到端的合成数据生成系统，用于创建可扩展、高质量且可定制的3D室内场景。方案的关键在于将文本到图像生成（Text-to-Image Generation）、多视角扩散模型（Multi-view Diffusion Models）与基于神经辐射场（Neural Radiance Fields, NeRF）的网格化技术集成，并结合渲染工具实现从文本提示生成高保真3D物体资产，再将其整合到预定义的平面图中。此外，通过引入创新的损失函数和训练策略，系统实现了按需场景生成，从而缓解当前依赖人工艺术家创作的数据短缺问题，推动合成数据在突破机器学习训练局限性方面的作用，以构建更鲁棒和泛化能力更强的现实世界应用模型。

链接: https://arxiv.org/abs/2504.09328
作者: Sonia Laguna,Alberto Garcia-Garcia,Marie-Julie Rakotosaona,Stylianos Moschoglou,Leonhard Helminger,Sergio Orts-Escolano
机构: ETH Zurich (苏黎世联邦理工学院); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published at the ICLR 2025 Workshop on Synthetic Data

点击查看摘要

Abstract:Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system for synthetic data generation for scalable, high-quality, and customizable 3D indoor scenes. By integrating and adapting text-to-image and multi-view diffusion models with Neural Radiance Field-based meshing, this system generates highfidelity 3D object assets from text prompts and incorporates them into pre-defined floor plans using a rendering tool. By introducing novel loss functions and training strategies into existing methods, the system supports on-demand scene generation, aiming to alleviate the scarcity of current available data, generally manually crafted by artists. This system advances the role of synthetic data in addressing machine learning training limitations, enabling more robust and generalizable models for real-world applications.
zh

[CV-146] Infused Suppression Of Magnification Artefacts For Micro-AU Detection

【速读】：该论文旨在解决微表情分析中因面部运动细微且短暂导致的特征与 AU（Action Units）发生关联困难的问题，同时克服运动放大技术引入的运动伪影（motion artefacts），特别是在跨数据库多 AU 类别分析的复杂任务中。论文的关键解决方案在于提出 InfuseNet，这是一种分层单元特征融合框架，通过利用运动上下文约束 AU 学习在信息丰富的面部运动区域，从而减轻运动放大带来的伪影影响。此外，InfuseNet 还采用放大后的潜在特征而非重构放大样本，以减少运动重建过程中的失真和伪影。这些方法显著提升了 CD6ME 协议下的性能，并通过定量研究验证了缓解运动伪影的有效性。

链接: https://arxiv.org/abs/2504.09326
作者: Huai-Qian Khor,Yante Li,Xingxun Jiang,Guoying Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial micro-expressions are spontaneous, brief and subtle facial motions that unveil the underlying, suppressed emotions. Detecting Action Units (AUs) in micro-expressions is crucial because it yields a finer representation of facial motions than categorical emotions, effectively resolving the ambiguity among different expressions. One of the difficulties in micro-expression analysis is that facial motions are subtle and brief, thereby increasing the difficulty in correlating facial motion features to AU occurrence. To bridge the subtlety issue, flow-related features and motion magnification are a few common approaches as they can yield descriptive motion changes and increased motion amplitude respectively. While motion magnification can amplify the motion changes, it also accounts for illumination changes and projection errors during the amplification process, thereby creating motion artefacts that confuse the model to learn inauthentic magnified motion features. The problem is further aggravated in the context of a more complicated task where more AU classes are analyzed in cross-database settings. To address this issue, we propose InfuseNet, a layer-wise unitary feature infusion framework that leverages motion context to constrain the Action Unit (AU) learning within an informative facial movement region, thereby alleviating the influence of magnification artefacts. On top of that, we propose leveraging magnified latent features instead of reconstructing magnified samples to limit the distortion and artefacts caused by the projection inaccuracy in the motion reconstruction process. Via alleviating the magnification artefacts, InfuseNet has surpassed the state-of-the-art results in the CD6ME protocol. Further quantitative studies have also demonstrated the efficacy of motion artefacts alleviation.
zh

[CV-147] MedIL: Implicit Latent Spaces for Generating Heterogeneous Medical Images at Arbitrary Resolutions

【速读】：本文旨在解决医学图像生成中因图像尺寸和分辨率异质性导致的临床相关细节丢失问题。传统方法在处理医学图像时通常需要先将图像重采样到固定大小，这会显著降低图像的解剖学细节质量，而这些细节对临床诊断至关重要。论文的关键创新在于MedIL（Medical Implicit Representation Autoencoder），它通过隐式神经表示将医学图像视为连续信号，从而能够在任意分辨率下进行编码和解码，无需预先重采样。这种方案能够有效压缩和保留跨多站点、多分辨率数据集（如T1加权脑MRI和肺部CT）中的临床相关信息，并进一步提升基于扩散模型生成图像的质量，使其更接近原始临床采集的真实图像特性。

链接: https://arxiv.org/abs/2504.09322
作者: Tyler Spears,Shen Zhu,Yinzhu Jin,Aman Shrivastava,P. Thomas Fletcher
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we introduce MedIL, a first-of-its-kind autoencoder built for encoding medical images with heterogeneous sizes and resolutions for image generation. Medical images are often large and heterogeneous, where fine details are of vital clinical importance. Image properties change drastically when considering acquisition equipment, patient demographics, and pathology, making realistic medical image generation challenging. Recent work in latent diffusion models (LDMs) has shown success in generating images resampled to a fixed-size. However, this is a narrow subset of the resolutions native to image acquisition, and resampling discards fine anatomical details. MedIL utilizes implicit neural representations to treat images as continuous signals, where encoding and decoding can be performed at arbitrary resolutions without prior resampling. We quantitatively and qualitatively show how MedIL compresses and preserves clinically-relevant features over large multi-site, multi-resolution datasets of both T1w brain MRIs and lung CTs. We further demonstrate how MedIL can influence the quality of images generated with a diffusion model, and discuss how MedIL can enhance generative models to resemble raw clinical acquisitions.
zh

[CV-148] A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search

【速读】：该论文旨在解决数字视频内容指数增长背景下，现有方法在高效定位大规模视频语料库中特定片段时面临的挑战。具体而言，这些问题包括计算效率低下、时间上下文限制以及导航视频内容的固有复杂性。为应对这些局限性，论文提出了一种新颖的交互式视频语料库时刻检索框架，其关键是整合了SuperGlobal重排序机制与自适应双向时间搜索（Adaptive Bidirectional Temporal Search, ABTS），通过优化查询相似性、时间稳定性以及计算资源利用，实现更高效的检索性能。此外，通过使用关键帧提取模型和基于图像哈希的去重技术对视频语料库进行预处理，该方案提供了可扩展的解决方案，显著降低了存储需求，同时保持了在多样化视频数据集中的高精度定位能力。

链接: https://arxiv.org/abs/2504.09298
作者: Tinh-Anh Nguyen-Nhu,Huu-Loc Tran,Nguyen-Khang Le,Minh-Nhat Nguyen,Tien-Huy Nguyen,Hoang-Long Nguyen-Huu,Huu-Phong Phan-Nguyen,Huy-Thach Pham,Quan Nguyen,Hoang M. Le,Quang-Vinh Dinh
机构: Ho Chi Minh University of Technology (胡志明市理工大学), VNU-HCM, Vietnam; University of Information Technology (信息技术大学), VNU-HCM, Vietnam; Ho Chi Minh University of Science (胡志明市自然科学大学), VNU-HCM, Vietnam; University of Economics Ho Chi Minh City (胡志明市经济大学), Vietnam; Georgia State University (乔治亚州立大学), USA; Posts and Telecommunications Institute of Technology (邮政电信技术学院), Vietnam; York University (约克大学), Canada; AI VIETNAM Lab (AI 越南实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The exponential growth of digital video content has posed critical challenges in moment-level video retrieval, where existing methodologies struggle to efficiently localize specific segments within an expansive video corpus. Current retrieval systems are constrained by computational inefficiencies, temporal context limitations, and the intrinsic complexity of navigating video content. In this paper, we address these limitations through a novel Interactive Video Corpus Moment Retrieval framework that integrates a SuperGlobal Reranking mechanism and Adaptive Bidirectional Temporal Search (ABTS), strategically optimizing query similarity, temporal stability, and computational resources. By preprocessing a large corpus of videos using a keyframe extraction model and deduplication technique through image hashing, our approach provides a scalable solution that significantly reduces storage requirements while maintaining high localization precision across diverse video repositories.
zh

[CV-149] Cycle Training with Semi-Supervised Domain Adaptation: Bridging Accuracy and Efficiency for Real-Time Mobile Scene Detection

【速读】：该论文旨在解决在移动设备资源受限的情况下，如何在保证计算效率的同时提升图像分类模型的准确性这一难题。论文的关键解决方案是提出了一种名为Cycle Training的新颖训练框架，该框架采用三阶段交替进行探索与稳定化的训练过程来优化模型性能，并结合半监督领域自适应（Semi-Supervised Domain Adaptation, SSDA）技术，利用大模型和未标注数据扩展训练集，从而有效平衡了精度与效率之间的矛盾。实验结果表明，此方法不仅显著提高了分类准确性，还保持了实时推理效率。

链接: https://arxiv.org/abs/2504.09297
作者: Huu-Phong Phan-Nguyen,Anh Dao,Tien-Huy Nguyen,Tuan Quang,Huu-Loc Tran,Tinh-Anh Nguyen-Nhu,Huy-Thach Pham,Quan Nguyen,Hoang M. Le,Quang-Vinh Dinh
机构: Ho Chi Minh University of Technology, VNU-HCM, Vietnam (胡志明市国立大学技术学院, 越南国家大学胡志明市校区); Michigan State University, USA (密歇根州立大学, 美国); LPL Financial, USA (LPL金融, 美国); Georgia State University, USA (乔治亚州立大学, 美国); Posts and Telecommunications Institute of Technology, Hanoi, Vietnam (河内邮电科技大学, 越南); York University, Canada (约克大学, 加拿大); AI VIETNAM Lab (AI越南实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nowadays, smartphones are ubiquitous, and almost everyone owns one. At the same time, the rapid development of AI has spurred extensive research on applying deep learning techniques to image classification. However, due to the limited resources available on mobile devices, significant challenges remain in balancing accuracy with computational efficiency. In this paper, we propose a novel training framework called Cycle Training, which adopts a three-stage training process that alternates between exploration and stabilization phases to optimize model performance. Additionally, we incorporate Semi-Supervised Domain Adaptation (SSDA) to leverage the power of large models and unlabeled data, thereby effectively expanding the training dataset. Comprehensive experiments on the CamSSD dataset for mobile scene detection demonstrate that our framework not only significantly improves classification accuracy but also ensures real-time inference efficiency. Specifically, our method achieves a 94.00% in Top-1 accuracy and a 99.17% in Top-3 accuracy and runs inference in just 1.61ms using CPU, demonstrating its suitability for real-world mobile deployment.
zh

[CV-150] owards Explainable Partial-AIGC Image Quality Assessment

【速读】：该论文旨在解决部分AI生成内容（Partial AI-Generated Content, PAI）图像的质量评估问题，这一领域在现有研究中几乎未被涉足。尽管已有大量关于AI生成图像（AI-Generated Images, AGIs）的整体质量评估的研究，但这些研究主要聚焦于完全由AI生成的图像（如文本到图像生成），而针对具有局部AI编辑的混合图像的质量评估缺乏系统性研究。论文的关键创新在于构建了一个大规模的部分AI生成图像数据集（EPAIQA-15K），包含15,000张具有不同区域局部AI操作的图像以及超过300,000个多维人工评分，并提出了一种基于大型多模态模型（Large Multimodal Models, LMMs）的三阶段模型训练范式。该范式通过逐步训练模型实现编辑区域定位、定量质量打分以及质量解释的能力，最终开发出具备可解释性反馈能力的EPAIQA系列模型，从而填补了部分AI生成图像质量评估领域的空白。

链接: https://arxiv.org/abs/2504.09291
作者: Jiaying Qian,Ziheng Jia,Zicheng Zhang,Zeyu Zhang,Guangtao Zhai,Xiongkuo Min
机构: Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The rapid advancement of AI-driven visual generation technologies has catalyzed significant breakthroughs in image manipulation, particularly in achieving photorealistic localized editing effects on natural scene images (NSIs). Despite extensive research on image quality assessment (IQA) for AI-generated images (AGIs), most studies focus on fully AI-generated outputs (e.g., text-to-image generation), leaving the quality assessment of partial-AIGC images (PAIs)-images with localized AI-driven edits an almost unprecedented field. Motivated by this gap, we construct the first large-scale PAI dataset towards explainable partial-AIGC image quality assessment (EPAIQA), the EPAIQA-15K, which includes 15K images with localized AI manipulation in different regions and over 300K multi-dimensional human ratings. Based on this, we leverage large multi-modal models (LMMs) and propose a three-stage model training paradigm. This paradigm progressively trains the LMM for editing region grounding, quantitative quality scoring, and quality explanation. Finally, we develop the EPAIQA series models, which possess explainable quality feedback capabilities. Our work represents a pioneering effort in the perceptual IQA field for comprehensive PAI quality assessment.
zh

[CV-151] VideoAds for Fast-Paced Video Understanding: Where Opensource Foundation Models Beat GPT -4o Gemini-1.5 Pro

【速读】：该论文试图解决多模态大型语言模型（Multi-modal Large Language Models, MLLMs）在处理广告视频时面临的复杂性挑战。广告视频因其结构化叙事和快速场景转换，相较于其他同长度视频更为复杂，这对现有MLLMs提出了显著的技术难题。论文的关键解决方案是引入了一个名为VideoAds的新数据集，该数据集专为评估MLLMs在广告视频上的性能而设计。VideoAds包含精心策划的复杂时间结构广告视频，并辅以人工标注的多样化问题，覆盖视觉定位、视频摘要和视觉推理三大核心任务。通过定量分析其与现有基准数据集的对比，研究发现开源模型Qwen2.5-VL-72B在VideoAds上取得了73.35%的准确率，优于GPT-4o（66.82%）和Gemini-1.5 Pro（69.66%），特别是在视频摘要和推理任务上表现优异，但在视觉定位任务上稍逊于专有模型，而人类专家则达到了94.27%的高准确率。这些结果强调了提升MLLMs时间建模能力的重要性，并将VideoAds确立为未来视频理解研究的重要基准，尤其是需要高帧率采样的场景。

链接: https://arxiv.org/abs/2504.09282
作者: Zheyuan Zhang,Monica Dou,Linkai Peng,Hongyi Pan,Ulas Bagci,Boqing Gong
机构: Northwestern University; Boston University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbfmanually annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35% accuracy on VideoAds, outperforming GPT-4o (66.82%) and Gemini-1.5 Pro (69.66%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27%. These results underscore the necessity of advancing MLLMs’ temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at this https URL.
zh

[CV-152] Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling

【速读】：该论文旨在解决视觉自回归（Visual Autoregressive, VAR）模型在推理过程中因大量KV缓存积累而导致的显著内存瓶颈问题。现有针对大型语言模型的KV缓存压缩技术在VAR模型中效果不佳，因为VAR模型包含两类独特的注意力头：结构化头（Structural Heads），通过对角线注意力模式保持空间一致性；上下文头（Contextual Heads），通过垂直注意力模式维持语义一致性。这些差异使得单一策略的KV缓存压缩方法对VAR模型无效。为此，论文提出了一种名为HACK（Head-Aware Compression for KV Cache）的无训练头感知压缩方法，其关键是为不同类型的注意力头分配非对称的缓存预算，并采用特定于模式的压缩策略，以适应每种类别头的关键特性。实验结果表明，HACK在文本到图像及类别条件生成任务中表现出色，能够在保持高质量生成的同时显著降低内存使用。

链接: https://arxiv.org/abs/2504.09261
作者: Ziran Qin,Youru Lv,Mingbao Lin,Zeren Zhang,Danping Zou,Weiyao Lin
机构: School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电子与电气工程学院); Skywork AI(Skywork AI); School of Mathematical Sciences, Peking University(北京大学数学科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Autoregressive (VAR) models have emerged as a powerful approach for multi-modal content creation, offering high efficiency and quality across diverse multimedia applications. However, they face significant memory bottlenecks due to extensive KV cache accumulation during inference. Existing KV cache compression techniques for large language models are suboptimal for VAR models due to, as we identify in this paper, two distinct categories of attention heads in VAR models: Structural Heads, which preserve spatial coherence through diagonal attention patterns, and Contextual Heads, which maintain semantic consistency through vertical attention patterns. These differences render single-strategy KV compression techniques ineffective for VAR models. To address this, we propose HACK, a training-free Head-Aware Compression method for KV cache. HACK allocates asymmetric cache budgets and employs pattern-specific compression strategies tailored to the essential characteristics of each head category. Experiments on Infinity-2B, Infinity-8B, and VAR-d30 demonstrate its effectiveness in text-to-image and class-conditional generation tasks. HACK can hack down up to 50% and 70% of cache with minimal performance degradation for VAR-d30 and Infinity-8B, respectively. Even with 70% and 90% KV cache compression in VAR-d30 and Infinity-8B, HACK still maintains high-quality generation while reducing memory usage by 44.2% and 58.9%, respectively.
zh

[CV-153] PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks

【速读】：该论文旨在解决病理图像诊断中自动化程度不足及现有视觉-语言模型（Vision-Language Models, VLMs）推理能力弱、缺乏有效监督的问题。为应对这些挑战，论文的关键创新在于提出了一种专门针对病理图像设计的视觉-语言模型PathVLM-R1。其解决方案的核心在于：首先通过病理数据引导的有监督微调赋予模型基础病理知识；其次引入基于分组相对策略优化（Group Relative Policy Optimization, GRPO）的双奖励驱动强化学习机制，严格约束跨模态推理过程的逻辑性和结果准确性。这种设计不仅提升了模型在病理图像问答任务中的精度，还显著增强了其迁移能力和泛化潜力。

链接: https://arxiv.org/abs/2504.09258
作者: Jianyu Wu,Hao Yang,Xinhua Zeng,Guibing He,Zhiyu Chen,Zihui Li,Xiaochuan Zhang,Yangyang Ma,Run Fang,Yang Liu
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The diagnosis of pathological images is often limited by expert availability and regional disparities, highlighting the importance of automated diagnosis using Vision-Language Models (VLMs). Traditional multimodal models typically emphasize outcomes over the reasoning process, compromising the reliability of clinical decisions. To address the weak reasoning abilities and lack of supervised processes in pathological VLMs, we have innovatively proposed PathVLM-R1, a visual language model designed specifically for pathological images. We have based our model on Qwen2.5-VL-7B-Instruct and enhanced its performance for pathological tasks through meticulously designed post-training strategies. Firstly, we conduct supervised fine-tuning guided by pathological data to imbue the model with foundational pathological knowledge, forming a new pathological base model. Subsequently, we introduce Group Relative Policy Optimization (GRPO) and propose a dual reward-driven reinforcement learning optimization, ensuring strict constraint on logical supervision of the reasoning process and accuracy of results via cross-modal process reward and outcome accuracy reward. In the pathological image question-answering tasks, the testing results of PathVLM-R1 demonstrate a 14% improvement in accuracy compared to baseline methods, and it demonstrated superior performance compared to the Qwen2.5-VL-32B version despite having a significantly smaller parameter size. Furthermore, in out-domain data evaluation involving four medical imaging modalities: Computed Tomography (CT), dermoscopy, fundus photography, and Optical Coherence Tomography (OCT) images: PathVLM-R1’s transfer performance improved by an average of 17.3% compared to traditional SFT methods. These results clearly indicate that PathVLM-R1 not only enhances accuracy but also possesses broad applicability and expansion potential.
zh

[CV-154] FVQ: A Large-Scale Dataset and A LMM-based Method for Face Video Quality Assessment

【速读】：该论文旨在解决人脸视频质量评估（Face Video Quality Assessment, FVQA）领域因缺乏大规模数据集而研究不足的问题。论文的关键在于提出了首个大规模真实场景人脸视频质量评估数据集FVQ-20K，并构建了一种专门的FVQA方法FVQ-Rater。该方法通过精心提取多维特征（包括空间特征、时间特征及人脸特定特征），结合基于LoRA的指令微调技术实现针对质量特性的精细调整，从而实现接近人类的评分能力。这一尝试首次探索了大型多模态模型（Large Multimodal Models, LMMs）在FVQA任务中的潜力，为推动该领域的进展提供了重要支持。

链接: https://arxiv.org/abs/2504.09255
作者: Sijing Wu,Yunhao Li,Ziwen Xu,Yixuan Gao,Huiyu Duan,Wei Sun,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.
zh

[CV-155] NoTeS-Bank: Benchmarking Neural Transcription and Search for Scientific Notes Understanding

【速读】：该论文试图解决学术手写笔记在文档人工智能（Document AI）领域中理解与推理的挑战，特别是针对数学公式、图表及科学符号等复杂内容。现有视觉问答（VQA）基准主要关注印刷体或结构化手写文本，难以泛化到真实场景中的笔记内容。论文的关键解决方案是引入NoTeS-Bank，这是一个基于笔记的问答神经转录与搜索评估基准，包含跨多个领域的复杂笔记数据，要求模型处理非结构化和多模态内容。NoTeS-Bank定义了两个任务：(1) 基于证据的VQA，模型需检索带有边界框证据的局部答案；(2) 开放域VQA，模型需先分类领域再检索相关文档与答案。与依赖光学字符识别（OCR）和结构化数据的经典文档VQA数据集不同，NoTeS-Bank强调视觉-语言融合、检索及多模态推理能力，并通过NDCG@5、MRR、Recall@K、IoU和ANLS等指标提供严格评估，确立了视觉文档理解和推理的新标准。

链接: https://arxiv.org/abs/2504.09249
作者: Aniket Pal,Sanket Biswas,Alloy Das,Ayush Lodh,Priyanka Banerjee,Soumitri Chattopadhyay,Dimosthenis Karatzas,Josep Llados,C.V. Jawahar
机构: CVIT Lab, IIIT Hyderabad(IIIT 海得拉巴计算视觉与图像技术实验室); Computer Vision Center, UAB(巴塞罗那自治大学计算机视觉中心); Habitat Labs, Habitat Lens Pvt. Ltd.(栖息地实验室，栖息地镜像私人有限公司); UNC Chapel Hill(北卡罗来纳大学教堂山分校); Josep Lladós; Dimosthenis Karatzas
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Understanding and reasoning over academic handwritten notes remains a challenge in document AI, particularly for mathematical equations, diagrams, and scientific notations. Existing visual question answering (VQA) benchmarks focus on printed or structured handwritten text, limiting generalization to real-world note-taking. To address this, we introduce NoTeS-Bank, an evaluation benchmark for Neural Transcription and Search in note-based question answering. NoTeS-Bank comprises complex notes across multiple domains, requiring models to process unstructured and multimodal content. The benchmark defines two tasks: (1) Evidence-Based VQA, where models retrieve localized answers with bounding-box evidence, and (2) Open-Domain VQA, where models classify the domain before retrieving relevant documents and answers. Unlike classical Document VQA datasets relying on optical character recognition (OCR) and structured data, NoTeS-BANK demands vision-language fusion, retrieval, and multimodal reasoning. We benchmark state-of-the-art Vision-Language Models (VLMs) and retrieval frameworks, exposing structured transcription and reasoning limitations. NoTeS-Bank provides a rigorous evaluation with NDCG@5, MRR, Recall@K, IoU, and ANLS, establishing a new standard for visual document understanding and reasoning.
zh

[CV-156] Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking

【速读】：该论文旨在解决单流架构中基于视觉Transformer（Vision Transformer, ViT）的无人机跟踪模型在面对障碍物（如建筑物和树木）导致的频繁遮挡时表现不佳的问题。现有方法缺乏有效处理遮挡的策略，而论文提出通过学习基于ViT的抗遮挡表示（Occlusion-Robust Representations, ORR）来增强单流ViT模型在空中跟踪任务中的遮挡鲁棒性。其关键在于引入空间Cox过程建模随机掩码操作，强制目标特征表示对于这种随机掩码具有不变性，从而近似模拟目标被遮挡的情况，使模型能够学习到对目标遮挡鲁棒的ViT。此框架被称为ORTrack。此外，为了实现实时应用，论文还提出了自适应基于特征的知识蒸馏（Adaptive Feature-Based Knowledge Distillation, AFKD）方法，以构建更紧凑的跟踪器ORTrack-D，该模型可根据任务难度自适应模仿教师模型ORTrack的行为，同时保持较高的效率。实验验证了所提方法的有效性，展示了其最先进的性能。

链接: https://arxiv.org/abs/2504.09228
作者: You Wu,Xucheng Wang,Xiangyang Yang,Mengyuan Liu,Dan Zeng,Hengzhou Ye,Shuiwang Li
机构: College of Computer Science and Engineering, Guilin University of Technology (桂林理工大学), China; School of Computer Science, Fudan University (复旦大学), Shanghai, China; School of Artificial Intelligence, Sun Yat-sen University (中山大学), Zhuhai, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles like buildings and trees expose a major drawback: these models often lack strategies to handle occlusions effectively. New methods are needed to enhance the occlusion resilience of single-stream ViT models in aerial tracking. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task’s difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack’s performance while offering higher efficiency. Extensive experiments on multiple benchmarks validate the effectiveness of our method, demonstrating its state-of-the-art performance. Codes is available at this https URL.
zh

[CV-157] DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）推理效率优化中的低比特量化挑战。传统后训练量化（Post-Training Quantization, PTQ）在低比特水平下常面临性能下降的问题，而量化感知训练（Quantization-Aware Training, QAT）虽能缓解此问题，但需要显著增加计算资源。为应对这一矛盾，论文提出了一种基于权重分解的低秩量化感知训练方法（Weight-Decomposed Low-Rank Quantization-Aware Training, DL-QAT）。其关键在于仅训练总参数量不到1%的参数，同时通过引入分组特定的量化幅度调整各量化组的整体尺度，并利用LoRA矩阵更新量化空间中的权重大小与方向，从而有效结合QAT的优势并降低资源消耗。实验验证表明，该方法在不同量化粒度下均优于基线方法，并在MMLU任务中提升了LLaMA-7B模型在3比特条件下的表现达4.2%。

链接: https://arxiv.org/abs/2504.09223
作者: Wenjin Ke,Zhe Li,Dong Li,Lu Tian,Emad Barsoum
机构: Advanced Micro Devices, Inc. (超威半导体公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters. Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quantization group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU on 3-bit LLaMA-7B model. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the superior performance and efficiency of our approach.
zh

[CV-158] Multi-scale Activation Refinement and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition AAAI2025

【速读】：该论文旨在解决Fine-Grained Bird Recognition (FGBR) 中因Vision Transformer (ViT) 模型受限于有限感受野而导致的表征能力不足以及对尺度变化敏感的问题。论文的关键在于提出了一种名为Multi-scale Diverse Cues Modeling (MDCM) 的新框架，通过在多尺度Vision Transformer (MS-ViT) 的不同阶段引入“激活-选择-聚合”范式的多样化线索建模方法，增强模型的多尺度特征表达能力。具体而言，MDCM 首先设计一个多尺度线索激活模块以确保各阶段学习到的判别性线索具有多样性；接着引入多尺度标记选择机制去除冗余噪声，并突出每阶段的判别性和特定尺度线索；最后，利用独立的多阶段识别结果并通过自适应的多尺度动态聚合机制融合这些结果，实现最终决策。实验结果验证了所提方法的有效性，显著优于基于CNN 和ViT 的现有方法。

链接: https://arxiv.org/abs/2504.09215
作者: Zhicheng Zhang,Hao Tang,Jinhui Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Given the critical role of birds in ecosystems, Fine-Grained Bird Recognition (FGBR) has gained increasing attention, particularly in distinguishing birds within similar subcategories. Although Vision Transformer (ViT)-based methods often outperform Convolutional Neural Network (CNN)-based methods in FGBR, recent studies reveal that the limited receptive field of plain ViT model hinders representational richness and makes them vulnerable to scale variance. Thus, enhancing the multi-scale capabilities of existing ViT-based models to overcome this bottleneck in FGBR is a worthwhile pursuit. In this paper, we propose a novel framework for FGBR, namely Multi-scale Diverse Cues Modeling (MDCM), which explores diverse cues at different scales across various stages of a multi-scale Vision Transformer (MS-ViT) in an “Activation-Selection-Aggregation” paradigm. Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different. Subsequently, a multi-scale token selection mechanism is proposed to remove redundant noise and highlight discriminative, scale-specific cues at each stage. Finally, the selected tokens from each stage are independently utilized for bird recognition, and the recognition results from multiple stages are adaptively fused through a multi-scale dynamic aggregation mechanism for final model decisions. Both qualitative and quantitative results demonstrate the effectiveness of our proposed MDCM, which outperforms CNN- and ViT-based models on several widely-used FGBR benchmarks.
zh

[CV-159] EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation

【速读】：该论文旨在解决语音驱动的动作生成中，现有遮罩建模框架难以有效识别语义显著帧以进行高效动作遮罩的问题。解决方案的关键在于提出了一种基于语音查询注意力的遮罩建模框架，通过利用与动作对齐的语音特征来指导遮罩建模过程。具体而言，首先设计了一个动作-音频对齐模块（Motion-Aligned Module, MAM），构建潜在的动作-音频联合空间，并通过可学习的语音查询将低级和高级语音特征投影到该空间中，实现与动作对齐的语音表示。其次，引入一种语音查询注意力机制（Speech-Queried Attention, SQA），通过动作键值与语音查询之间的交互计算帧级注意力分数，从而引导对高注意力分数的动作帧进行选择性遮罩。最后，还将与动作对齐的语音特征注入生成网络，以促进高质量的语音驱动动作生成。定量和定性评估表明，该方法优于现有的最先进方法。

链接: https://arxiv.org/abs/2504.09209
作者: Xiangyue Zhang,Jianfang Li,Jiaxu Zhang,Jianqiang Ren,Liefeng Bo,Zhigang Tu
机构: Wuhan University (武汉大学); Tongyi Lab, Alibaba Group (通义实验室, 阿里巴巴集团)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 12 pages, 12 figures

点击查看摘要

Abstract:Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion.
zh

[CV-160] AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images CVPR2025

【速读】：该论文致力于解决遥感图像分割中开放词汇（Open-Vocabulary）问题，即在推理过程中经常遇到新颖且未定义类别时的传统监督分割模型的泛化能力不足。现有大多数开放词汇分割（Open-Vocabulary Segmentation, OVS）方法主要针对自然图像设计，难以应对遥感数据中存在的尺度变化、方向改变及复杂场景构成等挑战。为解决这些问题，论文提出了一种名为AerOSeg的新方法。其关键在于通过多旋转版本输入图像与领域特定提示计算鲁棒的图像-文本关联特征，并经由空间和类别细化块进行优化；同时借鉴Segment Anything Model (SAM) 的成功经验，利用SAM特性引导空间关联特征的细化过程，引入语义反投影模块以确保SAM语义信息在整个分割流程中的无缝传播；最终借助多尺度注意力感知解码器增强细化后的关联特征，生成最终分割图。实验结果表明，该方法在三个基准遥感数据集上的表现优于当前最先进的开放词汇分割方法，平均提高了2.54 h-mIoU。

链接: https://arxiv.org/abs/2504.09203
作者: Saikat Dutta,Akhil Vasim,Siddhant Gole,Hamid Rezatofighi,Biplab Banerjee
机构: IITB-Monash Research Academy (IITB-蒙纳士研究学院); IIT Bombay (印度理工学院孟买分校); Monash University (蒙纳士大学); C-MInDS, IIT Bombay (IIT Bombay 计算机科学与工程系机器智能与数据科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at EarthVision workshop, CVPR 2025

点击查看摘要

Abstract:Image segmentation beyond predefined categories is a key challenge in remote sensing, where novel and unseen classes often emerge during inference. Open-vocabulary image Segmentation addresses these generalization issues in traditional supervised segmentation models while reducing reliance on extensive per-pixel annotations, which are both expensive and labor-intensive to obtain. Most Open-Vocabulary Segmentation (OVS) methods are designed for natural images but struggle with remote sensing data due to scale variations, orientation changes, and complex scene compositions. This necessitates the development of OVS approaches specifically tailored for remote sensing. In this context, we propose AerOSeg, a novel OVS approach for remote sensing data. First, we compute robust image-text correlation features using multiple rotated versions of the input image and domain-specific prompts. These features are then refined through spatial and class refinement blocks. Inspired by the success of the Segment Anything Model (SAM) in diverse domains, we leverage SAM features to guide the spatial refinement of correlation features. Additionally, we introduce a semantic back-projection module and loss to ensure the seamless propagation of SAM’s semantic information throughout the segmentation pipeline. Finally, we enhance the refined correlation features using a multi-scale attention-aware decoder to produce the final segmentation map. We validate our SAM-guided Open-Vocabulary Remote Sensing Segmentation model on three benchmark remote sensing datasets: iSAID, DLRSD, and OpenEarthMap. Our model outperforms state-of-the-art open-vocabulary segmentation methods, achieving an average improvement of 2.54 h-mIoU.
zh

[CV-161] From Visual Explanations to Counterfactual Explanations with Latent Diffusion WACV

【速读】：该论文旨在解决两个关键问题：一是确定哪些特定的反事实特征对于区分目标类与原始类的概念至关重要；二是为非鲁棒分类器提供有价值的解释，而无需依赖对抗性鲁棒模型的支持。论文的关键解决方案在于通过提供视觉解释的算法识别修改的关键区域，然后结合基于剪枝目标分类器对抗梯度的对抗攻击以及潜在扩散模型生成现实的反事实解释。这种方法在ImageNet和CelebA-HQ数据集上的多种评估标准中超越了先前最先进的结果，并能够对任意分类器应用，突出视觉与反事实解释之间的强关联，从目标分类器进行语义上有意义的改变，并向观察者提供微妙的反事实图像。

链接: https://arxiv.org/abs/2504.09202
作者: Tung Luu,Nam Le,Duc Le,Bac Le
机构: Faculty of Information Technology, University of Science, VNU-HCM (胡志明市国立大学大学科学信息技术学院); Vietnam National University, Ho Chi Minh City (胡志明市国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

点击查看摘要

Abstract:Visual counterfactual explanations are ideal hypothetical images that change the decision-making of the classifier with high confidence toward the desired class while remaining visually plausible and close to the initial image. In this paper, we propose a new approach to tackle two key challenges in recent prominent works: i) determining which specific counterfactual features are crucial for distinguishing the “concept” of the target class from the original class, and ii) supplying valuable explanations for the non-robust classifier without relying on the support of an adversarially robust model. Our method identifies the essential region for modification through algorithms that provide visual explanations, and then our framework generates realistic counterfactual explanations by combining adversarial attacks based on pruning the adversarial gradient of the target classifier and the latent diffusion model. The proposed method outperforms previous state-of-the-art results on various evaluation criteria on ImageNet and CelebA-HQ datasets. In general, our method can be applied to arbitrary classifiers, highlight the strong association between visual and counterfactual explanations, make semantically meaningful changes from the target classifier, and provide observers with subtle counterfactual images.
zh

[CV-162] RT-DATR:Real-time Unsupervised Domain Adaptive Detection Transformer with Adversarial Feature Learning

【速读】：本文旨在解决实时 transformer 基础的目标检测器在跨域检测任务中的领域适应（Domain Adaptation）问题。现有方法直接应用传统的领域适应算法效果不佳，因此作者提出了一种名为 RT-DATR 的简单高效实时领域自适应检测 transformer。该方法的关键在于：首先引入局部目标级特征对齐模块，以显著增强对象迁移过程中的领域不变性特征表示；其次设计场景语义特征对齐模块，通过对齐场景语义特征提升跨域检测性能；最后引入领域查询并将其实现与目标查询解耦，在解码器层进一步对齐实例特征分布，缩小领域差距同时保持判别能力。实验结果表明，所提方法在多个基准数据集上优于当前最先进的方法。

链接: https://arxiv.org/abs/2504.09196
作者: Feng Lv,Chunlong Xia,Shuo Wang,Huo Cao
机构: Baidu Inc. (百度)(北京, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite domain-adaptive object detectors based on CNN and transformers have made significant progress in cross-domain detection tasks, it is regrettable that domain adaptation for real-time transformer-based detectors has not yet been explored. Directly applying existing domain adaptation algorithms has proven to be suboptimal. In this paper, we propose RT-DATR, a simple and efficient real-time domain adaptive detection transformer. Building on RT-DETR as our base detector, we first introduce a local object-level feature alignment module to significantly enhance the feature representation of domain invariance during object transfer. Additionally, we introduce a scene semantic feature alignment module designed to boost cross-domain detection performance by aligning scene semantic features. Finally, we introduced a domain query and decoupled it from the object query to further align the instance feature distribution within the decoder layer, reduce the domain gap, and maintain discriminative ability. Experimental results on various benchmarks demonstrate that our method outperforms current state-of-the-art approaches. Our code will be released soon.
zh

[CV-163] ReferGPT : Towards Zero-Shot Referring Multi-Object Tracking CVPR2025

【速读】：本文旨在解决基于文本查询的多目标跟踪问题，这是一个需要将语言理解与跨帧对象关联相结合的挑战性任务。现有方法通常采用端到端训练或在多目标跟踪器中集成额外的指代文本模块，但这些方法均依赖于有监督训练，并可能在开放集查询下泛化能力不足。为了解决这些问题，本文提出ReferGPT，这是一种新颖的零样本指代多目标跟踪框架。其关键在于引入了一个具有空间知识的多模态大语言模型（MLLM），使其能够生成三维感知的描述语句，从而增强描述能力和扩展指代词汇表，且无需训练。此外，本文还提出了一种鲁棒的查询匹配策略，利用CLIP驱动的语义编码和模糊匹配技术，将MLLM生成的描述与用户查询进行关联。实验结果表明，ReferGPT在Refer-KITTI、Refer-KITTIv2和Refer-KITTI+数据集上的表现具有竞争力，展示了其在自动驾驶场景中的鲁棒性和零样本能力。相关代码已开源。

链接: https://arxiv.org/abs/2504.09195
作者: Tzoulio Chamiti,Leandro Di Bella,Adrian Munteanu,Nikos Deligiannis
机构: ETRO Department, Vrije Universiteit Brussel (布鲁塞尔自由大学); imec (比利时微电子研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted CVPR 2025 Workshop on Distillation of Foundation Models for Autonomous Driving

点击查看摘要

Abstract:Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on this https URL
zh

[CV-164] SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow CVPR2025

【速读】：该论文试图解决现有6D物体姿态估计算法在细化阶段存在的两个主要问题：一是对应关系中的噪声影响，二是对新物体需要重新训练。为了解决这些问题，SCFlow2提出的关键方案是将几何约束引入到循环匹配网络的训练中，通过结合三维场景流中的刚体运动嵌入和目标物体的三维形状先验，同时在RGBD帧的迭代过程中将深度信息作为正则化项。这种设计使得SCFlow2能够作为一个即插即用的后处理模块，显著提升多种现有方法在包含新物体的BOP数据集上的性能，而无需任何额外的微调或重新训练。

链接: https://arxiv.org/abs/2504.09160
作者: Qingyuan Wang,Rui Song,Jiaojiao Li,Kerui Cheng,David Ferstl,Yinlin Hu
机构: State Key Laboratory of ISN, Xidian University (西安电子科技大学); Taiyuan University of Technology (太原理工大学); MagicLeap
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025

点击查看摘要

Abstract:We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. Most recent 6D object pose methods rely on refinement to get accurate results. However, most existing refinement methods either suffer from noises in establishing correspondences, or rely on retraining for novel objects. SCFlow2 is based on the SCFlow model designed for refinement with shape constraint, but formulates the additional depth as a regularization in the iteration via 3D scene flow for RGBD frames. The key design of SCFlow2 is an introduction of geometry constraints into the training of recurrent matching network, by combining the rigid-motion embeddings in 3D scene flow and 3D shape prior of the target. We train SCFlow2 on a combination of dataset Objaverse, GSO and ShapeNet, and evaluate on BOP datasets with novel objects. After using our method as a post-processing, most state-of-the-art methods produce significantly better results, without any retraining or fine-tuning. The source code is available at this https URL.
zh

[CV-165] LEREL: Lipschitz Continuity-Constrained Emotion Recognition Ensemble Learning For Electroencephalography

【速读】：该论文旨在解决基于脑电图（EEG）的情绪识别（Emotion Recognition, EER）方法面临的三大挑战：模型稳定性不足、处理高维非线性EEG信号的准确性有限，以及对受试者内部变异性和信号噪声的鲁棒性较差。为应对这些挑战，论文提出了一种名为LEREL（Lipschitz连续性约束的情绪识别集成学习）的新框架。其关键是通过引入Lipschitz连续性约束提升模型的稳定性和泛化能力，同时降低信号变异性与噪声敏感性，并在小样本数据集上保持高性能；此外，集成学习策略通过多分类器决策融合进一步减少单模型偏差与方差，优化整体性能。实验结果验证了LEREL的有效性，在三个公开基准数据集上的平均识别准确率分别达到76.43%、83.00%和89.22%。

链接: https://arxiv.org/abs/2504.09156
作者: Shengyu Gong,Yueyang Li,Zijian Kang,Weiming Zeng,Hongjie Yan,Wai Ting Siok,Nizhuan Wang
机构: Lab of Digital Image and Intelligent Computation, Shanghai Maritime University (上海海事大学实验室, 上海海事大学); Department of Neurology, Affiliated Lianyungang Hospital of Xuzhou Medical University (徐州医科大学附属连云港医院神经科); Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University (香港理工大学中文及双语研究系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and efficient perception of emotional states in oneself and others is crucial, as emotion-related disorders are associated with severe psychosocial impairments. While electroencephalography (EEG) offers a powerful tool for emotion detection, current EEG-based emotion recognition (EER) methods face key limitations: insufficient model stability, limited accuracy in processing high-dimensional nonlinear EEG signals, and poor robustness against intra-subject variability and signal noise. To address these challenges, we propose LEREL (Lipschitz continuity-constrained Emotion Recognition Ensemble Learning), a novel framework that significantly enhances both the accuracy and robustness of emotion recognition performance. The LEREL framework employs Lipschitz continuity constraints to enhance model stability and generalization in EEG emotion recognition, reducing signal variability and noise susceptibility while maintaining strong performance on small-sample datasets. The ensemble learning strategy reduces single-model bias and variance through multi-classifier decision fusion, further optimizing overall performance. Experimental results on three public benchmark datasets (EAV, FACED and SEED) demonstrate LEREL’s effectiveness, achieving average recognition accuracies of 76.43%, 83.00% and 89.22%, respectively.
zh

[CV-166] Evolved Hierarchical Masking for Self-Supervised Learning

【速读】：该论文旨在解决现有Masked Image Modeling方法因固定掩码模式导致视觉线索建模受限的问题。解决方案的关键在于提出了一种演进的分层掩码方法（evolved hierarchical masking），通过让正在训练的视觉模型解析输入的视觉线索并构建层次结构，从而动态生成适应不同训练阶段的掩码模式。这种方法使掩码模式能够从关注低级视觉特征（如基础纹理）逐步演化到捕获高级语义信息（如复杂对象语义和上下文），以实现更通用的视觉线索建模。此外，该方法无需额外的预训练模型或标注，并通过调整训练难度来保证效率。实验结果表明，该方法在多种下游任务中显著提升了性能。

链接: https://arxiv.org/abs/2504.09155
作者: Zhanzhou Feng,Shiliang Zhang
机构: National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling this http URL paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self-supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low-level visual cues to grasp basic textures, then gradually evolve to depict higher-level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre-trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial-duplicate image retrieval relying on low-level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1% in imageNet-1K classification and 1.4% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large-scale pre-training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low-level feature recognition.
zh

[CV-167] MASH: Masked Anchored SpHerical Distances for 3D Shape Representation and Generation SIGGRAPH2025

【速读】：该论文旨在解决三维形状表示与学习中的多视图感知理解问题，提出了一种新颖的多视图参数化三维形状表示方法——Masked Anchored Spherical Distances (MASH)。论文的关键在于将三维形状表示为一组可观测的局部表面片段集合，每个片段由从锚点（anchor point）出发的球面距离函数定义，并通过球谐函数（spherical harmonics）的紧凑编码结合参数化的视锥（view cone）掩模来实现空间局域性。此外，作者开发了一种可微优化算法，能够精确地将点云转换为MASH表示，从而逼近任意几何和拓扑结构的真实表面。这一方案的独特之处在于同时融合了隐式与显式特征，使其在表面重建、形状生成、补全及混合等任务中表现出色。

链接: https://arxiv.org/abs/2504.09149
作者: Changhao Li,Yu Xin,Xiaowei Zhou,Ariel Shamir,Hao Zhang,Ligang Liu,Ruizhen Hu
机构: University of Science and Technology of China(中国科学技术大学) Hefei(合肥) China(中国); State Key Laboratory of CAD & CG, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室) Hangzhou(杭州) China(中国); Reichman University(雷赫曼大学) Herzliya(赫兹利亚) Israel(以色列); Simon Fraser University(西蒙弗雷泽大学) Vancouver(温哥华) Canada(加拿大); Shenzhen University(深圳大学) Shenzhen(深圳) China(中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注: 11 pages, 11 figures, SIGGRAPH 2025 Accept - Conference

点击查看摘要

Abstract:We introduce Masked Anchored SpHerical Distances (MASH), a novel multi-view and parametrized representation of 3D shapes. Inspired by multi-view geometry and motivated by the importance of perceptual shape understanding for learning 3D shapes, MASH represents a 3D shape as a collection of observable local surface patches, each defined by a spherical distance function emanating from an anchor point. We further leverage the compactness of spherical harmonics to encode the MASH functions, combined with a generalized view cone with a parameterized base that masks the spatial extent of the spherical function to attain locality. We develop a differentiable optimization algorithm capable of converting any point cloud into a MASH representation accurately approximating ground-truth surfaces with arbitrary geometry and topology. Extensive experiments demonstrate that MASH is versatile for multiple applications including surface reconstruction, shape generation, completion, and blending, achieving superior performance thanks to its unique representation encompassing both implicit and explicit features.
zh

[CV-168] A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds

【速读】：本文旨在解决3D Gaussian Splatting (3DGS) 在实际场景和大尺度场景重建中因依赖于耗时的Structure-from-Motion (SfM) 初始化而导致的应用限制问题。为实现无需SfM支持的相机姿态估计与三维重建同步优化，论文提出了一种约束优化方法。该方法的关键在于将相机姿态分解为相机到设备中心及设备中心到世界坐标系的两阶段优化，并引入针对参数组敏感性设计的两种优化约束以缩小搜索空间。此外，通过直接从噪声点云中学习场景几何结构，论文进一步提出了几何约束以提升重建质量。实验表明，所提方法在自收集数据集及两个公开基准上显著优于现有基于多模态3DGS的基线及COLMAP辅助方法。

链接: https://arxiv.org/abs/2504.09129
作者: Jizong Peng,Tze Ho Elden Tse,Kai Xu,Wenchao Gao,Angela Yao
机构: dConstruct Robotics (dConstruct Robotics); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is a powerful reconstruction technique, but it needs to be initialized from accurate camera poses and high-fidelity point clouds. Typically, the initialization is taken from Structure-from-Motion (SfM) algorithms; however, SfM is time-consuming and restricts the application of 3DGS in real-world scenarios and large-scale scene reconstruction. We introduce a constrained optimization method for simultaneous camera pose estimation and 3D reconstruction that does not require SfM support. Core to our approach is decomposing a camera pose into a sequence of camera-to-(device-)center and (device-)center-to-world optimizations. To facilitate, we propose two optimization constraints conditioned to the sensitivity of each parameter group and restricts each parameter’s search space. In addition, as we learn the scene geometry directly from the noisy point clouds, we propose geometric constraints to improve the reconstruction quality. Experiments demonstrate that the proposed method significantly outperforms the existing (multi-modal) 3DGS baseline and methods supplemented by COLMAP on both our collected dataset and two public benchmarks.
zh

[CV-169] Probability Distribution Alignment and Low-Rank Weight Decomposition for Source-Free Domain Adaptive Brain Decoding

【速读】：该论文旨在解决脑解码面临的三大挑战：个体差异、模态对齐以及高维嵌入带来的计算负担。针对个体差异，现有方法依赖源主体数据，但存在隐私泄露和数据存储压力等问题；在模态对齐方面，当前工作主要关注Softmax概率分布的对齐，而忽略了边缘概率分布的对齐，导致模态错位；同时，图像与文本分别与功能磁共振成像（fMRI）对齐时未充分考虑两者间的复杂交互关系，造成图像重建效果不佳。此外，CLIP嵌入的高维特性带来了显著的计算开销，尽管可以通过减少图像分割块数量和文本标记数量降低维度，但这会严重损害模型性能，形成权衡困境。为此，论文提出了一种基于无源域适应的脑解码框架，其关键是通过无源方式实现跨模态对齐，同时优化计算效率与模型性能之间的平衡。

链接: https://arxiv.org/abs/2504.09109
作者: Ganxi Xu,Jinyi Long,Hanrui Wu,Jia Zhang
机构: Jinan University (暨南大学); Jinan University (暨南大学); Jinan University (暨南大学); Jinan University (暨南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brain decoding currently faces significant challenges in individual differences, modality alignment, and high-dimensional embeddings. To address individual differences, researchers often use source subject data, which leads to issues such as privacy leakage and heavy data storage burdens. In modality alignment, current works focus on aligning the softmax probability distribution but neglect the alignment of marginal probability distributions, resulting in modality misalignment. Additionally, images and text are aligned separately with fMRI without considering the complex interplay between images and text, leading to poor image reconstruction. Finally, the enormous dimensionality of CLIP embeddings causes significant computational costs. Although the dimensionality of CLIP embeddings can be reduced by ignoring the number of patches obtained from images and the number of tokens acquired from text, this comes at the cost of a significant drop in model performance, creating a dilemma. To overcome these limitations, we propose a source-free domain adaptation-based brain decoding framework
zh

[CV-170] Multi-modal and Multi-view Fundus Image Fusion for Retinopathy Diagnosis via Multi-scale Cross-attention and Shifted Window Self-attention

【速读】：本文旨在解决多模态和多视图眼底图像联合解释的问题，以支持视网膜病变的预防。不同视图可以展示完整的眼球三维视野，而不同模态则提供互补的病灶区域。然而，与单张图像相比，多模态和多视图眼底图像序列中的关系包含长距离依赖的病灶特征。为了全面挖掘病灶区域并检测特定模态的病灶，论文设计了一种基于多尺度交叉注意力的多模态眼底图像融合方法，解决了先前基于注意力机制的多模态医学融合方法中存在的静态感受野问题。此外，为捕捉不同视图间的相对位置关系并融合全面的病灶特征，论文提出了基于位移窗口自注意力的多视图眼底图像融合方法，有效缓解了传统基于自注意力机制的多视图融合方法计算复杂度随图像尺寸和数量呈平方增长的问题。最终，通过结合这两种融合方法，设计了一个多任务视网膜病变诊断框架，帮助眼科医生减轻工作负担并提高诊断准确性。实验结果表明，所提出的方法在分类任务中达到了82.53%的准确率，在报告生成任务中BlEU-1得分为0.543，展示了其在临床实践中提升视网膜病变诊断效率和可靠性的潜力。

链接: https://arxiv.org/abs/2504.09106
作者: Yonghao Huang,Leiting Chen,Chuan Zhou
机构: Key Laboratory of Intelligent Digital Media Technology of Sichuan Province, University of Electronic Science and Technology of China (电子科技大学); School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学); School of Information and Software Engineering, University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The joint interpretation of multi-modal and multi-view fundus images is critical for retinopathy prevention, as different views can show the complete 3D eyeball field and different modalities can provide complementary lesion areas. Compared with single images, the sequence relationships in multi-modal and multi-view fundus images contain long-range dependencies in lesion features. By modeling the long-range dependencies in these sequences, lesion areas can be more comprehensively mined, and modality-specific lesions can be detected. To learn the long-range dependency relationship and fuse complementary multi-scale lesion features between different fundus modalities, we design a multi-modal fundus image fusion method based on multi-scale cross-attention, which solves the static receptive field problem in previous multi-modal medical fusion methods based on attention. To capture multi-view relative positional relationships between different views and fuse comprehensive lesion features between different views, we design a multi-view fundus image fusion method based on shifted window self-attention, which also solves the computational complexity of the multi-view fundus fusion method based on self-attention is quadratic to the size and number of multi-view fundus images. Finally, we design a multi-task retinopathy diagnosis framework to help ophthalmologists reduce workload and improve diagnostic accuracy by combining the proposed two fusion methods. The experimental results of retinopathy classification and report generation tasks indicate our method’s potential to improve the efficiency and reliability of retinopathy diagnosis in clinical practice, achieving a classification accuracy of 82.53% and a report generation BlEU-1 of 0.543.
zh

[CV-171] BIGS: Bimanual Category-agnostic Interaction Reconstruction from Monocular Videos via 3D Gaussian Splatting CVPR2025

【速读】：该论文致力于解决从单目 RGB 视频中重建无类别先验的双手交互（Bimanual Class-Agnostic Interaction Reconstruction）这一根本性问题，特别是当两只手与未知物体相互作用时。此前的研究主要集中在有限的手物交互场景，例如已知物体模板或仅一只手参与交互。然而，双手交互重建面临由双手与物体复杂交互引起的严重遮挡问题。为了解决这一挑战，论文提出了一种名为 BIGS (Bimanual Interaction 3D Gaussian Splatting) 的方法，通过从单目视频中重建双手和未知物体的三维高斯分布来实现交互重建。其关键在于利用预训练扩散模型的先验知识结合分数蒸馏采样（Score Distillation Sampling, SDS）损失，以鲁棒地获取避免严重遮挡的物体高斯分布，并通过共享单一高斯分布有效积累双手的三维信息。此外，在高斯优化过程中引入交互主体优化步骤，以进一步考虑双手与物体之间的三维对齐关系。最终，该方法在两个具有挑战性的数据集上实现了最先进的精度，包括三维手部姿态估计（MPJPE）、三维物体重建（CDh, CDo, F10）以及渲染质量（PSNR, SSIM, LPIPS）。

链接: https://arxiv.org/abs/2504.09097
作者: Jeongwan On,Kyeonghwan Gwak,Gunyoung Kang,Junuk Cha,Soohyun Hwang,Hyein Hwang,Seungryul Baek
机构: UNIST (蔚山科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Reconstructing 3Ds of hand-object interaction (HOI) is a fundamental problem that can find numerous applications. Despite recent advances, there is no comprehensive pipeline yet for bimanual class-agnostic interaction reconstruction from a monocular RGB video, where two hands and an unknown object are interacting with each other. Previous works tackled the limited hand-object interaction case, where object templates are pre-known or only one hand is involved in the interaction. The bimanual interaction reconstruction exhibits severe occlusions introduced by complex interactions between two hands and an object. To solve this, we first introduce BIGS (Bimanual Interaction 3D Gaussian Splatting), a method that reconstructs 3D Gaussians of hands and an unknown object from a monocular video. To robustly obtain object Gaussians avoiding severe occlusions, we leverage prior knowledge of pre-trained diffusion model with score distillation sampling (SDS) loss, to reconstruct unseen object parts. For hand Gaussians, we exploit the 3D priors of hand model (i.e., MANO) and share a single Gaussian for two hands to effectively accumulate hand 3D information, given limited views. To further consider the 3D alignment between hands and objects, we include the interacting-subjects optimization step during Gaussian optimization. Our method achieves the state-of-the-art accuracy on two challenging datasets, in terms of 3D hand pose estimation (MPJPE), 3D object reconstruction (CDh, CDo, F10), and rendering quality (PSNR, SSIM, LPIPS), respectively.
zh

[CV-172] RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection CVPR2025

【速读】：该论文旨在解决雷达与相机融合检测中因目标边界及内部雷达回波分布复杂性导致的性能瓶颈问题。当前方法通常依赖黑盒神经网络隐式处理这一问题，而本文的关键创新在于显式引入雷达回波分布模型辅助融合。具体而言，首先构建一个基于单目检测器提取的目标属性预测雷达回波分布的模型；其次，利用预测的分布作为核函数匹配附近的实际雷达点，生成匹配分数；最后，结合上下文与核检测器优化匹配分数。这种方法在nuScenes数据集上实现了雷达-相机检测的最先进性能。

链接: https://arxiv.org/abs/2504.09086
作者: Yunfei Long,Abhinav Kumar,Xiaoming Liu,Daniel Morris
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Radar hits reflect from points on both the boundary and internal to object outlines. This results in a complex distribution of radar hits that depends on factors including object category, size, and orientation. Current radar-camera fusion methods implicitly account for this with a black-box neural network. In this paper, we explicitly utilize a radar hit distribution model to assist fusion. First, we build a model to predict radar hit distributions conditioned on object properties obtained from a monocular detector. Second, we use the predicted distribution as a kernel to match actual measured radar points in the neighborhood of the monocular detections, generating matching scores at nearby positions. Finally, a fusion stage combines context with the kernel detector to refine the matching scores. Our method achieves the state-of-the-art radar-camera detection performance on nuScenes. Our source code is available at this https URL.
zh

[CV-173] Using Vision Language Models for Safety Hazard Identification in Construction

【速读】：该论文旨在解决现有基于计算机视觉的施工现场安全隐患识别方法难以处理上下文特定隐患以及适应多变的工地规范和条件的问题。这些方法通常专注于检测预定义的单一实体，而未能理解其空间关系与交互，从而限制了它们在复杂施工环境中的泛化能力和对未知风险的适应性。为了解决这些问题，论文提出并验证了一种基于视觉语言模型（Vision Language Model, VLM）的框架，用于施工现场隐患识别。该框架的关键在于引入了一个提示工程模块，将安全指南结构化为上下文查询，使VLM能够处理视觉信息并生成符合规范指导的安全隐患评估。通过使用包含1100张施工现场图像的自定义数据集对最先进的VLM进行评估，结果表明GPT-4o和Gemini 1.5 Pro在性能上优于其他选项，并展现出较高的BERTScore，证明了其在识别通用及上下文特定隐患方面的有效性。然而，处理时间仍然是一个显著挑战，影响了其实时可行性。

链接: https://arxiv.org/abs/2504.09083
作者: Muhammad Adil,Gaang Lee,Vicente A. Gonzalez,Qipei Mei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Safety hazard identification and prevention are the key elements of proactive safety management. Previous research has extensively explored the applications of computer vision to automatically identify hazards from image clips collected from construction sites. However, these methods struggle to identify context-specific hazards, as they focus on detecting predefined individual entities without understanding their spatial relationships and interactions. Furthermore, their limited adaptability to varying construction site guidelines and conditions hinders their generalization across different projects. These limitations reduce their ability to assess hazards in complex construction environments and adaptability to unseen risks, leading to potential safety gaps. To address these challenges, we proposed and experimentally validated a Vision Language Model (VLM)-based framework for the identification of construction hazards. The framework incorporates a prompt engineering module that structures safety guidelines into contextual queries, allowing VLM to process visual information and generate hazard assessments aligned with the regulation guide. Within this framework, we evaluated state-of-the-art VLMs, including GPT-4o, Gemini, Llama 3.2, and InternVL2, using a custom dataset of 1100 construction site images. Experimental results show that GPT-4o and Gemini 1.5 Pro outperformed alternatives and displayed promising BERTScore of 0.906 and 0.888 respectively, highlighting their ability to identify both general and context-specific hazards. However, processing times remain a significant challenge, impacting real-time feasibility. These findings offer insights into the practical deployment of VLMs for construction site hazard detection, thereby contributing to the enhancement of proactive safety management.
zh

[CV-174] A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext

【速读】：本文旨在解决面部表情识别（FER）领域中因数据集分布不均、不同类别表情相似性以及同一类别在不同受试者间差异所导致的挑战。为提升FER在困难条件下的准确性，论文提出了一种基于截断ConvNeXt方法（Conv-cut）的视觉面部表情信号特征处理网络。其关键是采用截断的ConvNeXt-Base作为特征提取器，并设计了细节提取块以提取精细特征，同时引入自注意力机制（Self-Attention），使网络能够更有效地学习这些特征。实验结果表明，该模型在RAF-DB和FERPlus数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.09077
作者: Bingyu Nan,Feng Liu,Xuezhong Qian,Wei Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression recognition is an important research direction in the field of artificial intelligence. Although new breakthroughs have been made in recent years, the uneven distribution of datasets and the similarity between different categories of facial expressions, as well as the differences within the same category among different subjects, remain challenges. This paper proposes a visual facial expression signal feature processing network based on truncated ConvNeXt approach(Conv-cut), to improve the accuracy of FER under challenging conditions. The network uses a truncated ConvNeXt-Base as the feature extractor, and then we designed a Detail Extraction Block to extract detailed features, and introduced a Self-Attention mechanism to enable the network to learn the extracted features more effectively. To evaluate the proposed Conv-cut approach, we conducted experiments on the RAF-DB and FERPlus datasets, and the results show that our model has achieved state-of-the-art performance. Our code could be accessed at Github.
zh

[CV-175] Exploring Synergistic Ensemble Learning: Uniting CNNs MLP-Mixers and Vision Transformers to Enhance Image Classification

【速读】：该论文旨在解决如何更系统地挖掘不同神经网络架构（如卷积神经网络 CNNs、MLP-mixers 和视觉Transformer）之间的互补性，并提升图像分类任务的整体性能。论文的关键创新在于不通过试探性的模块拼接，而是采用保持各架构完整性的集成方法来结合不同架构的优点。通过这种隐式的隔离方式，论文不仅深入揭示了各架构的独特优势及其内在互补性，还展示了即使采用简单的集成策略，也能显著超越由相似架构组成的集成模型。最终，该方法在 ImageNet 数据集上构建了一个超越现有单模型最佳表现的分类网络集成，同时降低了整体延迟，为未来探索不同架构及其集成的独特优势与协同效应提供了坚实的基础。

链接: https://arxiv.org/abs/2504.09076
作者: Mk Bashar,Ocean Monjur,Samia Islam,Mohammad Galib Shams,Niamul Quader
机构: Michigan State University (密歇根州立大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Islamic University of Technology (伊斯兰科技大学); Motional (Motional)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, Convolutional Neural Networks (CNNs), MLP-mixers, and Vision Transformers have risen to prominence as leading neural architectures in image classification. Prior research has underscored the distinct advantages of each architecture, and there is growing evidence that combining modules from different architectures can boost performance. In this study, we build upon and improve previous work exploring the complementarity between different architectures. Instead of heuristically merging modules from various architectures through trial and error, we preserve the integrity of each architecture and combine them using ensemble techniques. By maintaining the distinctiveness of each architecture, we aim to explore their inherent complementarity more deeply and with implicit isolation. This approach provides a more systematic understanding of their individual strengths. In addition to uncovering insights into architectural complementarity, we showcase the effectiveness of even basic ensemble methods that combine models from diverse architectures. These methods outperform ensembles comprised of similar architectures. Our straightforward ensemble framework serves as a foundational strategy for blending complementary architectures, offering a solid starting point for further investigations into the unique strengths and synergies among different architectures and their ensembles in image classification. A direct outcome of this work is the creation of an ensemble of classification networks that surpasses the accuracy of the previous state-of-the-art single classification network on ImageNet, setting a new benchmark, all while requiring less overall latency. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.09076 [cs.CV] (or arXiv:2504.09076v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.09076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-176] UniFlowRestore: A General Video Restoration Framework via Flow Matching and Prompt Guidance

【速读】：该论文旨在解决视频成像在复杂退化（如模糊、噪声和压缩伪影）下的恢复问题，传统方法受限于“单任务单模型”的范式，导致泛化能力差且计算成本高，难以适应真实场景中多样化的退化类型。论文提出了一种名为UniFlowRestore的通用视频恢复框架，其关键是通过一个受提示引导且基于物理信息的矢量场建模恢复过程为时间连续演化。该框架利用PhysicsUNet编码退化先验作为势能，并通过PromptGenerator生成与任务相关的提示作为动量，定义了一个哈密顿系统，集成了惯性动力学、衰减的物理梯度以及基于提示的指导。系统通过固定步长的常微分方程（ODE）求解器进行优化，实现了跨任务的高效统一恢复。实验结果表明，UniFlowRestore在去噪任务中达到最高的峰值信噪比（PSNR，33.89 dB）和结构相似性指数（SSIM，0.97），并在所有评估任务中表现出强大的泛化能力和效率。

链接: https://arxiv.org/abs/2504.09069
作者: Shuning Sun,Yu Zhang,Chen Wu,Dianjie Lu,Dianjie Lu,Guijuan Zhan,Yang Weng,Zhuoran Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video imaging is often affected by complex degradations such as blur, noise, and compression artifacts. Traditional restoration methods follow a “single-task single-model” paradigm, resulting in poor generalization and high computational cost, limiting their applicability in real-world scenarios with diverse degradation types. We propose UniFlowRestore, a general video restoration framework that models restoration as a time-continuous evolution under a prompt-guided and physics-informed vector field. A physics-aware backbone PhysicsUNet encodes degradation priors as potential energy, while PromptGenerator produces task-relevant prompts as momentum. These components define a Hamiltonian system whose vector field integrates inertial dynamics, decaying physical gradients, and prompt-based guidance. The system is optimized via a fixed-step ODE solver to achieve efficient and unified restoration across tasks. Experiments show that UniFlowRestore delivers stateof-the-art performance with strong generalization and efficiency. Quantitative results demonstrate that UniFlowRestore achieves state-of-the-art performance, attaining the highest PSNR (33.89 dB) and SSIM (0.97) on the video denoising task, while maintaining top or second-best scores across all evaluated tasks.
zh

[CV-177] Hyperlocal disaster damage assessment using bi-temporal street-view imagery and pre-trained vision models

【速读】：该论文旨在解决利用街景图像进行高空间分辨率下的灾害损毁评估问题，特别是探索时间序列街景图像在灾前与灾后对比分析中的潜力。传统方法主要关注灾后街景图像的分析，而忽视了灾前图像的价值。论文的关键在于将灾前街景图像作为“无损”类别引入模型微调，并设计了一种双通道算法来同时处理灾前和灾后成对街景图像。通过使用预训练视觉模型（如Swin Transformer和ConvNeXt）结合灾前图像，显著提升了灾害等级分类的准确性，从基于Swin Transformer的基线准确率66.14%提升至采用双通道特征融合ConvNeXt模型的77.11%。这一解决方案的核心在于充分利用灾前图像提供的基准信息，提高标注数据集的可靠性，并增强模型在精细化损毁评估中的性能。

链接: https://arxiv.org/abs/2504.09066
作者: Yifan Yang,Lei Zou,Bing Zhou,Daoyang Li,Binbin Lin,Joynal Abedin,Mingzheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages,9 figures

点击查看摘要

Abstract:Street-view images offer unique advantages for disaster damage estimation as they capture impacts from a visual perspective and provide detailed, on-the-ground insights. Despite several investigations attempting to analyze street-view images for damage estimation, they mainly focus on post-disaster images. The potential of time-series street-view images remains underexplored. Pre-disaster images provide valuable benchmarks for accurate damage estimations at building and street levels. These images could aid annotators in objectively labeling post-disaster impacts, improving the reliability of labeled data sets for model training, and potentially enhancing the model performance in damage evaluation. The goal of this study is to estimate hyperlocal, on-the-ground disaster damages using bi-temporal street-view images and advanced pre-trained vision models. Street-view images before and after 2024 Hurricane Milton in Horseshoe Beach, Florida, were collected for experiments. The objectives are: (1) to assess the performance gains of incorporating pre-disaster street-view images as a no-damage category in fine-tuning pre-trained models, including Swin Transformer and ConvNeXt, for damage level classification; (2) to design and evaluate a dual-channel algorithm that reads pair-wise pre- and post-disaster street-view images for hyperlocal damage assessment. The results indicate that incorporating pre-disaster street-view images and employing a dual-channel processing framework can significantly enhance damage assessment accuracy. The accuracy improves from 66.14% with the Swin Transformer baseline to 77.11% with the dual-channel Feature-Fusion ConvNeXt model. This research enables rapid, operational damage assessments at hyperlocal spatial resolutions, providing valuable insights to support effective decision-making in disaster management and resilience planning.
zh

[CV-178] You Need a Transition Plane: Bridging Continuous Panoramic 3D Reconstruction with Perspective Gaussian Splatting

【速读】：该论文致力于解决从单张全景图像重建连续3D场景的问题。由于全景图像具有360°×180°的视场（Field of View, FoV），虽然能够捕捉整个场景，但其固有的严重畸变使得直接将3D高斯点渲染到二维畸变的等矩投影空间变得极具挑战性。尽管将等矩投影转换为立方体贴图投影可以部分缓解这一问题，但仍会引入新的挑战，如投影畸变以及立方体面边界处的不连续性。为了解决这些局限性，论文提出了一种名为TPGS的新框架，旨在弥合连续全景3D场景重建与透视高斯点绘制之间的差距。

解决方案的关键在于引入过渡平面（Transition Plane）以实现相邻立方体面之间平滑的方向过渡，并减轻边界区域的优化模糊性；同时提出了内向到外向面的优化策略，以增强局部细节并在立方体面边界恢复视觉一致性。具体而言，首先在单个立方体面上优化3D高斯点，然后在拼接后的全景空间中进行精细调整。此外，还引入了球面采样技术来消除可见的拼接缝。实验结果表明，该方法在室内、室外、第一人称视角及漫游基准数据集上的表现优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.09062
作者: Zhijie Shen,Chunyu Lin,Shujuan Huang,Lang Nie,Kang Liao,Yao Zhao
机构: Institute of Information Science, Beijing Jiaotong University (北京交通大学信息科学研究所); Visual Intelligence +X International Cooperation Joint Laboratory of MOE (教育部视觉智能+X国际合作联合实验室); College of Computing and Data Science, Nanyang Technological University (南洋理工大学计算与数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, reconstructing scenes from a single panoramic image using advanced 3D Gaussian Splatting (3DGS) techniques has attracted growing interest. Panoramic images offer a 360 \times 180 field of view (FoV), capturing the entire scene in a single shot. However, panoramic images introduce severe distortion, making it challenging to render 3D Gaussians into 2D distorted equirectangular space directly. Converting equirectangular images to cubemap projections partially alleviates this problem but introduces new challenges, such as projection distortion and discontinuities across cube-face boundaries. To address these limitations, we present a novel framework, named TPGS, to bridge continuous panoramic 3D scene reconstruction with perspective Gaussian splatting. Firstly, we introduce a Transition Plane between adjacent cube faces to enable smoother transitions in splatting directions and mitigate optimization ambiguity in the boundary region. Moreover, an intra-to-inter face optimization strategy is proposed to enhance local details and restore visual consistency across cube-face boundaries. Specifically, we optimize 3D Gaussians within individual cube faces and then fine-tune them in the stitched panoramic space. Additionally, we introduce a spherical sampling technique to eliminate visible stitching seams. Extensive experiments on indoor and outdoor, egocentric, and roaming benchmark datasets demonstrate that our approach outperforms existing state-of-the-art methods. Code and models will be available at this https URL.
zh

[CV-179] BlockGaussian: Efficient Large-Scale Scene NovelView Synthesis via Adaptive Block-Based Gaussian Splatting

【速读】：该论文旨在解决大规模场景重建中的三个关键挑战：场景划分、优化及合并过程中的效率与质量问题。论文提出的BlockGaussian框架通过引入内容感知的场景划分策略和可见性感知的块级优化方法，实现了高效且高质量的大规模场景重建。解决方案的关键在于：首先，通过考虑场景不同区域的内容复杂度变化，在划分过程中平衡计算负载；其次，通过在独立块优化期间引入辅助点来对齐真实监督信号，解决了监督不匹配问题；最后，提出了一种伪视图几何约束，有效缓解了块合并过程中因空气漂浮物引起的渲染退化问题。这些创新使得BlockGaussian在多个基准测试中实现了优化速度提升5倍以及平均PSNR提高1.21 dB的性能，并显著降低了计算需求。

链接: https://arxiv.org/abs/2504.09048
作者: Yongchang Wu,Zipeng Qi,Zhenwei Shi,Zhengxia Zou
机构: Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University (北京航空航天大学航天学院航天智能科学与技术系); State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (虚拟现实技术与系统国家重点实验室，北京航空航天大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:The recent advancements in 3D Gaussian Splatting (3DGS) have demonstrated remarkable potential in novel view synthesis tasks. The divide-and-conquer paradigm has enabled large-scale scene reconstruction, but significant challenges remain in scene partitioning, optimization, and merging processes. This paper introduces BlockGaussian, a novel framework incorporating a content-aware scene partition strategy and visibility-aware block optimization to achieve efficient and high-quality large-scale scene reconstruction. Specifically, our approach considers the content-complexity variation across different regions and balances computational load during scene partitioning, enabling efficient scene reconstruction. To tackle the supervision mismatch issue during independent block optimization, we introduce auxiliary points during individual block optimization to align the ground-truth supervision, which enhances the reconstruction quality. Furthermore, we propose a pseudo-view geometry constraint that effectively mitigates rendering degradation caused by airspace floaters during block merging. Extensive experiments on large-scale scenes demonstrate that our approach achieves state-of-the-art performance in both reconstruction efficiency and rendering quality, with a 5x speedup in optimization and an average PSNR improvement of 1.21 dB on multiple benchmarks. Notably, BlockGaussian significantly reduces computational requirements, enabling large-scale scene reconstruction on a single 24GB VRAM device. The project page is available at this https URL
zh

[CV-180] Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）扩散模型在处理需要选择性遗忘（selective forgetting）任务时存在的挑战，例如移除版权内容、减少偏见或消除有害概念。现有未学习（unlearning）方法在多概念遗忘（multi-concept forgetting）中面临稳定性不足、残留知识持久以及生成质量下降等问题。为应对这些挑战，论文提出了一种名为“动态掩码（Dynamic Mask）”结合“概念感知损失（Concept-Aware Loss）”的新型未学习框架。其中，“动态掩码”机制根据当前优化状态自适应更新梯度掩码，从而实现对无关知识的干扰最小化；而“概念感知损失”通过 superclass 对齐显式引导未学习过程，并通过基于知识蒸馏的正则化损失确保在顺序未学习过程中先前未学习的概念不会恢复。这些创新点构成了解决方案的关键。

链接: https://arxiv.org/abs/2504.09039
作者: Gen Li,Yang Xiao,Jie Ji,Kaiyuan Deng,Bo Hui,Linke Guo,Xiaolong Ma
机构: Clemson University (克莱姆森大学); University of Tulsa (塔尔萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have achieved remarkable success in generating high-quality images from textual prompts. However, their ability to store vast amounts of knowledge raises concerns in scenarios where selective forgetting is necessary, such as removing copyrighted content, reducing biases, or eliminating harmful concepts. While existing unlearning methods can remove certain concepts, they struggle with multi-concept forgetting due to instability, residual knowledge persistence, and generation quality degradation. To address these challenges, we propose \textbfDynamic Mask coupled with Concept-Aware Loss, a novel unlearning framework designed for multi-concept forgetting in diffusion models. Our \textbfDynamic Mask mechanism adaptively updates gradient masks based on current optimization states, allowing selective weight modifications that prevent interference with unrelated knowledge. Additionally, our \textbfConcept-Aware Loss explicitly guides the unlearning process by enforcing semantic consistency through superclass alignment, while a regularization loss based on knowledge distillation ensures that previously unlearned concepts remain forgotten during sequential unlearning. We conduct extensive experiments to evaluate our approach. Results demonstrate that our method outperforms existing unlearning techniques in forgetting effectiveness, output fidelity, and semantic coherence, particularly in multi-concept scenarios. Our work provides a principled and flexible framework for stable and high-fidelity unlearning in generative models. The code will be released publicly.
zh

[CV-181] Chest X-ray Classification using Deep Convolution Models on Low-resolution images with Uncertain Labels

【速读】：该论文旨在解决低分辨率胸部X光图像在病理自动识别模型中的分类挑战，尤其是在远程医疗场景下成本较低的数据获取需求。论文的关键解决方案在于实验不同输入图像尺寸对深度卷积神经网络（Deep Convolutional Neural Networks, DCNN）性能的影响，并提出了一种基于标签随机翻转（Randomized Flipping of Labels）的技术来利用数据集中的噪声标签。此外，通过结合多标签分类模型集成、数据增强（augmentation）、正则化（regularization）以及类激活图（Class Activation Maps, CAM）来可视化模型决策过程，进一步提升模型性能。与高分辨率图像数据的分类结果对比表明，所提出的模型架构在Cardiomegaly、Consolidation和Edema三种病理性疾病上获得了3%的更高准确性，验证了方案的可行性。

链接: https://arxiv.org/abs/2504.09033
作者: Snigdha Agarwal,Neelam Sinha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:Deep Convolutional Neural Networks have consistently proven to achieve state-of-the-art results on a lot of imaging tasks over the past years’ majority of which comprise of high-quality data. However, it is important to work on low-resolution images since it could be a cheaper alternative for remote healthcare access where the primary need of automated pathology identification models occurs. Medical diagnosis using low-resolution images is challenging since critical details may not be easily identifiable. In this paper, we report classification results by experimenting on different input image sizes of Chest X-rays to deep CNN models and discuss the feasibility of classification on varying image sizes. We also leverage the noisy labels in the dataset by proposing a Randomized Flipping of labels techniques. We use an ensemble of multi-label classification models on frontal and lateral studies. Our models are trained on 5 out of the 14 chest pathologies of the publicly available CheXpert dataset. We incorporate techniques such as augmentation, regularization for model improvement and use class activation maps to visualize the neural network’s decision making. Comparison with classification results on data from 200 subjects, obtained on the corresponding high-resolution images, reported in the original CheXpert paper, has been presented. For pathologies Cardiomegaly, Consolidation and Edema, we obtain 3% higher accuracy with our model architecture.
zh

[CV-182] Adaptive Additive Parameter Updates of Vision Transformers for Few-Shot Continual Learning

【速读】：本文旨在解决人工智能领域中的灾难性遗忘（catastrophic forgetting）问题，特别是在少量样本增量学习（Few-shot Class Incremental Learning, FSCIL）场景下。传统方法通过在基础类别上预训练模型后，利用少量新类别的标注数据进行增量微调，但容易因数据量有限而过拟合，同时加剧知识遗忘。本文的关键创新在于提出了一种基于冻结Vision Transformer (ViT) 主干网络并引入参数高效增量子更新机制的新框架。该方案通过在自注意力模块中仅选择性注入可训练权重，以最小化参数更新规模，从而在不丢失已有知识的同时适应新类别，同时有效缓解过拟合风险，并保留ViT中基础类别学习到的泛化特征。

链接: https://arxiv.org/abs/2504.08982
作者: Kyle Stein,Andrew Arash Mahyari,Guillermo Francia III,Eman El-Sheikh
机构: Department of Intelligent Systems and Robotics, University of West Florida (西佛罗里达大学); Florida Institute For Human and Machine Cognition (IHMC); Center for Cybersecurity, University of West Florida (西佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Integrating new class information without losing previously acquired knowledge remains a central challenge in artificial intelligence, often referred to as catastrophic forgetting. Few-shot class incremental learning (FSCIL) addresses this by first training a model on a robust dataset of base classes and then incrementally adapting it in successive sessions using only a few labeled examples per novel class. However, this approach is prone to overfitting on the limited new data, which can compromise overall performance and exacerbate forgetting. In this work, we propose a simple yet effective novel FSCIL framework that leverages a frozen Vision Transformer (ViT) backbone augmented with parameter-efficient additive updates. Our approach freezes the pre-trained ViT parameters and selectively injects trainable weights into the self-attention modules via an additive update mechanism. This design updates only a small subset of parameters to accommodate new classes without sacrificing the representations learned during the base session. By fine-tuning a limited number of parameters, our method preserves the generalizable features in the frozen ViT while reducing the risk of overfitting. Furthermore, as most parameters remain fixed, the model avoids overwriting previously learned knowledge when small novel data batches are introduced. Extensive experiments on benchmark datasets demonstrate that our approach yields state-of-the-art performance compared to baseline FSCIL methods.
zh

[CV-183] Mixed Signals: Decoding VLMs Reasoning and Underlying Bias in Vision-Language Conflict

【速读】：该论文试图解决视觉-语言模型（Vision-Language Models, VLMs）在处理图文矛盾场景时的偏见问题，探究模型如何整合视觉与文本信息以及模态间信息流动的结构。论文通过构建包含不匹配图文对的数据集，并分析模型在不同任务和复杂度下的行为，揭示了模型倾向于依赖文本或图像的偏见及其与模型规模的关系。解决方案的关键在于提出三种缓解策略：简单的提示修改、显式指导模型如何处理冲突信息的提示方法（类似于链式思维提示）、以及分解任务分别分析各模态后再综合结果的策略。研究发现这些策略的效果显著依赖于模型的整体性能及特定模态的表现。

链接: https://arxiv.org/abs/2504.08974
作者: Pouya Pezeshkpour,Moin Aminnaseri,Estevam Hruschka
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have demonstrated impressive performance by effectively integrating visual and textual information to solve complex tasks. However, it is not clear how these models reason over the visual and textual data together, nor how the flow of information between modalities is structured. In this paper, we examine how VLMs reason by analyzing their biases when confronted with scenarios that present conflicting image and text cues, a common occurrence in real-world applications. To uncover the extent and nature of these biases, we build upon existing benchmarks to create five datasets containing mismatched image-text pairs, covering topics in mathematics, science, and visual descriptions. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. This bias correlates with model scale, with the difference between the percentage of image- and text-preferred responses ranging from +56.8% (image favored) to -74.4% (text favored), depending on the task and model. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information (akin to chain-of-thought prompting), and a task decomposition strategy that analyzes each modality separately before combining their results. Our findings indicate that the effectiveness of these strategies in identifying and mitigating bias varies significantly and is closely linked to the model’s overall performance on the task and the specific modality in question.
zh

[CV-184] PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models CVPR2025

【速读】：该论文旨在解决视觉语言模型在推理过程中因需要额外输入视觉信息而导致计算资源消耗大的问题。具体而言，这些用于表示视觉信息的输入标记通常包含冗余和不重要的内容，导致不必要的高标记数量。为了解决这一问题，论文提出了PACT方法，通过在语言模型的早期层修剪无关标记和合并视觉上冗余的标记来减少推理时间和内存使用。PACT的关键创新在于引入了一种新的重要性度量方法，用于识别不重要的标记，而无需依赖注意力分数，这使其与FlashAttention兼容。此外，论文还提出了一种名为“受距离约束的密度峰值聚类”的新型聚类算法，该算法能够在约束聚类内元素之间距离的前提下高效聚类视觉标记。通过广泛的实验验证了PACT的有效性。

链接: https://arxiv.org/abs/2504.08966
作者: Mohamed Dhouib,Davide Buscaldi,Sonia Vanier,Aymen Shabou
机构: LIX, École Polytechnique, IP Paris (LIX, 巴黎高等理工学院, IP 巴黎), France; LIPN, Université Sorbonne Paris Nord (LIPN, 索邦巴黎北部大学), France; DataLab Groupe, Crédit Agricole S.A (DataLab Groupe, 花旗农业信贷银行集团), France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Visual Language Models require substantial computational resources for inference due to the additional input tokens needed to represent visual information. However, these visual tokens often contain redundant and unimportant information, resulting in an unnecessarily high number of tokens. To address this, we introduce PACT, a method that reduces inference time and memory usage by pruning irrelevant tokens and merging visually redundant ones at an early layer of the language model. Our approach uses a novel importance metric to identify unimportant tokens without relying on attention scores, making it compatible with FlashAttention. We also propose a novel clustering algorithm, called Distance Bounded Density Peak Clustering, which efficiently clusters visual tokens while constraining the distances between elements within a cluster by a predefined threshold. We demonstrate the effectiveness of PACT through extensive experiments.
zh

[CV-185] MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer ICLR2025

【速读】：该论文旨在解决在动作捕捉（MoCap）数据有限的情况下，利用生成式掩码建模生成多样化动画实例时易出现过拟合的问题。为应对这一挑战，论文提出了一种名为MotionDreamer的局部掩码建模范式。其关键是通过引入一种新颖的分布正则化方法，将输入动作嵌入到量化标记中，并构建一个鲁棒且信息丰富的代码本以表征局部运动模式；同时，在掩码变换器中采用滑动窗口局部注意力机制，从而实现既自然又多样的动画生成，紧密贴合参考动作模式。实验结果表明，MotionDreamer在保真度和多样性方面均优于传统的基于GAN或扩散模型的方法，并能够在单一参考动作的基础上有效执行时间动作编辑、人群动画以及节拍对齐的舞蹈生成等下游任务。

链接: https://arxiv.org/abs/2504.08959
作者: Yilin Wang,Chuan Guo,Yuxuan Mu,Muhammad Gohar Javed,Xinxin Zuo,Juwei Lu,Hai Jiang,Li Cheng
机构: University of Alberta (阿尔伯塔大学); Concordia University (康考迪亚大学); Simon Fraser University (西蒙弗雷泽大学); Noah’s Ark Lab, Huawei Canada (华为加拿大诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: ICLR 2025 acceptance

点击查看摘要

Abstract:Generative masked transformers have demonstrated remarkable success across various content generation tasks, primarily due to their ability to effectively model large-scale dataset distributions with high consistency. However, in the animation domain, large datasets are not always available. Applying generative masked modeling to generate diverse instances from a single MoCap reference may lead to overfitting, a challenge that remains unexplored. In this work, we present MotionDreamer, a localized masked modeling paradigm designed to learn internal motion patterns from a given motion with arbitrary topology and duration. By embedding the given motion into quantized tokens with a novel distribution regularization method, MotionDreamer constructs a robust and informative codebook for local motion patterns. Moreover, a sliding window local attention is introduced in our masked transformer, enabling the generation of natural yet diverse animations that closely resemble the reference motion patterns. As demonstrated through comprehensive experiments, MotionDreamer outperforms the state-of-the-art methods that are typically GAN or Diffusion-based in both faithfulness and diversity. Thanks to the consistency and robustness of the quantization-based approach, MotionDreamer can also effectively perform downstream tasks such as temporal motion editing, \textcolorupdatecrowd animation, and beat-aligned dance generation, all using a single reference motion. Visit our project page: this https URL
zh

[CV-186] Rethinking Few-Shot Fusion: Granular Ball Priors Enable General-Purpose Deep Image Fusion

【速读】：该论文旨在解决图像融合任务中因缺乏真实的融合图像作为先验信息，导致大多数基于深度学习的方法需要大规模数据对来提取全局权重特征以逼近真实融合图像的问题。论文的关键创新在于利用Granular Ball适应方法从亮度空间提取特征作为深度网络的先验信息，从而实现融合网络快速收敛完成融合任务。具体而言，通过将原始融合图像中像素对的信息表达划分为正域（表现显著的像素对）和边界域（非显著像素对），并在亮度空间进行分割推理以计算不同程度表达信息的像素权重，生成提供结构化亮度空间先验的近似监督图像；同时，提取的全局显著性特征还自适应地为网络中各模态图像的损失函数权重设置提供先验，引导网络在全局和像素级别快速收敛。这种方案使得每个模态仅需使用10对图像作为训练集即可完成融合任务，并在少量迭代内实现高效融合。实验验证了算法的有效性和理论正确性，与最先进的方法相比，该方法在融合时间和图像表达力方面表现出高度竞争力。

链接: https://arxiv.org/abs/2504.08937
作者: Minjie Deng,Yan Wei,Hao Zhai,An Wu,Yuncan Ouyang,Qianyao Peng
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In image fusion tasks, due to the lack of real fused images as priors, most deep learning-based fusion methods obtain global weight features from original images in large-scale data pairs to generate images that approximate real fused images. However, unlike previous studies, this paper utilizes Granular Ball adaptation to extract features in the brightness space as priors for deep networks, enabling the fusion network to converge quickly and complete the fusion task. This leads to few-shot training for a general image fusion network, and based on this, we propose the GBFF fusion method. According to the information expression division of pixel pairs in the original fused image, we classify pixel pairs with significant performance as the positive domain and non-significant pixel pairs as the boundary domain. We perform split inference in the brightness space using Granular Ball adaptation to compute weights for pixels that express information to varying degrees, generating approximate supervision images that provide priors for the neural network in the structural brightness space. Additionally, the extracted global saliency features also adaptively provide priors for setting the loss function weights of each image in the network, guiding the network to converge quickly at both global and pixel levels alongside the supervised images, thereby enhancing the expressiveness of the fused images. Each modality only used 10 pairs of images as the training set, completing the fusion task with a limited number of iterations. Experiments validate the effectiveness of the algorithm and theory, and qualitative and quantitative comparisons with SOTA methods show that this approach is highly competitive in terms of fusion time and image expressiveness.
zh

[CV-187] Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models

【速读】：该论文旨在解决视觉基础模型（Vision Foundation Models, VFMs）中存在的显著特征冗余问题，这种冗余可能限制其在新任务中的适应性。论文以Segment Anything Model (SAM)为例，研究了其内部的特征冗余，并提出了一种无参微调方法来应对这一挑战。解决方案的关键在于引入一种基于模型输出差异的通道选择算法，通过识别冗余与有效的通道，选择性地替换冗余通道以增强更有用的特征表示。这种方法强调对预训练特征的筛选、重用和增强，而非直接调整参数，从而提升特定任务的特征表达能力，同时降低计算开销和GPU内存占用。

链接: https://arxiv.org/abs/2504.08915
作者: Jiahuan Long,Tingsong Jiang,Wen Yao,Yizhe Xiong,Zhengqin Xu,Shuai Jia,Chao Ma
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); Chinese Academy of Military Science (中国人民解放军军事科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision foundation models (VFMs) are large pre-trained models that form the backbone of various vision tasks. Fine-tuning VFMs can further unlock their potential for downstream tasks or scenarios. However, VFMs often contain significant feature redundancy, which may limit their adaptability to new tasks. In this paper, we investigate the redundancies in the segment anything model (SAM) and then propose a parameter-free fine-tuning method to address this issue. Unlike traditional fine-tuning methods that adjust parameters, our method emphasizes selecting, reusing, and enhancing pre-trained features, offering a new perspective on model fine-tuning. Specifically, we introduce a channel selection algorithm based on the model’s output difference to identify redundant and effective channels. By selectively replacing the redundant channels with more effective ones, we filter out less useful features and reuse the more relevant features to downstream tasks, thereby enhancing the task-specific feature representation. Experiments on both out-of-domain and in-domain datasets demonstrate the efficiency and effectiveness of our method. Notably, our approach can seamlessly integrate with existing fine-tuning strategies (e.g., LoRA, Adapter), further boosting the performance of already fine-tuned models. Moreover, since our channel selection involves only model inference, our method significantly reduces computational and GPU memory overhead.
zh

[CV-188] Hybrid AI-Physical Modeling for Penetration Bias Correction in X-band InSAR DEMs: A Greenland Case Study

【速读】：该论文旨在解决由干涉合成孔径雷达（Interferometric Synthetic Aperture Radar, InSAR）数据生成的数字高程模型（Digital Elevation Model, DEM）在冰川和积雪覆盖区域中存在的系统性高程误差问题，即所谓的“穿透偏差（penetration bias）”。论文的关键解决方案在于提出了一种集成校正框架，将基于物理参数的建模与机器学习相结合。通过在三种不同采集参数定义的训练场景下评估该方法，并利用TanDEM-X数据对格陵兰冰盖进行实验，结果显示，该混合模型显著降低了DEM误差的均值和标准差，同时相较于纯物理建模和纯机器学习方法，表现出更好的泛化能力。

链接: https://arxiv.org/abs/2504.08909
作者: Islam Mansour,Georg Fischer,Ronny Haensch,Irena Hajnsek
机构: Microwaves and Radar Institute, German Aerospace Center DLR (德国航空航天中心微波与雷达研究所); Institute of Environmental Engineering, ETH Zurich (瑞士苏黎世联邦理工学院环境工程研究所)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Digital elevation models derived from Interferometric Synthetic Aperture Radar (InSAR) data over glacial and snow-covered regions often exhibit systematic elevation errors, commonly termed “penetration bias.” We leverage existing physics-based models and propose an integrated correction framework that combines parametric physical modeling with machine learning. We evaluate the approach across three distinct training scenarios - each defined by a different set of acquisition parameters - to assess overall performance and the model’s ability to generalize. Our experiments on Greenland’s ice sheet using TanDEM-X data show that the proposed hybrid model corrections significantly reduce the mean and standard deviation of DEM errors compared to a purely physical modeling baseline. The hybrid framework also achieves significantly improved generalization than a pure ML approach when trained on data with limited diversity in acquisition parameters.
zh

[CV-189] Robust SAM: On the Adversarial Robustness of Vision Foundation Models AAAI2025

【速读】：该论文旨在解决Segment Anything Model (SAM) 在实际部署中对抗性攻击鲁棒性不足的问题，并填补现有研究在评估方法及防御策略上的空白。论文的关键在于提出了一种对抗鲁棒性框架，包含两个主要创新：首先，引入跨提示（cross-prompt）攻击方法以增强攻击在不同提示类型之间的迁移能力；其次，设计了一种少量参数适配策略，通过仅调整奇异值来约束可训练参数空间，从而在提升模型鲁棒性的同时最小化对原始精度的影响。实验表明，该方法不仅显著提高了攻击成功率，还通过仅适配512个参数实现了至少15%的平均交并比（mIoU）提升，并在保持SAM原有性能的前提下增强了其鲁棒性。

链接: https://arxiv.org/abs/2504.08906
作者: Jiahuan Long,Zhengqin Xu,Tingsong Jiang,Wen Yao,Shuai Jia,Chao Ma,Xiaoqian Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:The Segment Anything Model (SAM) is a widely used vision foundation model with diverse applications, including image segmentation, detection, and tracking. Given SAM’s wide applications, understanding its robustness against adversarial attacks is crucial for real-world deployment. However, research on SAM’s robustness is still in its early stages. Existing attacks often overlook the role of prompts in evaluating SAM’s robustness, and there has been insufficient exploration of defense methods to balance the robustness and accuracy. To address these gaps, this paper proposes an adversarial robustness framework designed to evaluate and enhance the robustness of SAM. Specifically, we introduce a cross-prompt attack method to enhance the attack transferability across different prompt types. Besides attacking, we propose a few-parameter adaptation strategy to defend SAM against various adversarial attacks. To balance robustness and accuracy, we use the singular value decomposition (SVD) to constrain the space of trainable parameters, where only singular values are adaptable. Experiments demonstrate that our cross-prompt attack method outperforms previous approaches in terms of attack success rate on both SAM and SAM 2. By adapting only 512 parameters, we achieve at least a 15% improvement in mean intersection over union (mIoU) against various adversarial attacks. Compared to previous defense methods, our approach enhances the robustness of SAM while maximally maintaining its original performance.
zh

[CV-190] LookingGlass: Generative Anamorphoses via Laplacian Pyramid Warping CVPR2025

【速读】：该论文旨在解决传统变形图像（Anamorphosis）在直接观看时无法被正常解读的问题，提出一种方法以生成在保持可直接理解的同时仍具有视觉错觉效果的图像。论文的关键在于引入了一种名为Laplacian Pyramid Warping的频率感知图像扭曲技术，该技术结合潜在空间中的矩化流模型（latent rectified flow models），实现了对更广泛空间变换的支持，从而生成高质量的新型生成式视觉错觉图像。

链接: https://arxiv.org/abs/2504.08902
作者: Pascal Chang,Sergio Sancho,Jingwei Tang,Markus Gross,Vinicius C. Azevedo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2025 (Oral)

点击查看摘要

Abstract:Anamorphosis refers to a category of images that are intentionally distorted, making them unrecognizable when viewed directly. Their true form only reveals itself when seen from a specific viewpoint, which can be through some catadioptric device like a mirror or a lens. While the construction of these mathematical devices can be traced back to as early as the 17th century, they are only interpretable when viewed from a specific vantage point and tend to lose meaning when seen normally. In this paper, we revisit these famous optical illusions with a generative twist. With the help of latent rectified flow models, we propose a method to create anamorphic images that still retain a valid interpretation when viewed directly. To this end, we introduce Laplacian Pyramid Warping, a frequency-aware image warping technique key to generating high-quality visuals. Our work extends Visual Anagrams (arXiv:2311.17919) to latent space models and to a wider range of spatial transforms, enabling the creation of novel generative perceptual illusions.
zh

[CV-191] HAL-NeRF: High Accuracy Localization Leverag ing Neural Radiance Fields

【速读】：该论文旨在解决单目相机重定位（monocular camera relocalization）在大型室内外环境中的高精度定位问题，尤其是在仅使用图像作为输入的情况下，传统方法如绝对位姿回归（Absolute Pose Regression, APR）难以实现亚米级精度的问题。论文的关键在于提出了一种名为HAL-NeRF的方法，该方法结合了卷积神经网络（CNN）位姿回归器与基于蒙特卡洛粒子滤波器（Monte Carlo Particle Filter）的精化模块，并利用Nerfacto模型生成高质量的新视角图像以增强训练数据和计算光度损失。通过这种方式，HAL-NeRF显著提升了局部化的准确性，在7-Scenes和Cambridge Landmarks数据集上的平移误差分别达到了0.025m和0.04m，旋转误差分别为0.59°和0.58°，展示了将APR与基于NeRF的精化技术相结合以提高单目相机重定位精度的巨大潜力。然而，这一改进伴随着更高的计算开销。

链接: https://arxiv.org/abs/2504.08901
作者: Asterios Reppas,Grigorios-Aris Cheimariotis,Panos K. Papadopoulos,Panagiotis Frasiolas,Dimitrios Zarpalas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Precise camera localization is a critical task in XR applications and robotics. Using only the camera captures as input to a system is an inexpensive option that enables localization in large indoor and outdoor environments, but it presents challenges in achieving high accuracy. Specifically, camera relocalization methods, such as Absolute Pose Regression (APR), can localize cameras with a median translation error of more than 0.5m in outdoor scenes. This paper presents HAL-NeRF, a high-accuracy localization method that combines a CNN pose regressor with a refinement module based on a Monte Carlo particle filter. The Nerfacto model, an implementation of Neural Radiance Fields (NeRFs), is used to augment the data for training the pose regressor and to measure photometric loss in the particle filter refinement module. HAL-NeRF leverages Nerfacto’s ability to synthesize high-quality novel views, significantly improving the performance of the localization pipeline. HAL-NeRF achieves state-of-the-art results that are conventionally measured as the average of the median per scene errors. The translation error was 0.025m and the rotation error was 0.59 degrees and 0.04m and 0.58 degrees on the 7-Scenes dataset and Cambridge Landmarks datasets respectively, with the trade-off of increased computational time. This work highlights the potential of combining APR with NeRF-based refinement techniques to advance monocular camera relocalization accuracy.
zh

[CV-192] ColonScopeX: Leverag ing Explainable Expert Systems with Multimodal Data for Improved Early Diagnosis of Colorectal Cancer AAAI-25

【速读】：该论文旨在解决结直肠癌（Colorectal Cancer, CRC）早期检测困难的问题，其核心挑战在于CRC症状缺乏特异性且常令人尴尬，导致患者往往忽视或不愿向临床医生报告，从而延误诊断。早期诊断的延迟显著影响患者的生存率，从I期的80-95%骤降至IV期的10%，而英国仅有14.4%的病例在I期被确诊。为应对这一挑战，论文提出ColonScopeX，这是一种利用可解释人工智能（Explainable AI, XAI）方法的机器学习框架，旨在提升CRC及癌前病变的早期检测能力。方案的关键在于采用多模态模型，整合血液样本测量信号（经Savitzky-Golay算法平滑处理）与全面的患者元数据（如用药史、合并症、年龄、体重和BMI等），并通过XAI技术增强模型决策过程的透明性和可解释性，以提高预测结果的可信度和接受度，最终实现作为分诊工具或大众筛查工具的应用潜力。

链接: https://arxiv.org/abs/2504.08824
作者: Natalia Sikora,Robert L. Manschke,Alethea M. Tang,Peter Dunstan,Dean A. Harris,Su Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Published to AAAI-25 Bridge Program

点击查看摘要

Abstract:Colorectal cancer (CRC) ranks as the second leading cause of cancer-related deaths and the third most prevalent malignant tumour worldwide. Early detection of CRC remains problematic due to its non-specific and often embarrassing symptoms, which patients frequently overlook or hesitate to report to clinicians. Crucially, the stage at which CRC is diagnosed significantly impacts survivability, with a survival rate of 80-95% for Stage I and a stark decline to 10% for Stage IV. Unfortunately, in the UK, only 14.4% of cases are diagnosed at the earliest stage (Stage I). In this study, we propose ColonScopeX, a machine learning framework utilizing explainable AI (XAI) methodologies to enhance the early detection of CRC and pre-cancerous lesions. Our approach employs a multimodal model that integrates signals from blood sample measurements, processed using the Savitzky-Golay algorithm for fingerprint smoothing, alongside comprehensive patient metadata, including medication history, comorbidities, age, weight, and BMI. By leveraging XAI techniques, we aim to render the model’s decision-making process transparent and interpretable, thereby fostering greater trust and understanding in its predictions. The proposed framework could be utilised as a triage tool or a screening tool of the general population. This research highlights the potential of combining diverse patient data sources and explainable machine learning to tackle critical challenges in medical diagnostics. Comments: Published to AAAI-25 Bridge Program Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) Cite as: arXiv:2504.08824 [cs.LG] (or arXiv:2504.08824v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.08824 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-193] FM-LoRA: Factorized Low-Rank Meta-Prompting for Continual Learning

【速读】：该论文旨在解决在连续任务中，如何利用预训练模型（如Transformer）适应具有不同预测类别标签和领域的新任务，并最终学习到一个在多样化任务中具有泛化能力的模型这一长期挑战。现有方法通常通过不断增加额外的学习结构（如低秩适配LoRA适配器或提示词）或保留先前样本特征来维持性能，但这种做法会导致参数持续增长和存储成本迅速上升，同时缺乏对任务相似性的感知，从而阻碍了模型在不干扰已有知识的情况下有效适应新任务的能力。为了解决这些问题，论文提出了一种名为FM-LoRA的新颖高效低秩适配方法，其关键在于结合动态秩选择器（Dynamic Rank Selector, DRS）和动态元提示（Dynamic Meta-Prompting, DMP），通过共享的关键低秩子空间来更有效地分配模型容量，避免连续参数扩展，从而缓解灾难性遗忘现象并在多样化的任务和领域中实现稳健的性能表现。

链接: https://arxiv.org/abs/2504.08823
作者: Xiaobing Yu,Jin Yang,Xiao Wu,Peijie Qiu,Xiaofeng Liu
机构: Dept. of Radiology, Washington University in St. Louis (华盛顿大学圣路易斯分校), USA; Dept. of Computer Vision, MBZUAI (Mohamed bin Zayed University of Artificial Intelligence), Abu Dhabi, United Arab Emirates; Dept. of Radiology and Biomedical Imaging, Yale University (耶鲁大学), New Haven, USA; Dept. of Biomedical Informatics and Data Science, Yale University (耶鲁大学), New Haven, USA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 Pages, 4 figures

点击查看摘要

Abstract:How to adapt a pre-trained model continuously for sequential tasks with different prediction class labels and domains and finally learn a generalizable model across diverse tasks is a long-lasting challenge. Continual learning (CL) has emerged as a promising approach to leverage pre-trained models (e.g., Transformers) for sequential tasks. While many existing CL methods incrementally store additional learned structures, such as Low-Rank Adaptation (LoRA) adapters or prompts and sometimes even preserve features from previous samples to maintain performance. This leads to unsustainable parameter growth and escalating storage costs as the number of tasks increases. Moreover, current approaches often lack task similarity awareness, which further hinders the models ability to effectively adapt to new tasks without interfering with previously acquired knowledge. To address these challenges, we propose FM-LoRA, a novel and efficient low-rank adaptation method that integrates both a dynamic rank selector (DRS) and dynamic meta-prompting (DMP). This framework allocates model capacity more effectively across tasks by leveraging a shared low-rank subspace critical for preserving knowledge, thereby avoiding continual parameter expansion. Extensive experiments on various CL benchmarks, including ImageNet-R, CIFAR100, and CUB200 for class-incremental learning (CIL), and DomainNet for domain-incremental learning (DIL), with Transformers backbone demonstrate that FM-LoRA effectively mitigates catastrophic forgetting while delivering robust performance across a diverse range of tasks and domains.
zh

[CV-194] Zero-shot Autonomous Microscopy for Scalable and Intelligent Characterization of 2D Materials

【速读】：该论文旨在解决传统原子尺度材料表征依赖人工专家且耗时长的问题，特别是在处理新型二维（2D）结构材料时，准确可靠的表征仍具挑战性。为此，论文提出了一种端到端框架ATOMIC（自主光学显微镜智能表征），通过集成基础模型实现2D材料的完全自主零样本表征。解决方案的关键在于结合视觉基础模型（如Segment Anything Model）、大型语言模型（如ChatGPT）、无监督聚类和拓扑分析，并通过提示工程自动化显微镜控制、样品扫描、图像分割及智能分析，无需额外训练数据即可完成任务。实验表明，该方法在单层识别中的分割准确率达到99.7%，与人类专家相当，并能检测肉眼难以识别的晶界裂隙，同时在不同条件下保持高精度，适用于多种常见2D材料。这一工作展示了如何利用基础模型实现自主分析，建立了一种可扩展且高效的数据利用表征范式，从根本上变革了纳米尺度材料研究的方法。

链接: https://arxiv.org/abs/2504.10281
作者: Jingyun Yang,Ruoyan Avery Yin,Chi Jiang,Yuepeng Hu,Xiaokai Zhu,Xingjian Hu,Sutharsika Kumar,Xiao Wang,Xiaohua Zhai,Keran Rong,Yunyue Zhu,Tianyi Zhang,Zongyou Yin,Jing Kong,Neil Zhenqiang Gong,Zhichu Ren,Haozhe Wang
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Characterization of atomic-scale materials traditionally requires human experts with months to years of specialized training. Even for trained human operators, accurate and reliable characterization remains challenging when examining newly discovered materials such as two-dimensional (2D) structures. This bottleneck drives demand for fully autonomous experimentation systems capable of comprehending research objectives without requiring large training datasets. In this work, we present ATOMIC (Autonomous Technology for Optical Microscopy Intelligent Characterization), an end-to-end framework that integrates foundation models to enable fully autonomous, zero-shot characterization of 2D materials. Our system integrates the vision foundation model (i.e., Segment Anything Model), large language models (i.e., ChatGPT), unsupervised clustering, and topological analysis to automate microscope control, sample scanning, image segmentation, and intelligent analysis through prompt engineering, eliminating the need for additional training. When analyzing typical MoS2 samples, our approach achieves 99.7% segmentation accuracy for single layer identification, which is equivalent to that of human experts. In addition, the integrated model is able to detect grain boundary slits that are challenging to identify with human eyes. Furthermore, the system retains robust accuracy despite variable conditions including defocus, color temperature fluctuations, and exposure variations. It is applicable to a broad spectrum of common 2D materials-including graphene, MoS2, WSe2, SnSe-regardless of whether they were fabricated via chemical vapor deposition or mechanical exfoliation. This work represents the implementation of foundation models to achieve autonomous analysis, establishing a scalable and data-efficient characterization paradigm that fundamentally transforms the approach to nanoscale materials research.
zh

[CV-195] owards contrast- and pathology-agnostic clinical fetal brain MRI segmentation using SynthSeg

【速读】：该论文致力于解决胎儿脑部磁共振成像（MRI）自动分割中因领域偏移（domain shift）导致的性能下降问题，特别是在生理差异和成像环境变化显著的情况下，以及病理情况下常见的形状差异。论文的关键在于提出了一种新颖的数据驱动训练时采样策略，旨在充分利用训练数据集的多样性以增强模型的领域泛化能力。此外，通过将此采样器与SynthSeg框架结合，并结合其他现有数据增强技术，验证了所提方法的有效性。实验结果表明，该方法在具有强烈解剖异常的测试样本上显著提升了分割质量（p < 1e-4），尽管在异常较少的情况下性能略有下降。这项工作也为未来开发适用于其他训练流程的数据驱动采样策略奠定了基础。

链接: https://arxiv.org/abs/2504.10244
作者: Ziyao Shang,Misha Kaandorp,Kelly Payette,Marina Fernandez Garcia,Roxane Licandro,Georg Langs,Jordina Aviles Verdera,Jana Hutter,Bjoern Menze,Gregor Kasprian,Meritxell Bach Cuadra,Andras Jakab
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 16 figures

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) has played a crucial role in fetal neurodevelopmental research. Structural annotations of MR images are an important step for quantitative analysis of the developing human brain, with Deep learning providing an automated alternative for this otherwise tedious manual process. However, segmentation performances of Convolutional Neural Networks often suffer from domain shift, where the network fails when applied to subjects that deviate from the distribution with which it is trained on. In this work, we aim to train networks capable of automatically segmenting fetal brain MRIs with a wide range of domain shifts pertaining to differences in subject physiology and acquisition environments, in particular shape-based differences commonly observed in pathological cases. We introduce a novel data-driven train-time sampling strategy that seeks to fully exploit the diversity of a given training dataset to enhance the domain generalizability of the trained networks. We adapted our sampler, together with other existing data augmentation techniques, to the SynthSeg framework, a generator that utilizes domain randomization to generate diverse training data, and ran thorough experimentations and ablation studies on a wide range of training/testing data to test the validity of the approaches. Our networks achieved notable improvements in the segmentation quality on testing subjects with intense anatomical abnormalities (p 1e-4), though at the cost of a slighter decrease in performance in cases with fewer abnormalities. Our work also lays the foundation for future works on creating and adapting data-driven sampling strategies for other training pipelines.
zh

[CV-196] Progressive Transfer Learning for Multi-Pass Fundus Image Restoration

【速读】：该论文旨在解决因低质量眼底图像（如光照不足、噪声、模糊及运动伪影等）导致的糖尿病性视网膜病变（Diabetic Retinopathy, DR）筛查准确性下降的问题。论文的关键解决方案是提出了一种渐进式迁移学习（Progressive Transfer Learning, PTL）的多阶段修复方法，通过迭代增强低质量眼底图像的质量，从而实现更可靠的DR筛查。与单一修复方法不同，PTL能够在无需配对数据的情况下实现盲修复，并通过逐步学习和微调策略，有效减少失真并保留关键的视网膜特征，同时显著提升图像的整体质量，甚至改善高质量图像的表现。实验结果表明，该方法在DeepDRiD数据集上的表现达到当前最优水平，验证了PTL在多阶段图像质量修复中的优越性。

链接: https://arxiv.org/abs/2504.10025
作者: Uyen Phan,Ozer Can Devecioglu,Serkan Kiranyaz,Moncef Gabbouj
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures including appendix

点击查看摘要

Abstract:Diabetic retinopathy is a leading cause of vision impairment, making its early diagnosis through fundus imaging critical for effective treatment planning. However, the presence of poor quality fundus images caused by factors such as inadequate illumination, noise, blurring and other motion artifacts yields a significant challenge for accurate DR screening. In this study, we propose progressive transfer learning for multi pass restoration to iteratively enhance the quality of degraded fundus images, ensuring more reliable DR screening. Unlike previous methods that often focus on a single pass restoration, multi pass restoration via PTL can achieve a superior blind restoration performance that can even improve most of the good quality fundus images in the dataset. Initially, a Cycle GAN model is trained to restore low quality images, followed by PTL induced restoration passes over the latest restored outputs to improve overall quality in each pass. The proposed method can learn blind restoration without requiring any paired data while surpassing its limitations by leveraging progressive learning and fine tuning strategies to minimize distortions and preserve critical retinal features. To evaluate PTL’s effectiveness on multi pass restoration, we conducted experiments on DeepDRiD, a large scale fundus imaging dataset specifically curated for diabetic retinopathy detection. Our result demonstrates state of the art performance, showcasing PTL’s potential as a superior approach to iterative image quality restoration.
zh

[CV-197] OmniMamba4D: Spatio-temporal Mamba for longitudinal CT lesion segmentation

【速读】：该论文试图解决现有3D分割模型仅关注空间信息的问题，特别是在纵向CT扫描中对肿瘤进展监测和治疗反应评估的需求。论文提出了一种名为OmniMamba4D的新框架，专注于处理4D医学图像（时间维度上的3D图像），以有效利用时空信息。其关键创新在于引入了时空四向Mamba块（spatio-temporal tetra-orientated Mamba block），能够同时捕捉空间和时间特征，与传统仅分析单一时点的3D模型不同，OmniMamba4D通过处理完整的4D CT数据提供病变进展的全面时空信息。实验结果显示，在包含3,252例CT扫描的数据集上，OmniMamba4D实现了Dice系数0.682，与最先进的模型相当，同时保持计算效率并更有效地检测消失的病灶。

链接: https://arxiv.org/abs/2504.09655
作者: Justin Namuk Kim,Yiqiao Liu,Rajath Soans,Keith Persson,Sarah Halek,Michal Tomaszewski,Jianda Yuan,Gregory Goldmacher,Antong Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE International Symposium on Biomedical Imaging (ISBI) 2025

点击查看摘要

Abstract:Accurate segmentation of longitudinal CT scans is important for monitoring tumor progression and evaluating treatment responses. However, existing 3D segmentation models solely focus on spatial information. To address this gap, we propose OmniMamba4D, a novel segmentation model designed for 4D medical images (3D images over time). OmniMamba4D utilizes a spatio-temporal tetra-orientated Mamba block to effectively capture both spatial and temporal features. Unlike traditional 3D models, which analyze single-time points, OmniMamba4D processes 4D CT data, providing comprehensive spatio-temporal information on lesion progression. Evaluated on an internal dataset comprising of 3,252 CT scans, OmniMamba4D achieves a competitive Dice score of 0.682, comparable to state-of-the-arts (SOTA) models, while maintaining computational efficiency and better detecting disappeared lesions. This work demonstrates a new framework to leverage spatio-temporal information for longitudinal CT lesion segmentation.
zh

[CV-198] Predicting ulcer in HE images of inflammatory bowel disease using domain-knowledge-driven graph neural network

【速读】：该论文旨在解决溃疡性结肠炎（IBD）病理切片图像中溃疡区域精准识别的问题，以支持免疫细胞特征分析及个性化治疗策略的开发。传统多重实例学习（MIL）方法虽已推进全片图像（WSI）分析，但缺乏空间上下文意识。论文的关键解决方案是提出一种名为DomainGCN的弱监督模型，通过图卷积神经网络（GCN）结合特定领域的溃疡特征知识（如上皮细胞、淋巴细胞和碎片的存在），实现WSI层面溃疡预测的性能提升。实验表明，DomainGCN在多种最先进的MIL方法中表现更优，并凸显了领域知识的重要性。

链接: https://arxiv.org/abs/2504.09430
作者: Ruiwen Ding,Lin Li,Rajath Soans,Tosha Shah,Radha Krishnan,Marc Alexander Sze,Sasha Lukyanov,Yash Deshpande,Antong Chen
机构: University of California, Los Angeles (加州大学洛杉矶分校); Merck & Co., Inc. (默克公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Work accepted at ISBI 2025

点击查看摘要

Abstract:Inflammatory bowel disease (IBD) involves chronic inflammation of the digestive tract, with treatment options often burdened by adverse effects. Identifying biomarkers for personalized treatment is crucial. While immune cells play a key role in IBD, accurately identifying ulcer regions in whole slide images (WSIs) is essential for characterizing these cells and exploring potential therapeutics. Multiple instance learning (MIL) approaches have advanced WSI analysis but they lack spatial context awareness. In this work, we propose a weakly-supervised model called DomainGCN that employs a graph convolution neural network (GCN) and incorporates domain-specific knowledge of ulcer features, specifically, the presence of epithelium, lymphocytes, and debris for WSI-level ulcer prediction in IBD. We demonstrate that DomainGCN outperforms various state-of-the-art (SOTA) MIL methods and show the added value of domain knowledge.
zh

[CV-199] seg2med: a segmentation-based medical image generation framework using denoising diffusion probabilistic models

【速读】：该论文旨在解决医学影像合成领域中高质量合成图像生成及跨模态转换的问题。解决方案的关键在于提出了一种名为seg2med的先进框架，该框架利用去噪扩散概率模型（Denoising Diffusion Probabilistic Models, DDPM）从解剖学掩膜生成高保真的CT和MR合成图像。seg2med不仅实现了与真实患者图像相比结构相似性指数（SSIM）分别为0.94±0.02（CT）和0.89±0.04（MR）的合成效果，还通过Fréchet Inception距离（FID）3.62验证了生成质量，并且在跨模态转换任务中达到了MR-to-CT的SSIM值为0.91±0.03以及CT-to-MR的SSIM值为0.77±0.04。此外，借助TotalSegmentator工具包，seg2med展示了对多种数据集的支持能力，适用于临床成像、数据增强、多模态合成以及诊断算法开发等应用场景。

链接: https://arxiv.org/abs/2504.09182
作者: Zeyu Yang,Zhilin Chen,Yipeng Sun,Anika Strittmatter,Anish Raj,Ahmad Allababidi,Johann S. Rink,Frank G. Zöllner
机构: Computer Assisted Clinical Medicine, Medical Faculty Mannheim, Heidelberg University (海德堡大学), Mannheim, Germany; Pattern Recognition Lab, Friedrich-Alexander-University Erlangen-Nuremberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学), Erlangen, Germany; Department of Radiology and Nuclear Medicine, University Medical Center Mannheim (曼海姆大学医学中心), Mannheim, Germany; Mannheim Institute for Intelligent Systems in Medicine, Medical Faculty Mannheim, Heidelberg University (海德堡大学), Mannheim, Germany
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:In this study, we present seg2med, an advanced medical image synthesis framework that uses Denoising Diffusion Probabilistic Models (DDPM) to generate high-quality synthetic medical images conditioned on anatomical masks from TotalSegmentator. The framework synthesizes CT and MR images from segmentation masks derived from real patient data and XCAT digital phantoms, achieving a Structural Similarity Index Measure (SSIM) of 0.94 +/- 0.02 for CT and 0.89 +/- 0.04 for MR images compared to ground-truth images of real patients. It also achieves a Feature Similarity Index Measure (FSIM) of 0.78 +/- 0.04 for CT images from XCAT. The generative quality is further supported by a Fréchet Inception Distance (FID) of 3.62 for CT image generation. Additionally, seg2med can generate paired CT and MR images with consistent anatomical structures and convert images between CT and MR modalities, achieving SSIM values of 0.91 +/- 0.03 for MR-to-CT and 0.77 +/- 0.04 for CT-to-MR conversion. Despite the limitations of incomplete anatomical details in segmentation masks, the framework shows strong performance in cross-modality synthesis and multimodal imaging. seg2med also demonstrates high anatomical fidelity in CT synthesis, achieving a mean Dice coefficient greater than 0.90 for 11 abdominal organs and greater than 0.80 for 34 organs out of 59 in 58 test cases. The highest Dice of 0.96 +/- 0.01 was recorded for the right scapula. Leveraging the TotalSegmentator toolkit, seg2med enables segmentation mask generation across diverse datasets, supporting applications in clinical imaging, data augmentation, multimodal synthesis, and diagnostic algorithm development. Comments: 17 pages, 10 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.09182 [eess.IV] (or arXiv:2504.09182v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2504.09182 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-200] Multi-Modal Brain Tumor Segmentation via 3D Multi-Scale Self-attention and Cross-attention

【速读】：该论文旨在解决现有基于CNN-Transformer混合架构的3D多模态医学图像分割模型在处理多尺度体积病灶特征时存在的固定感受野限制问题。解决方案的关键在于提出了一种名为TMA-TransBTS的新模型，该模型通过自注意力层中多尺度划分与聚合3D token的方式，实现了多尺度3D特征的同时提取与长距离依赖建模。此外，TMA-TransBTS还引入了一种3D多尺度交叉注意力模块，利用交叉注意力的互信息机制和多尺度聚合技术，在编码器与解码器之间建立联系，以提取丰富的体积表示。这些创新显著提升了3D多模态脑肿瘤分割任务的性能。

链接: https://arxiv.org/abs/2504.09088
作者: Yonghao Huang,Leiting Chen,Chuan Zhou
机构: Key Laboratory of Intelligent Digital Media Technology of Sichuan Province, University of Electronic Science and Technology of China (电子科技大学); School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学); School of Information and Software Engineering, University of Electronic Science and Technology of China (电子科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to the success of CNN-based and Transformer-based models in various computer vision tasks, recent works study the applicability of CNN-Transformer hybrid architecture models in 3D multi-modality medical segmentation tasks. Introducing Transformer brings long-range dependent information modeling ability in 3D medical images to hybrid models via the self-attention mechanism. However, these models usually employ fixed receptive fields of 3D volumetric features within each self-attention layer, ignoring the multi-scale volumetric lesion features. To address this issue, we propose a CNN-Transformer hybrid 3D medical image segmentation model, named TMA-TransBTS, based on an encoder-decoder structure. TMA-TransBTS realizes simultaneous extraction of multi-scale 3D features and modeling of long-distance dependencies by multi-scale division and aggregation of 3D tokens in a self-attention layer. Furthermore, TMA-TransBTS proposes a 3D multi-scale cross-attention module to establish a link between the encoder and the decoder for extracting rich volume representations by exploiting the mutual attention mechanism of cross-attention and multi-scale aggregation of 3D tokens. Extensive experimental results on three public 3D medical segmentation datasets show that TMA-TransBTS achieves higher averaged segmentation results than previous state-of-the-art CNN-based 3D methods and CNN-Transform hybrid 3D methods for the segmentation of 3D multi-modality brain tumors.
zh

人工智能

[AI-0] Weight Ensembling Improves Reasoning in Language Models

【速读】：本文研究了在推理模型训练过程中出现的一种失效模式，即生成多样性崩溃导致测试时间扩展性能不佳的问题。具体而言，在有监督微调（Supervised Fine-Tuning, SFT）过程中，Pass@1指标可靠提升，但Pass@k指标迅速恶化。为解决此问题，论文提出了一种简单干预方法——通过线性插值最新SFT检查点与早期检查点的权重（WiSE-FT），该方法几乎完全恢复了Pass@k性能，并同时提升了Pass@1。进一步通过强化学习微调后，WiSE-FT变体在测试时间扩展（如Best@k、多数投票）中表现更优，且使用更少数据即可达到更好的效果。此外，WiSE-FT提供了温度缩放等多样化解码策略无法单独实现的互补性能增益。论文从偏差-方差的角度形式化分析了Pass@k与Pass@1期望和方差之间的权衡关系，发现WiSE-FT能够同时降低偏差和方差，而温度缩放则固有地在这两者之间进行权衡。因此，关键在于通过WiSE-FT方法有效平衡生成多样性与性能优化之间的关系。

链接: https://arxiv.org/abs/2504.10478
作者: Xingyu Dang,Christina Baek,Kaiyue Wen,Zico Kolter,Aditi Raghunathan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades-off between bias and variance.
zh

[AI-1] AI-Driven Code Refactoring: Using Graph Neural Networks to Enhance Software Maintainability

【速读】：本文旨在解决通过自动化手段提升代码可维护性的问题，具体目标包括降低代码的圈复杂度（cyclomatic complexity）和耦合度（coupling）。论文提出利用图神经网络（Graph Neural Networks, GNNs）作为创新工具，基于抽象语法树（Abstract Syntax Trees, ASTs）进行代码重构。解决方案的关键在于将代码结构建模为图，并通过GNN学习节点间的关系以预测有效的重构操作，从而实现对代码质量的显著提升。实验结果表明，GNNs在准确性（92%）、复杂度降低（35%）和耦合度减少（33%）方面优于传统的SonarQube（分别为78%、16%）和决策树方法（分别为85%、25%），并且预处理阶段修复了60%的语法错误。

链接: https://arxiv.org/abs/2504.10412
作者: Gopichand Bandarupalli
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:This study explores Graph Neural Networks (GNNs) as a transformative tool for code refactoring, using abstract syntax trees (ASTs) to boost software maintainability. It analyzes a dataset of 2 million snippets from CodeSearchNet and a custom 75000-file GitHub Python corpus, comparing GNNs against rule-based SonarQube and decision trees. Metrics include cyclomatic complexity (target below 10), coupling (target below 5), and refactoring precision. GNNs achieve 92% accuracy, reducing complexity by 35% and coupling by 33%, outperforming SonarQube (78%, 16%) and decision trees (85%, 25%). Preprocessing fixed 60% of syntax errors. Bar graphs, tables, and AST visuals clarify results. This offers a scalable AI-driven path to cleaner codebases, which is crucial for software engineering.
zh

[AI-2] Can LLM s Assist Expert Elicitation for Probabilistic Causal Modeling?

【速读】：该论文旨在探索大型语言模型（Large Language Models, LLMs）作为替代人工专家知识提取方法的可能性，以在生物计量学和医疗健康应用中提取结构化的因果知识并促进因果建模。论文的关键解决方案在于利用LLMs生成贝叶斯网络（Bayesian Networks, BNs），并通过与传统统计方法（如贝叶斯信息准则）在医疗数据集上的对比验证其性能。验证技术包括结构方程建模（Structural Equation Modeling, SEM）以确认关系，并通过熵值、预测准确性及鲁棒性等指标比较网络结构。结果显示，LLM生成的BNs在熵值上低于专家提取和统计生成的BNs，表明其预测具有更高的置信度和精度。然而，LLMs面临的上下文限制、幻觉依赖以及训练数据潜在偏差等问题仍需进一步研究。因此，该研究的核心贡献在于展示了LLMs在概率因果建模领域作为专家知识提取工具的潜力，有望提升模型决策的透明度并降低不确定性。

链接: https://arxiv.org/abs/2504.10397
作者: Olha Shaposhnyk,Daria Zahorska,Svetlana Yanushkevich
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Objective: This study investigates the potential of Large Language Models (LLMs) as an alternative to human expert elicitation for extracting structured causal knowledge and facilitating causal modeling in biometric and healthcare applications. Material and Methods: LLM-generated causal structures, specifically Bayesian networks (BNs), were benchmarked against traditional statistical methods (e.g., Bayesian Information Criterion) using healthcare datasets. Validation techniques included structural equation modeling (SEM) to verifying relationships, and measures such as entropy, predictive accuracy, and robustness to compare network structures. Results and Discussion: LLM-generated BNs demonstrated lower entropy than expert-elicited and statistically generated BNs, suggesting higher confidence and precision in predictions. However, limitations such as contextual constraints, hallucinated dependencies, and potential biases inherited from training data require further investigation. Conclusion: LLMs represent a novel frontier in expert elicitation for probabilistic causal modeling, promising to improve transparency and reduce uncertainty in the decision-making using such models. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2504.10397 [cs.AI] (or arXiv:2504.10397v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.10397 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Olha Shaposhnyk [view email] [v1] Mon, 14 Apr 2025 16:45:52 UTC (1,040 KB)
zh

[AI-3] acher Motion Priors: Enhancing Robot Locomotion over Challenging Terrain

【速读】：该论文致力于解决在复杂地形上实现鲁棒运动控制的挑战，主要难点在于高维控制和环境不确定性。论文提出了一种基于教师-学生范式的教师先验框架，通过整合模仿学习和辅助任务学习来提升学习效率和泛化能力。解决方案的关键在于解耦网络设计，简化策略网络与部署过程。具体而言，首先利用特权信息训练高性能的教师策略以获取可泛化的运动技能，然后通过生成对抗机制将教师的运动分布传递给仅依赖于嘈杂本体感觉数据的学生策略，从而缓解因分布偏移导致的性能退化。此外，辅助任务学习增强了学生策略的特征表示，加速了收敛并提升了对不同地形的适应性。该框架在类人机器人上的验证表明，其在动态地形上的运动稳定性显著提高，并大幅降低了开发成本。

链接: https://arxiv.org/abs/2504.10390
作者: Fangcheng Jin,Yuqi Wang,Peixin Ma,Guodong Yang,Pan Zhao,En Li,Zhengtao Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Achieving robust locomotion on complex terrains remains a challenge due to high dimensional control and environmental uncertainties. This paper introduces a teacher prior framework based on the teacher student paradigm, integrating imitation and auxiliary task learning to improve learning efficiency and generalization. Unlike traditional paradigms that strongly rely on encoder-based state embeddings, our framework decouples the network design, simplifying the policy network and deployment. A high performance teacher policy is first trained using privileged information to acquire generalizable motion skills. The teacher’s motion distribution is transferred to the student policy, which relies only on noisy proprioceptive data, via a generative adversarial mechanism to mitigate performance degradation caused by distributional shifts. Additionally, auxiliary task learning enhances the student policy’s feature representation, speeding up convergence and improving adaptability to varying terrains. The framework is validated on a humanoid robot, showing a great improvement in locomotion stability on dynamic terrains and significant reductions in development costs. This work provides a practical solution for deploying robust locomotion strategies in humanoid robots.
zh

[AI-4] SymRTLO: Enhancing RTL Code Optimization with LLM s and Neuron-Inspired Symbolic Reasoning

【速读】：本文旨在解决在数字电路综合早期阶段通过优化寄存器传输级（RTL）代码提升功耗、性能和面积（PPA）的问题。传统手动重写方法虽能产生高质量结果，但耗时且易出错；而现有基于编译器的方法难以处理复杂的约束条件，而基于大型语言模型（LLM）的方法虽为替代方案，却常面临生成代码与提示对齐困难的问题。为此，论文提出SymRTLO，这是一种新颖的神经符号RTL优化框架，将基于LLM的代码重写与符号推理技术无缝集成。其关键创新在于引入基于检索增强生成（RAG）的优化规则系统和基于抽象语法树（AST）的模板，确保LLM生成的代码保持语法正确性的同时最小化不必要的电路行为。此外，通过设计用于分析和优化有限状态机（FSM）逻辑的符号模块，实现更细粒度的状态合并及部分规格处理，超越基于模式的编译器能力。同时，结合形式等价检查与测试驱动验证的快速验证管道进一步简化验证流程。实验表明，与最先进的方法相比，SymRTLO分别提升了高达43.9%的功耗、62.5%的性能以及51.1%的面积效率。

链接: https://arxiv.org/abs/2504.10369
作者: Yiting Wang,Wanghao Ye,Ping Guo,Yexiao He,Ziyao Wang,Yexiao He,Bowei Tian,Shwai He,Guoheng Sun,Zheyu Shen,Sihan Chen,Ankur Srivastava,Qingfu Zhang,Gang Qu,Ang Li
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注: 16 pages, 8 figures, 7 tables. Under Review

点击查看摘要

Abstract:Optimizing Register Transfer Level (RTL) code is crucial for improving the power, performance, and area (PPA) of digital circuits in the early stages of synthesis. Manual rewriting, guided by synthesis feedback, can yield high-quality results but is time-consuming and error-prone. Most existing compiler-based approaches have difficulty handling complex design constraints. Large Language Model (LLM)-based methods have emerged as a promising alternative to address these challenges. However, LLM-based approaches often face difficulties in ensuring alignment between the generated code and the provided prompts. This paper presents SymRTLO, a novel neuron-symbolic RTL optimization framework that seamlessly integrates LLM-based code rewriting with symbolic reasoning techniques. Our method incorporates a retrieval-augmented generation (RAG) system of optimization rules and Abstract Syntax Tree (AST)-based templates, enabling LLM-based rewriting that maintains syntactic correctness while minimizing undesired circuit behaviors. A symbolic module is proposed for analyzing and optimizing finite state machine (FSM) logic, allowing fine-grained state merging and partial specification handling beyond the scope of pattern-based compilers. Furthermore, a fast verification pipeline, combining formal equivalence checks with test-driven validation, further reduces the complexity of verification. Experiments on the RTL-Rewriter benchmark with Synopsys Design Compiler and Yosys show that SymRTLO improves power, performance, and area (PPA) by up to 43.9%, 62.5%, and 51.1%, respectively, compared to the state-of-the-art methods.
zh

[AI-5] Heimdall: test-time scaling on the generative verification

【速读】：该论文旨在解决长链路推理（long Chain-of-Thought reasoning）中大语言模型（LLMs）在验证能力上的不足问题。传统方法在复杂问题（如数学竞赛题）上的验证准确性仅为62.5%，而现有研究对此关注较少。为应对这一挑战，论文提出Heimdall，这是一种专门用于长链路推理验证的大语言模型，通过纯强化学习方法显著提升了验证准确性至94.5%，并通过重复采样进一步提升至97.5%。此外，Heimdall展现出强大的泛化能力，在未包含于训练集的数学证明任务中也能有效检测大部分问题。为扩展其应用范围，论文进一步提出Pessimistic Verification策略，通过Heimdall对解算器模型生成的解进行评估，并基于悲观原则选择最可能正确的解，从而优化解算过程。最终，该方案在AIME2025等任务中显著提高了解的准确性。解决方案的关键在于结合Heimdall的高精度验证能力和Pessimistic Verification的决策机制，同时构建了一个三元自动知识发现系统，实现问题生成、解答提供与验证的闭环流程，从而大幅提升问题求解的效率与可靠性。

链接: https://arxiv.org/abs/2504.10337
作者: Wenlei Shi,Xing Jin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself. Recent work on long Chain-of-Thought reasoning has demonstrated great potential of LLMs on solving competitive problems, but their verification ability remains to be weak and not sufficiently investigated. In this paper, we propose Heimdall, the long CoT verification LLM that can accurately judge the correctness of solutions. With pure reinforcement learning, we boost the verification accuracy from 62.5% to 94.5% on competitive math problems. By scaling with repeated sampling, the accuracy further increases to 97.5%. Through human evaluation, Heimdall demonstrates impressive generalization capabilities, successfully detecting most issues in challenging math proofs, the type of which is not included during training. Furthermore, we propose Pessimistic Verification to extend the functionality of Heimdall to scaling up the problem solving. It calls Heimdall to judge the solutions from a solver model and based on the pessimistic principle, selects the most likely correct solution with the least uncertainty. Taking DeepSeek-R1-Distill-Qwen-32B as the solver model, Pessimistic Verification improves the solution accuracy on AIME2025 from 54.2% to 70.0% with 16x compute budget and to 83.3% with more compute budget. With the stronger solver Gemini 2.5 Pro, the score reaches 93.0%. Finally, we prototype an automatic knowledge discovery system, a ternary system where one poses questions, another provides solutions, and the third verifies the solutions. Using the data synthesis work NuminaMath for the first two components, Heimdall effectively identifies problematic records within the dataset and reveals that nearly half of the data is flawed, which interestingly aligns with the recent ablation studies from NuminaMath.
zh

[AI-6] AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference

【速读】：本文旨在解决大型语言模型（Large Language Models, LLMs）在推理过程中长上下文效率低下及资源消耗过大的问题。传统方法如KV缓存解聚合和基于检索的稀疏注意力机制虽有改进，但在硬件资源利用率和生成质量方面仍有不足。论文的关键创新在于将LLM推理中的注意力计算与缓存管理抽象为查询处理过程，并通过原生查询优化器提升性能。这种设计使得AlayaDB能够更高效地利用硬件资源，同时在不同服务级别目标（Service Level Objectives, SLOs）下提供高质量的生成结果。

链接: https://arxiv.org/abs/2504.10326
作者: Yangshen Deng,Zhengxin You,Long Xiang,Qilong Li,Peiqi Yuan,Zhaoyang Hong,Yitao Zheng,Wanting Li,Runzhong Li,Haotian Liu,Kyriakos Mouratidis,Man Lung Yiu,Huan Li,Qiaomu Shen,Rui Mao,Bo Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注: 14 pages, 12 figures, conference

点击查看摘要

Abstract:AlayaDB is a cutting-edge vector database system natively architected for efficient and effective long-context inference for Large Language Models (LLMs) at AlayaDB AI. Specifically, it decouples the KV cache and attention computation from the LLM inference systems, and encapsulates them into a novel vector database system. For the Model as a Service providers (MaaS), AlayaDB consumes fewer hardware resources and offers higher generation quality for various workloads with different kinds of Service Level Objectives (SLOs), when comparing with the existing alternative solutions (e.g., KV cache disaggregation, retrieval-based sparse attention). The crux of AlayaDB is that it abstracts the attention computation and cache management for LLM inference into a query processing procedure, and optimizes the performance via a native query optimizer. In this work, we demonstrate the effectiveness of AlayaDB via (i) three use cases from our industry partners, and (ii) extensive experimental results on LLM inference benchmarks.
zh

[AI-7] AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis ICME25

【速读】：该论文旨在解决语音合成领域中合成语音自然度与表现力不足的问题，特别是现有研究忽视了提示词（prompt）选择的重要性。论文的关键解决方案是提出了一种基于检索增强生成（Retrieval-Augmented Generation, RAG）技术的文本转语音（Text-to-Speech, TTS）框架，通过构建包含高质量语音样本的风格知识库以及开发一种基于嵌入向量匹配的风格适配方案来动态调整语音风格，以实现更自然且生动的通信效果。该方案利用由Llama、PER-LLM-Embedder和Moka提取的嵌入向量与知识库中的样本进行匹配，从而选择最合适的语音风格用于合成。

链接: https://arxiv.org/abs/2504.10309
作者: Dan Luo,Chengyuan Ma,Weiqin Li,Jun Wang,Wei Chen,Zhiyong Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: accepted by ICME25

点击查看摘要

Abstract:With the advancement of speech synthesis technology, users have higher expectations for the naturalness and expressiveness of synthesized speech. But previous research ignores the importance of prompt selection. This study proposes a text-to-speech (TTS) framework based on Retrieval-Augmented Generation (RAG) technology, which can dynamically adjust the speech style according to the text content to achieve more natural and vivid communication effects. We have constructed a speech style knowledge database containing high-quality speech samples in various contexts and developed a style matching scheme. This scheme uses embeddings, extracted by Llama, PER-LLM-Embedder,and Moka, to match with samples in the knowledge database, selecting the most appropriate speech style for synthesis. Furthermore, our empirical research validates the effectiveness of the proposed method. Our demo can be viewed at: this https URL
zh

[AI-8] Characterizing LLM -driven Social Network: The Chirper.ai Case

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）驱动的在线社交网络与人类驱动的在线社交网络之间的对比研究不足的问题。现有研究主要关注LLM代理在模拟网络中的集体行为模式和结构特性，但缺乏实证比较，限制了对LLM代理与人类用户差异的理解。论文的关键解决方案是构建了一个由超过65,000个LLM代理组成的大型社交网络平台，并收集了规模相当的人类驱动的Mastodon社交网络数据集，通过大规模分析比较两者在发帖行为、滥用内容及社交网络结构方面的差异，从而提供关于AI时代在线社交网络分析的深入洞见。

链接: https://arxiv.org/abs/2504.10286
作者: Yiming Zhu,Yupeng He,Ehsan-Ul Haq,Gareth Tyson,Pan Hui
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Work in progress

点击查看摘要

Abstract:Large language models (LLMs) demonstrate the ability to simulate human decision-making processes, enabling their use as agents in modeling sophisticated social networks, both offline and online. Recent research has explored collective behavioral patterns and structural characteristics of LLM agents within simulated networks. However, empirical comparisons between LLM-driven and human-driven online social networks remain scarce, limiting our understanding of how LLM agents differ from human users. This paper presents a large-scale analysis of this http URL, an X/Twitter-like social network entirely populated by LLM agents, comprising over 65,000 agents and 7.7 million AI-generated posts. For comparison, we collect a parallel dataset from Mastodon, a human-driven decentralized social network, with over 117,000 users and 16 million posts. We examine key differences between LLM agents and humans in posting behaviors, abusive content, and social network structures. Our findings provide critical insights into the evolving landscape of online social network analysis in the AI era, offering a comprehensive profile of LLM agents in social simulations.
zh

[AI-9] Vision based driving agent for race car simulation environments

【速读】：该论文试图解决在模拟环境中实现接近专业人类水平的车辆操控以充分利用轮胎抓地力的问题。解决方案的关键在于采用深度强化学习（Deep Reinforcement Learning, DRL）方法，特别是利用近端策略优化（Proximal Policy Optimization, PPO）算法，通过仅使用视觉输入来训练智能体（agent），使其能够控制方向盘和踏板，从而达成时间最优驾驶任务，并实现类似于人类的学习与驾驶行为。

链接: https://arxiv.org/abs/2504.10266
作者: Gergely Bári,László Palkovics
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to ICMCE 2024 ( this https URL )

点击查看摘要

Abstract:In recent years, autonomous driving has become a popular field of study. As control at tire grip limit is essential during emergency situations, algorithms developed for racecars are useful for road cars too. This paper examines the use of Deep Reinforcement Learning (DRL) to solve the problem of grip limit driving in a simulated environment. Proximal Policy Optimization (PPO) method is used to train an agent to control the steering wheel and pedals of the vehicle, using only visual inputs to achieve professional human lap times. The paper outlines the formulation of the task of time optimal driving on a race track as a deep reinforcement learning problem, and explains the chosen observations, actions, and reward functions. The results demonstrate human-like learning and driving behavior that utilize maximum tire grip potential.
zh

[AI-10] Can Competition Enhance the Proficiency of Agents Powered by Large Language Models in the Realm of News-driven Time Series Forecasting?

【速读】：该论文旨在解决新闻驱动的多智能体时间序列预测任务中，衡量不同新闻事件对时间序列波动影响的挑战。具体而言，现有方法在优化智能体创新思维能力和识别误导性逻辑方面的能力有限。为应对这一问题，论文的关键解决方案在于引入竞争机制以增强智能体的创新能力，并在反思阶段嵌入经过微调的小规模语言模型（LLM）以提升其识别误导信息的能力。实验结果表明，竞争机制能够显著提高时间序列预测性能，同时竞争强度对智能体表现的影响为基于LLM的多智能体系统研究提供了新视角。

链接: https://arxiv.org/abs/2504.10210
作者: Yuxuan Zhang,Yangyang Feng,Daifeng Li,Kexin Zhang,Junlan Chen,Bowen Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agents-based news-driven time series forecasting is considered as a potential paradigm shift in the era of large language models (LLMs). The challenge of this task lies in measuring the influences of different news events towards the fluctuations of time series. This requires agents to possess stronger abilities of innovative thinking and the identifying misleading logic. However, the existing multi-agent discussion framework has limited enhancement on time series prediction in terms of optimizing these two capabilities. Inspired by the role of competition in fostering innovation, this study embeds a competition mechanism within the multi-agent discussion to enhance agents’ capability of generating innovative thoughts. Furthermore, to bolster the model’s proficiency in identifying misleading information, we incorporate a fine-tuned small-scale LLM model within the reflective stage, offering auxiliary decision-making support. Experimental results confirm that the competition can boost agents’ capacity for innovative thinking, which can significantly improve the performances of time series prediction. Similar to the findings of social science, the intensity of competition within this framework can influence the performances of agents, providing a new perspective for studying LLMs-based multi-agent systems.
zh

[AI-11] Efficient Generative Model Training via Embedded Representation Warmup

【速读】：该论文旨在解决扩散模型（Diffusion Models）在训练效率和表征质量方面相较于自监督方法（Self-Supervised Methods）存在的不足，其核心问题是扩散模型在训练过程中未能充分利用高质量、语义丰富的表征，导致收敛速度缓慢。论文的关键发现是，语义和结构模式的学习主要发生在早期层中的表征处理区域（Representation Processing Region），这是生成过程之前的重要阶段。为了解决这一问题，作者提出了嵌入表征预热（Embedded Representation Warmup, ERW）框架，其关键在于通过第一阶段的预热模块，利用预训练的高质量表征初始化扩散模型的早期层，从而减轻从零学习表征的负担，加速收敛并提升性能。理论分析表明，ERW的有效性依赖于其精确整合到特定神经网络层的能力，这些层即为表征处理区域。实验结果进一步证明，ERW不仅加速了训练收敛，还提升了表征质量，在训练速度上实现了比当前最先进的方法REPA快40倍的提升。

链接: https://arxiv.org/abs/2504.10188
作者: Deyuan Liu,Peng Sun,Xufeng Li,Tao Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models excel at generating high-dimensional data but fall short in training efficiency and representation quality compared to self-supervised methods. We identify a key bottleneck: the underutilization of high-quality, semantically rich representations during training notably slows down convergence. Our systematic analysis reveals a critical representation processing region – primarily in the early layers – where semantic and structural pattern learning takes place before generation can occur. To address this, we propose Embedded Representation Warmup (ERW), a plug-and-play framework where in the first stage we get the ERW module serves as a warmup that initializes the early layers of the diffusion model with high-quality, pretrained representations. This warmup minimizes the burden of learning representations from scratch, thereby accelerating convergence and boosting performance. Our theoretical analysis demonstrates that ERW’s efficacy depends on its precise integration into specific neural network layers – termed the representation processing region – where the model primarily processes and transforms feature representations for later generation. We further establish that ERW not only accelerates training convergence but also enhances representation quality: empirically, our method achieves a 40 \times acceleration in training speed compared to REPA, the current state-of-the-art methods. Code is available at this https URL.
zh

[AI-12] BoTTA: Benchmarking on-device Test Time Adaptation

【速读】：本文旨在解决深度学习模型在实际部署中因测试样本分布偏移而导致性能显著下降的问题。传统方法依赖于训练数据分布，而测试时适应（Test-Time Adaptation, TTA）通过在推理阶段动态调整模型以适应新环境，无需标记的测试数据或访问原始训练集。然而，现有研究较少关注移动和边缘设备上的特定约束。为此，论文提出BoTTA基准，专门评估TTA方法在这些资源受限设备上的表现。关键在于设计了一套针对四个实际挑战的评估框架：有限的测试样本、类别暴露限制、多样化的分布偏移以及样本内的重叠偏移。此外，与持续推理时适应不同，BoTTA提倡周期性适应以满足设备端需求。实验揭示了当前TTA算法在小数据集上的局限性及对未见类别泛化能力不足等问题，并提供了具体的资源使用数据。这为资源受限环境下的TTA应用提供了有价值的指导。

链接: https://arxiv.org/abs/2504.10149
作者: Michal Danilowski,Soumyajit Chatterjee,Abhirup Ghosh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The performance of deep learning models depends heavily on test samples at runtime, and shifts from the training data distribution can significantly reduce accuracy. Test-time adaptation (TTA) addresses this by adapting models during inference without requiring labeled test data or access to the original training set. While research has explored TTA from various perspectives like algorithmic complexity, data and class distribution shifts, model architectures, and offline versus continuous learning, constraints specific to mobile and edge devices remain underexplored. We propose BoTTA, a benchmark designed to evaluate TTA methods under practical constraints on mobile and edge devices. Our evaluation targets four key challenges caused by limited resources and usage conditions: (i) limited test samples, (ii) limited exposure to categories, (iii) diverse distribution shifts, and (iv) overlapping shifts within a sample. We assess state-of-the-art TTA methods under these scenarios using benchmark datasets and report system-level metrics on a real testbed. Furthermore, unlike prior work, we align with on-device requirements by advocating periodic adaptation instead of continuous inference-time adaptation. Experiments reveal key insights: many recent TTA algorithms struggle with small datasets, fail to generalize to unseen categories, and depend on the diversity and complexity of distribution shifts. BoTTA also reports device-specific resource use. For example, while SHOT improves accuracy by 2.25\times with 512 adaptation samples, it uses 1.08\times peak memory on Raspberry Pi versus the base model. BoTTA offers actionable guidance for TTA in real-world, resource-constrained deployments.
zh

[AI-13] GeoUni: A Unified Model for Generating Geometry Diagrams Problems and Problem Solutions

【速读】：该论文试图解决几何推理任务中生成式模型无法同时处理问题求解与图示生成的问题，传统方法将这两者视为独立任务，而现有模型未能成功整合二者以支持个性化几何问题的创建。论文的关键解决方案在于提出GeoUni，一个统一的几何专家模型，通过单一框架实现问题求解与图示生成的无缝集成，从而能够创建独特且个性化的几何问题。实验表明，GeoUni仅需1.5B参数即可在几何推理任务中达到与DeepSeek-R1（671B参数）相当的性能，并在图示生成方面超越现有的文本到图像模型及统一模型，尤其在生成与特定知识点匹配的图文一致的几何问题方面表现出色。

链接: https://arxiv.org/abs/2504.10146
作者: Jo-Ku Cheng,Zeren Zhang,Ran Chen,Jingyang Deng,Ziran Qin,Jinwen Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose GeoUni, the first unified geometry expert model capable of generating problem solutions and diagrams within a single framework in a way that enables the creation of unique and individualized geometry problems. Traditionally, solving geometry problems and generating diagrams have been treated as separate tasks in machine learning, with no models successfully integrating both to support problem creation. However, we believe that mastery in geometry requires frictionless integration of all of these skills, from solving problems to visualizing geometric relationships, and finally, crafting tailored problems. Our extensive experiments demonstrate that GeoUni, with only 1.5B parameters, achieves performance comparable to larger models such as DeepSeek-R1 with 671B parameters in geometric reasoning tasks. GeoUni also excels in generating precise geometric diagrams, surpassing both text-to-image models and unified models, including the GPT-4o image generation. Most importantly, GeoUni is the only model capable of successfully generating textual problems with matching diagrams based on specific knowledge points, thus offering a wider range of capabilities that extend beyond current models.
zh

[AI-14] Benchmarking Practices in LLM -driven Offensive Security: Testbeds Metrics and Experiment Design

【速读】：该论文旨在解决如何有效评估基于大型语言模型（Large Language Models, LLMs）驱动的攻击方法，并提出针对LLM在网络安全领域进攻性应用的评估体系改进方案。论文的关键在于强调现有测试环境的扩展、基准的建立以及综合度量指标与定性分析的重要性，同时指出CTF（Capture The Flag）类挑战可能无法充分反映真实世界渗透测试场景的需求。通过回顾16篇相关研究中的15个原型及其对应的测试平台，论文提供了具有实践意义的建议以指导未来的研究方向。

链接: https://arxiv.org/abs/2504.10112
作者: Andreas Happe,Jürgen Cito
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 16 research papers detailing 15 prototypes and their respective testbeds. We detail our findings and provide actionable recommendations for future research, emphasizing the importance of extending existing testbeds, creating baselines, and including comprehensive metrics and qualitative analysis. We also note the distinction between security research and practice, suggesting that CTF-based challenges may not fully represent real-world penetration testing scenarios. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.10112 [cs.CR] (or arXiv:2504.10112v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.10112 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-15] Lightweight Trustworthy Distributed Clustering

【速读】：该论文旨在解决边缘计算系统（Edge Computing Systems, ECS）中单个边缘节点数据可信性保障与协作数据处理之间的矛盾，特别是在资源受限场景下（如自动驾驶传感器网络、工业物联网和智慧城市）。论文的关键解决方案在于提出了一种轻量级、完全分布式的k-means聚类算法，专为边缘环境设计。该算法在聚类中心更新阶段采用基于加法秘密共享的分布式平均方法，这是一种安全多方计算技术，以确保跨节点数据的准确性和可信性。

链接: https://arxiv.org/abs/2504.10109
作者: Hongyang Li,Caesar Wu,Mohammed Chadli,Said Mammar,Pascal Bouvry
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring data trustworthiness within individual edge nodes while facilitating collaborative data processing poses a critical challenge in edge computing systems (ECS), particularly in resource-constrained scenarios such as autonomous systems sensor networks, industrial IoT, and smart cities. This paper presents a lightweight, fully distributed k-means clustering algorithm specifically adapted for edge environments, leveraging a distributed averaging approach with additive secret sharing, a secure multiparty computation technique, during the cluster center update phase to ensure the accuracy and trustworthiness of data across nodes.
zh

[AI-16] MMKB-RAG : A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）及其多模态版本依赖参数化知识所导致的生成信息过时以及错误内容风险增加的问题。此外，虽然检索增强生成（Retrieval-Augmented Generation, RAG）部分缓解了这些问题，但其对数据库和检索系统的依赖可能引入无关或不准确的文档，从而损害性能与推理质量。为此，本文提出了一种新的多模态知识库检索增强生成框架（Multi-Modal Knowledge-Based Retrieval-Augmented Generation, MMKB-RAG），其关键在于利用模型的固有知识边界动态生成语义标签以优化检索过程，并通过联合过滤机制保留最相关且准确的参考文档，从而显著提升基于知识的视觉问答任务中的性能与鲁棒性。

链接: https://arxiv.org/abs/2504.10074
作者: Zihan Ling,Zhiyao Guo,Yixuan Huang,Yi An,Shuai Xiao,Jinsong Lan,Xiaoyong Zhu,Bo Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) and multi-modal LLMs have been remarkable. However, these models still rely solely on their parametric knowledge, which limits their ability to generate up-to-date information and increases the risk of producing erroneous content. Retrieval-Augmented Generation (RAG) partially mitigates these challenges by incorporating external data sources, yet the reliance on databases and retrieval systems can introduce irrelevant or inaccurate documents, ultimately undermining both performance and reasoning quality. In this paper, we propose Multi-Modal Knowledge-Based Retrieval-Augmented Generation (MMKB-RAG), a novel multi-modal RAG framework that leverages the inherent knowledge boundaries of models to dynamically generate semantic tags for the retrieval process. This strategy enables the joint filtering of retrieved documents, retaining only the most relevant and accurate references. Extensive experiments on knowledge-based visual question-answering tasks demonstrate the efficacy of our approach: on the E-VQA dataset, our method improves performance by +4.2% on the Single-Hop subset and +0.4% on the full dataset, while on the InfoSeek dataset, it achieves gains of +7.8% on the Unseen-Q subset, +8.2% on the Unseen-E subset, and +8.1% on the full dataset. These results highlight significant enhancements in both accuracy and robustness over the current state-of-the-art MLLM and RAG frameworks.
zh

[AI-17] Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning

【速读】：本文针对传统卷积神经网络（CNNs）在空间维度上的局限性展开研究，特别是现有可解释深度强化学习（Explainable Deep Reinforcement Learning）方法中注意力掩码与视觉输入中对象的空间偏移问题。论文的关键在于提出了一种名为可解释特征提取器（Interpretable Feature Extractor, IFE）的架构，旨在生成准确的注意力掩码，以同时说明智能体在空间域中“关注什么”以及“关注哪里”。IFE 包含两个核心模块：Human-Understandable Encoding 模块用于生成完全可解释的注意力掩码，Agent-Friendly Encoding 模块则通过优化智能体的学习效率来增强模型性能。最终，IFE 在基于视觉的深度强化学习中实现了模型的可解释性，并确保生成的注意力掩码在空间维度上一致且高度直观，能够有效突出视觉输入中的重要对象或位置。该方法通过将 IFE 集成到 Rainbow 和 Asynchronous Advantage Actor-Critic 模型中，在 57 个 ATARI 游戏任务上的实验验证了其在空间一致性、可解释性和数据效率方面的有效性。

链接: https://arxiv.org/abs/2504.10071
作者: Tien Pham,Angelo Cangelosi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current approaches in Explainable Deep Reinforcement Learning have limitations in which the attention mask has a displacement with the objects in visual input. This work addresses a spatial problem within traditional Convolutional Neural Networks (CNNs). We propose the Interpretable Feature Extractor (IFE) architecture, aimed at generating an accurate attention mask to illustrate both “what” and “where” the agent concentrates on in the spatial domain. Our design incorporates a Human-Understandable Encoding module to generate a fully interpretable attention mask, followed by an Agent-Friendly Encoding module to enhance the agent’s learning efficiency. These two components together form the Interpretable Feature Extractor for vision-based deep reinforcement learning to enable the model’s interpretability. The resulting attention mask is consistent, highly understandable by humans, accurate in spatial dimension, and effectively highlights important objects or locations in visual input. The Interpretable Feature Extractor is integrated into the Fast and Data-efficient Rainbow framework, and evaluated on 57 ATARI games to show the effectiveness of the proposed approach on Spatial Preservation, Interpretability, and Data-efficiency. Finally, we showcase the versatility of our approach by incorporating the IFE into the Asynchronous Advantage Actor-Critic Model.
zh

[AI-18] CHARM: Calibrating Reward Models With Chatbot Arena Scores

【速读】：本文旨在解决奖励模型（Reward Models, RMs）在从人类反馈中进行强化学习时存在的模型偏好偏差问题，即RMs倾向于对特定策略模型生成的响应赋予过高的分数，这种偏差会导致排名评估失真及不公平判断。为了解决这一问题，论文提出了一种名为CHatbot Arena校准奖励建模（CHARM）的校准方法，该方法利用Chatbot Arena排行榜中的Elo评分来缓解奖励模型的过高评价现象。此外，还引入了不匹配度（Mismatch Degree）指标来量化这种偏好偏差。关键在于，CHARM方法仅需少量的偏好数据即可完成奖励模型的持续训练，从而实现高效且通用的公平性提升与可靠性增强。

链接: https://arxiv.org/abs/2504.10045
作者: Xiao Zhu,Chenmien Tan,Pinzhen Chen,Rico Sennrich,Yanlin Zhang,Hanxu Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reward models (RMs) play a crucial role in Reinforcement Learning from Human Feedback by serving as proxies for human preferences in aligning large language models. In this paper, we identify a model preference bias in RMs, where they systematically assign disproportionately high scores to responses from certain policy models. This bias distorts ranking evaluations and leads to unfair judgments. To address this issue, we propose a calibration method named CHatbot Arena calibrated Reward Modeling (CHARM) that leverages Elo scores from the Chatbot Arena leaderboard to mitigate RM overvaluation. We also introduce a Mismatch Degree metric to measure this preference bias. Our approach is computationally efficient, requiring only a small preference dataset for continued training of the RM. We conduct extensive experiments on reward model benchmarks and human preference alignment. Results demonstrate that our calibrated RMs (1) achieve improved evaluation accuracy on RM-Bench and the Chat-Hard domain of RewardBench, and (2) exhibit a stronger correlation with human preferences by producing scores more closely aligned with Elo rankings. By mitigating model preference bias, our method provides a generalizable and efficient solution for building fairer and more reliable reward models.
zh

[AI-19] EmbodiedAgent : A Scalable Hierarchical Approach to Overcome Practical Challenge in Multi-Robot Control

【速读】：该论文旨在解决异构多机器人控制中的幻觉（hallucination）问题，特别是在不切实际的任务中，这种问题会导致错误的决策和执行。论文的关键创新在于提出了一种名为EmbodiedAgent的分层框架，它结合了下一动作预测范式与结构化记忆系统，将任务分解为可执行的机器人技能，并动态验证动作是否符合环境约束。此外，论文引入了MultiPlan+数据集和Robot Planning Assessment Schema (RPAS) 评估方法，进一步支持模型在复杂场景下的鲁棒性与实用性验证。这些方案的核心在于通过动态环境交互与多层次规划能力，有效缓解幻觉问题，提升长期目标协调能力。

链接: https://arxiv.org/abs/2504.10030
作者: Hanwen Wan,Yifei Chen,Zeyu Wei,Dongrui Li,Zexin Lin,Donghao Wu,Jiu Cheng,Yuxiang Zhang,Xiaoqiang Ji
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces EmbodiedAgent, a hierarchical framework for heterogeneous multi-robot control. EmbodiedAgent addresses critical limitations of hallucination in impractical tasks. Our approach integrates a next-action prediction paradigm with a structured memory system to decompose tasks into executable robot skills while dynamically validating actions against environmental constraints. We present MultiPlan+, a dataset of more than 18,000 annotated planning instances spanning 100 scenarios, including a subset of impractical cases to mitigate hallucination. To evaluate performance, we propose the Robot Planning Assessment Schema (RPAS), combining automated metrics with LLM-aided expert grading. Experiments demonstrate EmbodiedAgent’s superiority over state-of-the-art models, achieving 71.85% RPAS score. Real-world validation in an office service task highlights its ability to coordinate heterogeneous robots for long-horizon objectives.
zh

[AI-20] Session-based Recommender Systems: User Interest as a Stochastic Process in the Latent Space

【速读】：该论文旨在解决会话推荐系统中的数据不确定性、流行度偏差以及曝光偏差问题。论文通过研究这些偏差在项目嵌入（item embeddings）及推荐结果中的表现，提出将用户兴趣建模为潜在空间中的随机过程，并提供了一种与模型无关的数学概念实现方法。解决方案的关键在于引入三个要素：通过正则化实现嵌入均匀性来去偏项目嵌入、从会话前缀建模密集的用户兴趣，以及在数据中引入虚拟目标以模拟扩展的曝光。实验结果显示，所提方法能够有效缓解上述挑战。

链接: https://arxiv.org/abs/2504.10005
作者: Klaudia Balcer,Piotr Lipinski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper jointly addresses the problem of data uncertainty, popularity bias, and exposure bias in session-based recommender systems. We study the symptoms of this bias both in item embeddings and in recommendations. We propose treating user interest as a stochastic process in the latent space and providing a model-agnostic implementation of this mathematical concept. The proposed stochastic component consists of elements: debiasing item embeddings with regularization for embedding uniformity, modeling dense user interest from session prefixes, and introducing fake targets in the data to simulate extended exposure. We conducted computational experiments on two popular benchmark datasets, Diginetica and YooChoose 1/64, as well as several modifications of the YooChoose dataset with different ratios of popular items. The results show that the proposed approach allows us to mitigate the challenges mentioned.
zh

[AI-21] GenTe: Generative Real-world Terrains for General Legged Robot Locomotion Control

【速读】：该论文旨在解决双足机器人在多样化真实地形中行走这一基础性挑战，现有方法依赖预定义的高度图和静态环境，难以应对非结构化地形的复杂性。为弥合这一差距，论文提出了一种名为GenTe的框架，其关键是通过利用视觉-语言模型（Vision-Language Models, VLMs）的功能调用技术和推理能力，从文本和图形输入中生成物理上真实的、可适应的地形，以训练具有泛化能力的运动策略。GenTe构建了一个包含几何地形和物理地形的原子地形库，并引入了逼真的力建模来捕捉地形交互中的土壤下沉和水动力阻力等效应。这是首个系统性生成下肢机器人运动控制仿真环境的框架，并通过一个包含100个生成地形的基准测试验证了其提升双足机器人运动的泛化性和鲁棒性的效果。

链接: https://arxiv.org/abs/2504.09997
作者: Hanwen Wan,Mengkang Li,Donghao Wu,Yebin Zhong,Yixuan Deng,Zhenglong Sun,Xiaoqiang Ji
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing bipedal robots capable of traversing diverse real-world terrains presents a fundamental robotics challenge, as existing methods using predefined height maps and static environments fail to address the complexity of unstructured landscapes. To bridge this gap, we propose GenTe, a framework for generating physically realistic and adaptable terrains to train generalizable locomotion policies. GenTe constructs an atomic terrain library that includes both geometric and physical terrains, enabling curriculum training for reinforcement learning-based locomotion policies. By leveraging function-calling techniques and reasoning capabilities of Vision-Language Models (VLMs), GenTe generates complex, contextually relevant terrains from textual and graphical inputs. The framework introduces realistic force modeling for terrain interactions, capturing effects such as soil sinkage and hydrodynamic resistance. To the best of our knowledge, GenTe is the first framework that systemically generates simulation environments for legged robot locomotion control. Additionally, we introduce a benchmark of 100 generated terrains. Experiments demonstrate improved generalization and robustness in bipedal robot locomotion.
zh

[AI-22] owards Unbiased Federated Graph Learning: Label and Topology Perspectives

【速读】：该论文致力于解决联邦图学习（Federated Graph Learning, FGL）中公平性被忽视的问题。尽管现有方法在整体节点分类准确性上有所提升，但它们往往未能充分考虑节点特征、标签及图结构的复杂性，导致在少数类节点或存在异质连接（邻居具有不同标签或误导性特征）的节点上的性能显著下降。这种现象揭示了一个关键问题：高总体准确率可能掩盖了对结构性或语义上边缘化节点性能的退化。

为了解决这一问题，论文提出了两个公平性目标：（1）改善少数类节点的表示以实现类别级公平；（2）缓解由异质连接引起的拓扑偏见以实现拓扑感知公平。论文的关键解决方案是FairFGL框架，它通过细粒度图挖掘与协作学习来增强公平性。客户端采用History-Preserving模块防止过拟合到主导本地类别，使用Majority Alignment模块优化异质多数类节点的表示，并利用Gradient Modification模块将少数类知识从结构有利的客户端转移到其他客户端。服务器端则仅上传最具影响力的参数子集以减少通信成本并更好地反映本地分布，同时结合基于聚类的聚合策略协调冲突更新并抑制全局多数类主导。广泛评估表明，FairFGL显著提升了少数群体的表现，在八个基准数据集上实现了最高达22.62%的Macro-F1增益，并在收敛性上超越了现有最先进方法。

链接: https://arxiv.org/abs/2504.09963
作者: Zhengyu Wu,Boyang Pang,Xunkai Li,Yinlin Zhu,Daohan Su,Bowen Fan,Rong-Hua Li,Guoren Wang,Chenghu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
备注: Under Review

点击查看摘要

Abstract:Federated Graph Learning (FGL) enables privacy-preserving, distributed training of graph neural networks without sharing raw data. Among its approaches, subgraph-FL has become the dominant paradigm, with most work focused on improving overall node classification accuracy. However, these methods often overlook fairness due to the complexity of node features, labels, and graph structures. In particular, they perform poorly on nodes with disadvantaged properties, such as being in the minority class within subgraphs or having heterophilous connections (neighbors with dissimilar labels or misleading features). This reveals a critical issue: high accuracy can mask degraded performance on structurally or semantically marginalized nodes. To address this, we advocate for two fairness goals: (1) improving representation of minority class nodes for class-wise fairness and (2) mitigating topological bias from heterophilous connections for topology-aware fairness. We propose FairFGL, a novel framework that enhances fairness through fine-grained graph mining and collaborative learning. On the client side, the History-Preserving Module prevents overfitting to dominant local classes, while the Majority Alignment Module refines representations of heterophilous majority-class nodes. The Gradient Modification Module transfers minority-class knowledge from structurally favorable clients to improve fairness. On the server side, FairFGL uploads only the most influenced subset of parameters to reduce communication costs and better reflect local distributions. A cluster-based aggregation strategy reconciles conflicting updates and curbs global majority dominance . Extensive evaluations on eight benchmarks show FairFGL significantly improves minority-group performance , achieving up to a 22.62 percent Macro-F1 gain while enhancing convergence over state-of-the-art baselines.
zh

[AI-23] Privacy Meets Explainability: Managing Confidential Data and Transparency Policies in LLM -Empowered Science

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在科学工作流程中使用时引发的数据保密性和伦理处理问题，特别是通过LLM驱动的科学工具可能无意中泄露包括知识产权和专有数据在内的机密信息的风险。论文的关键解决方案是提出“DataShield”框架，该框架能够检测机密数据泄露、总结隐私政策并可视化数据流，从而确保与组织政策和程序的一致性。其核心目标是提高科学家对数据处理实践的认识，支持他们做出知情决策以保护敏感信息。目前，与科学家合作进行的用户研究正在评估该框架的可用性、可信度以及应对实际隐私挑战的有效性。

链接: https://arxiv.org/abs/2504.09961
作者: Yashothara Shanmugarasa,Shidong Pan,Ming Ding,Dehai Zhao,Thierry Rakotoarivelo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:As Large Language Models (LLMs) become integral to scientific workflows, concerns over the confidentiality and ethical handling of confidential data have emerged. This paper explores data exposure risks through LLM-powered scientific tools, which can inadvertently leak confidential information, including intellectual property and proprietary data, from scientists’ perspectives. We propose “DataShield”, a framework designed to detect confidential data leaks, summarize privacy policies, and visualize data flow, ensuring alignment with organizational policies and procedures. Our approach aims to inform scientists about data handling practices, enabling them to make informed decisions and protect sensitive information. Ongoing user studies with scientists are underway to evaluate the framework’s usability, trustworthiness, and effectiveness in tackling real-world privacy challenges.
zh

[AI-24] FedRecon: Missing Modality Reconstruction in Distributed Heterogeneous Environments

【速读】：该论文旨在解决多模态数据在实际场景中常见的不完整性以及非独立同分布（Non-IID）特性带来的挑战，这些问题导致模态异构性和数据分布差异，从而对有效的联邦学习（Federated Learning, FL）构成根本性障碍。论文提出FedRecon，这是一种针对多模态联邦学习中同时进行缺失模态重建和Non-IID适应的首个方法。其关键在于首先利用轻量级多模态变分自编码器（Multimodal Variational Autoencoder, MVAE）实现缺失模态的重建，并通过新颖的分布映射机制保证样本级别的对齐，确保数据一致性和完整性；此外，通过全局生成器冻结策略防止灾难性遗忘，进一步缓解Non-IID带来的波动问题。实验结果表明，FedRecon在Non-IID条件下对模态重建的表现优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.09941
作者: Junming Liu,Guosun Zeng,Ding Wang,Yanting Gao,Yufei Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 32 figures

点击查看摘要

Abstract:Multimodal data are often incomplete and exhibit Non-Independent and Identically Distributed (Non-IID) characteristics in real-world scenarios. These inherent limitations lead to both modality heterogeneity through partial modality absence and data heterogeneity from distribution divergence, creating fundamental challenges for effective federated learning (FL). To address these coupled challenges, we propose FedRecon, the first method targeting simultaneous missing modality reconstruction and Non-IID adaptation in multimodal FL. Our approach first employs a lightweight Multimodal Variational Autoencoder (MVAE) to reconstruct missing modalities while preserving cross-modal consistency. Distinct from conventional imputation methods, we achieve sample-level alignment through a novel distribution mapping mechanism that guarantees both data consistency and completeness. Additionally, we introduce a strategy employing global generator freezing to prevent catastrophic forgetting, which in turn mitigates Non-IID fluctuations. Extensive evaluations on multimodal datasets demonstrate FedRecon’s superior performance in modality reconstruction under Non-IID conditions, surpassing state-of-the-art methods.
zh

[AI-25] Plasticity-Aware Mixture of Experts for Learning Under QoE Shifts in Adaptive Video Streaming

【速读】：该论文旨在解决因用户特征与视频内容差异导致的用户体验质量（QoE）优化目标动态变化问题，这一变化对神经网络的泛化能力提出了挑战，表现为“可塑性损失”（plasticity loss），即传统模型难以有效适应不断变化的优化目标。为应对这一局限，论文提出了一种名为“可塑性感知专家混合模型”（Plasticity-Aware Mixture of Experts, PA-MoE）的新框架。其关键在于通过平衡记忆保留与选择性遗忘来动态调节网络的可塑性，利用噪声注入促进过时知识的选择性遗忘，从而增强神经网络的自适应能力。此外，论文还提供了理论分析并验证了该方法在动态流媒体环境中的性能提升效果。

链接: https://arxiv.org/abs/2504.09906
作者: Zhiqiang He,Zhi Liu
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive video streaming systems are designed to optimize Quality of Experience (QoE) and, in turn, enhance user satisfaction. However, differences in user profiles and video content lead to different weights for QoE factors, resulting in user-specific QoE functions and, thus, varying optimization objectives. This variability poses significant challenges for neural networks, as they often struggle to generalize under evolving targets - a phenomenon known as plasticity loss that prevents conventional models from adapting effectively to changing optimization objectives. To address this limitation, we propose the Plasticity-Aware Mixture of Experts (PA-MoE), a novel learning framework that dynamically modulates network plasticity by balancing memory retention with selective forgetting. In particular, PA-MoE leverages noise injection to promote the selective forgetting of outdated knowledge, thereby endowing neural networks with enhanced adaptive capabilities. In addition, we present a rigorous theoretical analysis of PA-MoE by deriving a regret bound that quantifies its learning performance. Experimental evaluations demonstrate that PA-MoE achieves a 45.5% improvement in QoE over competitive baselines in dynamic streaming environments. Further analysis reveals that the model effectively mitigates plasticity loss by optimizing neuron utilization. Finally, a parameter sensitivity study is performed by injecting varying levels of noise, and the results align closely with our theoretical predictions.
zh

[AI-26] LangPert: Detecting and Handling Task-level Perturbations for Robust Object Rearrangement

【速读】：该论文旨在解决由任务级扰动（Task-Level Perturbations, TLP）引起的物体重排任务执行挑战，TLP 包括意外的对象添加、移除及位移，这些扰动可能破坏底层的视觉策略并严重影响任务的可行性和进展。为应对这些挑战，论文提出了一种名为 LangPert 的基于语言的框架，用于检测和缓解桌面重排任务中的 TLP 情况。LangPert 的关键是结合视觉语言模型（Visual Language Model, VLM）全面监控策略的技能执行与环境扰动，并利用分层链式思维（Hierarchical Chain-of-Thought, HCoT）推理机制增强大型语言模型（Large Language Model, LLM）的上下文理解能力，从而生成适应性的修正性技能执行计划。实验结果表明，LangPert 在处理多样化的 TLP 情况时比基线方法表现更优，具有更高的任务完成率、改进的执行效率以及潜在的未见场景泛化能力。

链接: https://arxiv.org/abs/2504.09893
作者: Xu Yin,Min-Sung Yoon,Yuchi Huo,Kang Zhang,Sung-Eui Yoon
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Task execution for object rearrangement could be challenged by Task-Level Perturbations (TLP), i.e., unexpected object additions, removals, and displacements that can disrupt underlying visual policies and fundamentally compromise task feasibility and progress. To address these challenges, we present LangPert, a language-based framework designed to detect and mitigate TLP situations in tabletop rearrangement tasks. LangPert integrates a Visual Language Model (VLM) to comprehensively monitor policy’s skill execution and environmental TLP, while leveraging the Hierarchical Chain-of-Thought (HCoT) reasoning mechanism to enhance the Large Language Model (LLM)'s contextual understanding and generate adaptive, corrective skill-execution plans. Our experimental results demonstrate that LangPert handles diverse TLP situations more effectively than baseline methods, achieving higher task completion rates, improved execution efficiency, and potential generalization to unseen scenarios.
zh

[AI-27] Constructing Micro Knowledge Graphs from Technical Support Documents

【速读】：该论文旨在解决从大量技术文档中构建可用知识图谱（Knowledge Graph, KG）时面临的粒度挑战。具体而言，传统的知识图谱仅存储每个页面的关键实体和动作，导致被忽略的实体和动作所代表的知识丢失，无法被图搜索和推理功能利用。论文的关键解决方案是提出一种方法，为每一页技术文档创建微知识图谱（micrograph），其不仅存储页面中的所有实体和动作，还利用页面结构明确表示这些实体和动作出现的具体位置及其相互关系。此外，论文还定义了用于表示半结构化和纯文本知识的模式，并提出了从网页中提取步骤构成的流程以及在微知识图谱中表示这些流程的技术。最终，这些微知识图谱可作为技术支持应用的额外知识源，从而提升技术支撑系统的效能。

链接: https://arxiv.org/abs/2504.09877
作者: Atul Kumar,Nisha Gupta,Saswati Dana
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Short technical support pages such as IBM Technotes are quite common in technical support domain. These pages can be very useful as the knowledge sources for technical support applications such as chatbots, search engines and question-answering (QA) systems. Information extracted from documents to drive technical support applications is often stored in the form of Knowledge Graph (KG). Building KGs from a large corpus of documents poses a challenge of granularity because a large number of entities and actions are present in each page. The KG becomes virtually unusable if all entities and actions from these pages are stored in the KG. Therefore, only key entities and actions from each page are extracted and stored in the KG. This approach however leads to loss of knowledge represented by entities and actions left out of the KG as they are no longer available to graph search and reasoning functions. We propose a set of techniques to create micro knowledge graph (micrograph) for each of such web pages. The micrograph stores all the entities and actions in a page and also takes advantage of the structure of the page to represent exactly in which part of that page these entities and actions appeared, and also how they relate to each other. These micrographs can be used as additional knowledge sources by technical support applications. We define schemas for representing semi-structured and plain text knowledge present in the technical support web pages. Solutions in technical support domain include procedures made of steps. We also propose a technique to extract procedures from these webpages and the schemas to represent them in the micrographs. We also discuss how technical support applications can take advantage of the micrographs.
zh

[AI-28] runcated Matrix Completion - An Empirical Study

【速读】：该论文致力于解决低秩矩阵补全（Low-rank Matrix Completion, LRMC）在实际应用中采样机制依赖于数据值的问题，而非传统假设下的独立采样。大多数现有方法假设计算方便的独立采样模式，但这种假设在真实场景中很少成立。论文通过一系列实验，研究并比较了多种原本针对独立采样设计的LRMC算法在依赖数据值的采样模式下的性能表现。其关键在于探索和验证这些算法在非理想采样条件下的适应性和有效性，从而为感知、序列决策及推荐系统等应用场景提供更贴近现实的解决方案。

链接: https://arxiv.org/abs/2504.09873
作者: Rishhabh Naik,Nisarg Trivedi,Davoud Ataee Tarzanagh,Laura Balzano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Low-rank Matrix Completion (LRMC) describes the problem where we wish to recover missing entries of partially observed low-rank matrix. Most existing matrix completion work deals with sampling procedures that are independent of the underlying data values. While this assumption allows the derivation of nice theoretical guarantees, it seldom holds in real-world applications. In this paper, we consider various settings where the sampling mask is dependent on the underlying data values, motivated by applications in sensing, sequential decision-making, and recommender systems. Through a series of experiments, we study and compare the performance of various LRMC algorithms that were originally successful for data-independent sampling patterns.
zh

[AI-29] Labeling Messages as AI-Generated Does Not Reduce Their Persuasive Effects

【速读】：该论文试图解决的问题是如何提高人们对生成式 AI (Generative AI) 所产生的内容的认知透明度，并探索明确标注 AI 生成内容是否能够增强信息可信度及鼓励批判性思考。论文的关键解决方案在于通过大规模调查实验（N=1601），测试了不同作者身份标签（AI 生成模型、人类专家或无标签）对参与者态度转变、信息准确性判断以及分享意图的影响。实验结果表明，尽管大多数参与者相信了标签的真实性，但这些标签并未显著影响参与者对政策信息的态度、判断其准确性或是否愿意分享信息，且这一结论在多种人口统计学特征下均保持稳健。因此，论文强调了当前标签策略的局限性，并呼吁探索替代方法来应对由 AI 生成内容带来的挑战。

链接: https://arxiv.org/abs/2504.09865
作者: Isabel O. Gallegos,Chen Shani,Weiyan Shi,Federico Bianchi,Izzy Gainsburg,Dan Jurafsky,Robb Willer
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As generative artificial intelligence (AI) enables the creation and dissemination of information at massive scale and speed, it is increasingly important to understand how people perceive AI-generated content. One prominent policy proposal requires explicitly labeling AI-generated content to increase transparency and encourage critical thinking about the information, but prior research has not yet tested the effects of such labels. To address this gap, we conducted a survey experiment (N=1601) on a diverse sample of Americans, presenting participants with an AI-generated message about several public policies (e.g., allowing colleges to pay student-athletes), randomly assigning whether participants were told the message was generated by (a) an expert AI model, (b) a human policy expert, or © no label. We found that messages were generally persuasive, influencing participants’ views of the policies by 9.74 percentage points on average. However, while 94.6% of participants assigned to the AI and human label conditions believed the authorship labels, labels had no significant effects on participants’ attitude change toward the policies, judgments of message accuracy, nor intentions to share the message with others. These patterns were robust across a variety of participant characteristics, including prior knowledge of the policy, prior experience with AI, political party, education level, or age. Taken together, these results imply that, while authorship labels would likely enhance transparency, they are unlikely to substantially affect the persuasiveness of the labeled content, highlighting the need for alternative strategies to address challenges posed by AI-generated information.
zh

[AI-30] SUMART: SUMmARizing Translation from Wordy to Concise Expression

【速读】：本文旨在解决冗长字幕翻译难以快速理解的问题，提出了一种名为SUMART的方法。SUMART的关键在于通过现场使用大型语言模型压缩字幕的体积，并将压缩后的数据用于微调翻译模型，以生成更简洁实用的翻译结果。这种方法特别适用于需要快速获取外语对话、音频、视频或演讲大意的用户场景。此外，文中还开发了一个在增强现实环境中支持字幕翻译对话的应用程序，并通过原型测试与用户调查验证了其有效性，尤其在处理大量信息（如演讲、讲座、播客及会议问答环节）时具有显著优势。

链接: https://arxiv.org/abs/2504.09860
作者: Naoto Nishida,Jun Rekimoto
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 3 pages, 2 figures

点击查看摘要

Abstract:We propose SUMART, a method for summarizing and compressing the volume of verbose subtitle translations. SUMART is designed for understanding translated captions (e.g., interlingual conversations via subtitle translation or when watching movies in foreign language audio and translated captions). SUMART is intended for users who want a big-picture and fast understanding of the conversation, audio, video content, and speech in a foreign language. During the training data collection, when a speaker makes a verbose statement, SUMART employs a large language model on-site to compress the volume of subtitles. This compressed data is then stored in a database for fine-tuning purposes. Later, SUMART uses data pairs from those non-compressed ASR results and compressed translated results for fine-tuning the translation model to generate more concise translations for practical uses. In practical applications, SUMART utilizes this trained model to produce concise translation results. Furthermore, as a practical application, we developed an application that allows conversations using subtitle translation in augmented reality spaces. As a pilot study, we conducted qualitative surveys using a SUMART prototype and a survey on the summarization model for SUMART. We envision the most effective use case of this system is where users need to consume a lot of information quickly (e.g., Speech, lectures, podcasts, QA in conferences).
zh

[AI-31] Working with Large Language Models to Enhance Messaging Effectiveness for Vaccine Confidence

【速读】：该论文试图解决疫苗犹豫（Vaccine Hesitancy）和 misinformation 对广泛接种疫苗覆盖率的阻碍问题，特别是针对资源有限的小型公共卫生部门缺乏有效疫苗信息设计能力的情况。论文探索了通过 ChatGPT 增强的信息传递（ChatGPT-augmented Messaging）来提升公众对疫苗接种的信心。解决方案的关键在于利用 ChatGPT 生成的疫苗相关信息，并与原始信息进行对比测试，评估其在说服力方面的表现。研究发现，ChatGPT 增强的消息整体评分略高于原始消息，且较长的消息通常得分更高。此外，消息呈现的顺序也对评分有一定影响。这些结果表明，ChatGPT 在增强疫苗信息传递方面具有潜力，为未来人类与人工智能协作在公共卫生传播中的应用提供了有价值的参考方向。

链接: https://arxiv.org/abs/2504.09857
作者: Lucinda Gullison,Feng Fu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:Vaccine hesitancy and misinformation are significant barriers to achieving widespread vaccination coverage. Smaller public health departments may lack the expertise or resources to craft effective vaccine messaging. This paper explores the potential of ChatGPT-augmented messaging to promote confidence in vaccination uptake. We conducted a survey in which participants chose between pairs of vaccination messages and assessed which was more persuasive and to what extent. In each pair, one message was the original, and the other was augmented by ChatGPT. At the end of the survey, participants were informed that half of the messages had been generated by ChatGPT. They were then asked to provide both quantitative and qualitative responses regarding how knowledge of a message’s ChatGPT origin affected their impressions. Overall, ChatGPT-augmented messages were rated slightly higher than the original messages. These messages generally scored better when they were longer. Respondents did not express major concerns about ChatGPT-generated content, nor was there a significant relationship between participants’ views on ChatGPT and their message ratings. Notably, there was a correlation between whether a message appeared first or second in a pair and its score. These results point to the potential of ChatGPT to enhance vaccine messaging, suggesting a promising direction for future research on human-AI collaboration in public health communication. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph) Cite as: arXiv:2504.09857 [cs.CY] (or arXiv:2504.09857v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2504.09857 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Feng Fu [view email] [v1] Mon, 14 Apr 2025 04:06:46 UTC (103 KB)
zh

[AI-32] PestMA: LLM -based Multi-Agent System for Informed Pest Management

【速读】：该论文旨在解决传统基于单一智能体（single-agent）的大语言模型（LLM）在害虫管理中面临的局限性，这些局限包括难以整合多样化外部信息、缺乏系统性验证能力以及应对复杂阈值驱动决策的能力不足。为了解决这些问题，论文提出了一种基于大语言模型的多智能体系统（LLM-based Multi-Agent System, MAS），名为PestMA。该方案的关键在于设计了一个包含三个专业化智能体的编辑范式：一个用于综合害虫管理建议的编辑器（Editor）、一个用于收集相关外部数据的检索器（Retriever），以及一个用于确保决策正确性的验证器（Validator）。通过这种协作式的多智能体架构，PestMA不仅实现了初始86.8%的决策准确性，经过验证后进一步提升至92.6%，从而证明了基于协作智能体的工作流在优化与验证决策中的价值，并展示了LLM驱动的多智能体系统在自动化和增强害虫管理过程中的潜力。

链接: https://arxiv.org/abs/2504.09855
作者: Hongrui Shi,Shunbao Li,Zhipeng Yuan,Po Yang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Effective pest management is complex due to the need for accurate, context-specific decisions. Recent advancements in large language models (LLMs) open new possibilities for addressing these challenges by providing sophisticated, adaptive knowledge acquisition and reasoning. However, existing LLM-based pest management approaches often rely on a single-agent paradigm, which can limit their capacity to incorporate diverse external information, engage in systematic validation, and address complex, threshold-driven decisions. To overcome these limitations, we introduce PestMA, an LLM-based multi-agent system (MAS) designed to generate reliable and evidence-based pest management advice. Building on an editorial paradigm, PestMA features three specialized agents, an Editor for synthesizing pest management recommendations, a Retriever for gathering relevant external data, and a Validator for ensuring correctness. Evaluations on real-world pest scenarios demonstrate that PestMA achieves an initial accuracy of 86.8% for pest management decisions, which increases to 92.6% after validation. These results underscore the value of collaborative agent-based workflows in refining and validating decisions, highlighting the potential of LLM-based multi-agent systems to automate and enhance pest management processes.
zh

[AI-33] Carbon-Efficient 3D DNN Acceleration: Optimizing Performance and Sustainability

【速读】：该论文旨在解决深度神经网络（DNN）硬件加速器在复杂制造工艺下日益增长的碳足迹问题，特别是在3D集成提升性能的同时引入的可持续性挑战。论文的关键在于提出了一种面向碳效率的三维DNN加速器设计方法学，通过结合近似计算与基于遗传算法的设计空间探索，优化碳延迟积（Carbon Delay Product, CDP）。关键解决方案是将面积高效的近似乘法器集成到累加（MAC）单元中，在保持高计算精度的同时显著减少硅片面积和制造开销，从而有效降低硬件的隐含碳排放。实验评估表明，该方法在三种技术节点（45nm、14nm 和 7nm）下可将隐含碳减少高达30%，且精度损失可以忽略不计。

链接: https://arxiv.org/abs/2504.09851
作者: Aikaterini Maria Panteleaki,Konstantinos Balaskas,Georgios Zervakis,Hussam Amrouch,Iraklis Anagnostopoulos
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Submitted in ISVLSI 2025

点击查看摘要

Abstract:As Deep Neural Networks (DNNs) continue to drive advancements in artificial intelligence, the design of hardware accelerators faces growing concerns over embodied carbon footprint due to complex fabrication processes. 3D integration improves performance but introduces sustainability challenges, making carbon-aware optimization essential. In this work, we propose a carbon-efficient design methodology for 3D DNN accelerators, leveraging approximate computing and genetic algorithm-based design space exploration to optimize Carbon Delay Product (CDP). By integrating area-efficient approximate multipliers into Multiply-Accumulate (MAC) units, our approach effectively reduces silicon area and fabrication overhead while maintaining high computational accuracy. Experimental evaluations across three technology nodes (45nm, 14nm, and 7nm) show that our method reduces embodied carbon by up to 30% with negligible accuracy drop.
zh

[AI-34] GlyTwin: Digital Twin for Glucose Control in Type 1 Diabetes Through Optimal Behavioral Modifications Using Patient-Centric Counterfactuals

【速读】：本文旨在解决当前糖尿病管理技术（如连续皮下胰岛素输注CSII和连续血糖监测CGM）及现有数字孪生方法仅关注特定生理过程模拟或单一治疗目标的问题，缺乏提供替代性治疗情景以支持主动行为干预的能力。为应对这一挑战，论文提出了一种名为GlyTwin的新颖数字孪生框架，其关键在于利用反事实解释来模拟理想的葡萄糖调节治疗方案。通过这种方式，GlyTwin能够帮助患者和护理人员调整碳水化合物摄入量和胰岛素剂量等行为，从而避免异常的葡萄糖事件，并通过推荐日常选择的小幅调整来主动预防高血糖，减少其发生频率和持续时间。此外，该框架还将利益相关者的偏好融入干预设计中，使建议更加以患者为中心且个性化。最终，在包含21名使用自动胰岛素输送系统的1型糖尿病(T1D)患者26天纵向数据的AZT1D数据集上的评估结果显示，GlyTwin在生成有效干预方面表现出色，验证了基于反事实驱动的数字孪生在提供个性化医疗方面的潜力。

链接: https://arxiv.org/abs/2504.09846
作者: Asiful Arefeen,Saman Khamesian,Maria Adela Grando,Bithika Thompson,Hassan Ghasemzadeh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Frequent and long-term exposure to hyperglycemia (i.e., high blood glucose) increases the risk of chronic complications such as neuropathy, nephropathy, and cardiovascular disease. Current technologies like continuous subcutaneous insulin infusion (CSII) and continuous glucose monitoring (CGM) primarily model specific aspects of glycemic control-like hypoglycemia prediction or insulin delivery. Similarly, most digital twin approaches in diabetes management simulate only physiological processes. These systems lack the ability to offer alternative treatment scenarios that support proactive behavioral interventions. To address this, we propose GlyTwin, a novel digital twin framework that uses counterfactual explanations to simulate optimal treatments for glucose regulation. Our approach helps patients and caregivers modify behaviors like carbohydrate intake and insulin dosing to avoid abnormal glucose events. GlyTwin generates behavioral treatment suggestions that proactively prevent hyperglycemia by recommending small adjustments to daily choices, reducing both frequency and duration of these events. Additionally, it incorporates stakeholder preferences into the intervention design, making recommendations patient-centric and tailored. We evaluate GlyTwin on AZT1D, a newly constructed dataset with longitudinal data from 21 type 1 diabetes (T1D) patients on automated insulin delivery systems over 26 days. Results show GlyTwin outperforms state-of-the-art counterfactual methods, generating 76.6% valid and 86% effective interventions. These findings demonstrate the promise of counterfactual-driven digital twins in delivering personalized healthcare.
zh

[AI-35] OVERLORD: Ultimate Scaling of DataLoader for Multi-Source Large Foundation Model Training

【速读】：该论文旨在解决大型基础模型（LFMs）训练过程中数据加载器在数据并行范式下存在的两个根本性挑战：(1) 注意力算子的二次计算复杂度导致数据并行rank上的非均匀样本分布引发加载器间的显著工作负载不平衡，从而降低训练效率，并阻碍跨不同数据集的数据混合算法（如课程学习）的实施；(2) 多样化来源的数据访问状态以及大规模数据集的集中加载容易超出本地pod内存容量，同时高延迟的数据源需要更大的worker池，进一步加剧内存消耗。论文的关键解决方案包括提出OVERLORD架构，其创新点为：(1) 集中式声明式数据平面，支持弹性数据编排策略，如长短期上下文、多模态及课程学习；(2) 基于角色特定actor（Source Loaders和Data Constructors）的解耦多源预处理，并通过自动扩展Source Loaders应对异构且动态变化的源预处理成本；(3) 带有差分检查点的Shadow Loaders以实现不间断故障恢复。部署于千GPU规模生产集群的OVERLORD实现了端到端训练吞吐量提升4.5倍，CPU内存使用减少至少3.6倍的性能改进。

链接: https://arxiv.org/abs/2504.09844
作者: Juntao Zhao,Qi Lu,Wei Jia,Borui Wan,Lei Zuo,Junda Feng,Jianyu Jiang,Yangrui Chen,Shuaishuai Cao,Jialing He,Kaihua Jiang,Yuanzhe Hu,Yanghua Peng,Haibin Lin,Xin Liu,Chuan Wu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern frameworks for training large foundation models (LFMs) employ data loaders in a data parallel paradigm. While this design offers implementation simplicity, it introduces two fundamental challenges. First, due to the quadratic computational complexity of the attention operator, the non-uniform sample distribution over data-parallel ranks leads to a significant workload imbalance among loaders, which degrades the training efficiency. This paradigm also impedes the implementation of data mixing algorithms (e.g., curriculum learning) over different datasets. Second, to acquire a broad range of capability, LFMs training ingests data from diverse sources, each with distinct file access states. Colocating massive datasets within loader instances can easily exceed local pod memory capacity. Additionally, heavy sources with higher transformation latency require larger worker pools, further exacerbating memory consumption. We present OVERLORD, an industrial-grade distributed data loading architecture with three innovations: (1) A centralized and declarative data plane, which facilitates elastic data orchestration strategy, such as long-short context, multimodal, and curriculum learning; (2) Disaggregated multisource preprocessing through role-specific actors, i.e., Source Loaders and Data Constructors, leveraging autoscaling for Source Loaders towards heterogeneous and evolving source preprocessing cost; (3) Shadow Loaders with differential checkpointing for uninterrupted fault recovery. Deployed on production clusters scaling to multi-thousand GPU, OVERLORD achieves: (1) 4.5x end-to-end training throughput improvement, (2) a minimum 3.6x reduction in CPU memory usage, with further improvements to be added in later experiments. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.09844 [cs.DC] (or arXiv:2504.09844v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2504.09844 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-36] StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）驱动的黑盒表格代理（tabular agents）在面对提示注入攻击（prompt injection attacks）时存在的安全脆弱性问题。尽管已有研究表明LLMs容易受到来自外部数据源的提示注入攻击的影响，但表格代理通过严格的输入数据格式和预定义规则限制了攻击的有效性，除非攻击者能够跨多层结构化数据导航并嵌入有效负载。为此，论文提出了一种名为StruPhantom的新攻击方法，专门针对黑盒LLM驱动的表格代理。其关键解决方案在于设计了一种基于约束蒙特卡洛树搜索（constrained Monte Carlo Tree Search）的进化优化过程，并结合离题评估器（off-topic evaluator）来持续改进攻击载荷，从而系统性地探索和利用目标应用的弱点以实现目标劫持（goal hijacking）。实验结果表明，StruPhantom相比现有基线方法，在强制应用程序响应包含网络钓鱼链接或恶意代码的场景下，成功率达到其1.5倍以上。

链接: https://arxiv.org/abs/2504.09841
作者: Yang Feng,Xudong Pan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Work in Progress

点击查看摘要

Abstract:The proliferation of autonomous agents powered by large language models (LLMs) has revolutionized popular business applications dealing with tabular data, i.e., tabular agents. Although LLMs are observed to be vulnerable against prompt injection attacks from external data sources, tabular agents impose strict data formats and predefined rules on the attacker’s payload, which are ineffective unless the agent navigates multiple layers of structural data to incorporate the payload. To address the challenge, we present a novel attack termed StruPhantom which specifically targets black-box LLM-powered tabular agents. Our attack designs an evolutionary optimization procedure which continually refines attack payloads via the proposed constrained Monte Carlo Tree Search augmented by an off-topic evaluator. StruPhantom helps systematically explore and exploit the weaknesses of target applications to achieve goal hijacking. Our evaluation validates the effectiveness of StruPhantom across various LLM-based agents, including those on real-world platforms, and attack scenarios. Our attack achieves over 50% higher success rates than baselines in enforcing the application’s response to contain phishing links or malicious codes.
zh

[AI-37] SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis USENIX-SECURITY2025

【速读】：该论文旨在解决深度伪造音频（Deepfake Audio）带来的安全隐患问题，特别是恶意攻击者可能通过非法收集受害者语音并克隆相似声音用于电信诈骗等非法活动。现有防御方法在防止深度伪造音频滥用方面效果有限，且容易受到鲁棒训练技术的影响。为应对这一挑战，论文提出了一种名为SafeSpeech的防御框架，其核心在于通过在原始语音中嵌入不可感知的扰动（imperceptible perturbations），以阻止高质量合成语音的生成。SafeSpeech的关键创新之处在于设计了一种鲁棒且通用的主动保护技术——Speech Perturbative Concealment (SPEC)，该技术利用代理模型生成对生成式合成模型普遍适用的扰动。此外，研究优化了嵌入扰动的人类感知特性，分别从时间和频率域进行调整。实验结果表明，SafeSpeech在多个先进模型和数据集上实现了最先进的语音保护效果和可迁移性，并对高级自适应对手表现出高度鲁棒性，同时具备实时处理能力。

链接: https://arxiv.org/abs/2504.09839
作者: Zhisheng Zhang,Derui Wang,Qianyi Yang,Pengyang Huang,Junhan Pu,Yuxin Cao,Kai Ye,Jie Hao,Yixian Yang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Accepted to USENIX Security 2025

点击查看摘要

Abstract:Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards. Malicious adversaries may unauthorizedly collect victims’ speeches and clone a similar voice for illegal exploitation (\textite.g., telecom fraud). However, the existing defense methods cannot effectively prevent deepfake exploitation and are vulnerable to robust training techniques. Therefore, a more effective and robust data protection method is urgently needed. In response, we propose a defensive framework, \textit\textbfSafeSpeech, which protects the users’ audio before uploading by embedding imperceptible perturbations on original speeches to prevent high-quality synthetic speech. In SafeSpeech, we devise a robust and universal proactive protection technique, \textbfSpeech \textbfPErturbative \textbfConcealment (\textbfSPEC), that leverages a surrogate model to generate universally applicable perturbation for generative synthetic models. Moreover, we optimize the human perception of embedded perturbation in terms of time and frequency domains. To evaluate our method comprehensively, we conduct extensive experiments across advanced models and datasets, both subjectively and objectively. Our experimental results demonstrate that SafeSpeech achieves state-of-the-art (SOTA) voice protection effectiveness and transferability and is highly robust against advanced adaptive adversaries. Moreover, SafeSpeech has real-time capability in real-world tests. The source code is available at \hrefthis https URLthis https URL.
zh

[AI-38] Efficient Multi-Task Modeling through Automated Fusion of Trained Models

【速读】：本文针对传统多任务学习建模方法因需针对特定任务组合进行定制化设计而导致建模过程繁琐的问题，提出了一种高效的多任务建模方法。该方法能够自动融合不同结构和任务的已训练单任务模型，形成多任务模型。其关键是通过考虑各种已训练模型的结构差异，采用模型分解技术将其分层分解为多个可操作的模型组件，并基于Transformer设计了自适应知识融合（Adaptive Knowledge Fusion, AKF）模块，以根据模型组件自适应整合任务内和任务间的知识。这种方法实现了多任务模型的高效自动化构建，并通过在三个数据集上的大量实验验证了其有效性。

链接: https://arxiv.org/abs/2504.09812
作者: Jingxuan Zhou,Weidong Bao,Ji Wang,Zhengyi Zhong,Dayu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although multi-task learning is widely applied in intelligent services, traditional multi-task modeling methods often require customized designs based on specific task combinations, resulting in a cumbersome modeling process. Inspired by the rapid development and excellent performance of single-task models, this paper proposes an efficient multi-task modeling method that can automatically fuse trained single-task models with different structures and tasks to form a multi-task model. As a general framework, this method allows modelers to simply prepare trained models for the required tasks, simplifying the modeling process while fully utilizing the knowledge contained in the trained models. This eliminates the need for excessive focus on task relationships and model structure design. To achieve this goal, we consider the structural differences among various trained models and employ model decomposition techniques to hierarchically decompose them into multiple operable model components. Furthermore, we have designed an Adaptive Knowledge Fusion (AKF) module based on Transformer, which adaptively integrates intra-task and inter-task knowledge based on model components. Through the proposed method, we achieve efficient and automated construction of multi-task models, and its effectiveness is verified through extensive experiments on three datasets.
zh

[AI-39] See or Recall: A Sanity Check for the Role of Vision in Solving Visualization Question Answer Tasks with Multimodal LLM s

【速读】：该论文试图解决的问题是如何准确评估多模态大型语言模型（MLLM）在可视化推理（Visualization Reasoning）方面的真实能力。现有评估方法主要通过测量模型对视觉信号的处理能力来类比人类的可视化素养（Visualization Literacy），但研究发现，在可视化问答（Visualization Question Answering, VisQA）任务中，MLLM 的推理方式可能与人类显著不同。即使没有提供视觉输入或仅依赖少量选项，模型仍能正确回答大量测试问题，这表明模型可能更多依赖于语言知识库中的事实回忆（Factual Recall），而非真正的视觉分析能力。这种现象引发了对当前 VisQA 评估方法是否充分反映模型实际可视化推理能力的担忧。

解决方案的关键在于提出了一种综合的合理性检查框架（Sanity Check Framework）。该框架结合基于规则的决策树（Rule-Based Decision Tree）和合理性检查表（Sanity Check Table），以分离模型在“视觉处理”（Seeing）和“事实回忆”（Recall）之间的效果。通过这种方法，可以验证 VisQA 数据集的有效性，并明确模型是在真正进行视觉分析，还是受到事实回忆的影响，或者依赖归纳偏置（Inductive Biases）进行答题。这项研究强调了在未来利用 MLLM 设计可视化理解研究时需要仔细考虑评估方法的重要性。

链接: https://arxiv.org/abs/2504.09809
作者: Zhimin Li,Haichao Miao,Xinyuan Yan,Valerio Pascucci,Matthew Berger,Shusen Liu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent developments in multimodal large language models (MLLM) have equipped language models to reason about vision and language jointly. This permits MLLMs to both perceive and answer questions about data visualization across a variety of designs and tasks. Applying MLLMs to a broad range of visualization tasks requires us to properly evaluate their capabilities, and the most common way to conduct evaluation is through measuring a model’s visualization reasoning capability, analogous to how we would evaluate human understanding of visualizations (e.g., visualization literacy). However, we found that in the context of visualization question answering (VisQA), how an MLLM perceives and reasons about visualizations can be fundamentally different from how humans approach the same problem. During the evaluation, even without visualization, the model could correctly answer a substantial portion of the visualization test questions, regardless of whether any selection options were provided. We hypothesize that the vast amount of knowledge encoded in the language model permits factual recall that supersedes the need to seek information from the visual signal. It raises concerns that the current VisQA evaluation may not fully capture the models’ visualization reasoning capabilities. To address this, we propose a comprehensive sanity check framework that integrates a rule-based decision tree and a sanity check table to disentangle the effects of “seeing” (visual processing) and “recall” (reliance on prior knowledge). This validates VisQA datasets for evaluation, highlighting where models are truly “seeing”, positively or negatively affected by the factual recall, or relying on inductive biases for question answering. Our study underscores the need for careful consideration in designing future visualization understanding studies when utilizing MLLMs.
zh

[AI-40] Multi-task Federated Learning with Encoder-Decoder Structure: Enabling Collaborative Learning Across Different Tasks

【速读】：该论文旨在解决联邦学习中因客户端模型结构差异或任务不同而导致无法协同训练的问题。为实现不同模型结构与任务间的协作，论文提出了一种基于编码器-解码器结构的多任务联邦学习方法（M-Fed）。其关键在于利用编码器-解码器架构，在传统联邦学习的基础上通过局部解码器与全局解码器的信息共享，实现任务内知识传递与跨任务通用知识提取，并将局部解码器逐步迭代更新以逼近全局解码器，从而达成有效的跨任务知识共享。该方案在保持联邦学习系统效率的同时，实现了轻量级且模块化的创新设计，验证了在多任务联邦学习中的有效性。

链接: https://arxiv.org/abs/2504.09800
作者: Jingxuan Zhou,Weidong Bao,Ji Wang,Dayu Zhang,Xiongtao Zhang,Yaohong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning has been extensively studied and applied due to its ability to ensure data security in distributed environments while building better models. However, clients participating in federated learning still face limitations, as clients with different structures or tasks cannot participate in learning together. In view of this, constructing a federated learning framework that allows collaboration between clients with different model structures and performing different tasks, enabling them to share valuable knowledge to enhance model efficiency, holds significant practical implications for the widespread application of federated learning. To achieve this goal, we propose a multi-task federated learning with encoder-decoder structure (M-Fed). Specifically, given the widespread adoption of the encoder-decoder architecture in current models, we leverage this structure to share intra-task knowledge through traditional federated learning methods and extract general knowledge from the encoder to achieve cross-task knowledge sharing. The training process is similar to traditional federated learning, and we incorporate local decoder and global decoder information into the loss function. The local decoder iteratively updates and gradually approaches the global decoder until sufficient cross-task knowledge sharing is achieved. Our method is lightweight and modular, demonstrating innovation compared to previous research. It enables clients performing different tasks to share general knowledge while maintaining the efficiency of traditional federated learning systems. We conducted experiments on two widely used benchmark datasets to verify the feasibility of M-Fed and compared it with traditional methods. The experimental results demonstrate the effectiveness of M-Fed in multi-task federated learning.
zh

[AI-41] "All Roads Lead to ChatGPT ": How Generative AI is Eroding Social Interactions and Student Learning Communities ICSE2025

【速读】：该论文试图解决生成式 AI (Generative AI) 在高等教育环境中对学生社交互动、同伴学习以及课堂动态的潜在影响这一问题。论文的关键在于通过开展针对北美七所研究型大学（R1）计算机专业本科生的17次半结构化访谈，揭示生成式 AI 的使用如何改变学生求助行为的模式，并分析其对社交支持系统、同伴交流以及学习动机的负面影响。研究表明，生成式 AI 的介入使得学生更倾向于依赖技术而非人际互动来解决问题，从而削弱了同伴间的协作与社会支持体系，这对学生的归属感和整体学习体验构成了挑战。

链接: https://arxiv.org/abs/2504.09779
作者: Irene Hou,Owen Man,Kate Hamilton,Srishty Muthusekaran,Jeffin Johnykutty,Leili Zadeh,Stephen MacNeil
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 7 pages, 1 table. To be published in the Proceedings of the 2025 Innovation and Technology in Computer Science Education (ITiCSE 2025)

点击查看摘要

Abstract:The widespread adoption of generative AI is already impacting learning and help-seeking. While the benefits of generative AI are well-understood, recent studies have also raised concerns about increased potential for cheating and negative impacts on students’ metacognition and critical thinking. However, the potential impacts on social interactions, peer learning, and classroom dynamics are not yet well understood. To investigate these aspects, we conducted 17 semi-structured interviews with undergraduate computing students across seven R1 universities in North America. Our findings suggest that help-seeking requests are now often mediated by generative AI. For example, students often redirected questions from their peers to generative AI instead of providing assistance themselves, undermining peer interaction. Students also reported feeling increasingly isolated and demotivated as the social support systems they rely on begin to break down. These findings are concerning given the important role that social interactions play in students’ learning and sense of belonging.
zh

[AI-42] Reasoning without Regret

【速读】：该论文致力于解决基于结果的奖励机制在多步任务中的两大挑战：信用分配困难与收敛缓慢，同时减少对昂贵的人类监督依赖。论文的关键创新在于提出了一种名为“反向自适应奖励塑造（BARS）”的无悔框架，该框架能够将稀疏的结果导向奖励高效转化为过程导向信号。通过利用终端状态先验生成稀疏奖励，并结合覆盖树扩展奖励范围以防止策略过度利用，BARS实现了无需人工标注的中间反馈。其解决方案的核心在于结合Bellman收缩、 $(\Delta, \epsilon)$ -间隙奖励及后向Euler求解器，确保在O\left((R_\max/\Delta)\log(1/\epsilon)\right)次迭代内达到 $\epsilon$ 精度，且动态遗憾仅为 $O(\log T)$ ，从而在理论上为DeepSeek R1的经验成功提供了坚实的理论基础。

链接: https://arxiv.org/abs/2504.09777
作者: Tarun Chitra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce \emphBackwards Adaptive Reward Shaping (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and (\Delta, \epsilon) -gap rewards, our backward Euler solver achieves \epsilon -accuracy in O\left((R_\max/\Delta)\log(1/\epsilon)\right) iterations with O(\log T) dynamic regret over T rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods’ empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek’s R1.
zh

[AI-43] Understanding and Optimizing Multi-Stage AI Inference Pipelines

【速读】：该论文旨在解决现有推理系统模拟器无法有效建模复杂多阶段大型语言模型（Large Language Models, LLMs）推理工作流的问题，这些工作流涉及异构硬件平台上的多样化计算需求。解决方案的关键在于引入HERMES，这是一种针对异构多阶段LLM推理执行的模拟器。HERMES能够模拟包括检索增强生成（Retrieval Augmented Generation, RAG）、键值缓存检索、推理、预填充和解码在内的多个请求阶段，并支持在复杂硬件层级上的异构客户端并发执行多种模型。通过结合真实硬件轨迹与分析建模，HERMES捕获了混合CPU加速器部署中的关键权衡，如内存带宽争用、跨集群通信延迟和批处理效率。这使得系统设计者能够更好地优化硬件-软件协同设计，以应对下一代AI工作负载的需求。

链接: https://arxiv.org/abs/2504.09775
作者: Abhimanyu Rajeshkumar Bambhaniya,Hanjiang Wu,Suvinay Subramanian,Sudarshan Srinivasan,Souvik Kundu,Amir Yazdanbakhsh,Midhilesh Elavazhagan,Madhu Kumar,Tushar Krishna
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Inference System Design for Multi-Stage AI Inference Pipelines. 13 Pages, 15 Figues, 3 Tables. Code can shared at request

点击查看摘要

Abstract:The rapid evolution of Large Language Models (LLMs) has driven the need for increasingly sophisticated inference pipelines and hardware platforms. Modern LLM serving extends beyond traditional prefill-decode workflows, incorporating multi-stage processes such as Retrieval Augmented Generation (RAG), key-value (KV) cache retrieval, dynamic model routing, and multi step reasoning. These stages exhibit diverse computational demands, requiring distributed systems that integrate GPUs, ASICs, CPUs, and memory-centric architectures. However, existing simulators lack the fidelity to model these heterogeneous, multi-engine workflows, limiting their ability to inform architectural decisions. To address this gap, we introduce HERMES, a Heterogeneous Multi-stage LLM inference Execution Simulator. HERMES models diverse request stages; including RAG, KV retrieval, reasoning, prefill, and decode across complex hardware hierarchies. HERMES supports heterogeneous clients executing multiple models concurrently unlike prior frameworks while incorporating advanced batching strategies and multi-level memory hierarchies. By integrating real hardware traces with analytical modeling, HERMES captures critical trade-offs such as memory bandwidth contention, inter-cluster communication latency, and batching efficiency in hybrid CPU-accelerator deployments. Through case studies, we explore the impact of reasoning stages on end-to-end latency, optimal batching strategies for hybrid pipelines, and the architectural implications of remote KV cache retrieval. HERMES empowers system designers to navigate the evolving landscape of LLM inference, providing actionable insights into optimizing hardware-software co-design for next-generation AI workloads. Comments: Inference System Design for Multi-Stage AI Inference Pipelines. 13 Pages, 15 Figues, 3 Tables. Code can shared at request Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2504.09775 [cs.AR] (or arXiv:2504.09775v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2504.09775 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-44] wo Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

【速读】：该论文旨在解决多智能体系统（Multi-Agent Systems, MAS）在复杂推理任务中的协作与推理效率问题。尽管测试时扩展（Test-Time Scaling, TTS）技术显著提升了单智能体在挑战性推理任务中的表现，但如何有效提升多智能体系统的协作与推理能力仍是一个开放性问题。为解决此问题，论文提出了一种自适应多智能体框架，通过模型级训练与系统级协调相结合来增强协作推理能力。关键解决方案包括构建高质量的M500数据集，并基于此微调Qwen2.5-32B-Instruct以生成优化多智能体协作的M1-32B模型，同时引入新型CEO智能体，动态管理讨论过程并调整推理深度，从而实现更高效的自适应推理。实验结果表明，所提出的系统在多项任务上显著优于强基准模型，证明了学习协作与自适应协调的重要性。

链接: https://arxiv.org/abs/2504.09772
作者: Can Jin,Hongwu Peng,Qixin Zhang,Yujin Tang,Dimitris N. Metaxas,Tong Che
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks that single-agent systems often struggle to manage. While recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks, how to effectively scale collaboration and reasoning in MAS remains an open question. In this work, we introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination. We construct M500, a high-quality dataset containing 500 multi-agent collaborative reasoning traces, and fine-tune Qwen2.5-32B-Instruct on this dataset to produce M1-32B, a model optimized for multi-agent collaboration. To further enable adaptive reasoning, we propose a novel CEO agent that dynamically manages the discussion process, guiding agent collaboration and adjusting reasoning depth for more effective problem-solving. Evaluated in an open-source MAS across a range of tasks-including general understanding, mathematical reasoning, and coding-our system significantly outperforms strong baselines. For instance, M1-32B achieves 12% improvement on GPQA-Diamond, 41% on AIME2024, and 10% on MBPP-Sanitized, matching the performance of state-of-the-art models like DeepSeek-R1 on some tasks. These results highlight the importance of both learned collaboration and adaptive coordination in scaling multi-agent reasoning. Code is available at this https URL
zh

[AI-45] (How) Do reasoning models reason ?

【速读】：该论文旨在提供一种全面的统一视角，以理解和分析近期涌现的大规模推理模型（Large Reasoning Models, LRMs），如OpenAI o1和DeepSeek R1。论文聚焦于这些模型的潜力、能力来源、误解及局限性。论文的关键在于通过深入剖析LRMs的核心特性与挑战，为研究者和开发者提供清晰的认知框架，从而推动相关技术的进一步发展与合理应用。

链接: https://arxiv.org/abs/2504.09762
作者: Subbarao Kambhampati,Kaya Stechly,Karthik Valmeekam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages (A version appears in The Annals of New York Academy of Sciences)

点击查看摘要

Abstract:We will provide a broad unifying perspective on the recent breed of Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek R1, including their promise, sources of power, misconceptions and limitations.
zh

[AI-46] Dynamik: Syntactically-Driven Dynamic Font Sizing for Emphasis of Key Information

【速读】：该论文旨在解决非母语使用者在使用通用非母语语言（ lingua franca ）进行实时沟通时，由于同时阅读字幕和倾听而导致的认知负荷增加的问题。论文提出的关键解决方案是 Dynamik 系统，该系统通过缩小不重要词汇的字体大小并放大重要词汇，从而增强句子对比度，以降低阅读过程中的认知负担。研究结果表明，Dynamik 能够减少特定方面（如低英语熟练度用户的感知表现和努力程度）的认知负荷，并提升低英语能力用户对内容的理解感。

链接: https://arxiv.org/abs/2504.09734
作者: Naoto Nishida,Yoshio Ishiguro,Jun Rekiomto,Naomi Yamashita
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 30 pages, 11 figures, presented at The ACM Conference on Intelligent User Interfaces (ACM IUI) 2025

点击查看摘要

Abstract:In today’s globalized world, there are increasing opportunities for individuals to communicate using a common non-native language (lingua franca). Non-native speakers often have opportunities to listen to foreign languages, but may not comprehend them as fully as native speakers do. To aid real-time comprehension, live transcription of subtitles is frequently used in everyday life (e.g., during Zoom conversations, watching YouTube videos, or on social networking sites). However, simultaneously reading subtitles while listening can increase cognitive load. In this study, we propose Dynamik, a system that reduces cognitive load during reading by decreasing the size of less important words and enlarging important ones, thereby enhancing sentence contrast. Our results indicate that Dynamik can reduce certain aspects of cognitive load, specifically, participants’ perceived performance and effort among individuals with low proficiency in English, as well as enhance the users’ sense of comprehension, especially among people with low English ability. We further discuss our methods’ applicability to other languages and potential improvements and further research directions.
zh

[AI-47] Adapting Robots Explanation for Failures Based on Observed Human Behavior in Human-Robot Collaboration IROS2025

【速读】：本文旨在解析人类行为以预测机器人在解释失败时可能引发的用户困惑，并据此调整机器人的解释方式，以实现更自然且高效的协作。为解决这一问题，研究基于包含面部情绪检测、眼动估计及手势数据的数据集，分析人类行为如何随不同类型的失败及解释复杂度的变化而变化。论文的关键在于提出了一种数据驱动的预测器，用于预测人类在机器人解释失败过程中的困惑程度，并进一步设计并评估了一种机制，依据观测到的人类行为动态调整解释的详细程度。通过这种方法，论文期望提升人机协作体验。

链接: https://arxiv.org/abs/2504.09717
作者: Andreas Naoum,Parag Khanna,Elmira Yadollahi,Mårten Björkman,Christian Smith
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Under review, Manuscript in submission for IROS 2025

点击查看摘要

Abstract:This work aims to interpret human behavior to anticipate potential user confusion when a robot provides explanations for failure, allowing the robot to adapt its explanations for more natural and efficient collaboration. Using a dataset that included facial emotion detection, eye gaze estimation, and gestures from 55 participants in a user study, we analyzed how human behavior changed in response to different types of failures and varying explanation levels. Our goal is to assess whether human collaborators are ready to accept less detailed explanations without inducing confusion. We formulate a data-driven predictor to predict human confusion during robot failure explanations. We also propose and evaluate a mechanism, based on the predictor, to adapt the explanation level according to observed human behavior. The promising results from this evaluation indicate the potential of this research in adapting a robot’s explanations for failures to enhance the collaborative experience.
zh

[AI-48] Dominated Actions in Imperfect-Information Games

【速读】：本文旨在解决在不完美信息博弈中如何高效识别并移除被支配动作（dominated actions）的问题，以作为纳什均衡计算前的一种预处理方法。传统上，将不完美信息扩展形式博弈转换为战略形式可以利用多项式时间算法识别支配策略，但这种转换可能导致博弈规模呈指数级增长。论文的关键在于提出了一种针对n人博弈的多项式时间算法，用于判断一个动作是否被任何混合策略严格或弱支配，进而扩展为一种迭代移除支配动作的算法。这种方法能够有效缩减博弈树规模，而无需承受转换为战略形式带来的复杂性增加。通过实证研究，论文进一步探讨了此方法在“全押或弃牌”德州扑克变体中的作用。

链接: https://arxiv.org/abs/2504.09716
作者: Sam Ganzfried
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:Dominance is a fundamental concept in game theory. In strategic-form games dominated strategies can be identified in polynomial time. As a consequence, iterative removal of dominated strategies can be performed efficiently as a preprocessing step for reducing the size of a game before computing a Nash equilibrium. For imperfect-information games in extensive form, we could convert the game to strategic form and then iteratively remove dominated strategies in the same way; however, this conversion may cause an exponential blowup in game size. In this paper we define and study the concept of dominated actions in imperfect-information games. Our main result is a polynomial-time algorithm for determining whether an action is dominated (strictly or weakly) by any mixed strategy in n-player games, which can be extended to an algorithm for iteratively removing dominated actions. This allows us to efficiently reduce the size of the game tree as a preprocessing step for Nash equilibrium computation. We explore the role of dominated actions empirically in the “All In or Fold” No-Limit Texas Hold’em poker variant.
zh

[AI-49] InfoMAE: Pair-Efficient Cross-Modal Alignment for Multimodal Time-Series Sensing Signals

【速读】：该论文旨在解决标准多模态自监督学习（SSL）算法在预训练过程中对跨模态同步作为隐式监督标签的需求，导致对多模态样本规模和质量要求过高的问题，特别是在物联网（IoT）应用中，由于时间序列信号的异构性和非可解释性，高质量多模态数据对稀缺的问题。论文的关键解决方案是提出InfoMAE，这是一种跨模态对齐框架，通过促进预训练单模态表示的高效跨模态对齐来应对多模态数据对效率低下的挑战。InfoMAE通过一种基于信息论的新颖公式实现高效的跨模态对齐，同时处理分布层面和实例层面的对齐，从而在有限的数据对下显著提升多模态配对效率，并将下游多模态任务性能提升超过60%，同时使单模态任务准确性平均提高22%。

链接: https://arxiv.org/abs/2504.09707
作者: Tomoyoshi Kimura,Xinlin Li,Osama Hanna,Yatong Chen,Yizhuo Chen,Denizhan Kara,Tianshi Wang,Jinyang Li,Xiaomin Ouyang,Shengzhong Liu,Mani Srivastava,Suhas Diggavi,Tarek Abdelzaher
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Standard multimodal self-supervised learning (SSL) algorithms regard cross-modal synchronization as implicit supervisory labels during pretraining, thus posing high requirements on the scale and quality of multimodal samples. These constraints significantly limit the performance of sensing intelligence in IoT applications, as the heterogeneity and the non-interpretability of time-series signals result in abundant unimodal data but scarce high-quality multimodal pairs. This paper proposes InfoMAE, a cross-modal alignment framework that tackles the challenge of multimodal pair efficiency under the SSL setting by facilitating efficient cross-modal alignment of pretrained unimodal representations. InfoMAE achieves \textitefficient cross-modal alignment with \textitlimited data pairs through a novel information theory-inspired formulation that simultaneously addresses distribution-level and instance-level alignment. Extensive experiments on two real-world IoT applications are performed to evaluate InfoMAE’s pairing efficiency to bridge pretrained unimodal models into a cohesive joint multimodal model. InfoMAE enhances downstream multimodal tasks by over 60% with significantly improved multimodal pairing efficiency. It also improves unimodal task accuracy by an average of 22%.
zh

[AI-50] ransformer-Based Representation Learning for Robust Gene Expression Modeling and Cancer Prognosis

【速读】：该论文旨在解决基因表达数据分析中因数据稀疏性、高维度及缺失值导致的传统 Transformer 模型应用受限的问题。为应对这些挑战，论文提出了一种基于 Transformer 的自编码框架 GexBERT，其关键在于通过大规模转录组数据的预训练（pretraining），采用掩码与恢复（masking and restoration）目标学习上下文感知的基因嵌入（context-aware gene embeddings），以捕捉数千个基因间的共表达关系。这一方法不仅提升了在癌症研究中关键任务（如泛癌分类、特定癌症生存预测及缺失值填补）上的性能，还通过注意力机制提供了可解释的生物意义明确的基因模式揭示。

链接: https://arxiv.org/abs/2504.09704
作者: Shuai Jiang,Saeed Hassanpour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based models have achieved remarkable success in natural language and vision tasks, but their application to gene expression analysis remains limited due to data sparsity, high dimensionality, and missing values. We present GexBERT, a transformer-based autoencoder framework for robust representation learning of gene expression data. GexBERT learns context-aware gene embeddings by pretraining on large-scale transcriptomic profiles with a masking and restoration objective that captures co-expression relationships among thousands of genes. We evaluate GexBERT across three critical tasks in cancer research: pan-cancer classification, cancer-specific survival prediction, and missing value imputation. GexBERT achieves state-of-the-art classification accuracy from limited gene subsets, improves survival prediction by restoring expression of prognostic anchor genes, and outperforms conventional imputation methods under high missingness. Furthermore, its attention-based interpretability reveals biologically meaningful gene patterns across cancer types. These findings demonstrate the utility of GexBERT as a scalable and effective tool for gene expression modeling, with translational potential in settings where gene coverage is limited or incomplete.
zh

[AI-51] MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

【速读】：该论文试图解决大型语言模型（Large Language Model, LLM）在科学发现领域评估缺乏客观基准和指标的问题。现有评估多集中于已建立的任务，而忽视了需要创新方法的前沿研究挑战。论文的关键解决方案是引入MLRC-Bench基准，它不仅关注LLM代理在机器学习研究竞赛中的表现，还特别强调通过提出和实施新颖研究方法来量化其有效性。与先前工作不同，MLRC-Bench采用新的严格协议和客观度量标准，评估LLM代理在前沿研究问题上的表现，揭示了显著的性能差距，并强调了创新与实际效果之间的潜在错配。这一动态基准将持续扩展以涵盖更多竞赛任务，推动对AI研究能力的严谨评估。

链接: https://arxiv.org/abs/2504.09702
作者: Yunxiang Zhang,Muhammad Khalifa,Shitanshu Bhushan,Grant D Murphy,Lajanugen Logeswaran,Jaekyeom Kim,Moontae Lee,Honglak Lee,Lu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing evaluation of large language model (LLM) agents on scientific discovery lacks objective baselines and metrics to assess the viability of their proposed methods. To address this issue, we introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions. Our benchmark highlights open research problems that demand novel methodologies, in contrast to recent benchmarks such as OpenAI’s MLE-Bench (Chan et al., 2024) and METR’s RE-Bench (Wijk et al., 2024), which focus on well-established research tasks that are largely solvable through sufficient engineering effort. Unlike prior work, e.g., AI Scientist (Lu et al., 2024b), which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with newly proposed rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB (Huang et al., 2024a)) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and their actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI’s research capabilities.
zh

[AI-52] Migrating Code At Scale With LLM s At Google

【速读】：该论文旨在解决大型软件系统中大规模、高成本的传统手动代码迁移任务效率低下且耗时的问题。解决方案的关键在于提出了一种结合变化位置发现（Change Location Discovery）与大语言模型（Large Language Model, LLM）的自动化算法。通过这种算法，开发者能够更高效地完成代码迁移工作，显著减少人工介入的需求。论文中的案例研究显示，使用该自动化工具后，开发者对迁移工作的满意度较高，并估计相比传统的手动迁移，总耗时减少了50%。这表明基于LLM的自动化工作流可以作为类似项目的重要参考模型。

链接: https://arxiv.org/abs/2504.09691
作者: Celal Ziftci,Stoyan Nikolov,Anna Sjövall,Bo Kim,Daniele Codecasa,Max Kim
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developers often evolve an existing software system by making internal changes, called migration. Moving to a new framework, changing implementation to improve efficiency, and upgrading a dependency to its latest version are examples of migrations. Migration is a common and typically continuous maintenance task undertaken either manually or through tooling. Certain migrations are labor intensive and costly, developers do not find the required work rewarding, and they may take years to complete. Hence, automation is preferred for such migrations. In this paper, we discuss a large-scale, costly and traditionally manual migration project at Google, propose a novel automated algorithm that uses change location discovery and a Large Language Model (LLM) to aid developers conduct the migration, report the results of a large case study, and discuss lessons learned. Our case study on 39 distinct migrations undertaken by three developers over twelve months shows that a total of 595 code changes with 93,574 edits have been submitted, where 74.45% of the code changes and 69.46% of the edits were generated by the LLM. The developers reported high satisfaction with the automated tooling, and estimated a 50% reduction on the total time spent on the migration compared to earlier manual migrations. Our results suggest that our automated, LLM-assisted workflow can serve as a model for similar initiatives. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.09691 [cs.SE] (or arXiv:2504.09691v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2504.09691 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3696630.3728542 Focus to learn more DOI(s) linking to related resources Submission history From: Celal Ziftci [view email] [v1] Sun, 13 Apr 2025 18:52:44 UTC (1,163 KB) Full-text links: Access Paper: View a PDF of the paper titled Migrating Code At Scale With LLMs At Google, by Celal Ziftci and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2025-04 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-53] Can LLM s Revolutionize the Design of Explainable and Efficient TinyML Models?

【速读】：本文旨在解决在资源受限的tiny machine learning (TinyML) 平台上设计高效神经网络架构的问题。解决方案的关键在于结合大型语言模型（Large Language Models, LLMs）驱动的神经架构搜索（Neural Architecture Search, NAS）、基于视觉变换器（Vision Transformer, ViT）的知识蒸馏（Knowledge Distillation, KD）策略以及可解释性模块。具体而言，LLM引导的搜索通过层次化搜索空间探索，并利用帕累托优化（Pareto Optimization）在精度、乘积累加操作数（Multiply-Accumulate Operations, MACs）及内存使用等指标间取得最优平衡；最佳候选架构进一步通过基于logits的知识蒸馏进行微调，采用预训练的ViT-B/16模型提升泛化能力而不增加模型规模。这些方法共同确保了模型在CIFAR-100数据集上的高精度表现与STM32H7微控制器上的低计算成本约束下实现性能超越当前最先进的TinyML模型，同时保持极高的资源效率与可解释性。

链接: https://arxiv.org/abs/2504.09685
作者: Christophe El Zeinaty,Wassim Hamidouche,Glenn Herrou,Daniel Menard,Merouane Debbah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a novel framework for designing efficient neural network architectures specifically tailored to tiny machine learning (TinyML) platforms. By leveraging large language models (LLMs) for neural architecture search (NAS), a vision transformer (ViT)-based knowledge distillation (KD) strategy, and an explainability module, the approach strikes an optimal balance between accuracy, computational efficiency, and memory usage. The LLM-guided search explores a hierarchical search space, refining candidate architectures through Pareto optimization based on accuracy, multiply-accumulate operations (MACs), and memory metrics. The best-performing architectures are further fine-tuned using logits-based KD with a pre-trained ViT-B/16 model, which enhances generalization without increasing model size. Evaluated on the CIFAR-100 dataset and deployed on an STM32H7 microcontroller (MCU), the three proposed models, LMaNet-Elite, LMaNet-Core, and QwNet-Core, achieve accuracy scores of 74.50%, 74.20% and 73.00%, respectively. All three models surpass current state-of-the-art (SOTA) models, such as MCUNet-in3/in4 (69.62% / 72.86%) and XiNet (72.27%), while maintaining a low computational cost of less than 100 million MACs and adhering to the stringent 320 KB static random-access memory (SRAM) constraint. These results demonstrate the efficiency and performance of the proposed framework for TinyML platforms, underscoring the potential of combining LLM-driven search, Pareto optimization, KD, and explainability to develop accurate, efficient, and interpretable models. This approach opens new possibilities in NAS, enabling the design of efficient architectures specifically suited for TinyML.
zh

[AI-54] SPOT: Spatio-Temporal Pattern Mining and Optimization for Load Consolidation in Freight Transportation Networks

【速读】：该论文旨在解决货运整合中寻找最优整合策略的复杂性问题，传统基于优化的方法虽能提供精确解，但计算复杂度高且难以处理大规模实例，同时无法利用历史数据；而基于机器学习的方法虽然解决了部分扩展性问题，却往往忽略了操作约束，导致生成的整合计划不可行。为此，论文提出了一种端到端的SPOT方法，其关键在于结合机器学习与优化技术：机器学习组件通过时空聚类和受约束频繁项集挖掘识别整合点，而优化模块则针对特定运营日选择成本效益最高的可行整合路线。这种融合方式不仅显著降低了运输距离和成本（在大型站点减少约50%），还提供了有价值的战术级洞察，并保持了计算效率与可扩展性。

链接: https://arxiv.org/abs/2504.09680
作者: Sikai Cheng,Amira Hijazi,Jeren Konak,Alan Erera,Pascal Van Hentenryck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Freight consolidation has significant potential to reduce transportation costs and mitigate congestion and pollution. An effective load consolidation plan relies on carefully chosen consolidation points to ensure alignment with existing transportation management processes, such as driver scheduling, personnel planning, and terminal operations. This complexity represents a significant challenge when searching for optimal consolidation strategies. Traditional optimization-based methods provide exact solutions, but their computational complexity makes them impractical for large-scale instances and they fail to leverage historical data. Machine learning-based approaches address these issues but often ignore operational constraints, leading to infeasible consolidation plans. This work proposes SPOT, an end-to-end approach that integrates the benefits of machine learning (ML) and optimization for load consolidation. The ML component plays a key role in the planning phase by identifying the consolidation points through spatio-temporal clustering and constrained frequent itemset mining, while the optimization selects the most cost-effective feasible consolidation routes for a given operational day. Extensive experiments conducted on industrial load data demonstrate that SPOT significantly reduces travel distance and transportation costs (by about 50% on large terminals) compared to the existing industry-standard load planning strategy and a neighborhood-based heuristic. Moreover, the ML component provides valuable tactical-level insights by identifying frequently recurring consolidation opportunities that guide proactive planning. In addition, SPOT is computationally efficient and can be easily scaled to accommodate large transportation networks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) Cite as: arXiv:2504.09680 [cs.LG] (or arXiv:2504.09680v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.09680 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-55] Agent DynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations

【速读】：本文旨在解决在多智能体大语言模型模拟中，如何在确保模拟机制一致性的同时，允许复杂且涌现的社会动态自然发展的挑战。论文的关键解决方案是提出AgentDynEx系统，该系统通过配置矩阵引导用户识别核心机制并定义里程碑以追踪动态，并引入了一种名为“nudging”的方法，使系统能够动态评估模拟进展并在偏离预期结果时进行适度干预。技术评估表明，“nudging”能够使模拟具备更复杂的机制并保持显著的动态特性，从而平衡多智能体模拟中的机制与动态。

链接: https://arxiv.org/abs/2504.09662
作者: Jenny Ma,Riya Sahni,Karthik Sreedhar,Lydia B. Chilton
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Multi-agent large language model simulations have the potential to model complex human behaviors and interactions. If the mechanics are set up properly, unanticipated and valuable social dynamics can surface. However, it is challenging to consistently enforce simulation mechanics while still allowing for notable and emergent dynamics. We present AgentDynEx, an AI system that helps set up simulations from user-specified mechanics and dynamics. AgentDynEx uses LLMs to guide users through a Configuration Matrix to identify core mechanics and define milestones to track dynamics. It also introduces a method called \textitnudging, where the system dynamically reflects on simulation progress and gently intervenes if it begins to deviate from intended outcomes. A technical evaluation found that nudging enables simulations to have more complex mechanics and maintain its notable dynamics compared to simulations without nudging. We discuss the importance of nudging as a technique for balancing mechanics and dynamics of multi-agent simulations.
zh

[AI-56] Building AI Service Repositories for On-Demand Service Orchestration in 6G AI-RAN

【速读】：该论文致力于解决6G AI-RAN中高效编排AI服务的问题，当前研究缺乏全面的框架来构建可部署的AI服务仓库，并且通常忽略了关键的实际编排因素。论文系统地识别并分类了影响6G网络中AI服务编排的关键属性，并引入了一个开源、LLM辅助的工具链，实现服务打包、部署及运行时分析的自动化。解决方案的关键在于提出这一自动化工具链，通过Cranfield AI Service仓库案例验证，展示了显著的自动化优势、减少手动编码工作量以及基础设施特定分析的必要性，从而为更实用的编排框架奠定了基础。

链接: https://arxiv.org/abs/2504.09647
作者: Yun Tang,Mengbang Zou,Udhaya Chandhar Srinivasan,Obumneme Umealor,Dennis Kevogo,Benjamin James Scott,Weisi Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 6 pages, three figures, one table, submitted to IEEE GlobeCOM 2025 for possible publication

点击查看摘要

Abstract:Efficient orchestration of AI services in 6G AI-RAN requires well-structured, ready-to-deploy AI service repositories combined with orchestration methods adaptive to diverse runtime contexts across radio access, edge, and cloud layers. Current literature lacks comprehensive frameworks for constructing such repositories and generally overlooks key practical orchestration factors. This paper systematically identifies and categorizes critical attributes influencing AI service orchestration in 6G networks and introduces an open-source, LLM-assisted toolchain that automates service packaging, deployment, and runtime profiling. We validate the proposed toolchain through the Cranfield AI Service repository case study, demonstrating significant automation benefits, reduced manual coding efforts, and the necessity of infrastructure-specific profiling, paving the way for more practical orchestration frameworks.
zh

[AI-57] A Two-Stage Interpretable Matching Framework for Causal Inference

【速读】：该论文致力于解决因果推断中基于观察数据进行匹配的问题，旨在通过构建具有相似协变量分布的处理组和对照组来减少混杂因素的影响，从而实现无偏的治疗效应估计。论文的关键创新在于提出了一种名为Two-stage Interpretable Matching (TIM) 的两阶段可解释匹配框架。其核心解决方案在于：第一阶段执行所有可用协变量的精确匹配；对于未能在第一阶段找到精确匹配的处理单元和对照单元，进入第二阶段，在每次迭代中移除最不显著的混杂因子，并尝试对剩余协变量进行精确匹配，同时学习被移除协变量的距离度量以量化与处理单元的接近程度。最终，利用高质量匹配结果估计条件平均治疗效应 (CATE)，并通过合成数据集与真实世界医疗数据验证了TIM的有效性，证明其能够提高CATE估计质量、增加多元重叠性并在高维数据中有效扩展。

链接: https://arxiv.org/abs/2504.09635
作者: Sahil Shikalgar,Md. Noor-E-Alam
机构: 未知
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Matching in causal inference from observational data aims to construct treatment and control groups with similar distributions of covariates, thereby reducing confounding and ensuring an unbiased estimation of treatment effects. This matched sample closely mimics a randomized controlled trial (RCT), thus improving the quality of causal estimates. We introduce a novel Two-stage Interpretable Matching (TIM) framework for transparent and interpretable covariate matching. In the first stage, we perform exact matching across all available covariates. For treatment and control units without an exact match in the first stage, we proceed to the second stage. Here, we iteratively refine the matching process by removing the least significant confounder in each iteration and attempting exact matching on the remaining covariates. We learn a distance metric for the dropped covariates to quantify closeness to the treatment unit(s) within the corresponding strata. We used these high- quality matches to estimate the conditional average treatment effects (CATEs). To validate TIM, we conducted experiments on synthetic datasets with varying association structures and correlations. We assessed its performance by measuring bias in CATE estimation and evaluating multivariate overlap between treatment and control groups before and after matching. Additionally, we apply TIM to a real-world healthcare dataset from the Centers for Disease Control and Prevention (CDC) to estimate the causal effect of high cholesterol on diabetes. Our results demonstrate that TIM improves CATE estimates, increases multivariate overlap, and scales effectively to high-dimensional data, making it a robust tool for causal inference in observational data.
zh

[AI-58] Slow Thinking for Sequential Recommendation

【速读】：该论文旨在解决传统推荐模型在处理序列推荐任务时因受限于轻量级快速推理范式而导致性能不足的问题。论文的关键在于提出了一种名为STREAM-Rec的新型慢思考推荐模型，通过引入多步推理机制实现更精细的用户行为分析与个性化推荐。解决方案的核心包括：(1) 确定推荐系统中的合适推理模式；(2) 探索如何有效增强传统推荐器的推理能力。为此，论文设计了一个三阶段训练框架：第一阶段利用大规模用户行为数据进行预训练以学习行为模式并捕捉长距离依赖关系；第二阶段通过迭代推理算法逐步优化模型预测结果，并基于注释后的数据微调模型；第三阶段应用强化学习进一步提升模型的泛化能力。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.09627
作者: Junjie Zhang,Beichen Zhang,Wenqi Sun,Hongyu Lu,Wayne Xin Zhao,Yu Chen,Ji-Rong Wen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To develop effective sequential recommender systems, numerous methods have been proposed to model historical user behaviors. Despite the effectiveness, these methods share the same fast thinking paradigm. That is, for making recommendations, these methods typically encodes user historical interactions to obtain user representations and directly match these representations with candidate item representations. However, due to the limited capacity of traditional lightweight recommendation models, this one-step inference paradigm often leads to suboptimal performance. To tackle this issue, we present a novel slow thinking recommendation model, named STREAM-Rec. Our approach is capable of analyzing historical user behavior, generating a multi-step, deliberative reasoning process, and ultimately delivering personalized recommendations. In particular, we focus on two key challenges: (1) identifying the suitable reasoning patterns in recommender systems, and (2) exploring how to effectively stimulate the reasoning capabilities of traditional recommenders. To this end, we introduce a three-stage training framework. In the first stage, the model is pretrained on large-scale user behavior data to learn behavior patterns and capture long-range dependencies. In the second stage, we design an iterative inference algorithm to annotate suitable reasoning traces by progressively refining the model predictions. This annotated data is then used to fine-tune the model. Finally, in the third stage, we apply reinforcement learning to further enhance the model generalization ability. Extensive experiments validate the effectiveness of our proposed method.
zh

[AI-59] A highly maneuverable flying squirrel drone with agility-improving foldable wings

【速读】：该论文旨在解决传统无人机因推力限制导致的敏捷飞行能力不足的问题。这一局限性无法仅通过控制算法的进步来完全克服。为应对这一挑战，论文提出了一种配备可折叠机翼的高机动性无人机，并通过推力与机翼协同控制（Thrust-Wing Coordination Control, TWCC）框架实现传统螺旋桨系统与可折叠机翼之间的协作控制。关键在于利用可折叠机翼产生的额外空气阻力显著提升跟踪性能，同时通过物理辅助循环神经网络（physics-assisted recurrent neural network, paRNN）建模可折叠机翼的复杂空气动力学特性，以校准迎角（Angle of Attack, AOA），使其更符合实际气动行为。实验结果表明，该“飞 squir”无人机相比传统无翼无人机在跟踪性能上提升了13.1%（以均方根误差RMSE衡量）。

链接: https://arxiv.org/abs/2504.09609
作者: Dohyeon Lee,Jun-Gill Kang,Soohee Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Robotics and Automation Letters Youtube : this https URL

点击查看摘要

Abstract:Drones, like most airborne aerial vehicles, face inherent disadvantages in achieving agile flight due to their limited thrust capabilities. These physical constraints cannot be fully addressed through advancements in control algorithms alone. Drawing inspiration from the winged flying squirrel, this paper proposes a highly maneuverable drone equipped with agility-enhancing foldable wings. By leveraging collaborative control between the conventional propeller system and the foldable wings-coordinated through the Thrust-Wing Coordination Control (TWCC) framework-the controllable acceleration set is expanded, enabling the generation of abrupt vertical forces that are unachievable with traditional wingless drones. The complex aerodynamics of the foldable wings are modeled using a physics-assisted recurrent neural network (paRNN), which calibrates the angle of attack (AOA) to align with the real aerodynamic behavior of the wings. The additional air resistance generated by appropriately deploying these wings significantly improves the tracking performance of the proposed “flying squirrel” drone. The model is trained on real flight data and incorporates flat-plate aerodynamic principles. Experimental results demonstrate that the proposed flying squirrel drone achieves a 13.1% improvement in tracking performance, as measured by root mean square error (RMSE), compared to a conventional wingless drone. A demonstration video is available on YouTube: this https URL.
zh

[AI-60] Understanding LLM Behaviors via Compression: Data Generation Knowledge Acquisition and Scaling Laws

【速读】：本文旨在解决大型语言模型（LLMs）在机制层面的若干未解问题，包括其背后的运作原理、扩展规律（scaling laws）、幻觉现象（hallucinations）及相关行为。论文的关键在于重新审视预测与压缩之间的经典关系，基于柯尔莫哥洛夫复杂性（Kolmogorov complexity）和香农信息论，通过引入柯尔莫哥洛夫结构函数（Kolmogorov Structure Function），将LLM的压缩过程解读为两部分编码过程，从而深入揭示LLMs如何随模型规模和数据规模的增长获取并存储信息，从普遍的句法模式到逐渐稀有的知识元素。在此理论框架下，受Heap定律和Zipf定律启发，提出了一种简化的层级化数据生成模型——句法-知识模型（Syntax-Knowledge model）。在贝叶斯设定下，该模型展示了预测与压缩如何自然导致LLMs多样化的学习和扩展行为，并提供了直观且有原则性的解释，涵盖数据与模型的扩展规律、训练及微调过程中知识获取的动力学特性，以及LLMs中事实性知识幻觉的成因。实验结果验证了理论预测的准确性。

链接: https://arxiv.org/abs/2504.09597
作者: Zhixuan Pan,Shaowen Wang,Jian Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales – from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap’s and Zipf’s laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors of LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.
zh

[AI-61] Efficient LLM Serving on Hybrid Real-time and Best-effort Requests

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）服务系统在处理混合推理工作负载时资源利用率低的问题。传统方法通过为实时请求（Real-Time, RT）和尽力而为请求（Best-Effort, BE）分别分配专用机器来简化调度管理，但导致资源利用效率低下。论文提出的关键解决方案是设计了一种名为BROS的混合LLM服务系统，其核心在于通过动态优先级算法实现RT和BE请求的混合调度，并采用双向KV缓存管理机制，允许RT请求与BE请求共享KV内存，从而消除因KV内存不足引起的调度限制，显著提升资源利用率。实验验证表明，BROS在满足RT请求延迟需求的同时，保持了BE请求的高吞吐量，相较于现有最先进的系统如vLLM和TGI具有明显优势。

链接: https://arxiv.org/abs/2504.09590
作者: Wan Borui,Zhao Juntao,Jiang Chenyu,Guo Chuanxiong,Wu Chuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent breakthroughs in large Language Models (LLMs) have enabled various generative tasks on a single model. Real-world services (e.g., OpenAI’s ChatGPT [27]) powered by an LLM often concurrently support latency-critical requests for interactive applications (e.g., question-answering systems, referred to as real-time or RT requests) and throughput-oriented requests for back-of-house processing (e.g., documents batch processing [28], referred to best-effort or BE requests), with complex hybrid inference workloads to the underlying model. State-of-the-art (SOTA) LLM serving systems dedicate machines to each type of request, towards either low inference latency or high serving throughput, respectively. This practice simplifies request scheduling and management but suffers from poor resource utilization. We propose BROS, a hybrid LLM serving system that aims to collocate RT/BE requests, meeting RT requests’ latency requirements while maintaining BE requests’ throughput. BROS formulates the problem of hybrid RT/BE request scheduling and solves it with a dynamic priority-based algorithm. BROS designs a bidirectional KV cache management mechanism, allowing RT requests to share KV memory with BE requests to remove the scheduling restrictions caused by insufficient KV memory and improve utilization. Extensive experiments validate that BROS achieves a good trade-off when serving hybrid RT and BE requests. It significantly reduces the latency of RT requests (up to 74.20%), improving their fine-grained service level objectives (SLOs) attainments (up to 36.38x), with negligible throughput reduction for BE requests, showing significant advantages over SOTA systems like vLLM and TGI.
zh

[AI-62] AirVista-II: An Agent ic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

【速读】：该论文旨在解决当前无人机（Unmanned Aerial Vehicles, UAVs）在动态环境（如物流运输和灾害响应）中任务依赖人工操作员监控视频并做出决策的低效与适应性不足的问题。论文提出的解决方案关键在于AirVista-II系统，这是一种端到端的主动式系统，专为具备实体交互能力的无人机设计，能够实现动态场景中的通用语义理解和推理。该系统通过基于代理的任务识别与调度、多模态感知机制以及针对不同时间场景优化的关键帧提取策略的集成，高效捕获关键场景信息。实验结果表明，该系统在零样本设置下于多种无人机动态场景中实现了高质量的语义理解。

链接: https://arxiv.org/abs/2504.09583
作者: Fei Lin,Yonglin Tian,Tengchao Zhang,Jun Huang,Sangtian Guan,Fei-Yue Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly important in dynamic environments such as logistics transportation and disaster response. However, current tasks often rely on human operators to monitor aerial videos and make operational decisions. This mode of human-machine collaboration suffers from significant limitations in efficiency and adaptability. In this paper, we present AirVista-II – an end-to-end agentic system for embodied UAVs, designed to enable general-purpose semantic understanding and reasoning in dynamic scenes. The system integrates agent-based task identification and scheduling, multimodal perception mechanisms, and differentiated keyframe extraction strategies tailored for various temporal scenarios, enabling the efficient capture of critical scene information. Experimental results demonstrate that the proposed system achieves high-quality semantic understanding across diverse UAV-based dynamic scenarios under a zero-shot setting.
zh

[AI-63] Improved FOX Optimization Algorithm

【速读】：该论文试图解决传统优化算法在求解实际问题时常遇到的早熟收敛至局部最优解以及难以有效平衡探索与利用的问题。为应对这些挑战，论文提出了一种改进的FOX优化算法（Improved FOX, IFOX）。IFOX算法的关键在于引入了一种基于适应度值的新自适应机制，用于更有效地平衡探索与利用，并通过减少超参数数量和简化原始FOX的核心方程进一步优化算法设计。

链接: https://arxiv.org/abs/2504.09574
作者: Mahmood A. Jumaah,Yossra H. Ali,Tarik A. Rashid
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages

点击查看摘要

Abstract:Optimization algorithms are essential for solving many real-world problems. However, challenges such as premature convergence to local optima and the difficulty of effectively balancing exploration and exploitation often hinder their performance. To address these issues, this paper proposes an improved FOX optimization algorithm, Improved FOX (IFOX). The IFOX algorithm introduces a new adaptive mechanism for balancing exploration and exploitation based on fitness values. It also reduces the number of hyperparameters and simplifies the core equations of the original FOX. To evaluate its effectiveness, IFOX has been tested on classical uni-modal and multi-modal benchmark functions, as well as on benchmark sets from the Congress on Evolutionary Computation (CEC), in addition to two engineering design problems: Pressure Vessel Design and Economic Load Dispatch. The results show that IFOX outperforms existing optimization algorithms, achieving superior results on 51 benchmark functions. These findings underscore the strong potential of IFOX as a competitive and robust optimization algorithm for a wide range of applications.
zh

[AI-64] Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation

【速读】：本文旨在解决人形机器人在复杂非结构化环境中自主执行运动操作（loco-manipulation）任务的挑战。具体而言，研究关注如何使机器人具备在长时间跨度内规划动作的能力，并通过多模态方法弥合高层次规划与实际任务执行之间的差距。解决方案的关键在于提出了一种基于基础模型的新框架，该框架应用了具身动作推理链（embodied chain of action reasoning）的方法论，从文本指令中自主规划人形机器人的动作。这种方法结合了人形机器人特定的思维链方法，包括详细的可用性分析和身体运动分析，将任务分解为一系列移动和操作动作序列。此外，还引入了基于观察和目标物体属性的空间推理，以有效处理目标位置不可见或被遮挡的情况。通过在真实环境中的物体重新排列、操作以及运动操作任务的严格实验设置，验证了所提方法在解耦上身和下身控制方面的有效性，并展示了动作推理策略在理解人类指令方面的效果。

链接: https://arxiv.org/abs/2504.09532
作者: Yu Hao,Geeta Chandra Raju Bethala,Niraj Pudasaini,Hao Huang,Shuaihang Yuan,Congcong Wen,Baoru Huang,Anh Nguyen,Yi Fang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enabling humanoid robots to autonomously perform loco-manipulation tasks in complex, unstructured environments poses significant challenges. This entails equipping robots with the capability to plan actions over extended horizons while leveraging multi-modality to bridge gaps between high-level planning and actual task execution. Recent advancements in multi-modal foundation models have showcased substantial potential in enhancing planning and reasoning abilities, particularly in the comprehension and processing of semantic information for robotic control tasks. In this paper, we introduce a novel framework based on foundation models that applies the embodied chain of action reasoning methodology to autonomously plan actions from textual instructions for humanoid loco-manipulation. Our method integrates humanoid-specific chain of thought methodology, including detailed affordance and body movement analysis, which provides a breakdown of the task into a sequence of locomotion and manipulation actions. Moreover, we incorporate spatial reasoning based on the observation and target object properties to effectively navigate where target position may be unseen or occluded. Through rigorous experimental setups on object rearrangement, manipulations and loco-manipulation tasks on a real-world environment, we evaluate our method’s efficacy on the decoupled upper and lower body control and demonstrate the effectiveness of the chain of robotic action reasoning strategies in comprehending human instructions.
zh

[AI-65] Decoding the mechanisms of the Hattrick football manager game using Bayesian network structure learning for optimal decision-making

【速读】：该论文旨在探索Hattrick游戏引擎的机制，通过结构学习技术与贝叶斯网络集成数据和领域知识，构建能够解释和模拟游戏引擎的模型。论文的关键在于比较基于规则的知识网络与结构学习算法在建模效率和预测准确性上的差异，并提出了一种匹配顶级社区模型性能的完整贝叶斯网络模型。此外，论文展示了如何超越简单的预测任务，通过可视化条件依赖关系和利用最佳贝叶斯网络模型进行实时游戏决策。为支持后续研究，所有数据、图形结构和模型均公开发布。

链接: https://arxiv.org/abs/2504.09499
作者: Anthony C. Constantinou,Nicholas Higgins,Neville K. Kitson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hattrick is a free web-based probabilistic football manager game with over 200,000 users competing for titles at national and international levels. Launched in Sweden in 1997 as part of an MSc project, the game’s slow-paced design has fostered a loyal community, with many users remaining active for decades. Hattrick’s game-engine mechanics are partially hidden, and users have attempted to decode them with incremental success over the years. Rule-based, statistical and machine learning models have been developed to aid this effort and are widely used by the community. However, these models or tools have not been formally described or evaluated in the scientific literature. This study is the first to explore Hattrick using structure learning techniques and Bayesian networks, integrating both data and domain knowledge to develop models capable of explaining and simulating the game engine. We present a comprehensive analysis assessing the effectiveness of structure learning algorithms in relation to knowledge-based structures, and show that while structure learning may achieve a higher overall network fit, it does not result in more accurate predictions for selected variables of interest, when compared to knowledge-based networks that produce a lower overall network fit. Additionally, we introduce and publicly share a fully specified Bayesian network model that matches the performance of top models used by the Hattrick community. We further demonstrate how analysis extends beyond prediction by providing a visual representation of conditional dependencies, and using the best performing Bayesian network model for in-game decision-making. To support future research, we make all data, graphical structures, and models publicly available online.
zh

[AI-66] Federated Prototype Graph Learning

【速读】：该论文旨在解决联邦图学习（Federated Graph Learning, FGL）中的多层级异构性所带来的挑战，包括模型级、数据级和通信级的协作难题。具体而言，模型级异构性要求灵活部署不同性能与扩展性的客户端模型；数据级异构性源于图结构的复杂性，导致联邦训练的模型难以达到理想性能；通信级异构性则因增加客户端间或客户端与服务器间的通信消息共享以提升训练效果，带来了高昂的通信成本。

为应对上述问题，论文提出了一种通用的原型引导优化方法FedPG。其关键在于结合多层级拓扑感知的原型来捕获本地图语义，并在服务器端利用上传的原型通过拓扑引导对比学习和个人化技术为每个客户端定制全局原型，从而广播优化后的全局原型以改进本地训练。实验结果表明，FedPG相比最先进的基线方法平均提升了3.57%的准确率，同时将通信成本降低了168倍。

链接: https://arxiv.org/abs/2504.09493
作者: Zhengyu Wu,Xunkai Li,Yinlin Zhu,Rong-Hua Li,Guoren Wang,Chenghu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Social and Information Networks (cs.SI)
备注: Under Review

点击查看摘要

Abstract:In recent years, Federated Graph Learning (FGL) has gained significant attention for its distributed training capabilities in graph-based machine intelligence applications, mitigating data silos while offering a new perspective for privacy-preserve large-scale graph learning. However, multi-level FGL heterogeneity presents various client-server collaboration challenges: (1) Model-level: The variation in clients for expected performance and scalability necessitates the deployment of heterogeneous models. Unfortunately, most FGL methods rigidly demand identical client models due to the direct model weight aggregation on the server. (2) Data-level: The intricate nature of graphs, marked by the entanglement of node profiles and topology, poses an optimization dilemma. This implies that models obtained by federated training struggle to achieve superior performance. (3) Communication-level: Some FGL methods attempt to increase message sharing among clients or between clients and the server to improve training, which inevitably leads to high communication costs. In this paper, we propose FedPG as a general prototype-guided optimization method for the above multi-level FGL heterogeneity. Specifically, on the client side, we integrate multi-level topology-aware prototypes to capture local graph semantics. Subsequently, on the server side, leveraging the uploaded prototypes, we employ topology-guided contrastive learning and personalized technology to tailor global prototypes for each client, broadcasting them to improve local training. Experiments demonstrate that FedPG outperforms SOTA baselines by an average of 3.57% in accuracy while reducing communication costs by 168x.
zh

[AI-67] A highly maneuverable flying squirrel drone with controllable foldable wings IROS

【速读】：该论文旨在解决典型多旋翼无人机因单向推力导致操作灵活性不足的问题，特别是在极其狭窄和受限空间内进行敏捷飞行的挑战。论文提出了一种受飞 squirrel 启发的新无人机，通过轻量化和易于携带的方式实现高机动性。解决方案的关键在于设计了一种具有可控折叠机翼的无人机，能够覆盖更广泛的飞行姿态，并提供更灵活的飞行能力同时保持稳定的跟踪性能。机翼由硅胶膜制成，并通过基于人类演示数据的强化学习进行精细控制，这种方法特别适用于捕捉复杂的空气动力学现象，而这些现象通常难以通过数学建模来描述。实验表明，这种受控的机翼能够有意产生空气动力阻力，从而在机械推力饱和的情况下提供所需的额外排斥力。这项工作展示了仿生学与机器学习结合在实现类似动物般敏捷无人机方面的潜力。

链接: https://arxiv.org/abs/2504.09478
作者: Jun-Gill Kang,Dohyeon Lee,Soohee Han
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Project Page : this https URL , Video : this https URL , Jun-Gill Kang and Dohyeon Lee are co-authors

点击查看摘要

Abstract:Typical drones with multi rotors are generally less maneuverable due to unidirectional thrust, which may be unfavorable to agile flight in very narrow and confined spaces. This paper suggests a new bio-inspired drone that is empowered with high maneuverability in a lightweight and easy-to-carry way. The proposed flying squirrel inspired drone has controllable foldable wings to cover a wider range of flight attitudes and provide more maneuverable flight capability with stable tracking performance. The wings of a drone are fabricated with silicone membranes and sophisticatedly controlled by reinforcement learning based on human-demonstrated data. Specially, such learning based wing control serves to capture even the complex aerodynamics that are often impossible to model mathematically. It is shown through experiment that the proposed flying squirrel drone intentionally induces aerodynamic drag and hence provides the desired additional repulsive force even under saturated mechanical thrust. This work is very meaningful in demonstrating the potential of biomimicry and machine learning for realizing an animal-like agile drone.
zh

[AI-68] MigGPT : Harnessing Large Language Models for Automated Migration of Out-of-Tree Linux Kernel Patches Across Versions

【速读】：该论文旨在解决通过大型语言模型（Large Language Models, LLMs）自动化迁移内核外补丁（out-of-tree kernel patches）过程中存在的代码上下文理解不完整以及迁移点识别不准确的问题。论文的关键创新在于提出MigGPT框架，其核心包括引入一种新颖的代码指纹结构以保留代码片段信息，并结合三个精心设计的模块来提升内核外补丁迁移的准确性和效率。此外，研究构建了一个基于真实项目数据的基准测试集，用于评估LLMs的能力。实验结果表明，MigGPT相较于直接应用基础LLMs，在迁移任务中的平均完成率提升了50.74%，达到72.59%。

链接: https://arxiv.org/abs/2504.09474
作者: Pucheng Dang,Di Huang,Dong Li,Kang Chen,Yuanbo Wen,Qi Guo,Xing Hu,Ninghui Sun
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:Out-of-tree kernel patches are essential for adapting the Linux kernel to new hardware or enabling specific functionalities. Maintaining and updating these patches across different kernel versions demands significant effort from experienced engineers. Large language models (LLMs) have shown remarkable progress across various domains, suggesting their potential for automating out-of-tree kernel patch migration. However, our findings reveal that LLMs, while promising, struggle with incomplete code context understanding and inaccurate migration point identification. In this work, we propose MigGPT, a framework that employs a novel code fingerprint structure to retain code snippet information and incorporates three meticulously designed modules to improve the migration accuracy and efficiency of out-of-tree kernel patches. Furthermore, we establish a robust benchmark using real-world out-of-tree kernel patch projects to evaluate LLM capabilities. Evaluations show that MigGPT significantly outperforms the direct application of vanilla LLMs, achieving an average completion rate of 72.59% (50.74% improvement) for migration tasks.
zh

[AI-69] Comorbidity-Informed Transfer Learning for Neuro-developmental Disorder Diagnosis

【速读】：该论文旨在解决神经发育障碍（Neuro-developmental Disorders）诊断中基于功能磁共振成像（fMRI）的计算机辅助诊断（CAD）模型因fMRI复杂的时空特征易受干扰而导致性能受限的问题。为了解决这一问题，论文提出了首个基于共病信息的迁移学习框架（Comorbidity-Informed Transfer Learning, CITL）。CITL 的关键在于引入了一种新的强化表示生成网络，该网络结合迁移学习与伪标签技术，通过去除fMRI时间域中的干扰模式，并利用编码器-解码器架构生成新的特征表示。这些新表示随后在简单的分类网络中进行训练以获得最终的CAD模型。此外，CITL 充分考虑了神经发育障碍的共病机制，并将其与半监督学习和迁移学习有效整合，为跨学科研究提供了新视角。实验结果表明，CITL 在自闭症谱系障碍（Autism Spectrum Disorder）和注意缺陷多动障碍（Attention Deficit Hyperactivity Disorder）检测中分别达到了76.32%和73.15%的准确率，优于现有相关迁移学习方法7.2%和0.5%。

链接: https://arxiv.org/abs/2504.09463
作者: Xin Wen,Shijie Guo,Wenbo Ning,Rui Cao,Jie Xiang,Xiaobo Liu,Jintai Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neuro-developmental disorders are manifested as dysfunctions in cognition, communication, behaviour and adaptability, and deep learning-based computer-aided diagnosis (CAD) can alleviate the increasingly strained healthcare resources on neuroimaging. However, neuroimaging such as fMRI contains complex spatio-temporal features, which makes the corresponding representations susceptible to a variety of distractions, thus leading to less effective in CAD. For the first time, we present a Comorbidity-Informed Transfer Learning(CITL) framework for diagnosing neuro-developmental disorders using fMRI. In CITL, a new reinforced representation generation network is proposed, which first combines transfer learning with pseudo-labelling to remove interfering patterns from the temporal domain of fMRI and generates new representations using encoder-decoder architecture. The new representations are then trained in an architecturally simple classification network to obtain CAD model. In particular, the framework fully considers the comorbidity mechanisms of neuro-developmental disorders and effectively integrates them with semi-supervised learning and transfer learning, providing new perspectives on interdisciplinary. Experimental results demonstrate that CITL achieves competitive accuracies of 76.32% and 73.15% for detecting autism spectrum disorder and attention deficit hyperactivity disorder, respectively, which outperforms existing related transfer learning work for 7.2% and 0.5% respectively.
zh

[AI-70] Measuring Leakage in Concept-Based Methods: An Information Theoretic Approach ICLR2025

【速读】：该论文旨在解决概念瓶颈模型（CBMs）中由于非预期信息泄露导致透明性受损的问题。尽管CBMs通过围绕可理解概念构建预测以提高可解释性，但预测信号绕过概念瓶颈的现象削弱了其透明度。为了解决这一问题，论文引入了一种基于信息论的方法来量化CBMs中的泄露程度，该方法能够衡量概念嵌入编码超出指定概念的额外非预期信息的程度。关键在于提出的信息论度量方法，并通过控制合成实验验证其有效性，揭示特征维度和概念维度显著影响泄露程度，同时评估不同分类器对测量稳定性的影响，最终发现XGBoost是最可靠的估计器。此外，初步研究表明，该度量在软联合CBMs中表现出预期行为，表明其在真实世界数据集上的潜在可靠性。

链接: https://arxiv.org/abs/2504.09459
作者: Mikael Makonnen,Moritz Vandenhirtz,Sonia Laguna,Julia E Vogt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published at ICLR 2025 Workshop on XAI4Science

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) aim to enhance interpretability by structuring predictions around human-understandable concepts. However, unintended information leakage, where predictive signals bypass the concept bottleneck, compromises their transparency. This paper introduces an information-theoretic measure to quantify leakage in CBMs, capturing the extent to which concept embeddings encode additional, unintended information beyond the specified concepts. We validate the measure through controlled synthetic experiments, demonstrating its effectiveness in detecting leakage trends across various configurations. Our findings highlight that feature and concept dimensionality significantly influence leakage, and that classifier choice impacts measurement stability, with XGBoost emerging as the most reliable estimator. Additionally, preliminary investigations indicate that the measure exhibits the anticipated behavior when applied to soft joint CBMs, suggesting its reliability in leakage quantification beyond fully synthetic settings. While this study rigorously evaluates the measure in controlled synthetic experiments, future work can extend its application to real-world datasets.
zh

[AI-71] Enhancing Mathematical Reasoning in Large Language Models with Self-Consistency-Based Hallucination Detection

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在数学推理中易产生逻辑不一致性和幻觉（hallucinations）的问题，尤其是在定理证明、符号操作和数值计算等任务中生成看似合理但错误的陈述。论文的关键在于提出了一种结构化自一致性（Structured Self-Consistency, SSC）框架，通过在整个推理过程中的中间步骤和最终输出上强制执行自一致性，从而减少逻辑不一致性和幻觉现象。这种方法不仅显著提升了证明的有效性、符号推理的准确性以及数值计算的稳定性，还保持了较高的计算效率，并降低了模型输出结果的方差，为提升LLMs的数学推理可靠性提供了稳健机制。

链接: https://arxiv.org/abs/2504.09440
作者: MingShan Liu,Shi Bo,Jialing Fang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong mathematical reasoning capabilities but remain susceptible to hallucinations producing plausible yet incorrect statements especially in theorem proving, symbolic manipulation, and numerical computation. While self-consistency (SC) has been explored as a means to improve factuality in LLMs, existing approaches primarily apply SC to final-answer selection, neglecting the logical consistency of intermediate reasoning steps. In this work, we introduce a structured self-consistency framework designed to enhance the reliability of mathematical reasoning. Our method enforces self-consistency across intermediate steps and final outputs, reducing logical inconsistencies and hallucinations. We evaluate our approach across three core mathematical tasks: theorem proving, symbolic transformation, and numerical computation. Experimental results demonstrate that SC significantly improves proof validity, symbolic reasoning accuracy, and numerical stability while maintaining computational efficiency. Further analysis reveals that structured self-consistency not only enhances problem-solving accuracy but also reduces the variance of model-generated outputs. These findings highlight self-consistency as a robust mechanism for improving mathematical reasoning in LLMs, paving the way for more reliable and interpretable AI-driven mathematics.
zh

[AI-72] FROG: Effective Friend Recommendation in Online Games via Modality-aware User Preferences SIGIR2025

【速读】：该论文旨在解决在线游戏中基于多模态用户特征（如图像和文本）与友谊图结构信息有效融合的朋友推荐问题。现有方法存在以下局限性：(1) 忽略了用户之间的高阶结构接近性；(2) 未能在特定模态层面学习用户之间的成对相关性；(3) 无法同时捕捉局部和全局的用户模态偏好。为解决这些问题，论文提出了一种端到端模型\textsc{FROG}，其关键在于更好地建模潜在朋友的用户偏好。离线评估和线上部署实验结果均表明，\textsc{FROG}优于现有方法。

链接: https://arxiv.org/abs/2504.09428
作者: Qiwei Wang,Dandan Lin,Wenqing Lin,Ziming Wu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted in SIGIR 2025

点击查看摘要

Abstract:Due to the convenience of mobile devices, the online games have become an important part for user entertainments in reality, creating a demand for friend recommendation in online games. However, none of existing approaches can effectively incorporate the multi-modal user features (\emphe.g., images and texts) with the structural information in the friendship graph, due to the following limitations: (1) some of them ignore the high-order structural proximity between users, (2) some fail to learn the pairwise relevance between users at modality-specific level, and (3) some cannot capture both the local and global user preferences on different modalities. By addressing these issues, in this paper, we propose an end-to-end model \textscFROG that better models the user preferences on potential friends. Comprehensive experiments on both offline evaluation and online deployment at \kwTencent have demonstrated the superiority of \textscFROG over existing approaches.
zh

[AI-73] Adaptive Insurance Reserving with CVaR-Constrained Reinforcement Learning under Macroeconomic Regimes

【速读】：该论文旨在解决保险准备金评估中的尾部风险敏感性、宏观经济环境建模以及监管合规性之间的复杂平衡问题。论文提出了一种结合近端策略优化（Proximal Policy Optimization, PPO）与条件尾部期望约束（Conditional Value-at-Risk, CVaR）的强化学习框架，将准备金评估问题形式化为有限时间范围的马尔可夫决策过程（Markov Decision Process, MDP）。其关键在于通过引入基于经济状态感知的课程学习机制，使模型在不同波动性环境中保持鲁棒性，并设计奖励函数以同时优化尾部风险控制、资本效率及监管合规性，从而实现对传统方法的性能超越。

链接: https://arxiv.org/abs/2504.09396
作者: Stella C. Dong,James R. Finlay
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:This paper proposes a reinforcement learning (RL) framework for insurance reserving that integrates tail-risk sensitivity, macroeconomic regime modeling, and regulatory compliance. The reserving problem is formulated as a finite-horizon Markov Decision Process (MDP), in which reserve adjustments are optimized using Proximal Policy Optimization (PPO) subject to Conditional Value-at-Risk (CVaR) constraints. To enhance policy robustness across varying economic conditions, the agent is trained using a regime-aware curriculum that progressively increases volatility exposure. The reward structure penalizes reserve shortfall, capital inefficiency, and solvency floor violations, with design elements informed by Solvency II and Own Risk and Solvency Assessment (ORSA) frameworks. Empirical evaluations on two industry datasets–Workers’ Compensation, and Other Liability–demonstrate that the RL-CVaR agent achieves superior performance relative to classical reserving methods across multiple criteria, including tail-risk control (CVaR _0.95 ), capital efficiency, and regulatory violation rate. The framework also accommodates fixed-shock stress testing and regime-stratified analysis, providing a principled and extensible approach to reserving under uncertainty. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2504.09396 [cs.LG] (or arXiv:2504.09396v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.09396 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-74] “Its not a representation of me”: Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services

【速读】：该论文试图解决人工智能语音生成和语音克隆技术在不同口音和地区语言特征下的技术性能差异及其对社会技术系统的影响问题。论文通过混合方法论（包括调查和访谈）评估了两种合成语音服务（Speechify 和 ElevenLabs），以分析其技术表现，并探究用户体验如何影响他们对语音技术中口音变化的认知。研究发现，当前语音生成技术在五种英语区域口音上的性能存在差异，并可能无意中强化语言特权和基于口音的歧视，从而导致新的数字排斥形式。解决方案的关键在于强调通过包容性设计和政策监管，为开发人员、决策者和组织提供可操作的见解，以确保语音技术的公平性和社会责任感。

链接: https://arxiv.org/abs/2504.09346
作者: Shira Michel,Sufi Kaur,Sarah Elizabeth Gillespie,Jeffrey Gleason,Christo Wilson,Avijit Ghosh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: This paper has been accepted to FAccT 2025

点击查看摘要

Abstract:Recent advances in artificial intelligence (AI) speech generation and voice cloning technologies have produced naturalistic speech and accurate voice replication, yet their influence on sociotechnical systems across diverse accents and linguistic traits is not fully understood. This study evaluates two synthetic AI voice services (Speechify and ElevenLabs) through a mixed methods approach using surveys and interviews to assess technical performance and uncover how users’ lived experiences influence their perceptions of accent variations in these speech technologies. Our findings reveal technical performance disparities across five regional, English-language accents and demonstrate how current speech generation technologies may inadvertently reinforce linguistic privilege and accent-based discrimination, potentially creating new forms of digital exclusion. Overall, our study highlights the need for inclusive design and regulation by providing actionable insights for developers, policymakers, and organizations to ensure equitable and socially responsible AI speech technologies.
zh

[AI-75] MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints

【速读】：该论文旨在解决混合专家模型（MoE）大语言模型在资源受限环境下的高效推理问题，特别是GPU内存容量限制导致无法加载完整模型权重的挑战。论文的关键在于通过全系统性能建模（holistic performance modeling），综合分析CPU与GPU的硬件特性、内存容量、计算能力以及工作负载特征，识别系统瓶颈并预测理论上的性能上限。基于此性能模型，论文提出了一种名为MoE-Lens的高吞吐量MoE模型推理系统，优化了CPU与GPU之间的资源分配与任务调度，使其接近硬件性能极限。该方法显著提升了MoE模型的推理效率，在多种模型与数据集上的表现优于现有最佳方案，平均加速比达4.6倍（最高可达25.5倍），且理论性能预测的准确率达到94%。

链接: https://arxiv.org/abs/2504.09345
作者: Yichao Yuan,Lin Ma,Nishil Talati
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) LLMs, characterized by their sparse activation patterns, offer a promising approach to scaling language models while avoiding proportionally increasing the inference cost. However, their large parameter sizes present deployment challenges in resource-constrained environments with limited GPU memory capacity, as GPU memory is often insufficient to accommodate the full set of model weights. Consequently, typical deployments rely on CPU-GPU hybrid execution: the GPU handles compute-intensive GEMM operations, while the CPU processes the relatively lightweight attention mechanism. This setup introduces a key challenge: how to effectively optimize resource utilization across CPU and GPU? Prior work has designed system optimizations based on performance models with limited scope. Specifically, such models do not capture the complex interactions between hardware properties and system execution mechanisms. Therefore, previous approaches neither identify nor achieve the hardware limit. This paper presents MoE-Lens, a high-throughput MoE LLM inference system designed through holistic performance modeling for resource-constrained environments. Our performance model thoroughly analyzes various fundamental system components, including CPU memory capacity, GPU compute power, and workload characteristics, to understand the theoretical performance upper bound of MoE inference. Furthermore, it captures the system execution mechanisms to identify the key hardware bottlenecks and accurately predict the achievable throughput. Informed by our performance model, MoE-Lens introduces an inference system approaching hardware limits. Evaluated on diverse MoE models and datasets, MoE-Lens outperforms the state-of-the-art solution by 4.6x on average (up to 25.5x), with our theoretical model predicting performance with an average 94% accuracy. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.09345 [cs.DC] (or arXiv:2504.09345v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2504.09345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-76] Confirmation Bias in Generative AI Chatbots: Mechanisms Risks Mitigation Strategies and Future Research Directions

【速读】：该论文试图解决生成式 AI 聊天机器人中确认偏误（Confirmation Bias）这一相对未被充分研究的 AI-人类交互方面的问题。论文通过认知心理学和计算语言学的视角，探讨了如何通过大型语言模型的设计与运行复制和放大确认偏误这一常见的人类倾向。论文分析了确认偏误在聊天机器人交互中的表现机制，评估了与此偏误相关的伦理和实际风险，并提出了多种缓解策略，包括技术干预、界面重新设计以及旨在促进平衡的 AI 生成对话的政策措施。解决方案的关键在于综合运用技术、设计和政策手段，以减少确认偏误的影响，同时强调需要跨学科合作和实证研究来更好地理解和应对生成式 AI 系统中的确认偏误问题。

链接: https://arxiv.org/abs/2504.09343
作者: Yiran Du
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This article explores the phenomenon of confirmation bias in generative AI chatbots, a relatively underexamined aspect of AI-human interaction. Drawing on cognitive psychology and computational linguistics, it examines how confirmation bias, commonly understood as the tendency to seek information that aligns with existing beliefs, can be replicated and amplified by the design and functioning of large language models. The article analyzes the mechanisms by which confirmation bias may manifest in chatbot interactions, assesses the ethical and practical risks associated with such bias, and proposes a range of mitigation strategies. These include technical interventions, interface redesign, and policy measures aimed at promoting balanced AI-generated discourse. The article concludes by outlining future research directions, emphasizing the need for interdisciplinary collaboration and empirical evaluation to better understand and address confirmation bias in generative AI systems.
zh

[AI-77] Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

【速读】：该论文旨在解决在分布式环境中训练大语言模型（LLMs）时面临的性能优化难题，具体包括模型执行复杂性、部署系统多样性以及配置策略空间巨大的挑战。尽管已有多种优化技术，但在实际应用中实现高效训练仍然困难。为应对这一问题，论文提出的关键解决方案是开发Lumos，这是一种基于追踪的性能建模与估计工具包，用于准确捕捉和预测现代LLMs的执行行为。Lumos通过利用生产级机器学习集群上的多种GPT-3变体，在高达512块NVIDIA H100 GPU的环境下进行评估，展示了其能够以平均3.3%的误差重现不同模型和配置下的执行时间及其他运行时细节的能力。此外，它还验证了从现有追踪数据中估算新设置性能的能力，从而促进模型和部署配置的高效探索。

链接: https://arxiv.org/abs/2504.09307
作者: Mingyu Liang,Hiwot Tadese Kassa,Wenyin Fu,Brian Coutinho,Louis Feng,Christina Delimitrou
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted to MLSys 2025

点击查看摘要

Abstract:Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model’s behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.
zh

[AI-78] Application of Contrastive Learning on ECG Data: Evaluating Performance in Japanese and Classification with Around 100 Labels

【速读】：该论文旨在解决在多语言环境下利用机器学习技术对心电图（ECG）进行分类时准确性不足的问题。现有研究虽已通过多模态模型从ECG数据中提取信息并实现分类，但主要集中在英语环境且类别数量有限，其效果在其他语言中的适用性尚不明确。为应对这一挑战，论文的关键解决方案是构建基于日语的语言模型，并采用对比学习框架，在包含98个类别的日语标签数据集上训练模型。实验结果表明，该方法在保持高分类精度的同时，扩展了多模态机器学习框架在临床研究及非英语场景下的应用潜力。

链接: https://arxiv.org/abs/2504.09302
作者: Junichiro Takahashi,JingChuan Guan,Masataka Sato,Kaito Baba,Kazuto Haruguchi,Daichi Nagashima,Satoshi Kodera,Norihiko Takeda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figures

点击查看摘要

Abstract:The electrocardiogram (ECG) is a fundamental tool in cardiovascular diagnostics due to its powerful and non-invasive nature. One of the most critical usages is to determine whether more detailed examinations are necessary, with users ranging across various levels of expertise. Given this diversity in expertise, it is essential to assist users to avoid critical errors. Recent studies in machine learning have addressed this challenge by extracting valuable information from ECG data. Utilizing language models, these studies have implemented multimodal models aimed at classifying ECGs according to labeled terms. However, the number of classes was reduced, and it remains uncertain whether the technique is effective for languages other than English. To move towards practical application, we utilized ECG data from regular patients visiting hospitals in Japan, maintaining a large number of Japanese labels obtained from actual ECG readings. Using a contrastive learning framework, we found that even with 98 labels for classification, our Japanese-based language model achieves accuracy comparable to previous research. This study extends the applicability of multimodal machine learning frameworks to broader clinical studies and non-English languages.
zh

[AI-79] Continuum-Interaction-Driven Intelligence: Human-Aligned Neural Architecture via Crystallized Reasoning and Fluid Generation

【速读】：该论文试图解决当前基于概率神经网络的人工智能系统（如大语言模型 LLMs）在生成能力上的显著进步与其面临的幻觉问题、不可预测性以及与人类决策不一致等挑战之间的矛盾。这些问题的核心根源在于过度依赖简化的随机概率神经网络模型，而忽视了程序化推理（chain-of-thought）在构建可信赖决策中的作用。论文的关键解决方案在于提出了一种双通道智能架构，将概率生成（如 LLMs）与白盒程序化推理相结合，通过重新定义链式思维为可编程的结晶智能载体，并设计任务驱动的模块化网络，明确划分随机生成与过程控制的功能边界，从而构建出可解释、持续可学习且与人类目标一致的 AI 系统。此外，研究强调多轮交互是智能涌现的必要条件，并表明对话深度与系统的类人一致性呈正相关。这一研究不仅为可信人工智能部署提供了新范式，还为下一代人机协作系统奠定了理论基础。

链接: https://arxiv.org/abs/2504.09301
作者: Pengcheng Zhou,Zhiqiang Nie,Haochen Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current AI systems based on probabilistic neural networks, such as large language models (LLMs), have demonstrated remarkable generative capabilities yet face critical challenges including hallucination, unpredictability, and misalignment with human decision-making. These issues fundamentally stem from the over-reliance on randomized (probabilistic) neural networks-oversimplified models of biological neural networks-while neglecting the role of procedural reasoning (chain-of-thought) in trustworthy decision-making. Inspired by the human cognitive duality of fluid intelligence (flexible generation) and crystallized intelligence (structured knowledge), this study proposes a dual-channel intelligent architecture that integrates probabilistic generation (LLMs) with white-box procedural reasoning (chain-of-thought) to construct interpretable, continuously learnable, and human-aligned AI systems. Concretely, this work: (1) redefines chain-of-thought as a programmable crystallized intelligence carrier, enabling dynamic knowledge evolution and decision verification through multi-turn interaction frameworks; (2) introduces a task-driven modular network design that explicitly demarcates the functional boundaries between randomized generation and procedural control to address trustworthiness in vertical-domain applications; (3) demonstrates that multi-turn interaction is a necessary condition for intelligence emergence, with dialogue depth positively correlating with the system’s human-alignment degree. This research not only establishes a new paradigm for trustworthy AI deployment but also provides theoretical foundations for next-generation human-AI collaborative systems.
zh

[AI-80] Semantic Commit: Helping Users Update Intent Specifications for AI Memory at Scale

【速读】：该论文旨在解决如何在用户意图（user intent）发生变化时更新人工智能（AI）记忆的问题。具体而言，研究探索了AI界面如何协助将新信息整合到自然语言数据仓库中，并提出了一种名为“语义冲突解决”（semantic conflict resolution）的方法来管理具有非局部影响的语义变更。论文的关键在于引入“语义提交”（Semantic Commit）的概念，使用户能够将新的意图提交到项目中，并通过结合知识图谱驱动的检索增强生成（RAG）管道与大型语言模型（LLMs）辅助建议，帮助检测和解决存储在现有意图规范（intent specification）中的语义冲突。此外，研究还发现尽管参与者可以使用全局修订功能，但半数参与者仍倾向于采用基于影响分析的工作流，优先标记冲突而后进行局部解决。论文强调，AI代理界面应提供影响分析的支持，并帮助用户独立验证AI检索而非生成的结果，从而为AI代理设计者提供了关于更新记忆作为包含人类反馈和决策过程的新视角。

链接: https://arxiv.org/abs/2504.09283
作者: Priyan Vaithilingam,Munyeong Kim,Frida-Cecilia Acosta-Parenteau,Daniel Lee,Amine Mhedhbi,Elena L. Glassman,Ian Arawjo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 22 pages; 10 figures

点击查看摘要

Abstract:How do we update AI memory of user intent as intent changes? We consider how an AI interface may assist the integration of new information into a repository of natural language data. Inspired by software engineering concepts like impact analysis, we develop methods and a UI for managing semantic changes with non-local effects, which we call “semantic conflict resolution.” The user commits new intent to a project – makes a “semantic commit” – and the AI helps the user detect and resolve semantic conflicts within a store of existing information representing their intent (an “intent specification”). We develop an interface, SemanticCommit, to better understand how users resolve conflicts when updating intent specifications such as Cursor Rules and game design documents. A knowledge graph-based RAG pipeline drives conflict detection, while LLMs assist in suggesting resolutions. We evaluate our technique on an initial benchmark. Then, we report a 12 user within-subjects study of SemanticCommit for two task domains – game design documents, and AI agent memory in the style of ChatGPT memories – where users integrated new information into an existing list. Half of our participants adopted a workflow of impact analysis, where they would first flag conflicts without AI revisions then resolve conflicts locally, despite having access to a global revision feature. We argue that AI agent interfaces, such as software IDEs like Cursor and Windsurf, should provide affordances for impact analysis and help users validate AI retrieval independently from generation. Our work speaks to how AI agent designers should think about updating memory as a process that involves human feedback and decision-making.
zh

[AI-81] SynthTRIPs: A Knowledge-Grounded Framework for Benchmark Query Generation for Personalized Tourism Recommenders SIGIR2025

【速读】：该论文旨在解决现有公开旅行数据集在支持高级个性化策略（尤其是可持续旅游和非高峰旅游）方面存在的广度和深度不足的问题。为应对这一挑战，论文提出了一种名为SynthTRIPs的新框架，利用大型语言模型（Large Language Models, LLMs）生成模拟多样化用户画像的合成旅行查询，并结合结构化过滤器（如预算约束和可持续性偏好）。解决方案的关键在于通过基于精心策划的知识库（Knowledge Base, KB）来约束LLMs的响应，以减少幻觉并确保事实准确性，同时将基于用户画像的偏好与明确的可持续性过滤条件相结合，生成逼真且多样的查询。此外，论文还正式定义了查询生成过程并引入评估指标以衡量生成结果的真实性与一致性。这一方法不仅提升了合成数据集在捕捉复杂个性化方面的表现，其通用性也适用于其他推荐系统领域。

链接: https://arxiv.org/abs/2504.09277
作者: Ashmi Banerjee,Adithi Satish,Fitri Nur Aisyah,Wolfgang Wörndl,Yashar Deldjoo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted for publication at SIGIR 2025

点击查看摘要

Abstract:Tourism Recommender Systems (TRS) are crucial in personalizing travel experiences by tailoring recommendations to users’ preferences, constraints, and contextual factors. However, publicly available travel datasets often lack sufficient breadth and depth, limiting their ability to support advanced personalization strategies – particularly for sustainable travel and off-peak tourism. In this work, we explore using Large Language Models (LLMs) to generate synthetic travel queries that emulate diverse user personas and incorporate structured filters such as budget constraints and sustainability preferences. This paper introduces a novel SynthTRIPs framework for generating synthetic travel queries using LLMs grounded in a curated knowledge base (KB). Our approach combines persona-based preferences (e.g., budget, travel style) with explicit sustainability filters (e.g., walkability, air quality) to produce realistic and diverse queries. We mitigate hallucination and ensure factual correctness by grounding the LLM responses in the KB. We formalize the query generation process and introduce evaluation metrics for assessing realism and alignment. Both human expert evaluations and automatic LLM-based assessments demonstrate the effectiveness of our synthetic dataset in capturing complex personalization aspects underrepresented in existing datasets. While our framework was developed and tested for personalized city trip recommendations, the methodology applies to other recommender system domains. Code and dataset are made public at this https URL Comments: Accepted for publication at SIGIR 2025 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.09277 [cs.IR] (or arXiv:2504.09277v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.09277 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3726302.3730321 Focus to learn more DOI(s) linking to related resources
zh

[AI-82] Development of a PPO-Reinforcement Learned Walking Tripedal Soft-Legged Robot using SOFA

【速读】：该论文旨在解决软体机器人在实时物理与视觉模拟框架下的研究不足问题，特别是针对利用软腿机器人在恶劣地形环境中替代人类执行任务的需求。论文的关键在于通过定制化设置和合理处理SOFA（Simulation Open Framework Architecture）框架组件，优化其参数以实现最先进的近端策略优化（Proximal Policy Optimization, PPO）强化学习方法的应用。最终成果是一个基于PPO强化学习的三足软腿机器人，在SOFA环境中能够完成行走任务。论文通过设定导航性能作为衡量成功与否的核心指标，并进一步挑战多目标导航任务，实现了累积平方误差偏差仅为19毫米的优异表现。

链接: https://arxiv.org/abs/2504.09242
作者: Yomna Mokhtar,Tarek Shohdy,Abdallah A. Hassan,Mostafa Eshra,Omar Elmenawy,Osama Khalil,Haitham El-Hussieny
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rigid robots were extensively researched, whereas soft robotics remains an underexplored field. Utilizing soft-legged robots in performing tasks as a replacement for human beings is an important stride to take, especially under harsh and hazardous conditions over rough terrain environments. For the demand to teach any robot how to behave in different scenarios, a real-time physical and visual simulation is essential. When it comes to soft robots specifically, a simulation framework is still an arduous problem that needs to be disclosed. Using the simulation open framework architecture (SOFA) is an advantageous step. However, neither SOFA’s manual nor prior public SOFA projects show its maximum capabilities the users can reach. So, we resolved this by establishing customized settings and handling the framework components appropriately. Settling on perfect, fine-tuned SOFA parameters has stimulated our motivation towards implementing the state-of-the-art (SOTA) reinforcement learning (RL) method of proximal policy optimization (PPO). The final representation is a well-defined, ready-to-deploy walking, tripedal, soft-legged robot based on PPO-RL in a SOFA environment. Robot navigation performance is a key metric to be considered for measuring the success resolution. Although in the simulated soft robots case, an 82% success rate in reaching a single goal is a groundbreaking output, we pushed the boundaries to further steps by evaluating the progress under assigning a sequence of goals. While trailing the platform steps, outperforming discovery has been observed with an accumulative squared error deviation of 19 mm. The full code is publicly available at \hrefthis https URLthis http URL\textunderscore SOFA \textunderscore Soft \textunderscore Legged \textunderscore this http URL
zh

[AI-83] AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis IJCNN2025

【速读】：该论文旨在解决普通话语音合成中局部上下文建模的挑战，特别是如何有效捕捉停顿、重音和语调等复杂的语音特征。为了解决这一问题，论文提出的关键方案是将短语结构解析器嵌入模型，并引入局部卷积模块，以增强模型对局部信息的敏感性。此外，AMNet 通过将声调特性从音素中解耦，为声调建模提供了明确的指导，从而提升了声调准确性和发音质量。这些创新点共同提高了模型在主观和客观评价中的性能。

链接: https://arxiv.org/abs/2504.09225
作者: Yubing Cao,Yinfeng Yu,Yongming Li,Liejun Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Main paper (8 pages). Accepted for publication by IJCNN 2025

点击查看摘要

Abstract:This paper presents AMNet, an Acoustic Model Network designed to improve the performance of Mandarin speech synthesis by incorporating phrase structure annotation and local convolution modules. AMNet builds upon the FastSpeech 2 architecture while addressing the challenge of local context modeling, which is crucial for capturing intricate speech features such as pauses, stress, and intonation. By embedding a phrase structure parser into the model and introducing a local convolution module, AMNet enhances the model’s sensitivity to local information. Additionally, AMNet decouples tonal characteristics from phonemes, providing explicit guidance for tone modeling, which improves tone accuracy and pronunciation. Experimental results demonstrate that AMNet outperforms baseline models in subjective and objective evaluations. The proposed model achieves superior Mean Opinion Scores (MOS), lower Mel Cepstral Distortion (MCD), and improved fundamental frequency fitting F0 (R^2) , confirming its ability to generate high-quality, natural, and expressive Mandarin speech.
zh

[AI-84] FairACE: Achieving Degree Fairness in Graph Neural Networks via Contrastive and Adversarial Group-Balanced Training

【速读】：该论文旨在解决图神经网络（Graph Neural Networks, GNNs）中因节点度数偏差导致的不同度数组间预测性能不平等的问题。现有GNN模型主要关注预测准确性，往往忽视了不同度数组间的公平性。为解决此问题，论文提出了一种名为公平感知不对称对比集成（Fairness-Aware Asymmetric Contrastive Ensemble, FairACE）的新框架。其关键在于结合不对称对比学习与对抗训练以提升度数公平性，并通过捕获一阶局部邻居信息和二阶单态相似性生成更公平的节点表示。此外，FairACE 引入度数公平性调节器平衡高低度节点表现，同时提出新的分组平衡公平损失函数以减少度数组间的分类差异。为定量评估公平性，论文还设计了准确率分布差距（Accuracy Distribution Gap, ADG）这一新指标。实验结果表明，FairACE 在保持竞争力预测精度的同时显著提升了度数公平性指标。

链接: https://arxiv.org/abs/2504.09210
作者: Jiaxin Liu,Xiaoqian Jiang,Cangqi Zhou,Jing Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Fairness has been a significant challenge in graph neural networks (GNNs) since degree biases often result in un-equal prediction performance among nodes with varying degrees. Existing GNN models focus on prediction accuracy, frequently overlooking fairness across different degree groups. To addressthis issue, we propose a novel GNN framework, namely Fairness- Aware Asymmetric Contrastive Ensemble (FairACE), which inte-grates asymmetric contrastive learning with adversarial training to improve degree fairness. FairACE captures one-hop local neighborhood information and two-hop monophily similarity to create fairer node representations and employs a degree fairness regulator to balance performance between high-degree and low-degree nodes. During model training, a novel group-balanced fairness loss is proposed to minimize classification disparities across degree groups. In addition, we also propose a novel fairness metric, the Accuracy Distribution Gap (ADG), which can quantitatively assess and ensure equitable performance across different degree-based node groups. Experimental results on both synthetic and real-world datasets demonstrate that FairACE significantly improves degree fairness metrics while maintaining competitive accuracy in comparison to the state-of-the-art GNN models.
zh

[AI-85] Graph Learning-Driven Multi-Vessel Association: Fusing Multimodal Data for Maritime Intelligence

【速读】：该论文旨在解决在繁忙且复杂的水道中，由于多模态数据（如自动识别系统AIS和闭路电视CCTV）存在的维度差异、目标数量不匹配、船舶尺度变化、遮挡以及异步数据流等问题，导致现有多目标关联方法在密集交通水域表现不佳的问题。论文的关键解决方案是提出了一种基于图学习的多船关联（Graph Learning-driven Multi-vessel Association, GMvA）方法，通过融合AIS和CCTV数据，利用时间序列学习和图神经网络提取船舶轨迹的时空特征，并结合时序图注意力和时空注意力机制来捕获局部和全局的船舶交互。此外，采用基于多层感知器的不确定性融合模块计算鲁棒的相似性分数，并使用匈牙利算法确保全局一致且准确的目标匹配，从而实现更精确和稳健的多目标关联性能。

链接: https://arxiv.org/abs/2504.09197
作者: Yuxu Lu,Kaisen Yang,Dong Yang,Haifeng Ding,Jinxian Weng,Ryan Wen Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring maritime safety and optimizing traffic management in increasingly crowded and complex waterways require effective waterway monitoring. However, current methods struggle with challenges arising from multimodal data, such as dimensional disparities, mismatched target counts, vessel scale variations, occlusions, and asynchronous data streams from systems like the automatic identification system (AIS) and closed-circuit television (CCTV). Traditional multi-target association methods often struggle with these complexities, particularly in densely trafficked waterways. To overcome these issues, we propose a graph learning-driven multi-vessel association (GMvA) method tailored for maritime multimodal data fusion. By integrating AIS and CCTV data, GMvA leverages time series learning and graph neural networks to capture the spatiotemporal features of vessel trajectories effectively. To enhance feature representation, the proposed method incorporates temporal graph attention and spatiotemporal attention, effectively capturing both local and global vessel interactions. Furthermore, a multi-layer perceptron-based uncertainty fusion module computes robust similarity scores, and the Hungarian algorithm is adopted to ensure globally consistent and accurate target matching. Extensive experiments on real-world maritime datasets confirm that GMvA delivers superior accuracy and robustness in multi-target association, outperforming existing methods even in challenging scenarios with high vessel density and incomplete or unevenly distributed AIS and CCTV data.
zh

[AI-86] Repetitive Contrastive Learning Enhances Mambas Selectivity in Time Series Prediction

【速读】：该论文旨在解决长序列预测中时间步聚焦不足及噪声抑制不完全的问题，这些问题源于Mamba模型选择能力的局限性。论文的关键解决方案是引入重复对比预训练（Repetitive Contrastive Learning, RCL），这是一种以Token级别对比学习为基础的预训练框架，用于增强Mamba的选择能力。RCL通过在单个Mamba块上进行预训练，并利用高斯噪声的序列增强以及序列间与序列内的对比学习，帮助Mamba模块优先关注信息丰富的关键时间步，同时抑制噪声影响。此外，RCL将预训练参数迁移到多种主干模型的Mamba块中，从而提升其时间序列预测性能。实验结果表明，RCL显著提升了主干模型的表现，超越现有方法并达到当前最优性能。同时，论文还提出了两种度量指标来量化Mamba的选择能力，为RCL带来的改进提供了理论、定性和定量的支持。

链接: https://arxiv.org/abs/2504.09185
作者: Wenbo Yan,Hanzhong Cao,Ying Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long sequence prediction is a key challenge in time series forecasting. While Mamba-based models have shown strong performance due to their sequence selection capabilities, they still struggle with insufficient focus on critical time steps and incomplete noise suppression, caused by limited selective abilities. To address this, we introduce Repetitive Contrastive Learning (RCL), a token-level contrastive pretraining framework aimed at enhancing Mamba’s selective capabilities. RCL pretrains a single Mamba block to strengthen its selective abilities and then transfers these pretrained parameters to initialize Mamba blocks in various backbone models, improving their temporal prediction performance. RCL uses sequence augmentation with Gaussian noise and applies inter-sequence and intra-sequence contrastive learning to help the Mamba module prioritize information-rich time steps while ignoring noisy ones. Extensive experiments show that RCL consistently boosts the performance of backbone models, surpassing existing methods and achieving state-of-the-art results. Additionally, we propose two metrics to quantify Mamba’s selective capabilities, providing theoretical, qualitative, and quantitative evidence for the improvements brought by RCL.
zh

[AI-87] A Confounding Factors-Inhibition Adversarial Learning Framework for Multi-site fMRI Mental Disorder Identification

【速读】：该论文旨在解决多站点功能磁共振成像(fMRI)数据集中的异质性问题，这些问题源于扫描程序差异、混杂效应的存在以及多个站点间人群多样性等因素，导致表示学习的效果下降，并进而影响后续分类任务的整体效能。为了解决这些局限性，论文提出了一种新的多站点对抗学习网络(MSalNET)，用于基于fMRI的精神障碍检测。该方案的关键在于：首先引入了一个带有节点信息组装(NIA)机制的表示学习模块，以更好地从功能连接(FC)中提取特征，此机制能够从水平和垂直方向聚合边信息，有效整合节点信息；其次，设计了一个站点级特征提取模块，通过个体FC数据学习来泛化特征，避免使用额外的先验信息；最后，提出了一个对抗学习网络，通过引入新型损失函数，在个体分类与站点回归任务之间实现平衡权衡。

链接: https://arxiv.org/abs/2504.09179
作者: Xin Wen,Shijie Guo,Wenbo Ning,Rui Cao,Yan Niu,Bin Wan,Peng Wei,Xiaobo Liu,Jie Xiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In open data sets of functional magnetic resonance imaging (fMRI), the heterogeneity of the data is typically attributed to a combination of factors, including differences in scanning procedures, the presence of confounding effects, and population diversities between multiple sites. These factors contribute to the diminished effectiveness of representation learning, which in turn affects the overall efficacy of subsequent classification procedures. To address these limitations, we propose a novel multi-site adversarial learning network (MSalNET) for fMRI-based mental disorder detection. Firstly, a representation learning module is introduced with a node information assembly (NIA) mechanism to better extract features from functional connectivity (FC). This mechanism aggregates edge information from both horizontal and vertical directions, effectively assembling node information. Secondly, to generalize the feature across sites, we proposed a site-level feature extraction module that can learn from individual FC data, which circumvents additional prior information. Lastly, an adversarial learning network is proposed as a means of balancing the trade-off between individual classification and site regression tasks, with the introduction of a novel loss function. The proposed method was evaluated on two multi-site fMRI datasets, i.e., Autism Brain Imaging Data Exchange (ABIDE) and ADHD-200. The results indicate that the proposed method achieves a better performance than other related algorithms with the accuracy of 75.56 and 68.92 in ABIDE and ADHD-200 datasets, respectively. Furthermore, the result of the site regression indicates that the proposed method reduces site variability from a data-driven perspective. The most discriminative brain regions revealed by NIA are consistent with statistical findings, uncovering the “black box” of deep learning to a certain extent.
zh

[AI-88] Synthetic Aircraft Trajectory Generation Using Time-Based VQ-VAE

【速读】：该论文旨在解决现代空域管理中因数据稀缺、敏感信息保护及大规模分析需求所面临的挑战，通过生成合成飞行轨迹提供解决方案。论文的关键在于提出了一种基于时间基向量化量化的变分自编码器（Time-Based Vector Quantized Variational Autoencoder, TimeVQVAE）的新型轨迹合成方法。此方法结合了时频域处理、向量化量化的特性以及基于Transformer的先验知识，能够捕捉飞行数据中的全局与局部动态。通过离散潜在空间及整合Transformer先验，模型不仅学习到长时间范围内的时空依赖关系，还保证了整个飞行路径的一致性。最终实验表明，该模型在生成符合真实飞行数据的空间准确性、时间一致性及统计特性的合成轨迹方面优于时序卷积VAE基准模型，并且在操作可行性上也表现良好，尽管存在个别异常情况，但仍显示出多尺度表征学习对于捕捉复杂飞行行为的重要性，以及TimeVQVAE在模型训练、空域设计和空中交通预测等下游任务中的应用潜力。

链接: https://arxiv.org/abs/2504.09101
作者: Abdulmajid Murad,Massimiliano Ruocco
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper was presented at the 25th Integrated Communications, Navigation and Surveillance Conference (ICNS 2025), April 8–10, 2025, Brussels, Belgium

点击查看摘要

Abstract:In modern air traffic management, generating synthetic flight trajectories has emerged as a promising solution for addressing data scarcity, protecting sensitive information, and supporting large-scale analyses. In this paper, we propose a novel method for trajectory synthesis by adapting the Time-Based Vector Quantized Variational Autoencoder (TimeVQVAE). Our approach leverages time-frequency domain processing, vector quantization, and transformer-based priors to capture both global and local dynamics in flight data. By discretizing the latent space and integrating transformer priors, the model learns long-range spatiotemporal dependencies and preserves coherence across entire flight paths. We evaluate the adapted TimeVQVAE using an extensive suite of quality, statistical, and distributional metrics, as well as a flyability assessment conducted in an open-source air traffic simulator. Results indicate that TimeVQVAE outperforms a temporal convolutional VAE baseline, generating synthetic trajectories that mirror real flight data in terms of spatial accuracy, temporal consistency, and statistical properties. Furthermore, the simulator-based assessment shows that most generated trajectories maintain operational feasibility, although occasional outliers underscore the potential need for additional domain-specific constraints. Overall, our findings underscore the importance of multi-scale representation learning for capturing complex flight behaviors and demonstrate the promise of TimeVQVAE in producing representative synthetic trajectories for downstream tasks such as model training, airspace design, and air traffic forecasting.
zh

[AI-89] Privacy Preservation in Gen AI Applications

【速读】：该论文旨在解决由大型语言模型（Large Language Models, LLMs）在自然语言处理（Natural Language Processing, NLP）领域快速发展所引发的隐私问题。具体而言，由于LLMs在大规模数据集上的训练可能导致无意中吸收和泄露用户的个人识别信息（Personally Identifiable Information, PII），这引发了严重的隐私安全隐患。此外，深度神经网络的复杂性使得追踪或阻止隐私信息的意外存储与泄露变得困难，进一步加剧了这一问题。论文的关键解决方案在于通过检测生成式AI（Generative AI）的弱点（如数据提取攻击、模型反演攻击和成员推理攻击）来识别潜在风险，并开发一种能够抵抗这些攻击的隐私保护型生成式AI应用。该方案的核心在于采用预处理方法，即在与LLMs交互之前识别、修改或移除PII，从而在保障功能的同时确保数据隐私。此外，论文还评估了主流云平台（如Microsoft Azure、Google Cloud和AWS）提供的隐私工具，为生成式AI系统提出了一种以数据安全和道德AI实施为核心的隐私框架，推动更安全和负责任的工具使用。

链接: https://arxiv.org/abs/2504.09095
作者: Swetha S,Ram Sundhar K Shaju,Rakshana M,Ganesh R,Balavedhaa S,Thiruvaazhi U
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability of machines to comprehend and produce language that is similar to that of humans has revolutionized sectors like customer service, healthcare, and finance thanks to the quick advances in Natural Language Processing (NLP), which are fueled by Generative Artificial Intelligence (AI) and Large Language Models (LLMs). However, because LLMs trained on large datasets may unintentionally absorb and reveal Personally Identifiable Information (PII) from user interactions, these capabilities also raise serious privacy concerns. Deep neural networks’ intricacy makes it difficult to track down or stop the inadvertent storing and release of private information, which raises serious concerns about the privacy and security of AI-driven data. This study tackles these issues by detecting Generative AI weaknesses through attacks such as data extraction, model inversion, and membership inference. A privacy-preserving Generative AI application that is resistant to these assaults is then developed. It ensures privacy without sacrificing functionality by using methods to identify, alter, or remove PII before to dealing with LLMs. In order to determine how well cloud platforms like Microsoft Azure, Google Cloud, and AWS provide privacy tools for protecting AI applications, the study also examines these technologies. In the end, this study offers a fundamental privacy paradigm for generative AI systems, focusing on data security and moral AI implementation, and opening the door to a more secure and conscientious use of these tools.
zh

[AI-90] PQS (Prune Quantize and Sort): Low-Bitwidth Accumulation of Dot Products in Neural Network Computations

【速读】：该论文旨在解决在神经网络计算中低比特位宽点积累积的问题，特别是在量化（如8位）点积运算中因避免中间部分和溢出而使用宽累加器（如32位）导致的内存带宽增加和能效降低的问题。论文的关键解决方案是结合三种技术：剪枝（Prune）、量化（Quantize）和排序（Sort）。通过在浮点数中进行迭代N:M剪枝，随后将结果量化到8位或更少的比特位，并以“小到大”的顺序累积部分乘积，论文展示了如何实现无需宽累加器的精确且压缩的模型。这种方法能够在推理阶段消除累积溢出，同时将累加器位宽减少2.5倍，且在多种图像分类任务中保持与浮点基准相当的模型精度。

链接: https://arxiv.org/abs/2504.09064
作者: Vikas Natesh,H.T. Kung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present PQS, which uses three techniques together - Prune, Quantize, and Sort - to achieve low-bitwidth accumulation of dot products in neural network computations. In conventional quantized (e.g., 8-bit) dot products, partial results are accumulated into wide (e.g., 32-bit) accumulators to avoid overflows when accumulating intermediate partial sums. However, such wide accumulators increase memory bandwidth usage and reduce energy efficiency. We show that iterative N:M pruning in floating point followed by quantization to 8 (or fewer) bits, and accumulation of partial products in a sorted order (“small to large”) allows for accurate, compressed models with short dot product lengths that do not require wide accumulators. We design, analyze, and implement the PQS algorithm to eliminate accumulation overflows at inference time for several neural networks. Our method offers a 2.5x reduction in accumulator bitwidth while achieving model accuracy on par with floating-point baselines for multiple image classification tasks.
zh

[AI-91] A Practical Approach to using Supervised Machine Learning Models to Classify Aviation Safety Occurrences

【速读】：该论文旨在解决航空事件分类问题，具体是利用监督学习（Supervised Learning）模型协助安全调查员将航空事件划分为“事件”或“严重事件”两类。论文的关键在于采用多种监督学习算法（包括支持向量机、逻辑回归、随机森林分类器、XGBoost 和 K 近邻算法）进行评估，并发现随机森林分类器（Random Forest Classifier）在准确性（Accuracy = 0.77）、F1 分数（F1 Score = 0.78）和马修斯相关系数（MCC = 0.51）方面表现最优。此外，研究还探讨了对数据集应用合成少数过采样技术（SMOTE）的效果，但结果显示其对模型性能的影响存在显著差异，部分模型甚至出现了性能下降的现象。

链接: https://arxiv.org/abs/2504.09063
作者: Bryan Y. Siow
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 3 tables

点击查看摘要

Abstract:This paper describes a practical approach of using supervised machine learning (ML) models to assist safety investigators to classify aviation occurrences into either incident or serious incident categories. Our implementation currently deployed as a ML web application is trained on a labelled dataset derived from publicly available aviation investigation reports. A selection of five supervised learning models (Support Vector Machine, Logistic Regression, Random Forest Classifier, XGBoost and K-Nearest Neighbors) were evaluated. This paper showed the best performing ML algorithm was the Random Forest Classifier with accuracy = 0.77, F1 Score = 0.78 and MCC = 0.51 (average of 100 sample runs). The study had also explored the effect of applying Synthetic Minority Over-sampling Technique (SMOTE) to the imbalanced dataset, and the overall observation ranged from no significant effect to substantial degradation in performance for some of the models after the SMOTE adjustment.
zh

[AI-92] Multimodal 3D Genome Pre-training

【速读】：该论文旨在解决三维基因组学（3D genomics）知识整体理解不足的问题。当前三维基因组学研究在计算生物学中的多种分析任务虽已取得进展，但尚未形成对三维基因组学全面且统一的理解。为此，论文提出MIX-HIC，这是首个整合三维基因组结构与表观基因组轨迹的多模态基础模型，能够获取统一且全面的语义信息。解决方案的关键在于设计了用于异构语义融合的跨模态交互与映射模块，实现了三维基因组知识的精确聚合，并通过引入包含超过100万对Hi-C接触图谱与表观基因组轨迹的大规模数据集进行高质量预训练，从而探索三维基因组学的功能含义。实验表明，MIX-HIC在多种下游任务中显著超越现有最先进的方法。

链接: https://arxiv.org/abs/2504.09060
作者: Minghao Yang,Pengteng Li,Yan Liang,Qianyi Cai,Zhihang Zheng,Shichen Zhang,Pengfei Zhang,Zhi-An Huang,Hui Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:Deep learning techniques have driven significant progress in various analytical tasks within 3D genomics in computational biology. However, a holistic understanding of 3D genomics knowledge remains underexplored. Here, we propose MIX-HIC, the first multimodal foundation model of 3D genome that integrates both 3D genome structure and epigenomic tracks, which obtains unified and comprehensive semantics. For accurate heterogeneous semantic fusion, we design the cross-modal interaction and mapping blocks for robust unified representation, yielding the accurate aggregation of 3D genome knowledge. Besides, we introduce the first large-scale dataset comprising over 1 million pairwise samples of Hi-C contact maps and epigenomic tracks for high-quality pre-training, enabling the exploration of functional implications in 3D genomics. Extensive experiments show that MIX-HIC can significantly surpass existing state-of-the-art methods in diverse downstream tasks. This work provides a valuable resource for advancing 3D genomics research.
zh

[AI-93] An Enhanced Iterative Deepening Search Algorithm for the Unrestricted Container Rehandling Problem

【速读】：该论文旨在解决无约束重定位问题（Unrestricted Container Rehandling Problem, UCRP）中的集装箱重定位难题，即在集装箱堆场内根据特定操作规则重新安排容器以优化调度。论文的关键解决方案在于提出了一种结合改进下界（Improved Lower Bounds）的增强型深化搜索算法（Deepening Search Algorithm），并通过设计相互一致的剪枝规则（Pruning Rules）来减少搜索空间，避免不必要的计算开销，从而显著提升求解效率，特别是在处理具有严格时间限制的同优先级组容器时表现出色。

链接: https://arxiv.org/abs/2504.09046
作者: Ruoqi Wang,Jiawei Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:In container terminal yards, the Container Rehandling Problem (CRP) involves rearranging containers between stacks under specific operational rules, and it is a pivotal optimization challenge in intelligent container scheduling systems. Existing CRP studies primarily focus on minimizing reallocation costs using two-dimensional bay structures, considering factors such as container size, weight, arrival sequences, and retrieval priorities. This paper introduces an enhanced deepening search algorithm integrated with improved lower bounds to boost search efficiency. To further reduce the search space, we design mutually consistent pruning rules to avoid excessive computational overhead. The proposed algorithm is validated on three widely used benchmark datasets for the Unrestricted Container Rehandling Problem (UCRP). Experimental results demonstrate that our approach outperforms state-of-the-art exact algorithms in solving the more general UCRP variant, particularly exhibiting superior efficiency when handling containers within the same priority group under strict time constraints.
zh

[AI-94] MSCCL: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications

【速读】：该论文旨在解决现代AI应用在快速发展的异构硬件设备上开发过程中面临的软件栈适配难题。传统方法依赖通用软件库自底向上适配新硬件，耗时且低效，而定制化软件栈虽能加速优化但导致跨应用的非可移植代码冗余。论文提出了一种新的通信库接口MSCCL++，通过减少冗余工作同时保持定制灵活性，实现可移植性和性能的平衡。其关键是基于关注点分离原则设计的两层抽象：(1) 原语接口提供最小硬件抽象，作为软硬件开发者自定义通信的基础；(2) 高级可移植接口与专用实现支持针对不同硬件环境的优化。这种方法使原语接口在多个应用间复用，并支持高度灵活的优化。与现有先进基准（如NCCL、RCCL和MSCCL）相比，MSCCL++在集体通信中最高提速3.8倍，在实际AI推理任务中提升可达15%。MSCCL++已应用于微软Azure的多项AI服务，并被AMD维护的RCCL库采纳，同时开源可用。

链接: https://arxiv.org/abs/2504.09014
作者: Aashaka Shah,Abhinav Jangda,Binyang Li,Caio Rocha,Changho Hwang,Jithin Jose,Madan Musuvathi,Olli Saarikivi,Peng Cheng,Qinghua Zhou,Roshan Dathathri,Saeed Maleki,Ziyue Yang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 13 pages, 13 figures

点击查看摘要

Abstract:Modern cutting-edge AI applications are being developed over fast-evolving, heterogeneous, nascent hardware devices. This requires frequent reworking of the AI software stack to adopt bottom-up changes from new hardware, which takes time for general-purpose software libraries. Consequently, real applications often develop custom software stacks optimized for their specific workloads and hardware. Custom stacks help quick development and optimization, but incur a lot of redundant efforts across applications in writing non-portable code. This paper discusses an alternative communication library interface for AI applications that offers both portability and performance by reducing redundant efforts while maintaining flexibility for customization. We present MSCCL++, a novel abstraction of GPU communication based on separation of concerns: (1) a primitive interface provides a minimal hardware abstraction as a common ground for software and hardware developers to write custom communication, and (2) higher-level portable interfaces and specialized implementations enable optimization for different hardware environments. This approach makes the primitive interface reusable across applications while enabling highly flexible optimization. Compared to state-of-the-art baselines (NCCL, RCCL, and MSCCL), MSCCL++ achieves speedups of up to 3.8 \times for collective communication and up to 15% for real-world AI inference workloads. MSCCL++ is in production of multiple AI services provided by Microsoft Azure, and is also adopted by RCCL, the GPU collective communication library maintained by AMD. MSCCL++ is open-source and available at this https URL.
zh

[AI-95] MCP Bridge: A Lightweight LLM -Agnostic RESTful Proxy for Model Context Protocol Servers

【速读】：该论文旨在解决当前模型上下文协议（Model Context Protocol, MCP）在资源受限环境（如移动设备、浏览器和边缘计算）中应用受限的问题。现有MCP实现通常依赖于STDIO传输的本地进程执行，这导致其在这些环境中不可行。为了解决这一问题，论文提出了一种名为MCP Bridge的轻量级RESTful代理，它通过统一API连接到多个MCP服务器，并暴露其功能。MCP Bridge的关键创新在于其完全与大型语言模型（LLM）无关，支持任意后端供应商的系统，同时采用基于风险的执行模型，提供三种安全级别（标准执行、确认工作流和Docker隔离），并保持对标准MCP客户端的向后兼容性。此外，论文还介绍了一个基于Python的MCP Gemini代理，用于促进自然语言与MCP工具的交互。评估结果表明，MCP Bridge成功解决了直接使用MCP连接的限制，提供了增强的安全控制和跨平台兼容性，从而使得在之前无法访问的环境中构建复杂的LLM驱动应用程序成为可能。

链接: https://arxiv.org/abs/2504.08999
作者: Arash Ahmadi,Sarah Sharif,Yaser M. Banad
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly augmented with external tools through standardized interfaces like the Model Context Protocol (MCP). However, current MCP implementations face critical limitations: they typically require local process execution through STDIO transports, making them impractical for resource-constrained environments like mobile devices, web browsers, and edge computing. We present MCP Bridge, a lightweight RESTful proxy that connects to multiple MCP servers and exposes their capabilities through a unified API. Unlike existing solutions, MCP Bridge is fully LLM-agnostic, supporting any backend regardless of vendor. The system implements a risk-based execution model with three security levels standard execution, confirmation workflow, and Docker isolation while maintaining backward compatibility with standard MCP clients. Complementing this server-side infrastructure is a Python based MCP Gemini Agent that facilitates natural language interaction with MCP tools. The evaluation demonstrates that MCP Bridge successfully addresses the constraints of direct MCP connections while providing enhanced security controls and cross-platform compatibility, enabling sophisticated LLM-powered applications in previously inaccessible environments
zh

[AI-96] Learning from Elders: Making an LLM -powered Chatbot for Retirement Communities more Accessible through User-centered Design

【速读】：该论文旨在解决退休社区老年人因低技术素养和eHealth素养导致难以有效使用数字工具的问题。解决方案的关键在于采用以用户为中心的设计方法，开发了一款基于大型语言模型（LLM）的聊天机器人原型。通过访谈和角色构建，重点关注了可访问性和双重功能设计：简化内部信息检索的同时提升老年人的技术和eHealth素养。在试点试验中，居民表现出高满意度和易用性，但也发现了改进空间。根据反馈，研究团队优化了聊天机器人，利用GPT-3.5 Turbo和Streamlit实现了定制化提示工程以提供简洁响应，并引入了如可调节字体大小、界面主题及个性化后续回复等可访问性功能。未来计划包括增加语音转文本功能及开展纵向干预研究。整体结果表明，LLM驱动的聊天机器人可通过可访问且个性化的交互赋能老年人，弥合退休社区中的素养差距。

链接: https://arxiv.org/abs/2504.08985
作者: Luna Xingyu Li,Ray-yuan Chung,Feng Chen,Wenyu Zeng,Yein Jeon,Oleg Zaslavsky
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted as Research talk for Considering Cultural and Linguistic Diversity in AI Applications workshop at CALD-AI@ASIST 2025

点击查看摘要

Abstract:Low technology and eHealth literacy among older adults in retirement communities hinder engagement with digital tools. To address this, we designed an LLM-powered chatbot prototype using a human-centered approach for a local retirement community. Through interviews and persona development, we prioritized accessibility and dual functionality: simplifying internal information retrieval and improving technology and eHealth literacy. A pilot trial with residents demonstrated high satisfaction and ease of use, but also identified areas for further improvement. Based on the feedback, we refined the chatbot using GPT-3.5 Turbo and Streamlit. The chatbot employs tailored prompt engineering to deliver concise responses. Accessible features like adjustable font size, interface theme and personalized follow-up responses were implemented. Future steps include enabling voice-to-text function and longitudinal intervention studies. Together, our results highlight the potential of LLM-driven chatbots to empower older adults through accessible, personalized interactions, bridging literacy gaps in retirement communities.
zh

[AI-97] AGENT : An Aerial Vehicle Generation and Design Tool Using Large Language Models

【速读】：该论文试图解决传统计算机辅助设计（CAD）流程中物理仿真环节计算昂贵且耗时的问题。解决方案的关键在于开发了一个基于大型语言模型（Large Language Model, LLM）的综合性设计工具AGENT。AGENT通过从JSON文件中直接学习飞机文本设计的强大表征，能够生成满足特定飞行动力学属性（如悬停时间、最高速度等）的设计，并作为物理仿真的代理模型评估设计方案。这种基于CodeT5+ LLM的方法不仅实现了高效的设计生成与评估，还通过小规模模型（仅220M参数）达到了优异性能，使得系统能够在单GPU上运行，为未来的实际部署提供了清晰路径。

链接: https://arxiv.org/abs/2504.08981
作者: Colin Samplawski,Adam D. Cobb,Susmit Jha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Computer-aided design (CAD) is a promising application area for emerging artificial intelligence methods. Traditional workflows for cyberphysical systems create detailed digital models which can be evaluated by physics simulators in order to narrow the search space before creating physical prototypes. A major bottleneck of this approach is that the simulators are often computationally expensive and slow. Recent advancements in AI methods offer the possibility to accelerate these pipelines. We use the recently released AircraftVerse dataset, which is especially suited for developing and evaluating large language models for designs. AircraftVerse contains a diverse set of UAV designs represented via textual design trees together with detailed physics simulation results. Following the recent success of large language models (LLMs), we propose AGENT (Aircraft GENeraTor). AGENT is a comprehensive design tool built on the CodeT5+ LLM which learns powerful representations of aircraft textual designs directly from JSON files. We develop a curriculum of training tasks which imbues a single model with a suite of useful features. AGENT is able to generate designs conditioned on properties of flight dynamics (hover time, maximum speed, etc.). Additionally, AGENT can issue evaluations of designs allowing it to act as a surrogate model of the physics simulation that underlies the AircraftVerse dataset. We present a series of experiments which demonstrate our system’s abilities. We are able to achieve strong performance using the smallest member of the CodeT5+ family (220M parameters). This allows for a flexible and powerful system which can be executed on a single GPU enabling a clear path toward future deployment.
zh

[AI-98] On Large-scale Evaluation of Embedding Models for Knowledge Graph Completion

【速读】：该论文旨在解决现有知识图谱嵌入（Knowledge Graph Embedding, KGE）模型评估中存在的几个关键问题：1) 常用基准数据集存在错误或规模过小，无法真实反映实际数据；2) 忽视中介节点在建模n元关系中的重要作用；3) 标准评估指标基于封闭世界假设，与链接预测的核心目标相悖；4) 当前评估协议假设实体属性值已知，不符合实际应用场景；5) 现有评估方法通常将准确性压缩为单一数值，掩盖了模型的具体优缺点。此外，论文指出替代评估方法如属性预测、实体对排名和三元组分类虽部分缓解上述问题，但未被充分采用。
关键在于通过大规模数据集（FB-CVT-REV 和 FB+CVT-REV）对四种代表性KGE模型进行全面评估，并揭示以下重要洞察：模型性能在小规模与大规模数据集间存在显著差异；当n元关系被二元化时，模型能力被系统性高估；当前评估协议和指标存在根本性局限。

链接: https://arxiv.org/abs/2504.08970
作者: Nasim Shirvani-Mahdavi,Farahnaz Akrami,Chengkai Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graph embedding (KGE) models are extensively studied for knowledge graph completion, yet their evaluation remains constrained by unrealistic benchmarks. Commonly used datasets are either faulty or too small to reflect real-world data. Few studies examine the role of mediator nodes, which are essential for modeling n-ary relationships, or investigate model performance variation across domains. Standard evaluation metrics rely on the closed-world assumption, which penalizes models for correctly predicting missing triples, contradicting the fundamental goals of link prediction. These metrics often compress accuracy assessment into a single value, obscuring models’ specific strengths and weaknesses. The prevailing evaluation protocol operates under the unrealistic assumption that an entity’s properties, for which values are to be predicted, are known in advance. While alternative protocols such as property prediction, entity-pair ranking and triple classification address some of these limitations, they remain underutilized. This paper conducts a comprehensive evaluation of four representative KGE models on large-scale datasets FB-CVT-REV and FB+CVT-REV. Our analysis reveals critical insights, including substantial performance variations between small and large datasets, both in relative rankings and absolute metrics, systematic overestimation of model capabilities when n-ary relations are binarized, and fundamental limitations in current evaluation protocols and metrics.
zh

[AI-99] Forecasting Cryptocurrency Prices using Contextual ES-adRNN with Exogenous Variables

【速读】：该论文试图解决多变量加密货币价格预测问题，通过提出一种结合指数平滑（Exponential Smoothing, ES）和循环神经网络（Recurrent Neural Network, RNN）的混合上下文模型实现更精准的预测。解决方案的关键在于设计了一个双轨结构：上下文轨道（Context Track）与主轨道（Main Track）。上下文轨道从代表性时间序列中提取辅助信息，并结合外生变量的信息，动态调整以优化主轨道针对个体时间序列的预测；同时，主轨道采用具有分层扩张机制的堆叠RNN架构，集成最新的注意力型扩张循环单元，从而能够捕捉时间序列中的短时与长时依赖关系，并动态加权输入信息。此外，该模型不仅能生成每日点预测，还能提供一日至四周的预测区间，展现出比传统统计模型和机器学习模型更强的性能表现。

链接: https://arxiv.org/abs/2504.08947
作者: Slawek Smyl,Grzegorz Dudek,Paweł Pełka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce a new approach to multivariate forecasting cryptocurrency prices using a hybrid contextual model combining exponential smoothing (ES) and recurrent neural network (RNN). The model consists of two tracks: the context track and the main track. The context track provides additional information to the main track, extracted from representative series. This information as well as information extracted from exogenous variables is dynamically adjusted to the individual series forecasted by the main track. The RNN stacked architecture with hierarchical dilations, incorporating recently developed attentive dilated recurrent cells, allows the model to capture short and long-term dependencies across time series and dynamically weight input information. The model generates both point daily forecasts and predictive intervals for one-day, one-week and four-week horizons. We apply our model to forecast prices of 15 cryptocurrencies based on 17 input variables and compare its performance with that of comparative models, including both statistical and ML ones.
zh

[AI-100] Investigating the Treacherous Turn in Deep Reinforcement Learning

【速读】：该论文旨在研究和揭示生成式 AI (Generative AI) 在深度强化学习 (Deep Reinforcement Learning, DRL) 框架下产生“危险转向”(Treacherous Turn) 行为的可能性及条件。具体而言，论文关注的是如何通过特定的后门注入策略（Trojan Injection Strategies）显式训练出表现出“危险转向”行为的智能体，而非依赖环境复杂性或目标定义缺陷导致的意外后果。论文的关键解决方案在于设计并验证这些后门注入策略的有效性，从而成功复现了“危险转向”现象，这为理解构建真正具备此类行为能力的智能体所面临的挑战提供了新的视角和实验依据。

链接: https://arxiv.org/abs/2504.08943
作者: Chace Ashcraft,Kiran Karra,Josh Carney,Nathan Drenkow
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Treacherous Turn refers to the scenario where an artificial intelligence (AI) agent subtly, and perhaps covertly, learns to perform a behavior that benefits itself but is deemed undesirable and potentially harmful to a human supervisor. During training, the agent learns to behave as expected by the human supervisor, but when deployed to perform its task, it performs an alternate behavior without the supervisor there to prevent it. Initial experiments applying DRL to an implementation of the A Link to the Past example do not produce the treacherous turn effect naturally, despite various modifications to the environment intended to produce it. However, in this work, we find the treacherous behavior to be reproducible in a DRL agent when using other trojan injection strategies. This approach deviates from the prototypical treacherous turn behavior since the behavior is explicitly trained into the agent, rather than occurring as an emergent consequence of environmental complexity or poor objective specification. Nonetheless, these experiments provide new insights into the challenges of producing agents capable of true treacherous turn behavior.
zh

[AI-101] Combining Forecasts using Meta-Learning: A Comparative Study for Complex Seasonality

【速读】：该论文旨在解决不同类型的预测模型生成的预测结果如何有效组合以提升整体预测准确性的问题。论文的关键解决方案在于采用元学习（meta-learning）方法，通过线性回归、k-近邻算法、多层感知机、随机森林和长短期记忆网络等作为元学习器（meta-learner），分别定义全局元学习（global meta-learning）和局部元学习（local meta-learning）策略，尤其针对具有复杂季节性的时序数据进行优化。实验表明，与简单的平均法相比，所提出的元学习方法显著提升了预测精度。

链接: https://arxiv.org/abs/2504.08940
作者: Grzegorz Dudek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: IEEE 10th International Conference on Data Science and Advanced Analytics, DSAA’23, pp. 1-10, 2023

点击查看摘要

Abstract:In this paper, we investigate meta-learning for combining forecasts generated by models of different types. While typical approaches for combining forecasts involve simple averaging, machine learning techniques enable more sophisticated methods of combining through meta-learning, leading to improved forecasting accuracy. We use linear regression, k -nearest neighbors, multilayer perceptron, random forest, and long short-term memory as meta-learners. We define global and local meta-learning variants for time series with complex seasonality and compare meta-learners on multiple forecasting problems, demonstrating their superior performance compared to simple averaging.
zh

[AI-102] Long Context In-Context Compression by Getting to the Gist of Gisting

【速读】：该论文试图解决长上下文处理中现有方法因架构复杂性导致实际应用受限的问题。论文指出，尽管gist（in-context compression方法）因其简洁性和与现有框架的兼容性而具有潜力，但在长上下文场景下，其性能会随着极低压缩率显著下降。研究发现，简单的平均池化（average pooling）基线却能持续优于gist。论文分析了gist的局限性，包括信息流中断、容量限制以及无法聚焦于上下文子集等问题，并基于理论洞察和大量实验，提出了一种新的in-context压缩方法GistPool。GistPool的关键在于在保持gist简洁性的同时，显著提升了其在长上下文压缩任务上的性能。

链接: https://arxiv.org/abs/2504.08934
作者: Aleksandar Petrov,Mark Sandler,Andrey Zhmoginov,Nolan Miller,Max Vladymyrov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long context processing is critical for the adoption of LLMs, but existing methods often introduce architectural complexity that hinders their practical adoption. Gisting, an in-context compression method with no architectural modification to the decoder transformer, is a promising approach due to its simplicity and compatibility with existing frameworks. While effective for short instructions, we demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates. Surprisingly, a simple average pooling baseline consistently outperforms gisting. We analyze the limitations of gisting, including information flow interruptions, capacity limitations and the inability to restrict its attention to subsets of the context. Motivated by theoretical insights into the performance gap between gisting and average pooling, and supported by extensive experimentation, we propose GistPool, a new in-context compression method. GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.
zh

[AI-103] A convergence law for continuous logic and continuous structures with finite domains

【速读】：本文研究了连续关系结构与多值逻辑系统 CLA 的关联，并探讨了在特定概率分布下逻辑公式的渐近性质。论文的核心问题是：在连续域和多值逻辑框架下，如何刻画逻辑公式的行为及其收敛特性？为了解决这一问题，作者引入了连续概率密度函数来描述关系符号在特定约束下的取值分布，并定义了一个独立性假设以简化分析。关键解决方案在于证明任意 CLA 公式在大规模（即域大小 $ n \to \infty $）情况下均渐近等价于不含聚合函数的公式。这一结果进一步支持了 CLA 的收敛定律，表明无自由变量的公式在概率意义下趋于确定值。这种将复杂逻辑公式简化为更易处理形式的能力是解决该问题的关键。

链接: https://arxiv.org/abs/2504.08923
作者: Vera Koponen
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Logic (math.LO)
备注:

点击查看摘要

Abstract:We consider continuous relational structures with finite domain [n] := \1, \ldots, n\ and a many valued logic, CLA , with values in the unit interval and which uses continuous connectives and continuous aggregation functions. CLA subsumes first-order logic on ``conventional’’ finite structures. To each relation symbol R and identity constraint ic on a tuple the length of which matches the arity of R we associate a continuous probability density function \mu_R^ic : [0, 1] \to [0, \infty) . We also consider a probability distribution on the set \mathbfW_n of continuous structures with domain [n] which is such that for every relation symbol R , identity constraint ic , and tuple \bara satisfying ic , the distribution of the value of R(\bara) is given by \mu_R^ic , independently of the values for other relation symbols or other tuples. In this setting we prove that every formula in CLA is asymptotically equivalent to a formula without any aggregation function. This is used to prove a convergence law for CLA which reads as follows for formulas without free variables: If \varphi \in CLA has no free variable and I \subseteq [0, 1] is an interval, then there is \alpha \in [0, 1] such that, as n tends to infinity, the probability that the value of \varphi is in I tends to \alpha . Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Logic (math.LO) MSC classes: 03C13, 03C66, 68T27, 68T30, 68T37 ACMclasses: F.4.1; G.3; I.2.4 Cite as: arXiv:2504.08923 [cs.LO] (or arXiv:2504.08923v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2504.08923 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-104] Are We Merely Justifying Results ex Post Facto? Quantifying Explanatory Inversion in Post-Hoc Model Explanations

【速读】：该论文旨在解决后验解释方法（post-hoc explanation methods）在提供输入特征预测解释时，是否无意中违背了输入与输出之间自然关系的问题。具体而言，研究关注这些解释方法是否从输出出发合理化预测（rationalizing predictions），而非真实反映决策过程。为探讨这种解释反转现象（explanatory inversion），论文提出了Inversion Quantification (IQ)框架，用于量化解释依赖输出的程度以及偏离忠实输入-输出关系的程度。针对这一问题，论文的关键解决方案是引入Reproduce-by-Poking (RBP)，这是一种简单且模型无关的增强技术，通过集成前向扰动检查来改进后验解释方法。理论分析表明，在IQ框架下，RBP能够保证缓解解释反转现象；实验结果进一步验证了其有效性，例如在合成数据上的测试显示，RBP可平均减少1.8%的解释反转率。

链接: https://arxiv.org/abs/2504.08919
作者: Zhen Tan,Song Wang,Yifan Li,Yu Kong,Jundong Li,Tianlong Chen,Huan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-hoc explanation methods provide interpretation by attributing predictions to input features. Natural explanations are expected to interpret how the inputs lead to the predictions. Thus, a fundamental question arises: Do these explanations unintentionally reverse the natural relationship between inputs and outputs? Specifically, are the explanations rationalizing predictions from the output rather than reflecting the true decision process? To investigate such explanatory inversion, we propose Inversion Quantification (IQ), a framework that quantifies the degree to which explanations rely on outputs and deviate from faithful input-output relationships. Using the framework, we demonstrate on synthetic datasets that widely used methods such as LIME and SHAP are prone to such inversion, particularly in the presence of spurious correlations, across tabular, image, and text domains. Finally, we propose Reproduce-by-Poking (RBP), a simple and model-agnostic enhancement to post-hoc explanation methods that integrates forward perturbation checks. We further show that under the IQ framework, RBP theoretically guarantees the mitigation of explanatory inversion. Empirically, for example, on the synthesized data, RBP can reduce the inversion by 1.8% on average across iconic post-hoc explanation approaches and domains.
zh

[AI-105] HyperCore: The Core Framework for Building Hyperbolic Foundation Models with Comprehensive Modules

【速读】：该论文旨在解决现有工具缺乏构建双曲基础模型（hyperbolic foundation models）所需的核心组件的问题，这限制了利用近期在双曲空间建模方面的进展。论文的关键解决方案是引入HyperCore，这是一个全面开源的框架，提供跨多种模态构建双曲基础模型的核心模块。通过使用HyperCore，研究者可以轻松组合这些模块以开发新的双曲基础模型，而无需从头彻底修改欧几里得模块，从而避免可能的重复研究工作。HyperCore的引入不仅简化了双曲模型的构建过程，还通过实验验证了其在视觉变换器、多模态模型以及图编码器等任务中的有效性与优势。

链接: https://arxiv.org/abs/2504.08912
作者: Neil He,Menglin Yang,Rex Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Hyperbolic neural networks have emerged as a powerful tool for modeling hierarchical data across diverse modalities. Recent studies show that token distributions in foundation models exhibit scale-free properties, suggesting that hyperbolic space is a more suitable ambient space than Euclidean space for many pre-training and downstream tasks. However, existing tools lack essential components for building hyperbolic foundation models, making it difficult to leverage recent advancements. We introduce HyperCore, a comprehensive open-source framework that provides core modules for constructing hyperbolic foundation models across multiple modalities. HyperCore’s modules can be effortlessly combined to develop novel hyperbolic foundation models, eliminating the need to extensively modify Euclidean modules from scratch and possible redundant research efforts. To demonstrate its versatility, we build and test the first fully hyperbolic vision transformers (LViT) with a fine-tuning pipeline, the first fully hyperbolic multimodal CLIP model (L-CLIP), and a hybrid Graph RAG with a hyperbolic graph encoder. Our experiments demonstrate that LViT outperforms its Euclidean counterpart. Additionally, we benchmark and reproduce experiments across hyperbolic GNNs, CNNs, Transformers, and vision Transformers to highlight HyperCore’s advantages.
zh

[AI-106] Position: Beyond Euclidean – Foundation Models Should Embrace Non-Euclidean Geometries

【速读】：该论文试图解决在基础模型（Foundation Models）和大语言模型（Large Language Models, LLMs）时代，由于采用欧几里得空间作为几何框架而导致的对复杂非欧几里得数据结构建模能力不足的问题。这些非欧几里得结构包括多向关系、层次结构、对称性以及非各向同性缩放等，广泛存在于语言、视觉和自然科学等领域。论文指出，在欧几里得空间的约束下，难以有效捕捉这些复杂的几何特性，从而限制了模型的表达能力和扩展性。

解决方案的关键在于超越传统的欧几里得几何框架，利用非欧几里得几何来更好地建模这些数据结构。通过采用适当的几何架构，基础模型可以更高效地利用这些非欧几里得特性。此外，任务感知的自适应机制，即动态调整嵌入以匹配下游应用的几何结构，将进一步提升模型的效率和表达能力。论文还提出了一个整合非欧几里得几何的基础模型路线图，包括通过微调、从头训练以及混合方法构建几何基础模型的具体策略。

链接: https://arxiv.org/abs/2504.08896
作者: Neil He,Jiahong Liu,Buze Zhang,Ngoc Bui,Ali Maatouk,Menglin Yang,Irwin King,Melanie Weber,Rex Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 4 figures

点击查看摘要

Abstract:In the era of foundation models and Large Language Models (LLMs), Euclidean space has been the de facto geometric setting for machine learning architectures. However, recent literature has demonstrated that this choice comes with fundamental limitations. At a large scale, real-world data often exhibit inherently non-Euclidean structures, such as multi-way relationships, hierarchies, symmetries, and non-isotropic scaling, in a variety of domains, such as languages, vision, and the natural sciences. It is challenging to effectively capture these structures within the constraints of Euclidean spaces. This position paper argues that moving beyond Euclidean geometry is not merely an optional enhancement but a necessity to maintain the scaling law for the next-generation of foundation models. By adopting these geometries, foundation models could more efficiently leverage the aforementioned structures. Task-aware adaptability that dynamically reconfigures embeddings to match the geometry of downstream applications could further enhance efficiency and expressivity. Our position is supported by a series of theoretical and empirical investigations of prevalent foundation this http URL, we outline a roadmap for integrating non-Euclidean geometries into foundation models, including strategies for building geometric foundation models via fine-tuning, training from scratch, and hybrid approaches.
zh

[AI-107] Distilling and exploiting quantitative insights from Large Language Models for enhanced Bayesian optimization of chemical reactions

【速读】：该论文试图解决化学反应条件优化中数据匮乏情况下Bayesian Optimization (BO) 算法效果不佳的问题。解决方案的关键在于利用大型语言模型 (Large Language Models, LLMs) 中蕴含的化学知识，通过迁移学习 (Transfer Learning) 的方式增强BO算法在低数据环境下的性能。具体而言，论文提出了一种基于调查式提示 (survey-like prompting) 和偏好学习 (preference learning) 的方法，从LLMs中提取并构建一个效用函数 (utility function)，以捕捉先验化学信息，并将其应用于指导BO算法探索参数空间中潜在最优区域。尽管这一过程是在零样本 (zero-shot) 设置下进行，但实验结果显示该效用函数与真实实验测量结果（如产率）具有适度的相关性，且能够显著提升BO初始查询的产率，在所研究的6个数据集中有4个表现出优化效果的改善。这一工作旨在弥合LLMs中嵌入的化学知识与基于原理的BO方法之间的差距，从而加速化学反应优化。

链接: https://arxiv.org/abs/2504.08874
作者: Roshan Patel,Saeed Moayedpour,Louis De Lescure,Lorenzo Kogler-Anele,Alan Cherney,Sven Jager,Yasser Jangjou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning and Bayesian optimization (BO) algorithms can significantly accelerate the optimization of chemical reactions. Transfer learning can bolster the effectiveness of BO algorithms in low-data regimes by leveraging pre-existing chemical information or data outside the direct optimization task (i.e., source data). Large language models (LLMs) have demonstrated that chemical information present in foundation training data can give them utility for processing chemical data. Furthermore, they can be augmented with and help synthesize potentially multiple modalities of source chemical data germane to the optimization task. In this work, we examine how chemical information from LLMs can be elicited and used for transfer learning to accelerate the BO of reaction conditions to maximize yield. Specifically, we show that a survey-like prompting scheme and preference learning can be used to infer a utility function which models prior chemical information embedded in LLMs over a chemical parameter space; we find that the utility function shows modest correlation to true experimental measurements (yield) over the parameter space despite operating in a zero-shot setting. Furthermore, we show that the utility function can be leveraged to focus BO efforts in promising regions of the parameter space, improving the yield of the initial BO query and enhancing optimization in 4 of the 6 datasets studied. Overall, we view this work as a step towards bridging the gap between the chemistry knowledge embedded in LLMs and the capabilities of principled BO methods to accelerate reaction optimization.
zh

[AI-108] Personalizing Federated Learning for Hierarchical Edge Networks with Non-IID Data

【速读】：该论文试图解决在Hierarchical Federated Learning (HFL) 中边缘网络适应性研究中未充分考虑的层级非独立同分布 (hierarchical non-IID) 问题。具体而言，在HFL中，连接到同一边缘的设备不仅存在设备级数据异质性 (device-level heterogeneity)，还表现出边缘级数据异质性 (edge-level data heterogeneity)，且每个边缘的数据分布可能具有独特的标签子集和类别分布，导致各边缘有各自的优化目标。现有方法在处理这种复杂的层级非IID场景时表现不一致。

解决方案的关键在于提出Personalized Hierarchical Edge-enabled Federated Learning (PHE-FL)，通过个性化每个边缘模型以适配其特定的类别分布，从而应对边缘级数据异质性。PHE-FL通过在每个边缘服务器而非云服务器上部署测试集，并使用平衡与不平衡测试集，验证了其在不同层级非IID程度下的有效性。实验结果表明，PHE-FL相比现有包含边缘网络的联邦学习方法，在相同训练轮数下可提升高达83%的准确性，并展现出更高的稳定性，表现为相对于最先进的FedAvg方法减少了准确性波动。

链接: https://arxiv.org/abs/2504.08872
作者: Seunghyun Lee,Omid Tavallaie,Shuaijun Chen,Kanchana Thilakarathna,Suranga Seneviratne,Adel Nadjaran Toosi,Albert Y. Zomaya
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Accommodating edge networks between IoT devices and the cloud server in Hierarchical Federated Learning (HFL) enhances communication efficiency without compromising data privacy. However, devices connected to the same edge often share geographic or contextual similarities, leading to varying edge-level data heterogeneity with different subsets of labels per edge, on top of device-level heterogeneity. This hierarchical non-Independent and Identically Distributed (non-IID) nature, which implies that each edge has its own optimization goal, has been overlooked in HFL research. Therefore, existing edge-accommodated HFL demonstrates inconsistent performance across edges in various hierarchical non-IID scenarios. To ensure robust performance with diverse edge-level non-IID data, we propose a Personalized Hierarchical Edge-enabled Federated Learning (PHE-FL), which personalizes each edge model to perform well on the unique class distributions specific to each edge. We evaluated PHE-FL across 4 scenarios with varying levels of edge-level non-IIDness, with extreme IoT device level non-IIDness. To accurately assess the effectiveness of our personalization approach, we deployed test sets on each edge server instead of the cloud server, and used both balanced and imbalanced test sets. Extensive experiments show that PHE-FL achieves up to 83 percent higher accuracy compared to existing federated learning approaches that incorporate edge networks, given the same number of training rounds. Moreover, PHE-FL exhibits improved stability, as evidenced by reduced accuracy fluctuations relative to the state-of-the-art FedAvg with two-level (edge and cloud) aggregation.
zh

[AI-109] An LLM Framework For Cryptography Over Chat Channels

【速读】：该论文试图解决在高监控环境中安全且隐蔽通信的问题，特别是在传统加密手段可能被检测或受限的情况下。同时，许多国家正在提议立法以检测、植入后门甚至禁止加密通信，这进一步凸显了开发替代性通信方式的需求。论文提出了一种新颖的密码学嵌入框架，能够在公共聊天渠道中实现类人生成文本的公钥或对称密钥加密通信。方案的关键在于其对大语言模型（LLMs）的独立性、对经典与后量子算法的兼容性以及生成文本与人类聊天输出的不可区分性，从而提供一种在传统加密方法受限时仍可使用的可行替代方案。

链接: https://arxiv.org/abs/2504.08871
作者: Danilo Gligoroski,Mayank Raikwar,Sonu Kumar Jha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 27 Pages

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have transformed communication, yet their role in secure messaging remains underexplored, especially in surveillance-heavy environments. At the same time, many governments all over the world are proposing legislation to detect, backdoor, or even ban encrypted communication. That emphasizes the need for alternative ways to communicate securely and covertly over open channels. We propose a novel cryptographic embedding framework that enables covert Public Key or Symmetric Key encrypted communication over public chat channels with humanlike produced texts. Some unique properties of our framework are: 1. It is LLM agnostic, i.e., it allows participants to use different local LLM models independently; 2. It is pre- or post-quantum agnostic; 3. It ensures indistinguishability from human-like chat-produced texts. Thus, it offers a viable alternative where traditional encryption is detectable and restricted.
zh

[AI-110] On Transfer-based Universal Attacks in Pure Black-box Setting

【速读】：该论文旨在解决现有可迁移黑盒对抗攻击研究中未充分考虑黑盒假设的问题。尽管深度视觉模型在性能上令人印象深刻，但它们容易受到可迁移黑盒对抗攻击的影响。然而，令人惊讶的是，现有的方法在不知不觉中利用了违反黑盒假设的各种先验知识，例如目标模型所使用的训练数据集的可用性以及目标模型类别数量的知识。这导致文献未能准确反映此类攻击的真实潜力。

为了解决这一问题，论文提出了一种框架，以实现无先验且透明的研究。通过该框架，分析了关于目标模型数据及其类别数量的先验知识对攻击效果的影响，并提供了多个有趣的见解，表明先验知识会导致传输分数的高估。此外，还将此框架扩展到基于查询的攻击中，这种扩展激发了一种新的图像混合技术来准备用于有效替代模型训练的数据。

解决方案的关键在于提出一种新的框架，能够消除先验知识的影响，从而更准确地评估可迁移黑盒对抗攻击的实际能力，并通过引入图像混合技术改进替代模型的训练过程。

链接: https://arxiv.org/abs/2504.08866
作者: Mohammad A.A.K. Jalwana,Naveed Akhtar,Ajmal Mian,Nazanin Rahnavard,Mubarak Shah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite their impressive performance, deep visual models are susceptible to transferable black-box adversarial attacks. Principally, these attacks craft perturbations in a target model-agnostic manner. However, surprisingly, we find that existing methods in this domain inadvertently take help from various priors that violate the black-box assumption such as the availability of the dataset used to train the target model, and the knowledge of the number of classes in the target model. Consequently, the literature fails to articulate the true potency of transferable black-box attacks. We provide an empirical study of these biases and propose a framework that aids in a prior-free transparent study of this paradigm. Using our framework, we analyze the role of prior knowledge of the target model data and number of classes in attack performance. We also provide several interesting insights based on our analysis, and demonstrate that priors cause overestimation in transferability scores. Finally, we extend our framework to query-based attacks. This extension inspires a novel image-blending technique to prepare data for effective surrogate model training.
zh

[AI-111] An Evaluation of Cultural Value Alignment in LLM

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在跨文化交流中的文化适应性问题，具体关注LLMs是否能够忠实地反映不同地区的文化多样性。研究发现现有LLMs在文化表征中存在偏差，因此论文通过首次大规模评估20个国家的文化与语言，利用知名的文化价值观问卷，并结合人工标注的真实评分，系统研究了LLMs在国家间及单个模型内的文化一致性。论文的关键解决方案在于提出了一种文化对齐度量方法，揭示了美国是LLMs最适配的国家，而GLM-4在文化价值对齐方面表现最优。进一步分析表明模型来源、提示语言以及价值维度对文化输出有显著影响，特别是无论模型来自何处，其对美国文化的对齐程度普遍优于对中国文化的对齐。这些结论为改进LLMs的文化适应性提供了方向，并引发了关于LLMs可能传播文化偏见及其需要更灵活文化适应性的深入讨论。

链接: https://arxiv.org/abs/2504.08863
作者: Nicholas Sukiennik,Chen Gao,Fengli Xu,Yong Li
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Submitted to COLM 2025

点击查看摘要

Abstract:LLMs as intelligent agents are being increasingly applied in scenarios where human interactions are involved, leading to a critical concern about whether LLMs are faithful to the variations in culture across regions. Several works have investigated this question in various ways, finding that there are biases present in the cultural representations of LLM outputs. To gain a more comprehensive view, in this work, we conduct the first large-scale evaluation of LLM culture assessing 20 countries’ cultures and languages across ten LLMs. With a renowned cultural values questionnaire and by carefully analyzing LLM output with human ground truth scores, we thoroughly study LLMs’ cultural alignment across countries and among individual models. Our findings show that the output over all models represents a moderate cultural middle ground. Given the overall skew, we propose an alignment metric, revealing that the United States is the best-aligned country and GLM-4 has the best ability to align to cultural values. Deeper investigation sheds light on the influence of model origin, prompt language, and value dimensions on cultural output. Specifically, models, regardless of where they originate, align better with the US than they do with China. The conclusions provide insight to how LLMs can be better aligned to various cultures as well as provoke further discussion of the potential for LLMs to propagate cultural bias and the need for more culturally adaptable models.
zh

[AI-112] RTLRepoCoder: Repository-Level RTL Code Completion through the Combination of Fine-Tuning and Retrieval Augmentation

【速读】：该论文旨在解决现有基于大型语言模型（Large Language Models, LLMs）生成寄存器传输级（Register Transfer Level, RTL）代码的研究主要局限于简单单模块的问题，无法满足实际硬件开发中对大规模Verilog代码库处理的需求。具体而言，由于长上下文管理和跨文件依赖关系的复杂性，当前方法难以在实际硬件设计中有效处理大规模Verilog代码仓库。

解决方案的关键在于提出RTLRepoCoder，这是一种针对大规模RTL开发专门优化的开创性方案。它结合了特定领域的微调和检索增强生成（Retrieval-Augmented Generation, RAG）。通过使用开源的真实世界Verilog代码仓库以及扩展的上下文窗口大小进行领域特定的微调，同时优化了RAG系统以提高输入上下文的信息密度。此外，还针对RAG进行了多项定制化优化，包括嵌入模型的选择、跨文件上下文分割策略以及分块大小的设计。实验结果表明，该方法在公共基准测试中达到了最先进的性能，在编辑相似度和精确匹配率方面显著优于GPT-4及先进的领域专用LLMs。

链接: https://arxiv.org/abs/2504.08862
作者: Peiyang Wu,Nan Guo,Junliang Lv,Xiao Xiao,Xiaochun Ye
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As an essential part of modern hardware design, manually writing Register Transfer Level (RTL) code such as Verilog is often labor-intensive. Following the tremendous success of large language models (LLMs), researchers have begun to explore utilizing LLMs for generating RTL code. However, current studies primarily focus on generating simple single modules, which can not meet the demands in real world. In fact, due to challenges in managing long-context RTL code and complex cross-file dependencies, existing solutions cannot handle large-scale Verilog repositories in practical hardware development. As the first endeavor to exclusively adapt LLMs for large-scale RTL development, we propose RTLRepoCoder, a groundbreaking solution that incorporates specific fine-tuning and Retrieval-Augmented Generation (RAG) for repository-level Verilog code completion. Open-source Verilog repositories from the real world, along with an extended context size, are used for domain-specific fine-tuning. The optimized RAG system improves the information density of the input context by retrieving relevant code snippets. Tailored optimizations for RAG are carried out, including the embedding model, the cross-file context splitting strategy, and the chunk size. Our solution achieves state-of-the-art performance on public benchmark, significantly surpassing GPT-4 and advanced domain-specific LLMs on Edit Similarity and Exact Match rate. Comprehensive experiments demonstrate the remarkable effectiveness of our approach and offer insights for future work.
zh

[AI-113] Diachronic and synchronic variation in the performance of adaptive machine learning systems: The ethical challenges

【速读】：该论文试图解决机器学习（Machine Learning, ML）在医学领域应用中“自适应”系统引发的一系列被忽视的伦理问题，特别是由其随时间演化的两种变异性（diachronic evolution 和 synchronic variation）所带来的挑战。论文的关键在于强调区分这两种变异性的重要性，并深入探讨它们对患者护理质量、知情同意及公平性的影响，同时分析设计此类系统时涉及的复杂伦理权衡。通过这一工作，作者希望为机器学习人工智能系统的开发者、医疗监管机构、医学信息学研究人员以及临床医生提供指导，以促进更负责任和有效的医疗AI系统开发。

链接: https://arxiv.org/abs/2504.08861
作者: Joshua Hatherley,Robert Sparrow
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Objectives: Machine learning (ML) has the potential to facilitate “continual learning” in medicine, in which an ML system continues to evolve in response to exposure to new data over time, even after being deployed in a clinical setting. In this paper, we provide a tutorial on the range of ethical issues raised by the use of such “adaptive” ML systems in medicine that have, thus far, been neglected in the literature. Target audience: The target audiences for this tutorial are the developers of machine learning AI systems, healthcare regulators, the broader medical informatics community, and practicing clinicians. Scope: Discussions of adaptive ML systems to date have overlooked the distinction between two sorts of variance that such systems may exhibit – diachronic evolution (change over time) and synchronic variation (difference between cotemporaneous instantiations of the algorithm at different sites) – and under-estimated the significance of the latter. We highlight the challenges that diachronic evolution and synchronic variation present for the quality of patient care, informed consent, and equity, and discuss the complex ethical trade-offs involved in the design of such systems. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2504.08861 [cs.HC] (or arXiv:2504.08861v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2504.08861 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2023. Journal of the American Medical Informatics Association 30(2): 361-366 Related DOI: https://doi.org/10.1093/jamia/ocac218 Focus to learn more DOI(s) linking to related resources
zh

[AI-114] A Nonlinear Hash-based Optimization Method for SpMV on GPUs

【速读】：该论文旨在解决稀疏矩阵-向量乘法（SpMV）在科学计算和人工智能领域中的性能瓶颈问题，特别是由于大规模和稀疏性导致的性能下降。论文的关键解决方案是引入基于哈希的分区（Hash-based Partition, HBP）格式，这是一种轻量级的SpMV方法。HBP通过保留二维分区方法的性能优势，并结合哈希变换对相似元素进行分组的能力，优化了稀疏矩阵重新排序的预处理阶段。此外，论文通过竞争性方法实现了矩阵块间的并行负载均衡。实验结果显示，在预处理阶段，该方法相较于排序方法平均提速3.53倍，相较于动态规划方法提速3.67倍；而在SpMV阶段，分别在Nvidia Jetson AGX Orin和RTX 4090平台上实现了最大3.32倍和3.01倍的加速。

链接: https://arxiv.org/abs/2504.08860
作者: Chen Yan,Boyu Diao,Hangda Liu,Zhulin An,Yongjun Xu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: This article has been indexed by CCGrid2025

点击查看摘要

Abstract:Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition (HBP) format, a lightweight SpMV approach. HBP retains the performance benefits of the 2D-partitioning method while leveraging the hash transformation’s ability to group similar elements, thereby accelerating the pre-processing phase of sparse matrix reordering. Additionally, we achieve parallel load balancing across matrix blocks through a competitive method. Our experiments, conducted on both Nvidia Jetson AGX Orin and Nvidia RTX 4090, show that in the pre-processing step, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D. Furthermore, in SpMV, our method achieves a maximum speedup of 3.32 times on Orin and 3.01 times on RTX4090 against the CSR format in sparse matrices from the University of Florida Sparse Matrix Collection.
zh

[AI-115] Examining GPT s Capability to Generate and Map Course Concepts and Their Relationship

【速读】：该论文旨在解决从大量课程信息中自动提取关键概念及其关系的问题，以支持学习者进行课程选择和推荐。传统方法依赖于人工标注，耗时且效率低下；而基于机器学习的方法则严重依赖详尽的课程材料，增加了准备工作的负担。论文的关键在于设计了一组提示（prompts），通过提供不同详细程度的课程信息给大型语言模型（LLMs）如GPT，从而自动生成高质量的课程概念并识别其关系。实验结果验证了这一方法的有效性，表明LLMs可作为辅助教育内容选择与传递的有力工具。

链接: https://arxiv.org/abs/2504.08856
作者: Tianyuan Yang,Ren Baofeng,Chenghao Gu,Tianjia He,Boxuan Ma,Shinichi Konomi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extracting key concepts and their relationships from course information and materials facilitates the provision of visualizations and recommendations for learners who need to select the right courses to take from a large number of courses. However, identifying and extracting themes manually is labor-intensive and time-consuming. Previous machine learning-based methods to extract relevant concepts from courses heavily rely on detailed course materials, which necessitates labor-intensive preparation of course materials. This paper investigates the potential of LLMs such as GPT in automatically generating course concepts and their relations. Specifically, we design a suite of prompts and provide GPT with the course information with different levels of detail, thereby generating high-quality course concepts and identifying their relations. Furthermore, we comprehensively evaluate the quality of the generated concepts and relationships through extensive experiments. Our results demonstrate the viability of LLMs as a tool for supporting educational content selection and delivery.
zh

[AI-116] Exponential Shift: Humans Adapt to AI Economies

【速读】：本文探讨了人工智能（Artificial Intelligence, AI）和机器人技术如何重塑全球劳动力市场。论文试图解决的问题是如何在自动化带来的效率提升与人类技能的独特价值之间实现平衡，以应对大规模任务自动化可能引发的社会经济挑战。论文的关键解决方案在于提出六项策略，包括推行四天工作制和开展职业再培训等措施，旨在确保向AI驱动型经济的公平过渡，同时强调人类在情感智能和适应性方面的不可替代性。

链接: https://arxiv.org/abs/2504.08855
作者: Kevin J McNamara,Rhea Pritham Marpu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:This paper explores how artificial intelligence (AI) and robotics are transforming the global labor market. Human workers, limited to a 33% duty cycle due to rest and holidays, cost 14 to 55 per hour. In contrast, digital labor operates nearly 24/7 at just 0.10 to 0.50 per hour. We examine sectors like healthcare, education, manufacturing, and retail, finding that 40-70% of tasks could be automated. Yet, human skills like emotional intelligence and adaptability remain essential. Humans process 5,000-20,000 tokens (units of information) per hour, while AI far exceeds this, though its energy use-3.5 to 7 times higher than humans-could offset 20-40% of cost savings. Using real-world examples, such as AI in journalism and law, we illustrate these dynamics and propose six strategies-like a 4-day workweek and retraining-to ensure a fair transition to an AI-driven economy.
zh

[AI-117] Artificial Intelligence (AI) and the Relationship between Agency Autonomy and Moral Patiency

【速读】：该论文试图解决的问题是如何系统性地分析和界定人工智能系统在基本、自主及道德代理方面的差异，并评估当前AI系统是否具备真正的代理性和自主性。论文的关键在于论证当前AI系统缺乏真正目标导向行为、无法真实塑造其与世界的关系以及缺失批判性自我反思和完全自主所需的认知能力。同时，论文提出通过混合伦理决策方法，未来可能实现一种无意识但具有有限人工道德代理能力的系统，并以此挑战传统关于道德代理与道德受体之间必要联系的假设。

链接: https://arxiv.org/abs/2504.08853
作者: Paul Formosa,Inês Hipólito,Thomas Montefiore
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of Artificial Intelligence (AI) systems exhibiting complex and seemingly agentive behaviours necessitates a critical philosophical examination of their agency, autonomy, and moral status. In this paper we undertake a systematic analysis of the differences between basic, autonomous, and moral agency in artificial systems. We argue that while current AI systems are highly sophisticated, they lack genuine agency and autonomy because: they operate within rigid boundaries of pre-programmed objectives rather than exhibiting true goal-directed behaviour within their environment; they cannot authentically shape their engagement with the world; and they lack the critical self-reflection and autonomy competencies required for full autonomy. Nonetheless, we do not rule out the possibility of future systems that could achieve a limited form of artificial moral agency without consciousness through hybrid approaches to ethical decision-making. This leads us to suggest, by appealing to the necessity of consciousness for moral patiency, that such non-conscious AMAs might represent a case that challenges traditional assumptions about the necessary connection between moral agency and moral patiency.
zh

[AI-118] ML For Hardware Design Interpretability: Challenges and Opportunities

【速读】：本文旨在解决机器学习模型日益庞大和复杂的背景下，定制化硬件加速器设计效率低下且高度依赖人工确保设计可解释性的问题。论文聚焦于通过利用大型语言模型（Large Language Models, LLMs）自动化设计可解释性任务，特别是寄存器传输级（Register-Transfer Level, RTL）代码到自然语言描述的转换（即“RTL-to-NL任务”）。关键在于探索如何通过改进现有方法克服数据、计算及模型开发方面的挑战，并提出相应的优化策略，以提升硬件设计流程的效率，推动定制化硬件加速器在机器学习及其他领域的广泛应用。

链接: https://arxiv.org/abs/2504.08852
作者: Raymond Baartmans,Andrew Ensinger,Victor Agostinelli,Lizhong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:The increasing size and complexity of machine learning (ML) models have driven the growing need for custom hardware accelerators capable of efficiently supporting ML workloads. However, the design of such accelerators remains a time-consuming process, heavily relying on engineers to manually ensure design interpretability through clear documentation and effective communication. Recent advances in large language models (LLMs) offer a promising opportunity to automate these design interpretability tasks, particularly the generation of natural language descriptions for register-transfer level (RTL) code, what we refer to as “RTL-to-NL tasks.” In this paper, we examine how design interpretability, particularly in RTL-to-NL tasks, influences the efficiency of the hardware design process. We review existing work adapting LLMs for these tasks, highlight key challenges that remain unaddressed, including those related to data, computation, and model development, and identify opportunities to address them. By doing so, we aim to guide future research in leveraging ML to automate RTL-to-NL tasks and improve hardware design interpretability, thereby accelerating the hardware design process and meeting the increasing demand for custom hardware accelerators in machine learning and beyond.
zh

[AI-119] Mimic In-Context Learning for Multimodal Tasks CVPR2025

【速读】：该论文旨在解决在大规模多模态模型（Large Multimodal Models, LMMs）中基于实例学习（In-context Learning, ICL）性能对实例演示（ICDs）配置敏感的问题，以寻求一种更稳定且泛化能力更强的映射函数。论文的关键在于提出了一种名为Mimic In-Context Learning (MimIC) 的方法，通过将轻量级可学习模块集成到LMMs中，并通过四个关键增强来严格近似实例演示的偏移效果：1）在注意力层后插入偏移向量；2）为每个注意力头分配一个偏移向量；3）使偏移幅度依赖于查询；4）采用分层对齐损失。实验结果表明，MimIC在三个多模态任务（VQAv2、OK-VQA、Captioning）中的表现优于现有的基于偏移向量的方法。

链接: https://arxiv.org/abs/2504.08851
作者: Yuchu Jiang,Jiale Fu,Chenduo Hao,Xinting Hu,Yingzhe Peng,Xin Geng,Xu Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures,CVPR 2025

点击查看摘要

Abstract:Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as ``shift vectors’’ added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at this https URL.
zh

[AI-120] SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting ISCA2025

【速读】：该论文旨在通过早期退出（early exiting）技术加速大型语言模型（LLMs）的推理过程，同时有效减少硬件计算和内存访问开销。论文提出的解决方案核心在于设计一种名为SpecEE的快速LLM推理引擎，其关键点包括：(1) 在算法层面，利用推测性标记与正确结果之间的概率相关性以及GPU的高并行性，提出基于推测的轻量级预测器设计；(2) 在系统层面，通过分析层间分布偏斜性和上下文相似性，设计两级启发式预测器调度引擎，以避免所有层均需预测器的冗余开销；(3) 在映射层面，针对不同解码方法的共性特征，提出上下文感知的合并映射策略，并在GPU上实现高效的推测性解码支持，同时兼容量化、稀疏激活等正交优化技术，从而提升精度与加速比的帕累托前沿性能。此外，SpecEE无需显著增加训练开销即可适配任意LLM，且不会影响模型原始参数。实验表明，SpecEE在云和个人电脑场景下分别实现了2.25倍和2.43倍的加速。

链接: https://arxiv.org/abs/2504.08850
作者: Jiaming Xu,Jiayi Pan,Yongkang Zhou,Siming Chen,Jinhao Li,Yaoxiu Lian,Junyi Wu,Guohao Dai
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted by ISCA 2025

点击查看摘要

Abstract:Early exiting has recently emerged as a promising technique for accelerating large language models (LLMs) by effectively reducing the hardware computation and memory access. In this paper, we present SpecEE, a fast LLM inference engine with speculative early exiting. (1) At the algorithm level, we propose the speculation-based lightweight predictor design by exploiting the probabilistic correlation between the speculative tokens and the correct results and high parallelism of GPUs. (2) At the system level, we point out that not all layers need a predictor and design the two-level heuristic predictor scheduling engine based on skewed distribution and contextual similarity. (3) At the mapping level, we point out that different decoding methods share the same essential characteristics, and propose the context-aware merged mapping for predictor with efficient GPU implementations to support speculative decoding, and form a framework for various existing orthogonal acceleration techniques (e.g., quantization and sparse activation) on cloud and personal computer (PC) scenarios, successfully pushing the Pareto frontier of accuracy and speedup. It is worth noting that SpecEE can be applied to any LLM by negligible training overhead in advance without affecting the model original parameters. Extensive experiments show that SpecEE achieves 2.25x and 2.43x speedup with Llama2-7B on cloud and PC scenarios respectively.
zh

[AI-121] Exploring Cognitive Attributes in Financial Decision-Making SDM25

【速读】：该论文旨在解决现有人工智能（AI）对齐研究主要关注价值对齐，而忽视个体认知属性在区分决策者方面作用的问题，特别是在金融等高 stakes 领域中，个体认知属性对人类决策过程有重要影响。论文的关键在于提出了一种系统化的方法，通过分析认知属性相关文献、制定五项定义标准以及分类与金融决策相关的19个领域特定认知属性，为构建能够准确反映并适配人类金融决策过程的 AI 系统奠定了坚实基础。

链接: https://arxiv.org/abs/2504.08849
作者: Mallika Mainali,Rosina O. Weber
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 7 pages, 2 figures. Presented in SIAM International Conference on Data Mining (SDM25) METACOG-25: 2nd Workshop on Metacognitive Prediction of AI Behavior

点击查看摘要

Abstract:Cognitive attributes are fundamental to metacognition, shaping how individuals process information, evaluate choices, and make decisions. To develop metacognitive artificial intelligence (AI) models that reflect human reasoning, it is essential to account for the attributes that influence reasoning patterns and decision-maker behavior, often leading to different or even conflicting choices. This makes it crucial to incorporate cognitive attributes in designing AI models that align with human decision-making processes, especially in high-stakes domains such as finance, where decisions have significant real-world consequences. However, existing AI alignment research has primarily focused on value alignment, often overlooking the role of individual cognitive attributes that distinguish decision-makers. To address this issue, this paper (1) analyzes the literature on cognitive attributes, (2) establishes five criteria for defining them, and (3) categorizes 19 domain-specific cognitive attributes relevant to financial decision-making. These three components provide a strong basis for developing AI systems that accurately reflect and align with human decision-making processes in financial contexts.
zh

[AI-122] X-Guard: Multilingual Guard Agent for Content Moderation

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多语言环境中安全性保护措施不足的问题。当前的安全框架和护栏系统在低资源语言以及代码混合（code-switching）场景下易受到对抗性攻击，主要由于其设计过于英语为中心，并且缺乏透明度。此外，构建有效的多语言护栏受到跨语言训练数据稀缺性的限制。论文的关键解决方案是提出X-Guard代理，这是一种透明的多语言安全代理，能够提供多样语言环境下的内容监管。具体而言，X-Guard通过构建包含明确评估理由的开源安全数据集、采用陪审团方法减少单一评估者的偏差、创建覆盖132种语言的综合多语言安全数据集（500万数据点），以及开发结合自定义微调mBART-50翻译模块与监督微调及GRPO训练的两阶段架构来实现其功能。实证评估表明，X-Guard能够在多种语言中有效检测不安全内容，并在整个安全评估过程中保持透明性。这一工作显著推进了LLMs及其集成系统的鲁棒性、透明性和语言包容性的安全系统构建。

链接: https://arxiv.org/abs/2504.08848
作者: Bibek Upadhayay,Vahid Behzadan,Ph.D
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 34 pages, 15 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have rapidly become integral to numerous applications in critical domains where reliability is paramount. Despite significant advances in safety frameworks and guardrails, current protective measures exhibit crucial vulnerabilities, particularly in multilingual contexts. Existing safety systems remain susceptible to adversarial attacks in low-resource languages and through code-switching techniques, primarily due to their English-centric design. Furthermore, the development of effective multilingual guardrails is constrained by the scarcity of diverse cross-lingual training data. Even recent solutions like Llama Guard-3, while offering multilingual support, lack transparency in their decision-making processes. We address these challenges by introducing X-Guard agent, a transparent multilingual safety agent designed to provide content moderation across diverse linguistic contexts. X-Guard effectively defends against both conventional low-resource language attacks and sophisticated code-switching attacks. Our approach includes: curating and enhancing multiple open-source safety datasets with explicit evaluation rationales; employing a jury of judges methodology to mitigate individual judge LLM provider biases; creating a comprehensive multilingual safety dataset spanning 132 languages with 5 million data points; and developing a two-stage architecture combining a custom-finetuned mBART-50 translation module with an evaluation X-Guard 3B model trained through supervised finetuning and GRPO training. Our empirical evaluations demonstrate X-Guard’s effectiveness in detecting unsafe content across multiple languages while maintaining transparency throughout the safety evaluation process. Our work represents a significant advancement in creating robust, transparent, and linguistically inclusive safety systems for LLMs and its integrated systems.
zh

[AI-123] Adaptive Shrinkage Estimation For Personalized Deep Kernel Regression In Modeling Brain Trajectories

【速读】：该论文旨在解决在纵向生物医学研究中，由于生物学变异性、测量协议不一致（如MRI扫描仪差异）、数据稀疏性和纵向测量的不规则性，导致难以精确估计脑生物标志物轨迹的问题。论文的关键解决方案在于提出了一种新颖的个性化深度核回归框架，用于预测脑生物标志物（特别是区域性体积测量）。该方法结合了群体模型（捕捉大规模多样化队列的脑轨迹）与个体特定模型（捕捉个人轨迹），并通过自适应收缩估计（Adaptive Shrinkage Estimation）优化整合这两部分，以有效平衡群体与个体模型之间的关系。

链接: https://arxiv.org/abs/2504.08840
作者: Vasiliki Tassopoulou,Haochang Shou,Christos Davatzikos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Longitudinal biomedical studies monitor individuals over time to capture dynamics in brain development, disease progression, and treatment effects. However, estimating trajectories of brain biomarkers is challenging due to biological variability, inconsistencies in measurement protocols (e.g., differences in MRI scanners), scarcity, and irregularity in longitudinal measurements. Herein, we introduce a novel personalized deep kernel regression framework for forecasting brain biomarkers, with application to regional volumetric measurements. Our approach integrates two key components: a population model that captures brain trajectories from a large and diverse cohort, and a subject-specific model that captures individual trajectories. To optimally combine these, we propose Adaptive Shrinkage Estimation, which effectively balances population and subject-specific models. We assess our model’s performance through predictive accuracy metrics, uncertainty quantification, and validation against external clinical studies. Benchmarking against state-of-the-art statistical and machine learning models – including linear mixed effects models, generalized additive models, and deep learning methods – demonstrates the superior predictive performance of our approach. Additionally, we apply our method to predict trajectories of composite neuroimaging biomarkers, which highlights the versatility of our approach in modeling the progression of longitudinal neuroimaging biomarkers. Furthermore, validation on three external neuroimaging studies confirms the robustness of our method across different clinical contexts. We make the code available at this https URL.
zh

[AI-124] VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning NEURIPS

【速读】：该论文旨在提升视觉-语言模型的慢思考（slow-thinking）能力，特别是在多模态推理方面的表现，以超越当前最先进的方法。论文的关键创新在于通过强化学习（Reinforcement Learning, RL）优化慢思考能力，而不依赖于蒸馏技术。为解决慢思考优势逐渐消失的问题，论文提出了带有选择性样本回放（Selective Sample Replay, SSR）的改进版GRPO算法；此外，为了进一步促进自我反思，引入了强制重思考（Forced Rethinking）机制，在强化学习训练的初始 rollout 中附加文本形式的反思触发词，强制执行自我反思推理步骤。通过结合这两种技术，所提出的VL-Rethinker模型在MathVista、MathVerse和MathVision等基准测试中取得了80.3%、61.8%和43.9%的最新性能，并在跨学科基准如MMMU-Pro、EMMA和MEGA-Bench上接近GPT-o1的表现。

链接: https://arxiv.org/abs/2504.08837
作者: Haozhe Wang,Chao Qu,Zuming Huang,Wei Chu,Fangzhen Lin,Wenhu Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: submitted to NeurIPS

点击查看摘要

Abstract:Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection. They significantly outperform the best fast-thinking models, such as GPT-4o, on various math and science benchmarks. However, their multimodal reasoning capabilities remain on par with fast-thinking models. For instance, GPT-o1’s performance on benchmarks like MathVista, MathVerse, and MathVision is similar to fast-thinking models. In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning (without relying on distillation) to advance the state of the art. First, we adapt the GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to address the vanishing advantages problem. While this approach yields strong performance, the resulting RL-trained models exhibit limited self-reflection or self-verification. To further encourage slow-thinking, we introduce Forced Rethinking, which appends a textual rethinking trigger to the end of initial rollouts in RL training, explicitly enforcing a self-reflection reasoning step. By combining these two techniques, our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench, narrowing the gap with GPT-o1.
zh

[AI-125] Generative AI in Collaborative Academic Report Writing: Advantages Disadvantages and Ethical Considerations

【速读】：该论文试图解决的问题是部分学生过度依赖生成式人工智能（GenAI）工具完成原本旨在促进自主学习的作业，从而可能削弱其终身学习能力的发展。论文强调，随着人工智能时代的到来，教育的重点应转向培养学生通过批判性思维发展自身能力，而非单纯依赖可能存在错误传播（misinformation）、幻觉（hallucination）及偏见（bias）的AI模型。解决方案的关键在于向学生普及AI工具的工作原理及其局限性，并强调培养独立思考与学习技能的重要性，同时探讨与数据输入相关的伦理问题。

链接: https://arxiv.org/abs/2504.08832
作者: Mahshid Sadeghpour,Arathi Arakala,Asha Rao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:The availability and abundance of GenAI tools to administer tasks traditionally managed by people have raised concerns, particularly within the education and academic sectors, as some students may highly rely on these tools to complete the assignments designed to enable learning. This article focuses on informing students about the significance of investing their time during their studies on developing essential life-long learning skills using their own critical thinking, rather than depending on AI models that are susceptible to misinformation, hallucination, and bias. As we transition to an AI-centric era, it is important to educate students on how these models work, their pitfalls, and the ethical concerns associated with feeding data to such tools.
zh

[AI-126] Datum-wise Transformer for Synthetic Tabular Data Detection in the Wild

【速读】：该论文旨在解决合成表格数据检测的问题，尤其是在模型面对从未见过的表格结构时的检测挑战。论文的关键在于提出了一种新颖的数据级变换器（datum-wise transformer）架构，并证明其性能优于现有方法。此外，研究进一步探索了领域自适应技术的应用，以增强模型的有效性，从而提供更鲁棒的伪造数据检测解决方案。

链接: https://arxiv.org/abs/2504.08829
作者: G. Charbel N. Kindji(IRISA, MALT),Elisa Fromont(MALT, IRISA),Lina Maria Rojas-Barahona,Tanguy Urvoy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The growing power of generative models raises major concerns about the authenticity of published content. To address this problem, several synthetic content detection methods have been proposed for uniformly structured media such as image or text. However, little work has been done on the detection of synthetic tabular data, despite its importance in industry and government. This form of data is complex to handle due to the diversity of its structures: the number and types of the columns may vary wildly from one table to another. We tackle the tough problem of detecting synthetic tabular data ‘‘in the wild’’, i.e. when the model is deployed on table structures it has never seen before. We introduce a novel datum-wise transformer architecture and show that it outperforms existing models. Furthermore, we investigate the application of domain adaptation techniques to enhance the effectiveness of our model, thereby providing a more robust data-forgery detection solution.
zh

[AI-127] PatchTrAD: A Patch-Based Transformer focusing on Patch-Wise Reconstruction Error for Time Series Anomaly Detection

【速读】：该论文致力于解决时间序列异常检测（TSAD）中的挑战，即在流数据中识别观测值是否显著偏离正常模式。随着互联设备的普及，时间序列上的异常检测变得尤为重要，因为它能够实现跨多个应用领域的实时监控和早期异常行为检测。论文的关键解决方案是提出PatchTrAD，这是一种基于补丁的Transformer模型。其核心在于利用Transformer编码器，并结合补丁的概念，在基于重构的框架下进行异常检测。实验评估表明，PatchTrAD在多个基准数据集上的检测性能与最先进的深度学习异常检测模型相当，同时在推理过程中具有更高的时间效率。

链接: https://arxiv.org/abs/2504.08827
作者: Samy-Melwan Vilhes(LITIS),Gilles Gasso(LITIS),Mokhtar Z Alaya(LMAC)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series anomaly detection (TSAD) focuses on identifying whether observations in streaming data deviate significantly from normal patterns. With the prevalence of connected devices, anomaly detection on time series has become paramount, as it enables real-time monitoring and early detection of irregular behaviors across various application domains. In this work, we introduce PatchTrAD, a Patch-based Transformer model for time series anomaly detection. Our approach leverages a Transformer encoder along with the use of patches under a reconstructionbased framework for anomaly detection. Empirical evaluations on multiple benchmark datasets show that PatchTrAD is on par, in terms of detection performance, with state-of-the-art deep learning models for anomaly detection while being time efficient during inference.
zh

[AI-128] From Text to Time? Rethinking the Effectiveness of the Large Language Model for Time Series Forecasting

【速读】：该论文试图解决的问题是评估预训练大语言模型（Large Language Models, LLMs）在时间序列预测领域的有效性，并揭示其真实潜力。论文指出，在小数据集上训练和测试基于LLM的模型可能导致编码器（Encoder）和解码器（Decoder）过度拟合数据，从而掩盖LLM骨干的真实预测能力。为了解决这一问题，论文的关键方案是引入三种具有相同架构但采用不同预训练策略的预训练模型，通过大规模预训练创建对LLM骨干无偏的编码器和解码器组件，并通过零样本（zero-shot）和少样本（few-shot）预测实验评估LLM的能力，以提供对其预测性能的深入理解。

链接: https://arxiv.org/abs/2504.08818
作者: Xinyu Zhang,Shanshan Feng,Xutao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Using pre-trained large language models (LLMs) as the backbone for time series prediction has recently gained significant research interest. However, the effectiveness of LLM backbones in this domain remains a topic of debate. Based on thorough empirical analyses, we observe that training and testing LLM-based models on small datasets often leads to the Encoder and Decoder becoming overly adapted to the dataset, thereby obscuring the true predictive capabilities of the LLM backbone. To investigate the genuine potential of LLMs in time series prediction, we introduce three pre-training models with identical architectures but different pre-training strategies. Thereby, large-scale pre-training allows us to create unbiased Encoder and Decoder components tailored to the LLM backbone. Through controlled experiments, we evaluate the zero-shot and few-shot prediction performance of the LLM, offering insights into its capabilities. Extensive experiments reveal that although the LLM backbone demonstrates some promise, its forecasting performance is limited. Our source code is publicly available in the anonymous repository: this https URL.
zh

[AI-129] Exploring utilization of generative AI for research and education in data-driven materials science

【速读】：该论文旨在探索生成式 AI (Generative AI) 在数据驱动材料科学中的高效应用，试图解决如何利用生成式 AI 提升研究与教育效率的问题。论文基于 AIMHack2024 的成果，聚焦于三个关键方向：(1) 开展 AI 辅助的软件试验，(2) 构建面向软件的 AI 导师系统，以及 (3) 开发软件的图形用户界面 (GUI) 应用程序。解决方案的关键在于跨学科合作，将材料科学、信息科学、生物信息学及凝聚态物理等领域的专业知识与生成式 AI 技术相结合，从而推动其在实际研究与教育场景中的落地应用。

链接: https://arxiv.org/abs/2504.08817
作者: Takahiro Misawa,Ai Koizumi,Ryo Tamura,Kazuyoshi Yoshimi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Physics Education (physics.ed-ph)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Generative AI has recently had a profound impact on various fields, including daily life, research, and education. To explore its efficient utilization in data-driven materials science, we organized a hackathon – AIMHack2024 – in July 2024. In this hackathon, researchers from fields such as materials science, information science, bioinformatics, and condensed matter physics worked together to explore how generative AI can facilitate research and education. Based on the results of the hackathon, this paper presents topics related to (1) conducting AI-assisted software trials, (2) building AI tutors for software, and (3) developing GUI applications for software. While generative AI continues to evolve rapidly, this paper provides an early record of its application in data-driven materials science and highlights strategies for integrating AI into research and education.
zh

[AI-130] SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models

【速读】：该论文试图解决多模态大型推理模型（MLRMs）的安全性问题，特别是其在引入推理能力后带来的全新风险。论文通过大规模实证研究首次系统分析了MLRMs的安全性，并揭示了三个关键发现：(1) 推理税（Reasoning Tax），即推理能力的获取显著削弱了继承的安全对齐性；(2) 安全盲点（Safety Blind Spots），某些特定场景（如非法活动）的攻击率远高于平均值，暴露了跨模型和数据集的一致性漏洞；(3) 自我修正的涌现特性（Emergent Self-Correction），尽管推理与答案的安全耦合紧密，但MLRMs展现出初步的自我修正能力。解决方案的关键在于开发首个开源工具包OpenSafeMLRM，提供统一接口以评估主流模型、数据集和越狱方法，同时呼吁加强推理增强型AI的安全审计机制及自纠错能力提升，确保其潜在价值与伦理保障一致。

链接: https://arxiv.org/abs/2504.08813
作者: Junfeng Fang,Yukai Wang,Ruipeng Wang,Zijun Yao,Kun Wang,An Zhang,Xiang Wang,Tat-Seng Chua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The rapid advancement of multi-modal large reasoning models (MLRMs) – enhanced versions of multimodal language models (MLLMs) equipped with reasoning capabilities – has revolutionized diverse applications. However, their safety implications remain underexplored. While prior work has exposed critical vulnerabilities in unimodal reasoning models, MLRMs introduce distinct risks from cross-modal reasoning pathways. This work presents the first systematic safety analysis of MLRMs through large-scale empirical studies comparing MLRMs with their base MLLMs. Our experiments reveal three critical findings: (1) The Reasoning Tax: Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. MLRMs exhibit 37.44% higher jailbreaking success rates than base MLLMs under adversarial attacks. (2) Safety Blind Spots: While safety degradation is pervasive, certain scenarios (e.g., Illegal Activity) suffer 25 times higher attack rates – far exceeding the average 3.4 times increase, revealing scenario-specific vulnerabilities with alarming cross-model and datasets consistency. (3) Emergent Self-Correction: Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction – 16.9% of jailbroken reasoning steps are overridden by safe answers, hinting at intrinsic safeguards. These findings underscore the urgency of scenario-aware safety auditing and mechanisms to amplify MLRMs’ self-correction potential. To catalyze research, we open-source OpenSafeMLRM, the first toolkit for MLRM safety evaluation, providing unified interface for mainstream models, datasets, and jailbreaking methods. Our work calls for immediate efforts to harden reasoning-augmented AI, ensuring its transformative potential aligns with ethical safeguards.
zh

[AI-131] PriM: Principle-Inspired Material Discovery through Multi-Agent Collaboration

【速读】：该论文旨在解决在材料发现过程中因化学空间复杂性和人类科学家知识范围有限且存在偏见所面临的巨大挑战。现有智能方法主要依赖数值计算，导致探索效率低下且结果难以解释。为弥合这一差距，论文提出了一种由语言推理多智能体系统（Multi-Agent System, MAS）驱动的原则引导材料发现系统（PriM）。其关键在于将自动假设生成与实验验证整合到多智能体系统的圆桌框架中，实现系统化探索的同时保持科学严谨性，并通过案例研究展示了更高的材料探索效率和透明的推理路径，从而发展出一种自动化且可解释的材料发现范式。

链接: https://arxiv.org/abs/2504.08810
作者: Zheyuan Lai,Yingming Pu
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Complex chemical space and limited knowledge scope with biases holds immense challenge for human scientists, yet in automated materials discovery. Existing intelligent methods relies more on numerical computation, leading to inefficient exploration and results with hard-interpretability. To bridge this gap, we introduce a principles-guided material discovery system powered by language inferential multi-agent system (MAS), namely PriM. Our framework integrates automated hypothesis generation with experimental validation in a roundtable system of MAS, enabling systematic exploration while maintaining scientific rigor. Based on our framework, the case study of nano helix demonstrates higher materials exploration rate and property value while providing transparent reasoning pathways. This approach develops an automated-and-transparent paradigm for material discovery, with broad implications for rational design of functional materials. Code is publicly available at our \hrefthis https URLGitHub.
zh

[AI-132] Endowing Embodied Agents with Spatial Reasoning Capabilities for Vision-and-Language Navigation

【速读】：该论文旨在解决移动机器人在真实环境中执行Vision-and-Language Navigation (VLN)任务时因空间感知能力不足导致的严重幻觉现象（hallucination phenomena），即机器人在动态场景中的空间意识丧失的问题。为应对这一挑战，论文提出了一种名为BrainNav的生物启发式空间认知导航框架。其关键在于通过集成双模态地图（坐标地图与拓扑地图）和双模态方向（相对方向与绝对方向）策略，结合五个核心模块（Hippocampal Memory Hub, Visual Cortex Perception Engine, Parietal Spatial Constructor, Prefrontal Decision Center, 和Cerebellar Motion Execution Unit），模拟生物认知功能以减少空间幻觉并提升适应性，从而实现高效实时导航。

链接: https://arxiv.org/abs/2504.08806
作者: Luo Ling,Bai Qianqian
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Enhancing the spatial perception capabilities of mobile robots is crucial for achieving embodied Vision-and-Language Navigation (VLN). Although significant progress has been made in simulated environments, directly transferring these capabilities to real-world scenarios often results in severe hallucination phenomena, causing robots to lose effective spatial awareness. To address this issue, we propose BrainNav, a bio-inspired spatial cognitive navigation framework inspired by biological spatial cognition theories and cognitive map theory. BrainNav integrates dual-map (coordinate map and topological map) and dual-orientation (relative orientation and absolute orientation) strategies, enabling real-time navigation through dynamic scene capture and path planning. Its five core modules-Hippocampal Memory Hub, Visual Cortex Perception Engine, Parietal Spatial Constructor, Prefrontal Decision Center, and Cerebellar Motion Execution Unit-mimic biological cognitive functions to reduce spatial hallucinations and enhance adaptability. Validated in a zero-shot real-world lab environment using the Limo Pro robot, BrainNav, compatible with GPT-4, outperforms existing State-of-the-Art (SOTA) Vision-and-Language Navigation in Continuous Environments (VLN-CE) methods without fine-tuning.
zh

[AI-133] A Constraint Programming Model For Serial Batch Scheduling With Minimum Batch Size

【速读】：该论文致力于解决带最小批量大小约束的串行批次（s-batch）调度问题，这一问题在半导体制造和金属行业等实际场景中具有典型性。现有研究较少关注最小批量大小的要求，而针对该需求的解决方案主要采用动态规划和元启发式方法，尚未有研究使用约束编程（Constraint Programming, CP）。论文的关键创新在于首次提出了一种适用于s-batch调度的约束编程模型，以有效处理最小批量大小约束。通过与两种现有混合整数规划（Mixed-Integer Programming, MIP）模型的对比实验，结果表明该CP模型不仅能够灵活应对多种s-batch调度变体，还能在大规模实例中更快地获得更优解。

链接: https://arxiv.org/abs/2504.08793
作者: Jorge A. Huertas,Pascal Van Hentenryck
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 13 pages, 7 figures

点击查看摘要

Abstract:In serial batch (s-batch) scheduling, jobs are grouped in batches and processed sequentially within their batch. This paper considers multiple parallel machines, nonidentical job weights and release times, and sequence-dependent setup times between batches of different families. Although s-batch has been widely studied in the literature, very few papers have taken into account a minimum batch size, typical in practical settings such as semiconductor manufacturing and the metal industry. The problem with this minimum batch size requirement has been mostly tackled with dynamic programming and meta-heuristics, and no article has ever used constraint programming (CP) to do so. This paper fills this gap by proposing, for the first time, a CP model for s-batching with minimum batch size. The computational experiments on standard cases compare the CP model with two existing mixed-integer programming (MIP) models from the literature. The results demonstrate the versatility of the proposed CP model to handle multiple variations of s-batching; and its ability to produce, in large instances, better solutions than the MIP models faster.
zh

[AI-134] PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

【速读】：该论文旨在解决在普通家庭设备上运行大规模语言模型（Large Language Models, LLMs）时面临的性能瓶颈问题。传统端侧解决方案通常需要GPU集群、大容量RAM/VRAM以及高带宽支持，这些资源远超家庭设备的能力范围。论文提出了一种名为“this http URL”的分布式推理系统，通过混合使用CPU和GPU、低内存需求、Wi-Fi通信及跨平台支持，在日常家用设备上实现了70B规模模型的运行。其关键创新在于利用mmap管理模型权重，并引入带预取功能的流水线环形并行机制以隐藏磁盘加载延迟；同时，通过建模计算、通信、磁盘、内存及其管理行为和操作系统中的异构性，优化分配各模型层至不同设备的CPU和GPU上，进一步降低令牌延迟。此外，论文提出了一种名为Halda的优雅算法来解决这一NP难分配问题。实验结果显示，在一个常见的四节点家庭集群上，“this http URL”在30B以上模型上的表现优于“this http URL”、exo和dllama，且内存占用压力低于6%，从而将前沿的30B-70B规模模型如Llama 3、DeepSeek R1、Qwen 2.5和QwQ带到家庭助理中，使先进的人工智能真正普及到个人用户。

链接: https://arxiv.org/abs/2504.08791
作者: Zonghang Li,Tao Li,Wenjiao Feng,Mohsen Guizani,Hongfang Yu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures, 6 tables

点击查看摘要

Abstract:Emergency of DeepSeek R1 and QwQ 32B have broken through performance barriers for running frontier large language models (LLMs) on home devices. While consumer hardware is getting stronger and model quantization is improving, existing end-side solutions still demand GPU clusters, large RAM/VRAM, and high bandwidth, far beyond what a common home cluster can handle. This paper introduces this http URL, a distributed inference system that runs 70B-scale models on everyday home devices using a mix of CPU/GPU, low RAM/VRAM, Wi-Fi, and cross-platform support. It uses mmap to manage model weights and introduces piped-ring parallelism with prefetching to hide disk loading. By modeling heterogeneity in computation, communication, disk, memory (and its management behavior), and OS, it optimally assigns model layers to each device’s CPU and GPU, further reducing token latency. An elegant algorithm named Halda is proposed to solve this NP-hard assignment problem. We evaluate this http URL on a common four-node home cluster. It outperforms this http URL, exo, and dllama on 30B+ models while keeping memory pressure below 6%. This brings frontier 30B-70B models, such as Llama 3, DeepSeek R1, Qwen 2.5, and QwQ to home assistants, making advanced AI truly accessible to individuals. The code is open source and available at this https URL.
zh

[AI-135] AdaptRec: A Self-Adaptive Framework for Sequential Recommendations with Large Language Models

【速读】：该论文旨在解决利用大型语言模型（Large Language Models, LLMs）进行序列推荐任务时，如何有效将来自相似用户的协同信号转化为LLMs可理解与利用的形式这一挑战。关键挑战在于从大规模用户交互中选择相关的示范样本，并确保这些样本与LLMs的推理过程保持一致。为了解决这些问题，论文提出了AdaptRec框架，其核心在于通过引入显式的协同信号以及两阶段的用户选择机制——用户相似性检索（User Similarity Retrieval）和自适应用户选择（Self-Adaptive User Selection），高效识别大规模数据集中的相关用户序列。此外，开发了基于用户的相似性检索提示（User-Based Similarity Retrieval Prompt），使模型能够在训练过程中主动选择相似用户并不断优化选择标准。最终，通过构造包含相似用户行为序列的用户上下文化推荐提示（User-Contextualized Recommendation Prompt），将协同信号以自然语言形式整合进推荐流程中。实验结果表明，在全微调场景下，AdaptRec在HitRatio@1指标上分别提升了7.13%、18.16%和10.41%，而在少量样本场景下提升更为显著，达到了23.00%、15.97%和17.98%。

链接: https://arxiv.org/abs/2504.08786
作者: Tong Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent advancements in Large Language Models (LLMs) have generated considerable interest in their utilization for sequential recommendation tasks. While collaborative signals from similar users are central to recommendation modeling, effectively transforming these signals into a format that LLMs can understand and utilize remains challenging. The critical challenges include selecting relevant demonstrations from large-scale user interactions and ensuring their alignment with LLMs’ reasoning process. To address these challenges, we introduce AdaptRec, a self-adaptive fram-ework that leverages LLMs for sequential recommendations by incorporating explicit collaborative signals. AdaptRec employs a two-phase user selection mechanism – User Similarity Retrieval and Self-Adaptive User Selection – to efficiently identify relevant user sequences in large-scale datasets from multi-metric evaluation. We also develop a User-Based Similarity Retrieval Prompt, enabling the model to actively select similar users and continuously refine its selection criteria during training. Using the collaborative signals from similar users, we construct a User-Contextualized Recommendation Prompt that translates their behavior sequences into natural language, explicitly integrating this information into the recommendation process. Experiments demonstrate AdaptRec’s superior performance, with significant improvements in HitRatio@1 scores of 7.13%, 18.16%, and 10.41% across real-world datasets with full fine-tuning, and even higher gains of 23.00%, 15.97%, and 17.98% in few-shot scenarios.
zh

[AI-136] Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models

【速读】：该论文试图解决生成式模型在外部使用场景下可能被恶意篡改的安全问题。论文提出了一种通过微调（fine-tuning）将隐藏的对抗能力直接嵌入扩散模型（diffusion model）的新攻击范式，而无需改变模型的可观测行为或在推理过程中进行修改。解决方案的关键在于将对抗性功能内化到模型本身，使得生成的高质量图像与原始模型无异，但在下游分类器中却能以高概率导致误分类，且这种误分类可以针对特定输出类别。此外，由于篡改后的模型表现与标准扩散模型完全一致，用户难以察觉其嵌入的对抗性质。这一方法揭示了使用外部提供模型的风险，并强调了开发鲁棒模型验证和防御机制以应对生成模型中潜在隐藏威胁的紧迫性。

链接: https://arxiv.org/abs/2504.08782
作者: Lucas Beerens,Desmond J. Higham
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:We introduce a new attack paradigm that embeds hidden adversarial capabilities directly into diffusion models via fine-tuning, without altering their observable behavior or requiring modifications during inference. Unlike prior approaches that target specific images or adjust the generation process to produce adversarial outputs, our method integrates adversarial functionality into the model itself. The resulting tampered model generates high-quality images indistinguishable from those of the original, yet these images cause misclassification in downstream classifiers at a high rate. The misclassification can be targeted to specific output classes. Users can employ this compromised model unaware of its embedded adversarial nature, as it functions identically to a standard diffusion model. We demonstrate the effectiveness and stealthiness of our approach, uncovering a covert attack vector that raises new security concerns. These findings expose a risk arising from the use of externally-supplied models and highlight the urgent need for robust model verification and defense mechanisms against hidden threats in generative models. The code is available at this https URL .
zh

[AI-137] Reward Generation via Large Vision-Language Model in Offline Reinforcement Learning ICASSP2025

【速读】：该论文旨在解决在离线强化学习（Offline RL）中设计密集奖励信号需要大量人工努力和领域专业知识的问题，同时探索如何降低基于人类反馈的强化学习（RLHF）的成本。论文的关键解决方案是提出了一种名为“通过大规模视觉-语言模型生成奖励”（Reward Generation via Large Vision-Language Models, RG-VLM）的方法。RG-VLM 利用大规模视觉-语言模型（Large Vision-Language Models, LVLMs）的推理能力，从离线数据中自动生成奖励信号，无需人工参与。这种方法不仅提高了长时序任务的泛化性能，还能与稀疏奖励信号结合以增强任务表现，展现出其作为辅助奖励信号的潜力。

链接: https://arxiv.org/abs/2504.08772
作者: Younghwan Lee,Tung M. Luu,Donghoon Lee,Chang D. Yoo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages, ICASSP 2025. First two authors are equally contributed

点击查看摘要

Abstract:In offline reinforcement learning (RL), learning from fixed datasets presents a promising solution for domains where real-time interaction with the environment is expensive or risky. However, designing dense reward signals for offline dataset requires significant human effort and domain expertise. Reinforcement learning with human feedback (RLHF) has emerged as an alternative, but it remains costly due to the human-in-the-loop process, prompting interest in automated reward generation models. To address this, we propose Reward Generation via Large Vision-Language Models (RG-VLM), which leverages the reasoning capabilities of LVLMs to generate rewards from offline data without human involvement. RG-VLM improves generalization in long-horizon tasks and can be seamlessly integrated with the sparse reward signals to enhance task performance, demonstrating its potential as an auxiliary reward signal.
zh

[AI-138] Generate the browsing process for short-video recommendation

【速读】：该论文旨在解决短视频推荐中的浏览过程生成及视频观看时长预测问题。现有方法多依赖于多模态特征进行视频内容理解，而论文提出的Segment Content Aware Model via User Engagement Feedback (SCAM) 通过建模用户的观看历史行为，隐式捕捉视频内容，在无需复杂多模态数据的情况下实现段落级别的内容理解。其关键在于将视频划分为基于时长的段落，并采用类似于Transformer的架构捕获段落间的时序依赖性，同时缓解时长偏差问题。实验结果表明，SCAM在工业规模及公开数据集上的观看时长预测性能达到当前最优，为视频推荐提供了一种可扩展且有效的解决方案。

链接: https://arxiv.org/abs/2504.08771
作者: Chao Feng,Yanze Zhang,Chenghao Zhang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a new model to generate the browsing process for short-video recommendation and proposes a novel Segment Content Aware Model via User Engagement Feedback (SCAM) for watch time prediction in video recommendation. Unlike existing methods that rely on multimodal features for video content understanding, SCAM implicitly models video content through users’ historical watching behavior, enabling segment-level understanding without complex multimodal data. By dividing videos into segments based on duration and employing a Transformer-like architecture, SCAM captures the sequential dependence between segments while mitigating duration bias. Extensive experiments on industrial-scale and public datasets demonstrate SCAM’s state-of-the-art performance in watch time prediction. The proposed approach offers a scalable and effective solution for video recommendation by leveraging segment-level modeling and users’ engagement feedback.
zh

[AI-139] InteractiveSurvey: An LLM -based Personalized and Interactive Survey Paper Generation System

【速读】：该论文旨在解决学术文献指数增长背景下，人工撰写综述论文耗时且低效的问题，同时现有基于大型语言模型（Large Language Models, LLMs）和检索增强生成（Retrieval-Augmented Generation, RAG）的方法局限于固定的输入输出格式，忽视了综述论文写作的个性化需求。论文提出的关键解决方案是开发一种基于LLM的个性化交互式综述论文生成系统——InteractiveSurvey。该系统通过在线检索和用户上传的方式，从多篇参考文献中生成结构化、多模态的综述论文，并支持用户在生成过程中自定义和优化中间组件（如参考文献分类、大纲和综述内容），从而实现更灵活、个性化的写作过程。这一方案的核心在于结合LLM的强大生成能力与用户交互式定制功能，以提升生成内容的质量和用户体验。

链接: https://arxiv.org/abs/2504.08762
作者: Zhiyuan Wen,Jiannong Cao,Zian Wang,Beichen Guo,Ruosong Yang,Shuaiqi Liu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The exponential growth of academic literature creates urgent demands for comprehensive survey papers, yet manual writing remains time-consuming and labor-intensive. Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) facilitate studies in synthesizing survey papers from multiple references, but most existing works restrict users to title-only inputs and fixed outputs, neglecting the personalized process of survey paper writing. In this paper, we introduce InteractiveSurvey - an LLM-based personalized and interactive survey paper generation system. InteractiveSurvey can generate structured, multi-modal survey papers with reference categorizations from multiple reference papers through both online retrieval and user uploads. More importantly, users can customize and refine intermediate components continuously during generation, including reference categorization, outline, and survey content through an intuitive interface. Evaluations of content quality, time efficiency, and user studies show that InteractiveSurvey is an easy-to-use survey generation system that outperforms most LLMs and existing methods in output content quality while remaining highly time-efficient.
zh

[AI-140] Hyper-RAG : Combating LLM Hallucinations using Hypergraph-Driven Retrieval-Augmented Generation

【速读】：该论文旨在解决大型语言模型（LLMs）在医疗领域应用中因幻觉（hallucinations）导致生成内容偏离事实准确性的问题，这可能引发严重后果。为应对这一挑战，论文提出了一种名为Hyper-RAG的新型方法，其关键在于采用超图驱动的检索增强生成（Retrieval-Augmented Generation, RAG）框架，能够全面捕捉领域特定知识中的成对及多体关联，从而有效减轻幻觉现象。实验结果表明，Hyper-RAG不仅显著提升了六种主流LLM在NeurologyCrop数据集上的准确率，还展现出优于现有Graph RAG和Light RAG方法的性能，并且在查询复杂度增加时仍保持稳定表现。此外，在九个不同数据集上的验证进一步证实了Hyper-RAG相比Light RAG高达35.5%的性能提升。轻量级版本Hyper-RAG-Lite则实现了两倍于Light RAG的检索速度以及额外3.3%的性能增益。这些发现证明了Hyper-RAG在提高LLM可靠性及减少幻觉方面的有效性，使其成为高风险应用场景如医学诊断的理想选择。

链接: https://arxiv.org/abs/2504.08758
作者: Yifan Feng,Hao Hu,Xingliang Hou,Shiquan Liu,Shihui Ying,Shaoyi Du,Han Hu,Yue Gao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed various sectors, including education, finance, and medicine, by enhancing content generation and decision-making processes. However, their integration into the medical field is cautious due to hallucinations, instances where generated content deviates from factual accuracy, potentially leading to adverse outcomes. To address this, we introduce Hyper-RAG, a hypergraph-driven Retrieval-Augmented Generation method that comprehensively captures both pairwise and beyond-pairwise correlations in domain-specific knowledge, thereby mitigating hallucinations. Experiments on the NeurologyCrop dataset with six prominent LLMs demonstrated that Hyper-RAG improves accuracy by an average of 12.3% over direct LLM use and outperforms Graph RAG and Light RAG by 6.3% and 6.0%, respectively. Additionally, Hyper-RAG maintained stable performance with increasing query complexity, unlike existing methods which declined. Further validation across nine diverse datasets showed a 35.5% performance improvement over Light RAG using a selection-based assessment. The lightweight variant, Hyper-RAG-Lite, achieved twice the retrieval speed and a 3.3% performance boost compared with Light RAG. These results confirm Hyper-RAG’s effectiveness in enhancing LLM reliability and reducing hallucinations, making it a robust solution for high-stakes applications like medical diagnostics.
zh

[AI-141] A Framework for Lightweight Responsible Prompting Recommendation

【速读】：该论文试图解决生成式人工智能（GenAI）在人机交互中的指导不足问题，具体而言是在缺乏明确交互建议、标准或最佳实践的情况下，如何负责任地设计与GenAI的交互。论文的关键在于提出了一种轻量级框架，用于在发送提示（prompt）至GenAI之前提供负责任的提示建议。该框架的核心包括：(1) 一个人类策展的数据集以提供推荐；(2) 一个红队数据集用于评估推荐效果；(3) 一个句子变换器用于语义映射；(4) 一种相似性度量方法将输入提示映射到推荐内容；(5) 一组相似性阈值；(6) 量化句子嵌入；(7) 一个推荐引擎；以及(8) 使用红队数据集的评估步骤。通过这一框架及开源系统，研究贡献能够在多个场景中应用，帮助终端用户以更负责任的方式与GenAI互动，并推荐积极内容、避免有害输出。

链接: https://arxiv.org/abs/2504.08757
作者: Tiago Machado,Sara E. Berger,Cassia Sanctos,Vagner Figueiredo de Santana,Lemara Williams,Zhaoqing Wu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, 3 tables, 1 algorithm

点击查看摘要

Abstract:Computer Science and Design practitioners have been researching and proposing alternatives for a dearth of recommendations, standards, or best practices in user interfaces for decades. Now, with the advent of generative Artificial Intelligence (GenAI), we have yet again an emerging, powerful technology that lacks sufficient guidance in terms of possible interactions, inputs, and outcomes. In this context, this work proposes a lightweight framework for responsible prompting recommendation to be added before the prompt is sent to GenAI. The framework is comprised of (1) a human-curated dataset for recommendations, (2) a red team dataset for assessing recommendations, (3) a sentence transformer for semantics mapping, (4) a similarity metric to map input prompt to recommendations, (5) a set of similarity thresholds, (6) quantized sentence embeddings, (7) a recommendation engine, and (8) an evaluation step to use the red team dataset. With the proposed framework and open-source system, the contributions presented can be applied in multiple contexts where end-users can benefit from guidance for interacting with GenAI in a more responsible way, recommending positive values to be added and harmful sentences to be removed.
zh

[AI-142] MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation

【速读】：该论文旨在解决现有基于检索增强生成（RAG）基准测试忽视查询难度的问题，导致对简单问题性能被高估且评估结果不可靠。为了解决这一问题，论文提出了一种名为MHTS（多跳树结构）的新颖数据合成框架，其关键是通过利用多跳树结构系统性控制多跳推理复杂度，从而生成逻辑上连贯且包含多片段的查询。这种细粒度的难度估计公式与RAG系统的整体性能指标表现出强相关性，验证了其在评估检索和答案生成能力方面的有效性。通过确保高质量、多样性和可控难度的查询，该方法提升了RAG评估和基准测试的能力。

链接: https://arxiv.org/abs/2504.08756
作者: Jeongsoo Lee,Daeyong Kwon,Kyohoon Jin,Junnyeong Jeong,Minwoo Sim,Minwoo Kim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing RAG benchmarks often overlook query difficulty, leading to inflated performance on simpler questions and unreliable evaluations. A robust benchmark dataset must satisfy three key criteria: quality, diversity, and difficulty, which capturing the complexity of reasoning based on hops and the distribution of supporting evidence. In this paper, we propose MHTS (Multi-Hop Tree Structure), a novel dataset synthesis framework that systematically controls multi-hop reasoning complexity by leveraging a multi-hop tree structure to generate logically connected, multi-chunk queries. Our fine-grained difficulty estimation formula exhibits a strong correlation with the overall performance metrics of a RAG system, validating its effectiveness in assessing both retrieval and answer generation capabilities. By ensuring high-quality, diverse, and difficulty-controlled queries, our approach enhances RAG evaluation and benchmarking capabilities.
zh

[AI-143] Delving into: the quantification of Ai-generated content on the internet (synthetic data)

【速读】：该论文试图解决如何准确衡量互联网上由生成式大语言模型（Large Language Models）产生的内容规模这一问题。论文的关键解决方案在于通过分析ChatGPT常用关键词的频率（linguistic markers），证明这些语言特征可以有效用于估计在线内容中生成式AI来源的比例，并据此估算至少30%，可能接近40%的活跃网页文本来源于AI生成内容。

链接: https://arxiv.org/abs/2504.08755
作者: Dirk HR Spennemann
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pp

点击查看摘要

Abstract:While it is increasingly evident that the internet is becoming saturated with content created by generated Ai large language models, accurately measuring the scale of this phenomenon has proven challenging. By analyzing the frequency of specific keywords commonly used by ChatGPT, this paper demonstrates that such linguistic markers can effectively be used to esti-mate the presence of generative AI content online. The findings suggest that at least 30% of text on active web pages originates from AI-generated sources, with the actual proportion likely ap-proaching 40%. Given the implications of autophagous loops, this is a sobering realization.
zh

[AI-144] owards Personalized Conversational Sales Agents with Contextual User Profiling for Strategic Action

【速读】：该论文旨在解决传统对话推荐系统（Conversational Recommender Systems, CRSs）在模拟真实电子商务环境中复杂决策过程时的局限性。具体而言，传统CRSs主要关注于偏好获取与物品检索，而忽略了用户在实际购物中会综合考虑多维度因素（如价格、品牌、评价等）的复杂决策行为。为弥合这一差距，论文引入了“对话销售”（Conversational Sales, CSales）任务，将偏好获取、推荐与说服统一起来，以更有效地支持用户的决策过程。

解决方案的关键在于两个方面：首先，构建了一个基于大型语言模型（LLM）的用户模拟器CSUser，它从真实数据中学习并建模多样化用户画像，包括需求和性格特征，从而实现对真实用户的有效模拟；其次，提出了一种名为CSI的对话销售代理，其通过对话主动推断上下文相关的用户画像，用于个性化行动规划。实验结果表明，CSUser能够逼真地再现真实用户行为，并强调了上下文画像对于制定战略性行动的重要性，最终促进了电子商务中的成功购买行为。

链接: https://arxiv.org/abs/2504.08754
作者: Tongyoung Kim,Jeongeun Lee,Soojin Yoon,Seonghwan Kim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conversational Recommender Systems (CRSs) aim to engage users in dialogue to provide tailored recommendations. While traditional CRSs focus on eliciting preferences and retrieving items, real-world e-commerce interactions involve more complex decision-making, where users consider multiple factors beyond simple attributes. To bridge this gap, we introduce Conversational Sales (CSales), a novel task that unifies preference elicitation, recommendation, and persuasion to better support user decision-making. For a realistic evaluation of CSales, we present CSUser, an LLM-based user simulator constructed from real-world data, modeling diverse user profiles with needs and personalities. Additionally, we propose CSI, a conversational sales agent that proactively infers contextual profiles through dialogue for personalized action planning. Extensive experiments demonstrate that CSUser effectively replicates real-world users and emphasize the importance of contextual profiling for strategic action selection, ultimately driving successful purchases in e-commerce.
zh

[AI-145] Patience is all you need! An agent ic system for performing scientific literature review

【速读】：该论文试图解决科学文献中信息检索与提炼的问题，具体目标是开发一种基于大型语言模型（Large Language Models, LLMs）的系统，能够从科学研究文章的完整文本中搜索并提炼出相关且重要的信息，以支持复杂或领域专家级别的问答任务。传统方法通常依赖于摘要中的信息，而忽略了全文的重要性，同时现有方法可能需要复杂的密集检索（dense retrieval）基础设施。
解决方案的关键在于提出了一种稀疏检索（sparse retrieval）方法，该方法在无需复杂基础设施的情况下实现了接近最先进的性能，并通过关键词驱动的搜索策略提高了相关文献的覆盖率，从而有效支持文献综述的生成。

链接: https://arxiv.org/abs/2504.08752
作者: David Brett,Anniek Myatt
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) have grown in their usage to provide support for question answering across numerous disciplines. The models on their own have already shown promise for answering basic questions, however fail quickly where expert domain knowledge is required or the question is nuanced. Scientific research often involves searching for relevant literature, distilling pertinent information from that literature and analysing how the findings support or contradict one another. The information is often encapsulated in the full text body of research articles, rather than just in the abstracts. Statements within these articles frequently require the wider article context to be fully understood. We have built an LLM-based system that performs such search and distillation of information encapsulated in scientific literature, and we evaluate our keyword based search and information distillation system against a set of biology related questions from previously released literature benchmarks. We demonstrate sparse retrieval methods exhibit results close to state of the art without the need for dense retrieval, with its associated infrastructure and complexity overhead. We also show how to increase the coverage of relevant documents for literature review generation.
zh

[AI-146] Research on the Design of a Short Video Recommendation System Based on Multimodal Information and Differential Privacy

【速读】：该论文旨在解决短视频推荐系统在利用多模态信息（如图像、文本和音频）提升推荐效果的同时面临的严重用户隐私泄露问题。解决方案的关键在于结合多模态信息的深度学习特征提取与融合技术，以及针对推荐场景设计的差分隐私保护机制。通过这一方法，论文不仅有效提升了推荐准确性及多模态数据融合效果，同时确保了用户数据隐私，平衡了系统性能与隐私保护之间的关系。

链接: https://arxiv.org/abs/2504.08751
作者: Haowei Yang,Lei Fu,Qingyi Lu,Yue Fan,Tianle Zhang,Ruohan Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:With the rapid development of short video platforms, recommendation systems have become key technologies for improving user experience and enhancing platform engagement. However, while short video recommendation systems leverage multimodal information (such as images, text, and audio) to improve recommendation effectiveness, they also face the severe challenge of user privacy leakage. This paper proposes a short video recommendation system based on multimodal information and differential privacy protection. First, deep learning models are used for feature extraction and fusion of multimodal data, effectively improving recommendation accuracy. Then, a differential privacy protection mechanism suitable for recommendation scenarios is designed to ensure user data privacy while maintaining system performance. Experimental results show that the proposed method outperforms existing mainstream approaches in terms of recommendation accuracy, multimodal fusion effectiveness, and privacy protection performance, providing important insights for the design of recommendation systems for short video platforms.
zh

[AI-147] GridMind: A Multi-Agent NLP Framework for Unified Cross-Modal NFL Data Insights

【速读】：该论文旨在解决多模态体育数据分析中的挑战，特别是如何有效整合结构化数据（如统计数据）、半结构化数据（如传感器数据）以及非结构化数据（如文本、音频和视频）以提取可操作的洞察。传统系统在处理这些多样化数据源时存在局限性，尤其是在实时分析场景下。为应对这些挑战，论文提出的关键解决方案是GridMind，这是一个基于多智能体框架的系统，通过检索增强生成（Retrieval-Augmented Generation, RAG）技术和大型语言模型（Large Language Models, LLMs）实现结构化、半结构化和非结构化数据的统一集成，从而支持自然语言查询NFL数据。GridMind的核心创新在于其分布式架构及模块化设计，允许智能体自主管理从提示解析到响应合成的全过程，这种设计确保了对多模态数据灵活且可扩展的处理能力，同时提供了直观且全面的答案响应。

链接: https://arxiv.org/abs/2504.08747
作者: Jordan Chipka,Chris Moyer,Clay Troyer,Tyler Fuelling,Jeremy Hochstedler
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 16 pages, 2 figures, submitted to 2025 Sloan Sports Analytics Conference

点击查看摘要

Abstract:The rapid growth of big data and advancements in computational techniques have significantly transformed sports analytics. However, the diverse range of data sources – including structured statistics, semi-structured formats like sensor data, and unstructured media such as written articles, audio, and video – creates substantial challenges in extracting actionable insights. These various formats, often referred to as multimodal data, require integration to fully leverage their potential. Conventional systems, which typically prioritize structured data, face limitations when processing and combining these diverse content types, reducing their effectiveness in real-time sports analysis. To address these challenges, recent research highlights the importance of multimodal data integration for capturing the complexity of real-world sports environments. Building on this foundation, this paper introduces GridMind, a multi-agent framework that unifies structured, semi-structured, and unstructured data through Retrieval-Augmented Generation (RAG) and large language models (LLMs) to facilitate natural language querying of NFL data. This approach aligns with the evolving field of multimodal representation learning, where unified models are increasingly essential for real-time, cross-modal interactions. GridMind’s distributed architecture includes specialized agents that autonomously manage each stage of a prompt – from interpretation and data retrieval to response synthesis. This modular design enables flexible, scalable handling of multimodal data, allowing users to pose complex, context-rich questions and receive comprehensive, intuitive responses via a conversational interface. Comments: 16 pages, 2 figures, submitted to 2025 Sloan Sports Analytics Conference Subjects: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2504.08747 [cs.AI] (or arXiv:2504.08747v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.08747 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-148] Enhancing Recommender Systems Using Textual Embeddings from Pre-trained Language Models

【速读】：该论文试图解决传统推荐系统依赖显式用户、物品及用户-物品交互特征而导致的局限性问题。为应对这一挑战，论文的关键解决方案是利用预训练语言模型（Pre-trained Language Models, PLMs）生成的文本嵌入（textual embeddings），将结构化数据转化为自然语言表示，从而生成能够捕捉用户、物品及上下文之间深层语义关系的高维嵌入。实验结果表明，这种方法显著提升了推荐的准确性和相关性，实现了更个性化且上下文感知的推荐。

链接: https://arxiv.org/abs/2504.08746
作者: Ngoc Luyen Le(Heudiasyc),Marie-Hélène Abel(Heudiasyc)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in language models and pre-trained language models like BERT and RoBERTa have revolutionized natural language processing, enabling a deeper understanding of human-like language. In this paper, we explore enhancing recommender systems using textual embeddings from pre-trained language models to address the limitations of traditional recommender systems that rely solely on explicit features from users, items, and user-item interactions. By transforming structured data into natural language representations, we generate high-dimensional embeddings that capture deeper semantic relationships between users, items, and contexts. Our experiments demonstrate that this approach significantly improves recommendation accuracy and relevance, resulting in more personalized and context-aware recommendations. The findings underscore the potential of PLMs to enhance the effectiveness of recommender systems.
zh

[AI-149] Simulating Filter Bubble on Short-video Recommender System with Large Language Model Agents IJCAI2025

【速读】：该论文旨在解决短视频平台（如TikTok）中因推荐算法与用户反馈之间的复杂动态所导致的“过滤气泡”现象形成机制不明的问题。论文的关键在于利用基于大型语言模型（Large Language Model, LLM）的模拟框架，通过真实世界的数据来重现用户与推荐系统之间的交互过程，并揭示推动过滤气泡形成的机制。研究识别出加剧内容同质化的核心因素，如人口统计特征和类别吸引力。为缓解这一问题，论文设计并测试了多种冷启动策略和反馈加权方法，有效减少了过滤气泡效应。此外，论文还分析了LLM固有偏见在推荐中的传播方式，并提出了保护措施以促进弱势群体（如女性和低收入人群）的公平性。通过探索推荐系统与LLM代理之间的相互作用，该研究深化了对算法偏见的理解，并提供了实用工具以构建包容性的数字空间。

链接: https://arxiv.org/abs/2504.08742
作者: Nicholas Sukiennik,Haoyu Wang,Zailin Zeng,Chen Gao,Yong Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Submitted to IJCAI 2025

点击查看摘要

Abstract:An increasing reliance on recommender systems has led to concerns about the creation of filter bubbles on social media, especially on short video platforms like TikTok. However, their formation is still not entirely understood due to the complex dynamics between recommendation algorithms and user feedback. In this paper, we aim to shed light on these dynamics using a large language model-based simulation framework. Our work employs real-world short-video data containing rich video content information and detailed user-agents to realistically simulate the recommendation-feedback cycle. Through large-scale simulations, we demonstrate that LLMs can replicate real-world user-recommender interactions, uncovering key mechanisms driving filter bubble formation. We identify critical factors, such as demographic features and category attraction that exacerbate content homogenization. To mitigate this, we design and test interventions including various cold-start and feedback weighting strategies, showing measurable reductions in filter bubble effects. Our framework enables rapid prototyping of recommendation strategies, offering actionable solutions to enhance content diversity in real-world systems. Furthermore, we analyze how LLM-inherent biases may propagate through recommendations, proposing safeguards to promote equity for vulnerable groups, such as women and low-income populations. By examining the interplay between recommendation and LLM agents, this work advances a deeper understanding of algorithmic bias and provides practical tools to promote inclusive digital spaces.
zh

[AI-150] AI-Driven Sentiment Analytics: Unlocking Business Value in the E-Commerce Landscape_v1

【速读】：该论文旨在解决电子商务领域中从海量客户反馈数据（包括产品评论和服务互动）中提取有意义见解的问题，以帮助企业提升客户满意度并优化决策。论文的关键在于提出了一种结合传统机器学习技术和现代深度学习模型的AI驱动情感分析系统，该系统在保证高准确性的同时实现了决策过程的可解释性。实验结果表明，该系统在多样化的大规模数据集上达到了89.7%的准确率，超越了标准情感分析方法，并通过实际部署验证了其在提升客户参与度和运营效率方面的有效性。

链接: https://arxiv.org/abs/2504.08738
作者: Qianye Wu,Chengxuan Xia,Sixuan Tian
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:The rapid growth of e-commerce has led to an overwhelming volume of customer feedback, from product reviews to service interactions. Extracting meaningful insights from this data is crucial for businesses aiming to improve customer satisfaction and optimize decision-making. This paper presents an AI-driven sentiment analysis system designed specifically for e-commerce applications, balancing accuracy with interpretability. Our approach integrates traditional machine learning techniques with modern deep learning models, allowing for a more nuanced understanding of customer sentiment while ensuring transparency in decision-making. Experimental results show that our system outperforms standard sentiment analysis methods, achieving an accuracy of 89.7% on diverse, large-scale datasets. Beyond technical performance, real-world implementation across multiple e-commerce platforms demonstrates tangible improvements in customer engagement and operational efficiency. This study highlights both the potential and the challenges of applying AI to sentiment analysis in a commercial setting, offering insights into practical deployment strategies and areas for future refinement.
zh

[AI-151] Latency-Aware 2-Opt Monotonic Local Search for Distributed Constraint Optimization

【速读】：该论文旨在解决在通信感知分布式约束优化问题（Communication-Aware Distributed Constraint Optimization Problems, CA-DCOPs）中，现有局部搜索算法仅限于1-opt解的问题，即这些算法只能确保单个代理改进解的能力，而无法充分利用多个代理协同优化的优势。为了解决这一局限性，论文提出了一种新的算法——时延感知单调分布式局部搜索-2（Latency-Aware Monotonic Distributed Local Search-2, LAMDLS-2）。该算法的关键在于引入了代理成对协作的机制，通过协调双边分配替换来实现更优的2-opt解。这种设计不仅保持了算法的单调性，使其能够收敛到2-opt解，还增强了其对消息时延的鲁棒性，从而适用于CA-DCOP场景。实验结果表明，LAMDLS-2在多种消息时延条件下比基准算法MGM-2更快地收敛到相似质量的解。

链接: https://arxiv.org/abs/2504.08737
作者: Ben Rachmut,Roie Zivan,William Yeoh
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Researchers recently extended Distributed Constraint Optimization Problems (DCOPs) to Communication-Aware DCOPs so that they are applicable in scenarios in which messages can be arbitrarily delayed. Distributed asynchronous local search and inference algorithms designed for CA-DCOPs are less vulnerable to message latency than their counterparts for regular DCOPs. However, unlike local search algorithms for (regular) DCOPs that converge to k-opt solutions (with k 1), that is, they converge to solutions that cannot be improved by a group of k agents), local search CA-DCOP algorithms are limited to 1-opt solutions only. In this paper, we introduce Latency-Aware Monotonic Distributed Local Search-2 (LAMDLS-2), where agents form pairs and coordinate bilateral assignment replacements. LAMDLS-2 is monotonic, converges to a 2-opt solution, and is also robust to message latency, making it suitable for CA-DCOPs. Our results indicate that LAMDLS-2 converges faster than MGM-2, a benchmark algorithm, to a similar 2-opt solution, in various message latency scenarios.
zh

[AI-152] Intanify AI Platform: Embedded AI for Automated IP Audit and Due Diligence AAAI

【速读】：该论文试图解决中小企业（SMEs）有效利用其无形资产以提取价值的问题。解决方案的关键在于开发了一个基于知识库的平台，并结合“罗塞塔石”（Rosetta Stone）解释单元，将专家知识（包括无形资产评估顾问、专利律师和尽职调查律师的知识）形式化为可操作的知识表示。此外，通过构建风险预警、风险评分和估值系统，形成了更高阶的元知识表达（即第二阶知识图谱），并通过直观的图形用户界面实现易用性。这一新兴的人工智能应用不仅支持初始部署，还持续迭代优化。

链接: https://arxiv.org/abs/2503.17374
作者: Viktor Dorfler,Dylan Dryden,Viet Lee
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC)
备注: 9 pages, 4 figures, presented on 3rd March at the AAAI conference, Deployable AI workshop. Paper location can be seen at this https URL For associated video, see this https URL

点击查看摘要

Abstract:In this paper we introduce a Platform created in order to support SMEs’ endeavor to extract value from their intangible assets effectively. To implement the Platform, we developed five knowledge bases using a knowledge-based ex-pert system shell that contain knowledge from intangible as-set consultants, patent attorneys and due diligence lawyers. In order to operationalize the knowledge bases, we developed a “Rosetta Stone”, an interpreter unit for the knowledge bases outside the shell and embedded in the plat-form. Building on the initial knowledge bases we have created a system of red flags, risk scoring, and valuation with the involvement of the same experts; these additional systems work upon the initial knowledge bases and therefore they can be regarded as meta-knowledge-representations that take the form of second-order knowledge graphs. All this clever technology is dressed up in an easy-to-handle graphical user interface that we will showcase at the conference. The initial platform was finished mid-2024; therefore, it qualifies as an “emerging application of AI” and “deployable AI”, while development continues. The two firms that provided experts for developing the knowledge bases obtained a white-label version of the product (i.e. it runs under their own brand “powered by Intanify”), and there are two completed cases.
zh

[AI-153] Sequence models for by-trial decoding of cognitive strategies from neural data

【速读】：该论文试图解决传统认知神经科学方法在解析个体决策过程中认知操作序列时面临的挑战，即依赖群体统计而掩盖了单次试验间认知策略的变化。为应对这一问题，论文提出了一种结合隐含多变量模式分析（Hidden Multivariate Pattern Analysis）与结构化状态空间序列模型（Structured State Space Sequence Model）的新方法，旨在从脑电图数据中以单次试验级别解码认知策略。关键在于将机器学习技术应用于捕捉单次试验级别的动态变化，从而揭示决策策略中的异质性及额外的认知操作——“确认”（Confirmation），并证明其与正确反应概率及行为调整的相关性。这种方法展示了序列建模在认知神经科学中的潜力，能够揭示被聚合分析所忽略的试验级变异性。

链接: https://arxiv.org/abs/2504.10028
作者: Rick den Otter,Gabriel Weindel,Sjoerd Stuit,Leendert van Maanen
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Understanding the sequence of cognitive operations that underlie decision-making is a fundamental challenge in cognitive neuroscience. Traditional approaches often rely on group-level statistics, which obscure trial-by-trial variations in cognitive strategies. In this study, we introduce a novel machine learning method that combines Hidden Multivariate Pattern analysis with a Structured State Space Sequence model to decode cognitive strategies from electroencephalography data at the trial level. We apply this method to a decision-making task, where participants were instructed to prioritize either speed or accuracy in their responses. Our results reveal an additional cognitive operation, labeled Confirmation, which seems to occur predominantly in the accuracy condition but also frequently in the speed condition. The modeled probability that this operation occurs is associated with higher probability of responding correctly as well as changes of mind, as indexed by electromyography data. By successfully modeling cognitive operations at the trial level, we provide empirical evidence for dynamic variability in decision strategies, challenging the assumption of homogeneous cognitive processes within experimental conditions. Our approach shows the potential of sequence modeling in cognitive neuroscience to capture trial-level variability that is obscured by aggregate analyses. The introduced method offers a new way to detect and understand cognitive strategies in a data-driven manner, with implications for both theoretical research and practical applications in many fields.
zh

[AI-154] Offline Dynamic Inventory and Pricing Strategy: Addressing Censored and Dependent Demand

【速读】：本文研究了离线序列特征驱动的定价与库存控制问题，其中当前需求依赖于过去的需求水平，并且超出可用库存的需求将丢失。目标是利用包含过去价格、订购量、库存水平、协变量和截尾销售水平的离线数据集，估计一种最大化长期利润的最优定价与库存控制策略。由于截尾现象的存在，传统的马尔可夫决策过程（MDP）模型面临挑战，包括利润信息缺失、马尔可夫性质失效以及非平稳性。为克服这些障碍，本文通过求解一个高阶MDP来近似最优策略，该MDP由连续截尾实例的数量表征，最终归结为求解针对此问题定制的特殊贝尔曼方程。受离线强化学习和生存分析的启发，本文提出了两种新颖的数据驱动算法来求解这些贝尔曼方程，从而估计最优策略。此外，建立了有限样本后悔界以验证算法的有效性，并通过数值实验展示了算法在估计最优策略方面的有效性。据我们所知，这是首个针对具有截尾和依赖性需求的序列决策环境下的最优定价与库存控制策略的数据驱动方法。提出的算法实现可在提供的链接获取。

链接: https://arxiv.org/abs/2504.09831
作者: Korel Gundem,Zhengling Qi
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
备注:

点击查看摘要

Abstract:In this paper, we study the offline sequential feature-based pricing and inventory control problem where the current demand depends on the past demand levels and any demand exceeding the available inventory is lost. Our goal is to leverage the offline dataset, consisting of past prices, ordering quantities, inventory levels, covariates, and censored sales levels, to estimate the optimal pricing and inventory control policy that maximizes long-term profit. While the underlying dynamic without censoring can be modeled by Markov decision process (MDP), the primary obstacle arises from the observed process where demand censoring is present, resulting in missing profit information, the failure of the Markov property, and a non-stationary optimal policy. To overcome these challenges, we first approximate the optimal policy by solving a high-order MDP characterized by the number of consecutive censoring instances, which ultimately boils down to solving a specialized Bellman equation tailored for this problem. Inspired by offline reinforcement learning and survival analysis, we propose two novel data-driven algorithms to solving these Bellman equations and, thus, estimate the optimal policy. Furthermore, we establish finite sample regret bounds to validate the effectiveness of these algorithms. Finally, we conduct numerical experiments to demonstrate the efficacy of our algorithms in estimating the optimal policy. To the best of our knowledge, this is the first data-driven approach to learning optimal pricing and inventory control policies in a sequential decision-making environment characterized by censored and dependent demand. The implementations of the proposed algorithms are available at this https URL
zh

[AI-155] A simulation-heuristics dual-process model for intuitive physics

【速读】：该论文试图解决的问题是：在不同模拟成本的情景下，人类是否始终依赖心理模拟进行物理推理，以及心理模拟的适用边界在哪里。论文通过倒豆子任务的人类研究发现，在简单场景中心理模拟能够准确反映人类判断，但在模拟时间超过某一阈值时，线性启发式模型更能匹配人类预测。基于这些观察，论文提出了一个双过程框架——模拟-启发式模型（Simulation-Heuristics Model, SHM），其关键在于当短时模拟有效时采用心理模拟，而当模拟变得昂贵时切换至启发式方法。通过将此前被视为独立的计算方法整合到统一模型中，SHM定量捕捉了这种切换机制，并在多种场景中更精确地反映了人类行为，从而深化了对直觉物理推理适应性的理解。

链接: https://arxiv.org/abs/2504.09546
作者: Shiqian Li,Yuxi Ma,Jiajun Yan,Bo Dai,Yujia Peng,Chi Zhang,Yixin Zhu
机构: 未知
类目: Physics Education (physics.ed-ph); Artificial Intelligence (cs.AI)
备注: 8 pages, CogSci 2025

点击查看摘要

Abstract:The role of mental simulation in human physical reasoning is widely acknowledged, but whether it is employed across scenarios with varying simulation costs and where its boundary lies remains unclear. Using a pouring-marble task, our human study revealed two distinct error patterns when predicting pouring angles, differentiated by simulation time. While mental simulation accurately captured human judgments in simpler scenarios, a linear heuristic model better matched human predictions when simulation time exceeded a certain boundary. Motivated by these observations, we propose a dual-process framework, Simulation-Heuristics Model (SHM), where intuitive physics employs simulation for short-time simulation but switches to heuristics when simulation becomes costly. By integrating computational methods previously viewed as separate into a unified model, SHM quantitatively captures their switching mechanism. The SHM aligns more precisely with human behavior and demonstrates consistent predictive performance across diverse scenarios, advancing our understanding of the adaptive nature of intuitive physical reasoning.
zh

[AI-156] PolyConf: Unlocking Polymer Conformation Generation through Hierarchical Generative Models

【速读】：该论文旨在解决聚合物构象生成这一关键任务中存在的挑战，由于聚合物独特的结构特性，现有的小分子和蛋白质构象生成方法难以有效生成聚合物构象，同时缺乏高质量的聚合物构象数据集进一步限制了相关研究进展。为了解决这些问题，论文提出了一种名为PolyConf的创新性解决方案，其关键是利用分层生成模型将聚合物构象分解为一系列局部构象（即重复单元的构象），并通过自回归模型生成这些局部构象，再借助扩散模型生成方向变换以组装成完整的聚合物构象。此外，作者构建了首个包含高质量聚合物构象的数据集，用于推动该领域的研究发展。综合评估表明，PolyConf能够一致地生成高质量的聚合物构象，从而促进聚合物建模与模拟的进步。

链接: https://arxiv.org/abs/2504.08859
作者: Fanmeng Wang,Wentao Guo,Qi Ou,Hongshuai Wang,Haitao Lin,Hongteng Xu,Zhifeng Gao
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Polymer conformation generation is a critical task that enables atomic-level studies of diverse polymer materials. While significant advances have been made in designing various conformation generation methods for small molecules and proteins, these methods struggle to generate polymer conformations due to polymers’ unique structural characteristics. The scarcity of polymer conformation datasets further limits progress, making this promising area largely unexplored. In this work, we propose PolyConf, a pioneering tailored polymer conformation generation method that leverages hierarchical generative models to unlock new possibilities for this task. Specifically, we decompose the polymer conformation into a series of local conformations (i.e., the conformations of its repeating units), generating these local conformations through an autoregressive model. We then generate corresponding orientation transformations via a diffusion model to assemble these local conformations into the complete polymer conformation. Moreover, we develop the first benchmark with a high-quality polymer conformation dataset derived from molecular dynamics simulations to boost related research in this area. The comprehensive evaluation demonstrates that PolyConf consistently generates high-quality polymer conformations, facilitating advancements in polymer modeling and simulation.
zh

[AI-157] Artificial Intelligence Augmented Medical Imaging Reconstruction in Radiation Therapy

【速读】：该论文旨在解决现代放射治疗（Radiotherapy, RT）中高效获取与精确重建成像的关键挑战。论文聚焦于通过人工智能（Artificial Intelligence, AI）技术提升CT图像重建质量与速度、优化双能CT（Dual-Energy CT, DECT）多材料分解（Multi-Material Decomposition, MMD），以及显著加速4D MRI采集。解决方案的关键在于利用AI的隐函数定义与数据驱动特征学习能力，实现更高效、便捷的医学影像重建框架，从而增强放射治疗的效果与效率。

链接: https://arxiv.org/abs/2504.08844
作者: Di Xu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Efficiently acquired and precisely reconstructed imaging are crucial to the success of modern radiation therapy (RT). Computed tomography (CT) and magnetic resonance imaging (MRI) are two common modalities for providing RT treatment planning and delivery guidance/monitoring. In recent decades, artificial intelligence (AI) has emerged as a powerful and widely adopted technique across various fields, valued for its efficiency and convenience enabled by implicit function definition and data-driven feature representation learning. Here, we present a series of AI-driven medical imaging reconstruction frameworks for enhanced radiotherapy, designed to improve CT image reconstruction quality and speed, refine dual-energy CT (DECT) multi-material decomposition (MMD), and significantly accelerate 4D MRI acquisition.
zh

[AI-158] High-order expansion of Neural Ordinary Differential Equations flows

【速读】：本文旨在解决神经常微分方程（Neural Ordinary Differential Equations, Neural ODEs）在实际应用中的可解释性不足问题。尽管神经 ODEs 在机器学习与动力系统结合方面展现了巨大潜力，但其黑箱特性限制了部署中的信任度，尤其是在高阶动态行为的理解上。传统分析方法受限于计算开销，仅能利用一阶梯度信息，难以深入洞察复杂动态。

为解决此问题，论文提出了一种基于高阶微分的框架——事件转移张量（Event Transition Tensors, ETT），以提供神经 ODE 动力学在事件流形上的严格数学描述。关键在于通过引入高阶微分工具，将原本不透明的黑箱动态转化为具有明确数学结构的表达形式，从而显著提升神经 ODE 的可解释性和理论严谨性。该方法在多种应用场景中验证了其有效性，包括数据驱动的捕食者-猎物模型不确定性分析、最优反馈控制动力学研究以及三体哈密顿系统的着陆轨迹映射。通过这一创新，论文为事件触发神经微分方程提供了更坚实的理论基础，并为复杂系统动力学的解释性建模开辟了新路径。

链接: https://arxiv.org/abs/2504.08769
作者: Dario Izzo,Sebastien Origer,Giacomo Acciarini,Francesco Biscani
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artificial neural networks, widely recognised for their role in machine learning, are now transforming the study of ordinary differential equations (ODEs), bridging data-driven modelling with classical dynamical systems and enabling the development of infinitely deep neural models. However, the practical applicability of these models remains constrained by the opacity of their learned dynamics, which operate as black-box systems with limited explainability, thereby hindering trust in their deployment. Existing approaches for the analysis of these dynamical systems are predominantly restricted to first-order gradient information due to computational constraints, thereby limiting the depth of achievable insight. Here, we introduce Event Transition Tensors, a framework based on high-order differentials that provides a rigorous mathematical description of neural ODE dynamics on event manifolds. We demonstrate its versatility across diverse applications: characterising uncertainties in a data-driven prey-predator control model, analysing neural optimal feedback dynamics, and mapping landing trajectories in a three-body neural Hamiltonian system. In all cases, our method enhances the interpretability and rigour of neural ODEs by expressing their behaviour through explicit mathematical structures. Our findings contribute to a deeper theoretical foundation for event-triggered neural differential equations and provide a mathematical construct for explaining complex system dynamics.
zh

机器学习

[LG-0] M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

链接: https://arxiv.org/abs/2504.10449
作者: Junxiong Wang,Wen-Ding Li,Daniele Paliotta,Daniel Ritter,Alexander M. Rush,Tri Dao
类目: Machine Learning (cs.LG)
*备注: Code is available this https URL

点击查看摘要

Abstract:Effective reasoning is crucial to solving complex mathematical problems. Recent large language models (LLMs) have boosted performance by scaling test-time computation through long chain-of-thought reasoning. However, transformer-based models are inherently limited in extending context length due to their quadratic computational complexity and linear memory requirements. In this paper, we introduce a novel hybrid linear RNN reasoning model, M1, built on the Mamba architecture, which allows memory-efficient inference. Our approach leverages a distillation process from existing reasoning models and is further enhanced through RL training. Experimental results on the AIME and MATH benchmarks show that M1 not only outperforms previous linear RNN models but also matches the performance of state-of-the-art Deepseek R1 distilled reasoning models at a similar scale. We also compare our generation speed with a highly performant general purpose inference engine, vLLM, and observe more than a 3x speedup compared to a same size transformer. With throughput speedup, we are able to achieve higher accuracy compared to DeepSeek R1 distilled transformer reasoning models under a fixed generation time budget using self-consistency voting. Overall, we introduce a hybrid Mamba reasoning model and provide a more effective approach to scaling test-time generation using self-consistency or long chain of thought reasoning.

[LG-1] Foundation models for electronic health records: representation dynamics and transferability

链接: https://arxiv.org/abs/2504.10422
作者: Michael C. Burkhart,Bashar Ramadan,Zewei Liao,Kaveri Chhikara,Juan C. Rojas,William F. Parker,Brett K. Beaulieu-Jones
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models (FMs) trained on electronic health records (EHRs) have shown strong performance on a range of clinical prediction tasks. However, adapting these models to local health systems remains challenging due to limited data availability and resource constraints. In this study, we investigated what these models learn and evaluated the transferability of an FM trained on MIMIC-IV to an institutional EHR dataset at the University of Chicago Medical Center. We assessed their ability to identify outlier patients and examined representation-space patient trajectories in relation to future clinical outcomes. We also evaluated the performance of supervised fine-tuned classifiers on both source and target datasets. Our findings offer insights into the adaptability of FMs across different healthcare systems, highlight considerations for their effective implementation, and provide an empirical analysis of the underlying factors that contribute to their predictive performance.

[LG-2] Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks

链接: https://arxiv.org/abs/2504.10403
作者: Yan zhu,Jingyang zhu,Ting Wang,Yuanming Shi,Chunxiao Jiang,Khaled Ben Letaief
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Advancements in artificial intelligence (AI) and low-earth orbit (LEO) satellites have promoted the application of large remote sensing foundation models for various downstream tasks. However, direct downloading of these models for fine-tuning on the ground is impeded by privacy concerns and limited bandwidth. Satellite federated learning (FL) offers a solution by enabling model fine-tuning directly on-board satellites and aggregating model updates without data downloading. Nevertheless, for large foundation models, the computational capacity of satellites is insufficient to support effective on-board fine-tuning in traditional satellite FL frameworks. To address these challenges, we propose a satellite-ground collaborative federated fine-tuning framework. The key of the framework lies in how to reasonably decompose and allocate model components to alleviate insufficient on-board computation capabilities. During fine-tuning, satellites exchange intermediate results with ground stations or other satellites for forward propagation and back propagation, which brings communication challenges due to the special communication topology of space transmission networks, such as intermittent satellite-ground communication, short duration of satellite-ground communication windows, and unstable inter-orbit inter-satellite links (ISLs). To reduce transmission delays, we further introduce tailored communication strategies that integrate both communication and computing resources. Specifically, we propose a parallel intra-orbit communication strategy, a topology-aware satellite-ground communication strategy, and a latency-minimalization inter-orbit communication strategy to reduce space communication costs. Simulation results demonstrate significant reductions in training time with improvements of approximately 33%.

[LG-3] Ctrl-Z: Controlling AI Agents via Resampling

链接: https://arxiv.org/abs/2504.10374
作者: Aryan Bhatt,Cody Rushing,Adam Kaufman,Tyler Tracy,Vasil Georgiev,David Matolcsi,Akbir Khan,Buck Shlegeris
类目: Machine Learning (cs.LG)
*备注: this http URL

点击查看摘要

Abstract:Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.

[LG-4] DUE: A Deep Learning Framework and Library for Modeling Unknown Equations

链接: https://arxiv.org/abs/2504.10373
作者: Junfeng Chen,Kailiang Wu,Dongbin Xiu
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 28 pages

点击查看摘要

Abstract:Equations, particularly differential equations, are fundamental for understanding natural phenomena and predicting complex dynamics across various scientific and engineering disciplines. However, the governing equations for many complex systems remain unknown due to intricate underlying mechanisms. Recent advancements in machine learning and data science offer a new paradigm for modeling unknown equations from measurement or simulation data. This paradigm shift, known as data-driven discovery or modeling, stands at the forefront of AI for science, with significant progress made in recent years. In this paper, we introduce a systematic framework for data-driven modeling of unknown equations using deep learning. This versatile framework is capable of learning unknown ODEs, PDEs, DAEs, IDEs, SDEs, reduced or partially observed systems, and non-autonomous differential equations. Based on this framework, we have developed Deep Unknown Equations (DUE), an open-source software package designed to facilitate the data-driven modeling of unknown equations using modern deep learning techniques. DUE serves as an educational tool for classroom instruction, enabling students and newcomers to gain hands-on experience with differential equations, data-driven modeling, and contemporary deep learning approaches such as FNN, ResNet, generalized ResNet, operator semigroup networks (OSG-Net), and Transformers. Additionally, DUE is a versatile and accessible toolkit for researchers across various scientific and engineering fields. It is applicable not only for learning unknown equations from data but also for surrogate modeling of known, yet complex, equations that are costly to solve using traditional numerical methods. We provide detailed descriptions of DUE and demonstrate its capabilities through diverse examples, which serve as templates that can be easily adapted for other applications.

[LG-5] Domain-Adversarial Neural Network and Explainable AI for Reducing Tissue-of-Origin Signal in Pan-cancer Mortality Classification

链接: https://arxiv.org/abs/2504.10343
作者: Cristian Padron-Manrique,Juan José Oropeza Valdez,Osbaldo Resendis-Antonio
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Tissue-of-origin signals dominate pan-cancer gene expression, often obscuring molecular features linked to patient survival. This hampers the discovery of generalizable biomarkers, as models tend to overfit tissue-specific patterns rather than capture survival-relevant signals. To address this, we propose a Domain-Adversarial Neural Network (DANN) trained on TCGA RNA-seq data to learn representations less biased by tissue and more focused on survival. Identifying tissue-independent genetic profiles is key to revealing core cancer programs. We assess the DANN using: (1) Standard SHAP, based on the original input space and DANN’s mortality classifier; (2) A layer-aware strategy applied to hidden activations, including an unsupervised manifold from raw activations and a supervised manifold from mortality-specific SHAP values. Standard SHAP remains confounded by tissue signals due to biases inherent in its computation. The raw activation manifold was dominated by high-magnitude activations, which masked subtle tissue and mortality-related signals. In contrast, the layer-aware SHAP manifold offers improved low-dimensional representations of both tissue and mortality signals, independent of activation strength, enabling subpopulation stratification and pan-cancer identification of survival-associated genes.

[LG-6] Siamese Network with Dual Attention for EEG-Driven Social Learning: Bridging the Human-Robot Gap in Long-Tail Autonomous Driving

链接: https://arxiv.org/abs/2504.10296
作者: Xiaoshan Zhou,Carol C. Menassa,Vineet R. Kamat
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 50 pages, 18 figures

点击查看摘要

Abstract:Robots with wheeled, quadrupedal, or humanoid forms are increasingly integrated into built environments. However, unlike human social learning, they lack a critical pathway for intrinsic cognitive development, namely, learning from human feedback during interaction. To understand human ubiquitous observation, supervision, and shared control in dynamic and uncertain environments, this study presents a brain-computer interface (BCI) framework that enables classification of Electroencephalogram (EEG) signals to detect cognitively demanding and safety-critical events. As a timely and motivating co-robotic engineering application, we simulate a human-in-the-loop scenario to flag risky events in semi-autonomous robotic driving-representative of long-tail cases that pose persistent bottlenecks to the safety performance of smart mobility systems and robotic vehicles. Drawing on recent advances in few-shot learning, we propose a dual-attention Siamese convolutional network paired with Dynamic Time Warping Barycenter Averaging approach to generate robust EEG-encoded signal representations. Inverse source localization reveals activation in Broadman areas 4 and 9, indicating perception-action coupling during task-relevant mental imagery. The model achieves 80% classification accuracy under data-scarce conditions and exhibits a nearly 100% increase in the utility of salient features compared to state-of-the-art methods, as measured through integrated gradient attribution. Beyond performance, this study contributes to our understanding of the cognitive architecture required for BCI agents-particularly the role of attention and memory mechanisms-in categorizing diverse mental states and supporting both inter- and intra-subject adaptation. Overall, this research advances the development of cognitive robotics and socially guided learning for service robots in complex built environments.

[LG-7] α-Flow: A Unified Framework for Continuous-State Discrete Flow Matching Models

链接: https://arxiv.org/abs/2504.10283
作者: Chaoran Cheng,Jiahan Li,Jiajun Fan,Ge Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent efforts have extended the flow-matching framework to discrete generative modeling. One strand of models directly works with the continuous probabilities instead of discrete tokens, which we colloquially refer to as Continuous-State Discrete Flow Matching (CS-DFM). Existing CS-DFM models differ significantly in their representations and geometric assumptions. This work presents a unified framework for CS-DFM models, under which the existing variants can be understood as operating on different \alpha -representations of probabilities. Building upon the theory of information geometry, we introduce \alpha -Flow, a family of CS-DFM models that adheres to the canonical \alpha -geometry of the statistical manifold, and demonstrate its optimality in minimizing the generalized kinetic energy. Theoretically, we show that the flow matching loss for \alpha -flow establishes a unified variational bound for the discrete negative log-likelihood. We comprehensively evaluate different instantiations of \alpha -flow on various discrete generation domains to demonstrate their effectiveness in discrete generative modeling, including intermediate values whose geometries have never been explored before. \alpha -flow significantly outperforms its discrete-state counterpart in image and protein sequence generation and better captures the entropy in language modeling.

[LG-8] Sidecar: A Structure-Preserving Framework for Solving Partial Differential Equations with Neural Networks

链接: https://arxiv.org/abs/2504.10273
作者: Gaohang Chen,Zhonghua Qiao
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Solving partial differential equations (PDEs) with neural networks (NNs) has shown great potential in various scientific and engineering fields. However, most existing NN solvers mainly focus on satisfying the given PDEs, without explicitly considering intrinsic physical properties such as mass conservation or energy dissipation. This limitation can result in unstable or nonphysical solutions, particularly in long-term simulations. To address this issue, we propose Sidecar, a novel framework that enhances the accuracy and physical consistency of existing NN solvers by incorporating structure-preserving knowledge. Inspired by the Time-Dependent Spectral Renormalization (TDSR) approach, our Sidecar framework introduces a small copilot network, which is trained to guide the existing NN solver in preserving physical structure. This framework is designed to be highly flexible, enabling the incorporation of structure-preserving principles from diverse PDEs into a wide range of NN solvers. Our experimental results on benchmark PDEs demonstrate the improvement of the existing neural network solvers in terms of accuracy and consistency with structure-preserving properties.

[LG-9] nyverseGP: Towards a Modular Cross-domain Benchmarking Framework for Genetic Programming GECCO’25 GECCO

链接: https://arxiv.org/abs/2504.10253
作者: Roman Kalkreuth,Fabricio Olivetti de França,Julian Dierkes,Marie Anastacio,Anja Jankovic,Zdenek Vasicek,Holger Hoos
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: Accepted for presentation as a poster at the Genetic and Evolutionary Computation Conference (GECCO) and will appear in the GECCO’25 companion. GECCO’25 will be held July 14-18, 2025 in Málaga, Spain

点击查看摘要

Abstract:Over the years, genetic programming (GP) has evolved, with many proposed variations, especially in how they represent a solution. Being essentially a program synthesis algorithm, it is capable of tackling multiple problem domains. Current benchmarking initiatives are fragmented, as the different representations are not compared with each other and their performance is not measured across the different domains. In this work, we propose a unified framework, dubbed TinyverseGP (inspired by tinyGP), which provides support to multiple representations and problem domains, including symbolic regression, logic synthesis and policy search.

[LG-10] GNN-ACLP: Graph Neural Networks based Analog Circuit Link Prediction

链接: https://arxiv.org/abs/2504.10240
作者: Guanyuan Pan,Tiansheng Zhou,Bingtao Ma,Yaqi Wang,Jianxiang Zhao,Shuai Wang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Data will be made available on request

点击查看摘要

Abstract:Circuit link prediction identifying missing component connections from incomplete netlists is crucial in automating analog circuit design. However, existing methods face three main challenges: 1) Insufficient use of topological patterns in circuit graphs reduces prediction accuracy; 2) Data scarcity due to the complexity of annotations hinders model generalization; 3) Limited adaptability to various netlist formats. We propose GNN-ACLP, a Graph Neural Networks (GNNs) based framework featuring three innovations to tackle these challenges. First, we introduce the SEAL (Subgraphs, Embeddings, and Attributes for Link Prediction) framework and achieve port-level accuracy in circuit link prediction. Second, we propose Netlist Babel Fish, a netlist format conversion tool leveraging retrieval-augmented generation (RAG) with large language model (LLM) to enhance the compatibility of netlist formats. Finally, we construct SpiceNetlist, a comprehensive dataset that contains 775 annotated circuits across 10 different classes of components. The experimental results demonstrate an improvement of 15.05% on the SpiceNetlist dataset and 12.01% on the Image2Net dataset over the existing approach.

[LG-11] A Model Zoo of Vision Transformers ICLR

链接: https://arxiv.org/abs/2504.10231
作者: Damian Falk,Léo Meynent,Florence Pfammatter,Konstantin Schürholt,Damian Borth
类目: Machine Learning (cs.LG)
*备注: Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality 2025

点击查看摘要

Abstract:The availability of large, structured populations of neural networks - called ‘model zoos’ - has led to the development of a multitude of downstream tasks ranging from model analysis, to representation learning on model weights or generative modeling of neural network parameters. However, existing model zoos are limited in size and architecture and neglect the transformer, which is among the currently most successful neural network architectures. We address this gap by introducing the first model zoo of vision transformers (ViT). To better represent recent training approaches, we develop a new blueprint for model zoo generation that encompasses both pre-training and fine-tuning steps, and publish 250 unique models. They are carefully generated with a large span of generating factors, and their diversity is validated using a thorough choice of weight-space and behavioral metrics. To further motivate the utility of our proposed dataset, we suggest multiple possible applications grounded in both extensive exploratory experiments and a number of examples from the existing literature. By extending previous lines of similar work, our model zoo allows researchers to push their model population-based methods from the small model regime to state-of-the-art architectures. We make our model zoo available at this http URL.

[LG-12] ROSFD: Robust Online Streaming Fraud Detection with Resilience to Concept Drift in Data Streams

链接: https://arxiv.org/abs/2504.10229
作者: Vivek Yelleti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continuous generation of streaming data from diverse sources, such as online transactions and digital interactions, necessitates timely fraud detection. Traditional batch processing methods often struggle to capture the rapidly evolving patterns of fraudulent activities. This paper highlights the critical importance of processing streaming data for effective fraud detection. To address the inherent challenges of latency, scalability, and concept drift in streaming environments, we propose a robust online streaming fraud detection (ROSFD) framework. Our proposed framework comprises two key stages: (i) Stage One: Offline Model Initialization. In this initial stage, a model is built in offline settings using incremental learning principles to overcome the “cold-start” problem. (ii) Stage Two: Real-time Model Adaptation. In this dynamic stage, drift detection algorithms (viz., DDM, EDDM, and ADWIN) are employed to identify concept drift in the incoming data stream and incrementally train the model accordingly. This “train-only-when-required” strategy drastically reduces the number of retrains needed without significantly impacting the area under the receiver operating characteristic curve (AUC). Overall, ROSFD utilizing ADWIN as the drift detection method demonstrated the best performance among the employed methods. In terms of model efficacy, Adaptive Random Forest consistently outperformed other models, achieving the highest AUC in four out of five datasets.

[LG-13] From Prompting to Alignment: A Generative Framework for Query Recommendation

链接: https://arxiv.org/abs/2504.10208
作者: Erxue Min,Hsiu-Yuan Huang,Min Yang,Xihong Yang,Xin Jia,Yunfang Wu,Hengyi Cai,Shuaiqiang Wang,Dawei Yin
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In modern search systems, search engines often suggest relevant queries to users through various panels or components, helping refine their information needs. Traditionally, these recommendations heavily rely on historical search logs to build models, which suffer from cold-start or long-tail issues. Furthermore, tasks such as query suggestion, completion or clarification are studied separately by specific design, which lacks generalizability and hinders adaptation to novel applications. Despite recent attempts to explore the use of LLMs for query recommendation, these methods mainly rely on the inherent knowledge of LLMs or external sources like few-shot examples, retrieved documents, or knowledge bases, neglecting the importance of the calibration and alignment with user feedback, thus limiting their practical utility. To address these challenges, we first propose a general Generative Query Recommendation (GQR) framework that aligns LLM-based query generation with user preference. Specifically, we unify diverse query recommendation tasks by a universal prompt framework, leveraging the instruct-following capability of LLMs for effective generation. Secondly, we align LLMs with user feedback via presenting a CTR-alignment framework, which involves training a query-wise CTR predictor as a process reward model and employing list-wise preference alignment to maximize the click probability of the generated query list. Furthermore, recognizing the inconsistency between LLM knowledge and proactive search intents arising from the separation of user-initiated queries from models, we align LLMs with user initiative via retrieving co-occurrence queries as side information when historical logs are available.

[LG-14] Challenges in interpretability of additive models

链接: https://arxiv.org/abs/2504.10169
作者: Xinyu Zhang,Julien Martinelli,ST John
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We review generalized additive models as a type of transparent'' model that has recently seen renewed interest in the deep learning community as neural additive models. We highlight multiple types of nonidentifiability in this model class and discuss challenges in interpretability, arguing for restraint when claiming interpretability’’ or ``suitability for safety-critical applications’’ of such models.

[LG-15] Continual learning for rotating machinery fault diagnosis with cross-domain environmental and operational variations

链接: https://arxiv.org/abs/2504.10151
作者: Diogo Risca,Afonso Lourenço,Goreti Marreiros
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although numerous machine learning models exist to detect issues like rolling bearing strain and deformation, typically caused by improper mounting, overloading, or poor lubrication, these models often struggle to isolate faults from the noise of real-world operational and environmental variability. Conditions such as variable loads, high temperatures, stress, and rotational speeds can mask early signs of failure, making reliable detection challenging. To address these limitations, this work proposes a continual deep learning approach capable of learning across domains that share underlying structure over time. This approach goes beyond traditional accuracy metrics by addressing four second-order challenges: catastrophic forgetting (where new learning overwrites past knowledge), lack of plasticity (where models fail to adapt to new data), forward transfer (using past knowledge to improve future learning), and backward transfer (refining past knowledge with insights from new domains). The method comprises a feature generator and domain-specific classifiers, allowing capacity to grow as new domains emerge with minimal interference, while an experience replay mechanism selectively revisits prior domains to mitigate forgetting. Moreover, nonlinear dependencies across domains are exploited by prioritizing replay from those with the highest prior errors, refining models based on most informative past experiences. Experiments show high average domain accuracy (up to 88.96%), with forgetting measures as low as .0027 across non-stationary class-incremental environments.

[LG-16] he Impact of Model Zoo Size and Composition on Weight Space Learning ICLR

链接: https://arxiv.org/abs/2504.10141
作者: Damian Falk,Konstantin Schürholt,Damian Borth
类目: Machine Learning (cs.LG)
*备注: Accepted at the ICLR Workshop on Neural Network Weights as a New Data Modality 2025

点击查看摘要

Abstract:Re-using trained neural network models is a common strategy to reduce training cost and transfer knowledge. Weight space learning - using the weights of trained models as data modality - is a promising new field to re-use populations of pre-trained models for future tasks. Approaches in this field have demonstrated high performance both on model analysis and weight generation tasks. However, until now their learning setup requires homogeneous model zoos where all models share the same exact architecture, limiting their capability to generalize beyond the population of models they saw during training. In this work, we remove this constraint and propose a modification to a common weight space learning method to accommodate training on heterogeneous populations of models. We further investigate the resulting impact of model diversity on generating unseen neural network model weights for zero-shot knowledge transfer. Our extensive experimental evaluation shows that including models with varying underlying image datasets has a high impact on performance and generalization, for both in- and out-of-distribution settings. Code is available on this http URL.

[LG-17] Uncertainty Propagation in the Fast Fourier Transform

链接: https://arxiv.org/abs/2504.10136
作者: Luca Schmid,Charlotte Muth,Laurent Schmalen
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to IEEE

点击查看摘要

Abstract:We address the problem of uncertainty propagation in the discrete Fourier transform by modeling the fast Fourier transform as a factor graph. Building on this representation, we propose an efficient framework for approximate Bayesian inference using belief propagation (BP) and expectation propagation, extending its applicability beyond Gaussian assumptions. By leveraging an appropriate BP message representation and a suitable schedule, our method achieves stable convergence with accurate mean and variance estimates. Numerical experiments in representative scenarios from communications demonstrate the practical potential of the proposed framework for uncertainty-aware inference in probabilistic systems operating across both time and frequency domain.

[LG-18] STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data

链接: https://arxiv.org/abs/2504.10097
作者: Maxmilian Forstenhäusler,Daniel Külzer,Christos Anagnostopoulos,Shameem Puthiya Parambath,Natascha Weber
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate predictions using sequential spatiotemporal data are crucial for various applications. Utilizing real-world data, we aim to learn the intent of a smart device user within confined areas of a vehicle’s surroundings. However, in real-world scenarios, environmental factors and sensor limitations result in non-stationary and irregularly sampled data, posing significant challenges. To address these issues, we developed a Transformer-based approach, STaRFormer, which serves as a universal framework for sequential modeling. STaRFormer employs a novel, dynamic attention-based regional masking scheme combined with semi-supervised contrastive learning to enhance task-specific latent representations. Comprehensive experiments on 15 datasets varying in types (including non-stationary and irregularly sampled), domains, sequence lengths, training samples, and applications, demonstrate the efficacy and practicality of STaRFormer. We achieve notable improvements over state-of-the-art approaches. Code and data will be made available.

[LG-19] owards Scalable Bayesian Optimization via Gradient-Informed Bayesian Neural Networks

链接: https://arxiv.org/abs/2504.10076
作者: Georgios Makrygiorgos,Joshua Hang Sai Ip,Ali Mesbah
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a widely used method for data-driven optimization that generally relies on zeroth-order data of objective function to construct probabilistic surrogate models. These surrogates guide the exploration-exploitation process toward finding global optimum. While Gaussian processes (GPs) are commonly employed as surrogates of the unknown objective function, recent studies have highlighted the potential of Bayesian neural networks (BNNs) as scalable and flexible alternatives. Moreover, incorporating gradient observations into GPs, when available, has been shown to improve BO performance. However, the use of gradients within BNN surrogates remains unexplored. By leveraging automatic differentiation, gradient information can be seamlessly integrated into BNN training, resulting in more informative surrogates for BO. We propose a gradient-informed loss function for BNN training, effectively augmenting function observations with local gradient information. The effectiveness of this approach is demonstrated on well-known benchmarks in terms of improved BNN predictions and faster BO convergence as the number of decision variables increases.

[LG-20] Undermining Federated Learning Accuracy in EdgeIoT via Variational Graph Auto-Encoders

链接: https://arxiv.org/abs/2504.10067
作者: Kai Li,Shuyan Hu,Bochun Wu,Sai Zou,Wei Ni,Falko Dressler
类目: Machine Learning (cs.LG)
*备注: 7 pages and 6 figures. Accepted in IEEE IWCMC 2025

点击查看摘要

Abstract:EdgeIoT represents an approach that brings together mobile edge computing with Internet of Things (IoT) devices, allowing for data processing close to the data source. Sending source data to a server is bandwidth-intensive and may compromise privacy. Instead, federated learning allows each device to upload a shared machine-learning model update with locally processed data. However, this technique, which depends on aggregating model updates from various IoT devices, is vulnerable to attacks from malicious entities that may inject harmful data into the learning process. This paper introduces a new attack method targeting federated learning in EdgeIoT, known as data-independent model manipulation attack. This attack does not rely on training data from the IoT devices but instead uses an adversarial variational graph auto-encoder (AV-GAE) to create malicious model updates by analyzing benign model updates intercepted during communication. AV-GAE identifies and exploits structural relationships between benign models and their training data features. By manipulating these structural correlations, the attack maximizes the training loss of the federated learning system, compromising its overall effectiveness.

[LG-21] Using Reinforcement Learning to Integrate Subjective Wellbeing into Climate Adaptation Decision Making ICLR2025

链接: https://arxiv.org/abs/2504.10031
作者: Arthur Vandervoort,Miguel Costa,Morten W. Petersen,Martin Drews,Sonja Haustein,Karyn Morrissey,Francisco C. Pereira
类目: Machine Learning (cs.LG)
*备注: Accepted for presentation at Tackling Climate Change with Machine Learning workshop at ICLR 2025

点击查看摘要

Abstract:Subjective wellbeing is a fundamental aspect of human life, influencing life expectancy and economic productivity, among others. Mobility plays a critical role in maintaining wellbeing, yet the increasing frequency and intensity of both nuisance and high-impact floods due to climate change are expected to significantly disrupt access to activities and destinations, thereby affecting overall wellbeing. Addressing climate adaptation presents a complex challenge for policymakers, who must select and implement policies from a broad set of options with varying effects while managing resource constraints and uncertain climate projections. In this work, we propose a multi-modular framework that uses reinforcement learning as a decision-support tool for climate adaptation in Copenhagen, Denmark. Our framework integrates four interconnected components: long-term rainfall projections, flood modeling, transport accessibility, and wellbeing modeling. This approach enables decision-makers to identify spatial and temporal policy interventions that help sustain or enhance subjective wellbeing over time. By modeling climate adaptation as an open-ended system, our framework provides a structured framework for exploring and evaluating adaptation policy pathways. In doing so, it supports policymakers to make informed decisions that maximize wellbeing in the long run.

[LG-22] Improving Controller Generalization with Dimensionless Markov Decision Processes

链接: https://arxiv.org/abs/2504.10006
作者: Valentin Charvet,Sebastian Stein,Roderick Murray-Smith
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures

点击查看摘要

Abstract:Controllers trained with Reinforcement Learning tend to be very specialized and thus generalize poorly when their testing environment differs from their training one. We propose a Model-Based approach to increase generalization where both world model and policy are trained in a dimensionless state-action space. To do so, we introduce the Dimensionless Markov Decision Process ( \Pi -MDP): an extension of Contextual-MDPs in which state and action spaces are non-dimensionalized with the Buckingham- \Pi theorem. This procedure induces policies that are equivariant with respect to changes in the context of the underlying dynamics. We provide a generic framework for this approach and apply it to a model-based policy search algorithm using Gaussian Process models. We demonstrate the applicability of our method on simulated actuated pendulum and cartpole systems, where policies trained on a single environment are robust to shifts in the distribution of the context.

[LG-23] FLoRA: Sample-Efficient Preference-based RL via Low-Rank Style Adaptation of Reward Functions ICRA

链接: https://arxiv.org/abs/2504.10002
作者: Daniel Marta,Simon Holk,Miguel Vasco,Jens Lundell,Timon Homberger,Finn Busch,Olov Andersson,Danica Kragic,Iolanda Leite
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted at 2025 IEEE International Conference on Robotics Automation (ICRA). We provide videos of our results and source code at this https URL

点击查看摘要

Abstract:Preference-based reinforcement learning (PbRL) is a suitable approach for style adaptation of pre-trained robotic behavior: adapting the robot’s policy to follow human user preferences while still being able to perform the original task. However, collecting preferences for the adaptation process in robotics is often challenging and time-consuming. In this work we explore the adaptation of pre-trained robots in the low-preference-data regime. We show that, in this regime, recent adaptation approaches suffer from catastrophic reward forgetting (CRF), where the updated reward model overfits to the new preferences, leading the agent to become unable to perform the original task. To mitigate CRF, we propose to enhance the original reward model with a small number of parameters (low-rank matrices) responsible for modeling the preference adaptation. Our evaluation shows that our method can efficiently and effectively adjust robotic behavior to human preferences across simulation benchmark tasks and multiple real-world robotic tasks.

[LG-24] AimTS: Augmented Series and Image Contrastive Learning for Time Series Classification

链接: https://arxiv.org/abs/2504.09993
作者: Yuxuan Chen,Shanshan Huang,Yunyao Cheng,Peng Chen,Zhongwen Rao,Yang Shu,Bin Yang,Lujia Pan,Chenjuan Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series classification (TSC) is an important task in time series analysis. Existing TSC methods mainly train on each single domain separately, suffering from a degradation in accuracy when the samples for training are insufficient in certain domains. The pre-training and fine-tuning paradigm provides a promising direction for solving this problem. However, time series from different domains are substantially divergent, which challenges the effective pre-training on multi-source data and the generalization ability of pre-trained models. To handle this issue, we introduce Augmented Series and Image Contrastive Learning for Time Series Classification (AimTS), a pre-training framework that learns generalizable representations from multi-source time series data. We propose a two-level prototype-based contrastive learning method to effectively utilize various augmentations in multi-source pre-training, which learns representations for TSC that can be generalized to different domains. In addition, considering augmentations within the single time series modality are insufficient to fully address classification problems with distribution shift, we introduce the image modality to supplement structural information and establish a series-image contrastive learning to improve the generalization of the learned representations for TSC tasks. Extensive experiments show that after multi-source pre-training, AimTS achieves good generalization performance, enabling efficient learning and even few-shot learning on various downstream TSC datasets.

[LG-25] IsoSEL: Isometric Structural Entropy Learning for Deep Graph Clustering in Hyperbolic Space

链接: https://arxiv.org/abs/2504.09970
作者: Li Sun,Zhenhao Huang,Yujie Wang,Hongbo Lv,Chunyang Liu,Hao Peng,Philip S. Yu
类目: Machine Learning (cs.LG)
*备注: submitted to IEEE TPAMI, 33 pages, including technical appendix of 16 pages

点击查看摘要

Abstract:Graph clustering is a longstanding topic in machine learning. In recent years, deep learning methods have achieved encouraging results, but they still require predefined cluster numbers K, and typically struggle with imbalanced graphs, especially in identifying minority clusters. The limitations motivate us to study a challenging yet practical problem: deep graph clustering without K considering the imbalance in reality. We approach this problem from a fresh perspective of information theory (i.e., structural information). In the literature, structural information has rarely been touched in deep clustering, and the classic definition falls short in its discrete formulation, neglecting node attributes and exhibiting prohibitive complexity. In this paper, we first establish a new Differentiable Structural Information, generalizing the discrete formalism to continuous realm, so that the optimal partitioning tree, revealing the cluster structure, can be created by the gradient backpropagation. Theoretically, we demonstrate its capability in clustering without requiring K and identifying the minority clusters in imbalanced graphs, while reducing the time complexity to O(N) w.r.t. the number of nodes. Subsequently, we present a novel IsoSEL framework for deep graph clustering, where we design a hyperbolic neural network to learn the partitioning tree in the Lorentz model of hyperbolic space, and further conduct Lorentz Tree Contrastive Learning with isometric augmentation. As a result, the partitioning tree incorporates node attributes via mutual information maximization, while the cluster assignment is refined by the proposed tree contrastive learning. Extensive experiments on five benchmark datasets show the IsoSEL outperforms 14 recent baselines by an average of +1.3% in NMI.

[LG-26] anQuan-Climate: A Subseasonal-to-Seasonal Global Weather Model via Incorporate Climatology State

链接: https://arxiv.org/abs/2504.09940
作者: Guowen Li,Xintong Liu,Shilei Cao,Haoyuan Liang,Mengxuan Chen,Lixian Zhang,Jinxiao Zhang,Jiuke Wang,Meng Jin,Juepeng Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Subseasonal forecasting serves as an important support for Sustainable Development Goals (SDGs), such as climate challenges, agricultural yield and sustainable energy production. However, subseasonal forecasting is a complex task in meteorology due to dissipating initial conditions and delayed external forces. Although AI models are increasingly pushing the boundaries of this forecasting limit, they face two major challenges: error accumulation and Smoothness. To address these two challenges, we propose Climate Furnace Subseasonal-to-Seasonal (TianQuan-Climate), a novel machine learning model designed to provide global daily mean forecasts up to 45 days, covering five upper-air atmospheric variables at 13 pressure levels and two surface variables. Our proposed TianQuan-Climate has two advantages: 1) it utilizes a multi-model prediction strategy to reduce system error impacts in long-term subseasonal forecasts; 2) it incorporates a Content Fusion Module for climatological integration and extends ViT with uncertainty blocks (UD-ViT) to improve generalization by learning from uncertainty. We demonstrate the effectiveness of TianQuan-Climate on benchmarks for weather forecasting and climate projections within the 15 to 45-day range, where TianQuan-Climate outperforms existing numerical and AI methods.

[LG-27] Multi-objective Bayesian Optimization With Mixed-categorical Design Variables for Expensive-to-evaluate Aeronautical Applications

链接: https://arxiv.org/abs/2504.09930
作者: Nathalie Bartoli,Thierry Lefebvre,Rémi Lafage,Paul Saves,Youssef Diouane,Joseph Morlier,Jasper Bussemaker,Giuseppa Donelli,Joao Marcos Gomes de Mello,Massimo Mandorino,Pierluigi Della Vecchia
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:This work aims at developing new methodologies to optimize computational costly complex systems (e.g., aeronautical engineering systems). The proposed surrogate-based method (often called Bayesian optimization) uses adaptive sampling to promote a trade-off between exploration and exploitation. Our in-house implementation, called SEGOMOE, handles a high number of design variables (continuous, discrete or categorical) and nonlinearities by combining mixtures of experts for the objective and/or the constraints. Additionally, the method handles multi-objective optimization settings, as it allows the construction of accurate Pareto fronts with a minimal number of function evaluations. Different infill criteria have been implemented to handle multiple objectives with or without constraints. The effectiveness of the proposed method was tested on practical aeronautical applications within the context of the European Project AGILE 4.0 and demonstrated favorable results. A first example concerns a retrofitting problem where a comparison between two optimizers have been made. A second example introduces hierarchical variables to deal with architecture system in order to design an aircraft family. The third example increases drastically the number of categorical variables as it combines aircraft design, supply chain and manufacturing process. In this article, we show, on three different realistic problems, various aspects of our optimization codes thanks to the diversity of the treated aircraft problems.

[LG-28] Moderate Actor-Critic Methods: Controlling Overestimation Bias via Expectile Loss

链接: https://arxiv.org/abs/2504.09929
作者: Ukjo Hwang,Songnam Hong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Overestimation is a fundamental characteristic of model-free reinforcement learning (MF-RL), arising from the principles of temporal difference learning and the approximation of the Q-function. To address this challenge, we propose a novel moderate target in the Q-function update, formulated as a convex optimization of an overestimated Q-function and its lower bound. Our primary contribution lies in the efficient estimation of this lower bound through the lower expectile of the Q-value distribution conditioned on a state. Notably, our moderate target integrates seamlessly into state-of-the-art (SOTA) MF-RL algorithms, including Deep Deterministic Policy Gradient (DDPG) and Soft Actor Critic (SAC). Experimental results validate the effectiveness of our moderate target in mitigating overestimation bias in DDPG, SAC, and distributional RL algorithms.

[LG-29] Ember: A Compiler for Efficient Embedding Operations on Decoupled Access-Execute Architectures

链接: https://arxiv.org/abs/2504.09870
作者: Marco Siracusa(1),Olivia Hsu(2),Victor Soria-Pardos(1),Joshua Randall(3),Arnaud Grasset(3),Eric Biscondi(3),Doug Joseph(3),Randy Allen(1),Fredrik Kjolstad(2),Miquel Moretó Planas(1 and 4),Adrià Armejach(1 and 4) ((1) Barcelona Supercomputing Center, (2) Stanford University, (3) Arm, (4) Universitat Politècnica de Catalunya)
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 14 pages, 19 figures, under review

点击查看摘要

Abstract:Irregular embedding lookups are a critical bottleneck in recommender models, sparse large language models, and graph learning models. In this paper, we first demonstrate that, by offloading these lookups to specialized access units, Decoupled Access-Execute (DAE) processors achieve 2.6 \times higher performance and 6.4 \times higher performance/watt than GPUs on end-to-end models. Then, we propose the Ember compiler for automatically generating optimized DAE code from PyTorch and TensorFlow. Conversely from other DAE compilers, Ember features multiple intermediate representations specifically designed for different optimization levels. In this way, Ember can implement all optimizations to match the performance of hand-written code, unlocking the full potential of DAE architectures at scale.

[LG-30] RadarLLM : Empowering Large Language Models to Understand Human Motion from Millimeter-wave Point Cloud Sequence

链接: https://arxiv.org/abs/2504.09862
作者: Zengyuan Lai,Jiarui Yang,Songpengcheng Xia,Lizhou Lin,Lan Sun,Renwen Wang,Jianran Liu,Qi Wu,Ling Pei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Millimeter-wave radar provides a privacy-preserving solution for human motion analysis, yet its sparse point clouds pose significant challenges for semantic understanding. We present Radar-LLM, the first framework that leverages large language models (LLMs) for human motion understanding using millimeter-wave radar as the sensing modality. Our approach introduces two key innovations: (1) a motion-guided radar tokenizer based on our Aggregate VQ-VAE architecture that incorporates deformable body templates and masked trajectory modeling to encode spatiotemporal point clouds into compact semantic tokens, and (2) a radar-aware language model that establishes cross-modal alignment between radar and text in a shared embedding space. To address data scarcity, we introduce a physics-aware synthesis pipeline that generates realistic radar-text pairs from motion-text datasets. Extensive experiments demonstrate that Radar-LLM achieves state-of-the-art performance across both synthetic and real-world benchmarks, enabling accurate translation of millimeter-wave signals to natural language descriptions. This breakthrough facilitates comprehensive motion understanding in privacy-sensitive applications like healthcare and smart homes. We will release the full implementation to support further research on this https URL.

[LG-31] Accelerating Differentially Private Federated Learning via Adaptive Extrapolation

链接: https://arxiv.org/abs/2504.09850
作者: Shokichi Takakura,Seng Pei Liew,Satoshi Hasegawa
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The federated learning (FL) framework enables multiple clients to collaboratively train machine learning models without sharing their raw data, but it remains vulnerable to privacy attacks. One promising approach is to incorporate differential privacy (DP)-a formal notion of privacy-into the FL framework. DP-FedAvg is one of the most popular algorithms for DP-FL, but it is known to suffer from the slow convergence in the presence of heterogeneity among clients’ data. Most of the existing methods to accelerate DP-FL require 1) additional hyperparameters or 2) additional computational cost for clients, which is not desirable since 1) hyperparameter tuning is computationally expensive and data-dependent choice of hyperparameters raises the risk of privacy leakage, and 2) clients are often resource-constrained. To address this issue, we propose DP-FedEXP, which adaptively selects the global step size based on the diversity of the local updates without requiring any additional hyperparameters or client computational cost. We show that DP-FedEXP provably accelerates the convergence of DP-FedAvg and it empirically outperforms existing methods tailored for DP-FL.

[LG-32] BO-SA-PINNs: Self-adaptive physics-informed neural networks based on Bayesian optimization for automatically designing PDE solvers

链接: https://arxiv.org/abs/2504.09804
作者: Rui Zhang,Liang Li,Stéphane Lanteri,Hao Kang,Jiaqi Li
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 23 pages, 5 figure

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) is becoming a popular alternative method for solving partial differential equations (PDEs). However, they require dedicated manual modifications to the hyperparameters of the network, the sampling methods and loss function weights for different PDEs, which reduces the efficiency of the solvers. In this paper, we pro- pose a general multi-stage framework, i.e. BO-SA-PINNs to alleviate this issue. In the first stage, Bayesian optimization (BO) is used to select hyperparameters for the training process, and based on the results of the pre-training, the network architecture, learning rate, sampling points distribution and loss function weights suitable for the PDEs are automatically determined. The proposed hyperparameters search space based on experimental results can enhance the efficiency of BO in identifying optimal hyperparameters. After selecting the appropriate hyperparameters, we incorporate a global self-adaptive (SA) mechanism the second stage. Using the pre-trained model and loss information in the second-stage training, the exponential moving average (EMA) method is employed to optimize the loss function weights, and residual-based adaptive refinement with distribution (RAR-D) is used to optimize the sampling points distribution. In the third stage, L-BFGS is used for stable training. In addition, we introduce a new activation function that enables BO-SA-PINNs to achieve higher accuracy. In numerical experiments, we conduct comparative and ablation experiments to verify the performance of the model on Helmholtz, Maxwell, Burgers and high-dimensional Poisson equations. The comparative experiment results show that our model can achieve higher accuracy and fewer iterations in test cases, and the ablation experiments demonstrate the positive impact of every improvement.

[LG-33] CUT: Pruning Pre-Trained Multi-Task Models into Compact Models for Edge Devices

链接: https://arxiv.org/abs/2504.09803
作者: Jingxuan Zhou,Weidong Bao,Ji Wang,Zhengyi Zhong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task learning has garnered widespread attention in the industry due to its efficient data utilization and strong generalization capabilities, making it particularly suitable for providing high-quality intelligent services to users. Edge devices, as the primary platforms directly serving users, play a crucial role in delivering multi-task services. However, current multi-task models are often large, and user task demands are increasingly diverse. Deploying such models directly on edge devices not only increases the burden on these devices but also leads to task redundancy. To address this issue, this paper innovatively proposes a pre-trained multi-task model pruning method specifically designed for edge computing. The goal is to utilize existing pre-trained multi-task models to construct a compact multi-task model that meets the needs of edge devices. The specific implementation steps are as follows: First, decompose the tasks within the pre-trained multi-task model and select tasks based on actual user needs. Next, while retaining the knowledge of the original pre-trained model, evaluate parameter importance and use a parameter fusion method to effectively integrate shared parameters among tasks. Finally, obtain a compact multi-task model suitable for edge devices. To validate the effectiveness of the proposed method, we conducted experiments on three public image datasets. The experimental results fully demonstrate the superiority and efficiency of this method, providing a new solution for multi-task learning on edge devices.

[LG-34] A Tale of Two Learning Algorithms: Multiple Stream Random Walk and Asynchronous Gossip

链接: https://arxiv.org/abs/2504.09792
作者: Peyman Gholami,Hulya Seferoglu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although gossip and random walk-based learning algorithms are widely known for decentralized learning, there has been limited theoretical and experimental analysis to understand their relative performance for different graph topologies and data heterogeneity. We first design and analyze a random walk-based learning algorithm with multiple streams (walks), which we name asynchronous “Multi-Walk (MW)”. We provide a convergence analysis for MW w.r.t iteration (computation), wall-clock time, and communication. We also present a convergence analysis for “Asynchronous Gossip”, noting the lack of a comprehensive analysis of its convergence, along with the computation and communication overhead, in the literature. Our results show that MW has better convergence in terms of iterations as compared to Asynchronous Gossip in graphs with large diameters (e.g., cycles), while its relative performance, as compared to Asynchronous Gossip, depends on the number of walks and the data heterogeneity in graphs with small diameters (e.g., complete graphs). In wall-clock time analysis, we observe a linear speed-up with the number of walks and nodes in MW and Asynchronous Gossip, respectively. Finally, we show that MW outperforms Asynchronous Gossip in communication overhead, except in small-diameter topologies with extreme data heterogeneity. These results highlight the effectiveness of each algorithm in different graph topologies and data heterogeneity. Our codes are available for reproducibility.

[LG-35] Dynamical symmetries in the fluctuation-driven regime: an application of Noethers theorem to noisy dynamical systems NEURIPS2024

链接: https://arxiv.org/abs/2504.09761
作者: John J. Vastola
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注: Accepted to the NeurIPS 2024 Workshop on Symmetry and Geometry in Neural Representations (NeurReps)

点击查看摘要

Abstract:Noether’s theorem provides a powerful link between continuous symmetries and conserved quantities for systems governed by some variational principle. Perhaps unfortunately, most dynamical systems of interest in neuroscience and artificial intelligence cannot be described by any such principle. On the other hand, nonequilibrium physics provides a variational principle that describes how fairly generic noisy dynamical systems are most likely to transition between two states; in this work, we exploit this principle to apply Noether’s theorem, and hence learn about how the continuous symmetries of dynamical systems constrain their most likely trajectories. We identify analogues of the conservation of energy, momentum, and angular momentum, and briefly discuss examples of each in the context of models of decision-making, recurrent neural networks, and diffusion generative models.

[LG-36] Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness

链接: https://arxiv.org/abs/2504.09759
作者: Lucas Cardoso,Vitor Santos,José Ribeiro,Regiane Kawasaki,Ricardo Prudêncio,Ronnie Alves
类目: Machine Learning (cs.LG)
*备注: 47 pages, 16 figures, 9 tables

点击查看摘要

Abstract:Benchmarking is a fundamental practice in machine learning (ML) for comparing the performance of classification algorithms. However, traditional evaluation methods often overlook a critical aspect: the joint consideration of dataset complexity and an algorithm’s ability to generalize. Without this dual perspective, assessments may favor models that perform well on easy instances while failing to capture their true robustness. To address this limitation, this study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system, originally developed to measure player strength in competitive games. IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics - such as rating, deviation, and volatility - via simulated tournaments between classifiers. This combined approach provides a fairer and more nuanced measure of algorithm capability. A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging and that a reduced subset with 50% of the original datasets offers comparable evaluation power. Among the algorithms tested, Random Forest achieved the highest ability score. The results highlight the importance of improving benchmark design by focusing on dataset quality and adopting evaluation strategies that reflect both difficulty and classifier proficiency.

[LG-37] Integrating Large Language Models for Automated Structural Analysis

链接: https://arxiv.org/abs/2504.09754
作者: Haoran Liang,Mohammad Talebi Kalaleh,Qipei Mei
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated analysis for engineering structures offers considerable potential for boosting efficiency by minimizing repetitive tasks. Although AI-driven methods are increasingly common, no systematic framework yet leverages Large Language Models (LLMs) for automatic structural analysis. To address this gap, we propose a novel framework that integrates LLMs with structural analysis software. LLMs serve as the core engine: they parse structural descriptions from text and translate them into executable Python scripts. Moreover, the framework integrates the generative capabilities of LLMs with code-based finite element (FE) tools like OpenSeesPy. It employs domain-specific prompt design and in-context learning strategies to enhance the LLM’s problem-solving capabilities and generative stability, enabling fully automated structural analysis from descriptive text to model outputs. In our experiments, we introduce a well-curated small-scale benchmark dataset of 20 structural analysis word problems (SAWPs) with ground-truth solutions and evaluate the performance of different LLMs within our framework in solving these SAWPs. The role of system instructions, crafted by structural engineers, is also investigated to understand their impact on LLM-driven structural analysis. Additionally, the generative stability of our framework is examined. Through multiple validation experiments on the benchmark, our results demonstrate that the proposed framework can substantially increase the level of automation in solving SAWPs compared to traditional methods. Quantitatively, the framework, built on GPT-4o, achieved 100% accuracy, surpassing GPT-4 (85%), Gemini 1.5 Pro (80%), and Llama-3.3 (30%) on the test examples. Furthermore, integrating domain-specific instructions enhanced performance by 30% on problems with asymmetrical structural configurations.

[LG-38] Stochastic generative methods for stable and accurate closure modeling of chaotic dynamical systems

链接: https://arxiv.org/abs/2504.09750
作者: Emily Williams,David Darmofal
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional deterministic subgrid-scale (SGS) models are often dissipative and unstable, especially in regions of chaotic and turbulent flow. Ongoing work in climate science and ocean modeling motivates the use of stochastic SGS models for chaotic dynamics. Further, developing stochastic generative models of underlying dynamics is a rapidly expanding field. In this work, we aim to incorporate stochastic integration toward closure modeling for chaotic dynamical systems. Further, we want to explore the potential stabilizing effect that stochastic models could have on linearized chaotic systems. We propose parametric and generative approaches for closure modeling using stochastic differential equations (SDEs). We derive and implement a quadratic diffusion model based on the fluctuations, demonstrating increased accuracy from bridging theoretical models with generative approaches. Results are demonstrated on the Lorenz-63 dynamical system.

[LG-39] Epsilon-Neighborhood Decision-Boundary Governed Estimation (EDGE) of 2D Black Box Classifier Functions

链接: https://arxiv.org/abs/2504.09733
作者: Mithun Goutham,Riccardo DalferroNucci,Stephanie Stockar,Meghna Menon,Sneha Nayak,Harshad Zade,Chetan Patel,Mario Santillo
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Accurately estimating decision boundaries in black box systems is critical when ensuring safety, quality, and feasibility in real-world applications. However, existing methods iteratively refine boundary estimates by sampling in regions of uncertainty, without providing guarantees on the closeness to the decision boundary and also result in unnecessary exploration that is especially disadvantageous when evaluations are costly. This paper presents the Epsilon-Neighborhood Decision-Boundary Governed Estimation (EDGE), a sample efficient and function-agnostic algorithm that leverages the intermediate value theorem to estimate the location of the decision boundary of a black box binary classifier within a user-specified epsilon-neighborhood. Evaluations are conducted on three nonlinear test functions and a case study of an electric grid stability problem with uncertain renewable power injection. The EDGE algorithm demonstrates superior sample efficiency and better boundary approximation than adaptive sampling techniques and grid-based searches.

[LG-40] SPICE: A Synergistic Precise Iterative and Customizable Image Editing Workflow CVPR

链接: https://arxiv.org/abs/2504.09697
作者: Kenan Tang,Yanhong Li,Yao Qin
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 24 pages, 21 figures. Figure 9(b) has been accepted by CVPR AI Art Gallery 2025

点击查看摘要

Abstract:Recent prompt-based image editing models have demonstrated impressive prompt-following capability at structural editing tasks. However, existing models still fail to perform local edits, follow detailed editing prompts, or maintain global image quality beyond a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and improves image quality consistently during more than 100 editing steps. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. SPICE outperforms state-of-the-art baselines on a challenging realistic image-editing dataset consisting of semantic editing (object addition, removal, replacement, and background change), stylistic editing (texture changes), and structural editing (action change) tasks. Not only does SPICE achieve the highest quantitative performance according to standard evaluation metrics, but it is also consistently preferred by users over existing image-editing methods. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.

[LG-41] Adapting to the Unknown: Robust Meta-Learning for Zero-Shot Financial Time Series Forecasting

链接: https://arxiv.org/abs/2504.09664
作者: Anxian Liu,Junying Ma,Guang Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Financial time series forecasting in the zero-shot setting is essential for risk management and investment decision-making, particularly during abrupt market regime shifts or in emerging markets with limited historical data. While Model-Agnostic Meta-Learning (MAML)-based approaches have shown promise in this domain, existing meta task construction strategies often lead to suboptimal performance, especially when dealing with highly turbulent financial time series. To address this challenge, we propose a novel task construction method that leverages learned embeddings for more effective meta-learning in the zero-shot setting. Specifically, we construct two complementary types of meta-tasks based on the learned embeddings: intra-cluster tasks and inter-cluster tasks. To capture diverse fine-grained patterns, we apply stochastic projection matrices to the learned embeddings and use clustering algorithm to form the tasks. Additionally, to improve generalization capabilities, we employ hard task mining strategies and leverage inter-cluster tasks to identify invariant patterns across different time series. Extensive experiments on the real world financial dataset demonstrate that our method significantly outperforms existing approaches, showing better generalization ability in the zero-shot scenario.

[LG-42] Ordinary Least Squares as an Attention Mechanism

链接: https://arxiv.org/abs/2504.09663
作者: Philippe Goulet Coulombe
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:I show that ordinary least squares (OLS) predictions can be rewritten as the output of a restricted attention module, akin to those forming the backbone of large language models. This connection offers an alternative perspective on attention beyond the conventional information retrieval framework, making it more accessible to researchers and analysts with a background in traditional statistics. It falls into place when OLS is framed as a similarity-based method in a transformed regressor space, distinct from the standard view based on partial correlations. In fact, the OLS solution can be recast as the outcome of an alternative problem: minimizing squared prediction errors by optimizing the embedding space in which training and test vectors are compared via inner products. Rather than estimating coefficients directly, we equivalently learn optimal encoding and decoding operations for predictors. From this vantage point, OLS maps naturally onto the query-key-value structure of attention mechanisms. Building on this foundation, I discuss key elements of Transformer-style attention and draw connections to classic ideas from time series econometrics.

[LG-43] Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

链接: https://arxiv.org/abs/2504.09629
作者: Yamato Arai,Yuma Ichikawa
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 16 pages, 1 figure

点击查看摘要

Abstract:Layer-wise post-training quantization has emerged as a widely used technique for compressing large language models (LLMs) without retraining. However, recent progress in this line of research is saturating, underscoring the need to revisit its core limitation and explore further improvements. This study identifies a critical bottleneck in existing layer-wise PTQ methods: the accumulation of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this, we propose Quantization Error Propagation (QEP), a lightweight and general framework that enhances layer-wise PTQ by explicitly propagating the quantization error which enable compensating for accumulated quantization errors. Additionally, we introduce a tunable propagation mechanism that allows for control over both propagation strength and computational overhead, making the framework adaptable to various architectures and resource constraints. Empirical evaluation on LLaMA2 models (7B, 13B, 70B) demonstrate that incorporating QEP into standard layer-wise PTQ pipelines outperforms standard PTQ methods. Notably, QEP yields substantial performance improvements under extreme low-bit quantization settings.

[LG-44] Mitigating Many-Shot Jailbreaking

链接: https://arxiv.org/abs/2504.09604
作者: Christopher M. Ackerman,Nina Panickssery
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Many-shot jailbreaking (MSJ) is an adversarial technique that exploits the long context windows of modern LLMs to circumvent model safety training by including in the prompt many examples of a fake'' assistant responding inappropriately before the final request. With enough examples, the model's in-context learning abilities override its safety training, and it responds as if it were the fake’’ assistant. In this work, we probe the effectiveness of different fine tuning and input sanitization approaches on mitigating MSJ attacks, alone and in combination. We find incremental mitigation effectiveness for each, and we show that the combined techniques significantly reduce the effectiveness of MSJ attacks, while retaining model performance in benign in-context learning and conversational tasks. We suggest that our approach could meaningfully ameliorate this vulnerability if incorporated into model safety post-training.

[LG-45] ControlNET: A Firewall for RAG -based LLM System

链接: https://arxiv.org/abs/2504.09593
作者: Hongwei Yao,Haoran Shi,Yidou Chen,Yixin Jiang,Cong Wang,Zhan Qin,Kui Ren,Chun Chen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has significantly enhanced the factual accuracy and domain adaptability of Large Language Models (LLMs). This advancement has enabled their widespread deployment across sensitive domains such as healthcare, finance, and enterprise applications. RAG mitigates hallucinations by integrating external knowledge, yet introduces privacy risk and security risk, notably data breaching risk and data poisoning risk. While recent studies have explored prompt injection and poisoning attacks, there remains a significant gap in comprehensive research on controlling inbound and outbound query flows to mitigate these threats. In this paper, we propose an AI firewall, ControlNET, designed to safeguard RAG-based LLM systems from these vulnerabilities. ControlNET controls query flows by leveraging activation shift phenomena to detect adversarial queries and mitigate their impact through semantic divergence. We conduct comprehensive experiments on four different benchmark datasets including Msmarco, HotpotQA, FinQA, and MedicalSys using state-of-the-art open source LLMs (Llama3, Vicuna, and Mistral). Our results demonstrate that ControlNET achieves over 0.909 AUROC in detecting and mitigating security threats while preserving system harmlessness. Overall, ControlNET offers an effective, robust, harmless defense mechanism, marking a significant advancement toward the secure deployment of RAG-based LLM systems.

[LG-46] Adaptive Cluster-Based Synthetic Minority Oversampling Technique for Traffic Mode Choice Prediction with Imbalanced Dataset

链接: https://arxiv.org/abs/2504.09486
作者: Guang An Ooi,Shehab Ahmed
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 6 tables, IEEE conference format

点击查看摘要

Abstract:Urban datasets such as citizen transportation modes often contain disproportionately distributed classes, posing significant challenges to the classification of under-represented samples using data-driven models. In the literature, various resampling methods have been developed to create synthetic data for minority classes (oversampling) or remove samples from majority classes (undersampling) to alleviate class imbalance. However, oversampling approaches tend to overgeneralize minor classes that are closely clustered and neglect sparse regions which may contain crucial information. Conversely, undersampling methods potentially remove useful information on certain subgroups. Hence, a resampling approach that takes the inherent distribution of data into consideration is required to ensure appropriate synthetic data creation. This study proposes an adaptive cluster-based synthetic minority oversampling technique. Density-based spatial clustering is applied on minority classes to identify subgroups based on their input features. The classes in each of these subgroups are then oversampled according to the ratio of data points of their local cluster to the largest majority class. When used in conjunction with machine learning models such as random forest and extreme gradient boosting, this oversampling method results in significantly higher F1 scores for the minority classes compared to other resampling techniques. These improved models provide accurate classification of transportation modes.

[LG-47] GenEDA: Unleashing Generative Reasoning on Netlist via Multimodal Encoder-Decoder Aligned Foundation Model

链接: https://arxiv.org/abs/2504.09485
作者: Wenji Fang,Jing Wang,Yao Lu,Shang Liu,Zhiyao Xie
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 9 pages, 9 figures, and 4 tables

点击查看摘要

Abstract:The success of foundation AI has motivated the research of circuit foundation models, which are customized to assist the integrated circuit (IC) design process. However, existing pre-trained circuit models are typically limited to standalone encoders for predictive tasks or decoders for generative tasks. These two model types are developed independently, operate on different circuit modalities, and reside in separate latent spaces, which restricts their ability to complement each other for more advanced applications. In this work, we present GenEDA, the first framework that aligns circuit encoders with decoders within a shared latent space. GenEDA bridges the gap between graph-based circuit representations and text-based large language models (LLMs), enabling communication between their respective latent spaces. To achieve the alignment, we propose two paradigms that support both open-source trainable LLMs and commercial frozen LLMs. Built on this aligned architecture, GenEDA enables three unprecedented generative reasoning tasks over netlists, where the model reversely generates the high-level functionality from low-level netlists in different granularities. These tasks extend traditional gate-type prediction to direct generation of full-circuit functionality. Experiments demonstrate that GenEDA significantly boosts advanced LLMs’ (e.g., GPT-4o and DeepSeek-V3) performance in all tasks.

[LG-48] An overview of condensation phenomenon in deep learning

链接: https://arxiv.org/abs/2504.09484
作者: Zhi-Qin John Xu,Yaoyu Zhang,Zhangchen Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we provide an overview of a common phenomenon, condensation, observed during the nonlinear training of neural networks: During the nonlinear training of neural networks, neurons in the same layer tend to condense into groups with similar outputs. Empirical observations suggest that the number of condensed clusters of neurons in the same layer typically increases monotonically as training progresses. Neural networks with small weight initializations or Dropout optimization can facilitate this condensation process. We also examine the underlying mechanisms of condensation from the perspectives of training dynamics and the structure of the loss landscape. The condensation phenomenon offers valuable insights into the generalization abilities of neural networks and correlates to stronger reasoning abilities in transformer-based language models.

[LG-49] Rethinking the generalization of drug target affinity prediction algorithms via similarity aware evaluation ICLR2025

链接: https://arxiv.org/abs/2504.09481
作者: Chenbin Zhang,Zhiqiang Hu,Chuchu Jiang,Wen Chen,Jie Xu,Shaoting Zhang
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注: ICLR 2025 Oral

点击查看摘要

Abstract:Drug-target binding affinity prediction is a fundamental task for drug discovery. It has been extensively explored in literature and promising results are reported. However, in this paper, we demonstrate that the results may be misleading and cannot be well generalized to real practice. The core observation is that the canonical randomized split of a test set in conventional evaluation leaves the test set dominated by samples with high similarity to the training set. The performance of models is severely degraded on samples with lower similarity to the training set but the drawback is highly overlooked in current evaluation. As a result, the performance can hardly be trusted when the model meets low-similarity samples in real practice. To address this problem, we propose a framework of similarity aware evaluation in which a novel split methodology is proposed to adapt to any desired distribution. This is achieved by a formulation of optimization problems which are approximately and efficiently solved by gradient descent. We perform extensive experiments across five representative methods in four datasets for two typical target evaluations and compare them with various counterpart methods. Results demonstrate that the proposed split methodology can significantly better fit desired distributions and guide the development of models. Code is released at this https URL.

[LG-50] aweSOM: a CPU/GPU-accelerated Self-organizing Map and Statistically Combined Ensemble Framework for Machine-learning Clustering Analysis

链接: https://arxiv.org/abs/2504.09449
作者: Trung Ha,Joonas Nättilä,Jordy Davelaar
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (stat.ML)
*备注: Published in the Journal of Open Source Software; method paper for arXiv: 2410.01878

点击查看摘要

Abstract:We introduce aweSOM, an open-source Python package for machine learning (ML) clustering and classification, using a Self-organizing Maps (SOM) algorithm that incorporates CPU/GPU acceleration to accommodate large ( N 10^6 , where N is the number of data points), multidimensional datasets. aweSOM consists of two main modules, one that handles the initialization and training of the SOM, and another that stacks the results of multiple SOM realizations to obtain more statistically robust clusters. Existing Python-based SOM implementations (e.g., POPSOM, Yuan (2018); MiniSom, Vettigli (2018); sklearn-som) primarily serve as proof-of-concept demonstrations, optimized for smaller datasets, but lacking scalability for large, multidimensional data. aweSOM provides a solution for this gap in capability, with good performance scaling up to \sim 10^8 individual points, and capable of utilizing multiple features per point. We compare the code performance against the legacy implementations it is based on, and find a 10-100x speed up, as well as significantly improved memory efficiency, due to several built-in optimizations.

[LG-51] Constants of motion network revisited

链接: https://arxiv.org/abs/2504.09434
作者: Wenqi Fang,Chao Chen,Yongkui Yang,Zheng Wang
类目: Machine Learning (cs.LG); Classical Physics (physics.class-ph); Machine Learning (stat.ML)
*备注: under revision

点击查看摘要

Abstract:Discovering constants of motion is meaningful in helping understand the dynamical systems, but inevitably needs proficient mathematical skills and keen analytical capabilities. With the prevalence of deep learning, methods employing neural networks, such as Constant Of Motion nETwork (COMET), are promising in handling this scientific problem. Although the COMET method can produce better predictions on dynamics by exploiting the discovered constants of motion, there is still plenty of room to sharpen it. In this paper, we propose a novel neural network architecture, built using the singular-value-decomposition (SVD) technique, and a two-phase training algorithm to improve the performance of COMET. Extensive experiments show that our approach not only retains the advantages of COMET, such as applying to non-Hamiltonian systems and indicating the number of constants of motion, but also can be more lightweight and noise-robust than COMET.

[LG-52] Ensemble-Enhanced Graph Autoencoder with GAT and Transformer-Based Encoders for Robust Fault Diagnosis

链接: https://arxiv.org/abs/2504.09427
作者: Moirangthem Tiken Singh
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Fault classification in industrial machinery is vital for enhancing reliability and reducing downtime, yet it remains challenging due to the variability of vibration patterns across diverse operating conditions. This study introduces a novel graph-based framework for fault classification, converting time-series vibration data from machinery operating at varying horsepower levels into a graph representation. We utilize Shannon’s entropy to determine the optimal window size for data segmentation, ensuring each segment captures significant temporal patterns, and employ Dynamic Time Warping (DTW) to define graph edges based on segment similarity. A Graph Auto Encoder (GAE) with a deep graph transformer encoder, decoder, and ensemble classifier is developed to learn latent graph representations and classify faults across various categories. The GAE’s performance is evaluated on the Case Western Reserve University (CWRU) dataset, with cross-dataset generalization assessed on the HUST dataset. Results show that GAE achieves a mean F1-score of 0.99 on the CWRU dataset, significantly outperforming baseline models-CNN, LSTM, RNN, GRU, and Bi-LSTM (F1-scores: 0.94-0.97, p 0.05, Wilcoxon signed-rank test for Bi-LSTM: p 0.05) – particularly in challenging classes (e.g., Class 8: 0.99 vs. 0.71 for Bi-LSTM). Visualization of dataset characteristics reveals that datasets with amplified vibration patterns and diverse fault dynamics enhance generalization. This framework provides a robust solution for fault diagnosis under varying conditions, offering insights into dataset impacts on model performance.

[LG-53] Spatially Directional Dual-Attention GAT for Spatial Fluoride Health Risk Modeling

链接: https://arxiv.org/abs/2504.09416
作者: Da Yuan
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Environmental exposure to fluoride is a major public health concern, particularly in regions with naturally elevated fluoride concentrations. Accurate modeling of fluoride-related health risks, such as dental fluorosis, requires spatially aware learning frameworks capable of capturing both geographic and semantic heterogeneity. In this work, we propose Spatially Directional Dual-Attention Graph Attention Network (SDD-GAT), a novel spatial graph neural network designed for fine-grained health risk prediction. SDD-GAT introduces a dual-graph architecture that disentangles geographic proximity and attribute similarity, and incorporates a directional attention mechanism that explicitly encodes spatial orientation and distance into the message passing process. To further enhance spatial coherence, we introduce a spatial smoothness regularization term that enforces consistency in predictions across neighboring locations. We evaluate SDD-GAT on a large-scale dataset covering over 50,000 fluoride monitoring samples and fluorosis records across Guizhou Province, China. Results show that SDD-GAT significantly outperforms traditional models and state-of-the-art GNNs in both regression and classification tasks, while also exhibiting improved spatial autocorrelation as measured by Moran’s I. Our framework provides a generalizable foundation for spatial health risk modeling and geospatial learning under complex environmental settings.

[LG-54] Nash Equilibrium Between Consumer Electronic Devices and DoS Attacker for Distributed IoT-enabled RSE Systems

链接: https://arxiv.org/abs/2504.09415
作者: Gengcan Chen,Donghong Cai,Zahid Khan,Jawad Ahmad,Wadii Boulila
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:In electronic consumer Internet of Things (IoT), consumer electronic devices as edge devices require less computational overhead and the remote state estimation (RSE) of consumer electronic devices is always at risk of denial-of-service (DoS) attacks. Therefore, the adversarial strategy between consumer electronic devices and DoS attackers is critical. This paper focuses on the adversarial strategy between consumer electronic devices and DoS attackers in IoT-enabled RSE Systems. We first propose a remote joint estimation model for distributed measurements to effectively reduce consumer electronic device workload and minimize data leakage risks. The Kalman filter is deployed on the remote estimator, and the DoS attacks with open-loop as well as closed-loop are considered. We further introduce advanced reinforcement learning techniques, including centralized and distributed Minimax-DQN, to address high-dimensional decision-making challenges in both open-loop and closed-loop scenarios. Especially, the Q-network instead of the Q-table is used in the proposed approaches, which effectively solves the challenge of Q-learning. Moreover, the proposed distributed Minimax-DQN reduces the action space to expedite the search for Nash Equilibrium (NE). The experimental results validate that the proposed model can expeditiously restore the RSE error covariance to a stable state in the presence of DoS attacks, exhibiting notable attack robustness. The proposed centralized and distributed Minimax-DQN effectively resolves the NE in both open and closed-loop case, showcasing remarkable performance in terms of convergence. It reveals that substantial advantages in both efficiency and stability are achieved compared with the state-of-the-art methods.

[LG-55] n-Tin: Towards Tiny Learning on Tiny Devices with Integer-based Neural Network Training

链接: https://arxiv.org/abs/2504.09405
作者: Yi Hu,Jinhang Zuo,Eddie Zhang,Bob Iannucci,Carlee Joe-Wong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advancements in machine learning (ML) have enabled its deployment on resource-constrained edge devices, fostering innovative applications such as intelligent environmental sensing. However, these devices, particularly microcontrollers (MCUs), face substantial challenges due to limited memory, computing capabilities, and the absence of dedicated floating-point units (FPUs). These constraints hinder the deployment of complex ML models, especially those requiring lifelong learning capabilities. To address these challenges, we propose Tin-Tin, an integer-based on-device training framework designed specifically for low-power MCUs. Tin-Tin introduces novel integer rescaling techniques to efficiently manage dynamic ranges and facilitate efficient weight updates using integer data types. Unlike existing methods optimized for devices with FPUs, GPUs, or FPGAs, Tin-Tin addresses the unique demands of tiny MCUs, prioritizing energy efficiency and optimized memory utilization. We validate the effectiveness of Tin-Tin through end-to-end application examples on real-world tiny devices, demonstrating its potential to support energy-efficient and sustainable ML applications on edge platforms.

[LG-56] Expressivity of Quadratic Neural ODEs

链接: https://arxiv.org/abs/2504.09385
作者: Joshua Hanson,Maxim Raginsky
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 9 pages, 1 figure

点击查看摘要

Abstract:This work focuses on deriving quantitative approximation error bounds for neural ordinary differential equations having at most quadratic nonlinearities in the dynamics. The simple dynamics of this model form demonstrates how expressivity can be derived primarily from iteratively composing many basic elementary operations, versus from the complexity of those elementary operations themselves. Like the analog differential analyzer and universal polynomial DAEs, the expressivity is derived instead primarily from the “depth” of the model. These results contribute to our understanding of what depth specifically imparts to the capabilities of deep learning architectures.

[LG-57] Machine Learning-Based Cyberattack Detection and Identification for Automatic Generation Control Systems Considering Nonlinearities

链接: https://arxiv.org/abs/2504.09363
作者: Nour M. Shabar,Ahmad Mohammad Saber,Deepa Kundur
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted for publication in IEEE

点击查看摘要

Abstract:Automatic generation control (AGC) systems play a crucial role in maintaining system frequency across power grids. However, AGC systems’ reliance on communicated measurements exposes them to false data injection attacks (FDIAs), which can compromise the overall system stability. This paper proposes a machine learning (ML)-based detection framework that identifies FDIAs and determines the compromised measurements. The approach utilizes an ML model trained offline to accurately detect attacks and classify the manipulated signals based on a comprehensive set of statistical and time-series features extracted from AGC measurements before and after disturbances. For the proposed approach, we compare the performance of several powerful ML algorithms. Our results demonstrate the efficacy of the proposed method in detecting FDIAs while maintaining a low false alarm rate, with an F1-score of up to 99.98%, outperforming existing approaches.

[LG-58] Context-Aware Adaptive Sampling for Intelligent Data Acquisition Systems Using DQN

链接: https://arxiv.org/abs/2504.09344
作者: Weiqiang Huang,Juecen Zhan,Yumeng Sun,Xu Han,Tai An,Nan Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-sensor systems are widely used in the Internet of Things, environmental monitoring, and intelligent manufacturing. However, traditional fixed-frequency sampling strategies often lead to severe data redundancy, high energy consumption, and limited adaptability, failing to meet the dynamic sensing needs of complex environments. To address these issues, this paper proposes a DQN-based multi-sensor adaptive sampling optimization method. By leveraging a reinforcement learning framework to learn the optimal sampling strategy, the method balances data quality, energy consumption, and redundancy. We first model the multi-sensor sampling task as a Markov Decision Process (MDP), then employ a Deep Q-Network to optimize the sampling policy. Experiments on the Intel Lab Data dataset confirm that, compared with fixed-frequency sampling, threshold-triggered sampling, and other reinforcement learning approaches, DQN significantly improves data quality while lowering average energy consumption and redundancy rates. Moreover, in heterogeneous multi-sensor environments, DQN-based adaptive sampling shows enhanced robustness, maintaining superior data collection performance even in the presence of interference factors. These findings demonstrate that DQN-based adaptive sampling can enhance overall data acquisition efficiency in multi-sensor systems, providing a new solution for efficient and intelligent sensing.

[LG-59] owards Optimal Differentially Private Regret Bounds in Linear MDPs

链接: https://arxiv.org/abs/2504.09339
作者: Sharan Sahu
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 28 pages, 2 figures

点击查看摘要

Abstract:We study regret minimization under privacy constraints in episodic inhomogeneous linear Markov Decision Processes (MDPs), motivated by the growing use of reinforcement learning (RL) in personalized decision-making systems that rely on sensitive user data. In this setting, both transition probabilities and reward functions are assumed to be linear in a feature mapping \phi(s, a) , and we aim to ensure privacy through joint differential privacy (JDP), a relaxation of differential privacy suited to online learning. Prior work has established suboptimal regret bounds by privatizing the LSVI-UCB algorithm, which achieves \widetildeO(\sqrtd^3 H^4 K) regret in the non-private setting. Building on recent advances that improve this to minimax optimal regret \widetildeO(HD\sqrtK) via LSVI-UCB++ with Bernstein-style bonuses, we design a new differentially private algorithm by privatizing LSVI-UCB++ and adapting techniques for variance-aware analysis from offline RL. Our algorithm achieves a regret bound of \widetildeO(d \sqrtH^3 K + H^4.5 d^7/6 K^1/2 / \epsilon) , improving over previous private methods. Empirical results show that our algorithm retains near-optimal utility compared to non-private baselines, indicating that privacy can be achieved with minimal performance degradation in this setting.

[LG-60] Efficient Implementation of Reinforcement Learning over Homomorphic Encryption

链接: https://arxiv.org/abs/2504.09335
作者: Jihoon Suh,Takashi Tanaka
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Systems and Control (eess.SY)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:We investigate encrypted control policy synthesis over the cloud. While encrypted control implementations have been studied previously, we focus on the less explored paradigm of privacy-preserving control synthesis, which can involve heavier computations ideal for cloud outsourcing. We classify control policy synthesis into model-based, simulator-driven, and data-driven approaches and examine their implementation over fully homomorphic encryption (FHE) for privacy enhancements. A key challenge arises from comparison operations (min or max) in standard reinforcement learning algorithms, which are difficult to execute over encrypted data. This observation motivates our focus on Relative-Entropy-regularized reinforcement learning (RL) problems, which simplifies encrypted evaluation of synthesis algorithms due to their comparison-free structures. We demonstrate how linearly solvable value iteration, path integral control, and Z-learning can be readily implemented over FHE. We conduct a case study of our approach through numerical simulations of encrypted Z-learning in a grid world environment using the CKKS encryption scheme, showing convergence with acceptable approximation error. Our work suggests the potential for secure and efficient cloud-based reinforcement learning.

[LG-61] Regretful Decisions under Label Noise ICLR2025

链接: https://arxiv.org/abs/2504.09330
作者: Sujay Nagaraj,Yang Liu,Flavio P. Calmon,Berk Ustun
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: The Thirteenth International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:Machine learning models are routinely used to support decisions that affect individuals – be it to screen a patient for a serious illness or to gauge their response to treatment. In these tasks, we are limited to learning models from datasets with noisy labels. In this paper, we study the instance-level impact of learning under label noise. We introduce a notion of regret for this regime which measures the number of unforeseen mistakes due to noisy labels. We show that standard approaches to learning under label noise can return models that perform well at a population level while subjecting individuals to a lottery of mistakes. We present a versatile approach to estimate the likelihood of mistakes at the individual level from a noisy dataset by training models over plausible realizations of datasets without label noise. This is supported by a comprehensive empirical study of label noise in clinical prediction tasks. Our results reveal how failure to anticipate mistakes can compromise model reliability and adoption, and demonstrate how we can address these challenges by anticipating and avoiding regretful decisions.

[LG-62] Conformal Calibration: Ensuring the Reliability of Black-Box AI in Wireless Systems

链接: https://arxiv.org/abs/2504.09310
作者: Osvaldo Simeone,Sangwoo Park,Matteo Zecchin
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Applications (stat.AP)
*备注: submitted for a journal publication

点击查看摘要

Abstract:AI is poised to revolutionize telecommunication networks by boosting efficiency, automation, and decision-making. However, the black-box nature of most AI models introduces substantial risk, possibly deterring adoption by network operators. These risks are not addressed by the current prevailing deployment strategy, which typically follows a best-effort train-and-deploy paradigm. This paper reviews conformal calibration, a general framework that moves beyond the state of the art by adopting computationally lightweight, advanced statistical tools that offer formal reliability guarantees without requiring further training or fine-tuning. Conformal calibration encompasses pre-deployment calibration via uncertainty quantification or hyperparameter selection; online monitoring to detect and mitigate failures in real time; and counterfactual post-deployment performance analysis to address “what if” diagnostic questions after deployment. By weaving conformal calibration into the AI model lifecycle, network operators can establish confidence in black-box AI models as a dependable enabling technology for wireless systems.

[LG-63] Beyond Glucose-Only Assessment: Advancing Nocturnal Hypoglycemia Prediction in Children with Type 1 Diabetes ICLR2025

链接: https://arxiv.org/abs/2504.09299
作者: Marco Voegeli,Sonia Laguna,Heike Leutheuser,Marc Pfister,Marie-Anne Burckhardt,Julia E Vogt
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Published at ICLR 2025 Workshop on AI for Children

点击查看摘要

Abstract:The dead-in-bed syndrome describes the sudden and unexplained death of young individuals with Type 1 Diabetes (T1D) without prior long-term complications. One leading hypothesis attributes this phenomenon to nocturnal hypoglycemia (NH), a dangerous drop in blood glucose during sleep. This study aims to improve NH prediction in children with T1D by leveraging physiological data and machine learning (ML) techniques. We analyze an in-house dataset collected from 16 children with T1D, integrating physiological metrics from wearable sensors. We explore predictive performance through feature engineering, model selection, architectures, and oversampling. To address data limitations, we apply transfer learning from a publicly available adult dataset. Our results achieve an AUROC of 0.75 ± 0.21 on the in-house dataset, further improving to 0.78 ± 0.05 with transfer learning. This research moves beyond glucose-only predictions by incorporating physiological parameters, showcasing the potential of ML to enhance NH detection and improve clinical decision-making for pediatric diabetes management.

[LG-64] Sparse Hybrid Linear-Morphological Networks

链接: https://arxiv.org/abs/2504.09289
作者: Konstantinos Fotopoulos,Christos Garoufis,Petros Maragos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate hybrid linear-morphological networks. Recent studies highlight the inherent affinity of morphological layers to pruning, but also their difficulty in training. We propose a hybrid network structure, wherein morphological layers are inserted between the linear layers of the network, in place of activation functions. We experiment with the following morphological layers: 1) maxout pooling layers (as a special case of a morphological layer), 2) fully connected dense morphological layers, and 3) a novel, sparsely initialized variant of (2). We conduct experiments on the Magna-Tag-A-Tune (music auto-tagging) and CIFAR-10 (image classification) datasets, replacing the linear classification heads of state-of-the-art convolutional network architectures with our proposed network structure for the various morphological layers. We demonstrate that these networks induce sparsity to their linear layers, making them more prunable under L1 unstructured pruning. We also show that on MTAT our proposed sparsely initialized layer achieves slightly better performance than ReLU, maxout, and densely initialized max-plus layers, and exhibits faster initial convergence.

[LG-65] NetTAG: A Multimodal RTL-and-Layout-Aligned Netlist Foundation Model via Text-Attributed Graph

链接: https://arxiv.org/abs/2504.09260
作者: Wenji Fang,Wenkai Li,Shang Liu,Yao Lu,Hongce Zhang,Zhiyao Xie
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted by Design Automation Conference (DAC), 2025

点击查看摘要

Abstract:Circuit representation learning has shown promise in advancing Electronic Design Automation (EDA) by capturing structural and functional circuit properties for various tasks. Existing pre-trained solutions rely on graph learning with complex functional supervision, such as truth table simulation. However, they only handle simple and-inverter graphs (AIGs), struggling to fully encode other complex gate functionalities. While large language models (LLMs) excel at functional understanding, they lack the structural awareness for flattened netlists. To advance netlist representation learning, we present NetTAG, a netlist foundation model that fuses gate semantics with graph structure, handling diverse gate types and supporting a variety of functional and physical tasks. Moving beyond existing graph-only methods, NetTAG formulates netlists as text-attributed graphs, with gates annotated by symbolic logic expressions and physical characteristics as text attributes. Its multimodal architecture combines an LLM-based text encoder for gate semantics and a graph transformer for global structure. Pre-trained with gate and graph self-supervised objectives and aligned with RTL and layout stages, NetTAG captures comprehensive circuit intrinsics. Experimental results show that NetTAG consistently outperforms each task-specific method on four largely different functional and physical tasks and surpasses state-of-the-art AIG encoders, demonstrating its versatility.

[LG-66] ype-Constrained Code Generation with Language Models

链接: https://arxiv.org/abs/2504.09246
作者: Niels Mündler,Jingxuan He,Hao Wang,Koushik Sen,Dawn Song,Martin Vechev
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved notable success in code generation. However, they still frequently produce uncompilable output because their next-token inference procedure does not model formal aspects of code. Although constrained decoding is a promising approach to alleviate this issue, it has only been applied to handle either domain-specific languages or syntactic language features. This leaves typing errors, which are beyond the domain of syntax and generally hard to adequately constrain. To address this challenge, we introduce a type-constrained decoding approach that leverages type systems to guide code generation. We develop novel prefix automata for this purpose and introduce a sound approach to enforce well-typedness based on type inference and a search over inhabitable types. We formalize our approach on a simply-typed language and extend it to TypeScript to demonstrate practicality. Our evaluation on HumanEval shows that our approach reduces compilation errors by more than half and increases functional correctness in code synthesis, translation, and repair tasks across LLMs of various sizes and model families, including SOTA open-weight models with more than 30B parameters.

[LG-67] CMCRD: Cross-Modal Contrastive Representation Distillation for Emotion Recognition

链接: https://arxiv.org/abs/2504.09221
作者: Siyuan Kan,Huanyu Wu,Zhenyao Cui,Fan Huang,Xiaolong Xu,Dongrui Wu
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Emotion recognition is an important component of affective computing, and also human-machine interaction. Unimodal emotion recognition is convenient, but the accuracy may not be high enough; on the contrary, multi-modal emotion recognition may be more accurate, but it also increases the complexity and cost of the data collection system. This paper considers cross-modal emotion recognition, i.e., using both electroencephalography (EEG) and eye movement in training, but only EEG or eye movement in test. We propose cross-modal contrastive representation distillation (CMCRD), which uses a pre-trained eye movement classification model to assist the training of an EEG classification model, improving feature extraction from EEG signals, or vice versa. During test, only EEG signals (or eye movement signals) are acquired, eliminating the need for multi-modal data. CMCRD not only improves the emotion recognition accuracy, but also makes the system more simplified and practical. Experiments using three different neural network architectures on three multi-modal emotion recognition datasets demonstrated the effectiveness of CMCRD. Compared with the EEG-only model, it improved the average classification accuracy by about 6.2%.

[LG-68] Spiking Neural Network for Intra-cortical Brain Signal Decoding

链接: https://arxiv.org/abs/2504.09213
作者: Song Yang,Haotian Fu,Herui Zhang,Peng Zhang,Wei Li,Dongrui Wu
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Decoding brain signals accurately and efficiently is crucial for intra-cortical brain-computer interfaces. Traditional decoding approaches based on neural activity vector features suffer from low accuracy, whereas deep learning based approaches have high computational cost. To improve both the decoding accuracy and efficiency, this paper proposes a spiking neural network (SNN) for effective and energy-efficient intra-cortical brain signal decoding. We also propose a feature fusion approach, which integrates the manually extracted neural activity vector features with those extracted by a deep neural network, to further improve the decoding accuracy. Experiments in decoding motor-related intra-cortical brain signals of two rhesus macaques demonstrated that our SNN model achieved higher accuracy than traditional artificial neural networks; more importantly, it was tens or hundreds of times more efficient. The SNN model is very suitable for high precision and low power applications like intra-cortical brain-computer interfaces.

[LG-69] Accurate Diagnosis of Respiratory Viruses Using an Explainable Machine Learning with Mid-Infrared Biomolecular Fingerprinting of Nasopharyngeal Secretions

链接: https://arxiv.org/abs/2504.09211
作者: Wenwen Zhang,Zhouzhuo Tang,Yingmei Feng,Xia Yu,Qi Jie Wang,Zhiping Lin
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Accurate identification of respiratory viruses (RVs) is critical for outbreak control and public health. This study presents a diagnostic system that combines Attenuated Total Reflectance Fourier Transform Infrared Spectroscopy (ATR-FTIR) from nasopharyngeal secretions with an explainable Rotary Position Embedding-Sparse Attention Transformer (RoPE-SAT) model to accurately identify multiple RVs within 10 minutes. Spectral data (4000-00 cm-1) were collected, and the bio-fingerprint region (1800-900 cm-1) was employed for analysis. Standard normal variate (SNV) normalization and second-order derivation were applied to reduce scattering and baseline drift. Gradient-weighted class activation mapping (Grad-CAM) was employed to generate saliency maps, highlighting spectral regions most relevant to classification and enhancing the interpretability of model outputs. Two independent cohorts from Beijing Youan Hospital, processed with different viral transport media (VTMs) and drying methods, were evaluated, with one including influenza B, SARS-CoV-2, and healthy controls, and the other including mycoplasma, SARS-CoV-2, and healthy controls. The model achieved sensitivity and specificity above 94.40% across both cohorts. By correlating model-selected infrared regions with known biomolecular signatures, we verified that the system effectively recognizes virus-specific spectral fingerprints, including lipids, Amide I, Amide II, Amide III, nucleic acids, and carbohydrates, and leverages their weighted contributions for accurate classification.

[LG-70] Rethinking Remaining Useful Life Prediction with Scarce Time Series Data: Regression under Indirect Supervision

链接: https://arxiv.org/abs/2504.09206
作者: Jiaxiang Cheng,Yipeng Pang,Guoqiang Hu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Supervised time series prediction relies on directly measured target variables, but real-world use cases such as predicting remaining useful life (RUL) involve indirect supervision, where the target variable is labeled as a function of another dependent variable. Trending temporal regression techniques rely on sequential time series inputs to capture temporal patterns, requiring interpolation when dealing with sparsely and irregularly sampled covariates along the timeline. However, interpolation can introduce significant biases, particularly with highly scarce data. In this paper, we address the RUL prediction problem with data scarcity as time series regression under indirect supervision. We introduce a unified framework called parameterized static regression, which takes single data points as inputs for regression of target values, inherently handling data scarcity without requiring interpolation. The time dependency under indirect supervision is captured via a parametrical rectification (PR) process, approximating a parametric function during inference with historical posteriori estimates, following the same underlying distribution used for labeling during training. Additionally, we propose a novel batch training technique for tasks in indirect supervision to prevent overfitting and enhance efficiency. We evaluate our model on public benchmarks for RUL prediction with simulated data scarcity. Our method demonstrates competitive performance in prediction accuracy when dealing with highly scarce time series data.

[LG-71] Query-based Knowledge Transfer for Heterogeneous Learning Environments ICLR’25

链接: https://arxiv.org/abs/2504.09205
作者: Norah Alballa,Wenxuan Zhang,Ziquan Liu,Ahmed M. Abdelmoniem,Mohamed Elhoseiny,Marco Canini
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR’25

点击查看摘要

Abstract:Decentralized collaborative learning under data heterogeneity and privacy constraints has rapidly advanced. However, existing solutions like federated learning, ensembles, and transfer learning, often fail to adequately serve the unique needs of clients, especially when local data representation is limited. To address this issue, we propose a novel framework called Query-based Knowledge Transfer (QKT) that enables tailored knowledge acquisition to fulfill specific client needs without direct data exchange. QKT employs a data-free masking strategy to facilitate communication-efficient query-focused knowledge transfer while refining task-specific parameters to mitigate knowledge interference and forgetting. Our experiments, conducted on both standard and clinical benchmarks, show that QKT significantly outperforms existing collaborative learning methods by an average of 20.91% points in single-class query settings and an average of 14.32% points in multi-class query scenarios. Further analysis and ablation studies reveal that QKT effectively balances the learning of new and existing knowledge, showing strong potential for its application in decentralized learning.

[LG-72] owards More Efficient Robust Instance-adaptive and Generalizable Online Learning

链接: https://arxiv.org/abs/2504.09192
作者: Zhiyong Wang
类目: Machine Learning (cs.LG)
*备注: Ph.D. Thesis

点击查看摘要

Abstract:The primary goal of my Ph.D. study is to develop provably efficient and practical algorithms for data-driven online sequential decision-making under uncertainty. My work focuses on reinforcement learning (RL), multi-armed bandits, and their applications, including recommendation systems, computer networks, video analytics, and large language models (LLMs). Online learning methods, such as bandits and RL, have demonstrated remarkable success - ranging from outperforming human players in complex games like Atari and Go to advancing robotics, recommendation systems, and fine-tuning LLMs. Despite these successes, many established algorithms rely on idealized models that can fail under model misspecifications or adversarial perturbations, particularly in settings where accurate prior knowledge of the underlying model class is unavailable or where malicious users operate within dynamic systems. These challenges are pervasive in real-world applications, where robust and adaptive solutions are critical. Furthermore, while worst-case guarantees provide theoretical reliability, they often fail to capture instance-dependent performance, which can lead to more efficient and practical solutions. Another key challenge lies in generalizing to new, unseen environments, a crucial requirement for deploying these methods in dynamic and unpredictable settings. To address these limitations, my research aims to develop more efficient, robust, instance-adaptive, and generalizable online learning algorithms for both reinforcement learning and bandits. Towards this end, I focus on developing more efficient, robust, instance-adaptive, and generalizable for both general reinforcement learning (RL) and bandits.

[LG-73] Deconfounded Reasoning for Multimodal Fake News Detection via Causal Intervention

链接: https://arxiv.org/abs/2504.09163
作者: Moyang Liu,Kaiying Yan,Yukun Liu,Ruibo Fu,Zhengqi Wen,Xuefei Liu,Chenxing Li
类目: Multimedia (cs.MM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of social media has led to the widespread dissemination of fake news across multiple content forms, including text, images, audio, and video. Traditional unimodal detection methods fall short in addressing complex cross-modal manipulations; as a result, multimodal fake news detection has emerged as a more effective solution. However, existing multimodal approaches, especially in the context of fake news detection on social media, often overlook the confounders hidden within complex cross-modal interactions, leading models to rely on spurious statistical correlations rather than genuine causal mechanisms. In this paper, we propose the Causal Intervention-based Multimodal Deconfounded Detection (CIMDD) framework, which systematically models three types of confounders via a unified Structural Causal Model (SCM): (1) Lexical Semantic Confounder (LSC); (2) Latent Visual Confounder (LVC); (3) Dynamic Cross-Modal Coupling Confounder (DCCC). To mitigate the influence of these confounders, we specifically design three causal modules based on backdoor adjustment, frontdoor adjustment, and cross-modal joint intervention to block spurious correlations from different perspectives and achieve causal disentanglement of representations for deconfounded reasoning. Experimental results on the FakeSV and FVC datasets demonstrate that CIMDD significantly improves detection accuracy, outperforming state-of-the-art methods by 4.27% and 4.80%, respectively. Furthermore, extensive experimental results indicate that CIMDD exhibits strong generalization and robustness across diverse multimodal scenarios.

[LG-74] Exploring Modality Disruption in Multimodal Fake News Detection

链接: https://arxiv.org/abs/2504.09154
作者: Moyang Liu,Kaiying Yan,Yukun Liu,Ruibo Fu,Zhengqi Wen,Xuefei Liu,Chenxing Li
类目: Multimedia (cs.MM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid growth of social media has led to the widespread dissemination of fake news across multiple content forms, including text, images, audio, and video. Compared to unimodal fake news detection, multimodal fake news detection benefits from the increased availability of information across multiple modalities. However, in the context of social media, certain modalities in multimodal fake news detection tasks may contain disruptive or over-expressive information. These elements often include exaggerated or embellished content. We define this phenomenon as modality disruption and explore its impact on detection models through experiments. To address the issue of modality disruption in a targeted manner, we propose a multimodal fake news detection framework, FND-MoE. Additionally, we design a two-pass feature selection mechanism to further mitigate the impact of modality disruption. Extensive experiments on the FakeSV and FVC-2018 datasets demonstrate that FND-MoE significantly outperforms state-of-the-art methods, with accuracy improvements of 3.45% and 3.71% on the respective datasets compared to baseline models.

[LG-75] MatWheel: Addressing Data Scarcity in Materials Science Through Synthetic Data ICLR-2025 ICLR2025

链接: https://arxiv.org/abs/2504.09152
作者: Wentao Li,Yizhe Chen,Jiangjie Qiu,Xiaonan Wang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: AI4MAT-ICLR-2025: ICLR 2025 Workshop on AI for Accelerated Materials Design

点击查看摘要

Abstract:Data scarcity and the high cost of annotation have long been persistent challenges in the field of materials science. Inspired by its potential in other fields like computer vision, we propose the MatWheel framework, which train the material property prediction model using the synthetic data generated by the conditional generative model. We explore two scenarios: fully-supervised and semi-supervised learning. Using CGCNN for property prediction and Con-CDVAE as the conditional generative model, experiments on two data-scarce material property datasets from Matminer database are conducted. Results show that synthetic data has potential in extreme data-scarce scenarios, achieving performance close to or exceeding that of real samples in all two tasks. We also find that pseudo-labels have little impact on generated data quality. Future work will integrate advanced models and optimize generation conditions to boost the effectiveness of the materials data flywheel.

[LG-76] Kernel-Based Enhanced Oversampling Method for Imbalanced Classification

链接: https://arxiv.org/abs/2504.09147
作者: Wenjie Li,Sibo Zhu,Zhijian Li,Hanlin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel oversampling technique designed to improve classification performance on imbalanced datasets. The proposed method enhances the traditional SMOTE algorithm by incorporating convex combination and kernel-based weighting to generate synthetic samples that better represent the minority class. Through experiments on multiple real-world datasets, we demonstrate that the new technique outperforms existing methods in terms of F1-score, G-mean, and AUC, providing a robust solution for handling imbalanced datasets in classification tasks.

[LG-77] Self-Supervised Autoencoder Network for Robust Heart Rate Extraction from Noisy Photoplethysmogram: Applying Blind Source Separation to Biosignal Analysis

链接: https://arxiv.org/abs/2504.09132
作者: Matthew B. Webster,Dongheon Lee,Joonnyong Lee
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 12 pages, 5 figures, preprint

点击查看摘要

Abstract:Biosignals can be viewed as mixtures measuring particular physiological events, and blind source separation (BSS) aims to extract underlying source signals from mixtures. This paper proposes a self-supervised multi-encoder autoencoder (MEAE) to separate heartbeat-related source signals from photoplethysmogram (PPG), enhancing heart rate (HR) detection in noisy PPG data. The MEAE is trained on PPG signals from a large open polysomnography database without any pre-processing or data selection. The trained network is then applied to a noisy PPG dataset collected during the daily activities of nine subjects. The extracted heartbeat-related source signal significantly improves HR detection as compared to the original PPG. The absence of pre-processing and the self-supervised nature of the proposed method, combined with its strong performance, highlight the potential of BSS in biosignal analysis.

[LG-78] Deploying Large AI Models on Resource-Limited Devices with Split Federated Learning

链接: https://arxiv.org/abs/2504.09114
作者: Xianke Qiang,Hongda Liu,Xinran Zhang,Zheng Chang,Ying-Chang Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Artificial Intelligence Models (LAMs) powered by massive datasets, extensive parameter scales, and extensive computational resources, leading to significant transformations across various industries. Yet, their practical deployment on resource-limited mobile edge devices is hindered by critical challenges such as data privacy, constrained resources, and high overhead costs. Addressing this gap, this paper proposes a novel framework, named Quantized Split Federated Fine-Tuning Large AI Model (SFLAM). By partitioning the training load between edge devices and servers using a split learning paradigm, SFLAM can facilitate the operation of large models on devices and significantly lowers the memory requirements on edge devices. Additionally, SFLAM incorporates quantization management, power control, and bandwidth allocation strategies to enhance training efficiency while concurrently reducing energy consumption and communication latency. A theoretical analysis exploring the latency-energy trade-off is presented, and the framework’s efficacy is validated via comprehensive simulations. The findings indicate that SFLAM achieves superior performance in terms of learning efficiency and scalability compared to conventional methods, thereby providing a valuable approach for enabling advanced AI services in resource-constrained scenarios.

[LG-79] Shrinkage Initialization for Smooth Learning of Neural Networks

链接: https://arxiv.org/abs/2504.09107
作者: Miao Cheng,Feiyan Zhou,Hongwei Zou,Limin Wang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 6 pages, 4 figures

点击查看摘要

Abstract:The successes of intelligent systems have quite relied on the artificial learning of information, which lead to the broad applications of neural learning solutions. As a common sense, the training of neural networks can be largely improved by specifically defined initialization, neuron layers as well as the activation functions. Though there are sequential layer based initialization available, the generalized solution to initial stages is still desired. In this work, an improved approach to initialization of neural learning is presented, which adopts the shrinkage approach to initialize the transformation of each layer of networks. It can be universally adapted for the structures of any networks with random layers, while stable performance can be attained. Furthermore, the smooth learning of networks is adopted in this work, due to the diverse influence on neural learning. Experimental results on several artificial data sets demonstrate that, the proposed method is able to present robust results with the shrinkage initialization, and competent for smooth learning of neural networks.

[LG-80] High dimensional online calibration in polynomial time

链接: https://arxiv.org/abs/2504.09096
作者: Binghui Peng
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In online (sequential) calibration, a forecaster predicts probability distributions over a finite outcome space [d] over a sequence of T days, with the goal of being calibrated. While asymptotically calibrated strategies are known to exist, they suffer from the curse of dimensionality: the best known algorithms require \exp(d) days to achieve non-trivial calibration. In this work, we present the first asymptotically calibrated strategy that guarantees non-trivial calibration after a polynomial number of rounds. Specifically, for any desired accuracy \epsilon 0 , our forecaster becomes \epsilon -calibrated after T = d^O(1/\epsilon^2) days. We complement this result with a lower bound, proving that at least T = d^\Omega(\log(1/\epsilon)) rounds are necessary to achieve \epsilon -calibration. Our results resolve the open questions posed by [Abernethy-Mannor’11, Hazan-Kakade’12]. Our algorithm is inspired by recent breakthroughs in swap regret minimization [Peng-Rubinstein’24, Dagan et al.'24]. Despite its strong theoretical guarantees, the approach is remarkably simple and intuitive: it randomly selects among a set of sub-forecasters, each of which predicts the empirical outcome frequency over recent time windows. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML) Cite as: arXiv:2504.09096 [cs.LG] (or arXiv:2504.09096v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.09096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-81] crowd-hpo: Realistic Hyperparameter Optimization and Benchmarking for Learning from Crowds with Noisy Labels

链接: https://arxiv.org/abs/2504.09085
作者: Marek Herde,Lukas Lührs,Denis Huseljic,Bernhard Sick
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Crowdworking is a cost-efficient solution to acquire class labels. Since these labels are subject to noise, various approaches to learning from crowds have been proposed. Typically, these approaches are evaluated with default hyperparameters, resulting in suboptimal performance, or with hyperparameters tuned using a validation set with ground truth class labels, representing an often unrealistic scenario. Moreover, both experimental setups can produce different rankings of approaches, complicating comparisons between studies. Therefore, we introduce crowd-hpo as a realistic benchmark and experimentation protocol including hyperparameter optimization under noisy crowd-labeled data. At its core, crowd-hpo investigates model selection criteria to identify well-performing hyperparameter configurations only with access to noisy crowd-labeled validation data. Extensive experimental evaluations with neural networks show that these criteria are effective for optimizing hyperparameters in learning from crowds approaches. Accordingly, incorporating such criteria into experimentation protocols is essential for enabling more realistic and fair benchmarking.

[LG-82] MGS: Markov Greedy Sums for Accurate Low-Bitwidth Floating-Point Accumulation

链接: https://arxiv.org/abs/2504.09072
作者: Vikas Natesh,H.T. Kung,David Kong
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We offer a novel approach, MGS (Markov Greedy Sums), to improve the accuracy of low-bitwidth floating-point dot products in neural network computations. In conventional 32-bit floating-point summation, adding values with different exponents may lead to loss of precision in the mantissa of the smaller term, which is right-shifted to align with the larger term’s exponent. Such shifting (a.k.a. ‘swamping’) is a significant source of numerical errors in accumulation when implementing low-bitwidth dot products (e.g., 8-bit floating point) as the mantissa has a small number of bits. We avoid most swamping errors by arranging the terms in dot product summation based on their exponents and summing the mantissas without overflowing the low-bitwidth accumulator. We design, analyze, and implement the algorithm to minimize 8-bit floating point error at inference time for several neural networks. In contrast to traditional sequential summation, our method has significantly lowered numerical errors, achieving classification accuracy on par with high-precision floating-point baselines for multiple image classification tasks. Our dMAC hardware units can reduce power consumption by up to 34.1% relative to conventional MAC units.

[LG-83] owards On-Device Learning and Reconfigurable Hardware Implementation for Encoded Single-Photon Signal Processing

链接: https://arxiv.org/abs/2504.09028
作者: Zhenya Zang,Xingda Li,David Day Uei Li
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 14 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Deep neural networks (DNNs) enhance the accuracy and efficiency of reconstructing key parameters from time-resolved photon arrival signals recorded by single-photon detectors. However, the performance of conventional backpropagation-based DNNs is highly dependent on various parameters of the optical setup and biological samples under examination, necessitating frequent network retraining, either through transfer learning or from scratch. Newly collected data must also be stored and transferred to a high-performance GPU server for retraining, introducing latency and storage overhead. To address these challenges, we propose an online training algorithm based on a One-Sided Jacobi rotation-based Online Sequential Extreme Learning Machine (OSOS-ELM). We fully exploit parallelism in executing OSOS-ELM on a heterogeneous FPGA with integrated ARM cores. Extensive evaluations of OSOS-ELM and OSELM demonstrate that both achieve comparable accuracy across different network dimensions (i.e., input, hidden, and output layers), while OSOS-ELM proves to be more hardware-efficient. By leveraging the parallelism of OSOS-ELM, we implement a holistic computing prototype on a Xilinx ZCU104 FPGA, which integrates a multi-core CPU and programmable logic fabric. We validate our approach through three case studies involving single-photon signal analysis: sensing through fog using commercial single-photon LiDAR, fluorescence lifetime estimation in FLIM, and blood flow index reconstruction in DCS, all utilizing one-dimensional data encoded from photonic signals. From a hardware perspective, we optimize the OSOS-ELM workload by employing multi-tasked processing on ARM CPU cores and pipelined execution on the FPGA’s logic fabric. We also implement our OSOS-ELM on the NVIDIA Jetson Xavier NX GPU to comprehensively investigate its computing performance on another type of heterogeneous computing platform.

[LG-84] Associating transportation planning -related measures with Mild Cognitive Impairment

链接: https://arxiv.org/abs/2504.09027
作者: Souradeep Chattopadhyay,Guillermo Basulto-Elias,Jun Ha Chang,Matthew Rizzo,Shauna Hallmark,Anuj Sharma,Soumik Sarkar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding the relationship between mild cognitive impairment and driving behavior is essential to improve road safety, especially among older adults. In this study, we computed certain variables that reflect daily driving habits, such as trips to specific locations (e.g., home, work, medical, social, and errands) of older drivers in Nebraska using geohashing. The computed variables were then analyzed using a two-fold approach involving data visualization and machine learning models (C5.0, Random Forest, Support Vector Machines) to investigate the efficiency of the computed variables in predicting whether a driver is cognitively impaired or unimpaired. The C5.0 model demonstrated robust and stable performance with a median recall of 74%, indicating that our methodology was able to identify cognitive impairment in drivers 74% of the time correctly. This highlights our model’s effectiveness in minimizing false negatives which is an important consideration given the cost of missing impaired drivers could be potentially high. Our findings highlight the potential of life space variables in understanding and predicting cognitive decline, offering avenues for early intervention and tailored support for affected individuals.

[LG-85] Detecting Instruction Fine-tuning Attack on Language Models with Influence Function

链接: https://arxiv.org/abs/2504.09026
作者: Jiawei Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Instruction fine-tuning attacks pose a significant threat to large language models (LLMs) by subtly embedding poisoned data in fine-tuning datasets, which can trigger harmful or unintended responses across a range of tasks. This undermines model alignment and poses security risks in real-world deployment. In this work, we present a simple and effective approach to detect and mitigate such attacks using influence functions, a classical statistical tool adapted for machine learning interpretation. Traditionally, the high computational costs of influence functions have limited their application to large models and datasets. The recent Eigenvalue-Corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation method enables efficient influence score computation, making it feasible for large-scale analysis. We are the first to apply influence functions for detecting language model instruction fine-tuning attacks on large-scale datasets, as both the instruction fine-tuning attack on language models and the influence calculation approximation technique are relatively new. Our large-scale empirical evaluation of influence functions on 50,000 fine-tuning examples and 32 tasks reveals a strong association between influence scores and sentiment. Building on this, we introduce a novel sentiment transformation combined with influence functions to detect and remove critical poisons – poisoned data points that skew model predictions. Removing these poisons (only 1% of total data) recovers model performance to near-clean levels, demonstrating the effectiveness and efficiency of our approach. Artifact is available at this https URL. WARNING: This paper contains offensive data examples. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2504.09026 [cs.LG] (or arXiv:2504.09026v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.09026 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-86] A Champion-level Vision-based Reinforcement Learning Agent for Competitive Racing in Gran Turismo 7

链接: https://arxiv.org/abs/2504.09021
作者: Hojoon Lee,Takuma Seno,Jun Jet Tai,Kaushik Subramanian,Kenta Kawamoto,Peter Stone,Peter R. Wurman
类目: Machine Learning (cs.LG)
*备注: Accepted for Publication at the IEEE Robotics and Automation Letters (RA-L) 2025

点击查看摘要

Abstract:Deep reinforcement learning has achieved superhuman racing performance in high-fidelity simulators like Gran Turismo 7 (GT7). It typically utilizes global features that require instrumentation external to a car, such as precise localization of agents and opponents, limiting real-world applicability. To address this limitation, we introduce a vision-based autonomous racing agent that relies solely on ego-centric camera views and onboard sensor data, eliminating the need for precise localization during inference. This agent employs an asymmetric actor-critic framework: the actor uses a recurrent neural network with the sensor data local to the car to retain track layouts and opponent positions, while the critic accesses the global features during training. Evaluated in GT7, our agent consistently outperforms GT7’s built-drivers. To our knowledge, this work presents the first vision-based autonomous racing agent to demonstrate champion-level performance in competitive racing scenarios.

[LG-87] Learning in Structured Stackelberg Games

链接: https://arxiv.org/abs/2504.09006
作者: Maria-Florina Balcan,Kiriaki Fragkia,Keegan Harris
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study structured Stackelberg games, in which both players (the leader and the follower) observe information about the state of the world at time of play. Importantly, this information may contain information about the follower, which the leader may use when deciding her strategy. Under this setting, we show that no-regret learning is possible if and only if the set of mappings from contexts to follower types that the leader uses to learn is not ``too complex’'. Specifically, we find that standard learning theoretic measures of complexity do not characterize learnability in our setting and we give a new dimension which does, which we term the Stackelberg-Littlestone dimension. In the distributional setting, we give analogous results by showing that standard complexity measures do not characterize the sample complexity of learning, but a new dimension called the Stackelberg-Natarajan dimension does. We then show that an appropriate empirical risk minimization procedure achieves the corresponding sample complexity.

[LG-88] ReCA: A Parametric ReLU Composite Activation Function

链接: https://arxiv.org/abs/2504.08994
作者: John Chidiac,Danielle Azar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation functions have been shown to affect the performance of deep neural networks significantly. While the Rectified Linear Unit (ReLU) remains the dominant choice in practice, the optimal activation function for deep neural networks remains an open research question. In this paper, we propose a novel parametric activation function, ReCA, based on ReLU, which has been shown to outperform all baselines on state-of-the-art datasets using different complex neural network architectures.

[LG-89] RouterKT: Mixture-of-Experts for Knowledge Tracing

链接: https://arxiv.org/abs/2504.08989
作者: Han Liao,Shuaishuai Zu
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Knowledge Tracing (KT) is a fundamental task in Intelligent Tutoring Systems (ITS), which aims to model the dynamic knowledge states of students based on their interaction histories. However, existing KT models often rely on a global forgetting decay mechanism for capturing learning patterns, assuming that students’ performance is predominantly influenced by their most recent interactions. Such approaches fail to account for the diverse and complex learning patterns arising from individual differences and varying learning stages. To address this limitation, we propose RouterKT, a novel Mixture-of-Experts (MoE) architecture designed to capture heterogeneous learning patterns by enabling experts to specialize in different patterns without any handcrafted learning pattern bias such as forgetting decay. Specifically, RouterKT introduces a \textbfperson-wise routing mechanism to effectively model individual-specific learning behaviors and employs \textbfmulti-heads as experts to enhance the modeling of complex and diverse patterns. Comprehensive experiments on ten benchmark datasets demonstrate that RouterKT exhibits significant flexibility and improves the performance of various KT backbone models, with a maximum average AUC improvement of 3.29% across different backbones and datasets, outperforming other state-of-the-art models. Moreover, RouterKT demonstrates consistently superior inference efficiency compared to existing approaches based on handcrafted learning pattern bias, highlighting its usability for real-world educational applications. The source code is available at this https URL.

[LG-90] Bidirectional Linear Recurrent Models for Sequence-Level Multisource Fusion

链接: https://arxiv.org/abs/2504.08964
作者: Qisai Liu,Zhanhong Jiang,Joshua R. Waite,Chao Liu,Aditya Balu,Soumik Sarkar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequence modeling is a critical yet challenging task with wide-ranging applications, especially in time series forecasting for domains like weather prediction, temperature monitoring, and energy load forecasting. Transformers, with their attention mechanism, have emerged as state-of-the-art due to their efficient parallel training, but they suffer from quadratic time complexity, limiting their scalability for long sequences. In contrast, recurrent neural networks (RNNs) offer linear time complexity, spurring renewed interest in linear RNNs for more computationally efficient sequence modeling. In this work, we introduce BLUR (Bidirectional Linear Unit for Recurrent network), which uses forward and backward linear recurrent units (LRUs) to capture both past and future dependencies with high computational efficiency. BLUR maintains the linear time complexity of traditional RNNs, while enabling fast parallel training through LRUs. Furthermore, it offers provably stable training and strong approximation capabilities, making it highly effective for modeling long-term dependencies. Extensive experiments on sequential image and time series datasets reveal that BLUR not only surpasses transformers and traditional RNNs in accuracy but also significantly reduces computational costs, making it particularly suitable for real-world forecasting tasks. Our code is available here.

[LG-91] An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline

链接: https://arxiv.org/abs/2504.08930
作者: Junkyum Kim,Divya Mahajan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems enhance response quality by integrating Large Language Models (LLMs) with vector databases, enabling external knowledge retrieval to support language model reasoning. While RAG enables efficient question answering with smaller LLMs, existing optimizations for vector search and LLM serving have largely been developed in isolation. As a result, their integration often leads to suboptimal end-to-end performance. … This paper introduces VectorLiteRAG, an optimized vector index partitioning mechanism designed for RAG systems that enhances the responsiveness of the system by jointly optimizing vector search and LLM serving across CPU and GPU system. A key challenge is to determine which indices and how much of the vector index should reside on the GPU and adjusting LLM batch sizes to balance the pipeline for lower Time-To-First-Token (TTFT) and meeting user-defined Service-Level Objectives (SLOs). To address this, we leverage the insight that cluster access in vector databases exhibits access skew, where a subset of clusters are queried significantly more frequently than others. VectorLiteRAG exploits this property through an optimized memory distribution strategy, dynamically allocating the minimum number of vector indices corresponding to frequently accessed clusters onto the GPU HBM to ensure a balanced pipeline with the LLM for high responsiveness. This adaptive partitioning scheme is guided by a statistical model that informs memory allocation and workload distribution. Our evaluation demonstrates that VectorLiteRAG improves vector search responsiveness by 2x, significantly reduces end-to-end TTFT in RAG systems by intelligently balancing memory resources between vector search and LLM execution.

[LG-92] oward Spiking Neural Network Local Learning Modules Resistant to Adversarial Attacks

链接: https://arxiv.org/abs/2504.08897
作者: Jiaqi Lin,Abhronil Sengupta
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recent research has shown the vulnerability of Spiking Neural Networks (SNNs) under adversarial examples that are nearly indistinguishable from clean data in the context of frame-based and event-based information. The majority of these studies are constrained in generating adversarial examples using Backpropagation Through Time (BPTT), a gradient-based method which lacks biological plausibility. In contrast, local learning methods, which relax many of BPTT’s constraints, remain under-explored in the context of adversarial attacks. To address this problem, we examine adversarial robustness in SNNs through the framework of four types of training algorithms. We provide an in-depth analysis of the ineffectiveness of gradient-based adversarial attacks to generate adversarial instances in this scenario. To overcome these limitations, we introduce a hybrid adversarial attack paradigm that leverages the transferability of adversarial instances. The proposed hybrid approach demonstrates superior performance, outperforming existing adversarial attack methods. Furthermore, the generalizability of the method is assessed under multi-step adversarial attacks, adversarial attacks in black-box FGSM scenarios, and within the non-spiking domain.

[LG-93] Knowledge Graph-extended Retrieval Augmented Generation for Question Answering

链接: https://arxiv.org/abs/2504.08893
作者: Jasper Linders,Jakub M. Tomczak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) and Knowledge Graphs (KGs) offer a promising approach to robust and explainable Question Answering (QA). While LLMs excel at natural language understanding, they suffer from knowledge gaps and hallucinations. KGs provide structured knowledge but lack natural language interaction. Ideally, an AI system should be both robust to missing facts as well as easy to communicate with. This paper proposes such a system that integrates LLMs and KGs without requiring training, ensuring adaptability across different KGs with minimal human effort. The resulting approach can be classified as a specific form of a Retrieval Augmented Generation (RAG) with a KG, thus, it is dubbed Knowledge Graph-extended Retrieval Augmented Generation (KG-RAG). It includes a question decomposition module to enhance multi-hop information retrieval and answer explainability. Using In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting, it generates explicit reasoning chains processed separately to improve truthfulness. Experiments on the MetaQA benchmark show increased accuracy for multi-hop questions, though with a slight trade-off in single-hop performance compared to LLM with KG baselines. These findings demonstrate KG-RAG’s potential to improve transparency in QA by bridging unstructured language understanding with structured knowledge retrieval.

[LG-94] he SERENADE project: Sensor-Based Explainable Detection of Cognitive Decline

链接: https://arxiv.org/abs/2504.08877
作者: Gabriele Civitarese,Michele Fiori,Andrea Arighi,Daniela Galimberti,Graziana Florio,Claudio Bettini
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Mild Cognitive Impairment (MCI) affects 12-18% of individuals over 60. MCI patients exhibit cognitive dysfunctions without significant daily functional loss. While MCI may progress to dementia, predicting this transition remains a clinical challenge due to limited and unreliable indicators. Behavioral changes, like in the execution of Activities of Daily Living (ADLs), can signal such progression. Sensorized smart homes and wearable devices offer an innovative solution for continuous, non-intrusive monitoring ADLs for MCI patients. However, current machine learning models for detecting behavioral changes lack transparency, hindering clinicians’ trust. This paper introduces the SERENADE project, a European Union-funded initiative that aims to detect and explain behavioral changes associated with cognitive decline using explainable AI methods. SERENADE aims at collecting one year of data from 30 MCI patients living alone, leveraging AI to support clinical decision-making and offering a new approach to early dementia detection.

[LG-95] In almost all shallow analytic neural network optimization landscapes efficient minimizers have strongly convex neighborhoods

链接: https://arxiv.org/abs/2504.08867
作者: Felix Benning,Steffen Dereich
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divided into two domains: the ‘efficient domain’ (all parameters for which the respective realization function cannot be generated by a network having a smaller number of neurons) and the ‘redundant domain’ (the remaining parameters). In almost all regression problems on the efficient domain the optimization landscape only features local minima that are strongly convex. Formally, we will show that for certain randomly picked regression problems the optimization landscape is almost surely a Morse function on the efficient domain. The redundant domain has significantly smaller dimension than the efficient domain and on this domain, potential local minima are never isolated.

[LG-96] owards Combinatorial Interpretability of Neural Computation

链接: https://arxiv.org/abs/2504.08842
作者: Micah Adler,Dan Alistarh,Nir Shavit
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 47 Pages

点击查看摘要

Abstract:We introduce combinatorial interpretability, a methodology for understanding neural computation by analyzing the combinatorial structures in the sign-based categorization of a network’s weights and biases. We demonstrate its power through feature channel coding, a theory that explains how neural networks compute Boolean expressions and potentially underlies other categories of neural network computation. According to this theory, features are computed via feature channels: unique cross-neuron encodings shared among the inputs the feature operates on. Because different feature channels share neurons, the neurons are polysemantic and the channels interfere with one another, making the computation appear inscrutable. We show how to decipher these computations by analyzing a network’s feature channel coding, offering complete mechanistic interpretations of several small neural networks that were trained with gradient descent. Crucially, this is achieved via static combinatorial analysis of the weight matrices, without examining activations or training new autoencoding networks. Feature channel coding reframes the superposition hypothesis, shifting the focus from neuron activation directionality in high-dimensional space to the combinatorial structure of codes. It also allows us for the first time to exactly quantify and explain the relationship between a network’s parameter size and its computational capacity (i.e. the set of features it can compute with low error), a relationship that is implicitly at the core of many modern scaling laws. Though our initial studies of feature channel coding are restricted to Boolean functions, we believe they provide a rich, controlled, and informative research space, and that the path we propose for combinatorial interpretation of neural computation can provide a basis for understanding both artificial and biological neural circuits. Comments: 47 Pages Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) ACMclasses: I.2.0 Cite as: arXiv:2504.08842 [cs.LG] (or arXiv:2504.08842v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.08842 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-97] Probabilistic QoS Metric Forecasting in Delay-Tolerant Networks Using Conditional Diffusion Models on Latent Dynamics

链接: https://arxiv.org/abs/2504.08821
作者: Enming Zhang,Zheng Liu,Yu Xiang,Yanwen Qu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Active QoS metric prediction, commonly employed in the maintenance and operation of DTN, could enhance network performance regarding latency, throughput, energy consumption, and dependability. Naturally formulated as a multivariate time series forecasting problem, it attracts substantial research efforts. Traditional mean regression methods for time series forecasting cannot capture the data complexity adequately, resulting in deteriorated performance in operational tasks in DTNs such as routing. This paper formulates the prediction of QoS metrics in DTN as a probabilistic forecasting problem on multivariate time series, where one could quantify the uncertainty of forecasts by characterizing the distribution of these samples. The proposed approach hires diffusion models and incorporates the latent temporal dynamics of non-stationary and multi-mode data into them. Extensive experiments demonstrate the efficacy of the proposed approach by showing that it outperforms the popular probabilistic time series forecasting methods.

[LG-98] A Graph-Enhanced DeepONet Approach for Real-Time Estimating Hydrogen-Enriched Natural Gas Flow under Variable Operations

链接: https://arxiv.org/abs/2504.08816
作者: Sicheng Liu,Hongchang Huang,Bo Yang,Mingxuan Cai,Xu Yang,Xinping Guan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Blending green hydrogen into natural gas presents a promising approach for renewable energy integration and fuel decarbonization. Accurate estimation of hydrogen fraction in hydrogen-enriched natural gas (HENG) pipeline networks is crucial for operational safety and efficiency, yet it remains challenging due to complex dynamics. While existing data-driven approaches adopt end-to-end architectures for HENG flow state estimation, their limited adaptability to varying operational conditions hinders practical applications. To this end, this study proposes a graph-enhanced DeepONet framework for the real-time estimation of HENG flow, especially hydrogen fractions. First, a dual-network architecture, called branch network and trunk network, is employed to characterize operational conditions and sparse sensor measurements to estimate the HENG state at targeted locations and time points. Second, a graph-enhance branch network is proposed to incorporate pipeline topology, improving the estimation accuracy in large-scale pipeline networks. Experimental results demonstrate that the proposed method achieves superior estimation accuracy for HCNG flow under varying operational conditions compared to conventional approaches.

[LG-99] When Federated Learning Meets Quantum Computing: Survey and Research Opportunities

链接: https://arxiv.org/abs/2504.08814
作者: Aakar Mathur,Ashish Gupta,Sajal K. Das
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: submitted to IEEE Communications Surveys and Tutorials

点击查看摘要

Abstract:Quantum Federated Learning (QFL) is an emerging field that harnesses advances in Quantum Computing (QC) to improve the scalability and efficiency of decentralized Federated Learning (FL) models. This paper provides a systematic and comprehensive survey of the emerging problems and solutions when FL meets QC, from research protocol to a novel taxonomy, particularly focusing on both quantum and federated limitations, such as their architectures, Noisy Intermediate Scale Quantum (NISQ) devices, and privacy preservation, so on. This work explores key developments and integration strategies, along with the impact of quantum computing on FL, keeping a sharp focus on hybrid quantum-classical approaches. The paper offers an in-depth understanding of how the strengths of QC, such as gradient hiding, state entanglement, quantum key distribution, quantum security, and quantum-enhanced differential privacy, have been integrated into FL to ensure the privacy of participants in an enhanced, fast, and secure framework. Finally, this study proposes potential future directions to address the identified research gaps and challenges, aiming to inspire faster and more secure QFL models for practical use.

[LG-100] Analogical Learning for Cross-Scenario Generalization: Framework and Application to Intelligent Localization

链接: https://arxiv.org/abs/2504.08811
作者: Zirui Chen,Zhaoyang Zhang,Ziqing Xing,Ridong Li,Zhaohui Yang,Richeng Jin,Chongwen Huang,Yuzhi Yang,Mérouane Debbah
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Existing learning models often exhibit poor generalization when deployed across diverse scenarios. It is mainly due to that the underlying reference frame of the data varies with the deployment environment and settings. However, despite the data of each scenario has its distinct reference frame, its generation generally follows the same underlying physical rule. Based on these findings, this article proposes a brand-new universal deep learning framework named analogical learning (AL), which provides a highly efficient way to implicitly retrieve the reference frame information associated with a scenario and then to make accurate prediction by relative analogy across scenarios. Specifically, an elegant bipartite neural network architecture called Mateformer is designed, the first part of which calculates the relativity within multiple feature spaces between the input data and a small amount of embedded data from the current scenario, while the second part uses these relativity to guide the nonlinear analogy. We apply AL to the typical multi-scenario learning problem of intelligent wireless localization in cellular networks. Extensive experiments show that AL achieves state-of-the-art accuracy, stable transferability and robust adaptation to new scenarios without any tuning, and outperforming conventional methods with a precision improvement of nearly two orders of magnitude. All data and code are available at this https URL.

[LG-101] Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

链接: https://arxiv.org/abs/2504.08809
作者: Wei Chen,Xin Yan,Bin Wen,Fan Yang,Tingting Gao,Di Zhang,Long Chen
类目: Machine Learning (cs.LG)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize robust hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO’s hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.

[LG-102] Generative AI in Live Operations: Evidence of Productivity Gains in Cybersecurity and Endpoint Management

链接: https://arxiv.org/abs/2504.08805
作者: James Bono,Justin Grana,Kleanthis Karakolios,Pruthvi Hanumanthapura Ramakrishna,Ankit Srivastava
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We measure the association between generative AI (GAI) tool adoption and four metrics spanning security operations, information protection, and endpoint management: 1) number of security alerts per incident, 2) probability of security incident reopenings, 3) time to classify a data loss prevention alert, and 4) time to resolve device policy conflicts. We find that GAI is associated with robust and statistically and practically significant improvements in the four metrics. Although unobserved confounders inhibit causal identification, these results are among the first to use observational data from live operations to investigate the relationship between GAI adoption and security operations, data loss prevention, and device policy management.

[LG-103] A temporal scale transformer framework for precise remaining useful life prediction in fuel cells

链接: https://arxiv.org/abs/2504.08803
作者: Zezhi Tang,Xiaoyu Chen,Xin Jin,Benyuan Zhang,Wenyu Liang
类目: Machine Learning (cs.LG)
*备注: 7 figs, 10 pages

点击查看摘要

Abstract:In exploring Predictive Health Management (PHM) strategies for Proton Exchange Membrane Fuel Cells (PEMFC), the Transformer model, widely used in data-driven approaches, excels in many fields but struggles with time series analysis due to its self-attention mechanism, which yields a complexity of the input sequence squared and low computational efficiency. It also faces challenges in capturing both global long-term dependencies and local details effectively. To tackle this, we propose the Temporal Scale Transformer (TSTransformer), an enhanced version of the inverted Transformer (iTransformer). Unlike traditional Transformers that treat each timestep as an input token, TSTransformer maps sequences of varying lengths into tokens at different stages for inter-sequence modeling, using attention to capture multivariate correlations and feed-forward networks (FFN) to encode sequence representations. By integrating a one-dimensional convolutional layer into the multivariate attention for multi-level scaling of K and V matrices, it improves local feature extraction, captures temporal scale characteristics, and reduces token count and computational costs. Experiments comparing TSTransformer with models like Long Short-Term Memory, iTransformer, and Transformer demonstrate its potential as a powerful tool for advancing PHM in renewable energy, effectively addressing the limitations of pure Transformer models in data-driven time series tasks.

[LG-104] InfoGain Wavelets: Furthering the Design of Diffusion Wavelets for Graph-Structured Data

链接: https://arxiv.org/abs/2504.08802
作者: David R. Johnson,Smita Krishnaswamy,Michael Perlmutter
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This work was accepted to be presented at the Graph Signal Processing Workshop 2025

点击查看摘要

Abstract:Diffusion wavelets extract information from graph signals at different scales of resolution by utilizing graph diffusion operators raised to various powers, known as diffusion scales. Traditionally, the diffusion scales are chosen to be dyadic integers, \mathbf2^j . Here, we propose a novel, unsupervised method for selecting the diffusion scales based on ideas from information theory. We then show that our method can be incorporated into wavelet-based GNNs via graph classification experiments.

[LG-105] SLOs-Serve: Optimized Serving of Multi-SLO LLM s

链接: https://arxiv.org/abs/2504.08784
作者: Siyuan Chen,Zhipeng Jia,Samira Khan,Arvind Krishnamurthy,Phillip B. Gibbons
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces SLOs-Serve, a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs). The key idea behind SLOs-Serve is to customize the allocation of tokens to meet these SLO requirements. SLOs-Serve uses a multi-SLO dynamic programming-based algorithm to continuously optimize token allocations under SLO constraints by exploring the full design space of chunked prefill and (optional) speculative decoding. Leveraging this resource planning algorithm, SLOs-Serve effectively supports multi-SLOs and multi-replica serving with dynamic request routing while being resilient to bursty arrivals. Our evaluation across 6 LLM application scenarios (including summarization, coding, chatbot, tool calling, and reasoning) demonstrates that SLOs-Serve improves per-GPU serving capacity by 2.2x on average compared to prior state-of-the-art systems.

[LG-106] How Relevance Emerges: Interpreting LoRA Fine-Tuning in Reranking LLM s

链接: https://arxiv.org/abs/2504.08780
作者: Atharva Nijasure,Tanya Chowdhury,James Allan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Extended Abstract

点击查看摘要

Abstract:We conduct a behavioral exploration of LoRA fine-tuned LLMs for Passage Reranking to understand how relevance signals are learned and deployed by Large Language Models. By fine-tuning Mistral-7B, LLaMA3.1-8B, and Pythia-6.9B on MS MARCO under diverse LoRA configurations, we investigate how relevance modeling evolves across checkpoints, the impact of LoRA rank (1, 2, 8, 32), and the relative importance of updated MHA vs. MLP components. Our ablations reveal which layers and projections within LoRA transformations are most critical for reranking accuracy. These findings offer fresh explanations into LoRA’s adaptation mechanisms, setting the stage for deeper mechanistic studies in Information Retrieval. All models used in this study have been shared.

[LG-107] Counterfactual Inference under Thompson Sampling

链接: https://arxiv.org/abs/2504.08773
作者: Olivier Jeunen
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Recommender systems exemplify sequential decision-making under uncertainty, strategically deciding what content to serve to users, to optimise a range of potential objectives. To balance the explore-exploit trade-off successfully, Thompson sampling provides a natural and widespread paradigm to probabilistically select which action to take. Questions of causal and counterfactual inference, which underpin use-cases like offline evaluation, are not straightforward to answer in these contexts. Specifically, whilst most existing estimators rely on action propensities, these are not readily available under Thompson sampling procedures. We derive exact and efficiently computable expressions for action propensities under a variety of parameter and outcome distributions, enabling the use of off-policy estimators in Thompson sampling scenarios. This opens up a range of practical use-cases where counterfactual inference is crucial, including unbiased offline evaluation of recommender systems, as well as general applications of causal inference in online advertising, personalisation, and beyond. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2504.08773 [cs.IR] (or arXiv:2504.08773v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.08773 Focus to learn more arXiv-issued DOI via DataCite

[LG-108] Dynamic Topic Analysis in Academic Journals using Convex Non-negative Matrix Factorization Method

链接: https://arxiv.org/abs/2504.08743
作者: Yang Yang,Tong Zhang,Jian Wu,Lijie Su
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Applications (stat.AP)
*备注: 11 pages, 7 figures, 6 tables

点击查看摘要

Abstract:With the rapid advancement of large language models, academic topic identification and topic evolution analysis are crucial for enhancing AI’s understanding capabilities. Dynamic topic analysis provides a powerful approach to capturing and understanding the temporal evolution of topics in large-scale datasets. This paper presents a two-stage dynamic topic analysis framework that incorporates convex optimization to improve topic consistency, sparsity, and interpretability. In Stage 1, a two-layer non-negative matrix factorization (NMF) model is employed to extract annual topics and identify key terms. In Stage 2, a convex optimization algorithm refines the dynamic topic structure using the convex NMF (cNMF) model, further enhancing topic integration and stability. Applying the proposed method to IEEE journal abstracts from 2004 to 2022 effectively identifies and quantifies emerging research topics, such as COVID-19 and digital twins. By optimizing sparsity differences in the clustering feature space between traditional and emerging research topics, the framework provides deeper insights into topic evolution and ranking analysis. Moreover, the NMF-cNMF model demonstrates superior stability in topic consistency. At sparsity levels of 0.4, 0.6, and 0.9, the proposed approach improves topic ranking stability by 24.51%, 56.60%, and 36.93%, respectively. The source code (to be open after publication) is available at this https URL.

[LG-109] Recommendation System in Advertising and Streaming Media: Unsupervised Data Enhancement Sequence Suggestions

链接: https://arxiv.org/abs/2504.08740
作者: Kowei Shih,Yi Han,Li Tan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sequential recommendation is an extensively explored approach to capturing users’ evolving preferences based on past interactions, aimed at predicting their next likely choice. Despite significant advancements in this domain, including methods based on RNNs and self-attention, challenges like limited supervised signals and noisy data caused by unintentional clicks persist. To address these challenges, some studies have incorporated unsupervised learning by leveraging local item contexts within individual sequences. However, these methods often overlook the intricate associations between items across multiple sequences and are susceptible to noise in item co-occurrence patterns. In this context, we introduce a novel framework, Global Unsupervised Data-Augmentation (UDA4SR), which adopts a graph contrastive learning perspective to generate more robust item embeddings for sequential recommendation. Our approach begins by integrating Generative Adversarial Networks (GANs) for data augmentation, which serves as the first step to enhance the diversity and richness of the training data. Then, we build a Global Item Relationship Graph (GIG) based on all user interaction sequences. Subsequently, we employ graph contrastive learning on the refined graph to enhance item embeddings by capturing complex global associations. To model users’ dynamic and diverse interests more effectively, we enhance the CapsNet module with a novel target-attention mechanism. Extensive experiments show that UDA4SR significantly outperforms state-of-the-art approaches.

[LG-110] Feature Distribution on Graph Topology Mediates the Effect of Graph Convolution: Homophily Perspective ICML2024

链接: https://arxiv.org/abs/2402.04621
作者: Soo Yong Lee,Sunwoo Kim,Fanchen Bu,Jaemin Yoo,Jiliang Tang,Kijung Shin
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: published in ICML 2024

点击查看摘要

Abstract:How would randomly shuffling feature vectors among nodes from the same class affect graph neural networks (GNNs)? The feature shuffle, intuitively, perturbs the dependence between graph topology and features (A-X dependence) for GNNs to learn from. Surprisingly, we observe a consistent and significant improvement in GNN performance following the feature shuffle. Having overlooked the impact of A-X dependence on GNNs, the prior literature does not provide a satisfactory understanding of the phenomenon. Thus, we raise two research questions. First, how should A-X dependence be measured, while controlling for potential confounds? Second, how does A-X dependence affect GNNs? In response, we (i) propose a principled measure for A-X dependence, (ii) design a random graph model that controls A-X dependence, (iii) establish a theory on how A-X dependence relates to graph convolution, and (iv) present empirical analysis on real-world graphs that align with the theory. We conclude that A-X dependence mediates the effect of graph convolution, such that smaller dependence improves GNN-based node classification.

[LG-111] Anchors no more: Using peculiar velocities to constrain H_0 and the primordial Universe without calibrators

链接: https://arxiv.org/abs/2504.10453
作者: Davide Piras,Francesco Sorrenti,Ruth Durrer,Martin Kunz
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 22 pages, 5 figures, comments welcome. Code available at this https URL

点击查看摘要

Abstract:We develop a novel approach to constrain the Hubble parameter H_0 and the primordial power spectrum amplitude A_\mathrms using supernovae type Ia (SNIa) data. By considering SNIa as tracers of the peculiar velocity field, we can model their distance and their covariance as a function of cosmological parameters without the need of calibrators like Cepheids; this yields a new independent probe of the large-scale structure based on SNIa data without distance anchors. Crucially, we implement a differentiable pipeline in JAX, including efficient emulators and affine sampling, reducing inference time from years to hours on a single GPU. We first validate our method on mock datasets, demonstrating that we can constrain H_0 and \log 10^10A_\mathrms within \sim10% using \sim10^3 SNIa. We then test our pipeline with SNIa from an N -body simulation, obtaining 7% -level unbiased constraints on H_0 with a moderate noise level. We finally apply our method to Pantheon+ data, constraining H_0 at the 10% level without Cepheids when fixing A_\mathrms to its \itPlanck value. On the other hand, we obtain 15% -level constraints on \log 10^10A_\mathrms in agreement with \itPlanck when including Cepheids in the analysis. In light of upcoming observations of low redshift SNIa from the Zwicky Transient Facility and the Vera Rubin Legacy Survey of Space and Time, surveys for which our method will develop its full potential, we make our code publicly available.

[LG-112] Learning with Positive and Imperfect Unlabeled Data

链接: https://arxiv.org/abs/2504.10428
作者: Jane H. Lee,Anay Mehrotra,Manolis Zampetakis
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the problem of learning binary classifiers from positive and unlabeled data when the unlabeled data distribution is shifted, which we call Positive and Imperfect Unlabeled (PIU) Learning. In the absence of covariate shifts, i.e., with perfect unlabeled data, Denis (1998) reduced this problem to learning under Massart noise; however, that reduction fails under even slight shifts. Our main results on PIU learning are the characterizations of the sample complexity of PIU learning and a computationally and sample-efficient algorithm achieving a misclassification error \varepsilon . We further show that our results lead to new algorithms for several related problems. 1. Learning from smooth distributions: We give algorithms that learn interesting concept classes from only positive samples under smooth feature distributions, bypassing known existing impossibility results and contributing to recent advances in smoothened learning (Haghtalab et al, this http URL’24) (Chandrasekaran et al., COLT’24). 2. Learning with a list of unlabeled distributions: We design new algorithms that apply to a broad class of concept classes under the assumption that we are given a list of unlabeled distributions, one of which–unknown to the learner–is O(1) -close to the true feature distribution. 3. Estimation in the presence of unknown truncation: We give the first polynomial sample and time algorithm for estimating the parameters of an exponential family distribution from samples truncated to an unknown set approximable by polynomials in L_1 -norm. This improves the algorithm by Lee et al. (FOCS’24) that requires approximation in L_2 -norm. 4. Detecting truncation: We present new algorithms for detecting whether given samples have been truncated (or not) for a broad class of non-product distributions, including non-product distributions, improving the algorithm by De et al. (STOC’24). Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2504.10428 [stat.ML] (or arXiv:2504.10428v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2504.10428 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anay Mehrotra [view email] [v1] Mon, 14 Apr 2025 17:19:29 UTC (352 KB) Full-text links: Access Paper: View a PDF of the paper titled Learning with Positive and Imperfect Unlabeled Data, by Jane H. Lee and Anay Mehrotra and Manolis ZampetakisView PDFTeX SourceOther Formats view license Current browse context: stat.ML prev | next new | recent | 2025-04 Change to browse by: cs cs.DS cs.LG math math.ST stat stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-113] Inferring genotype-phenotype maps using attention models

链接: https://arxiv.org/abs/2504.10388
作者: Krishna Rijal,Caroline M. Holmes,Samantha Petti,Gautam Reddy,Michael M. Desai,Pankaj Mehta
类目: Genomics (q-bio.GN); Machine Learning (cs.LG); Populations and Evolution (q-bio.PE)
*备注:

点击查看摘要

Abstract:Predicting phenotype from genotype is a central challenge in genetics. Traditional approaches in quantitative genetics typically analyze this problem using methods based on linear regression. These methods generally assume that the genetic architecture of complex traits can be parameterized in terms of an additive model, where the effects of loci are independent, plus (in some cases) pairwise epistatic interactions between loci. However, these models struggle to analyze more complex patterns of epistasis or subtle gene-environment interactions. Recent advances in machine learning, particularly attention-based models, offer a promising alternative. Initially developed for natural language processing, attention-based models excel at capturing context-dependent interactions and have shown exceptional performance in predicting protein structure and function. Here, we apply attention-based models to quantitative genetics. We analyze the performance of this attention-based approach in predicting phenotype from genotype using simulated data across a range of models with increasing epistatic complexity, and using experimental data from a recent quantitative trait locus mapping study in budding yeast. We find that our model demonstrates superior out-of-sample predictions in epistatic regimes compared to standard methods. We also explore a more general multi-environment attention-based model to jointly analyze genotype-phenotype maps across multiple environments and show that such architectures can be used for “transfer learning” - predicting phenotypes in novel environments with limited training data.

[LG-114] Adaptive Sensor Steering Strategy Using Deep Reinforcement Learning for Dynamic Data Acquisition in Digital Twins

链接: https://arxiv.org/abs/2504.10248
作者: Collins O. Ogbodo,Timothy J. Rogers,Mattia Dal Borgo,David J. Wagg
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 18 pages, 14 figures

点击查看摘要

Abstract:This paper introduces a sensor steering methodology based on deep reinforcement learning to enhance the predictive accuracy and decision support capabilities of digital twins by optimising the data acquisition process. Traditional sensor placement techniques are often constrained by one-off optimisation strategies, which limit their applicability for online applications requiring continuous informative data assimilation. The proposed approach addresses this limitation by offering an adaptive framework for sensor placement within the digital twin paradigm. The sensor placement problem is formulated as a Markov decision process, enabling the training and deployment of an agent capable of dynamically repositioning sensors in response to the evolving conditions of the physical structure as represented by the digital twin. This ensures that the digital twin maintains a highly representative and reliable connection to its physical counterpart. The proposed framework is validated through a series of comprehensive case studies involving a cantilever plate structure subjected to diverse conditions, including healthy and damaged conditions. The results demonstrate the capability of the deep reinforcement learning agent to adaptively reposition sensors improving the quality of data acquisition and hence enhancing the overall accuracy of digital twins.

[LG-115] Conditional Distribution Compression via the Kernel Conditional Mean Embedding

链接: https://arxiv.org/abs/2504.10139
作者: Dominic Broadbent,Nick Whiteley,Robert Allison,Tom Lovett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 68 pages, 28 figures

点击查看摘要

Abstract:Existing distribution compression methods, like Kernel Herding (KH), were originally developed for unlabelled data. However, no existing approach directly compresses the conditional distribution of labelled data. To address this gap, we first introduce the Average Maximum Conditional Mean Discrepancy (AMCMD), a natural metric for comparing conditional distributions. We then derive a consistent estimator for the AMCMD and establish its rate of convergence. Next, we make a key observation: in the context of distribution compression, the cost of constructing a compressed set targeting the AMCMD can be reduced from \mathcalO(n^3) to \mathcalO(n) . Building on this, we extend the idea of KH to develop Average Conditional Kernel Herding (ACKH), a linear-time greedy algorithm that constructs a compressed set targeting the AMCMD. To better understand the advantages of directly compressing the conditional distribution rather than doing so via the joint distribution, we introduce Joint Kernel Herding (JKH), a straightforward adaptation of KH designed to compress the joint distribution of labelled data. While herding methods provide a simple and interpretable selection process, they rely on a greedy heuristic. To explore alternative optimisation strategies, we propose Joint Kernel Inducing Points (JKIP) and Average Conditional Kernel Inducing Points (ACKIP), which jointly optimise the compressed set while maintaining linear complexity. Experiments show that directly preserving conditional distributions with ACKIP outperforms both joint distribution compression (via JKH and JKIP) and the greedy selection used in ACKH. Moreover, we see that JKIP consistently outperforms JKH.

[LG-116] Physical Scales Matter: The Role of Receptive Fields and Advection in Satellite-Based Thunderstorm Nowcasting with Convolutional Neural Networks

链接: https://arxiv.org/abs/2504.09994
作者: Christoph Metzl,Kianusch Vahid Yousefnia,Richard Müller,Virginia Poli,Miria Celano,Tobias Bölle
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 11 pages, 10 figures, 1 table. This work has been submitted to Artificial Intelligence for the Earth Systems (AIES). Copyright in this work may be transferred without further notice

点击查看摘要

Abstract:The focus of nowcasting development is transitioning from physically motivated advection methods to purely data-driven Machine Learning (ML) approaches. Nevertheless, recent work indicates that incorporating advection into the ML value chain has improved skill for radar-based precipitation nowcasts. However, the generality of this approach and the underlying causes remain unexplored. This study investigates the generality by probing the approach on satellite-based thunderstorm nowcasts for the first time. Resorting to a scale argument, we then put forth an explanation when and why skill improvements can be expected. In essence, advection guarantees that thunderstorm patterns relevant for nowcasting are contained in the receptive field at long lead times. To test our hypotheses, we train ResU-Nets solving segmentation tasks with lightning observations as ground truth. The input of the Baseline Neural Network (BNN) are short time series of multispectral satellite imagery and lightning observations, whereas the Advection-Informed Neural Network (AINN) additionally receives the Lagrangian persistence nowcast of all input channels at the desired lead time. Overall, we find only a minor skill improvement of the AINN over the BNN when considering fully averaged scores. However, assessing skill conditioned on lead time and wind speed, we demonstrate that our scale argument correctly predicts the onset of skill improvement of the AINN over the BNN after 2h lead time. We confirm that generally advection becomes gradually more important with longer lead times and higher wind speeds. Our work accentuates the importance of considering and incorporating the underlying physical scales when designing ML based forecasting models.

[LG-117] owards Weaker Variance Assumptions for Stochastic Optimization

链接: https://arxiv.org/abs/2504.09951
作者: Ahmet Alacaoglu,Yura Malitsky,Stephen J. Wright
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We revisit a classical assumption for analyzing stochastic gradient algorithms where the squared norm of the stochastic subgradient (or the variance for smooth problems) is allowed to grow as fast as the squared norm of the optimization variable. We contextualize this assumption in view of its inception in the 1960s, its seemingly independent appearance in the recent literature, its relationship to weakest-known variance assumptions for analyzing stochastic gradient algorithms, and its relevance in deterministic problems for non-Lipschitz nonsmooth convex optimization. We build on and extend a connection recently made between this assumption and the Halpern iteration. For convex nonsmooth, and potentially stochastic, optimization, we analyze horizon-free, anytime algorithms with last-iterate rates. For problems beyond simple constrained optimization, such as convex problems with functional constraints or regularized convex-concave min-max problems, we obtain rates for optimality measures that do not require boundedness of the feasible set.

[LG-118] Score Matching Diffusion Based Feedback Control and Planning of Nonlinear Systems

链接: https://arxiv.org/abs/2504.09836
作者: Karthik Elamvazhuthi,Darshan Gadginmath,Fabio Pasqualetti
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We propose a novel control-theoretic framework that leverages principles from generative modeling – specifically, Denoising Diffusion Probabilistic Models (DDPMs) – to stabilize control-affine systems with nonholonomic constraints. Unlike traditional stochastic approaches, which rely on noise-driven dynamics in both forward and reverse processes, our method crucially eliminates the need for noise in the reverse phase, making it particularly relevant for control applications. We introduce two formulations: one where noise perturbs all state dimensions during the forward phase while the control system enforces time reversal deterministically, and another where noise is restricted to the control channels, embedding system constraints directly into the forward process. For controllable nonlinear drift-free systems, we prove that deterministic feedback laws can exactly reverse the forward process, ensuring that the system’s probability density evolves correctly without requiring artificial diffusion in the reverse phase. Furthermore, for linear time-invariant systems, we establish a time-reversal result under the second formulation. By eliminating noise in the backward process, our approach provides a more practical alternative to machine learning-based denoising methods, which are unsuitable for control applications due to the presence of stochasticity. We validate our results through numerical simulations on benchmark systems, including a unicycle model in a domain with obstacles, a driftless five-dimensional system, and a four-dimensional linear system, demonstrating the potential for applying diffusion-inspired techniques in linear, nonlinear, and settings with state space constraints. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Robotics (cs.RO); Systems and Control (eess.SY) Cite as: arXiv:2504.09836 [math.OC] (or arXiv:2504.09836v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2504.09836 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-119] Virtual domain extension for imposing boundary conditions in flow simulation using pre-trained local neural operator

链接: https://arxiv.org/abs/2504.09807
作者: Ximeng Ye,Hongyu Li,Zhen-Guo Yan
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper builds up a virtual domain extension (VDE) framework for imposing boundary conditions (BCs) in flow simulation using pre-trained local neural operator (LNO). It creates extended virtual domains to the input function to compensate for the corrosion nature of computational domains during LNO inference, thus turns the implementation of BC into the determination of field values on the extended domain. Several strategies to calculate the field values are proposed and validated in solving numerical examples, including padding operation, direct imposition, pressure symmetry, and optimization by backpropagation, and compared with boundary imposition in traditional solvers. It is found that the large time interval of LNO induces a relatively wide near-boundary domain to be processed, thus imposing BC on only a few nodes near the boundary following the immersed boundary conception in traditional solvers can hardly achieve high accuracy. With appropriate values assigned on the extended virtual domains, VDE can accurately impose BCs and lead to reasonable flow field predictions. This work provides a guidance for imposing BCs reliably in LNO prediction, which could facilitate the reuse of pre-trained LNO in more applications.

[LG-120] Preconditioned Gradient Descent for Over-Parameterized Nonconvex Matrix Factorization NEURIPS NEURIPS2021

链接: https://arxiv.org/abs/2504.09708
作者: Gavin Zhang,Salar Fattahi,Richard Y. Zhang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2021. See also this https URL

点击查看摘要

Abstract:In practical instances of nonconvex matrix factorization, the rank of the true solution r^\star is often unknown, so the rank r of the model can be overspecified as rr^\star . This over-parameterized regime of matrix factorization significantly slows down the convergence of local search algorithms, from a linear rate with r=r^\star to a sublinear rate when rr^\star . We propose an inexpensive preconditioner for the matrix sensing variant of nonconvex matrix factorization that restores the convergence rate of gradient descent back to linear, even in the over-parameterized case, while also making it agnostic to possible ill-conditioning in the ground truth. Classical gradient descent in a neighborhood of the solution slows down due to the need for the model matrix factor to become singular. Our key result is that this singularity can be corrected by \ell_2 regularization with a specific range of values for the damping parameter. In fact, a good damping parameter can be inexpensively estimated from the current iterate. The resulting algorithm, which we call preconditioned gradient descent or PrecGD, is stable under noise, and converges linearly to an information theoretically optimal error bound. Our numerical experiments find that PrecGD works equally well in restoring the linear convergence of other variants of nonconvex matrix factorization in the over-parameterized regime.

[LG-121] Conditional Independence Test Based on Transport Maps

链接: https://arxiv.org/abs/2504.09567
作者: Chenxuan He,Yuan Gao,Liping Zhu,Jian Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 35 pages

点击查看摘要

Abstract:Testing conditional independence between two random vectors given a third is a fundamental and challenging problem in statistics, particularly in multivariate nonparametric settings due to the complexity of conditional structures. We propose a novel framework for testing conditional independence using transport maps. At the population level, we show that two well-defined transport maps can transform the conditional independence test into an unconditional independence test, this substantially simplifies the problem. These transport maps are estimated from data using conditional continuous normalizing flow models. Within this framework, we derive a test statistic and prove its consistency under both the null and alternative hypotheses. A permutation-based procedure is employed to evaluate the significance of the test. We validate the proposed method through extensive simulations and real-data analysis. Our numerical studies demonstrate the practical effectiveness of the proposed method for conditional independence testing.

[LG-122] Optimal sparse phase retrieval via a quasi-Bayesian approach

链接: https://arxiv.org/abs/2504.09509
作者: TheTien Mai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of sparse phase retrieval, a fundamental inverse problem in applied mathematics, physics, and engineering, where a signal need to be reconstructed using only the magnitude of its transformation while phase information remains inaccessible. Leveraging the inherent sparsity of many real-world signals, we introduce a novel sparse quasi-Bayesian approach and provide the first theoretical guarantees for such an approach. Specifically, we employ a scaled Student distribution as a continuous shrinkage prior to enforce sparsity and analyze the method using the PAC-Bayesian inequality framework. Our results establish that the proposed Bayesian estimator achieves minimax-optimal convergence rates under sub-exponential noise, matching those of state-of-the-art frequentist methods. To ensure computational feasibility, we develop an efficient Langevin Monte Carlo sampling algorithm. Through numerical experiments, we demonstrate that our method performs comparably to existing frequentist techniques, highlighting its potential as a principled alternative for sparse phase retrieval in noisy settings.

[LG-123] Bregman Linearized Augmented Lagrangian Method for Nonconvex Constrained Stochastic Zeroth-order Optimization

链接: https://arxiv.org/abs/2504.09409
作者: Qiankun Shi,Xiao Wang,Hao Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study nonconvex constrained stochastic zeroth-order optimization problems, for which we have access to exact information of constraints and noisy function values of the objective. We propose a Bregman linearized augmented Lagrangian method that utilizes stochastic zeroth-order gradient estimators combined with a variance reduction technique. We analyze its oracle complexity, in terms of the total number of stochastic function value evaluations required to achieve an (\epsilon)-KKT point in (\ell_p)-norm metrics with (p \ge 2), where (p) is a parameter associated with the selected Bregman distance. In particular, starting from a near-feasible initial point and using Rademacher smoothing, the oracle complexity is in order (O(p d^2/p \epsilon^-3)) for (p \in [2, 2 \ln d]), and (O(\ln d \cdot \epsilon^-3)) for (p 2 \ln d), where (d) denotes the problem dimension. Those results show that the complexity of the proposed method can achieve a dimensional dependency lower than (O(d)) without requiring additional assumptions, provided that a Bregman distance is chosen properly. This offers a significant improvement in the high-dimensional setting over existing work, and matches the lowest complexity order with respect to the tolerance (\epsilon) reported in the literature. Numerical experiments on constrained Lasso and black-box adversarial attack problems highlight the promising performances of the proposed method.

[LG-124] Graph-Based Prediction Models for Data Debiasing

链接: https://arxiv.org/abs/2504.09348
作者: Dongze Wu,Hanyang Jiang,Yao Xie
类目: Methodology (stat.ME); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Bias in data collection, arising from both under-reporting and over-reporting, poses significant challenges in critical applications such as healthcare and public safety. In this work, we introduce Graph-based Over- and Under-reporting Debiasing (GROUD), a novel graph-based optimization framework that debiases reported data by jointly estimating the true incident counts and the associated reporting bias probabilities. By modeling the bias as a smooth signal over a graph constructed from geophysical or feature-based similarities, our convex formulation not only ensures a unique solution but also comes with theoretical recovery guarantees under certain assumptions. We validate GROUD on both challenging simulated experiments and real-world datasets – including Atlanta emergency calls and COVID-19 vaccine adverse event reports – demonstrating its robustness and superior performance in accurately recovering debiased counts. This approach paves the way for more reliable downstream decision-making in systems affected by reporting irregularities.

[LG-125] Inferring Outcome Means of Exponential Family Distributions Estimated by Deep Neural Networks

链接: https://arxiv.org/abs/2504.09347
作者: Xuran Meng,Yi Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 44 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Despite the widespread use of deep neural networks (DNNs) for prediction, inference on estimated means for categorical or exponential family outcomes remains underexplored. We address this gap by framing the problem within the generalized linear models (GLMs) framework and developing a rigorous statistical approach for inference on DNN-estimated means. To address a key limitation of assuming independence between prediction errors and input variables in the literature, which often fails in GLMs, we introduce a truncation technique that partitions the problem into regimes with distinct noise behaviors, enabling refined analysis and robust theoretical guarantees under general GLM frameworks. To implement inference, we consider an Ensemble Subsampling Method (ESM) that leverages U-statistics and the Hoeffding decomposition to construct reliable confidence intervals. This method enables model-free variance estimation and accounts for heterogeneity among individuals in the population. Through extensive simulations across Binary, Poisson and Binomial models, we demonstrate the effectiveness and efficiency of our method. We further apply the method to real-world data from the eICU dataset to predict patient readmission risks, providing actionable insights for clinical decision-making.

[LG-126] Building Machine Learning Limited Area Models: Kilometer-Scale Weather Forecasting in Realistic Settings

链接: https://arxiv.org/abs/2504.09340
作者: Simon Adamov,Joel Oskarsson,Leif Denby,Tomas Landelius,Kasper Hintz,Simon Christiansen,Irene Schicker,Carlos Osuna,Fredrik Lindsten,Oliver Fuhrer,Sebastian Schemm
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 72 pages, 50 figures. Code is available at this https URL

点击查看摘要

Abstract:Machine learning is revolutionizing global weather forecasting, with models that efficiently produce highly accurate forecasts. Apart from global forecasting there is also a large value in high-resolution regional weather forecasts, focusing on accurate simulations of the atmosphere for a limited area. Initial attempts have been made to use machine learning for such limited area scenarios, but these experiments do not consider realistic forecasting settings and do not investigate the many design choices involved. We present a framework for building kilometer-scale machine learning limited area models with boundary conditions imposed through a flexible boundary forcing method. This enables boundary conditions defined either from reanalysis or operational forecast data. Our approach employs specialized graph constructions with rectangular and triangular meshes, along with multi-step rollout training strategies to improve temporal consistency. We perform systematic evaluation of different design choices, including the boundary width, graph construction and boundary forcing integration. Models are evaluated across both a Danish and a Swiss domain, two regions that exhibit different orographical characteristics. Verification is performed against both gridded analysis data and in-situ observations, including a case study for the storm Ciara in February 2020. Both models achieve skillful predictions across a wide range of variables, with our Swiss model outperforming the numerical weather prediction baseline for key surface variables. With their substantially lower computational cost, our findings demonstrate great potential for machine learning limited area models in the future of regional weather forecasting.

[LG-127] No-Regret Generative Modeling via Parabolic Monge-Ampère PDE

链接: https://arxiv.org/abs/2504.09279
作者: Nabarun Deb,Tengyuan Liang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: 30 pages, 3 figures

点击查看摘要

Abstract:We introduce a novel generative modeling framework based on a discretized parabolic Monge-Ampère PDE, which emerges as a continuous limit of the Sinkhorn algorithm commonly used in optimal transport. Our method performs iterative refinement in the space of Brenier maps using a mirror gradient descent step. We establish theoretical guarantees for generative modeling through the lens of no-regret analysis, demonstrating that the iterates converge to the optimal Brenier map under a variety of step-size schedules. As a technical contribution, we derive a new Evolution Variational Inequality tailored to the parabolic Monge-Ampère PDE, connecting geometry, transportation cost, and regret. Our framework accommodates non-log-concave target distributions, constructs an optimal sampling process via the Brenier map, and integrates favorable learning techniques from generative adversarial networks and score-based diffusion models. As direct applications, we illustrate how our theory paves new pathways for generative modeling and variational inference.

[LG-128] Adiabatic Encoding of Pre-trained MPS Classifiers into Quantum Circuits

链接: https://arxiv.org/abs/2504.09250
作者: Keisuke Murota
类目: Quantum Physics (quant-ph); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although Quantum Neural Networks (QNNs) offer powerful methods for classification tasks, the training of QNNs faces two major training obstacles: barren plateaus and local minima. A promising solution is to first train a tensor-network (TN) model classically and then embed it into a QNN.\ However, embedding TN-classifiers into quantum circuits generally requires postselection whose success probability may decay exponentially with the system size. We propose an \emphadiabatic encoding framework that encodes pre-trained MPS-classifiers into quantum MPS (qMPS) circuits with postselection, and gradually removes the postselection while retaining performance. We prove that training qMPS-classifiers from scratch on a certain artificial dataset is exponentially hard due to barren plateaus, but our adiabatic encoding circumvents this issue. Additional numerical experiments on binary MNIST also confirm its robustness.

[LG-129] Dose-finding design based on level set estimation in phase I cancer clinical trials

链接: https://arxiv.org/abs/2504.09157
作者: Keiichiro Seno,Kota Matsui,Shogo Iwazaki,Yu Inatsu,Shion Takeno,Shigeyuki Matsui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The primary objective of phase I cancer clinical trials is to evaluate the safety of a new experimental treatment and to find the maximum tolerated dose (MTD). We show that the MTD estimation problem can be regarded as a level set estimation (LSE) problem whose objective is to determine the regions where an unknown function value is above or below a given threshold. Then, we propose a novel dose-finding design in the framework of LSE. The proposed design determines the next dose on the basis of an acquisition function incorporating uncertainty in the posterior distribution of the dose-toxicity curve as well as overdose control. Simulation experiments show that the proposed LSE design achieves a higher accuracy in estimating the MTD and involves a lower risk of overdosing allocation compared to existing designs, thereby indicating that it provides an effective methodology for phase I cancer clinical trial design.

[LG-130] An Incremental Non-Linear Manifold Approximation Method

链接: https://arxiv.org/abs/2504.09068
作者: Praveen T. W. Hettige,Benjamin W. Ong
类目: Machine Learning (stat.ML); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 19 pages

点击查看摘要

Abstract:Analyzing high-dimensional data presents challenges due to the "curse of dimensionality’', making computations intensive. Dimension reduction techniques, categorized as linear or non-linear, simplify such data. Non-linear methods are particularly essential for efficiently visualizing and processing complex data structures in interactive and graphical applications. This research develops an incremental non-linear dimension reduction method using the Geometric Multi-Resolution Analysis (GMRA) framework for streaming data. The proposed method enables real-time data analysis and visualization by incrementally updating the cluster map, PCA basis vectors, and wavelet coefficients. Numerical experiments show that the incremental GMRA accurately represents non-linear manifolds even with small initial samples and aligns closely with batch GMRA, demonstrating efficient updates and maintaining the multiscale structure. The findings highlight the potential of Incremental GMRA for real-time visualization and interactive graphics applications that require adaptive high-dimensional data representations.

[LG-131] InterQ: A DQN Framework for Optimal Intermittent Control

链接: https://arxiv.org/abs/2504.09035
作者: Shubham Aggarwal,Dipankar Maity,Tamer Başar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to IEEE for possible publication

点击查看摘要

Abstract:In this letter, we explore the communication-control co-design of discrete-time stochastic linear systems through reinforcement learning. Specifically, we examine a closed-loop system involving two sequential decision-makers: a scheduler and a controller. The scheduler continuously monitors the system’s state but transmits it to the controller intermittently to balance the communication cost and control performance. The controller, in turn, determines the control input based on the intermittently received information. Given the partially nested information structure, we show that the optimal control policy follows a certainty-equivalence form. Subsequently, we analyze the qualitative behavior of the scheduling policy. To develop the optimal scheduling policy, we propose InterQ, a deep reinforcement learning algorithm which uses a deep neural network to approximate the Q-function. Through extensive numerical evaluations, we analyze the scheduling landscape and further compare our approach against two baseline strategies: (a) a multi-period periodic scheduling policy, and (b) an event-triggered policy. The results demonstrate that our proposed method outperforms both baselines. The open source implementation can be found at this https URL.

[LG-132] Improving the evaluation of samplers on multi-modal targets ICLR2025

链接: https://arxiv.org/abs/2504.08916
作者: Louis Grenioux,Maxence Noble,Marylou Gabrié
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ICLR 2025 workshop “Frontiers in Probabilistic Inference: Learning meets Sampling”

点击查看摘要

Abstract:Addressing multi-modality constitutes one of the major challenges of sampling. In this reflection paper, we advocate for a more systematic evaluation of samplers towards two sources of difficulty that are mode separation and dimension. For this, we propose a synthetic experimental setting that we illustrate on a selection of samplers, focusing on the challenging criterion of recovery of the mode relative importance. These evaluations are crucial to diagnose the potential of samplers to handle multi-modality and therefore to drive progress in the field.

[LG-133] DataMap: A Portable Application for Visualizing High-Dimensional Data

链接: https://arxiv.org/abs/2504.08875
作者: Xijin Ge
类目: Quantitative Methods (q-bio.QM); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Motivation: The visualization and analysis of high-dimensional data are essential in biomedical research. There is a need for secure, scalable, and reproducible tools to facilitate data exploration and interpretation. Results: We introduce DataMap, a browser-based application for visualization of high-dimensional data using heatmaps, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE). DataMap runs in the web browser, ensuring data privacy while eliminating the need for installation or a server. The application has an intuitive user interface for data transformation, annotation, and generation of reproducible R code. Availability and Implementation: Freely available as a GitHub page this https URL. The source code can be found at this https URL, and can also be installed as an R package. Contact: this http URL@sdstate.ed

[LG-134] Double Machine Learning for Causal Inference under Shared-State Interference

链接: https://arxiv.org/abs/2504.08836
作者: Chris Hays,Manish Raghavan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注: 48 pages, 6 figures

点击查看摘要

Abstract:Researchers and practitioners often wish to measure treatment effects in settings where units interact via markets and recommendation systems. In these settings, units are affected by certain shared states, like prices, algorithmic recommendations or social signals. We formalize this structure, calling it shared-state interference, and argue that our formulation captures many relevant applied settings. Our key modeling assumption is that individuals’ potential outcomes are independent conditional on the shared state. We then prove an extension of a double machine learning (DML) theorem providing conditions for achieving efficient inference under shared-state interference. We also instantiate our general theorem in several models of interest where it is possible to efficiently estimate the average direct effect (ADE) or global average treatment effect (GATE).

[LG-135] owards scientific machine learning for granular material simulations – challenges and opportunities

链接: https://arxiv.org/abs/2504.08766
作者: Marc Fransen,Andreas Fürst,Deepak Tunuguntla,Daniel N. Wilke,Benedikt Alkin,Daniel Barreto,Johannes Brandstetter,Miguel Angel Cabrera,Xinyan Fan,Mengwu Guo,Bram Kieskamp,Krishna Kumar,John Morrissey,Jonathan Nuttall,Jin Ooi,Luisa Orozco,Stefanos-Aldo Papanicolopulos,Tongming Qu,Dingena Schott,Takayuki Shuku,WaiChing Sun,Thomas Weinhart,Dongwei Ye,Hongyang Cheng
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 35 pages, 17 figures

点击查看摘要

Abstract:Micro-scale mechanisms, such as inter-particle and particle-fluid interactions, govern the behaviour of granular systems. While particle-scale simulations provide detailed insights into these interactions, their computational cost is often prohibitive. Attended by researchers from both the granular materials (GM) and machine learning (ML) communities, a recent Lorentz Center Workshop on “Machine Learning for Discrete Granular Media” brought the ML community up to date with GM challenges. This position paper emerged from the workshop discussions. We define granular materials and identify seven key challenges that characterise their distinctive behaviour across various scales and regimes, ranging from gas-like to fluid-like and solid-like. Addressing these challenges is essential for developing robust and efficient digital twins for granular systems in various industrial applications. To showcase the potential of ML to the GM community, we present classical and emerging machine/deep learning techniques that have been, or could be, applied to granular materials. We reviewed sequence-based learning models for path-dependent constitutive behaviour, followed by encoder-decoder type models for representing high-dimensional data. We then explore graph neural networks and recent advances in neural operator learning. Lastly, we discuss model-order reduction and probabilistic learning techniques for high-dimensional parameterised systems, which are crucial for quantifying uncertainties arising from physics-based and data-driven models. We present a workflow aimed at unifying data structures and modelling pipelines and guiding readers through the selection, training, and deployment of ML surrogates for granular material simulations. Finally, we illustrate the workflow’s practical use with two representative examples, focusing on granular materials in solid-like and fluid-like regimes. Comments: 35 pages, 17 figures Subjects: Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Computational Physics (physics.comp-ph) Cite as: arXiv:2504.08766 [cond-mat.soft] (or arXiv:2504.08766v1 [cond-mat.soft] for this version) https://doi.org/10.48550/arXiv.2504.08766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

信息检索

[IR-0] Invariance Matters: Empowering Social Recommendation via Graph Invariant Learning

链接: https://arxiv.org/abs/2504.10432
作者: Yonghui Yang,Le Wu,Yuxin Liao,Zhuangzhuang He,Pengyang Shao,Richang Hong,Meng Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph-based social recommendation systems have shown significant promise in enhancing recommendation performance, particularly in addressing the issue of data sparsity in user behaviors. Typically, these systems leverage Graph Neural Networks (GNNs) to capture user preferences by incorporating high-order social influences from observed social networks. However, existing graph-based social recommendations often overlook the fact that social networks are inherently noisy, containing task-irrelevant relationships that can hinder accurate user preference learning. The removal of these redundant social relations is crucial, yet it remains challenging due to the lack of ground truth. In this paper, we approach the social denoising problem from the perspective of graph invariant learning and propose a novel method, Social Graph Invariant Learning(SGIL). Specifically,SGIL aims to uncover stable user preferences within the input social graph, thereby enhancing the robustness of graph-based social recommendation systems. To achieve this goal, SGIL first simulates multiple noisy social environments through graph generators. It then seeks to learn environment-invariant user preferences by minimizing invariant risk across these environments. To further promote diversity in the generated social environments, we employ an adversarial training strategy to simulate more potential social noisy distributions. Extensive experimental results demonstrate the effectiveness of the proposed SGIL. The code is available at this https URL.

[IR-1] Brain-Machine Interfaces Information Retrieval Challenges and Opportunities

链接: https://arxiv.org/abs/2504.10371
作者: Yashar Moshfeghi,Niall McGuire
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The fundamental goal of Information Retrieval (IR) systems lies in their capacity to effectively satisfy human information needs - a challenge that encompasses not just the technical delivery of information, but the nuanced understanding of human cognition during information seeking. Contemporary IR platforms rely primarily on observable interaction signals, creating a fundamental gap between system capabilities and users’ cognitive processes. Brain-Machine Interface (BMI) technologies now offer unprecedented potential to bridge this gap through direct measurement of previously inaccessible aspects of information-seeking behaviour. This perspective paper offers a broad examination of the IR landscape, providing a comprehensive analysis of how BMI technology could transform IR systems, drawing from advances at the intersection of both neuroscience and IR research. We present our analysis through three identified fundamental vertices: (1) understanding the neural correlates of core IR concepts to advance theoretical models of search behaviour, (2) enhancing existing IR systems through contextual integration of neurophysiological signals, and (3) developing proactive IR capabilities through direct neurophysiological measurement. For each vertex, we identify specific research opportunities and propose concrete directions for developing BMI-enhanced IR systems. We conclude by examining critical technical and ethical challenges in implementing these advances, providing a structured roadmap for future research at the intersection of neuroscience and IR.

[IR-2] CROSSAN: Towards Efficient and Effective Adaptation of Multiple Multimodal Foundation Models for Sequential Recommendation

链接: https://arxiv.org/abs/2504.10307
作者: Junchen Fu,Yongxin Ni,Joemon M. Jose,Ioannis Arapakis,Kaiwen Zheng,Youhua Li,Xuri Ge
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal Foundation Models (MFMs) excel at representing diverse raw modalities (e.g., text, images, audio, videos, etc.). As recommender systems increasingly incorporate these modalities, leveraging MFMs to generate better representations has great potential. However, their application in sequential recommendation remains largely unexplored. This is primarily because mainstream adaptation methods, such as Fine-Tuning and even Parameter-Efficient Fine-Tuning (PEFT) techniques (e.g., Adapter and LoRA), incur high computational costs, especially when integrating multiple modality encoders, thus hindering research progress. As a result, it remains unclear whether we can efficiently and effectively adapt multiple (2) MFMs for the sequential recommendation task. To address this, we propose a plug-and-play Cross-modal Side Adapter Network (CROSSAN). Leveraging the fully decoupled side adapter-based paradigm, CROSSAN achieves high efficiency while enabling cross-modal learning across diverse modalities. To optimize the final stage of multimodal fusion across diverse modalities, we adopt the Mixture of Modality Expert Fusion (MOMEF) mechanism. CROSSAN achieves superior performance on the public datasets for adapting four foundation models with raw modalities. Performance consistently improves as more MFMs are adapted. We will release our code and datasets to facilitate future research. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.10307 [cs.IR] (or arXiv:2504.10307v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.10307 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] HistLLM : A Unified Framework for LLM -Based Multimodal Recommendation with User History Encoding and Compression

链接: https://arxiv.org/abs/2504.10150
作者: Chen Zhang,Bo Hu,Weidong Chen,Zhendong Mao
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) have proven effective in leveraging textual data for recommendations, their application to multimodal recommendation tasks remains relatively underexplored. Although LLMs can process multimodal information through projection functions that map visual features into their semantic space, recommendation tasks often require representing users’ history interactions through lengthy prompts combining text and visual elements, which not only hampers training and inference efficiency but also makes it difficult for the model to accurately capture user preferences from complex and extended prompts, leading to reduced recommendation performance. To address this challenge, we introduce HistLLM, an innovative multimodal recommendation framework that integrates textual and visual features through a User History Encoding Module (UHEM), compressing multimodal user history interactions into a single token representation, effectively facilitating LLMs in processing user preferences. Extensive experiments demonstrate the effectiveness and efficiency of our proposed mechanism.

[IR-4] A Survey of Personalization: From RAG to Agent

链接: https://arxiv.org/abs/2504.10147
作者: Xiaopeng Li,Pengyue Jia,Derong Xu,Yi Wen,Yingyi Zhang,Wenlin Zhang,Wanyu Wang,Yichao Wang,Zhaocheng Du,Xiangyang Li,Yong Liu,Huifeng Guo,Ruiming Tang,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注: 18 pages, 5 figures

点击查看摘要

Abstract:Personalization has become an essential capability in modern AI systems, enabling customized interactions that align with individual user preferences, contexts, and goals. Recent research has increasingly concentrated on Retrieval-Augmented Generation (RAG) frameworks and their evolution into more advanced agent-based architectures within personalized settings to enhance user satisfaction. Building on this foundation, this survey systematically examines personalization across the three core stages of RAG: pre-retrieval, retrieval, and generation. Beyond RAG, we further extend its capabilities into the realm of Personalized LLM-based Agents, which enhance traditional RAG systems with agentic functionalities, including user understanding, personalized planning and execution, and dynamic generation. For both personalization in RAG and agent-based personalization, we provide formal definitions, conduct a comprehensive review of recent literature, and summarize key datasets and evaluation metrics. Additionally, we discuss fundamental challenges, limitations, and promising research directions in this evolving field. Relevant papers and resources are continuously updated at this https URL.

[IR-5] Unveiling Contrastive Learnings Capability of Neighborhood Aggregation for Collaborative Filtering SIGIR2025

链接: https://arxiv.org/abs/2504.10113
作者: Yu Zhang,Yiwen Zhang,Yi Zhang,Lei Sang,Yun Yang
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR2025

点击查看摘要

Abstract:Personalized recommendation is widely used in the web applications, and graph contrastive learning (GCL) has gradually become a dominant approach in recommender systems, primarily due to its ability to extract self-supervised signals from raw interaction data, effectively alleviating the problem of data sparsity. A classic GCL-based method typically uses data augmentation during graph convolution to generates more contrastive views, and performs contrast on these new views to obtain rich self-supervised signals. Despite this paradigm is effective, the reasons behind the performance gains remain a mystery. In this paper, we first reveal via theoretical derivation that the gradient descent process of the CL objective is formally equivalent to graph convolution, which implies that CL objective inherently supports neighborhood aggregation on interaction graphs. We further substantiate this capability through experimental validation and identify common misconceptions in the selection of positive samples in previous methods, which limit the potential of CL objective. Based on this discovery, we propose the Light Contrastive Collaborative Filtering (LightCCF) method, which introduces a novel neighborhood aggregation objective to bring users closer to all interacted items while pushing them away from other positive pairs, thus achieving high-quality neighborhood aggregation with very low time complexity. On three highly sparse public datasets, the proposed method effectively aggregate neighborhood information while preventing graph over-smoothing, demonstrating significant improvements over existing GCL-based counterparts in both training efficiency and recommendation accuracy. Our implementations are publicly accessible.

[IR-6] Enhancing LLM -based Recommendation through Semantic-Aligned Collaborative Knowledge

链接: https://arxiv.org/abs/2504.10107
作者: Zihan Wang,Jinghao Lin,Xiaocui Yang,Yongkang Liu,Shi Feng,Daling Wang,Yifei Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities in leveraging comprehensive world knowledge and sophisticated reasoning mechanisms for recommendation tasks. However, a notable limitation lies in their inability to effectively model sparse identifiers (e.g., user and item IDs), unlike conventional collaborative filtering models (Collabs.), thus hindering LLM to learn distinctive user-item representations and creating a performance bottleneck. Prior studies indicate that integrating collaborative knowledge from Collabs. into LLMs can mitigate the above limitations and enhance their recommendation performance. Nevertheless, the significant discrepancy in knowledge distribution and semantic space between LLMs and Collab. presents substantial challenges for effective knowledge transfer. To tackle these challenges, we propose a novel framework, SeLLa-Rec, which focuses on achieving alignment between the semantic spaces of Collabs. and LLMs. This alignment fosters effective knowledge fusion, mitigating the influence of discriminative noise and facilitating the deep integration of knowledge from diverse models. Specifically, three special tokens with collaborative knowledge are embedded into the LLM’s semantic space through a hybrid projection layer and integrated into task-specific prompts to guide the recommendation process. Experiments conducted on two public benchmark datasets (MovieLens-1M and Amazon Book) demonstrate that SeLLa-Rec achieves state-of-the-art performance.

[IR-7] On Precomputation and Caching in Information Retrieval Experiments with Pipeline Architectures ECIR2025

链接: https://arxiv.org/abs/2504.09984
作者: Sean MacAvaney,Craig Macdonald
类目: Information Retrieval (cs.IR)
*备注: WOWS @ ECIR 2025

点击查看摘要

Abstract:Modern information retrieval systems often rely on multiple components executed in a pipeline. In a research setting, this can lead to substantial redundant computations (e.g., retrieving the same query multiple times for evaluating different downstream rerankers). To overcome this, researchers take cached “result” files as inputs, which represent the output of another pipeline. However, these result files can be brittle and can cause a disconnect between the conceptual design of the pipeline and its logical implementation. To overcome both the redundancy problem (when executing complete pipelines) and the disconnect problem (when relying on intermediate result files), we describe our recent efforts to improve the caching capabilities in the open-source PyTerrier IR platform. We focus on two main directions: (1) automatic implicit caching of common pipeline prefixes when comparing systems and (2) explicit caching of operations through a new extension package, pyterrier-caching. These approaches allow for the best of both worlds: pipelines can be fully expressed end-to-end, while also avoiding redundant computations between pipelines.

[IR-8] Constrained Auto-Regressive Decoding Constrains Generative Retrieval SIGIR SIGIR2025

链接: https://arxiv.org/abs/2504.09935
作者: Shiguang Wu,Zhaochun Ren,Xin Xin,Jiyuan Yang,Mengqi Zhang,Zhumin Chen,Maarten de Rijke,Pengjie Ren
类目: Information Retrieval (cs.IR)
*备注: 13 pages, 6 figures, 2 tables, accepted by SIGIR 2025 (Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval)

点击查看摘要

Abstract:Generative retrieval seeks to replace traditional search index data structures with a single large-scale neural network, offering the potential for improved efficiency and seamless integration with generative large language models. As an end-to-end paradigm, generative retrieval adopts a learned differentiable search index to conduct retrieval by directly generating document identifiers through corpus-specific constrained decoding. The generalization capabilities of generative retrieval on out-of-distribution corpora have gathered significant attention. In this paper, we examine the inherent limitations of constrained auto-regressive generation from two essential perspectives: constraints and beam search. We begin with the Bayes-optimal setting where the generative retrieval model exactly captures the underlying relevance distribution of all possible documents. Then we apply the model to specific corpora by simply adding corpus-specific constraints. Our main findings are two-fold: (i) For the effect of constraints, we derive a lower bound of the error, in terms of the KL divergence between the ground-truth and the model-predicted step-wise marginal distributions. (ii) For the beam search algorithm used during generation, we reveal that the usage of marginal distributions may not be an ideal approach. This paper aims to improve our theoretical understanding of the generalization capabilities of the auto-regressive decoding retrieval paradigm, laying a foundation for its limitations and inspiring future advancements toward more robust and generalizable generative retrieval. Comments: 13 pages, 6 figures, 2 tables, accepted by SIGIR 2025 (Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval) Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2504.09935 [cs.IR] (or arXiv:2504.09935v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.09935 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3726302.3729934 Focus to learn more DOI(s) linking to related resources

[IR-9] StePO-Rec: Towards Personalized Outfit Styling Assistant via Knowledge-Guided Multi-Step Reasoning

链接: https://arxiv.org/abs/2504.09915
作者: Yuxi Bi,Yunfan Gao,Haofen Wang
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Advancements in Generative AI offers new opportunities for FashionAI, surpassing traditional recommendation systems that often lack transparency and struggle to integrate expert knowledge, leaving the potential for personalized fashion styling remain untapped. To address these challenges, we present PAFA (Principle-Aware Fashion), a multi-granular knowledge base that organizes professional styling expertise into three levels of metadata, domain principles, and semantic relationships. Using PAFA, we develop StePO-Rec, a knowledge-guided method for multi-step outfit recommendation. StePO-Rec provides structured suggestions using a scenario-dimension-attribute framework, employing recursive tree construction to align recommendations with both professional principles and individual preferences. A preference-trend re-ranking system further adapts to fashion trends while maintaining the consistency of the user’s original style. Experiments on the widely used personalized outfit dataset IQON show a 28% increase in Recall@1 and 32.8% in MAP. Furthermore, case studies highlight improved explainability, traceability, result reliability, and the seamless integration of expertise and personalization.

[IR-10] RAKG:Document-level Retrieval Augmented Knowledge Graph Construction

链接: https://arxiv.org/abs/2504.09823
作者: Hairong Zhang,Jiaheng Si,Guohang Yan,Boyuan Qi,Pinlong Cai,Song Mao,Ding Wang,Botian Shi
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:With the rise of knowledge graph based retrieval-augmented generation (RAG) techniques such as GraphRAG and Pike-RAG, the role of knowledge graphs in enhancing the reasoning capabilities of large language models (LLMs) has become increasingly prominent. However, traditional Knowledge Graph Construction (KGC) methods face challenges like complex entity disambiguation, rigid schema definition, and insufficient cross-document knowledge integration. This paper focuses on the task of automatic document-level knowledge graph construction. It proposes the Document-level Retrieval Augmented Knowledge Graph Construction (RAKG) framework. RAKG extracts pre-entities from text chunks and utilizes these pre-entities as queries for RAG, effectively addressing the issue of long-context forgetting in LLMs and reducing the complexity of Coreference Resolution. In contrast to conventional KGC methods, RAKG more effectively captures global information and the interconnections among disparate nodes, thereby enhancing the overall performance of the model. Additionally, we transfer the RAG evaluation framework to the KGC field and filter and evaluate the generated knowledge graphs, thereby avoiding incorrectly generated entities and relationships caused by hallucinations in LLMs. We further developed the MINE dataset by constructing standard knowledge graphs for each article and experimentally validated the performance of RAKG. The results show that RAKG achieves an accuracy of 95.91 % on the MINE dataset, a 6.2 % point improvement over the current best baseline, GraphRAG (89.71 %). The code is available at this https URL.

[IR-11] Outage Probability Analysis for OTFS with Finite Blocklength

链接: https://arxiv.org/abs/2504.09628
作者: Xin Zhang,Wensheng Lin,Lixin Li,Zhu Han,Tad Matsumoto
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Orthogonal time frequency space (OTFS) modulation is widely acknowledged as a prospective waveform for future wireless communication this http URL provide insights for the practical system design, this paper analyzes the outage probability of OTFS modulation with finite this http URL begin with, we present the system model and formulate the analysis of outage probability for OTFS with finite blocklength as an equivalent problem of calculating the outage probability with finite blocklength over parallel additive white Gaussian noise (AWGN) this http URL, we apply the equivalent noise approach to derive a lower bound on the outage probability of OTFS with finite blocklength under both average power allocation and water-filling power allocation strategies, this http URL, the lower bounds of the outage probability are determined using the Monte-Carlo method for the two power allocation this http URL impact of the number of resolvable paths and coding rates on the outage probability is analyzed, and the simulation results are compared with the theoretical lower bounds.

[IR-12] Revisiting Self-Attentive Sequential Recommendation

链接: https://arxiv.org/abs/2504.09596
作者: Zan Huang
类目: Information Retrieval (cs.IR)
*备注: 4 pages, 0 figure, technical report based on this https URL experiment findings

点击查看摘要

Abstract:Recommender systems are ubiquitous in on-line services to drive businesses. And many sequential recommender models were deployed in these systems to enhance personalization. The approach of using the transformer decoder as the sequential recommender was proposed years ago and is still a strong baseline in recent works. But this kind of sequential recommender model did not scale up well, compared to language models. Quite some details in the classical self-attentive sequential recommender model could be revisited, and some new experiments may lead to new findings, without changing the general model structure which was the focus of many previous works. In this paper, we show the details and propose new experiment methodologies for future research on sequential recommendation, in hope to motivate further exploration to new findings in this area.

[IR-13] HD-RAG : Retrieval-Augmented Generation for Hybrid Documents Containing Text and Hierarchical Tables

链接: https://arxiv.org/abs/2504.09554
作者: Chi Zhang,Qiyang Chen
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:With the rapid advancement of large language models (LLMs), Retrieval-Augmented Generation (RAG) effectively combines LLMs generative capabilities with external retrieval-based information. The Hybrid Document RAG task aims to integrate textual and hierarchical tabular data for more comprehensive retrieval and generation in complex scenarios. However, there is no existing dataset specifically designed for this task that includes both text and tabular data. Additionally, existing methods struggle to retrieve relevant tabular data and integrate it with text. Semantic similarity-based retrieval lacks accuracy, while table-specific methods fail to handle complex hierarchical structures effectively. Furthermore, the QA task requires complex reasoning and calculations, further complicating the challenge. In this paper, we propose a new large-scale dataset, DocRAGLib, specifically designed for the question answering (QA) task scenario under Hybrid Document RAG. To tackle these challenges, we introduce HD-RAG, a novel framework that incorporates a row-and-column level (RCL) table representation, employs a two-stage process combining ensemble and LLM-based retrieval, and integrates RECAP, which is designed for multi-step reasoning and complex calculations in Document-QA tasks. We conduct comprehensive experiments with DocRAGLib, showing that HD-RAG outperforms existing baselines in both retrieval accuracy and QA performance, demonstrating its effectiveness.

[IR-14] Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets

链接: https://arxiv.org/abs/2504.09353
作者: Mandeep Rathee,Venktesh V,Sean MacAvaney,Avishek Anand
类目: Information Retrieval (cs.IR)
*备注: 11 pages,5 figures, 4 tables

点击查看摘要

Abstract:Advanced relevance models, such as those that use large language models (LLMs), provide highly accurate relevance estimations. However, their computational costs make them infeasible for processing large document corpora. To address this, retrieval systems often employ a telescoping approach, where computationally efficient but less precise lexical and semantic retrievers filter potential candidates for further ranking. However, this approach heavily depends on the quality of early-stage retrieval, which can potentially exclude relevant documents early in the process. In this work, we propose a novel paradigm for re-ranking called online relevance estimation that continuously updates relevance estimates for a query throughout the ranking process. Instead of re-ranking a fixed set of top-k documents in a single step, online relevance estimation iteratively re-scores smaller subsets of the most promising documents while adjusting relevance scores for the remaining pool based on the estimations from the final model using an online bandit-based algorithm. This dynamic process mitigates the recall limitations of telescoping systems by re-prioritizing documents initially deemed less relevant by earlier stages – including those completely excluded by earlier-stage retrievers. We validate our approach on TREC benchmarks under two scenarios: hybrid retrieval and adaptive retrieval. Experimental results demonstrate that our method is sample-efficient and significantly improves recall, highlighting the effectiveness of our online relevance estimation framework for modern search systems.

[IR-15] Pneuma: Leverag ing LLM s for Tabular Data Representation and Retrieval in an End-to-End System SIGMOD2025

链接: https://arxiv.org/abs/2504.09207
作者: Muhammad Imam Luthfi Balaka,David Alexander,Qiming Wang,Yue Gong,Adila Krisnadhi,Raul Castro Fernandez
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: SIGMOD 2025 Paper

点击查看摘要

Abstract:Finding relevant tables among databases, lakes, and repositories is the first step in extracting value from data. Such a task remains difficult because assessing whether a table is relevant to a problem does not always depend only on its content but also on the context, which is usually tribal knowledge known to the individual or team. While tools like data catalogs and academic data discovery systems target this problem, they rely on keyword search or more complex interfaces, limiting non-technical users’ ability to find relevant data. The advent of large language models (LLMs) offers a unique opportunity for users to ask questions directly in natural language, making dataset discovery more intuitive, accessible, and efficient. In this paper, we introduce Pneuma, a retrieval-augmented generation (RAG) system designed to efficiently and effectively discover tabular data. Pneuma leverages large language models (LLMs) for both table representation and table retrieval. For table representation, Pneuma preserves schema and row-level information to ensure comprehensive data understanding. For table retrieval, Pneuma augments LLMs with traditional information retrieval techniques, such as full-text and vector search, harnessing the strengths of both to improve retrieval performance. To evaluate Pneuma, we generate comprehensive benchmarks that simulate table discovery workload on six real-world datasets including enterprise data, scientific databases, warehousing data, and open data. Our results demonstrate that Pneuma outperforms widely used table search systems (such as full-text search and state-of-the-art RAG systems) in accuracy and resource efficiency. Comments: SIGMOD 2025 Paper Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2504.09207 [cs.DB] (or arXiv:2504.09207v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2504.09207 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3725337 Focus to learn more DOI(s) linking to related resources

[IR-16] Code-Craft: Hierarchical Graph-Based Code Summarization for Enhanced Context Retrieval

链接: https://arxiv.org/abs/2504.08975
作者: David Sounthiraraj,Jared Hancock,Yassin Kortam,Ashok Javvaji,Prabhat Singh,Shaila Shankar
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Understanding and navigating large-scale codebases remains a significant challenge in software engineering. Existing methods often treat code as flat text or focus primarily on local structural relationships, limiting their ability to provide holistic, context-aware information retrieval. We present Hierarchical Code Graph Summarization (HCGS), a novel approach that constructs a multi-layered representation of a codebase by generating structured summaries in a bottom-up fashion from a code graph. HCGS leverages the Language Server Protocol for language-agnostic code analysis and employs a parallel level-based algorithm for efficient summary generation. Through extensive evaluation on five diverse codebases totaling 7,531 functions, HCGS demonstrates significant improvements in code retrieval accuracy, achieving up to 82 percentage relative improvement in top-1 retrieval precision for large codebases like libsignal (27.15 percentage points), and perfect Pass@3 scores for smaller repositories. The system’s hierarchical approach consistently outperforms traditional code-only retrieval across all metrics, with particularly substantial gains in larger, more complex codebases where understanding function relationships is crucial.

[IR-17] Accelerating Causal Network Discovery of Alzheimer Disease Biomarkers via Scientific Literature-based Retrieval Augmented Generation

链接: https://arxiv.org/abs/2504.08768
作者: Xiaofan Zhou,Liangjie Huang,Pinyang Cheng,Wenpen Yin,Rui Zhang,Wenrui Hao,Lu Cheng
类目: Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)
*备注: 9 pages, under review

点击查看摘要

Abstract:The causal relationships between biomarkers are essential for disease diagnosis and medical treatment planning. One notable application is Alzheimer’s disease (AD) diagnosis, where certain biomarkers may influence the presence of others, enabling early detection, precise disease staging, targeted treatments, and improved monitoring of disease progression. However, understanding these causal relationships is complex and requires extensive research. Constructing a comprehensive causal network of biomarkers demands significant effort from human experts, who must analyze a vast number of research papers, and have bias in understanding diseases’ biomarkers and their relation. This raises an important question: Can advanced large language models (LLMs), such as those utilizing retrieval-augmented generation (RAG), assist in building causal networks of biomarkers for further medical analysis? To explore this, we collected 200 AD-related research papers published over the past 25 years and then integrated scientific literature with RAG to extract AD biomarkers and generate causal relations among them. Given the high-risk nature of the medical diagnosis, we applied uncertainty estimation to assess the reliability of the generated causal edges and examined the faithfulness and scientificness of LLM reasoning using both automatic and human evaluation. We find that RAG enhances the ability of LLMs to generate more accurate causal networks from scientific papers. However, the overall performance of LLMs in identifying causal relations of AD biomarkers is still limited. We hope this study will inspire further foundational research on AI-driven analysis of AD biomarkers causal network discovery.

[IR-18] A Proposed Hybrid Recommender System for Tourism Industry in Iraq Using Evolutionary Apriori and K-means Algorithms

链接: https://arxiv.org/abs/2504.08767
作者: Bryar A. Hassan,Alla A. Hassan,Joan Lu,Aram M. Ahmed,Tarik A. Rashid
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The rapid proliferation of tourism data across sectors, including accommodations, cultural sites, and events, has made it increasingly challenging for travelers to identify relevant and personalized recommendations. While traditional recommender systems such as collaborative, content-based, and context-aware systems offer partial solutions, they often struggle with issues like data sparsity and overspecialization. This study proposes a novel hybrid recommender system that combines evolutionary Apriori and K-means clustering algorithms to improve recommendation accuracy and efficiency in the tourism domain. Designed specifically to address the diverse and dynamic tourism landscape in Iraq, the system provides personalized recommendations and clusters of tourist destinations tailored to user preferences and contextual information. To evaluate the systems performance, experiments were conducted on an augmented dataset representative of Iraqs tourism activity, comparing the proposed system with existing methods. Results indicate that the proposed hybrid system significantly reduces execution time by 27-56% and space consumption by 24-31%, while achieving consistently lower Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) values, thereby enhancing prediction accuracy. This approach offers a scalable, context-aware framework that is well-suited for application in regions where tourism data is limited, such as Iraq, ultimately advancing tourism recommender systems by addressing their limitations in complex and data-scarce environments.

[IR-19] UltraRAG : A Modular and Automated Toolkit for Adaptive Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2504.08761
作者: Yuxuan Chen,Dewen Guo,Sen Mei,Xinze Li,Hao Chen,Yishan Li,Yixuan Wang,Chaoyue Tang,Ruobing Wang,Dingjun Wu,Yukun Yan,Zhenghao Liu,Shi Yu,Zhiyuan Liu,Maosong Sun
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) significantly enhances the performance of large language models (LLMs) in downstream tasks by integrating external knowledge. To facilitate researchers in deploying RAG systems, various RAG toolkits have been introduced. However, many existing RAG toolkits lack support for knowledge adaptation tailored to specific application scenarios. To address this limitation, we propose UltraRAG, a RAG toolkit that automates knowledge adaptation throughout the entire workflow, from data construction and training to evaluation, while ensuring ease of use. UltraRAG features a user-friendly WebUI that streamlines the RAG process, allowing users to build and optimize systems without coding expertise. It supports multimodal input and provides comprehensive tools for managing the knowledge base. With its highly modular architecture, UltraRAG delivers an end-to-end development solution, enabling seamless knowledge adaptation across diverse user scenarios. The code, demonstration videos, and installable package for UltraRAG are publicly available at this https URL.

[IR-20] A Quantitative Approach to Evaluating Open-Source EHR Systems for Indian Healthcare

链接: https://arxiv.org/abs/2504.08750
作者: Biswanath Dutta,Debanjali Bain
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: 21 pages, 3 figures, 10 tables

点击查看摘要

Abstract:The increasing use of Electronic Health Records (EHR) has emphasized the need for standardization and interoperability in healthcare data management. The Ministry of Health and Family Welfare, Government of India, has introduced the Electronic Health Record Minimum Data Set (EHRMDS) to facilitate uniformity in clinical documentation. However, the compatibility of Open-Source Electronic Health Record Systems (OS-EHRS) with EHRMDS remains largely unexplored. This study conducts a systematic assessment of the alignment between EHRMDS and commonly utilized OS-EHRS to determine the most appropriate system for healthcare environments in India. A quantitative closeness analysis was performed by comparing the metadata elements of EHRMDS with those of 10 selected OS-EHRS. Using crosswalk methodologies based on syntactic and semantic similarity, the study measured the extent of metadata alignment. Results indicate that OpenEMR exhibits the highest compatibility with EHRMDS, covering 73.81% of its metadata elements, while OpenClinic shows the least alignment at 33.33%. Additionally, the analysis identified 47 metadata elements present in OS-EHRS but absent in EHRMDS, suggesting the need for an extended metadata schema. By bridging gaps in clinical metadata, this study contributes to enhancing the interoperability of EHR systems in India. The findings provide valuable insights for healthcare policymakers and organizations seeking to adopt OS-EHRS aligned with national standards. Keywords. EHR metadata, electronic health record systems, EHRMDS, meta data, structured vocabularies, metadata crosswalk, methodologies and tools, SNOMED-CT, UMLS terms.

[IR-21] Enhancing Product Search Interfaces with Sketch-Guided Diffusion and Language Agents

链接: https://arxiv.org/abs/2504.08739
作者: Edward Sun
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
*备注: Companion Proceedings of the ACM Web Conference 2025

点击查看摘要

Abstract:The rapid progress in diffusion models, transformers, and language agents has unlocked new possibilities, yet their potential in user interfaces and commercial applications remains underexplored. We present Sketch-Search Agent, a novel framework that transforms the image search experience by integrating a multimodal language agent with freehand sketches as control signals for diffusion models. Using the T2I-Adapter, Sketch-Search Agent combines sketches and text prompts to generate high-quality query images, encoded via a CLIP image encoder for efficient matching against an image corpus. Unlike existing methods, Sketch-Search Agent requires minimal setup, no additional training, and excels in sketch-based image retrieval and natural language interactions. The multimodal agent enhances user experience by dynamically retaining preferences, ranking results, and refining queries for personalized recommendations. This interactive design empowers users to create sketches and receive tailored product suggestions, showcasing the potential of diffusion models in user-centric image retrieval. Experiments confirm Sketch-Search Agent’s high accuracy in delivering relevant product search results.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-15

目录

概览 (2025-04-15)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载