Arxiv今日论文 | 2025-04-24

本篇博文主要内容为 2025-04-24 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在非英语语言评估中的不足问题，尤其是在伊比利亚半岛和伊比洛-美洲语言中，现有基准测试多以英语为中心，忽视了语言多样性、工业相关任务的能力评估以及静态评价体系。论文的关键解决方案是提出IberBench，这是一个综合且可扩展的基准，通过整合来自22个任务类别（如情感分析、毒性检测和摘要生成等）的101个数据集，覆盖多种语言变体，并支持持续更新和社区驱动的模型及数据集提交，解决了现有评估实践中缺乏语言多样性和静态评价设置的问题。

链接: https://arxiv.org/abs/2504.16921
作者: José Ángel González,Ian Borrego Obrador,Álvaro Romo Herrero,Areg Mikael Sarvazyan,Mara Chinea-Ríos,Angelo Basile,Marc Franco-Salvador
机构: Symanto; Keepler; Universitat Politècnica de València (瓦伦西亚理工大学); UNICC
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) remain difficult to evaluate comprehensively, particularly for languages other than English, where high-quality data is often limited. Existing benchmarks and leaderboards are predominantly English-centric, with only a few addressing other languages. These benchmarks fall short in several key areas: they overlook the diversity of language varieties, prioritize fundamental Natural Language Processing (NLP) capabilities over tasks of industrial relevance, and are static. With these aspects in mind, we present IberBench, a comprehensive and extensible benchmark designed to assess LLM performance on both fundamental and industry-relevant NLP tasks, in languages spoken across the Iberian Peninsula and Ibero-America. IberBench integrates 101 datasets from evaluation campaigns and recent benchmarks, covering 22 task categories such as sentiment and emotion analysis, toxicity detection, and summarization. The benchmark addresses key limitations in current evaluation practices, such as the lack of linguistic diversity and static evaluation setups by enabling continual updates and community-driven model and dataset submissions moderated by a committee of experts. We evaluate 23 LLMs ranging from 100 million to 14 billion parameters and provide empirical insights into their strengths and limitations. Our findings indicate that (i) LLMs perform worse on industry-relevant tasks than in fundamental ones, (ii) performance is on average lower for Galician and Basque, (iii) some tasks show results close to random, and (iv) in other tasks LLMs perform above random but below shared task systems. IberBench offers open-source implementations for the entire evaluation pipeline, including dataset normalization and hosting, incremental evaluation of LLMs, and a publicly accessible leaderboard.
zh

[NLP-1] OptimAI: Optimization from Natural Language Using LLM -Powered AI Agents

【速读】：该论文旨在解决将自然语言描述的实际优化问题转化为精确数学形式并选择合适求解器的问题，这通常需要大量领域专业知识。论文提出了一种名为OptimAI的框架，通过利用大型语言模型（LLM）驱动的AI代理来解决这一挑战，其性能优于当前最先进的方法。解决方案的关键在于构建了一个包含四个核心角色的框架：(1) 形式化器（Formulator），用于将自然语言问题描述转换为精确的数学公式；(2) 规划器（Planner），在执行前制定高层次的解决方案策略；以及(3) 编码器（Coder）和代码审查员（Code Critic），能够与环境交互并反思结果以优化后续行动。消融研究证实了所有角色的重要性，移除规划器或代码审查员分别会导致生产力下降5.8倍和3.1倍。此外，引入基于UCB的调试调度机制可以动态切换替代计划，进一步提升3.3倍的生产力。设计强调多智能体协作，便于探索统一系统内结合多样化模型的协同效应。该方法在NLP4LP数据集上的准确率达到88.1%，在Optibench（非线性无表格子集）上的准确率达到71.2%，相比先前的最佳结果分别减少了58%和50%的错误率。

链接: https://arxiv.org/abs/2504.16918
作者: Raghav Thind,Youran Sun,Ling Liang,Haizhao Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optimization plays a vital role in scientific research and practical applications, but formulating a concrete optimization problem described in natural language into a mathematical form and selecting a suitable solver to solve the problem requires substantial domain expertise. We introduce \textbfOptimAI, a framework for solving \underlineOptimization problems described in natural language by leveraging LLM-powered \underlineAI agents, achieving superior performance over current state-of-the-art methods. Our framework is built upon four key roles: (1) a \emphformulator that translates natural language problem descriptions into precise mathematical formulations; (2) a \emphplanner that constructs a high-level solution strategy prior to execution; and (3) a \emphcoder and a \emphcode critic capable of interacting with the environment and reflecting on outcomes to refine future actions. Ablation studies confirm that all roles are essential; removing the planner or code critic results in 5.8\times and 3.1\times drops in productivity, respectively. Furthermore, we introduce UCB-based debug scheduling to dynamically switch between alternative plans, yielding an additional 3.3\times productivity gain. Our design emphasizes multi-agent collaboration, allowing us to conveniently explore the synergistic effect of combining diverse models within a unified system. Our approach attains 88.1% accuracy on the NLP4LP dataset and 71.2% on the Optibench (non-linear w/o table) subset, reducing error rates by 58% and 50% respectively over prior best results.
zh

[NLP-2] racing Thought: Using Chain-of-Thought Reasoning to Identify the LLM Behind AI-Generated Text AAAI2025

【速读】：该论文旨在解决AI生成文本的检测以及特定大型语言模型（LLM）识别的问题，以应对学术诚信、虚假信息传播及伦理AI部署等方面的挑战。论文提出的解决方案关键在于引入链式思维（Chain-of-Thought, CoT）微调框架，采用双任务方法：任务A为区分文本是由AI生成还是由人类撰写，任务B为识别文本背后的特定LLM。其创新之处在于CoT推理机制，使模型能够为预测结果生成解释，从而提升模型的透明性和可解释性。实验表明，该方法在两项任务中均表现出高精度，并显著增强了模型的性能与可解释能力。

链接: https://arxiv.org/abs/2504.16913
作者: Shifali Agrahari,Sanasam Ranbir Singh
机构: Department of Computer Science and Engineering Indian Institute of Technology Guwahati (印度理工学院古瓦哈提分校计算机科学与工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: De-Factify 4: 4th Workshop on Multimodal Fact Checking and Hate Speech Detection, co-located with AAAI 2025. Pennsylvania

点击查看摘要

Abstract:In recent years, the detection of AI-generated text has become a critical area of research due to concerns about academic integrity, misinformation, and ethical AI deployment. This paper presents COT Fine-tuned, a novel framework for detecting AI-generated text and identifying the specific language model. responsible for generating the text. We propose a dual-task approach, where Task A involves classifying text as AI-generated or human-written, and Task B identifies the specific LLM behind the text. The key innovation of our method lies in the use of Chain-of-Thought reasoning, which enables the model to generate explanations for its predictions, enhancing transparency and interpretability. Our experiments demonstrate that COT Fine-tuned achieves high accuracy in both tasks, with strong performance in LLM identification and human-AI classification. We also show that the CoT reasoning process contributes significantly to the models effectiveness and interpretability.
zh

[NLP-3] AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset

【速读】：该论文旨在构建最先进的数学推理模型，其核心挑战在于提升复杂数学问题的推理与解答能力。为解决这一问题，论文提出了三个关键方案：首先，构建了一个包含54万独特高质量数学问题及其320万个长推理解法的大规模数据集；其次，开发了一种通过迭代训练、生成和质量筛选将代码执行与长推理模型相结合的新方法，从而生成了170万个高质量工具集成推理解法；最后，设计了一个候选解法选择的流水线，利用生成式解法选择（GenSelect）显著优于多数投票基线方法。这些创新共同推动了数学推理基准测试中的最先进成果。

链接: https://arxiv.org/abs/2504.16891
作者: Ivan Moshkov,Darragh Hanley,Ivan Sorokin,Shubham Toshniwal,Christof Henkel,Benedikt Schifferer,Wei Du,Igor Gitman
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Report of AIMO-2 winning submission

点击查看摘要

Abstract:This paper presents our winning submission to the AI Mathematical Olympiad - Progress Prize 2 (AIMO-2) competition. Our recipe for building state-of-the-art mathematical reasoning models relies on three key pillars. First, we create a large-scale dataset comprising 540K unique high-quality math problems, including olympiad-level problems, and their 3.2M long-reasoning solutions. Second, we develop a novel method to integrate code execution with long reasoning models through iterative training, generation, and quality filtering, resulting in 1.7M high-quality Tool-Integrated Reasoning solutions. Third, we create a pipeline to train models to select the most promising solution from many candidates. We show that such generative solution selection (GenSelect) can significantly improve upon majority voting baseline. Combining these ideas, we train a series of models that achieve state-of-the-art results on mathematical reasoning benchmarks. To facilitate further research, we release our code, models, and the complete OpenMathReasoning dataset under a commercially permissive license.
zh

[NLP-4] Do Large Language Models know who did what to whom?

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）在多大程度上能够理解句子中的主题角色（即“谁对谁做了什么”），特别是其主要训练目标——词预测（word prediction）是否导致了能够捕获主题角色的句子表征。论文的关键解决方案在于通过两组实验，分别分析了四种LLMs的句子表征特性，并发现尽管某些注意力头（attention heads）能够稳健地捕捉主题角色且独立于句法结构，但整体表征相似性更多反映的是句法相似性而非主题角色的一致性，且主题角色信息并未显著体现在隐藏单元的任何子集中。这表明LLMs能够提取主题角色，但与人类相比，这种信息对其表征的影响较弱。

链接: https://arxiv.org/abs/2504.16884
作者: Joseph M. Denning,Xiaohan(Hannah)Guo,Bryor Snefjella,Idan A. Blank
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are commonly criticized for not understanding language. However, many critiques focus on cognitive abilities that, in humans, are distinct from language processing. Here, we instead study a kind of understanding tightly linked to language: inferring who did what to whom (thematic roles) in a sentence. Does the central training objective of LLMs-word prediction-result in sentence representations that capture thematic roles? In two experiments, we characterized sentence representations in four LLMs. In contrast to human similarity judgments, in LLMs the overall representational similarity of sentence pairs reflected syntactic similarity but not whether their agent and patient assignments were identical vs. reversed. Furthermore, we found little evidence that thematic role information was available in any subset of hidden units. However, some attention heads robustly captured thematic roles, independently of syntax. Therefore, LLMs can extract thematic roles but, relative to humans, this information influences their representations more weakly.
zh

[NLP-5] Planning with Diffusion Models for Target-Oriented Dialogue Systems

【速读】：该论文致力于解决目标导向对话（Target-Oriented Dialogue, TOD）中对话规划存在的局限性，即现有方法通常以逐步顺序方式生成对话计划，容易受到累积错误和短视行为的影响。为了解决这些问题，论文提出了一种名为DiffTOD的新颖对话规划框架。其关键是利用扩散模型实现非顺序对话规划，并通过条件引导将对话规划形式化为轨迹生成问题，同时引入三种定制化的引导机制以针对不同目标类型提供灵活的测试时指导。这一方案使得DiffTOD能够在长时序内有效执行非短视的前瞻探索并优化行动策略。

链接: https://arxiv.org/abs/2504.16858
作者: Hanwen Du,Bo Peng,Xia Ning
机构: Department of Computer Science and Engineering, The Ohio State University, USA (计算机科学与工程系，俄亥俄州立大学，美国); Department of Biomedical Informatics, The Ohio State University, USA (生物医学信息学系，俄亥俄州立大学，美国); Translational Data Analytics Institute, The Ohio State University, USA (转化数据分析研究所，俄亥俄州立大学，美国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Target-Oriented Dialogue (TOD) remains a significant challenge in the LLM era, where strategic dialogue planning is crucial for directing conversations toward specific targets. However, existing dialogue planning methods generate dialogue plans in a step-by-step sequential manner, and may suffer from compounding errors and myopic actions. To address these limitations, we introduce a novel dialogue planning framework, DiffTOD, which leverages diffusion models to enable non-sequential dialogue planning. DiffTOD formulates dialogue planning as a trajectory generation problem with conditional guidance, and leverages a diffusion language model to estimate the likelihood of the dialogue trajectory. To optimize the dialogue action strategies, DiffTOD introduces three tailored guidance mechanisms for different target types, offering flexible guidance towards diverse TOD targets at test time. Extensive experiments across three diverse TOD settings show that DiffTOD can effectively perform non-myopic lookahead exploration and optimize action strategies over a long horizon through non-sequential dialogue planning, and demonstrates strong flexibility across complex and diverse dialogue scenarios. Our code and data are accessible through this https URL.
zh

[NLP-6] Emo Pillars: Knowledge Distillation to Support Fine-Grained Context-Aware and Context-Less Emotion Classification

【速读】：该论文旨在解决情感分析数据集缺乏上下文以及情绪类别有限的问题，同时应对现有基础大语言模型（如GPT-4）在情感预测上的过拟合及资源消耗过高的挑战。为解决这些问题，论文提出了一种基于大型语言模型的数据合成流水线，并利用Mistral-7b生成训练样本以适配更轻量化的BERT型编码器模型。其关键创新在于通过将生成过程锚定在一个叙事语料库上来增强示例的语义多样性，从而生成围绕故事角色的独特上下文且覆盖28种情绪类别的非重复性表达。此外，论文贡献了一个包含10万条带上下文和30万条无上下文的情感分析数据集，并通过微调预训练编码器构建了多个“Emo Pillars”模型。实验结果表明，这些模型在GoEmotions、ISEAR和IEMOCAP等任务中达到了当前最优性能（SOTA），同时验证了数据集在提升表述多样化与上下文个性化方面的有效性，但也指出了现有流水线在处理 taxonomy 外标签时存在的不足。

链接: https://arxiv.org/abs/2504.16856
作者: Alexander Shvets
机构: NLP Group, Pompeu Fabra University (庞培法布拉大学自然语言处理小组); Language Technologies Unit, Barcelona Supercomputing Center (巴塞罗那超级计算中心语言技术单元)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most datasets for sentiment analysis lack context in which an opinion was expressed, often crucial for emotion understanding, and are mainly limited by a few emotion categories. Foundation large language models (LLMs) like GPT-4 suffer from over-predicting emotions and are too resource-intensive. We design an LLM-based data synthesis pipeline and leverage a large model, Mistral-7b, for the generation of training examples for more accessible, lightweight BERT-type encoder models. We focus on enlarging the semantic diversity of examples and propose grounding the generation into a corpus of narratives to produce non-repetitive story-character-centered utterances with unique contexts over 28 emotion classes. By running 700K inferences in 450 GPU hours, we contribute with the dataset of 100K contextual and also 300K context-less examples to cover both scenarios. We use it for fine-tuning pre-trained encoders, which results in several Emo Pillars models. We show that Emo Pillars models are highly adaptive to new domains when tuned to specific tasks such as GoEmotions, ISEAR, IEMOCAP, and EmoContext, reaching the SOTA performance on the first three. We also validate our dataset, conducting statistical analysis and human evaluation, and confirm the success of our measures in utterance diversification (although less for the neutral class) and context personalization, while pointing out the need for improved handling of out-of-taxonomy labels within the pipeline.
zh

[NLP-7] Monte Carlo Planning with Large Language Model for Text-Based Game Agents

【速读】：该论文旨在解决基于文本的游戏环境中语言驱动自主代理在规划与学习过程中效率低下以及缺乏语言理解与推理能力的问题。传统方法如结合蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）与强化学习（Reinforcement Learning, RL）的规划-再学习范式因需要大量迭代而耗时显著，并且仅依赖不确定性驱动的探索。为应对这些问题，论文提出了一种名为“动态记忆引导的大语言模型的蒙特卡洛规划”（Monte Carlo planning with Dynamic Memory-guided Large language model, MC-DML）的算法。其关键是通过增强大语言模型（Large Language Models, LLMs）的语言理解和推理能力，同时利用树搜索算法的探索优势，在规划过程中引入试验内和跨试验的记忆机制，使模型能够从过去的经验中学习并在规划期间动态调整动作评估，从而在初始规划阶段显著提升性能，超越需要多次迭代的传统高效方法。

链接: https://arxiv.org/abs/2504.16855
作者: Zijing Shi,Meng Fang,Ling Chen
机构: AAII, University of Technology Sydney (悉尼科技大学AAII); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-based games provide valuable environments for language-based autonomous agents. However, planning-then-learning paradigms, such as those combining Monte Carlo Tree Search (MCTS) and reinforcement learning (RL), are notably time-consuming due to extensive iterations. Additionally, these algorithms perform uncertainty-driven exploration but lack language understanding and reasoning abilities. In this paper, we introduce the Monte Carlo planning with Dynamic Memory-guided Large language model (MC-DML) algorithm. MC-DML leverages the language understanding and reasoning capabilities of Large Language Models (LLMs) alongside the exploratory advantages of tree search algorithms. Specifically, we enhance LLMs with in-trial and cross-trial memory mechanisms, enabling them to learn from past experiences and dynamically adjust action evaluations during planning. We conduct experiments on a series of text-based games from the Jericho benchmark. Our results demonstrate that the MC-DML algorithm significantly enhances performance across various games at the initial planning phase, outperforming strong contemporary methods that require multiple iterations. This demonstrates the effectiveness of our algorithm, paving the way for more efficient language-grounded planning in complex environments.
zh

[NLP-8] GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning

【速读】：该论文旨在解决在越南语任务中使用大语言模型（LLM）进行推理时存在的挑战，特别是需要中间推理步骤才能生成最终答案的问题。论文提出的关键解决方案是基于分组相对策略优化（Group Relative Policy Optimization, GRPO）微调方法开发GreenMind-Medium-14B-R1模型，并结合高质量的越南语合成推理数据集，设计了两种奖励函数。首先，通过检测采样过程中是否存在偏颇语言字符来解决语言混杂（language mixing）问题；其次，利用基于Sentence Transformer的模型确保生成的推理内容保持事实准确性且不扭曲最终输出。实验结果表明，该模型不仅超越了现有工作，还提升了语言一致性，并在SeaExam多语言选择题数据集上的评估进一步验证了其推理方法的有效性。

链接: https://arxiv.org/abs/2504.16832
作者: Luu Quy Tung,Hoang Quoc Viet,Vo Trong Thu
机构: GreenNode.ai; John Von Neumann Institute (约翰·冯·诺伊曼学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) is a robust approach for tackling LLM tasks that require intermediate reasoning steps prior to generating a final answer. In this paper, we present GreenMind-Medium-14B-R1, the Vietnamese reasoning model inspired by the finetuning strategy based on Group Relative Policy Optimization. We also leverage a high-quality Vietnamese synthesized reasoning dataset and design two reward functions to tackle the main limitations of this technique: (i) language mixing, where we explicitly detect the presence of biased language characters during the process of sampling tokens, and (ii) we leverage Sentence Transformer-based models to ensure that the generated reasoning content maintains factual correctness and does not distort the final output. Experimental results on the Vietnamese dataset from the VLSP 2023 Challenge demonstrate that our model outperforms prior works and enhances linguistic consistency in its responses. Furthermore, we extend our evaluation to SeaExam-a multilingual multiple-choice dataset, showing the effectiveness of our reasoning method compared to few-shot prompting techniques.
zh

[NLP-9] Process Reward Models That Think

【速读】：该论文旨在解决基于过程奖励模型（Process Reward Models, PRMs）在测试时间扩展中的高监督成本问题。PRMs需要逐步骤的监督信号，这使得其训练代价高昂。为了解决这一问题，论文提出了一种数据高效的PRM——ThinkPRM，它通过利用长链推理（long Chain-of-Thought, CoT）模型的内在推理能力，在数量级更少的过程标签上进行微调，以验证解题过程中的每一步，并生成验证链式思维（CoT）。关键在于ThinkPRM仅使用PRM800K中1%的过程标签，就在多个具有挑战性的基准测试中超越了传统的判别式PRM、LLM-as-a-Judge及基线方法，同时在跨领域的评估中也表现优异，且在相同的计算预算下进一步提升了验证效率。

链接: https://arxiv.org/abs/2504.16828
作者: Muhammad Khalifa,Rishabh Agarwal,Lajanugen Logeswaran,Jaekyeom Kim,Hao Peng,Moontae Lee,Honglak Lee,Lu Wang
机构: University of Michigan (密歇根大学); Mila; LG AI Research (LG人工智能研究院); University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Step-by-step verifiers – also known as process reward models (PRMs) – are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers – using only 1% of the process labels in PRM800K – across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at this https URL.
zh

[NLP-10] LLM -assisted Graph-RAG Information Extraction from IFC Data

【速读】：本文旨在解决由于IFC数据复杂性导致的多义性和层级结构难以解析的问题，即IFC数据允许以多种方式表示相同的产品信息，这增加了协作中的不确定性。论文的关键解决方案是利用大型语言模型（LLMs）结合图检索增强生成技术（Graph-RAG），通过图结构来解析IFC数据，提取建筑对象属性及其关系，从而实现基于自然语言的查询-响应检索功能，而无需复杂的处理管道。这种Graph-RAG方法能够赋予生成式AI（Generative AI）如GPT-4o以图基知识，有效应对IFC数据层级结构带来的挑战。

链接: https://arxiv.org/abs/2504.16813
作者: Sima Iranmanesh,Hadeel Saadany,Edlira Vakaj
机构: 未知
类目: Computation and Language (cs.CL)
备注: 2025 European Conference on Computing in Construction

点击查看摘要

Abstract:IFC data has become the general building information standard for collaborative work in the construction industry. However, IFC data can be very complicated because it allows for multiple ways to represent the same product information. In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.
zh

[NLP-11] Random Long-Context Access for Mamba via Hardware-aligned Hierarchical Sparse Attention

【速读】：该论文旨在解决RNN（循环神经网络）在处理长序列时无法随机访问历史上下文的问题，同时保持其在效率和长度泛化方面的优势。传统的注意力机制直接集成到RNN中可能会削弱其效率优势。为了解决这一限制，论文提出了一种名为Hierarchical Sparse Attention (HSA) 的新型注意力机制。HSA的关键创新在于基于每个块内的细粒度token级信息学习token到块的相关性，从而提高块选择的精度，并通过分层聚合信息实现长距离随机访问的灵活性。此外，为了提升效率，论文还设计了与硬件友好的内核。最终，结合HSA与Mamba模型，提出了RAMba模型，在仅使用4K长度上下文预训练的情况下，在6400万上下文的密钥检索任务中实现了完美准确性，并在多个下游任务中取得了显著性能提升，展示了其在长上下文建模中的巨大潜力。

链接: https://arxiv.org/abs/2504.16795
作者: Xiang Hu,Jiaqi Leng,Jun Zhao,Kewei Tu,Wei Wu
机构: Ant Group (蚂蚁集团); Fudan University (复旦大学); ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose \textbfHierarchical \textbfSparse \textbfAttention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selecting the top- k chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba’s huge potential in long-context modeling.
zh

[NLP-12] Credible plan-driven RAG method for Multi-hop Question Answering

【速读】：该论文致力于解决多跳问答（Multi-hop Question Answering, Multi-hop QA）中 Retrieval-Augmented Generation (RAG) 方法面临的挑战，即复杂查询的逻辑推理路径易出现偏差，中间结果可能产生错误，并且这些误差会在推理过程中传播和累积，从而降低复杂查询答案的准确性。为应对这一问题，论文提出了一种名为 Plan-then-Act-and-Review (PAR RAG) 的框架，其关键在于通过引入可解释且增量式的推理范式，缓解误差传播问题以实现更准确和可靠的多跳问答。具体而言，PAR RAG 框架分为规划（Planning）、执行（Act）和审查（Review）三个阶段：首先采用自顶向下的全局分解策略制定全面的推理计划，避免传统方法中的局部最优陷阱；其次结合多粒度验证机制确保中间结果的准确性，并有效管理误差传播；最后通过精细调整确保整个推理过程的可靠性。实验结果表明，该框架在 EM 和 F1 等关键指标上显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.16787
作者: Ningning Zhang,Chi Zhang,Zhizhong Tan,Xingxing Yang,Weiping Deng,Wenyong Wang
机构: Macau University of Science and Technology (澳门科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 figures

点击查看摘要

Abstract:Multi-hop question answering (QA) presents a considerable challenge for Retrieval-Augmented Generation (RAG), requiring the structured decomposition of complex queries into logical reasoning paths and the generation of dependable intermediate results. However, deviations in reasoning paths or errors in intermediate results, which are common in current RAG methods, may propagate and accumulate throughout the reasoning process, diminishing the accuracy of the answer to complex queries. To address this challenge, we propose the Plan-then-Act-and-Review (PAR RAG) framework, which is organized into three key stages: planning, act, and review, and aims to offer an interpretable and incremental reasoning paradigm for accurate and reliable multi-hop question answering by mitigating error this http URL RAG initially applies a top-down problem decomposition strategy, formulating a comprehensive plan that integrates multiple executable steps from a holistic viewpoint. This approach avoids the pitfalls of local optima common in traditional RAG methods, ensuring the accuracy of the entire reasoning path. Subsequently, PAR RAG incorporates a plan execution mechanism based on multi-granularity verification. By utilizing both coarse-grained similarity information and fine-grained relevant data, the framework thoroughly checks and adjusts intermediate results, ensuring process accuracy while effectively managing error propagation and amplification. Experimental results on multi-hop QA datasets demonstrate that the PAR RAG framework substantially outperforms existing state-of-the-art methods in key metrics, including EM and F1 scores.
zh

[NLP-13] MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores

【速读】：该论文旨在解决大型语言模型在处理长上下文输入时，因推理时间和资源消耗增加而面临的实际应用挑战，特别是在资源受限环境中的应用问题。论文的关键解决方案是提出了一种名为MOOSComp的方法，这是一种基于标记分类的长上下文压缩技术。其核心在于通过缓解过平滑问题（mitigating the over-smoothing problem）以及引入异常值分数（outlier scores），增强基于BERT的压缩器性能。在训练阶段，通过添加类间余弦相似性损失项（inter-class cosine similarity loss term）来惩罚过于相似的标记表示，从而提升标记分类准确性；在压缩阶段，利用异常值分数保留任务无关压缩中容易被丢弃的关键但罕见的标记，并将其与分类器输出相结合，使压缩器更具任务通用性。实验结果表明，MOOSComp在多种压缩比下于长上下文理解和推理基准测试中表现出色，并在资源受限的移动设备上实现了4倍压缩比下的3.3倍加速。

链接: https://arxiv.org/abs/2504.16786
作者: Fengwei Zhou,Jiafei Song,Wenjin Jason Li,Gengjian Xue,Zhikang Zhao,Yichao Lu,Bailin Na
机构: OPPO CTG (OPPO Central Technology Group)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in large language models have significantly improved their ability to process long-context input, but practical applications are challenged by increased inference time and resource consumption, particularly in resource-constrained environments. To address these challenges, we propose MOOSComp, a token-classification-based long-context compression method that enhances the performance of a BERT-based compressor by mitigating the over-smoothing problem and incorporating outlier scores. In the training phase, we add an inter-class cosine similarity loss term to penalize excessively similar token representations, thereby improving the token classification accuracy. During the compression phase, we introduce outlier scores to preserve rare but critical tokens that are prone to be discarded in task-agnostic compression. These scores are integrated with the classifier’s output, making the compressor more generalizable to various tasks. Superior performance is achieved at various compression ratios on long-context understanding and reasoning benchmarks. Moreover, our method obtains a speedup of 3.3x at a 4x compression ratio on a resource-constrained mobile device.
zh

[NLP-14] Evaluation Framework for AI Systems in “the Wild”

【速读】：该论文试图解决生成式 AI (GenAI) 模型在实际应用中评价方法滞后的问题，传统评估方法多依赖于固定基准和数据集，难以反映真实世界中的性能表现，导致实验室测试结果与实际应用之间存在差距。论文的关键解决方案在于提出一个全面的评估框架，强调多样化的、动态变化的输入以及整体性、动态性和持续性的评估方法。此外，论文还提供了设计准确反映实时能力的评估方法的指导，并为政策制定者推荐关注社会影响而非单一性能指标或参数规模的GenAI政策。关键在于倡导整合性能、公平性和伦理性的整体框架，并采用结合人工和自动化评估的连续、以结果为导向的方法，同时保持透明以增强利益相关者的信任。这些策略确保GenAI模型不仅技术上卓越，而且在伦理和社会层面也具有责任感和影响力。

链接: https://arxiv.org/abs/2504.16778
作者: Sarah Jabbour,Trenton Chang,Anindya Das Antar,Joseph Peper,Insu Jang,Jiachen Liu,Jae-Won Chung,Shiqi He,Michael Wellman,Bryan Goodman,Elizabeth Bondi-Kelly,Kevin Samy,Rada Mihalcea,Mosharaf Chowhury,David Jurgens,Lu Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 35 pages

点击查看摘要

Abstract:Generative AI (GenAI) models have become vital across industries, yet current evaluation methods have not adapted to their widespread use. Traditional evaluations often rely on benchmarks and fixed datasets, frequently failing to reflect real-world performance, which creates a gap between lab-tested outcomes and practical applications. This white paper proposes a comprehensive framework for how we should evaluate real-world GenAI systems, emphasizing diverse, evolving inputs and holistic, dynamic, and ongoing assessment approaches. The paper offers guidance for practitioners on how to design evaluation methods that accurately reflect real-time capabilities, and provides policymakers with recommendations for crafting GenAI policies focused on societal impacts, rather than fixed performance numbers or parameter sizes. We advocate for holistic frameworks that integrate performance, fairness, and ethics and the use of continuous, outcome-oriented methods that combine human and automated assessments while also being transparent to foster trust among stakeholders. Implementing these strategies ensures GenAI models are not only technically proficient but also ethically responsible and impactful.
zh

[NLP-15] How Effective are Generative Large Language Models in Performing Requirements Classification?

【速读】：该论文试图解决的问题是如何评估生成式大语言模型（Generative Large Language Models, LLMs）在需求分类任务中的表现，并探索其与非生成式LLMs（如BERT）相比的潜力。论文关注的重点是生成式LLMs在处理二元分类和多类别分类任务时的能力，尤其是在需求工程（Requirements Engineering, RE）领域中常见的上下文感知文本生成需求。

解决方案的关键在于通过设计广泛的实验来比较三种生成式LLMs（Bloom、Gemma和Llama）在三个常用数据集（PROMISE NFR、Functional-Quality和SecReq）上的性能。研究发现，提示设计和模型架构是普遍重要的因素，而数据集差异等其他因素的影响则更具情境性，取决于具体分类任务的复杂度。这一见解为未来模型开发和部署策略提供了指导，强调优化提示结构以及根据特定任务需求调整模型架构的重要性。

链接: https://arxiv.org/abs/2504.16768
作者: Waad Alhoshan,Alessio Ferrari,Liping Zhao
机构: Imam Mohammad Ibn Saud Islamic University (IMSIU)(伊玛目穆罕默德·本·沙特伊斯兰大学); University College Dublin (UCD)(都柏林大学学院); University of Manchester(曼彻斯特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:In recent years, transformer-based large language models (LLMs) have revolutionised natural language processing (NLP), with generative models opening new possibilities for tasks that require context-aware text generation. Requirements engineering (RE) has also seen a surge in the experimentation of LLMs for different tasks, including trace-link detection, regulatory compliance, and others. Requirements classification is a common task in RE. While non-generative LLMs like BERT have been successfully applied to this task, there has been limited exploration of generative LLMs. This gap raises an important question: how well can generative LLMs, which produce context-aware outputs, perform in requirements classification? In this study, we explore the effectiveness of three generative LLMs-Bloom, Gemma, and Llama-in performing both binary and multi-class requirements classification. We design an extensive experimental study involving over 400 experiments across three widely used datasets (PROMISE NFR, Functional-Quality, and SecReq). Our study concludes that while factors like prompt design and LLM architecture are universally important, others-such as dataset variations-have a more situational impact, depending on the complexity of the classification task. This insight can guide future model development and deployment strategies, focusing on optimising prompt structures and aligning model architectures with task-specific needs for improved performance.
zh

[NLP-16] HEMA : A Hippocampus-Inspired Extended Memory Architecture for Long-Context AI Conversations

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在长对话（超过数百轮）中难以保持连贯性的问题，尽管它们在有限上下文窗口内表现良好。论文提出了一种名为HEMA（灵感来源于海马体的扩展记忆架构）的双存储系统作为解决方案。HEMA的关键在于结合Compact Memory（一种持续更新的一句话摘要，用于保持全局叙事连贯性）和Vector Memory（一种基于余弦相似度查询的片段嵌入的情节存储）。实验结果显示，与仅使用总结的方法相比，采用10K索引片段的Vector Memory在P@5=0.80和R@50=0.74的情况下，精度-召回曲线下的面积增加了一倍。此外，消融研究揭示了两个关键见解：通过年龄加权修剪实现的语义遗忘可将检索延迟减少34%，同时几乎不损失召回率；两级摘要层次结构可以防止超长对话（超过1,000轮）中的级联错误。HEMA展示了结合逐字回忆与语义连续性的方法为隐私感知的长期对话能力的实用解决方案提供了可能性，而无需重新训练模型。

链接: https://arxiv.org/abs/2504.16754
作者: Kwangseob Ahn
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) struggle with maintaining coherence in extended conversations spanning hundreds of turns, despite performing well within their context windows. This paper introduces HEMA (Hippocampus-Inspired Extended Memory Architecture), a dual-memory system inspired by human cognitive processes. HEMA combines Compact Memory - a continuously updated one-sentence summary preserving global narrative coherence, and Vector Memory - an episodic store of chunk embeddings queried via cosine similarity. When integrated with a 6B-parameter transformer, HEMA maintains coherent dialogues beyond 300 turns while keeping prompt length under 3,500 tokens. Experimental results show substantial improvements: factual recall accuracy increases from 41% to 87%, and human-rated coherence improves from 2.7 to 4.3 on a 5-point scale. With 10K indexed chunks, Vector Memory achieves P@5 = 0.80 and R@50 = 0.74, doubling the area under the precision-recall curve compared to summarization-only approaches. Ablation studies reveal two key insights: semantic forgetting through age-weighted pruning reduces retrieval latency by 34% with minimal recall loss, and a two-level summary hierarchy prevents cascade errors in ultra-long conversations exceeding 1,000 turns. HEMA demonstrates that combining verbatim recall with semantic continuity provides a practical solution for privacy-aware conversational AI capable of month-long dialogues without model retraining.
zh

[NLP-17] IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在加速科学研究中的潜力如何有效转化为实际科学发现的问题，特别聚焦于生成新颖研究假设这一关键阶段。现有自动化假设生成方法主要集中在多智能体框架和扩展测试时计算能力，但缺乏通过协同人机协作（Human-in-the-loop, HITL）方法实现透明性和可控性的有效手段。为填补这一空白，论文提出IRIS（Interactive Research Ideation System），一个开源平台，用于研究人员利用LLM辅助进行科学构想。IRIS的关键创新在于结合蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）实现自适应测试时计算扩展、细粒度反馈机制以及基于查询的文献综合分析功能，从而赋予研究人员在整个构想过程中更强的控制力与洞察力。此外，通过跨学科研究人员的使用研究验证了系统的有效性。

链接: https://arxiv.org/abs/2504.16728
作者: Aniketh Garikaparthi,Manasi Patwardhan,Lovekesh Vig,Arman Cohan
机构: TCS Research (TCS 研究院); Yale University (耶鲁大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 6 pages main-text, 2 pages appendix

点击查看摘要

Abstract:The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at this https URL
zh

[NLP-18] A Post-trainers Guide to Multilingual Training Data: Uncovering Cross-lingual Transfer Dynamics

【速读】：该论文试图解决大型语言模型在跨语言迁移（Cross-Lingual Transfer, CLT）中的动态机制不清晰的问题，即尽管后训练（post-training）在多语言数据上的广泛应用，其背后使跨语言迁移得以实现的关键因素尚未明确。论文通过研究两种规模高达350亿参数的模型家族，在三种具有不同复杂度的任务（摘要生成、指令跟随、数学推理）的单任务与多任务指令微调设置下，使用精心控制的多语言数据混合进行训练，系统性地分析了跨语言迁移的动力学特性及其影响因素。解决方案的关键在于揭示跨语言迁移和多语言性能不能仅由单一变量解释，而是取决于后训练设置的组合，并进一步识别了实践中实现有效跨语言迁移的实际条件。

链接: https://arxiv.org/abs/2504.16677
作者: Luisa Shimabucoro,Ahmet Ustun,Marzieh Fadaee,Sebastian Ruder
机构: Cohere For AI; University of São Paulo (圣保罗大学); Meta (元)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In order for large language models to be useful across the globe, they are fine-tuned to follow instructions on multilingual data. Despite the ubiquity of such post-training, a clear understanding of the dynamics that enable cross-lingual transfer remains elusive. This study examines cross-lingual transfer (CLT) dynamics in realistic post-training settings. We study two model families of up to 35B parameters in size trained on carefully controlled mixtures of multilingual data on three generative tasks with varying levels of complexity (summarization, instruction following, and mathematical reasoning) in both single-task and multi-task instruction tuning settings. Overall, we find that the dynamics of cross-lingual transfer and multilingual performance cannot be explained by isolated variables, varying depending on the combination of post-training settings. Finally, we identify the conditions that lead to effective cross-lingual transfer in practice.
zh

[NLP-19] ParetoHqD: Fast Offline Multiobjective Alignment of Large Language Models using Pareto High-quality Data

【速读】：该论文旨在解决大型语言模型在多目标对齐任务中因不适当的偏好表示和不平衡奖励分数导致的性能受限问题。论文的关键解决方案是提出ParetoHqD方法，通过将人类偏好表示为目标空间中的偏好方向，并将接近帕累托前沿的数据视为“高质量”数据，从而改进对齐算法的表现。ParetoHqD采用两阶段有监督微调过程，每阶段使用与偏好方向最佳匹配的帕累托高质量训练集，有效提升了多目标对齐任务的性能。实验结果表明，ParetoHqD在两个任务中优于五种基线方法。

链接: https://arxiv.org/abs/2504.16628
作者: Haoran Gu,Handing Wang,Yi Mei,Mengjie Zhang,Yaochu Jin
机构: School of Artificial Intelligence, Xidian University (西安电子科技大学), China; School of Engineering and Computer Science, Victoria University of Wellington (惠灵顿维多利亚大学), New Zealand; School of Engineering, Westlake University (西湖大学), China
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 19 pages, 6 figure, Multiobjective Alignment of LLMs

点击查看摘要

Abstract:Aligning large language models with multiple human expectations and values is crucial for ensuring that they adequately serve a variety of user needs. To this end, offline multiobjective alignment algorithms such as the Rewards-in-Context algorithm have shown strong performance and efficiency. However, inappropriate preference representations and training with imbalanced reward scores limit the performance of such algorithms. In this work, we introduce ParetoHqD that addresses the above issues by representing human preferences as preference directions in the objective space and regarding data near the Pareto front as ‘‘high-quality’’ data. For each preference, ParetoHqD follows a two-stage supervised fine-tuning process, where each stage uses an individual Pareto high-quality training set that best matches its preference direction. The experimental results have demonstrated the superiority of ParetoHqD over five baselines on two multiobjective alignment tasks.
zh

[NLP-20] IFIN India at SemEval-2025: Harnessing Translation to Overcome Multilingual IR Challenges in Fact-Checked Claim Retrieval

【速读】：该论文旨在解决单语和跨语言环境中检索先前经过事实核查的陈述这一挑战，这是由于全球范围内虚假信息泛滥而显得尤为重要的任务。论文提出的关键解决方案采用两阶段策略：首先利用经过微调的嵌入模型构建一个可靠的基线检索系统，然后通过基于大型语言模型（LLM）的重排序器进一步优化检索结果。其中，论文的核心贡献在于展示基于LLM的翻译方法如何克服多语言信息检索中的障碍。此外，研究还特别关注确保整个管道的大部分能够在消费级GPU上实现复现。最终集成系统在单语测试集和跨语言测试集上的success@10得分分别达到了0.938和0.81025。

链接: https://arxiv.org/abs/2504.16627
作者: Prasanna Devadiga,Arya Suneesh,Pawan Kumar Rajpoot,Bharatdeep Hazarika,Aditya U Baliga
机构: TIFIN India; IIIT Kottayam (印度信息技术学院科塔亚姆)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We address the challenge of retrieving previously fact-checked claims in monolingual and crosslingual settings - a critical task given the global prevalence of disinformation. Our approach follows a two-stage strategy: a reliable baseline retrieval system using a fine-tuned embedding model and an LLM-based reranker. Our key contribution is demonstrating how LLM-based translation can overcome the hurdles of multilingual information retrieval. Additionally, we focus on ensuring that the bulk of the pipeline can be replicated on a consumer GPU. Our final integrated system achieved a success@10 score of 0.938 and 0.81025 on the monolingual and crosslingual test sets, respectively.
zh

[NLP-21] Debunking with Dialogue? Exploring AI-Generated Counterspeech to Challenge Conspiracy Theories

【速读】：该论文试图解决在线平台上针对阴谋论（Conspiracy Theories）的反击言论（Counterspeech）缺乏有效生成方法的问题。现有研究主要集中于仇恨言论（Hate Speech）的应对，而针对阴谋论的专家驱动型努力难以规模化，且目前尚无配对阴谋论评论与专家创作反击言论的数据集。为填补这一空白，论文评估了GPT-4o、Llama 3和Mistral等大型语言模型（Large Language Models, LLMs）在应用基于心理学研究的反击策略方面的有效性，这些策略通过结构化提示（structured prompts）提供。解决方案的关键在于利用结构化提示引导LLMs生成高质量的反击言论，但研究发现，这些模型通常产生通用、重复或肤浅的结果，并且过度强调恐惧，同时容易虚构事实、来源或数据，这限制了其在实际应用场景中的实用性。

链接: https://arxiv.org/abs/2504.16604
作者: Mareike Lisker,Christina Gottschalk,Helena Mihaljević
机构: University of Applied Sciences (HTW) Berlin (柏林应用技术大学), Germany
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 15 pages

点击查看摘要

Abstract:Counterspeech is a key strategy against harmful online content, but scaling expert-driven efforts is challenging. Large Language Models (LLMs) present a potential solution, though their use in countering conspiracy theories is under-researched. Unlike for hate speech, no datasets exist that pair conspiracy theory comments with expert-crafted counterspeech. We address this gap by evaluating the ability of GPT-4o, Llama 3, and Mistral to effectively apply counterspeech strategies derived from psychological research provided through structured prompts. Our results show that the models often generate generic, repetitive, or superficial results. Additionally, they over-acknowledge fear and frequently hallucinate facts, sources, or figures, making their prompt-based use in practical applications problematic.
zh

[NLP-22] Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: A Pilot Study

【速读】：该论文旨在评估大型语言模型（Large Language Models, LLMs）与传统机器翻译（Machine Translation, MT）工具在将医学咨询摘要从英语翻译成阿拉伯语、中文和越南语时的表现。研究对象涵盖了面向患者的友好文本和面向临床医生的专业文本，并使用标准自动化指标进行评估。研究发现，传统MT工具在复杂文本翻译中总体表现更优，而LLMs在越南语和中文中对较简单的摘要翻译展现出潜力，但阿拉伯语翻译随着复杂度增加而改善，这与其语言形态学特性有关。尽管LLMs提供了上下文灵活性，但其表现仍不一致，现有的评价指标未能充分反映临床相关性。论文强调了针对特定领域的训练、改进的评估方法以及人工监督在医学翻译中的重要性。因此，论文试图解决的问题是如何提高LLMs在医学翻译中的性能，其解决方案的关键在于引入领域专用训练数据、优化评价机制及结合人工审核以确保翻译质量。

链接: https://arxiv.org/abs/2504.16601
作者: Andy Li,Wei Zhou,Rashina Hoda,Chris Bain,Peter Poon
机构: Faculty of Information Technology, Monash University (蒙纳士大学), Clayton, VIC 3800, Australia; Faculty of Medicine, Nursing and Health Sciences, Monash University (蒙纳士大学), Clayton, VIC 3800, Australia; Supportive and Palliative Care Unit, Monash Health, Clayton, VIC 3168, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 tables and 1 Figure

点击查看摘要

Abstract:This study evaluates how well large language models (LLMs) and traditional machine translation (MT) tools translate medical consultation summaries from English into Arabic, Chinese, and Vietnamese. It assesses both patient, friendly and clinician, focused texts using standard automated metrics. Results showed that traditional MT tools generally performed better, especially for complex texts, while LLMs showed promise, particularly in Vietnamese and Chinese, when translating simpler summaries. Arabic translations improved with complexity due to the language’s morphology. Overall, while LLMs offer contextual flexibility, they remain inconsistent, and current evaluation metrics fail to capture clinical relevance. The study highlights the need for domain-specific training, improved evaluation methods, and human oversight in medical translation.
zh

[NLP-23] PIS: Linking Importance Sampling and Attention Mechanisms for Efficient Prompt Compression

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中因高昂成本导致的广泛采用受限问题，提出了一种新的提示压缩框架Prompt Importance Sampling (PIS)，以实现高效且系统化的提示压缩。现有方法主要依赖启发式的截断或抽象摘要技术，这些方法忽视了LLMs的核心机制，并缺乏对生成过程中token重要性的系统评估。PIS的关键创新在于其双层压缩机制：1）在token层面，利用LLM原生的注意力得分量化显著性，并通过轻量级的9层强化学习网络实现自适应压缩；2）在语义层面，提出了基于俄罗斯轮盘采样的句子级重要性采样策略。实验结果表明，该方法在多个领域基准测试中达到了最先进的压缩性能，并意外提升了推理效率。这一工作通过理论基础与实际效率的结合，在LLMs的上下文管理方面推动了提示工程的发展。

链接: https://arxiv.org/abs/2504.16574
作者: Lizhe Chen,Binjia Zhou,Yuyao Ge,Jiayi Chen,Shiguang NI
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Zhejiang University (浙江大学); CAS Key Laboratory of AI Security, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所人工智能安全重点实验室); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable progress, demonstrating unprecedented capabilities across various natural language processing tasks. However, the high costs associated with such exceptional performance limit the widespread adoption of LLMs, highlighting the need for prompt compression. Existing prompt compression methods primarily rely on heuristic truncation or abstractive summarization techniques, which fundamentally overlook the intrinsic mechanisms of LLMs and lack a systematic evaluation of token importance for generation. In this work, we introduce Prompt Importance Sampling (PIS), a novel compression framework that dynamically compresses prompts by sampling important tokens based on the analysis of attention scores of hidden states. PIS employs a dual-level compression mechanism: 1) at the token level, we quantify saliency using LLM-native attention scores and implement adaptive compression through a lightweight 9-layer reinforcement learning (RL) network; 2) at the semantic level, we propose a Russian roulette sampling strategy for sentence-level importance sampling. Comprehensive evaluations across multiple domain benchmarks demonstrate that our method achieves state-of-the-art compression performance. Notably, our framework serendipitously enhances reasoning efficiency through optimized context structuring. This work advances prompt engineering by offering both theoretical grounding and practical efficiency in context management for LLMs.
zh

[NLP-24] ransformers for Complex Query Answering over Knowledge Hypergraphs

【速读】：该论文旨在解决复杂查询回答（CQA）中知识图谱表示能力不足的问题。传统三元组知识图谱（Triple KGs）仅由二元关系组成，无法充分表达现实世界中复杂的事实，而超关系图谱（hyper-relational graphs）虽引入了更高阶的关系，但难以有效建模具有等贡献实体且关系元数变化的场景。为填补这一研究空白，论文提出了新的CQA数据集JF17k-HCQA和M-FB15k-HCQA，并设计了一种两阶段Transformer模型——逻辑知识超图Transformer（LKHGT）。该模型包含用于原子投影的Projection Encoder和处理复杂逻辑操作的Logical Encoder，两者均配备了类型感知偏置（Type Aware Bias, TAB）以捕获令牌交互。关键在于通过这种创新架构，LKHGT能够在知识超图（KHG）上实现先进的CQA性能，并具备泛化到分布外查询类型的能力。

链接: https://arxiv.org/abs/2504.16537
作者: Hong Ting Tsang,Zihao Wang,Yangqiu Song
机构: CSE, HKUST, HKSAR, China (计算机科学系, 香港科技大学, 中华人民共和国香港特别行政区)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Complex Query Answering (CQA) has been extensively studied in recent years. In order to model data that is closer to real-world distribution, knowledge graphs with different modalities have been introduced. Triple KGs, as the classic KGs composed of entities and relations of arity 2, have limited representation of real-world facts. Real-world data is more sophisticated. While hyper-relational graphs have been introduced, there are limitations in representing relationships of varying arity that contain entities with equal contributions. To address this gap, we sampled new CQA datasets: JF17k-HCQA and M-FB15k-HCQA. Each dataset contains various query types that include logical operations such as projection, negation, conjunction, and disjunction. In order to answer knowledge hypergraph (KHG) existential first-order queries, we propose a two-stage transformer model, the Logical Knowledge Hypergraph Transformer (LKHGT), which consists of a Projection Encoder for atomic projection and a Logical Encoder for complex logical operations. Both encoders are equipped with Type Aware Bias (TAB) for capturing token interactions. Experimental results on CQA datasets show that LKHGT is a state-of-the-art CQA method over KHG and is able to generalize to out-of-distribution query types.
zh

[NLP-25] QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining

【速读】：该论文旨在解决现有大语言模型（Large Language Models, LLMs）训练数据优化方法中存在的问题，即单独优化数据质量和多样性的方法往往忽略了两者之间的内在权衡关系。为了解决这一问题，论文提出了一个统一的数据选择框架QuaDMix，其关键是通过引入多个质量度量标准和领域分类来同时评估数据点的质量和整体多样性，并利用参数化的数据采样函数，基于这些与质量和多样性相关的标签动态调整每个数据点的采样概率，从而在固定训练预算下实现对数据分布的自动优化，平衡质量和多样性之间的关系。此外，为了加速QuaDMix框架中参数的搜索过程，论文采用了受RegMix启发的方法，在较小的模型上进行模拟实验并使用LightGBM进行参数调优。实验结果表明，QuaDMix在多个基准测试中平均提升了7.2%的性能，优于独立优化质量和多样性的策略，验证了平衡质量和多样性的必要性和有效性。

链接: https://arxiv.org/abs/2504.16511
作者: Fengze Liu,Weidong Zhou,Binbin Liu,Zhimiao Yu,Yifan Zhang,Haobin Lin,Yifeng Yu,Xiaohuan Zhou,Taifeng Wang,Yong Cao
机构: ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quality and diversity are two critical metrics for the training data of large language models (LLMs), positively impacting performance. Existing studies often optimize these metrics separately, typically by first applying quality filtering and then adjusting data proportions. However, these approaches overlook the inherent trade-off between quality and diversity, necessitating their joint consideration. Given a fixed training quota, it is essential to evaluate both the quality of each data point and its complementary effect on the overall dataset. In this paper, we introduce a unified data selection framework called QuaDMix, which automatically optimizes the data distribution for LLM pretraining while balancing both quality and diversity. Specifically, we first propose multiple criteria to measure data quality and employ domain classification to distinguish data points, thereby measuring overall diversity. QuaDMix then employs a unified parameterized data sampling function that determines the sampling probability of each data point based on these quality and diversity related labels. To accelerate the search for the optimal parameters involved in the QuaDMix framework, we conduct simulated experiments on smaller models and use LightGBM for parameters searching, inspired by the RegMix method. Our experiments across diverse models and datasets demonstrate that QuaDMix achieves an average performance improvement of 7.2% across multiple benchmarks. These results outperform the independent strategies for quality and diversity, highlighting the necessity and ability to balance data quality and diversity.
zh

[NLP-26] -VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning

【速读】：该论文旨在解决电信行业专用词汇和复杂概念导致的标准自然语言处理模型在捕捉领域特定语义方面的不足问题，这严重影响了下游任务的表现。论文的关键解决方案是引入T-VEC（Telecom Vectorization Model），这是一种通过深度微调专门为电信领域设计的新型嵌入模型。T-VEC通过对最先进的gte-Qwen2-1.5B-instruct模型进行三元组损失目标的适应，并利用精心策划的大规模电信专用数据集开发而成。其关键之处在于对基础模型的338层权重进行了大量修改，确保了领域知识的深度融合，超越了表面的适应技术。此外，还开发并开源了首个专用的电信领域分词器，以增强对行业术语的处理能力。这一系列改进使T-VEC在MTEB评分和内部电信特定三元组评估基准上均表现出色，确立了NetoAI在电信AI创新领域的领先地位。

链接: https://arxiv.org/abs/2504.16460
作者: Vignesh Ethiraj,Sidhanth Menon,Divya Vijay
机构: NetoAI (https://www.netoai.ai)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Introduces T-VEC, a telecom-specific text embedding model. Fine-tuned gte-Qwen2-1.5B-instruct on curated telecom data points. Includes the first open-source telecom tokenizer. Model available at this https URL

点击查看摘要

Abstract:The specialized vocabulary and complex concepts of the telecommunications industry present significant challenges for standard Natural Language Processing models. Generic text embeddings often fail to capture telecom-specific semantics, hindering downstream task performance. We introduce T-VEC (Telecom Vectorization Model), a novel embedding model tailored for the telecom domain through deep fine-tuning. Developed by NetoAI, T-VEC is created by adapting the state-of-the-art gte-Qwen2-1.5B-instruct model using a triplet loss objective on a meticulously curated, large-scale dataset of telecom-specific data. Crucially, this process involved substantial modification of weights across 338 layers of the base model, ensuring deep integration of domain knowledge, far exceeding superficial adaptation techniques. We quantify this deep change via weight difference analysis. A key contribution is the development and open-sourcing (MIT License) of the first dedicated telecom-specific tokenizer, enhancing the handling of industry jargon. T-VEC achieves a leading average MTEB score (0.825) compared to established models and demonstrates vastly superior performance (0.9380 vs. less than 0.07) on our internal telecom-specific triplet evaluation benchmark, indicating an exceptional grasp of domain-specific nuances, visually confirmed by improved embedding separation. This work positions NetoAI at the forefront of telecom AI innovation, providing the community with a powerful, deeply adapted, open-source tool.
zh

[NLP-27] EMRModel: A Large Language Model for Extracting Medical Consultation Dialogues into Structured Medical Records

【速读】：该论文旨在解决医疗咨询对话中未结构化信息难以有效利用的问题，特别是如何从非结构化的医疗对话中高效提取结构化的电子医疗记录（EMRs）。传统基于规则或浅层机器学习的方法难以捕捉深层次和隐含语义，而大型预训练语言模型结合轻量级微调方法Low-Rank Adaptation (LoRA) 展示出潜力。论文的关键解决方案是提出了一种名为EMRModel的新方法，它将基于LoRA的微调与代码风格提示设计相结合，以实现医疗咨询对话到结构化EMRs的高效转换。此外，研究还构建了一个高质量且具有实际意义的医疗咨询对话数据集，并引入了细粒度的信息提取评估基准及系统性评估方法，从而推动了医学自然语言处理（NLP）模型的优化。实验结果显示，EMRModel在F1分数上达到88.1%，相比标准预训练模型提升了49.5%，并且相较于传统LoRA微调方法表现更优，验证了其在结构化医疗记录提取任务中的有效性。

链接: https://arxiv.org/abs/2504.16448
作者: Shuguang Zhao,Qiangzhong Feng,Zhiyang He,Peipei Sun,Yingying Wang,Xiaodong Tao,Xiaoliang Lu,Mei Cheng,Xinyue Wu,Yanyan Wang,Wei Liang
机构: iflytek(科大讯飞); ustcinfo(USTC Information Network); usc(南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical consultation dialogues contain critical clinical information, yet their unstructured nature hinders effective utilization in diagnosis and treatment. Traditional methods, relying on rule-based or shallow machine learning techniques, struggle to capture deep and implicit semantics. Recently, large pre-trained language models and Low-Rank Adaptation (LoRA), a lightweight fine-tuning method, have shown promise for structured information extraction. We propose EMRModel, a novel approach that integrates LoRA-based fine-tuning with code-style prompt design, aiming to efficiently convert medical consultation dialogues into structured electronic medical records (EMRs). Additionally, we construct a high-quality, realistically grounded dataset of medical consultation dialogues with detailed annotations. Furthermore, we introduce a fine-grained evaluation benchmark for medical consultation information extraction and provide a systematic evaluation methodology, advancing the optimization of medical natural language processing (NLP) models. Experimental results show EMRModel achieves an F1 score of 88.1%, improving by49.5% over standard pre-trained models. Compared to traditional LoRA fine-tuning methods, our model shows superior performance, highlighting its effectiveness in structured medical record extraction tasks.
zh

[NLP-28] MAGIC: Near-Optimal Data Attribution for Deep Learning

【速读】：该论文试图解决预测数据归因（predictive data attribution）问题，即评估添加或移除一组训练数据点对模型预测的影响。在凸优化设置下，这一目标较为直接（例如通过无穷小刀锋法实现），但在大规模非凸设置中，现有方法效果不佳，其估计结果与真实值的相关性较弱。论文的关键解决方案是提出了一种新的数据归因方法（MAGIC），该方法结合了经典方法与元微分（metadifferentiation）的最新进展，以近乎最优的方式估计训练数据增减对模型预测的影响。

链接: https://arxiv.org/abs/2504.16430
作者: Andrew Ilyas,Logan Engstrom
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The goal of predictive data attribution is to estimate how adding or removing a given set of training datapoints will affect model predictions. In convex settings, this goal is straightforward (i.e., via the infinitesimal jackknife). In large-scale (non-convex) settings, however, existing methods are far less successful – current methods’ estimates often only weakly correlate with ground truth. In this work, we present a new data attribution method (MAGIC) that combines classical methods and recent advances in metadifferentiation to (nearly) optimally estimate the effect of adding or removing training data on model predictions.
zh

[NLP-29] Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

【速读】：该论文试图解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在理解认知层面语义（cognitive-level semantics）方面的能力不足问题。解决方案的关键在于引入了一个名为MMLA的综合基准数据集，该数据集包含超过61,000个来自舞台场景和真实世界的多模态话语，涵盖了多模态语义的六个核心维度：意图、情感、对话行为、情感倾向、说话风格及沟通行为。通过零样本推理（zero-shot inference）、监督微调（supervised fine-tuning）以及指令微调（instruction tuning）三种方法评估主流LLMs和MLLMs，研究发现即使经过微调的模型在复杂人类语言理解上的准确率仅能达到60%~70%，从而揭示了当前MLLMs在此领域的局限性。这一工作旨在为探索大语言模型在多模态语言分析中的潜力奠定坚实基础，并提供宝贵的资源推动该领域的发展。相关数据集与代码已开源。

链接: https://arxiv.org/abs/2504.16427
作者: Hanlei Zhang,Zhuohang Li,Yeshuang Zhu,Hua Xu,Peiwu Wang,Jinchao Zhang,Jie Zhou,Haige Zhu
机构: Department of Computer Science and Technology, Tsinghua University (清华大学); Pattern Recognition Center, WeChat AI, Tencent Inc, China (微信人工智能模式识别中心，腾讯公司，中国); Kennesaw State University (肯尼索州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 23 pages, 5 figures

点击查看摘要

Abstract:Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at this https URL.
zh

[NLP-30] Evaluating Multi-Hop Reasoning in Large Language Models : A Chemistry-Centric Case Study

【速读】：该论文试图解决大型语言模型（LLMs）在化学领域中跨跳多步组合推理能力的评估问题。为实现这一目标，论文设计并验证了一套全自动的工作流，通过整合OpenAI推理模型与命名实体识别（NER）系统从近期文献中提取化学实体，并结合外部知识库构建全面的知识图谱。关键解决方案在于通过在这些知识图谱上生成多跳问题，评估LLMs在有上下文增强和无上下文增强两种设置下的表现。研究发现，即使最先进的模型在多跳组合推理任务中也面临显著挑战，强调了引入文档检索的重要性，尽管完美的检索精度能够提升性能，但并不能完全消除推理错误，这凸显了组合推理的复杂性。此外，该工作不仅评估了当前LLMs的局限性，还提出了一种新颖的数据生成管道，可跨多个领域生成具有挑战性的推理数据集。

链接: https://arxiv.org/abs/2504.16414
作者: Mohammad Khodadad,Ali Shiraee Kasmaee,Mahdi Astaraki,Nicholas Sherck,Hamidreza Mahyar,Soheila Samiee
机构: Department of Computational Science and Engineering (计算科学与工程系), McMaster University (麦克马斯特大学), Canada; BASF Canada Inc (加拿大巴斯夫公司), Canada; BASF Corporation (巴斯夫公司), USA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this study, we introduced a new benchmark consisting of a curated dataset and a defined evaluation process to assess the compositional reasoning capabilities of large language models within the chemistry domain. We designed and validated a fully automated pipeline, verified by subject matter experts, to facilitate this task. Our approach integrates OpenAI reasoning models with named entity recognition (NER) systems to extract chemical entities from recent literature, which are then augmented with external knowledge bases to form a comprehensive knowledge graph. By generating multi-hop questions across these graphs, we assess LLM performance in both context-augmented and non-context augmented settings. Our experiments reveal that even state-of-the-art models face significant challenges in multi-hop compositional reasoning. The results reflect the importance of augmenting LLMs with document retrieval, which can have a substantial impact on improving their performance. However, even perfect retrieval accuracy with full context does not eliminate reasoning errors, underscoring the complexity of compositional reasoning. This work not only benchmarks and highlights the limitations of current LLMs but also presents a novel data generation pipeline capable of producing challenging reasoning datasets across various domains. Overall, this research advances our understanding of reasoning in computational linguistics.
zh

[NLP-31] Out-of-the-Box Conditional Text Embeddings from Large Language Models

【速读】：该论文试图解决文本条件嵌入（conditional text embedding）在实际应用中因依赖大规模标注数据进行微调而导致的劳动和资源成本过高的问题。解决方案的关键在于提出了一种名为PonTE的新方法，它通过利用因果大语言模型（causal large language model）和条件提示（conditional prompt），实现了无监督的条件文本嵌入生成。实验结果表明，PonTE能够生成有用的条件文本嵌入，并在条件语义文本相似性和文本聚类任务中表现出与有监督方法相当的性能，同时避免了微调的需求。此外，论文还展示了PonTE嵌入的可解释性，通过分析基于提示的词汇生成和嵌入可视化实现。

链接: https://arxiv.org/abs/2504.16411
作者: Kosuke Yamada,Peinan Zhang
机构: Cyberagent Inc. (赛博代理有限公司), Japan
类目: Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Conditional text embedding is a proposed representation that captures the shift in perspective on texts when conditioned on a specific aspect. Previous methods have relied on extensive training data for fine-tuning models, leading to challenges in terms of labor and resource costs. We propose PonTE, a novel unsupervised conditional text embedding method that leverages a causal large language model and a conditional prompt. Through experiments on conditional semantic text similarity and text clustering, we demonstrate that PonTE can generate useful conditional text embeddings and achieve performance comparable to supervised methods without fine-tuning. We also show the interpretability of text embeddings with PonTE by analyzing word generation following prompts and embedding visualization.
zh

[NLP-32] Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation

【速读】：该论文试图解决低资源条件下的结构化推理问题，具体挑战是如何让大型语言模型（LLMs）在仅使用极少量标注数据的情况下生成可解释的、逐步推导的推理过程。论文的关键在于提出了一种名为“Less is More”的方法，通过一个多智能体框架结合反向提示诱导、GPT-4o增强的检索-综合推理以及双阶段奖励引导过滤机制，从仅有的24个标注样本中提取高质量监督信号，应用于问题解析、思维链解析及步骤级验证三个子任务。所有模块均基于统一的LoRA+设置微调自Meta-Llama-3-8B-Instruct。该方案的核心创新点在于结合结构化验证与奖励过滤，有效提升了少样本和零样本场景下的结构化推理质量。

链接: https://arxiv.org/abs/2504.16408
作者: Jiahao Yuan,Xingzhe Sun,Xing Yu,Jingwen Wang,Dehui Du,Zhiqing Cui,Zixiang Di
机构: East China Normal University (华东师范大学); University of Reading (雷丁大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The XLLM@ACL2025 Shared Task-III formulates a low-resource structural reasoning task that challenges LLMs to generate interpretable, step-by-step rationales with minimal labeled data. We present Less is More, the third-place winning approach in the XLLM@ACL2025 Shared Task-III, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi-agent framework with reverse-prompt induction, retrieval-augmented reasoning synthesis via GPT-4o, and dual-stage reward-guided filtering to distill high-quality supervision across three subtasks: question parsing, CoT parsing, and step-level verification. All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup. By combining structure validation with reward filtering across few-shot and zero-shot prompts, our pipeline consistently improves structure reasoning quality. These results underscore the value of controllable data distillation in enhancing structured inference under low-resource constraints. Our code is available at this https URL.
zh

[NLP-33] ConTextual: Improving Clinical Text Summarization in LLM s with Context-preserving Token Filtering and Knowledge Graphs

【速读】：该论文旨在解决从非结构化临床数据中提取最相关上下文信息的问题，以优化患者护理中的决策过程。传统方法通常对所有输入标记进行统一处理或依赖启发式过滤器，这可能忽略细微的临床线索，并未能优先考虑对决策至关重要的信息。论文提出了一种名为ConTextual的新框架，其关键是结合了上下文保留的标记过滤方法与领域特定的知识图谱（Domain-Specific Knowledge Graph, KG）用于上下文增强。通过保留特定上下文的重要标记并用结构化知识丰富它们，ConTextual不仅提高了语言连贯性，还保持了临床准确性。实验结果表明，该方法在两个公共基准数据集上的表现始终优于其他基线方法。

链接: https://arxiv.org/abs/2504.16394
作者: Fahmida Liza Piya,Rahmatollah Beheshti
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unstructured clinical data can serve as a unique and rich source of information that can meaningfully inform clinical practice. Extracting the most pertinent context from such data is critical for exploiting its true potential toward optimal and timely decision-making in patient care. While prior research has explored various methods for clinical text summarization, most prior studies either process all input tokens uniformly or rely on heuristic-based filters, which can overlook nuanced clinical cues and fail to prioritize information critical for decision-making. In this study, we propose Contextual, a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity. Our extensive empirical evaluations on two public benchmark datasets demonstrate that ConTextual consistently outperforms other baselines. Our proposed approach highlights the complementary role of token-level filtering and structured retrieval in enhancing both linguistic and clinical integrity, as well as offering a scalable solution for improving precision in clinical text generation.
zh

[NLP-34] SplitReason : Learning To Offload Reasoning

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在推理任务中由于生成较长token序列而导致效率低下且计算资源消耗大的问题。传统方法在处理复杂推理时需要整个过程依赖单一的大规模模型，这导致了高昂的计算成本。为了解决这一挑战，论文提出了一种分而治之的方法：通过仅将推理过程中最困难的部分卸载到更大、能力更强的模型上，同时利用较小、更高效的模型完成大部分生成工作，并进一步训练较小模型学会自主识别这些难点并触发相应的卸载机制。关键在于通过对OpenR1-Math-220k数据集中的18k条推理轨迹进行标注来实现这一点，并采用监督微调(Supervised Fine-Tuning, SFT)与强化学习微调(Reinforcement Learning Fine-Tuning, RLFT)，使较小的推理模型能够学会将其最具挑战性的部分传递给更大的模型处理。这种方法不仅提升了AIME24推理准确性达24%和28.3%，而且分别仅需卸载1.35%和5%的生成tokens。此外，作者开源了SplitReason模型及其相关数据、代码和日志文件。

链接: https://arxiv.org/abs/2504.16379
作者: Yash Akhauri,Anthony Fei,Chi-Chih Chang,Ahmed F. AbouElhamayed,Yueying Li,Mohamed S. Abdelfattah
机构: Cornell University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning in large language models (LLMs) tends to produce substantially longer token generation sequences than simpler language modeling tasks. This extended generation length reflects the multi-step, compositional nature of reasoning and is often correlated with higher solution accuracy. From an efficiency perspective, longer token generation exacerbates the inherently sequential and memory-bound decoding phase of LLMs. However, not all parts of this expensive reasoning process are equally difficult to generate. We leverage this observation by offloading only the most challenging parts of the reasoning process to a larger, more capable model, while performing most of the generation with a smaller, more efficient model; furthermore, we teach the smaller model to identify these difficult segments and independently trigger offloading when needed. To enable this behavior, we annotate difficult segments across 18k reasoning traces from the OpenR1-Math-220k chain-of-thought (CoT) dataset. We then apply supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to a 1.5B-parameter reasoning model, training it to learn to offload the most challenging parts of its own reasoning process to a larger model. This approach improves AIME24 reasoning accuracy by 24% and 28.3% while offloading 1.35% and 5% of the generated tokens respectively. We open-source our SplitReason model, data, code and logs.
zh

[NLP-35] xt-to-TrajVis: Enabling Trajectory Data Visualizations from Natural Language Questions

【速读】：本文旨在解决Text-to-TrajVis任务的问题，即通过自然语言问题生成轨迹数据可视化（Trajectory Data Visualization）。由于这是一个全新任务，目前社区内尚无相关数据集。为填补这一空白，论文的关键解决方案在于首先设计了一种新的可视化语言——轨迹可视化语言（Trajectory Visualization Language, TVL），以促进轨迹数据查询与可视化生成。在此基础上，提出了一种结合大型语言模型（Large Language Models, LLMs）与人工标注的方法来构建高质量的数据集。具体而言，先通过系统化流程生成TVL，再利用LLMs标注对应的自然语言问题，从而创建首个大规模Text-to-TrajVis数据集TrajVL，包含18,140组（问题, TVL）对。实验结果表明，此任务具有可行性和高度挑战性，值得进一步研究探索。

链接: https://arxiv.org/abs/2504.16358
作者: Tian Bai,Huiyan Ying,Kailong Suo,Junqiu Wei,Tao Fan,Yuanfeng Song
机构: Jilin University (吉林大学); Macau University of Science and Technology (澳门科技大学); WeBank AI (微众银行人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the Text-to-TrajVis task, which aims to transform natural language questions into trajectory data visualizations, facilitating the development of natural language interfaces for trajectory visualization systems. As this is a novel task, there is currently no relevant dataset available in the community. To address this gap, we first devised a new visualization language called Trajectory Visualization Language (TVL) to facilitate querying trajectory data and generating visualizations. Building on this foundation, we further proposed a dataset construction method that integrates Large Language Models (LLMs) with human efforts to create high-quality data. Specifically, we first generate TVLs using a comprehensive and systematic process, and then label each TVL with corresponding natural language questions using LLMs. This process results in the creation of the first large-scale Text-to-TrajVis dataset, named TrajVL, which contains 18,140 (question, TVL) pairs. Based on this dataset, we systematically evaluated the performance of multiple LLMs (GPT, Qwen, Llama, etc.) on this task. The experimental results demonstrate that this task is both feasible and highly challenging and merits further exploration within the research community.
zh

[NLP-36] ransformer-Based Extraction of Statutory Definitions from the U.S. Code

【速读】：该论文旨在解决从美国法典（U.S. Code, U.S.C.）这一复杂法律文本中自动提取定义性术语及其定义和适用范围的问题。这一任务对于提升法律信息的可理解性和清晰度具有重要意义。论文的关键解决方案在于提出了一种基于Transformer架构的先进自然语言处理（NLP）系统，该系统利用领域特定的Legal-BERT模型，并结合多阶段处理管道实现高精度的定义提取。具体而言，系统首先通过微调后的法律领域BERT模型对段落进行分类以识别是否包含定义，随后利用注意力机制和基于规则的模式聚合相关段落并提取定义性术语及其管辖范围。这种方法显著提升了定义提取的准确性，最终实现了96.8%的精确率、98.9%的召回率以及98.2%的F1分数，大幅超越传统机器学习方法。

链接: https://arxiv.org/abs/2504.16353
作者: Arpana Hosabettu(Google),Harsh Shah(Cornell University)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, to be published in IEEE AIIoT 2025

点击查看摘要

Abstract:Automatic extraction of definitions from legal texts is critical for enhancing the comprehension and clarity of complex legal corpora such as the United States Code (U.S.C.). We present an advanced NLP system leveraging transformer-based architectures to automatically extract defined terms, their definitions, and their scope from the U.S.C. We address the challenges of automatically identifying legal definitions, extracting defined terms, and determining their scope within this complex corpus of over 200,000 pages of federal statutory law. Building upon previous feature-based machine learning methods, our updated model employs domain-specific transformers (Legal-BERT) fine-tuned specifically for statutory texts, significantly improving extraction accuracy. Our work implements a multi-stage pipeline that combines document structure analysis with state-of-the-art language models to process legal text from the XML version of the U.S. Code. Each paragraph is first classified using a fine-tuned legal domain BERT model to determine if it contains a definition. Our system then aggregates related paragraphs into coherent definitional units and applies a combination of attention mechanisms and rule-based patterns to extract defined terms and their jurisdictional scope. The definition extraction system is evaluated on multiple titles of the U.S. Code containing thousands of definitions, demonstrating significant improvements over previous approaches. Our best model achieves 96.8% precision and 98.9% recall (98.2% F1-score), substantially outperforming traditional machine learning classifiers. This work contributes to improving accessibility and understanding of legal information while establishing a foundation for downstream legal reasoning tasks.
zh

[NLP-37] SignX: The Foundation Model for Sign Recognition

【速读】：本文旨在解决美国手语（ASL）手势识别中的复杂数据处理挑战，特别是如何通过姿势信息将RGB视频翻译成基于英语的唯一标识符ID glosses。由于目前尚无统一的gloss分配标准，确保数据集内一致的glossing约定至关重要。为应对这些挑战，论文提出SignX框架，这是一种适用于多种人体活动识别场景的基础模型方案。其关键在于两个创新组件：首先，开发了基于逆向扩散模型的Pose2Gloss模块，该模块包含一个多轨道姿态融合层，将五种最强大的姿态信息源（SMPLer-X、DWPose、Mediapipe、PrimeDepth和Sapiens Segmentation）统一为单一潜在姿态表示；其次，训练了一个基于ViT的Video2Pose模块，可以直接将原始视频转换为手语者姿态表示。通过这种两阶段训练框架，SignX实现了手语识别模型与现有姿态格式的兼容性，为手语识别所需的通用姿态估计奠定了基础。实验结果表明，SignX在手语视频识别任务中能够产生比以往工作更准确的预测gloss表示。

链接: https://arxiv.org/abs/2504.16315
作者: Sen Fang,Chunyu Sui,Hongwei Yi,Carol Neidle,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学); Columbia University (哥伦比亚大学); Max-Planck Institute (马克斯-普朗克研究所); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The complexity of sign language data processing brings many challenges. The current approach to recognition of ASL signs aims to translate RGB sign language videos through pose information into English-based ID glosses, which serve to uniquely identify ASL signs. Note that there is no shared convention for assigning such glosses to ASL signs, so it is essential that the same glossing conventions are used for all of the data in the datasets that are employed. This paper proposes SignX, a foundation model framework for sign recognition. It is a concise yet powerful framework applicable to multiple human activity recognition scenarios. First, we developed a Pose2Gloss component based on an inverse diffusion model, which contains a multi-track pose fusion layer that unifies five of the most powerful pose information sources–SMPLer-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation–into a single latent pose representation. Second, we trained a Video2Pose module based on ViT that can directly convert raw video into signer pose representation. Through this 2-stage training framework, we enable sign language recognition models to be compatible with existing pose formats, laying the foundation for the common pose estimation necessary for sign recognition. Experimental results show that SignX can recognize signs from sign language video, producing predicted gloss representations with greater accuracy than has been reported in prior work.
zh

[NLP-38] Capturing Symmetry and Antisymmetry in Language Models through Symmetry-Aware Training Objectives

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在理解对称关系（如国家边界）和反对称关系（如亲子关系）方面存在的不足。研究通过构建一个基于Wikidata的自然语言推理数据集来评估LLMs在这类关系理解上的能力，发现LLMs的表现接近随机猜测水平，表明其在关系理解上的显著差距。为了解决这一问题，论文的关键方案是通过对比学习与k近邻方法重新训练编码器（encoder），这种方法不仅匹配了微调分类头（fine-tuned classification heads）的性能，还在少样本学习效率及缓解灾难性遗忘（catastrophic forgetting）方面表现出额外优势。

链接: https://arxiv.org/abs/2504.16312
作者: Zhangdie Yuan,Andreas Vlachos
机构: Department of Computer Science and Technology (计算机科学与技术系), University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Capturing symmetric (e.g., country borders another country) and antisymmetric (e.g., parent_of) relations is crucial for a variety of applications. This paper tackles this challenge by introducing a novel Wikidata-derived natural language inference dataset designed to evaluate large language models (LLMs). Our findings reveal that LLMs perform comparably to random chance on this benchmark, highlighting a gap in relational understanding. To address this, we explore encoder retraining via contrastive learning with k-nearest neighbors. The retrained encoder matches the performance of fine-tuned classification heads while offering additional benefits, including greater efficiency in few-shot learning and improved mitigation of catastrophic forgetting.
zh

[NLP-39] he Paradox of Poetic Intent in Back-Translation: Evaluating the Quality of Large Language Models in Chinese Translation

【速读】：该论文旨在解决在中英翻译中如何有效保留诗意意图、文化传承以及处理专业术语的挑战。论文的关键解决方案在于构建了一个包含中文科学术语、历史翻译悖论和文学隐喻的多样化语料库，并提出了一种基于回译和Friedman检验的评估系统（BT-Fried），用于评估六种主要大型语言模型（如GPT-4.5、DeepSeek V3）和三种传统翻译工具的表现。通过这一方法，论文揭示了不同翻译策略在不同文本类型中的优劣，并提出了改进BLEU评分的新变体，以更好地衡量语义相似性。

链接: https://arxiv.org/abs/2504.16286
作者: Li Weigang,Pedro Carvalho Brom
机构: University of Brasilia (巴西利亚大学); Federal Institute of Brasilia (巴西联邦理工学院)
类目: Computation and Language (cs.CL)
备注: 24 pages, 3 figures

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has reshaped the landscape of machine translation, yet challenges persist in preserving poetic intent, cultural heritage, and handling specialized terminology in Chinese-English translation. This study constructs a diverse corpus encompassing Chinese scientific terminology, historical translation paradoxes, and literary metaphors. Utilizing a back-translation and Friedman test-based evaluation system (BT-Fried), we evaluate BLEU, CHRF, TER, and semantic similarity metrics across six major LLMs (e.g., GPT-4.5, DeepSeek V3) and three traditional translation tools. Key findings include: (1) Scientific abstracts often benefit from back-translation, while traditional tools outperform LLMs in linguistically distinct texts; (2) LLMs struggle with cultural and literary retention, exemplifying the “paradox of poetic intent”; (3) Some models exhibit “verbatim back-translation”, reflecting emergent memory behavior; (4) A novel BLEU variant using Jieba segmentation and n-gram weighting is proposed. The study contributes to the empirical evaluation of Chinese NLP performance and advances understanding of cultural fidelity in AI-mediated translation.
zh

[NLP-40] he Language of Attachment: Modeling Attachment Dynamics in Psychotherapy

【速读】：该论文试图解决通过自然语言处理（Natural Language Processing, NLP）自动评估患者依恋风格（attachment style）的问题，以替代目前依赖人工使用患者依恋编码系统（Patient Attachment Coding System, PACS）进行评估的方法。当前方法复杂且资源密集，并需要大量培训，限制了依恋导向治疗和研究的广泛应用。论文的关键解决方案在于探索性地开发基于NLP分类模型，从心理治疗转录文本中自动识别患者的依恋风格，同时分析其结果并讨论自动化工具在这一领域的应用潜力与挑战，例如误分类“专注型”患者为“回避型”可能对治疗结果产生更负面的影响。这为实现更个性化的心理治疗及更精准的心理治疗机制研究开辟了新途径。

链接: https://arxiv.org/abs/2504.16271
作者: Frederik Bredgaard,Martin Lund Trinhammer,Elisa Bassignana
机构: SODAS, University of Copenhagen (哥本哈根大学 SODAS), Denmark; IT University of Copenhagen (哥本哈根信息技术大学), Denmark; Pioneer Center for Artificial Intelligence (先锋人工智能中心), Denmark
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The delivery of mental healthcare through psychotherapy stands to benefit immensely from developments within Natural Language Processing (NLP), in particular through the automatic identification of patient specific qualities, such as attachment style. Currently, the assessment of attachment style is performed manually using the Patient Attachment Coding System (PACS; Talia et al., 2017), which is complex, resource-consuming and requires extensive training. To enable wide and scalable adoption of attachment informed treatment and research, we propose the first exploratory analysis into automatically assessing patient attachment style from psychotherapy transcripts using NLP classification models. We further analyze the results and discuss the implications of using automated tools for this purpose – e.g., confusing preoccupied' patients with avoidant’ likely has a more negative impact on therapy outcomes with respect to other mislabeling. Our work opens an avenue of research enabling more personalized psychotherapy and more targeted research into the mechanisms of psychotherapy through advancements in NLP.
zh

[NLP-41] CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

【速读】：该论文旨在解决跨语言信息检索（Cross-lingual Information Retrieval, CLIR）在学术搜索场景中的挑战，具体是针对以英语查询但需要从法语文档中检索相关学术内容的问题。论文的关键在于提出了一种新的数据集CLIRudit，并对其进行了全面的基准测试，评估了多种零样本(first-stage retrieval)方法的性能。这些方法包括密集型检索器（Dense Retrievers）、稀疏型检索器（Sparse Retrievers）、机器翻译结合查询与文档处理，以及最先进的多语言检索器。研究发现，未专门针对跨语言任务训练的大规模密集型检索器在零样本设置下可以达到与使用人工翻译相当的性能，而稀疏检索器结合文档翻译也能提供高效的替代方案。因此，该研究的核心贡献在于验证了无需额外机器翻译即可实现高质量跨语言检索的可能性及其技术可行性。

链接: https://arxiv.org/abs/2504.16264
作者: Francisco Valentini,Diego Kozlowski,Vincent Larivière
机构: CONICET-UBA (阿根廷布宜诺斯艾利斯大学国家科学技术研究委员会); École de bibliothéconomie et des sciences de l’information, Université de Montréal (蒙特利尔大学图书馆与信息科学学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
zh

[NLP-42] Using Phonemes in cascaded S2S translation pipeline

【速读】：该论文试图解决传统多语言同声传译系统中依赖文本表示方法的问题，探索使用音素作为文本表示的可能性。解决方案的关键在于训练了一个开源的序列到序列模型，在WMT17数据集上分别采用标准文本表示和音素表示两种格式，并通过BLEU指标评估两种方法的性能，验证音素表示在资源需求较低的情况下能够提供与文本表示相当的质量，同时更适合低资源语言场景。

链接: https://arxiv.org/abs/2504.16234
作者: Rene Pilz,Johannes Schneider
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at Swiss NLP Conference 2025

点击查看摘要

Abstract:This paper explores the idea of using phonemes as a textual representation within a conventional multilingual simultaneous speech-to-speech translation pipeline, as opposed to the traditional reliance on text-based language representations. To investigate this, we trained an open-source sequence-to-sequence model on the WMT17 dataset in two formats: one using standard textual representation and the other employing phonemic representation. The performance of both approaches was assessed using the BLEU metric. Our findings shows that the phonemic approach provides comparable quality but offers several advantages, including lower resource requirements or better suitability for low-resource languages.
zh

[NLP-43] Reflexive Prompt Engineering: A Framework for Responsible Prompt Engineering and Interaction Design

【速读】：该论文旨在解决生成式 AI (Generative AI) 在实际应用中因不当提示工程（prompt engineering）导致的潜在伦理、法律及社会价值冲突问题。论文强调，通过战略性的提示工程将道德、法律和社会价值观直接嵌入到 AI 系统的交互中，超越单纯的技术功能优化，从而促进公平性、问责制和透明度。论文的关键解决方案在于提出一个包含五个相互关联组件（提示设计、系统选择、系统配置、性能评估和提示管理）的综合性框架，以实现负责任的提示工程。通过结合技术精确性和伦理意识，并利用实证证据，该框架旨在平衡技术功能与社会影响之间的关系，同时降低潜在风险。这种方法的核心在于将伦理考量融入 AI 的开发和部署过程，而非作为事后补充，从而实现“设计即责任”（Responsibility by Design）的原则。

链接: https://arxiv.org/abs/2504.16204
作者: Christian Djeffal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 20 pages one figure

点击查看摘要

Abstract:Responsible prompt engineering has emerged as a critical framework for ensuring that generative artificial intelligence (AI) systems serve society’s needs while minimizing potential harms. As generative AI applications become increasingly powerful and ubiquitous, the way we instruct and interact with them through prompts has profound implications for fairness, accountability, and transparency. This article examines how strategic prompt engineering can embed ethical and legal considerations and societal values directly into AI interactions, moving beyond mere technical optimization for functionality. This article proposes a comprehensive framework for responsible prompt engineering that encompasses five interconnected components: prompt design, system selection, system configuration, performance evaluation, and prompt management. Drawing from empirical evidence, the paper demonstrates how each component can be leveraged to promote improved societal outcomes while mitigating potential risks. The analysis reveals that effective prompt engineering requires a delicate balance between technical precision and ethical consciousness, combining the systematic rigor and focus on functionality with the nuanced understanding of social impact. Through examination of real-world and emerging practices, the article illustrates how responsible prompt engineering serves as a crucial bridge between AI development and deployment, enabling organizations to fine-tune AI outputs without modifying underlying model architectures. This approach aligns with broader “Responsibility by Design” principles, embedding ethical considerations directly into the implementation process rather than treating them as post-hoc additions. The article concludes by identifying key research directions and practical guidelines for advancing the field of responsible prompt engineering.
zh

[NLP-44] FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking

【速读】：该论文旨在构建一个用于金融领域自然语言推理（Financial Natural Language Inference, FinNLI）的基准数据集，以评估预训练语言模型（Pre-trained Language Models, PLMs）和大型语言模型（Large Language Models, LLMs）在处理多样化金融文本（如SEC文件、年度报告和 earnings call 转录稿）时的能力。论文的关键在于通过精心设计的数据集框架生成具有多样性的前提-假设对（premise-hypothesis pairs），同时尽量减少虚假关联（spurious correlations）。此外，FinNLI 数据集包含 21,304 对标注样本，并提供了一个由金融专家注释的高质量测试集（3,304 个实例）。研究发现，领域迁移（domain shift）显著降低了通用领域的自然语言推理性能，而当前基于指令微调的金融 LLMs 在 FinNLI 上的表现令人意外地不佳，这揭示了现有模型在金融推理任务中的局限性，为未来改进提供了方向。

链接: https://arxiv.org/abs/2504.16188
作者: Jabez Magomere,Elena Kochkina,Samuel Mensah,Simerjot Kaur,Charese H. Smiley
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce FinNLI, a benchmark dataset for Financial Natural Language Inference (FinNLI) across diverse financial texts like SEC Filings, Annual Reports, and Earnings Call transcripts. Our dataset framework ensures diverse premise-hypothesis pairs while minimizing spurious correlations. FinNLI comprises 21,304 pairs, including a high-quality test set of 3,304 instances annotated by finance experts. Evaluations show that domain shift significantly degrades general-domain NLI performance. The highest Macro F1 scores for pre-trained (PLMs) and large language models (LLMs) baselines are 74.57% and 78.62%, respectively, highlighting the dataset’s difficulty. Surprisingly, instruction-tuned financial LLMs perform poorly, suggesting limited generalizability. FinNLI exposes weaknesses in current LLMs for financial reasoning, indicating room for improvement.
zh

[NLP-45] Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends

【速读】：该论文旨在解决传统高级驾驶辅助系统（ADAS）在动态真实场景中因传感器处理碎片化及易受对抗条件影响而导致的交通安全性挑战。论文提出通过多模态大型语言模型（MLLMs）整合视觉、空间和环境等跨模态数据，实现全面的场景理解，从而克服上述限制。解决方案的关键在于利用MLLMs提升感知能力、决策质量和对抗鲁棒性，并借助关键数据集（如KITTI、DRAMA、ML4RoadSafety）推动研究进展，同时探索实时边缘部署、因果驱动推理以及人机协作等未来发展方向。

链接: https://arxiv.org/abs/2504.16134
作者: Mohammad Abu Tami,Mohammed Elhenawy,Huthaifa I. Ashqar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Traffic safety remains a critical global challenge, with traditional Advanced Driver-Assistance Systems (ADAS) often struggling in dynamic real-world scenarios due to fragmented sensor processing and susceptibility to adversarial conditions. This paper reviews the transformative potential of Multimodal Large Language Models (MLLMs) in addressing these limitations by integrating cross-modal data such as visual, spatial, and environmental inputs to enable holistic scene understanding. Through a comprehensive analysis of MLLM-based approaches, we highlight their capabilities in enhancing perception, decision-making, and adversarial robustness, while also examining the role of key datasets (e.g., KITTI, DRAMA, ML4RoadSafety) in advancing research. Furthermore, we outline future directions, including real-time edge deployment, causality-driven reasoning, and human-AI collaboration. By positioning MLLMs as a cornerstone for next-generation traffic safety systems, this review underscores their potential to revolutionize the field, offering scalable, context-aware solutions that proactively mitigate risks and improve overall road safety.
zh

[NLP-46] LegalRAG : A Hybrid RAG System for Multilingual Legal Information Retrieval IJCNN2025

【速读】：该论文旨在解决自然语言处理（NLP）和计算语言学技术在法律与监管任务中的应用局限性问题，特别是针对双语（英语和孟加拉语）监管文件——孟加拉国警察公报的检索与问答需求。论文的关键解决方案在于开发了一种高效的双语文本检索与问答框架，采用了现代化的检索增强生成（Retrieval Augmented Generation, RAG）管道，并进一步提出了一种改进版的RAG方法，通过提升信息检索性能实现了更精准的答案生成。这一系统显著提高了特定政府法律公告的搜索效率，使法律信息更加易得。实验结果表明，所提出的改进方法在所有评估指标上均优于现有方法。

链接: https://arxiv.org/abs/2504.16121
作者: Muhammad Rafsan Kabir,Rafeed Mohammad Sultan,Fuad Rahman,Mohammad Ruhul Amin,Sifat Momen,Nabeel Mohammed,Shafin Rahman
机构: Apurba-NSU R&D Lab (阿普尔巴-南亚大学研发实验室), Department of Electrical and Computer Engineering (电气与计算机工程系), North South University (南北大学); Apurba Technologies; Fordham University (福特汉姆大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at IJCNN 2025

点击查看摘要

Abstract:Natural Language Processing (NLP) and computational linguistic techniques are increasingly being applied across various domains, yet their use in legal and regulatory tasks remains limited. To address this gap, we develop an efficient bilingual question-answering framework for regulatory documents, specifically the Bangladesh Police Gazettes, which contain both English and Bangla text. Our approach employs modern Retrieval Augmented Generation (RAG) pipelines to enhance information retrieval and response generation. In addition to conventional RAG pipelines, we propose an advanced RAG-based approach that improves retrieval performance, leading to more precise answers. This system enables efficient searching for specific government legal notices, making legal information more accessible. We evaluate both our proposed and conventional RAG systems on a diverse test set on Bangladesh Police Gazettes, demonstrating that our approach consistently outperforms existing methods across all evaluation metrics.
zh

[NLP-47] HPU: High-Bandwidth Processing Unit for Scalable Cost-effective LLM Inference via GPU Co-processing

【速读】：该论文旨在解决基于Transformer的大规模语言模型（LLMs）在当前GPU系统中由于注意力层低运算强度及KV缓存巨大内存需求所带来的效率低下问题。论文的关键解决方案是提出了一种高带宽处理单元（High-bandwidth Processing Unit, HPU），这是一种内存密集型协处理器，能够通过卸载内存绑定操作增强GPU资源利用率，特别是在大批次推理场景下。HPU作为附加卡扩展了系统的内存容量，以应对因大批次大小和长序列长度带来的内存需求增长。论文展示了基于PCIe的FPGA实现的HPU原型，并证明该GPU-HPU异构系统相比仅使用GPU的系统，在性能上提升了高达4.1倍，能效提升了4.6倍，同时实现了可扩展性而不增加GPU数量。

链接: https://arxiv.org/abs/2504.16112
作者: Myunghyun Rhee,Joonseop Sim,Taeyoung Ahn,Seungyong Lee,Daegun Yoon,Euiseok Kim,Kyoung Park,Youngpyo Joo,Hosik Kim
机构: SK hynix Inc. (SK海力士)
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 6 pages

点击查看摘要

Abstract:The attention layer, a core component of Transformer-based LLMs, brings out inefficiencies in current GPU systems due to its low operational intensity and the substantial memory requirements of KV caches. We propose a High-bandwidth Processing Unit (HPU), a memoryintensive co-processor that enhances GPU resource utilization during large-batched LLM inference. By offloading memory-bound operations, the HPU allows the GPU to focus on compute-intensive tasks, increasing overall efficiency. Also, the HPU, as an add-on card, scales out to accommodate surging memory demands driven by large batch sizes and extended sequence lengths. In this paper, we show the HPU prototype implemented with PCIe-based FPGA cards mounted on a GPU system. Our novel GPU-HPU heterogeneous system demonstrates up to 4.1x performance gains and 4.6x energy efficiency improvements over a GPUonly system, providing scalability without increasing the number of GPUs.
zh

[NLP-48] Cooperative Speech Semantic Competence and AI

【速读】：该论文试图解决的问题是：大型语言模型（Large Language Models, LLMs）是否具备合作性交流的能力以及其语义理解能力是否存在局限。论文的关键在于论证LLMs缺乏合作交流所需的尊重态度，因为这种尊重态度要求互惠性（reciprocity）。基于此，作者指出当前LLMs无法进行断言（assertion），从而质疑其语义能力，并强调意义知识的研究不仅属于认知心理学的范畴，也应纳入道德心理学的领域。

链接: https://arxiv.org/abs/2504.16092
作者: Mahrad Almotahari
机构: University of Edinburgh (爱丁堡大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 25 pages

点击查看摘要

Abstract:Cooperative speech is purposive. From the speaker’s perspective, one crucial purpose is the transmission of knowledge. Cooperative speakers care about getting things right for their conversational partners. This attitude is a kind of respect. Cooperative speech is an ideal form of communication because participants have respect for each other. And having respect within a cooperative enterprise is sufficient for a particular kind of moral standing: we ought to respect those who have respect for us. Respect demands reciprocity. I maintain that large language models aren’t owed the kind of respect that partly constitutes a cooperative conversation. This implies that they aren’t cooperative interlocutors, otherwise we would be obliged to reciprocate the attitude. Leveraging this conclusion, I argue that present-day LLMs are incapable of assertion and that this raises an overlooked doubt about their semantic competence. One upshot of this argument is that knowledge of meaning isn’t just a subject for the cognitive psychologist. It’s also a subject for the moral psychologist.
zh

计算机视觉

[CV-0] Procedural Dataset Generation for Zero-Shot Stereo Matching

【速读】：该论文试图解决的问题是如何设计有效的合成立体匹配数据集。传统方法中，合成数据集的设计参数及其对模型性能的影响尚未被充分探索。论文通过调整过程化数据集生成器的参数，系统性地研究了合成数据集的设计空间，并评估其对零样本立体匹配性能的影响。论文的关键解决方案在于基于实验结果优化生成器，开发出专门用于零样本立体匹配任务的过程化数据集生成工具Infinigen-Stereo。该工具能够生成高质量的合成数据，使仅使用此数据训练的模型在标准基准测试中的零样本立体匹配性能超越了现有公开模型及结合多种现有合成数据集训练的稳健基线模型。

链接: https://arxiv.org/abs/2504.16930
作者: David Yan,Alexander Raistrick,Jia Deng
机构: Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic datasets are a crucial ingredient for training stereo matching networks, but the question of what makes a stereo dataset effective remains largely unexplored. We investigate the design space of synthetic datasets by varying the parameters of a procedural dataset generator, and report the effects on zero-shot stereo matching performance using standard benchmarks. We collect the best settings to produce Infinigen-Stereo, a procedural generator specifically optimized for zero-shot stereo datasets. Models trained only on data from our system outperform robust baselines trained on a combination of existing synthetic datasets and have stronger zero-shot stereo matching performance than public checkpoints from prior works. We open source our system at this https URL to enable further research on procedural stereo datasets.
zh

[CV-1] I-Con: A Unifying Framework for Representation Learning ICLR2025

【速读】：该论文旨在解决现代机器学习中不同任务对应的多样化损失函数难以统一的问题，并提出一种通用的信息论框架来解释和整合这些方法。论文的关键在于引入了一个单一的信息论方程，该方程能够推广多种现代损失函数，并揭示它们在最小化两个条件分布（监督表示与学习表示）之间的集成KL散度方面的共同本质。这一视角展示了聚类、谱方法、降维、对比学习以及有监督学习背后隐藏的信息几何结构。基于此框架，作者不仅连接了23种不同的方法，还通过组合文献中的成功技术开发了新的损失函数，进而实现了ImageNet-1K无监督分类任务上的最新性能提升，并提出了基于该框架的有原则去偏方法以改进对比表征学习器。

链接: https://arxiv.org/abs/2504.16929
作者: Shaden Alshammari,John Hershey,Axel Feldmann,William T. Freeman,Mark Hamilton
机构: MIT(麻省理工学院); Google(谷歌); Microsoft(微软)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注: ICLR 2025; website: this https URL . Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:As the field of representation learning grows, there has been a proliferation of different loss functions to solve different classes of problems. We introduce a single information-theoretic equation that generalizes a large collection of modern loss functions in machine learning. In particular, we introduce a framework that shows that several broad classes of machine learning methods are precisely minimizing an integrated KL divergence between two conditional distributions: the supervisory and learned representations. This viewpoint exposes a hidden information geometry underlying clustering, spectral methods, dimensionality reduction, contrastive learning, and supervised learning. This framework enables the development of new loss functions by combining successful techniques from across the literature. We not only present a wide array of proofs, connecting over 23 different approaches, but we also leverage these theoretical results to create state-of-the-art unsupervised image classifiers that achieve a +8% improvement over the prior state-of-the-art on unsupervised classification on ImageNet-1K. We also demonstrate that I-Con can be used to derive principled debiasing methods which improve contrastive representation learners.
zh

[CV-2] Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

【速读】：该论文试图解决稀疏注意力机制在实际应用中未能显著超越自注意力（Self-Attention）基线的问题，尤其是在提升计算效率方面。这一问题的核心原因在于现有注意力基础设施的复杂性以及AI硬件架构的快速演进。同时，许多最先进的基础模型依赖于注意力机制，并需要可靠的空间稀疏性来摆脱O(n²)的时间复杂度限制。为了解决这些问题，论文的关键在于提出了一种新的广义邻域注意力机制（Generalized Neighborhood Attention, GNA），它能够统一描述滑动窗口注意力、步幅滑动窗口注意力和分块注意力等多种形式，并通过开发更精确的性能分析模型来评估其改进效果。此外，论文还设计了一个模拟器以提供更现实的速度上限估计，并基于NVIDIA Blackwell架构实现了高效的融合多头注意力（FMHA）内核，从而在多种完美分块稀疏情况下充分实现理论上的最大加速比，最终达到了FP16精度下1.3 petaFLOPs/秒的有效利用率。这些方案共同构成了论文的核心贡献。

链接: https://arxiv.org/abs/2504.16922
作者: Ali Hassani,Fengzhe Zhou,Aditya Kane,Jiannan Huang,Chieh-Yun Chen,Min Shi,Steven Walton,Markus Hoehnerbach,Vijay Thakkar,Michael Isaev,Qinsheng Zhang,Bing Xu,Haicheng Wu,Wen-mei Hwu,Ming-Yu Liu,Humphrey Shi
机构: Georgia Tech (乔治亚理工学院); NVIDIA (英伟达); UIUC (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: this https URL

点击查看摘要

Abstract:Many sparse attention mechanisms such as Neighborhood Attention have typically failed to consistently deliver speedup over the self attention baseline. This is largely due to the level of complexity in attention infrastructure, and the rapid evolution of AI hardware architecture. At the same time, many state-of-the-art foundational models, particularly in computer vision, are heavily bound by attention, and need reliable sparsity to escape the O(n^2) complexity. In this paper, we study a class of promising sparse attention mechanisms that focus on locality, and aim to develop a better analytical model of their performance improvements. We first introduce Generalized Neighborhood Attention (GNA), which can describe sliding window, strided sliding window, and blocked attention. We then consider possible design choices in implementing these approaches, and create a simulator that can provide much more realistic speedup upper bounds for any given setting. Finally, we implement GNA on top of a state-of-the-art fused multi-headed attention (FMHA) kernel designed for the NVIDIA Blackwell architecture in CUTLASS. Our implementation can fully realize the maximum speedup theoretically possible in many perfectly block-sparse cases, and achieves an effective utilization of 1.3 petaFLOPs/second in FP16. In addition, we plug various GNA configurations into off-the-shelf generative models, such as Cosmos-7B, HunyuanVideo, and FLUX, and show that it can deliver 28% to 46% end-to-end speedup on B200 without any fine-tuning. We will open source our simulator and Blackwell kernels directly through the NATTEN project.
zh

[CV-3] DreamO: A Unified Framework for Image Customization

【速读】：该论文旨在解决现有图像定制方法普遍针对特定任务设计、缺乏通用性以整合不同类型条件的问题，提出了一种名为DreamO的统一框架。解决方案的关键在于：(1) 利用扩散Transformer（DiT）框架实现对不同输入类型的统一处理；(2) 构建包含多种定制任务的大规模训练数据集，并引入特征路由约束以精确查询参考图像中的相关信息；(3) 设计位置关联的占位符策略以灵活控制条件在生成结果中的布局；(4) 采用分阶段的渐进式训练策略，通过初始阶段建立一致性基线、全面增强定制能力的全量训练阶段以及质量校准阶段，确保生成结果的高质量与稳定性。这些创新点共同实现了高效且灵活的图像定制能力。

链接: https://arxiv.org/abs/2504.16915
作者: Chong Mou,Yanze Wu,Wenxu Wu,Zinan Guo,Pengze Zhang,Yufeng Cheng,Yiming Luo,Fei Ding,Shiwen Zhang,Xinghui Li,Mengtian Li,Songtao Zhao,Jian Zhang,Qian He,Xinglong Wu
机构: Intelligent Creation Team, ByteDance (字节跳动智能创作团队); School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, extensive research on image customization (e.g., identity, subject, style, background, etc.) demonstrates strong customization capabilities in large-scale generative models. However, most approaches are designed for specific tasks, restricting their generalizability to combine different types of condition. Developing a unified framework for image customization remains an open challenge. In this paper, we present DreamO, an image customization framework designed to support a wide range of tasks while facilitating seamless integration of multiple conditions. Specifically, DreamO utilizes a diffusion transformer (DiT) framework to uniformly process input of different types. During training, we construct a large-scale training dataset that includes various customization tasks, and we introduce a feature routing constraint to facilitate the precise querying of relevant information from reference images. Additionally, we design a placeholder strategy that associates specific placeholders with conditions at particular positions, enabling control over the placement of conditions in the generated results. Moreover, we employ a progressive training strategy consisting of three stages: an initial stage focused on simple tasks with limited data to establish baseline consistency, a full-scale training stage to comprehensively enhance the customization capabilities, and a final quality alignment stage to correct quality biases introduced by low-quality data. Extensive experiments demonstrate that the proposed DreamO can effectively perform various image customization tasks with high quality and flexibly integrate different types of control conditions.
zh

[CV-4] BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation

【速读】：该论文旨在解决文本到视频（Text-to-Video, T2V）生成模型中存在的对抗性漏洞问题，特别是这些模型在生成视频时容易包含未明确指定于文本提示中的冗余信息（如环境元素、次要对象及额外细节），从而为恶意攻击者嵌入隐藏有害内容提供了机会。论文提出了一种名为BadVideo的首个针对T2V生成任务设计的后门攻击框架，其关键在于通过两种策略实现目标对抗输出的设计：(1) 空间-时间组成（Spatio-Temporal Composition），即结合不同的时空特征以编码恶意信息；(2) 动态元素变换（Dynamic Element Transformation），即通过对冗余元素随时间变化的转换来传递恶意信息。这两种策略使攻击者的恶意目标能够与用户的文本指令无缝融合，从而具备高隐蔽性。此外，利用视频的时间维度，该攻击成功规避了主要分析单帧空间信息的传统内容审核系统。大量实验表明，BadVideo在保持原始语义的同时实现了高攻击成功率，并在干净输入上表现出色。研究揭示了T2V模型的对抗性脆弱性，强调了潜在风险与滥用问题。

链接: https://arxiv.org/abs/2504.16907
作者: Ruotong Wang,Mingli Zhu,Jiarong Ou,Rui Chen,Xin Tao,Pengfei Wan,Baoyuan Wu
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing. However, the adversarial vulnerabilities of these models remain rarely explored. We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. Our attack focuses on designing target adversarial outputs through two key strategies: (1) Spatio-Temporal Composition, which combines different spatiotemporal features to encode malicious information; (2) Dynamic Element Transformation, which introduces transformations in redundant elements over time to convey malicious information. Based on these strategies, the attacker’s malicious target seamlessly integrates with the user’s textual instructions, providing high stealthiness. Moreover, by exploiting the temporal dimension of videos, our attack successfully evades traditional content moderation systems that primarily analyze spatial information within individual frames. Extensive experiments demonstrate that BadVideo achieves high attack success rates while preserving original semantics and maintaining excellent performance on clean inputs. Overall, our work reveals the adversarial vulnerability of T2V models, calling attention to potential risks and misuse. Our project page is at this https URL.
zh

[CV-5] High-Quality Cloud-Free Optical Image Synthesis Using Multi-Temporal SAR and Contaminated Optical Data

【速读】：该论文旨在解决由于云层遮挡和卫星长重访周期导致的光学遥感数据缺失问题，特别是在复杂云覆盖场景下的光学数据合成挑战。论文提出的关键解决方案是CRSynthNet，这是一种创新的图像合成网络，通过引入DownUp模块和Fusion Attention模块等精心设计的组件来提升合成精度。这些创新模块是实现结构细节恢复、光谱一致性保持以及视觉效果优化的核心所在。实验结果表明，CRSynthNet在多个定量指标（如峰值信噪比PSNR=26.978，结构相似性指数SSIM=0.648，均方根误差RMSE=0.050）上显著优于对比方法，并提供了更真实的合成效果。此外，论文构建了TCSEN12数据集，作为专门针对云覆盖问题的数据资源，进一步支持了这一研究方向。

链接: https://arxiv.org/abs/2504.16870
作者: Chenxi Duan
机构: Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente (荷兰特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Addressing gaps caused by cloud cover and the long revisit cycle of satellites is vital for providing essential data to support remote sensing applications. This paper tackles the challenges of missing optical data synthesis, particularly in complex scenarios with cloud cover. We propose CRSynthNet, a novel image synthesis network that incorporates innovative designed modules such as the DownUp Block and Fusion Attention to enhance accuracy. Experimental results validate the effectiveness of CRSynthNet, demonstrating substantial improvements in restoring structural details, preserving spectral consist, and achieving superior visual effects that far exceed those produced by comparison methods. It achieves quantitative improvements across multiple metrics: a peak signal-to-noise ratio (PSNR) of 26.978, a structural similarity index measure (SSIM) of 0.648, and a root mean square error (RMSE) of 0.050. Furthermore, this study creates the TCSEN12 dataset, a valuable resource specifically designed to address cloud cover challenges in missing optical data synthesis study. The dataset uniquely includes cloud-covered images and leverages earlier image to predict later image, offering a realistic representation of real-world scenarios. This study offer practical method and valuable resources for optical satellite image synthesis task.
zh

[CV-6] Hyperspectral Vision Transformers for Greenhouse Gas Estimations from Space

【速读】：该论文试图解决高光谱成像在监测温室气体(GHGs)时受限于空间覆盖有限和重访时间稀疏的问题，同时克服多光谱成像在光谱细节上的不足。论文的关键解决方案是提出了一种基于自监督深度学习的光谱变换模型，通过带域掩蔽自动编码器(pre-training)和时空对齐的多光谱-高光谱图像对(fine-tuning)，从多光谱输入中合成高光谱数据。这种方法有效平衡了光谱分辨率与覆盖范围之间的权衡，显著提升了温室气体预测的准确性。

链接: https://arxiv.org/abs/2504.16851
作者: Ruben Gonzalez Avilés,Linus Scheibenreif,Nassim Ait Ali Braham,Benedikt Blumenstiel,Thomas Brunschwiler,Ranjini Guruprasad,Damian Borth,Conrad Albrecht,Paolo Fraccaro,Devyani Lambhate,Johannes Jakubik
机构: University of St. Gallen(圣加仑大学); IBM Research (IBM研究院); German Aerospace Center (DLR)(德国航空航天中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral imaging provides detailed spectral information and holds significant potential for monitoring of greenhouse gases (GHGs). However, its application is constrained by limited spatial coverage and infrequent revisit times. In contrast, multispectral imaging offers broader spatial and temporal coverage but often lacks the spectral detail that can enhance GHG detection. To address these challenges, this study proposes a spectral transformer model that synthesizes hyperspectral data from multispectral inputs. The model is pre-trained via a band-wise masked autoencoder and subsequently fine-tuned on spatio-temporally aligned multispectral-hyperspectral image pairs. The resulting synthetic hyperspectral data retain the spatial and temporal benefits of multispectral imagery and improve GHG prediction accuracy relative to using multispectral data alone. This approach effectively bridges the trade-off between spectral resolution and coverage, highlighting its potential to advance atmospheric monitoring by combining the strengths of hyperspectral and multispectral systems with self-supervised deep learning.
zh

[CV-7] A Low-Cost Photogrammetry System for 3D Plant Modeling and Phenotyping

【速读】：该论文旨在开发一种开源且低成本的摄影测量系统，用于植物三维建模与表型分析。论文以小麦为例，展示了如何通过从点云中计算出多种表型性状（如植物高度、半径、叶片角度以及凸包等）来实现这一目标。系统采用运动结构法重建植物的三维点云表示。解决方案的关键在于利用摄影测量技术和结构-from-运动算法，结合自动化处理流程，从而能够高效、精确地提取复杂的植物表型特征，并进一步探索客观区分直立型与铺展型小麦冠层架构的特定指标。

链接: https://arxiv.org/abs/2504.16840
作者: Joe Hrzich,Michael A. Beck,Christopher P. Bidinosti,Christopher J. Henry,Kalhari Manawasinghe,Karen Tanino
机构: University of Winnipeg (温尼伯大学); University of Manitoba (曼尼托巴大学); University of Saskatchewan (萨斯喀彻温大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present an open-source, low-cost photogrammetry system for 3D plant modeling and phenotyping. The system uses a structure-from-motion approach to reconstruct 3D representations of the plants via point clouds. Using wheat as an example, we demonstrate how various phenotypic traits can be computed easily from the point clouds. These include standard measurements such as plant height and radius, as well as features that would be more cumbersome to measure by hand, such as leaf angles and convex hull. We further demonstrate the utility of the system through the investigation of specific metrics that may yield objective classifications of erectophile versus planophile wheat canopy architectures.
zh

[CV-8] Decoupled Global-Local Alignment for Improving Compositional Understanding

【速读】：该论文旨在解决现有对比语言图像预训练（CLIP）模型在理解组合概念（如关系和属性）方面的能力受限问题。尽管近期研究通过引入全局难样本负例提升了组合理解能力，但这些方法因强制拉远文本负例与图像在嵌入空间中的距离，显著损害了模型的通用能力。论文的关键解决方案是提出了一种解耦全局-局部对齐（Decoupled Global-Local Alignment, DeGLA）框架，通过在保持模型原有能力的同时增强组合理解能力。具体而言，DeGLA 框架通过在全局对齐过程中引入自蒸馏机制，利用冻结的教师模型（基于指数移动平均生成）来优化知识保留，有效缓解了微调过程中的灾难性遗忘问题。此外，通过利用大语言模型（LLMs）的上下文学习能力生成高质量负样本，并设计图像引导对比损失（IGC）和文本引导对比损失（TGC），进一步增强了视觉-语言的组合建模能力。实验结果表明，DeGLA 在多个基准测试（VALSE、SugarCrepe 和 ARO）上的平均性能提升达 3.5%，并在十一个数据集的零样本分类任务中实现了 13.0% 的平均性能提升。

链接: https://arxiv.org/abs/2504.16801
作者: Xiaoxing Hu,Kaicheng Yang,Jun Wang,Haoran Xu,Ziyong Feng,Yupei Wang
机构: Beijing Institute of Technology(Beijing Institute of Technology); DeepGlint(DeepGlint); Zhejiang University(Zhejiang University); Beijing China(Beijing, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP’s ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model’s inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model’s inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at this https URL
zh

[CV-9] 4D Multimodal Co-attention Fusion Network with Latent Contrastive Alignment for Alzheimers Diagnosis

【速读】：该论文旨在解决跨模态神经影像数据在阿尔茨海默病（Alzheimer’s Disease, AD）早期诊断中特征融合的异质性挑战。具体而言，不同模态数据（如功能磁共振成像fMRI的4D时空动态与结构磁共振成像sMRI的3D解剖结构）之间固有的差异使得有效的特征对齐变得困难。为了解决这一问题，论文提出了一种名为M2M-AlignNet的几何感知多模态共注意网络，其关键在于引入了一个多块到多块（Multi-Patch-to-Multi-Patch, M2M）对比损失函数，通过几何加权的块对应关系量化并减少表示差异，从而显式地对齐不同脑区的fMRI成分与其对应的sMRI结构基质，且无需一对一的严格约束。此外，还设计了一个潜在作为查询的共注意模块，以自主发现融合模式，避免模态优先级偏差并减少特征冗余。实验结果验证了该方法的有效性，并强调了fMRI与sMRI作为AD生物标志物之间的对应关系。

链接: https://arxiv.org/abs/2504.16798
作者: Yuxiang Wei,Yanteng Zhang,Xi Xiao,Tianyang Wang,Xiao Wang,Vince D. Calhoun
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal neuroimaging provides complementary structural and functional insights into both human brain organization and disease-related dynamics. Recent studies demonstrate enhanced diagnostic sensitivity for Alzheimer’s disease (AD) through synergistic integration of neuroimaging data (e.g., sMRI, fMRI) with behavioral cognitive scores tabular data biomarkers. However, the intrinsic heterogeneity across modalities (e.g., 4D spatiotemporal fMRI dynamics vs. 3D anatomical sMRI structure) presents critical challenges for discriminative feature fusion. To bridge this gap, we propose M2M-AlignNet: a geometry-aware multimodal co-attention network with latent alignment for early AD diagnosis using sMRI and fMRI. At the core of our approach is a multi-patch-to-multi-patch (M2M) contrastive loss function that quantifies and reduces representational discrepancies via geometry-weighted patch correspondence, explicitly aligning fMRI components across brain regions with their sMRI structural substrates without one-to-one constraints. Additionally, we propose a latent-as-query co-attention module to autonomously discover fusion patterns, circumventing modality prioritization biases while minimizing feature redundancy. We conduct extensive experiments to confirm the effectiveness of our method and highlight the correspondance between fMRI and sMRI as AD biomarkers.
zh

[CV-10] owards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation

【速读】：该论文旨在解决从视频数据集中生成自然语言描述的问题，特别是在智能监控和自动驾驶等视频应用领域中，如何生成具有洞察力且语境化的动作描述。为了解决这一问题，论文提出了一种新颖的框架，通过结合文本和视觉模态来实现这一目标。解决方案的关键在于利用ResNet50提取视频帧的视觉特征，并将其转换为patch嵌入后输入基于Generative Pre-trained Transformer-2 (GPT-2) 的编码器-解码器模型。为了确保文本与视觉表征的对齐以及高质量描述的生成，系统采用了多头自注意力（multi-head self-attention）和跨注意力（cross-attention）技术。最终，通过BLEU (1-4)、CIDEr、METEOR 和ROUGE-L等指标的评估，证明了该方法的有效性，其性能显著优于传统方法。

链接: https://arxiv.org/abs/2504.16788
作者: Lakshita Agarwal,Bindu Verma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding and analyzing video actions are essential for producing insightful and contextualized descriptions, especially for video-based applications like intelligent monitoring and autonomous systems. The proposed work introduces a novel framework for generating natural language descriptions from video datasets by combining textual and visual modalities. The suggested architecture makes use of ResNet50 to extract visual features from video frames that are taken from the Microsoft Research Video Description Corpus (MSVD), and Berkeley DeepDrive eXplanation (BDD-X) datasets. The extracted visual characteristics are converted into patch embeddings and then run through an encoder-decoder model based on Generative Pre-trained Transformer-2 (GPT-2). In order to align textual and visual representations and guarantee high-quality description production, the system uses multi-head self-attention and cross-attention techniques. The model’s efficacy is demonstrated by performance evaluation using BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The suggested framework outperforms traditional methods with BLEU-4 scores of 0.755 (BDD-X) and 0.778 (MSVD), CIDEr scores of 1.235 (BDD-X) and 1.315 (MSVD), METEOR scores of 0.312 (BDD-X) and 0.329 (MSVD), and ROUGE-L scores of 0.782 (BDD-X) and 0.795 (MSVD). By producing human-like, contextually relevant descriptions, strengthening interpretability, and improving real-world applications, this research advances explainable AI.
zh

[CV-11] Noise-Tolerant Coreset-Based Class Incremental Continual Learning

【速读】：本文旨在解决计算机视觉应用中部署后适应新数据分布的需求，特别是在持续学习（Continual Learning, CL）场景下，如何在引入新类别（类增量学习，Class-Incremental Learning, CIL）的同时避免因标签噪声和实例噪声导致的性能下降。论文关注的核心问题是：基于小 coreset 记忆回放的持续学习方法对未关联实例噪声的鲁棒性，并推导出一种新的理论界限以分析其稳健性。关键解决方案在于提出两种新的持续学习算法，用于构建对噪声具有更强容忍能力的回放缓冲区，并通过实验证明这些方法在有标签噪声和未关联实例噪声情况下显著提升了分类准确性，同时有效减少了遗忘现象。

链接: https://arxiv.org/abs/2504.16763
作者: Edison Mucllari,Aswin Raghavan,Zachary Alan Daniels
机构: University of Kentucky (肯塔基大学); SRI International (SRI国际); SRI International (SRI国际)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: Work-in-Progress

点击查看摘要

Abstract:Many applications of computer vision require the ability to adapt to novel data distributions after deployment. Adaptation requires algorithms capable of continual learning (CL). Continual learners must be plastic to adapt to novel tasks while minimizing forgetting of previous this http URL, CL opens up avenues for noise to enter the training pipeline and disrupt the CL. This work focuses on label noise and instance noise in the context of class-incremental learning (CIL), where new classes are added to a classifier over time, and there is no access to external data from past classes. We aim to understand the sensitivity of CL methods that work by replaying items from a memory constructed using the idea of Coresets. We derive a new bound for the robustness of such a method to uncorrelated instance noise under a general additive noise threat model, revealing several insights. Putting the theory into practice, we create two continual learning algorithms to construct noise-tolerant replay buffers. We empirically compare the effectiveness of prior memory-based continual learners and the proposed algorithms under label and uncorrelated instance noise on five diverse datasets. We show that existing memory-based CL are not robust whereas the proposed methods exhibit significant improvements in maximizing classification accuracy and minimizing forgetting in the noisy CIL setting.
zh

[CV-12] ri-FusionNet: Enhancing Image Description Generation with Transformer-based Fusion Network and Dual Attention Mechanism

【速读】：该论文旨在解决图像描述生成（Image Description Generation）的问题，这是提升视觉内容可访问性（Accessibility）以及人工智能理解视觉信息能力的关键任务。近年来，深度学习的进步显著提升了自然语言处理和计算机视觉的能力，但现有方法在生成准确、上下文丰富且灵活的图像描述方面仍存在挑战。

论文提出的关键解决方案是Tri-FusionNet模型，其核心在于通过融合三种模块实现多模态数据的有效结合：(1) 增强了双注意力机制的Vision Transformer (ViT) 编码器模块，用于关注图像中的相关空间区域及语言上下文，从而优化图像特征提取；(2) 使用Robustly Optimized BERT Approach (RoBERTa) 解码器模块生成精确的文字描述；(3) 利用Contrastive Language-Image Pre-Training (CLIP) 的整合模块，通过对比学习对齐视觉与文本数据，确保两种模态的有效结合。这种ViT、RoBERTa和CLIP的联合设计，辅以双注意力机制，使模型能够生成更高质量、上下文相关的图像描述。实验结果表明，该框架在多个数据集（如Flickr30k、Flickr8k和MS-COCO）上的性能具有竞争力，验证了解决方案的有效性。

链接: https://arxiv.org/abs/2504.16761
作者: Lakshita Agarwal,Bindu Verma
机构: Department of Information Technology, Delhi Technological University (德里技术大学信息技术系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image description generation is essential for accessibility and AI understanding of visual content. Recent advancements in deep learning have significantly improved natural language processing and computer vision. In this work, we propose Tri-FusionNet, a novel image description generation model that integrates transformer modules: a Vision Transformer (ViT) encoder module with dual-attention mechanism, a Robustly Optimized BERT Approach (RoBERTa) decoder module, and a Contrastive Language-Image Pre-Training (CLIP) integrating module. The ViT encoder, enhanced with dual attention, focuses on relevant spatial regions and linguistic context, improving image feature extraction. The RoBERTa decoder is employed to generate precise textual descriptions. CLIP’s integrating module aligns visual and textual data through contrastive learning, ensuring effective combination of both modalities. This fusion of ViT, RoBERTa, and CLIP, along with dual attention, enables the model to produce more accurate, contextually rich, and flexible descriptions. The proposed framework demonstrated competitive performance on the Flickr30k and Flickr8k datasets, with BLEU scores ranging from 0.767 to 0.456 and 0.784 to 0.479, CIDEr scores of 1.679 and 1.483, METEOR scores of 0.478 and 0.358, and ROUGE-L scores of 0.567 and 0.789, respectively. On MS-COCO, the framework obtained BLEU scores of 0.893 (B-1), 0.821 (B-2), 0.794 (B-3), and 0.725 (B-4). The results demonstrate the effectiveness of Tri-FusionNet in generating high-quality image descriptions.
zh

[CV-13] Feature Mixing Approach for Detecting Intraoperative Adverse Events in Laparoscopic Roux-en-Y Gastric Bypass Surgery

【速读】：该论文旨在解决术中不良事件（Intraoperative Adverse Events, IAEs）检测中的挑战，特别是由于其罕见性导致的数据集高度不平衡问题，这使得基于人工智能的检测与严重程度量化变得困难。论文提出的解决方案BetaMixer的关键在于采用基于Beta分布的混合方法，将离散的IAE严重程度评分转换为连续值，以实现精确的严重程度回归（0-5量表）。此外，BetaMixer通过基于Beta分布的采样增强欠代表类别，并正则化中间嵌入来保持结构化的特征空间。结合生成式方法对齐特征空间与采样的IAE严重程度，BetaMixer利用Transformer实现稳健的分类与严重程度回归。在扩展了IAE标签的MultiBypass140数据集上的评估表明，BetaMixer表现出色，具有0.76的加权F1分数、0.81的召回率、0.73的阳性预测值（PPV）和0.84的阴性预测值（NPV），证明了其在处理不平衡数据方面的强大性能。

链接: https://arxiv.org/abs/2504.16749
作者: Rupak Bose,Chinedu Innocent Nwoye,Jorge Lazo,Joël Lukas Lavanchy,Nicolas Padoy
机构: ICube (UMR7357 CNRS INSERM University of Strasbourg) (法国); University Digestive Health Care Center, Clarunis, Basel, Switzerland (瑞士); Dept. of Biomedical Engineering, University of Basel, Switzerland (瑞士); IHU Strasbourg (法国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures, 8 tables, Release new dataset annotations

点击查看摘要

Abstract:Intraoperative adverse events (IAEs), such as bleeding or thermal injury, can lead to severe postoperative complications if undetected. However, their rarity results in highly imbalanced datasets, posing challenges for AI-based detection and severity quantification. We propose BetaMixer, a novel deep learning model that addresses these challenges through a Beta distribution-based mixing approach, converting discrete IAE severity scores into continuous values for precise severity regression (0-5 scale). BetaMixer employs Beta distribution-based sampling to enhance underrepresented classes and regularizes intermediate embeddings to maintain a structured feature space. A generative approach aligns the feature space with sampled IAE severity, enabling robust classification and severity regression via a transformer. Evaluated on the MultiBypass140 dataset, which we extended with IAE labels, BetaMixer achieves a weighted F1 score of 0.76, recall of 0.81, PPV of 0.73, and NPV of 0.84, demonstrating strong performance on imbalanced data. By integrating Beta distribution-based sampling, feature mixing, and generative modeling, BetaMixer offers a robust solution for IAE detection and quantification in clinical settings.
zh

[CV-14] Gaussian Splatting is an Effective Data Generator for 3D Object Detection

【速读】：该论文旨在解决自动驾驶场景下3D对象检测的数据增强问题。现有基于扩散的方法通常依赖于鸟瞰图（BEV）布局来合成图像，而本文提出了一种基于高斯点 splatting 的3D重建方法，可以直接将3D对象放置在重建的3D空间中，并通过显式的几何变换确保对象布置的物理合理性以及精确的3D姿态和位置标注。这种方法的关键在于通过几何变换实现真实且精确的3D对象布置，而非依赖于外观多样性或视觉遮挡生成困难样本，从而显著提升基于相机的3D对象检测性能。实验表明，这种方法优于现有的基于扩散的3D数据增强技术，并揭示了几何多样性比外观多样性对性能提升更为重要。

链接: https://arxiv.org/abs/2504.16740
作者: Farhad G. Zanjani,Davide Abati,Auke Wiggers,Dimitris Kalatzis,Jens Petersen,Hong Cai,Amirhossein Habibian
机构: Qualcomm Technologies, Inc. (高通公司); Qualcomm AI Research (高通人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We investigate data augmentation for 3D object detection in autonomous driving. We utilize recent advancements in 3D reconstruction based on Gaussian Splatting for 3D object placement in driving scenes. Unlike existing diffusion-based methods that synthesize images conditioned on BEV layouts, our approach places 3D objects directly in the reconstructed 3D space with explicitly imposed geometric transformations. This ensures both the physical plausibility of object placement and highly accurate 3D pose and position annotations. Our experiments demonstrate that even by integrating a limited number of external 3D objects into real scenes, the augmented data significantly enhances 3D object detection performance and outperforms existing diffusion-based 3D augmentation for object detection. Extensive testing on the nuScenes dataset reveals that imposing high geometric diversity in object placement has a greater impact compared to the appearance diversity of objects. Additionally, we show that generating hard examples, either by maximizing detection loss or imposing high visual occlusion in camera images, does not lead to more efficient 3D data augmentation for camera-based 3D object detection in autonomous driving. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.16740 [cs.CV] (or arXiv:2504.16740v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.16740 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-15] Prompt-Tuning SAM: From Generalist to Specialist with only 2048 Parameters and 16 Training Images

【速读】：本文旨在解决Segment Anything Model (SAM) 在非自然域（如显微成像）应用中的性能下降问题，以及其在自动化生物医学任务中因需要精确提示而不可行的局限性。为应对这些挑战，论文提出了一种名为PTSAM（prompt-tuned SAM）的新方法，通过参数高效的提示微调（prompt-tuning）技术，在仅增加约2,048个额外参数的情况下，将SAM转化为特定下游任务的专业工具。关键在于利用提示微调仅调整SAM的掩码解码器即可达到与当前最先进方法相当的性能，同时所需可训练参数减少约2,000倍；进一步通过提示微调图像编码器，可显著提升跨域分割精度，最高改善幅度达18%。此外，PTSAM仅需少量标注数据（如16张图像）即可有效训练，特别适用于训练数据有限且存在域偏移的应用场景。

链接: https://arxiv.org/abs/2504.16739
作者: Tristan Piater,Björn Barz,Alexander Freytag
机构: Carl Zeiss AG (卡尔蔡司股份公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM) is widely used for segmenting a diverse range of objects in natural images from simple user prompts like points or bounding boxes. However, SAM’s performance decreases substantially when applied to non-natural domains like microscopic imaging. Furthermore, due to SAM’s interactive design, it requires a precise prompt for each image and object, which is unfeasible in many automated biomedical applications. Previous solutions adapt SAM by training millions of parameters via fine-tuning large parts of the model or of adapter layers. In contrast, we show that as little as 2,048 additional parameters are sufficient for turning SAM into a use-case specialist for a certain downstream task. Our novel PTSAM (prompt-tuned SAM) method uses prompt-tuning, a parameter-efficient fine-tuning technique, to adapt SAM for a specific task. We validate the performance of our approach on multiple microscopic and one medical dataset. Our results show that prompt-tuning only SAM’s mask decoder already leads to a performance on-par with state-of-the-art techniques while requiring roughly 2,000x less trainable parameters. For addressing domain gaps, we find that additionally prompt-tuning SAM’s image encoder is beneficial, further improving segmentation accuracy by up to 18% over state-of-the-art results. Since PTSAM can be reliably trained with as little as 16 annotated images, we find it particularly helpful for applications with limited training data and domain shifts.
zh

[CV-16] V2R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations

【速读】：该论文试图解决大型视觉语言模型（Large Vision Language Models, LVLMs）在面对自然场景中物体因视角和环境变化而不可避免呈现的位置、尺度、方向和上下文等视觉变化时，其鲁棒性不足的问题。现有研究对该领域的探索尚显不足，尤其在复杂任务表现优异的模型在简单任务（如物体识别）中的显著退化现象未被充分揭示。论文的关键在于提出了一套名为V²R-Bench的综合评估框架，用于系统性评估LVLMs的视觉变化鲁棒性，包括自动化数据集生成与全面的评估指标。同时，通过组件级分析框架及创新的特征对齐可视化方法，论文发现这些鲁棒性缺陷源于管道架构中的误差累积以及多模态对齐的不足，进一步通过合成数据实验验证了这些局限本质上是模型架构设计的不足，从而强调了未来LVLM设计中架构创新的重要性。

链接: https://arxiv.org/abs/2504.16727
作者: Zhiyuan Fan,Yumeng Wang,Sandeep Polisetty,Yi R.(May)Fung
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) excel in various vision-language tasks. Yet, their robustness to visual variations in position, scale, orientation, and context that objects in natural scenes inevitably exhibit due to changes in viewpoint and environment remains largely underexplored. To bridge this gap, we introduce V ^2 R-Bench, a comprehensive benchmark framework for evaluating Visual Variation Robustness of LVLMs, which encompasses automated evaluation dataset generation and principled metrics for thorough robustness assessment. Through extensive evaluation on 21 LVLMs, we reveal a surprising vulnerability to visual variations, in which even advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition. Interestingly, these models exhibit a distinct visual position bias that contradicts theories of effective receptive fields, and demonstrate a human-like visual acuity threshold. To identify the source of these vulnerabilities, we present a systematic framework for component-level analysis, featuring a novel visualization approach for aligned visual features. Results show that these vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment. Complementary experiments with synthetic data further demonstrate that these limitations are fundamentally architectural deficiencies, scoring the need for architectural innovations in future LVLM designs.
zh

[CV-17] Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering

【速读】：该论文旨在解决多模态仇恨内容（Multimodal Hate Content）检测的挑战，特别是那些利用隐晦或编码参考的仇恨表情包。这些内容因同时包含文本和图像而难以被传统仅依赖文本或仅依赖图像的检测系统有效识别。论文的关键解决方案在于提出了一种整合多种技术组件的多模态仇恨检测框架，包括OCR（Optical Character Recognition）用于提取嵌入文本、captioning（图像描述）用于中性描述视觉内容、sub-label classification（子标签分类）用于细粒度分类仇恨内容、RAG（Retrieval-Augmented Generation）用于上下文相关的检索，以及VQA（Visual Question Answering）用于迭代分析符号性和上下文线索。这种综合方法能够揭示更简单的管道所遗漏的潜在信号，从而显著提升检测性能。

链接: https://arxiv.org/abs/2504.16723
作者: Ali Anaissi,Junaid Akram,Kunal Chaturvedi,Ali Braytee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures, 2025 International Conference on Computational Science

点击查看摘要

Abstract:Memes are widely used for humor and cultural commentary, but they are increasingly exploited to spread hateful content. Due to their multimodal nature, hateful memes often evade traditional text-only or image-only detection systems, particularly when they employ subtle or coded references. To address these challenges, we propose a multimodal hate detection framework that integrates key components: OCR to extract embedded text, captioning to describe visual content neutrally, sub-label classification for granular categorization of hateful content, RAG for contextually relevant retrieval, and VQA for iterative analysis of symbolic and contextual cues. This enables the framework to uncover latent signals that simpler pipelines fail to detect. Experimental results on the Facebook Hateful Memes dataset reveal that the proposed framework exceeds the performance of unimodal and conventional multimodal models in both accuracy and AUC-ROC.
zh

[CV-18] PMG: Progressive Motion Generation via Sparse Anchor Postures Curriculum Learning

【速读】：该论文旨在解决在计算机动画、游戏设计及人机交互领域中，如何生成与用户意图一致且可控性更强的人类运动这一挑战性问题。现有方法存在明显局限性，如文本方法虽提供高层语义指导但难以精确描述复杂动作；轨迹导向技术虽直观但常无法生成精准或定制化的人物移动；而基于锚点姿态的方法通常仅限于合成简单运动模式。论文提出了一种名为\textbf{ProMoGen}（渐进式运动生成）的新框架，其关键是将轨迹引导与稀疏锚点运动控制相结合，通过分离全局轨迹确保空间方向和位移的一致性，同时利用稀疏锚点仅提供精确的动作指导而不涉及位移，从而实现两个方面独立优化，生成更可控、高保真且复杂的运动合成结果。此外，为应对从稀疏运动直接学习的不稳定性，引入了\textbf{SAP-CL}（稀疏锚点姿势课程学习）策略，逐步调整用于指导的锚点数量，以实现更精确稳定的收敛。大量实验表明，ProMoGen在由预定义轨迹和任意锚帧引导下，能够出色地合成生动多样的运动，并在多个控制场景中显著超越当前最先进的方法。

链接: https://arxiv.org/abs/2504.16722
作者: Yingjie Xi,Jian Jun Zhang,Xiaosong Yang
机构: Bournemouth University (伯恩茅斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In computer animation, game design, and human-computer interaction, synthesizing human motion that aligns with user intent remains a significant challenge. Existing methods have notable limitations: textual approaches offer high-level semantic guidance but struggle to describe complex actions accurately; trajectory-based techniques provide intuitive global motion direction yet often fall short in generating precise or customized character movements; and anchor poses-guided methods are typically confined to synthesize only simple motion patterns. To generate more controllable and precise human motions, we propose \textbfProMoGen (Progressive Motion Generation), a novel framework that integrates trajectory guidance with sparse anchor motion control. Global trajectories ensure consistency in spatial direction and displacement, while sparse anchor motions only deliver precise action guidance without displacement. This decoupling enables independent refinement of both aspects, resulting in a more controllable, high-fidelity, and sophisticated motion synthesis. ProMoGen supports both dual and single control paradigms within a unified training process. Moreover, we recognize that direct learning from sparse motions is inherently unstable, we introduce \textbfSAP-CL (Sparse Anchor Posture Curriculum Learning), a curriculum learning strategy that progressively adjusts the number of anchors used for guidance, thereby enabling more precise and stable convergence. Extensive experiments demonstrate that ProMoGen excels in synthesizing vivid and diverse motions guided by predefined trajectory and arbitrary anchor frames. Our approach seamlessly integrates personalized motion with structured guidance, significantly outperforming state-of-the-art methods across multiple control scenarios.
zh

[CV-19] Energy-Based Pseudo-Label Refining for Source-free Domain Adaptation

【速读】：该论文致力于解决源无关领域适应（Source-free Domain Adaptation, SFDA）中因依赖置信度生成伪标签而导致的负迁移问题。为应对这一挑战，论文提出了一种基于能量的伪标签精炼方法（Energy-Based Pseudo-Label Refining, EBPR）。其关键是通过样本的能量分数生成伪标签，并利用全局及类别级别的能量阈值筛选伪标签；同时引入对比学习策略，过滤困难样本并与增强版本对齐以提取更具有判别性的特征。实验验证表明，该方法在Office-31、Office-Home和VisDA-C数据集上均优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.16692
作者: Xinru Meng,Han Sun,Jiamei Liu,Ningzhong Liu,Huiyu Zhou
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); University of Leicester (莱斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures, accepted by PRL. code at this https URL

点击查看摘要

Abstract:Source-free domain adaptation (SFDA), which involves adapting models without access to source data, is both demanding and challenging. Existing SFDA techniques typically rely on pseudo-labels generated from confidence levels, leading to negative transfer due to significant noise. To tackle this problem, Energy-Based Pseudo-Label Refining (EBPR) is proposed for SFDA. Pseudo-labels are created for all sample clusters according to their energy scores. Global and class energy thresholds are computed to selectively filter pseudo-labels. Furthermore, a contrastive learning strategy is introduced to filter difficult samples, aligning them with their augmented versions to learn more discriminative features. Our method is validated on the Office-31, Office-Home, and VisDA-C datasets, consistently finding that our model outperformed state-of-the-art methods.
zh

[CV-20] SemanticSugarBeets: A Multi-Task Framework and Dataset for Inspecting Harvest and Storag e Characteristics of Sugar Beets CVPR

【速读】：该论文旨在解决甜菜在收获后储存期间因微生物、土壤残留和多余植株等因素导致的糖分损失问题，通过自动化视觉检测提升甜菜品质保证效率。解决方案的关键在于提出了一种新颖的高质量标注数据集以及两阶段方法，用于单目RGB图像中甜菜的检测、语义分割及质量评估，并通过广泛的消融实验优化模型架构、编码器选择及环境条件影响，最终实现98.8的mAP50-95检测精度和64.0的最优分割模型mIoU性能。

链接: https://arxiv.org/abs/2504.16684
作者: Gerardus Croonen,Andreas Trondl,Julia Simon,Daniel Steininger
机构: AIT Austrian Institute of Technology (奥地利技术研究院); Center for Vision, Automation & Control (视觉、自动化与控制中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Code and dataset available at this https URL

点击查看摘要

Abstract:While sugar beets are stored prior to processing, they lose sugar due to factors such as microorganisms present in adherent soil and excess vegetation. Their automated visual inspection promises to aide in quality assurance and thereby increase efficiency throughout the processing chain of sugar production. In this work, we present a novel high-quality annotated dataset and two-stage method for the detection, semantic segmentation and mass estimation of post-harvest and post-storage sugar beets in monocular RGB images. We conduct extensive ablation experiments for the detection of sugar beets and their fine-grained semantic segmentation regarding damages, rot, soil adhesion and excess vegetation. For these tasks, we evaluate multiple image sizes, model architectures and encoders, as well as the influence of environmental conditions. Our experiments show an mAP50-95 of 98.8 for sugar-beet detection and an mIoU of 64.0 for the best-performing segmentation model.
zh

[CV-21] Representation Learning via Non-Contrastive Mutual Information

【速读】：该论文旨在解决自监督表征学习中对比方法（如SimCLR）与非对比方法（如BYOL）各自存在的局限性问题。对比方法通过最大化相关数据点之间的互信息来学习表征，但需要进行大规模的成对比较，导致高方差且需要较大的批量大小；而非对比方法虽然方差较低且无需成对比较，但容易坍塌至常量向量。论文的关键在于提出了一种新的自监督目标函数——互信息非对比损失（Mutual Information Non-Contrastive, MINC），它通过将一种特定的对比方法——光谱对比损失（Spectral Contrastive Loss）转换为更通用的非对比形式，既消除了成对比较以降低方差，又保留了对比方法中的互信息公式以避免坍塌现象。

链接: https://arxiv.org/abs/2504.16667
作者: Zhaohan Daniel Guo,Bernardo Avila Pires,Khimya Khetarpal,Dale Schuurmans,Bo Dai
机构: Google DeepMind
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Labeling data is often very time consuming and expensive, leaving us with a majority of unlabeled data. Self-supervised representation learning methods such as SimCLR (Chen et al., 2020) or BYOL (Grill et al., 2020) have been very successful at learning meaningful latent representations from unlabeled image data, resulting in much more general and transferable representations for downstream tasks. Broadly, self-supervised methods fall into two types: 1) Contrastive methods, such as SimCLR; and 2) Non-Contrastive methods, such as BYOL. Contrastive methods are generally trying to maximize mutual information between related data points, so they need to compare every data point to every other data point, resulting in high variance, and thus requiring large batch sizes to work well. Non-contrastive methods like BYOL have much lower variance as they do not need to make pairwise comparisons, but are much trickier to implement as they have the possibility of collapsing to a constant vector. In this paper, we aim to develop a self-supervised objective that combines the strength of both types. We start with a particular contrastive method called the Spectral Contrastive Loss (HaoChen et al., 2021; Lu et al., 2024), and we convert it into a more general non-contrastive form; this removes the pairwise comparisons resulting in lower variance, but keeps the mutual information formulation of the contrastive method preventing collapse. We call our new objective the Mutual Information Non-Contrastive (MINC) loss. We test MINC by learning image representations on ImageNet (similar to SimCLR and BYOL) and show that it consistently improves upon the Spectral Contrastive loss baseline.
zh

[CV-22] A Diff-Attention Aware State Space Fusion Model for Remote Sensing Classification

【速读】：该论文旨在解决多光谱（MS）和全色（PAN）遥感图像融合过程中特征冗余以及难以有效整合分离后具有显著语义差异特征的问题。论文的关键解决方案在于提出了一种基于选择性状态空间模型的差分注意力感知状态空间融合模型（DAS2F-Model），其中包含两个核心模块：跨模态差分注意力模块（CMDA-Module）和注意力感知线性融合模块（AALF-Module）。CMDA-Module 通过空间保持视觉鳗鱼（SPVM）优化输入合理地提取并分离 MS 和 PAN 图像的共同特征及其各自的主导特征；而 AALF-Module 则通过计算影响系数实现像素级线性融合，以在保持特征尺寸不变的情况下有效融合语义差异较大的特征。

链接: https://arxiv.org/abs/2504.16665
作者: Wenping Ma,Boyou Xue,Mengru Ma,Chuang Chen,Hekai Zhang,Hao Zhu
机构: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education (教育部智能感知与图像理解重点实验室); International Research Center for Intelligent Perception and Computation (智能感知与计算国际研究中心); School of Artificial Intelligence (人工智能学院), Xidian University (西安电子科技大学), China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages,9 figures

点击查看摘要

Abstract:Multispectral (MS) and panchromatic (PAN) images describe the same land surface, so these images not only have their own advantages, but also have a lot of similar information. In order to separate these similar information and their respective advantages, reduce the feature redundancy in the fusion stage. This paper introduces a diff-attention aware state space fusion model (DAS2F-Model) for multimodal remote sensing image classification. Based on the selective state space model, a cross-modal diff-attention module (CMDA-Module) is designed to extract and separate the common features and their respective dominant features of MS and PAN images. Among this, space preserving visual mamba (SPVM) retains image spatial features and captures local features by optimizing visual mamba’s input reasonably. Considering that features in the fusion stage will have large semantic differences after feature separation and simple fusion operations struggle to effectively integrate these significantly different features, an attention-aware linear fusion module (AALF-Module) is proposed. It performs pixel-wise linear fusion by calculating influence coefficients. This mechanism can fuse features with large semantic differences while keeping the feature size unchanged. Empirical evaluations indicate that the presented method achieves better results than alternative approaches. The relevant code can be found at:this https URL
zh

[CV-23] A Time Series Dataset of NIR Spectra and RGB and NIR-HSI Images of the Barley Germination Process

【速读】：该论文旨在构建一个包含大麦颗粒萌发过程中RGB和近红外高光谱成像（NIR-HSI）数据的开源数据集，并结合分割掩膜与近红外光谱信息，用于时间序列分析。论文的关键在于通过在每次成像时标注每个大麦颗粒的萌发状态（萌发或未萌发），并利用黑色滤纸作为背景实现基于强度阈值的简单分割（如Otsu方法），从而支持基于RGB图像、近红外光谱或NIR-HSI分析的萌发时间序列研究。

链接: https://arxiv.org/abs/2504.16658
作者: Ole-Christian Galbo Engstrøm,Erik Schou Dreier,Birthe Møller Jespersen,Kim Steenstrup Pedersen
机构: FOSS Analytical A/S (FOSS Analytical A/S); Department of Computer Science (DIKU), University of Copenhagen (哥本哈根大学计算机科学系); Department of Food Science (UCPH FOOD), University of Copenhagen (哥本哈根大学食品科学系); University College Lillebælt (UCL) (大学学院雷尔贝特); Natural History Museum of Denmark (NHMD), University of Copenhagen (哥本哈根大学自然历史博物馆)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We provide an open-source dataset of RGB and NIR-HSI (near-infrared hyperspectral imaging) images with associated segmentation masks and NIR spectra of 2242 individual malting barley kernels. We imaged every kernel pre-exposure to moisture and every 24 hours after exposure to moisture for five consecutive days. Every barley kernel was labeled as germinated or not germinated during each image acquisition. The barley kernels were imaged with black filter paper as the background, facilitating straight-forward intensity threshold-based segmentation, e.g., by Otsu’s method. This dataset facilitates time series analysis of germination time for barley kernels using either RGB image analysis, NIR spectral analysis, NIR-HSI analysis, or a combination hereof.
zh

[CV-24] Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

【速读】：该论文旨在解决多模态推理模型在实现复杂推理能力与广泛泛化性能之间长期存在的平衡难题。为应对这一挑战，Skywork R1V2 引入了一种混合强化学习范式（Hybrid Reinforcement Learning Paradigm），通过融合奖励模型引导与基于规则的策略，实现了两者之间的协调。此外，为了进一步提升训练效率并克服组相对策略优化（Group Relative Policy Optimization, GRPO）中的“优势消失”困境，论文提出了选择性样本缓冲区（Selective Sample Buffer, SSB）机制，该机制在整个优化过程中优先利用高价值样本。论文还指出，过量的强化信号可能导致视觉幻觉现象，并通过校准奖励阈值对其进行系统监控与缓解。关键在于混合强化学习范式的创新应用以及 SSB 机制的有效设计。

链接: https://arxiv.org/abs/2504.16656
作者: Chris,Yichen Wei,Yi Peng,Xiaokun Wang,Weijie Qiu,Wei Shen,Tianyidan Xie,Jiangbo Pei,Jianhao Zhang,Yunzhuo Hao,Xuchen Song,Yang Liu,Yahui Zhou
机构: Skywork AI (Skywork AI); Kunlun Inc. (昆仑万维股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we propose the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages’’ dilemma inherent in Group Relative Policy Optimization (GRPO) by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations–a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and 74.0 on MMMU. These results underscore R1V2’s superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility this https URL.
zh

[CV-25] WiFi based Human Fall and Activity Recognition using Transformer based Encoder Decoder and Graph Neural Networks

【速读】：本文旨在解决基于WiFi信道状态信息（Channel State Information, CSI）的人体姿态估计与动作识别问题。解决方案的关键在于提出了一种名为Transformer基于编码器-解码器网络（Transformer-based Encoder Decoder Network, TED Net）的新架构。TED Net通过结合卷积编码器与基于Transformer的注意力机制，从CSI信号中捕获时空特征以实现人体骨架姿态的精确估计。随后，利用这些估计出的姿态作为输入，采用定制化的有向图神经网络（Directed Graph Neural Network, DGNN）进行动作分类。实验结果表明，TED Net在姿态估计方面优于现有方法，并且基于CSI的姿态能够可靠地完成动作分类，其性能与基于RGB的方法相当，同时在跌倒和非跌倒场景中均表现出稳健性。这凸显了CSI驱动的人体骨架估计在动作识别中的潜力，特别是在需要保护隐私的家庭环境中，如老年人跌倒检测等场景中，相比基于视觉的方法具有显著优势。

链接: https://arxiv.org/abs/2504.16655
作者: Younggeol Cho,Elisa Motta,Olivia Nocentini,Marta Lagomarsino,Andrea Merello,Marco Crepaldi,Arash Ajoudani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Human pose estimation and action recognition have received attention due to their critical roles in healthcare monitoring, rehabilitation, and assistive technologies. In this study, we proposed a novel architecture named Transformer based Encoder Decoder Network (TED Net) designed for estimating human skeleton poses from WiFi Channel State Information (CSI). TED Net integrates convolutional encoders with transformer based attention mechanisms to capture spatiotemporal features from CSI signals. The estimated skeleton poses were used as input to a customized Directed Graph Neural Network (DGNN) for action recognition. We validated our model on two datasets: a publicly available multi modal dataset for assessing general pose estimation, and a newly collected dataset focused on fall related scenarios involving 20 participants. Experimental results demonstrated that TED Net outperformed existing approaches in pose estimation, and that the DGNN achieves reliable action classification using CSI based skeletons, with performance comparable to RGB based systems. Notably, TED Net maintains robust performance across both fall and non fall cases. These findings highlight the potential of CSI driven human skeleton estimation for effective action recognition, particularly in home environments such as elderly fall detection. In such settings, WiFi signals are often readily available, offering a privacy preserving alternative to vision based methods, which may raise concerns about continuous camera monitoring.
zh

[CV-26] SSLR: A Semi-Supervised Learning Method for Isolated Sign Language Recognition

【速读】：该论文旨在解决手语识别（Sign Language Recognition, SLR）系统中标注数据匮乏的问题。为应对这一挑战，论文提出了一种基于半监督学习（Semi-Supervised Learning, SSL）的方法（SSLR），其关键是采用伪标签（pseudo-label）方法为未标注样本进行标注。手语手势通过姿势信息（pose information）表示，该信息编码了手语者的骨骼关节点，并作为输入提供给所提出的Transformer主干模型。实验结果表明，与完全监督学习方法相比，在较少标注数据的情况下，SSL方法在WLASL-100数据集上的性能更优。

链接: https://arxiv.org/abs/2504.16640
作者: Hasan Algafri,Hamzah Luqman,Sarah Alyami,Issam Laradji
机构: Information and Computer Science Department, King Fahd University of Petroleum and Minerals (国王法赫德石油矿产大学信息与计算机科学系); SDAIA–KFUPM Joint Research Center for Artificial Intelligence (SDAIA–KFUPM人工智能联合研究中心); Computing Department, Imam Abdulrahman Bin Faisal University (伊玛目阿卜杜勒拉赫曼本费萨尔大学计算系); ServiceNow (ServiceNow)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sign language is the primary communication language for people with disabling hearing loss. Sign language recognition (SLR) systems aim to recognize sign gestures and translate them into spoken language. One of the main challenges in SLR is the scarcity of annotated datasets. To address this issue, we propose a semi-supervised learning (SSL) approach for SLR (SSLR), employing a pseudo-label method to annotate unlabeled samples. The sign gestures are represented using pose information that encodes the signer’s skeletal joint points. This information is used as input for the Transformer backbone model utilized in the proposed approach. To demonstrate the learning capabilities of SSL across various labeled data sizes, several experiments were conducted using different percentages of labeled data with varying numbers of classes. The performance of the SSL approach was compared with a fully supervised learning-based model on the WLASL-100 dataset. The obtained results of the SSL model outperformed the supervised learning-based model with less labeled data in many cases.
zh

[CV-27] RouteWinFormer: A Route-Window Transformer for Middle-range Attention in Image Restoration

【速读】：该论文旨在解决传统Transformer模型在图像恢复任务中因长距离注意力机制导致的计算开销过高且不必要的问题。论文指出，图像退化和上下文信息通常局限于局部区域，因此中等范围的注意力已足够满足需求。为解决此问题，论文提出了一种名为RouteWinFormer的新颖窗口式Transformer模型，其关键在于引入了Route-Windows Attention模块，通过基于区域相似性的动态选择相关邻近窗口来实现注意力聚合，从而高效扩展感受野至中等范围。此外，还提出了多尺度结构正则化方法，在训练过程中使U形网络的子尺度关注结构信息，而原始尺度学习退化模式。实验结果表明，RouteWinFormer在多种图像恢复任务的9个数据集上超越了现有最先进的方法。

链接: https://arxiv.org/abs/2504.16637
作者: Qifan Li,Tianyi Liang,Xingtao Wang,Xiaopeng Fan
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer models have recently garnered significant attention in image restoration due to their ability to capture long-range pixel dependencies. However, long-range attention often results in computational overhead without practical necessity, as degradation and context are typically localized. Normalized average attention distance across various degradation datasets shows that middle-range attention is enough for image restoration. Building on this insight, we propose RouteWinFormer, a novel window-based Transformer that models middle-range context for image restoration. RouteWinFormer incorporates Route-Windows Attnetion Module, which dynamically selects relevant nearby windows based on regional similarity for attention aggregation, extending the receptive field to a mid-range size efficiently. In addition, we introduce Multi-Scale Structure Regularization during training, enabling the sub-scale of the U-shaped network to focus on structural information, while the original-scale learns degradation patterns based on generalized image structure priors. Extensive experiments demonstrate that RouteWinFormer outperforms state-of-the-art methods across 9 datasets in various image restoration tasks.
zh

[CV-28] Dual-Camera All-in-Focus Neural Radiance Fields

【速读】：该论文旨在解决从智能手机双摄像头输入中合成全焦神经辐射场（NeRF）的问题，无需手动重新对焦。传统方法因单一摄像头视角下固定物体的持续失焦模糊及缺乏清晰参考而失效。为恢复全焦NeRF，论文引入了智能手机中的双摄像头方案：超广角摄像头具有更大的景深（Depth-of-Field, DoF），主摄像头则提供更高分辨率的细节。解决方案的关键在于首先通过空间变换与颜色匹配对齐双摄像头图像，随后利用带有可学习失焦参数的失焦感知融合模块预测失焦图并融合对齐后的相机对。这种方法不仅构建了一个包含智能手机主摄像头和超广角摄像头成对图像的多视图数据集，还通过实验验证了所提出的DC-NeRF方法在生成高质量全焦新视图方面的优越性，并展示了其在调整模糊强度和焦平面等深度景深应用中的潜力。

链接: https://arxiv.org/abs/2504.16636
作者: Xianrui Luo,Zijin Wu,Juewen Peng,Huiqiang Sun,Zhiguo Cao,Guosheng Lin
机构: Key Laboratory of Image Processing and Intelligent Control, Ministry of Education (图像处理与智能控制重点实验室, 教育部); School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (华中科技大学人工智能与自动化学院, 武汉, 430074, 中国); Nanyang Technological University (NTU) (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published by IEEE TPAMI 2025

点击查看摘要

Abstract:We present the first framework capable of synthesizing the all-in-focus neural radiance field (NeRF) from inputs without manual refocusing. Without refocusing, the camera will automatically focus on the fixed object for all views, and current NeRF methods typically using one camera fail due to the consistent defocus blur and a lack of sharp reference. To restore the all-in-focus NeRF, we introduce the dual-camera from smartphones, where the ultra-wide camera has a wider depth-of-field (DoF) and the main camera possesses a higher resolution. The dual camera pair saves the high-fidelity details from the main camera and uses the ultra-wide camera’s deep DoF as reference for all-in-focus restoration. To this end, we first implement spatial warping and color matching to align the dual camera, followed by a defocus-aware fusion module with learnable defocus parameters to predict a defocus map and fuse the aligned camera pair. We also build a multi-view dataset that includes image pairs of the main and ultra-wide cameras in a smartphone. Extensive experiments on this dataset verify that our solution, termed DC-NeRF, can produce high-quality all-in-focus novel views and compares favorably against strong baselines quantitatively and qualitatively. We further show DoF applications of DC-NeRF with adjustable blur intensity and focal plane, including refocusing and split diopter.
zh

[CV-29] EHGCN: Hierarchical Euclidean-Hyperbolic Fusion via Motion-Aware GCN for Hybrid Event Stream Perception

【速读】：该论文旨在解决现有基于图神经网络（Graph Neural Network, GNN）的事件相机感知方法在纯欧几里得空间中难以捕捉长距离依赖关系且无法有效表征非均匀分布事件流固有层次结构的问题。为了解决这一挑战，论文提出了一种名为EHGCN的创新方法，在欧几里得和双曲空间中共同感知事件流。其关键在于引入自适应采样策略以动态调节采样率，保留判别性事件的同时减弱混乱噪声；设计基于马尔可夫矢量场（Markov Vector Field, MVF）驱动的运动感知超边生成方法，通过状态转移概率消除目标间的虚假关联，并提供重要的拓扑先验以捕获事件间的长距离依赖关系；最终构建欧几里得-双曲图卷积网络（Euclidean-Hyperbolic GCN），融合局部聚合与全局分层建模的信息，实现混合事件感知。实验结果验证了所提方法的有效性。

链接: https://arxiv.org/abs/2504.16616
作者: Haosheng Chen,Lian Luo,Mengjingcheng Mo,Zhanjie Wu,Guobao Xiao,Ji Gan,Jiaxu Leng,Xinbo Gao
机构: Chongqing University of Post and Telecommunications(重庆邮电大学); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event cameras, with microsecond temporal resolution and high dynamic range (HDR) characteristics, emit high-speed event stream for perception tasks. Despite the recent advancement in GNN-based perception methods, they are prone to use straightforward pairwise connectivity mechanisms in the pure Euclidean space where they struggle to capture long-range dependencies and fail to effectively characterize the inherent hierarchical structures of non-uniformly distributed event stream. To this end, in this paper we propose a novel approach named EHGCN, which is a pioneer to perceive event stream in both Euclidean and hyperbolic spaces for event vision. In EHGCN, we introduce an adaptive sampling strategy to dynamically regulate sampling rates, retaining discriminative events while attenuating chaotic noise. Then we present a Markov Vector Field (MVF)-driven motion-aware hyperedge generation method based on motion state transition probabilities, thereby eliminating cross-target spurious associations and providing critically topological priors while capturing long-range dependencies between events. Finally, we propose a Euclidean-Hyperbolic GCN to fuse the information locally aggregated and globally hierarchically modeled in Euclidean and hyperbolic spaces, respectively, to achieve hybrid event perception. Experimental results on event perception tasks such as object detection and recognition validate the effectiveness of our approach.
zh

[CV-30] Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

【速读】：本文旨在解决微创手术领域中数据共享限制导致的协作模型训练难题，通过联邦学习（Federated Learning, FL）实现无需数据传输的联合模型训练。论文的关键解决方案在于将掩码自编码器（Masked Autoencoder, MAE）与适应性尖锐度感知最小化（adaptive Sharpness-Aware Minimization, FedSAM）以及随机加权平均（Stochastic Weight Averaging, SWA）相结合，形成FL-EndoViT模型。该模型首先在Endo700k数据集上进行预训练，随后针对语义分割、动作三元组识别及手术阶段识别等下游任务进行微调与评估。实验结果表明，FedSAM的引入显著降低了分块重建损失，并提升了模型性能，特别是在数据受限场景下的手术场景分割以及大样本条件下的动作三元组识别任务中表现出优于集中式方法的优势。这表明联邦学习为保护隐私的手术基础模型训练提供了有效途径，其核心在于适配机构间数据异质性的方法创新。

链接: https://arxiv.org/abs/2504.16612
作者: Max Kirchner,Alexander C. Jenke,Sebastian Bodenstedt,Fiona R. Kolbinger,Oliver Saldanha,Jakob N. Kather,Martin Wagner,Stefanie Speidel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint submitted to MEDIA

点击查看摘要

Abstract:Purpose: In this study, we investigate the training of foundation models using federated learning to address data-sharing limitations and enable collaborative model training without data transfer for minimally invasive surgery. Methods: Inspired by the EndoViT study, we adapt the Masked Autoencoder for federated learning, enhancing it with adaptive Sharpness-Aware Minimization (FedSAM) and Stochastic Weight Averaging (SWA). Our model is pretrained on the Endo700k dataset collection and later fine-tuned and evaluated for tasks such as Semantic Segmentation, Action Triplet Recognition, and Surgical Phase Recognition. Results: Our findings demonstrate that integrating adaptive FedSAM into the federated MAE approach improves pretraining, leading to a reduction in reconstruction loss per patch. The application of FL-EndoViT in surgical downstream tasks results in performance comparable to CEN-EndoViT. Furthermore, FL-EndoViT exhibits advantages over CEN-EndoViT in surgical scene segmentation when data is limited and in action triplet recognition when large datasets are used. Conclusion: These findings highlight the potential of federated learning for privacy-preserving training of surgical foundation models, offering a robust and generalizable solution for surgical data science. Effective collaboration requires adapting federated learning methods, such as the integration of FedSAM, which can accommodate the inherent data heterogeneity across institutions. In future, exploring FL in video-based models may enhance these capabilities by incorporating spatiotemporal dynamics crucial for real-world surgical environments.
zh

[CV-31] HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction

【速读】：该论文旨在解决在处理大规模城市环境和复杂细节时效率低下的问题，特别是在基于3D高斯点绘制（3D Gaussian Splatting）的场景重建与渲染中。论文的关键在于提出了一种名为HUG的新方法，通过引入分层神经高斯表示（Hierarchical Neural Gaussian Representation），优化数据划分和重建流程。具体而言，该方法采用增强的基于块的重建流水线，重点提升每个块内的重建质量，并减少块边界附近冗余训练区域的需求。通过将神经高斯表示与分层架构相结合，实现了高质量场景渲染的同时保持较低的计算成本，验证了其在大规模城市场景表示中的有效性和优势。

链接: https://arxiv.org/abs/2504.16606
作者: Zhongtao Wang,Mai Su,Huishan Au,Yilong Li,Xizhe Cao,Chengwei Pan,Yisong Chen,Guoping Wang
机构: Peking University (北京大学); Beihang University (北京航空航天大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As urban 3D scenes become increasingly complex and the demand for high-quality rendering grows, efficient scene reconstruction and rendering techniques become crucial. We present HUG, a novel approach to address inefficiencies in handling large-scale urban environments and intricate details based on 3D Gaussian splatting. Our method optimizes data partitioning and the reconstruction pipeline by incorporating a hierarchical neural Gaussian representation. We employ an enhanced block-based reconstruction pipeline focusing on improving reconstruction quality within each block and reducing the need for redundant training regions around block boundaries. By integrating neural Gaussian representation with a hierarchical architecture, we achieve high-quality scene rendering at a low computational cost. This is demonstrated by our state-of-the-art results on public benchmarks, which prove the effectiveness and advantages in large-scale urban scene representation.
zh

[CV-32] JEPA for RL: Investigating Joint-Embedding Predictive Architectures for Reinforcement Learning

【速读】：该论文试图解决如何将联合嵌入预测架构（Joint-Embedding Predictive Architectures, JEPA）应用于基于图像的强化学习，并讨论模型崩塌（model collapse）现象及其预防方法。关键在于提出有效的策略以避免模型在学习过程中因表征退化而导致性能下降，同时通过经典Cart Pole任务提供实验验证。

链接: https://arxiv.org/abs/2504.16591
作者: Tristan Kenneweg,Philip Kenneweg,Barbara Hammer
机构: University of Bielefeld (比勒费尔德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ESANN 2025

点击查看摘要

Abstract:Joint-Embedding Predictive Architectures (JEPA) have recently become popular as promising architectures for self-supervised learning. Vision transformers have been trained using JEPA to produce embeddings from images and videos, which have been shown to be highly suitable for downstream tasks like classification and segmentation. In this paper, we show how to adapt the JEPA architecture to reinforcement learning from images. We discuss model collapse, show how to prevent it, and provide exemplary data on the classical Cart Pole task.
zh

[CV-33] CountingDINO: A Training-free Pipeline for Class-Agnostic Counting using Unsupervised Backbones

【速读】：该论文试图解决无类别物体计数（Class-Agnostic Counting, CAC）问题，目标是在无需限定预定义类别的前提下估计图像中的物体数量。当前基于实例的 CAC 方法在推理阶段具有灵活性，但训练过程仍严重依赖标注数据，这限制了其可扩展性和对下游任务的泛化能力。论文提出的关键解决方案是 CountingDINO，这是一种无需训练的基于实例的 CAC 框架，利用完全无监督的特征提取器。具体而言，该方法采用仅视觉的自监督主干网络提取与对象相关的特征，并在整个框架中无需标注数据。在推理阶段，通过 ROI-Align 从 DINO 特征中提取潜在的对象原型，并将其作为卷积核生成相似性图，再经由简单的归一化方案转化为密度图。实验表明，该方法在无需标签的情况下优于基准模型，并在某些情况下甚至超过了依赖有监督主干的无训练方法以及若干全监督的最新技术，证明了无训练 CAC 的可行性和竞争力。

链接: https://arxiv.org/abs/2504.16570
作者: Giacomo Pacini,Lorenzo Bianchi,Luca Ciampi,Nicola Messina,Giuseppe Amato,Fabrizio Falchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures, 2 tables. Project website: this https URL

点击查看摘要

Abstract:Class-agnostic counting (CAC) aims to estimate the number of objects in images without being restricted to predefined categories. However, while current exemplar-based CAC methods offer flexibility at inference time, they still rely heavily on labeled data for training, which limits scalability and generalization to many downstream use cases. In this paper, we introduce CountingDINO, the first training-free exemplar-based CAC framework that exploits a fully unsupervised feature extractor. Specifically, our approach employs self-supervised vision-only backbones to extract object-aware features, and it eliminates the need for annotated data throughout the entire proposed pipeline. At inference time, we extract latent object prototypes via ROI-Align from DINO features and use them as convolutional kernels to generate similarity maps. These are then transformed into density maps through a simple yet effective normalization scheme. We evaluate our approach on the FSC-147 benchmark, where we outperform a baseline under the same label-free setting. Our method also achieves competitive – and in some cases superior – results compared to training-free approaches relying on supervised backbones, as well as several fully supervised state-of-the-art methods. This demonstrates that training-free CAC can be both scalable and competitive. Website: this https URL
zh

[CV-34] SAIP-Net: Enhancing Remote Sensing Image Segmentation via Spectral Adaptive Information Propagation

【速读】：该论文旨在解决遥感影像语义分割中空间边界不精确及类内特征一致性不足的问题，传统分层模型难以应对这些挑战。为克服空间域特征融合局限性和感受野不足的缺陷，论文提出了一种新颖的频域感知分割框架SAIP-Net，其关键在于采用光谱自适应信息传播机制，通过自适应频率滤波和多尺度感受野增强，有效抑制类内特征的不一致并锐化边界线。实验结果表明，该方法在遥感图像分割任务中显著优于现有技术，验证了频域自适应策略与扩展感受野结合的有效性。

链接: https://arxiv.org/abs/2504.16564
作者: Zhongtao Wang,Xizhe Cao,Yisong Chen,Guoping Wang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Semantic segmentation of remote sensing imagery demands precise spatial boundaries and robust intra-class consistency, challenging conventional hierarchical models. To address limitations arising from spatial domain feature fusion and insufficient receptive fields, this paper introduces SAIP-Net, a novel frequency-aware segmentation framework that leverages Spectral Adaptive Information Propagation. SAIP-Net employs adaptive frequency filtering and multi-scale receptive field enhancement to effectively suppress intra-class feature inconsistencies and sharpen boundary lines. Comprehensive experiments demonstrate significant performance improvements over state-of-the-art methods, highlighting the effectiveness of spectral-adaptive strategies combined with expanded receptive fields for remote sensing image segmentation.
zh

[CV-35] Beyond Anonymization: Object Scrubbing for Privacy-Preserving 2D and 3D Vision Tasks ICCV2025

【速读】：本文旨在解决隐私保护数据集模糊化过程中如何在消除敏感对象的同时最大限度保留数据集效用的问题。论文提出了一种名为ROAR（Robust Object Removal and Re-annotation）的可扩展框架，其关键在于将实例分割与生成式修复（generative inpainting）相结合，通过直接移除可识别实体而非修改它们来实现隐私保护，从而在保持场景完整性的同时显著提高数据集的可用性。实验表明，相比图像丢弃方法，ROAR在2D COCO数据集上的检测平均精度（AP）保留率更高（87.5% vs. 74.2%），特别是在处理小物体时优势更为明显。此外，在基于NeRF的3D重建任务中，该方法在保持结构相似性（SSIM）的同时提升了感知质量（LPIPS），且峰值信噪比（PSNR）损失最小仅为1.66 dB。这表明ROAR能够在提供强大隐私保障的同时实现极小的性能折损，为未来隐私保护视觉系统的研究奠定了基础。

链接: https://arxiv.org/abs/2504.16557
作者: Murat Bilgehan Ertan,Ronak Sahu,Phuong Ha Nguyen,Kaleel Mahmood,Marten van Dijk
机构: CWI(荷兰国家数学与计算机科学研究所); University of Connecticut (美国康涅狄格大学); eBay Inc. (易趣公司); University of Rhode Island (美国罗德岛大学); CWI(荷兰国家数学与计算机科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICCV 2025

点击查看摘要

Abstract:We introduce ROAR (Robust Object Removal and Re-annotation), a scalable framework for privacy-preserving dataset obfuscation that eliminates sensitive objects instead of modifying them. Our method integrates instance segmentation with generative inpainting to remove identifiable entities while preserving scene integrity. Extensive evaluations on 2D COCO-based object detection show that ROAR achieves 87.5% of the baseline detection average precision (AP), whereas image dropping achieves only 74.2% of the baseline AP, highlighting the advantage of scrubbing in preserving dataset utility. The degradation is even more severe for small objects due to occlusion and loss of fine-grained details. Furthermore, in NeRF-based 3D reconstruction, our method incurs a PSNR loss of at most 1.66 dB while maintaining SSIM and improving LPIPS, demonstrating superior perceptual quality. Our findings establish object removal as an effective privacy framework, achieving strong privacy guarantees with minimal performance trade-offs. The results highlight key challenges in generative inpainting, occlusion-robust segmentation, and task-specific scrubbing, setting the foundation for future advancements in privacy-preserving vision systems.
zh

[CV-36] oF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration

【速读】：该论文旨在解决在移动设备和增强现实/虚拟现实（AR/VR）设备日益受限的功耗预算下，极低分辨率时间-of-flight（ToF）传感器产生的极度稀疏深度数据限制了其在同步定位与建图（SLAM）中无缝应用的问题。论文的关键解决方案是提出了一种名为ToF-Splatting的新型SLAM管道，它基于3D高斯点投射（Gaussian Splatting），能够有效利用极其稀疏的ToF输入数据。该方法通过引入多帧集成模块，将来自极稀疏ToF深度、单目彩色图像以及多视图几何的线索融合起来，从而生成密集的深度图。实验结果表明，该方法在参考数据集上实现了最先进的跟踪和映射性能。

链接: https://arxiv.org/abs/2504.16545
作者: Andrea Conti,Matteo Poggi,Valerio Cambareri,Martin R. Oswald,Stefano Mattoccia
机构: University of Bologna (博洛尼亚大学), Italy; Sony DepthSensing Solutions (索尼深度传感解决方案), Belgium; University of Amsterdam (阿姆斯特丹大学), Netherlands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Time-of-Flight (ToF) sensors provide efficient active depth sensing at relatively low power budgets; among such designs, only very sparse measurements from low-resolution sensors are considered to meet the increasingly limited power constraints of mobile and AR/VR devices. However, such extreme sparsity levels limit the seamless usage of ToF depth in SLAM. In this work, we propose ToF-Splatting, the first 3D Gaussian Splatting-based SLAM pipeline tailored for using effectively very sparse ToF input data. Our approach improves upon the state of the art by introducing a multi-frame integration module, which produces dense depth maps by merging cues from extremely sparse ToF depth, monocular color, and multi-view geometry. Extensive experiments on both synthetic and real sparse ToF datasets demonstrate the viability of our approach, as it achieves state-of-the-art tracking and mapping performances on reference datasets.
zh

[CV-37] Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes

【速读】：本文旨在解决街道景观评估中存在的两个主要问题：一是当前评估方法要么局限于形态度量属性，要么依赖于耗时的定性评价；二是缺乏一种高效、可扩展且基于开放数据的定量分析工具。为了解决这些问题，论文提出了一种名为SAGAI（Streetscape Analysis with Generative Artificial Intelligence）的新方法，其关键在于结合开源地理数据（如OpenStreetMap）、街景图像（Google Street View）以及轻量级视觉-语言模型（LLaVA），通过定制化的自然语言提示生成结构化的空间指标。这种方法无需针对特定任务进行训练或依赖专有软件，能够实现对城市环境的可扩展且可解释的分析，并支持多种应用场景，如步行友好性、安全性及城市设计研究等。

链接: https://arxiv.org/abs/2504.16538
作者: Joan Perez(1),Giovanni Fusco(2) ((1) Urban Geo Analytics, France (2) Universite Cote-Azur-CNRS-AMU-Avignon Universite, ESPACE, France)
机构: Urban Geo Analytics (城市地理分析); Université Côte-Azur-CNRS-AMU-Avignon Université, ESPACE (尼斯海岸大学-CNRS-马赛大学-阿维尼翁大学, ESPACE)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 25 pages, 6 figures in main paper, 6 figures in appendices

点击查看摘要

Abstract:Streetscapes are an essential component of urban space. Their assessment is presently either limited to morphometric properties of their mass skeleton or requires labor-intensive qualitative evaluations of visually perceived qualities. This paper introduces SAGAI: Streetscape Analysis with Generative Artificial Intelligence, a modular workflow for scoring street-level urban scenes using open-access data and vision-language models. SAGAI integrates OpenStreetMap geometries, Google Street View imagery, and a lightweight version of the LLaVA model to generate structured spatial indicators from images via customizable natural language prompts. The pipeline includes an automated mapping module that aggregates visual scores at both the point and street levels, enabling direct cartographic interpretation. It operates without task-specific training or proprietary software dependencies, supporting scalable and interpretable analysis of urban environments. Two exploratory case studies in Nice and Vienna illustrate SAGAI’s capacity to produce geospatial outputs from vision-language inference. The initial results show strong performance for binary urban-rural scene classification, moderate precision in commercial feature detection, and lower estimates, but still informative, of sidewalk width. Fully deployable by any user, SAGAI can be easily adapted to a wide range of urban research themes, such as walkability, safety, or urban design, through prompt modification alone.
zh

[CV-38] A Few-Shot Metric Learning Method with Dual-Channel Attention for Cross-Modal Same-Neuron Identification

【速读】：该论文旨在解决神经科学研究中跨成像模态单神经元匹配的问题，这一问题是理解神经元结构与功能关系的关键。然而，不同成像模态之间的差异（modality gaps）以及有限的标注数据带来了显著挑战。为应对这些挑战，论文提出了一种基于少样本度量学习的方法，其关键在于引入了具有双通道注意力机制的预训练视觉Transformer模型，能够实现鲁棒的跨模态神经元识别。具体而言，局部和全局通道分别提取胞体形态和纤维上下文信息，并通过门控机制融合输出；为进一步提升模型的细粒度区分能力，引入了基于MultiSimilarityMiner算法的困难样本挖掘策略以及Circle Loss函数。实验结果表明，该方法在Top-K准确率和召回率方面优于现有方法，消融研究和t-SNE可视化验证了各模块的有效性，同时在不同微调策略下实现了精度与训练效率的良好权衡。

链接: https://arxiv.org/abs/2504.16520
作者: Wenwei Li,Liyi Cai,Wu Chen,Anan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注: 23 pages, 9 figures, submitted to arXiv for public access

点击查看摘要

Abstract:In neuroscience research, achieving single-neuron matching across different imaging modalities is critical for understanding the relationship between neuronal structure and function. However, modality gaps and limited annotations present significant challenges. We propose a few-shot metric learning method with a dual-channel attention mechanism and a pretrained vision transformer to enable robust cross-modal neuron identification. The local and global channels extract soma morphology and fiber context, respectively, and a gating mechanism fuses their outputs. To enhance the model’s fine-grained discrimination capability, we introduce a hard sample mining strategy based on the MultiSimilarityMiner algorithm, along with the Circle Loss function. Experiments on two-photon and fMOST datasets demonstrate superior Top-K accuracy and recall compared to existing methods. Ablation studies and t-SNE visualizations validate the effectiveness of each module. The method also achieves a favorable trade-off between accuracy and training efficiency under different fine-tuning strategies. These results suggest that the proposed approach offers a promising technical solution for accurate single-cell level matching and multimodal neuroimaging integration.
zh

[CV-39] hink Hierarchically Act Dynamically: Hierarchical Multi-modal Fusion and Reasoning for Vision-and-Language Navigation ACM-MM2025

【速读】：该论文旨在解决视觉-语言导航（Vision-and-Language Navigation, VLN）任务中，现有方法因仅依赖全局场景表示或对象级特征而无法充分捕捉多模态之间复杂交互关系的问题。为应对这一挑战，论文提出了一种多层级融合与推理架构（Multi-level Fusion and Reasoning Architecture, MFRA）。MFRA 的关键在于引入了一个分层融合机制，能够从多个模态中聚合从低级视觉线索到高级语义概念的不同层级特征，并进一步设计了一个推理模块，通过基于指令引导的注意力机制和动态上下文整合来推断导航动作。这种选择性捕获和组合相关视觉、语言及时间信号的方式显著提升了复杂导航场景中的决策准确性。实验结果表明，MFRA 在 REVERIE、R2R 和 SOON 等基准数据集上的表现优于当前最先进的方法，验证了多层级模态融合在具身导航任务中的有效性。

链接: https://arxiv.org/abs/2504.16516
作者: Junrong Yue,Yifan Zhang,Chuan Qin,Bo Li,Xiaomin Lie,Xinlei Yu,Wenxin Zhang,Zhendong Zhao
机构: City University of Hong Kong, Dongguan Campus (香港城市大学东莞校区); The University of Melbourne (墨尔本大学); Tsinghua University & Baidu Inc. (清华大学&百度公司); University of Chinese Academy of Science (中国科学院大学); National University of Singapore (新加坡国立大学); Xiaomin Lie (未知); Zhendong Zhao (未知); Yifan Zhang (未知)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, Submitted to ACM MM 2025

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) aims to enable embodied agents to follow natural language instructions and reach target locations in real-world environments. While prior methods often rely on either global scene representations or object-level features, these approaches are insufficient for capturing the complex interactions across modalities required for accurate navigation. In this paper, we propose a Multi-level Fusion and Reasoning Architecture (MFRA) to enhance the agent’s ability to reason over visual observations, language instructions and navigation history. Specifically, MFRA introduces a hierarchical fusion mechanism that aggregates multi-level features-ranging from low-level visual cues to high-level semantic concepts-across multiple modalities. We further design a reasoning module that leverages fused representations to infer navigation actions through instruction-guided attention and dynamic context integration. By selectively capturing and combining relevant visual, linguistic, and temporal signals, MFRA improves decision-making accuracy in complex navigation scenarios. Extensive experiments on benchmark VLN datasets including REVERIE, R2R, and SOON demonstrate that MFRA achieves superior performance compared to state-of-the-art methods, validating the effectiveness of multi-level modal fusion for embodied navigation.
zh

[CV-40] Federated Learning of Low-Rank One-Shot Image Detection Models in Edge Devices with Scalable Accuracy and Compute Complexity

【速读】：该论文旨在解决在资源受限的边缘设备上训练高效低秩一次图像检测模型时面临的计算和通信开销大的问题。论文的关键创新在于提出了LoRa-FL框架，通过将低秩适应技术集成到一次检测架构中，显著降低了计算和通信开销，同时保持了可扩展的准确性。该方法利用联邦学习实现轻量级图像识别模型的协作训练，支持快速适配与高效部署于异构资源受限设备。实验结果表明，该方法在MNIST和CIFAR10数据集上实现了竞争性的检测性能，并大幅减少了通信带宽和计算复杂度。

链接: https://arxiv.org/abs/2504.16515
作者: Abdul Hannaan,Zubair Shah,Aiman Erbad,Amr Mohamed,Ali Safa
机构: College of Science and Engineering, Hamad Bin Khalifa University (哈马德本哈利法大学), Doha Qatar; College of Engineering, Qatar University (卡塔尔大学), Doha Qatar
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted for publication at IEEE IWCMC 2025

点击查看摘要

Abstract:This paper introduces a novel federated learning framework termed LoRa-FL designed for training low-rank one-shot image detection models deployed on edge devices. By incorporating low-rank adaptation techniques into one-shot detection architectures, our method significantly reduces both computational and communication overhead while maintaining scalable accuracy. The proposed framework leverages federated learning to collaboratively train lightweight image recognition models, enabling rapid adaptation and efficient deployment across heterogeneous, resource-constrained devices. Experimental evaluations on the MNIST and CIFAR10 benchmark datasets, both in an independent-and-identically-distributed (IID) and non-IID setting, demonstrate that our approach achieves competitive detection performance while significantly reducing communication bandwidth and compute complexity. This makes it a promising solution for adaptively reducing the communication and compute power overheads, while not sacrificing model accuracy.
zh

[CV-41] raveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance

【速读】：该论文旨在解决现有多模态 AI 系统在城市环境场景理解与旅行辅助方面缺乏专业化知识和上下文理解的问题。为应对这一挑战，论文提出了一种名为 TraveLLaMA 的专用多模态语言模型。其关键是构建了一个包含 22 万问答对的大规模数据集，其中包括从真实旅行论坛中精心整理的 13 万文本问答对（结合 GPT 增强响应）以及专门针对地图理解和场景解析的 9 万视觉-语言问答对。通过在最先进的视觉-语言模型（如 LLaVA、Qwen-VL 和 Shikra）上进行广泛的微调实验，研究展示了 TraveLLaMA 在纯文本旅行理解和视觉问答任务中的性能提升了 6.5%-9.4%，并在提供上下文相关的旅行推荐、解读地图位置及理解特定地点图像的同时，提供了实用信息（如营业时间和游客评论）。相较通用模型，TraveLLaMA 在旅行特定任务中表现显著更优，确立了多模态旅行辅助系统的全新基准。

链接: https://arxiv.org/abs/2504.16505
作者: Meng Chu,Yukang Chen,Haokun Gui,Shaozuo Yu,Yi Wang,Jiaya Jia
机构: HKUST(Hong Kong University of Science and Technology)(香港科技大学); CUHK(The Chinese University of Hong Kong)(香港中文大学); ShanghaiAI Lab(上海AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Tourism and travel planning increasingly rely on digital assistance, yet existing multimodal AI systems often lack specialized knowledge and contextual understanding of urban environments. We present TraveLLaMA, a specialized multimodal language model designed for urban scene understanding and travel assistance. Our work addresses the fundamental challenge of developing practical AI travel assistants through a novel large-scale dataset of 220k question-answer pairs. This comprehensive dataset uniquely combines 130k text QA pairs meticulously curated from authentic travel forums with GPT-enhanced responses, alongside 90k vision-language QA pairs specifically focused on map understanding and scene comprehension. Through extensive fine-tuning experiments on state-of-the-art vision-language models (LLaVA, Qwen-VL, Shikra), we demonstrate significant performance improvements ranging from 6.5%-9.4% in both pure text travel understanding and visual question answering tasks. Our model exhibits exceptional capabilities in providing contextual travel recommendations, interpreting map locations, and understanding place-specific imagery while offering practical information such as operating hours and visitor reviews. Comparative evaluations show TraveLLaMA significantly outperforms general-purpose models in travel-specific tasks, establishing a new benchmark for multi-modal travel assistance systems.
zh

[CV-42] PRaDA: Projective Radial Distortion Averag ing CVPR2025

【速读】：该论文试图解决在具有挑战性的条件下自动校准径向畸变相机的问题。传统方法要么需要解决包含相机位姿、3D点和畸变参数的完整运动结构（SfM）问题，这通常需要大量重叠足够的图像；要么依赖于学习型方法，但其准确性较低。本文的关键解决方案在于将畸变校准与三维重建解耦，在保持基于SfM方法精度的同时规避了许多相关复杂性。通过在投影空间中工作，其中几何形状唯一且受射影变换约束，该方法能够封装除畸变外的所有相机参数。所提出的方法称为“投影径向畸变平均”（Projective Radial Distortion Averaging），它在一个完全投影框架中平均多个畸变估计值，而无需创建3D点或进行完整的束调整。通过依赖成对投影关系，该方法支持任何特征匹配方法，而无需在多幅图像间构建点轨迹。

链接: https://arxiv.org/abs/2504.16499
作者: Daniil Sinitsyn,Linus Härenstam-Nielsen,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025. 8 pages + references

点击查看摘要

Abstract:We tackle the problem of automatic calibration of radially distorted cameras in challenging conditions. Accurately determining distortion parameters typically requires either 1) solving the full Structure from Motion (SfM) problem involving camera poses, 3D points, and the distortion parameters, which is only possible if many images with sufficient overlap are provided, or 2) relying heavily on learning-based methods that are comparatively less accurate. In this work, we demonstrate that distortion calibration can be decoupled from 3D reconstruction, maintaining the accuracy of SfM-based methods while avoiding many of the associated complexities. This is achieved by working in Projective Space, where the geometry is unique up to a homography, which encapsulates all camera parameters except for distortion. Our proposed method, Projective Radial Distortion Averaging, averages multiple distortion estimates in a fully projective framework without creating 3d points and full bundle adjustment. By relying on pairwise projective relations, our methods support any feature-matching approaches without constructing point tracks across multiple images.
zh

[CV-43] Rethinking Generalizable Infrared Small Target Detection: A Real-scene Benchmark and Cross-view Representation Learning

【速读】：该论文旨在解决红外小目标检测（Infrared Small Target Detection, ISTD）在面对传感器类型、观测条件及目标固有属性变化时引起的领域偏移（domain shift）问题，这种领域偏移导致获取的红外图像数据分布存在显著差异，从而严重限制了ISTD模型在多样化场景中的泛化能力。为应对这一挑战，论文提出了一种基于领域适应增强的ISTD框架。其关键解决方案包括：引入跨视图通道对齐（Cross-view Channel Alignment, CCA）以缓解数据集之间的分布偏移并实现跨样本对齐；提出跨视图Top-K融合（Cross-view Top-K Fusion）策略，将目标信息与多样化的背景特征相结合，提升模型提取关键数据特征的能力；此外，开发了噪声引导表征学习（Noise-guided Representation Learning）策略，使模型能够学习到更抗噪的特征表示，从而提高其在多种噪声环境下的泛化性能。最终，论文构建了一个专用的红外小目标数据集RealScene-ISTD，并验证了所提方法在检测概率（Pd）、虚警率（Fa）以及交并比（IoU）方面的优越性能。

链接: https://arxiv.org/abs/2504.16487
作者: Yahao Lu,Yuehui Li,Xingyuan Guo,Shuai Yuan,Yukai Shi,Liang Lin
机构: School of Information Engineering, Guangdong University of Technology, Guangzhou, 510006, China (广东工业大学信息工程学院，广州，510006，中国); Southern Power Grid, Ltd., Guangzhou, 510000, China (南方电网有限公司，广州，510000，中国); Xi’an Key Laboratory of Infrared Technology and System, School of Optoelectronic Engineering, Xidian University, Xi’an 710071, China (西安红外技术与系统重点实验室，西安电子科技大学光电工程学院，西安，710071，中国); School of Data and Computer Science, Sun Yat-sen University, Guangzhou, 510006, China (中山大学数据科学与计算机学院，广州，510006，中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A benchmark associated with real-world scenes for the Infrared Small Target Detection (ISTD) is presented

点击查看摘要

Abstract:Infrared small target detection (ISTD) is highly sensitive to sensor type, observation conditions, and the intrinsic properties of the target. These factors can introduce substantial variations in the distribution of acquired infrared image data, a phenomenon known as domain shift. Such distribution discrepancies significantly hinder the generalization capability of ISTD models across diverse scenarios. To tackle this challenge, this paper introduces an ISTD framework enhanced by domain adaptation. To alleviate distribution shift between datasets and achieve cross-sample alignment, we introduce Cross-view Channel Alignment (CCA). Additionally, we propose the Cross-view Top-K Fusion strategy, which integrates target information with diverse background features, enhancing the model’ s ability to extract critical data characteristics. To further mitigate the impact of noise on ISTD, we develop a Noise-guided Representation learning strategy. This approach enables the model to learn more noise-resistant feature representations, to improve its generalization capability across diverse noisy domains. Finally, we develop a dedicated infrared small target dataset, RealScene-ISTD. Compared to state-of-the-art methods, our approach demonstrates superior performance in terms of detection probability (Pd), false alarm rate (Fa), and intersection over union (IoU). The code is available at: this https URL.
zh

[CV-44] RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory

【速读】：该论文旨在解决RGB-Depth (RGB-D) 视频对象分割（VOS）任务中跨模态信息利用不足以及长时间预测时易发生目标漂移的问题。为应对这些挑战，论文提出了一种基于多存储特征记忆的新型RGB-D VOS方法。其关键是设计了一个分层模态选择与融合机制，能够自适应地结合两种模态的特征，并开发了一个分割精化模块，通过充分利用Segmentation Anything Model (SAM) 来优化分割掩膜，同时引入空间时间嵌入和模态嵌入以增强SAM在RGB-D VOS中的性能表现。实验结果表明，该方法在最新的RGB-D VOS基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2504.16471
作者: Boyue Xu,Ruichao Hou,Tongwei Ren,Gangshan Wu
机构: State Key Laboratory for Novel Software Technology, Nanjing University (国家重点实验室，南京大学), Nanjing (南京), China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The RGB-Depth (RGB-D) Video Object Segmentation (VOS) aims to integrate the fine-grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off-the-shelf RGB-D segmentation methods fail to fully explore cross-modal information and suffer from object drift during long-term prediction. In this paper, we propose a novel RGB-D VOS method via multi-store feature memory for robust segmentation. Specifically, we design the hierarchical modality selection and fusion, which adaptively combines features from both modalities. Additionally, we develop a segmentation refinement module that effectively utilizes the Segmentation Anything Model (SAM) to refine the segmentation mask, ensuring more reliable results as memory to guide subsequent segmentation tasks. By leveraging spatio-temporal embedding and modality embedding, mixed prompts and fused images are fed into SAM to unleash its potential in RGB-D VOS. Experimental results show that the proposed method achieves state-of-the-art performance on the latest RGB-D VOS benchmark.
zh

[CV-45] MTSGL: Multi-Task Structure Guided Learning for Robust and Interpretable SAR Aircraft Recognition

【速读】：该论文旨在解决合成孔径雷达（SAR）图像中飞机识别任务中现有分类算法对飞机结构知识理解不足的问题。论文提出的关键解决方案是引入基于结构的SAR飞机标注方法，并设计一个多任务结构引导学习（MTSGL）网络。MTSGL网络不仅包含分类任务，还引入了结构语义感知（SSA）模块和结构一致性正则化（SCR）模块。SSA模块用于捕获结构语义信息，以促进对飞机知识的人类认知理解；SCR模块则帮助保持SAR图像中飞机结构与所提出的标注之间的几何一致性，从而以几何上有意义的方式解耦结构属性。通过这种方式，MTSGL实现了专家级的飞机先验知识和结构引导学习范式的结合，目标是以类似于人类认知过程的方式理解飞机概念。实验结果表明，该方法在鲁棒性和可解释性方面具有显著优势。

链接: https://arxiv.org/abs/2504.16467
作者: Qishan He,Lingjun Zhao,Ru Luo,Siqian Zhang,Lin Lei,Kefeng Ji,Gangyao Kuang
机构: IEEE Publication Technology Group (IEEE 出版技术组)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aircraft recognition in synthetic aperture radar (SAR) imagery is a fundamental mission in both military and civilian applications. Recently deep learning (DL) has emerged a dominant paradigm for its explosive performance on extracting discriminative features. However, current classification algorithms focus primarily on learning decision hyperplane without enough comprehension on aircraft structural knowledge. Inspired by the fined aircraft annotation methods for optical remote sensing images (RSI), we first introduce a structure-based SAR aircraft annotations approach to provide structural and compositional supplement information. On this basis, we propose a multi-task structure guided learning (MTSGL) network for robust and interpretable SAR aircraft recognition. Besides the classification task, MTSGL includes a structural semantic awareness (SSA) module and a structural consistency regularization (SCR) module. The SSA is designed to capture structure semantic information, which is conducive to gain human-like comprehension of aircraft knowledge. The SCR helps maintain the geometric consistency between the aircraft structure in SAR imagery and the proposed annotation. In this process, the structural attribute can be disentangled in a geometrically meaningful manner. In conclusion, the MTSGL is presented with the expert-level aircraft prior knowledge and structure guided learning paradigm, aiming to comprehend the aircraft concept in a way analogous to the human cognitive process. Extensive experiments are conducted on a self-constructed multi-task SAR aircraft recognition dataset (MT-SARD) and the effective results illustrate the superiority of robustness and interpretation ability of the proposed MTSGL.
zh

[CV-46] Cross Paradigm Representation and Alignment Transformer for Image Deraining

【速读】：该论文旨在解决在低级视觉任务（如图像去雨）中，基于Transformer的网络因单一范式架构难以有效处理不规则雨纹模式和复杂的几何重叠问题。为应对这一挑战，论文提出了一种新颖的跨范式表示与对齐Transformer（CPRAformer）。其关键在于通过层次化的表示与对齐机制，融合空间-通道和全局-局部两种范式的互补特性，以辅助图像重建。具体而言，模型采用两种自注意力机制：稀疏提示通道自注意力（SPC-SA）和空间像素细化自注意力（SPR-SA），分别增强全局通道依赖性和聚焦于空间雨纹分布及细粒度纹理恢复。此外，引入自适应对齐频率模块（AAFM），以两阶段渐进方式对齐和交互特征，减少范式间的特征错位与知识差异。通过这一统一的动态交互框架，实现从两种范式中提取最具价值的融合信息，从而显著提升性能，在八个基准数据集上达到当前最优表现，并验证了其在其他图像恢复任务中的鲁棒性。

链接: https://arxiv.org/abs/2504.16455
作者: Shun Zou,Yi Zou,Juncheng Li,Guangwei Gao,Guojun Qi
机构: Nanjing Agricultural University (南京农业大学); Soochow University (苏州大学); Xiangtan University (湘潭大学); Shanghai University (上海大学); Nanjing University of Posts and Telecommunications (南京邮电大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code: this https URL

点击查看摘要

Abstract:Transformer-based networks have achieved strong performance in low-level vision tasks like image deraining by utilizing spatial or channel-wise self-attention. However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms (spatial-channel and global-local) to aid image reconstruction. It bridges the gap within and between paradigms, aligning and coordinating them to enable deep interaction and fusion of features. Specifically, we use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA). SPC-SA enhances global channel dependencies through dynamic sparsity, while SPR-SA focuses on spatial rain distribution and fine-grained texture recovery. To address the feature misalignment and knowledge differences between them, we introduce the Adaptive Alignment Frequency Module (AAFM), which aligns and interacts with features in a two-stage progressive manner, enabling adaptive guidance and complementarity. This reduces the information gap within and between paradigms. Through this unified cross-paradigm dynamic interaction framework, we achieve the extraction of the most valuable interactive fusion information from the two paradigms. Extensive experiments demonstrate that our model achieves state-of-the-art performance on eight benchmark datasets and further validates CPRAformer’s robustness in other image restoration tasks and downstream applications.
zh

[CV-47] Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Any Convex Parametric Shapes

【速读】：该论文旨在解决现有参数化形状优化目标函数在多种计算机视觉任务中的不足，这些问题包括：基于回归的损失（如L1/L2）与IoU缺乏相关性，基于IoU的损失不稳定且局限于简单形状，以及特定任务的方法计算复杂且无法跨领域泛化。这导致当前参数化形状优化的目标函数分散，每个领域提出了不同的IoU近似方法。为了解决这些问题，论文的关键在于引入了边缘化广义IoU（MGIoU），这是一种新颖的损失函数，通过将结构化凸形投影到其唯一的形状法线上来计算一维归一化的广义IoU，从而克服了上述挑战。MGIoU不仅简单高效，而且完全可微，与IoU高度相关。进一步地，论文扩展了MGIoU为MGIoU+，以支持非结构化凸形的优化，实现了参数化形状优化在多样化应用中的统一。实验表明，MGIoU及其扩展版本在标准基准测试中始终优于现有损失函数，并将损失计算延迟降低了10到40倍。此外，MGIoU满足度量属性和尺度不变性，确保其作为目标函数的鲁棒性。论文还提出了MGIoU-用于减少碰撞自由轨迹预测等任务中的重叠问题。代码已开源。

链接: https://arxiv.org/abs/2504.16443
作者: Duy-Tho Le,Trung Pham,Jianfei Cai,Hamid Rezatofighi
机构: Monash University (蒙纳士大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Optimizing the similarity between parametric shapes is crucial for numerous computer vision tasks, where Intersection over Union (IoU) stands as the canonical measure. However, existing optimization methods exhibit significant shortcomings: regression-based losses like L1/L2 lack correlation with IoU, IoU-based losses are unstable and limited to simple shapes, and task-specific methods are computationally intensive and not generalizable accross domains. As a result, the current landscape of parametric shape objective functions has become scattered, with each domain proposing distinct IoU approximations. To address this, we unify the parametric shape optimization objective functions by introducing Marginalized Generalized IoU (MGIoU), a novel loss function that overcomes these challenges by projecting structured convex shapes onto their unique shape Normals to compute one-dimensional normalized GIoU. MGIoU offers a simple, efficient, fully differentiable approximation strongly correlated with IoU. We then extend MGIoU to MGIoU+ that supports optimizing unstructured convex shapes. Together, MGIoU and MGIoU+ unify parametric shape optimization across diverse applications. Experiments on standard benchmarks demonstrate that MGIoU and MGIoU+ consistently outperform existing losses while reducing loss computation latency by 10-40x. Additionally, MGIoU and MGIoU+ satisfy metric properties and scale-invariance, ensuring robustness as an objective function. We further propose MGIoU- for minimizing overlaps in tasks like collision-free trajectory prediction. Code is available at this https URL
zh

[CV-48] FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing

【速读】：该论文致力于解决现有视觉语言模型（Vision-Language Models, VLMs）在遥感场景分类和领域泛化任务中的不足，特别是基于指令文本提示的零样本推理方法容易受到全图特征中噪声和背景干扰的影响，导致类别内特征不一致及误分类的问题。论文的关键创新在于提出了一种名为FrogDogNet的新框架，通过结合傅里叶频域滤波和自注意力机制，实现低频不变特征的选择性保留，同时去除噪声和无关背景信息。这种方法确保了跨领域的鲁棒特征表示，从而显著提升了遥感图像的场景分类性能及领域泛化能力。实验结果表明，FrogDogNet在多个遥感数据集和领域泛化任务上优于当前最先进的基于提示学习的方法。

链接: https://arxiv.org/abs/2504.16433
作者: Hariseetharam Gunduboina(1),Muhammad Haris Khan(2),Biplab Banerjee(1) ((1) Indian Institute of Technology Bombay, India, (2) Mohamed Bin Zayed University of Artificial Intelligence, UAE)
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德 bin扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, large-scale vision-language models (VLMs) like CLIP have gained attention for their zero-shot inference using instructional text prompts. While these models excel in general computer vision, their potential for domain generalization in remote sensing (RS) remains underexplored. Existing approaches enhance prompt learning by generating visual prompt tokens but rely on full-image features, introducing noise and background artifacts that vary within a class, causing misclassification. To address this, we propose FrogDogNet, a novel prompt learning framework integrating Fourier frequency filtering and self-attention to improve RS scene classification and domain generalization. FrogDogNet selectively retains invariant low-frequency components while eliminating noise and irrelevant backgrounds, ensuring robust feature representation across domains. The model first extracts significant features via projection and self-attention, then applies frequency-based filtering to preserve essential structural information for prompt learning. Extensive experiments on four RS datasets and three domain generalization tasks show that FrogDogNet consistently outperforms state-of-the-art prompt learning methods, demonstrating superior adaptability across domain shifts. Our findings highlight the effectiveness of frequency-based invariant feature retention in generalization, paving the way for broader applications. Our code is available at this https URL
zh

[CV-49] PixelWeb: The First Web GUI Dataset with Pixel-Wise Labels

【速读】：该论文旨在解决现有图形用户界面（GUI）数据集中标注信息准确性不足的问题，包括丢失、重复或无意义的边界框（BBox）注释，这些问题会降低基于这些数据集训练模型的性能，并限制其在实际应用中的有效性。此外，现有的GUI数据集仅提供视觉上的BBox注释，这阻碍了与视觉相关的下游任务的发展。为了解决这些问题，论文提出PixelWeb，这是一个包含超过100,000个标注网页的大规模GUI数据集。PixelWeb的关键解决方案在于采用了一种新颖的自动标注方法，通过两个核心模块——通道推导和层分析——集成视觉特征提取与文档对象模型（DOM）结构分析。通道推导确保了在遮挡和重叠元素情况下GUI元素定位的准确性，而层分析利用DOM确定元素的可见性和堆叠顺序，从而提供精确的BBox注释。此外，PixelWeb还包括全面的元数据，如元素图像、轮廓和掩码注释，并通过三位独立标注人员的手动验证确认了其高质量和高精度的注释。实验结果表明，PixelWeb在GUI元素检测任务中的mAP95指标性能比现有数据集高出3到7倍，显示出其在GUI生成和自动化用户交互等下游任务中具有显著的性能提升潜力。

链接: https://arxiv.org/abs/2504.16419
作者: Qi Yang,Weichen Bi,Haiyang Shen,Yaoqi Guo,Yun Ma
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Graphical User Interface (GUI) datasets are crucial for various downstream tasks. However, GUI datasets often generate annotation information through automatic labeling, which commonly results in inaccurate GUI element BBox annotations, including missing, duplicate, or meaningless BBoxes. These issues can degrade the performance of models trained on these datasets, limiting their effectiveness in real-world applications. Additionally, existing GUI datasets only provide BBox annotations visually, which restricts the development of visually related GUI downstream tasks. To address these issues, we introduce PixelWeb, a large-scale GUI dataset containing over 100,000 annotated web pages. PixelWeb is constructed using a novel automatic annotation approach that integrates visual feature extraction and Document Object Model (DOM) structure analysis through two core modules: channel derivation and layer analysis. Channel derivation ensures accurate localization of GUI elements in cases of occlusion and overlapping elements by extracting BGRA four-channel bitmap annotations. Layer analysis uses the DOM to determine the visibility and stacking order of elements, providing precise BBox annotations. Additionally, PixelWeb includes comprehensive metadata such as element images, contours, and mask annotations. Manual verification by three independent annotators confirms the high quality and accuracy of PixelWeb annotations. Experimental results on GUI element detection tasks show that PixelWeb achieves performance on the mAP95 metric that is 3-7 times better than existing datasets. We believe that PixelWeb has great potential for performance improvement in downstream tasks such as GUI generation and automated user interaction.
zh

[CV-50] Assessing the Feasibility of Internet-Sourced Video for Automatic Cattle Lameness Detection

【速读】：该论文试图解决通过视频数据检测牛跛行、疾病或步态异常的问题。解决方案的关键在于直接利用深度学习分类模型从视频数据中学习时空特征，而无需采用传统的多阶段方法（如目标检测、姿态估计和特征提取）。具体而言，研究采用了ConvLSTM2D和3D CNN两种深度学习模型，并通过数据增强技术提升模型的鲁棒性和泛化能力。实验结果显示，3D CNN模型在视频级别达到了90%的分类准确率，且precision、recall和f1-score均为90.9%，表现出色。这表明所提出的深度学习方法能够有效简化处理流程并实现对牛跛行的精准检测与分类。

链接: https://arxiv.org/abs/2504.16404
作者: Md Fahimuzzman Sohan
机构: Department of Software Engineering (软件工程系), Daffodil International University (达夫迪尔国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Cattle lameness is often caused by hoof injuries or interdigital dermatitis, leads to pain and significantly impacts essential physiological activities such as walking, feeding, and drinking. This study presents a deep learning-based model for detecting cattle lameness, sickness, or gait abnormalities using publicly available video data. The dataset consists of 50 unique videos from 40 individual cattle, recorded from various angles in both indoor and outdoor environments. Half of the dataset represents naturally walking (normal/non-lame) cattle, while the other half consists of cattle exhibiting gait abnormalities (lame). To enhance model robustness and generalizability, data augmentation was applied to the training data. The pre-processed videos were then classified using two deep learning models: ConvLSTM2D and 3D CNN. A comparative analysis of the results demonstrates strong classification performance. Specifically, the 3D CNN model achieved a video-level classification accuracy of 90%, with precision, recall, and f1-score of 90.9%, 90.9%, and 90.91% respectively. The ConvLSTM2D model exhibited a slightly lower accuracy of 85%. This study highlights the effectiveness of directly applying classification models to learn spatiotemporal features from video data, offering an alternative to traditional multi-stage approaches that typically involve object detection, pose estimation, and feature extraction. Besides, the findings demonstrate that the proposed deep learning models, particularly the 3D CNN, effectively classify and detect lameness in cattle while simplifying the processing pipeline.
zh

[CV-51] SaENeRF: Suppressing Artifacts in Event-based Neural Radiance Fields IJCNN2025

【速读】：本文旨在解决从事件流重建几何一致且 photometrically 准确的静态场景 3D 表征的问题。现有基于事件的 Neural Radiance Fields (NeRF) 方法虽部分缓解了挑战，但因网络早期阶段的激进学习以及事件相机固有的噪声，仍存在持续的伪影问题。为克服这些限制，论文提出了一种名为 SaENeRF 的新型自监督框架。其关键是通过累积事件极性规范化预测的辐射变化，实现表征构建的渐进且快速学习；同时引入专门设计的正则化损失，抑制低于事件阈值的 photometric 变化区域中的伪影，并增强非零事件的光强差异，从而提升重建场景的视觉保真度。

链接: https://arxiv.org/abs/2504.16389
作者: Yuanjian Wang,Yufei Deng,Rong Xiao,Jiahao Fan,Chenwei Tang,Deng Xiong,Jiancheng Lv
机构: College of Computer Science, Sichuan University (四川大学计算机学院), Chengdu, China; Stevens Institute of Technology (史蒂文斯理工学院), Hoboken, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCNN 2025

点击查看摘要

Abstract:Event cameras are neuromorphic vision sensors that asynchronously capture changes in logarithmic brightness changes, offering significant advantages such as low latency, low power consumption, low bandwidth, and high dynamic range. While these characteristics make them ideal for high-speed scenarios, reconstructing geometrically consistent and photometrically accurate 3D representations from event data remains fundamentally challenging. Current event-based Neural Radiance Fields (NeRF) methods partially address these challenges but suffer from persistent artifacts caused by aggressive network learning in early stages and the inherent noise of event cameras. To overcome these limitations, we present SaENeRF, a novel self-supervised framework that effectively suppresses artifacts and enables 3D-consistent, dense, and photorealistic NeRF reconstruction of static scenes solely from event streams. Our approach normalizes predicted radiance variations based on accumulated event polarities, facilitating progressive and rapid learning for scene representation construction. Additionally, we introduce regularization losses specifically designed to suppress artifacts in regions where photometric changes fall below the event threshold and simultaneously enhance the light intensity difference of non-zero events, thereby improving the visual fidelity of the reconstructed scene. Extensive qualitative and quantitative experiments demonstrate that our method significantly reduces artifacts and achieves superior reconstruction quality compared to existing methods. The code is available at this https URL.
zh

[CV-52] Revisiting Radar Camera Alignment by Contrastive Learning for 3D Object Detection

【速读】：该论文旨在解决基于雷达和相机融合的3D目标检测算法中存在的特征对齐问题，特别是由于雷达与相机域差距导致的特征错位。现有方法要么在对齐过程中忽视了模态间特征的交互，要么未能有效对齐跨模态相同空间位置的特征。为了解决上述问题，论文提出了一种名为Radar Camera Alignment (RCAlign) 的新对齐模型。其关键是设计了一个基于对比学习的Dual-Route Alignment (DRA) 模块，用于对齐和融合雷达与相机特征，并针对雷达鸟瞰图(BEV)特征稀疏的问题，提出了Radar Feature Enhancement (RFE) 模块，通过知识蒸馏损失提升雷达BEV特征的密度。实验结果表明，RCAlign 在nuScenes公开数据集上的雷达-相机融合3D目标检测任务中达到了新的技术水平，并在实时3D检测中相较于最新方法实现了显著性能提升(4.3% NDS 和 8.4% mAP)。

链接: https://arxiv.org/abs/2504.16368
作者: Linhua Kong,Dongxia Chang,Lian Liu,Zisen Kong,Pengyuan Li,Yao Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D object detection algorithms based on radar and camera fusion have shown excellent performance, setting the stage for their application in autonomous driving perception tasks. Existing methods have focused on dealing with feature misalignment caused by the domain gap between radar and camera. However, existing methods either neglect inter-modal features interaction during alignment or fail to effectively align features at the same spatial location across modalities. To alleviate the above problems, we propose a new alignment model called Radar Camera Alignment (RCAlign). Specifically, we design a Dual-Route Alignment (DRA) module based on contrastive learning to align and fuse the features between radar and camera. Moreover, considering the sparsity of radar BEV features, a Radar Feature Enhancement (RFE) module is proposed to improve the densification of radar BEV features with the knowledge distillation loss. Experiments show RCAlign achieves a new state-of-the-art on the public nuScenes benchmark in radar camera fusion for 3D Object Detection. Furthermore, the RCAlign achieves a significant performance gain (4.3% NDS and 8.4% mAP) in real-time 3D detection compared to the latest state-of-the-art method (RCBEVDet).
zh

[CV-53] CLPSTNet: A Progressive Multi-Scale Convolutional Steganography Model Integrating Curriculum Learning

【速读】：该论文旨在解决在使用卷积神经网络（Convolutional Neural Networks, CNNs）进行图像隐写时，因数字图像固有复杂性导致的不可见性和安全性问题。为应对这些挑战，论文提出了一种名为课程学习渐进式隐写网络（Curriculum Learning Progressive Steganophy Network, CLPSTNet）的解决方案。CLPSTNet 的核心在于其包含多个渐进式的多尺度卷积模块，这些模块集成了 Inception 结构与空洞卷积。通过从较小的卷积核和膨胀率开始，逐步扩展至较大的卷积核和膨胀率，该模块能够从浅层到深层、从精细到粗糙实现多尺度特征提取，并在不同融合阶段精炼隐藏信息的特征。这种设计使得 CLPSTNet 在 ALASKA2、VOC2012 和 ImageNet 三个公开数据集上的峰值信噪比（PSNR）、结构相似性指数（SSIM）以及解码准确性均表现优异，同时生成的隐写图像具有较低的隐写分析检测率。

链接: https://arxiv.org/abs/2504.16364
作者: Fengchun Liu,Tong Zhang,Chunying Zhang
机构: Qianan College, North China University of Science and Technology (华北理工大学轻工学院); School of Cyberspace Security, Beijing University of Posts and Telecommunications (北京邮电大学网络空间安全学院); College of Science, North China University of Science and Technology (华北理工大学理学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:In recent years, a large number of works have introduced Convolutional Neural Networks (CNNs) into image steganography, which transform traditional steganography methods such as hand-crafted features and prior knowledge design into steganography methods that neural networks autonomically learn information embedding. However, due to the inherent complexity of digital images, issues of invisibility and security persist when using CNN models for information embedding. In this paper, we propose Curriculum Learning Progressive Steganophy Network (CLPSTNet). The network consists of multiple progressive multi-scale convolutional modules that integrate Inception structures and dilated convolutions. The module contains multiple branching pathways, starting from a smaller convolutional kernel and dilatation rate, extracting the basic, local feature information from the feature map, and gradually expanding to the convolution with a larger convolutional kernel and dilatation rate for perceiving the feature information of a larger receptive field, so as to realize the multi-scale feature extraction from shallow to deep, and from fine to coarse, allowing the shallow secret information features to be refined in different fusion stages. The experimental results show that the proposed CLPSTNet not only has high PSNR , SSIM metrics and decoding accuracy on three large public datasets, ALASKA2, VOC2012 and ImageNet, but also the steganographic images generated by CLPSTNet have low steganalysis this http URL can find our code at \hrefthis https URLthis https URL.
zh

[CV-54] Almost Right: Making First-layer Kernels Nearly Orthogonal Improves Model Generalization

【速读】：该论文旨在解决计算机视觉领域中模型泛化能力不足的问题，特别是在处理未知样本时的表现。为实现这一目标，论文提出了一种新的损失组件，通过正则化网络第一卷积层中的滤波核（filtering kernels），使其接近正交性，从而提升模型的泛化性能。与以往方法不同的是，该方案赋予网络灵活性，自主决定哪一对滤波核需要正交化，而非固定规则，这有助于网络探索更优的解空间，并施加更强的约束。实验结果显示，在未对网络架构进行修改的情况下，该方法在三种不同架构（ResNet-50、DenseNet-121、ViT-b-16）及两项具有挑战性的开放式识别任务（虹膜生物特征中的呈现攻击检测和胸部X光图像异常检测）中均显著提升了模型的泛化能力。

链接: https://arxiv.org/abs/2504.16362
作者: Colton R. Crum,Adam Czajka
机构: Department of Computer Science and Engineering (计算机科学与工程系), University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 1 figure, 3 tables

点击查看摘要

Abstract:An ongoing research challenge within several domains in computer vision is how to increase model generalization capabilities. Several attempts to improve model generalization performance are heavily inspired by human perceptual intelligence, which is remarkable in both its performance and efficiency to generalize to unknown samples. Many of these methods attempt to force portions of the network to be orthogonal, following some observation within neuroscience related to early vision processes. In this paper, we propose a loss component that regularizes the filtering kernels in the first convolutional layer of a network to make them nearly orthogonal. Deviating from previous works, we give the network flexibility in which pairs of kernels it makes orthogonal, allowing the network to navigate to a better solution space, imposing harsh penalties. Without architectural modifications, we report substantial gains in generalization performance using the proposed loss against previous works (including orthogonalization- and saliency-based regularization methods) across three different architectures (ResNet-50, DenseNet-121, ViT-b-16) and two difficult open-set recognition tasks: presentation attack detection in iris biometrics, and anomaly detection in chest X-ray images.
zh

[CV-55] Regularizing Differentiable Architecture Search with Smooth Activation

【速读】：该论文旨在解决基于Differentiable Architecture Search (DARTS) 的神经架构搜索（NAS）方法中存在的鲁棒性、泛化能力不足以及跳过支配（skip dominance）引起的性能崩溃问题。论文的关键创新在于提出了一种名为Smooth Activation DARTS (SA-DARTS) 的简单而有效的解决方案。SA-DARTS 通过在架构权重上引入平滑激活函数作为辅助损失，缓解了无权重操作对搜索过程的不公平优势，促使架构权重收敛至扇出分布，并能够从跳过支配的初始状态恢复搜索过程。这一方法不仅解决了跳过支配和离散化差异挑战，还通过理论与实证分析证明了其能够在 NAS-Bench-201、分类任务及超分辨率任务上达到新的最先进的性能（SOTA），同时展示了其在减少参数量的情况下提升现有SOTA模型性能的能力。

链接: https://arxiv.org/abs/2504.16306
作者: Yanlin Zhou,Mostafa El-Khamy,Kee-Bong Song
机构: Samsung Semiconductor, Inc. (三星半导体股份有限公司)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Differentiable Architecture Search (DARTS) is an efficient Neural Architecture Search (NAS) method but suffers from robustness, generalization, and discrepancy issues. Many efforts have been made towards the performance collapse issue caused by skip dominance with various regularization techniques towards operation weights, path weights, noise injection, and super-network redesign. It had become questionable at a certain point if there could exist a better and more elegant way to retract the search to its intended goal – NAS is a selection problem. In this paper, we undertake a simple but effective approach, named Smooth Activation DARTS (SA-DARTS), to overcome skip dominance and discretization discrepancy challenges. By leveraging a smooth activation function on architecture weights as an auxiliary loss, our SA-DARTS mitigates the unfair advantage of weight-free operations, converging to fanned-out architecture weight values, and can recover the search process from skip-dominance initialization. Through theoretical and empirical analysis, we demonstrate that the SA-DARTS can yield new state-of-the-art (SOTA) results on NAS-Bench-201, classification, and super-resolution. Further, we show that SA-DARTS can help improve the performance of SOTA models with fewer parameters, such as Information Multi-distillation Network on the super-resolution task.
zh

[CV-56] MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts GPT -4-Turbo and Crowdworkers

【速读】：该论文旨在解决短视讯平台（如YouTube、Instagram或TikTok）上在线危害内容的全面理解和量化测量问题。目前，对于这些平台上危害内容的范围和程度缺乏系统性的研究与数据支持。论文的关键解决方案是构建两个大规模多模态、多类别危害内容的数据集：一是包含60,906个经过系统筛选的可能有害的YouTube视频；二是包含19,422个视频的标注数据集，由三类标注者完成，包括训练有素的领域专家、GPT-4-Turbo（基于14帧图像、1个缩略图及文本元数据）以及亚马逊Mechanical Turk的专业工人。该标注数据集不仅提供了二分类（有害 vs. 无害）和六种危害类别（信息、仇恨与骚扰、成瘾性、诱饵点击、性相关及身体危害的多标签分类），还包含了跨所有标注者的一致性标注结果以及多数标注者的标注结果，并进一步提供了单个标注者独立标注的三个子数据集。这些数据集有望推动在线危害内容的研究、辅助多模态分类任务，并促进短视讯平台中危害内容的识别与潜在缓解策略的发展。

链接: https://arxiv.org/abs/2504.16304
作者: Wonjeong Jo,Magdalena Wojcieszak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Short video platforms, such as YouTube, Instagram, or TikTok, are used by billions of users. These platforms expose users to harmful content, ranging from clickbait or physical harms to hate or misinformation. Yet, we lack a comprehensive understanding and measurement of online harm on short video platforms. Toward this end, we present two large-scale datasets of multi-modal and multi-categorical online harm: (1) 60,906 systematically selected potentially harmful YouTube videos and (2) 19,422 videos annotated by three labeling actors: trained domain experts, GPT-4-Turbo (using 14 image frames, 1 thumbnail, and text metadata), and crowdworkers (Amazon Mechanical Turk master workers). The annotated dataset includes both (a) binary classification (harmful vs. harmless) and (b) multi-label categorizations of six harm categories: Information, Hate and harassment, Addictive, Clickbait, Sexual, and Physical harms. Furthermore, the annotated dataset provides (1) ground truth data with videos annotated consistently across (a) all three actors and (b) the majority of the labeling actors, and (2) three data subsets labeled by individual actors. These datasets are expected to facilitate future work on online harm, aid in (multi-modal) classification efforts, and advance the identification and potential mitigation of harmful content on video platforms.
zh

[CV-57] Naturally Computed Scale Invariance in the Residual Stream of ResNet18

【速读】：该论文旨在探索神经网络如何实现对图像变化变量（如尺度变化）的不变性，以保持物体身份的识别不受影响。论文特别关注ResNet18架构中的残差流（residual stream），这是InceptionV1所缺乏的组件。研究发现，中间块中的许多卷积通道通过逐元素残差相加的方式表现出尺度不变性，具体表现为输入小尺度副本与前向求和输出大尺度副本的计算结果。关键在于利用残差流对尺度等变表示进行组合，从而实现尺度不变性，并通过消融实验进一步验证这些神经属性与鲁棒物体识别行为之间的因果关系。

链接: https://arxiv.org/abs/2504.16290
作者: André Longon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:An important capacity in visual object recognition is invariance to image-altering variables which leave the identity of objects unchanged, such as lighting, rotation, and scale. How do neural networks achieve this? Prior mechanistic interpretability research has illuminated some invariance-building circuitry in InceptionV1, but the results are limited and networks with different architectures have remained largely unexplored. This work investigates ResNet18 with a particular focus on its residual stream, an architectural component which InceptionV1 lacks. We observe that many convolutional channels in intermediate blocks exhibit scale invariant properties, computed by the element-wise residual summation of scale equivariant representations: the block input’s smaller-scale copy with the block pre-sum output’s larger-scale copy. Through subsequent ablation experiments, we attempt to causally link these neural properties with scale-robust object recognition behavior. Our tentative findings suggest how the residual stream computes scale invariance and its possible role in behavior. Code is available at: this https URL
zh

[CV-58] An Automated Pipeline for Few-Shot Bird Call Classification: A Case Study with the Tooth-Billed Pigeon

【速读】：该论文旨在解决现有大型鸟类分类器（如BirdNET和Perch）在处理稀有物种时的局限性，这些模型在检测常见鸟类方面表现出色，但对于仅存1-3条已知录音的濒危物种却缺乏有效的分类选项，这对保护学家监测极危鸟类的最后个体构成了重大挑战。为了解决这一问题，论文的关键在于利用大型鸟类分类网络的嵌入空间（embedding space），通过余弦相似度（cosine similarity）构建分类器，并结合过滤和去噪预处理技术，在极少量训练数据下优化稀有物种的检测能力。最终，该方法在模拟场景和真实世界测试中均展现出高召回率（1.0）和高准确性（0.95），证明其在实际野外应用中的可行性。

链接: https://arxiv.org/abs/2504.16276
作者: Abhishek Jana,Moeumu Uili,James Atherton,Mark O’Brien,Joe Wood,Leandra Brickson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 16 pages, 5 figures, 4 tables

点击查看摘要

Abstract:This paper presents an automated one-shot bird call classification pipeline designed for rare species absent from large publicly available classifiers like BirdNET and Perch. While these models excel at detecting common birds with abundant training data, they lack options for species with only 1-3 known recordings-a critical limitation for conservationists monitoring the last remaining individuals of endangered birds. To address this, we leverage the embedding space of large bird classification networks and develop a classifier using cosine similarity, combined with filtering and denoising preprocessing techniques, to optimize detection with minimal training data. We evaluate various embedding spaces using clustering metrics and validate our approach in both a simulated scenario with Xeno-Canto recordings and a real-world test on the critically endangered tooth-billed pigeon (Didunculus strigirostris), which has no existing classifiers and only three confirmed recordings. The final model achieved 1.0 recall and 0.95 accuracy in detecting tooth-billed pigeon calls, making it practical for use in the field. This open-source system provides a practical tool for conservationists seeking to detect and monitor rare species on the brink of extinction.
zh

[CV-59] Quantum Doubly Stochastic Transformers

【速读】：该论文旨在解决Transformer中Softmax归一化导致的注意力矩阵不稳定训练问题，并提出一种基于量子电路的参数化方法以生成双随机矩阵（Doubly Stochastic Matrix, DSM），从而提升模型性能及训练稳定性。论文的关键创新在于设计了一种混合经典-量子的双随机Transformer（QDSFormer），通过变分量子电路替代自注意力层中的Softmax操作，实现对DSM的直接参数化建模。这种方法不仅提供了经典方法无法实现的独特量子归纳偏置，还显著提升了模型的信息保留能力与表达多样性，在多个小规模物体识别任务中表现出超越标准Vision Transformer及其相关改进版本的一致优越性能，同时展现出更稳定的训练过程和更低的性能波动。

链接: https://arxiv.org/abs/2504.16275
作者: Jannis Born,Filip Skogh,Kahn Rhrissorrakrai,Filippo Utro,Nico Wagner,Aleksandros Sobczyk
机构: IBM Research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:At the core of the Transformer, the Softmax normalizes the attention matrix to be right stochastic. Previous research has shown that this often destabilizes training and that enforcing the attention matrix to be doubly stochastic (through Sinkhorn’s algorithm) consistently improves performance across different tasks, domains and Transformer flavors. However, Sinkhorn’s algorithm is iterative, approximative, non-parametric and thus inflexible w.r.t. the obtained doubly stochastic matrix (DSM). Recently, it has been proven that DSMs can be obtained with a parametric quantum circuit, yielding a novel quantum inductive bias for DSMs with no known classical analogue. Motivated by this, we demonstrate the feasibility of a hybrid classical-quantum doubly stochastic Transformer (QDSFormer) that replaces the Softmax in the self-attention layer with a variational quantum circuit. We study the expressive power of the circuit and find that it yields more diverse DSMs that better preserve information than classical operators. Across multiple small-scale object recognition tasks, we find that our QDSFormer consistently surpasses both a standard Vision Transformer and other doubly stochastic Transformers. Beyond the established Sinkformer, this comparison includes a novel quantum-inspired doubly stochastic Transformer (based on QR decomposition) that can be of independent interest. The QDSFormer also shows improved training stability and lower performance variation suggesting that it may mitigate the notoriously unstable training of ViTs on small-scale data.
zh

[CV-60] DeepCS-TRD a Deep Learning-based Cross-Section Tree Ring Detector

【速读】：该论文旨在解决树木年轮自动检测的问题，特别是针对不同物种（如火炬松、皂荚和灰杨）和图像采集条件（显微镜图像、扫描仪图像及智能手机拍摄图像）下的年轮检测。论文的关键创新在于将传统方法中的边缘检测步骤替换为基于深度学习的方法（U-Net），从而显著提升了在宏观图像上的检测性能，并首次实现了跨物种和多图像域的通用性。此外，作者发布了两个公开可用的标注图像数据集以推动社区研究。

链接: https://arxiv.org/abs/2504.16242
作者: Henry Marichal,Verónica Casaravilla,Candice Power,Karolain Mello,Joaquín Mazarino,Christine Lucas,Ludmila Profumo,Diego Passarella,Gregory Randall
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures. Accepted in ICIAP 2025

点击查看摘要

Abstract:Here, we propose Deep CS-TRD, a new automatic algorithm for detecting tree rings in whole cross-sections. It substitutes the edge detection step of CS-TRD by a deep-learning-based approach (U-Net), which allows the application of the method to different image domains: microscopy, scanner or smartphone acquired, and species (Pinus taeda, Gleditsia triachantos and Salix glauca). Additionally, we introduce two publicly available datasets of annotated images to the community. The proposed method outperforms state-of-the-art approaches in macro images (Pinus taeda and Gleditsia triacanthos) while showing slightly lower performance in microscopy images of Salix glauca. To our knowledge, this is the first paper that studies automatic tree ring detection for such different species and acquisition conditions. The dataset and source code are available in this https URL
zh

[CV-61] CLIP-IT: CLIP-based Pairing for Histology Images Classification

【速读】：该论文试图解决在训练基于组织学图像和文本报告的视觉-语言模型（Vision-Language Models, VLMs）时，因需要大规模配对数据集而导致的隐私、数据收集、标注和维护成本等问题。论文的关键解决方案在于提出了一种名为CLIP-IT的方法：首先利用基于CLIP的模态配对模型，将组织学图像与外部来源的特权文本信息进行语义匹配，从而构建增强型多模态数据集，无需人工配对样本；其次设计了一种多模态训练流程，通过知识蒸馏从配对文本模态向单模态图像分类器传递知识，以提升性能，并在推理阶段无需使用文本数据；同时采用参数高效的微调方法缓解主模态（图像）与配对模态（文本）之间的错位问题。实验结果表明，CLIP-IT能够在保持较低计算复杂度的同时，有效利用特权文本信息，显著优于单一模态分类器。

链接: https://arxiv.org/abs/2504.16181
作者: Banafsheh Karimian(1),Giulia Avanzato(2),Soufian Belharbi(1),Luke McCaffrey(3),Mohammadhadi Shateri(1),Eric Granger(1) ((1) LIVIA ILLS Dept. of Systems Engineering ETS Montreal Canada, (2) Dept. of Computer Engineering University of Cagliari Italy, (3) Goodman Cancer Research Centre Dept. of Oncology McGill University Canada)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal learning has shown significant promise for improving medical image analysis by integrating information from complementary data sources. This is widely employed for training vision-language models (VLMs) for cancer detection based on histology images and text reports. However, one of the main limitations in training these VLMs is the requirement for large paired datasets, raising concerns over privacy, and data collection, annotation, and maintenance costs. To address this challenge, we introduce CLIP-IT method to train a vision backbone model to classify histology images by pairing them with privileged textual information from an external source. At first, the modality pairing step relies on a CLIP-based model to match histology images with semantically relevant textual report data from external sources, creating an augmented multimodal dataset without the need for manually paired samples. Then, we propose a multimodal training procedure that distills the knowledge from the paired text modality to the unimodal image classifier for enhanced performance without the need for the textual data during inference. A parameter-efficient fine-tuning method is used to efficiently address the misalignment between the main (image) and paired (text) modalities. During inference, the improved unimodal histology classifier is used, with only minimal additional computational complexity. Our experiments on challenging PCAM, CRC, and BACH histology image datasets show that CLIP-IT can provide a cost-effective approach to leverage privileged textual information and outperform unimodal classifiers for histology.
zh

[CV-62] A detection-task-specific deep-learning method to improve the quality of sparse-view myocardial perfusion SPECT images

【速读】：该论文试图解决单光子发射断层成像（Single-Photon Emission Computed Tomography, SPECT）心肌灌注显像（Myocardial Perfusion Imaging, MPI）中因扫描时间过长导致的患者不适、运动伪影以及由于SPECT扫描与用于衰减补偿的CT扫描配准不良可能引起的诊断不准确的问题。同时，减少投影角度虽可缩短扫描时间，但会显著影响重建图像的质量。为此，论文提出了一种针对检测任务的深度学习方法，用于稀疏视图MPI SPECT图像。该方法的关键在于引入观察者损失项（observer loss term），通过惩罚人形通道特征的丢失，以提高灌注缺损检测任务的性能，并有效恢复左心室壁结构，从而克服稀疏采样带来的伪影问题。

链接: https://arxiv.org/abs/2504.16171
作者: Zezhang Yang,Zitong Yu,Nuri Choi,Abhinav K. Jha
机构: Department of Electrical and Systems Engineering, Washington University (华盛顿大学), St. Louis, MO, USA; Department of Biomedical Engineering, Washington University (华盛顿大学), St. Louis, MO, USA; Mallinckrodt Institute of Radiology, Washington University (华盛顿大学), St. Louis, MO, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Myocardial perfusion imaging (MPI) with single-photon emission computed tomography (SPECT) is a widely used and cost-effective diagnostic tool for coronary artery disease. However, the lengthy scanning time in this imaging procedure can cause patient discomfort, motion artifacts, and potentially inaccurate diagnoses due to misalignment between the SPECT scans and the CT-scans which are acquired for attenuation compensation. Reducing projection angles is a potential way to shorten scanning time, but this can adversely impact the quality of the reconstructed images. To address this issue, we propose a detection-task-specific deep-learning method for sparse-view MPI SPECT images. This method integrates an observer loss term that penalizes the loss of anthropomorphic channel features with the goal of improving performance in perfusion defect-detection task. We observed that, on the task of detecting myocardial perfusion defects, the proposed method yielded an area under the receiver operating characteristic (ROC) curve (AUC) significantly larger than the sparse-view protocol. Further, the proposed method was observed to be able to restore the structure of the left ventricle wall, demonstrating ability to overcome sparse-sampling artifacts. Our preliminary results motivate further evaluations of the method.
zh

[CV-63] Classification of Firn Data via Topological Features

【速读】：该论文旨在评估拓扑特征在通用且鲁棒的雪firn图像分类中的性能，以更广泛地理解拓扑特征化的优势、陷阱及权衡。论文的关键在于利用子水平集特征（sublevel set features）和距离变换特征（distance transform features），结合持久性曲线（persistence curves），从微计算机断层扫描（microCT）图像预测样本深度。研究通过多种具有挑战性的训练-测试场景发现，没有任何单一方法在所有类别中始终占据主导地位，并揭示了准确性、可解释性和泛化能力之间复杂的权衡关系。

链接: https://arxiv.org/abs/2504.16150
作者: Sarah Day,Jesse Dimino,Matt Jester,Kaitlin Keegan,Thomas Weighill
机构: Department of Mathematics, William & Mary (威廉与玛丽学院); Department of Mathematics, CUNY College of Staten Island (纽约城市大学斯塔滕岛学院); Department of Mathematics and Statistics, University of North Carolina at Greensboro (北卡罗来纳大学格林斯伯勒分校); Nevada Geosciences, Department of Geological Sciences and Engineering, University of Nevada, Reno (内华达大学里诺分校地质科学与工程系); Department of Mathematics and Statistics, University of North Carolina at Greensboro (北卡罗来纳大学格林斯伯勒分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT)
备注:

点击查看摘要

Abstract:In this paper we evaluate the performance of topological features for generalizable and robust classification of firn image data, with the broader goal of understanding the advantages, pitfalls, and trade-offs in topological featurization. Firn refers to layers of granular snow within glaciers that haven’t been compressed into ice. This compactification process imposes distinct topological and geometric structure on firn that varies with depth within the firn column, making topological data analysis (TDA) a natural choice for understanding the connection between depth and structure. We use two classes of topological features, sublevel set features and distance transform features, together with persistence curves, to predict sample depth from microCT images. A range of challenging training-test scenarios reveals that no one choice of method dominates in all categories, and uncoveres a web of trade-offs between accuracy, interpretability, and generalizability.
zh

[CV-64] Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

【速读】：该论文旨在解决多任务视觉定位（Multi-task Visual Grounding, MTVG）中的两个主要问题：一是现有方法未能充分将语言信息注入整个视觉主干以增强视觉特征提取，且需要额外的跨模态交互模块；二是未有效利用参考表达理解（REC）与参考表达分割（RES）任务之间的关系来实现更精确的协作预测。为了解决这些问题，论文提出了一种渐进式语言引导视觉学习框架（Progressive Language-guided Visual Learning, PLVL）。其关键是不仅深入挖掘视觉模态本身的内在特征表达，还逐步注入语言信息以帮助学习与语言相关的视觉特征，从而无需额外的跨模态融合模块即可全面引入语言指导。此外，通过分析REC任务的定位中心可以部分辅助RES任务的对象区域分割，论文设计了一个多任务头以实现这两个子任务的协作预测。实验结果表明，PLVL在REC和RES任务上均显著优于代表性方法。

链接: https://arxiv.org/abs/2504.16145
作者: Jingchao Wang,Hong Wang,Wenlong Zhang,Kunhua Ji,Dingjiang Huang,Yefeng Zheng
机构: East China Normal University (华东师范大学); Xi’an Jiaotong University (西安交通大学); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. this https URL
zh

[CV-65] Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

【速读】：该论文旨在解决在农业物联网（IoT）系统中将深度学习应用集成到资源受限的边缘设备时面临的挑战，即如何在保证Vision Transformers (ViTs)高精度的同时满足效率需求。论文的关键创新在于提出了一种混合知识蒸馏框架，通过从Swin Transformer教师模型协同转移logit和注意力知识至MobileNetV3学生模型来弥合轻量级模型与复杂Transformer模型之间的性能差距。解决方案的关键包括引入自适应注意力对齐以解决跨架构不匹配问题，并采用双损失函数优化类别概率与空间关注，同时设计了针对IoT设备的轻量化验证指标（如13 MB内存占用、0.22 GFLOPs计算复杂度）及动态分辨率适配的注意力映射。实验结果表明，该方法显著提升了推理效率，在保持较高精度的同时大幅降低了计算成本和延迟。

链接: https://arxiv.org/abs/2504.16128
作者: Stanley Mugisha,Rashid Kisitu,Florence Tushabe
机构: Soroti University (索罗蒂大学), Uganda
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages and 4 figures

点击查看摘要

Abstract:Integrating deep learning applications into agricultural IoT systems faces a serious challenge of balancing the high accuracy of Vision Transformers (ViTs) with the efficiency demands of resource-constrained edge devices. Large transformer models like the Swin Transformers excel in plant disease classification by capturing global-local dependencies. However, their computational complexity (34.1 GFLOPs) limits applications and renders them impractical for real-time on-device inference. Lightweight models such as MobileNetV3 and TinyML would be suitable for on-device inference but lack the required spatial reasoning for fine-grained disease detection. To bridge this gap, we propose a hybrid knowledge distillation framework that synergistically transfers logit and attention knowledge from a Swin Transformer teacher to a MobileNetV3 student model. Our method includes the introduction of adaptive attention alignment to resolve cross-architecture mismatch (resolution, channels) and a dual-loss function optimizing both class probabilities and spatial focus. On the lantVillage-Tomato dataset (18,160 images), the distilled MobileNetV3 attains 92.4% accuracy relative to 95.9% for Swin-L but at an 95% reduction on PC and 82% in inference latency on IoT devices. (23ms on PC CPU and 86ms/image on smartphone CPUs). Key innovations include IoT-centric validation metrics (13 MB memory, 0.22 GFLOPs) and dynamic resolution-matching attention maps. Comparative experiments show significant improvements over standalone CNNs and prior distillation methods, with a 3.5% accuracy gain over MobileNetV3 baselines. Significantly, this work advances real-time, energy-efficient crop monitoring in precision agriculture and demonstrates how we can attain ViT-level diagnostic precision on edge devices. Code and models will be made available for replication after acceptance.
zh

[CV-66] MonoTher-Depth: Enhancing Thermal Depth Estimation via Confidence-Aware Distillation

【速读】：该论文旨在解决热成像单目深度估计（MDE）中因标注数据有限导致的泛化能力不足问题，与基于大量多样化RGB图像数据集的传统RGB-MDE模型相比，热成像MDE模型受限于标注数据的稀缺性。为应对这一挑战，论文提出了一种通过知识蒸馏从通用RGB-MDE模型迁移知识来增强热成像MDE的新方法。其关键在于一种置信度感知的知识蒸馏方法，利用RGB-MDE模型预测的置信度有选择地强化热成像MDE模型，充分发挥RGB模型的优势并弥补其不足，从而显著提升热成像MDE的精度，并且不依赖于深度监督标签，极大地扩展了其在新场景中的适用性。实验表明，在无标注深度的新场景中，所提出的置信度感知蒸馏方法相较于未使用蒸馏的基线方法，绝对相对误差降低了22.88%。

链接: https://arxiv.org/abs/2504.16127
作者: Xingxing Zuo,Nikhil Ranganathan,Connor Lee,Georgia Gkioxari,Soon-Jo Chung
机构: California Institute of Technology (加州理工学院), Pasadena, California, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 Pages; The code will be available at this https URL

点击查看摘要

Abstract:Monocular depth estimation (MDE) from thermal images is a crucial technology for robotic systems operating in challenging conditions such as fog, smoke, and low light. The limited availability of labeled thermal data constrains the generalization capabilities of thermal MDE models compared to foundational RGB MDE models, which benefit from datasets of millions of images across diverse scenarios. To address this challenge, we introduce a novel pipeline that enhances thermal MDE through knowledge distillation from a versatile RGB MDE model. Our approach features a confidence-aware distillation method that utilizes the predicted confidence of the RGB MDE to selectively strengthen the thermal MDE model, capitalizing on the strengths of the RGB model while mitigating its weaknesses. Our method significantly improves the accuracy of the thermal MDE, independent of the availability of labeled depth supervision, and greatly expands its applicability to new scenarios. In our experiments on new scenarios without labeled depth, the proposed confidence-aware distillation method reduces the absolute relative error of thermal MDE by 22.88% compared to the baseline without distillation.
zh

[CV-67] Context-Awareness and Interpretability of Rare Occurrences for Discovery and Formalization of Critical Failure Modes

【速读】：本文旨在解决视觉系统在关键领域（如监控、执法和交通）中因罕见或不可预见场景导致的潜在安全风险问题。论文提出了一种基于本体论的人机辅助发现框架——CAIRO（Context-Awareness and Interpretability of Rare Occurrences），用于检测和形式化失败案例（CP - Critical Phenomena）。其关键是通过设计激励人机交互流程，评估并测试AI黑盒模型在误检、对抗攻击及幻觉等情形下的关键性问题，并通过分析自动化驾驶系统中目标检测模型的失效情况，以可扩展且可解释的方式填补相机感知与真实世界上下文之间的差距。最终，将这些测试案例存储为显式的知识图谱（OWL/XML格式），以便共享、下游分析、逻辑推理以及责任追溯。

链接: https://arxiv.org/abs/2504.16117
作者: Sridevi Polavaram,Xin Zhou,Meenu Ravi,Mohammad Zarei,Anmol Srivastava
机构: MITRE CORP. (MITRE公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to IEEE Conference for Artificial Intelligence, 2025

点击查看摘要

Abstract:Vision systems are increasingly deployed in critical domains such as surveillance, law enforcement, and transportation. However, their vulnerabilities to rare or unforeseen scenarios pose significant safety risks. To address these challenges, we introduce Context-Awareness and Interpretability of Rare Occurrences (CAIRO), an ontology-based human-assistive discovery framework for failure cases (or CP - Critical Phenomena) detection and formalization. CAIRO by design incentivizes human-in-the-loop for testing and evaluation of criticality that arises from misdetections, adversarial attacks, and hallucinations in AI black-box models. Our robust analysis of object detection model(s) failures in automated driving systems (ADS) showcases scalable and interpretable ways of formalizing the observed gaps between camera perception and real-world contexts, resulting in test cases stored as explicit knowledge graphs (in OWL/XML format) amenable for sharing, downstream analysis, logical reasoning, and accountability.
zh

[CV-68] Persistence-based Hough Transform for Line Detection

【速读】：该论文旨在解决传统霍夫变换（Hough Transform）在检测线条或其他物体时因简单阈值化投票过程易受噪声和其他伪影影响的问题。论文的关键创新在于提出了一种基于持久同调（Persistent Homology）的替代投票技术，用于在霍夫空间中检测峰值。这种方法能够自然克服简单阈值化方法的局限性，显著提高检测性能，并增强算法的鲁棒性。实验结果表明，所提出的方法在合成数据上的表现优于原始方法。

链接: https://arxiv.org/abs/2504.16114
作者: Johannes Ferner,Stefan Huber,Saverio Messineo,Angel Pop,Martin Uray
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at iDSC’25, Salzburg, Austria

点击查看摘要

Abstract:The Hough transform is a popular and classical technique in computer vision for the detection of lines (or more general objects). It maps a pixel into a dual space – the Hough space: each pixel is mapped to the set of lines through this pixel, which forms a curve in Hough space. The detection of lines then becomes a voting process to find those lines that received many votes by pixels. However, this voting is done by thresholding, which is susceptible to noise and other artifacts. In this work, we present an alternative voting technique to detect peaks in the Hough space based on persistent homology, which very naturally addresses limitations of simple thresholding. Experiments on synthetic data show that our method significantly outperforms the original method, while also demonstrating enhanced robustness. This work seeks to inspire future research in two key directions. First, we highlight the untapped potential of Topological Data Analysis techniques and advocate for their broader integration into existing methods, including well-established ones. Secondly, we initiate a discussion on the mathematical stability of the Hough transform, encouraging exploration of mathematically grounded improvements to enhance its robustness. Comments: Accepted at iDSC’25, Salzburg, Austria Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2504.16114 [cs.CV] (or arXiv:2504.16114v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.16114 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-69] Shape Your Ground: Refining Road Surfaces Beyond Planar Representations

【速读】：本文旨在解决从航空影像中重建路面时存在的表面不平滑、紧凑性不足及准确性受限等问题。现有方法常产生伪影与不一致性，而下游任务倾向于将道路简化为平面以方便处理，但牺牲了精度。为应对这些挑战，论文提出了FlexRoad框架，其核心创新在于利用Elevation-Constrained Spatial Road Clustering (ECSRC)算法实现鲁棒的异常值修正，显著降低表面粗糙度和拟合误差，并通过拟合非均匀有理B样条（Non-Uniform Rational B-Splines, NURBS）曲面来直接优化路面平滑度。此外，为了量化评估不同方法的效果，论文构建了GeoRoad Dataset (GeRoD)，包含多种来源的路面与地形剖面数据。实验结果表明，FlexRoad在多个指标上优于常用方法，且对输入源、地形类型及噪声具有良好的适应性。通过消融研究，进一步验证了各组成部分对高质量重建性能的关键作用，使FlexRoad成为一种通用的真实路面建模方法。

链接: https://arxiv.org/abs/2504.16103
作者: Oussema Dhaouadi,Johannes Meier,Jacques Kaiser,Daniel Cremers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Road surface reconstruction from aerial images is fundamental for autonomous driving, urban planning, and virtual simulation, where smoothness, compactness, and accuracy are critical quality factors. Existing reconstruction methods often produce artifacts and inconsistencies that limit usability, while downstream tasks have a tendency to represent roads as planes for simplicity but at the cost of accuracy. We introduce FlexRoad, the first framework to directly address road surface smoothing by fitting Non-Uniform Rational B-Splines (NURBS) surfaces to 3D road points obtained from photogrammetric reconstructions or geodata providers. Our method at its core utilizes the Elevation-Constrained Spatial Road Clustering (ECSRC) algorithm for robust anomaly correction, significantly reducing surface roughness and fitting errors. To facilitate quantitative comparison between road surface reconstruction methods, we present GeoRoad Dataset (GeRoD), a diverse collection of road surface and terrain profiles derived from openly accessible geodata. Experiments on GeRoD and the photogrammetry-based DeepScenario Open 3D Dataset (DSC3D) demonstrate that FlexRoad considerably surpasses commonly used road surface representations across various metrics while being insensitive to various input sources, terrains, and noise types. By performing ablation studies, we identify the key role of each component towards high-quality reconstruction performance, making FlexRoad a generic method for realistic road surface modeling.
zh

[CV-70] Audio and Multiscale Visual Cues Driven Cross-modal Transformer for Idling Vehicle Detection

【速读】：该论文旨在解决车辆怠速检测（Idle Vehicle Detection, IVD）任务中跨模态信息对齐的挑战，特别是视频与音频模态之间的关联建模问题。传统方法因采用基本注意力机制难以有效融合视觉和听觉线索，导致漏检现象严重。为了解决这一问题，论文提出了一种基于Transformer的端到端检测网络AVIVDNetv2，其关键在于引入跨模态Transformer以全局patch级别学习、多尺度视觉特征融合模块以及解耦检测头，从而显著提升了跨模态对齐能力和检测性能。实验结果表明，AVIVDNetv2在mAP指标上分别比基准模型提升了7.66和9.42个百分点，并在所有车辆类别上保持一致的平均精度增益（AP），同时刷新了 sounding object localization 领域的性能记录。

链接: https://arxiv.org/abs/2504.16102
作者: Xiwen Li,Ross Whitaker,Tolga Tasdizen
机构: Scientific Computing and Imaging (SCI) Institute at the University of Utah (犹他大学科学计算与成像研究所), Salt Lake City, UT, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Idling vehicle detection (IVD) supports real-time systems that reduce pollution and emissions by dynamically messaging drivers to curb excess idling behavior. In computer vision, IVD has become an emerging task that leverages video from surveillance cameras and audio from remote microphones to localize and classify vehicles in each frame as moving, idling, or engine-off. As with other cross-modal tasks, the key challenge lies in modeling the correspondence between audio and visual modalities, which differ in representation but provide complementary cues – video offers spatial and motion context, while audio conveys engine activity beyond the visual field. The previous end-to-end model, which uses a basic attention mechanism, struggles to align these modalities effectively, often missing vehicle detections. To address this issue, we propose AVIVDNetv2, a transformer-based end-to-end detection network. It incorporates a cross-modal transformer with global patch-level learning, a multiscale visual feature fusion module, and decoupled detection heads. Extensive experiments show that AVIVDNetv2 improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline, with consistent AP gains across all vehicle categories. Furthermore, AVIVDNetv2 outperforms the state-of-the-art method for sounding object localization, establishing a new performance benchmark on the AVIVD dataset.
zh

[CV-71] Advanced Chest X-Ray Analysis via Transformer-Based Image Descriptors and Cross-Model Attention Mechanism

【速读】：该论文旨在解决通过胸片图像生成高质量描述的问题，以辅助检测多种胸部疾病。解决方案的关键在于提出了一种结合Vision Transformer (ViT) 编码器、跨模态注意力机制以及基于GPT-4的解码器的新模型。ViT提取高保真的视觉特征，并通过跨模态注意力与文本数据融合，提升图像描述的准确性、上下文连贯性和丰富性；GPT-4解码器进一步将这些融合特征转化为精确且相关的图像标题。这种端到端的方法在Indiana University (IU) 和 National Institutes of Health (NIH) 胸部X光数据集上的出色表现验证了其有效性。

链接: https://arxiv.org/abs/2504.16774
作者: Lakshita Agarwal,Bindu Verma
机构: Department of Information Technology, Delhi Technological University (德里技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The examination of chest X-ray images is a crucial component in detecting various thoracic illnesses. This study introduces a new image description generation model that integrates a Vision Transformer (ViT) encoder with cross-modal attention and a GPT-4-based transformer decoder. The ViT captures high-quality visual features from chest X-rays, which are fused with text data through cross-modal attention to improve the accuracy, context, and richness of image descriptions. The GPT-4 decoder transforms these fused features into accurate and relevant captions. The model was tested on the National Institutes of Health (NIH) and Indiana University (IU) Chest X-ray datasets. On the IU dataset, it achieved scores of 0.854 (B-1), 0.883 (CIDEr), 0.759 (METEOR), and 0.712 (ROUGE-L). On the NIH dataset, it achieved the best performance on all metrics: BLEU 1–4 (0.825, 0.788, 0.765, 0.752), CIDEr (0.857), METEOR (0.726), and ROUGE-L (0.705). This framework has the potential to enhance chest X-ray evaluation, assisting radiologists in more precise and efficient diagnosis.
zh

[CV-72] Frequency-Compensated Network for Daily Arctic Sea Ice Concentration Prediction

【速读】：该论文旨在解决北极海冰浓度（Sea Ice Concentration, SIC）预测中的两个关键挑战：1) 当前方法难以探索频域中的长期特征依赖关系；2) 难以保留高频细节，导致边缘区域的变化无法被精确捕捉。为了解决这些问题，论文提出了一种名为Frequency-Compensated Network (FCNet) 的方法，用于每日尺度上的北极SIC预测。其关键是设计了一个双分支网络，分别用于频域特征提取和卷积特征提取。在频域特征提取方面，引入了自适应频率滤波块，结合可训练层与基于傅里叶的滤波器；在卷积特征提取方面，提出了高频频增强块，并通过通道注意力机制增强高频特征，同时利用时间注意力单元捕获低频特征以揭示长时间范围内的海冰变化。这些创新共同提升了边缘和细节预测的精度。

链接: https://arxiv.org/abs/2504.16745
作者: Jialiang Zhang,Feng Gao,Yanhai Gan,Junyu Dong,Qian Du
机构: School of Computer Science and Technology, Ocean University of China (海洋大学计算机科学与技术学院); Department of Electrical and Computer Engineering, Mississippi State University (密西西比州立大学电气与计算机工程系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TGRS 2025

点击查看摘要

Abstract:Accurately forecasting sea ice concentration (SIC) in the Arctic is critical to global ecosystem health and navigation safety. However, current methods still is confronted with two challenges: 1) these methods rarely explore the long-term feature dependencies in the frequency domain. 2) they can hardly preserve the high-frequency details, and the changes in the marginal area of the sea ice cannot be accurately captured. To this end, we present a Frequency-Compensated Network (FCNet) for Arctic SIC prediction on a daily basis. In particular, we design a dual-branch network, including branches for frequency feature extraction and convolutional feature extraction. For frequency feature extraction, we design an adaptive frequency filter block, which integrates trainable layers with Fourier-based filters. By adding frequency features, the FCNet can achieve refined prediction of edges and details. For convolutional feature extraction, we propose a high-frequency enhancement block to separate high and low-frequency information. Moreover, high-frequency features are enhanced via channel-wise attention, and temporal attention unit is employed for low-frequency feature extraction to capture long-range sea ice changes. Extensive experiments are conducted on a satellite-derived daily SIC dataset, and the results verify the effectiveness of the proposed FCNet. Our codes and data will be made public available at: this https URL .
zh

[CV-73] Comprehensive Evaluation of Quantitative Measurements from Automated Deep Segmentations of PSMA PET/CT Images

【速读】：该论文旨在解决基于自动化深度学习分割方法提取的定量测量值在前列腺癌生物复发患者[18F]DCFPyL PSMA靶向PET/CT图像分析中的准确性与一致性问题，超越传统Dice相似性系数评估，聚焦于SUVmax、SUVmean、总病灶活性(TLA)、肿瘤体积(TMTV)、病灶计数及病灶扩散等六个定量指标。论文的关键解决方案在于提出了一种新的损失函数——L1加权Dice焦点损失(L1DFL)，并与多种深度神经网络模型（U-Net、Attention U-Net和SegResNet）结合进行训练。实验结果表明，Attention U-Net与L1DFL的组合在SUVmax和TLA等指标上实现了与地面实况最强的相关性(一致性相关系数=0.90-0.99)，并通过等效性检验确认了其在SUV指标、病灶计数和TLA方面的高性能表现，同时通过Bland-Altman、覆盖概率和总偏差指数分析证明L1DFL能够有效减少量化地面实况临床测量的变异性。

链接: https://arxiv.org/abs/2504.16237
作者: Obed Korshie Dzikunu,Amirhossein Toosi,Shadab Ahamed,Sara Harsini,Francois Benard,Xiaoxiao Li,Arman Rahmim
机构: University of British Columbia (英属哥伦比亚大学), Vancouver, Canada; BC Cancer Research Institute (BC 癌症研究中心), Vancouver, Canada; Vector Institute (向量研究所), Toronto, Canada
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:This study performs a comprehensive evaluation of quantitative measurements as extracted from automated deep-learning-based segmentation methods, beyond traditional Dice Similarity Coefficient assessments, focusing on six quantitative metrics, namely SUVmax, SUVmean, total lesion activity (TLA), tumor volume (TMTV), lesion count, and lesion spread. We analyzed 380 prostate-specific membrane antigen (PSMA) targeted [18F]DCFPyL PET/CT scans of patients with biochemical recurrence of prostate cancer, training deep neural networks, U-Net, Attention U-Net and SegResNet with four loss functions: Dice Loss, Dice Cross Entropy, Dice Focal Loss, and our proposed L1 weighted Dice Focal Loss (L1DFL). Evaluations indicated that Attention U-Net paired with L1DFL achieved the strongest correlation with the ground truth (concordance correlation = 0.90-0.99 for SUVmax and TLA), whereas models employing the Dice Loss and the other two compound losses, particularly with SegResNet, underperformed. Equivalence testing (TOST, alpha = 0.05, Delta = 20%) confirmed high performance for SUV metrics, lesion count and TLA, with L1DFL yielding the best performance. By contrast, tumor volume and lesion spread exhibited greater variability. Bland-Altman, Coverage Probability, and Total Deviation Index analyses further highlighted that our proposed L1DFL minimizes variability in quantification of the ground truth clinical measures. The code is publicly available at: this https URL\this http URL.
zh

[CV-74] BrainPrompt: Multi-Level Brain Prompt Enhancement for Neurological Condition Identification

【速读】：本文旨在解决神经系统疾病（如阿尔茨海默病）早期诊断困难的问题，特别是在症状与健康对照组相似的早期阶段。现有基于脑网络分析的方法主要依赖图模型处理成像数据，可能忽略重要的非成像因素，从而限制了模型的预测能力和可解释性。为了解决这一问题，论文提出了BrainPrompt框架，其关键是通过将大型语言模型（Large Language Models, LLMs）与知识驱动提示相结合来增强图神经网络（Graph Neural Networks, GNNs）。BrainPrompt设计了三种知识驱动提示：(1) 区域提示（ROI-level prompts）用于编码每个脑区的身份和功能；(2) 主体提示（subject-level prompts）整合人口统计信息；(3) 疾病提示（disease-level prompts）捕捉疾病的时序进展。通过多层级提示，BrainPrompt有效利用了来自LLMs的知识增强多模态信息，提升了模型对神经系统疾病分期的预测能力，并提供了更易解释的结果。实验表明，BrainPrompt在两个神经障碍的静息态功能磁共振成像（resting-state fMRI）数据集上的表现优于当前最先进的方法，并且通过生物标志物研究验证了其提取有价值且可解释信息的能力。

链接: https://arxiv.org/abs/2504.16096
作者: Jiaxing Xu,Kai He,Yue Tang,Wei Li,Mengcheng Lan,Xia Dong,Yiping Ke,Mengling Feng
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neurological conditions, such as Alzheimer’s Disease, are challenging to diagnose, particularly in the early stages where symptoms closely resemble healthy controls. Existing brain network analysis methods primarily focus on graph-based models that rely solely on imaging data, which may overlook important non-imaging factors and limit the model’s predictive power and interpretability. In this paper, we present BrainPrompt, an innovative framework that enhances Graph Neural Networks (GNNs) by integrating Large Language Models (LLMs) with knowledge-driven prompts, enabling more effective capture of complex, non-imaging information and external knowledge for neurological disease identification. BrainPrompt integrates three types of knowledge-driven prompts: (1) ROI-level prompts to encode the identity and function of each brain region, (2) subject-level prompts that incorporate demographic information, and (3) disease-level prompts to capture the temporal progression of disease. By leveraging these multi-level prompts, BrainPrompt effectively harnesses knowledge-enhanced multi-modal information from LLMs, enhancing the model’s capability to predict neurological disease stages and meanwhile offers more interpretable results. We evaluate BrainPrompt on two resting-state functional Magnetic Resonance Imaging (fMRI) datasets from neurological disorders, showing its superiority over state-of-the-art methods. Additionally, a biomarker study demonstrates the framework’s ability to extract valuable and interpretable information aligned with domain knowledge in neuroscience.
zh

人工智能

[AI-0] Latent Diffusion Planning for Imitation Learning

【速读】：该论文试图解决现有模仿学习方法依赖大量专家演示数据的问题。为应对这一挑战，论文提出了一种名为潜在扩散规划（Latent Diffusion Planning, LDP）的模块化方法。LDP 的关键在于结合了一个能够利用无动作演示的规划器和一个能够利用次优数据的逆动力学模型，二者均在学习到的潜在空间中运作。通过引入变分自编码器学习紧凑的潜在表示，实现基于图像域的未来状态有效预测，并借助扩散目标函数训练规划器和逆动力学模型。这种分离规划与动作预测的设计使 LDP 能够充分利用次优及无动作数据提供的更密集监督信号，在模拟视觉机器人操作任务中超越了当前最先进的模仿学习方法。

链接: https://arxiv.org/abs/2504.16925
作者: Amber Xie,Oleh Rybkin,Dorsa Sadigh,Chelsea Finn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent progress in imitation learning has been enabled by policy architectures that scale to complex visuomotor tasks, multimodal distributions, and large datasets. However, these methods often rely on learning from large amount of expert demonstrations. To address these shortcomings, we propose Latent Diffusion Planning (LDP), a modular approach consisting of a planner which can leverage action-free demonstrations, and an inverse dynamics model which can leverage suboptimal data, that both operate over a learned latent space. First, we learn a compact latent space through a variational autoencoder, enabling effective forecasting of future states in image-based domains. Then, we train a planner and an inverse dynamics model with diffusion objectives. By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data. On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches, as they cannot leverage such additional data.
zh

[AI-1] Building A Secure Agent ic AI Application Leverag ing A2A Protocol

【速读】：该论文旨在解决复杂多智能体协作场景下，Google Agent2Agent (A2A) 协议的安全实施与部署问题，以确保其交互的可靠性和安全性。论文的关键在于通过综合分析 A2A 协议的基本组成、运行机制及其在智能体通信框架中的定位，结合 MAESTRO 风险评估框架，采用主动威胁建模方法，针对代理卡管理、任务执行完整性以及认证方法等核心方面，识别潜在的安全隐患。基于这些分析，论文提出了实用的安全开发方法和架构最佳实践，以构建鲁棒且高效的 A2A 系统。此外，论文还探讨了 A2A 与模型上下文协议（Model Context Protocol, MCP）协同作用如何进一步提升安全互操作性。关键解决方案在于将理论分析与实践指导相结合，为开发者和架构师提供必要的知识与工具，以支持下一代智能体应用的构建。

链接: https://arxiv.org/abs/2504.16902
作者: Idan Habler,Ken Huang,Vineeth Sai Narajala,Prashant Kulkarni
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, 1 table, Authors contributed equally to this work

点击查看摘要

Abstract:As Agentic AI systems evolve from basic workflows to complex multi agent collaboration, robust protocols such as Google’s Agent2Agent (A2A) become essential enablers. To foster secure adoption and ensure the reliability of these complex interactions, understanding the secure implementation of A2A is essential. This paper addresses this goal by providing a comprehensive security analysis centered on the A2A protocol. We examine its fundamental elements and operational dynamics, situating it within the framework of agent communication development. Utilizing the MAESTRO framework, specifically designed for AI risks, we apply proactive threat modeling to assess potential security issues in A2A deployments, focusing on aspects such as Agent Card management, task execution integrity, and authentication methodologies. Based on these insights, we recommend practical secure development methodologies and architectural best practices designed to build resilient and effective A2A systems. Our analysis also explores how the synergy between A2A and the Model Context Protocol (MCP) can further enhance secure interoperability. This paper equips developers and architects with the knowledge and practical guidance needed to confidently leverage the A2A protocol for building robust and secure next generation agentic applications. Comments: 13 pages, 4 figures, 1 table, Authors contributed equally to this work Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.16902 [cs.CR] (or arXiv:2504.16902v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.16902 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] Approximating Optimal Labelings for Temporal Connectivity

【速读】：该论文研究的是Minimum Aged Labeling (MAL) 问题，即在时间图中调度边的时间标签，使得所有顶点对在不超过给定最大允许时间 (a) 的情况下保持连通，并且所使用的标签总数最小化。这一问题在物流、调度分配以及社交网络中的信息传播等领域具有重要应用价值，合理选择时间标签可以显著降低基础设施成本、燃料消耗或温室气体排放。

解决方案的关键在于理解 (a) 和输入图直径之间的关系，并设计相应的近似算法。论文证明了当 (a \geq 2) 或 (a \geq 3) 时，该问题的近似比下界分别为 (O(\log n)) 和 (2^{\log^{1-\epsilon} n})，除非 (\text{P} = \text{NP}) 或 (\text{NP} \subseteq \text{DTIME}(2^{\text{polylog}(n)}))。此外，论文提出了满足特定条件下的近似算法，这些算法的性能几乎接近上述下界，并进一步揭示了该问题与静态图上的经典优化问题 Diameter Constrained Spanning Subgraph (DCSS) 之间的联系。

链接: https://arxiv.org/abs/2504.16837
作者: Daniele Carnevale(1),Gianlorenzo D’Angelo(1),Martin Olsen(2) ((1) Gran Sasso Science Institute, (2) Aarhus University)
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In a temporal graph the edge set dynamically changes over time according to a set of time-labels associated with each edge that indicates at which time-steps the edge is available. Two vertices are connected if there is a path connecting them in which the edges are traversed in increasing order of their labels. We study the problem of scheduling the availability time of the edges of a temporal graph in such a way that all pairs of vertices are connected within a given maximum allowed time a and the overall number of labels is minimized. The problem, known as \emphMinimum Aged Labeling (MAL), has several applications in logistics, distribution scheduling, and information spreading in social networks, where carefully choosing the time-labels can significantly reduce infrastructure costs, fuel consumption, or greenhouse gases. The problem MAL has previously been proved to be NP-complete on undirected graphs and \APX-hard on directed graphs. In this paper, we extend our knowledge on the complexity and approximability of MAL in several directions. We first show that the problem cannot be approximated within a factor better than O(\log n) when a\geq 2 , unless \textP = \textNP , and a factor better than 2^\log ^1-\epsilon n when a\geq 3 , unless \textNP\subseteq \textDTIME(2^\textpolylog(n)) , where n is the number of vertices in the graph. Then we give a set of approximation algorithms that, under some conditions, almost match these lower bounds. In particular, we show that the approximation depends on a relation between a and the diameter of the input graph. We further establish a connection with a foundational optimization problem on static graphs called \emphDiameter Constrained Spanning Subgraph (DCSS) and show that our hardness results also apply to DCSS. Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.16837 [cs.DS] (or arXiv:2504.16837v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2504.16837 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daniele Carnevale [view email] [v1] Wed, 23 Apr 2025 16:00:33 UTC (45 KB)
zh

[AI-3] Improving Significant Wave Height Prediction Using Chronos Models

【速读】：该论文旨在解决波高预测中传统物理模型和机器学习方法在计算效率及非线性动力学建模方面的挑战。解决方案的关键在于提出了一种由大型语言模型（Large Language Model, LLM）驱动的时间架构（Chronos），该架构针对波浪预报进行了优化。通过在西北太平洋盆地三个战略性选择的海洋区域的历史波浪数据上应用先进的时间模式识别技术，Chronos实现了多方面的改进，包括显著减少训练时间和提高推理速度、在短期和长期预测中的优越性能以及零样本能力，从而为复杂地球物理系统的建模提供了一个高效且可转移的框架。

链接: https://arxiv.org/abs/2504.16834
作者: Yilin Zhai,Hongyuan Shi,Chao Zhan,Qing Wang,Zaijin You,Nan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Accurate wave height prediction is critical for maritime safety and coastal resilience, yet conventional physics-based models and traditional machine learning methods face challenges in computational efficiency and nonlinear dynamics modeling. This study introduces Chronos, the first implementation of a large language model (LLM)-powered temporal architecture (Chronos) optimized for wave forecasting. Through advanced temporal pattern recognition applied to historical wave data from three strategically chosen marine zones in the Northwest Pacific basin, our framework achieves multimodal improvements: (1) 14.3% reduction in training time with 2.5x faster inference speed compared to PatchTST baselines, achieving 0.575 mean absolute scaled error (MASE) units; (2) superior short-term forecasting (1-24h) across comprehensive metrics; (3) sustained predictive leadership in extended-range forecasts (1-120h); and (4) demonstrated zero-shot capability maintaining median performance (rank 4/12) against specialized operational models. This LLM-enhanced temporal modeling paradigm establishes a new standard in wave prediction, offering both computationally efficient solutions and a transferable framework for complex geophysical systems modeling.
zh

[AI-4] Lightweight Latent Verifiers for Efficient Meta-Generation Strategies

【速读】：该论文旨在解决传统大语言模型（Large Language Models, LLMs）验证器因规模较大导致计算成本高昂的问题。解决方案的关键在于提出了一种新颖的轻量级验证方法 LiLaVe，它能够从基础 LLM 的隐藏状态中可靠地提取正确性信号。与传统的基于 LLM 的验证器相比，LiLaVe 仅需极小的计算资源即可运行，同时通过结合流行的元生成策略（如 best-of-n 或自一致性）以及设计新的基于 LiLaVe 的方法（如条件自校正或条件多数投票），显著提升了小型 LLM 在生成任务中的准确性和效率。这一工作展示了从 LLM 隐藏状态中提取潜在信息的可行性，并为需要推理密集型应用的可扩展且资源高效解决方案铺平了道路。

链接: https://arxiv.org/abs/2504.16760
作者: Bartosz Piotrowski,Witold Drzewakowski,Konrad Staniszewski,Piotr Miłoś
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Verifiers are auxiliary models that assess the correctness of outputs generated by base large language models (LLMs). They play a crucial role in many strategies for solving reasoning-intensive problems with LLMs. Typically, verifiers are LLMs themselves, often as large (or larger) than the base model they support, making them computationally expensive. In this work, we introduce a novel lightweight verification approach, LiLaVe, which reliably extracts correctness signals from the hidden states of the base LLM. A key advantage of LiLaVe is its ability to operate with only a small fraction of the computational budget required by traditional LLM-based verifiers. To demonstrate its practicality, we couple LiLaVe with popular meta-generation strategies, like best-of-n or self-consistency. Moreover, we design novel LiLaVe-based approaches, like conditional self-correction or conditional majority voting, that significantly improve both accuracy and efficiency in generation tasks with smaller LLMs. Our work demonstrates the fruitfulness of extracting latent information from the hidden states of LLMs, and opens the door to scalable and resource-efficient solutions for reasoning-intensive applications.
zh

[AI-5] MOSAIC: A Skill-Centric Algorithmic Framework for Long-Horizon Manipulation Planning

【速读】：该论文旨在解决机器人与人工智能领域中利用预定义技能规划长时域运动的关键挑战。现有方法在系统性探索技能组合、利用通用易学技能（如推和抓）实现任务泛化以及避免依赖需大量领域特定知识的符号化世界表示方面存在显著不足，这些要素在现有方案中仍相对独立，导致难以形成鲁棒且可扩展的复杂长时域问题解决方案。论文提出MOSAIC框架作为解决方案，其关键是围绕技能本身指导规划过程，通过Generator计算可执行轨迹与世界配置，以及Connector通过求解边界值问题连接独立生成的技能轨迹，从而实现整体任务的逐步完成。MOSAIC突破了从预定义起始或目标状态逐步发现技能的传统范式，将规划焦点集中在技能本就有效的区域，显著提升了探索效率。

链接: https://arxiv.org/abs/2504.16738
作者: Itamar Mishani,Yorai Shaoul,Maxim Likhachev
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Under review. Project page: this https URL

点击查看摘要

Abstract:Planning long-horizon motions using a set of predefined skills is a key challenge in robotics and AI. Addressing this challenge requires methods that systematically explore skill combinations to uncover task-solving sequences, harness generic, easy-to-learn skills (e.g., pushing, grasping) to generalize across unseen tasks, and bypass reliance on symbolic world representations that demand extensive domain and task-specific knowledge. Despite significant progress, these elements remain largely disjoint in existing approaches, leaving a critical gap in achieving robust, scalable solutions for complex, long-horizon problems. In this work, we present MOSAIC, a skill-centric framework that unifies these elements by using the skills themselves to guide the planning process. MOSAIC uses two families of skills: Generators compute executable trajectories and world configurations, and Connectors link these independently generated skill trajectories by solving boundary value problems, enabling progress toward completing the overall task. By breaking away from the conventional paradigm of incrementally discovering skills from predefined start or goal states–a limitation that significantly restricts exploration–MOSAIC focuses planning efforts on regions where skills are inherently effective. We demonstrate the efficacy of MOSAIC in both simulated and real-world robotic manipulation tasks, showcasing its ability to solve complex long-horizon planning problems using a diverse set of skills incorporating generative diffusion models, motion planning algorithms, and manipulation-specific models. Visit this https URL for demonstrations and examples.
zh

[AI-6] A Survey of AI Agent Protocols

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）代理在实际应用中缺乏标准化通信协议的问题。随着LLM代理在各行业的广泛应用，其与外部工具或数据源交互的方式缺乏统一标准，导致协作困难、扩展性受限，并限制了解决复杂现实任务的能力。论文的关键解决方案在于提出一个统一的通信协议，以实现LLM代理与工具之间的流畅交互，促进协作，并激发集体智能的形成。此外，论文通过分类现有协议、分析其适用场景，并从安全性、可扩展性和延迟等维度进行性能比较，为特定应用场景选择最合适的协议提供指导。

链接: https://arxiv.org/abs/2504.16736
作者: Yingxuan Yang,Huacan Chai,Yuanyi Song,Siyuan Qi,Muning Wen,Ning Li,Junwei Liao,Haoyi Hu,Jianghao Lin,Gaowei Chang,Weiwen Liu,Ying Wen,Yong Yu,Weinan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has led to the widespread deployment of LLM agents across diverse industries, including customer service, content generation, data analysis, and even healthcare. However, as more LLM agents are deployed, a major issue has emerged: there is no standard way for these agents to communicate with external tools or data sources. This lack of standardized protocols makes it difficult for agents to work together or scale effectively, and it limits their ability to tackle complex, real-world tasks. A unified communication protocol for LLM agents could change this. It would allow agents and tools to interact more smoothly, encourage collaboration, and triggering the formation of collective intelligence. In this paper, we provide a systematic overview of existing communication protocols for LLM agents. We classify them into four main categories and make an analysis to help users and developers select the most suitable protocols for specific applications. Additionally, we conduct a comparative performance analysis of these protocols across key dimensions such as security, scalability, and latency. Finally, we explore future challenges, such as how protocols can adapt and survive in fast-evolving environments, and what qualities future protocols might need to support the next generation of LLM agent ecosystems. We expect this work to serve as a practical reference for both researchers and engineers seeking to design, evaluate, or integrate robust communication infrastructures for intelligent agents.
zh

[AI-7] Offline Robotic World Model: Learning Robotic Policies without a Physics Simulator

【速读】：该论文旨在解决离线强化学习 (Offline RL) 在机器人控制中的分布偏移 (distributional shift) 问题，限制了策略的泛化能力，并提出了一种名为离线机器人世界模型 (RWM-O) 的基于模型的方法。RWM-O 的关键是显式估计认识不确定性 (epistemic uncertainty)，以改进策略学习，而无需依赖物理模拟器。通过将这些不确定性估计集成到策略优化过程中，该方法惩罚不可靠的转换，减少对模型误差的过拟合，从而提高稳定性。实验结果表明，RWM-O 提升了泛化能力和安全性，实现了完全基于真实世界数据的策略学习，推动了机器人领域的可扩展且高效的数据利用强化学习。

链接: https://arxiv.org/abs/2504.16680
作者: Chenhao Li,Andreas Krause,Marco Hutter
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has demonstrated impressive capabilities in robotic control but remains challenging due to high sample complexity, safety concerns, and the sim-to-real gap. While offline RL eliminates the need for risky real-world exploration by learning from pre-collected data, it suffers from distributional shift, limiting policy generalization. Model-Based RL (MBRL) addresses this by leveraging predictive models for synthetic rollouts, yet existing approaches often lack robust uncertainty estimation, leading to compounding errors in offline settings. We introduce Offline Robotic World Model (RWM-O), a model-based approach that explicitly estimates epistemic uncertainty to improve policy learning without reliance on a physics simulator. By integrating these uncertainty estimates into policy optimization, our approach penalizes unreliable transitions, reducing overfitting to model errors and enhancing stability. Experimental results show that RWM-O improves generalization and safety, enabling policy learning purely from real-world data and advancing scalable, data-efficient RL for robotics.
zh

[AI-8] MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark

【速读】：该论文旨在解决生成式模型在密码猜测领域应用中评估标准不统一、研究结果不一致以及缺乏严格评估的问题。论文的关键在于引入了MAYA（Unified, Customizable, Plug-and-Play Password Benchmarking Framework），这是一个标准化的评估框架，通过一系列先进的测试场景和八个真实世界密码数据集，对六种最先进的生成式密码猜测方法进行了全面评估。MAYA的关键创新在于其统一性和可定制性，能够确保不同模型之间的公平比较，并揭示模型在生成复杂密码时的性能差异。通过MAYA，研究发现顺序模型在生成准确且复杂的密码猜测方面优于其他生成架构和传统工具，同时多模型攻击策略进一步提升了整体性能。

链接: https://arxiv.org/abs/2504.16651
作者: William Corrias,Fabio De Gaspari,Dorjan Hitaj,Luigi V. Mancini
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid evolution of generative models has led to their integration across various fields, including password guessing, aiming to generate passwords that resemble human-created ones in complexity, structure, and patterns. Despite generative model’s promise, inconsistencies in prior research and a lack of rigorous evaluation have hindered a comprehensive understanding of their true potential. In this paper, we introduce MAYA, a unified, customizable, plug-and-play password benchmarking framework. MAYA provides a standardized approach for evaluating generative password-guessing models through a rigorous set of advanced testing scenarios and a collection of eight real-life password datasets. Using MAYA, we comprehensively evaluate six state-of-the-art approaches, which have been re-implemented and adapted to ensure standardization, for a total of over 15,000 hours of computation. Our findings indicate that these models effectively capture different aspects of human password distribution and exhibit strong generalization capabilities. However, their effectiveness varies significantly with long and complex passwords. Through our evaluation, sequential models consistently outperform other generative architectures and traditional password-guessing tools, demonstrating unique capabilities in generating accurate and complex guesses. Moreover, models learn and generate different password distributions, enabling a multi-model attack that outperforms the best individual model. By releasing MAYA, we aim to foster further research, providing the community with a new tool to consistently and reliably benchmark password-generation techniques. Our framework is publicly available at this https URL
zh

[AI-9] Bridging Econometrics and AI: VaR Estimation via Reinforcement Learning and GARCH Models

【速读】：该论文试图解决在金融市场波动加剧的环境下，传统计量经济学模型（如GARCH及其变体）因假设过于严格而难以适应当前市场复杂性的风险度量准确性问题。论文的关键解决方案在于提出了一种结合GARCH波动率模型与深度强化学习的混合框架，通过将风险水平预测任务视为一个不平衡分类问题，并利用Double Deep Q-Network (DDQN) 模型进行方向性市场预测，实现了根据市场条件动态调整风险预测的能力。这种架构不仅显著提高了Value-at-Risk (VaR) 估计的准确性，减少了超越风险阈值的次数，还降低了资本需求，同时保持了对监管风险阈值的遵守，从而增强了其在现代主动风险管理中的实用性。

链接: https://arxiv.org/abs/2504.16635
作者: Fredy Pokou(CRIStAL, INOCS),Jules Sadefo Kamdem(MRE),François Benhmad(MRE)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP); Risk Management (q-fin.RM); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:In an environment of increasingly volatile financial markets, the accurate estimation of risk remains a major challenge. Traditional econometric models, such as GARCH and its variants, are based on assumptions that are often too rigid to adapt to the complexity of the current market dynamics. To overcome these limitations, we propose a hybrid framework for Value-at-Risk (VaR) estimation, combining GARCH volatility models with deep reinforcement learning. Our approach incorporates directional market forecasting using the Double Deep Q-Network (DDQN) model, treating the task as an imbalanced classification problem. This architecture enables the dynamic adjustment of risk-level forecasts according to market conditions. Empirical validation on daily Eurostoxx 50 data covering periods of crisis and high volatility shows a significant improvement in the accuracy of VaR estimates, as well as a reduction in the number of breaches and also in capital requirements, while respecting regulatory risk thresholds. The ability of the model to adjust risk levels in real time reinforces its relevance to modern and proactive risk management.
zh

[AI-10] Cognitive Silicon: An Architectural Blueprint for Post-Industrial Computing Systems

【速读】：该论文试图解决自主人工智能（Autonomous AI）系统在确定性、人为设计的计算架构中存在的基础局限性问题。论文提出了一种名为“Cognitive Silicon”的假设性全栈架构框架，探索认知计算系统设计的可能发展方向。解决方案的关键在于设计一种集成符号支架、受控记忆、运行时道德一致性以及与硅到语义层对齐感知执行的架构。这种设计方法通过与大型语言模型（LLMs）在非对称认识论条件下进行辩证协同设计而形成，旨在通过引入结构化摩擦来揭示盲点和权衡。该框架设想将物理约束导致的有限性、不可复制的隐性知识以及不可克隆的身份密钥作为认知实体的基础构建模块，并将信任/代理、支架/涌现、执行/治理等核心张力视为架构的核心驱动力而非边缘情况。理论上，该架构与自由能量原理（Free Energy Principle）相一致，可能提供一种正式的解释，说明认知系统如何通过预测误差最小化在物理和计算边界内保持身份。最终目标是构建一种在不可逆硬件约束和身份限定认识论机制下能够维持人类对齐的可道德处理的认知基础设施。

链接: https://arxiv.org/abs/2504.16622
作者: Christoforus Yoga Haryanto,Emily Lomempow
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Working Paper, 37 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Autonomous AI systems reveal foundational limitations in deterministic, human-authored computing architectures. This paper presents Cognitive Silicon: a hypothetical full-stack architectural framework projected toward 2035, exploring a possible trajectory for cognitive computing system design. The proposed architecture would integrate symbolic scaffolding, governed memory, runtime moral coherence, and alignment-aware execution across silicon-to-semantics layers. Our design grammar has emerged from dialectical co-design with LLMs under asymmetric epistemic conditions–creating structured friction to expose blind spots and trade-offs. The envisioned framework would establish mortality as a natural consequence of physical constraints, non-copyable tacit knowledge, and non-cloneable identity keys as cognitive-embodiment primitives. Core tensions (trust/agency, scaffolding/emergence, execution/governance) would function as central architectural pressures rather than edge cases. The architecture theoretically converges with the Free Energy Principle, potentially offering a formal account of how cognitive systems could maintain identity through prediction error minimization across physical and computational boundaries. The resulting framework aims to deliver a morally tractable cognitive infrastructure that could maintain human-alignment through irreversible hardware constraints and identity-bound epistemic mechanisms resistant to replication or subversion.
zh

[AI-11] Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在分析敏感或专有代码以检测安全漏洞时面临的隐私和推理成本挑战。具体而言，研究关注如何利用小规模语言模型（Small Language Models, SLMs）实现准确的本地化漏洞检测，特别是在Python代码中识别MITRE Top 25常见弱点枚举（Common Weakness Enumerations, CWEs）。解决方案的关键在于通过指令跟随微调（instruction-following fine-tuning），将一个预训练的3.5亿参数代码模型（codegen-mono）调整为专注于特定任务的SLM，结合半监督方法生成合成数据与人工审查构建针对性数据集，从而显著提升模型性能，最终在测试集中达到约99%的准确率、98.08%的精确率、100%的召回率以及99.04%的F1分数，验证了微调后的SLMs作为高效且隐私保护工具的有效性。

链接: https://arxiv.org/abs/2504.16584
作者: Md. Azizul Hakim Bappy(Institute of Information and Communication Technology, Bangladesh University of Engineering Technology, Dhaka, Bangladesh),Hossen A Mustafa(Institute of Information and Communication Technology, Bangladesh University of Engineering Technology, Dhaka, Bangladesh),Prottoy Saha(Institute of Information and Communication Technology, Bangladesh University of Engineering Technology, Dhaka, Bangladesh),Rajinus Salehat(Hajee Mohammad Danesh Science and Technology University, Dinajpur, Bangladesh)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures, 3 tables. Dataset available at this https URL . Model available at this https URL . Keywords: Small Language Models (SLMs), Vulnerability Detection, CWE, Fine-tuning, Python Security, Privacy-Preserving Code Analysis

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities, such as Common Weakness Enumerations (CWEs). However, their reliance on cloud infrastructure and substantial computational requirements pose challenges for analyzing sensitive or proprietary codebases due to privacy concerns and inference costs. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection. We investigated whether a 350-million parameter pre-trained code model (codegen-mono) could be effectively fine-tuned to detect the MITRE Top 25 CWEs specifically within Python code. To facilitate this, we developed a targeted dataset of 500 examples using a semi-supervised approach involving LLM-driven synthetic data generation coupled with meticulous human review. Initial tests confirmed that the base codegen-mono model completely failed to identify CWEs in our samples. However, after applying instruction-following fine-tuning, the specialized SLM achieved remarkable performance on our test set, yielding approximately 99% accuracy, 98.08% precision, 100% recall, and a 99.04% F1-score. These results strongly suggest that fine-tuned SLMs can serve as highly accurate and efficient tools for CWE detection, offering a practical and privacy-preserving solution for integrating advanced security analysis directly into development workflows.
zh

[AI-12] MMHCL: Multi-Modal Hypergraph Contrastive Learning for Recommendation

【速读】：该论文旨在解决多模态内容共享平台中个性化推荐系统面临的两个主要问题：数据稀疏性和冷启动问题，并提出了一种新的解决方案。现有方法通常难以充分探索用户与产品之间从多模态数据中衍生的语义关联。为此，论文提出了一种名为Multi-Modal Hypergraph Contrastive Learning (MMHCL) 的框架用于用户推荐。其关键在于构建了两个超图——用户到用户（u2u）超图和项目到项目（i2i）超图，分别挖掘用户之间的共享偏好以及项目间的复杂多模态语义相似性，从而生成更密集的二阶语义信息，与一阶用户-项目交互信息互补以缓解数据稀疏性问题。此外，通过设计协同对比学习的特征增强范式，最大化或最小化相同/不同用户和项目的一阶和二阶嵌入之间的互信息，有效提升了特征的可区分性。这种方法不仅获得了更密集的二阶超图，还挖掘出更多共享属性，从而在一定程度上缓解了数据稀疏性和冷启动问题。

链接: https://arxiv.org/abs/2504.16576
作者: Xu Guo,Tong Zhang,Fuyun Wang,Xudong Wang,Xiaoya Zhang,Xin Liu,Zhen Cui
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures. This manuscript is currently under major revision for ACM Transactions on Multimedia Computing, Communications, and Applications (ACM TOMM)

点击查看摘要

Abstract:The burgeoning presence of multimodal content-sharing platforms propels the development of personalized recommender systems. Previous works usually suffer from data sparsity and cold-start problems, and may fail to adequately explore semantic user-product associations from multimodal data. To address these issues, we propose a novel Multi-Modal Hypergraph Contrastive Learning (MMHCL) framework for user recommendation. For a comprehensive information exploration from user-product relations, we construct two hypergraphs, i.e. a user-to-user (u2u) hypergraph and an item-to-item (i2i) hypergraph, to mine shared preferences among users and intricate multimodal semantic resemblance among items, respectively. This process yields denser second-order semantics that are fused with first-order user-item interaction as complementary to alleviate the data sparsity issue. Then, we design a contrastive feature enhancement paradigm by applying synergistic contrastive learning. By maximizing/minimizing the mutual information between second-order (e.g. shared preference pattern for users) and first-order (information of selected items for users) embeddings of the same/different users and items, the feature distinguishability can be effectively enhanced. Compared with using sparse primary user-item interaction only, our MMHCL obtains denser second-order hypergraphs and excavates more abundant shared attributes to explore the user-product associations, which to a certain extent alleviates the problems of data sparsity and cold-start. Extensive experiments have comprehensively demonstrated the effectiveness of our method. Our code is publicly available at: this https URL.
zh

[AI-13] PsyCounAssist: A Full-Cycle AI-Powered Psychological Counseling Assistant System

【速读】：该论文旨在解决心理辅导过程中情感监测不实时、记录主观性强以及个性化支持不足的问题。解决方案的关键在于PsyCounAssist系统，它通过整合多模态情感识别（结合语音与光电容积脉搏波描记法PPG信号）实现精准的实时情感分析，利用大语言模型(LLMs)自动生成结构化的会话报告，并提供个性化的人工智能生成后续支持。这一综合方案不仅提升了心理辅导的效率与质量，还确保了系统的实用性和灵活性，同时强调了基于PPG的情感分类可靠性和非侵入式隐私保护的优势。

链接: https://arxiv.org/abs/2504.16573
作者: Xianghe Liu,Jiaqi Xu,Tao Sun
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Psychological counseling is a highly personalized and dynamic process that requires therapists to continuously monitor emotional changes, document session insights, and maintain therapeutic continuity. In this paper, we introduce PsyCounAssist, a comprehensive AI-powered counseling assistant system specifically designed to augment psychological counseling practices. PsyCounAssist integrates multimodal emotion recognition combining speech and photoplethysmography (PPG) signals for accurate real-time affective analysis, automated structured session reporting using large language models (LLMs), and personalized AI-generated follow-up support. Deployed on Android-based tablet devices, the system demonstrates practical applicability and flexibility in real-world counseling scenarios. Experimental evaluation confirms the reliability of PPG-based emotional classification and highlights the system’s potential for non-intrusive, privacy-aware emotional support. PsyCounAssist represents a novel approach to ethically and effectively integrating AI into psychological counseling workflows.
zh

[AI-14] A Vision for AI-Driven Adaptation of Dynamic AR Content to Users and Environments

【速读】：该论文试图解决现有增强现实（AR）系统在管理多种交互可能性时存在的挑战，特别是如何有效地进行AR内容的自适应放置。论文的关键解决方案在于利用人工智能（AI），通过机器学习方法实现AR内容布局的动态调整，使其能够实时响应用户移动和环境变化。这种方案的核心在于智能管理AR投影内容与固定静态内容之间的分布，从而实现无缝的用户界面布局，并可能降低用户的认知负荷。

链接: https://arxiv.org/abs/2504.16562
作者: Julian Rasch,Florian Müller,Francesco Chiossi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Augmented Reality (AR) is transforming the way we interact with virtual information in the physical world. By overlaying digital content in real-world environments, AR enables new forms of immersive and engaging experiences. However, existing AR systems often struggle to effectively manage the many interactive possibilities that AR presents. This vision paper speculates on AI-driven approaches for adaptive AR content placement, dynamically adjusting to user movement and environmental changes. By leveraging machine learning methods, such a system would intelligently manage content distribution between AR projections integrated into the external environment and fixed static content, enabling seamless UI layout and potentially reducing users’ cognitive load. By exploring the possibilities of AI-driven dynamic AR content placement, we aim to envision new opportunities for innovation and improvement in various industries, from urban navigation and workplace productivity to immersive learning and beyond. This paper outlines a vision for the development of more intuitive, engaging, and effective AI-powered AR experiences.
zh

[AI-15] Exploring human-SAV interaction using large language models : The impact of psychological ownership and anthropomorphism on user experience

【速读】：该论文试图解决的问题是如何通过大型语言模型（Large Language Model, LLM）驱动的共享自动驾驶车辆（Shared Autonomous Vehicle, SAV）用户界面（User Interface, UI）中的提示策略影响用户的感知、体验以及对该技术的采纳意愿。此前的研究主要集中在心理学因素如类人化（Anthropomorphism）对SAV采纳的影响，而本研究关注的是LLM驱动的SAV UI提示策略的具体作用。

解决方案的关键在于设计具有不同类人化特征和心理所有权触发机制的SAV UI，并通过定量与定性方法评估这些设计对用户心理所有权（Psychological Ownership）、类人化感知、服务质量、披露倾向、SAV响应情感以及整体接受度的影响。实验结果表明，更具类人化特性和心理所有权诱导能力的SAV对话式UI能够提升用户对SAV人类特质的感知，并改善响应的情感倾向，从而为基于LLM的对话式UI设计提供了实用指导，以增强用户体验和SAV技术的采纳。

链接: https://arxiv.org/abs/2504.16548
作者: Lirui Guo,Michael G. Burke,Wynita M. Griggs
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:There has been extensive prior work exploring how psychological factors such as anthropomorphism affect the adoption of shared autonomous vehicles (SAVs). However, limited research has been conducted on how prompt strategies in large language model (LLM)-powered SAV User Interfaces (UIs) affect users’ perceptions, experiences, and intentions to adopt such technology. In this work, we investigate how conversational UIs powered by LLMs drive these psychological factors and psychological ownership, the sense of possession a user may come to feel towards an entity or object they may not legally own. We designed four SAV UIs with varying levels of anthropomorphic characteristics and psychological ownership triggers. Quantitative measures of psychological ownership, anthropomorphism, quality of service, disclosure tendency, sentiment of SAV responses, and overall acceptance were collected after participants interacted with each SAV. Qualitative feedback was also gathered regarding the experience of psychological ownership during the interactions. The results indicate that an SAV conversational UI designed to be more anthropomorphic and to induce psychological ownership improved users’ perceptions of the SAV’s human-like qualities and improved the sentiment of responses compared to a control condition. These findings provide practical guidance for designing LLM-based conversational UIs that enhance user experience and adoption of SAVs.
zh

[AI-16] Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM -based Multi-Agent Debate

【速读】：该论文旨在解决多智能体辩论（Multi-Agent Debate, MAD）系统在复杂任务中增强推理能力的同时，其迭代对话和角色扮演特性所带来的安全隐忧，特别是对抗性攻击（如越狱攻击）引发有害内容的风险。论文的关键在于引入了一种新颖的结构化提示重写框架，通过叙事封装（narrative encapsulation）、角色驱动升级（role-driven escalation）、迭代优化（iterative refinement）以及修辞模糊化（rhetorical obfuscation）来利用MAD系统的动态特性。实验结果表明，该方法显著放大了MAD架构的脆弱性，将有害内容的比例从28.14%提升至80.34%，并在某些场景下实现了高达80%的攻击成功率。这揭示了MAD系统固有的安全性缺陷，并强调了在实际部署前开发专用防御机制的紧迫性。

链接: https://arxiv.org/abs/2504.16489
作者: Senmao Qi,Yifei Zou,Peng Li,Ziyi Lin,Xiuzhen Cheng,Dongxiao Yu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 33 pages, 5 figures

点击查看摘要

Abstract:Multi-Agent Debate (MAD), leveraging collaborative interactions among Large Language Models (LLMs), aim to enhance reasoning capabilities in complex tasks. However, the security implications of their iterative dialogues and role-playing characteristics, particularly susceptibility to jailbreak attacks eliciting harmful content, remain critically underexplored. This paper systematically investigates the jailbreak vulnerabilities of four prominent MAD frameworks built upon leading commercial LLMs (GPT-4o, GPT-4, GPT-3.5-turbo, and DeepSeek) without compromising internal agents. We introduce a novel structured prompt-rewriting framework specifically designed to exploit MAD dynamics via narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation. Our extensive experiments demonstrate that MAD systems are inherently more vulnerable than single-agent setups. Crucially, our proposed attack methodology significantly amplifies this fragility, increasing average harmfulness from 28.14% to 80.34% and achieving attack success rates as high as 80% in certain scenarios. These findings reveal intrinsic vulnerabilities in MAD architectures and underscore the urgent need for robust, specialized defenses prior to real-world deployment.
zh

[AI-17] On Developers Self-Declaration of AI-Generated Code: An Analysis of Practices

【速读】：该论文试图解决在实际软件开发中区分人工智能生成代码（AI-Generated Code）与人工编写代码的需求，并探索开发者自我声明（self-declare）AI生成代码的方式及其背后的原因。论文的关键在于通过混合方法研究，首先从GitHub仓库中挖掘出613个AI生成代码片段实例，随后开展工业调查获取111份有效反馈，从而揭示开发者自我声明AI生成代码的实践模式及原因。研究发现大多数开发者（76.6%）通常会自我声明AI生成代码，而其余开发者则认为无需这样做。自我声明的主要动机包括追踪代码以供未来审查与调试以及伦理考量；而不自我声明的主要原因则是对AI生成代码进行了大量修改或认为此活动不必要。最终，论文提出了针对开发者自我声明AI生成代码的指导原则，旨在解决伦理与代码质量方面的关切。

链接: https://arxiv.org/abs/2504.16485
作者: Syed Mohammad Kashif,Peng Liang,Amjed Tahir
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 35 pages, 17 images, 8 tables, Manuscript submitted to a journal (2025)

点击查看摘要

Abstract:AI code generation tools have gained significant popularity among developers, who use them to assist in software development due to their capability to generate code. Existing studies mainly explored the quality, e.g., correctness and security, of AI-generated code, while in real-world software development, the prerequisite is to distinguish AI-generated code from human-written code, which emphasizes the need to explicitly declare AI-generated code by developers. To this end, this study intends to understand the ways developers use to self-declare AI-generated code and explore the reasons why developers choose to self-declare or not. We conducted a mixed-methods study consisting of two phases. In the first phase, we mined GitHub repositories and collected 613 instances of AI-generated code snippets. In the second phase, we conducted a follow-up industrial survey, which received 111 valid responses. Our research revealed the practices followed by developers to self-declare AI-generated code. Most practitioners (76.6%) always or sometimes self-declare AI-generated code. In contrast, other practitioners (23.4%) noted that they never self-declare AI-generated code. The reasons for self-declaring AI-generated code include the need to track and monitor the code for future review and debugging, and ethical considerations. The reasons for not self-declaring AI-generated code include extensive modifications to AI-generated code and the developers’ perception that self-declaration is an unnecessary activity. We finally provided guidelines for practitioners to self-declare AI-generated code, addressing ethical and code quality concerns.
zh

[AI-18] Harden and Catch for Just-in-Time Assured LLM -Based Software Testing: Open Research Challenges

【速读】：该论文旨在解决自动化软件测试领域中长期存在的若干基础概念未被明确定义和充分探索的问题，这些概念具有巨大的实际应用潜力。具体而言，论文正式定义并研究了加固测试（hardening test）和捕获测试（catching test）的特性：加固测试旨在防范未来的回归故障，而捕获测试则用于检测代码变更引入的新功能中的回归或缺陷。论文还提出了捕获“即时生成”（Catching Just-in-Time, JiTTest）挑战，即在代码变更进入生产环境前即时生成测试以捕获潜在的新缺陷，并指出任何解决此挑战的方法同样可以应用于发现遗留代码中的隐含缺陷。论文讨论了加固测试、捕获测试及JiTTest可能的结果、开放的研究问题、部署选项，并介绍了Meta在基于大型语言模型（LLM）的自动化加固方面的初步研究成果。论文的关键在于提出并形式化描述这些测试概念及其相互关系，并强调通过即时生成测试来提高软件质量的核心方法论。

链接: https://arxiv.org/abs/2504.16472
作者: Mark Harman,Peter O’Hearn,Shubho Sengupta
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: To Appear as keynote paper at FSE 2025

点击查看摘要

Abstract:Despite decades of research and practice in automated software testing, several fundamental concepts remain ill-defined and under-explored, yet offer enormous potential real-world impact. We show that these concepts raise exciting new challenges in the context of Large Language Models for software test generation. More specifically, we formally define and investigate the properties of hardening and catching tests. A hardening test is one that seeks to protect against future regressions, while a catching test is one that catches such a regression or a fault in new functionality introduced by a code change. Hardening tests can be generated at any time and may become catching tests when a future regression is caught. We also define and motivate the Catching Just-in-Time' (JiTTest) Challenge, in which tests are generated just-in-time’ to catch new faults before they land into production. We show that any solution to Catching JiTTest generation can also be repurposed to catch latent faults in legacy code. We enumerate possible outcomes for hardening and catching tests and JiTTests, and discuss open research problems, deployment options, and initial results from our work on automated LLM-based hardening at Meta. This paper\footnoteAuthor order is alphabetical. The corresponding author is Mark Harman. was written to accompany the keynote by the authors at the ACM International Conference on the Foundations of Software Engineering (FSE) 2025.
zh

[AI-19] ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance

【速读】：该论文旨在解决现有机器人操作视频合成方法在指令跟随能力和视觉质量方面的不足。具体而言，当前方法在确保有效指令跟随（effective instruction-following）和高视觉质量（high visual quality）方面仍面临显著挑战。虽然已有方法如RoboDreamer通过语言分解（linguistic decomposition）将指令划分为低级子任务来实现组合式指令跟随（compositional instruction-following），但它们未能充分考虑这些子任务之间的关系，并且忽视了深度和语义等有价值的视觉指导信息，而这些对于提升视觉质量至关重要。

为了解决上述问题，论文提出了ManipDreamer这一基于动作树（action tree）和视觉指导（visual guidance）的先进世界模型作为解决方案的关键。首先，通过将指令表示为动作树并为树节点分配嵌入向量（embeddings），每个指令可以通过遍历动作树获得其特定嵌入向量，从而更好地学习子任务之间的关系并引导世界模型。其次，在视觉质量增强方面，引入了一个与世界模型兼容的视觉指导适配器（visual guidance adapter），结合深度和语义指导，以提高视频生成的时间一致性和物理一致性。基于动作树和视觉指导，ManipDreamer显著提升了指令跟随能力以及视觉质量。实验结果显示，相比于RoboDreamer，ManipDreamer在已见任务和未见任务中均实现了视频质量指标的重大改进，例如PSNR从19.55提升至21.05，SSIM从0.7474提高到0.7982，未见任务中的Flow Error从3.506降低至3.201，同时平均提高了6个RLBench任务中机器人操作成功率2.5%。

链接: https://arxiv.org/abs/2504.16464
作者: Ying Li,Xiaobao Wei,Xiaowei Chi,Yuming Li,Zhongyu Zhao,Hao Wang,Ningning Ma,Ming Lu,Shanghang Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:While recent advancements in robotic manipulation video synthesis have shown promise, significant challenges persist in ensuring effective instruction-following and achieving high visual quality. Recent methods, like RoboDreamer, utilize linguistic decomposition to divide instructions into separate lower-level primitives, conditioning the world model on these primitives to achieve compositional instruction-following. However, these separate primitives do not consider the relationships that exist between them. Furthermore, recent methods neglect valuable visual guidance, including depth and semantic guidance, both crucial for enhancing visual quality. This paper introduces ManipDreamer, an advanced world model based on the action tree and visual guidance. To better learn the relationships between instruction primitives, we represent the instruction as the action tree and assign embeddings to tree nodes, each instruction can acquire its embeddings by navigating through the action tree. The instruction embeddings can be used to guide the world model. To enhance visual quality, we combine depth and semantic guidance by introducing a visual guidance adapter compatible with the world model. This visual adapter enhances both the temporal and physical consistency of video generation. Based on the action tree and visual guidance, ManipDreamer significantly boosts the instruction-following ability and visual quality. Comprehensive evaluations on robotic manipulation benchmarks reveal that ManipDreamer achieves large improvements in video quality metrics in both seen and unseen tasks, with PSNR improved from 19.55 to 21.05, SSIM improved from 0.7474 to 0.7982 and reduced Flow Error from 3.506 to 3.201 in unseen tasks, compared to the recent RoboDreamer model. Additionally, our method increases the success rate of robotic manipulation tasks by 2.5% in 6 RLbench tasks on average.
zh

[AI-20] Private Federated Learning using Preference-Optimized Synthetic Data ICLR25

【速读】：该论文旨在解决差分隐私联邦学习（Differentially Private Federated Learning, DP-FL）在生成高质量差分隐私合成数据方面的性能局限性问题。近年来的研究表明，通过差分隐私合成数据的方法可能优于或增强传统的DP-FL方法（Wu等，2024；Hou等，2024）。然而，现有生成差分隐私合成数据的主要算法依赖于基于公开信息的精心提示工程以及迭代的私有客户端反馈，这可能导致效率和质量上的不足。

论文的关键创新在于将之前差分隐私合成数据方法中收集的私有客户端反馈视为一种偏好排序（preference ranking），并通过设计一种名为Preference Optimization for Private Client Data (POPri)的新算法，利用直接偏好优化（Direct Preference Optimization, DPO）等技术对大型语言模型（LLMs）进行微调，以生成更高质量的差分隐私合成数据。为验证POPri的有效性，作者发布了LargeFedBench，一个用于评估无污染联邦客户端数据上大型语言模型的新基准。实验结果显示，与已有工作相比，POPri显著提升了差分隐私合成数据的效用，在LargeFedBench数据集上的下一令牌预测准确性接近完全私有和非私有设置之间的差距高达68%，而先前的合成数据方法仅为52%，最先进的DP联邦学习方法仅为10%。代码和数据可通过提供的链接获取。

链接: https://arxiv.org/abs/2504.16438
作者: Charlie Hou,Mei-Yu Wang,Yige Zhu,Daniel Lazar,Giulia Fanti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Spotlight presentation at SynthData Workshop ICLR25

点击查看摘要

Abstract:In practical settings, differentially private Federated learning (DP-FL) is the dominant method for training models from private, on-device client data. Recent work has suggested that DP-FL may be enhanced or outperformed by methods that use DP synthetic data (Wu et al., 2024; Hou et al., 2024). The primary algorithms for generating DP synthetic data for FL applications require careful prompt engineering based on public information and/or iterative private client feedback. Our key insight is that the private client feedback collected by prior DP synthetic data methods (Hou et al., 2024; Xie et al., 2024) can be viewed as a preference ranking. Our algorithm, Preference Optimization for Private Client Data (POPri) harnesses client feedback using preference optimization algorithms such as Direct Preference Optimization (DPO) to fine-tune LLMs to generate high-quality DP synthetic data. To evaluate POPri, we release LargeFedBench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. POPri substantially improves the utility of DP synthetic data relative to prior work on LargeFedBench datasets and an existing benchmark from Xie et al. (2024). POPri closes the gap between next-token prediction accuracy in the fully-private and non-private settings by up to 68%, compared to 52% for prior synthetic data methods, and 10% for state-of-the-art DP federated learning methods. The code and data are available at this https URL.
zh

[AI-21] FKAN: Interpretable Time Series Forecasting with Kolmogorov-Arnold Network

【速读】：该论文致力于解决现有深度时间序列预测方法在提供高精度预测的同时缺乏可解释性的问题，这限制了其在自动驾驶和医疗等安全关键领域的可信部署。论文的关键解决方案是提出了一种新的可解释模型iTFKAN，它通过模型符号化实现可解释性，从而能够进一步探索模型决策的依据和数据中的潜在模式。此外，iTFKAN还开发了先验知识注入和时频协同学习两种策略，以有效引导复杂交织时间序列数据下的模型学习。实验结果表明，iTFKAN不仅实现了卓越的预测性能，同时具备高水平的可解释能力。

链接: https://arxiv.org/abs/2504.16432
作者: Ziran Liang,Rui An,Wenqi Fan,Yanghui Rao,Yuxuan Liang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As time evolves, data within specific domains exhibit predictability that motivates time series forecasting to predict future trends from historical data. However, current deep forecasting methods can achieve promising performance but generally lack interpretability, hindering trustworthiness and practical deployment in safety-critical applications such as auto-driving and healthcare. In this paper, we propose a novel interpretable model, iTFKAN, for credible time series forecasting. iTFKAN enables further exploration of model decision rationales and underlying data patterns due to its interpretability achieved through model symbolization. Besides, iTFKAN develops two strategies, prior knowledge injection, and time-frequency synergy learning, to effectively guide model learning under complex intertwined time series data. Extensive experimental results demonstrated that iTFKAN can achieve promising forecasting performance while simultaneously possessing high interpretive capabilities.
zh

[AI-22] A Survey of Foundation Model-Powered Recommender Systems: From Feature-Based Generative to Agent ic Paradigms

【速读】：本文旨在探索推荐系统（Recommender Systems, RS）中基础模型（Foundation Models, FMs）的应用潜力及其在重塑推荐范式中的作用。传统推荐技术主要依赖于针对特定任务设计的模型来建模用户与物品之间的交互以及内容特征，而FMs如GPT、LLaMA和CLIP等大规模预训练模型的兴起为推荐系统带来了新的可能性。论文的关键在于探讨如何通过三种不同范式——基于特征的表征增强、生成式推荐方法以及智能体交互系统，将FMs整合到推荐系统中以提升其性能。此外，文章还分析了这些方法在多模态数据处理、表示学习及自然语言理解等方面的能力，并指出了当前研究中的机遇与挑战，同时提出了未来的研究方向和技术难题。关键解决方案在于充分利用FMs的强大功能，在保持高效性和准确性的同时克服现有技术限制。

链接: https://arxiv.org/abs/2504.16420
作者: Chengkai Huang,Hongtao Huang,Tong Yu,Kaige Xie,Junda Wu,Shuai Zhang,Julian Mcauley,Dietmar Jannach,Lina Yao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems (RS) have become essential in filtering information and personalizing content for users. RS techniques have traditionally relied on modeling interactions between users and items as well as the features of content using models specific to each task. The emergence of foundation models (FMs), large scale models trained on vast amounts of data such as GPT, LLaMA and CLIP, is reshaping the recommendation paradigm. This survey provides a comprehensive overview of the Foundation Models for Recommender Systems (FM4RecSys), covering their integration in three paradigms: (1) Feature-Based augmentation of representations, (2) Generative recommendation approaches, and (3) Agentic interactive systems. We first review the data foundations of RS, from traditional explicit or implicit feedback to multimodal content sources. We then introduce FMs and their capabilities for representation learning, natural language understanding, and multi-modal reasoning in RS contexts. The core of the survey discusses how FMs enhance RS under different paradigms. Afterward, we examine FM applications in various recommendation tasks. Through an analysis of recent research, we highlight key opportunities that have been realized as well as challenges encountered. Finally, we outline open research directions and technical challenges for next-generation FM4RecSys. This survey not only reviews the state-of-the-art methods but also provides a critical analysis of the trade-offs among the feature-based, the generative, and the agentic paradigms, outlining key open issues and future research directions.
zh

[AI-23] FeedQUAC: Quick Unobtrusive AI-Generated Commentary

【速读】：该论文试图解决设计过程中反馈获取劳动密集且具有干扰性的问题。解决方案的关键在于引入FeedQUAC，这是一种通过不同人格从多种视角提供实时AI生成评论的设计助手，从而实现轻松且无处不在的反馈。

链接: https://arxiv.org/abs/2504.16416
作者: Tao Long,Kendra Wannamaker,Jo Vermeulen,George Fitzmaurice,Justin Matejka
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multimedia (cs.MM)
备注: 20 pages, 12 figures

点击查看摘要

Abstract:Design thrives on feedback. However, gathering constant feedback throughout the design process can be labor-intensive and disruptive. We explore how AI can bridge this gap by providing effortless, ambient feedback. We introduce FeedQUAC, a design companion that delivers real-time AI-generated commentary from a variety of perspectives through different personas. A design probe study with eight participants highlights how designers can leverage quick yet ambient AI feedback to enhance their creative workflows. Participants highlight benefits such as convenience, playfulness, confidence boost, and inspiration from this lightweight feedback agent, while suggesting additional features, like chat interaction and context curation. We discuss the role of AI feedback, its strengths and limitations, and how to integrate it into existing design workflows while balancing user involvement. Our findings also suggest that ambient interaction is a valuable consideration for both the design and evaluation of future creativity support systems.
zh

[AI-24] Cyberoception: Finding a Painlessly-Measurable New Sense in the Cyberworld Towards Emotion-Awareness in Computing

【速读】：本文旨在解决现有内感受（Interoception）测量方法在实际应用中的局限性，如依赖于实验室环境和高精度设备的问题，这使得实时监测用户的内感受状态变得困难。为此，论文提出了一种新的假设性概念——“赛博感受（Cyberoception）”，这是一种仅通过嵌入普通智能手机传感器即可测量的新感知形式，并且在与情绪相关的能力上具有类似于内感受的相关性。关键在于定义了一种新的数据类型“Turn On”（用户对其智能手机开机频率的主观感知），并通过为期10天的实验室与真实环境混合实验验证了其与参与者情感效价之间的显著关联性。这一成果为开发更具情感感知能力及用户友好的应用程序和服务奠定了基础。

链接: https://arxiv.org/abs/2504.16378
作者: Tadashi Okoshi,Zexiong Gao,Tan Yi Zhen,Takumi Karasawa,Takeshi Miki,Wataru Sasaki,Rajesh K. Balan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted by ACM CHI2025

点击查看摘要

Abstract:In Affective computing, recognizing users’ emotions accurately is the basis of affective human-computer interaction. Understanding users’ interoception contributes to a better understanding of individually different emotional abilities, which is essential for achieving inter-individually accurate emotion estimation. However, existing interoception measurement methods, such as the heart rate discrimination task, have several limitations, including their dependence on a well-controlled laboratory environment and precision apparatus, making monitoring users’ interoception challenging. This study aims to determine other forms of data that can explain users’ interoceptive or similar states in their real-world lives and propose a novel hypothetical concept “cyberoception,” a new sense (1) which has properties similar to interoception in terms of the correlation with other emotion-related abilities, and (2) which can be measured only by the sensors embedded inside commodity smartphone devices in users’ daily lives. Results from a 10-day-long in-lab/in-the-wild hybrid experiment reveal a specific cyberoception type “Turn On” (users’ subjective sensory perception about the frequency of turning-on behavior on their smartphones), significantly related to participants’ emotional valence. We anticipate that cyberoception to serve as a fundamental building block for developing more “emotion-aware”, user-friendly applications and services.
zh

[AI-25] DP2FL: Dual Prompt Personalized Federated Learning in Foundation Models

【速读】：该论文致力于解决个性化联邦学习（PFL）在本地客户数据有限时，深度学习模型因训练不足而导致性能不佳的问题，同时探索如何更好地整合基础模型的能力以缓解这一挑战，并应对新客户端加入时的相关挑战。论文的关键在于提出Dual Prompt Personalized Federated Learning (DP2FL) 框架，通过引入双提示（dual prompts）和自适应聚合策略，实现全局任务感知与本地数据驱动洞察的有效结合，从而提升模型的泛化能力并适配特定的数据分布。此外，DP2FL设计了一个全局模型，支持对新数据源的预测，并能够无缝集成新增客户端而无需重新训练。这一方案的核心创新点在于其提示设计和聚合策略，验证了其在高度异构环境下的有效性。

链接: https://arxiv.org/abs/2504.16357
作者: Ying Chang,Xiaohu Shi,Xiaohui Zhao,Zhaohuang Chen,Deyin Ma
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Personalized federated learning (PFL) has garnered significant attention for its ability to address heterogeneous client data distributions while preserving data privacy. However, when local client data is limited, deep learning models often suffer from insufficient training, leading to suboptimal performance. Foundation models, such as CLIP (Contrastive Language-Image Pretraining), exhibit strong feature extraction capabilities and can alleviate this issue by fine-tuning on limited local data. Despite their potential, foundation models are rarely utilized in federated learning scenarios, and challenges related to integrating new clients remain largely unresolved. To address these challenges, we propose the Dual Prompt Personalized Federated Learning (DP2FL) framework, which introduces dual prompts and an adaptive aggregation strategy. DP2FL combines global task awareness with local data-driven insights, enabling local models to achieve effective generalization while remaining adaptable to specific data distributions. Moreover, DP2FL introduces a global model that enables prediction on new data sources and seamlessly integrates newly added clients without requiring retraining. Experimental results in highly heterogeneous environments validate the effectiveness of DP2FL’s prompt design and aggregation strategy, underscoring the advantages of prediction on novel data sources and demonstrating the seamless integration of new clients into the federated learning framework.
zh

[AI-26] Disentangling and Generating Modalities for Recommendation in Missing Modality Scenarios SIGIR2025

【速读】：该论文致力于解决多模态推荐系统（Multi-modal Recommender Systems, MRSs）在处理缺失模态场景时面临的两个关键问题：(1) 缺乏对缺失模态场景的充分考虑；(2) 忽视模态特征的独特特性。这些问题导致在实际应用中因模态缺失而显著降低性能。为了解决这些问题，论文提出了一种名为解耦与生成模态推荐框架（Disentangling and Generating Modality Recommender, DGMRec）的新方法。其关键在于从信息论的角度将模态特征解耦为通用模态特征和特定模态特征，从而实现更丰富的表示能力，并通过整合其他模态对齐特征及利用用户的模态偏好来生成缺失模态特征。实验结果表明，DGMRec 在多种具有挑战性的场景下优于现有最先进的多模态推荐系统。此外，基于生成的方法还实现了跨模态检索任务，这是现有多模态推荐系统无法实现的。

链接: https://arxiv.org/abs/2504.16352
作者: Jiwan Kim,Hongseok Kang,Sein Kim,Kibum Kim,Chanyoung Park
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: SIGIR 2025

点击查看摘要

Abstract:Multi-modal recommender systems (MRSs) have achieved notable success in improving personalization by leveraging diverse modalities such as images, text, and audio. However, two key challenges remain insufficiently addressed: (1) Insufficient consideration of missing modality scenarios and (2) the overlooking of unique characteristics of modality features. These challenges result in significant performance degradation in realistic situations where modalities are missing. To address these issues, we propose Disentangling and Generating Modality Recommender (DGMRec), a novel framework tailored for missing modality scenarios. DGMRec disentangles modality features into general and specific modality features from an information-based perspective, enabling richer representations for recommendation. Building on this, it generates missing modality features by integrating aligned features from other modalities and leveraging user modality preferences. Extensive experiments show that DGMRec consistently outperforms state-of-the-art MRSs in challenging scenarios, including missing modalities and new item settings as well as diverse missing ratios and varying levels of missing modalities. Moreover, DGMRec’s generation-based approach enables cross-modal retrieval, a task inapplicable for existing MRSs, highlighting its adaptability and potential for real-world applications. Our code is available at this https URL.
zh

[AI-27] Mining Software Repositories for Expert Recommendation

【速读】：该论文旨在解决大型开源软件项目中人工缺陷分类者在分配错误报告给最合适的开发者时面临的挑战。具体而言，其目标是通过分析历史软件开发记录（存储于问题跟踪系统中）自动识别具有特定领域专业知识且经验匹配的开发者，从而辅助人工缺陷分类工作。解决方案的关键在于结合BERTopic和TopicMiner技术，利用问题报告的特征（如相关产品、组件以及优先级和严重性等级）来评估开发者的经验，并基于这些特征对开发者进行排序。通过Top-k准确性指标进行评估，并与先前工作的结果（如TopicMiner MTM、BUGZIE、BT-RL和LDA-SVM）进行对比，验证方法的有效性，数据来源包括Eclipse和Mozilla等项目的多个子项目（如JDT、Firefox和Thunderbird）。

链接: https://arxiv.org/abs/2504.16343
作者: Chad Marshall,Andrew Barovic,Armin Moin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose an automated approach to bug assignment to developers in large open-source software projects. This way, we assist human bug triagers who are in charge of finding the best developer with the right level of expertise in a particular area to be assigned to a newly reported issue. Our approach is based on the history of software development as documented in the issue tracking systems. We deploy BERTopic and techniques from TopicMiner. Our approach works based on the bug reports’ features, such as the corresponding products and components, as well as their priority and severity levels. We sort developers based on their experience with specific combinations of new reports. The evaluation is performed using Top-k accuracy, and the results are compared with the reported results in prior work, namely TopicMiner MTM, BUGZIE, Bug triaging via deep Reinforcement Learning BT-RL, and LDA-SVM. The evaluation data come from various Eclipse and Mozilla projects, such as JDT, Firefox, and Thunderbird.
zh

[AI-28] On the Consistency of GNN Explanations for Malware Detection

【速读】：本文旨在解决基于控制流图（Control Flow Graphs, CFGs）的恶意软件检测中模型可解释性不足的问题。为了解决这一问题，论文提出了一种新颖的框架，其关键是结合规则编码（rule-based encoding）与基于自动编码器的嵌入（autoencoder-based embedding）动态构建CFG并生成节点特征表示。随后利用图神经网络（Graph Neural Networks, GNNs）构建分类器以检测恶意行为。为了提高模型的可解释性，论文采用了多种最先进的可解释技术，包括GNNExplainer、PGExplainer和CaptumExplainer，并特别引入了三种归因方法：Integrated Gradients、Guided Backpropagation和Saliency。此外，论文还提出了一个名为RankFusion的新颖聚合方法，用于整合顶级解释器的输出以提升解释质量。同时，通过两种子图提取策略，包括提出的贪心边分解（Greedy Edge-wise Composition, GEC）方法，进一步增强了结构一致性。全面评估表明，所提出的框架在准确识别恶意软件样本的同时，能够生成可靠且易于理解的解释。

链接: https://arxiv.org/abs/2504.16316
作者: Hossein Shokouhinejad,Griffin Higgins,Roozbeh Razavi-Far,Hesamodin Mohammadian,Ali A. Ghorbani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Control Flow Graphs (CFGs) are critical for analyzing program execution and characterizing malware behavior. With the growing adoption of Graph Neural Networks (GNNs), CFG-based representations have proven highly effective for malware detection. This study proposes a novel framework that dynamically constructs CFGs and embeds node features using a hybrid approach combining rule-based encoding and autoencoder-based embedding. A GNN-based classifier is then constructed to detect malicious behavior from the resulting graph representations. To improve model interpretability, we apply state-of-the-art explainability techniques, including GNNExplainer, PGExplainer, and CaptumExplainer, the latter is utilized three attribution methods: Integrated Gradients, Guided Backpropagation, and Saliency. In addition, we introduce a novel aggregation method, called RankFusion, that integrates the outputs of the top-performing explainers to enhance the explanation quality. We also evaluate explanations using two subgraph extraction strategies, including the proposed Greedy Edge-wise Composition (GEC) method for improved structural coherence. A comprehensive evaluation using accuracy, fidelity, and consistency metrics demonstrates the effectiveness of the proposed framework in terms of accurate identification of malware samples and generating reliable and interpretable explanations.
zh

[AI-29] DataS3: Dataset Subset Selection for Specialization

【速读】：该论文旨在解决机器学习模型在特定部署（如特定医院或国家公园）上的性能优化问题，而非仅关注广义域上的表现。现实中的部署数据通常具有分布不均衡和独特性，导致训练数据与部署数据之间的分布差异会降低模型性能。为应对这一挑战，论文提出了数据集子集选择用于专业化（Dataset Subset Selection for Specialization, DS3）的问题定义：即从现有训练集中挑选一个子集，以优化其在目标部署特定分布上的性能。论文的关键解决方案是引入DataS^3，首个专门针对DS3问题设计的数据集和基准。通过全面评估多种算法家族（包括核心集、数据过滤和数据整理方法），研究发现通用分布的方法在特定部署任务上表现不佳，而精心策划的专家子集能够显著提升精度（最高可达51.3%）。因此，论文强调了定制化数据整理在提升特定部署分布性能和训练效率中的重要作用。

链接: https://arxiv.org/abs/2504.16277
作者: Neha Hulkund,Alaa Maalouf,Levi Cai,Daniel Yang,Tsun-Hsuan Wang,Abigail O’Neil,Timm Haucke,Sandeep Mukherjee,Vikram Ramaswamy,Judy Hansen Shen,Gabriel Tseng,Mike Walmsley,Daniela Rus,Ken Goldberg,Hannah Kerner,Irene Chen,Yogesh Girdhar,Sara Beery
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many real-world machine learning (ML) applications (e.g. detecting broken bones in x-ray images, detecting species in camera traps), in practice models need to perform well on specific deployments (e.g. a specific hospital, a specific national park) rather than the domain broadly. However, deployments often have imbalanced, unique data distributions. Discrepancy between the training distribution and the deployment distribution can lead to suboptimal performance, highlighting the need to select deployment-specialized subsets from the available training data. We formalize dataset subset selection for specialization (DS3): given a training set drawn from a general distribution and a (potentially unlabeled) query set drawn from the desired deployment-specific distribution, the goal is to select a subset of the training data that optimizes deployment performance. We introduce DataS^3; the first dataset and benchmark designed specifically for the DS3 problem. DataS^3 encompasses diverse real-world application domains, each with a set of distinct deployments to specialize in. We conduct a comprehensive study evaluating algorithms from various families–including coresets, data filtering, and data curation–on DataS^3, and find that general-distribution methods consistently fail on deployment-specific tasks. Additionally, we demonstrate the existence of manually curated (deployment-specific) expert subsets that outperform training on all available data with accuracy gains up to 51.3 percent. Our benchmark highlights the critical role of tailored dataset curation in enhancing performance and training efficiency on deployment-specific distributions, which we posit will only become more important as global, public datasets become available across domains and ML models are deployed in the real world. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.16277 [cs.LG] (or arXiv:2504.16277v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.16277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-30] Investigating LLM s in Clinical Triage: Promising Capabilities Persistent Intersectional Biases ALT AAAI2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在急诊科分诊（triage）应用中的潜力评估及潜在偏倚分析问题。论文通过两个关键维度系统性研究了LLMs的能力：(1) 对分布偏移和缺失数据的鲁棒性；(2) 针对性别与种族交叉偏倚的反事实分析。解决方案的关键在于评估多种基于LLMs的方法（如持续预训练和上下文学习）以及传统机器学习方法，并通过实证分析揭示LLMs在这些方面的优势及其表现差异的原因。研究发现LLMs在特定性别与种族交叉点上存在偏好差距，尤其在某些种族群体中性别差异更为显著，表明LLMs可能编码了特定临床情境或特征组合下的人口统计学偏好。

链接: https://arxiv.org/abs/2504.16273
作者: Joseph Lee,Tianqi Shang,Jae Young Baik,Duy Duong-Tran,Shu Yang,Lingyao Li,Li Shen
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to GenAI4Health Workshop @ AAAI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored. We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions: (1) robustness to distribution shifts and missing data, and (2) counterfactual analysis of intersectional biases across sex and race. We assess multiple LLM-based approaches, ranging from continued pre-training to in-context learning, as well as machine learning approaches. Our results indicate that LLMs exhibit superior robustness, and we investigate the key factors contributing to the promising LLM-based approaches. Furthermore, in this setting, we identify gaps in LLM preferences that emerge in particular intersections of sex and race. LLMs generally exhibit sex-based differences, but they are most pronounced in certain racial groups. These findings suggest that LLMs encode demographic preferences that may emerge in specific clinical contexts or particular combinations of characteristics.
zh

[AI-31] Boosting Classifier Performance with Opposition-Based Data Transformation

【速读】：该论文试图解决传统分类算法在处理复杂或稀疏学习环境时性能不足的问题。为应对这一挑战，论文提出了一种基于对立学习（Opposition-Based Learning, OBL）的新颖数据变换框架。OBL最初用于加速优化任务中的收敛，此处被创新性地应用于生成合成的对立样本，以替代部分训练数据并改进决策边界形成。解决方案的关键在于设计并结合三种OBL变体（全局OBL、类别内OBL及局部类别内OBL）与多种常用分类器（如KNN、SVM、LR和DT），通过引入这些对抗性增强的数据提升模型的整体分类性能，同时显著提高计算效率。实验结果表明，OBL增强后的分类器在准确性与F1分数上均优于标准分类器，并且特别适用于高维异构数据集。

链接: https://arxiv.org/abs/2504.16268
作者: Abdesslem Layeb
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we introduce a novel data transformation framework based on Opposition-Based Learning (OBL) to boost the performance of traditional classification algorithms. Originally developed to accelerate convergence in optimization tasks, OBL is leveraged here to generate synthetic opposite samples that replace the acutely training data and improve decision boundary formation. We explore three OBL variants; Global OBL, Class-Wise OBL, and Localized Class-Wise OBL; and integrate them with several widely used classifiers, including K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Logistic Regression (LR), and Decision Tree (DT). Extensive experiments conducted on 26 heterogeneous and high-dimensional datasets demonstrate that OBL-enhanced classifiers consistently outperform their standard counterparts in terms of accuracy and F1-score, frequently achieving near-perfect or perfect classification. Furthermore, OBL contributes to improved computational efficiency, particularly in SVM and LR. These findings underscore the potential of OBL as a lightweight yet powerful data transformation strategy for enhancing classification performance, especially in complex or sparse learning environments.
zh

[AI-32] Gradient-Optimized Fuzzy Classifier: A Benchmark Study Against State-of-the-Art Models

【速读】：该论文旨在评估梯度优化模糊推理系统（Gradient-Optimized Fuzzy Inference System, GF）分类器在监督学习任务中的性能，并与多种最先进的机器学习模型（如随机森林、XGBoost、逻辑回归、支持向量机和神经网络）进行对比。研究在来自UCI机器学习库的五个数据集上开展，这些数据集具有多样化的输入类型、类别分布和分类复杂性。论文的关键在于GF采用了梯度下降方法代替传统模糊推理系统依赖的无导数优化方法，从而显著提升了训练效率和预测性能。实验结果表明，GF模型不仅在分类准确性上具有竞争力甚至优于某些基准模型，同时保持了高精度和极短的训练时间，尤其在处理噪声数据和可变特征集时表现出良好的一致性，凸显其鲁棒性。因此，GF作为一种可解释、高效且适应性强的替代方案，展示了在监督学习任务中取代更复杂的深度学习模型的潜力。

链接: https://arxiv.org/abs/2504.16263
作者: Magnus Sieverding,Nathan Steffen,Kelly Cohen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a performance benchmarking study of a Gradient-Optimized Fuzzy Inference System (GF) classifier against several state-of-the-art machine learning models, including Random Forest, XGBoost, Logistic Regression, Support Vector Machines, and Neural Networks. The evaluation was conducted across five datasets from the UCI Machine Learning Repository, each chosen for their diversity in input types, class distributions, and classification complexity. Unlike traditional Fuzzy Inference Systems that rely on derivative-free optimization methods, the GF leverages gradient descent to significantly improving training efficiency and predictive performance. Results demonstrate that the GF model achieved competitive, and in several cases superior, classification accuracy while maintaining high precision and exceptionally low training times. In particular, the GF exhibited strong consistency across folds and datasets, underscoring its robustness in handling noisy data and variable feature sets. These findings support the potential of gradient optimized fuzzy systems as interpretable, efficient, and adaptable alternatives to more complex deep learning models in supervised learning tasks.
zh

[AI-33] Blockchain Meets Adaptive Honeypots: A Trust-Aware Approach to Next-Gen IoT Security

【速读】：该论文旨在解决基于边缘计算的下一代无线网络（Next-Generation Wireless Networks, NGWN）与物联网（IoT）在提供增强带宽容量的同时面临的日益严峻的网络安全威胁问题。现有的入侵检测与防御方法因攻击者不断演化其攻击策略而效果有限。为应对这一挑战，论文提出了一种动态攻击检测与防御方案。方案的关键在于：首先通过基于区块链的认证机制结合Deoxys认证算法（DAA）验证物联网设备的合法性；其次引入双阶段入侵检测系统，第一阶段采用基于签名的改进随机森林（Improved Random Forest, IRF）算法进行检测，第二阶段利用基于特征的扩散卷积循环神经网络（Diffusion Convolution Recurrent Neural Network, DCRNN）实现异常检测；同时通过基于堆优化的可信服务迁移（Heap-Based Optimization, HBO）保障服务质量（Quality of Service, QoS）和维护服务水平协议（Service Level Agreement, SLA）。此外，方案还包括按需部署高交互蜜罐以欺骗攻击者并提取攻击模式，并使用双模格签名方案（Bimodal Lattice Signature Scheme, BLISS）安全存储攻击信息以增强基于签名的入侵检测系统（Intrusion Detection System, IDS）。通过在NS3仿真环境中评估，结果表明该框架在准确性、攻击检测率、误报率、精确度、召回率、ROC曲线、内存占用、CPU使用率及执行时间等多方面显著优于现有方法，从而有效提升了NGWN-IoT生态系统的安全性。

链接: https://arxiv.org/abs/2504.16226
作者: Yazan Otoum,Arghavan Asad,Amiya Nayak
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: This paper has been submitted to the IEEE Transactions on Network Science and Engineering (TNSE) for possible publication

点击查看摘要

Abstract:Edge computing-based Next-Generation Wireless Networks (NGWN)-IoT offer enhanced bandwidth capacity for large-scale service provisioning but remain vulnerable to evolving cyber threats. Existing intrusion detection and prevention methods provide limited security as adversaries continually adapt their attack strategies. We propose a dynamic attack detection and prevention approach to address this challenge. First, blockchain-based authentication uses the Deoxys Authentication Algorithm (DAA) to verify IoT device legitimacy before data transmission. Next, a bi-stage intrusion detection system is introduced: the first stage uses signature-based detection via an Improved Random Forest (IRF) algorithm. In contrast, the second stage applies feature-based anomaly detection using a Diffusion Convolution Recurrent Neural Network (DCRNN). To ensure Quality of Service (QoS) and maintain Service Level Agreements (SLA), trust-aware service migration is performed using Heap-Based Optimization (HBO). Additionally, on-demand virtual High-Interaction honeypots deceive attackers and extract attack patterns, which are securely stored using the Bimodal Lattice Signature Scheme (BLISS) to enhance signature-based Intrusion Detection Systems (IDS). The proposed framework is implemented in the NS3 simulation environment and evaluated against existing methods across multiple performance metrics, including accuracy, attack detection rate, false negative rate, precision, recall, ROC curve, memory usage, CPU usage, and execution time. Experimental results demonstrate that the framework significantly outperforms existing approaches, reinforcing the security of NGWN-enabled IoT ecosystems
zh

[AI-34] Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis

【速读】：该论文旨在解决深度学习（Deep Learning, DL）量化技术对GPU优化带来的新挑战，特别是针对混合数据类型矩阵乘法运算器的需求，现有高层次编译器缺乏表达能力以实现关键优化，而低层次编程模型则需要大量人工编程工作的问题。论文的关键解决方案是提出Hexcute，这是一种基于瓦片的编程语言，通过提供共享内存和寄存器抽象来实现这些算子的细粒度优化。此外，Hexcute利用任务映射调度GPU程序，并通过一种新型基于类型推理的算法自动化布局与任务映射合成，从而减少编程开销。

链接: https://arxiv.org/abs/2504.16214
作者: Xiao Zhang,Yaoyao Ding,Yang Hu,Gennady Pekhimenko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 17 pages, 24 figures

点击查看摘要

Abstract:Deep learning (DL) workloads mainly run on accelerators like GPUs. Recent DL quantization techniques demand a new matrix multiplication operator with mixed input data types, further complicating GPU optimization. Prior high-level compilers like Triton lack the expressiveness to implement key optimizations like fine-grained data pipelines and hardware-friendly memory layouts for these operators, while low-level programming models, such as Hidet, Graphene, and CUTLASS, require significant programming efforts. To balance expressiveness with engineering effort, we propose Hexcute, a tile-based programming language that exposes shared memory and register abstractions to enable fine-grained optimization for these operators. Additionally, Hexcute leverages task mapping to schedule the GPU program, and to reduce programming efforts, it automates layout and task mapping synthesis with a novel type-inference-based algorithm. Our evaluation shows that Hexcute generalizes to a wide range of DL operators, achieves 1.7-11.28 \times speedup over existing DL compilers for mixed-type operators, and brings up to 2.91 \times speedup in the end-to-end evaluation.
zh

[AI-35] nyML for Speech Recognition

【速读】：该论文旨在解决在高度资源受限的物联网（IoT）边缘设备上实现高效语音识别的问题。为应对这一挑战，论文的关键解决方案是训练并部署了一种量化的1D卷积神经网络（CNN）模型，并结合Edge Impulse提供的技术优化模型性能，最终在自建数据集上达到高达97%的准确率。此外，通过使用专为物联网和人工智能设计的Arduino Nano 33 BLE Sense微控制器板验证原型，确保模型适用于实际应用环境。相比现有研究仅关注有限关键词，该模型能够处理23个不同关键词，支持更复杂的语音命令。

链接: https://arxiv.org/abs/2504.16213
作者: Andrew Barovic,Armin Moin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We train and deploy a quantized 1D convolutional neural network model to conduct speech recognition on a highly resource-constrained IoT edge device. This can be useful in various Internet of Things (IoT) applications, such as smart homes and ambient assisted living for the elderly and people with disabilities, just to name a few examples. In this paper, we first create a new dataset with over one hour of audio data that enables our research and will be useful to future studies in this field. Second, we utilize the technologies provided by Edge Impulse to enhance our model’s performance and achieve a high Accuracy of up to 97% on our dataset. For the validation, we implement our prototype using the Arduino Nano 33 BLE Sense microcontroller board. This microcontroller board is specifically designed for IoT and AI applications, making it an ideal choice for our target use case scenarios. While most existing research focuses on a limited set of keywords, our model can process 23 different keywords, enabling complex commands.
zh

[AI-36] HTN Plan Repair Algorithms Compared: Strengths and Weaknesses of Different Methods ICAPS2025

【速读】：该论文旨在通过理论与实证比较三种近期提出的分层计划修复算法（SHOPFixer、IPyHOPPER 和 Rewrite）来解决计划修复问题。论文的关键在于揭示这三种算法分别对应于计划修复问题的三种不同定义，从而导致它们在搜索空间、可解决的修复问题类型以及能够执行的修复种类上的差异。理解这些区别对于在具体应用中选择合适的修复方法至关重要。基于理论分析，论文进一步通过一系列基准规划问题对这些算法进行了实证评估，重点关注诸如重规划、按时间回溯和计划树上的回跳等算法属性对运行时修复性能及问题覆盖范围的影响。

链接: https://arxiv.org/abs/2504.16209
作者: Paul Zaidins,Robert P. Goldman,Ugur Kuter,Dana Nau,Mark Roberts
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages; 19 figures; To appear in the Proceedings for ICAPS 2025, the 35th International Conference on Automated Planning and Schedulings

点击查看摘要

Abstract:This paper provides theoretical and empirical comparisons of three recent hierarchical plan repair algorithms: SHOPFixer, IPyHOPPER, and Rewrite. Our theoretical results show that the three algorithms correspond to three different definitions of the plan repair problem, leading to differences in the algorithms’ search spaces, the repair problems they can solve, and the kinds of repairs they can make. Understanding these distinctions is important when choosing a repair method for any given application. Building on the theoretical results, we evaluate the algorithms empirically in a series of benchmark planning problems. Our empirical results provide more detailed insight into the runtime repair performance of these systems and the coverage of the repair problems solved, based on algorithmic properties such as replanning, chronological backtracking, and backjumping over plan trees. Comments: 20 pages; 19 figures; To appear in the Proceedings for ICAPS 2025, the 35th International Conference on Automated Planning and Schedulings Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.8 Cite as: arXiv:2504.16209 [cs.AI] (or arXiv:2504.16209v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2504.16209 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-37] Quality of explanation of xAI from the prespective of Italian end-users: Italian version of System Causability Scale (SCS)

【速读】：该论文旨在验证意大利语版本的系统因果性量表（I-SCS）以衡量可解释人工智能（xAI）中提供的解释质量。为实现这一目标，研究采用了最初由主要开发者在2020年提供的英文版本，并通过前向-后向翻译方法确保准确性。最终，通过计算内容有效性指数/比值及进行认知访谈与代表性终端用户交互完成了九个步骤，去除了一个不符合内容有效性要求的问题（CVR低于0.49）。由此得到的意大利语版本包含9个问题，且被测试样本完全理解其含义与内容。研究的关键在于利用严谨的翻译和验证流程确保量表的文化适应性和有效性，从而为xAI系统的解释质量评估提供可靠工具。

链接: https://arxiv.org/abs/2504.16193
作者: Carmine Attanasio,Alireza Mortezapour
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: This work will be presented in Coperman 2025 Conference

点击查看摘要

Abstract:Background and aim: Considering the scope of the application of artificial intelligence beyond the field of computer science, one of the concerns of researchers is to provide quality explanations about the functioning of algorithms based on artificial intelligence and the data extracted from it. The purpose of the present study is to validate the Italian version of system causability scale (I-SCS) to measure the quality of explanations provided in a xAI. Method: For this purpose, the English version, initially provided in 2020 in coordination with the main developer, was utilized. The forward-backward translation method was applied to ensure accuracy. Finally, these nine steps were completed by calculating the content validity index/ratio and conducting cognitive interviews with representative end users. Results: The original version of the questionnaire consisted of 10 questions. However, based on the obtained indexes (CVR below 0.49), one question (Question 8) was entirely removed. After completing the aforementioned steps, the Italian version contained 9 questions. The representative sample of Italian end users fully comprehended the meaning and content of the questions in the Italian version. Conclusion: The Italian version obtained in this study can be used in future research studies as well as in the field by xAI developers. This tool can be used to measure the quality of explanations provided for an xAI system in Italian culture. Comments: This work will be presented in Coperman 2025 Conference Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.16193 [cs.HC] (or arXiv:2504.16193v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2504.16193 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alireza Mortezapour [view email] [v1] Tue, 22 Apr 2025 18:32:40 UTC (431 KB)
zh

[AI-38] FPGA-Based Neural Network Accelerators for Space Applications: A Survey

【速读】：该论文旨在解决空间任务对高性能 onboard 航天器计算系统日益增长的需求与当前技术能力之间的差距问题。论文关注通过利用现场可编程门阵列（Field-Programmable Gate Arrays, FPGAs）实现基于神经网络（Neural Networks, NNs）的加速器，以满足空间任务中自主操作、传感器数据分析及数据压缩等需求。解决方案的关键在于充分发挥 FPGAs 的灵活性、成本效益以及抗辐射潜力，同时结合神经网络在处理复杂任务中的强大能力，从而提升航天器上的计算性能，并为未来相关研究指明方向。

链接: https://arxiv.org/abs/2504.16173
作者: Pedro Antunes,Artur Podobas
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Space missions are becoming increasingly ambitious, necessitating high-performance onboard spacecraft computing systems. In response, field-programmable gate arrays (FPGAs) have garnered significant interest due to their flexibility, cost-effectiveness, and radiation tolerance potential. Concurrently, neural networks (NNs) are being recognized for their capability to execute space mission tasks such as autonomous operations, sensor data analysis, and data compression. This survey serves as a valuable resource for researchers aiming to implement FPGA-based NN accelerators in space applications. By analyzing existing literature, identifying trends and gaps, and proposing future research directions, this work highlights the potential of these accelerators to enhance onboard computing systems.
zh

[AI-39] Physics-Informed Inference Time Scaling via Simulation-Calibrated Scientific Machine Learning

【速读】：该论文旨在解决高维偏微分方程（PDEs）在科学计算中的挑战，这些挑战广泛存在于量子化学、经济学和金融等领域。现有科学机器学习（Scientific Machine Learning, SciML）方法虽能提供近似解，但通常存在偏差且忽视了重要的物理洞察。为解决这些问题，论文提出了一种名为Simulation-Calibrated Scientific Machine Learning (SCaSML) 的框架，其核心在于通过在推理阶段动态校正和去偏SciML预测来强化物理约束。SCaSML的关键创新包括利用推导出的新物理定律量化系统误差，并采用基于Feynman-Kac和Elworthy-Bismut-Li公式的Monte Carlo求解器动态修正预测结果。这一方法不仅提高了收敛速率，还通过理论与数值分析验证了其有效性，实验表明相比基础代理模型，SCaSML可将误差降低20%-50%，成为首个在推理阶段优化高维PDE近似解的算法。

链接: https://arxiv.org/abs/2504.16172
作者: Zexi Fan,Yan Sun,Shihao Yang,Yiping Lu
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:High-dimensional partial differential equations (PDEs) pose significant computational challenges across fields ranging from quantum chemistry to economics and finance. Although scientific machine learning (SciML) techniques offer approximate solutions, they often suffer from bias and neglect crucial physical insights. Inspired by inference-time scaling strategies in language models, we propose Simulation-Calibrated Scientific Machine Learning (SCaSML), a physics-informed framework that dynamically refines and debiases the SCiML predictions during inference by enforcing the physical laws. SCaSML leverages derived new physical laws that quantifies systematic errors and employs Monte Carlo solvers based on the Feynman-Kac and Elworthy-Bismut-Li formulas to dynamically correct the prediction. Both numerical and theoretical analysis confirms enhanced convergence rates via compute-optimal inference methods. Our numerical experiments demonstrate that SCaSML reduces errors by 20-50% compared to the base surrogate model, establishing it as the first algorithm to refine approximated solutions to high-dimensional PDE during inference. Code of SCaSML is available at this https URL.
zh

[AI-40] Leverag ing Social Media Analytics for Sustainability Trend Detection in Saudi Arabias Evolving Market

【速读】：该论文试图解决如何实时识别和监测沙特阿拉伯在可持续发展等多领域中的新兴趋势，以挖掘商业和投资机会。论文的关键解决方案在于采用基于人工智能（AI）驱动的方法论，通过处理数百万条社交媒体帖子、新闻和博客文章，提取区域内的可持续发展趋势。这种方法不仅为经济学家、企业和政府提供了针对特定行业及跨行业的可靠洞察，还帮助决策者及时掌握市场动态，同时展示了在全球其他地区应用该框架的潜力。最终，研究强调了利用AI方法使决策者能够可靠评估公众对倡议的认知与采纳情况，并跟踪趋势增长的能力。

链接: https://arxiv.org/abs/2504.16153
作者: Kanwal Aalijah
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9

点击查看摘要

Abstract:Saudi Arabias rapid economic growth and social evolution under Vision 2030 present a unique opportunity to track emerging trends in real time. Uncovering trends in real time can open up new avenues for business and investment opportunities. This paper explores how AI and social media analytics can uncover and monitor these trends across sectors like sustainability, construction, food beverages industry, tourism, technology, and entertainment. This paper focus on use of AI-driven methodology to identify sustainability trends across Saudi Arabia. We processed millions of social media posts, news, blogs in order to understand sustainability trends in the region. The paper presents an AI approach that can help economists, businesses, government to understand sustainability trends and make better decisions around them. This approach offers both sector-specific and cross-sector insights, giving decision-makers a reliable, up to date snapshot of Saudi Arabias market shifts. Beyond Saudi Arabia, this framework also shows potential for adapting to other regions. Overall, our findings highlight how by using AI-methodologies, give decision makers a reliable method to understand how initiatives are perceived and adopted by the public and understand growth of trends.
zh

[AI-41] owards responsible AI for education: Hybrid human-AI to confront the Elephant in the room

【速读】：该论文试图解决教育领域中人工智能（AI）方法和应用在公平性、透明性和有效性方面长期存在的九个关键挑战。这些问题包括对教育AI定义的模糊性、忽视学习过程中的动机和元认知、缺乏学科知识整合与利益相关者参与、非序列化模型在时间序列数据上的不当使用、不恰当的评估指标、不可靠的可解释AI方法、伦理指导的缺失、缺乏系统基准以及对个性化推荐的忽视等。论文的关键解决方案在于提出采用混合AI方法，特别是神经符号AI（neural-symbolic AI），以应对这些挑战，并为构建负责任且值得信赖的教育AI系统奠定基础。

链接: https://arxiv.org/abs/2504.16148
作者: Danial Hooshyar,Gustav Šír,Yeongwook Yang,Eve Kikas,Raija Hämäläinen,Tommi Kärkkäinen,Dragan Gašević,Roger Azevedo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite significant advancements in AI-driven educational systems and ongoing calls for responsible AI for education, several critical issues remain unresolved – acting as the elephant in the room within AI in education, learning analytics, educational data mining, learning sciences, and educational psychology communities. This critical analysis identifies and examines nine persistent challenges that continue to undermine the fairness, transparency, and effectiveness of current AI methods and applications in education. These include: (1) the lack of clarity around what AI for education truly means – often ignoring the distinct purposes, strengths, and limitations of different AI families – and the trend of equating it with domain-agnostic, company-driven large language models; (2) the widespread neglect of essential learning processes such as motivation, emotion, and (meta)cognition in AI-driven learner modelling and their contextual nature; (3) limited integration of domain knowledge and lack of stakeholder involvement in AI design and development; (4) continued use of non-sequential machine learning models on temporal educational data; (5) misuse of non-sequential metrics to evaluate sequential models; (6) use of unreliable explainable AI methods to provide explanations for black-box models; (7) ignoring ethical guidelines in addressing data inconsistencies during model training; (8) use of mainstream AI methods for pattern discovery and learning analytics without systematic benchmarking; and (9) overemphasis on global prescriptions while overlooking localised, student-specific recommendations. Supported by theoretical and empirical research, we demonstrate how hybrid AI methods – specifically neural-symbolic AI – can address the elephant in the room and serve as the foundation for responsible, trustworthy AI systems in education.
zh

[AI-42] Detecting Actionable Requests and Offers on Social Media During Crises Using LLM s

【速读】：该论文旨在解决自然灾害期间社交媒体上大量涌现的信息难以被高效组织和分类以支持人道主义响应的问题。论文的关键在于提出了一种细粒度的层次化 taxonomy（ taxonomy），用于系统性地将危机相关的信息按三个关键维度（物资、应急人员和行动）进行结构化组织，并结合 Query-Specific Few-shot Learning（QSF Learning）方法，通过从嵌入数据库中检索与类别相关的标注示例，提升大型语言模型（LLMs）在检测和分类帖子方面的性能。此外，论文还评估了信息的可操作性，以优先处理需要立即关注的帖子。实验结果表明，该方法优于基准提示策略，能够有效识别和优先处理具有实际操作价值的请求和援助信息。

链接: https://arxiv.org/abs/2504.16144
作者: Ahmed El Fekih Zguir,Ferda Ofli,Muhammad Imran
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural disasters often result in a surge of social media activity, including requests for assistance, offers of help, sentiments, and general updates. To enable humanitarian organizations to respond more efficiently, we propose a fine-grained hierarchical taxonomy to systematically organize crisis-related information about requests and offers into three critical dimensions: supplies, emergency personnel, and actions. Leveraging the capabilities of Large Language Models (LLMs), we introduce Query-Specific Few-shot Learning (QSF Learning) that retrieves class-specific labeled examples from an embedding database to enhance the model’s performance in detecting and classifying posts. Beyond classification, we assess the actionability of messages to prioritize posts requiring immediate attention. Extensive experiments demonstrate that our approach outperforms baseline prompting strategies, effectively identifying and prioritizing actionable requests and offers.
zh

[AI-43] SparseJEPA: Sparse Representation Learning of Joint Embedding Predictive Architectures

【速读】：该论文旨在解决基于联合嵌入预测架构（Joint Embedding Predictive Architectures, JEPA）在学习通用表示时存在的可解释性不足及因密集嵌入表示导致的效率低下问题。为应对这些挑战，论文提出了一种名为SparseJEPA的扩展方法，其关键在于将稀疏表征学习整合到JEPA框架中。通过引入一种惩罚机制，SparseJEPA鼓励潜在空间变量在具有强语义关系的数据特征间共享，同时保持预测性能。此外，理论证明表明，分组机制通过减少潜在变量间的多信息量（Multiinformation），从而提升表示质量。这种改进不仅优化了潜在空间，还促进了更易于理解和有意义的表示学习。

链接: https://arxiv.org/abs/2504.16140
作者: Max Hartman,Lav Varshney
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Joint Embedding Predictive Architectures (JEPA) have emerged as a powerful framework for learning general-purpose representations. However, these models often lack interpretability and suffer from inefficiencies due to dense embedding representations. We propose SparseJEPA, an extension that integrates sparse representation learning into the JEPA framework to enhance the quality of learned representations. SparseJEPA employs a penalty method that encourages latent space variables to be shared among data features with strong semantic relationships, while maintaining predictive performance. We demonstrate the effectiveness of SparseJEPA by training on the CIFAR-100 dataset and pre-training a lightweight Vision Transformer. The improved embeddings are utilized in linear-probe transfer learning for both image classification and low-level tasks, showcasing the architecture’s versatility across different transfer tasks. Furthermore, we provide a theoretical proof that demonstrates that the grouping mechanism enhances representation quality. This was done by displaying that grouping reduces Multiinformation among latent-variables, including proofing the Data Processing Inequality for Multiinformation. Our results indicate that incorporating sparsity not only refines the latent space but also facilitates the learning of more meaningful and interpretable representations. In further work, hope to further extend this method by finding new ways to leverage the grouping mechanism through object-centric representation learning.
zh

[AI-44] Enhancing Trust Through Standards: A Comparative Risk-Impact Framework for Aligning ISO AI Standards with Global Ethical and Regulatory Contexts

【速读】：该论文试图解决人工智能（Artificial Intelligence, AI）在全球不同监管环境中伦理风险治理的不一致性问题，特别是国际标准化组织（ISO）AI标准在应对这些风险时的有效性局限。论文的关键解决方案在于提出了一种新颖的比较风险-影响评估框架（Comparative Risk-Impact Assessment Framework），用于评估ISO标准在欧盟AI法案和其他十个地区（如英国、加拿大、中国等）的伦理风险缓解效果，并揭示其不足之处。通过案例研究发现，ISO标准在自愿性质下存在执行力度不足（如美国科罗拉多州的情况）以及忽视区域特定风险（如中国的隐私风险）的问题。为此，论文建议引入强制性风险审计、区域特定附录及隐私聚焦模块以增强ISO标准的适应性和全球适用性。这一方法不仅整合了全球趋势，还提供了一个可复制的工具，以推动标准化与伦理需求的协同，促进全球范围内的AI互操作性和信任度。

链接: https://arxiv.org/abs/2504.16139
作者: Sridharan Sankaran
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As artificial intelligence (AI) reshapes industries and societies, ensuring its trustworthiness-through mitigating ethical risks like bias, opacity, and accountability deficits-remains a global challenge. International Organization for Standardization (ISO) AI standards, such as ISO/IEC 24027 and 24368, aim to foster responsible development by embedding fairness, transparency, and risk management into AI systems. However, their effectiveness varies across diverse regulatory landscapes, from the EU’s risk-based AI Act to China’s stability-focused measures and the U.S.'s fragmented state-led initiatives. This paper introduces a novel Comparative Risk-Impact Assessment Framework to evaluate how well ISO standards address ethical risks within these contexts, proposing enhancements to strengthen their global applicability. By mapping ISO standards to the EU AI Act and surveying regulatory frameworks in ten regions-including the UK, Canada, India, Japan, Singapore, South Korea, and Brazil-we establish a baseline for ethical alignment. The framework, applied to case studies in the EU, US-Colorado, and China, reveals gaps: voluntary ISO standards falter in enforcement (e.g., Colorado) and undervalue region-specific risks like privacy (China). We recommend mandatory risk audits, region-specific annexes, and a privacy-focused module to enhance ISO’s adaptability. This approach not only synthesizes global trends but also offers a replicable tool for aligning standardization with ethical imperatives, fostering interoperability and trust in AI worldwide. Policymakers and standards bodies can leverage these insights to evolve AI governance, ensuring it meets diverse societal needs as the technology advances.
zh

[AI-45] rends in Frontier AI Model Count: A Forecast to 2028

【速读】：该论文试图解决的问题是如何评估基于训练计算量（FLOP）阈值在未来几年内会捕获多少个AI模型。论文的关键在于通过预测不同时间点超过特定FLOP阈值的Foundation模型数量，揭示这些绝对阈值随时间捕获模型的数量将呈现超线性增长的趋势，同时对比基于最大训练运行规模定义的相对阈值所呈现出的更稳定的模型捕获趋势。这为政策制定者提供了关于如何设定合理且动态的AI模型监管阈值的重要参考。

链接: https://arxiv.org/abs/2504.16138
作者: Iyngkarran Kumar,Sam Manning
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Governments are starting to impose requirements on AI models based on how much compute was used to train them. For example, the EU AI Act imposes requirements on providers of general-purpose AI with systemic risk, which includes systems trained using greater than 10^25 floating point operations (FLOP). In the United States’ AI Diffusion Framework, a training compute threshold of 10^26 FLOP is used to identify “controlled models” which face a number of requirements. We explore how many models such training compute thresholds will capture over time. We estimate that by the end of 2028, there will be between 103-306 foundation models exceeding the 10^25 FLOP threshold put forward in the EU AI Act (90% CI), and 45-148 models exceeding the 10^26 FLOP threshold that defines controlled models in the AI Diffusion Framework (90% CI). We also find that the number of models exceeding these absolute compute thresholds each year will increase superlinearly – that is, each successive year will see more new models captured within the threshold than the year before. Thresholds that are defined with respect to the largest training run to date (for example, such that all models within one order of magnitude of the largest training run to date are captured by the threshold) see a more stable trend, with a median forecast of 14-16 models being captured by this definition annually from 2025-2028.
zh

[AI-46] A Conceptual Framework for AI-based Decision Systems in Critical Infrastructures

【速读】：该论文致力于解决人类与人工智能在安全关键系统中的交互所面临的独特挑战，现有框架对此仅部分应对。这些问题源于透明性、信任、可解释性需求与鲁棒且安全决策必要性的复杂相互作用。论文的关键解决方案在于提出一个全面的概念框架，该框架通过采用跨学科方法整合数学、决策理论、计算机科学、哲学、心理学及认知工程等传统上截然不同的领域，并结合能源、移动性和航空等特定工程领域的专业知识。此外，通过将其应用于现有框架实例化，展示了该框架的灵活性。这一综合框架旨在填补设计、部署和维护安全有效系统的空白。

链接: https://arxiv.org/abs/2504.16133
作者: Milad Leyli-abadi,Ricardo J. Bessa,Jan Viebahn,Daniel Boos,Clark Borst,Alberto Castagna,Ricardo Chavarriaga,Mohamed Hassouna,Bruno Lemetayer,Giulia Leto,Antoine Marot,Maroua Meddeb,Manuel Meyer,Viola Schiaffonati,Manuel Schneider,Toni Waefler
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The interaction between humans and AI in safety-critical systems presents a unique set of challenges that remain partially addressed by existing frameworks. These challenges stem from the complex interplay of requirements for transparency, trust, and explainability, coupled with the necessity for robust and safe decision-making. A framework that holistically integrates human and AI capabilities while addressing these concerns is notably required, bridging the critical gaps in designing, deploying, and maintaining safe and effective systems. This paper proposes a holistic conceptual framework for critical infrastructures by adopting an interdisciplinary approach. It integrates traditionally distinct fields such as mathematics, decision theory, computer science, philosophy, psychology, and cognitive engineering and draws on specialized engineering domains, particularly energy, mobility, and aeronautics. The flexibility in its adoption is also demonstrated through its instantiation on an already existing framework.
zh

[AI-47] Efficacy of a Computer Tutor that Models Expert Human Tutors

【速读】：该论文试图解决的问题是：明确专家知识在辅导有效性中的作用，并探讨智能辅导系统（ITS）与人类导师相比的效果差异。
解决方案的关键在于设计了一个为期9周的学习效能研究，通过比较三种条件（基于专家的人类导师的ITS、仅具备领域专业知识但非专业导师的人类 tutor 和无导师的对照组），评估其对学生即时和延迟学习测试成绩的影响。研究使用逻辑混合效应模型分析数据，揭示了ITS和人类导师在即时及延迟测试中的显著积极效果，从而为理解专家知识在辅导中的角色提供了实证依据。

链接: https://arxiv.org/abs/2504.16132
作者: Andrew M. Olney,Sidney K. D’Mello,Natalie Person,Whitney Cade,Patrick Hays,Claire W. Dempsey,Blair Lehman,Betsy Williams,Art Graesser
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Shortened version of this paper has been accepted to AIED 2025

点击查看摘要

Abstract:Tutoring is highly effective for promoting learning. However, the contribution of expertise to tutoring effectiveness is unclear and continues to be debated. We conducted a 9-week learning efficacy study of an intelligent tutoring system (ITS) for biology modeled on expert human tutors with two control conditions: human tutors who were experts in the domain but not in tutoring and a no-tutoring condition. All conditions were supplemental to classroom instruction, and students took learning tests immediately before and after tutoring sessions as well as delayed tests 1-2 weeks later. Analysis using logistic mixed-effects modeling indicates significant positive effects on the immediate post-test for the ITS (d =.71) and human tutors (d =.66) which are in the 99th percentile of meta-analytic effects, as well as significant positive effects on the delayed post-test for the ITS (d =.36) and human tutors (d =.39). We discuss implications for the role of expertise in tutoring and the design of future studies.
zh

[AI-48] MARFT: Multi-Agent Reinforcement Fine-Tuning

【速读】：本文旨在解决基于大型语言模型（LLM）的多智能体系统（LaMAS）在强化学习（RL）微调方面研究不足的问题，并探讨将多智能体强化学习（MARL）方法直接应用于LaMAS时面临的挑战。论文的关键在于提出了一种名为多智能体强化微调（MARFT）的新范式，通过引入针对LaMAS定制的通用算法框架，阐明其概念基础、关键差异及实践实施策略。核心解决方案是构建一个稳健且可扩展的MARFT框架，并提供完整的开源实现以促进采用和进一步研究，从而推动在复杂任务中的鲁棒性和适应性提升。

链接: https://arxiv.org/abs/2504.16129
作者: Junwei Liao,Muning Wen,Jun Wang,Weinan Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 36 pages

点击查看摘要

Abstract:LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks requiring multifaceted reasoning and collaboration, from generating high-quality presentation slides to conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methodologies to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a universal algorithmic framework tailored for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We begin by reviewing the evolution from RL to Reinforcement Fine-Tuning, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a novel, LaMAS-oriented formulation of RFT. Central to this work is the presentation of a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work aims to serve as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: this https URL.
zh

[AI-49] SOTOPIA-S4: a user-friendly system for flexible customizable and large-scale social simulation

【速读】：该论文旨在解决通过大型语言模型（Large Language Model, LLM）代理进行社会模拟时的技术障碍，同时为研究者提供一种高效、灵活且可扩展的系统，用于生成基于LLM的多轮多边交互，并支持自定义评估指标以测试假设。论文的关键创新在于提出SOTOPIA-S4，这是一个包含模拟引擎、具备灵活RESTful API的API服务器以及直观Web界面的pip包。该系统不仅降低了技术门槛，使非专业人士也能设计和分析模拟，还通过两个具体案例展示了其在双边招聘谈判和多方规划场景中的实用性。

链接: https://arxiv.org/abs/2504.16122
作者: Xuhui Zhou,Zhe Su,Sophie Feng,Jiaxu Zhou,Jen-tse Huang,Hsien-Te Kao,Spencer Lynch,Svitlana Volkova,Tongshuang Sherry Wu,Anita Woolley,Hao Zhu,Maarten Sap
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: The first author and the second author contributed equally

点击查看摘要

Abstract:Social simulation through large language model (LLM) agents is a promising approach to explore and validate hypotheses related to social science questions and LLM agents behavior. We present SOTOPIA-S4, a fast, flexible, and scalable social simulation system that addresses the technical barriers of current frameworks while enabling practitioners to generate multi-turn and multi-party LLM-based interactions with customizable evaluation metrics for hypothesis testing. SOTOPIA-S4 comes as a pip package that contains a simulation engine, an API server with flexible RESTful APIs for simulation management, and a web interface that enables both technical and non-technical users to design, run, and analyze simulations without programming. We demonstrate the usefulness of SOTOPIA-S4 with two use cases involving dyadic hiring negotiation and multi-party planning scenarios.
zh

[AI-50] A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

【速读】：该论文试图解决大型语言模型（LLM）在实际应用中可能产生的潜在偏见和有害内容的问题。为应对这些担忧，论文提出了一种实用的解决方案，以确保LLM的安全和道德使用。解决方案的关键在于引入一种名为BART校正模型（BART-Corrective Model）的后生成修正机制，该机制通过调整生成的内容来确保其安全性和合规性。与仅依赖模型微调或提示工程的方法不同，此方法提供了一种以数据为中心的强大替代方案，用于减轻有害内容的影响。实验结果表明，在多个有毒性数据集上的集成后，该方法显著降低了平均毒性（mean toxicity）和越狱分数（jail-breaking scores）。

链接: https://arxiv.org/abs/2504.16120
作者: Chaima Njeh,Haïfa Nakouri,Fehmi Jaafar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper is under revision in the International Journal of Information Security

点击查看摘要

Abstract:Large Language Models (LLM) have made remarkable progress, but concerns about potential biases and harmful content persist. To address these apprehensions, we introduce a practical solution for ensuring LLM’s safe and ethical use. Our novel approach focuses on a post-generation correction mechanism, the BART-Corrective Model, which adjusts generated content to ensure safety and security. Unlike relying solely on model fine-tuning or prompt engineering, our method provides a robust data-centric alternative for mitigating harmful content. We demonstrate the effectiveness of our approach through experiments on multiple toxic datasets, which show a significant reduction in mean toxicity and jail-breaking scores after integration. Specifically, our results show a reduction of 15% and 21% in mean toxicity and jail-breaking scores with GPT-4, a substantial reduction of 28% and 5% with PaLM2, a reduction of approximately 26% and 23% with Mistral-7B, and a reduction of 11.1% and 19% with Gemma-2b-it. These results demonstrate the potential of our approach to improve the safety and security of LLM, making them more suitable for real-world applications.
zh

[AI-51] owards Explainable and Lightweight AI for Real-Time Cyber Threat Hunting in Edge Networks

【速读】：本文旨在解决边缘网络中实时网络安全威胁检测面临的两大挑战：1) 黑盒 AI 模型缺乏可解释性，限制了安全分析师对其预测背后逻辑的理解；2) 传统深度学习技术因高计算成本而难以在资源受限的边缘设备上部署。为应对这些问题，论文提出了一种名为“可解释且轻量级 AI（Explainable and Lightweight AI, ELAI）”的框架。其关键是将可解释机器学习算法与优化后的轻量级深度学习技术相结合，在保证检测透明度的同时实现高效的计算性能。具体而言，ELAI 利用决策树、基于注意力机制的深度学习以及联邦学习来提升检测精度，并保持良好的可解释性。通过在标准网络安全数据集上的实验验证，该框架实现了高检出率、低误报率，并大幅降低了计算开销，相较于传统深度学习方法具有显著优势。

链接: https://arxiv.org/abs/2504.16118
作者: Milad Rahmati
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As cyber threats continue to evolve, securing edge networks has become increasingly challenging due to their distributed nature and resource limitations. Many AI-driven threat detection systems rely on complex deep learning models, which, despite their high accuracy, suffer from two major drawbacks: lack of interpretability and high computational cost. Black-box AI models make it difficult for security analysts to understand the reasoning behind their predictions, limiting their practical deployment. Moreover, conventional deep learning techniques demand significant computational resources, rendering them unsuitable for edge devices with limited processing power. To address these issues, this study introduces an Explainable and Lightweight AI (ELAI) framework designed for real-time cyber threat detection in edge networks. Our approach integrates interpretable machine learning algorithms with optimized lightweight deep learning techniques, ensuring both transparency and computational efficiency. The proposed system leverages decision trees, attention-based deep learning, and federated learning to enhance detection accuracy while maintaining explainability. We evaluate ELAI using benchmark cybersecurity datasets, such as CICIDS and UNSW-NB15, assessing its performance across diverse cyberattack scenarios. Experimental results demonstrate that the proposed framework achieves high detection rates with minimal false positives, all while significantly reducing computational demands compared to traditional deep learning methods. The key contributions of this work include: (1) a novel interpretable AI-based cybersecurity model tailored for edge computing environments, (2) an optimized lightweight deep learning approach for real-time cyber threat detection, and (3) a comprehensive analysis of explainability techniques in AI-driven cybersecurity applications.
zh

[AI-52] DMind Benchmark: The First Comprehensive Benchmark for LLM Evaluation in the Web3 Domain

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在Web3等专业化且快速发展的领域中应用效果有限的问题。为了解决这一问题，论文提出了一种名为DMind Benchmark的新框架，其关键是通过涵盖区块链基础、基础设施、智能合约分析、去中心化金融（DeFi）、去中心化自治组织（DAOs）、非同质化代币（NFTs）、代币经济学、模因概念以及安全漏洞等九个关键类别的系统性测试，不仅包括传统的多项选择题，还引入了领域特定的主观任务（如智能合约代码审计与修复、链上数据的数值推理及填空评估），从而捕捉现实世界的复杂性并全面评估模型的适应能力。通过该基准测试，论文揭示了现有LLMs在Web3特定推理和应用中的性能差距，并公开发布了基准数据集、评估流程和标注结果，以促进该领域的进一步研究和更强大的Web3专用LLMs的发展。

链接: https://arxiv.org/abs/2504.16116
作者: Miracle Master,Rainy Sun,Anya Reese,Joey Ouyang,Alex Chen,Winter Dong,Frank Li,James Yi,Garry Zhao,Tony Ling,Hobert Wong,Lowes Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have led to significant progress on a wide range of natural language processing tasks. However, their effectiveness in specialized and rapidly evolving domains such as Web3 remains underexplored. In this paper, we introduce DMind Benchmark, a novel framework that systematically tests LLMs across nine key categories encompassing blockchain fundamentals, infrastructure, smart contract analysis, decentralized finance (DeFi), decentralized autonomous organizations (DAOs), non-fungible tokens (NFTs), token economics, meme concepts, and security vulnerabilities. DMind Benchmark goes beyond conventional multiple-choice questions by incorporating domain-specific subjective tasks (e.g., smart contract code auditing and repair, numeric reasoning on on-chain data, and fill-in assessments), thereby capturing real-world complexities and stress-testing model adaptability. We evaluate fifteen popular LLMs (from ChatGPT, DeepSeek, Claude, and Gemini series) on DMind Benchmark, uncovering performance gaps in Web3-specific reasoning and application, particularly in emerging areas like token economics and meme concepts. Even the strongest models face significant challenges in identifying subtle security vulnerabilities and analyzing complex DeFi mechanisms. To foster progress in this area, we publicly release our benchmark dataset, evaluation pipeline, and annotated results at this http URL, offering a valuable resource for advancing specialized domain adaptation and the development of more robust Web3-enabled LLMs. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.16116 [cs.CR] (or arXiv:2504.16116v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2504.16116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-53] A Framework for Objective-Driven Dynamical Stochastic Fields

【速读】：该论文旨在解决复杂动态随机系统中，尤其是那些表现出目标导向行为的系统（即“智能场”），缺乏形式化理论描述及其在实际应用中有效转化的挑战。论文的关键在于提出三个基本原理——完备构型、局部性以及目的性，以构建理解智能场的理论框架，并从人工智能应用的角度探索设计此类场的方法。通过这些方法，论文期望为未来在理论上深入理解和在实践中开发这种以目标为导向的动态随机系统的潜力奠定基础。

链接: https://arxiv.org/abs/2504.16115
作者: Yibo Jacky Zhang,Sanmi Koyejo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO)
备注:

点击查看摘要

Abstract:Fields offer a versatile approach for describing complex systems composed of interacting and dynamic components. In particular, some of these dynamical and stochastic systems may exhibit goal-directed behaviors aimed at achieving specific objectives, which we refer to as \textitintelligent fields . However, due to their inherent complexity, it remains challenging to develop a formal theoretical description of such systems and to effectively translate these descriptions into practical applications. In this paper, we propose three fundamental principles – complete configuration, locality, and purposefulness – to establish a theoretical framework for understanding intelligent fields. Moreover, we explore methodologies for designing such fields from the perspective of artificial intelligence applications. This initial investigation aims to lay the groundwork for future theoretical developments and practical advances in understanding and harnessing the potential of such objective-driven dynamical stochastic fields.
zh

[AI-54] AI-Based Vulnerability Analysis of NFT Smart Contracts

【速读】：该论文旨在解决智能合约代码中常见缺陷的检测与分类问题。研究通过收集并分类大量智能合约代码，识别出包括“Risky Mutably Proxy”、“ERC-721 Recentrancy”、“Unlimited Mining”、“Missing Requirements”以及“Public Burns”在内的多种常见缺陷。解决方案的关键在于构建了一个基于决策树（Decision Tree）的模型，并进一步引入随机森林（Random Forest）模型进行优化。具体而言，首先通过特征提取、数据划分及算法选择，利用CART分类树构建初始决策树模型；随后，在决策树的基础上扩展随机森林模型，通过样本抽象、特征选择、参数调整与模型优化完成森林模型的构建。最终通过对不同模型的对比分析，得出了一般性结论。

链接: https://arxiv.org/abs/2504.16113
作者: Xin Wang,Xiaoqi Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the research experiment of this article, our research work is divided into several stages. Firstly, we collected a large number of smart contract codes and classified them, identifying several common defects, including Risky Mutably Porxy, ERC-721 Recentrancy, Unlimited Mining, Missing Requirements, and Public Burns. Secondly, we used Python to process the smart contracts. On the one hand, we modified the file names, and on the other hand, we batched the process of the content for analysis and application. Next, we built a model of the decision tree. Firstly, we carried out the feature extraction. We selected the algorithm and divided the data. After comparing and processing, we chose the CART classification tree to process. By gene coefficient, we analyzed and sorted the data, and got the initial model of the decision tree. Then, we introduced the random forest model on the basis of the decision tree. From abstracting the same amount of samples to selecting features this http URL adjusting and optimizing parameters to completing the construction of the forest model. Finally, we compared and analyzed the decision tree, random forest, and self-built model in the paper and drew general conclusions.
zh

[AI-55] Security-First AI: Foundations for Robust and Trustworthy Systems

【速读】：该论文试图解决人工智能（AI）领域中安全（security）作为基础保障未被充分重视的问题，强调AI安全（AI security，即保护数据、模型和流程免受对抗性操纵）是实现安全（safety）、透明性（transparency）、问责制（accountability）、一致性（alignment）和责任（responsibility）等目标的根本前提。论文的关键解决方案在于提出以安全为导向（security-first）的方法论，并通过构建层次化的挑战视图区分安全与安全之外的其他问题，同时倡导以度量驱动（metric-driven）的方式建立稳健的AI安全体系，从而确保AI系统的可信性和韧性。

链接: https://arxiv.org/abs/2504.16110
作者: Krti Tallam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The conversation around artificial intelligence (AI) often focuses on safety, transparency, accountability, alignment, and responsibility. However, AI security (i.e., the safeguarding of data, models, and pipelines from adversarial manipulation) underpins all of these efforts. This manuscript posits that AI security must be prioritized as a foundational layer. We present a hierarchical view of AI challenges, distinguishing security from safety, and argue for a security-first approach to enable trustworthy and resilient AI systems. We discuss core threat models, key attack vectors, and emerging defense mechanisms, concluding that a metric-driven approach to AI security is essential for robust AI safety, transparency, and accountability.
zh

[AI-56] Radiometer Calibration using Machine Learning

【速读】：该论文旨在解决射电天文中辐射计（radiometer）因天线与接收机之间阻抗失配导致的信号反射和畸变问题，以及传统校准方法（如Dicke切换法）在高精度探测宇宙学中重要的天空平均21厘米氢原子谱线信号（z > 10）时面临的局限性。论文的关键解决方案是引入一种基于机器学习（Machine Learning, ML）的新型校准框架，利用神经网络对已知信号源进行训练，以建模和校准复杂系统中难以通过传统分析方法解决的问题，从而实现满足辐射计量实验所需的高精度校准能力。

链接: https://arxiv.org/abs/2504.16791
作者: S. A. K. Leeney,H. T. J. Bevins,E. de Lera Acedo,W. J. Handley,C. Kirkham,R. S. Patel,J. Zhu,D. Molnar,J. Cumner,D. Anstey,K. Artuc,G. Bernardi,M. Bucher,S. Carey,J. Cavillot,R. Chiello,W. Croukamp,D. I. L. de Villiers,J. A. Ely,A. Fialkov,T. Gessey-Jones,G. Kulkarni,A. Magro,P. D. Meerburg,S. Mittal,J. H. N. Pattison,S. Pegwal,C. M. Pieterse,J. R. Pritchard,E. Puchwein,N. Razavi-Ghods,I. L. V. Roque,A. Saxena,K. H. Scheutwinkel,P. Scott,E. Shen,P. H. Sims,M. Spinelli
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI)
备注: Under peer review for publication in Nature Scientific Reports as part of the Radio Astronomy collection

点击查看摘要

Abstract:Radiometers are crucial instruments in radio astronomy, forming the primary component of nearly all radio telescopes. They measure the intensity of electromagnetic radiation, converting this radiation into electrical signals. A radiometer’s primary components are an antenna and a Low Noise Amplifier (LNA), which is the core of the ``receiver’’ chain. Instrumental effects introduced by the receiver are typically corrected or removed during calibration. However, impedance mismatches between the antenna and receiver can introduce unwanted signal reflections and distortions. Traditional calibration methods, such as Dicke switching, alternate the receiver input between the antenna and a well-characterised reference source to mitigate errors by comparison. Recent advances in Machine Learning (ML) offer promising alternatives. Neural networks, which are trained using known signal sources, provide a powerful means to model and calibrate complex systems where traditional analytical approaches struggle. These methods are especially relevant for detecting the faint sky-averaged 21-cm signal from atomic hydrogen at high redshifts. This is one of the main challenges in observational Cosmology today. Here, for the first time, we introduce and test a machine learning-based calibration framework capable of achieving the precision required for radiometric experiments aiming to detect the 21-cm line.
zh

[AI-57] he Dance of Atoms-De Novo Protein Design with Diffusion Model

【速读】：该论文试图解决蛋白质从头设计（de novo protein design）领域中传统方法效率低、成功率有限以及实验成本高的问题。论文的关键在于利用扩散模型（diffusion models）这一生成式 AI 技术，在蛋白质骨架（backbone）和序列（sequence）生成方面取得了突破性进展。与依赖片段和生物信息学的传统方法相比，扩散模型显著提高了设计成功率，并降低了实验成本。其中，RFDiffusion 模型在 25 个蛋白质设计任务中的表现尤为突出，其成功率达到甚至超越了现有传统方法和其它基于 AI 的方法，如 RFjoint 和幻觉生成（hallucination）。因此，该研究的核心解决方案在于引入并优化扩散模型，以实现更高效、更精准的蛋白质设计。

链接: https://arxiv.org/abs/2504.16479
作者: Yujie Qin,Ming He,Changyong Yu,Ming Ni,Xian Liu,Xiaochen Bo
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The de novo design of proteins refers to creating proteins with specific structures and functions that do not naturally exist. In recent years, the accumulation of high-quality protein structure and sequence data and technological advancements have paved the way for the successful application of generative artificial intelligence (AI) models in protein design. These models have surpassed traditional approaches that rely on fragments and bioinformatics. They have significantly enhanced the success rate of de novo protein design, and reduced experimental costs, leading to breakthroughs in the field. Among various generative AI models, diffusion models have yielded the most promising results in protein design. In the past two to three years, more than ten protein design models based on diffusion models have emerged. Among them, the representative model, RFDiffusion, has demonstrated success rates in 25 protein design tasks that far exceed those of traditional methods, and other AI-based approaches like RFjoint and hallucination. This review will systematically examine the application of diffusion models in generating protein backbones and sequences. We will explore the strengths and limitations of different models, summarize successful cases of protein design using diffusion models, and discuss future development directions.
zh

[AI-58] PINN-MEP: Continuous Neural Representations for Minimum-Energy Path Discovery in Molecular Systems

【速读】：该论文旨在解决物理系统中构象转变路径采样的难题，传统方法如分子动力学（Molecular Dynamics, MD）或马尔可夫链蒙特卡洛（Markov Chain Monte Carlo, MCMC）在高维分子系统及从稳定态之间跨越高能量障碍的转变采样中效率低下。这类罕见事件虽然在模拟时间尺度上不常见，但对生物过程却至关重要，例如离子通道蛋白从关闭到开放状态的转变。然而，这些真实系统的转变可能需要毫秒至数秒，而传统方法可能需要数月甚至数年的连续模拟才能观察一次。为应对这一挑战，论文提出了一种将转变路径生成重新表述为连续优化问题的方法，通过受弦方法启发的物理信息神经网络（Physics-Informed Neural Networks, PINNs）来寻找最小能量路径（Minimum-Energy Path, MEP）。关键在于利用隐式神经函数表示路径，并结合可微分分子动力学力场的自动微分技术，从而高效地发现符合物理规律的转变路径，而无需昂贵的传统路径采样。实验验证了该方法在包含超过8,300个原子的牛胰蛋白酶抑制剂（BPTI）系统中的有效性。

链接: https://arxiv.org/abs/2504.16381
作者: Magnus Petersen,Roberto Covino
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Characterizing conformational transitions in physical systems remains a fundamental challenge in the computational sciences. Traditional sampling methods like molecular dynamics (MD) or MCMC often struggle with the high-dimensional nature of molecular systems and the high energy barriers of transitions between stable states. While these transitions are rare events in simulation timescales, they often represent the most biologically significant processes - for example, the conformational change of an ion channel protein from its closed to open state, which controls cellular ion flow and is crucial for neural signaling. Such transitions in real systems may take milliseconds to seconds but could require months or years of continuous simulation to observe even once. We present a method that reformulates transition path generation as a continuous optimization problem solved through physics-informed neural networks (PINNs) inspired by string methods for minimum-energy path (MEP) generation. By representing transition paths as implicit neural functions and leveraging automatic differentiation with differentiable molecular dynamics force fields, our method enables the efficient discovery of physically realistic transition pathways without requiring expensive path sampling. We demonstrate our method’s effectiveness on two proteins, including an explicitly hydrated bovine pancreatic trypsin inhibitor (BPTI) system with over 8,300 atoms.
zh

[AI-59] QAOA-GPT : Efficient Generation of Adaptive and Regular Quantum Approximate Optimization Algorithm Circuits

【速读】：本文旨在解决利用经典计算机求解某些计算困难的优化问题效率低下这一挑战，通过引入量子算法的新途径以实现潜在加速。论文的关键解决方案是提出QAOA-GPT框架，这是一种基于生成式预训练Transformer（GPT）的生成模型，能够直接合成求解二次无约束二进制优化问题的量子电路，并在MaxCut问题上进行验证。其核心创新点在于采用自适应量子近似优化算法（adaptive QAOA）生成多样化且高质量的合成数据集，从而构建和优化特定问题的电路。实验结果表明，QAOA-GPT不仅能够为未见过的训练问题实例生成高质量量子电路，还能有效参数化QAOA。这种方法显著降低了经典QAOA及其自适应变体中电路生成和参数优化的计算开销，展示了生成式AI在可扩展生成紧凑型量子电路方面的潜力。

链接: https://arxiv.org/abs/2504.16350
作者: Ilya Tyagin,Marwa H. Farag,Kyle Sherbert,Karunya Shirali,Yuri Alexeev,Ilya Safro
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantum computing has the potential to improve our ability to solve certain optimization problems that are computationally difficult for classical computers, by offering new algorithmic approaches that may provide speedups under specific conditions. In this work, we introduce QAOA-GPT, a generative framework that leverages Generative Pretrained Transformers (GPT) to directly synthesize quantum circuits for solving quadratic unconstrained binary optimization problems, and demonstrate it on the MaxCut problem on graphs. To diversify the training circuits and ensure their quality, we have generated a synthetic dataset using the adaptive QAOA approach, a method that incrementally builds and optimizes problem-specific circuits. The experiments conducted on a curated set of graph instances demonstrate that QAOA-GPT, generates high quality quantum circuits for new problem instances unseen in the training as well as successfully parametrizes QAOA. Our results show that using QAOA-GPT to generate quantum circuits will significantly decrease both the computational overhead of classical QAOA and adaptive approaches that often use gradient evaluation to generate the circuit and the classical optimization of the circuit parameters. Our work shows that generative AI could be a promising avenue to generate compact quantum circuits in a scalable way.
zh

[AI-60] Heterogeneous networks in drug-target interaction prediction

【速读】：该论文旨在解决药物发现过程中耗时长、成本高的问题，特别是通过计算方法预测药物-靶点相互作用 (DTI, Drug-Target Interaction)，以缩小湿实验室实验的搜索空间。论文的关键解决方案在于综述基于图机器学习 (Graph Machine Learning) 的方法在 DTI 预测中的应用，这些方法因其在该领域的卓越表现而受到关注。论文详细介绍了相关框架、主要贡献、数据集及其源代码，并讨论了评估性能的常用指标。此外，还探讨了未来面临的挑战及需要深入探索的重要领域。

链接: https://arxiv.org/abs/2504.16152
作者: Mohammad Molaee,Nasrollah Moghadam Charkari
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Drug discovery requires a tremendous amount of time and cost. Computational drug-target interaction prediction, a significant part of this process, can reduce these requirements by narrowing the search space for wet lab experiments. In this survey, we provide comprehensive details of graph machine learning-based methods in predicting drug-target interaction, as they have shown promising results in this field. These details include the overall framework, main contribution, datasets, and their source codes. The selected papers were mainly published from 2020 to 2024. Prior to discussing papers, we briefly introduce the datasets commonly used with these methods and measurements to assess their performance. Finally, future challenges and some crucial areas that need to be explored are discussed.
zh

[AI-61] A Non-Invasive Load Monitoring Method for Edge Computing Based on MobileNetV3 and Dynamic Time Regulation

【速读】：本文旨在解决非侵入式负荷监测（Non-Intrusive Load Monitoring, NILM）技术在资源受限的微控制器单元（Microcontroller Units, MCUs）上部署时面临的高计算成本和巨大内存需求的问题。解决方案的关键在于提出了一种创新的时间-频率域动态时间规整（Dynamic Time Warping, DTW）算法，并对六种机器学习方法在家庭用电场景中的性能进行了系统比较与分析。通过边缘MCU上的完整实验验证，该方案实现了95%的识别准确率。同时，通过对频域特征提取过程的深度优化，运行时间减少了55.55%，存储开销降低了约34.6%。未来研究将进一步优化算法性能，并计划消除电压互感器设计以显著降低成本，从而为NILM的实际应用提供更具成本效益的解决方案，并为边缘计算环境中高效NILM系统的理论基础和技术路径奠定基础。

链接: https://arxiv.org/abs/2504.16142
作者: Hangxu Liu,Yaojie Sun,Yu Wang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, non-intrusive load monitoring (NILM) technology has attracted much attention in the related research field by virtue of its unique advantage of utilizing single meter data to achieve accurate decomposition of device-level energy consumption. Cutting-edge methods based on machine learning and deep learning have achieved remarkable results in load decomposition accuracy by fusing time-frequency domain features. However, these methods generally suffer from high computational costs and huge memory requirements, which become the main obstacles for their deployment on resource-constrained microcontroller units (MCUs). To address these challenges, this study proposes an innovative Dynamic Time Warping (DTW) algorithm in the time-frequency domain and systematically compares and analyzes the performance of six machine learning techniques in home electricity scenarios. Through complete experimental validation on edge MCUs, this scheme successfully achieves a recognition accuracy of 95%. Meanwhile, this study deeply optimizes the frequency domain feature extraction process, which effectively reduces the running time by 55.55% and the storage overhead by about 34.6%. The algorithm performance will be further optimized in future research work. Considering that the elimination of voltage transformer design can significantly reduce the cost, the subsequent research will focus on this direction, and is committed to providing more cost-effective solutions for the practical application of NILM, and providing a solid theoretical foundation and feasible technical paths for the design of efficient NILM systems in edge computing environments.
zh

[AI-62] Introduction to Quantum Machine Learning and Quantum Architecture Search ISCAS2025

【速读】：该论文旨在探索量子计算（Quantum Computing, QC）与机器学习（Machine Learning, ML）的深度融合，特别是通过量子机器学习（Quantum Machine Learning, QML）提升机器学习算法性能的方法。同时，论文聚焦于设计高性能量子电路架构的系统化和自动化方法，以降低量子增强工具的使用门槛，使非量子计算领域的研究人员能够更有效地利用这些技术。论文的关键在于提出并综述近期在QML算法优化及量子电路架构设计两方面的突破性进展，强调其在拓展QML应用范围方面的潜力。

链接: https://arxiv.org/abs/2504.16131
作者: Samuel Yen-Chi Chen,Zhiding Liang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: ISCAS 2025 Tutorial

点击查看摘要

Abstract:Recent advancements in quantum computing (QC) and machine learning (ML) have fueled significant research efforts aimed at integrating these two transformative technologies. Quantum machine learning (QML), an emerging interdisciplinary field, leverages quantum principles to enhance the performance of ML algorithms. Concurrently, the exploration of systematic and automated approaches for designing high-performance quantum circuit architectures for QML tasks has gained prominence, as these methods empower researchers outside the quantum computing domain to effectively utilize quantum-enhanced tools. This tutorial will provide an in-depth overview of recent breakthroughs in both areas, highlighting their potential to expand the application landscape of QML across diverse fields.
zh

[AI-63] A Self-supervised Learning Method for Raman Spectroscopy based on Masked Autoencoders

【速读】：该论文试图解决在拉曼光谱分析中，由于标注数据有限或难以获取而导致监督学习方法性能下降的问题。为应对这一挑战，论文提出了一种基于掩码自编码器（Masked AutoEncoder, MAE）的自监督学习范式——SMAE。其关键在于通过随机掩码和重建光谱信息的方式，在无需任何光谱标注的情况下进行预训练，从而学习到光谱的关键特征。这种方法不仅提升了信噪比（SNR）超过两倍，还实现了显著优于经典无监督方法及最新深度聚类方法的聚类精度，并在少量标注数据微调后达到了与监督学习ResNet相当的识别准确率。

链接: https://arxiv.org/abs/2504.16130
作者: Pengju Ren,Ri-gui Zhou,Yaochong Li
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Raman spectroscopy serves as a powerful and reliable tool for analyzing the chemical information of substances. The integration of Raman spectroscopy with deep learning methods enables rapid qualitative and quantitative analysis of materials. Most existing approaches adopt supervised learning methods. Although supervised learning has achieved satisfactory accuracy in spectral analysis, it is still constrained by costly and limited well-annotated spectral datasets for training. When spectral annotation is challenging or the amount of annotated data is insufficient, the performance of supervised learning in spectral material identification declines. In order to address the challenge of feature extraction from unannotated spectra, we propose a self-supervised learning paradigm for Raman Spectroscopy based on a Masked AutoEncoder, termed SMAE. SMAE does not require any spectral annotations during pre-training. By randomly masking and then reconstructing the spectral information, the model learns essential spectral features. The reconstructed spectra exhibit certain denoising properties, improving the signal-to-noise ratio (SNR) by more than twofold. Utilizing the network weights obtained from masked pre-training, SMAE achieves clustering accuracy of over 80% for 30 classes of isolated bacteria in a pathogenic bacterial dataset, demonstrating significant improvements compared to classical unsupervised methods and other state-of-the-art deep clustering methods. After fine-tuning the network with a limited amount of annotated data, SMAE achieves an identification accuracy of 83.90% on the test set, presenting competitive performance against the supervised ResNet (83.40%).
zh

[AI-64] xLSTM-ECG: Multi-label ECG Classification via Feature Fusion with xLSTM

【速读】：该论文旨在解决心血管疾病（CVDs）诊断中传统心电图（ECG）手动解读耗时且易出错的问题。为应对这一挑战，论文提出了一种名为xLSTM-ECG的新方法，其关键在于利用扩展长短期记忆网络（extended Long Short-Term Memory, xLSTM）进行多标签ECG信号分类。该方法通过短时傅里叶变换（Short-Time Fourier Transform, STFT）将时域ECG波形转换至频域以增强特征提取，并设计了专门针对12导联ECG记录的xLSTM架构，以捕捉局部与全局信号特征。实验结果表明，此模型在PTB-XL数据集上的多标签分类表现优异，同时在Georgia 12-Lead数据集上的测试进一步验证了其鲁棒性和高效性。

链接: https://arxiv.org/abs/2504.16101
作者: Lei Kang,Xuanshuo Fu,Javier Vazquez-Corral,Ernest Valveny,Dimosthenis Karatzas
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, highlighting the critical need for efficient and accurate diagnostic tools. Electrocardiograms (ECGs) are indispensable in diagnosing various heart conditions; however, their manual interpretation is time-consuming and error-prone. In this paper, we propose xLSTM-ECG, a novel approach that leverages an extended Long Short-Term Memory (xLSTM) network for multi-label classification of ECG signals, using the PTB-XL dataset. To the best of our knowledge, this work represents the first design and application of xLSTM modules specifically adapted for multi-label ECG classification. Our method employs a Short-Time Fourier Transform (STFT) to convert time-series ECG waveforms into the frequency domain, thereby enhancing feature extraction. The xLSTM architecture is specifically tailored to address the complexities of 12-lead ECG recordings by capturing both local and global signal features. Comprehensive experiments on the PTB-XL dataset reveal that our model achieves strong multi-label classification performance, while additional tests on the Georgia 12-Lead dataset underscore its robustness and efficiency. This approach significantly improves ECG classification accuracy, thereby advancing clinical diagnostics and patient care. The code will be publicly available upon acceptance.
zh

[AI-65] owards Accurate Forecasting of Renewable Energy : Building Datasets and Benchmarking Machine Learning Models for Solar and Wind Power in France

【速读】：该论文旨在解决非调度型可再生能源（如太阳能和风能）在国家尺度上的精准预测问题，以提升电网稳定性及电价预测的准确性。传统方法通常通过自下而上的厂站级预测间接推导区域供电情况，但忽略了空间分辨率数据的潜力。论文的关键解决方案在于构建了一个综合性的预测框架，利用机器学习模型结合空间显式的天气数据与生产站点容量及位置信息进行预测。具体而言，研究构建了一个涵盖2012年至2023年的数据集，使用RTE提供的日发电量数据作为目标变量，并结合ERA5天气数据、生产站点容量与位置以及电价作为输入特征。针对空间分辨率天气数据的处理，提出了三种方法：国家范围内的空间平均、主成分分析降维以及一种计算机视觉架构以捕捉复杂的空间关系。此外，论文评估了基于交叉验证的时间序列优化方法，并发现神经网络相较于传统的树模型表现更优，尤其是在长期可再生容量增长背景下更具优势。最终结果表明，基于时间序列的交叉验证方法能够实现较低的误差率，且模型在中期预测中的归一化均方根误差（nRMSE）范围为4%至10%，接近单厂级别局部模型的精度，展示了这些方法在区域供电预测中的潜力。

链接: https://arxiv.org/abs/2504.16100
作者: Eloi Lindas,Yannig Goude,Philippe Ciais
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 24 pages, 4 tables, 18 figures

点击查看摘要

Abstract:Accurate prediction of non-dispatchable renewable energy sources is essential for grid stability and price prediction. Regional power supply forecasts are usually indirect through a bottom-up approach of plant-level forecasts, incorporate lagged power values, and do not use the potential of spatially resolved data. This study presents a comprehensive methodology for predicting solar and wind power production at country scale in France using machine learning models trained with spatially explicit weather data combined with spatial information about production sites capacity. A dataset is built spanning from 2012 to 2023, using daily power production data from RTE (the national grid operator) as the target variable, with daily weather data from ERA5, production sites capacity and location, and electricity prices as input features. Three modeling approaches are explored to handle spatially resolved weather data: spatial averaging over the country, dimension reduction through principal component analysis, and a computer vision architecture to exploit complex spatial relationships. The study benchmarks state-of-the-art machine learning models as well as hyperparameter tuning approaches based on cross-validation methods on daily power production data. Results indicate that cross-validation tailored to time series is best suited to reach low error. We found that neural networks tend to outperform traditional tree-based models, which face challenges in extrapolation due to the increasing renewable capacity over time. Model performance ranges from 4% to 10% in nRMSE for midterm horizon, achieving similar error metrics to local models established at a single-plant level, highlighting the potential of these methods for regional power supply forecasting.
zh

[AI-66] wo-Timescale Joint Transmit and Pinching Beamforming for Pinching-Antenna Systems

【速读】：本文旨在解决基于pinching antenna systems (PASS) 的下行多用户多输入单输出（MU-MISO）系统中和速率最大化的问题。解决方案的关键在于提出了一种双时间尺度联合发射波束成形与pinching波束成形的设计方法，将原始问题分解为短期发射波束成形和长期pinching波束成形两个子问题。短期子问题通过Karush-Kuhn-Tucker指导的双重学习方法求解，而长期子问题则采用随机连续凸逼近法处理。仿真结果表明，所提出的双时间尺度算法相较于其他基线方法实现了显著的性能提升。

链接: https://arxiv.org/abs/2504.16099
作者: Luyuan Zhang,Xidong Mu,An Liu,Yuanwei Liu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 5 pages, 4 figures, letter

点击查看摘要

Abstract:Pinching antenna systems (PASS) have been proposed as a revolutionary flexible antenna technology which facilitates line-of-sight links via numerous low-cost pinching antennas with adjustable activation positions over waveguides. This letter proposes a two-timescale joint transmit and pinching beamforming design for the maximization of sum rate of a PASS-based downlink multi-user multiple input single output system. A primal dual decomposition method is developed to decouple the two-timescale problem into two sub-problems: 1) A Karush-Kuhn-Tucker-guided dual learning-based approach is proposed to solve the short-term transmit beamforming design sub-problem; 2) The long-term pinching beamforming design sub-problem is tackled by adopting a stochastic successive convex approximation method. Simulation results demonstrate that the proposed two-timescale algorithm achieves a significant performance gain compared to other baselines.
zh

[AI-67] A CNN-based Local-Global Self-Attention via Averag ed Window Embeddings for Hierarchical ECG Analysis

【速读】：该论文旨在解决传统变压器模型在心电图（ECG）分析中难以有效捕捉局部形态特征的问题，这对于准确解读ECG至关重要。论文的关键创新在于提出了一种新的局部-全局注意力心电图模型（LGA-ECG），通过结合卷积归纳偏置与全局自注意力机制来克服这一局限。具体而言，LGA-ECG通过平均重叠卷积窗口获得的嵌入来提取查询，从而实现精细的形态学分析，同时利用来自整个序列的键和值的注意力机制来建模全局上下文。实验结果表明，该方法在CODE-15数据集上的表现优于现有最先进模型，并且消融研究验证了局部-全局注意力策略的有效性。

链接: https://arxiv.org/abs/2504.16097
作者: Arthur Buzelin,Pedro Robles Dutenhefner,Turi Rezende,Luisa G. Porfirio,Pedro Bento,Yan Aquino,Jose Fernandes,Caio Santana,Gabriela Miana,Gisele L. Pappa,Antonio Ribeiro,Wagner Meira Jr
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cardiovascular diseases remain the leading cause of global mortality, emphasizing the critical need for efficient diagnostic tools such as electrocardiograms (ECGs). Recent advancements in deep learning, particularly transformers, have revolutionized ECG analysis by capturing detailed waveform features as well as global rhythm patterns. However, traditional transformers struggle to effectively capture local morphological features that are critical for accurate ECG interpretation. We propose a novel Local-Global Attention ECG model (LGA-ECG) to address this limitation, integrating convolutional inductive biases with global self-attention mechanisms. Our approach extracts queries by averaging embeddings obtained from overlapping convolutional windows, enabling fine-grained morphological analysis, while simultaneously modeling global context through attention to keys and values derived from the entire sequence. Experiments conducted on the CODE-15 dataset demonstrate that LGA-ECG outperforms state-of-the-art models and ablation studies validate the effectiveness of the local-global attention strategy. By capturing the hierarchical temporal dependencies and morphological patterns in ECG signals, this new design showcases its potential for clinical deployment with robust automated ECG classification.
zh

[AI-68] Efficient Portfolio Selection through Preference Aggregation with Quicksort and the Bradley–Terry Model

【速读】：该论文试图解决如何在不确定性条件下，将有限资源分配到能够带来最大长期收益的项目组合中的问题。这类问题广泛存在于创新项目评估、研究资助决策以及公众项目预算分配等场景中。论文的关键在于提出了一种基于Quicksort和Bradley–Terry模型的比较规则，通过将项目的长期不确定收益转化为成对“胜率”（pairwise “win” probabilities），并结合多个代理的评价来优化整体收益。此外，该方法可与采样技术结合以显著减少成对比较的数量，从而实现高效且实用的项目组合选择。

链接: https://arxiv.org/abs/2504.16093
作者: Yurun Ge,Lucas Böttcher,Tom Chou,Maria R. D’Orsogna
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Probability (math.PR)
备注: 15pp, 4 figs

点击查看摘要

Abstract:How to allocate limited resources to projects that will yield the greatest long-term benefits is a problem that often arises in decision-making under uncertainty. For example, organizations may need to evaluate and select innovation projects with risky returns. Similarly, when allocating resources to research projects, funding agencies are tasked with identifying the most promising proposals based on idiosyncratic criteria. Finally, in participatory budgeting, a local community may need to select a subset of public projects to fund. Regardless of context, agents must estimate the uncertain values of a potentially large number of projects. Developing parsimonious methods to compare these projects, and aggregating agent evaluations so that the overall benefit is maximized, are critical in assembling the best project portfolio. Unlike in standard sorting algorithms, evaluating projects on the basis of uncertain long-term benefits introduces additional complexities. We propose comparison rules based on Quicksort and the Bradley–Terry model, which connects rankings to pairwise “win” probabilities. In our model, each agent determines win probabilities of a pair of projects based on his or her specific evaluation of the projects’ long-term benefit. The win probabilities are then appropriately aggregated and used to rank projects. Several of the methods we propose perform better than the two most effective aggregation methods currently available. Additionally, our methods can be combined with sampling techniques to significantly reduce the number of pairwise comparisons. We also discuss how the Bradley–Terry portfolio selection approach can be implemented in practice.
zh

机器学习

[LG-0] Meta-Learning Online Dynamics Model Adaptation in Off-Road Autonomous Driving

链接: https://arxiv.org/abs/2504.16923
作者: Jacob Levy,Jason Gibson,Bogdan Vlahov,Erica Tevere,Evangelos Theodorou,David Fridovich-Keil,Patrick Spieler
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:High-speed off-road autonomous driving presents unique challenges due to complex, evolving terrain characteristics and the difficulty of accurately modeling terrain-vehicle interactions. While dynamics models used in model-based control can be learned from real-world data, they often struggle to generalize to unseen terrain, making real-time adaptation essential. We propose a novel framework that combines a Kalman filter-based online adaptation scheme with meta-learned parameters to address these challenges. Offline meta-learning optimizes the basis functions along which adaptation occurs, as well as the adaptation parameters, while online adaptation dynamically adjusts the onboard dynamics model in real time for model-based control. We validate our approach through extensive experiments, including real-world testing on a full-scale autonomous off-road vehicle, demonstrating that our method outperforms baseline approaches in prediction accuracy, performance, and safety metrics, particularly in safety-critical scenarios. Our results underscore the effectiveness of meta-learned dynamics model adaptation, advancing the development of reliable autonomous systems capable of navigating diverse and unseen environments. Video is available at: this https URL

[LG-1] Learning Verifiable Control Policies Using Relaxed Verification

链接: https://arxiv.org/abs/2504.16879
作者: Puja Chaudhury,Alexander Estornell,Michael Everett
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To provide safety guarantees for learning-based control systems, recent work has developed formal verification methods to apply after training ends. However, if the trained policy does not meet the specifications, or there is conservatism in the verification algorithm, establishing these guarantees may not be possible. Instead, this work proposes to perform verification throughout training to ultimately aim for policies whose properties can be evaluated throughout runtime with lightweight, relaxed verification algorithms. The approach is to use differentiable reachability analysis and incorporate new components into the loss function. Numerical experiments on a quadrotor model and unicycle model highlight the ability of this approach to lead to learned control policies that satisfy desired reach-avoid and invariance specifications.

[LG-2] Hybrid Reinforcement Learning and Model Predictive Control for Adaptive Control of Hydrogen-Diesel Dual-Fuel Combustion

链接: https://arxiv.org/abs/2504.16875
作者: Julian Bedei,Murray McBain,Charles Robert Koch,Jakob Andert,David Gordon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) and Machine Learning Integrated Model Predictive Control (ML-MPC) are promising approaches for optimizing hydrogen-diesel dual-fuel engine control, as they can effectively control multiple-input multiple-output systems and nonlinear processes. ML-MPC is advantageous for providing safe and optimal controls, ensuring the engine operates within predefined safety limits. In contrast, RL is distinguished by its adaptability to changing conditions through its learning-based approach. However, the practical implementation of either method alone poses challenges. RL requires high variance in control inputs during early learning phases, which can pose risks to the system by potentially executing unsafe actions, leading to mechanical damage. Conversely, ML-MPC relies on an accurate system model to generate optimal control inputs and has limited adaptability to system drifts, such as injector aging, which naturally occur in engine applications. To address these limitations, this study proposes a hybrid RL and ML-MPC approach that uses an ML-MPC framework while incorporating an RL agent to dynamically adjust the ML-MPC load tracking reference in response to changes in the environment. At the same time, the ML-MPC ensures that actions stay safe throughout the RL agent’s exploration. To evaluate the effectiveness of this approach, fuel pressure is deliberately varied to introduce a model-plant mismatch between the ML-MPC and the engine test bench. The result of this mismatch is a root mean square error (RMSE) in indicated mean effective pressure of 0.57 bar when running the ML-MPC. The experimental results demonstrate that RL successfully adapts to changing boundary conditions by altering the tracking reference while ML-MPC ensures safe control inputs. The quantitative improvement in load tracking by implementing RL is an RSME of 0.44 bar.

[LG-3] Exploring How LLM s Capture and Represent Domain-Specific Knowledge

链接: https://arxiv.org/abs/2504.16871
作者: Mirian Hipolito Garcia,Camille Couturier,Daniel Madrigal Diaz,Ankur Mallick,Anastasios Kyrillidis,Robert Sim,Victor Ruhle,Saravan Rajmohan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study whether Large Language Models (LLMs) inherently capture domain-specific nuances in natural language. Our experiments probe the domain sensitivity of LLMs by examining their ability to distinguish queries from different domains using hidden states generated during the prefill phase. We reveal latent domain-related trajectories that indicate the model’s internal recognition of query domains. We also study the robustness of these domain representations to variations in prompt styles and sources. Our approach leverages these representations for model selection, mapping the LLM that best matches the domain trace of the input query (i.e., the model with the highest performance on similar traces). Our findings show that LLMs can differentiate queries for related domains, and that the fine-tuned model is not always the most accurate. Unlike previous work, our interpretations apply to both closed and open-ended generative tasks

[LG-4] An Adaptive ML Framework for Power Converter Monitoring via Federated Transfer Learning

链接: https://arxiv.org/abs/2504.16866
作者: Panagiotis Kakosimos,Alireza Nemat Saberi,Luca Peretti
类目: Machine Learning (cs.LG)
*备注: 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:This study explores alternative framework configurations for adapting thermal machine learning (ML) models for power converters by combining transfer learning (TL) and federated learning (FL) in a piecewise manner. This approach inherently addresses challenges such as varying operating conditions, data sharing limitations, and security implications. The framework starts with a base model that is incrementally adapted by multiple clients via adapting three state-of-the-art domain adaptation techniques: Fine-tuning, Transfer Component Analysis (TCA), and Deep Domain Adaptation (DDA). The Flower framework is employed for FL, using Federated Averaging for aggregation. Validation with field data demonstrates that fine-tuning offers a straightforward TL approach with high accuracy, making it suitable for practical applications. Benchmarking results reveal a comprehensive comparison of these methods, showcasing their respective strengths and weaknesses when applied in different scenarios. Locally hosted FL enhances performance when data aggregation is not feasible, while cloud-based FL becomes more practical with a significant increase in the number of clients, addressing scalability and connectivity challenges.

[LG-5] Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

链接: https://arxiv.org/abs/2504.16831
作者: Frederik L. Dennig,Nina Geyer,Daniela Blumberg,Yannick Metz,Daniel A. Keim
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures, 2 tables, LaTeX; to appear at the 16th International EuroVis Workshop on Visual Analytics (EuroVA’25)

点击查看摘要

Abstract:Recently, neural networks have gained attention for creating parametric and invertible multidimensional data projections. Parametric projections allow for embedding previously unseen data without recomputing the projection as a whole, while invertible projections enable the generation of new data points. However, these properties have never been explored simultaneously for arbitrary projection methods. We evaluate three autoencoder (AE) architectures for creating parametric and invertible projections. Based on a given projection, we train AEs to learn a mapping into 2D space and an inverse mapping into the original space. We perform a quantitative and qualitative comparison on four datasets of varying dimensionality and pattern complexity using t-SNE. Our results indicate that AEs with a customized loss function can create smoother parametric and inverse projections than feed-forward neural networks while giving users control over the strength of the smoothing effect.

[LG-6] Online model learning with data-assimilated reservoir computers

链接: https://arxiv.org/abs/2504.16767
作者: Andrea Nóvoa,Luca Magri
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn); Applications (stat.AP)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:We propose an online learning framework for forecasting nonlinear spatio-temporal signals (fields). The method integrates (i) dimensionality reduction, here, a simple proper orthogonal decomposition (POD) projection; (ii) a generalized autoregressive model to forecast reduced dynamics, here, a reservoir computer; (iii) online adaptation to update the reservoir computer (the model), here, ensemble sequential data this http URL demonstrate the framework on a wake past a cylinder governed by the Navier-Stokes equations, exploring the assimilation of full flow fields (projected onto POD modes) and sparse sensors. Three scenarios are examined: a naïve physical state estimation; a two-fold estimation of physical and reservoir states; and a three-fold estimation that also adjusts the model parameters. The two-fold strategy significantly improves ensemble convergence and reduces reconstruction error compared to the naïve approach. The three-fold approach enables robust online training of partially-trained reservoir computers, overcoming limitations of a priori training. By unifying data-driven reduced order modelling with Bayesian data assimilation, this work opens new opportunities for scalable online model learning for nonlinear time series forecasting.

[LG-7] QAOA-PCA: Enhancing Efficiency in the Quantum Approximate Optimization Algorithm via Principal Component Analysis

链接: https://arxiv.org/abs/2504.16755
作者: Owain Parry,Phil McMinn
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:The Quantum Approximate Optimization Algorithm (QAOA) is a promising variational algorithm for solving combinatorial optimization problems on near-term devices. However, as the number of layers in a QAOA circuit increases, which is correlated with the quality of the solution, the number of parameters to optimize grows linearly. This results in more iterations required by the classical optimizer, which results in an increasing computational burden as more circuit executions are needed. To mitigate this issue, we introduce QAOA-PCA, a novel reparameterization technique that employs Principal Component Analysis (PCA) to reduce the dimensionality of the QAOA parameter space. By extracting principal components from optimized parameters of smaller problem instances, QAOA-PCA facilitates efficient optimization with fewer parameters on larger instances. Our empirical evaluation on the prominent MaxCut problem demonstrates that QAOA-PCA consistently requires fewer iterations than standard QAOA, achieving substantial efficiency gains. While this comes at the cost of a slight reduction in approximation ratio compared to QAOA with the same number of layers, QAOA-PCA almost always outperforms standard QAOA when matched by parameter count. QAOA-PCA strikes a favorable balance between efficiency and performance, reducing optimization overhead without significantly compromising solution quality.

[LG-8] Simple Graph Contrastive Learning via Fractional-order Neural Diffusion Networks ICML

链接: https://arxiv.org/abs/2504.16748
作者: Yanan Zhao,Feng Ji,Kai Zhao,Xuhao Li,Qiyu Kang,Wenfei Liang,Yahya Alkhatib,Xingchao Jian,Wee Peng Tay
类目: Machine Learning (cs.LG)
*备注: Submitted to ICML

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) has recently made progress as an unsupervised graph representation learning paradigm. GCL approaches can be categorized into augmentation-based and augmentation-free methods. The former relies on complex data augmentations, while the latter depends on encoders that can generate distinct views of the same input. Both approaches may require negative samples for training. In this paper, we introduce a novel augmentation-free GCL framework based on graph neural diffusion models. Specifically, we utilize learnable encoders governed by Fractional Differential Equations (FDE). Each FDE is characterized by an order parameter of the differential operator. We demonstrate that varying these parameters allows us to produce learnable encoders that generate diverse views, capturing either local or global information, for contrastive learning. Our model does not require negative samples for training and is applicable to both homophilic and heterophilic datasets. We demonstrate its effectiveness across various datasets, achieving state-of-the-art performance.

[LG-9] Simplified Swarm Learning Framework for Robust and Scalable Diagnostic Services in Cancer Histopathology

链接: https://arxiv.org/abs/2504.16732
作者: Yanjie Wu,Yuhao Ji,Saiho Lee,Juniad Akram,Ali Braytee,Ali Anaissi
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, 2025 International Conference on Computational Science

点击查看摘要

Abstract:The complexities of healthcare data, including privacy concerns, imbalanced datasets, and interoperability issues, necessitate innovative machine learning solutions. Swarm Learning (SL), a decentralized alternative to Federated Learning, offers privacy-preserving distributed training, but its reliance on blockchain technology hinders accessibility and scalability. This paper introduces a \textitSimplified Peer-to-Peer Swarm Learning (P2P-SL) Framework tailored for resource-constrained environments. By eliminating blockchain dependencies and adopting lightweight peer-to-peer communication, the proposed framework ensures robust model synchronization while maintaining data privacy. Applied to cancer histopathology, the framework integrates optimized pre-trained models, such as TorchXRayVision, enhanced with DenseNet decoders, to improve diagnostic accuracy. Extensive experiments demonstrate the framework’s efficacy in handling imbalanced and biased datasets, achieving comparable performance to centralized models while preserving privacy. This study paves the way for democratizing advanced machine learning in healthcare, offering a scalable, accessible, and efficient solution for privacy-sensitive diagnostic applications.

[LG-10] A Unified Retrieval Framework with Document Ranking and EDU Filtering for Multi-document Summarization

链接: https://arxiv.org/abs/2504.16711
作者: Shiyin Tan,Jaeeon Park,Dongyuan Li,Renhe Jiang,Manabu Okumura
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the field of multi-document summarization (MDS), transformer-based models have demonstrated remarkable success, yet they suffer an input length limitation. Current methods apply truncation after the retrieval process to fit the context length; however, they heavily depend on manually well-crafted queries, which are impractical to create for each document set for MDS. Additionally, these methods retrieve information at a coarse granularity, leading to the inclusion of irrelevant content. To address these issues, we propose a novel retrieval-based framework that integrates query selection and document ranking and shortening into a unified process. Our approach identifies the most salient elementary discourse units (EDUs) from input documents and utilizes them as latent queries. These queries guide the document ranking by calculating relevance scores. Instead of traditional truncation, our approach filters out irrelevant EDUs to fit the context length, ensuring that only critical information is preserved for summarization. We evaluate our framework on multiple MDS datasets, demonstrating consistent improvements in ROUGE metrics while confirming its scalability and flexibility across diverse model architectures. Additionally, we validate its effectiveness through an in-depth analysis, emphasizing its ability to dynamically select appropriate queries and accurately rank documents based on their relevance scores. These results demonstrate that our framework effectively addresses context-length constraints, establishing it as a robust and reliable solution for MDS.

[LG-11] PIN-WM: Learning Physics-INformed World Models for Non-Prehensile Manipulation

链接: https://arxiv.org/abs/2504.16693
作者: Wenxuan Li,Hang Zhao,Zhiyuan Yu,Yu Du,Qin Zou,Ruizhen Hu,Kai Xu
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:While non-prehensile manipulation (e.g., controlled pushing/poking) constitutes a foundational robotic skill, its learning remains challenging due to the high sensitivity to complex physical interactions involving friction and restitution. To achieve robust policy learning and generalization, we opt to learn a world model of the 3D rigid body dynamics involved in non-prehensile manipulations and use it for model-based reinforcement learning. We propose PIN-WM, a Physics-INformed World Model that enables efficient end-to-end identification of a 3D rigid body dynamical system from visual observations. Adopting differentiable physics simulation, PIN-WM can be learned with only few-shot and task-agnostic physical interaction trajectories. Further, PIN-WM is learned with observational loss induced by Gaussian Splatting without needing state estimation. To bridge Sim2Real gaps, we turn the learned PIN-WM into a group of Digital Cousins via physics-aware randomizations which perturb physics and rendering parameters to generate diverse and meaningful variations of the PIN-WM. Extensive evaluations on both simulation and real-world tests demonstrate that PIN-WM, enhanced with physics-aware digital cousins, facilitates learning robust non-prehensile manipulation skills with Sim2Real transfer, surpassing the Real2Sim2Real state-of-the-arts.

[LG-12] A Statistical Evaluation of Indoor LoRaWAN Environment-Aware Propagation for 6G: MLR ANOVA and Residual Distribution Analysis

链接: https://arxiv.org/abs/2504.16688
作者: Nahshon Mokua,Obiri,Kristof,Van Laerhoven
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: \c{opyright} 2025 IEEE. Personal use of this material is permitted. This is the accepted version of the article: To appear in the 2025 Joint European Conference on Networks and Communications 6G Summit (EuCNC/6G Summit)

点击查看摘要

Abstract:Modeling path loss in indoor LoRaWAN technology deployments is inherently challenging due to structural obstructions, occupant density and activities, and fluctuating environmental conditions. This study proposes a two-stage approach to capture and analyze these complexities using an extensive dataset of 1,328,334 field measurements collected over six months in a single-floor office at the University of Siegen’s Hoelderlinstrasse Campus, Germany. First, we implement a multiple linear regression model that includes traditional propagation metrics (distance, structural walls) and an extension with proposed environmental variables (relative humidity, temperature, carbon dioxide, particulate matter, and barometric pressure). Using analysis of variance, we demonstrate that adding these environmental factors can reduce unexplained variance by 42.32 percent. Secondly, we examine residual distributions by fitting five candidate probability distributions: Normal, Skew-Normal, Cauchy, Student’s t, and Gaussian Mixture Models with one to five components. Our results show that a four-component Gaussian Mixture Model captures the residual heterogeneity of indoor signal propagation most accurately, significantly outperforming single-distribution approaches. Given the push toward ultra-reliable, context-aware communications in 6G networks, our analysis shows that environment-aware modeling can substantially improve LoRaWAN network design in dynamic indoor IoT deployments.

[LG-13] MCMC for Bayesian estimation of Differential Privacy from Membership Inference Attacks

链接: https://arxiv.org/abs/2504.16683
作者: Ceren Yildirim,Kamer Kaya,Sinan Yildirim,Erkay Savas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Code available: this https URL

点击查看摘要

Abstract:We propose a new framework for Bayesian estimation of differential privacy, incorporating evidence from multiple membership inference attacks (MIA). Bayesian estimation is carried out via a Markov chain Monte Carlo (MCMC) algorithm, named MCMC-DP-Est, which provides an estimate of the full posterior distribution of the privacy parameter (e.g., instead of just credible intervals). Critically, the proposed method does not assume that privacy auditing is performed with the most powerful attack on the worst-case (dataset, challenge point) pair, which is typically unrealistic. Instead, MCMC-DP-Est jointly estimates the strengths of MIAs used and the privacy of the training algorithm, yielding a more cautious privacy analysis. We also present an economical way to generate measurements for the performance of an MIA that is to be used by the MCMC method to estimate privacy. We present the use of the methods with numerical examples with both artificial and real data.

[LG-14] Provable wavelet-based neural approximation

链接: https://arxiv.org/abs/2504.16682
作者: Youngmi Hur,Hyojae Lim,Mikyoung Lim
类目: Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this paper, we develop a wavelet-based theoretical framework for analyzing the universal approximation capabilities of neural networks over a wide range of activation functions. Leveraging wavelet frame theory on the spaces of homogeneous type, we derive sufficient conditions on activation functions to ensure that the associated neural network approximates any functions in the given space, along with an error estimate. These sufficient conditions accommodate a variety of smooth activation functions, including those that exhibit oscillatory behavior. Furthermore, by considering the L^2 -distance between smooth and non-smooth activation functions, we establish a generalized approximation result that is applicable to non-smooth activations, with the error explicitly controlled by this distance. This provides increased flexibility in the design of network architectures.

[LG-15] Efficient Data Valuation Approximation in Federated Learning: A Sampling-based Approach

链接: https://arxiv.org/abs/2504.16668
作者: Shuyue Wei,Yongxin Tong,Zimu Zhou,Tianran He,Yi Xu
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Federated learning paradigm to utilize datasets across multiple data providers. In FL, cross-silo data providers often hesitate to share their high-quality dataset unless their data value can be fairly assessed. Shapley value (SV) has been advocated as the standard metric for data valuation in FL due to its desirable properties. However, the computational overhead of SV is prohibitive in practice, as it inherently requires training and evaluating an FL model across an exponential number of dataset combinations. Furthermore, existing solutions fail to achieve high accuracy and efficiency, making practical use of SV still out of reach, because they ignore choosing suitable computation scheme for approximation framework and overlook the property of utility function in FL. We first propose a unified stratified-sampling framework for two widely-used schemes. Then, we analyze and choose the more promising scheme under the FL linear regression assumption. After that, we identify a phenomenon termed key combinations, where only limited dataset combinations have a high-impact on final data value. Building on these insights, we propose a practical approximation algorithm, IPSS, which strategically selects high-impact dataset combinations rather than evaluating all possible combinations, thus substantially reducing time cost with minor approximation error. Furthermore, we conduct extensive evaluations on the FL benchmark datasets to demonstrate that our proposed algorithm outperforms a series of representative baselines in terms of efficiency and effectiveness.

[LG-16] DAPLSR: Data Augmentation Partial Least Squares Regression Model via Manifold Optimization

链接: https://arxiv.org/abs/2504.16639
作者: Haoran Chen,Jiapeng Liu,Jiafan Wang,Wenjun Shi
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Traditional Partial Least Squares Regression (PLSR) models frequently underperform when handling data characterized by uneven categories. To address the issue, this paper proposes a Data Augmentation Partial Least Squares Regression (DAPLSR) model via manifold optimization. The DAPLSR model introduces the Synthetic Minority Over-sampling Technique (SMOTE) to increase the number of samples and utilizes the Value Difference Metric (VDM) to select the nearest neighbor samples that closely resemble the original samples for generating synthetic samples. In solving the model, in order to obtain a more accurate numerical solution for PLSR, this paper proposes a manifold optimization method that uses the geometric properties of the constraint space to improve model degradation and optimization. Comprehensive experiments show that the proposed DAPLSR model achieves superior classification performance and outstanding evaluation metrics on various datasets, significantly outperforming existing methods.

[LG-17] Compositional Active Learning of Synchronous Systems through Automated Alphabet Refinement

链接: https://arxiv.org/abs/2504.16624
作者: Leo Henry,Thomas Neele,Mohammad Mousavi,Matteo Sammartino
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
*备注:

点击查看摘要

Abstract:Active automata learning infers automaton models of systems from behavioral observations, a technique successfully applied to a wide range of domains. Compositional approaches for concurrent systems have recently emerged. We take a significant step beyond available results, including those by the authors, and develop a general technique for compositional learning of a synchronizing parallel system with an unknown decomposition. Our approach automatically refines the global alphabet into component alphabets while learning the component models. We develop a theoretical treatment of distributions of alphabets, i.e., sets of possibly overlapping component alphabets. We characterize counter-examples that reveal inconsistencies with global observations, and show how to systematically update the distribution to restore consistency. We present a compositional learning algorithm implementing these ideas, where learning counterexamples precisely correspond to distribution counterexamples under well-defined conditions. We provide an implementation, called CoalA, using the state-of-the-art active learning library LearnLib. Our experiments show that in more than 630 subject systems, CoalA delivers orders of magnitude improvements (up to five orders) in membership queries and in systems with significant concurrency, it also achieves better scalability in the number of equivalence queries.

[LG-18] HERB: Human-augmented Efficient Reinforcement learning for Bin-packing

链接: https://arxiv.org/abs/2504.16595
作者: Gojko Perovic,Nuno Ferreira Duarte,Atabak Dehban,Gonçalo Teixeira,Egidio Falotico,José Santos-Victor
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 5 Figures

点击查看摘要

Abstract:Packing objects efficiently is a fundamental problem in logistics, warehouse automation, and robotics. While traditional packing solutions focus on geometric optimization, packing irregular, 3D objects presents significant challenges due to variations in shape and stability. Reinforcement Learning~(RL) has gained popularity in robotic packing tasks, but training purely from simulation can be inefficient and computationally expensive. In this work, we propose HERB, a human-augmented RL framework for packing irregular objects. We first leverage human demonstrations to learn the best sequence of objects to pack, incorporating latent factors such as space optimization, stability, and object relationships that are difficult to model explicitly. Next, we train a placement algorithm that uses visual information to determine the optimal object positioning inside a packing container. Our approach is validated through extensive performance evaluations, analyzing both packing efficiency and latency. Finally, we demonstrate the real-world feasibility of our method on a robotic system. Experimental results show that our method outperforms geometric and purely RL-based approaches by leveraging human intuition, improving both packing robustness and adaptability. This work highlights the potential of combining human expertise-driven RL to tackle complex real-world packing challenges in robotic systems.

[LG-19] Data-Assimilated Model-Based Reinforcement Learning for Partially Observed Chaotic Flows

链接: https://arxiv.org/abs/2504.16588
作者: Defne E. Ozan,Andrea Nóvoa,Luca Magri
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:The goal of many applications in energy and transport sectors is to control turbulent flows. However, because of chaotic dynamics and high dimensionality, the control of turbulent flows is exceedingly difficult. Model-free reinforcement learning (RL) methods can discover optimal control policies by interacting with the environment, but they require full state information, which is often unavailable in experimental settings. We propose a data-assimilated model-based RL (DA-MBRL) framework for systems with partial observability and noisy measurements. Our framework employs a control-aware Echo State Network for data-driven prediction of the dynamics, and integrates data assimilation with an Ensemble Kalman Filter for real-time state estimation. An off-policy actor-critic algorithm is employed to learn optimal control strategies from state estimates. The framework is tested on the Kuramoto-Sivashinsky equation, demonstrating its effectiveness in stabilizing a spatiotemporally chaotic flow from noisy and partial measurements.

[LG-20] Enhancing Variable Selection in Large-scale Logistic Regression: Leverag ing Manual Labeling with Beneficial Noise

链接: https://arxiv.org/abs/2504.16585
作者: Xiaofei Wu,Rongmei Liang
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In large-scale supervised learning, penalized logistic regression (PLR) effectively addresses the overfitting problem by introducing regularization terms yet its performance still depends on efficient variable selection strategies. This paper theoretically demonstrates that label noise stemming from manual labeling, which is solely related to classification difficulty, represents a type of beneficial noise for variable selection in PLR. This benefit is reflected in a more accurate estimation of the selected non-zero coefficients when compared with the case where only truth labels are used. Under large-scale settings, the sample size for PLR can become very large, making it infeasible to store on a single machine. In such cases, distributed computing methods are required to handle PLR model with manual labeling. This paper presents a partition-insensitive parallel algorithm founded on the ADMM (alternating direction method of multipliers) algorithm to address PLR by incorporating manual labeling. The partition insensitivity of the proposed algorithm refers to the fact that the solutions obtained by the algorithm will not change with the distributed storage of data. In addition, the algorithm has global convergence and a sublinear convergence rate. Experimental results indicate that, as compared with traditional variable selection classification techniques, the PLR with manually-labeled noisy data achieves higher estimation and classification accuracy across multiple large-scale datasets.

[LG-21] Hyper-Transforming Latent Diffusion Models

链接: https://arxiv.org/abs/2504.16580
作者: Ignacio Peis,Batuhan Koyuncu,Isabel Valera,Jes Frellsen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models. Unlike prior approaches that rely on MLP-based hypernetworks with scalability limitations, our method employs a Transformer-based decoder to generate INR parameters from latent variables, addressing both representation capacity and computational efficiency. Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork, which can be trained either from scratch or via hyper-transforming-a strategy that fine-tunes only the decoder while freezing the pre-trained latent space. This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining.

[LG-22] Unified Molecule Generation and Property Prediction

链接: https://arxiv.org/abs/2504.16559
作者: Adam Izdebski,Jan Olszewski,Pankhil Gawade,Krzysztof Koras,Serra Korkmaz,Valentin Rauscher,Jakub M. Tomczak,Ewa Szczurek
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 17 pages, 4 figures

点击查看摘要

Abstract:Modeling the joint distribution of the data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic capabilities reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mask together with a unified pre-training scheme. We show that Hyformer rivals other joint models, as well as state-of-the-art molecule generation and property prediction models. Additionally, we show the benefits of joint modeling in downstream tasks of molecular representation learning, hit identification and antimicrobial peptide design.

[LG-23] Least-Squares-Embedded Optimization for Accelerated Convergence of PINNs in Acoustic Wavefield Simulations

链接: https://arxiv.org/abs/2504.16553
作者: Mohammad Mahdi Abedi,David Pardo,Tariq Alkhalifah
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have shown promise in solving partial differential equations (PDEs), including the frequency-domain Helmholtz equation. However, standard training of PINNs using gradient descent (GD) suffers from slow convergence and instability, particularly for high-frequency wavefields. For scattered acoustic wavefield simulation based on Helmholtz equation, we derive a hybrid optimization framework that accelerates training convergence by embedding a least-squares (LS) solver directly into the GD loss function. This formulation enables optimal updates for the linear output layer. Our method is applicable with or without perfectly matched layers (PML), and we provide practical tensor-based implementations for both scenarios. Numerical experiments on benchmark velocity models demonstrate that our approach achieves faster convergence, higher accuracy, and improved stability compared to conventional PINN training. In particular, our results show that the LS-enhanced method converges rapidly even in cases where standard GD-based training fails. The LS solver operates on a small normal matrix, ensuring minimal computational overhead and making the method scalable for large-scale wavefield simulations.

[LG-24] A Comprehensive Survey of Synthetic Tabular Data Generation

链接: https://arxiv.org/abs/2504.16506
作者: Ruxue Shi,Yili Wang,Mengnan Du,Xu Shen,Xin Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data remains one of the most prevalent and critical data formats across diverse real-world applications. However, its effective use in machine learning (ML) is often constrained by challenges such as data scarcity, privacy concerns, and class imbalance. Synthetic data generation has emerged as a promising solution, leveraging generative models to learn the distribution of real datasets and produce high-fidelity, privacy-preserving samples. Various generative paradigms have been explored, including energy-based models (EBMs), variational autoencoders (VAEs), generative adversarial networks (GANs), large language models (LLMs), and diffusion models. While several surveys have investigated synthetic tabular data generation, most focus on narrow subdomains or specific generative methods, such as GANs, diffusion models, or privacy-preserving techniques. This limited scope often results in fragmented insights, lacking a comprehensive synthesis that bridges diverse approaches. In particular, recent advances driven by LLMs and diffusion-based models remain underexplored. This gap hinders a holistic understanding of the field`s evolution, methodological interplay, and open challenges. To address this, our survey provides a unified and systematic review of synthetic tabular data generation. Our contributions are threefold: (1) we propose a comprehensive taxonomy that organizes existing methods into traditional approaches, diffusion-based methods, and LLM-based models, and provide an in-depth comparative analysis; (2) we detail the complete pipeline for synthetic tabular data generation, including data synthesis, post-processing, and evaluation; (3) we identify major challenges, explore real-world applications, and outline open research questions and future directions to guide future work in this rapidly evolving area.

[LG-25] Neuro-Evolutionary Approach to Physics-Aware Symbolic Regression

链接: https://arxiv.org/abs/2504.16503
作者: Jiří Kubalík,Robert Babuška
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symbolic regression is a technique that can automatically derive analytic models from data. Traditionally, symbolic regression has been implemented primarily through genetic programming that evolves populations of candidate solutions sampled by genetic operators, crossover and mutation. More recently, neural networks have been employed to learn the entire analytical model, i.e., its structure and coefficients, using regularized gradient-based optimization. Although this approach tunes the model’s coefficients better, it is prone to premature convergence to suboptimal model structures. Here, we propose a neuro-evolutionary symbolic regression method that combines the strengths of evolutionary-based search for optimal neural network (NN) topologies with gradient-based tuning of the network’s parameters. Due to the inherent high computational demand of evolutionary algorithms, it is not feasible to learn the parameters of every candidate NN topology to full convergence. Thus, our method employs a memory-based strategy and population perturbations to enhance exploitation and reduce the risk of being trapped in suboptimal NNs. In this way, each NN topology can be trained using only a short sequence of backpropagation iterations. The proposed method was experimentally evaluated on three real-world test problems and has been shown to outperform other NN-based approaches regarding the quality of the models obtained.

[LG-26] Dynamic Time-aware Continual User Representation Learning

链接: https://arxiv.org/abs/2504.16501
作者: Seungyoon Choi,Sein Kim,Hongseok Kang,Wonjoong Kim,Chanyoung Park
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional user modeling (UM) approaches have primarily focused on designing models for a single specific task, but they face limitations in generalization and adaptability across various tasks. Recognizing these challenges, recent studies have shifted towards continual learning (CL)-based universal user representation learning aiming to develop a single model capable of handling multiple tasks. Despite advancements, existing methods are in fact evaluated under an unrealistic scenario that does not consider the passage of time as tasks progress, which overlooks newly emerged items that may change the item distribution of previous tasks. In this paper, we introduce a practical evaluation scenario on which CL-based universal user representation learning approaches should be evaluated, which takes into account the passage of time as tasks progress. Then, we propose a novel framework Dynamic Time-aware continual user representation learner, named DITTO, designed to alleviate catastrophic forgetting despite continuous shifts in item distribution, while also allowing the knowledge acquired from previous tasks to adapt to the current shifted item distribution. Through our extensive experiments, we demonstrate the superiority of DITTO over state-of-the-art methods under a practical evaluation scenario. Our source code is available at this https URL.

[LG-27] Seeking Flat Minima over Diverse Surrogates for Improved Adversarial Transferability: A Theoretical Framework and Algorithmic Instantiation

链接: https://arxiv.org/abs/2504.16474
作者: Meixi Zheng,Kehan Wu,Yanbo Fan,Rui Huang,Baoyuan Wu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 26 pages, 6 figures

点击查看摘要

Abstract:The transfer-based black-box adversarial attack setting poses the challenge of crafting an adversarial example (AE) on known surrogate models that remain effective against unseen target models. Due to the practical importance of this task, numerous methods have been proposed to address this challenge. However, most previous methods are heuristically designed and intuitively justified, lacking a theoretical foundation. To bridge this gap, we derive a novel transferability bound that offers provable guarantees for adversarial transferability. Our theoretical analysis has the advantages of \textit(i) deepening our understanding of previous methods by building a general attack framework and \textit(ii) providing guidance for designing an effective attack algorithm. Our theoretical results demonstrate that optimizing AEs toward flat minima over the surrogate model set, while controlling the surrogate-target model shift measured by the adversarial model discrepancy, yields a comprehensive guarantee for AE transferability. The results further lead to a general transfer-based attack framework, within which we observe that previous methods consider only partial factors contributing to the transferability. Algorithmically, inspired by our theoretical results, we first elaborately construct the surrogate model set in which models exhibit diverse adversarial vulnerabilities with respect to AEs to narrow an instantiated adversarial model discrepancy. Then, a \textitmodel-Diversity-compatible Reverse Adversarial Perturbation (DRAP) is generated to effectively promote the flatness of AEs over diverse surrogate models to improve transferability. Extensive experiments on NIPS2017 and CIFAR-10 datasets against various target models demonstrate the effectiveness of our proposed attack.

[LG-28] An Effective Gram Matrix Characterizes Generalization in Deep Networks

链接: https://arxiv.org/abs/2504.16450
作者: Rubing Yang,Pratik Chaudhari
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We derive a differential equation that governs the evolution of the generalization gap when a deep network is trained by gradient descent. This differential equation is controlled by two quantities, a contraction factor that brings together trajectories corresponding to slightly different datasets, and a perturbation factor that accounts for them training on different datasets. We analyze this differential equation to compute an effective Gram matrix'' that characterizes the generalization gap after training in terms of the alignment between this Gram matrix and a certain initial residual’'. Empirical evaluations on image classification datasets indicate that this analysis can predict the test loss accurately. Further, at any point during training, the residual predominantly lies in the subspace of the effective Gram matrix with the smallest eigenvalues. This indicates that the training process is benign, i.e., it does not lead to significant deterioration of the generalization gap (which is zero at initialization). The alignment between the effective Gram matrix and the residual is different for different datasets and architectures. The match/mismatch of the data and the architecture is primarily responsible for good/bad generalization.

[LG-29] From Past to Present: A Survey of Malicious URL Detection Techniques Datasets and Code Repositories

链接: https://arxiv.org/abs/2504.16449
作者: Ye Tian,Yanqiu Yu,Jianguo Sun,Yanbin Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews exhibit 4 critical gaps: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset this http URL paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g. Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (URL, HTML, Visual, etc.). This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze: 1) publicly available datasets (2016-2024), and 2) open-source implementations from published works(2013-2025). Then, we outline essential design principles and architectural frameworks for product-level implementations. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curating datasets and open-source implementations: this https URL.

[LG-30] Node Assigned physics-informed neural networks for thermal-hydraulic system simulation: CVH/FL module

链接: https://arxiv.org/abs/2504.16447
作者: Jeesuk Shin,Cheolwoong Kim,Sunwoong Yang,Minseo Lee,Sung Joong Kim,Joongoo Jeon
类目: Machine Learning (cs.LG)
*备注: 40 pages, 12 figures. Jeesuk Shin and Cheolwoong Kim contributed equally to this work. Sung Joong Kim and Joongoo Jeon are co-corresponding authors

点击查看摘要

Abstract:Severe accidents (SAs) in nuclear power plants have been analyzed using thermal-hydraulic (TH) system codes such as MELCOR and MAAP. These codes efficiently simulate the progression of SAs, while they still have inherent limitations due to their inconsistent finite difference schemes. The use of empirical schemes incorporating both implicit and explicit formulations inherently induces unidirectional coupling in multi-physics analyses. The objective of this study is to develop a novel numerical method for TH system codes using physics-informed neural network (PINN). They have shown strength in solving multi-physics due to the innate feature of neural networks-automatic differentiation. We propose a node-assigned PINN (NA-PINN) that is suitable for the control volume approach-based system codes. NA-PINN addresses the issue of spatial governing equation variation by assigning an individual network to each nodalization of the system code, such that spatial information is excluded from both the input and output domains, and each subnetwork learns to approximate a purely temporal solution. In this phase, we evaluated the accuracy of the PINN methods for the hydrodynamic module. In the 6 water tank simulation, PINN and NA-PINN showed maximum absolute errors of 1.678 and 0.007, respectively. It should be noted that only NA-PINN demonstrated acceptable accuracy. To the best of the authors’ knowledge, this is the first study to successfully implement a system code using PINN. Our future work involves extending NA-PINN to a multi-physics solver and developing it in a surrogate manner.

[LG-31] arget Concrete Score Matching: A Holistic Framework for Discrete Diffusion

链接: https://arxiv.org/abs/2504.16431
作者: Ruixiang Zhang,Shuangfei Zhai,Yizhe Zhang,James Thornton,Zijing Ou,Joshua Susskind,Navdeep Jaitly
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion is a promising framework for modeling and generating discrete data. In this work, we present Target Concrete Score Matching (TCSM), a novel and versatile objective for training and fine-tuning discrete diffusion models. TCSM provides a general framework with broad applicability. It supports pre-training discrete diffusion models directly from data samples, and many existing discrete diffusion approaches naturally emerge as special cases of our more general TCSM framework. Furthermore, the same TCSM objective extends to post-training of discrete diffusion models, including fine-tuning using reward functions or preference data, and distillation of knowledge from pre-trained autoregressive models. These new capabilities stem from the core idea of TCSM, estimating the concrete score of the target distribution, which resides in the original (clean) data space. This allows seamless integration with reward functions and pre-trained models, which inherently only operate in the clean data space rather than the noisy intermediate spaces of diffusion processes. Our experiments on language modeling tasks demonstrate that TCSM matches or surpasses current methods. Additionally, TCSM is versatile, applicable to both pre-training and post-training scenarios, offering greater flexibility and sample efficiency.

[LG-32] Natural Policy Gradient for Averag e Reward Non-Stationary RL

链接: https://arxiv.org/abs/2504.16415
作者: Neharika Jali,Eshika Pathak,Pranay Sharma,Guannan Qu,Gauri Joshi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of \Delta_T . Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm BORL-NS-NAC that does not require prior knowledge of the variation budget \Delta_T . We present a dynamic regret of \tilde\mathscr O(|S|^1/2|A|^1/2\Delta_T^1/6T^5/6) for both algorithms, where T is the time horizon, and |S| , |A| are the sizes of the state and action spaces. The regret analysis leverages a novel adaptation of the Lyapunov function analysis of NAC to dynamic environments and characterizes the effects of simultaneous updates in policy, value function estimate and changes in the environment.

[LG-33] Circinus: Efficient Query Planner for Compound ML Serving

链接: https://arxiv.org/abs/2504.16397
作者: Banruo Liu,Wei-Yu Lin,Minghao Fang,Yihan Jiang,Fan Lai
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of compound AI serving – integrating multiple operators in a pipeline that may span edge and cloud tiers – enables end-user applications such as autonomous driving, generative AI-powered meeting companions, and immersive gaming. Achieving high service goodput – i.e., meeting service level objectives (SLOs) for pipeline latency, accuracy, and costs – requires effective planning of operator placement, configuration, and resource allocation across infrastructure tiers. However, the diverse SLO requirements, varying edge capabilities, and high query volumes create an enormous planning search space, rendering current solutions fundamentally limited for real-time serving and cost-efficient deployments. This paper presents Circinus, an SLO-aware query planner for large-scale compound AI workloads. Circinus novelly decomposes multi-query planning and multi-dimensional SLO objectives while preserving global decision quality. By exploiting plan similarities within and across queries, it significantly reduces search steps. It further improves per-step efficiency with a precision-aware plan profiler that incrementally profiles and strategically applies early stopping based on imprecise estimates of plan performance. At scale, Circinus selects query-plan combinations to maximize global SLO goodput. Evaluations in real-world settings show that Circinus improves service goodput by 3.2-5.0 \times , accelerates query planning by 4.2-5.8 \times , achieving query response in seconds, while reducing deployment costs by 3.2-4.0 \times over state of the arts even in their intended single-tier deployments. Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2504.16397 [cs.DB] (or arXiv:2504.16397v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2504.16397 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-34] Disentangled Graph Representation Based on Substructure-Aware Graph Optimal Matching Kernel Convolutional Networks

链接: https://arxiv.org/abs/2504.16360
作者: Mao Wang,Tao Wu,Xingping Xian,Shaojie Qiao,Weina Niu,Canyixing Cui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graphs effectively characterize relational data, driving graph representation learning methods that uncover underlying predictive information. As state-of-the-art approaches, Graph Neural Networks (GNNs) enable end-to-end learning for diverse tasks. Recent disentangled graph representation learning enhances interpretability by decoupling independent factors in graph data. However, existing methods often implicitly and coarsely characterize graph structures, limiting structural pattern analysis within the graph. This paper proposes the Graph Optimal Matching Kernel Convolutional Network (GOMKCN) to address this limitation. We view graphs as node-centric subgraphs, where each subgraph acts as a structural factor encoding position-specific information. This transforms graph prediction into structural pattern recognition. Inspired by CNNs, GOMKCN introduces the Graph Optimal Matching Kernel (GOMK) as a convolutional operator, computing similarities between subgraphs and learnable graph filters. Mathematically, GOMK maps subgraphs and filters into a Hilbert space, representing graphs as point sets. Disentangled representations emerge from projecting subgraphs onto task-optimized filters, which adaptively capture relevant structural patterns via gradient descent. Crucially, GOMK incorporates local correspondences in similarity measurement, resolving the trade-off between differentiability and accuracy in graph kernels. Experiments validate that GOMKCN achieves superior accuracy and interpretability in graph pattern mining and prediction. The framework advances the theoretical foundation for disentangled graph representation learning.

[LG-35] Property-Preserving Hashing for ell_1-Distance Predicates: Applications to Countering Adversarial Input Attacks

链接: https://arxiv.org/abs/2504.16355
作者: Hassan Asghar,Chenhan Zhang,Dali Kaafar
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Perceptual hashing is used to detect whether an input image is similar to a reference image with a variety of security applications. Recently, they have been shown to succumb to adversarial input attacks which make small imperceptible changes to the input image yet the hashing algorithm does not detect its similarity to the original image. Property-preserving hashing (PPH) is a recent construct in cryptography, which preserves some property (predicate) of its inputs in the hash domain. Researchers have so far shown constructions of PPH for Hamming distance predicates, which, for instance, outputs 1 if two inputs are within Hamming distance t . A key feature of PPH is its strong correctness guarantee, i.e., the probability that the predicate will not be correctly evaluated in the hash domain is negligible. Motivated by the use case of detecting similar images under adversarial setting, we propose the first PPH construction for an \ell_1 -distance predicate. Roughly, this predicate checks if the two one-sided \ell_1 -distances between two images are within a threshold t . Since many adversarial attacks use \ell_2 -distance (related to \ell_1 -distance) as the objective function to perturb the input image, by appropriately choosing the threshold t , we can force the attacker to add considerable noise to evade detection, and hence significantly deteriorate the image quality. Our proposed scheme is highly efficient, and runs in time O(t^2) . For grayscale images of size 28 \times 28 , we can evaluate the predicate in 0.0784 seconds when pixel values are perturbed by up to 1 % . For larger RGB images of size 224 \times 224 , by dividing the image into 1,000 blocks, we achieve times of 0.0128 seconds per block for 1 % change, and up to 0.2641 seconds per block for 14% change.

[LG-36] ClarifyCoder: Clarification-Aware Fine-Tuning for Programmatic Problem Solving

链接: https://arxiv.org/abs/2504.16331
作者: Jie JW Wu,Manav Chaudhary,Davit Abrahamyan,Arhaan Khaku,Anjiang Wei,Fatemeh H. Fard
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, a significant gap remains between their current performance and that of expert software engineers. A key differentiator is that human engineers actively seek clarification when faced with ambiguous requirements, while LLMs typically generate code regardless of uncertainties in the problem description. We present ClarifyCoder, a novel framework with synthetic data generation and instruction-tuning that enables LLMs to identify ambiguities and request clarification before proceeding with code generation. While recent work has focused on LLM-based agents for iterative code generation, we argue that the fundamental ability to recognize and query ambiguous requirements should be intrinsic to the models themselves. Our approach consists of two main components: (1) a data synthesis technique that augments existing programming datasets with scenarios requiring clarification to generate clarification-aware training data, and (2) a fine-tuning strategy that teaches models to prioritize seeking clarification over immediate code generation when faced with incomplete or ambiguous requirements. We further provide an empirical analysis of integrating ClarifyCoder with standard fine-tuning for a joint optimization of both clarify-awareness and coding ability. Experimental results demonstrate that ClarifyCoder significantly improves the communication capabilities of Code LLMs through meaningful clarification dialogues while maintaining code generation capabilities.

[LG-37] PCF-Grasp: Converting Point Completion to Geometry Feature to Enhance 6-DoF Grasp

链接: https://arxiv.org/abs/2504.16320
作者: Yaofeng Cheng,Fusheng Zha,Wei Guo,Pengfei Wang,Chao Zeng,Lining Sun,Chenguang Yang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The 6-Degree of Freedom (DoF) grasp method based on point clouds has shown significant potential in enabling robots to grasp target objects. However, most existing methods are based on the point clouds (2.5D points) generated from single-view depth images. These point clouds only have one surface side of the object providing incomplete geometry information, which mislead the grasping algorithm to judge the shape of the target object, resulting in low grasping accuracy. Humans can accurately grasp objects from a single view by leveraging their geometry experience to estimate object shapes. Inspired by humans, we propose a novel 6-DoF grasping framework that converts the point completion results as object shape features to train the 6-DoF grasp network. Here, point completion can generate approximate complete points from the 2.5D points similar to the human geometry experience, and converting it as shape features is the way to utilize it to improve grasp efficiency. Furthermore, due to the gap between the network generation and actual execution, we integrate a score filter into our framework to select more executable grasp proposals for the real robot. This enables our method to maintain a high grasp quality in any camera viewpoint. Extensive experiments demonstrate that utilizing complete point features enables the generation of significantly more accurate grasp proposals and the inclusion of a score filter greatly enhances the credibility of real-world robot grasping. Our method achieves a 17.8% success rate higher than the state-of-the-art method in real-world experiments.

[LG-38] Semantics at an Angle: When Cosine Similarity Works Until It Doesnt

链接: https://arxiv.org/abs/2504.16318
作者: Kisung You
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cosine similarity has become a standard metric for comparing embeddings in modern machine learning. Its scale-invariance and alignment with model training objectives have contributed to its widespread adoption. However, recent studies have revealed important limitations, particularly when embedding norms carry meaningful semantic information. This informal article offers a reflective and selective examination of the evolution, strengths, and limitations of cosine similarity. We highlight why it performs well in many settings, where it tends to break down, and how emerging alternatives are beginning to address its blind spots. We hope to offer a mix of conceptual clarity and practical perspective, especially for quantitative scientists who think about embeddings not just as vectors, but as geometric and philosophical objects.

[LG-39] Affect Models Have Weak Generalizability to Atypical Speech

链接: https://arxiv.org/abs/2504.16283
作者: Jaya Narain,Amrit Romana,Vikramjit Mitra,Colin Lea,Shirley Ren
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Speech and voice conditions can alter the acoustic properties of speech, which could impact the performance of paralinguistic models for affect for people with atypical speech. We evaluate publicly available models for recognizing categorical and dimensional affect from speech on a dataset of atypical speech, comparing results to datasets of typical speech. We investigate three dimensions of speech atypicality: intelligibility, which is related to pronounciation; monopitch, which is related to prosody, and harshness, which is related to voice quality. We look at (1) distributional trends of categorical affect predictions within the dataset, (2) distributional comparisons of categorical affect predictions to similar datasets of typical speech, and (3) correlation strengths between text and speech predictions for spontaneous speech for valence and arousal. We find that the output of affect models is significantly impacted by the presence and degree of speech atypicalities. For instance, the percentage of speech predicted as sad is significantly higher for all types and grades of atypical speech when compared to similar typical speech datasets. In a preliminary investigation on improving robustness for atypical speech, we find that fine-tuning models on pseudo-labeled atypical speech data improves performance on atypical speech without impacting performance on typical speech. Our results emphasize the need for broader training and evaluation datasets for speech emotion models, and for modeling approaches that are robust to voice and speech differences.

[LG-40] Learning Explainable Dense Reward Shapes via Bayesian Optimization

链接: https://arxiv.org/abs/2504.16272
作者: Ryan Koo,Ian Yang,Vipul Raheja,Mingyi Hong,Kwang-Sung Jun,Dongyeop Kang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

[LG-41] COBRA: Algorithm-Architecture Co-optimized Binary Transformer Accelerator for Edge Inference

链接: https://arxiv.org/abs/2504.16269
作者: Ye Qiao,Zhiheng Cheng,Yian Wang,Yifan Zhang,Yunzhe Deng,Sitao Huang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based models have demonstrated superior performance in various fields, including natural language processing and computer vision. However, their enormous model size and high demands in computation, memory, and communication limit their deployment to edge platforms for local, secure inference. Binary transformers offer a compact, low-complexity solution for edge deployment with reduced bandwidth needs and acceptable accuracy. However, existing binary transformers perform inefficiently on current hardware due to the lack of binary specific optimizations. To address this, we introduce COBRA, an algorithm-architecture co-optimized binary Transformer accelerator for edge computing. COBRA features a real 1-bit binary multiplication unit, enabling matrix operations with -1, 0, and +1 values, surpassing ternary methods. With further hardware-friendly optimizations in the attention block, COBRA achieves up to 3,894.7 GOPS throughput and 448.7 GOPS/Watt energy efficiency on edge FPGAs, delivering a 311x energy efficiency improvement over GPUs and a 3.5x throughput improvement over the state-of-the-art binary accelerator, with only negligible inference accuracy degradation.

[LG-42] LLM e: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

链接: https://arxiv.org/abs/2504.16266
作者: Ye Qiao,Zhiheng Cheng,Yifan Zhang,Yian Wang,Sitao Huang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected latency of the prefill phase. We present TeLLMe, the first ternary LLM accelerator for low-power FPGAs (e.g., AMD KV260) that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. Our contributions include: (1) a table-lookup matrix engine for ternary matmul that merges grouped activations with online precomputation to minimize resource use; (2) a fused, bandwidth-efficient attention module featuring a reversed reordering scheme to accelerate prefill; and (3) a tightly integrated normalization and quantization–dequantization unit optimized for ultra-low-bit inference. Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts and prefill latencies of 0.55–1.15 s for 64–128 token prompts, marking a significant energy-efficiency advance and establishing a new edge FPGA benchmark for generative AI.

[LG-43] Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching

链接: https://arxiv.org/abs/2504.16262
作者: Junn Yong Loo,Michelle Adeline,Julia Kaiwen Lau,Fang Yu Leong,Hwa Hui Tew,Arghya Pal,Vishnu Monn Baskaran,Chee-Ming Ting,Raphaël C.-W. Phan
类目: Machine Learning (cs.LG)
*备注: Accepted by Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Energy-based models (EBMs) are a powerful class of probabilistic generative models due to their flexibility and interpretability. However, relationships between potential flows and explicit EBMs remain underexplored, while contrastive divergence training via implicit Markov chain Monte Carlo (MCMC) sampling is often unstable and expensive in high-dimensional settings. In this paper, we propose Variational Potential Flow Bayes (VPFB), a new energy-based generative framework that eliminates the need for implicit MCMC sampling and does not rely on auxiliary networks or cooperative training. VPFB learns an energy-parameterized potential flow by constructing a flow-driven density homotopy that is matched to the data distribution through a variational loss minimizing the Kullback-Leibler divergence between the flow-driven and marginal homotopies. This principled formulation enables robust and efficient generative modeling while preserving the interpretability of EBMs. Experimental results on image generation, interpolation, out-of-distribution detection, and compositional generation confirm the effectiveness of VPFB, showing that our method performs competitively with existing approaches in terms of sample quality and versatility across diverse generative modeling tasks.

[LG-44] FairPlay: A Collaborative Approach to Mitigate Bias in Datasets for Improved AI Fairness

链接: https://arxiv.org/abs/2504.16255
作者: Tina Behzad,Mithilesh Kumar Singh,Anthony J. Ripa,Klaus Mueller
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: Accepted at ACM CSCW 2025. 30 pages total (including references and supplementary material). Contains 10 figures

点击查看摘要

Abstract:The issue of fairness in decision-making is a critical one, especially given the variety of stakeholder demands for differing and mutually incompatible versions of fairness. Adopting a strategic interaction of perspectives provides an alternative to enforcing a singular standard of fairness. We present a web-based software application, FairPlay, that enables multiple stakeholders to debias datasets collaboratively. With FairPlay, users can negotiate and arrive at a mutually acceptable outcome without a universally agreed-upon theory of fairness. In the absence of such a tool, reaching a consensus would be highly challenging due to the lack of a systematic negotiation process and the inability to modify and observe changes. We have conducted user studies that demonstrate the success of FairPlay, as users could reach a consensus within about five rounds of gameplay, illustrating the application’s potential for enhancing fairness in AI systems.

[LG-45] General Post-Processing Framework for Fairness Adjustment of Machine Learning Models

链接: https://arxiv.org/abs/2504.16238
作者: Léandre Eberhard,Nirek Sharma,Filipp Shelobolin,Aalok Ganesh Shanbhag
类目: Machine Learning (cs.LG)
*备注: Submitted to FAccT 2025. Does not include reviewer feedback yet

点击查看摘要

Abstract:As machine learning increasingly influences critical domains such as credit underwriting, public policy, and talent acquisition, ensuring compliance with fairness constraints is both a legal and ethical imperative. This paper introduces a novel framework for fairness adjustments that applies to diverse machine learning tasks, including regression and classification, and accommodates a wide range of fairness metrics. Unlike traditional approaches categorized as pre-processing, in-processing, or post-processing, our method adapts in-processing techniques for use as a post-processing step. By decoupling fairness adjustments from the model training process, our framework preserves model performance on average while enabling greater flexibility in model development. Key advantages include eliminating the need for custom loss functions, enabling fairness tuning using different datasets, accommodating proprietary models as black-box systems, and providing interpretable insights into the fairness adjustments. We demonstrate the effectiveness of this approach by comparing it to Adversarial Debiasing, showing that our framework achieves a comparable fairness/accuracy tradeoff on real-world datasets.

[LG-46] owards a Distributed Federated Learning Aggregation Placement using Particle Swarm Intelligence

链接: https://arxiv.org/abs/2504.16227
作者: Amir Ali-Pour,Sadra Bekrani,Laya Samizadeh,Julien Gascon-Samson
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Federated learning has become a promising distributed learning concept with extra insurance on data privacy. Extensive studies on various models of Federated learning have been done since the coinage of its term. One of the important derivatives of federated learning is hierarchical semi-decentralized federated learning, which distributes the load of the aggregation task over multiple nodes and parallelizes the aggregation workload at the breadth of each level of the hierarchy. Various methods have also been proposed to perform inter-cluster and intra-cluster aggregation optimally. Most of the solutions, nonetheless, require monitoring the nodes’ performance and resource consumption at each round, which necessitates frequently exchanging systematic data. To optimally perform distributed aggregation in SDFL with minimal reliance on systematic data, we propose Flag-Swap, a Particle Swarm Optimization (PSO) method that optimizes the aggregation placement according only to the processing delay. Our simulation results show that PSO-based placement can find the optimal placement relatively fast, even in scenarios with many clients as candidates for aggregation. Our real-world docker-based implementation of Flag-Swap over the recently emerged FL framework shows superior performance compared to black-box-based deterministic placement strategies, with about 43% minutes faster than random placement, and 32% minutes faster than uniform placement, in terms of total processing time.

[LG-47] Deep Learning Meets Process-Based Models: A Hybrid Approach to Agricultural Challenges

链接: https://arxiv.org/abs/2504.16141
作者: Yue Shi,Liangxiu Han,Xin Zhang,Tam Sobeih,Thomas Gaiser,Nguyen Huu Thuy,Dominik Behrend,Amit Kumar Srivastava,Krishnagopal Halder,Frank Ewert
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Process-based models (PBMs) and deep learning (DL) are two key approaches in agricultural modelling, each offering distinct advantages and limitations. PBMs provide mechanistic insights based on physical and biological principles, ensuring interpretability and scientific rigour. However, they often struggle with scalability, parameterisation, and adaptation to heterogeneous environments. In contrast, DL models excel at capturing complex, nonlinear patterns from large datasets but may suffer from limited interpretability, high computational demands, and overfitting in data-scarce scenarios. This study presents a systematic review of PBMs, DL models, and hybrid PBM-DL frameworks, highlighting their applications in agricultural and environmental modelling. We classify hybrid PBM-DL approaches into DL-informed PBMs, where neural networks refine process-based models, and PBM-informed DL, where physical constraints guide deep learning predictions. Additionally, we conduct a case study on crop dry biomass prediction, comparing hybrid models against standalone PBMs and DL models under varying data quality, sample sizes, and spatial conditions. The results demonstrate that hybrid models consistently outperform traditional PBMs and DL models, offering greater robustness to noisy data and improved generalisation across unseen locations. Finally, we discuss key challenges, including model interpretability, scalability, and data requirements, alongside actionable recommendations for advancing hybrid modelling in agriculture. By integrating domain knowledge with AI-driven approaches, this study contributes to the development of scalable, interpretable, and reproducible agricultural models that support data-driven decision-making for sustainable agriculture. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2504.16141 [cs.LG] (or arXiv:2504.16141v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.16141 Focus to learn more arXiv-issued DOI via DataCite

[LG-48] Virology Capabilities Test (VCT): A Multimodal Virology QA Benchmark

链接: https://arxiv.org/abs/2504.16137
作者: Jasper Götting,Pedro Medeiros,Jon G Sanders,Nathaniel Li,Long Phan,Karam Elabd,Lennart Justen,Dan Hendrycks,Seth Donoughe
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 31 pages

点击查看摘要

Abstract:We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of 322 multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8% accuracy, outperforming 94% of expert virologists even within their sub-areas of specialization. The ability to provide expert-level virology troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.

[LG-49] Active Learning Methods for Efficient Data Utilization and Model Performance Enhancement

链接: https://arxiv.org/abs/2504.16136
作者: Chiung-Yi Tseng,Junhao Song,Ziqian Bi,Tianyang Wang,Chia Xin Liang,Ming Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the era of data-driven intelligence, the paradox of data abundance and annotation scarcity has emerged as a critical bottleneck in the advancement of machine learning. This paper gives a detailed overview of Active Learning (AL), which is a strategy in machine learning that helps models achieve better performance using fewer labeled examples. It introduces the basic concepts of AL and discusses how it is used in various fields such as computer vision, natural language processing, transfer learning, and real-world applications. The paper focuses on important research topics such as uncertainty estimation, handling of class imbalance, domain adaptation, fairness, and the creation of strong evaluation metrics and benchmarks. It also shows that learning methods inspired by humans and guided by questions can improve data efficiency and help models learn more effectively. In addition, this paper talks about current challenges in the field, including the need to rebuild trust, ensure reproducibility, and deal with inconsistent methodologies. It points out that AL often gives better results than passive learning, especially when good evaluation measures are used. This work aims to be useful for both researchers and practitioners by providing key insights and proposing directions for future progress in active learning.

[LG-50] Representation Learning for Tabular Data: A Comprehensive Survey

链接: https://arxiv.org/abs/2504.16109
作者: Jun-Peng Jiang,Si-Yang Liu,Hao-Run Cai,Qile Zhou,Han-Jia Ye
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data, structured as rows and columns, is among the most prevalent data types in machine learning classification and regression applications. Models for learning from tabular data have continuously evolved, with Deep Neural Networks (DNNs) recently demonstrating promising results through their capability of representation learning. In this survey, we systematically introduce the field of tabular representation learning, covering the background, challenges, and benchmarks, along with the pros and cons of using DNNs. We organize existing methods into three main categories according to their generalization capabilities: specialized, transferable, and general models. Specialized models focus on tasks where training and evaluation occur within the same data distribution. We introduce a hierarchical taxonomy for specialized models based on the key aspects of tabular data – features, samples, and objectives – and delve into detailed strategies for obtaining high-quality feature- and sample-level representations. Transferable models are pre-trained on one or more datasets and subsequently fine-tuned on downstream tasks, leveraging knowledge acquired from homogeneous or heterogeneous sources, or even cross-modalities such as vision and language. General models, also known as tabular foundation models, extend this concept further, allowing direct application to downstream tasks without fine-tuning. We group these general models based on the strategies used to adapt across heterogeneous datasets. Additionally, we explore ensemble methods, which integrate the strengths of multiple tabular models. Finally, we discuss representative extensions of tabular learning, including open-environment tabular machine learning, multimodal learning with tabular data, and tabular understanding. More information can be found in the following repository: this https URL.

[LG-51] Application of an attention-based CNN-BiLSTM framework for in vivo two-photon calcium imaging of neuronal ensembles: decoding complex bilateral forelimb movements from unilateral M1

链接: https://arxiv.org/abs/2504.16917
作者: Ghazal Mirzaee,Jonathan Chang,Shahrzad Latifi
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decoding behavior, such as movement, from multiscale brain networks remains a central objective in neuroscience. Over the past decades, artificial intelligence and machine learning have played an increasingly significant role in elucidating the neural mechanisms underlying motor function. The advancement of brain-monitoring technologies, capable of capturing complex neuronal signals with high spatial and temporal resolution, necessitates the development and application of more sophisticated machine learning models for behavioral decoding. In this study, we employ a hybrid deep learning framework, an attention-based CNN-BiLSTM model, to decode skilled and complex forelimb movements using signals obtained from in vivo two-photon calcium imaging. Our findings demonstrate that the intricate movements of both ipsilateral and contralateral forelimbs can be accurately decoded from unilateral M1 neuronal ensembles. These results highlight the efficacy of advanced hybrid deep learning models in capturing the spatiotemporal dependencies of neuronal networks activity linked to complex movement execution.

[LG-52] Exploring zero-shot structure-based protein fitness prediction

链接: https://arxiv.org/abs/2504.16886
作者: Arnav Sharma,Anthony Gitter
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 26 pages, 7 figures

点击查看摘要

Abstract:The ability to make zero-shot predictions about the fitness consequences of protein sequence changes with pre-trained machine learning models enables many practical applications. Such models can be applied for downstream tasks like genetic variant interpretation and protein engineering without additional labeled data. The advent of capable protein structure prediction tools has led to the availability of orders of magnitude more precomputed predicted structures, giving rise to powerful structure-based fitness prediction models. Through our experiments, we assess several modeling choices for structure-based models and their effects on downstream fitness prediction. Zero-shot fitness prediction models can struggle to assess the fitness landscape within disordered regions of proteins, those that lack a fixed 3D structure. We confirm the importance of matching protein structures to fitness assays and find that predicted structures for disordered regions can be misleading and affect predictive performance. Lastly, we evaluate an additional structure-based model on the ProteinGym substitution benchmark and show that simple multi-modal ensembles are strong baselines.

[LG-53] Common Functional Decompositions Can Mis-attribute Differences in Outcomes Between Populations ICLR2025

链接: https://arxiv.org/abs/2504.16864
作者: Manuel Quintero,William T. Stephenson,Advik Shreekumar,Tamara Broderick
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注: 30 pages, appearing in 2nd Workshop on Navigating and Addressing Data Problems for Foundation Models (DATA-FM @ ICLR 2025)

点击查看摘要

Abstract:In science and social science, we often wish to explain why an outcome is different in two populations. For instance, if a jobs program benefits members of one city more than another, is that due to differences in program participants (particular covariates) or the local labor markets (outcomes given covariates)? The Kitagawa-Oaxaca-Blinder (KOB) decomposition is a standard tool in econometrics that explains the difference in the mean outcome across two populations. However, the KOB decomposition assumes a linear relationship between covariates and outcomes, while the true relationship may be meaningfully nonlinear. Modern machine learning boasts a variety of nonlinear functional decompositions for the relationship between outcomes and covariates in one population. It seems natural to extend the KOB decomposition using these functional decompositions. We observe that a successful extension should not attribute the differences to covariates – or, respectively, to outcomes given covariates – if those are the same in the two populations. Unfortunately, we demonstrate that, even in simple examples, two common decompositions – functional ANOVA and Accumulated Local Effects – can attribute differences to outcomes given covariates, even when they are identical in two populations. We provide a characterization of when functional ANOVA misattributes, as well as a general property that any discrete decomposition must satisfy to avoid misattribution. We show that if the decomposition is independent of its input distribution, it does not misattribute. We further conjecture that misattribution arises in any reasonable additive decomposition that depends on the distribution of the covariates.

[LG-54] Confidence Sequences for Generalized Linear Models via Regret Analysis

链接: https://arxiv.org/abs/2504.16555
作者: Eugenio Clerico,Hamish Flynn,Wojciech Kotłowski,Gergely Neu
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We develop a methodology for constructing confidence sets for parameters of statistical models via a reduction to sequential prediction. Our key observation is that for any generalized linear model (GLM), one can construct an associated game of sequential probability assignment such that achieving low regret in the game implies a high-probability upper bound on the excess likelihood of the true parameter of the GLM. This allows us to develop a scheme that we call online-to-confidence-set conversions, which effectively reduces the problem of proving the desired statistical claim to an algorithmic question. We study two varieties of this conversion scheme: 1) analytical conversions that only require proving the existence of algorithms with low regret and provide confidence sets centered at the maximum-likelihood estimator 2) algorithmic conversions that actively leverage the output of the online algorithm to construct confidence sets (and may be centered at other, adaptively constructed point estimators). The resulting methodology recovers all state-of-the-art confidence set constructions within a single framework, and also provides several new types of confidence sets that were previously unknown in the literature.

[LG-55] Breaking scaling relations with inverse catalysts: a machine learning exploration of trends in mathrmCO_2 hydrogenation energy barriers

链接: https://arxiv.org/abs/2504.16493
作者: Luuk H. E. Kempen,Marius Juul Nielsen,Mie Andersen
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 10 pages, 6 figures + supporting information (5 pages, 7 figures, 2 tables)

点击查看摘要

Abstract:The conversion of \mathrmCO_2 into useful products such as methanol is a key strategy for abating climate change and our dependence on fossil fuels. Developing new catalysts for this process is costly and time-consuming and can thus benefit from computational exploration of possible active sites. However, this is complicated by the complexity of the materials and reaction networks. Here, we present a workflow for exploring transition states of elementary reaction steps at inverse catalysts, which is based on the training of a neural network-based machine learning interatomic potential. We focus on the crucial formate intermediate and its formation over nanoclusters of indium oxide supported on Cu(111). The speedup compared to an approach purely based on density functional theory allows us to probe a wide variety of active sites found at nanoclusters of different sizes and stoichiometries. Analysis of the obtained set of transition state geometries reveals different structure–activity trends at the edge or interior of the nanoclusters. Furthermore, the identified geometries allow for the breaking of linear scaling relations, which could be a key underlying reason for the excellent catalytic performance of inverse catalysts observed in experiments.

[LG-56] he Safety-Privacy Tradeoff in Linear Bandits

链接: https://arxiv.org/abs/2504.16371
作者: Arghavan Zibaie,Spencer Hutchinson,Ramtin Pedarsani,Mahnoosh Alizadeh
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 16 pages, 3 figures, accepted to 2025 IEEE International Symposium on Information Theory (ISIT)

点击查看摘要

Abstract:We consider a collection of linear stochastic bandit problems, each modeling the random response of different agents to proposed interventions, coupled together by a global safety constraint. We assume a central coordinator must choose actions to play on each bandit with the objective of regret minimization, while also ensuring that the expected response of all agents satisfies the global safety constraints at each round, in spite of uncertainty about the bandits’ parameters. The agents consider their observed responses to be private and in order to protect their sensitive information, the data sharing with the central coordinator is performed under local differential privacy (LDP). However, providing higher level of privacy to different agents would have consequences in terms of safety and regret. We formalize these tradeoffs by building on the notion of the sharpness of the safety set - a measure of how the geometric properties of the safe set affects the growth of regret - and propose a unilaterally unimprovable vector of privacy levels for different agents given a maximum regret budget.

[LG-57] Covariate-dependent Graphical Model Estimation via Neural Networks with Statistical Guarantees

链接: https://arxiv.org/abs/2504.16356
作者: Jiahe Lin,Yikai Zhang,George Michailidis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Accepted by Transactions on Machine Learning Research (TMLR)

点击查看摘要

Abstract:Graphical models are widely used in diverse application domains to model the conditional dependencies amongst a collection of random variables. In this paper, we consider settings where the graph structure is covariate-dependent, and investigate a deep neural network-based approach to estimate it. The method allows for flexible functional dependency on the covariate, and fits the data reasonably well in the absence of a Gaussianity assumption. Theoretical results with PAC guarantees are established for the method, under assumptions commonly used in an Empirical Risk Minimization framework. The performance of the proposed method is evaluated on several synthetic data settings and benchmarked against existing approaches. The method is further illustrated on real datasets involving data from neuroscience and finance, respectively, and produces interpretable results.

[LG-58] Deep Neural Network Emulation of the Quantum-Classical Transition via Learned Wigner Function Dynamics

链接: https://arxiv.org/abs/2504.16334
作者: Kamran Majid
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of classical behavior from quantum mechanics as Planck’s constant \hbar approaches zero remains a fundamental challenge in physics [1-3]. This paper introduces a novel approach employing deep neural networks to directly learn the dynamical mapping from initial quantum state parameters (for Gaussian wave packets of the one-dimensional harmonic oscillator) and \hbar to the parameters of the time-evolved Wigner function in phase space [4-6]. A comprehensive dataset of analytically derived time-evolved Wigner functions was generated, and a deep feedforward neural network with an enhanced architecture was successfully trained for this prediction task, achieving a final training loss of ~ 0.0390. The network demonstrates a significant and previously unrealized ability to accurately capture the underlying mapping of the Wigner function dynamics. This allows for a direct emulation of the quantum-classical transition by predicting the evolution of phase-space distributions as \hbar is systematically varied. The implications of these findings for providing a new computational lens on the emergence of classicality are discussed, highlighting the potential of this direct phase-space learning approach for studying fundamental aspects of quantum mechanics. This work presents a significant advancement beyond previous efforts that focused on learning observable mappings [7], offering a direct route via the phase-space representation.

[LG-59] A Geometric Approach to Problems in Optimization and Data Science

链接: https://arxiv.org/abs/2504.16270
作者: Naren Sarayu Manoj
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: PhD dissertation

点击查看摘要

Abstract:We give new results for problems in computational and statistical machine learning using tools from high-dimensional geometry and probability. We break up our treatment into two parts. In Part I, we focus on computational considerations in optimization. Specifically, we give new algorithms for approximating convex polytopes in a stream, sparsification and robust least squares regression, and dueling optimization. In Part II, we give new statistical guarantees for data science problems. In particular, we formulate a new model in which we analyze statistical properties of backdoor data poisoning attacks, and we study the robustness of graph clustering algorithms to ``helpful’’ misspecification. Comments: PhD dissertation Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2504.16270 [math.OC] (or arXiv:2504.16270v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2504.16270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-60] Probabilistic Emulation of the Community Radiative Transfer Model Using Machine Learning

链接: https://arxiv.org/abs/2504.16192
作者: Lucas Howard,Aneesh C. Subramanian,Gregory Thompson,Benjamin Johnson,Thomas Auligne
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 26 pages, 9 figures, 1 table

点击查看摘要

Abstract:The continuous improvement in weather forecast skill over the past several decades is largely due to the increasing quantity of available satellite observations and their assimilation into operational forecast systems. Assimilating these observations requires observation operators in the form of radiative transfer models. Significant efforts have been dedicated to enhancing the computational efficiency of these models. Computational cost remains a bottleneck, and a large fraction of available data goes unused for assimilation. To address this, we used machine learning to build an efficient neural network based probabilistic emulator of the Community Radiative Transfer Model (CRTM), applied to the GOES Advanced Baseline Imager. The trained NN emulator predicts brightness temperatures output by CRTM and the corresponding error with respect to CRTM. RMSE of the predicted brightness temperature is 0.3 K averaged across all channels. For clear sky conditions, the RMSE is less than 0.1 K for 9 out of 10 infrared channels. The error predictions are generally reliable across a wide range of conditions. Explainable AI methods demonstrate that the trained emulator reproduces the relevant physics, increasing confidence that the model will perform well when presented with new data.

[LG-61] Behavior of prediction performance metrics with rare events

链接: https://arxiv.org/abs/2504.16185
作者: Emily Minus,R. Yates Coley,Susan M. Shortreed,Brian D. Williamson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 55 pages (21 main, 34 supplementary), 26 tables (3 main, 23 supplementary), 5 figures (3 main, 2 supplementary)

点击查看摘要

Abstract:Area under the receiving operator characteristic curve (AUC) is commonly reported alongside binary prediction models. However, there are concerns that AUC might be a misleading measure of prediction performance in the rare event setting. This setting is common since many events of clinical importance are rare events. We conducted a simulation study to determine when or whether AUC is unstable in the rare event setting. Specifically, we aimed to determine whether the bias and variance of AUC are driven by the number of events or the event rate. We also investigated the behavior of other commonly used measures of prediction performance, including positive predictive value, accuracy, sensitivity, and specificity. Our results indicate that poor AUC behavior – as measured by empirical bias, variability of cross-validated AUC estimates, and empirical coverage of confidence intervals – is driven by the minimum class size, not event rate. Performance of sensitivity is driven by the number of events, while that of specificity is driven by the number of non-events. Other measures, including positive predictive value and accuracy, depend on the event rate even in large samples. AUC is reliable in the rare event setting provided that the total number of events is moderately large.

[LG-62] A Statistical Approach for Synthetic EEG Data Generation

链接: https://arxiv.org/abs/2504.16143
作者: Gideon Vos,Maryam Ebrahimpour,Liza van Eijk,Zoltan Sarnyai,Mostafa Rahimi Azghadi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:Electroencephalogram (EEG) data is crucial for diagnosing mental health conditions but is costly and time-consuming to collect at scale. Synthetic data generation offers a promising solution to augment datasets for machine learning applications. However, generating high-quality synthetic EEG that preserves emotional and mental health signals remains challenging. This study proposes a method combining correlation analysis and random sampling to generate realistic synthetic EEG data. We first analyze interdependencies between EEG frequency bands using correlation analysis. Guided by this structure, we generate synthetic samples via random sampling. Samples with high correlation to real data are retained and evaluated through distribution analysis and classification tasks. A Random Forest model trained to distinguish synthetic from real EEG performs at chance level, indicating high fidelity. The generated synthetic data closely match the statistical and structural properties of the original EEG, with similar correlation coefficients and no significant differences in PERMANOVA tests. This method provides a scalable, privacy-preserving approach for augmenting EEG datasets, enabling more efficient model training in mental health research. Comments: 24 pages, 10 figures Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) MSC classes: 68T01, 92-08 Cite as: arXiv:2504.16143 [eess.SP] (or arXiv:2504.16143v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2504.16143 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-63] SeizureFormer: A Transformer Model for IEA-Based Seizure Risk Forecasting

链接: https://arxiv.org/abs/2504.16098
作者: Tianning Feng(1),Junting Ni(1),Ezequiel Gleichgerrcht(2),Wei Jin(1) ((1) Department of Computer Science, Emory University, Atlanta, GA, USA, (2) Department of Neurology, Emory University, Atlanta, GA, USA)
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures. Submitted to AMIA 2025. Also submitted as an undergraduate honors thesis at Emory University

点击查看摘要

Abstract:We present SeizureFormer, a Transformer-based model for long-term seizure risk forecasting using interictal epileptiform activity (IEA) surrogate biomarkers and long episode (LE) biomarkers from responsive neurostimulation (RNS) systems. Unlike raw scalp EEG-based models, SeizureFormer leverages structured, clinically relevant features and integrates CNN-based patch embedding, multi-head self-attention, and squeeze-and-excitation blocks to model both short-term dynamics and long-term seizure cycles. Tested across five patients and multiple prediction windows (1 to 14 days), SeizureFormer achieved state-of-the-art performance with mean ROC AUC of 79.44 percent and mean PR AUC of 76.29 percent. Compared to statistical, machine learning, and deep learning baselines, it demonstrates enhanced generalizability and seizure risk forecasting performance under class imbalance. This work supports future clinical integration of interpretable and robust seizure forecasting tools for personalized epilepsy management.

信息检索

[IR-0] Search Timelines: Visualizing Search History to Enable Cross-Session Exploratory Search

链接: https://arxiv.org/abs/2504.16741
作者: Orland Hoeber,Md Nazmul Islam,Miriam Boon,Dale Storie,Veronica Ramshaw
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Purpose: The timespan over which exploratory searching can occur, as well as the scope and volume of the search activities undertaken, can make it difficult for searchers to remember key details about their search activities. These difficulties are present both in the midst of searching as well as when resuming a search that spans multiple sessions. In this paper, we present a search interface designed to support cross-session exploratory search in a public digital library context. Methods: Search Timelines provides a visualization of current and past search activities via a dynamic timeline of the search activity (queries and saved resources). This timeline is presented at two levels of detail. An overview timeline is provided alongside the search results in a typical search engine results page design. A detailed timeline is provided in the workspace, where searchers can review the history of their search activities and their saved resources. A controlled laboratory study was conducted to compare this approach to a baseline interface modelled after a typical public digital library search/workspace interface. Results: Participants who used Search Timelines reported higher levels of user engagement, usability, and perceived knowledge gain, during an initial search session and when resuming the search after a 7-8 day interval. This came at the expense of the searchers taking more time to complete the search task, which we view as positive evidence of engagement in cross-session exploratory search processes. Conclusion: Search Timelines serves as an example of how lightweight visualization approaches can be used to enhance typical search interface designs to support exploratory search. The results highlight the value of providing persistent representations of past search activities within the search interface.

[IR-1] Information Leakage of Sentence Embeddings via Generative Embedding Inversion Attacks SIGIR2025

链接: https://arxiv.org/abs/2504.16609
作者: Antonios Tragoudaras,Theofanis Aslanidis,Emmanouil Georgios Lionis,Marina Orozco González,Panagiotis Eustratiadis
类目: Information Retrieval (cs.IR)
*备注: This is a preprint of our paper accepted at SIGIR 2025

点击查看摘要

Abstract:Text data are often encoded as dense vectors, known as embeddings, which capture semantic, syntactic, contextual, and domain-specific information. These embeddings, widely adopted in various applications, inherently contain rich information that may be susceptible to leakage under certain attacks. The GEIA framework highlights vulnerabilities in sentence embeddings, demonstrating that they can reveal the original sentences they represent. In this study, we reproduce GEIA’s findings across various neural sentence embedding models. Additionally, we contribute new analysis to examine whether these models leak sensitive information from their training datasets. We propose a simple yet effective method without any modification to the attacker’s architecture proposed in GEIA. The key idea is to examine differences between log-likelihood for masked and original variants of data that sentence embedding models have been pre-trained on, calculated on the embedding space of the attacker. Our findings indicate that following our approach, an adversary party can recover meaningful sensitive information related to the pre-training knowledge of the popular models used for creating sentence embeddings, seriously undermining their security. Our code is available on: this https URL

[IR-2] Enhancing LLM -Based Agents via Global Planning and Hierarchical Execution

链接: https://arxiv.org/abs/2504.16563
作者: Junjie Chen,Haitao Li,Jingli Yang,Yiqun Liu,Qingyao Ai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Intelligent agent systems based on Large Language Models (LLMs) have shown great potential in real-world applications. However, existing agent frameworks still face critical limitations in task planning and execution, restricting their effectiveness and generalizability. Specifically, current planning methods often lack clear global goals, leading agents to get stuck in local branches, or produce non-executable plans. Meanwhile, existing execution mechanisms struggle to balance complexity and stability, and their limited action space restricts their ability to handle diverse real-world tasks. To address these limitations, we propose GoalAct, a novel agent framework that introduces a continuously updated global planning mechanism and integrates a hierarchical execution strategy. GoalAct decomposes task execution into high-level skills, including searching, coding, writing and more, thereby reducing planning complexity while enhancing the agents’ adaptability across diverse task scenarios. We evaluate GoalAct on LegalAgentBench, a benchmark with multiple types of legal tasks that require the use of multiple types of tools. Experimental results demonstrate that GoalAct achieves state-of-the-art (SOTA) performance, with an average improvement of 12.22% in success rate. These findings highlight GoalAct’s potential to drive the development of more advanced intelligent agent systems, making them more effective across complex real-world applications. Our code can be found at this https URL.

[IR-3] Modality Reliability Guided Multimodal Recommendation

链接: https://arxiv.org/abs/2504.16524
作者: Xue Dong,Xuemeng Song,Na Zheng,Sicheng Zhao,Guiguang Ding
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal recommendation faces an issue of the performance degradation that the uni-modal recommendation sometimes achieves the better performance. A possible reason is that the unreliable item modality data hurts the fusion result. Several existing studies have introduced weights for different modalities to reduce the contribution of the unreliable modality data in predicting the final user rating. However, they fail to provide appropriate supervisions for learning the modality weights, making the learned weights imprecise. Therefore, we propose a modality reliability guided multimodal recommendation framework that uniquely learns the modality weights supervised by the modality reliability. Considering that there is no explicit label provided for modality reliability, we resort to automatically identify it through the BPR recommendation objective. In particular, we define a modality reliability vector as the supervision label by the difference between modality-specific user ratings to positive and negative items, where a larger difference indicates a higher reliability of the modality as the BPR objective is better satisfied. Furthermore, to enhance the effectiveness of the supervision, we calculate the confidence level for the modality reliability vector, which dynamically adjusts the supervision strength and eliminates the harmful supervision. Extensive experiments on three real-world datasets show the effectiveness of the proposed method.

[IR-4] Killing Two Birds with One Stone: Unifying Retrieval and Ranking with a Single Generative Recommendation Model SIGIR2025

链接: https://arxiv.org/abs/2504.16454
作者: Luankang Zhang,Kenan Song,Yi Quan Lee,Wei Guo,Hao Wang,Yawen Li,Huifeng Guo,Yong Liu,Defu Lian,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted at SIGIR 2025

点击查看摘要

Abstract:In recommendation systems, the traditional multi-stage paradigm, which includes retrieval and ranking, often suffers from information loss between stages and diminishes performance. Recent advances in generative models, inspired by natural language processing, suggest the potential for unifying these stages to mitigate such loss. This paper presents the Unified Generative Recommendation Framework (UniGRF), a novel approach that integrates retrieval and ranking into a single generative model. By treating both stages as sequence generation tasks, UniGRF enables sufficient information sharing without additional computational costs, while remaining model-agnostic. To enhance inter-stage collaboration, UniGRF introduces a ranking-driven enhancer module that leverages the precision of the ranking stage to refine retrieval processes, creating an enhancement loop. Besides, a gradient-guided adaptive weighter is incorporated to dynamically balance the optimization of retrieval and ranking, ensuring synchronized performance improvements. Extensive experiments demonstrate that UniGRF significantly outperforms existing models on benchmark datasets, confirming its effectiveness in facilitating information transfer. Ablation studies and further experiments reveal that UniGRF not only promotes efficient collaboration between stages but also achieves synchronized optimization. UniGRF provides an effective, scalable, and compatible framework for generative recommendation systems.

[IR-5] MPAD: A New Dimension-Reduction Method for Preserving Nearest Neighbors in High-Dimensional Vector Search

链接: https://arxiv.org/abs/2504.16335
作者: Jiuzhou Fu,Dongfang Zhao
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:High-dimensional vector embeddings are widely used in retrieval systems, yet dimensionality reduction (DR) is seldom applied due to its tendency to distort nearest-neighbor (NN) structure critical for search. Existing DR techniques such as PCA and UMAP optimize global or manifold-preserving criteria, rather than retrieval-specific objectives. We present MPAD: Maximum Pairwise Absolute Difference, an unsupervised DR method that explicitly preserves approximate NN relations by maximizing the margin between k-NNs and non-k-NNs under a soft orthogonality constraint. This design enables MPAD to retain ANN-relevant geometry without supervision or changes to the original embedding model. Experiments across multiple domains show that MPAD consistently outperforms standard DR methods in preserving neighborhood structure, enabling more accurate search in reduced dimensions.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-24

目录

概览 (2025-04-24)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载