Arxiv今日论文 | 2025-04-07

本篇博文主要内容为 2025-04-07 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决开发通用协作代理过程中AI系统面临的两大挑战：(1) 在新领域中的适应能力；(2) 对不确定性进行透明推理以支持验证与修正的能力。传统黑盒模型虽具备强大的数据处理能力，但因缺乏透明性、领域特定性和不确定性意识而无法满足这些需求。论文提出的关键解决方案是Bonsai系统，这是一种基于组合与概率推理的方法，通过检索相关证据并利用其计算从自然语言推断得出子命题的可能性，从而生成可适应的推理树。Bonsai系统在测试阶段可通过证据缩放调整推理强度，并在处理多样化领域（如文本记录、照片、视频、音频及数据库）时展现出可靠性能。此外，问答与人机对齐实验表明，Bonsai在生成可解释、基于证据且具有不确定性意识的推理轨迹方面能够媲美领域特定的黑盒方法。

链接: https://arxiv.org/abs/2504.03640
作者: Kate Sanders,Benjamin Van Durme
机构: Johns Hopkins University (约翰斯·霍普kins大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, preprint

点击查看摘要

Abstract:To develop general-purpose collaborative agents, humans need reliable AI systems that can (1) adapt to new domains and (2) transparently reason with uncertainty to allow for verification and correction. Black-box models demonstrate powerful data processing abilities but do not satisfy these criteria due to their opaqueness, domain specificity, and lack of uncertainty awareness. We introduce Bonsai, a compositional and probabilistic reasoning system that generates adaptable inference trees by retrieving relevant grounding evidence and using it to compute likelihoods of sub-claims derived from broader natural language inferences. Bonsai’s reasoning power is tunable at test-time via evidence scaling and it demonstrates reliable handling of varied domains including transcripts, photographs, videos, audio, and databases. Question-answering and human alignment experiments demonstrate that Bonsai matches the performance of domain-specific black-box methods while generating interpretable, grounded, and uncertainty-aware reasoning traces.
zh

[NLP-1] Do Larger Language Models Imply Better Reasoning ? A Pretraining Scaling Law for Reasoning

【速读】：该论文试图解决的问题是如何理解大规模语言模型（LLMs）在复杂推理任务中的性能随参数规模变化的趋势，并探索其最优模型大小与知识图谱特性之间的关系。论文的关键解决方案在于设计了一个合成的多跳推理环境，用于模拟真实世界的大规模知识图谱结构，并通过仅使用不完整图谱中的三元组对语言模型进行预训练，评估其推理缺失边的能力。研究发现，过参数化可能导致因过度记忆而损害推理性能，并进一步分析了影响这种“U型”损失曲线的因素，包括图谱结构、模型大小和训练步数。最终，论文提出了一种经验性缩放方法，通过将知识图谱搜索熵线性映射到最优模型大小，以预测特定知识图谱下的最佳模型规模。这一工作为理解LLMs中规模与推理能力之间的关系提供了新见解，并揭示了优化其推理任务性能的潜在途径。

链接: https://arxiv.org/abs/2504.03635
作者: Xinyi Wang,Shawn Tan,Mingyu Jin,William Yang Wang,Rameswar Panda,Yikang Shen
机构: UC Santa Barbara; MIT-IBM Watson AI Lab; Rutgers University (罗格斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.
zh

[NLP-2] Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models

【速读】：该论文旨在解决在保持或提升推理准确性的同时，显著降低大规模生成式 AI (Generative AI) 模型的推理成本和加速推理速度的问题。论文的关键创新在于引入了 Nemotron-H 系列模型，通过将传统 Transformer 模型中的大部分自注意力层替换为 Mamba 层，这些 Mamba 层以恒定计算量和每生成 token 的恒定内存需求实现了高效的推理。此外，通过 MiniPuzzle 压缩与蒸馏技术进一步优化了 56B 参数规模的模型，生成了更轻量化的 Nemotron-H-47B-Base 模型，其推理速度提升了 20%，同时保持了相似的准确性。论文还提出了基于 FP8 的训练方法，证明其在推理性能上可媲美传统的 BF16 训练方法，并成功应用于 56B 模型的训练。

链接: https://arxiv.org/abs/2504.03624
作者: NVIDIA:Aaron Blakeman,Aarti Basant,Abhinav Khattar,Adithya Renduchintala,Akhiad Bercovich,Aleksander Ficek,Alexis Bjorlin,Ali Taghibakhshi,Amala Sanjay Deshmukh,Ameya Sunil Mahabaleshwarkar,Andrew Tao,Anna Shors,Ashwath Aithal,Ashwin Poojary,Ayush Dattagupta,Balaram Buddharaju,Bobby Chen,Boris Ginsburg,Boxin Wang,Brandon Norick,Brian Butterfield,Bryan Catanzaro,Carlo del Mundo,Chengyu Dong,Christine Harvey,Christopher Parisien,Dan Su,Daniel Korzekwa,Danny Yin,Daria Gitman,David Mosallanezhad,Deepak Narayanan,Denys Fridman,Dima Rekesh,Ding Ma,Dmytro Pykhtar,Dong Ahn,Duncan Riach,Dusan Stosic,Eileen Long,Elad Segal,Ellie Evans,Eric Chung,Erick Galinkin,Evelina Bakhturina,Ewa Dobrowolska,Fei Jia,Fuxiao Liu,Gargi Prasad,Gerald Shen,Guilin Liu,Guo Chen,Haifeng Qian,Helen Ngo,Hongbin Liu,Hui Li,Igor Gitman,Ilia Karmanov,Ivan Moshkov,Izik Golan,Jan Kautz,Jane Polak Scowcroft,Jared Casper,Jarno Seppanen,Jason Lu,Jason Sewall,Jiaqi Zeng,Jiaxuan You,Jimmy Zhang,Jing Zhang,Jining Huang,Jinze Xue,Jocelyn Huang,Joey Conway,John Kamalu,Jon Barker,Jonathan Cohen,Joseph Jennings,Jupinder Parmar,Karan Sapra,Kari Briski,Kateryna Chumachenko,Katherine Luna,Keshav Santhanam,Kezhi Kong,Kirthi Sivamani,Krzysztof Pawelec,Kumar Anik,Kunlun Li,Lawrence McAfee,Leon Derczynski,Lindsey Pavao,Luis Vega,Lukas Voegtle,Maciej Bala,Maer Rodrigues de Melo,Makesh Narsimhan Sreedhar,Marcin Chochowski,Markus Kliegl
机构: NVIDIA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As inference-time scaling becomes critical for enhanced reasoning capabilities, it is increasingly becoming important to build models that are efficient to infer. We introduce Nemotron-H, a family of 8B and 56B/47B hybrid Mamba-Transformer models designed to reduce inference cost for a given accuracy level. To achieve this goal, we replace the majority of self-attention layers in the common Transformer model architecture with Mamba layers that perform constant computation and require constant memory per generated token. We show that Nemotron-H models offer either better or on-par accuracy compared to other similarly-sized state-of-the-art open-sourced Transformer models (e.g., Qwen-2.5-7B/72B and Llama-3.1-8B/70B), while being up to 3 \times faster at inference. To further increase inference speed and reduce the memory required at inference time, we created Nemotron-H-47B-Base from the 56B model using a new compression via pruning and distillation technique called MiniPuzzle. Nemotron-H-47B-Base achieves similar accuracy to the 56B model, but is 20% faster to infer. In addition, we introduce an FP8-based training recipe and show that it can achieve on par results with BF16-based training. This recipe is used to train the 56B model. All Nemotron-H models will be released, with support in Hugging Face, NeMo, and Megatron-LM.
zh

[NLP-3] Align to Structure: Aligning Large Language Models with Structural Information

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成长篇连贯文本时面临的挑战，特别是缺乏层次化规划和结构化组织的问题。论文提出了一种名为“结构对齐（Structural Alignment）”的新方法，通过将基于语言学的篇章框架整合到强化学习中，引导模型生成具有连贯性和良好组织性的输出。解决方案的关键在于采用密集奖励机制，在Proximal Policy Optimization框架内，依据篇章特性相对于人类写作的差异性，为每个标记分配细粒度的奖励信号。此外，论文评估了两种互补的奖励模型：其一通过评分表面级文本特征来提升可读性并提供显式的结构指导；其二则通过分析层级化的篇章主题模式，强化深层连贯性和修辞复杂度，从而在作文生成和长文档摘要等任务中超越标准模型及基于RLHF增强的模型。所有训练数据和代码将公开共享。

链接: https://arxiv.org/abs/2504.03622
作者: Zae Myung Kim,Anand Ramachandran,Farideh Tavazoee,Joo-Kyung Kim,Oleg Rokhlenko,Dongyeop Kang
机构: University of Minnesota Twin Cities (明尼苏达大学双城分校); Amazon (亚马逊); OpenAI (OpenAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating long, coherent text remains a challenge for large language models (LLMs), as they lack hierarchical planning and structured organization in discourse generation. We introduce Structural Alignment, a novel method that aligns LLMs with human-like discourse structures to enhance long-form text generation. By integrating linguistically grounded discourse frameworks into reinforcement learning, our approach guides models to produce coherent and well-organized outputs. We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing. Two complementary reward models are evaluated: the first improves readability by scoring surface-level textual features to provide explicit structuring, while the second reinforces deeper coherence and rhetorical sophistication by analyzing global discourse patterns through hierarchical discourse motifs, outperforming both standard and RLHF-enhanced models in tasks such as essay generation and long-document summarization. All training data and code will be publicly shared at this https URL.
zh

[NLP-4] Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task

【速读】：该论文旨在解决多语言环境下检索增强生成（Retrieval-Augmented Generation, RAG）方法的有效性问题，特别是针对多语言开放域问答任务。现有RAG策略在多语言场景中的表现有限，如单向翻译增强的tRAG存在覆盖范围不足的问题，而直接跨语言检索的Multilingual RAG则因跨语言内容差异导致一致性下降。论文的关键解决方案是提出Crosslingual RAG (CrossRAG)，通过在生成响应前将检索到的文档翻译成通用语言（如英语），从而显著提升知识密集型任务的性能，并同时优化高资源和低资源语言的表现。

链接: https://arxiv.org/abs/2504.03616
作者: Leonardo Ranaldi,Barry Haddow,Alexandra Birch
机构: Institute for Language, Cognition and Computation (语言、认知与计算研究所); School of Informatics, University of Edinburgh (爱丁堡大学信息学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP, enhancing large language models (LLMs) by allowing them to access richer factual contexts through in-context retrieval. While effective in monolingual settings, especially in English, its use in multilingual tasks remains unexplored. This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering. We evaluate the performance of various multilingual RAG strategies, including question-translation (tRAG), which translates questions into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval occurs directly across multiple languages. Our findings reveal that tRAG, while useful, suffers from limited coverage. In contrast, MultiRAG improves efficiency by enabling multilingual retrieval but introduces inconsistencies due to cross-lingual variations in the retrieved content. To address these issues, we propose Crosslingual RAG (CrossRAG), a method that translates retrieved documents into a common language (e.g., English) before generating the response. Our experiments show that CrossRAG significantly enhances performance on knowledge-intensive tasks, benefiting both high-resource and low-resource languages.
zh

[NLP-5] AIR: A Systematic Analysis of Annotations Instructions and Response Pairs in Preference Dataset

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）与人类价值观对齐过程中偏好学习面临的挑战，其核心问题是当前方法未能有效区分和优化构成高质量偏好数据集的三个核心组件：偏好标注（Preference Annotations）、指令（Instructions）和响应对（Response Pairs），导致难以系统性提升模型性能。论文的关键解决方案是提出了一种名为AIR的分量分析框架，通过系统隔离和优化每个组件，并评估它们的协同效应，揭示了指导性原则，包括注释简化（点式生成评分）、指令推断稳定性（基于方差的跨模型过滤）以及响应对质量（适度的边界值与高绝对评分）。这些原则的结合使基准方法获得了平均+5.3的性能提升，即使仅使用14k高质量数据对。该研究将偏好数据集的设计从随意扩展转向了组件感知优化，为高效且可复现的模型对齐提供了蓝图。

链接: https://arxiv.org/abs/2504.03612
作者: Bingxiang He,Wenbin Zhang,Jiaxi Song,Cheng Qian,Zixuan Fu,Bowen Sun,Ning Ding,Haiwen Hong,Longtao Huang,Hui Xue,Ganqu Cui,Wanxiang Che,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学); Shanghai AI Lab (上海人工智能实验室); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: 29 pages, 11 figures

点击查看摘要

Abstract:Preference learning is critical for aligning large language models (LLMs) with human values, yet its success hinges on high-quality datasets comprising three core components: Preference \textbfAnnotations, \textbfInstructions, and \textbfResponse Pairs. Current approaches conflate these components, obscuring their individual impacts and hindering systematic optimization. In this work, we propose \textbfAIR, a component-wise analysis framework that systematically isolates and optimizes each component while evaluating their synergistic effects. Through rigorous experimentation, AIR reveals actionable principles: annotation simplicity (point-wise generative scoring), instruction inference stability (variance-based filtering across LLMs), and response pair quality (moderate margins + high absolute scores). When combined, these principles yield +5.3 average gains over baseline method, even with only 14k high-quality pairs. Our work shifts preference dataset design from ad hoc scaling to component-aware optimization, offering a blueprint for efficient, reproducible alignment.
zh

[NLP-6] APIGen-MT: Agent ic Pipeline for Multi-Turn Data Generation via Simulated Agent -Human Interplay

【速读】：该论文旨在解决多轮交互（multi-turn interactions）人工智能代理训练数据稀缺且昂贵的问题。解决方案的关键在于提出了一种名为APIGen-MT的两阶段框架，用于生成可验证且多样化的多轮代理数据。第一阶段通过利用一组大型语言模型（LLM）审查员和迭代反馈循环生成包含真实动作的任务蓝图；第二阶段则通过模拟的人机交互将这些蓝图转化为完整的交互轨迹。此外，开发了一系列参数规模从10亿到700亿不等的xLAM-2-fc-r系列模型，其性能优于前沿模型如GPT-4o和Claude 3.5，并且较小规模的模型在多轮设置中表现尤为突出，同时保持了跨多次试验的一致性。这种方法不仅提高了训练数据的质量，还促进了更可靠、高效及强大的AI代理的发展。所有合成数据集与训练好的模型均已开源以推动相关领域的研究进展。

链接: https://arxiv.org/abs/2504.03601
作者: Akshara Prabhakar,Zuxin Liu,Weiran Yao,Jianguo Zhang,Ming Zhu,Shiyu Wang,Zhiwei Liu,Tulika Awalgaonkar,Haolin Chen,Thai Hoang,Juan Carlos Niebles,Shelby Heinecke,Huan Wang,Silvio Savarese,Caiming Xiong
机构: Salesforce AI Research (Salesforce AI 研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages plus references and appendices

点击查看摘要

Abstract:Training effective AI agents for multi-turn interactions requires high-quality data that captures realistic human-agent dynamics, yet such data is scarce and expensive to collect manually. We introduce APIGen-MT, a two-phase framework that generates verifiable and diverse multi-turn agent data. In the first phase, our agentic pipeline produces detailed task blueprints with ground-truth actions, leveraging a committee of LLM reviewers and iterative feedback loops. These blueprints are then transformed into complete interaction trajectories through simulated human-agent interplay. We train a family of models – the xLAM-2-fc-r series with sizes ranging from 1B to 70B parameters. Our models outperform frontier models such as GPT-4o and Claude 3.5 on \tau -bench and BFCL benchmarks, with the smaller models surpassing their larger counterparts, particularly in multi-turn settings, while maintaining superior consistency across multiple trials. Comprehensive experiments demonstrate that our verified blueprint-to-details approach yields high-quality training data, enabling the development of more reliable, efficient, and capable agents. We open-source both the synthetic data collected and the trained xLAM-2-fc-r models to advance research in AI agents. Models are available on HuggingFace at this https URL and project website is this https URL
zh

[NLP-7] EnrichIndex: Using LLM s to Enrich Retrieval Indices Offline

【速读】：该论文旨在解决现有信息检索系统在处理隐含相关性（implied relevance）时面临的挑战，即当文档的相关性并非显式表达于其内容中，而是通过特定的专业术语或结构暗示时，如何有效评估文档与查询之间的关联。传统方法依赖于语言匹配，难以应对这种隐含的相关性推理任务。此外，基于大型语言模型（LLM）的在线增强检索虽然具备推理能力，但存在高延迟和高昂计算成本的问题，因为每次查询都需要重新计算查询-文档相关性。

解决方案的关键在于引入EnrichIndex方法，它利用LLM的推理能力离线构建语义增强的检索索引（semantically-enriched retrieval indices）。具体而言，EnrichIndex通过对检索语料库中的所有文档进行一次性遍历（single pass），提前完成语义特征的提取和索引构建工作。这些语义增强的索引能够显著提升检索性能，并且可以与现有的在线LLM重排序器协同工作以进一步优化效果。实验结果表明，EnrichIndex在包含段落和表格的五项检索任务中表现出色，相较于强基准模型，在召回率@10和NDCG@10指标上分别提升了11.7点和10.6点，同时减少了293.3倍的在线LLM调用令牌数，从而大幅降低了延迟和成本。

链接: https://arxiv.org/abs/2504.03598
作者: Peter Baile Chen,Tomer Wolfson,Michael Cafarella,Dan Roth
机构: MIT (麻省理工学院); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Dataset and code are available at this https URL

点击查看摘要

Abstract:Existing information retrieval systems excel in cases where the language of target documents closely matches that of the user query. However, real-world retrieval systems are often required to implicitly reason whether a document is relevant. For example, when retrieving technical texts or tables, their relevance to the user query may be implied through a particular jargon or structure, rather than explicitly expressed in their content. Large language models (LLMs) hold great potential in identifying such implied relevance by leveraging their reasoning skills. Nevertheless, current LLM-augmented retrieval is hindered by high latency and computation cost, as the LLM typically computes the query-document relevance online, for every query anew. To tackle this issue we introduce EnrichIndex, a retrieval approach which instead uses the LLM offline to build semantically-enriched retrieval indices, by performing a single pass over all documents in the retrieval corpus once during ingestion time. Furthermore, the semantically-enriched indices can complement existing online retrieval approaches, boosting the performance of LLM re-rankers. We evaluated EnrichIndex on five retrieval tasks, involving passages and tables, and found that it outperforms strong online LLM-based retrieval systems, with an average improvement of 11.7 points in recall @ 10 and 10.6 points in NDCG @ 10 compared to strong baselines. In terms of online calls to the LLM, it processes 293.3 times fewer tokens which greatly reduces the online latency and cost. Overall, EnrichIndex is an effective way to build better retrieval indices offline by leveraging the strong reasoning skills of LLMs.
zh

[NLP-8] Extending the SAREF4ENER Ontology with Flexibility Based on FlexOffers

【速读】：该论文旨在解决现有行业标准本体（SAREF4ENER）在支持能源灵活性方面的局限性问题，特别是其无法充分描述复杂设备（如电动汽车、电池和热泵）的灵活性以及捕获许多柔性负荷类型的固有不确定性。论文的关键解决方案是提出了一种对SAREF4ENER的扩展，该扩展全面集成了FlexOffer模型的支持，包括高级用例，同时保持向后兼容性。这一新型本体模块能够精确描述先进设备的灵活性，并有效处理与柔性负载相关的不确定性问题。

链接: https://arxiv.org/abs/2504.03595
作者: Fabio Lilliu(1),Amir Laadhar(2),Christian Thomsen(3),Diego Reforgiato Recupero(1),Torben Bach Pedersen(3) ((1) University of Cagliari, (2) PANTOPIX GmbH amp; Co. KG, (3) Aalborg University)
机构: University of Cagliari (萨萨里大学); PANTOPIX GmbH & Co. KG (潘托皮克斯有限公司); Aalborg University (奥尔堡大学); University of Cagliari (萨萨里大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures, 4 tables. Submitted to SmartGridComm 2025

点击查看摘要

Abstract:A key element to support the increased amounts of renewable energy in the energy system is flexibility, i.e., the possibility of changing energy loads in time and amount. Many flexibility models have been designed; however, exact models fail to scale for long time horizons or many devices. Because of this, the FlexOffer (FOs) model has been designed, to provide device-independent approximations of flexibility with good accuracy, and much better scaling for long time horizons and many devices. An important aspect of the real-life implementation of energy flexibility is enabling flexible data exchange with many types of smart energy appliances and market systems, e.g., in smart buildings. For this, ontologies standardizing data formats are required. However, the current industry standard ontology for integrating smart devices for energy purposes, SAREF for Energy Flexibility (SAREF4ENER) only has limited support for flexibility and thus cannot support important use cases. In this paper we propose an extension of SAREF4ENER that integrates full support for the complete FlexOffer model, including advanced use cases, while maintaining backward compatibility. This novel ontology module can accurately describe flexibility for advanced devices such as electric vehicles, batteries, and heat pumps. It can also capture the inherent uncertainty associated with many flexible load types.
zh

[NLP-9] SynWorld: Virtual Scenario Synthesis for Agent ic Action Knowledge Refinement

【速读】：该论文试图解决LLM（大型语言模型）驱动的智能体在新型环境或非传统动作空间中部署时所面临的重大挑战。这些挑战包括如何使智能体自主探索环境、优化工作流程以及提升对动作的理解。为了解决这些问题，论文提出了一种名为SynWorld的框架，其关键是通过在动作空间内合成多步动作调用的可能场景，并结合蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）进行探索，从而有效精炼智能体在当前环境中的动作知识。实验结果表明，SynWorld是一种在新环境中学习动作知识的有效且通用的方法。

链接: https://arxiv.org/abs/2504.03561
作者: Runnan Fang,Xiaobin Wang,Yuan Liang,Shuofei Qiao,Jialong Wu,Zekun Xi,Ningyu Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团); Zhejiang Key Laboratory of Big Data Intelligent Computing (浙江省大数据智能计算重点实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress

点击查看摘要

Abstract:In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at this https URL.
zh

[NLP-10] Agent ic Knowledgeable Self-awareness

【速读】：该论文试图解决传统大型语言模型（LLMs）在代理规划任务中采用“广撒网”方法忽视人类决策过程中情境自我意识（situation self-awareness）的问题。具体而言，传统的代理规划方法无差别地注入黄金轨迹、外部反馈和领域知识，而未充分利用动态评估情境需求及战略性资源分配的能力。为填补这一空白，论文提出了基于知识的情境自我意识（agentic knowledgeable self-awareness）的新范式，使基于LLMs的代理能够自主调节知识利用。

解决方案的关键在于提出了一种以数据为中心的方法KnowSelf，它赋予代理类似人类的知识情境自我意识。通过设计一种启发式的场景判断标准，KnowSelf在代理自我探索的轨迹中标记特殊标记符以收集训练数据，并通过两阶段训练过程实现特定情境下生成相应特殊标记符，从而以最小成本达到最佳规划效果。实验表明，KnowSelf在不同任务和模型上优于多种强基准模型，且对外部知识依赖极小。

链接: https://arxiv.org/abs/2504.03553
作者: Shuofei Qiao,Zhisong Qiu,Baochang Ren,Xiaobin Wang,Xiangyuan Ru,Ningyu Zhang,Xiang Chen,Yong Jiang,Pengjun Xie,Fei Huang,Huajun Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a “flood irrigation” methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent’s self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at this https URL.
zh

[NLP-11] MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

【速读】：该论文试图解决多语言医学领域语音翻译（Speech Translation, ST）的问题，特别是在跨越语言障碍以提升患者护理、缓解专业人员短缺以及改善诊断和治疗方面的需求。论文的关键在于构建了一个大规模的医学ST数据集MultiMed-ST，包含五种语言（越南语、英语、德语、法语、繁体中文和简体中文）的所有翻译方向，总计29万样本，这是目前最大的医学机器翻译（Machine Translation, MT）数据集和跨领域最大的多对多多语言ST数据集。此外，论文通过广泛的分析研究提供了重要的解决方案基准，包括双语与多语言对比、端到端与级联模型对比、任务特定与多任务序列到序列（seq2seq）对比、代码混合分析以及定量-定性错误分析，从而为医学ST领域的研究奠定了坚实的基础。所有代码、数据和模型均在线公开获取。

链接: https://arxiv.org/abs/2504.03546
作者: Khai Le-Duc,Tuyen Tran,Bach Phan Tat,Nguyen Kim Hai Bui,Quan Dang,Hung-Phong Tran,Thanh-Thuy Nguyen,Ly Nguyen,Tuan-Minh Phan,Thi Thu Phuong Tran,Chris Ngo,Nguyen X. Khanh,Thanh Nguyen-Tang
机构: University of Toronto (多伦多大学); University Health Network (未知); Knovel Engineering Lab (未知); Hanoi University of Science and Technology (河内国家大学科学技术学院); KU Leuven (未知); Eötvös Loránd University (未知); HCMC Open University (胡志明市开放大学); IÉSEG School of Management (未知); Technische Universität Dortmund (杜伊斯堡-埃森大学); University of Hertfordshire (赫特福德郡大学); UC Berkeley (加州大学伯克利分校); Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Preprint, 122 pages

点击查看摘要

Abstract:Multilingual speech translation (ST) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, Traditional Chinese and Simplified Chinese, together with the models. With 290,000 samples, our dataset is the largest medical machine translation (MT) dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most extensive analysis study in ST research to date, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence (seq2seq) comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: this https URL.
zh

[NLP-12] Diverse In-Context Example Selection After Decomposing Programs and Aligned Utterances Improves Semantic Parsing NAACL2025

【速读】：本文旨在解决自然语言到结构化程序语义解析（Semantic Parsing）中提示学习（Prompt-based Learning）的上下文示例选择问题。传统方法通常直接使用完整的上下文示例（ICEs），而忽视了这些示例可能包含冗余或不相关的信息。论文的关键创新在于将可用的上下文示例树（ICE Trees）分解为片段，并提出一种利用带有语法约束的大型语言模型（LLMs）来自动映射这些片段到相应自然语言表达的方法。此外，作者还扩展了一种多样化的上下文示例选择方法以适应完整与片段化的上下文示例场景。通过这种方法，论文在多个流行的语义解析基准测试中展示了显著的准确性提升，尤其对较小规模的LLMs及资源稀缺语言的程序解析效果尤为明显。

链接: https://arxiv.org/abs/2504.03541
作者: Mayank Kothyari,Sunita Sarawagi,Soumen Chakrabarti,Gaurav Arora,Srujana Merugu
机构: Indian Institute of Technology Bombay (印度理工学院孟买); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注: To appear at NAACL 2025 (Main)

点击查看摘要

Abstract:LLMs are increasingly used as seq2seq translators from natural language utterances to structured programs, a process called semantic interpretation. Unlike atomic labels or token sequences, programs are naturally represented as abstract syntax trees (ASTs). Such structured representation raises novel issues related to the design and selection of in-context examples (ICEs) presented to the LLM. We focus on decomposing the pool of available ICE trees into fragments, some of which may be better suited to solving the test instance. Next, we propose how to use (additional invocations of) an LLM with prompted syntax constraints to automatically map the fragments to corresponding utterances. Finally, we adapt and extend a recent method for diverse ICE selection to work with whole and fragmented ICE instances. We evaluate our system, SCUD4ICL, on popular diverse semantic parsing benchmarks, showing visible accuracy gains from our proposed decomposed diverse demonstration method. Benefits are particularly notable for smaller LLMs, ICE pools having larger labeled trees, and programs in lower resource languages.
zh

[NLP-13] Neutralizing the Narrative: AI-Powered Debiasing of Online News Articles

【速读】：该论文旨在解决新闻报道中的偏见问题及其对公众认知的影响，特别是在犯罪、政治和社会议题方面的偏见。传统的人工偏见检测方法因主观解读和可扩展性限制而存在不足。为应对这一挑战，论文提出了一种基于AI的框架，利用先进的大型语言模型（LLMs），如GPT-4o、GPT-4o Mini、Gemini Pro、Gemini Flash、Llama 8B和Llama 3B，系统性地识别并缓解新闻文章中的偏见。解决方案的关键在于采用两阶段方法：第一阶段通过每个LLM在段落级别评分并解释偏见内容，并结合人工评估建立真实标注；第二阶段利用GPT-4o Mini进行迭代去偏处理，同时通过自动化重评与人工审核验证其效果。实验结果表明，GPT-4o Mini在偏见检测的准确性及去偏效果方面表现优异，且分析揭示了媒体偏见随时间和地理区域变化的现象，这些变化与社会政治动态和现实世界事件相关联。本研究为偏见缓解提供了可扩展的计算方法，促进了新闻报道的公平性和问责制。

链接: https://arxiv.org/abs/2504.03520
作者: Chen Wei Kuo,Kevin Chu,Nouar AlDahoul,Hazem Ibrahim,Talal Rahwan,Yasir Zaki
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 23 pages, 3 figures

点击查看摘要

Abstract:Bias in news reporting significantly impacts public perception, particularly regarding crime, politics, and societal issues. Traditional bias detection methods, predominantly reliant on human moderation, suffer from subjective interpretations and scalability constraints. Here, we introduce an AI-driven framework leveraging advanced large language models (LLMs), specifically GPT-4o, GPT-4o Mini, Gemini Pro, Gemini Flash, Llama 8B, and Llama 3B, to systematically identify and mitigate biases in news articles. To this end, we collect an extensive dataset consisting of over 30,000 crime-related articles from five politically diverse news sources spanning a decade (2013-2023). Our approach employs a two-stage methodology: (1) bias detection, where each LLM scores and justifies biased content at the paragraph level, validated through human evaluation for ground truth establishment, and (2) iterative debiasing using GPT-4o Mini, verified by both automated reassessment and human reviewers. Empirical results indicate GPT-4o Mini’s superior accuracy in bias detection and effectiveness in debiasing. Furthermore, our analysis reveals temporal and geographical variations in media bias correlating with socio-political dynamics and real-world events. This study contributes to scalable computational methodologies for bias mitigation, promoting fairness and accountability in news reporting.
zh

[NLP-14] Structured Legal Document Generation in India: A Model-Agnostic Wrapper Approach with VidhikDastaavej

【速读】：该论文试图解决印度法律领域私密法律文件结构化生成的问题，这一任务在之前的研究中未得到充分关注。论文的关键解决方案在于提出了一种名为“Model-Agnostic Wrapper (MAW)”的两步框架：首先生成结构化的章节标题，然后通过结合检索机制迭代生成内容，以确保文档的连贯性和事实准确性。此外，论文还开发了一个包含匿名化数据集“VidhikDastaavej”和特定于印度法律文本的生成模型“NyayaShilp”的系统，并设计了人机协同（Human-in-the-Loop, HITL）的交互式文档生成工具，以提升实际应用中的效率和可靠性。这些方法共同构成了一个可扩展且适应性强的AI辅助法律文件起草基础。

链接: https://arxiv.org/abs/2504.03486
作者: Shubham Kumar Nigam,Balaramamahanthi Deepak Patnaik,Ajay Varghese Thomas,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya
机构: IIT Kanpur (印度理工学院坎普尔); SRM Institute of Science and Technology (SRM 科学与技术研究院); IISER Kolkata (印度科教局克勒格布尔高等研究院); Symbiosis Law School Pune (浦那共生法学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automating legal document drafting can significantly enhance efficiency, reduce manual effort, and streamline legal workflows. While prior research has explored tasks such as judgment prediction and case summarization, the structured generation of private legal documents in the Indian legal domain remains largely unaddressed. To bridge this gap, we introduce VidhikDastaavej, a novel, anonymized dataset of private legal documents, and develop NyayaShilp, a fine-tuned legal document generation model specifically adapted to Indian legal texts. We propose a Model-Agnostic Wrapper (MAW), a two-step framework that first generates structured section titles and then iteratively produces content while leveraging retrieval-based mechanisms to ensure coherence and factual accuracy. We benchmark multiple open-source LLMs, including instruction-tuned and domain-adapted versions, alongside proprietary models for comparison. Our findings indicate that while direct fine-tuning on small datasets does not always yield improvements, our structured wrapper significantly enhances coherence, factual adherence, and overall document quality while mitigating hallucinations. To ensure real-world applicability, we developed a Human-in-the-Loop (HITL) Document Generation System, an interactive user interface that enables users to specify document types, refine section details, and generate structured legal drafts. This tool allows legal professionals and researchers to generate, validate, and refine AI-generated legal documents efficiently. Extensive evaluations, including expert assessments, confirm that our framework achieves high reliability in structured legal drafting. This research establishes a scalable and adaptable foundation for AI-assisted legal drafting in India, offering an effective approach to structured legal document generation.
zh

[NLP-15] SpectR: Dynamically Composing LM Experts with Spectral Routing

【速读】：该论文试图解决在实际应用中有效利用现有专用专家模型（expert models）的问题，即如何动态选择或组合最适合特定任务的模型。解决方案的关键在于提出了一种名为SPECTR的方法，它能够在推理过程中的每个时间步动态组合专家模型，且无需额外训练。这种方法支持灵活的逐token和逐层模型组合，从而提升跨领域专家任务的表现，并显著提高路由准确性。

链接: https://arxiv.org/abs/2504.03454
作者: William Fleshman,Benjamin Van Durme
机构: Johns Hopkins University (约翰斯·霍普kins大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training large, general-purpose language models poses significant challenges. The growing availability of specialized expert models, fine-tuned from pretrained models for specific tasks or domains, offers a promising alternative. Leveraging the potential of these existing expert models in real-world applications requires effective methods to select or merge the models best suited for a given task. This paper introduces SPECTR, an approach for dynamically composing expert models at each time step during inference. Notably, our method requires no additional training and enables flexible, token- and layer-wise model combinations. Our experimental results demonstrate that SPECTR improves routing accuracy over alternative training-free methods, increasing task performance across expert domains.
zh

[NLP-16] Locations of Characters in Narratives: Andersen and Persuasion Datasets

【速读】：该论文试图解决的问题是评估大型语言模型（Large Language Models, LLMs）在理解叙事语境中角色与其相应地理位置关系方面的空间理解能力。为实现这一目标，研究引入了两个新的数据集：Andersen 和 Persuasion。解决方案的关键在于构建包含故事摘录及相应角色位置提问的提示（prompts），并通过人工标注的数据集来测试不同LLMs的表现。实验结果显示，最佳性能的模型在Andersen数据集上的准确率为61.85%，在Persuasion数据集上的准确率为56.06%。

链接: https://arxiv.org/abs/2504.03434
作者: Batuhan Ozyurt,Roya Arkhmammadova,Deniz Yuret
机构: 未知
类目: Computation and Language (cs.CL)
备注: 14 pages, 3 figures, 10 tables

点击查看摘要

Abstract:The ability of machines to grasp spatial understanding within narrative contexts is an intriguing aspect of reading comprehension that continues to be studied. Motivated by the goal to test the AI’s competence in understanding the relationship between characters and their respective locations in narratives, we introduce two new datasets: Andersen and Persuasion. For the Andersen dataset, we selected fifteen children’s stories from “Andersen’s Fairy Tales” by Hans Christian Andersen and manually annotated the characters and their respective locations throughout each story. Similarly, for the Persuasion dataset, characters and their locations in the novel “Persuasion” by Jane Austen were also manually annotated. We used these datasets to prompt Large Language Models (LLMs). The prompts are created by extracting excerpts from the stories or the novel and combining them with a question asking the location of a character mentioned in that excerpt. Out of the five LLMs we tested, the best-performing one for the Andersen dataset accurately identified the location in 61.85% of the examples, while for the Persuasion dataset, the best-performing one did so in 56.06% of the cases.
zh

[NLP-17] Online Difficulty Filtering for Reasoning Oriented Reinforcement Learning

【速读】：该论文旨在解决Reasoning-Oriented Reinforcement Learning (RORL) 中由于奖励稀疏性导致训练效果依赖于问题难度选择的问题。传统方法如课程学习通常采用静态难度调整策略，而近期的在线过滤方法虽有所改进，但缺乏理论支撑与系统性验证。论文的关键解决方案是提出“平衡在线难度过滤”(balanced online difficulty filtering)，即通过动态筛选当前训练模型在中间准确率水平上的问题来构建批次，从而最大化RORL训练的有效性。论文从理论上推导出初始策略与最优策略之间KL散度的下界可用采样准确率的方差表示，并证明了平衡过滤能够最大化此下界，从而提升性能。实验结果表明，该方法在多个数学推理基准测试中实现了AIME额外10%的提升以及平均4%的改进，同时显著提高了样本效率和训练时间效率。

链接: https://arxiv.org/abs/2504.03380
作者: Sanghwan Bae,Jiwoo Hong,Min Young Lee,Hanbyul Kim,JeongYeon Nam,Donghyun Kwak
机构: NAVER Cloud (NAVER云); KAIST AI (韩国科学技术院人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning-Oriented Reinforcement Learning (RORL) enhances the reasoning ability of Large Language Models (LLMs). However, due to the sparsity of rewards in RORL, effective training is highly dependent on the selection of problems of appropriate difficulty. Although curriculum learning attempts to address this by adjusting difficulty, it often relies on static schedules, and even recent online filtering methods lack theoretical grounding and a systematic understanding of their effectiveness. In this work, we theoretically and empirically show that curating the batch with the problems that the training model achieves intermediate accuracy on the fly can maximize the effectiveness of RORL training, namely balanced online difficulty filtering. We first derive that the lower bound of the KL divergence between the initial and the optimal policy can be expressed with the variance of the sampled accuracy. Building on those insights, we show that balanced filtering can maximize the lower bound, leading to better performance. Experimental results across five challenging math reasoning benchmarks show that balanced online filtering yields an additional 10% in AIME and 4% improvements in average over plain GRPO. Moreover, further analysis shows the gains in sample efficiency and training time efficiency, exceeding the maximum reward of plain GRPO within 60% training time and the volume of the training set.
zh

[NLP-18] Sustainable LLM Inference for Edge AI: Evaluating Quantized LLM s for Energy Efficiency Output Accuracy and Inference Latency

【速读】：本文旨在解决在边缘设备上部署大语言模型（Large Language Models, LLMs）所面临的计算资源受限、内存限制、推理速度以及能耗等问题。论文的关键解决方案是通过模型量化技术，在保持输出准确性的同时显著减小模型大小和计算开销，从而实现高效推理。具体而言，研究分析了Ollama库中的28个量化LLMs，默认使用后训练量化（Post-Training Quantization, PTQ）和仅权重量化技术，并部署于Raspberry Pi 4（4GB RAM）这一边缘设备上。通过在多个量化级别和任务类型下评估能量效率、推理性能及输出准确性，并基于五个标准化数据集（CommonsenseQA、BIG-Bench Hard、TruthfulQA、GSM8K和HumanEval）进行基准测试，结合硬件级能耗测量工具捕获实际功耗，揭示了不同量化设置下能量效率、推理速度与准确度之间的权衡关系，明确了优化LLMs在资源受限环境部署的配置方案。这种将硬件级能耗分析与LLMs基准测试相结合的方法为可持续人工智能提供了实用见解，填补了现有针对能耗感知LLMs部署研究中的重要空白。

链接: https://arxiv.org/abs/2504.03360
作者: Erik Johannes Husom,Arda Goknil,Merve Astekin,Lwin Khin Shar,Andre Kåsen,Sagar Sen,Benedikt Andreas Mithassel,Ahmet Soylu
机构: SINTEF(Oslo, Norway); Singapore Management University(Singapore, Singapore); Oslo Metropolitan University(Oslo, Norway); Kristiania University of Applied Sciences(Oslo, Norway)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 30 pages, 14 figures

点击查看摘要

Abstract:Deploying Large Language Models (LLMs) on edge devices presents significant challenges due to computational constraints, memory limitations, inference speed, and energy consumption. Model quantization has emerged as a key technique to enable efficient LLM inference by reducing model size and computational overhead. In this study, we conduct a comprehensive analysis of 28 quantized LLMs from the Ollama library, which applies by default Post-Training Quantization (PTQ) and weight-only quantization techniques, deployed on an edge device (Raspberry Pi 4 with 4GB RAM). We evaluate energy efficiency, inference performance, and output accuracy across multiple quantization levels and task types. Models are benchmarked on five standardized datasets (CommonsenseQA, BIG-Bench Hard, TruthfulQA, GSM8K, and HumanEval), and we employ a high-resolution, hardware-based energy measurement tool to capture real-world power consumption. Our findings reveal the trade-offs between energy efficiency, inference speed, and accuracy in different quantization settings, highlighting configurations that optimize LLM deployment for resource-constrained environments. By integrating hardware-level energy profiling with LLM benchmarking, this study provides actionable insights for sustainable AI, bridging a critical gap in existing research on energy-aware LLM deployment.
zh

[NLP-19] Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings

【速读】：该论文旨在解决 stereotypes（刻板印象）与 anti-stereotypes（反刻板印象）检测这一难题，尤其关注如何清晰区分 stereotypes（刻板印象）、anti-stereotypes（反刻板印象）、stereotypical biases（刻板偏见）以及 biases（偏见），并通过定义精确的术语提供深入见解。论文的关键解决方案在于提出了 StereoDetect，这是一个高质量的基准数据集，通过优化利用现有数据集（如 StereoSet 和 WinoQueer），结合手动验证过程和语义信息迁移构建而成。研究发现，参数少于 10B 的语言模型在检测 anti-stereotypes（反刻板印象）时常会混淆，并强调了精心策划的数据集对于提升 stereotype（刻板印象）检测任务的重要性。

链接: https://arxiv.org/abs/2504.03352
作者: Kaustubh Shivshankar Shejole,Pushpak Bhattacharyya
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Stereotypes are known to be highly pernicious, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases in LLMs, leaving the study of stereotypes in its early stages. Many studies have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and anti-stereotype detection is a problem that requires knowledge of society; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a four-tuple definition and provide precise terminology distinguishing stereotype, anti-stereotype, stereotypical bias, and bias, offering valuable insights into their various aspects. In this paper, we propose StereoDetect, a high-quality benchmarking dataset curated for this task by optimally utilizing current datasets such as StereoSet and WinoQueer, involving a manual verification process and the transfer of semantic information. We demonstrate that language models for reasoning with fewer than 10B parameters often get confused when detecting anti-stereotypes. We also demonstrate the critical importance of well-curated datasets by comparing our model with other current models for stereotype detection. The dataset and code is available at this https URL.
zh

[NLP-20] BabyLMs First Words: Word Segmentation as a Phonological Probing Task CONLL2025

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在音系分析中的局限性，特别是现有基准数据集主要局限于英语以及标准的子词（subword）输入表示方法（基于图形字符的子词分割）不适合分析音素表示的问题。论文的关键解决方案是将词边界提取作为一项音系探查任务，通过利用预测误差在单词起始处达到峰值的现象，提出无监督方法从训练好的基于音素的语言模型中提取词边界，并进一步验证这些模型即使在未出现在训练数据中的情况下也能隐式跟踪词边界。这一跨语言研究不仅支持了习得的统计学习理论，还为子词分词器的训练提供了新的实证依据。

链接: https://arxiv.org/abs/2504.03338
作者: Zébulon Goriely
机构: Department of Computer Science & Technology, University of Cambridge, U.K. (剑桥大学计算机科学与技术系); ALTA Institute, University of Cambridge, U.K. (剑桥大学ALTA研究所)
类目: Computation and Language (cs.CL)
备注: 17 pages, 10 figures, submitted to CoNLL 2025

点击查看摘要

Abstract:Language models provide a key framework for studying linguistic theories based on prediction, but phonological analysis using large language models (LLMs) is difficult; there are few phonological benchmarks beyond English and the standard input representation used in LLMs (subwords of graphemes) is not suitable for analyzing the representation of phonemes. In this work, we demonstrate how word segmentation can be used as a phonological probing task, allowing us to study the representations learned by phoneme-based language models trained on child-directed speech across 31 languages. Following computational models of word segmentation, we present unsupervised methods for extracting word boundaries from a trained model using the observation that prediction-error peaks at the start of words. We also use linear probes to identify that these models implicitly track word boundaries, even when they do not appear in training. This cross-lingual work corroborates statistical learning theories of acquisition and empirically motivates new methods for training subword tokenizers.
zh

[NLP-21] Optimal Embedding Guided Negative Sample Generation for Knowledge Graph Link Prediction

【速读】：本文旨在解决知识图谱嵌入（Knowledge Graph Embedding, KGE）模型在训练过程中难以精确区分正负样本的问题，特别是如何有效生成高质量的负样本以提升模型性能。传统方法主要关注于从训练数据中识别具有挑战性的负样本，而本文提出的关键解决方案在于理论化分析负样本对优化KGE的重要性，并确定了一个有效的负样本分布的充分条件。基于此理论基础，作者提出了\textsc{EMU}框架，其核心创新点在于通过满足该条件来生成负样本，而非依赖传统的困难负样本选择策略。\textsc{EMU}的简单性使其能够轻松集成到现有的KGE模型与负采样方法中。实验结果表明，无论是在不同数据集还是与多种KGE模型及负采样方法结合时，\textsc{EMU}均显著提升了链接预测性能，甚至达到了与嵌入维度扩大五倍的传统模型相当的效果。

链接: https://arxiv.org/abs/2504.03327
作者: Makoto Takamoto,Daniel Oñoro-Rubio,Wiem Ben Rim,Takashi Maruyama,Bhushan Kotnis
机构: NEC Laboratories Europe (NEC实验室欧洲); University College London (伦敦大学学院); Coresystems AG (Coresystems AG)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 11 pages, 6 figures, 15 Tables, accepted and to be published in TMLR

点击查看摘要

Abstract:Knowledge graph embedding (KGE) models encode the structural information of knowledge graphs to predicting new links. Effective training of these models requires distinguishing between positive and negative samples with high precision. Although prior research has shown that improving the quality of negative samples can significantly enhance model accuracy, identifying high-quality negative samples remains a challenging problem. This paper theoretically investigates the condition under which negative samples lead to optimal KG embedding and identifies a sufficient condition for an effective negative sample distribution. Based on this theoretical foundation, we propose \textbfEmbedding \textbfMUtation (\textscEMU), a novel framework that \emphgenerates negative samples satisfying this condition, in contrast to conventional methods that focus on \emphidentifying challenging negative samples within the training data. Importantly, the simplicity of \textscEMU ensures seamless integration with existing KGE models and negative sampling methods. To evaluate its efficacy, we conducted comprehensive experiments across multiple datasets. The results consistently demonstrate significant improvements in link prediction performance across various KGE models and negative sampling methods. Notably, \textscEMU enables performance improvements comparable to those achieved by models with embedding dimension five times larger. An implementation of the method and experiments are available at this https URL.
zh

[NLP-22] Evaluating Compact LLM s for Zero-Shot Iberian Language Tasks on End-User Devices

【速读】：该论文试图解决大型语言模型（Large Language Models）在计算资源受限的消费级设备上部署困难的问题，特别是针对伊比利亚半岛等语言资源有限的语言。论文聚焦于评估紧凑型、最先进的语言模型在伊比利亚语言相关的多项自然语言处理（NLP）任务中的表现，并揭示了尽管某些模型在特定任务中表现出色，但仍存在显著性能差距，尤其是在巴斯克语等语言上。解决方案的关键在于探索如何平衡模型的紧凑性与多语言任务上的鲁棒性能。

链接: https://arxiv.org/abs/2504.03312
作者: Luís Couto Seller,Íñigo Sanz Torres,Adrián Vogel-Fernández,Carlos González Carballo,Pedro Miguel Sánchez Sánchez,Adrián Carruana Martín,Enrique de Miguel Ambite
机构: Advantx Technological Foundation (Funditec)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Under Revision al SEPLN conference

点击查看摘要

Abstract:Large Language Models have significantly advanced natural language processing, achieving remarkable performance in tasks such as language generation, translation, and reasoning. However, their substantial computational requirements restrict deployment to high-end systems, limiting accessibility on consumer-grade devices. This challenge is especially pronounced for under-resourced languages like those spoken in the Iberian Peninsula, where relatively limited linguistic resources and benchmarks hinder effective evaluation. This work presents a comprehensive evaluation of compact state-of-the-art LLMs across several essential NLP tasks tailored for Iberian languages. The results reveal that while some models consistently excel in certain tasks, significant performance gaps remain, particularly for languages such as Basque. These findings highlight the need for further research on balancing model compactness with robust multilingual performance
zh

[NLP-23] Noise Augmented Fine Tuning for Mitigating Hallucinations in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在生成内容时容易出现不准确或误导性信息（即幻觉现象，hallucinations）的问题。为应对这一挑战，论文提出了一种名为噪声增强微调（Noise-Augmented Fine-Tuning, NoiseFiT）的新框架。NoiseFiT 的关键在于基于信噪比（Signal-to-Noise Ratio, SNR）的自适应噪声注入策略，通过动态缩放的高斯噪声有选择地扰动模型中被识别为高 SNR（更鲁棒）或低 SNR（可能欠正则化）的层，从而提升模型的鲁棒性。此外，论文还设计了一种混合损失函数，结合标准交叉熵、软交叉熵和一致性正则化，以确保在噪声训练条件下输出的稳定性和准确性。理论分析表明，这种自适应噪声注入方法是无偏且方差保持的，提供了期望收敛的强大保证。实验结果验证了 NoiseFiT 在多个测试集和基准数据集上的有效性，显著降低了幻觉率，并在关键任务中提升了性能，同时未带来过高的计算开销。

链接: https://arxiv.org/abs/2504.03302
作者: Afshin Khadangi,Amir Sartipi,Igor Tchappi,Ramin Bahmani
机构: University of Luxembourg(卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often produce inaccurate or misleading content-hallucinations. To address this challenge, we introduce Noise-Augmented Fine-Tuning (NoiseFiT), a novel framework that leverages adaptive noise injection based on the signal-to-noise ratio (SNR) to enhance model robustness. In particular, NoiseFiT selectively perturbs layers identified as either high-SNR (more robust) or low-SNR (potentially under-regularized) using a dynamically scaled Gaussian noise. We further propose a hybrid loss that combines standard cross-entropy, soft cross-entropy, and consistency regularization to ensure stable and accurate outputs under noisy training conditions. Our theoretical analysis shows that adaptive noise injection is both unbiased and variance-preserving, providing strong guarantees for convergence in expectation. Empirical results on multiple test and benchmark datasets demonstrate that NoiseFiT significantly reduces hallucination rates, often improving or matching baseline performance in key tasks. These findings highlight the promise of noise-driven strategies for achieving robust, trustworthy language modeling without incurring prohibitive computational overhead. Given the comprehensive and detailed nature of our experiments, we have publicly released the fine-tuning logs, benchmark evaluation artifacts, and source code online at WB, Hugging Face, and GitHub, respectively, to foster further research, accessibility and reproducibility.
zh

[NLP-24] Stance-Driven Multimodal Controlled Statement Generation: New Dataset and Task

【速读】：该论文旨在解决多模态立场可控内容生成的问题，特别是在推特中结合文本和图像生成具有特定立场响应的任务。当前数据集主要局限于纯文本，缺乏多模态内容及有效上下文，尤其是在立场检测领域。为了解决这一问题，论文正式定义并研究了这一新问题，并创建了一个名为StanceGen2024的数据集，这是首个专为政治话语中的多模态立场可控文本生成设计的资源。该数据集包含2024年美国总统大选中的帖子和用户评论，涵盖文本、图像、视频以及立场标注，以探索多模态政治内容如何影响立场表达。

论文的关键解决方案是提出了一种名为立场驱动多模态生成（SDMG）的框架，该框架通过整合多模态特征的加权融合与立场引导来提高语义一致性和立场控制能力。通过这种方法，模型能够更好地理解和生成符合特定立场的多模态内容。论文还公开了数据集和代码供公众使用和进一步研究。

链接: https://arxiv.org/abs/2504.03295
作者: Bingqian Wang,Quan Fang,Jiachen Sun,Xiaoxiao Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Formulating statements that support diverse or controversial stances on specific topics is vital for platforms that enable user expression, reshape political discourse, and drive social critique and information dissemination. With the rise of Large Language Models (LLMs), controllable text generation towards specific stances has become a promising research area with applications in shaping public opinion and commercial marketing. However, current datasets often focus solely on pure texts, lacking multimodal content and effective context, particularly in the context of stance detection. In this paper, we formally define and study the new problem of stance-driven controllable content generation for tweets with text and images, where given a multimodal post (text and image/video), a model generates a stance-controlled response. To this end, we create the Multimodal Stance Generation Dataset (StanceGen2024), the first resource explicitly designed for multimodal stance-controllable text generation in political discourse. It includes posts and user comments from the 2024 U.S. presidential election, featuring text, images, videos, and stance annotations to explore how multimodal political content shapes stance expression. Furthermore, we propose a Stance-Driven Multimodal Generation (SDMG) framework that integrates weighted fusion of multimodal features and stance guidance to improve semantic consistency and stance control. We release the dataset and code (this https URL) for public use and further research.
zh

[NLP-25] RWKVTTS: Yet another TTS based on RWKV-7

【速读】：该论文旨在解决传统Transformer模型在语音合成（Text-to-Speech, TTS）领域中存在的计算效率与资源消耗之间的权衡问题。解决方案的关键在于提出了一种基于RNN（Recurrent Neural Network）的新型架构RWKV-7，它通过利用循环神经网络的特性，在保持高质量输出的同时显著提升了合成速度、自然度以及资源效率。此外，RWKV-7展示了对多样化语言环境和低资源条件下的适应能力，为推动TTS技术的普及与应用提供了创新路径。

链接: https://arxiv.org/abs/2504.03289
作者: Lin yueyu,Liu Xiao
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Human-AI interaction thrives on intuitive and efficient interfaces, among which voice stands out as a particularly natural and accessible modality. Recent advancements in transformer-based text-to-speech (TTS) systems, such as Fish-Speech, CosyVoice, and MegaTTS 3, have delivered remarkable improvements in quality and realism, driving a significant evolution in the TTS domain. In this paper, we introduce RWKV-7 \citepeng2025rwkv, a cutting-edge RNN-based architecture tailored for TTS applications. Unlike traditional transformer models, RWKV-7 leverages the strengths of recurrent neural networks to achieve greater computational efficiency and scalability, while maintaining high-quality output. Our comprehensive benchmarks demonstrate that RWKV-7 outperforms transformer-based models across multiple key metrics, including synthesis speed, naturalness of speech, and resource efficiency. Furthermore, we explore its adaptability to diverse linguistic contexts and low-resource environments, showcasing its potential to democratize TTS technology. These findings position RWKV-7 as a powerful and innovative alternative, paving the way for more accessible and versatile voice synthesis solutions in real-world this http URL code and weights are this https URL, this https URL
zh

[NLP-26] Inherent and emergent liability issues in LLM -based agent agent ic systems: a principal-agent perspective

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）驱动的生成式系统在复杂性和应用范围不断增加背景下所引发的治理、监控与责任归属问题。论文从委托代理（Principal-Agent）视角分析了由LLM代理及其扩展系统的委派使用可能带来的潜在责任问题，并探讨了部署过程中委托方与代理方关系的关键方面及其潜在影响。解决方案的关键在于通过技术治理方法的发展来增强系统的透明度与问责制，具体包括提升可解释性与行为评估能力、优化奖励机制与冲突管理、以及通过设计检测与故障安全机制来缓解错位行为与不当操作。论文还强调了这些方法在提高AI系统设计、审计与监控中的重要性。

链接: https://arxiv.org/abs/2504.03255
作者: Garry A. Gabison,R. Patrick Xian
机构: Queen Mary University of London (伦敦玛丽女王大学); University of California, Berkeley (加州大学伯克利分校); Certivize AI (Certivize AI)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 12 pages content (incl. appendix) + 12 pages references, comments welcome

点击查看摘要

Abstract:Agentic systems powered by large language models (LLMs) are becoming progressively more complex and capable. Their increasing agency and expanding deployment settings attract growing attention over effective governance policies, monitoring and control protocols. Based on emerging landscapes of the agentic market, we analyze the potential liability issues stemming from delegated use of LLM agents and their extended systems from a principal-agent perspective. Our analysis complements existing risk-based studies on artificial agency and covers the spectrum of important aspects of the principal-agent relationship and their potential consequences at deployment. Furthermore, we motivate method developments for technical governance along the directions of interpretability and behavior evaluations, reward and conflict management, and the mitigation of misalignment and misconduct through principled engineering of detection and fail-safe mechanisms. By illustrating the outstanding issues in AI liability for LLM-based agentic systems, we aim to inform the system design, auditing and monitoring approaches to enhancing transparency and accountability.
zh

[NLP-27] hink When You Need: Self-Adaptive Chain-of-Thought Learning

【速读】：该论文旨在解决现有基于直接惩罚推理长度的方法无法适应问题复杂度变化而导致语言模型在简单问题上产生低效“过度推理”（overthinking）的问题。论文的关键解决方案是通过引入基于长度和质量比较的奖励机制，并结合理论假设，将解答的正确性与简洁性同时优化。此外，作者进一步展示了该方法在无明确标准答案（ground truth）的模糊任务中的适用性。实验结果表明，所提出的方法在保持准确率的同时显著提高了解释的简洁性，有效引导模型学会在必要时进行推理。

链接: https://arxiv.org/abs/2504.03234
作者: Junjie Yang,Ke Lin,Xing Yu
机构: Xiaohongshu Inc
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages

点击查看摘要

Abstract:Chain of Thought (CoT) reasoning enhances language models’ performance but often leads to inefficient “overthinking” on simple problems. We identify that existing approaches directly penalizing reasoning length fail to account for varying problem complexity. Our approach constructs rewards through length and quality comparisons, guided by theoretical assumptions that jointly enhance solution correctness with conciseness. Moreover, we further demonstrate our method to fuzzy tasks where ground truth is unavailable. Experiments across multiple reasoning benchmarks demonstrate that our method maintains accuracy while generating significantly more concise explanations, effectively teaching models to “think when needed.”
zh

[NLP-28] Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

【速读】：该论文试图解决现有会话代理在个性化交互方面的不足，特别是当前基于强化学习从人类反馈（Reinforcement Learning from Human Feedback, RLHF）的训练方法虽强调帮助性和安全性，但难以实现真正具有同理心、适应性和个性化的互动。传统个性化方法通常依赖于丰富的用户历史数据，这限制了其对新用户或上下文受限用户的有效性。论文的关键解决方案是引入一种内在动机机制，将提升对用户模型理解的奖励与多轮RLHF相结合，激励代理主动探索用户特征，从而通过优化对话获取更多关于用户的详细信息，实现更个性化的交互。实验表明，该方法在教育和健身场景中优于传统的多轮RLHF基线，在揭示用户偏好并进行适应方面表现更佳。

链接: https://arxiv.org/abs/2504.03206
作者: Yanming Wan,Jiaxing Wu,Marwa Abdulhai,Lior Shani,Natasha Jaques
机构: Google DeepMind (谷歌深度思维); University of Washington (华盛顿大学); Google Research (谷歌研究); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective conversational agents must be able to personalize their behavior to suit a user’s preferences, personality, and attributes, whether they are assisting with writing tasks or operating in domains like education or healthcare. Current training methods like Reinforcement Learning from Human Feedback (RLHF) prioritize helpfulness and safety but fall short in fostering truly empathetic, adaptive, and personalized interactions. Traditional approaches to personalization often rely on extensive user history, limiting their effectiveness for new or context-limited users. To overcome these limitations, we propose to incorporate an intrinsic motivation to improve the conversational agents’s model of the user as an additional reward alongside multi-turn RLHF. This reward mechanism encourages the agent to actively elicit user traits by optimizing conversations to increase the accuracy of its user model. Consequently, the policy agent can deliver more personalized interactions through obtaining more information about the user. We applied our method both education and fitness settings, where LLMs teach concepts or recommend personalized strategies based on users’ hidden learning style or lifestyle attributes. Using LLM-simulated users, our approach outperformed a multi-turn RLHF baseline in revealing information about the users’ preferences, and adapting to them.
zh

[NLP-29] Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在数学问题视觉解释生成方面的能力不足问题。当前基于LLM生成的解释主要集中于文本说明，而忽视了视觉解释这一重要组成部分。在实际教学场景中，人类导师常借助图表、标记和高亮等视觉辅助工具来增强概念理解。为此，论文提出了一种新的任务——视觉解题解释（Visual Solution Explanation），要求模型不仅解决问题，还需生成包含必要新引入视觉元素（如辅助线、注释或几何构造）的解释。为评估模型在此任务上的表现，论文构建了一个名为MathExplain的多模态基准数据集，包含997道数学问题及其对应的视觉关键点标注和解释性文本。解决方案的关键在于设计能够有效识别相关视觉组件并生成连贯的关键点驱动解释的模型，同时通过MathExplain数据集推动多模态LLMs在教育领域的研究与应用。

链接: https://arxiv.org/abs/2504.03197
作者: Jaewoo Park,Jungyang Park,Dongju Jang,Jiwan Chung,Byungwoo Yoo,Jaewoo Shin,Seonjoon Park,Taehyeong Kim,Youngjae Yu
机构: Yonsei University (延世大学); Mathpresso (Mathpresso)
类目: Computation and Language (cs.CL)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:With the rapid advancement of mathematical reasoning capabilities in large language models (LLMs), AI systems are increasingly being adopted in educational settings to support students’ comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: visual explanation. In real-world instructional contexts, human tutors routinely employ visual aids-such as diagrams, markings, and highlights-to enhance conceptual clarity. To bridge this gap, we introduce a novel task of visual solution explanation, which requires not only solving problems but also generating explanations that incorporate newly introduced visual elements essential for understanding (e.g., auxiliary lines, annotations, or geometric constructions). To evaluate model performance on this task, we propose MathExplain, a multimodal benchmark consisting of 997 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that while some closed-source models demonstrate promising capabilities on visual solution-explaining, current open-source general-purpose models perform inconsistently, particularly in identifying relevant visual components and producing coherent keypoint-based explanations. We expect that visual solution-explaining and the MathExplain dataset will catalyze further research on multimodal LLMs in education and advance their deployment as effective, explanation-oriented AI tutors. Code and data will be released publicly.
zh

[NLP-30] Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际自然语言处理（NLP）应用中部署的安全性问题，特别是通用对齐（generalizable alignment）这一核心挑战。当前对齐方法（如基于人类反馈的强化学习，Reinforcement Learning from Human Feedback, RLHF）在训练分布外难以保证约束满足，因为它们依赖于隐式的后验偏好。论文的关键解决方案在于提出一种新的安全语言对齐框架，该框架通过从正负示范中学习自然语言约束作为首要步骤。这种方法不仅推导任务特定的奖励函数，还学习潜在的约束函数，从而促进对新安全需求的适应，并在领域偏移和对抗输入下实现鲁棒的通用化。论文通过将此框架形式化为约束马尔可夫决策过程（Constrained Markov Decision Process, CMDP），并在基于文本的导航环境中验证其有效性，展示了在遵循安全导航路径时减少领域偏移下的违规行为，并通过蒸馏后的BERT模型微调实现了零违规。这一工作为构建面向实际NLP场景的安全关键且更通用的LLMs提供了有前景的途径。

链接: https://arxiv.org/abs/2504.03185
作者: Jaymari Chua,Chen Wang,Lina Yao
机构: CSIRO’s Data61 (CSIRO的数据61); UNSW (新南威尔士大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalizable alignment is a core challenge for deploying Large Language Models (LLMs) safely in real-world NLP applications. Current alignment methods, including Reinforcement Learning from Human Feedback (RLHF), often fail to guarantee constraint satisfaction outside their training distribution due to their reliance on implicit, post-hoc preferences. Inspired by a paradigm shift to first curate data before tuning, we introduce a new framework for safe language alignment that learns natural language constraints from positive and negative demonstrations as a primary step. From inferring both a task-specific reward function and latent constraint functions, our approach fosters adaptation to novel safety requirements and robust generalization under domain shifts and adversarial inputs. We formalize the framework within a Constrained Markov Decision Process (CMDP) and validate it via a text-based navigation environment, demonstrating safe adaptation to changing danger zones. Our experiments show fewer violations upon domain shift when following a safe navigation path, and we achieve zero violations by applying learned constraints to a distilled BERT model as a fine-tuning technique. This work offers a promising path toward building safety-critical and more generalizable LLMs for practical NLP settings.
zh

[NLP-31] Multi-lingual Multi-turn Automated Red Teaming for LLM s NAACL2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中生成不安全响应的风险评估问题。传统的人工红队测试（Human-driven Red-teaming）存在成本高、耗时长且难以覆盖模型所有功能（如多语言、多模态等）的局限性，而现有的自动化方法仅能处理模型能力的一小部分（如英语单轮对话）。为此，论文提出了一种名为多语言多轮自动红队测试（Multi-lingual Multi-turn Automated Red Teaming, MM-ART）的解决方案。其关键是通过完全自动化的方式实现多语言、多轮对话场景下的红队测试，快速识别导致不安全响应的提示词，并验证了LLMs在多轮对话中比单轮对话更容易暴露安全漏洞，尤其是在非英语语言场景下，模型的安全脆弱性显著增加。

链接: https://arxiv.org/abs/2504.03174
作者: Abhishek Singhania,Christophe Dupuy,Shivam Mangale,Amani Namboori
机构: Amazon
类目: Computation and Language (cs.CL)
备注: Accepted at TrustNLP@NAACL 2025

点击查看摘要

Abstract:Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to ``model alignment’', i.e., preventing LLMs to generate unsafe responses when deployed into customer-facing applications. One popular method to evaluate safety risks is \textitred-teaming, where agents attempt to bypass alignment by crafting elaborate prompts that trigger unsafe responses from a model. Standard human-driven red-teaming is costly, time-consuming and rarely covers all the recent features (e.g., multi-lingual, multi-modal aspects), while proposed automation methods only cover a small subset of LLMs capabilities (i.e., English or single-turn). We present Multi-lingual Multi-turn Automated Red Teaming (\textbfMM-ART), a method to fully automate conversational, multi-lingual red-teaming operations and quickly identify prompts leading to unsafe responses. Through extensive experiments on different languages, we show the studied LLMs are on average 71% more vulnerable after a 5-turn conversation in English than after the initial turn. For conversations in non-English languages, models display up to 195% more safety vulnerabilities than the standard single-turn English approach, confirming the need for automated red-teaming methods matching LLMs capabilities.
zh

[NLP-32] Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation

【速读】：该论文旨在解决当前 Retrieval-Augmented Generation (RAG) 实现中，在处理检索内容中的噪声、重复和冗余问题上的局限性。这些局限性主要源于现有方法难以有效利用细粒度的文档间关系。为了解决这些问题，论文提出了一种名为 Efficient Dynamic Clustering-based document Compression framework (EDC²-RAG) 的解决方案。其关键是通过有效地挖掘潜在的文档间关系，同时去除无关信息和冗余内容，从而提升 RAG 在知识集成任务中的性能。实验结果表明，该方法在多种场景和实验设置下均实现了稳定的性能提升，并展现出较强的鲁棒性和适用性。

链接: https://arxiv.org/abs/2504.03165
作者: Weitao Li,Kaiming Liu,Xiangyu Zhang,Xuanyu Lei,Weizhi Ma,Yang Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge integration during large language model (LLM) inference in recent years. However, current RAG implementations face challenges in effectively addressing noise, repetition and redundancy in retrieved content, primarily due to their limited ability to exploit fine-grained inter-document relationships. To address these limitations, we propose an \textbfEfficient \textbfDynamic \textbfClustering-based document \textbfCompression framework (\textbfEDC\textsuperscript2-RAG) that effectively utilizes latent inter-document relationships while simultaneously removing irrelevant information and redundant content. We validate our approach, built upon GPT-3.5, on widely used knowledge-QA and hallucination-detected datasets. The results show that this method achieves consistent performance improvements across various scenarios and experimental settings, demonstrating strong robustness and applicability. Our code and datasets can be found at this https URL.
zh

[NLP-33] DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

【速读】：该论文旨在解决现有大型语言模型（Large Language Models, LLMs）在深度研究任务中的局限性，特别是当前方法主要依赖于脆弱的手动提示工程（prompt engineering-based）或在受控环境下基于检索增强生成（Retrieval-Augmented Generation, RAG）的强化学习（RL），这些方法无法充分捕捉真实世界交互的复杂性。论文的关键解决方案是提出DeepResearcher，这是一个通过在真实世界环境中扩展强化学习来实现基于LLM的深度研究代理端到端训练的综合框架。与假设所有必要信息均存在于固定语料库中的RAG方法不同，DeepResearcher训练代理以应对开放网络中嘈杂、无结构且动态变化的特性，并采用专门设计的多智能体架构，使浏览代理能够从不同网页结构中提取相关信息并克服重大技术挑战。这一方案的核心在于通过真实网络环境中的端到端强化学习训练，使模型具备计划制定、多源信息交叉验证、自我反思以及诚实面对不确定性等认知能力。

链接: https://arxiv.org/abs/2504.03160
作者: Yuxiang Zheng,Dayuan Fu,Xiangkun Hu,Xiaojie Cai,Lyumanshan Ye,Pengrui Lu,Pengfei Liu
机构: SJTU(上海交通大学); SII(上海交通大学信息安全工程学院); GAIR(上海交通大学机器人与人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at this https URL.
zh

[NLP-34] Beyond the Next Token: Towards Prompt-Robust Zero-Shot Classification via Efficient Multi-Token Prediction NAACL2025

【速读】：该论文试图解决零样本文本分类中因大语言模型提示工程固有脆弱性导致的可靠性不足问题，即提示的小幅变化可能引起模型性能显著差异。解决方案的关键在于提出了一种名为Placeholding Parallel Prediction (P3) 的新方法，通过在单次语言模型运行中预测多个位置的标记概率，模拟生成路径的全面采样，从而克服现有方法仅关注下一个标记概率的局限性，显著提升了分类的准确性和提示鲁棒性，同时减少了对提示工程的依赖。

链接: https://arxiv.org/abs/2504.03159
作者: Junlang Qian,Zixiao Zhu,Hanzhang Zhou,Zijian Feng,Zepeng Zhai,Kezhi Mao
机构: Nanyang Technological University (南洋理工大学), Singapore; Singapore-ETH Centre; Tencent (腾讯), China
类目: Computation and Language (cs.CL)
备注: Accepted in NAACL 2025 (main Oral)

点击查看摘要

Abstract:Zero-shot text classification typically relies on prompt engineering, but the inherent prompt brittleness of large language models undermines its reliability. Minor changes in prompt can cause significant discrepancies in model performance. We attribute this prompt brittleness largely to the narrow focus on nexttoken probabilities in existing methods. To address this, we propose Placeholding Parallel Prediction (P3), a novel approach that predicts token probabilities across multiple positions and simulates comprehensive sampling of generation paths in a single run of a language model. Experiments show improved accuracy and up to 98% reduction in the standard deviation across prompts, boosting robustness. Even without a prompt, P3 maintains comparable performance, reducing the need for prompt engineering.
zh

[NLP-35] Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多模态推理（Multimodal Reasoning）中的核心挑战，特别是如何有效整合视觉与文本输入以实现跨模态的结构化问题求解。论文指出，多模态推理的关键难题包括处理模态间冲突信息以及确保推理结果的准确性与连贯性。为应对这些挑战，论文强调了采用先进的算法和鲁棒的评估方法的重要性，并提出通过后训练优化（post-training optimization）及测试时推理（test-time inference）的实用技术来提升模型性能。解决方案的关键在于结合理论框架与实际应用，为未来研究提供明确方向。

链接: https://arxiv.org/abs/2504.03151
作者: Jing Bi,Susan Liang,Xiaofei Zhou,Pinxin Liu,Junjia Guo,Yunlong Tang,Luchuan Song,Chao Huang,Guangyu Sun,Jinxi He,Jiarui Wu,Shu Yang,Daoan Zhang,Chen Chen,Lianggong Bruce Wen,Zhang Liu,Jiebo Luo,Chenliang Xu
机构: University of Rochester (罗切斯特大学); University of Central Florida (中佛罗里达大学); Corning Inc. (康宁公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. However, effectively extending these capabilities into multimodal contexts-where models must integrate both visual and textual inputs-continues to be a significant challenge. Multimodal reasoning introduces complexities, such as handling conflicting information across modalities, which require models to adopt advanced interpretative strategies. Addressing these challenges involves not only sophisticated algorithms but also robust methodologies for evaluating reasoning accuracy and coherence. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs. Through a thorough and up-to-date comparison, we clearly formulate core reasoning challenges and opportunities, highlighting practical methods for post-training optimization and test-time inference. Our work provides valuable insights and guidance, bridging theoretical frameworks and practical implementations, and sets clear directions for future research.
zh

[NLP-36] LightPROF: A Lightweight Reasoning Framework for Large Language Model on Knowledge Graph AAAI2025

【速读】：该论文旨在解决大型语言模型（LLMs）在知识更新延迟下可能产生的错误推理或有害结果的问题，并提出一种高效利用知识图谱（Knowledge Graphs, KGs）结构信息的方法。现有基于知识图谱的LLM推理方法仅以文本形式注入知识图谱的信息，忽略了其结构化信息，且大多依赖于大参数量的闭源或开源模型，导致资源消耗较高。

解决方案的关键在于提出了一种轻量级且高效的Prompt学习-推理框架（LightPROF）。该框架通过“检索-嵌入-推理”（Retrieve-Embed-Reason）的过程，首先利用检索模块从知识图谱中准确、稳定地提取对应的推理图谱，然后借助基于Transformer的知识适配器（Knowledge Adapter），精细提取并整合知识图谱中的事实与结构化信息，并将其映射到LLM的词嵌入空间中，形成适合LLM使用的友好型Prompt。此外，LightPROF仅需训练知识适配器，即可兼容任何开源LLM。实验结果表明，该方法在两个公开的知识图谱问答（KGQA）基准数据集上实现了卓越的性能，并在输入Token数量和推理时间方面表现出显著优势。

链接: https://arxiv.org/abs/2504.03137
作者: Tu Ao,Yanhua Yu,Yuling Wang,Yang Deng,Zirui Guo,Liang Pang,Pinghui Wang,Tat-Seng Chua,Xiao Zhang,Zhen Cai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This paper has been accepted by AAAI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have impressive capabilities in text understanding and zero-shot reasoning. However, delays in knowledge updates may cause them to reason incorrectly or produce harmful results. Knowledge Graphs (KGs) provide rich and reliable contextual information for the reasoning process of LLMs by structurally organizing and connecting a wide range of entities and relations. Existing KG-based LLM reasoning methods only inject KGs’ knowledge into prompts in a textual form, ignoring its structural information. Moreover, they mostly rely on close-source models or open-source models with large parameters, which poses challenges to high resource consumption. To address this, we propose a novel Lightweight and efficient Prompt learning-ReasOning Framework for KGQA (LightPROF), which leverages the full potential of LLMs to tackle complex reasoning tasks in a parameter-efficient manner. Specifically, LightPROF follows a “Retrieve-Embed-Reason process”, first accurately, and stably retrieving the corresponding reasoning graph from the KG through retrieval module. Next, through a Transformer-based Knowledge Adapter, it finely extracts and integrates factual and structural information from the KG, then maps this information to the LLM’s token embedding space, creating an LLM-friendly prompt to be used by the LLM for the final reasoning. Additionally, LightPROF only requires training Knowledge Adapter and can be compatible with any open-source LLM. Extensive experiments on two public KGQA benchmarks demonstrate that LightPROF achieves superior performance with small-scale LLMs. Furthermore, LightPROF shows significant advantages in terms of input token count and reasoning time.
zh

[NLP-37] Single-Pass Document Scanning for Question Answering

【速读】：该论文旨在解决在大规模文档问答中处理极长文本的挑战：传统的基于片段（chunk-based）嵌入方法容易丢失重要的全局上下文信息，而使用完整上下文的Transformer模型在处理数十万tokens时计算成本过高。论文提出了一种单次扫描（single-pass scanning）的方法，以线性时间处理整个文本，同时保持全局连贯性，并确定与查询最相关的句子。该方法的关键在于通过在整个先前上下文中进行条件化而不分割片段，从而保留全局连贯性，这对长文档尤为重要。实验结果表明，该方法在41个QA基准数据集上的表现优于基于片段的嵌入方法，并且以较低的计算成本与大型语言模型竞争。

链接: https://arxiv.org/abs/2504.03101
作者: Weili Cao,Jianyou Wang,Youze Zheng,Longtian Bao,Qirui Zheng,Taylor Berg-Kirkpatrick,Ramamohan Paturi,Leon Bergen
机构: Laboratory for Emerging Intelligence (新兴智能实验室); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which sentences are most relevant to the query. On 41 QA benchmarks, our single-pass scanner consistently outperforms chunk-based embedding methods and competes with large language models at a fraction of the computational cost. By conditioning on the entire preceding context without chunk breaks, the method preserves global coherence, which is especially important for long documents. Overall, single-pass document scanning offers a simple solution for question answering over massive text. All code, datasets, and model checkpoints are available at this https URL
zh

[NLP-38] AD-GPT : Large Language Models in Alzheimers Disease

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在阿尔茨海默病（Alzheimer’s Disease, AD）等专业化领域中信息检索准确性与深度不足的问题。为应对这一挑战，论文提出的关键解决方案是开发AD-GPT，这是一种针对AD领域的专用生成式预训练Transformer模型。AD-GPT通过整合多源生物医学数据（如潜在的AD相关基因、分子遗传信息及与脑区相关的基因变异），结合基于Llama3和BERT构建的堆叠式LLM架构，优化了四个核心任务：遗传信息检索、基因-脑区关系评估、基因-AD关系分析以及脑区-AD关系映射。实验结果表明，AD-GPT在这些任务上的精度和可靠性优于现有最先进的LLMs，凸显其作为强大且专门化AI工具在推动AD研究和生物标志物发现方面的潜力。

链接: https://arxiv.org/abs/2504.03071
作者: Ziyu Liu,Lintao Tang,Zeliang Sun,Zhengliang Liu,Yanjun Lyu,Wei Ruan,Yangshuang Xu,Liang Shan,Jiyoon Shin,Xiaohe Chen,Dajiang Zhu,Tianming Liu,Rongjie Liu,Chao Huang
机构: University of Georgia (UGA); Florida State University (FSU); University of Texas at Arlington (UTA); University of Georgia (UGA); University of Texas at Arlington (UTA); University of Georgia (UGA); University of Georgia (UGA); University of Georgia (UGA)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as powerful tools for medical information retrieval, yet their accuracy and depth remain limited in specialized domains such as Alzheimer’s disease (AD), a growing global health challenge. To address this gap, we introduce AD-GPT, a domain-specific generative pre-trained transformer designed to enhance the retrieval and analysis of AD-related genetic and neurobiological information. AD-GPT integrates diverse biomedical data sources, including potential AD-associated genes, molecular genetic information, and key gene variants linked to brain regions. We develop a stacked LLM architecture combining Llama3 and BERT, optimized for four critical tasks in AD research: (1) genetic information retrieval, (2) gene-brain region relationship assessment, (3) gene-AD relationship analysis, and (4) brain region-AD relationship mapping. Comparative evaluations against state-of-the-art LLMs demonstrate AD-GPT’s superior precision and reliability across these tasks, underscoring its potential as a robust and specialized AI tool for advancing AD research and biomarker discovery.
zh

[NLP-39] ask as Context Prompting for Accurate Medical Symptom Coding Using Large Language Models ALT

【速读】：该论文旨在解决从非结构化临床文本（如疫苗安全报告）中进行精准医学症状编码的问题，这一任务在药物警戒和安全性监测领域具有重要应用价值。传统方法将症状提取与链接视为独立流程，难以有效应对临床叙述中的多样性和复杂性，尤其是在罕见病例中表现不佳。尽管大型语言模型（Large Language Models, LLMs）的最新进展提供了新机会，但其性能一致性仍面临挑战。为了解决这些问题，论文提出了一种名为Task as Context (TACO) Prompting的新框架，通过在LLM提示中嵌入任务特定上下文，实现提取和链接任务的统一。关键解决方案在于TACO Prompting框架的设计，它显著提升了症状编码任务的灵活性和准确性，并通过构建人类注释的数据集SYMPCODER以及两阶段评估框架，全面评估了症状链接和提及保真度。实验结果表明，TACO Prompting在多个LLMs（如Llama2-chat、Jackalope-7b、GPT-3.5 Turbo、GPT-4 Turbo和GPT-4o）上的表现证明了其在定制化任务中的有效性，为更具体的编码任务和临床文本处理方法的发展奠定了基础。

链接: https://arxiv.org/abs/2504.03051
作者: Chengyang He,Wenlong Zhang,Violet Xinying Chen,Yue Ning,Ping Wang
机构: Stevens Institute of Technology (史蒂文斯理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 5 Tables, ACM/IEEE International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE '25), June 24–26, 2025, New York, NY, USA

点击查看摘要

Abstract:Accurate medical symptom coding from unstructured clinical text, such as vaccine safety reports, is a critical task with applications in pharmacovigilance and safety monitoring. Symptom coding, as tailored in this study, involves identifying and linking nuanced symptom mentions to standardized vocabularies like MedDRA, differentiating it from broader medical coding tasks. Traditional approaches to this task, which treat symptom extraction and linking as independent workflows, often fail to handle the variability and complexity of clinical narratives, especially for rare cases. Recent advancements in Large Language Models (LLMs) offer new opportunities but face challenges in achieving consistent performance. To address these issues, we propose Task as Context (TACO) Prompting, a novel framework that unifies extraction and linking tasks by embedding task-specific context into LLM prompts. Our study also introduces SYMPCODER, a human-annotated dataset derived from Vaccine Adverse Event Reporting System (VAERS) reports, and a two-stage evaluation framework to comprehensively assess both symptom linking and mention fidelity. Our comprehensive evaluation of multiple LLMs, including Llama2-chat, Jackalope-7b, GPT-3.5 Turbo, GPT-4 Turbo, and GPT-4o, demonstrates TACO’s effectiveness in improving flexibility and accuracy for tailored tasks like symptom coding, paving the way for more specific coding tasks and advancing clinical text processing methodologies.
zh

[NLP-40] LLM Library Learning Fails: A LEGO-Prover Case Study

【速读】：该论文试图评估基于大型语言模型（LLMs）的图书馆学习（Library Learning）技术在提升任务性能方面的有效性，特别是通过自动创建可重用工具和缓存推理来实现这一目标。论文聚焦于LEGO-Prover系统，该系统声称能够学习可重用的数学引理以支持推理任务。

解决方案的关键在于深入分析LEGO-Prover是否真正实现了其宣称的效果。研究发现，该系统并未直接或通过修改相关示例间接重用所学引理，并且其相对于简单基线方法（即仅使用提示模型）的性能提升在计入计算成本后消失。因此，论文指出当前对这些技术有效性的认知存在严重误解，需要重新审视LLM驱动的图书馆学习现状，并呼吁采用更严格的评估标准，包括行为分析以及确保基线方法与被测系统使用相同的计算预算。

链接: https://arxiv.org/abs/2504.03048
作者: Ian Berlot-Attwell,Frank Rudzicz,Xujie Si
机构: University of Toronto (多伦多大学); Vector Institute (向量研究所); Dalhousie University (达尔豪斯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:Recent advancements in the coding, reasoning, and tool-using abilities of LLMs have spurred interest in library learning (i.e., online learning through the creation, storage, and retrieval of reusable and composable functions, knowledge, checklists, or lemmas). Such systems often promise improved task performance through the automatic creation of broadly applicable tools, as well as superior computational performance through the caching of reasoning (i.e., the storage of generated tools). However, we find strong reason to be skeptical. We perform a deep dive into one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples). Crucially, we find that LEGO-Prover does not in fact improve over the simple baseline of prompting the model - the improvements in task accuracy vanish once computational cost is accounted for. Our findings suggest that serious misconceptions exist as to the effectiveness of these techniques, that a serious re-examination of the state of LLM-based library learning is required, and that we require much stronger standards for evaluation including behavioural analysis and ensuring that an equal computational budget is used for baselines.
zh

[NLP-41] Extending CREAMT: Leverag ing Large Language Models for Literary Translation Post-Editing

【速读】：该论文试图解决在翻译创意文本（如文学作品）时，如何平衡机器翻译后编辑（Post-editing, PE）的效率与创造力及风格保留之间的矛盾。现有神经机器翻译（Neural Machine Translation, NMT）系统在此方面表现欠佳，而大型语言模型（Large Language Models, LLMs）因其上下文感知能力和创造性翻译的优势提供了新的可能性。论文的关键解决方案在于评估基于LLMs生成的文学翻译在后编辑过程中的可行性和效果，并通过自定义研究工具与专业译者合作，分析编辑时间、质量及创造力。结果显示，与人工翻译相比，后编辑LLM生成的翻译显著减少了编辑时间，同时保持了相似的创造力水平，从而验证了LLMs在支持高资源语言文学翻译工作方面的潜力。

链接: https://arxiv.org/abs/2504.03045
作者: Antonio Castaldo,Sheila Castilho,Joss Moorkens,Johanna Monti
机构: University of Naples “L’Orientale” (那不勒斯东方大学); University of Pisa (比萨大学); Dublin City University (都柏林城市大学)
类目: Computation and Language (cs.CL)
备注: to be published in the Proceedings of the 20th Machine Translation Summit (MT Summit 2025)

点击查看摘要

Abstract:Post-editing machine translation (MT) for creative texts, such as literature, requires balancing efficiency with the preservation of creativity and style. While neural MT systems struggle with these challenges, large language models (LLMs) offer improved capabilities for context-aware and creative translation. This study evaluates the feasibility of post-editing literary translations generated by LLMs. Using a custom research tool, we collaborated with professional literary translators to analyze editing time, quality, and creativity. Our results indicate that post-editing LLM-generated translations significantly reduces editing time compared to human translation while maintaining a similar level of creativity. The minimal difference in creativity between PE and MT, combined with substantial productivity gains, suggests that LLMs may effectively support literary translators working with high-resource languages.
zh

[NLP-42] IPA-CHILDES G2P: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling CONLL2025

【速读】：该论文旨在解决现有图灵符号到音素转换工具生成的音素词汇与公认的音素库存不一致的问题，并填补现有音素数据集在多语言覆盖、自发性言语以及儿童指向语言方面的不足。为实现这一目标，论文引入了两个资源：(i) G2P+，一种将 orthographic 数据集转换为一致音素表示的工具；(ii) IPA CHILDES，一个包含31种语言儿童指向语音的音素数据集。G2P+ 的关键创新在于利用 Phoible 数据库中的音素库存，从而确保生成的音素表示与标准音素集合保持一致。通过使用 G2P+ 工具，研究者扩展了 CHILDES 数据集以包含音素转写，形成 IPA CHILDES。论文进一步展示了该数据集在音系学研究中的实用性，通过训练11种语言的音素语言模型并探测其对显著特征的学习能力，发现音素的分布特性足以支持跨语言的主要类别和位置特征的学习。

链接: https://arxiv.org/abs/2504.03036
作者: Zébulon Goriely,Paula Buttery
机构: Department of Computer Science & Technology, University of Cambridge (剑桥大学计算机科学与技术系); ALTA Institute, University of Cambridge (剑桥大学ALTA研究所)
类目: Computation and Language (cs.CL)
备注: 19 pages, 7 figures. Submitted to CoNLL 2025

点击查看摘要

Abstract:In this paper, we introduce two resources: (i) G2P+, a tool for converting orthographic datasets to a consistent phonemic representation; and (ii) IPA CHILDES, a phonemic dataset of child-centered speech across 31 languages. Prior tools for grapheme-to-phoneme conversion result in phonemic vocabularies that are inconsistent with established phonemic inventories, an issue which G2P+ addresses by leveraging the inventories in the Phoible database. Using this tool, we augment CHILDES with phonemic transcriptions to produce IPA CHILDES. This new resource fills several gaps in existing phonemic datasets, which often lack multilingual coverage, spontaneous speech, and a focus on child-directed language. We demonstrate the utility of this dataset for phonological research by training phoneme language models on 11 languages and probing them for distinctive features, finding that the distributional properties of phonemes are sufficient to learn major class and place features cross-lingually.
zh

[NLP-43] Ontologies in Design: How Imagining a Tree Reveals Possibilites and Assumptions in Large Language Models

【速读】：该论文试图解决生成式人工智能（Generative AI）在设计与开发过程中因忽视本体论（ontologies）而可能导致的潜在危害问题。论文指出，尽管基于价值观（如偏见）的分析至关重要，但本体论——即我们允许自己思考或讨论的内容范畴——是分析这些系统时一个被低估但至关重要的维度。论文的关键解决方案在于提出一种基于实践的本体论参与方法，并通过四个导向（pluralism, groundedness, liveliness, and enactment）来指导本体论在设计中的考量。通过在大型语言模型（LLM）开发全流程中开展本体论分析，论文展示了这些导向所开启的可能性，从而强调了在社会技术系统设计中运用本体论的机会与局限性。

链接: https://arxiv.org/abs/2504.03029
作者: Nava Haghighi,Sunny Yu,James Landay,Daniela Rosner
机构: Stanford University (斯坦福大学); University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 20 pages, 1 figure, 2 tables, CHI '25

点击查看摘要

Abstract:Amid the recent uptake of Generative AI, sociotechnical scholars and critics have traced a multitude of resulting harms, with analyses largely focused on values and axiology (e.g., bias). While value-based analyses are crucial, we argue that ontologies – concerning what we allow ourselves to think or talk about – is a vital but under-recognized dimension in analyzing these systems. Proposing a need for a practice-based engagement with ontologies, we offer four orientations for considering ontologies in design: pluralism, groundedness, liveliness, and enactment. We share examples of potentialities that are opened up through these orientations across the entire LLM development pipeline by conducting two ontological analyses: examining the responses of four LLM-based chatbots in a prompting exercise, and analyzing the architecture of an LLM-based agent simulation. We conclude by sharing opportunities and limitations of working with ontologies in the design and development of sociotechnical systems.
zh

[NLP-44] he Dual-Route Model of Induction

【速读】：该论文旨在研究和揭示在上下文学习（in-context learning）中不同类型的归纳头（induction heads）的作用及其分工。论文引入了一种新的归纳头——概念级归纳头（concept-level induction heads），与传统的词级归纳头（token-level induction heads）形成对比。概念级归纳头专注于复制整个词汇单元（lexical units），而非单个标记（tokens），并通过关注多标记单词的结尾来实现这一目标。关键解决方案在于区分这两种归纳头的功能：概念级归纳头在语义任务（如词级翻译）中起主导作用，而词级归纳头则在需要逐字复制的任务（如复制无意义标记）中至关重要。此外，研究表明这两种机制独立运作，且词级归纳头的移除会导致模型倾向于释义（paraphrase）而非逐字复制。论文据此提出，尽管词级归纳头对特定任务至关重要，但概念级归纳头可能在更广泛的上下文学习场景中具有更高的适用性。

链接: https://arxiv.org/abs/2504.03022
作者: Sheridan Feucht,Eric Todd,Byron Wallace,David Bau
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 36 pages, 39 figures. Code and data at this https URL

点击查看摘要

Abstract:Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we introduce a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in parallel with token-level induction heads to copy meaningful text. We show that these heads are responsible for semantic tasks like word-level translation, whereas token induction heads are vital for tasks that can only be done verbatim, like copying nonsense tokens. These two “routes” operate independently: in fact, we show that ablation of token induction heads causes models to paraphrase where they would otherwise copy verbatim. In light of these findings, we argue that although token induction heads are vital for specific tasks, concept induction heads may be more broadly relevant for in-context learning.
zh

[NLP-45] Language Models Guidance with Multi-Aspect-Cueing: A Case Study for Competitor Analysis

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在竞争者分析中因缺乏对当代或未来现实的知识以及对市场竞品格局不完全理解而导致的局限性问题。论文的关键解决方案在于将商业相关方面融入LLMs，以增强其对竞争市场的理解能力，并通过定量与定性的实验证明，这种整合能够持续提升模型性能，从而提高竞争者分析的效能。

链接: https://arxiv.org/abs/2504.02984
作者: Amir Hadifar,Christopher Ochs,Arjan Van Ewijk
机构: Nokia Bell Labs (诺基亚贝尔实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Competitor analysis is essential in modern business due to the influence of industry rivals on strategic planning. It involves assessing multiple aspects and balancing trade-offs to make informed decisions. Recent Large Language Models (LLMs) have demonstrated impressive capabilities to reason about such trade-offs but grapple with inherent limitations such as a lack of knowledge about contemporary or future realities and an incomplete understanding of a market’s competitive landscape. In this paper, we address this gap by incorporating business aspects into LLMs to enhance their understanding of a competitive market. Through quantitative and qualitative experiments, we illustrate how integrating such aspects consistently improves model performance, thereby enhancing analytical efficacy in competitor analysis.
zh

[NLP-46] Hummus: A Dataset of Humorous Multimodal Metaphor Use

【速读】：本文旨在研究多模态隐喻的幽默能力，这是学术界尚未充分关注的领域。论文从意外理论（Incongruity Theory）、概念隐喻理论（Conceptual Metaphor Theory）以及 VU 阿姆斯特丹隐喻语料库（VU Amsterdam Metaphor Corpus）的标注方案中汲取灵感，提出了一种针对图像-标题配对中幽默多模态隐喻使用的新型标注方案。为验证该方案的有效性，作者构建了 HUMMUS 数据集（幽默多模态隐喻使用数据集），包含来自《纽约客》标题竞赛语料库的 1000 组图像-标题配对，并进行了专家标注。通过该数据集，论文测试了当前最先进的多模态大语言模型（Multimodal Large Language Models, MLLMs）在检测与理解幽默多模态隐喻方面的表现。实验结果表明，现有 MLLMs 在处理幽默多模态隐喻时仍存在困难，尤其是在整合视觉与文本信息方面。因此，论文的关键在于提出了一种新的标注方案并创建了相应的高质量数据集，以评估和改进多模态模型对幽默隐喻的理解能力。

链接: https://arxiv.org/abs/2504.02983
作者: Xiaoyu Tong,Zhi Zhang,Martha Lewis,Ekaterina Shutova
机构: ILLC, University of Amsterdam (阿姆斯特丹大学逻辑语言与计算研究所), the Netherlands; Computer Science Department, Stanford University (斯坦福大学计算机科学系), US
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Metaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at this http URL.
zh

[NLP-47] A Bayesian account of pronoun and neopronoun acquisition

【速读】：该论文旨在解决 queer 社群（LGBTQ+ 社群）成员之间在代词使用上的不平等问题，特别是如何尊重个体对自身称谓（如自选名字或新造代词）的选择。论文指出，当前语言模型通常基于形式到意义的映射以及词汇共现统计来学习指代表达，这种做法难以适应多样化的性别表达。为解决此问题，论文的关键在于提出了一种基于嵌套中国餐馆 franchise 过程（nested Chinese Restaurant Franchise Process, nCRFP）的概率图模型，该模型能够显式建模个体间在代词选择上的差异，同时灵活处理自选代词和新造代词，而无需依赖传统的词汇共现统计。通过这种方法，论文展示了如何更好地捕捉符号知识中代词或名字被接受的速度变化，并使计算系统既能灵活适应又能尊重具有多样化性别表达的 queer 人群。

链接: https://arxiv.org/abs/2504.02973
作者: Cassandra L. Jacobs,Morgan Grobol
机构: Department of Linguistics, State University of New York at Buffalo (纽约州立大学布法罗分校); MoDyCo, Université Paris Nanterre (巴黎纳米尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A major challenge to equity among members of queer communities is the use of one’s chosen forms of reference, such as personal names or pronouns. Speakers often dismiss their misuses of pronouns as “unintentional”, and claim that their errors reflect many decades of fossilized mainstream language use, as well as attitudes or expectations about the relationship between one’s appearance and acceptable forms of reference. We argue for explicitly modeling individual differences in pronoun selection and present a probabilistic graphical modeling approach based on the nested Chinese Restaurant Franchise Process (nCRFP) (Ahmed et al., 2013) to account for flexible pronominal reference such as chosen names and neopronouns while moving beyond form-to-meaning mappings and without lexical co-occurrence statistics to learn referring expressions, as in contemporary language models. We show that such a model can account for variability in how quickly pronouns or names are integrated into symbolic knowledge and can empower computational systems to be both flexible and respectful of queer people with diverse gender expression.
zh

[NLP-48] QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding CVPR2025

【速读】：该论文致力于解决在视觉文档理解（Visual Document Understanding, VDU）任务中，通过微调预训练视觉-语言模型（Vision-Language Model, VLM）优化识别文本丰富文档图像中查询特定区域的能力不足的问题。现有方法通过修改网络架构直接注入查询，往往难以适应标注数据有限的新数据集。论文提出的关键解决方案是QID（Query Injection for Document），这是一种新颖的、精简的、保留架构的方法，通过将查询嵌入集成到视觉编码器中，显著提升了性能，尤其是在数据稀缺的微调场景下。具体而言，该方法引入了双模块框架：一是查询感知模块，用于生成精确引导模型关注的唯一查询向量；二是查询无关模块，捕捉标记之间的位置关系，确保稳健的空间理解。这两个模块独立于视觉注意力块运行，便于针对查询嵌入进行针对性学习，从而增强视觉语义识别能力。实验结果表明，该方法在多个数据集上的OCR-free VLM应用中取得了显著的性能提升，特别是在处理文本丰富的数据稀缺环境时表现出色。

链接: https://arxiv.org/abs/2504.02971
作者: Binh M. Le,Shaoyuan Xu,Jinmiao Fu,Zhishen Huang,Moyan Li,Yanhui Guo,Hongdong Li,Sameera Ramasinghe,Bryan Wang
机构: Sungkyunkwan University (成均馆大学); Amazon (亚马逊); Pluralis Research (Pluralis 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 8 pages, accepted by CVPR 2025 MULA

点击查看摘要

Abstract:In Visual Document Understanding (VDU) tasks, fine-tuning a pre-trained Vision-Language Model (VLM) with new datasets often falls short in optimizing the vision encoder to identify query-specific regions in text-rich document images. Existing methods that directly inject queries into model layers by modifying the network architecture often struggle to adapt to new datasets with limited annotations. To address this, we introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder, leading to notable performance gains, particularly in data-scarce fine-tuning scenarios. Specifically, our approach introduces a dual-module framework: a query-aware module that generates a unique query vector to precisely guide the model’s focus, as well as a query-agnostic module that captures the positional relationships among tokens, ensuring robust spatial understanding. Notably, both modules operate independently of the vision attention blocks, facilitating targeted learning of query embeddings and enhancing visual semantic identification. Experiments with OCR-free VLMs across multiple datasets demonstrate significant performance improvements using our method, especially in handling text-rich documents in data-scarce environments.
zh

[NLP-49] CoLa – Learning to Interactively Collaborate with Large LMs

【速读】：该论文旨在探索是否可以通过从人类引导人工智能系统解决复杂语言问题的演示中泛化，模拟出可自动生成指导的人类引导行为，并提出了一种名为CoLa的新型自引导学习范式用于训练自动化“引导者”（automated guides）。论文的关键在于设计这一自引导学习方法，通过在问答数据集、谜题求解任务及受限文本生成任务上的评估，证明了CoLa在所有领域均优于竞争方法。此外，研究发现，经过微小规模训练的自动化引导者在作为指导者时的表现甚至超过了强大的GPT-4模型。通过对比人类与自动化引导者的策略，论文进一步展示了自动化引导者能够根据推理器的能力调整其指导策略，从而实现更优性能。

链接: https://arxiv.org/abs/2504.02965
作者: Abhishek Sharma,Dan Goldwasser
机构: Department of Computer Science (计算机科学系), Purdue University (普渡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs’ remarkable ability to tackle a wide range of language tasks opened new opportunities for collaborative human-AI problem solving. LLMs can amplify human capabilities by applying their intuitions and reasoning strategies at scale. We explore whether human guides can be simulated, by generalizing from human demonstrations of guiding an AI system to solve complex language problems. We introduce CoLa, a novel self-guided learning paradigm for training automated \textitguides and evaluate it on two QA datasets, a puzzle-solving task, and a constrained text generation task. Our empirical results show that CoLa consistently outperforms competitive approaches across all domains. Moreover, a small-sized trained guide outperforms a strong model like GPT-4 when acting as a guide. We compare the strategies employed by humans and automated guides by conducting a human study on a QA dataset. We show that automated guides outperform humans by adapting their strategies to reasoners’ capabilities and conduct qualitative analyses highlighting distinct differences in guiding strategies.
zh

[NLP-50] Understanding Aha Moments: from External Observations to Internal Mechanisms

【速读】：该论文旨在研究大型推理模型（LRMs）在处理复杂问题时所表现出的“啊哈时刻”（Aha Moment），探索这些模型如何获取推理能力并在重新组织方法以分配更多思考时间时展现出这种现象。论文的关键在于系统性地分析“啊哈时刻”的外部表现与内部机制，包括语言模式、不确定性描述、“推理坍塌”以及潜在空间中的分析。解决方案的核心是揭示“啊哈时刻”通过增强拟人化语气的使用频率以及根据问题难度调整不确定性来帮助模型完成推理，避免陷入“推理坍塌”。同时，内部机制表现为拟人化特征与纯粹推理之间的分离，且对于更难的问题表现出更强的拟人化倾向。此外，论文发现，“啊哈时刻”通过改变模型对问题难度的认知，使简单问题显得更复杂，而复杂问题显得更简单，从而有助于解决复杂问题。

链接: https://arxiv.org/abs/2504.02956
作者: Shu Yang,Junchao Wu,Xin Chen,Yunze Xiao,Xinyi Yang,Derek F. Wong,Di Wang
机构: Provable Responsible AI and Data Analytics (PRADA) Lab (可证明负责任的人工智能与数据分析实验室); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Macau (澳门大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs), capable of reasoning through complex problems, have become crucial for tasks like programming, mathematics, and commonsense reasoning. However, a key challenge lies in understanding how these models acquire reasoning capabilities and exhibit “aha moments” when they reorganize their methods to allocate more thinking time to problems. In this work, we systematically study “aha moments” in LRMs, from linguistic patterns, description of uncertainty, “Reasoning Collapse” to analysis in latent space. We demonstrate that the “aha moment” is externally manifested in a more frequent use of anthropomorphic tones for self-reflection and an adaptive adjustment of uncertainty based on problem difficulty. This process helps the model complete reasoning without succumbing to “Reasoning Collapse”. Internally, it corresponds to a separation between anthropomorphic characteristics and pure reasoning, with an increased anthropomorphic tone for more difficult problems. Furthermore, we find that the “aha moment” helps models solve complex problems by altering their perception of problem difficulty. As the layer of the model increases, simpler problems tend to be perceived as more complex, while more difficult problems appear simpler.
zh

[NLP-51] Cultural Learning-Based Culture Adaptation of Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在适应多样化文化价值观时面临的挑战，现有模型往往默认反映特定群体的价值观，可能对其他群体造成潜在伤害。论文提出了一种名为CLCA的新框架，其关键是通过模拟社会互动生成角色扮演场景中的对话，捕捉隐含的文化规范以微调模型，从而增强LLMs与文化价值观的一致性。实验基于世界价值观调查（World Value Survey）数据验证了该方法在不同模型架构上的有效性。

链接: https://arxiv.org/abs/2504.02953
作者: Chen Cecilia Liu,Anna Korhonen,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab, Department of Computer Science and Hessian Center for AI (hessian.AI), Technical University of Darmstadt (达姆施塔特工业大学); Language Technology Lab, University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting large language models (LLMs) to diverse cultural values is a challenging task, as existing LLMs often reflect the values of specific groups by default, and potentially causing harm to others. In this paper, we present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning. The framework leverages simulated social interactions to generate conversations in which LLMs engage in role-playing within culturally adapted social scenarios, capturing implicit cultural norms for model fine-tuning. CLCA improves cultural value alignment across various model architectures measured using World Value Survey data, demonstrating the effectiveness of our proposed approach. Our results provide early evidence that understanding intent and social interactions can enhance cultural value adaptation in LLMs, highlighting the promise of training approaches based on cultural learning.
zh

[NLP-52] Robustly identifying concepts introduced during chat fine-tuning using crosscoders

【速读】：该论文旨在解决现有模型diffing方法（如Crosscoders）在识别特定于微调模型的概念时存在的误 attribution 问题，即某些本已存在于基础模型中的概念被错误地标记为微调模型独有的现象。这一问题源于Crosscoders的L1训练损失函数。为解决此问题，论文的关键创新在于引入了Latent Scaling技术，通过更精确地衡量潜在空间在各模型间的分布，有效标记出这些误 attribution 的情况。在此基础上，作者进一步提出使用BatchTopK损失函数重新训练Crosscoders，显著缓解了上述问题，成功发现了更多真正具有聊天特异性且可解释性强的概念，例如“虚假信息”、“个人问题”以及多种拒绝相关的潜在表示。论文建议实践者采用类似改进方法，并展示了基于BatchTopK的Crosscoders能够为理解聊天微调如何改变语言模型行为提供具体洞见。

链接: https://arxiv.org/abs/2504.02922
作者: Julian Minder,Clement Dumas,Caden Juang,Bilal Chugtai,Neel Nanda
机构: EPFL (洛桑联邦理工学院); ETHZ (苏黎世联邦理工学院); École Normale Supérieure Paris-Saclay (巴黎高等师范学校萨克雷校区); Université Paris-Saclay (巴黎萨克雷大学); Northeastern University (东北大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 47 pages, 27 figures

点击查看摘要

Abstract:Model diffing is the study of how fine-tuning changes a model’s representations and internal algorithms. Many behaviours of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent’s presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of genuinely chat-specific latents that are both interpretable and causally effective, representing concepts such as \textitfalse information and \textitpersonal question , along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat tuning modifies language model behavior.
zh

[NLP-53] HyperRAG : Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

【速读】：该论文旨在解决基于检索增强生成（Retrieval-Augmented Generation, RAG）管道中的重排序器（reranker）引入的计算挑战，这些挑战限制了系统的高吞吐量和低延迟。论文的关键解决方案是提出HyperRAG系统，通过利用KV缓存重用来优化RAG管道中质量与效率之间的权衡，特别是在解码器-only重排序器的情况下，HyperRAG不仅实现了高质量的生成，还显著提升了系统级效率。为了充分发挥KV缓存重用的优势，HyperRAG结合了一系列系统级优化措施以提升效率和可扩展性。实验表明，HyperRAG在保持高下游性能的同时，相较于传统RAG服务，实现了2到3倍的吞吐量提升。

链接: https://arxiv.org/abs/2504.02921
作者: Yuwei An,Yihua Cheng,Seo Jin Park,Junchen Jiang
机构: Carnegie Mellon University (卡内基梅隆大学); University of Chicago (芝加哥大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.
zh

[NLP-54] Bias in Large Language Models Across Clinical Applications: A Systematic Review

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在医疗领域应用中潜在的偏见问题及其对患者护理和健康公平性的潜在影响。论文的关键解决方案在于系统性地调查LLMs中偏见的存在、来源、表现形式以及临床影响，并强调通过严格评估模型以及开发和实施有效的缓解策略来确保LLMs在医疗领域的安全、公平和可信部署。

链接: https://arxiv.org/abs/2504.02917
作者: Thanathip Suenghataiphorn,Narisara Tribuddharat,Pojsakorn Danpanichkul,Narathorn Kulthamrongsri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: Large language models (LLMs) are rapidly being integrated into healthcare, promising to enhance various clinical tasks. However, concerns exist regarding their potential for bias, which could compromise patient care and exacerbate health inequities. This systematic review investigates the prevalence, sources, manifestations, and clinical implications of bias in LLMs. Methods: We conducted a systematic search of PubMed, OVID, and EMBASE from database inception through 2025, for studies evaluating bias in LLMs applied to clinical tasks. We extracted data on LLM type, bias source, bias manifestation, affected attributes, clinical task, evaluation methods, and outcomes. Risk of bias was assessed using a modified ROBINS-I tool. Results: Thirty-eight studies met inclusion criteria, revealing pervasive bias across various LLMs and clinical applications. Both data-related bias (from biased training data) and model-related bias (from model training) were significant contributors. Biases manifested as: allocative harm (e.g., differential treatment recommendations); representational harm (e.g., stereotypical associations, biased image generation); and performance disparities (e.g., variable output quality). These biases affected multiple attributes, most frequently race/ethnicity and gender, but also age, disability, and language. Conclusions: Bias in clinical LLMs is a pervasive and systemic issue, with a potential to lead to misdiagnosis and inappropriate treatment, particularly for marginalized patient populations. Rigorous evaluation of the model is crucial. Furthermore, the development and implementation of effective mitigation strategies, coupled with continuous monitoring in real-world clinical settings, are essential to ensure the safe, equitable, and trustworthy deployment of LLMs in healthcare.
zh

[NLP-55] Noiser: Bounded Input Perturbations for Attributing Large Language Models

【速读】：本文旨在解决如何为大语言模型（Large Language Models, LLMs）的预测生成忠实的特征归因（Feature Attribution, FA），以准确反映模型的实际内部行为。现有方法在归因的忠实性（faithfulness）和可回答性（answerability）方面存在不足。为此，论文提出Noiser，这是一种基于扰动的FA方法，通过在每个输入嵌入上施加有界噪声，并衡量模型对部分噪声输入的鲁棒性，从而获得输入的归因。关键创新在于引入了一种新的可回答性度量，利用指令调教的判别模型评估高分标记恢复预测输出的能力。通过对六种LLMs和三种任务的综合评估，结果表明Noiser在忠实性和可回答性方面均优于现有的梯度、注意力和扰动基方法，成为一种稳健且有效的语言模型预测解释方法。

链接: https://arxiv.org/abs/2504.02911
作者: Mohammad Reza Ghasemi Madani,Aryo Pradipta Gema,Gabriele Sarti,Yu Zhao,Pasquale Minervini,Andrea Passerini
机构: University of Trento (特伦托大学); University of Edinburgh (爱丁堡大学); CLCG, University of Groningen (格罗宁根大学 CLCG); Miniml.AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2402.00794 by other authors

点击查看摘要

Abstract:Feature attribution (FA) methods are common post-hoc approaches that explain how Large Language Models (LLMs) make predictions. Accordingly, generating faithful attributions that reflect the actual inner behavior of the model is crucial. In this paper, we introduce Noiser, a perturbation-based FA method that imposes bounded noise on each input embedding and measures the robustness of the model against partially noised input to obtain the input attributions. Additionally, we propose an answerability metric that employs an instructed judge model to assess the extent to which highly scored tokens suffice to recover the predicted output. Through a comprehensive evaluation across six LLMs and three tasks, we demonstrate that Noiser consistently outperforms existing gradient-based, attention-based, and perturbation-based FA methods in terms of both faithfulness and answerability, making it a robust and effective approach for explaining language model predictions.
zh

[NLP-56] Enhancing Chart-to-Code Generation in Multimodal Large Language Models via Iterative Dual Preference Learning

【速读】：该论文致力于解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在图表到代码生成任务中的挑战，即如何准确捕捉并总结图表的视觉与结构元素以生成可执行的绘图脚本。传统方法难以有效满足这一需求，因为MLLMs本质上并非专门为代码生成任务设计。为了解决此问题，论文提出了一种名为Chart2Code的新框架，其关键在于引入迭代双重偏好学习机制，通过结构化代码变体生成以及细粒度的双重奖励信号来增强MLLMs的图表到代码生成能力。此外，文中提出的双评分方法结合了代码结构与可视化表示的评估，进一步提升了生成质量，即使在偏好数据集规模较小时亦如此。这些创新点共同推动了图表理解领域的进步。

链接: https://arxiv.org/abs/2504.02906
作者: Zhihan Zhang,Yixin Cao,Lizi Liao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:Chart-to-code generation, the process of converting chart images into executable plotting scripts, provides a lossless representation of chart information, requiring models to accurately capture and summarize all visual and structural elements. However, this remains a significant challenge for multimodal large language models (MLLMs), which are not inherently well-aligned with code generation tasks. To bridge this gap, we introduce Chart2Code, a novel iterative dual preference learning framework designed to enhance MLLMs’ chart-to-code generation capabilities through structured code variant generation and fine-grained dual reward signals. We validate Chart2Code across three MLLMs and find that iterative preference learning consistently improves out-of-distribution chart-to-code generation quality. Throughout this process, our dual scoring method, which evaluates both the textual code structure and its visual representation, leads to greater performance improvements, even with a reduced preference dataset size. Further analysis explores the key components of our framework and highlights the interplay between chart-to-code generation and broader chart reasoning, paving the way for future advancements in chart comprehension.
zh

[NLP-57] How Post-Training Reshapes LLM s: A Mechanistic View on Knowledge Truthfulness Refusal and Confidence

【速读】：该论文试图解决的问题是如何从机制层面理解大型语言模型（Large Language Models, LLMs）在后训练（post-training）过程中的内部变化。尽管已有大量研究关注后训练算法及其输出效果，但对后训练如何重塑LLMs内部结构的研究仍显不足。为填补这一空白，论文从四个视角对比了基础模型与后训练模型，以揭示后训练的影响。

解决方案的关键在于通过系统性分析，揭示基础模型与后训练模型在知识存储位置、表示形式、真实性与拒绝方向以及置信度等方面的异同。具体而言，论文发现后训练不会改变事实性知识的存储位置，而是调整了基础模型的知识表示并发展出新的表示形式；同时，真实性和拒绝行为均可在隐藏表征空间中通过线性向量表示，并且真实性方向在基础模型与后训练模型间高度相似且易于干预；而拒绝方向则存在差异，其跨模型的迁移能力有限；此外，基础模型与后训练模型之间的置信度差异并非由熵神经元直接导致。这些发现为理解后训练过程中保留和改变的核心机制提供了重要见解，并有助于下游任务如模型引导及未来可解释性研究的发展。

链接: https://arxiv.org/abs/2504.02904
作者: Hongzhe Du,Weikai Li,Min Cai,Karim Saraipour,Zimin Zhang,Himabindu Lakkaraju,Yizhou Sun,Shichang Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by linear vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training.
zh

[NLP-58] Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在迭代自改进过程中引入的过自信（overconfidence）问题及其带来的校准误差（Expected Calibration Error, ECE）增加现象。论文的关键在于探索并验证在自改进过程中集成置信度校准技术的有效性，并提出通过在每次自改进步骤中进行迭代校准（iterative calibration）能够最有效地减少ECE，从而提升模型的校准性能。这一研究为平衡LLMs的性能与可靠性提供了重要见解。

链接: https://arxiv.org/abs/2504.02902
作者: Liangjie Huang,Dawei Li,Huan Liu,Lu Cheng
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanism has shown promise in enhancing task performance, recent studies suggest that it may also introduce undesirable biases-most notably, self-bias, or the tendency of LLMs to favor their own prior outputs. In this work, we extend this line of inquiry by investigating the impact on confidence estimation. We evaluate three representative self-improvement paradigms-basic prompting, Chain-of-Thought (CoT) prompting, and tuning-based methods and find that iterative self-improvement can lead to systematic overconfidence, as evidenced by a steadily increasing Expected Calibration Error (ECE) and lower accuracy with high confidence. We then further explore the integration of confidence calibration techniques with self-improvement. Specifically, we compare three strategies: (1) applying calibration after multiple rounds of self-improvement, (2) calibrating before self-improvement, and (3) applying calibration iteratively at each self-improvement step. Our results show that iterative calibration is most effective in reducing ECE, yielding improved calibration. Our work pioneers the study of self-improving LLMs from a calibration perspective, offering valuable insights into balancing model performance and reliability.
zh

[NLP-59] A Practical Synthesis of Detecting AI-Generated Textual Visual and Audio Content

【速读】：该论文旨在解决由生成式 AI (Generative AI) 引发的文本、视觉及音频内容检测与防范问题，重点关注如何识别和缓解因大规模语言模型、基于扩散的视觉生成器以及合成音频工具等技术发展而加剧的误报信息、版权侵权、安全威胁及公众信任侵蚀等问题。论文的关键在于提出涵盖观察策略、语言统计分析、模型驱动管道、水印与指纹技术及新兴集成方法的综合检测方案，并强调结合人类验证的鲁棒性、快速适应不断进步的生成架构的重要性。通过综述前沿研究并展示学术、新闻、法律及工业领域的案例研究，论文旨在为研究人员、从业者及监管者提供全面指导，以在日益复杂的 AI 媒体环境中维护内容真实性。

链接: https://arxiv.org/abs/2504.02898
作者: Lele Cao
机构: King/Microsoft
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Advances in AI-generated content have led to wide adoption of large language models, diffusion-based visual generators, and synthetic audio tools. However, these developments raise critical concerns about misinformation, copyright infringement, security threats, and the erosion of public trust. In this paper, we explore an extensive range of methods designed to detect and mitigate AI-generated textual, visual, and audio content. We begin by discussing motivations and potential impacts associated with AI-based content generation, including real-world risks and ethical dilemmas. We then outline detection techniques spanning observation-based strategies, linguistic and statistical analysis, model-based pipelines, watermarking and fingerprinting, as well as emergent ensemble approaches. We also present new perspectives on robustness, adaptation to rapidly improving generative architectures, and the critical role of human-in-the-loop verification. By surveying state-of-the-art research and highlighting case studies in academic, journalistic, legal, and industrial contexts, this paper aims to inform robust solutions and policymaking. We conclude by discussing open challenges, including adversarial transformations, domain generalization, and ethical concerns, thereby offering a holistic guide for researchers, practitioners, and regulators to preserve content authenticity in the face of increasingly sophisticated AI-generated media.
zh

[NLP-60] OnRL-RAG : Real-Time Personalized Mental Health Dialogue System

【速读】：该论文旨在解决大型语言模型（LLMs）在处理实时动态环境中的个性化需求时，受限于预训练数据而导致的知识时效性和表达适应性不足的问题。论文提出的关键解决方案是Online Reinforcement Learning-based Retrieval-Augmented Generation (OnRL-RAG) 系统，通过结合检索增强生成（Retrieval-Augmented Generation, RAG）与基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF），实现对心理健康问题（如压力、焦虑和抑郁）的个性化检测与响应。该系统不仅能够整合最新的外部信息，还能通过在线强化学习动态调整以适应不同个体的需求，从而提供更加精准的服务。论文通过一个包含2028名大学生数据集的实验验证了OnRL-RAG系统的优越性能，显著优于标准RAG及多种基准LLM模型。

链接: https://arxiv.org/abs/2504.02894
作者: Ahsan Bilal,Beiyu Lin,Mehdi Zaeifi
机构: University of Oklahoma (俄克拉荷马大学), USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely used for various tasks and applications. However, LLMs and fine-tuning are limited to the pre-trained data. For example, ChatGPT’s world knowledge until 2021 can be outdated or inaccurate. To enhance the capabilities of LLMs, Retrieval-Augmented Generation (RAG), is proposed to augment LLMs with additional, new, latest details and information to LLMs. While RAG offers the correct information, it may not best present it, especially to different population groups with personalizations. Reinforcement Learning from Human Feedback (RLHF) adapts to user needs by aligning model responses with human preference through feedback loops. In real-life applications, such as mental health problems, a dynamic and feedback-based model would continuously adapt to new information and offer personalized assistance due to complex factors fluctuating in a daily environment. Thus, we propose an Online Reinforcement Learning-based Retrieval-Augmented Generation (OnRL-RAG) system to detect and personalize the responding systems to mental health problems, such as stress, anxiety, and depression. We use an open-source dataset collected from 2028 College Students with 28 survey questions for each student to demonstrate the performance of our proposed system with the existing systems. Our system achieves superior performance compared to standard RAG and simple LLM via GPT-4o, GPT-4o-mini, Gemini-1.5, and GPT-3.5. This work would open up the possibilities of real-life applications of LLMs for personalized services in the everyday environment. The results will also help researchers in the fields of sociology, psychology, and neuroscience to align their theories more closely with the actual human daily environment.
zh

[NLP-61] Automated Survey Collection with LLM -based Conversational Agents

【速读】：该论文旨在解决传统基于电话的问卷调查方法在收集生物医学和医疗保健数据时成本高、劳动密集且难以有效扩展的问题。为克服这些局限性，论文提出了一种由对话式大型语言模型（Conversational Large Language Models, LLMs）驱动的端到端问卷收集框架。解决方案的关键在于设计了一个包含研究者、基于LLM的电话代理、用于分析对话转录的第二代LLM（GPT-4o）以及结果存储数据库的综合系统，并通过实验验证了该框架的有效性，证明了LLM在提高问卷收集效率和准确性方面的潜力。

链接: https://arxiv.org/abs/2504.02891
作者: Kurmanbek Kaiyrbekov,Nicholas J Dobbins,Sean D Mooney
机构: Cyberinfrastructure and Artificial Intelligence Platforms Section, Center for Genomics and Data Science Research, National Human Genome Research Institute, National Institutes of Health (国立卫生研究院), Bethesda, Maryland, USA; Biomedical Informatics & Data Science, Department of Medicine, Johns Hopkins University (约翰斯·霍普金斯大学), Baltimore, Maryland, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: Traditional phone-based surveys are among the most accessible and widely used methods to collect biomedical and healthcare data, however, they are often costly, labor intensive, and difficult to scale effectively. To overcome these limitations, we propose an end-to-end survey collection framework driven by conversational Large Language Models (LLMs). Materials and Methods: Our framework consists of a researcher responsible for designing the survey and recruiting participants, a conversational phone agent powered by an LLM that calls participants and administers the survey, a second LLM (GPT-4o) that analyzes the conversation transcripts generated during the surveys, and a database for storing and organizing the results. To test our framework, we recruited 8 participants consisting of 5 native and 3 non-native english speakers and administered 40 surveys. We evaluated the correctness of LLM-generated conversation transcripts, accuracy of survey responses inferred by GPT-4o and overall participant experience. Results: Survey responses were successfully extracted by GPT-4o from conversation transcripts with an average accuracy of 98% despite transcripts exhibiting an average per-line word error rate of 7.7%. While participants noted occasional errors made by the conversational LLM agent, they reported that the agent effectively conveyed the purpose of the survey, demonstrated good comprehension, and maintained an engaging interaction. Conclusions: Our study highlights the potential of LLM agents in conducting and analyzing phone surveys for healthcare applications. By reducing the workload on human interviewers and offering a scalable solution, this approach paves the way for real-world, end-to-end AI-powered phone survey collection systems. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.02891 [cs.CL] (or arXiv:2504.02891v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2504.02891 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kurmanbek Kaiyrbekov [view email] [v1] Wed, 2 Apr 2025 18:10:19 UTC (3,243 KB)
zh

[NLP-62] Scaling Test-time Compute for Low-resource Languages: Multilingual Reasoning in LLM s

【速读】：该论文旨在解决低资源语言下大型语言模型（Large Language Models, LLMs）在深度推理任务中的不足，特别是现有测试时计算扩展技术主要集中在流行语言（如英语），而对低资源语言的推理能力探索有限且效果不佳的问题。论文的关键在于提出了一种名为“English-Pivoted Chain-of-Thought (CoT) Training”的方法，即通过让模型在隐空间中以偏向其主导语言（如英语）的方式工作，同时在输入为低资源语言的情况下，生成英语的CoT并输出目标语言的最终答案。这种方法利用了多语言机制，通过英语作为桥梁实现了跨语言推理能力的提升，并显著优于仅在目标语言内生成CoT和最终响应的基线方法，性能提升最高可达28.33%。

链接: https://arxiv.org/abs/2504.02890
作者: Khanh-Tung Tran,Barry O’Sullivan,Hoang D. Nguyen
机构: School of Computer Science and Information Technology, University College Cork (爱尔兰科克大学计算机科学与信息技术学院), Cork, Ireland
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in test-time compute scaling have enabled Large Language Models (LLMs) to tackle deep reasoning tasks by generating a chain-of-thought (CoT) that includes trial and error, backtracking, and intermediate reasoning steps before producing the final answer. However, these techniques have been applied predominantly to popular languages, such as English, leaving reasoning in low-resource languages underexplored and misaligned. In this work, we investigate the multilingual mechanism by which LLMs internally operate in a latent space biased toward their inherently dominant language. To leverage this phenomenon for low-resource languages, we train models to generate the CoT in English while outputting the final response in the target language, given input in the low-resource language. Our experiments demonstrate that this approach, named English-Pivoted CoT Training, outperforms other baselines, including training to generate both the CoT and the final response solely in the target language, with up to 28.33% improvement. Further analysis provides novel insights into the relationships between reasoning and multilinguality of LLMs, prompting for better approaches in developing multilingual large reasoning models
zh

[NLP-63] A Status Quo Investigation of Large Language Models towards Cost-Effective CFD Automation with OpenFOAMGPT GPT : ChatGPT vs. Qwen vs. Deepseek

【速读】：该论文旨在评估将多个大型语言模型（Large-Language Models）集成到OpenFOAM-GPT中的性能，以解决复杂计算流体力学（CFD）任务的自动化问题。论文的关键在于探索如何有效管理边界条件、湍流模型及求解器配置等任务，同时识别现有方法的局限性，例如本地部署的小型模型（如QwQ-32B）在处理复杂过程时生成有效求解器文件的能力不足，以及零样本提示（Zero-Shot Prompting）在精细设置下的失效问题。研究强调，尽管大型模型在某些方面表现较好，但边界条件和求解器关键词相关的挑战凸显了专家监督的必要性，表明实现CFD模拟全自动化的进一步开发至关重要。因此，解决方案的关键在于通过结合专家知识与更先进的模型能力，提升系统的稳定性和适应复杂任务的能力。

链接: https://arxiv.org/abs/2504.02888
作者: Wenkang Wang,Ran Xu,Jingsen Feng,Qingfu Zhang,Xu Chu
机构: International Research Institute for Multidisciplinary Science, Beihang University (北航); Cluster of Excellence SimTech, University of Stuttgart (斯图加特大学); Faculty of Environment, Science and Economy, University of Exeter (埃克塞特大学); Institute of Fluid Mechanics, Beihang University (北航); Institute of Thermodynamics and Fluid Mechanics, Technische Universität Ilmenau (伊尔梅瑙工业大学); Faculty of Environment, Science and Economy, University of Exeter (埃克塞特大学); Cluster of Excellence SimTech, University of Stuttgart (斯图加特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We evaluated the performance of OpenFOAMGPT incorporating multiple large-language models. Some of the present models efficiently manage different CFD tasks such as adjusting boundary conditions, turbulence models, and solver configurations, although their token cost and stability vary. Locally deployed smaller models like QwQ-32B struggled with generating valid solver files for complex processes. Zero-shot prompting commonly failed in simulations with intricate settings, even for large models. Challenges with boundary conditions and solver keywords stress the requirement for expert supervision, indicating that further development is needed to fully automate specialized CFD simulations.
zh

[NLP-64] Processes Matter: How ML/GAI Approaches Could Support Open Qualitative Coding of Online Discourse Datasets

【速读】：该论文试图解决在定性研究中开放编码（open coding）面临的挑战，特别是如何从大规模话语数据集中有效捕捉广泛且细微的概念或“编码时刻”（coding moments）。论文的关键在于评估机器学习（Machine Learning, ML）和生成式人工智能（Generative AI, GAI）在开放编码中的潜力，并通过对比五种基于ML/GAI的方法与四名人类编码员的结果，揭示了人机协作的互补优势。研究发现，逐行分析的人工智能方法在识别基于内容的编码方面表现出色，而人类则更擅长解析会话动态。论文强调，研究人员应将AI作为辅助工具嵌入其分析流程中，而非完全替代人类编码员，以实现更高效的开放编码过程。

链接: https://arxiv.org/abs/2504.02887
作者: John Chen,Alexandros Lotsos,Grace Wang,Lexie Zhao,Bruce Sherin,Uri Wilensky,Michael Horn
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: This paper was recommended for acceptance as a long paper by CSCL reviewers, but ends up as a short paper. The arXiv version here is its longer form, revised with reviewers’ comments

点击查看摘要

Abstract:Open coding, a key inductive step in qualitative research, discovers and constructs concepts from human datasets. However, capturing extensive and nuanced aspects or “coding moments” can be challenging, especially with large discourse datasets. While some studies explore machine learning (ML)/Generative AI (GAI)‘s potential for open coding, few evaluation studies exist. We compare open coding results by five recently published ML/GAI approaches and four human coders, using a dataset of online chat messages around a mobile learning software. Our systematic analysis reveals ML/GAI approaches’ strengths and weaknesses, uncovering the complementary potential between humans and AI. Line-by-line AI approaches effectively identify content-based codes, while humans excel in interpreting conversational dynamics. We discussed how embedded analytical processes could shape the results of ML/GAI approaches. Instead of replacing humans in open coding, researchers should integrate AI with and according to their analytical processes, e.g., as parallel co-coders.
zh

[NLP-65] LVMed-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

【速读】：该论文旨在解决现有大型视觉语言模型（Large Vision-Language Models, LVMs）在医学报告生成任务中的两大局限性：一是缺乏复杂的推理能力，导致生成报告中存在逻辑不一致和潜在诊断错误；二是缺乏反思机制，无法在思维过程中发现并修正错误。为了解决这些问题，论文提出了一种名为LVMed-R2的新微调策略，其关键是引入了复杂推理和反思机制。具体而言，复杂推理机制通过医学知识注入模块和感知增强模块提升模型诊断准确性，并结合感知树限制感知范围；反思机制则通过自我验证纠正输出中的潜在错误。这一方案首次将复杂推理引入医学报告生成任务，并通过IU-Xray和MIMIC-CXR数据集的实验验证，证明了所提方法在自然语言生成和临床效用指标上的有效性。

链接: https://arxiv.org/abs/2504.02885
作者: Hao Wang,Shuchang Ye,Jinghao Lin,Usman Naseem,Jinman Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures, 1 table

点击查看摘要

Abstract:Large vision-language models (LVMs) hold a great promise for automating medical report generation, potentially reducing the burden of manual reporting. State-of-the-art (SOTA) research fine-tunes general LVMs with medical data to align radiology images to corresponding medical reports. However, there are two key factors that limit these LVM’s performance. Firstly, LVMs lack complex reasoning capability that leads to logical inconsistencies and potential diagnostic errors in generated reports. Secondly, LVMs lack reflection mechanism that leads to an inability to discover errors in the thinking process. To address these gaps, we propose LVMed-R2, a new fine-tuning strategy that introduces complex reasoning and reflection mechanisms for LVMs to enhance medical report generation. To the best of our knowledge, this is the first work to introduce complex reasoning to the medical report generation (MRG) task. Our proposed complex reasoning contains medical knowledge injection and perception-enhancing modules which improve the accuracy of LVMs diagnosis, coupled with a perception tree to provide guidance to limit the perception range. Further, the reflection mechanism forces self-verification for outputs to correct for potential errors. We experimented by fine-tuning LVMs with our proposed LVMed-R2 strategy, using IU-Xray and MIMIC-CXR datasets. Our results, measured on natural language generation (NLG) metrics and clinical efficacy (CE) metrics, demonstrate that LVMs fine-tuned with the proposed reflection mechanism possess the ability to correct outputs and complex reasoning effectively and improve LVMs performance for MRG.
zh

[NLP-66] SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models

【速读】：本文介绍了SemEval-2025任务4，旨在从大型语言模型（Large Language Models, LLMs）中去除敏感内容。研究聚焦于三个子任务，分别针对不同应用场景下的无学习（unlearning）：(1) 去除涵盖多种体裁的长篇合成创意文档；(2) 去除包含个人可识别信息（Personally Identifiable Information, PII）的短篇合成传记，如虚假姓名、电话号码、社会安全号码（SSN）、电子邮件地址及家庭住址；(3) 去除从目标模型训练数据集中采样的真实文档。通过来自30多个机构的超过100份提交，本文总结了关键技术与经验。关键在于开发有效的机制以系统性地识别并移除模型中的敏感内容，同时确保模型在去学习过程中的性能稳定性和泛化能力。

链接: https://arxiv.org/abs/2504.02883
作者: Anil Ramakrishna,Yixin Wan,Xiaomeng Jin,Kai-Wei Chang,Zhiqi Bu,Bhanukiran Vinzamuri,Volkan Cevher,Mingyi Hong,Rahul Gupta
机构: Amazon AGI (亚马逊AGI); UCLA (加州大学洛杉矶分校); UIUC (伊利诺伊大学香槟分校); EPFL (瑞士联邦理工学院); University of Minnesota (明尼苏达大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce SemEval-2025 Task 4: unlearning sensitive content from Large Language Models (LLMs). The task features 3 subtasks for LLM unlearning spanning different use cases: (1) unlearn long form synthetic creative documents spanning different genres; (2) unlearn short form synthetic biographies containing personally identifiable information (PII), including fake names, phone number, SSN, email and home addresses, and (3) unlearn real documents sampled from the target model’s training dataset. We received over 100 submissions from over 30 institutions and we summarize the key techniques and lessons in this paper.
zh

[NLP-67] DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

【速读】：该论文旨在解决工具增强型大型语言模型（Tool-Augmented Large Language Models, TA-LLMs）在处理不完整查询和超出范围请求时面临的挑战。现有方法主要依赖于专家轨迹的监督微调（Supervised Fine-Tuning with expert trajectories），而论文提出了一种名为DiaTool-DPO的新方法，通过直接偏好优化（Direct Preference Optimization, DPO）增强TA-LLM的对话能力。关键在于将TA-LLM交互建模为具有5个不同对话状态的马尔可夫决策过程（Markov Decision Process），并根据状态转换轨迹将用户查询分为3类。此外，自动构建正确与错误对话流程的成对轨迹数据集，并引入专门的对话控制目标损失函数。实验结果表明，DiaTool-DPO在信息收集（94.8%）和工具调用拒绝（91%）方面接近GPT-4o的性能，同时显著优于基线方法（分别提升至44%和9.6%）。此方法无需额外的专家演示或人工标注即可开发出能够应对多样化实际场景的TA-LLMs。

链接: https://arxiv.org/abs/2504.02882
作者: Sunghee Jung,Donghun Lee,Shinbok Lee,Gaeun Seo,Daniel Lee,Byeongil Ko,Junrae Cho,Kihyun Kim,Eunggyun Kim,Myeongcheol Shin
机构: Kakao Corp. (卡卡奥公司)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM’s dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o’s performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.
zh

[NLP-68] Better Bill GPT : Comparing Large Language Models against Legal Invoice Reviewers

【速读】：该论文试图解决法律发票审查过程中存在的高成本、不一致性和耗时过长的问题，传统方式依赖于法务运营人员、律师或计费专家逐行审核账单合规性。论文的关键解决方案是将大型语言模型（Large Language Models, LLMs）应用于法律发票审查任务，并通过实证研究验证其在准确性、速度和成本效益方面的表现。研究结果表明，LLMs在所有指标上均显著优于人类审查员，包括高达92%的审批决策准确率、最高81%的F-score分类性能，以及仅需3.6秒完成每张发票的审查速度，同时将处理成本降低至原来的0.03%。因此，论文强调了AI在法律支出管理中的变革作用，并指出未来的关键挑战是如何平衡自动化与人工判断的战略应用。

链接: https://arxiv.org/abs/2504.02881
作者: Nick Whitehouse,Nicole Lincoln,Stephanie Yiu,Lizzie Catterson,Rivindu Perera
机构: AI Center of Excellence, Onit Inc(AI卓越中心, Onit Inc)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Legal invoice review is a costly, inconsistent, and time-consuming process, traditionally performed by Legal Operations, Lawyers or Billing Specialists who scrutinise billing compliance line by line. This study presents the first empirical comparison of Large Language Models (LLMs) against human invoice reviewers - Early-Career Lawyers, Experienced Lawyers, and Legal Operations Professionals-assessing their accuracy, speed, and cost-effectiveness. Benchmarking state-of-the-art LLMs against a ground truth set by expert legal professionals, our empirically substantiated findings reveal that LLMs decisively outperform humans across every metric. In invoice approval decisions, LLMs achieve up to 92% accuracy, surpassing the 72% ceiling set by experienced lawyers. On a granular level, LLMs dominate line-item classification, with top models reaching F-scores of 81%, compared to just 43% for the best-performing human group. Speed comparisons are even more striking - while lawyers take 194 to 316 seconds per invoice, LLMs are capable of completing reviews in as fast as 3.6 seconds. And cost? AI slashes review expenses by 99.97%, reducing invoice processing costs from an average of 4.27 per invoice for human invoice reviewers to mere cents. These results highlight the evolving role of AI in legal spend management. As law firms and corporate legal departments struggle with inefficiencies, this study signals a seismic shift: The era of LLM-powered legal spend management is not on the horizon, it has arrived. The challenge ahead is not whether AI can perform as well as human reviewers, but how legal teams will strategically incorporate it, balancing automation with human discretion.
zh

[NLP-69] Revisiting Funnel Transformers for Modern LLM Architectures with Comprehensive Ablations in Training and Inference Configurations

【速读】：该论文旨在解决基于 Transformer 的大型语言模型因计算成本高且快速迭代而导致早期提出的优化技术无法有效提升现代模型性能的问题。论文以 Dai 和 Le 提出的 Funnel Transformer 为基础，研究其在当代 Gemma2 Transformer 架构中的信息瓶颈效应及其影响。关键在于通过系统性评估不同 funnel 配置与恢复方法，发现通过精心选择 funnel 层并采用有效的恢复策略，可以显著减轻性能损失，实现高达 44% 的延迟降低，同时平衡计算效率与模型准确性之间的权衡，为大规模自然语言应用中部署 funnel 基础方法提供实践指导。

链接: https://arxiv.org/abs/2504.02877
作者: DongHyun Choi,Lucas Spangher,Chris Hidey,Peter Grabowski,Ramy Eskander
机构: Google(谷歌); Massachusetts Institute of Technology(麻省理工学院); University of California, Berkeley(加州大学伯克利分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer-based Large Language Models, which suffer from high computational costs, advance so quickly that techniques proposed to streamline earlier iterations are not guaranteed to benefit more modern models. Building upon the Funnel Transformer proposed by Dai and Le (2020), which progressively compresses intermediate representations, we investigate the impact of funneling in contemporary Gemma2 Transformer architectures. We systematically evaluate various funnel configurations and recovery methods, comparing: (1) standard pretraining to funnel-aware pretraining strategies, (2) the impact of funnel-aware fine-tuning, and (3) the type of sequence recovery operation. Our results demonstrate that funneling creates information bottlenecks that propagate through deeper network layers, particularly in larger models (e.g., Gemma 7B), leading to at times unmanageable performance lost. However, carefully selecting the funneling layer and employing effective recovery strategies, can substantially mitigate performance losses, achieving up to a 44% reduction in latency. Our findings highlight key trade-offs between computational efficiency and model accuracy, providing practical guidance for deploying funnel-based approaches in large-scale natural language applications.
zh

[NLP-70] heBlueScrubs-v1 a comprehensive curated medical dataset derived from the internet

【速读】：该论文试图解决临床大型语言模型（Clinical Large Language Models, cLLMs）训练数据集规模不足且多样性有限的问题。目前公开可用的资源如PubMed虽然提供了基础医学文献，但其覆盖范围过于狭窄，无法充分满足全面医学应用的需求。论文的关键解决方案在于引入了一个名为TheBlueScrubs-v1的数据集，该数据集包含超过250亿个医学标记，规模接近PubMed的三倍，并从广泛的互联网语料库中提取。为了确保数据质量，研究团队设计了一个两阶段过滤流程：第一阶段使用逻辑回归模型对文档进行筛选，在外部验证中达到约0.95的AUC；第二阶段通过参数量为70B的Llama 3.1指令模型进行验证。每个文本被分配三个基于大型语言模型的质量评分，涵盖医学相关性、精确性和事实细节以及安全与伦理标准。此外，还通过专门的癌症分类器标注了约110亿个肿瘤学标记。最终，该数据集在多个任务中的表现证明了其对医学AI研究的潜在价值。

链接: https://arxiv.org/abs/2504.02874
作者: Luis Felipe,Carlos Garcia,Issam El Naqa,Monique Shotande,Aakash Tripathi,Vivek Rudrapatna,Ghulam Rasool,Danielle Bitterman,Gilmer Valdes
机构: Machine Learning Department, Moffitt Cancer Center (机器学习系, 莫菲特癌症中心), Tampa, Florida.; Center for Real World Evidence, UCSF (真实世界证据中心, 加州大学旧金山分校), San Francisco, California.; The Blue Scrubs (蓝条衫), Tampa, Florida.; Harvard Medical School (哈佛医学院), Boston, Massachusetts
类目: Computation and Language (cs.CL)
备注: 22 pages, 8 figures, 10 tables

点击查看摘要

Abstract:The need for robust and diverse data sets to train clinical large language models (cLLMs) is critical given that currently available public repositories often prove too limited in size or scope for comprehensive medical use. While resources like PubMed provide foundational medical literature, they capture only a narrow range of formal publications and omit the broader medical discourse on the internet. To address these deficits, we introduce TheBlueScrubs-v1, a curated dataset of over 25 billion medical tokens - nearly three times larger than PubMed - drawn from a broad-scale internet corpus. Our two-stage filtering pipeline employs a Logistic Regression model for document screening (achieving an AUC of approximately 0.95 on external validation), followed by verification via a 70B-parameter Llama 3.1 instruct model. Each text is assigned three LLM-based quality scores encompassing medical relevance, precision and factual detail, and safety and ethical standards. Clinician reviews confirm high concordance with these automated evaluations, and a specialized cancer classifier further labels approximately 11 billion oncology tokens. Two demonstration tasks highlight the dataset’s practical value: first, we distill the safety evaluations to a smaller BERT-style model that reaches an AUC near 0.96 on unseen data; second, we fine-tune a compact LLM on a filtered subset, showing measurable improvements over standard baselines in medical benchmarks as well as private ones. This Data Descriptor details the dataset’s creation and validation, underscoring its potential utility for medical AI research.
zh

[NLP-71] Short-PHD: Detecting Short LLM -generated Text with Topological Data Analysis After Off-topic Content Insertion

【速读】：该论文旨在解决有效检测大规模语言模型（Large Language Models, LLMs）生成的短文本的问题。现有基于拓扑数据分析的方法虽然通过文本嵌入的持久同调维度（Persistent Homology Dimension, PHD）提供了一种更稳健的零样本检测方法，但在短文本检测方面仍存在挑战。论文提出的关键解决方案是Short-PHD方法，它通过在输入文本前插入无关主题内容来稳定先前PHD方法对短文本的估计，并依据预设的检测阈值识别LLM生成的文本。实验结果表明，Short-PHD在短LLM生成文本检测任务中优于现有的零样本方法。

链接: https://arxiv.org/abs/2504.02873
作者: Dongjun Wei,Minjia Mao,Xiao Fang,Michael Chau
机构: The University of Hong Kong (香港大学); University of Delaware (特拉华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The malicious usage of large language models (LLMs) has motivated the detection of LLM-generated texts. Previous work in topological data analysis shows that the persistent homology dimension (PHD) of text embeddings can serve as a more robust and promising score than other zero-shot methods. However, effectively detecting short LLM-generated texts remains a challenge. This paper presents Short-PHD, a zero-shot LLM-generated text detection method tailored for short texts. Short-PHD stabilizes the estimation of the previous PHD method for short texts by inserting off-topic content before the given input text and identifies LLM-generated text based on an established detection threshold. Experimental results on both public and generated datasets demonstrate that Short-PHD outperforms existing zero-shot methods in short LLM-generated text detection. Implementation codes are available online.
zh

[NLP-72] Scraping the Shadows: Deep Learning Breakthroughs in Dark Web Intelligence

【速读】：该论文旨在解决手动从暗网市场（Darknet Markets, DNMs）提取数据存在的错误率高且耗时的问题。论文的关键解决方案是开发了一个自动化数据提取框架，并评估了三种最先进的命名实体识别（Named Entity Recognition, NER）模型（ELMo-BiLSTM、UniversalNER 和 GLiNER）在从暗网产品列表页面提取复杂实体方面的性能。研究通过构建一个新的标注数据集，用于训练、微调和评估这些模型。研究发现，最先进的 NER 模型在暗网信息抽取任务中表现出色，实现了 91% 的精确率（Precision）、96% 的召回率（Recall）和 94% 的 F1 分数，其中微调进一步提升了模型性能，UniversalNER 表现最佳。

链接: https://arxiv.org/abs/2504.02872
作者: Ingmar Bakermans,Daniel De Pascale,Gonçalo Marcelino,Giuseppe Cascavilla,Zeno Geradts
机构: Tilburg University (蒂尔堡大学); UvA (阿姆斯特丹大学); Eindhoven University of Technology (埃因霍温理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
备注: 17 pages, 17 images

点击查看摘要

Abstract:Darknet markets (DNMs) facilitate the trade of illegal goods on a global scale. Gathering data on DNMs is critical to ensuring law enforcement agencies can effectively combat crime. Manually extracting data from DNMs is an error-prone and time-consuming task. Aiming to automate this process we develop a framework for extracting data from DNMs and evaluate the application of three state-of-the-art Named Entity Recognition (NER) models, ELMo-BiLSTM \citepShahEtAl2022, UniversalNER \citepZhouEtAl2024, and GLiNER \citepZaratianaEtAl2023, at the task of extracting complex entities from DNM product listing pages. We propose a new annotated dataset, which we use to train, fine-tune, and evaluate the models. Our findings show that state-of-the-art NER models perform well in information extraction from DNMs, achieving 91% Precision, 96% Recall, and an F1 score of 94%. In addition, fine-tuning enhances model performance, with UniversalNER achieving the best performance.
zh

[NLP-73] Synthesized Annotation Guidelines are Knowledge-Lite Boosters for Clinical Information Extraction

【速读】：该论文试图解决通过大型语言模型 (LLMs) 进行生成式信息抽取时，传统人工编写的标注指南在构建过程中耗时且难以复用的问题。同时，这些指南通常针对特定任务定制，缺乏通用性。论文的关键解决方案在于提出了一种自优化方法，利用LLMs的知识总结和文本生成能力来自动生成标注指南，几乎无需人工干预。实验结果表明，在多个生物医学领域的命名实体识别任务中，LLM生成的指南相比无指南基线提升了严格F1分数（最高达25.86%），并在大多数任务中表现出与人工编写指南相当甚至更优的性能（提升1.15%-4.14%）。

链接: https://arxiv.org/abs/2504.02871
作者: Enshuo Hsu,Martin Ugbala,Krishna Kumar Kookal,Zouaidi Kawtar,Nicholas L. Rider,Muhammad F. Walji,Kirk Roberts
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Generative information extraction using large language models, particularly through few-shot learning, has become a popular method. Recent studies indicate that providing a detailed, human-readable guideline-similar to the annotation guidelines traditionally used for training human annotators can significantly improve performance. However, constructing these guidelines is both labor- and knowledge-intensive. Additionally, the definitions are often tailored to meet specific needs, making them highly task-specific and often non-reusable. Handling these subtle differences requires considerable effort and attention to detail. In this study, we propose a self-improving method that harvests the knowledge summarization and text generation capacity of LLMs to synthesize annotation guidelines while requiring virtually no human input. Our zero-shot experiments on the clinical named entity recognition benchmarks, 2012 i2b2 EVENT, 2012 i2b2 TIMEX, 2014 i2b2, and 2018 n2c2 showed 25.86%, 4.36%, 0.20%, and 7.75% improvements in strict F1 scores from the no-guideline baseline. The LLM-synthesized guidelines showed equivalent or better performance compared to human-written guidelines by 1.15% to 4.14% in most tasks. In conclusion, this study proposes a novel LLM self-improving method that requires minimal knowledge and human input and is applicable to multiple biomedical domains.
zh

[NLP-74] AI Hiring with LLM s: A Context-Aware and Explainable Multi-Agent Framework for Resume Screening CVPR2025

【速读】：该论文旨在解决人才招聘中简历筛选这一耗时且繁琐的问题，同时确保评估过程的客观性、准确性与公平性。随着大规模语言模型（Large Language Models, LLMs）的发展，论文提出了一种基于多智能体框架的解决方案，利用LLMs系统化处理和评估简历，以实现招聘流程的自动化和优化。方案的关键在于设计了一个包含简历提取器、评估器、摘要生成器和评分格式化器四个核心智能体的框架，并通过整合检索增强生成（Retrieval-Augmented Generation, RAG）技术增强评估器的上下文相关性，使其能够结合外部知识源（如行业专业知识、专业认证、大学排名及企业特定的招聘标准），从而实现个性化招聘，弥合AI自动化与人才获取之间的差距。

链接: https://arxiv.org/abs/2504.02870
作者: Frank P.-W. Lo,Jianing Qiu,Zeyu Wang,Haibao Yu,Yeming Chen,Gao Zhang,Benny Lo
机构: Imperial College London (帝国理工学院); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学); Wedon Education Technologies (未知中文名称); Brest Business School (布雷斯特商学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025 Workshop

点击查看摘要

Abstract:Resume screening is a critical yet time-intensive process in talent acquisition, requiring recruiters to analyze vast volume of job applications while remaining objective, accurate, and fair. With the advancements in Large Language Models (LLMs), their reasoning capabilities and extensive knowledge bases demonstrate new opportunities to streamline and automate recruitment workflows. In this work, we propose a multi-agent framework for resume screening using LLMs to systematically process and evaluate resumes. The framework consists of four core agents, including a resume extractor, an evaluator, a summarizer, and a score formatter. To enhance the contextual relevance of candidate assessments, we integrate Retrieval-Augmented Generation (RAG) within the resume evaluator, allowing incorporation of external knowledge sources, such as industry-specific expertise, professional certifications, university rankings, and company-specific hiring criteria. This dynamic adaptation enables personalized recruitment, bridging the gap between AI automation and talent acquisition. We assess the effectiveness of our approach by comparing AI-generated scores with ratings provided by HR professionals on a dataset of anonymized online resumes. The findings highlight the potential of multi-agent RAG-LLM systems in automating resume screening, enabling more efficient and scalable hiring workflows.
zh

[NLP-75] Multi-Agent LLM LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在实际应用中面临的评估难题，具体包括现有评估方法无法充分捕捉动态开放文本生成所需的细微语义信息，以及基于LLM的评估框架（如LLM-as-a-judge）存在的适应性不足和评分结果难以解释的问题。论文的关键在于提出了一种新颖的动态多智能体系统，通过自动设计个性化的LLM评估者，实现了对不同自然语言生成任务的灵活适配，并在保持与人类感知一致性的前提下优化了评估提示的设计。实验结果显示，该多智能体LLM评估框架不仅提升了评估准确性，还显著改善了评分结果与人类判断的一致性。

链接: https://arxiv.org/abs/2504.02867
作者: Hongliu Cao,Ilias Driouich,Robin Singh,Eoin Thomas
机构: Amadeus SAS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Presented at SophiaSummit2024

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. This underscores the need for robust evaluation methodologies to accurately assess LLM-based applications. Traditional evaluation methods, which rely on word overlap or text embeddings, are inadequate for capturing the nuanced semantic information necessary to evaluate dynamic, open-ended text generation. Recent research has explored leveraging LLMs to mimic human reasoning and decision-making processes for evaluation purposes known as LLM-as-a-judge framework. However, these existing frameworks have two significant limitations. First, they lack the flexibility to adapt to different text styles, including various answer and ground truth styles, thereby reducing their generalization performance. Second, the evaluation scores produced by these frameworks are often skewed and hard to interpret, showing a low correlation with human judgment. To address these challenges, we propose a novel dynamic multi-agent system that automatically designs personalized LLM judges for various natural language generation applications. This system iteratively refines evaluation prompts and balances the trade-off between the adaptive requirements of downstream tasks and the alignment with human perception. Our experimental results show that the proposed multi-agent LLM Judge framework not only enhances evaluation accuracy compared to existing methods but also produces evaluation scores that better align with human perception.
zh

[NLP-76] he Illusionists Prompt: Exposing the Factual Vulnerabilities of Large Language Models with Linguistic Nuances

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在面对恶意构造查询时，其事实性保证不足的问题。尽管已有研究关注如何减少正式用户查询中的幻觉现象，但这些工作主要局限于非恶意场景。论文的关键解决方案是引入了一种名为“The Illusionist’s Prompt”的新型幻觉攻击方法，该方法通过融入语言学细微差别来构造对抗性查询，挑战五种增强事实性的策略。其核心在于自动生成高度可迁移的幻觉提示，以诱导内部事实性错误，同时保持用户意图和语义完整性，从而有效削弱包括商用API（如GPT-4o和Gemini-2.0）在内的闭源LLMs的事实准确性，即使在多种防御机制下亦如此。

链接: https://arxiv.org/abs/2504.02865
作者: Yining Wang,Yuquan Wang,Xi Li,Mi Zhang,Geng Hong,Min Yang
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: work in progress

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to advance, they are increasingly relied upon as real-time sources of information by non-expert users. To ensure the factuality of the information they provide, much research has focused on mitigating hallucinations in LLM responses, but only in the context of formal user queries, rather than maliciously crafted ones. In this study, we introduce The Illusionist’s Prompt, a novel hallucination attack that incorporates linguistic nuances into adversarial queries, challenging the factual accuracy of LLMs against five types of fact-enhancing strategies. Our attack automatically generates highly transferrable illusory prompts to induce internal factual errors, all while preserving user intent and semantics. Extensive experiments confirm the effectiveness of our attack in compromising black-box LLMs, including commercial APIs like GPT-4o and Gemini-2.0, even with various defensive mechanisms.
zh

[NLP-77] he Material Contracts Corpus

【速读】：该论文旨在构建并公开一个名为Material Contracts Corpus (MCC)的数据集，以解决合同设计与法律语言研究中数据稀缺的问题，并支持基于人工智能的法律工具开发。解决方案的关键在于利用机器学习（Machine Learning）和自然语言处理（Natural Language Processing, NLP）技术，特别是对LLaMA-2模型进行微调（Fine-tuning），实现合同的分类（Classification）以及与特定当事方的关联（Linking）。此外，MCC通过提供诸如提交表格、文档格式及修订状态等元数据（Metadata），进一步增强了数据集的价值。这一资源可实现大规模下载与在线访问，为相关领域的实证研究（Empirical Research）提供了重要支持。

链接: https://arxiv.org/abs/2504.02864
作者: Peter Adelson,Julian Nyarko
机构: Stanford Graduate School of Business (斯坦福商学院) and Stanford Law School (斯坦福法学院); Stanford Law School (斯坦福法学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces the Material Contracts Corpus (MCC), a publicly available dataset comprising over one million contracts filed by public companies with the U.S. Securities and Exchange Commission (SEC) between 2000 and 2023. The MCC facilitates empirical research on contract design and legal language, and supports the development of AI-based legal tools. Contracts in the corpus are categorized by agreement type and linked to specific parties using machine learning and natural language processing techniques, including a fine-tuned LLaMA-2 model for contract classification. The MCC further provides metadata such as filing form, document format, and amendment status. We document trends in contractual language, length, and complexity over time, and highlight the dominance of employment and security agreements in SEC filings. This resource is available for bulk download and online access at this https URL.
zh

[NLP-78] GS_DravidianLangTech@2025: Women Targeted Abusive Texts Detection on Social Media

【速读】：该论文旨在解决社交平台上针对女性的恶意文本滥用问题，定义中的“恶意言语”指意图伤害或煽动仇恨的交流行为，尤其聚焦于识别针对女性的侮辱性语言。为实现这一目标，论文的关键解决方案是采用逻辑回归（Logistic Regression）和BERT作为基础模型，并利用来自DravidianLangTech@2025的数据集进行训练。最终，在Tamil和Malayalam语言上的测试结果显示，BERT模型取得了0.729的宏F1分数，而逻辑回归模型为0.6279，表明基于预训练语言模型的方法在该任务中更为有效。

链接: https://arxiv.org/abs/2504.02863
作者: Girma Yohannis Bade,Zahra Ahani,Olga Kolesnikova,José Luis Oropeza,Grigori Sidorov
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The increasing misuse of social media has become a concern; however, technological solutions are being developed to moderate its content effectively. This paper focuses on detecting abusive texts targeting women on social media platforms. Abusive speech refers to communication intended to harm or incite hatred against vulnerable individuals or groups. Specifically, this study aims to identify abusive language directed toward women. To achieve this, we utilized logistic regression and BERT as base models to train datasets sourced from DravidianLangTech@2025 for Tamil and Malayalam languages. The models were evaluated on test datasets, resulting in a 0.729 macro F1 score for BERT and 0.6279 for logistic regression in Tamil and Malayalam, respectively.
zh

[NLP-79] Optimizing Humor Generation in Large Language Models : Temperature Configurations and Architectural Trade-offs

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在生成技术相关幽默方面的系统性评估不足的问题。解决方案的关键在于通过全面分析五种架构家族中的13个最先进的LLMs，采用严谨的统计方法（如ANOVA、相关性研究和二次回归），评估模型在不同温度设置和提示变化下的表现，并基于五个加权标准（幽默质量、领域相关性、概念原创性、语气精确性和表达效率）进行综合评价。研究揭示了模型架构对性能差异的重要影响，强调了温度调整和架构选择对生成幽默效果的关键作用，并提出了实用的模型选择与配置指南。

链接: https://arxiv.org/abs/2504.02858
作者: Evgenii Evstafev
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Large language models (LLMs) demonstrate increasing capabilities in creative text generation, yet systematic evaluations of their humor production remain underexplored. This study presents a comprehensive analysis of 13 state-of-the-art LLMs across five architectural families, evaluating their performance in generating technically relevant humor for software developers. Through a full factorial design testing 715 unique configurations of temperature settings and prompt variations, we assess model outputs using five weighted criteria: humor quality, domain relevance, concept originality, tone precision, and delivery efficiency. Our methodology employs rigorous statistical analysis including ANOVA, correlation studies, and quadratic regression to identify optimal configurations and architectural influences. Results reveal significant performance variations across models, with certain architectures achieving 21.8% superiority over baseline systems. Temperature sensitivity analysis demonstrates that 73% of models achieve peak performance at lower stochasticity settings (= 0.5), though optimal ranges vary substantially by architecture. We identify distinct model clusters: compact high-performers maintaining efficiency-quality balance versus verbose specialists requiring longer outputs for marginal gains. Statistical validation confirms model architecture explains 38.7% of performance variance, with significant correlations between humor quality and concept originality. The study establishes practical guidelines for model selection and configuration, demonstrating how temperature adjustments and architectural considerations impact humor generation effectiveness. These findings advance understanding of LLM capabilities in creative technical writing and provide empirically validated configuration strategies for developers implementing humor-generation systems.
zh

[NLP-80] Mapping Technological Futures: Anticipatory Discourse Through Text Mining

【速读】：该论文旨在研究围绕新兴技术未来所形成的前瞻性话语（Anticipatory Discourse），特别是在人工智能等技术的不确定性背景下，社会媒体上的讨论如何塑造人们对未来的预期。论文通过分析X平台（即推特）上400位关键意见领袖（Key Opinion Leaders, KOLs）在2021年至2023年间发布的150万条帖子，利用包括BERTopic建模、情感、情绪及态度分析在内的先进文本挖掘技术，识别出100个反映技术驱动未来预期的独特主题。论文的关键在于揭示KOLs在构建当前对未来（\textit{present futures}）的乐观愿景与影响未来对当下的认知（\textit{future presents}）中的双重作用，并强调其在引导公众注意力方面的核心地位，特别是在技术引发的高度不确定时期。通过将技术描绘为解决社会挑战的方案，KOLs作为社会叙事的调解者，连接了想象中的未来与现实世界，从而深化了我们对技术中介背景下前瞻性话语的理解。

链接: https://arxiv.org/abs/2504.02853
作者: Maciej Skorski,Alina Landowska,Krzysztof Rajda
机构: Czech Technical University Prague(Czech Technical University in Prague); SWPS University(SWPS University of Social Sciences and Humanities); Brand24(Brand24)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to Humanities and Social Sciences Communications. arXiv admin note: text overlap with arXiv:2407.17522

点击查看摘要

Abstract:The volatility and unpredictability of emerging technologies, such as artificial intelligence (AI), generate significant uncertainty, which is widely discussed on social media. This study examines anticipatory discourse surrounding technological futures by analysing 1.5 million posts from 400 key opinion leaders (KOLs) published on the X platform (from 2021 to 2023). Using advanced text mining techniques, including BERTopic modelling, sentiment, emotion, and attitude analyses, the research identifies 100 distinct topics reflecting anticipated tech-driven futures. Our findings emphasize the dual role of KOLs in framing \textitpresent futures – optimistic visions of transformative technologies like AI and IoT – and influencing \textitfuture presents, where these projections shape contemporary societal and geopolitical debates. Positive emotions such as Hope dominate, outweighing Anxiety, particularly in topics like Machine Learning, Data Science, and Deep Learning,'' while discussions around Climate Change’’ and ``War, Ukraine, and Trump People’’ elicit \textitAnxiety. By framing technologies as solutions to societal challenges, KOLs act as mediators of societal narratives, bridging imagined futures and current realities. These insights underscore their pivotal role in directing public attention with emerging technologies during periods of heightened uncertainty, advancing our understanding of anticipatory discourse in technology-mediated contexts.
zh

计算机视觉

[CV-0] MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

【速读】：该论文旨在解决现有多模态大型语言模型（Unified Multimodal Large Language Models, U-MLLMs）基准测试中存在的两大问题：1) 缺乏标准化的传统任务评估基准，导致研究结果难以进行一致比较；2) 缺少针对混合模态生成任务的基准，无法有效评估多模态推理能力。论文的关键解决方案是提出了一套全面的评估框架，包含三个核心部分：1) 标准化传统任务评估，涵盖12个数据集、10类任务及30种子任务，以确保跨研究的一致性和公平性；2) 引入五项新型混合模态推理任务，如图像编辑、带图像生成的常识问答以及几何推理等，用于测试多模态推理能力；3) 对12种领先的U-MLLMs及专门化的理解与生成模型进行全面评估。通过这些措施，论文揭示了现有U-MLLMs在处理混合模态任务中的性能差距，并强调了开发更稳健模型的需求。

链接: https://arxiv.org/abs/2504.03641
作者: Wulin Xie,Yi-Fan Zhang,Chaoyou Fu,Yang Shi,Bingyan Nie,Hongkai Chen,Zhang Zhang,Liang Wang,Tieniu Tan
机构: CASIA(中科院自动化研究所); NJU(南京大学); PKU(北京大学); Vivo; M-M-E(多模态基础与应用研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in this https URL.
zh

[CV-1] Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions CVPR2025

【速读】：该论文试图解决现有文本到运动生成方法中忽视身体形状对运动合成影响的问题。由于这些方法倾向于学习统一的标准身体形状，从而忽略了不同身体形状与运动动力学之间的自然关联，可能导致运动失真。为了解决这一问题，论文提出了一种基于自然语言提示生成考虑身体形状的人体运动的方法。其关键在于利用基于有限标量量化（Finite Scalar Quantization, FSQ）的变分自编码器（Variational Autoencoder, VAE）将运动量化为离散标记，并结合连续的身体形状信息将这些标记解量化为连续且详细的运动。此外，通过预训练的语言模型预测连续形状参数和运动标记，实现了与文本对齐的运动合成及解码为考虑身体形状的运动。

链接: https://arxiv.org/abs/2504.03639
作者: Ting-Hsuan Liao,Yi Zhou,Yu Shen,Chun-Hao Paul Huang,Saayan Mitra,Jia-Bin Huang,Uttaran Bhattacharya
机构: University of Maryland, College Park (马里兰大学帕克分校); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project page: this https URL

点击查看摘要

Abstract:We explore how body shapes influence human motion synthesis, an aspect often overlooked in existing text-to-motion generation methods due to the ease of learning a homogenized, canonical body shape. However, this homogenization can distort the natural correlations between different body shapes and their motion dynamics. Our method addresses this gap by generating body-shape-aware human motions from natural language prompts. We utilize a finite scalar quantization-based variational autoencoder (FSQ-VAE) to quantize motion into discrete tokens and then leverage continuous body shape information to de-quantize these tokens back into continuous, detailed motion. Additionally, we harness the capabilities of a pretrained language model to predict both continuous shape parameters and motion tokens, facilitating the synthesis of text-aligned motions and decoding them into shape-aware motions. We evaluate our method quantitatively and qualitatively, and also conduct a comprehensive perceptual study to demonstrate its efficacy in generating shape-aware motions.
zh

[CV-2] An Algebraic Geometry Approach to Viewing Graph Solvability

【速读】：该论文旨在研究视角图（viewing graph）在运动恢复结构（Structure-from-Motion, SfM）中的可解性问题，即在何种条件下通过视角图能够唯一确定相机参数。论文的关键在于提出了一种基于代数几何（Algebraic Geometry）的新框架，用于分析此类可解性问题，并利用该框架证明了一个先前提出的猜想，从而深化了对运动恢复结构中视角图的理解。

链接: https://arxiv.org/abs/2504.03637
作者: Federica Arrigoni,Kathlén Kohn,Andrea Fusiello,Tomas Pajdla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Algebraic Geometry (math.AG)
备注:

点击查看摘要

Abstract:The concept of viewing graph solvability has gained significant interest in the context of structure-from-motion. A viewing graph is a mathematical structure where nodes are associated to cameras and edges represent the epipolar geometry connecting overlapping views. Solvability studies under which conditions the cameras are uniquely determined by the graph. In this paper we propose a novel framework for analyzing solvability problems based on Algebraic Geometry, demonstrating its potential in understanding structure-from-motion graphs and proving a conjecture that was previously proposed.
zh

[CV-3] Quantifying the uncertainty of model-based synthetic image quality metrics

【速读】：该论文旨在解决合成图像质量评估中特征嵌入模型可信度不足的问题，特别是针对基于预训练辅助模型（如卷积自编码器）生成的Fréchet Autoencoder Distance (FAED) 类似指标的不确定性量化。论文的关键解决方案是引入不确定性量化（Uncertainty Quantification, UQ）方法，通过在特征嵌入模型上应用Monte Carlo Dropout来建模嵌入的不确定性，并利用输入样本嵌入分布的变异性来反映FAED计算结果的置信水平。这种方法通过预测嵌入的方差以及计算得到的FAED值的标准差表达不确定性，验证了其能够有效评估输入数据是否偏离模型训练数据分布的能力，从而提升评估指标的可信度。

链接: https://arxiv.org/abs/2504.03623
作者: Ciaran Bench,Spencer A. Thomas
机构: National Physical Laboratory (国家物理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The quality of synthetically generated images (e.g. those produced by diffusion models) are often evaluated using information about image contents encoded by pretrained auxiliary models. For example, the Fréchet Inception Distance (FID) uses embeddings from an InceptionV3 model pretrained to classify ImageNet. The effectiveness of this feature embedding model has considerable impact on the trustworthiness of the calculated metric (affecting its suitability in several domains, including medical imaging). Here, uncertainty quantification (UQ) is used to provide a heuristic measure of the trustworthiness of the feature embedding model and an FID-like metric called the Fréchet Autoencoder Distance (FAED). We apply Monte Carlo dropout to a feature embedding model (convolutional autoencoder) to model the uncertainty in its embeddings. The distribution of embeddings for each input are then used to compute a distribution of FAED values. We express uncertainty as the predictive variance of the embeddings as well as the standard deviation of the computed FAED values. We find that their magnitude correlates with the extent to which the inputs are out-of-distribution to the model’s training data, providing some validation of its ability to assess the trustworthiness of the FAED.
zh

[CV-4] VISTA-OCR: Towards generative and interactive end to end OCR models

【速读】：该论文旨在解决传统光学字符识别（OCR）系统中需要独立分支处理文本检测与识别的问题，提出了一种轻量级的统一架构——VISTA-OCR（Vision and Spatially-aware Text Analysis OCR）。其关键在于利用Transformer解码器在一个统一的分支内顺序生成文本转录及其空间坐标，从而实现文本检测与识别的同时进行。通过逐步训练机制，从视觉特征提取到多任务模态令牌生成，该方法不仅提升了模型效率，还增强了其在复杂场景下的适应能力。此外，为了降低计算成本并支持更多交互式OCR应用，研究团队构建了一个包含真实世界样本及边界框标注数据的新数据集，并开发了VISTA-\textomni变体，使其能够在仅使用150M参数的情况下处理手写和印刷文档。实验结果表明，相较于当前最先进的专用模型，VISTA-OCR在标准OCR任务上表现更优，并展现出强大的潜力以应对更加复杂的OCR应用场景需求。

链接: https://arxiv.org/abs/2504.03621
作者: Laziz Hamdi,Amine Tamasna,Pascal Boisson,Thierry Paquet
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce \textbfVISTA-OCR (Vision and Spatially-aware Text Analysis OCR), a lightweight architecture that unifies text detection and recognition within a single generative model. Unlike conventional methods that require separate branches with dedicated parameters for text recognition and detection, our approach leverages a Transformer decoder to sequentially generate text transcriptions and their spatial coordinates in a unified branch. Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase, followed by multitask learning with multimodal token generation. To address the increasing demand for versatile OCR systems capable of advanced tasks, such as content-based text localization \refcontent_based_localization, we introduce new prompt-controllable OCR tasks during this http URL enhance the model’s capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples. Although recent Vision Large Language Models (VLLMs) can efficiently perform these tasks, their high computational cost remains a barrier for practical deployment. In contrast, our VISTA _\textomni variant processes both handwritten and printed documents with only 150M parameters, interactively, by prompting. Extensive experiments on multiple datasets demonstrate that VISTA-OCR achieves better performance compared to state-of-the-art specialized models on standard OCR tasks while showing strong potential for more sophisticated OCR applications, addressing the growing need for interactive OCR systems. All code and annotations for VISTA-OCR will be made publicly available upon acceptance.
zh

[CV-5] Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution

【速读】：该论文旨在解决现有合成媒体识别系统在面对新兴生成模型时性能严重退化的问题。传统系统通常依赖于从已知生成器学习到的特征表示，难以适应不断演化的生成模型生态。为应对这一挑战，论文提出了一种自主自适应的合成媒体识别系统（Autonomous Self-Adaptive Synthetic Media Identification System），其核心在于不仅能够检测和归因已知来源的合成图像，还能在无需人工干预的情况下自主识别和整合新型生成器。关键解决方案在于采用开放集识别策略（Open-Set Identification Strategy）结合可演化嵌入空间（Evolvable Embedding Space），通过无监督聚类方法将未知样本聚集成高置信度簇，并持续优化决策边界，从而实现对已知与未知来源的稳健区分能力，确保系统在生成模型快速发展的背景下仍能保持高性能检测与归因能力。

链接: https://arxiv.org/abs/2504.03615
作者: Aref Azizpour,Tai D. Nguyen,Matthew C. Stamm
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rapid advances in generative AI have enabled the creation of highly realistic synthetic images, which, while beneficial in many domains, also pose serious risks in terms of disinformation, fraud, and other malicious applications. Current synthetic image identification systems are typically static, relying on feature representations learned from known generators; as new generative models emerge, these systems suffer from severe performance degradation. In this paper, we introduce the concept of an autonomous self-adaptive synthetic media identification system – one that not only detects synthetic images and attributes them to known sources but also autonomously identifies and incorporates novel generators without human intervention. Our approach leverages an open-set identification strategy with an evolvable embedding space that distinguishes between known and unknown sources. By employing an unsupervised clustering method to aggregate unknown samples into high-confidence clusters and continuously refining its decision boundaries, our system maintains robust detection and attribution performance even as the generative landscape evolves. Extensive experiments demonstrate that our method significantly outperforms existing approaches, marking a crucial step toward universal, adaptable forensic systems in the era of rapidly advancing generative models.
zh

[CV-6] Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal

【速读】：该论文致力于解决光学卫星图像云层去除的挑战，特别是在与合成孔径雷达（Synthetic Aperture Radar, SAR）图像融合的应用场景下。当前基于扩散模型的方法虽然能够从无云分布中采样以提供高质量估计，但其从纯高斯噪声开始采样的方式增加了采样轨迹的复杂性，并导致性能受限。此外，现有方法在有效融合SAR和光学数据方面表现不足。为了解决这些问题，论文提出了一种名为Diffusion Bridges for Cloud Removal (DB-CR) 的新方法，其关键是通过直接连接有云和无云图像分布来构建扩散桥梁。同时，DB-CR 引入了一种新颖的多模态扩散桥梁架构，采用双分支主干网络进行多模态图像恢复，结合高效的主干结构和专用的跨模态融合模块，以有效提取和融合来自SAR和光学图像的特征。通过将云层去除问题形式化为扩散桥梁问题并利用定制化的架构，DB-CR 实现了高保真的结果且计算效率较高。实验结果表明，DB-CR 在SEN12MS-CR数据集上的性能达到当前最优水平。

链接: https://arxiv.org/abs/2504.03607
作者: Yuyang Hu,Suhas Lohit,Ulugbek S. Kamilov,Tim K. Marks
机构: Department of Electrical & System Engineering, Washington University in St. Louis (华盛顿大学圣路易斯分校); Departments of Computer Science & Engineering and Electrical & System Engineering, Washington University in St. Louis (华盛顿大学圣路易斯分校); Mitsubishi Electric Research Laboratories (MERL) (三菱电机研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has achieved some success in addressing the challenge of cloud removal in optical satellite images, by fusing with synthetic aperture radar (SAR) images. Recently, diffusion models have emerged as powerful tools for cloud removal, delivering higher-quality estimation by sampling from cloud-free distributions, compared to earlier methods. However, diffusion models initiate sampling from pure Gaussian noise, which complicates the sampling trajectory and results in suboptimal performance. Also, current methods fall short in effectively fusing SAR and optical data. To address these limitations, we propose Diffusion Bridges for Cloud Removal, DB-CR, which directly bridges between the cloudy and cloud-free image distributions. In addition, we propose a novel multimodal diffusion bridge architecture with a two-branch backbone for multimodal image restoration, incorporating an efficient backbone and dedicated cross-modality fusion blocks to effectively extract and fuse features from synthetic aperture radar (SAR) and optical images. By formulating cloud removal as a diffusion-bridge problem and leveraging this tailored architecture, DB-CR achieves high-fidelity results while being computationally efficient. We evaluated DB-CR on the SEN12MS-CR cloud-removal dataset, demonstrating that it achieves state-of-the-art results.
zh

[CV-7] Robust Human Registration with Body Part Segmentation on Noisy Point Clouds

【速读】：该论文旨在解决在真实世界数据中存在的噪声和背景杂乱问题导致的人体网格与3D点云配准不精确的问题，特别是在增强现实（AR）和人机交互等应用中的挑战。论文的关键创新在于提出了一种结合身体部位分割的混合方法，通过将身体部位标签分配给点云中的个体点，并以此指导SMPL-X模型的两步拟合过程：首先利用身体部位质心进行初始姿态和方向估计，然后对点云对齐进行全局细化。这种方法不仅提升了人体姿态估计的准确性，还改善了分割精度。此外，拟合得到的人体网格能够进一步优化身体部位标签，从而实现更高质量的分割。实验结果表明，该方法在InterCap、EgoBody和BEHAVE等包含杂乱和噪声的真实世界数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2504.03602
作者: Kai Lascheit,Daniel Barath,Marc Pollefeys,Leonidas Guibas,Francis Engelmann
机构: ETH Zurich (苏黎世联邦理工学院); Microsoft (微软); Stanford University (斯坦福大学); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Registering human meshes to 3D point clouds is essential for applications such as augmented reality and human-robot interaction but often yields imprecise results due to noise and background clutter in real-world data. We introduce a hybrid approach that incorporates body-part segmentation into the mesh fitting process, enhancing both human pose estimation and segmentation accuracy. Our method first assigns body part labels to individual points, which then guide a two-step SMPL-X fitting: initial pose and orientation estimation using body part centroids, followed by global refinement of the point cloud alignment. Additionally, we demonstrate that the fitted human mesh can refine body part labels, leading to improved segmentation. Evaluations on the cluttered and noisy real-world datasets InterCap, EgoBody, and BEHAVE show that our approach significantly outperforms prior methods in both pose estimation and segmentation accuracy. Code and results are available on our project website: this https URL
zh

[CV-8] AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing CVPR’25

【速读】：该论文旨在解决现有自监督视频哈希（Self-Supervised Video Hashing, SSVH）方法因随机帧采样导致的哈希码质量下降问题。传统方法忽略了帧间信息密度和重建难度的差异，将所有帧同等对待，从而产生次优的哈希表示。为了解决这一局限性，论文提出了一种名为AutoSSVH的新框架，其关键是引入对抗帧采样策略与基于哈希的对比学习。对抗采样策略能够自动识别并选择信息更丰富且更具挑战性的帧进行重建，提升编码能力；同时，通过点到集合（Point-to-Set, P2Set）哈希对比目标及哈希分量投票策略，进一步增强跨视频语义关系的捕获能力和哈希码的判别性能。实验结果验证了AutoSSVH在检索效率和效果上的优越性。

链接: https://arxiv.org/abs/2504.03587
作者: Niu Lian,Jun Li,Jinpeng Wang,Ruisheng Luo,Yaowei Wang,Shu-Tao Xia,Bin Chen
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Research Center of Artificial Intelligence, Peng Cheng Laboratory (鹏城实验室人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by CVPR’25. 11 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Self-Supervised Video Hashing (SSVH) compresses videos into hash codes for efficient indexing and retrieval using unlabeled training videos. Existing approaches rely on random frame sampling to learn video features and treat all frames equally. This results in suboptimal hash codes, as it ignores frame-specific information density and reconstruction difficulty. To address this limitation, we propose a new framework, termed AutoSSVH, that employs adversarial frame sampling with hash-based contrastive learning. Our adversarial sampling strategy automatically identifies and selects challenging frames with richer information for reconstruction, enhancing encoding capability. Additionally, we introduce a hash component voting strategy and a point-to-set (P2Set) hash-based contrastive objective, which help capture complex inter-video semantic relationships in the Hamming space and improve the discriminability of learned hash codes. Extensive experiments demonstrate that AutoSSVH achieves superior retrieval efficacy and efficiency compared to state-of-the-art approaches. Code is available at this https URL.
zh

[CV-9] PF3Det: A Prompted Foundation Feature Assisted Visual LiDAR 3D Detector CVPR2025

【速读】：本文旨在解决三维物体检测中多模态融合的挑战，特别是激光雷达点云与图像数据之间的领域差距（domain gap）问题，同时缓解高质量标注数据稀缺导致的模型性能瓶颈。为应对这些挑战，论文提出了一种名为Prompted Foundational 3D Detector (PF3Det) 的方法，其关键是结合了基础模型编码器（foundation model encoders）与软提示（soft prompts）技术，以实现高效的激光雷达-图像特征融合，从而提升跨模态信息的有效整合能力。实验结果显示，PF3Det在nuScenes数据集上以有限的训练数据取得了最先进的性能，NDS提升了1.19%，mAP提升了2.42%。

链接: https://arxiv.org/abs/2504.03563
作者: Kaidong Li,Tianxiao Zhang,Kuan-Chuan Peng,Guanghui Wang
机构: University of Kansas (堪萨斯大学); Mitsubishi Electric Research Laboratories (三菱电机研究实验室); Toronto Metropolitan University (多伦多都会大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted to the CVPR 2025 Workshop on Distillation of Foundation Models for Autonomous Driving (WDFM-AD)

点击查看摘要

Abstract:3D object detection is crucial for autonomous driving, leveraging both LiDAR point clouds for precise depth information and camera images for rich semantic information. Therefore, the multi-modal methods that combine both modalities offer more robust detection results. However, efficiently fusing LiDAR points and images remains challenging due to the domain gaps. In addition, the performance of many models is limited by the amount of high quality labeled data, which is expensive to create. The recent advances in foundation models, which use large-scale pre-training on different modalities, enable better multi-modal fusion. Combining the prompt engineering techniques for efficient training, we propose the Prompted Foundational 3D Detector (PF3Det), which integrates foundation model encoders and soft prompts to enhance LiDAR-camera feature fusion. PF3Det achieves the state-of-the-art results under limited training data, improving NDS by 1.19% and mAP by 2.42% on the nuScenes dataset, demonstrating its efficiency in 3D detection.
zh

[CV-10] HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

【速读】：该论文旨在解决单图像人体重建中的几何不一致性问题，特别是在从单一视角生成多视角图像时容易出现的肢体碎片化或模糊现象。为了解决这些问题，论文提出了一种名为\textbf{HumanDreamer-X}的新框架，将多视角人体生成与重建整合到一个统一的流水线中，显著提升了重建三维模型的几何一致性和视觉保真度。关键解决方案在于引入了3D高斯点撒（3D Gaussian Splatting）作为显式的三维表示，提供初始的几何结构和外观优先级，并在此基础上训练\textbf{HumanFixer}来修复3DGS渲染结果以保证照片级真实感。此外，针对多视角生成中注意力机制的内在挑战，论文提出了注意力调制策略，有效增强了跨视角的几何细节一致性。实验结果显示，该方法在生成和重建的峰值信噪比（PSNR）指标上分别提高了16.45%和12.65%，达到高达25.62 dB的PSNR值，同时展示了对野外数据的泛化能力和对多种人体重建基础模型的适用性。

链接: https://arxiv.org/abs/2504.03536
作者: Boyuan Wang,Runqi Ouyang,Xiaofeng Wang,Zheng Zhu,Guosheng Zhao,Chaojun Ni,Guan Huang,Lihong Liu,Xingang Wang
机构: GigaAI; Institute of Automation, Chinese Academy of Sciences (自动化研究所, 中国科学院); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Single-image human reconstruction is vital for digital human modeling applications but remains an extremely challenging task. Current approaches rely on generative models to synthesize multi-view images for subsequent 3D reconstruction and animation. However, directly generating multiple views from a single human image suffers from geometric inconsistencies, resulting in issues like fragmented or blurred limbs in the reconstructed models. To tackle these limitations, we introduce \textbfHumanDreamer-X, a novel framework that integrates multi-view human generation and reconstruction into a unified pipeline, which significantly enhances the geometric consistency and visual fidelity of the reconstructed 3D models. In this framework, 3D Gaussian Splatting serves as an explicit 3D representation to provide initial geometry and appearance priority. Building upon this foundation, \textbfHumanFixer is trained to restore 3DGS renderings, which guarantee photorealistic results. Furthermore, we delve into the inherent challenges associated with attention mechanisms in multi-view human generation, and propose an attention modulation strategy that effectively enhances geometric details identity consistency across multi-view. Experimental results demonstrate that our approach markedly improves generation and reconstruction PSNR quality metrics by 16.45% and 12.65%, respectively, achieving a PSNR of up to 25.62 dB, while also showing generalization capabilities on in-the-wild data and applicability to various human reconstruction backbone models.
zh

[CV-11] RANa: Retrieval-Augmented Navigation

【速读】：该论文试图解决的问题是如何让导航代理在未知环境中不仅具备泛化能力，还能有效利用之前操作中收集的信息。论文的关键解决方案在于引入了一种新的检索增强型代理（retrieval-augmented agent），通过强化学习（RL）进行训练，并能够查询从先前相同环境中的片段（episodes）收集的数据库，同时学习如何整合这些额外的上下文信息。这一方案的核心在于提出了一种独特的代理架构，用于通用导航任务，并采用数据驱动的方法结合视觉基础模型（vision foundation model, FM）来实现语义和几何理解的双重功能。此外，论文还提出了新的基准测试方法，证明了检索机制能够在零样本迁移（zero-shot transfer）中跨任务和跨环境显著提升性能。

链接: https://arxiv.org/abs/2504.03524
作者: Gianluca Monaci,Rafael S. Rezende,Romain Deffayet,Gabriela Csurka,Guillaume Bono,Hervé Déjean,Stéphane Clinchant,Christian Wolf
机构: Naver Labs Europe (Naver实验室欧洲)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Methods for navigation based on large-scale learning typically treat each episode as a new problem, where the agent is spawned with a clean memory in an unknown environment. While these generalization capabilities to an unknown environment are extremely important, we claim that, in a realistic setting, an agent should have the capacity of exploiting information collected during earlier robot operations. We address this by introducing a new retrieval-augmented agent, trained with RL, capable of querying a database collected from previous episodes in the same environment and learning how to integrate this additional context information. We introduce a unique agent architecture for the general navigation task, evaluated on ObjectNav, ImageNav and Instance-ImageNav. Our retrieval and context encoding methods are data-driven and heavily employ vision foundation models (FM) for both semantic and geometric understanding. We propose new benchmarks for these settings and we show that retrieval allows zero-shot transfer across tasks and environments while significantly improving performance.
zh

[CV-12] FADConv: A Frequency-Aware Dynamic Convolution for Farmland Non-agriculturalization Identification and Segmentation

【速读】：该论文旨在解决因耕地非农化（cropland non-agriculturalization）导致的耕地资源损失及其对粮食安全和农业可持续性的系统性威胁。为应对这一挑战，精确识别耕地与非耕地区域至关重要。传统卷积神经网络（CNNs）采用静态卷积层，而动态卷积研究显示通过注意力机制自适应加权多个卷积核可以提高精度。然而，现有依赖全局平均池化（Global Average Pooling, GAP）进行注意力权重分配的动态卷积方法存在信息丢失问题，限制了分割精度。为解决这些问题，论文提出了频率感知动态卷积（Frequency-Aware Dynamic Convolution, FADConv）和频率注意力模块（Frequency Attention, FAT）。关键创新在于结合二维离散余弦变换（2D Discrete Cosine Transform, 2D DCT）来捕获频域特征并融合，同时用高质量的注意力权重替代传统的GAP方法，从而提升动态卷积核之间的组合效果。实验结果表明，FADConv显著提高了分割精度，且计算开销极小。

链接: https://arxiv.org/abs/2504.03510
作者: Tan Shu,Li Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cropland non-agriculturalization refers to the conversion of arable land into non-agricultural uses such as forests, residential areas, and construction sites. This phenomenon not only directly leads to the loss of cropland resources but also poses systemic threats to food security and agricultural sustainability. Accurate identification of cropland and non-cropland areas is crucial for detecting and addressing this issue. Traditional CNNs employ static convolution layers, while dynamic convolution studies demonstrate that adaptively weighting multiple convolutional kernels through attention mechanisms can enhance accuracy. However, existing dynamic convolution methods relying on Global Average Pooling (GAP) for attention weight allocation suffer from information loss, limiting segmentation precision. This paper proposes Frequency-Aware Dynamic Convolution (FADConv) and a Frequency Attention (FAT) module to address these limitations. Building upon the foundational structure of dynamic convolution, we designed FADConv by integrating 2D Discrete Cosine Transform (2D DCT) to capture frequency domain features and fuse them. FAT module generates high-quality attention weights that replace the traditional GAP method,making the combination between dynamic convolution kernels more this http URL on the GID and Hi-CNA datasets demonstrate that FADConv significantly improves segmentation accuracy with minimal computational overhead. For instance, ResNet18 with FADConv achieves 1.9% and 2.7% increases in F1-score and IoU for cropland segmentation on GID, with only 58.87M additional MAdds. Compared to other dynamic convolution approaches, FADConv exhibits superior performance in cropland segmentation tasks.
zh

[CV-13] LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders

【速读】：该论文试图解决长视频表征学习中的短时与长时依赖建模难题。传统方法通常受限于输入帧数量，难以有效处理长时间跨度的视频数据。为解决这一问题，论文提出了一种名为长视频掩码嵌入自编码器（Long-Video Masked Embedding Autoencoder, LV-MAE）的自监督学习框架。其关键在于将短时和长时依赖建模解耦为两个独立任务：首先利用先进的多模态编码器提取短片段的时空表示，然后通过掩码嵌入自编码器捕捉跨片段的高层交互，从而高效实现长视频的表征学习，并支持大规模长视频样本的自监督预训练。这种设计不仅提升了模型处理长视频的能力，还实现了在长视频基准测试上的最新性能表现。

链接: https://arxiv.org/abs/2504.03501
作者: Ilan Naiman,Emanuel Ben-Baruch,Oron Anschel,Alon Shoshan,Igor Kviatkovsky,Manoj Aggarwal,Gerard Medioni
机构: Amazon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we introduce long-video masked-embedding autoencoders (LV-MAE), a self-supervised learning framework for long video representation. Our approach treats short- and long-span dependencies as two separate tasks. Such decoupling allows for a more intuitive video processing where short-span spatiotemporal primitives are first encoded and are then used to capture long-range dependencies across consecutive video segments. To achieve this, we leverage advanced off-the-shelf multimodal encoders to extract representations from short segments within the long video, followed by pre-training a masked-embedding autoencoder capturing high-level interactions across segments. LV-MAE is highly efficient to train and enables the processing of much longer videos by alleviating the constraint on the number of input frames. Furthermore, unlike existing methods that typically pre-train on short-video datasets, our approach offers self-supervised pre-training using long video samples (e.g., 20+ minutes video clips) at scale. Using LV-MAE representations, we achieve state-of-the-art results on three long-video benchmarks – LVU, COIN, and Breakfast – employing only a simple classification head for either attentive or linear probing. Finally, to assess LV-MAE pre-training and visualize its reconstruction quality, we leverage the video-language aligned space of short video representations to monitor LV-MAE through video-text retrieval.
zh

[CV-14] BUFF: Bayesian Uncertainty Guided Diffusion Probabilistic Model for Single Image Super-Resolution AAAI2025

【速读】：该论文旨在解决现有扩散模型在图像超分辨率（Super-Resolution, SR）任务中因采用高斯噪声模型而难以有效处理自然场景中复杂多变纹理的问题。论文的关键创新在于提出了贝叶斯不确定性引导的扩散概率模型（Bayesian Uncertainty Guided Diffusion Probabilistic Model, BUFF）。BUFF 的核心在于引入贝叶斯网络生成高分辨率不确定性掩模（uncertainty masks），这些掩模能够引导扩散过程，并以语境感知的方式动态调整噪声强度。这种上下文感知且自适应的噪声控制机制显著提升了超分辨率图像的细节保真度，同时有效减少了复杂纹理和精细结构区域中的伪影与模糊现象，从而在应对复杂噪声模式及处理纹理与边缘方面表现出卓越的鲁棒性和适应性。

链接: https://arxiv.org/abs/2504.03490
作者: Zihao He,Shengchuan Zhang,Runze Hu,Yunhang Shen,Yan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, AAAI 2025

点击查看摘要

Abstract:Super-resolution (SR) techniques are critical for enhancing image quality, particularly in scenarios where high-resolution imagery is essential yet limited by hardware constraints. Existing diffusion models for SR have relied predominantly on Gaussian models for noise generation, which often fall short when dealing with the complex and variable texture inherent in natural scenes. To address these deficiencies, we introduce the Bayesian Uncertainty Guided Diffusion Probabilistic Model (BUFF). BUFF distinguishes itself by incorporating a Bayesian network to generate high-resolution uncertainty masks. These masks guide the diffusion process, allowing for the adjustment of noise intensity in a manner that is both context-aware and adaptive. This novel approach not only enhances the fidelity of super-resolved images to their original high-resolution counterparts but also significantly mitigates artifacts and blurring in areas characterized by complex textures and fine details. The model demonstrates exceptional robustness against complex noise patterns and showcases superior adaptability in handling textures and edges within images. Empirical evidence, supported by visual results, illustrates the model’s robustness, especially in challenging scenarios, and its effectiveness in addressing common SR issues such as blurring. Experimental evaluations conducted on the DIV2K dataset reveal that BUFF achieves a notable improvement, with a +0.61 increase compared to baseline in SSIM on BSD100, surpassing traditional diffusion approaches by an average additional +0.20dB PSNR gain. These findings underscore the potential of Bayesian methods in enhancing diffusion processes for SR, paving the way for future advancements in the field.
zh

[CV-15] Probabilistic Machine Learning for Noisy Labels in Earth Observation

【速读】：该论文旨在解决地球观测（Earth Observation, EO）领域中标注噪声对监督式机器学习（Machine Learning, ML）模型性能和可靠性带来的显著挑战。论文的关键解决方案在于利用概率机器学习（Probabilistic ML）方法来建模输入相关的标注噪声，并量化地球观测任务中的数据不确定性，同时考虑该领域的独特噪声来源。通过在多种高影响力地球观测应用中训练具有不确定性感知能力的概率模型，并引入专用的评估管道以验证其准确性与可靠性，研究发现这些不确定性感知模型在大多数数据集和评估指标上均优于标准确定性方法。此外，通过严格的不确定性评估，进一步增强了模型预测的可解释性。论文强调了在地球观测中建模标注噪声以及引入不确定性量化的重要性，为该领域更准确、可靠且可信的机器学习解决方案奠定了基础。

链接: https://arxiv.org/abs/2504.03478
作者: Spyros Kondylatos,Nikolaos Ioannis Bountos,Ioannis Prapas,Angelos Zavras,Gustau Camps-Valls,Ioannis Papoutsis
机构: Orion Lab (猎户座实验室); National Observatory of Athens (雅典国家天文台); National Technical University of Athens (雅典国立技术大学); Image Processing Laboratory (IPL), Universitat de València (瓦伦西亚大学图像处理实验室); Harokopio University of Athens (雅典哈罗科皮奥斯大学); Archimedes, Athena Research Center (雅典阿基米德研究中心)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Label noise poses a significant challenge in Earth Observation (EO), often degrading the performance and reliability of supervised Machine Learning (ML) models. Yet, given the critical nature of several EO applications, developing robust and trustworthy ML solutions is essential. In this study, we take a step in this direction by leveraging probabilistic ML to model input-dependent label noise and quantify data uncertainty in EO tasks, accounting for the unique noise sources inherent in the domain. We train uncertainty-aware probabilistic models across a broad range of high-impact EO applications-spanning diverse noise sources, input modalities, and ML configurations-and introduce a dedicated pipeline to assess their accuracy and reliability. Our experimental results show that the uncertainty-aware models consistently outperform the standard deterministic approaches across most datasets and evaluation metrics. Moreover, through rigorous uncertainty evaluation, we validate the reliability of the predicted uncertainty estimates, enhancing the interpretability of model predictions. Our findings emphasize the importance of modeling label noise and incorporating uncertainty quantification in EO, paving the way for more accurate, reliable, and trustworthy ML solutions in the field.
zh

[CV-16] ATM-Net: Anatomy-Aware Text-Guided Multi-Modal Fusion for Fine-Grained Lumbar Spine Segmentation

【速读】：该论文致力于解决现有腰椎分割方法因采用粗粒度分割策略而缺乏精确诊断所需细节的问题，同时指出仅依赖视觉模型无法充分捕捉解剖语义，导致类别误分类和分割细节不佳。为应对这些局限性，论文提出了一种名为ATM-Net的创新框架，其关键是引入了解剖感知、文本引导、多模态融合机制，用于精细分割腰椎亚结构（包括椎体、椎间盘和脊髓腔）。ATM-Net通过解剖感知文本提示生成器（ATPG）自适应地将图像标注转换为不同视角下的解剖感知提示，并通过整体解剖感知语义融合（HASF）模块整合这些信息以构建全面的解剖上下文。此外，通道对比解剖感知增强（CCAE）模块通过类别级通道对比学习进一步提升类别区分能力并优化分割效果。实验结果表明，ATM-Net在MRSpineSeg和SPIDER数据集上的表现显著优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.03476
作者: Sheng Lian,Dengfeng Pan,Jianlong Cai,Guang-Yong Chen,Zhun Zhong,Zhiming Luo,Shen Zhao,Shuo Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate lumbar spine segmentation is crucial for diagnosing spinal disorders. Existing methods typically use coarse-grained segmentation strategies that lack the fine detail needed for precise diagnosis. Additionally, their reliance on visual-only models hinders the capture of anatomical semantics, leading to misclassified categories and poor segmentation details. To address these limitations, we present ATM-Net, an innovative framework that employs an anatomy-aware, text-guided, multi-modal fusion mechanism for fine-grained segmentation of lumbar substructures, i.e., vertebrae (VBs), intervertebral discs (IDs), and spinal canal (SC). ATM-Net adopts the Anatomy-aware Text Prompt Generator (ATPG) to adaptively convert image annotations into anatomy-aware prompts in different views. These insights are further integrated with image features via the Holistic Anatomy-aware Semantic Fusion (HASF) module, building a comprehensive anatomical context. The Channel-wise Contrastive Anatomy-Aware Enhancement (CCAE) module further enhances class discrimination and refines segmentation through class-wise channel-level multi-modal contrastive learning. Extensive experiments on the MRSpineSeg and SPIDER datasets demonstrate that ATM-Net significantly outperforms state-of-the-art methods, with consistent improvements regarding class discrimination and segmentation details. For example, ATM-Net achieves Dice of 79.39% and HD95 of 9.91 pixels on SPIDER, outperforming the competitive SpineParseNet by 8.31% and 4.14 pixels, respectively.
zh

[CV-17] Multi-encoder nnU-Net outperforms Transformer models with self-supervised pretraining

【速读】：该论文致力于解决医学图像分割中的关键任务，即自动识别和勾勒解剖结构及病理区域。由于肿瘤的大小、形状和位置对临床决策和治疗策略有重要影响，准确的分割在放射学中至关重要。然而，MRI模态的变化、图像伪影以及标注数据的稀缺性给分割任务带来了挑战，并影响了传统模型的性能。为克服这些限制，论文提出了一种新颖的自监督学习多编码器nnU-Net架构，其关键在于通过独立的编码器分别处理多种MRI模态，使模型能够在融合特征之前捕获模态特定的信息，从而提高分割精度。实验结果显示，该方法在Dice相似性系数（DSC）上达到93.72%，优于其他模型如vanilla nnU-Net、SegResNet和Swin UNETR，尤其在标注数据有限的情况下显著提升了分割效果。

链接: https://arxiv.org/abs/2504.03474
作者: Seyedeh Sahar Taheri Otaghsara,Reza Rahmanzadeh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study addresses the essential task of medical image segmentation, which involves the automatic identification and delineation of anatomical structures and pathological regions in medical images. Accurate segmentation is crucial in radiology, as it aids in the precise localization of abnormalities such as tumors, thereby enabling effective diagnosis, treatment planning, and monitoring of disease progression. Specifically, the size, shape, and location of tumors can significantly influence clinical decision-making and therapeutic strategies, making accurate segmentation a key component of radiological workflows. However, challenges posed by variations in MRI modalities, image artifacts, and the scarcity of labeled data complicate the segmentation task and impact the performance of traditional models. To overcome these limitations, we propose a novel self-supervised learning Multi-encoder nnU-Net architecture designed to process multiple MRI modalities independently through separate encoders. This approach allows the model to capture modality-specific features before fusing them for the final segmentation, thus improving accuracy. Our Multi-encoder nnU-Net demonstrates exceptional performance, achieving a Dice Similarity Coefficient (DSC) of 93.72%, which surpasses that of other models such as vanilla nnU-Net, SegResNet, and Swin UNETR. By leveraging the unique information provided by each modality, the model enhances segmentation tasks, particularly in scenarios with limited annotated data. Evaluations highlight the effectiveness of this architecture in improving tumor segmentation outcomes.
zh

[CV-18] Dynamic Importance in Diffusion U-Net for Enhanced Image Synthesis ICME2025

【速读】：该论文旨在解决传统基于 U-Net 的扩散模型中，注意力块在推理过程中动态重要性变化被忽视的问题，这限制了其进一步优化图像应用的能力。论文的关键解决方案在于：首先理论上证明了重新加权 U-Net 中 Transformer 块的输出可以提升采样过程中的信噪比（“free lunch”），其次提出了 Importance Probe 方法以揭示并量化去噪过程中 Transformer 块重要性的动态变化，最后设计了一种自适应的基于重要性重加权调度方案，专门针对特定的图像生成与编辑任务。实验结果表明，该方法显著提升了推理效率，并增强了样本的美学质量与身份一致性，且可无缝集成到任何基于 U-Net 的架构中。

链接: https://arxiv.org/abs/2504.03471
作者: Xi Wang,Ziqi He,Yang Zhou
机构: CSSE, Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2025. Appendix Code: this https URL

点击查看摘要

Abstract:Traditional diffusion models typically employ a U-Net architecture. Previous studies have unveiled the roles of attention blocks in the U-Net. However, they overlook the dynamic evolution of their importance during the inference process, which hinders their further exploitation to improve image applications. In this study, we first theoretically proved that, re-weighting the outputs of the Transformer blocks within the U-Net is a “free lunch” for improving the signal-to-noise ratio during the sampling process. Next, we proposed Importance Probe to uncover and quantify the dynamic shifts in importance of the Transformer blocks throughout the denoising process. Finally, we design an adaptive importance-based re-weighting schedule tailored to specific image generation and editing tasks. Experimental results demonstrate that, our approach significantly improves the efficiency of the inference process, and enhances the aesthetic quality of the samples with identity consistency. Our method can be seamlessly integrated into any U-Net-based architecture. Code: this https URL
zh

[CV-19] D-Garment: Physics-Conditioned Latent Diffusion for Dynamic Garment Deformations

【速读】：该论文旨在解决在虚拟和增强现实环境中调整和变形3D服装以适配人体形状、运动以及织物材质的问题。这一问题是具有挑战性的，因为服装动力学会影响几何细节（如褶皱模式），而这些细节依赖于包括穿戴者体型与动作及织物特性在内的物理输入。现有研究主要集中在基于学习的方法从示例数据中生成服装变形，以及受物理启发的模拟器来生成逼真的服装动态。

论文提出了一种基于学习的方法，该方法是在通过基于物理的模拟器生成的数据集上进行训练的。与之前的工作相比，所提出的3D生成模型能够学习宽松布料几何结构的变形，特别是针对由身体运动和织物材质驱动的大变形和动态褶皱。此外，该模型可以高效地拟合使用视觉传感器捕获的观察结果。为了利用扩散模型的能力来学习精细尺度的细节，论文将3D服装建模为一个二维参数空间，并使用这种表示独立于网格分辨率来学习潜在扩散模型。这使得能够通过身体和材料信息条件化全局和局部几何信息。

解决方案的关键在于利用基于物理的模拟器生成的数据训练学习模型，结合扩散模型的强大能力，特别是在二维参数空间中建模3D服装，从而实现更精确的变形效果，并且能够在保持高精度的同时适应不同分辨率的网格。这种方法在模拟数据和多视图采集平台捕获的数据上进行了定量和定性评估，结果显示相比强基准方法，在Chamfer距离指标上表现更为准确。

链接: https://arxiv.org/abs/2504.03468
作者: Antoine Dumoulin,Adnane Boukhayma,Laurence Boissieux,Bharath Bhushan Damodaran,Pierre Hellier,Stefanie Wuhrer
机构: Inria Centre at the University Grenoble Alpes (里昂大学 Inria 中心); Inria, University of Rennes, CNRS, IRISA-UMR 6074 (雷恩大学 Inria, 国家科学研究中心, IRISA-UMR 6074); InterDigital Inc. (InterDigital 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Adjusting and deforming 3D garments to body shapes, body motion, and cloth material is an important problem in virtual and augmented reality. Applications are numerous, ranging from virtual change rooms to the entertainment and gaming industry. This problem is challenging as garment dynamics influence geometric details such as wrinkling patterns, which depend on physical input including the wearer’s body shape and motion, as well as cloth material features. Existing work studies learning-based modeling techniques to generate garment deformations from example data, and physics-inspired simulators to generate realistic garment dynamics. We propose here a learning-based approach trained on data generated with a physics-based simulator. Compared to prior work, our 3D generative model learns garment deformations for loose cloth geometry, especially for large deformations and dynamic wrinkles driven by body motion and cloth material. Furthermore, the model can be efficiently fitted to observations captured using vision sensors. We propose to leverage the capability of diffusion models to learn fine-scale detail: we model the 3D garment in a 2D parameter space, and learn a latent diffusion model using this representation independent from the mesh resolution. This allows to condition global and local geometric information with body and material information. We quantitatively and qualitatively evaluate our method on both simulated data and data captured with a multi-view acquisition platform. Compared to strong baselines, our method is more accurate in terms of Chamfer distance.
zh

[CV-20] Pyramid-based Mamba Multi-class Unsupervised Anomaly Detection

【速读】：该论文致力于解决多类别异常检测与定位中，特别是小异常精准定位的挑战。当前卷积神经网络（CNNs）在捕捉长距离依赖方面存在局限性，而基于Transformer的方法则通常伴随显著的计算开销。为应对这一难题，论文提出了一种基于状态空间模型（State Space Model, SSM）的金字塔扫描策略（Pyramidal Scanning Strategy, PSS）。该方法的关键在于通过将PSS与预训练编码器结合以实现多尺度特征提取，并引入特征级合成异常生成器，从而在多个尺度上捕获细粒度细节，有效提升小异常定位的精度。实验结果显示，在MVTec基准上的平均精确率（AP）提升了+1%，异常定位相对改进率（AU-PRO）增加了+1%，验证了该方法在工业场景中的优越性。

链接: https://arxiv.org/abs/2504.03442
作者: Nasar Iqbal,Niki Martinel
机构: Department of Mathematics, Computer Science and Physics, University of Udine (数学、计算机科学与物理系，乌迪内大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in convolutional neural networks (CNNs) and transformer-based methods have improved anomaly detection and localization, but challenges persist in precisely localizing small anomalies. While CNNs face limitations in capturing long-range dependencies, transformer architectures often suffer from substantial computational overheads. We introduce a state space model (SSM)-based Pyramidal Scanning Strategy (PSS) for multi-class anomaly detection and localization–a novel approach designed to address the challenge of small anomaly localization. Our method captures fine-grained details at multiple scales by integrating the PSS with a pre-trained encoder for multi-scale feature extraction and a feature-level synthetic anomaly generator. An improvement of +1% AP for multi-class anomaly localization and a + 1% increase in AU-PRO on MVTec benchmark demonstrate our method’s superiority in precise anomaly localization across diverse industrial scenarios. The code is available at this https URL Mamba.
zh

[CV-21] Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models NAACL2025

【速读】：该论文试图解决视觉语言模型（Visual Language Models, VLMs）在不确定性估计方面的问题，特别是在处理被损坏图像数据时的表现。解决方案的关键在于评估现有最先进的VLMs在量化其输出响应正确性方面的不确定性能力，并揭示这些模型在面对不同程度的数据损坏时，其不确定性估计的准确性下降以及过高的置信度问题。

链接: https://arxiv.org/abs/2504.03440
作者: Mirko Borszukovszki,Ivo Pascal de Jong,Matias Valdenegro-Toro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 11 figures, TrustNLP Workshop @ NAACL 2025 Camera ready

点击查看摘要

Abstract:To leverage the full potential of Large Language Models (LLMs) it is crucial to have some information on their answers’ uncertainty. This means that the model has to be able to quantify how certain it is in the correctness of a given response. Bad uncertainty estimates can lead to overconfident wrong answers undermining trust in these models. Quite a lot of research has been done on language models that work with text inputs and provide text outputs. Still, since the visual capabilities have been added to these models recently, there has not been much progress on the uncertainty of Visual Language Models (VLMs). We tested three state-of-the-art VLMs on corrupted image data. We found that the severity of the corruption negatively impacted the models’ ability to estimate their uncertainty and the models also showed overconfidence in most of the experiments.
zh

[CV-22] ZFusion: An Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving CVPR2025

【速读】：该论文旨在解决自动驾驶中3D物体感知的问题，特别是在4D雷达相较于激光雷达（LiDAR）点云稀疏的劣势下，如何有效融合4D雷达与视觉模态信息以提升检测性能。论文提出了一种名为ZFusion的方法，其关键在于所提出的FP-DDCA（Feature Pyramid-Double Deformable Cross Attention）融合器。FP-DDCA通过结合Transformer模块构建特征金字塔结构，在不同尺度上交互融合多模态特征，从而有效补充稀疏的雷达信息和密集的视觉信息，显著提高感知精度。此外，利用Depth-Context-Split视图转换模块，进一步利用了4D雷达的物理特性。实验表明，ZFusion在保持合理推理速度的同时，于感兴趣区域实现了最先进的mAP（平均精度均值），并在整个区域内具有竞争力的mAP表现，展示了接近激光雷达的性能且大幅优于仅依赖相机的方法。

链接: https://arxiv.org/abs/2504.03438
作者: Sheng Yang,Tong Zhan,Shichen Qiao,Jicheng Gong,Qing Yang,Yanfeng Lu,Jian Wang
机构: School of Data Science, Fudan University (复旦大学); ZF (China) Investment Co., Ltd. (采埃孚（中国）投资有限公司); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 WDFM-AD

点击查看摘要

Abstract:Reliable 3D object perception is essential in autonomous driving. Owing to its sensing capabilities in all weather conditions, 4D radar has recently received much attention. However, compared to LiDAR, 4D radar provides much sparser point cloud. In this paper, we propose a 3D object detection method, termed ZFusion, which fuses 4D radar and vision modality. As the core of ZFusion, our proposed FP-DDCA (Feature Pyramid-Double Deformable Cross Attention) fuser complements the (sparse) radar information and (dense) vision information, effectively. Specifically, with a feature-pyramid structure, the FP-DDCA fuser packs Transformer blocks to interactively fuse multi-modal features at different scales, thus enhancing perception accuracy. In addition, we utilize the Depth-Context-Split view transformation module due to the physical properties of 4D radar. Considering that 4D radar has a much lower cost than LiDAR, ZFusion is an attractive alternative to LiDAR-based methods. In typical traffic scenarios like the VoD (View-of-Delft) dataset, experiments show that with reasonable inference speed, ZFusion achieved the state-of-the-art mAP (mean average precision) in the region of interest, while having competitive mAP in the entire area compared to the baseline methods, which demonstrates performance close to LiDAR and greatly outperforms those camera-only methods.
zh

[CV-23] Autonomous state-space segmentation for Deep-RL sparse reward scenarios

【速读】：该论文旨在解决在稀疏奖励（sparse rewards）环境下深度强化学习算法的学习难题。为应对这一挑战，论文提出了一种两级架构的解决方案，其关键是结合内在驱动力（intrinsic motivation），通过“内在驱动”的探索与子目标生成阶段，交替进行稀疏奖励导向的目标驱动策略学习。具体而言，该方法构建多个小型网络，每个网络专注于特定的子路径，并利用这些网络作为未来探索的起点，避免从零开始重新探索已学习过的路径。实验在Gym SuperMarioBros环境中验证了该方法的有效性，表明自主分割环境以生成高效路径至最终目标的重要性。

链接: https://arxiv.org/abs/2504.03420
作者: Gianluca Maselli,Vieri Giuliano Santucci
机构: Institute of Cognitive Sciences and Technologies (ISTC)(认知科学与技术研究所); National Council of Research (CNR)(国家研究委员会), Roma, Italy; Department of Computer, Control and Management Engineering (DIAG)(计算机、控制和管理工程系), Sapienza Univerity of Rome
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dealing with environments with sparse rewards has always been crucial for systems developed to operate in autonomous open-ended learning settings. Intrinsic Motivations could be an effective way to help Deep Reinforcement Learning algorithms learn in such scenarios. In fact, intrinsic reward signals, such as novelty or curiosity, are generally adopted to improve exploration when extrinsic rewards are delayed or absent. Building on previous works, we tackle the problem of learning policies in the presence of sparse rewards by proposing a two-level architecture that alternates an ‘‘intrinsically driven’’ phase of exploration and autonomous sub-goal generation, to a phase of sparse reward, goal-directed policy learning. The idea is to build several small networks, each one specialized on a particular sub-path, and use them as starting points for future exploration without the need to further explore from scratch previously learnt paths. Two versions of the system have been trained and tested in the Gym SuperMarioBros environment without considering any additional extrinsic reward. The results show the validity of our approach and the importance of autonomously segment the environment to generate an efficient path towards the final goal.
zh

[CV-24] NeRFlex: Resource-aware Real-time High-quality Rendering of Complex Scenes on Mobile Devices

【速读】：该论文致力于解决在移动设备上实现复杂场景的高分辨率、实时神经辐射场（NeRF）渲染的问题。传统方法因计算需求大及内存开销高，难以在保证质量的同时实现实时性能，尤其在处理复杂场景时效果不佳。论文的关键创新在于提出NeRFlex框架，通过将场景分解为多个子场景并采用多NeRF表示，结合资源感知的设计理念，优化了渲染过程以适应移动设备的内存与计算限制。其核心方案包括设计细节导向的子场景分割模块、利用领域知识构建轻量级配置分析器，以及基于动态规划算法高效确定各子场景NeRF网络的配置，从而在NP难解问题的约束下实现了高质量且实时的渲染效果。

链接: https://arxiv.org/abs/2504.03415
作者: Zhe Wang,Yifei Zhu
机构: UM-SJTU Joint Institute (上海交通大学联合学院); Shanghai Jiao Tong University (上海交通大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Performance (cs.PF)
备注: This paper is accepted by 45th IEEE International Conference on Distributed Computing Systems (ICDCS 2025)

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) is a cutting-edge neural network-based technique for novel view synthesis in 3D reconstruction. However, its significant computational demands pose challenges for deployment on mobile devices. While mesh-based NeRF solutions have shown potential in achieving real-time rendering on mobile platforms, they often fail to deliver high-quality reconstructions when rendering practical complex scenes. Additionally, the non-negligible memory overhead caused by pre-computed intermediate results complicates their practical application. To overcome these challenges, we present NeRFlex, a resource-aware, high-resolution, real-time rendering framework for complex scenes on mobile devices. NeRFlex integrates mobile NeRF rendering with multi-NeRF representations that decompose a scene into multiple sub-scenes, each represented by an individual NeRF network. Crucially, NeRFlex considers both memory and computation constraints as first-class citizens and redesigns the reconstruction process accordingly. NeRFlex first designs a detail-oriented segmentation module to identify sub-scenes with high-frequency details. For each NeRF network, a lightweight profiler, built on domain knowledge, is used to accurately map configurations to visual quality and memory usage. Based on these insights and the resource constraints on mobile devices, NeRFlex presents a dynamic programming algorithm to efficiently determine configurations for all NeRF representations, despite the NP-hardness of the original decision problem. Extensive experiments on real-world datasets and mobile devices demonstrate that NeRFlex achieves real-time, high-quality rendering on commercial mobile devices.
zh

[CV-25] FLAIRBrainSeg: Fine-grained brain segmentation using FLAIR MRI only

【速读】：该论文试图解决在仅拥有液体衰减反转恢复（FLAIR）磁共振成像（MRI）的情况下进行脑部分区（brain parcellation）的问题。传统方法依赖于T1加权（T1-weighted）MRI，但在某些情况下，如多发性硬化症病变存在或无法获取其他模态影像时，这可能受到限制。论文提出的解决方案的关键在于开发了一种名为FLAIRBrainSeg的新方法，该方法通过利用现有的自动分割技术训练网络来近似通常从T1加权MRI获得的分割结果，从而实现对132个脑结构的精确分割，并且对多发性硬化症病变具有鲁棒性。实验表明，与基于图像合成的模态不可知（modality-agnostic）方法相比，该方法在域内和域外数据集上均表现出更好的性能。

链接: https://arxiv.org/abs/2504.03376
作者: Edern Le Bot,Rémi Giraud,Boris Mansencal,Thomas Tourdias,Josè V. Manjon,Pierrick Coupé
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:This paper introduces a novel method for brain segmentation using only FLAIR MRIs, specifically targeting cases where access to other imaging modalities is limited. By leveraging existing automatic segmentation methods, we train a network to approximate segmentations, typically obtained from T1-weighted MRIs. Our method, called FLAIRBrainSeg, produces segmentations of 132 structures and is robust to multiple sclerosis lesions. Experiments on both in-domain and out-of-domain datasets demonstrate that our method outperforms modality-agnostic approaches based on image synthesis, the only currently available alternative for performing brain parcellation using FLAIR MRI alone. This technique holds promise for scenarios where T1-weighted MRIs are unavailable and offers a valuable alternative for clinicians and researchers in need of reliable anatomical segmentation.
zh

[CV-26] Point Cloud-based Grasping for Soft Hand Exoskeleton

【速读】：该论文旨在解决手部功能障碍者在抓取操作中的困难，通过设计软性手外骨骼辅助抓取，但传统控制方法因环境理解复杂性而难以有效支持用户。论文提出了一种基于视觉的预测控制框架，关键在于利用深度感知的上下文意识来预测抓取目标并确定下一控制状态。与依赖大规模标注数据集且泛化能力有限的数据驱动方法不同，该方案基于几何建模，实现了在多样化抓取场景中的鲁棒适应性。实验结果显示，所提方法在未见物体上的重建成功率高，体现了其优于学习型模型的增强泛化能力。

链接: https://arxiv.org/abs/2504.03369
作者: Chen Hu,Enrica Tricomi,Eojin Rho,Daekyum Kim,Lorenzo Masia,Shan Luo,Letizia Gionfrida
机构: King’s College London (伦敦国王学院); Institut für Technische Informatik (ZITI), Heidelberg University (海德堡大学技术信息研究所); School of Computing, KAIST (KAIST 计算机学院); School of Mechanical Engineering and the School of Smart Mobility, Korea University (韩国大学机械工程学院和智能移动学院); Munich Institute for Robotics and Machine Intelligence, Technical University of Munich (慕尼黑机器人与机器智能研究所，慕尼黑工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Grasping is a fundamental skill for interacting with and manipulating objects in the environment. However, this ability can be challenging for individuals with hand impairments. Soft hand exoskeletons designed to assist grasping can enhance or restore essential hand functions, yet controlling these soft exoskeletons to support users effectively remains difficult due to the complexity of understanding the environment. This study presents a vision-based predictive control framework that leverages contextual awareness from depth perception to predict the grasping target and determine the next control state for activation. Unlike data-driven approaches that require extensive labelled datasets and struggle with generalizability, our method is grounded in geometric modelling, enabling robust adaptation across diverse grasping scenarios. The Grasping Ability Score (GAS) was used to evaluate performance, with our system achieving a state-of-the-art GAS of 91% across 15 objects and healthy participants, demonstrating its effectiveness across different object types. The proposed approach maintained reconstruction success for unseen objects, underscoring its enhanced generalizability compared to learning-based models.
zh

[CV-27] Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition

【速读】：该论文旨在解决基于注意力机制的端到端文本识别模型在字符级自回归解码过程中预测时间过长的问题。现有方法需要几秒钟才能处理单张页面图像，限制了其实际应用效率。为应对这一挑战，论文提出了一种名为Meta Document Attention Network (Meta-DAN) 的新型解码策略。其关键在于引入两个核心组件：一是窗口查询（windowed queries），通过同时处理多个Transformer查询来扩展上下文建模范围；二是多令牌预测（multi-token predictions），目标是在每次查询中预测多个令牌而非仅限下一个令牌。实验结果表明，该方法在10个全页手写数据集上的字符错误率达到了最先进的性能水平。

链接: https://arxiv.org/abs/2504.03349
作者: Denis Coquenet
机构: Univ Rennes, CNRS, Inria, IRISA - UMR 6074 (雷恩大学, 法国国家科学研究中心, 法国国家信息与自动化研究所, IRISA - 统一研究实验室 6074)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in text recognition led to a paradigm shift for page-level recognition, from multi-step segmentation-based approaches to end-to-end attention-based ones. However, the naïve character-level autoregressive decoding process results in long prediction times: it requires several seconds to process a single page image on a modern GPU. We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time while enabling a better context modeling. It relies on two main components: windowed queries, to process several transformer queries altogether, enlarging the context modeling with near future; and multi-token predictions, whose goal is to predict several tokens per query instead of only the next one. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate. Source code and weights of trained models are available at this https URL.
zh

[CV-28] EOOD: Entropy-based Out-of-distribution Detection IJCNN2025

【速读】：该论文旨在解决深度神经网络（DNNs）在处理分布外（OOD）样本时经常表现出过强置信度的问题，这一问题严重阻碍了其实际部署。论文的关键解决方案是提出了一种基于熵的分布外检测（Entropy-based Out-Of-distribution Detection, EOOD）框架。EOOD 首先利用真实分布内（ID）数据和伪分布外数据识别出信息流差异更为显著的网络块，然后在选定的网络块上计算条件熵作为 OOD 置信度分数。实验结果表明，该方法在多种 ID 和 OOD 设置下均表现出色，并优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.03342
作者: Guide Yang,Chao Hou,Weilong Peng,Xiang Fang,Yongwei Nie,Peican Zhu,Keke Tang
机构: Cyberspace Institute of Advanced Technology, Guangzhou University (广州大学先进网络技术研究所), Guangzhou, China; Huangpu Research School, Guangzhou University (广州大学黄埔研究院), Guangzhou, China; School of Computer Science and Cyber Engineering, Guangzhou University (广州大学计算机科学与网络安全学院), Guangzhou, China; IGP-ERI@N, Nanyang Technological University (南洋理工大学 Nanyang 技术大学), Singapore, Singapore; School of Computer Science & Engineering, South China University of Technology (华南理工大学计算机科学与工程学院), Guangzhou, China; School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University (西北工业大学人工智能、光学与电子学院), Xi’an, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IJCNN 2025

点击查看摘要

Abstract:Deep neural networks (DNNs) often exhibit overconfidence when encountering out-of-distribution (OOD) samples, posing significant challenges for deployment. Since DNNs are trained on in-distribution (ID) datasets, the information flow of ID samples through DNNs inevitably differs from that of OOD samples. In this paper, we propose an Entropy-based Out-Of-distribution Detection (EOOD) framework. EOOD first identifies specific block where the information flow differences between ID and OOD samples are more pronounced, using both ID and pseudo-OOD samples. It then calculates the conditional entropy on the selected block as the OOD confidence score. Comprehensive experiments conducted across various ID and OOD settings demonstrate the effectiveness of EOOD in OOD detection and its superiority over state-of-the-art methods.
zh

[CV-29] QIRL: Boosting Visual Question Answering via Optimized Question-Image Relation Learning

【速读】：该论文旨在解决现有视觉问答（VQA）去偏方法存在的两个主要问题：一是未能有效捕捉图像与文本之间的深层关联，因为现有学习框架无法从高度对比的样本中提取更深层次的相关性；二是缺乏在推理过程中评估输入问题与图像相关性的机制。为解决这些问题，论文提出了一种名为优化问题-图像关系学习（QIRL）的新框架，其关键在于引入了两个模块：负样本图像生成（NIG）模块通过自监督生成无关的问题-图像对来增强相关性学习，而无关样本识别（ISI）模块则通过检测并过滤无关输入以提高模型鲁棒性，从而减少预测误差。此外，为了验证通过过滤无关输入减少输出错误的概念，论文还设计了一个专门的指标来评估ISI模块的性能。该方法具有模型无关性，可与其他多种VQA模型结合使用，并在VQA-CPv2和VQA-v2数据集上的实验验证了其有效性和泛化能力。

链接: https://arxiv.org/abs/2504.03337
作者: Quanxing Xu,Ling Zhou,Xian Zhong,Feifei Zhang,Rubing Huang,Chia-Wen Lin
机构: School of Computer Science and Engineering, Macau University of Science and Technology (澳门科技大学计算机科学与工程学院), Taipa, Macau 999078, China; Hubei Key Laboratory of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology (武汉理工大学计算机科学与人工智能学院湖北省交通物联网重点实验室), Wuhan 430070, China; State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology (武汉理工大学海事技术与安全国家重点实验室), Wuhan 430063, China; School of Computer Science and Engineering, Tianjin University of Technology (天津工业大学计算机科学与工程学院), Tianjin 300382, China; School of Computer Science and Engineering, Macau University of Science and Technology (澳门科技大学计算机科学与工程学院), Taipa, Macau 999078, China; Macau University of Science and Technology Zhuhai MUST Science and Technology Research Institute (澳门科技大学珠海科技研究院), Zhuhai, Guangdong 519099, China; Department of Electrical Engineering, National Tsing Hua University (国立清华大学电机工程系), Hsinchu 30013, Taiwan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing debiasing approaches in Visual Question Answering (VQA) primarily focus on enhancing visual learning, integrating auxiliary models, or employing data augmentation strategies. However, these methods exhibit two major drawbacks. First, current debiasing techniques fail to capture the superior relation between images and texts because prevalent learning frameworks do not enable models to extract deeper correlations from highly contrasting samples. Second, they do not assess the relevance between the input question and image during inference, as no prior work has examined the degree of input relevance in debiasing studies. Motivated by these limitations, we propose a novel framework, Optimized Question-Image Relation Learning (QIRL), which employs a generation-based self-supervised learning strategy. Specifically, two modules are introduced to address the aforementioned issues. The Negative Image Generation (NIG) module automatically produces highly irrelevant question-image pairs during training to enhance correlation learning, while the Irrelevant Sample Identification (ISI) module improves model robustness by detecting and filtering irrelevant inputs, thereby reducing prediction errors. Furthermore, to validate our concept of reducing output errors through filtering unrelated question-image inputs, we propose a specialized metric to evaluate the performance of the ISI module. Notably, our approach is model-agnostic and can be integrated with various VQA models. Extensive experiments on VQA-CPv2 and VQA-v2 demonstrate the effectiveness and generalization ability of our method. Among data augmentation strategies, our approach achieves state-of-the-art results.
zh

[CV-30] Steerable Anatomical Shape Synthesis with Implicit Neural Representations

【速读】：该论文旨在解决虚拟成像试验中解剖结构生成建模的问题，特别是如何通过生成式模型模拟特定患者群体而非依赖随机采样。论文的关键解决方案在于提出了一种基于隐式神经表示（Implicit Neural Representations）的可操控生成模型。隐式神经表示天然支持拓扑变化，使其非常适合具有拓扑变异的解剖结构（如甲状腺）。此外，该模型学习了解耦潜在表示（disentangled latent representation），从而实现对形状变化的精细控制。这一方法既保证了重建的准确性，又确保了生成结构的解剖学合理性。

链接: https://arxiv.org/abs/2504.03313
作者: Bram de Wilde,Max T. Rietberg,Guillaume Lajoinie,Jelmer M. Wolterink
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative modeling of anatomical structures plays a crucial role in virtual imaging trials, which allow researchers to perform studies without the costs and constraints inherent to in vivo and phantom studies. For clinical relevance, generative models should allow targeted control to simulate specific patient populations rather than relying on purely random sampling. In this work, we propose a steerable generative model based on implicit neural representations. Implicit neural representations naturally support topology changes, making them well-suited for anatomical structures with varying topology, such as the thyroid. Our model learns a disentangled latent representation, enabling fine-grained control over shape variations. Evaluation includes reconstruction accuracy and anatomical plausibility. Our results demonstrate that the proposed model achieves high-quality shape generation while enabling targeted anatomical modifications.
zh

[CV-31] Multi-Flow: Multi-View-Enriched Normalizing Flows for Industrial Anomaly Detection CVPR2025

【速读】：该论文致力于解决单视图异常检测方法在处理多视角工业产品复杂性质时的局限性问题。现有基于归一化流（Normalizing Flow）的方法在单一视角场景中表现良好，但未能充分利用多视角数据中的先验信息。为弥合这一差距，论文提出了一种名为Multi-Flow的新颖多视角异常检测方法，其关键是通过引入一种新的跨视角消息传递机制（cross-view message-passing scheme），使不同视角间的信息能够有效流动与融合，从而提升精确似然估计能力。实验验证表明，Multi-Flow在Real-IAD数据集上的图像级和样本级异常检测任务中均达到了最新的技术水平。

链接: https://arxiv.org/abs/2504.03306
作者: Mathis Kruse,Bodo Rosenhahn
机构: Institute for Information Processing, L3S - Leibniz University Hannover (信息处理研究所, L3S - 汉诺威莱布尼茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Visual Anomaly and Novelty Detection 3.0 Workshop at CVPR 2025

点击查看摘要

Abstract:With more well-performing anomaly detection methods proposed, many of the single-view tasks have been solved to a relatively good degree. However, real-world production scenarios often involve complex industrial products, whose properties may not be fully captured by one single image. While normalizing flow based approaches already work well in single-camera scenarios, they currently do not make use of the priors in multi-view data. We aim to bridge this gap by using these flow-based models as a strong foundation and propose Multi-Flow, a novel multi-view anomaly detection method. Multi-Flow makes use of a novel multi-view architecture, whose exact likelihood estimation is enhanced by fusing information across different views. For this, we propose a new cross-view message-passing scheme, letting information flow between neighboring views. We empirically validate it on the real-world multi-view data set Real-IAD and reach a new state-of-the-art, surpassing current baselines in both image-wise and sample-wise anomaly detection tasks.
zh

[CV-32] FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement

【速读】：该论文致力于解决文本到图像任务中生成多个新概念的挑战，特别是当训练样本数量较少时容易出现过拟合现象，并且在处理类相似主体（如特定的两只狗）时会遇到属性泄漏的问题。论文提出的解决方案Fuse-and-Refine (FaR) 的关键是两项关键技术：Concept Fusion 技术和 Localized Refinement 损失函数。Concept Fusion 技术通过将参考主体从背景中分离并重新组合成复合图像来系统性地扩充训练数据，从而增加多样性并缓解由于有限训练样本分布狭窄导致的过拟合问题；Localized Refinement 损失函数则通过使每个概念的注意力图与正确区域对齐来保留主体代表性属性，有效防止属性泄漏，确保扩散模型能够在去噪过程中区分相似主体而不混淆它们的注意力图。这些方法共同实现了新概念学习与已有知识保留之间的平衡，并显著提升了生成效果。

链接: https://arxiv.org/abs/2504.03292
作者: Gia-Nghia Tran,Quang-Huy Che,Trong-Tai Dam Vu,Bich-Nga Pham,Vinh-Tiep Nguyen,Trung-Nghia Le,Minh-Triet Tran
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating multiple new concepts remains a challenging problem in the text-to-image task. Current methods often overfit when trained on a small number of samples and struggle with attribute leakage, particularly for class-similar subjects (e.g., two specific dogs). In this paper, we introduce Fuse-and-Refine (FaR), a novel approach that tackles these challenges through two key contributions: Concept Fusion technique and Localized Refinement loss function. Concept Fusion systematically augments the training data by separating reference subjects from backgrounds and recombining them into composite images to increase diversity. This augmentation technique tackles the overfitting problem by mitigating the narrow distribution of the limited training samples. In addition, Localized Refinement loss function is introduced to preserve subject representative attributes by aligning each concept’s attention map to its correct region. This approach effectively prevents attribute leakage by ensuring that the diffusion model distinguishes similar subjects without mixing their attention maps during the denoising process. By fine-tuning specific modules at the same time, FaR balances the learning of new concepts with the retention of previously learned knowledge. Empirical results show that FaR not only prevents overfitting and attribute leakage while maintaining photorealism, but also outperforms other state-of-the-art methods.
zh

[CV-33] QD-Track: Temporal Query Denoising for 3D Multi-Object Tracking

【速读】：该论文旨在解决现有基于注意力机制的多目标跟踪（MOT）方法中，query denoising仅在单帧内进行，无法有效利用时序相关信息的问题，同时指出当前去噪过程中注意力掩码限制了自注意力机制在提升实例关联性方面的潜力。为了解决这些问题，论文提出了一种名为TQD-Track的新方法，其关键是引入了面向MOT任务的时序query denoising (Temporal Query Denoising, TQD)，使去噪查询能够携带时序信息及特定实例的特征表示。此外，通过在关联模块中设计一致性关联掩码，确保检测与轨迹查询在推理阶段的交互一致，进一步增强了适用于显式数据关联模块（如基于检测或交替检测与关联范式的）跟踪方法的表现。

链接: https://arxiv.org/abs/2504.03258
作者: Shuxiao Ding,Yutong Yang,Julian Wiederer,Markus Braun,Peizheng Li,Juergen Gall,Bin Yang
机构: Mercedes-Benz AG (梅赛德斯-奔驰); University of Bonn (波恩大学); University of Stuttgart (斯图加特大学); University of Tübingen (图宾根大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Query denoising has become a standard training strategy for DETR-based detectors by addressing the slow convergence issue. Besides that, query denoising can be used to increase the diversity of training samples for modeling complex scenarios which is critical for Multi-Object Tracking (MOT), showing its potential in MOT application. Existing approaches integrate query denoising within the tracking-by-attention paradigm. However, as the denoising process only happens within the single frame, it cannot benefit the tracker to learn temporal-related information. In addition, the attention mask in query denoising prevents information exchange between denoising and object queries, limiting its potential in improving association using self-attention. To address these issues, we propose TQD-Track, which introduces Temporal Query Denoising (TQD) tailored for MOT, enabling denoising queries to carry temporal information and instance-specific feature representation. We introduce diverse noise types onto denoising queries that simulate real-world challenges in MOT. We analyze our proposed TQD for different tracking paradigms, and find out the paradigm with explicit learned data association module, e.g. tracking-by-detection or alternating detection and association, benefit from TQD by a larger margin. For these paradigms, we further design an association mask in the association module to ensure the consistent interaction between track and detection queries as during inference. Extensive experiments on the nuScenes dataset demonstrate that our approach consistently enhances different tracking methods by only changing the training process, especially the paradigms with explicit association module.
zh

[CV-34] SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding

【速读】：该论文旨在解决合成孔径雷达（SAR）图像理解难题，主要由于其复杂的物理成像机制以及与人类视觉感知显著不同的视觉特性，导致现有视觉语言模型（VLMs）在SAR图像上的表现欠佳。尽管VLMs在RGB图像理解方面表现出色，但它们缺乏针对SAR特定知识的训练分布，从而限制了其在SAR图像上的应用效果。为了解决这一局限性，论文提出了一种名为SARLANG-1M的大规模多模态SAR图像理解基准数据集。该数据集的关键在于通过整合SAR图像与文本模态，构建了一个包含超过100万高质量SAR图像-文本对的数据集，覆盖全球59个城市，具备分层分辨率、细粒度语义描述、丰富的遥感类别以及跨多个任务和应用场景的问题-答案对。实验结果表明，使用SARLANG-1M对主流VLMs进行微调能够显著提升其在SAR图像理解中的性能，达到接近人类专家的水平。

链接: https://arxiv.org/abs/2504.03254
作者: Yimin Wei,Aoran Xiao,Yexian Ren,Yuting Zhu,Hongruixuan Chen,Junshi Xia,Naoto Yokoya
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthetic Aperture Radar (SAR) is a crucial remote sensing technology, enabling all-weather, day-and-night observation with strong surface penetration for precise and continuous environmental monitoring and analysis. However, SAR image interpretation remains challenging due to its complex physical imaging mechanisms and significant visual disparities from human perception. Recently, Vision-Language Models (VLMs) have demonstrated remarkable success in RGB image understanding, offering powerful open-vocabulary interpretation and flexible language interaction. However, their application to SAR images is severely constrained by the absence of SAR-specific knowledge in their training distributions, leading to suboptimal performance. To address this limitation, we introduce SARLANG-1M, a large-scale benchmark tailored for multimodal SAR image understanding, with a primary focus on integrating SAR with textual modality. SARLANG-1M comprises more than 1 million high-quality SAR image-text pairs collected from over 59 cities worldwide. It features hierarchical resolutions (ranging from 0.1 to 25 meters), fine-grained semantic descriptions (including both concise and detailed captions), diverse remote sensing categories (1,696 object types and 16 land cover classes), and multi-task question-answering pairs spanning seven applications and 1,012 question types. Extensive experiments on mainstream VLMs demonstrate that fine-tuning with SARLANG-1M significantly enhances their performance in SAR image interpretation, reaching performance comparable to human experts. The dataset and code will be made publicly available at this https URL.
zh

[CV-35] Robot Localization Using a Learned Keypoint Detector and Descriptor with a Floor Camera and a Feature Rich Industrial Floor

【速读】：该论文试图解决移动机器人定位问题，特别是在无需依赖可读标记的情况下，从环境中提取高质量特征以实现精确的机器人定位。论文的关键解决方案是提出了一种名为Keypoint Localization Framework (KOALA) 的框架，利用深度神经网络从工业地板图像中提取足够的特征，从而在75.7%的图像中实现平均位置误差2厘米、旋转误差2.4%的定位精度。尽管未使用滤波、先验或时间信息，该方法仍能在机器人移动过程中高精度解决绑架问题（kidnapped robot problem）。其核心优势在于结合特定的检测器与描述符，实现了优于同类方法的性能表现。

链接: https://arxiv.org/abs/2504.03249
作者: Piet Brömmel,Dominik Brämer,Oliver Urbann,Diana Kleingarn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The localization of moving robots depends on the availability of good features from the environment. Sensor systems like Lidar are popular, but unique features can also be extracted from images of the ground. This work presents the Keypoint Localization Framework (KOALA), which utilizes deep neural networks that extract sufficient features from an industrial floor for accurate localization without having readable markers. For this purpose, we use a floor covering that can be produced as cheaply as common industrial floors. Although we do not use any filtering, prior, or temporal information, we can estimate our position in 75.7 % of all images with a mean position error of 2 cm and a rotation error of 2.4 %. Thus, the robot kidnapping problem can be solved with high precision in every frame, even while the robot is moving. Furthermore, we show that our framework with our detector and descriptor combination is able to outperform comparable approaches.
zh

[CV-36] Rotation Invariance in Floor Plan Digitization using Zernike Moments

【速读】：该论文试图解决将老旧建筑平面图从打印或扫描为栅格图像的形式转化为机器可读形式的问题，特别是处理因扫描导致的轻微旋转或位移。论文的关键在于提出了一种端到端的流水线，包括图像预处理、基于新颖方法从预处理图像构建区域邻接图（Region Adjacency Graph, RAG）并预测其节点。通过在RAG特征提取中加入归一化步骤，显著提升了RAG特征计算的旋转不变性，并提高了旋转数据上的F1分数和交并比（IoU）。此外，论文还提出了一种墙分割算法，用于将墙体分割成与相应房间关联的段。

链接: https://arxiv.org/abs/2504.03241
作者: Marius Graumann(1),Jan Marius Stürmer(1),Tobias Koch(1) ((1) German Aerospace Center (DLR), Institute for the Protection of Terrestrial Infrastructures)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Nowadays, a lot of old floor plans exist in printed form or are stored as scanned raster images. Slight rotations or shifts may occur during scanning. Bringing floor plans of this form into a machine readable form to enable further use, still poses a problem. Therefore, we propose an end-to-end pipeline that pre-processes the image and leverages a novel approach to create a region adjacency graph (RAG) from the pre-processed image and predict its nodes. By incorporating normalization steps into the RAG feature extraction, we significantly improved the rotation invariance of the RAG feature calculation. Moreover, applying our method leads to an improved F1 score and IoU on rotated data. Furthermore, we proposed a wall splitting algorithm for partitioning walls into segments associated with the corresponding rooms.
zh

[CV-37] Malware Detection in Docker Containers: An Image is Worth a Thousand Logs

【速读】：该论文旨在解决因恶意软件混淆（obfuscation）和多态性（polymorphism）等技术的发展而受到挑战的恶意软件检测问题，同时应对软件容器广泛使用所带来的新安全威胁，特别是恶意软件注入导致的容器被劫持风险。论文的关键解决方案是通过机器学习分析容器文件系统来识别被劫持的容器。具体而言，论文将整个软件容器通过其 tarball 表示转换为大型 RGB 图像，并提出了一种基于流式处理和分块（patch-based）方法的卷积神经网络（Convolutional Neural Network, CNN）架构。这种方法不仅有效提高了恶意软件检测的准确性，还通过发布 COSOCO 数据集支持了实验验证，该数据集包含了良性与被劫持容器的 3364 张大型 RGB 图像。实验结果显示，该方法在 F1 和召回率（Recall）方面优于所有独立及集成的 VirusTotal 引擎，从而确立了新的行业标准。

链接: https://arxiv.org/abs/2504.03238
作者: Akis Nousias,Efklidis Katsaros,Evangelos Syrmos,Panagiotis Radoglou-Grammatikis,Thomas Lagkas,Vasileios Argyriou,Ioannis Moscholios,Evangelos Markakis,Sotirios Goudos,Panagiotis Sarigiannidis
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICC-W

点击查看摘要

Abstract:Malware detection is increasingly challenged by evolving techniques like obfuscation and polymorphism, limiting the effectiveness of traditional methods. Meanwhile, the widespread adoption of software containers has introduced new security challenges, including the growing threat of malicious software injection, where a container, once compromised, can serve as entry point for further cyberattacks. In this work, we address these security issues by introducing a method to identify compromised containers through machine learning analysis of their file systems. We cast the entire software containers into large RGB images via their tarball representations, and propose to use established Convolutional Neural Network architectures on a streaming, patch-based manner. To support our experiments, we release the COSOCO dataset–the first of its kind–containing 3364 large-scale RGB images of benign and compromised software containers at this https URL. Our method detects more malware and achieves higher F1 and Recall scores than all individual and ensembles of VirusTotal engines, demonstrating its effectiveness and setting a new standard for identifying malware-compromised software containers.
zh

[CV-38] Crash Time Matters: HybridMamba for Fine-Grained Temporal Localization in Traffic Surveillance Footage

【速读】：该论文旨在解决长视频监控数据中交通事故检测困难的问题，由于交通事故事件短暂且稀发，传统方法难以实现精确的时间定位。论文提出了一种名为HybridMamba的新架构，结合视觉Transformer与状态空间时间建模，以实现精准的事故时间定位。其关键在于采用多层级Token压缩与分层时间处理技术，在保持计算效率的同时不牺牲时间分辨率，从而在大规模数据集上实现了1.50秒的平均绝对误差，并具备强泛化能力。

链接: https://arxiv.org/abs/2504.03235
作者: Ibne Farabi Shihab,Anuj Sharma
机构: Department of Computer Science, Iowa State University (爱荷华州立大学); Department of Civil, Construction and Environmental Engineering, Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic crash detection in long-form surveillance videos is critical for emergency response and infrastructure planning but remains difficult due to the brief and rare nature of crash events. We introduce HybridMamba, a novel architecture that combines visual transformers with state-space temporal modeling to achieve accurate crash time localization. Our method uses multi-level token compression and hierarchical temporal processing to remain computationally efficient without sacrificing temporal resolution. Evaluated on a large-scale dataset from the Iowa Department of Transportation, HybridMamba achieves a mean absolute error of 1.50 seconds, with 65.2 percent of predictions within one second of the ground truth. It outperforms recent video-language models such as TimeChat and VideoLLaMA2 by up to 2.8 seconds, while using significantly fewer parameters. Our results demonstrate strong generalization across videos ranging from 2 to 40 minutes in diverse conditions. HybridMamba offers a robust and efficient solution for fine-grained temporal localization in traffic surveillance. The code will be released upon publication.
zh

[CV-39] Unlocking Neural Transparency: Jacobian Maps for Explainable AI in Alzheimers Detection

【速读】：该论文旨在解决阿尔茨海默病（Alzheimer’s Disease, AD）早期检测中深度学习模型准确性高但缺乏可解释性的问题，以提升临床信任度与应用广泛性。论文的关键解决方案是引入一种基于雅可比映射（Jacobian Maps, JMs）的预处理方法，并将其融入多模态框架中。通过捕捉局部脑容积变化，JMs能够建立模型预测结果与已知神经解剖学AD生物标志物之间的有意义关联，从而增强诊断的可解释性和可靠性。此外，通过与传统预处理数据训练的3D卷积神经网络（CNN）对比实验以及3D Grad-CAM分析，验证了该方法在提升诊断准确性及解释能力方面的有效性。

链接: https://arxiv.org/abs/2504.03230
作者: Yasmine Mustafa,Mohamed Elmahallawy,Tie Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Alzheimer’s disease (AD) leads to progressive cognitive decline, making early detection crucial for effective intervention. While deep learning models have shown high accuracy in AD diagnosis, their lack of interpretability limits clinical trust and adoption. This paper introduces a novel pre-model approach leveraging Jacobian Maps (JMs) within a multi-modal framework to enhance explainability and trustworthiness in AD detection. By capturing localized brain volume changes, JMs establish meaningful correlations between model predictions and well-known neuroanatomical biomarkers of AD. We validate JMs through experiments comparing a 3D CNN trained on JMs versus on traditional preprocessed data, which demonstrates superior accuracy. We also employ 3D Grad-CAM analysis to provide both visual and quantitative insights, further showcasing improved interpretability and diagnostic reliability.
zh

[CV-40] Electromyography-Based Gesture Recognition: Hierarchical Feature Extraction for Enhanced Spatial-Temporal Dynamics

【速读】：该论文旨在解决基于多通道表面肌电图（sEMG）的手势识别中因信号不稳定导致预测不准确以及时间变化特征增强效率低下的问题。为克服缺乏基于信号的时间变化特征的问题，论文提出了一种轻量级的挤压激励深度学习多流空间时间动态时间变化特征提取方法，构建了一个有效的基于sEMG的手势识别系统。方案的关键在于设计了每个分支来提取分层特征，捕获全局和详细的时空关系以确保特征的有效性。具体而言，第一分支利用双向时序卷积网络（Bi-TCN）专注于捕捉长期时间依赖；第二分支结合一维卷积层、可分离CNN和挤压激励（SE）块高效提取时空特征；第三分支通过时序卷积网络（TCN）和双向长短期记忆网络（BiLSTM）捕获双向时间关系和时间变化模式。各分支输出通过拼接融合，并通过通道注意力模块进一步优化，从而选择性地关注最具有信息量的特征，同时提升计算效率。实验结果表明，该模型在Ninapro DB2、DB4和DB5数据集上的准确率分别为96.41%、92.40%和93.34%，证明了系统处理复杂sEMG动态的能力，为假肢控制和人机界面技术的发展提供了重要贡献。

链接: https://arxiv.org/abs/2504.03221
作者: Jungpil Shin,Abu Saleh Musa Miah,Sota Konnai,Shu Hoshitaka,Pankoo Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hand gesture recognition using multichannel surface electromyography (sEMG) is challenging due to unstable predictions and inefficient time-varying feature enhancement. To overcome the lack of signal based time-varying feature problems, we propose a lightweight squeeze-excitation deep learning-based multi stream spatial temporal dynamics time-varying feature extraction approach to build an effective sEMG-based hand gesture recognition system. Each branch of the proposed model was designed to extract hierarchical features, capturing both global and detailed spatial-temporal relationships to ensure feature effectiveness. The first branch, utilizing a Bidirectional-TCN (Bi-TCN), focuses on capturing long-term temporal dependencies by modelling past and future temporal contexts, providing a holistic view of gesture dynamics. The second branch, incorporating a 1D Convolutional layer, separable CNN, and Squeeze-and-Excitation (SE) block, efficiently extracts spatial-temporal features while emphasizing critical feature channels, enhancing feature relevance. The third branch, combining a Temporal Convolutional Network (TCN) and Bidirectional LSTM (BiLSTM), captures bidirectional temporal relationships and time-varying patterns. Outputs from all branches are fused using concatenation to capture subtle variations in the data and then refined with a channel attention module, selectively focusing on the most informative features while improving computational efficiency. The proposed model was tested on the Ninapro DB2, DB4, and DB5 datasets, achieving accuracy rates of 96.41%, 92.40%, and 93.34%, respectively. These results demonstrate the capability of the system to handle complex sEMG dynamics, offering advancements in prosthetic limb control and human-machine interface technologies with significant implications for assistive technologies.
zh

[CV-41] From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution Deviation and Future Implications in AI-Language Models

【速读】：该论文旨在探讨从 ChatGPT 到 DeepSeek AI 的技术演进及其对人工智能（Artificial Intelligence, AI）发展的更广泛影响。论文试图解决的问题是如何评估和理解这两种模型在架构设计、性能表现以及伦理考量上的差异，并探索它们在实际应用中的优势与局限性。解决方案的关键在于通过设计一个包含多领域多项选择题的案例研究，系统性地比较 ChatGPT 和 DeepSeek AI 的能力，从而为语言模型的技术改进、行业应用潜力以及未来研究方向提供有价值的洞见。

链接: https://arxiv.org/abs/2504.03219
作者: Simrandeep Singh,Shreya Bansal,Abdulmotaleb El Saddik,Mukesh Saini
机构: Chandigarh University (查尔肯德大学); Indian Institute of Technology Ropar (印度理工学院罗帕尔); University of Ottawa (渥太华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 1 figure, 4 tables

点击查看摘要

Abstract:The rapid advancement of artificial intelligence (AI) has reshaped the field of natural language processing (NLP), with models like OpenAI ChatGPT and DeepSeek AI. Although ChatGPT established a strong foundation for conversational AI, DeepSeek AI introduces significant improvements in architecture, performance, and ethical considerations. This paper presents a detailed analysis of the evolution from ChatGPT to DeepSeek AI, highlighting their technical differences, practical applications, and broader implications for AI development. To assess their capabilities, we conducted a case study using a predefined set of multiple choice questions in various domains, evaluating the strengths and limitations of each model. By examining these aspects, we provide valuable insight into the future trajectory of AI, its potential to transform industries, and key research directions for improving AI-driven language models.
zh

[CV-42] Endo3R: Unified Online Reconstruction from Dynamic Monocular Endoscopic Video

【速读】：该论文旨在解决从单目手术视频中实现一致尺度的三维场景重建问题，这是计算机辅助手术任务中的重要挑战。由于内窥镜视频存在动态形变和无纹理表面等固有难题，传统方法通常依赖于校准或工具先验来估计尺度，或者采用类似SfM的多阶段流程，这些方法存在误差累积且需要离线优化。论文提出Endo3R，一种无需先验知识或额外优化即可在线实现一致尺度重建的统一三维基础模型。

解决方案的关键在于引入了一种基于不确定性感知的双重记忆机制，将近期短期动态与长期空间一致性相结合，扩展了最近提出的成对重建模型至长时间增量动态重建能力。此外，为应对手术场景的高度动态特性，通过Sampson距离测量令牌的不确定性并过滤高不确定性令牌。针对内窥镜数据集中缺乏真实深度和相机姿态标注的问题，设计了一种自监督机制，并引入新颖的动力学感知流损失函数。实验结果表明，Endo3R在SCARED和Hamlyn数据集上的零样本深度预测和相机姿态估计任务中表现出色，同时保持了在线效率。

链接: https://arxiv.org/abs/2504.03198
作者: Jiaxin Guo,Wenzhen Dong,Tianyu Huang,Hao Ding,Ziyi Wang,Haomin Kuang,Qi Dou,Yun-Hui Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reconstructing 3D scenes from monocular surgical videos can enhance surgeon’s perception and therefore plays a vital role in various computer-assisted surgery tasks. However, achieving scale-consistent reconstruction remains an open challenge due to inherent issues in endoscopic videos, such as dynamic deformations and textureless surfaces. Despite recent advances, current methods either rely on calibration or instrument priors to estimate scale, or employ SfM-like multi-stage pipelines, leading to error accumulation and requiring offline optimization. In this paper, we present Endo3R, a unified 3D foundation model for online scale-consistent reconstruction from monocular surgical video, without any priors or extra optimization. Our model unifies the tasks by predicting globally aligned pointmaps, scale-consistent video depths, and camera parameters without any offline optimization. The core contribution of our method is expanding the capability of the recent pairwise reconstruction model to long-term incremental dynamic reconstruction by an uncertainty-aware dual memory mechanism. The mechanism maintains history tokens of both short-term dynamics and long-term spatial consistency. Notably, to tackle the highly dynamic nature of surgical scenes, we measure the uncertainty of tokens via Sampson distance and filter out tokens with high uncertainty. Regarding the scarcity of endoscopic datasets with ground-truth depth and camera poses, we further devise a self-supervised mechanism with a novel dynamics-aware flow loss. Abundant experiments on SCARED and Hamlyn datasets demonstrate our superior performance in zero-shot surgical video depth prediction and camera pose estimation with online efficiency. Project page: this https URL.
zh

[CV-43] Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation CVPR2025

【速读】：该论文旨在解决现有领域泛化语义分割（DGSS）方法仅依赖视觉基础模型（VFMs）或视觉语言模型（VLMs），未能有效整合两者互补优势的问题。虽然VFMs擅长捕捉细粒度特征，而VLMs在文本对齐方面表现稳健但难以处理粗粒度信息，但将它们通过注意力机制有效融合面临挑战，主要是由于增加的patch tokens导致长序列建模复杂化。为了解决这一问题，论文提出了一种名为MFuser的新型Mamba基融合框架，其关键在于通过两个组件实现两者的高效结合：MVFuser作为协同适配器，通过捕获时序和空间动态来联合微调两个模型；MTEnhancer则是一个混合注意力-Mamba模块，通过引入图像先验优化文本嵌入。这种设计能够在保持序列长度线性可扩展性的同时，实现精确的特征局部性和强大的文本对齐，且不带来显著的计算开销。实验结果表明，MFuser在合成到真实及真实到真实的基准测试中分别达到了68.20 mIoU和71.87 mIoU的性能，显著优于当前最先进的DGSS方法。

链接: https://arxiv.org/abs/2504.03193
作者: Xin Zhang,Robby T. Tan
机构: National University of Singapore (新加坡国立大学); ASUS Intelligent Cloud Services (华硕智能云服务)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025

点击查看摘要

Abstract:Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine-grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. MFuser consists of two key components: MVFuser, which acts as a co-adapter to jointly fine-tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention-Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state-of-the-art DGSS methods, achieving 68.20 mIoU on synthetic-to-real and 71.87 mIoU on real-to-real benchmarks. The code is available at this https URL.
zh

[CV-44] hree Forensic Cues for JPEG AI Images

【速读】：该论文试图解决JPEG AI图像的数字取证问题，特别是开发针对基于AI压缩的JPEG图像（JPEG AI）的新颖检测与区分方法。传统JPEG的取证工具无法直接应用于JPEG AI，因为其特有的伪影容易与深度伪造（DeepFake）图像混淆。论文的关键在于提出了三种可用于JPEG AI取证算法的线索：首先，揭示了JPEG AI预处理在颜色通道中引入的独特相关性；其次，证明了JPEG AI图像重复压缩会导致失真差异减小，可借此检测重新压缩；第三，展示了利用潜在空间中的量化特征来区分真实图像与合成图像的方法。这些方案不仅具有实际应用价值，还为AI压缩图像的数字取证研究提供了启发。

链接: https://arxiv.org/abs/2504.03191
作者: Sandra Bergmann,Fabian Brand,Christian Riess
机构: IT Security Infrastructures Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学); Multimedia Communications and Signal Processing Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The JPEG standard was vastly successful. Currently, the first AI-based compression method JPEG AI'' will be standardized. JPEG AI brings remarkable benefits. JPEG AI images exhibit impressive image quality at bitrates that are an order of magnitude lower than images compressed with traditional JPEG. However, forensic analysis of JPEG AI has to be completely re-thought: forensic tools for traditional JPEG do not transfer to JPEG AI, and artifacts from JPEG AI are easily confused with artifacts from artificially generated images (DeepFakes’'). This creates a need for novel forensic approaches to detection and distinction of JPEG AI images. In this work, we make a first step towards a forensic JPEG AI toolset. We propose three cues for forensic algorithms for JPEG AI. These algorithms address three forensic questions: first, we show that the JPEG AI preprocessing introduces correlations in the color channels that do not occur in uncompressed images. Second, we show that repeated compression of JPEG AI images leads to diminishing distortion differences. This can be used to detect recompression, in a spirit similar to some classic JPEG forensics methods. Third, we show that the quantization of JPEG AI images in the latent space can be used to distinguish real images with JPEG AI compression from synthetically generated images. The proposed methods are interpretable for a forensic analyst, and we hope that they inspire further research in the forensics of AI-compressed images.
zh

[CV-45] MIMRS: A Survey on Masked Image Modeling in Remote Sensing

【速读】：该论文旨在解决遥感领域中因云遮挡、观测遮挡及传感器限制等因素导致的数据不完整问题，并探索利用自监督学习技术提升视觉理解能力的方法。论文的关键解决方案在于引入掩码图像建模（Masked Image Modeling, MIM）技术，通过屏蔽图像的部分信息（如像素、patches或潜在表示），训练模型预测缺失部分以充分利用未标注数据进行预训练。这种方法不仅为遥感任务提供了新的可能性，如云去除、多模态数据融合及超分辨率处理，还为这一快速发展的研究领域奠定了理论与实践基础。

链接: https://arxiv.org/abs/2504.03181
作者: Shabnam Choudhury,Akhil Vasim,Michael Schmitt,Biplab Banerjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:Masked Image Modeling (MIM) is a self-supervised learning technique that involves masking portions of an image, such as pixels, patches, or latent representations, and training models to predict the missing information using the visible context. This approach has emerged as a cornerstone in self-supervised learning, unlocking new possibilities in visual understanding by leveraging unannotated data for pre-training. In remote sensing, MIM addresses challenges such as incomplete data caused by cloud cover, occlusions, and sensor limitations, enabling applications like cloud removal, multi-modal data fusion, and super-resolution. By synthesizing and critically analyzing recent advancements, this survey (MIMRS) is a pioneering effort to chart the landscape of mask image modeling in remote sensing. We highlight state-of-the-art methodologies, applications, and future research directions, providing a foundational review to guide innovation in this rapidly evolving field.
zh

[CV-46] Detection Based Part-level Articulated Object Reconstruction from Single RGBD Image NEURIPS2023

【速读】：该论文旨在解决从单张RGBD图像中重建具有多种结构和多样组件数量的人造关节物体的问题，重点在于部件级形状重建、姿态估计及运动学分析。不同于以往依赖于学习实例级潜在空间的方法，该研究提出了一种基于部件级表示的新颖方法，将实例视为检测到的部件组合。然而，这种“检测-再分组”的方法在处理具有不同部件结构和数量的实例时面临假阳性、部件大小尺度变化以及由于端到端训练导致模型规模增加等问题。为了解决这些挑战，论文提出了三个关键技术：1）测试时结合运动学信息的部件融合以提升检测性能并抑制假阳性；2）各向异性尺度归一化用于部件形状学习，适应不同的部件尺寸与比例；3）特征空间与输出空间之间的平衡策略，以改进部件检测同时控制模型规模。实验结果表明，所提方法能够成功重建复杂结构的多实例，并在形状重建和运动学估计方面优于现有方法。

链接: https://arxiv.org/abs/2504.03177
作者: Yuki Kawana,Tatsuya Harada
机构: The University of Tokyo (东京大学); RIKEN AIP (理化学研究所人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2023

点击查看摘要

Abstract:We propose an end-to-end trainable, cross-category method for reconstructing multiple man-made articulated objects from a single RGBD image, focusing on part-level shape reconstruction and pose and kinematics estimation. We depart from previous works that rely on learning instance-level latent space, focusing on man-made articulated objects with predefined part counts. Instead, we propose a novel alternative approach that employs part-level representation, representing instances as combinations of detected parts. While our detect-then-group approach effectively handles instances with diverse part structures and various part counts, it faces issues of false positives, varying part sizes and scales, and an increasing model size due to end-to-end training. To address these challenges, we propose 1) test-time kinematics-aware part fusion to improve detection performance while suppressing false positives, 2) anisotropic scale normalization for part shape learning to accommodate various part sizes and scales, and 3) a balancing strategy for cross-refinement between feature space and output space to improve part detection while maintaining model size. Evaluation on both synthetic and real data demonstrates that our method successfully reconstructs variously structured multiple instances that previous works cannot handle, and outperforms prior works in shape reconstruction and kinematics estimation.
zh

[CV-47] Real-Time Roadway Obstacle Detection for Electric Scooters Using Deep Learning and Multi-Sensor Fusion

【速读】：本文旨在解决电动滑板车（e-scooter）在城市环境中因小型车轮、缺乏悬挂系统以及对不平路面敏感而导致交通事故和伤害增加的问题。尽管基于深度学习的目标检测技术已被广泛用于提升汽车安全性，但其在电动滑板车障碍物检测中的应用尚未被探索。为了解决这一问题，研究提出了一种集成RGB相机和深度相机的新颖地面障碍物检测系统，以增强实时道路危险检测能力。此外，惯性测量单元（IMU）通过测量线性垂直加速度来识别表面振动，从而指导六类障碍物的分类：树枝、井盖、坑洼、松果、无方向裂缝和截断圆顶。所有传感器，包括RGB相机、深度相机和IMU，均整合于Intel RealSense Camera D435i中。利用YOLO模型的深度学习方法能够检测道路危险，并结合深度数据估算障碍物距离。在长达七小时的自然骑行数据集上的评估显示，该系统具有较高的平均精度均值（mAP）0.827，并表现出优异的实时性能。此方法通过先进的计算机视觉与数据融合技术有效提升了电动滑板车的安全性。数据集可通过提供的链接访问，项目代码托管于另一个链接处。关键在于结合多源传感器数据与高效的深度学习算法实现精准的道路障碍物检测。

链接: https://arxiv.org/abs/2504.03171
作者: Zeyang Zheng,Arman Hosseini,Dong Chen,Omid Shoghli,Arsalan Heydarian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at ASCE International Conference on Computing in Civil Engineering (i3ce)

点击查看摘要

Abstract:The increasing adoption of electric scooters (e-scooters) in urban areas has coincided with a rise in traffic accidents and injuries, largely due to their small wheels, lack of suspension, and sensitivity to uneven surfaces. While deep learning-based object detection has been widely used to improve automobile safety, its application for e-scooter obstacle detection remains unexplored. This study introduces a novel ground obstacle detection system for e-scooters, integrating an RGB camera, and a depth camera to enhance real-time road hazard detection. Additionally, the Inertial Measurement Unit (IMU) measures linear vertical acceleration to identify surface vibrations, guiding the selection of six obstacle categories: tree branches, manhole covers, potholes, pine cones, non-directional cracks, and truncated domes. All sensors, including the RGB camera, depth camera, and IMU, are integrated within the Intel RealSense Camera D435i. A deep learning model powered by YOLO detects road hazards and utilizes depth data to estimate obstacle proximity. Evaluated on the seven hours of naturalistic riding dataset, the system achieves a high mean average precision (mAP) of 0.827 and demonstrates excellent real-time performance. This approach provides an effective solution to enhance e-scooter safety through advanced computer vision and data fusion. The dataset is accessible at this https URL, and the project code is hosted on this https URL.
zh

[CV-48] REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval

【速读】：该论文旨在解决基于内容的遥感图像检索（RS-CBIR）中高效且强鲁棒性方法的需求，特别是在遥感图像档案快速扩展的背景下。传统方法在处理高分辨率、多光谱数据以及复杂背景时面临计算开销大、特征冗余及多样性不足等问题。为此，论文提出了一种创新的自监督框架——REJEPA（联合嵌入预测架构），其关键在于通过空间分布上下文标记编码预测目标标记的抽象表示，有效捕获高层语义特征并去除不必要的像素级细节。与依赖像素重建或对比学习负样本的方法不同，REJEPA在特征空间内运作，相比像素重建基线（如MAE）可降低40%-60%的计算复杂度。此外，通过引入方差-不变性-协方差正则化（Variance-Invariance-Covariance Regularisation, VICReg），REJEPA提升了特征的多样性和表达能力，避免了编码器退化问题。实验结果表明，REJEPA在多个遥感基准数据集（如BEN-14K和FMoW）上的检索精度显著提升（最高达10.1%），并实现了传感器无关的高效、可扩展且精确的RS-CBIR方案。

链接: https://arxiv.org/abs/2504.03169
作者: Shabnam Choudhury,Yash Salunkhe,Sarthak Mehrotra,Biplab Banerjee
机构: Indian Institute of Technology Bombay (印度孟买理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:The rapid expansion of remote sensing image archives demands the development of strong and efficient techniques for content-based image retrieval (RS-CBIR). This paper presents REJEPA (Retrieval with Joint-Embedding Predictive Architecture), an innovative self-supervised framework designed for unimodal RS-CBIR. REJEPA utilises spatially distributed context token encoding to forecast abstract representations of target tokens, effectively capturing high-level semantic features and eliminating unnecessary pixel-level details. In contrast to generative methods that focus on pixel reconstruction or contrastive techniques that depend on negative pairs, REJEPA functions within feature space, achieving a reduction in computational complexity of 40-60% when compared to pixel-reconstruction baselines like Masked Autoencoders (MAE). To guarantee strong and varied representations, REJEPA incorporates Variance-Invariance-Covariance Regularisation (VICReg), which prevents encoder collapse by promoting feature diversity and reducing redundancy. The method demonstrates an estimated enhancement in retrieval accuracy of 5.1% on BEN-14K (S1), 7.4% on BEN-14K (S2), 6.0% on FMoW-RGB, and 10.1% on FMoW-Sentinel compared to prominent SSL techniques, including CSMAE-SESD, Mask-VLM, SatMAE, ScaleMAE, and SatMAE++, on extensive RS benchmarks BEN-14K (multispectral and SAR data), FMoW-RGB and FMoW-Sentinel. Through effective generalisation across sensor modalities, REJEPA establishes itself as a sensor-agnostic benchmark for efficient, scalable, and precise RS-CBIR, addressing challenges like varying resolutions, high object density, and complex backgrounds with computational efficiency.
zh

[CV-49] Finding the Reflection Point: Unpadding Images to Remove Data Augmentation Artifacts in Large Open Source Image Datasets for Machine Learning

【速读】：本文旨在解决机器学习数据集整理中的一个新问题：检测并去除噪声镜像填充伪影。为标准化图像尺寸，数据增强技术如填充是必要的，但它们可能引入会在跨领域重用数据集时降低模型评估质量的伪影。论文提出了一种系统算法，通过最小均方误差方法结合阈值处理来精确定位反射边界，并移除反射填充。该方法的关键在于能够即使在存在压缩或插值噪声的情况下，有效识别真实内容与其镜像副本之间的过渡。通过在SHEL5k数据集上的验证，展示了该算法在OWLv2零样本目标检测任务中的显著性能提升，硬帽检测的平均精度从0.47提高到0.61，人员检测的平均精度从0.68提高到0.73。通过解决填充区域中的标注不一致和扭曲对象问题，本研究提升了数据集的整体完整性，从而在计算机视觉任务中实现了更可靠的模型评估。

链接: https://arxiv.org/abs/2504.03168
作者: Lucas Choi,Ross Greer
机构: Archbishop Mitty ( Archbishop Mitty ); University of California, Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we address a novel image restoration problem relevant to machine learning dataset curation: the detection and removal of noisy mirrored padding artifacts. While data augmentation techniques like padding are necessary for standardizing image dimensions, they can introduce artifacts that degrade model evaluation when datasets are repurposed across domains. We propose a systematic algorithm to precisely delineate the reflection boundary through a minimum mean squared error approach with thresholding and remove reflective padding. Our method effectively identifies the transition between authentic content and its mirrored counterpart, even in the presence of compression or interpolation noise. We demonstrate our algorithm’s efficacy on the SHEL5k dataset, showing significant performance improvements in zero-shot object detection tasks using OWLv2, with average precision increasing from 0.47 to 0.61 for hard hat detection and from 0.68 to 0.73 for person detection. By addressing annotation inconsistencies and distorted objects in padded regions, our approach enhances dataset integrity, enabling more reliable model evaluation across computer vision tasks.
zh

[CV-50] RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

【速读】：该论文旨在解决现有基础模型在遥感（Remote Sensing, RS）应用中的局限性，即它们主要针对单一或有限模态的数据进行处理，而未能充分利用遥感数据固有的多模态特性。光学、合成孔径雷达（Synthetic Aperture Radar, SAR）及多光谱数据提供了互补的信息，能够显著降低单源分析中的不确定性。为解决这一问题，论文提出了RingMoE，这是一种具有147亿参数的统一多模态遥感基础模型，预训练于来自九颗卫星的4亿个多模态遥感图像数据集。其关键创新点包括：(1) 层次化的专家混合模型（Mixture-of-Experts, MoE）架构，包含模态专用、协作及共享专家，既能有效建模模态内知识，又能捕捉模态间依赖关系以缓解模态表示之间的冲突；(2) 物理信息引导的自监督学习，将传感器特定的辐射特征显式嵌入到预训练目标中；(3) 动态专家剪枝技术，能够在保持性能的同时将模型从147亿参数压缩至10亿参数，从而支持高效部署于地球观测任务中。通过在六个关键遥感任务（分类、检测、分割、跟踪、变化检测和深度估计）的23个基准测试中验证，RingMoE不仅超越了现有的基础模型，还创造了新的最优性能记录，展现了从单模态到多模态场景的强大适应能力。

链接: https://arxiv.org/abs/2504.03166
作者: Hanbo Bi,Yingchao Feng,Boyuan Tong,Mengyu Wang,Haichen Yu,Yongqiang Mao,Hao Chang,Wenhui Diao,Peijin Wang,Yue Yu,Hanyang Peng,Yehong Zhang,Kun Fu,Xian Sun
机构: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China (航天信息研究所，中国科学院，北京 100190, 中国); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China (中国科学院大学电子、电气与通信工程学院，北京 100190, 中国); University of Chinese Academy of Sciences, Beijing 100190, China (中国科学院大学，北京 100190, 中国); Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China (目标认知与应用技术重点实验室，中国科学院航天信息研究所，北京 100094, 中国); Department of Electronic Engineering, Tsinghua University, Beijing 100084, China (清华大学电子工程系，北京 100084, 中国); Peng Cheng Laboratory, Shenzhen 518066, China (鹏城实验室，深圳 518066, 中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning.
zh

[CV-51] NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在自动驾驶任务中空间理解与推理能力不足的问题。尽管VLMs在自动驾驶领域展现出巨大潜力，但它们的空间推理能力仍存在显著局限性，而现有的基准数据集并未系统性评估这些模型在驾驶场景中的空间推理能力。为填补这一空白，论文提出了NuScenes-SpatialQA，这是一个基于大规模真实标注的问答（Question-Answer, QA）基准，专门用于评估VLMs在自动驾驶中的空间理解和推理能力。该基准依托NuScenes数据集构建，并通过自动化3D场景图生成管道和QA生成管道实现。关键解决方案在于利用NuScenes-SpatialQA基准，对包括通用模型和空间增强模型在内的多种VLMs进行广泛实验，从而提供其空间能力的首次全面评估。实验结果表明，虽然空间增强模型在定性QA任务中表现更优，但在定量QA任务中未表现出竞争力，这凸显了VLMs在空间理解与推理方面面临的重大挑战。

链接: https://arxiv.org/abs/2504.03164
作者: Kexin Tian,Jingrui Mao,Yunlong Zhang,Jiwan Jiang,Yang Zhou,Zhengzhong Tu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have demonstrated strong potential for autonomous driving tasks. However, their spatial understanding and reasoning-key capabilities for autonomous driving-still exhibit significant limitations. Notably, none of the existing benchmarks systematically evaluate VLMs’ spatial reasoning capabilities in driving scenarios. To fill this gap, we propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark specifically designed to evaluate the spatial understanding and reasoning capabilities of VLMs in autonomous driving. Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline. The benchmark systematically evaluates VLMs’ performance in both spatial understanding and reasoning across multiple dimensions. Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving. Surprisingly, the experimental results show that the spatial-enhanced VLM outperforms in qualitative QA but does not demonstrate competitiveness in quantitative QA. In general, VLMs still face considerable challenges in spatial understanding and reasoning.
zh

[CV-52] okenFLEX: Unified VLM Training for Flexible Visual Tokens Inference

【速读】：该论文旨在解决传统视觉-语言模型（Vision-Language Models, VLMs）中固定视觉token数量导致的效率低下问题。具体而言，在任务简单时，固定数量的token会导致计算资源浪费；而在任务复杂时，则可能无法提供足够的视觉细节以支持精细理解。为了解决这些问题，论文提出了一种名为TokenFLEX的新框架，它能够根据任务需求灵活调整图像编码的token数量，从而更高效地与大型语言模型（Large Language Model, LLM）集成。

TokenFLEX的关键创新在于两个方面：首先，引入了一种新颖的训练范式，在训练过程中通过随机调节token数量来提升模型在不同token数量下的性能；其次，设计了一个轻量级的视觉token投影器，其中包含自适应池化层和SwiGLU模块，使得可以灵活地对视觉token进行下采样，并根据特定的token数量选择适配的特征。这些创新确保了TokenFLEX不仅具有高度灵活性，还能保持高水平的视觉-语言理解能力。

链接: https://arxiv.org/abs/2504.03154
作者: Junshan Hu,Jialiang Mao,Zhikang Liu,Zhongpu Xia,Peng Jia,Xianpeng Lang
机构: Li Auto
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional Vision-Language Models(VLMs) typically utilize a fixed number of vision tokens, regardless of task complexity. This one-size-fits-all strategy introduces notable inefficiencies: using excessive tokens leads to unnecessary computational overhead in simpler tasks, whereas insufficient tokens compromise fine-grained visual comprehension in more complex contexts. To overcome these limitations, we present TokenFLEX, an innovative and adaptable vision-language framework that encodes images into a variable number of tokens for efficient integration with a Large Language Model (LLM). Our approach is underpinned by two pivotal innovations. Firstly, we present a novel training paradigm that enhances performance across varying numbers of vision tokens by stochastically modulating token counts during training. Secondly, we design a lightweight vision token projector incorporating an adaptive pooling layer and SwiGLU, allowing for flexible downsampling of vision tokens and adaptive selection of features tailored to specific token counts. Comprehensive experiments reveal that TokenFLEX consistently outperforms its fixed-token counterparts, achieving notable performance gains across various token counts enhancements of 1.6%, 1.0%, and 0.4% with 64, 144, and 256 tokens, respectively averaged over eight vision-language benchmarks. These results underscore TokenFLEX’s remarkable flexibility while maintaining high-performance vision-language understanding.
zh

[CV-53] Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models

【速读】：该论文旨在解决扩散模型在视频生成中的高计算强度问题，特别是现有基于特征缓存的方法因未能充分利用各模块的异构重要性而导致的缓存重用效率低下及输出质量下降的问题。论文的关键创新在于提出了ProfilingDiT，这是一种新颖的自适应缓存策略，通过解耦关注前景和背景的模块，系统分析扩散模型中的注意力分布，发现大多数层对前景或背景区域具有一致偏好，并且预测噪声在去噪过程中从初始的低步间相似性逐渐趋于稳定。基于此，论文设计了一种选择性缓存策略，在动态前景元素上保留完全计算的同时，高效缓存静态背景特征，从而显著降低计算开销并保持视觉保真度。实验结果表明，该框架在实现显著加速（例如，Wan2.1模型加速2.01倍）的同时，维持了全面质量指标下的视觉保真度。

链接: https://arxiv.org/abs/2504.03140
作者: Xuran Ma,Yexin Liu,Yaofu Liu,Xianfeng Wu,Mingzhe Zheng,Zihao Wang,Ser-Nam Lim,Harry Yang
机构: Hong Kong University of Science and Technology (香港科技大学); Everlyn AI; University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have demonstrated remarkable capabilities in video generation. However, the computational intensity remains a significant challenge for practical applications. While feature caching has been proposed to reduce the computational burden of diffusion models, existing methods typically overlook the heterogeneous significance of individual blocks, resulting in suboptimal reuse and degraded output quality. To this end, we address this gap by introducing ProfilingDiT, a novel adaptive caching strategy that explicitly disentangles foreground and background-focused blocks. Through a systematic analysis of attention distributions in diffusion models, we reveal a key observation: 1) Most layers exhibit a consistent preference for either foreground or background regions. 2) Predicted noise shows low inter-step similarity initially, which stabilizes as denoising progresses. This finding inspires us to formulate a selective caching strategy that preserves full computation for dynamic foreground elements while efficiently caching static background features. Our approach substantially reduces computational overhead while preserving visual fidelity. Extensive experiments demonstrate that our framework achieves significant acceleration (e.g., 2.01 times speedup for Wan2.1) while maintaining visual fidelity across comprehensive quality metrics, establishing a viable method for efficient video generation.
zh

[CV-54] Classic Video Denoising in a Machine Learning World: Robust Fast and Controllable

【速读】：该论文旨在解决传统去噪方法与基于深度学习的去噪方法之间的权衡问题：传统方法在处理真实场景视频时表现可靠且运行效率高，但需要手动调整参数；而基于深度学习的方法虽然在去噪质量上有显著提升，但容易因训练数据分布与实际噪声模式的差异导致意外失效，同时存在速度慢和缺乏用户控制的问题。论文的关键解决方案是提出了一种基于传统去噪方法的不同iable去噪管道，并通过神经网络预测每个特定输入的最佳去噪参数，从而实现一种既鲁棒高效又支持用户控制的新型去噪方法。

链接: https://arxiv.org/abs/2504.03136
作者: Xin Jin,Simon Niklaus,Zhoutong Zhang,Zhihao Xia,Chunle Guo,Yuting Yang,Jiawen Chen,Chongyi Li
机构: VCIP, CS, Nankai University (南开大学); Adobe Research (Adobe 研究院); Adobe (Adobe); NKIARI, Shenzhen Futian (深圳福田新型智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Homepage: this https URL

点击查看摘要

Abstract:Denoising is a crucial step in many video processing pipelines such as in interactive editing, where high quality, speed, and user control are essential. While recent approaches achieve significant improvements in denoising quality by leveraging deep learning, they are prone to unexpected failures due to discrepancies between training data distributions and the wide variety of noise patterns found in real-world videos. These methods also tend to be slow and lack user control. In contrast, traditional denoising methods perform reliably on in-the-wild videos and run relatively quickly on modern hardware. However, they require manually tuning parameters for each input video, which is not only tedious but also requires skill. We bridge the gap between these two paradigms by proposing a differentiable denoising pipeline based on traditional methods. A neural network is then trained to predict the optimal denoising parameters for each specific input, resulting in a robust and efficient approach that also supports user control.
zh

[CV-55] Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

【速读】：该论文旨在解决两个主要问题：(1) 层次化建模不完善导致问题级别区分不佳，从而在层次间引发语义碎片化；(2) 过度依赖基于Transformer的跨模态自注意力融合方法中的隐式学习，掩盖了医学场景中的关键局部语义关联。为了解决这些问题，论文提出了一种HiCA-VQA方法，其关键在于设计了两个模块：层次提示（Hierarchical Prompting）用于细粒度医学问题的预对齐，以及层次化答案解码器（Hierarchical Answer Decoders）分别处理不同级别的问题以提升多粒度预测精度。此外，引入了跨注意融合模块，其中图像作为查询，文本作为键值对，进一步增强了模型对局部语义特征的学习能力。实验结果表明，HiCA-VQA框架在Rad-Restruct基准数据集上的表现优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.03135
作者: Junkai Zhang,Bin Li,Shoujun Zhou,Yue Du
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical Visual Question Answering (Med-VQA) answers clinical questions using medical images, aiding diagnosis. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical MedVQA tasks and established datasets, However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels causing semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which obscures crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a HiCA-VQA method, including two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework better outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding.
zh

[CV-56] Joint Retrieval of Cloud properties using Attention-based Deep Learning Models

【速读】：该论文旨在解决云属性反演中的若干关键挑战，包括独立像素近似（IPA）方法在处理三维辐射效应、云边缘误差以及重叠或异质云场时的局限性。同时，针对基于人工智能/机器学习（AI/ML）的深度学习模型存在的内存消耗大、只能反演出单一云属性或难以实现联合反演的问题，论文提出了一种新的解决方案。该方案的关键在于引入CloudUNet结合注意力模块（CAM），其核心创新点包括：(1) 利用注意力机制减少厚云和重叠云区域的误差；(2) 设计专门的损失函数以实现云光学厚度（COT）与云有效半径（CER）的同时反演。实验结果表明，CAM模型在Large Eddy Simulation（LES）数据集上的表现优于现有最先进的深度学习方法，并显著降低了COT和CER的平均绝对误差（MAE）。

链接: https://arxiv.org/abs/2504.03133
作者: Zahid Hassan Tushar,Adeleke Ademakinwa,Jianwu Wang,Zhibo Zhang,Sanjay Purushotham
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩郡分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 Pages, 4 figures, to be published in 2025 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025)

点击查看摘要

Abstract:Accurate cloud property retrieval is vital for understanding cloud behavior and its impact on climate, including applications in weather forecasting, climate modeling, and estimating Earth’s radiation balance. The Independent Pixel Approximation (IPA), a widely used physics-based approach, simplifies radiative transfer calculations by assuming each pixel is independent of its neighbors. While computationally efficient, IPA has significant limitations, such as inaccuracies from 3D radiative effects, errors at cloud edges, and ineffectiveness for overlapping or heterogeneous cloud fields. Recent AI/ML-based deep learning models have improved retrieval accuracy by leveraging spatial relationships across pixels. However, these models are often memory-intensive, retrieve only a single cloud property, or struggle with joint property retrievals. To overcome these challenges, we introduce CloudUNet with Attention Module (CAM), a compact UNet-based model that employs attention mechanisms to reduce errors in thick, overlapping cloud regions and a specialized loss function for joint retrieval of Cloud Optical Thickness (COT) and Cloud Effective Radius (CER). Experiments on a Large Eddy Simulation (LES) dataset show that our CAM model outperforms state-of-the-art deep learning methods, reducing mean absolute errors (MAE) by 34% for COT and 42% for CER, and achieving 76% and 86% lower MAE for COT and CER retrievals compared to the IPA method.
zh

[CV-57] GraphSeg: Segmented 3D Representations via Graph Edge Addition and Contraction

【速读】：本文旨在解决机器人在非结构化环境中对物体进行准确且一致分割的问题。传统方法如Segment Anything等虽在2D图像分割中表现出色，但在物理3D世界中的表现往往不佳，存在过分割现象以及跨视图掩码一致性不足的问题。为此，论文提出GraphSeg框架，其关键是通过构建双重对应图（从2D像素级相似性和推断的3D结构出发）将分割问题转化为边添加与图收缩的过程，从而将多个2D掩码融合为统一的物体级分割结果，并进一步利用3D基础模型生成分段的3D表示。GraphSeg显著减少了所需图像数量，同时提升了分割精度，在桌面场景中实现了最先进的性能，并提升了下游机器人操作任务的表现。

链接: https://arxiv.org/abs/2504.03129
作者: Haozhan Tang,Tianyi Zhang,Oliver Kroemer,Matthew Johnson-Roberson,Weiming Zhi
机构: Robotics Institute, Carnegie Mellon University (卡内基梅隆大学机器人研究所)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Robots operating in unstructured environments often require accurate and consistent object-level representations. This typically requires segmenting individual objects from the robot’s surroundings. While recent large models such as Segment Anything (SAM) offer strong performance in 2D image segmentation. These advances do not translate directly to performance in the physical 3D world, where they often over-segment objects and fail to produce consistent mask correspondences across views. In this paper, we present GraphSeg, a framework for generating consistent 3D object segmentations from a sparse set of 2D images of the environment without any depth information. GraphSeg adds edges to graphs and constructs dual correspondence graphs: one from 2D pixel-level similarities and one from inferred 3D structure. We formulate segmentation as a problem of edge addition, then subsequent graph contraction, which merges multiple 2D masks into unified object-level segmentations. We can then leverage \emph3D foundation models to produce segmented 3D representations. GraphSeg achieves robust segmentation with significantly fewer images and greater accuracy than prior methods. We demonstrate state-of-the-art performance on tabletop scenes and show that GraphSeg enables improved performance on downstream robotic manipulation tasks. Code available at this https URL.
zh

[CV-58] FontGuard: A Robust Font Watermarking Approach Leverag ing Deep Font Knowledge

【速读】：该论文旨在解决AI生成内容带来的版权保护、溯源及合规性等法证与安全问题，特别是针对文本字体水印技术在嵌入容量、视觉质量以及鲁棒性方面的不足。现有字体水印方法通常忽视了字体领域的专业知识，导致水印字体质量低劣、嵌入容量有限，并且容易受到实际应用场景中的失真、低分辨率以及字符分割不准确等问题的影响。

论文提出的关键解决方案是FontGuard，这是一种基于字体模型和语言引导对比学习的新型字体水印模型。FontGuard通过修改字体的隐藏风格特征而非仅限于像素级调整来实现水印嵌入，从而在保证嵌入信息的同时显著提升水印字体的视觉质量。此外，FontGuard利用字体流形生成大量与原始字体高度相似的新变体以增加嵌入容量，并在解码器中采用图像-文本对比学习以增强对真实世界传输失真的鲁棒性。实验结果表明，FontGuard在合成、跨媒体及在线社交网络三种场景下的解码准确率分别提升了+5.4%、+7.4%和+5.8%，同时在LPIPS指标下将水印字体的视觉质量提高了52.7%。此外，FontGuard还具备无需重新训练即可为未见字体生成水印的独特能力。

链接: https://arxiv.org/abs/2504.03128
作者: Kahim Wong,Jicheng Zhou,Kemou Li,Yain-Whar Si,Xiaowei Wu,Jiantao Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of AI-generated content brings significant concerns on the forensic and security issues such as source tracing, copyright protection, etc, highlighting the need for effective watermarking technologies. Font-based text watermarking has emerged as an effective solution to embed information, which could ensure copyright, traceability, and compliance of the generated text content. Existing font watermarking methods usually neglect essential font knowledge, which leads to watermarked fonts of low quality and limited embedding capacity. These methods are also vulnerable to real-world distortions, low-resolution fonts, and inaccurate character segmentation. In this paper, we introduce FontGuard, a novel font watermarking model that harnesses the capabilities of font models and language-guided contrastive learning. Unlike previous methods that focus solely on the pixel-level alteration, FontGuard modifies fonts by altering hidden style features, resulting in better font quality upon watermark embedding. We also leverage the font manifold to increase the embedding capacity of our proposed method by generating substantial font variants closely resembling the original font. Furthermore, in the decoder, we employ an image-text contrastive learning to reconstruct the embedded bits, which can achieve desirable robustness against various real-world transmission distortions. FontGuard outperforms state-of-the-art methods by +5.4%, +7.4%, and +5.8% in decoding accuracy under synthetic, cross-media, and online social network distortions, respectively, while improving the visual quality by 52.7% in terms of LPIPS. Moreover, FontGuard uniquely allows the generation of watermarked fonts for unseen fonts without re-training the network. The code and dataset are available at this https URL.
zh

[CV-59] NuWa: Deriving Lightweight Task-Specific Vision Transformers for Edge Devices

【速读】：该论文旨在解决视觉Transformer（Vision Transformer, ViT）在边缘设备上的任务特定精度不足以及推理效率低的问题。传统预训练的ViT模型通常针对广泛的任务设计，对于边缘设备而言存在“过度拟合”的情况，而这些设备往往只需要ViT的部分知识来完成特定任务。这种情况下，ViT在边缘设备上的任务特定精度较低且推理速度较慢。

论文的关键解决方案是提出NuWa方法，通过从基础ViT模型中衍生出专注于特定任务的小型ViT模型，将基础ViT中的任务特定知识有效地迁移到这些小型ViT中。NuWa能够在边缘设备受限资源下最大化模型精度，并同时保证推理延迟。实验结果表明，与现有最先进的方法相比，NuWa提升了高达11.83%的任务特定精度，并加速推理速度1.29倍至2.79倍。

链接: https://arxiv.org/abs/2504.03118
作者: Ziteng Wei,Qiang He,Bing Li,Feifei Chen,Yun Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 12 figures, 6 tables

点击查看摘要

Abstract:Vision Transformers (ViTs) excel in computer vision tasks but lack flexibility for edge devices’ diverse needs. A vital issue is that ViTs pre-trained to cover a broad range of tasks are \textitover-qualified for edge devices that usually demand only part of a ViT’s knowledge for specific tasks. Their task-specific accuracy on these edge devices is suboptimal. We discovered that small ViTs that focus on device-specific tasks can improve model accuracy and in the meantime, accelerate model inference. This paper presents NuWa, an approach that derives small ViTs from the base ViT for edge devices with specific task requirements. NuWa can transfer task-specific knowledge extracted from the base ViT into small ViTs that fully leverage constrained resources on edge devices to maximize model accuracy with inference latency assurance. Experiments with three base ViTs on three public datasets demonstrate that compared with state-of-the-art solutions, NuWa improves model accuracy by up to \text11.83% and accelerates model inference by 1.29 \times - 2.79 \times . Code for reproduction is available at this https URL.
zh

[CV-60] Multi-Granularity Vision Fastformer with Fusion Mechanism for Skin Lesion Segmentation

【速读】：该论文旨在解决医学图像分割领域中卷积神经网络（CNN）和视觉变换器（Vision Transformer, ViT）面临的挑战。CNN受限于局部上下文信息，而ViT因二次复杂度导致显著的计算成本，同时区分不同程度严重性的病灶边界也是一项难题。为此，研究提出了一种轻量级U形网络VFFM-UNet，其关键在于结合Vision Fastformer与融合机制（Fusion Mechanism, FM）。具体而言，通过Fastformer的加法注意力机制，采用元素-wise乘积和矩阵乘积实现全面特征提取与通道降维以减少计算开销；设计的融合机制包括多粒度融合和通道融合，能够在粒度和通道层面处理特征图，获取不同上下文信息，从而实现对不同程度严重性病灶边界的精确识别。实验结果表明，VFFM-UNet在参数数量、计算复杂度以及分割性能方面均优于现有最先进的模型。

链接: https://arxiv.org/abs/2504.03108
作者: Xuanyu Liu,Huiyun Yao,Jinggui Gao,Zhongyi Guo,Xue Zhang,Yulin Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background:Convolutional Neural Networks(CNN) and Vision Transformers(ViT) are the main techniques used in Medical image segmentation. However, CNN is limited to local contextual information, and ViT’s quadratic complexity results in significant computational costs. At the same time, equipping the model to distinguish lesion boundaries with varying degrees of severity is also a challenge encountered in skin lesion segmentation. Purpose:This research aims to optimize the balance between computational costs and long-range dependency modelling and achieve excellent generalization across lesions with different degrees of severity. Methods:we propose a lightweight U-shape network that utilizes Vision Fastformer with Fusion Mechanism (VFFM-UNet). We inherit the advantages of Fastformer’s additive attention mechanism, combining element-wise product and matrix product for comprehensive feature extraction and channel reduction to save computational costs. In order to accurately identify the lesion boundaries with varying degrees of severity, we designed Fusion Mechanism including Multi-Granularity Fusion and Channel Fusion, which can process the feature maps in the granularity and channel levels to obtain different contextual information. Results:Comprehensive experiments on the ISIC2017, ISIC2018 and PH2 datasets demonstrate that VFFM-UNet outperforms existing state-of-the-art models regarding parameter numbers, computational complexity and segmentation performance. In short, compared to MISSFormer, our model achieves superior segmentation performance while reducing parameter and computation costs by 101x and 15x, respectively. Conclusions:Both quantitative and qualitative analyses show that VFFM-UNet sets a new benchmark by reaching an ideal balance between parameter numbers, computational complexity, and segmentation performance compared to existing state-of-the-art models.
zh

[CV-61] Scaling Open-Vocabulary Action Detection

【速读】：该论文旨在解决开放词汇动作检测（open-vocabulary action detection）中的两个关键挑战：(1) 缺乏大规模多动作类别的数据集以实现稳健训练；(2) 将预训练视觉-语言对比模型扩展到检测任务时，由于参数密集型适应可能导致对基础动作类别的过拟合。论文的关键解决方案包括：(1) 提出一种仅编码器的多模态视频动作检测模型，减少对参数密集型附加组件的依赖；(2) 引入一种简单的弱监督训练策略，利用现有的闭集动作检测数据集进行预训练；(3) 设计一个新的基准测试方法，无需使用现有闭集动作检测数据集进行训练即可评估性能，为未来研究提供基线结果。

链接: https://arxiv.org/abs/2504.03096
作者: Zhen Hao Sia,Yogesh Singh Rawat
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes. Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection. Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining. Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.
zh

[CV-62] SLACK: Attacking LiDAR-based SLAM with Adversarial Point Injections

【速读】：该论文旨在解决基于学习的方法在LiDAR应用中使自动驾驶车辆易受对抗性攻击（adversarial point injections, PiJ）的问题，这种攻击对导航和地图生成构成严重安全挑战。现有研究缺乏对基于LiDAR的SLAM系统中学习型攻击的深入探讨。为应对这一问题，论文提出了一种端到端的深度生成对抗模型SLACK，通过少量点注入实现对LiDAR扫描的攻击，同时保持LiDAR数据质量不受损害。解决方案的关键在于设计了一种新颖且简单的自动编码器，该编码器结合对比学习与基于分割的注意力机制，以实现精确的重建效果，从而有效提升PiJ攻击的性能。实验结果显示，SLACK在KITTI和CARLA-64数据集上的表现优于现有最佳基线，同时验证了其在降低导航和地图质量方面的有效性，而不会影响LiDAR扫描的整体质量。

链接: https://arxiv.org/abs/2504.03089
作者: Prashant Kumar,Dheeraj Vattikonda,Kshitij Madhav Bhat,Kunal Dargan,Prem Kalra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The widespread adoption of learning-based methods for the LiDAR makes autonomous vehicles vulnerable to adversarial attacks through adversarial \textitpoint injections (PiJ). It poses serious security challenges for navigation and map generation. Despite its critical nature, no major work exists that studies learning-based attacks on LiDAR-based SLAM. Our work proposes SLACK, an end-to-end deep generative adversarial model to attack LiDAR scans with several point injections without deteriorating LiDAR quality. To facilitate SLACK, we design a novel yet simple autoencoder that augments contrastive learning with segmentation-based attention for precise reconstructions. SLACK demonstrates superior performance on the task of \textitpoint injections (PiJ) compared to the best baselines on KITTI and CARLA-64 dataset while maintaining accurate scan quality. We qualitatively and quantitatively demonstrate PiJ attacks using a fraction of LiDAR points. It severely degrades navigation and map quality without deteriorating the LiDAR scan quality.
zh

[CV-63] How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models ICLR2024

【速读】：该论文试图解决视频编辑与生成过程中因传统噪声采样技术无法保留视频相邻帧间的时间相关性而导致的高频闪烁或纹理粘连等质量问题。论文的关键解决方案在于提出了一种新的噪声表示方法——积分噪声 (\int -noise)，它将离散的噪声样本重新解释为连续积分的噪声场，使得像素值反映的是底层无限分辨率噪声在整个像素区域上的积分而非离散值。此外，论文设计了一种专门的传输方法，利用积分噪声在帧序列中精确传递噪声样本，从而最大化不同帧之间的相关性同时保持噪声特性。这种创新方法能够广泛应用于视频修复、代理渲染及条件视频生成等任务。

链接: https://arxiv.org/abs/2504.03072
作者: Pascal Chang,Jingwei Tang,Markus Gross,Vinicius C. Azevedo
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICLR 2024 (Oral)

点击查看摘要

Abstract:Video editing and generation methods often rely on pre-trained image-based diffusion models. During the diffusion process, however, the reliance on rudimentary noise sampling techniques that do not preserve correlations present in subsequent frames of a video is detrimental to the quality of the results. This either produces high-frequency flickering, or texture-sticking artifacts that are not amenable to post-processing. With this in mind, we propose a novel method for preserving temporal correlations in a sequence of noise samples. This approach is materialized by a novel noise representation, dubbed \int -noise (integral noise), that reinterprets individual noise samples as a continuously integrated noise field: pixel values do not represent discrete values, but are rather the integral of an underlying infinite-resolution noise over the pixel area. Additionally, we propose a carefully tailored transport method that uses \int -noise to accurately advect noise samples over a sequence of frames, maximizing the correlation between different frames while also preserving the noise properties. Our results demonstrate that the proposed \int -noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation. See this https URL for video results.
zh

[CV-64] Compressing 3D Gaussian Splatting by Noise-Substituted Vector Quantization

【速读】：该论文旨在解决3D Gaussian Splatting (3DGS) 在3D重建中的高存储成本问题。传统方法在重建单个场景时通常需要数百万个高斯点（Gaussian splats），每个点由59个浮点参数表示，导致内存消耗高达约1 GB。为应对这一挑战，论文提出了一种通过构建独立的属性码本并将离散码索引存储来实现压缩的方法。关键在于采用噪声替代的向量量化（noise-substituted vector quantization）技术，同时训练码本与模型特征，确保梯度下降优化与参数离散化之间的一致性。此方案在保持竞争性重建质量的同时，将内存消耗减少了约45倍，并提升了渲染速度，使其适用于实际应用。

链接: https://arxiv.org/abs/2504.03059
作者: Haishan Wang,Mohammad Hassan Vali,Arno Solin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness in 3D reconstruction, achieving high-quality results with real-time radiance field rendering. However, a key challenge is the substantial storage cost: reconstructing a single scene typically requires millions of Gaussian splats, each represented by 59 floating-point parameters, resulting in approximately 1~GB of memory. To address this challenge, we propose a compression method by building separate attribute codebooks and storing only discrete code indices. Specifically, we employ noise-substituted vector quantization technique to jointly train the codebooks and model features, ensuring consistency between gradient descent optimization and parameter discretization. Our method reduces the memory consumption efficiently (around 45\times ) while maintaining competitive reconstruction quality on standard 3D benchmark scenes. Experiments on different codebook sizes show the trade-off between compression ratio and image quality. Furthermore, the trained compressed model remains fully compatible with popular 3DGS viewers and enables faster rendering speed, making it well-suited for practical applications.
zh

[CV-65] Cooperative Inference for Real-Time 3D Human Pose Estimation in Multi-Device Edge Networks

【速读】：该论文致力于解决在资源受限且动态变化的环境中，实时三维（3D）人体姿态估计面临的高计算复杂性挑战。为应对这一问题，论文提出了一种新颖的合作推理方法，用于移动边缘计算（MEC）网络中的实时3D人体姿态估计。该方法的关键在于，配备轻量级推理模型的多个终端设备利用双重置信阈值过滤模糊图像，仅将筛选后的图像上传至边缘服务器以供更强大的推理模型重新评估，从而在计算和通信约束下提高估计精度。此外，通过数值分析推理方法的准确性与端到端延迟，并制定联合优化问题以推导每个设备的最佳置信阈值和传输时间，目标是最小化平均关节位置误差（MPJPE），同时满足所需的端到端延迟约束。论文进一步证明最小化MPJPE等价于最大化所有设备的推理准确性总和，将问题分解为可管理的子问题，并提出一种低复杂度优化算法以获得近似最优解。实验结果表明，MPJPE与端到端延迟之间存在权衡关系，并确认所提出的合作推理方法通过最佳选择置信阈值和传输时间显著降低了MPJPE，同时始终满足各种MEC环境下的端到端延迟需求。

链接: https://arxiv.org/abs/2504.03052
作者: Hyun-Ho Choi,Kangsoo Kim,Ki-Ho Lee,Kisong Lee
机构: School of ICT, Robotics & Mechanical Engineering, Hankyong National University (汉城国立大学); Department of Electrical and Software Engineering, Schulich School of Engineering, University of Calgary (卡加立大学); School of the Electrical Engineering, Chung-Ang University (中央大学); Department of Information and Communication Engineering, Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 12 figures

点击查看摘要

Abstract:Accurate and real-time three-dimensional (3D) pose estimation is challenging in resource-constrained and dynamic environments owing to its high computational complexity. To address this issue, this study proposes a novel cooperative inference method for real-time 3D human pose estimation in mobile edge computing (MEC) networks. In the proposed method, multiple end devices equipped with lightweight inference models employ dual confidence thresholds to filter ambiguous images. Only the filtered images are offloaded to an edge server with a more powerful inference model for re-evaluation, thereby improving the estimation accuracy under computational and communication constraints. We numerically analyze the performance of the proposed inference method in terms of the inference accuracy and end-to-end delay and formulate a joint optimization problem to derive the optimal confidence thresholds and transmission time for each device, with the objective of minimizing the mean per-joint position error (MPJPE) while satisfying the required end-to-end delay constraint. To solve this problem, we demonstrate that minimizing the MPJPE is equivalent to maximizing the sum of the inference accuracies for all devices, decompose the problem into manageable subproblems, and present a low-complexity optimization algorithm to obtain a near-optimal solution. The experimental results show that a trade-off exists between the MPJPE and end-to-end delay depending on the confidence thresholds. Furthermore, the results confirm that the proposed cooperative inference method achieves a significant reduction in the MPJPE through the optimal selection of confidence thresholds and transmission times, while consistently satisfying the end-to-end delay requirement in various MEC environments.
zh

[CV-66] Attention-Aware Multi-View Pedestrian Tracking

【速读】：该论文旨在解决多目标跟踪中的遮挡问题，特别是在多摄像机设置下，传统视角转换导致鸟瞰图（Bird’s Eye View, BEV）特征图产生显著失真，从而影响行人外观特征鲁棒性的问题。论文的关键解决方案是提出了一种结合注意力机制的新型多视图行人跟踪模型。该模型采用早期融合策略进行检测，并利用交叉注意力机制在不同帧之间建立稳健的行人关联，同时高效传播行人特征跨帧，从而为每个行人生成更鲁棒的特征表示。实验结果表明，该方法在Wildtrack数据集上的IDF1得分为96.1%，在MultiviewX数据集上的IDF1得分为85.7%，优于现有最先进的方法。

链接: https://arxiv.org/abs/2504.03047
作者: Reef Alturki,Adrian Hilton,Jean-Yves Guillemaut
机构: Centre for Vision, Speech and Signal Processing, University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In spite of the recent advancements in multi-object tracking, occlusion poses a significant challenge. Multi-camera setups have been used to address this challenge by providing a comprehensive coverage of the scene. Recent multi-view pedestrian detection models have highlighted the potential of an early-fusion strategy, projecting feature maps of all views to a common ground plane or the Bird’s Eye View (BEV), and then performing detection. This strategy has been shown to improve both detection and tracking performance. However, the perspective transformation results in significant distortion on the ground plane, affecting the robustness of the appearance features of the pedestrians. To tackle this limitation, we propose a novel model that incorporates attention mechanisms in a multi-view pedestrian tracking scenario. Our model utilizes an early-fusion strategy for detection, and a cross-attention mechanism to establish robust associations between pedestrians in different frames, while efficiently propagating pedestrian features across frames, resulting in a more robust feature representation for each pedestrian. Extensive experiments demonstrate that our model outperforms state-of-the-art models, with an IDF1 score of 96.1% on Wildtrack dataset, and 85.7% on MultiviewX dataset.
zh

[CV-67] Sliced Wasserstein Discrepancy in Disentangling Representation and Adaptation Networks for Unsupervised Domain Adaptation

【速读】：该论文致力于解决无监督领域适应（Unsupervised Domain Adaptation, UDA）中图像内容与风格表征解耦的问题。现有方法通常采用Gram矩阵损失来捕捉风格差异，但其局限性在于对风格变化的表达可能不够鲁棒。为此，论文提出了一种名为DRANet-SWD的新方法，关键在于引入切片Wasserstein散度（Sliced Wasserstein Discrepancy, SWD）作为风格损失替代传统的Gram矩阵损失。SWD通过提供更稳健的特征分布统计比较，增强了风格适应能力，从而提升了UDA任务的性能。实验结果表明，该方法在数字分类数据集和驾驶场景分割任务中均表现出色，验证了SWD在优化特征对齐和改善领域适应方面的有效性。

链接: https://arxiv.org/abs/2504.03043
作者: Joel Sol,Shadi Alijani,Homayoun Najjaran
机构: Faculty of Electrical and Computer Engineering, University of Victoria (维多利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, submitted to IEEE conference

点击查看摘要

Abstract:This paper introduces DRANet-SWD, an extension of existing work that disentangles content and style representations of images for unsupervised domain adaptation (UDA). The approach builds upon DRANet by incorporating the sliced Wasserstein discrepancy (SWD) as a style loss instead of the traditional Gram matrix loss. The potential advantages of SWD over the Gram matrix loss for capturing style variations in domain adaptation are investigated. Experiments using digit classification datasets and driving scenario segmentation validate the method, demonstrating that DRANet-SWD enhances performance. Results indicate that SWD provides a more robust statistical comparison of feature distributions, leading to better style adaptation. These findings highlight the effectiveness of SWD in refining feature alignment and improving domain adaptation tasks across these benchmarks. Our code can be found here.
zh

[CV-68] VIP: Video Inpainting Pipeline for Real World Human Removal

【速读】：该论文致力于解决高分辨率视频片段中真实场景下人类及行人移除的难题，特别是在保证高质量结果、时间一致性以及处理复杂对象交互（如人类及其随身物品和影子）方面的挑战。论文提出了一种名为VIP（Video Inpainting Pipeline）的新颖无提示视频修复框架，用于实际应用中的人类移除任务。VIP通过引入运动模块增强现有最先进的文本到视频模型，并利用变分自编码器（Variational Autoencoder, VAE）在潜在空间中进行渐进去噪。此外，还实现了高效的包含随身物品的人体分割以生成精确的掩码。关键解决方案包括开发VIP流水线、参考帧集成技术以及Dual-Fusion潜在片段细化方法，这些方法共同应对长高分辨率视频序列修复中的复杂性。

链接: https://arxiv.org/abs/2504.03041
作者: Huiming Sun,Yikang Li,Kangning Yang,Ruineng Li,Daitao Xing,Yangbo Xie,Lan Fu,Kaiyu Zhang,Ming Chen,Jiaming Ding,Jiang Geng,Jie Cai,Zibo Meng,Chiuman Ho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inpainting for real-world human and pedestrian removal in high-resolution video clips presents significant challenges, particularly in achieving high-quality outcomes, ensuring temporal consistency, and managing complex object interactions that involve humans, their belongings, and their shadows. In this paper, we introduce VIP (Video Inpainting Pipeline), a novel promptless video inpainting framework for real-world human removal applications. VIP enhances a state-of-the-art text-to-video model with a motion module and employs a Variational Autoencoder (VAE) for progressive denoising in the latent space. Additionally, we implement an efficient human-and-belongings segmentation for precise mask generation. Sufficient experimental results demonstrate that VIP achieves superior temporal consistency and visual fidelity across diverse real-world scenarios, surpassing state-of-the-art methods on challenging datasets. Our key contributions include the development of the VIP pipeline, a reference frame integration technique, and the Dual-Fusion Latent Segment Refinement method, all of which address the complexities of inpainting in long, high-resolution video sequences.
zh

[CV-69] HALO: Human-Aligned End-to-end Image Retargeting with Layered Transformations

【速读】：该论文试图解决图像重定目标（image retargeting）任务中因宽高比变化而产生的视觉伪影（visual artifacts）、内容丢失或结构破坏的问题。现有方法在这些方面表现不佳，无法同时保持原始图像的内容与结构。为了解决这些问题，论文提出了一种端到端可训练的解决方案HALO。其关键在于将输入图像分解为显著区域（salient areas）与非显著区域（non-salient areas）两层，并为不同层次应用不同的变形场（wrapping fields）。此外，为了进一步减少输出图像中的结构失真，论文引入了感知结构相似性损失（perceptual structure similarity loss），该损失函数能够量化输入与输出图像之间的结构相似性，并与人类感知相一致。定量评估和用户研究均表明，HALO在RetargetMe数据集上达到了当前最优水平（SOTA），尤其在用户偏好测试中，HALO比基线方法平均提高了18.4%的用户满意度。

链接: https://arxiv.org/abs/2504.03026
作者: Yiran Xu,Siqi Xie,Zhuofang Li,Harris Shadmany,Yinxiao Li,Luciano Sbaiz,Miaosen Wang,Junjie Ke,Jose Lezama,Hang Qi,Han Zhang,Jesse Berent,Ming-Hsuan Yang,Irfan Essa,Jia-Bin Huang,Feng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image retargeting aims to change the aspect-ratio of an image while maintaining its content and structure with less visual artifacts. Existing methods still generate many artifacts or fail to maintain original content or structure. To address this, we introduce HALO, an end-to-end trainable solution for image retargeting. Since humans are more sensitive to distortions in salient areas than non-salient areas of an image, HALO decomposes the input image into salient/non-salient layers and applies different wrapping fields to different layers. To further minimize the structure distortion in the output images, we propose perceptual structure similarity loss which measures the structure similarity between input and output images and aligns with human perception. Both quantitative results and a user study on the RetargetMe dataset show that HALO achieves SOTA. Especially, our method achieves an 18.4% higher user preference compared to the baselines on average.
zh

[CV-70] Page Classification for Print Imaging Pipeline

【速读】：该论文旨在解决多类别图像分类的问题，目标是区分五种类型的图像：纯文本、纯图片、图文混合、收据以及高亮显示内容。论文的关键在于提出了一种基于支持向量机（SVM）的高级分类方法，并引入了四个新的特征，以提升分类的准确性与适用性。

链接: https://arxiv.org/abs/2504.03020
作者: Shaoyuan Xu,Cheng Lu,Mark Shaw,Peter Bauer,Jan P. Allebach
机构: School of Electrical and Computer Engineering, Purdue University (普渡大学); HP Inc. (惠普公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Digital copiers and printers are widely used nowadays. One of the most important things people care about is copying or printing quality. In order to improve it, we previously came up with an SVM-based classification method to classify images with only text, only pictures or a mixture of both based on the fact that modern copiers and printers are equipped with processing pipelines designed specifically for different kinds of images. However, in some other applications, we need to distinguish more than three classes. In this paper, we develop a more advanced SVM-based classification method using four more new features to classify 5 types of images which are text, picture, mixed, receipt and highlight.
zh

[CV-71] Comprehensive Relighting: Generalizable and Consistent Monocular Human Relighting and Harmonization CVPR2025

【速读】：本文旨在解决在任意场景中，通过单一图像或视频控制并协调人体任意身体部位照明的问题。由于缺乏相关数据集，现有的基于图像的重照明方法通常局限于特定场景（如人脸或静态人体），难以泛化。为应对这一挑战，论文的关键在于重新利用预训练的扩散模型作为通用图像先验，并在从粗到细的框架中联合建模人体重照明与背景和谐化。此外，引入了一种无监督的时间光照模型，通过学习大量真实世界视频中的光照循环一致性，进一步增强重照明的时间连贯性。在推理阶段，时间光照模块通过时空特征融合算法与扩散模型结合，无需额外训练；同时采用新的引导细化作为后处理，以保留输入图像的高频细节。实验结果表明，该方法在泛化能力和光照时间连贯性方面表现优异，优于现有基于图像的人体重照明与和谐化方法。

链接: https://arxiv.org/abs/2504.03011
作者: Junying Wang,Jingyuan Liu,Xin Sun,Krishna Kumar Singh,Zhixin Shu,He Zhang,Jimei Yang,Nanxuan Zhao,Tuanfeng Y. Wang,Simon S. Chen,Ulrich Neumann,Jae Shin Yoon
机构: University of Southern California (南加州大学); Adobe Research (Adobe 研究院); Runway (Runway);
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL . Accepted by CVPR 2025

点击查看摘要

Abstract:This paper introduces Comprehensive Relighting, the first all-in-one approach that can both control and harmonize the lighting from an image or video of humans with arbitrary body parts from any scene. Building such a generalizable model is extremely challenging due to the lack of dataset, restricting existing image-based relighting models to a specific scenario (e.g., face or static human). To address this challenge, we repurpose a pre-trained diffusion model as a general image prior and jointly model the human relighting and background harmonization in the coarse-to-fine framework. To further enhance the temporal coherence of the relighting, we introduce an unsupervised temporal lighting model that learns the lighting cycle consistency from many real-world videos without any ground truth. In inference time, our temporal lighting module is combined with the diffusion models through the spatio-temporal feature blending algorithms without extra training; and we apply a new guided refinement as a post-processing to preserve the high-frequency details from the input image. In the experiments, Comprehensive Relighting shows a strong generalizability and lighting temporal coherence, outperforming existing image-based human relighting and harmonization methods.
zh

[CV-72] Emotion Recognition Using Convolutional Neural Networks

【速读】：该论文旨在解决如何通过深度学习技术实现面部表情的情绪识别问题，具体目标是对静止图像和实时视频中的七种情绪（愤怒、厌恶、恐惧、快乐、中性、悲伤和惊讶）进行检测和分类。解决方案的关键在于从零开始构建了一个包含数据集收集、数据预处理、模型训练和测试的完整情绪识别分类与回归系统，并利用卷积神经网络在两种不同数据集上的测试验证了系统的有效性，达到了超过80%的准确率，同时证明了其在实时情绪检测中的可行性和高效性。

链接: https://arxiv.org/abs/2504.03010
作者: Shaoyuan Xu,Yang Cheng,Qian Lin,Jan P. Allebach
机构: School of Electrical and Computer Engineering, Purdue University (普渡大学); HP Labs (惠普实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Emotion has an important role in daily life, as it helps people better communicate with and understand each other more efficiently. Facial expressions can be classified into 7 categories: angry, disgust, fear, happy, neutral, sad and surprise. How to detect and recognize these seven emotions has become a popular topic in the past decade. In this paper, we develop an emotion recognition system that can apply emotion recognition on both still images and real-time videos by using deep learning. We build our own emotion recognition classification and regression system from scratch, which includes dataset collection, data preprocessing , model training and testing. Given a certain image or a real-time video, our system is able to show the classification and regression results for all of the 7 emotions. The proposed system is tested on 2 different datasets, and achieved an accuracy of over 80%. Moreover, the result obtained from real-time testing proves the feasibility of implementing convolutional neural networks in real time to detect emotions accurately and efficiently. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2504.03010 [cs.CV] (or arXiv:2504.03010v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.03010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-73] DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery CVPR2025

【速读】：该论文旨在解决在床人体网格恢复领域因隐私和成本限制难以收集大规模真实视觉数据的问题，这阻碍了深度学习模型的训练与部署。此外，现有的在床人体网格估计方法过度依赖真实数据，导致其在不同场景（如不同覆盖物和环境设置）中的泛化能力有限。为了解决这些问题，论文提出了一种针对从顶部深度图像进行在床人体网格恢复的Sim-to-Real Transfer框架，该框架利用大规模合成数据以及少量或无真实样本。解决方案的关键在于引入了一个扩散模型，用于弥合合成数据与真实数据之间的差距，从而支持实际在床姿态和身体推断场景下的泛化能力。通过广泛的实验和消融研究验证了该框架的有效性，展示了其在多种医疗保健场景中的鲁棒性和适应性的显著提升。

链接: https://arxiv.org/abs/2504.03006
作者: Jing Gao,Ce Zheng,Laszlo A. Jeni,Zackory Erickson
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 19 figures. Accepted to CVPR 2025

点击查看摘要

Abstract:In-bed human mesh recovery can be crucial and enabling for several healthcare applications, including sleep pattern monitoring, rehabilitation support, and pressure ulcer prevention. However, it is difficult to collect large real-world visual datasets in this domain, in part due to privacy and expense constraints, which in turn presents significant challenges for training and deploying deep learning models. Existing in-bed human mesh estimation methods often rely heavily on real-world data, limiting their ability to generalize across different in-bed scenarios, such as varying coverings and environmental settings. To address this, we propose a Sim-to-Real Transfer Framework for in-bed human mesh recovery from overhead depth images, which leverages large-scale synthetic data alongside limited or no real-world samples. We introduce a diffusion model that bridges the gap between synthetic data and real data to support generalization in real-world in-bed pose and body inference scenarios. Extensive experiments and ablation studies validate the effectiveness of our framework, demonstrating significant improvements in robustness and adaptability across diverse healthcare scenarios.
zh

[CV-74] Noise-Aware Generalization: Robustness to In-Domain Noise and Out-of-Domain Generalization

【速读】：该论文旨在解决多源领域泛化（Multi-source Domain Generalization, DG）中因标签噪声导致模型性能下降的问题，提出了一种新的框架以应对同时存在分布偏移和标签噪声的情境，即噪声感知泛化（Noise-Aware Generalization, NAG）。传统方法通常假设分布偏移与标签噪声相关联，但这种关联在NAG场景下可能不成立。论文的关键创新在于提出的DL4ND方法，通过利用跨域样本的差异性来增强对标签噪声的检测能力，从而有效区分标签噪声引起的分布变化与真实域间分布偏移，避免了简单假定域标签带来的信息损失。实验结果表明，DL4ND在四个不同数据集上的性能显著提升，为解决NAG问题提供了有前景的方向。

链接: https://arxiv.org/abs/2504.02996
作者: Siqi Wang,Aoming Liu,Bryan A. Plummer
机构: Boston University (波士顿大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-source Domain Generalization (DG) aims to improve model robustness to new distributions. However, DG methods often overlook the effect of label noise, which can confuse a model during training, reducing performance. Limited prior work has analyzed DG method’s noise-robustness, typically focused on an analysis of existing methods rather than new solutions. In this paper, we investigate this underexplored space, where models are evaluated under both distribution shifts and label noise, which we refer to as Noise-Aware Generalization (NAG). A natural solution to address label noise would be to combine a Learning with Noisy Labels (LNL) method with those from DG. Many LNL methods aim to detect distribution shifts in a class’s samples, i.e., they assume that distribution shifts often correspond to label noise. However, in NAG distribution shifts can be due to label noise or domain shifts, breaking the assumptions used by LNL methods. A naive solution is to make a similar assumption made by many DG methods, where we presume to have domain labels during training, enabling us to isolate the two types of shifts. However, this ignores valuable cross-domain information. Specifically, our proposed DL4ND approach improves noise detection by taking advantage of the observation that noisy samples that may appear indistinguishable within a single domain often show greater variation when compared across domains. Experiments show that DL4ND significantly improves performance across four diverse datasets, offering a promising direction for tackling NAG.
zh

[CV-75] VARGPT -v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

【速读】：该论文试图解决多模态理解与文本到图像指令跟随任务中的性能瓶颈问题，同时探索视觉理解、生成及编辑功能的统一模型设计。解决方案的关键在于提出VARGPT-v1.1，通过引入一种结合迭代视觉指令微调与直接偏好优化（Direct Preference Optimization, DPO）的强化学习策略、扩展训练数据集至830万视觉-生成指令对、升级语言模型主干为Qwen2、提升图像生成分辨率以及在不修改架构的情况下实现图像编辑能力，实现了多模态理解与生成性能的显著提升，并展示了统一视觉自回归模型从大型语言模型（Large Language Models, LLMs）中采用灵活训练策略的潜力。

链接: https://arxiv.org/abs/2504.02949
作者: Xianwei Zhuang,Yuxin Xie,Yufan Deng,Dongchao Yang,Liming Liang,Jinghan Ru,Yuguo Yin,Yuexian Zou
机构: School of Electronic and Computer Engineering, Peking University (北京大学电子与计算机工程学院); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code is available at: this https URL . arXiv admin note: text overlap with arXiv:2501.12327

点击查看摘要

Abstract:In this work, we present VARGPT-v1.1, an advanced unified visual autoregressive model that builds upon our previous framework VARGPT. The model preserves the dual paradigm of next-token prediction for visual understanding and next-scale generation for image synthesis. Specifically, VARGPT-v1.1 integrates: (1) a novel training strategy combining iterative visual instruction tuning with reinforcement learning through Direct Preference Optimization (DPO), (2) an expanded training corpus containing 8.3M visual-generative instruction pairs, (3) an upgraded language model backbone using Qwen2, (4) enhanced image generation resolution, and (5) emergent image editing capabilities without architectural modifications. These advancements enable VARGPT-v1.1 to achieve state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks, demonstrating significant improvements in both comprehension and generation metrics. Notably, through visual instruction tuning, the model acquires image editing functionality while maintaining architectural consistency with its predecessor, revealing the potential for unified visual understanding, generation, and editing. Our findings suggest that well-designed unified visual autoregressive models can effectively adopt flexible training strategies from large language models (LLMs), exhibiting promising scalability. The codebase and model weights are publicly available at this https URL.
zh

[CV-76] LiDAR-based Object Detection with Real-time Voice Specifications

【速读】：该论文旨在解决基于 LiDAR 的物体检测系统在实时应用中的准确性与泛化能力不足的问题，特别是在处理类别不平衡和多模态数据融合方面。解决方案的关键在于提出了一种结合 KITTI 数据集的 3D 点云与 RGB 图像的多模态 PointNet 框架，通过加权损失函数缓解类别不平衡，并利用自适应技术优化训练过程，从而实现 87.0% 的验证准确率（显著优于 200 样本基线的 67.5%）。此外，通过集成自然语音输出与 3D 可视化反馈，增强了系统的可用性和安全性，为人类与计算机交互及环境感知领域的扩展性研究奠定了基础。

链接: https://arxiv.org/abs/2504.02920
作者: Anurag Kulkarni
机构: Shivaji University (沙瓦吉大学), Maharashtra (马哈拉施特拉邦), India (印度)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures, submitted as part of MSc research

点击查看摘要

Abstract:This paper presents a LiDAR-based object detection system with real-time voice specifications, integrating KITTI’s 3D point clouds and RGB images through a multi-modal PointNet framework. It achieves 87.0% validation accuracy on a 3000-sample subset, surpassing a 200-sample baseline of 67.5% by combining spatial and visual data, addressing class imbalance with weighted loss, and refining training via adaptive techniques. A Tkinter prototype provides natural Indian male voice output using Edge TTS (en-IN-PrabhatNeural), alongside 3D visualizations and real-time feedback, enhancing accessibility and safety in autonomous navigation, assistive technology, and beyond. The study offers a detailed methodology, comprehensive experimental analysis, and a broad review of applications and challenges, establishing this work as a scalable advancement in human-computer interaction and environmental perception, aligned with current research trends.
zh

[CV-77] Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments

【速读】：该论文试图解决的问题是：现有图像和视频生成模型是否具备遵循物理守恒定律的能力，即能否被视为世界模型（world models）以生成符合物理规律的真实且可信的视频。这一问题对于评估生成模型在机器人、自动驾驶和科学模拟等领域的适用性至关重要。

解决方案的关键在于引入了一个名为Morpheus的新基准（benchmark），用于基于物理推理评估视频生成模型。Morpheus包含80段捕捉真实物理现象的视频，并以物理守恒定律为导向。由于人工生成的视频缺乏地面真实数据（ground truth），研究者利用与特定物理场景相关的不可违背的守恒定律，结合物理信息神经网络（physics-informed neural networks, PINNs）和视觉-语言基础模型（vision-language foundation models），开发了基于物理的信息度量方法来评估生成视频的物理合理性。通过这种方法，论文揭示了即使采用先进的提示技术和视频条件化策略，当前的生成模型仍然难以有效编码物理原理，尽管它们能够生成视觉上令人愉悦的视频。

链接: https://arxiv.org/abs/2504.02918
作者: Chenyu Zhang,Daniil Cherniavskii,Andrii Zadaianchuk,Antonios Tragoudaras,Antonios Vozikis,Thijmen Nijdam,Derck W. E. Prinzhorn,Mark Bodracska,Nicu Sebe,Efstratios Gavves
机构: University of Trento (特伦托大学), Italy; University of Amsterdam (阿姆斯特丹大学), the Netherlands; Archimedes, Athena Research Center (雅典研究与技术中心), Greece
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in image and video generation raise hopes that these models possess world modeling capabilities, the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical conservation laws? To answer this, we introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 80 real-world videos capturing physical phenomena, guided by conservation laws. Since artificial generations lack ground truth, we assess physical plausibility using physics-informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models. Our findings reveal that even with advanced prompting and video conditioning, current models struggle to encode physical principles despite generating aesthetically pleasing videos. All data, leaderboard, and code are open-sourced at our project page.
zh

[CV-78] Haphazard Inputs as Images in Online Learning IJCNN2025

【速读】：该论文致力于解决在线学习环境中特征空间变化（即随机输入）的问题，这是当前研究中的一个突出挑战。由于现有方法依赖于特定模型且无法直接利用先进的深度学习技术（这些技术通常需要固定维度的输入），导致其应用受限。论文的关键解决方案是提出一种在在线学习场景中将变化的特征空间动态转换为固定维度图像表示的方法。这种方法具有模型无关性，使得任何基于视觉的模型都能处理随机输入，并通过ResNet和ViT验证了其有效性。通过无缝处理不一致的输入数据，该方法实现了可扩展性和鲁棒性。实验结果表明，该方法在四个公开数据集上均表现出了良好的性能，代码已开源。

链接: https://arxiv.org/abs/2504.02912
作者: Rohit Agarwal,Aryan Dessai,Arif Ahmed Sekh,Krishna Agarwal,Alexander Horsch,Dilip K. Prasad
机构: UiT The Arctic University of Norway (北极大学 UIT 挪威特罗姆瑟); IIT (ISM) Dhanbad (印度理工学院丁哈德)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注: Accepted at IJCNN 2025

点击查看摘要

Abstract:The field of varying feature space in online learning settings, also known as haphazard inputs, is very prominent nowadays due to its applicability in various fields. However, the current solutions to haphazard inputs are model-dependent and cannot benefit from the existing advanced deep-learning methods, which necessitate inputs of fixed dimensions. Therefore, we propose to transform the varying feature space in an online learning setting to a fixed-dimension image representation on the fly. This simple yet novel approach is model-agnostic, allowing any vision-based models to be applicable for haphazard inputs, as demonstrated using ResNet and ViT. The image representation handles the inconsistent input data seamlessly, making our proposed approach scalable and robust. We show the efficacy of our method on four publicly available datasets. The code is available at this https URL.
zh

[CV-79] Comparative Analysis of Deepfake Detection Models: New Approaches and Perspectives

【速读】：该论文旨在解决深度伪造（Deepfake）视频带来的日益严重的威胁，即通过操纵现实和传播错误信息对社会造成的危害，迫切需要有效的检测方法。论文的关键在于研究和比较不同的深度伪造检测方法，特别关注GenConViT模型及其在DeepfakeBenchmark中与其他架构的性能对比。论文通过分析数字图像处理、机器学习和人工神经网络等技术基础，尤其是卷积神经网络（CNN）、生成对抗网络（GAN）和Transformer，探索了深度伪造的创建与检测原理，并基于相关指标和新数据集（如WildDeepfake和DeepSpeak）评估模型性能，以识别最有效的检测工具。结果显示，经过微调后，GenConViT模型在准确性（93.82%）和泛化能力方面表现出色，优于DeepfakeBenchmark中的其他架构。因此，该研究为提升深度伪造检测技术提供了贡献，助力开发更强大且有效的反制虚假信息传播的解决方案。

链接: https://arxiv.org/abs/2504.02900
作者: Matheus Martins Batista
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
备注: Bachelor’s thesis

点击查看摘要

Abstract:The growing threat posed by deepfake videos, capable of manipulating realities and disseminating misinformation, drives the urgent need for effective detection methods. This work investigates and compares different approaches for identifying deepfakes, focusing on the GenConViT model and its performance relative to other architectures present in the DeepfakeBenchmark. To contextualize the research, the social and legal impacts of deepfakes are addressed, as well as the technical fundamentals of their creation and detection, including digital image processing, machine learning, and artificial neural networks, with emphasis on Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Transformers. The performance evaluation of the models was conducted using relevant metrics and new datasets established in the literature, such as WildDeep-fake and DeepSpeak, aiming to identify the most effective tools in the battle against misinformation and media manipulation. The obtained results indicated that GenConViT, after fine-tuning, exhibited superior performance in terms of accuracy (93.82%) and generalization capacity, surpassing other architectures in the DeepfakeBenchmark on the DeepSpeak dataset. This study contributes to the advancement of deepfake detection techniques, offering contributions to the development of more robust and effective solutions against the dissemination of false information.
zh

[CV-80] UAC: Uncertainty-Aware Calibration of Neural Networks for Gesture Detection

【速读】：该论文旨在解决在安全关键领域（如建筑、制造和医疗）中，基于IMU（Inertial Measurement Units）的人体手势识别模型因严格的校准需求和对抗分布外（Out-of-Distribution, OOD）数据的鲁棒性限制而难以广泛应用的问题。论文提出了一种名为UAC（Uncertainty-Aware Calibration）的新型两步法来应对这些挑战。解决方案的关键在于：首先，设计了一种能够从IMU数据中预测手势概率及其相关不确定性的不确定性感知手势网络架构，并利用该不确定性对每个潜在手势的概率进行校准；其次，通过多IMU数据窗口预测的熵加权期望，提升模型精度的同时保持正确的校准效果。实验表明，UAC方法在OOD和分布内场景下均优于现有最先进的校准方法，显著提高了模型的校准和准确性。

链接: https://arxiv.org/abs/2504.02895
作者: Farida Al Haddad,Yuxin Wang,Malcolm Mielle
机构: Ecole Polytechnique Federale de Lausanne (洛桑联邦理工学院); Schindler EPFL Lab (Schindler EPFL 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Artificial intelligence has the potential to impact safety and efficiency in safety-critical domains such as construction, manufacturing, and healthcare. For example, using sensor data from wearable devices, such as inertial measurement units (IMUs), human gestures can be detected while maintaining privacy, thereby ensuring that safety protocols are followed. However, strict safety requirements in these domains have limited the adoption of AI, since accurate calibration of predicted probabilities and robustness against out-of-distribution (OOD) data is necessary. This paper proposes UAC (Uncertainty-Aware Calibration), a novel two-step method to address these challenges in IMU-based gesture recognition. First, we present an uncertainty-aware gesture network architecture that predicts both gesture probabilities and their associated uncertainties from IMU data. This uncertainty is then used to calibrate the probabilities of each potential gesture. Second, an entropy-weighted expectation of predictions over multiple IMU data windows is used to improve accuracy while maintaining correct calibration. Our method is evaluated using three publicly available IMU datasets for gesture detection and is compared to three state-of-the-art calibration methods for neural networks: temperature scaling, entropy maximization, and Laplace approximation. UAC outperforms existing methods, achieving improved accuracy and calibration in both OOD and in-distribution scenarios. Moreover, we find that, unlike our method, none of the state-of-the-art methods significantly improve the calibration of IMU-based gesture recognition models. In conclusion, our work highlights the advantages of uncertainty-aware calibration of neural networks, demonstrating improvements in both calibration and accuracy for gesture detection using IMU data. Comments: 12 pages, 2 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.02895 [cs.CV] (or arXiv:2504.02895v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.02895 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-81] Enhancing Traffic Sign Recognition On The Performance Based On Yolov8

【速读】：该论文旨在解决交通标志识别中因小尺寸、多变环境条件、遮挡及类别不平衡等因素导致的检测与分类准确性低的问题。论文的关键在于提出了一种基于增强版 YOLOv8 的检测系统，通过集成先进的数据增强技术、新颖的架构改进（如坐标注意力机制 Coordinate Attention 和双向特征金字塔网络 BiFPN）、动态模块（如 ODConv 和 LSKA）以及优化的损失函数（结合 EIoU 和 WIoU 的 Focal Loss），显著提升了检测精度、在恶劣条件下的鲁棒性以及边缘设备上的实时推理能力。

链接: https://arxiv.org/abs/2504.02884
作者: Baba Ibrahim,Zhou Kui(Hubei University of Automotive Technology and Hubei University of Automotive Technology)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)
备注: 27 Pages, 6 Figures, 10 Tables and 20 References

点击查看摘要

Abstract:This paper Traffic sign recognition plays a crucial role in the development of autonomous vehicles and advanced driver-assistance systems (ADAS). Despite significant advances in deep learning and object detection, accurately detecting and classifying traffic signs remains challenging due to their small sizes, variable environmental conditions, occlusion, and class imbalance. This thesis presents an enhanced YOLOv8-based detection system that integrates advanced data augmentation techniques, novel architectural enhancements including Coordinate Attention (CA), Bidirectional Feature Pyramid Network (BiFPN), and dynamic modules such as ODConv and LSKA, along with refined loss functions (EIoU and WIoU combined with Focal Loss). Extensive experiments conducted on datasets including GTSRB, TT100K, and GTSDB demonstrate marked improvements in detection accuracy, robustness under adverse conditions, and real-time inference on edge devices. The findings contribute actionable insights for deploying reliable traffic sign recognition systems in real-world autonomous driving scenarios.
zh

[CV-82] Exploring the Capabilities of LLM s for IMU-based Fine-grained Human Activity Understanding

【速读】：本文旨在解决现有基于大语言模型（LLMs）的人体活动识别（HAR）方法在细粒度任务上的性能不足问题，特别是针对平面书写场景下的字母识别等任务，传统方法仅能达到接近随机猜测的准确性。为填补这一空白，研究的关键在于通过微调预训练的LLMs，并结合自收集的数据集与少量学习（few-shot learning），显著提升了二维数据处理的性能，最高可达129倍改进。进一步地，为了扩展至三维空间中的空中书写场景，提出了基于编码器的流水线设计，将三维数据映射为二维等效表示，以保留时空信息，从而实现鲁棒的字母预测。最终，端到端系统在包含多达5个字母的空中书写环境中实现了78%的单词识别准确率，确立了LLMs在细粒度HAR任务中的可行性。

链接: https://arxiv.org/abs/2504.02878
作者: Lilin Xu,Kaiyuan Hou,Xiaofan Jiang
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to The 2nd International Workshop on Foundation Models for Cyber-Physical Systems Internet of Things (FMSys 2025)

点击查看摘要

Abstract:Human activity recognition (HAR) using inertial measurement units (IMUs) increasingly leverages large language models (LLMs), yet existing approaches focus on coarse activities like walking or running. Our preliminary study indicates that pretrained LLMs fail catastrophically on fine-grained HAR tasks such as air-written letter recognition, achieving only near-random guessing accuracy. In this work, we first bridge this gap for flat-surface writing scenarios: by fine-tuning LLMs with a self-collected dataset and few-shot learning, we achieved up to a 129x improvement on 2D data. To extend this to 3D scenarios, we designed an encoder-based pipeline that maps 3D data into 2D equivalents, preserving the spatiotemporal information for robust letter prediction. Our end-to-end pipeline achieves 78% accuracy on word recognition with up to 5 letters in mid-air writing scenarios, establishing LLMs as viable tools for fine-grained HAR.
zh

[CV-83] Multimodal Reference Visual Grounding

【速读】：该论文旨在解决视觉定位（Visual Grounding）任务中，当输入图像包含相似对象时，现有大型视觉-语言模型（Large Vision-Language Models, LVLMs）难以区分的问题。具体而言，针对类似物体（如无糖可乐与普通可乐）的识别挑战，论文引入了一个新任务——多模态参考视觉定位（Multimodal Reference Visual Grounding, MRVG）。解决方案的关键在于利用参考图像结合少量样本目标检测技术（few-shot object detection），并通过大型语言模型（Large Language Models, LLMs）实现对象匹配。为此，论文设计了一种名为MRVG-Net的新方法，并构建了一个新的数据集来研究MRVG问题。实验结果表明，该方法在视觉定位性能上超越了现有的LVLMs（如Qwen2.5-VL-7B），实现了少样本检测与视觉定位之间的有效融合，为视觉理解解锁了新的能力。

链接: https://arxiv.org/abs/2504.02876
作者: Yangxiao Lu,Ruosen Li,Liqiang Jing,Jikai Wang,Xinya Du,Yunhui Guo,Nicholas Ruozzi,Yu Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page with our code and dataset: this https URL

点击查看摘要

Abstract:Visual grounding focuses on detecting objects from images based on language expressions. Recent Large Vision-Language Models (LVLMs) have significantly advanced visual grounding performance by training large models with large-scale datasets. However, the problem remains challenging, especially when similar objects appear in the input image. For example, an LVLM may not be able to differentiate Diet Coke and regular Coke in an image. In this case, if additional reference images of Diet Coke and regular Coke are available, it can help the visual grounding of similar objects. In this work, we introduce a new task named Multimodal Reference Visual Grounding (MRVG). In this task, a model has access to a set of reference images of objects in a database. Based on these reference images and a language expression, the model is required to detect a target object from a query image. We first introduce a new dataset to study the MRVG problem. Then we introduce a novel method, named MRVG-Net, to solve this visual grounding problem. We show that by efficiently using reference images with few-shot object detection and using Large Language Models (LLMs) for object matching, our method achieves superior visual grounding performance compared to the state-of-the-art LVLMs such as Qwen2.5-VL-7B. Our approach bridges the gap between few-shot detection and visual grounding, unlocking new capabilities for visual understanding. Project page with our code and dataset: this https URL Comments: Project page with our code and dataset: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2504.02876 [cs.CV] (or arXiv:2504.02876v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2504.02876 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-84] OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery

【速读】：该论文旨在解决城市地区综合性高质量建筑属性数据稀缺的问题，特别是通过整合多源开放数据集，获取大规模全面的建筑图像，并推断完整的建筑属性。论文的关键在于提出OpenFACADES框架，利用多模态众包数据结合多模态大语言模型，为建筑档案补充客观属性与语义描述。其解决方案的核心在于三个步骤：首先通过等视线分析整合街景图像元数据与OpenStreetMap几何信息，识别适合观察目标建筑的视角；其次自动化检测全景图像中的建筑立面并采用重投影方法生成整体透视视图；最后引入一种创新方法，利用开源大视觉语言模型（VLMs）进行多属性预测及开放式词汇描述，基于来自七个城市的全球性标注图像数据集开展系统研究。评估显示，微调后的VLM在多属性推理方面表现出色，优于单属性计算机视觉模型和零样本ChatGPT-4o。

链接: https://arxiv.org/abs/2504.02866
作者: Xiucheng Liang,Jinheng Xie,Tianhong Zhao,Rudi Stouffs,Filip Biljecki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building properties, such as height, usage, and material composition, play a crucial role in spatial data infrastructures, supporting applications such as energy simulation, risk assessment, and environmental modeling. Despite their importance, comprehensive and high-quality building attribute data remain scarce in many urban areas. Recent advances have enabled the extraction and tagging of objective building attributes using remote sensing and street-level imagery. However, establishing a method and pipeline that integrates diverse open datasets, acquires holistic building imagery at scale, and infers comprehensive building attributes remains a significant challenge. Among the first, this study bridges the gaps by introducing OpenFACADES, an open framework that leverages multimodal crowdsourced data to enrich building profiles with both objective attributes and semantic descriptors through multimodal large language models. Our methodology proceeds in three major steps. First, we integrate street-level image metadata from Mapillary with OpenStreetMap geometries via isovist analysis, effectively identifying images that provide suitable vantage points for observing target buildings. Second, we automate the detection of building facades in panoramic imagery and tailor a reprojection approach to convert objects into holistic perspective views that approximate real-world observation. Third, we introduce an innovative approach that harnesses and systematically investigates the capabilities of open-source large vision-language models (VLMs) for multi-attribute prediction and open-vocabulary captioning in building-level analytics, leveraging a globally sourced dataset of 30,180 labeled images from seven cities. Evaluation shows that fine-tuned VLM excel in multi-attribute inference, outperforming single-attribute computer vision models and zero-shot ChatGPT-4o.
zh

[CV-85] owards Understanding How Knowledge Evolves in Large Vision-Language Models

【速读】：该论文试图解决大型视觉-语言模型（LVLMs）内部工作机制理解不足的问题，这一理解局限阻碍了其能力的进一步提升。论文的关键解决方案在于设计了一系列新颖的策略来分析LVLMs中的多模态知识演化，并从单个标记概率、标记概率分布以及特征编码三个层次深入研究多模态知识的演化过程。在此过程中，论文识别出两个关键节点：临界层和突变层，将演化过程划分为快速演化、稳定化和突变三个阶段。这种研究首次揭示了LVLMs中知识演化的轨迹，为理解其底层机制提供了新的视角。

链接: https://arxiv.org/abs/2504.02862
作者: Sudong Wang,Yunjian Zhang,Yao Zhu,Jianing Li,Zizhe Wang,Yanwei Liu,Xiangyang Ji
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) are gradually becoming the foundation for many artificial intelligence applications. However, understanding their internal working mechanisms has continued to puzzle researchers, which in turn limits the further enhancement of their capabilities. In this paper, we seek to investigate how multimodal knowledge evolves and eventually induces natural languages in LVLMs. We design a series of novel strategies for analyzing internal knowledge within LVLMs, and delve into the evolution of multimodal knowledge from three levels, including single token probabilities, token probability distributions, and feature encodings. In this process, we identify two key nodes in knowledge evolution: the critical layers and the mutation layers, dividing the evolution process into three stages: rapid evolution, stabilization, and mutation. Our research is the first to reveal the trajectory of knowledge evolution in LVLMs, providing a fresh perspective for understanding their underlying mechanisms. Our codes are available at this https URL.
zh

[CV-86] Computer Vision and Deep Learning for 4D Augmented Reality

【速读】：该论文旨在解决在扩展现实（XR）平台中高效渲染4D视频的问题，特别是针对复杂3D模型传输带宽限制的挑战。论文的关键解决方案在于利用深度学习模型开发了一种紧凑的4D视频序列形状与外观表示方法，通过有效学习和重构4D视频序列，确保在降低数据需求的同时，不牺牲视频的形状和外观质量。

链接: https://arxiv.org/abs/2504.02860
作者: Karthik Shivashankar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: My Master Thesis , University of Surrey 2019

点击查看摘要

Abstract:The prospect of 4D video in Extended Reality (XR) platform is huge and exciting, it opens a whole new way of human computer interaction and the way we perceive the reality and consume multimedia. In this thesis, we have shown that feasibility of rendering 4D video in Microsoft mixed reality platform. This enables us to port any 3D performance capture from CVSSP into XR product like the HoloLens device with relative ease. However, if the 3D model is too complex and is made up of millions of vertices, the data bandwidth required to port the model is a severe limitation with the current hardware and communication system. Therefore, in this project we have also developed a compact representation of both shape and appearance of the 4d video sequence using deep learning models to effectively learn the compact representation of 4D video sequence and reconstruct it without affecting the shape and appearance of the video sequence.
zh

[CV-87] Survey and synthesis of state of the art in driver monitoring

【速读】：该论文旨在解决如何有效表征驾驶员状态的问题，这是驾驶监控（Driver Monitoring, DM）系统的核心第一步。随着驾驶自动化（Driving Automation, DA）的发展，DM将在所有非完全自动驾驶的车辆中持续保持重要性。论文的关键在于提供一种全面且结构化的视角来梳理DM的技术手段，并沿着五大主要维度（即“亚状态”）——包括嗜睡、认知负荷、分心、情绪以及是否受外界影响——对驾驶员状态进行表征。为此，论文设计了一组相互关联的表格，将这些状态与其对应的指标（如眨眼频率）和可获取这些指标的传感器（如摄像头）联系起来，同时考虑了与驾驶员直接相关以及间接相关的车辆和环境因素。这一方法不仅为研究人员、设备供应商及整车制造商提供了实现先进DM系统的多种选择，还指出了未来研究和创新的潜在方向。

链接: https://arxiv.org/abs/2110.00472
作者: Anaïs Halin,Jacques G. Verly,Marc Van Droogenbroeck
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Road-vehicle accidents are mostly due to human errors, and many such accidents could be avoided by continuously monitoring the driver. Driver monitoring (DM) is a topic of growing interest in the automotive industry, and it will remain relevant for all vehicles that are not fully autonomous, and thus for decades for the average vehicle owner. The present paper focuses on the first step of DM, which consists in characterizing the state of the driver. Since DM will be increasingly linked to driving automation (DA), this paper presents a clear view of the role of DM at each of the six SAE levels of DA. This paper surveys the state of the art of DM, and then synthesizes it, providing a unique, structured, polychotomous view of the many characterization techniques of DM. Informed by the survey, the paper characterizes the driver state along the five main dimensions–called here “(sub)states”–of drowsiness, mental workload, distraction, emotions, and under the influence. The polychotomous view of DM is presented through a pair of interlocked tables that relate these states to their indicators (e.g., the eye-blink rate) and the sensors that can access each of these indicators (e.g., a camera). The tables factor in not only the effects linked directly to the driver, but also those linked to the (driven) vehicle and the (driving) environment. They show, at a glance, to concerned researchers, equipment providers, and vehicle manufacturers (1) most of the options they have to implement various forms of advanced DM systems, and (2) fruitful areas for further research and innovation.
zh

[CV-88] MedSAM2: Segment Anything in 3D Medical Images and Videos

【速读】：该论文旨在解决医学图像和视频分割领域中缺乏通用性强、适用于三维数据且经过全面用户研究的模型的问题。论文的关键在于提出MedSAM2，这是一种基于提示的三维图像和视频分割基础模型。MedSAM2通过在包含超过455,000个三维图像-掩码对和76,000帧的大规模医学数据集上微调Segment Anything Model 2而开发，显著提升了在多种器官、病变类型及成像模式下的性能。此外，论文还设计了一种人机协作流程来构建大规模数据集，实现了迄今为止最大规模的用户研究，证明MedSAM2能够将人工成本减少超过85%。这一解决方案的核心在于结合先进的模型微调技术和高效的数据标注流程，以实现高精度、高效率的医学图像与视频分割。

链接: https://arxiv.org/abs/2504.03600
作者: Jun Ma,Zongxin Yang,Sumin Kim,Bihui Chen,Mohammed Baharoon,Adibvafa Fallahpour,Reza Asakereh,Hongwei Lyu,Bo Wang
机构: AI Collaborative Centre, University Health Network (联合健康网络人工智能协作中心); Vector Institute, Toronto, Canada (多伦多向量研究所); Department of Biomedical Informatics, Harvard Medical School, Harvard University, Boston, USA (波士顿哈佛大学医学院生物医学信息学系); Peter Munk Cardiac Centre, University Health Network (彼得·芒克心脏中心，联合健康网络); Department of Computer Science, University of Toronto (多伦多大学计算机科学系); Harvard University (哈佛大学); University of Toronto (多伦多大学); Claude (克劳德); Stability.AI (稳定性人工智能); Meta (元宇宙); Stanford University (斯坦福大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Medical image and video segmentation is a critical task for precision medicine, which has witnessed considerable progress in developing task or modality-specific and generalist models for 2D images. However, there have been limited studies on building general-purpose models for 3D images and videos with comprehensive user studies. Here, we present MedSAM2, a promptable segmentation foundation model for 3D image and video segmentation. The model is developed by fine-tuning the Segment Anything Model 2 on a large medical dataset with over 455,000 3D image-mask pairs and 76,000 frames, outperforming previous models across a wide range of organs, lesions, and imaging modalities. Furthermore, we implement a human-in-the-loop pipeline to facilitate the creation of large-scale datasets resulting in, to the best of our knowledge, the most extensive user study to date, involving the annotation of 5,000 CT lesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames, demonstrating that MedSAM2 can reduce manual costs by more than 85%. MedSAM2 is also integrated into widely used platforms with user-friendly interfaces for local and cloud deployment, making it a practical tool for supporting efficient, scalable, and high-quality segmentation in both research and healthcare environments.
zh

[CV-89] AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities

【速读】：该论文试图解决在深度学习模型中因输入模态/对比度集不匹配而导致预训练模型性能下降的问题。传统方法在面对不同输入模态或预训练与微调阶段模态不一致时，往往难以保持性能，通常会导致精度下降。为了解决这一挑战，论文提出了一种自适应视觉Transformer（AdaViT）框架，其关键在于利用动态标记器（dynamic tokenizer）将不同的输入图像模态编码为令牌，并借助Transformer架构的特性构建跨变长令牌的注意力机制，从而实现对每个病例可变输入模态的有效处理。通过广泛的实验验证，该架构在脑梗死和脑肿瘤分割任务中的零样本测试、少量样本微调以及反向迁移场景下表现出色，并且对于自监督预训练也能最大化利用预训练数据，促进多样化下游任务的迁移。

链接: https://arxiv.org/abs/2504.03589
作者: Badhan Kumar Das,Gengyan Zhao,Han Liu,Thomas J. Re,Dorin Comaniciu,Eli Gibson,Andreas Maier
机构: Siemens Healthineers AG; Siemens Medical Solutions USA, Inc.; FAU Erlangen-Nuremberg
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pretrain techniques, whether supervised or self-supervised, are widely used in deep learning to enhance model performance. In real-world clinical scenarios, different sets of magnetic resonance (MR) contrasts are often acquired for different subjects/cases, creating challenges for deep learning models assuming consistent input modalities among all the cases and between pretrain and finetune. Existing methods struggle to maintain performance when there is an input modality/contrast set mismatch with the pretrained model, often resulting in degraded accuracy. We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case. We utilize a dynamic tokenizer to encode different input image modalities to tokens and take advantage of the characteristics of the transformer to build attention mechanism across variable length of tokens. Through extensive experiments, we demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets, resulting in superior performance on zero-shot testing, few-shot finetuning, and backward transferring in brain infarct and brain tumor segmentation tasks. Additionally, for self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.
zh

[CV-90] Early detection of diabetes through transfer learning-based eye (vision) screening and improvement of machine learning model performance and advanced parameter setting algorithms

【速读】：该论文旨在解决糖尿病视网膜病变（Diabetic Retinopathy, DR）检测中传统机器学习方法存在的低准确性、低敏感性以及因数据复杂性和体量导致的模型训练周期长等问题。论文的关键解决方案是引入迁移学习（Transfer Learning, TL），通过特征维度降低、优化学习率调整及高级参数调优算法等改进措施，提升模型效率与诊断准确性。最终，所提出的模型在测试集上的整体准确率达到84%，其中特定类别的最高准确率为89%，敏感性最高达97%，F1分数为92%，表明其在识别DR病例方面具有显著优势。这些结果表明基于迁移学习的DR筛查是一种有前景的早期诊断手段，有助于及时干预以防止视力丧失并改善患者预后。

链接: https://arxiv.org/abs/2504.03439
作者: Mohammad Reza Yousefi,Ali Bakrani,Amin Dehghani
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 25 pages,12 Figures, 1 Table

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is a serious and common complication of diabetes, caused by prolonged high blood sugar levels that damage the small retinal blood vessels. If left untreated, DR can progress to retinal vein occlusion and stimulate abnormal blood vessel growth, significantly increasing the risk of blindness. Traditional diabetes diagnosis methods often utilize convolutional neural networks (CNNs) to extract visual features from retinal images, followed by classification algorithms such as decision trees and k-nearest neighbors (KNN) for disease detection. However, these approaches face several challenges, including low accuracy and sensitivity, lengthy machine learning (ML) model training due to high data complexity and volume, and the use of limited datasets for testing and evaluation. This study investigates the application of transfer learning (TL) to enhance ML model performance in DR detection. Key improvements include dimensionality reduction, optimized learning rate adjustments, and advanced parameter tuning algorithms, aimed at increasing efficiency and diagnostic accuracy. The proposed model achieved an overall accuracy of 84% on the testing dataset, outperforming prior studies. The highest class-specific accuracy reached 89%, with a maximum sensitivity of 97% and an F1-score of 92%, demonstrating strong performance in identifying DR cases. These findings suggest that TL-based DR screening is a promising approach for early diagnosis, enabling timely interventions to prevent vision loss and improve patient outcomes.
zh

[CV-91] Comparative Analysis of Unsupervised and Supervised Autoencoders for Nuclei Classification in Clear Cell Renal Cell Carcinoma Images

【速读】：该论文致力于解决通过自动化方法改进透明细胞肾细胞癌（clear cell renal cell carcinoma, ccRCC）病理图像中细胞核分级的问题，这一任务传统上依赖于病理学家的主观视觉评估。论文的关键在于提出了一种结合分类器分支的判别自编码器（Classifier-based Discriminative Autoencoder, CDAE），并通过神经架构搜索及对比学习优化其性能。这种方法显著提升了潜在空间分离能力和分类准确性，尤其是在区分具有挑战性的相邻等级时表现优异，并最终在所有评估指标上超越了现有的最先进模型CHR-Network。因此，将分类器集成到自编码器中以实现监督学习特征提取是该解决方案的核心创新点。

链接: https://arxiv.org/abs/2504.03146
作者: Fatemeh Javadian,Zahra Aminparast,Johannes Stegmaier,Abin Jose
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted 4-page paper at IEEE ISBI 2025. 3 figures, 3 tables

点击查看摘要

Abstract:This study explores the application of supervised and unsupervised autoencoders (AEs) to automate nuclei classification in clear cell renal cell carcinoma (ccRCC) images, a diagnostic task traditionally reliant on subjective visual grading by pathologists. We evaluate various AE architectures, including standard AEs, contractive AEs (CAEs), and discriminative AEs (DAEs), as well as a classifier-based discriminative AE (CDAE), optimized using the hyperparameter tuning tool Optuna. Bhattacharyya distance is selected from several metrics to assess class separability in the latent space, revealing challenges in distinguishing adjacent grades using unsupervised models. CDAE, integrating a supervised classifier branch, demonstrated superior performance in both latent space separation and classification accuracy. Given that CDAE-CNN achieved notable improvements in classification metrics, affirming the value of supervised learning for class-specific feature extraction, F1 score was incorporated into the tuning process to optimize classification performance. Results show significant improvements in identifying aggressive ccRCC grades by leveraging the classification capability of AE through latent clustering followed by fine-grained classification. Our model outperforms the current state of the art, CHR-Network, across all evaluated metrics. These findings suggest that integrating a classifier branch in AEs, combined with neural architecture search and contrastive learning, enhances grading automation in ccRCC pathology, particularly in detecting aggressive tumor grades, and may improve diagnostic accuracy.
zh

[CV-92] Global Rice Multi-Class Segmentation Dataset (RiceSEG): A Comprehensive and Diverse High-Resolution RGB-Annotated Images for the Development and Benchmarking of Rice Segmentation Algorithms

【速读】：该论文旨在解决利用计算机视觉技术进行水稻表型性状分析时，图像成分区分任务面临的挑战，特别是由于水稻器官精细结构和冠层复杂光照条件导致的难度。解决方案的关键在于构建了一个高质量的多类别水稻语义分割数据集——RiceSEG。该数据集包含来自五个主要水稻种植国家（中国、日本、印度、菲律宾和坦桑尼亚）的近50,000张高分辨率地面图像，覆盖超过6,000个基因型的所有生长阶段，并从中精选出3,078个代表性样本，标注为六类（背景、绿色植被、衰老植被、稻穗、杂草和浮萍）。这一数据集不仅填补了现有资源的空白，还特别强调了中国子数据集对东北到南方主要基因型及种植环境的全面覆盖，为开发针对水稻及其他作物的专用分割模型提供了重要支持。

链接: https://arxiv.org/abs/2504.02880
作者: Junchi Zhou,Haozhou Wang,Yoichiro Kato,Tejasri Nampally,P. Rajalakshmi,M. Balram,Keisuke Katsura,Hao Lu,Yue Mu,Wanneng Yang,Yangmingrui Gao,Feng Xiao,Hongtao Chen,Yuhao Chen,Wenjuan Li,Jingwen Wang,Fenghua Yu,Jian Zhou,Wensheng Wang,Xiaochun Hu,Yuanzhu Yang,Yanfeng Ding,Wei Guo,Shouyang Liu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Developing computer vision-based rice phenotyping techniques is crucial for precision field management and accelerating breeding, thereby continuously advancing rice production. Among phenotyping tasks, distinguishing image components is a key prerequisite for characterizing plant growth and development at the organ scale, enabling deeper insights into eco-physiological processes. However, due to the fine structure of rice organs and complex illumination within the canopy, this task remains highly challenging, underscoring the need for a high-quality training dataset. Such datasets are scarce, both due to a lack of large, representative collections of rice field images and the time-intensive nature of annotation. To address this gap, we established the first comprehensive multi-class rice semantic segmentation dataset, RiceSEG. We gathered nearly 50,000 high-resolution, ground-based images from five major rice-growing countries (China, Japan, India, the Philippines, and Tanzania), encompassing over 6,000 genotypes across all growth stages. From these original images, 3,078 representative samples were selected and annotated with six classes (background, green vegetation, senescent vegetation, panicle, weeds, and duckweed) to form the RiceSEG dataset. Notably, the sub-dataset from China spans all major genotypes and rice-growing environments from the northeast to the south. Both state-of-the-art convolutional neural networks and transformer-based semantic segmentation models were used as baselines. While these models perform reasonably well in segmenting background and green vegetation, they face difficulties during the reproductive stage, when canopy structures are more complex and multiple classes are involved. These findings highlight the importance of our dataset for developing specialized segmentation models for rice and other crops.
zh

[CV-93] Machine Learning Prediction of Cardiovascular Risk in Type 1 Diabetes Mellitus Using Radiomics Features from Multimodal Retinal Images

【速读】：该论文旨在开发一种机器学习（Machine Learning, ML）算法，用于从1型糖尿病患者的多模态视网膜图像中确定心血管（Cardiovascular, CV）风险，并将其分为中等、高和极高风险三个级别。研究的关键在于从眼底照相（Fundus Retinography）、光学相干断层扫描（Optical Coherence Tomography, OCT）及光学相干断层扫描血管成像（Optical Coherence Tomography Angiography, OCTA）图像中提取影像组学特征（Radiomic Features），并通过这些特征单独或与临床数据结合训练ML模型，以实现对CV风险的精确分类和评估。实验结果表明，结合OCT、OCTA以及眼部相关数据的影像组学方法在无需全身性数据输入的情况下也能达到较高的诊断性能（AUC=0.89±0.02），证明了这种基于眼科学（Oculomics）的方法在CV风险评估中的潜力。

链接: https://arxiv.org/abs/2504.02868
作者: Ariadna Tohà-Dalmau(1),Josep Rosinés-Fonoll(2),Enrique Romero(1 and 3),Ferran Mazzanti(4),Ruben Martin-Pinardel(5),Sonia Marias-Perez(2),Carolina Bernal-Morales(2, 5 and 6),Rafael Castro-Dominguez(2),Andrea Mendez(2),Emilio Ortega(5, 6 and 7),Irene Vinagre(5, 6 and 7),Marga Gimenez(5, 6 and 7),Alfredo Vellido(1 and 3),Javier Zarranz-Ventura(2, 5, 6 and 7) ((1) Department of Computer Science, Universitat Politècnica de Catalunya (2) Institut Clínic d’Oftalmología, Hospital Clínic de Barcelona (3) Intelligent Data Science and Artificial Intelligence Research Center (4) Department of Physics, Universitat Politècnica de Catalunya (5) August Pi i Sunyer Biomedical Research Institute (6) Diabetes Unit, Hospital Clínic de Barcelona (7) School of Medicine, Universitat de Barcelona)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 7 figures. Submitted to Ophthalmology Science, under second review

点击查看摘要

Abstract:This study aimed to develop a machine learning (ML) algorithm capable of determining cardiovascular risk in multimodal retinal images from patients with type 1 diabetes mellitus, distinguishing between moderate, high, and very high-risk levels. Radiomic features were extracted from fundus retinography, optical coherence tomography (OCT), and OCT angiography (OCTA) images. ML models were trained using these features either individually or combined with clinical data. A dataset of 597 eyes (359 individuals) was analyzed, and models trained only with radiomic features achieved AUC values of (0.79 \pm 0.03) for identifying moderate risk cases from high and very high-risk cases, and (0.73 \pm 0.07) for distinguishing between high and very high-risk cases. The addition of clinical variables improved all AUC values, reaching (0.99 \pm 0.01) for identifying moderate risk cases and (0.95 \pm 0.02) for differentiating between high and very high-risk cases. For very high CV risk, radiomics combined with OCT+OCTA metrics and ocular data achieved an AUC of (0.89 \pm 0.02) without systemic data input. These results demonstrate that radiomic features obtained from multimodal retinal images are useful for discriminating and classifying CV risk labels, highlighting the potential of this oculomics approach for CV risk assessment.
zh

人工智能

[AI-0] owards deployment-centric multimodal AI beyond vision and language

【速读】：该论文试图解决多模态人工智能（Multimodal AI）在实际部署中面临的挑战，特别是现有研究主要集中于视觉和语言数据模型，而忽视了部署可行性的关键问题。论文的关键解决方案在于提出一种以部署为中心的工作流程（deployment-centric workflow），将部署约束（deployment constraints）尽早融入研发过程，从而减少不可部署方案的可能性。此外，论文强调通过多层次的多模态整合（multilevel multimodal integration）和跨学科协作（multidisciplinary collaboration），拓宽研究范围至更广泛的领域，并通过分析流行病应对、自动驾驶汽车设计及气候变化适应等真实场景下的共性挑战，促进开放研究实践与多学科对话，加速具有广泛社会影响的部署驱动型发展。

链接: https://arxiv.org/abs/2504.03603
作者: Xianyuan Liu,Jiayang Zhang,Shuo Zhou,Thijs L. van der Plas,Avish Vijayaraghavan,Anastasiia Grishina,Mengdie Zhuang,Daniel Schofield,Christopher Tomlinson,Yuhan Wang,Ruizhe Li,Louisa van Zeeland,Sina Tabakhi,Cyndie Demeocq,Xiang Li,Arunav Das,Orlando Timmerman,Thomas Baldwin-McDonald,Jinge Wu,Peizhen Bai,Zahraa Al Sahili,Omnia Alwazzan,Thao N. Do,Mohammod N.I. Suvon,Angeline Wang,Lucia Cipolina-Kun,Luigi A. Moretti,Lucas Farndale,Nitisha Jain,Natalia Efremova,Yan Ge,Marta Varela,Hak-Keung Lam,Oya Celiktutan,Ben R. Evans,Alejandro Coca-Castro,Honghan Wu,Zahraa S. Abdallah,Chen Chen,Valentin Danchev,Nataliya Tkachenko,Lei Lu,Tingting Zhu,Gregory G. Slabaugh,Roger K. Moore,William K. Cheung,Peter H. Charlton,Haiping Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal artificial intelligence (AI) integrates diverse types of data via machine learning to improve understanding, prediction, and decision-making across disciplines such as healthcare, science, and engineering. However, most multimodal AI advances focus on models for vision and language data, while their deployability remains a key challenge. We advocate a deployment-centric workflow that incorporates deployment constraints early to reduce the likelihood of undeployable solutions, complementing data-centric and model-centric approaches. We also emphasise deeper integration across multiple levels of multimodality and multidisciplinary collaboration to significantly broaden the research scope beyond vision and language. To facilitate this approach, we identify common multimodal-AI-specific challenges shared across disciplines and examine three real-world use cases: pandemic response, self-driving car design, and climate change adaptation, drawing expertise from healthcare, social science, engineering, science, sustainability, and finance. By fostering multidisciplinary dialogue and open research practices, our community can accelerate deployment-centric development for broad societal impact.
zh

[AI-1] Real-is-Sim: Bridging the Sim-to-Real Gap with a Dynamic Digital Twin for Real-World Robot Policy Evaluation

【速读】：该论文旨在解决行为克隆（Behavior Cloning）在复杂操作任务训练中的两个主要挑战：一是评估训练性能困难，尤其是与实际任务成功率相关性较差的行为克隆损失；二是依赖耗时且昂贵的真实世界评估来识别最优策略或检测过拟合与欠拟合的问题。为应对这些挑战，论文提出了一种名为real-is-sim的新框架，其关键是引入基于Embodied Gaussians的动态数字孪生体（Digital Twin），贯穿政策开发的整个流程——包括数据收集、训练和部署阶段。通过持续同步模拟环境与物理环境，该框架允许在真实环境中采集演示数据的同时利用模拟器提取状态信息，并支持灵活的状态表示方式。此外，在训练过程中可高效并行地对策略进行离线评估，而在部署阶段则实现策略执行与真实硬件解耦，从而缓解领域转移难题。实验验证表明，此方法显著提升了模拟成功率与现实世界评价之间的相关性。

链接: https://arxiv.org/abs/2504.03597
作者: Jad Abou-Chakra,Lingfeng Sun,Krishan Rana,Brandon May,Karl Schmeckpeper,Maria Vittoria Minniti,Laura Herlant
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in behavior cloning have enabled robots to perform complex manipulation tasks. However, accurately assessing training performance remains challenging, particularly for real-world applications, as behavior cloning losses often correlate poorly with actual task success. Consequently, researchers resort to success rate metrics derived from costly and time-consuming real-world evaluations, making the identification of optimal policies and detection of overfitting or underfitting impractical. To address these issues, we propose real-is-sim, a novel behavior cloning framework that incorporates a dynamic digital twin (based on Embodied Gaussians) throughout the entire policy development pipeline: data collection, training, and deployment. By continuously aligning the simulated world with the physical world, demonstrations can be collected in the real world with states extracted from the simulator. The simulator enables flexible state representations by rendering image inputs from any viewpoint or extracting low-level state information from objects embodied within the scene. During training, policies can be directly evaluated within the simulator in an offline and highly parallelizable manner. Finally, during deployment, policies are run within the simulator where the real robot directly tracks the simulated robot’s joints, effectively decoupling policy execution from real hardware and mitigating traditional domain-transfer challenges. We validate real-is-sim on the PushT manipulation task, demonstrating strong correlation between success rates obtained in the simulator and real-world evaluations. Videos of our system can be found at this https URL.
zh

[AI-2] Dense Neural Network Based Arrhythmia Classification on Low-cost and Low-compute Micro-controller

【速读】：该论文旨在解决心血管疾病（CVD）诊断中传统心电图（ECG）监测设备成本高昂的问题，同时确保能够有效检测心律失常（arrhythmia）。目前，工业级ECG系统虽然性能优越，但开发成本较高，而基于微控制器单元（MCU）的系统虽能显著降低成本，但在匹配行业标准并实现心律失常检测方面仍面临挑战。为此，论文提出了一种基于密集神经网络的解决方案，通过在Arduino Nano平台上实现高效的算法，以满足实时检测需求。关键在于设计了一个包含两层（不计输入层），分别具有10个和4个神经元的神经网络模型，并结合适当的激活函数选择方法，最终实现了对四种心律失常分类的78.3%宏平均F1分数及96.38%的准确率，同时保持极低的计算开销（仅需0.001314 MOps浮点运算）。

链接: https://arxiv.org/abs/2504.03531
作者: Md Abu Obaida Zishan,H M Shihab,Sabik Sadman Islam,Maliha Alam Riya,Gazi Mashrur Rahman,Jannatun Noor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The electrocardiogram (ECG) monitoring device is an expensive albeit essential device for the treatment and diagnosis of cardiovascular diseases (CVD). The cost of this device typically ranges from 2000 to 10000. Several studies have implemented ECG monitoring systems in micro-controller units (MCU) to reduce industrial development costs by up to 20 times. However, to match industry-grade systems and display heartbeats effectively, it is essential to develop an efficient algorithm for detecting arrhythmia (irregular heartbeat). Hence in this study, a dense neural network is developed to detect arrhythmia on the Arduino Nano. The Nano consists of the ATMega328 microcontroller with a 16MHz clock, 2KB of SRAM, and 32KB of program memory. Additionally, the AD8232 SparkFun Single-Lead Heart Rate Monitor is used as the ECG sensor. The implemented neural network model consists of two layers (excluding the input) with 10 and four neurons respectively with sigmoid activation function. However, four approaches are explored to choose the appropriate activation functions. The model has a size of 1.267 KB, achieves an F1 score (macro-average) of 78.3% for classifying four types of arrhythmia, an accuracy rate of 96.38%, and requires 0.001314 MOps of floating-point operations (FLOPs).
zh

[AI-3] Quantifying Robustness: A Benchmarking Framework for Deep Learning Forecasting in Cyber-Physical Systems

【速读】：该论文旨在解决工业Cyber-Physical Systems (CPS)领域中深度学习（Deep Learning, DL）方法在健康管理和预测（Prognostics and Health Management, PHM）任务中的鲁棒性不足问题，以及现有鲁棒性评估方法无法充分反映实际复杂场景的问题。论文的关键在于提出了一种基于分布鲁棒性的实用鲁棒性定义，专门针对工业CPS场景进行设计，并构建了一个系统化的鲁棒性评估框架。该框架通过模拟真实的扰动（如传感器漂移、噪声和不规则采样）来全面分析预测模型在真实数据集上的鲁棒性，同时提供标准化评分以量化和比较不同模型及架构的性能，从而辅助模型选择与优化。

链接: https://arxiv.org/abs/2504.03494
作者: Alexander Windmann,Henrik Steude,Daniel Boschmann,Oliver Niggemann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cyber-Physical Systems (CPS) in domains such as manufacturing and energy distribution generate complex time series data crucial for Prognostics and Health Management (PHM). While Deep Learning (DL) methods have demonstrated strong forecasting capabilities, their adoption in industrial CPS remains limited due insufficient robustness. Existing robustness evaluations primarily focus on formal verification or adversarial perturbations, inadequately representing the complexities encountered in real-world CPS scenarios. To address this, we introduce a practical robustness definition grounded in distributional robustness, explicitly tailored to industrial CPS, and propose a systematic framework for robustness evaluation. Our framework simulates realistic disturbances, such as sensor drift, noise and irregular sampling, enabling thorough robustness analyses of forecasting models on real-world CPS datasets. The robustness definition provides a standardized score to quantify and compare model performance across diverse datasets, assisting in informed model selection and architecture design. Through extensive empirical studies evaluating prominent DL architectures (including recurrent, convolutional, attention-based, modular, and structured state-space models) we demonstrate the applicability and effectiveness of our approach. We publicly release our robustness benchmark to encourage further research and reproducibility.
zh

[AI-4] Decentralized Collective World Model for Emergent Communication and Coordination

【速读】：该论文旨在解决多智能体系统中同时实现符号涌现（symbol emergence）用于通信以及协调行为的问题。现有研究通常分别关注通信或协调，而本文提出的方法通过集体预测编码的时间扩展，在同一框架下同时实现两者。方案的关键在于将世界模型与通信信道集成，使智能体能够预测环境动力学、从部分观测中估计状态，并通过双向消息交换共享关键信息，其中对比学习用于确保消息对齐。此外，通过约束防止直接访问其他智能体的内部状态，促进了更准确反映环境状态的有意义符号系统的涌现。这一方法在具有不同感知能力的智能体协作任务中表现出色，验证了去中心化通信在支持协调的同时构建共享环境表征的有效性。

链接: https://arxiv.org/abs/2504.03353
作者: Kentaro Nomura,Tatsuya Aoki,Tadahiro Taniguchi,Takato Horii
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a fully decentralized multi-agent world model that enables both symbol emergence for communication and coordinated behavior through temporal extension of collective predictive coding. Unlike previous research that focuses on either communication or coordination separately, our approach achieves both simultaneously. Our method integrates world models with communication channels, enabling agents to predict environmental dynamics, estimate states from partial observations, and share critical information through bidirectional message exchange with contrastive learning for message alignment. Using a two-agent trajectory drawing task, we demonstrate that our communication-based approach outperforms non-communicative models when agents have divergent perceptual capabilities, achieving the second-best coordination after centralized models. Importantly, our distributed approach with constraints preventing direct access to other agents’ internal states facilitates the emergence of more meaningful symbol systems that accurately reflect environmental states. These findings demonstrate the effectiveness of decentralized communication for supporting coordination while developing shared representations of the environment.
zh

[AI-5] alk2X – An Open-Source Toolkit Facilitating Deployment of LLM -Powered Chatbots on the Web

【速读】：该论文旨在解决当前大型语言模型（LLM）驱动的聊天机器人在网页集成中的两大主要问题：一是由于闭源方案的限制，导致其在网页托管中的普及率较低；二是缺乏透明性，尤其是在实现细节和能源效率方面。论文的关键解决方案是提出了一种开源的代理工具Talk2X，它采用经过适配的检索增强生成方法（RAG），结合自动构建的向量数据库，以提升能源效率。Talk2X的架构通用性强，可适用于任意网站，并为开发者提供了一个即插即用的集成工具。通过混合方法评估，Talk2X显著提高了任务完成时间、正确性和用户体验，验证了解决方案的有效性。

链接: https://arxiv.org/abs/2504.03343
作者: Lars Krupp,Daniel Geißler,Peter Hevesi,Marco Hirsch,Paul Lukowicz,Jakob Karolus
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Integrated into websites, LLM-powered chatbots offer alternative means of navigation and information retrieval, leading to a shift in how users access information on the web. Yet, predominantly closed-sourced solutions limit proliferation among web hosts and suffer from a lack of transparency with regard to implementation details and energy efficiency. In this work, we propose our openly available agent Talk2X leveraging an adapted retrieval-augmented generation approach (RAG) combined with an automatically generated vector database, benefiting energy efficiency. Talk2X’s architecture is generalizable to arbitrary websites offering developers a ready to use tool for integration. Using a mixed-methods approach, we evaluated Talk2X’s usability by tasking users to acquire specific assets from an open science repository. Talk2X significantly improved task completion time, correctness, and user experience supporting users in quickly pinpointing specific information as compared to standard user-website interaction. Our findings contribute technical advancements to an ongoing paradigm shift of how we access information on the web.
zh

[AI-6] Policy Optimization Algorithms in a Unified Framework

【速读】：该论文旨在解决政策优化算法在理解和实现上的复杂性问题，这些问题通常源于与马尔可夫决策过程相关的复杂计算，以及折扣奖励和平均奖励设置的差异使用。论文的关键解决方案在于提出一个统一框架，该框架结合广义遍历理论（Generalized Ergodicity Theory）和扰动分析（Perturbation Analysis）。广义遍历理论有助于阐明随机过程的稳态行为，从而帮助理解折扣奖励和平均奖励的机制；而扰动分析则深入揭示了政策优化算法的基本原理。通过这一框架，论文不仅识别了常见的实现错误，还展示了正确的实现方法，并通过线性二次调节器（Linear Quadratic Regulator, LQR）问题的案例研究，说明了算法设计的细微变化对实施结果的影响。论文的目标是使政策优化算法更易于访问，并减少其在实际应用中的误用。

链接: https://arxiv.org/abs/2504.03328
作者: Shuang Wu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Policy optimization algorithms are crucial in many fields but challenging to grasp and implement, often due to complex calculations related to Markov decision processes and varying use of discount and average reward setups. This paper presents a unified framework that applies generalized ergodicity theory and perturbation analysis to clarify and enhance the application of these algorithms. Generalized ergodicity theory sheds light on the steady-state behavior of stochastic processes, aiding understanding of both discounted and average rewards. Perturbation analysis provides in-depth insights into the fundamental principles of policy optimization algorithms. We use this framework to identify common implementation errors and demonstrate the correct approaches. Through a case study on Linear Quadratic Regulator problems, we illustrate how slight variations in algorithm design affect implementation outcomes. We aim to make policy optimization algorithms more accessible and reduce their misuse in practice.
zh

[AI-7] owards Effective EU E-Participation: The Development of AskThePublic

【速读】：该论文试图解决如何通过设计一个基于大型语言模型（Large Language Model）的聊天机器人（Chatbot），提升现有公民参与平台在政策制定过程中的有效性。论文的关键在于结合媒体丰富度理论（Media Richness Theory）与设计科学研究方法（Design Science Research），开发名为AskThePublic的聊天机器人，以提供交互式、结构化且语言能力增强的响应，从而提高政策制定者、记者、研究人员及公众对公共输入的参与意愿。

链接: https://arxiv.org/abs/2504.03287
作者: Kilian Sprenkamp,Nils Messerschmidt,Amir Sartipi,Igor Tchappi,Xiaohui Wu,Liudmila Zavolokina,Gilbert Fridgen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:E-participation platforms can be an important asset for governments in increasing trust and fostering democratic societies. By engaging non-governmental and private institutions, domain experts, and even the general public, policymakers can make informed and inclusive decisions. Drawing on the Media Richness Theory and applying the Design Science Research method, we explore how a chatbot can be designed to improve the effectiveness of the policy-making process of existing citizen involvement platforms. Leveraging the Have Your Say platform, which solicits feedback on European Commission initiatives and regulations, a Large Language Model based chatbot, called AskThePublic is created, providing policymakers, journalists, researchers, and interested citizens with a convenient channel to explore and engage with public input. By conducting 11 semistructured interviews, the results show that the participants value the interactive and structured responses as well as enhanced language capabilities, thus increasing their likelihood of engaging with AskThePublic over the existing platform. An outlook for future iterations is provided and discussed with regard to the perspectives of the different stakeholders.
zh

[AI-8] Monte Carlo Graph Coloring

【速读】：该论文试图解决图着色（Graph Coloring）这一经典组合优化问题，尤其是在大规模图实例上的求解挑战。传统精确方法在处理包含数百个顶点以上的图时效率低下，因此论文探索如何有效应用蒙特卡洛搜索算法，特别是嵌套蒙特卡洛搜索（Nested Monte Carlo Search, NMCS）和嵌套 rollout 策略适应（Nested Rollout Policy Adaptation, NRPA），来应对这一难题。解决方案的关键在于设计适合图着色问题的蒙特卡洛搜索策略，并将其与现有启发式方法进行对比评估，以验证其性能优势。

链接: https://arxiv.org/abs/2504.03277
作者: Tristan Cazenave,Benjamin Negrevergne,Florian Sikora
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Coloring is probably one of the most studied and famous problem in graph algorithms. Exact methods fail to solve instances with more than few hundred vertices, therefore, a large number of heuristics have been proposed. Nested Monte Carlo Search (NMCS) and Nested Rollout Policy Adaptation (NRPA) are Monte Carlo search algorithms for single player games. Surprisingly, few work has been dedicated to evaluating Monte Carlo search algorithms to combinatorial graph problems. In this paper we expose how to efficiently apply Monte Carlo search to Graph Coloring and compare this approach to existing ones.
zh

[AI-9] Do Large Language Models Solve the Problems of Agent -Based Modeling? A Critical Review of Generative Social Simulations

【速读】：该论文试图解决的问题是评估“生成式 Agent-Based Models (ABMs)”这一新兴方法是否能够有效应对传统 ABMs 长期面临的批评，包括缺乏现实性、计算复杂性以及校准和验证与实证数据的挑战。论文的关键在于通过回顾相关文献，发现生成式 ABMs 在面对这些批评时存在局限性，例如对历史辩论的关注不足、验证方法薄弱（多依赖主观可信度评估）且难以充分证明操作有效性。此外，论文指出大型语言模型（LLMs）的黑箱特性可能加剧而非缓解传统 ABMs 的挑战，并质疑该领域是否能够过渡到所需的严谨建模水平以推动社会科学理论的发展。

链接: https://arxiv.org/abs/2504.03274
作者: Maik Larooij,Petter Törnberg
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in AI have reinvigorated Agent-Based Models (ABMs), as the integration of Large Language Models (LLMs) has led to the emergence of ``generative ABMs’’ as a novel approach to simulating social systems. While ABMs offer means to bridge micro-level interactions with macro-level patterns, they have long faced criticisms from social scientists, pointing to e.g., lack of realism, computational complexity, and challenges of calibrating and validating against empirical data. This paper reviews the generative ABM literature to assess how this new approach adequately addresses these long-standing criticisms. Our findings show that studies show limited awareness of historical debates. Validation remains poorly addressed, with many studies relying solely on subjective assessments of model `believability’, and even the most rigorous validation failing to adequately evidence operational validity. We argue that there are reasons to believe that LLMs will exacerbate rather than resolve the long-standing challenges of ABMs. The black-box nature of LLMs moreover limit their usefulness for disentangling complex emergent causal mechanisms. While generative ABMs are still in a stage of early experimentation, these findings question of whether and how the field can transition to the type of rigorous modeling needed to contribute to social scientific theory.
zh

[AI-10] Verification of Autonomous Neural Car Control with KeYmaera X

【速读】：该论文致力于解决自动驾驶汽车在高速公路场景中避免与邻近车辆发生碰撞的安全性验证问题。论文的关键在于利用微分动态逻辑（differential dynamic logic, dL）构建形式化模型，并通过KeYmaera X工具进行形式化安全证明。研究证明了在无限时间范围内无碰撞的安全性，且该安全性不受行驶时间长度及随时间变化的反应时间和制动力的影响。论文不仅展示了dL作为严格的形式化方法在运行时监控、屏蔽以及神经网络验证中的应用潜力，还揭示了ABZ’25案例研究中提供的规范与highway-env仿真环境之间的不一致性。通过尝试修复这些不一致性，论文发现了多个反例，进一步暴露了强化学习环境中存在的潜在问题。

链接: https://arxiv.org/abs/2504.03272
作者: Enguerrand Prebet,Samuel Teuber,André Platzer
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 21 pages, 6 figures; Accepted at the 11th International Conference on Rigorous State Based Methods (ABZ’25)

点击查看摘要

Abstract:This article presents a formal model and formal safety proofs for the ABZ’25 case study in differential dynamic logic (dL). The case study considers an autonomous car driving on a highway avoiding collisions with neighbouring cars. Using KeYmaera X’s dL implementation, we prove absence of collision on an infinite time horizon which ensures that safety is preserved independently of trip length. The safety guarantees hold for time-varying reaction time and brake force. Our dL model considers the single lane scenario with cars ahead or behind. We demonstrate that dL with its tools is a rigorous foundation for runtime monitoring, shielding, and neural network verification. Doing so sheds light on inconsistencies between the provided specification and simulation environment highway-env of the ABZ’25 study. We attempt to fix these inconsistencies and uncover numerous counterexamples which also indicate issues in the provided reinforcement learning environment.
zh

[AI-11] An Extended Symbolic-Arithmetic Model for Teaching Double-Black Removal with Rotation in Red-Black Trees

【速读】：该论文旨在解决红黑（RB）树中双黑（DB）节点形成后引起的旋转和重新着色操作带来的教学与学习挑战。论文的关键在于提出了一种基于符号算术代数（SA）方法的扩展方案，通过定义如“Red + Black = Black”、“Black - Black = Red”等规则，实现双黑节点的移除以及黑高度的重新平衡。此外，论文进一步推导了三个SA数学方程：一般符号算法规则、部分符号算法规则1和部分符号算法规则2，用于处理不同情形下的黑高度平衡问题，包括LR、RL、LL和RR四种旋转情况，并考虑了直接或间接与双黑节点相连节点的位置关系。最终，论文通过分析内侄子（inner nephew）、外侄子（outer nephew）的颜色属性来指导旋转和重新着色操作，以确保红黑树的平衡性。这一方法的核心创新在于其能够有效处理涉及节点旋转和路径重着色的双黑节点移除过程，从而实现黑高度的整体平衡。

链接: https://arxiv.org/abs/2504.03259
作者: Kennedy E. Ehimwenma,Hongyu Zhou,Junfeng Wang,Ze Zheng
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Double-black (DB) nodes have no place in red-black (RB) trees. So when DB nodes are formed, they are immediately removed. The removal of DB nodes that cause rotation and recoloring of other connected nodes poses greater challenges in the teaching and learning of RB trees. To ease this difficulty, this paper extends our previous work on the symbolic arithmetic algebraic (SA) method for removing DB nodes. The SA operations that are given as, Red + Black = Black; Black - Black = Red; Black + Black = DB; and DB - Black = Black removes DB nodes and rebalances black heights in RB trees. By extension, this paper projects three SA mathematical equations, namely, general symbolic arithmetic rule; partial symbolic arithmetic rule1; and partial symbolic arithmetic rule2. The removal of a DB node ultimately affects black heights in RB trees. To balance black heights using the SA equations, all the RB tree cases, namely, LR, RL, LL, and RR, were considered in this work; and the position of the nodes connected directly or indirectly to the DB node was also tested. In this study, to balance a RB tree, the issues considered w.r.t. the different cases of the RB tree were i) whether a DB node has an inner, outer, or both inner and outer black nephews; or ii) whether a DB node has an inner, outer or both inner and outer red nephews. The nephews r and x in this work are the children of the sibling s to a DB, and further up the tree, the parent p of a DB is their grandparent g. Thus, r and x have indirect relationships to a DB at the point of formation of the DB node. The novelty of the SA equations is in their effectiveness in the removal of DB that involves rotation of nodes as well as the recoloring of nodes along any simple path so as to balance black heights in a tree.
zh

[AI-12] Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators

【速读】：该论文致力于解决在开放世界环境中通用机器人移动操作所面临的长时序、复杂目标和部分可观测性带来的挑战。论文的关键解决方案在于提出了一种新颖的框架，利用视觉-语言模型（Vision-Language Models, VLMs）作为感知模块来估计不确定性并促进符号化语义绑定。其核心创新点在于构建符号化信念表示，并采用信念空间规划器生成考虑不确定性的计划，从而实现战略性信息收集。这种方法使智能体能够有效应对部分可观测性和属性不确定性，显著提升了在部分可观测环境中的推理能力。通过实证研究，该方法在性能上优于传统的端到端 VLM 规划或基于 VLM 的状态估计方法。

链接: https://arxiv.org/abs/2504.03245
作者: Linfeng Zhao,Willie McClinton,Aidan Curtis,Nishanth Kumar,Tom Silver,Leslie Pack Kaelbling,Lawson L.S. Wong
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Generalizable robotic mobile manipulation in open-world environments poses significant challenges due to long horizons, complex goals, and partial observability. A promising approach to address these challenges involves planning with a library of parameterized skills, where a task planner sequences these skills to achieve goals specified in structured languages, such as logical expressions over symbolic facts. While vision-language models (VLMs) can be used to ground these expressions, they often assume full observability, leading to suboptimal behavior when the agent lacks sufficient information to evaluate facts with certainty. This paper introduces a novel framework that leverages VLMs as a perception module to estimate uncertainty and facilitate symbolic grounding. Our approach constructs a symbolic belief representation and uses a belief-space planner to generate uncertainty-aware plans that incorporate strategic information gathering. This enables the agent to effectively reason about partial observability and property uncertainty. We demonstrate our system on a range of challenging real-world tasks that require reasoning in partially observable environments. Simulated evaluations show that our approach outperforms both vanilla VLM-based end-to-end planning or VLM-based state estimation baselines by planning for and executing strategic information gathering. This work highlights the potential of VLMs to construct belief-space symbolic scene representations, enabling downstream tasks such as uncertainty-aware planning.
zh

[AI-13] Persuasive Calibration

【速读】：该论文旨在解决**Persuasive Calibration（说服性校准）**问题，即在存在激励不一致的情况下，如何设计一个最优预测器，使得主方（Principal）提供的预测能够在满足校准误差预算的前提下，引导代理方（Agent）做出期望的决策。论文的核心在于探讨当主方与代理方的目标不完全一致时，如何在标准的Lt-范数预期校准误差（Expected Calibration Error, ECE）框架下计算出最优预测器。

解决方案的关键在于：将预测器视为完美校准预测器的后处理版本，并基于此构建了一般性框架。通过该框架，论文揭示了最优预测器的结构特性，特别是当主方效用与事件无关且采用L1-范数ECE时，最优预测器在高或低真实期望结果时呈现过信或欠信特性，而在中间保持完美校准；同时，其错误校准行为与主方效用函数具有共线性结构。此外，论文提出了针对一般主方效用和Lt-范数ECE的近似最优预测器的FPTAS算法，以及针对L1-范数和L-无穷范数ECE的多项式时间精确算法。

链接: https://arxiv.org/abs/2504.03211
作者: Yiding Feng,Wei Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
备注:

点击查看摘要

Abstract:We introduce and study the persuasive calibration problem, where a principal aims to provide trustworthy predictions about underlying events to a downstream agent to make desired decisions. We adopt the standard calibration framework that regulates predictions to be unbiased conditional on their own value, and thus, they can reliably be interpreted at the face value by the agent. Allowing a small calibration error budget, we aim to answer the following question: what is and how to compute the optimal predictor under this calibration error budget, especially when there exists incentive misalignment between the principal and the agent? We focus on standard Lt-norm Expected Calibration Error (ECE) metric. We develop a general framework by viewing predictors as post-processed versions of perfectly calibrated predictors. Using this framework, we first characterize the structure of the optimal predictor. Specifically, when the principal’s utility is event-independent and for L1-norm ECE, we show: (1) the optimal predictor is over-(resp. under-) confident for high (resp. low) true expected outcomes, while remaining perfectly calibrated in the middle; (2) the miscalibrated predictions exhibit a collinearity structure with the principal’s utility function. On the algorithmic side, we provide a FPTAS for computing approximately optimal predictor for general principal utility and general Lt-norm ECE. Moreover, for the L1- and L-Infinity-norm ECE, we provide polynomial-time algorithms that compute the exact optimal predictor. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH) Cite as: arXiv:2504.03211 [cs.LG] (or arXiv:2504.03211v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.03211 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-14] Augmenting Human Cognition With Generative AI: Lessons From AI-Assisted Decision-Making

【速读】：该论文旨在探索如何利用生成式 AI (Generative AI) 设计增强而非取代人类认知的工具。论文的关键问题是如何在生成式 AI 和 AI 辅助决策中平衡提供端到端（end-to-end）解决方案与过程导向（process-oriented）支持之间的关系。论文指出，当前流行的做法是向用户提供由 AI 生成的完整解决方案，供用户接受、拒绝或编辑，但这种方法存在挑战。论文的核心解决方案在于提倡过程导向的支持方式，即通过逐步协助用户完成任务，从而弥补端到端方案的不足，并讨论了这种支持方式在复杂决策任务中的适用性，基于大型语言模型 (LLMs) 的实验验证了这一方法的有效性。

链接: https://arxiv.org/abs/2504.03207
作者: Zelun Tony Zhang,Leon Reicherts
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:How can we use generative AI to design tools that augment rather than replace human cognition? In this position paper, we review our own research on AI-assisted decision-making for lessons to learn. We observe that in both AI-assisted decision-making and generative AI, a popular approach is to suggest AI-generated end-to-end solutions to users, which users can then accept, reject, or edit. Alternatively, AI tools could offer more incremental support to help users solve tasks themselves, which we call process-oriented support. We describe findings on the challenges of end-to-end solutions, and how process-oriented support can address them. We also discuss the applicability of these findings to generative AI based on a recent study in which we compared both approaches to assist users in a complex decision-making task with LLMs.
zh

[AI-15] A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations

【速读】：该论文旨在解决人工智能（Artificial Intelligence, AI）与机器学习（Machine Learning, ML）在人机协作（Human-Autonomy Teaming, HAT）任务中，如何帮助人类维持对自主资产的意识与控制，同时建立信任并支持共享语境理解这一核心挑战。论文的关键解决方案在于提出了一种实时的人类数字孪生体（Human Digital Twin, HDT）架构，该架构通过集成大型语言模型（Large Language Models, LLMs）实现知识报告、问答及推荐功能，并以视觉界面形式呈现。其核心创新点在于采用元认知方法，提供个性化且情境感知的响应，确保与人类队友的期望一致。此外，HDT 被设计为从训练到部署再到战后复盘全生命周期内表现行为均高度拟真的团队成员，系统包含语音识别、上下文处理、基于 AI 的对话、情感建模、唇形同步及多模态反馈等模块。

链接: https://arxiv.org/abs/2504.03147
作者: Abdul Mannan Mohammed,Azhar Ali Mohammad,Jason A. Ortiz,Carsten Neumann,Grace Bochenek,Dirk Reiners,Carolina Cruz-Neira
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Presented at: 2024 Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), Paper No. 24366, 10 pages, 5 figures

点击查看摘要

Abstract:Recent developments in Artificial Intelligence (AI) and Machine Learning (ML) are creating new opportunities for Human-Autonomy Teaming (HAT) in tasks, missions, and continuous coordinated activities. A major challenge is enabling humans to maintain awareness and control over autonomous assets, while also building trust and supporting shared contextual understanding. To address this, we present a real-time Human Digital Twin (HDT) architecture that integrates Large Language Models (LLMs) for knowledge reporting, answering, and recommendation, embodied in a visual interface. The system applies a metacognitive approach to enable personalized, context-aware responses aligned with the human teammate’s expectations. The HDT acts as a visually and behaviorally realistic team member, integrated throughout the mission lifecycle, from training to deployment to after-action review. Our architecture includes speech recognition, context processing, AI-driven dialogue, emotion modeling, lip-syncing, and multimodal feedback. We describe the system design, performance metrics, and future development directions for more adaptive and realistic HAT systems. Comments: Presented at: 2024 Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), Paper No. 24366, 10 pages, 5 figures Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.03147 [cs.HC] (or arXiv:2504.03147v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2504.03147 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.17605/OSF.IO/KEH9T Focus to learn more DOI(s) linking to related resources
zh

[AI-16] Graph Network Modeling Techniques for Visualizing Human Mobility Patterns

【速读】：该论文旨在解决城市尺度下人类移动分析中基于图（Graph-based）方法所面临的多重挑战，包括高质量数据的缺乏以实现高时空分辨率的流动表示、有限的计算资源将大规模移动数据转化为网络结构，以及图模型固有的扩展性问题。论文的关键解决方案在于通过将图嵌入连续空间（graph embedding into a continuous space），缓解了快速图匹配、图时间序列建模以及移动动态可视化等相关问题。实验表明，该方法能够有效将出租车轨迹数据转化为网络结构与流动模式，并在匹配图的任务中平均减少约40%的误差，相较未匹配图表现更优。

链接: https://arxiv.org/abs/2504.03119
作者: Sinjini Mitra,Anuj Srivastava,Avipsa Roy,Pavan Turaga
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Human mobility analysis at urban-scale requires models to represent the complex nature of human movements, which in turn are affected by accessibility to nearby points of interest, underlying socioeconomic factors of a place, and local transport choices for people living in a geographic region. In this work, we represent human mobility and the associated flow of movements as a grapyh. Graph-based approaches for mobility analysis are still in their early stages of adoption and are actively being researched. The challenges of graph-based mobility analysis are multifaceted - the lack of sufficiently high-quality data to represent flows at high spatial and teporal resolution whereas, limited computational resources to translate large voluments of mobility data into a network structure, and scaling issues inherent in graph models etc. The current study develops a methodology by embedding graphs into a continuous space, which alleviates issues related to fast graph matching, graph time-series modeling, and visualization of mobility dynamics. Through experiments, we demonstrate how mobility data collected from taxicab trajectories could be transformed into network structures and patterns of mobility flow changes, and can be used for downstream tasks reporting approx 40% decrease in error on average in matched graphs vs unmatched ones.
zh

[AI-17] Post-processing for Fair Regression via Explainable SVD

【速读】：本文旨在解决神经网络回归模型在训练过程中实现统计公平性（statistical parity）的问题。关键在于通过一种基于可解释奇异值分解（Singular Value Decomposition, SVD）的权重矩阵线性变换方法，将公平性约束转化为对奇异值的约束。具体而言，该方法利用变换后矩阵的奇异值直接反映两个群体输出分布一阶矩和二阶矩的差异，从而在解析层面优化满足公平性约束的模型参数。实验验证表明，该方法在不同数据集上的公平性-准确性权衡达到或优于基准方法，且无需在推理阶段使用敏感属性。

链接: https://arxiv.org/abs/2504.03093
作者: Zhiqun Zuo,Ding Zhu,Mohammad Mahdi Khalili
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a post-processing algorithm for training fair neural network regression models that satisfy statistical parity, utilizing an explainable singular value decomposition (SVD) of the weight matrix. We propose a linear transformation of the weight matrix, whereby the singular values derived from the SVD of the transformed matrix directly correspond to the differences in the first and second moments of the output distributions across two groups. Consequently, we can convert the fairness constraints into constraints on the singular values. We analytically solve the problem of finding the optimal weights under these constraints. Experimental validation on various datasets demonstrates that our method achieves a similar or superior fairness-accuracy trade-off compared to the baselines without using the sensitive attribute at the inference time.
zh

[AI-18] Machine Learning-Based Detection and Analysis of Suspicious Activities in Bitcoin Wallet Transactions in the USA

【速读】：本文旨在解决通过机器学习算法有效识别和追踪比特币钱包交易中可疑活动的问题。解决方案的关键在于开发能够捕捉数据非线性关系的模型，并通过特征工程提取趋势和异常值以揭示非法活动。研究采用了深度解析的比特币钱包交易数据集，包含交易金额、时间戳、网络流量及地址等要素。经实验验证，Random Forest因其最高的F1分数被证明为最佳模型，这表明其在处理复杂数据关系方面的优越性。研究结果不仅揭示了钱包活动中显著的模式，如未赎回交易与最终余额之间的相关性，还强调了应用机器学习算法跟踪加密货币对于构建透明且安全的美国市场的重要性。

链接: https://arxiv.org/abs/2504.03092
作者: Md Zahidul Islam,Md Shahidul Islam,Biswajit Chandra das,Syed Ali Reza,Proshanta Kumar Bhowmik,Kanchon Kumar Bishnu,Md Shafiqur Rahman,Redoyan Chowdhury,Laxmi Pant
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages,7 figures

点击查看摘要

Abstract:The dramatic adoption of Bitcoin and other cryptocurrencies in the USA has revolutionized the financial landscape and provided unprecedented investment and transaction efficiency opportunities. The prime objective of this research project is to develop machine learning algorithms capable of effectively identifying and tracking suspicious activity in Bitcoin wallet transactions. With high-tech analysis, the study aims to create a model with a feature for identifying trends and outliers that can expose illicit activity. The current study specifically focuses on Bitcoin transaction information in America, with a strong emphasis placed on the importance of knowing about the immediate environment in and through which such transactions pass through. The dataset is composed of in-depth Bitcoin wallet transactional information, including important factors such as transaction values, timestamps, network flows, and addresses for wallets. All entries in the dataset expose information about financial transactions between wallets, including received and sent transactions, and such information is significant for analysis and trends that can represent suspicious activity. This study deployed three accredited algorithms, most notably, Logistic Regression, Random Forest, and Support Vector Machines. In retrospect, Random Forest emerged as the best model with the highest F1 Score, showcasing its ability to handle non-linear relationships in the data. Insights revealed significant patterns in wallet activity, such as the correlation between unredeemed transactions and final balances. The application of machine algorithms in tracking cryptocurrencies is a tool for creating transparent and secure U.S. markets.
zh

[AI-19] From Questions to Insights: Exploring XAI Challenges Reported on Stack Overflow Questions

【速读】：该论文旨在解决人工智能（AI）模型可解释性不足导致其实际应用受限的问题。论文通过分析Stack Overflow（SO）上的663个相关问题，识别出解释性AI（XAI）技术在真实场景中面临的七大挑战，其中模型集成与分歧问题是最为普遍且严重的挑战。论文的关键解决方案在于提出改进方向，即通过增强解释一致性与简化集成过程，以提升XAI技术的易用性和可访问性，从而促进其在实践中的有效应用。

链接: https://arxiv.org/abs/2504.03085
作者: Saumendu Roy,Saikat Mondal,Banani Roy,Chanchal Roy
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted in 29th International Conference on Evaluation and Assessment in Software Engineering (EASE 2025)

点击查看摘要

Abstract:The lack of interpretability is a major barrier that limits the practical usage of AI models. Several eXplainable AI (XAI) techniques (e.g., SHAP, LIME) have been employed to interpret these models’ performance. However, users often face challenges when leveraging these techniques in real-world scenarios and thus submit questions in technical QA forums like Stack Overflow (SO) to resolve these challenges. We conducted an exploratory study to expose these challenges, their severity, and features that can make XAI techniques more accessible and easier to use. Our contributions to this study are fourfold. First, we manually analyzed 663 SO questions that discussed challenges related to XAI techniques. Our careful investigation produced a catalog of seven challenges (e.g., disagreement issues). We then analyzed their prevalence and found that model integration and disagreement issues emerged as the most prevalent challenges. Second, we attempt to estimate the severity of each XAI challenge by determining the correlation between challenge types and answer metadata (e.g., the presence of accepted answers). Our analysis suggests that model integration issues is the most severe challenge. Third, we attempt to perceive the severity of these challenges based on practitioners’ ability to use XAI techniques effectively in their work. Practitioners’ responses suggest that disagreement issues most severely affect the use of XAI techniques. Fourth, we seek agreement from practitioners on improvements or features that could make XAI techniques more accessible and user-friendly. The majority of them suggest consistency in explanations and simplified integration. Our study findings might (a) help to enhance the accessibility and usability of XAI and (b) act as the initial benchmark that can inspire future research.
zh

[AI-20] Integrating Identity-Based Identification against Adaptive Adversaries in Federated Learning

【速读】：本文旨在解决联邦学习（Federated Learning, FL）系统中由重新连接的恶意客户端（Reconnecting Malicious Clients, RMCs）带来的安全威胁问题。这些威胁源于恶意客户端利用FL系统的开放连接特性，在断开后通过修改攻击策略重新接入系统以规避检测。论文的关键解决方案是将基于身份的认证（Identity-Based Identification, IBI）集成到FL环境中，通过基于密码学的身份方案实现客户端的身份验证，从而有效防止已断开连接的恶意客户端再次进入系统。此方法采用椭圆曲线上的TNC-IBI（Tan-Ng-Chin）方案以确保计算效率，并结合安全聚合算法如Krum和Trimmed Mean进一步提升FL系统的鲁棒性，显著减轻RMCs的影响。

链接: https://arxiv.org/abs/2504.03077
作者: Jakub Kacper Szelag,Ji-Jian Chin,Lauren Ansell,Sook-Chin Yip
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, research article, IEEE possible publication (in submission)

点击查看摘要

Abstract:Federated Learning (FL) has recently emerged as a promising paradigm for privacy-preserving, distributed machine learning. However, FL systems face significant security threats, particularly from adaptive adversaries capable of modifying their attack strategies to evade detection. One such threat is the presence of Reconnecting Malicious Clients (RMCs), which exploit FLs open connectivity by reconnecting to the system with modified attack strategies. To address this vulnerability, we propose integration of Identity-Based Identification (IBI) as a security measure within FL environments. By leveraging IBI, we enable FL systems to authenticate clients based on cryptographic identity schemes, effectively preventing previously disconnected malicious clients from re-entering the system. Our approach is implemented using the TNC-IBI (Tan-Ng-Chin) scheme over elliptic curves to ensure computational efficiency, particularly in resource-constrained environments like Internet of Things (IoT). Experimental results demonstrate that integrating IBI with secure aggregation algorithms, such as Krum and Trimmed Mean, significantly improves FL robustness by mitigating the impact of RMCs. We further discuss the broader implications of IBI in FL security, highlighting research directions for adaptive adversary detection, reputation-based mechanisms, and the applicability of identity-based cryptographic frameworks in decentralized FL architectures. Our findings advocate for a holistic approach to FL security, emphasizing the necessity of proactive defence strategies against evolving adaptive adversarial threats.
zh

[AI-21] Design of AI-Powered Tool for Self-Regulation Support in Programming Education

【速读】：该论文旨在解决大型语言模型（LLM）编程辅助工具与机构学习管理系统（LMS）之间缺乏集成的问题，以及现有研究在利用LLM支持学生自我调节学习技能发展方面的不足。论文的核心问题是：如何通过整合LLM工具与LMS，实现基于上下文的个性化反馈，并促进学生的自我调节学习能力提升。

解决方案的关键在于开发CodeRunner Agent，这是一个基于LLM的编程助手，它集成了Moodle中的CodeRunner插件。CodeRunner Agent通过结合课程材料、编程问题、学生答案及执行结果等详细上下文信息，为教育者提供可定制的AI反馈。同时，它通过策略导向的AI回应增强学生的自我调节学习能力。这种集成化、上下文化且以技能为导向的方法为编程教育的数据驱动改进提供了有前景的方向。

链接: https://arxiv.org/abs/2504.03068
作者: Huiyong Li,Boxuan Ma
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) tools have demonstrated their potential to deliver high-quality assistance by providing instant, personalized feedback that is crucial for effective programming education. However, many of these tools operate independently from institutional Learning Management Systems, which creates a significant disconnect. This isolation limits the ability to leverage learning materials and exercise context for generating tailored, context-aware feedback. Furthermore, previous research on self-regulated learning and LLM support mainly focused on knowledge acquisition, not the development of important self-regulation skills. To address these challenges, we developed CodeRunner Agent, an LLM-based programming assistant that integrates the CodeRunner, a student-submitted code executing and automated grading plugin in Moodle. CodeRunner Agent empowers educators to customize AI-generated feedback by incorporating detailed context from lecture materials, programming questions, student answers, and execution results. Additionally, it enhances students’ self-regulated learning by providing strategy-based AI responses. This integrated, context-aware, and skill-focused approach offers promising avenues for data-driven improvements in programming education.
zh

[AI-22] Context-Aware Self-Adaptation for Domain Generalization ICML2023

【速读】：该论文致力于解决领域泛化（Domain Generalization）问题，即如何在仅使用源训练域（source training domains）数据的情况下，学习一个能够在未见测试域（unseen testing domain）上表现良好的模型。论文提出了一种名为上下文感知自适应（Context-Aware Self-Adaptation, CASA）的两阶段方法作为解决方案。CASA 的关键在于通过模拟近似的元泛化（meta-generalization）场景，并引入自适应模块，在保持预训练的元源模型（meta-source models）在元源域（meta-source domains）上的预测能力的同时，调整这些模型以适配元目标域（meta-target domains）。自适应的核心思想是利用上下文信息（如小批量特征的均值）作为领域知识，自动将第一阶段训练好的模型迁移到第二阶段的新上下文中。此外，通过集成多个元源模型进行测试域推理进一步提升了性能。实验结果表明，所提出的方法在标准基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2504.03064
作者: Hao Yan,Yuhong Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2023 AdvML Frontiers workshop

点击查看摘要

Abstract:Domain generalization aims at developing suitable learning algorithms in source training domains such that the model learned can generalize well on a different unseen testing domain. We present a novel two-stage approach called Context-Aware Self-Adaptation (CASA) for domain generalization. CASA simulates an approximate meta-generalization scenario and incorporates a self-adaptation module to adjust pre-trained meta source models to the meta-target domains while maintaining their predictive capability on the meta-source domains. The core concept of self-adaptation involves leveraging contextual information, such as the mean of mini-batch features, as domain knowledge to automatically adapt a model trained in the first stage to new contexts in the second stage. Lastly, we utilize an ensemble of multiple meta-source models to perform inference on the testing domain. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on standard benchmarks.
zh

[AI-23] Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards

【速读】：本文旨在解决强化学习（Reinforcement Learning, RL）在实际应用中因违反安全约束而导致严重后果的问题。传统的强化学习方法通常仅关注最大化奖励信号，而忽视了对安全约束的严格遵守。为了解决这一问题，论文提出了一种名为安全调制策略优化（Safety Modulated Policy Optimization, SMPO）的新方法。其关键是通过引入安全调制奖励机制，在标准策略优化框架内实现安全策略函数的学习。具体而言，论文将安全违规成本视为与标准奖励并行的环境反馈，并设计了一个Q-cost函数作为安全评估器来估计未来的累积成本。此外，通过一个精心设计的成本感知加权函数调整奖励，确保在满足安全限制的同时最大化期望奖励。策略函数与安全评估器通过梯度下降法在与环境的在线交互过程中同步更新。实验结果表明，该方法在多个强化学习环境中显著优于经典及最新的对比方法。

链接: https://arxiv.org/abs/2504.03040
作者: Hanping Zhang,Yuhong Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safe Reinforcement Learning (Safe RL) aims to train an RL agent to maximize its performance in real-world environments while adhering to safety constraints, as exceeding safety violation limits can result in severe consequences. In this paper, we propose a novel safe RL approach called Safety Modulated Policy Optimization (SMPO), which enables safe policy function learning within the standard policy optimization framework through safety modulated rewards. In particular, we consider safety violation costs as feedback from the RL environments that are parallel to the standard awards, and introduce a Q-cost function as safety critic to estimate expected future cumulative costs. Then we propose to modulate the rewards using a cost-aware weighting function, which is carefully designed to ensure the safety limits based on the estimation of the safety critic, while maximizing the expected rewards. The policy function and the safety critic are simultaneously learned through gradient descent during online interactions with the environment. We conduct experiments using multiple RL environments and the experimental results demonstrate that our method outperforms several classic and state-of-the-art comparison methods in terms of overall safe RL performance.
zh

[AI-24] Deep Reinforcement Learning via Object-Centric Attention

【速读】：该论文试图解决深度强化学习（Deep Reinforcement Learning）智能体在原始像素输入训练下难以泛化至未见过的环境的问题，这些问题源于对无关背景细节和虚假相关性的依赖。为了解决这一挑战，论文提出了一种基于对象的注意力机制——Object-Centric Attention via Masking (OCCAM)，其核心在于通过掩码选择性地保留与任务相关的实体，同时过滤掉无关的视觉信息。OCCAM 的关键创新在于利用对象中心的归纳偏置（inductive bias），无需针对不同任务设计特定表示，从而实现更鲁棒的泛化能力，显著降低样本复杂度，并在 Atari 基准测试中展现出与传统基于像素的方法相当或更好的性能。

链接: https://arxiv.org/abs/2504.03024
作者: Jannis Blüml,Cedric Derstroff,Bjarne Gregori,Elisabeth Dillies,Quentin Delfosse,Kristian Kersting
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 11 figures, 7 tables

点击查看摘要

Abstract:Deep reinforcement learning agents, trained on raw pixel inputs, often fail to generalize beyond their training environments, relying on spurious correlations and irrelevant background details. To address this issue, object-centric agents have recently emerged. However, they require different representations tailored to the task specifications. Contrary to deep agents, no single object-centric architecture can be applied to any environment. Inspired by principles of cognitive science and Occam’s Razor, we introduce Object-Centric Attention via Masking (OCCAM), which selectively preserves task-relevant entities while filtering out irrelevant visual information. Specifically, OCCAM takes advantage of the object-centric inductive bias. Empirical evaluations on Atari benchmarks demonstrate that OCCAM significantly improves robustness to novel perturbations and reduces sample complexity while showing similar or improved performance compared to conventional pixel-based RL. These results suggest that structured abstraction can enhance generalization without requiring explicit symbolic representations or domain-specific object extraction pipelines.
zh

[AI-25] Localized Definitions and Distributed Reasoning : A Proof-of-Concept Mechanistic Interpretability Study via Activation Patching

【速读】：该论文旨在探究经过微调的 GPT-2 模型中知识表示的定位问题，通过因果层归因方法（Causal Layer Attribution via Activation Patching, CLAP）识别负责正确答案生成的关键神经网络层。研究使用了两个配置对模型进行微调，微调数据集包含 9,958 篇 PubMed 摘要，并通过验证损失监控实现早期停止。CLAP 方法的核心包括缓存干净（正确答案）和损坏（错误答案）激活、计算对数差以量化模型偏好，以及通过将损坏的激活替换为干净的激活来评估恢复效果。研究的关键发现是：事实性知识更倾向于局部化表示，而关联性知识依赖于分布式表示；此外，编辑效果取决于任务类型。这些结果不仅解决了关于模型编辑中局部化问题的冲突观察，还强调了采用任务适应性技术以实现可靠且可解释的更新的重要性。

链接: https://arxiv.org/abs/2504.02976
作者: Nooshin Bahador
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:This study investigates the localization of knowledge representation in fine-tuned GPT-2 models using Causal Layer Attribution via Activation Patching (CLAP), a method that identifies critical neural layers responsible for correct answer generation. The model was fine-tuned on 9,958 PubMed abstracts (epilepsy: 20,595 mentions, EEG: 11,674 mentions, seizure: 13,921 mentions) using two configurations with validation loss monitoring for early stopping. CLAP involved (1) caching clean (correct answer) and corrupted (incorrect answer) activations, (2) computing logit difference to quantify model preference, and (3) patching corrupted activations with clean ones to assess recovery. Results revealed three findings: First, patching the first feedforward layer recovered 56% of correct preference, demonstrating that associative knowledge is distributed across multiple layers. Second, patching the final output layer completely restored accuracy (100% recovery), indicating that definitional knowledge is localised. The stronger clean logit difference for definitional questions further supports this localized representation. Third, minimal recovery from convolutional layer patching (13.6%) suggests low-level features contribute marginally to high-level reasoning. Statistical analysis confirmed significant layer-specific effects (p0.01). These findings demonstrate that factual knowledge is more localized and associative knowledge depends on distributed representations. We also showed that editing efficacy depends on task type. Our findings not only reconcile conflicting observations about localization in model editing but also emphasize on using task-adaptive techniques for reliable, interpretable updates.
zh

[AI-26] Improved Compact Genetic Algorithms with Efficient Caching

【速读】：该论文旨在解决 Compact Genetic Algorithms (cGAs) 在接近收敛时反复生成相同染色体导致冗余评估的问题。解决方案的关键在于引入缓存机制（caching），通过记录和复用已评估过的染色体，避免对相同染色体的重复评估，从而显著减少函数评估次数，同时保持算法的时间效率。论文还提出了一种高效的数据结构用于缓存维护，并分析了基于两种知名缓存替换策略的性能，证明了该方法在维持性能准确性的同时能够有效降低计算开销。

链接: https://arxiv.org/abs/2504.02972
作者: Prasanta Dutta,Anirban Mukhopadhyay
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Compact Genetic Algorithms (cGAs) are condensed variants of classical Genetic Algorithms (GAs) that use a probability vector representation of the population instead of the complete population. cGAs have been shown to significantly reduce the number of function evaluations required while producing outcomes similar to those of classical GAs. However, cGAs have a tendency to repeatedly generate the same chromosomes as they approach convergence, resulting in unnecessary evaluations of identical chromosomes. This article introduces the concept of caching in cGAs as a means of avoiding redundant evaluations of the same chromosomes. Our proposed approach operates equivalently to cGAs, but enhances the algorithm’s time efficiency by reducing the number of function evaluations. We also present a data structure for efficient cache maintenance to ensure low overhead. The proposed caching approach has an asymptotically constant time complexity on average. The proposed method further generalizes the caching mechanism with higher selection pressure for elitism-based cGAs. We conduct a rigorous analysis based on experiments on benchmark optimization problems using two well-known cache replacement strategies. The results demonstrate that caching significantly reduces the number of function evaluations required while maintaining the same level of performance accuracy.
zh

[AI-27] Global-Order GFlowNets ICLR2025

【速读】：该论文旨在解决 Order-Preserving (OP) GFlowNets 在处理多目标优化问题时因局部顺序（local order）导致的冲突优化目标问题。论文指出，虽然 OP GFlowNets 能通过基于帕累托支配关系施加局部顺序来高效采样帕累托前沿附近的多样化候选解，并避免使用标量化（scalarization），但这种局部顺序可能导致优化目标之间的冲突。为了解决这一问题，论文引入了 Global-Order GFlowNets，通过将局部顺序转化为全局顺序，消除了这些冲突。实验评估表明，所提出的方法在多个基准测试中表现出显著的有效性和潜力。

链接: https://arxiv.org/abs/2504.02968
作者: Lluís Pastor-Pérez,Javier Alonso-Garcia,Lukas Mauch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, ICLR 2025 Workshop format

点击查看摘要

Abstract:Order-Preserving (OP) GFlowNets have demonstrated remarkable success in tackling complex multi-objective (MOO) black-box optimization problems using stochastic optimization techniques. Specifically, they can be trained online to efficiently sample diverse candidates near the Pareto front. A key advantage of OP GFlowNets is their ability to impose a local order on training samples based on Pareto dominance, eliminating the need for scalarization - a common requirement in other approaches like Preference-Conditional GFlowNets. However, we identify an important limitation of OP GFlowNets: imposing a local order on training samples can lead to conflicting optimization objectives. To address this issue, we introduce Global-Order GFlowNets, which transform the local order into a global one, thereby resolving these conflicts. Our experimental evaluations on various benchmarks demonstrate the efficacy and promise of our proposed method.
zh

[AI-28] Digital Forensics in the Age of Large Language Models

【速读】：该论文试图解决的问题是传统数字取证方法因依赖手动操作而难以应对快速增长且日益复杂的数字数据，以及数字取证领域对大型语言模型（Large Language Models, LLM）的潜力缺乏全面理解。解决方案的关键在于通过提供一个易于理解和系统性的概述，阐明LLM如何革新数字取证方法，并强调其在自动化和增强取证任务中的优越能力。同时，论文结合理论与实践，通过具体示例和现实场景展示LLM的应用，并深入分析其在幻觉效应、可解释性、偏见及伦理考量等方面的局限性，进而提出未来研究方向，强调实现透明度、问责制和流程标准化的重要性。

链接: https://arxiv.org/abs/2504.02963
作者: Zhipeng Yin,Zichong Wang,Weifeng Xu,Jun Zhuang,Pallab Mozumder,Antoinette Smith,Wenbin Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital forensics plays a pivotal role in modern investigative processes, utilizing specialized methods to systematically collect, analyze, and interpret digital evidence for judicial proceedings. However, traditional digital forensic techniques are primarily based on manual labor-intensive processes, which become increasingly insufficient with the rapid growth and complexity of digital data. To this end, Large Language Models (LLMs) have emerged as powerful tools capable of automating and enhancing various digital forensic tasks, significantly transforming the field. Despite the strides made, general practitioners and forensic experts often lack a comprehensive understanding of the capabilities, principles, and limitations of LLM, which limits the full potential of LLM in forensic applications. To fill this gap, this paper aims to provide an accessible and systematic overview of how LLM has revolutionized the digital forensics approach. Specifically, it takes a look at the basic concepts of digital forensics, as well as the evolution of LLM, and emphasizes the superior capabilities of LLM. To connect theory and practice, relevant examples and real-world scenarios are discussed. We also critically analyze the current limitations of applying LLMs to digital forensics, including issues related to illusion, interpretability, bias, and ethical considerations. In addition, this paper outlines the prospects for future research, highlighting the need for effective use of LLMs for transparency, accountability, and robust standardization in the forensic process.
zh

[AI-29] Level Up Peer Review in Education: Investigating genAI-driven Gamification system and its influence on Peer Feedback Effectiveness

【速读】：该论文旨在解决软件工程（Software Engineering, SE）教育中代码评审和设计批评技能培养不足的问题。尽管这些技能在专业实践中至关重要，但它们在正式教育中很少被强调，且学生之间的同行反馈质量与参与度存在显著差异。为了解决这一问题，论文提出了关键解决方案：Socratique，这是一个结合游戏化（Gamification）和生成式人工智能（Generative AI, GenAI）辅助的同行评估平台。其核心在于通过引入游戏元素激励学生提供更多的反馈，同时利用GenAI助手实时支持高质量、建设性的评论撰写，从而有效提升同行评审的质量与效率。

链接: https://arxiv.org/abs/2504.02962
作者: Rafal Wlodarski,Leonardo da Silva Sousa,Allison Connell Pensky
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In software engineering (SE), the ability to review code and critique designs is essential for professional practice. However, these skills are rarely emphasized in formal education, and peer feedback quality and engagement can vary significantly among students. This paper introduces Socratique, a gamified peer-assessment platform integrated with Generative AI (GenAI) assistance, designed to develop students’ peer-review skills in a functional programming course. By incorporating game elements, Socratique aims to motivate students to provide more feedback, while the GenAI assistant offers real-time support in crafting high quality, constructive comments. To evaluate the impact of this approach, we conducted a randomized controlled experiment with master’s students comparing a treatment group with a gamified, GenAI-driven setup against a control group with minimal gamification. Results show that students in the treatment group provided significantly more voluntary feedback, with higher scores on clarity, relevance, and specificity - all key aspects of effective code and design reviews. This study provides evidence for the effectiveness of combining gamification and AI to improve peer review processes, with implications for fostering review-related competencies in software engineering curricula.
zh

[AI-30] Graph Attention for Heterogeneous Graphs with Positional Encoding

【速读】：该论文旨在解决异构图（heterogeneous graphs）上 Graph Neural Networks (GNNs) 性能普遍低于同构图（homogeneous graphs）的问题。论文通过基准测试多种 GNN 架构，发现图注意力网络（Graph Attention Networks, GATs）在节点分类和链接预测任务中表现尤为出色。解决方案的关键在于通过将位置编码（positional encodings）集成到注意力机制中，利用完整的拉普拉斯谱（full Laplacian spectrum）来精确捕捉节点的相对和绝对位置，从而进一步提升下游任务（如节点分类和链接预测）的性能。

链接: https://arxiv.org/abs/2504.02938
作者: Nikhil Shivakumar Nayak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Differential Geometry (math.DG); Machine Learning (stat.ML)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as the de facto standard for modeling graph data, with attention mechanisms and transformers significantly enhancing their performance on graph-based tasks. Despite these advancements, the performance of GNNs on heterogeneous graphs often remains complex, with networks generally underperforming compared to their homogeneous counterparts. This work benchmarks various GNN architectures to identify the most effective methods for heterogeneous graphs, with a particular focus on node classification and link prediction. Our findings reveal that graph attention networks excel in these tasks. As a main contribution, we explore enhancements to these attention networks by integrating positional encodings for node embeddings. This involves utilizing the full Laplacian spectrum to accurately capture both the relative and absolute positions of each node within the graph, further enhancing performance on downstream tasks such as node classification and link prediction.
zh

[AI-31] Systematic Literature Review: Explainable AI Definitions and Challenges in Education

【速读】：该论文旨在解决解释性人工智能（Explainable AI, XAI）在教育领域中的定义不统一及挑战识别的问题。论文通过系统性回顾方法（采用PRISMA流程），分析了19项相关研究，归纳出15种XAI定义与62项挑战，并将其归类为七个主题：可解释性、伦理、技术、人机交互（HCI）、可信度、政策与指南以及其他方面。论文的关键在于揭示XAI在教育领域的潜在贡献及其面临的复杂挑战，特别是由于XAI在伦理、可信度、技术细节和可解释性等方面的定义重叠和多样性导致的混淆问题。解决方案的关键在于建立标准化的XAI定义与框架，以应对这些挑战并促进其在教育领域的有效应用。

链接: https://arxiv.org/abs/2504.02910
作者: Zaid M. Altukhi,Sojen Pradhan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainable AI (XAI) seeks to transform black-box algorithmic processes into transparent ones, enhancing trust in AI applications across various sectors such as education. This review aims to examine the various definitions of XAI within the literature and explore the challenges of XAI in education. Our goal is to shed light on how XAI can contribute to enhancing the educational field. This systematic review, utilising the PRISMA method for rigorous and transparent research, identified 19 relevant studies. Our findings reveal 15 definitions and 62 challenges. These challenges are categorised using thematic analysis into seven groups: explainability, ethical, technical, human-computer interaction (HCI), trustworthiness, policy and guideline, and others, thereby deepening our understanding of the implications of XAI in education. Our analysis highlights the absence of standardised definitions for XAI, leading to confusion, especially because definitions concerning ethics, trustworthiness, technicalities, and explainability tend to overlap and vary.
zh

[AI-32] Hide and Seek in Noise Labels: Noise-Robust Collaborative Active Learning with LLM -Powered Assistance

【速读】：该论文旨在解决从带有噪声标签（Noisy Labels）的数据中学习的问题，这是现实世界中常见的挑战，尤其是在训练数据可能存在错误或损坏标签的情境下。现有方法通常依赖于主动学习来识别噪声标签，并通过查询人类专家进行去噪。然而，这些方法的性能仍然受限于能否准确地将干净样本与噪声样本分离。为克服这一限制，本文提出了一种基于主动学习的创新性协作学习框架NoiseAL，其关键是结合大型语言模型（Large Language Models, LLMs）与小型模型（Small Models, SMs）。具体而言，在协作训练过程中，首先使用两个小型模型构建共预测网络，并提出动态增强阈值策略将噪声数据划分为不同子集；然后从这些子集中选择干净与噪声样本供主动标注器LLMs修正噪声样本；最后针对不同程度的标签噪声采用不同的优化目标处理各子集。实验结果表明，该框架在合成及真实噪声数据集上的表现优于当前最先进的基线方法。

链接: https://arxiv.org/abs/2504.02901
作者: Bo Yuan,Yulin Chen,Yin Zhang,Wei Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning from noisy labels (LNL) is a challenge that arises in many real-world scenarios where collected training data can contain incorrect or corrupted labels. Most existing solutions identify noisy labels and adopt active learning to query human experts on them for denoising. In the era of large language models (LLMs), although we can reduce the human effort to improve these methods, their performances are still subject to accurately separating the clean and noisy samples from noisy data. In this paper, we propose an innovative collaborative learning framework NoiseAL based on active learning to combine LLMs and small models (SMs) for learning from noisy labels. During collaborative training, we first adopt two SMs to form a co-prediction network and propose a dynamic-enhanced threshold strategy to divide the noisy data into different subsets, then select the clean and noisy samples from these subsets to feed the active annotator LLMs to rectify noisy samples. Finally, we employ different optimization objectives to conquer subsets with different degrees of label noises. Extensive experiments on synthetic and real-world noise datasets further demonstrate the superiority of our framework over state-of-the-art baselines.
zh

[AI-33] Meat-Free Day Reduces Greenhouse Gas Emissions but Poses Challenges for Customer Retention and Adherence to Dietary Guidelines

【速读】：该论文旨在评估“无肉日”（Meat-Free Day, MFD）策略在减少校园内肉类消费方面的环境、行为及营养影响，以探索其在实现全球环保与营养目标中的潜力。论文通过在18个月内于12个大学食堂实施67次随机选择的每周无肉日，分析超过40万次食品购买记录，系统评估了MFD的效果。结果显示，无肉日显著减少了校园内食物相关的温室气体排放（52.9%），同时改善了纤维摄入（+26.9%）和胆固醇水平（-4.5%），但伴随蛋白质摄入下降（27.6%）和糖分增加（34.2%）。此外，尽管当天植物性餐品销量上升，但并未对后续非无肉日的饮食模式产生持久影响，且存在部分顾客流失导致潜在外部补偿性肉类消费的风险。因此，论文指出无肉日有效性的关键挑战在于校园内顾客的保留率，建议结合客户留存干预措施以最大化其环境和营养效益。

链接: https://arxiv.org/abs/2504.02899
作者: Giuseppe Russo,Kristina Gligorić,Vincent Moreau,Robert West
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注: 26 pages, 7 figures, 19 Tables

点击查看摘要

Abstract:Reducing meat consumption is crucial for achieving global environmental and nutritional targets. Meat-Free Day (MFD) is a widely adopted strategy to address this challenge by encouraging plant-based diets through the removal of animal-based meals. We assessed the environmental, behavioral, and nutritional impacts of MFD by implementing 67 MFDs over 18 months (once a week on a randomly chosen day) across 12 cafeterias on a large university campus, analyzing over 400,000 food purchases. MFD reduced on-campus food-related greenhouse gas (GHG) emissions on treated days by 52.9% and contributed to improved fiber (+26.9%) and cholesterol (-4.5%) consumption without altering caloric intake. These nutritional benefits were, however, accompanied by a 27.6% decrease in protein intake and a 34.2% increase in sugar consumption. Moreover, the increase in plant-based meals did not carry over to subsequent days, as evidenced by a 3.5% rebound in animal-based meal consumption on days immediately following treated days. MFD also led to a 16.8% drop in on-campus meal sales on treated this http URL Carlo simulations suggest that if 8.7% of diners were to eat burgers off-campus on treated days, MFD’s GHG savings would be fully negated. As our analysis identifies on-campus customer retention as the main challenge to MFD effectiveness, we recommend combining MFD with customer retention interventions to ensure environmental and nutritional benefits.
zh

[AI-34] Embedding Method for Knowledge Graph with Densely Defined Ontology

【速读】：该论文旨在解决现有知识图谱嵌入（Knowledge Graph Embedding, KGE）模型未能充分利用本体论（ontologies），特别是属性间关系的问题。论文提出了一种名为TransU的新KGE模型，专门针对包含明确属性关系的本体论设计。关键解决方案在于将属性视为实体的一个子集，从而实现统一表示，这使得模型能够更有效地利用属性间的语义信息。实验结果基于标准数据集和实际应用数据集进行了验证。

链接: https://arxiv.org/abs/2504.02889
作者: Takanori Ugai
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 6pages, 4 figures

点击查看摘要

Abstract:Knowledge graph embedding (KGE) is a technique that enhances knowledge graphs by addressing incompleteness and improving knowledge retrieval. A limitation of the existing KGE models is their underutilization of ontologies, specifically the relationships between properties. This study proposes a KGE model, TransU, designed for knowledge graphs with well-defined ontologies that incorporate relationships between properties. The model treats properties as a subset of entities, enabling a unified representation. We present experimental results using a standard dataset and a practical dataset.
zh

[AI-35] Exploration of Multi-Element Collaborative Research and Application for Modern Power System Based on Generative Large Models

【速读】：该论文致力于解决智能低碳电力系统中可再生能源整合、储能管理和碳排放优化等复杂挑战。论文的关键在于利用生成式大模型（Generative Large Models, GLMs），通过时空建模和强化学习技术，实现负荷侧管理、储能利用率提升以及电力碳排放的动态优化。GLMs能够处理多源数据并捕捉系统复杂动态特性，从而支持电网的动态调度、稳定性增强、碳交易策略优化及应对极端天气事件的能力提升。其核心解决方案在于结合数据驱动的方法与智能算法，以实现高效、自适应且低碳的电力系统运行。

链接: https://arxiv.org/abs/2504.02855
作者: Lu Cheng,Qixiu Zhang,Beibei Xu,Zhiwei Huang,Cirun Zhang,Yanan Lyu,Fan Zhang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The transition to intelligent, low-carbon power systems necessitates advanced optimization strategies for managing renewable energy integration, energy storage, and carbon emissions. Generative Large Models (GLMs) provide a data-driven approach to enhancing forecasting, scheduling, and market operations by processing multi-source data and capturing complex system dynamics. This paper explores the role of GLMs in optimizing load-side management, energy storage utilization, and electricity carbon, with a focus on Smart Wide-area Hybrid Energy Systems with Storage and Carbon (SGLSC). By leveraging spatiotemporal modeling and reinforcement learning, GLMs enable dynamic energy scheduling, improve grid stability, enhance carbon trading strategies, and strengthen resilience against extreme weather events. The proposed framework highlights the transformative potential of GLMs in achieving efficient, adaptive, and low-carbon power system operations.
zh

[AI-36] A First-Principles Based Risk Assessment Framework and the IEEE P3396 Standard

【速读】：本文旨在解决生成式 AI 引发的前所未有的自动化内容创作与决策支持所带来的全新风险问题。论文提出了一种基于 IEEE P3396 AI 风险、安全性、可信度和责任推荐实践的原理性风险评估框架。解决方案的关键在于引入一种以信息为中心的本体论，将生成式 AI 的输出分类为四种基本类型：(1) 感知级信息，(2) 知识级信息，(3) 决策/行动计划信息，以及 (4) 控制令牌（访问或资源指令）。此分类方法能够系统地识别潜在危害，并根据产生的信息性质更精确地将责任归因于相关利益方（开发者、部署者、用户、监管机构）。通过这种方法，论文展示了每种信息类型如何导致特定的结果风险（如欺骗、虚假信息、不安全建议、安全漏洞），并需要针对性的风险指标和缓解措施。论文通过立足于信息的本质、人类能动性和认知，使风险评估与 AI 输出对人类理解和行为的影响相一致。最终，这一原则性的方法支持明确的责任划分并提供有针对性的保护措施，而非基于广泛应用的风险分类。文中还包含示例表格，用于映射信息类型到风险和责任。这项工作旨在为 IEEE P3396 推荐实践及更广泛的 AI 治理提供严谨且基于原理的生成式 AI 风险评估基础，同时促进负责任的创新。

链接: https://arxiv.org/abs/2504.00091
作者: Richard J. Tong,Marina Cortês,Jeanine A. DeFalco,Mark Underwood,Janusz Zalewski
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 8 pages with 3 tables. This manuscript is prepared for publication by the Institute of Electrical and Electronics Engineers, Standards Association (IEEE-SA), Sponsor Committee - Artificial Intelligence Standards Committee (C/AISC) as a White Paper of Working Group p3396 at this https URL

点击查看摘要

Abstract:Generative Artificial Intelligence (AI) is enabling unprecedented automation in content creation and decision support, but it also raises novel risks. This paper presents a first-principles risk assessment framework underlying the IEEE P3396 Recommended Practice for AI Risk, Safety, Trustworthiness, and Responsibility. We distinguish between process risks (risks arising from how AI systems are built or operated) and outcome risks (risks manifest in the AI system’s outputs and their real-world effects), arguing that generative AI governance should prioritize outcome risks. Central to our approach is an information-centric ontology that classifies AI-generated outputs into four fundamental categories: (1) Perception-level information, (2) Knowledge-level information, (3) Decision/Action plan information, and (4) Control tokens (access or resource directives). This classification allows systematic identification of harms and more precise attribution of responsibility to stakeholders (developers, deployers, users, regulators) based on the nature of the information produced. We illustrate how each information type entails distinct outcome risks (e.g. deception, misinformation, unsafe recommendations, security breaches) and requires tailored risk metrics and mitigations. By grounding the framework in the essence of information, human agency, and cognition, we align risk evaluation with how AI outputs influence human understanding and action. The result is a principled approach to AI risk that supports clear accountability and targeted safeguards, in contrast to broad application-based risk categorizations. We include example tables mapping information types to risks and responsibilities. This work aims to inform the IEEE P3396 Recommended Practice and broader AI governance with a rigorous, first-principles foundation for assessing generative AI risks while enabling responsible innovation.
zh

[AI-37] On-line Policy Improvement using Monte-Carlo Search NIPS NEURIPS1996

【速读】：该论文旨在解决自适应控制器实时策略改进的问题。解决方案的关键在于提出了一种蒙特卡洛模拟算法（Monte-Carlo Simulation Algorithm），通过在每次模拟步骤中基于初始策略统计测量每个可能动作的长期期望奖励，选取能使期望奖励最大化的动作，从而实现策略的改进。该算法易于并行化，并已在IBM SP1和SP2平行RISC超级计算机上实现。论文展示了该算法在跳棋领域对多种初始策略（从随机策略到TD-Gammon）的应用效果，均显著降低了基础玩家的错误率（最多可降低5倍以上）。其潜在应用范围包括其他可通过模拟环境进行调整的自适应控制任务。

链接: https://arxiv.org/abs/2501.05407
作者: Gerald Tesauro,Gregory R. Galperin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accompanied by oral presentation by Gregory Galperin at NeurIPS 1996 (then known as NIPS*96)

点击查看摘要

Abstract:We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller. In the Monte-Carlo simulation, the long-term expected reward of each possible action is statistically measured, using the initial policy to make decisions in each step of the simulation. The action maximizing the measured expected reward is then taken, resulting in an improved policy. Our algorithm is easily parallelizable and has been implemented on the IBM SP1 and SP2 parallel-RISC supercomputers. We have obtained promising initial results in applying this algorithm to the domain of backgammon. Results are reported for a wide variety of initial policies, ranging from a random policy to TD-Gammon, an extremely strong multi-layer neural network. In each case, the Monte-Carlo algorithm gives a substantial reduction, by as much as a factor of 5 or more, in the error rate of the base players. The algorithm is also potentially useful in many other adaptive control applications in which it is possible to simulate the environment. Comments: Accompanied by oral presentation by Gregory Galperin at NeurIPS 1996 (then known as NIPS*96) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2501.05407 [cs.LG] (or arXiv:2501.05407v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.05407 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Advances in Neural Information Processing 9 (NIPS 1996 Proceedings published 1997)
zh

[AI-38] Physics-informed 4D X-ray image reconstruction from ultra-sparse spatiotemporal data

【速读】：该论文旨在解决利用现代X射线源提供的高通量密度进行快速动态过程X射线成像时所面临的挑战，具体表现为时间分辨断层扫描中投影数或空间信息受限，以及stroboscopic成像中时间点数量有限的问题，这些问题导致重建问题成为不适定问题，难以通过经典重建方法解决。论文提出的关键解决方案是开发了一种名为4D物理信息优化神经隐式X射线成像（4D-PIONIX）的新方法，它结合了完整的物理模型与基于深度学习的状态-of-the-art 4D X射线成像稀疏视图重建方法，从而在保持物理模型完整性的同时有效应对稀疏数据条件下的重建难题。通过从模拟二元液滴碰撞的超稀疏时空采集数据中恢复4D信息，验证了该方法的潜力，并展望其将在时间分辨X射线断层扫描及新型稀疏采集方法如X射线多投影成像等领域开启新的时空可能性，推动对多种快速4D动力学现象的研究。

链接: https://arxiv.org/abs/2504.03469
作者: Zisheng Yao,Yuhe Zhang,Zhe Hu,Robert Klöfkorn,Tobias Ritschel,Pablo Villanueva-Perez
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注:

点击查看摘要

Abstract:The unprecedented X-ray flux density provided by modern X-ray sources offers new spatiotemporal possibilities for X-ray imaging of fast dynamic processes. Approaches to exploit such possibilities often result in either i) a limited number of projections or spatial information due to limited scanning speed, as in time-resolved tomography, or ii) a limited number of time points, as in stroboscopic imaging, making the reconstruction problem ill-posed and unlikely to be solved by classical reconstruction approaches. 4D reconstruction from such data requires sample priors, which can be included via deep learning (DL). State-of-the-art 4D reconstruction methods for X-ray imaging combine the power of AI and the physics of X-ray propagation to tackle the challenge of sparse views. However, most approaches do not constrain the physics of the studied process, i.e., a full physical model. Here we present 4D physics-informed optimized neural implicit X-ray imaging (4D-PIONIX), a novel physics-informed 4D X-ray image reconstruction method combining the full physical model and a state-of-the-art DL-based reconstruction method for 4D X-ray imaging from sparse views. We demonstrate and evaluate the potential of our approach by retrieving 4D information from ultra-sparse spatiotemporal acquisitions of simulated binary droplet collisions, a relevant fluid dynamic process. We envision that this work will open new spatiotemporal possibilities for various 4D X-ray imaging modalities, such as time-resolved X-ray tomography and more novel sparse acquisition approaches like X-ray multi-projection imaging, which will pave the way for investigations of various rapid 4D dynamics, such as fluid dynamics and composite testing.
zh

[AI-39] he AI Cosmologist I: An Agent ic System for Automated Data Analysis

【速读】：该论文试图解决自动化科学研究流程中的核心挑战，特别是天体物理学和宇宙学领域中数据密集型研究的复杂性问题。传统自动机器学习系统通常局限于单一优化目标，而缺乏灵活性与创造性。为应对这一挑战，论文提出了一种名为“AI Cosmologist”的主动智能系统（Agentic System），其关键在于通过集成多种专业化智能体（Agents）来模拟人类研究人员的完整科学探索过程。这些智能体分别负责规划、编码、执行、分析及综合，从而实现从创意产生到实验评估再到研究成果传播的全流程自动化。尤其值得注意的是，AI Cosmologist能够生成多样化的实现策略、编写完整代码、处理运行错误、解析结果，并基于实验输出提出新的方法论。这种端到端的能力不仅展示了主动智能系统在加速科学发现方面的潜力，还证明了其可以自主完成从数据集到最终科学论文的全链条工作。

链接: https://arxiv.org/abs/2504.03424
作者: Adam Moss
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI); Data Analysis, Statistics and Probability (physics.data-an)
备注: 45 pages

点击查看摘要

Abstract:We present the AI Cosmologist, an agentic system designed to automate cosmological/astronomical data analysis and machine learning research workflows. This implements a complete pipeline from idea generation to experimental evaluation and research dissemination, mimicking the scientific process typically performed by human researchers. The system employs specialized agents for planning, coding, execution, analysis, and synthesis that work together to develop novel approaches. Unlike traditional auto machine-learning systems, the AI Cosmologist generates diverse implementation strategies, writes complete code, handles execution errors, analyzes results, and synthesizes new approaches based on experimental outcomes. We demonstrate the AI Cosmologist capabilities across several machine learning tasks, showing how it can successfully explore solution spaces, iterate based on experimental results, and combine successful elements from different approaches. Our results indicate that agentic systems can automate portions of the research process, potentially accelerating scientific discovery. The code and experimental data used in this paper are available on GitHub at this https URL. Example papers included in the appendix demonstrate the system’s capability to autonomously produce complete scientific publications, starting from only the dataset and task description
zh

[AI-40] Mind the Prompt: Prompting Strategies in Audio Generations for Improving Sound Classification

【速读】：该论文旨在解决如何通过设计有效的提示策略（prompt strategies）生成逼真的数据集，以提升基于文本到音频（Text-To-Audio, TTA）模型的声音分类任务性能。同时，研究还探索了高效组合这些数据集的技术，以进一步增强其在声音分类中的实用性。关键在于发现任务特定的提示策略相较于基础提示方法在数据生成中具有显著优势，并且由不同TTA模型生成的数据集的融合能够更有效地提升分类性能，优于单纯增加训练数据规模的方法。

链接: https://arxiv.org/abs/2504.03329
作者: Francesca Ronchini,Ho-Hsiang Wu,Wei-Cheng Lin,Fabio Antonacci
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注: Accepted at Generative Data Augmentation for Real-World Signal Processing Applications Workshop

点击查看摘要

Abstract:This paper investigates the design of effective prompt strategies for generating realistic datasets using Text-To-Audio (TTA) models. We also analyze different techniques for efficiently combining these datasets to enhance their utility in sound classification tasks. By evaluating two sound classification datasets with two TTA models, we apply a range of prompt strategies. Our findings reveal that task-specific prompt strategies significantly outperform basic prompt approaches in data generation. Furthermore, merging datasets generated using different TTA models proves to enhance classification performance more effectively than merely increasing the training dataset size. Overall, our results underscore the advantages of these methods as effective data augmentation techniques using synthetic data.
zh

[AI-41] JanusDDG: A Thermodynamics-Compliant Model for Sequence-Based Protein Stability via Two-Fronts Multi-Head Attention

【速读】：本文旨在解决利用蛋白质残基变异预测蛋白质自由能变化 ((\Delta \Delta G)) 的问题，特别是单点及多点突变的影响。传统方法依赖于蛋白质结构信息，而本文提出了一种名为JanusDDG的深度学习框架，其关键在于结合了基于蛋白质语言模型（Protein Language Models, PLMs）的嵌入表示和双向交叉注意力Transformer架构。该方案通过计算野生型与突变型嵌入之间的差异来生成查询（Q）和值（V），同时交替使用两种状态作为键（K），实现了交叉交织的注意力机制。这种方法不仅能够捕捉突变引起的扰动，还保留了重要的上下文信息，并且在设计上严格遵循热力学的基本性质，如反对称性和传递性。实验结果表明，JanusDDG仅基于序列即可达到最先进的预测精度，对于单点和多点突变的预测性能可媲美甚至超越基于结构的方法。

链接: https://arxiv.org/abs/2504.03278
作者: Guido Barducci,Ivan Rossi,Francesco Codicè,Cesare Rollo,Valeria Repetto,Corrado Pancotti,Virginia Iannibelli,Tiziana Sanavia,Piero Fariselli
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: 20 pages, 11 figures

点击查看摘要

Abstract:Understanding how residue variations affect protein stability is crucial for designing functional proteins and deciphering the molecular mechanisms underlying disease-related mutations. Recent advances in protein language models (PLMs) have revolutionized computational protein analysis, enabling, among other things, more accurate predictions of mutational effects. In this work, we introduce JanusDDG, a deep learning framework that leverages PLM-derived embeddings and a bidirectional cross-attention transformer architecture to predict \Delta \Delta G of single and multiple-residue mutations while simultaneously being constrained to respect fundamental thermodynamic properties, such as antisymmetry and transitivity. Unlike conventional self-attention, JanusDDG computes queries (Q) and values (V) as the difference between wild-type and mutant embeddings, while keys (K) alternate between the two. This cross-interleaved attention mechanism enables the model to capture mutation-induced perturbations while preserving essential contextual information. Experimental results show that JanusDDG achieves state-of-the-art performance in predicting \Delta \Delta G from sequence alone, matching or exceeding the accuracy of structure-based methods for both single and multiple mutations.
zh

[AI-42] Properties of Fixed Points of Generalised Extra Gradient Methods Applied to Min-Max Problems

【速读】：该论文旨在研究广义Extra-gradient (GEG) 算法在min-max问题中的不动点性质，并探讨min-max问题目标函数的鞍点（Nash均衡）与GEG不动点之间的联系。论文的关键在于证明，在适当的步长选择下，min-max问题的鞍点集合是GEG稳定不动点的子集，并通过离散时间动态系统的稳定性分析揭示了GEG算法的收敛特性。论文通过数值实例展示了所提方法的结果及其相对于现有方法的优势。

链接: https://arxiv.org/abs/2504.03069
作者: Amir Ali Farzin,Yuen-Man Pun,Philipp Braun,Iman Shames
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:This paper studies properties of fixed points of generalised Extra-gradient (GEG) algorithms applied to min-max problems. We discuss connections between saddle points of the objective function of the min-max problem and GEG fixed points. We show that, under appropriate step-size selections, the set of saddle points (Nash equilibria) is a subset of stable fixed points of GEG. Convergence properties of the GEG algorithm are obtained through a stability analysis of a discrete-time dynamical system. The results and benefits when compared to existing methods are illustrated through numerical examples.
zh

[AI-43] Learning Distributions of Complex Fluid Simulations with Diffusion Graph Networks ICLR2025

【速读】：该论文旨在解决复杂非稳态动力学系统（如流体流动）中单一代数平均解无法充分表征系统状态分布的问题。传统方法通常依赖于长时间且昂贵的数值模拟来计算统计量，而该研究提出了一种基于图的潜在扩散模型（graph-based latent diffusion model），或称为流匹配模型（flow-matching model），以直接从系统的平衡分布中采样状态。解决方案的关键在于结合图结构处理无规则网格的能力，用于表征具有局部高梯度的复杂几何形状；同时利用多尺度图神经网络（multi-scale Graph Neural Network, GNN）在潜在空间进行扩散建模，从而高效学习和推断整个解分布。此外，该方法即使在训练数据来自不完整且较短的真实模拟时，也能准确学习完整的解分布，展现出强大的鲁棒性与适用性。

链接: https://arxiv.org/abs/2504.02843
作者: Mario Lino,Tobias Pfaff,Nils Thuerey
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注: 31 pages, 19 figures, Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Physical systems with complex unsteady dynamics, such as fluid flows, are often poorly represented by a single mean solution. For many practical applications, it is crucial to access the full distribution of possible states, from which relevant statistics (e.g., RMS and two-point correlations) can be derived. Here, we propose a graph-based latent diffusion (or alternatively, flow-matching) model that enables direct sampling of states from their equilibrium distribution, given a mesh discretization of the system and its physical parameters. This allows for the efficient computation of flow statistics without running long and expensive numerical simulations. The graph-based structure enables operations on unstructured meshes, which is critical for representing complex geometries with spatially localized high gradients, while latent-space diffusion modeling with a multi-scale GNN allows for efficient learning and inference of entire distributions of solutions. A key finding is that the proposed networks can accurately learn full distributions even when trained on incomplete data from relatively short simulations. We apply this method to a range of fluid dynamics tasks, such as predicting pressure distributions on 3D wing models in turbulent flow, demonstrating both accuracy and computational efficiency in challenging scenarios. The ability to directly sample accurate solutions, and capturing their diversity from short ground-truth simulations, is highly promising for complex scientific modeling tasks.
zh

机器学习

[LG-0] Reciprocity-Aware Convolutional Neural Networks for Map-Based Path Loss Prediction

链接: https://arxiv.org/abs/2504.03625
作者: Ryan G. Dempsey,Jonathan Ethier,Halim Yanikomeroglu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Path loss modeling is a widely used technique for estimating point-to-point losses along a communications link from transmitter (Tx) to receiver (Rx). Accurate path loss predictions can optimize use of the radio frequency spectrum and minimize unwanted interference. Modern path loss modeling often leverages data-driven approaches, using machine learning to train models on drive test measurement datasets. Drive tests primarily represent downlink scenarios, where the Tx is located on a building and the Rx is located on a moving vehicle. Consequently, trained models are frequently reserved for downlink coverage estimation, lacking representation of uplink scenarios. In this paper, we demonstrate that data augmentation can be used to train a path loss model that is generalized to uplink, downlink, and backhaul scenarios, training using only downlink drive test measurements. By adding a small number of synthetic samples representing uplink scenarios to the training set, root mean squared error is reduced by 8 dB on uplink examples in the test set.

[LG-1] rading off Relevance and Revenue in the Jobs Marketplace: Estimation Optimization and Auction Design AAAI2025

链接: https://arxiv.org/abs/2504.03618
作者: Farzad Pourbabaee,Sophie Yanying Sheng,Peter McCrory,Luke Simon,Di Mo
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: Computational Jobs Marketplace, AAAI 2025

点击查看摘要

Abstract:We study the problem of position allocation in job marketplaces, where the platform determines the ranking of the jobs for each seeker. The design of ranking mechanisms is critical to marketplace efficiency, as it influences both short-term revenue from promoted job placements and long-term health through sustained seeker engagement. Our analysis focuses on the tradeoff between revenue and relevance, as well as the innovations in job auction design. We demonstrated two ways to improve relevance with minimal impact on revenue: incorporating the seekers preferences and applying position-aware auctions.

[LG-2] Scalable Hypergraph Structure Learning with Diverse Smoothness Priors

链接: https://arxiv.org/abs/2504.03583
作者: Benjamin T. Brown,Haoxiang Zhang,Daniel L. Lau,Gonzalo R. Arce
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, 6 figures, submitted to IEEE for possible publication

点击查看摘要

Abstract:In graph signal processing, learning the weighted connections between nodes from a set of sample signals is a fundamental task when the underlying relationships are not known a priori. This task is typically addressed by finding a graph Laplacian on which the observed signals are smooth. With the extension of graphs to hypergraphs - where edges can connect more than two nodes - graph learning methods have similarly been generalized to hypergraphs. However, the absence of a unified framework for calculating total variation has led to divergent definitions of smoothness and, consequently, differing approaches to hyperedge recovery. We confront this challenge through generalization of several previously proposed hypergraph total variations, subsequently allowing ease of substitution into a vector based optimization. To this end, we propose a novel hypergraph learning method that recovers a hypergraph topology from time-series signals based on a smoothness prior. Our approach addresses key limitations in prior works, such as hyperedge selection and convergence issues, by formulating the problem as a convex optimization solved via a forward-backward-forward algorithm, ensuring guaranteed convergence. Additionally, we introduce a process that simultaneously limits the span of the hyperedge search and maintains a valid hyperedge selection set. In doing so, our method becomes scalable in increasingly complex network structures. The experimental results demonstrate improved performance, in terms of accuracy, over other state-of-the-art hypergraph inference methods; furthermore, we empirically show our method to be robust to total variation terms, biased towards global smoothness, and scalable to larger hypergraphs.

[LG-3] Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy

链接: https://arxiv.org/abs/2504.03579
作者: Kamil Ciosek,Nicolò Felicioni,Sina Ghiassian
类目: Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:Detecting whether an LLM hallucinates is an important research challenge. One promising way of doing so is to estimate the semantic entropy (Farquhar et al., 2024) of the distribution of generated sequences. We propose a new algorithm for doing that, with two main advantages. First, due to us taking the Bayesian approach, we achieve a much better quality of semantic entropy estimates for a given budget of samples from the LLM. Second, we are able to tune the number of samples adaptively so that `harder’ contexts receive more samples. We demonstrate empirically that our approach systematically beats the baselines, requiring only 59% of samples used by Farquhar et al. (2024) to achieve the same quality of hallucination detection as measured by AUROC. Moreover, quite counterintuitively, our estimator is useful even with just one sample from the LLM.

[LG-4] Dexterous Manipulation through Imitation Learning: A Survey

链接: https://arxiv.org/abs/2504.03515
作者: Shan An,Ziyu Meng,Chao Tang,Yuning Zhou,Tengyu Liu,Fangqiang Ding,Shufang Zhang,Yao Mu,Ran Song,Wei Zhang,Zeng-Guang Hou,Hong Zhang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 22pages, 5 figures

点击查看摘要

Abstract:Dexterous manipulation, which refers to the ability of a robotic hand or multi-fingered end-effector to skillfully control, reorient, and manipulate objects through precise, coordinated finger movements and adaptive force modulation, enables complex interactions similar to human hand dexterity. With recent advances in robotics and machine learning, there is a growing demand for these systems to operate in complex and unstructured environments. Traditional model-based approaches struggle to generalize across tasks and object variations due to the high-dimensionality and complex contact dynamics of dexterous manipulation. Although model-free methods such as reinforcement learning (RL) show promise, they require extensive training, large-scale interaction data, and carefully designed rewards for stability and effectiveness. Imitation learning (IL) offers an alternative by allowing robots to acquire dexterous manipulation skills directly from expert demonstrations, capturing fine-grained coordination and contact dynamics while bypassing the need for explicit modeling and large-scale trial-and-error. This survey provides an overview of dexterous manipulation methods based on imitation learning (IL), details recent advances, and addresses key challenges in the field. Additionally, it explores potential research directions to enhance IL-driven dexterous manipulation. Our goal is to offer researchers and practitioners a comprehensive introduction to this rapidly evolving domain.

[LG-5] Hierarchical Knowledge Structuring for Effective Federated Learning in Heterogeneous Environments IJCNN2025

链接: https://arxiv.org/abs/2504.03505
作者: Wai Fong Tam,Qilei Li,Ahmed M. Abdelmonie
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, IJCNN 2025

点击查看摘要

Abstract:Federated learning enables collaborative model training across distributed entities while maintaining individual data privacy. A key challenge in federated learning is balancing the personalization of models for local clients with generalization for the global model. Recent efforts leverage logit-based knowledge aggregation and distillation to overcome these issues. However, due to the non-IID nature of data across diverse clients and the imbalance in the client’s data distribution, directly aggregating the logits often produces biased knowledge that fails to apply to individual clients and obstructs the convergence of local training. To solve this issue, we propose a Hierarchical Knowledge Structuring (HKS) framework that formulates sample logits into a multi-granularity codebook to represent logits from personalized per-sample insights to globalized per-class knowledge. The unsupervised bottom-up clustering method is leveraged to enable the global server to provide multi-granularity responses to local clients. These responses allow local training to integrate supervised learning objectives with global generalization constraints, which results in more robust representations and improved knowledge sharing in subsequent training rounds. The proposed framework’s effectiveness is validated across various benchmarks and model architectures.

[LG-6] Learning Dual-Arm Coordination for Grasping Large Flat Objects

链接: https://arxiv.org/abs/2504.03500
作者: Yongliang Wang,Hamidreza Kasaei
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Grasping large flat objects, such as books or keyboards lying horizontally, presents significant challenges for single-arm robotic systems, often requiring extra actions like pushing objects against walls or moving them to the edge of a surface to facilitate grasping. In contrast, dual-arm manipulation, inspired by human dexterity, offers a more refined solution by directly coordinating both arms to lift and grasp the object without the need for complex repositioning. In this paper, we propose a model-free deep reinforcement learning (DRL) framework to enable dual-arm coordination for grasping large flat objects. We utilize a large-scale grasp pose detection model as a backbone to extract high-dimensional features from input images, which are then used as the state representation in a reinforcement learning (RL) model. A CNN-based Proximal Policy Optimization (PPO) algorithm with shared Actor-Critic layers is employed to learn coordinated dual-arm grasp actions. The system is trained and tested in Isaac Gym and deployed to real robots. Experimental results demonstrate that our policy can effectively grasp large flat objects without requiring additional maneuvers. Furthermore, the policy exhibits strong generalization capabilities, successfully handling unseen objects. Importantly, it can be directly transferred to real robots without fine-tuning, consistently outperforming baseline methods.

[LG-7] Optimistic Learning for Communication Networks

链接: https://arxiv.org/abs/2504.03499
作者: George Iosifidis,Naram Mhaisen,Douglas J. Leith
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI/ML-based tools are at the forefront of resource management solutions for communication networks. Deep learning, in particular, is highly effective in facilitating fast and high-performing decision-making whenever representative training data is available to build offline accurate models. Conversely, online learning solutions do not require training and enable adaptive decisions based on runtime observations, alas are often overly conservative. This extensive tutorial proposes the use of optimistic learning (OpL) as a decision engine for resource management frameworks in modern communication systems. When properly designed, such solutions can achieve fast and high-performing decisions – comparable to offline-trained models – while preserving the robustness and performance guarantees of the respective online learning approaches. We introduce the fundamental concepts, algorithms and results of OpL, discuss the roots of this theory and present different approaches to defining and achieving optimism. We proceed to showcase how OpL can enhance resource management in communication networks for several key problems such as caching, edge computing, network slicing, and workload assignment in decentralized O-RAN platforms. Finally, we discuss the open challenges that must be addressed to unlock the full potential of this new resource management approach.

[LG-8] Hybrid Real- and Complex-valued Neural Network Architecture

链接: https://arxiv.org/abs/2504.03497
作者: Alex Young,Luan Vinícius Fiorio,Bo Yang,Boris Karanov,Wim van Houtum,Ronald M. Aarts
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a \emphhybrid real- and complex-valued \emphneural network (HNN) architecture, designed to combine the computational efficiency of real-valued processing with the ability to effectively handle complex-valued data. We illustrate the limitations of using real-valued neural networks (RVNNs) for inherently complex-valued problems by showing how it learnt to perform complex-valued convolution, but with notable inefficiencies stemming from its real-valued constraints. To create the HNN, we propose to use building blocks containing both real- and complex-valued paths, where information between domains is exchanged through domain conversion functions. We also introduce novel complex-valued activation functions, with higher generalisation and parameterisation efficiency. HNN-specific architecture search techniques are described to navigate the larger solution space. Experiments with the AudioMNIST dataset demonstrate that the HNN reduces cross-entropy loss and consumes less parameters compared to an RVNN for all considered cases. Such results highlight the potential for the use of partially complex-valued processing in neural networks and applications for HNNs in many signal processing domains.

[LG-9] Diffusion Active Learning: Towards Data-Driven Experimental Design in Computed Tomography

链接: https://arxiv.org/abs/2504.03491
作者: Luis Barba,Johannes Kirschner,Tomas Aidukas,Manuel Guizar-Sicairos,Benjamín Béjar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Diffusion Active Learning, a novel approach that combines generative diffusion modeling with data-driven sequential experimental design to adaptively acquire data for inverse problems. Although broadly applicable, we focus on scientific computed tomography (CT) for experimental validation, where structured prior datasets are available, and reducing data requirements directly translates to shorter measurement times and lower X-ray doses. We first pre-train an unconditional diffusion model on domain-specific CT reconstructions. The diffusion model acts as a learned prior that is data-dependent and captures the structure of the underlying data distribution, which is then used in two ways: It drives the active learning process and also improves the quality of the reconstructions. During the active learning loop, we employ a variant of diffusion posterior sampling to generate conditional data samples from the posterior distribution, ensuring consistency with the current measurements. Using these samples, we quantify the uncertainty in the current estimate to select the most informative next measurement. Our results show substantial reductions in data acquisition requirements, corresponding to lower X-ray doses, while simultaneously improving image reconstruction quality across multiple real-world tomography datasets.

[LG-10] Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching

链接: https://arxiv.org/abs/2504.03485
作者: John Paisley,Wei Zhang,Brian Barr
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present three Fisher divergence (FD) minimization algorithms for learning Gaussian process (GP) based score models for lower dimensional density estimation problems. The density is formed by multiplying a base multivariate normal distribution with an exponentiated GP refinement, and so we refer to it as a GP-tilted nonparametric density. By representing the GP part of the score as a linear function using the random Fourier feature (RFF) approximation, we show that all learning problems can be solved in closed form. This includes the basic and noise conditional versions of the Fisher divergence, as well as a novel alternative to noise conditional FD models based on variational inference (VI). Here, we propose using an ELBO-like optimization of the approximate posterior with which we derive a Fisher variational predictive distribution. The RFF representation of the GP, which is functionally equivalent to a single layer neural network score model with cosine activation, provides a unique linear form for which all expectations are in closed form. The Gaussian base also helps with tractability of the VI approximation. We demonstrate our three learning algorithms, as well as a MAP baseline algorithm, on several low dimensional density estimation problems. The closed-form nature of the learning problem removes the reliance on iterative algorithms, making this technique particularly well-suited to large data sets.

[LG-11] Discovering Partially Known Ordinary Differential Equations: a Case Study on the Chemical Kinetics of Cellulose Degradation

链接: https://arxiv.org/abs/2504.03484
作者: Federica Bragone,Kateryna Morozovska,Tor Laneryd,Khemraj Shukla,Stefano Markidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The degree of polymerization (DP) is one of the methods for estimating the aging of the polymer based insulation systems, such as cellulose insulation in power components. The main degradation mechanisms in polymers are hydrolysis, pyrolysis, and oxidation. These mechanisms combined cause a reduction of the DP. However, the data availability for these types of problems is usually scarce. This study analyzes insulation aging using cellulose degradation data from power transformers. The aging problem for the cellulose immersed in mineral oil inside power transformers is modeled with ordinary differential equations (ODEs). We recover the governing equations of the degradation system using Physics-Informed Neural Networks (PINNs) and symbolic regression. We apply PINNs to discover the Arrhenius equation’s unknown parameters in the Ekenstam ODE describing cellulose contamination content and the material aging process related to temperature for synthetic data and real DP values. A modification of the Ekenstam ODE is given by Emsley’s system of ODEs, where the rate constant expressed by the Arrhenius equation decreases in time with the new formulation. We use PINNs and symbolic regression to recover the functional form of one of the ODEs of the system and to identify an unknown parameter.

[LG-12] Online Traffic Density Estimation using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2504.03483
作者: Dennis Wilkman,Kateryna Morozovska,Karl Henrik Johansson,Matthieu Barreau
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Recent works on the application of Physics-Informed Neural Networks to traffic density estimation have shown to be promising for future developments due to their robustness to model errors and noisy data. In this paper, we introduce a methodology for online approximation of the traffic density using measurements from probe vehicles in two settings: one using the Greenshield model and the other considering a high-fidelity traffic simulation. The proposed method continuously estimates the real-time traffic density in space and performs model identification with each new set of measurements. The density estimate is updated in almost real-time using gradient descent and adaptive weights. In the case of full model knowledge, the resulting algorithm has similar performance to the classical open-loop one. However, in the case of model mismatch, the iterative solution behaves as a closed-loop observer and outperforms the baseline method. Similarly, in the high-fidelity setting, the proposed algorithm correctly reproduces the traffic characteristics.

[LG-13] Optimizing Specific and Shared Parameters for Efficient Parameter Tuning

链接: https://arxiv.org/abs/2504.03450
作者: Van-Anh Nguyen,Thanh-Toan Do,Mehrtash Harandi,Dinh Phung,Trung Le
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models, with a vast number of parameters and pretraining on massive datasets, achieve state-of-the-art performance across various applications. However, efficiently adapting them to downstream tasks with minimal computational overhead remains a challenge. Parameter-Efficient Transfer Learning (PETL) addresses this by fine-tuning only a small subset of parameters while preserving pre-trained knowledge. In this paper, we propose SaS, a novel PETL method that effectively mitigates distributional shifts during fine-tuning. SaS integrates (1) a shared module that captures common statistical characteristics across layers using low-rank projections and (2) a layer-specific module that employs hypernetworks to generate tailored parameters for each layer. This dual design ensures an optimal balance between performance and parameter efficiency while introducing less than 0.05% additional parameters, making it significantly more compact than existing methods. Extensive experiments on diverse downstream tasks, few-shot settings and domain generalization demonstrate that SaS significantly enhances performance while maintaining superior parameter efficiency compared to existing methods, highlighting the importance of capturing both shared and layer-specific information in transfer learning. Code and data are available at this https URL.

[LG-14] Optimizing Quantum Circuits via ZX Diagrams using Reinforcement Learning and Graph Neural Networks

链接: https://arxiv.org/abs/2504.03429
作者: Alexander Mattick,Maniraman Periyasamy,Christian Ufrecht,Abhishek Y. Dubey,Christopher Mutschler,Axel Plinge,Daniel D. Scherer
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum computing is currently strongly limited by the impact of noise, in particular introduced by the application of two-qubit gates. For this reason, reducing the number of two-qubit gates is of paramount importance on noisy intermediate-scale quantum hardware. To advance towards more reliable quantum computing, we introduce a framework based on ZX calculus, graph-neural networks and reinforcement learning for quantum circuit optimization. By combining reinforcement learning and tree search, our method addresses the challenge of selecting optimal sequences of ZX calculus rewrite rules. Instead of relying on existing heuristic rules for minimizing circuits, our method trains a novel reinforcement learning policy that directly operates on ZX-graphs, therefore allowing us to search through the space of all possible circuit transformations to find a circuit significantly minimizing the number of CNOT gates. This way we can scale beyond hard-coded rules towards discovering arbitrary optimization rules. We demonstrate our method’s competetiveness with state-of-the-art circuit optimizers and generalization capabilities on large sets of diverse random circuits.

[LG-15] DML-RAM: Deep Multimodal Learning Framework for Robotic Arm Manipulation using Pre-trained Models

链接: https://arxiv.org/abs/2504.03423
作者: Sathish Kumar,Swaroop Damodaran,Naveen Kumar Kuruba,Sumit Jha,Arvind Ramanathan
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 7 pages , 4 figures

点击查看摘要

Abstract:This paper presents a novel deep learning framework for robotic arm manipulation that integrates multimodal inputs using a late-fusion strategy. Unlike traditional end-to-end or reinforcement learning approaches, our method processes image sequences with pre-trained models and robot state data with machine learning algorithms, fusing their outputs to predict continuous action values for control. Evaluated on BridgeData V2 and Kuka datasets, the best configuration (VGG16 + Random Forest) achieved MSEs of 0.0021 and 0.0028, respectively, demonstrating strong predictive performance and robustness. The framework supports modularity, interpretability, and real-time decision-making, aligning with the goals of adaptive, human-in-the-loop cyber-physical systems.

[LG-16] BitHEP – The Limits of Low-Precision ML in HEP

链接: https://arxiv.org/abs/2504.03387
作者: Claudius Krause,Daohan Wang,Ramon Winterhalder
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:The increasing complexity of modern neural network architectures demands fast and memory-efficient implementations to mitigate computational bottlenecks. In this work, we evaluate the recently proposed BitNet architecture in HEP applications, assessing its performance in classification, regression, and generative modeling tasks. Specifically, we investigate its suitability for quark-gluon discrimination, SMEFT parameter estimation, and detector simulation, comparing its efficiency and accuracy to state-of-the-art methods. Our results show that while BitNet consistently performs competitively in classification tasks, its performance in regression and generation varies with the size and type of the network, highlighting key limitations and potential areas for improvement.

[LG-17] A metrological framework for uncertainty evaluation in machine learning classification models

链接: https://arxiv.org/abs/2504.03359
作者: Samuel Bilson,Maurice Cox,Anna Pustogvar,Andrew Thompson
类目: Machine Learning (cs.LG)
*备注: 47 pages, 7 figures

点击查看摘要

Abstract:Machine learning (ML) classification models are increasingly being used in a wide range of applications where it is important that predictions are accompanied by uncertainties, including in climate and earth observation, medical diagnosis and bioaerosol monitoring. The output of an ML classification model is a type of categorical variable known as a nominal property in the International Vocabulary of Metrology (VIM). However, concepts related to uncertainty evaluation for nominal properties are not defined in the VIM, nor is such evaluation addressed by the Guide to the Expression of Uncertainty in Measurement (GUM). In this paper we propose a metrological conceptual uncertainty evaluation framework for ML classification, and illustrate its use in the context of two applications that exemplify the issues and have significant societal impact, namely, climate and earth observation and medical diagnosis. Our framework would enable an extension of the VIM and GUM to uncertainty for nominal properties, which would make both applicable to ML classification models.

[LG-18] Data Augmentation of Time-Series Data in Human Movement Biomechanics: A Scoping Review

链接: https://arxiv.org/abs/2504.03334
作者: Christina Halmich,Lucas Höschler,Christoph Schranz,Christian Borgelt
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Preprint under review at PLOS ONE

点击查看摘要

Abstract:The integration of machine learning and deep learning has transformed data analytics in biomechanics, enabled by extensive wearable sensor data. However, the field faces challenges such as limited large-scale datasets and high data acquisition costs, which hinder the development of robust algorithms. Data augmentation techniques show promise in addressing these issues, but their application to biomechanical time-series data requires comprehensive evaluation. This scoping review investigates data augmentation methods for time-series data in the biomechanics domain. It analyzes current approaches for augmenting and generating time-series datasets, evaluates their effectiveness, and offers recommendations for applying these techniques in biomechanics. Four databases, PubMed, IEEE Xplore, Scopus, and Web of Science, were searched for studies published between 2013 and 2024. Following PRISMA-ScR guidelines, a two-stage screening identified 21 relevant publications. Results show that there is no universally preferred method for augmenting biomechanical time-series data; instead, methods vary based on study objectives. A major issue identified is the absence of soft tissue artifacts in synthetic data, leading to discrepancies referred to as the synthetic gap. Moreover, many studies lack proper evaluation of augmentation methods, making it difficult to assess their effects on model performance and data quality. This review highlights the critical role of data augmentation in addressing limited dataset availability and improving model generalization in biomechanics. Tailoring augmentation strategies to the characteristics of biomechanical data is essential for advancing predictive modeling. A better understanding of how different augmentation methods impact data quality and downstream tasks will be key to developing more effective and realistic techniques. Comments: Preprint under review at PLOS ONE Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC) Cite as: arXiv:2504.03334 [cs.LG] (or arXiv:2504.03334v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2504.03334 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christina Halmich [view email] [v1] Fri, 4 Apr 2025 10:31:44 UTC (487 KB) Full-text links: Access Paper: View a PDF of the paper titled Data Augmentation of Time-Series Data in Human Movement Biomechanics: A Scoping Review, by Christina Halmich and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-04 Change to browse by: cs cs.HC References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-19] Learning Lie Group Generators from Trajectories

链接: https://arxiv.org/abs/2504.03220
作者: Lifan Hu
类目: Machine Learning (cs.LG)
*备注: 7 pages, 12 figures

点击查看摘要

Abstract:This work investigates the inverse problem of generator recovery in matrix Lie groups from discretized trajectories. Let G be a real matrix Lie group and \mathfrakg = \textLie(G) its corresponding Lie algebra. A smooth trajectory \gamma( t ) generated by a fixed Lie algebra element \xi \in \mathfrakg follows the exponential flow \gamma( t ) = g_0 \cdot \exp(t \xi) . The central task addressed in this work is the reconstruction of such a latent generator \xi from a discretized sequence of poses \g_0, g_1, \dots, g_T\ \subset G , sampled at uniform time intervals. This problem is formulated as a data-driven regression from normalized sequences of discrete Lie algebra increments \log\left(g_t^-1 g_t+1\right) to the constant generator \xi \in \mathfrakg . A feedforward neural network is trained to learn this mapping across several groups, including \textSE(2), \textSE(3), \textSO(3), and \textSL(2, \mathbbR) . It demonstrates strong empirical accuracy under both clean and noisy conditions, which validates the viability of data-driven recovery of Lie group generators using shallow neural architectures. This is Lie-RL GitHub Repo this https URL. Feel free to make suggestions and collaborations!

[LG-20] Structured Knowledge Accumulation: The Principle of Entropic Least Action in Forward-Only Neural Learning

链接: https://arxiv.org/abs/2504.03214
作者: Bouarfa Mahi Quantiota
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:This paper aims to extend the Structured Knowledge Accumulation (SKA) framework recently proposed by \citemahi2025ska. We introduce two core concepts: the Tensor Net function and the characteristic time property of neural learning. First, we reinterpret the learning rate as a time step in a continuous system. This transforms neural learning from discrete optimization into continuous-time evolution. We show that learning dynamics remain consistent when the product of learning rate and iteration steps stays constant. This reveals a time-invariant behavior and identifies an intrinsic timescale of the network. Second, we define the Tensor Net function as a measure that captures the relationship between decision probabilities, entropy gradients, and knowledge change. Additionally, we define its zero-crossing as the equilibrium state between decision probabilities and entropy gradients. We show that the convergence of entropy and knowledge flow provides a natural stopping condition, replacing arbitrary thresholds with an information-theoretic criterion. We also establish that SKA dynamics satisfy a variational principle based on the Euler-Lagrange equation. These findings extend SKA into a continuous and self-organizing learning model. The framework links computational learning with physical systems that evolve by natural laws. By understanding learning as a time-based process, we open new directions for building efficient, robust, and biologically-inspired AI systems.

[LG-21] PIONM: A Generalized Approach to Solving Density-Constrained Mean-Field Games Equilibrium under Modified Boundary Conditions

链接: https://arxiv.org/abs/2504.03209
作者: Jinwei Liu,Wang Yao,Xiao Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural network-based methods are effective for solving equilibria in Mean-Field Games (MFGs), particularly in high-dimensional settings. However, solving the coupled partial differential equations (PDEs) in MFGs limits their applicability since solving coupled PDEs is computationally expensive. Additionally, modifying boundary conditions, such as the initial state distribution or terminal value function, necessitates extensive retraining, reducing scalability. To address these challenges, we propose a generalized framework, PIONM (Physics-Informed Neural Operator NF-MKV Net), which leverages physics-informed neural operators to solve MFGs equations. PIONM utilizes neural operators to compute MFGs equilibria for arbitrary boundary conditions. The method encodes boundary conditions as input features and trains the model to align them with density evolution, modeled using discrete-time normalizing flows. Once trained, the algorithm efficiently computes the density distribution at any time step for modified boundary condition, ensuring efficient adaptation to different boundary conditions in MFGs equilibria. Unlike traditional MFGs methods constrained by fixed coefficients, PIONM efficiently computes equilibria under varying boundary conditions, including obstacles, diffusion coefficients, initial densities, and terminal functions. PIONM can adapt to modified conditions while preserving density distribution constraints, demonstrating superior scalability and generalization capabilities compared to existing methods.

[LG-22] BondMatcher: H-Bond Stability Analysis in Molecular Systems

链接: https://arxiv.org/abs/2504.03205
作者: Thomas Daniel,Malgorzata Olejniczak,Julien Tierny
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:This application paper investigates the stability of hydrogen bonds (H-bonds), as characterized by the Quantum Theory of Atoms in Molecules (QTAIM). First, we contribute a database of 4544 electron densities associated to four isomers of water hexamers (the so-called Ring, Book, Cage and Prism), generated by distorting their equilibrium geometry under various structural perturbations, modeling the natural dynamic behavior of molecular systems. Second, we present a new stability measure, called bond occurrence rate, associating each bond path present at equilibrium with its rate of occurrence within the input ensemble. We also provide an algorithm, called BondMatcher, for its automatic computation, based on a tailored, geometry-aware partial isomorphism estimation between the extremum graphs of the considered electron densities. Our new stability measure allows for the automatic identification of densities lacking H-bond paths, enabling further visual inspections. Specifically, the topological analysis enabled by our framework corroborates experimental observations and provides refined geometrical criteria for characterizing the disappearance of H-bond paths. Our electron density database and our C++ implementation are available at this address: this https URL.

[LG-23] Simultaneous Learning of Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model

链接: https://arxiv.org/abs/2504.03188
作者: Kotaro Ikeda,Masanori Koyama,Jinzhe Zhang,Kohei Hayashi,Kenji Fukumizu
类目: Machine Learning (cs.LG)
*备注: 29 pages, 17 figures

点击查看摘要

Abstract:In this paper, we propose a flow-based method for learning all-to-all transfer maps among conditional distributions, approximating pairwise optimal transport. The proposed method addresses the challenge of handling continuous conditions, which often involve a large set of conditions with sparse empirical observations per condition. We introduce a novel cost function that enables simultaneous learning of optimal transports for all pairs of conditional distributions. Our method is supported by a theoretical guarantee that, in the limit, it converges to pairwise optimal transports among infinite pairs of conditional distributions. The learned transport maps are subsequently used to couple data points in conditional flow matching. We demonstrate the effectiveness of this method on synthetic and benchmark datasets, as well as on chemical datasets where continuous physical properties are defined as conditions.

[LG-24] On the Connection Between Diffusion Models and Molecular Dynamics

链接: https://arxiv.org/abs/2504.03187
作者: Liam Harcombe,Timothy T. Duignan
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Neural Network Potentials (NNPs) have emerged as a powerful tool for modelling atomic interactions with high accuracy and computational efficiency. Recently, denoising diffusion models have shown promise in NNPs by training networks to remove noise added to stable configurations, eliminating the need for force data during training. In this work, we explore the connection between noise and forces by providing a new, simplified mathematical derivation of their relationship. We also demonstrate how a denoising model can be implemented using a conventional MD software package interfaced with a standard NNP architecture. We demonstrate the approach by training a diffusion-based NNP to simulate a coarse-grained lithium chloride solution and employ data duplication to enhance model performance.

[LG-25] Mathematical Modeling of Option Pricing with an Extended Black-Scholes Framework

链接: https://arxiv.org/abs/2504.03175
作者: Nikhil Shivakumar Nayak,Michael P. Brenner
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Probability (math.PR); Computational Finance (q-fin.CP)
*备注: 7 pages, 3 figures

点击查看摘要

Abstract:This study investigates enhancing option pricing by extending the Black-Scholes model to include stochastic volatility and interest rate variability within the Partial Differential Equation (PDE). The PDE is solved using the finite difference method. The extended Black-Scholes model and a machine learning-based LSTM model are developed and evaluated for pricing Google stock options. Both models were backtested using historical market data. While the LSTM model exhibited higher predictive accuracy, the finite difference method demonstrated superior computational efficiency. This work provides insights into model performance under varying market conditions and emphasizes the potential of hybrid approaches for robust financial modeling.

[LG-26] Water Mapping and Change Detection Using Time Series Derived from the Continuous Monitoring of Land Disturbance Algorithm

链接: https://arxiv.org/abs/2504.03170
作者: Huong Pham,Samuel Cheng,Tao Hu,Chengbin Deng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Given the growing environmental challenges, accurate monitoring and prediction of changes in water bodies are essential for sustainable management and conservation. The Continuous Monitoring of Land Disturbance (COLD) algorithm provides a valuable tool for real-time analysis of land changes, such as deforestation, urban expansion, agricultural activities, and natural disasters. This capability enables timely interventions and more informed decision-making. This paper assesses the effectiveness of the algorithm to estimate water bodies and track pixel-level water trends over time. Our findings indicate that COLD-derived data can reliably estimate estimate water frequency during stable periods and delineate water bodies. Furthermore, it enables the evaluation of trends in water areas after disturbances, allowing for the determination of whether water frequency increases, decreases, or remains constant.

[LG-27] Enhanced Penalty-based Bidirectional Reinforcement Learning Algorithms

链接: https://arxiv.org/abs/2504.03163
作者: Sai Gana Sandeep Pula,Sathish A. P. Kumar,Sumit Jha,Arvind Ramanathan
类目: Machine Learning (cs.LG)
*备注: 16 pages, 13 Figures

点击查看摘要

Abstract:This research focuses on enhancing reinforcement learning (RL) algorithms by integrating penalty functions to guide agents in avoiding unwanted actions while optimizing rewards. The goal is to improve the learning process by ensuring that agents learn not only suitable actions but also which actions to avoid. Additionally, we reintroduce a bidirectional learning approach that enables agents to learn from both initial and terminal states, thereby improving speed and robustness in complex environments. Our proposed Penalty-Based Bidirectional methodology is tested against Mani skill benchmark environments, demonstrating an optimality improvement of success rate of approximately 4% compared to existing RL implementations. The findings indicate that this integrated strategy enhances policy learning, adaptability, and overall performance in challenging scenarios

[LG-28] Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking

链接: https://arxiv.org/abs/2504.03162
作者: Zihan Gu,Ruoyu Chen,Hua Zhang,Yue Hu,Xiaochun Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Grokking, referring to the abrupt improvement in test accuracy after extended overfitting, offers valuable insights into the mechanisms of model generalization. Existing researches based on progress measures imply that grokking relies on understanding the optimization dynamics when the loss function is dominated solely by the weight decay term. However, we find that this optimization merely leads to token uniformity, which is not a sufficient condition for grokking. In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations. Based on theoretical analysis and experimental validation, we present the following insights: (i) The weight decay term encourages uniformity across all tokens in the embedding space when it is minimized. (ii) The occurrence of grokking is jointly determined by the uniformity of the embedding space and the distribution of the training dataset. Building on these insights, we provide a unified perspective for understanding various previously proposed progress measures and introduce a novel, concise, and effective progress measure that could trace the changes in test loss more accurately. Finally, to demonstrate the versatility of our theoretical framework, we design a dedicated dataset to validate our theory on ResNet-18, successfully showcasing the occurrence of grokking.

[LG-29] MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories

链接: https://arxiv.org/abs/2504.03153
作者: Natalie Tirabassi,Sathish A. P. Kumar,Sumit Jha,Arvind Ramanathan
类目: Machine Learning (cs.LG)
*备注: 9 pages, 14 figures and 3 tables

点击查看摘要

Abstract:We propose MORAL (a multimodal reinforcement learning framework for decision making in autonomous laboratories) that enhances sequential decision-making in autonomous robotic laboratories through the integration of visual and textual inputs. Using the BridgeData V2 dataset, we generate fine-tuned image captions with a pretrained BLIP-2 vision-language model and combine them with visual features through an early fusion strategy. The fused representations are processed using Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) agents. Experimental results demonstrate that multimodal agents achieve a 20% improvement in task completion rates and significantly outperform visual-only and textual-only baselines after sufficient training. Compared to transformer-based and recurrent multimodal RL models, our approach achieves superior performance in cumulative reward and caption quality metrics (BLEU, METEOR, ROUGE-L). These results highlight the impact of semantically aligned language cues in enhancing agent learning efficiency and generalization. The proposed framework contributes to the advancement of multimodal reinforcement learning and embodied AI systems in dynamic, real-world environments.

[LG-30] Safe Screening Rules for Group OWL Models

链接: https://arxiv.org/abs/2504.03152
作者: Runxue Bao,Quanchao Lu,Yanfu Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8 pages. arXiv admin note: text overlap with arXiv:2006.16433

点击查看摘要

Abstract:Group Ordered Weighted L_1 -Norm (Group OWL) regularized models have emerged as a useful procedure for high-dimensional sparse multi-task learning with correlated features. Proximal gradient methods are used as standard approaches to solving Group OWL models. However, Group OWL models usually suffer huge computational costs and memory usage when the feature size is large in the high-dimensional scenario. To address this challenge, in this paper, we are the first to propose the safe screening rule for Group OWL models by effectively tackling the structured non-separable penalty, which can quickly identify the inactive features that have zero coefficients across all the tasks. Thus, by removing the inactive features during the training process, we may achieve substantial computational gain and memory savings. More importantly, the proposed screening rule can be directly integrated with the existing solvers both in the batch and stochastic settings. Theoretically, we prove our screening rule is safe and also can be safely applied to the existing iterative optimization algorithms. Our experimental results demonstrate that our screening rule can effectively identify the inactive features and leads to a significant computational speedup without any loss of accuracy.

[LG-31] From Observation to Orientation: an Adaptive Integer Programming Approach to Intervention Design

链接: https://arxiv.org/abs/2504.03122
作者: Abdelmonem Elrefaey,Rong Pan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Using both observational and experimental data, a causal discovery process can identify the causal relationships between variables. A unique adaptive intervention design paradigm is presented in this work, where causal directed acyclic graphs (DAGs) are for effectively recovered with practical budgetary considerations. In order to choose treatments that optimize information gain under these considerations, an iterative integer programming (IP) approach is proposed, which drastically reduces the number of experiments required. Simulations over a broad range of graph sizes and edge densities are used to assess the effectiveness of the suggested approach. Results show that the proposed adaptive IP approach achieves full causal graph recovery with fewer intervention iterations and variable manipulations than random intervention baselines, and it is also flexible enough to accommodate a variety of practical constraints.

[LG-32] Anomaly Detection in Time Series Data Using Reinforcement Learning Variational Autoencoder and Active Learning

链接: https://arxiv.org/abs/2504.02999
作者: Bahareh Golchin,Banafsheh Rekabdar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A novel approach to detecting anomalies in time series data is presented in this paper. This approach is pivotal in domains such as data centers, sensor networks, and finance. Traditional methods often struggle with manual parameter tuning and cannot adapt to new anomaly types. Our method overcomes these limitations by integrating Deep Reinforcement Learning (DRL) with a Variational Autoencoder (VAE) and Active Learning. By incorporating a Long Short-Term Memory (LSTM) network, our approach models sequential data and its dependencies effectively, allowing for the detection of new anomaly classes with minimal labeled data. Our innovative DRL- VAE and Active Learning combination significantly improves existing methods, as shown by our evaluations on real-world datasets, enhancing anomaly detection techniques and advancing time series analysis.

[LG-33] Improving log-based anomaly detection through learned adaptive filter

链接: https://arxiv.org/abs/2504.02994
作者: Yiyuan Xiong,Shaofeng Cai
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Log messages record important system runtime information and are useful for detecting anomalous behaviors and managing modern software systems. Many supervised and unsupervised learning methods have been proposed recently for log-based anomaly detection. State-of-the-art unsupervised methods predict the next log event given a log sequence and apply fixed configurations that use the same filter condition (i.e. k, the top k predicted log events will be regarded as normal next events) which leads to inferior performance in the detection stage because it sets one fixed k for all log sequences, which ignores the dynamic nature and variance in different log sequences. Recently, deep reinforcement learning (DRL) are widely applied to make intelligent decisions in a dynamic environment. In this work, we contend that it is necessary to apply adaptive filters for different log sequences. To achieve this, we propose a novel approach based on DRL to construct a learned adaptive filter and apply different normal/abnormal filter thresholds for different log sequences. We define the Markov Decision Process (MDP) and formulate the learned adaptive filter as a problem that can be solved by DRL. We evaluate the learned adaptive filter on two state-of-the-art log-based anomaly detection unsupervised approaches DeepLog and LogAnomaly in two datasets HDFS and BGL. Extensive experiments show that our approach outperforms the fixed configurations and achieves significantly better performance in log-based anomaly detection.

[LG-34] Route Recommendations for Traffic Management Under Learned Partial Driver Compliance

链接: https://arxiv.org/abs/2504.02993
作者: Heeseung Bang,Jung-Hoon Cho,Cathy Wu,Andreas A. Malikopoulos
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:In this paper, we aim to mitigate congestion in traffic management systems by guiding travelers along system-optimal (SO) routes. However, we recognize that most theoretical approaches assume perfect driver compliance, which often does not reflect reality, as drivers tend to deviate from recommendations to fulfill their personal objectives. Therefore, we propose a route recommendation framework that explicitly learns partial driver compliance and optimizes traffic flow under realistic adherence. We first compute an SO edge flow through flow optimization techniques. Next, we train a compliance model based on historical driver decisions to capture individual responses to our recommendations. Finally, we formulate a stochastic optimization problem that minimizes the gap between the target SO flow and the realized flow under conditions of imperfect adherence. Our simulations conducted on a grid network reveal that our approach significantly reduces travel time compared to baseline strategies, demonstrating the practical advantage of incorporating learned compliance into traffic management.

[LG-35] Randomized Pairwise Learning with Adaptive Sampling: A PAC-Bayes Analysis

链接: https://arxiv.org/abs/2504.02957
作者: Sijia Zhou(1),Yunwen Lei(2),Ata Kabán(1) ((1) School of Computer Science, University of Birmingham, (2) Department of Mathematics, The University of Hong Kong)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study stochastic optimization with data-adaptive sampling schemes to train pairwise learning models. Pairwise learning is ubiquitous, and it covers several popular learning tasks such as ranking, metric learning and AUC maximization. A notable difference of pairwise learning from pointwise learning is the statistical dependencies among input pairs, for which existing analyses have not been able to handle in the general setting considered in this paper. To this end, we extend recent results that blend together two algorithm-dependent frameworks of analysis – algorithmic stability and PAC-Bayes – which allow us to deal with any data-adaptive sampling scheme in the optimizer. We instantiate this framework to analyze (1) pairwise stochastic gradient descent, which is a default workhorse in many machine learning problems, and (2) pairwise stochastic gradient descent ascent, which is a method used in adversarial training. All of these algorithms make use of a stochastic sampling from a discrete distribution (sample indices) before each update. Non-uniform sampling of these indices has been already suggested in the recent literature, to which our work provides generalization guarantees in both smooth and non-smooth convex problems.

[LG-36] Feature Engineering on LMS Data to Optimize Student Performance Prediction

链接: https://arxiv.org/abs/2504.02916
作者: Keith Hubbard,Sheilla Amponsah
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Nearly every educational institution uses a learning management system (LMS), often producing terabytes of data generated by thousands of people. We examine LMS grade and login data from a regional comprehensive university, specifically documenting key considerations for engineering features from these data when trying to predict student performance. We specifically document changes to LMS data patterns since Covid-19, which are critical for data scientists to account for when using historic data. We compare numerous engineered features and approaches to utilizing those features for machine learning. We finish with a summary of the implications of including these features into more comprehensive student performance models.

[LG-37] Enhancing Air Quality Monitoring: A Brief Review of Federated Learning Advances

链接: https://arxiv.org/abs/2504.02909
作者: Sara Yarham,Mehran Behjati
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: This is a preprint version of a paper accepted and published in Springer Lecture Notes in Networks and Systems. The final version is available at this https URL

点击查看摘要

Abstract:Monitoring air quality and environmental conditions is crucial for public health and effective urban planning. Current environmental monitoring approaches often rely on centralized data collection and processing, which pose significant privacy, security, and scalability challenges. Federated Learning (FL) offers a promising solution to these limitations by enabling collaborative model training across multiple devices without sharing raw data. This decentralized approach addresses privacy concerns while still leveraging distributed data sources. This paper provides a comprehensive review of FL applications in air quality and environmental monitoring, emphasizing its effectiveness in predicting pollutants and managing environmental data. However, the paper also identifies key limitations of FL when applied in this domain, including challenges such as communication overhead, infrastructure demands, generalizability issues, computational complexity, and security vulnerabilities. For instance, communication overhead, caused by the frequent exchange of model updates between local devices and central servers, is a notable challenge. To address this, future research should focus on optimizing communication protocols and reducing the frequency of updates to lessen the burden on network resources. Additionally, the paper suggests further research directions to refine FL frameworks and enhance their applicability in real-world environmental monitoring scenarios. By synthesizing findings from existing studies, this paper highlights the potential of FL to improve air quality management while maintaining data privacy and security, and it provides valuable insights for future developments in the field.

[LG-38] Scenario Discovery for Urban Planning : The Case of Green Urbanism and the Impact on Stress

链接: https://arxiv.org/abs/2504.02905
作者: Lorena Torres Lahoz,Carlos Lima Azevedo,Leonardo Ancora,Paulo Morgado,Zenia Kotval,Bruno Miranda,Francisco Camara Pereira
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban environments significantly influence mental health outcomes, yet the role of an effective framework for decision-making under deep uncertainty (DMDU) for optimizing urban policies for stress reduction remains underexplored. While existing research has demonstrated the effects of urban design on mental health, there is a lack of systematic scenario-based analysis to guide urban planning decisions. This study addresses this gap by applying Scenario Discovery (SD) in urban planning to evaluate the effectiveness of urban vegetation interventions in stress reduction across different urban environments using a predictive model based on emotional responses collected from a neuroscience-based outdoor experiment in Lisbon. Combining these insights with detailed urban data from Copenhagen, we identify key intervention thresholds where vegetation-based solutions succeed or fail in mitigating stress responses. Our findings reveal that while increased vegetation generally correlates with lower stress levels, high-density urban environments, crowding, and individual psychological traits (e.g., extraversion) can reduce its effectiveness. This work showcases our Scenario Discovery framework as a systematic approach for identifying robust policy pathways in urban planning, opening the door for its exploration in other urban decision-making contexts where uncertainty and design resiliency are critical.

[LG-39] Quantum Speedups for Markov Chain Monte Carlo Methods with Application to Optimization

链接: https://arxiv.org/abs/2504.03626
作者: Guneykan Ozgul,Xiantao Li,Mehrdad Mahdavi,Chunhao Wang
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 37 pages

点击查看摘要

Abstract:We propose quantum algorithms that provide provable speedups for Markov Chain Monte Carlo (MCMC) methods commonly used for sampling from probability distributions of the form \pi \propto e^-f , where f is a potential function. Our first approach considers Gibbs sampling for finite-sum potentials in the stochastic setting, employing an oracle that provides gradients of individual functions. In the second setting, we consider access only to a stochastic evaluation oracle, allowing simultaneous queries at two points of the potential function under the same stochastic parameter. By introducing novel techniques for stochastic gradient estimation, our algorithms improve the gradient and evaluation complexities of classical samplers, such as Hamiltonian Monte Carlo (HMC) and Langevin Monte Carlo (LMC) in terms of dimension, precision, and other problem-dependent parameters. Furthermore, we achieve quantum speedups in optimization, particularly for minimizing non-smooth and approximately convex functions that commonly appear in empirical risk minimization problems.

[LG-40] Optimistic Online Learning in Symmetric Cone Games

链接: https://arxiv.org/abs/2504.03592
作者: Anas Barakat,Wayne Lin,John Lazarsfeld,Antonios Varvitsiotis
类目: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimistic online learning algorithms have led to significant advances in equilibrium computation, particularly for two-player zero-sum games, achieving an iteration complexity of \mathcalO(1/\epsilon) to reach an \epsilon -saddle point. These advances have been established in normal-form games, where strategies are simplex vectors, and quantum games, where strategies are trace-one positive semidefinite matrices. We extend optimistic learning to symmetric cone games (SCGs), a class of two-player zero-sum games where strategy spaces are generalized simplices (trace-one slices of symmetric cones). A symmetric cone is the cone of squares of a Euclidean Jordan Algebra; canonical examples include the nonnegative orthant, the second-order cone, the cone of positive semidefinite matrices, and their products, all fundamental to convex optimization. SCGs unify normal-form and quantum games and, as we show, offer significantly greater modeling flexibility, allowing us to model applications such as distance metric learning problems and the Fermat-Weber problem. To compute approximate saddle points in SCGs, we introduce the Optimistic Symmetric Cone Multiplicative Weights Update algorithm and establish an iteration complexity of \mathcalO(1/\epsilon) to reach an \epsilon -saddle point. Our analysis builds on the Optimistic Follow-the-Regularized-Leader framework, with a key technical contribution being a new proof of the strong convexity of the symmetric cone negative entropy with respect to the trace-one norm, a result that may be of independent interest.

[LG-41] Stochastic Optimization with Optimal Importance Sampling

链接: https://arxiv.org/abs/2504.03560
作者: Liviu Aolaritei,Bart P.G. Van Parys,Henry Lam,Michael I. Jordan
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Importance Sampling (IS) is a widely used variance reduction technique for enhancing the efficiency of Monte Carlo methods, particularly in rare-event simulation and related applications. Despite its power, the performance of IS is often highly sensitive to the choice of the proposal distribution and frequently requires stochastic calibration techniques. While the design and analysis of IS have been extensively studied in estimation settings, applying IS within stochastic optimization introduces a unique challenge: the decision and the IS distribution are mutually dependent, creating a circular optimization structure. This interdependence complicates both the analysis of convergence for decision iterates and the efficiency of the IS scheme. In this paper, we propose an iterative gradient-based algorithm that jointly updates the decision variable and the IS distribution without requiring time-scale separation between the two. Our method achieves the lowest possible asymptotic variance and guarantees global convergence under convexity of the objective and mild assumptions on the IS distribution family. Furthermore, we show that these properties are preserved under linear constraints by incorporating a recent variant of Nesterov’s dual averaging method.

[LG-42] Operator Learning: A Statistical Perspective

链接: https://arxiv.org/abs/2504.03503
作者: Unique Subedi,Ambuj Tewari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 28 pages, 6 figures

点击查看摘要

Abstract:Operator learning has emerged as a powerful tool in scientific computing for approximating mappings between infinite-dimensional function spaces. A primary application of operator learning is the development of surrogate models for the solution operators of partial differential equations (PDEs). These methods can also be used to develop black-box simulators to model system behavior from experimental data, even without a known mathematical model. In this article, we begin by formalizing operator learning as a function-to-function regression problem and review some recent developments in the field. We also discuss PDE-specific operator learning, outlining strategies for incorporating physical and mathematical constraints into architecture design and training processes. Finally, we end by highlighting key future directions such as active data collection and the development of rigorous uncertainty quantification frameworks.

[LG-43] Generating ensembles of spatially-coherent in-situ forecasts using flow matching

链接: https://arxiv.org/abs/2504.03463
作者: David Landry,Claire Monteleoni,Anastase Charantonis
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:We propose a machine-learning-based methodology for in-situ weather forecast postprocessing that is both spatially coherent and multivariate. Compared to previous work, our Flow MAtching Postprocessing (FMAP) better represents the correlation structures of the observations distribution, while also improving marginal performance at the stations. FMAP generates forecasts that are not bound to what is already modeled by the underlying gridded prediction and can infer new correlation structures from data. The resulting model can generate an arbitrary number of forecasts from a limited number of numerical simulations, allowing for low-cost forecasting systems. A single training is sufficient to perform postprocessing at multiple lead times, in contrast with other methods which use multiple trained networks at generation time. This work details our methodology, including a spatial attention transformer backbone trained within a flow matching generative modeling framework. FMAP shows promising performance in experiments on the EUPPBench dataset, forecasting surface temperature and wind gust values at station locations in western Europe up to five-day lead times.

[LG-44] Conditioning Diffusions Using Malliavin Calculus

链接: https://arxiv.org/abs/2504.03461
作者: Jakiw Pidstrigach,Elizabeth Baker,Carles Domingo-Enrich,George Deligiannidis,Nikolas Nüsken
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:In stochastic optimal control and conditional generative modelling, a central computational task is to modify a reference diffusion process to maximise a given terminal-time reward. Most existing methods require this reward to be differentiable, using gradients to steer the diffusion towards favourable outcomes. However, in many practical settings, like diffusion bridges, the reward is singular, taking an infinite value if the target is hit and zero otherwise. We introduce a novel framework, based on Malliavin calculus and path-space integration by parts, that enables the development of methods robust to such singular rewards. This allows our approach to handle a broad range of applications, including classification, diffusion bridges, and conditioning without the need for artificial observational noise. We demonstrate that our approach offers stable and reliable training, outperforming existing techniques.

[LG-45] A Polynomial-Time Algorithm for Variational Inequalities under the Minty Condition

链接: https://arxiv.org/abs/2504.03432
作者: Ioannis Anagnostides,Gabriele Farina,Tuomas Sandholm,Brian Hu Zhang
类目: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving (Stampacchia) variational inequalities (SVIs) is a foundational problem at the heart of optimization, with a host of critical applications ranging from engineering to economics. However, this expressivity comes at the cost of computational hardness. As a result, most research has focused on carving out specific subclasses that elude those intractability barriers. A classical property that goes back to the 1960s is the Minty condition, which postulates that the Minty VI (MVI) problem – the weak dual of the SVI problem – admits a solution. In this paper, we establish the first polynomial-time algorithm – that is, with complexity growing polynomially in the dimension d and \log(1/\epsilon) – for solving \epsilon -SVIs for Lipschitz continuous mappings under the Minty condition. Prior approaches either incurred an exponentially worse dependence on 1/\epsilon (and other natural parameters of the problem) or made overly restrictive assumptions – such as strong monotonicity. To do so, we introduce a new variant of the ellipsoid algorithm wherein separating hyperplanes are obtained after taking a gradient descent step from the center of the ellipsoid. It succeeds even though the set of SVIs can be nonconvex and not fully dimensional. Moreover, when our algorithm is applied to an instance with no MVI solution and fails to identify an SVI solution, it produces a succinct certificate of MVI infeasibility. We also show that deciding whether the Minty condition holds is \mathsfcoNP -complete. We provide several extensions and new applications of our main results. Specifically, we obtain the first polynomial-time algorithms for i) solving monotone VIs, ii) globally minimizing a (potentially nonsmooth) quasar-convex function, and iii) computing Nash equilibria in multi-player harmonic games. Subjects: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2504.03432 [math.OC] (or arXiv:2504.03432v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2504.03432 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-46] Bayesian LSTM for indoor temperature modeling

链接: https://arxiv.org/abs/2504.03350
作者: Emma Hannula,Arttu Häkkinen,Antti Solonen,Felibe Uribe,Jana de Wiljes,Lassi Roininen
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Improving energy efficiency of building heating systems is essential for reducing global energy consumption and greenhouse gas emissions. Traditional control methods in buildings rely on static heating curves based solely on outdoor temperature measurements, neglecting system state and free heat sources like solar gain. Model predictive control (MPC) not only addresses these limitations but further optimizes heating control by incorporating weather forecasts and system state predictions. However, current industrial MPC solutions often use simplified physics-inspired models, which compromise accuracy for interpretability. While purely data-driven models offer better predictive performance, they face challenges like overfitting and lack of transparency. To bridge this gap, we propose a Bayesian Long Short-Term Memory (LSTM) architecture for indoor temperature modeling. Our experiments across 100 real-world buildings demonstrate that the Bayesian LSTM outperforms an industrial physics-based model in predictive accuracy, enabling potential for improved energy efficiency and thermal comfort if deployed in heating MPC solutions. Over deterministic black-box approaches, the Bayesian framework provides additional advantages by improving generalization ability and allowing interpretation of predictions via uncertainty quantification. This work advances data-driven heating control by balancing predictive performance with the transparency and reliability required for real-world heating MPC applications. Subjects: Applications (stat.AP); Machine Learning (cs.LG) Cite as: arXiv:2504.03350 [stat.AP] (or arXiv:2504.03350v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2504.03350 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-47] Block Toeplitz Sparse Precision Matrix Estimation for Large-Scale Interval-Valued Time Series Forecasting

链接: https://arxiv.org/abs/2504.03322
作者: Wan Tian,Zhongfeng Qin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modeling and forecasting interval-valued time series (ITS) have attracted considerable attention due to their growing presence in various contexts. To the best of our knowledge, there have been no efforts to model large-scale ITS. In this paper, we propose a feature extraction procedure for large-scale ITS, which involves key steps such as auto-segmentation and clustering, and feature transfer learning. This procedure can be seamlessly integrated with any suitable prediction models for forecasting purposes. Specifically, we transform the automatic segmentation and clustering of ITS into the estimation of Toeplitz sparse precision matrices and assignment set. The majorization-minimization algorithm is employed to convert this highly non-convex optimization problem into two subproblems. We derive efficient dynamic programming and alternating direction method to solve these two subproblems alternately and establish their convergence properties. By employing the Joint Recurrence Plot (JRP) to image subsequence and assigning a class label to each cluster, an image dataset is constructed. Then, an appropriate neural network is chosen to train on this image dataset and used to extract features for the next step of forecasting. Real data applications demonstrate that the proposed method can effectively obtain invariant representations of the raw data and enhance forecasting performance.

[LG-48] Adaptive Classification of Interval-Valued Time Series

链接: https://arxiv.org/abs/2504.03318
作者: Wan Tian,Zhongfeng Qin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, the modeling and analysis of interval-valued time series have garnered significant attention in the fields of econometrics and statistics. However, the existing literature primarily focuses on regression tasks while neglecting classification aspects. In this paper, we propose an adaptive approach for interval-valued time series classification. Specifically, we represent interval-valued time series using convex combinations of upper and lower bounds of intervals and transform these representations into images based on point-valued time series imaging methods. We utilize a fine-grained image classification neural network to classify these images, to achieve the goal of classifying the original interval-valued time series. This proposed method is applicable to both univariate and multivariate interval-valued time series. On the optimization front, we treat the convex combination coefficients as learnable parameters similar to the parameters of the neural network and provide an efficient estimation method based on the alternating direction method of multipliers (ADMM). On the theoretical front, under specific conditions, we establish a margin-based multiclass generalization bound for generic CNNs composed of basic blocks involving convolution, pooling, and fully connected layers. Through simulation studies and real data applications, we validate the effectiveness of the proposed method and compare its performance against a wide range of point-valued time series classification methods.

[LG-49] Detecting underdetermination in parameterized quantum circuits

链接: https://arxiv.org/abs/2504.03315
作者: Marie Kempkes,Jakob Spiegelberg,Evert van Nieuwenburg,Vedran Dunjko
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A central question in machine learning is how reliable the predictions of a trained model are. Reliability includes the identification of instances for which a model is likely not to be trusted based on an analysis of the learning system itself. Such unreliability for an input may arise from the model family providing a variety of hypotheses consistent with the training data, which can vastly disagree in their predictions on that particular input point. This is called the underdetermination problem, and it is important to develop methods to detect it. With the emergence of quantum machine learning (QML) as a prospective alternative to classical methods for certain learning problems, the question arises to what extent they are subject to underdetermination and whether similar techniques as those developed for classical models can be employed for its detection. In this work, we first provide an overview of concepts from Safe AI and reliability, which in particular received little attention in QML. We then explore the use of a method based on local second-order information for the detection of underdetermination in parameterized quantum circuits through numerical experiments. We further demonstrate that the approach is robust to certain levels of shot noise. Our work contributes to the body of literature on Safe Quantum AI, which is an emerging field of growing importance.

[LG-50] Roto-Translation Invariant Metrics on Position-Orientation Space

链接: https://arxiv.org/abs/2504.03309
作者: Gijs Bellaard,Bart M. N. Smets
类目: Differential Geometry (math.DG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Riemannian metrics on the position-orientation space M(3) that are roto-translation group SE(3) invariant play a key role in image analysis tasks like enhancement, denoising, and segmentation. These metrics enable roto-translation equivariant algorithms, with the associated Riemannian distance often used in implementation. However, computing the Riemannian distance is costly, which makes it unsuitable in situations where constant recomputation is needed. We propose the mav (minimal angular velocity) distance, defined as the Riemannian length of a geometrically meaningful curve, as a practical alternative. We see an application of the mav distance in geometric deep learning. Namely, neural networks architectures such as PONITA, relies on geometric invariants to create their roto-translation equivariant model. The mav distance offers a trainable invariant, with the parameters that determine the Riemannian metric acting as learnable weights. In this paper we: 1) classify and parametrize all SE(3) invariant metrics on M(3), 2) describes how to efficiently calculate the mav distance, and 3) investigate if including the mav distance within PONITA can positively impact its accuracy in predicting molecular properties. Subjects: Differential Geometry (math.DG); Machine Learning (cs.LG) Cite as: arXiv:2504.03309 [math.DG] (or arXiv:2504.03309v1 [math.DG] for this version) https://doi.org/10.48550/arXiv.2504.03309 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gijs Bellaard [view email] [v1] Fri, 4 Apr 2025 09:36:11 UTC (31 KB)

[LG-51] Universal Collection of Euclidean Invariants between Pairs of Position-Orientations

链接: https://arxiv.org/abs/2504.03299
作者: Gijs Bellaard,Bart M. N. Smets,Remco Duits
类目: Differential Geometry (math.DG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Euclidean E(3) equivariant neural networks that employ scalar fields on position-orientation space M(3) have been effectively applied to tasks such as predicting molecular dynamics and properties. To perform equivariant convolutional-like operations in these architectures one needs Euclidean invariant kernels on M(3) x M(3). In practice, a handcrafted collection of invariants is selected, and this collection is then fed into multilayer perceptrons to parametrize the kernels. We rigorously describe an optimal collection of 4 smooth scalar invariants on the whole of M(3) x M(3). With optimal we mean that the collection is independent and universal, meaning that all invariants are pertinent, and any invariant kernel is a function of them. We evaluate two collections of invariants, one universal and one not, using the PONITA neural network architecture. Our experiments show that using a collection of invariants that is universal positively impacts the accuracy of PONITA significantly.

[LG-52] he Ground Cost for Optimal Transport of Angular Velocity

链接: https://arxiv.org/abs/2504.03190
作者: Karthik Elamvazhuthi,Abhishek Halder
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We revisit the optimal transport problem over angular velocity dynamics given by the controlled Euler equation. The solution of this problem enables stochastic guidance of spin states of a rigid body (e.g., spacecraft) over hard deadline constraint by transferring a given initial state statistics to a desired terminal state statistics. This is an instance of generalized optimal transport over a nonlinear dynamical system. While prior work has reported existence-uniqueness and numerical solution of this dynamical optimal transport problem, here we present structural results about the equivalent Kantorovich a.k.a. optimal coupling formulation. Specifically, we focus on deriving the ground cost for the associated Kantorovich optimal coupling formulation. The ground cost equals to the cost of transporting unit amount of mass from a specific realization of the initial or source joint probability measure to a realization of the terminal or target joint probability measure, and determines the Kantorovich formulation. Finding the ground cost leads to solving a structured deterministic nonlinear optimal control problem, which is shown to be amenable to an analysis technique pioneered by Athans et. al. We show that such techniques have broader applicability in determining the ground cost (thus Kantorovich formulation) for a class of generalized optimal mass transport problems involving nonlinear dynamics with translated norm-invariant drift.

[LG-53] Bayesian Optimization of Robustness Measures Using Randomized GP-UCB-based Algorithms under Input Uncertainty

链接: https://arxiv.org/abs/2504.03172
作者: Yu Inatsu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 44 pages, 4 figures

点击查看摘要

Abstract:Bayesian optimization based on Gaussian process upper confidence bound (GP-UCB) has a theoretical guarantee for optimizing black-box functions. Black-box functions often have input uncertainty, but even in this case, GP-UCB can be extended to optimize evaluation measures called robustness measures. However, GP-UCB-based methods for robustness measures include a trade-off parameter \beta , which must be excessively large to achieve theoretical validity, just like the original GP-UCB. In this study, we propose a new method called randomized robustness measure GP-UCB (RRGP-UCB), which samples the trade-off parameter \beta from a probability distribution based on a chi-squared distribution and avoids explicitly specifying \beta . The expected value of \beta is not excessively large. Furthermore, we show that RRGP-UCB provides tight bounds on the expected value of regret based on the optimal solution and estimated solutions. Finally, we demonstrate the usefulness of the proposed method through numerical experiments.

[LG-54] Accelerating Particle-based Energetic Variational Inference

链接: https://arxiv.org/abs/2504.03158
作者: Xuelian Bao,Lulu Kang,Chun Liu,Yiwei Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 5 figures, 2 tables

点击查看摘要

Abstract:In this work, we propose a novel particle-based variational inference (ParVI) method that accelerates the EVI-Im. Inspired by energy quadratization (EQ) and operator splitting techniques for gradient flows, our approach efficiently drives particles towards the target distribution. Unlike EVI-Im, which employs the implicit Euler method to solve variational-preserving particle dynamics for minimizing the KL divergence, derived using a “discretize-then-variational” approach, the proposed algorithm avoids repeated evaluation of inter-particle interaction terms, significantly reducing computational cost. The framework is also extensible to other gradient-based sampling techniques. Through several numerical experiments, we demonstrate that our method outperforms existing ParVI approaches in efficiency, robustness, and accuracy.

[LG-55] A computational transition for detecting multivariate shuffled linear regression by low-degree polynomials

链接: https://arxiv.org/abs/2504.03097
作者: Zhangsong Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 23 pages

点击查看摘要

Abstract:In this paper, we study the problem of multivariate shuffled linear regression, where the correspondence between predictors and responses in a linear model is obfuscated by a latent permutation. Specifically, we investigate the model Y=\tfrac1\sqrt1+\sigma^2(\Pi_* X Q_* + \sigma Z) , where X is an nd standard Gaussian design matrix, Z is an nm Gaussian noise matrix, \Pi_* is an unknown nn permutation matrix, and Q_ is an unknown dm on the Grassmanian manifold satisfying Q_^\top Q_* = \mathbb I_m . Consider the hypothesis testing problem of distinguishing this model from the case where X and Y are independent Gaussian random matrices of sizes nd and nm , respectively. Our results reveal a phase transition phenomenon in the performance of low-degree polynomial algorithms for this task. (1) When m=o(d) , we show that all degree- D polynomials fail to distinguish these two models even when \sigma=0 , provided with D^4=o\big( \tfracdm \big) . (2) When m=d and \sigma=\omega(1) , we show that all degree- D polynomials fail to distinguish these two models provided with D=o(\sigma) . (3) When m=d and \sigma=o(1) , we show that there exists a constant-degree polynomial that strongly distinguish these two models. These results establish a smooth transition in the effectiveness of low-degree polynomial algorithms for this problem, highlighting the interplay between the dimensions m and d , the noise level \sigma , and the computational complexity of the testing task. Comments: 23 pages Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST) Cite as: arXiv:2504.03097 [stat.ML] (or arXiv:2504.03097v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2504.03097 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] High-dimensional ridge regression with random features for non-identically distributed data with a variance profile

链接: https://arxiv.org/abs/2504.03035
作者: Issa-Mbenard Dabo,Jérémie Bigot
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The behavior of the random feature model in the high-dimensional regression framework has become a popular issue of interest in the machine learning literature. This model is generally considered for feature vectors x_i = \Sigma^1/2 x_i’ , where x_i’ is a random vector made of independent and identically distributed (iid) entries, and \Sigma is a positive definite matrix representing the covariance of the features. In this paper, we move beyond \CB this standard assumption by studying the performances of the random features model in the setting of non-iid feature vectors. Our approach is related to the analysis of the spectrum of large random matrices through random matrix theory (RMT) \CB and free probability results. We turn to the analysis of non-iid data by using the notion of variance profile \CB which is \CB well studied in RMT. Our main contribution is then the study of the limits of the training and \CB prediction risks associated to the ridge estimator in the random features model when its dimensions grow. We provide asymptotic equivalents of these risks that capture the behavior of ridge regression with random features in a \CB high-dimensional framework. These asymptotic equivalents, \CB which prove to be sharp in numerical experiments, are retrieved by adapting, to our setting, established results from operator-valued free probability theory. Moreover, \CB for various classes of random feature vectors that have not been considered so far in the literature, our approach allows to show the appearance of the double descent phenomenon when the ridge regularization parameter is small enough. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST); Methodology (stat.ME) Cite as: arXiv:2504.03035 [stat.ML] (or arXiv:2504.03035v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2504.03035 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-57] ConfEviSurrogate: A Conformalized Evidential Surrogate Model for Uncertainty Quantification

链接: https://arxiv.org/abs/2504.02919
作者: Yuhan Duan,Xin Zhao,Neng Shi,Han-Wei Shen
类目: Machine Learning (stat.ML); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Surrogate models, crucial for approximating complex simulation data across sciences, inherently carry uncertainties that range from simulation noise to model prediction errors. Without rigorous uncertainty quantification, predictions become unreliable and hence hinder analysis. While methods like Monte Carlo dropout and ensemble models exist, they are often costly, fail to isolate uncertainty types, and lack guaranteed coverage in prediction intervals. To address this, we introduce ConfEviSurrogate, a novel Conformalized Evidential Surrogate Model that can efficiently learn high-order evidential distributions, directly predict simulation outcomes, separate uncertainty sources, and provide prediction intervals. A conformal prediction-based calibration step further enhances interval reliability to ensure coverage and improve efficiency. Our ConfEviSurrogate demonstrates accurate predictions and robust uncertainty estimates in diverse simulations, including cosmology, ocean dynamics, and fluid dynamics.

[LG-58] Efficient First-Order Optimization on the Pareto Set for Multi-Objective Learning under Preference Guidance

链接: https://arxiv.org/abs/2504.02854
作者: Lisha Chen,Quan Xiao,Ellen Hidemi Fukuda,Xinyi Chen,Kun Yuan,Tianyi Chen
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-objective learning under user-specified preference is common in real-world problems such as multi-lingual speech recognition under fairness. In this work, we frame such a problem as a semivectorial bilevel optimization problem, whose goal is to optimize a pre-defined preference function, subject to the constraint that the model parameters are weakly Pareto optimal. To solve this problem, we convert the multi-objective constraints to a single-objective constraint through a merit function with an easy-to-evaluate gradient, and then, we use a penalty-based reformulation of the bilevel optimization problem. We theoretically establish the properties of the merit function, and the relations of solutions for the penalty reformulation and the constrained formulation. Then we propose algorithms to solve the reformulated single-level problem, and establish its convergence guarantees. We test the method on various synthetic and real-world problems. The results demonstrate the effectiveness of the proposed method in finding preference-guided optimal solutions to the multi-objective problem.

[LG-59] ransfer learning from first-principles calculations to experiments with chemistry-informed domain transformation

链接: https://arxiv.org/abs/2504.02848
作者: Yuta Yahagi,Kiichi Obuchi,Fumihiko Kosaka,Kota Matsui
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 36 pages, 19 figures, 8 tables

点击查看摘要

Abstract:Simulation-to-Real (Sim2Real) transfer learning, the machine learning technique that efficiently solves a real-world task by leveraging knowledge from computational data, has received increasing attention in materials science as a promising solution to the scarcity of experimental data. We proposed an efficient transfer learning scheme from first-principles calculations to experiments based on the chemistry-informed domain transformation, that integrates the heterogeneous source and target domains by harnessing the underlying physics and chemistry. The proposed method maps the computational data from the simulation space (source domain) into the space of experimental data (target domain). During this process, these qualitatively different domains are efficiently bridged by prior knowledge of chemistry, the statistical ensemble and the relationship between source and target quantities. As a proof-of-concept, we predict the catalyst activity for the reverse water-gas shift reaction by using the abundant first-principles data in addition to the experimental data. Through the demonstration, we confirmed that the transfer learning model exhibits positive transfer in accuracy and data efficiency. In particular, a significantly high accuracy was achieved despite using a few (less than ten) target data in domain transformation, whose accuracy is one order of magnitude smaller than that of a full scratch model trained with over 100 target data. This result indicates that the proposed method leverages the high prediction performance with few target data, which helps to save the number of trials in real laboratories.

[LG-60] Enhanced ECG Arrhythmia Detection Accuracy by Optimizing Divergence-Based Data Fusion

链接: https://arxiv.org/abs/2504.02842
作者: Baozhuo Su,Qingli Dou,Kang Liu,Zhengxian Qu,Jerry Deng,Ting Tan,Yanan Gu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 13 pages, 8 figures, 6 tables

点击查看摘要

Abstract:AI computation in healthcare faces significant challenges when clinical datasets are limited and heterogeneous. Integrating datasets from multiple sources and different equipments is critical for effective AI computation but is complicated by their diversity, complexity, and lack of representativeness, so we often need to join multiple datasets for analysis. The currently used method is fusion after normalization. But when using this method, it can introduce redundant information, decreasing the signal-to-noise ratio and reducing classification accuracy. To tackle this issue, we propose a feature-based fusion algorithm utilizing Kernel Density Estimation (KDE) and Kullback-Leibler (KL) divergence. Our approach involves initially preprocessing and continuous estimation on the extracted features, followed by employing the gradient descent method to identify the optimal linear parameters that minimize the KL divergence between the feature distributions. Using our in-house datasets consisting of ECG signals collected from 2000 healthy and 2000 diseased individuals by different equipments and verifying our method by using the publicly available PTB-XL dataset which contains 21,837 ECG recordings from 18,885 patients. We employ a Light Gradient Boosting Machine (LGBM) model to do the binary classification. The results demonstrate that the proposed fusion method significantly enhances feature-based classification accuracy for abnormal ECG cases in the merged datasets, compared to the normalization method. This data fusion strategy provides a new approach to process heterogeneous datasets for the optimal AI computation results.

[LG-61] PETIMOT: A Novel Framework for Inferring Protein Motions from Sparse Data Using SE(3)-Equivariant Graph Neural Networks

链接: https://arxiv.org/abs/2504.02839
作者: Valentin Lombard,Sergei Grudinin,Elodie Laine
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Proteins move and deform to ensure their biological functions. Despite significant progress in protein structure prediction, approximating conformational ensembles at physiological conditions remains a fundamental open problem. This paper presents a novel perspective on the problem by directly targeting continuous compact representations of protein motions inferred from sparse experimental observations. We develop a task-specific loss function enforcing data symmetries, including scaling and permutation operations. Our method PETIMOT (Protein sEquence and sTructure-based Inference of MOTions) leverages transfer learning from pre-trained protein language models through an SE(3)-equivariant graph neural network. When trained and evaluated on the Protein Data Bank, PETIMOT shows superior performance in time and accuracy, capturing protein dynamics, particularly large/slow conformational changes, compared to state-of-the-art flow-matching approaches and traditional physics-based models.

[LG-62] Explainable Dual-Attention Tabular Transformer for Soil Electrical Resistivity Prediction: A Decision Support Framework for High-Voltage Substation Construction

链接: https://arxiv.org/abs/2504.02834
作者: Warat Kongkitkul,Sompote Youwai,Warut Sakulpojworachai
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: arXiv admin note: text overlap with arXiv:2502.15827 by other authors

点击查看摘要

Abstract:This research introduces a novel dual-attention transformer architecture for predicting soil electrical resistivity, a critical parameter for high-voltage substation construction. Our model employs attention mechanisms operating across both features and data batches, enhanced by feature embedding layers that project inputs into higher-dimensional spaces. We implements Particle Swarm Optimization for hyperparameter tuning, systematically optimizing embedding dimensions, attention heads, and neural network architecture. The proposed architecture achieves superior predictive performance (Mean Absolute Percentage Error: 0.63%) compared to recent state of the art models for tabular data. Crucially, our model maintains explainability through SHapley Additive exPlanations value analysis, revealing that fine particle content and dry density are the most influential parameters affecting soil resistivity. We developes a web-based application implementing this model to provide engineers with an accessible decision support framework that bridges geotechnical and electrical engineering requirements for the Electricity Generating Authority of Thailand. This integrated approach satisfies both structural stability and electrical safety standards, improving construction efficiency and safety compliance in high-voltage infrastructure implementation.

[LG-63] Scalable Min-Max Optimization via Primal-Dual Exact Pareto Optimization

链接: https://arxiv.org/abs/2504.02833
作者: Sangwoo Park,Stefan Vlaski,Lajos Hanzo
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: submitted for a conference

点击查看摘要

Abstract:In multi-objective optimization, minimizing the worst objective can be preferable to minimizing the average objective, as this ensures improved fairness across objectives. Due to the non-smooth nature of the resultant min-max optimization problem, classical subgradient-based approaches typically exhibit slow convergence. Motivated by primal-dual consensus techniques in multi-agent optimization and learning, we formulate a smooth variant of the min-max problem based on the augmented Lagrangian. The resultant Exact Pareto Optimization via Augmented Lagrangian (EPO-AL) algorithm scales better with the number of objectives than subgradient-based strategies, while exhibiting lower per-iteration complexity than recent smoothing-based counterparts. We establish that every fixed-point of the proposed algorithm is both Pareto and min-max optimal under mild assumptions and demonstrate its effectiveness in numerical simulations.

[LG-64] DualMS: Implicit Dual-Channel Minimal Surface Optimization for Heat Exchanger Design

链接: https://arxiv.org/abs/2504.02830
作者: Weizheng Zhang(1),Hao Pan(2),Lin Lu(1),Xiaowei Duan(1),Xin Yan(1),Ruonan Wang(3),Qiang Du(3) ((1) Shandong University, (2) Tsinghua University, (3) Institute of Engineering Thermophysics, Chinese Academy of Sciences)
类目: Optimization and Control (math.OC); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Heat exchangers are critical components in a wide range of engineering applications, from energy systems to chemical processing, where efficient thermal management is essential. The design objectives for heat exchangers include maximizing the heat exchange rate while minimizing the pressure drop, requiring both a large interface area and a smooth internal structure. State-of-the-art designs, such as triply periodic minimal surfaces (TPMS), have proven effective in optimizing heat exchange efficiency. However, TPMS designs are constrained by predefined mathematical equations, limiting their adaptability to freeform boundary shapes. Additionally, TPMS structures do not inherently control flow directions, which can lead to flow stagnation and undesirable pressure drops. This paper presents DualMS, a novel computational framework for optimizing dual-channel minimal surfaces specifically for heat exchanger designs in freeform shapes. To the best of our knowledge, this is the first attempt to directly optimize minimal surfaces for two-fluid heat exchangers, rather than relying on TPMS. Our approach formulates the heat exchange maximization problem as a constrained connected maximum cut problem on a graph, with flow constraints guiding the optimization process. To address undesirable pressure drops, we model the minimal surface as a classification boundary separating the two fluids, incorporating an additional regularization term for area minimization. We employ a neural network that maps spatial points to binary flow types, enabling it to classify flow skeletons and automatically determine the surface boundary. DualMS demonstrates greater flexibility in surface topology compared to TPMS and achieves superior thermal performance, with lower pressure drops while maintaining a similar heat exchange rate under the same material cost. Subjects: Optimization and Control (math.OC); Graphics (cs.GR); Machine Learning (cs.LG) Cite as: arXiv:2504.02830 [math.OC] (or arXiv:2504.02830v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2504.02830 Focus to learn more arXiv-issued DOI via DataCite

信息检索

[IR-0] Learning Sparse Disentangled Representations for Multimodal Exclusion Retrieval

链接: https://arxiv.org/abs/2504.03184
作者: Prachi,Sumit Bhatia,Srikanta Bedathur
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal representations are essential for cross-modal retrieval, but they often lack interpretability, making it difficult to understand the reasoning behind retrieved results. Sparse disentangled representations offer a promising solution; however, existing methods rely heavily on text tokens, resulting in high-dimensional embeddings. In this work, we propose a novel approach that generates compact, fixed-size embeddings that maintain disentanglement while providing greater control over retrieval tasks. We evaluate our method on challenging exclusion queries using the MSCOCO and Conceptual Captions benchmarks, demonstrating notable improvements over dense models like CLIP, BLIP, and VISTA (with gains of up to 11% in AP@10), as well as over sparse disentangled models like VDR (achieving up to 21% gains in AP@10). Furthermore, we present qualitative results that emphasize the enhanced interpretability of our disentangled representations.

[IR-1] Exploiting Fine-Grained Skip Behaviors for Micro-Video Recommendation AAAI

链接: https://arxiv.org/abs/2504.03107
作者: Sanghyuck Lee,Sangkeun Park,Jaesung Lee
类目: Information Retrieval (cs.IR)
*备注: 9 pages, 5 figures. Published in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025

点击查看摘要

Abstract:The growing trend of sharing short videos on social media platforms, where users capture and share moments from their daily lives, has led to an increase in research efforts focused on micro-video recommendations. However, conventional methods oversimplify the modeling of skip behavior, categorizing interactions solely as positive or negative based on whether skipping occurs. This study was motivated by the importance of the first few seconds of micro-videos, leading to a refinement of signals into three distinct categories: highly positive, less positive, and negative. Specifically, we classify skip interactions occurring within a short time as negatives, while those occurring after a delay are categorized as less positive. The proposed dual-level graph and hierarchical ranking loss are designed to effectively learn these fine-grained interactions. Our experiments demonstrated that the proposed method outperformed three conventional methods across eight evaluation measures on two public datasets.

[IR-2] Integrating Notch Filtering and Statistical Methods for Improved Cardiac Diagnostics Using MATLAB

链接: https://arxiv.org/abs/2504.02847
作者: Lohit Bibar,Samali bose,Tribeni Prasad Banerjee
类目: ignal Processing (eess.SP); Information Retrieval (cs.IR); Image and Video Processing (eess.IV)
*备注: 11

点击查看摘要

Abstract:A Notch Filter is essential in ECG signal processing to eliminate narrowband noise, especially powerline interference at 50 Hz or 60 Hz. This interference overlaps with vital ECG signal features, affecting the accuracy of downstream classification tasks (e.g., arrhythmia detection). A properly designed notch filter enhances signal quality, preserves essential ECG components (P, QRS, T waves), and improves the performance of machine learning or deep learning models used for ECG classification.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-04-07

目录

概览 (2025-04-07)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载