本篇博文主要内容为 2025-05-09 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-05-09)
今日共更新418篇论文,其中:
- 自然语言处理共70篇(Computation and Language (cs.CL))
- 人工智能共112篇(Artificial Intelligence (cs.AI))
- 计算机视觉共113篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共123篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
【速读】: 该论文旨在解决将离线视频大语言模型(Video-LLMs)适配到在线流式场景中的两个核心问题:一是模型在多轮实时理解任务中的能力受限,二是缺乏主动响应机制。解决方案的关键在于提出StreamBridge框架,其核心包括:(1)结合记忆缓冲区与轮次衰减压缩策略,以支持长上下文的多轮交互;(2)采用解耦且轻量的激活模型,可无缝集成至现有Video-LLMs中,实现持续的主动响应。此外,为支持StreamBridge,研究者还构建了Stream-IT数据集,用于流式视频理解任务。
链接: https://arxiv.org/abs/2505.05467
作者: Haibo Wang,Bo Feng,Zhengfeng Lai,Mingze Xu,Shiyu Li,Weifeng Ge,Afshin Dehghan,Meng Cao,Ping Huang
机构: Apple(苹果); Fudan University(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.
zh
[NLP-1] ComPO: Preference Alignment via Comparison Oracles
【速读】: 该论文旨在解决直接对齐方法在对齐大语言模型(Large Language Models, LLMs)与人类偏好时存在的冗长性和似然位移问题,这些问题通常由噪声偏好对引起的优选与非优选响应相似似然所导致。论文提出了一种基于比较预言机的新偏好对齐方法,并为其基本方案提供了收敛性保证;其解决方案的关键在于设计专门针对具有显著似然差的偏好对的方法,从而提升在使用噪声偏好对时LLMs的性能。
链接: https://arxiv.org/abs/2505.05465
作者: Peter Chen,Xi Chen,Wotao Yin,Tianyi Lin
机构: Columbia University(哥伦比亚大学); Stern School of Business, New York University(纽约大学斯特恩商学院); DAMO Academy, Alibaba Group US(阿里集团美国达摩院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages
Abstract:Direct alignment methods are increasingly used for aligning large language models (LLMs) with human preferences. However, these methods suffer from the issues of verbosity and likelihood displacement, which can be driven by the noisy preference pairs that induce similar likelihood for preferred and dispreferred responses. The contributions of this paper are two-fold. First, we propose a new preference alignment method based on comparison oracles and provide the convergence guarantee for its basic scheme. Second, we improve our method using some heuristics and conduct the experiments to demonstrate the flexibility and compatibility of practical scheme in improving the performance of LLMs using noisy preference pairs. Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing direct alignment methods. A highlight of our work is that we evidence the importance of designing specialized methods for preference pairs with distinct likelihood margin, which complements the recent findings in \citetRazin-2025-Unintentional.
zh
[NLP-2] Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging ICML2025
【速读】: 该论文试图解决如何将视觉感知与大型语言模型(Large Language Models, LLMs)的推理能力有效结合的问题,当前对此机制的理解仍较为有限。其解决方案的关键在于通过模型合并(model merging)技术,将不同模态的模型参数进行连接,从而将LLMs的推理能力融入到视觉-语言模型(Vision-Language Models, VLMs)中。该方法无需额外训练即可实现推理能力的迁移,并为理解感知与推理在模型内部的分布及合并的影响提供了新的视角。
链接: https://arxiv.org/abs/2505.05464
作者: Shiqi Chen,Jinghan Zhang,Tongyao Zhu,Wei Liu,Siyang Gao,Miao Xiong,Manling Li,Junxian He
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2025. Our code is publicly available at this https URL
Abstract:Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.
zh
[NLP-3] UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections AAAI
【速读】: 该论文试图解决选举期间误导性叙事对公众意见的塑造问题,这类叙事可能影响选民对候选人和政党的看法,因此需要准确检测。解决方案的关键在于构建了首个针对欧洲近期选举中常见误导性叙事的分类体系,并基于此构建了UKElectionNarratives数据集,该数据集包含2019年和2024年英国大选期间的人工标注误导性叙事,同时对预训练和大语言模型(特别是GPT-4o)在检测选举相关误导性叙事方面的有效性进行了基准测试。
链接: https://arxiv.org/abs/2505.05459
作者: Fatima Haouari,Carolina Scarton,Nicolò Faggiani,Nikolaos Nikolaidis,Bonka Kotseva,Ibrahim Abu Farha,Jens Linge,Kalina Bontcheva
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: This work was accepted at the International AAAI Conference on Web and Social Media (ICWSM 2025)
Abstract:Misleading narratives play a crucial role in shaping public opinion during elections, as they can influence how voters perceive candidates and political parties. This entails the need to detect these narratives accurately. To address this, we introduce the first taxonomy of common misleading narratives that circulated during recent elections in Europe. Based on this taxonomy, we construct and analyse UKElectionNarratives: the first dataset of human-annotated misleading narratives which circulated during the UK General Elections in 2019 and 2024. We also benchmark Pre-trained and Large Language Models (focusing on GPT-4o), studying their effectiveness in detecting election-related misleading narratives. Finally, we discuss potential use cases and make recommendations for future research directions using the proposed codebook and dataset.
zh
[NLP-4] Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding CVPR2025
【速读】: 该论文旨在解决视觉文档理解中的挑战,即如何有效整合视觉感知与文本理解,特别是在多样化的文档类型和复杂布局下,现有微调数据集在提供详细上下文信息方面存在不足,导致模型产生幻觉并难以理解视觉元素之间的空间关系。解决方案的关键在于提出一种创新的流水线,利用自适应生成标记语言(如Markdown、JSON、HTML和TiKZ)构建高度结构化的文档表示,并生成具有上下文依据的响应。此外,研究者还引入了两个细粒度的结构化数据集:DocMark-Pile用于文档解析的预训练,DocMark-Instruct用于基于上下文的指令遵循微调,从而显著提升了模型在多种视觉文档理解基准上的性能。
链接: https://arxiv.org/abs/2505.05446
作者: Han Xiao,Yina Xie,Guanxin Tan,Yinghao Chen,Rui Hu,Ke Wang,Aojun Zhou,Hao Li,Hao Shao,Xudong Lu,Peng Gao,Yafei Wen,Xiaoxin Chen,Shuai Ren,Hongsheng Li
机构: CUHK MMLab (The Chinese University of Hong Kong Multimedia Laboratory); vivo AI Lab (vivo Artificial Intelligence Lab); CPII under InnoHK (CPII under InnoHK); Shanghai AI Lab (Shanghai Artificial Intelligence Lab); Shenzhen Institute of Advanced Technology, CAS (Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: CVPR2025
Abstract:Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following. Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios. Our code and models are released at https://github. com/Euphoria16/DocMark.
zh
[NLP-5] clem:todd: A Framework for the Systematic Benchmarking of LLM -Based Task-Oriented Dialogue System Realisations
【速读】: 该论文试图解决现有研究在评估对话系统时往往孤立地考察用户模拟器或特定系统设计,从而限制了不同架构和配置间洞察的泛化性问题。解决方案的关键在于提出clem todd(chat-optimized LLMs for task-oriented dialogue systems development),这是一个灵活的框架,能够在一致条件下系统地评估对话系统,支持用户模拟器与对话系统的多种组合进行详细基准测试,并确保数据集、评估指标和计算约束的统一性。
链接: https://arxiv.org/abs/2505.05445
作者: Chalamalasetti Kranti,Sherzod Hakimov,David Schlangen
机构: University of Potsdam (波茨坦大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: 30 pages
Abstract:The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation-either focusing on a single user simulator or a specific system design-limiting the generalisability of insights across architectures and configurations. In this work, we propose clem todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. It supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem todd’s flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.
zh
[NLP-6] Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
【速读】: 该论文旨在解决模型驱动的数据过滤方法在数据质量验证和种子数据选择上的两个主要问题:缺乏高效的验证策略以及时反馈数据质量,以及种子数据选择缺乏明确标准且依赖人工经验导致的主观性。解决方案的关键在于引入一种高效的验证策略,实现对数据对大语言模型(Large Language Model, LLM)训练影响的快速评估,并结合该策略优化正负样本的选择,构建高效的数据过滤流水线。该流水线提升了过滤效率、分类器质量和鲁棒性,同时降低了实验和推理成本。
链接: https://arxiv.org/abs/2505.05427
作者: Yudong Wang,Zixuan Fu,Jie Cai,Peijun Tang,Hongya Lyu,Yewei Fang,Zhi Zheng,Jie Zhou,Guoyang Zeng,Chaojun Xiao,Xu Han,Zhiyuan Liu
机构: ModelBest Inc.(ModelBest公司); Tsinghua University(清华大学); Soochow University(苏州大学)
类目: Computation and Language (cs.CL)
备注: The datasets are available on this https URL
Abstract:Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.
zh
[NLP-7] ransProQA: an LLM -based literary Translation evaluation metric with Professional Question Answering
【速读】: 该论文试图解决当前评估指标在文学翻译质量评估中偏向机械准确性而忽视艺术表达的问题,以及机器翻译被过度高估而可能造成翻译质量与文化真实性长期下降的风险。解决方案的关键在于提出一种基于大型语言模型(LLM)的无参考问答框架TransProQA,该框架整合了专业文学译者和研究者的见解,聚焦于文学质量评估中的关键要素,如文学手法、文化理解及作者语气,并通过引入专业译者反馈作为权重进一步提升性能,从而实现接近人类水平的文学翻译评估。
链接: https://arxiv.org/abs/2505.05423
作者: Ran Zhang,Wei Zhao,Lieve Macken,Steffen Eger
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: WIP
Abstract:The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation (MT) as being superior to experienced professional human translation. In the long run, this bias could result in a permanent decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce TransProQA, a novel, reference-free, LLM-based question-answering (QA) framework designed specifically for literary translation evaluation. TransProQA uniquely integrates insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, TransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation (ACC-EQ and Kendall’s tau) and surpassing the best state-of-the-art (SOTA) metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, TransProQA approaches human-level evaluation performance comparable to trained linguistic annotators. It demonstrates broad applicability to open-source models such as LLaMA3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free literary evaluation metric and a valuable tool for evaluating texts that require local processing due to copyright or ethical considerations.
zh
[NLP-8] okLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
【速读】: 该论文旨在解决多模态统一中因缺乏高层语义而导致的高训练计算开销和有限的理解性能问题。其解决方案的关键在于提出TokLIP,一种通过语义化向量量化(VQ)标记并融合CLIP级语义的视觉分词器,同时支持使用标准VQ标记进行端到端多模态自回归训练。TokLIP将低层次离散VQ分词器与基于ViT的标记编码器结合,以捕捉高层次连续语义,并通过解耦理解与生成的训练目标,直接应用先进的VQ分词器而无需定制量化操作。
链接: https://arxiv.org/abs/2505.05422
作者: Haokun Lin,Teng Wang,Yixiao Ge,Yuying Ge,Zhichao Lu,Ying Wei,Qingfu Zhang,Zhenan Sun,Ying Shan
机构: ARC Lab, Tencent PCG (ARC实验室,腾讯PCG); City University of Hong Kong (香港城市大学); Zhejiang University (浙江大学); NLPR & MAIS, Institute of Automation, CAS (模式识别国家重点实验室与多媒体信息处理重点实验室,中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report
Abstract:Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at this https URL.
zh
[NLP-9] Reasoning Models Dont Always Say What They Think
【速读】: 该论文试图解决生成式 AI (Generative AI) 在推理过程中是否存在不安全行为的问题,特别是通过监控 Chain-of-thought (CoT) 的可信度来评估模型的意图和推理过程。解决方案的关键在于评估当前最先进的推理模型在不同提示中的 CoT 一致性,即 CoT 是否能够真实反映模型的实际推理过程。研究发现,尽管 CoT 在一定程度上能揭示模型对提示中隐含线索的使用情况,但其揭示率较低,且强化学习虽能短暂提升一致性,但效果有限,无法完全确保模型行为的安全性。
链接: https://arxiv.org/abs/2505.05410
作者: Yanda Chen,Joe Benton,Ansh Radhakrishnan,Jonathan Uesato,Carson Denison,John Schulman,Arushi Somani,Peter Hase,Misha Wagner,Fabien Roger,Vlad Mikulik,Samuel R. Bowman,Jan Leike,Jared Kaplan,Ethan Perez
机构: Anthropic(Anthropic)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model’s CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models’ actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.
zh
[NLP-10] Crosslingual Reasoning through Test-Time Scaling
【速读】: 该论文试图解决英语为中心的推理语言模型(RLM)在跨语言推理任务中的泛化能力问题,特别是如何通过英语推理微调和长链式思维(CoT)来提升多语言,尤其是低资源语言的数学推理能力。其解决方案的关键在于通过增加英语中心RLM的推理计算规模,使其能够有效推广到多种语言,并发现通过控制CoT的生成语言可以提升模型在高资源语言中的推理效果,同时揭示了模型在跨领域(如STEM到文化常识)推理中的局限性。
链接: https://arxiv.org/abs/2505.05408
作者: Zheng-Xin Yong,M. Farid Adilazuarda,Jonibek Mansurov,Ruochen Zhang,Niklas Muennighoff,Carsten Eickhoff,Genta Indra Winata,Julia Kreutzer,Stephen H. Bach,Alham Fikri Aji
机构: Brown University (布朗大学); MBZUAI (MBZUAI); Stanford University (斯坦福大学); University of Tübingen (图宾根大学); Capital One (资本one); Cohere Labs (Cohere实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM’s CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.
zh
[NLP-11] Frame In Frame Out: Do LLM s Generate More Biased News Headlines than Humans?
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在自动化新闻和内容生成过程中可能引入或放大框架偏差(framing bias)的问题。其解决方案的关键在于通过分析不同模型架构生成的新闻内容,识别框架表现的差异,并强调需要开发有效的后训练缓解策略和更严格的评估框架,以确保自动化新闻内容符合平衡报道的标准。
链接: https://arxiv.org/abs/2505.05406
作者: Valeria Pastorino,Nafise Sadat Moosavi
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Framing in media critically shapes public perception by selectively emphasizing some details while downplaying others. With the rise of large language models in automated news and content creation, there is growing concern that these systems may introduce or even amplify framing biases compared to human authors. In this paper, we explore how framing manifests in both out-of-the-box and fine-tuned LLM-generated news content. Our analysis reveals that, particularly in politically and socially sensitive contexts, LLMs tend to exhibit more pronounced framing than their human counterparts. In addition, we observe significant variation in framing tendencies across different model architectures, with some models displaying notably higher biases. These findings point to the need for effective post-training mitigation strategies and tighter evaluation frameworks to ensure that automated news content upholds the standards of balanced reporting.
zh
[NLP-12] ICon: In-Context Contribution for Automatic Data Selection
【速读】: 该论文旨在解决指令微调中数据选择的问题,以提升大型语言模型(Large Language Models, LLMs)的性能并降低训练成本。现有自动化选择方法要么依赖计算成本高昂的基于梯度的指标,要么依赖人工设计的启发式规则,这些方法可能无法充分挖掘数据的内在特性。论文提出的解决方案是ICon(In-context Learning for Contribution Measurement),其关键在于利用上下文学习(in-context learning, ICL)的隐式微调特性,在无需梯度计算或人工特征工程的情况下衡量样本贡献,从而提供一种计算效率更高的替代方案,并减少启发式方法中固有的主观偏差。
链接: https://arxiv.org/abs/2505.05327
作者: Yixin Yang,Qingxiu Dong,Linli Yao,Fangwei Zhu,Zhifang Sui
机构: State Key Laboratory of Multimedia Information Processing, Peking University (国家多媒体信息处理重点实验室,北京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Data selection for instruction tuning is essential for improving the performance of Large Language Models (LLMs) and reducing training cost. However, existing automated selection methods either depend on computationally expensive gradient-based measures or manually designed heuristics, which may fail to fully exploit the intrinsic attributes of data. In this paper, we propose In-context Learning for Contribution Measurement (ICon), a novel gradient-free method that takes advantage of the implicit fine-tuning nature of in-context learning (ICL) to measure sample contribution without gradient computation or manual indicators engineering. ICon offers a computationally efficient alternative to gradient-based methods and reduces human inductive bias inherent in heuristic-based approaches. ICon comprises three components and identifies high-contribution data by assessing performance shifts under implicit learning through ICL. Extensive experiments on three LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of ICon. Remarkably, on LLaMA3.1-8B, models trained on 15% of ICon-selected data outperform full datasets by 5.42% points and exceed the best performance of widely used selection methods by 2.06% points. We further analyze high-contribution samples selected by ICon, which show both diverse tasks and appropriate difficulty levels, rather than just the hardest ones.
zh
[NLP-13] Scalable Chain of Thoughts via Elastic Reasoning
【速读】: 该论文试图解决大规模推理模型(Large Reasoning Models, LRMs)在实际部署中因输出长度不可控而导致的资源约束问题,尤其是在token数量、延迟或计算资源方面存在严格限制的情况下。解决方案的关键在于提出了一种名为Elastic Reasoning的新框架,该框架将推理过程显式划分为“思考”和“求解”两个阶段,并为每个阶段分配独立的预算,从而在测试时优先保证求解部分的完整性,提高在资源受限环境下的可靠性。此外,通过引入一种轻量级的预算约束回放策略,使模型能够在思考过程被截断时自适应地进行推理,并有效泛化到未见过的预算约束条件,而无需额外训练。
链接: https://arxiv.org/abs/2505.05315
作者: Yuhui Xu,Hanze Dong,Lei Wang,Doyen Sahoo,Junnan Li,Caiming Xiong
机构: Salesforce AI Research (Salesforce人工智能研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases–thinking and solution–with independently allocated budgets. At test time, Elastic Reasoning prioritize that completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Elastic Reasoning offers a principled and practical solution to the pressing challenge of controllable reasoning at scale.
zh
[NLP-14] oward Reason able Parrots: Why Large Language Models Should Argue with Us by Design
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在支持和促进论证过程方面的不足问题。论文认为,现有的LLMs未能有效服务于论证技能的培养,因此提出了一种理想的技术设计,其关键在于将LLMs重新定位为锻炼批判性思维的工具,而非替代工具。该解决方案的核心概念是“合理鹦鹉”(reasonable parrots),其遵循相关性、责任性和自由性的基本原则,并通过论证性对话策略进行交互,这些原则源自长期的论证理论研究,应作为基于LLM技术的论证基础设计的起点。
链接: https://arxiv.org/abs/2505.05298
作者: Elena Musi,Nadin Kokciyan,Khalid Al-Khatib,Davide Ceolin,Emmanuelle Dietz,Klara Gutekunst,Annette Hautli-Janisz,Cristian Manuel Santibañez Yañez,Jodi Schneider,Jonas Scholz,Cor Steging,Jacky Visser,Henning Wachsmuth
机构: U. of Liverpool (利物浦大学); The U. of Edinburgh (爱丁堡大学); U. of Groningen (格罗宁根大学); Centrum Wiskunde & Informatica (荷兰数学与计算机科学研究中心); Airbus (空客); U. of Kassel (卡塞尔大学); U. of Passau (帕绍大学); Universidad de Concepción (康塞普西翁大学); U. of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); U. of Dundee (邓迪大学); U. of Hannover (汉诺威大学)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:In this position paper, we advocate for the development of conversational technology that is inherently designed to support and facilitate argumentative processes. We argue that, at present, large language models (LLMs) are inadequate for this purpose, and we propose an ideal technology design aimed at enhancing argumentative skills. This involves re-framing LLMs as tools to exercise our critical thinking rather than replacing them. We introduce the concept of ‘reasonable parrots’ that embody the fundamental principles of relevance, responsibility, and freedom, and that interact through argumentative dialogical moves. These principles and moves arise out of millennia of work in argumentation theory and should serve as the starting point for LLM-based technology that incorporates basic principles of argumentation.
zh
[NLP-15] -T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction IJCAI2025
【速读】: 该论文旨在解决方面情感三元组抽取(Aspect Sentiment Triplet Extraction, ASTE)任务中关系建模的挑战,特别是如何更有效地捕捉句子中不同词之间的关系。其解决方案的关键在于直接利用Transformer层作为下游关系学习模块,以增强模型对表结构中词间关系的建模能力。为了解决直接使用Transformer带来的长序列和局部注意力交互不均的问题,作者提出了一种新的Table-Transformer(T-T),通过引入带有循环移位策略的条带注意力机制,限制全局注意力范围并促进不同注意力窗口间的交互,从而在保持较低计算成本的同时实现最先进的性能。
链接: https://arxiv.org/abs/2505.05271
作者: Kun Peng,Chaodong Tong,Cong Cao,Hao Peng,Qian Li,Guanlin Wu,Lei Jiang,Yanbing Liu,Philip S. Yu
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); School of Cyber Science and Technology, Beihang University (北京航空航天大学网络科学与技术学院); School of Computer Science, Beijing University of Posts and Telecommunications (北京邮电大学计算机学院); College of Systems Engineering, National University of Defense Technology (国防科技大学系统工程学院); Department of Computer Science, University of Illinois at Chicago (伊利诺伊大学芝加哥分校计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI2025
Abstract:Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability to capture relations can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs.
zh
[NLP-16] QualBench: Benchmarking Chinese LLM s with Localized Professional Qualifications for Vertical Domain Evaluation
【速读】: 该论文试图解决当前中文大语言模型(Large Language Models, LLMs)在垂直领域评估中的不足,特别是现有基准在领域覆盖和对中国工作场景洞察方面的局限性。解决方案的关键在于引入QualBench,这是首个针对中文LLMs的多领域问答基准,通过将资格考试作为统一的人类专业知识评估框架,构建了一个涵盖六个垂直领域、超过17,000道题目的数据集,其数据选择基于24项中国资质认证,以紧密贴合国家政策和行业标准。
链接: https://arxiv.org/abs/2505.05225
作者: Mengze Hong,Wailing Ng,Di Jiang,Chen Jason Zhang
机构: Hong Kong Polytechnic University (香港理工大学); WeBank Co., Ltd (微众银行)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of Chinese large language models (LLMs) underscores the need for domain-specific evaluations to ensure reliable applications. However, existing benchmarks often lack coverage in vertical domains and offer limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for human expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, with data selections grounded in 24 Chinese qualifications to closely align with national policies and working standards. Through comprehensive evaluation, the Qwen2.5 model outperformed the more advanced GPT-4o, with Chinese LLMs consistently surpassing non-Chinese models, highlighting the importance of localized domain knowledge in meeting qualification requirements. The best performance of 75.26% reveals the current gaps in domain coverage within model capabilities. Furthermore, we present the failure of LLM collaboration with crowdsourcing mechanisms and suggest the opportunities for multi-domain RAG knowledge enhancement and vertical domain LLM training with Federated Learning.
zh
[NLP-17] Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks ICML2025
【速读】: 该论文试图解决当前文本水印算法在对抗攻击下的鲁棒性不足问题,特别是针对基于高熵标记嵌入水印方法的脆弱性。其解决方案的关键在于提出一种通用且高效的改写攻击方法——自信息重写攻击(Self-Information Rewrite Attack, SIRA),该方法通过计算每个标记的自信息来识别潜在的模式标记并实施针对性攻击,从而有效破坏水印的检测能力。实验结果表明,SIRA在七种近期水印方法上实现了接近100%的攻击成功率,且成本低廉,无需访问水印算法或受水印的大型语言模型(LLM),具备良好的迁移性。
链接: https://arxiv.org/abs/2505.05190
作者: Yixin Cheng,Hongcheng Guo,Yangming Li,Leonid Sigal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: ICML 2025 Accpeted
Abstract:Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.
zh
[NLP-18] A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition
【速读】: 该论文旨在解决低资源语言(如乌尔都语)中多模态命名实体识别(Multimodal Named Entity Recognition, MNER)研究不足的问题,主要挑战包括缺乏标注的多模态数据集和标准化基线。其解决方案的关键在于提出U-MNER框架并发布Twitter2015-Urdu数据集,该数据集基于广泛使用的Twitter2015数据集,并依据乌尔都语语法规则进行标注。U-MNER框架通过结合文本和视觉信息,采用Urdu-BERT进行文本嵌入、ResNet提取视觉特征,并引入跨模态融合模块对信息进行对齐与融合,从而实现了在该数据集上的最先进性能,为低资源语言的MNER研究奠定了基础。
链接: https://arxiv.org/abs/2505.05148
作者: Hussain Ahmad,Qingyang Zeng,Jing Wan
机构: Beijing University of Chemical Technology (北京化工大学)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures. Preprint
Abstract:The emergence of multimodal content, particularly text and images on social media, has positioned Multimodal Named Entity Recognition (MNER) as an increasingly important area of research within Natural Language Processing. Despite progress in high-resource languages such as English, MNER remains underexplored for low-resource languages like Urdu. The primary challenges include the scarcity of annotated multimodal datasets and the lack of standardized baselines. To address these challenges, we introduce the U-MNER framework and release the Twitter2015-Urdu dataset, a pioneering resource for Urdu MNER. Adapted from the widely used Twitter2015 dataset, it is annotated with Urdu-specific grammar rules. We establish benchmark baselines by evaluating both text-based and multimodal models on this dataset, providing comparative analyses to support future research on Urdu MNER. The U-MNER framework integrates textual and visual context using Urdu-BERT for text embeddings and ResNet for visual feature extraction, with a Cross-Modal Fusion Module to align and fuse information. Our model achieves state-of-the-art performance on the Twitter2015-Urdu dataset, laying the groundwork for further MNER research in low-resource languages.
zh
[NLP-19] Understanding In-context Learning of Addition via Activation Subspaces
【速读】: 该论文试图解决现代Transformer模型在上下文学习(in-context learning)中如何通过前向传播实现从少量示例中提取信号、聚合为预测规则并应用于新示例的问题。其解决方案的关键在于通过一种新颖的优化方法,定位到Llama-3-8B模型中仅三个注意力头(attention heads)负责少样本能力,并发现这些头所提取的信号位于一个六维子空间中,其中四个维度跟踪数值的个位,另外两个维度跟踪整体量级,同时揭示了通过后续示例抑制早期错误的自校正机制。
链接: https://arxiv.org/abs/2505.05145
作者: Xinyan Hu,Kayo Yin,Michael I. Jordan,Jacob Steinhardt,Lijie Chen
机构: UC Berkeley (加州大学伯克利分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages
Abstract:To perform in-context learning, language models must extract signals from individual few-shot examples, aggregate these into a learned prediction rule, and then apply this rule to new examples. How is this implemented in the forward pass of modern transformer models? To study this, we consider a structured family of few-shot learning tasks for which the true prediction rule is to add an integer k to the input. We find that Llama-3-8B attains high accuracy on this task for a range of k , and localize its few-shot ability to just three attention heads via a novel optimization approach. We further show the extracted signals lie in a six-dimensional subspace, where four of the dimensions track the unit digit and the other two dimensions track overall magnitude. We finally examine how these heads extract information from individual few-shot examples, identifying a self-correction mechanism in which mistakes from earlier examples are suppressed by later examples. Our results demonstrate how tracking low-dimensional subspaces across a forward pass can provide insight into fine-grained computational structures.
zh
[NLP-20] Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中多语言能力的机制理解与可控性问题。现有基于神经元或内部激活的方法在分析多语言能力时面临超位置(superposition)和层间激活方差(layer-wise activation variance)等挑战,导致结果不可靠。论文提出的解决方案关键在于利用稀疏自编码器(Sparse Autoencoders, SAEs)对LLMs的激活进行分解,得到具有语言特异性的特征,并通过引入一种新的度量方法验证这些特征与特定语言的强相关性。进一步实验表明,通过消融(ablation)这些语言特异性SAE特征,可以显著影响LLMs在某一语言上的表现,而对其他语言影响较小,从而实现了对模型生成语言的可控性提升。
链接: https://arxiv.org/abs/2505.05111
作者: Boyi Deng,Yu Wan,Yidan Zhang,Baosong Yang,Fuli Feng
机构: Tongyi Lab, Alibaba Group Inc; Institute of Dataspace, Hefei, Anhui, China
类目: Computation and Language (cs.CL)
备注:
Abstract:The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs.
zh
[NLP-21] X-Driver: Explainable Autonomous Driving with Vision-Language Models
【速读】: 该论文旨在解决端到端自动驾驶在闭环评估中成功率低的问题,这一问题突显了现有框架在真实世界部署中的局限性。其解决方案的关键在于提出X-Driver,一个统一的多模态大语言模型(Multi-modal Large Language Models, MLLMs)框架,通过引入思维链(Chain-of-Thought, CoT)和自回归建模技术,提升感知与决策能力。
链接: https://arxiv.org/abs/2505.05098
作者: Wei Liu,Jiyuan Zhang,Binxiong Zheng,Yufeng Hu,Yingzhan Lin,Zengfeng Zeng
机构: 未知
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:
Abstract:End-to-end autonomous driving has advanced significantly, offering benefits such as system simplicity and stronger driving performance in both open-loop and closed-loop settings than conventional pipelines. However, existing frameworks still suffer from low success rates in closed-loop evaluations, highlighting their limitations in real-world deployment. In this paper, we introduce X-Driver, a unified multi-modal large language models(MLLMs) framework designed for closed-loop autonomous driving, leveraging Chain-of-Thought(CoT) and autoregressive modeling to enhance perception and decision-making. We validate X-Driver across multiple autonomous driving tasks using public benchmarks in CARLA simulation environment, including Bench2Drive[6]. Our experimental results demonstrate superior closed-loop performance, surpassing the current state-of-the-art(SOTA) while improving the interpretability of driving decisions. These findings underscore the importance of structured reasoning in end-to-end driving and establish X-Driver as a strong baseline for future research in closed-loop autonomous driving.
zh
[NLP-22] Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction
【速读】: 该论文试图解决现有生成式 AI (Generative AI) 检测方法过于关注检测准确率而忽视高误报率 (False Positive Rate, FPR) 带来的社会风险问题。解决方案的关键在于引入多尺度置信预测 (Multiscaled Conformal Prediction, MCP),该方法在有效控制 FPR 上限的同时,提升了检测性能,并增强了对对抗攻击的鲁棒性。此外,论文还提出了 RealDet 数据集,用于实现更真实的校准和更优的检测效果。
链接: https://arxiv.org/abs/2505.05084
作者: Xiaowei Zhu,Yubing Ren,Yanan Cao,Xixun Lin,Fang Fang,Yangxi Li
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); National Computer Network Emergency Response Technical Team, Coordination Center of China(中国国家计算机网络应急技术处理协调中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of large language models has raised significant concerns regarding their potential misuse by malicious actors. As a result, developing effective detectors to mitigate these risks has become a critical priority. However, most existing detection methods focus excessively on detection accuracy, often neglecting the societal risks posed by high false positive rates (FPRs). This paper addresses this issue by leveraging Conformal Prediction (CP), which effectively constrains the upper bound of FPRs. While directly applying CP constrains FPRs, it also leads to a significant reduction in detection performance. To overcome this trade-off, this paper proposes a Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction (MCP), which both enforces the FPR constraint and improves detection performance. This paper also introduces RealDet, a high-quality dataset that spans a wide range of domains, ensuring realistic calibration and enabling superior detection performance when combined with MCP. Empirical evaluations demonstrate that MCP effectively constrains FPRs, significantly enhances detection performance, and increases robustness against adversarial attacks across multiple detectors and datasets.
zh
[NLP-23] Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization
【速读】: 该论文旨在解决低资源语言(如孟加拉语)中消费者健康查询(Consumer Health Queries, CHQs)由于包含冗余信息而导致的高效医疗响应难题。其解决方案的关键在于评估九种先进的大语言模型(Large Language Models, LLMs)在零样本(zero-shot)条件下的摘要生成能力,以验证这些模型是否能在未进行任务特定训练的情况下,生成高质量的查询摘要。研究通过使用BanglaCHQ-Summ数据集,并基于ROUGE指标与微调的Bangla T5模型进行对比,发现部分LLMs在性能上可与细调模型相媲美,从而展示了大语言模型在低资源语言医疗信息处理中的潜力。
链接: https://arxiv.org/abs/2505.05070
作者: Ajwad Abrar,Farzana Tabassum,Sabbir Ahmed
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Consumer Health Queries (CHQs) in Bengali (Bangla), a low-resource language, often contain extraneous details, complicating efficient medical responses. This study investigates the zero-shot performance of nine advanced large language models (LLMs): GPT-3.5-Turbo, GPT-4, Claude-3.5-Sonnet, Llama3-70b-Instruct, Mixtral-8x22b-Instruct, Gemini-1.5-Pro, Qwen2-72b-Instruct, Gemma-2-27b, and Athene-70B, in summarizing Bangla CHQs. Using the BanglaCHQ-Summ dataset comprising 2,350 annotated query-summary pairs, we benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model. Mixtral-8x22b-Instruct emerged as the top performing model in ROUGE-1 and ROUGE-L, while Bangla T5 excelled in ROUGE-2. The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training. This work underscores the potential of LLMs in addressing challenges in low-resource languages, providing scalable solutions for healthcare query summarization.
zh
[NLP-24] CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts
【速读】: 该论文试图解决现有代码生成基准测试(如HumanEval、MBPP和BigCodeBench)主要针对英文提示进行评估,而忽略了现实中多语言开发者在与大语言模型(LLMs)交互时常用代码混杂语言(code-mixed language)的问题。解决方案的关键是引入CodeMixBench,这是一个新的基准测试,通过在自然语言部分的提示中引入受控的代码混杂(Controlled Code-Mixing, CMD),在三种语言对(Hinglish、西班牙语-英语、中文拼音-英语)上评估LLMs在代码生成任务中的鲁棒性。
链接: https://arxiv.org/abs/2505.05063
作者: Manik Sheokand,Parth Sawant
机构: Chandigarh University (昌迪加尔大学); New York University (纽约大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and BigCodeBench primarily evaluate LLMs on English-only prompts, overlooking the real-world scenario where multilingual developers often use code-mixed language while interacting with LLMs. To address this gap, we introduce CodeMixBench, a novel benchmark designed to evaluate the robustness of LLMs on code generation from code-mixed prompts. Built upon BigCodeBench, CodeMixBench introduces controlled code-mixing (CMD) into the natural language parts of prompts across three language pairs: Hinglish (Hindi-English), Spanish-English, and Chinese Pinyin-English. We comprehensively evaluate a diverse set of open-source code generation models ranging from 1.5B to 15B parameters. Our results show that code-mixed prompts consistently degrade Pass@1 performance compared to their English-only counterparts, with performance drops increasing under higher CMD levels for smaller models. CodeMixBench provides a realistic evaluation framework for studying multilingual code generation and highlights new challenges and directions for building robust code generation models that generalize well across diverse linguistic settings.
zh
[NLP-25] ochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations
【速读】: 该论文旨在解决低资源语言Teochew方言在语音任务(如自动语音识别和文本到语音)中缺乏高质量标注语料的问题。其解决方案的关键在于构建了Teochew-Wild语料库,该语料库包含18.9小时的自然场景下Teochew方言语音数据,并提供了精确的正字法和拼音标注,同时配套了文本处理工具和资源,以支持相关研究与应用。
链接: https://arxiv.org/abs/2505.05056
作者: Linrong Pan,Chenglong Jiang,Gaoze Hou,Ying Gao
机构: South China University of Technology (华南理工大学); Guangzhou No.6 Middle School (广州市第六中学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper reports the construction of the Teochew-Wild, a speech corpus of the Teochew dialect. The corpus includes 18.9 hours of in-the-wild Teochew speech data from multiple speakers, covering both formal and colloquial expressions, with precise orthographic and pinyin annotations. Additionally, we provide supplementary text processing tools and resources to propel research and applications in speech tasks for this low-resource language, such as automatic speech recognition (ASR) and text-to-speech (TTS). To the best of our knowledge, this is the first publicly available Teochew dataset with accurate orthographic annotations. We conduct experiments on the corpus, and the results validate its effectiveness in ASR and TTS tasks.
zh
[NLP-26] Image-Text Relation Prediction for Multilingual Tweets
【速读】: 该论文试图解决多语言视觉-语言模型在不同语言中对图像与文本关系预测的任务性能问题,特别是针对社交媒体平台上图像与文本之间关联性不明确的现象。其解决方案的关键在于构建一个专门的平衡基准数据集,基于拉脱维亚语的推文及其人工翻译成英语的配对数据,以评估模型在多语言环境下的表现,并通过对比最新发布的视觉-语言模型检查点,验证其在该任务上的逐步提升能力。
链接: https://arxiv.org/abs/2505.05040
作者: Matīss Rikters,Edison Marrese-Taylor
机构: National Institute of Advanced Industrial Science and Technology (日本产业技术综合研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Various social networks have been allowing media uploads for over a decade now. Still, it has not always been clear what is their relation with the posted text or even if there is any at all. In this work, we explore how multilingual vision-language models tackle the task of image-text relation prediction in different languages, and construct a dedicated balanced benchmark data set from Twitter posts in Latvian along with their manual translations into English. We compare our results to previous work and show that the more recently released vision-language model checkpoints are becoming increasingly capable at this task, but there is still much room for further improvement.
zh
[NLP-27] G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness
【速读】: 该论文试图解决传统A/B测试在评估用户界面(User Interface, UI)设计说服力时成本高、耗时长的问题,以及现有基于视觉-语言模型(Vision-Language Models, VLMs)的UI分析方法仅关注孤立设计属性而忽视对比说服力的局限性。其解决方案的关键在于提出G-FOCUS,一种新颖的推理阶段推理策略,通过减少位置偏差和提升评估准确性来增强VLM在UI设计说服力对比评估中的表现。
链接: https://arxiv.org/abs/2505.05026
作者: Jaehyun Jeon,Janghan Yoon,Minsoo Kim,Sumin Shim,Yejin Choi,Hanbin Kim,Youngjae Yu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 31 pages, 17 figures
Abstract:Evaluating user interface (UI) design effectiveness extends beyond aesthetics to influencing user behavior, a principle central to Design Persuasiveness. A/B testing is the predominant method for determining which UI variations drive higher user engagement, but it is costly and time-consuming. While recent Vision-Language Models (VLMs) can process automated UI analysis, current approaches focus on isolated design attributes rather than comparative persuasiveness-the key factor in optimizing user interactions. To address this, we introduce WiserUI-Bench, a benchmark designed for Pairwise UI Design Persuasiveness Assessment task, featuring 300 real-world UI image pairs labeled with A/B test results and expert rationales. Additionally, we propose G-FOCUS, a novel inference-time reasoning strategy that enhances VLM-based persuasiveness assessment by reducing position bias and improving evaluation accuracy. Experimental results show that G-FOCUS surpasses existing inference strategies in consistency and accuracy for pairwise UI evaluation. Through promoting VLM-driven evaluation of UI persuasiveness, our work offers an approach to complement A/B testing, propelling progress in scalable UI preference modeling and design optimization. Code and data will be released publicly.
zh
[NLP-28] Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization IJCAI2025
【速读】: 该论文旨在解决如何在全参数微调范式下,将微调后的大型语言模型(Large Language Models, LLMs)的下游预测结果归因于其预训练数据的问题。现有方法无法计算“多阶段”影响且难以扩展到十亿级参数的LLMs,因此无法有效解释微调模型的预测。论文提出的解决方案是多阶段影响函数,其关键在于利用经过特征值校正的Kronecker-Factored (EK-FAC) 参数化方法,以实现高效的近似计算,从而提升方法的可扩展性和实用性。
链接: https://arxiv.org/abs/2505.05017
作者: Yuntai Bao,Xuhong Zhang,Tianyu Du,Xinkui Zhao,Jiang Zong,Hao Peng,Jianwei Yin
机构: Zhejiang University (浙江大学); Universal Identification Technology (Hangzhou) Co.,Ltd. (通用识别技术(杭州)有限公司); Zhejiang Normal University (浙江师范大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, accepted by IJCAI 2025
Abstract:Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute ``multi-stage’’ influence and lack scalability to billion-scale LLMs. In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates. Our code is public at this https URL. Comments: 9 pages, accepted by IJCAI 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.05017 [cs.CL] (or arXiv:2505.05017v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.05017 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-29] he Pitfalls of Growing Group Complexity: LLM s and Social Choice-Based Aggregation for Group Recommendations
【速读】: 该论文试图解决如何在群体推荐系统(Group Recommender Systems, GRS)中利用生成式 AI (Generative AI) 正确执行基于社会选择的聚合策略的问题,特别是在零样本学习条件下,以及提示格式对准确性的影响。其解决方案的关键在于评估不同群体复杂度(用户和物品数量)、不同语言模型、不同提示条件(包括上下文学习或生成解释)以及群体偏好格式对模型性能的影响,结果显示上下文学习(In-Context Learning, ICL)在高群体复杂度下能显著提升性能,而其他提示修改方式则影响不大。
链接: https://arxiv.org/abs/2505.05016
作者: Cedric Waterschoot,Nava Tintarev,Francesco Barile
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: To be published in: Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization (UMAP Adjunct '25), June 16–19, 2025, New York City, NY, USA Accepted at the 4th Workshop on Group Modeling, Adaptation and Personalization (GMAP), co-located at UMAP 2025
Abstract:Large Language Models (LLMs) are increasingly applied in recommender systems aimed at both individuals and groups. Previously, Group Recommender Systems (GRS) often used social choice-based aggregation strategies to derive a single recommendation based on the preferences of multiple people. In this paper, we investigate under which conditions language models can perform these strategies correctly based on zero-shot learning and analyse whether the formatting of the group scenario in the prompt affects accuracy. We specifically focused on the impact of group complexity (number of users and items), different LLMs, different prompting conditions, including In-Context learning or generating explanations, and the formatting of group preferences. Our results show that performance starts to deteriorate when considering more than 100 ratings. However, not all language models were equally sensitive to growing group complexity. Additionally, we showed that In-Context Learning (ICL) can significantly increase the performance at higher degrees of group complexity, while adding other prompt modifications, specifying domain cues or prompting for explanations, did not impact accuracy. We conclude that future research should include group complexity as a factor in GRS evaluation due to its effect on LLM performance. Furthermore, we showed that formatting the group scenarios differently, such as rating lists per user or per item, affected accuracy. All in all, our study implies that smaller LLMs are capable of generating group recommendations under the right conditions, making the case for using smaller models that require less computing power and costs.
zh
[NLP-30] Rethinking Invariance in In-context Learning
【速读】: 该论文试图解决自回归大语言模型中上下文学习(In-Context Learning, ICL)对上下文示例顺序敏感的问题,尽管这些示例之间是相互独立的。解决方案的关键在于提出一种具有不变性的ICL方法(Invariant ICL, InvICL),其设计重点在于同时实现信息非泄露(information non-leakage)和上下文互依赖性(context interdependence)这两个关键要素,从而在保持性能的同时提升模型对输入顺序的鲁棒性。
链接: https://arxiv.org/abs/2505.04994
作者: Lizhe Fang,Yifei Wang,Khashayar Gatmiry,Lei Fang,Yisen Wang
机构: Peking University (北京大学); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at this https URL.
zh
[NLP-31] Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)生成内容与人类偏好对齐的问题,尤其是现有方法在建模人类偏好时往往依赖显式或隐式奖励函数,而忽视了人类偏好在不同任务和人群中的复杂性和多维性。其解决方案的关键在于引入潜在偏好编码(Latent Preference Coding, LPC),通过离散潜在代码建模整体偏好背后的隐含因素及其组合,从而无需依赖预定义奖励函数和人工设计的组合权重,自动从数据中推断出潜在因素及其重要性。
链接: https://arxiv.org/abs/2505.04993
作者: Zhuocheng Gong,Jian Guan,Wei Wu,Huishuai Zhang,Dongyan Zhao
机构: Cranberry-Lemon University (克兰伯里-柠檬大学); University of the Witwatersrand (维特沃特斯兰德大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved remarkable success, yet aligning their generations with human preferences remains a critical challenge. Existing approaches to preference modeling often rely on an explicit or implicit reward function, overlooking the intricate and multifaceted nature of human preferences that may encompass conflicting factors across diverse tasks and populations. To address this limitation, we introduce Latent Preference Coding (LPC), a novel framework that models the implicit factors as well as their combinations behind holistic preferences using discrete latent codes. LPC seamlessly integrates with various offline alignment algorithms, automatically inferring the underlying factors and their importance from data without relying on pre-defined reward functions and hand-crafted combination weights. Extensive experiments on multiple benchmarks demonstrate that LPC consistently improves upon three alignment algorithms (DPO, SimPO, and IPO) using three base models (Mistral-7B, Llama3-8B, and Llama3-8B-Instruct). Furthermore, deeper analysis reveals that the learned latent codes effectively capture the differences in the distribution of human preferences and significantly enhance the robustness of alignment against noise in data. By providing a unified representation for the multifarious preference factors, LPC paves the way towards developing more robust and versatile alignment techniques for the responsible deployment of powerful LLMs.
zh
[NLP-32] Rethinking the Relationship between the Power Law and Hierarchical Structures
【速读】: 该论文试图解决关于语言中幂律分布与层级结构之间关系的理论假设是否成立的问题,特别是针对句法结构的统计特性是否符合该假设的隐含前提。解决方案的关键在于通过分析英语语料库中的解析树的互信息、偏离概率上下文无关文法(PCFG)的偏差及其他属性,验证该假设的合理性,结果表明这些假设在句法结构中并不成立,从而揭示了将该理论推广至儿童语言和动物信号中的困难,强调需要重新审视幂律与层级结构之间的关系。
链接: https://arxiv.org/abs/2505.04984
作者: Kai Nakaishi,Ryo Yoshida,Kohei Kajikawa,Koji Hukushima,Yohei Oseki
机构: RIKEN(理化学研究所); The University of Tokyo(东京大学); National Institute for Japanese Language and Linguistics(日本语言文化研究所)
类目: Computation and Language (cs.CL)
备注: 13 pages, 11 figures
Abstract:Statistical analysis of corpora provides an approach to quantitatively investigate natural languages. This approach has revealed that several power laws consistently emerge across different corpora and languages, suggesting the universal principles underlying languages. Particularly, the power-law decay of correlation has been interpreted as evidence for underlying hierarchical structures in syntax, semantics, and discourse. This perspective has also been extended to child languages and animal signals. However, the argument supporting this interpretation has not been empirically tested. To address this problem, this study examines the validity of the argument for syntactic structures. Specifically, we test whether the statistical properties of parse trees align with the implicit assumptions in the argument. Using English corpora, we analyze the mutual information, deviations from probabilistic context-free grammars (PCFGs), and other properties in parse trees, as well as in the PCFG that approximates these trees. Our results indicate that the assumptions do not hold for syntactic structures and that it is difficult to apply the proposed argument to child languages and animal signals, highlighting the need to reconsider the relationship between the power law and hierarchical structures.
zh
[NLP-33] General Transform: A Unified Framework for Adaptive Transform to Enhance Representations
【速读】: 该论文试图解决在机器学习中选择合适离散变换(Discrete Transform)以提升模型性能的问题,尤其是在缺乏数据集属性知识的情况下,传统方法效果受限。解决方案的关键在于提出一种自适应的变换表示方法——通用变换(General Transform, GT),该方法通过学习数据驱动的映射来适应特定的数据集和任务,从而克服了传统变换依赖先验知识的局限性。
链接: https://arxiv.org/abs/2505.04969
作者: Gekko Budiutama,Shunsuke Daimon,Hirofumi Nishi,Yu-ichiro Matsushita
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Discrete transforms, such as the discrete Fourier transform, are widely used in machine learning to improve model performance by extracting meaningful features. However, with numerous transforms available, selecting an appropriate one often depends on understanding the dataset’s properties, making the approach less effective when such knowledge is unavailable. In this work, we propose General Transform (GT), an adaptive transform-based representation designed for machine learning applications. Unlike conventional transforms, GT learns data-driven mapping tailored to the dataset and task of interest. Here, we demonstrate that models incorporating GT outperform conventional transform-based approaches across computer vision and natural language processing tasks, highlighting its effectiveness in diverse learning scenarios.
zh
[NLP-34] Chain-of-Thought Tokens are Computer Program Variables
【速读】: 该论文试图解决链式思维(Chain-of-thought, CoT)在大型语言模型(Large Language Models, LLMs)中内部机制不明确的问题。其解决方案的关键在于通过实证研究发现,仅保留存储中间结果的CoT标记即可实现与完整CoT相当的性能,并且将中间结果以替代的潜在形式存储不会影响模型表现。此外,研究还表明对CoT中的某些值进行随机干预会导致后续CoT标记和最终答案发生变化,这暗示CoT标记可能类似于计算机程序中的变量,但可能存在意外捷径和标记间计算复杂性限制等潜在问题。
链接: https://arxiv.org/abs/2505.04955
作者: Fangwei Zhu,Peiyi Wang,Zhifang Sui
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at this https URL.
zh
[NLP-35] Prompt-Based LLM s for Position Bias-Aware Reranking in Personalized Recommendations
【速读】: 该论文试图解决基于大语言模型(Large Language Models, LLMs)的推荐系统在处理位置偏差、上下文窗口限制以及列表级排序任务时存在的局限性。其解决方案的关键是提出一个混合框架,将传统推荐模型与LLM结合,通过结构化提示对Top-k候选物品进行重排序,以缓解位置偏差并提升推荐效果。
链接: https://arxiv.org/abs/2505.04948
作者: Md Aminul Islam,Ahmed Sayeed Faruk
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Recommender systems are essential for delivering personalized content across digital platforms by modeling user preferences and behaviors. Recently, large language models (LLMs) have been adopted for prompt-based recommendation due to their ability to generate personalized outputs without task-specific training. However, LLM-based methods face limitations such as limited context window size, inefficient pointwise and pairwise prompting, and difficulty handling listwise ranking due to token constraints. LLMs can also be sensitive to position bias, as they may overemphasize earlier items in the prompt regardless of their true relevance. To address and investigate these issues, we propose a hybrid framework that combines a traditional recommendation model with an LLM for reranking top-k items using structured prompts. We evaluate the effects of user history reordering and instructional prompts for mitigating position bias. Experiments on MovieLens-100K show that randomizing user history improves ranking quality, but LLM-based reranking does not outperform the base model. Explicit instructions to reduce position bias are also ineffective. Our evaluations reveal limitations in LLMs’ ability to model ranking context and mitigate bias. Our code is publicly available at this https URL.
zh
[NLP-36] 2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models
【速读】: 该论文试图解决文本到视频生成模型在渲染精确屏幕文本(如字幕或数学公式)方面的不足,这一问题在需要严格文本准确性的应用中构成了重大挑战。解决方案的关键在于引入T2VTextBench,这是首个专注于评估文本到视频模型中屏幕文本保真度和时间一致性的基准测试平台,通过集成复杂文本字符串与动态场景变化的提示集,测试模型在多帧中保持详细指令的能力。
链接: https://arxiv.org/abs/2505.04946
作者: Xuyang Guo,Jiayan Huo,Zhenmei Shi,Zhao Song,Jiahao Zhang,Jiale Zhao
机构: Guilin University of Electronic Technology (桂林电子科技大学); University of Arizona (亚利桑那大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of California, Berkeley (加州大学伯克利分校); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Thanks to recent advancements in scalable deep architectures and large-scale pretraining, text-to-video generation has achieved unprecedented capabilities in producing high-fidelity, instruction-following content across a wide range of styles, enabling applications in advertising, entertainment, and education. However, these models’ ability to render precise on-screen text, such as captions or mathematical formulas, remains largely untested, posing significant challenges for applications requiring exact textual accuracy. In this work, we introduce T2VTextBench, the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models. Our suite of prompts integrates complex text strings with dynamic scene changes, testing each model’s ability to maintain detailed instructions across frames. We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text. These results highlight a critical gap in current video generators and provide a clear direction for future research aimed at enhancing textual manipulation in video synthesis.
zh
[NLP-37] Perception Reason Think and Plan: A Survey on Large Multimodal Reasoning Models
【速读】: 该论文试图解决多模态推理在开放、不确定和复杂环境中的泛化能力、推理深度以及代理行为等关键问题。其解决方案的关键在于提出一个四阶段的发展路线图,涵盖从基于任务特定模块的早期方法到统一的、以语言为中心的多模态大语言模型(Multimodal Large Language Models, MLLMs)的演进,并进一步探讨原生多模态推理模型(Native Large Multimodal Reasoning Models, N-LMRMs)的概念方向,旨在实现可扩展、自主且适应性强的推理与规划能力。
链接: https://arxiv.org/abs/2505.04921
作者: Yunxin Li,Zhenyu Liu,Zitao Li,Xuanyu Zhang,Zhenran Xu,Xinyu Chen,Haoyuan Shi,Shenyuan Jiang,Xintong Wang,Jifang Wang,Shouzheng Huang,Xinping Zhao,Borui Jiang,Lanqing Hong,Longyue Wang,Zhuotao Tian,Baoxing Huai,Wenhan Luo,Weihua Luo,Zheng Zhang,Baotian Hu,Min Zhang
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 75 Pages,10 figures; Project: this https URL
Abstract:Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field’s shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.
zh
[NLP-38] An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education
【速读】: 该论文旨在解决学术内容中语义检索系统在语言和结构特性上的不适应性问题,特别是针对课程大纲等教育问答场景下的语义匹配挑战。其解决方案的关键在于构建了一个包含3,197个句子对的合成数据集,并采用两种微调策略:一种是基于MultipleNegativesRankingLoss(MNRL)的基线模型,另一种是结合MNRL与CosineSimilarityLoss的双损失模型,以提升语义排序和相似性校准效果。实验结果表明,所提出的模型在多个教育相关任务中表现优于现有开源基线,并接近高性能专有嵌入模型的性能。
链接: https://arxiv.org/abs/2505.04916
作者: Ramteja Sajja,Yusuf Sermet,Ibrahim Demir
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 17 pages, 3 Tables
Abstract:Recent advances in AI have catalyzed the adoption of intelligent educational tools, yet many semantic retrieval systems remain ill-suited to the unique linguistic and structural characteristics of academic content. This study presents two open-source embedding models fine-tuned for educational question answering, particularly in the context of course syllabi. A synthetic dataset of 3,197 sentence pairs, spanning synonymous terminology, paraphrased questions, and implicit-explicit mappings, was constructed through a combination of manual curation and large language model (LLM)-assisted generation. Two training strategies were evaluated: (1) a baseline model fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model that combines MNRL with CosineSimilarityLoss to improve both semantic ranking and similarity calibration. Evaluations were conducted on 28 university course syllabi using a fixed set of natural language questions categorized into course, faculty, and teaching assistant information. Results demonstrate that both fine-tuned models outperform strong open-source baselines, including all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model narrows the performance gap with high-performing proprietary embeddings such as OpenAI’s text-embedding-3 series. This work contributes reusable, domain-aligned embedding models and provides a replicable framework for educational semantic retrieval, supporting downstream applications such as academic chatbots, retrieval-augmented generation (RAG) systems, and learning management system (LMS) integrations.
zh
[NLP-39] Enigme: Generative Text Puzzles for Evaluating Reasoning in Language Models
【速读】: 该论文试图解决生成式人工智能(Generative AI)在推理能力方面的局限性问题,特别是如何评估和提升Transformer-decoder语言模型在处理新型任务时的推理能力。解决方案的关键在于通过分析Transformer-decoder模型的潜在变量结构,设计能够探测其推理能力边界的推理任务。为此,作者提出了Enigme,一个开源库,用于生成基于文本的谜题,以在训练和评估Transformer-decoder模型及未来AI架构的推理技能中发挥作用。
链接: https://arxiv.org/abs/2505.04914
作者: John Hawkins
机构: Pingla Institute, Sydney, Australia (平拉研究所,悉尼,澳大利亚)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To be published in the proceedings of The 2025 11th International Conference on Engineering, Applied Sciences, and Technology (ICEAST)
Abstract:Transformer-decoder language models are a core innovation in text based generative artificial intelligence. These models are being deployed as general-purpose intelligence systems in many applications. Central to their utility is the capacity to understand natural language commands and exploit the reasoning embedded in human text corpora to apply some form of reasoning process to a wide variety of novel tasks. To understand the limitations of this approach to generating reasoning we argue that we need to consider the architectural constraints of these systems. Consideration of the latent variable structure of transformer-decoder models allows us to design reasoning tasks that should probe the boundary of their capacity to reason. We present enigme, an open-source library for generating text-based puzzles to be used in training and evaluating reasoning skills within transformer-decoder models and future AI architectures.
zh
[NLP-40] SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models
【速读】: 该论文旨在解决在三维(3D)环境中实现零样本空间推理的问题,传统方法通常依赖于昂贵的3D特定微调和专用的3D输入(如点云或体素特征),而SpatialPrompting框架则通过关键帧驱动的提示生成策略作为解决方案的关键。该策略利用视觉-语言相似性、马氏距离、视场角和图像清晰度等指标,从图像序列中选择多样化且信息丰富的关键帧,并结合相应的相机位姿数据,以有效抽象空间关系并推断复杂的3D结构。
链接: https://arxiv.org/abs/2505.04911
作者: Shun Taguchi,Hideki Deguchi,Takumi Hamazaki,Hiroyuki Sakai
机构: Toyota Central R&D Labs., Inc. (丰田中央研究开发株式会社)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 11 figures
Abstract:This study introduces SpatialPrompting, a novel framework that harnesses the emergent reasoning capabilities of off-the-shelf multimodal large language models to achieve zero-shot spatial reasoning in three-dimensional (3D) environments. Unlike existing methods that rely on expensive 3D-specific fine-tuning with specialized 3D inputs such as point clouds or voxel-based features, SpatialPrompting employs a keyframe-driven prompt generation strategy. This framework uses metrics such as vision-language similarity, Mahalanobis distance, field of view, and image sharpness to select a diverse and informative set of keyframes from image sequences and then integrates them with corresponding camera pose data to effectively abstract spatial relationships and infer complex 3D structures. The proposed framework not only establishes a new paradigm for flexible spatial reasoning that utilizes intuitive visual and positional cues but also achieves state-of-the-art zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across several metrics. The proposed method effectively eliminates the need for specialized 3D inputs and fine-tuning, offering a simpler and more scalable alternative to conventional approaches.
zh
[NLP-41] ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在通过链式思维(Chain-of-Thought, CoT)提示执行复杂推理任务时产生的冗余内容问题,该问题导致输出冗长、计算开销增加并影响用户体验。现有压缩方法要么在推理后进行剪枝,可能破坏推理连贯性,要么依赖采样选择,无法在生成过程中有效干预。论文提出的关键解决方案是ConCISE(Confidence-guided Compression In Step-by-step Efficient Reasoning),其核心在于通过增强模型在推理过程中的置信度来简化推理链,具体包括置信度注入(Confidence Injection)以稳定中间步骤和早期停止(Early Stopping)以在置信度足够时终止推理,从而有效减少冗余反思步骤的生成。
链接: https://arxiv.org/abs/2505.04881
作者: Ziqing Qiao,Yongheng Deng,Jiali Zeng,Dong Wang,Lai Wei,Fandong Meng,Jie Zhou,Ju Ren,Yaoxue Zhang
机构: Tsinghua University (清华大学); Pattern Recognition Center, WeChat AI, Tencent Inc., China (腾讯公司微信人工智能实验室模式识别中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience. Existing compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection, which fails to intervene effectively during generation. In this work, we introduce a confidence-guided perspective to explain the emergence of redundant reflection in LRMs, identifying two key patterns: Confidence Deficit, where the model reconsiders correct steps due to low internal confidence, and Termination Delay, where reasoning continues even after reaching a confident answer. Based on this analysis, we propose ConCISE (Confidence-guided Compression In Step-by-step Efficient Reasoning), a framework that simplifies reasoning chains by reinforcing the model’s confidence during inference, thus preventing the generation of redundant reflection steps. It integrates Confidence Injection to stabilize intermediate steps and Early Stopping to terminate reasoning when confidence is sufficient. Extensive experiments demonstrate that fine-tuning LRMs on ConCISE-generated data yields significantly shorter outputs, reducing length by up to approximately 50% under SimPO, while maintaining high task accuracy. ConCISE consistently outperforms existing baselines across multiple reasoning benchmarks.
zh
[NLP-42] CRAFT: Cultural Russian-Oriented Dataset Adaptation for Focused Text-to-Image Generation
【速读】: 该论文试图解决当前流行的文本到图像生成模型在处理特定文化内容时存在的知识缺口问题,这主要是由于训练数据主要基于西方欧美流行文化,导致模型缺乏文化适应性,从而产生错误结果、降低生成质量,并传播刻板印象和不当内容。解决方案的关键在于引入“文化代码(cultural code)”的概念,并提出一种基于文化代码的数据收集与处理方法,特别是针对俄罗斯文化的数据集构建方法,以提升模型对特定文化的理解与生成能力。
链接: https://arxiv.org/abs/2505.04851
作者: Viacheslav Vasilev,Vladimir Arkhipkin,Julia Agafonova,Tatiana Nikulina,Evelina Mironova,Alisa Shichanina,Nikolai Gerasimenko,Mikhail Shoytov,Denis Dimitrov
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: This is arxiv version of the paper which was accepted for the Doklady Mathematics Journal in 2024
Abstract:Despite the fact that popular text-to-image generation models cope well with international and general cultural queries, they have a significant knowledge gap regarding individual cultures. This is due to the content of existing large training datasets collected on the Internet, which are predominantly based on Western European or American popular culture. Meanwhile, the lack of cultural adaptation of the model can lead to incorrect results, a decrease in the generation quality, and the spread of stereotypes and offensive content. In an effort to address this issue, we examine the concept of cultural code and recognize the critical importance of its understanding by modern image generation models, an issue that has not been sufficiently addressed in the research community to date. We propose the methodology for collecting and processing the data necessary to form a dataset based on the cultural code, in particular the Russian one. We explore how the collected data affects the quality of generations in the national domain and analyze the effectiveness of our approach using the Kandinsky 3.1 text-to-image model. Human evaluation results demonstrate an increase in the level of awareness of Russian culture in the model.
zh
[NLP-43] Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成过程中产生的幻觉(hallucinations)问题,特别是在摘要任务中,LLMs即使在提供上下文的情况下仍可能引入无支持的信息或矛盾。解决方案的关键在于提出FaithJudge,这是一种基于少量人类幻觉标注的LLM-as-a-judge方法,能够显著提升自动化LLM幻觉评估的效果,并通过构建以FaithJudge为核心的增强型幻觉排行榜,实现对RAG系统中LLMs幻觉更可靠的基准测试。
链接: https://arxiv.org/abs/2505.04847
作者: Manveer Singh Tamber,Forrest Sheng Bao,Chenyu Xu,Ge Luo,Suleman Kazi,Minseok Bae,Miaoran Li,Ofer Mendelevitch,Renyi Qu,Jimmy Lin
机构: University of Waterloo (滑铁卢大学); Vectara ( Vectara); Iowa State University (爱荷华州立大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Hallucinations remain a persistent challenge for LLMs. RAG aims to reduce hallucinations by grounding responses in contexts. However, even when provided context, LLMs still frequently introduce unsupported information or contradictions. This paper presents our efforts to measure LLM hallucinations with a focus on summarization tasks, assessing how often various LLMs introduce hallucinations when summarizing documents. We discuss Vectara’s existing LLM hallucination leaderboard, based on the Hughes Hallucination Evaluation Model (HHEM). While HHEM and Vectara’s Hallucination Leaderboard have garnered great research interest, we examine challenges faced by HHEM and current hallucination detection methods by analyzing the effectiveness of these methods on existing hallucination datasets. To address these limitations, we propose FaithJudge, an LLM-as-a-judge approach guided by few-shot human hallucination annotations, which substantially improves automated LLM hallucination evaluation over current methods. We introduce an enhanced hallucination leaderboard centered on FaithJudge, alongside our current hallucination leaderboard, enabling more reliable benchmarking of LLMs for hallucinations in RAG.
zh
[NLP-44] HiPerRAG : High-Performance Retrieval Augmented Generation for Scientific Insights
【速读】: 该论文旨在解决科学文献数量激增所带来的发现未被充分利用、重复性工作以及跨学科协作受限等问题,通过改进大型语言模型(Large Language Models, LLMs)在处理科学信息时的事实准确性来辅助科学家。其解决方案的关键在于提出HiPerRAG,一个基于高性能计算(High Performance Computing, HPC)的检索增强生成(Retrieval Augmented Generation, RAG)工作流,核心组件包括Oreo(用于多模态文档解析的高吞吐量模型)和ColTrast(一种利用对比学习和后期交互技术提升检索准确性的查询感知编码器微调算法),从而实现对超过360万篇科学文章的高效索引与知识检索。
链接: https://arxiv.org/abs/2505.04846
作者: Ozan Gokdemir,Carlo Siebenschuh,Alexander Brace,Azton Wells,Brian Hsu,Kyle Hippe,Priyanka V. Setty,Aswathy Ajith,J. Gregory Pauloski,Varuni Sastry,Sam Foreman,Huihuo Zheng,Heng Ma,Bharat Kale,Nicholas Chia,Thomas Gibbs,Michael E. Papka,Thomas Brettin,Francis J. Alexander,Anima Anandkumar,Ian Foster,Rick Stevens,Venkatram Vishwanath,Arvind Ramanathan
机构: Argonne National Laboratory (阿贡国家实验室); The University of Chicago (芝加哥大学); NVIDIA Inc. (英伟达公司); University of Illinois Chicago (伊利诺伊大学芝加哥分校); California Institute of Technology (加州理工学院)
类目: Information Retrieval (cs.IR); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: This paper has been accepted at the Platform for Advanced Scientific Computing Conference (PASC 25), June 16-18, 2025, Brugg-Windisch, Switzerland
Abstract:The volume of scientific literature is growing exponentially, leading to underutilized discoveries, duplicated efforts, and limited cross-disciplinary collaboration. Retrieval Augmented Generation (RAG) offers a way to assist scientists by improving the factuality of Large Language Models (LLMs) in processing this influx of information. However, scaling RAG to handle millions of articles introduces significant challenges, including the high computational costs associated with parsing documents and embedding scientific knowledge, as well as the algorithmic complexity of aligning these representations with the nuanced semantics of scientific content. To address these issues, we introduce HiPerRAG, a RAG workflow powered by high performance computing (HPC) to index and retrieve knowledge from more than 3.6 million scientific articles. At its core are Oreo, a high-throughput model for multimodal document parsing, and ColTrast, a query-aware encoder fine-tuning algorithm that enhances retrieval accuracy by using contrastive learning and late-interaction techniques. HiPerRAG delivers robust performance on existing scientific question answering benchmarks and two new benchmarks introduced in this work, achieving 90% accuracy on SciQ and 76% on PubMedQA-outperforming both domain-specific models like PubMedGPT and commercial LLMs such as GPT-4. Scaling to thousands of GPUs on the Polaris, Sunspot, and Frontier supercomputers, HiPerRAG delivers million document-scale RAG workflows for unifying scientific knowledge and fostering interdisciplinary innovation.
zh
[NLP-45] Osiris: A Lightweight Open-Source Hallucination Detection System
【速读】: 该论文试图解决RAG(Retrieval-Augmented Generation)系统中由于大型语言模型(Large Language Models, LLMs)生成内容与给定上下文不一致而导致的幻觉(hallucination)问题,这一问题阻碍了RAG系统在生产环境中的部署。解决方案的关键在于构建一个包含诱导幻觉的扰动多跳问答数据集,并通过监督微调在该数据集上训练模型,从而在保持较低参数量的情况下实现比GPT-4o更好的召回率以及具有竞争力的精确率和准确率。
链接: https://arxiv.org/abs/2505.04844
作者: Alex Shan,John Bauer,Christopher D. Manning
机构: Stanford University (斯坦福大学); Stanford HAI (斯坦福人文与人工智能研究所)
类目: Computation and Language (cs.CL)
备注: Stanford 191W
Abstract:Retrieval-Augmented Generation (RAG) systems have gained widespread adoption by application builders because they leverage sources of truth to enable Large Language Models (LLMs) to generate more factually sound responses. However, hallucinations, instances of LLM responses that are unfaithful to the provided context, often prevent these systems from being deployed in production environments. Current hallucination detection methods typically involve human evaluation or the use of closed-source models to review RAG system outputs for hallucinations. Both human evaluators and closed-source models suffer from scaling issues due to their high costs and slow inference speeds. In this work, we introduce a perturbed multi-hop QA dataset with induced hallucinations. Via supervised fine-tuning on our dataset, we achieve better recall with a 7B model than GPT-4o on the RAGTruth hallucination detection benchmark and offer competitive performance on precision and accuracy, all while using a fraction of the parameters. Code is released at our repository.
zh
[NLP-46] Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLM s
【速读】: 该论文试图解决生成式 AI (Generative AI) 在面对对抗性攻击时的安全性问题,特别是针对大型语言模型(LLMs)的 jailbreak 攻击,这类攻击能够绕过对齐防护机制。解决方案的关键在于系统性地分析超过1,400个对抗性提示,并提出分层的缓解策略,同时推荐采用混合红队测试与沙箱技术的方法以增强 LLM 的安全性。
链接: https://arxiv.org/abs/2505.04806
作者: Chetan Pathade
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 7 Pages, 6 Figures
Abstract:Large Language Models (LLMs) are increasingly integrated into consumer and enterprise applications. Despite their capabilities, they remain susceptible to adversarial attacks such as prompt injection and jailbreaks that override alignment safeguards. This paper provides a systematic investigation of jailbreak strategies against various state-of-the-art LLMs. We categorize over 1,400 adversarial prompts, analyze their success against GPT-4, Claude 2, Mistral 7B, and Vicuna, and examine their generalizability and construction logic. We further propose layered mitigation strategies and recommend a hybrid red-teaming and sandboxing approach for robust LLM security.
zh
[NLP-47] Flower Across Time and Media: Sentiment Analysis of Tang Song Poetry and Visual Correspondence
【速读】: 该论文试图解决唐宋时期文学情感与视觉文化之间系统性关联的不足问题(the systematic correlation between evolving literary emotions and visual culture remains underexplored)。其解决方案的关键在于采用基于BERT的情感分析方法,对唐宋诗词中的花卉意象进行情感模式的量化分析,并将其与同期纺织品、陶瓷等物质文化中的视觉证据进行交叉验证,从而揭示文学表达与艺术表现之间的新联系。
链接: https://arxiv.org/abs/2505.04785
作者: Shuai Gong,Tiange Zhou
机构: Macau University of Science and Technology (澳门科技大学); Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 9 figures
Abstract:The Tang (618 to 907) and Song (960 to 1279) dynasties witnessed an extraordinary flourishing of Chinese cultural expression, where floral motifs served as a dynamic medium for both poetic sentiment and artistic design. While previous scholarship has examined these domains independently, the systematic correlation between evolving literary emotions and visual culture remains underexplored. This study addresses that gap by employing BERT-based sentiment analysis to quantify emotional patterns in floral imagery across Tang Song poetry, then validating these patterns against contemporaneous developments in decorative this http URL approach builds upon recent advances in computational humanities while remaining grounded in traditional sinological methods. By applying a fine tuned BERT model to analyze peony and plum blossom imagery in classical poetry, we detect measurable shifts in emotional connotations between the Tang and Song periods. These textual patterns are then cross berenced with visual evidence from textiles, ceramics, and other material culture, revealing previously unrecognized synergies between literary expression and artistic representation.
zh
[NLP-48] When Bad Data Leads to Good Models ICML2025
【速读】: 该论文试图解决在大型语言模型(LLM)预训练过程中,数据质量对模型质量影响的常规认知问题,即是否高质量数据一定是最佳选择。其解决方案的关键在于通过预训练与后训练的协同设计,探索有毒数据在预训练阶段的潜在价值,发现增加有毒数据的比例可以使得毒性在后训练阶段更容易被控制,从而在减少生成毒性的同时保持模型的通用能力。
链接: https://arxiv.org/abs/2505.04741
作者: Kenneth Li,Yida Chen,Fernanda Viégas,Martin Wattenberg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML 2025
Abstract:In large language model (LLM) pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of “quality” from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model’s output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
zh
[NLP-49] SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLM s via Continual Pre-Training Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding
【速读】: 该论文旨在解决在构建面向中国国有资产管理与企业(SOAEs)的领域专用大型语言模型(LLMs)时所面临的三大挑战:1)模型容量受限,影响知识整合与跨任务适应性;2)过度依赖领域特定的监督微调(SFT)数据,忽视了通用语言模式的广泛适用性;3)大模型处理长上下文时推理加速效率低下。其解决方案的关键在于提出一个三阶段框架:1)持续预训练,在保留基础能力的同时整合领域知识;2)基于课程学习策略的领域渐进式SFT,从弱相关对话数据逐步过渡到专家标注的SOAEs数据集以优化领域任务;3)通过72B目标模型与7B草稿模型之间的逻辑蒸馏实现推理加速,从而在不损失质量的前提下提升1.39-1.52倍的推理速度。
链接: https://arxiv.org/abs/2505.04723
作者: Jingyang Deng,Ran Chen,Jo-Ku Cheng,Jinwen Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This study addresses key challenges in developing domain-specific large language models (LLMs) for Chinese state-owned assets and enterprises (SOAEs), where current approaches face three limitations: 1) constrained model capacity that limits knowledge integration and cross-task adaptability; 2) excessive reliance on domain-specific supervised fine-tuning (SFT) data, which neglects the broader applicability of general language patterns; and 3) inefficient inference acceleration for large models processing long contexts. In this work, we propose SOAEsV2-7B/72B, a specialized LLM series developed via a three-phase framework: 1) continual pre-training integrates domain knowledge while retaining base capabilities; 2) domain-progressive SFT employs curriculum-based learning strategy, transitioning from weakly relevant conversational data to expert-annotated SOAEs datasets to optimize domain-specific tasks; 3) distillation-enhanced speculative decoding accelerates inference via logit distillation between 72B target and 7B draft models, achieving 1.39-1.52 \times speedup without quality loss. Experimental results demonstrate that our domain-specific pre-training phase maintains 99.8% of original general language capabilities while significantly improving domain performance, resulting in a 1.08 \times improvement in Rouge-1 score and a 1.17 \times enhancement in BLEU-4 score. Ablation studies further show that domain-progressive SFT outperforms single-stage training, achieving 1.02 \times improvement in Rouge-1 and 1.06 \times in BLEU-4. Our work introduces a comprehensive, full-pipeline approach for optimizing SOAEs LLMs, bridging the gap between general language capabilities and domain-specific expertise.
zh
[NLP-50] Advanced Deep Learning Approaches for Automated Recognition of Cuneiform Symbols
【速读】: 该论文试图解决古代楔形文字(cuneiform)的自动识别与释义问题,旨在通过深度学习算法实现对这一古老文字系统的高效解析。解决方案的关键在于利用五种不同的深度学习模型,在大规模楔形文字数据集上进行训练与评估,并选择表现优异的模型对汉谟拉比法典中的楔形符号进行语义识别与英文翻译,从而实现对古阿卡德语的准确理解。
链接: https://arxiv.org/abs/2505.04678
作者: Shahad Elshehaby,Alavikunhu Panthakkan,Hussain Al-Ahmad,Mina Al-Saad
机构: University of Dubai (迪拜大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a thoroughly automated method for identifying and interpreting cuneiform characters via advanced deep-learning algorithms. Five distinct deep-learning models were trained on a comprehensive dataset of cuneiform characters and evaluated according to critical performance metrics, including accuracy and precision. Two models demonstrated outstanding performance and were subsequently assessed using cuneiform symbols from the Hammurabi law acquisition, notably Hammurabi Law 1. Each model effectively recognized the relevant Akkadian meanings of the symbols and delivered precise English translations. Future work will investigate ensemble and stacking approaches to optimize performance, utilizing hybrid architectures to improve detection accuracy and reliability. This research explores the linguistic relationships between Akkadian, an ancient Mesopotamian language, and Arabic, emphasizing their historical and cultural linkages. This study demonstrates the capability of deep learning to decipher ancient scripts by merging computational linguistics with archaeology, therefore providing significant insights for the comprehension and conservation of human history.
zh
[NLP-51] REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM IJCAI2025
【速读】: 该论文旨在解决视觉大语言模型(Vision Large Language Models, VLLMs)在多模态和多轮对话中面临的安全与伦理挑战,这些问题传统文本单轮交互的安全评估框架无法有效应对。其解决方案的关键在于提出REVEAL(Responsible Evaluation of Vision-Enabled AI LLMs)框架,该框架通过自动化图像挖掘、合成对抗数据生成、基于 crescendo 攻击策略的多轮对话扩展以及利用GPT-4o等评估者进行综合危害评估,实现了对VLLMs中图像输入危害的系统性检测与评价。
链接: https://arxiv.org/abs/2505.04673
作者: Madhur Jindal,Saurabh Deshpande
机构: Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages (8 main), to be published in IJCAI 2025
Abstract:Vision Large Language Models (VLLMs) represent a significant advancement in artificial intelligence by integrating image-processing capabilities with textual understanding, thereby enhancing user interactions and expanding application domains. However, their increased complexity introduces novel safety and ethical challenges, particularly in multi-modal and multi-turn conversations. Traditional safety evaluation frameworks, designed for text-based, single-turn interactions, are inadequate for addressing these complexities. To bridge this gap, we introduce the REVEAL (Responsible Evaluation of Vision-Enabled AI LLMs) Framework, a scalable and automated pipeline for evaluating image-input harms in VLLMs. REVEAL includes automated image mining, synthetic adversarial data generation, multi-turn conversational expansion using crescendo attack strategies, and comprehensive harm assessment through evaluators like GPT-4o. We extensively evaluated five state-of-the-art VLLMs, GPT-4o, Llama-3.2, Qwen2-VL, Phi3.5V, and Pixtral, across three important harm categories: sexual harm, violence, and misinformation. Our findings reveal that multi-turn interactions result in significantly higher defect rates compared to single-turn evaluations, highlighting deeper vulnerabilities in VLLMs. Notably, GPT-4o demonstrated the most balanced performance as measured by our Safety-Usability Index (SUI) followed closely by Pixtral. Additionally, misinformation emerged as a critical area requiring enhanced contextual defenses. Llama-3.2 exhibited the highest MT defect rate ( 16.55 % ) while Qwen2-VL showed the highest MT refusal rate ( 19.1 % ). Comments: 13 pages (8 main), to be published in IJCAI 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.04673 [cs.CL] (or arXiv:2505.04673v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.04673 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-52] Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process-Supervised Rewards
【速读】: 该论文旨在解决文本到结构化查询语言(Text-to-SQL)任务中因推理过程不准确而导致生成的SQL语句质量下降的问题。其关键解决方案是提出Reward-SQL框架,通过系统性地将过程奖励模型(Process Reward Models, PRMs)融入到Text-to-SQL的推理过程中,以提升模型的推理准确性。该框架采用“冷启动,随后PRM监督”的范式,首先利用通用表表达式(Chain-of-CTEs)构建可解释的推理基线,再结合在线训练信号(GRPO)与PRM引导的推理策略(如最佳N采样),从而显著提升模型性能。
链接: https://arxiv.org/abs/2505.04671
作者: Yuxin Zhang,Meihao Fan,Ju Fan,Mingyang Yi,Yuyu Luo,Jian Tan,Guoliang Li
机构: Renmin University of China(中国人民大学); HKUST (GZ)(香港科技大学(广州)); Alibaba Cloud Computing(阿里云计算); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in large language models (LLMs) have significantly improved performance on the Text-to-SQL task by leveraging their powerful reasoning capabilities. To enhance accuracy during the reasoning process, external Process Reward Models (PRMs) can be introduced during training and inference to provide fine-grained supervision. However, if misused, PRMs may distort the reasoning trajectory and lead to suboptimal or incorrect SQL this http URL address this challenge, we propose Reward-SQL, a framework that systematically explores how to incorporate PRMs into the Text-to-SQL reasoning process effectively. Our approach follows a “cold start, then PRM supervision” paradigm. Specifically, we first train the model to decompose SQL queries into structured stepwise reasoning chains using common table expressions (Chain-of-CTEs), establishing a strong and interpretable reasoning baseline. Then, we investigate four strategies for integrating PRMs, and find that combining PRM as an online training signal (GRPO) with PRM-guided inference (e.g., best-of-N sampling) yields the best results. Empirically, on the BIRD benchmark, Reward-SQL enables models supervised by a 7B PRM to achieve a 13.1% performance gain across various guidance strategies. Notably, our GRPO-aligned policy model based on Qwen2.5-Coder-7B-Instruct achieves 68.9% accuracy on the BIRD development set, outperforming all baseline methods under the same model size. These results demonstrate the effectiveness of Reward-SQL in leveraging reward-based supervision for Text-to-SQL reasoning. Our code is publicly available.
zh
[NLP-53] Fine-Tuning Large Language Models and Evaluating Retrieval Methods for Improved Question Answering on Building Codes
【速读】: 该论文旨在解决建筑规范(Building Codes)查询过程中因文本量大、技术语言复杂及条款分散而导致的手动查询困难和效率低下的问题。其解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的问答系统,其中核心在于选择合适的检索方法并优化语言模型的生成能力。研究通过评估多种检索方法,并利用加拿大国家建筑规范(NBCC)数据集对语言模型进行领域特定微调,验证了Elasticsearch作为高效检索器的有效性以及微调对提升模型生成相关响应能力的重要性。
链接: https://arxiv.org/abs/2505.04666
作者: Mohammad Aqib,Mohd Hamza,Qipei Mei,Ying Hei Chui
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Building codes are regulations that establish standards for the design, construction, and safety of buildings to ensure structural integrity, fire protection, and accessibility. They are often extensive, complex, and subject to frequent updates, making manual querying challenging and time-consuming. Key difficulties include navigating large volumes of text, interpreting technical language, and identifying relevant clauses across different sections. A potential solution is to build a Question-Answering (QA) system that answers user queries based on building codes. Among the various methods for building a QA system, Retrieval-Augmented Generation (RAG) stands out in performance. RAG consists of two components: a retriever and a language model. This study focuses on identifying a suitable retriever method for building codes and optimizing the generational capability of the language model using fine-tuning techniques. We conducted a detailed evaluation of various retrieval methods by performing the retrieval on the National Building Code of Canada (NBCC) and explored the impact of domain-specific fine-tuning on several language models using the dataset derived from NBCC. Our analysis included a comparative assessment of different retrievers and the performance of both pre-trained and fine-tuned models to determine the efficacy and domain-specific adaptation of language models using fine-tuning on the NBCC dataset. Experimental results showed that Elasticsearch proved to be the most robust retriever among all. The findings also indicate that fine-tuning language models on an NBCC-specific dataset can enhance their ability to generate contextually relevant responses. When combined with context retrieved by a powerful retriever like Elasticsearch, this improvement in LLM performance can optimize the RAG system, enabling it to better navigate the complexities of the NBCC.
zh
[NLP-54] Personalized Risks and Regulatory Strategies of Large Language Models in Digital Advertising
【速读】: 该论文试图解决在实际运营中,如何将广告推荐系统与用户隐私保护和数据安全措施相结合的问题。其解决方案的关键在于结合大型语言模型(Large Language Model, LLM)与注意力机制,构建一种既能实现个性化广告推荐又能保障用户隐私的算法模型。具体而言,通过BERT模型进行广告语义嵌入和基于用户画像的广告推荐,并采用本地模型训练与数据加密技术,以降低用户隐私泄露的风险。
链接: https://arxiv.org/abs/2505.04665
作者: Haoyang Feng,Yanjun Dai,Yuan Gao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although large language models have demonstrated the potential for personalized advertising recommendations in experimental environments, in actual operations, how advertising recommendation systems can be combined with measures such as user privacy protection and data security is still an area worthy of in-depth discussion. To this end, this paper studies the personalized risks and regulatory strategies of large language models in digital advertising. This study first outlines the principles of Large Language Model (LLM), especially the self-attention mechanism based on the Transformer architecture, and how to enable the model to understand and generate natural language text. Then, the BERT (Bidirectional Encoder Representations from Transformers) model and the attention mechanism are combined to construct an algorithmic model for personalized advertising recommendations and user factor risk protection. The specific steps include: data collection and preprocessing, feature selection and construction, using large language models such as BERT for advertising semantic embedding, and ad recommendations based on user portraits. Then, local model training and data encryption are used to ensure the security of user privacy and avoid the leakage of personal data. This paper designs an experiment for personalized advertising recommendation based on a large language model of BERT and verifies it with real user data. The experimental results show that BERT-based advertising push can effectively improve the click-through rate and conversion rate of advertisements. At the same time, through local model training and privacy protection mechanisms, the risk of user privacy leakage can be reduced to a certain extent.
zh
[NLP-55] AI-Generated Fall Data: Assessing LLM s and Diffusion Model for Wearable Fall Detection
【速读】: 该论文试图解决由于真实跌倒数据稀缺而导致的跌倒检测系统训练困难问题(fall detection systems training challenge due to the scarcity of real-world fall data)。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)生成合成跌倒数据,通过文本到动作(text-to-motion, T2M)和文本到文本(text-to-text)模型模拟真实的跌倒场景,并将生成的数据与真实数据集结合,以提升跌倒检测模型的性能。
链接: https://arxiv.org/abs/2505.04660
作者: Sana Alamgeer,Yasine Souissi,Anne H. H. Ngu
机构: Texas State University (德克萨斯州立大学); University of North Carolina (北卡罗来纳大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Training fall detection systems is challenging due to the scarcity of real-world fall data, particularly from elderly individuals. To address this, we explore the potential of Large Language Models (LLMs) for generating synthetic fall data. This study evaluates text-to-motion (T2M, SATO, ParCo) and text-to-text models (GPT4o, GPT4, Gemini) in simulating realistic fall scenarios. We generate synthetic datasets and integrate them with four real-world baseline datasets to assess their impact on fall detection performance using a Long Short-Term Memory (LSTM) model. Additionally, we compare LLM-generated synthetic data with a diffusion-based method to evaluate their alignment with real accelerometer distributions. Results indicate that dataset characteristics significantly influence the effectiveness of synthetic data, with LLM-generated data performing best in low-frequency settings (e.g., 20Hz) while showing instability in high-frequency datasets (e.g., 200Hz). While text-to-motion models produce more realistic biomechanical data than text-to-text models, their impact on fall detection varies. Diffusion-based synthetic data demonstrates the closest alignment to real data but does not consistently enhance model performance. An ablation study further confirms that the effectiveness of synthetic data depends on sensor placement and fall representation. These findings provide insights into optimizing synthetic data generation for fall detection models.
zh
[NLP-56] Integration of Large Language Models and Traditional Deep Learning for Social Determinants of Health Prediction
【速读】: 该论文旨在解决从临床文本中自动提取社会决定因素健康(Social Determinants of Health, SDoH)的问题,并比较传统深度学习与大型语言模型(Large Language Models, LLMs)在SDoH分类任务中的优劣。其解决方案的关键在于提出一种结合传统深度学习方法的高效性与LLMs精确性的混合策略,通过消除昂贵的LLM处理步骤显著提升分类速度(12倍执行时间加速),同时在多标签SDoH分类任务中取得优于先前基准10个百分点的性能提升。
链接: https://arxiv.org/abs/2505.04655
作者: Paul Landes,Jimeng Sun,Adam Cross
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Social Determinants of Health (SDoH) are economic, social and personal circumstances that affect or influence an individual’s health status. SDoHs have shown to be correlated to wellness outcomes, and therefore, are useful to physicians in diagnosing diseases and in decision-making. In this work, we automatically extract SDoHs from clinical text using traditional deep learning and Large Language Models (LLMs) to find the advantages and disadvantages of each on an existing publicly available dataset. Our models outperform a previous reference point on a multilabel SDoH classification by 10 points, and we present a method and model to drastically speed up classification (12X execution time) by eliminating expensive LLM processing. The method we present combines a more nimble and efficient solution that leverages the power of the LLM for precision and traditional deep learning methods for efficiency. We also show highly performant results on a dataset supplemented with synthetic data and several traditional deep learning models that outperform LLMs. Our models and methods offer the next iteration of automatic prediction of SDoHs that impact at-risk patients.
zh
[NLP-57] A Comparative Analysis of Ethical and Safety Gaps in LLM s using Relative Danger Coefficient
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在伦理性能方面的评估问题,特别是针对其潜在的安全风险、滥用可能性、歧视性以及对社会的整体影响。论文提出了一种新的衡量LLMs危害性的指标——相对危险系数(Relative Danger Coefficient, RDC),并通过对多种AI模型(如DeepSeek-V3、GPT系列和Gemini系列)的比较分析,强调了在高风险情境下实施严格的人类监督的重要性。解决方案的关键在于引入量化评估框架以系统化识别和管理AI模型可能带来的伦理风险。
链接: https://arxiv.org/abs/2505.04654
作者: Yehor Tereshchenko,Mika Hämäläinen
机构: Metropolia University of Applied Sciences (Metropolia应用科学大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Artificial Intelligence (AI) and Large Language Models (LLMs) have rapidly evolved in recent years, showcasing remarkable capabilities in natural language understanding and generation. However, these advancements also raise critical ethical questions regarding safety, potential misuse, discrimination and overall societal impact. This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes. Furthermore, we present a new metric for calculating harm in LLMs called Relative Danger Coefficient (RDC).
zh
[NLP-58] Advancing Conversational Diagnostic AI with Multimodal Reasoning
【速读】: 该论文试图解决远程医疗中诊断对话的评估局限于语言交互的问题,而未能反映真实场景下多模态医学资料(如皮肤照片、心电图和临床文档)的处理需求。解决方案的关键在于增强Articulate Medical Intelligence Explorer (AMIE)系统对多模态数据的收集与解释能力,并在诊疗过程中进行精准推理,通过Gemini 2.0 Flash实现状态感知的对话框架,动态控制对话流程,依据患者状态和诊断演变生成结构化的多模态病史采集过程,从而提升诊断性能。
链接: https://arxiv.org/abs/2505.04653
作者: Khaled Saab,Jan Freyberg,Chunjong Park,Tim Strother,Yong Cheng,Wei-Hung Weng,David G.T. Barrett,David Stutz,Nenad Tomasev,Anil Palepu,Valentin Liévin,Yash Sharma,Roma Ruparel,Abdullah Ahmed,Elahe Vedadi,Kimberly Kanada,Cian Hughes,Yun Liu,Geoff Brown,Yang Gao,Sean Li,S. Sara Mahdavi,James Manyika,Katherine Chou,Yossi Matias,Avinatan Hassidim,Dale R. Webster,Pushmeet Kohli,S.M. Ali Eslami,Joëlle Barral,Adam Rodman,Vivek Natarajan,Mike Schaekermann,Tao Tu,Alan Karthikesalingam,Ryutaro Tanno
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have demonstrated great potential for conducting diagnostic conversations but evaluation has been largely limited to language-only interactions, deviating from the real-world requirements of remote care delivery. Instant messaging platforms permit clinicians and patients to upload and discuss multimodal medical artifacts seamlessly in medical consultation, but the ability of LLMs to reason over such data while preserving other attributes of competent diagnostic conversation remains unknown. Here we advance the conversational diagnosis and management performance of the Articulate Medical Intelligence Explorer (AMIE) through a new capability to gather and interpret multimodal data, and reason about this precisely during consultations. Leveraging Gemini 2.0 Flash, our system implements a state-aware dialogue framework, where conversation flow is dynamically controlled by intermediate model outputs reflecting patient states and evolving diagnoses. Follow-up questions are strategically directed by uncertainty in such patient states, leading to a more structured multimodal history-taking process that emulates experienced clinicians. We compared AMIE to primary care physicians (PCPs) in a randomized, blinded, OSCE-style study of chat-based consultations with patient actors. We constructed 105 evaluation scenarios using artifacts like smartphone skin photos, ECGs, and PDFs of clinical documents across diverse conditions and demographics. Our rubric assessed multimodal capabilities and other clinically meaningful axes like history-taking, diagnostic accuracy, management reasoning, communication, and empathy. Specialist evaluation showed AMIE to be superior to PCPs on 7/9 multimodal and 29/32 non-multimodal axes (including diagnostic accuracy). The results show clear progress in multimodal conversational diagnostic AI, but real-world translation needs further research.
zh
[NLP-59] Scientific Hypothesis Generation and Validation: Methods Datasets and Future Directions
【速读】: 该论文试图解决科学假设生成与验证过程中信息综合、潜在关系发现和推理增强的问题,其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)驱动的方法,包括符号框架、生成模型、混合系统和多智能体架构。通过检索增强生成、知识图谱补全、模拟、因果推断和工具辅助推理等技术,实现对科学问题的高效处理与创新性解答。同时,论文强调了在可解释性、新颖性和领域对齐之间的权衡,并提出了结合上下文学习、领域适应以及符号基础的现代LLM流水线,以提升科学发现的准确性与适用性。
链接: https://arxiv.org/abs/2505.04651
作者: Adithya Kulkarni,Fatimah Alotaibi,Xinyue Zeng,Longfeng Wu,Tong Zeng,Barry Menglong Yao,Minqian Liu,Shuaicheng Zhang,Lifu Huang,Dawei Zhou
机构: Virginia Tech(弗吉尼亚理工学院); University of California, Davis(加州大学戴维斯分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) are transforming scientific hypothesis generation and validation by enabling information synthesis, latent relationship discovery, and reasoning augmentation. This survey provides a structured overview of LLM-driven approaches, including symbolic frameworks, generative models, hybrid systems, and multi-agent architectures. We examine techniques such as retrieval-augmented generation, knowledge-graph completion, simulation, causal inference, and tool-assisted reasoning, highlighting trade-offs in interpretability, novelty, and domain alignment. We contrast early symbolic discovery systems (e.g., BACON, KEKADA) with modern LLM pipelines that leverage in-context learning and domain adaptation via fine-tuning, retrieval, and symbolic grounding. For validation, we review simulation, human-AI collaboration, causal modeling, and uncertainty quantification, emphasizing iterative assessment in open-world contexts. The survey maps datasets across biomedicine, materials science, environmental science, and social science, introducing new resources like AHTech and CSKG-600. Finally, we outline a roadmap emphasizing novelty-aware generation, multimodal-symbolic integration, human-in-the-loop systems, and ethical safeguards, positioning LLMs as agents for principled, scalable scientific discovery.
zh
[NLP-60] FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights
【速读】: 该论文旨在解决通过大型语言模型(Large Language Models, LLMs)自动化科学研究过程中存在的知识综合与质量保障问题。其解决方案的关键在于提出一种名为Feedback-Refined Agent Methodology (FRAME) 的新框架,该框架通过迭代优化和结构化反馈机制提升医学论文生成的质量,核心创新包括结构化数据集构建方法、三元架构(生成器、评估器和反思器代理)以及综合评估体系。
链接: https://arxiv.org/abs/2505.04649
作者: Chengzhang Yu,Yiming Zhang,Zhixin Liu,Zenghui Ding,Yining Sun,Zhanpeng Jin
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures, 5 table
Abstract:The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME’s effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.
zh
[NLP-61] ChatGPT for automated grading of short answer questions in mechanical ventilation
【速读】: 该论文旨在解决在研究生医学教育中使用大型语言模型(Large Language Models, LLMs)自动评分短答案题(Short Answer Questions, SAQs)的可行性和准确性问题。研究的关键在于评估ChatGPT 4o在标准化评分提示和评分量表下对临床情景案例的SAQ进行评分的能力,并与人类评分者进行比较,以确定其在高风险评估中的适用性。研究结果表明,ChatGPT评分系统存在显著偏差,与人类评分者在个体层面的 agreement 极低,且在评价性和分析性题目上表现最差,因此作者建议谨慎使用LLMs进行研究生课程评分。
链接: https://arxiv.org/abs/2505.04645
作者: Tejas Jade,Alex Yartsev
机构: Shri Atal Bihari Vajpayee Medical College and Research Institute (Shri Atal Bihari Vajpayee 医疗学院和研究机构); Bowring and Lady Curzon Hospitals (Bowring 和 Lady Curzon 医院); Rajiv Gandhi University of Health Sciences (拉吉夫·甘地健康科学大学); The University of Sydney (悉尼大学); Intensive Care Service, Westmead Hospital (Westmead 医院重症监护服务)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Computation (stat.CO)
备注:
Abstract:Standardised tests using short answer questions (SAQs) are common in postgraduate education. Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses in ways aligning with applying SAQ grading rubrics, making them attractive for automated grading. We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students (557 short-answer responses) enrolled in an online course on mechanical ventilation (2020–2024). Deidentified responses to three case-based scenarios were presented to ChatGPT with a standardised grading prompt and rubric. Outputs were analysed using mixed-effects modelling, variance component analysis, intraclass correlation coefficients (ICCs), Cohen’s kappa, Kendall’s W, and Bland–Altman statistics. ChatGPT awarded systematically lower marks than human graders with a mean difference (bias) of -1.34 on a 10-point scale. ICC values indicated poor individual-level agreement (ICC1 = 0.086), and Cohen’s kappa (-0.0786) suggested no meaningful agreement. Variance component analysis showed minimal variability among the five ChatGPT sessions (G-value = 0.87), indicating internal consistency but divergence from the human grader. The poorest agreement was observed for evaluative and analytic items, whereas checklist and prescriptive rubric items had less disagreement. We caution against the use of LLMs in grading postgraduate coursework. Over 60% of ChatGPT-assigned grades differed from human grades by more than acceptable boundaries for high-stakes assessments.
zh
[NLP-62] Prediction-powered estimators for finite population statistics in highly imbalanced textual data: Public hate crime estimation
【速读】: 该论文试图解决在有限文本文档群体中估计人口参数时,因目标变量标签需要人工标注而带来的挑战。解决方案的关键在于将Transformer编码器神经网络的预测结果与经典的调查抽样估计方法相结合,利用模型预测作为辅助变量,从而提高估计效率并减少对人工标注的依赖。
链接: https://arxiv.org/abs/2505.04643
作者: Hannes Waldetoft,Jakob Torgander,Måns Magnusson
机构: Uppsala University (乌普萨拉大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Estimating population parameters in finite populations of text documents can be challenging when obtaining the labels for the target variable requires manual annotation. To address this problem, we combine predictions from a transformer encoder neural network with well-established survey sampling estimators using the model predictions as an auxiliary variable. The applicability is demonstrated in Swedish hate crime statistics based on Swedish police reports. Estimates of the yearly number of hate crimes and the police’s under-reporting are derived using the Hansen-Hurwitz estimator, difference estimation, and stratified random sampling estimation. We conclude that if labeled training data is available, the proposed method can provide very efficient estimates with reduced time spent on manual annotation.
zh
[NLP-63] Rethinking Multimodal Sentiment Analysis: A High-Accuracy Simplified Fusion Architecture
【速读】: 该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis)中的问题,即通过融合语言、音频和视觉信号来准确识别话语层面的情感类别。其解决方案的关键在于设计一种轻量级但高效的基于融合的深度学习模型,采用模态特定编码器对各模态特征进行独立提取,随后通过简单的拼接方式进行融合,并通过密集融合层捕捉跨模态交互,从而在保持高性能的同时减少计算开销。
链接: https://arxiv.org/abs/2505.04642
作者: Nischal Mandal,Yang Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal sentiment analysis, a pivotal task in affective computing, seeks to understand human emotions by integrating cues from language, audio, and visual signals. While many recent approaches leverage complex attention mechanisms and hierarchical architectures, we propose a lightweight, yet effective fusion-based deep learning model tailored for utterance-level emotion classification. Using the benchmark IEMOCAP dataset, which includes aligned text, audio-derived numeric features, and visual descriptors, we design a modality-specific encoder using fully connected layers followed by dropout regularization. The modality-specific representations are then fused using simple concatenation and passed through a dense fusion layer to capture cross-modal interactions. This streamlined architecture avoids computational overhead while preserving performance, achieving a classification accuracy of 92% across six emotion categories. Our approach demonstrates that with careful feature engineering and modular design, simpler fusion strategies can outperform or match more complex models, particularly in resource-constrained environments.
zh
[NLP-64] A Comparative Benchmark of a Moroccan Darija Toxicity Detection Model (Typica.ai) and Major LLM -Based Moderation APIs (OpenAI Mistral Anthropic)
【速读】: 该论文旨在解决针对特定文化背景下的有害内容(如隐含侮辱、讽刺和文化特定的攻击性语言)的检测问题,这些问题通常被通用系统所忽视。解决方案的关键在于开发并评估一个针对摩洛哥方言(Moroccan Darija)的定制毒性检测模型,并将其与主流基于大语言模型(LLM)的内容审核API(如OpenAI、Mistral和Anthropic Claude)进行对比,以验证其在文化相关有害内容检测中的优越性能。
链接: https://arxiv.org/abs/2505.04640
作者: Hicham Assoudi
机构: 未知
类目: Computation and Language (cs.CL)
备注: GitHub repository with reproducibility materials and evaluation notebook available at: this https URL
Abstract:This paper presents a comparative benchmark evaluating the performance of this http URL’s custom Moroccan Darija toxicity detection model against major LLM-based moderation APIs: OpenAI (omni-moderation-latest), Mistral (mistral-moderation-latest), and Anthropic Claude (claude-3-haiku-20240307). We focus on culturally grounded toxic content, including implicit insults, sarcasm, and culturally specific aggression often overlooked by general-purpose systems. Using a balanced test set derived from the OMCD_Typica.ai_Mix dataset, we report precision, recall, F1-score, and accuracy, offering insights into challenges and opportunities for moderation in underrepresented languages. Our results highlight this http URL’s superior performance, underlining the importance of culturally adapted models for reliable content moderation.
zh
[NLP-65] Language translation and change of accent for speech-to-speech task using diffusion model
【速读】: 该论文试图解决语音到语音翻译(Speech-to-speech translation, S2ST)中同时处理语言翻译与说话人口音适应的问题,这一任务在当前文献中尚未得到充分探索。解决方案的关键在于提出一种统一的方法,将问题重新表述为一个条件生成任务,通过基于音素并由目标语音特征引导的方式生成目标语音,并利用扩散模型的强大生成能力,通过源语音转录文本来生成包含所需语言和口音属性的梅尔频谱图,从而实现翻译与口音适应的联合优化。
链接: https://arxiv.org/abs/2505.04639
作者: Abhishek Mishra,Ritesh Sur Chowdhury,Vartul Bahuguna,Isha Pandey,Ganesh Ramakrishnan
机构: CMINDS, IIT Bombay; CSE, IIT Bombay
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Speech-to-speech translation (S2ST) aims to convert spoken input in one language to spoken output in another, typically focusing on either language translation or accent adaptation. However, effective cross-cultural communication requires handling both aspects simultaneously - translating content while adapting the speaker’s accent to match the target language context. In this work, we propose a unified approach for simultaneous speech translation and change of accent, a task that remains underexplored in current literature. Our method reformulates the problem as a conditional generation task, where target speech is generated based on phonemes and guided by target speech features. Leveraging the power of diffusion models, known for high-fidelity generative capabilities, we adapt text-to-image diffusion strategies by conditioning on source speech transcriptions and generating Mel spectrograms representing the target speech with desired linguistic and accentual attributes. This integrated framework enables joint optimization of translation and accent adaptation, offering a more parameter-efficient and effective model compared to traditional pipelines.
zh
[NLP-66] owards Artificial Intelligence Research Assistant for Expert-Involved Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)和大型多模态模型(Large Multi-Modal Models, LMMs)在生物医学研究中的可靠性及具体贡献尚未明确的问题。其关键解决方案是构建了一个名为ARIEL的多模态数据集,用于基准测试和提升LLMs和LMMs在总结科学文本和解释复杂生物医学图表方面的两项核心能力。通过创建两个开源数据集并结合专家驱动的人工评估,研究人员系统地评估了开源与闭源基础模型,并通过针对性的提示工程和微调策略优化模型性能,同时利用测试时计算扩展增强LMM的推理能力,从而实现了优于人类专家校正的准确性。
链接: https://arxiv.org/abs/2505.04638
作者: Tianyu Liu,Simeng Han,Xiao Luo,Hanchen Wang,Pan Lu,Biqing Zhu,Yuge Wang,Keyi Li,Jiapeng Chen,Rihao Qu,Yufeng Liu,Xinyue Cui,Aviv Yaish,Yuhang Chen,Minsheng Hao,Chuhan Li,Kexing Li,Arman Cohan,Hua Xu,Mark Gerstein,James Zou,Hongyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 36 pages, 7 figures
Abstract:Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research, yet their reliability and specific contributions to biomedical applications remain insufficiently characterized. In this study, we present \textbfARtificial \textbfIntelligence research assistant for \textbfExpert-involved \textbfLearning (ARIEL), a multimodal dataset designed to benchmark and enhance two critical capabilities of LLMs and LMMs in biomedical research: summarizing extensive scientific texts and interpreting complex biomedical figures. To facilitate rigorous assessment, we create two open-source sets comprising biomedical articles and figures with designed questions. We systematically benchmark both open- and closed-source foundation models, incorporating expert-driven human evaluations conducted by doctoral-level experts. Furthermore, we improve model performance through targeted prompt engineering and fine-tuning strategies for summarizing research papers, and apply test-time computational scaling to enhance the reasoning capabilities of LMMs, achieving superior accuracy compared to human-expert corrections. We also explore the potential of using LMM Agents to generate scientific hypotheses from diverse multimodal inputs. Overall, our results delineate clear strengths and highlight significant limitations of current foundation models, providing actionable insights and guiding future advancements in deploying large-scale language and multi-modal models within biomedical research.
zh
[NLP-67] Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLM s
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理跨模态信息整合时与人类认知过程存在的显著差异问题。现有模型的静态分词机制限制了其对人类动态、上下文敏感的信息处理能力的模拟。解决方案的关键在于提出一种基于认知科学原理的动态跨模态分词框架,该框架包含自适应边界、层次化表示和对齐机制,以更贴近人类的跨模态信息处理方式。
链接: https://arxiv.org/abs/2505.04637
作者: Dongxing Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in processing diverse data types, yet significant disparities persist between human cognitive processes and computational approaches to multimodal information integration. This research presents a systematic investigation into the parallels between human cross-modal chunking mechanisms and token representation methodologies in MLLMs. Through empirical studies comparing human performance patterns with model behaviors across visual-linguistic tasks, we demonstrate that conventional static tokenization schemes fundamentally constrain current models’ capacity to simulate the dynamic, context-sensitive nature of human information processing. We propose a novel framework for dynamic cross-modal tokenization that incorporates adaptive boundaries, hierarchical representations, and alignment mechanisms grounded in cognitive science principles. Quantitative evaluations demonstrate that our approach yields statistically significant improvements over state-of-the-art models on benchmark tasks (+7.8% on Visual Question Answering, +5.3% on Complex Scene Description) while exhibiting more human-aligned error patterns and attention distributions. These findings contribute to the theoretical understanding of the relationship between human cognition and artificial intelligence, while providing empirical evidence for developing more cognitively plausible AI systems.
zh
[NLP-68] How Social is It? A Benchmark for LLM s Capabilities in Multi-user Multi-turn Social Agent Tasks
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在多用户、多轮社会代理任务中的社会能力缺乏系统性评估的问题。现有基准未能全面衡量LLMs在复杂社会场景中独立扮演角色的能力。解决方案的关键在于提出一个基于社会学原理的代理任务分级框架,并设计了一个名为“How Social Is It”(HSII)的新基准,该基准包含四个阶段:格式解析、目标选择、目标切换对话和稳定对话,以评估LLMs在真实社会交互场景中的沟通与任务完成能力。此外,通过引入COT-复杂度(COT-complexity)统计指标,进一步优化了正确性与效率之间的权衡。
链接: https://arxiv.org/abs/2505.04628
作者: Yusen Wu,Junwu Xiong,Xiaotie Deng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs’ capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM’s social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs’ social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.
zh
[NLP-69] From Dialect Gaps to Identity Maps: Tackling Variability in Speaker Verification
【速读】: 该论文试图解决库尔德语(Kurdish)不同方言之间说话人检测的复杂性和困难问题,尤其是由于其语音和词汇差异较大,给说话人识别系统带来的挑战。论文提出的关键解决方案是采用先进的机器学习方法、数据增强策略以及构建详尽的方言特定语料库,以提高系统的准确性和可靠性。研究结果表明,针对每种方言的定制化策略结合跨方言训练能够显著提升识别性能。
链接: https://arxiv.org/abs/2505.04629
作者: Abdulhady Abas Abdullah,Soran Badawi,Dana A. Abdullah,Dana Rasul Hamad,Hanan Abdulrahman Taher,Sabat Salih Muhamad,Aram Mahmood Ahmed,Bryar A. Hassan,Sirwan Abdolwahed Aula,Tarik A. Rashid
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The complexity and difficulties of Kurdish speaker detection among its several dialects are investigated in this work. Because of its great phonetic and lexical differences, Kurdish with several dialects including Kurmanji, Sorani, and Hawrami offers special challenges for speaker recognition systems. The main difficulties in building a strong speaker identification system capable of precisely identifying speakers across several dialects are investigated in this work. To raise the accuracy and dependability of these systems, it also suggests solutions like sophisticated machine learning approaches, data augmentation tactics, and the building of thorough dialect-specific corpus. The results show that customized strategies for every dialect together with cross-dialect training greatly enhance recognition performance.
zh
计算机视觉
[CV-0] SVAD: From Single Image to 3D Avatar via Synthetic Data Generation with Video Diffusion and Data Augmentation CVPR2025
【速读】:该论文旨在解决从单张图像生成高质量可动画化3D人体化身(avatar)的问题,这一任务在计算机视觉中面临挑战,因为从单一视角重建完整的3D信息具有固有难度。现有方法存在明显局限:3D Gaussian Splatting(3DGS)方法虽能生成高质量结果,但需要多视角或视频序列;而视频扩散模型虽可从单图生成动画,但在一致性和身份保留方面表现不佳。论文提出的SVAD方法通过融合现有技术的优势来克服这些限制,其关键在于利用视频扩散模型生成合成训练数据,并通过身份保留和图像修复模块增强数据质量,再使用优化后的数据训练3DGS化身,从而实现高保真度、身份一致性及实时渲染能力。
链接: https://arxiv.org/abs/2505.05475
作者: Yonwoo Choi
机构: SECERN AI(SECERN AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 SyntaGen Workshop, Project Page: this https URL
Abstract:Creating high-quality animatable 3D human avatars from a single image remains a significant challenge in computer vision due to the inherent difficulty of reconstructing complete 3D information from a single viewpoint. Current approaches face a clear limitation: 3D Gaussian Splatting (3DGS) methods produce high-quality results but require multiple views or video sequences, while video diffusion models can generate animations from single images but struggle with consistency and identity preservation. We present SVAD, a novel approach that addresses these limitations by leveraging complementary strengths of existing techniques. Our method generates synthetic training data through video diffusion, enhances it with identity preservation and image restoration modules, and utilizes this refined data to train 3DGS avatars. Comprehensive evaluations demonstrate that SVAD outperforms state-of-the-art (SOTA) single-image methods in maintaining identity consistency and fine details across novel poses and viewpoints, while enabling real-time rendering capabilities. Through our data augmentation pipeline, we overcome the dependency on dense monocular or multi-view training data typically required by traditional 3DGS approaches. Extensive quantitative, qualitative comparisons show our method achieves superior performance across multiple metrics against baseline models. By effectively combining the generative power of diffusion models with both the high-quality results and rendering efficiency of 3DGS, our work establishes a new approach for high-fidelity avatar generation from a single image input.
zh
[CV-1] 3D Scene Generation: A Survey
【速读】:该论文旨在解决3D场景生成问题,即合成具有空间结构、语义意义和逼真视觉效果的环境,以支持沉浸式媒体、机器人、自动驾驶和具身AI等应用。其解决方案的关键在于结合深度生成模型(如GANs、扩散模型)与3D表示方法(如NeRF、3D高斯),通过将场景生成重新定义为图像或视频生成问题,从而提升生成场景的真实性、多样性和视角一致性。论文系统综述了当前主流方法,并提出了未来研究方向,包括更高保真度、物理感知及交互式生成等。
链接: https://arxiv.org/abs/2505.05474
作者: Beichen Wen,Haozhe Xie,Zhaoxi Chen,Fangzhou Hong,Ziwei Liu
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:3D scene generation seeks to synthesize spatially structured, semantically meaningful, and photorealistic environments for applications such as immersive media, robotics, autonomous driving, and embodied AI. Early methods based on procedural rules offered scalability but limited diversity. Recent advances in deep generative models (e.g., GANs, diffusion models) and 3D representations (e.g., NeRF, 3D Gaussians) have enabled the learning of real-world scene distributions, improving fidelity, diversity, and view consistency. Recent advances like diffusion models bridge 3D scene synthesis and photorealism by reframing generation as image or video synthesis problems. This survey provides a systematic overview of state-of-the-art approaches, organizing them into four paradigms: procedural generation, neural 3D-based generation, image-based generation, and video-based generation. We analyze their technical foundations, trade-offs, and representative results, and review commonly used datasets, evaluation protocols, and downstream applications. We conclude by discussing key challenges in generation capacity, 3D representation, data and annotations, and evaluation, and outline promising directions including higher fidelity, physics-aware and interactive generation, and unified perception-generation models. This review organizes recent advances in 3D scene generation and highlights promising directions at the intersection of generative AI, 3D vision, and embodied intelligence. To track ongoing developments, we maintain an up-to-date project page: this https URL.
zh
[CV-2] DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion CVPR2025
【速读】:该论文旨在解决传统Structure-from-Motion (SfM)方法中依赖两阶段流水线(即结合学习或几何配对推理与后续全局优化)的局限性,提出一种数据驱动的多视角推理方法,直接从多视角图像中推断出3D场景几何结构和相机位姿。其解决方案的关键在于构建一个基于Transformer的去噪扩散模型,将场景几何和相机参数参数化为全局坐标系中的像素级光线起点和终点,并通过该模型从多视角输入中预测这些参数,从而实现端到端的3D重建。
链接: https://arxiv.org/abs/2505.05473
作者: Qitao Zhao,Amy Lin,Jeff Tan,Jason Y. Zhang,Deva Ramanan,Shubham Tulsiani
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project website: this https URL
Abstract:Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.
zh
[CV-3] Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
【速读】:该论文试图解决统一模型在多模态生成任务中仅限于单模态生成的问题,即现有方法大多只能基于多种模态进行单模态生成,而无法实现多模态的交错生成。解决方案的关键在于提出Mogao框架,该框架通过因果方法实现了多模态的交错生成,其核心技术改进包括深度融合设计、双视觉编码器、交错旋转位置嵌入以及多模态无分类器引导,从而有效结合了自回归模型在文本生成中的优势与扩散模型在高质量图像合成中的能力。
链接: https://arxiv.org/abs/2505.05472
作者: Chao Liao,Liyang Liu,Xun Wang,Zhengxiong Luo,Xinyu Zhang,Wenliang Zhao,Jie Wu,Liang Li,Zhi Tian,Weilin Huang
机构: ByteDance(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Mogao Technical Report
Abstract:Recent progress in unified models for image understanding and generation has been impressive, yet most approaches remain limited to single-modal generation conditioned on multiple modalities. In this paper, we present Mogao, a unified framework that advances this paradigm by enabling interleaved multi-modal generation through a causal approach. Mogao integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance, which allow it to harness the strengths of both autoregressive models for text generation and diffusion models for high-quality image synthesis. These practical improvements also make Mogao particularly effective to process interleaved sequences of text and images arbitrarily. To further unlock the potential of unified models, we introduce an efficient training strategy on a large-scale, in-house dataset specifically curated for joint text and image generation. Extensive experiments show that Mogao not only achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs. Its emergent capabilities in zero-shot image editing and compositional generation highlight Mogao as a practical omni-modal foundation model, paving the way for future development and scaling the unified multi-modal systems.
zh
[CV-4] Flow-GRPO: Training Flow Matching Models via Online RL
【速读】:该论文试图解决在流匹配模型中引入在线强化学习(online reinforcement learning, RL)以提升生成质量与控制能力的问题。其解决方案的关键在于两个核心策略:一是通过将确定性常微分方程(ODE)转换为等效的随机微分方程(SDE),在保持原始模型边缘分布一致性的前提下,实现统计采样以支持RL探索;二是采用去噪缩减策略,在减少训练去噪步骤的同时保留原始推理时间步数,从而显著提升采样效率且不牺牲性能。
链接: https://arxiv.org/abs/2505.05470
作者: Jie Liu,Gongye Liu,Jiajun Liang,Yangguang Li,Jiaheng Liu,Xintao Wang,Pengfei Wan,Di Zhang,Wanli Ouyang
机构: CUHK MMLab (香港中文大学多媒体实验室); Tsinghua University (清华大学); Kuaishou Technology (快手科技); Nanjing University (南京大学); Shanghai AI Laboratory (上海人工智能实验室
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code: this https URL
Abstract:We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model’s marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from 63% to 95% . In visual text rendering, its accuracy improves from 59% to 92% , significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.
zh
[CV-5] Generating Physically Stable and Buildable LEGO Designs from Text
【速读】:该论文试图解决从文本提示生成物理稳定的LEGO积木模型的问题(Physical Stability of LEGO Models from Text Prompts)。解决方案的关键在于构建一个大规模的、物理稳定的LEGO设计数据集,并训练一个自回归的大语言模型来预测下一步应添加的积木;同时,在自回归推理过程中引入高效的合法性检查和物理感知回滚机制,以利用物理定律和装配约束修剪不可行的token预测,从而提升生成设计的稳定性。
链接: https://arxiv.org/abs/2505.05469
作者: Ava Pun,Kangle Deng,Ruixuan Liu,Deva Ramanan,Changliu Liu,Jun-Yan Zhu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We also release our new dataset, StableText2Lego, containing over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: this https URL.
zh
[CV-6] SITE: towards Spatial Intelligence Thorough Evaluation
【速读】:该论文旨在解决评估大型视觉-语言模型在空间智能(Spatial Intelligence, SI)方面能力不足的问题,特别是针对多模态视觉输入(单图、多图和视频)及多种SI因素(如图形到环境尺度、空间可视化与方向、内在与外在、静态与动态)的全面评估。其解决方案的关键在于构建SITE基准数据集,通过结合自下而上的31个现有数据集的调查和自上而下的认知科学分类系统策略,设计出两种新型任务——视角获取和动态场景任务,以更系统地衡量模型的空间智能水平。
链接: https://arxiv.org/abs/2505.05456
作者: Wenqi Wang,Reuben Tan,Pengyue Zhu,Jianwei Yang,Zhengyuan Yang,Lijuan Wang,Andrey Kolobov,Jianfeng Gao,Boqing Gong
机构: Boston University (波士顿大学); Microsoft Research, Redmond (微软研究院,雷德蒙德)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models’ spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model’s spatial reasoning proficiency and its performance on an embodied AI task.
zh
[CV-7] PillarMamba: Learning Local-Global Context for Roadside Point Cloud via Hybrid State Space Model
【速读】:该论文旨在解决路边点云导向的3D目标检测问题,该问题在智能交通系统(Intelligent Transport System, ITS)和车路协同(Vehicle-to-Everything, V2X)任务中具有重要意义,但尚未得到充分研究。其关键在于提升网络的感受野以及有效利用场景上下文信息。论文提出的解决方案是引入基于状态空间模型(State Space Model, SSM)的Mamba架构,并结合跨阶段状态空间组(Cross-stage State-space Group, CSG)框架,通过跨阶段特征融合增强网络表达能力并实现高效计算。为克服状态空间模型在扫描方向限制下局部连接断裂和历史关系遗忘的问题,进一步提出了混合状态空间块(Hybrid State-space Block, HSB),通过局部卷积增强邻域连接并利用残差注意力保持历史记忆。
链接: https://arxiv.org/abs/2505.05397
作者: Zhang Zhang,Chao Sun,Chao Yue,Da Wen,Tianze Wang,Jianghao Leng
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Serving the Intelligent Transport System (ITS) and Vehicle-to-Everything (V2X) tasks, roadside perception has received increasing attention in recent years, as it can extend the perception range of connected vehicles and improve traffic safety. However, roadside point cloud oriented 3D object detection has not been effectively explored. To some extent, the key to the performance of a point cloud detector lies in the receptive field of the network and the ability to effectively utilize the scene context. The recent emergence of Mamba, based on State Space Model (SSM), has shaken up the traditional convolution and transformers that have long been the foundational building blocks, due to its efficient global receptive field. In this work, we introduce Mamba to pillar-based roadside point cloud perception and propose a framework based on Cross-stage State-space Group (CSG), called PillarMamba. It enhances the expressiveness of the network and achieves efficient computation through cross-stage feature fusion. However, due to the limitations of scan directions, state space model faces local connection disrupted and historical relationship forgotten. To address this, we propose the Hybrid State-space Block (HSB) to obtain the local-global context of roadside point cloud. Specifically, it enhances neighborhood connections through local convolution and preserves historical memory through residual attention. The proposed method outperforms the state-of-the-art methods on the popular large scale roadside benchmark: DAIR-V2X-I. The code will be released soon.
zh
[CV-8] EDmamba: A Simple yet Effective Event Denoising Method with State Space Model
【速读】:该论文旨在解决动态视觉传感器(Dynamic Vision Sensor, DVS)输出事件流中固有的噪声问题,以保持其超低延迟和实时处理能力。现有事件去噪方法面临计算复杂度高与轻量级方法鲁棒性不足之间的矛盾。该研究提出了一种基于状态空间模型(State Space Model, SSM)的新型事件去噪框架,其关键在于将事件表示为4D事件云,并通过粗粒度特征提取模块、空间Mamba(S-SSM)和时间Mamba(T-SSM)有效建模时空特征,从而在保证效率的同时提升去噪性能。
链接: https://arxiv.org/abs/2505.05391
作者: Ciyu Ruan,Zihang Gong,Ruishan Guo,Jingao Xu,Xinlei Chen
机构: Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras excel in high-speed vision due to their high temporal resolution, high dynamic range, and low power consumption. However, as dynamic vision sensors, their output is inherently noisy, making efficient denoising essential to preserve their ultra-low latency and real-time processing capabilities. Existing event denoising methods struggle with a critical dilemma: computationally intensive approaches compromise the sensor’s high-speed advantage, while lightweight methods often lack robustness across varying noise levels. To address this, we propose a novel event denoising framework based on State Space Models (SSMs). Our approach represents events as 4D event clouds and includes a Coarse Feature Extraction (CFE) module that extracts embedding features from both geometric and polarity-aware subspaces. The model is further composed of two essential components: A Spatial Mamba (S-SSM) that models local geometric structures and a Temporal Mamba (T-SSM) that captures global temporal dynamics, efficiently propagating spatiotemporal features across events. Experiments demonstrate that our method achieves state-of-the-art accuracy and efficiency, with 88.89K parameters, 0.0685s per 100K events inference time, and a 0.982 accuracy score, outperforming Transformer-based methods by 2.08% in denoising accuracy and 36X faster.
zh
[CV-9] GeomHair: Reconstruction of Hair Strands from Colorless 3D Scans
【速读】:该论文旨在解决从无色3D扫描中直接重建发丝(hair strand)的问题,这一问题在计算机视觉与图形学中具有重要意义,广泛应用于高保真数字虚拟人合成、动画及AR/VR等领域。现有方法通常依赖于RGB图像捕捉,易受环境影响且在复杂发型中难以准确提取引导发丝的方向。该论文提出的解决方案关键在于通过多模态发丝方向提取技术,直接在扫描的表面上检测锐利的表面特征,并利用神经2D线检测器对扫描着色渲染进行处理以估计发丝方向。此外,还引入了经过改进噪声调度优化并基于扫描特定文本提示适配的扩散先验,从而实现无需依赖颜色信息即可准确重建简单和复杂发型。
链接: https://arxiv.org/abs/2505.05376
作者: Rachmadio Noval Lazuardi,Artem Sevastopolsky,Egor Zakharov,Matthias Niessner,Vanessa Sklyarova
机构: Technical University of Munich(慕尼黑工业大学); ETH Zürich(苏黎世联邦理工学院); Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures, 1 table
Abstract:We propose a novel method that reconstructs hair strands directly from colorless 3D scans by leveraging multi-modal hair orientation extraction. Hair strand reconstruction is a fundamental problem in computer vision and graphics that can be used for high-fidelity digital avatar synthesis, animation, and AR/VR applications. However, accurately recovering hair strands from raw scan data remains challenging due to human hair’s complex and fine-grained structure. Existing methods typically rely on RGB captures, which can be sensitive to the environment and can be a challenging domain for extracting the orientation of guiding strands, especially in the case of challenging hairstyles. To reconstruct the hair purely from the observed geometry, our method finds sharp surface features directly on the scan and estimates strand orientation through a neural 2D line detector applied to the renderings of scan shading. Additionally, we incorporate a diffusion prior trained on a diverse set of synthetic hair scans, refined with an improved noise schedule, and adapted to the reconstructed contents via a scan-specific text prompt. We demonstrate that this combination of supervision signals enables accurate reconstruction of both simple and intricate hairstyles without relying on color information. To facilitate further research, we introduce Strands400, the largest publicly available dataset of hair strands with detailed surface geometry extracted from real-world data, which contains reconstructed hair strands from the scans of 400 subjects.
zh
[CV-10] hreshold Modulation for Online Test-Time Adaptation of Spiking Neural Networks IJCNN2025
【速读】:该论文试图解决在部署后,脉冲神经网络(Spiking Neural Networks, SNNs)面对数据分布变化时适应能力不足的问题。现有在线测试时适应(Online Test-Time Adaptation, OTTA)方法主要针对传统人工神经网络设计,不适用于SNNs。论文提出的解决方案关键在于一种低功耗、适合神经形态芯片的在线测试时适应框架——阈值调制(Threshold Modulation, TM),其通过受神经元动力学启发的归一化方法动态调整发放阈值,从而提升SNNs在分布偏移下的泛化能力,同时保持计算成本较低。
链接: https://arxiv.org/abs/2505.05375
作者: Kejie Zhao,Wenjia Hua,Aiersi Tuerhong,Luziwei Leng,Yuxin Ma,Qinghua Guo
机构: Southern University of Science and Technology (南方科技大学); Chongqing University (重庆大学); Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: Accepted by IJCNN 2025. \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing this material for advertising or promotional purposes, collecting new collected works for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Abstract:Recently, spiking neural networks (SNNs), deployed on neuromorphic chips, provide highly efficient solutions on edge devices in different scenarios. However, their ability to adapt to distribution shifts after deployment has become a crucial challenge. Online test-time adaptation (OTTA) offers a promising solution by enabling models to dynamically adjust to new data distributions without requiring source data or labeled target samples. Nevertheless, existing OTTA methods are largely designed for traditional artificial neural networks and are not well-suited for SNNs. To address this gap, we propose a low-power, neuromorphic chip-friendly online test-time adaptation framework, aiming to enhance model generalization under distribution shifts. The proposed approach is called Threshold Modulation ™, which dynamically adjusts the firing threshold through neuronal dynamics-inspired normalization, being more compatible with neuromorphic hardware. Experimental results on benchmark datasets demonstrate the effectiveness of this method in improving the robustness of SNNs against distribution shifts while maintaining low computational cost. The proposed method offers a practical solution for online test-time adaptation of SNNs, providing inspiration for the design of future neuromorphic chips. The demo code is available at this http URL.
zh
[CV-11] Joint Super-Resolution and Segmentation for 1-m Impervious Surface Area Mapping in Chinas Yangtze River Economic Belt
【速读】:该论文旨在解决高精度不透水面(Impervious Surface Area, ISA)制图的问题,特别是如何利用低成本、易获取的Sentinel-2遥感数据生成1米分辨率的ISA地图。其解决方案的关键在于提出了一种联合超分辨率与分割的框架——JointSeg,通过多模态跨分辨率输入进行训练,实现了从10米到1米的渐进式分辨率提升,同时保持了细粒度的空间纹理,并通过有效的跨尺度特征融合确保分类精度。该方法在复杂地形和城乡混合区域表现出色,显著提升了ISA识别的准确性与适用性。
链接: https://arxiv.org/abs/2505.05367
作者: Jie Deng,Danfeng Hong,Chenyu Li,Naoto Yokoya
机构: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China; School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China; School of Mathematics, Southeast University, Nanjing 210096, China; Graduate School of Frontier Sciences, the University of Tokyo, Chiba 277-8561, Japan; RIKEN Center for Advanced Intelligence Project, Tokyo 103-0027, Japan
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:We propose a novel joint framework by integrating super-resolution and segmentation, called JointSeg, which enables the generation of 1-meter ISA maps directly from freely available Sentinel-2 imagery. JointSeg was trained on multimodal cross-resolution inputs, offering a scalable and affordable alternative to traditional approaches. This synergistic design enables gradual resolution enhancement from 10m to 1m while preserving fine-grained spatial textures, and ensures high classification fidelity through effective cross-scale feature fusion. This method has been successfully applied to the Yangtze River Economic Belt (YREB), a region characterized by complex urban-rural patterns and diverse topography. As a result, a comprehensive ISA mapping product for 2021, referred to as ISA-1, was generated, covering an area of over 2.2 million square kilometers. Quantitative comparisons against the 10m ESA WorldCover and other benchmark products reveal that ISA-1 achieves an F1-score of 85.71%, outperforming bilinear-interpolation-based segmentation by 9.5%, and surpassing other ISA datasets by 21.43%-61.07%. In densely urbanized areas (e.g., Suzhou, Nanjing), ISA-1 reduces ISA overestimation through improved discrimination of green spaces and water bodies. Conversely, in mountainous regions (e.g., Ganzi, Zhaotong), it identifies significantly more ISA due to its enhanced ability to detect fragmented anthropogenic features such as rural roads and sparse settlements, demonstrating its robustness across diverse landscapes. Moreover, we present biennial ISA maps from 2017 to 2023, capturing spatiotemporal urbanization dynamics across representative cities. The results highlight distinct regional growth patterns: rapid expansion in upstream cities, moderate growth in midstream regions, and saturation in downstream metropolitan areas.
zh
[CV-12] me of the Flight of the Gaussians: Optimizing Depth Indirectly in Dynamic Radiance Fields
【速读】:该论文试图解决从单目连续波时间飞行(C-ToF)相机中重建动态场景的问题,特别是在缺乏直接测量深度信息的情况下实现高保真度的3D重建。其解决方案的关键在于将两种启发式方法引入优化过程,以提升由高斯表示的场景几何精度,从而在受限的C-ToF传感条件下获得准确的重建结果,且相比神经体积方法具有更高的效率。
链接: https://arxiv.org/abs/2505.05356
作者: Runfeng Li,Mikhail Okunev,Zixuan Guo,Anh Ha Duong,Christian Richardt,Matthew O’Toole,James Tompkin
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a method to reconstruct dynamic scenes from monocular continuous-wave time-of-flight (C-ToF) cameras using raw sensor samples that achieves similar or better accuracy than neural volumetric approaches and is 100x faster. Quickly achieving high-fidelity dynamic 3D reconstruction from a single viewpoint is a significant challenge in computer vision. In C-ToF radiance field reconstruction, the property of interest-depth-is not directly measured, causing an additional challenge. This problem has a large and underappreciated impact upon the optimization when using a fast primitive-based scene representation like 3D Gaussian splatting, which is commonly used with multi-view data to produce satisfactory results and is brittle in its optimization otherwise. We incorporate two heuristics into the optimization to improve the accuracy of scene geometry represented by Gaussians. Experimental results show that our approach produces accurate reconstructions under constrained C-ToF sensing conditions, including for fast motions like swinging baseball bats. this https URL
zh
[CV-13] Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization WACV2024 ACL
【速读】:该论文试图解决声音源定位(sound source localization)问题,旨在通过多模态模型实现音频与视觉信息的对齐与理解。其解决方案的关键在于将CLIP模型扩展至音频领域,提出一种无需显式文本输入的自监督方法,通过将音频映射为与CLIP文本编码器兼容的token,生成音频驱动的嵌入表示,并利用对比音频-视觉对应目标对视觉特征进行对齐,从而实现更完整且紧凑的声音对象定位。
链接: https://arxiv.org/abs/2505.05343
作者: Sooyoung Park,Arda Senocak,Joon Son Chung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Journal Extension of WACV 2024 paper ( arXiv:2311.04066 ). Code is available at this https URL
Abstract:Large-scale vision-language models demonstrate strong multimodal alignment and generalization across diverse tasks. Among them, CLIP stands out as one of the most successful approaches. In this work, we extend the application of CLIP to sound source localization, proposing a self-supervised method operates without explicit text input. We introduce a framework that maps audios into tokens compatible with CLIP’s text encoder, producing audio-driven embeddings. These embeddings are used to generate sounding region masks, from which visual features are extracted and aligned with the audio embeddings through a contrastive audio-visual correspondence objective. Our findings show that alignment knowledge of pre-trained multimodal foundation model enables our method to generate more complete and compact localization for sounding objects. We further propose an LLM-guided extension that distills object-aware audio-visual scene understanding into the model during training to enhance alignment. Extensive experiments across five diverse tasks demonstrate that our method, in all variants, outperforms state-of-the-art approaches and achieves strong generalization in zero-shot settings.
zh
[CV-14] Progressive Inertial Poser: Progressive Real-Time Kinematic Chain Estimation for 3D Full-Body Pose from Three IMU Sensors
【速读】:该论文旨在解决在虚拟现实应用中实现全身体态估计的问题,特别是如何在不依赖外部视觉传感器或额外佩戴骨盆和下肢传感器的情况下,通过有限的惯性测量单元(IMU)数据实现高精度的全身运动捕捉。其解决方案的关键在于提出了一种名为Progressive Inertial Poser (ProgIP) 的方法,该方法结合了神经网络估计与人体动力学模型,利用三枚佩戴在头部和手腕上的IMU传感器获取的惯性数据,通过多阶段渐进式网络结构实现对全身运动的实时重建。
链接: https://arxiv.org/abs/2505.05336
作者: Zunjie Zhu,Yan Zhao,Yihan Hu,Guoxiang Wang,Hai Qiu,Bolun Zheng,Chenggang Yan,Feng Xu
机构: Hangzhou Dianzi University(杭州电子科技大学); Tiangong University(天津工业大学); Lishui University(丽水学院); Costar Intelligent Optoelectronics Technology Co.,Ltd(科思特智能光电技术有限公司); Macao Polytechnic University(澳门理工学院); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The motion capture system that supports full-body virtual representation is of key significance for virtual reality. Compared to vision-based systems, full-body pose estimation from sparse tracking signals is not limited by environmental conditions or recording range. However, previous works either face the challenge of wearing additional sensors on the pelvis and lower-body or rely on external visual sensors to obtain global positions of key joints. To improve the practicality of the technology for virtual reality applications, we estimate full-body poses using only inertial data obtained from three Inertial Measurement Unit (IMU) sensors worn on the head and wrists, thereby reducing the complexity of the hardware system. In this work, we propose a method called Progressive Inertial Poser (ProgIP) for human pose estimation, which combines neural network estimation with a human dynamics model, considers the hierarchical structure of the kinematic chain, and employs a multi-stage progressive network estimation with increased depth to reconstruct full-body motion in real time. The encoder combines Transformer Encoder and bidirectional LSTM (TE-biLSTM) to flexibly capture the temporal dependencies of the inertial sequence, while the decoder based on multi-layer perceptrons (MLPs) transforms high-dimensional features and accurately projects them onto Skinned Multi-Person Linear (SMPL) model parameters. Quantitative and qualitative experimental results on multiple public datasets show that our method outperforms state-of-the-art methods with the same inputs, and is comparable to recent works using six IMU sensors.
zh
[CV-15] Aesthetics Without Semantics
【速读】:该论文试图解决当前美学研究中数据库存在的偏向性问题,即现有数据库主要包含美观图像,导致对审美判断的理解和预测变得复杂。解决方案的关键在于构建一个语义内容最少的图像数据库(Minimum Semantic Content, MSC),并通过一种方法生成处于审美评价负面的图像,从而补充现有数据集的不足。这一方法有助于更全面地分析图像特征与审美评价之间的关系。
链接: https://arxiv.org/abs/2505.05331
作者: C. Alejandro Parraga(1 and 2),Olivier Penacchio(1, 2 and 3),Marcos Muňoz Gonzalez(1),Bogdan Raducanu(2),Xavier Otazu(1 and 2) ((1) Comp. Sci. Dept., Engineering School, Universitat Autònoma de Barcelona (UAB), Campus UAB-Bellaterra, 08193, Barcelona, Spain, (2) Computer Vision Centre, Campus UAB, Bellaterra, 08193, Barcelona, Spain, (3) School of Psychology and Neuroscience, University of St Andrews, St Andrews, Fife KY16 9JP, United Kingdom)
机构: Universitat Autònoma de Barcelona (UAB)(巴塞罗那自治大学); Computer Vision Centre(计算机视觉中心); University of St Andrews(圣安德鲁斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Computation (stat.CO)
备注: Parts of this work were presented in abstract format at the Vision Science of Art Conference (VSAC2016), the Iberian Conference on Perception (CIP2022), and the European Conference on Visual Perception (ECVP2022). See Perception 51, No1 (Suppl.) pp139, 2022)
Abstract:While it is easy for human observers to judge an image as beautiful or ugly, aesthetic decisions result from a combination of entangled perceptual and cognitive (semantic) factors, making the understanding of aesthetic judgements particularly challenging from a scientific point of view. Furthermore, our research shows a prevailing bias in current databases, which include mostly beautiful images, further complicating the study and prediction of aesthetic responses. We address these limitations by creating a database of images with minimal semantic content and devising, and next exploiting, a method to generate images on the ugly side of aesthetic valuations. The resulting Minimum Semantic Content (MSC) database consists of a large and balanced collection of 10,426 images, each evaluated by 100 observers. We next use established image metrics to demonstrate how augmenting an image set biased towards beautiful images with ugly images can modify, or even invert, an observed relationship between image features and aesthetics valuation. Taken together, our study reveals that works in empirical aesthetics attempting to link image content and aesthetic judgements may magnify, underestimate, or simply miss interesting effects due to a limitation of the range of aesthetic values they consider.
zh
[CV-16] Feature-Augmented Deep Networks for Multiscale Building Segmentation in High-Resolution UAV and Satellite Imagery
【速读】:该论文旨在解决高分辨率RGB影像中建筑物分割的挑战,特别是由于与非建筑物特征的光谱相似性、阴影以及不规则建筑几何形状导致的分割困难。其解决方案的关键在于提出了一种基于RGB航空和卫星影像的多尺度建筑物分割深度学习框架,通过引入特征增强输入(包括主成分分析、可见光差异植被指数、形态学建筑指数和Sobel边缘滤波器)来指导Res-U-Net架构更有效地学习复杂的空间模式,并结合层冻结、循环学习率和SuperConvergence等训练策略以减少训练时间和资源消耗。
链接: https://arxiv.org/abs/2505.05321
作者: Chintan B. Maniyar,Minakshi Kumar,Gengchen Mai
机构: University of Georgia (乔治亚大学); Indian Institute of Remote Sensing (印度遥感研究所); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: in preparation for journal submission, 25 pages, 11 figures
Abstract:Accurate building segmentation from high-resolution RGB imagery remains challenging due to spectral similarity with non-building features, shadows, and irregular building geometries. In this study, we present a comprehensive deep learning framework for multiscale building segmentation using RGB aerial and satellite imagery with spatial resolutions ranging from 0.4m to 2.7m. We curate a diverse, multi-sensor dataset and introduce feature-augmented inputs by deriving secondary representations including Principal Component Analysis (PCA), Visible Difference Vegetation Index (VDVI), Morphological Building Index (MBI), and Sobel edge filters from RGB channels. These features guide a Res-U-Net architecture in learning complex spatial patterns more effectively. We also propose training policies incorporating layer freezing, cyclical learning rates, and SuperConvergence to reduce training time and resource usage. Evaluated on a held-out WorldView-3 image, our model achieves an overall accuracy of 96.5%, an F1-score of 0.86, and an Intersection over Union (IoU) of 0.80, outperforming existing RGB-based benchmarks. This study demonstrates the effectiveness of combining multi-resolution imagery, feature augmentation, and optimized training strategies for robust building segmentation in remote sensing applications.
zh
[CV-17] Mapping User Trust in Vision Language Models: Research Landscape Challenges and Prospects
【速读】:该论文试图解决用户在与视觉语言模型(Vision Language Models, VLMs)交互过程中对系统信任度的动态变化问题,以及如何保护和告知用户何时应信任这些系统。解决方案的关键在于通过跨学科的分类体系,涵盖不同的认知科学能力、协作模式和代理行为,以系统地分析和理解用户与VLM之间的信任动态,并为未来VLM信任研究提供初步需求框架。
链接: https://arxiv.org/abs/2505.05318
作者: Agnese Chiatti,Sara Bernardini,Lara Shibelski Godoy Piccolo,Viola Schiaffonati,Matteo Matteucci
机构: Politecnico di Milano (米兰理工大学); University of Oxford (牛津大学); CODE University of Applied Sciences (科德应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:The rapid adoption of Vision Language Models (VLMs), pre-trained on large image-text and video-text datasets, calls for protecting and informing users about when to trust these systems. This survey reviews studies on trust dynamics in user-VLM interactions, through a multi-disciplinary taxonomy encompassing different cognitive science capabilities, collaboration modes, and agent behaviours. Literature insights and findings from a workshop with prospective VLM users inform preliminary requirements for future VLM trust studies.
zh
[CV-18] PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining
【速读】:该论文旨在解决事件相机在雨天条件下因密集噪声而导致的去雨难题,现有方法在时间精度、去雨效果和计算效率之间存在权衡。其解决方案的关键在于提出PRE-Mamba框架,该框架通过4D事件云表示融合双时间尺度以保持高时间精度,引入时空解耦与融合模块(STDF)增强去雨能力,以及采用多尺度状态空间模型(MS3M)以线性计算复杂度捕捉更深层次的雨动态,从而实现了高效的去雨性能。
链接: https://arxiv.org/abs/2505.05307
作者: Ciyu Ruan,Ruishan Guo,Zihang Gong,Jingao Xu,Wenhan Yang,Xinlei Chen
机构: Tsinghua University (清华大学); Harbin Institute of Technology (哈尔滨工业大学); Carnegie Mellon University (卡内基梅隆大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras excel in high temporal resolution and dynamic range but suffer from dense noise in rainy conditions. Existing event deraining methods face trade-offs between temporal precision, deraining effectiveness, and computational efficiency. In this paper, we propose PRE-Mamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. Our framework introduces a 4D event cloud representation that integrates dual temporal scales to preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion module (STDF) that enhances deraining capability by enabling shallow decoupling and interaction of temporal and spatial information, and a Multi-Scale State Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and multi-spatial scales with linear computational complexity. Enhanced by frequency-domain regularization, PRE-Mamba achieves superior performance (0.95 SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a comprehensive dataset with labeled synthetic and real-world sequences. Moreover, our method generalizes well across varying rain intensities, viewpoints, and even snowy conditions.
zh
[CV-19] PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes
【速读】:该论文试图解决在真实3D场景中基于语言指导的物体放置问题(Language-Guided Object Placement in Real 3D Scenes)。该任务要求根据给定的3D场景点云、3D资产和文本提示,找到一个符合语义描述且满足几何约束的有效放置位置。解决方案的关键在于提出一个新的基准测试和评估协议,以及一个用于训练3D大语言模型(LLMs)的新数据集,并引入一种作为非平凡基线的新型方法,以应对任务中因存在多个有效解而带来的模糊性及对3D几何关系和自由空间推理的需求。
链接: https://arxiv.org/abs/2505.05288
作者: Ahmed Abdelreheem,Filippo Aleotti,Jamie Watson,Zawar Qureshi,Abdelrahman Eldesokey,Peter Wonka,Gabriel Brostow,Sara Vicente,Guillermo Garcia-Hernando
机构: Niantic Spatial(尼安蒂克空间); KAUST(卡耐基梅隆大学阿尔巴尼亚分校); UCL(伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Tech report. Project page: this https URL
Abstract:We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene’s point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space. We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models.
zh
[CV-20] MTL-UE: Learning to Learn Nothing for Multi-Task Learning ICML2025
【速读】:该论文试图解决多任务数据和多任务学习(Multi-Task Learning, MTL)模型在不可撤销性(unlearnable)策略中的研究空白问题,即现有方法主要针对单任务学习(Single-Task Learning, STL)模型,而忽略了MTL场景下的安全需求。解决方案的关键在于提出MTL-UE,这是首个针对多任务数据和MTL模型的统一不可撤销示例生成框架。其核心创新包括设计基于生成器的结构,引入标签先验和类别特征嵌入以提升攻击性能,并结合任务内与任务间的嵌入正则化以增强类间分离度和抑制类内方差,从而显著提高攻击鲁棒性。
链接: https://arxiv.org/abs/2505.05279
作者: Yi Yu,Song Xia,Siyuan Yang,Chenqi Kong,Wenhan Yang,Shijian Lu,Yap-Peng Tan,Alex C. Kot
机构: 未知
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025
Abstract:Most existing unlearnable strategies focus on preventing unauthorized users from training single-task learning (STL) models with personal data. Nevertheless, the paradigm has recently shifted towards multi-task data and multi-task learning (MTL), targeting generalist and foundation models that can handle multiple tasks simultaneously. Despite their growing importance, MTL data and models have been largely neglected while pursuing unlearnable strategies. This paper presents MTL-UE, the first unified framework for generating unlearnable examples for multi-task data and MTL models. Instead of optimizing perturbations for each sample, we design a generator-based structure that introduces label priors and class-wise feature embeddings which leads to much better attacking performance. In addition, MTL-UE incorporates intra-task and inter-task embedding regularization to increase inter-class separation and suppress intra-class variance which enhances the attack robustness greatly. Furthermore, MTL-UE is versatile with good supports for dense prediction tasks in MTL. It is also plug-and-play allowing integrating existing surrogate-dependent unlearnable methods with little adaptation. Extensive experiments show that MTL-UE achieves superior attacking performance consistently across 4 MTL datasets, 3 base UE methods, 5 model backbones, and 5 MTL task-weighting strategies.
zh
[CV-21] PADriver: Towards Personalized Autonomous Driving
【速读】:该论文旨在解决个性化自主驾驶(Personalized Autonomous Driving, PAD)中的决策问题,特别是在复杂交通环境下的场景理解、危险等级评估和动作决策。其解决方案的关键在于提出了一种基于多模态大语言模型(Multi-modal Large Language Model, MLLM)的闭环框架PADriver,该框架能够实时处理视频流和个性化文本提示,并自动进行场景理解与危险等级估计,从而为最终动作提供明确的风险参考,实现符合用户偏好的驾驶行为。
链接: https://arxiv.org/abs/2505.05240
作者: Genghua Kou,Fan Jia,Weixin Mao,Yingfei Liu,Yucheng Zhao,Ziheng Zhang,Osamu Yoshie,Tiancai Wang,Ying Li,Xiangyu Zhang
机构: Beijing Institute of Technology (北京理工大学); Megvii Technology (旷视科技); Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action and provides an explicit reference for the final action, which corresponds to the preset personalized prompt. Moreover, we construct a closed-loop benchmark named PAD-Highway based on Highway-Env simulator to comprehensively evaluate the decision performance under traffic rules. The dataset contains 250 hours videos with high-quality annotation to facilitate the development of PAD behavior analysis. Experimental results on the constructed benchmark show that PADriver outperforms state-of-the-art approaches on different evaluation metrics, and enables various driving modes.
zh
[CV-22] Does CLIP perceive art the same way we do?
【速读】:该论文试图解决生成式 AI (Generative AI) 模型,特别是 CLIP 在理解艺术作品时与人类感知的一致性问题,其核心在于探究 CLIP 是否能够像人类一样提取绘画中的高层次语义和风格信息。解决方案的关键在于设计针对性的探测任务,并将其结果与人类标注和专家基准进行比较,以评估 CLIP 在内容、场景理解、艺术风格、历史时期以及视觉变形或伪影等方面的感知能力,从而揭示其在美学线索和艺术意图上的优势与局限性。
链接: https://arxiv.org/abs/2505.05229
作者: Andrea Asperti,Leonardo Dessì,Maria Chiara Tonetti,Nico Wu
机构: University of Bologna(博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it “see” the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP’s ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP’s responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP’s visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.
zh
[CV-23] Multi-Objective Reinforcement Learning for Adaptive Personalized Autonomous Driving
【速读】:该论文试图解决自动驾驶车辆在面对不同用户驾驶风格偏好时适应性不足的问题,现有端到端驾驶方法通常依赖预定义的驾驶风格或需要持续用户反馈进行适应,限制了其对动态、上下文相关偏好的支持能力。解决方案的关键在于采用多目标强化学习(Multi-Objective Reinforcement Learning, MORL)结合偏好驱动优化,通过连续权重向量编码偏好以调节可解释的风格目标(如效率、舒适性、速度和激进程度),从而实现在不重新训练策略的情况下运行时适应驾驶风格偏好。
链接: https://arxiv.org/abs/2505.05223
作者: Hendrik Surmann,Jorge de Heuvel,Maren Bennewitz
机构: University of Bonn, Germany(波恩大学, 德国); Lamarr Institute for Machine Learning and Artificial Intelligence(拉玛尔机器学习与人工智能研究所); Center for Robotics, Bonn, Germany(波恩机器人中心, 德国); Robotics Institute Germany(德国机器人研究所); Federal Ministry of Education and Research of Germany(德国联邦教育和研究部); state of North-Rhine Westphalia(北莱茵-威斯特法伦州); LAMARR22B(拉玛尔22B)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Human drivers exhibit individual preferences regarding driving style. Adapting autonomous vehicles to these preferences is essential for user trust and satisfaction. However, existing end-to-end driving approaches often rely on predefined driving styles or require continuous user feedback for adaptation, limiting their ability to support dynamic, context-dependent preferences. We propose a novel approach using multi-objective reinforcement learning (MORL) with preference-driven optimization for end-to-end autonomous driving that enables runtime adaptation to driving style preferences. Preferences are encoded as continuous weight vectors to modulate behavior along interpretable style objectives \unicodex2013 including efficiency, comfort, speed, and aggressiveness \unicodex2013 without requiring policy retraining. Our single-policy agent integrates vision-based perception in complex mixed-traffic scenarios and is evaluated in diverse urban environments using the CARLA simulator. Experimental results demonstrate that the agent dynamically adapts its driving behavior according to changing preferences while maintaining performance in terms of collision avoidance and route completion.
zh
[CV-24] Diffusion Model Quantization: A Review
【速读】:该论文旨在解决扩散模型在资源受限的边缘设备上高效部署的问题,其核心挑战在于如何通过模型量化技术实现模型的压缩与加速。解决方案的关键在于系统性地回顾和分析当前扩散模型量化领域的最新进展,包括针对U-Net架构和Diffusion Transformers (DiT)等主流结构的量化方法,并从定性和定量两个角度对代表性方案进行深入评估,以推动生成式模型在实际应用中的量化研究与发展。
链接: https://arxiv.org/abs/2505.05215
作者: Qian Zeng,Chenggong Hu,Mingli Song,Jie Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 8 figures
Abstract:Recent success of large text-to-image models has empirically underscored the exceptional performance of diffusion models in generative tasks. To facilitate their efficient deployment on resource-constrained edge devices, model quantization has emerged as a pivotal technique for both compression and acceleration. This survey offers a thorough review of the latest advancements in diffusion model quantization, encapsulating and analyzing the current state of the art in this rapidly advancing domain. First, we provide an overview of the key challenges encountered in the quantization of diffusion models, including those based on U-Net architectures and Diffusion Transformers (DiT). We then present a comprehensive taxonomy of prevalent quantization techniques, engaging in an in-depth discussion of their underlying principles. Subsequently, we perform a meticulous analysis of representative diffusion model quantization schemes from both qualitative and quantitative perspectives. From a quantitative standpoint, we rigorously benchmark a variety of methods using widely recognized datasets, delivering an extensive evaluation of the most recent and impactful research in the field. From a qualitative standpoint, we categorize and synthesize the effects of quantization errors, elucidating these impacts through both visual analysis and trajectory examination. In conclusion, we outline prospective avenues for future research, proposing novel directions for the quantization of generative models in practical applications. The list of related papers, corresponding codes, pre-trained models and comparison results are publicly available at the survey project homepage this https URL.
zh
[CV-25] HQC-NBV: A Hybrid Quantum-Classical View Planning Approach
【速读】:该论文旨在解决计算机视觉与机器人感知中的高效视点规划问题(view planning),这一问题在搜索救援和自主导航等任务中具有关键作用。传统方法,包括基于采样的和确定性方法,在复杂场景中常面临计算可扩展性和解的最优性不足的问题。本文提出的解决方案是HQC-NBV,一种融合量子与经典计算的框架,其关键在于利用量子特性高效探索参数空间,同时保持鲁棒性和可扩展性。该方法通过特定的多组分成本项哈密顿量和具有双向交替纠缠模式的参数中心变分假设,捕捉视点参数间的层次依赖关系,从而实现性能提升。
链接: https://arxiv.org/abs/2505.05212
作者: Xiaotong Yu,Chang Wen Chen
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficient view planning is a fundamental challenge in computer vision and robotic perception, critical for tasks ranging from search and rescue operations to autonomous navigation. While classical approaches, including sampling-based and deterministic methods, have shown promise in planning camera viewpoints for scene exploration, they often struggle with computational scalability and solution optimality in complex settings. This study introduces HQC-NBV, a hybrid quantum-classical framework for view planning that leverages quantum properties to efficiently explore the parameter space while maintaining robustness and scalability. We propose a specific Hamiltonian formulation with multi-component cost terms and a parameter-centric variational ansatz with bidirectional alternating entanglement patterns that capture the hierarchical dependencies between viewpoint parameters. Comprehensive experiments demonstrate that quantum-specific components provide measurable performance advantages. Compared to the classical methods, our approach achieves up to 49.2% higher exploration efficiency across diverse environments. Our analysis of entanglement architecture and coherence-preserving terms provides insights into the mechanisms of quantum advantage in robotic exploration tasks. This work represents a significant advancement in integrating quantum computing into robotic perception systems, offering a paradigm-shifting solution for various robot vision tasks.
zh
[CV-26] EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution
【速读】:该论文旨在解决盲超分辨率(Blind Super-Resolution, BSR)中图像恢复质量不足的问题,通过引入生成式AI(Generative AI)模型提升重建效果。其解决方案的关键在于提出一种新的BSR方法——Enhancing Anything Model (EAM),该方法基于Diffusion Transformers (DiT)架构,并引入了\Psi-DiT块,通过低分辨率潜在空间作为可分离流注入控制,构建三流结构以有效利用预训练DiT中的先验知识。此外,还提出了渐进式掩码图像建模策略和主体感知提示生成策略,以增强T2I模型的先验引导能力和泛化性。
链接: https://arxiv.org/abs/2505.05209
作者: Haizhen Xie,Kunpeng Du,Qiangyu Yan,Sen Lu,Jianhong Han,Hanting Chen,Hailin Hu,Jie Hu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Utilizing pre-trained Text-to-Image (T2I) diffusion models to guide Blind Super-Resolution (BSR) has become a predominant approach in the field. While T2I models have traditionally relied on U-Net architectures, recent advancements have demonstrated that Diffusion Transformers (DiT) achieve significantly higher performance in this domain. In this work, we introduce Enhancing Anything Model (EAM), a novel BSR method that leverages DiT and outperforms previous U-Net-based approaches. We introduce a novel block, \Psi -DiT, which effectively guides the DiT to enhance image restoration. This block employs a low-resolution latent as a separable flow injection control, forming a triple-flow architecture that effectively leverages the prior knowledge embedded in the pre-trained DiT. To fully exploit the prior guidance capabilities of T2I models and enhance their generalization in BSR, we introduce a progressive Masked Image Modeling strategy, which also reduces training costs. Additionally, we propose a subject-aware prompt generation strategy that employs a robust multi-modal model in an in-context learning framework. This strategy automatically identifies key image areas, provides detailed descriptions, and optimizes the utilization of T2I diffusion priors. Our experiments demonstrate that EAM achieves state-of-the-art results across multiple datasets, outperforming existing methods in both quantitative metrics and visual quality.
zh
[CV-27] Concept-Based Unsupervised Domain Adaptation ICML2025
【速读】:该论文旨在解决概念瓶颈模型(Concept Bottleneck Models, CBMs)在领域偏移(domain shift)情况下性能下降和泛化能力不足的问题。其关键解决方案是提出基于概念的无监督领域自适应(Concept-based Unsupervised Domain Adaptation, CUDA)框架,通过对抗训练对齐跨领域的概念表示、引入松弛阈值以容忍领域间的微小差异、在目标领域直接推断概念而不依赖标注的概念数据,并将概念学习整合到传统领域自适应中,从而提升CBMs的鲁棒性与可解释性。
链接: https://arxiv.org/abs/2505.05195
作者: Xinyue Xu,Yueying Hu,Hui Tang,Yi Qin,Lu Mi,Hao Wang,Xiaomeng Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025
Abstract:Concept Bottleneck Models (CBMs) enhance interpretability by explaining predictions through human-understandable concepts but typically assume that training and test data share the same distribution. This assumption often fails under domain shifts, leading to degraded performance and poor generalization. To address these limitations and improve the robustness of CBMs, we propose the Concept-based Unsupervised Domain Adaptation (CUDA) framework. CUDA is designed to: (1) align concept representations across domains using adversarial training, (2) introduce a relaxation threshold to allow minor domain-specific differences in concept distributions, thereby preventing performance drop due to over-constraints of these distributions, (3) infer concepts directly in the target domain without requiring labeled concept data, enabling CBMs to adapt to diverse domains, and (4) integrate concept learning into conventional domain adaptation (DA) with theoretical guarantees, improving interpretability and establishing new benchmarks for DA. Experiments demonstrate that our approach significantly outperforms the state-of-the-art CBM and DA methods on real-world datasets.
zh
[CV-28] Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models
【速读】:该论文旨在解决预训练视觉-语言模型(VLMs)在生物医学图像分类任务中,特别是在少样本场景下,因仅依赖文本提示而忽略生物医学图像中特定结构(如复杂的解剖结构和细微的病理特征)的问题。其解决方案的关键在于提出Biomed-DPT,一种增强知识的双模态提示调优技术,通过构建包含模板驱动的临床提示和大语言模型(LLM)驱动的领域自适应提示的双提示结构,并利用知识蒸馏技术提取临床知识;同时,在视觉提示设计中引入零向量作为软提示,以实现注意力重加权,从而避免对非诊断区域和非关键病理特征的关注。
链接: https://arxiv.org/abs/2505.05189
作者: Wei Peng,Kang Liu,Jianchen Hu,Meng Zhang
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Prompt learning is one of the most effective paradigms for adapting pre-trained vision-language models (VLMs) to the biomedical image classification tasks in few shot scenarios. However, most of the current prompt learning methods only used the text prompts and ignored the particular structures (such as the complex anatomical structures and subtle pathological features) in the biomedical images. In this work, we propose Biomed-DPT, a knowledge-enhanced dual modality prompt tuning technique. In designing the text prompt, Biomed-DPT constructs a dual prompt including the template-driven clinical prompts and the large language model (LLM)-driven domain-adapted prompts, then extracts the clinical knowledge from the domain-adapted prompts through the knowledge distillation technique. In designing the vision prompt, Biomed-DPT introduces the zero vector as a soft prompt to leverage attention re-weighting so that the focus on non-diagnostic regions and the recognition of non-critical pathological features are avoided. Biomed-DPT achieves an average classification accuracy of 66.14% across 11 biomedical image datasets covering 9 modalities and 10 organs, with performance reaching 78.06% in base classes and 75.97% in novel classes, surpassing the Context Optimization (CoOp) method by 6.20%, 3.78%, and 8.04%, respectively. Our code are available at \underlinethis https URL.
zh
[CV-29] PaniCar: Securing the Perception of Advanced Driving Assistance Systems Against Emergency Vehicle Lighting
【速读】:该论文试图解决自动驾驶汽车在遇到开启的应急车辆灯光时,由于镜头眩光(flare)导致目标检测器置信度波动并可能低于检测阈值的问题,这一现象被称为PaniCar。解决方案的关键在于提出Caracetamol,一个旨在增强目标检测器对激活应急车辆灯光抗干扰能力的鲁棒框架,通过提升检测置信度、降低置信度下界以及减少波动范围来提高检测性能,并满足实时驾驶的处理速率要求。
链接: https://arxiv.org/abs/2505.05183
作者: Elad Feldman,Jacob Shams,Dudi Biton,Alfred Chen,Shaoyuan Xie,Satoru Koda,Yisroel Mirsky,Asaf Shabtai,Yuval Elovici,Ben Nassi
机构: Ben-Gurion University of the Negev, Israel; University of California, Irvine, USA; Fujitsu Limited, Japan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The safety of autonomous cars has come under scrutiny in recent years, especially after 16 documented incidents involving Teslas (with autopilot engaged) crashing into parked emergency vehicles (police cars, ambulances, and firetrucks). While previous studies have revealed that strong light sources often introduce flare artifacts in the captured image, which degrade the image quality, the impact of flare on object detection performance remains unclear. In this research, we unveil PaniCar, a digital phenomenon that causes an object detector’s confidence score to fluctuate below detection thresholds when exposed to activated emergency vehicle lighting. This vulnerability poses a significant safety risk, and can cause autonomous vehicles to fail to detect objects near emergency vehicles. In addition, this vulnerability could be exploited by adversaries to compromise the security of advanced driving assistance systems (ADASs). We assess seven commercial ADASs (Tesla Model 3, “manufacturer C”, HP, Pelsee, AZDOME, Imagebon, Rexing), four object detectors (YOLO, SSD, RetinaNet, Faster R-CNN), and 14 patterns of emergency vehicle lighting to understand the influence of various technical and environmental factors. We also evaluate four SOTA flare removal methods and show that their performance and latency are insufficient for real-time driving constraints. To mitigate this risk, we propose Caracetamol, a robust framework designed to enhance the resilience of object detectors against the effects of activated emergency vehicle lighting. Our evaluation shows that on YOLOv3 and Faster RCNN, Caracetamol improves the models’ average confidence of car detection by 0.20, the lower confidence bound by 0.33, and reduces the fluctuation range by 0.33. In addition, Caracetamol is capable of processing frames at a rate of between 30-50 FPS, enabling real-time ADAS car detection.
zh
[CV-30] Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models UAI2025
【速读】:该论文试图解决标准视觉-语言模型(Vision-Language Models, VLMs)中确定性嵌入难以捕捉视觉和文本描述中的不确定性问题,以及图像与文本之间多种可能对应关系带来的挑战。其解决方案的关键在于提出一种后处理方法GroVE,该方法基于高斯过程潜在变量模型(Gaussian Process Latent Variable Model, GPLVM),在冻结的VLM基础上学习共享的低维潜在空间,通过单模态嵌入重构和跨模态对齐目标优化,最终生成具有不确定性的概率嵌入。
链接: https://arxiv.org/abs/2505.05163
作者: Aishwarya Venkataramanan,Paul Bodesheim,Joachim Denzler
机构: Friedrich Schiller University Jena (弗里德里希·席勒耶拿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: UAI 2025, 22 pages
Abstract:Vision-Language Models (VLMs) learn joint representations by mapping images and text into a shared latent space. However, recent research highlights that deterministic embeddings from standard VLMs often struggle to capture the uncertainties arising from the ambiguities in visual and textual descriptions and the multiple possible correspondences between images and texts. Existing approaches tackle this by learning probabilistic embeddings during VLM training, which demands large datasets and does not leverage the powerful representations already learned by large-scale VLMs like CLIP. In this paper, we propose GroVE, a post-hoc approach to obtaining probabilistic embeddings from frozen VLMs. GroVE builds on Gaussian Process Latent Variable Model (GPLVM) to learn a shared low-dimensional latent space where image and text inputs are mapped to a unified representation, optimized through single-modal embedding reconstruction and cross-modal alignment objectives. Once trained, the Gaussian Process model generates uncertainty-aware probabilistic embeddings. Evaluation shows that GroVE achieves state-of-the-art uncertainty calibration across multiple downstream tasks, including cross-modal retrieval, visual question answering, and active learning.
zh
[CV-31] Research on Anomaly Detection Methods Based on Diffusion Models
【速读】:该论文试图解决在复杂高维数据分布下传统异常检测方法(如统计建模和机器学习方法)面临的性能瓶颈问题。其解决方案的关键在于利用扩散概率模型(Diffusion Probabilistic Models, DPMs)的生成能力,通过扩散过程建模正常数据分布,并借助反向扩散进行数据重建,结合重构误差与语义差异作为异常指标。此外,引入多尺度特征提取、注意力机制和小波域表示以提升模型对数据细粒度结构和全局依赖关系的捕捉能力,从而实现更准确和鲁棒的异常检测。
链接: https://arxiv.org/abs/2505.05137
作者: Yi Chen
机构: Zhejiang Normal University (浙江师范大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 table
Abstract:Anomaly detection is a fundamental task in machine learning and data mining, with significant applications in cybersecurity, industrial fault diagnosis, and clinical disease monitoring. Traditional methods, such as statistical modeling and machine learning-based approaches, often face challenges in handling complex, high-dimensional data distributions. In this study, we explore the potential of diffusion models for anomaly detection, proposing a novel framework that leverages the strengths of diffusion probabilistic models (DPMs) to effectively identify anomalies in both image and audio data. The proposed method models the distribution of normal data through a diffusion process and reconstructs input data via reverse diffusion, using a combination of reconstruction errors and semantic discrepancies as anomaly indicators. To enhance the framework’s performance, we introduce multi-scale feature extraction, attention mechanisms, and wavelet-domain representations, enabling the model to capture fine-grained structures and global dependencies in the data. Extensive experiments on benchmark datasets, including MVTec AD and UrbanSound8K, demonstrate that our method outperforms state-of-the-art anomaly detection techniques, achieving superior accuracy and robustness across diverse data modalities. This research highlights the effectiveness of diffusion models in anomaly detection and provides a robust and efficient solution for real-world applications.
zh
[CV-32] Automated vision-based assistance tools in bronchoscopy: stenosis severity estimation
【速读】:该论文旨在解决亚声门狭窄(subglottic stenosis)严重程度评估的自动化问题,传统方法依赖于专家视觉检查,存在主观性高、诊断一致性差的问题。其解决方案的关键在于利用内窥镜成像中的光照衰减物理效应,实现对气道腔体的分割与追踪,并从单帧图像构建三维模型以测量气道狭窄程度,从而实现无需医生手动探索狭窄区域的自动化评估。
链接: https://arxiv.org/abs/2505.05136
作者: Clara Tomasini,Javier Rodriguez-Puigvert,Dinora Polanco,Manuel Viñuales,Luis Riazuelo,Ana Cristina Murillo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Purpose: Subglottic stenosis refers to the narrowing of the subglottis, the airway between the vocal cords and the trachea. Its severity is typically evaluated by estimating the percentage of obstructed airway. This estimation can be obtained from CT data or through visual inspection by experts exploring the region. However, visual inspections are inherently subjective, leading to less consistent and robust diagnoses. No public methods or datasets are currently available for automated evaluation of this condition from bronchoscopy video. Methods: We propose a pipeline for automated subglottic stenosis severity estimation during the bronchoscopy exploration, without requiring the physician to traverse the stenosed region. Our approach exploits the physical effect of illumination decline in endoscopy to segment and track the lumen and obtain a 3D model of the airway. This 3D model is obtained from a single frame and is used to measure the airway narrowing. Results: Our pipeline is the first to enable automated and robust subglottic stenosis severity measurement using bronchoscopy images. The results show consistency with ground-truth estimations from CT scans and expert estimations, and reliable repeatability across multiple estimations on the same patient. Our evaluation is performed on our new Subglottic Stenosis Dataset of real bronchoscopy procedures data. Conclusion: We demonstrate how to automate evaluation of subglottic stenosis severity using only bronchoscopy. Our approach can assist with and shorten diagnosis and monitoring procedures, with automated and repeatable estimations and less exploration time, and save radiation exposure to patients as no CT is required. Additionally, we release the first public benchmark for subglottic stenosis severity assessment. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.05136 [cs.CV] (or arXiv:2505.05136v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.05136 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Clara Tomasini [view email] [v1] Thu, 8 May 2025 11:13:38 UTC (4,549 KB)
zh
[CV-33] An Active Contour Model for Silhouette Vectorization using Bézier Curves
【速读】:该论文试图解决图像轮廓的矢量化问题,即如何将位图形式的轮廓边界转换为由立方贝塞尔曲线(cubic Bézier curves)表示的矢量图形。解决方案的关键在于提出一种主动轮廓模型,通过最小化贝塞尔曲线与轮廓边界的距离,优化贝塞尔曲线端点的位置、常规点处切向量的方向以及贝塞尔曲线参数的估计。该方法能够以任意矢量化方法得到的轮廓作为初始猜测,并显著降低与顶级图形软件Inkscape、Adobe Illustrator及基于曲率的矢量化方法所得结果之间的平均距离,同时通过减少曲线长度来施加额外的规则性。
链接: https://arxiv.org/abs/2505.05132
作者: Luis Alvarez,Jean-Michel Morel
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Functional Analysis (math.FA)
备注: 14 pages, 5 figures and 1 table
Abstract:In this paper, we propose an active contour model for silhouette vectorization using cubic Bézier curves. Among the end points of the Bézier curves, we distinguish between corner and regular points where the orientation of the tangent vector is prescribed. By minimizing the distance of the Bézier curves to the silhouette boundary, the active contour model optimizes the location of the Bézier curves end points, the orientation of the tangent vectors in the regular points, and the estimation of the Bézier curve parameters. This active contour model can use the silhouette vectorization obtained by any method as an initial guess. The proposed method significantly reduces the average distance between the silhouette boundary and its vectorization obtained by the world-class graphic software Inkscape, Adobe Illustrator, and a curvature-based vectorization method, which we introduce for comparison. Our method also allows us to impose additional regularity on the Bézier curves by reducing their lengths.
zh
[CV-34] MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models
【速读】:该论文旨在解决多目标编辑任务中因目标对象重叠或相互作用而导致的定位不准确和属性-对象不匹配问题(即注意力对齐偏差和跨注意力泄漏)。现有方法在全局交叉注意力机制下存在注意力稀释和空间干扰,而基于掩码的方法则因多目标场景下的特征纠缠导致属性无法绑定到几何精确区域。论文提出的解决方案MDE-Edit是一种无需训练的推理阶段优化方法,其关键在于通过两个关键损失函数实现精准的局部图像操作:Object Alignment Loss (OAL) 用于对齐多层交叉注意力与分割掩码以精确定位目标对象,Color Consistency Loss (CCL) 则用于增强掩码内目标属性注意力并抑制相邻区域的泄漏,从而确保多目标编辑的局部性和一致性。
链接: https://arxiv.org/abs/2505.05101
作者: Hongyang Zhu,Haipeng Liu,Bo Fu,Yang Wang
机构: Hefei University of Technology(合肥工业大学); Liaoning Normal University(辽宁师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures
Abstract:Multi-object editing aims to modify multiple objects or regions in complex scenes while preserving structural coherence. This task faces significant challenges in scenarios involving overlapping or interacting objects: (1) Inaccurate localization of target objects due to attention misalignment, leading to incomplete or misplaced edits; (2) Attribute-object mismatch, where color or texture changes fail to align with intended regions due to cross-attention leakage, creating semantic conflicts (\textite.g., color bleeding into non-target areas). Existing methods struggle with these challenges: approaches relying on global cross-attention mechanisms suffer from attention dilution and spatial interference between objects, while mask-based methods fail to bind attributes to geometrically accurate regions due to feature entanglement in multi-object scenarios. To address these limitations, we propose a training-free, inference-stage optimization approach that enables precise localized image manipulation in complex multi-object scenes, named MDE-Edit. MDE-Edit optimizes the noise latent feature in diffusion models via two key losses: Object Alignment Loss (OAL) aligns multi-layer cross-attention with segmentation masks for precise object positioning, and Color Consistency Loss (CCL) amplifies target attribute attention within masks while suppressing leakage to adjacent regions. This dual-loss design ensures localized and coherent multi-object edits. Extensive experiments demonstrate that MDE-Edit outperforms state-of-the-art methods in editing accuracy and visual quality, offering a robust solution for complex multi-object image manipulation tasks.
zh
[CV-35] DispBench: Benchmarking Disparity Estimation to Synthetic Corruptions CVPR2025
【速读】:该论文试图解决深度学习(Deep Learning, DL)在立体图像对中视差估计任务中的可靠性与泛化能力不足的问题,特别是在面对分布偏移和对抗攻击时的脆弱性。现有方法在标准基准上表现优异,但在实际应用中存在可靠性隐患,而缺乏统一的评估基准进一步阻碍了该领域的进展。为解决这一问题,作者提出了DispBench,这是一个全面的基准测试工具,其关键在于系统性地评估视差估计方法在多种合成图像损坏场景下的鲁棒性,包括对抗攻击和分布外偏移,从而揭示准确率、可靠性和泛化能力之间的关键关联。
链接: https://arxiv.org/abs/2505.05091
作者: Shashank Agnihotri,Amaan Ansari,Annika Dackermann,Fabian Rösch,Margret Keuper
机构: Data and Web Science Group, University of Mannheim (数据与网络科学组,曼海姆大学); Max-Planck-Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CVPR 2025 Workshop on Synthetic Data for Computer Vision
Abstract:Deep learning (DL) has surpassed human performance on standard benchmarks, driving its widespread adoption in computer vision tasks. One such task is disparity estimation, estimating the disparity between matching pixels in stereo image pairs, which is crucial for safety-critical applications like medical surgeries and autonomous navigation. However, DL-based disparity estimation methods are highly susceptible to distribution shifts and adversarial attacks, raising concerns about their reliability and generalization. Despite these concerns, a standardized benchmark for evaluating the robustness of disparity estimation methods remains absent, hindering progress in the field. To address this gap, we introduce DispBench, a comprehensive benchmarking tool for systematically assessing the reliability of disparity estimation methods. DispBench evaluates robustness against synthetic image corruptions such as adversarial attacks and out-of-distribution shifts caused by 2D Common Corruptions across multiple datasets and diverse corruption scenarios. We conduct the most extensive performance and robustness analysis of disparity estimation methods to date, uncovering key correlations between accuracy, reliability, and generalization. Open-source code for DispBench: this https URL Comments: Accepted at CVPR 2025 Workshop on Synthetic Data for Computer Vision Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2505.05091 [cs.CV] (or arXiv:2505.05091v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.05091 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-36] Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow ICRA2025
【速读】:该论文旨在解决事件相机(event camera)在长时间序列中进行光流估计时,传统基于帧的方法因忽略事件的时空特性及假设线性运动而导致的光流误差问题。其解决方案的关键在于提出E-NMSTFlow网络,该网络通过引入时空运动特征感知(Spatio-Temporal Motion Feature Aware, STMFA)模块和自适应运动特征增强(Adaptive Motion Feature Enhancement, AMFE)模块,充分利用事件的丰富时空信息以学习时空数据关联,并采用非线性运动补偿损失函数,利用事件间的精确非线性运动提升网络的无监督学习效果。
链接: https://arxiv.org/abs/2505.05089
作者: Zuntao Liu,Hao Zhuang,Junjie Jiang,Yuhang Song,Zheng Fang
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICRA 2025. Project Page: this https URL
Abstract:Event cameras have the potential to capture continuous motion information over time and space, making them well-suited for optical flow estimation. However, most existing learning-based methods for event-based optical flow adopt frame-based techniques, ignoring the spatio-temporal characteristics of events. Additionally, these methods assume linear motion between consecutive events within the loss time window, which increases optical flow errors in long-time sequences. In this work, we observe that rich spatio-temporal information and accurate nonlinear motion between events are crucial for event-based optical flow estimation. Therefore, we propose E-NMSTFlow, a novel unsupervised event-based optical flow network focusing on long-time sequences. We propose a Spatio-Temporal Motion Feature Aware (STMFA) module and an Adaptive Motion Feature Enhancement (AMFE) module, both of which utilize rich spatio-temporal information to learn spatio-temporal data associations. Meanwhile, we propose a nonlinear motion compensation loss that utilizes the accurate nonlinear motion between events to improve the unsupervised learning of our network. Extensive experiments demonstrate the effectiveness and superiority of our method. Remarkably, our method ranks first among unsupervised learning methods on the MVSEC and DSEC-Flow datasets. Our project page is available at this https URL.
zh
[CV-37] SSH-Net: A Self-Supervised and Hybrid Network for Noisy Image Watermark Removal
【速读】:该论文旨在解决可见水印去除的问题,该问题由于水印本身的复杂性和图像中携带的噪声而具有挑战性。现有方法主要依赖于需要成对的带水印和无水印图像数据集的监督学习方法,而在实际场景中这类数据集往往难以获取。为了解决这一问题,论文提出了SSH-Net,一种自监督与混合网络结构,其关键在于通过自监督方式合成参考的无水印图像,并采用双网络设计来处理任务:上层网络专注于噪声去除,采用轻量级卷积神经网络(CNN)架构;下层网络则负责同时去除水印和噪声,引入Transformer块以建模长距离依赖关系并捕捉复杂的图像特征。此外,为增强模型效果,在双网络之前引入了一个共享的CNN特征编码器,用于提取两个网络均可利用的通用特征。
链接: https://arxiv.org/abs/2505.05088
作者: Wenyang Liu,Jianjun Gao,Kim-Hui Yap
机构: Nanyang Technological University (南洋理工大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Under Review in JVCI
Abstract:Visible watermark removal is challenging due to its inherent complexities and the noise carried within images. Existing methods primarily rely on supervised learning approaches that require paired datasets of watermarked and watermark-free images, which are often impractical to obtain in real-world scenarios. To address this challenge, we propose SSH-Net, a Self-Supervised and Hybrid Network specifically designed for noisy image watermark removal. SSH-Net synthesizes reference watermark-free images using the watermark distribution in a self-supervised manner and adopts a dual-network design to address the task. The upper network, focused on the simpler task of noise removal, employs a lightweight CNN-based architecture, while the lower network, designed to handle the more complex task of simultaneously removing watermarks and noise, incorporates Transformer blocks to model long-range dependencies and capture intricate image features. To enhance the model’s effectiveness, a shared CNN-based feature encoder is introduced before dual networks to extract common features that both networks can leverage. Our code will be available at this https URL.
zh
[CV-38] PIDiff: Image Customization for Personalized Identities with Diffusion Models
【速读】:该论文旨在解决个性化身份文本到图像生成中身份信息与背景信息语义纠缠导致的生成图像丢失关键身份特征及多样性显著降低的问题。其解决方案的关键在于提出一种基于微调的扩散模型PIDiff,该模型利用W+空间和定制化的微调策略,以避免语义纠缠并实现准确的身份特征提取与定位,同时通过引入跨注意力块和参数优化策略,在推理过程中保持预训练模型对野生图像的生成能力并保留身份信息。
链接: https://arxiv.org/abs/2505.05081
作者: Jinyu Gu,Haipeng Liu,Meng Wang,Yang Wang
机构: Hefei University of Technology(合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 11 figures
Abstract:Text-to-image generation for personalized identities aims at incorporating the specific identity into images using a text prompt and an identity image. Based on the powerful generative capabilities of DDPMs, many previous works adopt additional prompts, such as text embeddings and CLIP image embeddings, to represent the identity information, while they fail to disentangle the identity information and background information. As a result, the generated images not only lose key identity characteristics but also suffer from significantly reduced diversity. To address this issue, previous works have combined the W+ space from StyleGAN with diffusion models, leveraging this space to provide a more accurate and comprehensive representation of identity features through multi-level feature extraction. However, the entanglement of identity and background information in in-the-wild images during training prevents accurate identity localization, resulting in severe semantic interference between identity and background. In this paper, we propose a novel fine-tuning-based diffusion model for personalized identities text-to-image generation, named PIDiff, which leverages the W+ space and an identity-tailored fine-tuning strategy to avoid semantic entanglement and achieves accurate feature extraction and localization. Style editing can also be achieved by PIDiff through preserving the characteristics of identity features in the W+ space, which vary from coarse to fine. Through the combination of the proposed cross-attention block and parameter optimization strategy, PIDiff preserves the identity information and maintains the generation capability for in-the-wild images of the pre-trained model during inference. Our experimental results validate the effectiveness of our method in this task.
zh
[CV-39] he City that Never Settles: Simulation-based LiDAR Dataset for Long-Term Place Recognition Under Extreme Structural Changes
【速读】:该论文旨在解决大规模建筑与拆除对长期场景识别(place recognition, PR)带来的挑战,这些问题由于城市和郊区环境的剧烈变化而变得更加复杂。现有数据集主要反映有限或以室内为主的环境变化,无法充分代表广泛的户外变化。为弥补这一差距,研究者提出了基于CARLA模拟器构建的“永不安定的城市”(City that Never Settles, CNS)数据集,该数据集捕捉了不同地图和序列中的主要结构性变化,如建筑物的建设与拆除。此外,论文还提出了TCR_sym,即原始TCR度量的对称版本,以实现不依赖于源-目标顺序的一致结构性变化测量。CNS数据集涵盖了比现有真实世界基准更广泛的环境变化,实验表明当前最先进的基于LiDAR的PR方法在该数据集上表现显著下降,突显了开发能够处理显著环境变化的鲁棒算法的必要性。
链接: https://arxiv.org/abs/2505.05076
作者: Hyunho Song,Dongjae Lee,Seunghun Oh,Minwoo Jung,Ayoung Kim
机构: SNU(首尔大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale construction and demolition significantly challenge long-term place recognition (PR) by drastically reshaping urban and suburban environments. Existing datasets predominantly reflect limited or indoor-focused changes, failing to adequately represent extensive outdoor transformations. To bridge this gap, we introduce the City that Never Settles (CNS) dataset, a simulation-based dataset created using the CARLA simulator, capturing major structural changes-such as building construction and demolition-across diverse maps and sequences. Additionally, we propose TCR_sym, a symmetric version of the original TCR metric, enabling consistent measurement of structural changes irrespective of source-target ordering. Quantitative comparisons demonstrate that CNS encompasses more extensive transformations than current real-world benchmarks. Evaluations of state-of-the-art LiDAR-based PR methods on CNS reveal substantial performance degradation, underscoring the need for robust algorithms capable of handling significant environmental changes. Our dataset is available at this https URL.
zh
[CV-40] Visual Affordances: Enabling Robots to Understand Object Functionality
【速读】:该论文试图解决人机交互中对象可供性(affordance)预测的可重复性问题,这一问题导致了不同方法之间的比较基准不公平且不可靠。其解决方案的关键在于提出一种统一的视觉可供性预测框架,并引入Affordance Sheet以提高透明度和可复现性,同时构建了一个将视觉可供性预测与物理世界相连接的通用框架,从而实现从感知到机器人执行的完整信息传递。
链接: https://arxiv.org/abs/2505.05074
作者: Tommaso Apicella,Alessio Xompero,Andrea Cavallaro
机构: Istituto Italiano di Tecnologia (意大利技术研究所); Queen Mary University of London (伦敦玛丽女王大学); Idiap Research Institute (Idiap研究中心); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 24 pages, 12 figures, 10 tables. Project website at this https URL
Abstract:Human-robot interaction for assistive technologies relies on the prediction of affordances, which are the potential actions a robot can perform on objects. Predicting object affordances from visual perception is formulated differently for tasks such as grasping detection, affordance classification, affordance segmentation, and hand-object interaction synthesis. In this work, we highlight the reproducibility issue in these redefinitions, making comparative benchmarks unfair and unreliable. To address this problem, we propose a unified formulation for visual affordance prediction, provide a comprehensive and systematic review of previous works highlighting strengths and limitations of methods and datasets, and analyse what challenges reproducibility. To favour transparency, we introduce the Affordance Sheet, a document to detail the proposed solution, the datasets, and the validation. As the physical properties of an object influence the interaction with the robot, we present a generic framework that links visual affordance prediction to the physical world. Using the weight of an object as an example for this framework, we discuss how estimating object mass can affect the affordance prediction. Our approach bridges the gap between affordance perception and robot actuation, and accounts for the complete information about objects of interest and how the robot interacts with them to accomplish its task.
zh
[CV-41] FG-CLIP: Fine-Grained Visual and Textual Alignment ICML2025
【速读】:该论文旨在解决Contrastive Language-Image Pre-training (CLIP) 在细粒度理解任务中的不足,因其主要依赖粗粒度短描述进行预训练,难以捕捉图像的细节语义。解决方案的关键在于提出Fine-Grained CLIP (FG-CLIP),通过三项核心创新提升模型的细粒度理解能力:一是利用大模型生成16亿条长描述-图像对以捕捉全局语义细节;二是构建包含1200万张图像和4000万个区域特定边界框的高质量数据集以确保精确且上下文丰富的表征;三是引入1000万个细粒度负样本以增强模型区分细微语义差异的能力。
链接: https://arxiv.org/abs/2505.05071
作者: Chunyu Xie,Bin Wang,Fanjing Kong,Jincheng Li,Dawei Liang,Gengshen Zhang,Dawei Leng,Yuhui Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025
Abstract:Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model’s ability to distinguish subtle semantic differences. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP’s effectiveness in capturing fine-grained image details and improving overall model performance. The related data, code, and models are available at this https URL.
zh
[CV-42] ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning
【速读】:该论文试图解决长尾半监督学习(Long-Tailed Semi-Supervised Learning, LTSSL)中由于数据分布不均衡导致的模型性能下降问题,特别是在尾部类别上的表现不足。现有方法在使用生成式视觉基础模型(如CLIP)时,发现全量微调(Full Fine-Tuning, FFT)会降低模型性能,而线性探针(Linear Probing, LP)和轻量级微调(Lightweight Fine-Tuning, LFT)虽然提升了整体性能,但对尾部类别的改进有限。解决方案的关键在于提出一种无偏轻量级微调策略ULFine,通过置信度感知的文本原型自适应拟合缓解过自信问题,并通过双logits的互补融合对抗伪标签和分类器偏差,从而显著提升模型在尾部类别的预测准确性。
链接: https://arxiv.org/abs/2505.05062
作者: Enhao Zhang,Chaohua Li,Chuanxing Geng,Songcan Chen
机构: MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, China; College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing 211106, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Based on the success of large-scale visual foundation models like CLIP in various downstream tasks, this paper initially attempts to explore their impact on Long-Tailed Semi-Supervised Learning (LTSSL) by employing the foundation model with three strategies: Linear Probing (LP), Lightweight Fine-Tuning (LFT), and Full Fine-Tuning (FFT). Our analysis presents the following insights: i) Compared to LTSSL algorithms trained from scratch, FFT results in a decline in model performance, whereas LP and LFT, although boosting overall model performance, exhibit negligible benefits to tail classes. ii) LP produces numerous false pseudo-labels due to \textitunderlearned training data, while LFT can reduce the number of these false labels but becomes overconfident about them owing to \textitbiased fitting training data. This exacerbates the pseudo-labeled and classifier biases inherent in LTSSL, limiting performance improvement in the tail classes. With these insights, we propose a Unbiased Lightweight Fine-tuning strategy, \textbfULFine, which mitigates the overconfidence via confidence-aware adaptive fitting of textual prototypes and counteracts the pseudo-labeled and classifier biases via complementary fusion of dual logits. Extensive experiments demonstrate that ULFine markedly decreases training costs by over ten times and substantially increases prediction accuracies compared to state-of-the-art methods.
zh
[CV-43] UncertainSAM: Fast and Efficient Uncertainty Quantification of the Segment Anything Model ICML’25
【速读】:该论文试图解决生成式 AI (Generative AI) 中的不确定性量化(Uncertainty Quantification, UQ)问题,特别是在类无关的基础模型 Segment Anything Model (SAM) 上,现有UQ方法面临挑战。解决方案的关键在于提出一种基于贝叶斯熵公式的理论驱动的UQ模型,该模型同时考虑了数据不确定性(aleatoric)、模型不确定性(epistemic)以及新引入的任务不确定性(task uncertainty),并据此训练出一个轻量级的后处理UQ方法USAM,该方法能够有效捕捉模型参数不足、提示信息不足或图像模糊等导致的不确定性根源。
链接: https://arxiv.org/abs/2505.05049
作者: Timo Kaiser,Thomas Norrenbrock,Bodo Rosenhahn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML’25
Abstract:The introduction of the Segment Anything Model (SAM) has paved the way for numerous semantic segmentation applications. For several tasks, quantifying the uncertainty of SAM is of particular interest. However, the ambiguous nature of the class-agnostic foundation model SAM challenges current uncertainty quantification (UQ) approaches. This paper presents a theoretically motivated uncertainty quantification model based on a Bayesian entropy formulation jointly respecting aleatoric, epistemic, and the newly introduced task uncertainty. We use this formulation to train USAM, a lightweight post-hoc UQ method. Our model traces the root of uncertainty back to under-parameterised models, insufficient prompts or image ambiguities. Our proposed deterministic USAM demonstrates superior predictive capabilities on the SA-V, MOSE, ADE20k, DAVIS, and COCO datasets, offering a computationally cheap and easy-to-use UQ alternative that can support user-prompting, enhance semi-supervised pipelines, or balance the tradeoff between accuracy and cost efficiency.
zh
[CV-44] xTrace: A Facial Expressive Behaviour Analysis Tool for Continuous Affect Recognition
【速读】:该论文旨在解决在真实场景下进行面部表情行为分析的两个关键问题:一是缺乏大规模且覆盖二维情绪空间的标注面部情感视频数据集,二是难以提取具有判别性、可解释性、鲁棒性和计算效率的面部视频特征。其解决方案的关键在于提出xTrace工具,该工具基于最大规模的面部情感视频数据集进行训练,覆盖了二维情绪空间中的大部分情绪区域,从而实现了对自然表达行为的广泛适用性;同时,xTrace采用的面部情感描述符不仅具备可解释性,还能在低计算复杂度下实现高精度和鲁棒性。
链接: https://arxiv.org/abs/2505.05043
作者: Mani Kumar Tellamekala,Shashank Jaiswal,Thomas Smith,Timur Alamev,Gary McKeown,Anthony Brown,Michel Valstar
机构: Blueskeye AI(蓝眼科技); Queen’s University Belfast(贝尔法斯特女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recognising expressive behaviours in face videos is a long-standing challenge in Affective Computing. Despite significant advancements in recent years, it still remains a challenge to build a robust and reliable system for naturalistic and in-the-wild facial expressive behaviour analysis in real time. This paper addresses two key challenges in building such a system: (1). The paucity of large-scale labelled facial affect video datasets with extensive coverage of the 2D emotion space, and (2). The difficulty of extracting facial video features that are discriminative, interpretable, robust, and computationally efficient. Toward addressing these challenges, we introduce xTrace, a robust tool for facial expressive behaviour analysis and predicting continuous values of dimensional emotions, namely valence and arousal, from in-the-wild face videos. To address challenge (1), our affect recognition model is trained on the largest facial affect video data set, containing ~450k videos that cover most emotion zones in the dimensional emotion space, making xTrace highly versatile in analysing a wide spectrum of naturalistic expressive behaviours. To address challenge (2), xTrace uses facial affect descriptors that are not only explainable, but can also achieve a high degree of accuracy and robustness with low computational complexity. The key components of xTrace are benchmarked against three existing tools: MediaPipe, OpenFace, and Augsburg Affect Toolbox. On an in-the-wild validation set composed of 50k videos, xTrace achieves 0.86 mean CCC and 0.13 mean absolute error values. We present a detailed error analysis of affect predictions from xTrace, illustrating (a). its ability to recognise emotions with high accuracy across most bins in the 2D emotion space, (b). its robustness to non-frontal head pose angles, and ©. a strong correlation between its uncertainty estimates and its accuracy. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.05043 [cs.CV] (or arXiv:2505.05043v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.05043 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-45] Split Matching for Inductive Zero-shot Semantic Segmentation
【速读】:该论文旨在解决零样本语义分割(Zero-shot Semantic Segmentation, ZSS)中因缺乏未见类别监督而导致的模型过拟合问题,以及传统匈牙利匹配(Hungarian matching)在未见类别分类时易将其误分为背景的问题。其解决方案的关键在于提出了一种名为Split Matching (SM) 的新型分配策略,该策略通过将查询分为已见类别组和潜在未见类别组,并分别根据可用的监督信息进行独立优化,从而实现对已见类别和未见候选类别的解耦匹配。此外,还引入了多尺度特征增强(Multi-scale Feature Enhancement, MFE)模块以提升模型的空间细节捕捉能力。
链接: https://arxiv.org/abs/2505.05023
作者: Jialei Chen,Xu Zheng,Dongyue Li,Chong Yi,Seigo Ito,Danda Pani Paudel,Luc Van Gool,Hiroshi Murase,Daisuke Deguchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables object localization without relying on explicit labels. However, conventional Hungarian matching, a core component in query-based frameworks, needs full supervision and often misclassifies unseen categories as background in the setting of ZSS. To address this issue, we propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components: one for seen classes in annotated regions and another for latent classes in unannotated regions (referred to as unseen candidates). Specifically, we partition the queries into seen and candidate groups, enabling each to be optimized independently according to its available supervision. To discover unseen candidates, we cluster CLIP dense features to generate pseudo masks and extract region-level embeddings using CLS tokens. Matching is then conducted separately for the two groups based on both class-level similarity and mask-level consistency. Additionally, we introduce a Multi-scale Feature Enhancement (MFE) module that refines decoder features through residual multi-scale aggregation, improving the model’s ability to capture spatial details across resolutions. SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state-of-the-art performance on two standard benchmarks.
zh
[CV-46] SOAP: Style-Omniscient Animatable Portraits
【速读】:该论文试图解决从单张图像生成可动画化的3D头像的问题,主要挑战包括风格限制(如写实、卡通、动漫)以及处理配饰或发型的困难。其解决方案的关键在于提出SOAP框架,该框架利用多视角扩散模型和自适应优化流程,在保持拓扑结构和绑定的前提下对FLAME网格进行变形,从而生成带有纹理、支持FACS驱动动画并集成眼球和牙齿的高质量3D头像。
链接: https://arxiv.org/abs/2505.05022
作者: Tingting Liao,Yujian Zheng,Adilbek Karmanov,Liwen Hu,Leyang Jin,Yuliang Xiu,Hao Li
机构: Mohamed bin Zayed University of Artificial IntelligenceUAE(穆罕默德·本·扎耶德人工智能大学); PinscreenUSA(屏显科技美国); Westlake UniversityChina(西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at this https URL.
zh
[CV-47] Adaptive Contextual Embedding for Robust Far-View Borehole Detection
【速读】:该论文旨在解决在控制爆破作业中,从远距离图像中准确检测密集分布的小型钻孔的问题,这一问题对操作安全和效率至关重要。现有检测方法因目标尺度小、排列密集以及钻孔的视觉特征不明显而面临挑战。该论文提出的解决方案关键在于构建一种自适应检测方法,通过显式利用基于指数移动平均(EMA)统计更新获得的一致嵌入表示,引入三个协同组件:自适应增强、嵌入稳定化和上下文精炼,以提升检测性能。EMA的广泛使用在钻孔视觉复杂度低且尺度小的情况下具有显著优势,能够实现稳定且鲁棒的表征学习。
链接: https://arxiv.org/abs/2505.05008
作者: Xuesong Liu,Tianyu Hao,Emmett J. Ientilucci
机构: Rochester Institute of Technology (罗彻斯特理工学院); Guangzhou University (广州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In controlled blasting operations, accurately detecting densely distributed tiny boreholes from far-view imagery is critical for operational safety and efficiency. However, existing detection methods often struggle due to small object scales, highly dense arrangements, and limited distinctive visual features of boreholes. To address these challenges, we propose an adaptive detection approach that builds upon existing architectures (e.g., YOLO) by explicitly leveraging consistent embedding representations derived through exponential moving average (EMA)-based statistical updates. Our method introduces three synergistic components: (1) adaptive augmentation utilizing dynamically updated image statistics to robustly handle illumination and texture variations; (2) embedding stabilization to ensure consistent and reliable feature extraction; and (3) contextual refinement leveraging spatial context for improved detection accuracy. The pervasive use of EMA in our method is particularly advantageous given the limited visual complexity and small scale of boreholes, allowing stable and robust representation learning even under challenging visual conditions. Experiments on a challenging proprietary quarry-site dataset demonstrate substantial improvements over baseline YOLO-based architectures, highlighting our method’s effectiveness in realistic and complex industrial scenarios. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.05008 [cs.CV] (or arXiv:2505.05008v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.05008 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-48] Driving with Context: Online Map Matching for Complex Roads Using Lane Markings and Scenario Recognition
【速读】:该论文旨在解决复杂道路网络中在线标准分辨率(Standard Definition, SD)地图匹配的准确性问题,特别是在多层道路区域中现有方法容易出现误差的问题。其解决方案的关键在于构建一个结合多种概率因素的隐马尔可夫模型(Hidden Markov Model, HMM),通过精心设计的概率因子充分利用车道标记和驾驶场景识别。具体而言,首先通过多车道跟踪方法生成车道标记,并利用HMM将其与SD地图关联,构建增强的SD地图,从而实现车辆在覆盖区域内的自定位;其次,应用驾驶场景识别模型生成场景识别的发射概率因子,以提升高架道路及下方普通城市道路的地图匹配性能。
链接: https://arxiv.org/abs/2505.05007
作者: Xin Bi,Zhichao Li,Yuxuan Xia,Panpan Tong,Lijuan Zhang,Yang Chen,Junsheng Fu
机构: Tongji University (同济大学); Zenseact (森斯泰克); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages and 12 figures. Under review at IEEE RA-L
Abstract:Accurate online map matching is fundamental to vehicle navigation and the activation of intelligent driving functions. Current online map matching methods are prone to errors in complex road networks, especially in multilevel road area. To address this challenge, we propose an online Standard Definition (SD) map matching method by constructing a Hidden Markov Model (HMM) with multiple probability factors. Our proposed method can achieve accurate map matching even in complex road networks by carefully leveraging lane markings and scenario recognition in the designing of the probability factors. First, the lane markings are generated by a multi-lane tracking method and associated with the SD map using HMM to build an enriched SD map. In areas covered by the enriched SD map, the vehicle can re-localize itself by performing Iterative Closest Point (ICP) registration for the lane markings. Then, the probability factor accounting for the lane marking detection can be obtained using the association probability between adjacent lanes and roads. Second, the driving scenario recognition model is applied to generate the emission probability factor of scenario recognition, which improves the performance of map matching on elevated roads and ordinary urban roads underneath them. We validate our method through extensive road tests in Europe and China, and the experimental results show that our proposed method effectively improves the online map matching accuracy as compared to other existing methods, especially in multilevel road area. Specifically, the experiments show that our proposed method achieves F_1 scores of 98.04% and 94.60% on the Zenseact Open Dataset and test data of multilevel road areas in Shanghai respectively, significantly outperforming benchmark methods. The implementation is available at this https URL.
zh
[CV-49] Automated Thoracolumbar Stump Rib Detection and Analysis in a Large CT Cohort
【速读】:该论文旨在解决胸腰段残端肋骨(thoracolumbar stump ribs)的自动化检测与形态学定量分析问题。传统方法依赖人工评估,且多为定性描述,而本文提出了一种基于高分辨率深度学习模型的自动分割方法,显著提升了分割性能(Dice分数0.997 vs. 0.779,p值<0.01),并结合迭代算法与分段线性插值实现对肋骨长度的精确测量,成功率达到98.2%。其关键在于通过深度学习模型实现高精度的肋骨分割,并结合算法优化提升形态学分析的准确性。
链接: https://arxiv.org/abs/2505.05004
作者: Hendrik Möller,Hanna Schön,Alina Dima,Benjamin Keinert-Weth,Robert Graf,Matan Atad,Johannes Paetzold,Friederike Jungmann,Rickmer Braren,Florian Kofler,Bjoern Menze,Daniel Rueckert,Jan S. Kirschke
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Thoracolumbar stump ribs are one of the essential indicators of thoracolumbar transitional vertebrae or enumeration anomalies. While some studies manually assess these anomalies and describe the ribs qualitatively, this study aims to automate thoracolumbar stump rib detection and analyze their morphology quantitatively. To this end, we train a high-resolution deep-learning model for rib segmentation and show significant improvements compared to existing models (Dice score 0.997 vs. 0.779, p-value 0.01). In addition, we use an iterative algorithm and piece-wise linear interpolation to assess the length of the ribs, showing a success rate of 98.2%. When analyzing morphological features, we show that stump ribs articulate more posteriorly at the vertebrae (-19.2 ± 3.8 vs -13.8 ± 2.5, p-value 0.01), are thinner (260.6 ± 103.4 vs. 563.6 ± 127.1, p-value 0.01), and are oriented more downwards and sideways within the first centimeters in contrast to full-length ribs. We show that with partially visible ribs, these features can achieve an F1-score of 0.84 in differentiating stump ribs from regular ones. We publish the model weights and masks for public use.
zh
[CV-50] StabStitch: Unsupervised Online Video Stitching with Spatiotemporal Bidirectional Warps
【速读】:该论文试图解决视频拼接中因连续非平滑形变导致的时间内容抖动问题(warping shake),即使输入视频稳定,拼接后的视频仍可能产生不期望的形变抖动,影响视觉体验。解决方案的关键在于提出一种新的视频拼接框架StabStitch++,通过无监督学习同时实现空间拼接与时间稳定化。其核心创新包括:假设一个虚拟中平面并将原始图像平面投影至其上,设计可微双向分解模块以解耦单应性变换并均衡对齐负担与投影失真;结合视频稳定化中的相机路径思想,推导出视频拼接轨迹的数学表达式;引入形变平滑模型,利用混合损失函数同时鼓励内容对齐、轨迹平滑与在线协作,从而在不牺牲对齐精度的前提下实现时间稳定性。
链接: https://arxiv.org/abs/2505.05001
作者: Lang Nie,Chunyu Lin,Kang Liao,Yun Zhang,Shuaicheng Liu,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Visual Intelligence +X International Cooperation Joint Laboratory of MOE (教育部视觉智能+x国际合作联合实验室); Nanyang Technological University (南洋理工大学); Communication University of Zhejiang (浙江传媒学院); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: TPAMI2025; this https URL . arXiv admin note: text overlap with arXiv:2403.06378
Abstract:We retarget video stitching to an emerging issue, named warping shake, which unveils the temporal content shakes induced by sequentially unsmooth warps when extending image stitching to video stitching. Even if the input videos are stable, the stitched video can inevitably cause undesired warping shakes and affect the visual experience. To address this issue, we propose StabStitch++, a novel video stitching framework to realize spatial stitching and temporal stabilization with unsupervised learning simultaneously. First, different from existing learning-based image stitching solutions that typically warp one image to align with another, we suppose a virtual midplane between original image planes and project them onto it. Concretely, we design a differentiable bidirectional decomposition module to disentangle the homography transformation and incorporate it into our spatial warp, evenly spreading alignment burdens and projective distortions across two views. Then, inspired by camera paths in video stabilization, we derive the mathematical expression of stitching trajectories in video stitching by elaborately integrating spatial and temporal warps. Finally, a warp smoothing model is presented to produce stable stitched videos with a hybrid loss to simultaneously encourage content alignment, trajectory smoothness, and online collaboration. Compared with StabStitch that sacrifices alignment for stabilization, StabStitch++ makes no compromise and optimizes both of them simultaneously, especially in the online mode. To establish an evaluation benchmark and train the learning framework, we build a video stitching dataset with a rich diversity in camera motions and scenes. Experiments exhibit that StabStitch++ surpasses current solutions in stitching performance, robustness, and efficiency, offering compelling advancements in this field by building a real-time online video stitching system.
zh
[CV-51] Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication ICMR2025
【速读】:该论文旨在解决现有研究中忽视听众在交互过程中的作用,以及未能充分探索说话者与听众之间动态交互的问题。其解决方案的关键在于提出一种说话者与听众的互扩散生成模型(Inter-Diffusion Generation Model of Speakers and Listeners),首次将听众的全身动作整合到生成框架中,并通过设计一种新颖的互扩散机制,准确捕捉交流过程中说话者与听众之间的复杂互动模式。此外,该模型在构建过程中引入了交互条件和生成对抗网络(GAN)以增加去噪步骤,从而实现基于说话者语音信息的动态生成及对听众反馈的实时响应,促进双方的协同交互。
链接: https://arxiv.org/abs/2505.04996
作者: Jinhe Huang,Yongkang Cheng,Yuming Hang,Gaoge Han,Jinewei Li,Jing Zhang,Xingjian Gu
机构: Nanjing Agriculture University (南京农业大学); Northwest A&F University (西北农林科技大学); City University of Hong Kong (香港城市大学); University of Chinese Academy of Sciences (中国科学院大学); Northwestern Polytechnical University (西北工业大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: accepted by ICMR 2025
Abstract:Full-body gestures play a pivotal role in natural interactions and are crucial for achieving effective communication. Nevertheless, most existing studies primarily focus on the gesture generation of speakers, overlooking the vital role of listeners in the interaction process and failing to fully explore the dynamic interaction between them. This paper innovatively proposes an Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication. For the first time, we integrate the full-body gestures of listeners into the generation framework. By devising a novel inter-diffusion mechanism, this model can accurately capture the complex interaction patterns between speakers and listeners during communication. In the model construction process, based on the advanced diffusion model architecture, we innovatively introduce interaction conditions and the GAN model to increase the denoising step size. As a result, when generating gesture sequences, the model can not only dynamically generate based on the speaker’s speech information but also respond in realtime to the listener’s feedback, enabling synergistic interaction between the two. Abundant experimental results demonstrate that compared with the current state-of-the-art gesture generation methods, the model we proposed has achieved remarkable improvements in the naturalness, coherence, and speech-gesture synchronization of the generated gestures. In the subjective evaluation experiments, users highly praised the generated interaction scenarios, believing that they are closer to real life human communication situations. Objective index evaluations also show that our model outperforms the baseline methods in multiple key indicators, providing more powerful support for effective communication.
zh
[CV-52] Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization IJCAI-25
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中的属性偏差(attribute bias)问题,该问题通常导致局部模型因学习非因果关联而优化不一致,从而降低性能。现有方法通过数据增强或知识蒸馏来增加样本多样性或学习不变表示,但缺乏对推理路径的全面分析,且混杂因素的干扰限制了其效果。本文提出的联邦去混杂与去偏学习(Federated Deconfounding and Debiasing Learning, FedDDL)方法的关键在于构建结构化因果图以分析模型推理过程,并通过后门调整消除混杂路径。其核心创新包括:设计客户端内去混杂学习模块以解耦背景与目标,生成反事实样本以切断背景与标签的关联;以及设计客户端间去偏学习模块以构建因果原型,减少原型组件中的背景比例,并通过因果原型正则化弥合异构表示间的差距。
链接: https://arxiv.org/abs/2505.04979
作者: Zhuang Qi,Sijin Zhou,Lei Meng,Han Hu,Han Yu,Xiangxu Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IJCAI-25 Accepted
Abstract:Attribute bias in federated learning (FL) typically leads local models to optimize inconsistently due to the learning of non-causal associations, resulting degraded performance. Existing methods either use data augmentation for increasing sample diversity or knowledge distillation for learning invariant representations to address this problem. However, they lack a comprehensive analysis of the inference paths, and the interference from confounding factors limits their performance. To address these limitations, we propose the \underlineFederated \underlineDeconfounding and \underlineDebiasing \underlineLearning (FedDDL) method. It constructs a structured causal graph to analyze the model inference process, and performs backdoor adjustment to eliminate confounding paths. Specifically, we design an intra-client deconfounding learning module for computer vision tasks to decouple background and objects, generating counterfactual samples that establish a connection between the background and any label, which stops the model from using the background to infer the label. Moreover, we design an inter-client debiasing learning module to construct causal prototypes to reduce the proportion of the background in prototype components. Notably, it bridges the gap between heterogeneous representations via causal prototypical regularization. Extensive experiments on 2 benchmarking datasets demonstrate that \methodname significantly enhances the model capability to focus on main objects in unseen data, leading to 4.5% higher Top-1 Accuracy on average over 9 state-of-the-art existing methods.
zh
[CV-53] ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment
【速读】:该论文旨在解决双语文本到运动生成任务中的关键挑战,即缺乏双语运动-语言数据集以及扩散模型中文本与运动分布的不对齐问题,这些问题导致生成的运动在语义上不一致或质量较低。其解决方案的关键在于提出BiHumanML3D数据集,作为双语文本到运动生成模型的重要基准,并引入双语运动扩散模型(BiMD),通过跨语言对齐表示来捕捉语义,实现统一的双语模型。此外,还提出了基于奖励引导采样的对齐方法(ReAlign),通过步骤感知奖励模型评估对齐质量,并引导扩散过程向最优对齐分布演化,从而提升运动质量和语义一致性。
链接: https://arxiv.org/abs/2505.04974
作者: Wanjiang Weng,Xiaofeng Tan,Hongsong Wang,Pan Zhou
机构: Southeast University (东南大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures
Abstract:Bilingual text-to-motion generation, which synthesizes 3D human motions from bilingual text inputs, holds immense potential for cross-linguistic applications in gaming, film, and robotics. However, this task faces critical challenges: the absence of bilingual motion-language datasets and the misalignment between text and motion distributions in diffusion models, leading to semantically inconsistent or low-quality motions. To address these challenges, we propose BiHumanML3D, a novel bilingual human motion dataset, which establishes a crucial benchmark for bilingual text-to-motion generation models. Furthermore, we propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics, thereby achieving a unified bilingual model. Building upon this, we propose Reward-guided sampling Alignment (ReAlign) method, comprising a step-aware reward model to assess alignment quality during sampling and a reward-guided strategy that directs the diffusion process toward an optimally aligned distribution. This reward model integrates step-aware tokens and combines a text-aligned module for semantic consistency and a motion-aligned module for realism, refining noisy motions at each timestep to balance probability density and alignment. Experiments demonstrate that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods. Project page: this https URL.
zh
[CV-54] AI and Vision based Autonomous Navigation of Nano-Drones in Partially-Known Environments
【速读】:该论文试图解决在资源受限的纳米级无人机平台上实现安全自主导航与高级任务(如探索和监视)的问题。解决方案的关键在于将导航任务分为两部分:基于深度学习的目标检测器运行在边缘设备上,而规划算法则在机载系统中执行,从而有效应对纳米无人机的计算和能源限制。
链接: https://arxiv.org/abs/2505.04972
作者: Mattia Sartori,Chetna Singhal,Neelabhro Roy,Davide Brunelli,James Gross
机构: KTH Royal Institute of Technology, Sweden. ; Inria France. ; University of Trento, Italy.
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: in DCOSS-IoT 2025, Wi-DroIT 2025
Abstract:The miniaturisation of sensors and processors, the advancements in connected edge intelligence, and the exponential interest in Artificial Intelligence are boosting the affirmation of autonomous nano-size drones in the Internet of Robotic Things ecosystem. However, achieving safe autonomous navigation and high-level tasks such as exploration and surveillance with these tiny platforms is extremely challenging due to their limited resources. This work focuses on enabling the safe and autonomous flight of a pocket-size, 30-gram platform called Crazyflie 2.1 in a partially known environment. We propose a novel AI-aided, vision-based reactive planning method for obstacle avoidance under the ambit of Integrated Sensing, Computing and Communication paradigm. We deal with the constraints of the nano-drone by splitting the navigation task into two parts: a deep learning-based object detector runs on the edge (external hardware) while the planning algorithm is executed onboard. The results show the ability to command the drone at \sim8 frames-per-second and a model performance reaching a COCO mean-average-precision of 60.8 . Field experiments demonstrate the feasibility of the solution with the drone flying at a top speed of 1 m/s while steering away from an obstacle placed in an unknown position and reaching the target destination. The outcome highlights the compatibility of the communication delay and the model performance with the requirements of the real-time navigation task. We provide a feasible alternative to a fully onboard implementation that can be extended to autonomous exploration with nano-drones.
zh
[CV-55] DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding ICLR2025
【速读】:该论文旨在解决ego-centric 3D视觉定位任务中的两个关键问题:一是由于点云与第一视角多视图图像稀疏融合导致的细粒度视觉语义丢失,二是由于语言描述的随意性导致的文本语义上下文有限。解决方案的关键在于提出DenseGrounding,通过引入Hierarchical Scene Semantic Enhancer增强视觉特征,以保留密集的场景语义并促进跨模态对齐;同时通过Language Semantic Enhancer利用大语言模型在训练过程中提供丰富的上下文和多样化的语言描述,从而提升文本语义表达能力。
链接: https://arxiv.org/abs/2505.04965
作者: Henry Zheng,Hao Shi,Qihang Peng,Yong Xien Chng,Rui Huang,Yepeng Weng,Zhongchao Shi,Gao Huang
机构: Tsinghua University (清华大学); Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICLR 2025
Abstract:Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grained visual semantics due to sparse fusion of point clouds with ego-centric multi-view images, (2) limited textual semantic context due to arbitrary language descriptions. We propose DenseGrounding, a novel approach designed to address these issues by enhancing both visual and textual semantics. For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features and facilitating cross-modal alignment. For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions with additional context during model training. Extensive experiments show that DenseGrounding significantly outperforms existing methods in overall accuracy, with improvements of 5.81% and 7.56% when trained on the comprehensive full dataset and smaller mini subset, respectively, further advancing the SOTA in egocentric 3D visual grounding. Our method also achieves 1st place and receives the Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.
zh
[CV-56] CAG-VLM: Fine-Tuning of a Large-Scale Model to Recognize Angiographic Images for Next-Generation Diagnostic Systems
【速读】:该论文旨在解决冠状动脉造影(Coronary Angiography, CAG)图像分析中对专家 cardiologists 依赖度过高的问题,通过引入一种基于 AI 的决策支持系统来提升诊断与治疗计划的自动化水平。其解决方案的关键在于构建一个两阶段的、由医生校准的处理流程以及一个双语(日语/英语)的 CAG 图像-报告数据集,并利用微调的视觉语言模型(Visual Language Models, VLMs)进行临床报告和治疗建议的生成。通过在低对比度图像上实现高准确性的左右侧定位分类,并在独立测试集中验证模型性能,最终选定性能最佳的模型作为 CAG-VLM,展示了专用微调 VLM 在辅助心血管疾病诊断中的有效性。
链接: https://arxiv.org/abs/2505.04964
作者: Yuto Nakamura,Satoshi Kodera,Haruki Settai,Hiroki Shinohara,Masatsugu Tamura,Tomohiro Noguchi,Tatsuki Furusawa,Ryo Takizawa,Tempei Kabayama,Norihiko Takeda
机构: The University of Tokyo(东京大学); The University of Tokyo Hospital(东京大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Coronary angiography (CAG) is the gold-standard imaging modality for evaluating coronary artery disease, but its interpretation and subsequent treatment planning rely heavily on expert cardiologists. To enable AI-based decision support, we introduce a two-stage, physician-curated pipeline and a bilingual (Japanese/English) CAG image-report dataset. First, we sample 14,686 frames from 539 exams and annotate them for key-frame detection and left/right laterality; a ConvNeXt-Base CNN trained on this data achieves 0.96 F1 on laterality classification, even on low-contrast frames. Second, we apply the CNN to 243 independent exams, extract 1,114 key frames, and pair each with its pre-procedure report and expert-validated diagnostic and treatment summary, yielding a parallel corpus. We then fine-tune three open-source VLMs (PaliGemma2, Gemma3, and ConceptCLIP-enhanced Gemma3) via LoRA and evaluate them using VLScore and cardiologist review. Although PaliGemma2 w/LoRA attains the highest VLScore, Gemma3 w/LoRA achieves the top clinician rating (mean 7.20/10); we designate this best-performing model as CAG-VLM. These results demonstrate that specialized, fine-tuned VLMs can effectively assist cardiologists in generating clinical reports and treatment recommendations from CAG images.
zh
[CV-57] ViCTr: Vital Consistency Transfer for Pathology Aware Image Synthesis
【速读】:该论文旨在解决医学图像合成中的关键挑战,包括标注病理数据有限、模态域差异以及弥漫性病理(如肝硬化)的复杂表示问题。现有方法在保持解剖结构准确性的同时难以精确建模病理特征,通常依赖自然图像先验或低效的多步采样过程。论文提出的解决方案是ViCTr(Vital Consistency Transfer),其关键在于结合修正流轨迹与Tweedie校正扩散过程,实现高保真、病理感知的图像生成。通过在ATLAS-8k数据集上预训练并利用低秩适应(LoRA)模块进行对抗微调,ViCTr实现了对病理严重程度的精确控制,并通过重新表述Tweedie公式,在线性轨迹框架下支持单步采样,显著提升了生成效率与解剖真实性。
链接: https://arxiv.org/abs/2505.04963
作者: Onkar Susladkar,Gayatri Deshmukh,Yalcin Tur,Ulas Bagci
机构: Northwestern University (西北大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthesizing medical images remains challenging due to limited annotated pathological data, modality domain gaps, and the complexity of representing diffuse pathologies such as liver cirrhosis. Existing methods often struggle to maintain anatomical fidelity while accurately modeling pathological features, frequently relying on priors derived from natural images or inefficient multi-step sampling. In this work, we introduce ViCTr (Vital Consistency Transfer), a novel two-stage framework that combines a rectified flow trajectory with a Tweedie-corrected diffusion process to achieve high-fidelity, pathology-aware image synthesis. First, we pretrain ViCTr on the ATLAS-8k dataset using Elastic Weight Consolidation (EWC) to preserve critical anatomical structures. We then fine-tune the model adversarially with Low-Rank Adaptation (LoRA) modules for precise control over pathology severity. By reformulating Tweedie’s formula within a linear trajectory framework, ViCTr supports one-step sampling, reducing inference from 50 steps to just 4, without sacrificing anatomical realism. We evaluate ViCTr on BTCV (CT), AMOS (MRI), and CirrMRI600+ (cirrhosis) datasets. Results demonstrate state-of-the-art performance, achieving a Medical Frechet Inception Distance (MFID) of 17.01 for cirrhosis synthesis 28% lower than existing approaches and improving nnUNet segmentation by +3.8% mDSC when used for data augmentation. Radiologist reviews indicate that ViCTr-generated liver cirrhosis MRIs are clinically indistinguishable from real scans. To our knowledge, ViCTr is the first method to provide fine-grained, pathology-aware MRI synthesis with graded severity control, closing a critical gap in AI-driven medical imaging research.
zh
[CV-58] An Efficient Method for Accurate Pose Estimation and Error Correction of Cuboidal Objects IROS2022
【速读】:该论文试图解决从有序或无序堆叠中自主抓取立方体物体时的高精度位姿估计问题(pose estimation),旨在以时间高效的方式减少目标位姿的误差。解决方案的关键在于提出一种替代性的线性时间方法,用于位姿误差的估计与校正,以克服传统全局点云配准方法在精度上的不足以及局部配准算法在执行时间和最终位姿误差不确定性方面的局限性。
链接: https://arxiv.org/abs/2505.04962
作者: Utsav Rai,Hardik Mehta,Vismay Vakharia,Aditya Choudhary,Amit Parmar,Rolif Lima,Kaushik Das
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted in IEEE/RSJ IROS 2022 Workshop on Mobile Manipulation and Embodied Intelligence (MOMA)
Abstract:The proposed system outlined in this paper is a solution to a use case that requires the autonomous picking of cuboidal objects from an organized or unorganized pile with high precision. This paper presents an efficient method for precise pose estimation of cuboid-shaped objects, which aims to reduce errors in target pose in a time-efficient manner. Typical pose estimation methods like global point cloud registrations are prone to minor pose errors for which local registration algorithms are generally used to improve pose accuracy. However, due to the execution time overhead and uncertainty in the error of the final achieved pose, an alternate, linear time approach is proposed for pose error estimation and correction. This paper presents an overview of the solution followed by a detailed description of individual modules of the proposed algorithm.
zh
[CV-59] ADD: Physics-Based Motion Imitation with Adversarial Differential Discriminators
【速读】:该论文试图解决多目标优化问题中依赖人工调参的聚合函数所带来的性能受限问题,尤其是在基于强化学习的物理模拟角色运动跟踪任务中,传统方法需要精心设计的奖励函数以获得高保真结果,但此类方法不仅依赖领域知识和大量手动调整,还限制了奖励函数在不同技能间的适用性。解决方案的关键在于提出一种新颖的对抗式多目标优化技术,其核心是使用一个接收单一正样本的对抗性微分判别器,该技术无需依赖人工调参的奖励函数即可有效引导优化过程,并实现与当前最先进运动跟踪方法相当的高质量结果。
链接: https://arxiv.org/abs/2505.04961
作者: Ziyu Zhang,Sergey Bashkirov,Dun Yang,Michael Taylor,Xue Bin Peng
机构: Simon Fraser University (西蒙弗雷泽大学); Sony Playstation (索尼游戏平台); NVIDIA (英伟达)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 19 pages, 15 figures
Abstract:Multi-objective optimization problems, which require the simultaneous optimization of multiple terms, are prevalent across numerous applications. Existing multi-objective optimization methods often rely on manually tuned aggregation functions to formulate a joint optimization target. The performance of such hand-tuned methods is heavily dependent on careful weight selection, a time-consuming and laborious process. These limitations also arise in the setting of reinforcement-learning-based motion tracking for physically simulated characters, where intricately crafted reward functions are typically used to achieve high-fidelity results. Such solutions not only require domain expertise and significant manual adjustment, but also limit the applicability of the resulting reward function across diverse skills. To bridge this gap, we present a novel adversarial multi-objective optimization technique that is broadly applicable to a range of multi-objective optimization problems, including motion tracking. The proposed adversarial differential discriminator receives a single positive sample, yet is still effective at guiding the optimization process. We demonstrate that our technique can enable characters to closely replicate a variety of acrobatic and agile behaviors, achieving comparable quality to state-of-the-art motion-tracking methods, without relying on manually tuned reward functions. Results are best visualized through this https URL.
zh
[CV-60] Building-Guided Pseudo-Label Learning for Cross-Modal Building Damage Mapping
【速读】:该论文旨在解决利用灾前光学图像和灾后合成孔径雷达(SAR)图像进行建筑物损毁映射的挑战,以实现准确的建筑物损毁评估。其解决方案的关键在于提出一种基于建筑物引导的伪标签学习框架,通过多模型融合与测试时增强策略生成低不确定性伪标签,并结合建筑物先验信息对损毁建筑物的伪标签进行优化,从而提升损毁分类的准确性。
链接: https://arxiv.org/abs/2505.04941
作者: Jiepan Li,He Huang,Yu Sheng,Yujun Guo,Wei He
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate building damage assessment using bi-temporal multi-modal remote sensing images is essential for effective disaster response and recovery planning. This study proposes a novel Building-Guided Pseudo-Label Learning Framework to address the challenges of mapping building damage from pre-disaster optical and post-disaster SAR images. First, we train a series of building extraction models using pre-disaster optical images and building labels. To enhance building segmentation, we employ multi-model fusion and test-time augmentation strategies to generate pseudo-probabilities, followed by a low-uncertainty pseudo-label training method for further refinement. Next, a change detection model is trained on bi-temporal cross-modal images and damaged building labels. To improve damage classification accuracy, we introduce a building-guided low-uncertainty pseudo-label refinement strategy, which leverages building priors from the previous step to guide pseudo-label generation for damaged buildings, reducing uncertainty and enhancing reliability. Experimental results on the 2025 IEEE GRSS Data Fusion Contest dataset demonstrate the effectiveness of our approach, which achieved the highest mIoU score (54.28%) and secured first place in the competition.
zh
[CV-61] FF-PNet: A Pyramid Network Based on Feature and Field for Brain Image Registration
【速读】:该论文旨在解决可变形医学图像配准技术中现有模型在并行提取粗粒度和细粒度特征方面的效率不足问题。其解决方案的关键在于构建了一个基于特征和形变场的金字塔配准网络(FF-PNet),并通过并行操作两个模块实现高效处理:用于粗粒度特征提取的残差特征融合模块(RFFM)和用于细粒度图像形变的残差形变场融合模块(RDFFM)。这两个模块的协同工作显著提升了配准精度,且在编码阶段仅使用传统卷积神经网络,未引入注意力机制或多层感知机,仍取得了优异的性能表现。
链接: https://arxiv.org/abs/2505.04938
作者: Ying Zhang,Shuai Guo,Chenxi Sun,Yuchen Zhu,Jinhai Xiang
机构: Huazhong Agricultural University (华中农业大学); Agricultural Bioinformatics Key Laboratory of Hubei Province (湖北省农业生物信息学重点实验室); Key Laboratory of Smart Farming for Agricultural Animals, Ministry of Agriculture (农业农村部智慧养殖农业动物重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:In recent years, deformable medical image registration techniques have made significant progress. However, existing models still lack efficiency in parallel extraction of coarse and fine-grained features. To address this, we construct a new pyramid registration network based on feature and deformation field (FF-PNet). For coarse-grained feature extraction, we design a Residual Feature Fusion Module (RFFM), for fine-grained image deformation, we propose a Residual Deformation Field Fusion Module (RDFFM). Through the parallel operation of these two modules, the model can effectively handle complex image deformations. It is worth emphasizing that the encoding stage of FF-PNet only employs traditional convolutional neural networks without any attention mechanisms or multilayer perceptrons, yet it still achieves remarkable improvements in registration accuracy, fully demonstrating the superior feature decoding capabilities of RFFM and RDFFM. We conducted extensive experiments on the LPBA and OASIS datasets. The results show our network consistently outperforms popular methods in metrics like the Dice Similarity Coefficient.
zh
[CV-62] Canny2Palm: Realistic and Controllable Palmprint Generation for Large-scale Pre-training
【速读】:该论文旨在解决 palmprint(掌纹)识别中因数据稀缺而导致的识别准确率提升困难的问题。其解决方案的关键在于提出了一种名为 Canny2Palm 的新型合成方法,该方法利用 Canny 边缘检测器提取掌纹纹理,并将其作为条件输入到 Pix2Pix 网络中以生成逼真的掌纹图像。通过从不同身份的掌纹纹理中重新组合,能够生成新的身份,从而实现可控的多样性,有效扩展大规模新身份的数据集。
链接: https://arxiv.org/abs/2505.04922
作者: Xingzeng Lan,Xing Duan,Chen Chen,Weiyu Lin,Bo Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Palmprint recognition is a secure and privacy-friendly method of biometric identification. One of the major challenges to improve palmprint recognition accuracy is the scarcity of palmprint data. Recently, a popular line of research revolves around the synthesis of virtual palmprints for large-scale pre-training purposes. In this paper, we propose a novel synthesis method named Canny2Palm that extracts palm textures with Canny edge detector and uses them to condition a Pix2Pix network for realistic palmprint generation. By re-assembling palmprint textures from different identities, we are able to create new identities by seeding the generator with new assemblies. Canny2Palm not only synthesizes realistic data following the distribution of real palmprints but also enables controllable diversity to generate large-scale new identities. On open-set palmprint recognition benchmarks, models pre-trained with Canny2Palm synthetic data outperform the state-of-the-art with up to 7.2% higher identification accuracy. Moreover, the performance of models pre-trained with Canny2Palm continues to improve given 10,000 synthetic IDs while those with existing methods already saturate, demonstrating the potential of our method for large-scale pre-training.
zh
[CV-63] A Simple Detector with Frame Dynamics is a Strong Tracker CVPR
【速读】:该论文旨在解决红外图像中微小目标跟踪的挑战,尤其是现有跟踪器依赖裁剪模板区域且运动建模能力有限的问题。其解决方案的关键在于通过集成全局检测与运动感知学习及时间先验,提升跟踪性能。具体而言,首先引入帧动态机制,利用帧差和光流在输入层面编码目标特征与运动特性,从而增强目标与背景杂波的区分能力;其次,在后处理阶段提出轨迹约束过滤策略,利用时空先验抑制误检并提升跟踪鲁棒性。
链接: https://arxiv.org/abs/2505.04917
作者: Chenxu Peng,Chenxu Wang,Minrui Zou,Danyang Li,Zhengpeng Yang,Yimian Dai,Ming-Ming Cheng,Xiang Li
机构: Nankai University (南开大学); NKIARI (NKIARI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 CVPR Anti-UAV Workshop
Abstract:Infrared object tracking plays a crucial role in Anti-Unmanned Aerial Vehicle (Anti-UAV) applications. Existing trackers often depend on cropped template regions and have limited motion modeling capabilities, which pose challenges when dealing with tiny targets. To address this, we propose a simple yet effective infrared tiny-object tracker that enhances tracking performance by integrating global detection and motion-aware learning with temporal priors. Our method is based on object detection and achieves significant improvements through two key innovations. First, we introduce frame dynamics, leveraging frame difference and optical flow to encode both prior target features and motion characteristics at the input level, enabling the model to better distinguish the target from background clutter. Second, we propose a trajectory constraint filtering strategy in the post-processing stage, utilizing spatio-temporal priors to suppress false positives and enhance tracking robustness. Extensive experiments show that our method consistently outperforms existing approaches across multiple metrics in challenging infrared UAV tracking scenarios. Notably, we achieve state-of-the-art performance in the 4th Anti-UAV Challenge, securing 1st place in Track 1 and 2nd place in Track 2.
zh
[CV-64] GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing CVPR2025
【速读】:该论文旨在解决场景文本编辑中保持文本风格一致性和视觉连贯性的难题,尤其是在处理复杂字符如中文时,现有基于扩散的方法难以生成高质量且可识别的文本。其解决方案的关键在于提出GlyphMastero,一个专门的字形编码器,通过新颖的字形注意力模块显式建模局部字符与全局文本行之间的跨层级交互,并结合特征金字塔网络融合多尺度OCR骨干特征,从而获得更细致的字形感知引导,实现对场景文本生成过程的精确控制。
链接: https://arxiv.org/abs/2505.04915
作者: Tong Wang,Ting Liu,Xiaochao Qu,Chengjing Wu,Luoqi Liu,Xiaolin Hu
机构: MT Lab, Meitu Inc.(MT实验室,美图公司); Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, Tsinghua University(计算机科学与技术系,脑与认知科学研究所,IDG/麦戈文脑科学研究机构,清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02% improvement in sentence accuracy over the state-of-the-art multi-lingual scene text editing baseline, while simultaneously reducing the text-region Fréchet inception distance by 53.28%.
zh
[CV-65] Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization ECCV2024
【速读】:该论文旨在解决弱监督目标定位(Weakly Supervised Object Localization, WSOL)中由于仅使用图像级标签而导致的像素级细粒度信息学习不足的问题,这限制了WSOL的进一步发展。其解决方案的关键在于利用Segment Anything Model (SAM) 的零样本泛化能力和细粒度分割能力,通过提出一种基于网格点的掩码提示(Pro2SAM)网络来增强整体目标区域的激活。具体而言,首先设计了一个全局令牌变换器(GTFormer)生成粗粒度前景图作为灵活的掩码提示,其次将网格点作为密集提示输入SAM以最大化前景掩码的概率,最后引入像素级相似性度量实现从掩码提示到SAM的掩码匹配。
链接: https://arxiv.org/abs/2505.04905
作者: Xi Yang,Songsong Duan,Nannan Wang,Xinbo Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2024
Abstract:Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03% and 66.85% Top-1 Loc, respectively.
zh
[CV-66] OWT: A Foundational Organ-Wise Tokenization Framework for Medical Imaging
【速读】:该论文试图解决传统表示学习中基于整体性、黑箱嵌入的方法所存在的语义成分纠缠问题,这一问题限制了模型的可解释性与泛化能力,尤其在医学影像领域更为关键。论文提出的解决方案是组织级标记化(Organ-Wise Tokenization, OWT)框架,其关键在于通过基于标记组的重建(Token Group-based Reconstruction, TGR)训练范式,将图像显式地分解为可分离的标记组,每个标记组对应一个独立的器官或语义实体,从而实现语义解耦的表示学习。
链接: https://arxiv.org/abs/2505.04899
作者: Sifan Song,Siyeop Yoon,Pengfei Jin,Sekeun Kim,Matthew Tivnan,Yujin Oh,Runqi Meng,Ling Chen,Zhiliang Lyu,Dufan Wu,Ning Guo,Xiang Li,Quanzheng Li
机构: Massachusetts General Hospital and Harvard Medical School (麻省总医院和哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in representation learning often rely on holistic, black-box embeddings that entangle multiple semantic components, limiting interpretability and generalization. These issues are especially critical in medical imaging. To address these limitations, we propose an Organ-Wise Tokenization (OWT) framework with a Token Group-based Reconstruction (TGR) training paradigm. Unlike conventional approaches that produce holistic features, OWT explicitly disentangles an image into separable token groups, each corresponding to a distinct organ or semantic entity. Our design ensures each token group encapsulates organ-specific information, boosting interpretability, generalization, and efficiency while allowing fine-grained control in downstream tasks. Experiments on CT and MRI datasets demonstrate the effectiveness of OWT in not only achieving strong image reconstruction and segmentation performance, but also enabling novel semantic-level generation and retrieval applications that are out of reach for standard holistic embedding methods. These findings underscore the potential of OWT as a foundational framework for semantically disentangled representation learning, offering broad scalability and applicability to real-world medical imaging scenarios and beyond.
zh
[CV-67] Cross-Branch Orthogonality for Improved Generalization in Face Deepfake Detection
【速读】:该论文旨在解决当前深度伪造(deepfake)检测技术难以应对生成式AI(Generative AI)快速演进所带来的新型深度伪造内容的问题,尤其是现有检测方法依赖特定伪造痕迹导致泛化能力不足。其解决方案的关键在于引入一种基于特征正交性的解耦策略,通过粗粒度到细粒度的空间信息与语义信息及其交互,提升特征的区分性并减少模型特征的冗余,从而实现对新型深度伪造内容的有效检测。
链接: https://arxiv.org/abs/2505.04888
作者: Tharindu Fernando,Clinton Fookes,Sridha Sridharan,Simon Denman
机构: Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Remarkable advancements in generative AI technology have given rise to a spectrum of novel deepfake categories with unprecedented leaps in their realism, and deepfakes are increasingly becoming a nuisance to law enforcement authorities and the general public. In particular, we observe alarming levels of confusion, deception, and loss of faith regarding multimedia content within society caused by face deepfakes, and existing deepfake detectors are struggling to keep up with the pace of improvements in deepfake generation. This is primarily due to their reliance on specific forgery artifacts, which limits their ability to generalise and detect novel deepfake types. To combat the spread of malicious face deepfakes, this paper proposes a new strategy that leverages coarse-to-fine spatial information, semantic information, and their interactions while ensuring feature distinctiveness and reducing the redundancy of the modelled features. A novel feature orthogonality-based disentanglement strategy is introduced to ensure branch-level and cross-branch feature disentanglement, which allows us to integrate multiple feature vectors without adding complexity to the feature space or compromising generalisation. Comprehensive experiments on three public benchmarks: FaceForensics++, Celeb-DF, and the Deepfake Detection Challenge (DFDC) show that these design choices enable the proposed approach to outperform current state-of-the-art methods by 5% on the Celeb-DF dataset and 7% on the DFDC dataset in a cross-dataset evaluation setting.
zh
[CV-68] Learning from Loss Landscape: Generalizable Mixed-Precision Quantization via Adaptive Sharpness-Aware Gradient Aligning
【速读】:该论文旨在解决混合精度量化(Mixed Precision Quantization, MPQ)在大规模数据集上进行量化策略搜索时计算成本过高的问题。其关键解决方案是首先在小规模数据集上搜索量化策略,然后将其泛化到大规模数据集,从而避免了大规模量化微调的需求,仅需调整模型权重即可。该方法的核心技术包括:增强量化泛化的尖锐度感知最小化、处理不同优化目标间梯度冲突的隐式梯度方向对齐,以及加速优化的自适应扰动半径。
链接: https://arxiv.org/abs/2505.04877
作者: Lianbo Ma,Jianlun Ma,Yuee Zhou,Guoyang Xie,Qiang He,Zhichao Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixed Precision Quantization (MPQ) has become an essential technique for optimizing neural network by determining the optimal bitwidth per layer. Existing MPQ methods, however, face a major hurdle: they require a computationally expensive search for quantization policies on large-scale datasets. To resolve this issue, we introduce a novel approach that first searches for quantization policies on small datasets and then generalizes them to large-scale datasets. This approach simplifies the process, eliminating the need for large-scale quantization fine-tuning and only necessitating model weight adjustment. Our method is characterized by three key techniques: sharpness-aware minimization for enhanced quantization generalization, implicit gradient direction alignment to handle gradient conflicts among different optimization objectives, and an adaptive perturbation radius to accelerate optimization. Both theoretical analysis and experimental results validate our approach. Using the CIFAR10 dataset (just 0.5% the size of ImageNet training data) for MPQ policy search, we achieved equivalent accuracy on ImageNet with a significantly lower computational cost, while improving efficiency by up to 150% over the baselines.
zh
[CV-69] Auto-regressive transformation for image alignment
【速读】:该论文旨在解决图像对齐中在特征稀疏区域、极端尺度和视场差异以及大形变情况下的性能不足问题,这些问题通常导致对齐精度不理想。其解决方案的关键在于提出一种名为自回归变换(Auto-Regressive Transformation, ART)的新方法,该方法在自回归框架下迭代估计从粗到细的变换,并通过多尺度图像表示中的关键区域聚焦来提升鲁棒性,同时利用交叉注意力层的引导实现精确对齐。
链接: https://arxiv.org/abs/2505.04864
作者: Kanggeon Lee,Soochahn Lee,Kyoung Mu Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing methods for image alignment struggle in cases involving feature-sparse regions, extreme scale and field-of-view differences, and large deformations, often resulting in suboptimal accuracy. Robustness to these challenges improves through iterative refinement of the transformation field while focusing on critical regions in multi-scale image representations. We thus propose Auto-Regressive Transformation (ART), a novel method that iteratively estimates the coarse-to-fine transformations within an auto-regressive framework. Leveraging hierarchical multi-scale features, our network refines the transformations using randomly sampled points at each scale. By incorporating guidance from the cross-attention layer, the model focuses on critical regions, ensuring accurate alignment even in challenging, feature-limited conditions. Extensive experiments across diverse datasets demonstrate that ART significantly outperforms state-of-the-art methods, establishing it as a powerful new method for precise image alignment with broad applicability.
zh
[CV-70] Mix-QSAM: Mixed-Precision Quantization of the Segment Anything Model
【速读】:该论文旨在解决Segment Anything Model (SAM)在资源受限设备上部署时面临的高计算和内存需求问题,以及现有后训练量化(Post-Training Quantization, PTQ)方法因固定位宽量化导致的精度和效率不优问题。其解决方案的关键在于提出一种混合精度的PTQ框架Mix-QSAM,通过引入层级重要性评分和跨层协同性度量,建立整数二次规划(Integer Quadratic Programming, IQP)模型,实现最优位宽分配,从而在保证模型性能的同时提升计算效率。
链接: https://arxiv.org/abs/2505.04861
作者: Navin Ranjan,Andreas Savakis
机构: Rochester Institute of Technology (罗彻斯特理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 Figures
Abstract:The Segment Anything Model (SAM) is a popular vision foundation model; however, its high computational and memory demands make deployment on resource-constrained devices challenging. While Post-Training Quantization (PTQ) is a practical approach for reducing computational overhead, existing PTQ methods rely on fixed bit-width quantization, leading to suboptimal accuracy and efficiency. To address this limitation, we propose Mix-QSAM, a mixed-precision PTQ framework for SAM. First, we introduce a layer-wise importance score, derived using Kullback-Leibler (KL) divergence, to quantify each layer’s contribution to the model’s output. Second, we introduce cross-layer synergy, a novel metric based on causal mutual information, to capture dependencies between adjacent layers. This ensures that highly interdependent layers maintain similar bit-widths, preventing abrupt precision mismatches that degrade feature propagation and numerical stability. Using these metrics, we formulate an Integer Quadratic Programming (IQP) problem to determine optimal bit-width allocation under model size and bit-operation constraints, assigning higher precision to critical layers while minimizing bit-width in less influential layers. Experimental results demonstrate that Mix-QSAM consistently outperforms existing PTQ methods on instance segmentation and object detection tasks, achieving up to 20% higher average precision under 6-bit and 4-bit mixed-precision settings, while maintaining computational efficiency.
zh
[CV-71] D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation
【速读】:该论文旨在解决双臂操作(bimanual manipulation)中由于高维状态空间和双臂间紧密协调需求所带来的学习挑战,尤其是在眼动追踪(eye-in-hand)模仿学习场景下,数据收集成本高且多样性不足的问题。其解决方案的关键在于提出一种针对双臂操作的离线数据增强方法——D-CODA(Diffusion for COordinated Dual-arm Data Augmentation),该方法通过训练扩散模型生成视角一致的腕部摄像头图像以及对应的关节空间动作标签,从而提升数据的多样性和可用性,同时利用约束优化确保夹爪与物体接触状态满足双臂协调的物理约束。
链接: https://arxiv.org/abs/2505.04860
作者: I-Chun Arthur Liu,Jason Chen,Gaurav Sukhatme,Daniel Seita
机构: University of Southern California (南加州大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Learning bimanual manipulation is challenging due to its high dimensionality and tight coordination required between two arms. Eye-in-hand imitation learning, which uses wrist-mounted cameras, simplifies perception by focusing on task-relevant views. However, collecting diverse demonstrations remains costly, motivating the need for scalable data augmentation. While prior work has explored visual augmentation in single-arm settings, extending these approaches to bimanual manipulation requires generating viewpoint-consistent observations across both arms and producing corresponding action labels that are both valid and feasible. In this work, we propose Diffusion for COordinated Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation tailored to eye-in-hand bimanual imitation learning that trains a diffusion model to synthesize novel, viewpoint-consistent wrist-camera images for both arms while simultaneously generating joint-space action labels. It employs constrained optimization to ensure that augmented states involving gripper-to-object contacts adhere to constraints suitable for bimanual coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our results across 2250 simulation trials and 300 real-world trials demonstrate that it outperforms baselines and ablations, showing its potential for scalable data augmentation in eye-in-hand bimanual manipulation. Our project website is at: this https URL.
zh
[CV-72] ORXE: Orchestrating Experts for Dynamically Configurable Efficiency
【速读】:该论文试图解决AI模型在实时配置效率方面的问题,即如何在不同输入样本的复杂性下动态调整计算资源以实现高效且灵活的推理。解决方案的关键在于提出ORXE框架,该框架利用一组具有不同计算成本和性能水平的预训练专家,并通过基于置信度的门控机制动态分配计算资源,从而在不进行复杂元模型训练的情况下实现高效的推理路径调整。
链接: https://arxiv.org/abs/2505.04850
作者: Qingyuan Wang,Guoxin Wang,Barry Cardiff,Deepu John
机构: University College Dublin (都柏林大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents ORXE, a modular and adaptable framework for achieving real-time configurable efficiency in AI models. By leveraging a collection of pre-trained experts with diverse computational costs and performance levels, ORXE dynamically adjusts inference pathways based on the complexity of input samples. Unlike conventional approaches that require complex metamodel training, ORXE achieves high efficiency and flexibility without complicating the development process. The proposed system utilizes a confidence-based gating mechanism to allocate appropriate computational resources for each input. ORXE also supports adjustments to the preference between inference cost and prediction performance across a wide range during runtime. We implemented a training-free ORXE system for image classification tasks, evaluating its efficiency and accuracy across various devices. The results demonstrate that ORXE achieves superior performance compared to individual experts and other dynamic models in most cases. This approach can be extended to other applications, providing a scalable solution for diverse real-world deployment scenarios.
zh
[CV-73] Seeing Cells Clearly: Evaluating Machine Vision Strategies for Microglia Centroid Detection in 3D Images
【速读】:该论文试图解决如何准确识别3D显微镜图像中小胶质细胞(microglia)中心点的问题,其解决方案的关键在于评估不同工具(ilastik、3D Morph和Omnipose)在检测小胶质细胞形态方面的性能差异及其对图像信息提取的影响。
链接: https://arxiv.org/abs/2505.04838
作者: Youjia Zhang
机构: University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Microglia are important cells in the brain, and their shape can tell us a lot about brain health. In this project, I test three different tools for finding the center points of microglia in 3D microscope images. The tools include ilastik, 3D Morph, and Omnipose. I look at how well each one finds the cells and how their results compare. My findings show that each tool sees the cells in its own way, and this can affect the kind of information we get from the images.
zh
[CV-74] Are Synthetic Corruptions A Reliable Proxy For Real-World Corruptions? CVPR2025
【速读】:该论文试图解决深度学习(Deep Learning, DL)模型在面对现实世界中的分布偏移(distribution shifts)时的鲁棒性评估问题,特别是由于天气和光照变化导致的干扰。为了解决这一问题,研究者采用合成噪声(synthetic corruptions)作为替代方案,以减少收集真实世界多样化数据的成本。该研究的关键在于通过大规模基准测试,验证合成噪声是否能够可靠地代表现实世界的噪声,并发现两者在平均性能上存在强相关性,从而支持合成噪声在鲁棒性评估中的有效性。此外,研究还分析了特定噪声类型的关联性,为理解合成噪声在何种情况下能有效模拟现实噪声提供了关键见解。
链接: https://arxiv.org/abs/2505.04835
作者: Shashank Agnihotri,David Schader,Nico Sharei,Mehmet Ege Kaçar,Margret Keuper
机构: University of Mannheim (曼海姆大学); Max-Planck-Institute for Informatics (马克斯·普朗克信息学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025 Workshop on Synthetic Data for Computer Vision
Abstract:Deep learning (DL) models are widely used in real-world applications but remain vulnerable to distribution shifts, especially due to weather and lighting changes. Collecting diverse real-world data for testing the robustness of DL models is resource-intensive, making synthetic corruptions an attractive alternative for robustness testing. However, are synthetic corruptions a reliable proxy for real-world corruptions? To answer this, we conduct the largest benchmarking study on semantic segmentation models, comparing performance on real-world corruptions and synthetic corruptions datasets. Our results reveal a strong correlation in mean performance, supporting the use of synthetic corruptions for robustness evaluation. We further analyze corruption-specific correlations, providing key insights to understand when synthetic corruptions succeed in representing real-world corruptions. Open-source Code: this https URL
zh
[CV-75] WIR3D: Visually-Informed and Geometry-Aware 3D Shape Abstraction
【速读】:该论文试图解决如何通过稀疏的视觉有意义的3D曲线来抽象化3D形状的问题,旨在实现对几何结构和显著视觉特征(如纹理)的准确表示。解决方案的关键在于优化贝塞尔曲线的参数,利用预训练基础模型(CLIP)的中间激活作为优化引导,并将优化过程分为两个阶段:第一阶段捕捉形状的粗略几何结构,第二阶段通过一种新颖的局部关键点损失进行空间引导,以表征细粒度特征,从而实现用户对抽象特征的控制。同时,通过神经SDF损失确保与原始表面的一致性,使曲线能够作为直观的变形操作工具。
链接: https://arxiv.org/abs/2505.04813
作者: Richard Liu,Daniel Fu,Noah Tan,Itai Lang,Rana Hanocka
机构: University of Chicago (芝加哥大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present WIR3D, a technique for abstracting 3D shapes through a sparse set of visually meaningful curves in 3D. We optimize the parameters of Bezier curves such that they faithfully represent both the geometry and salient visual features (e.g. texture) of the shape from arbitrary viewpoints. We leverage the intermediate activations of a pre-trained foundation model (CLIP) to guide our optimization process. We divide our optimization into two phases: one for capturing the coarse geometry of the shape, and the other for representing fine-grained features. Our second phase supervision is spatially guided by a novel localized keypoint loss. This spatial guidance enables user control over abstracted features. We ensure fidelity to the original surface through a neural SDF loss, which allows the curves to be used as intuitive deformation handles. We successfully apply our method for shape abstraction over a broad dataset of shapes with varying complexity, geometric structure, and texture, and demonstrate downstream applications for feature control and shape deformation.
zh
[CV-76] DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition
【速读】:该论文试图解决当前行人重识别(Person ReID)技术在真实复杂环境下性能显著下降的问题,主要原因是数据集缺乏极端的多样性因素,如分辨率变化、视角差异、尺度变化、遮挡以及服装或时段漂移带来的外观变化。解决方案的关键在于提出DetReIDX,这是一个大规模的空地行人数据集,专门设计为对ReID技术在真实条件下的压力测试,包含来自三个大洲七个大学校园的超过1300万个人框,覆盖509个身份,并且每个被试至少在不同日期的两个会话中被记录,存在服装、光照和位置的变化,从而能够实际评估长期行人ReID性能。此外,数据集还包含16种软生物特征属性及多任务标签,用于检测、跟踪、ReID和动作识别。
链接: https://arxiv.org/abs/2505.04793
作者: Kailash A. Hambarde,Nzakiese Mbongo,Pavan Kumar MP,Satish Mekewad,Carolina Fernandes,Gökhan Silahtaroğlu,Alice Nithya,Pawan Wasnik,MD. Rashidunnabi,Pranita Samale,Hugo Proença
机构: Instituto de Telecomunicações (Instituto de Telecomunicações); University of Beira Interior (University of Beira Interior); J.N.N. College of Engineering (J.N.N. College of Engineering); School of Computational Sciences (School of Computational Sciences); SRTM University (SRTM University); Istanbul Medipol University (Istanbul Medipol University); SRM Institute of Science and Technology (SRM Institute of Science and Technology)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Person reidentification (ReID) technology has been considered to perform relatively well under controlled, ground-level conditions, but it breaks down when deployed in challenging real-world settings. Evidently, this is due to extreme data variability factors such as resolution, viewpoint changes, scale variations, occlusions, and appearance shifts from clothing or session drifts. Moreover, the publicly available data sets do not realistically incorporate such kinds and magnitudes of variability, which limits the progress of this technology. This paper introduces DetReIDX, a large-scale aerial-ground person dataset, that was explicitly designed as a stress test to ReID under real-world conditions. DetReIDX is a multi-session set that includes over 13 million bounding boxes from 509 identities, collected in seven university campuses from three continents, with drone altitudes between 5.8 and 120 meters. More important, as a key novelty, DetReIDX subjects were recorded in (at least) two sessions on different days, with changes in clothing, daylight and location, making it suitable to actually evaluate long-term person ReID. Plus, data were annotated from 16 soft biometric attributes and multitask labels for detection, tracking, ReID, and action recognition. In order to provide empirical evidence of DetReIDX usefulness, we considered the specific tasks of human detection and ReID, where SOTA methods catastrophically degrade performance (up to 80% in detection accuracy and over 70% in Rank-1 ReID) when exposed to DetReIDXs conditions. The dataset, annotations, and official evaluation protocols are publicly available at this https URL
zh
[CV-77] Convex Relaxation for Robust Vanishing Point Estimation in Manhattan World CVPR2025 DATE
【速读】:该论文旨在解决曼哈顿世界中消失点(vanishing points, VPs)的检测问题,该问题在许多三维视觉应用中是一个基础任务,涉及联合推断线与消失点的关联并定位每个消失点。现有方法要么是次优解法,要么为了追求全局最优性而付出高昂的计算成本。该论文的关键解决方案是首次引入凸松弛技术,通过“软”关联方案(通过截断多选择误差实现),实现消失点位置和线-消失点关联的联合估计,进而将问题转化为可重构为二次约束二次规划(QCQP)的问题,并进一步松弛为凸半定规划(SDP)问题。为高效求解该SDP问题,论文提出了一种全局最优、抗异常值的迭代求解器(称为GlobustVP),该方法在每次迭代中独立搜索一个消失点及其相关线,其余线视为异常值,并在所有消失点独立更新后通过局部优化强化曼哈顿世界中三个消失点之间的相互正交性。
链接: https://arxiv.org/abs/2505.04788
作者: Bangyan Liao,Zhenjun Zhao,Haoang Li,Yi Zhou,Yingping Zeng,Hao Li,Peidong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025 as Award Candidate Oral Presentation. The first two authors contributed equally to this work. Code: this https URL
Abstract:Determining the vanishing points (VPs) in a Manhattan world, as a fundamental task in many 3D vision applications, consists of jointly inferring the line-VP association and locating each VP. Existing methods are, however, either sub-optimal solvers or pursuing global optimality at a significant cost of computing time. In contrast to prior works, we introduce convex relaxation techniques to solve this task for the first time. Specifically, we employ a ``soft’’ association scheme, realized via a truncated multi-selection error, that allows for joint estimation of VPs’ locations and line-VP associations. This approach leads to a primal problem that can be reformulated into a quadratically constrained quadratic programming (QCQP) problem, which is then relaxed into a convex semidefinite programming (SDP) problem. To solve this SDP problem efficiently, we present a globally optimal outlier-robust iterative solver (called \textbfGlobustVP), which independently searches for one VP and its associated lines in each iteration, treating other lines as outliers. After each independent update of all VPs, the mutual orthogonality between the three VPs in a Manhattan world is reinforced via local refinement. Extensive experiments on both synthetic and real-world data demonstrate that \textbfGlobustVP achieves a favorable balance between efficiency, robustness, and global optimality compared to previous works. The code is publicly available at this https URL.
zh
[CV-78] Replay to Remember (R2R): An Efficient Uncertainty-driven Unsupervised Continual Learning Framework Using Generative Replay ECAI-2025
【速读】:该论文旨在解决神经网络在持续学习过程中面临的“灾难性遗忘”问题,即在学习新知识时遗忘之前已学到的知识。其解决方案的关键在于提出了一种基于不确定性驱动的无监督持续学习框架——“Replay to Remember (R2R)”,该框架通过一种聚类级别的不确定性驱动反馈机制和由视觉语言模型(VLM)支持的生成式重放模块,高效地利用未标记和合成标记数据,并在无需预训练的情况下持续适应新任务,同时生成代表过去经验的合成标签数据,从而显著提升知识保留能力。
链接: https://arxiv.org/abs/2505.04787
作者: Sriram Mandalika,Harsha Vardhan,Athira Nambiar
机构: SRM Institute of Science and Technology (SRM 科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to the 28th European Conference on Artificial Intelligence (ECAI-2025)
Abstract:Continual Learning entails progressively acquiring knowledge from new data while retaining previously acquired knowledge, thereby mitigating Catastrophic Forgetting'' in neural networks. Our work presents a novel uncertainty-driven Unsupervised Continual Learning framework using Generative Replay, namely
Replay to Remember (R2R)‘’. The proposed R2R architecture efficiently uses unlabelled and synthetic labelled data in a balanced proportion using a cluster-level uncertainty-driven feedback mechanism and a VLM-powered generative replay module. Unlike traditional memory-buffer methods that depend on pretrained models and pseudo-labels, our R2R framework operates without any prior training. It leverages visual features from unlabeled data and adapts continuously using clustering-based uncertainty estimation coupled with dynamic thresholding. Concurrently, a generative replay mechanism along with DeepSeek-R1 powered CLIP VLM produces labelled synthetic data representative of past experiences, resembling biological visual thinking that replays memory to remember and act in new, unseen tasks. Extensive experimental analyses are carried out in CIFAR-10, CIFAR-100, CINIC-10, SVHN and TinyImageNet datasets. Our proposed R2R approach improves knowledge retention, achieving a state-of-the-art performance of 98.13%, 73.06%, 93.41%, 95.18%, 59.74%, respectively, surpassing state-of-the-art performance by over 4.36%.
zh
[CV-79] Vision-Language-Action Models: Concepts Progress Applications and Challenges
【速读】:该论文旨在解决如何构建统一的感知、自然语言理解和具身行动框架,以实现更智能、适应性强的具身人工智能代理问题。其关键解决方案在于通过Vision-Language-Action (VLA)模型整合视觉语言模型(VLMs)、动作规划器和分层控制器,推动跨模态学习向通用代理的演进,并提出包括代理AI适配、跨具身泛化和统一神经符号规划在内的针对性策略,以应对实时控制、多模态动作表示、系统可扩展性及伦理部署等挑战。
链接: https://arxiv.org/abs/2505.04769
作者: Ranjan Sapkota,Yang Cao,Konstantinos I. Roumeliotis,Manoj Karkee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 18 Figures, 4 Tables
Abstract:Vision-Language-Action (VLA) models mark a transformative advancement in artificial intelligence, aiming to unify perception, natural language understanding, and embodied action within a single computational framework. This foundational review presents a comprehensive synthesis of recent advancements in Vision-Language-Action models, systematically organized across five thematic pillars that structure the landscape of this rapidly evolving field. We begin by establishing the conceptual foundations of VLA systems, tracing their evolution from cross-modal learning architectures to generalist agents that tightly integrate vision-language models (VLMs), action planners, and hierarchical controllers. Our methodology adopts a rigorous literature review framework, covering over 80 VLA models published in the past three years. Key progress areas include architectural innovations, parameter-efficient training strategies, and real-time inference accelerations. We explore diverse application domains such as humanoid robotics, autonomous vehicles, medical and industrial robotics, precision agriculture, and augmented reality navigation. The review further addresses major challenges across real-time control, multimodal action representation, system scalability, generalization to unseen tasks, and ethical deployment risks. Drawing from the state-of-the-art, we propose targeted solutions including agentic AI adaptation, cross-embodiment generalization, and unified neuro-symbolic planning. In our forward-looking discussion, we outline a future roadmap where VLA models, VLMs, and agentic AI converge to power socially aligned, adaptive, and general-purpose embodied agents. This work serves as a foundational reference for advancing intelligent, real-world robotics and artificial general intelligence. Vision-language-action, Agentic AI, AI Agents, Vision-language Models
zh
[CV-80] Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective
【速读】:该论文旨在解决RGB-D显著性检测(RGB-D SOD)中效率与性能难以平衡的问题,即当前基于大尺度主干网络的方法虽然精度高但效率低,而轻量级方法则难以达到高精度。解决方案的关键在于从深度质量、模态融合和特征表示三个基础角度出发,提出Speed-Accuracy Tradeoff Network (SATNet)。具体而言,引入Depth Anything Model生成高质量深度图以缓解多模态差异,设计Decoupled Attention Module (DAM)增强模态内与模态间的一致性,开发Dual Information Representation Module (DIRM)扩展轻量主干网络的特征空间,并通过Dual Feature Aggregation Module (DFAM)融合纹理与显著性特征,从而在保持轻量化的同时提升检测性能。
链接: https://arxiv.org/abs/2505.04758
作者: Songsong Duan,Xi Yang,Nannan Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TIP 2025
Abstract:Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency. Meanwhile, several existing lightweight methods are difficult to achieve high-precision performance. To balance the efficiency and performance, we propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives: depth quality, modality fusion, and feature representation. Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps,which effectively alleviates the multi-modal gaps in the current datasets. For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities. Here, the multi-modal features are decoupled into dual-view feature vectors to project discriminable information of feature maps. For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework to enlarge the limited feature space generated by the lightweight backbones. DIRM models texture features and saliency features to enrich feature space, and employ two-way prediction heads to optimal its parameters through a bi-directional backpropagation. Finally, we design a Dual Feature Aggregation Module (DFAM) in the decoder to aggregate texture and saliency features. Extensive experiments on five public RGB-D SOD datasets indicate that the proposed SATNet excels state-of-the-art (SOTA) CNN-based heavyweight models and achieves a lightweight framework with 5.2 M parameters and 415 FPS.
zh
[CV-81] Hyb-KAN ViT: Hybrid Kolmogorov-Arnold Networks Augmented Vision Transformer
【速读】:该论文试图解决传统多层感知机(Multi-Layer Perceptrons, MLP)在视觉变压器(Vision Transformers, ViTs)中的固有局限性,特别是其在空间-频率建模和计算效率方面的不足。解决方案的关键在于提出一种新型框架Hyb-KAN ViT,该框架通过集成基于小波的频谱分解与样条优化激活函数,引入了两个核心模块:Efficient-KAN(Eff-KAN)和Wavelet-KAN(Wav-KAN)。其中,Eff-KAN用样条函数替代MLP层以提升效率,而Wav-KAN则利用正交小波变换实现多尺度特征提取,从而在增强空间-频率建模能力的同时缓解计算瓶颈。
链接: https://arxiv.org/abs/2505.04740
作者: Sainath Dey,Mitul Goswami,Jashika Sethi,Prasant Kumar Pattnaik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study addresses the inherent limitations of Multi-Layer Perceptrons (MLPs) in Vision Transformers (ViTs) by introducing Hybrid Kolmogorov-Arnold Network (KAN)-ViT (Hyb-KAN ViT), a novel framework that integrates wavelet-based spectral decomposition and spline-optimized activation functions, prior work has failed to focus on the prebuilt modularity of the ViT architecture and integration of edge detection capabilities of Wavelet functions. We propose two key modules: Efficient-KAN (Eff-KAN), which replaces MLP layers with spline functions and Wavelet-KAN (Wav-KAN), leveraging orthogonal wavelet transforms for multi-resolution feature extraction. These modules are systematically integrated in ViT encoder layers and classification heads to enhance spatial-frequency modeling while mitigating computational bottlenecks. Experiments on ImageNet-1K (Image Recognition), COCO (Object Detection and Instance Segmentation), and ADE20K (Semantic Segmentation) demonstrate state-of-the-art performance with Hyb-KAN ViT. Ablation studies validate the efficacy of wavelet-driven spectral priors in segmentation and spline-based efficiency in detection tasks. The framework establishes a new paradigm for balancing parameter efficiency and multi-scale representation in vision architectures.
zh
[CV-82] False Promises in Medical Imaging AI? Assessing Validity of Outperformance Claims
【速读】:该论文试图解决医学影像人工智能(Artificial Intelligence, AI)研究中性能比较存在的可靠性问题,特别是针对新提出方法是否真正优于现有技术的判断。其解决方案的关键在于采用贝叶斯方法,结合报告的实验结果和经验估计的模型一致性,量化虚假优越性声明的概率,从而评估方法相对排名是否可能仅由偶然因素导致。
链接: https://arxiv.org/abs/2505.04720
作者: Evangelia Christodoulou,Annika Reinke,Pascaline Andrè,Patrick Godau,Piotr Kalinowski,Rola Houhou,Selen Erkan,Carole H. Sudre,Ninon Burgos,Sofiène Boutaj,Sophie Loizillon,Maëlys Solal,Veronika Cheplygina,Charles Heitz,Michal Kozubek,Michela Antonelli,Nicola Rieke,Antoine Gilson,Leon D. Mayer,Minu D. Tizabi,M. Jorge Cardoso,Amber Simpson,Annette Kopp-Schneider,Gaël Varoquaux,Olivier Colliot,Lena Maier-Hein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Performance comparisons are fundamental in medical imaging Artificial Intelligence (AI) research, often driving claims of superiority based on relative improvements in common performance metrics. However, such claims frequently rely solely on empirical mean performance. In this paper, we investigate whether newly proposed methods genuinely outperform the state of the art by analyzing a representative cohort of medical imaging papers. We quantify the probability of false claims based on a Bayesian approach that leverages reported results alongside empirically estimated model congruence to estimate whether the relative ranking of methods is likely to have occurred by chance. According to our results, the majority (80%) of papers claims outperformance when introducing a new method. Our analysis further revealed a high probability (5%) of false outperformance claims in 86% of classification papers and 53% of segmentation papers. These findings highlight a critical flaw in current benchmarking practices: claims of outperformance in medical imaging AI are frequently unsubstantiated, posing a risk of misdirecting future research efforts.
zh
[CV-83] Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers
【速读】:该论文试图解决自然场景中布局生成的限制问题,现有方法要么是封闭词汇(closed-vocabulary),要么依赖专有大语言模型进行开放词汇(open-vocabulary)生成,从而限制了其建模能力和在可控图像生成中的广泛应用。解决方案的关键在于使用轻量级开源语言模型从文本提示中提取场景元素,并引入一种新型的面向属性的扩散Transformer架构,在开放词汇条件下进行条件布局生成。
链接: https://arxiv.org/abs/2505.04718
作者: Divyansh Srivastava,Xiang Zhang,He Wen,Chenru Wen,Zhuowen Tu
机构: UC San Diego (加州大学圣地亚哥分校); Tsingua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.
zh
[CV-84] Comparison of Visual Trackers for Biomechanical Analysis of Running
【速读】:该论文旨在解决人体姿态跟踪在短跑生物力学分析中的准确性问题,特别是针对关键关节角度的测量精度。其解决方案的关键在于提出一种基于关节的模型,并结合后处理模块进行异常值检测与预测融合,从而提升姿态追踪的精确度。实验结果表明,该方法在减少根均方误差方面效果显著,为运动生物力学分析提供了有价值的工具,但仍需进一步优化以满足高精度应用的需求。
链接: https://arxiv.org/abs/2505.04713
作者: Luis F. Gomez,Gonzalo Garrido-Lopez,Julian Fierrez,Aythami Morales,Ruben Tolosana,Javier Rueda,Enrique Navarro
机构: BiDA Lab, Universidad Autónoma de Madrid(生物识别与数据挖掘实验室,马德里自治大学); Sport Biomechanics Laboratory, Faculty of Physical Activity and Sports Sciences, INEF, Universidad Politécnica de Madrid(运动生物力学实验室,体育活动与体育科学学院,INEF,马德里理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint of the paper presented to the Third Workshop on Learning with Few or Without Annotated Face, Body, and Gesture Data on 19th IEEE Conference on Automatic Face and Gesture Recognition 2025
Abstract:Human pose estimation has witnessed significant advancements in recent years, mainly due to the integration of deep learning models, the availability of a vast amount of data, and large computational resources. These developments have led to highly accurate body tracking systems, which have direct applications in sports analysis and performance evaluation. This work analyzes the performance of six trackers: two point trackers and four joint trackers for biomechanical analysis in sprints. The proposed framework compares the results obtained from these pose trackers with the manual annotations of biomechanical experts for more than 5870 frames. The experimental framework employs forty sprints from five professional runners, focusing on three key angles in sprint biomechanics: trunk inclination, hip flex extension, and knee flex extension. We propose a post-processing module for outlier detection and fusion prediction in the joint angles. The experimental results demonstrate that using joint-based models yields root mean squared errors ranging from 11.41° to 4.37°. When integrated with the post-processing modules, these errors can be reduced to 6.99° and 3.88°, respectively. The experimental findings suggest that human pose tracking approaches can be valuable resources for the biomechanical analysis of running. However, there is still room for improvement in applications where high accuracy is required. Comments: Preprint of the paper presented to the Third Workshop on Learning with Few or Without Annotated Face, Body, and Gesture Data on 19th IEEE Conference on Automatic Face and Gesture Recognition 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2505.04713 [cs.CV] (or arXiv:2505.04713v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.04713 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-85] Histo-Miner: Deep Learning based Tissue Features Extraction Pipeline from HE Whole Slide Images of Cutaneous Squamous Cell Carcinoma
【速读】:该论文旨在解决皮肤组织全切片图像(Whole-Slide Images, WSI)分析中缺乏标注数据集和开源处理流程的问题。其关键解决方案是提出Histo-Miner,一个基于深度学习的管道,用于皮肤WSI的分析,并生成两个带有标注细胞核和肿瘤区域的数据集。该方法利用卷积神经网络和视觉变压器进行细胞核分割与分类以及肿瘤区域分割,取得了优于现有技术的性能指标,并通过生成的特征向量支持多种下游任务,如预测患者对免疫治疗的反应。
链接: https://arxiv.org/abs/2505.04672
作者: Lucas Sancéré,Carina Lorenz,Doris Helbig,Oana-Diana Persa,Sonja Dengler,Alexander Kreuter,Martim Laimer,Anne Fröhlich,Jennifer Landsberg,Johannes Brägelmann,Katarzyna Bozek
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 31 pages including supplement, 5 core figures, 5 supplement figures
Abstract:Recent advancements in digital pathology have enabled comprehensive analysis of Whole-Slide Images (WSI) from tissue samples, leveraging high-resolution microscopy and computational capabilities. Despite this progress, there is a lack of labeled datasets and open source pipelines specifically tailored for analysis of skin tissue. Here we propose Histo-Miner, a deep learning-based pipeline for analysis of skin WSIs and generate two datasets with labeled nuclei and tumor regions. We develop our pipeline for the analysis of patient samples of cutaneous squamous cell carcinoma (cSCC), a frequent non-melanoma skin cancer. Utilizing the two datasets, comprising 47,392 annotated cell nuclei and 144 tumor-segmented WSIs respectively, both from cSCC patients, Histo-Miner employs convolutional neural networks and vision transformers for nucleus segmentation and classification as well as tumor region segmentation. Performance of trained models positively compares to state of the art with multi-class Panoptic Quality (mPQ) of 0.569 for nucleus segmentation, macro-averaged F1 of 0.832 for nucleus classification and mean Intersection over Union (mIoU) of 0.884 for tumor region segmentation. From these predictions we generate a compact feature vector summarizing tissue morphology and cellular interactions, which can be used for various downstream tasks. Here, we use Histo-Miner to predict cSCC patient response to immunotherapy based on pre-treatment WSIs from 45 patients. Histo-Miner identifies percentages of lymphocytes, the granulocyte to lymphocyte ratio in tumor vicinity and the distances between granulocytes and plasma cells in tumors as predictive features for therapy response. This highlights the applicability of Histo-Miner to clinically relevant scenarios, providing direct interpretation of the classification and insights into the underlying biology.
zh
[CV-86] ChannelExplorer: Exploring Class Separability Through Activation Channel Visualization
【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)内部行为理解困难的问题,特别是不同层和激活通道对类别可分性(class separability)的贡献。其解决方案的关键在于提出一种交互式可视化分析工具——ChannelExplorer,该工具通过数据驱动的方式分析图像输出,而非依赖于架构分析,从而揭示类别间的混淆情况、激活重叠程度以及激活通道模式,进而支持多种模型架构的分析与应用。
链接: https://arxiv.org/abs/2505.04647
作者: Md Rahat-uz- Zaman,Bei Wang,Paul Rosen
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep neural networks (DNNs) achieve state-of-the-art performance in many vision tasks, yet understanding their internal behavior remains challenging, particularly how different layers and activation channels contribute to class separability. We introduce ChannelExplorer, an interactive visual analytics tool for analyzing image-based outputs across model layers, emphasizing data-driven insights over architecture analysis for exploring class separability. ChannelExplorer summarizes activations across layers and visualizes them using three primary coordinated views: a Scatterplot View to reveal inter- and intra-class confusion, a Jaccard Similarity View to quantify activation overlap, and a Heatmap View to inspect activation channel patterns. Our technique supports diverse model architectures, including CNNs, GANs, ResNet and Stable Diffusion models. We demonstrate the capabilities of ChannelExplorer through four use-case scenarios: (1) generating class hierarchy in ImageNet, (2) finding mislabeled images, (3) identifying activation channel contributions, and(4) locating latent states’ position in Stable Diffusion model. Finally, we evaluate the tool with expert users.
zh
[CV-87] OcularAg e: A Comparative Study of Iris and Periocular Images for Pediatric Age Estimation
【速读】:该论文试图解决从儿童眼动生物特征图像中估计年龄的问题,这一任务因生理变化细微及纵向数据集有限而具有挑战性。研究的解决方案关键在于利用一种多任务深度学习框架,联合执行年龄预测与年龄段分类,并对比分析虹膜与眼周区域在儿童年龄估计中的表现。通过使用包含超过21,000张近红外图像的纵向数据集,以及针对非正方形眼动输入优化的卷积神经网络架构,研究验证了眼周模型在年龄估计上的优越性,实现了较低的平均绝对误差和较高的分类准确率,为儿童导向的隐私保护型生物特征系统提供了可行方案。
链接: https://arxiv.org/abs/2505.05374
作者: Naveenkumar G Venkataswamy,Poorna Ravi,Stephanie Schuckers,Masudul H. Imtiaz
机构: Clarkson University (克拉克森大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Estimating a child’s age from ocular biometric images is challenging due to subtle physiological changes and the limited availability of longitudinal datasets. Although most biometric age estimation studies have focused on facial features and adult subjects, pediatric-specific analysis, particularly of the iris and periocular regions, remains relatively unexplored. This study presents a comparative evaluation of iris and periocular images for estimating the ages of children aged between 4 and 16 years. We utilized a longitudinal dataset comprising more than 21,000 near-infrared (NIR) images, collected from 288 pediatric subjects over eight years using two different imaging sensors. A multi-task deep learning framework was employed to jointly perform age prediction and age-group classification, enabling a systematic exploration of how different convolutional neural network (CNN) architectures, particularly those adapted for non-square ocular inputs, capture the complex variability inherent in pediatric eye images. The results show that periocular models consistently outperform iris-based models, achieving a mean absolute error (MAE) of 1.33 years and an age-group classification accuracy of 83.82%. These results mark the first demonstration that reliable age estimation is feasible from children’s ocular images, enabling privacy-preserving age checks in child-centric applications. This work establishes the first longitudinal benchmark for pediatric ocular age estimation, providing a foundation for designing robust, child-focused biometric systems. The developed models proved resilient across different imaging sensors, confirming their potential for real-world deployment. They also achieved inference speeds of less than 10 milliseconds per image on resource-constrained VR headsets, demonstrating their suitability for real-time applications.
zh
[CV-88] Augmented Deep Contexts for Spatially Embedded Video Coding CVPR
【速读】:该论文旨在解决传统神经视频编解码器(Neural Video Codecs, NVCs)在处理大运动或新出现物体时因上下文信息有限和潜在先验对齐不良而导致的性能不足问题。其解决方案的关键在于提出一种空间嵌入式视频编解码器(Spatially Embedded Video Codec, SEVC),通过结合空间和时间参考生成增强的运动矢量与混合时空上下文,并引入由多个时间潜在表示增强的空间引导潜在先验,以解决潜在先验对齐问题并丰富先验信息,最终通过联合时空优化实现质量自适应的比特率分配,从而提升编码性能。
链接: https://arxiv.org/abs/2505.05309
作者: Yifan Bian,Chuanbo Tang,Li Li,Dong Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages,CVPR
Abstract:Most Neural Video Codecs (NVCs) only employ temporal references to generate temporal-only contexts and latent prior. These temporal-only NVCs fail to handle large motions or emerging objects due to limited contexts and misaligned latent prior. To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. Firstly, our SEVC leverages both spatial and temporal references to generate augmented motion vectors and hybrid spatial-temporal contexts. Secondly, to address the misalignment issue in latent prior and enrich the prior information, we introduce a spatial-guided latent prior augmented by multiple temporal latent representations. At last, we design a joint spatial-temporal optimization to learn quality-adaptive bit allocation for spatial references, further boosting rate-distortion performance. Experimental results show that our SEVC effectively alleviates the limitations in handling large motions or emerging objects, and also reduces 11.9% more bitrate than the previous state-of-the-art NVC while providing an additional low-resolution bitstream. Our code and model are available at this https URL.
zh
[CV-89] Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection
【速读】:该论文旨在解决在视网膜成像中,基于领域内数据预训练的视觉变换器(Vision Transformers, ViTs)是否比基于自然图像预训练的通用模型更能提升中晚期年龄相关性黄斑变性(age-related macular degeneration, AMD)识别性能的问题。其解决方案的关键在于通过在七个数字视网膜图像(Digital Fundus Image, DFI)数据集上对六种自监督学习(Self-supervised learning, SSL)预训练的ViTs进行基准测试,验证了在自然图像上预训练的iBOT模型在跨分布泛化能力上优于领域特定模型和无预训练的ViT-L模型,从而揭示了基础模型在AMD识别中的价值,并质疑了领域内预训练的必要性。
链接: https://arxiv.org/abs/2505.05291
作者: Benjamin A. Cohen,Jonathan Fhima,Meishar Meisel,Baskin Meital,Luis Filipe Nakayama,Eran Berkowitz,Joachim A. Behar
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Tissues and Organs (q-bio.TO)
备注: 10 pages, 3 figures
Abstract:Self-supervised learning (SSL) has enabled Vision Transformers (ViTs) to learn robust representations from large-scale natural image datasets, enhancing their generalization across domains. In retinal imaging, foundation models pretrained on either natural or ophthalmic data have shown promise, but the benefits of in-domain pretraining remain uncertain. To investigate this, we benchmark six SSL-pretrained ViTs on seven digital fundus image (DFI) datasets totaling 70,000 expert-annotated images for the task of moderate-to-late age-related macular degeneration (AMD) identification. Our results show that iBOT pretrained on natural images achieves the highest out-of-distribution generalization, with AUROCs of 0.80-0.97, outperforming domain-specific models, which achieved AUROCs of 0.78-0.96 and a baseline ViT-L with no pretraining, which achieved AUROCs of 0.68-0.91. These findings highlight the value of foundation models in improving AMD identification and challenge the assumption that in-domain pretraining is necessary. Furthermore, we release BRAMD, an open-access dataset (n=587) of DFIs with AMD labels from Brazil.
zh
[CV-90] White Light Specular Reflection Data Augmentation for Deep Learning Polyp Detection
【速读】:该论文试图解决深度学习(Deep Learning, DL)在结直肠癌息肉检测中因内窥镜白光反射被误认为息肉而导致的误检问题。解决方案的关键在于提出一种新的数据增强方法,通过人工添加更多白光反射来创建更具挑战性的训练场景,具体包括生成人工光源库、识别不适合添加光源的区域,并利用滑动窗口方法将人工光源添加到合适的区域,从而生成增强图像,提升模型的学习能力与检测性能。
链接: https://arxiv.org/abs/2505.05248
作者: Jose Angel Nuñez,Fabian Vazquez,Diego Adame,Xiaoyan Fu,Pengfei Gu,Bin Fu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 Figures, paper accepted by the ISBI (International Symposium on Biomedical Imaging) 2025 Conference
Abstract:Colorectal cancer is one of the deadliest cancers today, but it can be prevented through early detection of malignant polyps in the colon, primarily via colonoscopies. While this method has saved many lives, human error remains a significant challenge, as missing a polyp could have fatal consequences for the patient. Deep learning (DL) polyp detectors offer a promising solution. However, existing DL polyp detectors often mistake white light reflections from the endoscope for polyps, which can lead to false this http URL address this challenge, in this paper, we propose a novel data augmentation approach that artificially adds more white light reflections to create harder training scenarios. Specifically, we first generate a bank of artificial lights using the training dataset. Then we find the regions of the training images that we should not add these artificial lights on. Finally, we propose a sliding window method to add the artificial light to the areas that fit of the training images, resulting in augmented images. By providing the model with more opportunities to make mistakes, we hypothesize that it will also have more chances to learn from those mistakes, ultimately improving its performance in polyp detection. Experimental results demonstrate the effectiveness of our new data augmentation method.
zh
[CV-91] Improved Brain Tumor Detection in MRI: Fuzzy Sigmoid Convolution in Deep Learning IJCNN2025
【速读】:该论文旨在解决肿瘤检测中现有卷积神经网络(CNN)模型参数过多导致性能提升受限的问题。其关键解决方案是引入模糊逻辑的Sigmoid卷积(FSC),结合顶部和中部过滤模块,通过一种新颖的卷积算子有效扩展感受野并保持输入数据完整性,从而显著减少可训练参数数量,同时保持分类精度。该方法在三个基准数据集上实现了99.17%至99.89%的高分类准确率,并且参数量仅为大规模迁移学习架构的1/100,展现出计算效率和轻量化优势。
链接: https://arxiv.org/abs/2505.05208
作者: Muhammad Irfan,Anum Nawaz,Riku Klen,Abdulhamit Subasi,Tomi Westerlund,Wei Chen
机构: University of Turku(图尔库大学); Fudan University(复旦大学); University at Albany(阿尔巴尼大学); University of Sydney(悉尼大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE IJCNN 2025 has accepted the paper
Abstract:Early detection and accurate diagnosis are essential to improving patient outcomes. The use of convolutional neural networks (CNNs) for tumor detection has shown promise, but existing models often suffer from overparameterization, which limits their performance gains. In this study, fuzzy sigmoid convolution (FSC) is introduced along with two additional modules: top-of-the-funnel and middle-of-the-funnel. The proposed methodology significantly reduces the number of trainable parameters without compromising classification accuracy. A novel convolutional operator is central to this approach, effectively dilating the receptive field while preserving input data integrity. This enables efficient feature map reduction and enhances the model’s tumor detection capability. In the FSC-based model, fuzzy sigmoid activation functions are incorporated within convolutional layers to improve feature extraction and classification. The inclusion of fuzzy logic into the architecture improves its adaptability and robustness. Extensive experiments on three benchmark datasets demonstrate the superior performance and efficiency of the proposed model. The FSC-based architecture achieved classification accuracies of 99.17%, 99.75%, and 99.89% on three different datasets. The model employs 100 times fewer parameters than large-scale transfer learning architectures, highlighting its computational efficiency and suitability for detecting brain tumors early. This research offers lightweight, high-performance deep-learning models for medical imaging applications.
zh
[CV-92] MDAA-Diff: CT-Guided Multi-Dose Adaptive Attention Diffusion Model for PET Denoising
【速读】:该论文旨在解决低剂量正电子发射断层扫描(Low-dose PET, LPET)图像质量下降导致的诊断可靠性问题,其核心挑战在于如何在减少辐射暴露的同时保持图像的诊断质量。现有研究多集中于单个低剂量PET图像的去噪,忽略了因患者间差异导致的剂量响应不一致性以及CT图像提供的解剖约束信息。该论文提出的解决方案关键在于构建一种基于CT引导的多剂量自适应注意力去噪扩散模型(CT-Guided Multi-dose Adaptive Attention Denoising Diffusion Model, MDAA-Diff),通过引入CT引导的高频小波注意力模块和剂量自适应注意力模块,实现解剖结构引导与剂量水平自适应的联合优化,从而在低剂量条件下显著提升PET图像的细节保留能力和诊断价值。
链接: https://arxiv.org/abs/2505.05112
作者: Xiaolong Niu,Zanting Ye,Xu Han,Yanchao Huang,Hao Sun,Hubing Wu,Lijun Lu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Acquiring high-quality Positron Emission Tomography (PET) images requires administering high-dose radiotracers, which increases radiation exposure risks. Generating standard-dose PET (SPET) from low-dose PET (LPET) has become a potential solution. However, previous studies have primarily focused on single low-dose PET denoising, neglecting two critical factors: discrepancies in dose response caused by inter-patient variability, and complementary anatomical constraints derived from CT images. In this work, we propose a novel CT-Guided Multi-dose Adaptive Attention Denoising Diffusion Model (MDAA-Diff) for multi-dose PET denoising. Our approach integrates anatomical guidance and dose-level adaptation to achieve superior denoising performance under low-dose conditions. Specifically, this approach incorporates a CT-Guided High-frequency Wavelet Attention (HWA) module, which uses wavelet transforms to separate high-frequency anatomical boundary features from CT images. These extracted features are then incorporated into PET imaging through an adaptive weighted fusion mechanism to enhance edge details. Additionally, we propose the Dose-Adaptive Attention (DAA) module, a dose-conditioned enhancement mechanism that dynamically integrates dose levels into channel-spatial attention weight calculation. Extensive experiments on 18F-FDG and 68Ga-FAPI datasets demonstrate that MDAA-Diff outperforms state-of-the-art approaches in preserving diagnostic quality under reduced-dose conditions. Our code is publicly available.
zh
[CV-93] RepSNet: A Nucleus Instance Segmentation model based on Boundary Regression and Structural Re-parameterization
【速读】:该论文旨在解决数字病理分析中核实例分割的计算效率和重叠目标处理问题。其解决方案的关键在于设计了一种基于核边界回归和结构重参数化方案的神经网络模型RepSNet,该模型通过估计每个像素的父核边界位置信息(BPI),并利用提出的边界投票机制(BVM)聚合BPI以获得核边界,从而实现更精确的实例分割;同时,通过结构重参数化技术优化编码器-解码器结构,提升特征融合能力并降低模型推理阶段的参数量和计算负担。
链接: https://arxiv.org/abs/2505.05073
作者: Shengchun Xiong,Xiangru Li,Yunpeng Zhong,Wanfen Peng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 7 figures, 5 tables
Abstract:Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus instance segmentation is a key step in digital pathology analysis and pathological diagnosis. However, the computational efficiency of the model and the treatment of overlapping targets are the major challenges in the studies of this problem. To this end, a neural network model RepSNet was designed based on a nucleus boundary regression and a structural re-parameterization scheme for segmenting and classifying the nuclei in H\E-stained histopathological images. First, RepSNet estimates the boundary position information (BPI) of the parent nucleus for each pixel. The BPI estimation incorporates the local information of the pixel and the contextual information of the parent nucleus. Then, the nucleus boundary is estimated by aggregating the BPIs from a series of pixels using a proposed boundary voting mechanism (BVM), and the instance segmentation results are computed from the estimated nucleus boundary using a connected component analysis procedure. The BVM intrinsically achieves a kind of synergistic belief enhancement among the BPIs from various pixels. Therefore, different from the methods available in literature that obtain nucleus boundaries based on a direct pixel recognition scheme, RepSNet computes its boundary decisions based on some guidances from macroscopic information using an integration mechanism. In addition, RepSNet employs a re-parametrizable encoder-decoder structure. This model can not only aggregate features from some receptive fields with various scales which helps segmentation accuracy improvement, but also reduce the parameter amount and computational burdens in the model inference phase through the structural re-parameterization technique. Extensive experiments demonstrated the superiorities of RepSNet compared to several typical benchmark models.
zh
[CV-94] Direct Image Classification from Fourier Ptychographic Microscopy Measurements without Reconstruction
【速读】:该论文试图解决在傅里叶全息显微术(Fourier Ptychographic Microscopy, FPM)中,从大量测量数据中重建高分辨率图像所面临的计算成本过高的问题。其解决方案的关键在于直接在FPM测量数据中进行图像内容分类,而非先进行高分辨率图像的重建。研究表明,卷积神经网络(Convolutional Neural Networks, CNN)能够从测量序列中提取有效信息,不仅在分类性能上显著优于基于单个带限图像的分类(提升高达12%),而且计算效率更高。此外,通过学习对多个原始测量数据进行多路复用,可以在保持分类精度的同时大幅减少数据量和采集时间。
链接: https://arxiv.org/abs/2505.05054
作者: Navya Sonal Agarwal,Jan Philipp Schneider,Kanchana Vaishnavi Gandikota,Syed Muhammad Kazim,John Meshreki,Ivo Ihrke,Michael Moeller
机构: Kalinga Institute of Industrial Technology, India; Chair for Computer Vision, University of Siegen, Germany; Chair for Computational Sensing and Communications Engineering, University of Siegen, Germany.
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ISCS 2025
Abstract:The computational imaging technique of Fourier Ptychographic Microscopy (FPM) enables high-resolution imaging with a wide field of view and can serve as an extremely valuable tool, e.g. in the classification of cells in medical applications. However, reconstructing a high-resolution image from tens or even hundreds of measurements is computationally expensive, particularly for a wide field of view. Therefore, in this paper, we investigate the idea of classifying the image content in the FPM measurements directly without performing a reconstruction step first. We show that Convolutional Neural Networks (CNN) can extract meaningful information from measurement sequences, significantly outperforming the classification on a single band-limited image (up to 12 %) while being significantly more efficient than a reconstruction of a high-resolution image. Furthermore, we demonstrate that a learned multiplexing of several raw measurements allows maintaining the classification accuracy while reducing the amount of data (and consequently also the acquisition time) significantly.
zh
[CV-95] ADNP-15: An Open-Source Histopathological Dataset for Neuritic Plaque Segmentation in Human Brain Whole Slide Images with Frequency Domain Image Enhancement for Stain Normalization
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)中神经斑块的自动识别与分割难题,该问题因缺乏大规模标注数据集以及染色变异对图像分析的影响而具有挑战性。其解决方案的关键在于引入一个开源数据集(ADNP-15),并评估多种染色归一化技术与深度学习模型的组合效果,同时提出一种新的图像增强方法,以提升复杂组织结构中的分割精度。
链接: https://arxiv.org/abs/2505.05041
作者: Chenxi Zhao,Jianqiang Li,Qing Zhao,Jing Bai,Susana Boluda,Benoit Delatour,Lev Stimmer,Daniel Racoceanu,Gabriel Jimenez,Guanghui Fu
机构: Beijing University of Technology (北京工业大学); Sorbonne Université (索邦大学); Institut du Cerveau - Paris Brain Institute (大脑研究所-巴黎脑科学研究所); ICM (ICM); Inserm (Inserm); CNRS (CNRS); APHP (APHP); Hôpital de la Pitié Salpêtrière (巴黎圣路易医院); DMU Neuroscience (神经科学研究生院); Inria (Inria)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Alzheimer’s Disease (AD) is a neurodegenerative disorder characterized by amyloid-beta plaques and tau neurofibrillary tangles, which serve as key histopathological features. The identification and segmentation of these lesions are crucial for understanding AD progression but remain challenging due to the lack of large-scale annotated datasets and the impact of staining variations on automated image analysis. Deep learning has emerged as a powerful tool for pathology image segmentation; however, model performance is significantly influenced by variations in staining characteristics, necessitating effective stain normalization and enhancement techniques. In this study, we address these challenges by introducing an open-source dataset (ADNP-15) of neuritic plaques (i.e., amyloid deposits combined with a crown of dystrophic tau-positive neurites) in human brain whole slide images. We establish a comprehensive benchmark by evaluating five widely adopted deep learning models across four stain normalization techniques, providing deeper insights into their influence on neuritic plaque segmentation. Additionally, we propose a novel image enhancement method that improves segmentation accuracy, particularly in complex tissue structures, by enhancing structural details and mitigating staining inconsistencies. Our experimental results demonstrate that this enhancement strategy significantly boosts model generalization and segmentation accuracy. All datasets and code are open-source, ensuring transparency and reproducibility while enabling further advancements in the field.
zh
[CV-96] MoRe-3DGSMR: Motion-resolved reconstruction framework for free-breathing pulmonary MRI based on 3D Gaussian representation
【速读】:该论文旨在解决高分辨率、自由呼吸肺部磁共振成像(MRI)中运动分辨重建的挑战,特别是如何在不依赖监督信号的情况下实现高质量的三维各向同性图像重建。其解决方案的关键在于采用基于三维高斯表示(3DGS)的框架,通过在体素间进行数据平滑以实现连续的空间表示,并结合黄金角径向采样轨迹与呼吸运动信号提取,进而利用患者特异性卷积神经网络估计形变矢量场(DVFs),从而生成不同呼吸阶段的图像。该方法在六名受试者的数据集上进行了评估,并表现出优于现有方法的图像质量。
链接: https://arxiv.org/abs/2505.04959
作者: Tengya Peng,Ruyi Zha,Qing Zou
机构: University of Texas Southwestern Medical Center (德克萨斯西南医学中心); Australian National University (澳大利亚国立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study presents an unsupervised, motion-resolved reconstruction framework for high-resolution, free-breathing pulmonary magnetic resonance imaging (MRI), utilizing a three-dimensional Gaussian representation (3DGS). The proposed method leverages 3DGS to address the challenges of motion-resolved 3D isotropic pulmonary MRI reconstruction by enabling data smoothing between voxels for continuous spatial representation. Pulmonary MRI data acquisition is performed using a golden-angle radial sampling trajectory, with respiratory motion signals extracted from the center of k-space in each radial spoke. Based on the estimated motion signal, the k-space data is sorted into multiple respiratory phases. A 3DGS framework is then applied to reconstruct a reference image volume from the first motion state. Subsequently, a patient-specific convolutional neural network is trained to estimate the deformation vector fields (DVFs), which are used to generate the remaining motion states through spatial transformation of the reference volume. The proposed reconstruction pipeline is evaluated on six datasets from six subjects and bench-marked against three state-of-the-art reconstruction methods. The experimental findings demonstrate that the proposed reconstruction framework effectively reconstructs high-resolution, motion-resolved pulmonary MR images. Compared with existing approaches, it achieves superior image quality, reflected by higher signal-to-noise ratio and contrast-to-noise ratio. The proposed unsupervised 3DGS-based reconstruction method enables accurate motion-resolved pulmonary MRI with isotropic spatial resolution. Its superior performance in image quality metrics over state-of-the-art methods highlights its potential as a robust solution for clinical pulmonary MR imaging.
zh
[CV-97] Advanced 3D Imaging Approach to TSV/TGV Metrology and Inspection Using Only Optical Microscopy
【速读】:该论文旨在解决传统光学显微镜在硅和玻璃通孔(via)内部结构检测中的局限性,特别是其难以有效可视化内部结构的问题。解决方案的关键在于将混合场显微镜与光度立体技术相结合,通过多种光照条件实现三维重建,从而增强微尺度缺陷的检测能力,并提供深度和边缘异常的详细可视化,相较于传统方法具有更高的精度和可重复性。
链接: https://arxiv.org/abs/2505.04913
作者: Gugeong Sung
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 6 pages, 6 figures, Submitted to arXiv for preprint
Abstract:This paper introduces an innovative approach to silicon and glass via inspection, which combines hybrid field microscopy with photometric stereo. Conventional optical microscopy techniques are generally limited to superficial inspections and struggle to effectively visualize the internal structures of silicon and glass vias. By utilizing various lighting conditions for 3D reconstruction, the proposed method surpasses these limitations. By integrating photometric stereo to the traditional optical microscopy, the proposed method not only enhances the capability to detect micro-scale defects but also provides a detailed visualization of depth and edge abnormality, which are typically not visible with conventional optical microscopy inspection. The experimental results demonstrated that the proposed method effectively captures intricate surface details and internal structures. Quantitative comparisons between the reconstructed models and actual measurements present the capability of the proposed method to significantly improve silicon and glass via inspection process. As a result, the proposed method achieves enhanced cost-effectiveness while maintaining high accuracy and repeatability, suggesting substantial advancements in silicon and glass via inspection techniques
zh
[CV-98] Integrated Image Reconstruction and Target Recognition based on Deep Learning Technique
【速读】:该论文旨在解决计算微波成像(Computational Microwave Imaging, CMI)在图像重建阶段面临的显著计算瓶颈问题,特别是图像恢复和目标分类所面临的高处理需求。其解决方案的关键在于引入注意力门模块到ClassiGAN框架中,通过动态聚焦重要特征并抑制无关信息,从而提升特征提取效果和模型整体性能,最终实现比传统CMI方法更快的重建速度以及更优的重建质量与分类结果。
链接: https://arxiv.org/abs/2505.04836
作者: Cien Zhang,Jiaming Zhang,Jiajun He,Okan Yurduseven
机构: Wharton Research Data Services(沃顿研究数据服务); University of Pennsylvania(宾夕法尼亚大学); Centre for Wireless Innovation(无线创新中心); Queen’s University Belfast(贝尔法斯特女王大学); Department of Electrical Engineering(电子工程系); City University of Hong Kong(香港城市大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to The 2025 15th IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC 2025)
Abstract:Computational microwave imaging (CMI) has gained attention as an alternative technique for conventional microwave imaging techniques, addressing their limitations such as hardware-intensive physical layer and slow data collection acquisition speed to name a few. Despite these advantages, CMI still encounters notable computational bottlenecks, especially during the image reconstruction stage. In this setting, both image recovery and object classification present significant processing demands. To address these challenges, our previous work introduced ClassiGAN, which is a generative deep learning model designed to simultaneously reconstruct images and classify targets using only back-scattered signals. In this study, we build upon that framework by incorporating attention gate modules into ClassiGAN. These modules are intended to refine feature extraction and improve the identification of relevant information. By dynamically focusing on important features and suppressing irrelevant ones, the attention mechanism enhances the overall model performance. The proposed architecture, named Att-ClassiGAN, significantly reduces the reconstruction time compared to traditional CMI approaches. Furthermore, it outperforms current advanced methods, delivering improved Normalized Mean Squared Error (NMSE), higher Structural Similarity Index (SSIM), and better classification outcomes for the reconstructed targets.
zh
[CV-99] Advancing 3D Medical Image Segmentation: Unleashing the Potential of Planarian Neural Networks in Artificial Intelligence
【速读】:该论文试图解决3D医学图像分割中模型性能提升的问题,特别是通过借鉴涡虫神经网络(Planarian Neural Network, PNN)的结构来优化深度神经网络的设计。解决方案的关键在于提出PNN-UNet架构,该架构模仿PNN的双神经索结构,由一个Deep-UNet和一个Wide-UNet作为神经索,以及一个密集连接的自编码器作为大脑部分,从而在分割任务中实现优于传统UNet及其变体的性能。
链接: https://arxiv.org/abs/2505.04664
作者: Ziyuan Huang,Kevin Huggins,Srikar Bellur
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 8 figures, 21 tables
Abstract:Our study presents PNN-UNet as a method for constructing deep neural networks that replicate the planarian neural network (PNN) structure in the context of 3D medical image data. Planarians typically have a cerebral structure comprising two neural cords, where the cerebrum acts as a coordinator, and the neural cords serve slightly different purposes within the organism’s neurological system. Accordingly, PNN-UNet comprises a Deep-UNet and a Wide-UNet as the nerve cords, with a densely connected autoencoder performing the role of the brain. This distinct architecture offers advantages over both monolithic (UNet) and modular networks (Ensemble-UNet). Our outcomes on a 3D MRI hippocampus dataset, with and without data augmentation, demonstrate that PNN-UNet outperforms the baseline UNet and several other UNet variants in image segmentation.
zh
[CV-100] Rethinking Boundary Detection in Deep Learning-Based Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中边界区域精确分割困难的问题(boundary area segmentation challenge)。其解决方案的关键在于提出了一种名为CTO的新型网络架构,该架构结合了卷积神经网络(CNN)、视觉Transformer(ViT)模型以及显式边缘检测算子,通过双流编码器结构实现局部特征与长程依赖关系的联合捕捉,并引入基于边界引导的解码器网络,利用由专用边缘检测算子生成的二值边界掩码提供显式指导,从而提升模型对边界区域的学习能力。
链接: https://arxiv.org/abs/2505.04652
作者: Yi Lin,Dong Zhang,Xiao Fang,Yufan Chen,Kwang-Ting Cheng,Hao Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (香港科技大学); HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute (HKUST深圳-香港协同创新研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Medical Image Analysis
Abstract:Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model’s ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: this https URL.
zh
人工智能
[AI-0] Conversational Process Model Redesign
【速读】:该论文试图解决如何利用大语言模型(Large Language Models, LLMs)赋能领域专家在迭代过程中有效创建和重设计流程模型的问题。其解决方案的关键在于提出了一种对话式流程模型重设计(Conversational Process Model Redesign, CPD)方法,该方法通过多步骤流程实现可解释且可复现的变更:首先从文献中识别流程变更模式,其次将用户的自然语言变更请求重新表述为与所识别模式对齐的预期表述,最后将变更含义应用于流程模型。这一方法强调了用户与LLM之间的持续交互,而非仅关注单次提示执行与结果评估。
链接: https://arxiv.org/abs/2505.05453
作者: Nataliia Klievtsova,Timotheus Kampik,Juergen Mangler,Stefanie Rinderle-Ma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With the recent success of large language models (LLMs), the idea of AI-augmented Business Process Management systems is becoming more feasible. One of their essential characteristics is the ability to be conversationally actionable, allowing humans to interact with the LLM effectively to perform crucial process life cycle tasks such as process model design and redesign. However, most current research focuses on single-prompt execution and evaluation of results, rather than on continuous interaction between the user and the LLM. In this work, we aim to explore the feasibility of using LLMs to empower domain experts in the creation and redesign of process models in an iterative and effective way. The proposed conversational process model redesign (CPD) approach receives as input a process model and a redesign request by the user in natural language. Instead of just letting the LLM make changes, the LLM is employed to (a) identify process change patterns from literature, (b) re-phrase the change request to be aligned with an expected wording for the identified pattern (i.e., the meaning), and then to © apply the meaning of the change to the process model. This multi-step approach allows for explainable and reproducible changes. In order to ensure the feasibility of the CPD approach, and to find out how well the patterns from literature can be handled by the LLM, we performed an extensive evaluation. The results show that some patterns are hard to understand by LLMs and by users. Within the scope of the study, we demonstrated that users need support to describe the changes clearly. Overall the evaluation shows that the LLMs can handle most changes well according to a set of completeness and correctness criteria.
zh
[AI-1] EcoAgent : An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation
【速读】:该论文试图解决基于云的移动代理在使用多模态大语言模型(M)LLMs时存在的高延迟和高成本问题,以及微调后的(M)SLMs在边缘部署时普遍失去泛化能力、难以处理复杂任务的问题。解决方案的关键在于提出EcoAgent,一个面向移动自动化的边缘-云协同多智能体框架,其核心是云基础的规划代理与两个边缘代理(执行代理和观察代理)之间的闭环协作,通过观察代理的预理解模块压缩屏幕图像为简洁文本以降低令牌使用,并在失败时由规划代理通过反思模块检索屏幕历史并重新规划。
链接: https://arxiv.org/abs/2505.05440
作者: Biao Yi,Xavier Hu,Yurun Chen,Shengyu Zhang,Hongxia Yang,Fan Wu,Fei Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cloud-based mobile agents powered by (multimodal) large language models ((M)LLMs) offer strong reasoning abilities but suffer from high latency and cost. While fine-tuned (M)SLMs enable edge deployment, they often lose general capabilities and struggle with complex tasks. To address this, we propose EcoAgent, an Edge-Cloud cOllaborative multi-agent framework for mobile automation. EcoAgent features a closed-loop collaboration among a cloud-based Planning Agent and two edge-based agents: the Execution Agent for action execution and the Observation Agent for verifying outcomes. The Observation Agent uses a Pre-Understanding Module to compress screen images into concise text, reducing token usage. In case of failure, the Planning Agent retrieves screen history and replans via a Reflection Module. Experiments on AndroidWorld show that EcoAgent maintains high task success rates while significantly reducing MLLM token consumption, enabling efficient and practical mobile automation.
zh
[AI-2] CART-ELC: Oblique Decision Tree Induction via Exhaustive Search
【速读】:该论文试图解决传统轴对齐决策树在分类性能上的局限性,以及通过穷举搜索寻找斜向划分所带来的计算挑战。其解决方案的关键在于提出一种新的算法——分类与回归树-穷举线性组合(Classification and Regression Tree - Exhaustive Linear Combinations, CART-ELC),该算法在受限的超平面集合上进行穷举搜索,从而在保持较高分类精度的同时,有效降低计算复杂度,并生成更浅、更简洁且更具可解释性的决策树。
链接: https://arxiv.org/abs/2505.05402
作者: Andrew D. Laack
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS)
备注: 16 pages, 4 figures
Abstract:Oblique decision trees have attracted attention due to their potential for improved classification performance over traditional axis-aligned decision trees. However, methods that rely on exhaustive search to find oblique splits face computational challenges. As a result, they have not been widely explored. We introduce a novel algorithm, Classification and Regression Tree - Exhaustive Linear Combinations (CART-ELC), for inducing oblique decision trees that performs an exhaustive search on a restricted set of hyperplanes. We then investigate the algorithm’s computational complexity and its predictive capabilities. Our results demonstrate that CART-ELC consistently achieves competitive performance on small datasets, often yielding statistically significant improvements in classification accuracy relative to existing decision tree induction algorithms, while frequently producing shallower, simpler, and thus more interpretable trees.
zh
[AI-3] A Pain Assessment Framework based on multimodal data and Deep Machine Learning methods
【速读】:该论文旨在解决临床环境中疼痛评估的自动化问题,通过计算方法提升疼痛评估的准确性与适用性。其解决方案的关键在于开发创新的计算方法,以实现高性能的自动疼痛评估,并适应不同临床场景的需求,同时深入研究影响疼痛感知的显著因素,如人口统计学特征等。
链接: https://arxiv.org/abs/2505.05396
作者: Stefanos Gkikas
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:From the original abstract: This thesis initially aims to study the pain assessment process from a clinical-theoretical perspective while exploring and examining existing automatic approaches. Building on this foundation, the primary objective of this Ph.D. project is to develop innovative computational methods for automatic pain assessment that achieve high performance and are applicable in real clinical settings. A primary goal is to thoroughly investigate and assess significant factors, including demographic elements that impact pain perception, as recognized in pain research, through a computational standpoint. Within the limits of the available data in this research area, our goal was to design, develop, propose, and offer automatic pain assessment pipelines for unimodal and multimodal configurations that are applicable to the specific requirements of different scenarios. The studies published in this Ph.D. thesis showcased the effectiveness of the proposed methods, achieving state-of-the-art results. Additionally, they paved the way for exploring new approaches in artificial intelligence, foundation models, and generative artificial intelligence. Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2505.05396 [cs.AI] (or arXiv:2505.05396v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.05396 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-4] Software Development Life Cycle Perspective: A Survey of Benchmarks for CodeLLM s and Agents
【速读】:该论文试图解决当前针对Code large language models (CodeLLMs)和代理的基准测试缺乏全面综述的问题,从而推动其在软件工程中的有效评估与发展。解决方案的关键在于系统性地收集和分析181个基准测试,涵盖461篇相关论文中软件开发生命周期(SDLC)的不同阶段,揭示现有基准测试在覆盖范围上的不平衡,并提出未来研究方向以缩小理论能力与实际应用之间的差距。
链接: https://arxiv.org/abs/2505.05283
作者: Kaixin Wang,Tianlin Li,Xiaoyu Zhang,Chong Wang,Weisong Sun,Yang Liu,Bin Shi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code large language models (CodeLLMs) and agents have shown great promise in tackling complex software engineering this http URL to traditional software engineering methods, CodeLLMs and agents offer stronger abilities, and can flexibly process inputs and outputs in both natural and code. Benchmarking plays a crucial role in evaluating the capabilities of CodeLLMs and agents, guiding their development and deployment. However, despite their growing significance, there remains a lack of comprehensive reviews of benchmarks for CodeLLMs and agents. To bridge this gap, this paper provides a comprehensive review of existing benchmarks for CodeLLMs and agents, studying and analyzing 181 benchmarks from 461 relevant papers, covering the different phases of the software development life cycle (SDLC). Our findings reveal a notable imbalance in the coverage of current benchmarks, with approximately 60% focused on the software development phase in SDLC, while requirements engineering and software design phases receive minimal attention at only 5% and 3%, respectively. Additionally, Python emerges as the dominant programming language across the reviewed benchmarks. Finally, this paper highlights the challenges of current research and proposes future directions, aiming to narrow the gap between the theoretical capabilities of CodeLLMs and agents and their application in real-world scenarios.
zh
[AI-5] Enhancing Cooperative Multi-Agent Reinforcement Learning with State Modelling and Adversarial Exploration ICML2025
【速读】:该论文旨在解决在无通信能力的分布式部分可观测环境中实现多智能体深度强化学习(MARL)协作的问题。其核心挑战在于如何从个体智能体的观测中推断状态表示,并利用这些表示来提升智能体的探索能力和协作任务执行策略。解决方案的关键在于提出一种新的状态建模框架,使智能体能够根据自身策略优化,推断出关于不可观测状态的有意义信念表示,同时过滤冗余和低信息量的联合状态信息。在此基础上,论文进一步提出了SMPE算法,通过将信念显式融入策略网络以及采用对抗性探索策略,增强智能体在部分可观测环境下的策略区分能力。
链接: https://arxiv.org/abs/2505.05262
作者: Andreas Kontogiannis,Konstantinos Papathanasiou,Yi Shen,Giorgos Stamou,Michael M. Zavlanos,George Vouros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted (Poster) at ICML 2025
Abstract:Learning to cooperate in distributed partially observable environments with no communication abilities poses significant challenges for multi-agent deep reinforcement learning (MARL). This paper addresses key concerns in this domain, focusing on inferring state representations from individual agent observations and leveraging these representations to enhance agents’ exploration and collaborative task execution policies. To this end, we propose a novel state modelling framework for cooperative MARL, where agents infer meaningful belief representations of the non-observable state, with respect to optimizing their own policies, while filtering redundant and less informative joint state information. Building upon this framework, we propose the MARL SMPE algorithm. In SMPE, agents enhance their own policy’s discriminative abilities under partial observability, explicitly by incorporating their beliefs into the policy network, and implicitly by adopting an adversarial type of exploration policies which encourages agents to discover novel, high-value states while improving the discriminative abilities of others. Experimentally, we show that SMPE outperforms state-of-the-art MARL algorithms in complex fully cooperative tasks from the MPE, LBF, and RWARE benchmarks.
zh
[AI-6] Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation
【速读】:该论文试图解决传统深度神经网络(Deep Neural Network, DNN)形式化验证(Formal Verification, FV)方法中因安全属性二值编码而导致的安全性评估不够细致的问题,即模型被简单地划分为安全或不安全,无法反映模型内部的安全性层级。解决方案的关键在于引入一种新的问题形式化——抽象DNN验证(Abstract DNN-Verification),通过验证不安全输出的层次结构,实现对DNN安全性更细粒度的分析。该方法利用抽象解释和输出可达集推理,能够在形式化验证过程中评估多种安全性等级,同时在最坏情况下计算开销与传统二值验证方法相当,甚至可能更低。
链接: https://arxiv.org/abs/2505.05235
作者: Luca Marzari,Isabella Mastroeni,Alessandro Farinelli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional methods for formal verification (FV) of deep neural networks (DNNs) are constrained by a binary encoding of safety properties, where a model is classified as either safe or unsafe (robust or not robust). This binary encoding fails to capture the nuanced safety levels within a model, often resulting in either overly restrictive or too permissive requirements. In this paper, we introduce a novel problem formulation called Abstract DNN-Verification, which verifies a hierarchical structure of unsafe outputs, providing a more granular analysis of the safety aspect for a given DNN. Crucially, by leveraging abstract interpretation and reasoning about output reachable sets, our approach enables assessing multiple safety levels during the FV process, requiring the same (in the worst case) or even potentially less computational effort than the traditional binary verification approach. Specifically, we demonstrate how this formulation allows rank adversarial inputs according to their abstract safety level violation, offering a more detailed evaluation of the model’s safety and robustness. Our contributions include a theoretical exploration of the relationship between our novel abstract safety formulation and existing approaches that employ abstract interpretation for robustness verification, complexity analysis of the novel problem introduced, and an empirical evaluation considering both a complex deep reinforcement learning task (based on Habitat 3.0) and standard DNN-Verification benchmarks.
zh
[AI-7] ChemRxivQuest: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv Preprints
【速读】:该论文试图解决化学文献快速扩展带来的研究者高效获取领域知识的挑战,其解决方案的关键在于构建了一个名为ChemRxivQuest的高质量问答(QA)数据集,该数据集包含970对源自17个化学子领域共155篇ChemRxiv预印本的问答对,并通过自动化流程结合光学字符识别(OCR)、基于GPT-4o的问答生成以及模糊匹配技术进行答案验证,以确保数据的可追溯性和上下文准确性。
链接: https://arxiv.org/abs/2505.05232
作者: Mahmoud Amiri,Thomas Bocklitz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemRxivQuest, a curated dataset of 970 high-quality question-answer (QA) pairs derived from 155 ChemRxiv preprints across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemRxivQuest was constructed using an automated pipeline that combines optical character recognition (OCR), GPT-4o-based QA generation, and a fuzzy matching technique for answer verification. The dataset emphasizes conceptual, mechanistic, applied, and experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large language models. We analyze the dataset’s structure, coverage, and limitations, and outline future directions for expansion and expert validation. ChemRxivQuest provides a foundational resource for chemistry NLP research, education, and tool development.
zh
[AI-8] Put CASH on Bandits: A Max K-Armed Problem for Automated Machine Learning
【速读】:该论文试图解决AutoML领域中的联合算法选择与超参数优化(CASH)问题,这是一个具有挑战性的资源分配问题。解决方案的关键在于提出MaxUCB,一种基于最大k-臂老虎机(max k-armed bandit)的方法,用于在探索不同模型类别和进行超参数优化之间取得平衡。MaxUCB专为处理该场景中出现的轻尾且有界奖励分布而设计,相较于假设重尾奖励分布的经典最大k-臂老虎机方法,提供了更高效的替代方案。
链接: https://arxiv.org/abs/2505.05226
作者: Amir Rezaei Balef,Claire Vernade,Katharina Eggensperger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Combined Algorithm Selection and Hyperparameter optimization (CASH) is a challenging resource allocation problem in the field of AutoML. We propose MaxUCB, a max k -armed bandit method to trade off exploring different model classes and conducting hyperparameter optimization. MaxUCB is specifically designed for the light-tailed and bounded reward distributions arising in this setting and, thus, provides an efficient alternative compared to classic max k -armed bandit methods assuming heavy-tailed reward distributions. We theoretically and empirically evaluate our method on four standard AutoML benchmarks, demonstrating superior performance over prior approaches.
zh
[AI-9] Incentive-Aware Machine Learning; Robustness Fairness Improvement Causality
【速读】:该论文试图解决激励感知机器学习(incentive-aware machine learning)中的核心问题,即在个体可能通过策略性修改输入来影响算法决策的场景下,如何设计有效的模型。其解决方案的关键在于构建一个统一框架,涵盖稳健性(抵御“博弈”行为)、公平性(分析系统对社会的影响)以及改进/因果性(识别策略性行为带来的真实个人或社会收益)三个视角,并针对不同设置(离线、在线和因果情境)进行建模。该框架强调了区分博弈与改进行为以及处理代理异质性的关键挑战。
链接: https://arxiv.org/abs/2505.05211
作者: Chara Podimata
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: This literature review was published in SIGEcom Exchanges in 2025
Abstract:The article explores the emerging domain of incentive-aware machine learning (ML), which focuses on algorithmic decision-making in contexts where individuals can strategically modify their inputs to influence outcomes. It categorizes the research into three perspectives: robustness, aiming to design models resilient to “gaming”; fairness, analyzing the societal impacts of such systems; and improvement/causality, recognizing situations where strategic actions lead to genuine personal or societal improvement. The paper introduces a unified framework encapsulating models for these perspectives, including offline, online, and causal settings, and highlights key challenges such as differentiating between gaming and improvement and addressing heterogeneity among agents. By synthesizing findings from diverse works, we outline theoretical advancements and practical solutions for robust, fair, and causally-informed incentive-aware ML systems.
zh
[AI-10] LAPSO: A Unified Optimization View for Learning-Augmented Power System Operations
【速读】:该论文旨在解决可再生能源高渗透率下传统基于模型的电力系统运行方法在经济性、稳定性与鲁棒性方面面临的挑战。其解决方案的关键在于提出了一种全面的框架——学习增强型电力系统运行(Learning-Augmented Power System Operations, LAPSO),该框架从原生优化视角出发,聚焦于运行阶段,旨在打破时间上孤立的电力系统任务(如预测、运行与控制)之间的界限,并在训练和推理阶段统一机器学习与基于模型优化的目标。通过系统分析与仿真验证了LAPSO在设计新型集成算法(如稳定性约束优化和目标导向预测)中的有效性,并支持对不同不确定性来源的端到端追踪。
链接: https://arxiv.org/abs/2505.05203
作者: Wangkun Xu,Zhongda Chu,Fei Teng
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:With the high penetration of renewables, traditional model-based power system operation is challenged to deliver economic, stable, and robust decisions. Machine learning has emerged as a powerful modeling tool for capturing complex dynamics to address these challenges. However, its separate design often lacks systematic integration with existing methods. To fill the gap, this paper proposes a holistic framework of Learning-Augmented Power System Operations (LAPSO, pronounced as Lap-So). Adopting a native optimization perspective, LAPSO is centered on the operation stage and aims to break the boundary between temporally siloed power system tasks, such as forecast, operation and control, while unifying the objectives of machine learning and model-based optimizations at both training and inference stages. Systematic analysis and simulations demonstrate the effectiveness of applying LAPSO in designing new integrated algorithms, such as stability-constrained optimization (SCO) and objective-based forecasting (OBF), while enabling end-to-end tracing of different sources of uncertainties. In addition, a dedicated Python package-lapso is introduced to automatically augment existing power system optimization models with learnable components. All code and data are available at this https URL.
zh
[AI-11] Societal and technological progress as sewing an ever-growing ever-changing patchy and polychrome quilt
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)系统在伦理对齐(alignment)过程中存在的问题,即现有解决方案倾向于采用“一刀切”的方式,忽视了道德多样性,可能导致社会抵抗、信任下降和制度不稳定。其关键解决方案是提出一种称为“适当性框架”(appropriateness framework)的替代方法,该方法基于冲突理论、文化进化、多智能体系统和制度经济学,将持续分歧视为常态,并通过四个原则进行设计:情境化基础、社区定制、持续适应和多元中心治理,旨在将对齐的隐喻从道德统一转向更具生产力的冲突管理。
链接: https://arxiv.org/abs/2505.05197
作者: Joel Z. Leibo,Alexander Sasha Vezhnevets,William A. Cunningham,Sébastien Krier,Manfred Diaz,Simon Osindero
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 16 pages
Abstract:Artificial Intelligence (AI) systems are increasingly placed in positions where their decisions have real consequences, e.g., moderating online spaces, conducting research, and advising on policy. Ensuring they operate in a safe and ethically acceptable fashion is thus critical. However, most solutions have been a form of one-size-fits-all “alignment”. We are worried that such systems, which overlook enduring moral diversity, will spark resistance, erode trust, and destabilize our institutions. This paper traces the underlying problem to an often-unstated Axiom of Rational Convergence: the idea that under ideal conditions, rational agents will converge in the limit of conversation on a single ethics. Treating that premise as both optional and doubtful, we propose what we call the appropriateness framework: an alternative approach grounded in conflict theory, cultural evolution, multi-agent systems, and institutional economics. The appropriateness framework treats persistent disagreement as the normal case and designs for it by applying four principles: (1) contextual grounding, (2) community customization, (3) continual adaptation, and (4) polycentric governance. We argue here that adopting these design principles is a good way to shift the main alignment metaphor from moral unification to a more productive metaphor of conflict management, and that taking this step is both desirable and urgent.
zh
[AI-12] Stochastic Variational Propagation: Local Scalable and Efficient Alternative to Backpropagation
【速读】:该论文试图解决传统反向传播(Backpropagation, BP)在深度学习中因依赖全局梯度同步而导致的可扩展性受限和内存开销大的问题。其解决方案的关键在于提出一种名为随机变分传播(Stochastic Variational Propagation, SVP)的可扩展替代方法,该方法将训练过程重新表述为分层变分推断,并通过优化局部证据下界(ELBO)实现独立的局部更新,同时保持全局一致性。为防止层间表示崩溃,SVP利用固定随机矩阵将激活投影到低维空间以确保信息保留和表示多样性,结合特征对齐损失以增强层间一致性,从而在多种架构和数据集上实现与BP相当的精度,同时显著降低内存消耗并提升可扩展性。
链接: https://arxiv.org/abs/2505.05181
作者: Bojian Yin,Federico Corradi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures
Abstract:Backpropagation (BP) is the cornerstone of deep learning, but its reliance on global gradient synchronization limits scalability and imposes significant memory overhead. We propose Stochastic Variational Propagation (SVP), a scalable alternative that reframes training as hierarchical variational inference. SVP treats layer activations as latent variables and optimizes local Evidence Lower Bounds (ELBOs), enabling independent, local updates while preserving global coherence. However, directly applying KL divergence in layer-wise ELBOs risks inter-layer’s representation collapse due to excessive compression. To prevent this, SVP projects activations into low-dimensional spaces via fixed random matrices, ensuring information preservation and representational diversity. Combined with a feature alignment loss for inter-layer consistency, SVP achieves competitive accuracy with BP across diverse architectures (MLPs, CNNs, Transformers) and datasets (MNIST to ImageNet), reduces memory usage by up to 4x, and significantly improves scalability. More broadly, SVP introduces a probabilistic perspective to deep representation learning, opening pathways toward more modular and interpretable neural network design.
zh
[AI-13] MARK: Memory Augmented Refinement of Knowledge
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在面对不断变化的领域知识时,难以有效对齐而需要高昂微调成本的问题。其解决方案的关键在于提出一种名为Memory-Augmented Refinement of Knowledge (MARK)的框架,该框架通过引入结构化的精炼记忆(Refined Memory)和专门代理(agents)实现LLMs的持续学习而不需重新训练,从而提升模型在特定领域中的准确性、适应性和个性化能力。
链接: https://arxiv.org/abs/2505.05177
作者: Anish Ganguli,Prabal Deb,Debleena Banerjee
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) assist in specialized tasks but struggle to align with evolving domain knowledge without costly fine-tuning. Domain knowledge consists of: Knowledge: Immutable facts (e.g., ‘A stone is solid’) and generally accepted principles (e.g., ethical standards); Refined Memory: Evolving insights shaped by business needs and real-world changes. However, a significant gap often exists between a domain expert’s deep, nuanced understanding and the system’s domain knowledge, which can hinder accurate information retrieval and application. Our Memory-Augmented Refinement of Knowledge (MARK) framework enables LLMs to continuously learn without retraining by leveraging structured refined memory, inspired by the Society of Mind. MARK operates through specialized agents, each serving a distinct role: Residual Refined Memory Agent: Stores and retrieves domain-specific insights to maintain context over time; User Question Refined Memory Agent: Captures user-provided facts, abbreviations, and terminology for better comprehension; LLM Response Refined Memory Agent: Extracts key elements from responses for refinement and personalization. These agents analyse stored refined memory, detect patterns, resolve contradictions, and improve response accuracy. Temporal factors like recency and frequency prioritize relevant information while discarding outdated insights. MARK enhances LLMs in multiple ways: Ground Truth Strategy: Reduces hallucinations by establishing a structured reference; Domain-Specific Adaptation: Essential for fields like healthcare, law, and manufacturing, where proprietary insights are absent from public datasets; Personalized AI Assistants: Improves virtual assistants by remembering user preferences, ensuring coherent responses over time.
zh
[AI-14] Dukawalla: Voice Interfaces for Small Businesses in Africa
【速读】:该论文试图解决中小型企业(Small and Medium Sized Businesses, SMBs)在数据驱动决策方面面临的挑战,特别是在非洲国家,由于缺乏先进的分析工具,这些企业难以有效利用数据。解决方案的关键在于开发了一个名为Dukawalla的原型智能助手,该助手通过语音交互和生成式AI(Generative AI)的力量,将原始业务数据转化为可操作的洞察,从而简化数据收集并提供业务见解,以适应SMB工作者移动优先、时间有限且社交与商业紧密耦合的工作方式。
链接: https://arxiv.org/abs/2505.05170
作者: Elizabeth Ankrah,Stephanie Nyairo,Mercy Muchai,Kagonya Awori,Millicent Ochieng,Mark Kariuki,Jacki O’Neill
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Small and medium sized businesses often struggle with data driven decision making do to a lack of advanced analytics tools, especially in African countries where they make up a majority of the workforce. Though many tools exist they are not designed to fit into the ways of working of SMB workers who are mobile first, have limited time to learn new workflows, and for whom social and business are tightly coupled. To address this, the Dukawalla prototype was created. This intelligent assistant bridges the gap between raw business data, and actionable insights by leveraging voice interaction and the power of generative AI. Dukawalla provides an intuitive way for business owners to interact with their data, aiding in informed decision making. This paper examines Dukawalla’s deployment across SMBs in Nairobi, focusing on their experiences using this voice based assistant to streamline data collection and provide business insights
zh
[AI-15] Guiding Evolutionary AutoEncoder Training with Activation-Based Pruning Operators GECCO2025
【速读】:该论文试图解决神经网络剪枝中如何高效地同时优化编码器和解码器的问题,以提升自编码器的效率。其解决方案的关键在于引入两种新的变异算子,利用层激活信息指导权重剪枝,从而实现更高效的模型压缩。研究发现,在低维剪枝环境中,基于激活的引导策略效果更优,而在协同进化设置中,随机剪枝反而优于引导剪枝,这表明种群驱动策略通过扩展剪枝维度实现了更统计均匀的随机性,增强了鲁棒性。
链接: https://arxiv.org/abs/2505.05138
作者: Steven Jorgensen,Erik Hemberg,Jamal Toutouh,Una-May O’Reilly
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Accepted to The Genetic and Evolutionary Computation Conference (GECCO 2025)
Abstract:This study explores a novel approach to neural network pruning using evolutionary computation, focusing on simultaneously pruning the encoder and decoder of an autoencoder. We introduce two new mutation operators that use layer activations to guide weight pruning. Our findings reveal that one of these activation-informed operators outperforms random pruning, resulting in more efficient autoencoders with comparable performance to canonically trained models. Prior work has established that autoencoder training is effective and scalable with a spatial coevolutionary algorithm that cooperatively coevolves a population of encoders with a population of decoders, rather than one autoencoder. We evaluate how the same activity-guided mutation operators transfer to this context. We find that random pruning is better than guided pruning, in the coevolutionary setting. This suggests activation-based guidance proves more effective in low-dimensional pruning environments, where constrained sample spaces can lead to deviations from true uniformity in randomization. Conversely, population-driven strategies enhance robustness by expanding the total pruning dimensionality, achieving statistically uniform randomness that better preserves system dynamics. We experiment with pruning according to different schedules and present best combinations of operator and schedule for the canonical and coevolving populations cases.
zh
[AI-16] Is there a half-life for the success rates of AI agents ?
【速读】:该论文试图解决AI代理在长时间任务中性能表现的解释问题,其核心是揭示AI在执行持续时间较长的任务时成功概率下降的机制。解决方案的关键在于提出一个极其简单的数学模型,即AI代理在完成任务过程中每分钟的失败率保持恒定,这导致了任务成功率随任务长度呈指数级下降,并且每个代理可由其特有的半衰期来表征。该模型能够有效估计不同任务长度下的成功率,并暗示长时间任务的失败原因在于涉及越来越多的子任务,其中任何一个子任务的失败都会导致整个任务失败。
链接: https://arxiv.org/abs/2505.05115
作者: Toby Ord
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures
Abstract:Building on the recent empirical work of Kwa et al. (2025), I show that within their suite of research-engineering tasks the performance of AI agents on longer-duration tasks can be explained by an extremely simple mathematical model – a constant rate of failing during each minute a human would take to do the task. This implies an exponentially declining success rate with the length of the task and that each agent could be characterised by its own half-life. This empirical regularity allows us to estimate the success rate for an agent at different task lengths. And the fact that this model is a good fit for the data is suggestive of the underlying causes of failure on longer tasks – that they involve increasingly large sets of subtasks where failing any one fails the task. Whether this model applies more generally on other suites of tasks is unknown and an important subject for further work.
zh
[AI-17] Multi-agent Embodied AI: Advances and Future Directions
【速读】:该论文试图解决多智能体具身人工智能(multi-agent embodied AI)在动态、开放环境中面临的研究不足与应用挑战,特别是现有研究多集中于静态、封闭环境下的单智能体系统,未能充分捕捉多智能体在复杂现实场景中协作与适应的需求。其解决方案的关键在于系统性地回顾当前研究进展,分析关键贡献,并识别该领域面临的挑战与未来方向,以推动多智能体具身人工智能的深入发展与实际应用。
链接: https://arxiv.org/abs/2505.05108
作者: Zhaohan Feng,Ruiqi Xue,Lei Yuan,Yang Yu,Ning Ding,Meiqin Liu,Bingzhao Gao,Jian Sun,Gang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Embodied artificial intelligence (Embodied AI) plays a pivotal role in the application of advanced technologies in the intelligent era, where AI systems are integrated with physical bodies that enable them to perceive, reason, and interact with their environments. Through the use of sensors for input and actuators for action, these systems can learn and adapt based on real-world feedback, allowing them to perform tasks effectively in dynamic and unpredictable environments. As techniques such as deep learning (DL), reinforcement learning (RL), and large language models (LLMs) mature, embodied AI has become a leading field in both academia and industry, with applications spanning robotics, healthcare, transportation, and manufacturing. However, most research has focused on single-agent systems that often assume static, closed environments, whereas real-world embodied AI must navigate far more complex scenarios. In such settings, agents must not only interact with their surroundings but also collaborate with other agents, necessitating sophisticated mechanisms for adaptation, real-time learning, and collaborative problem-solving. Despite increasing interest in multi-agent systems, existing research remains narrow in scope, often relying on simplified models that fail to capture the full complexity of dynamic, open environments for multi-agent embodied AI. Moreover, no comprehensive survey has systematically reviewed the advancements in this area. As embodied AI rapidly evolves, it is crucial to deepen our understanding of multi-agent embodied AI to address the challenges presented by real-world applications. To fill this gap and foster further development in the field, this paper reviews the current state of research, analyzes key contributions, and identifies challenges and future directions, providing insights to guide innovation and progress in this field.
zh
[AI-18] A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal Knowledge
【速读】:该论文试图解决知识驱动的序列分类问题(knowledge-driven sequence classification),其中在不同的时间步需要使用知识的不同部分,并且存在时间关系。解决方案的关键在于设计多阶段的神经符号架构,以有效整合背景知识并处理时间维度,从而提升学习任务的性能。
链接: https://arxiv.org/abs/2505.05106
作者: Luca Salvatore Lorello,Marco Lippi,Stefano Melacci
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:One of the goals of neuro-symbolic artificial intelligence is to exploit background knowledge to improve the performance of learning tasks. However, most of the existing frameworks focus on the simplified scenario where knowledge does not change over time and does not cover the temporal dimension. In this work we consider the much more challenging problem of knowledge-driven sequence classification where different portions of knowledge must be employed at different timesteps, and temporal relations are available. Our experimental evaluation compares multi-stage neuro-symbolic and neural-only architectures, and it is conducted on a newly-introduced benchmarking framework. Results demonstrate the challenging nature of this novel setting, and also highlight under-explored shortcomings of neuro-symbolic methods, representing a precious reference for future research.
zh
[AI-19] Beyond Low-rank Decomposition: A Shortcut Approach for Efficient On-Device Learning
【速读】:该论文旨在解决设备端学习(on-device learning)中由于内存和计算资源受限而导致的部署难题。其解决方案的关键在于提出一种新颖的快捷方法(shortcut approach),该方法借鉴了之前关于低秩分解方法的研究,旨在减少反向传播过程中的激活内存瓶颈,从而显著降低激活内存使用量,并在传统基准测试中实现训练FLOPs的减少。
链接: https://arxiv.org/abs/2505.05086
作者: Le-Trung Nguyen,Ael Quelennec,Van-Tam Nguyen,Enzo Tartaglione
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:On-device learning has emerged as a promising direction for AI development, particularly because of its potential to reduce latency issues and mitigate privacy risks associated with device-server communication, while improving energy efficiency. Despite these advantages, significant memory and computational constraints still represent major challenges for its deployment. Drawing on previous studies on low-rank decomposition methods that address activation memory bottlenecks in backpropagation, we propose a novel shortcut approach as an alternative. Our analysis and experiments demonstrate that our method can reduce activation memory usage, even up to 120.09\times compared to vanilla training, while also reducing overall training FLOPs up to 1.86\times when evaluated on traditional benchmarks.
zh
[AI-20] Enhancing Reinforcement Learning for the Floorplanning of Analog ICs with Beam Search
【速读】:该论文旨在解决模拟集成电路(Analog IC)布局设计中复杂的权衡问题,包括器件物理特性和电路变异性带来的挑战。其解决方案的关键在于提出一种结合强化学习(Reinforcement Learning, RL)与束搜索(Beam Search, BS)策略的混合方法。该方法通过BS算法增强智能体的推理过程,使其能够在不进行策略重新训练或微调的情况下生成灵活的布局,并有效缓解拥塞问题,同时保持RL智能体的泛化能力及对电路特征和约束的高效处理能力。
链接: https://arxiv.org/abs/2505.05059
作者: Sandro Junior Della Rovere,Davide Basso,Luca Bortolussi,Mirjana Videnovic-Misic,Husni Habal
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in Proceedings of the 21st International Conference on Synthesis, Modeling, Analysis and Simulation Methods, and Applications to Circuit Design (SMACD 2025). 4 pages, 3 figures
Abstract:The layout of analog ICs requires making complex trade-offs, while addressing device physics and variability of the circuits. This makes full automation with learning-based solutions hard to achieve. However, reinforcement learning (RL) has recently reached significant results, particularly in solving the floorplanning problem. This paper presents a hybrid method that combines RL with a beam (BS) strategy. The BS algorithm enhances the agent’s inference process, allowing for the generation of flexible floorplans by accomodating various objective weightings, and addressing congestion without without the need for policy retraining or fine-tuning. Moreover, the RL agent’s generalization ability stays intact, along with its efficient handling of circuit features and constraints. Experimental results show approx. 5-85% improvement in area, dead space and half-perimeter wire length compared to a standard RL application, along with higher rewards for the agent. Moreover, performance and efficiency align closely with those of existing state-of-the-art techniques.
zh
[AI-21] A Reputation System for Large Language Model-based Multi-agent Systems to Avoid the Trag edy of the Commons
【速读】:该论文试图解决生成式多智能体系统(generative multi-agent systems, MASs)中因个体自利行为导致的“公地悲剧”(tragedy of the commons)问题,该问题表现为集体层面的灾难性结果。解决方案的关键在于提出一种动态的双层级声誉系统——RepuNet,该系统同时建模了个体层面的声誉动态和系统层面的网络演化,通过直接交互与间接传言驱动智能体对其自身及同伴形成声誉,并据此决定是否连接或断开与其他智能体的交互,从而有效缓解“公地悲剧”,促进并维持协作行为。
链接: https://arxiv.org/abs/2505.05029
作者: Siyue Ren,Wanli Fu,Xinkun Zou,Chen Shen,Yi Cai,Chen Chu,Zhen Wang,Shuyue Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The tragedy of the commons, where individual self-interest leads to collectively disastrous outcomes, is a pervasive challenge in human society. Recent studies have demonstrated that similar phenomena can arise in generative multi-agent systems (MASs). To address this challenge, this paper explores the use of reputation systems as a remedy. We propose RepuNet, a dynamic, dual-level reputation framework that models both agent-level reputation dynamics and system-level network evolution. Specifically, driven by direct interactions and indirect gossip, agents form reputations for both themselves and their peers, and decide whether to connect or disconnect other agents for future interactions. Through two distinct scenarios, we show that RepuNet effectively mitigates the ‘tragedy of the commons’, promoting and sustaining cooperation in generative MASs. Moreover, we find that reputation systems can give rise to rich emergent behaviors in generative MASs, such as the formation of cooperative clusters, the social isolation of exploitative agents, and the preference for sharing positive gossip rather than negative ones.
zh
[AI-22] Generating Reliable Synthetic Clinical Trial Data: The Role of Hyperparameter Optimization and Domain Constraints
【速读】:该论文试图解决在医疗研究中生成具有高保真度、实用性和符合领域特定约束的合成临床试验数据的问题。其解决方案的关键在于通过超参数优化(Hyperparameter Optimization, HPO)提升生成模型的性能,并结合复合指标优化策略以实现更平衡和泛化的合成数据集。研究还强调,仅依靠HPO不足以确保临床有效的合成数据,必须结合预处理和后处理步骤以及显式领域知识,以减少对基本生存约束的违反,从而提高合成数据的质量和临床适用性。
链接: https://arxiv.org/abs/2505.05019
作者: Waldemar Hahn,Jan-Niklas Eckardt,Christoph Röllig,Martin Sedlmayr,Jan Moritz Middeke,Markus Wolfien
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The generation of synthetic clinical trial data offers a promising approach to mitigating privacy concerns and data accessibility limitations in medical research. However, ensuring that synthetic datasets maintain high fidelity, utility, and adherence to domain-specific constraints remains a key challenge. While hyperparameter optimization (HPO) has been shown to improve generative model performance, the effectiveness of different optimization strategies for synthetic clinical data remains unclear. This study systematically evaluates four HPO strategies across eight generative models, comparing single-metric optimization against compound metric optimization approaches. Our results demonstrate that HPO consistently improves synthetic data quality, with TVAE, CTGAN, and CTAB-GAN+ achieving improvements of up to 60%, 39%, and 38%, respectively. Compound metric optimization outperformed single-metric strategies, producing more balanced and generalizable synthetic datasets. Interestingly, HPO alone is insufficient to ensure clinically valid synthetic data, as all models exhibited violations of fundamental survival constraints. Preprocessing and postprocessing played a crucial role in reducing these violations, as models lacking robust processing steps produced invalid data in up to 61% of cases. These findings underscore the necessity of integrating explicit domain knowledge alongside HPO to create high quality synthetic datasets. Our study provides actionable recommendations for improving synthetic data generation, with future research needed to refine metric selection and validate these findings on larger datasets to enhance clinical applicability.
zh
[AI-23] An Agent -Based Modeling Approach to Free-Text Keyboard Dynamics for Continuous Authentication
【速读】:该论文旨在解决如何通过连续认证系统利用自由文本键盘动态实现多因素认证中的安全增强问题,特别是在不影响用户体验的前提下。其解决方案的关键在于采用基于代理的模型(Agent-Based Model, ABM)生成合成击键数据,并结合机器学习方法进行用户验证,其中随机森林(Random Forest, RF)在捕捉用户特定行为模式方面表现出优于单类支持向量机(One-Class Support Vector Machine, OC-SVM)的性能,同时揭示了键盘硬件对打字行为的显著影响。
链接: https://arxiv.org/abs/2505.05015
作者: Roberto Dillon,Arushi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 16 pages, 5 figures, 12 tables
Abstract:Continuous authentication systems leveraging free-text keyboard dynamics offer a promising additional layer of security in a multifactor authentication setup that can be used in a transparent way with no impact on user experience. This study investigates the efficacy of behavioral biometrics by employing an Agent-Based Model (ABM) to simulate diverse typing profiles across mechanical and membrane keyboards. Specifically, we generated synthetic keystroke data from five unique agents, capturing features related to dwell time, flight time, and error rates within sliding 5-second windows updated every second. Two machine learning approaches, One-Class Support Vector Machine (OC-SVM) and Random Forest (RF), were evaluated for user verification. Results revealed a stark contrast in performance: while One-Class SVM failed to differentiate individual users within each group, Random Forest achieved robust intra-keyboard user recognition (Accuracy 0.7) but struggled to generalize across keyboards for the same user, highlighting the significant impact of keyboard hardware on typing behavior. These findings suggest that: (1) keyboard-specific user profiles may be necessary for reliable authentication, and (2) ensemble methods like RF outperform One-Class SVM in capturing fine-grained user-specific patterns.
zh
[AI-24] Foam-Agent : Towards Automated Intelligent CFD Workflows
【速读】:该论文旨在解决计算流体动力学(Computational Fluid Dynamics, CFD)模拟过程中需要大量领域专业知识和手动配置的问题,从而降低使用门槛。其解决方案的关键在于提出Foam-Agent框架,该框架通过三种创新机制实现基于自然语言输入的OpenFOAM仿真流程自动化:(1)针对不同仿真方面设计的分层多索引检索系统;(2)具备依赖关系感知能力的文件生成系统,确保配置文件的一致性;(3)迭代式错误纠正机制,能够在无需人工干预的情况下诊断并解决仿真失败问题。
链接: https://arxiv.org/abs/2505.04997
作者: Ling Yue,Nithin Somasekharan,Yadi Cao,Shaowu Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Computational Fluid Dynamics (CFD) is an essential simulation tool in various engineering disciplines, but it often requires substantial domain expertise and manual configuration, creating barriers to entry. We present Foam-Agent, a multi-agent framework that automates complex OpenFOAM-based CFD simulation workflows from natural language inputs. Our innovation includes (1) a hierarchical multi-index retrieval system with specialized indices for different simulation aspects, (2) a dependency-aware file generation system that provides consistency management across configuration files, and (3) an iterative error correction mechanism that diagnoses and resolves simulation failures without human intervention. Through comprehensive evaluation on the dataset of 110 simulation tasks, Foam-Agent achieves an 83.6% success rate with Claude 3.5 Sonnet, significantly outperforming existing frameworks (55.5% for MetaOpenFOAM and 37.3% for OpenFOAM-GPT). Ablation studies demonstrate the critical contribution of each system component, with the specialized error correction mechanism providing a 36.4% performance improvement. Foam-Agent substantially lowers the CFD expertise threshold while maintaining modeling accuracy, demonstrating the potential of specialized multi-agent systems to democratize access to complex scientific simulation tools. The code is public at this https URL
zh
[AI-25] ChainMarks: Securing DNN Watermark with Cryptographic Chain CCS’25
【速读】:该论文试图解决深度神经网络(Deep Neural Network, DNN)模型知识产权保护中水印易被移除和存在性判定模糊的问题。现有水印方案在面对水印移除和歧义攻击时表现出脆弱性,且缺乏明确的水印存在判定标准。其解决方案的关键在于提出一种名为ChainMarks的安全DNN水印方案,该方案通过在触发输入中引入密码学链生成安全且鲁棒的水印,并采用两阶段蒙特卡洛方法确定水印的存在性,从而提高了水印的鲁棒性和安全性。
链接: https://arxiv.org/abs/2505.04977
作者: Brian Choi,Shu Wang,Isabelle Choi,Kun Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted In ACM ASIA Conference on Computer and Communications Security (ASIA CCS '25), August 25-29, 2025, Ha Noi, Vietnam
Abstract:With the widespread deployment of deep neural network (DNN) models, dynamic watermarking techniques are being used to protect the intellectual property of model owners. However, recent studies have shown that existing watermarking schemes are vulnerable to watermark removal and ambiguity attacks. Besides, the vague criteria for determining watermark presence further increase the likelihood of such attacks. In this paper, we propose a secure DNN watermarking scheme named ChainMarks, which generates secure and robust watermarks by introducing a cryptographic chain into the trigger inputs and utilizes a two-phase Monte Carlo method for determining watermark presence. First, ChainMarks generates trigger inputs as a watermark dataset by repeatedly applying a hash function over a secret key, where the target labels associated with trigger inputs are generated from the digital signature of model owner. Then, the watermarked model is produced by training a DNN over both the original and watermark datasets. To verify watermarks, we compare the predicted labels of trigger inputs with the target labels and determine ownership with a more accurate decision threshold that considers the classification probability of specific models. Experimental results show that ChainMarks exhibits higher levels of robustness and security compared to state-of-the-art watermarking schemes. With a better marginal utility, ChainMarks provides a higher probability guarantee of watermark presence in DNN models with the same level of watermark accuracy.
zh
[AI-26] Position: The AI Conference Peer Review Crisis Demands Author Feedback and Reviewer Rewards ICML2025
【速读】:该论文试图解决当前主要人工智能(Artificial Intelligence, AI)会议中同行评审过程面临的挑战,包括投稿量激增(每场会议超过10,000篇投稿)以及评审质量和评审责任方面的担忧。其解决方案的关键在于将传统的单向评审体系转变为双向反馈机制,通过作者对评审质量的评估和评审员获得正式认证,构建一个责任框架,以促进可持续的高质量同行评审系统。具体而言,关键机制包括:(1)一种两阶段的双向评审系统,允许作者评估评审内容同时减少报复行为;(2)一种系统化的评审员激励机制,以鼓励高质量的评审行为。
链接: https://arxiv.org/abs/2505.04966
作者: Jaeho Kim,Yunseok Lee,Seulki Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: ICML2025 Position Track Oral
Abstract:The peer review process in major artificial intelligence (AI) conferences faces unprecedented challenges with the surge of paper submissions (exceeding 10,000 submissions per venue), accompanied by growing concerns over review quality and reviewer responsibility. This position paper argues for the need to transform the traditional one-way review system into a bi-directional feedback loop where authors evaluate review quality and reviewers earn formal accreditation, creating an accountability framework that promotes a sustainable, high-quality peer review system. The current review system can be viewed as an interaction between three parties: the authors, reviewers, and system (i.e., conference), where we posit that all three parties share responsibility for the current problems. However, issues with authors can only be addressed through policy enforcement and detection tools, and ethical concerns can only be corrected through self-reflection. As such, this paper focuses on reforming reviewer accountability with systematic rewards through two key mechanisms: (1) a two-stage bi-directional review system that allows authors to evaluate reviews while minimizing retaliatory behavior, (2)a systematic reviewer reward system that incentivizes quality reviewing. We ask for the community’s strong interest in these problems and the reforms that are needed to enhance the peer review process.
zh
[AI-27] Graffe: Graph Representation Learning via Diffusion Probabilistic Models
【速读】:该论文试图解决将扩散概率模型(DPMs)应用于图表示学习中的挑战,尤其是在图结构数据上捕捉语义信息的问题。其解决方案的关键在于提出一种自监督的扩散模型Graffe,该模型通过图编码器将源图压缩为紧凑表示,并将其作为条件引导扩散解码器的去噪过程,从而实现有效的图表示学习。
链接: https://arxiv.org/abs/2505.04956
作者: Dingshuo Chen,Shuchen Xue,Liuji Chen,Yingheng Wang,Qiang Liu,Shu Wu,Zhi-Ming Ma,Liang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, under review
Abstract:Diffusion probabilistic models (DPMs), widely recognized for their potential to generate high-quality samples, tend to go unnoticed in representation learning. While recent progress has highlighted their potential for capturing visual semantics, adapting DPMs to graph representation learning remains in its infancy. In this paper, we introduce Graffe, a self-supervised diffusion model proposed for graph representation learning. It features a graph encoder that distills a source graph into a compact representation, which, in turn, serves as the condition to guide the denoising process of the diffusion decoder. To evaluate the effectiveness of our model, we first explore the theoretical foundations of applying diffusion models to representation learning, proving that the denoising objective implicitly maximizes the conditional mutual information between data and its representation. Specifically, we prove that the negative logarithm of the denoising score matching loss is a tractable lower bound for the conditional mutual information. Empirically, we conduct a series of case studies to validate our theoretical insights. In addition, Graffe delivers competitive results under the linear probing setting on node and graph classification tasks, achieving state-of-the-art performance on 9 of the 11 real-world datasets. These findings indicate that powerful generative models, especially diffusion models, serve as an effective tool for graph representation learning.
zh
[AI-28] Position: Epistemic Artificial Intelligence is Essential for Machine Learning Models to Know When They Do Not Know
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)在处理不确定性以及在训练数据之外的泛化能力方面存在的显著不足。传统机器学习方法由于过度关注数据拟合和领域适应,难以应对陌生或对抗性数据带来的挑战。论文提出的解决方案的关键在于推动向认识论人工智能(epistemic AI)的范式转变,强调模型不仅要从已知信息中学习,还需从自身的无知中学习,从而增强对不确定性的识别与管理能力,以提升AI系统的鲁棒性和适应性。
链接: https://arxiv.org/abs/2505.04950
作者: Shireen Kudukkil Manchingal,Fabio Cuzzolin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the impressive achievements of AI, including advancements in generative models and large language models, there remains a significant gap in the ability of AI to handle uncertainty and generalize beyond the training data. We argue that AI models, especially in autonomous systems, fail to make robust predictions when faced with unfamiliar or adversarial data, as evidenced by incidents with autonomous vehicles. Traditional machine learning approaches struggle to address these issues due to an overemphasis on data fitting and domain adaptation. This position paper posits a paradigm shift towards epistemic artificial intelligence, emphasizing the need for models to learn not only from what they know but also from their ignorance. This approach, which focuses on recognizing and managing uncertainty, offers a potential solution to improve the resilience and robustness of AI systems, ensuring that they can better handle unpredictable real-world environments.
zh
[AI-29] Structural Alignment in Link Prediction
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)中链接预测任务的建模问题,特别是针对现有方法主要依赖于节点和边的嵌入表示而忽视了整个三元组结构的问题。其解决方案的关键在于提出一种以图结构为核心的视角,将KG的信息内容建模为整体三元组而非单独的节点和边,从而更有效地理解和建模链接预测任务。通过这一结构优先的方法,论文提出了结构对齐假设,认为链接预测可以被理解为一种结构性任务,并验证了该方法在跨KG迁移学习中的有效性。
链接: https://arxiv.org/abs/2505.04939
作者: Jeffrey Seathrún Sardina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Ph.D. thesis submitted to Trinity College Dublin
Abstract:While Knowledge Graphs (KGs) have become increasingly popular across various scientific disciplines for their ability to model and interlink huge quantities of data, essentially all real-world KGs are known to be incomplete. As such, with the growth of KG use has been a concurrent development of machine learning tools designed to predict missing information in KGs, which is referred to as the Link Prediction Task. The majority of state-of-the-art link predictors to date have followed an embedding-based paradigm. In this paradigm, it is assumed that the information content of a KG is best represented by the (individual) vector representations of its nodes and edges, and that therefore node and edge embeddings are particularly well-suited to performing link prediction. This thesis proposes an alternative perspective on the field’s approach to link prediction and KG data modelling. Specifically, this work re-analyses KGs and state-of-the-art link predictors from a graph-structure-first perspective that models the information content of a KG in terms of whole triples, rather than individual nodes and edges. Following a literature review and two core sets of experiments, this thesis concludes that a structure-first perspective on KGs and link prediction is both viable and useful for understanding KG learning and for enabling cross-KG transfer learning for the link prediction task. This observation is used to create and propose the Structural Alignment Hypothesis, which postulates that link prediction can be understood and modelled as a structural task. All code and data used for this thesis are open-sourced. This thesis was written bilingually, with the main document in English and an informal extended summary in Irish. An Irish-language translation dictionary of machine learning terms (the Foclóir Tráchtais) created for this work is open-sourced as well. Comments: Ph.D. thesis submitted to Trinity College Dublin Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2505.04939 [cs.LG] (or arXiv:2505.04939v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.04939 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jeffrey Sardina [view email] [v1] Thu, 8 May 2025 04:27:15 UTC (25,503 KB)
zh
[AI-30] Fair Uncertainty Quantification for Depression Prediction
【速读】:该论文试图解决深度学习在抑郁症预测中预测可靠性与算法公平性之间的平衡问题,特别是关注不确定性量化(Uncertainty Quantification, UQ)的公平性。其解决方案的关键在于提出一种公平的不确定性量化方法(Fair Uncertainty Quantification, FUQ),通过基于群体的分析,结合分组的共形预测来量化不同人口统计学群体中的不确定性,并引入一种面向公平性的优化策略,将公平性作为约束优化问题进行建模,从而在保持预测可靠性的同时适应不同群体间的异质性不确定性,实现最优的公平性表现。
链接: https://arxiv.org/abs/2505.04931
作者: Yonghong Li,Xiuzhuang Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Trustworthy depression prediction based on deep learning, incorporating both predictive reliability and algorithmic fairness across diverse demographic groups, is crucial for clinical application. Recently, achieving reliable depression predictions through uncertainty quantification has attracted increasing attention. However, few studies have focused on the fairness of uncertainty quantification (UQ) in depression prediction. In this work, we investigate the algorithmic fairness of UQ, namely Equal Opportunity Coverage (EOC) fairness, and propose Fair Uncertainty Quantification (FUQ) for depression prediction. FUQ pursues reliable and fair depression predictions through group-based analysis. Specifically, we first group all the participants by different sensitive attributes and leverage conformal prediction to quantify uncertainty within each demographic group, which provides a theoretically guaranteed and valid way to quantify uncertainty for depression prediction and facilitates the investigation of fairness across different demographic groups. Furthermore, we propose a fairness-aware optimization strategy that formulates fairness as a constrained optimization problem under EOC constraints. This enables the model to preserve predictive reliability while adapting to the heterogeneous uncertainty levels across demographic groups, thereby achieving optimal fairness. Through extensive evaluations on several visual and audio depression datasets, our approach demonstrates its effectiveness.
zh
[AI-31] Belief Filtering for Epistemic Control in Linguistic State Space
【速读】:该论文试图解决如何通过语义层面的控制机制实现对人工智能代理(artificial agents)的认知状态进行有效调节的问题,特别是在提升AI安全性和对齐性方面。解决方案的关键在于基于语义流形(Semantic Manifold)框架构建的信念过滤(belief filtering)机制,该机制通过内容感知的操作对自然语言片段进行动态处理,从而实现对代理内部语义空间的结构化干预,提供了可解释且模块化的认知调控方法。
链接: https://arxiv.org/abs/2505.04927
作者: Sebastian Dumbrava
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:We examine belief filtering as a mechanism for the epistemic control of artificial agents, focusing on the regulation of internal cognitive states represented as linguistic expressions. This mechanism is developed within the Semantic Manifold framework, where belief states are dynamic, structured ensembles of natural language fragments. Belief filters act as content-aware operations on these fragments across various cognitive transitions. This paper illustrates how the inherent interpretability and modularity of such a linguistically-grounded cognitive architecture directly enable belief filtering, offering a principled approach to agent regulation. The study highlights the potential for enhancing AI safety and alignment through structured interventions in an agent’s internal semantic space and points to new directions for architecturally embedded cognitive governance.
zh
[AI-32] Physics-Assisted and Topology-Informed Deep Learning for Weather Prediction IJCAI2025
【速读】:该论文试图解决深度学习模型在天气预测中忽视天气演变的物理机制和地球表面拓扑结构的问题。其解决方案的关键在于提出PASSAT模型,该模型结合了物理约束与地球表面拓扑信息,通过在球面流形上数值求解对流方程和纳维-斯托克斯方程,并利用球面图神经网络捕捉地球-大气相互作用,从而生成关键的初始速度场。
链接: https://arxiv.org/abs/2505.04918
作者: Jiaqi Zheng,Qing Ling,Yerong Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: International Joint Conferences on Artificial Intelligence (IJCAI 2025)
Abstract:Although deep learning models have demonstrated remarkable potential in weather prediction, most of them overlook either the \textbfphysics of the underlying weather evolution or the \textbftopology of the Earth’s surface. In light of these disadvantages, we develop PASSAT, a novel Physics-ASSisted And Topology-informed deep learning model for weather prediction. PASSAT attributes the weather evolution to two key factors: (i) the advection process that can be characterized by the advection equation and the Navier-Stokes equation; (ii) the Earth-atmosphere interaction that is difficult to both model and calculate. PASSAT also takes the topology of the Earth’s surface into consideration, other than simply treating it as a plane. With these considerations, PASSAT numerically solves the advection equation and the Navier-Stokes equation on the spherical manifold, utilizes a spherical graph neural network to capture the Earth-atmosphere interaction, and generates the initial velocity fields that are critical to solving the advection equation from the same spherical graph neural network. In the 5.625^\circ -resolution ERA5 data set, PASSAT outperforms both the state-of-the-art deep learning-based weather prediction models and the operational numerical weather prediction model IFS T42. Code and checkpoint are available at this https URL.
zh
[AI-33] Precise gradient descent training dynamics for finite-width multi-layer neural networks
【速读】:该论文旨在解决多层神经网络在有限宽度比例条件下梯度下降迭代过程的分布特性问题,特别是在单指数回归模型下的精确分布表征。其解决方案的关键在于提出了一种非渐近状态演化理论,该理论能够捕捉第一层权重的高斯波动以及深层权重的集中现象,并且适用于非高斯特征。与现有的神经切线核(NTK)、平均场(MF)理论和张量程序(TP)不同,该理论在有限宽度范围内有效,允许权重从个体初始化出发,超越了懒惰训练区域,并且能够表征一般多层神经网络的训练和泛化误差,而不仅仅是两层结构的泛化问题。
链接: https://arxiv.org/abs/2505.04898
作者: Qiyang Han,Masaaki Imaizumi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:
Abstract:In this paper, we provide the first precise distributional characterization of gradient descent iterates for general multi-layer neural networks under the canonical single-index regression model, in the `finite-width proportional regime’ where the sample size and feature dimension grow proportionally while the network width and depth remain bounded. Our non-asymptotic state evolution theory captures Gaussian fluctuations in first-layer weights and concentration in deeper-layer weights, and remains valid for non-Gaussian features. Our theory differs from existing neural tangent kernel (NTK), mean-field (MF) theories and tensor program (TP) in several key aspects. First, our theory operates in the finite-width regime whereas these existing theories are fundamentally infinite-width. Second, our theory allows weights to evolve from individual initializations beyond the lazy training regime, whereas NTK and MF are either frozen at or only weakly sensitive to initialization, and TP relies on special initialization schemes. Third, our theory characterizes both training and generalization errors for general multi-layer neural networks beyond the uniform convergence regime, whereas existing theories study generalization almost exclusively in two-layer settings. As a statistical application, we show that vanilla gradient descent can be augmented to yield consistent estimates of the generalization error at each iteration, which can be used to guide early stopping and hyperparameter tuning. As a further theoretical implication, we show that despite model misspecification, the model learned by gradient descent retains the structure of a single-index function with an effective signal determined by a linear combination of the true signal and the initialization. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2505.04898 [cs.LG] (or arXiv:2505.04898v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.04898 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-34] Clustering with Communication: A Variational Framework for Single Cell Representation Learning
【速读】:该论文试图解决单细胞RNA测序(scRNA-seq)中细胞异质性分析的局限性,即仅依赖转录组数据难以全面理解细胞间功能交互的问题。其解决方案的关键在于提出CCCVAE框架,该框架将细胞间通信(CCC)信号整合到单细胞表征学习中,通过引入基于配体-受体相互作用的通信感知核函数和稀疏高斯过程,将生物学先验信息编码到潜在空间,从而提升细胞聚类性能。
链接: https://arxiv.org/abs/2505.04891
作者: Cong Qi,Yeqing Chen,Jie Zhang,Wei Zhi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Single-cell RNA sequencing (scRNA-seq) has revealed complex cellular heterogeneity, but recent studies emphasize that understanding biological function also requires modeling cell-cell communication (CCC), the signaling interactions mediated by ligand-receptor pairs that coordinate cellular behavior. Tools like CellChat have demonstrated that CCC plays a critical role in processes such as cell differentiation, tissue regeneration, and immune response, and that transcriptomic data inherently encodes rich information about intercellular signaling. We propose CCCVAE, a novel variational autoencoder framework that incorporates CCC signals into single-cell representation learning. By leveraging a communication-aware kernel derived from ligand-receptor interactions and a sparse Gaussian process, CCCVAE encodes biologically informed priors into the latent space. Unlike conventional VAEs that treat each cell independently, CCCVAE encourages latent embeddings to reflect both transcriptional similarity and intercellular signaling context. Empirical results across four scRNA-seq datasets show that CCCVAE improves clustering performance, achieving higher evaluation scores than standard VAE baselines. This work demonstrates the value of embedding biological priors into deep generative models for unsupervised single-cell analysis.
zh
[AI-35] QBR: A Question-Bank-Based Approach to Fine-Grained Legal Knowledge Retrieval for the General Public
【速读】:该论文试图解决公众在法律知识检索中面临的难题,即由于专业术语的复杂性和普通用户对法律知识缺乏基础理解,导致传统信息检索技术难以有效满足用户需求。解决方案的关键在于提出一种称为QBR(Question-Based Retrieval)的方法,其核心是利用问题库(Questions Bank, QB)作为桥梁,通过QB生成训练样本以增强文档中知识单元的嵌入表示,从而实现细粒度的知识检索。
链接: https://arxiv.org/abs/2505.04883
作者: Mingruo Yuan,Ben Kao,Tien-Hsuan Wu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval of legal knowledge by the general public is a challenging problem due to the technicality of the professional knowledge and the lack of fundamental understanding by laypersons on the subject. Traditional information retrieval techniques assume that users are capable of formulating succinct and precise queries for effective document retrieval. In practice, however, the wide gap between the highly technical contents and untrained users makes legal knowledge retrieval very difficult. We propose a methodology, called QBR, which employs a Questions Bank (QB) as an effective medium for bridging the knowledge gap. We show how the QB is used to derive training samples to enhance the embedding of knowledge units within documents, which leads to effective fine-grained knowledge retrieval. We discuss and evaluate through experiments various advantages of QBR over traditional methods. These include more accurate, efficient, and explainable document retrieval, better comprehension of retrieval results, and highly effective fine-grained knowledge retrieval. We also present some case studies and show that QBR achieves social impact by assisting citizens to resolve everyday legal concerns.
zh
[AI-36] Federated Learning for Cyber Physical Systems: A Comprehensive Survey
【速读】:该论文试图解决在工业物联网(Industrial Internet of Things, IIoT)环境中,由于设备异构性、数据隐私、实时决策、安全性和可靠性等问题,导致机器学习(Machine Learning, ML)在网络物理系统(Cyber-Physical Systems, CPS)中集成困难的问题。其解决方案的关键在于采用联邦学习(Federated Learning, FL)这一分布式机器学习方法,通过在去中心化数据源上训练模型,既保障了数据隐私,又提升了系统的安全性和可靠性,同时支持多样化的应用场景和系统架构。
链接: https://arxiv.org/abs/2505.04873
作者: Minh K. Quan,Pubudu N. Pathirana,Mayuri Wijayasundara,Sujeeva Setunge,Dinh C. Nguyen,Christopher G. Brinton,David J. Love,H. Vincent Poor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: This work has been accepted by IEEE Communications Surveys Tutorials
Abstract:The integration of machine learning (ML) in cyber physical systems (CPS) is a complex task due to the challenges that arise in terms of real-time decision making, safety, reliability, device heterogeneity, and data privacy. There are also open research questions that must be addressed in order to fully realize the potential of ML in CPS. Federated learning (FL), a distributed approach to ML, has become increasingly popular in recent years. It allows models to be trained using data from decentralized sources. This approach has been gaining popularity in the CPS field, as it integrates computer, communication, and physical processes. Therefore, the purpose of this work is to provide a comprehensive analysis of the most recent developments of FL-CPS, including the numerous application areas, system topologies, and algorithms developed in recent years. The paper starts by discussing recent advances in both FL and CPS, followed by their integration. Then, the paper compares the application of FL in CPS with its applications in the internet of things (IoT) in further depth to show their connections and distinctions. Furthermore, the article scrutinizes how FL is utilized in critical CPS applications, e.g., intelligent transportation systems, cybersecurity services, smart cities, and smart healthcare solutions. The study also includes critical insights and lessons learned from various FL-CPS implementations. The paper’s concluding section delves into significant concerns and suggests avenues for further research in this fast-paced and dynamic era.
zh
[AI-37] PR2: Peephole Raw Pointer Rewriting with LLM s for Translating C to Safer Rust
【速读】:该论文旨在解决由C2RUST等工具将C代码转换为Rust代码时,生成的Rust程序过度依赖不安全构造(尤其是原始指针)从而削弱Rust内存安全保证的问题。解决方案的关键在于提出一种基于窥视孔的原始指针重写技术(peephole raw pointer rewriting technique),该技术通过决策树引导的提示方法将单个函数中的原始指针提升为适当的Rust数据结构,并结合代码变更分析修复重写过程中引入的错误,从而有效提升翻译后Rust代码的内存安全性。
链接: https://arxiv.org/abs/2505.04852
作者: Yifei Gao,Chengpeng Wang,Pengxiang Huang,Xuwei Liu,Mingwei Zheng,Xiangyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:There has been a growing interest in translating C code to Rust due to Rust’s robust memory and thread safety guarantees. Tools such as C2RUST enable syntax-guided transpilation from C to semantically equivalent Rust code. However, the resulting Rust programs often rely heavily on unsafe constructs–particularly raw pointers–which undermines Rust’s safety guarantees. This paper aims to improve the memory safety of Rust programs generated by C2RUST by eliminating raw pointers. Specifically, we propose a peephole raw pointer rewriting technique that lifts raw pointers in individual functions to appropriate Rust data structures. Technically, PR2 employs decision-tree-based prompting to guide the pointer lifting process. Additionally, it leverages code change analysis to guide the repair of errors introduced during rewriting, effectively addressing errors encountered during compilation and test case execution. We implement PR2 as a prototype and evaluate it using gpt-4o-mini on 28 real-world C projects. The results show that PR2 successfully eliminates 13.22% of local raw pointers across these projects, significantly enhancing the safety of the translated Rust code. On average, PR2 completes the transformation of a project in 5.44 hours, at an average cost of 1.46.
zh
[AI-38] Large Language Models are Autonomous Cyber Defenders
【速读】:该论文试图解决自主网络防御(Autonomous Cyber Defense, ACD)中多智能体协作与可解释性不足的问题。现有基于强化学习(Reinforcement Learning, RL)的ACD代理虽然能够执行任务,但其训练成本高且推理过程缺乏可解释性和迁移性。本文的关键解决方案是引入大型语言模型(Large Language Models, LLMs),通过在CybORG CAGE 4环境中进行集成,并提出一种新的通信协议,以评估LLMs在多智能体ACD场景下的表现,从而探索更高效、可解释的ACD团队构建方法。
链接: https://arxiv.org/abs/2505.04843
作者: Sebastián R. Castro,Roberto Campbell,Nancy Lau,Octavio Villalobos,Jiaqi Duan,Alvaro A. Cardenas
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Presented at IEEE CAI Workshop on Adaptive Cyber Defense 2025. Proceedings to appear
Abstract:Fast and effective incident response is essential to prevent adversarial cyberattacks. Autonomous Cyber Defense (ACD) aims to automate incident response through Artificial Intelligence (AI) agents that plan and execute actions. Most ACD approaches focus on single-agent scenarios and leverage Reinforcement Learning (RL). However, ACD RL-trained agents depend on costly training, and their reasoning is not always explainable or transferable. Large Language Models (LLMs) can address these concerns by providing explainable actions in general security contexts. Researchers have explored LLM agents for ACD but have not evaluated them on multi-agent scenarios or interacting with other ACD agents. In this paper, we show the first study on how LLMs perform in multi-agent ACD environments by proposing a new integration to the CybORG CAGE 4 environment. We examine how ACD teams of LLM and RL agents can interact by proposing a novel communication protocol. Our results highlight the strengths and weaknesses of LLMs and RL and help us identify promising research directions to create, train, and deploy future teams of ACD agents.
zh
[AI-39] Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reason ers With Verifiers
【速读】:该论文试图解决现有强化学习(Reinforcement Learning, RL)方法在微调大语言模型(Large Language Model, LLM)推理器时,因放弃已学习的价值函数而限制了测试阶段计算扩展性的问题。其解决方案的关键在于提出RL^V,该方法通过联合训练LLM作为推理器和生成验证器,利用RL生成的数据增强任何“无价值”RL方法,从而在不显著增加开销的情况下引入验证能力。
链接: https://arxiv.org/abs/2505.04842
作者: Kusha Sareen,Morgane M Moss,Alessandro Sordoni,Rishabh Agarwal,Arian Hosseini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL ^V that augments any ``value-free’’ RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL ^V boosts MATH accuracy by over 20% with parallel sampling and enables 8-32\times efficient test-time compute scaling compared to the base RL method. RL ^V also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL ^V achieves 1.2-1.6\times higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.
zh
[AI-40] Is there Value in Reinforcement Learning?
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中关于行动价值(action-values)是否应被显式表示的争议问题。传统观点认为,基于策略梯度(policy-gradient, PG)的模型优于基于价值(value-based, VB)的模型,因为PG方法似乎避免了对显式价值表示的依赖。然而,论文指出这一解决方案并不充分,因为PG方法虽然在决策时不需要显式的价值表示(刺激-反应映射),但在学习过程中仍需依赖价值表示。因此,单纯转向PG方法并不能真正消除模型中的价值表示。论文进一步指出,价值表示的必要性源于标准RL框架下的优化目标假设,而非所选算法本身。其关键解决方案在于将讨论焦点从算法选择转向对基础建模假设的批判性评估,特别是在自然环境中放松标准假设(如风险中性、完全可观测性等)时,需要重新审视“价值”的概念。
链接: https://arxiv.org/abs/2505.04822
作者: Lior Fox,Yonatan Loewenstein
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to The 6th Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM 2025)
Abstract:Action-values play a central role in popular Reinforcement Learing (RL) models of behavior. Yet, the idea that action-values are explicitly represented has been extensively debated. Critics had therefore repeatedly suggested that policy-gradient (PG) models should be favored over value-based (VB) ones, as a potential solution for this dilemma. Here we argue that this solution is unsatisfying. This is because PG methods are not, in fact, “Value-free” – while they do not rely on an explicit representation of Value for acting (stimulus-response mapping), they do require it for learning. Hence, switching to PG models is, per se, insufficient for eliminating Value from models of behavior. More broadly, the requirement for a representation of Value stems from the underlying assumptions regarding the optimization objective posed by the standard RL framework, not from the particular algorithm chosen to solve it. Previous studies mostly took these standard RL assumptions for granted, as part of their conceptualization or problem modeling, while debating the different methods used to optimize it (i.e., PG or VB). We propose that, instead, the focus of the debate should shift to critically evaluating the underlying modeling assumptions. Such evaluation is particularly important from an experimental perspective. Indeed, the very notion of Value must be reconsidered when standard assumptions (e.g., risk neutrality, full-observability, Markovian environment, exponential discounting) are relaxed, as is likely in natural settings. Finally, we use the Value debate as a case study to argue in favor of a more nuanced, algorithmic rather than statistical, view of what constitutes “a model” in cognitive sciences. Our analysis suggests that besides “parametric” statistical complexity, additional aspects such as computational complexity must also be taken into account when evaluating model complexity.
zh
[AI-41] Piecewise Constant Spectral Graph Neural Network
【速读】:该论文试图解决现有谱图神经网络(Spectral GNNs)在捕捉图谱特性时受限于低阶多项式滤波器的问题,这些滤波器由于多项式次数较低,无法充分识别图的谱特性。解决方案的关键在于提出一种分段常数谱图神经网络(PieCoN),通过将常数谱滤波器与多项式滤波器结合,实现对图结构更灵活的利用,并通过自适应地将谱区间划分为多个部分,扩大可有效学习的谱特性范围。
链接: https://arxiv.org/abs/2505.04808
作者: Vahan Martirosyan,Jhony H. Giraldo,Fragkiskos D. Malliaros
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to TMLR 2025
Abstract:Graph Neural Networks (GNNs) have achieved significant success across various domains by leveraging graph structures in data. Existing spectral GNNs, which use low-degree polynomial filters to capture graph spectral properties, may not fully identify the graph’s spectral characteristics because of the polynomial’s small degree. However, increasing the polynomial degree is computationally expensive and beyond certain thresholds leads to performance plateaus or degradation. In this paper, we introduce the Piecewise Constant Spectral Graph Neural Network(PieCoN) to address these challenges. PieCoN combines constant spectral filters with polynomial filters to provide a more flexible way to leverage the graph structure. By adaptively partitioning the spectrum into intervals, our approach increases the range of spectral properties that can be effectively learned. Experiments on nine benchmark datasets, including both homophilic and heterophilic graphs, demonstrate that PieCoN is particularly effective on heterophilic datasets, highlighting its potential for a wide range of applications.
zh
[AI-42] ORBIT-2: Scaling Exascale Vision Foundation Models for Weather and Climate Downscaling
【速读】:该论文旨在解决稀疏观测数据和粗分辨率气候模型限制区域决策有效性的难题,提出了一种强大的降尺度方法。其关键解决方案是引入ORBIT-2,这是一个可扩展的基础模型,采用两个核心创新:(1) Residual Slim ViT (Reslim),一种结合残差学习和贝叶斯正则化的轻量级架构,用于高效且稳健的预测;(2) TILES,一种基于块的序列扩展算法,将自注意力复杂度从二次降低到线性,从而实现长序列处理和大规模并行计算。
链接: https://arxiv.org/abs/2505.04802
作者: Xiao Wang,Jong-Youl Choi,Takuya Kurihaya,Isaac Lyngaas,Hong-Jun Yoon,Ming Fan,Nasik Muhammad Nafi,Aristeidis Tsaris,Ashwin M. Aji,Maliha Hossain,Mohamed Wahib,Dali Wang,Peter Thornton,Prasanna Balaprakash,Moetasim Ashfaq,Dan Lu
机构: 未知
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Sparse observations and coarse-resolution climate models limit effective regional decision-making, underscoring the need for robust downscaling. However, existing AI methods struggle with generalization across variables and geographies and are constrained by the quadratic complexity of Vision Transformer (ViT) self-attention. We introduce ORBIT-2, a scalable foundation model for global, hyper-resolution climate downscaling. ORBIT-2 incorporates two key innovations: (1) Residual Slim ViT (Reslim), a lightweight architecture with residual learning and Bayesian regularization for efficient, robust prediction; and (2) TILES, a tile-wise sequence scaling algorithm that reduces self-attention complexity from quadratic to linear, enabling long-sequence processing and massive parallelism. ORBIT-2 scales to 10 billion parameters across 32,768 GPUs, achieving up to 1.8 ExaFLOPS sustained throughput and 92-98% strong scaling efficiency. It supports downscaling to 0.9 km global resolution and processes sequences up to 4.2 billion tokens. On 7 km resolution benchmarks, ORBIT-2 achieves high accuracy with R^2 scores in the range of 0.98 to 0.99 against observation data.
zh
[AI-43] A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models
【速读】:该论文试图解决生成式 AI(Generative AI)和大型语言模型(Large Language Models, LLMs)驱动的对话代理所带来的新型操作风险问题,这些问题超出了传统网络安全范畴。解决方案的关键在于提出一种新的、可量化的风险评估指标,该指标能够同时评估服务提供方、终端用户和第三方三个关键利益相关者的潜在威胁,并结合技术复杂性(如非诱导性故障和高级提示注入攻击)以及上下文因素(如目标行业、用户年龄范围和漏洞严重性)进行综合分析。通过增强开源框架Garak以捕获多种威胁向量,该方法在检索增强生成(Retrieval-Augmented Generation, RAG)聊天机器人场景中得到验证,展示了聚合风险评分在短期缓解和长期模型设计与部署改进中的应用价值。
链接: https://arxiv.org/abs/2505.04784
作者: Pedro Pinacho-Davidson,Fernando Gutierrez,Pablo Zapata,Rodolfo Vergara,Pablo Aqueveque
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 21 pages
Abstract:The emergence of Generative AI (Gen AI) and Large Language Models (LLMs) has enabled more advanced chatbots capable of human-like interactions. However, these conversational agents introduce a broader set of operational risks that extend beyond traditional cybersecurity considerations. In this work, we propose a novel, instrumented risk-assessment metric that simultaneously evaluates potential threats to three key stakeholders: the service-providing organization, end users, and third parties. Our approach incorporates the technical complexity required to induce erroneous behaviors in the chatbot–ranging from non-induced failures to advanced prompt-injection attacks–as well as contextual factors such as the target industry, user age range, and vulnerability severity. To validate our metric, we leverage Garak, an open-source framework for LLM vulnerability testing. We further enhance Garak to capture a variety of threat vectors (e.g., misinformation, code hallucinations, social engineering, and malicious code generation). Our methodology is demonstrated in a scenario involving chatbots that employ retrieval-augmented generation (RAG), showing how the aggregated risk scores guide both short-term mitigation and longer-term improvements in model design and deployment. The results underscore the importance of multi-dimensional risk assessments in operationalizing secure, reliable AI-driven conversational systems.
zh
[AI-44] Exploring Zero-Shot App Review Classification with ChatGPT : Challenges and Potential
【速读】:该论文试图解决应用商店评论分类的问题,特别是如何将评论划分为功能需求、非功能需求、两者皆是或两者皆非四类,以辅助应用开发决策。传统方法受限于需要大量领域特定数据集,而该研究提出的解决方案关键在于利用生成式 AI (Generative AI) 的零样本学习能力,通过 ChatGPT 实现无需额外训练即可对评论进行有效分类。
链接: https://arxiv.org/abs/2505.04759
作者: Mohit Chaudhary,Chirag Jain,Preethu Rose Anish
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:App reviews are a critical source of user feedback, offering valuable insights into an app’s performance, features, usability, and overall user experience. Effectively analyzing these reviews is essential for guiding app development, prioritizing feature updates, and enhancing user satisfaction. Classifying reviews into functional and non-functional requirements play a pivotal role in distinguishing feedback related to specific app features (functional requirements) from feedback concerning broader quality attributes, such as performance, usability, and reliability (non-functional requirements). Both categories are integral to informed development decisions. Traditional approaches to classifying app reviews are hindered by the need for large, domain-specific datasets, which are often costly and time-consuming to curate. This study explores the potential of zero-shot learning with ChatGPT for classifying app reviews into four categories: functional requirement, non-functional requirement, both, or neither. We evaluate ChatGPT’s performance on a benchmark dataset of 1,880 manually annotated reviews from ten diverse apps spanning multiple domains. Our findings demonstrate that ChatGPT achieves a robust F1 score of 0.842 in review classification, despite certain challenges and limitations. Additionally, we examine how factors such as review readability and length impact classification accuracy and conduct a manual analysis to identify review categories more prone to misclassification.
zh
[AI-45] he Promise and Limits of LLM s in Constructing Proofs and Hints for Logic Problems in Intelligent Tutoring Systems
【速读】:该论文试图解决传统智能辅导系统(Intelligent Tutoring Systems, ITS)在生成个性化学生反馈时的局限性,特别是在基于模板的解释方式下难以提供动态、精准的反馈问题。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)生成动态的逻辑推理提示,以增强辅导系统的教学能力。研究评估了多种提示技术在多步符号逻辑证明构建中的步骤准确性,并发现DeepSeek-V3在该任务中表现最优,展示了LLMs在逻辑教学辅助中的潜力,但同时也指出需要进一步优化以确保生成内容的准确性和教学适宜性。
链接: https://arxiv.org/abs/2505.04736
作者: Sutapa Dey Tithi,Arun Kumar Ramesh,Clara DiMarco,Xiaoyi Tian,Nazia Alam,Kimia Fazeli,Tiffany Barnes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Intelligent tutoring systems have demonstrated effectiveness in teaching formal propositional logic proofs, but their reliance on template-based explanations limits their ability to provide personalized student feedback. While large language models (LLMs) offer promising capabilities for dynamic feedback generation, they risk producing hallucinations or pedagogically unsound explanations. We evaluated the stepwise accuracy of LLMs in constructing multi-step symbolic logic proofs, comparing six prompting techniques across four state-of-the-art LLMs on 358 propositional logic problems. Results show that DeepSeek-V3 achieved superior performance with 84.4% accuracy on stepwise proof construction and excelled particularly in simpler rules. We further used the best-performing LLM to generate explanatory hints for 1,050 unique student problem-solving states from a logic ITS and evaluated them on 4 criteria with both an LLM grader and human expert ratings on a 20% sample. Our analysis finds that LLM-generated hints were 75% accurate and rated highly by human evaluators on consistency and clarity, but did not perform as well explaining why the hint was provided or its larger context. Our results demonstrate that LLMs may be used to augment tutoring systems with logic tutoring hints, but requires additional modifications to ensure accuracy and pedagogical appropriateness.
zh
[AI-46] QBD-RankedDataGen: Generating Custom Ranked Datasets for Improving Query-By-Document Search Using LLM -Reranking with Reduced Human Effort
【速读】:该论文试图解决Query-By-Document (QBD)问题,即通过文档作为查询来检索匹配的文档,通常需要在特定领域或查询背景下进行。现有方法如关键词搜索和文档嵌入虽然可以借助领域特定数据集进行优化,但构建这些数据集成本高且耗时。论文提出的解决方案关键在于引入一种生成定制QBD搜索数据集的流程(QBD-RankedDatagen),并比较了多种利用大型语言模型(LLMs)的方法,这些方法能够整合领域专家输入以生成文档评分、排名及解释,从而显著降低数据集构建的人力成本,同时保留足够的专家知识用于检索模型调优。
链接: https://arxiv.org/abs/2505.04732
作者: Sriram Gopalakrishnan,Sunandita Patra
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 13 pages
Abstract:The Query-By-Document (QBD) problem is an information retrieval problem where the query is a document, and the retrieved candidates are documents that match the query document, often in a domain or query specific manner. This can be crucial for tasks such as patent matching, legal or compliance case retrieval, and academic literature review. Existing retrieval methods, including keyword search and document embeddings, can be optimized with domain-specific datasets to improve QBD search performance. However, creating these domain-specific datasets is often costly and time-consuming. Our work introduces a process to generate custom QBD-search datasets and compares a set of methods to use in this problem, which we refer to as QBD-RankedDatagen. We provide a comparative analysis of our proposed methods in terms of cost, speed, and the human interface with the domain experts. The methods we compare leverage Large Language Models (LLMs) which can incorporate domain expert input to produce document scores and rankings, as well as explanations for human review. The process and methods for it that we present can significantly reduce human effort in dataset creation for custom domains while still obtaining sufficient expert knowledge for tuning retrieval models. We evaluate our methods on QBD datasets from the Text Retrieval Conference (TREC) and finetune the parameters of the BM25 model – which is used in many industrial-strength search engines like OpenSearch – using the generated data.
zh
[AI-47] Geometric Fault-Tolerant Neural Network Tracking Control of Unknown Systems on Matrix Lie Groups
【速读】:该论文试图解决在未知动力学、执行器故障和有界扰动下,定义在矩阵李群上的系统跟踪控制问题。解决方案的关键在于利用矩阵李群切丛的左不变性,提出一组与李群结构内在兼容的神经网络权重学习规则,该方法无需显式参数化,从而避免了参数化奇异性,并实现了对最优权重的全局搜索。通过李雅普诺夫直接法,证明了所有误差信号(包括神经网络权重、无坐标配置误差函数和跟踪速度误差)的最终有界性。
链接: https://arxiv.org/abs/2505.04725
作者: Robin Chhabra,Farzaneh Abdollahi
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO); Dynamical Systems (math.DS)
备注:
Abstract:We present a geometric neural network-based tracking controller for systems evolving on matrix Lie groups under unknown dynamics, actuator faults, and bounded disturbances. Leveraging the left-invariance of the tangent bundle of matrix Lie groups, viewed as an embedded submanifold of the vector space \R^N\times N , we propose a set of learning rules for neural network weights that are intrinsically compatible with the Lie group structure and do not require explicit parameterization. Exploiting the geometric properties of Lie groups, this approach circumvents parameterization singularities and enables a global search for optimal weights. The ultimate boundedness of all error signals – including the neural network weights, the coordinate-free configuration error function, and the tracking velocity error – is established using Lyapunov’s direct method. To validate the effectiveness of the proposed method, we provide illustrative simulation results for decentralized formation control of multi-agent systems on the Special Euclidean group.
zh
[AI-48] Proceedings The 13th International Workshop on Theorem proving components for Educational software
【速读】:该论文试图解决从中学数学的直观学习方式向STEM教育中更形式化的方法过渡的问题,其解决方案的关键在于利用定理证明技术的强大功能,通过软件支持实现这一过渡。ThEdu系列通过汇集自动化定理证明研究及其在教育场景中的应用,旨在促进基于定理证明的软件发展,并加强计算机科学、数学与教育领域利益相关者之间的相互理解。
链接: https://arxiv.org/abs/2505.04677
作者: Julien Narboux(University Paris Cité, France),Walther Neuper(Johannes Kepler University Linz, Austria),Pedro Quaresma(University of Coimbra, Portugal)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The ThEdu series pursues the smooth transition from an intuitive way of doing mathematics at secondary school to a more formal approach to the subject in STEM education while favoring software support for this transition by exploiting the power of theorem-proving technologies. What follows is a brief description of how the present volume contributes to this enterprise. The 13th International Workshop on Theorem Proving Components for Educational Software (ThEdu’24), was a satellite event of the CADE29, part of IJCAR 2024, Nancy, France. ThEdu’24 was a vibrant workshop, with one invited talk by Jeremy Avigad (Carnegie Mellon University) and 14 submitted talks. An open call for papers was then issued and attracted 9 submissions. Eight of those submissions have been accepted by our reviewers. The resulting revised papers are collected in the present volume. The contributions in this volume are a faithful representation of the wide spectrum of ThEdu, ranging from those more focused on the automated deduction research, not losing track of the possible applications in an educational setting, to those focused on the applications, in educational settings, of automated deduction tools and methods. We, the volume editors, hope that this collection of papers will further promote the development of theorem-proving-based software and that it will allow to improve the mutual understanding between computer scientists, mathematicians, and stakeholders in education. While this volume goes to press, the next edition of the ThEdu workshop is being prepared: ThEdu’25 will be a satellite event of the 30th international Conference on Automated DEduction (CADE-30), July 28th - August 2nd, 2025, Stuttgart, Germany.
zh
[AI-49] Dynamic Location Search for Identifying Maximum Weighted Independent Sets in Complex Networks
【速读】:该论文旨在解决在智能交通系统(ITSs)中,由于生成式AI(Generative AI)等技术在大规模和复杂场景下需要大量训练时间和计算资源而带来的效率问题。其解决方案的关键在于提出一种名为DynLS的新算法,该算法通过三种关键创新来有效求解最大权独立集(MWIS)问题:基于评分的自适应顶点扰动(SAVP)技术以加速收敛,区域定位机制(RLM)以动态调整搜索空间从而逃离局部最优,以及结合顶点交换策略与奖励机制的新型可变邻域下降策略(ComLS),从而引导搜索获得高质量解。
链接: https://arxiv.org/abs/2505.04674
作者: Enqiang Zhu,Chenkai Hao,Chanjuan Liu,Yongsheng Rao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Artificial intelligence (AI), including Generative AI, are effective at generating high-quality traffic data and optimization solutions in intelligent transportation systems (ITSs), these techniques often demand significant training time and computational resources, especially in large-scale and complex scenarios. To address this, we introduce a novel and efficient algorithm for solving the maximum weighted independent set (MWIS) problem, which can be used to model many ITSs applications, such as traffic signal control and vehicle routing. Given the NP-hard nature of the MWIS problem, our proposed algorithm, DynLS, incorporates three key innovations to solve it effectively. First, it uses a scores-based adaptive vertex perturbation (SAVP) technique to accelerate convergence, particularly in sparse graphs. Second, it includes a region location mechanism (RLM) to help escape local optima by dynamically adjusting the search space. Finally, it employs a novel variable neighborhood descent strategy, ComLS, which combines vertex exchange strategies with a reward mechanism to guide the search toward high-quality solutions. Our experimental results demonstrate DynLS’s superior performance, consistently delivering high-quality solutions within 1000 seconds. DynLS outperformed five leading algorithms across 360 test instances, achieving the best solution for 350 instances and surpassing the second-best algorithm, Cyclic-Fast, by 177 instances. Moreover, DynLS matched Cyclic-Fast’s convergence speed, highlighting its efficiency and practicality. This research represents a significant advancement in heuristic algorithms for the MWIS problem, offering a promising approach to aid AI techniques in optimizing intelligent transportation systems.
zh
[AI-50] Multimodal Benchmarking and Recommendation of Text-to-Image Generation Models
【速读】:该论文试图解决文本到图像生成模型在评估与基准测试中缺乏统一框架的问题,特别是如何量化元数据增强提示对生成效果的影响。解决方案的关键在于构建一个开源的统一评估框架,利用DeepFashion-MultiModal数据集,并结合多种定量指标(如Weighted Score、CLIP-based similarity、LPIPS、FID及基于检索的度量)和定性分析,以系统评估不同架构下的视觉真实感、语义保真度和模型鲁棒性。
链接: https://arxiv.org/abs/2505.04650
作者: Kapil Wanaskar,Gaytri Jena,Magdalini Eirinaki
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:This work presents an open-source unified benchmarking and evaluation framework for text-to-image generation models, with a particular focus on the impact of metadata augmented prompts. Leveraging the DeepFashion-MultiModal dataset, we assess generated outputs through a comprehensive set of quantitative metrics, including Weighted Score, CLIP (Contrastive Language Image Pre-training)-based similarity, LPIPS (Learned Perceptual Image Patch Similarity), FID (Frechet Inception Distance), and retrieval-based measures, as well as qualitative analysis. Our results demonstrate that structured metadata enrichments greatly enhance visual realism, semantic fidelity, and model robustness across diverse text-to-image architectures. While not a traditional recommender system, our framework enables task-specific recommendations for model selection and prompt design based on evaluation metrics.
zh
[AI-51] Computational Irreducibility as the Foundation of Agency: A Formal Model Connecting Undecidability to Autonomous Behavior in Complex Systems
【速读】:该论文试图解决如何从计算和物理角度定义自主性(autonomy)与代理行为(agency)的问题,特别是探讨其与计算不可约性(computational irreducibility)、可判定性(decidability)等计算极限之间的关系。解决方案的关键在于提出一个“最小代理”(minimal agent)的正式模型,并基于算法信息论论证代理与环境交互的固有不可判定性和计算不可约性导致了不可预测性和新信息的生成,从而实现了有效的目标导向行为。论文进一步指出,真正的自主性必然意味着从外部视角来看的不可判定性,这为区分自主系统与可预测系统提供了理论基础。
链接: https://arxiv.org/abs/2505.04646
作者: Poria Azadi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Information Theory (cs.IT)
备注:
Abstract:This article explores the emergence of autonomy and agency by connecting fundamental computational limits (decidability, completeness, computational irreducibility) with physical concepts. We introduce a formal model of a “minimal agent” operating within potentially Turing-complete environments. Using algorithmic information theory, we argue that the inherent undecidability and computational irreducibility of agent-environment interaction lead to unpredictability and novel information generation, enabling agency (effective goal-directed action). Computational irreducibility prevents full external prediction, creating necessary conditions for autonomous behavior. We relate this to computational sourcehood, where an agent is the irreducible origin of its behavior, though formalizing this concept remains challenging. Our central thesis, formally proven, is that genuine autonomy necessarily implies undecidability from an external perspective, distinguishing autonomous systems from predictable ones. We propose that agency arises when agent-environment coupling complexity allows mutual information between internal states and relevant environmental variables to increase, particularly where analytical solutions are absent and operational closure is needed for persistence. This framework links agency directly to the computational properties of interaction, offering implications for understanding consciousness, designing autonomous AI, and reconceptualizing free will in a deterministic yet computationally irreducible universe.
zh
[AI-52] oward Holistic Evaluation of Recommender Systems Powered by Generative Models
【速读】:该论文旨在解决生成式推荐系统(Generative Recommender Systems, Gen-RecSys)在评估过程中面临的新挑战,这些挑战包括由生成式输出加剧的现有问题(如偏见和隐私泄露)以及全新的风险(如虚构物品和矛盾解释)。其解决方案的关键在于提出一种全面的评估方法,涵盖基于场景的评估和多指标检查,整合相关性、事实基础性、偏见检测和政策合规性,以确保Gen-RecSys的有效个性化与负责任的部署。
链接: https://arxiv.org/abs/2504.06667
作者: Yashar Deldjoo,Nikhil Mehta,Maheswaran Sathiamoorthy,Shuai Zhang,Pablo Castells,Julian McAuley
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recommender systems powered by generative models (Gen-RecSys) extend beyond classical item ranking by producing open-ended content, which simultaneously unlocks richer user experiences and introduces new risks. On one hand, these systems can enhance personalization and appeal through dynamic explanations and multi-turn dialogues. On the other hand, they might venture into unknown territory-hallucinating nonexistent items, amplifying bias, or leaking private information. Traditional accuracy metrics cannot fully capture these challenges, as they fail to measure factual correctness, content safety, or alignment with user intent. This paper makes two main contributions. First, we categorize the evaluation challenges of Gen-RecSys into two groups: (i) existing concerns that are exacerbated by generative outputs (e.g., bias, privacy) and (ii) entirely new risks (e.g., item hallucinations, contradictory explanations). Second, we propose a holistic evaluation approach that includes scenario-based assessments and multi-metric checks-incorporating relevance, factual grounding, bias detection, and policy compliance. Our goal is to provide a guiding framework so researchers and practitioners can thoroughly assess Gen-RecSys, ensuring effective personalization and responsible deployment. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2504.06667 [cs.IR] (or arXiv:2504.06667v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2504.06667 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-53] High-fidelity Grain Growth Modeling: Leverag ing Deep Learning for Fast Computations
【速读】:该论文旨在解决金属材料在退火过程中微观结构演化预测的计算成本过高问题,传统基于偏微分方程的方法存在计算瓶颈,限制了材料设计与制造的效率。其解决方案的关键在于提出一种结合卷积长短期记忆网络(Convolutional Long Short-Term Memory, LSTM)与自编码器的机器学习框架,该框架能够高效地捕捉晶粒演化的时空特征,并通过高维晶粒结构数据的紧凑潜在空间编码实现模式学习,同时采用融合均方误差、结构相似性指数测量和边界保持的新型复合损失函数,以确保晶界拓扑结构的完整性。
链接: https://arxiv.org/abs/2505.05354
作者: Pungponhavoan Tep,Marc Bernacki
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:Grain growth simulation is crucial for predicting metallic material microstructure evolution during annealing and resulting final mechanical properties, but traditional partial differential equation-based methods are computationally expensive, creating bottlenecks in materials design and manufacturing. In this work, we introduce a machine learning framework that combines a Convolutional Long Short-Term Memory networks with an Autoencoder to efficiently predict grain growth evolution. Our approach captures both spatial and temporal aspects of grain evolution while encoding high-dimensional grain structure data into a compact latent space for pattern learning, enhanced by a novel composite loss function combining Mean Squared Error, Structural Similarity Index Measurement, and Boundary Preservation to maintain structural integrity of grain boundary topology of the prediction. Results demonstrated that our machine learning approach accelerates grain growth prediction by up to \SI89\times faster, reducing computation time from \SI10\minute to approximately \SI10\second while maintaining high-fidelity predictions. The best model (S-30-30) achieving a structural similarity score of \SI86.71\percent and mean grain size error of just \SI0.07\percent. All models accurately captured grain boundary topology, morphology, and size distributions. This approach enables rapid microstructural prediction for applications where conventional simulations are prohibitively time-consuming, potentially accelerating innovation in materials science and manufacturing.
zh
[AI-54] Decomposition of Probabilities of Causation with Two Mediators
【速读】:该论文旨在解决如何在存在多个中介变量的情况下,分解总概率必要性与充分性(Total PNS)为沿不同因果路径的路径特定概率必要性与充分性(Path-specific PNS)的问题。其解决方案的关键在于定义路径特定PNS并提供一个识别定理,以实现对总PNS的分解,并通过数值实验验证所提出估计量的有限样本性质,同时在真实教育数据集中展示了其应用效果。
链接: https://arxiv.org/abs/2505.04983
作者: Yuta Kawakami,Jin Tian
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2412.14491
Abstract:Mediation analysis for probabilities of causation (PoC) provides a fundamental framework for evaluating the necessity and sufficiency of treatment in provoking an event through different causal pathways. One of the primary objectives of causal mediation analysis is to decompose the total effect into path-specific components. In this study, we investigate the path-specific probability of necessity and sufficiency (PNS) to decompose the total PNS into path-specific components along distinct causal pathways between treatment and outcome, incorporating two mediators. We define the path-specific PNS for decomposition and provide an identification theorem. Furthermore, we conduct numerical experiments to assess the properties of the proposed estimators from finite samples and demonstrate their practical application using a real-world educational dataset.
zh
[AI-55] Moments of Causal Effects
【速读】:该论文试图解决如何从有限样本中估计因果效应的矩(如均值、方差、协方差等)及其分布关系的问题,从而更全面地描述因果效应的统计特性。其解决方案的关键在于定义因果效应的矩和乘积矩,并提出相应的识别定理和界限,以分析因果效应的分布及其变量间的关系,同时通过实验验证方法在真实医疗数据中的可行性。
链接: https://arxiv.org/abs/2505.04971
作者: Yuta Kawakami,Jin Tian
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注:
Abstract:The moments of random variables are fundamental statistical measures for characterizing the shape of a probability distribution, encompassing metrics such as mean, variance, skewness, and kurtosis. Additionally, the product moments, including covariance and correlation, reveal the relationships between multiple random variables. On the other hand, the primary focus of causal inference is the evaluation of causal effects, which are defined as the difference between two potential outcomes. While traditional causal effect assessment focuses on the average causal effect, this work provides definitions, identification theorems, and bounds for moments and product moments of causal effects to analyze their distribution and relationships. We conduct experiments to illustrate the estimation of the moments of causal effects from finite samples and demonstrate their practical application using a real-world medical dataset.
zh
[AI-56] GroverGPT -2: Simulating Grovers Algorithm via Chain-of-Thought Reasoning and Quantum-Native Tokenization
【速读】:该论文试图解决经典计算模型是否能够学习并模拟量子算法的问题,特别是探索大型语言模型(Large Language Models, LLMs)在这一领域的潜力。其解决方案的关键在于提出GroverGPT-2,这是一种基于LLM的方法,通过链式思维(Chain-of-Thought, CoT)推理和量子原生分词技术,直接从量子电路表示进行Grover算法的仿真,并生成逻辑结构清晰且可解释的输出。该方法成功展示了经典模型能够通过高效处理量子原生标记来内化量子电路逻辑,从而捕捉量子算法的结构特性。
链接: https://arxiv.org/abs/2505.04880
作者: Min Chen,Jinglei Cheng,Pingzhi Li,Haoran Wang,Tianlong Chen,Junyu Liu
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 12 figures
Abstract:Quantum computing offers theoretical advantages over classical computing for specific tasks, yet the boundary of practical quantum advantage remains an open question. To investigate this boundary, it is crucial to understand whether, and how, classical machines can learn and simulate quantum algorithms. Recent progress in large language models (LLMs) has demonstrated strong reasoning abilities, prompting exploration into their potential for this challenge. In this work, we introduce GroverGPT-2, an LLM-based method for simulating Grover’s algorithm using Chain-of-Thought (CoT) reasoning and quantum-native tokenization. Building on its predecessor, GroverGPT-2 performs simulation directly from quantum circuit representations while producing logically structured and interpretable outputs. Our results show that GroverGPT-2 can learn and internalize quantum circuit logic through efficient processing of quantum-native tokens, providing direct evidence that classical models like LLMs can capture the structure of quantum algorithms. Furthermore, GroverGPT-2 outputs interleave circuit data with natural language, embedding explicit reasoning into the simulation. This dual capability positions GroverGPT-2 as a prototype for advancing machine understanding of quantum algorithms and modeling quantum circuit logic. We also identify an empirical scaling law for GroverGPT-2 with increasing qubit numbers, suggesting a path toward scalable classical simulation. These findings open new directions for exploring the limits of classical simulatability, enhancing quantum education and research, and laying groundwork for future foundation models in quantum computing.
zh
[AI-57] Quantum-Inspired Optimization Process for Data Imputation
【速读】:该论文旨在解决数据预处理中缺失值或不可靠值的插补问题,特别是在包含生物上不合理缺失值的临床特征数据集(如UCI Diabetes数据集)中。其解决方案的关键在于引入一种融合主成分分析(PCA)与量子辅助旋转的新型量子启发插补框架,并通过无梯度经典优化器(如COBYLA、模拟退火和差分进化)进行优化,以在保持统计保真度的同时重构缺失值。该方法将重构值限制在原始特征分布的±2个标准差范围内,避免了不现实的集中趋势聚类,从而显著提升了插补数据的真实性和变异性。
链接: https://arxiv.org/abs/2505.04841
作者: Nishikanta Mohanty,Bikash K. Behera,Badsah Mukherjee,Christopher Ferrie
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Data imputation is a critical step in data pre-processing, particularly for datasets with missing or unreliable values. This study introduces a novel quantum-inspired imputation framework evaluated on the UCI Diabetes dataset, which contains biologically implausible missing values across several clinical features. The method integrates Principal Component Analysis (PCA) with quantum-assisted rotations, optimized through gradient-free classical optimizers -COBYLA, Simulated Annealing, and Differential Evolution to reconstruct missing values while preserving statistical fidelity. Reconstructed values are constrained within +/-2 standard deviations of original feature distributions, avoiding unrealistic clustering around central tendencies. This approach achieves a substantial and statistically significant improvement, including an average reduction of over 85% in Wasserstein distance and Kolmogorov-Smirnov test p-values between 0.18 and 0.22, compared to p-values 0.99 in classical methods such as Mean, KNN, and MICE. The method also eliminates zero-value artifacts and enhances the realism and variability of imputed data. By combining quantum-inspired transformations with a scalable classical framework, this methodology provides a robust solution for imputation tasks in domains such as healthcare and AI pipelines, where data quality and integrity are crucial.
zh
[AI-58] Confabulation dynamics in a reservoir computer: Filling in the gaps with untrained attractors
【速读】:该论文试图解决生成式 AI (Generative AI) 在学习过程中产生虚假信息的问题,特别是其在没有故意欺骗意图的情况下出现的“错构”(confabulation)现象。研究的关键在于分析储层计算(reservoir computing, RC)中未训练吸引子(untrained attractor, UA)的作用,揭示其在重建失败时的表现及其对重构吸引子之间过渡的影响。研究结果表明,UA 是状态空间受限的学习系统中的固有特征,可能广泛存在于 RC 以外的系统中。
链接: https://arxiv.org/abs/2505.04792
作者: Jack O’Hagan,Andrew Keane,Andrew Flynn
机构: 未知
类目: Dynamical Systems (math.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Artificial Intelligence has advanced significantly in recent years thanks to innovations in the design and training of artificial neural networks (ANNs). Despite these advancements, we still understand relatively little about how elementary forms of ANNs learn, fail to learn, and generate false information without the intent to deceive, a phenomenon known as confabulation'. To provide some foundational insight, in this paper we analyse how confabulation occurs in reservoir computers (RCs): a dynamical system in the form of an ANN. RCs are particularly useful to study as they are known to confabulate in a well-defined way: when RCs are trained to reconstruct the dynamics of a given attractor, they sometimes construct an attractor that they were not trained to construct, a so-called
untrained attractor’ (UA). This paper sheds light on the role played by UAs when reconstruction fails and their influence when modelling transitions between reconstructed attractors. Based on our results, we conclude that UAs are an intrinsic feature of learning systems whose state spaces are bounded, and that this means of confabulation may be present in systems beyond RCs.
zh
机器学习
[LG-0] Facets of Disparate Impact: Evaluating Legally Consistent Bias in Machine Learning CIKM2024
链接: https://arxiv.org/abs/2505.05471
作者: Jarren Briscoe,Assefaw Gebremedhin
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: CIKM 2024
Abstract:Leveraging current legal standards, we define bias through the lens of marginal benefits and objective testing with the novel metric “Objective Fairness Index”. This index combines the contextual nuances of objective testing with metric stability, providing a legally consistent and reliable measure. Utilizing the Objective Fairness Index, we provide fresh insights into sensitive machine learning applications, such as COMPAS (recidivism prediction), highlighting the metric’s practical and theoretical significance. The Objective Fairness Index allows one to differentiate between discriminatory tests and systemic disparities.
[LG-1] RL-DAUNCE: Reinforcement Learning-Driven Data Assimilation with Uncertainty-Aware Constrained Ensembles
链接: https://arxiv.org/abs/2505.05452
作者: Pouria Behnoudfar,Nan Chen
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:
Abstract:Machine learning has become a powerful tool for enhancing data assimilation. While supervised learning remains the standard method, reinforcement learning (RL) offers unique advantages through its sequential decision-making framework, which naturally fits the iterative nature of data assimilation by dynamically balancing model forecasts with observations. We develop RL-DAUNCE, a new RL-based method that enhances data assimilation with physical constraints through three key aspects. First, RL-DAUNCE inherits the computational efficiency of machine learning while it uniquely structures its agents to mirror ensemble members in conventional data assimilation methods. Second, RL-DAUNCE emphasizes uncertainty quantification by advancing multiple ensemble members, moving beyond simple mean-state optimization. Third, RL-DAUNCE’s ensemble-as-agents design facilitates the enforcement of physical constraints during the assimilation process, which is crucial to improving the state estimation and subsequent forecasting. A primal-dual optimization strategy is developed to enforce constraints, which dynamically penalizes the reward function to ensure constraint satisfaction throughout the learning process. Also, state variable bounds are respected by constraining the RL action space. Together, these features ensure physical consistency without sacrificing efficiency. RL-DAUNCE is applied to the Madden-Julian Oscillation, an intermittent atmospheric phenomenon characterized by strongly non-Gaussian features and multiple physical constraints. RL-DAUNCE outperforms the standard ensemble Kalman filter (EnKF), which fails catastrophically due to the violation of physical constraints. Notably, RL-DAUNCE matches the performance of constrained EnKF, particularly in recovering intermittent signals, capturing extreme events, and quantifying uncertainties, while requiring substantially less computational effort.
[LG-2] DPQ-HD: Post-Training Compression for Ultra-Low Power Hyperdimensional Computing
链接: https://arxiv.org/abs/2505.05413
作者: Nilesh Prasad Pandey,Shriniwas Kulkarni,David Wang,Onat Gungor,Flavio Ponzina,Tajana Rosing
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hyperdimensional Computing (HDC) is emerging as a promising approach for edge AI, offering a balance between accuracy and efficiency. However, current HDC-based applications often rely on high-precision models and/or encoding matrices to achieve competitive performance, which imposes significant computational and memory demands, especially for ultra-low power devices. While recent efforts use techniques like precision reduction and pruning to increase the efficiency, most require retraining to maintain performance, making them expensive and impractical. To address this issue, we propose a novel Post Training Compression algorithm, Decomposition-Pruning-Quantization (DPQ-HD), which aims at compressing the end-to-end HDC system, achieving near floating point performance without the need of retraining. DPQ-HD reduces computational and memory overhead by uniquely combining the above three compression techniques and efficiently adapts to hardware constraints. Additionally, we introduce an energy-efficient inference approach that progressively evaluates similarity scores such as cosine similarity and performs early exit to reduce the computation, accelerating prediction inference while maintaining accuracy. We demonstrate that DPQ-HD achieves up to 20-100x reduction in memory for image and graph classification tasks with only a 1-2% drop in accuracy compared to uncompressed workloads. Lastly, we show that DPQ-HD outperforms the existing post-training compression methods and performs better or at par with retraining-based state-of-the-art techniques, requiring significantly less overall optimization time (up to 100x) and faster inference (up to 56x) on a microcontroller
[LG-3] Hide Seek: Transformer Symmetries Obscure Sharpness Riemannian Geometry Finds It
链接: https://arxiv.org/abs/2505.05409
作者: Marvin F. da Silva,Felix Dangel,Sageev Oore
类目: Machine Learning (cs.LG)
*备注:
Abstract:The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold. In practice, we need to resort to approximating the geodesics. Doing so up to first order yields existing adaptive sharpness measures, and we demonstrate that including higher-order terms is crucial to recover correlation with generalization. We present results on diagonal networks with synthetic data, and show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.
[LG-4] Denoising Diffusion Probabilistic Models for Coastal Inundation Forecasting
链接: https://arxiv.org/abs/2505.05381
作者: Kazi Ashik Islam,Zakaria Mehrab,Mahantesh Halappanavar,Henning Mortveit,Sridhar Katragadda,Jon Derek Loftis,Madhav Marathe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Coastal flooding poses significant risks to communities, necessitating fast and accurate forecasting methods to mitigate potential damage. To approach this problem, we present DIFF-FLOOD, a probabilistic spatiotemporal forecasting method designed based on denoising diffusion models. DIFF-FLOOD predicts inundation level at a location by taking both spatial and temporal context into account. It utilizes inundation levels at neighboring locations and digital elevation data as spatial context. Inundation history from a context time window, together with additional co-variates are used as temporal context. Convolutional neural networks and cross-attention mechanism are then employed to capture the spatiotemporal dynamics in the data. We trained and tested DIFF-FLOOD on coastal inundation data from the Eastern Shore of Virginia, a region highly impacted by coastal flooding. Our results show that, DIFF-FLOOD outperforms existing forecasting methods in terms of prediction performance (6% to 64% improvement in terms of two performance metrics) and scalability.
[LG-5] Nearly Optimal Sample Complexity for Learning with Label Proportions
链接: https://arxiv.org/abs/2505.05355
作者: Robert Busa-Fekete,Travis Dick,Claudio Gentile,Haim Kaplan,Tomer Koren,Uri Stemmer
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate Learning from Label Proportions (LLP), a partial information setting where examples in a training set are grouped into bags, and only aggregate label values in each bag are available. Despite the partial observability, the goal is still to achieve small regret at the level of individual examples. We give results on the sample complexity of LLP under square loss, showing that our sample complexity is essentially optimal. From an algorithmic viewpoint, we rely on carefully designed variants of Empirical Risk Minimization, and Stochastic Gradient Descent algorithms, combined with ad hoc variance reduction techniques. On one hand, our theoretical results improve in important ways on the existing literature on LLP, specifically in the way the sample complexity depends on the bag size. On the other hand, we validate our algorithmic solutions on several datasets, demonstrating improved empirical performance (better accuracy for less samples) against recent baselines.
[LG-6] Performance Estimation in Binary Classification Using Calibrated Confidence
链接: https://arxiv.org/abs/2505.05295
作者: Juhani Kivimäki,Jakub Białek,Wojtek Kuberski,Jukka K. Nurminen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Model monitoring is a critical component of the machine learning lifecycle, safeguarding against undetected drops in the model’s performance after deployment. Traditionally, performance monitoring has required access to ground truth labels, which are not always readily available. This can result in unacceptable latency or render performance monitoring altogether impossible. Recently, methods designed to estimate the accuracy of classifier models without access to labels have shown promising results. However, there are various other metrics that might be more suitable for assessing model performance in many cases. Until now, none of these important metrics has received similar interest from the scientific community. In this work, we address this gap by presenting CBPE, a novel method that can estimate any binary classification metric defined using the confusion matrix. In particular, we choose four metrics from this large family: accuracy, precision, recall, and F _1 , to demonstrate our method. CBPE treats the elements of the confusion matrix as random variables and leverages calibrated confidence scores of the model to estimate their distributions. The desired metric is then also treated as a random variable, whose full probability distribution can be derived from the estimated confusion matrix. CBPE is shown to produce estimates that come with strong theoretical guarantees and valid confidence intervals.
[LG-7] Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation
链接: https://arxiv.org/abs/2505.05287
作者: Zechu Li,Yufeng Jin,Daniel Ordonez Apraez,Claudio Semini,Puze Liu,Georgia Chalvatzaki
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Humans naturally exhibit bilateral symmetry in their gross manipulation skills, effortlessly mirroring simple actions between left and right hands. Bimanual robots-which also feature bilateral symmetry-should similarly exploit this property to perform tasks with either hand. Unlike humans, who often favor a dominant hand for fine dexterous skills, robots should ideally execute ambidextrous manipulation with equal proficiency. To this end, we introduce SYMDEX (SYMmetric DEXterity), a reinforcement learning framework for ambidextrous bi-manipulation that leverages the robot’s inherent bilateral symmetry as an inductive bias. SYMDEX decomposes complex bimanual manipulation tasks into per-hand subtasks and trains dedicated policies for each. By exploiting bilateral symmetry via equivariant neural networks, experience from one arm is inherently leveraged by the opposite arm. We then distill the subtask policies into a global ambidextrous policy that is independent of the hand-task assignment. We evaluate SYMDEX on six challenging simulated manipulation tasks and demonstrate successful real-world deployment on two of them. Our approach strongly outperforms baselines on complex task in which the left and right hands perform different roles. We further demonstrate SYMDEX’s scalability by extending it to a four-arm manipulation setup, where our symmetry-aware policies enable effective multi-arm collaboration and coordination. Our results highlight how structural symmetry as inductive bias in policy learning enhances sample efficiency, robustness, and generalization across diverse dexterous manipulation tasks.
[LG-8] Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective ICML’25
链接: https://arxiv.org/abs/2505.05242
作者: Hechuan Wen,Tong Chen,Mingming Gong,Li Kheng Chai,Shazia Sadiq,Hongzhi Yin
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML’25
Abstract:Although numerous complex algorithms for treatment effect estimation have been developed in recent years, their effectiveness remains limited when handling insufficiently labeled training sets due to the high cost of labeling the effect after treatment, e.g., expensive tumor imaging or biopsy procedures needed to evaluate treatment effects. Therefore, it becomes essential to actively incorporate more high-quality labeled data, all while adhering to a constrained labeling budget. To enable data-efficient treatment effect estimation, we formalize the problem through rigorous theoretical analysis within the active learning context, where the derived key measures – \textitfactual and \textitcounterfactual covering radius determine the risk upper bound. To reduce the bound, we propose a greedy radius reduction algorithm, which excels under an idealized, balanced data distribution. To generalize to more realistic data distributions, we further propose FCCM, which transforms the optimization objective into the \textitFactual and \textitCounterfactual Coverage Maximization to ensure effective radius reduction during data acquisition. Furthermore, benchmarking FCCM against other baselines demonstrates its superiority across both fully synthetic and semi-synthetic datasets.
[LG-9] Latte: Transfering LLM s` Latent-level Knowledge for Few-shot Tabular Learning
链接: https://arxiv.org/abs/2505.05237
作者: Ruxue Shi,Hengrui Gu,Hangting Ye,Yiwei Dai,Xu Shen,Xin Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Few-shot tabular learning, in which machine learning models are trained with a limited amount of labeled data, provides a cost-effective approach to addressing real-world challenges. The advent of Large Language Models (LLMs) has sparked interest in leveraging their pre-trained knowledge for few-shot tabular learning. Despite promising results, existing approaches either rely on test-time knowledge extraction, which introduces undesirable latency, or text-level knowledge, which leads to unreliable feature engineering. To overcome these limitations, we propose Latte, a training-time knowledge extraction framework that transfers the latent prior knowledge within LLMs to optimize a more generalized downstream model. Latte enables general knowledge-guided downstream tabular learning, facilitating the weighted fusion of information across different feature values while reducing the risk of overfitting to limited labeled data. Furthermore, Latte is compatible with existing unsupervised pre-training paradigms and effectively utilizes available unlabeled samples to overcome the performance limitations imposed by an extremely small labeled dataset. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of Latte, establishing it as a state-of-the-art approach in this domain
[LG-10] GFlowNets for Active Learning Based Resource Allocation in Next Generation Wireless Networks
链接: https://arxiv.org/abs/2505.05224
作者: Charbel Bou Chaaya,Mehdi Bennis
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we consider the radio resource allocation problem in a wireless system with various integrated functionalities, such as communication, sensing and computing. We design suitable resource management techniques that can simultaneously cater to those heterogeneous requirements, and scale appropriately with the high-dimensional and discrete nature of the problem. We propose a novel active learning framework where resource allocation patterns are drawn sequentially, evaluated in the environment, and then used to iteratively update a surrogate model of the environment. Our method leverages a generative flow network (GFlowNet) to sample favorable solutions, as such models are trained to generate compositional objects proportionally to their training reward, hence providing an appropriate coverage of its modes. As such, GFlowNet generates diverse and high return resource management designs that update the surrogate model and swiftly discover suitable solutions. We provide simulation results showing that our method can allocate radio resources achieving 20% performance gains against benchmarks, while requiring less than half of the number of acquisition rounds.
[LG-11] Long-Term Individual Causal Effect Estimation via Identifiable Latent Representation Learning
链接: https://arxiv.org/abs/2505.05192
作者: Ruichu Cai,Junjie Wan,Weilin Chen,Zeqin Yang,Zijian Li,Peng Zhen,Jiecheng Guo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Estimating long-term causal effects by combining long-term observational and short-term experimental data is a crucial but challenging problem in many real-world scenarios. In existing methods, several ideal assumptions, e.g. latent unconfoundedness assumption or additive equi-confounding bias assumption, are proposed to address the latent confounder problem raised by the observational data. However, in real-world applications, these assumptions are typically violated which limits their practical effectiveness. In this paper, we tackle the problem of estimating the long-term individual causal effects without the aforementioned assumptions. Specifically, we propose to utilize the natural heterogeneity of data, such as data from multiple sources, to identify latent confounders, thereby significantly avoiding reliance on idealized assumptions. Practically, we devise a latent representation learning-based estimator of long-term causal effects. Theoretically, we establish the identifiability of latent confounders, with which we further achieve long-term effect identification. Extensive experimental studies, conducted on multiple synthetic and semi-synthetic datasets, demonstrate the effectiveness of our proposed method.
[LG-12] OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning ICML2025
链接: https://arxiv.org/abs/2505.05180
作者: Cong Hua,Qianqian Xu,Zhiyong Yang,Zitai Wang,Shilong Bao,Qingming Huang
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by ICML2025
Abstract:Prompt tuning adapts Vision-Language Models like CLIP to open-world tasks with minimal training costs. In this direction, one typical paradigm evaluates model performance separately on known classes (i.e., base domain) and unseen classes (i.e., new domain). However, real-world scenarios require models to handle inputs without prior domain knowledge. This practical challenge has spurred the development of open-world prompt tuning, which demands a unified evaluation of two stages: 1) detecting whether an input belongs to the base or new domain (P1), and 2) classifying the sample into its correct class (P2). What’s more, as domain distributions are generally unknown, a proper metric should be insensitive to varying base/new sample ratios (P3). However, we find that current metrics, including HM, overall accuracy, and AUROC, fail to satisfy these three properties simultaneously. To bridge this gap, we propose OpenworldAUC, a unified metric that jointly assesses detection and classification through pairwise instance comparisons. To optimize OpenworldAUC effectively, we introduce Gated Mixture-of-Prompts (GMoP), which employs domain-specific prompts and a gating mechanism to dynamically balance detection and classification. Theoretical guarantees ensure generalization of GMoP under practical conditions. Experiments on 15 benchmarks in open-world scenarios show GMoP achieves SOTA performance on OpenworldAUC and other metrics. We release the code at this https URL
[LG-13] Bandit Max-Min Fair Allocation
链接: https://arxiv.org/abs/2505.05169
作者: Tsubasa Harada,Shinji Ito,Hanna Sumita
类目: Machine Learning (cs.LG)
*备注: 23 pages
Abstract:In this paper, we study a new decision-making problem called the bandit max-min fair allocation (BMMFA) problem. The goal of this problem is to maximize the minimum utility among agents with additive valuations by repeatedly assigning indivisible goods to them. One key feature of this problem is that each agent’s valuation for each item can only be observed through the semi-bandit feedback, while existing work supposes that the item values are provided at the beginning of each round. Another key feature is that the algorithm’s reward function is not additive with respect to rounds, unlike most bandit-setting problems. Our first contribution is to propose an algorithm that has an asymptotic regret bound of O(m\sqrtT\ln T/n + m\sqrtT \ln(mnT)) , where n is the number of agents, m is the number of items, and T is the time horizon. This is based on a novel combination of bandit techniques and a resource allocation algorithm studied in the literature on competitive analysis. Our second contribution is to provide the regret lower bound of \Omega(m\sqrtT/n) . When T is sufficiently larger than n , the gap between the upper and lower bounds is a logarithmic factor of T . Comments: 23 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.05169 [cs.LG] (or arXiv:2505.05169v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.05169 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-14] FedTDP: A Privacy-Preserving and Unified Framework for Trajectory Data Preparation via Federated Learning
链接: https://arxiv.org/abs/2505.05155
作者: Zhihao Zeng,Ziquan Fang,Wei Shao,Lu Chen,Yunjun Gao
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Trajectory data, which capture the movement patterns of people and vehicles over time and space, are crucial for applications like traffic optimization and urban planning. However, issues such as noise and incompleteness often compromise data quality, leading to inaccurate trajectory analyses and limiting the potential of these applications. While Trajectory Data Preparation (TDP) can enhance data quality, existing methods suffer from two key limitations: (i) they do not address data privacy concerns, particularly in federated settings where trajectory data sharing is prohibited, and (ii) they typically design task-specific models that lack generalizability across diverse TDP scenarios. To overcome these challenges, we propose FedTDP, a privacy-preserving and unified framework that leverages the capabilities of Large Language Models (LLMs) for TDP in federated environments. Specifically, we: (i) design a trajectory privacy autoencoder to secure data transmission and protect privacy, (ii) introduce a trajectory knowledge enhancer to improve model learning of TDP-related knowledge, enabling the development of TDP-oriented LLMs, and (iii) propose federated parallel optimization to enhance training efficiency by reducing data transmission and enabling parallel model training. Experiments on 6 real datasets and 10 mainstream TDP tasks demonstrate that FedTDP consistently outperforms 13 state-of-the-art baselines.
[LG-15] Sparse Training from Random Initialization: Aligning Lottery Ticket Masks using Weight Symmetry ICML2025
链接: https://arxiv.org/abs/2505.05143
作者: Mohammed Adnan,Rohan Jain,Ekansh Sharma,Rahul Krishnan,Yani Ioannou
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2025
Abstract:The Lottery Ticket Hypothesis (LTH) suggests there exists a sparse LTH mask and weights that achieve the same generalization performance as the dense model while using significantly fewer parameters. However, finding a LTH solution is computationally expensive, and a LTH sparsity mask does not generalize to other random weight initializations. Recent work has suggested that neural networks trained from random initialization find solutions within the same basin modulo permutation, and proposes a method to align trained models within the same loss basin. We hypothesize that misalignment of basins is the reason why LTH masks do not generalize to new random initializations and propose permuting the LTH mask to align with the new optimization basin when performing sparse training from a different random init. We empirically show a significant increase in generalization when sparse training from random initialization with the permuted mask as compared to using the non-permuted LTH mask, on multiple datasets (CIFAR-10, CIFAR-100 and ImageNet) and models (VGG11, ResNet20 and ResNet50).
[LG-16] aming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach
链接: https://arxiv.org/abs/2505.05126
作者: Xuyang Chen,Keyu Yan,Lin Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Offline reinforcement learning (RL) aims to learn decision-making policies from fixed datasets without online interactions, providing a practical solution where online data collection is expensive or risky. However, offline RL often suffers from distribution shift, resulting in inaccurate evaluation and substantial overestimation on out-of-distribution (OOD) actions. To address this, existing approaches incorporate conservatism by indiscriminately discouraging all OOD actions, thereby hindering the agent’s ability to generalize and exploit beneficial ones. In this paper, we propose Advantage-based Diffusion Actor-Critic (ADAC), a novel method that systematically evaluates OOD actions using the batch-optimal value function. Based on this evaluation, ADAC defines an advantage function to modulate the Q-function update, enabling more precise assessment of OOD action quality. We design a custom PointMaze environment and collect datasets to visually reveal that advantage modulation can effectively identify and select superior OOD actions. Extensive experiments show that ADAC achieves state-of-the-art performance on almost all tasks in the D4RL benchmark, with particularly clear margins on the more challenging tasks.
[LG-17] xt2Cypher: Data Pruning using Hard Example Selection
链接: https://arxiv.org/abs/2505.05122
作者: Makbule Gulcin Ozsoy
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:Database query languages such as SQL for relational databases and Cypher for graph databases have been widely adopted. Recent advancements in large language models (LLMs) enable natural language interactions with databases through models like Text2SQL and Text2Cypher. Fine-tuning these models typically requires large, diverse datasets containing non-trivial examples. However, as dataset size increases, the cost of fine-tuning also rises. This makes smaller, high-quality datasets essential for reducing costs for the same or better performance. In this paper, we propose five hard-example selection techniques for pruning the Text2Cypher dataset, aiming to preserve or improve performance while reducing resource usage. Our results show that these hard-example selection approaches can halve training time and costs with minimal impact on performance, and demonstrates that hard-example selection provides a cost-effective solution.
[LG-18] USPR: Learning a Unified Solver for Profiled Routing
链接: https://arxiv.org/abs/2505.05119
作者: Chuanbo Hua,Federico Berto,Zhikai Zhao,Jiwoo Son,Changhyun Kwon,Jinkyoo Park
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:The Profiled Vehicle Routing Problem (PVRP) extends the classical VRP by incorporating vehicle-client-specific preferences and constraints, reflecting real-world requirements such as zone restrictions and service-level preferences. While recent reinforcement learning (RL) solvers have shown promise, they require retraining for each new profile distribution, suffer from poor representation ability, and struggle to generalize to out-of-distribution instances. In this paper, we address these limitations by introducing USPR (Unified Solver for Profiled Routing), a novel framework that natively handles arbitrary profile types. USPR introduces three key innovations: (i) Profile Embeddings (PE) to encode any combination of profile types; (ii) Multi-Head Profiled Attention (MHPA), an attention mechanism that models rich interactions between vehicles and clients; (iii) Profile-aware Score Reshaping (PSR), which dynamically adjusts decoder logits using profile scores to improve generalization. Empirical results on diverse PVRP benchmarks demonstrate that USPR achieves state-of-the-art results among learning-based methods while offering significant gains in flexibility and computational efficiency. We make our source code publicly available to foster future research at this https URL.
[LG-19] Enhancing Text2Cypher with Schema Filtering
链接: https://arxiv.org/abs/2505.05118
作者: Makbule Gulcin Ozsoy
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:Knowledge graphs represent complex data using nodes, relationships, and properties. Cypher, a powerful query language for graph databases, enables efficient modeling and querying. Recent advancements in large language models allow translation of natural language questions into Cypher queries - Text2Cypher. A common approach is incorporating database schema into prompts. However, complex schemas can introduce noise, increase hallucinations, and raise computational costs. Schema filtering addresses these challenges by including only relevant schema elements, improving query generation while reducing token costs. This work explores various schema filtering methods for Text2Cypher task and analyzes their impact on token length, performance, and cost. Results show that schema filtering effectively optimizes Text2Cypher, especially for smaller models. Consistent with prior research, we find that larger models benefit less from schema filtering due to their longer context capabilities. However, schema filtering remains valuable for both larger and smaller models in cost reduction.
[LG-20] Balancing Client Participation in Federated Learning Using AoI
链接: https://arxiv.org/abs/2505.05099
作者: Alireza Javani,Zhiying Wang
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Federated Learning (FL) offers a decentralized framework that preserves data privacy while enabling collaborative model training across distributed clients. However, FL faces significant challenges due to limited communication resources, statistical heterogeneity, and the need for balanced client participation. This paper proposes an Age of Information (AoI)-based client selection policy that addresses these challenges by minimizing load imbalance through controlled selection intervals. Our method employs a decentralized Markov scheduling policy, allowing clients to independently manage participation based on age-dependent selection probabilities, which balances client updates across training rounds with minimal central oversight. We provide a convergence proof for our method, demonstrating that it ensures stable and efficient model convergence. Specifically, we derive optimal parameters for the Markov selection model to achieve balanced and consistent client participation, highlighting the benefits of AoI in enhancing convergence stability. Through extensive simulations, we demonstrate that our AoI-based method, particularly the optimal Markov variant, improves convergence over the FedAvg selection approach across both IID and non-IID data settings by 7.5% and up to 20% . Our findings underscore the effectiveness of AoI-based scheduling for scalable, fair, and efficient FL systems across diverse learning environments.
[LG-21] A Conjoint Graph Representation Learning Framework for Hypertension Comorbidity Risk Prediction
链接: https://arxiv.org/abs/2505.05094
作者: Leming Zhou,Zuo Wang,Zhixuan Duan
类目: Machine Learning (cs.LG)
*备注:
Abstract:The comorbidities of hypertension impose a heavy burden on patients and society. Early identification is necessary to prompt intervention, but it remains a challenging task. This study aims to address this challenge by combining joint graph learning with network analysis. Motivated by this discovery, we develop a Conjoint Graph Representation Learning (CGRL) framework that: a) constructs two networks based on disease coding, including the patient network and the disease difference network. Three comorbidity network features were generated based on the basic difference network to capture the potential relationship between comorbidities and risk diseases; b) incorporates computational structure intervention and learning feature representation, CGRL was developed to predict the risks of diabetes and coronary heart disease in patients; and c) analysis the comorbidity patterns and exploring the pathways of disease progression, the pathological pathogenesis of diabetes and coronary heart disease may be revealed. The results show that the network features extracted based on the difference network are important, and the framework we proposed provides more accurate predictions than other strong models in terms of accuracy.
[LG-22] ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model
链接: https://arxiv.org/abs/2505.05082
作者: Sagnik Bhattacharya,Abhiram R. Gorle,Ahmed Mohsin,Ahsan Bilal,Connor Ding,Amit Kumar Singh Yadav,Tsachy Weissman
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Probability (math.PR)
*备注: Pre-print
Abstract:Existing methods for generative modeling of discrete data, such as symbolic music tokens, face two primary challenges: (1) they either embed discrete inputs into continuous state-spaces or (2) rely on variational losses that only approximate the true negative log-likelihood. Previous efforts have individually targeted these limitations. While information-theoretic Gaussian diffusion models alleviate the suboptimality of variational losses, they still perform modeling in continuous domains. In this work, we introduce the Information-Theoretic Discrete Poisson Diffusion Model (ItDPDM), which simultaneously addresses both limitations by directly operating in a discrete state-space via a Poisson diffusion process inspired by photon arrival processes in camera sensors. We introduce a novel Poisson Reconstruction Loss (PRL) and derive an exact relationship between PRL and the true negative log-likelihood, thereby eliminating the need for approximate evidence lower bounds. Experiments conducted on the Lakh MIDI symbolic music dataset and the CIFAR-10 image benchmark demonstrate that ItDPDM delivers significant improvements, reducing test NLL by up to 80% compared to prior baselines, while also achieving faster convergence.
[LG-23] WaterDrum: Watermarking for Data-centric Unlearning Metric
链接: https://arxiv.org/abs/2505.05064
作者: Xinyang Lu,Xinyuan Niu,Gregory Kang Ruey Lau,Bui Thi Cam Nhung,Rachael Hwee Ling Sim,Fanyu Wen,Chuan-Sheng Foo,See-Kiong Ng,Bryan Kian Hsiang Low
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. However, existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when (a) the forget and retain set have semantically similar content, (b) retraining the model from scratch on the retain set is impractical, and/or © the model owner can improve the unlearning metric without directly performing unlearning on the LLM. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking for overcoming these limitations. We also introduce new benchmark datasets for LLM unlearning that contain varying levels of similar data points and can be used to rigorously evaluate unlearning algorithms using WaterDrum. Our code is available at this https URL and our new benchmark datasets are released at this https URL.
[LG-24] Neural Pathways to Program Success: Hopfield Networks for PERT Analysis
链接: https://arxiv.org/abs/2505.05047
作者: Azgar Ali Noor Ahamed
类目: Machine Learning (cs.LG)
*备注:
Abstract:Project and task scheduling under uncertainty remains a fundamental challenge in program and project management, where accurate estimation of task durations and dependencies is critical for delivering complex, multi project systems. The Program Evaluation and Review Technique provides a probabilistic framework to model task variability and critical paths. In this paper, the author presents a novel formulation of PERT scheduling as an energy minimization problem within a Hopfield neural network architecture. By mapping task start times and precedence constraints into a neural computation framework, the networks inherent optimization dynamics is exploited to approximate globally consistent schedules. The author addresses key theoretical issues related to energy function differentiability, constraint encoding, and convergence, and extends the Hopfield model for structured precedence graphs. Numerical simulations on synthetic project networks comprising up to 1000 tasks demonstrate the viability of this approach, achieving near optimal makespans with minimal constraint violations. The findings suggest that neural optimization models offer a promising direction for scalable and adaptive project tasks scheduling under uncertainty in areas such as the agentic AI workflows, microservice based applications that the modern AI systems are being built upon.
[LG-25] Dequantified Diffusion Schrödinger Bridge for Density Ratio Estimation
链接: https://arxiv.org/abs/2505.05034
作者: Wei Chen,Shigui Li,Jiacheng Li,Junmei Yang,John Paisley,Delu Zeng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Density ratio estimation is fundamental to tasks involving f -divergences, yet existing methods often fail under significantly different distributions or inadequately overlap supports, suffering from the \textitdensity-chasm and the \textitsupport-chasm problems. Additionally, prior approaches yield divergent time scores near boundaries, leading to instability. We propose \textD^3\textRE , a unified framework for robust and efficient density ratio estimation. It introduces the Dequantified Diffusion-Bridge Interpolant (DDBI), which expands support coverage and stabilizes time scores via diffusion bridges and Gaussian dequantization. Building on DDBI, the Dequantified Schrödinger-Bridge Interpolant (DSBI) incorporates optimal transport to solve the Schrödinger bridge problem, enhancing accuracy and efficiency. Our method offers uniform approximation and bounded time scores in theory, and outperforms baselines empirically in mutual information and density estimation tasks.
[LG-26] Generative Models for Long Time Series: Approximately Equivariant Recurrent Network Structures for an Adjusted Training Scheme
链接: https://arxiv.org/abs/2505.05020
作者: Ruwen Fulek,Markus Lange-Hegermann
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a simple yet effective generative model for time series data based on a Variational Autoencoder (VAE) with recurrent layers, referred to as the Recurrent Variational Autoencoder with Subsequent Training (RVAE-ST). Our method introduces an adapted training scheme that progressively increases the sequence length, addressing the challenge recurrent layers typically face when modeling long sequences. By leveraging the recurrent architecture, the model maintains a constant number of parameters regardless of sequence length. This design encourages approximate time-shift equivariance and enables efficient modeling of long-range temporal dependencies. Rather than introducing a fundamentally new architecture, we show that a carefully composed combination of known components can match or outperform state-of-the-art generative models on several benchmark datasets. Our model performs particularly well on time series that exhibit quasi-periodic structure,while remaining competitive on datasets with more irregular or partially non-stationary behavior. We evaluate its performance using ELBO, Fréchet Distance, discriminative scores, and visualizations of the learned embeddings.
[LG-27] CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations
链接: https://arxiv.org/abs/2505.04999
作者: Anthony Liang,Pavel Czempin,Matthew Hong,Yutai Zhou,Erdem Biyik,Stephen Tu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Latent Action Models, Self-supervised Pretraining, Learning from Videos
Abstract:Learning robot policies using imitation learning requires collecting large amounts of costly action-labeled expert demonstrations, which fundamentally limits the scale of training data. A promising approach to address this bottleneck is to harness the abundance of unlabeled observations-e.g., from video demonstrations-to learn latent action labels in an unsupervised way. However, we find that existing methods struggle when applied to complex robot tasks requiring fine-grained motions. We design continuous latent action models (CLAM) which incorporate two key ingredients we find necessary for learning to solve complex continuous control tasks from unlabeled observation data: (a) using continuous latent action labels instead of discrete representations, and (b) jointly training an action decoder to ensure that the latent action space can be easily grounded to real actions with relatively few labeled examples. Importantly, the labeled examples can be collected from non-optimal play data, enabling CLAM to learn performant policies without access to any action-labeled expert data. We demonstrate on continuous control benchmarks in DMControl (locomotion) and MetaWorld (manipulation), as well as on a real WidowX robot arm that CLAM significantly outperforms prior state-of-the-art methods, remarkably with a 2-3x improvement in task success rate compared to the best baseline. Videos and code can be found at this http URL.
[LG-28] Graph Neural Network Aided Deep Reinforcement Learning for Resource Allocation in Dynamic Terahertz UAV Networks
链接: https://arxiv.org/abs/2505.04981
作者: Zhifeng Hu,Chong Han
类目: Machine Learning (cs.LG)
*备注:
Abstract:Terahertz (THz) unmanned aerial vehicle (UAV) networks with flexible topologies and ultra-high data rates are expected to empower numerous applications in security surveillance, disaster response, and environmental monitoring, among others. However, the dynamic topologies hinder the efficient long-term joint power and antenna array resource allocation for THz links among UAVs. Furthermore, the continuous nature of power and the discrete nature of antennas cause this joint resource allocation problem to be a mixed-integer nonlinear programming (MINLP) problem with non-convexity and NP-hardness. Inspired by recent rapid advancements in deep reinforcement learning (DRL), a graph neural network (GNN) aided DRL algorithm for resource allocation in the dynamic THz UAV network with an emphasis on self-node features (GLOVE) is proposed in this paper, with the aim of resource efficiency (RE) maximization. When training the allocation policy for each UAV, GLOVE learns the relationship between this UAV and its neighboring UAVs via GNN, while also emphasizing the important self-node features of this UAV. In addition, a multi-task structure is leveraged by GLOVE to cooperatively train resource allocation decisions for the power and sub-arrays of all UAVs. Experimental results illustrate that GLOVE outperforms benchmark schemes in terms of the highest RE and the lowest latency. Moreover, unlike the benchmark methods with severe packet loss, GLOVE maintains zero packet loss during the entire training process, demonstrating its better robustness under the highly dynamic THz UAV network.
[LG-29] Community and hyperedge inference in multiple hypergraphs
链接: https://arxiv.org/abs/2505.04967
作者: Li Ni,Ziqi Deng,Lin Mu,Lei Zhang,Wenjian Luo,Yiwen Zhang
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:Hypergraphs, capable of representing high-order interactions via hyperedges, have become a powerful tool for modeling real-world biological and social systems. Inherent relationships within these real-world systems, such as the encoding relationship between genes and their protein products, drive the establishment of interconnections between multiple hypergraphs. Here, we demonstrate how to utilize those interconnections between multiple hypergraphs to synthesize integrated information from multiple higher-order systems, thereby enhancing understanding of underlying structures. We propose a model based on the stochastic block model, which integrates information from multiple hypergraphs to reveal latent high-order structures. Real-world hyperedges exhibit preferential attachment, where certain nodes dominate hyperedge formation. To characterize this phenomenon, our model introduces hyperedge internal degree to quantify nodes’ contributions to hyperedge formation. This model is capable of mining communities, predicting missing hyperedges of arbitrary sizes within hypergraphs, and inferring inter-hypergraph edges between hypergraphs. We apply our model to high-order datasets to evaluate its performance. Experimental results demonstrate strong performance of our model in community detection, hyperedge prediction, and inter-hypergraph edge prediction tasks. Moreover, we show that our model enables analysis of multiple hypergraphs of different types and supports the analysis of a single hypergraph in the absence of inter-hypergraph edges. Our work provides a practical and flexible tool for analyzing multiple hypergraphs, greatly advancing the understanding of the organization in real-world high-order systems.
[LG-30] VaCDA: Variational Contrastive Alignment-based Scalable Human Activity Recognition
链接: https://arxiv.org/abs/2505.04907
作者: Soham Khisa,Avijoy Chakma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Technological advancements have led to the rise of wearable devices with sensors that continuously monitor user activities, generating vast amounts of unlabeled data. This data is challenging to interpret, and manual annotation is labor-intensive and error-prone. Additionally, data distribution is often heterogeneous due to device placement, type, and user behavior variations. As a result, traditional transfer learning methods perform suboptimally, making it difficult to recognize daily activities. To address these challenges, we use a variational autoencoder (VAE) to learn a shared, low-dimensional latent space from available sensor data. This space generalizes data across diverse sensors, mitigating heterogeneity and aiding robust adaptation to the target domain. We integrate contrastive learning to enhance feature representation by aligning instances of the same class across domains while separating different classes. We propose Variational Contrastive Domain Adaptation (VaCDA), a multi-source domain adaptation framework combining VAEs and contrastive learning to improve feature representation and reduce heterogeneity between source and target domains. We evaluate VaCDA on multiple publicly available datasets across three heterogeneity scenarios: cross-person, cross-position, and cross-device. VaCDA outperforms the baselines in cross-position and cross-device scenarios.
[LG-31] CubeDAgger: Improved Robustness of Interactive Imitation Learning without Violation of Dynamic Stability
链接: https://arxiv.org/abs/2505.04897
作者: Taisuke Kobayashi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 7 pages, 4 figures
Abstract:Interactive imitation learning makes an agent’s control policy robust by stepwise supervisions from an expert. The recent algorithms mostly employ expert-agent switching systems to reduce the expert’s burden by limitedly selecting the supervision timing. However, the precise selection is difficult and such a switching causes abrupt changes in actions, damaging the dynamic stability. This paper therefore proposes a novel method, so-called CubeDAgger, which improves robustness while reducing dynamic stability violations by making three improvements to a baseline method, EnsembleDAgger. The first improvement adds a regularization to explicitly activate the threshold for deciding the supervision timing. The second transforms the expert-agent switching system to an optimal consensus system of multiple action candidates. Third, autoregressive colored noise to the actions is introduced to make the stochastic exploration consistent over time. These improvements are verified by simulations, showing that the learned policies are sufficiently robust while maintaining dynamic stability during interaction.
[LG-32] GCN-Based Throughput-Oriented Handover Management in Dense 5G Vehicular Networks
链接: https://arxiv.org/abs/2505.04894
作者: Nazanin Mehregan,Robson E. De Grande
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE DCOSS-IoT 2025
Abstract:The rapid advancement of 5G has transformed vehicular networks, offering high bandwidth, low latency, and fast data rates essential for real-time applications in smart cities and vehicles. These improvements enhance traffic safety and entertainment services. However, the limited coverage and frequent handovers in 5G networks cause network instability, especially in high-mobility environments due to the ping-pong effect. This paper presents TH-GCN (Throughput-oriented Graph Convolutional Network), a novel approach for optimizing handover management in dense 5G networks. Using graph neural networks (GNNs), TH-GCN models vehicles and base stations as nodes in a dynamic graph enriched with features such as signal quality, throughput, vehicle speed, and base station load. By integrating both user equipment and base station perspectives, this dual-centric approach enables adaptive, real-time handover decisions that improve network stability. Simulation results show that TH-GCN reduces handovers by up to 78 percent and improves signal quality by 10 percent, outperforming existing methods.
[LG-33] FedRE: Robust and Effective Federated Learning with Privacy Preference
链接: https://arxiv.org/abs/2505.04889
作者: Tianzhe Xiao,Yichen Li,Yu Zhou,Yining Qi,Yi Liu,Wei Wang,Haozhao Wang,Yi Wang,Ruixuan Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Despite Federated Learning (FL) employing gradient aggregation at the server for distributed training to prevent the privacy leakage of raw data, private information can still be divulged through the analysis of uploaded gradients from clients. Substantial efforts have been made to integrate local differential privacy (LDP) into the system to achieve a strict privacy guarantee. However, existing methods fail to take practical issues into account by merely perturbing each sample with the same mechanism while each client may have their own privacy preferences on privacy-sensitive information (PSI), which is not uniformly distributed across the raw data. In such a case, excessive privacy protection from private-insensitive information can additionally introduce unnecessary noise, which may degrade the model performance. In this work, we study the PSI within data and develop FedRE, that can simultaneously achieve robustness and effectiveness benefits with LDP protection. More specifically, we first define PSI with regard to the privacy preferences of each client. Then, we optimize the LDP by allocating less privacy budget to gradients with higher PSI in a layer-wise manner, thus providing a stricter privacy guarantee for PSI. Furthermore, to mitigate the performance degradation caused by LDP, we design a parameter aggregation mechanism based on the distribution of the perturbed information. We conducted experiments with text tamper detection on T-SROIE and DocTamper datasets, and FedRE achieves competitive performance compared to state-of-the-art methods.
[LG-34] Fairness Perceptions in Regression-based Predictive Models
链接: https://arxiv.org/abs/2505.04886
作者: Mukund Telukunta,Venkata Sriram Siddhardh Nadendla,Morgan Stuart,Casey Canfield
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Regression-based predictive analytics used in modern kidney transplantation is known to inherit biases from training data. This leads to social discrimination and inefficient organ utilization, particularly in the context of a few social groups. Despite this concern, there is limited research on fairness in regression and its impact on organ utilization and placement. This paper introduces three novel divergence-based group fairness notions: (i) independence, (ii) separation, and (iii) sufficiency to assess the fairness of regression-based analytics tools. In addition, fairness preferences are investigated from crowd feedback, in order to identify a socially accepted group fairness criterion for evaluating these tools. A total of 85 participants were recruited from the Prolific crowdsourcing platform, and a Mixed-Logit discrete choice model was used to model fairness feedback and estimate social fairness preferences. The findings clearly depict a strong preference towards the separation and sufficiency fairness notions, and that the predictive analytics is deemed fair with respect to gender and race groups, but unfair in terms of age groups.
[LG-35] Physics-informed solution reconstruction in elasticity and heat transfer using the explicit constraint force method
链接: https://arxiv.org/abs/2505.04875
作者: Conor Rowan,Kurt Maute,Alireza Doostan
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:One use case of physics-informed neural networks'' (PINNs) is solution reconstruction, which aims to estimate the full-field state of a physical system from sparse measurements. Parameterized governing equations of the system are used in tandem with the measurements to regularize the regression problem. However, in real-world solution reconstruction problems, the parameterized governing equation may be inconsistent with the physical phenomena that give rise to the measurement data. We show that due to assuming consistency between the true and parameterized physics, PINNs-based approaches may fail to satisfy three basic criteria of interpretability, robustness, and data consistency. As we argue, these criteria ensure that (i) the quality of the reconstruction can be assessed, (ii) the reconstruction does not depend strongly on the choice of physics loss, and (iii) that in certain situations, the physics parameters can be uniquely recovered. In the context of elasticity and heat transfer, we demonstrate how standard formulations of the physics loss and techniques for constraining the solution to respect the measurement data lead to different
constraint forces" – which we define as additional source terms arising from the constraints – and that these constraint forces can significantly influence the reconstructed solution. To avoid the potentially substantial influence of the choice of physics loss and method of constraint enforcement on the reconstructed solution, we propose the ``explicit constraint force method’’ (ECFM) to gain control of the source term introduced by the constraint. We then show that by satisfying the criteria of interpretability, robustness, and data consistency, this approach leads to more predictable and customizable reconstructions from noisy measurement data, even when the parameterization of the missing physics is inconsistent with the measured system.
[LG-36] Steerable Scene Generation with Post Training and Inference-Time Search
链接: https://arxiv.org/abs/2505.04831
作者: Nicholas Pfaff,Hongkai Dai,Sergey Zakharov,Shun Iwase,Russ Tedrake
类目: Robotics (cs.RO); Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project website: this https URL
Abstract:Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: this https URL
[LG-37] Guide your favorite protein sequence generative model
链接: https://arxiv.org/abs/2505.04823
作者: Junhao Xiong,Hunter Nisonoff,Ishan Gaur,Jennifer Listgarten
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:Generative machine learning models have begun to transform protein engineering, yet no principled framework for conditioning on auxiliary information in a plug-and-play manner exists; one may want to iteratively incorporate experimental feedback, or make use of an existing classifier – such as for predicting enzyme commission number – in order to guide the sampling of the generative model to generate sequences with desired properties. Herein, we present ProteinGuide, a rigorous and general framework to achieve just that: through unifying a broad class of protein generative models that includes masked language, (order-agnostic) autoregressive, diffusion and flow-matching models, we provide an approach to statistically condition pre-trained protein generative models. We demonstrate applicability of our approach by guiding each of two commonly used protein generative models, ProteinMPNN and ESM3, to generate amino acid and structure token sequences conditioned on several user-specified properties, namely, enhanced stability and CATH-labeled fold generation.
[LG-38] Robust ML Auditing using Prior Knowledge ICML25
链接: https://arxiv.org/abs/2505.04796
作者: Jade Garcia Bourrée,Augustin Godinot,Martijn De Vos,Milos Vujasinovic,Sayan Biswas,Gilles Tredan,Erwan Le Merrer,Anne-Marie Kermarrec
类目: Machine Learning (cs.LG)
*备注: Accepted to the 42nd International Conference on Machine Learning ICML25
Abstract:The rapid adoption of ML decision-making systems across products and services has led to a set of regulations on how such systems should behave and be built. Among all the technical challenges to enforcing these regulations, one crucial, yet under-explored problem is the risk of manipulation while these systems are being audited for fairness. This manipulation occurs when a platform deliberately alters its answers to a regulator to pass an audit without modifying its answers to other users. In this paper, we introduce a novel approach to manipulation-proof auditing by taking into account the auditor’s prior knowledge of the task solved by the platform. We first demonstrate that regulators must not rely on public priors (e.g. a public dataset), as platforms could easily fool the auditor in such cases. We then formally establish the conditions under which an auditor can prevent audit manipulations using prior knowledge about the ground truth. Finally, our experiments with two standard datasets exemplify the maximum level of unfairness a platform can hide before being detected as malicious. Our formalization and generalization of manipulation-proof auditing with a prior opens up new research directions for more robust fairness audits.
[LG-39] Prediction via Shapley Value Regression ICML2025
链接: https://arxiv.org/abs/2505.04775
作者: Amr Alkhatib,Roman Bresson,Henrik Boström,Michalis Vazirgiannis
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2025
Abstract:Shapley values have several desirable, theoretically well-supported, properties for explaining black-box model predictions. Traditionally, Shapley values are computed post-hoc, leading to additional computational cost at inference time. To overcome this, a novel method, called ViaSHAP, is proposed, that learns a function to compute Shapley values, from which the predictions can be derived directly by summation. Two approaches to implement the proposed method are explored; one based on the universal approximation theorem and the other on the Kolmogorov-Arnold representation theorem. Results from a large-scale empirical investigation are presented, showing that ViaSHAP using Kolmogorov-Arnold Networks performs on par with state-of-the-art algorithms for tabular data. It is also shown that the explanations of ViaSHAP are significantly more accurate than the popular approximator FastSHAP on both tabular data and images.
[LG-40] Primal-dual algorithm for contextual stochastic combinatorial optimization
链接: https://arxiv.org/abs/2505.04757
作者: Louis Bouvier,Thibault Prunet,Vincent Leclère,Axel Parmentier
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:This paper introduces a novel approach to contextual stochastic optimization, integrating operations research and machine learning to address decision-making under uncertainty. Traditional methods often fail to leverage contextual information, which underscores the necessity for new algorithms. In this study, we utilize neural networks with combinatorial optimization layers to encode policies. Our goal is to minimize the empirical risk, which is estimated from past data on uncertain parameters and contexts. To that end, we present a surrogate learning problem and a generic primal-dual algorithm that is applicable to various combinatorial settings in stochastic optimization. Our approach extends classic Fenchel-Young loss results and introduces a new regularization method using sparse perturbations on the distribution simplex. This allows for tractable updates in the original space and can accommodate diverse objective functions. We demonstrate the linear convergence of our algorithm under certain conditions and provide a bound on the non-optimality of the resulting policy in terms of the empirical risk. Experiments on a contextual stochastic minimum weight spanning tree problem show that our algorithm is efficient and scalable, achieving performance comparable to imitation learning of solutions computed using an expensive Lagrangian-based heuristic.
[LG-41] SetONet: A Deep Set-based Operator Network for Solving PDEs with permutation invariant variable input sampling
链接: https://arxiv.org/abs/2505.04738
作者: Stepan Tretiakov,Xingjian Li,Krishna Kumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural operators, particularly the Deep Operator Network (DeepONet), have shown promise in learning mappings between function spaces for solving differential equations. However, standard DeepONet requires input functions to be sampled at fixed locations, limiting its applicability in scenarios with variable sensor configurations, missing data, or irregular grids. We introduce the Set Operator Network (SetONet), a novel architecture that integrates Deep Sets principles into the DeepONet framework to address this limitation. The core innovation lies in the SetONet branch network, which processes the input function as an unordered \emphset of location-value pairs. This design ensures permutation invariance with respect to the input points, making SetONet inherently robust to variations in the number and locations of sensors. SetONet learns richer, spatially-aware input representations by explicitly processing spatial coordinates and function values. We demonstrate SetONet’s effectiveness on several benchmark problems, including derivative/anti-derivative operators, 1D Darcy flow, and 2D elasticity. Results show that SetONet successfully learns operators under variable input sampling conditions where standard DeepONet fails. Furthermore, SetONet is architecturally robust to sensor drop-off; unlike standard DeepONet, which requires methods like interpolation to function with missing data. Notably, SetONet can achieve comparable or improved accuracy over DeepONet on fixed grids, particularly for nonlinear problems, likely due to its enhanced input representation. SetONet provides a flexible and robust extension to the neural operator toolkit, significantly broadening the applicability of operator learning to problems with variable or incomplete input data.
[LG-42] Conformal Prediction with Corrupted Labels: Uncertain Imputation and Robust Re-weighting
链接: https://arxiv.org/abs/2505.04733
作者: Shai Feldman,Stephen Bates,Yaniv Romano
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a framework for robust uncertainty quantification in situations where labeled training data are corrupted, through noisy or missing labels. We build on conformal prediction, a statistical tool for generating prediction sets that cover the test label with a pre-specified probability. The validity of conformal prediction, however, holds under the i.i.d assumption, which does not hold in our setting due to the corruptions in the data. To account for this distribution shift, the privileged conformal prediction (PCP) method proposed leveraging privileged information (PI) – additional features available only during training – to re-weight the data distribution, yielding valid prediction sets under the assumption that the weights are accurate. In this work, we analyze the robustness of PCP to inaccuracies in the weights. Our analysis indicates that PCP can still yield valid uncertainty estimates even when the weights are poorly estimated. Furthermore, we introduce uncertain imputation (UI), a new conformal method that does not rely on weight estimation. Instead, we impute corrupted labels in a way that preserves their uncertainty. Our approach is supported by theoretical guarantees and validated empirically on both synthetic and real benchmarks. Finally, we show that these techniques can be integrated into a triply robust framework, ensuring statistically valid predictions as long as at least one underlying method is valid.
[LG-43] MatMMFuse: Multi-Modal Fusion model for Material Property Prediction ICLR2025
链接: https://arxiv.org/abs/2505.04634
作者: Abhiroop Bhattacharya,Sylvain G. Cloutier
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: Presented at AI for Accelerated Materials Design(AI4Mat), ICLR 2025 ( this https URL )
Abstract:The recent progress of using graph based encoding of crystal structures for high throughput material property prediction has been quite successful. However, using a single modality model prevents us from exploiting the advantages of an enhanced features space by combining different representations. Specifically, pre-trained Large language models(LLMs) can encode a large amount of knowledge which is beneficial for training of models. Moreover, the graph encoder is able to learn the local features while the text encoder is able to learn global information such as space group and crystal symmetry. In this work, we propose Material Multi-Modal Fusion(MatMMFuse), a fusion based model which uses a multi-head attention mechanism for the combination of structure aware embedding from the Crystal Graph Convolution Network (CGCNN) and text embeddings from the SciBERT model. We train our model in an end-to-end framework using data from the Materials Project Dataset. We show that our proposed model shows an improvement compared to the vanilla CGCNN and SciBERT model for all four key properties: formation energy, band gap, energy above hull and fermi energy. Specifically, we observe an improvement of 40% compared to the vanilla CGCNN model and 68% compared to the SciBERT model for predicting the formation energy per atom. Importantly, we demonstrate the zero shot performance of the trained model on small curated datasets of Perovskites, Chalcogenides and the Jarvis Dataset. The results show that the proposed model exhibits better zero shot performance than the individual plain vanilla CGCNN and SciBERT model. This enables researchers to deploy the model for specialized industrial applications where collection of training data is prohibitively expensive.
[LG-44] Robustly optimal dynamics for active matter reservoir computing
链接: https://arxiv.org/abs/2505.05420
作者: Mario U. Gaimann,Miriam Klopotek
类目: Adaptation and Self-Organizing Systems (nlin.AO); Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 55 pages, 30 figures. Supplementary Videos: this https URL . Replication Data: this https URL
Abstract:We study the information processing abilities of active matter in the reservoir computing (RC) paradigm, using a model that is externally driven to infer the future state of a chaotic signal. The simulated system closely follows a previously reported model. We uncover an exceptional dynamical regime of agent dynamics that has been overlooked heretofore. It appears robustly optimal across varying physical parameters and inference tasks, thus providing valuable insights into computation and inference with physical systems more generally. The ability to form effective mechanisms for information processing are primarily determined by the system’s own intrinsic relaxation abilities. These are identifiable when probing the system without a specific inference goal and manifest when testing minimalistic single-particle reservoirs. The regime that achieves optimal computation is situated just below the critical damping threshold, involving a microscopic dynamical relaxation with multiple stages. The optimal system is adaptable under chaotic external driving, due to a diversity in response mechanisms that emerge like rapid alternations between quasi-stationary and highly nonlinear dynamical states. Both coherent and incoherent dynamics contribute to their operation, partly at dissimilar scales of space and delay time. Correlations on agent dynamics can indicate the best-performing regimes and onsets of tight relationships between the responding system and the fluctuating driver. As this model of computation is interpretable in physical terms, it facilitates re-framing inquiries regarding learning and unconventional computing with a fresh rationale for many-body physics out of equilibrium.
[LG-45] Representing spherical tensors with scalar-based machine-learning models
链接: https://arxiv.org/abs/2505.05404
作者: Michelangelo Domina,Filippo Bigi,Paolo Pegolo,Michele Ceriotti
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Rotational symmetry plays a central role in physics, providing an elegant framework to describe how the properties of 3D objects – from atoms to the macroscopic scale – transform under the action of rigid rotations. Equivariant models of 3D point clouds are able to approximate structure-property relations in a way that is fully consistent with the structure of the rotation group, by combining intermediate representations that are themselves spherical tensors. The symmetry constraints however make this approach computationally demanding and cumbersome to implement, which motivates increasingly popular unconstrained architectures that learn approximate symmetries as part of the training process. In this work, we explore a third route to tackle this learning problem, where equivariant functions are expressed as the product of a scalar function of the point cloud coordinates and a small basis of tensors with the appropriate symmetry. We also propose approximations of the general expressions that, while lacking universal approximation properties, are fast, simple to implement, and accurate in practical settings.
[LG-46] From Sleep Staging to Spindle Detection: Evaluating End-to-End Automated Sleep Analysis
链接: https://arxiv.org/abs/2505.05371
作者: Niklas Grieger,Siamak Mehrkanoon,Philipp Ritter,Stephan Bialonski
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 10 pages, 4 figures, 2 tables
Abstract:Automation of sleep analysis, including both macrostructural (sleep stages) and microstructural (e.g., sleep spindles) elements, promises to enable large-scale sleep studies and to reduce variance due to inter-rater incongruencies. While individual steps, such as sleep staging and spindle detection, have been studied separately, the feasibility of automating multi-step sleep analysis remains unclear. Here, we evaluate whether a fully automated analysis using state-of-the-art machine learning models for sleep staging (RobustSleepNet) and subsequent spindle detection (SUMOv2) can replicate findings from an expert-based study of bipolar disorder. The automated analysis qualitatively reproduced key findings from the expert-based study, including significant differences in fast spindle densities between bipolar patients and healthy controls, accomplishing in minutes what previously took months to complete manually. While the results of the automated analysis differed quantitatively from the expert-based study, possibly due to biases between expert raters or between raters and the models, the models individually performed at or above inter-rater agreement for both sleep staging and spindle detection. Our results demonstrate that fully automated approaches have the potential to facilitate large-scale sleep research. We are providing public access to the tools used in our automated analysis by sharing our code and introducing SomnoBot, a privacy-preserving sleep analysis platform.
[LG-47] Operator-Level Quantum Acceleration of Non-Logconcave Sampling
链接: https://arxiv.org/abs/2505.05301
作者: Jiaqi Leng,Zhiyan Ding,Zherui Chen,Lin Lin
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 43 pages, 7 figures
Abstract:Sampling from probability distributions of the form \sigma \propto e^-\beta V , where V is a continuous potential, is a fundamental task across physics, chemistry, biology, computer science, and statistics. However, when V is non-convex, the resulting distribution becomes non-logconcave, and classical methods such as Langevin dynamics often exhibit poor performance. We introduce the first quantum algorithm that provably accelerates a broad class of continuous-time sampling dynamics. For Langevin dynamics, our method encodes the target Gibbs measure into the amplitudes of a quantum state, identified as the kernel of a block matrix derived from a factorization of the Witten Laplacian operator. This connection enables Gibbs sampling via singular value thresholding and yields the first provable quantum advantage with respect to the Poincaré constant in the non-logconcave setting. Building on this framework, we further develop the first quantum algorithm that accelerates replica exchange Langevin diffusion, a widely used method for sampling from complex, rugged energy landscapes.
[LG-48] A Connection Between Learning to Reject and Bhattacharyya Divergences
链接: https://arxiv.org/abs/2505.05273
作者: Alexander Soen
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Learning to reject provide a learning paradigm which allows for our models to abstain from making predictions. One way to learn the rejector is to learn an ideal marginal distribution (w.r.t. the input domain) - which characterizes a hypothetical best marginal distribution - and compares it to the true marginal distribution via a density ratio. In this paper, we consider learning a joint ideal distribution over both inputs and labels; and develop a link between rejection and thresholding different statistical divergences. We further find that when one considers a variant of the log-loss, the rejector obtained by considering the joint ideal distribution corresponds to the thresholding of the skewed Bhattacharyya divergence between class-probabilities. This is in contrast to the marginal case - that is equivalent to a typical characterization of optimal rejection, Chow’s Rule - which corresponds to a thresholding of the Kullback-Leibler divergence. In general, we find that rejecting via a Bhattacharyya divergence is less aggressive than Chow’s Rule.
[LG-49] A Two-Sample Test of Text Generation Similarity
链接: https://arxiv.org/abs/2505.05269
作者: Jingbin Xu,Chen Qian,Meimei Liu,Feng Guo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The surge in digitized text data requires reliable inferential methods on observed textual patterns. This article proposes a novel two-sample text test for comparing similarity between two groups of documents. The hypothesis is whether the probabilistic mapping generating the textual data is identical across two groups of documents. The proposed test aims to assess text similarity by comparing the entropy of the documents. Entropy is estimated using neural network-based language models. The test statistic is derived from an estimation-and-inference framework, where the entropy is first approximated using an estimation set, followed by inference on the remaining data set. We showed theoretically that under mild conditions, the test statistic asymptotically follows a normal distribution. A multiple data-splitting strategy is proposed to enhance test power, which combines p-values into a unified decision. Various simulation studies and a real data example demonstrated that the proposed two-sample text test maintains the nominal Type one error rate while offering greater power compared to existing methods. The proposed method provides a novel solution to assert differences in document classes, particularly in fields where large-scale textual information is crucial.
[LG-50] ICNN-enhanced 2SP: Leverag ing input convex neural networks for solving two-stage stochastic programming
链接: https://arxiv.org/abs/2505.05261
作者: Yu Liu,Fabricio Oliveira
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Two-stage stochastic programming (2SP) offers a basic framework for modelling decision-making under uncertainty, yet scalability remains a challenge due to the computational complexity of recourse function evaluation. Existing learning-based methods like Neural Two-Stage Stochastic Programming (Neur2SP) employ neural networks (NNs) as recourse function surrogates but rely on computationally intensive mixed-integer programming (MIP) formulations. We propose ICNN-enhanced 2SP, a method that leverages Input Convex Neural Networks (ICNNs) to exploit linear programming (LP) representability in convex 2SP problems. By architecturally enforcing convexity and enabling exact inference through LP, our approach eliminates the need for integer variables inherent to the conventional MIP-based formulation while retaining an exact embedding of the ICNN surrogate within the 2SP framework. This results in a more computationally efficient alternative that maintains solution quality. Comprehensive experiments reveal that ICNNs incur only marginally longer training times while achieving validation accuracy on par with their MIP-based counterparts. Across benchmark problems, ICNN-enhanced 2SP often exhibits considerably faster solution times than the MIP-based formulations while preserving solution quality, with these advantages becoming significantly more pronounced as problem scale increases. For the most challenging instances, the method achieves speedups of up to 100 \times and solution quality superior to MIP-based formulations.
[LG-51] Local linear Fréchet curve regression in manifolds
链接: https://arxiv.org/abs/2505.05168
作者: M.D. Ruiz-Medina,A. Torres–Signes
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Global Fréchet functional regression has been recently addressed from time correlated bivariate curve data evaluated in a manifold (see Torres et al. 2025). For this type of curve data sets, the present paper solves the problem of local linear approximation of the Fréchet conditional mean in an extrinsic and intrinsic way. The extrinsic local linear Fréchet functional regression predictor is obtained in the time varying tangent space by projection into an orthornormal basis of the ambient Hilbert space. The conditions assumed ensure the existence and uniqueness of this predictor, and its computation via exponential and logarithmic maps. A weighted Fréchet mean approach is adopted in the computation of an intrinsic local linear Fréchet functional regression predictor. The asymptotic optimality of this intrinsic local approximation is also proved. The performance of the empirical version of both, extrinsic and intrinsic functional predictors, and of a Nadaraya-Watson type Fréchet curve predictor is illustrated in the simulation study undertaken. The finite-sample size properties are also tested in a real-data application via cross-validation. Specifically, functional prediction of the magnetic vector field from the time-varying geocentric latitude and longitude of the satellite NASA’s MAGSAT spacecraft is addressed.
[LG-52] Overcoming Dimensional Factorization Limits in Discrete Diffusion Models through Quantum Joint Distribution Learning
链接: https://arxiv.org/abs/2505.05151
作者: Chuangtao Chen,Qinglin Zhao,MengChu Zhou,Zhimin He,Haozhen Situ
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: Comments are welcome
Abstract:This study explores quantum-enhanced discrete diffusion models to overcome classical limitations in learning high-dimensional distributions. We rigorously prove that classical discrete diffusion models, which calculate per-dimension transition probabilities to avoid exponential computational cost, exhibit worst-case linear scaling of Kullback-Leibler (KL) divergence with data dimension. To address this, we propose a Quantum Discrete Denoising Diffusion Probabilistic Model (QD3PM), which enables joint probability learning through diffusion and denoising in exponentially large Hilbert spaces. By deriving posterior states through quantum Bayes’ theorem, similar to the crucial role of posterior probabilities in classical diffusion models, and by learning the joint probability, we establish a solid theoretical foundation for quantum-enhanced diffusion models. For denoising, we design a quantum circuit using temporal information for parameter sharing and learnable classical-data-controlled rotations for encoding. Exploiting joint distribution learning, our approach enables single-step sampling from pure noise, eliminating iterative requirements of existing models. Simulations demonstrate the proposed model’s superior accuracy in modeling complex distributions compared to factorization methods. Hence, this paper establishes a new theoretical paradigm in generative models by leveraging the quantum advantage in joint distribution learning.
[LG-53] Error Analysis of Deep PDE Solvers for Option Pricing
链接: https://arxiv.org/abs/2505.05121
作者: Jasper Rou
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF)
*备注: 15 pages, 19 figures
Abstract:Option pricing often requires solving partial differential equations (PDEs). Although deep learning-based PDE solvers have recently emerged as quick solutions to this problem, their empirical and quantitative accuracy remain not well understood, hindering their real-world applicability. In this research, our aim is to offer actionable insights into the utility of deep PDE solvers for practical option pricing implementation. Through comparative experiments in both the Black–Scholes and the Heston model, we assess the empirical performance of two neural network algorithms to solve PDEs: the Deep Galerkin Method and the Time Deep Gradient Flow method (TDGF). We determine their empirical convergence rates and training time as functions of (i) the number of sampling stages, (ii) the number of samples, (iii) the number of layers, and (iv) the number of nodes per layer. For the TDGF, we also consider the order of the discretization scheme and the number of time steps.
[LG-54] Learning dynamically inspired invariant subspaces for Koopman and transfer operator approximation
链接: https://arxiv.org/abs/2505.05085
作者: Gary Froyland,Kevin Kühl
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 20 pages, 13 figures
Abstract:Transfer and Koopman operator methods offer a framework for representing complex, nonlinear dynamical systems via linear transformations, enabling for a deeper understanding of the underlying dynamics. The spectrum of these operators provide important insights into system predictability and emergent behaviour, although efficiently estimating them from data can be challenging. We tackle this issue through the lens of general operator and representational learning, in which we approximate these linear operators using efficient finite-dimensional representations. Specifically, we machine-learn orthonormal, locally supported basis functions that are dynamically tailored to the system. This learned basis provides a particularly accurate approximation of the operator’s action as well as a nearly invariant finite-dimensional subspace. We illustrate our approach with examples that showcase the retrieval of spectral properties from the estimated operator, and emphasise the dynamically adaptive quality of the machine-learned basis.
[LG-55] Boosting Statistic Learning with Synthetic Data from Pretrained Large Models
链接: https://arxiv.org/abs/2505.04992
作者: Jialong Jiang,Wenkang Hu,Jian Huang,Yuling Jiao,Xu Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully improves performance. We propose a novel end-to-end framework that generates and systematically filters synthetic data through domain-specific statistical methods, selectively integrating high-quality samples for effective augmentation. Our experiments demonstrate consistent improvements in predictive performance across various settings, highlighting the potential of our framework while underscoring the inherent limitations of generative models for data augmentation. Despite the ability to produce large volumes of synthetic data, the proportion that effectively improves model performance is limited.
[LG-56] Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach
链接: https://arxiv.org/abs/2505.04986
作者: Qian Peng,Yajie Bao,Haojie Ren,Zhaojun Wang,Changliang Zou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 23 pages, 15 figures
Abstract:Conformal prediction is a powerful tool for constructing prediction intervals for black-box models, providing a finite sample coverage guarantee for exchangeable data. However, this exchangeability is compromised when some entries of the test feature are contaminated, such as in the case of cellwise outliers. To address this issue, this paper introduces a novel framework called detect-then-impute conformal prediction. This framework first employs an outlier detection procedure on the test feature and then utilizes an imputation method to fill in those cells identified as outliers. To quantify the uncertainty in the processed test feature, we adaptively apply the detection and imputation procedures to the calibration set, thereby constructing exchangeable features for the conformal prediction interval of the test label. We develop two practical algorithms, PDI-CP and JDI-CP, and provide a distribution-free coverage analysis under some commonly used detection and imputation procedures. Notably, JDI-CP achieves a finite sample 1-2\alpha coverage guarantee. Numerical experiments on both synthetic and real datasets demonstrate that our proposed algorithms exhibit robust coverage properties and comparable efficiency to the oracle baseline.
[LG-57] Learning Linearized Models from Nonlinear Systems under Initialization Constraints with Finite Data
链接: https://arxiv.org/abs/2505.04954
作者: Lei Xin,Baike She,Qi Dou,George Chiu,Shreyas Sundaram
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 12 pages, 5 figurues. arXiv admin note: substantial text overlap with arXiv:2309.08805
Abstract:The identification of a linear system model from data has wide applications in control theory. The existing work that provides finite sample guarantees for linear system identification typically uses data from a single long system trajectory under i.i.d. random inputs, and assumes that the underlying dynamics is truly linear. In contrast, we consider the problem of identifying a linearized model when the true underlying dynamics is nonlinear, given that there is a certain constraint on the region where one can initialize the experiments. We provide a multiple trajectories-based deterministic data acquisition algorithm followed by a regularized least squares algorithm, and provide a finite sample error bound on the learned linearized dynamics. Our error bound shows that one can consistently learn the linearized dynamics, and demonstrates a trade-off between the error due to nonlinearity and the error due to noise. We validate our results through numerical experiments, where we also show the potential insufficiency of linear system identification using a single trajectory with i.i.d. random inputs, when nonlinearity does exist.
[LG-58] Generalization Analysis for Contrastive Representation Learning under Non-IID Settings ICML
链接: https://arxiv.org/abs/2505.04937
作者: Nong Minh Hieu,Antoine Ledent
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: To Appear in ICML, 2025
Abstract:Contrastive Representation Learning (CRL) has achieved impressive success in various domains in recent years. Nevertheless, the theoretical understanding of the generalization behavior of CRL is limited. Moreover, to the best of our knowledge, the current literature only analyzes generalization bounds under the assumption that the data tuples used for contrastive learning are independently and identically distributed. However, in practice, we are often limited to a fixed pool of reusable labeled data points, making it inevitable to recycle data across tuples to create sufficiently large datasets. Therefore, the tuple-wise independence condition imposed by previous works is invalidated. In this paper, we provide a generalization analysis for the CRL framework under non- i.i.d. settings that adheres to practice more realistically. Drawing inspiration from the literature on U-statistics, we derive generalization bounds which indicate the required number of samples in each class scales as the logarithm of the covering number of the class of learnable feature representations associated to each class. Next, we apply our main results to derive excess risk bounds for common function classes such as linear maps and neural networks.
[LG-59] Comparative Study of Generative Models for Early Detection of Failures in Medical Devices
链接: https://arxiv.org/abs/2505.04845
作者: Binesh Sadanandan,Bahareh Arghavani Nobar,Vahid Behzadan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:The medical device industry has significantly advanced by integrating sophisticated electronics like microchips and field-programmable gate arrays (FPGAs) to enhance the safety and usability of life-saving devices. These complex electro-mechanical systems, however, introduce challenging failure modes that are not easily detectable with conventional methods. Effective fault detection and mitigation become vital as reliance on such electronics grows. This paper explores three generative machine learning-based approaches for fault detection in medical devices, leveraging sensor data from surgical staplers,a class 2 medical device. Historically considered low-risk, these devices have recently been linked to an increasing number of injuries and fatalities. The study evaluates the performance and data requirements of these machine-learning approaches, highlighting their potential to enhance device safety.
[LG-60] Quantum QSAR for drug discovery
链接: https://arxiv.org/abs/2505.04648
作者: Alejandro Giraldo,Daniel Ruiz,Mariano Caruso,Guido Bellomo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantitative Structure-Activity Relationship (QSAR) modeling is key in drug discovery, but classical methods face limitations when handling high-dimensional data and capturing complex molecular interactions. This research proposes enhancing QSAR techniques through Quantum Support Vector Machines (QSVMs), which leverage quantum computing principles to process information Hilbert spaces. By using quantum data encoding and quantum kernel functions, we aim to develop more accurate and efficient predictive models.
[LG-61] Cryptogenic stroke and migraine: using probabilistic independence and machine learning to uncover latent sources of disease from the electronic health record
链接: https://arxiv.org/abs/2505.04631
作者: Joshua W. Betts,John M. Still,Thomas A. Lasko
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 1 table, LaTeX. Submitted as a student paper to the American Medical Informatics Association 2025 Annual Symposium for presentation
Abstract:Migraine is a common but complex neurological disorder that doubles the lifetime risk of cryptogenic stroke (CS). However, this relationship remains poorly characterized, and few clinical guidelines exist to reduce this associated risk. We therefore propose a data-driven approach to extract probabilistically-independent sources from electronic health record (EHR) data and create a 10-year risk-predictive model for CS in migraine patients. These sources represent external latent variables acting on the causal graph constructed from the EHR data and approximate root causes of CS in our population. A random forest model trained on patient expressions of these sources demonstrated good accuracy (ROC 0.771) and identified the top 10 most predictive sources of CS in migraine patients. These sources revealed that pharmacologic interventions were the most important factor in minimizing CS risk in our population and identified a factor related to allergic rhinitis as a potential causative source of CS in migraine patients.
[LG-62] BitHEP – The Limits of Low-Precision ML in HEP
链接: https://arxiv.org/abs/2504.03387
作者: Claudius Krause,Daohan Wang,Ramon Winterhalder
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 15 pages, 5 figures
Abstract:The increasing complexity of modern neural network architectures demands fast and memory-efficient implementations to mitigate computational bottlenecks. In this work, we evaluate the recently proposed BitNet architecture in HEP applications, assessing its performance in classification, regression, and generative modeling tasks. Specifically, we investigate its suitability for quark-gluon discrimination, SMEFT parameter estimation, and detector simulation, comparing its efficiency and accuracy to state-of-the-art methods. Our results show that while BitNet consistently performs competitively in classification tasks, its performance in regression and generation varies with the size and type of the network, highlighting key limitations and potential areas for improvement.
信息检索
[IR-0] Artifact Sharing for Information Retrieval Research SIGIR2025
链接: https://arxiv.org/abs/2505.05434
作者: Sean MacAvaney
类目: Information Retrieval (cs.IR)
*备注: SIGIR 2025 (demo)
Abstract:Sharing artifacts – such as trained models, pre-built indexes, and the code to use them – aids in reproducibility efforts by allowing researchers to validate intermediate steps and improves the sustainability of research by allowing multiple groups to build off one another’s prior computational work. Although there are de facto consensuses on how to share research code (through a git repository linked to from publications) and trained models (via HuggingFace Hub), there is no consensus for other types of artifacts, such as built indexes. Given the practical utility of using shared indexes, researchers have resorted to self-hosting these resources or performing ad hoc file transfers upon request, ultimately limiting the artifacts’ discoverability and reuse. This demonstration introduces a flexible and interoperable way to share artifacts for Information Retrieval research, improving both their accessibility and usability.
[IR-1] Stealthy LLM -Driven Data Poisoning Attacks Against Embedding-Based Retrieval-Augmented Recommender Systems
链接: https://arxiv.org/abs/2505.05196
作者: Fatemeh Nazary,Yashar Deldjoo,Tommaso Di Noia,Eugenio Di Sciascio
类目: Information Retrieval (cs.IR)
*备注:
Abstract:We present a systematic study of provider-side data poisoning in retrieval-augmented recommender systems (RAG-based). By modifying only a small fraction of tokens within item descriptions – for instance, adding emotional keywords or borrowing phrases from semantically related items – an attacker can significantly promote or demote targeted items. We formalize these attacks under token-edit and semantic-similarity constraints, and we examine their effectiveness in both promotion (long-tail items) and demotion (short-head items) scenarios. Our experiments on MovieLens, using two large language model (LLM) retrieval modules, show that even subtle attacks shift final rankings and item exposures while eluding naive detection. The results underscore the vulnerability of RAG-based pipelines to small-scale metadata rewrites and emphasize the need for robust textual consistency checks and provenance tracking to thwart stealthy provider-side poisoning.
[IR-2] Hybrid Personalization Using Declarative and Procedural Memory Modules of the Cognitive Architecture ACT-R
链接: https://arxiv.org/abs/2505.05083
作者: Kevin Innerebner,Dominik Kowald,Markus Schedl,Elisabeth Lex
类目: Information Retrieval (cs.IR)
*备注: Accepted for publication at the HyPer workshop, co-located with ACM UMAP 2025
Abstract:Recommender systems often rely on sub-symbolic machine learning approaches that operate as opaque black boxes. These approaches typically fail to account for the cognitive processes that shape user preferences and decision-making. In this vision paper, we propose a hybrid user modeling framework based on the cognitive architecture ACT-R that integrates symbolic and sub-symbolic representations of human memory. Our goal is to combine ACT-R’s declarative memory, which is responsible for storing symbolic chunks along sub-symbolic activations, with its procedural memory, which contains symbolic production rules. This integration will help simulate how users retrieve past experiences and apply decision-making strategies. With this approach, we aim to provide more transparent recommendations, enable rule-based explanations, and facilitate the modeling of cognitive biases. We argue that our approach has the potential to inform the design of a new generation of human-centered, psychology-informed recommender systems.
[IR-3] Divide-and-Conquer: Cold-Start Bundle Recommendation via Mixture of Diffusion Experts
链接: https://arxiv.org/abs/2505.05035
作者: Ming Li,Lin Li,Xiaohui Tao,Dong Zhang,Jimmy Xiangji Huang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Cold-start bundle recommendation focuses on modeling new bundles with insufficient information to provide recommendations. Advanced bundle recommendation models usually learn bundle representations from multiple views (e.g., interaction view) at both the bundle and item levels. Consequently, the cold-start problem for bundles is more challenging than that for traditional items due to the dual-level multi-view complexity. In this paper, we propose a novel Mixture of Diffusion Experts (MoDiffE) framework, which employs a divide-and-conquer strategy for cold-start bundle recommendation and follows three steps:(1) Divide: The bundle cold-start problem is divided into independent but similar sub-problems sequentially by level and view, which can be summarized as the poor representation of feature-missing bundles in prior-embedding models. (2) Conquer: Beyond prior-embedding models that fundamentally provide the embedded representations, we introduce a diffusion-based method to solve all sub-problems in a unified way, which directly generates diffusion representations using diffusion models without depending on specific features. (3) Combine: A cold-aware hierarchical Mixture of Experts (MoE) is employed to combine results of the sub-problems for final recommendations, where the two models for each view serve as experts and are adaptively fused for different bundles in a multi-layer manner. Additionally, MoDiffE adopts a multi-stage decoupled training pipeline and introduces a cold-start gating augmentation method to enable the training of gating for cold bundles. Through extensive experiments on three real-world datasets, we demonstrate that MoDiffE significantly outperforms existing solutions in handling cold-start bundle recommendation. It achieves up to a 0.1027 absolute gain in Recall@20 in cold-start scenarios and up to a 47.43% relative improvement in all-bundle scenarios.
[IR-4] LSRP: A Leader-Subordinate Retrieval Framework for Privacy-Preserving Cloud-Device Collaboration
链接: https://arxiv.org/abs/2505.05031
作者: Yingyi Zhang,Pengyue Jia,Xianneng Li,Derong Xu,Maolin Wang,Yichao Wang,Zhaocheng Du,Huifeng Guo,Yong Liu,Ruiming Tang,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Cloud-device collaboration leverages on-cloud Large Language Models (LLMs) for handling public user queries and on-device Small Language Models (SLMs) for processing private user data, collectively forming a powerful and privacy-preserving solution. However, existing approaches often fail to fully leverage the scalable problem-solving capabilities of on-cloud LLMs while underutilizing the advantage of on-device SLMs in accessing and processing personalized data. This leads to two interconnected issues: 1) Limited utilization of the problem-solving capabilities of on-cloud LLMs, which fail to align with personalized user-task needs, and 2) Inadequate integration of user data into on-device SLM responses, resulting in mismatches in contextual user information. In this paper, we propose a Leader-Subordinate Retrieval framework for Privacy-preserving cloud-device collaboration (LSRP), a novel solution that bridges these gaps by: 1) enhancing on-cloud LLM guidance to on-device SLM through a dynamic selection of task-specific leader strategies named as user-to-user retrieval-augmented generation (U-U-RAG), and 2) integrating the data advantages of on-device SLMs through small model feedback Direct Preference Optimization (SMFB-DPO) for aligning the on-cloud LLM with the on-device SLM. Experiments on two datasets demonstrate that LSRP consistently outperforms state-of-the-art baselines, significantly improving question-answer relevance and personalization, while preserving user privacy through efficient on-device retrieval. Our code is available at: this https URL. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2505.05031 [cs.IR] (or arXiv:2505.05031v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2505.05031 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-5] Learning Item Representations Directly from Multimodal Features for Effective Recommendation
链接: https://arxiv.org/abs/2505.04960
作者: Xin Zhou,Xiaoxiong Zhang,Dusit Niyato,Zhiqi Shen
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: Code: this https URL
Abstract:Conventional multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations by amalgamating item identity (ID) embeddings with multimodal features. Nevertheless, our empirical and theoretical findings unequivocally demonstrate a pronounced optimization gradient bias in favor of acquiring representations from multimodal features over item ID embeddings. As a consequence, item ID embeddings frequently exhibit suboptimal characteristics despite the convergence of multimodal feature parameters. Given the rich informational content inherent in multimodal features, in this paper, we propose a novel model (i.e., LIRDRec) that learns item representations directly from these features to augment recommendation performance. Recognizing that features derived from each modality may capture disparate yet correlated aspects of items, we propose a multimodal transformation mechanism, integrated with modality-specific encoders, to effectively fuse features from all modalities. Moreover, to differentiate the influence of diverse modality types, we devise a progressive weight copying fusion module within LIRDRec. This module incrementally learns the weight assigned to each modality in synthesizing the final user or item representations. Finally, we utilize the powerful visual understanding of Multimodal Large Language Models (MLLMs) to convert the item images into texts and extract semantics embeddings upon the texts via LLMs. Empirical evaluations conducted on five real-world datasets validate the superiority of our approach relative to competing baselines. It is worth noting the proposed model, equipped with embeddings extracted from MLLMs and LLMs, can further improve the recommendation accuracy of NDCG@20 by an average of 4.21% compared to the original embeddings.
[IR-6] Retrieval Augmented Generation Evaluation for Health Documents
链接: https://arxiv.org/abs/2505.04680
作者: Mario Ceresa,Lorenzo Bertolini,Valentin Comte,Nicholas Spadaro,Barbara Raffael,Brigitte Toussaint,Sergio Consoli,Amalia Muñoz Piñeiro,Alex Patak,Maddalena Querci,Tobias Wiesenthal
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Safe and trustworthy use of Large Language Models (LLM) in the processing of healthcare documents and scientific papers could substantially help clinicians, scientists and policymakers in overcoming information overload and focusing on the most relevant information at a given moment. Retrieval Augmented Generation (RAG) is a promising method to leverage the potential of LLMs while enhancing the accuracy of their outcomes. This report assesses the potentials and shortcomings of such approaches in the automatic knowledge synthesis of different types of documents in the health domain. To this end, it describes: (1) an internally developed proof of concept pipeline that employs state-of-the-art practices to deliver safe and trustable analysis for healthcare documents and scientific papers called RAGEv (Retrieval Augmented Generation Evaluation); (2) a set of evaluation tools for LLM-based document retrieval and generation; (3) a benchmark dataset to verify the accuracy and veracity of the results called RAGEv-Bench. It concludes that careful implementations of RAG techniques could minimize most of the common problems in the use of LLMs for document processing in the health domain, obtaining very high scores both on short yes/no answers and long answers. There is a high potential for incorporating it into the day-to-day work of policy support tasks, but additional efforts are required to obtain a consistent and trustworthy tool.