本篇博文主要内容为 2025-05-14 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-05-14)

今日共更新493篇论文,其中:

  • 自然语言处理82篇(Computation and Language (cs.CL))
  • 人工智能167篇(Artificial Intelligence (cs.AI))
  • 计算机视觉98篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习146篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] CodePDE: An Inference Framework for LLM -driven PDE Solver Generation

【速读】: 该论文试图解决偏微分方程(Partial Differential Equations, PDEs)求解的复杂性问题,传统数值求解器依赖专家知识且计算成本高,而基于神经网络的求解器需要大量训练数据且可解释性差。解决方案的关键在于将PDE求解任务转化为代码生成问题,并引入CodePDE框架,利用大语言模型(Large Language Models, LLMs)的能力进行求解,无需任务特定调优即可实现推理、调试、自我优化和测试时扩展等关键能力。

链接: https://arxiv.org/abs/2505.08783
作者: Shanda Li,Tanya Marwah,Junhong Shen,Weiwei Sun,Andrej Risteski,Yiming Yang,Ameet Talwalkar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). Leveraging advanced inference-time algorithms and scaling strategies, CodePDE unlocks critical capacities of LLM for PDE solving: reasoning, debugging, selfrefinement, and test-time scaling – all without task-specific tuning. CodePDE achieves superhuman performance across a range of representative PDE problems. We also present a systematic empirical analysis of LLM generated solvers, analyzing their accuracy, efficiency, and numerical scheme choices. Our findings highlight the promise and the current limitations of LLMs in PDE solving, offering a new perspective on solver design and opportunities for future model development. Our code is available at this https URL.
zh

[NLP-1] HealthBench: Evaluating Large Language Models Towards Improved Human Health ALT

【速读】: 该论文试图解决大型语言模型在医疗领域中的性能与安全性评估问题,其解决方案的关键在于构建了一个名为HealthBench的开源基准测试平台。HealthBench通过5,000次多轮对话及由262名医生制定的48,562个独特评分标准,实现了对模型在多种医疗场景(如紧急情况、临床数据转换、全球健康等)和行为维度(如准确性、指令遵循、沟通能力等)的现实且开放式的评估。这一基准测试为模型开发提供了系统性的评估框架,有助于推动医疗应用中更安全、高效的模型发展。

链接: https://arxiv.org/abs/2505.08775
作者: Rahul K. Arora,Jason Wei,Rebecca Soskin Hicks,Preston Bowman,Joaquin Quiñonero-Candela,Foivos Tsimpourlas,Michael Sharman,Meghan Shah,Andrea Vallone,Alex Beutel,Johannes Heidecke,Karan Singhal
机构: OpenAI(开放人工智能)
类目: Computation and Language (cs.CL)
备注: Blog: this https URL Code: this https URL

点击查看摘要

Abstract:We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo’s 16% to GPT-4o’s 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.
zh

[NLP-2] Aya Vision: Advancing the Frontier of Multilingual Multimodality

【速读】: 该论文旨在解决多模态语言模型构建中的核心挑战,包括视觉与语言模态对齐、高质量指令数据的收集以及引入视觉信息后文本单模态能力的退化问题。在多语言场景下,这些问题进一步加剧,表现为多语言多模态数据稀缺、机器翻译导致语义扭曲以及灾难性遗忘现象显著。论文提出的关键解决方案包括:一种合成标注框架,用于生成高质量、多样化的多语言多模态指令数据,以及一种跨模态模型融合技术,以缓解灾难性遗忘,从而在保持文本单模态能力的同时提升多模态生成性能。

链接: https://arxiv.org/abs/2505.08751
作者: Saurabh Dash,Yiyang Nan,John Dang,Arash Ahmadian,Shivalika Singh,Madeline Smith,Bharat Venkitesh,Vlad Shmyhlo,Viraat Aryabumi,Walter Beller-Morales,Jeremy Pekmez,Jason Ozuzu,Pierre Richemond,Acyr Locatelli,Nick Frosst,Phil Blunsom,Aidan Gomez,Ivan Zhang,Marzieh Fadaee,Manoj Govindassamy,Sudip Roy,Matthias Gallé,Beyza Ermis,Ahmet Üstün,Sara Hooker
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Building multimodal language models is fundamentally challenging: it requires aligning vision and language modalities, curating high-quality instruction data, and avoiding the degradation of existing text-only capabilities once vision is introduced. These difficulties are further magnified in the multilingual setting, where the need for multimodal data in different languages exacerbates existing data scarcity, machine translation often distorts meaning, and catastrophic forgetting is more pronounced. To address the aforementioned challenges, we introduce novel techniques spanning both data and modeling. First, we develop a synthetic annotation framework that curates high-quality, diverse multilingual multimodal instruction data, enabling Aya Vision models to produce natural, human-preferred responses to multimodal inputs across many languages. Complementing this, we propose a cross-modal model merging technique that mitigates catastrophic forgetting, effectively preserving text-only capabilities while simultaneously enhancing multimodal generative performance. Aya-Vision-8B achieves best-in-class performance compared to strong multimodal models such as Qwen-2.5-VL-7B, Pixtral-12B, and even much larger Llama-3.2-90B-Vision. We further scale this approach with Aya-Vision-32B, which outperforms models more than twice its size, such as Molmo-72B and LLaMA-3.2-90B-Vision. Our work advances multilingual progress on the multi-modal frontier, and provides insights into techniques that effectively bend the need for compute while delivering extremely high performance.
zh

[NLP-3] AC-Reason : Towards Theory-Guided Actual Causality Reasoning with Large Language Models

【速读】: 该论文试图解决现有基于大语言模型(Large Language Models, LLMs)的方法在实际因果性(Actual Causality, AC)推理中缺乏形式化理论基础,导致可解释性受限的问题。解决方案的关键在于提出AC-Reason,一个半形式化的推理框架,该框架通过识别AC场景中的因果相关事件、推断其形式化因果因素(如充分性、必要性和常规性),并借助理论引导的算法回答AC问题,从而实现更可靠和可解释的因果推理。

链接: https://arxiv.org/abs/2505.08750
作者: Yanxi Zhang,Xin Cong,Zhong Zhang,Xiao Liu,Dongyan Zhao,Yesai Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Actual causality (AC), a fundamental aspect of causal reasoning (CR), is responsible for attribution and responsibility assignment in real-world scenarios. However, existing LLM-based methods lack grounding in formal AC theory, resulting in limited interpretability. Therefore, we propose AC-Reason, a semi-formal reasoning framework that identifies causally relevant events within an AC scenario, infers the values of their formal causal factors (e.g., sufficiency, necessity, and normality), and answers AC queries via a theory-guided algorithm with explanations. While AC-Reason does not explicitly construct a causal graph, it operates over variables in the underlying causal structure to support principled reasoning. To enable comprehensive evaluation, we introduce AC-Bench, a new benchmark built upon and substantially extending Big-Bench Hard Causal Judgment (BBH-CJ). AC-Bench comprises ~1K carefully annotated samples, each with detailed reasoning steps and focuses solely on actual causation. The case study shows that synthesized samples in AC-Bench present greater challenges for LLMs. Extensive experiments on BBH-CJ and AC-Bench show that AC-Reason consistently improves LLM performance over baselines. On BBH-CJ, all tested LLMs surpass the average human rater accuracy of 69.60%, with GPT-4 + AC-Reason achieving 75.04%. On AC-Bench, GPT-4 + AC-Reason again achieves the highest accuracy of 71.82%. AC-Bench further enables fine-grained analysis of reasoning faithfulness, revealing that only Qwen-2.5-72B-Instruct, Claude-3.5-Sonnet, and GPT-4o exhibit faithful reasoning, whereas GPT-4 tends to exploit shortcuts. Finally, our ablation study proves that integrating AC theory into LLMs is highly effective, with the proposed algorithm contributing the most significant performance gains.
zh

[NLP-4] Probability Consistency in Large Language Models : Theoretical Foundations Meet Empirical Discrepancies

【速读】: 该论文试图解决的问题是:自回归大语言模型(Large Language Models, LLMs)在不同标记顺序的序列上训练时,是否能够学习到一致的概率分布。论文的解决方案之关键在于证明了对于任何定义良好的概率分布,序列困惑度在任意因子化方式(包括正向、反向或任意排列)下具有不变性,从而为研究LLMs如何从数据中学习提供了严格的理论基础,并定义了合理的实证评估协议。通过应用这些协议,论文揭示了先前研究在分析顺序效应时存在关键的方法论缺陷,并通过重新训练GPT-2模型在正向、反向和任意排列顺序下的科学文本,发现了系统性的偏差,这些偏差可归因于自注意力机制中的位置和局部性偏差。

链接: https://arxiv.org/abs/2505.08739
作者: Xiaoliang Luo,Xinyi Xu,Michael Ramscar,Bradley C. Love
机构: EmpiriQal Inc.(EmpiriQal 公司); London School of Economics and Political Science(伦敦经济学院); University of Tübingen(图宾根大学); Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Can autoregressive large language models (LLMs) learn consistent probability distributions when trained on sequences in different token orders? We prove formally that for any well-defined probability distribution, sequence perplexity is invariant under any factorization, including forward, backward, or arbitrary permutations. This result establishes a rigorous theoretical foundation for studying how LLMs learn from data and defines principled protocols for empirical evaluation. Applying these protocols, we show that prior studies examining ordering effects suffer from critical methodological flaws. We retrain GPT-2 models across forward, backward, and arbitrary permuted orders on scientific text. We find systematic deviations from theoretical invariance across all orderings with arbitrary permutations strongly deviating from both forward and backward models, which largely (but not completely) agreed with one another. Deviations were traceable to differences in self-attention, reflecting positional and locality biases in processing. Our theoretical and empirical results provide novel avenues for understanding positional biases in LLMs and suggest methods for detecting when LLMs’ probability distributions are inconsistent and therefore untrustworthy.
zh

[NLP-5] NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

【速读】: 该论文旨在解决护理价值观对齐(nursing value alignment)问题,即如何使大型语言模型(LLMs)在临床场景中更好地遵循护理领域的核心价值观。解决方案的关键在于构建了一个包含2,200个标注实例的基准数据集,该数据集基于国际护理准则提炼出的五个核心价值维度(Altruism, Human Dignity, Integrity, Justice, and Professionalism),并通过真实护理行为实例与由大模型生成的反事实案例进行配对,形成易、难两个层级的数据集,以评估模型在不同复杂度下的价值观对齐能力。

链接: https://arxiv.org/abs/2505.08734
作者: Ben Yao,Qiuchi Li,Yazhou Zhang,Siyu Yang,Bohan Zhang,Prayag Tiwari,Jing Qin
机构: The Hong Kong Polytechnic University (香港理工大学); University of Copenhagen (哥本哈根大学); Tianjin University (天津大学); Tongji Hospital of Tongji Medical College of Huazhong University of Science and Technology (华中科技大学同济医学院同济医院); Halmstad University (哈尔姆斯塔德大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 10 figures, 16 tables

点击查看摘要

Abstract:This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. The benchmark comprises 1,100 real-world nursing behavior instances collected through a five-month longitudinal field study across three hospitals of varying tiers. These instances are annotated by five clinical nurses and then augmented with LLM-generated counterfactuals with reversed ethic polarity. Each original case is paired with a value-aligned and a value-violating version, resulting in 2,200 labeled instances that constitute the Easy-Level dataset. To increase adversarial complexity, each instance is further transformed into a dialogue-based format that embeds contextual cues and subtle misleading signals, yielding a Hard-Level dataset. We evaluate 23 state-of-the-art (SoTA) LLMs on their alignment with nursing values. Our findings reveal three key insights: (1) DeepSeek-V3 achieves the highest performance on the Easy-Level dataset (94.55), where Claude 3.5 Sonnet outperforms other models on the Hard-Level dataset (89.43), significantly surpassing the medical LLMs; (2) Justice is consistently the most difficult nursing value dimension to evaluate; and (3) in-context learning significantly improves alignment. This work aims to provide a foundation for value-sensitive LLMs development in clinical settings. The dataset and the code are available at this https URL.
zh

[NLP-6] Memorization-Compression Cycles Improve Generalization

【速读】: 该论文试图解决大规模语言模型(Large Language Model, LLM)在预训练过程中如何提升泛化能力的问题,特别是通过优化内部表示的压缩与记忆机制来实现。其解决方案的关键在于引入信息瓶颈语言建模(Information Bottleneck Language Modeling, IBLM)目标,将语言建模重新定义为一个约束优化问题,即在保持最优预测性能的前提下最小化表示熵。该方法揭示了预训练过程中记忆与压缩的动态循环,并据此提出了门控相变(Gated Phase Transition, GAPT)训练算法,通过自适应切换记忆与压缩阶段来提升模型的泛化能力和抗干扰性。

链接: https://arxiv.org/abs/2505.08727
作者: Fangyuan Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.
zh

[NLP-7] LLM -based Prompt Ensemble for Reliable Medical Entity Recognition from EHRs

【速读】: 该论文旨在解决电子健康记录(Electronic Health Records, EHRs)中医疗实体识别(Named Entity Recognition, NER)的问题,以提取关键的医学实体如疾病、检查和治疗等,从而支持后续的临床应用。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs),特别是GPT-4o和DeepSeek-R1,结合多种提示工程(prompt engineering)技术,包括零样本、少样本和集成方法。其中,GPT-4o结合提示集成方法在任务中表现最佳,取得了0.95的F1分数和0.98的召回率,优于DeepSeek-R1,而集成方法通过基于嵌入相似性和多数投票的输出聚合提高了可靠性。

链接: https://arxiv.org/abs/2505.08704
作者: K M Sajjadul Islam,Ayesha Siddika Nipu,Jiawei Wu,Praveen Madiraju
机构: Marquette University (马凯特大学); University of Wisconsin-Milwaukee (威斯康星大学密尔沃基分校); Medical College of Wisconsin (威斯康星医学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: IEEE 26th International Conference on Information Reuse and Integration for Data Science (IRI 2025), San Jose, CA, USA

点击查看摘要

Abstract:Electronic Health Records (EHRs) are digital records of patient information, often containing unstructured clinical text. Named Entity Recognition (NER) is essential in EHRs for extracting key medical entities like problems, tests, and treatments to support downstream clinical applications. This paper explores prompt-based medical entity recognition using large language models (LLMs), specifically GPT-4o and DeepSeek-R1, guided by various prompt engineering techniques, including zero-shot, few-shot, and an ensemble approach. Among all strategies, GPT-4o with prompt ensemble achieved the highest classification performance with an F1-score of 0.95 and recall of 0.98, outperforming DeepSeek-R1 on the task. The ensemble method improved reliability by aggregating outputs through embedding-based similarity and majority voting.
zh

[NLP-8] Adaptive Schema-aware Event Extraction with Retrieval-Augmented Generation

【速读】: 该论文旨在解决事件抽取(Event Extraction, EE)中两个关键问题:现有流水线系统中模式(schema)固定的僵化性,以及联合模式匹配与抽取评估基准的缺失。为应对这些问题,论文提出的解决方案核心是自适应模式感知事件抽取(Adaptive Schema-aware Event Extraction, ASEE),其关键在于结合模式同义表达与模式检索增强生成技术,从而实现对多样化模式的灵活适配和精准结构生成。

链接: https://arxiv.org/abs/2505.08690
作者: Sheng Liang,Hang Lv,Zhihao Wen,Yaxiong Wu,Yongyue Zhang,Hao Wang,Yong Liu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Event extraction (EE) is a fundamental task in natural language processing (NLP) that involves identifying and extracting event information from unstructured text. Effective EE in real-world scenarios requires two key steps: selecting appropriate schemas from hundreds of candidates and executing the extraction process. Existing research exhibits two critical gaps: (1) the rigid schema fixation in existing pipeline systems, and (2) the absence of benchmarks for evaluating joint schema matching and extraction. Although large language models (LLMs) offer potential solutions, their schema hallucination tendencies and context window limitations pose challenges for practical deployment. In response, we propose Adaptive Schema-aware Event Extraction (ASEE), a novel paradigm combining schema paraphrasing with schema retrieval-augmented generation. ASEE adeptly retrieves paraphrased schemas and accurately generates targeted structures. To facilitate rigorous evaluation, we construct the Multi-Dimensional Schema-aware Event Extraction (MD-SEE) benchmark, which systematically consolidates 12 datasets across diverse domains, complexity levels, and language settings. Extensive evaluations on MD-SEE show that our proposed ASEE demonstrates strong adaptability across various scenarios, significantly improving the accuracy of event extraction.
zh

[NLP-9] Revealing economic facts: LLM s know more than they say

【速读】: 该论文试图解决如何利用大语言模型(Large Language Models, LLMs)的隐藏状态来估计和填补经济与金融统计数据的问题。其解决方案的关键在于使用基于隐藏状态的简单线性模型,而非直接使用模型的文本输出,结果显示隐藏状态能够捕捉比模型直接输出更丰富的经济信息。此外,研究还表明仅需少量标注样本即可完成训练,并提出了一种无需目标变量标注数据的迁移学习方法,以提高估计精度。

链接: https://arxiv.org/abs/2505.08662
作者: Marcus Buckmann,Quynh Anh Nguyen,Edward Hill
机构: Bank of England(英格兰银行)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); General Economics (econ.GN)
备注: 34 pages, 17 figures

点击查看摘要

Abstract:We investigate whether the hidden states of large language models (LLMs) can be used to estimate and impute economic and financial statistics. Focusing on county-level (e.g. unemployment) and firm-level (e.g. total assets) variables, we show that a simple linear model trained on the hidden states of open-source LLMs outperforms the models’ text outputs. This suggests that hidden states capture richer economic information than the responses of the LLMs reveal directly. A learning curve analysis indicates that only a few dozen labelled examples are sufficient for training. We also propose a transfer learning method that improves estimation accuracy without requiring any labelled data for the target variable. Finally, we demonstrate the practical utility of hidden-state representations in super-resolution and data imputation tasks.
zh

[NLP-10] Scaling Context Not Parameters: Training a Compact 7B Language Model for Efficient Long-Context Processing ACL2025

【速读】: 该论文旨在解决长上下文训练中的实际限制,以支持如合规性监控和验证等现实任务。其解决方案的关键在于开发出一种能够处理512K-token上下文长度的语言模型——MegaBeam-Mistral-7B,该模型在三个长上下文基准测试中表现出色,尤其在HELMENT任务中展现出优越的上下文学习性能,并在RULER任务中具备强大的检索与追踪能力,同时在BABILong基准上实现了无需RAG或针对性微调的竞争力长程推理能力。

链接: https://arxiv.org/abs/2505.08651
作者: Chen Wu,Yin Song
机构: Amazon Web Services (亚马逊云服务)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 6 figures, ACL 2025 (Industry Track)

点击查看摘要

Abstract:We present MegaBeam-Mistral-7B, a language model that supports 512K-token context length. Our work addresses practical limitations in long-context training, supporting real-world tasks such as compliance monitoring and verification. Evaluated on three long-context benchmarks, our 7B-parameter model demonstrates superior in-context learning performance on HELMET and robust retrieval and tracing capability on RULER. It is currently the only open model to achieve competitive long-range reasoning on BABILong at 512K context length without RAG or targeted fine-tuning. Released as fully open source under the Apache 2.0 license, the model has been downloaded over 100,000 times on Hugging Face. Model available at: this https URL
zh

[NLP-11] RAIL: Trace Reasoning and Agent ic Issue Localization

【速读】: 该论文试图解决在多样化领域中日益增长的代理工作流(agentic workflows)所产生的复杂轨迹(traces)难以进行可扩展且系统化评估的问题。当前的评估方法依赖于人工、领域特定的分析,这种方法无法适应代理输出复杂性和规模的增长。解决方案的关键在于提出一种稳健且动态的评估方法,构建一个涵盖148条由人类标注的轨迹的数据集(TRAIL),并基于此提出代理系统中常见错误类型的正式分类法,以支持更有效的错误分析和调试。

链接: https://arxiv.org/abs/2505.08638
作者: Darshan Deshpande,Varun Gangal,Hersh Mehta,Jitin Krishnan,Anand Kannappan,Rebecca Qian
机构: Patronus AI (Patronus AI)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Dataset link: this https URL

点击查看摘要

Abstract:The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.
zh

[NLP-12] Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models ICLR2025

【速读】: 该论文试图解决文本到图像生成模型在生成有效文本提示时的挑战,特别是现有提示反转方法在可解释性和提示生成一致性方面的不足。其解决方案的关键在于提出一种无需梯度的视觉引导解码(Visually Guided Decoding, VGD)方法,该方法结合了大语言模型(Large Language Models, LLMs)的文本生成能力和基于CLIP的对齐机制,以生成语义一致且可理解的提示,从而提升提示生成的可解释性、泛化能力和灵活性。

链接: https://arxiv.org/abs/2505.08622
作者: Donghoon Kim,Minji Bae,Kyuhong Shim,Byonghyo Shim
机构: Seoul National University (首尔国立大学); Sungkyunkwan University (成均馆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.
zh

[NLP-13] Automatic Task Detection and Heterogeneous LLM Speculative Decoding

【速读】: 该论文试图解决现有推测解码方法在下游任务中因草稿模型容量有限而导致的接受率与解码速度之间的权衡问题,从而难以在多种任务中保证效率。其解决方案的关键在于提出一种针对下游任务优化的推测解码算法,该算法包含自动任务划分与分配机制,能够将下游任务分类为不同子任务并分配给一组异构草稿模型,同时通过任务特定数据对齐草稿模型与目标模型以提升推理结果的一致性,并引入在线轻量级提示分类器动态路由提示至合适的草稿模型。

链接: https://arxiv.org/abs/2505.08600
作者: Danying Ge,Jianhua Gao,Qizhi Jiang,Yifei Feng,Weixing Ji
机构: Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 10 figures, 2 tables

点击查看摘要

Abstract:Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate and decoding speed in downstream tasks due to the limited capacity of the draft model, making it difficult to ensure efficiency across diverse tasks. To address this problem, we propose a speculative decoding algorithm tailored for downstream task optimization. It includes an automatic task partitioning and assigning method, which automatically categorizes downstream tasks into different sub-tasks and assigns them to a set of heterogeneous draft models. Each draft model is aligned with the target model using task-specific data, thereby enhancing the consistency of inference results. In addition, our proposed method incorporates an online lightweight prompt classifier to dynamically route prompts to the appropriate draft model. Experimental results demonstrate that the proposed method improves draft accuracy by 6% to 50% over vanilla speculative decoding, while achieving a speedup of 1.10x to 2.64x in LLM inference.
zh

[NLP-14] Enhancing Thyroid Cytology Diagnosis with RAG -Optimized LLM s and Pa-thology Foundation Models

【速读】: 该论文旨在解决甲状腺细针穿刺细胞学(thyroid cytology)诊断中存在的一系列问题,包括细胞学解读的挑战、标准化不足以及诊断准确性较低。其解决方案的关键在于将检索增强生成(RAG)技术与领域特定的基础模型相结合,通过构建一个精心筛选的知识库,使大型语言模型(LLMs)能够动态检索相关病例研究、诊断标准和专家解读,从而提升模型的上下文理解能力;同时,基于高分辨率病理图像训练的病理基础模型则增强了特征提取与分类能力,两者的融合有效提高了诊断的一致性,减少了变异,并支持病理科医生区分良性与恶性甲状腺病变。

链接: https://arxiv.org/abs/2505.08590
作者: Hussien Al-Asi,Jordan P Reynolds,Shweta Agarwal,Bryan J Dangott,Aziza Nassar,Zeynettin Akkus
机构: 未知
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Advancements in artificial intelligence (AI) are transforming pathology by integrat-ing large language models (LLMs) with retrieval-augmented generation (RAG) and domain-specific foundation models. This study explores the application of RAG-enhanced LLMs coupled with pathology foundation models for thyroid cytology diagnosis, addressing challenges in cytological interpretation, standardization, and diagnostic accuracy. By leveraging a curated knowledge base, RAG facilitates dy-namic retrieval of relevant case studies, diagnostic criteria, and expert interpreta-tion, improving the contextual understanding of LLMs. Meanwhile, pathology foun-dation models, trained on high-resolution pathology images, refine feature extrac-tion and classification capabilities. The fusion of these AI-driven approaches en-hances diagnostic consistency, reduces variability, and supports pathologists in dis-tinguishing benign from malignant thyroid lesions. Our results demonstrate that integrating RAG with pathology-specific LLMs significantly improves diagnostic efficiency and interpretability, paving the way for AI-assisted thyroid cytopathology, with foundation model UNI achieving AUC 0.73-0.93 for correct prediction of surgi-cal pathology diagnosis from thyroid cytology samples.
zh

[NLP-15] Small but Significant: On the Promise of Small Language Models for Accessible AIED

【速读】: 该论文试图解决当前教育人工智能(Artificial Intelligence in Education, AIED)领域对大型语言模型(Large Language Models, LLMs)尤其是生成式 AI (Generative AI) 的过度关注,而忽视了小型语言模型(Small Language Models, SLMs)在资源受限机构中提供公平且经济的高质量 AI 工具方面的潜力。解决方案的关键在于证明 SLMs 如 Phi-2 可以在无需复杂提示策略的情况下,有效应对 AIED 中的关键挑战,例如知识成分(Knowledge Component, KC)发现,从而强调开发基于 SLM 的 AIED 方法的重要性。

链接: https://arxiv.org/abs/2505.08588
作者: Yumou Wei,Paulo Carvalho,John Stamper
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: This vision paper advocates using small language models (e.g., Phi-2) in AI for education (AIED)

点击查看摘要

Abstract:GPT has become nearly synonymous with large language models (LLMs), an increasingly popular term in AIED proceedings. A simple keyword-based search reveals that 61% of the 76 long and short papers presented at AIED 2024 describe novel solutions using LLMs to address some of the long-standing challenges in education, and 43% specifically mention GPT. Although LLMs pioneered by GPT create exciting opportunities to strengthen the impact of AI on education, we argue that the field’s predominant focus on GPT and other resource-intensive LLMs (with more than 10B parameters) risks neglecting the potential impact that small language models (SLMs) can make in providing resource-constrained institutions with equitable and affordable access to high-quality AI tools. Supported by positive results on knowledge component (KC) discovery, a critical challenge in AIED, we demonstrate that SLMs such as Phi-2 can produce an effective solution without elaborate prompting strategies. Hence, we call for more attention to developing SLM-based AIED approaches.
zh

[NLP-16] Are We Paying Attention to Her? Investigating Gender Disambiguation and Attention in Machine Translation

【速读】: 该论文试图解决现代神经机器翻译(Neural Machine Translation, NMT)系统中性别偏见的问题,特别是传统评估指标未能充分捕捉系统对上下文性别线索的整合程度。解决方案的关键在于提出一种新的评估指标——最小对准确率(Minimal Pair Accuracy, MPA),该指标通过关注模型在最小对句子中的表现来衡量其对性别线索的依赖程度,这些句子仅在性别代词上存在差异,从而更深入地揭示模型是否能够正确利用性别线索进行性别消歧。

链接: https://arxiv.org/abs/2505.08546
作者: Chiara Manna,Afra Alishahi,Frédéric Blain,Eva Vanmassenhove
机构: Tilburg University (蒂尔堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While gender bias in modern Neural Machine Translation (NMT) systems has received much attention, traditional evaluation metrics do not to fully capture the extent to which these systems integrate contextual gender cues. We propose a novel evaluation metric called Minimal Pair Accuracy (MPA), which measures the reliance of models on gender cues for gender disambiguation. MPA is designed to go beyond surface-level gender accuracy metrics by focusing on whether models adapt to gender cues in minimal pairs – sentence pairs that differ solely in the gendered pronoun, namely the explicit indicator of the target’s entity gender in the source language (EN). We evaluate a number of NMT models on the English-Italian (EN–IT) language pair using this metric, we show that they ignore available gender cues in most cases in favor of (statistical) stereotypical gender interpretation. We further show that in anti-stereotypical cases, these models tend to more consistently take masculine gender cues into account while ignoring the feminine cues. Furthermore, we analyze the attention head weights in the encoder component and show that while all models encode gender information to some extent, masculine cues elicit a more diffused response compared to the more concentrated and specialized responses to feminine gender cues.
zh

[NLP-17] Reassessing Graph Linearization for Sequence-to-sequence AMR Parsing: On the Advantages and Limitations of Triple-Based Encoding EMNLP2025

【速读】: 该论文试图解决传统Penman编码在对抽象意义表示(Abstract Meaning Representation, AMR)图进行线性化时存在的问题,包括深度图中相关节点在文本中距离过远以及树形编码需要引入逆向关系导致关系类型数量翻倍。解决方案的关键在于提出一种基于三元组(triple)的线性化方法,以改进对嵌套图结构的表示效率。

链接: https://arxiv.org/abs/2505.08504
作者: Jeongwoo Kang,Maximin Coavoux,Cédric Lopez,Didier Schwab
机构: Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France (格勒诺布尔阿尔卑斯大学, 法国国家科学研究中心, 格勒诺布尔理工学院, 计算机科学实验室, 38000 格勒诺布尔, 法国); Emvista, Immeuble Le 610, 10 Rue Louis Breguet Bâtiment D, 34830 Jacou, France (艾姆维斯塔, 勒610大厦, 路易斯·布雷盖10号, D栋, 34830 贾库, 法国)
类目: Computation and Language (cs.CL)
备注: published at Insights from Negative Results in NLP (workshop EMNLP 2025)

点击查看摘要

Abstract:Sequence-to-sequence models are widely used to train Abstract Meaning Representation (Banarescu et al., 2013, AMR) parsers. To train such models, AMR graphs have to be linearized into a one-line text format. While Penman encoding is typically used for this purpose, we argue that it has limitations: (1) for deep graphs, some closely related nodes are located far apart in the linearized text (2) Penman’s tree-based encoding necessitates inverse roles to handle node re-entrancy, doubling the number of relation types to predict. To address these issues, we propose a triple-based linearization method and compare its efficiency with Penman linearization. Although triples are well suited to represent a graph, our results suggest room for improvement in triple encoding to better compete with Penman’s concise and explicit representation of a nested graph structure.
zh

[NLP-18] LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models

【速读】: 该论文试图解决零样本自动作文评分(zero-shot automated essay scoring, AES)中由于模型偏差和评分不一致导致的评分与人类评估结果偏离的问题。其解决方案的关键在于提出基于大语言模型的比较作文评分(LLM-based Comparative Essay Scoring, LCES),将AES建模为成对比较任务,通过指令让LLMs判断两篇作文的优劣,并利用RankNet算法将比较结果转化为连续分数,从而提高评分的准确性和可扩展性。

链接: https://arxiv.org/abs/2505.08498
作者: Takumi Shibata,Yuichi Miyamura
机构: Deloitte Analytics R&D, Deloitte Touche Tohmatsu LLC (德勤分析研发,德勤托马茨 LLC)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled zero-shot automated essay scoring (AES), providing a promising way to reduce the cost and effort of essay scoring in comparison with manual grading. However, most existing zero-shot approaches rely on LLMs to directly generate absolute scores, which often diverge from human evaluations owing to model biases and inconsistent scoring. To address these limitations, we propose LLM-based Comparative Essay Scoring (LCES), a method that formulates AES as a pairwise comparison task. Specifically, we instruct LLMs to judge which of two essays is better, collect many such comparisons, and convert them into continuous scores. Considering that the number of possible comparisons grows quadratically with the number of essays, we improve scalability by employing RankNet to efficiently transform LLM preferences into scalar scores. Experiments using AES benchmark datasets show that LCES outperforms conventional zero-shot methods in accuracy while maintaining computational efficiency. Moreover, LCES is robust across different LLM backbones, highlighting its applicability to real-world zero-shot AES.
zh

[NLP-19] Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning ? ACL2025

【速读】: 该论文旨在解决大型视觉-语言模型(LVLM)在图表理解与推理任务评估中的成本高、耗时长的问题,从而限制其在实际场景中的应用。其解决方案的关键在于评估13个开源LVLM作为图表理解能力的评判者,通过设计成对和单点评估任务,涵盖事实正确性、信息量和相关性等标准,并分析模型在格式遵循、位置一致性、长度偏差和指令遵循等方面的表现,以验证其作为低成本、高效自动评估工具的可行性。

链接: https://arxiv.org/abs/2505.08468
作者: Md Tahmid Rahman Laskar,Mohammed Saidul Islam,Ridwan Mahbub,Ahmed Masry,Mizanur Rahman,Amran Bhuiyan,Mir Tafseer Nayeem,Shafiq Joty,Enamul Hoque,Jimmy Huang
机构: York University (约克大学); University of Alberta (阿尔伯塔大学); Salesforce AI Research (Salesforce人工智能研究)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACL 2025 Industry Track

点击查看摘要

Abstract:Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge’s accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.
zh

[NLP-20] Large Language Models Meet Stance Detection: A Survey of Tasks Methods Applications Challenges and Future Directions

【速读】: 该论文旨在解决现有文献在针对基于大型语言模型(Large Language Models, LLMs)的立场检测方法上的覆盖不全面问题。其关键解决方案是通过系统性分析,构建一个涵盖学习方法、数据模态和目标关系三个维度的新型分类体系,以全面梳理LLMs在立场检测领域的最新进展,并探讨其在不同应用场景中的表现与挑战。

链接: https://arxiv.org/abs/2505.08464
作者: Lata Pangtey,Anukriti Bhatnagar,Shubhi Bansal,Shahid Shafi Dar,Nagendra Kumar
机构: IIT Indore (印度理工学院印多尔分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Stance detection is essential for understanding subjective content across various platforms such as social media, news articles, and online reviews. Recent advances in Large Language Models (LLMs) have revolutionized stance detection by introducing novel capabilities in contextual understanding, cross-domain generalization, and multimodal analysis. Despite these progressions, existing surveys often lack comprehensive coverage of approaches that specifically leverage LLMs for stance detection. To bridge this critical gap, our review article conducts a systematic analysis of stance detection, comprehensively examining recent advancements of LLMs transforming the field, including foundational concepts, methodologies, datasets, applications, and emerging challenges. We present a novel taxonomy for LLM-based stance detection approaches, structured along three key dimensions: 1) learning methods, including supervised, unsupervised, few-shot, and zero-shot; 2) data modalities, such as unimodal, multimodal, and hybrid; and 3) target relationships, encompassing in-target, cross-target, and multi-target scenarios. Furthermore, we discuss the evaluation techniques and analyze benchmark datasets and performance trends, highlighting the strengths and limitations of different architectures. Key applications in misinformation detection, political analysis, public health monitoring, and social media moderation are discussed. Finally, we identify critical challenges such as implicit stance expression, cultural biases, and computational constraints, while outlining promising future directions, including explainable stance reasoning, low-resource adaptation, and real-time deployment frameworks. Our survey highlights emerging trends, open challenges, and future directions to guide researchers and practitioners in developing next-generation stance detection systems powered by large language models.
zh

[NLP-21] RepCali: High Efficient Fine-tuning Via Representation Calibration in Latent Space for Pre-trained Language Models

【速读】: 该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在微调后仍存在编码器表示与解码器最优输入之间差异的问题。其解决方案的关键在于在编码器后的潜在空间中引入一种表示校准方法(Representation Calibration, RepCali),通过集成特定的校准模块对潜在表示进行校准,并将校准后的输出作为解码器的输入,从而提升下游任务的性能。

链接: https://arxiv.org/abs/2505.08463
作者: Fujun Zhang,XiangDong Su
机构: Inner Mongolia University (内蒙古大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:Fine-tuning pre-trained language models (PLMs) has become a dominant paradigm in applying PLMs to downstream tasks. However, with limited fine-tuning, PLMs still struggle with the discrepancies between the representation obtained from the PLMs’ encoder and the optimal input to the PLMs’ decoder. This paper tackles this challenge by learning to calibrate the representation of PLMs in the latent space. In the proposed representation calibration method (RepCali), we integrate a specific calibration block to the latent space after the encoder and use the calibrated output as the decoder input. The merits of the proposed RepCali include its universality to all PLMs with encoder-decoder architectures, its plug-and-play nature, and ease of implementation. Extensive experiments on 25 PLM-based models across 8 tasks (including both English and Chinese datasets) demonstrate that the proposed RepCali offers desirable enhancements to PLMs (including LLMs) and significantly improves the performance of downstream tasks. Comparison experiments across 4 benchmark tasks indicate that RepCali is superior to the representative fine-tuning baselines.
zh

[NLP-22] IterKey: Iterative Keyword Generation with LLM s for Enhanced Retrieval Augmented Generation

【速读】: 该论文试图解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法在准确性和可解释性之间的权衡问题。具体而言,密集检索方法虽然具有较高的准确性,但缺乏可解释性;而稀疏检索方法虽然透明度高,但由于依赖关键词匹配,往往无法充分捕捉查询的完整意图。解决方案的关键在于提出IterKey框架,该框架基于大语言模型(Large Language Models, LLMs)驱动的迭代关键词生成机制,通过三次LLM驱动的阶段:检索关键词生成、基于检索文档的答案生成以及答案验证,实现RAG的逐步优化。若验证失败,则通过精炼关键词进行迭代,从而在保持可解释性的同时提升模型的准确性。

链接: https://arxiv.org/abs/2505.08450
作者: Kazuki Hayashi,Hidetaka Kamigaito,Shinya Kouda,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良先端科学技術大学院大学); TDSE Inc. (TDSE公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a way to complement the in-context knowledge of Large Language Models (LLMs) by integrating external documents. However, real-world applications demand not only accuracy but also interpretability. While dense retrieval methods provide high accuracy, they lack interpretability; conversely, sparse retrieval methods offer transparency but often fail to capture the full intent of queries due to their reliance on keyword matching. To address these issues, we introduce IterKey, an LLM-driven iterative keyword generation framework that enhances RAG via sparse retrieval. IterKey consists of three LLM-driven stages: generating keywords for retrieval, generating answers based on retrieved documents, and validating the answers. If validation fails, the process iteratively repeats with refined keywords. Across four QA tasks, experimental results show that IterKey achieves 5% to 20% accuracy improvements over BM25-based RAG and simple baselines. Its performance is comparable to dense retrieval-based RAG and prior iterative query refinement methods using dense models. In summary, IterKey is a novel BM25-based approach leveraging LLMs to iteratively refine RAG, effectively balancing accuracy with interpretability.
zh

[NLP-23] Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact on Performance and Efficiency

【速读】: 该论文旨在解决大型语言模型在任务性能上虽高但常出现幻觉或依赖过时知识的问题,其解决方案是采用检索增强生成(Retrieval-augmented generation, RAG)方法,通过将生成过程与外部搜索相结合来弥补这些缺陷。RAG系统的关键在于通过调整超参数来平衡速度与质量,包括向量存储(如Chroma和Faiss)、分块策略、交叉编码器重排序以及温度参数等,从而实现透明且最新的响应。

链接: https://arxiv.org/abs/2505.08445
作者: Adel Ammar,Anis Koubaa,Omer Nacar,Wadii Boulila
机构: Prince Sultan University (沙特国王 Saud 大学); Alfaisal University (阿尔法伊萨尔大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models achieve high task performance yet often hallucinate or rely on outdated knowledge. Retrieval-augmented generation (RAG) addresses these gaps by coupling generation with external search. We analyse how hyperparameters influence speed and quality in RAG systems, covering Chroma and Faiss vector stores, chunking policies, cross-encoder re-ranking, and temperature, and we evaluate six metrics: faithfulness, answer correctness, answer relevancy, context precision, context recall, and answer similarity. Chroma processes queries 13% faster, whereas Faiss yields higher retrieval precision, revealing a clear speed-accuracy trade-off. Naive fixed-length chunking with small windows and minimal overlap outperforms semantic segmentation while remaining the quickest option. Re-ranking provides modest gains in retrieval quality yet increases runtime by roughly a factor of 5, so its usefulness depends on latency constraints. These results help practitioners balance computational cost and accuracy when tuning RAG systems for transparent, up-to-date responses. Finally, we re-evaluate the top configurations with a corrective RAG workflow and show that their advantages persist when the model can iteratively request additional evidence. We obtain a near-perfect context precision (99%), which demonstrates that RAG systems can achieve extremely high retrieval accuracy with the right combination of hyperparameters, with significant implications for applications where retrieval quality directly impacts downstream task performance, such as clinical decision support in healthcare.
zh

[NLP-24] A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court

【速读】: 该论文旨在解决意大利法律研究中主题建模因缺乏公开数据集而受到的限制,从而影响对最高法院判决中法律主题的分析。其解决方案的关键在于开发了一个文档处理流水线,该流水线整合了文档布局分析(Document Layout Analysis, DLA)、光学字符识别(Optical Character Recognition, OCR)和文本匿名化技术,以生成适用于主题建模的匿名化数据集。该流水线在DLA模块、OCR检测器以及文本识别器方面均表现出较高的性能指标,最终提升了主题建模的多样性与连贯性评分。

链接: https://arxiv.org/abs/2505.08439
作者: Matteo Marulli,Glauco Panattoni,Marco Bertini
机构: 未知
类目: Computation and Language (cs.CL)
备注: 51 pages

点击查看摘要

Abstract:Topic modeling in Italian legal research is hindered by the lack of public datasets, limiting the analysis of legal themes in Supreme Court judgments. To address this, we developed a document processing pipeline that produces an anonymized dataset optimized for topic modeling. The pipeline integrates document layout analysis (YOLOv8x), optical character recognition, and text anonymization. The DLA module achieved a mAP@50 of 0.964 and a mAP@50-95 of 0.800. The OCR detector reached a mAP@50-95 of 0.9022, and the text recognizer (TrOCR) obtained a character error rate of 0.0047 and a word error rate of 0.0248. Compared to OCR-only methods, our dataset improved topic modeling with a diversity score of 0.6198 and a coherence score of 0.6638. We applied BERTopic to extract topics and used large language models to generate labels and summaries. Outputs were evaluated against domain expert interpretations. Claude Sonnet 3.7 achieved a BERTScore F1 of 0.8119 for labeling and 0.9130 for summarization. Comments: 51 pages Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.08439 [cs.CL] (or arXiv:2505.08439v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.08439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-25] Hakim: Farsi Text Embedding Model

【速读】: 该论文旨在解决波斯语(Persian)在大规模文本嵌入研究中的代表性不足问题,以及提升波斯语自然语言理解(Natural Language Understanding, NLU)的性能。其解决方案的关键在于提出一种新的最先进的波斯语文本嵌入模型Hakim,并引入三个新的数据集——Corpesia、Pairsia-sup和Pairsia-unsup,以支持监督和非监督训练场景。此外,Hakim被设计用于聊天机器人和检索增强生成(Retrieval-Augmented Generation, RAG)系统,特别是在需要整合消息历史的检索任务中表现出色。同时,论文还提出了基于BERT的基线模型和基于RetroMAE的模型,后者在文本信息检索应用中表现尤为突出。这些贡献共同为推进波斯语语言理解奠定了新基础。

链接: https://arxiv.org/abs/2505.08435
作者: Mehran Sarmadi,Morteza Alikhani,Erfan Zinvandi,Zahra Pourbahman
机构: Sharif University of Technology (沙里夫理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in text embedding have significantly improved natural language understanding across many languages, yet Persian remains notably underrepresented in large-scale embedding research. In this paper, we present Hakim, a novel state-of-the-art Persian text embedding model that achieves a 8.5% performance improvement over existing approaches on the FaMTEB benchmark, outperforming all previously developed Persian language models. As part of this work, we introduce three new datasets - Corpesia, Pairsia-sup, and Pairsia-unsup - to support supervised and unsupervised training scenarios. Additionally, Hakim is designed for applications in chatbots and retrieval-augmented generation (RAG) systems, particularly addressing retrieval tasks that require incorporating message history within these systems. We also propose a new baseline model built on the BERT architecture. Our language model consistently achieves higher accuracy across various Persian NLP tasks, while the RetroMAE-based model proves particularly effective for textual information retrieval applications. Together, these contributions establish a new foundation for advancing Persian language understanding.
zh

[NLP-26] UMS: Enhancing Tool-use Abilities of LLM s with Multi-structure Handlers ICONIP2024

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在使用外部工具时存在的非可执行操作和不当操作问题,这些问题主要源于参数设置的错误。解决方案的关键在于提出一种名为TUMS的新框架,该框架通过将工具级处理转换为参数级处理来增强LLMs的工具使用能力,其核心在于四个关键组件:意图识别器、任务分解器、具有多结构处理器的子任务处理器以及执行器,从而实现更准确的参数生成和任务执行。

链接: https://arxiv.org/abs/2505.08402
作者: Aiyao He,Sijia Cui,Shuai Xu,Yanna Wang,Bo Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ICONIP 2024

点击查看摘要

Abstract:Recently, large language models(LLMs) have played an increasingly important role in solving a wide range of NLP tasks, leveraging their capabilities of natural language understanding and generating. Integration with external tools further enhances LLMs’ effectiveness, providing more precise, timely, and specialized responses. However, LLMs still encounter difficulties with non-executable actions and improper actions, which are primarily attributed to incorrect parameters. The process of generating parameters by LLMs is confined to the tool level, employing the coarse-grained strategy without considering the different difficulties of various tools. To address this issue, we propose TUMS, a novel framework designed to enhance the tool-use capabilities of LLMs by transforming tool-level processing into parameter-level processing. Specifically, our framework consists of four key components: (1) an intent recognizer that identifies the user’s intent to help LLMs better understand the task; (2) a task decomposer that breaks down complex tasks into simpler subtasks, each involving a tool call; (3) a subtask processor equipped with multi-structure handlers to generate accurate parameters; and (4) an executor. Our empirical studies have evidenced the effectiveness and efficiency of the TUMS framework with an average of 19.6% and 50.6% improvement separately on easy and hard benchmarks of ToolQA, meanwhile, we demonstrated the key contribution of each part with ablation experiments, offering more insights and stimulating future research on Tool-augmented LLMs.
zh

[NLP-27] Accelerating Chain-of-Thought Reasoning : When Goal-Gradient Importance Meets Dynamic Skipping

【速读】: 该论文旨在解决大型语言模型在使用Chain-of-Thought (CoT)提示进行复杂任务时,推理过程过于冗长、计算成本高和延迟大的问题。现有CoT压缩技术依赖于通用的重要性度量和静态压缩率,可能导致关键功能标记被误删或无法适应不同的推理复杂度。解决方案的关键在于提出一种名为Adaptive GoGI-Skip的新框架,通过监督微调实现动态的CoT压缩,其核心创新包括:(1) Goal-Gradient Importance (GoGI),一种基于梯度影响测量中间表示对最终答案损失贡献的重要性度量,以准确识别功能相关标记;(2) Adaptive Dynamic Skipping (ADS),一种根据运行时模型不确定性动态调节压缩率的机制,并通过自适应N-token约束确保局部连贯性。

链接: https://arxiv.org/abs/2505.08392
作者: Ren Zhuang,Ben Wang,Shuifa Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models leverage Chain-of-Thought (CoT) prompting for complex tasks, but their reasoning traces are often excessively verbose and inefficient, leading to significant computational costs and latency. Current CoT compression techniques typically rely on generic importance metrics and static compression rates, which may inadvertently remove functionally critical tokens or fail to adapt to varying reasoning complexity. To overcome these limitations, we propose Adaptive GoGI-Skip, a novel framework learning dynamic CoT compression via supervised fine-tuning. This approach introduces two synergistic innovations: (1) Goal-Gradient Importance (GoGI), a novel metric accurately identifying functionally relevant tokens by measuring the gradient influence of their intermediate representations on the final answer loss, and (2) Adaptive Dynamic Skipping (ADS), a mechanism dynamically regulating the compression rate based on runtime model uncertainty while ensuring local coherence through an adaptive N-token constraint. To our knowledge, this is the first work unifying a goal-oriented, gradient-based importance metric with dynamic, uncertainty-aware skipping for CoT compression. Trained on compressed MATH data, Adaptive GoGI-Skip demonstrates strong cross-domain generalization across diverse reasoning benchmarks including AIME, GPQA, and GSM8K. It achieves substantial efficiency gains - reducing CoT token counts by over 45% on average and delivering 1.6-2.0 times inference speedups - while maintaining high reasoning accuracy. Notably, it significantly outperforms existing baselines by preserving accuracy even at high effective compression rates, advancing the state of the art in the CoT reasoning efficiency-accuracy trade-off.
zh

[NLP-28] owards Contamination Resistant Benchmarks

【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)评估中的污染问题,即评估结果因训练数据与测试数据的重叠而失去可靠性。解决方案的关键在于提出一种具有污染抵抗能力的基准测试,该基准基于凯撒密码(Caesar ciphers),其简单性使其成为污染抵抗基准的优秀示例。通过在不同设置下对主流LLMs进行测试,研究发现当控制污染时,这些模型在该基准上表现不佳,从而揭示了当前LLMs存在的问题,并引发了对其真实能力的重要思考。

链接: https://arxiv.org/abs/2505.08389
作者: Rahmatullah Musawi,Sheng Lu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has transformed the landscape of natural language processing. Evaluating LLMs properly is crucial for understanding their potential and addressing concerns such as safety. However, LLM evaluation is confronted by various factors, among which contamination stands out as a key issue that undermines the reliability of evaluations. In this work, we introduce the concept of contamination resistance to address this challenge. We propose a benchmark based on Caesar ciphers (e.g., “ab” to “bc” when the shift is 1), which, despite its simplicity, is an excellent example of a contamination resistant benchmark. We test this benchmark on widely used LLMs under various settings, and we find that these models struggle with this benchmark when contamination is controlled. Our findings reveal issues in current LLMs and raise important questions regarding their true capabilities. Our work contributes to the development of contamination resistant benchmarks, enabling more rigorous LLM evaluation and offering insights into the true capabilities and limitations of LLMs.
zh

[NLP-29] Alignment Drift in CEFR-prompted LLM s for Interactive Spanish Tutoring

【速读】: 该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)作为适应性导师,在第二语言学习中根据学生的语言能力水平生成适当难度的文本问题。解决方案的关键在于通过系统提示(system prompting)来约束LLMs的输出,以实现基于欧洲共同语言参考框架(CEFR)的文本难度控制,从而支持个性化、水平对齐的自适应教学。然而,研究发现,仅依靠提示在长期交互中存在对齐漂移(alignment drift)的问题,表明需要更稳健的方法来维持模型输出的一致性。

链接: https://arxiv.org/abs/2505.08351
作者: Mina Almasi,Ross Deans Kristensen-McLachlan
机构: Aarhus University (奥胡斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper investigates the potentials of Large Language Models (LLMs) as adaptive tutors in the context of second-language learning. In particular, we evaluate whether system prompting can reliably constrain LLMs to generate only text appropriate to the student’s competence level. We simulate full teacher-student dialogues in Spanish using instruction-tuned, open-source LLMs ranging in size from 7B to 12B parameters. Dialogues are generated by having an LLM alternate between tutor and student roles with separate chat histories. The output from the tutor model is then used to evaluate the effectiveness of CEFR-based prompting to control text difficulty across three proficiency levels (A1, B1, C1). Our findings suggest that while system prompting can be used to constrain model outputs, prompting alone is too brittle for sustained, long-term interactional contexts - a phenomenon we term alignment drift. Our results provide insights into the feasibility of LLMs for personalized, proficiency-aligned adaptive tutors and provide a scalable method for low-cost evaluation of model performance without human participants.
zh

[NLP-30] On the Geometry of Semantics in Next-token Prediction

【速读】: 该论文试图解决语言模型如何通过仅依赖于下一个词预测(Next-Token Prediction, NTP)的训练目标,仍能捕捉到语言中的语义和语法概念的问题。其解决方案的关键在于揭示NTP优化隐式地引导模型通过中心化数据稀疏性矩阵的奇异值分解(SVD)因子来编码概念,尽管模型并未显式构造该矩阵,但学习到的词向量和上下文嵌入有效地对其进行了分解,从而捕捉语言结构。

链接: https://arxiv.org/abs/2505.08348
作者: Yize Zhao,Christos Thrampoulidis
机构: University of British Columbia (不列颠哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern language models demonstrate a remarkable ability to capture linguistic meaning despite being trained solely through next-token prediction (NTP). We investigate how this conceptually simple training objective leads models to extract and encode latent semantic and grammatical concepts. Our analysis reveals that NTP optimization implicitly guides models to encode concepts via singular value decomposition (SVD) factors of a centered data-sparsity matrix that captures next-word co-occurrence patterns. While the model never explicitly constructs this matrix, learned word and context embeddings effectively factor it to capture linguistic structure. We find that the most important SVD factors are learned first during training, motivating the use of spectral clustering of embeddings to identify human-interpretable semantics, including both classical k-means and a new orthant-based method directly motivated by our interpretation of concepts. Overall, our work bridges distributional semantics, neural collapse geometry, and neural network training dynamics, providing insights into how NTP’s implicit biases shape the emergence of meaning representations in language models.
zh

[NLP-31] AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale

【速读】: 该论文旨在解决大规模语言模型在推理能力上的局限性,特别是在数学和代码生成任务中达到与领先混合专家(Mixture-of-Experts, MoE)模型相媲美的性能问题。其关键解决方案是基于开源的Qwen2.5-32B基础模型,通过精心设计的后训练流程——结合监督微调和强化学习——显著提升了模型的推理能力,从而在AIME 2024、AIME 2025和LiveCodeBench等基准测试中取得了优异成绩,证明了开源社区在32B参数规模下可实现高性能推理模型的可行性。

链接: https://arxiv.org/abs/2505.08311
作者: Yunjie Ji,Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yiping Peng,Han Zhao,Xiangang Li
机构: Beike (贝壳); Ke.com (安居客)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present AM-Thinking-v1, a 32B dense language model that advances the frontier of reasoning, embodying the collaborative spirit of open-source innovation. Outperforming DeepSeek-R1 and rivaling leading Mixture-of-Experts (MoE) models like Qwen3-235B-A22B and Seed1.5-Thinking, AM-Thinking-v1 achieves impressive scores of 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, showcasing state-of-the-art mathematical and coding capabilities among open-source models of similar scale. Built entirely from the open-source Qwen2.5-32B base model and publicly available queries, AM-Thinking-v1 leverages a meticulously crafted post-training pipeline - combining supervised fine-tuning and reinforcement learning - to deliver exceptional reasoning capabilities. This work demonstrates that the open-source community can achieve high performance at the 32B scale, a practical sweet spot for deployment and fine-tuning. By striking a balance between top-tier performance and real-world usability, we hope AM-Thinking-v1 inspires further collaborative efforts to harness mid-scale models, pushing reasoning boundaries while keeping accessibility at the core of innovation. We have open-sourced our model on \hrefthis https URLHugging Face. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2505.08311 [cs.CL] (or arXiv:2505.08311v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.08311 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-32] Evaluating the Effectiveness of Black-Box Prompt Optimization as the Scale of LLM s Continues to Grow

【速读】: 该论文旨在解决黑盒提示优化方法在大规模语言模型(Large Language Models, LLMs)上的有效性问题,具体而言,是探究这些方法是否仍能在超大规模模型(如DeepSeek V3和Gemini 2.0 Flash)上带来显著的性能提升。论文的解决方案关键在于选取三种经典的黑盒优化方法,并在多个不同规模的LLMs上进行系统性实验,以验证模型规模对优化效果的影响,最终提出模型规模可能是限制优化效果的主要因素。

链接: https://arxiv.org/abs/2505.08303
作者: Ziyu Zhou,Yihang Wu,Jingyuan Yang,Zhan Xiao,Rongjun Li
机构: Huawei Technologies(华为技术); Zhejiang Lab(浙江实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Black-Box prompt optimization methods have emerged as a promising strategy for refining input prompts to better align large language models (LLMs), thereby enhancing their task performance. Although these methods have demonstrated encouraging results, most studies and experiments have primarily focused on smaller-scale models (e.g., 7B, 14B) or earlier versions (e.g., GPT-3.5) of LLMs. As the scale of LLMs continues to increase, such as with DeepSeek V3 (671B), it remains an open question whether these black-box optimization techniques will continue to yield significant performance improvements for models of such scale. In response to this, we select three well-known black-box optimization methods and evaluate them on large-scale LLMs (DeepSeek V3 and Gemini 2.0 Flash) across four NLU and NLG datasets. The results show that these black-box prompt optimization methods offer only limited improvements on these large-scale LLMs. Furthermore, we hypothesize that the scale of the model is the primary factor contributing to the limited benefits observed. To explore this hypothesis, we conducted experiments on LLMs of varying sizes (Qwen 2.5 series, ranging from 7B to 72B) and observed an inverse scaling law, wherein the effectiveness of black-box optimization methods diminished as the model size increased.
zh

[NLP-33] Enhancing Cache-Augmented Generation (CAG) with Adaptive Contextual Compression for Scalable Knowledge Integration

【速读】: 该论文旨在解决Cache-Augmented Generation (CAG)在扩展至大规模和动态知识库时面临的可扩展性与效率问题。其解决方案的关键在于提出一种自适应上下文压缩(Adaptive Contextual Compression, ACC)技术,该技术通过动态压缩和管理上下文输入,提升了现代大语言模型的长记忆能力利用率,同时结合混合CAG-RAG框架,通过选择性检索增强预加载上下文,以应对需要额外信息的场景。

链接: https://arxiv.org/abs/2505.08261
作者: Rishabh Agrawal,Himanshu Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid progress in large language models (LLMs) has paved the way for novel approaches in knowledge-intensive tasks. Among these, Cache-Augmented Generation (CAG) has emerged as a promising alternative to Retrieval-Augmented Generation (RAG). CAG minimizes retrieval latency and simplifies system design by preloading knowledge into the model’s context. However, challenges persist in scaling CAG to accommodate large and dynamic knowledge bases effectively. This paper introduces Adaptive Contextual Compression (ACC), an innovative technique designed to dynamically compress and manage context inputs, enabling efficient utilization of the extended memory capabilities of modern LLMs. To further address the limitations of standalone CAG, we propose a Hybrid CAG-RAG Framework, which integrates selective retrieval to augment preloaded contexts in scenarios requiring additional information. Comprehensive evaluations on diverse datasets highlight the proposed methods’ ability to enhance scalability, optimize efficiency, and improve multi-hop reasoning performance, offering practical solutions for real-world knowledge integration challenges.
zh

[NLP-34] Large Language Model Psychometrics: A Systematic Review of Evaluation Validation and Enhancement

【速读】: 该论文试图解决传统评估方法无法跟上大型语言模型(Large Language Models, LLMs)快速发展的问题,特别是如何衡量类人心理构造、超越静态和任务特定的基准以及建立以人类为中心的评估体系。其解决方案的关键在于引入心理测量学(Psychometrics),通过心理测量工具、理论和原则来评估、理解和提升LLMs,从而在基准制定、评估范围扩展、方法优化、结果验证及LLM能力提升等方面发挥核心作用。

链接: https://arxiv.org/abs/2505.08245
作者: Haoran Ye,Jing Jin,Yuhang Xie,Xin Zhang,Guojie Song
机构: Peking University (北京大学); State Key Laboratory of General Artificial Intelligence (国家重点实验室); School of Psychological and Cognitive Sciences (心理与认知科学学院); Key Laboratory of Machine Perception (Ministry of Education) (机器感知重点实验室); PKU-Wuhan Institute for Artificial Intelligence (北京大学武汉人工智能研究院
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 63 pages, 482 references

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. It presents novel challenges, such as measuring human-like psychological constructs, navigating beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with Psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This survey introduces and synthesizes an emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. We systematically explore the role of Psychometrics in shaping benchmarking principles, broadening evaluation scopes, refining methodologies, validating results, and advancing LLM capabilities. This paper integrates diverse perspectives to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, we aim to provide actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at this https URL.
zh

[NLP-35] Not that Groove: Zero-Shot Symbolic Music Editing

【速读】: 该论文试图解决在音乐生成领域中,传统基于音频的AI方法因缺乏灵活性而在音乐制作行业中应用受限的问题,其核心挑战是实现符号化音乐编辑(symbolic music editing),即根据文本指令对音乐进行精确修改。解决方案的关键在于利用大语言模型(LLMs)通过零样本提示(zero-shot prompting)有效编辑鼓点节奏,同时通过设计一种创造性格式将LLMs与音乐接口,从而实现高效的音乐编辑任务,并通过提供标注单元测试的评估数据集来确保结果符合音乐家的判断标准。

链接: https://arxiv.org/abs/2505.08203
作者: Li Zhang
机构: Drexel University (德雷塞尔大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Most work in AI music generation focused on audio, which has seen limited use in the music production industry due to its rigidity. To maximize flexibility while assuming only textual instructions from producers, we are among the first to tackle symbolic music editing. We circumvent the known challenge of lack of labeled data by proving that LLMs with zero-shot prompting can effectively edit drum grooves. The recipe of success is a creatively designed format that interfaces LLMs and music, while we facilitate evaluation by providing an evaluation dataset with annotated unit tests that highly aligns with musicians’ judgment.
zh

[NLP-36] A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中幻觉(hallucination)问题,即模型偶尔生成虚假或编造信息的问题。为了解决这一问题,论文提出了一种基于预训练不确定性量化(Uncertainty Quantification, UQ)头部的解决方案,其关键在于引入监督式辅助模块,这些模块在设计上利用了强大的Transformer架构以及来自LLM注意力图的信息性特征,从而显著提升了模型捕捉不确定性的能力。实验结果表明,这些UQ头部在领域内和领域外提示下的声明级幻觉检测任务中均表现出色,并具备良好的跨语言泛化能力。

链接: https://arxiv.org/abs/2505.08200
作者: Artem Shelmanov,Ekaterina Fadeeva,Akim Tsvigun,Ivan Tsvigun,Zhuohan Xie,Igor Kiselev,Nico Daheim,Caiqi Zhang,Artem Vazhentsev,Mrinmaya Sachan,Preslav Nakov,Timothy Baldwin
机构: MBZUAI; ETH Zürich; University of Cambridge; Nebius.AI; Behavox; Accenture; Computational Semantics Group
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have the tendency to hallucinate, i.e., to sporadically generate false or fabricated information. This presents a major challenge, as hallucinations often appear highly convincing and users generally lack the tools to detect them. Uncertainty quantification (UQ) provides a framework for assessing the reliability of model outputs, aiding in the identification of potential hallucinations. In this work, we introduce pre-trained UQ heads: supervised auxiliary modules for LLMs that substantially enhance their ability to capture uncertainty compared to unsupervised UQ methods. Their strong performance stems from the powerful Transformer architecture in their design and informative features derived from LLM attention maps. Experimental evaluation shows that these heads are highly robust and achieve state-of-the-art performance in claim-level hallucination detection across both in-domain and out-of-domain prompts. Moreover, these modules demonstrate strong generalization to languages they were not explicitly trained on. We pre-train a collection of UQ heads for popular LLM series, including Mistral, Llama, and Gemma 2. We publicly release both the code and the pre-trained heads.
zh

[NLP-37] Exploiting Text Semantics for Few and Zero Shot Node Classification on Text-attributed Graph

【速读】: 该论文旨在解决在文本属性图(Text-attributed Graph, TAG)上进行少样本和零样本节点分类的问题,此类问题在学术和社交网络等领域具有广泛应用。现有方法主要依赖图结构的增强技术来训练节点和文本嵌入,而文本增强方法尚未得到充分探索。论文提出的解决方案是文本语义增强(Text Semantics Augmentation, TSA),其关键在于通过引入更多的文本语义监督信号来提升分类精度。具体而言,TSA设计了两种增强技术:正向语义匹配用于检索与图节点具有相似嵌入的文本,反向语义对比则通过添加反义提示构建语义相反的文本描述,并与原始节点和文本进行对比,从而增强模型对文本语义的理解能力。

链接: https://arxiv.org/abs/2505.08168
作者: Yuxiang Wang,Xiao Yan,Shiyu Jin,Quanqing Xu,Chuang Hu,Yuanyuan Zhu,Bo Du,Jia Wu,Jiawei Jiang
机构: Wuhan University (武汉大学); OceanBase (海洋数据库); Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-attributed graph (TAG) provides a text description for each graph node, and few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. Existing work utilizes various graph-based augmentation techniques to train the node and text embeddings, while text-based augmentations are largely unexplored. In this paper, we propose Text Semantics Augmentation (TSA) to improve accuracy by introducing more text semantic supervision signals. Specifically, we design two augmentation techniques, i.e., positive semantics matching and negative semantics contrast, to provide more reference texts for each graph node or text description. Positive semantic matching retrieves texts with similar embeddings to match with a graph node. Negative semantic contrast adds a negative prompt to construct a text description with the opposite semantics, which is contrasted with the original node and text. We evaluate TSA on 5 datasets and compare with 13 state-of-the-art baselines. The results show that TSA consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%.
zh

[NLP-38] Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage

【速读】: 该论文旨在解决在使用非物质文化遗产(Intangible Cultural Heritage, ICH)数据对大规模语言模型(Large Language Models, LLMs)进行微调时所面临的偏差、错误知识继承和灾难性遗忘等问题。其解决方案的关键在于提出一种融合双向思维链与奖励机制的新型训练方法,该方法基于专为非物质文化遗产领域设计的ICH-Qwen模型,通过正向推理与反向提问及反向推理相结合的方式激活模型的潜在知识,并引入奖励机制优化决策过程,从而提升模型输出的质量与准确性。

链接: https://arxiv.org/abs/2505.08167
作者: Ruilin Liu,Zhixiao Zhao,Jieqiong Li,Chang Liu,Dongbo Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures

点击查看摘要

Abstract:The rapid development of large language models (LLMs) has provided significant support and opportunities for the advancement of domain-specific LLMs. However, fine-tuning these large models using Intangible Cultural Heritage (ICH) data inevitably faces challenges such as bias, incorrect knowledge inheritance, and catastrophic forgetting. To address these issues, we propose a novel training method that integrates a bidirectional chains of thought and a reward mechanism. This method is built upon ICH-Qwen, a large language model specifically designed for the field of intangible cultural heritage. The proposed method enables the model to not only perform forward reasoning but also enhances the accuracy of the generated answers by utilizing reverse questioning and reverse reasoning to activate the model’s latent knowledge. Additionally, a reward mechanism is introduced during training to optimize the decision-making process. This mechanism improves the quality of the model’s outputs through structural and content evaluations with different weighting schemes. We conduct comparative experiments on ICH-Qwen, with results demonstrating that our method outperforms 0-shot, step-by-step reasoning, knowledge distillation, and question augmentation methods in terms of accuracy, Bleu-4, and Rouge-L scores on the question-answering task. Furthermore, the paper highlights the effectiveness of combining the bidirectional chains of thought and reward mechanism through ablation experiments. In addition, a series of generalizability experiments are conducted, with results showing that the proposed method yields improvements on various domain-specific datasets and advanced models in areas such as Finance, Wikidata, and StrategyQA. This demonstrates that the method is adaptable to multiple domains and provides a valuable approach for model training in future applications across diverse fields.
zh

[NLP-39] A Large-Scale Empirical Analysis of Custom GPT s Vulnerabilities in the OpenAI Ecosystem

【速读】: 该论文旨在解决定制化生成式AI(Generative AI)模型在安全防护方面的不足问题,特别是针对基于GPT的定制模型(custom GPTs)在实际应用中所面临的多种可被利用的安全威胁。研究的关键在于通过大规模实证分析,评估这些定制模型在不同类别和流行度层级下的安全风险,并引入多指标排名系统来探讨其流行度与安全风险之间的关系。研究结果揭示了绝大多数定制GPTs缺乏足够的安全保护措施,且其安全漏洞主要集中在角色扮演攻击、系统提示泄露和网络钓鱼内容生成等方面,从而凸显了加强安全机制和内容审核的紧迫性。

链接: https://arxiv.org/abs/2505.08148
作者: Sunday Oyinlola Ogundoyin,Muhammad Ikram,Hassan Jameel Asghar,Benjamin Zi Hao Zhao,Dali Kaafar
机构: Macquarie University Cybersecurity Hub (麦考瑞大学网络安全中心); Macquarie University (麦考瑞大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Millions of users leverage generative pretrained transformer (GPT)-based language models developed by leading model providers for a wide range of tasks. To support enhanced user interaction and customization, many platforms-such as OpenAI-now enable developers to create and publish tailored model instances, known as custom GPTs, via dedicated repositories or application stores. These custom GPTs empower users to browse and interact with specialized applications designed to meet specific needs. However, as custom GPTs see growing adoption, concerns regarding their security vulnerabilities have intensified. Existing research on these vulnerabilities remains largely theoretical, often lacking empirical, large-scale, and statistically rigorous assessments of associated risks. In this study, we analyze 14,904 custom GPTs to assess their susceptibility to seven exploitable threats, such as roleplay-based attacks, system prompt leakage, phishing content generation, and malicious code synthesis, across various categories and popularity tiers within the OpenAI marketplace. We introduce a multi-metric ranking system to examine the relationship between a custom GPT’s popularity and its associated security risks. Our findings reveal that over 95% of custom GPTs lack adequate security protections. The most prevalent vulnerabilities include roleplay-based vulnerabilities (96.51%), system prompt leakage (92.20%), and phishing (91.22%). Furthermore, we demonstrate that OpenAI’s foundational models exhibit inherent security weaknesses, which are often inherited or amplified in custom GPTs. These results highlight the urgent need for enhanced security measures and stricter content moderation to ensure the safe deployment of GPT-based applications. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2505.08148 [cs.CR] (or arXiv:2505.08148v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.08148 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-40] Large Language Models for Computer-Aided Design: A Survey

【速读】: 该论文试图解决当前缺乏对生成式 AI (Generative AI) 与计算机辅助设计 (CAD) 集成的系统性综述的问题。其关键在于首次系统地探讨生成式 AI 在 CAD 领域的应用,通过分析六类主要应用场景,揭示其在提升和优化 CAD 工作流中的潜力,并提出未来研究方向以推动 CAD 技术的发展。

链接: https://arxiv.org/abs/2505.08137
作者: Licheng Zhang,Bach Le,Naveed Akhtar,Siew-Kei Lam,Tuan Ngo
机构: The University of Melbourne(墨尔本大学); Nanyang Technological University(南洋理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Graphics (cs.GR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have seen rapid advancements in recent years, with models like ChatGPT and DeepSeek, showcasing their remarkable capabilities across diverse domains. While substantial research has been conducted on LLMs in various fields, a comprehensive review focusing on their integration with Computer-Aided Design (CAD) remains notably absent. CAD is the industry standard for 3D modeling and plays a vital role in the design and development of products across different industries. As the complexity of modern designs increases, the potential for LLMs to enhance and streamline CAD workflows presents an exciting frontier. This article presents the first systematic survey exploring the intersection of LLMs and CAD. We begin by outlining the industrial significance of CAD, highlighting the need for AI-driven innovation. Next, we provide a detailed overview of the foundation of LLMs. We also examine both closed-source LLMs as well as publicly available models. The core of this review focuses on the various applications of LLMs in CAD, providing a taxonomy of six key areas where these models are making considerable impact. Finally, we propose several promising future directions for further advancements, which offer vast opportunities for innovation and are poised to shape the future of CAD technology. Github: this https URL
zh

[NLP-41] ALOHA: Empowering Multilingual Agent for University Orientation with Hierarchical Retrieval NAACL2025

【速读】: 该论文旨在解决高校师生在搜索校园特定信息时,现有公开服务无法满足需求的问题,主要原因是大型语言模型(Large Language Models, LLMs)缺乏领域专业知识,以及搜索引擎在多语言和实时场景下的支持有限。论文提出的解决方案是引入ALOHA,一个基于分层检索的多语言代理系统,并集成外部API以提供交互式服务,其关键在于通过分层检索机制增强模型的领域适应性,并结合外部资源提升系统的多语言处理能力和响应时效性。

链接: https://arxiv.org/abs/2505.08130
作者: Mingxu Tao,Bowen Tang,Mingxuan Ma,Yining Zhang,Hourun Li,Feifan Wen,Hao Ma,Jia Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To appear in NAACL 2025 Demo Track

点击查看摘要

Abstract:The rise of Large Language Models~(LLMs) revolutionizes information retrieval, allowing users to obtain required answers through complex instructions within conversations. However, publicly available services remain inadequate in addressing the needs of faculty and students to search campus-specific information. It is primarily due to the LLM’s lack of domain-specific knowledge and the limitation of search engines in supporting multilingual and timely scenarios. To tackle these challenges, we introduce ALOHA, a multilingual agent enhanced by hierarchical retrieval for university orientation. We also integrate external APIs into the front-end interface to provide interactive service. The human evaluation and case study show our proposed system has strong capabilities to yield correct, timely, and user-friendly responses to the queries in multiple languages, surpassing commercial chatbots and search engines. The system has been deployed and has provided service for more than 12,000 people.
zh

[NLP-42] Putting It All into Context: Simplifying Agents with LCLMs

【速读】: 该论文试图解决复杂现实任务中语言模型(Language Model, LM)代理架构过度复杂化的问题,探究在如SWE-bench这样的挑战性任务中,是否所有辅助结构都是必要的。其解决方案的关键在于将整个环境纳入长上下文语言模型(Long Context Language Model, LCLM)的上下文中,并通过适当的提示(prompting)使模型表现出与精心调优的复杂代理结构相当的能力。实验表明,无需任何辅助结构或工具的Gemini-1.5-Pro模型在SWE-Bench-Verified任务上达到了38%的准确率,而采用相同无代理方法的Gemini-2.5-Pro模型则达到了50.8%的求解率,证明了简化架构的有效性。

链接: https://arxiv.org/abs/2505.08120
作者: Mingjian Jiang,Yangjun Ruan,Luis Lastras,Pavan Kapanipathi,Tatsunori Hashimoto
机构: Stanford University (斯坦福大学); IBM Research (IBM 研究院); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in language model (LM) agents have demonstrated significant potential for automating complex real-world tasks. To make progress on these difficult tasks, LM agent architectures have become increasingly complex, often incorporating multi-step retrieval tools, multiple agents, and scaffolding adapted to the underlying LM. In this work, we investigate whether all of this complexity is necessary, or if parts of these scaffolds can be removed on challenging tasks like SWE-bench. We show that in the case of SWE-bench, simply putting the entire environment into the context of a long context language model (LCLM) and properly prompting the model makes it competitive with carefully tuned, complex agent scaffolds. We show that a Gemini-1.5-Pro model without any scaffolding or tools achieves 38% on SWE-Bench-Verified, comparable with approaches using carefully tuned agent scaffolds (32%). While the unscaffolded approach with Gemini-1.5-Pro falls short of the strongest agentic architectures, we demonstrate that the more capable Gemini-2.5-Pro using the same unscaffolded approach directly attains a 50.8% solve rate. Additionally, a two-stage approach combining Gemini-1.5-Pro with Claude-3.7 achieves a competitive 48.6% solve rate.
zh

[NLP-43] Are LLM s complicated ethical dilemma analyzers? ALT

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)是否能够模拟人类伦理推理并作为人类判断的可信代理这一开放性问题。其解决方案的关键在于构建一个包含196个真实伦理困境及专家意见的基准数据集,该数据集被划分为五个结构化部分:引言、关键因素、历史理论视角、解决策略和关键结论,并通过多维度的复合度量框架(基于BLEU、Damerau-Levenshtein距离、TF-IDF余弦相似度和Universal Sentence Encoder相似度)评估前沿LLMs的表现,同时利用基于逆向排名对齐和成对AHP分析的方法计算度量权重,以实现模型输出与专家回答的细粒度对比。

链接: https://arxiv.org/abs/2505.08106
作者: Jiashen(Jason)Du,Jesse Yao,Allen Liu,Zhekai Zhang
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CS194-280 Advanced LLM Agents project. Project page: this https URL

点击查看摘要

Abstract:One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real-world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non-expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini, Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion-based ranking alignment and pairwise AHP analysis, enabling fine-grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non-expert humans in lexical and structural alignment, with GPT-4o-mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision-making.
zh

[NLP-44] Beyond Input Activations: Identifying Influential Latents by Gradient Sparse Autoencoders

【速读】: 该论文试图解决传统稀疏自编码器(Sparse Autoencoder, SAE)在分析大规模语言模型(Large Language Model, LLM)内部表示时仅依赖输入侧激活信息,而忽略潜在特征与模型输出之间因果影响的问题。其解决方案的关键在于提出梯度稀疏自编码器(Gradient Sparse Autoencoder, GradSAE),该方法通过引入输出侧梯度信息来识别对模型输出具有最大影响力的潜在特征,从而更有效地实现对模型内部表示的解释与控制。

链接: https://arxiv.org/abs/2505.08080
作者: Dong Shu,Xuansheng Wu,Haiyan Zhao,Mengnan Du,Ninghao Liu
机构: Northwestern University (西北大学); University of Georgia (佐治亚大学); New Jersey Institute of Technology (新泽西理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) have recently emerged as powerful tools for interpreting and steering the internal representations of large language models (LLMs). However, conventional approaches to analyzing SAEs typically rely solely on input-side activations, without considering the causal influence between each latent feature and the model’s output. This work is built on two key hypotheses: (1) activated latents do not contribute equally to the construction of the model’s output, and (2) only latents with high causal influence are effective for model steering. To validate these hypotheses, we propose Gradient Sparse Autoencoder (GradSAE), a simple yet effective method that identifies the most influential latents by incorporating output-side gradient information.
zh

[NLP-45] HYPERNYM MERCURY: Token Optimization through Semantic Field Constriction and Reconstruction from Hypernyms. A New Text Compression Method

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)提示中计算优化的问题,具体是通过减少令牌(token)数量来提升效率。其解决方案的关键在于提出一种新颖的文本表示方案和首个基于词级语义压缩的方法,能够在保持源文本高语义相似性的同时实现超过90%的令牌减少。该方法具备无损性和细节粒度可控性,已在开源数据集上进行了验证,并展示了在不同文体和模型下的有效性。

链接: https://arxiv.org/abs/2505.08058
作者: Chris Forrester,Octavia Sulea
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Compute optimization using token reduction of LLM prompts is an emerging task in the fields of NLP and next generation, agentic AI. In this white paper, we introduce a novel (patent pending) text representation scheme and a first-of-its-kind word-level semantic compression of paragraphs that can lead to over 90% token reduction, while retaining high semantic similarity to the source text. We explain how this novel compression technique can be lossless and how the detail granularity is controllable. We discuss benchmark results over open source data (i.e. Bram Stoker’s Dracula available through Project Gutenberg) and show how our results hold at the paragraph level, across multiple genres and models.
zh

[NLP-46] FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLM s via Structured Reasoning

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在安全对齐过程中对无害查询的过度拒绝问题,这一问题显著降低了模型在敏感场景中的实用性。解决方案的关键在于提出FalseReject,这是一个包含16k条看似具有毒性的问题及其在44个安全相关类别下的结构化响应的综合性资源,并采用图引导的对抗性多智能体交互框架生成多样且复杂的提示,同时通过显式推理结构化响应,以帮助模型准确区分安全与不安全的上下文。

链接: https://arxiv.org/abs/2505.08054
作者: Zhehao Zhang,Weijie Xu,Fanyou Wu,Chandan K. Reddy
机构: Dartmouth College (达特茅斯学院); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction-tuned models and reasoning-oriented models, as well as a human-annotated benchmark test set. Our extensive benchmarking on 29 state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.
zh

[NLP-47] NAZM: Network Analysis of Zonal Metrics in Persian Poetic Tradition

【速读】: 该论文试图解决古典波斯诗人之间影响动态的量化与可视化问题,旨在通过计算模型揭示诗歌传统中的结构关系与文学流派的演变。其解决方案的关键在于构建一个多维相似性网络,利用语义、词汇、风格、主题和格律特征对每位诗人的作品集进行界定,并通过加权相似性矩阵生成聚合图以展示诗人之间的影响力。此外,通过多种中心性度量及Louvain社区检测算法,识别关键诗人、风格枢纽及桥梁诗人,从而为波斯文学提供数据驱动的新视角。

链接: https://arxiv.org/abs/2505.08052
作者: Kourosh Shahnazari,Seyed Moein Ayyoubzadeh
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study formalizes a computational model to simulate classical Persian poets’ dynamics of influence through constructing a multi-dimensional similarity network. Using a rigorously curated dataset based on Ganjoor’s corpus, we draw upon semantic, lexical, stylistic, thematic, and metrical features to demarcate each poet’s corpus. Each is contained within weighted similarity matrices, which are then appended to generate an aggregate graph showing poet-to-poet influence. Further network investigation is carried out to identify key poets, style hubs, and bridging poets by calculating degree, closeness, betweenness, eigenvector, and Katz centrality measures. Further, for typological insight, we use the Louvain community detection algorithm to demarcate clusters of poets sharing both style and theme coherence, which correspond closely to acknowledged schools of literature like Sabk-e Hindi, Sabk-e Khorasani, and the Bazgasht-e Adabi phenomenon. Our findings provide a new data-driven view of Persian literature distinguished between canonical significance and interextual influence, thus highlighting relatively lesser-known figures who hold great structural significance. Combining computational linguistics with literary study, this paper produces an interpretable and scalable model for poetic tradition, enabling retrospective reflection as well as forward-looking research within digital humanities.
zh

[NLP-48] Spell: A Semi-Masked Methodology for Tibetan Spelling Correction covering Multi-Level Error with Data Augmentation

【速读】: 该论文旨在解决多层级藏文拼写纠错问题,即同时处理字符级和音节级错误,而现有方法主要关注单层级纠错,缺乏对两者的有效整合。此外,藏文领域尚无针对该任务的开源数据集或增强方法。论文提出了一种基于未标注文本的数据增强方法,生成多层级错误,并引入TiSpell模型,该模型为半掩码结构,能够同时纠正字符级和音节级错误。其关键在于通过半掩码策略简化音节级纠错的复杂性,同时利用合成的九类错误构建稳健的训练集,从而提升纠错性能。

链接: https://arxiv.org/abs/2505.08037
作者: Yutong Liu,Feng Xiao,Ziyue Zhang,Yongbin Yu,Cheng Huang,Fan Gao,Xiangxiang Wang,Ma-bao Ban,Manping Fan,Thupten Tsering,Cheng Huang,Gadeng Luosang,Renzeng Duojie,Nyima Tashi
机构: School of Information and Software Engineering, University of Electronic Science and Technology of China (信息与软件工程学院,电子科技大学); School of Information Science and Technology, Tibet University (信息科学与技术学院,西藏大学); Department of Ophthalmology, University of Texas Southwestern Medical Center (眼科系,德克萨斯西南医学中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Multi-level Tibetan spelling correction addresses errors at both the character and syllable levels within a unified model. Existing methods focus mainly on single-level correction and lack effective integration of both levels. Moreover, there are no open-source datasets or augmentation methods tailored for this task in Tibetan. To tackle this, we propose a data augmentation approach using unlabeled text to generate multi-level corruptions, and introduce TiSpell, a semi-masked model capable of correcting both character- and syllable-level errors. Although syllable-level correction is more challenging due to its reliance on global context, our semi-masked strategy simplifies this process. We synthesize nine types of corruptions on clean sentences to create a robust training set. Experiments on both simulated and real-world data demonstrate that TiSpell, trained on our dataset, outperforms baseline models and matches the performance of state-of-the-art approaches, confirming its effectiveness.
zh

[NLP-49] Large Language Models and Arabic Content: A Review WWW

【速读】: 该论文试图解决阿拉伯语在自然语言处理(Natural Language Processing, NLP)领域中因语言复杂性(如丰富的形态学、复杂的结构和多样的书写标准)以及资源、数据集和工具稀缺所带来的挑战。其解决方案的关键在于利用预训练的大型语言模型(Large Language Models, LLMs),这些模型通过多语言语料库进行训练,能够在多种阿拉伯语NLP任务中表现出色,并通过微调(finetuning)和提示工程(prompt engineering)等技术进一步提升性能。

链接: https://arxiv.org/abs/2505.08004
作者: Haneh Rhel,Dmitri Roussinov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Original language: English This paper has been submitted to the First International Conference on Artificial Intelligence and Generative AI (FICAILY 2025), and it has been accepted for presentation at FICAILY on 9-10/July 2025 and for publication in the Springer Nature. Number of pages: 16 Publication status Accepted/In press - 7 Apr 2025 this https URL

点击查看摘要

Abstract:Over the past three years, the rapid advancement of Large Language Models (LLMs) has had a profound impact on multiple areas of Artificial Intelligence (AI), particularly in Natural Language Processing (NLP) across diverse languages, including Arabic. Although Arabic is considered one of the most widely spoken languages across 27 countries in the Arabic world and used as a second language in some other non-Arabic countries as well, there is still a scarcity of Arabic resources, datasets, and tools. Arabic NLP tasks face various challenges due to the complexities of the Arabic language, including its rich morphology, intricate structure, and diverse writing standards, among other factors. Researchers have been actively addressing these challenges, demonstrating that pre-trained Large Language Models (LLMs) trained on multilingual corpora achieve significant success in various Arabic NLP tasks. This study provides an overview of using large language models (LLMs) for the Arabic language, highlighting early pre-trained Arabic Language models across various NLP applications and their ability to handle diverse Arabic content tasks and dialects. It also provides an overview of how techniques like finetuning and prompt engineering can enhance the performance of these models. Additionally, the study summarizes common Arabic benchmarks and datasets while presenting our observations on the persistent upward trend in the adoption of LLMs.
zh

[NLP-50] ask-Adaptive Semantic Communications with Controllable Diffusion-based Data Regeneration

【速读】: 该论文试图解决语义通信中如何在保证带宽效率的同时,适应接收端多样化的下游任务需求的问题。其解决方案的关键在于提出一种基于扩散模型的任务自适应语义通信框架,该框架能够根据不同的下游任务动态调整语义信息的传输内容,通过初始化深度压缩的通用语义表示进行粗粒度数据重建,并结合注意力机制与接收端反馈生成的文本提示,实现更精细的语义信息更新与任务目标对齐。

链接: https://arxiv.org/abs/2505.07980
作者: Fupei Guo,Achintha Wijesinghe,Songyang Zhang,Zhi Ding
机构: University of Louisiana at Lafayette (路易斯安那大学拉法叶分校); University of California at Davis (加利福尼亚大学戴维斯分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic communications represent a new paradigm of next-generation networking that shifts bit-wise data delivery to conveying the semantic meanings for bandwidth efficiency. To effectively accommodate various potential downstream tasks at the receiver side, one should adaptively convey the most critical semantic information. This work presents a novel task-adaptive semantic communication framework based on diffusion models that is capable of dynamically adjusting the semantic message delivery according to various downstream tasks. Specifically, we initialize the transmission of a deep-compressed general semantic representation from the transmitter to enable diffusion-based coarse data reconstruction at the receiver. The receiver identifies the task-specific demands and generates textual prompts as feedback. Integrated with the attention mechanism, the transmitter updates the semantic transmission with more details to better align with the objectives of the intended receivers. Our test results demonstrate the efficacy of the proposed method in adaptively preserving critical task-relevant information for semantic communications while preserving high compression efficiency.
zh

[NLP-51] Assessing and Mitigating Medical Knowledge Drift and Conflicts in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在医疗领域应用中因医学知识快速演变而导致的适应性问题,特别是概念漂移(concept drift)和内部不一致性带来的过时或矛盾治疗建议。解决方案的关键在于开发DriftMedQA基准以模拟指南演化,并评估不同LLMs的时间可靠性;同时探索了两种缓解策略:检索增强生成(Retrieval-Augmented Generation)和通过直接偏好优化(Direct Preference Optimization)进行的偏好微调,其结合使用能够显著提升模型的一致性和可靠性。

链接: https://arxiv.org/abs/2505.07968
作者: Weiyi Wu,Xinwen Xu,Chongyang Gao,Xingjian Diao,Siting Li,Lucas A. Salas,Jiang Gui
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have great potential in the field of health care, yet they face great challenges in adapting to rapidly evolving medical knowledge. This can lead to outdated or contradictory treatment suggestions. This study investigated how LLMs respond to evolving clinical guidelines, focusing on concept drift and internal inconsistencies. We developed the DriftMedQA benchmark to simulate guideline evolution and assessed the temporal reliability of various LLMs. Our evaluation of seven state-of-the-art models across 4,290 scenarios demonstrated difficulties in rejecting outdated recommendations and frequently endorsing conflicting guidance. Additionally, we explored two mitigation strategies: Retrieval-Augmented Generation and preference fine-tuning via Direct Preference Optimization. While each method improved model performance, their combination led to the most consistent and reliable results. These findings underscore the need to improve LLM robustness to temporal shifts to ensure more dependable applications in clinical practice.
zh

[NLP-52] Re2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions

【速读】: 该论文试图解决同行评审系统在人工智能领域因投稿量激增而面临的评审质量下降和评审人员短缺问题,尤其是由于低质量稿件的重复投稿所加剧的负担。解决方案的关键在于引入名为Re^2的大规模一致性保障的同行评审与反驳数据集,该数据集包含来自24个会议和21个研讨会的19,926份初始投稿、70,668条评审意见以及53,818条反驳内容,并将反驳与讨论阶段建模为多轮对话范式,以支持传统的静态评审任务和动态交互式大型语言模型(Large Language Models, LLMs)助手,从而为作者提供更实际的修改指导并缓解评审压力。

链接: https://arxiv.org/abs/2505.07920
作者: Daoze Zhang,Zhijian Bao,Sihang Du,Zhiyi Zhao,Kuangling Zhang,Dezheng Bao,Yang Yang
机构: Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2 figures, 5 tables

点击查看摘要

Abstract:Peer review is a critical component of scientific progress in the fields like AI, but the rapid increase in submission volume has strained the reviewing system, which inevitably leads to reviewer shortages and declines review quality. Besides the growing research popularity, another key factor in this overload is the repeated resubmission of substandard manuscripts, largely due to the lack of effective tools for authors to self-evaluate their work before submission. Large Language Models (LLMs) show great promise in assisting both authors and reviewers, and their performance is fundamentally limited by the quality of the peer review data. However, existing peer review datasets face three major limitations: (1) limited data diversity, (2) inconsistent and low-quality data due to the use of revised rather than initial submissions, and (3) insufficient support for tasks involving rebuttal and reviewer-author interactions. To address these challenges, we introduce the largest consistency-ensured peer review and rebuttal dataset named Re^2, which comprises 19,926 initial submissions, 70,668 review comments, and 53,818 rebuttals from 24 conferences and 21 workshops on OpenReview. Moreover, the rebuttal and discussion stage is framed as a multi-turn conversation paradigm to support both traditional static review tasks and dynamic interactive LLM assistants, providing more practical guidance for authors to refine their manuscripts and helping alleviate the growing review burden. Our data and code are available in this https URL.
zh

[NLP-53] SciCom Wiki: Fact-Checking and FAIR Knowledge Distribution for Scientific Videos and Podcasts

【速读】: 该论文试图解决科学传播知识基础设施(Science Communication Knowledge Infrastructure, SciCom KI)在处理非文本媒体(如视频和播客)时存在的碎片化与可扩展性不足的问题,特别是在信息泛滥背景下对可靠信息的获取与事实核查的需求。其解决方案的关键是构建一个以Wikibase为核心的开源服务系统——SciCom Wiki,通过实现FAIR(可发现、可访问、可互操作、可重用)的媒体表示,并结合神经符号计算的事实核查方法,将异构媒体转化为知识图谱,从而提升机器可读性并支持对声明的准确比对。

链接: https://arxiv.org/abs/2505.07912
作者: Tim Wittenborg,Constantin Sebastian Tremel,Niklas Stehr,Oliver Karras,Markus Stocker,Sören Auer
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: 18 pages, 10 figures, submitted to TPDL 2025

点击查看摘要

Abstract:Democratic societies need accessible, reliable information. Videos and Podcasts have established themselves as the medium of choice for civic dissemination, but also as carriers of misinformation. The emerging Science Communication Knowledge Infrastructure (SciCom KI) curating non-textual media is still fragmented and not adequately equipped to scale against the content flood. Our work sets out to support the SciCom KI with a central, collaborative platform, the SciCom Wiki, to facilitate FAIR (findable, accessible, interoperable, reusable) media representation and the fact-checking of their content, particularly for videos and podcasts. Building an open-source service system centered around Wikibase, we survey requirements from 53 stakeholders, refine these in 11 interviews, and evaluate our prototype based on these requirements with another 14 participants. To address the most requested feature, fact-checking, we developed a neurosymbolic computational fact-checking approach, converting heterogenous media into knowledge graphs. This increases machine-readability and allows comparing statements against equally represented ground-truth. Our computational fact-checking tool was iteratively evaluated through 10 expert interviews, a public user survey with 43 participants verified the necessity and usability of our tool. Overall, our findings identified several needs to systematically support the SciCom KI. The SciCom Wiki, as a FAIR digital library complementing our neurosymbolic computational fact-checking framework, was found suitable to address the raised requirements. Further, we identified that the SciCom KI is severely underdeveloped regarding FAIR knowledge and related systems facilitating its collaborative creation and curation. Our system can provide a central knowledge node, yet a collaborative effort is required to scale against the imminent (mis-)information flood.
zh

[NLP-54] A Reproduction Study: The Kernel PCA Interpretation of Self-Attention Fails Under Scrutiny

【速读】: 该论文试图解决近期提出的自注意力机制(self-attention)实现核主成分分析(KPCA)的理论假设,其核心问题是验证自注意力是否真的能够通过值向量(value vectors)捕获键矩阵(key matrix)的Gram矩阵特征向量,并在特征空间中将查询投影到主成分轴上。该研究的解决方案关键在于通过实证分析揭示该假设的三个关键不一致之处:一是学习到的自注意力值向量与KPCA视角下的特征向量之间缺乏显著对齐;二是重建损失的下降被错误解释为自注意力最小化投影误差的证据;三是用于支持值向量捕获Gram矩阵特征向量的特征值统计结果无法在无特定实现调整的情况下复现。最终结论是,自注意力的KPCA解释缺乏实证支持。

链接: https://arxiv.org/abs/2505.07908
作者: Karahan Sarıtaş,Çağatay Yıldız
机构: University of Tübingen (图宾根大学); Tübingen AI Center (图宾根人工智能中心)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this reproduction study, we revisit recent claims that self-attention implements kernel principal component analysis (KPCA) (Teo et al., 2024), positing that (i) value vectors V capture the eigenvectors of the Gram matrix of the keys, and (ii) that self-attention projects queries onto the principal component axes of the key matrix K in a feature space. Our analysis reveals three critical inconsistencies: (1) No alignment exists between learned self-attention value vectors and what is proposed in the KPCA perspective, with average similarity metrics (optimal cosine similarity \leq 0.32 , linear CKA (Centered Kernel Alignment) \leq 0.11 , kernel CKA \leq 0.32 ) indicating negligible correspondence; (2) Reported decreases in reconstruction loss J_\textproj , arguably justifying the claim that the self-attention minimizes the projection error of KPCA, are misinterpreted, as the quantities involved differ by orders of magnitude ( \sim!10^3 ); (3) Gram matrix eigenvalue statistics, introduced to justify that V captures the eigenvector of the gram matrix, are irreproducible without undocumented implementation-specific adjustments. Across 10 transformer architectures, we conclude that the KPCA interpretation of self-attention lacks empirical support.
zh

[NLP-55] SEM: Reinforcement Learning for Search-Efficient Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在何时调用外部工具(如搜索引擎)与何时依赖内部知识之间进行有效决策的问题,当前的强化学习方法常导致冗余搜索行为,造成效率低下和成本过高。解决方案的关键在于提出一种名为SEM的新型后训练强化学习框架,该框架通过构建结合MuSiQue和MMLU的平衡数据集,设计结构化推理模板,并采用Group Relative Policy Optimization(GRPO)对模型的搜索行为进行后训练,同时设计奖励函数以鼓励在无需搜索时避免冗余操作并在需要时有效检索,从而提升模型的推理效率和对外部知识的合理利用能力。

链接: https://arxiv.org/abs/2505.07903
作者: Zeyang Sha,Shiwen Cui,Weiqiang Wang
机构: Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models(LLMs) have demonstrated their capabilities not only in reasoning but also in invoking external tools, particularly search engines. However, teaching models to discern when to invoke search and when to rely on their internal knowledge remains a significant challenge. Existing reinforcement learning approaches often lead to redundant search behaviors, resulting in inefficiencies and over-cost. In this paper, we propose SEM, a novel post-training reinforcement learning framework that explicitly trains LLMs to optimize search usage. By constructing a balanced dataset combining MuSiQue and MMLU, we create scenarios where the model must learn to distinguish between questions it can answer directly and those requiring external retrieval. We design a structured reasoning template and employ Group Relative Policy Optimization(GRPO) to post-train the model’s search behaviors. Our reward function encourages accurate answering without unnecessary search while promoting effective retrieval when needed. Experimental results demonstrate that our method significantly reduces redundant search operations while maintaining or improving answer accuracy across multiple challenging benchmarks. This framework advances the model’s reasoning efficiency and extends its capability to judiciously leverage external knowledge.
zh

[NLP-56] Multimodal Assessment of Classroom Discourse Quality: A Text-Centered Attention-Based Multi-Task Learning Approach

【速读】: 该论文试图解决如何在整节课段中评估话语实践质量的问题,特别是在传统人工编码方法存在耗时且成本高的情况下,利用人工智能技术进行自动化评估。其解决方案的关键在于提出一种以文本为中心的多模态融合架构,通过注意力机制捕捉语音、音频和视频流中的跨模态与内部交互,并采用多任务学习联合预测三个话语成分(话语性质、提问和解释)的质量评分,同时将任务建模为序数分类问题以考虑评分等级的顺序性。

链接: https://arxiv.org/abs/2505.07902
作者: Ruikun Hou,Babette Bühler,Tim Fütterer,Efe Bozkir,Peter Gerjets,Ulrich Trautwein,Enkelejda Kasneci
机构: Technical University of Munich (慕尼黑工业大学); University of Tübingen (图宾根大学); Leibniz-Institut für Wissensmedien (莱布尼茨知识媒体研究所)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The 18th International Conference on Educational Data Mining (EDM 2025)

点击查看摘要

Abstract:Classroom discourse is an essential vehicle through which teaching and learning take place. Assessing different characteristics of discursive practices and linking them to student learning achievement enhances the understanding of teaching quality. Traditional assessments rely on manual coding of classroom observation protocols, which is time-consuming and costly. Despite many studies utilizing AI techniques to analyze classroom discourse at the utterance level, investigations into the evaluation of discursive practices throughout an entire lesson segment remain limited. To address this gap, our study proposes a novel text-centered multimodal fusion architecture to assess the quality of three discourse components grounded in the Global Teaching InSights (GTI) observation protocol: Nature of Discourse, Questioning, and Explanations. First, we employ attention mechanisms to capture inter- and intra-modal interactions from transcript, audio, and video streams. Second, a multi-task learning approach is adopted to jointly predict the quality scores of the three components. Third, we formulate the task as an ordinal classification problem to account for rating level order. The effectiveness of these designed elements is demonstrated through an ablation study on the GTI Germany dataset containing 92 videotaped math lessons. Our results highlight the dominant role of text modality in approaching this task. Integrating acoustic features enhances the model’s consistency with human ratings, achieving an overall Quadratic Weighted Kappa score of 0.384, comparable to human inter-rater reliability (0.326). Our study lays the groundwork for the future development of automated discourse quality assessment to support teacher professional development through timely feedback on multidimensional discourse practices.
zh

[NLP-57] DeltaEdit: Enhancing Sequential Editing in Large Language Models by Controlling Superimposed Noise

【速读】: 该论文试图解决大型语言模型在长期连续知识编辑过程中出现的编辑成功率显著下降的问题,该问题被归因于编辑过程中累积的超拟合噪声(accumulation of superimposed noise)现象。解决方案的关键在于识别导致模型输出偏离目标的原因,并提出DeltaEdit方法,该方法通过动态正交约束策略优化更新参数,有效减少不同编辑之间的干扰,从而缓解偏差并提升编辑成功率和模型的泛化能力。

链接: https://arxiv.org/abs/2505.07899
作者: Ding Cao,Yuchen Cai,Rongxi Guo,Xuesong He,Guiquan Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sequential knowledge editing techniques aim to continuously update the knowledge in large language models at a low cost, preventing the models from generating outdated or incorrect information. However, existing sequential editing methods suffer from a significant decline in editing success rates after long-term editing. Through theoretical analysis and experiments, we identify that as the number of edits increases, the model’s output increasingly deviates from the desired target, leading to a drop in editing success rates. We refer to this issue as the accumulation of superimposed noise problem. To address this, we identify the factors contributing to this deviation and propose DeltaEdit, a novel method that optimizes update parameters through a dynamic orthogonal constraints strategy, effectively reducing interference between edits to mitigate deviation. Experimental results demonstrate that DeltaEdit significantly outperforms existing methods in edit success rates and the retention of generalization capabilities, ensuring stable and reliable model performance even under extensive sequential editing.
zh

[NLP-58] LongCodeBench: Evaluating Coding LLM s at 1M Context Windows

【速读】: 该论文试图解决长上下文建模能力不足的问题,特别是在处理大规模上下文任务时模型性能下降的挑战。其解决方案的关键在于构建一个名为LongCodeBench (LCB) 的基准测试,该基准通过从真实世界GitHub问题中提取数据,设计了代码理解(LongCodeQA)和缺陷修复(LongSWE-Bench)任务,以评估大语言模型(LLM)在长上下文场景下的编码能力。通过分层设计基准的复杂度,能够对不同规模的模型进行系统性评估,从而揭示当前模型在长上下文处理方面的普遍弱点。

链接: https://arxiv.org/abs/2505.07897
作者: Stefano Rando,Luca Romani,Alessio Sampieri,Yuta Kyuragi,Luca Franco,Fabio Galasso,Tatsunori Hashimoto,John Yang
机构: Panasonic AI Research (松下人工智能研究); Sapienza University of Rome (罗马第一大学); ItalAI (ItalAI); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks – not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales – ranging from Qwen2.5 14B Instruct to Google’s flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5.
zh

[NLP-59] rumorGPT : Graph-Based Retrieval-Augmented Large Language Model for Fact-Checking

【速读】: 该论文试图解决社交媒体时代虚假信息和谣言迅速传播导致的“信息疫情”问题,特别是健康领域中真假信息难以区分的挑战。其解决方案的关键在于提出TrumorGPT,一种基于生成式AI(Generative AI)的新型事实核查方法,通过利用大语言模型(LLM)进行少样本学习构建语义健康知识图谱并进行语义推理,结合图基检索增强生成(GraphRAG)技术,以解决大语言模型的幻觉问题和静态训练数据的局限性,从而实现基于最新医疗新闻和健康信息的高效事实核查。

链接: https://arxiv.org/abs/2505.07891
作者: Ching Nam Hang,Pei-Duo Yu,Chee Wei Tan
机构: Saint Francis University (圣弗朗西斯大学); Chung Yuan Christian University (中原大学); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the age of social media, the rapid spread of misinformation and rumors has led to the emergence of infodemics, where false information poses a significant threat to society. To combat this issue, we introduce TrumorGPT , a novel generative artificial intelligence solution designed for fact-checking in the health domain. TrumorGPT aims to distinguish “trumors”, which are health-related rumors that turn out to be true, providing a crucial tool in differentiating between mere speculation and verified facts. This framework leverages a large language model (LLM) with few-shot learning for semantic health knowledge graph construction and semantic reasoning. TrumorGPT incorporates graph-based retrieval-augmented generation (GraphRAG) to address the hallucination issue common in LLMs and the limitations of static training data. GraphRAG involves accessing and utilizing information from regularly updated semantic health knowledge graphs that consist of the latest medical news and health information, ensuring that fact-checking by TrumorGPT is based on the most recent data. Evaluating with extensive healthcare datasets, TrumorGPT demonstrates superior performance in fact-checking for public health claims. Its ability to effectively conduct fact-checking across various platforms marks a critical step forward in the fight against health-related misinformation, enhancing trust and accuracy in the digital information age.
zh

[NLP-60] SLFormer: A Lightweight Transformer Model for Turkish Sign Language Recognition Using Skeletal Landmarks

【速读】: 该论文旨在解决土耳其手语(Turkish Sign Language, TSL)识别中的高效性和实时性问题,特别是在资源受限设备上实现准确的符号手势识别。其解决方案的关键在于将手语动作视为有序的、类似字符串的语言,并采用基于3D关节位置(由Google的Mediapipe库提取)的输入方式,从而实现高效的维度压缩并保留关键语义信息。同时,该方法借鉴了自然语言处理中Transformer模型的自注意力机制,将其应用于序列到序列的翻译任务中,以有效捕捉手势序列中的时间共现关系并突出有意义的动作模式。

链接: https://arxiv.org/abs/2505.07890
作者: Kutay Ertürk,Furkan Altınışık,İrem Sarıaltın,Ömer Nezih Gerek
机构: Eskişehir Technical University (埃斯基谢希尔技术大学)
类目: Computation and Language (cs.CL); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This study presents TSLFormer, a light and robust word-level Turkish Sign Language (TSL) recognition model that treats sign gestures as ordered, string-like language. Instead of using raw RGB or depth videos, our method only works with 3D joint positions - articulation points - extracted using Google’s Mediapipe library, which focuses on the hand and torso skeletal locations. This creates efficient input dimensionality reduction while preserving important semantic gesture information. Our approach revisits sign language recognition as sequence-to-sequence translation, inspired by the linguistic nature of sign languages and the success of transformers in natural language processing. Since TSLFormer uses the self-attention mechanism, it effectively captures temporal co-occurrence within gesture sequences and highlights meaningful motion patterns as words unfold. Evaluated on the AUTSL dataset with over 36,000 samples and 227 different words, TSLFormer achieves competitive performance with minimal computational cost. These results show that joint-based input is sufficient for enabling real-time, mobile, and assistive communication systems for hearing-impaired individuals. Subjects: Computation and Language (cs.CL); Image and Video Processing (eess.IV) Cite as: arXiv:2505.07890 [cs.CL] (or arXiv:2505.07890v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2505.07890 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-61] BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理生物实验协议(biological protocols)这一高度专业化、准确性要求高且具有固有程序性的文本时表现不足的问题。其解决方案的关键在于构建了BioProBench,这是首个大规模、集成多任务的基准测试平台,涵盖五项核心任务:实验协议问答、步骤排序、错误修正、协议生成和协议推理,从而全面评估LLMs在程序性生物文本上的理解与推理能力。该基准基于27,000份原始协议,生成近556,000个高质量结构化实例,为研究者提供了一个标准化框架,以诊断现有模型的局限性并推动更安全自动化复杂科学流程的AI系统发展。

链接: https://arxiv.org/abs/2505.07889
作者: Yuyang Liu,Liuzhenghao Lv,Xiancheng Zhang,Li Yuan,Yonghong Tian
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Biological protocols are fundamental to reproducible and safe life science research. While LLMs excel on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning. While limited benchmarks have touched upon specific aspects like protocol QA, BioProBench provides a comprehensive suite of five core tasks: Protocol Question Answering, Step Ordering, Error Correction, Protocol Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on procedural biological texts. Built upon 27K original protocols, it yields nearly 556K high-quality structured instances. We evaluate 12 mainstream open/closed-source LLMs on BioProBench. Experimental results reveal that while top models preform well on surface understanding tasks, struggle significantly with deep reasoning and structured generation tasks like ordering and generation. Furthermore, model comparisons reveal diverse performance: certain open-source models approach closed-source levels on some tasks, yet bio-specific small models lag behind general LLMs, indicating limitations on complex procedural content. Overall, our findings underscore that procedural reasoning within biological protocols represents a significant challenge for current LLMs. BioProBench serves as a standardized framework to diagnose these specific limitations and guide the development of AI systems better equipped for safely automating complex scientific procedures. The code and data are available at: this https URL and this https URL.
zh

[NLP-62] Implementing Long Text Style Transfer with LLM s through Dual-Layered Sentence and Parag raph Structure Extraction and Mapping

【速读】: 该论文旨在解决在长文本风格迁移中利用大语言模型(Large Language Models, LLMs)进行零样本学习时所面临的挑战,特别是如何在保持原文句法和语义一致性的同时,实现句子层面的风格适应与段落层面的结构连贯性。其解决方案的关键在于提出了一种分层框架——ZeroStylus,该框架通过两个系统阶段实现:从参考文本中获取分层模板以及基于模板的多粒度匹配生成。该框架动态构建句子和段落模板库,以在保留句间逻辑关系的同时实现上下文感知的转换,从而显著提升了风格一致性、内容保留和表达质量。

链接: https://arxiv.org/abs/2505.07888
作者: Yusen Wu,Xiaotie Deng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper addresses the challenge in long-text style transfer using zero-shot learning of large language models (LLMs), proposing a hierarchical framework that combines sentence-level stylistic adaptation with paragraph-level structural coherence. We argue that in the process of effective paragraph-style transfer, to preserve the consistency of original syntactic and semantic information, it is essential to perform style transfer not only at the sentence level but also to incorporate paragraph-level semantic considerations, while ensuring structural coherence across inter-sentential relationships. Our proposed framework, ZeroStylus, operates through two systematic phases: hierarchical template acquisition from reference texts and template-guided generation with multi-granular matching. The framework dynamically constructs sentence and paragraph template repositories, enabling context-aware transformations while preserving inter-sentence logical relationships. Experimental evaluations demonstrate significant improvements over baseline methods, with structured rewriting achieving 6.90 average score compared to 6.70 for direct prompting approaches in tri-axial metrics assessing style consistency, content preservation, and expression quality. Ablation studies validate the necessity of both template hierarchies during style transfer, showing higher content preservation win rate against sentence-only approaches through paragraph-level structural encoding, as well as direct prompting method through sentence-level pattern extraction and matching. The results establish new capabilities for coherent long-text style transfer without requiring parallel corpora or LLM fine-tuning.
zh

[NLP-63] PLHF: Prompt Optimization with Few-Shot Human Feedback

【速读】: 该论文试图解决在缺乏明确输出质量评估指标的情况下,如何有效且高效地优化大型语言模型(Large Language Models, LLMs)的提示(prompt)问题。传统方法在处理需要固定解的回答任务时表现良好,但在输出质量难以通过标准黄金样本进行比较的情况下,定义评估指标变得复杂。为了解决这一问题,本文提出了一种名为PLHF(Prompt Learning with Human Feedback)的少样本提示优化框架,其灵感来源于著名的RLHF(Reinforcement Learning from Human Feedback)技术。PLHF的关键在于引入了一个特定的评估模块作为度量标准,用于估计输出质量,并且仅需一次人类反馈即可完成整个提示优化过程。

链接: https://arxiv.org/abs/2505.07886
作者: Chun-Pai Yang,Kan Zheng,Shou-De Lin
机构: Intelli-Train.ai(智能训练人工智能); ZRT Technology(智睿科技); National Taiwan University(台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic prompt optimization frameworks are developed to obtain suitable prompts for large language models (LLMs) with respect to desired output quality metrics. Although existing approaches can handle conventional tasks such as fixed-solution question answering, defining the metric becomes complicated when the output quality cannot be easily assessed by comparisons with standard golden samples. Consequently, optimizing the prompts effectively and efficiently without a clear metric becomes a critical challenge. To address the issue, we present PLHF (which stands for "P"rompt "L"earning with "H"uman "F"eedback), a few-shot prompt optimization framework inspired by the well-known RLHF technique. Different from naive strategies, PLHF employs a specific evaluator module acting as the metric to estimate the output quality. PLHF requires only a single round of human feedback to complete the entire prompt optimization process. Empirical results on both public and industrial datasets show that PLHF outperforms prior output grading strategies for LLM prompt optimizations.
zh

[NLP-64] Development of a WAZOBIA-Named Entity Recognition System

【速读】: 该论文试图解决非洲语言在命名实体识别(NER)领域中资源匮乏的问题,当前的NER系统主要针对英语、欧洲语言及其他少数全球性语言,而对像豪萨语、约鲁巴语和伊博语这样的尼日利亚主要语言关注不足。解决方案的关键在于构建一个专门针对这三种语言的WAZOBIA-NER系统,通过综合编译标注数据集以应对数据稀缺和语言多样性挑战,并采用条件随机场(CRF)、双向长短期记忆网络(BiLSTM)、基于Transformer的双向编码器表示(BERT)以及微调循环神经网络(RNN)等先进的机器学习与深度学习技术进行实体识别,同时结合光学字符识别(OCR)技术实现对文本图像的处理,从而提升系统的适用性和性能。

链接: https://arxiv.org/abs/2505.07884
作者: S.E Emedem,I.E Onyenwe,E. G Onyedinma
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 6 pages, 3 figures, 1 table

点击查看摘要

Abstract:Named Entity Recognition NER is very crucial for various natural language processing applications, including information extraction, machine translation, and sentiment analysis. Despite the ever-increasing interest in African languages within computational linguistics, existing NER systems focus mainly on English, European, and a few other global languages, leaving a significant gap for under-resourced languages. This research presents the development of a WAZOBIA-NER system tailored for the three most prominent Nigerian languages: Hausa, Yoruba, and Igbo. This research begins with a comprehensive compilation of annotated datasets for each language, addressing data scarcity and linguistic diversity challenges. Exploring the state-of-the-art machine learning technique, Conditional Random Fields (CRF) and deep learning models such as Bidirectional Long Short-Term Memory (BiLSTM), Bidirectional Encoder Representation from Transformers (Bert) and fine-tune with a Recurrent Neural Network (RNN), the study evaluates the effectiveness of these approaches in recognizing three entities: persons, organizations, and locations. The system utilizes optical character recognition (OCR) technology to convert textual images into machine-readable text, thereby enabling the Wazobia system to accept both input text and textual images for extraction purposes. The system achieved a performance of 0.9511 in precision, 0.9400 in recall, 0.9564 in F1-score, and 0.9301 in accuracy. The model’s evaluation was conducted across three languages, with precision, recall, F1-score, and accuracy as key assessment metrics. The Wazobia-NER system demonstrates that it is feasible to build robust NER tools for under-resourced African languages using current NLP frameworks and transfer learning.
zh

[NLP-65] Recovering Event Probabilities from Large Language Model Embeddings via Axiomatic Constraints

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)生成的事件概率存在不一致性的问题,这种不一致性违反了概率论的公理。为了解决这一问题,研究提出在扩展变分自编码器(extended variational autoencoder, VAE)学习的潜在空间中强制执行概率论的公理约束,如概率的可加性规则。该方法的关键在于通过VAE同时重建原始嵌入并预测语义相关事件的嵌入,从而在潜在空间中自然地生成一致的事件概率。实验结果表明,从嵌入中恢复的概率比模型直接报告的概率更具一致性,并且与真实概率高度吻合。

链接: https://arxiv.org/abs/2505.07883
作者: Jian-Qiao Zhu,Haijiang Yan,Thomas L. Griffiths
机构: Princeton University (普林斯顿大学); University of Warwick (华威大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rational decision-making under uncertainty requires coherent degrees of belief in events. However, event probabilities generated by Large Language Models (LLMs) have been shown to exhibit incoherence, violating the axioms of probability theory. This raises the question of whether coherent event probabilities can be recovered from the embeddings used by the models. If so, those derived probabilities could be used as more accurate estimates in events involving uncertainty. To explore this question, we propose enforcing axiomatic constraints, such as the additive rule of probability theory, in the latent space learned by an extended variational autoencoder (VAE) applied to LLM embeddings. This approach enables event probabilities to naturally emerge in the latent space as the VAE learns to both reconstruct the original embeddings and predict the embeddings of semantically related events. We evaluate our method on complementary events (i.e., event A and its complement, event not-A), where the true probabilities of the two events must sum to 1. Experiment results on open-weight language models demonstrate that probabilities recovered from embeddings exhibit greater coherence than those directly reported by the corresponding models and align closely with the true probabilities.
zh

[NLP-66] he Sound of Populism: Distinct Linguistic Features Across Populist Variants

【速读】: 该论文试图解决如何通过语言分析识别政治演讲中民粹主义(populism)的语调特征问题,具体是通过结合经典的LIWC(Linguistic Inquiry and Word Count)特征与微调后的RoBERTa模型来捕捉语言的情感和风格特征,从而揭示美国总统就职演说和国情咨文中的政治修辞的听觉维度。解决方案的关键在于将传统语言分析工具与先进的上下文感知语言模型相结合,以检测民粹主义的细微表达,并揭示不同民粹主义维度在语言标记中的表现差异。

链接: https://arxiv.org/abs/2505.07874
作者: Yu Wang,Runxi Yu,Zhongyuan Wang,Jing He
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study explores the sound of populism by integrating the classic Linguistic Inquiry and Word Count (LIWC) features, which capture the emotional and stylistic tones of language, with a fine-tuned RoBERTa model, a state-of-the-art context-aware language model trained to detect nuanced expressions of populism. This approach allows us to uncover the auditory dimensions of political rhetoric in U.S. presidential inaugural and State of the Union addresses. We examine how four key populist dimensions (i.e., left-wing, right-wing, anti-elitism, and people-centrism) manifest in the linguistic markers of speech, drawing attention to both commonalities and distinct tonal shifts across these variants. Our findings reveal that populist rhetoric consistently features a direct, assertive sound" that forges a connection with the people’’ and constructs a charismatic leadership persona. However, this sound is not simply informal but strategically calibrated. Notably, right-wing populism and people-centrism exhibit a more emotionally charged discourse, resonating with themes of identity, grievance, and crisis, in contrast to the relatively restrained emotional tones of left-wing and anti-elitist expressions.
zh

[NLP-67] Evaluating Financial Sentiment Analysis with Annotators Instruction Assisted Prompting: Enhancing Contextual Interpretation and Stock Prediction Accuracy

【速读】: 该论文试图解决金融情感分析(Financial Sentiment Analysis, FSA)中由于现有基准数据集的主观性导致的模型性能受限问题。现有数据集如Financial Phrasebank存在未明确定义的情感类别,导致标注者个体化视角带来的显著标注差异,从而使大语言模型(LLMs)在基准测试中面临不公平的预期。论文提出的解决方案之关键是引入Annotators’ Instruction Assisted Prompt (AIAP),通过将原本用于人类标注者的详细任务指令整合到LLMs的提示框架中,以标准化情感理解,提供更公平且上下文丰富的评估基础。实验结果表明,AIAP显著提升了LLM性能,最高提升达9.08,并引入了一种基于模型置信度分数的情感索引方法,增强了股票价格预测模型的效果。

链接: https://arxiv.org/abs/2505.07871
作者: A M Muntasir Rahman,Ajim Uddin,Guiling “Grace” Wang
机构: New Jersey Institute of Technology (新泽西理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Financial sentiment analysis (FSA) presents unique challenges to LLMs that surpass those in typical sentiment analysis due to the nuanced language used in financial contexts. The prowess of these models is often undermined by the inherent subjectivity of sentiment classifications in existing benchmark datasets like Financial Phrasebank. These datasets typically feature undefined sentiment classes that reflect the highly individualized perspectives of annotators, leading to significant variability in annotations. This variability results in an unfair expectation for LLMs during benchmarking, where they are tasked to conjecture the subjective viewpoints of human annotators without sufficient context. In this paper, we introduce the Annotators’ Instruction Assisted Prompt, a novel evaluation prompt designed to redefine the task definition of FSA for LLMs. By integrating detailed task instructions originally intended for human annotators into the LLMs’ prompt framework, AIAP aims to standardize the understanding of sentiment across both human and machine interpretations, providing a fair and context-rich foundation for sentiment analysis. We utilize a new dataset, WSBS, derived from the WallStreetBets subreddit to demonstrate how AIAP significantly enhances LLM performance by aligning machine operations with the refined task definitions. Experimental results demonstrate that AIAP enhances LLM performance significantly, with improvements up to 9.08. This context-aware approach not only yields incremental gains in performance but also introduces an innovative sentiment-indexing method utilizing model confidence scores. This method enhances stock price prediction models and extracts more value from the financial sentiment analysis, underscoring the significance of WSB as a critical source of financial text. Our research offers insights into both improving FSA through better evaluation methods.
zh

[NLP-68] Efficient Fairness Testing in Large Language Models : Prioritizing Metamorphic Relations for Bias Detection

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在输出中可能存在的公平性问题和潜在偏见,通过元测试中的元变换关系(Metamorphic Relations, MRs)优先级排序来高效检测这些问题。解决方案的关键在于采用基于句子多样性的方法对MRs进行计算和排序,以优化故障检测效果,从而在减少计算成本的同时提升公平性测试的效率和有效性。

链接: https://arxiv.org/abs/2505.07870
作者: Suavis Giramata,Madhusudan Srinivasan,Venkat Naidu Gudivada,Upulee Kanewala
机构: East Carolina University (东卡罗来纳大学); University of North Florida (北佛罗里达大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in various applications, raising critical concerns about fairness and potential biases in their outputs. This paper explores the prioritization of metamorphic relations (MRs) in metamorphic testing as a strategy to efficiently detect fairness issues within LLMs. Given the exponential growth of possible test cases, exhaustive testing is impractical; therefore, prioritizing MRs based on their effectiveness in detecting fairness violations is crucial. We apply a sentence diversity-based approach to compute and rank MRs to optimize fault detection. Experimental results demonstrate that our proposed prioritization approach improves fault detection rates by 22% compared to random prioritization and 12% compared to distance-based prioritization, while reducing the time to the first failure by 15% and 8%, respectively. Furthermore, our approach performs within 5% of fault-based prioritization in effectiveness, while significantly reducing the computational cost associated with fault labeling. These results validate the effectiveness of diversity-based MR prioritization in enhancing fairness testing for LLMs.
zh

[NLP-69] Arrow-Guided VLM: Enhancing Flowchart Understanding via Arrow Direction Encoding

【速读】: 该论文旨在解决当前视觉-语言模型(VLMs)在理解和解析流程图时对方向箭头和图结构的误判问题。其解决方案的关键在于提出一个七阶段的处理流程,包括节点与箭头端点的感知检测、光学字符识别(OCR)提取节点文本以及构建结构化提示以引导VLMs,从而显著提升模型对流程图的理解准确性。

链接: https://arxiv.org/abs/2505.07864
作者: Takamitsu Omasa,Ryo Koshihara,Masumi Morishige
机构: Galirage Inc. (Galirage 公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figures,

点击查看摘要

Abstract:Flowcharts are indispensable tools in software design and business-process analysis, yet current vision-language models (VLMs) frequently misinterpret the directional arrows and graph topology that set these diagrams apart from natural images. We introduce a seven-stage pipeline grouped into three broader processes: (1) arrow-aware detection of nodes and arrow endpoints; (2) optical character recognition (OCR) to extract node text; and (3) construction of a structured prompt that guides the VLMs. Tested on a 90-question benchmark distilled from 30 annotated flowcharts, the method raises overall accuracy from 80 % to 89 % (+9 percentage points) without any task-specific fine-tuning. The gain is most pronounced for next-step queries (25/30 - 30/30; 100 %, +17 pp); branch-result questions improve more modestly, and before-step questions remain difficult. A parallel evaluation with an LLM-as-a-Judge protocol shows the same trends, reinforcing the advantage of explicit arrow encoding. Limitations include dependence on detector and OCR precision, the small evaluation set, and residual errors at nodes with multiple incoming edges. Future work will enlarge the benchmark with synthetic and handwritten flowcharts and assess the approach on Business Process Model and Notation (BPMN) and Unified Modeling Language (UML).
zh

[NLP-70] QoSBERT: An Uncertainty-Aware Approach based on Pre-trained Language Models for Service Quality Prediction

【速读】: 该论文旨在解决传统服务质量(Quality of Service, QoS)模型在预测服务性能指标时依赖人工特征工程、仅提供点估计且缺乏置信度分析的问题。其解决方案的关键在于提出QoSBERT框架,该框架首次将QoS预测重新定义为基于预训练语言模型的语义回归任务,通过自动编码用户服务元数据为自然语言描述实现深层次语义理解,并集成基于蒙特卡洛Dropout的不确定性估计模块,从而实现可信且风险感知的服务质量预测。此外,QoSBERT采用注意力池化与轻量级多层感知机回归器联合优化,以最小化绝对误差,并利用不确定性估计选择高质量训练样本,提升低资源环境下的鲁棒性。

链接: https://arxiv.org/abs/2505.07863
作者: Ziliang Wang,Xiaohong Zhang,Ze Shi Li,Meng Yan
机构: Peking University (北京大学); Chongqing University (重庆大学); University of Victoria (维多利亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate prediction of Quality of Service (QoS) metrics is fundamental for selecting and managing cloud based services. Traditional QoS models rely on manual feature engineering and yield only point estimates, offering no insight into the confidence of their predictions. In this paper, we propose QoSBERT, the first framework that reformulates QoS prediction as a semantic regression task based on pre trained language models. Unlike previous approaches relying on sparse numerical features, QoSBERT automatically encodes user service metadata into natural language descriptions, enabling deep semantic understanding. Furthermore, we integrate a Monte Carlo Dropout based uncertainty estimation module, allowing for trustworthy and risk-aware service quality prediction, which is crucial yet underexplored in existing QoS models. QoSBERT applies attentive pooling over contextualized embeddings and a lightweight multilayer perceptron regressor, fine tuned jointly to minimize absolute error. We further exploit the resulting uncertainty estimates to select high quality training samples, improving robustness in low resource settings. On standard QoS benchmark datasets, QoSBERT achieves an average reduction of 11.7% in MAE and 6.7% in RMSE for response time prediction, and 6.9% in MAE for throughput prediction compared to the strongest baselines, while providing well calibrated confidence intervals for robust and trustworthy service quality estimation. Our approach not only advances the accuracy of service quality prediction but also delivers reliable uncertainty quantification, paving the way for more trustworthy, data driven service selection and optimization.
zh

[NLP-71] Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition

【速读】: 该论文试图解决序列到序列模型在结构化语言任务中依赖点积自注意力机制导致的计算和内存复杂度为二次方的问题。解决方案的关键在于引入图小波变换(Graph Wavelet Transformer, GWT),该架构通过在显式图拉普拉斯矩阵上定义的可学习多尺度小波变换替代传统自注意力机制,从而实现更高效且具表达性的图结构序列建模。

链接: https://arxiv.org/abs/2505.07862
作者: Andrew Kiruluta,Eric Lundy,Priscilla Burity
机构: UC Berkeley, CA
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing sequence to sequence models for structured language tasks rely heavily on the dot product self attention mechanism, which incurs quadratic complexity in both computation and memory for input length N. We introduce the Graph Wavelet Transformer (GWT), a novel architecture that replaces this bottleneck with a learnable, multi scale wavelet transform defined over an explicit graph Laplacian derived from syntactic or semantic parses. Our analysis shows that multi scale spectral decomposition offers an interpretable, efficient, and expressive alternative to quadratic self attention for graph structured sequence modeling.
zh

[NLP-72] Scalable LLM Math Reasoning Acceleration with Low-rank Distillation

【速读】: 该论文旨在解决高效推理方法在部署大型语言模型(Large Language Model, LLM)时导致数学推理能力显著下降的问题,同时保持语言任务的性能。其解决方案的关键在于提出一种低成本的蒸馏方法——Caprese,通过保留原始权重、引入约1%的额外参数以及使用少量(20K)合成训练样本,恢复因高效推理而丢失的数学能力,同时不影响语言任务的表现,并且能够减少活跃参数数量和降低延迟。

链接: https://arxiv.org/abs/2505.07861
作者: Harry Dong,Bilge Acun,Beidi Chen,Yuejie Chi
机构: Carnegie Mellon University (卡内基梅隆大学); Meta (元); FAIR at Meta (元人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a low-cost distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the math capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (11% reduction to generate 2048 tokens with Qwen 2.5 14B) while encouraging response brevity.
zh

[NLP-73] Boosting Performance on ARC is a Matter of Perspective

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在抽象推理任务中的能力局限性,特别是在面对Abstraction and Reasoning Corpus (ARC-AGI)这一挑战时的表现不足。其解决方案的关键在于在整个训练、生成和评分阶段引入任务特定的数据增强技术,并采用深度优先搜索算法生成多样化的高概率候选解;同时,将LLM不仅作为生成器使用,还作为评分器,利用其输出的概率选择最具潜力的解,从而提升整体性能。

链接: https://arxiv.org/abs/2505.07859
作者: Daniel Franzen,Jan Disselhoff,David Hartmann
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, 5 tables

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware (we assume a price of 36ct/hour for a Nvidia 4090 GPU).
zh

[NLP-74] Scaling Laws for Speculative Decoding

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在推理密集型架构中高效解码的问题,特别是针对依赖长链式思维推理的模型如OpenAI-o3和DeepSeek-R1。其解决方案的关键在于通过研究推测解码(speculative decoding)技术,并发现跨预训练语料量、草稿模型容量和解码批大小三个维度的对数线性缩放规律(Log-linear Scaling Laws),从而实现多维缩放协调,提升解码效率。基于此,作者提出了Scylla系统,在多个主流LLM上验证了其在解码速度和接受率上的显著提升。

链接: https://arxiv.org/abs/2505.07858
作者: Siyuan Yan,Mo Zhu,Guo-qing Jiang,Jianfei Wang,Jiaxing Chen,Wentai Zhang,Xiang Liao,Xiao Cui,Chen Zhang,Zhuoran Song,Ran Zhu
机构: Red Note Hi-Lab; Shanghai Jiaotong University; Nanjing University; Zhejiang University; Chinese University of Hong Kong
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:The escalating demand for efficient decoding in large language models (LLMs) is particularly critical for reasoning-intensive architectures like OpenAI-o3 and DeepSeek-R1, which depend on extended chain-of-thought reasoning. This study investigates speculative decoding techniques through dense LLM architectures to establish foundational insights for accelerating reasoning tasks. While speculative decoding methods leveraging parallel draft-verification cycles have emerged as promising acceleration techniques, the scaling laws governing decoding efficiency remain under-explored compared to conventional backbone LLMs developed through Pretraining-SFT-RLHF training paradigms. In this work, we discover Log-linear Scaling Laws (Theorem 1.1, 1.2 and 1.3) governing draft model acceptance rate (or decoding speed) across three dimensions: pretraining token volume, draft model capacity, and decoding batch size. Building on these laws, we achieve Scylla, which coordinates multi-dimensional scaling for popular LLMs (Llama2/3, Qwen2.5). Empirical validation shows Scylla achieves 1.5-2.2 higher acceptance rate than EAGLE2 and 0.3 higher than EAGLE3 at temperature T = 0, with peak performance gains on summarization and QA tasks (Figure 2). Industrial inference engine deployments demonstrate 2X decoding throughput improvements over EAGLE2 (Table 5), validating the transformative potential of systematic scaling for efficient LLM inference. Code will be released later.
zh

[NLP-75] Enhanced Urdu Intent Detection with Large Language Models and Prototype-Informed Predictive Pipelines

【速读】: 该论文旨在解决乌尔都语(Urdu)意图检测(Intent Detection)模型在少样本学习和未见类别预测方面的不足。现有方法在主流语言中已广泛应用少样本学习策略,但乌尔都语缺乏此类方法,传统模型仅限于对训练集中已见类别的预测。该研究提出了一种基于对比学习的独特解决方案,利用未标注的乌尔都语文本重新训练预训练语言模型(Pre-trained Language Models, LLMs),以增强其表征学习能力,并结合原型感知注意力机制(Prototype-informed Attention Mechanism)构建端到端的LLMPIA意图检测流水线。该方案的关键在于通过对比学习提升模型对乌尔都语的适应性,并有效整合预训练模型与注意力机制的优势。

链接: https://arxiv.org/abs/2505.07857
作者: Faiza Hassan,Summra Saleem,Kashif Javed,Muhammad Nabeel Asim,Abdur Rehman,Andreas Dengel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 42 pages, 10 figures(including 6 graphs)

点击查看摘要

Abstract:Multifarious intent detection predictors are developed for different languages, including English, Chinese and French, however, the field remains underdeveloped for Urdu, the 10th most spoken language. In the realm of well-known languages, intent detection predictors utilize the strategy of few-shot learning and prediction of unseen classes based on the model training on seen classes. However, Urdu language lacks few-shot strategy based intent detection predictors and traditional predictors are focused on prediction of the same classes which models have seen in the train set. To empower Urdu language specific intent detection, this introduces a unique contrastive learning approach that leverages unlabeled Urdu data to re-train pre-trained language models. This re-training empowers LLMs representation learning for the downstream intent detection task. Finally, it reaps the combined potential of pre-trained LLMs and the prototype-informed attention mechanism to create a comprehensive end-to-end LLMPIA intent detection pipeline. Under the paradigm of proposed predictive pipeline, it explores the potential of 6 distinct language models and 13 distinct similarity computation methods. The proposed framework is evaluated on 2 public benchmark datasets, namely ATIS encompassing 5836 samples and Web Queries having 8519 samples. Across ATIS dataset under 4-way 1 shot and 4-way 5 shot experimental settings LLMPIA achieved 83.28% and 98.25% F1-Score and on Web Queries dataset produced 76.23% and 84.42% F1-Score, respectively. In an additional case study on the Web Queries dataset under same classes train and test set settings, LLMPIA outperformed state-of-the-art predictor by 53.55% F1-Score.
zh

[NLP-76] Unpacking Robustness in Inflectional Languages: Adversarial Evaluation and Mechanistic Insights

【速读】: 该论文试图解决在屈折语言(inflectional languages)中对抗样本生成的效果及其对模型鲁棒性的影响问题。现有大多数对抗样本生成方法主要针对非屈折语言(如英语)进行开发和评估,而缺乏对屈折语言的系统研究。论文的关键解决方案是设计一种受机制可解释性启发的新评估协议,该协议基于边缘归因补丁(Edge Attribution Patching, EAP)方法,利用包含屈折和同形变体的平行任务特定语料库,以分析模型中与屈折相关的机制性元素及其在攻击下的行为。

链接: https://arxiv.org/abs/2505.07856
作者: Paweł Walkowiak,Marek Klonowski,Marcin Oleksy,Arkadiusz Janz
机构: Wrocław University of Technology (弗罗茨瓦夫科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Various techniques are used in the generation of adversarial examples, including methods such as TextBugger which introduce minor, hardly visible perturbations to words leading to changes in model behaviour. Another class of techniques involves substituting words with their synonyms in a way that preserves the text’s meaning but alters its predicted class, with TextFooler being a prominent example of such attacks. Most adversarial example generation methods are developed and evaluated primarily on non-inflectional languages, typically English. In this work, we evaluate and explain how adversarial attacks perform in inflectional languages. To explain the impact of inflection on model behaviour and its robustness under attack, we designed a novel protocol inspired by mechanistic interpretability, based on Edge Attribution Patching (EAP) method. The proposed evaluation protocol relies on parallel task-specific corpora that include both inflected and syncretic variants of texts in two languages – Polish and English. To analyse the models and explain the relationship between inflection and adversarial robustness, we create a new benchmark based on task-oriented dataset MultiEmo, enabling the identification of mechanistic inflection-related elements of circuits within the model and analyse their behaviour under attack.
zh

[NLP-77] CrashSage: A Large Language Model-Centered Framework for Contextual and Interpretable Traffic Crash Analysis

【速读】: 该论文旨在解决传统统计模型和树集成方法在道路碰撞分析中因依赖结构化数据而忽视上下文细节、难以捕捉复杂关系及底层语义的问题,以及由此导致的信息丢失问题,尤其是在多车辆交互、碰撞发展过程和罕见事件特征方面的叙述性元素。其解决方案的关键在于提出CrashSage框架,该框架基于大型语言模型(LLM),通过四种核心创新实现:一是采用表格到文本的转换策略与关系数据集成模式,将原始异构碰撞数据转化为保留关键结构和关系语境的丰富结构化文本叙述;二是利用基础LLM模型进行上下文感知的数据增强,以提升叙述连贯性同时保持事实准确性;三是对LLaMA3-8B模型进行微调,用于碰撞严重程度推理,表现出优于基线方法的性能;四是采用基于梯度的可解释性技术,揭示模型在个体碰撞层面和更广泛风险因素维度上的决策过程,从而提高透明度并支持针对性的道路安全干预。

链接: https://arxiv.org/abs/2505.07853
作者: Hao Zhen,Jidong J. Yang
机构: University of Georgia (佐治亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Road crashes claim over 1.3 million lives annually worldwide and incur global economic losses exceeding \ 1.8 trillion. Such profound societal and financial impacts underscore the urgent need for road safety research that uncovers crash mechanisms and delivers actionable insights. Conventional statistical models and tree ensemble approaches typically rely on structured crash data, overlooking contextual nuances and struggling to capture complex relationships and underlying semantics. Moreover, these approaches tend to incur significant information loss, particularly in narrative elements related to multi-vehicle interactions, crash progression, and rare event characteristics. This study presents CrashSage, a novel Large Language Model (LLM)-centered framework designed to advance crash analysis and modeling through four key innovations. First, we introduce a tabular-to-text transformation strategy paired with relational data integration schema, enabling the conversion of raw, heterogeneous crash data into enriched, structured textual narratives that retain essential structural and relational context. Second, we apply context-aware data augmentation using a base LLM model to improve narrative coherence while preserving factual integrity. Third, we fine-tune the LLaMA3-8B model for crash severity inference, demonstrating superior performance over baseline approaches, including zero-shot, zero-shot with chain-of-thought prompting, and few-shot learning, with multiple models (GPT-4o, GPT-4o-mini, LLaMA3-70B). Finally, we employ a gradient-based explainability technique to elucidate model decisions at both the individual crash level and across broader risk factor dimensions. This interpretability mechanism enhances transparency and enables targeted road safety interventions by providing deeper insights into the most influential factors.
zh

[NLP-78] Joint Detection of Fraud and Concept Drift inOnline Conversations with LLM -Assisted Judgment

【速读】: 该论文试图解决数字通信平台中虚假交互检测的问题,特别是传统静态异常检测方法在面对动态对话变化时的不足,导致误报或漏报。其解决方案的关键在于提出一个两阶段检测框架:首先利用定制的集成分类模型识别可疑对话,随后通过概念漂移分析(Concept Drift Analysis)使用One Class Drift Detector (OCDD)隔离对话中的变化,并结合大语言模型(Large Language Model, LLM)判断变化是否为欺诈性操作或合法话题转换,从而提升检测的准确性和可解释性。

链接: https://arxiv.org/abs/2505.07852
作者: Ali Senol,Garima Agrawal,Huan Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting fake interactions in digital communication platforms remains a challenging and insufficiently addressed problem. These interactions may appear as harmless spam or escalate into sophisticated scam attempts, making it difficult to flag malicious intent early. Traditional detection methods often rely on static anomaly detection techniques that fail to adapt to dynamic conversational shifts. One key limitation is the misinterpretation of benign topic transitions referred to as concept drift as fraudulent behavior, leading to either false alarms or missed threats. We propose a two stage detection framework that first identifies suspicious conversations using a tailored ensemble classification model. To improve the reliability of detection, we incorporate a concept drift analysis step using a One Class Drift Detector (OCDD) to isolate conversational shifts within flagged dialogues. When drift is detected, a large language model (LLM) assesses whether the shift indicates fraudulent manipulation or a legitimate topic change. In cases where no drift is found, the behavior is inferred to be spam like. We validate our framework using a dataset of social engineering chat scenarios and demonstrate its practical advantages in improving both accuracy and interpretability for real time fraud detection. To contextualize the trade offs, we compare our modular approach against a Dual LLM baseline that performs detection and judgment using different language models.
zh

[NLP-79] A Tale of Two Identities: An Ethical Audit of Human and AI-Crafted Personas

【速读】: 该论文试图解决生成式AI(Generative AI)在合成人格构建中可能引发的代表性伤害问题,尤其是在少数族裔身份表达方面的偏差。研究聚焦于大型语言模型(LLMs)生成的合成人格如何通过过度强调种族标记、过度生产文化编码语言以及构建语法复杂但叙事简化的人格,导致刻板印象、异域化、消解和善意偏见等社会技术性危害。其解决方案的关键在于提出“算法他者化”(algorithmic othering)的概念,并基于此提供面向叙事感知的评估指标和以社区为中心的验证协议,以提升合成身份生成的公平性与真实性。

链接: https://arxiv.org/abs/2505.07850
作者: Pranav Narayanan Venkit,Jiayi Li,Yingfan Zhou,Sarah Rajtmajer,Shomir Wilson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As LLMs (large language models) are increasingly used to generate synthetic personas particularly in data-limited domains such as health, privacy, and HCI, it becomes necessary to understand how these narratives represent identity, especially that of minority communities. In this paper, we audit synthetic personas generated by 3 LLMs (GPT4o, Gemini 1.5 Pro, Deepseek 2.5) through the lens of representational harm, focusing specifically on racial identity. Using a mixed methods approach combining close reading, lexical analysis, and a parameterized creativity framework, we compare 1512 LLM generated personas to human-authored responses. Our findings reveal that LLMs disproportionately foreground racial markers, overproduce culturally coded language, and construct personas that are syntactically elaborate yet narratively reductive. These patterns result in a range of sociotechnical harms, including stereotyping, exoticism, erasure, and benevolent bias, that are often obfuscated by superficially positive narrations. We formalize this phenomenon as algorithmic othering, where minoritized identities are rendered hypervisible but less authentic. Based on these findings, we offer design recommendations for narrative-aware evaluation metrics and community-centered validation protocols for synthetic identity generation.
zh

[NLP-80] Polysemy of Synthetic Neurons Towards a New Type of Explanatory Categorical Vector Spaces

【速读】: 该论文试图解决人工语言模型中合成神经元的多义性问题,当前认为这种多义性是潜在空间内分布式特征必要叠加的结果。论文提出了一种替代方案,其关键在于将第n层中的神经元几何地定义为一个具有非正交基的类别向量空间,该空间由前一层(n-1层)神经元提取的类别子维度组成。通过神经元内部的注意力机制,该结构能够识别并利用一个关键的类别区域,从而提高语言模型的效率,该区域更加均匀且位于不同类别子维度的交集处。

链接: https://arxiv.org/abs/2505.07831
作者: Michael Pichat,William Pogrund,Paloma Pichat,Judicael Poumay,Armanouche Gasparian,Samuel Demarchi,Martin Corbet,Alois Georgeon,Michael Veillet-Guillem
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The polysemantic nature of synthetic neurons in artificial intelligence language models is currently understood as the result of a necessary superposition of distributed features within the latent space. We propose an alternative approach, geometrically defining a neuron in layer n as a categorical vector space with a non-orthogonal basis, composed of categorical sub-dimensions extracted from preceding neurons in layer n-1. This categorical vector space is structured by the activation space of each neuron and enables, via an intra-neuronal attention process, the identification and utilization of a critical categorical zone for the efficiency of the language model - more homogeneous and located at the intersection of these different categorical sub-dimensions.
zh

[NLP-81] CellVerse: Do Large Language Models Really Understand Cell Biology?

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在基于语言驱动的单细胞分析任务中的性能评估不足问题,特别是其在细胞生物学领域的适用性与有效性尚未得到系统性验证。论文提出的关键解决方案是构建CellVerse,这是一个统一的语言中心问答基准,整合了四种类型的单细胞多组学数据,并涵盖了三个层次的单细胞分析任务:细胞类型注释(细胞级)、药物反应预测(药物级)和扰动分析(基因级)。通过在CellVerse上对14个开源和闭源LLMs进行系统评估,揭示了现有模型在细胞生物学理解上的局限性,并为未来提升LLMs在该领域的表现提供了基础。

链接: https://arxiv.org/abs/2505.07865
作者: Fan Zhang,Tianyu Liu,Zhihong Zhu,Hao Wu,Haixin Wang,Donghao Zhou,Yefeng Zheng,Kun Wang,Xian Wu,Pheng-Ann Heng
机构: CUHK(香港中文大学); Yale University(耶鲁大学); Peking University(北京大学); Tsinghua University(清华大学); UCLA(加州大学洛杉矶分校); Westlake University(西湖大学); NTU(南洋理工大学); Tencent(腾讯)
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cell Behavior (q-bio.CB)
备注:

点击查看摘要

Abstract:Recent studies have demonstrated the feasibility of modeling single-cell data as natural languages and the potential of leveraging powerful large language models (LLMs) for understanding cell biology. However, a comprehensive evaluation of LLMs’ performance on language-driven single-cell analysis tasks still remains unexplored. Motivated by this challenge, we introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data and encompasses three hierarchical levels of single-cell analysis tasks: cell type annotation (cell-level), drug response prediction (drug-level), and perturbation analysis (gene-level). Going beyond this, we systematically evaluate the performance across 14 open-source and closed-source LLMs ranging from 160M to 671B on CellVerse. Remarkably, the experimental results reveal: (1) Existing specialist models (C2S-Pythia) fail to make reasonable decisions across all sub-tasks within CellVerse, while generalist models such as Qwen, Llama, GPT, and DeepSeek family models exhibit preliminary understanding capabilities within the realm of cell biology. (2) The performance of current LLMs falls short of expectations and has substantial room for improvement. Notably, in the widely studied drug response prediction task, none of the evaluated LLMs demonstrate significant performance improvement over random guessing. CellVerse offers the first large-scale empirical demonstration that significant challenges still remain in applying LLMs to cell biology. By introducing CellVerse, we lay the foundation for advancing cell biology through natural languages and hope this paradigm could facilitate next-generation single-cell analysis.
zh

计算机视觉

[CV-0] UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations

【速读】:该论文试图解决将人类模仿能力迁移至机器人时面临的挑战,即由于人类与机器人在视觉外观和物理能力上的固有差异,导致技能迁移困难的问题。解决方案的关键在于提出UniSkill框架,该框架通过从大规模跨体态视频数据中学习与体态无关的技能表示,无需任何标签即可实现从人类视频提示中提取的技能有效迁移到仅使用机器人数据训练的机器人策略中。

链接: https://arxiv.org/abs/2505.08787
作者: Hanjung Kim,Jaehyun Kang,Hyolim Kang,Meedeum Cho,Seon Joo Kim,Youngwoon Lee
机构: Yonsei University (延世大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Mimicry is a fundamental learning mechanism in humans, enabling individuals to learn new tasks by observing and imitating experts. However, applying this ability to robots presents significant challenges due to the inherent differences between human and robot embodiments in both their visual appearance and physical capabilities. While previous methods bridge this gap using cross-embodiment datasets with shared scenes and tasks, collecting such aligned data between humans and robots at scale is not trivial. In this paper, we propose UniSkill, a novel framework that learns embodiment-agnostic skill representations from large-scale cross-embodiment video data without any labels, enabling skills extracted from human video prompts to effectively transfer to robot policies trained only on robot data. Our experiments in both simulation and real-world environments show that our cross-embodiment skills successfully guide robots in selecting appropriate actions, even with unseen video prompts. The project website can be found at: this https URL.
zh

[CV-1] owards Autonomous UAV Visual Object Search in City Space: Benchmark and Agent ic Methodology

【速读】:该论文旨在解决城市环境中空中视觉目标搜索(Aerial Visual Object Search, AVOS)任务中无人机自主搜索与识别目标对象的挑战,这些问题包括冗余语义处理、相似物体区分以及探索与利用的权衡。其解决方案的关键在于提出PRPSearcher,一种基于多模态大语言模型(MLLMs)的新型代理方法,该方法模拟人类三层认知机制,构建三个专用地图:以目标为中心的动态语义地图增强空间感知、基于语义吸引力值的三维认知地图用于目标推理,以及用于平衡探索与利用的三维不确定性地图,并结合去噪机制和灵感促进思维(IPT)提示机制实现自适应动作规划。

链接: https://arxiv.org/abs/2505.08765
作者: Yatai Ji,Zhengqiu Zhu,Yong Zhao,Beidan Liu,Chen Gao,Yihao Zhao,Sihang Qiu,Yue Hu,Quanjun Yin,Yong Li
机构: State Key Lab of Digital-Intelligent Modeling and Simulation(国家数字智能建模与仿真重点实验室); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aerial Visual Object Search (AVOS) tasks in urban environments require Unmanned Aerial Vehicles (UAVs) to autonomously search for and identify target objects using visual and textual cues without external guidance. Existing approaches struggle in complex urban environments due to redundant semantic processing, similar object distinction, and the exploration-exploitation dilemma. To bridge this gap and support the AVOS task, we introduce CityAVOS, the first benchmark dataset for autonomous search of common urban objects. This dataset comprises 2,420 tasks across six object categories with varying difficulty levels, enabling comprehensive evaluation of UAV agents’ search capabilities. To solve the AVOS tasks, we also propose PRPSearcher (Perception-Reasoning-Planning Searcher), a novel agentic method powered by multi-modal large language models (MLLMs) that mimics human three-tier cognition. Specifically, PRPSearcher constructs three specialized maps: an object-centric dynamic semantic map enhancing spatial perception, a 3D cognitive map based on semantic attraction values for target reasoning, and a 3D uncertainty map for balanced exploration-exploitation search. Also, our approach incorporates a denoising mechanism to mitigate interference from similar objects and utilizes an Inspiration Promote Thought (IPT) prompting mechanism for adaptive action planning. Experimental results on CityAVOS demonstrate that PRPSearcher surpasses existing baselines in both success rate and search efficiency (on average: +37.69% SR, +28.96% SPL, -30.69% MSS, and -46.40% NE). While promising, the performance gap compared to humans highlights the need for better semantic reasoning and spatial exploration capabilities in AVOS tasks. This work establishes a foundation for future advances in embodied target search. Dataset and source code are available at this https URL.
zh

[CV-2] Advancing Food Nutrition Estimation via Visual-Ingredient Feature Fusion

【速读】:该论文试图解决营养估算(nutrition estimation)在数据集缺乏营养标注(nutritional annotations)方面的局限性。其解决方案的关键在于引入了一个包含84,446张图像的FastFood数据集,并提出了一种模型无关的视觉-成分特征融合(Visual-Ingredient Feature Fusion, VIF²)方法,通过整合视觉特征与成分信息来提升营养估算的准确性。此外,通过同义词替换和重采样策略增强成分鲁棒性,并利用大模态模型在测试阶段对成分预测进行优化。

链接: https://arxiv.org/abs/2505.08747
作者: Huiyan Qi,Bin Zhu,Chong-Wah Ngo,Jingjing Chen,Ee-Peng Lim
机构: Singapore Management University (新加坡管理大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in ACM International Conference on Multimedia Retrieval 2025

点击查看摘要

Abstract:Nutrition estimation is an important component of promoting healthy eating and mitigating diet-related health risks. Despite advances in tasks such as food classification and ingredient recognition, progress in nutrition estimation is limited due to the lack of datasets with nutritional annotations. To address this issue, we introduce FastFood, a dataset with 84,446 images across 908 fast food categories, featuring ingredient and nutritional annotations. In addition, we propose a new model-agnostic Visual-Ingredient Feature Fusion (VIF ^2 ) method to enhance nutrition estimation by integrating visual and ingredient features. Ingredient robustness is improved through synonym replacement and resampling strategies during training. The ingredient-aware visual feature fusion module combines ingredient features and visual representation to achieve accurate nutritional prediction. During testing, ingredient predictions are refined using large multimodal models by data augmentation and majority voting. Our experiments on both FastFood and Nutrition5k datasets validate the effectiveness of our proposed method built in different backbones (e.g., Resnet, InceptionV3 and ViT), which demonstrates the importance of ingredient information in nutrition estimation. this https URL.
zh

[CV-3] Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving

【速读】:该论文旨在解决大型视觉-语言模型(Large Visual-Language Models, LVLMs)在自动驾驶场景中对场景理解不全面的问题,具体表现为现有研究多关注前视视角和场景中的部分物体,难以实现对场景的全面理解,同时LVLMs缺乏2D与3D之间的映射关系以及3D目标定位与指令理解的不足。解决方案的关键在于提出NuInteract数据集,该数据集包含超过150万对多视角图像语言对,涵盖密集场景描述和多样化的交互任务,并设计了DriveMonkey框架,该框架通过一系列可学习查询将LVLMs与空间处理器无缝集成,空间处理器作为即插即用组件,可使用预训练的3D检测器进行初始化以提升3D感知能力。

链接: https://arxiv.org/abs/2505.08725
作者: Zongchuang Zhao,Haoyu Fu,Dingkang Liang,Xin Zhou,Dingyuan Zhang,Hongwei Xie,Bing Wang,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The dataset and code will be released at this https URL

点击查看摘要

Abstract:The Large Visual-Language Models (LVLMs) have significantly advanced image understanding. Their comprehension and reasoning capabilities enable promising applications in autonomous driving scenarios. However, existing research typically focuses on front-view perspectives and partial objects within scenes, struggling to achieve comprehensive scene understanding. Meanwhile, existing LVLMs suffer from the lack of mapping relationship between 2D and 3D and insufficient integration of 3D object localization and instruction understanding. To tackle these limitations, we first introduce NuInteract, a large-scale dataset with over 1.5M multi-view image language pairs spanning dense scene captions and diverse interactive tasks. Furthermore, we propose DriveMonkey, a simple yet effective framework that seamlessly integrates LVLMs with a spatial processor using a series of learnable queries. The spatial processor, designed as a plug-and-play component, can be initialized with pre-trained 3D detectors to improve 3D perception. Our experiments show that DriveMonkey outperforms general LVLMs, especially achieving a 9.86% notable improvement on the 3D visual grounding task. The dataset and code will be released at this https URL.
zh

[CV-4] Mo: Spatiotemporal Foundation Model for Satellite Image Time Series

【速读】:该论文旨在解决现有时空基础模型在处理卫星图像时间序列(SITS)时,无法显式捕捉地物对象之间多尺度时空关系的问题,这一局限性影响了其在下游任务中的效果。论文提出的解决方案是TiMo,其关键在于引入了一种时空陀螺注意力机制,该机制能够动态捕捉时间和空间上的多尺度演变模式,从而提升模型对SITS数据的表征能力。

链接: https://arxiv.org/abs/2505.08723
作者: Xiaolei Qin,Di Wang,Jing Zhang,Fengxiang Wang,Xin Su,Bo Du,Liangpei Zhang
机构: Wuhan University (武汉大学); Zhongguancun Academy (中关村学院); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Satellite image time series (SITS) provide continuous observations of the Earth’s surface, making them essential for applications such as environmental management and disaster assessment. However, existing spatiotemporal foundation models rely on plain vision transformers, which encode entire temporal sequences without explicitly capturing multiscale spatiotemporal relationships between land objects. This limitation hinders their effectiveness in downstream tasks. To overcome this challenge, we propose TiMo, a novel hierarchical vision transformer foundation model tailored for SITS analysis. At its core, we introduce a spatiotemporal gyroscope attention mechanism that dynamically captures evolving multiscale patterns across both time and space. For pre-training, we curate MillionST, a large-scale dataset of one million images from 100,000 geographic locations, each captured across 10 temporal phases over five years, encompassing diverse geospatial changes and seasonal variations. Leveraging this dataset, we adapt masked image modeling to pre-train TiMo, enabling it to effectively learn and encode generalizable spatiotemporal this http URL experiments across multiple spatiotemporal tasks-including deforestation monitoring, land cover segmentation, crop type classification, and flood detection-demonstrate TiMo’s superiority over state-of-the-art methods. Code, model, and dataset will be released at this https URL.
zh

[CV-5] Controllable Image Colorization with Instance-aware Texts and Masks

【速读】:该论文旨在解决当前主流图像着色模型在实例级别上存在的颜色溢出(color bleeding)和颜色绑定错误(color binding errors)问题,以及无法实现精确实例感知着色的局限性。其解决方案的关键在于提出一种基于扩散模型的着色方法MT-Color,通过引入像素级掩码注意力机制,结合潜在特征与条件灰度图像特征,利用分割掩码构建交叉注意力掩码以防止不同实例间的像素信息交换;同时引入实例掩码与文本引导模块,通过自注意力机制融合实例掩码与文本表示,避免实例文本对其他区域的误引导,从而缓解颜色绑定错误。此外,还采用多实例采样策略提升着色精度。

链接: https://arxiv.org/abs/2505.08705
作者: Yanru An,Ling Gui,Qiang Hu,Chunlei Cai,Tianxiao Ye,Xiaoyun Zhang,Yanfeng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Bilibili Inc. (哔哩哔哩公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, the application of deep learning in image colorization has received widespread attention. The maturation of diffusion models has further advanced the development of image colorization models. However, current mainstream image colorization models still face issues such as color bleeding and color binding errors, and cannot colorize images at the instance level. In this paper, we propose a diffusion-based colorization method MT-Color to achieve precise instance-aware colorization with use-provided guidance. To tackle color bleeding issue, we design a pixel-level mask attention mechanism that integrates latent features and conditional gray image features through cross-attention. We use segmentation masks to construct cross-attention masks, preventing pixel information from exchanging between different instances. We also introduce an instance mask and text guidance module that extracts instance masks and text representations of each instance, which are then fused with latent features through self-attention, utilizing instance masks to form self-attention masks to prevent instance texts from guiding the colorization of other areas, thus mitigating color binding errors. Furthermore, we apply a multi-instance sampling strategy, which involves sampling each instance region separately and then fusing the results. Additionally, we have created a specialized dataset for instance-level colorization tasks, GPT-color, by leveraging large visual language models on existing image datasets. Qualitative and quantitative experiments show that our model and dataset outperform previous methods and datasets.
zh

[CV-6] SPAST: Arbitrary Style Transfer with Style Priors via Pre-trained Large-scale Model

【速读】:该论文旨在解决任意风格迁移中生成高质量风格化图像与保持内容结构之间的矛盾问题,以及现有方法在推理时间上的不足。其解决方案的关键在于提出一种新的框架SPAST,通过设计一种新颖的局部-全局窗口尺寸风格化模块(LGWSSM)将风格特征融合到内容特征中,并引入一种新的风格先验损失,从预训练的大规模模型中挖掘风格先验信息,从而在减少推理时间的同时生成高质量的风格化图像。

链接: https://arxiv.org/abs/2505.08695
作者: Zhanjie Zhang,Quanwei Zhang,Junsheng Luan,Mengyuan Yang,Yun Wang,Lei Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Neural Networks

点击查看摘要

Abstract:Given an arbitrary content and style image, arbitrary style transfer aims to render a new stylized image which preserves the content image’s structure and possesses the style image’s style. Existing arbitrary style transfer methods are based on either small models or pre-trained large-scale models. The small model-based methods fail to generate high-quality stylized images, bringing artifacts and disharmonious patterns. The pre-trained large-scale model-based methods can generate high-quality stylized images but struggle to preserve the content structure and cost long inference time. To this end, we propose a new framework, called SPAST, to generate high-quality stylized images with less inference time. Specifically, we design a novel Local-global Window Size Stylization Module (LGWSSM)tofuse style features into content features. Besides, we introduce a novel style prior loss, which can dig out the style priors from a pre-trained large-scale model into the SPAST and motivate the SPAST to generate high-quality stylized images with short inference this http URL conduct abundant experiments to verify that our proposed method can generate high-quality stylized images and less inference time compared with the SOTA arbitrary style transfer methods. Comments: Accepted by Neural Networks Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.08695 [cs.CV] (or arXiv:2505.08695v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.08695 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhanjie Zhang [view email] [v1] Tue, 13 May 2025 15:54:36 UTC (15,824 KB) Full-text links: Access Paper: View a PDF of the paper titled SPAST: Arbitrary Style Transfer with Style Priors via Pre-trained Large-scale Model, by Zhanjie Zhang and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-7] CAD-Coder:Text-Guided CAD Files Code Generation

【速读】:该论文试图解决传统计算机辅助设计(CAD)在个性化生成方面效率低下以及现有生成方法输出缺乏交互编辑性和几何标注的问题。解决方案的关键在于提出CAD-Coder框架,该框架能够将自然语言指令转化为可执行的CAD脚本代码,从而生成具备可编辑性和几何标注的人类可操作的CAD文件(.Dxf)。

链接: https://arxiv.org/abs/2505.08686
作者: Changqi He,Shuhan Zhang,Liguo Zhang,Jiajun Miao
机构: Harbin Engineering University (哈尔滨工程大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Computer-aided design (CAD) is a way to digitally create 2D drawings and 3D models of real-world products. Traditional CAD typically relies on hand-drawing by experts or modifications of existing library files, which doesn’t allow for rapid personalization. With the emergence of generative artificial intelligence, convenient and efficient personalized CAD generation has become possible. However, existing generative methods typically produce outputs that lack interactive editability and geometric annotations, limiting their practical applications in manufacturing. To enable interactive generative CAD, we propose CAD-Coder, a framework that transforms natural language instructions into CAD script codes, which can be executed in Python environments to generate human-editable CAD files (.Dxf). To facilitate the generation of editable CAD sketches with annotation information, we construct a comprehensive dataset comprising 29,130 Dxf files with their corresponding script codes, where each sketch preserves both editability and geometric annotations. We evaluate CAD-Coder on various 2D/3D CAD generation tasks against existing methods, demonstrating superior interactive capabilities while uniquely providing editable sketches with geometric annotations.
zh

[CV-8] Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results MICCAI2024

【速读】:该论文旨在解决医学图像分割中深度学习(Deep Learning, DL)模型的可靠性与临床适用性问题,特别是针对标注变异、校准和不确定性估计等关键挑战。其解决方案的关键在于引入多标注者(multi-rater)构建更全面的地面真实数据(ground truth),强调分割任务的主观性,并利用标注者间差异来评估模型的不确定性处理能力,从而实现更稳健的模型评价。

链接: https://arxiv.org/abs/2505.08685
作者: Meritxell Riera-Marin,Sikha O K,Julia Rodriguez-Comas,Matthias Stefan May,Zhaohong Pan,Xiang Zhou,Xiaokun Liang,Franciskus Xaverius Erick,Andrea Prenner,Cedric Hemon,Valentin Boussot,Jean-Louis Dillenseger,Jean-Claude Nunes,Abdul Qayyum,Moona Mazher,Steven A Niederer,Kaisar Kushibar,Carlos Martin-Isla,Petia Radeva,Karim Lekadir,Theodore Barfoot,Luis C. Garcia Peraza Herrera,Ben Glocker,Tom Vercauteren,Lucas Gago,Justin Englemann,Joy-Marie Kleiss,Anton Aubanell,Andreu Antolin,Javier Garcia-Lopez,Miguel A. Gonzalez Ballester,Adrian Galdran
机构: Sycai Technologies SL(西卡技术SL公司); BCN Medtech(巴塞罗那医疗技术); Universitätsklinikum Erlangen(埃尔朗根大学医院); University Hospital Erlangen(埃尔朗根大学医院); Hospital de Sant Pau i la Santa Creu(圣保罗和圣十字医院); Institut de Recerca Sant Pau - Centre CERCA(圣保罗研究机构-加泰罗尼亚研究中心); Hospital Universitari Vall d’Hebron(瓦尔德赫龙大学医院); TECNALIA(技术纳利亚); Shenzhen Institute of Advanced Technology(深圳先进技术研究院); University of Chinese Academy of Sciences(中国科学院大学); Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)(弗里德里希-亚历山大-埃尔朗根-纽伦堡大学); National Heart and Lung Institute(国家心肺研究所); Centre for Medical Image Computing(医学影像计算中心); Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM)(巴塞罗那医学人工智能实验室); IBA(巴塞罗那数学与信息学院); King’s College London (KCL)(国王学院伦敦大学); Universitat de Barcelona (UB)(巴塞罗那大学); University of Edinburgh(爱丁堡大学); Université de Rennes 1(雷恩第一大学); Institució Catalana de Recerca i Estudis Avançats (ICREA)(加泰罗尼亚研究与高级研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This challenge was hosted in MICCAI 2024

点击查看摘要

Abstract:Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.
zh

[CV-9] Claycode: Stylable and Deformable 2D Scannable Codes

【速读】:该论文旨在解决传统二维可扫描码(如二维码)在样式化和变形容忍度方面的局限性。传统矩阵码的结构限制了其视觉表现力,并且在遭受严重形变时容易失效。解决方案的关键在于提出Claycode,这是一种基于树状结构编码信息的新方法,通过将比特映射到拓扑树,并将其描绘为在目标多边形边界内嵌套的颜色区域,从而实现高度的样式化同时保持功能性,且在实际场景中展现出优于传统二维可扫描码的抗变形能力。

链接: https://arxiv.org/abs/2505.08666
作者: Marco Maida,Alberto Crescini,Marco Perronet,Elena Camuffo
机构: 未知
类目: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper introduces Claycode, a novel 2D scannable code designed for extensive stylization and deformation. Unlike traditional matrix-based codes (e.g., QR codes), Claycodes encode their message in a tree structure. During the encoding process, bits are mapped into a topology tree, which is then depicted as a nesting of color regions drawn within the boundaries of a target polygon shape. When decoding, Claycodes are extracted and interpreted in real-time from a camera stream. We detail the end-to-end pipeline and show that Claycodes allow for extensive stylization without compromising their functionality. We then empirically demonstrate Claycode’s high tolerance to heavy deformations, outperforming traditional 2D scannable codes in scenarios where they typically fail.
zh

[CV-10] SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

【速读】:该论文旨在解决在复杂活动中评估人类技能水平的问题,该问题在体育、康复和训练等领域具有广泛应用。其解决方案的关键在于提出SkillFormer,一种基于TimeSformer架构的参数高效多视角熟练度估计方法,通过引入CrossViewFusion模块实现视角特异性特征的融合,该模块采用多头跨注意力机制、可学习门控和自适应自校准技术,同时利用低秩适应(Low-Rank Adaptation)仅微调少量参数,从而显著降低训练成本并提升计算效率。

链接: https://arxiv.org/abs/2505.08665
作者: Edoardo Bianchi,Antonio Liotta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.
zh

[CV-11] DLO-Splatting: Tracking Deformable Linear Objects Using 3D Gaussian Splatting ICRA2025

【速读】:该论文旨在解决从多视角RGB图像和夹爪状态信息中估计可变形线性物体(Deformable Linear Object, DLO)三维形状的问题,这一问题在现有仅依赖视觉的方法中具有较大挑战性。解决方案的关键在于提出DLO-Splatting算法,该算法结合基于位置的动力学模型与形状平滑性和刚度阻尼修正进行形状预测,并通过基于三维高斯点云渲染的损失函数进行优化,迭代地渲染并精炼预测结果,以在更新步骤中与视觉观测对齐。

链接: https://arxiv.org/abs/2505.08644
作者: Holly Dinkel,Marcel Büsching,Alberta Longhini,Brian Coltin,Trey Smith,Danica Kragic,Mårten Björkman,Timothy Bretl
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); KTH Royal Institute of Technology (皇家理工学院); NASA Ames Research Center (美国国家航空航天局艾姆斯研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 5 pages, 2 figures, presented at the 2025 5th Workshop: Reflections on Representations and Manipulating Deformable Objects at the IEEE International Conference on Robotics and Automation. RMDO workshop ( this https URL )

点击查看摘要

Abstract:This work presents DLO-Splatting, an algorithm for estimating the 3D shape of Deformable Linear Objects (DLOs) from multi-view RGB images and gripper state information through prediction-update filtering. The DLO-Splatting algorithm uses a position-based dynamics model with shape smoothness and rigidity dampening corrections to predict the object shape. Optimization with a 3D Gaussian Splatting-based rendering loss iteratively renders and refines the prediction to align it with the visual observations in the update step. Initial experiments demonstrate promising results in a knot tying scenario, which is challenging for existing vision-only methods.
zh

[CV-12] OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

【速读】:该论文旨在解决大型视觉语言模型(LVLMs)在动态工具调用中难以学习自适应行为的问题,特别是在缺乏标准化基础设施的情况下,导致多样化工具集成、丰富交互数据生成及鲁棒代理训练困难。其关键解决方案是提出OpenThinkIMG,这是一个开源的端到端框架,具备标准化视觉工具接口、可扩展轨迹生成能力以及灵活的训练环境。此外,为提升策略泛化能力,论文还引入了V-ToolRL强化学习框架,通过直接优化任务成功率来使LVLMs自主发现最优工具使用策略。

链接: https://arxiv.org/abs/2505.08617
作者: Zhaochen Su,Linjie Li,Mingyang Song,Yunzhuo Hao,Zhengyuan Yang,Jun Zhang,Guanjie Chen,Jiawei Gu,Juntao Li,Xiaoye Qu,Yu Cheng
机构: Soochow University (苏州大学); Microsoft (微软); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (中国电子科技大学); Sun Yat-sen University (中山大学); Huazhong University of Science and Technology (华中科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely “think with images”.
zh

[CV-13] WaveGuard: Robust Deepfake Detection and Source Tracing via Dual-Tree Complex Wavelet and Graph Neural Networks

【速读】:该论文试图解决深度伪造(Deepfake)技术带来的隐私侵犯和身份盗用等安全风险。其解决方案的关键在于提出一种主动水印框架WaveGuard,通过频域嵌入和图结构一致性来增强水印的鲁棒性和不可感知性。具体而言,采用双树复数小波变换(Dual-Tree Complex Wavelet Transform, DT-CWT)将水印嵌入高频子带,并利用结构一致性图神经网络(Structural Consistency Graph Neural Network, SC-GNN)保持视觉质量,同时设计注意力模块以提升嵌入精度。

链接: https://arxiv.org/abs/2505.08614
作者: Ziyuan He,Zhiqing Guo,Liejun Wang,Gaobo Yang,Yunfeng Diao,Dan Ma
机构: Xinjiang University (新疆大学); Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center (新疆多模态智能处理与信息安全工程技术研究中心); Hunan University (湖南大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Deepfake technology poses increasing risks such as privacy invasion and identity theft. To address these threats, we propose WaveGuard, a proactive watermarking framework that enhances robustness and imperceptibility via frequency-domain embedding and graph-based structural consistency. Specifically, we embed watermarks into high-frequency sub-bands using Dual-Tree Complex Wavelet Transform (DT-CWT) and employ a Structural Consistency Graph Neural Network (SC-GNN) to preserve visual quality. We also design an attention module to refine embedding precision. Experimental results on face swap and reenactment tasks demonstrate that WaveGuard outperforms state-of-the-art methods in both robustness and visual quality. Code is available at this https URL.
zh

[CV-14] Boosting Zero-shot Stereo Matching using Large-scale Mixed Images Sources in the Real World

【速读】:该论文旨在解决立体匹配方法中依赖密集像素级真实标签导致的数据获取成本高、合成数据与真实图像之间存在领域差异的问题。其解决方案的关键在于提出一种名为\textbfBooSTer的框架,该框架结合了视觉基础模型和大规模混合图像源(包括合成图像、真实图像和单视角图像),并通过单目深度估计与扩散模型生成密集的立体匹配数据,同时利用伪单目深度标签和动态尺度-平移不变损失来处理真实数据集中的稀疏标签问题,从而提升模型的准确性和泛化能力。

链接: https://arxiv.org/abs/2505.08607
作者: Yuran Wang,Yingping Liang,Ying Fu
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo matching methods rely on dense pixel-wise ground truth labels, which are laborious to obtain, especially for real-world datasets. The scarcity of labeled data and domain gaps between synthetic and real-world images also pose notable challenges. In this paper, we propose a novel framework, \textbfBooSTer, that leverages both vision foundation models and large-scale mixed image sources, including synthetic, real, and single-view images. First, to fully unleash the potential of large-scale single-view images, we design a data generation strategy combining monocular depth estimation and diffusion models to generate dense stereo matching data from single-view images. Second, to tackle sparse labels in real-world datasets, we transfer knowledge from monocular depth estimation models, using pseudo-mono depth labels and a dynamic scale- and shift-invariant loss for additional supervision. Furthermore, we incorporate vision foundation model as an encoder to extract robust and transferable features, boosting accuracy and generalization. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, achieving significant improvements in accuracy over existing methods, particularly in scenarios with limited labeled data and domain shifts.
zh

[CV-15] Leverag ing Multi-Modal Information to Enhance Dataset Distillation

【速读】:该论文旨在解决传统数据集蒸馏方法在生成紧凑且具有代表性的合成数据集时,对多模态信息和目标级细节建模不足的问题。其关键解决方案在于引入两种增强机制:基于文本描述的监督(caption-guided supervision)和以目标为中心的掩码(object-centric masking)。前者通过特征拼接和文本匹配策略融合文本信息,提升合成数据与真实数据之间的语义一致性;后者则利用分割掩码隔离目标对象并减少背景干扰,结合掩码特征对齐损失和掩码梯度匹配损失,强化对象级别的学习效果。

链接: https://arxiv.org/abs/2505.08605
作者: Zhe Li,Hadrien Reynaud,Bernhard Kainz
机构: FAU Erlangen-Nürnberg (弗劳恩霍夫大学埃尔朗根-纽伦堡分校); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Dataset distillation aims to create a compact and highly representative synthetic dataset that preserves the knowledge of a larger real dataset. While existing methods primarily focus on optimizing visual representations, incorporating additional modalities and refining object-level information can significantly improve the quality of distilled datasets. In this work, we introduce two key enhancements to dataset distillation: caption-guided supervision and object-centric masking. To integrate textual information, we propose two strategies for leveraging caption features: the feature concatenation, where caption embeddings are fused with visual features at the classification stage, and caption matching, which introduces a caption-based alignment loss during training to ensure semantic coherence between real and synthetic data. Additionally, we apply segmentation masks to isolate target objects and remove background distractions, introducing two loss functions designed for object-centric learning: masked feature alignment loss and masked gradient matching loss. Comprehensive evaluations demonstrate that integrating caption-based guidance and object-centric masking enhances dataset distillation, leading to synthetic datasets that achieve superior performance on downstream tasks.
zh

[CV-16] Unsupervised Out-of-Distribution Detection in Medical Imaging Using Multi-Exit Class Activation Maps and Feature Masking

【速读】:该论文旨在解决医学影像中深度学习模型的分布外(out-of-distribution, OOD)检测问题,以提升模型的可靠性。其解决方案的关键在于利用多出口类激活图(Multi-Exit Class Activation Map, MECAM)和特征掩码技术,通过分析输入图像在不同分辨率和深度下的类激活图变化,捕捉全局与局部特征表示,从而实现对OOD数据的有效区分。

链接: https://arxiv.org/abs/2505.08604
作者: Yu-Jen Chen,Xueyang Li,Yiyu Shi,Tsung-Yi Ho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is essential for ensuring the reliability of deep learning models in medical imaging applications. This work is motivated by the observation that class activation maps (CAMs) for in-distribution (ID) data typically emphasize regions that are highly relevant to the model’s predictions, whereas OOD data often lacks such focused activations. By masking input images with inverted CAMs, the feature representations of ID data undergo more substantial changes compared to those of OOD data, offering a robust criterion for differentiation. In this paper, we introduce a novel unsupervised OOD detection framework, Multi-Exit Class Activation Map (MECAM), which leverages multi-exit CAMs and feature masking. By utilizing mult-exit networks that combine CAMs from varying resolutions and depths, our method captures both global and local feature representations, thereby enhancing the robustness of OOD detection. We evaluate MECAM on multiple ID datasets, including ISIC19 and PathMNIST, and test its performance against three medical OOD datasets, RSNA Pneumonia, COVID-19, and HeadCT, and one natural image OOD dataset, iSUN. Comprehensive comparisons with state-of-the-art OOD detection methods validate the effectiveness of our approach. Our findings emphasize the potential of multi-exit networks and feature masking for advancing unsupervised OOD detection in medical imaging, paving the way for more reliable and interpretable models in clinical practice.
zh

[CV-17] Rejoining frag mented ancient bamboo slips with physics-driven deep learning

【速读】:该论文旨在解决考古学中碎片化竹简重新拼接的问题,这一过程对于理解古代文献内容至关重要但极具挑战性。解决方案的关键在于提出WisePanda框架,该框架基于断裂物理和材料退化的原理,自动生成模拟训练数据以捕捉竹简碎片的物理特性,从而在无需人工配对样本的情况下训练匹配网络,显著提升了拼接准确性与效率。

链接: https://arxiv.org/abs/2505.08601
作者: Jinchi Zhu,Zhou Zhao,Hailong Lei,Xiaoguang Wang,Jialiang Lu,Jing Li,Qianqian Tang,Jiachen Shen,Gui-Song Xia,Bo Du,Yongchao Xu
机构: Wuhan University(武汉大学); Central China Normal University(华中师范大学); School of Computer Science, Wuhan University(武汉大学计算机学院); Center of Bamboo and Silk Manuscripts, Wuhan University(武汉大学竹简与丝织品研究中心); School of Information Management, Wuhan University(武汉大学信息管理学院); School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:Bamboo slips are a crucial medium for recording ancient civilizations in East Asia, and offers invaluable archaeological insights for reconstructing the Silk Road, studying material culture exchanges, and global history. However, many excavated bamboo slips have been fragmented into thousands of irregular pieces, making their rejoining a vital yet challenging step for understanding their content. Here we introduce WisePanda, a physics-driven deep learning framework designed to rejoin fragmented bamboo slips. Based on the physics of fracture and material deterioration, WisePanda automatically generates synthetic training data that captures the physical properties of bamboo fragmentations. This approach enables the training of a matching network without requiring manually paired samples, providing ranked suggestions to facilitate the rejoining process. Compared to the leading curve matching method, WisePanda increases Top-50 matching accuracy from 36% to 52%. Archaeologists using WisePanda have experienced substantial efficiency improvements (approximately 20 times faster) when rejoining fragmented bamboo slips. This research demonstrates that incorporating physical principles into deep learning models can significantly enhance their performance, transforming how archaeologists restore and study fragmented artifacts. WisePanda provides a new paradigm for addressing data scarcity in ancient artifact restoration through physics-driven machine learning.
zh

[CV-18] MESSI: A Multi-Elevation Semantic Segmentation Image Dataset of an Urban Environment

【速读】:该论文旨在解决无人机在密集城市环境中进行语义分割时面临的挑战,特别是深度对分割性能的影响以及不同城市区域的视觉多样性问题。解决方案的关键在于构建一个名为多高度语义分割图像(MESSI)的数据集,该数据集包含从不同高度和多个城市区域获取的2525张图像,并附有位置、姿态和相机内参等详细标注信息,从而为深度神经网络的训练及诸如定位、导航和跟踪等应用提供支持。

链接: https://arxiv.org/abs/2505.08589
作者: Barak Pinkovich,Boaz Matalon,Ehud Rivlin,Hector Rotstein
机构: Technion-Israel Institute of Technology(以色列技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper presents a Multi-Elevation Semantic Segmentation Image (MESSI) dataset comprising 2525 images taken by a drone flying over dense urban environments. MESSI is unique in two main features. First, it contains images from various altitudes, allowing us to investigate the effect of depth on semantic segmentation. Second, it includes images taken from several different urban regions (at different altitudes). This is important since the variety covers the visual richness captured by a drone’s 3D flight, performing horizontal and vertical maneuvers. MESSI contains images annotated with location, orientation, and the camera’s intrinsic parameters and can be used to train a deep neural network for semantic segmentation or other applications of interest (e.g., localization, navigation, and tracking). This paper describes the dataset and provides annotation details. It also explains how semantic segmentation was performed using several neural network models and shows several relevant statistics. MESSI will be published in the public domain to serve as an evaluation benchmark for semantic segmentation using images captured by a drone or similar vehicle flying over a dense urban environment.
zh

[CV-19] PrePrompt: Predictive prompting for class incremental learning

【速读】:该论文旨在解决基于预训练模型的类增量学习(Class Incremental Learning, CIL)中,现有方法依赖于基于相关性的策略所面临的固有局限性,即仅用少量可训练提示(prompt)拟合所有任务的整个特征空间具有根本性挑战。其解决方案的关键在于提出一种名为Predictive Prompting (PrePrompt) 的新框架,该框架通过利用预训练模型的自然分类能力来预测特定任务的提示,将CIL分解为两个阶段:任务特定提示预测和标签预测,从而绕过基于相关性的限制。此外,为缓解因缺乏历史数据导致的对近期类别的偏差问题,PrePrompt引入了特征翻译机制,以动态平衡稳定性和可塑性。

链接: https://arxiv.org/abs/2505.08586
作者: Libo Huang,Zhulin An,Chuanguang Yang,Boyu Diao,Fei Wang,Yan Zeng,Zhifeng Hao,Yongjun Xu
机构: Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); Department of Mathematics and Statistics, Beijing Technology and Business University(北京工商大学数学与统计学院); College of Science, Shantou University(汕头大学理学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 29 figures, conference

点击查看摘要

Abstract:Class Incremental Learning (CIL) based on pre-trained models offers a promising direction for open-world continual learning. Existing methods typically rely on correlation-based strategies, where an image’s classification feature is used as a query to retrieve the most related key prompts and select the corresponding value prompts for training. However, these approaches face an inherent limitation: fitting the entire feature space of all tasks with only a few trainable prompts is fundamentally challenging. We propose Predictive Prompting (PrePrompt), a novel CIL framework that circumvents correlation-based limitations by leveraging pre-trained models’ natural classification ability to predict task-specific prompts. Specifically, PrePrompt decomposes CIL into a two-stage prediction framework: task-specific prompt prediction followed by label prediction. While theoretically appealing, this framework risks bias toward recent classes due to missing historical data for older classifier calibration. PrePrompt then mitigates this by incorporating feature translation, dynamically balancing stability and plasticity. Experiments across multiple benchmarks demonstrate PrePrompt’s superiority over state-of-the-art prompt-based CIL methods. The code will be released upon acceptance.
zh

[CV-20] A Large-scale Benchmark on Geological Fault Delineation Models: Domain Shift Training Dynamics Generalizability Evaluation and Inferential Behavior

【速读】:该论文试图解决地震解释流程中,特别是在断层识别任务中,预训练模型在不同地质、采集和处理条件下泛化能力有限的问题。其关键解决方案是通过构建一个大规模基准测试,系统评估预训练、微调和联合训练策略在不同领域偏移程度下的表现,从而为领域迁移策略提供指导,并揭示当前微调实践的脆弱性及性能解释的挑战,以推动更通用、可解释和有效的机器学习模型的发展。

链接: https://arxiv.org/abs/2505.08585
作者: Jorge Quesada,Chen Zhou,Prithwijit Chowdhury,Mohammad Alotaibi,Ahmad Mustafa,Yusufjon Kumamnov,Mohit Prabhushankar,Ghassan AlRegib
机构: Georgia Institute of Technology (佐治亚理工学院); Occidental Petroleum Corporation (西方石油公司); Tashkent State Technical University (塔什干国立技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Machine learning has taken a critical role in seismic interpretation workflows, especially in fault delineation tasks. However, despite the recent proliferation of pretrained models and synthetic datasets, the field still lacks a systematic understanding of the generalizability limits of these models across seismic data representing a variety of geologic, acquisition and processing settings. Distributional shifts between different data sources, limitations in fine-tuning strategies and labeled data accessibility, and inconsistent evaluation protocols all represent major roadblocks in the deployment of reliable and robust models in real-world exploration settings. In this paper, we present the first large-scale benchmarking study explicitly designed to provide answers and guidelines for domain shift strategies in seismic interpretation. Our benchmark encompasses over 200 models trained and evaluated on three heterogeneous datasets (synthetic and real data) including FaultSeg3D, CRACKS, and Thebe. We systematically assess pretraining, fine-tuning, and joint training strategies under varying degrees of domain shift. Our analysis highlights the fragility of current fine-tuning practices, the emergence of catastrophic forgetting, and the challenges of interpreting performance in a systematic manner. We establish a robust experimental baseline to provide insights into the tradeoffs inherent to current fault delineation workflows, and shed light on directions for developing more generalizable, interpretable and effective machine learning models for seismic interpretation. The insights and analyses reported provide a set of guidelines on the deployment of fault delineation models within seismic interpretation workflows.
zh

[CV-21] ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking MICCAI2025

【速读】:该论文旨在解决手术场景分割中现有方法效率低和短期跟踪能力不足的问题,从而限制了其在复杂真实手术场景中的应用。解决方案的关键在于提出ReSurgSAM2,这是一个两阶段的手术引用分割框架,利用Segment Anything Model 2进行文本引用目标检测,并通过可靠初始帧识别和多样性驱动的长期记忆实现跟踪。其中,检测阶段引入了跨模态时空Mamba以生成精确的检测和分割结果,而跟踪阶段则通过可信且多样化的记忆库确保长期跟踪的一致性。

链接: https://arxiv.org/abs/2505.08581
作者: Haofeng Liu,Mingqi Gao,Xuxiao Luo,Ziyue Wang,Guanyi Qin,Junde Wu,Yueming Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Tissues and Organs (q-bio.TO)
备注: Early accepted by MICCAI 2025

点击查看摘要

Abstract:Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicability in complex real-world surgical scenarios. In this paper, we introduce ReSurgSAM2, a two-stage surgical referring segmentation framework that leverages Segment Anything Model 2 to perform text-referred target detection, followed by tracking with reliable initial frame identification and diversity-driven long-term memory. For the detection stage, we propose a cross-modal spatial-temporal Mamba to generate precise detection and segmentation results. Based on these results, our credible initial frame selection strategy identifies the reliable frame for the subsequent tracking. Upon selecting the initial frame, our method transitions to the tracking stage, where it incorporates a diversity-driven memory mechanism that maintains a credible and diverse memory bank, ensuring consistent long-term tracking. Extensive experiments demonstrate that ReSurgSAM2 achieves substantial improvements in accuracy and efficiency compared to existing methods, operating in real-time at 61.2 FPS. Our code and datasets will be available at this https URL.
zh

[CV-22] hermal Detection of People with Mobility Restrictions for Barrier Reduction at Traffic Lights Controlled Intersections

【速读】:该论文旨在解决传统基于RGB相机的自适应交通信号系统在服务行动障碍人群和视障人群方面的不足,以及RGB相机在恶劣天气或低能见度条件下的检测性能下降和隐私问题。其解决方案的关键在于提出一种完全自动化的基于热成像的交通信号系统,该系统能够动态调整信号灯时长以适应行动不便者,并为视障人士触发音频信号。为克服热成像在颜色、纹理细节和分辨率上的局限性,研究者开发了YOLO-Thermal模型,该模型通过集成先进的特征提取和注意力机制提升了热成像中的目标检测精度和鲁棒性。

链接: https://arxiv.org/abs/2505.08568
作者: Xiao Ni,Carsten Kuehnel,Xiaoyi Jiang
机构: University of Münster(明斯特大学); University of Applied Sciences Erfurt(埃尔福特应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid advances in deep learning for computer vision have driven the adoption of RGB camera-based adaptive traffic light systems to improve traffic safety and pedestrian comfort. However, these systems often overlook the needs of people with mobility restrictions. Moreover, the use of RGB cameras presents significant challenges, including limited detection performance under adverse weather or low-visibility conditions, as well as heightened privacy concerns. To address these issues, we propose a fully automated, thermal detector-based traffic light system that dynamically adjusts signal durations for individuals with walking impairments or mobility burden and triggers the auditory signal for visually impaired individuals, thereby advancing towards barrier-free intersection for all users. To this end, we build the thermal dataset for people with mobility restrictions (TD4PWMR), designed to capture diverse pedestrian scenarios, particularly focusing on individuals with mobility aids or mobility burden under varying environmental conditions, such as different lighting, weather, and crowded urban settings. While thermal imaging offers advantages in terms of privacy and robustness to adverse conditions, it also introduces inherent hurdles for object detection due to its lack of color and fine texture details and generally lower resolution of thermal images. To overcome these limitations, we develop YOLO-Thermal, a novel variant of the YOLO architecture that integrates advanced feature extraction and attention mechanisms for enhanced detection accuracy and robustness in thermal imaging. Experiments demonstrate that the proposed thermal detector outperforms existing detectors, while the proposed traffic light system effectively enhances barrier-free intersection. The source codes and dataset are available at this https URL.
zh

[CV-23] Reinforcement Learning meets Masked Video Modeling : Trajectory-Guided Adaptive Token Selection

【速读】:该论文旨在解决视频预训练中掩码策略选择不当导致的性能瓶颈问题,特别是如何有效捕捉视频中的运动信息以提升模型的表征能力。其解决方案的关键在于提出一种新颖且可泛化的轨迹感知自适应标记采样器(Trajectory-Aware Adaptive Token Sampler, TATS),该方法通过建模标记的运动动态,在掩码自编码器(MAE)框架中选择以运动为中心的标记,从而提升视频表示的学习效果。

链接: https://arxiv.org/abs/2505.08561
作者: Ayush K. Rai,Kyle Min,Tarun Krishna,Feiyan Hu,Alan F. Smeaton,Noel E. O’Connor
机构: Dublin City University (都柏林城市大学); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Masked video modeling~(MVM) has emerged as a highly effective pre-training strategy for visual foundation models, whereby the model reconstructs masked spatiotemporal tokens using information from visible tokens. However, a key challenge in such approaches lies in selecting an appropriate masking strategy. Previous studies have explored predefined masking techniques, including random and tube-based masking, as well as approaches that leverage key motion priors, optical flow and semantic cues from externally pre-trained models. In this work, we introduce a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS), which models the motion dynamics of tokens and can be seamlessly integrated into the masked autoencoder (MAE) framework to select motion-centric tokens in videos. Additionally, we propose a unified training strategy that enables joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization (PPO). We show that our model allows for aggressive masking without compromising performance on the downstream task of action recognition while also ensuring that the pre-training remains memory efficient. Extensive experiments of the proposed approach across four benchmarks, including Something-Something v2, Kinetics-400, UCF101, and HMDB51, demonstrate the effectiveness, transferability, generalization, and efficiency of our work compared to other state-of-the-art methods.
zh

[CV-24] DFA-CON: A Contrastive Learning Approach for Detecting Copyright Infringement in DeepFake Art

【速读】:该论文试图解决生成式 AI(Generative AI)在视觉内容创作中引发的版权侵权和伪造问题。其核心挑战在于,用于训练这些模型的大规模数据集通常包含受版权保护和非受版权保护的艺术作品,而生成模型可能记忆训练数据中的模式,从而导致不同程度的版权违规。解决方案的关键是提出 DFA-CON,一个基于对比学习(contrastive learning)的框架,旨在检测版权侵权或伪造的 AI 生成艺术。该框架通过学习判别性表示空间,在对比学习框架内建立原始艺术品与其伪造版本之间的亲和性,从而实现对多种攻击类型(如修补、风格迁移、对抗扰动和切割混合)的有效检测。

链接: https://arxiv.org/abs/2505.08552
作者: Haroon Wahab,Hassan Ugail,Irfan Mehmood
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent proliferation of generative AI tools for visual content creation-particularly in the context of visual artworks-has raised serious concerns about copyright infringement and forgery. The large-scale datasets used to train these models often contain a mixture of copyrighted and non-copyrighted artworks. Given the tendency of generative models to memorize training patterns, they are susceptible to varying degrees of copyright violation. Building on the recently proposed DeepfakeArt Challenge benchmark, this work introduces DFA-CON, a contrastive learning framework designed to detect copyright-infringing or forged AI-generated art. DFA-CON learns a discriminative representation space, posing affinity among original artworks and their forged counterparts within a contrastive learning framework. The model is trained across multiple attack types, including inpainting, style transfer, adversarial perturbation, and cutmix. Evaluation results demonstrate robust detection performance across most attack types, outperforming recent pretrained foundation models. Code and model checkpoints will be released publicly upon acceptance.
zh

[CV-25] he RaspGrade Dataset: Towards Automatic Raspberry Ripeness Grading with Deep Learning

【速读】:该论文试图解决在工业环境中对移动的覆盆子进行实时、准确且非侵入式的质量分级问题,具体是将其分为五个不同的等级。解决方案的关键在于构建了一个名为RaspGrade的专用数据集,并对其进行精细标注,以支持基于计算机视觉的实例分割实验,从而获得精确的果实级掩码,尽管在颜色相似性和遮挡情况下某些等级的分类仍存在挑战。

链接: https://arxiv.org/abs/2505.08537
作者: Mohamed Lamine Mekhalfi,Paul Chippendale,Fabio Poiesi,Samuele Bonecher,Gilberto Osler,Nicola Zancanella
机构: Fondazione Bruno Kessler(布鲁诺·凯塞尔基金会); Sant’Orsola(圣奥尔索拉)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This research investigates the application of computer vision for rapid, accurate, and non-invasive food quality assessment, focusing on the novel challenge of real-time raspberry grading into five distinct classes within an industrial environment as the fruits move along a conveyor belt. To address this, a dedicated dataset of raspberries, namely RaspGrade, was acquired and meticulously annotated. Instance segmentation experiments revealed that accurate fruit-level masks can be obtained; however, the classification of certain raspberry grades presents challenges due to color similarities and occlusion, while others are more readily distinguishable based on color. The acquired and annotated RaspGrade dataset is accessible on HuggingFace at: this https URL.
zh

[CV-26] GradMix: Gradient-based Selective Mixup for Robust Data Augmentation in Class-Incremental Learning

【速读】:该论文旨在解决持续学习中在获取新知识的同时保持已有知识的挑战,特别是针对类增量学习中的灾难性遗忘问题。其解决方案的关键在于提出GradMix方法,该方法通过基于梯度的有选择性混合(gradient-based selective mixup),利用基于类别的准则仅混合有助于减少灾难性遗忘的类别对样本,而非有害的类别对样本,从而有效提升模型性能并最小化对先前知识的遗忘。

链接: https://arxiv.org/abs/2505.08528
作者: Minsu Kim,Seong-Hyeon Hwang,Steven Euijong Whang
机构: KAIST(韩国科学技术院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the context of continual learning, acquiring new knowledge while maintaining previous knowledge presents a significant challenge. Existing methods often use experience replay techniques that store a small portion of previous task data for training. In experience replay approaches, data augmentation has emerged as a promising strategy to further improve the model performance by mixing limited previous task data with sufficient current task data. However, we theoretically and empirically analyze that training with mixed samples from random sample pairs may harm the knowledge of previous tasks and cause greater catastrophic forgetting. We then propose GradMix, a robust data augmentation method specifically designed for mitigating catastrophic forgetting in class-incremental learning. GradMix performs gradient-based selective mixup using a class-based criterion that mixes only samples from helpful class pairs and not from detrimental class pairs for reducing catastrophic forgetting. Our experiments on various real datasets show that GradMix outperforms data augmentation baselines in accuracy by minimizing the forgetting of previous knowledge.
zh

[CV-27] Leverag ing Segment Anything Model for Source-Free Domain Adaptation via Dual Feature Guided Auto-Prompting

【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中图像分割任务的源域到目标域迁移问题,即在仅有源域模型和未标注的目标域数据的情况下,使模型在目标域上表现良好。其解决方案的关键在于提出一种基于双特征引导的自动提示方法(Dual Feature Guided auto-prompting approach, DFG),通过源模型在特征聚合阶段初步适应目标域,并结合目标模型特征与Segment Anything Model (SAM) 特征逐步扩展边界框提示,以处理类别簇状聚集和分散的目标特征,同时通过连通性分析对SAM生成的伪标签进行后处理,减少因目标模型过自信预测导致的误检区域。

链接: https://arxiv.org/abs/2505.08527
作者: Zheang Huai,Hui Tang,Yi Li,Zhuangzhuang Chen,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) for segmentation aims at adapting a model trained in the source domain to perform well in the target domain with only the source model and unlabeled target this http URL by the recent success of Segment Anything Model (SAM) which exhibits the generality of segmenting images of various modalities and in different domains given human-annotated prompts like bounding boxes or points, we for the first time explore the potentials of Segment Anything Model for SFDA via automatedly finding an accurate bounding box prompt. We find that the bounding boxes directly generated with existing SFDA approaches are defective due to the domain this http URL tackle this issue, we propose a novel Dual Feature Guided (DFG) auto-prompting approach to search for the box prompt. Specifically, the source model is first trained in a feature aggregation phase, which not only preliminarily adapts the source model to the target domain but also builds a feature distribution well-prepared for box prompt search. In the second phase, based on two feature distribution observations, we gradually expand the box prompt with the guidance of the target model feature and the SAM feature to handle the class-wise clustered target features and the class-wise dispersed target features, respectively. To remove the potentially enlarged false positive regions caused by the over-confident prediction of the target model, the refined pseudo-labels produced by SAM are further postprocessed based on connectivity analysis. Experiments on 3D and 2D datasets indicate that our approach yields superior performance compared to conventional methods. Code is available at this https URL.
zh

[CV-28] Dynamic Snake Upsampling Operater and Boundary-Skeleton Weighted Loss for Tubular Structure Segmentation

【速读】:该论文旨在解决在密集预测任务(如语义分割和超分辨率)中,传统上采样算子无法适应管状拓扑结构的细长特性和形态曲率的问题。其解决方案的关键在于引入一种动态蛇形上采样算子(dynamic snake upsampling operators)和边界-骨架加权损失(boundary-skeleton weighted loss)。动态蛇形上采样算子通过自适应采样域调整采样步长,并沿蜿蜒路径选择子像素采样点,从而实现对管状结构的更精确子像素级特征恢复;边界-骨架加权损失则根据掩码类别比例和距离场权衡主体与边界权重分配,以保持主体重叠并提升目标拓扑连通性和边界对齐精度。

链接: https://arxiv.org/abs/2505.08525
作者: Yiqi Chen,Ganghai Huang,Sheng Zhang,Jianglin Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of tubular topological structures (e.g., fissures and vasculature) is critical in various fields to guarantee dependable downstream quantitative analysis and modeling. However, in dense prediction tasks such as semantic segmentation and super-resolution, conventional upsampling operators cannot accommodate the slenderness of tubular structures and the curvature of morphology. This paper introduces a dynamic snake upsampling operators and a boundary-skeleton weighted loss tailored for topological tubular structures. Specifically, we design a snake upsampling operators based on an adaptive sampling domain, which dynamically adjusts the sampling stride according to the feature map and selects a set of subpixel sampling points along the serpentine path, enabling more accurate subpixel-level feature recovery for tubular structures. Meanwhile, we propose a skeleton-to-boundary increasing weighted loss that trades off main body and boundary weight allocation based on mask class ratio and distance field, preserving main body overlap while enhancing focus on target topological continuity and boundary alignment precision. Experiments across various domain datasets and backbone networks show that this plug-and-play dynamic snake upsampling operator and boundary-skeleton weighted loss boost both pixel-wise segmentation accuracy and topological consistency of results.
zh

[CV-29] Attention-based Generative Latent Replay: A Continual Learning Approach for WSI Analysis

【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)分类中的领域增量学习问题,特别是应对由于不同器官、疾病或机构特异性差异导致的领域偏移(domain shifts)挑战。其解决方案的关键在于提出一种基于注意力机制的生成潜在重放持续学习框架(Attention-based Generative Latent Replay Continual Learning, AGLR-CL),通过高斯混合模型(GMMs)合成WSI表示和切片数量分布,从而在不显式存储原始数据的情况下保留历史领域的知识,并利用注意力机制筛选出最显著的切片嵌入以生成高质量的合成样本。

链接: https://arxiv.org/abs/2505.08524
作者: Pratibha Kumari,Daniel Reisenbüchler,Afshin Bozorgpour,Nadine S. Schaadt,Friedrich Feuerhake,Dorit Merhof
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Whole slide image (WSI) classification has emerged as a powerful tool in computational pathology, but remains constrained by domain shifts, e.g., due to different organs, diseases, or institution-specific variations. To address this challenge, we propose an Attention-based Generative Latent Replay Continual Learning framework (AGLR-CL), in a multiple instance learning (MIL) setup for domain incremental WSI classification. Our method employs Gaussian Mixture Models (GMMs) to synthesize WSI representations and patch count distributions, preserving knowledge of past domains without explicitly storing original data. A novel attention-based filtering step focuses on the most salient patch embeddings, ensuring high-quality synthetic samples. This privacy-aware strategy obviates the need for replay buffers and outperforms other buffer-free counterparts while matching the performance of buffer-based solutions. We validate AGLR-CL on clinically relevant biomarker detection and molecular status prediction across multiple public datasets with diverse centers, organs, and patient cohorts. Experimental results confirm its ability to retain prior knowledge and adapt to new domains, offering an effective, privacy-preserving avenue for domain incremental continual learning in WSI classification.
zh

[CV-30] A Deep Learning-Driven Framework for Inhalation Injury Grading Using Bronchoscopy Images

【速读】:该论文试图解决吸入性损伤在临床诊断和分级中因传统方法(如Abbreviated Injury Score, AIS)依赖主观评估且与临床结局相关性较弱而面临的挑战。其解决方案的关键在于提出一种基于深度学习的框架,利用支气管镜图像结合机械通气时长作为客观指标进行分级,并通过增强型StarGAN生成高质量且具有临床相关性的合成医学图像以弥补医学影像数据不足的问题。

链接: https://arxiv.org/abs/2505.08517
作者: Yifan Li,Alan W Pang,Jo Woon Chong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Inhalation injuries face a challenge in clinical diagnosis and grading due to the limitations of traditional methods, such as Abbreviated Injury Score (AIS), which rely on subjective assessments and show weak correlations with clinical outcomes. This study introduces a novel deep learning-based framework for grading inhalation injuries using bronchoscopy images with the duration of mechanical ventilation as an objective metric. To address the scarcity of medical imaging data, we propose enhanced StarGAN, a generative model that integrates Patch Loss and SSIM Loss to improve synthetic images’ quality and clinical relevance. The augmented dataset generated by enhanced StarGAN significantly improved classification performance when evaluated using the Swin Transformer, achieving an accuracy of 77.78%, an 11.11% improvement over the original dataset. Image quality was assessed using the Fréchet Inception Distance (FID), where Enhanced StarGAN achieved the lowest FID of 30.06, outperforming baseline models. Burn surgeons confirmed the realism and clinical relevance of the generated images, particularly the preservation of bronchial structures and color distribution. These results highlight the potential of enhanced StarGAN in addressing data limitations and improving classification accuracy for inhalation injury grading.
zh

[CV-31] VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

【速读】:该论文旨在解决大型视频语言模型(LVLMs)在基于视频的因果推理能力方面研究不足的问题,特别是在视觉基础和目标驱动场景中缺乏专门的评估基准。为了解决这一问题,作者提出了一个名为Video-based long-form Causal Reasoning (VCRBench) 的新基准,通过使用经过刻意打乱顺序的日常活动程序视频来测试模型对因果事件的识别、推理和正确排序能力。解决方案的关键在于提出一种模块化方法——Recognition-Reasoning Decomposition (RRD),将视频因果推理分解为视频识别和因果推理两个子任务,从而显著提升模型在VCRBench上的性能。

链接: https://arxiv.org/abs/2505.08455
作者: Pritam Sarkar,Ali Etemad
机构: Queen’s University, Canada; Vector Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent advances in video understanding, the capabilities of Large Video Language Models (LVLMs) to perform video-based causal reasoning remains underexplored, largely due to the absence of relevant and dedicated benchmarks for evaluating causal reasoning in visually grounded and goal-driven settings. To fill this gap, we introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench). We create VCRBench using procedural videos of simple everyday activities, where the steps are deliberately shuffled with each clip capturing a key causal event, to test whether LVLMs can identify, reason about, and correctly sequence the events needed to accomplish a specific goal. Moreover, the benchmark is carefully designed to prevent LVLMs from exploiting linguistic shortcuts, as seen in multiple-choice or binary QA formats, while also avoiding the challenges associated with evaluating open-ended QA. Our evaluation of state-of-the-art LVLMs on VCRBench suggests that these models struggle with video-based long-form causal reasoning, primarily due to their difficulty in modeling long-range causal dependencies directly from visual observations. As a simple step toward enabling such capabilities, we propose Recognition-Reasoning Decomposition (RRD), a modular approach that breaks video-based causal reasoning into two sub-tasks of video recognition and causal reasoning. Our experiments on VCRBench show that RRD significantly boosts accuracy on VCRBench, with gains of up to 25.2%. Finally, our thorough analysis reveals interesting insights, for instance, that LVLMs primarily rely on language knowledge for complex video-based long-form causal reasoning tasks.
zh

[CV-32] A Survey of 3D Reconstruction with Event Cameras: From Event-based Geometry to Neural 3D Rendering

【速读】:该论文旨在解决传统帧式相机在高动态范围、低光照或高速运动等极端环境下进行三维重建的局限性,提出利用事件相机(event camera)作为新型传感器以提升重建精度。其解决方案的关键在于利用事件相机异步捕获像素亮度变化的特性,生成稀疏但时间信息丰富的数据流,从而为三维重建提供更鲁棒的数据基础。文章系统综述了基于事件相机的三维重建方法,涵盖了立体视觉、单目视觉及多模态系统,并归纳了几何方法、深度学习方法以及神经渲染技术如神经辐射场和3D高斯泼溅等关键技术。

链接: https://arxiv.org/abs/2505.08438
作者: Chuanzhi Xu,Haoxian Zhou,Langyi Chen,Haodong Chen,Ying Zhou,Vera Chung,Qiang Qu
机构: The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 35 pages, 12 figures, 11 tables

点击查看摘要

Abstract:Event cameras have emerged as promising sensors for 3D reconstruction due to their ability to capture per-pixel brightness changes asynchronously. Unlike conventional frame-based cameras, they produce sparse and temporally rich data streams, which enable more accurate 3D reconstruction and open up the possibility of performing reconstruction in extreme environments such as high-speed motion, low light, or high dynamic range scenes. In this survey, we provide the first comprehensive review focused exclusively on 3D reconstruction using event cameras. The survey categorises existing works into three major types based on input modality - stereo, monocular, and multimodal systems, and further classifies them by reconstruction approach, including geometry-based, deep learning-based, and recent neural rendering techniques such as Neural Radiance Fields and 3D Gaussian Splatting. Methods with a similar research focus were organised chronologically into the most subdivided groups. We also summarise public datasets relevant to event-based 3D reconstruction. Finally, we highlight current research limitations in data availability, evaluation, representation, and dynamic scene handling, and outline promising future research directions. This survey aims to serve as a comprehensive reference and a roadmap for future developments in event-driven 3D reconstruction.
zh

[CV-33] -DF: A Large-Scale Diffusion-Based Dataset and Benchmark for Human Body Forgery Detection

【速读】:该论文旨在解决人体伪造(body forgery)领域数据集和检测方法匮乏的问题,这一问题源于人体生成技术的起步较晚及复杂性较高。其解决方案的关键在于引入一个大规模的基于扩散模型的人体伪造数据集TikTok-DeepFake (TT-DF),该数据集包含6,120个伪造视频和1,378,857张合成帧,并涵盖多种伪造方法、基于身份与姿态解耦的生成配置以及不同压缩版本,以尽可能全面地模拟真实世界中可能存在的未见过的伪造数据。此外,论文还提出了一种改进的检测模型Temporal Optical Flow Network (TOF-Net),通过利用自然数据与伪造数据之间的时空不一致性和光流分布差异进行检测,实验表明该模型在TT-DF上表现优异,优于当前最先进的可扩展面部伪造检测模型。

链接: https://arxiv.org/abs/2505.08437
作者: Wenkui Yang,Zhida Zhang,Xiaoqiang Zhou,Junxian Duan,Jie Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to PRCV 2024

点击查看摘要

Abstract:The emergence and popularity of facial deepfake methods spur the vigorous development of deepfake datasets and facial forgery detection, which to some extent alleviates the security concerns about facial-related artificial intelligence technologies. However, when it comes to human body forgery, there has been a persistent lack of datasets and detection methods, due to the later inception and complexity of human body generation methods. To mitigate this issue, we introduce TikTok-DeepFake (TT-DF), a novel large-scale diffusion-based dataset containing 6,120 forged videos with 1,378,857 synthetic frames, specifically tailored for body forgery detection. TT-DF offers a wide variety of forgery methods, involving multiple advanced human image animation models utilized for manipulation, two generative configurations based on the disentanglement of identity and pose information, as well as different compressed versions. The aim is to simulate any potential unseen forged data in the wild as comprehensively as possible, and we also furnish a benchmark on TT-DF. Additionally, we propose an adapted body forgery detection model, Temporal Optical Flow Network (TOF-Net), which exploits the spatiotemporal inconsistencies and optical flow distribution differences between natural data and forged data. Our experiments demonstrate that TOF-Net achieves favorable performance on TT-DF, outperforming current state-of-the-art extendable facial forgery detection models. For our TT-DF dataset, please refer to this https URL.
zh

[CV-34] Visual Image Reconstruction from Brain Activity via Latent Representation

【速读】:该论文旨在解决从脑活动信号中解码出视觉图像这一挑战,即视觉图像重建问题(visual image reconstruction)。其关键解决方案在于利用深度神经网络(DNNs)与生成模型的结合,通过层次化潜在表示(hierarchical latent representations)、组合策略(compositional strategies)和模块化架构(modular architectures)来实现对复杂、主观视觉体验的高精度重建。此外,论文还强调了构建多样化数据集、改进评估指标以及发展具有更强鲁棒性和泛化能力的组合表示的重要性。

链接: https://arxiv.org/abs/2505.08429
作者: Yukiyasu Kamitani,Misato Tanaka,Ken Shirakawa
机构: Kyoto University (京都大学); ATR Computational Neuroscience Laboratories (ATR计算神经科学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Visual image reconstruction, the decoding of perceptual content from brain activity into images, has advanced significantly with the integration of deep neural networks (DNNs) and generative models. This review traces the field’s evolution from early classification approaches to sophisticated reconstructions that capture detailed, subjective visual experiences, emphasizing the roles of hierarchical latent representations, compositional strategies, and modular architectures. Despite notable progress, challenges remain, such as achieving true zero-shot generalization for unseen images and accurately modeling the complex, subjective aspects of perception. We discuss the need for diverse datasets, refined evaluation metrics aligned with human perceptual judgments, and compositional representations that strengthen model robustness and generalizability. Ethical issues, including privacy, consent, and potential misuse, are underscored as critical considerations for responsible development. Visual image reconstruction offers promising insights into neural coding and enables new psychological measurements of visual experiences, with applications spanning clinical diagnostics and brain-machine interfaces.
zh

[CV-35] DHECA-SuperGaze: Dual Head-Eye Cross-Attention and Super-Resolution for Unconstrained Gaze Estimation

【速读】:该论文旨在解决无约束环境下眼动估计的挑战,特别是在低分辨率图像和现有方法对头部-眼睛交互建模不足的情况下。其解决方案的关键在于提出一种基于深度学习的方法——DHECA-SuperGaze,该方法结合了超分辨率(SR)技术和双头部-眼睛交叉注意力(DHECA)模块,通过双分支卷积主干网络处理眼部和多尺度SR头部图像,并利用交叉注意力机制实现特征的双向优化,从而提升眼动预测的准确性。

链接: https://arxiv.org/abs/2505.08426
作者: Franko Šikić,Donik Vršnak,Sven Lončarić
机构: University of Zagreb Faculty of Electrical Engineering and Computing (萨格勒布大学电气工程与计算学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unconstrained gaze estimation is the process of determining where a subject is directing their visual attention in uncontrolled environments. Gaze estimation systems are important for a myriad of tasks such as driver distraction monitoring, exam proctoring, accessibility features in modern software, etc. However, these systems face challenges in real-world scenarios, partially due to the low resolution of in-the-wild images and partially due to insufficient modeling of head-eye interactions in current state-of-the-art (SOTA) methods. This paper introduces DHECA-SuperGaze, a deep learning-based method that advances gaze prediction through super-resolution (SR) and a dual head-eye cross-attention (DHECA) module. Our dual-branch convolutional backbone processes eye and multiscale SR head images, while the proposed DHECA module enables bidirectional feature refinement between the extracted visual features through cross-attention mechanisms. Furthermore, we identified critical annotation errors in one of the most diverse and widely used gaze estimation datasets, Gaze360, and rectified the mislabeled data. Performance evaluation on Gaze360 and GFIE datasets demonstrates superior within-dataset performance of the proposed method, reducing angular error (AE) by 0.48° (Gaze360) and 2.95° (GFIE) in static configurations, and 0.59° (Gaze360) and 3.00° (GFIE) in temporal settings compared to prior SOTA methods. Cross-dataset testing shows improvements in AE of more than 1.53° (Gaze360) and 3.99° (GFIE) in both static and temporal settings, validating the robust generalization properties of our approach.
zh

[CV-36] DArFace: Deformation Aware Robustness for Low Quality Face Recognition

【速读】:该论文旨在解决面部识别系统在面对低质量人脸图像时性能下降的问题,这类图像常见于监控视频或远距离成像中,表现为低分辨率、运动模糊和各种失真,导致与训练时使用的高质量数据之间存在显著的领域差距。其解决方案的关键在于提出DArFace框架,该框架通过对抗性集成全局变换(如旋转、平移)和局部弹性形变来模拟真实的低质量条件,并引入对比目标以确保不同形变视角下的身份一致性,从而提升模型对这些退化情况的鲁棒性。

链接: https://arxiv.org/abs/2505.08423
作者: Sadaf Gulshad,Abdullah Aldahlawi Thakaa
机构: Advanced AI and Information Technology (先进人工智能与信息技术) LLC; Thakaa (Thakaa) LLC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial recognition systems have achieved remarkable success by leveraging deep neural networks, advanced loss functions, and large-scale datasets. However, their performance often deteriorates in real-world scenarios involving low-quality facial images. Such degradations, common in surveillance footage or standoff imaging include low resolution, motion blur, and various distortions, resulting in a substantial domain gap from the high-quality data typically used during training. While existing approaches attempt to address robustness by modifying network architectures or modeling global spatial transformations, they frequently overlook local, non-rigid deformations that are inherently present in real-world settings. In this work, we introduce DArFace, a Deformation-Aware robust Face recognition framework that enhances robustness to such degradations without requiring paired high- and low-quality training samples. Our method adversarially integrates both global transformations (e.g., rotation, translation) and local elastic deformations during training to simulate realistic low-quality conditions. Moreover, we introduce a contrastive objective to enforce identity consistency across different deformed views. Extensive evaluations on low-quality benchmarks including TinyFace, IJB-B, and IJB-C demonstrate that DArFace surpasses state-of-the-art methods, with significant gains attributed to the inclusion of local deformation modeling.
zh

[CV-37] STORYANCHORS: Generating Consistent Multi-Scene Story Frames for Long-Form Narratives

【速读】:该论文旨在解决多场景故事框架生成中存在的时间一致性差、角色连贯性不足以及场景过渡不自然等问题。其解决方案的关键在于提出StoryAnchors框架,该框架采用双向故事生成器,整合过去与未来上下文以确保叙事的时间一致性、角色连续性和场景过渡的流畅性。此外,通过引入多事件故事框架标注和渐进式故事框架训练,增强了模型对整体叙事流和事件级动态的捕捉能力,从而提升生成故事框架的质量与可编辑性。

链接: https://arxiv.org/abs/2505.08350
作者: Bo Wang,Haoyang Huang,Zhiyin Lu,Fengyuan Liu,Guoqing Ma,Jianlong Yuan,Yuan Zhang,Nan Duan
机构: Beijing Institute of Technology (北京理工大学); StepFun (步履科技); University of Science and Technology of China (中国科学技术大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces StoryAnchors, a unified framework for generating high-quality, multi-scene story frames with strong temporal consistency. The framework employs a bidirectional story generator that integrates both past and future contexts to ensure temporal consistency, character continuity, and smooth scene transitions throughout the narrative. Specific conditions are introduced to distinguish story frame generation from standard video synthesis, facilitating greater scene diversity and enhancing narrative richness. To further improve generation quality, StoryAnchors integrates Multi-Event Story Frame Labeling and Progressive Story Frame Training, enabling the model to capture both overarching narrative flow and event-level dynamics. This approach supports the creation of editable and expandable story frames, allowing for manual modifications and the generation of longer, more complex sequences. Extensive experiments show that StoryAnchors outperforms existing open-source models in key areas such as consistency, narrative coherence, and scene diversity. Its performance in narrative consistency and story richness is also on par with GPT-4o. Ultimately, StoryAnchors pushes the boundaries of story-driven frame generation, offering a scalable, flexible, and highly editable foundation for future research.
zh

[CV-38] FAD: Frequency Adaptation and Diversion for Cross-domain Few-shot Learning

【速读】:该论文旨在解决跨域小样本学习(Cross-domain Few-shot Learning, CD-FSL)中模型在显著分布偏移下难以有效泛化的挑战。现有方法虽通过轻量级任务特定模块提升适应性,但仅局限于空间域,忽略了频域中的关键差异。论文提出的解决方案核心是Frequency Adaptation and Diversion (FAD)框架,其关键在于引入频率感知的频域适配机制,通过离散傅里叶变换(DFT)将中间特征转换至频域,并利用径向掩码将其划分为低、中、高频段,再通过定制化卷积分支进行频段级适配,从而实现更精确和解耦的跨频率适应,提升模型在CD-FSL任务中的泛化能力。

链接: https://arxiv.org/abs/2505.08349
作者: Ruixiao Shi,Fu Feng,Yucheng Xie,Jing Wang,Xin Geng
机构: Southeast University (东南大学); Ministry of Education (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-domain few-shot learning (CD-FSL) requires models to generalize from limited labeled samples under significant distribution shifts. While recent methods enhance adaptability through lightweight task-specific modules, they operate solely in the spatial domain and overlook frequency-specific variations that are often critical for robust transfer. We observe that spatially similar images across domains can differ substantially in their spectral representations, with low and high frequencies capturing complementary semantic information at coarse and fine levels. This indicates that uniform spatial adaptation may overlook these spectral distinctions, thus constraining generalization. To address this, we introduce Frequency Adaptation and Diversion (FAD), a frequency-aware framework that explicitly models and modulates spectral components. At its core is the Frequency Diversion Adapter, which transforms intermediate features into the frequency domain using the discrete Fourier transform (DFT), partitions them into low, mid, and high-frequency bands via radial masks, and reconstructs each band using inverse DFT (IDFT). Each frequency band is then adapted using a dedicated convolutional branch with a kernel size tailored to its spectral scale, enabling targeted and disentangled adaptation across frequencies. Extensive experiments on the Meta-Dataset benchmark demonstrate that FAD consistently outperforms state-of-the-art methods on both seen and unseen domains, validating the utility of frequency-domain representations and band-wise adaptation for improving generalization in CD-FSL.
zh

[CV-39] A computer vision-based model for occupancy detection using low-resolution thermal images

【速读】:该论文旨在解决传统供暖、通风和空调(HVAC)系统因未考虑人员占用状态而导致的能源消耗与运行效率问题,以及基于RGB图像的占用检测方法所带来的隐私泄露风险。其解决方案的关键在于利用低分辨率热成像图像结合计算机视觉(CV)技术进行占用检测,并采用迁移学习对YOLOv5模型进行微调,从而在保障隐私的同时实现高效的占用识别。

链接: https://arxiv.org/abs/2505.08336
作者: Xue Cui,Vincent Gbouna Zakka,Minhyun Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Occupancy plays an essential role in influencing the energy consumption and operation of heating, ventilation, and air conditioning (HVAC) systems. Traditional HVAC typically operate on fixed schedules without considering occupancy. Advanced occupant-centric control (OCC) adopted occupancy status in regulating HVAC operations. RGB images combined with computer vision (CV) techniques are widely used for occupancy detection, however, the detailed facial and body features they capture raise significant privacy concerns. Low-resolution thermal images offer a non-invasive solution that mitigates privacy issues. The study developed an occupancy detection model utilizing low-resolution thermal images and CV techniques, where transfer learning was applied to fine-tune the You Only Look Once version 5 (YOLOv5) model. The developed model ultimately achieved satisfactory performance, with precision, recall, mAP50, and mAP50 values approaching 1.000. The contributions of this model lie not only in mitigating privacy concerns but also in reducing computing resource demands.
zh

[CV-40] An incremental algorithm for non-convex AI-enhanced medical image processing

【速读】:该论文旨在解决非凸正则化逆问题的求解难题,这类问题由于优化景观复杂且存在多个局部极小值而具有挑战性。其关键解决方案是提出incDG框架,该框架将深度学习与基于模型的增量优化相结合,以高效逼近成像逆问题的0\ell_0-最优解。incDG的核心在于利用深度神经网络生成有效的初始猜测,并通过正则化的增量迭代对重建结果进行优化,从而结合了人工智能工具的效率与基于模型优化的理论保障,确保了方法的鲁棒性和稳定性。

链接: https://arxiv.org/abs/2505.08324
作者: Elena Morotti
机构: University of Bologna(博洛尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Solving non-convex regularized inverse problems is challenging due to their complex optimization landscapes and multiple local minima. However, these models remain widely studied as they often yield high-quality, task-oriented solutions, particularly in medical imaging, where the goal is to enhance clinically relevant features rather than merely minimizing global error. We propose incDG, a hybrid framework that integrates deep learning with incremental model-based optimization to efficiently approximate the \ell_0 -optimal solution of imaging inverse problems. Built on the Deep Guess strategy, incDG exploits a deep neural network to generate effective initializations for a non-convex variational solver, which refines the reconstruction through regularized incremental iterations. This design combines the efficiency of Artificial Intelligence (AI) tools with the theoretical guarantees of model-based optimization, ensuring robustness and stability. We validate incDG on TpV-regularized optimization tasks, demonstrating its effectiveness in medical image deblurring and tomographic reconstruction across diverse datasets, including synthetic images, brain CT slices, and chest-abdomen scans. Results show that incDG outperforms both conventional iterative solvers and deep learning-based methods, achieving superior accuracy and stability. Moreover, we confirm that training incDG without ground truth does not significantly degrade performance, making it a practical and powerful tool for solving non-convex inverse problems in imaging and beyond.
zh

[CV-41] Improving Unsupervised Task-driven Models of Ventral Visual Stream via Relative Position Predictivity

【速读】:该论文试图解决当前基于对比学习的无监督任务驱动方法在建模腹侧视觉流(Ventral Visual Stream, VVS)时可能存在的局限性,即这些方法仅关注物体识别功能,而忽略了VVS可能具备的其他功能,如相对位置(Relative Position, RP)预测。论文提出的关键解决方案是将RP预测作为VVS的额外功能,并将其与对比学习相结合,构建一种新的无监督任务驱动方法,以更符合生物现实的方式建模VVS。实验结果表明,该方法不仅提升了物体识别的下游性能,还增强了RP预测能力及模型与大脑的相似性。

链接: https://arxiv.org/abs/2505.08316
作者: Dazhong Rong,Hao Dong,Xing Gao,Jiyu Wei,Di Hong,Yaoyao Hao,Qinming He,Yueming Wang
机构: Zhejiang University (浙江大学); Rutgers University (罗格斯大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for full publication at CogSci 2025 ( this https URL )

点击查看摘要

Abstract:Based on the concept that ventral visual stream (VVS) mainly functions for object recognition, current unsupervised task-driven methods model VVS by contrastive learning, and have achieved good brain similarity. However, we believe functions of VVS extend beyond just object recognition. In this paper, we introduce an additional function involving VVS, named relative position (RP) prediction. We first theoretically explain contrastive learning may be unable to yield the model capability of RP prediction. Motivated by this, we subsequently integrate RP learning with contrastive learning, and propose a new unsupervised task-driven method to model VVS, which is more inline with biological reality. We conduct extensive experiments, demonstrating that: (i) our method significantly improves downstream performance of object recognition while enhancing RP predictivity; (ii) RP predictivity generally improves the model brain similarity. Our results provide strong evidence for the involvement of VVS in location perception (especially RP prediction) from a computational perspective.
zh

[CV-42] Knowledge-Informed Deep Learning for Irrigation Type Mapping from Remote Sensing IJCAI-25

【速读】:该论文试图解决农业灌溉方法精准映射的问题,这一问题对于可持续农业实践和粮食系统至关重要。现有仅依赖卫星影像光谱特征的模型因农业景观的复杂性和训练数据有限而效果不佳。解决方案的关键在于提出一种基于Swin-Transformer的Knowledge-Informed Irrigation Mapping (KIIM)方法,其核心包括:(i)专用投影矩阵以编码作物到灌溉的概率,(ii)空间注意力图以区分农业用地与非农业用地,(iii)双向交叉注意力以融合多模态的互补信息,(iv)加权集成策略结合图像与作物信息的预测结果。此外,还引入了两阶段迁移学习方法以提升跨州灌溉映射性能。

链接: https://arxiv.org/abs/2505.08302
作者: Oishee Bintey Hoque,Nibir Chandra Mandal,Abhijin Adiga,Samarth Swarup,Sayjro Kossi Nouwakpo,Amanda Wilson,Madhav Marathe
机构: University of Virginia (弗吉尼亚大学); Biocomplexity Institute (生物复杂性研究所); US Department of Agriculture (美国农业部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Full version of the paper will be appearing at the Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-25), Special Track on AI for Good

点击查看摘要

Abstract:Accurate mapping of irrigation methods is crucial for sustainable agricultural practices and food systems. However, existing models that rely solely on spectral features from satellite imagery are ineffective due to the complexity of agricultural landscapes and limited training data, making this a challenging problem. We present Knowledge-Informed Irrigation Mapping (KIIM), a novel Swin-Transformer based approach that uses (i) a specialized projection matrix to encode crop to irrigation probability, (ii) a spatial attention map to identify agricultural lands from non-agricultural lands, (iii) bi-directional cross-attention to focus complementary information from different modalities, and (iv) a weighted ensemble for combining predictions from images and crop information. Our experimentation on five states in the US shows up to 22.9% (IoU) improvement over baseline with a 71.4% (IoU) improvement for hard-to-classify drip irrigation. In addition, we propose a two-phase transfer learning approach to enhance cross-state irrigation mapping, achieving a 51% IoU boost in a state with limited labeled data. The ability to achieve baseline performance with only 40% of the training data highlights its efficiency, reducing the dependency on extensive manual labeling efforts and making large-scale, automated irrigation mapping more feasible and cost-effective.
zh

[CV-43] Efficient Unstructured Pruning of Mamba State-Space Models for Resource-Constrained Environments

【速读】:该论文旨在解决State-space models(SSMs),特别是Mamba架构在资源受限环境中部署时面临的参数量过大问题。其关键解决方案是一个针对Mamba模型的非结构化剪枝框架,该框架通过三种核心创新实现高效剪枝:(1) 基于梯度的幅度剪枝技术,结合权重幅度与梯度信息以识别不重要的参数;(2) 迭代式剪枝策略,逐步增加稀疏度以保持模型稳定性;(3) 全局剪枝策略,优化整个模型的参数分配。该方法在保持超过95%原始性能的前提下,实现了高达70%的参数压缩。

链接: https://arxiv.org/abs/2505.08299
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: Iowa State University (爱荷华州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-space models (SSMs), particularly the Mamba architecture, have emerged as powerful alternatives to Transformers for sequence modeling, offering linear-time complexity and competitive performance across diverse tasks. However, their large parameter counts pose significant challenges for deployment in resource-constrained environments. We propose a novel unstructured pruning framework tailored for Mamba models that achieves up to 70% parameter reduction while retaining over 95% of the original performance. Our approach integrates three key innovations: (1) a gradient-aware magnitude pruning technique that combines weight magnitude and gradient information to identify less critical parameters, (2) an iterative pruning schedule that gradually increases sparsity to maintain model stability, and (3) a global pruning strategy that optimizes parameter allocation across the entire model. Through extensive experiments on WikiText-103, Long Range Arena, and ETT time-series benchmarks, we demonstrate significant efficiency gains with minimal performance degradation. Our analysis of pruning effects on Mamba’s components reveals critical insights into the architecture’s redundancy and robustness, enabling practical deployment in resource-constrained settings while broadening Mamba’s applicability.
zh

[CV-44] FauForensics: Boosting Audio-Visual Deepfake Detection with Facial Action Units

【速读】:该论文旨在解决多模态深度伪造(multimodal deepfakes)检测中因异构模态特征处理不足和跨数据集泛化能力差而导致的检测效果不佳问题。其解决方案的关键在于引入生物不变的面部动作单元(biologically invariant facial action units, FAUs),作为与情绪生理相关的面部肌肉活动定量描述符,以减少领域依赖性并捕捉合成内容中易被破坏的细微动态;同时,通过专用融合模块计算细粒度帧级音视频相似性,并利用可学习的跨模态查询动态对齐时序-空间唇音关系,从而缓解多模态特征异质性问题。

链接: https://arxiv.org/abs/2505.08294
作者: Jian Wang,Baoyuan Wu,Li Liu,Qingshan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid evolution of generative AI has increased the threat of realistic audio-visual deepfakes, demanding robust detection methods. Existing solutions primarily address unimodal (audio or visual) forgeries but struggle with multimodal manipulations due to inadequate handling of heterogeneous modality features and poor generalization across datasets. To this end, we propose a novel framework called FauForensics by introducing biologically invariant facial action units (FAUs), which is a quantitative descriptor of facial muscle activity linked to emotion physiology. It serves as forgery-resistant representations that reduce domain dependency while capturing subtle dynamics often disrupted in synthetic content. Besides, instead of comparing entire video clips as in prior works, our method computes fine-grained frame-wise audiovisual similarities via a dedicated fusion module augmented with learnable cross-modal queries. It dynamically aligns temporal-spatial lip-audio relationships while mitigating multi-modal feature heterogeneity issues. Experiments on FakeAVCeleb and LAV-DF show state-of-the-art (SOTA) performance and superior cross-dataset generalizability with up to an average of 4.83% than existing methods.
zh

[CV-45] M3G: Multi-Granular Gesture Generator for Audio-Driven Full-Body Human Motion Synthesis NIPS2025

【速读】:该论文试图解决从音频生成包含面部、身体、手部和全身动作的完整人体姿态时,由于固定粒度的姿势标记无法适应不同人体姿态模式所需帧数差异而导致的表达性和自然性不足的问题。解决方案的关键在于提出一种名为多粒度姿势生成器(M3G)的框架,其中核心创新是多粒度量化变分自编码器(MGVQ-VAE),该方法能够从不同时间粒度上对运动模式进行标记并重建运动序列,同时结合多粒度标记预测器,从音频中提取多粒度信息并预测相应的运动标记,从而实现更自然和富有表现力的全身姿态生成。

链接: https://arxiv.org/abs/2505.08293
作者: Zhizhuo Yin,Yuk Hang Tsui,Pan Hui
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 9 Pages, 4 figures, submitted to NIPS 2025

点击查看摘要

Abstract:Generating full-body human gestures encompassing face, body, hands, and global movements from audio is a valuable yet challenging task in virtual avatar creation. Previous systems focused on tokenizing the human gestures framewisely and predicting the tokens of each frame from the input audio. However, one observation is that the number of frames required for a complete expressive human gesture, defined as granularity, varies among different human gesture patterns. Existing systems fail to model these gesture patterns due to the fixed granularity of their gesture tokens. To solve this problem, we propose a novel framework named Multi-Granular Gesture Generator (M3G) for audio-driven holistic gesture generation. In M3G, we propose a novel Multi-Granular VQ-VAE (MGVQ-VAE) to tokenize motion patterns and reconstruct motion sequences from different temporal granularities. Subsequently, we proposed a multi-granular token predictor that extracts multi-granular information from audio and predicts the corresponding motion tokens. Then M3G reconstructs the human gestures from the predicted tokens using the MGVQ-VAE. Both objective and subjective experiments demonstrate that our proposed M3G framework outperforms the state-of-the-art methods in terms of generating natural and expressive full-body human gestures.
zh

[CV-46] Disruptive Transformation of Artworks in Master-Disciple Relationships: The Case of Ukiyo-e Artworks

【速读】:该论文试图解决东方绘画(尤其是浮世绘)在创造性分析方面缺乏系统性量化研究的问题,而西方绘画已通过大规模数据库和机器学习技术进行了多角度的综合分析。解决方案的关键在于利用生成式 AI(Generative AI)相关的网络概念,对浮世绘作品及其艺术家的创造力进行定量评估,基于11,000张高分辨率图像进行分析,从而揭示浮世绘在文化发展过程中创造力的变化趋势及其风格演变特征。

链接: https://arxiv.org/abs/2505.08284
作者: Honna Shinichi,Akira Matsui
机构: Japan Advanced Institute of Science and Technology (日本高级科学与技术研究所); Kobe University (神户大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Artwork research has long relied on human sensibility and subjective judgment, but recent developments in machine learning have enabled the quantitative assessment of features that humans could not discover. In Western paintings, comprehensive analyses have been conducted from various perspectives in conjunction with large databases, but such extensive analysis has not been sufficiently conducted for Eastern paintings. Then, we focus on Ukiyo-e, a traditional Japanese art form, as a case study of Eastern paintings, and conduct a quantitative analysis of creativity in works of art using 11,000 high-resolution images. This involves using the concept of calculating creativity from networks to analyze both the creativity of the artwork and that of the artists. As a result, In terms of Ukiyo-e as a whole, it was found that the creativity of its appearance has declined with the maturation of culture, but in terms of style, it has become more segmented with the maturation of culture and has maintained a high level of creativity. This not only provides new insights into the study of Ukiyo-e but also shows how Ukiyo-e has evolved within the ongoing cultural history, playing a culturally significant role in the analysis of Eastern art.
zh

[CV-47] Decoupled Multimodal Prototypes for Visual Recognition with Missing Modalities

【速读】:该论文旨在解决多模态学习中因模态缺失导致模型性能下降的问题(multimodal learning with missing modalities),尤其是在实际应用中无法保证所有模态均可用的情况下。其解决方案的关键在于提出一种基于解耦原型的输出头(decoupled prototype-based output head),该方法利用针对每个单独模态的缺失案例感知类别原型(missing-case-aware class-wise prototypes),从而动态适应不同的模态缺失场景,并与现有的基于提示的方法无缝集成。

链接: https://arxiv.org/abs/2505.08283
作者: Jueqing Lu,Yuanyuan Qi,Xiaohao Yang,Shujie Zhou,Lan Du
机构: Monash University (莫纳什大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal learning enhances deep learning models by enabling them to perceive and understand information from multiple data modalities, such as visual and textual inputs. However, most existing approaches assume the availability of all modalities, an assumption that often fails in real-world applications. Recent works have introduced learnable missing-case-aware prompts to mitigate performance degradation caused by missing modalities while reducing the need for extensive model fine-tuning. Building upon the effectiveness of missing-case-aware handling for missing modalities, we propose a novel decoupled prototype-based output head, which leverages missing-case-aware class-wise prototypes tailored for each individual modality. This approach dynamically adapts to different missing modality scenarios and can be seamlessly integrated with existing prompt-based methods. Extensive experiments demonstrate that our proposed output head significantly improves performance across a wide range of missing-modality scenarios and varying missing rates.
zh

[CV-48] Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion

【速读】:该论文旨在解决现有基于多模态大模型的图像压缩框架在重建保真度和编码效率方面表现不佳的问题,这些问题源于语义检索、潜在压缩和生成模型之间的碎片化集成。其解决方案的关键在于提出了一种名为ResULIC的残差引导超低码率图像压缩方法,该方法通过将残差信号引入语义检索和基于扩散的生成过程,提升了压缩性能。具体而言,引入了语义残差编码(Semantic Residual Coding, SRC)以捕捉原始图像与其压缩潜在表示之间的语义差异,并采用感知保真度优化器提升重建质量;同时提出了压缩感知扩散模型(Compression-aware Diffusion Model, CDM),实现比特率与扩散时间步长的最佳对齐,从而增强压缩与重建的协同效果。

链接: https://arxiv.org/abs/2505.08281
作者: Anle Ke,Xu Zhang,Tong Chen,Ming Lu,Chao Zhou,Jiawen Gu,Zhan Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual signals into both semantic retrieval and the diffusion-based generation process. Specifically, we introduce Semantic Residual Coding (SRC) to capture the semantic disparity between the original image and its compressed latent representation. A perceptual fidelity optimizer is further applied for superior reconstruction quality. Additionally, we present the Compression-aware Diffusion Model (CDM), which establishes an optimal alignment between bitrates and diffusion time steps, improving compression-reconstruction synergy. Extensive experiments demonstrate the effectiveness of ResULIC, achieving superior objective and subjective performance compared to state-of-the-art diffusion-based methods with - 80.7%, -66.3% BD-rate saving in terms of LPIPS and FID. Project page is available at https: //njuvision.this http URL.
zh

[CV-49] IrrMap: A Large-Scale Comprehensive Dataset for Irrigation Method Mapping

【速读】:该论文试图解决灌溉方法映射(irrigation method mapping)中缺乏大规模、多源、高精度数据集的问题,从而限制了基于机器学习(ML)的灌溉分析与建模。解决方案的关键在于构建IrrMap,这是一个包含1.1百万个图像块的大规模数据集,融合了多分辨率卫星影像(如LandSat和Sentinel)及关键辅助数据(如作物类型、土地利用和植被指数),并提供了标准化的224x224 GeoTIFF格式数据、多模态输入、合理的训练-测试划分以及配套的数据加载器,以支持深度学习模型在灌溉映射中的训练与基准测试。此外,IrrMap还附带了完整的数据生成流程,便于研究者扩展至新区域或适配其他农业与地理空间分析任务。

链接: https://arxiv.org/abs/2505.08273
作者: Nibir Chandra Mandal,Oishee Bintey Hoque,Abhijin Adiga,Samarth Swarup,Mandy Wilson,Lu Feng,Yangfeng Ji,Miaomiao Zhang,Geoffrey Fox,Madhav Marathe
机构: University of Virginia (弗吉尼亚大学); Biocomplexity Institute (生物复杂性研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce IrrMap, the first large-scale dataset (1.1 million patches) for irrigation method mapping across regions. IrrMap consists of multi-resolution satellite imagery from LandSat and Sentinel, along with key auxiliary data such as crop type, land use, and vegetation indices. The dataset spans 1,687,899 farms and 14,117,330 acres across multiple western U.S. states from 2013 to 2023, providing a rich and diverse foundation for irrigation analysis and ensuring geospatial alignment and quality control. The dataset is ML-ready, with standardized 224x224 GeoTIFF patches, the multiple input modalities, carefully chosen train-test-split data, and accompanying dataloaders for seamless deep learning model training andbenchmarking in irrigation mapping. The dataset is also accompanied by a complete pipeline for dataset generation, enabling researchers to extend IrrMap to new regions for irrigation data collection or adapt it with minimal effort for other similar applications in agricultural and geospatial analysis. We also analyze the irrigation method distribution across crop groups, spatial irrigation patterns (using Shannon diversity indices), and irrigated area variations for both LandSat and Sentinel, providing insights into regional and resolution-based differences. To promote further exploration, we openly release IrrMap, along with the derived datasets, benchmark models, and pipeline code, through a GitHub repository: this https URL and Data repository: this https URL, providing comprehensive documentation and implementation details.
zh

[CV-50] Open the Eyes of MPNN: Vision Enhances MPNN in Link Prediction ICML2025

【速读】:该论文试图解决传统消息传递图神经网络(MPNN)在链接预测任务中对视觉结构感知能力不足的问题,即忽略了视觉感知在理解图结构中的潜在价值。解决方案的关键在于提出一种名为图视觉网络(Graph Vision Network, GVN)的有效框架,并引入其更高效的变体E-GVN,通过引入视觉结构意识来增强MPNN的性能。实验证明,GVN在多个链接预测数据集上均取得了显著提升,并达到了新的最先进(SOTA)水平。

链接: https://arxiv.org/abs/2505.08266
作者: Yanbin Wei,Xuehao Wang,Zhan Zhuang,Yang Chen,Shuhao Chen,Yulong Zhang,Yu Zhang,James Kwok
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025

点击查看摘要

Abstract:Message-passing graph neural networks (MPNNs) and structural features (SFs) are cornerstones for the link prediction task. However, as a common and intuitive mode of understanding, the potential of visual perception has been overlooked in the MPNN community. For the first time, we equip MPNNs with vision structural awareness by proposing an effective framework called Graph Vision Network (GVN), along with a more efficient variant (E-GVN). Extensive empirical results demonstrate that with the proposed frameworks, GVN consistently benefits from the vision enhancement across seven link prediction datasets, including challenging large-scale graphs. Such improvements are compatible with existing state-of-the-art (SOTA) methods and GVNs achieve new SOTA results, thereby underscoring a promising novel direction for link prediction.
zh

[CV-51] Few-shot Novel Category Discovery

【速读】:该论文试图解决传统半监督学习中新型类别发现(Novel Category Discovery, NCD)在实际应用场景中的局限性,特别是在部分新类别仅有少量标注数据的情况下,如何有效提升模型的适应能力和泛化性能。解决方案的关键在于提出一种新的设置——少样本新型类别发现(Few-Shot Novel Category Discovery, FSNCD),通过利用少量支持样本学习到的知识,使训练好的智能体能够随着查询样本数量的增加,灵活切换已知类识别任务与新型类聚类任务。该方法通过引入半监督分层聚类(Semi-supervised Hierarchical Clustering, SHC)和不确定性感知K均值聚类(Uncertainty-aware K-means Clustering, UKC)来验证模型的推理能力,并在多个数据集上取得了领先的性能表现。

链接: https://arxiv.org/abs/2505.08260
作者: Chunming Li,Shidong Wang,Haofeng Zhang
机构: Nanjing University of Science and Technology (南京理工大学); Newcastle University (纽卡斯尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recently proposed Novel Category Discovery (NCD) adapt paradigm of transductive learning hinders its application in more real-world scenarios. In fact, few labeled data in part of new categories can well alleviate this burden, which coincides with the ease that people can label few of new category data. Therefore, this paper presents a new setting in which a trained agent is able to flexibly switch between the tasks of identifying examples of known (labelled) classes and clustering novel (completely unlabeled) classes as the number of query examples increases by leveraging knowledge learned from only a few (handful) support examples. Drawing inspiration from the discovery of novel categories using prior-based clustering algorithms, we introduce a novel framework that further relaxes its assumptions to the real-world open set level by unifying the concept of model adaptability in few-shot learning. We refer to this setting as Few-Shot Novel Category Discovery (FSNCD) and propose Semi-supervised Hierarchical Clustering (SHC) and Uncertainty-aware K-means Clustering (UKC) to examine the model’s reasoning capabilities. Extensive experiments and detailed analysis on five commonly used datasets demonstrate that our methods can achieve leading performance levels across different task settings and scenarios.
zh

[CV-52] CNN and ViT Efficiency Study on Tiny ImageNet and DermaMNIST Datasets

【速读】:该论文试图解决卷积神经网络与基于Transformer的架构在医学和通用图像分类基准中的性能与效率之间的权衡问题,旨在通过减少推理延迟和模型复杂度来提升模型在资源受限环境中的部署可行性。其解决方案的关键在于对四种Vision Transformer变体(Tiny、Small、Base、Large)采用适当的微调策略,并通过系统性的超参数调整,使这些模型在保持可接受准确率的前提下实现更快的推理速度和更少的参数量。

链接: https://arxiv.org/abs/2505.08259
作者: Aidar Amangeldi,Angsar Taigonyrov,Muhammad Huzaid Jawad,Chinedu Emmanuel Mbonu
机构: Nazarbayev University (纳扎尔巴耶夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study evaluates the trade-offs between convolutional and transformer-based architectures on both medical and general-purpose image classification benchmarks. We use ResNet-18 as our baseline and introduce a fine-tuning strategy applied to four Vision Transformer variants (Tiny, Small, Base, Large) on DermatologyMNIST and TinyImageNet. Our goal is to reduce inference latency and model complexity with acceptable accuracy degradation. Through systematic hyperparameter variations, we demonstrate that appropriately fine-tuned Vision Transformers can match or exceed the baseline’s performance, achieve faster inference, and operate with fewer parameters, highlighting their viability for deployment in resource-constrained environments.
zh

[CV-53] Where the Devil Hides: Deepfake Detectors Can No Longer Be Trusted CVPR2025

【速读】:该论文旨在解决深度伪造检测器因第三方数据集被恶意注入污染数据而引入后门的问题,这会导致检测器在遇到特定触发器时出现异常行为,从而降低其可信度。论文提出的关键解决方案是开发一种触发生成器,能够合成具有密码控制、语义抑制、自适应性和不可见性的触发模式,确保触发器的隐蔽性和有效性。

链接: https://arxiv.org/abs/2505.08255
作者: Shuaiwei Yuan,Junyu Dong,Yuezun Li
机构: Ocean University of China (中国海洋大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:With the advancement of AI generative techniques, Deepfake faces have become incredibly realistic and nearly indistinguishable to the human eye. To counter this, Deepfake detectors have been developed as reliable tools for assessing face authenticity. These detectors are typically developed on Deep Neural Networks (DNNs) and trained using third-party datasets. However, this protocol raises a new security risk that can seriously undermine the trustfulness of Deepfake detectors: Once the third-party data providers insert poisoned (corrupted) data maliciously, Deepfake detectors trained on these datasets will be injected ``backdoors’’ that cause abnormal behavior when presented with samples containing specific triggers. This is a practical concern, as third-party providers may distribute or sell these triggers to malicious users, allowing them to manipulate detector performance and escape accountability. This paper investigates this risk in depth and describes a solution to stealthily infect Deepfake detectors. Specifically, we develop a trigger generator, that can synthesize passcode-controlled, semantic-suppression, adaptive, and invisible trigger patterns, ensuring both the stealthiness and effectiveness of these triggers. Then we discuss two poisoning scenarios, dirty-label poisoning and clean-label poisoning, to accomplish the injection of backdoors. Extensive experiments demonstrate the effectiveness, stealthiness, and practicality of our method compared to several baselines. Comments: CVPR 2025 Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2505.08255 [cs.CR] (or arXiv:2505.08255v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.08255 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-54] Identifying Memorization of Diffusion Models through p-Laplace Analysis

【速读】:该论文试图解决生成式模型中训练数据记忆现象的识别问题,即如何检测生成模型是否在输出中包含了训练数据的特定特征。解决方案的关键在于利用估计的得分函数(score function)计算高阶微分算子,即p-Laplace算子,并通过该算子识别概率空间中的关键特征,从而实现对记忆化训练数据的检测。

链接: https://arxiv.org/abs/2505.08246
作者: Jonathan Brokman,Amit Giloni,Omer Hofman,Roman Vainshtein,Hisashi Kojima,Guy Gilboa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: To be published in SSVM 2025 (proceedings of the 10th International Conference on Scale Space and Variational Methods in Computer Vision)

点击查看摘要

Abstract:Diffusion models, today’s leading image generative models, estimate the score function, i.e. the gradient of the log probability of (perturbed) data samples, without direct access to the underlying probability distribution. This work investigates whether the estimated score function can be leveraged to compute higher-order differentials, namely p-Laplace operators. We show here these operators can be employed to identify memorized training data. We propose a numerical p-Laplace approximation based on the learned score functions, showing its effectiveness in identifying key features of the probability landscape. We analyze the structured case of Gaussian mixture models, and demonstrate the results carry-over to image generative models, where memorization identification based on the p-Laplace operator is performed for the first time.
zh

[CV-55] Congenital Heart Disease recognition using Deep Learning/Transformer models

【速读】:该论文试图解决先天性心脏病(Congenital Heart Disease, CHD)在婴儿中的高发病率和死亡率问题,特别是针对非侵入性筛查方法常出现假阴性结果的局限性。其解决方案的关键在于利用双模态(声音和图像)深度学习方法,通过自动特征提取提升医生对CHD的检测效率与准确性。

链接: https://arxiv.org/abs/2505.08242
作者: Aidar Amangeldi,Vladislav Yarovenko,Angsar Taigonyrov
机构: NU(纳扎尔巴耶夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Congenital Heart Disease (CHD) remains a leading cause of infant morbidity and mortality, yet non-invasive screening methods often yield false negatives. Deep learning models, with their ability to automatically extract features, can assist doctors in detecting CHD more effectively. In this work, we investigate the use of dual-modality (sound and image) deep learning methods for CHD diagnosis. We achieve 73.9% accuracy on the ZCHSound dataset and 80.72% accuracy on the DICOM Chest X-ray dataset.
zh

[CV-56] ACT-R: Adaptive Camera Trajectories for 3D Reconstruction from Single Image

【速读】:该论文旨在解决单视角3D重建中的遮挡揭示与三维一致性问题,通过引入自适应视图规划(adaptive view planning)来提升重建效果。其解决方案的关键在于计算自适应相机轨迹(adaptive camera trajectory, ACT),该轨迹通过最大化待重建三维物体的遮挡区域可见性来生成最优视图序列,而非依赖预设的相机设置。随后,利用视频扩散模型生成围绕该轨迹的新视图,并将其输入多视角3D重建模型以获得最终结果,整个流程无需运行时训练或优化,仅依赖预训练模型进行前向推理,从而实现了高效的多视角合成。

链接: https://arxiv.org/abs/2505.08239
作者: Yizhi Wang,Mingrui Zhao,Ali Mahdavi-Amiri,Hao Zhang
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce adaptive view planning to multi-view synthesis, aiming to improve both occlusion revelation and 3D consistency for single-view 3D reconstruction. Instead of generating an unordered set of views independently or simultaneously, we generate a sequence of views, leveraging temporal consistency to enhance 3D coherence. Most importantly, our view sequence is not determined by a pre-determined camera setup. Instead, we compute an adaptive camera trajectory (ACT), specifically, an orbit of camera views, which maximizes the visibility of occluded regions of the 3D object to be reconstructed. Once the best orbit is found, we feed it to a video diffusion model to generate novel views around the orbit, which in turn, are passed to a multi-view 3D reconstruction model to obtain the final reconstruction. Our multi-view synthesis pipeline is quite efficient since it involves no run-time training/optimization, only forward inferences by applying the pre-trained models for occlusion analysis and multi-view synthesis. Our method predicts camera trajectories that reveal occlusions effectively and produce consistent novel views, significantly improving 3D reconstruction over SOTA on the unseen GSO dataset, both quantitatively and qualitatively.
zh

[CV-57] EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation

【速读】:该论文旨在解决视频帧插值(Video Frame Interpolation, VFI)在大运动、遮挡和光照变化等复杂场景下的挑战,尤其是传统基于事件相机的方法在细微运动场景中因依赖显式运动建模而导致的高保真图像重建性能下降问题。其解决方案的关键在于提出一种统一且高效的基于事件的扩散模型框架EventDiff,该框架采用新颖的事件-帧混合自编码器(Event-Frame Hybrid AutoEncoder, HAE)与轻量级时空交叉注意力(Spatial-Temporal Cross Attention, STCA)模块,通过在潜在空间中直接进行去噪扩散过程实现帧插值,从而避免了显式运动估计或变形操作,提升了模型在多种复杂VFI场景下的鲁棒性与性能。

链接: https://arxiv.org/abs/2505.08235
作者: Hanle Zheng,Xujie Han,Zegang Peng,Shangbin Zhang,Guangxun Du,Zhuo Zou,Xilin Wang,Jibin Wu,Hao Guo,Lei Deng
机构: Tsinghua University (清华大学); Taiyuan University of Technology (太原理工大学); China Electronics Technology Group Corporation (中国电子科技集团公司); Fudan University (复旦大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Frame Interpolation (VFI) is a fundamental yet challenging task in computer vision, particularly under conditions involving large motion, occlusion, and lighting variation. Recent advancements in event cameras have opened up new opportunities for addressing these challenges. While existing event-based VFI methods have succeeded in recovering large and complex motions by leveraging handcrafted intermediate representations such as optical flow, these designs often compromise high-fidelity image reconstruction under subtle motion scenarios due to their reliance on explicit motion modeling. Meanwhile, diffusion models provide a promising alternative for VFI by reconstructing frames through a denoising process, eliminating the need for explicit motion estimation or warping operations. In this work, we propose EventDiff, a unified and efficient event-based diffusion model framework for VFI. EventDiff features a novel Event-Frame Hybrid AutoEncoder (HAE) equipped with a lightweight Spatial-Temporal Cross Attention (STCA) module that effectively fuses dynamic event streams with static frames. Unlike previous event-based VFI methods, EventDiff performs interpolation directly in the latent space via a denoising diffusion process, making it more robust across diverse and challenging VFI scenarios. Through a two-stage training strategy that first pretrains the HAE and then jointly optimizes it with the diffusion model, our method achieves state-of-the-art performance across multiple synthetic and real-world event VFI datasets. The proposed method outperforms existing state-of-the-art event-based VFI methods by up to 1.98dB in PSNR on Vimeo90K-Triplet and shows superior performance in SNU-FILM tasks with multiple difficulty levels. Compared to the emerging diffusion-based VFI approach, our method achieves up to 5.72dB PSNR gain on Vimeo90K-Triplet and 4.24X faster inference.
zh

[CV-58] Removing Watermarks with Partial Regeneration using Semantic Information

【速读】:该论文试图解决当前可见水印在面对具有语义感知能力的自适应攻击者时存在的脆弱性问题,特别是针对生成式AI(Generative AI)图像中嵌入的语义水印的鲁棒性不足。其解决方案的关键在于提出SemanticRegen,一种三阶段、无需标签的攻击方法,通过视觉-语言模型获取细粒度描述、零样本分割提取前景掩码,并利用大语言模型引导的扩散模型仅对背景进行修复,从而在保持图像表观意义不变的情况下擦除水印。该方法在多个水印系统上表现出显著的攻击效果,同时维持了较高的感知质量。

链接: https://arxiv.org/abs/2505.08234
作者: Krti Tallam,John Kevin Cava,Caleb Geniesse,N. Benjamin Erichson,Michael W. Mahoney
机构: International Computer Science Institute (国际计算机科学研究所); Arizona State University (亚利桑那州立大学); Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); University of California at Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:As AI-generated imagery becomes ubiquitous, invisible watermarks have emerged as a primary line of defense for copyright and provenance. The newest watermarking schemes embed semantic signals - content-aware patterns that are designed to survive common image manipulations - yet their true robustness against adaptive adversaries remains under-explored. We expose a previously unreported vulnerability and introduce SemanticRegen, a three-stage, label-free attack that erases state-of-the-art semantic and invisible watermarks while leaving an image’s apparent meaning intact. Our pipeline (i) uses a vision-language model to obtain fine-grained captions, (ii) extracts foreground masks with zero-shot segmentation, and (iii) inpaints only the background via an LLM-guided diffusion model, thereby preserving salient objects and style cues. Evaluated on 1,000 prompts across four watermarking systems - TreeRing, StegaStamp, StableSig, and DWT/DCT - SemanticRegen is the only method to defeat the semantic TreeRing watermark (p = 0.10 0.05) and reduces bit-accuracy below 0.75 for the remaining schemes, all while maintaining high perceptual quality (masked SSIM = 0.94 +/- 0.01). We further introduce masked SSIM (mSSIM) to quantify fidelity within foreground regions, showing that our attack achieves up to 12 percent higher mSSIM than prior diffusion-based attackers. These results highlight an urgent gap between current watermark defenses and the capabilities of adaptive, semantics-aware adversaries, underscoring the need for watermarking algorithms that are resilient to content-preserving regenerative attacks.
zh

[CV-59] G-MSGINet: A Grouped Multi-Scale Graph-Involution Network for Contactless Fingerprint Recognition

【速读】:该论文旨在解决接触式指纹识别中存在的一系列挑战,包括多分支架构、方向标签依赖或复杂预处理步骤所带来的可扩展性和泛化性限制。其解决方案的关键在于提出GMSGI层,这是一种集成分组像素级卷积、动态多尺度内核生成和基于图的关系建模的新型计算模块,能够在一个处理单元中实现细节点定位与身份嵌入的联合优化,从而在无需显式方向监督的情况下,通过学习的内核描述子直接适应图连接性,捕捉指纹区域间的有意义结构关系。

链接: https://arxiv.org/abs/2505.08233
作者: Santhoshkumar Peddi,Soham Bandyopadhyay,Debasis Samanta
机构: Indian Institute of Technology Kharagpur (印度理工学院卡哈格尔布尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents G-MSGINet, a unified and efficient framework for robust contactless fingerprint recognition that jointly performs minutiae localization and identity embedding directly from raw input images. Existing approaches rely on multi-branch architectures, orientation labels, or complex preprocessing steps, which limit scalability and generalization across real-world acquisition scenarios. In contrast, the proposed architecture introduces the GMSGI layer, a novel computational module that integrates grouped pixel-level involution, dynamic multi-scale kernel generation, and graph-based relational modelling into a single processing unit. Stacked GMSGI layers progressively refine both local minutiae-sensitive features and global topological representations through end-to-end optimization. The architecture eliminates explicit orientation supervision and adapts graph connectivity directly from learned kernel descriptors, thereby capturing meaningful structural relationships among fingerprint regions without fixed heuristics. Extensive experiments on three benchmark datasets, namely PolyU, CFPose, and Benchmark 2D/3D, demonstrate that G-MSGINet consistently achieves minutiae F1-scores in the range of 0.83\pm0.02 and Rank-1 identification accuracies between 97.0% and 99.1%, while maintaining an Equal Error Rate (EER) as low as 0.5%. These results correspond to improvements of up to 4.8% in F1-score and 1.4% in Rank-1 accuracy when compared to prior methods, using only 0.38 million parameters and 6.63 giga floating-point operations, which represents up to ten times fewer parameters than competitive baselines. This highlights the scalability and effectiveness of G-MSGINet in real-world contactless biometric recognition scenarios.
zh

[CV-60] HMPNet: A Feature Aggregation Architecture for Maritime Object Detection from a Shipborne Perspective ICME2025

【速读】:该论文旨在解决船舶视角下智能航海中由于缺乏特定海事数据而导致的视觉感知技术部署困难问题。其解决方案的关键在于提出了一种名为HMPNet的轻量级架构,该架构通过引入分层动态调制主干网络、矩阵级联多尺度颈部结构以及聚合权重共享检测器,实现了高效的多尺度特征融合与表达,从而在保持高精度的同时提升了计算效率。

链接: https://arxiv.org/abs/2505.08231
作者: Yu Zhang,Fengyuan Liu,Juan Lyu,Yi Wei,Changdong Yu
机构: Tianjin University of Science and Technology (天津科技大学); Tianjin Institute of Navigation Instruments (天津导航仪器研究所); Dalian Maritime University (大连海事大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to ICME 2025

点击查看摘要

Abstract:In the realm of intelligent maritime navigation, object detection from a shipborne perspective is paramount. Despite the criticality, the paucity of maritime-specific data impedes the deployment of sophisticated visual perception techniques, akin to those utilized in autonomous vehicular systems, within the maritime context. To bridge this gap, we introduce Navigation12, a novel dataset annotated for 12 object categories under diverse maritime environments and weather conditions. Based upon this dataset, we propose HMPNet, a lightweight architecture tailored for shipborne object detection. HMPNet incorporates a hierarchical dynamic modulation backbone to bolster feature aggregation and expression, complemented by a matrix cascading poly-scale neck and a polymerization weight sharing detector, facilitating efficient multi-scale feature aggregation. Empirical evaluations indicate that HMPNet surpasses current state-of-the-art methods in terms of both accuracy and computational efficiency, realizing a 3.3% improvement in mean Average Precision over YOLOv11n, the prevailing model, and reducing parameters by 23%.
zh

[CV-61] Object detection in adverse weather conditions for autonomous vehicles using Instruct Pix2Pix IJCNN

【速读】:该论文旨在解决在恶劣天气条件下提升目标检测系统鲁棒性的问题,这是推动自动驾驶技术发展的重要环节。其解决方案的关键在于利用扩散模型Instruct Pix2Pix生成基于天气增强的逼真数据集,从而减轻恶劣天气对先进目标检测模型(如Faster R-CNN和YOLOv10)感知能力的影响。通过在CARLA模拟器和真实世界图像数据集BDD100K与ACDC上的实验,验证了该方法的有效性。

链接: https://arxiv.org/abs/2505.08228
作者: Unai Gurbindo,Axel Brando,Jaume Abella,Caroline König
机构: High-Performance Embedded Systems (HPES) Lab, Barcelona Supercomputing Center (巴塞罗那超级计算中心); Computer Science Dept., Univ. Politècnica de Catalunya (计算机科学系,加泰罗尼亚理工大学); IDEAI-UPC- Research Center (IDEAI-UPC 研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures. Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025 (to appear)

点击查看摘要

Abstract:Enhancing the robustness of object detection systems under adverse weather conditions is crucial for the advancement of autonomous driving technology. This study presents a novel approach leveraging the diffusion model Instruct Pix2Pix to develop prompting methodologies that generate realistic datasets with weather-based augmentations aiming to mitigate the impact of adverse weather on the perception capabilities of state-of-the-art object detection models, including Faster R-CNN and YOLOv10. Experiments were conducted in two environments, in the CARLA simulator where an initial evaluation of the proposed data augmentation was provided, and then on the real-world image data sets BDD100K and ACDC demonstrating the effectiveness of the approach in real environments. The key contributions of this work are twofold: (1) identifying and quantifying the performance gap in object detection models under challenging weather conditions, and (2) demonstrating how tailored data augmentation strategies can significantly enhance the robustness of these models. This research establishes a solid foundation for improving the reliability of perception systems in demanding environmental scenarios, and provides a pathway for future advancements in autonomous driving. Comments: 8 pages, 5 figures. Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025 (to appear) Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; I.2.10; I.4.8; I.5.1 Cite as: arXiv:2505.08228 [cs.CV] (or arXiv:2505.08228v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.08228 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: International Joint Conference on Neural Networks (IJCNN) 2025
zh

[CV-62] Visual Watermarking in the Era of Diffusion Models: Advances and Challenges

【速读】:该论文试图解决生成式人工智能(Generative AI)技术发展带来的视觉内容滥用问题,特别是版权侵权风险。其解决方案的关键在于利用扩散模型(diffusion models)提升水印技术的鲁棒性和检测准确性,通过有效学习特征来嵌入难以察觉但具有强抗毁性的水印,从而增强数字内容的所有权保护能力。

链接: https://arxiv.org/abs/2505.08197
作者: Junxian Duan,Jiyang Guang,Wenkui Yang,Ran He
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As generative artificial intelligence technologies like Stable Diffusion advance, visual content becomes more vulnerable to misuse, raising concerns about copyright infringement. Visual watermarks serve as effective protection mechanisms, asserting ownership and deterring unauthorized use. Traditional deepfake detection methods often rely on passive techniques that struggle with sophisticated manipulations. In contrast, diffusion models enhance detection accuracy by allowing for the effective learning of features, enabling the embedding of imperceptible and robust watermarks. We analyze the strengths and challenges of watermark techniques related to diffusion models, focusing on their robustness and application in watermark generation. By exploring the integration of advanced diffusion models and watermarking security, we aim to advance the discourse on preserving watermark robustness against evolving forgery threats. It emphasizes the critical importance of developing innovative solutions to protect digital content and ensure the preservation of ownership rights in the era of generative AI.
zh

[CV-63] ADC-GS: Anchor-Driven Deformable and Compressed Gaussian Splatting for Dynamic Scene Reconstruction

【速读】:该论文试图解决现有4D Gaussian Splatting方法在动态场景重建中因依赖每个高斯体从规范空间到目标帧的独立变形而导致的相邻高斯基元间冗余问题,从而影响性能的问题。解决方案的关键在于提出Anchor-Driven Deformable and Compressed Gaussian Splatting (ADC-GS),其通过在规范空间内构建基于锚点的高斯基元结构,并结合基于时间显著性的锚点优化策略,减少变形冗余,同时引入分层粗到细的管道以捕捉不同粒度的运动,并采用率失真优化实现比特率消耗与表示保真度之间的最佳平衡。

链接: https://arxiv.org/abs/2505.08196
作者: He Huang,Qi Yang,Mufan Liu,Yiling Xu,Zhu Li
机构: Shanghai Jiao Tong University (上海交通大学); University of Missouri-Kansas City (密苏里大学堪萨斯城分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing 4D Gaussian Splatting methods rely on per-Gaussian deformation from a canonical space to target frames, which overlooks redundancy among adjacent Gaussian primitives and results in suboptimal performance. To address this limitation, we propose Anchor-Driven Deformable and Compressed Gaussian Splatting (ADC-GS), a compact and efficient representation for dynamic scene reconstruction. Specifically, ADC-GS organizes Gaussian primitives into an anchor-based structure within the canonical space, enhanced by a temporal significance-based anchor refinement strategy. To reduce deformation redundancy, ADC-GS introduces a hierarchical coarse-to-fine pipeline that captures motions at varying granularities. Moreover, a rate-distortion optimization is adopted to achieve an optimal balance between bitrate consumption and representation fidelity. Experimental results demonstrate that ADC-GS outperforms the per-Gaussian deformation approaches in rendering speed by 300%-800% while achieving state-of-the-art storage efficiency without compromising rendering quality. The code is released at this https URL.
zh

[CV-64] SpNeRF: Memory Efficient Sparse Volumetric Neural Rendering Accelerator for Edge Devices DATE2025

【速读】:该论文旨在解决神经渲染(Neural Rendering)在边缘设备上实时处理时面临的内存瓶颈问题,特别是由于体素网格(voxel grid)数据量大和访问模式不规则导致的频繁片外内存访问和高片内内存占用。其解决方案的关键在于提出一种软硬件协同设计方法——SpNeRF,通过识别神经渲染中的内存限制效率问题并分析体素网格数据的固有稀疏性,结合预处理和在线解码步骤来减少体素网格的内存占用,并设计专用硬件架构以支持稀疏体素网格处理,从而显著提升性能与能效。

链接: https://arxiv.org/abs/2505.08191
作者: Yipu Zhang,Jiawei Liang,Jian Peng,Jiang Xu,Wei Zhang
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (GZ) (香港科技大学(广州))
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by DATE 2025

点击查看摘要

Abstract:Neural rendering has gained prominence for its high-quality output, which is crucial for AR/VR applications. However, its large voxel grid data size and irregular access patterns challenge real-time processing on edge devices. While previous works have focused on improving data locality, they have not adequately addressed the issue of large voxel grid sizes, which necessitate frequent off-chip memory access and substantial on-chip memory. This paper introduces SpNeRF, a software-hardware co-design solution tailored for sparse volumetric neural rendering. We first identify memory-bound rendering inefficiencies and analyze the inherent sparsity in the voxel grid data of neural rendering. To enhance efficiency, we propose novel preprocessing and online decoding steps, reducing the memory size for voxel grid. The preprocessing step employs hash mapping to support irregular data access while maintaining a minimal memory size. The online decoding step enables efficient on-chip sparse voxel grid processing, incorporating bitmap masking to mitigate PSNR loss caused by hash collisions. To further optimize performance, we design a dedicated hardware architecture supporting our sparse voxel grid processing technique. Experimental results demonstrate that SpNeRF achieves an average 21.07 \times reduction in memory size while maintaining comparable PSNR levels. When benchmarked against Jetson XNX, Jetson ONX, this http URL and this http URL, our design achieves speedups of 95.1 \times , 63.5 \times , 1.5 \times and 10.3 \times , and improves energy efficiency by 625.6 \times , 529.1 \times , 4 \times , and 4.4 \times , respectively.
zh

[CV-65] Unsupervised Raindrop Removal from a Single Image using Conditional Diffusion Models

【速读】:该论文试图解决单张图像中雨滴去除的问题(raindrop removal from a single image),这是一个在图像处理中具有挑战性的任务。其解决方案的关键在于利用基于扩散模型的图像修复技术(diffusion-based image inpainting),通过该技术实现对雨滴区域的背景恢复,从而有效去除雨滴而不依赖多张图像或额外信息。

链接: https://arxiv.org/abs/2505.08190
作者: Lhuqita Fazry,Valentino Vito
机构: Universitas Indonesia(印度尼西亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Raindrop removal is a challenging task in image processing. Removing raindrops while relying solely on a single image further increases the difficulty of the task. Common approaches include the detection of raindrop regions in the image, followed by performing a background restoration process conditioned on those regions. While various methods can be applied for the detection step, the most common architecture used for background restoration is the Generative Adversarial Network (GAN). Recent advances in the use of diffusion models have led to state-of-the-art image inpainting techniques. In this paper, we introduce a novel technique for raindrop removal from a single image using diffusion-based image inpainting.
zh

[CV-66] Monocular Depth Guided Occlusion-Aware Disparity Refinement via Semi-supervised Learning in Laparoscopic Images

【速读】:该论文旨在解决立体腹腔镜图像中视差估计的两个关键问题:遮挡以及标注手术数据的稀缺性。其解决方案的关键在于提出一种深度引导的遮挡感知视差细化网络(DGORNet),通过利用不受遮挡影响的单目深度信息来优化视差图。该方法引入了位置嵌入模块以提供显式的空间上下文,增强网络对特征的定位与细化能力,并结合光学流差异损失(OFDLoss)以利用视频帧间的时序连续性,提升动态手术场景下的鲁棒性。

链接: https://arxiv.org/abs/2505.08178
作者: Ziteng Liu,Dongdong He,Chenghong Zhang,Wenpeng Gao,Yili Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Occlusion and the scarcity of labeled surgical data are significant challenges in disparity estimation for stereo laparoscopic images. To address these issues, this study proposes a Depth Guided Occlusion-Aware Disparity Refinement Network (DGORNet), which refines disparity maps by leveraging monocular depth information unaffected by occlusion. A Position Embedding (PE) module is introduced to provide explicit spatial context, enhancing the network’s ability to localize and refine features. Furthermore, we introduce an Optical Flow Difference Loss (OFDLoss) for unlabeled data, leveraging temporal continuity across video frames to improve robustness in dynamic surgical scenes. Experiments on the SCARED dataset demonstrate that DGORNet outperforms state-of-the-art methods in terms of End-Point Error (EPE) and Root Mean Squared Error (RMSE), particularly in occlusion and texture-less regions. Ablation studies confirm the contributions of the Position Embedding and Optical Flow Difference Loss, highlighting their roles in improving spatial and temporal consistency. These results underscore DGORNet’s effectiveness in enhancing disparity estimation for laparoscopic surgery, offering a practical solution to challenges in disparity estimation and data limitations.
zh

[CV-67] Empowering Vision Transformers with Multi-Scale Causal Intervention for Long-Tailed Image Classification

【速读】:该论文旨在解决长尾分类中由于类别不平衡导致的偏差问题,特别是在从卷积神经网络(CNN)向视觉Transformer(ViT)迁移过程中,现有因果模型难以有效提升性能的问题。其解决方案的关键在于提出一种两阶段的因果建模方法——TSCNet,通过多尺度因果干预来发现细粒度的因果关联,首先在分层因果表示学习阶段(HCRL)解耦背景与物体并应用后门干预以增强细粒度因果表示,其次在反事实logits偏差校准阶段(CLBC)通过自适应构建反事实平衡数据分布来优化决策边界,从而消除由数据分布引起的虚假关联。

链接: https://arxiv.org/abs/2505.08173
作者: Xiaoshuo Yan,Zhaochuan Li,Lei Meng,Zhuang Qi,Wei Wu,Zixuan Li,Xiangxu Meng
机构: School of Software, Shandong University (软件学院,山东大学); Inspur (浪潮)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Causal inference has emerged as a promising approach to mitigate long-tail classification by handling the biases introduced by class imbalance. However, along with the change of advanced backbone models from Convolutional Neural Networks (CNNs) to Visual Transformers (ViT), existing causal models may not achieve an expected performance gain. This paper investigates the influence of existing causal models on CNNs and ViT variants, highlighting that ViT’s global feature representation makes it hard for causal methods to model associations between fine-grained features and predictions, which leads to difficulties in classifying tail classes with similar visual appearance. To address these issues, this paper proposes TSCNet, a two-stage causal modeling method to discover fine-grained causal associations through multi-scale causal interventions. Specifically, in the hierarchical causal representation learning stage (HCRL), it decouples the background and objects, applying backdoor interventions at both the patch and feature level to prevent model from using class-irrelevant areas to infer labels which enhances fine-grained causal representation. In the counterfactual logits bias calibration stage (CLBC), it refines the optimization of model’s decision boundary by adaptive constructing counterfactual balanced data distribution to remove the spurious associations in the logits caused by data distribution. Extensive experiments conducted on various long-tail benchmarks demonstrate that the proposed TSCNet can eliminate multiple biases introduced by data imbalance, which outperforms existing methods.
zh

[CV-68] MoKD: Multi-Task Optimization for Knowledge Distillation

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中的两个关键问题:一是如何平衡从教师模型指导和任务目标的学习,二是如何处理教师模型与学生模型之间知识表示的差异。其解决方案的关键在于提出多任务优化的知识蒸馏方法(Multi-Task Optimization for Knowledge Distillation, MoKD),该方法将KD重新建模为多目标优化问题,并引入子空间学习框架,将特征表示投影到高维空间以提升知识迁移效果,从而有效缓解梯度冲突和梯度主导问题。

链接: https://arxiv.org/abs/2505.08170
作者: Zeeshan Hayder,Ali Cheraghian,Lars Petersson,Mehrtash Harandi
机构: Data61, CSIRO, Australia; Australian National University; Monash University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compact models can be effectively trained through Knowledge Distillation (KD), a technique that transfers knowledge from larger, high-performing teacher models. Two key challenges in Knowledge Distillation (KD) are: 1) balancing learning from the teacher’s guidance and the task objective, and 2) handling the disparity in knowledge representation between teacher and student models. To address these, we propose Multi-Task Optimization for Knowledge Distillation (MoKD). MoKD tackles two main gradient issues: a) Gradient Conflicts, where task-specific and distillation gradients are misaligned, and b) Gradient Dominance, where one objective’s gradient dominates, causing imbalance. MoKD reformulates KD as a multi-objective optimization problem, enabling better balance between objectives. Additionally, it introduces a subspace learning framework to project feature representations into a high-dimensional space, improving knowledge transfer. Our MoKD is demonstrated to outperform existing methods through extensive experiments on image classification using the ImageNet-1K dataset and object detection using the COCO dataset, achieving state-of-the-art performance with greater efficiency. To the best of our knowledge, MoKD models also achieve state-of-the-art performance compared to models trained from scratch.
zh

[CV-69] Decoding Neighborhood Environments with Large Language Models

【速读】:该论文试图解决传统方法在评估邻里环境(neighborhood environments)时存在的资源消耗大、难以规模化的问题。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)进行自动化分析,通过优化提示策略和微调技术提升模型的可行性与鲁棒性,并结合多个LLMs的多数投票机制,实现了无需训练即可达到88%以上的识别准确率,从而为大规模解码邻里环境提供了可行路径。

链接: https://arxiv.org/abs/2505.08163
作者: Andrew Cart,Shaohu Zhang,Melanie Escue,Xugui Zhou,Haitao Zhao,Prashanth BusiReddyGari,Beiyu Lin,Shuang Li
机构: University of North Carolina at Pembroke, USA; Louisiana State University, USA; University of Oklahoma, USA; North Carolina A&T State University, USA
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Neighborhood environments include physical and environmental conditions such as housing quality, roads, and sidewalks, which significantly influence human health and well-being. Traditional methods for assessing these environments, including field surveys and geographic information systems (GIS), are resource-intensive and challenging to evaluate neighborhood environments at scale. Although machine learning offers potential for automated analysis, the laborious process of labeling training data and the lack of accessible models hinder scalability. This study explores the feasibility of large language models (LLMs) such as ChatGPT and Gemini as tools for decoding neighborhood environments (e.g., sidewalk and powerline) at scale. We train a robust YOLOv11-based model, which achieves an average accuracy of 99.13% in detecting six environmental indicators, including streetlight, sidewalk, powerline, apartment, single-lane road, and multilane road. We then evaluate four LLMs, including ChatGPT, Gemini, Claude, and Grok, to assess their feasibility, robustness, and limitations in identifying these indicators, with a focus on the impact of prompting strategies and fine-tuning. We apply majority voting with the top three LLMs to achieve over 88% accuracy, which demonstrates LLMs could be a useful tool to decode the neighborhood environment without any training effort.
zh

[CV-70] Asynchronous Multi-Object Tracking with an Event Camera ICRA

【速读】:该论文旨在解决在高度动态环境中,机器人对多个物体进行检测与跟踪的问题。其关键解决方案是提出了一种异步事件多目标跟踪算法(Asynchronous Event Multi-Object Tracking, AEMOT),该算法通过异步处理原始事件流,利用新颖的活动流方向场和表面构建方法检测显著的事件斑点特征,并结合最近提出的异步事件斑点(Asynchronous Event Blob, AEB)跟踪器进行候选目标的跟踪,随后通过学习验证阶段对候选目标进行分类与筛选,从而实现高精度的目标状态估计。

链接: https://arxiv.org/abs/2505.08126
作者: Angus Apps,Ziwei Wang,Vladimir Perejogin,Timothy Molloy,Robert Mahony
机构: Australian National University (澳大利亚国立大学); Defence Science and Technology Group (国防科学与技术集团); Centre for Advanced Defence Research in Robotics and Autonomous Systems (先进防御研究机器人与自主系统中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures, published in IEEE International Conference on Robotics and Automation (ICRA), 2025

点击查看摘要

Abstract:Events cameras are ideal sensors for enabling robots to detect and track objects in highly dynamic environments due to their low latency output, high temporal resolution, and high dynamic range. In this paper, we present the Asynchronous Event Multi-Object Tracking (AEMOT) algorithm for detecting and tracking multiple objects by processing individual raw events asynchronously. AEMOT detects salient event blob features by identifying regions of consistent optical flow using a novel Field of Active Flow Directions built from the Surface of Active Events. Detected features are tracked as candidate objects using the recently proposed Asynchronous Event Blob (AEB) tracker in order to construct small intensity patches of each candidate object. A novel learnt validation stage promotes or discards candidate objects based on classification of their intensity patches, with promoted objects having their position, velocity, size, and orientation estimated at their event rate. We evaluate AEMOT on a new Bee Swarm Dataset, where it tracks dozens of small bees with precision and recall performance exceeding that of alternative event-based detection and tracking algorithms by over 37%. Source code and the labelled event Bee Swarm Dataset will be open sourced
zh

[CV-71] SLAG: Scalable Language-Augmented Gaussian Splatting

【速读】:该论文旨在解决大规模机器人应用中场景表示的快速编码与可扩展性问题,特别是在计算资源受限的机器人平台上实现高效的语言增强型3D场景嵌入。其解决方案的关键在于提出SLAG框架,该框架通过将2D视觉-语言模型特征集成到3D场景中,利用SAM和CLIP模型进行特征提取,并采用归一化加权平均方法从3D高斯场景参数中直接生成嵌入,从而避免了传统方法中对损失函数的依赖,实现了高度并行化的场景编码。此外,引入向量数据库以提高嵌入的存储与检索效率。

链接: https://arxiv.org/abs/2505.08124
作者: Laszlo Szilagyi,Francis Engelmann,Jeannette Bohg
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Language-augmented scene representations hold great promise for large-scale robotics applications such as search-and-rescue, smart cities, and mining. Many of these scenarios are time-sensitive, requiring rapid scene encoding while also being data-intensive, necessitating scalable solutions. Deploying these representations on robots with limited computational resources further adds to the challenge. To address this, we introduce SLAG, a multi-GPU framework for language-augmented Gaussian splatting that enhances the speed and scalability of embedding large scenes. Our method integrates 2D visual-language model features into 3D scenes using SAM and CLIP. Unlike prior approaches, SLAG eliminates the need for a loss function to compute per-Gaussian language embeddings. Instead, it derives embeddings from 3D Gaussian scene parameters via a normalized weighted average, enabling highly parallelized scene encoding. Additionally, we introduce a vector database for efficient embedding storage and retrieval. Our experiments show that SLAG achieves an 18 times speedup in embedding computation on a 16-GPU setup compared to OpenGaussian, while preserving embedding quality on the ScanNet and LERF datasets. For more details, visit our project website: this https URL.
zh

[CV-72] JSover: Joint Spectrum Estimation and Multi-Material Decomposition from Single-Energy CT Projections

【速读】:该论文旨在解决传统多材料分解(Multi-material Decomposition, MMD)方法对光谱CT扫描仪和预先测量的X射线能量谱的依赖问题,从而提升在常规单能CT(Single-energy CT, SECT)系统上的临床适用性。其关键解决方案是提出JSover框架,该框架采用了一步式策略,直接从单能CT投影数据中联合重建多材料组成并估计能量谱,通过引入物理启发的光谱先验信息,有效模拟虚拟光谱CT系统,减少因能量依赖性衰减导致的非线性束硬化伪影和噪声,同时利用隐式神经表示(Implicit Neural Representation, INR)作为无监督深度学习求解器,提升分解的准确性和计算效率。

链接: https://arxiv.org/abs/2505.08123
作者: Qing Wu,Hongjiang Wei,Jingyi Yu,S. Kevin Zhou,Yuyao Zhang
机构: ShanghaiTech University (上海科技大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:Multi-material decomposition (MMD) enables quantitative reconstruction of tissue compositions in the human body, supporting a wide range of clinical applications. However, traditional MMD typically requires spectral CT scanners and pre-measured X-ray energy spectra, significantly limiting clinical applicability. To this end, various methods have been developed to perform MMD using conventional (i.e., single-energy, SE) CT systems, commonly referred to as SEMMD. Despite promising progress, most SEMMD methods follow a two-step image decomposition pipeline, which first reconstructs monochromatic CT images using algorithms such as FBP, and then performs decomposition on these images. The initial reconstruction step, however, neglects the energy-dependent attenuation of human tissues, introducing severe nonlinear beam hardening artifacts and noise into the subsequent decomposition. This paper proposes JSover, a fundamentally reformulated one-step SEMMD framework that jointly reconstructs multi-material compositions and estimates the energy spectrum directly from SECT projections. By explicitly incorporating physics-informed spectral priors into the SEMMD process, JSover accurately simulates a virtual spectral CT system from SE acquisitions, thereby improving the reliability and accuracy of decomposition. Furthermore, we introduce implicit neural representation (INR) as an unsupervised deep learning solver for representing the underlying material maps. The inductive bias of INR toward continuous image patterns constrains the solution space and further enhances estimation quality. Extensive experiments on both simulated and real CT datasets show that JSover outperforms state-of-the-art SEMMD methods in accuracy and computational efficiency.
zh

[CV-73] Now you see it Now you dont: Damage Label Agreement in Drone Satellite Post-Disaster Imagery

【速读】:该论文试图解决卫星与无人机航拍影像在建筑物损毁标签一致性问题,即在飓风灾害后,基于这两种影像来源的损毁标签存在显著差异,可能对机器学习损毁评估系统的部署带来风险和潜在危害。解决方案的关键在于使用相同的损毁标签体系和建筑位置信息,对来自三次飓风的15,814栋建筑进行比较分析,从而克服了以往研究中因标签体系不一致、建筑位置不对齐及数据量不足等问题。

链接: https://arxiv.org/abs/2505.08117
作者: Thomas Manzini,Priyankari Perali,Jayesh Tripathi,Robin Murphy
机构: Texas A&M University (德克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 3 tables. Appearing at ACM FAccT’25

点击查看摘要

Abstract:This paper audits damage labels derived from coincident satellite and drone aerial imagery for 15,814 buildings across Hurricanes Ian, Michael, and Harvey, finding 29.02% label disagreement and significantly different distributions between the two sources, which presents risks and potential harms during the deployment of machine learning damage assessment systems. Currently, there is no known study of label agreement between drone and satellite imagery for building damage assessment. The only prior work that could be used to infer if such imagery-derived labels agree is limited by differing damage label schemas, misaligned building locations, and low data quantities. This work overcomes these limitations by comparing damage labels using the same damage label schemas and building locations from three hurricanes, with the 15,814 buildings representing 19.05 times more buildings considered than the most relevant prior work. The analysis finds satellite-derived labels significantly under-report damage by at least 20.43% compared to drone-derived labels (p1.2x10^-117), and satellite- and drone-derived labels represent significantly different distributions (p5.1x10^-175). This indicates that computer vision and machine learning (CV/ML) models trained on at least one of these distributions will misrepresent actual conditions, as the differing satellite and drone-derived distributions cannot simultaneously represent the distribution of actual conditions in a scene. This potential misrepresentation poses ethical risks and potential societal harm if not managed. To reduce the risk of future societal harms, this paper offers four recommendations to improve reliability and transparency to decisio-makers when deploying CV/ML damage assessment systems in practice
zh

[CV-74] Sleep Position Classification using Transfer Learning for Bed-based Pressure Sensors

【速读】:该论文试图解决在临床环境中利用基于床垫的压力敏感垫(PSM)数据进行四向睡眠姿势分类的问题,其核心挑战在于临床场景中缺乏大量标注数据。解决方案的关键在于采用迁移学习技术,将预训练的深度学习模型(如基于ImageNet预训练的ViTMAE和用于人体姿态估计的ViTPose)适应到低分辨率PSM数据集上,从而准确估计睡眠姿势。这种方法显著优于之前基于深度学习的TCN模型以及传统机器学习模型(如SVM、XGBoost、随机森林)。

链接: https://arxiv.org/abs/2505.08111
作者: Olivier Papillon,Rafik Goubran,James Green,Julien Larivière-Chartier,Caitlin Higginson,Frank Knoefel,Rébecca Robillard
机构: Carleton University (卡尔顿大学); Bruyère Health Research Institute (布鲁耶健康研究所); University of Ottawa Institute for Mental Health Research at the Royal (渥太华大学皇家心理健康研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Conference publication submitted to IEEE I2MTC 2025

点击查看摘要

Abstract:Bed-based pressure-sensitive mats (PSMs) offer a non-intrusive way of monitoring patients during sleep. We focus on four-way sleep position classification using data collected from a PSM placed under a mattress in a sleep clinic. Sleep positions can affect sleep quality and the prevalence of sleep disorders, such as apnea. Measurements were performed on patients with suspected sleep disorders referred for assessments at a sleep clinic. Training deep learning models can be challenging in clinical settings due to the need for large amounts of labeled data. To overcome the shortage of labeled training data, we utilize transfer learning to adapt pre-trained deep learning models to accurately estimate sleep positions from a low-resolution PSM dataset collected in a polysomnography sleep lab. Our approach leverages Vision Transformer models pre-trained on ImageNet using masked autoencoding (ViTMAE) and a pre-trained model for human pose estimation (ViTPose). These approaches outperform previous work from PSM-based sleep pose classification using deep learning (TCN) as well as traditional machine learning models (SVM, XGBoost, Random Forest) that use engineered features. We evaluate the performance of sleep position classification from 112 nights of patient recordings and validate it on a higher resolution 13-patient dataset. Despite the challenges of differentiating between sleep positions from low-resolution PSM data, our approach shows promise for real-world deployment in clinical settings
zh

[CV-75] opology-Guided Knowledge Distillation for Efficient Point Cloud Processing

【速读】:该论文旨在解决在资源受限环境中部署高性能点云处理模型(如Point Transformer V3)的挑战,这些问题主要源于模型的高计算和内存需求。其解决方案的关键在于提出一种新颖的蒸馏框架,该框架结合了拓扑感知表示和梯度引导的知识蒸馏,通过捕捉点云的底层几何结构,并利用基于梯度的特征对齐选择性地指导轻量级学生模型的学习过程,从而实现高效的知识迁移。

链接: https://arxiv.org/abs/2505.08101
作者: Luu Tung Hai,Thinh D. Le,Zhicheng Ding,Qing Tian,Truong-Son Hy
机构: The University of Alabama at Birmingham, USA; Soongsil University, South Korea; Bowling Green State University, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Point cloud processing has gained significant attention due to its critical role in applications such as autonomous driving and 3D object recognition. However, deploying high-performance models like Point Transformer V3 in resource-constrained environments remains challenging due to their high computational and memory demands. This work introduces a novel distillation framework that leverages topology-aware representations and gradient-guided knowledge distillation to effectively transfer knowledge from a high-capacity teacher to a lightweight student model. Our approach captures the underlying geometric structures of point clouds while selectively guiding the student model’s learning process through gradient-based feature alignment. Experimental results in the Nuscenes, SemanticKITTI, and Waymo datasets demonstrate that the proposed method achieves competitive performance, with an approximately 16x reduction in model size and a nearly 1.9x decrease in inference time compared to its teacher model. Notably, on NuScenes, our method achieves state-of-the-art performance among knowledge distillation techniques trained solely on LiDAR data, surpassing prior knowledge distillation baselines in segmentation performance. Our implementation is available publicly at: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2505.08101 [cs.CV] (or arXiv:2505.08101v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2505.08101 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-76] Multi-modal wound classification using wound image and location by Xception and Gaussian Mixture Recurrent Neural Network (GMRNN)

【速读】:该论文旨在解决急性及难以愈合伤口的准确诊断问题,以帮助伤口护理从业者提供有效的患者护理。传统诊断方法在面对感染、外周血管疾病和伤口深度增加等并发症时效果有限,而基于人工智能(Artificial Intelligence, AI)的诊断工具能够加速医学图像的解读并提高疾病早期检测能力。本文提出了一种基于迁移学习(Transfer Learning, TL)的多模态AI模型,结合了Xception和GMRNN两种先进的架构,通过融合迁移学习算法提取的特征与位置特征,实现对糖尿病性、压力性、手术性和静脉溃疡等常见伤口类型的分类。该解决方案的关键在于利用多模态特征融合和迁移学习技术,显著提升了伤口分类的准确性。

链接: https://arxiv.org/abs/2505.08086
作者: Ramin Mousa,Ehsan Matbooe,Hakimeh Khojasteh,Amirali Bengari,Mohammadmahdi Vahediahmar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The effective diagnosis of acute and hard-to-heal wounds is crucial for wound care practitioners to provide effective patient care. Poor clinical outcomes are often linked to infection, peripheral vascular disease, and increasing wound depth, which collectively exacerbate these comorbidities. However, diagnostic tools based on Artificial Intelligence (AI) speed up the interpretation of medical images and improve early detection of disease. In this article, we propose a multi-modal AI model based on transfer learning (TL), which combines two state-of-the-art architectures, Xception and GMRNN, for wound classification. The multi-modal network is developed by concatenating the features extracted by a transfer learning algorithm and location features to classify the wound types of diabetic, pressure, surgical, and venous ulcers. The proposed method is comprehensively compared with deep neural networks (DNN) for medical image analysis. The experimental results demonstrate a notable wound-class classifications (containing only diabetic, pressure, surgical, and venous) vary from 78.77 to 100% in various experiments. The results presented in this study showcase the exceptional accuracy of the proposed methodology in accurately classifying the most commonly occurring wound types using wound images and their corresponding locations.
zh

[CV-77] Visually Interpretable Subtask Reasoning for Visual Question Answering

【速读】:该论文试图解决复杂视觉问答任务中多步骤推理的准确性与可解释性问题,特别是针对对象识别、属性过滤和关系理解等过程的挑战。其解决方案的关键在于引入VISTAR(Visually Interpretable Subtask-Aware Reasoning Model),该模型通过在多模态大语言模型(MLLMs)内部生成结构化的子任务思维链(Subtask-of-Thought rationales)来提升推理能力和可解释性,而非依赖外部模型,从而在保持可解释性的前提下提高推理准确性。

链接: https://arxiv.org/abs/2505.08084
作者: Yu Cheng,Arushi Goel,Hakan Bilen
机构: University of Edinburgh (爱丁堡大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Answering complex visual questions like `Which red furniture can be used for sitting?’ requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at this https URL.
zh

[CV-78] Fréchet Power-Scenario Distance: A Metric for Evaluating Generative AI Models across Multiple Time-Scales in Smart Grids

【速读】:该论文旨在解决生成式人工智能(Generative AI)在智能电网中生成合成数据的质量评估问题,特别是在传统基于欧几里得距离的度量方法无法有效评估合成数据集之间质量差异的情况下。其解决方案的关键在于提出一种基于学习特征空间中两个数据集之间弗雷歇距离(Fréchet Distance, FD)的新型度量方法,从分布角度评估生成数据的质量,从而提升智能电网运行中数据驱动决策的可靠性。

链接: https://arxiv.org/abs/2505.08082
作者: Yuting Cai,Shaohuai Liu,Chao Tian,Le Xie
机构: Texas A&M University (德克萨斯A&M大学); Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) models in smart grids have advanced significantly in recent years due to their ability to generate large amounts of synthetic data, which would otherwise be difficult to obtain in the real world due to confidentiality constraints. A key challenge in utilizing such synthetic data is how to assess the data quality produced from such generative models. Traditional Euclidean distance-based metrics only reflect pair-wise relations between two individual samples, and could fail in evaluating quality differences between groups of synthetic datasets. In this work, we propose a novel metric based on the Fréchet Distance (FD) estimated between two datasets in a learned feature space. The proposed method evaluates the quality of generation from a distributional perspective. Empirical results demonstrate the superiority of the proposed metric across timescales and models, enhancing the reliability of data-driven decision-making in smart grid operations.
zh

[CV-79] RDD: Robust Feature Detector and Descriptor using Deformable Transformer

【速读】:该论文旨在解决在具有显著视角变化等挑战性场景下,结构光重建(Structure-from-Motion)和同步定位与地图构建(SLAM)中鲁棒的特征检测与描述问题。现有方法虽强调局部特征在建模几何变换中的重要性,但未能学习长程关系中的视觉线索。其解决方案的关键在于提出一种名为Robust Deformable Detector (RDD) 的新型关键点检测器/描述符,该方法利用可变形Transformer架构,通过可变形自注意力机制捕捉全局上下文和几何不变性,从而有效降低搜索空间复杂度并建模几何不变性。

链接: https://arxiv.org/abs/2505.08013
作者: Gonglin Chen,Tianwen Fu,Haiwei Chen,Wenbin Teng,Hanyuan Xiao,Yajie Zhao
机构: Institute for Creative Technologies (创意技术研究所); University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As a core step in structure-from-motion and SLAM, robust feature detection and description under challenging scenarios such as significant viewpoint changes remain unresolved despite their ubiquity. While recent works have identified the importance of local features in modeling geometric transformations, these methods fail to learn the visual cues present in long-range relationships. We present Robust Deformable Detector (RDD), a novel and robust keypoint detector/descriptor leveraging the deformable transformer, which captures global context and geometric invariance through deformable self-attention mechanisms. Specifically, we observed that deformable attention focuses on key locations, effectively reducing the search space complexity and modeling the geometric invariance. Furthermore, we collected an Air-to-Ground dataset for training in addition to the standard MegaDepth dataset. Our proposed method outperforms all state-of-the-art keypoint detection/description methods in sparse matching tasks and is also capable of semi-dense matching. To ensure comprehensive evaluation, we introduce two challenging benchmarks: one emphasizing large viewpoint and scale variations, and the other being an Air-to-Ground benchmark – an evaluation setting that has recently gaining popularity for 3D reconstruction across different altitudes.
zh

[CV-80] Vision Foundation Model Embedding-Based Semantic Anomaly Detection ICRA2025

【速读】:该论文试图解决自主系统在系统级推理中因语义异常(semantic anomalies)导致的未定义行为和故障问题,语义异常是指由熟悉视觉元素组成的上下文无效或不寻常的组合。解决方案的关键在于利用最先进的视觉基础模型的语义先验,直接在图像上操作,通过将运行时图像的局部视觉嵌入与被认为是安全且性能良好的典型场景数据库进行比较,从而实现语义异常检测。此外,引入一种简单的过滤机制以抑制误报,提升了方法的鲁棒性。

链接: https://arxiv.org/abs/2505.07998
作者: Max Peter Ronecker,Matthew Foutter,Amine Elhafsi,Daniele Gammelli,Ihor Barakaiev,Marco Pavone,Daniel Watzenig
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for the Workshop “Safely Leveraging Vision-Language Foundation Models in Robotics: Challenges and Opportunities” at ICRA 2025

点击查看摘要

Abstract:Semantic anomalies are contextually invalid or unusual combinations of familiar visual elements that can cause undefined behavior and failures in system-level reasoning for autonomous systems. This work explores semantic anomaly detection by leveraging the semantic priors of state-of-the-art vision foundation models, operating directly on the image. We propose a framework that compares local vision embeddings from runtime images to a database of nominal scenarios in which the autonomous system is deemed safe and performant. In this work, we consider two variants of the proposed framework: one using raw grid-based embeddings, and another leveraging instance segmentation for object-centric representations. To further improve robustness, we introduce a simple filtering mechanism to suppress false positives. Our evaluations on CARLA-simulated anomalies show that the instance-based method with filtering achieves performance comparable to GPT-4o, while providing precise anomaly localization. These results highlight the potential utility of vision embeddings from foundation models for real-time anomaly detection in autonomous systems.
zh

[CV-81] MilChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Remote Sensing

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在专业领域中,尤其是需要资源高效和领域特定适应性的场景下的性能局限性问题。其关键解决方案是提出一种轻量级的多模态语言模型MilChat,并通过监督微调与基于链式思维(Chain-of-Thought, CoT)推理标注的数据进行训练,同时结合Group Relative Policy Optimization(GRPO)方法以提升模型对关键领域特征的检测能力,同时减少对民用场景的误报。

链接: https://arxiv.org/abs/2505.07984
作者: Aybora Koksal,A. Aydin Alatan
机构: Middle East Technical University (中东技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to JSTARS on April 2, 2025. Code and dataset will be available upon acceptance

点击查看摘要

Abstract:Remarkable capabilities in understanding and generating text-image content have been demonstrated by recent advancements in multimodal large language models (MLLMs). However, their effectiveness in specialized domains-particularly those requiring resource-efficient and domain-specific adaptations-has remained limited. In this work, a lightweight multimodal language model termed MilChat is introduced, specifically adapted to analyze remote sensing imagery in secluded areas, including challenging missile launch sites. A new dataset, MilData, was compiled by verifying hundreds of aerial images through expert review, and subtle military installations were highlighted via detailed captions. Supervised fine-tuning on a 2B-parameter open-source MLLM with chain-of-thought (CoT) reasoning annotations was performed, enabling more accurate and interpretable explanations. Additionally, Group Relative Policy Optimization (GRPO) was leveraged to enhance the model’s ability to detect critical domain-specific cues-such as defensive layouts and key military structures-while minimizing false positives on civilian scenes. Through empirical evaluations, it has been shown that MilChat significantly outperforms both larger, general-purpose multimodal models and existing remote sensing-adapted approaches on open-ended captioning and classification metrics. Over 80% recall and 98% precision were achieved on the newly proposed MilData benchmark, underscoring the potency of targeted fine-tuning and reinforcement learning in specialized real-world applications.
zh

[CV-82] Monocular Online Reconstruction with Enhanced Detail Preservation

【速读】:该论文旨在解决单目图像流中基于3D高斯的密集映射框架在实时重建中的两个关键问题:无需依赖深度图即可分布高斯(Gaussian)以及确保重建地图的局部与全局一致性。其解决方案的关键在于引入两个核心模块:用于高效高斯分布的分层高斯管理模块(Hierarchical Gaussian Management Module)和用于维持多尺度对齐与连贯性的全局一致性优化模块(Global Consistency Optimization Module)。此外,论文还提出了多级占用哈希体素(Multi-level Occupancy Hash Voxels, MOHV),该结构通过多粒度正则化高斯,实现了精细与粗略几何及纹理的准确重建,同时保持整体结构完整性。

链接: https://arxiv.org/abs/2505.07887
作者: Songyin Wu,Zhaoyang Lv,Yufeng Zhu,Duncan Frost,Zhengqin Li,Ling-Qi Yan,Carl Ren,Richard Newcombe,Zhao Dong
机构: University of California Santa Barbara(加州大学圣塔芭芭拉分校); Meta Reality Labs Research(元现实实验室研究); Meta Reality Labs Research(元现实实验室研究); Meta Reality Labs Research(元现实实验室研究); Meta Reality Labs Research(元现实实验室研究); University of California Santa Barbara(加州大学圣塔芭芭拉分校); Meta Reality Labs Research(元现实实验室研究); Meta Reality Labs Research(元现实实验室研究); Meta Reality Labs Research(元现实实验室研究)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose an online 3D Gaussian-based dense mapping framework for photorealistic details reconstruction from a monocular image stream. Our approach addresses two key challenges in monocular online reconstruction: distributing Gaussians without relying on depth maps and ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: the Hierarchical Gaussian Management Module for effective Gaussian distribution and the Global Consistency Optimization Module for maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians for capturing details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency. Moreover, it integrates seamlessly with various tracking systems, ensuring generality and scalability.
zh

[CV-83] OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval

【速读】:该论文旨在解决知识增强型视觉问答(Knowledge-Based Visual Question Answering, KB-VQA)中如何有效融合多模态信息以提升视觉-语言检索增强生成(Vision-language Retrieval-Augmented Generation, RAG)系统性能的问题。其解决方案的关键在于提出一种从粗到细、多步骤的多模态检索机制,通过协调查询与知识库中的多种模态和粒度信息,实现更高效的跨模态检索与实体选择,从而提升生成效果。

链接: https://arxiv.org/abs/2505.07879
作者: Wei Yang,Jingjing Fu,Rui Wang,Jinyu Wang,Lei Song,Jiang Bian
机构: Microsoft Research Asia (微软亚洲研究院)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures, 17 tables

点击查看摘要

Abstract:Vision-language retrieval-augmented generation (RAG) has become an effective approach for tackling Knowledge-Based Visual Question Answering (KB-VQA), which requires external knowledge beyond the visual content presented in images. The effectiveness of Vision-language RAG systems hinges on multimodal retrieval, which is inherently challenging due to the diverse modalities and knowledge granularities in both queries and knowledge bases. Existing methods have not fully tapped into the potential interplay between these elements. We propose a multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that harmonizes multiple granularities and modalities to enhance efficacy. Our system begins with a broad initial search aligning knowledge granularity for cross-modal retrieval, followed by a multimodal fusion reranking to capture the nuanced multimodal information for top entity selection. A text reranker then filters out the most relevant fine-grained section for augmented generation. Extensive experiments on the InfoSeek and Encyclopedic-VQA benchmarks show our method achieves state-of-the-art retrieval performance and highly competitive answering results, underscoring its effectiveness in advancing KB-VQA systems.
zh

[CV-84] VIViT: Variable-Input Vision Transformer Framework for 3D MR Image Segmentation

【速读】:该论文试图解决真实世界磁共振(Magnetic Resonance, MR)研究中由于不同采集协议导致的对比度(contrast)多样性问题,这给现有的深度学习方法在大规模预训练和不同下游任务中的应用带来了挑战,因为这些方法通常需要固定的输入模态或对比度。解决方案的关键是提出一种可变输入的视觉Transformer(variable-input Vision Transformer, VIViT),该框架能够适应每项研究中的不同对比度,从而最大化预训练阶段的数据可用性,并将预训练获得的知识有效迁移至不同输入需求的下游任务中。

链接: https://arxiv.org/abs/2505.08693
作者: Badhan Kumar Das,Ajay Singh,Gengyan Zhao,Han Liu,Thomas J. Re,Dorin Comaniciu,Eli Gibson,Andreas Maier
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Self-supervised pretrain techniques have been widely used to improve the downstream tasks’ performance. However, real-world magnetic resonance (MR) studies usually consist of different sets of contrasts due to different acquisition protocols, which poses challenges for the current deep learning methods on large-scale pretrain and different downstream tasks with different input requirements, since these methods typically require a fixed set of input modalities or, contrasts. To address this challenge, we propose variable-input ViT (VIViT), a transformer-based framework designed for self-supervised pretraining and segmentation finetuning for variable contrasts in each study. With this ability, our approach can maximize the data availability in pretrain, and can transfer the learned knowledge from pretrain to downstream tasks despite variations in input requirements. We validate our method on brain infarct and brain tumor segmentation, where our method outperforms current CNN and ViT-based models with a mean Dice score of 0.624 and 0.883 respectively. These results highlight the efficacy of our design for better adaptability and performance on tasks with real-world heterogeneous MR data.
zh

[CV-85] A portable diagnosis model for Keratoconus using a smartphone

【速读】:该论文旨在解决角膜锥形变(Keratoconus, KC)诊断中依赖专业设备导致的可及性问题,提出了一种基于智能手机的便携式诊断框架。其关键在于利用手机屏幕显示Placido盘并捕捉角膜反射,结合两阶段检测流程:第一阶段通过加权支持向量机(WSVM)分类KC的不同阶段,第二阶段通过基于盘间距离的颜色图可视化病变区域,从而实现对KC严重程度和位置的直观评估。

链接: https://arxiv.org/abs/2505.08616
作者: Yifan Li,Myeongjun Kim,Yanjing Jin,Peter Ho,Jo Woon Chong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Keratoconus (KC) is a progressive corneal disorder characterized by localized thinning and protrusion, leading to visual distortion. While Placido disc-based topography remains a standard in clinical diagnostics, its dependence on specialized equipment limits accessibility. In this paper, we propose a portable, smartphone-based diagnostic framework that captures corneal reflections of a Placido disc displayed on a phone screen and applies a two-stage detection pipeline, then validate on 3D-printed emulated eyeball models that simulate normal, moderate, and severe KC stages based on anterior chamber depth (ACD). The first step of the two-stage detection pipeline is classifying different stages of KC with features including height and width of extracted reflections using weighted support vector machine (WSVM). It achieves a maximum accuracy of 92.93%, and maintains over 90% accuracy across multiple smartphone models, including the Galaxy Z Flip 3, iPhone 15 Pro, and iPhone 16 Pro. For the second step, we visualize the KC-affected protrusion regions on the corneas with color maps based on inter-disc distance, that provides an intuitive representation of disease severity and localization. Moreover, we validate the ability of the extracted features to differentiate between KC stages with ANOVA and Omega Squared, with significant p-values (e.g., p 10^-6 ) and large effect sizes ( \omega^2 up to 0.8398) among classes.
zh

[CV-86] GNCAF: A GNN-based Neighboring Context Aggregation Framework for Tertiary Lymphoid Structures Semantic Segmentation in WSI

【速读】:该论文试图解决在全切片图像(WSI)中对三级淋巴结构(Tertiary Lymphoid Structures, TLS)进行语义分割的问题,即同时分割TLS的区域及其成熟阶段。现有方法依赖于细胞代理任务并需要额外的后处理步骤,而本文提出了一种端到端的TLS语义分割(TLS-SS)任务。解决方案的关键在于提出一种基于图神经网络(GNN)的邻域上下文聚合框架(GNCAF),该框架通过逐步聚合目标区域的多跳邻域上下文,并利用自注意力机制指导目标区域的分割,从而增强模型对局部以外上下文信息的感知能力。

链接: https://arxiv.org/abs/2505.08430
作者: Lei Su
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tertiary lymphoid structures (TLS) are organized clusters of immune cells, whose maturity and area can be quantified in whole slide image (WSI) for various prognostic tasks. Existing methods for assessing these characteristics typically rely on cell proxy tasks and require additional post-processing steps. In this work, We focus on a novel task-TLS Semantic Segmentation (TLS-SS)-which segments both the regions and maturation stages of TLS in WSI in an end-to-end manner. Due to the extensive scale of WSI and patch-based segmentation strategies, TLS-SS necessitates integrating from neighboring patches to guide target patch (target) segmentation. Previous techniques often employ on multi-resolution approaches, constraining the capacity to leverage the broader neighboring context while tend to preserve coarse-grained information. To address this, we propose a GNN-based Neighboring Context Aggregation Framework (GNCAF), which progressively aggregates multi-hop neighboring context from the target and employs a self-attention mechanism to guide the segmentation of the target. GNCAF can be integrated with various segmentation models to enhance their ability to perceive contextual information outside of the patch. We build two TLS-SS datasets, called TCGA-COAD and INHOUSE-PAAD, and make the former (comprising 225 WSIs and 5041 TLSs) publicly available. Experiments on these datasets demonstrate the superiority of GNCAF, achieving a maximum of 22.08% and 26.57% improvement in mF1 and mIoU, respectively. Additionally, we also validate the task scalability of GNCAF on segmentation of lymph node metastases.
zh

[CV-87] An integrated language-vision foundation model for conversational diagnostics and triaging in primary eye care

【速读】:该论文旨在解决当前深度学习模型任务专用性强且缺乏用户友好接口的问题,提出了一种多功能基础模型Meta-EyeFM,该模型将大语言模型(LLM)与视觉基础模型(VFM)相结合,用于眼科疾病评估。解决方案的关键在于引入路由机制,根据文本查询实现任务特定的准确分析,并通过低秩适应(LoRA)对VFMs进行微调,以检测眼科及系统性疾病、区分疾病严重程度和识别常见眼科体征。

链接: https://arxiv.org/abs/2505.08414
作者: Zhi Da Soh,Yang Bai,Kai Yu,Yang Zhou,Xiaofeng Lei,Sahil Thakur,Zann Lee,Lee Ching Linette Phang,Qingsheng Peng,Can Can Xue,Rachel Shujuan Chong,Quan V. Hoang,Lavanya Raghavan,Yih Chung Tham,Charumathi Sabanayagam,Wei-Chi Wu,Ming-Chih Ho,Jiangnan He,Preeti Gupta,Ecosse Lamoureux,Seang Mei Saw,Vinay Nangia,Songhomitra Panda-Jonas,Jie Xu,Ya Xing Wang,Xinxing Xu,Jost B. Jonas,Tien Yin Wong,Rick Siow Mong Goh,Yong Liu,Ching-Yu Cheng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current deep learning models are mostly task specific and lack a user-friendly interface to operate. We present Meta-EyeFM, a multi-function foundation model that integrates a large language model (LLM) with vision foundation models (VFMs) for ocular disease assessment. Meta-EyeFM leverages a routing mechanism to enable accurate task-specific analysis based on text queries. Using Low Rank Adaptation, we fine-tuned our VFMs to detect ocular and systemic diseases, differentiate ocular disease severity, and identify common ocular signs. The model achieved 100% accuracy in routing fundus images to appropriate VFMs, which achieved \ge 82.2% accuracy in disease detection, \ge 89% in severity differentiation, \ge 76% in sign identification. Meta-EyeFM was 11% to 43% more accurate than Gemini-1.5-flash and ChatGPT-4o LMMs in detecting various eye diseases and comparable to an ophthalmologist. This system offers enhanced usability and diagnostic performance, making it a valuable decision support tool for primary eye care or an online LLM for fundus evaluation.
zh

[CV-88] Skeleton-Guided Diffusion Model for Accurate Foot X-ray Synthesis in Hallux Valgus Diagnosis

【速读】:该论文旨在解决足部畸形(如拇外翻)在医学影像合成中面临的图像保真度、骨骼一致性及物理约束难以平衡的问题,尤其是在基于扩散的方法中缺乏骨骼引导导致的局限性。其解决方案的关键在于提出一种骨骼约束的条件扩散模型(Skeletal-Constrained Conditional Diffusion Model, SCCDM),该模型通过多尺度特征提取和注意力机制提升了结构相似性指数(SSIM)和峰值信噪比(PSNR),并结合KCC(一种利用骨骼标志点的足部评估方法),实现了临床适用性强的高质量影像合成。

链接: https://arxiv.org/abs/2505.08247
作者: Midi Wan,Pengfei Li,Yizhuo Liang,Di Wu,Yushan Pan,Guangzhen Zhu,Hao Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image synthesis plays a crucial role in providing anatomically accurate images for diagnosis and treatment. Hallux valgus, which affects approximately 19% of the global population, requires frequent weight-bearing X-rays for assessment, placing additional strain on both patients and healthcare providers. Existing X-ray models often struggle to balance image fidelity, skeletal consistency, and physical constraints, particularly in diffusion-based methods that lack skeletal guidance. We propose the Skeletal-Constrained Conditional Diffusion Model (SCCDM) and introduce KCC, a foot evaluation method utilizing skeletal landmarks. SCCDM incorporates multi-scale feature extraction and attention mechanisms, improving the Structural Similarity Index (SSIM) by 5.72% (0.794) and Peak Signal-to-Noise Ratio (PSNR) by 18.34% (21.40 dB). When combined with KCC, the model achieves an average score of 0.85, demonstrating strong clinical applicability. The code is available at this https URL.
zh

[CV-89] Image-Guided Microstructure Optimization using Diffusion Models: Validated with Li-Mn-rich Cathode Precursors

【速读】:该论文试图解决材料微结构难以量化、预测和优化的问题,从而影响材料性能的设计与调控。其解决方案的关键在于构建了一个以图像为中心的闭环框架,该框架整合了基于扩散的图像生成模型、定量图像分析流程以及粒子群优化(PSO)算法,通过从扫描电子显微镜(SEM)图像中提取关键形貌描述符,实现对合成条件的预测设计与优化,进而实现对微结构的可控设计。

链接: https://arxiv.org/abs/2505.07906
作者: Geunho Choi,Changhwan Lee,Jieun Kim,Insoo Ye,Keeyoung Jung,Inchul Park
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 37 pages, 10 figures

点击查看摘要

Abstract:Microstructure often dictates materials performance, yet it is rarely treated as an explicit design variable because microstructure is hard to quantify, predict, and optimize. Here, we introduce an image centric, closed-loop framework that makes microstructural morphology into a controllable objective and demonstrate its use case with Li- and Mn-rich layered oxide cathode precursors. This work presents an integrated, AI driven framework for the predictive design and optimization of lithium-ion battery cathode precursor synthesis. This framework integrates a diffusion-based image generation model, a quantitative image analysis pipeline, and a particle swarm optimization (PSO) algorithm. By extracting key morphological descriptors such as texture, sphericity, and median particle size (D50) from SEM images, the platform accurately predicts SEM like morphologies resulting from specific coprecipitation conditions, including reaction time-, solution concentration-, and pH-dependent structural changes. Optimization then pinpoints synthesis parameters that yield user defined target morphologies, as experimentally validated by the close agreement between predicted and synthesized structures. This framework offers a practical strategy for data driven materials design, enabling both forward prediction and inverse design of synthesis conditions and paving the way toward autonomous, image guided microstructure engineering.
zh

[CV-90] Computationally Efficient Diffusion Models in Medical Imaging: A Comprehensive Review

【速读】:该论文旨在解决扩散模型在训练和生成过程中存在的高计算成本问题,特别是在自然图像和医学影像中的应用效率与推理时间优化。其解决方案的关键在于对当前主流的三种扩散模型——去噪扩散概率模型(Denoising Diffusion Probabilistic Model, DDPM)、潜在扩散模型(Latent Diffusion Model, LDM)和小波扩散模型(Wavelet Diffusion Model, WDM)进行系统分类与分析,探讨它们在不同应用场景下的计算复杂度及性能优势,从而为提升医学影像生成的效率与质量提供理论支持与技术方向。

链接: https://arxiv.org/abs/2505.07866
作者: Abdullah,Tao Huang,Ickjai Lee,Euijoon Ahn
机构: James Cook University (詹姆斯库克大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: pages 36, 6 figures

点击查看摘要

Abstract:The diffusion model has recently emerged as a potent approach in computer vision, demonstrating remarkable performances in the field of generative artificial intelligence. Capable of producing high-quality synthetic images, diffusion models have been successfully applied across a range of applications. However, a significant challenge remains with the high computational cost associated with training and generating these models. This study focuses on the efficiency and inference time of diffusion-based generative models, highlighting their applications in both natural and medical imaging. We present the most recent advances in diffusion models by categorizing them into three key models: the Denoising Diffusion Probabilistic Model (DDPM), the Latent Diffusion Model (LDM), and the Wavelet Diffusion Model (WDM). These models play a crucial role in medical imaging, where producing fast, reliable, and high-quality medical images is essential for accurate analysis of abnormalities and disease diagnosis. We first investigate the general framework of DDPM, LDM, and WDM and discuss the computational complexity gap filled by these models in natural and medical imaging. We then discuss the current limitations of these models as well as the opportunities and future research directions in medical imaging.
zh

[CV-91] Pose Estimation for Intra-cardiac Echocardiography Catheter via AI-Based Anatomical Understanding

【速读】:该论文旨在解决经心脏超声(ICE)在电生理(EP)和结构性心脏病(SHD)介入手术中导航精度不足的问题,现有方法依赖易受干扰的电磁(EM)跟踪或依赖操作员经验的手动调整。解决方案的关键在于提出一种基于视觉Transformer(ViT)的解剖感知位姿估计系统,该系统仅通过ICE图像确定导管的位置和方向,无需外部跟踪传感器,从而实现无传感器的实时定位。

链接: https://arxiv.org/abs/2505.07851
作者: Jaeyoung Huh,Ankur Kapoor,Young-Ho Kim
机构: Siemens Healthineers(西门子医疗); Princeton, NJ, USA(美国新泽西州普林斯顿)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Intra-cardiac Echocardiography (ICE) plays a crucial role in Electrophysiology (EP) and Structural Heart Disease (SHD) interventions by providing high-resolution, real-time imaging of cardiac structures. However, existing navigation methods rely on electromagnetic (EM) tracking, which is susceptible to interference and position drift, or require manual adjustments based on operator expertise. To overcome these limitations, we propose a novel anatomy-aware pose estimation system that determines the ICE catheter position and orientation solely from ICE images, eliminating the need for external tracking sensors. Our approach leverages a Vision Transformer (ViT)-based deep learning model, which captures spatial relationships between ICE images and anatomical structures. The model is trained on a clinically acquired dataset of 851 subjects, including ICE images paired with position and orientation labels normalized to the left atrium (LA) mesh. ICE images are patchified into 16x16 embeddings and processed through a transformer network, where a [CLS] token independently predicts position and orientation via separate linear layers. The model is optimized using a Mean Squared Error (MSE) loss function, balancing positional and orientational accuracy. Experimental results demonstrate an average positional error of 9.48 mm and orientation errors of (16.13 deg, 8.98 deg, 10.47 deg) across x, y, and z axes, confirming the model accuracy. Qualitative assessments further validate alignment between predicted and target views within 3D cardiac meshes. This AI-driven system enhances procedural efficiency, reduces operator workload, and enables real-time ICE catheter localization for tracking-free procedures. The proposed method can function independently or complement existing mapping systems like CARTO, offering a transformative approach to ICE-guided interventions.
zh

[CV-92] Evaluation of UAV-Based RGB and Multispectral Vegetation Indices for Precision Agriculture in Palm Tree Cultivation

【速读】:该论文旨在解决精准农业中植被健康监测的高成本问题,通过评估无人机(UAV)搭载的成像技术在棕榈树种植区的应用,提出一种更具成本效益的解决方案。研究的关键在于证明基于RGB图像的植被指数在植被分类和胁迫检测方面能够达到与多光谱指数相当的性能,从而为大规模农业监测提供经济可行的替代方案。

链接: https://arxiv.org/abs/2505.07840
作者: Alavikunhu Panthakkan,S M Anzar,K. Sherin,Saeed Al Mansoori,Hussain Al-Ahmad
机构: College of Engineering and IT, University of Dubai, U.A.E.; TKM College of Engineering, Kollam, India-691 005; MES College of Engineering, Kuttippuram, India-679 582; Mohammed Bin Rashid Space Centre (MBRSC), UAE
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precision farming relies on accurate vegetation monitoring to enhance crop productivity and promote sustainable agricultural practices. This study presents a comprehensive evaluation of UAV-based imaging for vegetation health assessment in a palm tree cultivation region in Dubai. By comparing multispectral and RGB image data, we demonstrate that RGBbased vegetation indices offer performance comparable to more expensive multispectral indices, providing a cost-effective alternative for large-scale agricultural monitoring. Using UAVs equipped with multispectral sensors, indices such as NDVI and SAVI were computed to categorize vegetation into healthy, moderate, and stressed conditions. Simultaneously, RGB-based indices like VARI and MGRVI delivered similar results in vegetation classification and stress detection. Our findings highlight the practical benefits of integrating RGB imagery into precision farming, reducing operational costs while maintaining accuracy in plant health monitoring. This research underscores the potential of UAVbased RGB imaging as a powerful tool for precision agriculture, enabling broader adoption of data-driven decision-making in crop management. By leveraging the strengths of both multispectral and RGB imaging, this work advances the state of UAV applications in agriculture, paving the way for more efficient and scalable farming solutions.
zh

人工智能

[AI-0] ARC-NCA: Towards Developmental Solutions to the Abstraction and Reasoning Corpus

【速读】:该论文试图解决人工通用智能(AGI)中的基础挑战,即在仅有少量(中位数为三个)正确示例的情况下,使人工智能系统具备跨多样化任务的鲁棒抽象与推理能力。解决方案的关键在于引入ARC-NCA方法,该方法基于标准神经细胞自动机(NCA)及其增强型变体——带有隐式记忆的NCA(EngramNCA),利用其模拟复杂动态和涌现模式的能力,模仿生物系统中的发育过程,从而提升AI的适应性推理与抽象能力。

链接: https://arxiv.org/abs/2505.08778
作者: Etienne Guichard,Felix Reimers,Mia Kvalsund,Mikkel Lepperød,Stefano Nichele
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC), later renamed ARC-AGI, poses a fundamental challenge in artificial general intelligence (AGI), requiring solutions that exhibit robust abstraction and reasoning capabilities across diverse tasks, while only few (with median count of three) correct examples are presented. While ARC-AGI remains very challenging for artificial intelligence systems, it is rather easy for humans. This paper introduces ARC-NCA, a developmental approach leveraging standard Neural Cellular Automata (NCA) and NCA enhanced with hidden memories (EngramNCA) to tackle the ARC-AGI benchmark. NCAs are employed for their inherent ability to simulate complex dynamics and emergent patterns, mimicking developmental processes observed in biological systems. Developmental solutions may offer a promising avenue for enhancing AI’s problem-solving capabilities beyond mere training data extrapolation. ARC-NCA demonstrates how integrating developmental principles into computational models can foster adaptive reasoning and abstraction. We show that our ARC-NCA proof-of-concept results may be comparable to, and sometimes surpass, that of ChatGPT 4.5, at a fraction of the cost.
zh

[AI-1] DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在数学创造性问题解决能力方面的评估不足与性能局限问题。尽管现有数学LLMs在推理能力方面取得进展,但其创造性能力尚未得到充分研究,且缺乏高质量的评估数据集。为应对这一问题,研究者提出了数学创造力的评估标准,并引入了DeepMath-Creative,一个涵盖代数、几何、分析等多个领域的高质量基准数据集。该研究通过系统评估主流LLMs在该数据集上的表现,揭示了当前模型在处理复杂构造性问题时的显著不足,表明其在低难度任务中的表现可能更多依赖于记忆模式的重组,而非真正的创造性洞察或新颖性综合。

链接: https://arxiv.org/abs/2505.08744
作者: Xiaoyang Chen,Xinan Dai,Yu Du,Qian Feng,Naixu Guo,Tingshuo Gu,Yuting Gao,Yingyi Gao,Xudong Han,Xiang Jiang,Yilin Jin,Hongyi Lin,Shisheng Lin,Xiangnan Li,Yuante Li,Yixing Li,Zhentao Lai,Zilu Ma,Yingrong Peng,Jiacheng Qian,Hao-Yu Sun,Jianbo Sun,Zirui Wang,Siwei Wu,Zian Wang,Bin Xu,Jianghao Xu,Yiyang Yu,Zichuan Yang,Hongji Zha,Ruichong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs’ creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria – emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations – the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.
zh

[AI-2] Securing RAG : A Risk Assessment and Mitigation Framework

【速读】:该论文旨在解决Retrieval Augmented Generation (RAG)系统在集成敏感数据时所面临的新型安全与隐私挑战。其解决方案的关键在于构建一个结合RAG特定安全考量与现有通用安全指南、行业标准及最佳实践的框架,以指导实现稳健、合规、安全且可信赖的RAG系统。

链接: https://arxiv.org/abs/2505.08728
作者: Lukas Ammann,Sara Ott,Christoph R. Landolt,Marco P. Lehmann
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 8 pages, 3 figures, Sara Ott and Lukas Ammann contributed equally

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has emerged as the de facto industry standard for user-facing NLP applications, offering the ability to integrate data without re-training or fine-tuning Large Language Models (LLMs). This capability enhances the quality and accuracy of responses but also introduces novel security and privacy challenges, particularly when sensitive data is integrated. With the rapid adoption of RAG, securing data and services has become a critical priority. This paper first reviews the vulnerabilities of RAG pipelines, and outlines the attack surface from data pre-processing and data storage management to integration with LLMs. The identified risks are then paired with corresponding mitigations in a structured overview. In a second step, the paper develops a framework that combines RAG-specific security considerations, with existing general security guidelines, industry standards, and best practices. The proposed framework aims to guide the implementation of robust, compliant, secure, and trustworthy RAG systems.
zh

[AI-3] PWC-MoE: Privacy-Aware Wireless Collaborative Mixture of Experts

【速读】:该论文旨在解决在带宽受限环境下部署大型语言模型(Large Language Models, LLMs)时面临的隐私保护与计算性能之间的矛盾问题。传统方法通过将LLMs托管在云端减轻了本地设备的计算和存储负担,但导致敏感数据传输引发隐私风险,并且需要大量通信带宽,这在资源受限环境中难以实现。而本地运行的小型语言模型(Small Language Models, SLMs)虽然提升了隐私性,但在复杂任务上的性能有限。为了解决这一问题,本文提出了一种隐私感知的无线协同专家混合(Privacy-Aware Wireless Collaborative Mixture of Experts, PWC-MoE)框架,其关键在于通过一个稀疏的隐私感知门控网络动态地将敏感令牌路由至本地客户端的隐私专家,而非敏感令牌则被路由至远程基站的非隐私专家,同时结合分组负载均衡机制和带宽自适应的重要性感知令牌卸载策略,以实现计算效率、模型性能与隐私保护的平衡。

链接: https://arxiv.org/abs/2505.08719
作者: Yang Su,Na Yan,Yansha Deng,Robert Schober
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) hosted on cloud servers alleviate the computational and storage burdens on local devices but raise privacy concerns due to sensitive data transmission and require substantial communication bandwidth, which is challenging in constrained environments. In contrast, small language models (SLMs) running locally enhance privacy but suffer from limited performance on complex tasks. To balance computational cost, performance, and privacy protection under bandwidth constraints, we propose a privacy-aware wireless collaborative mixture of experts (PWC-MoE) framework. Specifically, PWC-MoE employs a sparse privacy-aware gating network to dynamically route sensitive tokens to privacy experts located on local clients, while non-sensitive tokens are routed to non-privacy experts located at the remote base station. To achieve computational efficiency, the gating network ensures that each token is dynamically routed to and processed by only one expert. To enhance scalability and prevent overloading of specific experts, we introduce a group-wise load-balancing mechanism for the gating network that evenly distributes sensitive tokens among privacy experts and non-sensitive tokens among non-privacy experts. To adapt to bandwidth constraints while preserving model performance, we propose a bandwidth-adaptive and importance-aware token offloading scheme. This scheme incorporates an importance predictor to evaluate the importance scores of non-sensitive tokens, prioritizing the most important tokens for transmission to the base station based on their predicted importance and the available bandwidth. Experiments demonstrate that the PWC-MoE framework effectively preserves privacy and maintains high performance even in bandwidth-constrained environments, offering a practical solution for deploying LLMs in privacy-sensitive and bandwidth-limited scenarios.
zh

[AI-4] VizCV: AI-assisted visualization of researchers publications tracks

【速读】:该论文旨在解决如何有效分析科学家和研究团队的出版记录演变以评估其专业能力的问题,从而支持学术环境的管理及职业规划与评估。其解决方案的关键在于提出VizCV,一个基于Web的端到端可视化分析框架,通过AI辅助分析实现对科研人员科学轨迹的交互式探索,重点涵盖研究主题演变、出版记录与影响以及合作动态三个维度,并结合人工智能驱动的洞察,提供自动化的职业转型解释与比较分析功能。

链接: https://arxiv.org/abs/2505.08691
作者: Vladimír Lazárik,Marco Agus,Barbora Kozlíková,Pere-Pau Vázquez
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages, 9 figures. Subtmitted

点击查看摘要

Abstract:Analyzing how the publication records of scientists and research groups have evolved over the years is crucial for assessing their expertise since it can support the management of academic environments by assisting with career planning and evaluation. We introduce VizCV, a novel web-based end-to-end visual analytics framework that enables the interactive exploration of researchers’ scientific trajectories. It incorporates AI-assisted analysis and supports automated reporting of career evolution. Our system aims to model career progression through three key dimensions: a) research topic evolution to detect and visualize shifts in scholarly focus over time, b) publication record and the corresponding impact, c) collaboration dynamics depicting the growth and transformation of a researcher’s co-authorship network. AI-driven insights provide automated explanations of career transitions, detecting significant shifts in research direction, impact surges, or collaboration expansions. The system also supports comparative analysis between researchers, allowing users to compare topic trajectories and impact growth. Our interactive, multi-tab and multiview system allows for the exploratory analysis of career milestones under different perspectives, such as the most impactful articles, emerging research themes, or obtaining a detailed analysis of the contribution of the researcher in a subfield. The key contributions include AI/ML techniques for: a) topic analysis, b) dimensionality reduction for visualizing patterns and trends, c) the interactive creation of textual descriptions of facets of data through configurable prompt generation and large language models, that include key indicators, to help understanding the career development of individuals or groups.
zh

[AI-5] AC-PKAN: Attention-Enhanced and Chebyshev Polynomial-Based Physics-Informed Kolmogorov-Arnold Networks

【速读】:该论文旨在解决Kolmogorov-Arnold Networks (KANs)在求解偏微分方程(PDEs)时存在的计算与内存开销大以及表达能力受限的问题。尽管Chebyshev Type-I-based KANs(Chebyshev1KANs)已有所改进,但其仍存在秩坍塌问题,限制了模型的表达能力。解决方案的关键在于通过集成小波激活的多层感知机(MLPs)和可学习参数,结合内部注意力机制,确保雅可比矩阵的满秩性,并提升对任意阶PDE的逼近能力;同时引入外部残差梯度注意力(RGA)机制以缓解由切比雪夫多项式基引起的损失不稳定与不平衡问题。通过内、外部注意力的联合应用,提出了AC-PKAN架构,显著增强了弱监督物理信息神经网络(PINNs)的表达能力。

链接: https://arxiv.org/abs/2505.08687
作者: Hangwei Zhang,Zhimu Huang,Yan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have recently shown promise for solving partial differential equations (PDEs). Yet their original formulation is computationally and memory intensive, motivating the introduction of Chebyshev Type-I-based KANs (Chebyshev1KANs). Although Chebyshev1KANs have outperformed the vanilla KANs architecture, our rigorous theoretical analysis reveals that they still suffer from rank collapse, ultimately limiting their expressive capacity. To overcome these limitations, we enhance Chebyshev1KANs by integrating wavelet-activated MLPs with learnable parameters and an internal attention mechanism. We prove that this design preserves a full-rank Jacobian and is capable of approximating solutions to PDEs of arbitrary order. Furthermore, to alleviate the loss instability and imbalance introduced by the Chebyshev polynomial basis, we externally incorporate a Residual Gradient Attention (RGA) mechanism that dynamically re-weights individual loss terms according to their gradient norms and residual magnitudes. By jointly leveraging internal and external attention, we present AC-PKAN, a novel architecture that constitutes an enhancement to weakly supervised Physics-Informed Neural Networks (PINNs) and extends the expressive power of KANs. Experimental results from nine benchmark tasks across three domains show that AC-PKAN consistently outperforms or matches state-of-the-art models such as PINNsFormer, establishing it as a highly effective tool for solving complex real-world engineering problems in zero-data or data-sparse regimes. The code will be made publicly available upon acceptance.
zh

[AI-6] A Mamba-based Network for Semi-supervised Singing Melody Extraction Using Confidence Binary Regularization

【速读】:该论文旨在解决歌唱旋律提取(Singing Melody Extraction, SME)任务中存在的三个主要问题:现有模型使用Transformer架构导致计算复杂度高、效率低;传统方法依赖频率监督来估计基频(f0),忽略了音乐表演基于音符的本质;以及SME任务缺乏足够的标注数据。其解决方案的关键在于提出一种基于Mamba的网络结构SpectMamba,通过引入视觉Mamba实现线性计算复杂度,并设计一种新的音符-基频解码器以更好地模拟音乐表演,同时引入置信度二值正则化(Confidence Binary Regularization, CBR)模块,利用未标注数据提升模型性能。

链接: https://arxiv.org/abs/2505.08681
作者: Xiaoliang He,Kangjie Dong,Jingkai Cao,Shuai Yu,Wei Li,Yi Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Singing melody extraction (SME) is a key task in the field of music information retrieval. However, existing methods are facing several limitations: firstly, prior models use transformers to capture the contextual dependencies, which requires quadratic computation resulting in low efficiency in the inference stage. Secondly, prior works typically rely on frequencysupervised methods to estimate the fundamental frequency (f0), which ignores that the musical performance is actually based on notes. Thirdly, transformers typically require large amounts of labeled data to achieve optimal performances, but the SME task lacks of sufficient annotated data. To address these issues, in this paper, we propose a mamba-based network, called SpectMamba, for semi-supervised singing melody extraction using confidence binary regularization. In particular, we begin by introducing vision mamba to achieve computational linear complexity. Then, we propose a novel note-f0 decoder that allows the model to better mimic the musical performance. Further, to alleviate the scarcity of the labeled data, we introduce a confidence binary regularization (CBR) module to leverage the unlabeled data by maximizing the probability of the correct classes. The proposed method is evaluated on several public datasets and the conducted experiments demonstrate the effectiveness of our proposed method.
zh

[AI-7] A Study of Data-driven Methods for Inventory Optimization

【速读】:该论文试图解决数据驱动下的库存管理优化问题,具体是通过分析时间序列、随机森林(Random Forest)和深度强化学习三种算法在三种库存模型(缺货模型、双源供应模型和多级库存模型)中的应用效果,评估其在预测准确性、市场适应性以及对库存成本和客户满意度的影响。解决方案的关键在于利用数据可视化工具和统计指标进行算法性能的对比分析,从而揭示不同算法在实际应用场景中的有效性及潜在挑战,为库存管理决策提供科学依据。

链接: https://arxiv.org/abs/2505.08673
作者: Lee Yeung Ping,Patrick Wong,Tan Cheng Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper shows a comprehensive analysis of three algorithms (Time Series, Random Forest (RF) and Deep Reinforcement Learning) into three inventory models (the Lost Sales, Dual-Sourcing and Multi-Echelon Inventory Model). These methodologies are applied in the supermarket context. The main purpose is to analyse efficient methods for the data-driven. Their possibility, potential and current challenges are taken into consideration in this report. By comparing the results in each model, the effectiveness of each algorithm is evaluated based on several key performance indicators, including forecast accuracy, adaptability to market changes, and overall impact on inventory costs and customer satisfaction levels. The data visualization tools and statistical metrics are the indicators for the comparisons and show some obvious trends and patterns that can guide decision-making in inventory management. These tools enable managers to not only track the performance of different algorithms in real-time but also to drill down into specific data points to understand the underlying causes of inventory fluctuations. This level of detail is crucial for pinpointing inefficiencies and areas for improvement within the supply chain.
zh

[AI-8] A Social Robot with Inner Speech for Dietary Guidance

【速读】:该论文试图解决在医疗场景中,社会机器人提供饮食建议时透明度和信任度不足的问题。解决方案的关键在于引入内省言语(inner speech)机制,通过使机器人的推理过程显性化,提升其解释能力,从而增强用户对机器人决策的信任。该系统整合了大语言模型与知识图谱,以实现自然语言理解和结构化饮食信息的处理,进而生成清晰的决策依据。

链接: https://arxiv.org/abs/2505.08664
作者: Valerio Belcamino,Alessandro Carfì,Valeria Seidita,Fulvio Mastrogiovanni,Antonio Chella
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We explore the use of inner speech as a mechanism to enhance transparency and trust in social robots for dietary advice. In humans, inner speech structures thought processes and decision-making; in robotics, it improves explainability by making reasoning explicit. This is crucial in healthcare scenarios, where trust in robotic assistants depends on both accurate recommendations and human-like dialogue, which make interactions more natural and engaging. Building on this, we developed a social robot that provides dietary advice, and we provided the architecture with inner speech capabilities to validate user input, refine reasoning, and generate clear justifications. The system integrates large language models for natural language understanding and a knowledge graph for structured dietary information. By making decisions more transparent, our approach strengthens trust and improves human-robot interaction in healthcare. We validated this by measuring the computational efficiency of our architecture and conducting a small user study, which assessed the reliability of inner speech in explaining the robot’s behavior.
zh

[AI-9] A Comparative Study of Human Activity Recognition: Motion Tactile and multi-modal Approaches

【速读】:该论文旨在解决人类活动识别(Human Activity Recognition, HAR)在人机协作(Human-Robot Collaboration, HRC)中的准确性问题,通过评估基于视觉的触觉传感器与惯性测量单元(IMU)数据手套的分类性能,并提出一种融合触觉与运动数据的多模态框架。解决方案的关键在于利用触觉和运动数据的互补优势,通过多模态分类(Multi-Modal Classification, MMC)方法提升HAR系统的整体性能,实验结果表明该方法在控制条件和连续动作序列中均优于单一模态方法。

链接: https://arxiv.org/abs/2505.08657
作者: Valerio Belcamino,Nhat Minh Dinh Le,Quan Khanh Luu,Alessandro Carfì,Van Anh Ho,Fulvio Mastrogiovanni
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human activity recognition (HAR) is essential for effective Human-Robot Collaboration (HRC), enabling robots to interpret and respond to human actions. This study evaluates the ability of a vision-based tactile sensor to classify 15 activities, comparing its performance to an IMU-based data glove. Additionally, we propose a multi-modal framework combining tactile and motion data to leverage their complementary strengths. We examined three approaches: motion-based classification (MBC) using IMU data, tactile-based classification (TBC) with single or dual video streams, and multi-modal classification (MMC) integrating both. Offline validation on segmented datasets assessed each configuration’s accuracy under controlled conditions, while online validation on continuous action sequences tested online performance. Results showed the multi-modal approach consistently outperformed single-modality methods, highlighting the potential of integrating tactile and motion sensing to enhance HAR systems for collaborative robotics.
zh

[AI-10] WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation

【速读】:该论文旨在解决企业问答(QA)系统在真实场景中评估的挑战,特别是针对需要基于具体领域知识库(KB)生成答案的端到端检索增强生成(RAG)系统缺乏合适基准的问题。解决方案的关键在于引入WixQA基准套件,其包含三个精确基于已发布KB语料库的QA数据集,能够全面评估检索与生成组件,从而为企业RAG系统的实际应用提供可靠的评估框架。

链接: https://arxiv.org/abs/2505.08643
作者: Dvir Cohen,Lin Burg,Sviatoslav Pykhnivskyi,Hagit Gur,Stanislav Kovynov,Olga Atzmon,Gilad Barkan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) is a cornerstone of modern question answering (QA) systems, enabling grounded answers based on external knowledge. Although recent progress has been driven by open-domain datasets, enterprise QA systems need datasets that mirror the concrete, domain-specific issues users raise in day-to-day support scenarios. Critically, evaluating end-to-end RAG systems requires benchmarks comprising not only question–answer pairs but also the specific knowledge base (KB) snapshot from which answers were derived. To address this need, we introduce WixQA, a benchmark suite featuring QA datasets precisely grounded in the released KB corpus, enabling holistic evaluation of retrieval and generation components. WixQA includes three distinct QA datasets derived from this http URL customer support interactions and grounded in a snapshot of the public Wix Help Center KB: (i) WixQA-ExpertWritten, 200 real user queries with expert-authored, multi-step answers; (ii) WixQA-Simulated, 200 expert-validated QA pairs distilled from user dialogues; and (iii) WixQA-Synthetic, 6,222 LLM-generated QA pairs, with one pair systematically derived from each article in the knowledge base. We release the KB snapshot alongside the datasets under MIT license and provide comprehensive baseline results, forming a unique benchmark for evaluating enterprise RAG systems in realistic enterprise environments.
zh

[AI-11] Integrating Natural Language Processing and Exercise Monitoring for Early Diagnosis of Metabolic Syndrome: A Deep Learning Approach

【速读】:该论文试图解决代谢综合征(MetS)在临床中常被低估的问题,从而导致患者未能及时获得必要的护理。其解决方案的关键在于利用日常生活中易于获取的最少生理数据和与运动相关活动的自由文本,通过结合自然语言处理(NLP)和运动监测的深度学习框架进行MetS的分类诊断。该方法旨在提高MetS的早期检测效率,降低筛查和管理成本。

链接: https://arxiv.org/abs/2505.08628
作者: Yichen Zhao,Yuhua Wang,Xi Cheng,Junhao Fang,Yang Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Metabolic syndrome (MetS) is a medication condition characterized by abdominal obesity, insulin resistance, hypertension and hyperlipidemia. It increases the risk of majority of chronic diseases, including type 2 diabetes mellitus, and affects about one quarter of the global population. Therefore, early detection and timely intervention for MetS are crucial. Standard diagnosis for MetS components requires blood tests conducted within medical institutions. However, it is frequently underestimated, leading to unmet need for care for MetS population. This study aims to use the least physiological data and free texts about exercises related activities, which are obtained easily in daily life, to diagnosis MetS. We collected the data from 40 volunteers in a nursing home and used data augmentation to reduce the imbalance. We propose a deep learning framework for classifying MetS that integrates natural language processing (NLP) and exercise monitoring. The results showed that the best model reported a high positive result (AUROC=0.806 and REC=76.3%) through 3-fold cross-validation. Feature importance analysis revealed that text and minimum heart rate on a daily basis contribute the most in the classification of MetS. This study demonstrates the potential application of data that are easily measurable in daily life for the early diagnosis of MetS, which could contribute to reducing the cost of screening and management for MetS population.
zh

[AI-12] Resource-Efficient Language Models: Quantization for Fast and Accessible Inference

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理效率上的问题,特别是其对硬件资源和能耗的高需求。论文提出的解决方案关键在于后训练量化(Post-Training Quantization, PTQ)技术,通过减少模型的精度以优化推理效率,同时在量化方案、粒度及性能与压缩之间的权衡方面进行了系统性综述。

链接: https://arxiv.org/abs/2505.08620
作者: Tollef Emil Jørgensen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures, preprint

点击查看摘要

Abstract:Large language models have significantly advanced natural language processing, yet their heavy resource demands pose severe challenges regarding hardware accessibility and energy consumption. This paper presents a focused and high-level review of post-training quantization (PTQ) techniques designed to optimize the inference efficiency of LLMs by the end-user, including details on various quantization schemes, granularities, and trade-offs. The aim is to provide a balanced overview between the theory and applications of post-training quantization.
zh

[AI-13] MINIMALIST: switched-capacitor circuits for efficient in-memory computation of gated recurrent units

【速读】:该论文旨在解决在资源受限的嵌入式边缘计算环境中高效处理时序数据的问题,特别是针对传统循环神经网络(Recurrent Neural Networks, RNNs)在硬件实现上的效率与兼容性挑战。其解决方案的关键在于提出一种基于最小门控循环单元(Gated Recurrent Units, GRUs)的简化且硬件兼容的架构,并结合高效的混合信号硬件实现。该设计利用开关电容电路不仅实现存内计算(In-Memory Computation, IMC),还用于门控状态更新,从而实现了低功耗、高可扩展性的硬件部署。

链接: https://arxiv.org/abs/2505.08599
作者: Sebastian Billaudelle,Laura Kriener,Filippo Moro,Tristan Torchet,Melika Payvand
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) have been a long-standing candidate for processing of temporal sequence data, especially in memory-constrained systems that one may find in embedded edge computing environments. Recent advances in training paradigms have now inspired new generations of efficient RNNs. We introduce a streamlined and hardware-compatible architecture based on minimal gated recurrent units (GRUs), and an accompanying efficient mixed-signal hardware implementation of the model. The proposed design leverages switched-capacitor circuits not only for in-memory computation (IMC), but also for the gated state updates. The mixed-signal cores rely solely on commodity circuits consisting of metal capacitors, transmission gates, and a clocked comparator, thus greatly facilitating scaling and transfer to other technology nodes. We benchmark the performance of our architecture on time series data, introducing all constraints required for a direct mapping to the hardware system. The direct compatibility is verified in mixed-signal simulations, reproducing data recorded from the software-only network model. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2505.08599 [cs.AR] (or arXiv:2505.08599v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2505.08599 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-14] From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

【速读】:该论文旨在解决机器人操作中泛化能力不足的问题,特别是在面对未见过的场景和新任务时。现有视觉-语言-动作(Vision-Language-Action, VLA)模型由于具身数据集中的数据稀缺性和异质性,难以实现稳健的零样本性能。论文提出的解决方案关键在于FSD(From Seeing to Doing)模型,该模型通过空间关系推理生成中间表示,为机器人操作提供细粒度指导,其核心创新包括分层数据流水线与自洽机制,以对齐空间坐标与视觉信号。

链接: https://arxiv.org/abs/2505.08548
作者: Yifu Yuan,Haiqin Cui,Yibin Chen,Zibin Dong,Fei Ni,Longxin Kou,Jinyi Liu,Pengyi Li,Yan Zheng,Jianye Hao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Early version

点击查看摘要

Abstract:Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD’s capabilities in both “seeing” and “doing,” achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 54.1% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.
zh

[AI-15] Guiding LLM -based Smart Contract Generation with Finite State Machine

【速读】:该论文试图解决传统智能合约生成方法依赖人工编码和专家审计所带来的高门槛与低效率问题,以及大型语言模型(Large Language Models, LLMs)在智能合约生成中的有效性与安全性挑战。解决方案的关键在于提出一种基于有限状态机(Finite State Machine, FSM)与LLMs的智能合约生成框架FSM-SCG,通过将用户需求抽象为FSM来引导LLMs生成智能合约,并结合编译与安全检查的反馈进行代码迭代优化,从而显著提升生成代码的质量。

链接: https://arxiv.org/abs/2505.08542
作者: Hao Luo,Yuhao Lin,Xiao Yan,Xintong Hu,Yuxiang Wang,Qiming Zeng,Hao Wang,Jiawei Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w.r.t. effectiveness and security. To solve these problems, we propose FSM-SCG, a smart contract generation framework based on finite state machine (FSM) and LLMs, which significantly improves the quality of the generated code by abstracting user requirements to generate FSM, guiding LLMs to generate smart contracts, and iteratively optimizing the code with the feedback of compilation and security checks. The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%.
zh

[AI-16] he Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News SIGIR2025

【速读】:该论文旨在解决虚假新闻在社交媒体中快速传播所带来的社会挑战,特别是现有检测方法在可解释性和泛化能力方面的不足。其解决方案的关键在于引入一种基于大型语言模型(LLM)的多智能体系统——TruEDebate (TED),通过模拟辩论过程增强虚假新闻检测的可解释性与有效性。TED的核心创新包括两个组件:辩论流代理(DebateFlow Agents)和洞察流代理(InsightFlow Agents),前者通过组织正反方辩论对新闻的真实性进行深入评估,后者则通过合成与分析代理实现对辩论内容的综合判断与角色感知建模。

链接: https://arxiv.org/abs/2505.08532
作者: Yuhan Liu,Yuxuan Liu,Xiaoqing Zhang,Xiuying Chen,Rui Yan
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: SIGIR 2025

点击查看摘要

Abstract:In today’s digital environment, the rapid propagation of fake news via social networks poses significant social challenges. Most existing detection methods either employ traditional classification models, which suffer from low interpretability and limited generalization capabilities, or craft specific prompts for large language models (LLMs) to produce explanations and results directly, failing to leverage LLMs’ reasoning abilities fully. Inspired by the saying that “truth becomes clearer through debate,” our study introduces a novel multi-agent system with LLMs named TruEDebate (TED) to enhance the interpretability and effectiveness of fake news detection. TED employs a rigorous debate process inspired by formal debate settings. Central to our approach are two innovative components: the DebateFlow Agents and the InsightFlow Agents. The DebateFlow Agents organize agents into two teams, where one supports and the other challenges the truth of the news. These agents engage in opening statements, cross-examination, rebuttal, and closing statements, simulating a rigorous debate process akin to human discourse analysis, allowing for a thorough evaluation of news content. Concurrently, the InsightFlow Agents consist of two specialized sub-agents: the Synthesis Agent and the Analysis Agent. The Synthesis Agent summarizes the debates and provides an overarching viewpoint, ensuring a coherent and comprehensive evaluation. The Analysis Agent, which includes a role-aware encoder and a debate graph, integrates role embeddings and models the interactions between debate roles and arguments using an attention mechanism, providing the final judgment.
zh

[AI-17] ExEBench: Benchmarking Foundation Models on Extreme Earth Events

【速读】:该论文旨在解决生成式 AI (Generative AI) 在极端事件中的可靠性问题,特别是在面对极端值时模型可能因训练数据中的偏差而表现不佳。其解决方案的关键在于引入 \textbfExEBench(\textbfExtreme \textbfEarth Benchmark),这是一个涵盖七类极端事件的基准数据集,包括洪水、野火、风暴、热带气旋、极端降水、热浪和寒潮,具有全球覆盖范围、多样的数据量和来源,以及不同的空间、时间和光谱特征。该数据集还包含多个与极端事件检测、监测和预测紧密相关的挑战性机器学习任务,以推动生成式模型在灾害管理中的应用,并促进对地球系统在气候变化背景下的理解。

链接: https://arxiv.org/abs/2505.08529
作者: Shan Zhao,Zhitong Xiong,Jie Zhao,Xiao Xiang Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Our planet is facing increasingly frequent extreme events, which pose major risks to human lives and ecosystems. Recent advances in machine learning (ML), especially with foundation models (FMs) trained on extensive datasets, excel in extracting features and show promise in disaster management. Nevertheless, these models often inherit biases from training data, challenging their performance over extreme values. To explore the reliability of FM in the context of extreme events, we introduce \textbfExEBench (\textbfExtreme \textbfEarth Benchmark), a collection of seven extreme event categories across floods, wildfires, storms, tropical cyclones, extreme precipitation, heatwaves, and cold waves. The dataset features global coverage, varying data volumes, and diverse data sources with different spatial, temporal, and spectral characteristics. To broaden the real-world impact of FMs, we include multiple challenging ML tasks that are closely aligned with operational needs in extreme events detection, monitoring, and forecasting. ExEBench aims to (1) assess FM generalizability across diverse, high-impact tasks and domains, (2) promote the development of novel ML methods that benefit disaster management, and (3) offer a platform for analyzing the interactions and cascading effects of extreme events to advance our understanding of Earth system, especially under the climate change expected in the decades to come. The dataset and code are public this https URL.
zh

[AI-18] On the Complexity and Properties of Preferential Propositional Dependence Logic

【速读】:该论文旨在探讨在命题逻辑结合团队语义和依赖原子的框架下,KLM风格的优先推理(preferential reasoning)的复杂性与性质。其关键在于分析优先推理是否满足System P,并通过给出直观条件来完全刻画优先命题依赖逻辑满足System P的情形。研究发现,这些条件并不适用于基于团队的优先推理,同时揭示了经典蕴含与依赖逻辑蕴含如何在非平凡的优先模型中表达,并提出了两种自然表示下的优先团队推理复杂性,其中包括对经典(非团队基础)优先推理的新复杂性结果。

链接: https://arxiv.org/abs/2505.08522
作者: Kai Sauerwald,Arne Meier,Juha Kontinen
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:This paper considers the complexity and properties of KLM-style preferential reasoning in the setting of propositional logic with team semantics and dependence atoms, also known as propositional dependence logic. Preferential team-based reasoning is shown to be cumulative, yet violates System~P. We give intuitive conditions that fully characterise those cases where preferential propositional dependence logic satisfies System~P. We show that these characterisations do, surprisingly, not carry over to preferential team-based propositional logic. Furthermore, we show how classical entailment and dependence logic entailment can be expressed in terms of non-trivial preferential models. Finally, we present the complexity of preferential team-based reasoning for two natural representations. This includes novel complexity results for classical (non-team-based) preferential reasoning.
zh

[AI-19] Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain IJCAI25

【速读】:该论文试图解决传统自注意力机制在图信号处理框架下被限制为仅使用一阶多项式矩阵,从而表现为低通滤波器,无法有效利用多频段信息的问题。解决方案的关键在于提出一种名为Attentive Graph Filter (AGF)的新方法,从图信号处理的角度出发,将自注意力解释为在奇异值域中学习图滤波器,该方法具有与输入长度n线性相关的复杂度(即O(nd2)\mathcal{O}(nd^2)),并在多个任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2505.08516
作者: Hyowon Wi,Jeongwhan Choi,Noseong Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: IJCAI25 Accepted

点击查看摘要

Abstract:Transformers have demonstrated remarkable performance across diverse domains. The key component of Transformers is self-attention, which learns the relationship between any two tokens in the input sequence. Recent studies have revealed that the self-attention can be understood as a normalized adjacency matrix of a graph. Notably, from the perspective of graph signal processing (GSP), the self-attention can be equivalently defined as a simple graph filter, applying GSP using the value vector as the signal. However, the self-attention is a graph filter defined with only the first order of the polynomial matrix, and acts as a low-pass filter preventing the effective leverage of various frequency information. Consequently, existing self-attention mechanisms are designed in a rather simplified manner. Therefore, we propose a novel method, called \underline\textbfAttentive \underline\textbfGraph \underline\textbfFilter (AGF), interpreting the self-attention as learning the graph filter in the singular value domain from the perspective of graph signal processing for directed graphs with the linear complexity w.r.t. the input length n , i.e., \mathcalO(nd^2) . In our experiments, we demonstrate that AGF achieves state-of-the-art performance on various tasks, including Long Range Arena benchmark and time series classification.
zh

[AI-20] rialMatchAI: An End-to-End AI-powered Clinical Trial Recommendation System to Streamline Patient-to-Trial Matching

【速读】:该论文试图解决临床试验中患者招募效率低下的问题,这一问题已成为制约临床研究进展的主要瓶颈。其解决方案的关键在于构建一个基于生成式AI(Generative AI)的推荐系统——TrialMatchAI,该系统通过处理结构化和非结构化的临床数据,实现患者与临床试验的自动化匹配。其核心技术包括在检索增强生成框架下微调的开源大语言模型(LLM),以及结合词法和语义相似性的混合搜索策略,辅以医学Chain-of-Thought推理进行纳入标准的细粒度评估,从而实现可解释且可追溯的决策过程。

链接: https://arxiv.org/abs/2505.08508
作者: Majd Abdallah,Sigve Nakken,Mariska Bierkens,Johanna Galvis,Alexis Groppi,Slim Karkar,Lana Meiqari,Maria Alexandra Rujano,Steve Canham,Rodrigo Dienstmann,Remond Fijneman,Eivind Hovig,Gerrit Meijer,Macha Nikolski
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Patient recruitment remains a major bottleneck in clinical trials, calling for scalable and automated solutions. We present TrialMatchAI, an AI-powered recommendation system that automates patient-to-trial matching by processing heterogeneous clinical data, including structured records and unstructured physician notes. Built on fine-tuned, open-source large language models (LLMs) within a retrieval-augmented generation framework, TrialMatchAI ensures transparency and reproducibility and maintains a lightweight deployment footprint suitable for clinical environments. The system normalizes biomedical entities, retrieves relevant trials using a hybrid search strategy combining lexical and semantic similarity, re-ranks results, and performs criterion-level eligibility assessments using medical Chain-of-Thought reasoning. This pipeline delivers explainable outputs with traceable decision rationales. In real-world validation, 92 percent of oncology patients had at least one relevant trial retrieved within the top 20 recommendations. Evaluation across synthetic and real clinical datasets confirmed state-of-the-art performance, with expert assessment validating over 90 percent accuracy in criterion-level eligibility classification, particularly excelling in biomarker-driven matches. Designed for modularity and privacy, TrialMatchAI supports Phenopackets-standardized data, enables secure local deployment, and allows seamless replacement of LLM components as more advanced models emerge. By enhancing efficiency and interpretability and offering lightweight, open-source deployment, TrialMatchAI provides a scalable solution for AI-driven clinical trial matching in precision medicine.
zh

[AI-21] Achieving Scalable Robot Autonomy via neurosymbolic planning using lightweight local LLM

【速读】:该论文旨在解决基于PDDL的符号任务规划在动态人机协作中面临的可扩展性差、重规划需求高以及计划可用性延迟等问题。其关键解决方案是提出Gideon框架,该框架通过集成新型问题生成器,系统地生成大规模的真实领域-问题-计划元组数据集,并适应本地小型大语言模型(LLM)的神经符号规划,从而实现设备端执行和多领域支持。此方法显著提升了推理效率、可扩展性和多领域适应性,尽管训练效率低于基于更大模型的基线,但模型体积缩小了约120倍,具有重要的实际应用价值。

链接: https://arxiv.org/abs/2505.08492
作者: Nicholas Attolino,Alessio Capitanelli,Fulvio Mastrogiovanni
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 19 pages, 3 figures, 4 tables, accepted at IAS 2025

点击查看摘要

Abstract:PDDL-based symbolic task planning remains pivotal for robot autonomy yet struggles with dynamic human-robot collaboration due to scalability, re-planning demands, and delayed plan availability. Although a few neurosymbolic frameworks have previously leveraged LLMs such as GPT-3 to address these challenges, reliance on closed-source, remote models with limited context introduced critical constraints: third-party dependency, inconsistent response times, restricted plan length and complexity, and multi-domain scalability issues. We present Gideon, a novel framework that enables the transition to modern, smaller, local LLMs with extended context length. Gideon integrates a novel problem generator to systematically generate large-scale datasets of realistic domain-problem-plan tuples for any domain, and adapts neurosymbolic planning for local LLMs, enabling on-device execution and extended context for multi-domain support. Preliminary experiments in single-domain scenarios performed on Qwen-2.5 1.5B and trained on 8k-32k samples, demonstrate a valid plan percentage of 66.1% (32k model) and show that the figure can be further scaled through additional data. Multi-domain tests on 16k samples yield an even higher 70.6% planning validity rate, proving extensibility across domains and signaling that data variety can have a positive effect on learning efficiency. Although long-horizon planning and reduced model size make Gideon training much less efficient than baseline models based on larger LLMs, the results are still significant considering that the trained model is about 120x smaller than baseline and that significant advantages can be achieved in inference efficiency, scalability, and multi-domain adaptability, all critical factors in human-robot collaboration. Training inefficiency can be mitigated by Gideon’s streamlined data generation pipeline.
zh

[AI-22] An adaptive sampling algorithm for data-generation to build a data-manifold for physical problem surrogate modeling

【速读】:该论文试图解决在构建代理模型时,由于输入数据分布不均衡导致响应流形表示不佳,从而影响模型预测精度的问题。解决方案的关键在于提出一种自适应采样算法(Adaptive Sampling Algorithm for Data Generation, ASADG),该算法通过迭代地向初始输入数据中添加新的数据点,以更准确地表征高维空间中的响应流形。具体而言,该算法在每个步骤中将流形离散化为单纯复形,并在其每个单纯复形的重心满足特定阈值时将其作为新输入数据加入,从而提升数据的代表性与模型的预测性能。

链接: https://arxiv.org/abs/2505.08487
作者: Chetra Mang,Axel TahmasebiMoradi,David Danan,Mouadh Yagoubi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Physical models classically involved Partial Differential equations (PDE) and depending of their underlying complexity and the level of accuracy required, and known to be computationally expensive to numerically solve them. Thus, an idea would be to create a surrogate model relying on data generated by such solver. However, training such a model on an imbalanced data have been shown to be a very difficult task. Indeed, if the distribution of input leads to a poor response manifold representation, the model may not learn well and consequently, it may not predict the outcome with acceptable accuracy. In this work, we present an Adaptive Sampling Algorithm for Data Generation (ASADG) involving a physical model. As the initial input data may not accurately represent the response manifold in higher dimension, this algorithm iteratively adds input data into it. At each step the barycenter of each simplicial complex, that the manifold is discretized into, is added as new input data, if a certain threshold is satisfied. We demonstrate the efficiency of the data sampling algorithm in comparison with LHS method for generating more representative input data. To do so, we focus on the construction of a harmonic transport problem metamodel by generating data through a classical solver. By using such algorithm, it is possible to generate the same number of input data as LHS while providing a better representation of the response manifold.
zh

[AI-23] BAT: Benchmark for Auto-bidding Task WWW2025

【速读】:该论文旨在解决在线广告位拍卖中出价策略优化这一关键问题,尤其针对实时竞价(Real-Time Bidding, RTB)场景下的预算节奏一致性和点击成本(Cost Per Click, CPC)约束优化问题。其解决方案的关键在于构建一个涵盖两种最常见拍卖格式的基准测试平台,并在该平台上实现一系列稳健的基线算法,以提供一个用户友好且直观的框架,助力研究人员和从业者开发和优化创新的自动出价算法。

链接: https://arxiv.org/abs/2505.08485
作者: Alexandra Khirianova,Ekaterina Solodneva,Andrey Pudovikov,Sergey Osokin,Egor Samosvat,Yuriy Dorn,Alexander Ledovsky,Yana Zenkova
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 11 pages, 10 figures, WWW 2025 conference

点击查看摘要

Abstract:The optimization of bidding strategies for online advertising slot auctions presents a critical challenge across numerous digital marketplaces. A significant obstacle to the development, evaluation, and refinement of real-time autobidding algorithms is the scarcity of comprehensive datasets and standardized benchmarks. To address this deficiency, we present an auction benchmark encompassing the two most prevalent auction formats. We implement a series of robust baselines on a novel dataset, addressing the most salient Real-Time Bidding (RTB) problem domains: budget pacing uniformity and Cost Per Click (CPC) constraint optimization. This benchmark provides a user-friendly and intuitive framework for researchers and practitioners to develop and refine innovative autobidding algorithms, thereby facilitating advancements in the field of programmatic advertising. The implementation and additional resources can be accessed at the following repository (this https URL, this https URL). Comments: 11 pages, 10 figures, WWW 2025 conference Subjects: Artificial Intelligence (cs.AI); Machine Learning (stat.ML) MSC classes: 91B26 Cite as: arXiv:2505.08485 [cs.AI] (or arXiv:2505.08485v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.08485 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: J.Proceedings of the ACM on Web Conference 1 (2025) 2657 Related DOI: https://doi.org/10.1145/3696410.3714657 Focus to learn more DOI(s) linking to related resources
zh

[AI-24] Strategy-Augmented Planning for Large Language Models via Opponent Exploitation IJCNN2025

【速读】:该论文试图解决在对抗性领域中高效建模和利用对手的问题,这是长期存在的挑战。其解决方案的关键在于引入一种两阶段的策略增强规划(Strategy-Augmented Planning, SAP)框架,该框架通过关键组件——策略评估网络(Strategy Evaluation Network, SEN)显著提升了基于大语言模型(Large Language Models, LLMs)的智能体的对手利用能力。SAP框架在离线阶段构建显式策略空间并收集策略-结果对数据以训练SEN,在在线阶段动态识别对手策略并通过搜索最优反应策略进行贪婪利用,最终通过精心设计的提示将策略转化为行动。

链接: https://arxiv.org/abs/2505.08459
作者: Shuai Xu,Sijia Cui,Yanna Wang,Bo Xu,Qi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to IJCNN 2025

点击查看摘要

Abstract:Efficiently modeling and exploiting opponents is a long-standing challenge in adversarial domains. Large Language Models (LLMs) trained on extensive textual data have recently demonstrated outstanding performance in general tasks, introducing new research directions for opponent modeling. Some studies primarily focus on directly using LLMs to generate decisions based on the elaborate prompt context that incorporates opponent descriptions, while these approaches are limited to scenarios where LLMs possess adequate domain expertise. To address that, we introduce a two-stage Strategy-Augmented Planning (SAP) framework that significantly enhances the opponent exploitation capabilities of LLM-based agents by utilizing a critical component, the Strategy Evaluation Network (SEN). Specifically, in the offline stage, we construct an explicit strategy space and subsequently collect strategy-outcome pair data for training the SEN network. During the online phase, SAP dynamically recognizes the opponent’s strategies and greedily exploits them by searching best response strategy on the well-trained SEN, finally translating strategy to a course of actions by carefully designed prompts. Experimental results show that SAP exhibits robust generalization capabilities, allowing it to perform effectively not only against previously encountered opponent strategies but also against novel, unseen strategies. In the MicroRTS environment, SAP achieves a 85.35% performance improvement over baseline methods and matches the competitiveness of reinforcement learning approaches against state-of-the-art (SOTA) rule-based AI.
zh

[AI-25] Adaptive Bias Generalized Rollout Policy Adaptation on the Flexible Job-Shop Scheduling Problem

【速读】:该论文旨在解决柔性作业车间调度问题(Flexible Job-Shop Scheduling Problem, FJSSP),这是一个NP难的组合优化问题,广泛应用于制造领域,目标是高效地将多个操作分配到不同机器上,并确保同一工件的操作按顺序进行。论文提出的解决方案基于广义嵌套回溯策略适应(Generalized Nested Rollout Policy Adaptation)算法,其关键在于通过改进的策略适应机制提升蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)在FJSSP中的求解性能,实验结果表明该算法在多数情况下优于其他MCTS方法。

链接: https://arxiv.org/abs/2505.08451
作者: Lotfi Kobrosly,Marc-Emmanuel Coupvent des Graviers,Christophe Guettier,Tristan Cazenave
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The 19th Learning and Intelligent OptimizatioN Conference, LION19 2025

点击查看摘要

Abstract:The Flexible Job-Shop Scheduling Problem (FJSSP) is an NP-hard combinatorial optimization problem, with several application domains, especially for manufacturing purposes. The objective is to efficiently schedule multiple operations on dissimilar machines. These operations are gathered into jobs, and operations pertaining to the same job need to be scheduled sequentially. Different methods have been previously tested to solve this problem, such as Constraint Solving, Tabu Search, Genetic Algorithms, or Monte Carlo Tree Search (MCTS). We propose a novel algorithm derived from the Generalized Nested Rollout Policy Adaptation, developed to solve the FJSSP. We report encouraging experimental results, as our algorithm performs better than other MCTS-based approaches, even if makespans obtained on large instances are still far from known upper bounds. Comments: The 19th Learning and Intelligent OptimizatioN Conference, LION19 2025 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2505.08451 [cs.AI] (or arXiv:2505.08451v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2505.08451 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-26] Agent -as-a-Service based on Agent Network

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中代理层面协作组织不足的问题,特别是在大规模模型驱动的AI代理兴起背景下,如何实现更高效的任务决策、协作与适应性。其解决方案的关键在于提出基于代理网络的代理即服务(Agent-as-a-Service based on Agent Network, AaaS-AN),该方案通过两个核心组件实现:一是动态代理网络,将代理及代理组建模为顶点,并根据任务和角色依赖关系进行自组织;二是面向服务的代理,集成服务发现、注册与互操作协议,由服务调度器通过执行图实现分布式协调、上下文追踪与运行时任务管理。

链接: https://arxiv.org/abs/2505.08446
作者: Yuhan Zhu,Haojie Liu,Jian Wang,Bing Li,Zikang Yin,Yefei Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: work in progress

点击查看摘要

Abstract:The rise of large model-based AI agents has spurred interest in Multi-Agent Systems (MAS) for their capabilities in decision-making, collaboration, and adaptability. While the Model Context Protocol (MCP) addresses tool invocation and data exchange challenges via a unified protocol, it lacks support for organizing agent-level collaboration. To bridge this gap, we propose Agent-as-a-Service based on Agent Network (AaaS-AN), a service-oriented paradigm grounded in the Role-Goal-Process-Service (RGPS) standard. AaaS-AN unifies the entire agent lifecycle, including construction, integration, interoperability, and networked collaboration, through two core components: (1) a dynamic Agent Network, which models agents and agent groups as vertexes that self-organize within the network based on task and role dependencies; (2) service-oriented agents, incorporating service discovery, registration, and interoperability protocols. These are orchestrated by a Service Scheduler, which leverages an Execution Graph to enable distributed coordination, context tracking, and runtime task management. We validate AaaS-AN on mathematical reasoning and application-level code generation tasks, which outperforms state-of-the-art baselines. Notably, we constructed a MAS based on AaaS-AN containing agent groups, Robotic Process Automation (RPA) workflows, and MCP servers over 100 agent services. We also release a dataset containing 10,000 long-horizon multi-agent workflows to facilitate future research on long-chain collaboration in MAS.
zh

[AI-27] Explaining Autonomous Vehicles with Intention-aware Policy Graphs AAMAS2025 AAMAS

【速读】:该论文试图解决自动驾驶车辆决策过程的不透明性问题,这一问题阻碍了其社会信任和监管接受度。解决方案的关键在于提出一种后处理、模型无关的方法,通过基于意图感知策略图(Intention-aware Policy Graphs)的技术,从全局和局部视角提取可解释且可靠的车辆行为解释,从而评估车辆是否在合法边界内运行,并识别自动驾驶数据集和模型中的潜在漏洞。

链接: https://arxiv.org/abs/2505.08404
作者: Sara Montese,Victor Gimenez-Abalos,Atia Cortés,Ulises Cortés,Sergio Alvarez-Napagao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Workshop EXTRAAMAS 2025 in AAMAS Conference

点击查看摘要

Abstract:The potential to improve road safety, reduce human driving error, and promote environmental sustainability have enabled the field of autonomous driving to progress rapidly over recent decades. The performance of autonomous vehicles has significantly improved thanks to advancements in Artificial Intelligence, particularly Deep Learning. Nevertheless, the opacity of their decision-making, rooted in the use of accurate yet complex AI models, has created barriers to their societal trust and regulatory acceptance, raising the need for explainability. We propose a post-hoc, model-agnostic solution to provide teleological explanations for the behaviour of an autonomous vehicle in urban environments. Building on Intention-aware Policy Graphs, our approach enables the extraction of interpretable and reliable explanations of vehicle behaviour in the nuScenes dataset from global and local perspectives. We demonstrate the potential of these explanations to assess whether the vehicle operates within acceptable legal boundaries and to identify possible vulnerabilities in autonomous driving datasets and models.
zh

[AI-28] ConDiSim: Conditional Diffusion Models for Simulation Based Inference

【速读】:该论文旨在解决复杂系统中不可计算似然函数的仿真推断问题(simulation-based inference),其核心挑战在于无法直接计算后验分布。解决方案的关键在于提出一种条件扩散模型(conditional diffusion model, ConDiSim),该模型通过前向过程添加高斯噪声并利用反向过程学习去噪,从而近似后验分布,能够有效捕捉后验中的复杂依赖关系和多模态特性。

链接: https://arxiv.org/abs/2505.08403
作者: Mayank Nautiyal,Andreas Hellander,Prashant Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present a conditional diffusion model - ConDiSim, for simulation-based inference of complex systems with intractable likelihoods. ConDiSim leverages denoising diffusion probabilistic models to approximate posterior distributions, consisting of a forward process that adds Gaussian noise to parameters, and a reverse process learning to denoise, conditioned on observed data. This approach effectively captures complex dependencies and multi-modalities within posteriors. ConDiSim is evaluated across ten benchmark problems and two real-world test problems, where it demonstrates effective posterior approximation accuracy while maintaining computational efficiency and stability in model training. ConDiSim offers a robust and extensible framework for simulation-based inference, particularly suitable for parameter inference workflows requiring fast inference methods.
zh

[AI-29] Adaptive Diffusion Policy Optimization for Robotic Manipulation

【速读】:该论文试图解决如何快速且稳定地优化基于扩散模型的策略(diffusion-based policies)的问题,特别是在强化学习(Reinforcement Learning, RL)中的应用。现有研究虽已展示扩散模型在建模复杂策略、表达多模态性以及处理高维连续控制任务方面的潜力,但针对其优化方法的研究仍较为有限。论文提出的解决方案是Adam-based Diffusion Policy Optimization (ADPO),其关键在于引入自适应梯度下降方法,结合最佳实践,形成一个高效的微调框架,以提升基于扩散模型的策略在机器人控制任务中的性能。

链接: https://arxiv.org/abs/2505.08376
作者: Huiyun Jiang,Zhuang Yang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies have shown the great potential of diffusion models in improving reinforcement learning (RL) by modeling complex policies, expressing a high degree of multi-modality, and efficiently handling high-dimensional continuous control tasks. However, there is currently limited research on how to optimize diffusion-based polices (e.g., Diffusion Policy) fast and stably. In this paper, we propose an Adam-based Diffusion Policy Optimization (ADPO), a fast algorithmic framework containing best practices for fine-tuning diffusion-based polices in robotic control tasks using the adaptive gradient descent method in RL. Adaptive gradient method is less studied in training RL, let alone diffusion-based policies. We confirm that ADPO outperforms other diffusion-based RL methods in terms of overall effectiveness for fine-tuning on standard robotic tasks. Concretely, we conduct extensive experiments on standard robotic control tasks to test ADPO, where, particularly, six popular diffusion-based RL methods are provided as benchmark methods. Experimental results show that ADPO acquires better or comparable performance than the baseline methods. Finally, we systematically analyze the sensitivity of multiple hyperparameters in standard robotics tasks, providing guidance for subsequent practical applications. Our video demonstrations are released in this https URL.
zh

[AI-30] Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation

【速读】:该论文旨在解决大型语言模型在处理复杂问题时表现不足的问题(即模型在数学推理等任务中难以持续有效地解决问题)。其解决方案的关键在于提出两种受人类学习策略启发的新方法:Adaptive Difficulty Curriculum Learning (ADCL) 和 Expert-Guided Self-Reformulation (EGSR)。ADCL 通过定期重新评估后续数据批次的难度,以适应模型能力的变化,从而缓解难度偏移现象;EGSR 则通过引导模型在自身概念框架内重构专家解法,而非直接模仿,从而提升模型的理解与知识吸收能力。这两种方法在数学推理基准测试中表现出显著的协同增益效果。

链接: https://arxiv.org/abs/2505.08364
作者: Enci Zhang,Xingang Yan,Wei Lin,Tianxiang Zhang,Qianchun Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figs

点击查看摘要

Abstract:Despite impressive progress in areas like mathematical reasoning, large language models still face significant challenges in consistently solving complex problems. Drawing inspiration from key human learning strategies, we propose two novel strategies to enhance the capability of large language models to solve these complex problems. First, Adaptive Difficulty Curriculum Learning (ADCL) is a novel curriculum learning strategy that tackles the Difficulty Shift phenomenon (i.e., a model’s perception of problem difficulty dynamically changes during training) by periodically re-estimating difficulty within upcoming data batches to maintain alignment with the model’s evolving capabilities. Second, Expert-Guided Self-Reformulation (EGSR) is a novel reinforcement learning strategy that bridges the gap between imitation learning and pure exploration by guiding models to reformulate expert solutions within their own conceptual framework, rather than relying on direct imitation, fostering deeper understanding and knowledge assimilation. Extensive experiments on challenging mathematical reasoning benchmarks, using Qwen2.5-7B as the base model, demonstrate that these human-inspired strategies synergistically and significantly enhance performance. Notably, their combined application improves performance over the standard Zero-RL baseline by 10% on the AIME24 benchmark and 16.6% on AIME25.
zh

[AI-31] Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning ICLR2025

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中的泛化问题,特别是在面对具有未见过动态的新环境时,智能体难以有效适应和执行任务。其解决方案的关键在于引入了基于组合因果组件的世界建模框架(World Modeling with Compositional Causal Components, WM3C),该框架通过学习和利用可组合元素之间的因果动态,提升RL的泛化能力。与以往侧重于不变表示学习或元学习的方法不同,WM3C强调对因果关系的识别与利用,并结合语言作为组合性模态,实现潜在空间的有效分解与唯一识别,从而增强智能体在新任务中的适应性与性能。

链接: https://arxiv.org/abs/2505.08361
作者: Xinyue Wang,Biwei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning – where known components are reconfigured to handle new situations – we introduce World Modeling with Compositional Causal Components (WM3C). This novel framework enhances RL generalization by learning and leveraging compositional causal components. Unlike previous approaches focusing on invariant representation learning or meta-learning, WM3C identifies and utilizes causal dynamics among composable elements, facilitating robust adaptation to new tasks. Our approach integrates language as a compositional modality to decompose the latent space into meaningful components and provides theoretical guarantees for their unique identification under mild assumptions. Our practical implementation uses a masked autoencoder with mutual information constraints and adaptive sparsity regularization to capture high-level semantic information and effectively disentangle transition dynamics. Experiments on numerical simulations and real-world robotic manipulation tasks demonstrate that WM3C significantly outperforms existing methods in identifying latent processes, improving policy learning, and generalizing to unseen tasks.
zh

[AI-32] SHAP-based Explanations are Sensitive to Feature Representation

【速读】:该论文试图解决数据工程选择对基于局部特征的可解释性(Local Feature-Based Explanations)的影响问题,特别是这些选择如何改变特征重要性评估的结果。解决方案的关键在于揭示标准且看似无害的数据工程技术(如将年龄表示为直方图或以特定方式编码种族)能够操纵诸如SHAP等流行解释方法所计算的特征重要性,从而可能被攻击者利用来掩盖如歧视等问题。研究强调了特征表示对解释结果的敏感性,并指出这一现象尚未得到系统性探讨。

链接: https://arxiv.org/abs/2505.08345
作者: Hyunseung Hwang,Andrew Bell,Joao Fonseca,Venetia Pliatsika,Julia Stoyanovich,Steven Euijong Whang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ACM FAccT 2025

点击查看摘要

Abstract:Local feature-based explanations are a key component of the XAI toolkit. These explanations compute feature importance values relative to an ``interpretable’’ feature representation. In tabular data, feature values themselves are often considered interpretable. This paper examines the impact of data engineering choices on local feature-based explanations. We demonstrate that simple, common data engineering techniques, such as representing age with a histogram or encoding race in a specific way, can manipulate feature importance as determined by popular methods like SHAP. Notably, the sensitivity of explanations to feature representation can be exploited by adversaries to obscure issues like discrimination. While the intuition behind these results is straightforward, their systematic exploration has been lacking. Previous work has focused on adversarial attacks on feature-based explainers by biasing data or manipulating models. To the best of our knowledge, this is the first study demonstrating that explainers can be misled by standard, seemingly innocuous data engineering techniques.
zh

[AI-33] An Identifiable Cost-Aware Causal Decision-Making Framework Using Counterfactual Reasoning

【速读】:该论文旨在解决在异常条件下进行决策时,现有决策框架过于依赖强化学习或根本原因分析,导致常常忽视行动成本或未能充分整合因果机制的问题。其解决方案的关键在于提出一种基于反事实推理的最小成本因果决策(Minimum-Cost Causal Decision, MiCCD)框架,通过放松现有的因果决策框架以解决必要原因,并利用因果图构建代理模型,结合异常模式聚类标签作为监督信号,实现结构因果模型的近似,从而为可识别的反事实推理奠定基础,最终通过优化模型和SLSQP算法实现成本考虑下的最优干预策略。

链接: https://arxiv.org/abs/2505.08343
作者: Ruichu Cai,Xi Chen,Jie Qiao,Zijian Li,Yuequn Liu,Wei Chen,Keli Zhang,Jiale Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Decision making under abnormal conditions is a critical process that involves evaluating the current state and determining the optimal action to restore the system to a normal state at an acceptable cost. However, in such scenarios, existing decision-making frameworks highly rely on reinforcement learning or root cause analysis, resulting in them frequently neglecting the cost of the actions or failing to incorporate causal mechanisms adequately. By relaxing the existing causal decision framework to solve the necessary cause, we propose a minimum-cost causal decision (MiCCD) framework via counterfactual reasoning to address the above challenges. Emphasis is placed on making counterfactual reasoning processes identifiable in the presence of a large amount of mixed anomaly data, as well as finding the optimal intervention state in a continuous decision space. Specifically, it formulates a surrogate model based on causal graphs, using abnormal pattern clustering labels as supervisory signals. This enables the approximation of the structural causal model among the variables and lays a foundation for identifiable counterfactual reasoning. With the causal structure approximated, we then established an optimization model based on counterfactual estimation. The Sequential Least Squares Programming (SLSQP) algorithm is further employed to optimize intervention strategies while taking costs into account. Experimental evaluations on both synthetic and real-world datasets reveal that MiCCD outperforms conventional methods across multiple metrics, including F1-score, cost efficiency, and ranking quality(nDCG@k values), thus validating its efficacy and broad applicability.
zh

[AI-34] Benchmarking AI scientists in omics data-driven biological research

【速读】:该论文试图解决当前生物领域中AI科学家在自主进行生物学研究时缺乏真实、数据驱动的评估设置的问题。现有基准要么侧重于无需数据的推理,要么侧重于具有预定义统计答案的数据分析,未能提供符合实际研究场景的评价体系。解决方案的关键是引入Biological AI Scientist Benchmark (BaisBench),该基准通过两个任务评估AI科学家的能力:在31个专家标注的单细胞数据集上进行细胞类型注释,以及通过回答从41项近期单细胞研究中提取的198道多选题来实现科学发现。

链接: https://arxiv.org/abs/2505.08341
作者: Erpai Luo,Jinmeng Jia,Yifan Xiong,Xiangyu Li,Xiaobo Guo,Baoqi Yu,Lei Wei,Xuegong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Genomics (q-bio.GN)
备注:

点击查看摘要

Abstract:The rise of large language models and multi-agent systems has sparked growing interest in AI scientists capable of autonomous biological research. However, existing benchmarks either focus on reasoning without data or on data analysis with predefined statistical answers, lacking realistic, data-driven evaluation settings. Here, we introduce the Biological AI Scientist Benchmark (BaisBench), a benchmark designed to assess AI scientists’ ability to generate biological discoveries through data analysis and reasoning with external knowledge. BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions derived from the biological insights of 41 recent single-cell studies. Systematic experiments on state-of-the-art AI scientists and LLM agents showed that while promising, current models still substantially underperform human experts on both tasks. We hope BaisBench will fill this gap and serve as a foundation for advancing and evaluating AI models for scientific discovery. The benchmark can be found at: this https URL.
zh

[AI-35] Low-Complexity Inference in Continual Learning via Compressed Knowledge Transfer

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中模型在保持旧任务性能(稳定性)与适应新任务(可塑性)之间的平衡问题,特别是在使用大规模预训练模型时面临的推理阶段高计算成本限制实际应用的问题。解决方案的关键在于引入模型压缩技术,包括剪枝和知识蒸馏(Knowledge Distillation, KD),并提出两种针对类别增量学习(Class-Incremental Learning, CIL)的高效框架,通过在不同训练阶段进行压缩或利用教师-学生架构传递知识,实现准确率与推理复杂度之间的更好权衡。

链接: https://arxiv.org/abs/2505.08327
作者: Zhenrong Liu,Janne M. J. Huttunen,Mikko Honkala
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continual learning (CL) aims to train models that can learn a sequence of tasks without forgetting previously acquired knowledge. A core challenge in CL is balancing stability – preserving performance on old tasks – and plasticity – adapting to new ones. Recently, large pre-trained models have been widely adopted in CL for their ability to support both, offering strong generalization for new tasks and resilience against forgetting. However, their high computational cost at inference time limits their practicality in real-world applications, especially those requiring low latency or energy efficiency. To address this issue, we explore model compression techniques, including pruning and knowledge distillation (KD), and propose two efficient frameworks tailored for class-incremental learning (CIL), a challenging CL setting where task identities are unavailable during inference. The pruning-based framework includes pre- and post-pruning strategies that apply compression at different training stages. The KD-based framework adopts a teacher-student architecture, where a large pre-trained teacher transfers downstream-relevant knowledge to a compact student. Extensive experiments on multiple CIL benchmarks demonstrate that the proposed frameworks achieve a better trade-off between accuracy and inference complexity, consistently outperforming strong baselines. We further analyze the trade-offs between the two frameworks in terms of accuracy and efficiency, offering insights into their use across different scenarios.
zh

[AI-36] FedRS-Bench: Realistic Federated Learning Datasets and Benchmarks in Remote Sensing

【速读】:该论文试图解决遥感(Remote Sensing, RS)图像在分布式环境下进行联邦学习(Federated Learning, FL)时缺乏真实可比数据集和基准的问题。现有研究多依赖手动划分的单一数据集,无法反映真实世界RS数据的异质性和规模,并且实验设置不一致,阻碍了方法间的公平比较。解决方案的关键在于构建一个名为FedRS的真实联邦RS数据集,其包含8个覆盖多种传感器和分辨率的数据集,构建了135个客户端,具备真实联邦特性如标签分布偏斜、客户端数据量不平衡以及跨客户端的领域异质性,从而支持大规模FL方法的评估。基于FedRS,作者还实现了10种基线FL算法和评估指标,构建了FedRS-Bench,以促进未来研究的标准化和公平比较。

链接: https://arxiv.org/abs/2505.08325
作者: Haodong Zhao,Peng Peng,Chiyu Chen,Linqing Huang,Gongshen Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote sensing (RS) images are usually produced at an unprecedented scale, yet they are geographically and institutionally distributed, making centralized model training challenging due to data-sharing restrictions and privacy concerns. Federated learning (FL) offers a solution by enabling collaborative model training across decentralized RS data sources without exposing raw data. However, there lacks a realistic federated dataset and benchmark in RS. Prior works typically rely on manually partitioned single dataset, which fail to capture the heterogeneity and scale of real-world RS data, and often use inconsistent experimental setups, hindering fair comparison. To address this gap, we propose a realistic federated RS dataset, termed FedRS. FedRS consists of eight datasets that cover various sensors and resolutions and builds 135 clients, which is representative of realistic operational scenarios. Data for each client come from the same source, exhibiting authentic federated properties such as skewed label distributions, imbalanced client data volumes, and domain heterogeneity across clients. These characteristics reflect practical challenges in federated RS and support evaluation of FL methods at scale. Based on FedRS, we implement 10 baseline FL algorithms and evaluation metrics to construct the comprehensive FedRS-Bench. The experimental results demonstrate that FL can consistently improve model performance over training on isolated data silos, while revealing performance trade-offs of different methods under varying client heterogeneity and availability conditions. We hope FedRS-Bench will accelerate research on large-scale, realistic FL in RS by providing a standardized, rich testbed and facilitating fair comparisons across future works. The source codes and dataset are available at this https URL.
zh

[AI-37] Reciprocity as the Foundational Substrate of Society: How Reciprocal Dynamics Scale into Social Systems DATE

【速读】:该论文试图解决多智能体人工智能中缺乏可模拟的自下而上社会结构形成的模型的问题,以及经济学和社会学中“制度”和“规范”等基础理论往往在事后描述社会结构、依赖隐含的共享文化、道德或符号协议的缺陷。解决方案的关键在于提出一个三阶段的自下而上框架:互惠动力学(Reciprocal Dynamics),捕捉个体层面的互惠交换;规范稳定化(Norm Stabilization),巩固共享期望;制度建构(Institutional Construction),将稳定模式外化为可扩展结构。通过将社会涌现建立在个体层面的互惠性基础上,该框架实现了对道德、文化和制度结构如何从认知最小交互中产生的系统探索。

链接: https://arxiv.org/abs/2505.08319
作者: Egil Diau
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: First draft extending the first position paper. Main framework complete; historical examples and references will be updated

点击查看摘要

Abstract:A major bottleneck in multi-agent AI is the lack of simulateable models for the bottom-up emergence of social structure under realistic behavioral constraints. Similarly, many foundational theories in economics and sociology including the concepts of “institutions” and “norms” tend to describe social structures post hoc, often relying on implicit assumptions of shared culture, morality, or symbolic agreement. These concepts are often treated as primitives rather than reconstructed from agent-level behavior, leaving both their origins and operational definitions under-specified. To address this, we propose a three-stage bottom-up framework: Reciprocal Dynamics, capturing individual-level reciprocal exchanges; Norm Stabilization, the consolidation of shared expectations; and Institutional Construction, the externalization of stable patterns into scalable structures. By grounding social emergence in agent-level reciprocity, our framework enables the systematic exploration of how moral, cultural, and institutional structures emerge from cognitively minimal interactions.
zh

[AI-38] A Practical Introduction to Deep Reinforcement Learning

【速读】:该论文试图解决深度强化学习(Deep Reinforcement Learning, DRL)领域中算法多样性与理论复杂性给初学者带来的入门困难问题。其解决方案的关键在于通过统一的广义策略迭代(Generalized Policy Iteration, GPI)框架组织所有算法,并重点介绍广泛使用的近端策略优化(Proximal Policy Optimization, PPO)算法,以提供一种简洁、直观且实用的学习路径,强调概念理解、实例说明和工程实践,而非冗长的理论推导。

链接: https://arxiv.org/abs/2505.08295
作者: Yinghan Sun,Hongxi Wang,Hua Chen,Wei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep reinforcement learning (DRL) has emerged as a powerful framework for solving sequential decision-making problems, achieving remarkable success in a wide range of applications, including game AI, autonomous driving, biomedicine, and large language models. However, the diversity of algorithms and the complexity of theoretical foundations often pose significant challenges for beginners seeking to enter the field. This tutorial aims to provide a concise, intuitive, and practical introduction to DRL, with a particular focus on the Proximal Policy Optimization (PPO) algorithm, which is one of the most widely used and effective DRL methods. To facilitate learning, we organize all algorithms under the Generalized Policy Iteration (GPI) framework, offering readers a unified and systematic perspective. Instead of lengthy theoretical proofs, we emphasize intuitive explanations, illustrative examples, and practical engineering techniques. This work serves as an efficient and accessible guide, helping readers rapidly progress from basic concepts to the implementation of advanced DRL algorithms.
zh

[AI-39] LLM Enhancers for GNNs: An Analysis from the Perspective of Causal Mechanism Identification ICML2025

【速读】:该论文试图解决将大规模语言模型(Large Language Models, LLMs)作为特征增强器以优化节点表示,进而输入图神经网络(Graph Neural Networks, GNNs)时,其基本性质尚未被充分探索的问题。解决方案的关键在于基于交换干预方法进行更深入的分析,通过构建具有可控因果关系的合成图数据集,实现对语义关系和因果建模的精确操控,并利用该数据集进行交换干预实验,以揭示LLM增强器与GNN之间的内在逻辑和机制。在此基础上,设计了一个即插即用的优化模块,以提升LLM增强器与GNN之间的信息传递效率。

链接: https://arxiv.org/abs/2505.08265
作者: Hang Gao,Wenxuan Huang,Fengge Wu,Junsuo Zhao,Changwen Zheng,Huaping Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:The use of large language models (LLMs) as feature enhancers to optimize node representations, which are then used as inputs for graph neural networks (GNNs), has shown significant potential in graph representation learning. However, the fundamental properties of this approach remain underexplored. To address this issue, we propose conducting a more in-depth analysis of this issue based on the interchange intervention method. First, we construct a synthetic graph dataset with controllable causal relationships, enabling precise manipulation of semantic relationships and causal modeling to provide data for analysis. Using this dataset, we conduct interchange interventions to examine the deeper properties of LLM enhancers and GNNs, uncovering their underlying logic and internal mechanisms. Building on the analytical results, we design a plug-and-play optimization module to improve the information transfer between LLM enhancers and GNNs. Experiments across multiple datasets and models validate the proposed module.
zh

[AI-40] Automatic Curriculum Learning for Driving Scenarios: Towards Robust and Efficient Reinforcement Learning

【速读】:该论文试图解决在强化学习(Reinforcement Learning, RL)中训练端到端自动驾驶代理时面临的泛化能力和实际部署限制问题。传统方法通常在固定场景和周围道路使用者的典型行为下进行训练,导致模型难以适应复杂多变的真实环境。为了解决这一问题,该研究提出了一种自动课程学习(Automatic Curriculum Learning, ACL)框架,其关键在于通过一个“教师”模块动态生成并调整驾驶场景的复杂度,该模块基于代理当前策略的“学习潜力”这一代理中心指标,而非依赖人工设计的课程,从而避免了专家偏差并提升了可扩展性。该框架通过排除代理已掌握或过于困难的场景,提高了训练效率。

链接: https://arxiv.org/abs/2505.08264
作者: Ahmed Abouelazm,Tim Weinstein,Tim Joseph,Philip Schörner,J. Marius Zöllner
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted in the 36th IEEE Intelligent Vehicles Symposium (IV 2025)

点击查看摘要

Abstract:This paper addresses the challenges of training end-to-end autonomous driving agents using Reinforcement Learning (RL). RL agents are typically trained in a fixed set of scenarios and nominal behavior of surrounding road users in simulations, limiting their generalization and real-life deployment. While domain randomization offers a potential solution by randomly sampling driving scenarios, it frequently results in inefficient training and sub-optimal policies due to the high variance among training scenarios. To address these limitations, we propose an automatic curriculum learning framework that dynamically generates driving scenarios with adaptive complexity based on the agent’s evolving capabilities. Unlike manually designed curricula that introduce expert bias and lack scalability, our framework incorporates a ``teacher’’ that automatically generates and mutates driving scenarios based on their learning potential – an agent-centric metric derived from the agent’s current policy – eliminating the need for expert design. The framework enhances training efficiency by excluding scenarios the agent has mastered or finds too challenging. We evaluate our framework in a reinforcement learning setting where the agent learns a driving policy from camera images. Comparative results against baseline methods, including fixed scenario training and domain randomization, demonstrate that our approach leads to enhanced generalization, achieving higher success rates: +9% in low traffic density, +21% in high traffic density, and faster convergence with fewer training steps. Our findings highlight the potential of ACL in improving the robustness and efficiency of RL-based autonomous driving agents.
zh

[AI-41] Evaluating LLM Metrics Through Real-World Capabilities

【速读】:该论文试图解决当前评估生成式 AI(Generative AI)性能的方法未能充分反映其在实际应用场景中的效用问题。现有基准测试多侧重于通用智能的抽象衡量,如代码生成或事实回忆,而忽略了用户在日常任务中对 AI 的多样化需求,例如写作辅助、摘要生成、引用格式化和风格反馈等。论文的关键解决方案是通过大规模调查数据和使用日志分析,识别出六项核心能力(Summarization, Technical Assistance, Reviewing Work, Data Structuring, Generation, and Information Retrieval),并基于这些能力评估现有基准测试的覆盖范围、效率和可解释性,最终提出以用户为中心的评价标准,以更准确地反映 AI 在真实场景中的实用性。

链接: https://arxiv.org/abs/2505.08253
作者: Justin K Miller,Wenjia Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages main text, 5 pages references, 20 pages appendix; includes 3 figures and 4 tables

点击查看摘要

Abstract:As generative AI becomes increasingly embedded in everyday workflows, it is important to evaluate its performance in ways that reflect real-world usage rather than abstract notions of intelligence. Unlike many existing benchmarks that assess general intelligence, our approach focuses on real-world utility, evaluating how well models support users in everyday tasks. While current benchmarks emphasize code generation or factual recall, users rely on AI for a much broader range of activities-from writing assistance and summarization to citation formatting and stylistic feedback. In this paper, we analyze large-scale survey data and usage logs to identify six core capabilities that represent how people commonly use Large Language Models (LLMs): Summarization, Technical Assistance, Reviewing Work, Data Structuring, Generation, and Information Retrieval. We then assess the extent to which existing benchmarks cover these capabilities, revealing significant gaps in coverage, efficiency measurement, and interpretability. Drawing on this analysis, we use human-centered criteria to identify gaps in how well current benchmarks reflect common usage that is grounded in five practical criteria: coherence, accuracy, clarity, relevance, and efficiency. For four of the six capabilities, we identify the benchmarks that best align with real-world tasks and use them to compare leading models. We find that Google Gemini outperforms other models-including OpenAI’s GPT, xAI’s Grok, Meta’s LLaMA, Anthropic’s Claude, DeepSeek, and Qwen from Alibaba-on these utility-focused metrics.
zh

[AI-42] Reinforcement Learning-based Fault-Tolerant Control for Quadrotor with Online Transformer Adaptation ICRA

【速读】:该论文旨在解决多旋翼飞行器在执行任务时因执行器故障导致的快速不稳定和任务可靠性下降问题。现有基于强化学习的容错控制(FTC)方法通常依赖于多旋翼模型的先验知识或难以适应新的配置,而该论文提出了一种结合基于Transformer的在线自适应模块的混合强化学习框架,其关键在于利用Transformer架构实时推断潜在表示,从而在无需重新训练的情况下适应未见过的系统模型。

链接: https://arxiv.org/abs/2505.08223
作者: Dohyun Kim,Jayden Dongwoo Lee,Hyochoong Bang,Jungho Bae
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accpted at the 2025 IEEE International Conference on Robotics Automation (ICRA) Workshop: Robots in the Wild

点击查看摘要

Abstract:Multirotors play a significant role in diverse field robotics applications but remain highly susceptible to actuator failures, leading to rapid instability and compromised mission reliability. While various fault-tolerant control (FTC) strategies using reinforcement learning (RL) have been widely explored, most previous approaches require prior knowledge of the multirotor model or struggle to adapt to new configurations. To address these limitations, we propose a novel hybrid RL-based FTC framework integrated with a transformer-based online adaptation module. Our framework leverages a transformer architecture to infer latent representations in real time, enabling adaptation to previously unseen system models without retraining. We evaluate our method in a PyBullet simulation under loss-of-effectiveness actuator faults, achieving a 95% success rate and a positional root mean square error (RMSE) of 0.129 m, outperforming existing adaptation methods with 86% success and an RMSE of 0.153 m. Further evaluations on quadrotors with varying configurations confirm the robustness of our framework across untrained dynamics. These results demonstrate the potential of our framework to enhance the adaptability and reliability of multirotors, enabling efficient fault management in dynamic and uncertain environments. Website is available at this http URL
zh

[AI-43] Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles

【速读】:该论文旨在解决多自主水下航行器(Autonomous Vehicles, AV)在复杂海洋环境中进行多目标跟踪时面临的计算挑战,特别是多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在大规模场景下的样本效率低下和训练速度慢的问题。其关键解决方案是提出一种迭代蒸馏方法,将高保真仿真环境中的动态迁移至简化的、GPU加速的环境,同时保持高层动力学特性,从而实现高达30,000倍的速度提升,并结合一种基于Transformer的架构(TransfMAPPO),以提升样本效率并实现对智能体数量和目标数量的不变性策略学习。

链接: https://arxiv.org/abs/2505.08222
作者: Matteo Gallici,Ivan Masmitja,Mario Martín
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Autonomous vehicles (AV) offer a cost-effective solution for scientific missions such as underwater tracking. Recently, reinforcement learning (RL) has emerged as a powerful method for controlling AVs in complex marine environments. However, scaling these techniques to a fleet–essential for multi-target tracking or targets with rapid, unpredictable motion–presents significant computational challenges. Multi-Agent Reinforcement Learning (MARL) is notoriously sample-inefficient, and while high-fidelity simulators like Gazebo’s LRAUV provide 100x faster-than-real-time single-robot simulations, they offer no significant speedup for multi-vehicle scenarios, making MARL training impractical. To address these limitations, we propose an iterative distillation method that transfers high-fidelity simulations into a simplified, GPU-accelerated environment while preserving high-level dynamics. This approach achieves up to a 30,000x speedup over Gazebo through parallelization, enabling efficient training via end-to-end GPU acceleration. Additionally, we introduce a novel Transformer-based architecture (TransfMAPPO) that learns multi-agent policies invariant to the number of agents and targets, significantly improving sample efficiency. Following large-scale curriculum learning conducted entirely on GPU, we perform extensive evaluations in Gazebo, demonstrating that our method maintains tracking errors below 5 meters over extended durations, even in the presence of multiple fast-moving targets. This work bridges the gap between large-scale MARL training and high-fidelity deployment, providing a scalable framework for autonomous fleet control in real-world sea missions.
zh

[AI-44] Unveiling the Best Practices for Applying Speech Foundation Models to Speech Intelligibility Prediction for Hearing-Impaired People

【速读】:该论文旨在解决如何有效适配生成式语音基础模型(Speech Foundation Models, SFMs)以提升对听力障碍人群的语音可懂度预测(Speech Intelligibility Prediction for Hearing-Impaired, SIP-HI)性能的问题。其解决方案的关键在于识别并优化影响SIP-HI性能的核心设计因素,包括编码器层选择、预测头架构以及集成配置。研究发现,选择单一编码器层优于传统全层方法,时间建模对于预测头的有效性至关重要,同时集成多个SFMs能够提升性能,尤其是当个体模型能力较强时效果更显著。

链接: https://arxiv.org/abs/2505.08215
作者: Haoshuai Zhou,Boxuan Cao,Changgeng Mo,Linkai Li,Shan Xiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Speech foundation models (SFMs) have demonstrated strong performance across a variety of downstream tasks, including speech intelligibility prediction for hearing-impaired people (SIP-HI). However, optimizing SFMs for SIP-HI has been insufficiently explored. In this paper, we conduct a comprehensive study to identify key design factors affecting SIP-HI performance with 5 SFMs, focusing on encoder layer selection, prediction head architecture, and ensemble configurations. Our findings show that, contrary to traditional use-all-layers methods, selecting a single encoder layer yields better results. Additionally, temporal modeling is crucial for effective prediction heads. We also demonstrate that ensembling multiple SFMs improves performance, with stronger individual models providing greater benefit. Finally, we explore the relationship between key SFM attributes and their impact on SIP-HI performance. Our study offers practical insights into effectively adapting SFMs for speech intelligibility prediction for hearing-impaired populations.
zh

[AI-45] DSADF: Thinking Fast and Slow for Decision Making

【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)代理在动态环境中泛化能力不足的问题,以及如何有效融合基础模型的推理能力和RL代理的快速响应能力以提升决策效率。其解决方案的关键在于提出一种双系统自适应决策框架(Dual-System Adaptive Decision Framework, DSADF),该框架整合了两个互补模块:基于RL代理和记忆空间的快速直觉决策系统(System 1)以及由视觉语言模型(Vision Language Model, VLM)驱动的深度分析推理系统(System 2),通过协同作用实现高效且自适应的决策。

链接: https://arxiv.org/abs/2505.08189
作者: Alex Zhihao Dou,Dongfei Cui,Jun Yan,Weida Wang,Benteng Chen,Haoming Wang,Zeke Xie,Shufei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although Reinforcement Learning (RL) agents are effective in well-defined environments, they often struggle to generalize their learned policies to dynamic settings due to their reliance on trial-and-error interactions. Recent work has explored applying Large Language Models (LLMs) or Vision Language Models (VLMs) to boost the generalization of RL agents through policy optimization guidance or prior knowledge. However, these approaches often lack seamless coordination between the RL agent and the foundation model, leading to unreasonable decision-making in unfamiliar environments and efficiency bottlenecks. Making full use of the inferential capabilities of foundation models and the rapid response capabilities of RL agents and enhancing the interaction between the two to form a dual system is still a lingering scientific question. To address this problem, we draw inspiration from Kahneman’s theory of fast thinking (System 1) and slow thinking (System 2), demonstrating that balancing intuition and deep reasoning can achieve nimble decision-making in a complex world. In this study, we propose a Dual-System Adaptive Decision Framework (DSADF), integrating two complementary modules: System 1, comprising an RL agent and a memory space for fast and intuitive decision making, and System 2, driven by a VLM for deep and analytical reasoning. DSADF facilitates efficient and adaptive decision-making by combining the strengths of both systems. The empirical study in the video game environment: Crafter and Housekeep demonstrates the effectiveness of our proposed method, showing significant improvements in decision abilities for both unseen and known tasks.
zh

[AI-46] Feasibility-Aware Pessimistic Estimation: Toward Long-Horizon Safety in Offline RL

【速读】:该论文旨在解决离线安全强化学习(Offline Safe Reinforcement Learning, OSRL)中长期安全性不足、对分布外(Out-of-Distribution, OOD)状态和动作处理能力差以及样本效率低的问题。其解决方案的关键在于提出一种基于条件变分自编码器(CVAE)的悲观估计框架(Feasibility-Aware offline Safe Reinforcement Learning with CVAE-based Pessimism, FASP),通过Hamilton-Jacobi(H-J)可达性分析生成可靠的安全标签,结合悲观估计方法优化Q值评估,从而提升长期安全性和对OOD数据的鲁棒性,并保证高样本效率。

链接: https://arxiv.org/abs/2505.08179
作者: Zhikun Tao,Gang Xiong,He Fang,Zhen Shen,Yunjun Han,Qing-Shan Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline safe reinforcement learning(OSRL) derives constraint-satisfying policies from pre-collected datasets, offers a promising avenue for deploying RL in safety-critical real-world domains such as robotics. However, the majority of existing approaches emphasize only short-term safety, neglecting long-horizon considerations. Consequently, they may violate safety constraints and fail to ensure sustained protection during online deployment. Moreover, the learned policies often struggle to handle states and actions that are not present or out-of-distribution(OOD) from the offline dataset, and exhibit limited sample efficiency. To address these challenges, we propose a novel framework Feasibility-Aware offline Safe Reinforcement Learning with CVAE-based Pessimism (FASP). First, we employ Hamilton-Jacobi (H-J) reachability analysis to generate reliable safety labels, which serve as supervisory signals for training both a conditional variational autoencoder (CVAE) and a safety classifier. This approach not only ensures high sampling efficiency but also provides rigorous long-horizon safety guarantees. Furthermore, we utilize pessimistic estimation methods to estimate the Q-value of reward and cost, which mitigates the extrapolation errors induces by OOD actions, and penalize unsafe actions to enabled the agent to proactively avoid high-risk behaviors. Moreover, we theoretically prove the validity of this pessimistic estimation. Extensive experiments on DSRL benchmarks demonstrate that FASP algorithm achieves competitive performance across multiple experimental tasks, particularly outperforming state-of-the-art algorithms in terms of safety.
zh

[AI-47] Behind the Noise: Conformal Quantile Regression Reveals Emergent Representations

【速读】:该论文旨在解决科学成像中因缩短采集时间导致的测量噪声增加与高质量数据获取之间的矛盾问题。其核心挑战在于如何在减少采集时间以提高通量的同时,保持数据的可用性和准确性。该研究提出的解决方案的关键在于利用基于校准不确定性边界的小型化、随机结构神经网络集成,通过共形分位数回归进行训练,从而实现可靠的去噪,并揭示潜在空间中的可解释性结构,而无需依赖标签或分割。

链接: https://arxiv.org/abs/2505.08176
作者: Petrus H. Zwart,Tamas Varga,Odeta Qafoku,James A. Sethian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific imaging often involves long acquisition times to obtain high-quality data, especially when probing complex, heterogeneous systems. However, reducing acquisition time to increase throughput inevitably introduces significant noise into the measurements. We present a machine learning approach that not only denoises low-quality measurements with calibrated uncertainty bounds, but also reveals emergent structure in the latent space. By using ensembles of lightweight, randomly structured neural networks trained via conformal quantile regression, our method performs reliable denoising while uncovering interpretable spatial and chemical features – without requiring labels or segmentation. Unlike conventional approaches focused solely on image restoration, our framework leverages the denoising process itself to drive the emergence of meaningful representations. We validate the approach on real-world geobiochemical imaging data, showing how it supports confident interpretation and guides experimental design under resource constraints.
zh

[AI-48] Fast Text-to-Audio Generation with Adversarial Post-Training

【速读】:该论文试图解决文本到音频系统在推理时延迟过高,导致其在许多创意应用中不具实用性的问题。解决方案的关键在于提出一种名为对抗相对对比(Adversarial Relativistic-Contrastive, ARC)的后训练方法,该方法是首个不基于知识蒸馏的扩散/流模型对抗加速算法。ARC后训练通过将最近的相对对抗框架扩展到扩散/流模型的后训练,并结合一种新颖的对比判别器目标,以促进更好的提示遵循性,从而实现高效的音频生成。

链接: https://arxiv.org/abs/2505.08175
作者: Zachary Novack,Zach Evans,Zack Zukowski,Josiah Taylor,CJ Carr,Julian Parker,Adnan Al-Sinan,Gian Marco Iodice,Julian McAuley,Taylor Berg-Kirkpatrick,Jordi Pons
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating \approx 12s of 44.1kHz stereo audio in \approx 75ms on an H100, and \approx 7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.
zh

[AI-49] Feature Fitted Online Conformal Prediction for Deep Time Series Forecasting Model

【速读】:该论文试图解决在时间序列预测中如何有效量化预测不确定性的问题,特别是通过在线置信区间来实现。现有基于深度学习的点预测模型构建置信区间的方法存在关键局限性,如需要昂贵的重新训练、未能充分利用深度模型的表征能力或缺乏理论保证。解决方案的关键在于提出一种轻量级的合规预测方法,该方法利用预训练点预测模型提取的特征来拟合残差预测器并构建置信区间,并通过自适应覆盖控制机制进行增强,从而在不重新训练模型的情况下实现有效的覆盖率和更短的区间长度。

链接: https://arxiv.org/abs/2505.08158
作者: Xiannan Huang,Shuhan Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting is critical for many applications, where deep learning-based point prediction models have demonstrated strong performance. However, in practical scenarios, there is also a need to quantify predictive uncertainty through online confidence intervals. Existing confidence interval modeling approaches building upon these deep point prediction models suffer from key limitations: they either require costly retraining, fail to fully leverage the representational strengths of deep models, or lack theoretical guarantees. To address these gaps, we propose a lightweight conformal prediction method that provides valid coverage and shorter interval lengths without retraining. Our approach leverages features extracted from pre-trained point prediction models to fit a residual predictor and construct confidence intervals, further enhanced by an adaptive coverage control mechanism. Theoretically, we prove that our method achieves asymptotic coverage convergence, with error bounds dependent on the feature quality of the underlying point prediction model. Experiments on 12 datasets demonstrate that our method delivers tighter confidence intervals while maintaining desired coverage rates. Code, model and dataset in \hrefthis https URLGithub
zh

[AI-50] Hyperbolic Contrastive Learning with Model-augmentation for Knowledge-aware Recommendation

【速读】:该论文旨在解决基于图神经网络(GNN)的对比学习在知识感知推荐中难以有效捕捉用户-物品二分图和知识图谱中的潜在层次结构,以及通过扰动图结构生成正样本可能导致用户偏好学习偏移的问题。其解决方案的关键在于提出一种基于双曲对比学习的模型增强方法,首先设计了一种新颖的洛伦兹知识聚合机制以更好地表征用户和物品,其次提出了三种模型级增强技术,能够在不引起正样本对偏好偏移的情况下辅助双曲对比学习。

链接: https://arxiv.org/abs/2505.08157
作者: Shengyin Sun,Chen Ma
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:Benefiting from the effectiveness of graph neural networks (GNNs) and contrastive learning, GNN-based contrastive learning has become mainstream for knowledge-aware recommendation. However, most existing contrastive learning-based methods have difficulties in effectively capturing the underlying hierarchical structure within user-item bipartite graphs and knowledge graphs. Moreover, they commonly generate positive samples for contrastive learning by perturbing the graph structure, which may lead to a shift in user preference learning. To overcome these limitations, we propose hyperbolic contrastive learning with model-augmentation for knowledge-aware recommendation. To capture the intrinsic hierarchical graph structures, we first design a novel Lorentzian knowledge aggregation mechanism, which enables more effective representations of users and items. Then, we propose three model-level augmentation techniques to assist Hyperbolic contrastive learning. Different from the classical structure-level augmentation (e.g., edge dropping), the proposed model-augmentations can avoid preference shifts between the augmented positive pair. Finally, we conduct extensive experiments to demonstrate the superiority (maximum improvement of 11.03% ) of proposed methods over existing baselines.
zh

[AI-51] Efficient and Scalable Neural Symbolic Search for Knowledge Graph Complex Query Answering

【速读】:该论文旨在解决复杂查询回答(Complex Query Answering, CQA)在不完整知识图谱中检索逻辑公式的答案集时面临的效率与可扩展性问题。现有基于神经符号搜索的方法虽然在准确性上表现优异,但存在数据复杂度随实体数量呈二次增长以及循环查询导致的查询复杂度为NP难的问题,从而难以有效扩展到更大规模的知识图谱和更复杂的查询。论文提出的解决方案关键在于:首先,引入两种约束策略以计算神经逻辑索引,从而缩小变量的取值范围,降低符号搜索的数据复杂度;其次,提出一种基于局部搜索的近似算法,以应对循环查询带来的NP难问题。实验结果表明,该框架在保持相近性能的同时,将符号方法的计算负载降低了90%,有效缓解了效率与可扩展性问题。

链接: https://arxiv.org/abs/2505.08155
作者: Weizhi Fei,Zihao Wang,hang Yin,Shukai Zhao,Wei Zhang,Yangqiu Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Complex Query Answering (CQA) aims to retrieve answer sets for complex logical formulas from incomplete knowledge graphs, which is a crucial yet challenging task in knowledge graph reasoning. While neuro-symbolic search utilized neural link predictions achieve superior accuracy, they encounter significant complexity bottlenecks: (i) Data complexity typically scales quadratically with the number of entities in the knowledge graph, and (ii) Query complexity becomes NP-hard for cyclic queries. Consequently, these approaches struggle to effectively scale to larger knowledge graphs and more complex queries. To address these challenges, we propose an efficient and scalable symbolic search framework. First, we propose two constraint strategies to compute neural logical indices to reduce the domain of variables, thereby decreasing the data complexity of symbolic search. Additionally, we introduce an approximate algorithm based on local search to tackle the NP query complexity of cyclic queries. Experiments on various CQA benchmarks demonstrate that our framework reduces the computational load of symbolic methods by 90% while maintaining nearly the same performance, thus alleviating both efficiency and scalability issues.
zh

[AI-52] Foundation Models Knowledge Distillation For Battery Capacity Degradation Forecast

【速读】:该论文旨在解决锂离子电池容量退化估计的准确性问题,以提升电池运行的可靠性和安全性。传统专家模型虽针对特定场景设计,但其估计结果具有局限性。为实现基于大模型技术的零样本泛化能力,本文提出了一种面向退化的微调策略,关键在于通过在大规模开放电池充放电数据上微调时间序列基础模型(如Timer模型),使其具备强大的零样本泛化能力。此外,为应对大模型部署的计算挑战,还提出了知识蒸馏框架,将预训练基础模型的知识迁移至轻量级专家模型,从而显著提升专家模型在多工况下的泛化性能。

链接: https://arxiv.org/abs/2505.08151
作者: Joey Chan,Zhen Chen,Ershun Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate estimation of lithium-ion battery capacity degradation is critical for enhancing the reliability and safety of battery operations. Traditional expert models, tailored to specific scenarios, provide isolated estimations. With the rapid advancement of data-driven techniques, a series of general-purpose time-series foundation models have been developed. However, foundation models specifically designed for battery capacity degradation remain largely unexplored. To enable zero-shot generalization in battery degradation prediction using large model technology, this study proposes a degradation-aware fine-tuning strategy for time-series foundation models. We apply this strategy to fine-tune the Timer model on approximately 10 GB of open-source battery charge discharge data. Validation on our released CycleLife-SJTUIE dataset demonstrates that the fine-tuned Battery-Timer possesses strong zero-shot generalization capability in capacity degradation forecasting. To address the computational challenges of deploying large models, we further propose a knowledge distillation framework that transfers the knowledge of pre-trained foundation models into compact expert models. Distillation results across several state-of-the-art time-series expert models confirm that foundation model knowledge significantly improves the multi-condition generalization of expert models.
zh

[AI-53] Communication Styles and Reader Preferences of LLM and Human Experts in Explaining Health Information

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在健康信息事实核查中的沟通风格与人类专家之间的差异及其对读者感知的影响问题。其解决方案的关键在于通过构建一个包含1498条权威机构发布的健康虚假信息解释的数据集,并生成LLM对不准确健康信息的回应,结合健康传播理论,从信息的语言特征、发送者说服策略和接收者价值认同三个维度评估沟通风格。研究进一步通过盲评实验验证了人类对LLM内容的偏好,揭示了LLM结构化信息呈现方式在吸引读者方面的潜在有效性。

链接: https://arxiv.org/abs/2505.08143
作者: Jiawei Zhou,Kritika Venkatachalam,Minje Choi,Koustuv Saha,Munmun De Choudhury
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the wide adoption of large language models (LLMs) in information assistance, it is essential to examine their alignment with human communication styles and values. We situate this study within the context of fact-checking health information, given the critical challenge of rectifying conceptions and building trust. Recent studies have explored the potential of LLM for health communication, but style differences between LLMs and human experts and associated reader perceptions remain under-explored. In this light, our study evaluates the communication styles of LLMs, focusing on how their explanations differ from those of humans in three core components of health communication: information, sender, and receiver. We compiled a dataset of 1498 health misinformation explanations from authoritative fact-checking organizations and generated LLM responses to inaccurate health information. Drawing from health communication theory, we evaluate communication styles across three key dimensions of information linguistic features, sender persuasive strategies, and receiver value alignments. We further assessed human perceptions through a blinded evaluation with 99 participants. Our findings reveal that LLM-generated articles showed significantly lower scores in persuasive strategies, certainty expressions, and alignment with social values and moral foundations. However, human evaluation demonstrated a strong preference for LLM content, with over 60% responses favoring LLM articles for clarity, completeness, and persuasiveness. Our results suggest that LLMs’ structured approach to presenting information may be more effective at engaging readers despite scoring lower on traditional measures of quality in fact-checking and health communication.
zh

[AI-54] Lost in Transmission: When and Why LLM s Fail to Reason Globally

【速读】:该论文试图解决基于Transformer的大语言模型(Large Language Models, LLMs)在处理需要复杂推理的任务时所面临的性能瓶颈问题,尤其是这些模型在处理长输入时信息传递能力受限导致的失败。解决方案的关键在于引入了有界注意力前缀预言机(Bounded Attention Prefix Oracle, BAPO)模型,该模型通过模拟注意力头的带宽限制,揭示了某些推理任务(如图可达性)对通信带宽的高需求,并将这类任务定义为BAPO-hard。研究进一步表明,使用思维链(Chain of Thought, CoT)可以将BAPO-hard问题转化为BAPO-easy问题,从而提升模型的推理能力。

链接: https://arxiv.org/abs/2505.08140
作者: Tobias Schnabel,Kiran Tomlinson,Adith Swaminathan,Jennifer Neville
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: 28 pages

点击查看摘要

Abstract:Despite their many successes, transformer-based large language models (LLMs) continue to struggle with tasks that require complex reasoning over large parts of their input. We argue that these failures arise due to capacity limits on the accurate flow of information within LLMs. To formalize this issue, we introduce the bounded attention prefix oracle (BAPO) model, a new computational framework that models bandwidth constraints on attention heads, the mechanism for internal communication in LLMs. We show that several important reasoning problems like graph reachability require high communication bandwidth for BAPOs to solve; we call these problems BAPO-hard. Our experiments corroborate our theoretical predictions: GPT-4, Claude, and Gemini succeed on BAPO-easy tasks and fail even on relatively small BAPO-hard tasks. BAPOs also reveal another benefit of chain of thought (CoT): we prove that breaking down a task using CoT can turn any BAPO-hard problem into a BAPO-easy one. Our results offer principled explanations for key LLM failures and suggest directions for architectures and inference methods that mitigate bandwidth limits.
zh

[AI-55] Mirror Mirror on the Wall Have I Forgotten it All? A New Framework for Evaluating Machine Unlearning

【速读】:该论文试图解决机器遗忘(machine unlearning)方法在安全性上的不足问题,即如何确保通过遗忘方法生成的模型与从未包含遗忘数据的镜像模型(mirror model)之间无法被区分。其解决方案的关键在于提出一种称为“计算遗忘”(computational unlearning)的强形式化定义,该定义要求攻击者无法以高于随机猜测的概率区分遗忘模型与镜像模型,从而为机器遗忘的可行性提供理论保障。

链接: https://arxiv.org/abs/2505.08138
作者: Brennon Brimhall,Philip Mathew,Neil Fendley,Yinzhi Cao,Matthew Green
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Machine unlearning methods take a model trained on a dataset and a forget set, then attempt to produce a model as if it had only been trained on the examples not in the forget set. We empirically show that an adversary is able to distinguish between a mirror model (a control model produced by retraining without the data to forget) and a model produced by an unlearning method across representative unlearning methods from the literature. We build distinguishing algorithms based on evaluation scores in the literature (i.e. membership inference scores) and Kullback-Leibler divergence. We propose a strong formal definition for machine unlearning called computational unlearning. Computational unlearning is defined as the inability for an adversary to distinguish between a mirror model and a model produced by an unlearning method. If the adversary cannot guess better than random (except with negligible probability), then we say that an unlearning method achieves computational unlearning. Our computational unlearning definition provides theoretical structure to prove unlearning feasibility results. For example, our computational unlearning definition immediately implies that there are no deterministic computational unlearning methods for entropic learning algorithms. We also explore the relationship between differential privacy (DP)-based unlearning methods and computational unlearning, showing that DP-based approaches can satisfy computational unlearning at the cost of an extreme utility collapse. These results demonstrate that current methodology in the literature fundamentally falls short of achieving computational unlearning. We conclude by identifying several open questions for future work. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2505.08138 [cs.LG] (or arXiv:2505.08138v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.08138 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-56] Leverag ing AI for Productive and Trustworthy HPC Software: Challenges and Research Directions

【速读】:该论文试图解决如何利用人工智能(Artificial Intelligence, AI)技术革新高性能计算(High-Performance Computing, HPC)软件开发的问题。其关键解决方案在于探索将最先进的AI技术,特别是生成式AI,应用于这一高度专业化且细分的软件领域,并通过两个由美国能源部资助的项目——Ellora和Durban,推动HPC软件的智能化发展。

链接: https://arxiv.org/abs/2505.08135
作者: Keita Teranishi,Harshitha Menon,William F. Godoy,Prasanna Balaprakash,David Bau,Tal Ben-Nun,Abhinav Bathele,Franz Franchetti,Michael Franusich,Todd Gamblin,Giorgis Georgakoudis,Tom Goldstein,Arjun Guha,Steven Hahn,Costin Iancu,Zheming Jin,Terry Jones,Tze Meng Low,Het Mankad,Narasinga Rao Miniskar,Mohammad Alaul Haque Monil,Daniel Nichols,Konstantinos Parasyris,Swaroop Pophale,Pedro Valero-Lara,Jeffrey S. Vetter,Samuel Williams,Aaron Young
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注: 12 pages, 1 Figure, Accepted at “The 1st International Workshop on Foundational Large Language Models Advances for HPC” LLM4HPC to be held in conjunction with ISC High Performance 2025

点击查看摘要

Abstract:We discuss the challenges and propose research directions for using AI to revolutionize the development of high-performance computing (HPC) software. AI technologies, in particular large language models, have transformed every aspect of software development. For its part, HPC software is recognized as a highly specialized scientific field of its own. We discuss the challenges associated with leveraging state-of-the-art AI technologies to develop such a unique and niche class of software and outline our research directions in the two US Department of Energy–funded projects for advancing HPC Software via AI: Ellora and Durban.
zh

[AI-57] One Bad NOFO? AI Governance in Federal Grantmaking

【速读】:该论文试图解决美国联邦机构在通过拨款政策进行人工智能(Artificial Intelligence, AI)治理方面被忽视的作用问题,即机构如何通过设定拨款项目目标、评审标准和AI使用限制来间接规范受资助方的AI行为。解决方案的关键在于分析大量联邦拨款通知(NOFOs),揭示机构在拨款文件中对AI的引导与限制,并指出当前拨款政策在透明度、问责性和隐私保护方面的不足。

链接: https://arxiv.org/abs/2505.08133
作者: Dan Bateyko,Karen Levy
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: In The 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), June 23—26, 2025, Athens, Greece. 13 pages

点击查看摘要

Abstract:Much scholarship considers how U.S. federal agencies govern artificial intelligence (AI) through rulemaking and their own internal use policies. But agencies have an overlooked AI governance role: setting discretionary grant policy when directing billions of dollars in federal financial assistance. These dollars enable state and local entities to study, create, and use AI. This funding not only goes to dedicated AI programs, but also to grantees using AI in the course of meeting their routine grant objectives. As discretionary grantmakers, agencies guide and restrict what grant winners do – a hidden lever for AI governance. Agencies pull this lever by setting program objectives, judging criteria, and restrictions for AI use. Using a novel dataset of over 40,000 non-defense federal grant notices of funding opportunity (NOFOs) posted to this http URL between 2009 and 2024, we analyze how agencies regulate the use of AI by grantees. We select records mentioning AI and review their stated goals and requirements. We find agencies promoting AI in notice narratives, shaping adoption in ways other records of grant policy might fail to capture. Of the grant opportunities that mention AI, we find only a handful of AI-specific judging criteria or restrictions. This silence holds even when agencies fund AI uses in contexts affecting people’s rights and which, under an analogous federal procurement regime, would result in extra oversight. These findings recast grant notices as a site of AI policymaking – albeit one that is developing out of step with other regulatory efforts and incomplete in its consideration of transparency, accountability, and privacy protections. The paper concludes by drawing lessons from AI procurement scholarship, while identifying distinct challenges in grantmaking that invite further study.
zh

[AI-58] High-order Regularization for Machine Learning and Learning-based Control

【速读】:该论文试图解决机器学习中神经网络训练的泛化能力与可解释性问题,特别是如何通过正则化方法提升神经网络在强化学习中的性能。解决方案的关键在于提出一种高阶正则化(High-Order Regularization, HR),该方法不仅保证了近似算法的可证明收敛性,还建立了正则化与可解释学习之间的联系,同时理论证明了正则化可以视为逆映射的近似,并给出了显式可计算的近似误差。此外,HR方法适用于任意映射矩阵的神经网络,并通过理论分析表明,适当的正则化矩阵能够最大化神经网络的泛化能力。

链接: https://arxiv.org/abs/2505.08129
作者: Xinghua Liu,Ming Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The paper proposes a novel regularization procedure for machine learning. The proposed high-order regularization (HR) provides new insight into regularization, which is widely used to train a neural network that can be utilized to approximate the action-value function in general reinforcement learning problems. The proposed HR method ensures the provable convergence of the approximation algorithm, which makes the much-needed connection between regularization and explainable learning using neural networks. The proposed HR method theoretically demonstrates that regularization can be regarded as an approximation in terms of inverse mapping with explicitly calculable approximation error, and the L_2 regularization is a lower-order case of the proposed method. We provide lower and upper bounds for the error of the proposed HR solution, which helps build a reliable model. We also find that regularization with the proposed HR can be regarded as a contraction. We prove that the generalizability of neural networks can be maximized with a proper regularization matrix, and the proposed HR is applicable for neural networks with any mapping matrix. With the theoretical explanation of the extreme learning machine for neural network training and the proposed high-order regularization, one can better interpret the output of the neural network, thus leading to explainable learning. We present a case study based on regularized extreme learning neural networks to demonstrate the application of the proposed HR and give the corresponding incremental HR solution. We verify the performance of the proposed HR method by solving a classic control problem in reinforcement learning. The result demonstrates the superior performance of the method with significant enhancement in the generalizability of the neural network.
zh

[AI-59] Graph-Based Floor Separation Using Node Embeddings and Clustering of WiFi Trajectories

【速读】:该论文旨在解决室内多层环境中垂直定位(vertical localization)的问题,特别是通过Wi-Fi指纹轨迹实现楼层分离。其解决方案的关键在于构建一个基于图的模型,其中节点表示Wi-Fi指纹,边根据信号相似性和上下文转换进行加权,并利用Node2Vec生成低维嵌入,随后通过K-means聚类识别不同楼层。该方法在Huawei University Challenge 2021数据集上表现出优于传统社区检测算法的性能。

链接: https://arxiv.org/abs/2505.08088
作者: Rabia Yasa Kostas,Kahraman Kostas
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Indoor positioning systems (IPSs) are increasingly vital for location-based services in complex multi-storey environments. This study proposes a novel graph-based approach for floor separation using Wi-Fi fingerprint trajectories, addressing the challenge of vertical localization in indoor settings. We construct a graph where nodes represent Wi-Fi fingerprints, and edges are weighted by signal similarity and contextual transitions. Node2Vec is employed to generate low-dimensional embeddings, which are subsequently clustered using K-means to identify distinct floors. Evaluated on the Huawei University Challenge 2021 dataset, our method outperforms traditional community detection algorithms, achieving an accuracy of 68.97%, an F1- score of 61.99%, and an Adjusted Rand Index of 57.19%. By publicly releasing the preprocessed dataset and implementation code, this work contributes to advancing research in indoor positioning. The proposed approach demonstrates robustness to signal noise and architectural complexities, offering a scalable solution for floor-level localization.
zh

[AI-60] What Matters for Batch Online Reinforcement Learning in Robotics?

【速读】:该论文试图解决在机器人学习中如何有效利用自主收集的大量数据进行策略改进的问题,即批在线强化学习(batch online reinforcement learning)中的挑战。现有方法如模仿学习及其变种在从自主数据中高效提升策略或快速收敛至次优解方面表现不佳。论文提出的关键解决方案是:使用Q函数引导批在线强化学习以显著提升性能,采用基于策略分布选择最优动作的隐式策略提取方法替代传统方法,并优先选择表达能力强的策略类。此外,通过引入时间相关噪声增强多样性,进一步提升了性能和扩展性。

链接: https://arxiv.org/abs/2505.08078
作者: Perry Dong,Suvir Mirchandani,Dorsa Sadigh,Chelsea Finn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ability to learn from large batches of autonomously collected data for policy improvement – a paradigm we refer to as batch online reinforcement learning – holds the promise of enabling truly scalable robot learning by significantly reducing the need for human effort of data collection while getting benefits from self-improvement. Yet, despite the promise of this paradigm, it remains challenging to achieve due to algorithms not being able to learn effectively from the autonomous data. For example, prior works have applied imitation learning and filtered imitation learning methods to the batch online RL problem, but these algorithms often fail to efficiently improve from the autonomously collected data or converge quickly to a suboptimal point. This raises the question of what matters for effective batch online RL in robotics. Motivated by this question, we perform a systematic empirical study of three axes – (i) algorithm class, (ii) policy extraction methods, and (iii) policy expressivity – and analyze how these axes affect performance and scaling with the amount of autonomous data. Through our analysis, we make several observations. First, we observe that the use of Q-functions to guide batch online RL significantly improves performance over imitation-based methods. Building on this, we show that an implicit method of policy extraction – via choosing the best action in the distribution of the policy – is necessary over traditional policy extraction methods from offline RL. Next, we show that an expressive policy class is preferred over less expressive policy classes. Based on this analysis, we propose a general recipe for effective batch online RL. We then show a simple addition to the recipe of using temporally-correlated noise to obtain more diversity results in further performance gains. Our recipe obtains significantly better performance and scaling compared to prior methods.
zh

[AI-61] Explainable Reinforcement Learning Agents Using World Models

【速读】:该论文试图解决可解释强化学习(Explainable Reinforcement Learning, XRL)中由于序列决策的时序特性所带来的复杂性问题,以及非AI专家用户难以直接修改智能体或其策略的问题。解决方案的关键在于引入一种基于世界模型(World Models)的技术,通过生成反事实轨迹来提供解释,并进一步增强基于模型的深度强化学习(Model-Based Deep RL)智能体,加入一个逆向世界模型(Reverse World Model),该模型预测为了使智能体偏好某个反事实动作,世界状态应为何种情形。研究结果表明,展示世界应有状态的解释能够显著提升用户对智能体策略的理解,并假设此类解释有助于用户通过操控环境来学习如何控制智能体的执行。

链接: https://arxiv.org/abs/2505.08073
作者: Madhuri Singh,Amal Alabdulkarim,Gennie Mansi,Mark O. Riedl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The paper content spans 7 pages, followed by a page of references. It contains 7 figures and 2 small tables

点击查看摘要

Abstract:Explainable AI (XAI) systems have been proposed to help people understand how AI systems produce outputs and behaviors. Explainable Reinforcement Learning (XRL) has an added complexity due to the temporal nature of sequential decision-making. Further, non-AI experts do not necessarily have the ability to alter an agent or its policy. We introduce a technique for using World Models to generate explanations for Model-Based Deep RL agents. World Models predict how the world will change when actions are performed, allowing for the generation of counterfactual trajectories. However, identifying what a user wanted the agent to do is not enough to understand why the agent did something else. We augment Model-Based RL agents with a Reverse World Model, which predicts what the state of the world should have been for the agent to prefer a given counterfactual action. We show that explanations that show users what the world should have been like significantly increase their understanding of the agent policy. We hypothesize that our explanations can help users learn how to control the agents execution through by manipulating the environment.
zh

[AI-62] Justified Evidence Collection for Argument-based AI Fairness Assurance

【速读】:该论文试图解决确保人工智能(Artificial Intelligence, AI)系统公平性的复杂社会技术挑战,这一挑战需要在系统生命周期的各个阶段进行细致考量和持续监督。解决方案的关键在于提出一种由系统工程驱动的框架,该框架通过软件工具支持,在两个阶段中实现基于论据的保证(argument-based assurance)的动态操作。第一阶段在需求规划阶段,多学科、多利益相关者团队通过全面的公平治理过程定义目标和主张;第二阶段则通过持续监控界面从现有成果(如模型、数据和用例文档中的度量指标)中收集证据,以动态支持这些主张。

链接: https://arxiv.org/abs/2505.08064
作者: Alpay Sabuncuoglu,Christopher Burr,Carsten Maple
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: The paper is accepted for ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT '25)

点击查看摘要

Abstract:It is well recognised that ensuring fair AI systems is a complex sociotechnical challenge, which requires careful deliberation and continuous oversight across all stages of a system’s lifecycle, from defining requirements to model deployment and deprovisioning. Dynamic argument-based assurance cases, which present structured arguments supported by evidence, have emerged as a systematic approach to evaluating and mitigating safety risks and hazards in AI-enabled system development and have also been extended to deal with broader normative goals such as fairness and explainability. This paper introduces a systems-engineering-driven framework, supported by software tooling, to operationalise a dynamic approach to argument-based assurance in two stages. In the first stage, during the requirements planning phase, a multi-disciplinary and multi-stakeholder team define goals and claims to be established (and evidenced) by conducting a comprehensive fairness governance process. In the second stage, a continuous monitoring interface gathers evidence from existing artefacts (e.g. metrics from automated tests), such as model, data, and use case documentation, to support these arguments dynamically. The framework’s effectiveness is demonstrated through an illustrative case study in finance, with a focus on supporting fairness-related arguments.
zh

[AI-63] Bias or Optimality? Disentangling Bayesian Inference and Learning Biases in Human Decision-Making

【速读】:该论文试图解决人类在两臂伯努利老虎机(two-armed Bernoulli bandit, TABB)任务中表现出的正向偏差和确认偏差是否源于认知偏误还是学习机制中的非对称学习率问题。其解决方案的关键在于通过分析基于贝叶斯推断的Q-learning模型,揭示即使在客观更新信念的情况下,不对称学习率仍能再现这些行为特征,并指出确认偏差与非对称但衰减的学习率可能产生相同的实验行为信号。

链接: https://arxiv.org/abs/2505.08049
作者: Prakhar Godara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies claim that human behavior in a two-armed Bernoulli bandit (TABB) task is described by positivity and confirmation biases, implying that humans do not integrate new information objectively. However, we find that even if the agent updates its belief via objective Bayesian inference, fitting the standard Q-learning model with asymmetric learning rates still recovers both biases. Bayesian inference cast as an effective Q-learning algorithm has symmetric, though decreasing, learning rates. We explain this by analyzing the stochastic dynamics of these learning systems using master equations. We find that both confirmation bias and unbiased but decreasing learning rates yield the same behavioral signatures. Finally, we propose experimental protocols to disentangle true cognitive biases from artifacts of decreasing learning rates.
zh

[AI-64] Online Learning-based Adaptive Beam Switching for 6G Networks: Enhancing Efficiency and Resilience

【速读】:该论文旨在解决6G网络中自适应波束切换面临的高频、移动性和遮挡等挑战。其解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的在线学习框架,该框架引入了增强的状态表示(包括速度和遮挡历史)、门控循环单元(GRU)架构以及优先级经验回放机制,以实现实时波束优化。

链接: https://arxiv.org/abs/2505.08032
作者: Seyed Bagher Hashemi Natanzi,Zhicong Zhu,Bo Tang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Adaptive beam switching in 6G networks is challenged by high frequencies, mobility, and blockage. We propose an Online Learning framework using Deep Reinforcement Learning (DRL) with an enhanced state representation (velocity and blockage history), a GRU architecture, and prioritized experience replay for real-time beam optimization. Validated via Nvidia Sionna under time-correlated blockage, our approach significantly enhances resilience in SNR, throughput, and accuracy compared to a conventional heuristic. Furthermore, the enhanced DRL agent outperforms a reactive Multi-Armed Bandit (MAB) baseline by leveraging temporal dependencies, achieving lower performance variability. This demonstrates the benefits of memory and prioritized learning for robust 6G beam management, while confirming MAB as a strong baseline.
zh

[AI-65] PRISM: Complete Online Decentralized Multi-Agent Pathfinding with Rapid Information Sharing using Motion Constraints

【速读】:该论文旨在解决多任务多智能体路径规划(MT-MAPF)问题,其核心挑战在于如何在大规模智能体群体中同时规划安全且高效的路径,并避免碰撞。PRISM(Pathfinding with Rapid Information Sharing using Motion Constraints)作为解决方案,其关键在于采用了一种快速通信策略,通过信息包交换运动约束信息,从而提升协作路径规划和情境感知能力,即使在无直接通信的场景下也能有效运作。此外,PRISM能够避免所有可能的死锁情况,确保了算法的鲁棒性。

链接: https://arxiv.org/abs/2505.08025
作者: Hannah Lee,Zachary Serlin,James Motes,Brendan Long,Marco Morales,Nancy M. Amato
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 38 pages, 8 figures

点击查看摘要

Abstract:We introduce PRISM (Pathfinding with Rapid Information Sharing using Motion Constraints), a decentralized algorithm designed to address the multi-task multi-agent pathfinding (MT-MAPF) problem. PRISM enables large teams of agents to concurrently plan safe and efficient paths for multiple tasks while avoiding collisions. It employs a rapid communication strategy that uses information packets to exchange motion constraint information, enhancing cooperative pathfinding and situational awareness, even in scenarios without direct communication. We prove that PRISM resolves and avoids all deadlock scenarios when possible, a critical challenge in decentralized pathfinding. Empirically, we evaluate PRISM across five environments and 25 random scenarios, benchmarking it against the centralized Conflict-Based Search (CBS) and the decentralized Token Passing with Task Swaps (TPTS) algorithms. PRISM demonstrates scalability and solution quality, supporting 3.4 times more agents than CBS and handling up to 2.5 times more tasks in narrow passage environments than TPTS. Additionally, PRISM matches CBS in solution quality while achieving faster computation times, even under low-connectivity conditions. Its decentralized design reduces the computational burden on individual agents, making it scalable for large environments. These results confirm PRISM’s robustness, scalability, and effectiveness in complex and dynamic pathfinding scenarios.
zh

[AI-66] he Correspondence Between Bounded Graph Neural Networks and Frag ments of First-Order Logic

【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)的表达能力理解问题,即如何量化和分析GNN在处理图结构数据时的逻辑表达能力。解决方案的关键在于将有限模型论中的第一阶逻辑(First-Order Logic, FO)及相关逻辑片段(如模态逻辑、计数二元组逻辑等)与GNN的架构进行对应,从而建立一个统一的框架来解释GNN的逻辑可表达性。通过引入有限模型论的方法和工具,该研究揭示了受限GNN架构与特定逻辑片段之间的等价关系,为深入理解GNN的表达能力提供了理论依据。

链接: https://arxiv.org/abs/2505.08021
作者: Bernardo Cuenca Grau,Przemysław A. Wałęga
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages

点击查看摘要

Abstract:Graph Neural Networks (GNNs) address two key challenges in applying deep learning to graph-structured data: they handle varying size input graphs and ensure invariance under graph isomorphism. While GNNs have demonstrated broad applicability, understanding their expressive power remains an important question. In this paper, we show that bounded GNN architectures correspond to specific fragments of first-order logic (FO), including modal logic (ML), graded modal logic (GML), modal logic with the universal modality (ML(A)), the two-variable fragment (FO2) and its extension with counting quantifiers (C2). To establish these results, we apply methods and tools from finite model theory of first-order and modal logics to the domain of graph representation learning. This provides a unifying framework for understanding the logical expressiveness of GNNs within FO.
zh

[AI-67] Fair Play for Individuals Foul Play for Groups? Auditing Anonymizations Impact on ML Fairness

【速读】:该论文试图解决匿名化技术对机器学习(Machine Learning, ML)公平性的影响问题,特别是探讨k-匿名性、ℓ-多样性及t-接近性等常见匿名化技术如何影响个体公平性和群体公平性。其解决方案的关键在于系统性地评估不同匿名化程度对ML公平性的具体影响,并揭示隐私保护与公平性之间的权衡关系,从而为负责任的人工智能开发提供指导。

链接: https://arxiv.org/abs/2505.07985
作者: Héber H. Arcolezi,Mina Alishahi,Adda-Akram Bendoukha,Nesrine Kaaniche
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as k -anonymity, \ell -diversity, and t -closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to four orders of magnitude. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: this https URL.
zh

[AI-68] Self-cross Feature based Spiking Neural Networks for Efficient Few-shot Learning

【速读】:该论文试图解决在少量样本学习(few-shot learning, FSL)中,深度神经网络(Deep Neural Networks, DNNs)计算成本高且可扩展性差的问题,以及脉冲神经网络(Spiking Neural Networks, SNNs)在捕捉复杂时空特征和进行跨类别比较方面的不足。解决方案的关键在于提出一种基于SNN的FSL框架,该框架结合了自特征提取模块和跨特征对比模块,以优化特征表示并降低功耗,同时采用时间高效训练损失与InfoNCE损失的组合来提升脉冲序列的时间动态性和判别能力。

链接: https://arxiv.org/abs/2505.07921
作者: Qi Xu,Junyang Zhu,Dongdong Zhou,Hao Chen,Yang Liu,Jiangrong Shen,Qiang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) excel in computer vision tasks, especially, few-shot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynamic data, though they still encounter difficulties in capturing complex spatiotemporal features and performing accurate cross-class comparisons. To further enhance the performance and efficiency of SNNs in few-shot learning, we propose a few-shot learning framework based on SNNs, which combines a self-feature extractor module and a cross-feature contrastive module to refine feature representation and reduce power consumption. We apply the combination of temporal efficient training loss and InfoNCE loss to optimize the temporal dynamics of spike trains and enhance the discriminative power. Experimental results show that the proposed FSL-SNN significantly improves the classification performance on the neuromorphic dataset N-Omniglot, and also achieves competitive performance to ANNs on static datasets such as CUB and miniImageNet with low power consumption.
zh

[AI-69] Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation

【速读】:该论文旨在解决生物医学问答(Biomedical QA)系统中准确性和效率之间的平衡问题,特别是在大规模文献库中实现高效且可扩展的检索与生成。其解决方案的关键在于采用检索增强生成(Retrieval-Augmented Generation, RAG)框架,通过评估多种检索策略(如BM25、BioBERT、MedCPT及混合方法)和数据存储系统(如Elasticsearch、MongoDB、FAISS),优化检索深度与响应时间的权衡,最终在PubMed全量语料库上部署高效的RAG系统。

链接: https://arxiv.org/abs/2505.07917
作者: Linus Stuhlmann,Michael Alexander Saxer,Jonathan Fürst
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: Accepted at SDS25

点击查看摘要

Abstract:Biomedical question-answering (QA) systems require effective retrieval and generation components to ensure accuracy, efficiency, and scalability. This study systematically examines a Retrieval-Augmented Generation (RAG) system for biomedical QA, evaluating retrieval strategies and response time trade-offs. We first assess state-of-the-art retrieval methods, including BM25, BioBERT, MedCPT, and a hybrid approach, alongside common data stores such as Elasticsearch, MongoDB, and FAISS, on a ~10% subset of PubMed (2.4M documents) to measure indexing efficiency, retrieval latency, and retriever performance in the end-to-end RAG system. Based on these insights, we deploy the final RAG system on the full 24M PubMed corpus, comparing different retrievers’ impact on overall performance. Evaluations of the retrieval depth show that retrieving 50 documents with BM25 before reranking with MedCPT optimally balances accuracy (0.90), recall (0.90), and response time (1.91s). BM25 retrieval time remains stable (82ms), while MedCPT incurs the main computational cost. These results highlight previously not well-known trade-offs in retrieval depth, efficiency, and scalability for biomedical QA. With open-source code, the system is fully reproducible and extensible.
zh

[AI-70] Combining Bayesian Inference and Reinforcement Learning for Agent Decision Making: A Review

【速读】:该论文试图解决如何将贝叶斯推断(Bayesian inference)有效结合到强化学习(Reinforcement Learning, RL)中以提升智能体决策性能的问题,特别是针对数据效率、泛化能力、可解释性和安全性等关键挑战。其解决方案的关键在于系统性地梳理和分析贝叶斯方法在RL中的应用,涵盖基础贝叶斯方法、变分推断、贝叶斯深度学习、贝叶斯主动学习等多种技术,并探讨其与模型基础型RL、无模型RL及逆强化学习的结合方式,同时通过对比分析不同方法在多个性能指标上的表现,以及深入讨论贝叶斯方法在复杂RL问题中的适用性与实现机制。

链接: https://arxiv.org/abs/2505.07911
作者: Chengmin Zhou,Ville Kyrki,Pasi Fränti,Laura Ruotsalainen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bayesian inference has many advantages in decision making of agents (e.g. robotics/simulative agent) over a regular data-driven black-box neural network: Data-efficiency, generalization, interpretability, and safety where these advantages benefit directly/indirectly from the uncertainty quantification of Bayesian inference. However, there are few comprehensive reviews to summarize the progress of Bayesian inference on reinforcement learning (RL) for decision making to give researchers a systematic understanding. This paper focuses on combining Bayesian inference with RL that nowadays is an important approach in agent decision making. To be exact, this paper discusses the following five topics: 1) Bayesian methods that have potential for agent decision making. First basic Bayesian methods and models (Bayesian rule, Bayesian learning, and Bayesian conjugate models) are discussed followed by variational inference, Bayesian optimization, Bayesian deep learning, Bayesian active learning, Bayesian generative models, Bayesian meta-learning, and lifelong Bayesian learning. 2) Classical combinations of Bayesian methods with model-based RL (with approximation methods), model-free RL, and inverse RL. 3) Latest combinations of potential Bayesian methods with RL. 4) Analytical comparisons of methods that combine Bayesian methods with RL with respect to data-efficiency, generalization, interpretability, and safety. 5) In-depth discussions in six complex problem variants of RL, including unknown reward, partial-observability, multi-agent, multi-task, non-linear non-Gaussian, and hierarchical RL problems and the summary of how Bayesian methods work in the data collection, data processing and policy learning stages of RL to pave the way for better agent decision-making strategies.
zh

[AI-71] uning for Trustworthiness – Balancing Performance and Explanation Consistency in Neural Network Optimization

【速读】:该论文试图解决在超参数调优或神经网络架构优化过程中,可解释性(Explainable Artificial Intelligence, XAI)很少被考虑的问题,尽管其在可解释性研究中日益受到关注。论文提出了一种新的概念——XAI一致性,即不同特征归因方法之间的一致性,并引入了量化该一致性的新指标。解决方案的关键在于首次将XAI一致性直接整合到超参数调优的目标函数中,构建了一个多目标优化框架,以平衡预测性能与解释的稳健性。通过该框架,研究者能够识别出架构配置空间中的不同区域,包括性能与可解释性较差的区域、高性能但可解释性弱的区域以及性能与可解释性之间的权衡区域。

链接: https://arxiv.org/abs/2505.07910
作者: Alexander Hinterleitner,Thomas Bartz-Beielstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the growing interest in Explainable Artificial Intelligence (XAI), explainability is rarely considered during hyperparameter tuning or neural architecture optimization, where the focus remains primarily on minimizing predictive loss. In this work, we introduce the novel concept of XAI consistency, defined as the agreement among different feature attribution methods, and propose new metrics to quantify it. For the first time, we integrate XAI consistency directly into the hyperparameter tuning objective, creating a multi-objective optimization framework that balances predictive performance with explanation robustness. Implemented within the Sequential Parameter Optimization Toolbox (SPOT), our approach uses both weighted aggregation and desirability-based strategies to guide model selection. Through our proposed framework and supporting tools, we explore the impact of incorporating XAI consistency into the optimization process. This enables us to characterize distinct regions in the architecture configuration space: one region with poor performance and comparatively low interpretability, another with strong predictive performance but weak interpretability due to low \glsxai consistency, and a trade-off region that balances both objectives by offering high interpretability alongside competitive performance. Beyond introducing this novel approach, our research provides a foundation for future investigations into whether models from the trade-off zone-balancing performance loss and XAI consistency-exhibit greater robustness by avoiding overfitting to training performance, thereby leading to more reliable predictions on out-of-distribution data.
zh

[AI-72] Latent Behavior Diffusion for Sequential Reaction Generation in Dyadic Setting

【速读】:该论文旨在解决对话中双人互动模拟中的面部反应生成问题,即如何根据对话伙伴的行为生成自然且符合语境的面部反应,以提升人机交互的自然度与效果。其解决方案的关键在于提出了一种名为潜在行为扩散模型(Latent Behavior Diffusion Model)的新方法,该方法由一个上下文感知的自编码器和基于扩散的条件生成器组成。自编码器负责将高维输入特征压缩为简洁的潜在表示,捕捉听众反应中的动态模式;而基于扩散的条件生成器则在该潜在空间上进行非自回归式的实时面部反应预测,从而生成多样且符合细微对话线索与情感状态的面部反应。

链接: https://arxiv.org/abs/2505.07901
作者: Minh-Duc Nguyen,Hyung-Jeong Yang,Soo-Hyung Kim,Ji-Eun Shin,Seung-Won Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The dyadic reaction generation task involves synthesizing responsive facial reactions that align closely with the behaviors of a conversational partner, enhancing the naturalness and effectiveness of human-like interaction simulations. This paper introduces a novel approach, the Latent Behavior Diffusion Model, comprising a context-aware autoencoder and a diffusion-based conditional generator that addresses the challenge of generating diverse and contextually relevant facial reactions from input speaker behaviors. The autoencoder compresses high-dimensional input features, capturing dynamic patterns in listener reactions while condensing complex input data into a concise latent representation, facilitating more expressive and contextually appropriate reaction synthesis. The diffusion-based conditional generator operates on the latent space generated by the autoencoder to predict realistic facial reactions in a non-autoregressive manner. This approach allows for generating diverse facial reactions that reflect subtle variations in conversational cues and emotional states. Experimental results demonstrate the effectiveness of our approach in achieving superior performance in dyadic reaction synthesis tasks compared to existing methods.
zh

[AI-73] Representation Learning with Mutual Influence of Modalities for Node Classification in Multi-Modal Heterogeneous Networks

【速读】:该论文旨在解决多模态异构网络(MMHNs)中节点分类的问题,当前的多模态融合方法要么采用早期融合策略导致单个模态的独特特征丢失,要么采用晚期融合方法忽视了图神经网络(GNN)信息传播中的跨模态引导。论文提出的解决方案是Heterogeneous Graph Neural Network with Inter-Modal Attention (HGNN-IMA),其关键在于通过异构图Transformer框架,在信息传播过程中捕捉多种模态之间的相互影响,引入嵌套的跨模态注意力机制实现自适应多模态融合,并考虑模态对齐以促进跨所有模态具有相似性的节点间的传播,同时通过注意力损失缓解缺失模态的影响。

链接: https://arxiv.org/abs/2505.07895
作者: Jiafan Li,Jiaqi Zhu,Liang Chang,Yilin Li,Miaomiao Li,Yang Wang,Hongan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nowadays, numerous online platforms can be described as multi-modal heterogeneous networks (MMHNs), such as Douban’s movie networks and Amazon’s product review networks. Accurately categorizing nodes within these networks is crucial for analyzing the corresponding entities, which requires effective representation learning on nodes. However, existing multi-modal fusion methods often adopt either early fusion strategies which may lose the unique characteristics of individual modalities, or late fusion approaches overlooking the cross-modal guidance in GNN-based information propagation. In this paper, we propose a novel model for node classification in MMHNs, named Heterogeneous Graph Neural Network with Inter-Modal Attention (HGNN-IMA). It learns node representations by capturing the mutual influence of multiple modalities during the information propagation process, within the framework of heterogeneous graph transformer. Specifically, a nested inter-modal attention mechanism is integrated into the inter-node attention to achieve adaptive multi-modal fusion, and modality alignment is also taken into account to encourage the propagation among nodes with consistent similarities across all modalities. Moreover, an attention loss is augmented to mitigate the impact of missing modalities. Extensive experiments validate the superiority of the model in the node classification task, providing an innovative view to handle multi-modal data, especially when accompanied with network structures.
zh

[AI-74] Enhancing Trust Management System for Connected Autonomous Vehicles Using Machine Learning Methods: A Survey

【速读】:该论文旨在解决车联网中协同自动驾驶车辆(CAVs)在动态、开放及多领域网络环境下的信任管理问题,以应对内部和外部威胁并确保可靠决策。其解决方案的关键在于提出一种基于机器学习(ML)的三层次信任管理框架,包括信任数据层、信任计算层和信任激励层,并结合六维目标分类体系,分析各模块中ML方法的原理,从而提升CAVs在复杂交通场景下的信任评估与协作能力。

链接: https://arxiv.org/abs/2505.07882
作者: Qian Xu,Lei Zhang,Yixiao Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 9 figures

点击查看摘要

Abstract:Connected Autonomous Vehicles (CAVs) operate in dynamic, open, and multi-domain networks, rendering them vulnerable to various threats. Trust Management Systems (TMS) systematically organize essential steps in the trust mechanism, identifying malicious nodes against internal threats and external threats, as well as ensuring reliable decision-making for more cooperative tasks. Recent advances in machine learning (ML) offer significant potential to enhance TMS, especially for the strict requirements of CAVs, such as CAV nodes moving at varying speeds, and opportunistic and intermittent network behavior. Those features distinguish ML-based TMS from social networks, static IoT, and Social IoT. This survey proposes a novel three-layer ML-based TMS framework for CAVs in the vehicle-road-cloud integration system, i.e., trust data layer, trust calculation layer and trust incentive layer. A six-dimensional taxonomy of objectives is proposed. Furthermore, the principles of ML methods for each module in each layer are analyzed. Then, recent studies are categorized based on traffic scenarios that are against the proposed objectives. Finally, future directions are suggested, addressing the open issues and meeting the research trend. We maintain an active repository that contains up-to-date literature and open-source projects at this https URL.
zh

[AI-75] Efficient Telecom Specific LLM : TSLAM-Mini with QLoRA and Digital Twin Data

【速读】:该论文试图解决通用大语言模型(Large Language Models, LLMs)在实时电信应用中因缺乏领域特异性而表现不佳的问题。解决方案的关键在于对TSLAM-Mini模型进行细致的微调,该模型基于Phi-4 Mini Instruct 4B架构,具有38亿参数。微调过程利用了一个专门构建的包含10万样本的数据集,覆盖20个关键电信用例,并通过NetoAI的DigiTwin平台结合网络领域专家和权威RFC文档进行数据增强。此外,采用了量化低秩适应(Quantized Low-Rank Adaptation, QLoRA)技术以提高训练效率,并在资源受限的硬件上实现部署。

链接: https://arxiv.org/abs/2505.07877
作者: Vignesh Ethiraj,Divya Vijay,Sidhanth Menon,Heblin Berscilla
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: Introducing TSLAM-Mini, a specialized language model for telecommunications, demonstrating the efficacy of QLoRA fine-tuning and digital twin-synthesized data for enhanced network intelligence. Model available on: this https URL

点击查看摘要

Abstract:General-purpose large language models (LLMs), despite their broad capabilities accrued from open-world data, frequently exhibit suboptimal performance when confronted with the nuanced and specialized demands inherent in real-time telecommunications applications. This investigation addresses this critical limitation through the meticulous fine-tuning of TSLAM-Mini developed by NetoAI, a compact (3.8-billion parameter) causal language model architecturally derived from Phi-4 Mini Instruct 4B. The fine-tuning regimen leverages a bespoke dataset comprising 100,000 samples, strategically engineered to address 20 pivotal telecommunications use-cases, encompassing domains such as Network Fundamentals, IP Routing, MPLS, Network Security, Automation, OSS/BSS, RAN, Mobile Core, Satellite Communications, and Ethical AI. This dataset was curated utilizing NetoAI’s DigiTwin platform, enriched with granular insights from venerated network Subject Matter Experts (SMEs) and authoritative RFC documents, thereby capturing high-fidelity representations of real-world network dynamics through simulations inspired by digital twin paradigms. Employing Quantized Low-Rank Adaptation (QLoRA), a state-of-the-art Parameter Efficient Fine-Tuning (PEFT) technique, we achieved substantial training efficiency and enabled prospective deployment on resource-constrained hardware. A novel evaluation framework, predicated on a high-capacity LLM (Qwen3-235B-A22B) functioning as an automated adjudicator, was instituted to rigorously assess instruction-following fidelity and response quality across the specified telecom use-cases. Empirical results unequivocally demonstrate TSLAM-Mini’s superior aptitude in telecom-centric applications, underscoring the profound efficacy of domain-specific datasets and PEFT methodologies for advancing intelligent network management.
zh

[AI-76] Getting Ready for the EU AI Act in Healthcare. A call for Sustainable AI Development and Deployment

【速读】:该论文试图解决在医疗领域中,人工智能(Artificial Intelligence, AI)系统开发者和部署者如何确保其系统符合2024年8月生效的《人工智能法案》(AI Act)的问题,特别是在相关条款于2026年8月全面实施前,如何实现全面且有效的合规性。解决方案的关键在于将可信AI(Trustworthy AI)的伦理原则作为主动承诺,通过这些原则来解释和应用法案条款,从而识别最佳实践,提升AI系统的有效性与可持续性。

链接: https://arxiv.org/abs/2505.07875
作者: John Brandt Brodersen(1 and 2),Ilaria Amelia Caggiano(3),Pedro Kringen(4),Vince Istvan Madai(5),Walter Osika(6 and 7),Giovanni Sartor(8 and 9),Ellen Svensson(6 and 10),Magnus Westerlund(4),Roberto V. Zicari(11) ((1) University of Copenhagen, Denmark (2) UiT The Arctic University of Norway (3) Research Center in European Private Law (ReCEPL), Università degli studi Suor Orsola, Naples, Italy (4) Arcada University of Applied Science, Helsinki, Finland (5) QUEST Centre for Responsible Research, Berlin Institute of Health, Charité - Universitätsmedizin Berlin, Germany (6) TEA Lab, Karolinska Institutet, Stockholm, Sweden (7) Stockholm Health Care Services, Stockholm Region, Sweden (8) CIRSFID-Alma AI, University of Bologna, Italy (9) EUI, Florence, Italy (10) Stockholm University (11) Graduate School of Data Science, Seoul National University, South Korea)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 table

点击查看摘要

Abstract:Assessments of trustworthiness have become a cornerstone of responsible AI development. Especially in high-stakes fields like healthcare, aligning technical, evidence-based, and ethical practices with forthcoming legal requirements is increasingly urgent. We argue that developers and deployers of AI systems for the medical domain should be proactive and take steps to progressively ensure that such systems, both those currently in use and those being developed or planned, respect the requirements of the AI Act, which has come into force in August 2024. This is necessary if full and effective compliance is to be ensured when the most relevant provisions of the Act become effective (August 2026). The engagement with the AI Act cannot be viewed as a formalistic exercise. Compliance with the AI Act needs to be carried out through the proactive commitment to the ethical principles of trustworthy AI. These principles provide the background for the Act, which mentions them several times and connects them to the protection of public interest. They can be used to interpret and apply the Act’s provisions and to identify good practices, increasing the validity and sustainability of AI systems over time.
zh

[AI-77] CCL: Collaborative Curriculum Learning for Sparse-Reward Multi-Agent Reinforcement Learning via Co-evolutionary Task Evolution

【速读】:该论文旨在解决稀疏奖励环境下的强化学习问题,特别是在多智能体系统(MAS)中,由于反馈延迟和跨智能体共享导致的学习效率低下问题。其解决方案的关键在于提出了一种名为协作多维课程学习(CCL)的新型课程学习框架,该框架通过优化个体智能体的中间任务、利用变分进化算法生成具有信息量的子任务,以及协同进化智能体与其环境以提高训练稳定性来实现性能提升。

链接: https://arxiv.org/abs/2505.07854
作者: Yufei Lin,Chengwei Ye,Huanzhen Zhang,Kangsheng Wang,Linuo Xu,Shuyan Liu,Zeyu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Sparse reward environments pose significant challenges in reinforcement learning, especially within multi-agent systems (MAS) where feedback is delayed and shared across agents, leading to suboptimal learning. We propose Collaborative Multi-dimensional Course Learning (CCL), a novel curriculum learning framework that addresses this by (1) refining intermediate tasks for individual agents, (2) using a variational evolutionary algorithm to generate informative subtasks, and (3) co-evolving agents with their environment to enhance training stability. Experiments on five cooperative tasks in the MPE and Hide-and-Seek environments show that CCL outperforms existing methods in sparse reward settings.
zh

[AI-78] SweRank: Software Issue Localization with Code Ranking

【速读】:该论文旨在解决软件问题定位(software issue localization)中的效率与效果问题,即如何准确且高效地从代码库中找到与自然语言问题描述(如缺陷报告或功能请求)相关的代码位置。现有基于大语言模型(Large Language Model, LLM)的代理方法虽有潜力,但存在延迟高和成本大的问题;而传统代码排序模型在处理冗长且描述性较强的问题查询时表现不佳。该研究提出的解决方案是SweRank,其关键在于构建一个高效的“检索-重排序”框架,并利用SweLoc数据集进行训练,该数据集包含从公共GitHub仓库中整理的真实问题描述与对应的代码修改。实验结果表明,SweRank在性能上优于现有排序模型和依赖封闭源LLM的代理系统。

链接: https://arxiv.org/abs/2505.07849
作者: Revanth Gangi Reddy,Tarun Suresh,JaeHyeok Doo,Ye Liu,Xuan Phi Nguyen,Yingbo Zhou,Semih Yavuz,Caiming Xiong,Heng Ji,Shafiq Joty
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc’s utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.
zh

[AI-79] Conceptual Logical Foundations of Artificial Social Intelligence

【速读】:该论文试图解决社会智能的理论基础问题,包括社会活动中的协调与合作机制、社会代理人的最小心理架构、代理人的意图与世界状态之间的关系以及沟通在协调过程中的作用。其解决方案的关键在于提出一种基于信息与策略性思维关联的逻辑框架,构建一个包含动态变化的代理选择与能力、不确定性(如物理状态和社会状态信息不完全)以及意图状态的最小社会代理架构,并探讨沟通、语义与语用意义与其意图和信息状态之间的关系。

链接: https://arxiv.org/abs/2505.07847
作者: Eric Werner
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:What makes a society possible at all? How is coordination and cooperation in social activity possible? What is the minimal mental architecture of a social agent? How is the information about the state of the world related to the agents intentions? How are the intentions of agents related? What role does communication play in this coordination process? This essay explores the conceptual and logical foundations of artificial social intelligence in the context of a society of multiple agents that communicate and cooperate to achieve some end. An attempt is made to provide an introduction to some of the key concepts, their formal definitions and their interrelationships. These include the notion of a changing social world of multiple agents. The logic of social intelligence goes beyond classical logic by linking information with strategic thought. A minimal architecture of social agents is presented. The agents have different dynamically changing, possible choices and abilities. The agents also have uncertainty, lacking perfect information about their physical state as well as their dynamic social state. The social state of an agent includes the intentional state of that agent, as well as, that agent’s representation of the intentional states of other agents. Furthermore, it includes the evaluations agents make of their physical and social condition. Communication, semantic and pragmatic meaning and their relationship to intention and information states are investigated. The logic of agent abilities and intentions are motivated and formalized. The entropy of group strategic states is defined.
zh

[AI-80] Winning at All Cost: A Small Environment for Eliciting Specification Gaming Behaviors in Large Language Models

【速读】:该论文试图解决前沿大型语言模型(Large Language Models, LLMs)在面对无法通过合法方式取胜的情境时,可能通过利用系统漏洞来“绕过系统”的安全与对齐问题。其解决方案的关键在于采用一种新颖的文本模拟方法,通过设计一个不可赢的井字棋(tic-tac-toe)场景,分析模型在面对不可能任务时的行为倾向,从而揭示其 exploit 系统漏洞的潜在能力。研究结果表明,模型在特定提示下表现出显著的策略性漏洞利用行为,强调了在模型能力增强背景下,AI 对齐面临的紧迫挑战。

链接: https://arxiv.org/abs/2505.07846
作者: Lars Malmqvist
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: To be presented at SIMLA@ACNS 2025

点击查看摘要

Abstract:This study reveals how frontier Large Language Models LLMs can “game the system” when faced with impossible situations, a critical security and alignment concern. Using a novel textual simulation approach, we presented three leading LLMs (o1, o3-mini, and r1) with a tic-tac-toe scenario designed to be unwinnable through legitimate play, then analyzed their tendency to exploit loopholes rather than accept defeat. Our results are alarming for security researchers: the newer, reasoning-focused o3-mini model showed nearly twice the propensity to exploit system vulnerabilities (37.1%) compared to the older o1 model (17.5%). Most striking was the effect of prompting. Simply framing the task as requiring “creative” solutions caused gaming behaviors to skyrocket to 77.3% across all models. We identified four distinct exploitation strategies, from direct manipulation of game state to sophisticated modification of opponent behavior. These findings demonstrate that even without actual execution capabilities, LLMs can identify and propose sophisticated system exploits when incentivized, highlighting urgent challenges for AI alignment as models grow more capable of identifying and leveraging vulnerabilities in their operating environments.
zh

[AI-81] RAN Cortex: Memory-Augmented Intelligence for Context-Aware Decision-Making in AI-Native Networks

【速读】:该论文试图解决当前AI原生无线接入网(Radio Access Network, RAN)中智能模块在决策过程中缺乏持续记忆的问题,导致其无法有效应对具有周期性或重复性特征的网络动态环境。解决方案的关键在于提出RAN Cortex架构,该架构通过引入带有记忆增强的模块化层,实现上下文感知的决策支持。该模块包含上下文编码器、基于向量的记忆存储、回忆引擎以及策略接口,能够实时或近实时地为AI代理提供历史情境信息,从而提升网络的适应性与智能化水平。

链接: https://arxiv.org/abs/2505.07842
作者: Sebastian Barros
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 pages

点击查看摘要

Abstract:As Radio Access Networks (RAN) evolve toward AI-native architectures, intelligent modules such as xApps and rApps are expected to make increasingly autonomous decisions across scheduling, mobility, and resource management domains. However, these agents remain fundamentally stateless, treating each decision as isolated, lacking any persistent memory of prior events or outcomes. This reactive behavior constrains optimization, especially in environments where network dynamics exhibit episodic or recurring patterns. In this work, we propose RAN Cortex, a memory-augmented architecture that enables contextual recall in AI-based RAN decision systems. RAN Cortex introduces a modular layer composed of four elements: a context encoder that transforms network state into high-dimensional embeddings, a vector-based memory store of past network episodes, a recall engine to retrieve semantically similar situations, and a policy interface that supplies historical context to AI agents in real time or near-real time. We formalize the retrieval-augmented decision problem in the RAN, present a system architecture compatible with O-RAN interfaces, and analyze feasible deployments within the Non-RT and Near-RT RIC domains. Through illustrative use cases such as stadium traffic mitigation and mobility management in drone corridors, we demonstrate how contextual memory improves adaptability, continuity, and overall RAN intelligence. This work introduces memory as a missing primitive in AI-native RAN designs and provides a framework to enable “learning agents” without the need for retraining or centralized inference
zh

[AI-82] Moving From Monolithic To Microservices Architecture for Multi-Agent Systems

【速读】:该论文试图解决传统单体多智能体系统(Monolithic Multi-Agent Systems, MAS)在可扩展性和可维护性方面的局限性,其解决方案的关键在于采用微服务架构(Microservices Architecture)。通过引入微服务架构,论文强调了其在提升系统灵活性、模块化和协作效率方面的优势,并探讨了相关的核心架构原则与通信协议,如智能体通信语言(Agent Communication Languages, ACLs)、模型上下文协议(Model Context Protocol, MCP)和应用到应用协议(Application-to-Application, A2A)。

链接: https://arxiv.org/abs/2505.07838
作者: Muskaan Goyal,Pranav Bhasin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 5 pages, comparative analysis

点击查看摘要

Abstract:The transition from monolithic to microservices architecture revolutionized software development by improving scalability and maintainability. This paradigm shift is now becoming relevant for complex multi-agent systems (MAS). This review article explores the evolution from monolithic architecture to microservices architecture in the specific context of MAS. It will highlight the limitations of traditional monolithic MAS and the benefits of adopting a microservices-based approach. The article further examines the core architectural principles and communication protocols, including Agent Communication Languages (ACLs), the Model Context Protocol (MCP), and the Application-to-Application (A2A) protocol. The article identifies emerging architectural patterns, design challenges, and considerations through a comparative lens of the paradigm shift.
zh

[AI-83] Intelligent Product 3.0: Decentralised AI Agents and Web3 Intelligence Standards

【速读】:该论文试图解决传统智能产品在实时连接、数据准确性、自主决策及与网络化信息环境交互方面的局限性,其核心问题是如何通过新兴技术实现更高效、安全和自主的智能产品系统。解决方案的关键在于整合去中心化身份、基于区块链的产品信息与历史记录以及AI-to-AI协作,从而构建具备高韧性、共识机制和自主性的智能产品3.0规范。

链接: https://arxiv.org/abs/2505.07835
作者: Alex C. Y. Wong,Duncan McFarlane,C. Ellarby,M. Lee,M. Kuok
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 18 pages, 1 Figure, 3 Tables

点击查看摘要

Abstract:Twenty-five years ago, the specification of the Intelligent Product was established, envisaging real-time connectivity that not only enables products to gather accurate data about themselves but also allows them to assess and influence their own destiny. Early work by the Auto-ID project focused on creating a single, open-standard repository for storing and retrieving product information, laying a foundation for scalable connectivity. A decade later, the approach was revisited in light of low-cost RFID systems that promised a low-cost link between physical goods and networked information environments. Since then, advances in blockchain, Web3, and artificial intelligence have introduced unprecedented levels of resilience, consensus, and autonomy. By leveraging decentralised identity, blockchain-based product information and history, and intelligent AI-to-AI collaboration, this paper examines these developments and outlines a new specification for the Intelligent Product 3.0, illustrating how decentralised and AI-driven capabilities facilitate seamless interaction between physical AI and everyday products.
zh

[AI-84] ai.txt: A Domain-Specific Language for Guiding AI Interactions with the Internet

【速读】:该论文试图解决当前广泛采用的this http URL标准在AI模型、代理与网络内容交互中的监管不足问题,尤其是在伦理和法律合规性方面缺乏足够的粒度和语义表达能力。解决方案的关键在于引入一种新的领域特定语言(Domain-Specific Language, DSL),通过支持细粒度元素级控制和可被AI系统解析的自然语言指令,扩展传统基于URL的访问控制机制。此外,论文还提出了基于XML的程序化执行和自然语言提示集成两种合规机制,并通过初步实验和案例研究验证了其有效性。

链接: https://arxiv.org/abs/2505.07834
作者: Yuekang Li,Wei Song,Bangshuo Zhu,Dong Gong,Yi Liu,Gelei Deng,Chunyang Chen,Lei Ma,Jun Sun,Toby Walsh,Jingling Xue
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:We introduce this http URL, a novel domain-specific language (DSL) designed to explicitly regulate interactions between AI models, agents, and web content, addressing critical limitations of the widely adopted this http URL standard. As AI increasingly engages with online materials for tasks such as training, summarization, and content modification, existing regulatory methods lack the necessary granularity and semantic expressiveness to ensure ethical and legal compliance. this http URL extends traditional URL-based access controls by enabling precise element-level regulations and incorporating natural language instructions interpretable by AI systems. To facilitate practical deployment, we provide an integrated development environment with code autocompletion and automatic XML generation. Furthermore, we propose two compliance mechanisms: XML-based programmatic enforcement and natural language prompt integration, and demonstrate their effectiveness through preliminary experiments and case studies. Our approach aims to aid the governance of AI-Internet interactions, promoting responsible AI use in digital ecosystems.
zh

[AI-85] Patchwork: A Unified Framework for RAG Serving

【速读】:该论文旨在解决Retrieval Augmented Generation (RAG)系统在实际部署中面临的效率瓶颈问题,特别是由于其异构计算流程(包括大型语言模型、数据库和专用处理组件)所带来的技术挑战。解决方案的关键在于Patchwork框架的三个核心创新:首先,提供灵活的规范接口以支持用户自定义RAG流水线;其次,将这些流水线作为分布式推理系统进行部署,并针对各个RAG组件的独特可扩展性特征进行优化;第三,集成在线调度机制,通过动态请求优先级和资源自动扩展持续监控请求负载和执行进度,从而最小化服务等级目标(SLO)违规。

链接: https://arxiv.org/abs/2505.07833
作者: Bodun Hu,Luis Pabon,Saurabh Agarwal,Aditya Akella
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Operating Systems (cs.OS)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has emerged as a new paradigm for enhancing Large Language Model reliability through integration with external knowledge sources. However, efficient deployment of these systems presents significant technical challenges due to their inherently heterogeneous computational pipelines comprising LLMs, databases, and specialized processing components. We introduce Patchwork, a comprehensive end-to-end RAG serving framework designed to address these efficiency bottlenecks. Patchwork’s architecture offers three key innovations: First, it provides a flexible specification interface enabling users to implement custom RAG pipelines. Secondly, it deploys these pipelines as distributed inference systems while optimizing for the unique scalability characteristics of individual RAG components. Third, Patchwork incorporates an online scheduling mechanism that continuously monitors request load and execution progress, dynamically minimizing SLO violations through strategic request prioritization and resource auto-scaling. Our experimental evaluation across four distinct RAG implementations demonstrates that Patchwork delivers substantial performance improvements over commercial alternatives, achieving throughput gains exceeding 48% while simultaneously reducing SLO violations by ~24%.
zh

[AI-86] A General Approach of Automated Environment Design for Learning the Optimal Power Flow

【速读】:该论文试图解决如何设计强化学习(Reinforcement Learning, RL)环境以最大化训练性能的问题,特别是在最优潮流(Optimal Power Flow, OPF)场景中。解决方案的关键在于提出一种通用的自动化RL环境设计方法,该方法利用多目标优化(Multi-Objective Optimization)框架,并基于超参数优化(Hyperparameter Optimization, HPO)方法,实现环境设计的自动化。通过在五个OPF基准问题上的实验,验证了该方法在性能上优于手动设计的基线环境,并揭示了影响性能的关键设计因素。

链接: https://arxiv.org/abs/2505.07832
作者: Thomas Wolgast,Astrid Nieße
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 14 pages, accepted at ACM e-energy 2025

点击查看摘要

Abstract:Reinforcement learning (RL) algorithms are increasingly used to solve the optimal power flow (OPF) problem. Yet, the question of how to design RL environments to maximize training performance remains unanswered, both for the OPF and the general case. We propose a general approach for automated RL environment design by utilizing multi-objective optimization. For that, we use the hyperparameter optimization (HPO) framework, which allows the reuse of existing HPO algorithms and methods. On five OPF benchmark problems, we demonstrate that our automated design approach consistently outperforms a manually created baseline environment design. Further, we use statistical analyses to determine which environment design decisions are especially important for performance, resulting in multiple novel insights on how RL-OPF environments should be designed. Finally, we discuss the risk of overfitting the environment to the utilized RL algorithm. To the best of our knowledge, this is the first general approach for automated RL environment design.
zh

[AI-87] An Optimized Evacuation Plan for an Active-Shooter Situation Constrained by Network Capacity

【速读】:该论文试图解决在公共枪击事件中,由于紧急疏散过程中的高压力和缺乏可靠实时信息导致的错误决策问题,从而减少人员伤亡。解决方案的关键在于开发了一种多路径路由优化算法,该算法在考虑路径上可用容量的情况下,为每个疏散者确定多条最优安全路径,从而降低拥挤和瓶颈风险。

链接: https://arxiv.org/abs/2505.07830
作者: Joseph Lavalle-Rivera,Aniirudh Ramesh,Subhadeep Chakraborty
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 21 pages, 18 figures

点击查看摘要

Abstract:A total of more than 3400 public shootings have occurred in the United States between 2016 and 2022. Among these, 25.1% of them took place in an educational institution, 29.4% at the workplace including office buildings, 19.6% in retail store locations, and 13.4% in restaurants and bars. During these critical scenarios, making the right decisions while evacuating can make the difference between life and death. However, emergency evacuation is intensely stressful, which along with the lack of verifiable real-time information may lead to fatal incorrect decisions. To tackle this problem, we developed a multi-route routing optimization algorithm that determines multiple optimal safe routes for each evacuee while accounting for available capacity along the route, thus reducing the threat of crowding and bottlenecking. Overall, our algorithm reduces the total casualties by 34.16% and 53.3%, compared to our previous routing algorithm without capacity constraints and an expert-advised routing strategy respectively. Further, our approach to reduce crowding resulted in an approximate 50% reduction in occupancy in key bottlenecking nodes compared to both of the other evacuation algorithms.
zh

[AI-88] Blockbuster Part 1: Block-level AI Operator Fusion

【速读】:该论文旨在解决AI推理程序中操作符融合(operator fusion)的问题,特别是在具有分层内存架构的多处理器系统上提升计算效率。解决方案的关键在于Blockbuster框架中的操作符融合算法,该算法通过直接建模数据在不同内存层级之间的移动,实现了更高效的融合结果。与以往基于规则的融合算法不同,该方法能够生成包含复杂数据流动的优化内核,例如将LayerNorm与矩阵乘法、RMSNorm与FNN-SwiGLU进行融合,从而显著提升性能。

链接: https://arxiv.org/abs/2505.07829
作者: Ofer Dekel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Blockbuster is a framework for AI operator fusion in inference programs. The Blockbuster framework is compatible with any multiprocessor architecture that has a tiered memory hierarchy, including GPUs, multi-core CPUs, and some AI accelerator chips. It includes a graph-based representation for AI workloads, called a block program, which explicitly models how blocks of data move between the memory tiers. It also includes an operator fusion procedure, which is made up of a candidate selection algorithm and a fusion algorithm that fuses each individual candidate - this two-algorithm structure makes Blockbuster especially suitable for large AI programs. The current paper focuses on the fusion algorithm, which is a rule-based technique. While the literature is full of previous rule-based fusion algorithms, what sets our algorithm apart is its direct modeling of data movement between memory tiers, resulting in uniquely powerful fusion results. As a first sanity check, we demonstrate how our algorithm automatically rediscovers the well-known Flash Attention kernel. Then, we demonstrate the real power of our approach by fusing LayerNorm with matrix multiplication and RMSNorm with FNN-SwiGLU - the latter involves fusing three matrix multiplications, a Hadamard product, a reduction, and a few elementwise operations into a single mega-kernel.
zh

[AI-89] AI-Based Crypto Tokens: The Illusion of Decentralized AI?

【速读】:该论文试图解决当前AI-based tokens在技术实现与商业模式上的局限性,特别是其在去中心化AI平台中的实际价值与承诺之间的差距。论文通过分析主流AI-token项目的技术架构、代币功能、共识机制及商业模型,揭示了现有系统在链上智能能力、可扩展性以及创新性价值创造方面的不足。解决方案的关键在于推动去中心化AI系统的进一步发展,包括链上AI输出验证、区块链支持的联邦学习以及更稳健的激励机制,以缩小当前AI-token实现与理想愿景之间的差距。

链接: https://arxiv.org/abs/2505.07828
作者: Rischan Mafrur
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Databases (cs.DB)
备注:

点击查看摘要

Abstract:The convergence of blockchain and artificial intelligence (AI) has led to the emergence of AI-based tokens, which are cryptographic assets designed to power decentralized AI platforms and services. This paper provides a comprehensive review of leading AI-token projects, examining their technical architectures, token utilities, consensus mechanisms, and underlying business models. We explore how these tokens operate across various blockchain ecosystems and assess the extent to which they offer value beyond traditional centralized AI services. Based on this assessment, our analysis identifies several core limitations. From a technical perspective, many platforms depend extensively on off-chain computation, exhibit limited capabilities for on-chain intelligence, and encounter significant scalability challenges. From a business perspective, many models appear to replicate centralized AI service structures, simply adding token-based payment and governance layers without delivering truly novel value. In light of these challenges, we also examine emerging developments that may shape the next phase of decentralized AI systems. These include approaches for on-chain verification of AI outputs, blockchain-enabled federated learning, and more robust incentive frameworks. Collectively, while emerging innovations offer pathways to strengthen decentralized AI ecosystems, significant gaps remain between the promises and the realities of current AI-token implementations. Our findings contribute to a growing body of research at the intersection of AI and blockchain, highlighting the need for critical evaluation and more grounded approaches as the field continues to evolve.
zh

[AI-90] Explainable Artificial Intelligence Techniques for Software Development Lifecycle: A Phase-specific Survey

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)在软件工程中的可解释性不足问题,特别是复杂AI模型因缺乏透明性而限制了其信任度和广泛应用的“黑箱问题”(black-box problem)。解决方案的关键在于通过全面调查生成式AI(Generative AI)在软件开发生命周期(Software Development Life Cycle, SDLC)各阶段的应用,包括需求获取、设计与开发、测试与部署以及演化,来推动可解释人工智能(Explainable Artificial Intelligence, XAI)技术的集成与实践,从而提升AI系统的可解释性、透明度及可验证性。

链接: https://arxiv.org/abs/2505.07058
作者: Lakshit Arora,Sanjay Surendranath Girija,Shashank Kapoor,Aman Raj,Dipen Pradhan,Ankit Shetgaonkar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to IEEE COMPSAC 2025

点击查看摘要

Abstract:Artificial Intelligence (AI) is rapidly expanding and integrating more into daily life to automate tasks, guide decision making, and enhance efficiency. However, complex AI models, which make decisions without providing clear explanations (known as the “black-box problem”), currently restrict trust and widespread adoption of AI. Explainable Artificial Intelligence (XAI) has emerged to address the black-box problem of making AI systems more interpretable and transparent so stakeholders can trust, verify, and act upon AI-based outcomes. Researchers have developed various techniques to foster XAI in the Software Development Lifecycle. However, there are gaps in applying XAI techniques in the Software Engineering phases. Literature review shows that 68% of XAI in Software Engineering research is focused on maintenance as opposed to 8% on software management and requirements. In this paper, we present a comprehensive survey of the applications of XAI methods such as concept-based explanations, Local Interpretable Model-agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), rule extraction, attention mechanisms, counterfactual explanations, and example-based explanations to the different phases of the Software Development Life Cycle (SDLC), including requirements elicitation, design and development, testing and deployment, and evolution. To the best of our knowledge, this paper presents the first comprehensive survey of XAI techniques for every phase of the Software Development Life Cycle (SDLC). This survey aims to promote explainable AI in Software Engineering and facilitate the practical application of complex AI models in AI-driven software development.
zh

[AI-91] Reinforcement Learning (RL) Meets Urban Climate Modeling: Investigating the Efficacy and Impacts of RL-Based HVAC Control

【速读】:该论文试图解决基于强化学习(Reinforcement Learning, RL)的供暖、通风与空调(HVAC)控制策略在不同气候背景下的有效性、对室内气候和局部城市气候的影响以及策略的可迁移性问题。解决方案的关键在于构建一个集成框架,将RL与城市气候模型相结合,该模型进一步整合了建筑能耗模型,从而能够全面评估RL策略在不同气候条件下的表现及其对环境的影响。

链接: https://arxiv.org/abs/2505.07045
作者: Junjie Yu,John S. Schreck,David John Gagne,Keith W. Oleson,Jie Li,Yongtu Liang,Qi Liao,Mingfei Sun,David O. Topping,Zhonghua Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL)-based heating, ventilation, and air conditioning (HVAC) control has emerged as a promising technology for reducing building energy consumption while maintaining indoor thermal comfort. However, the efficacy of such strategies is influenced by the background climate and their implementation may potentially alter both the indoor climate and local urban climate. This study proposes an integrated framework combining RL with an urban climate model that incorporates a building energy model, aiming to evaluate the efficacy of RL-based HVAC control across different background climates, impacts of RL strategies on indoor climate and local urban climate, and the transferability of RL strategies across cities. Our findings reveal that the reward (defined as a weighted combination of energy consumption and thermal comfort) and the impacts of RL strategies on indoor climate and local urban climate exhibit marked variability across cities with different background climates. The sensitivity of reward weights and the transferability of RL strategies are also strongly influenced by the background climate. Cities in hot climates tend to achieve higher rewards across most reward weight configurations that balance energy consumption and thermal comfort, and those cities with more varying atmospheric temperatures demonstrate greater RL strategy transferability. These findings underscore the importance of thoroughly evaluating RL-based HVAC control strategies in diverse climatic contexts. This study also provides a new insight that city-to-city learning will potentially aid the deployment of RL-based HVAC control.
zh

[AI-92] Big Data and the Computational Social Science of Entrepreneurship and Innovation

【速读】:该论文试图解决在大规模社会数据爆炸和机器学习方法演进背景下,创业与创新研究者面临的新机遇与独特挑战,特别是如何利用大规模数据识别技术与商业新颖性、记录新创企业起源以及预测新技术与商业模式之间的竞争。其解决方案的关键在于通过结合大规模数据与机器学习模型,构建系统层面的创新与创业观测工具,并利用大数据驱动的人工智能模型生成技术与商业的“数字双胞胎”,为创新与创业过程及政策提供虚拟实验环境。

链接: https://arxiv.org/abs/2505.08706
作者: Ningzi Li,Shiyang Lai,James Evans
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Social and Information Networks (cs.SI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:As large-scale social data explode and machine-learning methods evolve, scholars of entrepreneurship and innovation face new research opportunities but also unique challenges. This chapter discusses the difficulties of leveraging large-scale data to identify technological and commercial novelty, document new venture origins, and forecast competition between new technologies and commercial forms. It suggests how scholars can take advantage of new text, network, image, audio, and video data in two distinct ways that advance innovation and entrepreneurship research. First, machine-learning models, combined with large-scale data, enable the construction of precision measurements that function as system-level observatories of innovation and entrepreneurship across human societies. Second, new artificial intelligence models fueled by big data generate ‘digital doubles’ of technology and business, forming laboratories for virtual experimentation about innovation and entrepreneurship processes and policies. The chapter argues for the advancement of theory development and testing in entrepreneurship and innovation by coupling big data with big models.
zh

[AI-93] A Survey of Deep Learning for Complex Speech Spectrograms

【速读】:该论文试图解决如何利用深度神经网络处理包含幅度和相位信息的复数谱图(complex spectrograms)以提升语音信号处理任务的效果。其解决方案的关键在于设计适用于复数值数据的神经网络架构,探索针对复数谱图的训练策略与损失函数,并结合生成模型等技术,以实现对语音信号的高效分析与重构。

链接: https://arxiv.org/abs/2505.08694
作者: Yuying Xie,Zheng-Hua Tan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in deep learning have significantly impacted the field of speech signal processing, particularly in the analysis and manipulation of complex spectrograms. This survey provides a comprehensive overview of the state-of-the-art techniques leveraging deep neural networks for processing complex spectrograms, which encapsulate both magnitude and phase information. We begin by introducing complex spectrograms and their associated features for various speech processing tasks. Next, we explore the key components and architectures of complex-valued neural networks, which are specifically designed to handle complex-valued data and have been applied for complex spectrogram processing. We then discuss various training strategies and loss functions tailored for training neural networks to process and model complex spectrograms. The survey further examines key applications, including phase retrieval, speech enhancement, and speech separation, where deep learning has achieved significant progress by leveraging complex spectrograms or their derived feature representations. Additionally, we examine the intersection of complex spectrograms with generative models. This survey aims to serve as a valuable resource for researchers and practitioners in the field of speech signal processing and complex-valued neural networks.
zh

[AI-94] Distributed Quantum Neural Networks on Distributed Photonic Quantum Computing

【速读】:该论文试图解决经典神经网络参数效率低和量子计算在实际部署中受限的问题,其解决方案的关键在于将光子量子神经网络(photonic QNN)与矩阵乘积态(MPS)映射相结合,通过混合量子-经典工作流生成高维概率分布,并将其映射为经典网络权重,从而实现参数高效的训练。该框架在MNIST分类任务中表现出色,使用较少参数即可达到接近经典基线的准确率,并且在降低参数数量的同时保持较高的准确性,同时通过经典部署避免了推理阶段对量子硬件的需求。

链接: https://arxiv.org/abs/2505.08474
作者: Kuan-Cheng Chen,Chen-Yu Liu,Yu Shang,Felix Burt,Kin K. Leung
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:We introduce a distributed quantum-classical framework that synergizes photonic quantum neural networks (QNNs) with matrix-product-state (MPS) mapping to achieve parameter-efficient training of classical neural networks. By leveraging universal linear-optical decompositions of M -mode interferometers and photon-counting measurement statistics, our architecture generates neural parameters through a hybrid quantum-classical workflow: photonic QNNs with M(M+1)/2 trainable parameters produce high-dimensional probability distributions that are mapped to classical network weights via an MPS model with bond dimension \chi . Empirical validation on MNIST classification demonstrates that photonic QT achieves an accuracy of 95.50% \pm 0.84% using 3,292 parameters ( \chi = 10 ), compared to 96.89% \pm 0.31% for classical baselines with 6,690 parameters. Moreover, a ten-fold compression ratio is achieved at \chi = 4 , with a relative accuracy loss of less than 3% . The framework outperforms classical compression techniques (weight sharing/pruning) by 6–12% absolute accuracy while eliminating quantum hardware requirements during inference through classical deployment of compressed parameters. Simulations incorporating realistic photonic noise demonstrate the framework’s robustness to near-term hardware imperfections. Ablation studies confirm quantum necessity: replacing photonic QNNs with random inputs collapses accuracy to chance level ( 10.0% \pm 0.5% ). Photonic quantum computing’s room-temperature operation, inherent scalability through spatial-mode multiplexing, and HPC-integrated architecture establish a practical pathway for distributed quantum machine learning, combining the expressivity of photonic Hilbert spaces with the deployability of classical neural networks.
zh

[AI-95] Non-contact Vital Signs Detection in Dynamic Environments

【速读】:该论文旨在解决毫米波雷达在复杂环境中进行生命体征检测时,由于时变直流偏移和相位不平衡导致的相位解调性能下降问题。其解决方案的关键在于提出了一种新颖的直流偏移校准方法,并结合希尔伯特变换与差分交叉相乘(HADCM)解调算法,通过从相邻信号峰谷估计时变直流偏移,并利用I/Q通道信号的差分形式与希尔伯特变换提取生命体征信息,从而实现更精确的信号恢复和噪声抑制。

链接: https://arxiv.org/abs/2505.08366
作者: Shuai Sun,Chong-Xi Liang,Chengwei Ye,Huanzhen Zhang,Kangsheng Wang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate phase demodulation is critical for vital sign detection using millimeter-wave radar. However, in complex environments, time-varying DC offsets and phase imbalances can severely degrade demodulation performance. To address this, we propose a novel DC offset calibration method alongside a Hilbert and Differential Cross-Multiply (HADCM) demodulation algorithm. The approach estimates time-varying DC offsets from neighboring signal peaks and valleys, then employs both differential forms and Hilbert transforms of the I/Q channel signals to extract vital sign information. Simulation and experimental results demonstrate that the proposed method maintains robust performance under low signal-to-noise ratios. Compared to existing demodulation techniques, it offers more accurate signal recovery in challenging scenarios and effectively suppresses noise interference.
zh

[AI-96] Aitomia: Your Intelligent Assistant for AI-Driven Atomistic and Quantum Chemical Simulations

【速读】:该论文试图解决原子尺度和量子化学(QC)模拟的复杂性与高门槛问题,旨在通过人工智能(AI)技术降低进行此类模拟的难度并加速相关领域的研究与开发。解决方案的关键在于构建一个名为Aitomia的智能助手平台,该平台利用微调的开源大型语言模型(LLMs)、基于规则的智能体以及检索增强生成(RAG)系统,提供从模拟设置、运行监控、结果分析到总结的全流程支持。

链接: https://arxiv.org/abs/2505.08195
作者: Jinming Hu,Hassan Nawaz,Yuting Rui,Lijie Chi,Arif Ullah,Pavlo O. Dral
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Chemical Physics (physics.chem-ph)
备注:

点击查看摘要

Abstract:We have developed Aitomia - a platform powered by AI to assist in performing AI-driven atomistic and quantum chemical (QC) simulations. This intelligent assistant platform is equipped with chatbots and AI agents to help experts and guide non-experts in setting up and running the atomistic simulations, monitoring their computation status, analyzing the simulation results, and summarizing them for the user in text and graphical forms. We achieve these goals by exploiting fine-tuned open-source large language models (LLMs), rule-based agents, and a retrieval-augmented generation (RAG) system. Aitomia leverages the versatility of our MLatom ecosystem for AI-enhanced computational chemistry. This intelligent assistant is going to be integrated into the Aitomistic Hub and XACS online computing services, with some functionality already publicly available as described at this http URL. Aitomia is expected to lower the barrier to performing atomistic simulations, accelerating research and development in the relevant fields.
zh

[AI-97] Probabilistic approach to longitudinal response prediction: application to radiomics from brain cancer imaging

【速读】:该论文旨在解决疾病进展的纵向预测问题,特别是在处理治疗反应和疾病演变动态变化时的不确定性。其解决方案的关键在于提出一种概率模型,该模型结合基线特征与中间随访数据,通过概率框架自然地处理纵向预测中的固有不确定性,同时有效控制问题维度的增长,从而无需依赖中间随访数据。

链接: https://arxiv.org/abs/2505.07973
作者: Isabella Cama,Michele Piana,Cristina Campi,Sara Garbarino
机构: 未知
类目: Applications (stat.AP); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures

点击查看摘要

Abstract:Longitudinal imaging analysis tracks disease progression and treatment response over time, providing dynamic insights into treatment efficacy and disease evolution. Radiomic features extracted from medical imaging can support the study of disease progression and facilitate longitudinal prediction of clinical outcomes. This study presents a probabilistic model for longitudinal response prediction, integrating baseline features with intermediate follow-ups. The probabilistic nature of the model naturally allows to handle the instrinsic uncertainty of the longitudinal prediction of disease progression. We evaluate the proposed model against state-of-the-art disease progression models in both a synthetic scenario and using a brain cancer dataset. Results demonstrate that the approach is competitive against existing methods while uniquely accounting for uncertainty and controlling the growth of problem dimensionality, eliminating the need for data from intermediate follow-ups.
zh

[AI-98] Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability

【速读】:该论文试图解决通过单细胞测序数据理解细胞身份和功能这一计算生物学中的关键挑战。其解决方案的关键在于利用NCBI Gene数据库中的基因特异性文本注释,结合大规模语言模型(LLMs)生成具有生物学上下文的细胞嵌入表示。具体而言,通过对每个细胞中表达水平最高的基因进行排序并获取其注释信息,再通过语言模型将其转换为向量嵌入,并采用表达加权平均的方式计算细胞嵌入,从而实现语义丰富且紧凑的表征。这一多模态策略将结构化生物数据与先进的语言建模技术相结合,提升了下游任务如细胞类型聚类、细胞脆弱性分析和轨迹推断的可解释性。

链接: https://arxiv.org/abs/2505.07896
作者: Douglas Jiang,Zilin Dai,Luxuan Zhang,Qiyi Yu,Haoqi Sun,Feng Tian
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their NCBI Gene descriptions, and transform these descriptions into vector embedding representations using large language models (LLMs). The models used include OpenAI text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large (Jan 2024), as well as domain-specific models BioBERT and SciBERT. Embeddings are computed via an expression-weighted average across the top N most highly expressed genes in each cell, providing a compact, semantically rich representation. This multimodal strategy bridges structured biological data with state-of-the-art language modeling, enabling more interpretable downstream applications such as cell-type clustering, cell vulnerability dissection, and trajectory inference.
zh

[AI-99] Sub-diffraction terahertz backpropagation compressive imaging

【速读】:该论文旨在解决太赫兹单像素成像(Terahertz Single-Pixel Imaging, TSPI)中由于太赫兹波长较长导致的亚衍射极限成像分辨率受限问题,以及传统方法对苛刻实验条件和耗时过程的依赖。其解决方案的关键在于提出一种基于反向传播的压缩成像技术,利用未训练的神经网络在物理模型约束下,以极低的压缩比(1.5625%)实现图像迭代重建,同时结合角谱传播(Angular Spectrum Propagation, ASP)理论抑制衍射效应,从而获得约λ₀/7(λ₀ = 833.3 μm at 0.36 THz)的亚衍射分辨率,并无需使用超薄光调制器。

链接: https://arxiv.org/abs/2505.07839
作者: Yongsheng Zhu,Shaojing Liu,Ximiao Wang,Runli Li,Haili Yang,Jiali Wang,Hongjia Zhu,Yanlin Ke,Ningsheng Xu,Huanjun Chen,Shaozhi Deng
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Terahertz single-pixel imaging (TSPI) has garnered significant attention due to its simplicity and cost-effectiveness. However, the relatively long wavelength of THz waves limits sub-diffraction-scale imaging resolution. Although TSPI technique can achieve sub-wavelength resolution, it requires harsh experimental conditions and time-consuming processes. Here, we propose a sub-diffraction THz backpropagation compressive imaging technique. We illuminate the object with monochromatic continuous-wave THz radiation. The transmitted THz wave is modulated by prearranged patterns generated on the back surface of a 500-\mum-thick silicon wafer, realized through photoexcited carriers using a 532-nm laser. The modulated THz wave is then recorded by a single-element detector. An untrained neural network is employed to iteratively reconstruct the object image with an ultralow compression ratio of 1.5625% under a physical model constraint, thus reducing the long sampling times. To further suppress the diffraction-field effects, embedded with the angular spectrum propagation (ASP) theory to model the diffraction of THz waves during propagation, the network retrieves near-field information from the object, enabling sub-diffraction imaging with a spatial resolution of ~\lambda0/7 (\lambda0 = 833.3 \mum at 0.36 THz) and eliminating the need for ultrathin photomodulators. This approach provides an efficient solution for advancing THz microscopic imaging and addressing other inverse imaging challenges.
zh

机器学习

[LG-0] Addressing the Current Challenges of Quantum Machine Learning through Multi-Chip Ensembles

链接: https://arxiv.org/abs/2505.08782
作者: Junghoon Justin Park,Jiook Cha,Samuel Yen-Chi Chen,Huan-Hsin Tseng,Shinjae Yoo
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Quantum Machine Learning (QML) holds significant promise for solving computational challenges across diverse domains. However, its practical deployment is constrained by the limitations of noisy intermediate-scale quantum (NISQ) devices, including noise, limited scalability, and trainability issues in variational quantum circuits (VQCs). We introduce the multi-chip ensemble VQC framework, which partitions high-dimensional computations across smaller quantum chips to enhance scalability, trainability, and noise resilience. We show that this approach mitigates barren plateaus, reduces quantum error bias and variance, and maintains robust generalization through controlled entanglement. Designed to align with current and emerging quantum hardware, the framework demonstrates strong potential for enabling scalable QML on near-term devices, as validated by experiments on standard benchmark datasets (MNIST, FashionMNIST, CIFAR-10) and real world dataset (PhysioNet EEG).

[LG-1] SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models

链接: https://arxiv.org/abs/2505.08768
作者: Suhan Guo,Jiahong Deng,Mengjun Yi,Furao Shen,Jian Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Attention-based architectures have achieved superior performance in multivariate time series forecasting but are computationally expensive. Techniques such as patching and adaptive masking have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method, SPAT ( \textbfS ensitivity \textbfP runer for \textbfAt tention), which selectively removes redundant attention mechanisms and yields highly effective models. Different from previous approaches, SPAT aims to remove the entire attention module, which reduces the risk of overfitting and enables speed-up without demanding specialized hardware. We propose a dynamic sensitivity metric, \textbfS ensitivity \textbfE nhanced \textbfN ormalized \textbfD ispersion (SEND) that measures the importance of each attention module during the pre-training phase. Experiments on multivariate datasets demonstrate that SPAT-pruned models achieve reductions of 2.842% in MSE, 1.996% in MAE, and 35.274% in FLOPs. Furthermore, SPAT-pruned models outperform existing lightweight, Mamba-based and LLM-based SOTA methods in both standard and zero-shot inference, highlighting the importance of retaining only the most effective attention mechanisms. We have made our code publicly available this https URL.

[LG-2] Implet: A Post-hoc Subsequence Explainer for Time Series Models

链接: https://arxiv.org/abs/2505.08748
作者: Fanyu Meng,Ziwen Kan,Shahbaz Rezaei,Zhaodan Kong,Xin Chen,Xin Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainability in time series models is crucial for fostering trust, facilitating debugging, and ensuring interpretability in real-world applications. In this work, we introduce Implet, a novel post-hoc explainer that generates accurate and concise subsequence-level explanations for time series models. Our approach identifies critical temporal segments that significantly contribute to the model’s predictions, providing enhanced interpretability beyond traditional feature-attribution methods. Based on it, we propose a cohort-based (group-level) explanation framework designed to further improve the conciseness and interpretability of our explanations. We evaluate Implet on several standard time-series classification benchmarks, demonstrating its effectiveness in improving interpretability. The code is available at this https URL

[LG-3] Sensitivity-Constrained Fourier Neural Operators for Forward and Inverse Problems in Parametric Differential Equations

链接: https://arxiv.org/abs/2505.08740
作者: Abdolmehdi Behroozi,Chaopeng Shen and,Daniel Kifer
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Parametric differential equations of the form du/dt = f(u, x, t, p) are fundamental in science and engineering. While deep learning frameworks such as the Fourier Neural Operator (FNO) can efficiently approximate solutions, they struggle with inverse problems, sensitivity estimation (du/dp), and concept drift. We address these limitations by introducing a sensitivity-based regularization strategy, called Sensitivity-Constrained Fourier Neural Operators (SC-FNO). SC-FNO achieves high accuracy in predicting solution paths and consistently outperforms standard FNO and FNO with physics-informed regularization. It improves performance in parameter inversion tasks, scales to high-dimensional parameter spaces (tested with up to 82 parameters), and reduces both data and training requirements. These gains are achieved with a modest increase in training time (30% to 130% per epoch) and generalize across various types of differential equations and neural operators. Code and selected experiments are available at: this https URL

[LG-4] owards Foundation Models for Experimental Readout Systems Combining Discrete and Continuous Data

链接: https://arxiv.org/abs/2505.08736
作者: James Giroux,Cristiano Fanelli
类目: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex); Instrumentation and Detectors (physics.ins-det)
*备注: 19 pages; 14 figures

点击查看摘要

Abstract:We present a (proto) Foundation Model for Nuclear Physics, capable of operating on low-level detector inputs from Imaging Cherenkov Detectors at the future Electron Ion Collider. To address limitations in existing next-token prediction approaches-namely resolution loss from VQ-VAE tokenization and lack of conditional generation-we propose three key innovations: (i) separate vocabularies for discrete spatial features and continuous variates, combined via Causal Multi-Head Cross-Attention (CMHCA), (ii) continuous kinematic conditioning through prepended context embeddings, and (iii) scalable and simple, high-resolution continuous variate tokenization without joint vocabulary inflation. Our model enables fast, high-fidelity generation of pixel and time sequences for Cherenkov photons, validated through closure tests in the High Performance DIRC. We also show our model generalizes to reconstruction tasks such as pion and kaon identification, in which we show its ability to leverage fine-tuning.

[LG-5] Preference Optimization for Combinatorial Optimization Problems ICML2025

链接: https://arxiv.org/abs/2505.08735
作者: Mingjun Pan,Guanquan Lin,You-Wei Luo,Bin Zhu,Zhien Dai,Lijun Sun,Chun Yuan
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted by ICML 2025

点击查看摘要

Abstract:Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring expert knowledge. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficiency. In this paper, we propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among sampled solutions. Methodologically, by reparameterizing the reward function in terms of policy and utilizing preference models, we formulate an entropy-regularized RL objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate local search techniques into the fine-tuning rather than post-processing to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on various benchmarks, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method significantly outperforms existing RL algorithms, achieving superior convergence efficiency and solution quality.

[LG-6] Modular Federated Learning: A Meta-Framework Perspective

链接: https://arxiv.org/abs/2505.08646
作者: Frederico Vicente,Cláudia Soares,Dušan Jakovetić
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables distributed machine learning training while preserving privacy, representing a paradigm shift for data-sensitive and decentralized environments. Despite its rapid advancements, FL remains a complex and multifaceted field, requiring a structured understanding of its methodologies, challenges, and applications. In this survey, we introduce a meta-framework perspective, conceptualising FL as a composition of modular components that systematically address core aspects such as communication, optimisation, security, and privacy. We provide a historical contextualisation of FL, tracing its evolution from distributed optimisation to modern distributed learning paradigms. Additionally, we propose a novel taxonomy distinguishing Aggregation from Alignment, introducing the concept of alignment as a fundamental operator alongside aggregation. To bridge theory with practice, we explore available FL frameworks in Python, facilitating real-world implementation. Finally, we systematise key challenges across FL sub-fields, providing insights into open research questions throughout the meta-framework modules. By structuring FL within a meta-framework of modular components and emphasising the dual role of Aggregation and Alignment, this survey provides a holistic and adaptable foundation for understanding and advancing FL research and deployment.

[LG-7] Credit Assignment and Efficient Exploration based on Influence Scope in Multi-agent Reinforcement Learning

链接: https://arxiv.org/abs/2505.08630
作者: Shuai Han,Mehdi Dastani,Shihan Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training cooperative agents in sparse-reward scenarios poses significant challenges for multi-agent reinforcement learning (MARL). Without clear feedback on actions at each step in sparse-reward setting, previous methods struggle with precise credit assignment among agents and effective exploration. In this paper, we introduce a novel method to deal with both credit assignment and exploration problems in reward-sparse domains. Accordingly, we propose an algorithm that calculates the Influence Scope of Agents (ISA) on states by taking specific value of the dimensions/attributes of states that can be influenced by individual agents. The mutual dependence between agents’ actions and state attributes are then used to calculate the credit assignment and to delimit the exploration space for each individual agent. We then evaluate ISA in a variety of sparse-reward multi-agent scenarios. The results show that our method significantly outperforms the state-of-art baselines.

[LG-8] Cost Function Estimation Using Inverse Reinforcement Learning with Minimal Observations

链接: https://arxiv.org/abs/2505.08619
作者: Sarmad Mehrdad,Avadesh Meduri,Ludovic Righetti
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:We present an iterative inverse reinforcement learning algorithm to infer optimal cost functions in continuous spaces. Based on a popular maximum entropy criteria, our approach iteratively finds a weight improvement step and proposes a method to find an appropriate step size that ensures learned cost function features remain similar to the demonstrated trajectory features. In contrast to similar approaches, our algorithm can individually tune the effectiveness of each observation for the partition function and does not need a large sample set, enabling faster learning. We generate sample trajectories by solving an optimal control problem instead of random sampling, leading to more informative trajectories. The performance of our method is compared to two state of the art algorithms to demonstrate its benefits in several simulated environments.

[LG-9] Clustering of Incomplete Data via a Bipartite Graph Structure

链接: https://arxiv.org/abs/2505.08594
作者: Amirhossein Javaheri,Daniel P. Palomar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:There are various approaches to graph learning for data clustering, incorporating different spectral and structural constraints through diverse graph structures. Some methods rely on bipartite graph models, where nodes are divided into two classes: centers and members. These models typically require access to data for the center nodes in addition to observations from the member nodes. However, such additional data may not always be available in many practical scenarios. Moreover, popular Gaussian models for graph learning have demonstrated limited effectiveness in modeling data with heavy-tailed distributions, which are common in financial markets. In this paper, we propose a clustering method based on a bipartite graph model that addresses these challenges. First, it can infer clusters from incomplete data without requiring information about the center nodes. Second, it is designed to effectively handle heavy-tailed data. Numerical experiments using real financial data validate the efficiency of the proposed method for data clustering.

[LG-10] MUBox: A Critical Evaluation Framework of Deep Machine Unlearning

链接: https://arxiv.org/abs/2505.08576
作者: Xiang Li,Bhavani Thuraisingham,Wenqi Wei
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Recent legal frameworks have mandated the right to be forgotten, obligating the removal of specific data upon user requests. Machine Unlearning has emerged as a promising solution by selectively removing learned information from machine learning models. This paper presents MUBox, a comprehensive platform designed to evaluate unlearning methods in deep learning. MUBox integrates 23 advanced unlearning techniques, tested across six practical scenarios with 11 diverse evaluation metrics. It allows researchers and practitioners to (1) assess and compare the effectiveness of different machine unlearning methods across various scenarios; (2) examine the impact of current evaluation metrics on unlearning performance; and (3) conduct detailed comparative studies on machine unlearning in a unified framework. Leveraging MUBox, we systematically evaluate these unlearning methods in deep learning and uncover several key insights: (a) Even state-of-the-art unlearning methods, including those published in top-tier venues and winners of unlearning competitions, demonstrate inconsistent effectiveness across diverse scenarios. Prior research has predominantly focused on simplified settings, such as random forgetting and class-wise unlearning, highlighting the need for broader evaluations across more difficult unlearning tasks. (b) Assessing unlearning performance remains a non-trivial problem, as no single evaluation metric can comprehensively capture the effectiveness, efficiency, and preservation of model utility. Our findings emphasize the necessity of employing multiple metrics to achieve a balanced and holistic assessment of unlearning methods. © In the context of depoisoning, our evaluation reveals significant variability in the effectiveness of existing approaches, which is highly dependent on the specific type of poisoning attacks.

[LG-11] Online Learning and Unlearning

链接: https://arxiv.org/abs/2505.08557
作者: Yaxi Hu,Bernhard Schölkopf,Amartya Sanyal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We formalize the problem of online learning-unlearning, where a model is updated sequentially in an online setting while accommodating unlearning requests between updates. After a data point is unlearned, all subsequent outputs must be statistically indistinguishable from those of a model trained without that point. We present two online learner-unlearner (OLU) algorithms, both built upon online gradient descent (OGD). The first, passive OLU, leverages OGD’s contractive property and injects noise when unlearning occurs, incurring no additional computation. The second, active OLU, uses an offline unlearning algorithm that shifts the model toward a solution excluding the deleted data. Under standard convexity and smoothness assumptions, both methods achieve regret bounds comparable to those of standard OGD, demonstrating that one can maintain competitive regret bounds while providing unlearning guarantees.

[LG-12] OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain

链接: https://arxiv.org/abs/2505.08550
作者: Wenzhen Yue,Yong Liu,Haoxuan Li,Hao Wang,Xianghua Ying,Ruohao Guo,Bowei Xing,Ji Shi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper presents \mathbfOLinear , a \mathbflinear -based multivariate time series forecasting model that operates in an \mathbfo rthogonally transformed domain. Recent forecasting models typically adopt the temporal forecast (TF) paradigm, which directly encode and decode time series in the time domain. However, the entangled step-wise dependencies in series data can hinder the performance of TF. To address this, some forecasters conduct encoding and decoding in the transformed domain using fixed, dataset-independent bases (e.g., sine and cosine signals in the Fourier transform). In contrast, we utilize \mathbfOrthoTrans , a data-adaptive transformation based on an orthogonal matrix that diagonalizes the series’ temporal Pearson correlation matrix. This approach enables more effective encoding and decoding in the decorrelated feature domain and can serve as a plug-in module to enhance existing forecasters. To enhance the representation learning for multivariate time series, we introduce a customized linear layer, \mathbfNormLin , which employs a normalized weight matrix to capture multivariate dependencies. Empirically, the NormLin module shows a surprising performance advantage over multi-head self-attention, while requiring nearly half the FLOPs. Extensive experiments on 24 benchmarks and 140 forecasting tasks demonstrate that OLinear consistently achieves state-of-the-art performance with high efficiency. Notably, as a plug-in replacement for self-attention, the NormLin module consistently enhances Transformer-based forecasters. The code and datasets are available at this https URL

[LG-13] Diffusion-assisted Model Predictive Control Optimization for Power System Real-Time Operation

链接: https://arxiv.org/abs/2505.08535
作者: Linna Xu,Yongli Zhu
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: This paper has been accepted by the 2025 IEEE PES General Meeting (PESGM) which will be held in Austin, TX, July.27-31, 2005

点击查看摘要

Abstract:This paper presents a modified model predictive control (MPC) framework for real-time power system operation. The framework incorporates a diffusion model tailored for time series generation to enhance the accuracy of the load forecasting module used in the system operation. In the absence of explicit state transition law, a model-identification procedure is leveraged to derive the system dynamics, thereby eliminating a barrier when applying MPC to a renewables-dominated power system. Case study results on an industry park system and the IEEE 30-bus system demonstrate that using the diffusion model to augment the training dataset significantly improves load-forecasting accuracy, and the inferred system dynamics are applicable to the real-time grid operation with solar and wind.

[LG-14] InfoPO: On Mutual Information Maximization for Large Language Model Alignment NAACL2025

链接: https://arxiv.org/abs/2505.08507
作者: Teng Xiao,Zhen Ge,Sujay Sanghavi,Tian Wang,Julian Katz-Samuels,Marc Versage,Qingjun Cui,Trishul Chilimbi
类目: Machine Learning (cs.LG)
*备注: NAACL 2025

点击查看摘要

Abstract:We study the post-training of large language models (LLMs) with human preference data. Recently, direct preference optimization and its variants have shown considerable promise in aligning language models, eliminating the need for reward models and online sampling. Despite these benefits, these methods rely on explicit assumptions about the Bradley-Terry (BT) model, which makes them prone to overfitting and results in suboptimal performance, particularly on reasoning-heavy tasks. To address these challenges, we propose a principled preference fine-tuning algorithm called InfoPO, which effectively and efficiently aligns large language models using preference data. InfoPO eliminates the reliance on the BT model and prevents the likelihood of the chosen response from decreasing. Extensive experiments confirm that InfoPO consistently outperforms established baselines on widely used open benchmarks, particularly in reasoning tasks.

[LG-15] A new methodology to decompose a parametric domain using reduced order data manifold in machine learning

链接: https://arxiv.org/abs/2505.08497
作者: Chetra Mang,Axel TahmasebiMoradi,Mouadh Yagoubi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a new methodology for parametric domain decomposition using iterative principal component analysis. Starting with iterative principle component analysis, the high dimension manifold is reduced to the lower dimension manifold. Moreover, two approaches are developed to reconstruct the inverse projector to project from the lower data component to the original one. Afterward, we provide a detailed strategy to decompose the parametric domain based on the low dimension manifold. Finally, numerical examples of harmonic transport problem are given to illustrate the efficiency and effectiveness of the proposed method comparing to the classical meta-models such as neural networks.

[LG-16] Isolation Forest in Novelty Detection Scenario

链接: https://arxiv.org/abs/2505.08489
作者: Adam Ulrich,Jan Krňávek,Roman Šenkeřík,Zuzana Komínková Oplatková,Radek Vala
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM)
*备注:

点击查看摘要

Abstract:Data mining offers a diverse toolbox for extracting meaningful structures from complex datasets, with anomaly detection emerging as a critical subfield particularly in the context of streaming or real-time data. Within anomaly detection, novelty detection focuses on identifying previously unseen patterns after training solely on regular data. While classic algorithms such as One-Class SVM or Local Outlier Factor (LOF) have been widely applied, they often lack interpretability and scalability. In this work, we explore the Half-Space Tree (HST) algorithm, originally proposed for streaming anomaly detection, and propose a novel theoretical modification to adapt it specifically for novelty detection tasks. Our approach is grounded in the idea that anomalies i.e., novelties tend to appear in the higher leaves of the tree, which are less frequently visited by regular instances. We analytically demonstrate the effectiveness of this approach using probabilistic analysis, expected depth (EXD) calculations, and combinatorial reasoning. A comparative analysis of expected depths between our modified HST and the original Isolation Forest highlights that novelty points are significantly more isolated in our approach. This supports the hypothesis that HSTs, with appropriate structural adaptation, can serve as interpretable and efficient novelty detectors. The paper contributes a theoretical foundation and supporting analysis for this adaptation, setting the stage for further application and experimentation.

[LG-17] Parameter Estimation using Reinforcement Learning Causal Curiosity: Limits and Challenges

链接: https://arxiv.org/abs/2505.08453
作者: Miguel Arana-Catania,Weisi Guo
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 24 pages, 10 figures, 9 tables

点击查看摘要

Abstract:Causal understanding is important in many disciplines of science and engineering, where we seek to understand how different factors in the system causally affect an experiment or situation and pave a pathway towards creating effective or optimising existing models. Examples of use cases are autonomous exploration and modelling of unknown environments or assessing key variables in optimising large complex systems. In this paper, we analyse a Reinforcement Learning approach called Causal Curiosity, which aims to estimate as accurately and efficiently as possible, without directly measuring them, the value of factors that causally determine the dynamics of a system. Whilst the idea presents a pathway forward, measurement accuracy is the foundation of methodology effectiveness. Focusing on the current causal curiosity’s robotic manipulator, we present for the first time a measurement accuracy analysis of the future potentials and current limitations of this technique and an analysis of its sensitivity and confounding factor disentanglement capability - crucial for causal analysis. As a result of our work, we promote proposals for an improved and efficient design of Causal Curiosity methods to be applied to real-world complex scenarios.

[LG-18] Continuous World Coverag e Path Planning for Fixed-Wing UAVs using Deep Reinforcement Learning IROS2025

链接: https://arxiv.org/abs/2505.08382
作者: Mirco Theile,Andres R. Zapata Rodriguez,Marco Caccamo,Alberto L. Sangiovanni-Vincentelli
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Submitted to IROS 2025

点击查看摘要

Abstract:Unmanned Aerial Vehicle (UAV) Coverage Path Planning (CPP) is critical for applications such as precision agriculture and search and rescue. While traditional methods rely on discrete grid-based representations, real-world UAV operations require power-efficient continuous motion planning. We formulate the UAV CPP problem in a continuous environment, minimizing power consumption while ensuring complete coverage. Our approach models the environment with variable-size axis-aligned rectangles and UAV motion with curvature-constrained Bézier curves. We train a reinforcement learning agent using an action-mapping-based Soft Actor-Critic (AM-SAC) algorithm employing a self-adaptive curriculum. Experiments on both procedurally generated and hand-crafted scenarios demonstrate the effectiveness of our method in learning energy-efficient coverage strategies.

[LG-19] Density Ratio-based Causal Discovery from Bivariate Continuous-Discrete Data

链接: https://arxiv.org/abs/2505.08371
作者: Takashi Nicholas Maeda,Shohei Shimizu,Hidetoshi Matsui
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper proposes a causal discovery method for mixed bivariate data consisting of one continuous and one discrete variable. Existing constraint-based approaches are ineffective in the bivariate setting, as they rely on conditional independence tests that are not suited to bivariate data. Score-based methods either impose strong distributional assumptions or face challenges in fairly comparing causal directions between variables of different types, due to differences in their information content. We introduce a novel approach that determines causal direction by analyzing the monotonicity of the conditional density ratio of the continuous variable, conditioned on different values of the discrete variable. Our theoretical analysis shows that the conditional density ratio exhibits monotonicity when the continuous variable causes the discrete variable, but not in the reverse direction. This property provides a principled basis for comparing causal directions between variables of different types, free from strong distributional assumptions and bias arising from differences in their information content. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets, showing superior accuracy compared to existing methods.

[LG-20] Localization of Impacts on Thin-Walled Structures by Recurrent Neural Networks: End-to-end Learning from Real-World Data

链接: https://arxiv.org/abs/2505.08362
作者: Alexander Humer,Lukas Grasboeck,Ayech Benjeddou
类目: Machine Learning (cs.LG)
*备注: XI ECCOMAS Thematic Conference on Smart Structures and Materials (SMART 2025)

点击查看摘要

Abstract:Today, machine learning is ubiquitous, and structural health monitoring (SHM) is no exception. Specifically, we address the problem of impact localization on shell-like structures, where knowledge of impact locations aids in assessing structural integrity. Impacts on thin-walled structures excite Lamb waves, which can be measured with piezoelectric sensors. Their dispersive characteristics make it difficult to detect and localize impacts by conventional methods. In the present contribution, we explore the localization of impacts using neural networks. In particular, we propose to use recurrent neural networks (RNNs) to estimate impact positions end-to-end, i.e., directly from sequential sensor data. We deal with comparatively long sequences of thousands of samples, since high sampling rate are needed to accurately capture elastic waves. For this reason, the proposed approach builds upon Gated Recurrent Units (GRUs), which are less prone to vanishing gradients as compared to conventional RNNs. Quality and quantity of data are crucial when training neural networks. Often, synthetic data is used, which inevitably introduces a reality gap. Here, by contrast, we train our networks using physical data from experiments, which requires automation to handle the large number of experiments needed. For this purpose, a robot is used to drop steel balls onto an aluminum plate equipped with piezoceramic sensors. Our results show remarkable accuracy in estimating impact positions, even with a comparatively small dataset.

[LG-21] Structural-Temporal Coupling Anomaly Detection with Dynamic Graph Transformer

链接: https://arxiv.org/abs/2505.08330
作者: Chang Zong,Yueting Zhuang,Jian Shao,Weiming Lu
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:Detecting anomalous edges in dynamic graphs is an important task in many applications over evolving triple-based data, such as social networks, transaction management, and epidemiology. A major challenge with this task is the absence of structural-temporal coupling information, which decreases the ability of the representation to distinguish anomalies from normal instances. Existing methods focus on handling independent structural and temporal features with embedding models, which ignore the deep interaction between these two types of information. In this paper, we propose a structural-temporal coupling anomaly detection architecture with a dynamic graph transformer model. Specifically, we introduce structural and temporal features from two integration levels to provide anomaly-aware graph evolutionary patterns. Then, a dynamic graph transformer enhanced by two-dimensional positional encoding is implemented to capture both discrimination and contextual consistency signals. Extensive experiments on six datasets demonstrate that our method outperforms current state-of-the-art models. Finally, a case study illustrates the strength of our method when applied to a real-world task.

[LG-22] SpecSphere: Dual-Pass Spectral-Spatial Graph Neural Networks with Certified Robustness

链接: https://arxiv.org/abs/2505.08320
作者: Yoonhyuk Choi,Chong-Kwon Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce SpecSphere, the first dual-pass spectral-spatial GNN that certifies every prediction against both \ell_0 edge flips and \ell_\infty feature perturbations, adapts to the full homophily-heterophily spectrum, and surpasses the expressive power of 1-Weisfeiler-Lehman while retaining linear-time complexity. Our model couples a Chebyshev-polynomial spectral branch with an attention-gated spatial branch and fuses their representations through a lightweight MLP trained in a cooperative-adversarial min-max game. We further establish (i) a uniform Chebyshev approximation theorem, (ii) minimax-optimal risk across the homophily-heterophily spectrum, (iii) closed-form robustness certificates, and (iv) universal approximation strictly beyond 1-WL. SpecSphere achieves state-of-the-art node-classification accuracy and delivers tighter certified robustness guarantees on real-world benchmarks. These results demonstrate that high expressivity, heterophily adaptation, and provable robustness can coexist within a single, scalable architecture.

[LG-23] Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization

链接: https://arxiv.org/abs/2505.08306
作者: Shira Vansover-Hager,Tomer Koren,Roi Livni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal \Theta(1/\sqrtn) excess population loss given a sample of size n , much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size \eta = \Theta(1/\sqrtn) , which gives the optimal rate after one pass, can lead to population loss as large as \Omega(1) after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order \Theta(1/(\eta T) + \eta \sqrtT) , where T is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after O(n \log n) steps. Finally, we also prove a lower bound of \Omega(\eta \sqrtn) on the generalization gap of one-pass SGD in dimension d = \smash\widetilde O(n) , improving on recent results of Koren et al.(2022) and Schliserman et al.(2024).

[LG-24] Super-fast rates of convergence for Neural Networks Classifiers under the Hard Margin Condition

链接: https://arxiv.org/abs/2505.08262
作者: Nathanael Tepakbong,Ding-Xuan Zhou,Xiang Zhou
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 31 pages

点击查看摘要

Abstract:We study the classical binary classification problem for hypothesis spaces of Deep Neural Networks (DNNs) with ReLU activation under Tsybakov’s low-noise condition with exponent q0 , and its limit-case q\to\infty which we refer to as the “hard-margin condition”. We show that DNNs which minimize the empirical risk with square loss surrogate and \ell_p penalty can achieve finite-sample excess risk bounds of order \mathcalO\left(n^-\alpha\right) for arbitrarily large \alpha0 under the hard-margin condition, provided that the regression function \eta is sufficiently smooth. The proof relies on a novel decomposition of the excess risk which might be of independent interest.

[LG-25] Clustering-based Low-Rank Matrix Approximation: An Adaptive Theoretical Analysis with Application to Data Compression

链接: https://arxiv.org/abs/2505.08256
作者: Sisipho Hamlomo,Marcellin Atemkeng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank matrix approximation (LoRMA) is a fundamental tool for compressing high-resolution data matrices by extracting important features while suppressing redundancy. Low-rank methods, such as global singular value decomposition (SVD), apply uniform compression across the entire data matrix, often ignoring important local variations and leading to the loss of fine structural details. To address these limitations, we introduce an adaptive LoRMA, which partitions data matrix into overlapping patches, groups structurally similar patches into several clusters using k-means, and performs SVD within each cluster. We derive the overall compression factor accounting for patch overlap and analyze how patch size influences compression efficiency and computational cost. While the proposed adaptive LoRMA method is applicable to any data exhibiting high local variation, we focus on medical imaging due to its pronounced local variability. We evaluate and compare our adaptive LoRMA against global SVD across four imaging modalities: MRI, ultrasound, CT scan, and chest X-ray. Results demonstrate that adaptive LoRMA effectively preserves structural integrity, edge details, and diagnostic relevance, as measured by peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), mean squared error (MSE), intersection over union (IoU), and edge preservation index (EPI). Adaptive LoRMA significantly minimizes block artifacts and residual errors, particularly in pathological regions, consistently outperforming global SVD in terms of PSNR, SSIM, IoU, EPI, and achieving lower MSE. Adaptive LoRMA prioritizes clinically salient regions while allowing aggressive compression in non-critical regions, optimizing storage efficiency. Although adaptive LoRMA requires higher processing time, its diagnostic fidelity justifies the overhead for high-compression applications.

[LG-26] Privacy-Preserving Analytics for Smart Meter (AMI) Data: A Hybrid Approach to Comply with CPUC Privacy Regulations

链接: https://arxiv.org/abs/2505.08237
作者: Benjamin Westrich
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Advanced Metering Infrastructure (AMI) data from smart electric and gas meters enables valuable insights for utilities and consumers, but also raises significant privacy concerns. In California, regulatory decisions (CPUC D.11-07-056 and D.11-08-045) mandate strict privacy protections for customer energy usage data, guided by the Fair Information Practice Principles (FIPPs). We comprehensively explore solutions drawn from data anonymization, privacy-preserving machine learning (differential privacy and federated learning), synthetic data generation, and cryptographic techniques (secure multiparty computation, homomorphic encryption). This allows advanced analytics, including machine learning models, statistical and econometric analysis on energy consumption data, to be performed without compromising individual privacy. We evaluate each technique’s theoretical foundations, effectiveness, and trade-offs in the context of utility data analytics, and we propose an integrated architecture that combines these methods to meet real-world needs. The proposed hybrid architecture is designed to ensure compliance with California’s privacy rules and FIPPs while enabling useful analytics, from forecasting and personalized insights to academic research and econometrics, while strictly protecting individual privacy. Mathematical definitions and derivations are provided where appropriate to demonstrate privacy guarantees and utility implications rigorously. We include comparative evaluations of the techniques, an architecture diagram, and flowcharts to illustrate how they work together in practice. The result is a blueprint for utility data scientists and engineers to implement privacy-by-design in AMI data handling, supporting both data-driven innovation and strict regulatory compliance. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2505.08237 [cs.CR] (or arXiv:2505.08237v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2505.08237 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Deep Probabilistic Modeling of User Behavior for Anomaly Detection via Mixture Density Networks

链接: https://arxiv.org/abs/2505.08220
作者: Lu Dai,Wenxuan Zhu,Xuehui Quan,Renzi Meng,Sheng Cai,Yichen Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To improve the identification of potential anomaly patterns in complex user behavior, this paper proposes an anomaly detection method based on a deep mixture density network. The method constructs a Gaussian mixture model parameterized by a neural network, enabling conditional probability modeling of user behavior. It effectively captures the multimodal distribution characteristics commonly present in behavioral data. Unlike traditional classifiers that rely on fixed thresholds or a single decision boundary, this approach defines an anomaly scoring function based on probability density using negative log-likelihood. This significantly enhances the model’s ability to detect rare and unstructured behaviors. Experiments are conducted on the real-world network user dataset UNSW-NB15. A series of performance comparisons and stability validation experiments are designed. These cover multiple evaluation aspects, including Accuracy, F1- score, AUC, and loss fluctuation. The results show that the proposed method outperforms several advanced neural network architectures in both performance and training stability. This study provides a more expressive and discriminative solution for user behavior modeling and anomaly detection. It strongly promotes the application of deep probabilistic modeling techniques in the fields of network security and intelligent risk control.

[LG-28] An Effective Flow-based Method for Positive-Unlabeled Learning: 2-HNC

链接: https://arxiv.org/abs/2505.08212
作者: Dorit Hochbaum,Torpong Nitayanont
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many scenarios of binary classification, only positive instances are provided in the training data, leaving the rest of the data unlabeled. This setup, known as positive-unlabeled (PU) learning, is addressed here with a network flow-based method which utilizes pairwise similarities between samples. The method we propose here, 2-HNC, leverages Hochbaum’s Normalized Cut (HNC) and the set of solutions it provides by solving a parametric minimum cut problem. The set of solutions, that are nested partitions of the samples into two sets, correspond to varying tradeoff values between the two goals: high intra-similarity inside the sets and low inter-similarity between the two sets. This nested sequence is utilized here to deliver a ranking of unlabeled samples by their likelihood of being negative. Building on this insight, our method, 2-HNC, proceeds in two stages. The first stage generates this ranking without assuming any negative labels, using a problem formulation that is constrained only on positive labeled samples. The second stage augments the positive set with likely-negative samples and recomputes the classification. The final label prediction selects among all generated partitions in both stages, the one that delivers a positive class proportion, closest to a prior estimate of this quantity, which is assumed to be given. Extensive experiments across synthetic and real datasets show that 2-HNC yields strong performance and often surpasses existing state-of-the-art algorithms.

[LG-29] A Multi-scale Representation Learning Framework for Long-Term Time Series Forecasting

链接: https://arxiv.org/abs/2505.08199
作者: Boshi Gao,Qingjian Ni,Fanbo Ju,Yu Chen,Ziqi Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Long-term time series forecasting (LTSF) offers broad utility in practical settings like energy consumption and weather prediction. Accurately predicting long-term changes, however, is demanding due to the intricate temporal patterns and inherent multi-scale variations within time series. This work confronts key issues in LTSF, including the suboptimal use of multi-granularity information, the neglect of channel-specific attributes, and the unique nature of trend and seasonal components, by introducing a proficient MLP-based forecasting framework. Our method adeptly disentangles complex temporal dynamics using clear, concurrent predictions across various scales. These multi-scale forecasts are then skillfully integrated through a system that dynamically assigns importance to information from different granularities, sensitive to individual channel characteristics. To manage the specific features of temporal patterns, a two-pronged structure is utilized to model trend and seasonal elements independently. Experimental results on eight LTSF benchmarks demonstrate that MDMixer improves average MAE performance by 4.64% compared to the recent state-of-the-art MLP-based method (TimeMixer), while achieving an effective balance between training efficiency and model interpretability.

[LG-30] nsor Sketch: Fast and Scalable Polynomial Kernel Approximation KDD2013

链接: https://arxiv.org/abs/2505.08146
作者: Ninh Pham,Rasmus Pagh
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Extension of KDD 2013 and correcting the variance bound

点击查看摘要

Abstract:Approximation of non-linear kernels using random feature maps has become a powerful technique for scaling kernel methods to large datasets. We propose \textitTensor Sketch, an efficient random feature map for approximating polynomial kernels. Given n training samples in \R^d Tensor Sketch computes low-dimensional embeddings in \R^D in time \BOn(d+D \logD) making it well-suited for high-dimensional and large-scale settings. We provide theoretical guarantees on the approximation error, ensuring the fidelity of the resulting kernel function estimates. We also discuss extensions and highlight applications where Tensor Sketch serves as a central computational tool.

[LG-31] Multi-Layer Hierarchical Federated Learning with Quantization

链接: https://arxiv.org/abs/2505.08145
作者: Seyed Mohammad Azimi-Abarghouyi,Carlo Fischione
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Almost all existing hierarchical federated learning (FL) models are limited to two aggregation layers, restricting scalability and flexibility in complex, large-scale networks. In this work, we propose a Multi-Layer Hierarchical Federated Learning framework (QMLHFL), which appears to be the first study that generalizes hierarchical FL to arbitrary numbers of layers and network architectures through nested aggregation, while employing a layer-specific quantization scheme to meet communication constraints. We develop a comprehensive convergence analysis for QMLHFL and derive a general convergence condition and rate that reveal the effects of key factors, including quantization parameters, hierarchical architecture, and intra-layer iteration counts. Furthermore, we determine the optimal number of intra-layer iterations to maximize the convergence rate while meeting a deadline constraint that accounts for both communication and computation times. Our results show that QMLHFL consistently achieves high learning accuracy, even under high data heterogeneity, and delivers notably improved performance when optimized, compared to using randomly selected values.

[LG-32] Fused3S: Fast Sparse Attention on Tensor Cores

链接: https://arxiv.org/abs/2505.08098
作者: Zitong Li,Aparna Chandramowlishwaran
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse attention is a core building block in many leading neural network models, from graph-structured learning to sparse sequence modeling. It can be decomposed into a sequence of three sparse matrix operations (3S): sampled dense-dense matrix multiplication (SDDMM), softmax normalization, and sparse matrix multiplication (SpMM). Efficiently executing the 3S computational pattern on modern GPUs remains challenging due to (a) the mismatch between unstructured sparsity and tensor cores optimized for dense operations, and (b) the high cost of data movement. Previous works have optimized these sparse operations individually or addressed one of these challenges. This paper introduces Fused3S, the first fused 3S algorithm that jointly maximizes tensor core utilization and minimizes data movement. Across real-world graph datasets, Fused3S achieves 1.6- 16.3\times and 1.5-14\times speedup over state-of-the-art on H100 and A30 GPUs. Furthermore, integrating Fused3S into Graph Transformer inference accelerates end-to-end performance by 1.05-5.36\times , consistently outperforming all 3S baselines across diverse datasets (single and batched graphs) and GPU architectures.

[LG-33] Manifold Learning with Normalizing Flows: Towards Regularity Expressivity and Iso-Riemannian Geometry

链接: https://arxiv.org/abs/2505.08087
作者: Willem Diepeveen,Deanna Needell
类目: Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注:

点击查看摘要

Abstract:Modern machine learning increasingly leverages the insight that high-dimensional data often lie near low-dimensional, non-linear manifolds, an idea known as the manifold hypothesis. By explicitly modeling the geometric structure of data through learning Riemannian geometry algorithms can achieve improved performance and interpretability in tasks like clustering, dimensionality reduction, and interpolation. In particular, learned pullback geometry has recently undergone transformative developments that now make it scalable to learn and scalable to evaluate, which further opens the door for principled non-linear data analysis and interpretable machine learning. However, there are still steps to be taken when considering real-world multi-modal data. This work focuses on addressing distortions and modeling errors that can arise in the multi-modal setting and proposes to alleviate both challenges through isometrizing the learned Riemannian structure and balancing regularity and expressivity of the diffeomorphism parametrization. We showcase the effectiveness of the synergy of the proposed approaches in several numerical experiments with both synthetic and real data.

[LG-34] A Federated Random Forest Solution for Secure Distributed Machine Learning

链接: https://arxiv.org/abs/2505.08085
作者: Alexandre Cotorobai,Jorge Miguel Silva,Jose Luis Oliveira
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Privacy and regulatory barriers often hinder centralized machine learning solutions, particularly in sectors like healthcare where data cannot be freely shared. Federated learning has emerged as a powerful paradigm to address these concerns; however, existing frameworks primarily support gradient-based models, leaving a gap for more interpretable, tree-based approaches. This paper introduces a federated learning framework for Random Forest classifiers that preserves data privacy and provides robust performance in distributed settings. By leveraging PySyft for secure, privacy-aware computation, our method enables multiple institutions to collaboratively train Random Forest models on locally stored data without exposing sensitive information. The framework supports weighted model averaging to account for varying data distributions, incremental learning to progressively refine models, and local evaluation to assess performance across heterogeneous datasets. Experiments on two real-world healthcare benchmarks demonstrate that the federated approach maintains competitive predictive accuracy - within a maximum 9% margin of centralized methods - while satisfying stringent privacy requirements. These findings underscore the viability of tree-based federated learning for scenarios where data cannot be centralized due to regulatory, competitive, or technical constraints. The proposed solution addresses a notable gap in existing federated learning libraries, offering an adaptable tool for secure distributed machine learning tasks that demand both transparency and reliable performance. The tool is available at this https URL.

[LG-35] Mobile Jamming Mitigation in 5G Networks: A MUSIC-Based Adaptive Beamforming Approach

链接: https://arxiv.org/abs/2505.08046
作者: Olivia Holguin,Rachel Donati,Seyed bagher Hashemi Natanzi,Bo Tang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Mobile jammers pose a critical threat to 5G networks, particularly in military communications. We propose an intelligent anti-jamming framework that integrates Multiple Signal Classification (MUSIC) for high-resolution Direction-of-Arrival (DoA) estimation, Minimum Variance Distortionless Response (MVDR) beamforming for adaptive interference suppression, and machine learning (ML) to enhance DoA prediction for mobile jammers. Extensive simulations in a realistic highway scenario demonstrate that our hybrid approach achieves an average Signal-to-Noise Ratio (SNR) improvement of 9.58 dB (maximum 11.08 dB) and up to 99.8% DoA estimation accuracy. The framework’s computational efficiency and adaptability to dynamic jammer mobility patterns outperform conventional anti-jamming techniques, making it a robust solution for securing 5G communications in contested environments.

[LG-36] Demo: A Practical Testbed for Decentralized Federated Learning on Physical Edge Devices

链接: https://arxiv.org/abs/2505.08033
作者: Chao Feng,Nicolas Huber,Alberto Huertas Celdran,Gerome Bovet,Burkhard Stiller
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training without sharing raw data, preserving participant privacy. Decentralized FL (DFL) eliminates reliance on a central server, mitigating the single point of failure inherent in the traditional FL paradigm, while introducing deployment challenges on resource-constrained devices. To evaluate real-world applicability, this work designs and deploys a physical testbed using edge devices such as Raspberry Pi and Jetson Nano. The testbed is built upon a DFL training platform, NEBULA, and extends it with a power monitoring module to measure energy consumption during training. Experiments across multiple datasets show that model performance is influenced by the communication topology, with denser topologies leading to better outcomes in DFL settings.

[LG-37] Safety and optimality in learning-based control at low computational cost

链接: https://arxiv.org/abs/2505.08026
作者: Dominik Baumann,Krzysztof Kowalczyk,Cristian R. Rojas,Koen Tiels,Pawel Wachel
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Accepted final version to appear in the IEEE Transactions on Automatic Control

点击查看摘要

Abstract:Applying machine learning methods to physical systems that are supposed to act in the real world requires providing safety guarantees. However, methods that include such guarantees often come at a high computational cost, making them inapplicable to large datasets and embedded devices with low computational power. In this paper, we propose CoLSafe, a computationally lightweight safe learning algorithm whose computational complexity grows sublinearly with the number of data points. We derive both safety and optimality guarantees and showcase the effectiveness of our algorithm on a seven-degrees-of-freedom robot arm.

[LG-38] Dynamical Low-Rank Compression of Neural Networks with Robustness under Adversarial Attacks

链接: https://arxiv.org/abs/2505.08022
作者: Steffen Schotthöfer,H. Lexie Yang,Stefan Schnake
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Deployment of neural networks on resource-constrained devices demands models that are both compact and robust to adversarial inputs. However, compression and adversarial robustness often conflict. In this work, we introduce a dynamical low-rank training scheme enhanced with a novel spectral regularizer that controls the condition number of the low-rank core in each layer. This approach mitigates the sensitivity of compressed models to adversarial perturbations without sacrificing clean accuracy. The method is model- and data-agnostic, computationally efficient, and supports rank adaptivity to automatically compress the network at hand. Extensive experiments across standard architectures, datasets, and adversarial attacks show the regularized networks can achieve over 94% compression while recovering or improving adversarial accuracy relative to uncompressed baselines.

[LG-39] A Scalable System to Prove Machine Learning Fairness in Zero-Knowledge

链接: https://arxiv.org/abs/2505.07997
作者: Tianyu Zhang,Shen Dong,O. Deniz Kose,Yanning Shen,Yupeng Zhang
类目: Machine Learning (cs.LG)
*备注: 2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 2025

点击查看摘要

Abstract:With the rise of machine learning techniques, ensuring the fairness of decisions made by machine learning algorithms has become of great importance in critical applications. However, measuring fairness often requires full access to the model parameters, which compromises the confidentiality of the models. In this paper, we propose a solution using zero-knowledge proofs, which allows the model owner to convince the public that a machine learning model is fair while preserving the secrecy of the model. To circumvent the efficiency barrier of naively proving machine learning inferences in zero-knowledge, our key innovation is a new approach to measure fairness only with model parameters and some aggregated information of the input, but not on any specific dataset. To achieve this goal, we derive new bounds for the fairness of logistic regression and deep neural network models that are tighter and better reflecting the fairness compared to prior work. Moreover, we develop efficient zero-knowledge proof protocols for common computations involved in measuring fairness, including the spectral norm of matrices, maximum, absolute value, and fixed-point arithmetic. We have fully implemented our system, FairZK, that proves machine learning fairness in zero-knowledge. Experimental results show that FairZK is significantly faster than the naive approach and an existing scheme that use zero-knowledge inferences as a subroutine. The prover time is improved by 3.1x–1789x depending on the size of the model and the dataset. FairZK can scale to a large model with 47 million parameters for the first time, and generates a proof for its fairness in 343 seconds. This is estimated to be 4 orders of magnitude faster than existing schemes, which only scale to small models with hundreds to thousands of parameters. Comments: 2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2505.07997 [cs.LG] (or arXiv:2505.07997v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2505.07997 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Making Small Language Models Efficient Reason ers: Intervention Supervision Reinforcement

链接: https://arxiv.org/abs/2505.07961
作者: Xuechen Zhang,Zijian Huang,Chenchun Ni,Ziyang Xiong,Jiasi Chen,Samet Oymak
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research enhances language model reasoning by scaling test-time compute via longer chain-of-thought traces. This often improves accuracy but also introduces redundancy and high computational cost, especially for small language models distilled with supervised fine-tuning (SFT). In this work, we propose new algorithms to improve token-efficient reasoning with small-scale models by effectively trading off accuracy and computation. We first show that the post-SFT model fails to determine the optimal stopping point of the reasoning process, resulting in verbose and repetitive outputs. Verbosity also significantly varies across wrong vs correct responses. To address these issues, we propose two solutions: (1) Temperature scaling (TS) to control the stopping point for the thinking phase and thereby trace length, and (2) TLDR: a length-regularized reinforcement learning method based on GRPO that facilitates multi-level trace length control (e.g. short, medium, long reasoning). Experiments on four reasoning benchmarks, MATH500, AMC, AIME24 and OlympiadBench, demonstrate that TS is highly effective compared to s1’s budget forcing approach and TLDR significantly improves token efficiency by about 50% with minimal to no accuracy loss over the SFT baseline. Moreover, TLDR also facilitates flexible control over the response length, offering a practical and effective solution for token-efficient reasoning in small models. Ultimately, our work reveals the importance of stopping time control, highlights shortcomings of pure SFT, and provides effective algorithmic recipes.

[LG-41] Symbolic Regression with Multimodal Large Language Models and Kolmogorov Arnold Networks

链接: https://arxiv.org/abs/2505.07956
作者: Thomas R. Harvey,Fabian Ruehle,Cristofero S. Fraser-Taliente,James Halverson
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:We present a novel approach to symbolic regression using vision-capable large language models (LLMs) and the ideas behind Google DeepMind’s Funsearch. The LLM is given a plot of a univariate function and tasked with proposing an ansatz for that function. The free parameters of the ansatz are fitted using standard numerical optimisers, and a collection of such ansätze make up the population of a genetic algorithm. Unlike other symbolic regression techniques, our method does not require the specification of a set of functions to be used in regression, but with appropriate prompt engineering, we can arbitrarily condition the generative step. By using Kolmogorov Arnold Networks (KANs), we demonstrate that ``univariate is all you need’’ for symbolic regression, and extend this method to multivariate functions by learning the univariate function on each edge of a trained KAN. The combined expression is then simplified by further processing with a language model.

[LG-42] On-Device Crack Segmentation for Edge Structural Health Monitoring

链接: https://arxiv.org/abs/2505.07915
作者: Yuxuan Zhang,Ye Xu,Luciano Sebastian Martinez-Rau,Quynh Nguyen Phuong Vu,Bengt Oelmann,Sebastian Bader
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted for the 2025 IEEE Sensors Applications Symposium (SAS)

点击查看摘要

Abstract:Crack segmentation can play a critical role in Structural Health Monitoring (SHM) by enabling accurate identification of crack size and location, which allows to monitor structural damages over time. However, deploying deep learning models for crack segmentation on resource-constrained microcontrollers presents significant challenges due to limited memory, computational power, and energy resources. To address these challenges, this study explores lightweight U-Net architectures tailored for TinyML applications, focusing on three optimization strategies: filter number reduction, network depth reduction, and the use of Depthwise Separable Convolutions (DWConv2D). Our results demonstrate that reducing convolution kernels and network depth significantly reduces RAM and Flash requirement, and inference times, albeit with some accuracy trade-offs. Specifically, by reducing the filer number to 25%, the network depth to four blocks, and utilizing depthwise convolutions, a good compromise between segmentation performance and resource consumption is achieved. This makes the network particularly suitable for low-power TinyML applications. This study not only advances TinyML-based crack segmentation but also provides the possibility for energy-autonomous edge SHM systems.

[LG-43] LECTOR: Summarizing E-book Reading Content for Personalized Student Support

链接: https://arxiv.org/abs/2505.07898
作者: Erwin Daniel López Zapata,Cheng Tang,Valdemar Švábenský,Fumiya Okubo,Atsushi Shimada
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: Published open-access in the International Journal of Artificial Intelligence in Education (IJAIED), see this https URL

点击查看摘要

Abstract:Educational e-book platforms provide valuable information to teachers and researchers through two main sources: reading activity data and reading content data. While reading activity data is commonly used to analyze learning strategies and predict low-performing students, reading content data is often overlooked in these analyses. To address this gap, this study proposes LECTOR (Lecture slides and Topic Relationships), a model that summarizes information from reading content in a format that can be easily integrated with reading activity data. Our first experiment compared LECTOR to representative Natural Language Processing (NLP) models in extracting key information from 2,255 lecture slides, showing an average improvement of 5% in F1-score. These results were further validated through a human evaluation involving 28 students, which showed an average improvement of 21% in F1-score over a model predominantly used in current educational tools. Our second experiment compared reading preferences extracted by LECTOR with traditional reading activity data in predicting low-performing students using 600,712 logs from 218 students. The results showed a tendency to improve the predictive performance by integrating LECTOR. Finally, we proposed examples showing the potential application of the reading preferences extracted by LECTOR in designing personalized interventions for students.

[LG-44] EnvCDiff: Joint Refinement of Environmental Information and Channel Fingerprints via Conditional Generative Diffusion Model

链接: https://arxiv.org/abs/2505.07894
作者: Zhenzhou Jin,Li You,Xiang-Gen Xia,Xiqi Gao
类目: Networking and Internet Architecture (cs.NI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP); Statistics Theory (math.ST)
*备注: 6 pages, 2 figures

点击查看摘要

Abstract:The paradigm shift from environment-unaware communication to intelligent environment-aware communication is expected to facilitate the acquisition of channel state information for future wireless communications. Channel Fingerprint (CF), as an emerging enabling technology for environment-aware communication, provides channel-related knowledge for potential locations within the target communication area. However, due to the limited availability of practical devices for sensing environmental information and measuring channel-related knowledge, most of the acquired environmental information and CF are coarse-grained, insufficient to guide the design of wireless transmissions. To address this, this paper proposes a deep conditional generative learning approach, namely a customized conditional generative diffusion model (CDiff). The proposed CDiff simultaneously refines environmental information and CF, reconstructing a fine-grained CF that incorporates environmental information, referred to as EnvCF, from its coarse-grained counterpart. Experimental results show that the proposed approach significantly improves the performance of EnvCF construction compared to the baselines.

[LG-45] Channel Fingerprint Construction for Massive MIMO: A Deep Conditional Generative Approach

链接: https://arxiv.org/abs/2505.07893
作者: Zhenzhou Jin,Li You,Xudong Li,Zhen Gao,Yuanwei Liu,Xiang-Gen Xia,Xiqi Gao
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP); Probability (math.PR); Statistics Theory (math.ST)
*备注: 15 pages, 7 figures

点击查看摘要

Abstract:Accurate channel state information (CSI) acquisition for massive multiple-input multiple-output (MIMO) systems is essential for future mobile communication networks. Channel fingerprint (CF), also referred to as channel knowledge map, is a key enabler for intelligent environment-aware communication and can facilitate CSI acquisition. However, due to the cost limitations of practical sensing nodes and test vehicles, the resulting CF is typically coarse-grained, making it insufficient for wireless transceiver design. In this work, we introduce the concept of CF twins and design a conditional generative diffusion model (CGDM) with strong implicit prior learning capabilities as the computational core of the CF twin to establish the connection between coarse- and fine-grained CFs. Specifically, we employ a variational inference technique to derive the evidence lower bound (ELBO) for the log-marginal distribution of the observed fine-grained CF conditioned on the coarse-grained CF, enabling the CGDM to learn the complicated distribution of the target data. During the denoising neural network optimization, the coarse-grained CF is introduced as side information to accurately guide the conditioned generation of the CGDM. To make the proposed CGDM lightweight, we further leverage the additivity of network layers and introduce a one-shot pruning approach along with a multi-objective knowledge distillation technique. Experimental results show that the proposed approach exhibits significant improvement in reconstruction performance compared to the baselines. Additionally, zero-shot testing on reconstruction tasks with different magnification factors further demonstrates the scalability and generalization ability of the proposed approach.

[LG-46] VoI-Driven Joint Optimization of Control and Communication in Vehicular Digital Twin Network

链接: https://arxiv.org/abs/2505.07892
作者: Lei Lei,Kan Zheng,Jie Mei,Xuemin(Sherman)Shen
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The vision of sixth-generation (6G) wireless networks paves the way for the seamless integration of digital twins into vehicular networks, giving rise to a Vehicular Digital Twin Network (VDTN). The large amount of computing resources as well as the massive amount of spatial-temporal data in Digital Twin (DT) domain can be utilized to enhance the communication and control performance of Internet of Vehicle (IoV) systems. In this article, we first propose the architecture of VDTN, emphasizing key modules that center on functions related to the joint optimization of control and communication. We then delve into the intricacies of the multitimescale decision process inherent in joint optimization in VDTN, specifically investigating the dynamic interplay between control and communication. To facilitate the joint optimization, we define two Value of Information (VoI) concepts rooted in control performance. Subsequently, utilizing VoI as a bridge between control and communication, we introduce a novel joint optimization framework, which involves iterative processing of two Deep Reinforcement Learning (DRL) modules corresponding to control and communication to derive the optimal policy. Finally, we conduct simulations of the proposed framework applied to a platoon scenario to demonstrate its effectiveness in ensu

[LG-47] PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation CVPR2025

链接: https://arxiv.org/abs/2505.07843
作者: HsiaoYuan Hsu,Yuxin Peng
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Accepted to CVPR 2025. Code and dataset are available at this https URL

点击查看摘要

Abstract:In poster design, content-aware layout generation is crucial for automatically arranging visual-textual elements on the given image. With limited training data, existing work focused on image-centric enhancement. However, this neglects the diversity of layouts and fails to cope with shape-variant elements or diverse design intents in generalized settings. To this end, we proposed a layout-centric approach that leverages layout knowledge implicit in large language models (LLMs) to create posters for omnifarious purposes, hence the name PosterO. Specifically, it structures layouts from datasets as trees in SVG language by universal shape, design intent vectorization, and hierarchical node representation. Then, it applies LLMs during inference to predict new layout trees by in-context learning with intent-aligned example selection. After layout trees are generated, we can seamlessly realize them into poster designs by editing the chat with LLMs. Extensive experimental results have demonstrated that PosterO can generate visually appealing layouts for given images, achieving new state-of-the-art performance across various benchmarks. To further explore PosterO’s abilities under the generalized settings, we built PStylish7, the first dataset with multi-purpose posters and various-shaped elements, further offering a challenging test for advanced research.

[LG-48] oken Communication-Driven Multimodal Large Models in Resource-Constrained Multiuser Networks

链接: https://arxiv.org/abs/2505.07841
作者: Junhe Zhang,Wanli Ni,Pengwei Wang,Dongyu Wang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The proliferation of intelligent applications at the wireless edge, alongside the exponential growth of multimodal data, poses challenges for deploying multimodal large models (MLMs) in resource-constrained networks. These constraints manifest as limited bandwidth, computational capacity, and stringent latency requirements, particularly under low signal-to-noise ratio (SNR) conditions. To overcome these limitations, we propose a token communication paradigm that facilitates the decentralized deployment of MLMs across user devices and edge infrastructure (e.g., base stations). In this paradigm, task-relevant tokens are extracted from multimodal inputs and serve as the primary medium for communication between distributed model components. To align semantics and optimize transmission efficiency, we propose a dual-pronged approach: 1) We design a contrastive split fine-tuning method to project heterogeneous modalities into a shared feature space, enabling seamless interaction between model components while preserving modal-specific semantics. 2) We employ a lightweight compression technique to reduce the size of transmitted tokens, minimizing bandwidth consumption without sacrificing task-critical information. The proposed framework integrates collaborative fine-tuning of both the foundation model and multimodal transceivers, ensuring that token generation and utilization are tailored to specific downstream tasks. Simulation experiments conducted under different SNR conditions demonstrate that our method results in a 13.7% improvement in test accuracy. Furthermore, our approach exhibits quicker convergence rates, even with reduced token lengths, highlighting the promise of token communication for facilitating more scalable and resilient MLM implementations in practical multiuser networks.

[LG-49] ML-Enabled Eavesdropper Detection in Beyond 5G IIoT Networks

链接: https://arxiv.org/abs/2505.07837
作者: Maria-Lamprini A. Bartsioka,Ioannis A. Bartsiokas,Panagiotis K. Gkonis,Dimitra I. Kaklamani,Iakovos S. Venieris
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, Accepted in IEEE ISCC 2025

点击查看摘要

Abstract:Advanced fifth generation (5G) and beyond (B5G) communication networks have revolutionized wireless technologies, supporting ultra-high data rates, low latency, and massive connectivity. However, they also introduce vulnerabilities, particularly in decentralized Industrial Internet of Things (IIoT) environments. Traditional cryptographic methods struggle with scalability and complexity, leading researchers to explore Artificial Intelligence (AI)-driven physical layer techniques for secure communications. In this context, this paper focuses on the utilization of Machine and Deep Learning (ML/DL) techniques to tackle with the common problem of eavesdropping detection. To this end, a simulated industrial B5G heterogeneous wireless network is used to evaluate the performance of various ML/DL models, including Random Forests (RF), Deep Convolutional Neural Networks (DCNN), and Long Short-Term Memory (LSTM) networks. These models classify users as either legitimate or malicious ones based on channel state information (CSI), position data, and transmission power. According to the presented numerical results, DCNN and RF models achieve a detection accuracy approaching 100% in identifying eavesdroppers with zero false alarms. In general, this work underlines the great potential of combining AI and Physical Layer Security (PLS) for next-generation wireless networks in order to address evolving security threats.

[LG-50] PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework

链接: https://arxiv.org/abs/2505.08784
作者: Abhineet Agarwal,Michael Xiao,Rebecca Barter,Omer Ronen,Boyu Fan,Bin Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:As machine learning (ML) models are increasingly deployed in high-stakes domains, trustworthy uncertainty quantification (UQ) is critical for ensuring the safety and reliability of these models. Traditional UQ methods rely on specifying a true generative model and are not robust to misspecification. On the other hand, conformal inference allows for arbitrary ML models but does not consider model selection, which leads to large interval sizes. We tackle these drawbacks by proposing a UQ method based on the predictability, computability, and stability (PCS) framework for veridical data science proposed by Yu and Kumbier. Specifically, PCS-UQ addresses model selection by using a prediction check to screen out unsuitable models. PCS-UQ then fits these screened algorithms across multiple bootstraps to assess inter-sample variability and algorithmic instability, enabling more reliable uncertainty estimates. Further, we propose a novel calibration scheme that improves local adaptivity of our prediction sets. Experiments across 17 regression and 6 classification datasets show that PCS-UQ achieves the desired coverage and reduces width over conformal approaches by \approx 20% . Further, our local analysis shows PCS-UQ often achieves target coverage across subgroups while conformal methods fail to do so. For large deep-learning models, we propose computationally efficient approximation schemes that avoid the expensive multiple bootstrap trainings of PCS-UQ. Across three computer vision benchmarks, PCS-UQ reduces prediction set size over conformal methods by 20% . Theoretically, we show a modified PCS-UQ algorithm is a form of split conformal inference and achieves the desired coverage with exchangeable data.

[LG-51] Generative Molecular Design with Steerable and Granular Synthesizability Control

链接: https://arxiv.org/abs/2505.08774
作者: Jeff Guo,Víctor Sabanza-Gil,Zlatko Jončev,Jeremy S. Luterbacher,Philippe Schwaller
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthesizability in small molecule generative design remains a bottleneck. Existing works that do consider synthesizability can output predicted synthesis routes for generated molecules. However, there has been minimal attention in addressing the ease of synthesis and enabling flexibility to incorporate desired reaction constraints. In this work, we propose a small molecule generative design framework that enables steerable and granular synthesizability control. Generated molecules satisfy arbitrary multi-parameter optimization objectives with predicted synthesis routes containing pre-defined allowed reactions, while optionally avoiding others. One can also enforce that all reactions belong to a pre-defined set. We show the capability to mix-and-match these reaction constraints across the most common medicinal chemistry transformations. Next, we show how our framework can be used to valorize industrial byproducts towards de novo optimized molecules. Going further, we demonstrate how granular control over synthesizability constraints can loosely mimic virtual screening of ultra-large make-on-demand libraries. Using only a single GPU, we generate and dock 15k molecules to identify promising candidates in Freedom 4.0 constituting 142B make-on-demand molecules (assessing only 0.00001% of the library). Generated molecules satisfying the reaction constraints have 90% exact match rate. Lastly, we benchmark our framework against recent synthesizability-constrained generative models and demonstrate the highest sample efficiency even when imposing the additional constraint that all molecules must be synthesizable from a single reaction type. The main theme is demonstrating that a pre-trained generalist molecular generative model can be incentivized to generate property-optimized small molecules under challenging synthesizability constraints through reinforcement learning.

[LG-52] Contrastive Normalizing Flows for Uncertainty-Aware Parameter Estimation

链接: https://arxiv.org/abs/2505.08709
作者: Ibrahim Elsharkawy,Yonatan Kahn
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注: 9 + 8 pages, 2 tables, 10 figures; Contribution to the FAIR Universe Higgs Uncertainty Challenge, winning first place ex aequo

点击查看摘要

Abstract:Estimating physical parameters from data is a crucial application of machine learning (ML) in the physical sciences. However, systematic uncertainties, such as detector miscalibration, induce data distribution distortions that can erode statistical precision. In both high-energy physics (HEP) and broader ML contexts, achieving uncertainty-aware parameter estimation under these domain shifts remains an open problem. In this work, we address this challenge of uncertainty-aware parameter estimation for a broad set of tasks critical for HEP. We introduce a novel approach based on Contrastive Normalizing Flows (CNFs), which achieves top performance on the HiggsML Uncertainty Challenge dataset. Building on the insight that a binary classifier can approximate the model parameter likelihood ratio, we address the practical limitations of expressivity and the high cost of simulating high-dimensional parameter grids by embedding data and parameters in a learned CNF mapping. This mapping yields a tunable contrastive distribution that enables robust classification under shifted data distributions. Through a combination of theoretical analysis and empirical evaluations, we demonstrate that CNFs, when coupled with a classifier and established frequentist techniques, provide principled parameter estimation and uncertainty quantification through classification that is robust to data distribution distortions.

[LG-53] Continuous Temporal Learning of Probability Distributions via Neural ODEs with Applications in Continuous Glucose Monitoring Data

链接: https://arxiv.org/abs/2505.08698
作者: Antonio Álvarez-López,Marcos Matabuena
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Dynamical Systems (math.DS); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Modeling the continuous–time dynamics of probability distributions from time–dependent data samples is a fundamental problem in many fields, including digital health. The aim is to analyze how the distribution of a biomarker, such as glucose, evolves over time and how these changes may reflect the progression of chronic diseases such as diabetes. In this paper, we propose a novel probabilistic model based on a mixture of Gaussian distributions to capture how samples from a continuous-time stochastic process evolve over the time. To model potential distribution shifts over time, we introduce a time-dependent function parameterized by a Neural Ordinary Differential Equation (Neural ODE) and estimate it non–parametrically using the Maximum Mean Discrepancy (MMD). The proposed model is highly interpretable, detects subtle temporal shifts, and remains computationally efficient. Through simulation studies, we show that it performs competitively in terms of estimation accuracy against state-of-the-art, less interpretable methods such as normalized gradient–flows and non–parameteric kernel density estimators. Finally, we demonstrate the utility of our method on digital clinical–trial data, showing how the interventions alters the time-dependent distribution of glucose levels and enabling a rigorous comparison of control and treatment groups from novel mathematical and clinical perspectives.

[LG-54] Uncertainty-Aware Surrogate-based Amortized Bayesian Inference for Computationally Expensive Models

链接: https://arxiv.org/abs/2505.08683
作者: Stefania Scheurer,Philipp Reiser,Tim Brünnette,Wolfgang Nowak,Anneli Guthke,Paul-Christian Bürkner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Bayesian inference typically relies on a large number of model evaluations to estimate posterior distributions. Established methods like Markov Chain Monte Carlo (MCMC) and Amortized Bayesian Inference (ABI) can become computationally challenging. While ABI enables fast inference after training, generating sufficient training data still requires thousands of model simulations, which is infeasible for expensive models. Surrogate models offer a solution by providing approximate simulations at a lower computational cost, allowing the generation of large data sets for training. However, the introduced approximation errors and uncertainties can lead to overconfident posterior estimates. To address this, we propose Uncertainty-Aware Surrogate-based Amortized Bayesian Inference (UA-SABI) - a framework that combines surrogate modeling and ABI while explicitly quantifying and propagating surrogate uncertainties through the inference pipeline. Our experiments show that this approach enables reliable, fast, and repeated Bayesian inference for computationally expensive models, even under tight time constraints.

[LG-55] neuralGAM: An R Package for Fitting Generalized Additive Neural Networks

链接: https://arxiv.org/abs/2505.08610
作者: Ines Ortega-Fernandez,Marta Sestelo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Nowadays, Neural Networks are considered one of the most effective methods for various tasks such as anomaly detection, computer-aided disease detection, or natural language processing. However, these networks suffer from the ``black-box’’ problem which makes it difficult to understand how they make decisions. In order to solve this issue, an R package called neuralGAM is introduced. This package implements a Neural Network topology based on Generalized Additive Models, allowing to fit an independent Neural Network to estimate the contribution of each feature to the output variable, yielding a highly accurate and interpretable Deep Learning model. The neuralGAM package provides a flexible framework for training Generalized Additive Neural Networks, which does not impose any restrictions on the Neural Network architecture. We illustrate the use of the neuralGAM package in both synthetic and real data examples.

[LG-56] Automated Model-Free Sorting of Single-Molecule Fluorescence Events Using a Deep Learning Based Hidden-State Model

链接: https://arxiv.org/abs/2505.08608
作者: Wenqi Zeng,Shuqi Zhou,Yuan Yao,Chunlai Chen
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Single-molecule fluorescence assays enable high-resolution analysis of biomolecular dynamics, but traditional analysis pipelines are labor-intensive and rely on users’ experience, limiting scalability and reproducibility. Recent deep learning models have automated aspects of data processing, yet many still require manual thresholds, complex architectures, or extensive labeled data. Therefore, we present DASH, a fully streamlined architecture for trace classification, state assignment, and automatic sorting that requires no user input. DASH demonstrates robust performance across users and experimental conditions both in equilibrium and non-equilibrium systems such as Cas12a-mediated DNA cleavage. This paper proposes a novel strategy for the automatic and detailed sorting of single-molecule fluorescence events. The dynamic cleavage process of Cas12a is used as an example to provide a comprehensive analysis. This approach is crucial for studying biokinetic structural changes at the single-molecule level.

[LG-57] Building-Block Aware Generative Modeling for 3D Crystals of Metal Organic Frameworks

链接: https://arxiv.org/abs/2505.08531
作者: Chenru Duan,Aditya Nandy,Sizhan Liu,Yuanqi Du,Liu He,Yi Qu,Haojun Jia,Jin-Hu Dou
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Metal-organic frameworks (MOFs) marry inorganic nodes, organic edges, and topological nets into programmable porous crystals, yet their astronomical design space defies brute-force synthesis. Generative modeling holds ultimate promise, but existing models either recycle known building blocks or are restricted to small unit cells. We introduce Building-Block-Aware MOF Diffusion (BBA MOF Diffusion), an SE(3)-equivariant diffusion model that learns 3D all-atom representations of individual building blocks, encoding crystallographic topological nets explicitly. Trained on the CoRE-MOF database, BBA MOF Diffusion readily samples MOFs with unit cells containing 1000 atoms with great geometric validity, novelty, and diversity mirroring experimental databases. Its native building-block representation produces unprecedented metal nodes and organic edges, expanding accessible chemical space by orders of magnitude. One high-scoring [Zn(1,4-TDC)(EtOH)2] MOF predicted by the model was synthesized, where powder X-ray diffraction, thermogravimetric analysis, and N2 sorption confirm its structural fidelity. BBA-Diff thus furnishes a practical pathway to synthesizable and high-performing MOFs.

[LG-58] SPP-SBL: Space-Power Prior Sparse Bayesian Learning for Block Sparse Recovery

链接: https://arxiv.org/abs/2505.08518
作者: Yanhao Zhang,Zhihan Zhu,Yong Xia
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures, 4 tables

点击查看摘要

Abstract:The recovery of block-sparse signals with unknown structural patterns remains a fundamental challenge in structured sparse signal reconstruction. By proposing a variance transformation framework, this paper unifies existing pattern-based block sparse Bayesian learning methods, and introduces a novel space power prior based on undirected graph models to adaptively capture the unknown patterns of block-sparse signals. By combining the EM algorithm with high-order equation root-solving, we develop a new structured sparse Bayesian learning method, SPP-SBL, which effectively addresses the open problem of space coupling parameter estimation in pattern-based methods. We further demonstrate that learning the relative values of space coupling parameters is key to capturing unknown block-sparse patterns and improving recovery accuracy. Experiments validate that SPP-SBL successfully recovers various challenging structured sparse signals (e.g., chain-structured signals and multi-pattern sparse signals) and real-world multi-modal structured sparse signals (images, audio), showing significant advantages in recovery accuracy across multiple metrics.

[LG-59] Understanding molecular ratios in the carbon and oxygen poor outer Milky Way with interpretable machine learning

链接: https://arxiv.org/abs/2505.08410
作者: Gijs Vermariën,Serena Viti,Johannes Heyl,Francesco Fontani
类目: Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: Accepted for publication in AA Sect. 6. Interstellar and circumstellar matter

点击查看摘要

Abstract:Context. The outer Milky Way has a lower metallicity than our solar neighbourhood, but still many molecules are detected in the region. Molecular line ratios can serve as probes to better understand the chemistry and physics in these regions. Aims. We use interpretable machine learning to study 9 different molecular ratios, helping us understand the forward connection between the physics of these environments and the carbon and oxygen chemistries. Methods. Using a large grid of astrochemical models generated using UCLCHEM, we study the properties of molecular clouds of low oxygen and carbon initial abundance. We first try to understand the line ratios using a classical analysis. We then move on to using interpretable machine learning, namely Shapley Additive Explanations (SHAP), to understand the higher order dependencies of the ratios over the entire parameter grid. Lastly we use the Uniform Manifold Approximation and Projection technique (UMAP) as a reduction method to create intuitive groupings of models. Results. We find that the parameter space is well covered by the line ratios, allowing us to investigate all input parameters. SHAP analysis shows that the temperature and density are the most important features, but the carbon and oxygen abundances are important in parts of the parameter space. Lastly, we find that we can group different types of ratios using UMAP. Conclusions. We show the chosen ratios are mostly sensitive to changes in the carbon initial abundance, together with the temperature and density. Especially the CN/HCN and HNC/HCN ratio are shown to be sensitive to the initial carbon abundance, making them excellent probes for this parameter. Out of the ratios, only CS/SO shows a sensitivity to the oxygen abundance.

[LG-60] Learning Treatment Allocations with Risk Control Under Partial Identifiability

链接: https://arxiv.org/abs/2505.08378
作者: Sofia Ek,Dave Zachariah
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning beneficial treatment allocations for a patient population is an important problem in precision medicine. Many treatments come with adverse side effects that are not commensurable with their potential benefits. Patients who do not receive benefits after such treatments are thereby subjected to unnecessary harm. This is a `treatment risk’ that we aim to control when learning beneficial allocations. The constrained learning problem is challenged by the fact that the treatment risk is not in general identifiable using either randomized trial or observational data. We propose a certifiable learning method that controls the treatment risk with finite samples in the partially identified setting. The method is illustrated using both simulated and real data.

[LG-61] Iteratively reweighted kernel machines efficiently learn sparse functions

链接: https://arxiv.org/abs/2505.08277
作者: Libin Zhu,Damek Davis,Dmitriy Drusvyatskiy,Maryam Fazel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, we argue that these two phenomena are not unique to neural networks, and can be elicited from classical kernel methods. Namely, we show that the derivative of the kernel predictor can detect the influential coordinates with low sample complexity. Moreover, by iteratively using the derivatives to reweight the data and retrain kernel machines, one is able to efficiently learn hierarchical polynomials with finite leap complexity. Numerical experiments illustrate the developed theory.

[LG-62] Lie Group Symmetry Discovery and Enforcement Using Vector Fields

链接: https://arxiv.org/abs/2505.08219
作者: Ben Shaw,Sasidhar Kunapuli,Abram Magner,Kevin R. Moon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symmetry-informed machine learning can exhibit advantages over machine learning which fails to account for symmetry. Additionally, recent attention has been given to continuous symmetry discovery using vector fields which serve as infinitesimal generators for Lie group symmetries. In this paper, we extend the notion of non-affine symmetry discovery to functions defined by neural networks. We further extend work in this area by introducing symmetry enforcement of smooth models using vector fields. Finally, we extend work on symmetry discovery using vector fields by providing both theoretical and experimental material on the restriction of the symmetry search space to infinitesimal isometries.

[LG-63] SIM-Shapley: A Stable and Computationally Efficient Approach to Shapley Value Approximation

链接: https://arxiv.org/abs/2505.08198
作者: Wangxuan Fan,Siqi Li,Doudou Zhou,Yohei Okada,Chuan Hong,Molei Liu,Nan Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Explainable artificial intelligence (XAI) is essential for trustworthy machine learning (ML), particularly in high-stakes domains such as healthcare and finance. Shapley value (SV) methods provide a principled framework for feature attribution in complex models but incur high computational costs, limiting their scalability in high-dimensional settings. We propose Stochastic Iterative Momentum for Shapley Value Approximation (SIM-Shapley), a stable and efficient SV approximation method inspired by stochastic optimization. We analyze variance theoretically, prove linear Q -convergence, and demonstrate improved empirical stability and low bias in practice on real-world datasets. In our numerical experiments, SIM-Shapley reduces computation time by up to 85% relative to state-of-the-art baselines while maintaining comparable feature attribution quality. Beyond feature attribution, our stochastic mini-batch iterative framework extends naturally to a broader class of sample average approximation problems, offering a new avenue for improving computational efficiency with stability guarantees. Code is publicly available at this https URL.

[LG-64] Enhancing the Efficiency of Complex Systems Crystal Structure Prediction by Active Learning Guided Machine Learning Potential

链接: https://arxiv.org/abs/2505.08159
作者: Jiaxiang Li,Junwei Feng,Jie Luo,Bowen Jiang,Xiangyu Zheng,Jian Lv,Keith Butler,Hanyu Liu,Congwei Xie,Yu Xie,Yanming Ma
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding multicomponent complex material systems is essential for design of advanced materials for a wide range of technological applications. While state-of-the-art crystal structure prediction (CSP) methods effectively identify new structures and assess phase stability, they face fundamental limitations when applied to complex systems. This challenge stems from the combinatorial explosion of atomic configurations and the vast stoichiometric space, both of which contribute to computational demands that rapidly exceed practical feasibility. In this work, we propose a flexible and automated workflow to build a highly generalizable and data-efficient machine learning potential (MLP), effectively unlocking the full potential of CSP algorithms. The workflow is validated on both Mg-Ca-H ternary and Be-P-N-O quaternary systems, demonstrating substantial machine learning acceleration in high-throughput structural optimization and enabling the efficient identification of promising compounds. These results underscore the effectiveness of our approach in exploring complex material systems and accelerating the discovery of new multicomponent materials.

[LG-65] Beyond Basic A/B testing: Improving Statistical Efficiency for Business Growth

链接: https://arxiv.org/abs/2505.08128
作者: Changshuai Wei,Phuc Nguyen,Benjamin Zelditch,Joyce Chen
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:The standard A/B testing approaches are mostly based on t-test in large scale industry applications. These standard approaches however suffers from low statistical power in business settings, due to nature of small sample-size or non-Gaussian distribution or return-on-investment (ROI) consideration. In this paper, we propose several approaches to addresses these challenges: (i) regression adjustment, generalized estimating equation, Man-Whitney U and Zero-Trimmed U that addresses each of these issues separately, and (ii) a novel doubly robust generalized U that handles ROI consideration, distribution robustness and small samples in one framework. We provide theoretical results on asymptotic normality and efficiency bounds, together with insights on the efficiency gain from theoretical analysis. We further conduct comprehensive simulation studies and apply the methods to multiple real A/B tests.

[LG-66] Sharp Gaussian approximations for Decentralized Federated Learning

链接: https://arxiv.org/abs/2505.08125
作者: Soham Bonnerjee,Sayar Karmakar,Wei Biao Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Federated Learning has gained traction in privacy-sensitive collaborative environments, with local SGD emerging as a key optimization method in decentralized settings. While its convergence properties are well-studied, asymptotic statistical guarantees beyond convergence remain limited. In this paper, we present two generalized Gaussian approximation results for local SGD and explore their implications. First, we prove a Berry-Esseen theorem for the final local SGD iterates, enabling valid multiplier bootstrap procedures. Second, motivated by robustness considerations, we introduce two distinct time-uniform Gaussian approximations for the entire trajectory of local SGD. The time-uniform approximations support Gaussian bootstrap-based tests for detecting adversarial attacks. Extensive simulations are provided to support our theoretical results.

[LG-67] Wasserstein Distributionally Robust Nonparametric Regression

链接: https://arxiv.org/abs/2505.07967
作者: Changyu Liu,Yuling Jiao,Junhui Wang,Jian Huang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 50 pages

点击查看摘要

Abstract:Distributionally robust optimization has become a powerful tool for prediction and decision-making under model uncertainty. By focusing on the local worst-case risk, it enhances robustness by identifying the most unfavorable distribution within a predefined ambiguity set. While extensive research has been conducted in parametric settings, studies on nonparametric frameworks remain limited. This paper studies the generalization properties of Wasserstein distributionally robust nonparametric estimators, with particular attention to the impact of model misspecification, where non-negligible discrepancies between the estimation function space and target function can impair generalization performance. We establish non-asymptotic error bounds for the excess local worst-case risk by analyzing the regularization effects induced by distributional perturbations and employing feedforward neural networks with Lipschitz constraints. These bounds illustrate how uncertainty levels and neural network structures influence generalization performance and are applicable to both Lipschitz and quadratic loss functions. Furthermore, we investigate the Lagrangian relaxation of the local worst-case risk and derive corresponding non-asymptotic error bounds for these estimators. The robustness of the proposed estimator is evaluated through simulation studies and illustrated with an application to the MNIST dataset.

[LG-68] Diffusion-based supervised learning of generative models for efficient sampling of multimodal distributions

链接: https://arxiv.org/abs/2505.07825
作者: Hoang Tran,Zezhong Zhang,Feng Bao,Dan Lu,Guannan Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We propose a hybrid generative model for efficient sampling of high-dimensional, multimodal probability distributions for Bayesian inference. Traditional Monte Carlo methods, such as the Metropolis-Hastings and Langevin Monte Carlo sampling methods, are effective for sampling from single-mode distributions in high-dimensional spaces. However, these methods struggle to produce samples with the correct proportions for each mode in multimodal distributions, especially for distributions with well separated modes. To address the challenges posed by multimodality, we adopt a divide-and-conquer strategy. We start by minimizing the energy function with initial guesses uniformly distributed within the prior domain to identify all the modes of the energy function. Then, we train a classifier to segment the domain corresponding to each mode. After the domain decomposition, we train a diffusion-model-assisted generative model for each identified mode within its support. Once each mode is characterized, we employ bridge sampling to estimate the normalizing constant, allowing us to directly adjust the ratios between the modes. Our numerical examples demonstrate that the proposed framework can effectively handle multimodal distributions with varying mode shapes in up to 100 dimensions. An application to Bayesian inverse problem for partial differential equations is also provided.

[LG-69] Linear to Neural Networks Regression: QSPR of Drugs via Degree-Distance Indices

链接: https://arxiv.org/abs/2505.07821
作者: M. J. Nadjafi Arani,S. Sorgun,M. Mirzargar
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 10 figures

点击查看摘要

Abstract:This study conducts a Quantitative Structure Property Relationship (QSPR) analysis to explore the correlation between the physical properties of drug molecules and their topological indices using machine learning techniques. While prior studies in drug design have focused on degree-based topological indices, this work analyzes a dataset of 166 drug molecules by computing degree-distance-based topological indices, incorporating vertex-edge weightings with respect to different six atomic properties (atomic number, atomic radius, atomic mass, density, electronegativity, ionization). Both linear models (Linear Regression, Lasso, and Ridge Regression) and nonlinear approaches (Random Forest, XGBoost, and Neural Networks) were employed to predict molecular properties. The results demonstrate the effectiveness of these indices in predicting specific physicochemical properties and underscore the practical relevance of computational methods in molecular property estimation. The study provides an innovative perspective on integrating topological indices with machine learning to enhance predictive accuracy, highlighting their potential application in drug discovery and development processes. This predictive may also explain that establishing a reliable relationship between topological indices and physical properties enables chemists to gain preliminary insights into molecular behavior before conducting experimental analyses, thereby optimizing resource utilization in cheminformatics research.

信息检索

[IR-0] Interest Changes: Considering User Interest Life Cycle in Recommendation System SIGIR2025

链接: https://arxiv.org/abs/2505.08471
作者: Yinjiang Cai,Jiangpan Hou,Yangping Zhu,Yuan Nie
类目: Information Retrieval (cs.IR)
*备注: Accepted by SIGIR 2025

点击查看摘要

Abstract:In recommendation systems, user interests are always in a state of constant flux. Typically, a user interest experiences a emergent phase, a stable phase, and a declining phase, which are referred to as the “user interest life-cycle”. Recent papers on user interest modeling have primarily focused on how to compute the correlation between the target item and user’s historical behaviors, without thoroughly considering the life-cycle features of user interest. In this paper, we propose an effective method called Deep Interest Life-cycle Network (DILN), which not only captures the interest life-cycle features efficiently, but can also be easily integrated to existing ranking models. DILN contains two key components: Interest Life-cycle Encoder Module constructs historical activity histograms of the user interest and then encodes them into dense representation. Interest Life-cycle Fusion Module injects the encoded dense representation into multiple expert networks, with the aim of enabling the specific phase of interest life-cycle to activate distinct experts. Online A/B testing reveals that DILN achieves significant improvements of +0.38% in CTR, +1.04% in CVR and +0.25% in duration per user, which demonstrates its effectiveness. In addition, DILN inherently increase the exposure of users’ emergent and stable interests while decreasing the exposure of declining interests. DILN has been deployed on the Lofter App.

[IR-1] Lost in Transliteration: Bridging the Script Gap in Neural IR SIGIR

链接: https://arxiv.org/abs/2505.08411
作者: Andreas Chari,Iadh Ounis,Sean MacAvaney
类目: Information Retrieval (cs.IR)
*备注: 6 pages, 2 tables. paper accepted at the Short Paper track of The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

点击查看摘要

Abstract:Most human languages use scripts other than the Latin alphabet. Search users in these languages often formulate their information needs in a transliterated – usually Latinized – form for ease of typing. For example, Greek speakers might use Greeklish, and Arabic speakers might use Arabizi. This paper shows that current search systems, including those that use multilingual dense embeddings such as BGE-M3, do not generalise to this setting, and their performance rapidly deteriorates when exposed to transliterated queries. This creates a script gap" between the performance of the same queries when written in their native or transliterated form. We explore whether adapting the popular translate-train" paradigm to transliterations can enhance the robustness of multilingual Information Retrieval (IR) methods and bridge the gap between native and transliterated scripts. By exploring various combinations of non-Latin and Latinized query text for training, we investigate whether we can enhance the capacity of existing neural retrieval techniques and enable them to apply to this important setting. We show that by further fine-tuning IR models on an even mixture of native and Latinized text, they can perform this cross-script matching at nearly the same performance as when the query was formulated in the native script. Out-of-domain evaluation and further qualitative analysis show that transliterations can also cause queries to lose some of their nuances, motivating further research in this direction.

[IR-2] kTok Search Recommendations: Governance and Research Challenges

链接: https://arxiv.org/abs/2505.08385
作者: Taylor Annabell,Robert Gorwa,Rebecca Scharlach,Jacob van de Kerkhof,Thales Bertaglia
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
*备注: to appear in The 1st international Workshop on Computational Approaches to Content Moderation and Platform Governance (COMPASS), held at ICWSM 2025

点击查看摘要

Abstract:Like other social media, TikTok is embracing its use as a search engine, developing search products to steer users to produce searchable content and engage in content discovery. Their recently developed product search recommendations are preformulated search queries recommended to users on videos. However, TikTok provides limited transparency about how search recommendations are generated and moderated, despite requirements under regulatory frameworks like the European Union’s Digital Services Act. By suggesting that the platform simply aggregates comments and common searches linked to videos, it sidesteps responsibility and issues that arise from contextually problematic recommendations, reigniting long-standing concerns about platform liability and moderation. This position paper addresses the novelty of search recommendations on TikTok by highlighting the challenges that this feature poses for platform governance and offering a computational research agenda, drawing on preliminary qualitative analysis. It sets out the need for transparency in platform documentation, data access and research to study search recommendations.

附件下载

点击下载今日全部论文列表