Arxiv今日论文 | 2025-02-14

本篇博文主要内容为 2025-02-14 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决扩散语言模型（Diffusion Language Models）在文本生成任务中的效率与准确性权衡问题。论文的关键在于通过理论分析揭示，在使用困惑度（Perplexity）作为评估指标时，遮掩扩散模型（Masked Diffusion Model, MDM）能够实现接近最优的困惑度，并且其采样步骤与序列长度无关，从而在保持性能的同时提高效率。然而，当使用序列错误率作为评估指标时，为了获得正确的序列，所需的采样步骤必须随序列长度线性增加，这消除了MDM相对于自回归模型（Autoregressive Models）的效率优势。通过这些发现，论文建立了理解MDM优势与局限性的首个理论基础。

链接: https://arxiv.org/abs/2502.09622
作者: Guhao Feng,Yihan Geng,Jian Guan,Wei Wu,Liwei Wang,Di He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 32 pages, 3 figures

点击查看摘要

Abstract:Diffusion language models have emerged as a promising approach for text generation. One would naturally expect this method to be an efficient replacement for autoregressive models since multiple tokens can be sampled in parallel during each diffusion step. However, its efficiency-accuracy trade-off is not yet well understood. In this paper, we present a rigorous theoretical analysis of a widely used type of diffusion language model, the Masked Diffusion Model (MDM), and find that its effectiveness heavily depends on the target evaluation metric. Under mild conditions, we prove that when using perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling steps regardless of sequence length, demonstrating that efficiency can be achieved without sacrificing performance. However, when using the sequence error rate–which is important for understanding the “correctness” of a sequence, such as a reasoning chain–we show that the required sampling steps must scale linearly with sequence length to obtain “correct” sequences, thereby eliminating MDM’s efficiency advantage over autoregressive models. Our analysis establishes the first theoretical foundation for understanding the benefits and limitations of MDMs. All theoretical findings are supported by empirical studies.
zh

[NLP-1] MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality Robustness and Efficiency

【速读】：该论文旨在系统评估和深入探究链式思维（Chain-of-Thought, CoT）对大型多模态模型（Large Multimodal Models, LMMs）推理能力的影响。目前这一领域尚缺乏系统的评估方法。为此，作者引入了MME-CoT基准，用于评估LMMs在数学、科学、光学字符识别（OCR）、逻辑推理、时空理解和一般场景分析等六个领域的CoT推理性能。论文的关键解决方案在于提出了一套全面的评估体系，包括三个新颖的指标，以细粒度地评估推理质量、鲁棒性和效率。通过精心设计的数据集和独特的评估策略，研究揭示了具有反思机制的模型在CoT推理质量方面表现更优，并且Kimi k1.5在这些模型中表现最佳。然而，论文也指出，CoT提示可能在依赖感知的任务中降低LMMs的性能，并且具有反思机制的模型在正常响应和自我修正阶段表现出显著的低效性。

链接: https://arxiv.org/abs/2502.09621
作者: Dongzhi Jiang,Renrui Zhang,Ziyu Guo,Yanwei Li,Yu Qi,Xinyan Chen,Liuhui Wang,Jianhan Jin,Claire Guo,Shen Yan,Bo Zhang,Chaoyou Fu,Peng Gao,Hongsheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs. Project Page: this https URL
zh

[NLP-2] Exploring the Potential of Encoder-free Architectures in 3D LMMs

【速读】：该论文旨在解决encoder-based 3D Large Multimodal Models (LMMs) 面临的挑战，包括无法适应不同的点云分辨率以及编码器提取的点特征不能满足大型语言模型 (LLMs) 的语义需求。论文的关键解决方案在于提出了一种无需编码器的架构，并通过两个主要策略实现：1) 在预训练阶段引入LLM嵌入语义编码策略，探索多种点云自监督损失的效果，并提出混合语义损失以提取高级语义；2) 在指令调优阶段引入分层几何聚合策略，将归纳偏差早期融入到LLM中，以便关注点云的局部细节。基于这些策略，论文提出了首个无需编码器的3D LMM模型ENEL。

链接: https://arxiv.org/abs/2502.09620
作者: Yiwen Tang,Zoey Guo,Zhuhao Wang,Ray Zhang,Qizhi Chen,Junli Liu,Delin Qu,Zhigang Wang,Dong Wang,Xuelong Li,Bin Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The code is released at this https URL

点击查看摘要

Abstract:Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM early layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at this https URL
zh

[NLP-3] Human-LLM Coevolution: Evidence from Academic Writing

【速读】：该论文旨在探讨大型语言模型（LLMs）在学术写作中的影响，并通过统计分析arXiv论文摘要中词频的变化来揭示人类作者与LLMs之间的相互适应。关键在于观察到某些被ChatGPT过度使用的词汇频率下降，而另一些词汇频率则持续上升，从而表明作者们已经调整了他们使用LLMs的方式，如选择输出或修改LLM生成的内容。这些现象反映了检测机器生成文本在实际场景中的挑战，并强调了继续关注词频变化以评估LLMs对学术写作影响的重要性。

链接: https://arxiv.org/abs/2502.09606
作者: Mingmeng Geng,Roberto Trotta
机构: International School for Advanced Studies (SISSA) (国际高等研究院); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With a statistical analysis of arXiv paper abstracts, we report a marked drop in the frequency of several words previously identified as overused by ChatGPT, such as “delve”, starting soon after they were pointed out in early 2024. The frequency of certain other words favored by ChatGPT, such as “significant”, has instead kept increasing. These phenomena suggest that some authors of academic papers have adapted their use of large language models (LLMs), for example, by selecting outputs or applying modifications to the LLM-generated content. Such coevolution and cooperation of humans and LLMs thus introduce additional challenges to the detection of machine-generated text in real-world scenarios. Estimating the impact of LLMs on academic writing by examining word frequency remains feasible, and more attention should be paid to words that were already frequently employed, including those that have decreased in frequency.
zh

[NLP-4] SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

【速读】：该论文旨在解决生成式大语言模型（LLMs）在生成响应时提供高质量引用的问题。关键在于SelfCite方法，它通过上下文消融提供的奖励信号来判断是否需要引用，从而显著改进了引用的质量。这种方法不仅优化了推理阶段的最佳N个样本选择策略，还直接微调了模型以生成更好的引用。

链接: https://arxiv.org/abs/2502.09604
作者: Yung-Sung Chuang,Benjamin Cohen-Wang,Shannon Zejiang Shen,Zhaofeng Wu,Hu Xu,Xi Victoria Lin,James Glass,Shang-Wen Li,Wen-tau Yih
机构: MIT(麻省理工学院); Meta(Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Implementation available at this https URL

点击查看摘要

Abstract:We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks.
zh

[NLP-5] CoT-Valve: Length-Compressible Chain-of-Thought Tuning

【速读】：该论文旨在解决在推理过程中因长链思维路径（Chain-of-Thought, CoT）导致的高计算成本问题。论文的关键在于提出了一种新的调优和推理策略——CoT-Valve，通过操纵参数空间中的方向来控制生成思维路径的长度，从而实现推理路径的弹性控制。这一方法使得模型能够根据任务难度动态调整推理路径长度，进而降低推理开销。实验结果显示，CoT-Valve不仅成功实现了思维路径的可控性和可压缩性，并且在GSM8K和AIME数据集上的性能优于基于提示的控制方法。

链接: https://arxiv.org/abs/2502.09601
作者: Xinyin Ma,Guangnian Wan,Runpeng Yu,Gongfan Fang,Xinchao Wang
机构: National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress. Code will be released at this https URL

点击查看摘要

Abstract:Chain-of-Thought significantly enhances a model’s reasoning capability, but it also comes with a considerable increase in inference costs due to long chains. With the observation that the reasoning path can be easily compressed under easy tasks but struggle on hard tasks, we explore the feasibility of elastically controlling the length of reasoning paths with only one model, thereby reducing the inference overhead of reasoning models dynamically based on task difficulty. We introduce a new tuning and inference strategy named CoT-Valve, designed to allow models to generate reasoning chains of varying lengths. To achieve this, we propose to identify a direction in the parameter space that, when manipulated, can effectively control the length of generated CoT. Moreover, we show that this property is valuable for compressing the reasoning chain. We construct datasets with chains from long to short for the same questions and explore two enhanced strategies for CoT-Valve: (1) a precise length-compressible CoT tuning method, and (2) a progressive chain length compression approach. Our experiments show that CoT-Valve successfully enables controllability and compressibility of the chain and shows better performance than the prompt-based control. We applied this method to QwQ-32B-Preview, reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with only one additional incorrect answer.
zh

[NLP-6] Do LLM s Recognize Your Preferences? Evaluating Personalized Preference Following in LLM s ICLR2025

【速读】：该论文旨在解决大型语言模型（LLMs）在个性化响应用户偏好方面的局限性。解决方案的关键在于引入PrefEval基准测试，该测试包含3,000个人工编排的用户偏好和查询对，覆盖20个主题，并评估LLMs在长上下文对话环境中推断、记忆和遵循用户偏好的能力。通过PrefEval，研究者评估了10种开源和专有LLMs在多轮对话中的表现，结果表明即使是最先进的LLMs在主动遵循用户偏好方面也面临显著挑战。论文还展示了PrefEval微调可以显著提升性能。

链接: https://arxiv.org/abs/2502.09597
作者: Siyan Zhao,Mingyi Hong,Yang Liu,Devamanyu Hazarika,Kaixiang Lin
机构: Amazon AGI(亚马逊AGI); UCLA(加州大学洛杉矶分校); University of Minnesota(明尼苏达大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at ICLR 2025 as oral presentation. Code and data at: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as chatbots, yet their ability to personalize responses to user preferences remains limited. We introduce PrefEval, a benchmark for evaluating LLMs’ ability to infer, memorize and adhere to user preferences in a long-context conversational setting. PrefEval comprises 3,000 manually curated user preference and query pairs spanning 20 topics. PrefEval contains user personalization or preference information in both explicit and implicit forms, and evaluates LLM performance using a generation and a classification task. With PrefEval, we evaluated the aforementioned preference following capabilities of 10 open-source and proprietary LLMs in multi-session conversations with varying context lengths up to 100k tokens. We benchmark with various prompting, iterative feedback, and retrieval-augmented generation methods. Our benchmarking effort reveals that state-of-the-art LLMs face significant challenges in proactively following users’ preferences during conversations. In particular, in zero-shot settings, preference following accuracy falls below 10% at merely 10 turns (~3k tokens) across most evaluated models. Even with advanced prompting and retrieval methods, preference following still deteriorates in long-context conversations. Furthermore, we show that fine-tuning on PrefEval significantly improves performance. We believe PrefEval serves as a valuable resource for measuring, understanding, and enhancing LLMs’ preference following abilities, paving the way for personalized conversational agents. Our code and dataset are available at this https URL.
zh

[NLP-7] Logical forms complement probability in understanding language model (and human) performance

【速读】：该论文旨在探究大型语言模型（Large Language Models, LLMs）在自然语言处理中的逻辑推理能力。研究通过构建一个包含命题逻辑和模态逻辑中的假设性和析取性三段论的控制数据集，作为测试平台来评估LLMs的表现。研究的关键在于揭示除了输入概率外，逻辑形式本身也是预测LLMs行为的重要且独立的因素，并通过对比人类和LLMs的逻辑推理表现，展示了两者之间的相似性和差异性。

链接: https://arxiv.org/abs/2502.09589
作者: Yixuan Wang,Freda Shi
机构: University of Chicago (芝加哥大学); University of Waterloo (滑铁卢大学); Vector Institute (向量研究所), Canada CIFAR AI Chair (加拿大 CIFAR AI 主席)
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: Preprint

点击查看摘要

Abstract:With the increasing interest in using large language models (LLMs) for planning in natural language, understanding their behaviors becomes an important research question. This work conducts a systematic investigation of LLMs’ ability to perform logical reasoning in natural language. We introduce a controlled dataset of hypothetical and disjunctive syllogisms in propositional and modal logic and use it as the testbed for understanding LLM performance. Our results lead to novel insights in predicting LLM behaviors: in addition to the probability of input (Gonen et al., 2023; McCoy et al., 2024), logical forms should be considered as orthogonal factors. In addition, we show similarities and differences between the logical reasoning performances of humans and LLMs by comparing LLM and human behavioral results.
zh

[NLP-8] Optimizing GPT for Video Understanding: Zero-Shot Performance and Prompt Engineering

【速读】：该论文旨在解决视频内容分类在工业应用中的挑战，通过探索和优化基于GPT（Generative Pre-trained Transformer）的模型以实现跨七个关键视频质量类别的零样本分类。解决方案的关键在于通过提示优化和策略改进来提升GPT的性能，特别是简化复杂策略显著减少了误报。此外，引入了一种新的基于分解-聚合的提示工程技巧，该方法优于传统的单一提示方法。这些实验表明，精心设计的提示可以大幅提高GPT的性能，而无需额外的微调，从而为各行业不同领域的视频分类系统提供了一种有效且可扩展的解决方案。

链接: https://arxiv.org/abs/2502.09573
作者: Mark Beliaev,Victor Yang,Madhura Raju,Jiachen Sun,Xinghai Hu
机构: Tiktok Inc. (抖音公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this study, we tackle industry challenges in video content classification by exploring and optimizing GPT-based models for zero-shot classification across seven critical categories of video quality. We contribute a novel approach to improving GPT’s performance through prompt optimization and policy refinement, demonstrating that simplifying complex policies significantly reduces false negatives. Additionally, we introduce a new decomposition-aggregation-based prompt engineering technique, which outperforms traditional single-prompt methods. These experiments, conducted on real industry problems, show that thoughtful prompt design can substantially enhance GPT’s performance without additional finetuning, offering an effective and scalable solution for improving video classification systems across various domains in industry.
zh

[NLP-9] MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing NAACL2025

【速读】：该论文旨在解决自然语言推理（NLI）任务中的分类问题，即如何将前提-假设对分类为蕴含、矛盾或中性。论文的关键解决方案是提出了一种名为MorphNLI的模块化逐步方法，通过使用语言模型生成必要的编辑来逐步变换（即形变）前提以匹配假设，并利用预训练的NLI模型跟踪这些原子变化下的蕴含进展，最终整合这些中间标签以得出最终输出。这种方法尤其在跨领域设置中表现出色，相对提高了12.6%的性能，并且具有可解释性，因为原子编辑可以用于理解整体的NLI标签。

链接: https://arxiv.org/abs/2502.09567
作者: Vlad Andrei Negru,Robert Vacareanu,Camelia Lemnaru,Mihai Surdeanu,Rodica Potolea
机构: Department of Computer Science, Technical University of Cluj-Napoca (克莱日纳波卡理工大学计算机科学系), Romania; Department of Computer Science, University of Arizona (亚利桑那大学计算机科学系), Tucson, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 11 figures, 8 tables. Accepted for NAACL 2025 Findings

点击查看摘要

Abstract:We introduce MorphNLI, a modular step-by-step approach to natural language inference (NLI). When classifying the premise-hypothesis pairs into entailment, contradiction, neutral, we use a language model to generate the necessary edits to incrementally transform (i.e., morph) the premise into the hypothesis. Then, using an off-the-shelf NLI model we track how the entailment progresses with these atomic changes, aggregating these intermediate labels into a final output. We demonstrate the advantages of our proposed method particularly in realistic cross-domain settings, where our method always outperforms strong baselines with improvements up to 12.6% (relative). Further, our proposed approach is explainable as the atomic edits can be used to understand the overall NLI label.
zh

[NLP-10] Zero-shot generation of synthetic neurosurgical data with large language models

【速读】：该论文旨在解决临床神经外科数据访问受限的问题，主要由于数据可用性不足、样本量小、隐私法规以及资源密集型的数据预处理和去标识化程序。论文的关键解决方案在于利用零样本生成的大语言模型（Large Language Model, LLM）GPT-4o来生成合成神经外科数据，并通过与条件表生成对抗网络（Conditional Tabular Generative Adversarial Network, CTGAN）进行基准测试，验证其在保真度、效用和隐私保护方面的表现。研究结果显示，尽管GPT-4o未经过微调或使用实际数据进行预训练，生成的数据集在单变量和双变量保真度方面表现出色，并且在预测术后功能状态恶化这一二分类任务中，基于GPT-4o生成数据训练的机器学习（ML）分类器性能与基于CTGAN数据训练的分类器相当。这些发现表明，GPT-4o能够有效生成高保真合成神经外科数据，从而弥补小样本量数据的不足，并支持神经外科结果预测模型的训练。

链接: https://arxiv.org/abs/2502.09566
作者: Austin A. Barr,Eddie Guo,Emre Sezgin
机构: University of Calgary(Calgary大学); The Abigail Wexner Research Institute(Neilson Wexner研究所); Nationwide Children’s Hospital(全国儿童医院); Department of Pediatrics(儿科系); The Ohio State University College of Medicine(俄亥俄州立大学医学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Clinical data is fundamental to advance neurosurgical research, but access is often constrained by data availability, small sample sizes, privacy regulations, and resource-intensive preprocessing and de-identification procedures. Synthetic data offers a potential solution to challenges associated with accessing and using real-world data (RWD). This study aims to evaluate the capability of zero-shot generation of synthetic neurosurgical data with a large language model (LLM), GPT-4o, by benchmarking with the conditional tabular generative adversarial network (CTGAN). Synthetic datasets were compared to real-world neurosurgical data to assess fidelity (means, proportions, distributions, and bivariate correlations), utility (ML classifier performance on RWD), and privacy (duplication of records from RWD). The GPT-4o-generated datasets matched or exceeded CTGAN performance, despite no fine-tuning or access to RWD for pre-training. Datasets demonstrated high univariate and bivariate fidelity to RWD without directly exposing any real patient records, even at amplified sample size. Training an ML classifier on GPT-4o-generated data and testing on RWD for a binary prediction task showed an F1 score (0.706) with comparable performance to training on the CTGAN data (0.705) for predicting postoperative functional status deterioration. GPT-4o demonstrated a promising ability to generate high-fidelity synthetic neurosurgical data. These findings also indicate that data synthesized with GPT-4o can effectively augment clinical data with small sample sizes, and train ML models for prediction of neurosurgical outcomes. Further investigation is necessary to improve the preservation of distributional characteristics and boost classifier performance.
zh

[NLP-11] EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在构建具身代理（embodied agents）方面的评估框架不足的问题。关键在于引入了EmbodiedBench，这是一个全面的基准测试平台，用于评估以视觉为导向的具身代理。通过设计多样化的测试任务和细致的子集评估，EmbodiedBench不仅揭示了现有挑战，还为推动基于MLLMs的具身代理的发展提供了有价值的见解。

链接: https://arxiv.org/abs/2502.09560
作者: Rui Yang,Hanyang Chen,Junyu Zhang,Mark Zhao,Cheng Qian,Kangrui Wang,Qineng Wang,Teja Venkat Koripella,Marziyeh Movahedi,Manling Li,Heng Ji,Huan Zhang,Tong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 51 pages

点击查看摘要

Abstract:Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code is available at this https URL.
zh

[NLP-12] Mind the Gap! Choice Independence in Using Multilingual LLM s for Persuasive Co-Writing Tasks in Different Languages

【速读】：该论文旨在探究用户在使用多语言大型语言模型（Multilingual Large Language Models, LLMs）辅助写作时的行为模式，特别是在不同语言环境下AI表现差异的影响。论文的关键在于分析用户在慈善广告撰写任务中，是否因先前接触某语言的LLM表现不佳而影响其后续对另一种语言LLM的利用，并评估这些行为模式如何影响广告的说服力以及用户的捐赠决策。研究发现，先前使用西班牙语LLM的经历会减少用户后续对英语LLM的依赖，尽管这种差异并未显著影响广告的整体说服力，但用户的认知来源（人类或AI）信念对捐赠行为有显著影响，尤其是对于相信广告由AI生成的西班牙语女性参与者。此外，人们通常难以区分人类创作与LLM生成的广告。论文结果强调了设计、开发、整合和采用多语言LLMs作为辅助工具时需考虑的重要因素。

链接: https://arxiv.org/abs/2502.09532
作者: Shreyan Biswas,Alexander Erlei,Ujwal Gadiraju
机构: Delft University of Technology(代尔夫特理工大学); University of Goettingen(哥廷根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Recent advances in generative AI have precipitated a proliferation of novel writing assistants. These systems typically rely on multilingual large language models (LLMs), providing globalized workers the ability to revise or create diverse forms of content in different languages. However, there is substantial evidence indicating that the performance of multilingual LLMs varies between languages. Users who employ writing assistance for multiple languages are therefore susceptible to disparate output quality. Importantly, recent research has shown that people tend to generalize algorithmic errors across independent tasks, violating the behavioral axiom of choice independence. In this paper, we analyze whether user utilization of novel writing assistants in a charity advertisement writing task is affected by the AI’s performance in a second language. Furthermore, we quantify the extent to which these patterns translate into the persuasiveness of generated charity advertisements, as well as the role of peoples’ beliefs about LLM utilization in their donation choices. Our results provide evidence that writers who engage with an LLM-based writing assistant violate choice independence, as prior exposure to a Spanish LLM reduces subsequent utilization of an English LLM. While these patterns do not affect the aggregate persuasiveness of the generated advertisements, people’s beliefs about the source of an advertisement (human versus AI) do. In particular, Spanish-speaking female participants who believed that they read an AI-generated advertisement strongly adjusted their donation behavior downwards. Furthermore, people are generally not able to adequately differentiate between human-generated and LLM-generated ads. Our work has important implications for the design, development, integration, and adoption of multilingual LLMs as assistive agents – particularly in writing tasks.
zh

[NLP-13] Improve LLM -based Automatic Essay Scoring with Linguistic Features AAAI

【速读】：该论文旨在解决自动作文评分（AES）系统在处理多样化写作任务时面临的挑战，特别是如何在不同题目类型下保持评分的一致性和准确性。论文的关键解决方案是结合监督特征方法与大规模语言模型（LLM）方法，通过将语言学特征融入基于LLM的评分体系中，从而实现性能提升，同时保持计算效率。实验结果表明，这种混合方法在领域内和领域外写作题目上均优于基线模型。

链接: https://arxiv.org/abs/2502.09497
作者: Zhaoyi Joey Hou,Alejandro Ciuba,Xiang Lorraine Li
机构: University of Pittsburgh(匹兹堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in the workshop Innovation and Responsibility in AI-Supported Education (iRaise) at the 2025 Conference on Artificial Intelligence (AAAI)

点击查看摘要

Abstract:Automatic Essay Scoring (AES) assigns scores to student essays, reducing the grading workload for instructors. Developing a scoring system capable of handling essays across diverse prompts is challenging due to the flexibility and diverse nature of the writing task. Existing methods typically fall into two categories: supervised feature-based approaches and large language model (LLM)-based methods. Supervised feature-based approaches often achieve higher performance but require resource-intensive training. In contrast, LLM-based methods are computationally efficient during inference but tend to suffer from lower performance. This paper combines these approaches by incorporating linguistic features into LLM-based scoring. Experimental results show that this hybrid method outperforms baseline models for both in-domain and out-of-domain writing prompts.
zh

[NLP-14] Objective quantification of mood states using large language models

【速读】：该论文旨在解决如何利用大型语言模型（Large Language Models, LLMs）量化异质性抑郁情绪状态的问题。论文的关键在于开发了一种框架，通过自报告问卷和大型语言模型对开放性问题的回答来评估个体的心理状态，并利用因素分析和回归技术，从LLMs的隐藏状态中识别出与抑郁和躯体情感困扰相关的子空间，从而实现对心理状态的量化测量。这种方法的关键在于所提问题的信息量以及LLMs对开放性答案的一致性和泛化能力。

链接: https://arxiv.org/abs/2502.09487
作者: Jakub Onysk,Quentin Huys
机构: Applied Computational Psychiatry Lab (应用计算精神病学实验室), Max Planck UCL Centre for Computational Psychiatry and Ageing Research (马克斯·普朗克大学学院计算精神病学与老龄化研究中心), Queen Square Institute of Neurology and Mental Health Neuroscience Department (皇后广场神经病学与心理健康神经科学系), Division of Psychiatry, University College London (精神病学系，伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: main text - 9 pages, 5 figures;

点击查看摘要

Abstract:Emotional states influence human behaviour and cognition, leading to diverse thought trajectories. Similarly, Large Language Models (LLMs) showcase an excellent level of response consistency across wide-ranging contexts (prompts). We leverage these parallels to establish a framework for quantifying mental states. Our approach utilises self-report questionnaires that reliably assess these states due to their inherent sensitivity to patterns of co-occurring responses. Specifically, we recruited a large sample of participants (N=422) to investigate how well an LLM (Mistral-7B-OpenOrca) quantifies a heterogenous set of depressive mood states measured with participants’ open-ended responses to a depression questionnaire. We show LLM responses to held-out multiple-choice questions, given participants’ open-ended answers, correlate strongly (r: 0.52-0.84) with true questionnaire scores, demonstrating LLM’s generalisation from mood representations. We explore a link between these representations and factor analysis. Using ridge regression, we find depression-related subspaces within LLM hidden states. We show these subspaces to be predictive of participants’ “Depression” and “Somatic Emotional Distress” factor scores, as well as suicidality severity. Overall, LLMs can provide quantitative measures of mental states. The reliability of these hinges upon how informative the questions we ask participants are. Used correctly, this approach could supplement mental state assessment in a variety of settings.
zh

[NLP-15] he Multilingual Mind : A Survey of Multilingual Reasoning in Language Models

【速读】：该论文旨在解决多语言推理在语言模型（Language Models, LMs）中的整合与应用问题。论文的关键在于探讨并提出方法，以应对逻辑推理跨语言处理时遇到的错位、偏见以及低资源环境下的挑战，并系统性地概述了现有方法、标准数据资源及评估基准。此外，论文分析了最先进的方法及其在这些基准上的表现，并探索了未来研究机会，以提升语言模型在处理多样语言和复杂推理任务方面的能力。

链接: https://arxiv.org/abs/2502.09457
作者: Akash Ghosh,Debayan Datta,Sriparna Saha,Chirag Agarwal
机构: Department of Computer Science and Engineering, Indian Institute of Technology Patna(印度理工学院帕特纳分校); University of Virginia(弗吉尼亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While reasoning and multilingual capabilities in Language Models (LMs) have achieved remarkable progress in recent years, their integration into a unified paradigm, multilingual reasoning, is at a nascent stage. Multilingual reasoning requires language models to handle logical reasoning across languages while addressing misalignment, biases, and challenges in low-resource settings. This survey provides the first in-depth review of multilingual reasoning in LMs. In this survey, we provide a systematic overview of existing methods that leverage LMs for multilingual reasoning, specifically outlining the challenges, motivations, and foundational aspects of applying language models to reason across diverse languages. We provide an overview of the standard data resources used for training multilingual reasoning in LMs and the evaluation benchmarks employed to assess their multilingual capabilities. Next, we analyze various state-of-the-art methods and their performance on these benchmarks. Finally, we explore future research opportunities to improve multilingual reasoning in LMs, focusing on enhancing their ability to handle diverse languages and complex reasoning tasks.
zh

[NLP-16] Pixel-Level Reasoning Segmentation via Multi-turn Conversations

【速读】：该论文旨在解决现有视觉感知系统在单轮对话中专注于区域级分割且依赖复杂显式的查询指令的问题。这些系统无法进行像素级推理或理解随交互变化的动态用户意图。为解决这一问题，论文引入了基于多轮对话的像素级推理分割（Pixel-level RS）任务，通过多轮交互追踪不断演变的用户意图以实现细粒度分割。关键解决方案在于构建了Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST) 数据集，并进一步提出了MIRAS (Multi-turn Interactive ReAsoning Segmentation) 框架，该框架将像素级分割与强大的多轮对话理解相结合，生成与用户意图一致的像素级解释。

链接: https://arxiv.org/abs/2502.09447
作者: Dexian Cai,Xiaocui Yang,Yongkang Liu,Daling Wang,Shi Feng,Yifei Zhang,Soujanya Poria
机构: School of Computer Science and Engineering, Northeastern University (东北大学计算机科学与工程学院), Shenyang, China; Singapore University of Technology and Design (新加坡科技设计大学), Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: this https URL.
zh

[NLP-17] On multi-token prediction for efficient LLM inference

【速读】：该论文旨在探究大型预训练语言模型（LLMs）在多令牌预测（MTP）任务中的固有能力，并评估将MTP模块整合到已冻结的LLMs中的挑战。研究发现，尽管这些模型通过数值边际化方法可以实现MTP，但其隐藏层已被高度专业化以适应下一令牌预测（NTP），这使得适配MTP变得非 trivial。联合训练MTP模块与主干网络虽能提升性能，但无法完全克服这一障碍。关键在于如何有效整合MTP模块以充分利用LLMs的潜力，同时提高多令牌预测的效率。

链接: https://arxiv.org/abs/2502.09419
作者: Somesh Mehra,Javier Alonso Garcia,Lukas Mauch
机构: European Research Center (EUREC), Sony Europe B.V.; Swiss Federal Institute of Technology Lausanne (EPFL)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.
zh

[NLP-18] Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?

【速读】：该论文旨在解决自动评估指标在句法错误修正（Grammatical Error Correction, GEC）中的排名与人类偏好不一致的问题。论文的关键解决方案是提出了一种新的聚合方法，使现有自动评估指标的评估过程与人类评价方法保持一致，通过这种方法来弥合自动评估与人类评价之间的差距。实验结果表明，这种改进方法在SEEDA基准测试中提高了大多数评估指标的表现，甚至有时BERT-based指标的表现优于GPT-4的指标。

链接: https://arxiv.org/abs/2502.09416
作者: Takumi Goto,Yusuke Sakai,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良先端科学技术大学院大学)
类目: Computation and Language (cs.CL)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:One of the goals of automatic evaluation metrics in grammatical error correction (GEC) is to rank GEC systems such that it matches human preferences. However, current automatic evaluations are based on procedures that diverge from human evaluation. Specifically, human evaluation derives rankings by aggregating sentence-level relative evaluation results, e.g., pairwise comparisons, using a rating algorithm, whereas automatic evaluation averages sentence-level absolute scores to obtain corpus-level scores, which are then sorted to determine rankings. In this study, we propose an aggregation method for existing automatic evaluation metrics which aligns with human evaluation methods to bridge this gap. We conducted experiments using various metrics, including edit-based metrics, n -gram based metrics, and sentence-level metrics, and show that resolving the gap improves results for the most of metrics on the SEEDA benchmark. We also found that even BERT-based metrics sometimes outperform the metrics of GPT-4. We publish our unified implementation of the metrics and meta-evaluations.
zh

[NLP-19] SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在复杂推理任务中的不足，特别是在传统方法如链式思考提示（chain-of-thought prompting, CoT）难以充分挖掘模型推理能力的问题。关键解决方案是引入SQuARE（顺序问答推理引擎），这是一种通过自我询问范式来增强推理的新颖提示技术。SQuARE通过促使模型生成并解答多个辅助问题，进而系统性地分解查询，从而促进对主题各个方面的更全面探索。

链接: https://arxiv.org/abs/2502.09390
作者: Daniel Fleischer,Moshe Berchansky,Gad Markovits,Moshe Wasserblat
机构: Intel Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages

点击查看摘要

Abstract:In the rapidly evolving field of Natural Language Processing, Large Language Models (LLMs) are tasked with increasingly complex reasoning challenges. Traditional methods like chain-of-thought prompting have shown promise but often fall short in fully leveraging a model’s reasoning capabilities. This paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a novel prompting technique designed to improve reasoning through a self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts models to generate and resolve multiple auxiliary questions before tackling the main query, promoting a more thorough exploration of various aspects of a topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models across multiple question-answering datasets, demonstrate that SQuARE significantly surpasses traditional CoT prompts and existing rephrase-and-respond methods. By systematically decomposing queries, SQuARE advances LLM capabilities in reasoning tasks. The code is publicly available at this https URL.
zh

[NLP-20] ruth Knows No Language: Evaluating Truthfulness Beyond English

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在多语言环境下保持真相性（truthfulness）的问题。关键解决方案在于开发了一个包含巴斯克语（Basque）、加泰罗尼亚语（Catalan）、加利西亚语（Galician）和西班牙语（Spanish）的专业翻译扩展版TruthfulQA基准，并通过人类评估、选择题指标以及LLM作为裁判（LLM-as-a-Judge）评分的方法，系统地比较了12个最先进的开源LLMs在这些语言中的表现。研究结果表明，尽管LLMs在英语中表现最佳，在巴斯克语中表现最差，但不同语言之间的真相性差异小于预期。此外，研究发现LLM-as-a-Judge方法与人类判断的相关性更高，且信息量（informativeness）在真相性评估中起着关键作用。最后，研究还观察到机器翻译提供了一种将真相性基准扩展到更多语言的可行方法。

链接: https://arxiv.org/abs/2502.09387
作者: Blanca Calvo Figueras,Eneko Sagarzazu,Julen Etxaniz,Jeremy Barnes,Pablo Gamallo,Iria De Dios Flores,Rodrigo Agerri
机构: HiTX Center - Ixa, University of the Basque Country, UPV/EHU (HiTX中心 - Ixa, 巴斯克大学); Elhuyar (埃尔尤哈尔); Centro de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela (圣地亚哥德孔波斯特拉大学智能技术研究中心); Departament de Traducció i Ciències del Llenguatge, Universitat Pompeu Fabra (庞培法布拉大学翻译与语言科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 5 figures, 8 tables

点击查看摘要

Abstract:We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness assessment. Our results also indicate that machine translation provides a viable approach for extending truthfulness benchmarks to additional languages, offering a scalable alternative to professional translation. Finally, we observe that universal knowledge questions are better handled across languages than context- and time-dependent ones, highlighting the need for truthfulness evaluations that account for cultural and temporal variability. Dataset and code are publicly available under open licenses.
zh

[NLP-21] Language Agents as Digital Representatives in Collective Decision-Making

【速读】：该论文旨在解决集体决策过程中个体偏好表达的问题。论文的关键在于通过训练语言模型来模拟人类代理的行为，使其能够作为个体的数字代表，准确表达被代表者的偏好。通过在多智能体场景研究和机制设计中的应用，展示了微调大型语言模型以充当数字代表的可行性。

链接: https://arxiv.org/abs/2502.09369
作者: Daniel Jarrett,Miruna Pîslar,Michiel A. Bakker,Michael Henry Tessler,Raphael Köster,Jan Balaguer,Romuald Elie,Christopher Summerfield,Andrea Tacchetti
机构: Google DeepMind
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Consider the process of collective decision-making, in which a group of individuals interactively select a preferred outcome from among a universe of alternatives. In this context, “representation” is the activity of making an individual’s preferences present in the process via participation by a proxy agent – i.e. their “representative”. To this end, learned models of human behavior have the potential to fill this role, with practical implications for multi-agent scenario studies and mechanism design. In this work, we investigate the possibility of training \textitlanguage agents to behave in the capacity of representatives of human agents, appropriately expressing the preferences of those individuals whom they stand for. First, we formalize the setting of \textitcollective decision-making – as the episodic process of interaction between a group of agents and a decision mechanism. On this basis, we then formalize the problem of \textitdigital representation – as the simulation of an agent’s behavior to yield equivalent outcomes from the mechanism. Finally, we conduct an empirical case study in the setting of \textitconsensus-finding among diverse humans, and demonstrate the feasibility of fine-tuning large language models to act as digital representatives.
zh

[NLP-22] Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLM s NAACL

【速读】：该论文旨在解决在多语言环境下，如何通过系统性评估选择最优的预翻译策略以提升大型语言模型（Large Language Models, LLMs）在不同任务中的表现。关键在于将提示（prompt）视为由指令、上下文、示例和输出四部分组成的模块化实体，并评估这四个部分分别进行预翻译的有效性，从而确定在多种语言和任务设置下的最佳策略。

链接: https://arxiv.org/abs/2502.09331
作者: Itai Mondshine,Tzuf Paz-Argaman,Reut Tsarfaty
机构: Bar-Ilan University (巴伊兰大学), Israel
类目: Computation and Language (cs.CL)
备注: Accepted for NAACL findings 2025

点击查看摘要

Abstract:Despite advances in the multilingual capabilities of Large Language Models (LLMs) across diverse tasks, English remains the dominant language for LLM research and development. So, when working with a different language, this has led to the widespread practice of pre-translation, i.e., translating the task prompt into English before inference. Selective pre-translation, a more surgical approach, focuses on translating specific prompt components. However, its current use is sporagic and lacks a systematic research foundation. Consequently, the optimal pre-translation strategy for various multilingual settings and tasks remains unclear. In this work, we aim to uncover the optimal setup for pre-translation by systematically assessing its use. Specifically, we view the prompt as a modular entity, composed of four functional parts: instruction, context, examples, and output, either of which could be translated or not. We evaluate pre-translation strategies across 35 languages covering both low and high-resource languages, on various tasks including Question Answering (QA), Natural Language Inference (NLI), Named Entity Recognition (NER), and Abstractive Summarization. Our experiments show the impact of factors as similarity to English, translation quality and the size of pre-trained data, on the model performance with pre-translation. We suggest practical guidelines for choosing optimal strategies in various multilingual settings.
zh

[NLP-23] A Judge-free LLM Open-ended Generation Benchmark Based on the Distributional Hypothesis

【速读】：该论文旨在解决大型语言模型（LLMs）开放性文本生成评估的挑战，主要由于缺乏明确的地面真实数据以及人工或基于LLM的评估方法成本高昂。论文的关键解决方案在于提出了一种新的基准测试方法，该方法利用n元语法统计和规则来评估LLMs，无需依赖人工判断或基于LLM的评判方法。通过使用50组问题和参考答案集，引入了三个基于n元语法和规则的新指标：流畅性、真实性及实用性。该基准测试与基于GPT-4的评估结果高度相关，同时所需计算资源显著减少，展示了其作为评估LLMs开放性生成能力的一种可扩展替代方案的有效性。

链接: https://arxiv.org/abs/2502.09316
作者: Kentaro Imajo,Masanori Hirano,Shuji Suzuki,Hiroaki Mikami
机构: Preferred Networks, Inc.(首选网络有限公司); Preferred Elements, Inc.(首选元素有限公司)
类目: Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Evaluating the open-ended text generation of large language models (LLMs) is challenging because of the lack of a clear ground truth and the high cost of human or LLM-based assessments. We propose a novel benchmark that evaluates LLMs using n-gram statistics and rules, without relying on human judgement or LLM-as-a-judge approaches. Using 50 question and reference answer sets, we introduce three new metrics based on n-grams and rules: Fluency, Truthfulness, and Helpfulness. Our benchmark strongly correlates with GPT-4o-based evaluations while requiring significantly fewer computational resources, demonstrating its effectiveness as a scalable alternative for assessing LLMs’ open-ended generation capabilities.
zh

[NLP-24] When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）在句子理解任务中处理歧义句（garden-path constructions）的能力，并与人类的表现进行对比。研究的关键在于通过一系列句子理解问题、释义任务及文本到图像生成任务来验证LLMs是否以及如何与人类在面对这些具有挑战性的句子时表现出相似的理解困难。研究表明，LLMs和人类都在特定的句法复杂性上遇到挑战，部分模型在理解上的表现与人类高度相关。

链接: https://arxiv.org/abs/2502.09307
作者: Samuel Joseph Amouyal,Aya Meltzer-Asscher,Jonathan Berant
机构: Blavatnik School of Computer Science, Tel Aviv University (特拉维夫大学); Department of Linguistics, Tel Aviv University (特拉维夫大学); Sagol School of Neuroscience, Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern Large Language Models (LLMs) have shown human-like abilities in many language tasks, sparking interest in comparing LLMs’ and humans’ language processing. In this paper, we conduct a detailed comparison of the two on a sentence comprehension task using garden-path constructions, which are notoriously challenging for humans. Based on psycholinguistic research, we formulate hypotheses on why garden-path sentences are hard, and test these hypotheses on human participants and a large suite of LLMs using comprehension questions. Our findings reveal that both LLMs and humans struggle with specific syntactic complexities, with some models showing high correlation with human comprehension. To complement our findings, we test LLM comprehension of garden-path constructions with paraphrasing and text-to-image generation tasks, and find that the results mirror the sentence comprehension question results, further validating our findings on LLM understanding of these constructions.
zh

[NLP-25] SparQLe: Speech Queries to Text Translation Through LLM s

【速读】：该论文旨在解决将语音表示与大型语言模型（LLMs）结合以实现更无缝的多模态处理和语音理解的问题。解决方案的关键在于引入了一种新颖的方法，该方法利用自监督的语音表示与指令调优的LLMs相结合，并通过使用模态适配器对齐提取的语音特征与指令调优的LLMs，从而有效保留输入语音的语义内容，提供了一个从自监督语音模型到指令调优的LLMs的有效桥梁，适用于多种语音理解应用。

链接: https://arxiv.org/abs/2502.09284
作者: Amirbek Djanibekov,Hanan Aldarmaki
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the growing influence of Large Language Models (LLMs), there is increasing interest in integrating speech representations with them to enable more seamless multi-modal processing and speech understanding. This study introduces a novel approach that leverages self-supervised speech representations in combination with instruction-tuned LLMs for speech-to-text translation. The proposed approach leverages a modality adapter to align extracted speech features with instruction-tuned LLMs using English-language data. Our experiments demonstrate that this method effectively preserves the semantic content of the input speech and serves as an effective bridge between self-supervised speech models and instruction-tuned LLMs, offering a promising solution for various speech understanding applications.
zh

[NLP-26] he Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics

【速读】：该论文旨在解决中文文本尤其是医学领域复杂语义下联合实体关系抽取的问题。解决方案的关键在于引入了SEA模块以增强复杂上下文语义信息的提取，并通过交互融合表示模块实现实体识别与关系抽取之间的双向信息交流，从而提升模型在处理此类任务时的性能。实验结果表明，该模型在CH-DDI数据集上的实体识别F1得分为96.73%，关系抽取得分为78.43%；在CoNLL04数据集上的实体识别精确度为89.54%，关系抽取准确率为71.64%。

链接: https://arxiv.org/abs/2502.09247
作者: Danni Feng,Runzhi Li,Jing Wang,Siyu Yan,Lihong Ma,Yunli Xing
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Joint entity-relation extraction is a critical task in transforming unstructured or semi-structured text into triplets, facilitating the construction of large-scale knowledge graphs, and supporting various downstream applications. Despite its importance, research on Chinese text, particularly with complex semantics in specialized domains like medicine, remains limited. To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions dataset designed to capture the intricacies of medical text. Leveraging the strengths of attention mechanisms in capturing long-range dependencies, we propose the SEA module, which enhances the extraction of complex contextual semantic information, thereby improving entity recognition and relation extraction. Additionally, to address the inefficiencies of existing methods in facilitating information exchange between entity recognition and relation extraction, we present an interactive fusion representation module. This module employs Cross Attention for bidirectional information exchange between the tasks and further refines feature extraction through BiLSTM. Experimental results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that our model exhibits strong generalization capabilities. On the CH-DDI dataset, our model achieves an F1-score of 96.73% for entity recognition and 78.43% for relation extraction. On the CoNLL04 dataset, it attains an entity recognition precision of 89.54% and a relation extraction accuracy of 71.64%.
zh

[NLP-27] You Do Not Fully Utilize Transformers Representation Capacity

【速读】：该论文旨在解决标准Transformer在处理序列数据时由于仅使用立即前一层的表示而导致的表征塌缩及次优性能问题。关键解决方案是引入Layer-Integrated Memory (LIMe)，通过允许访问早期层的隐藏状态，从而在保持模型整体内存占用的同时增强其表征能力。

链接: https://arxiv.org/abs/2502.09245
作者: Gleb Gerasimov,Yaroslav Aksenov,Nikita Balagansky,Viacheslav Sinii,Daniil Gavrilov
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In contrast to RNNs, which compress previous tokens into a single hidden state, Transformers can attend to all previous tokens directly. However, standard Transformers only use representations from the immediately preceding layer. In this paper, we show that this design choice causes representation collapse and leads to suboptimal performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a simple yet powerful approach that preserves the model’s overall memory footprint while expanding its representational capacity by allowing access to hidden states from earlier layers. Through extensive experiments across various architectures and different lookup mechanisms, we demonstrate consistent performance improvements on a wide range of tasks. Moreover, our analysis of the learned representation dynamics and our exploration of depthwise circuits reveal how LIMe integrates information across layers, pointing to promising directions for future research.
zh

[NLP-28] Reliable Conversational Agents under ASP Control that Understand Natural Language

【速读】：该论文旨在解决大型语言模型（LLMs）在实现人类对话时存在的理解与可靠性不足的问题。解决方案的关键在于将LLMs仅作为解析器使用，以实现文本到知识及反之的转换，并通过答案集编程（Answer Set Programming, ASP）进行基于知识的推理来完成对话。这一框架不仅适用于任务特定型聊天机器人，也适用于社交机器人，从而实现更可靠的人机交互。

链接: https://arxiv.org/abs/2502.09237
作者: Yankai Zeng(The University of Texas at Dallas)
机构: 未知
类目: Logic in Computer Science (cs.LO); Computation and Language (cs.CL)
备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Efforts have been made to make machines converse like humans in the past few decades. The recent techniques of Large Language Models (LLMs) make it possible to have human-like conversations with machines, but LLM’s flaws of lacking understanding and reliability are well documented. We believe that the best way to eliminate this problem is to use LLMs only as parsers to translate text to knowledge and vice versa and carry out the conversation by reasoning over this knowledge using the answer set programming. I have been developing a framework based on LLMs and ASP to realize reliable chatbots that “understand” human conversation. This framework has been used to develop task-specific chatbots as well as socialbots. My future research is focused on making these chatbots scalable and trainable.
zh

[NLP-29] Answer Set Counting and its Applications

【速读】：该论文旨在解决精确与近似解答集计数（Answer Set Counting, ASC）的问题。关键解决方案包括开发了一种精确的解答集计数器sharpASP，它采用紧凑的命题公式编码方式显著提升了效率；此外，还提出了一种基于哈希的近似解答集计数器ApproxASP，并将其与高斯-约旦消元法结合到ASP求解器clingo中。这些方法在多个基准测试中展示了优越性能，特别是在网络可靠性估计方面。

链接: https://arxiv.org/abs/2502.09231
作者: Mohimenul Kabir(National University of Singapore)
机构: 未知
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:We have focused on Answer Set Programming (ASP), more specifically, answer set counting, exploring both exact and approximate methodologies. We developed an exact ASP counter, sharpASP, which utilizes a compact encoding for propositional formulas, significantly enhancing efficiency compared to existing methods that often struggle with inefficient encodings. Our evaluations indicate that sharpASP outperforms current ASP counters on several benchmarks. In addition, we proposed an approximate ASP counter, named ApproxASP, a hashing-based counter integrating Gauss-Jordan elimination within the ASP solver, clingo. As a practical application, we employed ApproxASP for network reliability estimation, demonstrating superior performance over both traditional reliability estimators and #SAT-based methods.
zh

[NLP-30] Mind the Gaps: Logical English Prolog and Multi-agent Systems for Autonomous Vehicles

【速读】：该论文旨在解决自动驾驶车辆与人类驾驶员在交通规则法律方面的交互问题，特别是在城市环境中。论文提出的关键解决方案是一个模块化系统，该系统包含三个主要组件：使用逻辑英语（Logical English）的自然语言接口以编码规则；Prolog内部规则表示；以及在NetLogo中构建的基于多智能体的仿真环境。这三个组件通过翻译逻辑英语与Prolog之间的交互以及Prolog与NetLogo之间通过谓词接口来实现互动。这种模块化方法不仅使不同组件能够在整体系统中承担不同的角色，还允许模块的替换，从而实现了系统的可视化和动态验证。

链接: https://arxiv.org/abs/2502.09216
作者: Galileo Sartor(Swansea University),Adam Wyner(Swansea University),Giuseppe Contissa(University of Bologna)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:In this paper, we present a modular system for representing and reasoning with legal aspects of traffic rules for autonomous vehicles. We focus on a subset of the United Kingdom’s Highway Code (HC) related to junctions. As human drivers and automated vehicles (AVs) will interact on the roads, especially in urban environments, we claim that an accessible, unitary, high-level computational model should exist and be applicable to both users. Autonomous vehicles introduce a shift in liability that should not bring disadvantages or increased burden on human drivers. We develop a system “in silico” of the model. The proposed system is built of three main components: a natural language interface, using Logical English, which encodes the rules; an internal representation of the rules in Prolog; and an multi-agent-based simulation environment, built in NetLogo. The three components interact: Logical English is translated into and out of Prolog (along with some support code); Prolog and NetLogo interface via predicates. Such a modular approach enables the different components to carry different “burdens” in the overall system; it also allows swapping of modules. Given NetLogo, we can visualize the effect of the modeled rules as well as validate the system with a simple dynamic running scenario. Designated agents monitor the behaviour of the vehicles for compliance and record potential violations where they occur. The information on potential violations is then utilized by Validators, to determine whether the violation is punishable, differentiating between exceptions and cases.
zh

[NLP-31] Neuro-Symbolic Contrastive Learning for Cross-domain Inference

【速读】：该论文旨在解决预训练语言模型（Pre-trained Language Models, PLMs）在自然语言推理（Natural Language Inference, NLI）任务中的两个主要问题：对文本扰动的敏感性和对大规模数据集的依赖，这些问题表明模型过度依赖浅层启发式方法。同时，论文也关注到归纳逻辑编程（Inductive Logic Programming, ILP）在处理离散、稀疏和有限数据集时的优势。论文的关键解决方案是提出神经符号对比学习（Neuro-Symbolic Contrastive Learning），通过这一方法实现平滑且可微优化，从而在原本离散、嘈杂和稀疏的逻辑函数拓扑空间中提升逻辑准确性。该方法通过将数据表示为逻辑程序和逻辑规则集，在神经符号范式下有效嵌入抽象逻辑关系，并能捕捉具有相似语义逻辑关系的高度多样化的文本信息，同时也能区分具有不同逻辑关系的相似文本关系。

链接: https://arxiv.org/abs/2502.09213
作者: Mingyue Liu(Durham University),Ryo Ueda(University of Tokyo),Zhen Wan(Kyoto University),Katsumi Inoue(National Institute of Informatics),Chris G. Willcocks(Durham University)
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Pre-trained language models (PLMs) have made significant advances in natural language inference (NLI) tasks, however their sensitivity to textual perturbations and dependence on large datasets indicate an over-reliance on shallow heuristics. In contrast, inductive logic programming (ILP) excels at inferring logical relationships across diverse, sparse and limited datasets, but its discrete nature requires the inputs to be precisely specified, which limits their application. This paper proposes a bridge between the two approaches: neuro-symbolic contrastive learning. This allows for smooth and differentiable optimisation that improves logical accuracy across an otherwise discrete, noisy, and sparse topological space of logical functions. We show that abstract logical relationships can be effectively embedded within a neuro-symbolic paradigm, by representing data as logic programs and sets of logic rules. The embedding space captures highly varied textual information with similar semantic logical relations, but can also separate similar textual relations that have dissimilar logical relations. Experimental results demonstrate that our approach significantly improves the inference capabilities of the models in terms of generalisation and reasoning.
zh

[NLP-32] LP-LM: No Hallucinations in Question Answering with Logic Programming

【速读】：该论文旨在解决大型语言模型（LLMs）在生成回答时存在的幻觉问题，即生成不准确或虚构的信息。论文的关键解决方案是提出LP-LM系统，该系统通过语义解析和Prolog逻辑编程，将输入问题转换为Prolog谓词逻辑表达，并与知识库中的事实进行匹配，从而确保生成的回答可靠且基于已知事实。(LP-LM通过Definite Clause Grammar (DCG) 和表驱动技术实现线性时间复杂度的处理，适用于大规模输入。)

链接: https://arxiv.org/abs/2502.09212
作者: Katherine Wu,Yanhong A. Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Large language models (LLMs) are able to generate human-like responses to user queries. However, LLMs exhibit inherent limitations, especially because they hallucinate. This paper introduces LP-LM, a system that grounds answers to questions in known facts contained in a knowledge base (KB), facilitated through semantic parsing in Prolog, and always produces answers that are reliable. LP-LM generates a most probable constituency parse tree along with a corresponding Prolog term for an input question via Prolog definite clause grammar (DCG) parsing. The term is then executed against a KB of natural language sentences also represented as Prolog terms for question answering. By leveraging DCG and tabling, LP-LM runs in linear time in the size of input sentences for sufficiently many grammar rules. Performing experiments comparing LP-LM with current well-known LLMs in accuracy, we show that LLMs hallucinate on even simple questions, unlike LP-LM. Comments: In Proceedings ICLP 2024, arXiv:2502.08453 Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2502.09212 [cs.AI] (or arXiv:2502.09212v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.09212 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 416, 2025, pp. 69-77 Related DOI: https://doi.org/10.4204/EPTCS.416.5 Focus to learn more DOI(s) linking to related resources
zh

[NLP-33] hinking beyond the anthropomorphic paradigm benefits LLM research

【速读】：该论文旨在探讨和解决大型语言模型（Large Language Models, LLMs）研究领域中普遍存在的拟人化概念（anthropomorphism）及其限制性影响。论文的关键在于识别并分析了五种核心的拟人化假设，并提出了非拟人化的替代方法，从而为理解和改进LLMs开辟新的研究和发展路径，超越人类类比的局限。

链接: https://arxiv.org/abs/2502.09192
作者: Lujain Ibrahim,Myra Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Anthropomorphism, or the attribution of human traits to technology, is an automatic and unconscious response that occurs even in those with advanced technical expertise. In this position paper, we analyze hundreds of thousands of computer science research articles from the past decade and present empirical evidence of the prevalence and growth of anthropomorphic terminology in research on large language models (LLMs). This terminology reflects deeper anthropomorphic conceptualizations which shape how we think about and conduct LLM research. We argue these conceptualizations may be limiting, and that challenging them opens up new pathways for understanding and improving LLMs beyond human analogies. To illustrate this, we identify and analyze five core anthropomorphic assumptions shaping prominent methodologies across the LLM development lifecycle, from the assumption that models must use natural language for reasoning tasks to the assumption that model capabilities should be evaluated through human-centric benchmarks. For each assumption, we demonstrate how non-anthropomorphic alternatives can open new directions for research and development.
zh

[NLP-34] Matina: A Large-Scale 73B Token Persian Text Corpus

【速读】：该论文旨在解决由于数据资源有限导致的波斯语（Persian）文本数据集规模小且多样性不足的问题，这限制了自然语言处理（NLP）模型和开源大型语言模型（LLMs）的发展。论文的关键解决方案是引入Matina语料库，这是一个包含729亿个词元的新波斯语数据集，通过精心预处理和去重确保高数据质量。

链接: https://arxiv.org/abs/2502.09188
作者: Sara Bourbour Hosseinbeigi,Fatemeh Taherinezhad,Heshaam Faili,Hamed Baghbani,Fatemeh Nadi,Mostafa Amiri
机构: Tarbiat Modares University; University of Tehran (德黑兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text corpora are essential for training models used in tasks like summarization, translation, and large language models (LLMs). While various efforts have been made to collect monolingual and multilingual datasets in many languages, Persian has often been underrepresented due to limited resources for data collection and preprocessing. Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. This shortage of high-quality, varied data has slowed the development of NLP models and open-source LLMs for Persian. Since model performance depends heavily on the quality of training data, we address this gap by introducing the Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality. We further assess its effectiveness by training and evaluating transformer-based models on key NLP tasks. Both the dataset and preprocessing codes are publicly available, enabling researchers to build on and improve this resource for future Persian NLP advancements.
zh

[NLP-35] RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation

【速读】：该论文旨在解决现有代码生成方法依赖教师模型蒸馏且忽视自生成代码迭代精炼潜力的问题。关键在于提出了一种自适应批判精炼（Adaptive Critique Refinement, ACR）策略，通过引入大型语言模型作为评判者（LLM-as-a-Judge）的综合评分系统来评估代码响应质量，并采用大型语言模型作为评论者（LLM-as-a-Critic）的筛选性评论策略来批判低质量的自生成代码响应，从而实现模型的自我优化与性能提升。

链接: https://arxiv.org/abs/2502.09183
作者: Changzhi Zhou,Xinyu Zhang,Dandan Song,Xiancai Chen,Wanli Gu,Huipeng Ma,Yuhang Tian,Mengdi Zhang,Linmei Hu
机构: School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机学院), Beijing, China; Peking University(北京大学); Meituan(美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: work in process

点击查看摘要

Abstract:Code generation has attracted increasing attention with the rise of Large Language Models (LLMs). Many studies have developed powerful code LLMs by synthesizing code-related instruction data and applying supervised fine-tuning. However, these methods are limited by teacher model distillation and ignore the potential of iterative refinement by self-generated code. In this paper, we propose Adaptive Critique Refinement (ACR), which enables the model to refine itself by self-generated code and external critique, rather than directly imitating the code responses of the teacher model. Concretely, ACR includes a composite scoring system with LLM-as-a-Judge to evaluate the quality of code responses and a selective critique strategy with LLM-as-a-Critic to critique self-generated low-quality code responses. We develop the RefineCoder series by iteratively applying ACR, achieving continuous performance improvement on multiple code generation benchmarks. Compared to the baselines of the same size, our proposed RefineCoder series can achieve comparable or even superior performance using less data.
zh

[NLP-36] FLAME: Flexible LLM -Assisted Moderation Engine

【速读】：该论文旨在解决大型语言模型（LLMs）在用户交互过程中面临的显著挑战，特别是对抗性攻击中的“越狱”技术，这些技术能够规避内容安全措施。当前主要依赖输入提示过滤的内容审核系统已证明不足以应对这些挑战，例如Best-of-N (BoN) “越狱”技术的成功率可高达80%以上。论文提出的关键解决方案是Flexible LLM-Assisted Moderation Engine (FLAME)，它将重点从输入过滤转移到输出审核。FLAME通过评估模型响应而非分析用户查询，从而提供计算效率高、对抗BoN“越狱”攻击具有更强抵抗力，并且可通过自定义主题过滤灵活定义和更新安全标准等优势。实验结果表明，FLAME显著优于现有的审核系统，在减少GPT-4o-mini和DeepSeek-v3的攻击成功率方面效果尤为明显。

链接: https://arxiv.org/abs/2502.09175
作者: Ivan Bakulin(1 and 2),Ilia Kopanichuk(1 and 2),Iaroslav Bespalov(1),Nikita Radchenko(3),Vladimir Shaposhnikov(1 and 4),Dmitry Dylov(1 and 4),Ivan Oseledets(1 and 4) ((1) AIRI, (2) Moscow Institute of Physics and Technology, (3) SberHealth, (4) Skolkovo Institute of Science and Technology)
机构: AIRI (AIRI); Moscow Institute of Physics and Technology (莫斯科物理技术学院); Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究院); SberHealth (SberHealth)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has introduced significant challenges in moderating user-model interactions. While LLMs demonstrate remarkable capabilities, they remain vulnerable to adversarial attacks, particularly ``jailbreaking’’ techniques that bypass content safety measures. Current content moderation systems, which primarily rely on input prompt filtering, have proven insufficient, with techniques like Best-of-N (BoN) jailbreaking achieving success rates of 80% or more against popular LLMs. In this paper, we introduce Flexible LLM-Assisted Moderation Engine (FLAME): a new approach that shifts the focus from input filtering to output moderation. Unlike traditional circuit-breaking methods that analyze user queries, FLAME evaluates model responses, offering several key advantages: (1) computational efficiency in both training and inference, (2) enhanced resistance to BoN jailbreaking attacks, and (3) flexibility in defining and updating safety criteria through customizable topic filtering. Our experiments demonstrate that FLAME significantly outperforms current moderation systems. For example, FLAME reduces attack success rate in GPT-4o-mini and DeepSeek-v3 by a factor of ~9, while maintaining low computational overhead. We provide comprehensive evaluation on various LLMs and analyze the engine’s efficiency against the state-of-the-art jailbreaking. This work contributes to the development of more robust and adaptable content moderation systems for LLMs.
zh

[NLP-37] Musical Heritage Historical Entity Linking

【速读】：该论文旨在解决历史文本中命名实体识别、分类与链接（Entity Linking, EL）的挑战，特别是针对在著名知识库（KB）中代表性不足或缺失的历史音乐领域命名实体。论文的关键解决方案在于提出了一种新的无监督实体链接模型以及一种方法，通过使用知识图谱（KG）来扩展监督实体链接器，以应对历史文档的主要难题。实验表明，依赖无监督技术，并结合基于KG和启发式规则的逻辑约束来预测未登录实体（NIL实体），可以显著提升历史文档中的实体链接性能。

链接: https://arxiv.org/abs/2502.09168
作者: Arianna Graciotti,Nicolas Lazzari,Valentina Presutti,Rocco Tripodi
机构: unknown
类目: Computation and Language (cs.CL)
备注: To appear in Artificial Intelligence Review Journal

点击查看摘要

Abstract:Linking named entities occurring in text to their corresponding entity in a Knowledge Base (KB) is challenging, especially when dealing with historical texts. In this work, we introduce Musical Heritage named Entities Recognition, Classification and Linking (MHERCL), a novel benchmark consisting of manually annotated sentences extrapolated from historical periodicals of the music domain. MHERCL contains named entities under-represented or absent in the most famous KBs. We experiment with several State-of-the-Art models on the Entity Linking (EL) task and show that MHERCL is a challenging dataset for all of them. We propose a novel unsupervised EL model and a method to extend supervised entity linkers by using Knowledge Graphs (KGs) to tackle the main difficulties posed by historical documents. Our experiments reveal that relying on unsupervised techniques and improving models with logical constraints based on KGs and heuristics to predict NIL entities (entities not represented in the KB of reference) results in better EL performance on historical documents.
zh

[NLP-38] Improving TCM Question Answering through Tree-Organized Self-Reflective Retrieval with LLM s

【速读】：该论文旨在解决在中医（Traditional Chinese Medicine, TCM）领域内高效检索增强生成（retrieval-augmented generation, RAG）框架的不足。关键解决方案在于引入了一种基于知识组织的树形结构自反思检索（Tree-Organized Self-Reflective Retrieval, TOSRR）框架，通过构建层次化的知识库，并在推理阶段进行跨章节的信息整合，从而提升大型语言模型（Large Language Models, LLMs）在中医问答任务中的表现。

链接: https://arxiv.org/abs/2502.09156
作者: Chang Liu,Ying Chang,Jianmin Li,Yiqian Qu,Yu Li,Lingyong Cao,Shuyuan Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Objectives: Large language models (LLMs) can harness medical knowledge for intelligent question answering (QA), promising support for auxiliary diagnosis and medical talent cultivation. However, there is a deficiency of highly efficient retrieval-augmented generation (RAG) frameworks within the domain of Traditional Chinese Medicine (TCM). Our purpose is to observe the effect of the Tree-Organized Self-Reflective Retrieval (TOSRR) framework on LLMs in TCM QA tasks. Materials and Methods: We introduce the novel approach of knowledge organization, constructing a tree structure knowledge base with hierarchy. At inference time, our self-reflection framework retrieves from this knowledge base, integrating information across chapters. Questions from the TCM Medical Licensing Examination (MLE) and the college Classics Course Exam (CCE) were randomly selected as benchmark datasets. Results: By coupling with GPT-4, the framework can improve the best performance on the TCM MLE benchmark by 19.85% in absolute accuracy, and improve recall accuracy from 27% to 38% on CCE datasets. In manual evaluation, the framework improves a total of 18.52 points across dimensions of safety, consistency, explainability, compliance, and coherence. Conclusion: The TOSRR framework can effectively improve LLM’s capability in QA tasks of TCM. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.09156 [cs.CL] (or arXiv:2502.09156v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.09156 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ying Chang [view email] [v1] Thu, 13 Feb 2025 10:36:18 UTC (1,786 KB)
zh

[NLP-39] A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions

【速读】：该论文旨在解决阿拉伯语中方言和情感分类的问题。解决方案的关键在于构建了一个新颖的框架，该框架能够从给定文本中识别和预测阿拉伯语方言及情感，并具备构建新的方言感知情感词典的新颖能力。所提出的框架在阿拉伯语方言分类中达到了88.9%的准确率，在埃及和海湾方言的情感检测中分别达到了89.1%和79%的准确率。

链接: https://arxiv.org/abs/2502.09128
作者: Nasser A Alsadhan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Arabic is one of the oldest languages still in use today. As a result, several Arabic-speaking regions have developed dialects that are unique to them. Dialect and emotion recognition have various uses in Arabic text analysis, such as determining an online customer’s origin based on their comments. Furthermore, intelligent chatbots that are aware of a user’s emotions can respond appropriately to the user. Current research in emotion detection in the Arabic language lacks awareness of how emotions are exhibited in different dialects, which motivates the work found in this study. This research addresses the problems of dialect and emotion classification in Arabic. Specifically, this is achieved by building a novel framework that can identify and predict Arabic dialects and emotions from a given text. The framework consists of three modules: A text-preprocessing module, a classification module, and a clustering module with the novel capability of building new dialect-aware emotion lexicons. The proposed framework generated a new emotional lexicon for different dialects. It achieved an accuracy of 88.9% in classifying Arabic dialects, which outperforms the state-of-the-art results by 6.45 percentage points. Furthermore, the framework achieved 89.1-79% accuracy in detecting emotions in the Egyptian and Gulf dialects, respectively.
zh

[NLP-40] he influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)

【速读】：该论文旨在探究视觉-语言模型（Vision-Language Models, VLMs）如何处理基于视觉和语言线索的无知蕴涵（ignorance implicatures），并特别关注语境（精确与近似语境）和修饰词类型（裸数值、最高级和比较级修饰词）的影响。研究发现，尽管两种模型（GPT-4o 和 Gemini 1.5 Pro）对语言线索（修饰词）敏感，但在处理基于视觉线索（语境）的无知蕴涵时未能像人类一样进行推理。关键在于，模型在语境影响下的表现较弱且不一致，表明VLMs在实用推理方面存在挑战。同时，最高级修饰词比比较级修饰词更强烈地关联到无知蕴涵，支持了语义视角。因此，论文强调需要进一步改进VLMs，使其能够以依赖上下文的方式处理语言-视觉信息，从而实现类似人类的实用推理能力。

链接: https://arxiv.org/abs/2502.09120
作者: Ye-eun Cho,Yunho Maeng
机构: Sungkyunkwan University (成均馆大学); Ewha Womans University (梨花女子大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures, 3 tables

点击查看摘要

Abstract:This study explored how Vision-Language Models (VLMs) process ignorance implicatures with visual and linguistic cues. Particularly, we focused on the effects of contexts (precise and approximate contexts) and modifier types (bare numerals, superlative, and comparative modifiers), which were considered pragmatic and semantic factors respectively. Methodologically, we conducted a truth-value judgment task in visually grounded settings using GPT-4o and Gemini 1.5 Pro. The results indicate that while both models exhibited sensitivity to linguistic cues (modifier), they failed to process ignorance implicatures with visual cues (context) as humans do. Specifically, the influence of context was weaker and inconsistent across models, indicating challenges in pragmatic reasoning for VLMs. On the other hand, superlative modifiers were more strongly associated with ignorance implicatures as compared to comparative modifiers, supporting the semantic view. These findings highlight the need for further advancements in VLMs to process language-vision information in a context-dependent way to achieve human-like pragmatic inference.
zh

[NLP-41] Logical Reasoning in Large Language Models : A Survey

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）在逻辑推理方面的表现，并评估其进行严格逻辑推理的能力。论文的关键在于分析不同推理范式（演绎、归纳、溯因和类比推理）下LLMs的现有能力，并提出增强推理性能的策略，包括数据集优化调优、强化学习、解码策略以及神经符号方法。

链接: https://arxiv.org/abs/2502.09100
作者: Hanmeng Liu,Zhizhang Fu,Mengru Ding,Ruoxi Ning,Chaoli Zhang,Xiaozhang Liu,Yue Zhang
机构: Westlake University (西湖大学); Zhejiang Normal University (浙江师范大学); Hainan University (海南大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the emergence of advanced reasoning models like OpenAI o3 and DeepSeek-R1, large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, their ability to perform rigorous logical reasoning remains an open question. This survey synthesizes recent advancements in logical reasoning within LLMs, a critical area of AI research. It outlines the scope of logical reasoning in LLMs, its theoretical foundations, and the benchmarks used to evaluate reasoning proficiency. We analyze existing capabilities across different reasoning paradigms - deductive, inductive, abductive, and analogical - and assess strategies to enhance reasoning performance, including data-centric tuning, reinforcement learning, decoding strategies, and neuro-symbolic approaches. The review concludes with future directions, emphasizing the need for further exploration to strengthen logical reasoning in AI systems.
zh

[NLP-42] A Hybrid Transformer Model for Fake News Detection: Leverag ing Bayesian Optimization and Bidirectional Recurrent Unit

【速读】：该论文旨在解决假新闻分类问题。关键在于提出了一种优化的Transformer模型，该模型集成了双向门控循环单元（BiGRU）与贝叶斯算法。通过使用TF-IDF方法提取新闻文本特征，并将其转化为数值表示以促进后续机器学习任务。实验结果表明，该优化模型在训练集上达到了100%的准确率，在测试集上也表现出色，且贝叶斯算法的引入进一步提升了测试集的准确率至99.73%，展示了其在提高假新闻检测能力方面的有效性与快速分类性能。

链接: https://arxiv.org/abs/2502.09097
作者: Tianyi Huang,Zeqiu Xu,Peiyang Yu,Jingyuan Yi,Xiaochuan Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:In this paper, we propose an optimized Transformer model that integrates Bayesian algorithms with a Bidirectional Gated Recurrent Unit (BiGRU), and apply it to fake news classification for the first time. First, we employ the TF-IDF method to extract features from news texts and transform them into numeric representations to facilitate subsequent machine learning tasks. Two sets of experiments are then conducted for fake news detection and classification: one using a Transformer model optimized only with BiGRU, and the other incorporating Bayesian algorithms into the BiGRU-based Transformer. Experimental results show that the BiGRU-optimized Transformer achieves 100% accuracy on the training set and 99.67% on the test set, while the addition of the Bayesian algorithm maintains 100% accuracy on the training set and slightly improves test-set accuracy to 99.73%. This indicates that the Bayesian algorithm boosts model accuracy by 0.06%, further enhancing the detection capability for fake news. Moreover, the proposed algorithm converges rapidly at around the 10th training epoch with accuracy nearing 100%, demonstrating both its effectiveness and its fast classification ability. Overall, the optimized Transformer model, enhanced by the Bayesian algorithm and BiGRU, exhibits excellent continuous learning and detection performance, offering a robust technical means to combat the spread of fake news in the current era of information overload.
zh

[NLP-43] A Hybrid Model for Few-Shot Text Classification Using Transfer and Meta-Learning

【速读】：该论文旨在解决在少量样本场景下获取标注数据昂贵且困难的问题。解决方案的关键在于提出了一种基于迁移学习（Transfer Learning）和元学习（Meta-Learning）的少样本文本分类模型。通过迁移学习利用预训练模型的知识，并通过元学习机制优化模型在少量样本任务中的快速适应能力。实验结果表明，该方法在少量和中等样本条件下显著优于传统机器学习和深度学习方法。

链接: https://arxiv.org/abs/2502.09086
作者: Jia Gao,Shuangquan Lyu,Guiran Liu,Binrong Zhu,Hongye Zheng,Xiaoxuan Liao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the continuous development of natural language processing (NLP) technology, text classification tasks have been widely used in multiple application fields. However, obtaining labeled data is often expensive and difficult, especially in few-shot learning scenarios. To solve this problem, this paper proposes a few-shot text classification model based on transfer learning and meta-learning. The model uses the knowledge of the pre-trained model for transfer and optimizes the model’s rapid adaptability in few-sample tasks through a meta-learning mechanism. Through a series of comparative experiments and ablation experiments, we verified the effectiveness of the proposed method. The experimental results show that under the conditions of few samples and medium samples, the model based on transfer learning and meta-learning significantly outperforms traditional machine learning and deep learning methods. In addition, ablation experiments further analyzed the contribution of each component to the model performance and confirmed the key role of transfer learning and meta-learning in improving model accuracy. Finally, this paper discusses future research directions and looks forward to the potential of this method in practical applications.
zh

[NLP-44] Show Me the Work: Fact-Checkers Requirements for Explainable Automated Fact-Checking

【速读】：该论文旨在解决大型语言模型和生成式 AI (Generative AI) 在线应用中日益增长且复杂的虚假信息问题，通过提供有效的自动化事实核查系统来辅助事实核查员。论文的关键在于探索如何使这些系统的解释与事实核查员的实际决策和推理过程相契合，以实现无缝集成。研究通过半结构化访谈揭示了事实核查员评估证据、做出决策及解释其过程的方法，并识别出事实核查员对于自动化事实核查工具的解释需求。论文指出，有效的解释需能够追踪模型的推理路径，引用具体证据，并明确不确定性及信息缺口。

链接: https://arxiv.org/abs/2502.09083
作者: Greta Warren,Irina Shklovski,Isabelle Augenstein
机构: Department of Computer Science, University of Copenhagen(计算机科学系, 哥本哈根大学); University of Copenhagen(哥本哈根大学); Linköping University(林雪平大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Conditionally accepted to CHI’25

点击查看摘要

Abstract:The pervasiveness of large language models and generative AI in online media has amplified the need for effective automated fact-checking to assist fact-checkers in tackling the increasing volume and sophistication of misinformation. The complex nature of fact-checking demands that automated fact-checking systems provide explanations that enable fact-checkers to scrutinise their outputs. However, it is unclear how these explanations should align with the decision-making and reasoning processes of fact-checkers to be effectively integrated into their workflows. Through semi-structured interviews with fact-checking professionals, we bridge this gap by: (i) providing an account of how fact-checkers assess evidence, make decisions, and explain their processes; (ii) examining how fact-checkers use automated tools in practice; and (iii) identifying fact-checker explanation requirements for automated fact-checking tools. The findings show unmet explanation needs and identify important criteria for replicable fact-checking explanations that trace the model’s reasoning path, reference specific evidence, and highlight uncertainty and information gaps.
zh

[NLP-45] CoSER: Coordinating LLM -Based Persona Simulation of Established Roles

【速读】：该论文旨在解决角色扮演语言代理（Role-playing Language Agents, RPLAs）在模拟知名角色时所面临的挑战，主要由于缺乏高质量的角色数据集及精细的评估方法。解决方案的关键在于提出了CoSER，一个包含高质量数据集、开放模型和评估协议的集合。CoSER数据集涵盖了来自771本著名书籍的17,966个角色，并提供了真实对话及多样的数据类型。通过引入基于表演方法论的给定情境表演（given-circumstance acting），用于训练和评估角色扮演大型语言模型（LLMs）。此外，开发了基于LLaMA-3.1模型的CoSER 8B和CoSER 70B模型，展示了其在角色扮演任务中的卓越性能，特别是在InCharacter和LifeChoice基准测试中分别达到了75.80%和93.47%的准确率。

链接: https://arxiv.org/abs/2502.09082
作者: Xintao Wang,Heng Wang,Yifei Zhang,Xinfeng Yuan,Rui Xu,Jen-tse Huang,Siyu Yuan,Haoran Guo,Jiangjie Chen,Wei Wang,Yanghua Xiao,Shuchang Zhou
机构: Fudan University; Johns Hopkins University (约翰霍普金斯大学); StepFun
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.
zh

[NLP-46] Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables

【速读】：该论文旨在解决检索增强生成（Retrieval-augmented generation, RAG）技术在大型语言模型（LLMs）中减少幻觉响应不完全的问题。为了解决这一问题，论文的关键在于利用大规模的对话记录来构建高质量数据集，并通过主动学习（active learning）方法选择最合适的对话样本进行标注，从而优化标注预算内的性能表现。此外，论文提出了一种新的距离度量方法以适应RAG的主动学习需求。

链接: https://arxiv.org/abs/2502.09073
作者: Xuzhao Geng,Haozhao Wang,Jun Wang,Wei Liu,Ruixuan Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is a key technique for leveraging external knowledge and reducing hallucinations in large language models (LLMs). However, RAG still struggles to fully prevent hallucinated responses. To address this, it is essential to identify samples prone to hallucination or guide LLMs toward correct responses, which experts then annotate to develop high-quality datasets for refining LLMs. However, the growing scarcity of such datasets makes their creation challenging. This paper proposes using the vast amount of conversations from widespread LLM usage to build these datasets, training LLMs to avoid hallucination-prone questions while accurately responding to manageable ones. Given the impracticality of expert-annotating all conversation records, the paper introduces AL4RAG, which uses active learning to select the most suitable conversation samples for annotation, optimizing performance within an annotation budget. Additionally, recognizing that traditional active learning methods are not fully compatible with RAG due to unsuitable distance metrics, we develop a novel sample distance measurement for RAG active learning. Extensive experiments show that our method consistently outperforms baselines across multiple metrics.
zh

[NLP-47] An Open Recipe: Adapting Language-Specific LLM s to a Reasoning Model in One Day via Model Merging

【速读】：该论文旨在解决低资源语言在增强推理能力方面所面临的挑战，特别是在语言特定的大规模语言模型（LLMs）中。论文的关键解决方案在于提出了一种数据选择和模型合并的方法，能够将DeepSeek R1的高级推理能力融入到特定语言的LLMs中，同时保持其目标语言的能力。通过仅使用公开可用的数据集和计算预算为120，论文展示了如何提升特定语言LLMs的推理能力，使其达到与DeepSeek R1相当的水平，而不会损害其在目标语言任务上的表现。

链接: https://arxiv.org/abs/2502.09056
作者: Kunat Pipatanakul,Pittawat Taveekitworachai,Potsawee Manakul,Kasima Tharnpipitchai
机构: SCB 10X R&D(SCB 10X 研发); SCBX Group(SCBX 集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:This paper investigates data selection and model merging methodologies aimed at incorporating advanced reasoning capabilities such as those of DeepSeek R1 into language-specific large language models (LLMs), with a particular focus on the Thai LLM. Our goal is to enhance the reasoning capabilities of language-specific LLMs while maintaining their target language abilities. DeepSeek R1 excels in reasoning but primarily benefits high-resource languages such as English and Chinese. However, low-resource languages remain underserved due to the dominance of English-centric training data and model optimizations, which limit performance in these languages. This limitation results in unreliable code-switching and diminished effectiveness on tasks in low-resource languages. Meanwhile, local and regional LLM initiatives have attempted to bridge this gap by developing language-specific LLMs that focus on improving local linguistic fidelity. We demonstrate that, with only publicly available datasets and a computational budget of 120, it is possible to enhance the reasoning capabilities of language-specific LLMs to match the level of DeepSeek R1, without compromising their performance on target language tasks.
zh

[NLP-48] yphoon T1: An Open Thai Reasoning Model

【速读】：该论文旨在开发开放性的泰语推理模型Typhoon T1，以解决在低资源语言中构建能够生成推理轨迹的推理模型所面临的挑战。关键解决方案在于采用监督微调（supervised fine-tuning）的方法，利用开放数据集进行训练，而非使用强化学习（reinforcement learning），从而以更具成本效益的方式实现这一目标。

链接: https://arxiv.org/abs/2502.09042
作者: Pittawat Taveekitworachai,Potsawee Manakul,Kasima Tharnpipitchai,Kunat Pipatanakul
机构: SCB 10X R&D(10X研发); SCBX Group(集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 6 figures

点击查看摘要

Abstract:This paper introduces Typhoon T1, an open effort to develop an open Thai reasoning model. A reasoning model is a relatively new type of generative model built on top of large language models (LLMs). A reasoning model generates a long chain of thought before arriving at a final answer, an approach found to improve performance on complex tasks. However, details on developing such a model are limited, especially for reasoning models that can generate traces in a low-resource language. Typhoon T1 presents an open effort that dives into the details of developing a reasoning model in a more cost-effective way by leveraging supervised fine-tuning using open datasets, instead of reinforcement learning. This paper shares the details about synthetic data generation and training, as well as our dataset and model weights. Additionally, we provide insights gained from developing a reasoning model that generalizes across domains and is capable of generating reasoning traces in a low-resource language, using Thai as an example. We hope this open effort provides a foundation for further research in this field.
zh

[NLP-49] Diversity Enhances an LLM s Performance in RAG and Long-context Task

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在处理长上下文任务时，由于自注意力机制的时间复杂度（(O(N^2))，其中(N)表示上下文窗口长度）导致的上下文窗口限制问题。特别是在问答（Question Answering, QA）中的检索增强生成（Retrieval-Augmented Generation, RAG）和长上下文摘要任务中，常见的解决方案是选择与查询最相似的内容，这往往会导致冗余并排除多样化但相关的信息。论文的关键解决方案是将多样性整合到内容选择过程中，基于最大边缘相关性（Maximal Marginal Relevance, MMR）和最远点采样（Farthest Point Sampling, FPS）的原则，从而显著提高在LLM基础之上进行QA和摘要前相关句子或片段的选择召回率。

链接: https://arxiv.org/abs/2502.09017
作者: Zhchao Wang,Bin Bi,Yanqi Luo,Sitaram Asur,Claire Na Cheng
机构: Salesforce
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancements in large language models (LLMs) have highlighted the challenge of context window limitations, primarily due to the quadratic time complexity of the self-attention mechanism ((O(N^2)), where (N) denotes the context window length). This constraint impacts tasks such as retrieval-augmented generation (RAG) in question answering (Q\A) and long context summarization. A common approach involves selecting content with the highest similarity to the query; however, this often leads to redundancy and the exclusion of diverse yet relevant information. Building on principles from Maximal Marginal Relevance (MMR) and Farthest Point Sampling (FPS), we integrate diversity into the content selection process. Our findings reveal that incorporating diversity substantially increases the recall of selecting relevant sentences or chunks before LLM-based Q\A and summarization. These results highlight the importance of maintaining diversity in future LLM applications to further improve summarization and Q\A outcomes.
zh

[NLP-50] Hope vs. Hate: Understanding User Interactions with LGBTQ News Content in Mainstream US News Media through the Lens of Hope Speech

【速读】：该论文旨在分析用户如何与LGBTQ+新闻内容互动，并构建了一个细粒度的希望言论分类器来识别积极（希望言论）、消极、中立及无关内容。论文的关键解决方案在于通过一个包含1,419,047条评论的数据集进行分析，并结合公共健康专家的意见，开展了一项具有平衡和多样化政治代表性的注释研究，发布了包含3,750个实例及其详细注释者人口统计信息的数据集。此外，研究揭示了评分者政治信念与其对边缘化群体相关内容评价之间的强关联，以及基于个体政治信念训练的模型在实际应用中的显著分歧，同时发现零样本大型语言模型（LLMs）更倾向于与自由派评分者保持一致。

链接: https://arxiv.org/abs/2502.09004
作者: Jonathan Pofcher,Christopher M. Homan,Randall Sell,Ashiqur R. KhudaBukhsh
机构: Rochester Institute of Technology(罗切斯特理工学院); Drexel University(德雷塞尔大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper makes three contributions. First, via a substantial corpus of 1,419,047 comments posted on 3,161 YouTube news videos of major US cable news outlets, we analyze how users engage with LGBTQ+ news content. Our analyses focus both on positive and negative content. In particular, we construct a fine-grained hope speech classifier that detects positive (hope speech), negative, neutral, and irrelevant content. Second, in consultation with a public health expert specializing on LGBTQ+ health, we conduct an annotation study with a balanced and diverse political representation and release a dataset of 3,750 instances with fine-grained labels and detailed annotator demographic information. Finally, beyond providing a vital resource for the LGBTQ+ community, our annotation study and subsequent in-the-wild assessments reveal (1) strong association between rater political beliefs and how they rate content relevant to a marginalized community; (2) models trained on individual political beliefs exhibit considerable in-the-wild disagreement; and (3) zero-shot large language models (LLMs) align more with liberal raters.
zh

[NLP-51] uning-Free Personalized Alignment via Trial-Error-Explain In-Context Learning NAACL2025

【速读】：该论文旨在解决语言模型在文本生成任务中输出风格泛化的问题，即语言模型倾向于生成符合大众风格而非特定用户偏好的文本。论文提出的关键解决方案是Trial-Error-Explain In-Context Learning (TICL)，这是一种无需调参的方法，通过迭代扩展上下文学习提示，结合模型生成的负样本和解释，以指导语言模型更有效地学习特定用户的风格，从而实现个性化对齐。这种方法仅需少量用户示例（少于10个），并在零样本生成中克服了对结构化和正式表达的偏好偏差。

链接: https://arxiv.org/abs/2502.08972
作者: Hyundong Cho,Karishma Sharma,Nicolaas Jedema,Leonardo F. R. Ribeiro,Alessandro Moschitti,Ravi Krishnan,Jonathan May
机构: Amazon AGI (亚马逊AGI); University of Southern California, Information Sciences Institute (南加州大学,信息系统研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NAACL 2025 Findings

点击查看摘要

Abstract:Language models are aligned to the collective voice of many, resulting in generic outputs that do not align with specific users’ styles. In this work, we present Trial-Error-Explain In-Context Learning (TICL), a tuning-free method that personalizes language models for text generation tasks with fewer than 10 examples per user. TICL iteratively expands an in-context learning prompt via a trial-error-explain process, adding model-generated negative samples and explanations that provide fine-grained guidance towards a specific user’s style. TICL achieves favorable win rates on pairwise comparisons with LLM-as-a-judge up to 91.5% against the previous state-of-the-art and outperforms competitive tuning-free baselines for personalized alignment tasks of writing emails, essays and news articles. Both lexical and qualitative analyses show that the negative samples and explanations enable language models to learn stylistic context more effectively and overcome the bias towards structural and formal phrases observed in their zero-shot outputs. By front-loading inference compute to create a user-specific in-context learning prompt that does not require extra generation steps at test time, TICL presents a novel yet simple approach for personalized alignment.
zh

[NLP-52] Medicine on the Edge: Comparative Performance Analysis of On-Device LLM s for Clinical Reasoning

【速读】：该论文旨在评估在移动设备上部署大型语言模型（Large Language Models, LLM）在实际医疗场景中的性能和准确性。解决方案的关键在于通过使用AMEGA数据集对公开可用的在设备端LLMs进行基准测试，涵盖多种移动设备上的准确性、计算效率以及热限制等指标。研究表明，紧凑型通用模型如Phi-3 Mini能够在速度和准确性之间取得良好平衡，而经过医学领域微调的模型如Med42和Aloe则能达到最高的准确性。此外，论文强调需要更高效的推理方法和专门针对临床推理设计的模型以充分发挥在设备端LLMs在医疗健康领域的潜力。

链接: https://arxiv.org/abs/2502.08954
作者: Leon Nissen,Philipp Zagar,Vishnu Ravi,Aydin Zahedivash,Lara Marie Reimer,Stephan Jonas,Oliver Aalami,Paul Schmiedmayer
机构: Stanford Mussallem Center for Biodesign, Stanford University (斯坦福大学); Institute for Digital Medicine, University Hospital Bonn (波恩大学医院数字医学研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The deployment of Large Language Models (LLM) on mobile devices offers significant potential for medical applications, enhancing privacy, security, and cost-efficiency by eliminating reliance on cloud-based services and keeping sensitive health data local. However, the performance and accuracy of on-device LLMs in real-world medical contexts remain underexplored. In this study, we benchmark publicly available on-device LLMs using the AMEGA dataset, evaluating accuracy, computational efficiency, and thermal limitation across various mobile devices. Our results indicate that compact general-purpose models like Phi-3 Mini achieve a strong balance between speed and accuracy, while medically fine-tuned models such as Med42 and Aloe attain the highest accuracy. Notably, deploying LLMs on older devices remains feasible, with memory constraints posing a greater challenge than raw processing power. Our study underscores the potential of on-device LLMs for healthcare while emphasizing the need for more efficient inference and models tailored to real-world clinical reasoning.
zh

[NLP-53] Structured Convergence in Large Language Model Representations via Hierarchical Latent Space Folding

【速读】：该论文旨在解决高维潜在空间中令牌表示的冗余问题，这限制了计算效率并降低了跨模型层的结构一致性。论文的关键解决方案是引入层次化潜在空间折叠机制，通过结构化变换在学习到的嵌入中强制多尺度组织，从而提高表征紧凑性并保持重要的上下文区分度。该方法通过动态折叠操作迭代调整令牌嵌入，影响短程和长程依赖关系，最终实现更稳定的困惑度分布和增强的预测置信度。

链接: https://arxiv.org/abs/2502.08947
作者: Fenella Harcourt,Naderdel Piero,Gilbert Sutherland,Daphne Holloway,Harriet Bracknell,Julian Ormsby
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Token representations in high-dimensional latent spaces often exhibit redundancy, limiting computational efficiency and reducing structural coherence across model layers. Hierarchical latent space folding introduces a structured transformation mechanism that enforces a multi-scale organization within learned embeddings, refining representational compactness while preserving essential contextual distinctions. The proposed approach incorporates dynamic folding operations that iteratively adjust token embeddings through structured transformations, influencing both short-range and long-range dependencies in sequential processing tasks. Empirical evaluation demonstrates a reduction in representational variance across layers, contributing to more stable perplexity distributions and enhancing predictive confidence in text generation. The structured redistribution of attention head utilization leads to more efficient allocation of computational resources, particularly in deeper layers, where hierarchical refinements improve contextual abstraction. Comparative analysis of activation sparsity patterns suggests that hierarchical adjustments selectively reinforce critical pathways while reducing computational overhead in non-essential regions of the model. Statistical assessments of token reordering frequencies reveal that hierarchical modifications introduce subtle shifts in sequential dependencies, improving contextual alignment while maintaining syntactic correctness. Computational trade-offs associated with hierarchical folding introduce marginal increases in training time per epoch, yet empirical findings indicate that inference efficiency benefits from the structured representation adjustments. The results highlight the impact of hierarchical latent space folding on optimizing model performance through improved representation structuring and computational efficiency.
zh

[NLP-54] he Stochastic Parrot on LLM s Shoulder: A Summative Assessment of Physical Concept Understanding NAACL2025

【速读】：该论文旨在探讨大型语言模型（LLMs）是否真正理解其所说的内容，这一问题与“随机鹦鹉”现象密切相关。为解决这一问题，作者设计了一个名为PhysiCo的物理概念理解任务，并采用网格格式输入来抽象描述物理现象，从而减轻记忆问题。研究结果表明，最先进的LLMs在该任务上的表现比人类低约40%，这证明了随机鹦鹉现象的存在，即这些模型虽然能够很好地用自然语言描述和识别相关概念，但在特定任务上的表现却较差。关键在于通过这种新颖的任务设计，揭示了LLMs在处理抽象物理概念时的内在困难，而不仅仅是由于不熟悉的输入格式。

链接: https://arxiv.org/abs/2502.08946
作者: Mo Yu,Lemao Liu,Junjie Wu,Tsz Ting Chung,Shunchi Zhang,Jiangnan Li,Dit-Yan Yeung,Jie Zhou
机构: WeChat AI, Tencent(微信AI,腾讯); HKUST(香港科技大学); JHU(约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: NAACL 2025 Main Conference. First 5 authors contributed equally. Project page: this https URL

点击查看摘要

Abstract:In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.
zh

[NLP-55] Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis

【速读】：该论文旨在解决现有基准评估方法在评估大规模语言模型（Large Language Models, LLMs）时忽视其内在随机性的问题。当前的方法通常采用确定性生成策略或依赖单一随机样本，导致未被考虑的抽样方差和不可靠的基准分数估计。论文的关键解决方案在于提出一个分层统计模型，该模型通过结合基准特性与LLM的随机性，提供了更全面的基准过程表征。此外，论文引入了基于正确比率的提示难度评分 (\mathbb{P}(\text{correct}))，以提供对单个提示的细致洞见，并创建了一个数据映射来可视化难度和语义提示，从而实现基准构建中的错误检测和质量控制。

链接: https://arxiv.org/abs/2502.08943
作者: Wenbo Zhang,Hengrui Cai,Wenyu Chen
机构: University of California Irvine(加州大学欧文分校); Meta(Meta); Central Applied Science(中央应用科学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 1 table, 4 Figures

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant utilities in real-world applications, exhibiting impressive capabilities in natural language processing and understanding. Benchmark evaluations are crucial for assessing the capabilities of LLMs as they can provide a comprehensive assessment of their strengths and weaknesses. However, current evaluation methods often overlook the inherent randomness of LLMs by employing deterministic generation strategies or relying on a single random sample, resulting in unaccounted sampling variance and unreliable benchmark score estimates. In this paper, we propose a hierarchical statistical model that provides a more comprehensive representation of the benchmarking process by incorporating both benchmark characteristics and LLM randomness. We show that leveraging multiple generations improves the accuracy of estimating the benchmark score and reduces variance. We also introduce \mathbb P\left(\textcorrect\right) , a prompt-level difficulty score based on correct ratios, providing fine-grained insights into individual prompts. Additionally, we create a data map that visualizes difficulty and semantic prompts, enabling error detection and quality control in benchmark construction.
zh

[NLP-56] Escaping Collapse: The Strength of Weak Data for Large Language Model Training

【速读】：该论文旨在解决合成数据在大规模语言模型（LLM）训练过程中可能导致性能停滞或退化的问题。论文的关键在于提出了一种理论框架，证明即使大部分非合成训练数据质量较差，通过适当的筛选和优化，训练过程仍能够收敛至最优的LLM。这一解决方案的关键是借鉴提升算法（boosting）的思想，动态集中标注资源于最具挑战性的样本，从而显著提高模型性能。

链接: https://arxiv.org/abs/2502.08924
作者: Kareem Amin,Sara Babakniya,Alex Bie,Weiwei Kong,Umar Syed,Sergei Vassilvitskii
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance to plateau, or even “collapse”, after many training iterations. In this paper, we formalize this question and develop a theoretical framework to investigate how much curation is needed in order to ensure that LLM performance continually improves. We find that the requirements are nearly minimal. We describe a training procedure that converges to an optimal LLM even if almost all of the non-synthetic training data is of poor quality. Our analysis is inspired by boosting, a classic machine learning technique that leverages a very weak learning algorithm to produce an arbitrarily good classifier. Our training procedure subsumes many recently proposed methods for training LLMs on synthetic data, and thus our analysis sheds light on why they are successful, and also suggests opportunities for future improvement. We present experiments that validate our theory, and show that dynamically focusing labeling resources on the most challenging examples – in much the same way that boosting focuses the efforts of the weak learner – leads to improved performance.
zh

[NLP-57] CopySpec: Accelerating LLM s with Speculative Copy-and-Paste Without Compromising Quality

【速读】：本文旨在解决大型语言模型（LLMs）在生成与先前输出高度相似响应时的低效问题。关键解决方案是CopySpec技术，它通过识别模型聊天记录中的重复序列并推测后续令牌将相同，从而实现无缝复制，而不会影响输出质量或增加额外的GPU内存需求。

链接: https://arxiv.org/abs/2502.08923
作者: Razvan-Gabriel Dumitru,Minglai Yang,Vikas Yadav,Mihai Surdeanu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 18 figures, 19 tables

点击查看摘要

Abstract:We introduce CopySpec, an innovative technique designed to tackle the inefficiencies LLMs face when generating responses that closely resemble previous outputs. CopySpec identifies repeated sequences in the model’s chat history and speculates that the same tokens will follow, enabling seamless copying without compromising output quality or requiring additional GPU memory. To evaluate the effectiveness of our approach, we conducted experiments using five LLMs and five datasets: MT-Bench, CNN/DM, GSM-8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn’s answer, simulating real-world scenarios where users request modifications to prior responses. Our results demonstrate significant speed-ups: up to 2.35x on CNN/DM, 3.08x on the second turn of select MT-Redundant categories, and 2.66x on the third turn of GSM-8K’s self-correction tasks. Moreover, we show that CopySpec integrates seamlessly with speculative decoding, yielding an average 49% additional speed-up over speculative decoding for the second turn of MT-Redundant across all eight categories. While LLMs, even with speculative decoding, suffer from slower inference as context sizes grow, CopySpec leverages the expanded context to accelerate inference, making it faster as the context size increases. Our code and dataset are publicly available at this https URL.
zh

[NLP-58] PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology

【速读】：该论文旨在解决通过组织病理学全切片图像（Whole Slide Images, WSIs）进行疾病诊断所面临的挑战，特别是由于WSIs的吉字节级规模和复杂性导致的传统AI方法难以实现全面、迭代、多尺度的诊断过程。论文的关键解决方案是引入PathFinder框架，这是一个多模态、多智能体系统，模拟专家病理学家的决策过程。PathFinder整合了四个AI智能体：分类智能体、导航智能体、描述智能体和诊断智能体，它们协同工作以导航WSI、收集证据并提供具有自然语言解释的综合诊断。这种协作机制使得PathFinder能够超越现有最先进的方法，并在皮肤黑色素瘤诊断中提高8%的性能，同时提供可解释性。

链接: https://arxiv.org/abs/2502.08916
作者: Fatemeh Ghezloo,Mehmet Saygin Seyfioglu,Rustin Soraki,Wisdom O. Ikezogwo,Beibin Li,Tejoram Vivekanandan,Joann G. Elmore,Ranjay Krishna,Linda Shapiro
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Diagnosing diseases through histopathology whole slide images (WSIs) is fundamental in modern pathology but is challenged by the gigapixel scale and complexity of WSIs. Trained histopathologists overcome this challenge by navigating the WSI, looking for relevant patches, taking notes, and compiling them to produce a final holistic diagnostic. Traditional AI approaches, such as multiple instance learning and transformer-based models, fail short of such a holistic, iterative, multi-scale diagnostic procedure, limiting their adoption in the real-world. We introduce PathFinder, a multi-modal, multi-agent framework that emulates the decision-making process of expert pathologists. PathFinder integrates four AI agents, the Triage Agent, Navigation Agent, Description Agent, and Diagnosis Agent, that collaboratively navigate WSIs, gather evidence, and provide comprehensive diagnoses with natural language explanations. The Triage Agent classifies the WSI as benign or risky; if risky, the Navigation and Description Agents iteratively focus on significant regions, generating importance maps and descriptive insights of sampled patches. Finally, the Diagnosis Agent synthesizes the findings to determine the patient’s diagnostic classification. Our Experiments show that PathFinder outperforms state-of-the-art methods in skin melanoma diagnosis by 8% while offering inherent explainability through natural language descriptions of diagnostically relevant patches. Qualitative analysis by pathologists shows that the Description Agent’s outputs are of high quality and comparable to GPT-4o. PathFinder is also the first AI-based system to surpass the average performance of pathologists in this challenging melanoma classification task by 9%, setting a new record for efficient, accurate, and interpretable AI-assisted diagnostics in pathology. Data, code and models available at this https URL
zh

[NLP-59] InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU

【速读】：该论文旨在解决现代大型语言模型（LLMs）在处理非常长的上下文长度时面临的显著挑战，包括缓慢的推理速度和增加的内存成本。此外，大多数现有的预训练LLMs无法超越其原始训练序列长度进行泛化。为了解决这些问题，论文提出的关键方案是引入InfiniteHiP，这是一种新颖且实用的LLM推理框架。InfiniteHiP通过动态消除无关上下文标记的模块化分层标记剪枝算法加速处理，并通过选择性应用多种RoPE调整方法来实现对更长序列的泛化。此外，该框架在推理过程中将键值缓存卸载到主机内存，从而显著减轻GPU内存压力。这些措施使得InfiniteHiP能够在单个L40s 48GB GPU上处理高达300万个标记，同时实现18.95倍的注意力解码加速，而无需额外训练。

链接: https://arxiv.org/abs/2502.08910
作者: Heejun Lee,Geon Park,Jaduk Suh,Sung Ju Hwang
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Abstract:In modern large language models (LLMs), handling very long context lengths presents significant challenges as it causes slower inference speeds and increased memory costs. Additionally, most existing pre-trained LLMs fail to generalize beyond their original training sequence lengths. To enable efficient and practical long-context utilization, we introduce InfiniteHiP, a novel, and practical LLM inference framework that accelerates processing by dynamically eliminating irrelevant context tokens through a modular hierarchical token pruning algorithm. Our method also allows generalization to longer sequences by selectively applying various RoPE adjustment methods according to the internal attention patterns within LLMs. Furthermore, we offload the key-value cache to host memory during inference, significantly reducing GPU memory pressure. As a result, InfiniteHiP enables the processing of up to 3 million tokens on a single L40s 48GB GPU – 3x larger – without any permanent loss of context information. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training. We implement our method in the SGLang framework and demonstrate its effectiveness and practicality through extensive evaluations.
zh

[NLP-60] owards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLM s

【速读】：该论文旨在解决自动事实核查（Automated Fact-Checking, AFC）在处理大规模信息和多样化标签方案中的性能与解释质量问题。解决方案的关键在于利用不同规模的大型语言模型（Large Language Models, LLMs），特别是通过集成检索证据来提升分类准确性及解释质量，并验证了未经过微调的较大规模LLMs在处理复杂标签区分时的优势。研究表明，通过检索增强的AFC系统能够显著提高自动化事实核查任务的表现。

链接: https://arxiv.org/abs/2502.08909
作者: Premtim Sahitaj,Iffat Maab,Junichi Yamagishi,Jawan Kolanowski,Sebastian Möller,Vera Schmitt
机构: Technische Universität Berlin (柏林工业大学); Deutsches Forschungszentrum für Künstliche Intelligenz (德国人工智能研究中心); National Institute of Informatics (国立信息学研究所); Harz University of Applied Sciences (哈尔茨应用科学大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fact-checking is necessary to address the increasing volume of misinformation. Traditional fact-checking relies on manual analysis to verify claims, but it is slow and resource-intensive. This study establishes baseline comparisons for Automated Fact-Checking (AFC) using Large Language Models (LLMs) across multiple labeling schemes (binary, three-class, five-class) and extends traditional claim verification by incorporating analysis, verdict classification, and explanation in a structured setup to provide comprehensive justifications for real-world claims. We evaluate Llama-3 models of varying sizes (3B, 8B, 70B) on 17,856 claims collected from PolitiFact (2007-2024) using evidence retrieved via restricted web searches. We utilize TIGERScore as a reference-free evaluation metric to score the justifications. Our results show that larger LLMs consistently outperform smaller LLMs in classification accuracy and justification quality without fine-tuning. We find that smaller LLMs in a one-shot scenario provide comparable task performance to fine-tuned Small Language Models (SLMs) with large context sizes, while larger LLMs consistently surpass them. Evidence integration improves performance across all models, with larger LLMs benefiting most. Distinguishing between nuanced labels remains challenging, emphasizing the need for further exploration of labeling schemes and alignment with evidences. Our findings demonstrate the potential of retrieval-augmented AFC with LLMs.
zh

[NLP-61] Can Uniform Meaning Representation Help GPT -4 Translate from Indigenous Languages?

【速读】：该论文旨在解决生成式AI（Generative AI）在处理极低资源语言和土著语言时表现不佳的问题。关键解决方案在于将Uniform Meaning Representation (UMR)融入GPT-4的提示中，以提升其在翻译这些语言时的表现。研究表明，在多数测试案例中，集成UMR能够显著提高性能，表明UMR形式化在未来应用中的潜力。

链接: https://arxiv.org/abs/2502.08900
作者: Shira Wein
机构: Amherst College
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While ChatGPT and GPT-based models are able to effectively perform many tasks without additional fine-tuning, they struggle with related to extremely low-resource languages and indigenous languages. Uniform Meaning Representation (UMR), a semantic representation designed to capture the meaning of texts in many languages, is well-poised to be leveraged in the development of low-resource language technologies. In this work, we explore the downstream technical utility of UMR for low-resource languages by incorporating it into GPT-4 prompts. Specifically, we examine the ability of GPT-4 to perform translation from three indigenous languages (Navajo, Arápaho, and Kukama), with and without demonstrations, as well as with and without UMR annotations. Ultimately we find that in the majority of our test cases, integrating UMR into the prompt results in a statistically significant increase in performance, which is a promising indication of future applications of the UMR formalism.
zh

[NLP-62] Communication is All You Need: Persuasion Dataset Construction via Multi-LLM Communication NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在生成具有说服力的对话时，输出的流畅性和复杂性不足的问题。解决方案的关键在于提出了一种多LLM通信框架，该框架能够自动增强说服性数据的生成，从而实现高质量、多样化的语言内容高效产出，并且需要最少的人类监管。

链接: https://arxiv.org/abs/2502.08896
作者: Weicheng Ma,Hefan Zhang,Ivory Yang,Shiyu Ji,Joice Chen,Farnoosh Hashemi,Shubham Mohole,Ethan Gearey,Michael Macy,Saeed Hassanpour,Soroush Vosoughi
机构: Cornell University; Department of Computer Science, Dartmouth College
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL 2025 Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) have shown proficiency in generating persuasive dialogue, yet concerns about the fluency and sophistication of their outputs persist. This paper presents a multi-LLM communication framework designed to enhance the generation of persuasive data automatically. This framework facilitates the efficient production of high-quality, diverse linguistic content with minimal human oversight. Through extensive evaluations, we demonstrate that the generated data excels in naturalness, linguistic diversity, and the strategic use of persuasion, even in complex scenarios involving social taboos. The framework also proves adept at generalizing across novel contexts. Our results highlight the framework’s potential to significantly advance research in both computational and social science domains concerning persuasive communication.
zh

[NLP-63] LLM -Enhanced Multiple Instance Learning for Joint Rumor and Stance Detection with Social Context Information

【速读】：该论文旨在解决谣言检测与立场检测任务中数据标注成本高及二者协同效应未充分利用的问题。关键在于提出了一种基于大规模语言模型（LLM）增强的多实例学习（MIL）方法，通过仅依赖声明真实性级别的包级别标签（bag-level labels），联合预测帖子立场和声明类别标签，从而实现对谣言真实性和回应帖子中表达立场之间强关联性的有效识别。此方法通过将多分类问题转化为多个基于MIL的二分类问题，并利用判别注意力层聚合分类器输出，实现了更细粒度类别的预测。

链接: https://arxiv.org/abs/2502.08888
作者: Ruichao Yang,Jing Ma,Wei Gao,Hongzhan Lin
机构: Department of Computer Science, Hong Kong Baptist University, Hong KongChina(香港浸会大学计算机科学系); School of Computing and Information Systems, Singapore Management UniversitySingapore(新加坡管理大学计算与信息系统学院)
类目: Computation and Language (cs.CL)
备注: Accepted by ACM TIST

点击查看摘要

Abstract:The proliferation of misinformation, such as rumors on social media, has drawn significant attention, prompting various expressions of stance among users. Although rumor detection and stance detection are distinct tasks, they can complement each other. Rumors can be identified by cross-referencing stances in related posts, and stances are influenced by the nature of the rumor. However, existing stance detection methods often require post-level stance annotations, which are costly to obtain. We propose a novel LLM-enhanced MIL approach to jointly predict post stance and claim class labels, supervised solely by claim labels, using an undirected microblog propagation model. Our weakly supervised approach relies only on bag-level labels of claim veracity, aligning with multi-instance learning (MIL) principles. To achieve this, we transform the multi-class problem into multiple MIL-based binary classification problems. We then employ a discriminative attention layer to aggregate the outputs from these classifiers into finer-grained classes. Experiments conducted on three rumor datasets and two stance datasets demonstrate the effectiveness of our approach, highlighting strong connections between rumor veracity and expressed stances in responding posts. Our method shows promising performance in joint rumor and stance detection compared to the state-of-the-art methods.
zh

[NLP-64] BrainWavLM: Fine-tuning Speech Representations with Brain Responses to Language

【速读】：该论文旨在解决现有语音编码模型在预测人脑对语音刺激响应时的限制，特别是线性映射方法可能带来的局限性。关键解决方案在于采用低秩适应（Low-rank Adaptation, LoRA）技术对基于WavLM的编码模型进行端到端的微调，以实现非线性的优化。通过这种方法，不仅提升了整体皮层的平均编码性能，并且在保持其他区域改进的同时，针对性地提高了初级听觉皮层（Auditory Cortex, AC）的性能。此外，这种微调还增强了模型中的语义表示，而无需任何显式的注释。

链接: https://arxiv.org/abs/2502.08866
作者: Nishitha Vattikonda,Aditya R. Vaidya,Richard J. Antonello,Alexander G. Huth
机构: The University of Texas at Austin(德克萨斯大学奥斯汀分校); Zuckerman Institute (祖克曼研究所); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Speech encoding models use auditory representations to predict how the human brain responds to spoken language stimuli. Most performant encoding models linearly map the hidden states of artificial neural networks to brain data, but this linear restriction may limit their effectiveness. In this work, we use low-rank adaptation (LoRA) to fine-tune a WavLM-based encoding model end-to-end on a brain encoding objective, producing a model we name BrainWavLM. We show that fine-tuning across all of cortex improves average encoding performance with greater stability than without LoRA. This improvement comes at the expense of low-level regions like auditory cortex (AC), but selectively fine-tuning on these areas improves performance in AC, while largely retaining gains made in the rest of cortex. Fine-tuned models generalized across subjects, indicating that they learned robust brain-like representations of the speech stimuli. Finally, by training linear probes, we showed that the brain data strengthened semantic representations in the speech model without any explicit annotations. Our results demonstrate that brain fine-tuning produces best-in-class speech encoding models, and that non-linear methods have the potential to bridge the gap between artificial and biological representations of semantics.
zh

[NLP-65] EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges

【速读】：该论文旨在解决现有语言模型在面对复杂推理和隐性知识综合任务时的局限性。为了应对这一挑战，作者引入了EnigmaEval数据集，该数据集包含从谜题竞赛中提取的问题及其解答，旨在评估高级语言模型在处理隐性知识综合及多步演绎推理方面的能力。不同于现有的推理和知识基准测试，EnigmaEval通过谜题解决任务来考验模型发现看似无关信息之间隐藏联系的能力，从而揭示模型在非结构化和横向推理方面的不足。关键在于通过复杂且具有明确答案的谜题挑战，有效评估模型在这些高难度任务上的表现。

链接: https://arxiv.org/abs/2502.08859
作者: Clinton J. Wang,Dean Lee,Cristina Menghini,Johannes Mols,Jack Doughty,Adam Khoja,Jayson Lynch,Sean Hendryx,Summer Yue,Dan Hendrycks
机构: Scale AI; Center for AI Safety; MIT (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce EnigmaEval, a dataset of problems and solutions derived from puzzle competitions and events that probes models’ ability to perform implicit knowledge synthesis and multi-step deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle solving challenges models to discover hidden connections between seemingly unrelated pieces of information to uncover solution paths. The benchmark comprises 1184 puzzles of varying complexity – each typically requiring teams of skilled solvers hours to days to complete – with unambiguous, verifiable solutions that enable efficient evaluation. State-of-the-art language models achieve extremely low accuracy on these puzzles, even lower than other difficult benchmarks such as Humanity’s Last Exam, unveiling models’ shortcomings when challenged with problems requiring unstructured and lateral reasoning.
zh

[NLP-66] Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

【速读】：该论文旨在解决大型语言模型（LLMs）在生成过程中存在的幻觉问题和过时知识的问题，这些问题源于其依赖静态训练数据。论文的关键解决方案在于引入多模态检索增强生成（Multimodal Retrieval-Augmented Generation, RAG），通过整合外部动态信息来提升生成结果的事实性和时效性。然而，多模态RAG在跨模态对齐与推理方面面临独特挑战，这使其区别于传统的单模态RAG系统。

链接: https://arxiv.org/abs/2502.08826
作者: Mohammad Mahdi Abootorabi,Amirhosein Zobeiri,Mahdi Dehghani,Mohammadali Mohammadkhani,Bardia Mohammadi,Omid Ghahroodi,Mahdieh Soleymani Baghshah,Ehsaneddin Asgari
机构: Sharif University of Technology (舒里夫理工大学), Tehran, Iran;
University of Tehran (德黑兰大学), Tehran, Iran;
K.N. Toosi University of Technology (库鲁兹-纳杰菲科学技术大学), Tehran, Iran;
Qatar Computing Research Institute (卡塔尔计算研究研究院), Doha, Qatar
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at this https URL.
zh

[NLP-67] Examining and Adapting Time for Multilingual Classification via Mixture of Temporal Experts NAACL2025

【速读】：该论文旨在解决现有分类模型在处理时间变化和多语言数据时的局限性，主要关注时间效应在不同语言中的变化。论文的关键解决方案是提出了“时间专家混合框架（Mixture of Temporal Experts, MoTE）”，该框架能够利用语义和数据分布的变化来学习和适应时间趋势，从而增强分类器在时间数据变化下的泛化能力。

链接: https://arxiv.org/abs/2502.08825
作者: Weisi Liu,Guangzeng Han,Xiaolei Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: accept to NAACL 2025

点击查看摘要

Abstract:Time is implicitly embedded in classification process: classifiers are usually built on existing data while to be applied on future data whose distributions (e.g., label and token) may change. However, existing state-of-the-art classification models merely consider the temporal variations and primarily focus on English corpora, which leaves temporal studies less explored, let alone under multilingual settings. In this study, we fill the gap by treating time as domains (e.g., 2024 vs. 2025), examining temporal effects, and developing a domain adaptation framework to generalize classifiers over time on multiple languages. Our framework proposes Mixture of Temporal Experts (MoTE) to leverage both semantic and data distributional shifts to learn and adapt temporal trends into classification models. Our analysis shows classification performance varies over time across different languages, and we experimentally demonstrate that MoTE can enhance classifier generalizability over temporal data shifts. Our study provides analytic insights and addresses the need for time-aware models that perform robustly in multilingual scenarios.
zh

[NLP-68] Can a Single Model Master Both Multi-turn Conversations and Tool Use? CALM: A Unified Conversational Agent ic Language Model

【速读】：该论文旨在解决大型语言模型（LLMs）在任务导向对话（TOD）系统和语言代理（LAs）中所面临的双重挑战：TOD系统通常受限于有限的目标API集，导致与新服务接口时需额外数据以保持质量；而LAs则难以在多轮对话中维持用户意图。论文的关键解决方案是提出CALM（Conversational Agentic Language Model），这是一种统一的方法，融合了对话能力和功能调用能力。通过创建CALM-IT这一精心设计的多任务数据集，该方法实现了多轮推理与复杂API使用的结合，并训练出三个不同规模的CALM模型，从而在多个基准测试中超越了顶级领域特定模型。

链接: https://arxiv.org/abs/2502.08820
作者: Emre Can Acikgoz,Jeremiah Greer,Akul Datta,Ze Yang,William Zeng,Oussama Elachqar,Emmanouil Koukoumidis,Dilek Hakkani-Tür,Gokhan Tur
机构: University of Illinois Urbana-Champaign(伊利诺伊大学香槟分校); Oumi(未知)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA), and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CALM-IT, we train three models CALM 8B, CALM 70B, and CALM 405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks.
zh

[NLP-69] Lexical Manifold Reconfiguration in Large Language Models : A Novel Architectural Approach for Contextual Modulation

【速读】：该论文旨在解决静态词嵌入在处理复杂句式结构和领域特定术语转换时所面临的局限性。关键解决方案在于通过连续几何变换动态重构词嵌入，引入基于流形的变换机制来调节词项定位，从而在保持语言关系的同时，使嵌入表示能够适应不断变化的语篇结构。这种方法显著降低了困惑度，提高了词汇连贯性和句子层面的连续性，特别是在结构化和领域自适应文本生成任务中表现出色。

链接: https://arxiv.org/abs/2502.08818
作者: Koinis Vassilis,Godfrey Milbourne,Harriet Featherstone,Xanthe Peverell,Yorick Bletchley,Zachary Montford
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Contextual adaptation in token embeddings plays a central role in determining how well language models maintain coherence and retain semantic relationships over extended text sequences. Static embeddings often impose constraints on lexical flexibility, leading to suboptimal performance when faced with complex sentence structures or domain-specific terminology shifts. To address this limitation, a structured approach was developed for dynamically reconfiguring token embeddings through continuous geometric transformations, ensuring that representations evolved in response to evolving discourse structures. A manifold-based transformation mechanism was integrated to regulate lexical positioning, allowing embeddings to undergo controlled shifts while preserving linguistic relationships across varying textual contexts. Empirical evaluations demonstrated that embedding reconfiguration contributed to reductions in perplexity, improved lexical coherence, and enhanced sentence-level continuity, particularly in structured and domain-adaptive text generation tasks. Comparative analyses of embedding drift indicated that dynamically restructured representations maintained stronger contextual consistency, reducing misalignment in token dependencies while preserving fluency in language modeling outputs. Computational overhead assessments confirmed that while training complexity increased due to the iterative refinement of embeddings, inference remained efficient, ensuring practical feasibility for real-time generation. Evaluations across multiple datasets further demonstrated that dynamically modulated embeddings exhibited broader lexical diversity, reducing repetitive token patterns and enabling a more adaptable representation learning process.
zh

[NLP-70] A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks

【速读】：该论文旨在解决评估大型语言模型（Large Language Models, LLMs）在理解他人心理状态（Theory of Mind, ToM）方面的能力这一问题。论文的关键在于通过认知科学的分类法，系统性地综合当前评估方法，批判性地分析评测技术、提示策略以及LLMs在模拟人类心理状态推理方面的固有限制。研究表明，尽管LLMs在ToM任务中展现出一定的能力，但与人类的认知能力相比仍存在显著差距。

链接: https://arxiv.org/abs/2502.08796
作者: Karahan Sarıtaş,Kıvanç Tezören,Yavuz Durmazkeser
机构: University of Tübingen (蒂宾根大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In recent years, evaluating the Theory of Mind (ToM) capabilities of large language models (LLMs) has received significant attention within the research community. As the field rapidly evolves, navigating the diverse approaches and methodologies has become increasingly complex. This systematic review synthesizes current efforts to assess LLMs’ ability to perform ToM tasks, an essential aspect of human cognition involving the attribution of mental states to oneself and others. Despite notable advancements, the proficiency of LLMs in ToM remains a contentious issue. By categorizing benchmarks and tasks through a taxonomy rooted in cognitive science, this review critically examines evaluation techniques, prompting strategies, and the inherent limitations of LLMs in replicating human-like mental state reasoning. A recurring theme in the literature reveals that while LLMs demonstrate emerging competence in ToM tasks, significant gaps persist in their emulation of human cognitive abilities.
zh

[NLP-71] If Multi-Agent Debate is the Answer What is the Question?

【速读】：该论文旨在解决多智能体辩论（MAD）方法在评估实践中存在的关键不足，包括数据集重叠度有限和基准不一致等问题，这些问题引发了关于其泛化能力的重大关切。论文的关键解决方案在于提出Heter-MAD框架，通过使单一大型语言模型（LLM）代理能够访问异构基础模型的输出，从而显著提升当前MAD框架的性能。

链接: https://arxiv.org/abs/2502.08788
作者: Hangfan Zhang,Zhiyao Cui,Xinrun Wang,Qiaosheng Zhang,Zhen Wang,Dinghao Wu,Shuyue Hu
机构: Pennsylvania State University; Northwestern Polytechnical University; Singapore Management University; Shanghai Artificial Intelligence Laboratory
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This position paper takes a critical view of the status quo of MAD research, and outline multiple potential directions to improve MAD

点击查看摘要

Abstract:Multi-agent debate (MAD) has emerged as a promising approach to enhance the factual accuracy and reasoning quality of large language models (LLMs) by engaging multiple agents in iterative discussions during inference. Despite its potential, we argue that current MAD research suffers from critical shortcomings in evaluation practices, including limited dataset overlap and inconsistent baselines, raising significant concerns about generalizability. Correspondingly, this paper presents a systematic evaluation of five representative MAD methods across nine benchmarks using four foundational models. Surprisingly, our findings reveal that MAD methods fail to reliably outperform simple single-agent baselines such as Chain-of-Thought and Self-Consistency, even when consuming additional inference-time computation. From our analysis, we found that model heterogeneity can significantly improve MAD frameworks. We propose Heter-MAD enabling a single LLM agent to access the output from heterogeneous foundation models, which boosts the performance of current MAD frameworks. Finally, we outline potential directions for advancing MAD, aiming to spark a broader conversation and inspire future work in this area.
zh

[NLP-72] Zero-Shot Belief: A Hard Problem for LLM s ACL2025

【速读】：该论文旨在解决零样本源与目标信念预测的问题，特别是在FactBank数据集上的挑战。为了解决这一问题，论文提出了两种基于大型语言模型（LLM）的方法：一种是统一系统，在单次处理中识别事件、来源和信念标签；另一种是混合方法，使用微调的DeBERTa标记器进行事件检测。解决方案的关键在于采用混合方法，并通过详细的错误分析展示了在FactBank数据集上取得了新的最先进结果，进一步将该方法应用于意大利信念语料库ModaFact。

链接: https://arxiv.org/abs/2502.08777
作者: John Murzaku,Owen Rambow
机构: Stony Brook University (石溪大学); Department of Computer Science (计算机科学系); Institute for Advanced Computational Science (先进计算科学研究所)
类目: Computation and Language (cs.CL)
备注: Submitted to ACL 2025

点击查看摘要

Abstract:We present two LLM-based approaches to zero-shot source-and-target belief prediction on FactBank: a unified system that identifies events, sources, and belief labels in a single pass, and a hybrid approach that uses a fine-tuned DeBERTa tagger for event detection. We show that multiple open-sourced, closed-source, and reasoning-based LLMs struggle with the task. Using the hybrid approach, we achieve new state-of-the-art results on FactBank and offer a detailed error analysis. Our approach is then tested on the Italian belief corpus ModaFact.
zh

[NLP-73] Universal Model Routing for Efficient LLM Inference

【速读】：该论文旨在解决动态路由问题，即在测试阶段可以利用之前未观察到的语言模型（LLMs）。关键解决方案在于将每个语言模型表示为特征向量，并基于一组代表性提示预测来构建这些向量。论文提出了两种有效的策略：基于聚类的路由和学习的聚类映射。这两种策略被证明是对理论上最优路由规则的估计，并提供了超额风险界限以量化其误差。实验结果表明，所提出的策略在处理超过30个未见过的语言模型时是有效的。

链接: https://arxiv.org/abs/2502.08773
作者: Wittawat Jitkrittum,Harikrishna Narasimhan,Ankit Singh Rawat,Jeevesh Juneja,Zifeng Wang,Chen-Yu Lee,Pradeep Shenoy,Rina Panigrahy,Aditya Krishna Menon,Sanjiv Kumar
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models’ significant advances in capabilities are accompanied by significant increases in inference costs. Model routing is a simple technique for reducing inference cost, wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective strategies, relying on cluster-based routing and a learned cluster map respectively. We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors. Experiments on a range of public benchmarks show the effectiveness of the proposed strategies in routing amongst more than 30 unseen LLMs.
zh

[NLP-74] SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence

【速读】：该论文旨在解决语言模型在利用相关证据提供事实正确且上下文化响应时所遇到的困难，特别是在存在噪声和无关信息的现实场景中。论文的关键解决方案是SelfElicit，这是一种推理时方法，通过自我引导的显式突出显示帮助语言模型关注关键上下文证据。SelfElicit利用语言模型内在的证据发现能力，通过深层注意力分数自动识别并强调输入上下文中的关键证据，从而促进更准确和基于事实的响应，而无需额外训练或迭代提示。

链接: https://arxiv.org/abs/2502.08767
作者: Zhining Liu,Rana Ali Amjad,Ravinarayana Adkathimar,Tianxin Wei,Hanghang Tong
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); Amazon Science(亚马逊科学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Providing Language Models (LMs) with relevant evidence in the context (either via retrieval or user-provided) can significantly improve their ability to provide factually correct grounded responses. However, recent studies have found that LMs often struggle to fully comprehend and utilize key evidence from the context, especially when it contains noise and irrelevant information - an issue common in real-world scenarios. To address this, we propose SelfElicit, an inference-time approach that helps LMs focus on key contextual evidence through self-guided explicit highlighting. By leveraging the inherent evidence-finding capabilities of LMs using the attention scores of deeper layers, our method automatically identifies and emphasizes key evidence within the input context, facilitating more accurate and factually grounded responses without additional training or iterative prompting. We demonstrate that SelfElicit brings consistent and significant improvement on multiple evidence-based QA tasks for various LM families while maintaining computational efficiency. Our code and documentation are available at this https URL.
zh

[NLP-75] IHEval: Evaluating Language Models on Following the Instruction Hierarchy NAACL2025

【速读】：该论文旨在解决语言模型（Language Models, LMs）在处理指令层次结构（instruction hierarchy）时面临的挑战。指令层次结构从系统消息到用户消息，再到对话历史和工具输出，建立了一种优先级顺序，以确保语言模型行为的一致性和安全性。然而，目前对于这一主题的关注有限，并且缺乏全面的基准来评估模型遵循指令层次结构的能力。为了解决这一问题，论文提出了IHEval，这是一个包含九个任务共计3,538个示例的新基准，涵盖了不同优先级指令对齐或冲突的情况。关键解决方案在于通过IHEval基准来评估和揭示现有语言模型在识别和处理指令优先级方面的不足。

链接: https://arxiv.org/abs/2502.08745
作者: Zhihan Zhang,Shiyang Li,Zixuan Zhang,Xin Liu,Haoming Jiang,Xianfeng Tang,Yifan Gao,Zheng Li,Haodong Wang,Zhaoxuan Tan,Yichuan Li,Qingyu Yin,Bing Yin,Meng Jiang
机构: Amazon(亚马逊); University of Notre Dame(圣母大学); Worcester Polytechnic Institute(伍斯特理工学院)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models’ ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.
zh

[NLP-76] Are Expressions for Music Emotions the Same Across Cultures?

【速读】：该论文旨在解决跨文化音乐情感研究中的情感描述符普遍性争议及由此产生的文化偏见问题。关键解决方案在于提出一种平衡的实验设计，包括在巴西、美国和韩国进行的九项在线实验，并采用开放式的标签管道收集情感词汇以创建特定文化的情感分类法，从而减少文化偏见。

链接: https://arxiv.org/abs/2502.08744
作者: Elif Celen,Pol van Rijn,Harin Lee,Nori Jacoby
机构: Max Planck Institute for Empirical Aesthetics (马克斯·普朗克美感学研究所), Germany; Max Planck Institute for Human Cognitive and Brain Sciences (马克斯·普朗克人类认知与脑科学研究所), Germany; Cornell University (康奈尔大学), United States
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to CogSci

点击查看摘要

Abstract:Music evokes profound emotions, yet the universality of emotional descriptors across languages remains debated. A key challenge in cross-cultural research on music emotion is biased stimulus selection and manual curation of taxonomies, predominantly relying on Western music and languages. To address this, we propose a balanced experimental design with nine online experiments in Brazil, the US, and South Korea, involving N=672 participants. First, we sample a balanced set of popular music from these countries. Using an open-ended tagging pipeline, we then gather emotion terms to create culture-specific taxonomies. Finally, using these bottom-up taxonomies, participants rate emotions of each song. This allows us to map emotional similarities within and across cultures. Results show consistency in high arousal, high valence emotions but greater variability in others. Notably, machine translations were often inadequate to capture music-specific meanings. These findings together highlight the need for a domain-sensitive, open-ended, bottom-up emotion elicitation approach to reduce cultural biases in emotion research.
zh

[NLP-77] Data Augmentation to Improve Large Language Models in Food Hazard and Product Detection

【速读】：该论文旨在通过使用ChatGPT-4o-mini进行数据增强，以提升食品危害及产品分析中的模型性能。关键解决方案在于利用生成的数据增强技术来扩充原始数据集，并使用这些增强后的数据训练RoBERTa-base和Flan-T5-base两个大型语言模型。实验结果表明，与仅使用原始数据集相比，采用增强数据能够显著提高模型在召回率、F1分数、精确度和准确率等关键指标上的表现。

链接: https://arxiv.org/abs/2502.08687
作者: Areeg Fahad Rasheed,M. Zarkoosh,Shimam Amer Chasib,Safa F. Abbas
机构: Al-Nahrain University (阿尔奈赫伦大学); Ministry of Labour and Social Affairs (劳工和社会事务部)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The primary objective of this study is to demonstrate the impact of data augmentation using ChatGPT-4o-mini on food hazard and product analysis. The augmented data is generated using ChatGPT-4o-mini and subsequently used to train two large language models: RoBERTa-base and Flan-T5-base. The models are evaluated on test sets. The results indicate that using augmented data helped improve model performance across key metrics, including recall, F1 score, precision, and accuracy, compared to using only the provided dataset. The full code, including model training and the augmented dataset, can be found in this repository: this https URL
zh

[NLP-78] Mathematical Reasoning in Large Language Models : Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

【速读】：该论文旨在解决现有数学推理评估基准在数值范围有限及仅比较模型输出与标准答案而忽视推理过程洞察的问题。解决方案的关键在于引入GSM-Ranges数据集生成器，通过系统性地扰动数值来评估模型在不同数值规模下的鲁棒性，并提出了一种新的评分方法，该方法能够区分逻辑错误和非逻辑错误，从而更精确地评价推理过程而非仅仅依赖计算准确性。

链接: https://arxiv.org/abs/2502.08680
作者: Safal Shrestha,Minwu Kim,Keith Ross
机构: New York University Abu Dhabi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs’ mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.
zh

[NLP-79] Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models

【速读】：该论文旨在评估和解决数据质量对机器学习模型性能的影响。研究通过引入错误率指标来量化文本数据集的质量，并使用Mixtral大型语言模型（LLM）来检测和纠正错误。论文的关键解决方案在于利用LLMs来量化和改进低质量数据集中的错误，从而提升模型在不同错误率条件下的性能表现。研究表明，当数据集的错误率低于10%时，模型性能相对较好，但随着错误率的增加，模型性能显著下降。因此，确保数据集质量并实施相应的纠错措施对于提高机器学习模型的可靠性和有效性至关重要。

链接: https://arxiv.org/abs/2502.08669
作者: Tabinda Sarwar,Antonio Jose Jimeno Yepes,Lawrence Cavedon
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Background: Data collected in controlled settings typically results in high-quality datasets. However, in real-world applications, the quality of data collection is often compromised. It is well established that the quality of a dataset significantly impacts the performance of machine learning models. Methods: A rudimentary error rate metric was developed to evaluate textual dataset quality at the token level. Mixtral Large Language Model (LLM) was used to quantify and correct errors in low quality datasets. The study analyzed two healthcare datasets: the high-quality MIMIC-III public hospital dataset and a lower-quality private dataset from Australian aged care homes. Errors were systematically introduced into MIMIC at varying rates, while the ACH dataset quality was improved using the LLM. Results: For the sampled 35,774 and 6,336 patients from the MIMIC and ACH datasets respectively, we used Mixtral to introduce errors in MIMIC and correct errors in ACH. Mixtral correctly detected errors in 63% of progress notes, with 17% containing a single token misclassified due to medical terminology. LLMs demonstrated potential for improving progress note quality by addressing various errors. Under varying error rates, feature representation performance was tolerant to lower error rates (10%) but declined significantly at higher rates. Conclusions: The study revealed that models performed relatively well on datasets with lower error rates (10%), but their performance declined significantly as error rates increased (=10%). Therefore, it is crucial to evaluate the quality of a dataset before utilizing it for machine learning tasks. For datasets with higher error rates, implementing corrective measures is essential to ensure the reliability and effectiveness of machine learning models. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2502.08669 [cs.CL] (or arXiv:2502.08669v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.08669 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Antonio Jose Jimeno Yepes [view email] [v1] Wed, 12 Feb 2025 00:27:49 UTC (1,074 KB)
zh

[NLP-80] Style Extraction on Text Embeddings Using VAE and Parallel Dataset

【速读】：该论文旨在通过使用变分自编码器（Variational Autoencoder, VAE）模型，检测并分析不同圣经译本之间的文体差异，特别关注区分美国标准版（American Standard Version, ASV）与其他译本。解决方案的关键在于将文本数据嵌入到高维向量空间中，并利用VAE模型有效识别每种译本独特的文体分布。研究表明，尽管该模型主要优化于区分单一风格，但它在捕捉和区分文本风格方面表现出色。

链接: https://arxiv.org/abs/2502.08668
作者: InJin Kong,Shinyee Kang,Yuna Park,Sooyong Kim,Sanghyun Park
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:This study investigates the stylistic differences among various Bible translations using a Variational Autoencoder (VAE) model. By embedding textual data into high-dimensional vectors, the study aims to detect and analyze stylistic variations between translations, with a specific focus on distinguishing the American Standard Version (ASV) from other translations. The results demonstrate that each translation exhibits a unique stylistic distribution, which can be effectively identified using the VAE model. These findings suggest that the VAE model is proficient in capturing and differentiating textual styles, although it is primarily optimized for distinguishing a single style. The study highlights the model’s potential for broader applications in AI-based text generation and stylistic analysis, while also acknowledging the need for further model refinement to address the complexity of multi-dimensional stylistic relationships. Future research could extend this methodology to other text domains, offering deeper insights into the stylistic features embedded within various types of textual data.
zh

[NLP-81] Hallucination Monofacts and Miscalibration: An Empirical Investigation

【速读】：该论文旨在探究大型语言模型（LLMs）幻觉现象的理论边界及其影响因素。通过理论分析与系统实验，研究了训练数据中的事实频率分布和模型校准-幻觉之间的权衡。关键解决方案在于验证幻觉率下限由训练数据的单因素率（与经典Good-Turing缺失质量估计器相关）减去模型误校准决定，并通过控制实验展示了即使在固定单因素率的情况下，通过调整训练样本权重也能显著减少幻觉现象。这些发现表明，当前训练数据中激进去重的做法可能需要重新考虑，选择性重复可能是减少幻觉的一种合理机制。

链接: https://arxiv.org/abs/2502.08666
作者: Muqing Miao,Michael Kearns
机构: Department of Computer and Information Science, University of Pennsylvania (宾夕法尼亚大学计算机与信息科学系), Philadelphia, Pennsylvania, USA.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code available at this https URL

点击查看摘要

Abstract:Recent theoretical work by [Kalai and Vempala 2024] proves that a particular notion of hallucination rate in LLMs must be lower bounded by the training data monofact rate (related to the classical Good-Turing missing mass estimator) minus model miscalibration. Through systematic experiments with n-gram models and in-context learning with LLMs, we empirically investigate and validate this theory by examining how different underlying data distributions affect the monofact rate and a model’s tendency to hallucinate. We then vary model miscalibration through controlled upweighting of training samples while holding monofact rates constant, allowing us to isolate miscalibration’s reduction effect on hallucination. These findings suggest that both the distribution of fact frequencies in training data and the calibration-hallucination trade-off are inherent to probabilistic language generation. Our results also suggest that current practices of aggressive deduplication in training data may need to be reconsidered, as selective duplication could serve as a principled mechanism for reducing hallucination.
zh

[NLP-82] Hallucination Detection: A Probabilistic Framework Using Embeddings Distance Analysis

【速读】：该论文旨在解决大型语言模型（LLMs）中幻觉现象（hallucinations）的问题，以促进其在生产系统中的广泛应用。论文的关键在于提出了一种基于数学理论的方法来检测幻觉现象。通过研究幻觉内容与正确内容之间的结构差异，并利用Minkowski距离在嵌入空间（embedding space）中的特性，作者证明了这两种内容在嵌入距离分布上的统计显著差异。这一发现使得开发出能够检测幻觉响应的工具成为可能，达到了66%的准确率。

链接: https://arxiv.org/abs/2502.08663
作者: Emanuele Ricco,Lorenzo Cima,Roberto Di Pietro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Hallucinations are one of the major issues affecting LLMs, hindering their wide adoption in production systems. While current research solutions for detecting hallucinations are mainly based on heuristics, in this paper we introduce a mathematically sound methodology to reason about hallucination, and leverage it to build a tool to detect hallucinations. To the best of our knowledge, we are the first to show that hallucinated content has structural differences with respect to correct content. To prove this result, we resort to the Minkowski distances in the embedding space. Our findings demonstrate statistically significant differences in the embedding distance distributions, that are also scale free – they qualitatively hold regardless of the distance norm used and the number of keywords, questions, or responses. We leverage these structural differences to develop a tool to detect hallucinated responses, achieving an accuracy of 66% for a specific configuration of system parameters – comparable with the best results in the field. In conclusion, the suggested methodology is promising and novel, possibly paving the way for further research in the domain, also along the directions highlighted in our future work.
zh

[NLP-83] RoToR: Towards More Reliable Responses for Order-Invariant Inputs

【速读】：该论文旨在解决语言模型（Language Models, LMs）在处理列表型输入（listwise inputs）时的位置偏差（positional bias）问题，特别是在零样本情况下。论文的关键解决方案包括：(1) 提出RoToR，一种针对真正顺序不变输入的零样本不变性语言模型，通过最小化位置ID的修改来实现；(2) 引入选择性路由（Selective Routing），这是一种适应于处理顺序不变和顺序敏感输入的自适应框架。这些方法有效提升了语言模型在Lost in the middle (LitM)，知识图谱问答（Knowledge Graph Question Answering, KGQA）以及MMLU基准测试中的性能。

链接: https://arxiv.org/abs/2502.08662
作者: Soyoung Yoon,Dongha Ahn,Youngwon Lee,Minkyu Jung,HyungJoo Jang,Seung-won Hwang
机构: Seoul National University(首尔国立大学); Channel Corporation(Channel Corporation)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mitigating positional bias of language models (LMs) for listwise inputs is a well-known and important problem (e.g., lost-in-the-middle). While zero-shot order-invariant LMs have been proposed to solve this issue, their success on practical listwise problems has been limited. In this work, as a first contribution, we identify and overcome two limitations to make zero-shot invariant LMs more practical: (1) training and inference distribution mismatch arising from modifying positional ID assignments to enforce invariance, and (2) failure to adapt to a mixture of order-invariant and sensitive inputs in practical listwise problems. To overcome, we propose (1) RoToR, a zero-shot invariant LM for genuinely order-invariant inputs with minimal modifications of positional IDs, and (2) Selective Routing, an adaptive framework that handles both order-invariant and order-sensitive inputs in listwise tasks. On the Lost in the middle (LitM), Knowledge Graph Question Answering (KGQA), and MMLU benchmarks, we show that RoToR with Selective Routing can effectively handle practical listwise input tasks in a zero-shot manner.
zh

[NLP-84] Few-shot_LLM _Synthetic_Data_with_Distribution_Matching WWW2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）生成的合成数据与真实数据在关键语言属性（如风格、语气、内容比例等）上的差异问题。直接将这些合成数据与真实数据混合可能会扭曲原始数据分布，从而可能阻碍性能提升。为了解决这一问题，论文提出了一种名为SynAlign的框架，其关键是通过关键属性分布匹配来生成和过滤合成数据。具体而言，SynAlign使用高斯过程模型代理的不确定性追踪器来选择与已选数据集群不同的数据集合作为示范，以促进对真实数据多样性的高效探索。此外，采用潜在属性推理方法，使LLM能够基于总结的语言属性来合成新数据，并利用最大平均差异（Maximum Mean Discrepancy）作为目标函数来学习每个合成数据的采样权重，确保与真实数据的分布匹配。

链接: https://arxiv.org/abs/2502.08661
作者: Jiyuan Ren,Zhaocheng Du,Zhihao Wen,Qinglin Jia,Sunhao Dai,Chuhan Wu,Zhenhua Dong
机构: Tsinghua University(清华大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室); Renmin University of China(中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, accepted at www 2025

点击查看摘要

Abstract:As large language models (LLMs) advance, their ability to perform in-context learning and few-shot language generation has improved significantly. This has spurred using LLMs to produce high-quality synthetic data to enhance the performance of smaller models like online retrievers or weak LLMs. However, LLM-generated synthetic data often differs from the real data in key language attributes (e.g., styles, tones, content proportions, etc.). As a result, mixing these synthetic data directly with real data may distort the original data distribution, potentially hindering performance improvements. To solve this, we introduce SynAlign: a synthetic data generation and filtering framework based on key attribute distribution matching. Before generation, SynAlign employs an uncertainty tracker surrogated by the Gaussian Process model to iteratively select data clusters distinct from selected ones as demonstrations for new data synthesis, facilitating the efficient exploration diversity of the real data. Then, a latent attribute reasoning method is employed: the LLM summarizes linguistic attributes of demonstrations and then synthesizes new data based on them. This approach facilitates synthesizing diverse data with linguistic attributes that appear in real this http URL generation, the Maximum Mean Discrepancy is used as the objective function to learn the sampling weight of each synthetic data, ensuring distribution matching with the real data. Our experiments on multiple text prediction tasks show significant performance improvements. We also conducted an online A/B test on an online retriever to demonstrate SynAlign’s effectiveness.
zh

[NLP-85] Semantic Role Labeling: A Systematical Survey

【速读】：该论文旨在解决语义角色标记（Semantic Role Labeling, SRL）领域缺乏全面综述的问题。论文的关键在于通过四个核心视角——模型架构、句法特征建模、应用情景以及多模态扩展——对SRL方法进行系统的分类与综合，并探讨评估基准、度量标准及范式建模方法，同时分析其在不同领域的实际应用及其未来研究方向，特别是在大规模语言模型（Large Language Models, LLMs）时代的演进角色和对自然语言处理（NLP）领域的潜在影响。

链接: https://arxiv.org/abs/2502.08660
作者: Huiyao Chen,Meishan Zhang,Jing Li,Min Zhang,Lilja Øvrelid,Jan Hajič,Hao Fei
机构: Harbin Institute of Technology (深圳)(哈尔滨工业大学（深圳）); University of Oslo (奥斯陆大学); Charles University (查尔斯大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Semantic role labeling (SRL) is a central natural language processing (NLP) task aiming to understand the semantic roles within texts, facilitating a wide range of downstream applications. While SRL has garnered extensive and enduring research, there is currently a lack of a comprehensive survey that thoroughly organizes and synthesizes the field. This paper aims to review the entire research trajectory of the SRL community over the past two decades. We begin by providing a complete definition of SRL. To offer a comprehensive taxonomy, we categorize SRL methodologies into four key perspectives: model architectures, syntax feature modeling, application scenarios, and multi-modal extensions. Further, we discuss SRL benchmarks, evaluation metrics, and paradigm modeling approaches, while also exploring practical applications across various domains. Finally, we analyze future research directions in SRL, addressing the evolving role of SRL in the age of large language models (LLMs) and its potential impact on the broader NLP landscape. We maintain a public repository and consistently update related resources at: this https URL
zh

[NLP-86] Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLM s with Minimal Human Interventions

【速读】：该论文旨在解决现有方法在通过指令调优和强化学习校准大型语言模型（LLMs）以符合人类意图时过度依赖人工标注高质量正样本的问题。这种方法面临标签噪声以及优选与非优选响应数据之间细微区别的挑战。论文提出的关键解决方案是PT-ALIGN，这是一种新型的安全自对齐方法，通过自动精炼正样本和有毒样本，并进行细粒度双重指令调优来减少人工监督。PT-ALIGN利用LLM自身迭代生成和精炼训练实例，仅需少于50个人工注释。该方法采用最大似然估计（MLE）损失和细粒度极不可能性训练（UT）损失，以联合学习增强LLM的安全性，从而实现有害内容的最小化和无害内容的最大化，引导模型更安全地微调，提高生成有用和可靠内容的可能性。

链接: https://arxiv.org/abs/2502.08657
作者: Jingxin Xu,Guoshun Nan,Sheng Guan,Sicong Leng,Yilian Liu,Zixiao Wang,Yuyang Ma,Zhili Zhou,Yanzhao Hou,Xiaofeng Tao
机构: National Engineering Research Center for Mobile Network Technologies, Beijing University of Posts and Telecommunications (北京邮电大学移动网络技术国家工程研究中心), Beijing, 100876, China; Nanyang Technological University (南洋理工大学), Singapore; Guangzhou University (广州大学), Guangzhou, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Recent AI agents, such as ChatGPT and LLaMA, primarily rely on instruction tuning and reinforcement learning to calibrate the output of large language models (LLMs) with human intentions, ensuring the outputs are harmless and helpful. Existing methods heavily depend on the manual annotation of high-quality positive samples, while contending with issues such as noisy labels and minimal distinctions between preferred and dispreferred response data. However, readily available toxic samples with clear safety distinctions are often filtered out, removing valuable negative references that could aid LLMs in safety alignment. In response, we propose PT-ALIGN, a novel safety self-alignment approach that minimizes human supervision by automatically refining positive and toxic samples and performing fine-grained dual instruction tuning. Positive samples are harmless responses, while toxic samples deliberately contain extremely harmful content, serving as a new supervisory signals. Specifically, we utilize LLM itself to iteratively generate and refine training instances by only exploring fewer than 50 human annotations. We then employ two losses, i.e., maximum likelihood estimation (MLE) and fine-grained unlikelihood training (UT), to jointly learn to enhance the LLM’s safety. The MLE loss encourages an LLM to maximize the generation of harmless content based on positive samples. Conversely, the fine-grained UT loss guides the LLM to minimize the output of harmful words based on negative samples at the token-level, thereby guiding the model to decouple safety from effectiveness, directing it toward safer fine-tuning objectives, and increasing the likelihood of generating helpful and reliable content. Experiments on 9 popular open-source LLMs demonstrate the effectiveness of our PT-ALIGN for safety alignment, while maintaining comparable levels of helpfulness and usefulness.
zh

[NLP-87] Bridging the Evaluation Gap: Leverag ing Large Language Models for Topic Model Evaluation

【速读】：该论文旨在解决科学文献中动态演化的主题分类体系自动化评估的问题。解决方案的关键在于利用大规模语言模型（Large Language Models, LLMs）来衡量诸如连贯性、重复性、多样性和主题-文档对齐等关键质量维度，而无需过多依赖专家标注或狭窄的统计指标。通过定制的提示引导LLM进行评估，确保在不同数据集和建模技术上的评估结果具有一致性和可解释性。

链接: https://arxiv.org/abs/2502.07352
作者: Zhiyin Tan,Jennifer D’Souza
机构: L3S Research Center, Leibniz University Hannover (汉诺威莱布尼茨大学L3S研究中心); TIB Leibniz Information Centre for Science and Technology (汉诺威莱布尼茨科学与技术信息中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: accepted by IRCDL 2025

点击查看摘要

Abstract:This study presents a framework for automated evaluation of dynamically evolving topic taxonomies in scientific literature using Large Language Models (LLMs). In digital library systems, topic modeling plays a crucial role in efficiently organizing and retrieving scholarly content, guiding researchers through complex knowledge landscapes. As research domains proliferate and shift, traditional human centric and static evaluation methods struggle to maintain relevance. The proposed approach harnesses LLMs to measure key quality dimensions, such as coherence, repetitiveness, diversity, and topic-document alignment, without heavy reliance on expert annotators or narrow statistical metrics. Tailored prompts guide LLM assessments, ensuring consistent and interpretable evaluations across various datasets and modeling techniques. Experiments on benchmark corpora demonstrate the method’s robustness, scalability, and adaptability, underscoring its value as a more holistic and dynamic alternative to conventional evaluation strategies.
zh

[NLP-88] Reason Flux: Hierarchical LLM Reasoning via Scaling Thought Templates

【速读】：该论文旨在解决大规模语言模型（LLMs）在数学推理任务中的性能优化问题。关键解决方案在于通过扩展思维模板库和采用分层强化学习方法，使得模型能够在推理搜索空间中更有效地进行规划和执行。具体而言，论文提出了一个包含约500个高级思维模板的结构化通用库，并通过自适应扩展这些模板来进行推理，从而显著提升了数学推理能力至最先进水平。在MATH基准测试中，ReasonFlux-32B达到了91.2%的准确率，在AIME基准测试中解决了56.7%的问题，分别超越了OpenAI o1-preview和DeepSeek-V3。

链接: https://arxiv.org/abs/2502.06772
作者: Ling Yang,Zhaochen Yu,Bin Cui,Mengdi Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing sequential thought templates, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: this https URL
zh

计算机视觉

[CV-0] Embed Any NeRF: Graph Meta-Networks for Neural Tasks on Arbitrary NeRF Architectures

【速读】：该论文旨在解决NeRFs（神经辐射场）在不同架构之间无法通用的问题。现有的框架只能处理具有特定预定义架构的NeRFs，而本文提出了一种新的框架，能够处理多种架构的NeRFs，并能在未见过的架构上进行推理。解决方案的关键在于训练了一个图元网络（Graph Meta-Network）在表示学习框架中，同时利用对比目标（contrastive objective）来获得不依赖于架构的潜在空间（architecture-agnostic latent space）。这一方法在MLP（多层感知器）和三平面NeRFs上展示了稳健的分类和检索性能，达到了或超过了现有仅限于单一架构的框架。

链接: https://arxiv.org/abs/2502.09623
作者: Francesco Ballerini,Pierluigi Zama Ramirez,Samuele Salti,Luigi Di Stefano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have emerged as a groundbreaking paradigm for representing 3D objects and scenes by encoding shape and appearance information into the weights of a neural network. Recent works have shown how such weights can be used as input to frameworks processing them to solve deep learning tasks. Yet, these frameworks can only process NeRFs with a specific, predefined architecture. In this paper, we present the first framework that can ingest NeRFs with multiple architectures and perform inference on architectures unseen at training time. We achieve this goal by training a Graph Meta-Network in a representation learning framework. Moreover, we show how a contrastive objective is conducive to obtaining an architecture-agnostic latent space. In experiments on both MLP-based and tri-planar NeRFs, our approach demonstrates robust performance in classification and retrieval tasks that either matches or exceeds that of existing frameworks constrained to single architectures, thus providing the first architecture-agnostic method to perform tasks on NeRFs by processing their weights.
zh

[CV-1] Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights

【速读】：该论文旨在解决用户在寻找适用于特定任务的预训练模型时面临的困难，当前模型搜索方法仅限于基于文档的文本搜索，导致用户难以找到相关的模型。论文的关键解决方案是提出了一种名为ProbeLog的方法，通过观察模型在固定输入集（探针）上的响应来计算每个输出维度（logits）的描述符。这种方法支持基于logits的检索和零样本、基于文本的检索，从而无需访问模型元数据或训练数据即可识别目标概念。为了降低基于探针表示的成本，论文还开发了一种基于协同过滤的方法，将编码存储库的成本减少了3倍。ProbeLog展示了高检索精度，并且可扩展到全尺寸存储库。

链接: https://arxiv.org/abs/2502.09619
作者: Jonathan Kahana,Or Nathan,Eliahu Horwitz,Yedid Hoshen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the increasing numbers of publicly available models, there are probably pretrained, online models for most tasks users require. However, current model search methods are rudimentary, essentially a text-based search in the documentation, thus users cannot find the relevant models. This paper presents ProbeLog, a method for retrieving classification models that can recognize a target concept, such as “Dog”, without access to model metadata or training data. Differently from previous probing methods, ProbeLog computes a descriptor for each output dimension (logit) of each model, by observing its responses on a fixed set of inputs (probes). Our method supports both logit-based retrieval (“find more logits like this”) and zero-shot, text-based retrieval (“find all logits corresponding to dogs”). As probing-based representations require multiple costly feedforward passes through the model, we develop a method, based on collaborative filtering, that reduces the cost of encoding repositories by 3x. We demonstrate that ProbeLog achieves high retrieval accuracy, both in real-world and fine-grained search tasks and is scalable to full-size repositories.
zh

[CV-2] LIFe-GoM: Generalizable Human Rendering with Learned Iterative Feedback Over Multi-Resolution Gaussians-on-Mesh ICLR2025

【速读】：该论文旨在解决从稀疏输入中生成可动画化的人体 avatar 的通用渲染问题。论文的关键在于提出了一种改进的双形状表示方法，并结合迭代反馈更新框架和耦合多分辨率高斯点网格表示，以实现在单次推理过程中高效且高质量地重建人体形状，并进行快速高效的高分辨率渲染。

链接: https://arxiv.org/abs/2502.09617
作者: Jing Wen,Alexander G. Schwing,Shenlong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025; Project page: this https URL

点击查看摘要

Abstract:Generalizable rendering of an animatable human avatar from sparse inputs relies on data priors and inductive biases extracted from training on large data to avoid scene-specific optimization and to enable fast reconstruction. This raises two main challenges: First, unlike iterative gradient-based adjustment in scene-specific optimization, generalizable methods must reconstruct the human shape representation in a single pass at inference time. Second, rendering is preferably computationally efficient yet of high resolution. To address both challenges we augment the recently proposed dual shape representation, which combines the benefits of a mesh and Gaussian points, in two ways. To improve reconstruction, we propose an iterative feedback update framework, which successively improves the canonical human shape representation during reconstruction. To achieve computationally efficient yet high-resolution rendering, we study a coupled-multi-resolution Gaussians-on-Mesh representation. We evaluate the proposed approach on the challenging THuman2.0, XHuman and AIST++ data. Our approach reconstructs an animatable representation from sparse inputs in less than 1s, renders views with 95.1FPS at 1024 \times 1024 , and achieves PSNR/LPIPS*/FID of 24.65/110.82/51.27 on THuman2.0, outperforming the state-of-the-art in rendering quality.
zh

[CV-3] Variational Rectified Flow Matching

【速读】：该论文旨在解决经典校正流匹配（Classic Rectified Flow Matching）在处理多模态速度向量场（multi-modal velocity vector-fields）时存在的局限性。经典方法通过标准均方误差损失（mean-squared-error loss）训练得到的速度向量场会平均真实方向，从而无法体现多模态特性。论文的关键解决方案是引入变分校正流匹配（Variational Rectified Flow Matching），该方法能够在训练过程中学习和采样多模态流方向（multi-modal flow directions），从而更好地处理多模态分布问题。

链接: https://arxiv.org/abs/2502.09616
作者: Pengsheng Guo,Alexander G. Schwing
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study Variational Rectified Flow Matching, a framework that enhances classic rectified flow matching by modeling multi-modal velocity vector-fields. At inference time, classic rectified flow matching ‘moves’ samples from a source distribution to the target distribution by solving an ordinary differential equation via integration along a velocity vector-field. At training time, the velocity vector-field is learnt by linearly interpolating between coupled samples one drawn from the source and one drawn from the target distribution randomly. This leads to ‘‘ground-truth’’ velocity vector-fields that point in different directions at the same location, i.e., the velocity vector-fields are multi-modal/ambiguous. However, since training uses a standard mean-squared-error loss, the learnt velocity vector-field averages ‘‘ground-truth’’ directions and isn’t multi-modal. In contrast, variational rectified flow matching learns and samples from multi-modal flow directions. We show on synthetic data, MNIST, CIFAR-10, and ImageNet that variational rectified flow matching leads to compelling results.
zh

[CV-4] RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets WWW

【速读】：该论文旨在解决3D资产自动绑定（Auto-rigging）的问题，特别是针对非特定类别物体的模板无关（template-free）自动绑定。关键在于提出了一种名为RigAnything的新型自回归变换器模型，该模型通过概率性生成关节、骨骼拓扑结构以及分配蒙皮权重来实现这一目标。不同于大多数依赖预定义骨骼模板的方法，RigAnything采用自回归方式，基于全局输入形状和先前预测迭代预测下一个关节，从而有效地学习和表示骨骼树结构。此外，通过利用扩散建模（diffusion modeling），该模型提升了位置预测的准确性，确保关节在层次结构中的精确和一致放置。这种形式化方法使得自回归模型能够高效地捕捉骨骼内的空间和层级关系。

链接: https://arxiv.org/abs/2502.09615
作者: Isabella Liu,Zhan Xu,Wang Yifan,Hao Tan,Zexiang Xu,Xiaolong Wang,Hao Su,Zifan Shi
机构: UC San Diego(加州大学圣地亚哥分校); Adobe Research(Adobe研究); Hillbot Inc.(Hillbot公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present RigAnything, a novel autoregressive transformer-based model, which makes 3D assets rig-ready by probabilistically generating joints, skeleton topologies, and assigning skinning weights in a template-free manner. Unlike most existing auto-rigging methods, which rely on predefined skeleton template and are limited to specific categories like humanoid, RigAnything approaches the rigging problem in an autoregressive manner, iteratively predicting the next joint based on the global input shape and the previous prediction. While autoregressive models are typically used to generate sequential data, RigAnything extends their application to effectively learn and represent skeletons, which are inherently tree structures. To achieve this, we organize the joints in a breadth-first search (BFS) order, enabling the skeleton to be defined as a sequence of 3D locations and the parent index. Furthermore, our model improves the accuracy of position prediction by leveraging diffusion modeling, ensuring precise and consistent placement of joints within the hierarchy. This formulation allows the autoregressive model to efficiently capture both spatial and hierarchical relationships within the skeleton. Trained end-to-end on both RigNet and Objaverse datasets, RigAnything demonstrates state-of-the-art performance across diverse object types, including humanoids, quadrupeds, marine creatures, insects, and many more, surpassing prior methods in quality, robustness, generalizability, and efficiency. Please check our website for more details: this https URL.
zh

[CV-5] DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References ICLR2025

【速读】：本文旨在解决开发适用于灵巧操作的通用神经跟踪控制器的挑战，该控制器需管理灵巧机器人手以多样化的动作操纵各类物体，满足由运动学定义的人-物交互目的。解决方案的关键在于通过整理大规模成功的机器人跟踪演示（包含人类参考与机器人动作对）来训练神经控制器，并利用数据飞轮（data flywheel）迭代提升控制器性能及成功跟踪演示的数量与质量。此外，通过在同伦优化方法中利用已学习的跟踪控制器来单独优化每个轨迹跟踪，进一步增强了控制器在动态环境中的表现。这种方法使得所提出的控制器相较于现有领先基线方法，在成功率方面实现了超过10%的提升。

链接: https://arxiv.org/abs/2502.09614
作者: Xueyi Liu,Jianibieke Adalibieke,Qianwei Han,Yuzhe Qin,Li Yi
机构: Tsinghua University (清华大学); Shanghai Qi Zhi Institute (上海智研究院); Shanghai AI Laboratory (上海人工智能实验室); UC San Diego (加州大学圣地亚哥分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICLR 2025. Website: this https URL Code: this https URL Video: this https URL

点击查看摘要

Abstract:We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such a controller is complicated by the intricate contact dynamics of dexterous manipulation and the need for adaptivity, generalizability, and robustness. Current reinforcement learning and trajectory optimization methods often fall short due to their dependence on task-specific rewards or precise system models. We introduce an approach that curates large-scale successful robot tracking demonstrations, comprising pairs of human references and robot actions, to train a neural controller. Utilizing a data flywheel, we iteratively enhance the controller’s performance, as well as the number and quality of successful tracking demonstrations. We exploit available tracking demonstrations and carefully integrate reinforcement learning and imitation learning to boost the controller’s performance in dynamic environments. At the same time, to obtain high-quality tracking demonstrations, we individually optimize per-trajectory tracking by leveraging the learned tracking controller in a homotopy optimization method. The homotopy optimization, mimicking chain-of-thought, aids in solving challenging trajectory tracking problems to increase demonstration diversity. We showcase our success by training a generalizable neural controller and evaluating it in both simulation and real world. Our method achieves over a 10% improvement in success rates compared to leading baselines. The project website with animated results is available at this https URL.
zh

[CV-6] Latent Radiance Fields with 3D-aware 2D Representations ICLR2025

【速读】：该论文旨在解决2D特征空间与3D表示之间的领域差距问题，导致渲染性能下降。关键解决方案在于提出了一种新颖的框架，将3D感知整合到2D隐空间中，包括三个阶段：(1)一种增强2D隐表示3D一致性的对应关系感知自编码方法；(2)一种将这些具有3D感知的2D表示提升到3D空间的隐辐射场（Latent Radiance Field, LRF）；以及(3)一种改进从渲染的2D表示中解码图像的变分自编码器-辐射场（VAE-Radiance Field, VAE-RF）对齐策略。

链接: https://arxiv.org/abs/2502.09613
作者: Chaoyi Zhou,Xi Liu,Feng Luo,Siyu Huang
机构: Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICLR 2025; Project page: this https URL

点击查看摘要

Abstract:Latent 3D reconstruction has shown great promise in empowering 3D semantic understanding and 3D generation by distilling 2D features into the 3D space. However, existing approaches struggle with the domain gap between 2D feature space and 3D representations, resulting in degraded rendering performance. To address this challenge, we propose a novel framework that integrates 3D awareness into the 2D latent space. The framework consists of three stages: (1) a correspondence-aware autoencoding method that enhances the 3D consistency of 2D latent representations, (2) a latent radiance field (LRF) that lifts these 3D-aware 2D representations into 3D space, and (3) a VAE-Radiance Field (VAE-RF) alignment strategy that improves image decoding from the rendered 2D representations. Extensive experiments demonstrate that our method outperforms the state-of-the-art latent 3D reconstruction approaches in terms of synthesis performance and cross-dataset generalizability across diverse indoor and outdoor scenes. To our knowledge, this is the first work showing the radiance field representations constructed from 2D latent representations can yield photorealistic 3D reconstruction performance.
zh

[CV-7] Designing a Conditional Prior Distribution for Flow-Based Generative Models

【速读】：该论文旨在解决现有基于流的生成模型在条件生成任务中，如文本到图像生成，平均路径过长的问题。解决方案的关键在于利用条件流模型中非平凡先验分布的设计能力，通过将输入条件映射到数据空间中的一个“平均”点，并使用流匹配公式将参数化分布中的样本映射到目标条件分布，从而显著提高了训练速度和生成效率。

链接: https://arxiv.org/abs/2502.09611
作者: Noam Issachar,Mohammad Salama,Raanan Fattal,Sagie Benaim
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow-based generative models have recently shown impressive performance for conditional generation tasks, such as text-to-image generation. However, current methods transform a general unimodal noise distribution to a specific mode of the target data distribution. As such, every point in the initial source distribution can be mapped to every point in the target distribution, resulting in long average paths. To this end, in this work, we tap into a non-utilized property of conditional flow-based models: the ability to design a non-trivial prior distribution. Given an input condition, such as a text prompt, we first map it to a point lying in data space, representing an ``average" data point with the minimal average distance to all data points of the same conditional mode (e.g., class). We then utilize the flow matching formulation to map samples from a parametric distribution centered around this point to the conditional target distribution. Experimentally, our method significantly improves training times and generation efficiency (FID, KID and CLIP alignment scores) compared to baselines, producing high quality samples using fewer sampling steps.
zh

[CV-8] Instance Segmentation of Scene Sketches Using Natural Image Priors

【速读】：该论文旨在解决基于栅格场景草图的实例分割问题。现有图像分割模型在处理草图时面临独特挑战，因为草图具有稀疏性和风格多样性。论文的关键解决方案是提出SketchSeg方法，通过采用类别无关的微调以及利用深度线索优化分割掩码，并将草图组织成分层结构以实现遮挡实例的修复，从而支持高级草图编辑应用。为验证方法的鲁棒性，作者构建了一个包含多样化笔触和细节水平的合成场景草图分割数据集。

链接: https://arxiv.org/abs/2502.09608
作者: Mia Tang,Yael Vinker,Chuan Yan,Lvmin Zhang,Maneesh Agrawala
机构: Stanford University (斯坦福大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Sketch segmentation involves grouping pixels within a sketch that belong to the same object or instance. It serves as a valuable tool for sketch editing tasks, such as moving, scaling, or removing specific components. While image segmentation models have demonstrated remarkable capabilities in recent years, sketches present unique challenges for these models due to their sparse nature and wide variation in styles. We introduce SketchSeg, a method for instance segmentation of raster scene sketches. Our approach adapts state-of-the-art image segmentation and object detection models to the sketch domain by employing class-agnostic fine-tuning and refining segmentation masks using depth cues. Furthermore, our method organizes sketches into sorted layers, where occluded instances are inpainted, enabling advanced sketch editing applications. As existing datasets in this domain lack variation in sketch styles, we construct a synthetic scene sketch segmentation dataset featuring sketches with diverse brush strokes and varying levels of detail. We use this dataset to demonstrate the robustness of our approach and will release it to promote further research in the field. Project webpage: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2502.09608 [cs.CV] (or arXiv:2502.09608v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.09608 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-9] GAIA: A Global Multi-modal Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

【速读】：该论文旨在解决现有视觉-语言模型（Vision-Language Models, VLMs）在遥感（Remote Sensing, RS）图像处理任务中的性能不足问题。由于现有的VLMs主要基于网络抓取的、噪声较大的图像文本数据进行训练，它们对于RS领域的专业化内容接触有限，导致在RS特定任务上的表现不佳。论文的关键解决方案在于引入了一个名为GAIA的新数据集，该数据集包含205,150个精心策划的RS图像文本对，涵盖了多尺度、多传感器和多模态的RS图像分析。GAIA通过两个阶段构建：一是从可靠的RS来源有针对性地抓取图像及其文本描述；二是利用GPT-4o生成每个图像的五个高质量、科学严谨的合成标题。这一方法显著提升了RS图像分类、跨模态检索以及图像标题生成等任务的表现。

链接: https://arxiv.org/abs/2502.09598
作者: Angelos Zavras,Dimitrios Michail,Xiao Xiang Zhu,Begüm Demir,Ioannis Papoutsis
机构: Orion Lab, School of Rural, Surveying and Geoinformatics Engineering, National Technical University of Athens (国立技术大学雅典分校乡村、测量与地理工程学院); Institute of Astronomy, Astrophysics, Space Applications and Remote Sensing, National Observatory of Athens (国立雅典天文台); Department of Informatics and Telematics, Harokopio University of Athens (哈罗科皮奥斯大学信息学与遥感系); Chair of Data Science in Earth Observation, Technical University of Munich (慕尼黑工业大学地球观测数据科学教席); Munich Center for Machine Learning (慕尼黑机器学习中心); Faculty of Electrical Engineering and Computer Science, Technische Universität Berlin (柏林工业大学电气工程与计算机科学学院); BIFOLD - Berlin Institute for the Foundations of Learning and Data (BIFOLD - 柏林学习与数据基础研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 13 figures

点击查看摘要

Abstract:The continuous operation of Earth-orbiting satellites generates vast and ever-growing archives of Remote Sensing (RS) images. Natural language presents an intuitive interface for accessing, querying, and interpreting the data from such archives. However, existing Vision-Language Models (VLMs) are predominantly trained on web-scraped, noisy image-text data, exhibiting limited exposure to the specialized domain of RS. This deficiency results in poor performance on RS-specific tasks, as commonly used datasets often lack detailed, scientifically accurate textual descriptions and instead emphasize solely on attributes like date and location. To bridge this critical gap, we introduce GAIA, a novel dataset designed for multi-scale, multi-sensor, and multi-modal RS image analysis. GAIA comprises of 205,150 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions. Unlike existing vision-language datasets in RS, GAIA specifically focuses on capturing a diverse range of RS applications, providing unique information about environmental changes, natural disasters, and various other dynamic phenomena. The dataset provides a spatially and temporally balanced distribution, spanning across the globe, covering the last 25 years with a balanced temporal distribution of observations. GAIA’s construction involved a two-stage process: (1) targeted web-scraping of images and accompanying text from reputable RS-related sources, and (2) generation of five high-quality, scientifically grounded synthetic captions for each image using carefully crafted prompts that leverage the advanced vision-language capabilities of GPT-4o. Our extensive experiments, including fine-tuning of CLIP and BLIP2 models, demonstrate that GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks.
zh

[CV-10] Diffusing DeBias: a Recipe for Turning a Bug into a Feature

【速读】：该论文旨在解决深度学习模型在分类任务中因训练数据质量与数量不足以及数据中存在强虚假关联导致的不可恢复的预测偏差问题。为改善模型泛化能力和可信度，特别是在现实场景中，论文提出了一种名为Diffusing DeBias (DDB) 的新方法。该方法通过利用扩散模型固有的偏差学习倾向，采用条件扩散模型生成与偏差对齐的合成图像，进而训练一个偏差放大模型，并将其作为辅助手段应用于不同的无监督去偏方法中。关键在于这种方法不仅能有效减少数据集偏差，还能克服此类技术常见的训练集记忆问题，在多个基准数据集上显著超越当前最先进的方法。

链接: https://arxiv.org/abs/2502.09564
作者: Massimiliano Ciranni,Vito Paolo Pastore,Roberto Di Via,Enzo Tartaglione,Francesca Odone,Vittorio Murino
机构: MaLGa-DIBRIS, University of Genoa (MaLGa-DIBRIS,热那亚大学), Italy; Istituto Italiano di Tecnologia (意大利技术研究院), Italy; Telècom-Paris, Ecole Polytechnique Superior (泰尔通信-巴黎高等理工学院), France; University of Verona (维罗纳大学), Italy
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 Pages, 12 Figures

点击查看摘要

Abstract:Deep learning model effectiveness in classification tasks is often challenged by the quality and quantity of training data which, whenever containing strong spurious correlations between specific attributes and target labels, can result in unrecoverable biases in model predictions. Tackling these biases is crucial in improving model generalization and trust, especially in real-world scenarios. This paper presents Diffusing DeBias (DDB), a novel approach acting as a plug-in for common methods in model debiasing while exploiting the inherent bias-learning tendency of diffusion models. Our approach leverages conditional diffusion models to generate synthetic bias-aligned images, used to train a bias amplifier model, to be further employed as an auxiliary method in different unsupervised debiasing approaches. Our proposed method, which also tackles the common issue of training set memorization typical of this type of tech- niques, beats current state-of-the-art in multiple benchmark datasets by significant margins, demonstrating its potential as a versatile and effective tool for tackling dataset bias in deep learning applications.
zh

[CV-11] Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction

【速读】：该论文旨在解决大视场（FOV）图像场景重建中的高精度与高效性问题。关键在于提出了一种自校准框架，通过联合优化相机参数、镜头畸变以及3D高斯表示，实现了复杂镜头畸变的建模，采用了结合可逆残差网络与显式网格的混合网络设计，并引入了基于立方体贴图的重采样策略，从而在保持分辨率的同时支持大视场图像，且避免了失真伪影的产生。这种方法不仅能够从较少的图像中精确重建高质量场景，还展示了在合成及真实数据集上的最先进性能。

链接: https://arxiv.org/abs/2502.09563
作者: Youming Deng,Wenqi Xian,Guandao Yang,Leonidas Guibas,Gordon Wetzstein,Steve Marschner,Paul Debevec
机构: Cornell University; Netflix Eyeline Studios; Stanford University
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. In particular, our technique enables high-quality scene reconstruction from Large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. Our approach introduces a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortion, and demonstrates state-of-the-art performance on both synthetic and real-world datasets.
zh

[CV-12] Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

【速读】：该论文旨在解决在长时间生成逼真的TalkingFace视频过程中，一致的头部运动、同步的面部表情以及准确的唇形同步所面临的挑战。解决方案的关键在于引入了\textbf{MCDM}模型，该模型通过利用存档和当前片段的运动先验来增强运动预测并确保时间一致性。MCDM模型包含三个核心组成部分：(1) 存档片段运动先验，结合历史帧和参考帧以保持身份和上下文；(2) 当前片段运动先验扩散模型，捕捉多模态因果关系以准确预测头部运动、唇形同步及表情变化；(3) 记忆高效的时序注意力机制，通过动态存储和更新运动特征来减轻误差累积。

链接: https://arxiv.org/abs/2502.09533
作者: Fei Shen,Cong Wang,Junyao Gao,Qin Guo,Jisheng Dang,Jinhui Tang,Tat-Seng Chua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbfMotion-priors \textbfConditional \textbfDiffusion \textbfModel (\textbfMCDM), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also release the \textbfTalkingFace-Wild dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation. Code, models, and datasets will be publicly available.
zh

[CV-13] SteROI-D: System Design and Mapping for Stereo Depth Inference on Regions of Interest

【速读】：该论文旨在解决在电池供电的增强现实和虚拟现实（AR/VR）设备上，立体深度算法因全图像处理堆栈高能耗而无法有效运行的问题。解决方案的关键在于SteROI-D系统及其映射方法论，通过利用感兴趣区域（ROI）和时间稀疏性来节省能量，并采用灵活异构计算结构支持多样化的ROI。特别地，引入了一种系统性的映射方法论以有效处理动态ROI，从而最大化能源节省。

链接: https://arxiv.org/abs/2502.09528
作者: Jack Erhardt,Ziang Li,Reid Pinkham,Andrew Berkovich,Zhengya Zhang
机构: University of Michigan(密歇根大学) Ann Arbor(安阿伯); Meta(Meta) Reality Labs - Research(现实实验室-研究) Redmond(雷德蒙德) Washington(华盛顿) USA(美国); University of Michigan(密歇根大学) Ann Arbor(安阿伯) Michigan(密歇根) USA(美国)
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注: Accepted as a full paper by the 2025 EDGE AI FOUNDATION Austin

点击查看摘要

Abstract:Machine learning algorithms have enabled high quality stereo depth estimation to run on Augmented and Virtual Reality (AR/VR) devices. However, high energy consumption across the full image processing stack prevents stereo depth algorithms from running effectively on battery-limited devices. This paper introduces SteROI-D, a full stereo depth system paired with a mapping methodology. SteROI-D exploits Region-of-Interest (ROI) and temporal sparsity at the system level to save energy. SteROI-D’s flexible and heterogeneous compute fabric supports diverse ROIs. Importantly, we introduce a systematic mapping methodology to effectively handle dynamic ROIs, thereby maximizing energy savings. Using these techniques, our 28nm prototype SteROI-D design achieves up to 4.35x reduction in total system energy compared to a baseline ASIC.
zh

[CV-14] SQ-GAN: Semantic Image Communications Using Masked Vector Quantization

【速读】：该论文旨在解决图像压缩在语义/任务导向通信中的优化问题。解决方案的关键在于引入了Semantically Masked VQ-GAN (SQ-GAN)，它结合了生成模型与一个新开发的语义条件自适应掩模模块（SAMM），以选择性地编码图像中有意义的特征，从而在极低比特率下提升感知质量和语义分割准确性，超越了现有的JPEG2000和BPG等压缩方案。

链接: https://arxiv.org/abs/2502.09520
作者: Francesco Pezone,Sergio Barbarossa,Giuseppe Caire
机构: CNIT - National Inter-University Consortium for Telecommunications (国家电信跨大学联盟), Parma, Italy; Sapienza University of Rome (罗马大学), Italy; Technical University of Berlin (柏林工业大学), Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This work introduces Semantically Masked VQ-GAN (SQ-GAN), a novel approach integrating generative models to optimize image compression for semantic/task-oriented communications. SQ-GAN employs off-the-shelf semantic semantic segmentation and a new specifically developed semantic-conditioned adaptive mask module (SAMM) to selectively encode semantically significant features of the images. SQ-GAN outperforms state-of-the-art image compression schemes such as JPEG2000 and BPG across multiple metrics, including perceptual quality and semantic segmentation accuracy on the post-decoding reconstructed image, at extreme low compression rates expressed in bits per pixel.
zh

[CV-15] When and How Does CLIP Enable Domain and Compositional Generalization?

【速读】：该论文旨在探究对比视觉语言模型如CLIP在不同训练分布下的泛化能力，具体涉及跨未见过领域的泛化（领域泛化）以及在部分已见过领域的未见过类别中的泛化（组成泛化）。论文的关键在于通过系统构建具有控制域多样性和对象类别暴露的训练分布来训练CLIP模型，并发现成功的泛化需要在中间层学习共享表示和共享电路。

链接: https://arxiv.org/abs/2502.09507
作者: Elias Kempf,Simon Schrodi,Max Argus,Thomas Brox
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The remarkable generalization performance of contrastive vision-language models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centric and mechanistic analyses, we find that successful generalization requires learning of shared representations already in intermediate layers and shared circuitry.
zh

[CV-16] Prior-Constrained Association Learning for Fine-Grained Generalized Category Discovery AAAI2025

【速读】：本文旨在解决广义类别发现（Generalized Category Discovery, GCD）问题，即在已知或未知类别的帮助下，对未标记数据进行聚类的任务。不同于传统的半监督学习，GCD更具挑战性，因为未标记的数据可能来自未出现在标记数据中的新类别。当前最先进的方法通常借助自蒸馏辅助参数化分类器学习，但这些方法未能利用实例间的相似性来发现与类别特定语义相关的表示，而这对于表示学习和类别发现至关重要。

本文的关键在于提出了一种基于关联的学习方法，即先验约束关联学习（Prior-constrained Association Learning），通过捕捉和学习数据内的语义关系来解决问题。具体而言，来自已知类别的标记数据为未标记数据的关联提供了独特的先验知识。不同于先前仅将先验作为预聚类或后聚类精炼的方法，本文将先验完全融入关联过程中，并使其约束关联以实现可靠的分组结果。通过非参数原型对比增强表示学习，并结合参数化和非参数化分类方法，进一步提升了模型性能。实验表明，本文提出的方法在多个GCD基准数据集上均优于现有方法。

链接: https://arxiv.org/abs/2502.09501
作者: Menglin Wang,Zhun Zhong,Xiaojin Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:This paper addresses generalized category discovery (GCD), the task of clustering unlabeled data from potentially known or unknown categories with the help of labeled instances from each known category. Compared to traditional semi-supervised learning, GCD is more challenging because unlabeled data could be from novel categories not appearing in labeled data. Current state-of-the-art methods typically learn a parametric classifier assisted by self-distillation. While being effective, these methods do not make use of cross-instance similarity to discover class-specific semantics which are essential for representation learning and category discovery. In this paper, we revisit the association-based paradigm and propose a Prior-constrained Association Learning method to capture and learn the semantic relations within data. In particular, the labeled data from known categories provides a unique prior for the association of unlabeled data. Unlike previous methods that only adopts the prior as a pre or post-clustering refinement, we fully incorporate the prior into the association process, and let it constrain the association towards a reliable grouping outcome. The estimated semantic groups are utilized through non-parametric prototypical contrast to enhance the representation learning. A further combination of both parametric and non-parametric classification complements each other and leads to a model that outperforms existing methods by a significant margin. On multiple GCD benchmarks, we perform extensive experiments and validate the effectiveness of our proposed method.
zh

[CV-17] Standardisation of Convex Ultrasound Data Through Geometric Analysis and Augmentation

【速读】：该论文旨在解决超声图像数据组织混乱和稀疏性的问题，特别是在数据驱动算法应用中的挑战。关键解决方案在于从图像中提取基础的超声平面，并使用环扇形几何表示。通过这种方法，论文提出了一种提取扫描线和线性化凸面的应用，验证了该方法在私有和公开数据上的鲁棒性，并研究了使用估计的环扇形参数进行变形和增强的可逆性。

链接: https://arxiv.org/abs/2502.09482
作者: Alistair Weld,Giovanni Faoro,Luke Dixon,Sophie Camp,Arianna Menciassi,Stamatia Giannarou
机构: The Hamlyn Centre, Imperial College London (帝国理工学院); Scuola Superiore Sant’Anna (圣安娜高等学院); Department of Imaging, Charing Cross Hospital (查林十字医院); Department of Neurosurgery, Charing Cross Hospital (查林十字医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The application of ultrasound in healthcare has seen increased diversity and importance. Unlike other medical imaging modalities, ultrasound research and development has historically lagged, particularly in the case of applications with data-driven algorithms. A significant issue with ultrasound is the extreme variability of the images, due to the number of different machines available and the possible combination of parameter settings. One outcome of this is the lack of standardised and benchmarking ultrasound datasets. The method proposed in this article is an approach to alleviating this issue of disorganisation. For this purpose, the issue of ultrasound data sparsity is examined and a novel perspective, approach, and solution is proposed; involving the extraction of the underlying ultrasound plane within the image and representing it using annulus sector geometry. An application of this methodology is proposed, which is the extraction of scan lines and the linearisation of convex planes. Validation of the robustness of the proposed method is performed on both private and public data. The impact of deformation and the invertibility of augmentation using the estimated annulus sector parameters is also studied. Keywords: Ultrasound, Annulus Sector, Augmentation, Linearisation.
zh

[CV-18] Wholly-WOOD: Wholly Leverag ing Diversified-quality Labels for Weakly-supervised Oriented Object Detection

【速读】：该论文旨在解决利用弱标注（单点和水平边界框）有效训练定向目标检测器 (OOD) 的问题。关键在于开发了一种名为 Wholly-WOOD 的弱监督框架，能够统一利用多种标注形式（点、水平边界框、旋转边界框及其组合），仅使用水平边界框 (HBox) 进行训练即可实现接近旋转边界框 (RBox) 训练的性能，从而显著减少定向对象密集标注的繁琐工作。

链接: https://arxiv.org/abs/2502.09471
作者: Yi Yu,Xue Yang,Yansheng Li,Zhenjun Han,Feipeng Da,Junchi Yan
机构: School of Automation, Southeast University, Nanjing, 210096, China (东南大学自动化学院); Department of Automation, Shanghai Jiao Tong University, Shanghai, 200240, China (上海交通大学自动化系); School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, 200240, China (上海交通大学人工智能学院); School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, 430079, China (武汉大学遥感信息工程学院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Science, Beijing, 100049, China (中国科学院大学电子电气与通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 figures, 9 tables, accepted by TPAMI

点击查看摘要

Abstract:Accurately estimating the orientation of visual objects with compact rotated bounding boxes (RBoxes) has become a prominent demand, which challenges existing object detection paradigms that only use horizontal bounding boxes (HBoxes). To equip the detectors with orientation awareness, supervised regression/classification modules have been introduced at the high cost of rotation annotation. Meanwhile, some existing datasets with oriented objects are already annotated with horizontal boxes or even single points. It becomes attractive yet remains open for effectively utilizing weaker single point and horizontal annotations to train an oriented object detector (OOD). We develop Wholly-WOOD, a weakly-supervised OOD framework, capable of wholly leveraging various labeling forms (Points, HBoxes, RBoxes, and their combination) in a unified fashion. By only using HBox for training, our Wholly-WOOD achieves performance very close to that of the RBox-trained counterpart on remote sensing and other areas, significantly reducing the tedious efforts on labor-intensive annotation for oriented objects. The source codes are available at this https URL (PyTorch-based) and this https URL (Jittor-based).
zh

[CV-19] Metamorphic Testing for Pose Estimation Systems

【速读】：该论文旨在解决姿态估计系统在不同应用场景下性能评估的挑战，特别是由于人工标注成本高且难以复用导致的测试数据稀缺问题。论文的关键解决方案是提出了MET-POSE，这是一种元形态测试框架，它通过绕过手动标注的需求来评估姿态估计系统在不同条件下的表现。MET-POSE允许用户在更贴近其应用情境的条件下评估系统，而无需专门标注测试数据集或仅依赖可能不适应其应用领域的现有数据集。

链接: https://arxiv.org/abs/2502.09460
作者: Matias Duran,Thomas Laurent,Ellen Rushe,Anthony Ventresque
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at 2025 IEEE Conference on Software Testing, Verification and Validation (ICST)

点击查看摘要

Abstract:Pose estimation systems are used in a variety of fields, from sports analytics to livestock care. Given their potential impact, it is paramount to systematically test their behaviour and potential for failure. This is a complex task due to the oracle problem and the high cost of manual labelling necessary to build ground truth keypoints. This problem is exacerbated by the fact that different applications require systems to focus on different subjects (e.g., human versus animal) or landmarks (e.g., only extremities versus whole body and face), which makes labelled test data rarely reusable. To combat these problems we propose MET-POSE, a metamorphic testing framework for pose estimation systems that bypasses the need for manual annotation while assessing the performance of these systems under different circumstances. MET-POSE thus allows users of pose estimation systems to assess the systems in conditions that more closely relate to their application without having to label an ad-hoc test dataset or rely only on available datasets, which may not be adapted to their application domain. While we define MET-POSE in general terms, we also present a non-exhaustive list of metamorphic rules that represent common challenges in computer vision applications, as well as a specific way to evaluate these rules. We then experimentally show the effectiveness of MET-POSE by applying it to Mediapipe Holistic, a state of the art human pose estimation system, with the FLIC and PHOENIX datasets. With these experiments, we outline numerous ways in which the outputs of MET-POSE can uncover faults in pose estimation systems at a similar or higher rate than classic testing using hand labelled data, and show that users can tailor the rule set they use to the faults and level of accuracy relevant to their application.
zh

[CV-20] Redistribute Ensemble Training for Mitigating Memorization in Diffusion Models

【速读】：该论文旨在解决扩散模型在视觉模态下由于数据记忆行为引发的隐私风险问题。论文的关键解决方案在于提出一种通过代理模型参数进行学习的框架，并将训练数据划分为多个片段，每个片段训练一个代理模型，然后聚合形成最终模型。此外，论文还通过跳过具有异常低损失值的样本来避免记忆，同时引入IET-AGC+方法重新分配高度可记忆的样本以防止过度跳过。为了进一步减少记忆，论文还基于样本的损失值动态增强样本。这些措施共同确保在降低模型记忆能力的同时保持高性能。

链接: https://arxiv.org/abs/2502.09434
作者: Xiaoliu Guan,Yu Wu,Huayang Huang,Xiao Liu,Jiaxu Miao,Yi Yang
机构: School of Computer Science, Wuhan University, China(武汉大学计算机学院); School of Cyber Science and Technology, Sun Yat-sen University, China(中山大学网络科学与技术学院); College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China(浙江大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages,9 figures. arXiv admin note: substantial text overlap with arXiv:2407.15328

点击查看摘要

Abstract:Diffusion models, known for their tremendous ability to generate high-quality samples, have recently raised concerns due to their data memorization behavior, which poses privacy risks. Recent methods for memory mitigation have primarily addressed the issue within the context of the text modality in cross-modal generation tasks, restricting their applicability to specific conditions. In this paper, we propose a novel method for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization. Directly exposing visual data to the model increases memorization risk, so we design a framework where models learn through proxy model parameters instead. Specially, the training dataset is divided into multiple shards, with each shard training a proxy model, then aggregated to form the final model. Additionally, practical analysis of training losses illustrates that the losses for easily memorable images tend to be obviously lower. Thus, we skip the samples with abnormally low loss values from the current mini-batch to avoid memorizing. However, balancing the need to skip memorization-prone samples while maintaining sufficient training data for high-quality image generation presents a key challenge. Thus, we propose IET-AGC+, which redistributes highly memorizable samples between shards, to mitigate these samples from over-skipping. Furthermore, we dynamically augment samples based on their loss values to further reduce memorization. Extensive experiments and analysis on four datasets show that our method successfully reduces memory capacity while maintaining performance. Moreover, we fine-tune the pre-trained diffusion models, e.g., Stable Diffusion, and decrease the memorization score by 46.7%, demonstrating the effectiveness of our method. Code is available in: this https URL.
zh

[CV-21] A 3D Facial Reconstruction Evaluation Methodology: Comparing Smartphone Scans with Deep Learning Based Methods Using Geometry and Morphometry Criteria

【速读】：该论文旨在解决低成本三维（3D）面部获取和重建技术在临床应用中的有效性评估问题。关键在于提出了一种新的评价方法，该方法超越传统的基于几何的基准，整合了形态计量形状分析技术，提供了一个统计框架来评估面部形态的保真度。通过将智能手机获取的3D扫描与基于深度学习从2D图像重建的方法进行比较，并以高端立体光测法模型作为真实值，该方法能够定量评估全局和局部形状差异，从而为低成本3D面部获取和重建技术提供了生物上有意义的有效性验证手段。

链接: https://arxiv.org/abs/2502.09425
作者: Álvaro Heredia-Lidón,Alejandro Moñux-Bernal,Alejandro González,Luis M. Echeverry-Quiceno,Max Rubert,Aroa Casado,María Esther Esteban,Mireia Andreu-Montoriol,Susanna Gallardo,Cristina Ruffo,Neus Martínez-Abadías,Xavier Sevillano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Three-dimensional (3D) facial shape analysis has gained interest due to its potential clinical applications. However, the high cost of advanced 3D facial acquisition systems limits their widespread use, driving the development of low-cost acquisition and reconstruction methods. This study introduces a novel evaluation methodology that goes beyond traditional geometry-based benchmarks by integrating morphometric shape analysis techniques, providing a statistical framework for assessing facial morphology preservation. As a case study, we compare smartphone-based 3D scans with state-of-the-art deep learning reconstruction methods from 2D images, using high-end stereophotogrammetry models as ground truth. This methodology enables a quantitative assessment of global and local shape differences, offering a biologically meaningful validation approach for low-cost 3D facial acquisition and reconstruction techniques.
zh

[CV-22] ImageRAG : Dynamic Image Retrieval for Reference-Guided Image Generation

【速读】：该论文旨在解决扩散模型在生成罕见或未见过的概念时存在的挑战。论文的关键解决方案是提出了一种名为ImageRAG的方法，该方法通过检索与给定文本提示相关的图像，并将其作为上下文来指导生成过程。与先前依赖于特定训练以实现基于检索的生成的方法不同，ImageRAG利用现有的图像条件化模型的能力，无需进行特定于RAG的训练。这种方法具有高度适应性，能够显著提高生成罕见及细粒度概念的效果。

链接: https://arxiv.org/abs/2502.09411
作者: Rotem Shalev-Arkushin,Rinon Gal,Amit H. Bermano,Ohad Fried
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Diffusion models enable high-quality and diverse visual content synthesis. However, they struggle to generate rare or unseen concepts. To address this challenge, we explore the usage of Retrieval-Augmented Generation (RAG) with image generation models. We propose ImageRAG, a method that dynamically retrieves relevant images based on a given text prompt, and uses them as context to guide the generation process. Prior approaches that used retrieved images to improve generation, trained models specifically for retrieval-based generation. In contrast, ImageRAG leverages the capabilities of existing image conditioning models, and does not require RAG-specific training. Our approach is highly adaptable and can be applied across different model types, showing significant improvement in generating rare and fine-grained concepts using different base models. Our project page is available at: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2502.09411 [cs.CV] (or arXiv:2502.09411v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.09411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-23] Galileo: Learning Global and Local Features in Pretrained Remote Sensing Models

【速读】：该论文旨在解决在不同遥感应用中，预训练机器学习模型需具备处理多模态数据及表征地球表面现象多尺度特征的灵活性。论文的关键解决方案是提出了Galileo模型家族，这些模型能够灵活处理多模态遥感数据，并引入了一种新颖且高效的自监督学习方法，以同时学习大尺度和小尺度特征，从而克服了先前模型未能应对的挑战。

链接: https://arxiv.org/abs/2502.09356
作者: Gabriel Tseng,Anthony Fuller,Marlena Reil,Henry Herzog,Patrick Beukema,Favyen Bastani,James R. Green,Evan Shelhamer,Hannah Kerner,David Rolnick
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:From crop mapping to flood detection, machine learning in remote sensing has a wide range of societally beneficial applications. The commonalities between remote sensing data in these applications present an opportunity for pretrained machine learning models tailored to remote sensing to reduce the labeled data and effort required to solve individual tasks. However, such models must be: (i) flexible enough to ingest input data of varying sensor modalities and shapes (i.e., of varying spatial and temporal dimensions), and (ii) able to model Earth surface phenomena of varying scales and types. To solve this gap, we present Galileo, a family of pretrained remote sensing models designed to flexibly process multimodal remote sensing data. We also introduce a novel and highly effective self-supervised learning approach to learn both large- and small-scale features, a challenge not addressed by previous models. Our Galileo models obtain state-of-the-art results across diverse remote sensing tasks.
zh

[CV-24] Wasserstein distributional adversarial training for deep neural networks

【速读】：本文旨在解决深度神经网络在分布性对抗攻击威胁下的鲁棒性问题。解决方案的关键在于提出了一种基于Wasserstein分布鲁棒优化的敏感性分析方法，并引入了一种高效的微调技术。这种方法可以部署在已经过预训练的模型上，以增强模型的分布性鲁棒性，同时保持原有的点对点鲁棒性水平。实验结果表明，该方法对于多种预训练模型有效，即使对于使用大规模合成数据集（如20-100M幅图像）进行预训练的模型，也有显著改进。

链接: https://arxiv.org/abs/2502.09352
作者: Xingjian Bai,Guangyi He,Yifan Jiang,Jan Obloj
机构: Massachusetts Institute of Technology(麻省理工学院); Imperial College London(帝国理工学院); University of Oxford(牛津大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Design of adversarial attacks for deep neural networks, as well as methods of adversarial training against them, are subject of intense research. In this paper, we propose methods to train against distributional attack threats, extending the TRADES method used for pointwise attacks. Our approach leverages recent contributions and relies on sensitivity analysis for Wasserstein distributionally robust optimization problems. We introduce an efficient fine-tuning method which can be deployed on a previously trained model. We test our methods on a range of pre-trained models on RobustBench. These experimental results demonstrate the additional training enhances Wasserstein distributional robustness, while maintaining original levels of pointwise robustness, even for already very successful networks. The improvements are less marked for models pre-trained using huge synthetic datasets of 20-100M images. However, remarkably, sometimes our methods are still able to improve their performance even when trained using only the original training dataset (50k images).
zh

[CV-25] A Benchmark for Crime Surveillance Video Analysis with Large Models

【速读】：该论文旨在解决犯罪监控视频中异常分析的问题，特别是在多模态大型语言模型（Multimodal Large Language Models, MLLMs）在这一领域应用的研究不足。关键在于提出了一个新的基准测试集UCVL，包含1,829个视频及重新组织的标注数据，设计了六种类型的问题并生成了多样化的QA对，同时开发了详细的评估指导，并使用OpenAI的GPT-4进行精确评估。此外，通过微调模型LLaVA-OneVision在UCVL训练集上，验证了该数据集在视频异常分析中的高质量。

链接: https://arxiv.org/abs/2502.09325
作者: Haoran Chen,Dong Yi,Moyan Cao,Chensen Huang,Guibo Zhu,Jinqiao Wang
机构: Institute of Automation, Chinese Academy of Sciences (自动化研究所，中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Wuhan AI Research (武汉人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly analysis in surveillance videos is a crucial topic in computer vision. In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains. Although MLLMs are particularly versatile, their abilities to understand anomalous concepts and details are insufficiently studied because of the outdated benchmarks of this field not providing MLLM-style QAs and efficient algorithms to assess the model’s open-ended text responses. To fill this gap, we propose a benchmark for crime surveillance video analysis with large models denoted as UCVL, including 1,829 videos and reorganized annotations from the UCF-Crime and UCF-Crime Annotation datasets. We design six types of questions and generate diverse QA pairs. Then we develop detailed instructions and use OpenAI’s GPT-4o for accurate assessment. We benchmark eight prevailing MLLMs ranging from 0.5B to 40B parameters, and the results demonstrate the reliability of this bench. Moreover, we finetune LLaVA-OneVision on UCVL’s training set. The improvement validates our data’s high quality for video anomaly analysis.
zh

[CV-26] Mitigating the Impact of Prominent Position Shift in Drone-based RGBT Object Detection

【速读】：该论文旨在解决无人机视角下RGB热成像（RGBT）目标检测中的显著跨模态框位移问题。具体而言，由于不同模态下的微小物体位置差异显著，这导致未标记模态缺乏准确的监督信号，并阻碍了检测器学习良好的特征表示。此外，模态间相应特征点的不匹配也会使融合特征对检测头造成混淆。为了解决这些问题，论文提出了一种基于Mean Teacher的跨模态框校正（CBC）模块，将跨模态框位移问题视为标签噪声问题，并在过程中进行校正。同时，为了缓解RGBT融合中的特征图不匹配问题，论文设计了一个基于偏移窗口的级联对齐（SWCA）模块。实验结果表明，这些方法显著提升了检测性能。特别是，CBC模块将未标记模态的真实检测精度提高了25.52 aSim点，整体上在RGBTDronePerson数据集上的mAP_50达到了43.55点，比最先进的方法在DroneVehicle数据集的偏移子集上高出8.6 mAP50。

链接: https://arxiv.org/abs/2502.09311
作者: Yan Zhang,Wen Yang,Chang Xu,Qian Hu,Fang Xu,Gui-Song Xia
机构: Wuhan University (武汉大学); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages

点击查看摘要

Abstract:Drone-based RGBT object detection plays a crucial role in many around-the-clock applications. However, real-world drone-viewed RGBT data suffers from the prominent position shift problem, i.e., the position of a tiny object differs greatly in different modalities. For instance, a slight deviation of a tiny object in the thermal modality will induce it to drift from the main body of itself in the RGB modality. Considering RGBT data are usually labeled on one modality (reference), this will cause the unlabeled modality (sensed) to lack accurate supervision signals and prevent the detector from learning a good representation. Moreover, the mismatch of the corresponding feature point between the modalities will make the fused features confusing for the detection head. In this paper, we propose to cast the cross-modality box shift issue as the label noise problem and address it on the fly via a novel Mean Teacher-based Cross-modality Box Correction head ensemble (CBC). In this way, the network can learn more informative representations for both modalities. Furthermore, to alleviate the feature map mismatch problem in RGBT fusion, we devise a Shifted Window-Based Cascaded Alignment (SWCA) module. SWCA mines long-range dependencies between the spatially unaligned features inside shifted windows and cascaded aligns the sensed features with the reference ones. Extensive experiments on two drone-based RGBT object detection datasets demonstrate that the correction results are both visually and quantitatively favorable, thereby improving the detection performance. In particular, our CBC module boosts the precision of the sensed modality ground truth by 25.52 aSim points. Overall, the proposed detector achieves an mAP_50 of 43.55 points on RGBTDronePerson and surpasses a state-of-the-art method by 8.6 mAP50 on a shift subset of DroneVehicle dataset. The code and data will be made publicly available.
zh

[CV-27] A Physics-Informed Deep Learning Model for MRI Brain Motion Correction

【速读】：该论文旨在解决磁共振成像（MRI）在脑部成像中的运动伪影问题，这些问题源于长时间的采集过程。解决方案的关键在于提出了一种名为PI-MoCoNet的物理信息运动校正网络，它能够整合空间和k空间信息，通过三个损失函数（重建损失L1、感知损失LPIPS和数据一致性损失Ldc）来消除运动伪影，而无需显式的运动参数估计，从而提高图像保真度和诊断可靠性。

链接: https://arxiv.org/abs/2502.09296
作者: Mojtaba Safari,Shansong Wang,Zach Eidex,Richard Qiu,Chih-Wei Chang,David S. Yu,Xiaofeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:Background: MRI is crucial for brain imaging but is highly susceptible to motion artifacts due to long acquisition times. This study introduces PI-MoCoNet, a physics-informed motion correction network that integrates spatial and k-space information to remove motion artifacts without explicit motion parameter estimation, enhancing image fidelity and diagnostic reliability. Materials and Methods: PI-MoCoNet consists of a motion detection network (U-net with spatial averaging) to identify corrupted k-space lines and a motion correction network (U-net with Swin Transformer blocks) to reconstruct motion-free images. The correction is guided by three loss functions: reconstruction (L1), perceptual (LPIPS), and data consistency (Ldc). Motion artifacts were simulated via rigid phase encoding perturbations and evaluated on IXI and MR-ART datasets against Pix2Pix, CycleGAN, and U-net using PSNR, SSIM, and NMSE. Results: PI-MoCoNet significantly improved image quality. On IXI, for minor artifacts, PSNR increased from 34.15 dB to 45.95 dB, SSIM from 0.87 to 1.00, and NMSE reduced from 0.55% to 0.04%. For moderate artifacts, PSNR improved from 30.23 dB to 42.16 dB, SSIM from 0.80 to 0.99, and NMSE from 1.32% to 0.09%. For heavy artifacts, PSNR rose from 27.99 dB to 36.01 dB, SSIM from 0.75 to 0.97, and NMSE decreased from 2.21% to 0.36%. On MR-ART, PI-MoCoNet achieved PSNR gains of ~10 dB and SSIM improvements of up to 0.20, with NMSE reductions of ~6%. Ablation studies confirmed the importance of data consistency and perceptual losses, yielding a 1 dB PSNR gain and 0.17% NMSE reduction. Conclusions: PI-MoCoNet effectively mitigates motion artifacts in brain MRI, outperforming existing methods. Its ability to integrate spatial and k-space information makes it a promising tool for clinical use in motion-prone settings. Code: this https URL.
zh

[CV-28] EmoAssist: Emotional Assistant for Visual Impairment Community

【速读】：该论文旨在解决现有视觉障碍（Visual Impairment, VI）辅助大型多模态模型（LMMs）忽视情感需求的问题，并且当前基准测试缺乏对这些模型的情感评估。为了解决这些问题，论文提出了EmoAssist基准，这是一个综合性的基准，用于评估LMMs对VI社区的辅助性能。关键解决方案在于引入了EmoAssist模型，这是一种专门设计用于VI社区的情感辅助型LMM。EmoAssist模型利用直接偏好优化（Direct Preference Optimization, DPO）方法来使输出结果与人类的情感偏好保持一致，从而显著提升对VI用户的隐含情绪和意图的理解，提供更具同理心的回应和可操作的指导。

链接: https://arxiv.org/abs/2502.09285
作者: Xingyu Qi,He Li,Linjie Li,Zhenyu Wu
机构: Beijing University of Posts and Telecommunications(北京邮电大学); AI2Robotics
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The rapid advancement of large multi-modality models (LMMs) has significantly propelled the integration of artificial intelligence into practical applications. Visual Question Answering (VQA) systems, which can process multi-modal data including vision, text, and audio, hold great potential for assisting the Visual Impairment (VI) community in navigating complex and dynamic real-world environments. However, existing VI assistive LMMs overlook the emotional needs of VI individuals, and current benchmarks lack emotional evaluation of these LMMs. To address these gaps, this paper introduces the EmoAssist Benchmark, a comprehensive benchmark designed to evaluate the assistive performance of LMMs for the VI community. To the best of our knowledge, this is the first benchmark that incorporates emotional intelligence as a key consideration. Furthermore, we propose the EmoAssist Model, an Emotion-Assistive LMM specifically designed for the VI community. The EmoAssist Model utilizes Direct Preference Optimization (DPO) to align outputs with human emotional preferences. Experiment results demonstrate that the EmoAssist Model significantly enhances the recognition of implicit emotions and intentions of VI users, delivers empathetic responses, and provides actionable guidance. Specifically, it shows respective improvements of 147.8% and 89.7% in the Empathy and Suggestion metrics on the EmoAssist Benchmark, compared to the pre-tuning LMM, and even outperforms state-of-the-art LLMs such as GPT-4o.
zh

[CV-29] FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning

【速读】：该论文旨在解决遥感图像描述生成中的图像表示质量问题，其关键解决方案在于引入了一种融合两种不同卷积神经网络（CNN）编码器特征的方法，以捕获互补信息从而增强描述生成。此外，论文提出了一种加权平均技术来整合堆叠解码器中所有门控循环单元（GRU）的输出，并采用了一种基于比较的束搜索策略来优化描述选择。这些方法显著提升了模型性能，超越了基于Transformer的最先进模型及其他基于LSTM的基线模型。

链接: https://arxiv.org/abs/2502.09282
作者: Swadhin Das,Raksha Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Remote sensing image captioning aims to generate descriptive text from remote sensing images, typically employing an encoder-decoder framework. In this setup, a convolutional neural network (CNN) extracts feature representations from the input image, which then guide the decoder in a sequence-to-sequence caption generation process. Although much research has focused on refining the decoder, the quality of image representations from the encoder remains crucial for accurate captioning. This paper introduces a novel approach that integrates features from two distinct CNN based encoders, capturing complementary information to enhance caption generation. Additionally, we propose a weighted averaging technique to combine the outputs of all GRUs in the stacked decoder. Furthermore, a comparison-based beam search strategy is incorporated to refine caption selection. The results demonstrate that our fusion-based approach, along with the enhanced stacked decoder, significantly outperforms both the transformer-based state-of-the-art model and other LSTM-based baselines.
zh

[CV-30] ConsistentDreamer: View-Consistent Meshes Through Balanced Multi-View Gaussian Optimization

【速读】：该论文旨在解决三维生成中的视图一致性问题，特别是在使用扩散模型从图像生成三维资产时，不同视角之间存在不一致性和质量差异。论文的关键解决方案是提出了一种名为ConsistentDreamer的方法，通过首先生成一组固定的多视角先验图像，并在它们之间利用另一个扩散模型通过评分蒸馏采样（Score Distillation Sampling, SDS）损失来采样随机视角。这种方法限制了由SDS损失引导的不同视角之间的差异，确保了一致的基础形状。此外，引入了基于同方差不确定性动态调整的任务相关权重，并采用了不透明度、深度扭曲和法线对齐损失来优化表面细节，从而进一步提高视觉质量和视图一致性。

链接: https://arxiv.org/abs/2502.09278
作者: Onat Şahin,Mohammad Altillawi,George Eskandar,Carlos Carbone,Ziyuan Liu
机构: Huawei Technologies(华为技术有限公司); TUM(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Manuscript accepted by Pattern Recognition Letters

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved 3D generation, enabling the use of assets generated from an image for embodied AI simulations. However, the one-to-many nature of the image-to-3D problem limits their use due to inconsistent content and quality across views. Previous models optimize a 3D model by sampling views from a view-conditioned diffusion prior, but diffusion models cannot guarantee view consistency. Instead, we present ConsistentDreamer, where we first generate a set of fixed multi-view prior images and sample random views between them with another diffusion model through a score distillation sampling (SDS) loss. Thereby, we limit the discrepancies between the views guided by the SDS loss and ensure a consistent rough shape. In each iteration, we also use our generated multi-view prior images for fine-detail reconstruction. To balance between the rough shape and the fine-detail optimizations, we introduce dynamic task-dependent weights based on homoscedastic uncertainty, updated automatically in each iteration. Additionally, we employ opacity, depth distortion, and normal alignment losses to refine the surface for mesh extraction. Our method ensures better view consistency and visual quality compared to the state-of-the-art.
zh

[CV-31] FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation

【速读】：该论文旨在解决3D场景理解在自动驾驶中的挑战，特别是由于LiDAR数据的不规则性和稀疏性以及处理大规模点云的计算需求所导致的问题。论文的关键解决方案在于重新设计基于范围视图的LiDAR语义分割工作流程，通过调整分辨率以提高效率和准确性，而不是选择更高的方位角分辨率来减少信息丢失。这种策略优化了数据表示、增强和后处理方法，从而显著提升了不同网络架构的性能。

链接: https://arxiv.org/abs/2502.09274
作者: Bin Yang,Alexandru Paul Condurache
机构: Robert Bosch GmbH(罗伯特博世有限公司); University of Lübeck(吕贝克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D scene understanding is a critical yet challenging task in autonomous driving, primarily due to the irregularity and sparsity of LiDAR data, as well as the computational demands of processing large-scale point clouds. Recent methods leverage the range-view representation to improve processing efficiency. To mitigate the performance drop caused by information loss inherent to the “many-to-one” problem, where multiple nearby 3D points are mapped to the same 2D grids and only the closest is retained, prior works tend to choose a higher azimuth resolution for range-view projection. However, this can bring the drawback of reducing the proportion of pixels that carry information and heavier computation within the network. We argue that it is not the optimal solution and show that, in contrast, decreasing the resolution is more advantageous in both efficiency and accuracy. In this work, we present a comprehensive re-design of the workflow for range-view-based LiDAR semantic segmentation. Our approach addresses data representation, augmentation, and post-processing methods for improvements. Through extensive experiments on two public datasets, we demonstrate that our pipeline significantly enhances the performance of various network architectures over their baselines, paving the way for more effective LiDAR-based perception in autonomous systems.
zh

[CV-32] Memory-based Ensemble Learning in CMR Semantic Segmentation

【速读】：该论文旨在解决现有心脏电影序列心室分割模型在处理末尾切片时表现不佳的问题。关键解决方案在于利用空间连续性提取全局不确定性，并将其作为记忆应用于流式集成学习方法（Streaming）中的分类器加权，从而平衡整体及末尾切片的性能。此外，引入了End Coefficient (EC) 来量化末尾切片的准确性。

链接: https://arxiv.org/abs/2502.09269
作者: Yiwei Liu,Ziyi Wu,Liang Zhong,Linyi Wen,Yuankai Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing models typically segment either the entire 3D frame or 2D slices independently to derive clinical functional metrics from ventricular segmentation in cardiac cine sequences. While performing well overall, they struggle at the end slices. To address this, we leverage spatial continuity to extract global uncertainty from segmentation variance and use it as memory in our ensemble learning method, Streaming, for classifier weighting, balancing overall and end-slice performance. Additionally, we introduce the End Coefficient (EC) to quantify end-slice accuracy. Experiments on ACDC and M\Ms datasets show that our framework achieves near-state-of-the-art Dice Similarity Coefficient (DSC) and outperforms all models on end-slice performance, improving patient-specific segmentation accuracy.
zh

[CV-33] DynSegNet:Dynamic Architecture Adjustment for Adversarial Learning in Segmenting Hemorrhagic Lesions from Fundus Images

【速读】：该论文旨在解决眼底图像中出血病变分割的挑战，这些挑战包括病变形态的多样性、边界不明显以及与背景组织对比度低等问题。解决方案的关键在于提出了一种基于对抗学习的动态架构调整方法，该方法集成了分层U形编码器-解码器、残差块、注意力机制和ASPP模块，通过动态优化特征融合，从而提升分割性能。

链接: https://arxiv.org/abs/2502.09256
作者: Zesheng Li,Minwen Liao,Haoran Chen,Yan Su,Chengchang Pan,Honggang Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages,4 figures

点击查看摘要

Abstract:The hemorrhagic lesion segmentation plays a critical role in ophthalmic diagnosis, directly influencing early disease detection, treatment planning, and therapeutic efficacy evaluation. However, the task faces significant challenges due to lesion morphological variability, indistinct boundaries, and low contrast with background tissues. To improve diagnostic accuracy and treatment outcomes, developing advanced segmentation techniques remains imperative. This paper proposes an adversarial learning-based dynamic architecture adjustment approach that integrates hierarchical U-shaped encoder-decoder, residual blocks, attention mechanisms, and ASPP modules. By dynamically optimizing feature fusion, our method enhances segmentation performance. Experimental results demonstrate a Dice coefficient of 0.6802, IoU of 0.5602, Recall of 0.766, Precision of 0.6525, and Accuracy of 0.9955, effectively addressing the challenges in fundus image hemorrhage segmentation.[* Corresponding author.]
zh

[CV-34] Visual Graph Question Answering with ASP and LLM s for Language Parsing

【速读】：该论文旨在解决视觉问答（VQA）在处理图像中的图结构（非符号形式的图）时的挑战。为了解决这一问题，论文提出了一种将答案集编程（ASP）与视觉和自然语言处理模块相结合的模态神经符号方法。关键在于结合光学图识别进行图解析，预训练的光学字符识别神经网络解析标签，大型语言模型（LLM）进行语言处理，以及ASP用于推理。这种方法作为初步基准，在新数据集上的总体平均准确率为73%，证明了模态神经符号系统在复杂VQA任务中的潜力。

链接: https://arxiv.org/abs/2502.09211
作者: Jakob Johannes Bauer(ETH Zuerich, Switzerland),Thomas Eiter(TU Wien, Austria),Nelson Higuera Ruiz(TU Wien, Austria),Johannes Oetsch(Jonkoping University, Sweden)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)
备注: In Proceedings ICLP 2024, arXiv:2502.08453 . This work was partially funded from the Bosch Center for AI

点击查看摘要

Abstract:Visual Question Answering (VQA) is a challenging problem that requires to process multimodal input. Answer-Set Programming (ASP) has shown great potential in this regard to add interpretability and explainability to modular VQA architectures. In this work, we address the problem of how to integrate ASP with modules for vision and natural language processing to solve a new and demanding VQA variant that is concerned with images of graphs (not graphs in symbolic form). Images containing graph-based structures are an ubiquitous and popular form of visualisation. Here, we deal with the particular problem of graphs inspired by transit networks, and we introduce a novel dataset that amends an existing one by adding images of graphs that resemble metro lines. Our modular neuro-symbolic approach combines optical graph recognition for graph parsing, a pretrained optical character recognition neural network for parsing labels, Large Language Models (LLMs) for language processing, and ASP for reasoning. This method serves as a first baseline and achieves an overall average accuracy of 73% on the dataset. Our evaluation provides further evidence of the potential of modular neuro-symbolic systems, in particular with pretrained models that do not involve any further training and logic programming for reasoning, to solve complex VQA tasks.
zh

[CV-35] Faster than real-time detection of shot boundaries sampling structure and dynamic keyframes in video

【速读】：该论文旨在解决视频分析中的三个基础任务：镜头边界检测（包括硬切和短溶解）、采样结构识别（逐行/交错/拉入）以及动态关键帧检测。解决方案的关键在于提出了一种统一算法，通过利用从运动场和归一化互相关中导出的帧间和帧内度量的组合来实现这些分析任务。该算法通过稀疏和选择性计算这些度量，实现了四倍于实时的速度，并且在面对大范围相机或对象运动、闪光灯、闪烁或低对比度/噪声等具有挑战性的内容时，表现出极高的鲁棒性。

链接: https://arxiv.org/abs/2502.09202
作者: Hannes Fassold
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for ICISPC 2024

点击查看摘要

Abstract:The detection of shot boundaries (hardcuts and short dissolves), sampling structure (progressive / interlaced / pulldown) and dynamic keyframes in a video are fundamental video analysis tasks which have to be done before any further high-level analysis tasks. We present a novel algorithm which does all these analysis tasks in an unified way, by utilizing a combination of inter-frame and intra-frame measures derived from the motion field and normalized cross correlation. The algorithm runs four times faster than real-time due to sparse and selective calculation of these measures. An initial evaluation furthermore shows that the proposed algorithm is extremely robust even for challenging content showing large camera or object motion, flashlights, flicker or low contrast / noise.
zh

[CV-36] E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

【速读】：该论文旨在解决零样本对象图像定制中的高计算成本问题。不同于依赖资源密集型U-Net架构的方法，该研究提出了一种高效的框架E-MD3C（Efficient Masked Diffusion Transformer with Disentangled Conditions and Compact Collector）。其关键是采用轻量级的掩码扩散变换器处理潜码，并通过解耦条件设计和可学习的条件收集器来确保紧凑性，同时保持背景对齐和细节完整性。这一方法在VITON-HD数据集上表现出显著优势，不仅参数量减少到四分之一，而且推理速度提高2.5倍，GPU内存消耗仅为三分之一。

链接: https://arxiv.org/abs/2502.09164
作者: Trung X. Pham,Zhang Kang,Ji Woo Hong,Xuran Zheng,Chang D. Yoo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 14 figures

点击查看摘要

Abstract:We propose E-MD3C ( \underlineE fficient \underlineM asked \underlineD iffusion Transformer with Disentangled \underlineC onditions and \underlineC ompact \underlineC ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the existing approach on the VITON-HD dataset across metrics such as PSNR, FID, SSIM, and LPIPS, demonstrating clear advantages in parameters, memory efficiency, and inference speed. With only \frac14 of the parameters, our Transformer-based 468M model delivers 2.5\times faster inference and uses \frac23 of the GPU memory compared to an 1720M Unet-based latent diffusion model.
zh

[CV-37] Shortcut Learning Susceptibility in Vision Classifiers

【速读】：该论文旨在解决机器学习模型在训练过程中依赖于数据中的虚假相关性（spurious correlations）而非捕捉有意义特征的问题，即所谓的捷径学习（shortcut learning）。这种现象在视觉分类器（如卷积神经网络CNNs、多层感知机MLPs和视觉变换器ViTs）中尤为显著。研究的关键在于通过引入与类别标签位置相关的故意捷径（deliberate shortcuts）到数据集中，创建一个受控环境来评估这些模型是否依赖于这些人为提示或真正学习到了区分特征。通过定量评估（在修改后的数据集上训练并在包含/不包含捷径的不同测试集上测试）以及定性评估（使用基于网络反转的重建技术分析模型权重内部化的内容），研究者比较了不同视觉分类器架构对捷径依赖的敏感度，从而系统性地评估了它们对虚假相关性的敏感程度。

链接: https://arxiv.org/abs/2502.09150
作者: Pirzada Suhail,Amit Sethi
机构: IIT Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shortcut learning, where machine learning models exploit spurious correlations in data instead of capturing meaningful features, poses a significant challenge to building robust and generalizable models. This phenomenon is prevalent across various machine learning applications, including vision, natural language processing, and speech recognition, where models may find unintended cues that minimize training loss but fail to capture the underlying structure of the data. Vision classifiers such as Convolutional Neural Networks (CNNs), Multi-Layer Perceptrons (MLPs), and Vision Transformers (ViTs) leverage distinct architectural principles to process spatial and structural information, making them differently susceptible to shortcut learning. In this study, we systematically evaluate these architectures by introducing deliberate shortcuts into the dataset that are positionally correlated with class labels, creating a controlled setup to assess whether models rely on these artificial cues or learn actual distinguishing features. We perform both quantitative evaluation by training on the shortcut-modified dataset and testing them on two different test sets – one containing the same shortcuts and another without them – to determine the extent of reliance on shortcuts. Additionally, qualitative evaluation is performed by using network inversion-based reconstruction techniques to analyze what the models internalize in their weights, aiming to reconstruct the training data as perceived by the classifiers. We evaluate shortcut learning behavior across multiple benchmark datasets, including MNIST, Fashion-MNIST, SVHN, and CIFAR-10, to compare the susceptibility of different vision classifier architectures to shortcut reliance and assess their varying degrees of sensitivity to spurious correlations.
zh

[CV-38] Multimodal HIE Lesion Segmentation in Neonates: A Comparative Study of Loss Functions

【速读】：该论文旨在解决新生儿缺氧缺血性脑病（Hypoxic-Ischemic Encephalopathy, HIE）病变在磁共振成像（MRI）中的分割难题，由于病变范围广泛且体积多变以及标注数据有限，此任务极具挑战性。研究的关键在于通过优化预处理、增强技术和训练策略，利用BONBID-HIE数据集实现3D U-Net模型的改进，并特别关注于确定最适合HIE病变分割任务的损失函数。研究评估了多种损失函数，包括Dice、Dice-Focal、Tversky、Hausdorff距离损失（HausdorffDT Loss），以及两种复合损失函数——Dice-Focal-HausdorffDT和Tversky-HausdorffDT。结果表明，复合损失函数相较于单一损失函数在提高分割性能方面表现更优。特别是，Tversky-HausdorffDT损失函数获得了最高的Dice和归一化表面Dice分数，而Dice-Focal-HausdorffDT损失函数则最小化了平均表面距离。这项工作强调了针对特定任务优化损失函数的重要性，展示了结合基于区域和边界感知损失可以实现更精确的HIE病变分割，即使在训练数据有限的情况下也是如此。

链接: https://arxiv.org/abs/2502.09148
作者: Annayah Usman,Abdul Haseeb,Tahir Syed
机构: School of Mathematics and Computer Science, Institute of Business Administration Karachi, Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation of Hypoxic-Ischemic Encephalopathy (HIE) lesions in neonatal MRI is a crucial but challenging task due to diffuse multifocal lesions with varying volumes and the limited availability of annotated HIE lesion datasets. Using the BONBID-HIE dataset, we implemented a 3D U-Net with optimized preprocessing, augmentation, and training strategies to overcome data constraints. The goal of this study is to identify the optimal loss function specifically for the HIE lesion segmentation task. To this end, we evaluated various loss functions, including Dice, Dice-Focal, Tversky, Hausdorff Distance (HausdorffDT) Loss, and two proposed compound losses – Dice-Focal-HausdorffDT and Tversky-HausdorffDT – to enhance segmentation performance. The results show that different loss functions predict distinct segmentation masks, with compound losses outperforming standalone losses. Tversky-HausdorffDT Loss achieves the highest Dice and Normalized Surface Dice scores, while Dice-Focal-HausdorffDT Loss minimizes Mean Surface Distance. This work underscores the significance of task-specific loss function optimization, demonstrating that combining region-based and boundary-aware losses leads to more accurate HIE lesion segmentation, even with limited training data.
zh

[CV-39] Feature-based Graph Attention Networks Improve Online Continual Learning

【速读】：该论文旨在解决在线持续学习（Online Continual Learning）在图像分类中的挑战，特别是在动态环境和不断变化的数据分布下，模型需要适应新数据的同时保留先前学习任务的知识。论文的关键解决方案在于提出了一种基于图注意力网络（Graph Attention Networks, GATs）的新型在线持续学习框架。该框架通过学习注意力权重有效捕捉上下文关系，并通过层次特征图将图像转换为图结构，从而实现任务特定表示的动态更新。此外，引入增强型全局池化策略以提升分类性能，并采用重放记忆复制技术来改进先前任务的表征，同时保持内存预算。

链接: https://arxiv.org/abs/2502.09143
作者: Adjovi Sim,Zhengkui Wang,Aik Beng Ng,Shalini De Mello,Simon See,Wonmin Byeon
机构: NVIDIA; Singapore Institute of Technology
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages

点击查看摘要

Abstract:Online continual learning for image classification is crucial for models to adapt to new data while retaining knowledge of previously learned tasks. This capability is essential to address real-world challenges involving dynamic environments and evolving data distributions. Traditional approaches predominantly employ Convolutional Neural Networks, which are limited to processing images as grids and primarily capture local patterns rather than relational information. Although the emergence of transformer architectures has improved the ability to capture relationships, these models often require significantly larger resources. In this paper, we present a novel online continual learning framework based on Graph Attention Networks (GATs), which effectively capture contextual relationships and dynamically update the task-specific representation via learned attention weights. Our approach utilizes a pre-trained feature extractor to convert images into graphs using hierarchical feature maps, representing information at varying levels of granularity. These graphs are then processed by a GAT and incorporate an enhanced global pooling strategy to improve classification performance for continual learning. In addition, we propose the rehearsal memory duplication technique that improves the representation of the previous tasks while maintaining the memory budget. Comprehensive evaluations on benchmark datasets, including SVHN, CIFAR10, CIFAR100, and MiniImageNet, demonstrate the superiority of our method compared to the state-of-the-art methods.
zh

[CV-40] Replay-free Online Continual Learning with Self-Supervised MultiPatches

【速读】：该论文旨在解决在线持续学习（Online Continual Learning, OCL）中隐私敏感场景下避免使用重放（replay）策略的问题。论文的关键解决方案是提出了一种名为连续多补丁（Continual MultiPatches, CMP）的技术，该技术通过从单个样本生成多个补丁，并将其投影到共享特征空间中，使得来自同一示例的补丁被推近而不坍缩为单一特征点。这种方法在OCL数据流上超越了传统的重放策略及其他基于自监督学习（Self-Supervised Learning, SSL）的方法，挑战了重放作为自监督OCL标准解决方案的地位。

链接: https://arxiv.org/abs/2502.09140
作者: Giacomo Cignoni,Andrea Cossu,Alex Gomez-Villa,Joost van de Weijer,Antonio Carta
机构: University of Pisa; Computer Vision Center (CVC)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ESANN 2025

点击查看摘要

Abstract:Online Continual Learning (OCL) methods train a model on a non-stationary data stream where only a few examples are available at a time, often leveraging replay strategies. However, usage of replay is sometimes forbidden, especially in applications with strict privacy regulations. Therefore, we propose Continual MultiPatches (CMP), an effective plug-in for existing OCL self-supervised learning strategies that avoids the use of replay samples. CMP generates multiple patches from a single example and projects them into a shared feature space, where patches coming from the same example are pushed together without collapsing into a single point. CMP surpasses replay and other SSL-based strategies on OCL streams, challenging the role of replay as a go-to solution for self-supervised OCL.
zh

[CV-41] Automatic Pruning via Structured Lasso with Class-wise Information

【速读】：该论文旨在解决神经网络剪枝过程中因缺乏类别相关统计信息而导致的信息损失问题。解决方案的关键在于利用带信息瓶颈（Information Bottleneck）指导的结构化套索（structured lasso）方法，通过引入两种创新的自适应网络剪枝方案：图结构化套索剪枝与信息瓶颈（sparse graph-structured lasso pruning with Information Bottleneck, sGLP-IB）以及树引导的套索剪枝与信息瓶颈（sparse tree-guided lasso pruning with Information Bottleneck, sTLP-IB），以更好地捕捉类别间的关联性，从而在保持模型精度的同时实现更高效的模型压缩和计算资源节约。

链接: https://arxiv.org/abs/2502.09125
作者: Xiang Liu,Mingchen Li,Xia Li,Leigang Qu,Zifan Peng,Yijun Song,Zemin Liu,Linshan Jiang,Jialin Li
机构: National University of Singapore; University of North Texas; Department of Information Technology and Electrical Engineering, ETH Zürich; Hong Kong University of Science and Technology (Guangzhou); Dongfang College, Zhejiang University of Finance & Economics; College of Computer Science and Technology, Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Most pruning methods concentrate on unimportant filters of neural networks. However, they face the loss of statistical information due to a lack of consideration for class-wise data. In this paper, from the perspective of leveraging precise class-wise information for model pruning, we utilize structured lasso with guidance from Information Bottleneck theory. Our approach ensures that statistical information is retained during the pruning process. With these techniques, we introduce two innovative adaptive network pruning schemes: sparse graph-structured lasso pruning with Information Bottleneck (\textbfsGLP-IB) and sparse tree-guided lasso pruning with Information Bottleneck (\textbfsTLP-IB). The key aspect is pruning model filters using sGLP-IB and sTLP-IB to better capture class-wise relatedness. Compared to multiple state-of-the-art methods, our approaches demonstrate superior performance across three datasets and six model architectures in extensive experiments. For instance, using the VGG16 model on the CIFAR-10 dataset, we achieve a parameter reduction of 85%, a decrease in FLOPs by 61%, and maintain an accuracy of 94.10% (0.14% higher than the original model); we reduce the parameters by 55% with the accuracy at 76.12% using the ResNet architecture on ImageNet (only drops 0.03%). In summary, we successfully reduce model size and computational resource usage while maintaining accuracy. Our codes are at this https URL.
zh

[CV-42] Improving Deep Regression with Tightness ICLR2025

链接: https://arxiv.org/abs/2502.09122
作者: Shihao Zhang,Yuguang Yan,Angela Yao
机构: National University of Singapore (新加坡国立大学); Guangdong University of Technology (广东工业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025, Code: this https URL

点击查看摘要

Abstract:For deep regression, preserving the ordinality of the targets with respect to the feature representation improves performance across various tasks. However, a theoretical explanation for the benefits of ordinality is still lacking. This work reveals that preserving ordinality reduces the conditional entropy H(Z|Y) of representation Z conditional on the target Y . However, our findings reveal that typical regression losses do little to reduce H(Z|Y) , even though it is vital for generalization performance. With this motivation, we introduce an optimal transport-based regularizer to preserve the similarity relationships of targets in the feature space to reduce H(Z|Y) . Additionally, we introduce a simple yet efficient strategy of duplicating the regressor targets, also with the aim of reducing H(Z|Y) . Experiments on three real-world regression tasks verify the effectiveness of our strategies to improve deep regression. Code: this https URL.
zh

[CV-43] DenseSplat: Densifying Gaussian Splatting SLAM with Neural Radiance Prior

【速读】：该论文旨在解决在实时渲染和细粒度重建方面表现出色的高斯SLAM系统（Gaussian SLAM），因依赖大量关键帧而在实际稀疏视图条件下部署困难的问题。论文的关键解决方案是引入DenseSplat系统，它有效结合了NeRF和3DGS的优点。DenseSplat通过使用稀疏关键帧和NeRF先验来初始化密集填充地图的基元，并无缝填补空隙。此外，DenseSplat还实施了几何感知的基元采样和剪枝策略以管理细节层次并提高渲染效率，同时集成了闭环检测和束调整以显著提升帧间跟踪精度。

链接: https://arxiv.org/abs/2502.09111
作者: Mingrui Li,Shuhong Liu,Tianchen Deng,Hongyu Wang
机构: Department of Computer Science, Dalian University of Technology (大连理工大学计算机科学系), Dalian, China; Department of Mechano-informatics, Information Science and Technology, The University of Tokyo (东京大学机械信息学系，信息科技系), Tokyo, Japan; Institute of Medical Robotics and Department of Automation, Shanghai Jiao Tong University (上海交通大学医疗机器人研究所和自动化系), and Key Laboratory of System Control and Information Processing, Ministry of Education (教育部系统控制与信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gaussian SLAM systems excel in real-time rendering and fine-grained reconstruction compared to NeRF-based systems. However, their reliance on extensive keyframes is impractical for deployment in real-world robotic systems, which typically operate under sparse-view conditions that can result in substantial holes in the map. To address these challenges, we introduce DenseSplat, the first SLAM system that effectively combines the advantages of NeRF and 3DGS. DenseSplat utilizes sparse keyframes and NeRF priors for initializing primitives that densely populate maps and seamlessly fill gaps. It also implements geometry-aware primitive sampling and pruning strategies to manage granularity and enhance rendering efficiency. Moreover, DenseSplat integrates loop closure and bundle adjustment, significantly enhancing frame-to-frame tracking accuracy. Extensive experiments on multiple large-scale datasets demonstrate that DenseSplat achieves superior performance in tracking and mapping compared to current state-of-the-art methods.
zh

[CV-44] Pulling Back the Curtain: Unsupervised Adversarial Detection via Contrastive Auxiliary Networks

【速读】：该论文旨在解决深度学习模型在安全关键应用中易受对抗性攻击的问题。解决方案的关键在于提出了一种名为无监督对抗检测的对比辅助网络（Unsupervised adversarial detection via Contrastive Auxiliary Networks, U-CAN）。U-CAN通过嵌入目标模型的选定中间层来揭示辅助特征表示中的对抗行为，而无需使用对抗样本。该方法利用投影层和基于ArcFace的线性层来改进特征表示，从而更有效地区分良性输入与对抗性输入。实验结果表明，U-CAN方法在多个数据集和架构上超越现有无监督对抗检测技术，实现了更高的F1分数。

链接: https://arxiv.org/abs/2502.09110
作者: Eylon Mizrahi,Raz Lapid,Moshe Sipper
机构: Ben-Gurion University (本-古里安大学); DeepKeep (DeepKeep)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models are widely employed in safety-critical applications yet remain susceptible to adversarial attacks – imperceptible perturbations that can significantly degrade model performance. Conventional defense mechanisms predominantly focus on either enhancing model robustness or detecting adversarial inputs independently. In this work, we propose an Unsupervised adversarial detection via Contrastive Auxiliary Networks (U-CAN) to uncover adversarial behavior within auxiliary feature representations, without the need for adversarial examples. U-CAN is embedded within selected intermediate layers of the target model. These auxiliary networks, comprising projection layers and ArcFace-based linear layers, refine feature representations to more effectively distinguish between benign and adversarial inputs. Comprehensive experiments across multiple datasets (CIFAR-10, Mammals, and a subset of ImageNet) and architectures (ResNet-50, VGG-16, and ViT) demonstrate that our method surpasses existing unsupervised adversarial detection techniques, achieving superior F1 scores against four distinct attack methods. The proposed framework provides a scalable and effective solution for enhancing the security and reliability of deep learning systems.
zh

[CV-45] From Visuals to Vocabulary: Establishing Equivalence Between Image and Text Token Through Autoregressive Pre-training in MLLM s

【速读】：该论文旨在解决多模态大型语言模型（MLLMs）在感知任务上的表现良好，但在精确的多模态对齐方面存在不足，从而限制了其性能的问题。解决方案的关键在于提出了一种名为视觉动态嵌入引导预训练（Vision Dynamic Embedding-Guided Pretraining, VDEP）的方法。VDEP是一种混合自回归训练范式，通过利用视觉编码器后的MLP产生的动态嵌入来监督图像隐藏状态，并将图像标记整合到自回归训练中。这种方法强调从输入数据中恢复信息，特别是详细重建视觉信息的过程，从而改进了现有模型忽视有效处理图像数据的问题。

链接: https://arxiv.org/abs/2502.09093
作者: Mingxiao Li,Fang Qu,Zhanpeng Chen,Na Su,Zhizhou Zhong,Ziyang Chen,Nan Du,Xiaolong Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While MLLMs perform well on perceptual tasks, they lack precise multimodal alignment, limiting performance. To address this challenge, we propose Vision Dynamic Embedding-Guided Pretraining (VDEP), a hybrid autoregressive training paradigm for MLLMs. Utilizing dynamic embeddings from the MLP following the visual encoder, this approach supervises image hidden states and integrates image tokens into autoregressive training. Existing MLLMs primarily focused on recovering information from textual inputs, often neglecting the effective processing of image data. In contrast, the key improvement of this work is the reinterpretation of multimodal alignment as a process of recovering information from input data, with particular emphasis on reconstructing detailed visual this http URL proposed method seamlessly integrates into standard models without architectural changes. Experiments on 13 benchmarks show VDEP outperforms baselines, surpassing existing methods.
zh

[CV-46] Unsupervised Anomaly Detection on Implicit Shape representations for Sarcopenia Detection

【速读】：该论文旨在解决通过形状特征识别肌少症（Sarcopenia）肌肉的问题。解决方案的关键在于使用隐式神经表示（INR）模型来表征正常肌肉的形状，并引入无监督异常检测方法基于隐式模型的重构误差来识别肌少症肌肉。此外，通过条件INR结合自解码策略，学习到的潜在表示能够在无监督的情况下清晰地区分正常与异常肌肉。

链接: https://arxiv.org/abs/2502.09088
作者: Louise Piecuch,Jeremie Huet(MD),Antoine Frouin(PT),Antoine Nordez,Anne-Sophie Boureau(MD),Diana Mateus
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sarcopenia is an age-related progressive loss of muscle mass and strength that significantly impacts daily life. A commonly studied criterion for characterizing the muscle mass has been the combination of 3D imaging and manual segmentations. In this paper, we instead study the muscles’ shape. We rely on an implicit neural representation (INR) to model normal muscle shapes. We then introduce an unsupervised anomaly detection method to identify sarcopenic muscles based on the reconstruction error of the implicit model. Relying on a conditional INR with an auto-decoding strategy, we also learn a latent representation of the muscles that clearly separates normal from abnormal muscles in an unsupervised fashion. Experimental results on a dataset of 103 segmented volumes indicate that our double anomaly detection strategy effectively discriminates sarcopenic and non-sarcopenic muscles.
zh

[CV-47] BevSplat: Resolving Height Ambiguity via Feature-Based Gaussian Primitives for Weakly-Supervised Cross-View Localization

【速读】：本文旨在解决弱监督跨视角定位问题，即在带有噪声真实标注的情况下估计地面相机相对于卫星图像的姿态。现有方法在高度模糊性方面存在困难，主要是由于地面图像中缺乏深度信息。本文提出的关键解决方案是BevSplat方法，它通过基于特征的高斯原语解决了高度模糊性问题。每个地面图像像素由一个包含语义和空间特征的三维高斯表示，并合成到一个鸟瞰图特征图中进行相对姿态估计。此外，针对全景查询图像的挑战，引入了基于二十面体的监督策略。实验结果表明，BevSplat方法显著提高了定位精度。

链接: https://arxiv.org/abs/2502.09080
作者: Qiwei Wang,Shaoxun Wu,Yujiao Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the problem of weakly supervised cross-view localization, where the goal is to estimate the pose of a ground camera relative to a satellite image with noisy ground truth annotations. A common approach to bridge the cross-view domain gap for pose estimation is Bird’s-Eye View (BEV) synthesis. However, existing methods struggle with height ambiguity due to the lack of depth information in ground images and satellite height maps. Previous solutions either assume a flat ground plane or rely on complex models, such as cross-view transformers. We propose BevSplat, a novel method that resolves height ambiguity by using feature-based Gaussian primitives. Each pixel in the ground image is represented by a 3D Gaussian with semantic and spatial features, which are synthesized into a BEV feature map for relative pose estimation. Additionally, to address challenges with panoramic query images, we introduce an icosphere-based supervision strategy for the Gaussian primitives. We validate our method on the widely used KITTI and VIGOR datasets, which include both pinhole and panoramic query images. Experimental results show that BevSplat significantly improves localization accuracy over prior approaches.
zh

[CV-48] PTZ-Calib: Robust Pan-Tilt-Zoom Camera Calibration ICRA2025

【速读】：该论文旨在解决PTZ摄像头参数估计的问题，特别是针对任意视点的高效准确校准。解决方案的关键在于提出了一种两阶段的PTZ校准方法（PTZ-Calib），包括离线阶段的PTZ增量束调整算法（PTZ-IBA）和在线阶段的新视点重定位方法。离线阶段利用均匀选择的参考图像集和PTZ-IBA算法在局部坐标系中自动校准摄像头，并可进一步优化参数以匹配地理坐标系统。在线阶段则将新视点的校准问题转化为重定位问题，从而平衡精度与计算效率。

链接: https://arxiv.org/abs/2502.09075
作者: Jinhui Guo,Lubin Fan,Bojian Wu,Jiaqi Gu,Shen Cao,Jieping Ye
机构: Alibaba Cloud Computing(阿里云); Independent Researcher(独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICRA 2025

点击查看摘要

Abstract:In this paper, we present PTZ-Calib, a robust two-stage PTZ camera calibration method, that efficiently and accurately estimates camera parameters for arbitrary viewpoints. Our method includes an offline and an online stage. In the offline stage, we first uniformly select a set of reference images that sufficiently overlap to encompass a complete 360° view. We then utilize the novel PTZ-IBA (PTZ Incremental Bundle Adjustment) algorithm to automatically calibrate the cameras within a local coordinate system. Additionally, for practical application, we can further optimize camera parameters and align them with the geographic coordinate system using extra global reference 3D information. In the online stage, we formulate the calibration of any new viewpoints as a relocalization problem. Our approach balances the accuracy and computational efficiency to meet real-world demands. Extensive evaluations demonstrate our robustness and superior performance over state-of-the-art methods on various real and synthetic datasets. Datasets and source code can be accessed online at this https URL
zh

[CV-49] StyleBlend: Enhancing Style-Specific Content Creation in Text-to-Image Diffusion Models

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）扩散模型中视觉效果出色的图像合成与特定艺术风格无缝对齐的挑战。解决方案的关键在于引入StyleBlend方法，通过将风格分解为构图和纹理两个组件，并分别学习这些组件，利用两个合成分支专注于相应风格组件，从而实现有效的风格融合，同时保持内容生成的一致性。这种方法解决了先前技术中存在的文本错位和风格表示弱的问题。

链接: https://arxiv.org/abs/2502.09064
作者: Zichong Chen,Shijin Wang,Yang Zhou
机构: Visual Computing Research Center, CSSE, Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Eurographics 2025. Project page: this https URL

点击查看摘要

Abstract:Synthesizing visually impressive images that seamlessly align both text prompts and specific artistic styles remains a significant challenge in Text-to-Image (T2I) diffusion models. This paper introduces StyleBlend, a method designed to learn and apply style representations from a limited set of reference images, enabling content synthesis of both text-aligned and stylistically coherent. Our approach uniquely decomposes style into two components, composition and texture, each learned through different strategies. We then leverage two synthesis branches, each focusing on a corresponding style component, to facilitate effective style blending through shared features without affecting content generation. StyleBlend addresses the common issues of text misalignment and weak style representation that previous methods have struggled with. Extensive qualitative and quantitative comparisons demonstrate the superiority of our approach.
zh

[CV-50] Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model

【速读】：该论文旨在解决通用视觉检测任务中，利用少量样本实现产品缺陷检测的问题。关键在于采用视觉-语言模型（Vision-Language Model, VLM），通过微调包含非缺陷和缺陷产品图像及其对应解释性文本的多样化数据集，并利用基于实例学习（In-Context Learning）的方法，在仅提供少量示例的情况下，使模型能够进行有效的检测，从而避免了为每种新产品收集大量训练样本和重新训练模型的需求。实验结果表明，该方法在MVTec AD数据集上达到了高精度，MCC为0.804，F1分数为0.950。

链接: https://arxiv.org/abs/2502.09057
作者: Shiryu Ueno,Yoshikazu Hayashi,Shunsuke Nakatsuka,Yusei Yamada,Hiroaki Aizawa,Kunihito Kato
机构: Faculty of Engineering, Gifu University (工程学院, 圿府大学); Graduate School of Advanced Science and Engineering, Hiroshima University (高级科学与工程研究生院, 广岛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: VISAPP 2025

点击查看摘要

Abstract:We propose general visual inspection model using Vision-Language Model~(VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our method achieves high performance, with MCC of 0.804 and F1-score of 0.950 on MVTec AD in a one-shot manner. Our code is available at~this https URL.
zh

[CV-51] AIDE: Agent ically Improve Visual Language Model with Domain Experts

【速读】：该论文旨在解决传统视觉语言模型（Visual Language Models, VLMs）增强方法依赖于更大、更强大模型的知识蒸馏所带来的瓶颈问题，尤其是在没有更优模型可用的情况下。解决方案的关键在于引入AIDE（通过领域专家进行自主改进）框架，该框架使VLMs能够通过利用专门领域的专家模型来自主提升其能力。AIDE通过识别需要改进的实例、与领域专家互动以进行针对性分析、合成专家输出与现有数据，以及将增强后的实例整合到训练流程中这四个阶段的过程实现这一目标。实验结果表明，AIDE能够在不依赖更大VLMs或人工监督的情况下实现显著的性能提升。

链接: https://arxiv.org/abs/2502.09051
作者: Ming-Chang Chiu,Fuxiao Liu,Karan Sapra,Andrew Tao,Yaser Jacoob,Xuezhe Ma,Zhiding Yu,Guilin Liu
机构: University of Southern California (南加州大学); University of Maryland, College Park (马里兰大学帕克分校); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 6 pages, 4 figures, 2 tables

点击查看摘要

Abstract:The enhancement of Visual Language Models (VLMs) has traditionally relied on knowledge distillation from larger, more capable models. This dependence creates a fundamental bottleneck for improving state-of-the-art systems, particularly when no superior models exist. We introduce AIDE (Agentic Improvement through Domain Experts), a novel framework that enables VLMs to autonomously enhance their capabilities by leveraging specialized domain expert models. AIDE operates through a four-stage process: (1) identifying instances for refinement, (2) engaging domain experts for targeted analysis, (3) synthesizing expert outputs with existing data, and (4) integrating enhanced instances into the training pipeline. Experiments on multiple benchmarks, including MMMU, MME, MMBench, etc., demonstrate AIDE’s ability to achieve notable performance gains without relying on larger VLMs nor human supervision. Our framework provides a scalable, resource-efficient approach to continuous VLM improvement, addressing critical limitations in current methodologies, particularly valuable when larger models are unavailable to access.
zh

[CV-52] Evolution of Data-driven Single- and Multi-Hazard Susceptibility Mapping and Emergence of Deep Learning Methods

【速读】：该论文旨在探讨多灾种易损性制图的方法，并提出将深度学习模型应用于多灾种易损性制图的可能性。关键解决方案在于采用数据融合策略在多模态深度学习中，以适应多灾种易损性制图的需求。研究表明，深度学习模型是多灾种易损性制图中具有潜力且尚未充分开发的方法。数据融合策略提供了更广泛适用的深度学习模型空间，从而提高多灾种易损性制图的准确性与可靠性。

链接: https://arxiv.org/abs/2502.09045
作者: Jaya Sreevalsan-Nair,Aswathi Mundayatt
机构: International Institute of Information Technology Bangalore(印度国际信息技术学院班加罗尔); Graphics-Visualization-Computing Lab(图形可视化计算实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data-driven susceptibility mapping of natural hazards has harnessed the advances in classification methods used on heterogeneous sources represented as raster images. Susceptibility mapping is an important step towards risk assessment for any natural hazard. Increasingly, multiple hazards co-occur spatially, temporally, or both, which calls for an in-depth study on multi-hazard susceptibility mapping. In recent years, single-hazard susceptibility mapping algorithms have become well-established and have been extended to multi-hazard susceptibility mapping. Deep learning is also emerging as a promising method for single-hazard susceptibility mapping. Here, we discuss the evolution of methods for a single hazard, their extensions to multi-hazard maps as a late fusion of decisions, and the use of deep learning methods in susceptibility mapping. We finally propose a vision for adapting data fusion strategies in multimodal deep learning to multi-hazard susceptibility mapping. From the background study of susceptibility methods, we demonstrate that deep learning models are promising, untapped methods for multi-hazard susceptibility mapping. Data fusion strategies provide a larger space of deep learning models applicable to multi-hazard susceptibility mapping.
zh

[CV-53] Large Images are Gaussians: High-Quality Large Image Representation with Levels of 2D Gaussian Splatting AAAI2025 AAAI

【速读】：该论文旨在解决使用二维高斯点（2DGS）表示大型图像时遇到的挑战，特别是在处理大量高斯点的情况下。论文的关键解决方案包括两个方面：首先，采用了一种改进的表征和优化策略，以促进大量高斯点的拟合；其次，提出了层次化高斯点（Level-of-Gaussian）方法，用于重建粗略的低频初始图像和精细的高频细节，从而成功地将大型图像表示为高斯点，并实现了高质量的大图像表示。

链接: https://arxiv.org/abs/2502.09039
作者: Lingting Zhu,Guying Lin,Jinnan Chen,Xinjie Zhang,Zhenchao Jin,Zhao Wang,Lequan Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025). 10 pages, 4 figures

点击查看摘要

Abstract:While Implicit Neural Representations (INRs) have demonstrated significant success in image representation, they are often hindered by large training memory and slow decoding speed. Recently, Gaussian Splatting (GS) has emerged as a promising solution in 3D reconstruction due to its high-quality novel view synthesis and rapid rendering capabilities, positioning it as a valuable tool for a broad spectrum of applications. In particular, a GS-based representation, 2DGS, has shown potential for image fitting. In our work, we present \textbfLarge \textbfImages are \textbfGaussians (\textbfLIG), which delves deeper into the application of 2DGS for image representations, addressing the challenge of fitting large images with 2DGS in the situation of numerous Gaussian points, through two distinct modifications: 1) we adopt a variant of representation and optimization strategy, facilitating the fitting of a large number of Gaussian points; 2) we propose a Level-of-Gaussian approach for reconstructing both coarse low-frequency initialization and fine high-frequency details. Consequently, we successfully represent large images as Gaussian points and achieve high-quality large image representation, demonstrating its efficacy across various types of large images. Code is available at \hrefthis https URLthis https URL.
zh

[CV-54] Billet Number Recognition Based on Test-Time Adaptation

【速读】：该论文旨在解决钢坯生产过程中实时识别移动钢坯上的机印或手写编号时，现有场景文本识别方法因图像失真及训练与测试数据分布差异导致的低识别准确性问题。解决方案的关键在于引入了一种结合测试时适应机制与先验知识的钢坯编号识别方法。具体而言，该方法首先利用DB网络进行文本检测，并采用SVTR网络进行文本识别，在测试阶段通过最小化模型熵来适应测试数据分布，而无需监督微调。其次，利用钢坯编号编码规则作为先验知识评估识别结果的有效性，不合规的结果将被替换。最后，通过引入验证机制改进CTC算法以应对识别受损字符的局限性。实验结果显示，所提方法在真实数据集上显著提升了评价指标，验证了其有效性。

链接: https://arxiv.org/abs/2502.09026
作者: Yuan Wei,Xiuzhuang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:During the steel billet production process, it is essential to recognize machine-printed or manually written billet numbers on moving billets in real-time. To address the issue of low recognition accuracy for existing scene text recognition methods, caused by factors such as image distortions and distribution differences between training and test data, we propose a billet number recognition method that integrates test-time adaptation with prior knowledge. First, we introduce a test-time adaptation method into a model that uses the DB network for text detection and the SVTR network for text recognition. By minimizing the model’s entropy during the testing phase, the model can adapt to the distribution of test data without the need for supervised fine-tuning. Second, we leverage the billet number encoding rules as prior knowledge to assess the validity of each recognition result. Invalid results, which do not comply with the encoding rules, are replaced. Finally, we introduce a validation mechanism into the CTC algorithm using prior knowledge to address its limitations in recognizing damaged characters. Experimental results on real datasets, including both machine-printed billet numbers and handwritten billet numbers, show significant improvements in evaluation metrics, validating the effectiveness of the proposed method.
zh

[CV-55] EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition

【速读】：该论文旨在解决传统场景文本识别（Scene Text Recognition, STR）算法在低光照、运动模糊和复杂背景等挑战性因素下的识别难题。解决方案的关键在于利用基于生物启发事件相机（event camera）收集并标注了一个大规模基准数据集EventSTR，并提出了一种新的事件驱动场景文本识别框架SimC-ESTR。SimC-ESTR通过视觉编码器提取事件特征，并使用Q-former模块将其投影为令牌，同时引入记忆机制增强视觉令牌，以大语言模型为基础进行相似度误差校正，从而有效提升识别性能。

链接: https://arxiv.org/abs/2502.09020
作者: Xiao Wang,Jingtao Jiang,Dong Li,Futian Wang,Lin Zhu,Yaowei Wang,Yongyong Tian,Jin Tang
机构: School of Computer Science and Technology, Anhui University (安徽大学), China; Beijing Institute of Technology (北京理工大学), China; Peng Cheng Laboratory (鹏城实验室), Shenzhen, China; Harbin Institute of Technology (HITSZ) (哈尔滨工业大学), Shenzhen, China; National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University (北京大学), China; School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University (北京大学), China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: In Peer Review

点击查看摘要

Abstract:Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds. In this paper, we propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR. It contains 9,928 high-definition (1280 * 720) event samples and involves both Chinese and English characters. We also benchmark multiple STR algorithms as the baselines for future works to compare. In addition, we propose a new event-based scene text recognition framework, termed SimC-ESTR. It first extracts the event features using a visual encoder and projects them into tokens using a Q-former module. More importantly, we propose to augment the vision tokens based on a memory mechanism before feeding into the large language models. A similarity-based error correction mechanism is embedded within the large language model to correct potential minor errors fundamentally based on contextual information. Extensive experiments on the newly proposed EventSTR dataset and two simulation STR datasets fully demonstrate the effectiveness of our proposed model. We believe that the dataset and algorithmic model can innovatively propose an event-based STR task and are expected to accelerate the application of event cameras in various industries. The source code and pre-trained models will be released on this https URL
zh

[CV-56] Zero-shot Concept Bottleneck Models

【速读】：该论文旨在解决概念瓶颈模型（CBMs）在应用过程中需要针对目标任务进行训练以学习输入到概念以及概念到标签的映射关系，从而导致需要收集特定目标数据集和训练资源的问题。论文的关键解决方案是提出零样本概念瓶颈模型（Z-CBMs），通过利用大规模概念银行（包含从网络中提取的数百万词汇）来描述任意领域的输入，实现了完全零样本的输入到概念以及概念到标签的预测。具体而言，Z-CBMs 采用跨模态搜索实现概念检索，动态查找与输入相关的概念；并通过稀疏线性回归的概念回归选择重要概念，从而实现在无需任何额外训练的情况下提供可解释且可干预的概念。

链接: https://arxiv.org/abs/2502.09018
作者: Shin’ya Yamaguchi,Kosuke Nishida,Daiki Chijiwa,Yasutoshi Ida
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Concept bottleneck models (CBMs) are inherently interpretable and intervenable neural network models, which explain their final label prediction by the intermediate prediction of high-level semantic concepts. However, they require target task training to learn input-to-concept and concept-to-label mappings, incurring target dataset collections and training resources. In this paper, we present \textitzero-shot concept bottleneck models (Z-CBMs), which predict concepts and labels in a fully zero-shot manner without training neural networks. Z-CBMs utilize a large-scale concept bank, which is composed of millions of vocabulary extracted from the web, to describe arbitrary input in various domains. For the input-to-concept mapping, we introduce concept retrieval, which dynamically finds input-related concepts by the cross-modal search on the concept bank. In the concept-to-label inference, we apply concept regression to select essential concepts from the retrieved concepts by sparse linear regression. Through extensive experiments, we confirm that our Z-CBMs provide interpretable and intervenable concepts without any additional training. Code will be available at this https URL.
zh

[CV-57] Residual Transformer Fusion Network for Salt and Pepper Image Denoising

【速读】：该论文旨在解决图像去噪过程中需要某些先验噪声信息的问题。为了解决这一问题，论文提出了一种名为残差变换融合网络（RTF-Net）的架构，该架构结合了卷积视觉Transformer (CvT) 和残差网络 (ResNet)。解决方案的关键在于将处理过程分为两个部分：噪声抑制网络（NSN），使用残差块学习噪声图；结构增强网络（SEN），使用CvT学习需添加到图像中的细节。

链接: https://arxiv.org/abs/2502.09000
作者: Bintang Pradana Erlangga Putra,Heri Prasetyo,Esti Suryani
机构: Universitas Sebelas Maret (印尼塞普拉斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, 17 figures

点击查看摘要

Abstract:Convolutional Neural Network (CNN) has been widely used in unstructured datasets, one of which is image denoising. Image denoising is a noisy image reconstruction process that aims to reduce additional noise that occurs from the noisy image with various strategies. Image denoising has a problem, namely that some image denoising methods require some prior knowledge of information about noise. To overcome this problem, a combined architecture of Convolutional Vision Transformer (CvT) and Residual Networks (ResNet) is used which is called the Residual Transformer Fusion Network (RTF-Net). In general, the process in this architecture can be divided into two parts, Noise Suppression Network (NSN) and Structure Enhancement Network (SEN). Residual Block is used in the Noise Suppression Network and is used to learn the noise map in the image, while the CvT is used in the Structure Enhancement Network and is used to learn the details that need to be added to the image processed by the Noise Suppression Network. The model was trained using the DIV2K Training Set dataset, and validation using the DIV2K Validation Set. After doing the training, the model was tested using Lena, Bridge, Pepper, and BSD300 images with noise levels ranging from 30%, 50%, and 70% and the PSNR results were compared with the DBA, NASNLM, PARIGI, NLSF, NLSF-MLP and NLSF-CNN methods. The test results show that the proposed method is superior in all cases except for Pepper’s image with a noise level of 30%, where NLSF-CNN is superior with a PSNR value of 32.99 dB, while the proposed method gets a PSNR value of 31.70 dB.
zh

[CV-58] Hierarchical Vision Transformer with Prototypes for Interpretable Medical Image Classification

【速读】：该论文旨在解决医学领域高风险应用中模型可解释性不足的问题。解决方案的关键在于提出了一种名为HierViT的新方法，它结合了Vision Transformers的高性能与增强的可解释性能力。HierViT通过层次结构处理特定领域的特征，并以人类定义的特征（原型图像）可视化目标输出，使得模型推理过程具有语义上的直观性，类似于人类的推理方式。此外，注意力热图进一步展示了识别每个特征的关键区域，从而提供了验证预测结果的多功能工具。

链接: https://arxiv.org/abs/2502.08997
作者: Luisa Gallée,Catharina Silvia Lisson,Meinrad Beer,Michael Götz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Explainability is a highly demanded requirement for applications in high-risk areas such as medicine. Vision Transformers have mainly been limited to attention extraction to provide insight into the model’s reasoning. Our approach combines the high performance of Vision Transformers with the introduction of new explainability capabilities. We present HierViT, a Vision Transformer that is inherently interpretable and adapts its reasoning to that of humans. A hierarchical structure is used to process domain-specific features for prediction. It is interpretable by design, as it derives the target output with human-defined features that are visualized by exemplary images (prototypes). By incorporating domain knowledge about these decisive features, the reasoning is semantically similar to human reasoning and therefore intuitive. Moreover, attention heatmaps visualize the crucial regions for identifying each feature, thereby providing HierViT with a versatile tool for validating predictions. Evaluated on two medical benchmark datasets, LIDC-IDRI for lung nodule assessment and derm7pt for skin lesion classification, HierViT achieves superior and comparable prediction accuracy, respectively, while offering explanations that align with human reasoning.
zh

[CV-59] Latents of latents to delineate pixels: hybrid Matryoshka autoencoder-to-U-Net pairing for segmenting large medical images in GPU-poor and low-data regimes

【速读】：该论文旨在解决医学影像在降维过程中丢失重要细节的问题，特别是在像素级任务如语义分割中的低效性。为了解决这一问题，论文提出了一种名为Matryoshka投影的低秩方法和一种结合Matryoshka自动编码器（MatAE）与U-Net解码器的混合分割架构，即MatAE-U-Net。该模型通过层次化编码和多尺度特征提取，同时保留像素几何信息，从而提高精度和泛化能力。在斯坦福EchoNet-D数据集上的实验结果表明，MatAE-U-Net在分割心脏超声视频中的左心室时，性能优于基准U-Net模型。

链接: https://arxiv.org/abs/2502.08988
作者: Tahir Syed,Ariba Khan,Sawera Hanif
机构: School of Mathematics and Computer Science, Institute of Business Administration Karachi, Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical images are often high-resolution and lose important detail if downsampled, making pixel-level methods such as semantic segmentation much less efficient if performed on a low-dimensional image. We propose a low-rank Matryoshka projection and a hybrid segmenting architecture that preserves important information while retaining sufficient pixel geometry for pixel-level tasks. We design the Matryoshka Autoencoder (MatAE-U-Net) which combines the hierarchical encoding of the Matryoshka Autoencoder with the spatial reconstruction capabilities of a U-Net decoder, leveraging multi-scale feature extraction and skip connections to enhance accuracy and generalisation. We apply it to the problem of segmenting the left ventricle (LV) in echocardiographic images using the Stanford EchoNet-D dataset, including 1,000 standardised video-mask pairs of cardiac ultrasound videos resized to 112x112 pixels. The MatAE-UNet model achieves a Mean IoU of 77.68%, Mean Pixel Accuracy of 97.46%, and Dice Coefficient of 86.91%, outperforming the baseline U-Net, which attains a Mean IoU of 74.70%, Mean Pixel Accuracy of 97.31%, and Dice Coefficient of 85.20%. The results highlight the potential of using the U-Net in the recursive Matroshka latent space for imaging problems with low-contrast such as echocardiographic analysis.
zh

[CV-60] xt-driven 3D Human Generation via Contrastive Preference Optimization

【速读】：该论文旨在解决在从长而复杂的文本描述生成3D人体模型时，现有方法难以准确对齐的问题。关键解决方案在于引入了一种新颖的框架，该框架通过对比偏好（Contrastive Preferences）来改进Score Distillation Sampling (SDS)，其中人类水平的偏好模型在正向和负向提示的引导下协助SDS进行优化。具体而言，设计了一个偏好优化模块以整合多个模型，全面捕捉文本特征的全范围，并引入了一个否定偏好模块利用静态-动态否定提示来缓解无关细节的过度优化，从而有效防止“奖励劫持”（reward hacking）。

链接: https://arxiv.org/abs/2502.08977
作者: Pengfei Zhou,Xukun Shen,Yong Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8

点击查看摘要

Abstract:Recent advances in Score Distillation Sampling (SDS) have improved 3D human generation from textual descriptions. However, existing methods still face challenges in accurately aligning 3D models with long and complex textual inputs. To address this challenge, we propose a novel framework that introduces contrastive preferences, where human-level preference models, guided by both positive and negative prompts, assist SDS for improved alignment. Specifically, we design a preference optimization module that integrates multiple models to comprehensively capture the full range of textual features. Furthermore, we introduce a negation preference module to mitigate over-optimization of irrelevant details by leveraging static-dynamic negation prompts, effectively preventing ``reward hacking". Extensive experiments demonstrate that our method achieves state-of-the-art results, significantly enhancing texture realism and visual alignment with textual descriptions, particularly for long and complex inputs.
zh

[CV-61] opo2Seq: Enhanced Topology Reasoning via Topology Sequence Learning

【速读】：该论文旨在解决从透视视图中提取车道拓扑结构时存在的问题，特别是DETR-like框架所导致的无序性和长距离感知能力不足，这可能导致段端点错配和有限的拓扑预测能力。为了解决这一问题，论文提出了一种名为Topo2Seq的新方法，其关键是通过引入随机顺序提示到序列学习机制，使车道段解码器和拓扑序列解码器的双重解码分支能够同时学习从有向无环图（DAG）和包含几何信息的车道图中提取的车道拓扑序列。这种机制使得车道段解码器能够获得强大的长距离感知能力和精确的拓扑推理能力，而拓扑序列解码器仅在训练阶段引入，不影响推理效率。

链接: https://arxiv.org/abs/2502.08974
作者: Yiming Yang,Yueru Luo,Bingkun He,Erlong Li,Zhipeng Cao,Chao Zheng,Shuqi Mei,Zhen Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Extracting lane topology from perspective views (PV) is crucial for planning and control in autonomous driving. This approach extracts potential drivable trajectories for self-driving vehicles without relying on high-definition (HD) maps. However, the unordered nature and weak long-range perception of the DETR-like framework can result in misaligned segment endpoints and limited topological prediction capabilities. Inspired by the learning of contextual relationships in language models, the connectivity relations in roads can be characterized as explicit topology sequences. In this paper, we introduce Topo2Seq, a novel approach for enhancing topology reasoning via topology sequences learning. The core concept of Topo2Seq is a randomized order prompt-to-sequence learning between lane segment decoder and topology sequence decoder. The dual-decoder branches simultaneously learn the lane topology sequences extracted from the Directed Acyclic Graph (DAG) and the lane graph containing geometric information. Randomized order prompt-to-sequence learning extracts unordered key points from the lane graph predicted by the lane segment decoder, which are then fed into the prompt design of the topology sequence decoder to reconstruct an ordered and complete lane graph. In this way, the lane segment decoder learns powerful long-range perception and accurate topological reasoning from the topology sequence decoder. Notably, topology sequence decoder is only introduced during training and does not affect the inference efficiency. Experimental evaluations on the OpenLane-V2 dataset demonstrate the state-of-the-art performance of Topo2Seq in topology reasoning.
zh

[CV-62] owards Understanding Why Data Augmentation Improves Generalization

【速读】：该论文旨在解决数据增强如何提升模型泛化能力这一问题。解决方案的关键在于提出了一种统一的理论框架，阐明了数据增强通过两个关键效应——部分语义特征去除和特征混合来增强泛化能力。部分语义特征去除减少了模型对单一特征的依赖，促进了多样化的特征学习和更好的泛化性能；特征混合通过降低原始语义特征的影响并引入噪声，增加了训练复杂度，促使模型发展出更稳健的特征。

链接: https://arxiv.org/abs/2502.08940
作者: Jingyang Li,Jiachun Pan,Kim-Chuan Toh,Pan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Data augmentation is a cornerstone technique in deep learning, widely used to improve model generalization. Traditional methods like random cropping and color jittering, as well as advanced techniques such as CutOut, Mixup, and CutMix, have achieved notable success across various domains. However, the mechanisms by which data augmentation improves generalization remain poorly understood, and existing theoretical analyses typically focus on individual techniques without a unified explanation. In this work, we present a unified theoretical framework that elucidates how data augmentation enhances generalization through two key effects: partial semantic feature removal and feature mixing. Partial semantic feature removal reduces the model’s reliance on individual feature, promoting diverse feature learning and better generalization. Feature mixing, by scaling down original semantic features and introducing noise, increases training complexity, driving the model to develop more robust features. Advanced methods like CutMix integrate both effects, achieving complementary benefits. Our theoretical insights are further supported by experimental results, validating the effectiveness of this unified perspective.
zh

[CV-63] On the Promise for Assurance of Differentiable Neurosymbolic Reasoning Paradigms

【速读】：该论文旨在评估端到端可微神经符号系统（End-to-end fully differentiable neurosymbolic systems）在不同条件下的性能保障。论文的关键解决方案在于使用Scallop库，在图像和音频领域中的分类与推理任务中，评估这些系统的对抗鲁棒性、校准性、用户性能公平性和解的可解释性。研究发现，此类方法在算术运算定义明确且输入空间具有高维度的情况下，展现出超越纯神经网络模型的数据效率优势，并提供独特的性能保障机会。然而，论文也指出这种优越性并非普遍适用，并强调了神经符号模型在保证数据效率的同时可能带来的对抗性漏洞风险。

链接: https://arxiv.org/abs/2502.08932
作者: Luke E. Richards,Jessie Yaros,Jasen Babcock,Coung Ly,Robin Cosbey,Timothy Doster,Cynthia Matuszek
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To create usable and deployable Artificial Intelligence (AI) systems, there requires a level of assurance in performance under many different conditions. Many times, deployed machine learning systems will require more classic logic and reasoning performed through neurosymbolic programs jointly with artificial neural network sensing. While many prior works have examined the assurance of a single component of the system solely with either the neural network alone or entire enterprise systems, very few works have examined the assurance of integrated neurosymbolic systems. Within this work, we assess the assurance of end-to-end fully differentiable neurosymbolic systems that are an emerging method to create data-efficient and more interpretable models. We perform this investigation using Scallop, an end-to-end neurosymbolic library, across classification and reasoning tasks in both the image and audio domains. We assess assurance across adversarial robustness, calibration, user performance parity, and interpretability of solutions for catching misaligned solutions. We find end-to-end neurosymbolic methods present unique opportunities for assurance beyond their data efficiency through our empirical results but not across the board. We find that this class of neurosymbolic models has higher assurance in cases where arithmetic operations are defined and where there is high dimensionality to the input space, where fully neural counterparts struggle to learn robust reasoning operations. We identify the relationship between neurosymbolic models’ interpretability to catch shortcuts that later result in increased adversarial vulnerability despite performance parity. Finally, we find that the promise of data efficiency is typically only in the case of class imbalanced reasoning problems.
zh

[CV-64] Dynamic watermarks in images generated by diffusion models

【速读】：该论文旨在解决高保真文本到图像扩散模型在广泛应用中引发的重大伦理问题，包括知识产权保护和合成媒体滥用。解决方案的关键在于提出了一种多阶段嵌入水印的技术框架，通过在扩散模型学习的噪声分布中嵌入固定水印以及利用微调解码器在生成图像中嵌入人类难以察觉的动态水印，从而实现版权保护和追踪生成图像的源头。该方法通过调整水印的形状和颜色以适应生成内容，同时保持鲁棒性，并通过结构相似性指数测量（SSIM）和余弦相似度进行验证。实验结果表明，即使在动态水印针对特定内容进行调整的情况下，该方法仍能可靠地进行源模型验证，且对图像质量影响极小。

链接: https://arxiv.org/abs/2502.08927
作者: Yunzhuo Chen,Naveed Akhtar,Nur Al Hasan Haldar,Ajmal Mian
机构: The University of Western Australia (西澳大利亚大学); The University of Melbourne (墨尔本大学); Curtin University (科廷大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-fidelity text-to-image diffusion models have revolutionized visual content generation, but their widespread use raises significant ethical concerns, including intellectual property protection and the misuse of synthetic media. To address these challenges, we propose a novel multi-stage watermarking framework for diffusion models, designed to establish copyright and trace generated images back to their source. Our multi-stage watermarking technique involves embedding: (i) a fixed watermark that is localized in the diffusion model’s learned noise distribution and, (ii) a human-imperceptible, dynamic watermark in generates images, leveraging a fine-tuned decoder. By leveraging the Structural Similarity Index Measure (SSIM) and cosine similarity, we adapt the watermark’s shape and color to the generated content while maintaining robustness. We demonstrate that our method enables reliable source verification through watermark classification, even when the dynamic watermark is adjusted for content-specific variations. Source model verification is enabled through watermark classification. o support further research, we generate a dataset of watermarked images and introduce a methodology to evaluate the statistical impact of watermarking on generated this http URL, we rigorously test our framework against various attack scenarios, demonstrating its robustness and minimal impact on image quality. Our work advances the field of AI-generated content security by providing a scalable solution for model ownership verification and misuse prevention.
zh

[CV-65] Detecting Malicious Concepts Without Image Generation in AIGC

【速读】：该论文旨在解决文本到图像生成领域中恶意概念传播的问题。随着概念生成模型的兴起，用户对个性化内容的需求不断增加，导致概念分享平台迅速涌现。然而，这同时也引发了概念所有者上传恶意内容并通过非恶意描述和示例图像进行伪装的风险。为了防止此类恶意内容的传播，平台需要一种快速的方法来识别恶意概念。论文提出的关键解决方案是Concept QuickLook，这是一种仅基于概念文件进行检测而不生成任何图像的方法。Concept QuickLook定义了恶意概念，并设计了两种检测模式：概念匹配和模糊检测。通过广泛的实验验证，该方法能够有效检测恶意概念并在实际的概念共享平台上展示其可行性。

链接: https://arxiv.org/abs/2502.08921
作者: Kun Xu,Yushu Zhang,Shuren Qi,Tao Wang,Wenying Wen,Yuming Fang
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The task of text-to-image generation has achieved tremendous success in practice, with emerging concept generation models capable of producing highly personalized and customized content. Fervor for concept generation is increasing rapidly among users, and platforms for concept sharing have sprung up. The concept owners may upload malicious concepts and disguise them with non-malicious text descriptions and example images to deceive users into downloading and generating malicious content. The platform needs a quick method to determine whether a concept is malicious to prevent the spread of malicious concepts. However, simply relying on concept image generation to judge whether a concept is malicious requires time and computational resources. Especially, as the number of concepts uploaded and downloaded on the platform continues to increase, this approach becomes impractical and poses a risk of generating malicious content. In this paper, we propose Concept QuickLook, the first systematic work to incorporate malicious concept detection into research, which performs detection based solely on concept files without generating any images. We define malicious concepts and design two work modes for detection: concept matching and fuzzy detection. Extensive experiments demonstrate that the proposed Concept QuickLook can detect malicious concepts and demonstrate practicality in concept sharing platforms. We also design robustness experiments to further validate the effectiveness of the solution. We hope this work can initiate malicious concept detection tasks and provide some inspiration.
zh

[CV-66] Diffusion Models Through a Global Lens: Are They Culturally Inclusive?

【速读】：该论文旨在评估当前最先进的文本到图像扩散模型在生成具有特定文化特征的图像方面的表现，特别是在不同国家的文化细节方面，如建筑、服饰和食物。论文的关键解决方案在于引入了一个名为CultDiff的基准测试，并开发了一种基于神经网络的图像相似性度量方法CultDiff-S，以预测人类对于包含文化元素的真实图像与生成图像的判断，从而揭示现有模型在文化相关性、描述保真度和现实主义方面的显著差距。研究结果强调了需要构建更具包容性的生成式AI系统及更公平的文化多样性数据集。

链接: https://arxiv.org/abs/2502.08914
作者: Zahra Bayramli,Ayhan Suleymanzade,Na Min An,Huzama Ahmad,Eunsu Kim,Junyeong Park,James Thorne,Alice Oh
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 17 figures, 3 tables

点击查看摘要

Abstract:Text-to-image diffusion models have recently enabled the creation of visually compelling, detailed images from textual prompts. However, their ability to accurately represent various cultural nuances remains an open question. In our work, we introduce CultDiff benchmark, evaluating state-of-the-art diffusion models whether they can generate culturally specific images spanning ten countries. We show that these models often fail to generate cultural artifacts in architecture, clothing, and food, especially for underrepresented country regions, by conducting a fine-grained analysis of different similarity aspects, revealing significant disparities in cultural relevance, description fidelity, and realism compared to real-world reference images. With the collected human evaluations, we develop a neural-based image-image similarity metric, namely, CultDiff-S, to predict human judgment on real and generated images with cultural artifacts. Our work highlights the need for more inclusive generative AI systems and equitable dataset representation over a wide range of cultures.
zh

[CV-67] DiffoRA: Enabling Parameter-Efficient LLM Fine-Tuning via Differential Low-Rank Matrix Adaptation

【速读】：该论文旨在解决现有低秩适应（Low-Rank Adaptation, LoRA）方法在微调大型语言模型时存在的模块间差异性忽略及自适应调整能力有限的问题。论文的关键解决方案在于提出了一种新的参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）方案——DiffoRA。DiffoRA通过引入差分适应矩阵（Differential Adaptation Matrix, DAM），实现了模块级别的自适应调整，从而更有效地确定哪些模块对于微调最为适合和重要。DAM的设计不仅影响预训练模型的收敛速度，还提升了其泛化能力。此外，DiffoRA通过连续松弛和离散化以及权重共享优化来构建DAM，进一步增强了方案的有效性和性能。实验结果表明，DiffoRA在多个基准测试中达到了最佳的模型精度。

链接: https://arxiv.org/abs/2502.08905
作者: Tangyu Jiang,Haodi Wang,Chun Yuan
机构: Tsinghua University(清华大学); City University of Hong Kong(香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Parameter-Efficient Fine-Tuning (PEFT) methods have been extensively researched for large language models in the downstream tasks. Among all the existing approaches, the Low-Rank Adaptation (LoRA) has gained popularity for its streamlined design by incorporating low-rank matrices into existing pre-trained models. Though effective, LoRA allocates every module an identical low-rank matrix, which ignores the varying properties and contributions across different components. Moreover, the existing adaptive LoRA solutions rely highly on intuitive importance scoring indicators to adjust the interior rank of the decomposition matrices. In this paper, we propose a new PEFT scheme called DiffoRA, which is theoretically grounded and enables module-wise adoption of LoRA. At the core of our DiffoRA lies a Differential Adaptation Matrix (DAM) to determine which module is the most suitable and essential for fine-tuning. We explain how the designed matrix impacts the convergence rate and generalization capability of a pre-trained model. Furthermore, we construct the DAM via continuous relaxation and discretization with weight-sharing optimizations. We fully implement our DiffoRA and design comprehensive experiments to evaluate its performance. The experimental results demonstrate that our approach achieves the best model accuracy over all the state-of-the-art baselines across various benchmarks.
zh

[CV-68] CoL3D: Collaborative Learning of Single-view Depth and Camera Intrinsics for Metric 3D Shape Recovery ICRA2025

【速读】：该论文旨在解决从单幅图像中恢复精确的三维度量形状的问题，特别是在机器人和具身智能应用中，对环境进行导航和交互需要准确的空间理解。由于缺乏相机内参，仅通过深度估计无法恢复三维度量形状。为此，论文提出的关键解决方案是CoL3D框架，它通过统一网络在深度估计、相机内参估计以及三维点云三个层次上进行协同优化。特别地，CoL3D引入了规范入射场机制作为先验知识，并结合点云空间中的形状相似性测量损失，以提高三维形状的质量，从而显著提升了机器人感知能力所需的三维形状精度。

链接: https://arxiv.org/abs/2502.08902
作者: Chenghao Zhang,Lubin Fan,Shen Cao,Bojian Wu,Jieping Ye
机构: Alibaba Cloud Computing(阿里云); Independent Researcher(独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICRA 2025

点击查看摘要

Abstract:Recovering the metric 3D shape from a single image is particularly relevant for robotics and embodied intelligence applications, where accurate spatial understanding is crucial for navigation and interaction with environments. Usually, the mainstream approaches achieve it through monocular depth estimation. However, without camera intrinsics, the 3D metric shape can not be recovered from depth alone. In this study, we theoretically demonstrate that depth serves as a 3D prior constraint for estimating camera intrinsics and uncover the reciprocal relations between these two elements. Motivated by this, we propose a collaborative learning framework for jointly estimating depth and camera intrinsics, named CoL3D, to learn metric 3D shapes from single images. Specifically, CoL3D adopts a unified network and performs collaborative optimization at three levels: depth, camera intrinsics, and 3D point clouds. For camera intrinsics, we design a canonical incidence field mechanism as a prior that enables the model to learn the residual incident field for enhanced calibration. Additionally, we incorporate a shape similarity measurement loss in the point cloud space, which improves the quality of 3D shapes essential for robotic applications. As a result, when training and testing on a single dataset with in-domain settings, CoL3D delivers outstanding performance in both depth estimation and camera calibration across several indoor and outdoor benchmark datasets, which leads to remarkable 3D shape quality for the perception capabilities of robots.
zh

[CV-69] ShapeLib: designing a library of procedural 3D shape abstractions with Large Language Models

【速读】：该论文旨在解决如何设计一个可重用的3D形状抽象函数库的问题，这一问题在形状分析领域长期存在。关键在于利用前沿大型语言模型（LLMs）的先验知识，通过接受文本描述的设计意图和种子示例形状集，发现与设计意图匹配的表达能力强且泛化性好的形状函数。这些函数不仅参数具有语义可解释性，而且可以通过文本提示成功被大型语言模型操控，从而实现形状的变化。

链接: https://arxiv.org/abs/2502.08884
作者: R. Kenny Jones,Paul Guerrero,Niloy J. Mitra,Daniel Ritchie
机构: Brown University (布朗大学) USA; Adobe Research (Adobe 研究) United Kingdom; University College London (伦敦大学学院) and Adobe Research (Adobe 研究) United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Procedural representations are desirable, versatile, and popular shape encodings. Authoring them, either manually or using data-driven procedures, remains challenging, as a well-designed procedural representation should be compact, intuitive, and easy to manipulate. A long-standing problem in shape analysis studies how to discover a reusable library of procedural functions, with semantically aligned exposed parameters, that can explain an entire shape family. We present ShapeLib as the first method that leverages the priors of frontier LLMs to design a library of 3D shape abstraction functions. Our system accepts two forms of design intent: text descriptions of functions to include in the library and a seed set of exemplar shapes. We discover procedural abstractions that match this design intent by proposing, and then validating, function applications and implementations. The discovered shape functions in the library are not only expressive but also generalize beyond the seed set to a full family of shapes. We train a recognition network that learns to infer shape programs based on our library from different visual modalities (primitives, voxels, point clouds). Our shape functions have parameters that are semantically interpretable and can be modified to produce plausible shape variations. We show that this allows inferred programs to be successfully manipulated by an LLM given a text prompt. We evaluate ShapeLib on different datasets and show clear advantages over existing methods and alternative formulations.
zh

[CV-70] Harnessing Vision Models for Time Series Analysis: A Survey

【速读】：该论文旨在探讨视觉模型在时间序列分析中的优势，并填补现有文献中的空白。论文的关键在于如何将时间序列编码为图像（encode time series as images）以及如何对这些图像化的时间序列进行建模以完成各种任务（model the imaged time series for various tasks）。此外，论文还讨论了预处理和后处理步骤中的挑战，并概述了未来的研究方向。

链接: https://arxiv.org/abs/2502.08869
作者: Jingchao Ni,Ziming Zhao,ChengAo Shen,Hanghang Tong,Dongjin Song,Wei Cheng,Dongsheng Luo,Haifeng Chen
机构: University of Houston(休斯顿大学); University of Illinois at Urbana-Champaign(伊利诺伊大学香槟分校); University of Connecticut(康涅狄格大学); NEC Laboratories America(NEC美国实验室); Florida International University(佛罗里达国际大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Time series analysis has witnessed the inspiring development from traditional autoregressive models, deep learning models, to recent Transformers and Large Language Models (LLMs). Efforts in leveraging vision models for time series analysis have also been made along the way but are less visible to the community due to the predominant research on sequence modeling in this domain. However, the discrepancy between continuous time series and the discrete token space of LLMs, and the challenges in explicitly modeling the correlations of variates in multivariate time series have shifted some research attentions to the equally successful Large Vision Models (LVMs) and Vision Language Models (VLMs). To fill the blank in the existing literature, this survey discusses the advantages of vision models over LLMs in time series analysis. It provides a comprehensive and in-depth overview of the existing methods, with dual views of detailed taxonomy that answer the key research questions including how to encode time series as images and how to model the imaged time series for various tasks. Additionally, we address the challenges in the pre- and post-processing steps involved in this framework and outline future directions to further advance time series analysis with vision models.
zh

[CV-71] Survey on Single-Image Reflection Removal using Deep Learning Techniques

【速读】：该论文旨在解决数字图像中反射现象带来的挑战，特别是在计算机视觉、摄影和图像处理等应用中的高保真和鲁棒性需求。论文的关键在于全面回顾和评估近年来基于深度学习的单阶段和两阶段去反射方法。论文通过分析关键会议和期刊上的相关工作，总结了最新的单图像去反射技术，并概述了任务假设、当前的深度学习技术、公开数据集以及相关的评估指标，从而识别出基于深度学习去反射的主要挑战和机遇。

链接: https://arxiv.org/abs/2502.08836
作者: Kangning Yang,Huiming Sun,Jie Cai,Lan Fu,Jiaming Ding,Jinlong Li,Chiu Man Ho,Zibo Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The phenomenon of reflection is quite common in digital images, posing significant challenges for various applications such as computer vision, photography, and image processing. Traditional methods for reflection removal often struggle to achieve clean results while maintaining high fidelity and robustness, particularly in real-world scenarios. Over the past few decades, numerous deep learning-based approaches for reflection removal have emerged, yielding impressive results. In this survey, we conduct a comprehensive review of the current literature by focusing on key venues such as ICCV, ECCV, CVPR, NeurIPS, etc., as these conferences and journals have been central to advances in the field. Our review follows a structured paper selection process, and we critically assess both single-stage and two-stage deep learning methods for reflection removal. The contribution of this survey is three-fold: first, we provide a comprehensive summary of the most recent work on single-image reflection removal; second, we outline task hypotheses, current deep learning techniques, publicly available datasets, and relevant evaluation metrics; and third, we identify key challenges and opportunities in deep learning-based reflection removal, highlighting the potential of this rapidly evolving research area.
zh

[CV-72] mathsfCSMAE~:~Cataract Surgical Masked Autoencoder (MAE) based Pre-training

【速读】：该论文旨在解决自动化分析白内障手术视频中的关键挑战，以提升手术培训、流程优化及术后评估。论文提出的关键解决方案是一种基于掩码自编码器（Masked Autoencoder, MAE）的预训练方法，称为CSMAE。这种方法通过依据时空重要性选择遮罩令牌而非随机选择，从而提高模型的学习效率和在低数据条件下的鲁棒性。此外，该预训练模型能够通过微调轻松适应特定的下游任务，从而作为进一步分析的强大基础。通过在两个白内障手术视频数据集D99和Cataract-101上的严格测试，该方法显著超越了当前最先进的自监督预训练和基于适配器的迁移学习方法。这一进展不仅展示了MAE基预训练在手术视频分析领域的潜力，还为未来研究设定了新标准。

链接: https://arxiv.org/abs/2502.08822
作者: Nisarg A. Shah,Wele Gedara Chaminda Bandara,Shameema Skider,S. Swaroop Vedula,Vishal M. Patel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, Accepted to IEEE International Symposium on Biomedical Imaging (ISBI 2025)

点击查看摘要

Abstract:Automated analysis of surgical videos is crucial for improving surgical training, workflow optimization, and postoperative assessment. We introduce a CSMAE, Masked Autoencoder (MAE)-based pretraining approach, specifically developed for Cataract Surgery video analysis, where instead of randomly selecting tokens for masking, they are selected based on the spatiotemporal importance of the token. We created a large dataset of cataract surgery videos to improve the model’s learning efficiency and expand its robustness in low-data regimes. Our pre-trained model can be easily adapted for specific downstream tasks via fine-tuning, serving as a robust backbone for further analysis. Through rigorous testing on a downstream step-recognition task on two Cataract Surgery video datasets, D99 and Cataract-101, our approach surpasses current state-of-the-art self-supervised pretraining and adapter-based transfer learning methods by a significant margin. This advancement not only demonstrates the potential of our MAE-based pretraining in the field of surgical video analysis but also sets a new benchmark for future research.
zh

[CV-73] DejAIvu: Identifying and Explaining AI Art on the Web in Real-Time with Saliency Maps IJCAI2025

【速读】：该论文旨在解决由先进生成模型（如扩散模型和生成对抗网络GANs）引发的互联网上AI生成图像激增所带来的挑战，包括虚假信息传播、数字伪造以及真实性验证等问题。此外，媒体和营销领域中未经授权使用AI生成图像也引起了广泛关注。为应对这些问题，论文提出了一种名为DejAIvu的Chrome Web扩展程序，其关键是结合了实时AI生成图像检测与基于显著性分析的可解释性技术，能够在用户浏览网页时自动识别并突出显示AI生成图像的相关特征。

链接: https://arxiv.org/abs/2502.08821
作者: Jocelyn Dzuong
机构: Florida International University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, submitted to IJCAI 2025 demo track

点击查看摘要

Abstract:The recent surge in advanced generative models, such as diffusion models and generative adversarial networks (GANs), has led to an alarming rise in AI-generated images across various domains on the web. While such technologies offer benefits such as democratizing artistic creation, they also pose challenges in misinformation, digital forgery, and authenticity verification. Additionally, the uncredited use of AI-generated images in media and marketing has sparked significant backlash from online communities. In response to this, we introduce DejAIvu, a Chrome Web extension that combines real-time AI-generated image detection with saliency-based explainability while users browse the web. Using an ONNX-optimized deep learning model, DejAIvu automatically analyzes images on websites such as Google Images, identifies AI-generated content using model inference, and overlays a saliency heatmap to highlight AI-related artifacts. Our approach integrates efficient in-browser inference, gradient-based saliency analysis, and a seamless user experience, ensuring that AI detection is both transparent and interpretable. We also evaluate DejAIvu across multiple pretrained architectures and benchmark datasets, demonstrating high accuracy and low latency, making it a practical and deployable tool for enhancing AI image accountability. The code for this system can be found at this https URL.
zh

[CV-74] Measuring Anxiety Levels with Head Motion Patterns in Severe Depression Population

【速读】：该论文旨在解决抑郁症患者中焦虑水平精确评估的问题，以制定有效的个性化治疗方案。解决方案的关键在于提出了一种通过分析视频访谈过程中患者的头部运动（速度、加速度和角位移）来量化焦虑严重程度的新方法。利用CALYPSO抑郁数据集中的数据，研究提取了头部运动特征，并应用回归分析预测临床评估的焦虑水平，实现了基于头部运动模式预测心理焦虑严重程度的平均绝对误差（MAE）为0.35，表明该方法能够增强对焦虑在抑郁症中作用的理解，并帮助精神科医生改进个体化治疗策略。

链接: https://arxiv.org/abs/2502.08813
作者: Fouad Boualeb,Emery Pierson,Nicolas Doudeau,Clémence Nineuil,Ali Amad,Mohamed Daoudi
机构: Univ. Lille(里尔大学); CNRS(法国国家科学研究中心); Centrale Lille(中央里尔); Institut Mines-Télécom(矿业与电信学院); UMR 9189 CRIStAL(研究组); IMT Nord Europe(北方理工学院); École Polytechnique(巴黎综合理工大学); Inserm(法国国家健康与医学研究院); CHU Lille(里尔大学医院); U1172 - LilNCog - Lille Neuroscience & Cognition(里尔神经科学与认知实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19th IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2025

点击查看摘要

Abstract:Depression and anxiety are prevalent mental health disorders that frequently cooccur, with anxiety significantly influencing both the manifestation and treatment of depression. An accurate assessment of anxiety levels in individuals with depression is crucial to develop effective and personalized treatment plans. This study proposes a new noninvasive method for quantifying anxiety severity by analyzing head movements -specifically speed, acceleration, and angular displacement - during video-recorded interviews with patients suffering from severe depression. Using data from a new CALYPSO Depression Dataset, we extracted head motion characteristics and applied regression analysis to predict clinically evaluated anxiety levels. Our results demonstrate a high level of precision, achieving a mean absolute error (MAE) of 0.35 in predicting the severity of psychological anxiety based on head movement patterns. This indicates that our approach can enhance the understanding of anxiety’s role in depression and assist psychiatrists in refining treatment strategies for individuals.
zh

[CV-75] MRUCT: Mixed Reality Assistance for Acupuncture Guided by Ultrasonic Computed Tomography

【速读】：该论文旨在解决针灸实践中依赖肌肉记忆和触觉反馈导致的新手学习周期长且难以精确靶向穴位的问题。解决方案的关键在于开发了一种名为MRUCT的系统，该系统整合了超声计算机断层扫描（UCT）与混合现实（MR）技术，实现实时可视化穴位。通过非刚性配准方法重建解剖结构，并结合3D用户界面设计，系统能够在针刺过程中提供离线图像配准及实时指导，从而帮助使用者依据骨骼、肌肉等解剖结构及自动生成的参考点精确定位针刺位置。

链接: https://arxiv.org/abs/2502.08786
作者: Yue Yang,Xinkai Wang,Kehong Zhou,Xue Xie,Lifeng Zhu,Aiguo Song,Bruce Daniel
机构: Stanford University (斯坦福大学); Southeast University (东南大学); Shanghai Sixth People’s Hospital (上海第六人民医院)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Chinese acupuncture practitioners primarily depend on muscle memory and tactile feedback to insert needles and accurately target acupuncture points, as the current workflow lacks imaging modalities and visual aids. Consequently, new practitioners often learn through trial and error, requiring years of experience to become proficient and earn the trust of patients. Medical students face similar challenges in mastering this skill. To address these challenges, we developed an innovative system, MRUCT, that integrates ultrasonic computed tomography (UCT) with mixed reality (MR) technology to visualize acupuncture points in real-time. This system offers offline image registration and real-time guidance during needle insertion, enabling them to accurately position needles based on anatomical structures such as bones, muscles, and auto-generated reference points, with the potential for clinical implementation. In this paper, we outline the non-rigid registration methods used to reconstruct anatomical structures from UCT data, as well as the key design considerations of the MR system. We evaluated two different 3D user interface (3DUI) designs and compared the performance of our system to traditional workflows for both new practitioners and medical students. The results highlight the potential of MR to enhance therapeutic medical practices and demonstrate the effectiveness of the system we developed.
zh

[CV-76] SB-Bench: Stereotype Bias Benchmark for Large Multimodal Models

【速读】：该论文旨在解决大型多模态模型（Large Multimodal Models, LMMs）中的刻板印象偏见问题，这些偏见在实际应用中延续了有害的社会偏见，从而损害了AI系统的公平性和公正性。解决方案的关键在于引入Stereotype Bias Benchmark (SB-bench)，这是一个全面的框架，通过使用非合成图像评估九个不同类别的刻板印象偏见，从而填补现有数据集在真实视觉环境下的偏见评估不足的空白。SB-bench通过精心设计的、基于视觉的真实场景来严格评估LMMs，使其能够准确推理视觉刻板印象，并提供包含现实世界视觉样本、图像变化和多项选择题格式的评估框架。

链接: https://arxiv.org/abs/2502.08779
作者: Vishal Narnaware,Ashmal Vayani,Rohit Gupta,Swetha Sirnam,Mubarak Shah
机构: University of Central Florida
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to stereotypes, harmful generations, and ambiguous assumptions in real-world scenarios has become essential. However, existing datasets evaluating stereotype biases in LMMs often lack diversity and rely on synthetic images, leaving a gap in bias evaluation for real-world visual contexts. To address this, we introduce the Stereotype Bias Benchmark (SB-bench), the most comprehensive framework to date for assessing stereotype biases across nine diverse categories with non-synthetic images. SB-bench rigorously evaluates LMMs through carefully curated, visually grounded scenarios, challenging them to reason accurately about visual stereotypes. It offers a robust evaluation framework featuring real-world visual samples, image variations, and multiple-choice question formats. By introducing visually grounded queries that isolate visual biases from textual ones, SB-bench enables a precise and nuanced assessment of a model’s reasoning capabilities across varying levels of difficulty. Through rigorous testing of state-of-the-art open-source and closed-source LMMs, SB-bench provides a systematic approach to assessing stereotype biases in LMMs across key social dimensions. This benchmark represents a significant step toward fostering fairness in AI systems and reducing harmful biases, laying the groundwork for more equitable and socially responsible LMMs. Our code and dataset are publicly available.
zh

[CV-77] Exploring Test Time Adaptation for Subcortical Segmentation of the Fetal Brain in 3D Ultrasound

【速读】：该论文旨在解决在应用预训练模型于未见的自由手持超声（US）图像体积数据时，由于采集和对齐的巨大差异导致性能下降的问题。论文的关键解决方案是引入测试时适应（Test Time Adaptation, TTA）方法，并提出一种通过整合规范性图谱作为解剖结构先验的新颖TTA技术，从而提高模型在面对多种领域转换时的表现，进一步促进胎儿大脑发育的自动化监测。

链接: https://arxiv.org/abs/2502.08774
作者: Joshua Omolegan,Pak Hei Yeung,Madeleine K. Wyburd,Linde Hesse,Monique Haak,Intergrowth-21st Consortium,Ana I. L. Namburete,Nicola K. Dinsdale
机构: Oxford Machine Learning in NeuroImaging Lab, University of Oxford (牛津神经影像学机器学习实验室，牛津大学); School of Computer Science and Engineering, Nanyang Technological University, Singapore (南洋理工大学计算机科学与工程学院); Department of Obstetrics and Fetal Medicine, Leiden University Medical Center (莱顿大学医学中心); Google (谷歌); QuantCo (未知)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 5 figures

点击查看摘要

Abstract:Monitoring the growth of subcortical regions of the fetal brain in ultrasound (US) images can help identify the presence of abnormal development. Manually segmenting these regions is a challenging task, but recent work has shown that it can be automated using deep learning. However, applying pretrained models to unseen freehand US volumes often leads to a degradation of performance due to the vast differences in acquisition and alignment. In this work, we first demonstrate that test time adaptation (TTA) can be used to improve model performance in the presence of both real and simulated domain shifts. We further propose a novel TTA method by incorporating a normative atlas as a prior for anatomy. In the presence of various types of domain shifts, we benchmark the performance of different TTA methods and demonstrate the improvements brought by our proposed approach, which may further facilitate automated monitoring of fetal brain development. Our code is available at this https URL.
zh

[CV-78] Cluster and Predict Latents Patches for Improved Masked Image Modeling

【速读】：该论文旨在提升自监督表征学习中掩码图像建模（Masked Image Modeling, MIM）方法的表现。论文的关键在于引入了一种名为CAPI的新框架，该框架通过预测潜在聚类来实现纯MIM。CAPI采用基于聚类的损失函数，这种损失函数在训练过程中稳定，并展现出良好的扩展性。通过这种方法，ViT-L主干模型在ImageNet上达到了83.8%的准确率，在ADE20K上达到了32.1%的平均交并比（mIoU），显著超越了先前的MIM方法，并接近当前最先进的DINOv2的表现。

链接: https://arxiv.org/abs/2502.08769
作者: Timothée Darcet,Federico Baldassarre,Maxime Oquab,Julien Mairal,Piotr Bojanowski
机构: Meta FAIR; Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK; Meta FAIR; Meta FAIR; Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK; Meta FAIR
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures, submitted to TMLR

点击查看摘要

Abstract:Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper, we systematically analyze target representations, loss functions, and architectures, to introduce CAPI - a novel pure-MIM framework that relies on the prediction of latent clusterings. Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes, substantially outperforming previous MIM methods and approaching the performance of the current state-of-the-art, DINOv2. We release all our code and models.
zh

[CV-79] HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification

【速读】：该论文旨在解决高质标注数据稀缺的问题，以支持组织微环境分析及相关的医学诊断、预后和治疗规划。论文的关键解决方案是提出了一种名为HistoSmith的新型单阶段方法，通过条件潜扩散模型（Latent Diffusion Model）学习细胞布局、分类掩模和组织学图像的联合分布，从而实现按用户定义参数生成定制化的图像-标签对，以增强组织学数据集。这一方法在Conic HE组织病理学数据集和Nissl染色CytoDArk0数据集上的训练结果表明，其在细胞实例分割和分类方面，尤其是对于像中性粒细胞这样代表性不足的细胞类型，表现出显著改进。

链接: https://arxiv.org/abs/2502.08754
作者: Valentina Vadori,Jean-Marie Graïc,Antonella Peruffo,Livio Finos,Ujwala Kiran Chaudhari,Enrico Grisan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precise segmentation and classification of cell instances are vital for analyzing the tissue microenvironment in histology images, supporting medical diagnosis, prognosis, treatment planning, and studies of brain cytoarchitecture. However, the creation of high-quality annotated datasets for training remains a major challenge. This study introduces a novel single-stage approach (HistoSmith) for generating image-label pairs to augment histology datasets. Unlike state-of-the-art methods that utilize diffusion models with separate components for label and image generation, our approach employs a latent diffusion model to learn the joint distribution of cellular layouts, classification masks, and histology images. This model enables tailored data generation by conditioning on user-defined parameters such as cell types, quantities, and tissue types. Trained on the Conic HE histopathology dataset and the Nissl-stained CytoDArk0 dataset, the model generates realistic and diverse labeled samples. Experimental results demonstrate improvements in cell instance segmentation and classification, particularly for underrepresented cell types like neutrophils in the Conic dataset. These findings underscore the potential of our approach to address data scarcity challenges.
zh

[CV-80] Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）扩散模型中大型文本编码器内存使用过高的问题。论文的关键解决方案是Skip and Re-use layers (Skrr)，这是一种专为T2I扩散模型中的文本编码器设计的高效剪枝策略。Skrr通过选择性地跳过或重用Transformer块中的某些层，利用其内在冗余性，在不牺牲性能的前提下显著降低内存消耗。

链接: https://arxiv.org/abs/2502.08690
作者: Hoigi Seo,Wongi Jeong,Jae-sun Seo,Se Young Chun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.
zh

[CV-81] Multispectral Remote Sensing for Weed Detection in West Australian Agricultural Lands

【速读】：该论文旨在解决西澳大利亚金迪宁地区因广泛杂草侵害导致的农业经济和生态损失问题。解决方案的关键在于构建了一个专门用于农业应用的定制多光谱遥感数据集，并提出了一种端到端的杂草检测框架。该框架包括去噪、辐射校准、图像配准、正射纠正和拼接等预处理步骤，并结合植被指数（如NDVI、GNDVI、EVI、SAVI、MSAVI）与多光谱通道形成分类特征，采用深度学习模型（如ResNet）进行杂草识别，最终实现了高精度的杂草检测效果。

链接: https://arxiv.org/abs/2502.08678
作者: Haitian Wang,Muhammad Ibrahim,Yumeng Miao,D ustin Severtson,Atif Mansoor,Ajmal S. Mian
机构: The University of Western Australia; Department of Primary Industries and Regional Development
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 8 pages, 9 figures, 1 table, Accepted for oral presentation at IEEE 25th International Conference on Digital Image Computing: Techniques and Applications (DICTA 2024). Conference Proceeding: 979-8-3503-7903-7/24/$31.00 © 2024 IEEE

点击查看摘要

Abstract:The Kondinin region in Western Australia faces significant agricultural challenges due to pervasive weed infestations, causing economic losses and ecological impacts. This study constructs a tailored multispectral remote sensing dataset and an end-to-end framework for weed detection to advance precision agriculture practices. Unmanned aerial vehicles were used to collect raw multispectral data from two experimental areas (E2 and E8) over four years, covering 0.6046 km^2 and ground truth annotations were created with GPS-enabled vehicles to manually label weeds and crops. The dataset is specifically designed for agricultural applications in Western Australia. We propose an end-to-end framework for weed detection that includes extensive preprocessing steps, such as denoising, radiometric calibration, image alignment, orthorectification, and stitching. The proposed method combines vegetation indices (NDVI, GNDVI, EVI, SAVI, MSAVI) with multispectral channels to form classification features, and employs several deep learning models to identify weeds based on the input features. Among these models, ResNet achieves the highest performance, with a weed detection accuracy of 0.9213, an F1-Score of 0.8735, an mIOU of 0.7888, and an mDC of 0.8865, validating the efficacy of the dataset and the proposed weed detection method.
zh

[CV-82] LIR-LIVO: A LightweightRobust LiDAR/Vision/Inertial Odometry with Illumination-Resilient Deep Features

【速读】：该论文旨在解决在具有挑战性的光照条件和退化环境中，轻量且鲁棒的LiDAR-惯性-视觉里程计系统的设计问题。解决方案的关键在于利用基于深度学习的抗光照变化特征以及LiDAR-惯性-视觉里程计（LIVO）技术。通过引入均匀深度特征分布（由LiDAR点云与特征关联实现）和自适应特征匹配（采用Superpoint和LightGlue），该方法实现了卓越的精度和鲁棒性，并保持低计算成本。

链接: https://arxiv.org/abs/2502.08676
作者: Shujie Zhou,Zihao Wang,Xinye Dai,Weiwei Song,Shengfeng Gu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In this paper, we propose LIR-LIVO, a lightweight and robust LiDAR-inertial-visual odometry system designed for challenging illumination and degraded environments. The proposed method leverages deep learning-based illumination-resilient features and LiDAR-Inertial-Visual Odometry (LIVO). By incorporating advanced techniques such as uniform depth distribution of features enabled by depth association with LiDAR point clouds and adaptive feature matching utilizing Superpoint and LightGlue, LIR-LIVO achieves state-of-the-art (SOTA) accuracy and robustness with low computational cost. Experiments are conducted on benchmark datasets, including NTU-VIRAL, Hilti’22, and R3LIVE-Dataset. The corresponding results demonstrate that our proposed method outperforms other SOTA methods on both standard and challenging datasets. Particularly, the proposed method demonstrates robust pose estimation under poor ambient lighting conditions in the Hilti’22 dataset. The code of this work is publicly accessible on GitHub to facilitate advancements in the robotics community.
zh

[CV-83] COutfitGAN: Learning to Synthesize Compatible Outfits Supervised by Silhouette Masks and Fashion Styles

【速读】：该论文旨在解决如何基于给定的时尚单品生成互补且兼容的新时尚单品的问题。解决方案的关键在于提出了一种名为COutfitGAN的服装生成框架，该框架包含金字塔风格提取器、服装生成器、基于UNet的真实/假鉴别器以及搭配鉴别器，从而实现合成与给定单品相兼容的逼真图像。

链接: https://arxiv.org/abs/2502.08674
作者: Dongliang Zhou,Haijun Zhang,Qun Li,Jianghong Ma,Xiaofei Xu
机构: Department of Computer Science, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注: This paper was accepted by IEEE TMM

点击查看摘要

Abstract:How to recommend outfits has gained considerable attention in both academia and industry in recent years. Many studies have been carried out regarding fashion compatibility learning, to determine whether the fashion items in an outfit are compatible or not. These methods mainly focus on evaluating the compatibility of existing outfits and rarely consider applying such knowledge to ‘design’ new fashion items. We propose the new task of generating complementary and compatible fashion items based on an arbitrary number of given fashion items. In particular, given some fashion items that can make up an outfit, the aim of this paper is to synthesize photo-realistic images of other, complementary, fashion items that are compatible with the given ones. To achieve this, we propose an outfit generation framework, referred to as COutfitGAN, which includes a pyramid style extractor, an outfit generator, a UNet-based real/fake discriminator, and a collocation discriminator. To train and evaluate this framework, we collected a large-scale fashion outfit dataset with over 200K outfits and 800K fashion items from the Internet. Extensive experiments show that COutfitGAN outperforms other baselines in terms of similarity, authenticity, and compatibility measurements.
zh

[CV-84] DiffRenderGAN: Addressing Training Data Scarcity in Deep Segmentation Networks for Quantitative Nanomaterial Analysis through Differentiable Rendering and Generative Modelling

【速读】：该论文旨在解决纳米材料在不同技术、生物和环境背景下应用与交互的关键特性量化与理解难题。解决方案的关键在于引入DiffRenderGAN，这是一种将可微渲染器集成到生成对抗网络（GAN）框架中的新型生成模型，能够从非标注的真实显微镜图像生成注释过的合成纳米粒子图像，从而减少人工标注的需求并提升分割性能。通过这种方法，DiffRenderGAN在多种离子和电子显微镜案例中，包括二氧化钛（TiO2）、二氧化硅（SiO2）和银纳米线（AgNW），实现了合成数据与真实数据之间的桥梁，推动了复杂纳米材料系统的量化与理解。

链接: https://arxiv.org/abs/2502.09477
作者: Dennis Possart,Leonid Mill,Florian Vollnhals,Tor Hildebrand,Peter Suter,Mathis Hoffmann,Jonas Utz,Daniel Augsburger,Mareike Thies,Mingxuan Wu,Fabian Wagner,George Sarau,Silke Christiansen,Katharina Breininger
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Nanomaterials exhibit distinctive properties governed by parameters such as size, shape, and surface characteristics, which critically influence their applications and interactions across technological, biological, and environmental contexts. Accurate quantification and understanding of these materials are essential for advancing research and innovation. In this regard, deep learning segmentation networks have emerged as powerful tools that enable automated insights and replace subjective methods with precise quantitative analysis. However, their efficacy depends on representative annotated datasets, which are challenging to obtain due to the costly imaging of nanoparticles and the labor-intensive nature of manual annotations. To overcome these limitations, we introduce DiffRenderGAN, a novel generative model designed to produce annotated synthetic data. By integrating a differentiable renderer into a Generative Adversarial Network (GAN) framework, DiffRenderGAN optimizes textural rendering parameters to generate realistic, annotated nanoparticle images from non-annotated real microscopy images. This approach reduces the need for manual intervention and enhances segmentation performance compared to existing synthetic data methods by generating diverse and realistic data. Tested on multiple ion and electron microscopy cases, including titanium dioxide (TiO _2 ), silicon dioxide (SiO _2 )), and silver nanowires (AgNW), DiffRenderGAN bridges the gap between synthetic and real data, advancing the quantification and understanding of complex nanomaterial systems.
zh

[CV-85] Color Universal Design Neural Network for the Color Vision Deficiencies

【速读】：该论文旨在解决图像信息对于色觉缺陷者难以识别的问题。关键解决方案在于提出了一种名为CUD-Net的颜色通用设计网络，通过回归分段线性函数的节点点以及使用特定滤波器处理每幅输入图像，从而生成能够被色觉缺陷者视觉理解的图像。CUD-Net采用了一个四步流程来生成适用于色觉缺陷者的CUD图像，包括基于特定标准优化CUD数据集、通过预处理扩展输入图像信息、利用多模态融合架构处理扩展后的图像，以及提出一种共轭损失函数以解决由于数据集引起的一对多问题。这一方法能够生成保持色彩和对比度稳定的高质量CUD图像。

链接: https://arxiv.org/abs/2502.08671
作者: Sunyong Seo,Jinho Park
机构: Soongsil University (崇实大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:Information regarding images should be visually understood by anyone, including those with color deficiency. However, such information is not recognizable if the color that seems to be distorted to the color deficiencies meets an adjacent object. The aim of this paper is to propose a color universal design network, called CUD-Net, that generates images that are visually understandable by individuals with color deficiency. CUD-Net is a convolutional deep neural network that can preserve color and distinguish colors for input images by regressing the node point of a piecewise linear function and using a specific filter for each image. To generate CUD images for color deficiencies, we follow a four-step process. First, we refine the CUD dataset based on specific criteria by color experts. Second, we expand the input image information through pre-processing that is specialized for color deficiency vision. Third, we employ a multi-modality fusion architecture to combine features and process the expanded images. Finally, we propose a conjugate loss function based on the composition of the predicted image through the model to address one-to-many problems that arise from the dataset. Our approach is able to produce high-quality CUD images that maintain color and contrast stability. The code for CUD-Net is available on the GitHub repository
zh

[CV-86] Unpaired Image-to-Image Translation with Content Preserving Perspective: A Review

【速读】：该论文旨在解决图像到图像翻译（Image-to-image translation, I2I）过程中不同任务在内容保留程度上的差异问题。论文的关键在于将I2I方法分为三类：完全内容保留、部分内容保留和非内容保留，并针对每一类详细介绍了相应的任务、数据集、方法及其结果。通过这种分类，论文不仅提供了不同模型架构的评估标准，还构建了一个从仿真图像到真实图像转换的基准，用于评估不同的方法。最终结论指出，由于不同应用对内容保留的要求不同，选择适合特定应用的I2I模型时应考虑这一因素。

链接: https://arxiv.org/abs/2502.08667
作者: Mehran Safayani,Behnaz Mirzapour,Hanieh Aghaebrahimian,Nasrin Salehi,Hamid Ravaee
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-to-image translation (I2I) transforms an image from a source domain to a target domain while preserving source content. Most computer vision applications are in the field of image-to-image translation, such as style transfer, image segmentation, and photo enhancement. The degree of preservation of the content of the source images in the translation process can be different according to the problem and the intended application. From this point of view, in this paper, we divide the different tasks in the field of image-to-image translation into three categories: Fully Content preserving, Partially Content preserving, and Non-Content preserving. We present different tasks, datasets, methods, results of methods for these three categories in this paper. We make a categorization for I2I methods based on the architecture of different models and study each category separately. In addition, we introduce well-known evaluation criteria in the I2I translation field. Specifically, nearly 70 different I2I models were analyzed, and more than 10 quantitative evaluation metrics and 30 distinct tasks and datasets relevant to the I2I translation problem were both introduced and assessed. Translating from simulation to real images could be well viewed as an application of fully content preserving or partially content preserving unsupervised image-to-image translation methods. So, we provide a benchmark for Sim-to-Real translation, which can be used to evaluate different methods. In general, we conclude that because of the different extent of the obligation to preserving content in various applications, it is better to consider this issue in choosing a suitable I2I model for a specific application.
zh

人工智能

[AI-0] Score-of-Mixture Training: Training One-Step Generative Models Made Simple

链接: https://arxiv.org/abs/2502.09609
作者: Tejas Jayashankar,J. Jon Ryu,Gregory Wornell
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 27 pages, 9 figures

点击查看摘要

Abstract:We propose Score-of-Mixture Training (SMT), a novel framework for training one-step generative models by minimizing a class of divergences called the \alpha -skew Jensen-Shannon divergence. At its core, SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels. Similar to consistency models, our approach supports both training from scratch (SMT) and distillation using a pretrained diffusion model, which we call Score-of-Mixture Distillation (SMD). It is simple to implement, requires minimal hyperparameter tuning, and ensures stable training. Experiments on CIFAR-10 and ImageNet 64x64 show that SMT/SMD are competitive with and can even outperform existing methods.

[AI-1] KIMAs: A Configurable Knowledge Integrated Multi-Agent System

链接: https://arxiv.org/abs/2502.09596
作者: Zitao Li,Fei Wei,Yuexiang Xie,Dawei Gao,Weirui Kuang,Zhijian Ma,Bingchen Qian,Yaliang Li,Bolin Ding
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Knowledge-intensive conversations supported by large language models (LLMs) have become one of the most popular and helpful applications that can assist people in different aspects. Many current knowledge-intensive applications are centered on retrieval-augmented generation (RAG) techniques. While many open-source RAG frameworks facilitate the development of RAG-based applications, they often fall short in handling practical scenarios complicated by heterogeneous data in topics and formats, conversational context management, and the requirement of low-latency response times. This technical report presents a configurable knowledge integrated multi-agent system, KIMAs, to address these challenges. KIMAs features a flexible and configurable system for integrating diverse knowledge sources with 1) context management and query rewrite mechanisms to improve retrieval accuracy and multi-turn conversational coherency, 2) efficient knowledge routing and retrieval, 3) simple but effective filter and reference generation mechanisms, and 4) optimized parallelizable multi-agent pipeline execution. Our work provides a scalable framework for advancing the deployment of LLMs in real-world settings. To show how KIMAs can help developers build knowledge-intensive applications with different scales and emphases, we demonstrate how we configure the system to three applications already running in practice with reliable performance.

[AI-2] MDCrow: Automating Molecular Dynamics Workflows with Large Language Models

链接: https://arxiv.org/abs/2502.09565
作者: Quintina Campbell,Sam Cox,Jorge Medina,Brittany Watterson,Andrew D. White
类目: Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Molecular dynamics (MD) simulations are essential for understanding biomolecular systems but remain challenging to automate. Recent advances in large language models (LLM) have demonstrated success in automating complex scientific tasks using LLM-based agents. In this paper, we introduce MDCrow, an agentic LLM assistant capable of automating MD workflows. MDCrow uses chain-of-thought over 40 expert-designed tools for handling and processing files, setting up simulations, analyzing the simulation outputs, and retrieving relevant information from literature and databases. We assess MDCrow’s performance across 25 tasks of varying required subtasks and difficulty, and we evaluate the agent’s robustness to both difficulty and prompt style. \textttgpt-4o is able to complete complex tasks with low variance, followed closely by \textttllama3-405b, a compelling open-source model. While prompt style does not influence the best models’ performance, it has significant effects on smaller models.

[AI-3] Diffusion Models for Molecules: A Survey of Methods and Tasks

链接: https://arxiv.org/abs/2502.09511
作者: Liang Wang,Chao Song,Zhiyuan Liu,Yu Rong,Qiang Liu,Shu Wu,Liang Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Generative tasks about molecules, including but not limited to molecule generation, are crucial for drug discovery and material design, and have consistently attracted significant attention. In recent years, diffusion models have emerged as an impressive class of deep generative models, sparking extensive research and leading to numerous studies on their application to molecular generative tasks. Despite the proliferation of related work, there remains a notable lack of up-to-date and systematic surveys in this area. Particularly, due to the diversity of diffusion model formulations, molecular data modalities, and generative task types, the research landscape is challenging to navigate, hindering understanding and limiting the area’s growth. To address this, this paper conducts a comprehensive survey of diffusion model-based molecular generative methods. We systematically review the research from the perspectives of methodological formulations, data modalities, and task types, offering a novel taxonomy. This survey aims to facilitate understanding and further flourishing development in this area. The relevant papers are summarized at: this https URL.

[AI-4] AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization

链接: https://arxiv.org/abs/2502.09503
作者: Caleb Cranney,Jesse G. Meyer
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer architectures have transformed AI applications but remain complex to customize for domain experts lacking low-level implementation expertise. We introduce AttentionSmithy, a modular software package that simplifies transformer innovation by breaking down key components into reusable building blocks: attention modules, feed-forward networks, normalization layers, and positional encodings. Users can rapidly prototype and evaluate transformer variants without extensive coding. Our framework supports four positional encoding strategies and integrates with neural architecture search for automated design. We validate AttentionSmithy by replicating the original transformer under resource constraints and optimizing translation performance by combining positional encodings. Additionally, we demonstrate its adaptability in gene-specific modeling, achieving over 95% accuracy in cell type classification. These case studies highlight AttentionSmithy’s potential to accelerate research across diverse fields by removing framework implementation barriers.

[AI-5] PenTest: Elevating Ethical Hacking with AI and Automation

链接: https://arxiv.org/abs/2502.09484
作者: Haitham S. Al-Sinani,Chris J. Mitchell
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 27 pages, 6 figures

点击查看摘要

Abstract:Traditional ethical hacking relies on skilled professionals and time-intensive command management, which limits its scalability and efficiency. To address these challenges, we introduce PenTest++, an AI-augmented system that integrates automation with generative AI (GenAI) to optimise ethical hacking workflows. Developed in a controlled virtual environment, PenTest++ streamlines critical penetration testing tasks, including reconnaissance, scanning, enumeration, exploitation, and documentation, while maintaining a modular and adaptable design. The system balances automation with human oversight, ensuring informed decision-making at key stages, and offers significant benefits such as enhanced efficiency, scalability, and adaptability. However, it also raises ethical considerations, including privacy concerns and the risks of AI-generated inaccuracies (hallucinations). This research underscores the potential of AI-driven systems like PenTest++ to complement human expertise in cybersecurity by automating routine tasks, enabling professionals to focus on strategic decision-making. By incorporating robust ethical safeguards and promoting ongoing refinement, PenTest++ demonstrates how AI can be responsibly harnessed to address operational and ethical challenges in the evolving cybersecurity landscape.

[AI-6] Relational Conformal Prediction for Correlated Time Series

链接: https://arxiv.org/abs/2502.09443
作者: Andrea Cini,Alexander Jenkins,Danilo Mandic,Cesare Alippi,Filippo Maria Bianchi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We address the problem of uncertainty quantification in time series forecasting by exploiting observations at correlated sequences. Relational deep learning methods leveraging graph representations are among the most effective tools for obtaining point estimates from spatiotemporal data and correlated time series. However, the problem of exploiting relational structures to estimate the uncertainty of such predictions has been largely overlooked in the same context. To this end, we propose a novel distribution-free approach based on the conformal prediction framework and quantile regression. Despite the recent applications of conformal prediction to sequential data, existing methods operate independently on each target time series and do not account for relationships among them when constructing the prediction interval. We fill this void by introducing a novel conformal prediction method based on graph deep learning operators. Our method, named Conformal Relational Prediction (CoRel), does not require the relational structure (graph) to be known as a prior and can be applied on top of any pre-trained time series predictor. Additionally, CoRel includes an adaptive component to handle non-exchangeable data and changes in the input time series. Our approach provides accurate coverage and archives state-of-the-art uncertainty quantification in relevant benchmarks.

[AI-7] Variable Stiffness for Robust Locomotion through Reinforcement Learning

链接: https://arxiv.org/abs/2502.09436
作者: Dario Spoljaric,Yashuai Yan,Dongheui Lee
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: submitted to IFAC Joint Symposia on Mechatronics Robotics

点击查看摘要

Abstract:Reinforcement-learned locomotion enables legged robots to perform highly dynamic motions but often accompanies time-consuming manual tuning of joint stiffness. This paper introduces a novel control paradigm that integrates variable stiffness into the action space alongside joint positions, enabling grouped stiffness control such as per-joint stiffness (PJS), per-leg stiffness (PLS) and hybrid joint-leg stiffness (HJLS). We show that variable stiffness policies, with grouping in per-leg stiffness (PLS), outperform position-based control in velocity tracking and push recovery. In contrast, HJLS excels in energy efficiency. Furthermore, our method showcases robust walking behaviour on diverse outdoor terrains by sim-to-real transfer, although the policy is sorely trained on a flat floor. Our approach simplifies design by eliminating per-joint stiffness tuning while keeping competitive results with various metrics.

[AI-8] Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes

链接: https://arxiv.org/abs/2502.09432
作者: Navdeep Kumar,Adarsh Gupta,Maxence Mohamed Elfatihi,Giorgia Ramponi,Kfir Yehuda Levy,Shie Mannor
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study robust Markov decision processes (RMDPs) with non-rectangular uncertainty sets, which capture interdependencies across states unlike traditional rectangular models. While non-rectangular robust policy evaluation is generally NP-hard, even in approximation, we identify a powerful class of L_p -bounded uncertainty sets that avoid these complexity barriers due to their structural simplicity. We further show that this class can be decomposed into infinitely many \textttsa-rectangular L_p -bounded sets and leverage its structural properties to derive a novel dual formulation for L_p RMDPs. This formulation provides key insights into the adversary’s strategy and enables the development of the first robust policy evaluation algorithms for non-rectangular RMDPs. Empirical results demonstrate that our approach significantly outperforms brute-force methods, establishing a promising foundation for future investigation into non-rectangular robust MDPs.

[AI-9] A Survey of Reinforcement Learning for Optimization in Automation

链接: https://arxiv.org/abs/2502.09417
作者: Ahmad Farooq,Kamran Iqbal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 8 pages, 4 tables, and 1 figure. Accepted at IEEE 20th International Conference on Automation Science and Engineering (CASE) 2024

点击查看摘要

Abstract:Reinforcement Learning (RL) has become a critical tool for optimization challenges within automation, leading to significant advancements in several areas. This review article examines the current landscape of RL within automation, with a particular focus on its roles in manufacturing, energy systems, and robotics. It discusses state-of-the-art methods, major challenges, and upcoming avenues of research within each sector, highlighting RL’s capacity to solve intricate optimization challenges. The paper reviews the advantages and constraints of RL-driven optimization methods in automation. It points out prevalent challenges encountered in RL optimization, including issues related to sample efficiency and scalability; safety and robustness; interpretability and trustworthiness; transfer learning and meta-learning; and real-world deployment and integration. It further explores prospective strategies and future research pathways to navigate these challenges. Additionally, the survey includes a comprehensive list of relevant research papers, making it an indispensable guide for scholars and practitioners keen on exploring this domain.

[AI-10] S2-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation

链接: https://arxiv.org/abs/2502.09389
作者: Quantao Yang,Michael C. Welle,Danica Kragic,Olov Andersson
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment \textitinstances that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S ^2 -Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S ^2 -Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Full videos of all real-world experiments are available in the supplementary material.

[AI-11] RIFFID: Autonomous Robotic Aid For Increasing First Responders Efficiency

链接: https://arxiv.org/abs/2502.09379
作者: Jorgen Cani,Panagiotis Koletsis,Konstantinos Foteinos,Ioannis Kefaloukos,Lampros Argyriou,Manolis Falelakis,Iván Del Pino,Angel Santamaria-Navarro,Martin Čech,Ondřej Severa,Alessandro Umbrico,Francesca Fracasso,AndreA Orlandini,Dimitrios Drakoulis,Evangelos Markakis,Georgios Th. Papadopoulos
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing complexity of natural disaster incidents demands innovative technological solutions to support first responders in their efforts. This paper introduces the TRIFFID system, a comprehensive technical framework that integrates unmanned ground and aerial vehicles with advanced artificial intelligence functionalities to enhance disaster response capabilities across wildfires, urban floods, and post-earthquake search and rescue missions. By leveraging state-of-the-art autonomous navigation, semantic perception, and human-robot interaction technologies, TRIFFID provides a sophisticated system com- posed of the following key components: hybrid robotic platform, centralized ground station, custom communication infrastructure, and smartphone application. The defined research and development activities demonstrate how deep neural networks, knowledge graphs, and multimodal information fusion can enable robots to autonomously navigate and analyze disaster environ- ments, reducing personnel risks and accelerating response times. The proposed system enhances emergency response teams by providing advanced mission planning, safety monitoring, and adaptive task execution capabilities. Moreover, it ensures real- time situational awareness and operational support in complex and risky situations, facilitating rapid and precise information collection and coordinated actions.

[AI-12] A Deep Inverse-Mapping Model for a Flapping Robotic Wing ICLR2025

链接: https://arxiv.org/abs/2502.09378
作者: Hadar Sharvit,Raz Karl,Tsevi Beatus
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted to ICLR 2025. 10 Pages 5 figures + 2 figures in appendix

点击查看摘要

Abstract:In systems control, the dynamics of a system are governed by modulating its inputs to achieve a desired outcome. For example, to control the thrust of a quad-copter propeller the controller modulates its rotation rate, relying on a straightforward mapping between the input rotation rate and the resulting thrust. This mapping can be inverted to determine the rotation rate needed to generate a desired thrust. However, in complex systems, such as flapping-wing robots where intricate fluid motions are involved, mapping inputs (wing kinematics) to outcomes (aerodynamic forces) is nontrivial and inverting this mapping for real-time control is computationally impractical. Here, we report a machine-learning solution for the inverse mapping of a flapping-wing system based on data from an experimental system we have developed. Our model learns the input wing motion required to generate a desired aerodynamic force outcome. We used a sequence-to-sequence model tailored for time-series data and augmented it with a novel adaptive-spectrum layer that implements representation learning in the frequency domain. To train our model, we developed a flapping wing system that simultaneously measures the wing’s aerodynamic force and its 3D motion using high-speed cameras. We demonstrate the performance of our system on an additional open-source dataset of a flapping wing in a different flow regime. Results show superior performance compared with more complex state-of-the-art transformer-based models, with 11% improvement on the test datasets median loss. Moreover, our model shows superior inference time, making it practical for onboard robotic control. Our open-source data and framework may improve modeling and real-time control of systems governed by complex dynamics, from biomimetic robots to biomedical devices.

[AI-13] Simple Path Structural Encoding for Graph Transformers

链接: https://arxiv.org/abs/2502.09365
作者: Louis Airale,Antonio Longa,Mattia Rigon,Andrea Passerini,Roberto Passerone
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph transformers extend global self-attention to graph-structured data, achieving notable success in graph learning. Recently, random walk structural encoding (RWSE) has been found to further enhance their predictive power by encoding both structural and positional information into the edge representation. However, RWSE cannot always distinguish between edges that belong to different local graph patterns, which reduces its ability to capture the full structural complexity of graphs. This work introduces Simple Path Structural Encoding (SPSE), a novel method that utilizes simple path counts for edge encoding. We show theoretically and experimentally that SPSE overcomes the limitations of RWSE, providing a richer representation of graph structures, particularly for capturing local cyclic patterns. To make SPSE computationally tractable, we propose an efficient approximate algorithm for simple path counting. SPSE demonstrates significant performance improvements over RWSE on various benchmarks, including molecular and long-range graph datasets, achieving statistically significant gains in discriminative tasks. These results pose SPSE as a powerful edge encoding alternative for enhancing the expressivity of graph transformers.

[AI-14] Neural Spatiotemporal Point Processes: Trends and Challenges

链接: https://arxiv.org/abs/2502.09341
作者: Sumantrak Mukherjee,Mouad Elhamdi,George Mohler,David A. Selby,Yao Xie,Sebastian Vollmer,Gerrit Grossmann
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spatiotemporal point processes (STPPs) are probabilistic models for events occurring in continuous space and time. Real-world event data often exhibit intricate dependencies and heterogeneous dynamics. By incorporating modern deep learning techniques, STPPs can model these complexities more effectively than traditional approaches. Consequently, the fusion of neural methods with STPPs has become an active and rapidly evolving research area. In this review, we categorize existing approaches, unify key design choices, and explain the challenges of working with this data modality. We further highlight emerging trends and diverse application domains. Finally, we identify open challenges and gaps in the literature.

[AI-15] Graph Diffusion Network for Drug-Gene Prediction

链接: https://arxiv.org/abs/2502.09335
作者: Jiayang Wu,Wensheng Gan,Philip S. Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: IEEE/ACM TCBB. 14 pages

点击查看摘要

Abstract:Predicting drug-gene associations is crucial for drug development and disease treatment. While graph neural networks (GNN) have shown effectiveness in this task, they face challenges with data sparsity and efficient contrastive learning implementation. We introduce a graph diffusion network for drug-gene prediction (GDNDGP), a framework that addresses these limitations through two key innovations. First, it employs meta-path-based homogeneous graph learning to capture drug-drug and gene-gene relationships, ensuring similar entities share embedding spaces. Second, it incorporates a parallel diffusion network that generates hard negative samples during training, eliminating the need for exhaustive negative sample retrieval. Our model achieves superior performance on the DGIdb 4.0 dataset and demonstrates strong generalization capability on tripartite drug-gene-disease networks. Results show significant improvements over existing methods in drug-gene prediction tasks, particularly in handling complex heterogeneous relationships. The source code is publicly available at this https URL.

[AI-16] Predicting Drive Test Results in Mobile Networks Using Optimization Techniques

链接: https://arxiv.org/abs/2502.09305
作者: MohammadJava Taheri,Abolfazl Diyanat,MortezaAli Ahmadi,Ali Nazari
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Mobile network operators constantly optimize their networks to ensure superior service quality and coverage. This optimization is crucial for maintaining an optimal user experience and requires extensive data collection and analysis. One of the primary methods for gathering this data is through drive tests, where technical teams use specialized equipment to collect signal information across various regions. However, drive tests are both costly and time-consuming, and they face challenges such as traffic conditions, environmental factors, and limited access to certain areas. These constraints make it difficult to replicate drive tests under similar conditions. In this study, we propose a method that enables operators to predict received signal strength at specific locations using data from other drive test points. By reducing the need for widespread drive tests, this approach allows operators to save time and resources while still obtaining the necessary data to optimize their networks and mitigate the challenges associated with traditional drive tests.

[AI-17] Indeterminacy in Affective Computing: Considering Meaning and Context in Data Collection Practices

链接: https://arxiv.org/abs/2502.09294
作者: Bernd Dudzik,Tiffany Matej Hrkalovic,Chenxu Hao,Chirag Raman,Masha Tsfasman
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at: 12th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

点击查看摘要

Abstract:Automatic Affect Prediction (AAP) uses computational analysis of input data such as text, speech, images, and physiological signals to predict various affective phenomena (e.g., emotions or moods). These models are typically constructed using supervised machine-learning algorithms, which rely heavily on labeled training datasets. In this position paper, we posit that all AAP training data are derived from human Affective Interpretation Processes, resulting in a form of Affective Meaning. Research on human affect indicates a form of complexity that is fundamental to such meaning: it can possess what we refer to here broadly as Qualities of Indeterminacy (QIs) - encompassing Subjectivity (meaning depends on who is interpreting), Uncertainty (lack of confidence regarding meanings’ correctness), Ambiguity (meaning contains mutually exclusive concepts) and Vagueness (meaning is situated at different levels in a nested hierarchy). Failing to appropriately consider QIs leads to results incapable of meaningful and reliable predictions. Based on this premise, we argue that a crucial step in adequately addressing indeterminacy in AAP is the development of data collection practices for modeling corpora that involve the systematic consideration of 1) a relevant set of QIs and 2) context for the associated interpretation processes. To this end, we are 1) outlining a conceptual model of AIPs and the QIs associated with the meaning these produce and a conceptual structure of relevant context, supporting understanding of its role. Finally, we use our framework for 2) discussing examples of context-sensitivity-related challenges for addressing QIs in data collection setups. We believe our efforts can stimulate a structured discussion of both the role of aspects of indeterminacy and context in research on AAP, informing the development of better practices for data collection and analysis.

[AI-18] LiSA: Leverag ing Link Recommender to Attack Graph Neural Networks via Subgraph Injection

链接: https://arxiv.org/abs/2502.09271
作者: Wenlun Zhang,Enyan Dai,Kentaro Yoshioka
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable proficiency in modeling data with graph structures, yet recent research reveals their susceptibility to adversarial attacks. Traditional attack methodologies, which rely on manipulating the original graph or adding links to artificially created nodes, often prove impractical in real-world settings. This paper introduces a novel adversarial scenario involving the injection of an isolated subgraph to deceive both the link recommender and the node classifier within a GNN system. Specifically, the link recommender is mislead to propose links between targeted victim nodes and the subgraph, encouraging users to unintentionally establish connections and that would degrade the node classification accuracy, thereby facilitating a successful attack. To address this, we present the LiSA framework, which employs a dual surrogate model and bi-level optimization to simultaneously meet two adversarial objectives. Extensive experiments on real-world datasets demonstrate the effectiveness of our method.

[AI-19] Bandit Multiclass List Classification

链接: https://arxiv.org/abs/2502.09257
作者: Liad Erez,Tomer Koren
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of multiclass list classification with (semi-)bandit feedback, where input examples are mapped into subsets of size m of a collection of K possible labels, and the feedback consists of the predicted labels which lie in the set of true labels of the given example. Our main result is for the (\varepsilon,\delta) -PAC variant of the problem for which we design an algorithm that returns an \varepsilon -optimal hypothesis with high probability using a sample complexity of O \big( (\mathrmpoly(K/m) + sm / \varepsilon^2) \log (|H|/\delta) \big) where H is the underlying (finite) hypothesis class and s is an upper bound on the number of true labels for a given example. This bound improves upon known bounds for combinatorial semi-bandits whenever s \ll K . Moreover, in the regime where s = O(1) the leading terms in our bound match the corresponding full-information rates, implying that bandit feedback essentially comes at no cost. Our PAC learning algorithm is also computationally efficient given access to an ERM oracle for H . Additionally, we consider the regret minimization setting where data can be generated adversarially, and establish a regret bound of \widetilde O(|H| + \sqrtsmT \log |H|) . Our results generalize and extend those of Erez et al. (2024) who consider the simpler single-label setting corresponding to s=m=1 , and in fact hold for the more general contextual combinatorial semi-bandit problem with s -sparse rewards.

[AI-20] AnomalyGFM: Graph Foundation Model for Zero/Few-shot Anomaly Detection

链接: https://arxiv.org/abs/2502.09254
作者: Hezhe Qiao,Chaoxi Niu,Ling Chen,Guansong Pang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:Graph anomaly detection (GAD) aims to identify abnormal nodes that differ from the majority of the nodes in a graph, which has been attracting significant attention in recent years. Existing generalist graph models have achieved remarkable success in different graph tasks but struggle to generalize to the GAD task. This limitation arises from their difficulty in learning generalized knowledge for capturing the inherently infrequent, irregular and heterogeneous abnormality patterns in graphs from different domains. To address this challenge, we propose AnomalyGFM, a GAD-oriented graph foundation model that supports zero-shot inference and few-shot prompt tuning for GAD in diverse graph datasets. One key insight is that graph-agnostic representations for normal and abnormal classes are required to support effective zero/few-shot GAD across different graphs. Motivated by this, AnomalyGFM is pre-trained to align data-independent, learnable normal and abnormal class prototypes with node representation residuals (i.e., representation deviation of a node from its neighbors). The residual features essentially project the node information into a unified feature space where we can effectively measure the abnormality of nodes from different graphs in a consistent way. This provides a driving force for the learning of graph-agnostic, discriminative prototypes for the normal and abnormal classes, which can be used to enable zero-shot GAD on new graphs, including very large-scale graphs. If there are few-shot labeled normal nodes available in the new graphs, AnomalyGFM can further support prompt tuning to leverage these nodes for better adaptation. Comprehensive experiments on 11 widely-used GAD datasets with real anomalies, demonstrate that AnomalyGFM significantly outperforms state-of-the-art competing methods under both zero- and few-shot GAD settings.

[AI-21] From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine

链接: https://arxiv.org/abs/2502.09242
作者: Lukas Buess,Matthias Keicher,Nassir Navab,Andreas Maier,Soroosh Tayebi Arasteh
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative artificial intelligence (AI) models, such as diffusion models and OpenAI’s ChatGPT, are transforming medicine by enhancing diagnostic accuracy and automating clinical workflows. The field has advanced rapidly, evolving from text-only large language models for tasks such as clinical documentation and decision support to multimodal AI systems capable of integrating diverse data modalities, including imaging, text, and structured data, within a single model. The diverse landscape of these technologies, along with rising interest, highlights the need for a comprehensive review of their applications and potential. This scoping review explores the evolution of multimodal AI, highlighting its methods, applications, datasets, and evaluation in clinical settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed, IEEE Xplore, and Web of Science, prioritizing recent studies published up to the end of 2024. After rigorous screening, 144 papers were included, revealing key trends and challenges in this dynamic field. Our findings underscore a shift from unimodal to multimodal approaches, driving innovations in diagnostic support, medical report generation, drug discovery, and conversational AI. However, critical challenges remain, including the integration of heterogeneous data types, improving model interpretability, addressing ethical concerns, and validating AI systems in real-world clinical settings. This review summarizes the current state of the art, identifies critical gaps, and provides insights to guide the development of scalable, trustworthy, and clinically impactful multimodal AI solutions in healthcare.

[AI-22] Hybrid Answer Set Programming: Foundations and Applications

链接: https://arxiv.org/abs/2502.09235
作者: Nicolas Rühling
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Answer Set Programming (ASP) is a powerful tool for solving real-world problems. However, many problems involve numeric values and complex constraints beyond the capabilities of standard ASP solvers. Hybrid solvers like CLINGCON and CLINGO[DL] address this by using specialized methods for specific constraints. However, these solvers lack a strong theoretical foundation. This issue has first been addressed by introducing the Logic of Here-and-There with constraints (HT_c) as an extension of the Logic of Here-and-There (HT) and its non-monotone extension Equilibrium Logic. Nowadays, HT serves as a logical foundation for ASP and has facilitated a broader understanding of this paradigm. The idea is that HTC (and other extensions) play an analogous role for hybrid ASP. There remain many open questions about these logics regarding their fundamental characteristics as well as their practical use in solvers, ie. how they can guide the implementation. Having a formal understanding of these hybrid logics is also needed to better understand the inherent structure of the (real-world) problems they are applied to and to improve their representations in ASP. As an example of an application of ASP we use product configuration. Comments: In Proceedings ICLP 2024, arXiv:2502.08453 Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2502.09235 [cs.AI] (or arXiv:2502.09235v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.09235 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 416, 2025, pp. 374-380 Related DOI: https://doi.org/10.4204/EPTCS.416.38 Focus to learn more DOI(s) linking to related resources Submission history From: EPTCS [view email][via EPTCS proxy] [v1] Thu, 13 Feb 2025 11:53:57 UTC (18 KB)

[AI-23] Commonsense Reasoning -Aided Autonomous Vehicle Systems

链接: https://arxiv.org/abs/2502.09233
作者: Keegan Kimbrell(University of Texas at Dallas)
类目: Artificial Intelligence (cs.AI)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Autonomous Vehicle (AV) systems have been developed with a strong reliance on machine learning techniques. While machine learning approaches, such as deep learning, are extremely effective at tasks that involve observation and classification, they struggle when it comes to performing higher level reasoning about situations on the road. This research involves incorporating commonsense reasoning models that use image data to improve AV systems. This will allow AV systems to perform more accurate reasoning while also making them more adjustable, explainable, and ethical. This paper will discuss the findings so far and motivate its direction going forward.

[AI-24] Logical foundations of Smart Contracts

链接: https://arxiv.org/abs/2502.09232
作者: Kalonji Kalala(University of Ottawa)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Nowadays, sophisticated domains are emerging which require appropriate formalisms to be specified accurately in order to reason about them. One such domain is constituted of smart contracts that have emerged in cyber physical systems as a way of enforcing formal agreements between components of these systems. Smart contracts self-execute to run and share business processes through blockchain, in decentralized systems, with many different participants. Legal contracts are in many cases complex documents, with a number of exceptions, and many subcontracts. The implementation of smart contracts based on legal contracts is a long and laborious task, that needs to include all actions, procedures, and the effects of actions related to the execution of the contract. An ongoing open problem in this area is to formally account for smart contracts using a uniform and somewhat universal formalism. This thesis proposes logical foundations to smart contracts using the Situation Calculus, a logic for reasoning about actions. Situation Calculus is one of the prominent logic-based artificial intelligence approaches that provides enough logical mechanism to specify and implement dynamic and complex systems such as contracts. Situation Calculus is suitable to show how worlds dynamically change. Smart contracts are going to be implement with Golog (written en Prolog), a Situation Calculus-based programming language for modeling complex and dynamic behaviors.

[AI-25] Relating Answer Set Programming and Many-sorted Logics for Formal Verification

链接: https://arxiv.org/abs/2502.09230
作者: Zachary Hansen(University of Nebraska Omaha)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Answer Set Programming (ASP) is an important logic programming paradigm within the field of Knowledge Representation and Reasoning. As a concise, human-readable, declarative language, ASP is an excellent tool for developing trustworthy (especially, artificially intelligent) software systems. However, formally verifying ASP programs offers some unique challenges, such as 1. a lack of modularity (the meanings of rules are difficult to define in isolation from the enclosing program), 2. the ground-and-solve semantics (the meanings of rules are dependent on the input data with which the program is grounded), and 3. limitations of existing tools. My research agenda has been focused on addressing these three issues with the intention of making ASP verification an accessible, routine task that is regularly performed alongside program development. In this vein, I have investigated alternative semantics for ASP based on translations into the logic of here-and-there and many-sorted first-order logic. These semantics promote a modular understanding of logic programs, bypass grounding, and enable us to use automated theorem provers to automatically verify properties of programs. Comments: In Proceedings ICLP 2024, arXiv:2502.08453 Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2502.09230 [cs.LO] (or arXiv:2502.09230v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2502.09230 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 416, 2025, pp. 332-344 Related DOI: https://doi.org/10.4204/EPTCS.416.33 Focus to learn more DOI(s) linking to related resources Submission history From: EPTCS [view email][via EPTCS proxy] [v1] Thu, 13 Feb 2025 11:52:40 UTC (108 KB) Full-text links: Access Paper: View a PDF of the paper titled Relating Answer Set Programming and Many-sorted Logics for Formal Verification, by Zachary Hansen (University of Nebraska Omaha)View PDFTeX SourceOther Formats view license Current browse context: cs.LO prev | next new | recent | 2025-02 Change to browse by: cs cs.AI cs.PL References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-26] Computational methods for Dynamic Answer Set Programming

链接: https://arxiv.org/abs/2502.09228
作者: Susana Hahn(University of Potsdam, Germany)
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:In our daily lives and industrial settings, we often encounter dynamic problems that require reasoning over time and metric constraints. These include tasks such as scheduling, routing, and production sequencing. Dynamic logics have traditionally addressed these needs but often lack the flexibility and integration required for comprehensive problem modeling. This research aims to extend Answer Set Programming (ASP), a powerful declarative problem-solving approach, to handle dynamic domains effectively. By integrating concepts from dynamic, temporal, and metric logics into ASP, we seek to develop robust systems capable of modeling complex dynamic problems and performing efficient reasoning tasks, thereby enhancing ASPs applicability in industrial contexts.

[AI-27] Generating Causally Compliant Counterfactual Explanations using ASP

链接: https://arxiv.org/abs/2502.09226
作者: Sopam Dasgupta(The University of Texas at Dallas, USA)
类目: Artificial Intelligence (cs.AI)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:This research is focused on generating achievable counterfactual explanations. Given a negative outcome computed by a machine learning model or a decision system, the novel CoGS approach generates (i) a counterfactual solution that represents a positive outcome and (ii) a path that will take us from the negative outcome to the positive one, where each node in the path represents a change in an attribute (feature) value. CoGS computes paths that respect the causal constraints among features. Thus, the counterfactuals computed by CoGS are realistic. CoGS utilizes rule-based machine learning algorithms to model causal dependencies between features. The paper discusses the current status of the research and the preliminary results obtained.

[AI-28] Order-Sorted Intensional Logic: Expressing Subtyping Polymorphism with Typing Assertions and Quantification over Concepts

链接: https://arxiv.org/abs/2502.09224
作者: Đorđe Marković,Marc Denecker
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Subtyping, also known as subtype polymorphism, is a concept extensively studied in programming language theory, delineating the substitutability relation among datatypes. This property ensures that programs designed for supertype objects remain compatible with their subtypes. In this paper, we explore the capability of order-sorted logic for utilizing these ideas in the context of Knowledge Representation. We recognize two fundamental limitations: First, the inability of this logic to address the concept rather than the value of non-logical symbols, and second, the lack of language constructs for constraining the type of terms. Consequently, we propose guarded order-sorted intensional logic, where guards are language constructs for annotating typing information and intensional logic provides support for quantification over concepts. Comments: In Proceedings ICLP 2024, arXiv:2502.08453 Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2502.09224 [cs.AI] (or arXiv:2502.09224v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.09224 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 416, 2025, pp. 253-266 Related DOI: https://doi.org/10.4204/EPTCS.416.22 Focus to learn more DOI(s) linking to related resources

[AI-29] ASP-driven User-interaction with Clinguin

链接: https://arxiv.org/abs/2502.09222
作者: Alexander Beiser,Susana Hahn,Torsten Schaub
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:We present clinguin, a system for ASP-driven user interface design. Clinguin streamlines the development of user interfaces for ASP developers by letting them build interactive prototypes directly in ASP, eliminating the need for separate frontend languages. To this end, clinguin uses a few dedicated predicates to define user interfaces and the treatment of user-triggered events. This simple design greatly facilitates the specification of user interactions with an ASP system, in our case clingo.

[AI-30] Pearces Characterisation in an Epistemic Domain

链接: https://arxiv.org/abs/2502.09221
作者: Ezgi Iraz Su(Sinop University)
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Answer-set programming (ASP) is a successful problem-solving approach in logic-based AI. In ASP, problems are represented as declarative logic programs, and solutions are identified through their answer sets. Equilibrium logic (EL) is a general-purpose nonmonotonic reasoning formalism, based on a monotonic logic called here-and-there logic. EL was basically proposed by Pearce as a foundational framework of ASP. Epistemic specifications (ES) are extensions of ASP-programs with subjective literals. These new modal constructs in the ASP-language make it possible to check whether a regular literal of ASP is true in every (or some) answer-set of a program. ES-programs are interpreted by world-views, which are essentially collections of answer-sets. (Reflexive) autoepistemic logic is a nonmonotonic formalism, modeling self-belief (knowledge) of ideally rational agents. A relatively new semantics for ES is based on a combination of EL and (reflexive) autoepistemic logic. In this paper, we first propose an overarching framework in the epistemic ASP domain. We then establish a correspondence between existing (reflexive) (auto)epistemic equilibrium logics and our easily-adaptable comprehensive framework, building on Pearce’s characterisation of answer-sets as equilibrium models. We achieve this by extending Ferraris’ work on answer sets for propositional theories to the epistemic case and reveal the relationship between some ES-semantic proposals.

[AI-31] Graphical Conditions for the Existence Unicity and Number of Regular Models

链接: https://arxiv.org/abs/2502.09220
作者: Van-Giang Trinh(LIRICA team, LIS, Aix-Marseille University, Marseille, France),Belaid Benhamou(LIRICA team, LIS, Aix-Marseille University, Marseille, France),Sylvain Soliman(Inria Saclay, EP Lifeware, Palaiseau, France),François Fages(Inria Saclay, EP Lifeware, Palaiseau, France)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:The regular models of a normal logic program are a particular type of partial (i.e. 3-valued) models which correspond to stable partial models with minimal undefinedness. In this paper, we explore graphical conditions on the dependency graph of a finite ground normal logic program to analyze the existence, unicity and number of regular models for the program. We show three main results: 1) a necessary condition for the existence of non-trivial (i.e. non-2-valued) regular models, 2) a sufficient condition for the unicity of regular models, and 3) two upper bounds for the number of regular models based on positive feedback vertex sets. The first two conditions generalize the finite cases of the two existing results obtained by You and Yuan (1994) for normal logic programs with well-founded stratification. The third result is also new to the best of our knowledge. Key to our proofs is a connection that we establish between finite ground normal logic programs and Boolean network theory.

[AI-32] Abduction of Domain Relationships from Data for VQA

链接: https://arxiv.org/abs/2502.09219
作者: Al Mehdi Saadat Chowdhury,Paulo Shakarian,Gerardo I. Simari
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:In this paper, we study the problem of visual question answering (VQA) where the image and query are represented by ASP programs that lack domain data. We provide an approach that is orthogonal and complementary to existing knowledge augmentation techniques where we abduce domain relationships of image constructs from past examples. After framing the abduction problem, we provide a baseline approach, and an implementation that significantly improves the accuracy of query answering yet requires few examples.

[AI-33] Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration

链接: https://arxiv.org/abs/2502.09218
作者: Flavio Bertini(UNIPR),Alessandro Dal Palù(UNIPR),Federica Zaglio(UNIPR),Francesco Fabiano(NMSU),Andrea Formisano(UNIUD)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:This paper presents a complete explainable system that interprets a set of data, abstracts the underlying features and describes them in a natural language of choice. The system relies on two crucial stages: (i) identifying emerging properties from data and transforming them into abstract concepts, and (ii) converting these concepts into natural language. Despite the impressive natural language generation capabilities demonstrated by Large Language Models, their statistical nature and the intricacy of their internal mechanism still force us to employ these techniques as black boxes, forgoing trustworthiness. Developing an explainable pipeline for data interpretation would allow facilitating its use in safety-critical environments like processing medical information and allowing non-experts and visually impaired people to access narrated information. To this end, we believe that the fields of knowledge representation and automated reasoning research could present a valid alternative. Expanding on prior research that tackled the first stage (i), we focus on the second stage, named Concept2Text. Being explainable, data translation is easily modeled through logic-based rules, once again emphasizing the role of declarative programming in achieving AI explainability. This paper explores a Prolog/CLP-based rewriting system to interpret concepts-articulated in terms of classes and relations, plus common knowledge-derived from a generic ontology, generating natural language text. Its main features include hierarchical tree rewritings, modular multilingual generation, support for equivalent variants across semantic, grammar, and lexical levels, and a transparent rule-based system. We outline the architecture and demonstrate its flexibility through some examples capable of generating numerous diverse and equivalent rewritings based on the input concept.

[AI-34] Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents

链接: https://arxiv.org/abs/2502.09215
作者: Sean Glaze(Miami University),Daniela Inclezan(Miami University)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:This paper presents an architecture for simulating the actions of a norm-aware intelligent agent whose behavior with respect to norm compliance is set, and can later be changed, by a human controller. Updating an agent’s behavior mode from a norm-abiding to a riskier one may be relevant when the agent is involved in time-sensitive rescue operations, for example. We base our work on the Authorization and Obligation Policy Language AOPL designed by Gelfond and Lobo for the specification of norms. We introduce an architecture and a prototype software system that can be used to simulate an agent’s plans under different behavior modes that can later be changed by the controller. We envision such software to be useful to policy makers, as they can more readily understand how agents may act in certain situations based on the agents’ attitudes towards norm-compliance. Policy makers may then refine their policies if simulations show unwanted consequences.

[AI-35] On LLM -generated Logic Programs and their Inference Execution Methods

链接: https://arxiv.org/abs/2502.09209
作者: Paul Tarau(University of North Texas)
类目: Artificial Intelligence (cs.AI)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Large Language Models (LLMs) trained on petabytes of data are highly compressed repositories of a significant proportion of the knowledge accumulated and distilled so far. In this paper we study techniques to elicit this knowledge in the form of several classes of logic programs, including propositional Horn clauses, Dual Horn clauses, relational triplets and Definite Clause Grammars. Exposing this knowledge as logic programs enables sound reasoning methods that can verify alignment of LLM outputs to their intended uses and extend their inference capabilities. We study new execution methods for the generated programs, including soft-unification of abducible facts against LLM-generated content stored in a vector database as well as GPU-based acceleration of minimal model computation that supports inference with large LLM-generated programs.

[AI-36] Efficient OWL2QL Meta-reasoning Using ASP-based Hybrid Knowledge Bases

链接: https://arxiv.org/abs/2502.09206
作者: Haya Majid Qureshi(University of Klagenfurt),Wolfgang Faber(University of Klagenfurt)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Metamodeling refers to scenarios in ontologies in which classes and roles can be members of classes or occur in roles. This is a desirable modelling feature in several applications, but allowing it without restrictions is problematic for several reasons, mainly because it causes undecidability. Therefore, practical languages either forbid metamodeling explicitly or treat occurrences of classes as instances to be semantically different from other occurrences, thereby not allowing metamodeling semantically. Several extensions have been proposed to provide metamodeling to some extent. Building on earlier work that reduces metamodeling query answering to Datalog query answering, recently reductions to query answering over hybrid knowledge bases were proposed with the aim of using the Datalog transformation only where necessary. Preliminary work showed that the approach works, but the hoped-for performance improvements were not observed yet. In this work we expand on this body of work by improving the theoretical basis of the reductions and by using alternative tools that show competitive performance.

[AI-37] Counterfactual Explanations as Plans

链接: https://arxiv.org/abs/2502.09205
作者: Vaishak Belle(University of Edinburgh)
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:There has been considerable recent interest in explainability in AI, especially with black-box machine learning models. As correctly observed by the planning community, when the application at hand is not a single-shot decision or prediction, but a sequence of actions that depend on observations, a richer notion of explanations are desirable. In this paper, we look to provide a formal account of ``counterfactual explanations," based in terms of action sequences. We then show that this naturally leads to an account of model reconciliation, which might take the form of the user correcting the agent’s model, or suggesting actions to the agent’s plan. For this, we will need to articulate what is true versus what is known, and we appeal to a modal fragment of the situation calculus to formalise these intuitions. We consider various settings: the agent knowing partial truths, weakened truths and having false beliefs, and show that our definitions easily generalize to these different settings. Comments: In Proceedings ICLP 2024, arXiv:2502.08453 Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2502.09205 [cs.AI] (or arXiv:2502.09205v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.09205 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 416, 2025, pp. 153-167 Related DOI: https://doi.org/10.4204/EPTCS.416.14 Focus to learn more DOI(s) linking to related resources

[AI-38] Logical Lease Litigation: Prolog and LLM s for Rental Law Compliance in New York

链接: https://arxiv.org/abs/2502.09204
作者: Sanskar Sehgal,Yanhong A. Liu
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: In Proceedings ICLP 2024, arXiv:2502.08453

点击查看摘要

Abstract:Legal cases require careful logical reasoning following the laws, whereas interactions with non- technical users must be in natural language. As an application combining logical reasoning using Prolog and natural language processing using large language models (LLMs), this paper presents a novel approach and system, LogicLease, to automate the analysis of landlord-tenant legal cases in the state of New York. LogicLease determines compliance with relevant legal requirements by analyzing case descriptions and citing all relevant laws. It leverages LLMs for information extraction and Prolog for legal reasoning. By separating information extraction from legal reasoning, LogicLease achieves greater transparency and control over the legal logic applied to each case. We evaluate the accuracy, efficiency, and robustness of LogicLease through a series of tests, achieving 100% accuracy and an average processing time of 2.57 seconds. LogicLease presents advantages over state-of-the-art LLM- based legal analysis systems by providing clear, step-by-step reasoning, citing specific laws, and distinguishing itself by its ability to avoid hallucinations - a common issue in LLMs.

[AI-39] wo-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia AAAI2025 ALT

链接: https://arxiv.org/abs/2502.09173
作者: Jin Cui,Alexander Capstick,Payam Barnaghi,Gregory Scott
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: AAAI 2025 Workshop on Large Language Models and Generative AI for Health

点击查看摘要

Abstract:In remote healthcare monitoring, time series representation learning reveals critical patient behavior patterns from high-frequency data. This study analyzes home activity data from individuals living with dementia by proposing a two-stage, self-supervised learning approach tailored to uncover low-rank structures. The first stage converts time-series activities into text sequences encoded by a pre-trained language model, providing a rich, high-dimensional latent state space using a PageRank-based method. This PageRank vector captures latent state transitions, effectively compressing complex behaviour data into a succinct form that enhances interpretability. This low-rank representation not only enhances model interpretability but also facilitates clustering and transition analysis, revealing key behavioral patterns correlated with clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the framework’s potential in supporting cognitive status prediction, personalized care interventions, and large-scale health monitoring.

[AI-40] One-shot Federated Learning Methods: A Practical Guide

链接: https://arxiv.org/abs/2502.09104
作者: Xiang Liu,Zhenheng Tang,Xia Li,Yijun Song,Sijie Ji,Zemin Liu,Bo Han,Linshan Jiang,Jialin Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 1 figure

点击查看摘要

Abstract:One-shot Federated Learning (OFL) is a distributed machine learning paradigm that constrains client-server communication to a single round, addressing privacy and communication overhead issues associated with multiple rounds of data exchange in traditional Federated Learning (FL). OFL demonstrates the practical potential for integration with future approaches that require collaborative training models, such as large language models (LLMs). However, current OFL methods face two major challenges: data heterogeneity and model heterogeneity, which result in subpar performance compared to conventional FL methods. Worse still, despite numerous studies addressing these limitations, a comprehensive summary is still lacking. To address these gaps, this paper presents a systematic analysis of the challenges faced by OFL and thoroughly reviews the current methods. We also offer an innovative categorization method and analyze the trade-offs of various techniques. Additionally, we discuss the most promising future directions and the technologies that should be integrated into the OFL field. This work aims to provide guidance and insights for future research.

[AI-41] Exploring the Needs of Practising Musicians in Co-Creative AI Through Co-Design

链接: https://arxiv.org/abs/2502.09055
作者: Stephen James Krol,Maria Teresa Llano Rodriguez,Miguel Loor Paredes
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Paper accepted into CHI 2025, Yokohama Japan, April 26th - May 1st

点击查看摘要

Abstract:Recent advances in generative AI music have resulted in new technologies that are being framed as co-creative tools for musicians with early work demonstrating their potential to add to music practice. While the field has seen many valuable contributions, work that involves practising musicians in the design and development of these tools is limited, with the majority of work including them only once a tool has been developed. In this paper, we present a case study that explores the needs of practising musicians through the co-design of a musical variation system, highlighting the importance of involving a diverse range of musicians throughout the design process and uncovering various design insights. This was achieved through two workshops and a two week ecological evaluation, where musicians from different musical backgrounds offered valuable insights not only on a musical system’s design but also on how a musical AI could be integrated into their musical practices.

[AI-42] Cost-Saving LLM Cascades with Early Abstention

链接: https://arxiv.org/abs/2502.09054
作者: Michael J. Zellinger,Rex Liu,Matt Thomson
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages, 1 figure

点击查看摘要

Abstract:LLM cascades are based on the idea that processing all queries with the largest and most expensive LLMs is inefficient. Instead, cascades deploy small LLMs to answer the majority of queries, limiting the use of large and expensive LLMs to only the most difficult queries. This approach can significantly reduce costs without impacting performance. However, risk-sensitive domains such as finance or medicine place an additional premium on avoiding model errors. Recognizing that even the most expensive models may make mistakes, applications in these domains benefit from allowing LLM systems to completely abstain from answering a query when the chance of making a mistake is significant. However, giving a cascade the ability to abstain poses an immediate design question for LLM cascades: should abstention only be allowed at the final model or also at earlier models? Since the error patterns of small and large models are correlated, the latter strategy may further reduce inference costs by letting inexpensive models anticipate abstention decisions by expensive models, thereby obviating the need to run the expensive models. We investigate the benefits of “early abstention” in LLM cascades and find that it reduces the overall test loss by 2.2% on average across six benchmarks (GSM8K, MedMCQA, MMLU, TriviaQA, TruthfulQA, and XSum). These gains result from a more effective use of abstention, which trades a 4.1% average increase in the overall abstention rate for a 13.0% reduction in cost and a 5.0% reduction in error rate. Our findings demonstrate that it is possible to leverage correlations between the error patterns of different language models to drive performance improvements for LLM systems with abstention.

[AI-43] Game Theory Meets Large Language Models : A Systematic Survey

链接: https://arxiv.org/abs/2502.09053
作者: Haoran Sun,Yusen Wu,Yukun Cheng,Xu Chu
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Game theory establishes a fundamental framework for analyzing strategic interactions among rational decision-makers. The rapid advancement of large language models (LLMs) has sparked extensive research exploring the intersection of these two fields. Specifically, game-theoretic methods are being applied to evaluate and enhance LLM capabilities, while LLMs themselves are reshaping classic game models. This paper presents a comprehensive survey of the intersection of these fields, exploring a bidirectional relationship from three perspectives: (1) Establishing standardized game-based benchmarks for evaluating LLM behavior; (2) Leveraging game-theoretic methods to improve LLM performance through algorithmic innovations; (3) Characterizing the societal impacts of LLMs through game modeling. Among these three aspects, we also highlight how the equilibrium analysis for traditional game models is impacted by LLMs’ advanced language understanding, which in turn extends the study of game theory. Finally, we identify key challenges and future research directions, assessing their feasibility based on the current state of the field. By bridging theoretical rigor with emerging AI capabilities, this survey aims to foster interdisciplinary collaboration and drive progress in this evolving research area.

[AI-44] Leverag ing Member-Group Relations via Multi-View Graph Filtering for Effective Group Recommendation WWW2025

链接: https://arxiv.org/abs/2502.09050
作者: Chae-Hyun Kim,Yoon-Ryung Choi,Jin-Duk Park,Won-Yong Shin
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 5 pages, 3 figures, 4 tables; ACM Web Conference (WWW 2025) (to appear) (Please cite our conference version.)

点击查看摘要

Abstract:Group recommendation aims at providing optimized recommendations tailored to diverse groups, enabling groups to enjoy appropriate items. On the other hand, most existing group recommendation methods are built upon deep neural network (DNN) architectures designed to capture the intricate relationships between member-level and group-level interactions. While these DNN-based approaches have proven their effectiveness, they require complex and expensive training procedures to incorporate group-level interactions in addition to member-level interactions. To overcome such limitations, we introduce Group-GF, a new approach for extremely fast recommendations of items to each group via multi-view graph filtering (GF) that offers a holistic view of complex member-group dynamics, without the need for costly model training. Specifically, in Group-GF, we first construct three item similarity graphs manifesting different viewpoints for GF. Then, we discover a distinct polynomial graph filter for each similarity graph and judiciously aggregate the three graph filters. Extensive experiments demonstrate the effectiveness of Group-GF in terms of significantly reducing runtime and achieving state-of-the-art recommendation accuracy.

[AI-45] Criteria-Aware Graph Filtering: Extremely Fast Yet Accurate Multi-Criteria Recommendation WWW2025

链接: https://arxiv.org/abs/2502.09046
作者: Jin-Duk Park,Jaemin Yoo,Won-Yong Shin
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 12 pages, 8 figures, 7 tables; ACM Web Conference (WWW 2025) (to appear) (Please cite our conference version.)

点击查看摘要

Abstract:Multi-criteria (MC) recommender systems, which utilize MC rating information for recommendation, are increasingly widespread in various e-commerce domains. However, the MC recommendation using training-based collaborative filtering, requiring consideration of multiple ratings compared to single-criterion counterparts, often poses practical challenges in achieving state-of-the-art performance along with scalable model training. To solve this problem, we propose CA-GF, a training-free MC recommendation method, which is built upon criteria-aware graph filtering for efficient yet accurate MC recommendations. Specifically, first, we construct an item-item similarity graph using an MC user-expansion graph. Next, we design CA-GF composed of the following key components, including 1) criterion-specific graph filtering where the optimal filter for each criterion is found using various types of polynomial low-pass filters and 2) criteria preference-infused aggregation where the smoothed signals from each criterion are aggregated. We demonstrate that CA-GF is (a) efficient: providing the computational efficiency, offering the extremely fast runtime of less than 0.2 seconds even on the largest benchmark dataset, (b) accurate: outperforming benchmark MC recommendation methods, achieving substantial accuracy gains up to 24% compared to the best competitor, and © interpretable: providing interpretations for the contribution of each criterion to the model prediction based on visualizations.

[AI-46] AoI-Sensitive Data Forwarding with Distributed Beamforming in UAV-Assisted IoT

链接: https://arxiv.org/abs/2502.09038
作者: Zifan Lang,Guixia Liu,Geng Sun,Jiahui Li,Zemin Sun,Jiacheng Wang,Victor C.M. Leung
类目: Artificial Intelligence (cs.AI)
*备注: 6 pages, 4 figures, ICC2025

点击查看摘要

Abstract:This paper proposes a UAV-assisted forwarding system based on distributed beamforming to enhance age of information (AoI) in Internet of Things (IoT). Specifically, UAVs collect and relay data between sensor nodes (SNs) and the remote base station (BS). However, flight delays increase the AoI and degrade the network performance. To mitigate this, we adopt distributed beamforming to extend the communication range, reduce the flight frequency and ensure the continuous data relay and efficient energy utilization. Then, we formulate an optimization problem to minimize AoI and UAV energy consumption, by jointly optimizing the UAV trajectories and communication schedules. The problem is non-convex and with high dynamic, and thus we propose a deep reinforcement learning (DRL)-based algorithm to solve the problem, thereby enhancing the stability and accelerate convergence speed. Simulation results show that the proposed algorithm effectively addresses the problem and outperforms other benchmark algorithms.

[AI-47] Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning NAACL2025

链接: https://arxiv.org/abs/2502.09022
作者: Lin Zhang,Lijie Hu,Di Wang
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by NAACL2025

点击查看摘要

Abstract:Transformer-based language models have achieved notable success, yet their internal reasoning mechanisms remain largely opaque due to complex non-linear interactions and high-dimensional operations. While previous research suggests that these models implicitly encode reasoning structures, it is still unclear which specific multi-step thought processes they employ to solve complex tasks. To address this gap, we propose a novel mechanistic interpretability framework, SICAF, designed to trace and analyze the reasoning strategies that language models use in multi-step inference tasks. By employing circuit analysis and self-influence functions, we quantify the evolving importance of each token throughout the reasoning process, thereby mapping the pathways the model uses for inference. Applying SICAF to the GPT-2 model on the Indirect Object Identification (IOI) prediction task, we demonstrate how underlying circuits can reveal a reasoning process that aligns with human interpretability, offering new insights into the model’s internal logic.

[AI-48] RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models

链接: https://arxiv.org/abs/2502.09003
作者: Quan Wei,Chung-Yiu Yau,Hoi-To Wai,Yang(Katie)Zhao,Dongyeop Kang,Youngsuk Park,Mingyi Hong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations, and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures.

[AI-49] PixLift: Accelerating Web Browsing via AI Upscaling

链接: https://arxiv.org/abs/2502.08995
作者: Yonas Atinafu,Sarthak Malla,HyunSeok Daniel Jang,Nouar Aldahoul,Matteo Varvello,Yasir Zaki
类目: Performance (cs.PF); Artificial Intelligence (cs.AI)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Accessing the internet in regions with expensive data plans and limited connectivity poses significant challenges, restricting information access and economic growth. Images, as a major contributor to webpage sizes, exacerbate this issue, despite advances in compression formats like WebP and AVIF. The continued growth of complex and curated web content, coupled with suboptimal optimization practices in many regions, has prevented meaningful reductions in web page sizes. This paper introduces PixLift, a novel solution to reduce webpage sizes by downscaling their images during transmission and leveraging AI models on user devices to upscale them. By trading computational resources for bandwidth, PixLift enables more affordable and inclusive web access. We address key challenges, including the feasibility of scaled image requests on popular websites, the implementation of PixLift as a browser extension, and its impact on user experience. Through the analysis of 71.4k webpages, evaluations of three mainstream upscaling models, and a user study, we demonstrate PixLift’s ability to significantly reduce data usage without compromising image quality, fostering a more equitable internet.

[AI-50] RLSA-PFL: Robust Lightweight Secure Aggregation with Model Inconsistency Detection in Privacy-Preserving Federated Learning

链接: https://arxiv.org/abs/2502.08989
作者: Nazatul H. Sultan,Yan Bo,Yansong Gao,Seyit Camtepe,Arash Mahboubi,Hang Thanh Bui,Aufeef Chauhan,Hamed Aboutorab,Michael Bewong,Praveen Gauravaram,Rafiqul Islam,Sharif Abuadbba
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 16 pages, 10 Figures

点击查看摘要

Abstract:Federated Learning (FL) allows users to collaboratively train a global machine learning model by sharing local model only, without exposing their private data to a central server. This distributed learning is particularly appealing in scenarios where data privacy is crucial, and it has garnered substantial attention from both industry and academia. However, studies have revealed privacy vulnerabilities in FL, where adversaries can potentially infer sensitive information from the shared model parameters. In this paper, we present an efficient masking-based secure aggregation scheme utilizing lightweight cryptographic primitives to mitigate privacy risks. Our scheme offers several advantages over existing methods. First, it requires only a single setup phase for the entire FL training session, significantly reducing communication overhead. Second, it minimizes user-side overhead by eliminating the need for user-to-user interactions, utilizing an intermediate server layer and a lightweight key negotiation method. Third, the scheme is highly resilient to user dropouts, and the users can join at any FL round. Fourth, it can detect and defend against malicious server activities, including recently discovered model inconsistency attacks. Finally, our scheme ensures security in both semi-honest and malicious settings. We provide security analysis to formally prove the robustness of our approach. Furthermore, we implemented an end-to-end prototype of our scheme. We conducted comprehensive experiments and comparisons, which show that it outperforms existing solutions in terms of communication and computation overhead, functionality, and security.

[AI-51] Neural Force Field: Learning Generalized Physical Representation from a Few Examples

链接: https://arxiv.org/abs/2502.08987
作者: Shiqian Li,Ruihong Shen,Chi Zhang,Yixin Zhu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages

点击查看摘要

Abstract:Physical reasoning is a remarkable human ability that enables rapid learning and generalization from limited experience. Current AI models, despite extensive training, still struggle to achieve similar generalization, especially in Out-of-distribution (OOD) settings. This limitation stems from their inability to abstract core physical principles from observations. A key challenge is developing representations that can efficiently learn and generalize physical dynamics from minimal data. Here we present Neural Force Field (NFF) a modeling framework built on Neural Ordinary Differential Equation (NODE) that learns interpretable force field representations which can be efficiently integrated through an Ordinary Differential Equation ( ODE) solver to predict object trajectories. Unlike existing approaches that rely on high-dimensional latent spaces, NFF captures fundamental physical concepts such as gravity, support, and collision in an interpretable manner. Experiments on two challenging physical reasoning tasks demonstrate that NFF, trained with only a few examples, achieves strong generalization to unseen scenarios. This physics-grounded representation enables efficient forward-backward planning and rapid adaptation through interactive refinement. Our work suggests that incorporating physics-inspired representations into learning systems can help bridge the gap between artificial and human physical reasoning capabilities.

[AI-52] Few is More: Task-Efficient Skill-Discovery for Multi-Task Offline Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2502.08985
作者: Xun Wang,Zhuoran Li,Hai Zhong,Longbo Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:As a data-driven approach, offline MARL learns superior policies solely from offline datasets, ideal for domains rich in historical data but with high interaction costs and risks. However, most existing methods are task-specific, requiring retraining for new tasks, leading to redundancy and inefficiency. To address this issue, in this paper, we propose a task-efficient multi-task offline MARL algorithm, Skill-Discovery Conservative Q-Learning (SD-CQL). Unlike existing offline skill-discovery methods, SD-CQL discovers skills by reconstructing the next observation. It then evaluates fixed and variable actions separately and employs behavior-regularized conservative Q-learning to execute the optimal action for each skill. This approach eliminates the need for local-global alignment and enables strong multi-task generalization from limited small-scale source tasks. Substantial experiments on StarCraftII demonstrates the superior generalization performance and task-efficiency of SD-CQL. It achieves the best performance on \textbf10 out of 14 task sets, with up to \textbf65% improvement on individual task sets, and is within 4% of the best baseline on the remaining four.

[AI-53] SkyRover: A Modular Simulator for Cross-Domain Pathfinding

链接: https://arxiv.org/abs/2502.08969
作者: Wenhui Ma,Wenhao Li,Bo Jin,Changhong Lu,Xiangfeng Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 9 pages

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) and Automated Guided Vehicles (AGVs) increasingly collaborate in logistics, surveillance, inspection tasks and etc. However, existing simulators often focus on a single domain, limiting cross-domain study. This paper presents the SkyRover, a modular simulator for UAV-AGV multi-agent pathfinding (MAPF). SkyRover supports realistic agent dynamics, configurable 3D environments, and convenient APIs for external solvers and learning methods. By unifying ground and aerial operations, it facilitates cross-domain algorithm design, testing, and benchmarking. Experiments highlight SkyRover’s capacity for efficient pathfinding and high-fidelity simulations in UAV-AGV coordination. Project is available at this https URL.

[AI-54] RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage

链接: https://arxiv.org/abs/2502.08966
作者: Peter Yong Zhong,Siyuan Chen,Ruiqi Wang,McKenna McCall,Ben L. Titzer,Heather Miller
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external tools for tasks beyond their standalone capabilities, such as searching websites, booking flights, or making financial transactions. However, these tools greatly increase the risks of prompt injection attacks, where malicious content hijacks the LM agent to leak confidential data or trigger harmful actions. Existing defenses (OpenAI GPTs) require user confirmation before every tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS), which automatically detects and executes tool calls that preserve integrity and confidentiality, requiring user confirmation only when these safeguards cannot be ensured. RTBAS adapts Information Flow Control to the unique challenges presented by TBAS. We present two novel dependency screeners, using LM-as-a-judge and attention-based saliency, to overcome these challenges. Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS prevents all targeted attacks with only a 2% loss of task utility when under attack, and further tests confirm its ability to obtain near-oracle performance on detecting both subtle and direct privacy leaks.

[AI-55] Biologically Plausible Brain Graph Transformer ICLR2025

链接: https://arxiv.org/abs/2502.08958
作者: Ciyuan Peng,Yuelong Huang,Qichao Dong,Shuo Yu,Feng Xia,Chengqi Zhang,Yaochu Jin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 27pages, 16figures, published as a conference paper at ICLR 2025

点击查看摘要

Abstract:State-of-the-art brain graph analysis methods fail to fully encode the small-world architecture of brain graphs (accompanied by the presence of hubs and functional modules), and therefore lack biological plausibility to some extent. This limitation hinders their ability to accurately represent the brain’s structural and functional properties, thereby restricting the effectiveness of machine learning models in tasks such as brain disorder detection. In this work, we propose a novel Biologically Plausible Brain Graph Transformer (BioBGT) that encodes the small-world architecture inherent in brain graphs. Specifically, we present a network entanglement-based node importance encoding technique that captures the structural importance of nodes in global information propagation during brain graph communication, highlighting the biological properties of the brain structure. Furthermore, we introduce a functional module-aware self-attention to preserve the functional segregation and integration characteristics of brain graphs in the learned representations. Experimental results on three benchmark datasets demonstrate that BioBGT outperforms state-of-the-art models, enhancing biologically plausible brain graph representations for various brain graph analytical tasks

[AI-56] Language in the Flow of Time: Time-Series-Paired Texts Weaved into a Unified Temporal Narrative

链接: https://arxiv.org/abs/2502.08942
作者: Zihao Li,Xiao Lin,Zhining Liu,Jiaru Zou,Ziwei Wu,Lecheng Zheng,Dongqi Fu,Yada Zhu,Hendrik Hamann,Hanghang Tong,Jingrui He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint, 37 pages

点击查看摘要

Abstract:While many advances in time series models focus exclusively on numerical data, research on multimodal time series, particularly those involving contextual textual information commonly encountered in real-world scenarios, remains in its infancy. Consequently, effectively integrating the text modality remains challenging. In this work, we highlight an intuitive yet significant observation that has been overlooked by existing works: time-series-paired texts exhibit periodic properties that closely mirror those of the original time series. Building on this insight, we propose a novel framework, Texts as Time Series (TaTS), which considers the time-series-paired texts to be auxiliary variables of the time series. TaTS can be plugged into any existing numerical-only time series models and enable them to handle time series data with paired texts effectively. Through extensive experiments on both multimodal time series forecasting and imputation tasks across benchmark datasets with various existing time series models, we demonstrate that TaTS can enhance predictive performance and achieve outperformance without modifying model architectures.

[AI-57] Analysis of Off-Policy n-Step TD-Learning with Linear Function Approximation

链接: https://arxiv.org/abs/2502.08941
作者: Han-Dong Lim,Donghwan Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2402.15781

点击查看摘要

Abstract:This paper analyzes multi-step temporal difference (TD)-learning algorithms within the ``deadly triad’’ scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that n -step TD-learning algorithms converge to a solution as the sampling horizon n increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when n is sufficiently large. Based on these findings, in the second part, two n -step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the model-based deterministic algorithms.

[AI-58] okenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument ICASSP2025

链接: https://arxiv.org/abs/2502.08939
作者: Kyungsu Kim,Junghyun Koo,Sungho Lee,Haesun Joung,Kyogu Lee
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
*备注: 5 pages, 1 figure, to be published in ICASSP 2025

点击查看摘要

Abstract:Recent advancements in neural audio codecs have enabled the use of tokenized audio representations in various audio generation tasks, such as text-to-speech, text-to-audio, and text-to-music generation. Leveraging this approach, we propose TokenSynth, a novel neural synthesizer that utilizes a decoder-only transformer to generate desired audio tokens from MIDI tokens and CLAP (Contrastive Language-Audio Pretraining) embedding, which has timbre-related information. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without any fine-tuning. This flexibility enables diverse sound design and intuitive timbre control. We evaluated the quality of the synthesized audio, the timbral similarity between synthesized and target audio/text, and synthesis accuracy (i.e., how accurately it follows the input MIDI) using objective measures. TokenSynth demonstrates the potential of leveraging advanced neural audio codecs and transformers to create powerful and versatile neural synthesizers. The source code, model weights, and audio demos are available at: this https URL

[AI-59] Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models

链接: https://arxiv.org/abs/2502.08922
作者: Xin Zhou,Yiwen Guo,Ruotian Ma,Tao Gui,Qi Zhang,Xuanjing Huang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with human preferences is crucial for their deployment in real-world applications. Recent advancements in Self-Rewarding Language Models suggest that an LLM can use its internal reward models (such as LLM-as-a-Judge) \citeyuanself to generate preference data, improving alignment performance without costly human annotation. However, we find that different internal reward models within the same LLM often generate inconsistent preferences. This inconsistency raises concerns about the reliability of self-generated preference data, hinders overall alignment performance, and highlights the need for further research to ensure reliable and coherent alignment with human preferences. To address this limitation, we propose Self-Consistent Internal Rewards (SCIR), a novel framework designed to enhance consistency among internal reward models during training. In each training step, we collect preference predictions from multiple pre-defined internal reward models and enforce consistency and confidence through an inconsistency penalty mechanism, thereby improving the reliability of these internal reward models. We selectively use data with consistent predictions for preference optimization, ensuring the quality of the preference data. By employing self-consistent internal rewards, our method significantly improves the alignment performance and reward modeling capability of LLMs, outperforming baseline methods by a notable margin.

[AI-60] Exploring Emotion-Sensitive LLM -Based Conversational AI

链接: https://arxiv.org/abs/2502.08920
作者: Antonin Brun,Ruying Liu,Aryan Shukla,Frances Watson,Jonathan Gratch
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures, 1 table

点击查看摘要

Abstract:Conversational AI chatbots have become increasingly common within the customer service industry. Despite improvements in their emotional development, they often lack the authenticity of real customer service interactions or the competence of service providers. By comparing emotion-sensitive and emotion-insensitive LLM-based chatbots across 30 participants, we aim to explore how emotional sensitivity in chatbots influences perceived competence and overall customer satisfaction in service interactions. Additionally, we employ sentiment analysis techniques to analyze and interpret the emotional content of user inputs. We highlight that perceptions of chatbot trustworthiness and competence were higher in the case of the emotion-sensitive chatbot, even if issue resolution rates were not affected. We discuss implications of improved user satisfaction from emotion-sensitive chatbots and potential applications in support services.

[AI-61] Reinforced Large Language Model is a formal theorem prover

链接: https://arxiv.org/abs/2502.08908
作者: Zhiling Luo
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To take advantage of Large Language Model in theorem formalization and proof, we propose a reinforcement learning framework to iteratively optimize the pretrained LLM by rolling out next tactics and comparing them with the expected ones. The experiment results show that it helps to achieve a higher accuracy compared with directly fine-tuned LLM.

[AI-62] MIH-TCCT: Mitigating Inconsistent Hallucinations in LLM s via Event-Driven Text-Code Cyclic Training

链接: https://arxiv.org/abs/2502.08904
作者: Xinxin You,Xien Liu,Qixin Sun,Huan Zhang,Kaiyin Zhou,Shaohui Liu,GuoPing Hu,ShiJin Wang,Si Liu,Ji Wu
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent methodologies utilizing synthetic datasets have aimed to address inconsistent hallucinations in large language models (LLMs); however,these approaches are primarily tailored to specific tasks, limiting their generalizability. Inspired by the strong performance of code-trained models in logic-intensive domains, we propose a novel framework that leverages event-based text to generate corresponding code and employs cyclic training to transfer the logical consistency of code to natural language effectively. Our method significantly reduces inconsistent hallucinations across three leading LLMs and two categories of natural language tasks while maintaining overall performance. This framework effectively alleviates hallucinations without necessitating adaptation to downstream tasks, demonstrating generality and providing new perspectives to tackle the challenge of inconsistent hallucinations.

[AI-63] 3D-Grounded Vision-Language Framework for Robotic Task Planning : Automated Prompt Synthesis and Supervised Reasoning

链接: https://arxiv.org/abs/2502.08903
作者: Guoqin Tang,Qingxuan Jia,Zeyuan Huang,Gang Chen,Ning Ji,Zhipeng Yao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks, enabling robots to plan and execute actions adaptively in dynamic environments. However, most multimodal large language models lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations. Additionally, challenges such as low recognition accuracy, inefficiency, poor transferability, and reliability hinder their use in precision tasks. To address these limitations, we propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs. The 2D prompt synthesis module enables VLMs, trained on 2D images and text, to autonomously extract precise 3D spatial information without manual intervention, significantly enhancing 3D scene understanding. Meanwhile, the SLM supervises VLM outputs, mitigating hallucinations and ensuring reliable, executable robotic control code generation. Our framework eliminates the need for retraining in new environments, thereby improving cost efficiency and operational robustness. Experimental results that the proposed framework achieved a 96.0% Task Success Rate (TSR), outperforming other methods. Ablation studies demonstrated the critical role of both the 2D prompt synthesis module and the output supervision module (which, when removed, caused a 67% TSR drop). These findings validate the framework’s effectiveness in improving 3D recognition, task planning, and robotic task execution.

[AI-64] Learning in Strategic Queuing Systems with Small Buffers

链接: https://arxiv.org/abs/2502.08898
作者: Ariana Abel,Yoav Kolumbus,Jeronimo Martin Duque,Eva Tardos
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Routers in networking use simple learning algorithms to find the best way to deliver packets to their desired destination. This simple, myopic and distributed decision system makes large queuing systems simple to operate, but at the same time, the system needs more capacity than would be required if all traffic were centrally coordinated. In a recent paper, Gaitonde and Tardos (EC 2020 and JACM 2023) initiate the study of such systems, modeling them as an infinitely repeated game in which routers compete for servers and the system maintains a state (number of packets held by each queue) resulting from outcomes of previous rounds. Queues get to send a packet at each step to one of the servers, and servers attempt to process only one of the arriving packets, modeling routers. However, their model assumes that servers have no buffers at all, so queues have to resend all packets that were not served successfully. They show that, even with hugely increased server capacity relative to what is needed in the centrally-coordinated case, ensuring that the system is stable requires using timestamps and priority for older packets. We consider a system with two important changes, which make the model more realistic: first we add a very small buffer to each server, allowing it to hold on to a single packet to be served later (even if it fails to serve it); and second, we do not require timestamps or priority for older packets. Our main result is to show that when queues are learning, a small constant factor increase in server capacity, compared to what would be needed if centrally coordinating, suffices to keep the system stable, even if servers select randomly among packets arriving simultaneously. This work contributes to the growing literature on the impact of selfish learning in systems with carryover effects between rounds: when outcomes in the present round affect the game in the future.

[AI-65] Generative AI for Internet of Things Security: Challenges and Opportunities

链接: https://arxiv.org/abs/2502.08886
作者: Yan Lin Aung,Ivan Christian,Ye Dong,Xiaodong Ye,Sudipta Chattopadhyay,Jianying Zhou
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Generative AI (GenAI) continues to gain prominence and utility across various sectors, their integration into the realm of Internet of Things (IoT) security evolves rapidly. This work delves into an examination of the state-of-the-art literature and practical applications on how GenAI could improve and be applied in the security landscape of IoT. Our investigation aims to map the current state of GenAI implementation within IoT security, exploring their potential to fortify security measures further. Through the compilation, synthesis, and analysis of the latest advancements in GenAI technologies applied to IoT, this paper not only introduces fresh insights into the field, but also lays the groundwork for future research directions. It explains the prevailing challenges within IoT security, discusses the effectiveness of GenAI in addressing these issues, and identifies significant research gaps through MITRE Mitigations. Accompanied with three case studies, we provide a comprehensive overview of the progress and future prospects of GenAI applications in IoT security. This study serves as a foundational resource to improve IoT security through the innovative application of GenAI, thus contributing to the broader discourse on IoT security and technology integration.

[AI-66] Data Sensor Fusion In Digital Twin Technology For Enhanced Capabilities In A Home Environment

链接: https://arxiv.org/abs/2502.08874
作者: Benjamin Momoh,Salisu Yahaya
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper investigates the integration of data sensor fusion in digital twin technology to bolster home environment capabilities, particularly in the context of challenges brought on by the coronavirus pandemic and its economic effects. The study underscores the crucial role of digital transformation in not just adapting to, but also mitigating disruptions during the fourth industrial revolution. Using the Wit Motion sensor, data was collected for activities such as walking, working, sitting, and lying, with sensors measuring accelerometers, gyroscopes, and magnetometers. The research integrates Cyber-physical systems, IoT, AI, and robotics to fortify digital twin capabilities. The paper compares sensor fusion methods, including feature-level fusion, decision-level fusion, and Kalman filter fusion, alongside machine learning models like SVM, GBoost, and Random Forest to assess model effectiveness. Results show that sensor fusion significantly improves the accuracy and reliability of these models, as it compensates for individual sensor weaknesses, particularly with magnetometers. Despite higher accuracy in ideal conditions, integrating data from multiple sensors ensures more consistent and reliable results in real-world settings, thereby establishing a robust system that can be confidently applied in practical scenarios. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2502.08874 [cs.AI] (or arXiv:2502.08874v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2502.08874 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-67] Off-Switching Not Guaranteed

链接: https://arxiv.org/abs/2502.08864
作者: Sven Neth
类目: Artificial Intelligence (cs.AI)
*备注: Forthcoming in Philosophical Studies

点击查看摘要

Abstract:Hadfield-Menell et al. (2017) propose the Off-Switch Game, a model of Human-AI cooperation in which AI agents always defer to humans because they are uncertain about our preferences. I explain two reasons why AI agents might not defer. First, AI agents might not value learning. Second, even if AI agents value learning, they might not be certain to learn our actual preferences.

[AI-68] Estimating Probabilities of Causation with Machine Learning Models UAI2025

链接: https://arxiv.org/abs/2502.08858
作者: Shuai Wang,Ang Li
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages + 2 pages reference + 3 pages supplementary material, 5 figures, submitted to UAI 2025

点击查看摘要

Abstract:Probabilities of causation play a crucial role in modern decision-making. This paper addresses the challenge of predicting probabilities of causation for subpopulations with insufficient data using machine learning models. Tian and Pearl first defined and derived tight bounds for three fundamental probabilities of causation: the probability of necessity and sufficiency (PNS), the probability of sufficiency (PS), and the probability of necessity (PN). However, estimating these probabilities requires both experimental and observational distributions specific to each subpopulation, which are often unavailable or impractical to obtain with limited population-level data. We assume that the probabilities of causation for each subpopulation are determined by its characteristics. To estimate these probabilities for subpopulations with insufficient data, we propose using machine learning models that draw insights from subpopulations with sufficient data. Our evaluation of multiple machine learning models indicates that, given sufficient population-level data and an appropriate choice of machine learning model and activation function, PNS can be effectively predicted. Through simulation studies, we show that our multilayer perceptron (MLP) model with the Mish activation function achieves a mean absolute error (MAE) of approximately 0.02 in predicting PNS for 32,768 subpopulations using data from around 2,000 subpopulations.

[AI-69] A Reversible Solver for Diffusion SDEs

链接: https://arxiv.org/abs/2502.08834
作者: Zander W. Blasingame,Chen Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Diffusion models have quickly become the state-of-the-art for generation tasks across many different data modalities. An important ability of diffusion models is the ability to encode samples from the data distribution back into the sampling prior distribution. This is useful for performing alterations to real data samples along with guided generation via the continuous adjoint equations. We propose an algebraically reversible solver for diffusion SDEs that can exactly invert real data samples into the prior distribution.

[AI-70] A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective

链接: https://arxiv.org/abs/2502.08828
作者: Wangyang Ying,Cong Wei,Nanxu Gong,Xinyuan Wang,Haoyue Bai,Arun Vignesh Malarkkan,Sixun Dong,Dongjie Wang,Denghui Zhang,Yanjie Fu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tabular data is one of the most widely used data formats across various domains such as bioinformatics, healthcare, and marketing. As artificial intelligence moves towards a data-centric perspective, improving data quality is essential for enhancing model performance in tabular data-driven applications. This survey focuses on data-driven tabular data optimization, specifically exploring reinforcement learning (RL) and generative approaches for feature selection and feature generation as fundamental techniques for refining data spaces. Feature selection aims to identify and retain the most informative attributes, while feature generation constructs new features to better capture complex data patterns. We systematically review existing generative methods for tabular data engineering, analyzing their latest advancements, real-world applications, and respective strengths and limitations. This survey emphasizes how RL-based and generative techniques contribute to the automation and intelligence of feature engineering. Finally, we summarize the existing challenges and discuss future research directions, aiming to provide insights that drive continued innovation in this field.

[AI-71] CLOVER: A Test Case Generation Benchmark with Coverag e Long-Context and Verification

链接: https://arxiv.org/abs/2502.08806
作者: Jiacheng Xu,Bo Pang,Jin Qu,Hiroaki Hayashi,Caiming Xiong,Yingbo Zhou
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Software testing is a critical aspect of software development, yet generating test cases remains a routine task for engineers. This paper presents a benchmark, CLOVER, to evaluate models’ capabilities in generating and completing test cases under specific conditions. Spanning from simple assertion completions to writing test cases that cover specific code blocks across multiple files, these tasks are based on 12 python repositories, analyzing 845 problems with context lengths ranging from 4k to 128k tokens. Utilizing code testing frameworks, we propose a method to construct retrieval contexts using coverage information. While models exhibit comparable performance with short contexts, notable differences emerge with 16k contexts. Notably, models like GPT-4o and Claude 3.5 can effectively leverage relevant snippets; however, all models score below 35% on the complex Task III, even with the oracle context provided, underscoring the benchmark’s significance and the potential for model improvement. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.

[AI-72] Auction Design using Value Prediction with Hallucinations

链接: https://arxiv.org/abs/2502.08792
作者: Ilan Lobel,Humberto Moreira,Omar Mouchtaki
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We investigate a Bayesian mechanism design problem where a seller seeks to maximize revenue by selling an indivisible good to one of n buyers, incorporating potentially unreliable predictions (signals) of buyers’ private values derived from a machine learning model. We propose a framework where these signals are sometimes reflective of buyers’ true valuations but other times are hallucinations, which are uncorrelated with the buyers’ true valuations. Our main contribution is a characterization of the optimal auction under this framework. Our characterization establishes a near-decomposition of how to treat types above and below the signal. For the one buyer case, the seller’s optimal strategy is to post one of three fairly intuitive prices depending on the signal, which we call the “ignore”, “follow” and “cap” actions.

[AI-73] Acoustic Wave Manipulation Through Sparse Robotic Actuation ICRA2025

链接: https://arxiv.org/abs/2502.08784
作者: Tristan Shah,Noam Smilovich,Samer Gerges,Feruza Amirkulova,Stas Tiomkin
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: ICRA 2025

点击查看摘要

Abstract:Recent advancements in robotics, control, and machine learning have facilitated progress in the challenging area of object manipulation. These advancements include, among others, the use of deep neural networks to represent dynamics that are partially observed by robot sensors, as well as effective control using sparse control signals. In this work, we explore a more general problem: the manipulation of acoustic waves, which are partially observed by a robot capable of influencing the waves through spatially sparse actuators. This problem holds great potential for the design of new artificial materials, ultrasonic cutting tools, energy harvesting, and other applications. We develop an efficient data-driven method for robot learning that is applicable to either focusing scattered acoustic energy in a designated region or suppressing it, depending on the desired task. The proposed method is better in terms of a solution quality and computational complexity as compared to a state-of-the-art learning based method for manipulation of dynamical systems governed by partial differential equations. Furthermore our proposed method is competitive with a classical semi-analytical method in acoustics research on the demonstrated tasks. We have made the project code publicly available, along with a web page featuring video demonstrations: this https URL.

[AI-74] Contextual bandits with entropy-based human feedback

链接: https://arxiv.org/abs/2502.08759
作者: Raihan Seraj,Lili Meng,Tristan Sylvain
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, preference-based human feedback mechanisms have become essential for enhancing model performance across diverse applications, including conversational AI systems such as ChatGPT. However, existing approaches often neglect critical aspects, such as model uncertainty and the variability in feedback quality. To address these challenges, we introduce an entropy-based human feedback framework for contextual bandits, which dynamically balances exploration and exploitation by soliciting expert feedback only when model entropy exceeds a predefined threshold. Our method is model-agnostic and can be seamlessly integrated with any contextual bandit agent employing stochastic policies. Through comprehensive experiments, we show that our approach achieves significant performance improvements while requiring minimal human feedback, even under conditions of suboptimal feedback quality. This work not only presents a novel strategy for feedback solicitation but also highlights the robustness and efficacy of incorporating human guidance into machine learning systems. Our code is publicly available: this https URL

[AI-75] From PowerPoint UI Sketches to Web-Based Applications: Pattern-Driven Code Generation for GIS Dashboard Development Using Knowledge-Augmented LLM s Context-Aware Visual Prompting and the React Framework

链接: https://arxiv.org/abs/2502.08756
作者: Haowen Xu,Xiao-Ying Yu
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Developing web-based GIS applications, commonly known as CyberGIS dashboards, for querying and visualizing GIS data in environmental research often demands repetitive and resource-intensive efforts. While Generative AI offers automation potential for code generation, it struggles with complex scientific applications due to challenges in integrating domain knowledge, software engineering principles, and UI design best practices. This paper introduces a knowledge-augmented code generation framework that retrieves software engineering best practices, domain expertise, and advanced technology stacks from a specialized knowledge base to enhance Generative Pre-trained Transformers (GPT) for front-end development. The framework automates the creation of GIS-based web applications (e.g., dashboards, interfaces) from user-defined UI wireframes sketched in tools like PowerPoint or Adobe Illustrator. A novel Context-Aware Visual Prompting method, implemented in Python, extracts layouts and interface features from these wireframes to guide code generation. Our approach leverages Large Language Models (LLMs) to generate front-end code by integrating structured reasoning, software engineering principles, and domain knowledge, drawing inspiration from Chain-of-Thought (CoT) prompting and Retrieval-Augmented Generation (RAG). A case study demonstrates the framework’s capability to generate a modular, maintainable web platform hosting multiple dashboards for visualizing environmental and energy data (e.g., time-series, shapefiles, rasters) from user-sketched wireframes. By employing a knowledge-driven approach, the framework produces scalable, industry-standard front-end code using design patterns such as Model-View-ViewModel (MVVM) and frameworks like React. This significantly reduces manual effort in design and coding, pioneering an automated and efficient method for developing smart city software.

[AI-76] Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics ICLR2025

链接: https://arxiv.org/abs/2502.08696
作者: Sebastian Sanokowski,Wilhelm Berghammer,Martin Ennemoser,Haoyu Peter Wang,Sepp Hochreiter,Sebastian Lehner
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: Accepted at ICLR 2025

点击查看摘要

Abstract:Learning to sample from complex unnormalized distributions over discrete domains emerged as a promising research direction with applications in statistical physics, variational inference, and combinatorial optimization. Recent work has demonstrated the potential of diffusion models in this domain. However, existing methods face limitations in memory scaling and thus the number of attainable diffusion steps since they require backpropagation through the entire generative process. To overcome these limitations we introduce two novel training methods for discrete diffusion samplers, one grounded in the policy gradient theorem and the other one leveraging Self-Normalized Neural Importance Sampling (SN-NIS). These methods yield memory-efficient training and achieve state-of-the-art results in unsupervised combinatorial optimization. Numerous scientific applications additionally require the ability of unbiased sampling. We introduce adaptations of SN-NIS and Neural Markov Chain Monte Carlo that enable for the first time the application of discrete diffusion models to this problem. We validate our methods on Ising model benchmarks and find that they outperform popular autoregressive approaches. Our work opens new avenues for applying diffusion models to a wide range of scientific applications in discrete domains that were hitherto restricted to exact likelihood models.

[AI-77] AgentS ociety: Large-Scale Simulation of LLM -Driven Generative Agents Advances Understanding of Human Behaviors and Society

链接: https://arxiv.org/abs/2502.08691
作者: Jinghua Piao,Yuwei Yan,Jun Zhang,Nian Li,Junbo Yan,Xiaochong Lan,Zhihong Lu,Zhiheng Zheng,Jing Yi Wang,Di Zhou,Chen Gao,Fengli Xu,Fang Zhang,Ke Rong,Jun Su,Yong Li
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding human behavior and society is a central focus in social sciences, with the rise of generative social science marking a significant paradigmatic shift. By leveraging bottom-up simulations, it replaces costly and logistically challenging traditional experiments with scalable, replicable, and systematic computational approaches for studying complex social dynamics. Recent advances in large language models (LLMs) have further transformed this research paradigm, enabling the creation of human-like generative social agents and realistic simulacra of society. In this paper, we propose AgentSociety, a large-scale social simulator that integrates LLM-driven agents, a realistic societal environment, and a powerful large-scale simulation engine. Based on the proposed simulator, we generate social lives for over 10k agents, simulating their 5 million interactions both among agents and between agents and their environment. Furthermore, we explore the potential of AgentSociety as a testbed for computational social experiments, focusing on four key social issues: polarization, the spread of inflammatory messages, the effects of universal basic income policies, and the impact of external shocks such as hurricanes. These four issues serve as valuable cases for assessing AgentSociety’s support for typical research methods – such as surveys, interviews, and interventions – as well as for investigating the patterns, causes, and underlying mechanisms of social issues. The alignment between AgentSociety’s outcomes and real-world experimental results not only demonstrates its ability to capture human behaviors and their underlying mechanisms, but also underscores its potential as an important platform for social scientists and policymakers.

[AI-78] Advancing machine fault diagnosis: A detailed examination of convolutional neural networks

链接: https://arxiv.org/abs/2502.08689
作者: Govind Vashishtha,Sumika Chauhan,Mert Sehri,Justyna Hebda-Sobkowicz,Radoslaw Zimroz,Patrick Dumond,Rajesh Kumar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing complexity of machinery and the increasing demand for operational efficiency and safety have driven the development of advanced fault diagnosis techniques. Among these, convolutional neural networks (CNNs) have emerged as a powerful tool, offering robust and accurate fault detection and classification capabilities. This comprehensive review delves into the application of CNNs in machine fault diagnosis, covering its theoretical foundation, architectural variations, and practical implementations. The strengths and limitations of CNNs are analyzed in this domain, discussing their effectiveness in handling various fault types, data complexities, and operational environments. Furthermore, we explore the evolving landscape of CNN-based fault diagnosis, examining recent advancements in data augmentation, transfer learning, and hybrid architectures. Finally, we highlight future research directions and potential challenges to further enhance the application of CNNs for reliable and proactive machine fault diagnosis.

[AI-79] EEG Artifact Detection and Correction with Deep Autoencoders

链接: https://arxiv.org/abs/2502.08686
作者: David Aquilué-Llorens,Aureli Soria-Frisch
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:EEG signals convey important information about brain activity both in healthy and pathological conditions. However, they are inherently noisy, which poses significant challenges for accurate analysis and interpretation. Traditional EEG artifact removal methods, while effective, often require extensive expert intervention. This study presents LSTEEG, a novel LSTM-based autoencoder designed for the detection and correction of artifacts in EEG signals. Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear dependencies in sequential EEG data. LSTEEG demonstrates superior performance in both artifact detection and correction tasks compared to other state-of-the-art convolutional autoencoders. Our methodology enhances the interpretability and utility of the autoencoder’s latent space, enabling data-driven automated artefact removal in EEG its application in downstream tasks. This research advances the field of efficient and accurate multi-channel EEG preprocessing, and promotes the implementation and usage of automated EEG analysis pipelines for brain health applications.

[AI-80] Beyond Models! Explainable Data Valuation and Metric Adaption for Recommendation

链接: https://arxiv.org/abs/2502.08685
作者: Renqi Jia,Xiaokun Zhang,Bowei He,Qiannan Zhu,Weitao Xu,Jiehao Chen,Chen Ma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:User behavior records serve as the foundation for recommender systems. While the behavior data exhibits ease of acquisition, it often suffers from varying quality. Current methods employ data valuation to discern high-quality data from low-quality data. However, they tend to employ black-box design, lacking transparency and interpretability. Besides, they are typically tailored to specific evaluation metrics, leading to limited generality across various tasks. To overcome these issues, we propose an explainable and versatile framework DVR which can enhance the efficiency of data utilization tailored to any requirements of the model architectures and evaluation metrics. For explainable data valuation, a data valuator is presented to evaluate the data quality via calculating its Shapley value from the game-theoretic perspective, ensuring robust mathematical properties and reliability. In order to accommodate various evaluation metrics, including differentiable and non-differentiable ones, a metric adapter is devised based on reinforcement learning, where a metric is treated as the reinforcement reward that guides model optimization. Extensive experiments conducted on various benchmarks verify that our framework can improve the performance of current recommendation algorithms on various metrics including ranking accuracy, diversity, and fairness. Specifically, our framework achieves up to 34.7% improvements over existing methods in terms of representative NDCG metric. The code is available at this https URL.

[AI-81] Self-Evaluation for Job-Shop Scheduling

链接: https://arxiv.org/abs/2502.08684
作者: Imanol Echeverria,Maialen Murua,Roberto Santana
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Combinatorial optimization problems, such as scheduling and route planning, are crucial in various industries but are computationally intractable due to their NP-hard nature. Neural Combinatorial Optimization methods leverage machine learning to address these challenges but often depend on sequential decision-making, which is prone to error accumulation as small mistakes propagate throughout the process. Inspired by self-evaluation techniques in Large Language Models, we propose a novel framework that generates and evaluates subsets of assignments, moving beyond traditional stepwise approaches. Applied to the Job-Shop Scheduling Problem, our method integrates a heterogeneous graph neural network with a Transformer to build a policy model and a self-evaluation function. Experimental validation on challenging, well-known benchmarks demonstrates the effectiveness of our approach, surpassing state-of-the-art methods.

[AI-82] On the Role of Pre-trained Embeddings in Binary Code Analysis

链接: https://arxiv.org/abs/2502.08682
作者: Alwin Maier,Felix Weissberg,Konrad Rieck
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis. In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2.6 Cite as: arXiv:2502.08682 [cs.LG] (or arXiv:2502.08682v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.08682 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Proceedings of the 19th ACM Asia Conference on Computer and Communications Security (2024) 1143-1158 Related DOI: https://doi.org/10.1145/3634737.3657029 Focus to learn more DOI(s) linking to related resources

[AI-83] Centrally Coordinated Multi-Agent Reinforcement Learning for Power Grid Topology Control

链接: https://arxiv.org/abs/2502.08681
作者: Barbera de Mol,Davide Barbieri,Jan Viebahn,Davide Grossi
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Power grid operation is becoming more complex due to the increase in generation of renewable energy. The recent series of Learning To Run a Power Network (L2RPN) competitions have encouraged the use of artificial agents to assist human dispatchers in operating power grids. However, the combinatorial nature of the action space poses a challenge to both conventional optimizers and learned controllers. Action space factorization, which breaks down decision-making into smaller sub-tasks, is one approach to tackle the curse of dimensionality. In this study, we propose a centrally coordinated multi-agent (CCMA) architecture for action space factorization. In this approach, regional agents propose actions and subsequently a coordinating agent selects the final action. We investigate several implementations of the CCMA architecture, and benchmark in different experimental settings against various L2RPN baseline approaches. The CCMA architecture exhibits higher sample efficiency and superior final performance than the baseline approaches. The results suggest high potential of the CCMA approach for further application in higher-dimensional L2RPN as well as real-world power grid settings.

[AI-84] Deep Learning-Driven Malware Classification with API Call Sequence Analysis and Concept Drift Handling

链接: https://arxiv.org/abs/2502.08679
作者: Bishwajit Prasad Gond,Durga Prasad Mohapatra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Malware classification in dynamic environments presents a significant challenge due to concept drift, where the statistical properties of malware data evolve over time, complicating detection efforts. To address this issue, we propose a deep learning framework enhanced with a genetic algorithm to improve malware classification accuracy and adaptability. Our approach incorporates mutation operations and fitness score evaluations within genetic algorithms to continuously refine the deep learning model, ensuring robustness against evolving malware threats. Experimental results demonstrate that this hybrid method significantly enhances classification performance and adaptability, outperforming traditional static models. Our proposed approach offers a promising solution for real-time malware classification in ever-changing cybersecurity landscapes.

[AI-85] High-Throughput SAT Sampling

链接: https://arxiv.org/abs/2502.08673
作者: Arash Ardakani,Minwoo Kang,Kevin He,Qijing Huang,John Wawrzynek
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:In this work, we present a novel technique for GPU-accelerated Boolean satisfiability (SAT) sampling. Unlike conventional sampling algorithms that directly operate on conjunctive normal form (CNF), our method transforms the logical constraints of SAT problems by factoring their CNF representations into simplified multi-level, multi-output Boolean functions. It then leverages gradient-based optimization to guide the search for a diverse set of valid solutions. Our method operates directly on the circuit structure of refactored SAT instances, reinterpreting the SAT problem as a supervised multi-output regression task. This differentiable technique enables independent bit-wise operations on each tensor element, allowing parallel execution of learning processes. As a result, we achieve GPU-accelerated sampling with significant runtime improvements ranging from 33.6\times to 523.6\times over state-of-the-art heuristic samplers. We demonstrate the superior performance of our sampling method through an extensive evaluation on 60 instances from a public domain benchmark suite utilized in previous studies.

[AI-86] Motion Forecasting for Autonomous Vehicles: A Survey

链接: https://arxiv.org/abs/2502.08664
作者: Jianxin Shi,Jinhao Chen,Yuandong Wang,Li Sun,Chunyang Liu,Wei Xiong,Tianyu Wo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 31 pages, 7 figures

点击查看摘要

Abstract:In recent years, the field of autonomous driving has attracted increasingly significant public interest. Accurately forecasting the future behavior of various traffic participants is essential for the decision-making of Autonomous Vehicles (AVs). In this paper, we focus on both scenario-based and perception-based motion forecasting for AVs. We propose a formal problem formulation for motion forecasting and summarize the main challenges confronting this area of research. We also detail representative datasets and evaluation metrics pertinent to this field. Furthermore, this study classifies recent research into two main categories: supervised learning and self-supervised learning, reflecting the evolving paradigms in both scenario-based and perception-based motion forecasting. In the context of supervised learning, we thoroughly examine and analyze each key element of the methodology. For self-supervised learning, we summarize commonly adopted techniques. The paper concludes and discusses potential research directions, aiming to propel progress in this vital area of AV technology.

[AI-87] Analyzable Parameters Dominated Vehicle Platoon Dynamics Modeling and Analysis: A Physics-Encoded Deep Learning Approach

链接: https://arxiv.org/abs/2502.08658
作者: Hao Lyu,Yanyong Guo,Pan Liu,Shuo Feng,Weilin Ren,Quansheng Yue
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recently, artificial intelligence (AI)-enabled nonlinear vehicle platoon dynamics modeling plays a crucial role in predicting and optimizing the interactions between vehicles. Existing efforts lack the extraction and capture of vehicle behavior interaction features at the platoon scale. More importantly, maintaining high modeling accuracy without losing physical analyzability remains to be solved. To this end, this paper proposes a novel physics-encoded deep learning network, named PeMTFLN, to model the nonlinear vehicle platoon dynamics. Specifically, an analyzable parameters encoded computational graph (APeCG) is designed to guide the platoon to respond to the driving behavior of the lead vehicle while ensuring local stability. Besides, a multi-scale trajectory feature learning network (MTFLN) is constructed to capture platoon following patterns and infer the physical parameters required for APeCG from trajectory data. The human-driven vehicle trajectory datasets (HIGHSIM) were used to train the proposed PeMTFLN. The trajectories prediction experiments show that PeMTFLN exhibits superior compared to the baseline models in terms of predictive accuracy in speed and gap. The stability analysis result shows that the physical parameters in APeCG is able to reproduce the platoon stability in real-world condition. In simulation experiments, PeMTFLN performs low inference error in platoon trajectories generation. Moreover, PeMTFLN also accurately reproduces ground-truth safety statistics. The code of proposed PeMTFLN is open source.

[AI-88] Personalizing Education through an Adaptive LMS with Integrated LLM s

链接: https://arxiv.org/abs/2502.08655
作者: Kyle Spriggs,Meng Cheng Lau,Kalpdrum Passi
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The widespread adoption of large language models (LLMs) marks a transformative era in technology, especially within the educational sector. This paper explores the integration of LLMs within learning management systems (LMSs) to develop an adaptive learning management system (ALMS) personalized for individual learners across various educational stages. Traditional LMSs, while facilitating the distribution of educational materials, fall short in addressing the nuanced needs of diverse student populations, particularly in settings with limited instructor availability. Our proposed system leverages the flexibility of AI to provide a customizable learning environment that adjusts to each user’s evolving needs. By integrating a suite of general-purpose and domain-specific LLMs, this system aims to minimize common issues such as factual inaccuracies and outdated information, characteristic of general LLMs like OpenAI’s ChatGPT. This paper details the development of an ALMS that not only addresses privacy concerns and the limitations of existing educational tools but also enhances the learning experience by maintaining engagement through personalized educational content.

[AI-89] LegalScore: Development of a Benchmark for Evaluating AI Models in Legal Career Exams in Brazil

链接: https://arxiv.org/abs/2502.08652
作者: Roberto Caparroz,Marcelo Roitman,Beatriz G. Chow,Caroline Giusti,Larissa Torhacs,Pedro A. Sola,João H. M. Diogo,Luiza Balby,Carolina D. L. Vasconcelos,Leonardo R. Caparroz,Albano P. Franco
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Main article 25 pages, Appendices from page 26

点击查看摘要

Abstract:This research introduces LegalScore, a specialized index for assessing how generative artificial intelligence models perform in a selected range of career exams that require a legal background in Brazil. The index evaluates fourteen different types of artificial intelligence models’ performance, from proprietary to open-source models, in answering objective questions applied to these exams. The research uncovers the response of the models when applying English-trained large language models to Brazilian legal contexts, leading us to reflect on the importance and the need for Brazil-specific training data in generative artificial intelligence models. Performance analysis shows that while proprietary and most known models achieved better results overall, local and smaller models indicated promising performances due to their Brazilian context alignment in training. By establishing an evaluation framework with metrics including accuracy, confidence intervals, and normalized scoring, LegalScore enables systematic assessment of artificial intelligence performance in legal examinations in Brazil. While the study demonstrates artificial intelligence’s potential value for exam preparation and question development, it concludes that significant improvements are needed before AI can match human performance in advanced legal assessments. The benchmark creates a foundation for continued research, highlighting the importance of local adaptation in artificial intelligence development.

[AI-90] Cracking the Code: Enhancing Development finance understanding with artificial intelligence

链接: https://arxiv.org/abs/2502.09495
作者: Pierre Beaucoral
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analyzing development projects is crucial for understanding donors aid strategies, recipients priorities, and to assess development finance capacity to adress development issues by on-the-ground actions. In this area, the Organisation for Economic Co-operation and Developments (OECD) Creditor Reporting System (CRS) dataset is a reference data source. This dataset provides a vast collection of project narratives from various sectors (approximately 5 million projects). While the OECD CRS provides a rich source of information on development strategies, it falls short in informing project purposes due to its reporting process based on donors self-declared main objectives and pre-defined industrial sectors. This research employs a novel approach that combines Machine Learning (ML) techniques, specifically Natural Language Processing (NLP), an innovative Python topic modeling technique called BERTopic, to categorise (cluster) and label development projects based on their narrative descriptions. By revealing existing yet hidden topics of development finance, this application of artificial intelligence enables a better understanding of donor priorities and overall development funding and provides methods to analyse public and private projects narratives.

[AI-91] ransformer-Enhanced Variational Autoencoder for Crystal Structure Prediction

链接: https://arxiv.org/abs/2502.09423
作者: Ziyi Chen,Yang Yuan,Siming Zheng,Jialong Guo,Sihan Liang,Yangang Wang,Zongguo Wang
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Crystal structure forms the foundation for understanding the physical and chemical properties of materials. Generative models have emerged as a new paradigm in crystal structure prediction(CSP), however, accurately capturing key characteristics of crystal structures, such as periodicity and symmetry, remains a significant challenge. In this paper, we propose a Transformer-Enhanced Variational Autoencoder for Crystal Structure Prediction (TransVAE-CSP), who learns the characteristic distribution space of stable materials, enabling both the reconstruction and generation of crystal structures. TransVAE-CSP integrates adaptive distance expansion with irreducible representation to effectively capture the periodicity and symmetry of crystal structures, and the encoder is a transformer network based on an equivariant dot product attention mechanism. Experimental results on the carbon_24, perov_5, and mp_20 datasets demonstrate that TransVAE-CSP outperforms existing methods in structure reconstruction and generation tasks under various modeling metrics, offering a powerful tool for crystal structure design and optimization.

机器学习

[LG-0] Censor Dependent Variational Inference

链接: https://arxiv.org/abs/2502.09591
作者: Chuanhui Liu,Xiao Wang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper provides a comprehensive analysis of variational inference in latent variable models for survival analysis, emphasizing the distinctive challenges associated with applying variational methods to survival data. We identify a critical weakness in the existing methodology, demonstrating how a poorly designed variational distribution may hinder the objective of survival analysis tasks–modeling time-to-event distributions. We prove that the optimal variational distribution, which perfectly bounds the log-likelihood, may depend on the censoring mechanism. To address this issue, we propose censor-dependent variational inference (CDVI), tailored for latent variable models in survival analysis. More practically, we introduce CD-CVAE, a V-structure Variational Autoencoder (VAE) designed for the scalable implementation of CDVI. Further discussion extends some existing theories and training techniques to survival analysis. Extensive experiments validate our analysis and demonstrate significant improvements in the estimation of individual survival distributions.

[LG-1] Rolling Ahead Diffusion for Traffic Scene Simulation AAAI2025

链接: https://arxiv.org/abs/2502.09587
作者: Yunpeng Liu,Matthew Niedoba,William Harvey,Adam Scibior,Berend Zwartsenberg,Frank Wood
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Accepted to Workshop on Machine Learning for Autonomous Driving at AAAI 2025

点击查看摘要

Abstract:Realistic driving simulation requires that NPCs not only mimic natural driving behaviors but also react to the behavior of other simulated agents. Recent developments in diffusion-based scenario generation focus on creating diverse and realistic traffic scenarios by jointly modelling the motion of all the agents in the scene. However, these traffic scenarios do not react when the motion of agents deviates from their modelled trajectories. For example, the ego-agent can be controlled by a stand along motion planner. To produce reactive scenarios with joint scenario models, the model must regenerate the scenario at each timestep based on new observations in a Model Predictive Control (MPC) fashion. Although reactive, this method is time-consuming, as one complete possible future for all NPCs is generated per simulation step. Alternatively, one can utilize an autoregressive model (AR) to predict only the immediate next-step future for all NPCs. Although faster, this method lacks the capability for advanced planning. We present a rolling diffusion based traffic scene generation model which mixes the benefits of both methods by predicting the next step future and simultaneously predicting partially noised further future steps at the same time. We show that such model is efficient compared to diffusion model based AR, achieving a beneficial compromise between reactivity and computational efficiency.

[LG-2] Learning to Coordinate with Experts

链接: https://arxiv.org/abs/2502.09583
作者: Mohamad H. Danesh,Tu Trinh,Benjamin Plaut,Nguyen X. Khanh
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:When deployed in dynamic environments, AI agents will inevitably encounter challenges that exceed their individual capabilities. Leveraging assistance from expert agents-whether human or AI-can significantly enhance safety and performance in such situations. However, querying experts is often costly, necessitating the development of agents that can efficiently request and utilize expert guidance. In this paper, we introduce a fundamental coordination problem called Learning to Yield and Request Control (YRC), where the objective is to learn a strategy that determines when to act autonomously and when to seek expert assistance. We consider a challenging practical setting in which an agent does not interact with experts during training but must adapt to novel environmental changes and expert interventions at test time. To facilitate empirical research, we introduce YRC-Bench, an open-source benchmark featuring diverse domains. YRC-Bench provides a standardized Gym-like API, simulated experts, evaluation pipeline, and implementation of competitive baselines. Towards tackling the YRC problem, we propose a novel validation approach and investigate the performance of various learning methods across diverse environments, yielding insights that can guide future research.

[LG-3] DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra

链接: https://arxiv.org/abs/2502.09571
作者: Montgomery Bohde,Mrunali Manjrekar,Runzhong Wang,Shuiwang Ji,Connor W. Coley
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Preprint

点击查看摘要

Abstract:Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional \textitde novo generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on \textitde novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at this https URL.

[LG-4] Enhancing the Utility of Higher-Order Information in Relational Learning

链接: https://arxiv.org/abs/2502.09570
作者: Raphael Pellegrin,Lukas Fesser,Melanie Weber
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Higher-order information is crucial for relational learning in many domains where relationships extend beyond pairwise interactions. Hypergraphs provide a natural framework for modeling such relationships, which has motivated recent extensions of graph neural net- work architectures to hypergraphs. However, comparisons between hypergraph architectures and standard graph-level models remain limited. In this work, we systematically evaluate a selection of hypergraph-level and graph-level architectures, to determine their effectiveness in leveraging higher-order information in relational learning. Our results show that graph-level architectures applied to hypergraph expansions often outperform hypergraph- level ones, even on inputs that are naturally parametrized as hypergraphs. As an alternative approach for leveraging higher-order information, we propose hypergraph-level encodings based on classical hypergraph characteristics. While these encodings do not significantly improve hypergraph architectures, they yield substantial performance gains when combined with graph-level models. Our theoretical analysis shows that hypergraph-level encodings provably increase the representational power of message-passing graph neural networks beyond that of their graph-level counterparts.

[LG-5] SyntheticPop: Attacking Speaker Verification Systems With Synthetic VoicePops

链接: https://arxiv.org/abs/2502.09553
作者: Eshaq Jamdar,Amith Kamath Belman
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Voice Authentication (VA), also known as Automatic Speaker Verification (ASV), is a widely adopted authentication method, particularly in automated systems like banking services, where it serves as a secondary layer of user authentication. Despite its popularity, VA systems are vulnerable to various attacks, including replay, impersonation, and the emerging threat of deepfake audio that mimics the voice of legitimate users. To mitigate these risks, several defense mechanisms have been proposed. One such solution, Voice Pops, aims to distinguish an individual’s unique phoneme pronunciations during the enrollment process. While promising, the effectiveness of VA+VoicePop against a broader range of attacks, particularly logical or adversarial attacks, remains insufficiently explored. We propose a novel attack method, which we refer to as SyntheticPop, designed to target the phoneme recognition capabilities of the VA+VoicePop system. The SyntheticPop attack involves embedding synthetic “pop” noises into spoofed audio samples, significantly degrading the model’s performance. We achieve an attack success rate of over 95% while poisoning 20% of the training dataset. Our experiments demonstrate that VA+VoicePop achieves 69% accuracy under normal conditions, 37% accuracy when subjected to a baseline label flipping attack, and just 14% accuracy under our proposed SyntheticPop attack, emphasizing the effectiveness of our method.

[LG-6] Fast Tensor Completion via Approximate Richardson Iteration

链接: https://arxiv.org/abs/2502.09534
作者: Mehrdad Ghadiri,Matthew Fahrbach,Yunbum Kook,Ali Jadbabaie
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods, which solve highly structured linear regression problems at each step (e.g., for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is lost in TC regression problems, making direct extensions unclear. To address this, we propose a lifting approach that approximately solves TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We theoretically analyze the convergence rate of our approximate Richardson iteration based algorithm, and we demonstrate on real-world tensors that its running time can be 100x faster than direct methods for CP completion.

[LG-7] Robust Learning of Multi-index Models via Iterative Subspace Approximation

链接: https://arxiv.org/abs/2502.09525
作者: Ilias Diakonikolas,Giannis Iakovidis,Daniel M. Kane,Nikos Zarifis
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the task of learning Multi-Index Models (MIMs) with label noise under the Gaussian distribution. A K -MIM is any function f that only depends on a K -dimensional subspace. We focus on well-behaved MIMs with finite ranges that satisfy certain regularity properties. Our main contribution is a general robust learner that is qualitatively optimal in the Statistical Query (SQ) model. Our algorithm iteratively constructs better approximations to the defining subspace by computing low-degree moments conditional on the projection to the subspace computed thus far, and adding directions with relatively large empirical moments. This procedure efficiently finds a subspace V so that f(\mathbfx) is close to a function of the projection of \mathbfx onto V . Conversely, for functions for which these conditional moments do not help, we prove an SQ lower bound suggesting that no efficient learner exists. As applications, we provide faster robust learners for the following concept classes: * \bf Multiclass Linear Classifiers We give a constant-factor approximate agnostic learner with sample complexity N = O(d) 2^\mathrmpoly(K/\epsilon) and computational complexity \mathrmpoly(N ,d) . This is the first constant-factor agnostic learner for this class whose complexity is a fixed-degree polynomial in d . * \bf Intersections of Halfspaces We give an approximate agnostic learner for this class achieving 0-1 error K \tildeO(\mathrmOPT) + \epsilon with sample complexity N=O(d^2) 2^\mathrmpoly(K/\epsilon) and computational complexity \mathrmpoly(N ,d) . This is the first agnostic learner for this class with near-linear error dependence and complexity a fixed-degree polynomial in d . Furthermore, we show that in the presence of random classification noise, the complexity of our algorithm scales polynomially with 1/\epsilon . Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2502.09525 [cs.LG] (or arXiv:2502.09525v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.09525 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Nikos Zarifis [view email] [v1] Thu, 13 Feb 2025 17:37:42 UTC (67 KB) Full-text links: Access Paper: View a PDF of the paper titled Robust Learning of Multi-index Models via Iterative Subspace Approximation, by Ilias Diakonikolas and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-02 Change to browse by: cs cs.DS math math.ST stat stat.ML stat.TH References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-8] EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

链接: https://arxiv.org/abs/2502.09509
作者: Theodoros Kouzelis,Ioannis Kakogeorgiou,Spyros Gidaris,Nikos Komodakis
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Latent generative models have emerged as a leading approach for high-quality image synthesis. These models rely on an autoencoder to compress images into a latent space, followed by a generative model to learn the latent distribution. We identify that existing autoencoders lack equivariance to semantic-preserving transformations like scaling and rotation, resulting in complex latent spaces that hinder generative performance. To address this, we propose EQ-VAE, a simple regularization approach that enforces equivariance in the latent space, reducing its complexity without degrading reconstruction quality. By finetuning pre-trained autoencoders with EQ-VAE, we enhance the performance of several state-of-the-art generative models, including DiT, SiT, REPA and MaskGIT, achieving a 7 speedup on DiT-XL/2 with only five epochs of SD-VAE fine-tuning. EQ-VAE is compatible with both continuous and discrete autoencoders, thus offering a versatile enhancement for a wide range of latent generative models. Project page and code: this https URL.

[LG-9] Scalable First-order Method for Certifying Optimal k-Sparse GLMs

链接: https://arxiv.org/abs/2502.09502
作者: Jiachang Liu,Soroosh Shafiee,Andrea Lodi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This paper investigates the problem of certifying optimality for sparse generalized linear models (GLMs), where sparsity is enforced through an \ell_0 cardinality constraint. While branch-and-bound (BnB) frameworks can certify optimality by pruning nodes using dual bounds, existing methods for computing these bounds are either computationally intensive or exhibit slow convergence, limiting their scalability to large-scale problems. To address this challenge, we propose a first-order proximal gradient algorithm designed to solve the perspective relaxation of the problem within a BnB framework. Specifically, we formulate the relaxed problem as a composite optimization problem and demonstrate that the proximal operator of the non-smooth component can be computed exactly in log-linear time complexity, eliminating the need to solve a computationally expensive second-order cone program. Furthermore, we introduce a simple restart strategy that enhances convergence speed while maintaining low per-iteration complexity. Extensive experiments on synthetic and real-world datasets show that our approach significantly accelerates dual bound computations and is highly effective in providing optimality certificates for large-scale problems.

[LG-10] Eidetic Learning: an Efficient and Provable Solution to Catastrophic Forgetting

链接: https://arxiv.org/abs/2502.09500
作者: Nicholas Dronen,Randall Balestriero
类目: Machine Learning (cs.LG)
*备注: 16 pages, 6 figures; code is available at this https URL

点击查看摘要

Abstract:Catastrophic forgetting – the phenomenon of a neural network learning a task t1 and losing the ability to perform it after being trained on some other task t2 – is a long-standing problem for neural networks [McCloskey and Cohen, 1989]. We present a method, Eidetic Learning, that provably solves catastrophic forgetting. A network trained with Eidetic Learning – here, an EideticNet – requires no rehearsal or replay. We consider successive discrete tasks and show how at inference time an EideticNet automatically routes new instances without auxiliary task information. An EideticNet bears a family resemblance to the sparsely-gated Mixture-of-Experts layer Shazeer et al. [2016] in that network capacity is partitioned across tasks and the network itself performs data-conditional routing. An EideticNet is easy to implement and train, is efficient, and has time and space complexity linear in the number of parameters. The guarantee of our method holds for normalization layers of modern neural networks during both pre-training and fine-tuning. We show with a variety of network architectures and sets of tasks that EideticNets are immune to forgetting. While the practical benefits of EideticNets are substantial, we believe they can be benefit practitioners and theorists alike. The code for training EideticNets is available at \hrefthis https URLthis https URL.

[LG-11] On Agnostic PAC Learning in the Small Error Regime

链接: https://arxiv.org/abs/2502.09496
作者: Julian Asilis,Mikael Møller Høgsgaard,Grigoris Velegkas
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 44 pages

点击查看摘要

Abstract:Binary classification in the classic PAC model exhibits a curious phenomenon: Empirical Risk Minimization (ERM) learners are suboptimal in the realizable case yet optimal in the agnostic case. Roughly speaking, this owes itself to the fact that non-realizable distributions \mathcalD are simply more difficult to learn than realizable distributions – even when one discounts a learner’s error by \mathrmerr(h^_\mathcalD) , the error of the best hypothesis in \mathcalH for \mathcalD . Thus, optimal agnostic learners are permitted to incur excess error on (easier-to-learn) distributions \mathcalD for which \tau = \mathrmerr(h^_\mathcalD) is small. Recent work of Hanneke, Larsen, and Zhivotovskiy (FOCS 24) addresses this shortcoming by including \tau itself as a parameter in the agnostic error term. In this more fine-grained model, they demonstrate tightness of the error lower bound \tau + \Omega \left(\sqrt\frac\tau (d + \log(1 / \delta))m + \fracd + \log(1 / \delta)m \right) in a regime where \tau d/m , and leave open the question of whether there may be a higher lower bound when \tau \approx d/m , with d denoting \mathrmVC(\mathcalH) . In this work, we resolve this question by exhibiting a learner which achieves error c \cdot \tau + O \left(\sqrt\frac\tau (d + \log(1 / \delta))m + \fracd + \log(1 / \delta)m \right) for a constant c \leq 2.1 , thus matching the lower bound when \tau \approx d/m . Further, our learner is computationally efficient and is based upon careful aggregations of ERM classifiers, making progress on two other questions of Hanneke, Larsen, and Zhivotovskiy (FOCS 24). We leave open the interesting question of whether our approach can be refined to lower the constant from 2.1 to 1, which would completely settle the complexity of agnostic learning. Comments: 44 pages Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.09496 [cs.LG] (or arXiv:2502.09496v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.09496 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] Inverse Design with Dynamic Mode Decomposition

链接: https://arxiv.org/abs/2502.09490
作者: Yunpeng Zhu,Liangliang Cheng,Anping Jing,Hanyu Huo,Ziqiang Lang,Bo Zhang,J. Nathan Kutz
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS); Optimization and Control (math.OC); Fluid Dynamics (physics.flu-dyn)
*备注: 29 pages, 19 figures

点击查看摘要

Abstract:We introduce a computationally efficient method for the automation of inverse design in science and engineering. Based on simple least-square regression, the underlying dynamic mode decomposition algorithm can be used to construct a low-rank subspace spanning multiple experiments in parameter space. The proposed inverse design dynamic mode composition (ID-DMD) algorithm leverages the computed low-dimensional subspace to enable fast digital design and optimization on laptop-level computing, including the potential to prescribe the dynamics themselves. Moreover, the method is robust to noise, physically interpretable, and can provide uncertainty quantification metrics. The architecture can also efficiently scale to large-scale design problems using randomized algorithms in the ID-DMD. The simplicity of the method and its implementation are highly attractive in practice, and the ID-DMD has been demonstrated to be an order of magnitude more accurate than competing methods while simultaneously being 3-5 orders faster on challenging engineering design problems ranging from structural vibrations to fluid dynamics. Due to its speed, robustness, interpretability, and ease-of-use, ID-DMD in comparison with other leading machine learning methods represents a significant advancement in data-driven methods for inverse design and optimization, promising a paradigm shift in how to approach inverse design in practice.

[LG-13] Learning to Predict Global Atrial Fibrillation Dynamics from Sparse Measurements

链接: https://arxiv.org/abs/2502.09473
作者: Alexander Jenkins,Andrea Cini,Joseph Barker,Alexander Sharp,Arunashis Sau,Varun Valentine,Srushti Valasang,Xinyang Li,Tom Wong,Timothy Betts,Danilo Mandic,Cesare Alippi,Fu Siong Ng
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Under review

点击查看摘要

Abstract:Catheter ablation of Atrial Fibrillation (AF) consists of a one-size-fits-all treatment with limited success in persistent AF. This may be due to our inability to map the dynamics of AF with the limited resolution and coverage provided by sequential contact mapping catheters, preventing effective patient phenotyping for personalised, targeted ablation. Here we introduce FibMap, a graph recurrent neural network model that reconstructs global AF dynamics from sparse measurements. Trained and validated on 51 non-contact whole atria recordings, FibMap reconstructs whole atria dynamics from 10% surface coverage, achieving a 210% lower mean absolute error and an order of magnitude higher performance in tracking phase singularities compared to baseline methods. Clinical utility of FibMap is demonstrated on real-world contact mapping recordings, achieving reconstruction fidelity comparable to non-contact mapping. FibMap’s state-spaces and patient-specific parameters offer insights for electrophenotyping AF. Integrating FibMap into clinical practice could enable personalised AF care and improve outcomes.

[LG-14] A hierarchical approach for assessing the vulnerability of tree-based classification models to membership inference attack

链接: https://arxiv.org/abs/2502.09396
作者: Richard J. Preen,Jim Smith
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Machine learning models can inadvertently expose confidential properties of their training data, making them vulnerable to membership inference attacks (MIA). While numerous evaluation methods exist, many require computationally expensive processes, such as training multiple shadow models. This article presents two new complementary approaches for efficiently identifying vulnerable tree-based models: an ante-hoc analysis of hyperparameter choices and a post-hoc examination of trained model structure. While these new methods cannot certify whether a model is safe from MIA, they provide practitioners with a means to significantly reduce the number of models that need to undergo expensive MIA assessment through a hierarchical filtering approach. More specifically, it is shown that the rank order of disclosure risk for different hyperparameter combinations remains consistent across datasets, enabling the development of simple, human-interpretable rules for identifying relatively high-risk models before training. While this ante-hoc analysis cannot determine absolute safety since this also depends on the specific dataset, it allows the elimination of unnecessarily risky configurations during hyperparameter tuning. Additionally, computationally inexpensive structural metrics serve as indicators of MIA vulnerability, providing a second filtering stage to identify risky models after training but before conducting expensive attacks. Empirical results show that hyperparameter-based risk prediction rules can achieve high accuracy in predicting the most at risk combinations of hyperparameters across different tree-based model types, while requiring no model training. Moreover, target model accuracy is not seen to correlate with privacy risk, suggesting opportunities to optimise model configurations for both performance and privacy. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2502.09396 [cs.LG] (or arXiv:2502.09396v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.09396 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Robot Pouring: Identifying Causes of Spillage and Selecting Alternative Action Parameters Using Probabilistic Actual Causation

链接: https://arxiv.org/abs/2502.09395
作者: Jaime Maldonado,Jonas Krumme,Christoph Zetzsche,Vanessa Didelez,Kerstin Schill
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 20 pages, 13 figures

点击查看摘要

Abstract:In everyday life, we perform tasks (e.g., cooking or cleaning) that involve a large variety of objects and goals. When confronted with an unexpected or unwanted outcome, we take corrective actions and try again until achieving the desired result. The reasoning performed to identify a cause of the observed outcome and to select an appropriate corrective action is a crucial aspect of human reasoning for successful task execution. Central to this reasoning is the assumption that a factor is responsible for producing the observed outcome. In this paper, we investigate the use of probabilistic actual causation to determine whether a factor is the cause of an observed undesired outcome. Furthermore, we show how the actual causation probabilities can be used to find alternative actions to change the outcome. We apply the probabilistic actual causation analysis to a robot pouring task. When spillage occurs, the analysis indicates whether a task parameter is the cause and how it should be changed to avoid spillage. The analysis requires a causal graph of the task and the corresponding conditional probability distributions. To fulfill these requirements, we perform a complete causal modeling procedure (i.e., task analysis, definition of variables, determination of the causal graph structure, and estimation of conditional probability distributions) using data from a realistic simulation of the robot pouring task, covering a large combinatorial space of task parameters. Based on the results, we discuss the implications of the variables’ representation and how the alternative actions suggested by the actual causation analysis would compare to the alternative solutions proposed by a human observer. The practical use of the analysis of probabilistic actual causation to select alternative action parameters is demonstrated.

[LG-16] LoRA Training Provably Converges to a Low-Rank Global Minimum or It Fails Loudly (But it Probably Wont Fail)

链接: https://arxiv.org/abs/2502.09376
作者: Junsu Kim,Jaeyeon Kim,Ernest K. Ryu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has become a standard approach for fine-tuning large foundation models. However, our theoretical understanding of LoRA remains limited as prior analyses of LoRA’s training dynamics either rely on linearization arguments or consider highly simplified setups. In this work, we analyze the LoRA loss landscape without such restrictive assumptions. We define two regimes: a special regime'', which includes idealized setups where linearization arguments hold, and a generic regime’’ representing more realistic setups where linearization arguments do not hold. In the generic regime, we show that LoRA training converges to a global minimizer with low rank and small magnitude, or a qualitatively distinct solution with high rank and large magnitude. Finally, we argue that the zero-initialization and weight decay in LoRA training induce an implicit bias toward the low-rank, small-magnitude region of the parameter space – where global minima lie – thus shedding light on why LoRA training usually succeeds in finding global minima.

[LG-17] Mitigating multiple single-event upsets during deep neural network inference using fault-aware training

链接: https://arxiv.org/abs/2502.09374
作者: Toon Vinck,Naïn Jonckers,Gert Dekkers,Jeffrey Prinzie,Peter Karsmakers
类目: Machine Learning (cs.LG)
*备注: 7 pages, 4 figures, Topical Workshop on Electronics for Particle Physics

点击查看摘要

Abstract:Deep neural networks (DNNs) are increasingly used in safety-critical applications. Reliable fault analysis and mitigation are essential to ensure their functionality in harsh environments that contain high radiation levels. This study analyses the impact of multiple single-bit single-event upsets in DNNs by performing fault injection at the level of a DNN model. Additionally, a fault aware training (FAT) methodology is proposed that improves the DNNs’ robustness to faults without any modification to the hardware. Experimental results show that the FAT methodology improves the tolerance to faults up to a factor 3.

[LG-18] he Accuracy Cost of Weakness: A Theoretical Analysis of Fixed-Segment Weak Labeling for Events in Time

链接: https://arxiv.org/abs/2502.09363
作者: John Martinsson,Olof Mogren,Tuomas Virtanen,Maria Sandsten
类目: Machine Learning (cs.LG)
*备注: Submitted to TMLR

点击查看摘要

Abstract:Accurate labels are critical for deriving robust machine learning models. Labels are used to train supervised learning models and to evaluate most machine learning paradigms. In this paper, we model the accuracy and cost of a common weak labeling process where annotators assign presence or absence labels to fixed-length data segments for a given event class. The annotator labels a segment as “present” if it sufficiently covers an event from that class, e.g., a birdsong sound event in audio data. We analyze how the segment length affects the label accuracy and the required number of annotations, and compare this fixed-length labeling approach with an oracle method that uses the true event activations to construct the segments. Furthermore, we quantify the gap between these methods and verify that in most realistic scenarios the oracle method is better than the fixed-length labeling method in both accuracy and cost. Our findings provide a theoretical justification for adaptive weak labeling strategies that mimic the oracle process, and a foundation for optimizing weak labeling processes in sequence labeling tasks.

[LG-19] Machine learning for modelling unstructured grid data in computational physics: a review

链接: https://arxiv.org/abs/2502.09346
作者: Sibo Cheng,Marc Bocquet,Weiping Ding,Tobias Sebastian Finn,Rui Fu,Jinlong Fu,Yike Guo,Eleda Johnson,Siyi Li,Che Liu,Eric Newton Moro,Jie Pan,Matthew Piggott,Cesar Quilodran,Prakhar Sharma,Kun Wang,Dunhui Xiao,Xiao Xue,Yong Zeng,Mingrui Zhang,Hao Zhou,Kewei Zhu,Rossella Arcucci
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Data Analysis, Statistics and Probability (physics.data-an); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Unstructured grid data are essential for modelling complex geometries and dynamics in computational physics. Yet, their inherent irregularity presents significant challenges for conventional machine learning (ML) techniques. This paper provides a comprehensive review of advanced ML methodologies designed to handle unstructured grid data in high-dimensional dynamical systems. Key approaches discussed include graph neural networks, transformer models with spatial attention mechanisms, interpolation-integrated ML methods, and meshless techniques such as physics-informed neural networks. These methodologies have proven effective across diverse fields, including fluid dynamics and environmental simulations. This review is intended as a guidebook for computational scientists seeking to apply ML approaches to unstructured grid data in their domains, as well as for ML researchers looking to address challenges in computational physics. It places special focus on how ML methods can overcome the inherent limitations of traditional numerical techniques and, conversely, how insights from computational physics can inform ML development. To support benchmarking, this review also provides a summary of open-access datasets of unstructured grid data in computational physics. Finally, emerging directions such as generative models with unstructured data, reinforcement learning for mesh generation, and hybrid physics-data-driven paradigms are discussed to inspire future advancements in this evolving field.

[LG-20] his looks like what? Challenges and Future Research Directions for Part-Prototype Models

链接: https://arxiv.org/abs/2502.09340
作者: Khawla Elhadri,Tomasz Michalski,Adam Wróbel,Jörg Schlötterer,Bartosz Zieliński,Christin Seifert
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing interest in eXplainable Artificial Intelligence (XAI) has prompted research into models with built-in interpretability, the most prominent of which are part-prototype models. Part-Prototype Models (PPMs) make decisions by comparing an input image to a set of learned prototypes, providing human-understandable explanations in the form of ``this looks like that’'. Despite their inherent interpretability, PPMS are not yet considered a valuable alternative to post-hoc models. In this survey, we investigate the reasons for this and provide directions for future research. We analyze papers from 2019 to 2024, and derive a taxonomy of the challenges that current PPMS face. Our analysis shows that the open challenges are quite diverse. The main concern is the quality and quantity of prototypes. Other concerns are the lack of generalization to a variety of tasks and contexts, and general methodological issues, including non-standardized evaluation. We provide ideas for future research in five broad directions: improving predictive performance, developing novel architectures grounded in theory, establishing frameworks for human-AI collaboration, aligning models with humans, and establishing metrics and benchmarks for evaluation. We hope that this survey will stimulate research and promote intrinsically interpretable models for application domains. Our list of surveyed papers is available at this https URL.

[LG-21] Full Swap Regret and Discretized Calibration

链接: https://arxiv.org/abs/2502.09332
作者: Maxwell Fishelson,Robert Kleinberg,Princewill Okoroafor,Renato Paes Leme,Jon Schneider,Yifeng Teng
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:We study the problem of minimizing swap regret in structured normal-form games. Players have a very large (potentially infinite) number of pure actions, but each action has an embedding into d -dimensional space and payoffs are given by bilinear functions of these embeddings. We provide an efficient learning algorithm for this setting that incurs at most \tildeO(T^(d+1)/(d+3)) swap regret after T rounds. To achieve this, we introduce a new online learning problem we call \emphfull swap regret minimization. In this problem, a learner repeatedly takes a (randomized) action in a bounded convex d -dimensional action set \mathcalK and then receives a loss from the adversary, with the goal of minimizing their regret with respect to the \emphworst-case swap function mapping \mathcalK to \mathcalK . For varied assumptions about the convexity and smoothness of the loss functions, we design algorithms with full swap regret bounds ranging from O(T^d/(d+2)) to O(T^(d+1)/(d+2)) . Finally, we apply these tools to the problem of online forecasting to minimize calibration error, showing that several notions of calibration can be viewed as specific instances of full swap regret. In particular, we design efficient algorithms for online forecasting that guarantee at most O(T^1/3) \ell_2 -calibration error and O(\max(\sqrt\epsilon T, T^1/3)) \emphdiscretized-calibration error (when the forecaster is restricted to predicting multiples of \epsilon ). Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2502.09332 [cs.LG] (or arXiv:2502.09332v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.09332 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] Bayesian Optimization for Simultaneous Selection of Machine Learning Algorithms and Hyperparameters on Shared Latent Space

链接: https://arxiv.org/abs/2502.09329
作者: Kazuki Ishikawa,Ryota Ozaki,Yohei Kanzaki,Ichiro Takeuchi,Masayuki Karasuyama
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selecting the optimal combination of a machine learning (ML) algorithm and its hyper-parameters is crucial for the development of high-performance ML systems. However, since the combination of ML algorithms and hyper-parameters is enormous, the exhaustive validation requires a significant amount of time. Many existing studies use Bayesian optimization (BO) for accelerating the search. On the other hand, a significant difficulty is that, in general, there exists a different hyper-parameter space for each one of candidate ML algorithms. BO-based approaches typically build a surrogate model independently for each hyper-parameter space, by which sufficient observations are required for all candidate ML algorithms. In this study, our proposed method embeds different hyper-parameter spaces into a shared latent space, in which a surrogate multi-task model for BO is estimated. This approach can share information of observations from different ML algorithms by which efficient optimization is expected with a smaller number of total observations. We further propose the pre-training of the latent space embedding with an adversarial regularization, and a ranking model for selecting an effective pre-trained embedding for a given target dataset. Our empirical study demonstrates effectiveness of the proposed method through datasets from OpenML.

[LG-23] Depth-Bounds for Neural Networks via the Braid Arrangement

链接: https://arxiv.org/abs/2502.09324
作者: Moritz Grillo,Christoph Hertrich,Georg Loho
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Neural and Evolutionary Computing (cs.NE); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:We contribute towards resolving the open question of how many hidden layers are required in ReLU networks for exactly representing all continuous and piecewise linear functions on \mathbbR^d . While the question has been resolved in special cases, the best known lower bound in general is still 2. We focus on neural networks that are compatible with certain polyhedral complexes, more precisely with the braid fan. For such neural networks, we prove a non-constant lower bound of \Omega(\log\log d) hidden layers required to exactly represent the maximum of d numbers. Additionally, under our assumption, we provide a combinatorial proof that 3 hidden layers are necessary to compute the maximum of 5 numbers; this had only been verified with an excessive computation so far. Finally, we show that a natural generalization of the best known upper bound to maxout networks is not tight, by demonstrating that a rank-3 maxout layer followed by a rank-2 maxout layer is sufficient to represent the maximum of 7 numbers.

[LG-24] Bridging Jensen Gap for Max-Min Group Fairness Optimization in Recommendation ICLR2025

链接: https://arxiv.org/abs/2502.09319
作者: Chen Xu,Yuxin Li,Wenjie Wang,Liang Pang,Jun Xu,Tat-Seng Chua
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted in ICLR 2025

点击查看摘要

Abstract:Group max-min fairness (MMF) is commonly used in fairness-aware recommender systems (RS) as an optimization objective, as it aims to protect marginalized item groups and ensures a fair competition platform. However, our theoretical analysis indicates that integrating MMF constraint violates the assumption of sample independence during optimization, causing the loss function to deviate from linear additivity. Such nonlinearity property introduces the Jensen gap between the model’s convergence point and the optimal point if mini-batch sampling is applied. Both theoretical and empirical studies show that as the mini-batch size decreases and the group size increases, the Jensen gap will widen accordingly. Some methods using heuristic re-weighting or debiasing strategies have the potential to bridge the Jensen gap. However, they either lack theoretical guarantees or suffer from heavy computational costs. To overcome these limitations, we first theoretically demonstrate that the MMF-constrained objective can be essentially reformulated as a group-weighted optimization objective. Then we present an efficient and effective algorithm named FairDual, which utilizes a dual optimization technique to minimize the Jensen gap. Our theoretical analysis demonstrates that FairDual can achieve a sub-linear convergence rate to the globally optimal solution and the Jensen gap can be well bounded under a mini-batch sampling strategy with random shuffle. Extensive experiments conducted using six large-scale RS backbone models on three publicly available datasets demonstrate that FairDual outperforms all baselines in terms of both accuracy and fairness. Our data and codes are shared at this https URL.

[LG-25] SigGate: Enhancing Recurrent Neural Networks with Signature-Based Gating Mechanisms

链接: https://arxiv.org/abs/2502.09318
作者: Rémi Genet,Hugo Inzirillo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel approach that enhances recurrent neural networks (RNNs) by incorporating path signatures into their gating mechanisms. Our method modifies both Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures by replacing their forget and reset gates, respectively, with learnable path signatures. These signatures, which capture the geometric features of the entire path history, provide a richer context for controlling information flow through the network’s memory. This modification allows the networks to make memory decisions based on the full historical context rather than just the current input and state. Through experimental studies, we demonstrate that our Signature-LSTM (SigLSTM) and Signature-GRU (SigGRU) models outperform their traditional counterparts across various sequential learning tasks. By leveraging path signatures in recurrent architectures, this method offers new opportunities to enhance performance in time series analysis and forecasting applications.

[LG-26] owards Seamless Hierarchical Federated Learning under Intermittent Client Participation: A Stagewise Decision-Making Methodology

链接: https://arxiv.org/abs/2502.09303
作者: Minghong Wu,Minghui Liwang,Yuhan Su,Li Li,Seyyedali Hosseinalipour,Xianbin Wang,Huaiyu Dai,Zhenzhen Jiao
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 20 pages, 8 figures,5 tables

点击查看摘要

Abstract:Federated Learning (FL) offers a pioneering distributed learning paradigm that enables devices/clients to build a shared global model. This global model is obtained through frequent model transmissions between clients and a central server, which may cause high latency, energy consumption, and congestion over backhaul links. To overcome these drawbacks, Hierarchical Federated Learning (HFL) has emerged, which organizes clients into multiple clusters and utilizes edge nodes (e.g., edge servers) for intermediate model aggregations between clients and the central server. Current research on HFL mainly focus on enhancing model accuracy, latency, and energy consumption in scenarios with a stable/fixed set of clients. However, addressing the dynamic availability of clients – a critical aspect of real-world scenarios – remains underexplored. This study delves into optimizing client selection and client-to-edge associations in HFL under intermittent client participation so as to minimize overall system costs (i.e., delay and energy), while achieving fast model convergence. We unveil that achieving this goal involves solving a complex NP-hard problem. To tackle this, we propose a stagewise methodology that splits the solution into two stages, referred to as Plan A and Plan B. Plan A focuses on identifying long-term clients with high chance of participation in subsequent model training rounds. Plan B serves as a backup, selecting alternative clients when long-term clients are unavailable during model training rounds. This stagewise methodology offers a fresh perspective on client selection that can enhance both HFL and conventional FL via enabling low-overhead decision-making processes. Through evaluations on MNIST and CIFAR-10 datasets, we show that our methodology outperforms existing benchmarks in terms of model accuracy and system costs.

[LG-27] Convex Is Back: Solving Belief MDPs With Convexity-Informed Deep Reinforcement Learning

链接: https://arxiv.org/abs/2502.09298
作者: Daniel Koutas,Daniel Hettegger,Kostas G. Papakonstantinou,Daniel Straub
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel method for Deep Reinforcement Learning (DRL), incorporating the convex property of the value function over the belief space in Partially Observable Markov Decision Processes (POMDPs). We introduce hard- and soft-enforced convexity as two different approaches, and compare their performance against standard DRL on two well-known POMDP environments, namely the Tiger and FieldVisionRockSample problems. Our findings show that including the convexity feature can substantially increase performance of the agents, as well as increase robustness over the hyperparameter space, especially when testing on out-of-distribution domains. The source code for this work can be found at this https URL.

[LG-28] When do neural networks learn world models?

链接: https://arxiv.org/abs/2502.09297
作者: Tianren Zhang,Guanyu Chen,Feng Chen
类目: Machine Learning (cs.LG)
*备注: 28 pages, 9 figures

点击查看摘要

Abstract:Humans develop world models that capture the underlying generation process of data. Whether neural networks can learn similar world models remains an open problem. In this work, we provide the first theoretical results for this problem, showing that in a multi-task setting, models with a low-degree bias provably recover latent data-generating variables under mild assumptions – even if proxy tasks involve complex, non-linear functions of the latents. However, such recovery is also sensitive to model architecture. Our analysis leverages Boolean models of task solutions via the Fourier-Walsh transform and introduces new techniques for analyzing invertible Boolean transforms, which may be of independent interest. We illustrate the algorithmic implications of our results and connect them to related research areas, including self-supervised learning, out-of-distribution generalization, and the linear representation hypothesis in large language models.

[LG-29] An Uncertainty Principle for Linear Recurrent Neural Networks

链接: https://arxiv.org/abs/2502.09287
作者: Alexandre François,Antonio Orvieto,Francis Bach
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider linear recurrent neural networks, which have become a key building block of sequence modeling due to their ability for stable and effective long-range modeling. In this paper, we aim at characterizing this ability on a simple but core copy task, whose goal is to build a linear filter of order S that approximates the filter that looks K time steps in the past (which we refer to as the shift- K filter), where K is larger than S . Using classical signal models and quadratic cost, we fully characterize the problem by providing lower bounds of approximation, as well as explicit filters that achieve this lower bound up to constants. The optimal performance highlights an uncertainty principle: the optimal filter has to average values around the K -th time step in the past with a range~(width) that is proportional to K/S .

[LG-30] GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation ICLR2025

链接: https://arxiv.org/abs/2502.09268
作者: Hongyin Zhang,Pengxiang Ding,Shangke Lyu,Ying Peng,Donglin Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Published as a conference paper at ICLR 2025

点击查看摘要

Abstract:With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment. These perturbations introduce unforeseen state information to the VLA, resulting in inaccurate actions and consequently, a significant decline in generalization performance. The classic internal model control (IMC) principle demonstrates that a closed-loop system with an internal model that includes external input signals can accurately track the reference input and effectively offset the disturbance. We propose a novel closed-loop VLA method GEVRM that integrates the IMC principle to enhance the robustness of robot visual manipulation. The text-guided video generation model in GEVRM can generate highly expressive future visual planning goals. Simultaneously, we evaluate perturbations by simulating responses, which are called internal embeddings and optimized through prototype contrastive learning. This allows the model to implicitly infer and distinguish perturbations from the external environment. The proposed GEVRM achieves state-of-the-art performance on both standard and perturbed CALVIN benchmarks and shows significant improvements in realistic robot tasks.

[LG-31] Unlocking the Potential of Classic GNNs for Graph-level Tasks: Simple Architectures Meet Excellence

链接: https://arxiv.org/abs/2502.09263
作者: Yuankai Luo,Lei Shi,Xiao-Ming Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Message-passing Graph Neural Networks (GNNs) are often criticized for their limited expressiveness, issues like over-smoothing and over-squashing, and challenges in capturing long-range dependencies, while Graph Transformers (GTs) are considered superior due to their global attention mechanisms. Literature frequently suggests that GTs outperform GNNs, particularly in graph-level tasks such as graph classification and regression. In this study, we explore the untapped potential of GNNs through an enhanced framework, GNN+, which integrates six widely used techniques: edge feature integration, normalization, dropout, residual connections, feed-forward networks, and positional encoding, to effectively tackle graph-level tasks. We conduct a systematic evaluation of three classic GNNs, namely GCN, GIN, and GatedGCN, enhanced by the GNN+ framework across 14 well-known graph-level datasets. Our results show that, contrary to the prevailing belief, classic GNNs excel in graph-level tasks, securing top three rankings across all datasets and achieving first place in eight, while also demonstrating greater efficiency than GTs. This highlights the potential of simple GNN architectures, challenging the belief that complex mechanisms in GTs are essential for superior graph-level performance.

[LG-32] On the Importance of Embedding Norms in Self-Supervised Learning

链接: https://arxiv.org/abs/2502.09252
作者: Andrew Draganov,Sharvaree Vadgama,Sebastian Damrich,Jan Niklas Böhm,Lucas Maes,Dmitry Kobak,Erik Bekkers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) allows training data representations without a supervised signal and has become an important paradigm in machine learning. Most SSL methods employ the cosine similarity between embedding vectors and hence effectively embed data on a hypersphere. While this seemingly implies that embedding norms cannot play any role in SSL, a few recent works have suggested that embedding norms have properties related to network convergence and confidence. In this paper, we resolve this apparent contradiction and systematically establish the embedding norm’s role in SSL training. Using theoretical analysis, simulations, and experiments, we show that embedding norms (i) govern SSL convergence rates and (ii) encode network confidence, with smaller norms corresponding to unexpected samples. Additionally, we show that manipulating embedding norms can have large effects on convergence speed. Our findings demonstrate that SSL embedding norms are integral to understanding and optimizing network behavior.

[LG-33] Revisiting Euclidean Alignment for Transfer Learning in EEG-Based Brain-Computer Interfaces

链接: https://arxiv.org/abs/2502.09203
作者: Dongrui Wu
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Due to the non-stationarity and large individual differences of EEG signals, EEG-based brain-computer interfaces (BCIs) usually need subject-specific calibration to tailor the decoding algorithm for each new subject, which is time-consuming and user-unfriendly, hindering their real-world applications. Transfer learning (TL) has been extensively used to expedite the calibration, by making use of EEG data from other subjects/sessions. An important consideration in TL for EEG-based BCIs is to reduce the data distribution discrepancies among different subjects/session, to avoid negative transfer. Euclidean alignment (EA) was proposed in 2020 to address this challenge. Numerous experiments from 10 different BCI paradigms demonstrated its effectiveness and efficiency. This paper revisits the EA, explaining its procedure and correct usage, introducing its applications and extensions, and pointing out potential new research directions. It should be very helpful to BCI researchers, especially those who are working on EEG signal decoding.

[LG-34] Understanding High-Dimensional Bayesian Optimization

链接: https://arxiv.org/abs/2502.09198
作者: Leonard Papenmeier,Matthias Poloczek,Luigi Nardi
类目: Machine Learning (cs.LG)
*备注: 19 pages, 20 figures

点击查看摘要

Abstract:Recent work reported that simple Bayesian optimization methods perform well for high-dimensional real-world tasks, seemingly contradicting prior work and tribal knowledge. This paper investigates the ‘why’. We identify fundamental challenges that arise in high-dimensional Bayesian optimization and explain why recent methods succeed. Our analysis shows that vanishing gradients caused by Gaussian process initialization schemes play a major role in the failures of high-dimensional Bayesian optimization and that methods that promote local search behaviors are better suited for the task. We find that maximum likelihood estimation of Gaussian process length scales suffices for state-of-the-art performance. Based on this, we propose a simple variant of maximum likelihood estimation called MSR that leverages these findings to achieve state-of-the-art performance on a comprehensive set of real-world applications. We also present targeted experiments to illustrate and confirm our findings.

[LG-35] Generalizability through Explainability: Countering Overfitting with Counterfactual Examples

链接: https://arxiv.org/abs/2502.09193
作者: Flavio Giorgi,Fabiano Veglianti,Fabrizio Silvestri,Gabriele Tolomei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Overfitting is a well-known issue in machine learning that occurs when a model struggles to generalize its predictions to new, unseen data beyond the scope of its training set. Traditional techniques to mitigate overfitting include early stopping, data augmentation, and regularization. In this work, we demonstrate that the degree of overfitting of a trained model is correlated with the ability to generate counterfactual examples. The higher the overfitting, the easier it will be to find a valid counterfactual example for a randomly chosen input data point. Therefore, we introduce CF-Reg, a novel regularization term in the training loss that controls overfitting by ensuring enough margin between each instance and its corresponding counterfactual. Experiments conducted across multiple datasets and models show that our counterfactual regularizer generally outperforms existing regularization techniques.

[LG-36] LOB-Bench: Benchmarking Generative AI for Finance - an Application to Limit Order Book Data

链接: https://arxiv.org/abs/2502.09172
作者: Peer Nagy,Sascha Frey,Kang Li,Bidipta Sarkar,Svitlana Vyetrenko,Stefan Zohren,Ani Calinescu,Jakob Foerster
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Finance (q-fin.CP); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we present LOB-Bench, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains “market impact metrics”, i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a ©GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes.

[LG-37] Vertical Federated Continual Learning via Evolving Prototype Knowledge

链接: https://arxiv.org/abs/2502.09152
作者: Shuo Wang,Keke Gai,Jing Yu,Liehuang Zhu,Qi Wu
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Vertical Federated Learning (VFL) has garnered significant attention as a privacy-preserving machine learning framework for sample-aligned feature federation. However, traditional VFL approaches do not address the challenges of class and feature continual learning, resulting in catastrophic forgetting of knowledge from previous tasks. To address the above challenge, we propose a novel vertical federated continual learning method, named Vertical Federated Continual Learning via Evolving Prototype Knowledge (V-LETO), which primarily facilitates the transfer of knowledge from previous tasks through the evolution of prototypes. Specifically, we propose an evolving prototype knowledge method, enabling the global model to retain both previous and current task knowledge. Furthermore, we introduce a model optimization technique that mitigates the forgetting of previous task knowledge by restricting updates to specific parameters of the local model, thereby enhancing overall performance. Extensive experiments conducted in both CIL and FIL settings demonstrate that our method, V-LETO, outperforms the other state-of-the-art methods. For example, our method outperforms the state-of-the-art method by 10.39% and 35.15% for CIL and FIL tasks, respectively. Our code is available at this https URL.

[LG-38] Regularization can make diffusion models more efficient

链接: https://arxiv.org/abs/2502.09151
作者: Mahsa Taheri,Johannes Lederer
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models are one of the key architectures of generative AI. Their main drawback, however, is the computational costs. This study indicates that the concept of sparsity, well known especially in statistics, can provide a pathway to more efficient diffusion pipelines. Our mathematical guarantees prove that sparsity can reduce the input dimension’s influence on the computational complexity to that of a much smaller intrinsic dimension of the data. Our empirical findings confirm that inducing sparsity can indeed lead to better samples at a lower cost.

[LG-39] rust Me I Know the Way: Predictive Uncertainty in the Presence of Shortcut Learning

链接: https://arxiv.org/abs/2502.09137
作者: Lisa Wimmer,Bernd Bischl,Ludwig Bothmann
类目: Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:The correct way to quantify predictive uncertainty in neural networks remains a topic of active discussion. In particular, it is unclear whether the state-of-the art entropy decomposition leads to a meaningful representation of model, or epistemic, uncertainty (EU) in the light of a debate that pits ignorance against disagreement perspectives. We aim to reconcile the conflicting viewpoints by arguing that both are valid but arise from different learning situations. Notably, we show that the presence of shortcuts is decisive for EU manifesting as disagreement.

[LG-40] Interpreting and Steering Protein Language Models through Sparse Autoencoders

链接: https://arxiv.org/abs/2502.09135
作者: Edith Natalia Villegas Garcia,Alessio Ansuini
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 11 pages, 6 figures

点击查看摘要

Abstract:The rapid advancements in transformer-based language models have revolutionized natural language processing, yet understanding the internal mechanisms of these models remains a significant challenge. This paper explores the application of sparse autoencoders (SAE) to interpret the internal representations of protein language models, specifically focusing on the ESM-2 8M parameter model. By performing a statistical analysis on each latent component’s relevance to distinct protein annotations, we identify potential interpretations linked to various protein characteristics, including transmembrane regions, binding sites, and specialized motifs. We then leverage these insights to guide sequence generation, shortlisting the relevant latent components that can steer the model towards desired targets such as zinc finger domains. This work contributes to the emerging field of mechanistic interpretability in biological sequence models, offering new perspectives on model steering for sequence design. Comments: 11 pages, 6 figures Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM) Cite as: arXiv:2502.09135 [cs.LG] (or arXiv:2502.09135v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.09135 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-41] Finite-Time Analysis of Discrete-Time Stochastic Interpolants

链接: https://arxiv.org/abs/2502.09130
作者: Yuhao Liu,Yu Chen,Rui Hu,Longbo Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The stochastic interpolant framework offers a powerful approach for constructing generative models based on ordinary differential equations (ODEs) or stochastic differential equations (SDEs) to transform arbitrary data distributions. However, prior analyses of this framework have primarily focused on the continuous-time setting, assuming a perfect solution of the underlying equations. In this work, we present the first discrete-time analysis of the stochastic interpolant framework, where we introduce an innovative discrete-time sampler and derive a finite-time upper bound on its distribution estimation error. Our result provides a novel quantification of how different factors, including the distance between source and target distributions and estimation accuracy, affect the convergence rate and also offers a new principled way to design efficient schedules for convergence acceleration. Finally, numerical experiments are conducted on the discrete-time sampler to corroborate our theoretical findings.

[LG-42] Scaling Law for Stochastic Gradient Descent in Quadratically Parameterized Linear Regression

链接: https://arxiv.org/abs/2502.09106
作者: Shihong Ding,Haihan Zhang,Hanzhen Zhao,Cong Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In machine learning, the scaling law describes how the model performance improves with the model and data size scaling up. From a learning theory perspective, this class of results establishes upper and lower generalization bounds for a specific learning algorithm. Here, the exact algorithm running using a specific model parameterization often offers a crucial implicit regularization effect, leading to good generalization. To characterize the scaling law, previous theoretical studies mainly focus on linear models, whereas, feature learning, a notable process that contributes to the remarkable empirical success of neural networks, is regretfully vacant. This paper studies the scaling law over a linear regression with the model being quadratically parameterized. We consider infinitely dimensional data and slope ground truth, both signals exhibiting certain power-law decay rates. We study convergence rates for Stochastic Gradient Descent and demonstrate the learning rates for variables will automatically adapt to the ground truth. As a result, in the canonical linear regression, we provide explicit separations for generalization curves between SGD with and without feature learning, and the information-theoretical lower bound that is agnostic to parametrization method and the algorithm. Our analysis for decaying ground truth provides a new characterization for the learning dynamic of the model.

[LG-43] Application of Tabular Transformer Architectures for Operating System Fingerprinting

链接: https://arxiv.org/abs/2502.09084
作者: Rubén Pérez-Jove,Cristian R. Munteanu,Alejandro Pazos,Jose Vázquez-Naya
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Submitted as a preprint (not peer reviewed). 22 pages, 9 figures. Code and datasets available at: this https URL

点击查看摘要

Abstract:Operating System (OS) fingerprinting is essential for network management and cybersecurity, enabling accurate device identification based on network traffic analysis. Traditional rule-based tools such as Nmap and p0f face challenges in dynamic environments due to frequent OS updates and obfuscation techniques. While Machine Learning (ML) approaches have been explored, Deep Learning (DL) models, particularly Transformer architectures, remain unexploited in this domain. This study investigates the application of Tabular Transformer architectures-specifically TabTransformer and FT-Transformer-for OS fingerprinting, leveraging structured network data from three publicly available datasets. Our experiments demonstrate that FT-Transformer generally outperforms traditional ML models, previous approaches and TabTransformer across multiple classification levels (OS family, major, and minor versions). The results establish a strong foundation for DL-based OS fingerprinting, improving accuracy and adaptability in complex network environments. Furthermore, we ensure the reproducibility of our research by providing an open-source implementation.

[LG-44] FlowAR: une plateforme uniformisée pour la reconnaissance des activités humaines à partir de capteurs binaires

链接: https://arxiv.org/abs/2502.09067
作者: Ali Ncibi,Luc Bouganim,Philippe Pucheral
类目: Machine Learning (cs.LG)
*备注: in French language this https URL

点击查看摘要

Abstract:This demo showcases a platform for developing human activity recognition (AR) systems, focusing on daily activities using sensor data, like binary sensors. With a data-driven approach, this platform, named FlowAR, features a three-step pipeline (flow): data cleaning, segmentation, and personalized classification. Its modularity allows flexibility to test methods, datasets, and ensure rigorous evaluations. A concrete use case demonstrates its effectiveness.

[LG-45] CRANE: Reasoning with constrained LLM generation

链接: https://arxiv.org/abs/2502.09061
作者: Debangshu Banerjee,Tarun Suresh,Shubham Ugare,Sasa Misailovic,Gagandeep Singh
类目: Programming Languages (cs.PL); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Code generation, symbolic math reasoning, and other tasks require LLMs to produce outputs that are both syntactically and semantically correct. Constrained LLM generation is a promising direction to enforce adherence to formal grammar, but prior works have empirically observed that strict enforcement of formal constraints often diminishes the reasoning capabilities of LLMs. In this work, we first provide a theoretical explanation for why constraining LLM outputs to very restrictive grammars that only allow syntactically valid final answers reduces the reasoning capabilities of the model. Second, we demonstrate that by augmenting the output grammar with carefully designed additional rules, it is always possible to preserve the reasoning capabilities of the LLM while ensuring syntactic and semantic correctness in its outputs. Building on these theoretical insights, we propose a reasoning-augmented constrained decoding algorithm, CRANE, which effectively balances the correctness of constrained generation with the flexibility of unconstrained generation. Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO.

[LG-46] End-to-End triplet loss based fine-tuning for network embedding in effective PII detection

链接: https://arxiv.org/abs/2502.09002
作者: Rishika Kohli,Shaifu Gupta,Manoj Singh Gaur
类目: Machine Learning (cs.LG)
*备注: 13 pages, 10 figures, 5 tables

点击查看摘要

Abstract:There are many approaches in mobile data ecosystem that inspect network traffic generated by applications running on user’s device to detect personal data exfiltration from the user’s device. State-of-the-art methods rely on features extracted from HTTP requests and in this context, machine learning involves training classifiers on these features and making predictions using labelled packet traces. However, most of these methods include external feature selection before model training. Deep learning, on the other hand, typically does not require such techniques, as it can autonomously learn and identify patterns in the data without external feature extraction or selection algorithms. In this article, we propose a novel deep learning based end-to-end learning framework for prediction of exposure of personally identifiable information (PII) in mobile packets. The framework employs a pre-trained large language model (LLM) and an autoencoder to generate embedding of network packets and then uses a triplet-loss based fine-tuning method to train the model, increasing detection effectiveness using two real-world datasets. We compare our proposed detection framework with other state-of-the-art works in detecting PII leaks from user’s device.

[LG-47] Privacy-Preserving Hybrid Ensemble Model for Network Anomaly Detection: Balancing Security and Data Protection

链接: https://arxiv.org/abs/2502.09001
作者: Shaobo Liu,Zihao Zhao,Weijie He,Jiren Wang,Jing Peng,Haoyuan Ma
类目: Machine Learning (cs.LG)
*备注: Accepted by 2024 5th International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering(ICBAIE 2024)

点击查看摘要

Abstract:Privacy-preserving network anomaly detection has become an essential area of research due to growing concerns over the protection of sensitive data. Traditional anomaly de- tection models often prioritize accuracy while neglecting the critical aspect of privacy. In this work, we propose a hybrid ensemble model that incorporates privacy-preserving techniques to address both detection accuracy and data protection. Our model combines the strengths of several machine learning algo- rithms, including K-Nearest Neighbors (KNN), Support Vector Machines (SVM), XGBoost, and Artificial Neural Networks (ANN), to create a robust system capable of identifying network anomalies while ensuring privacy. The proposed approach in- tegrates advanced preprocessing techniques that enhance data quality and address the challenges of small sample sizes and imbalanced datasets. By embedding privacy measures into the model design, our solution offers a significant advancement over existing methods, ensuring both enhanced detection performance and strong privacy safeguards.

[LG-48] ask Generalization With AutoRegressive Compositional Structure: Can Learning From d Tasks Generalize to dT Tasks?

链接: https://arxiv.org/abs/2502.08991
作者: Amirhesam Abedsoltan,Huaqing Zhang,Kaiyue Wen,Hongzhou Lin,Jingzhao Zhang,Mikhail Belkin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of AutoRegressive Compositional (ARC) structure, where each task is a composition of T operations, and each operation is among a finite family of \d subtasks. This yields a total class of size~( \d^\TT ). We first show that generalization to all ( \d^\TT ) tasks is theoretically achievable by training on only ( \tildeO(\d) ) tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via in-context learning (ICL) and Chain-of-Thought (CoT) reasoning. We further demonstrate this generalization in arithmetic and language translation, extending beyond parity functions.

[LG-49] What exactly has TabPFN learned to do? ICLR2024

链接: https://arxiv.org/abs/2502.08978
作者: Calvin McCarter
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Originally published in Blogposts Track at ICLR 2024. Appendix contains re-analysis on TabPFN-v2 [Hollmann et al., 2025]

点击查看摘要

Abstract:TabPFN [Hollmann et al., 2023], a Transformer model pretrained to perform in-context learning on fresh tabular classification problems, was presented at the last ICLR conference. To better understand its behavior, we treat it as a black-box function approximator generator and observe its generated function approximations on a varied selection of training datasets. Exploring its learned inductive biases in this manner, we observe behavior that is at turns either brilliant or baffling. We conclude this post with thoughts on how these results might inform the development, evaluation, and application of prior-data fitted networks (PFNs) in the future.

[LG-50] Small Molecule Drug Discovery Through Deep Learning:Progress Challenges and Opportunities

链接: https://arxiv.org/abs/2502.08975
作者: Kun Li,Yida Xiong,Hongzhi Zhang,Xiantao Cai,Bo Du,Wenbin Hu
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 9 pages, 1 figures, 8 tables

点击查看摘要

Abstract:Due to their excellent drug-like and pharmacokinetic properties, small molecule drugs are widely used to treat various diseases, making them a critical component of drug discovery. In recent years, with the rapid development of deep learning (DL) techniques, DL-based small molecule drug discovery methods have achieved excellent performance in prediction accuracy, speed, and complex molecular relationship modeling compared to traditional machine learning approaches. These advancements enhance drug screening efficiency and optimization, and they provide more precise and effective solutions for various drug discovery tasks. Contributing to this field’s development, this paper aims to systematically summarize and generalize the recent key tasks and representative techniques in DL-based small molecule drug discovery in recent years. Specifically, we provide an overview of the major tasks in small molecule drug discovery and their interrelationships. Next, we analyze the six core tasks, summarizing the related methods, commonly used datasets, and technological development trends. Finally, we discuss key challenges, such as interpretability and out-of-distribution generalization, and offer our insights into future research directions for DL-assisted small molecule drug discovery.

[LG-51] Modeling Time-evolving Causality over Data Streams KDD’25

链接: https://arxiv.org/abs/2502.08963
作者: Naoki Chihara,Yasuko Matsubara,Ren Fujiwara,Yasushi Sakurai
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD’25

点击查看摘要

Abstract:Given an extensive, semi-infinite collection of multivariate coevolving data sequences (e.g., sensor/web activity streams) whose observations influence each other, how can we discover the time-changing cause-and-effect relationships in co-evolving data streams? How efficiently can we reveal dynamical patterns that allow us to forecast future values? In this paper, we present a novel streaming method, ModePlait, which is designed for modeling such causal relationships (i.e., time-evolving causality) in multivariate co-evolving data streams and forecasting their future values. The solution relies on characteristics of the causal relationships that evolve over time in accordance with the dynamic changes of exogenous variables. ModePlait has the following properties: (a) Effective: it discovers the time-evolving causality in multivariate co-evolving data streams by detecting the transitions of distinct dynamical patterns adaptively. (b) Accurate: it enables both the discovery of time-evolving causality and the forecasting of future values in a streaming fashion. © Scalable: our algorithm does not depend on data stream length and thus is applicable to very large sequences. Extensive experiments on both synthetic and real-world datasets demonstrate that our proposed model outperforms state-of-the-art methods in terms of discovering the time-evolving causality as well as forecasting.

[LG-52] A Comprehensive Survey on Imbalanced Data Learning

链接: https://arxiv.org/abs/2502.08960
作者: Xinyi Gao,Dongting Xie,Yihang Zhang,Zhengren Wang,Conghui He,Hongzhi Yin,Wentao Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the expansion of data availability, machine learning (ML) has achieved remarkable breakthroughs in both academia and industry. However, imbalanced data distributions are prevalent in various types of raw data and severely hinder the performance of ML by biasing the decision-making processes. To deepen the understanding of imbalanced data and facilitate the related research and applications, this survey systematically analyzing various real-world data formats and concludes existing researches for different data formats into four distinct categories: data re-balancing, feature representation, training strategy, and ensemble learning. This structured analysis help researchers comprehensively understand the pervasive nature of imbalance across diverse data format, thereby paving a clearer path toward achieving specific research goals. we provide an overview of relevant open-source libraries, spotlight current challenges, and offer novel insights aimed at fostering future advancements in this critical area of study.

[LG-53] Integrated Optimization and Game Theory Framework for Fair Cost Allocation in Community Microgrids

链接: https://arxiv.org/abs/2502.08953
作者: K. Victor Sam Moses Babu,Pratyush Chakraborty,Mayukha Pal
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fair cost allocation in community microgrids remains a significant challenge due to the complex interactions between multiple participants with varying load profiles, distributed energy resources, and storage systems. Traditional cost allocation methods often fail to adequately address the dynamic nature of participant contributions and benefits, leading to inequitable distribution of costs and reduced participant satisfaction. This paper presents a novel framework integrating multi-objective optimization with cooperative game theory for fair and efficient microgrid operation and cost allocation. The proposed approach combines mixed-integer linear programming for optimal resource dispatch with Shapley value analysis for equitable benefit distribution, ensuring both system efficiency and participant satisfaction. The framework was validated using real-world data across six distinct operational scenarios, demonstrating significant improvements in both technical and economic performance. Results show peak demand reductions ranging from 7.8% to 62.6%, solar utilization rates reaching 114.8% through effective storage integration, and cooperative gains of up to 1,801.01 per day. The Shapley value-based allocation achieved balanced benefit-cost distributions, with net positions ranging from -16.0% to +14.2% across different load categories, ensuring sustainable participant cooperation.

[LG-54] Self-Supervised Graph Contrastive Pretraining for Device-level Integrated Circuits

链接: https://arxiv.org/abs/2502.08949
作者: Sungyoung Lee,Ziyi Wang,Seunggeun Kim,Taekyun Lee,David Z. Pan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised graph representation learning has driven significant advancements in domains such as social network analysis, molecular design, and electronics design automation (EDA). However, prior works in EDA have mainly focused on the representation of gate-level digital circuits, failing to capture analog and mixed-signal circuits. To address this gap, we introduce DICE: Device-level Integrated Circuits Encoder, the first self-supervised pretrained graph neural network (GNN) model for any circuit expressed at the device level. DICE is a message-passing neural network (MPNN) trained through graph contrastive learning, and its pretraining process is simulation-free, incorporating two novel data augmentation techniques. Experimental results demonstrate that DICE achieves substantial performance gains across three downstream tasks, underscoring its effectiveness for both analog and digital circuits.

[LG-55] Reevaluating Policy Gradient Methods for Imperfect-Information Games

链接: https://arxiv.org/abs/2502.08938
作者: Max Rudolph,Nathan Lichtle,Sobhan Mohammadpour,Alexandre Bayen,J. Zico Kolter,Amy Zhang,Gabriele Farina,Eugene Vinitsky,Samuel Sokota
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP, DO, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for four large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 5600 training runs, FP, DO, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at this https URL and this https URL .

[LG-56] AutoLike: Auditing Social Media Recommendations through User Interactions

链接: https://arxiv.org/abs/2502.08933
作者: Hieu Le,Salma Elmalaki,Zubair Shafiq,Athina Markopoulou
类目: Machine Learning (cs.LG)
*备注: 17 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Modern social media platforms, such as TikTok, Facebook, and YouTube, rely on recommendation systems to personalize content for users based on user interactions with endless streams of content, such as “For You” pages. However, these complex algorithms can inadvertently deliver problematic content related to self-harm, mental health, and eating disorders. We introduce AutoLike, a framework to audit recommendation systems in social media platforms for topics of interest and their sentiments. To automate the process, we formulate the problem as a reinforcement learning problem. AutoLike drives the recommendation system to serve a particular type of content through interactions (e.g., liking). We apply the AutoLike framework to the TikTok platform as a case study. We evaluate how well AutoLike identifies TikTok content automatically across nine topics of interest; and conduct eight experiments to demonstrate how well it drives TikTok’s recommendation system towards particular topics and sentiments. AutoLike has the potential to assist regulators in auditing recommendation systems for problematic content. (Warning: This paper contains qualitative examples that may be viewed as offensive or harmful.)

[LG-57] CLEAR: Cluster-based Prompt Learning on Heterogeneous Graphs PAKDD2025

链接: https://arxiv.org/abs/2502.08918
作者: Feiyang Wang,Zhongbao Zhang,Junda Ye,Li Sun,Jianzhong Qi
类目: Machine Learning (cs.LG)
*备注: accepted by PAKDD 2025

点击查看摘要

Abstract:Prompt learning has attracted increasing attention in the graph domain as a means to bridge the gap between pretext and downstream tasks. Existing studies on heterogeneous graph prompting typically use feature prompts to modify node features for specific downstream tasks, which do not concern the structure of heterogeneous graphs. Such a design also overlooks information from the meta-paths, which are core to learning the high-order semantics of the heterogeneous graphs. To address these issues, we propose CLEAR, a Cluster-based prompt LEARNING model on heterogeneous graphs. We present cluster prompts that reformulate downstream tasks as heterogeneous graph reconstruction. In this way, we align the pretext and downstream tasks to share the same training objective. Additionally, our cluster prompts are also injected into the meta-paths such that the prompt learning process incorporates high-order semantic information entailed by the meta-paths. Extensive experiments on downstream tasks confirm the superiority of CLEAR. It consistently outperforms state-of-the-art models, achieving up to 5% improvement on the F1 metric for node classification.

[LG-58] Linear-Time User-Level DP-SCO via Robust Statistics

链接: https://arxiv.org/abs/2502.08889
作者: Badih Ghazi,Ravi Kumar,Daogao Liu,Pasin Manurangsi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:User-level differentially private stochastic convex optimization (DP-SCO) has garnered significant attention due to the paramount importance of safeguarding user privacy in modern large-scale machine learning applications. Current methods, such as those based on differentially private stochastic gradient descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility due to the need to privatize every intermediate iterate. In this work, we introduce a novel linear-time algorithm that leverages robust statistics, specifically the median and trimmed mean, to overcome these challenges. Our approach uniquely bounds the sensitivity of all intermediate iterates of SGD with gradient estimation based on robust statistics, thereby significantly reducing the gradient estimation noise for privacy purposes and enhancing the privacy-utility trade-off. By sidestepping the repeated privatization required by previous methods, our algorithm not only achieves an improved theoretical privacy-utility trade-off but also maintains computational efficiency. We complement our algorithm with an information-theoretic lower bound, showing that our upper bound is optimal up to logarithmic factors and the dependence on \epsilon . This work sets the stage for more robust and efficient privacy-preserving techniques in machine learning, with implications for future research and application in the field.

[LG-59] 2D Integrated Bayesian Tomography of Plasma Electron Density Profile for HL-3 Based on Gaussian Process

链接: https://arxiv.org/abs/2502.08882
作者: Cong Wang,Renjie Yang,Dong Li,Zongyu Yang,Zhijun Wang,Yixiong Wei,Jing Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces an integrated Bayesian model that combines line integral measurements and point values using Gaussian Process (GP). The proposed method leverages Gaussian Process Regression (GPR) to incorporate point values into 2D profiles and employs coordinate mapping to integrate magnetic flux information for 2D inversion. The average relative error of the reconstructed profile, using the integrated Bayesian tomography model with normalized magnetic flux, is as low as 3.60*10^(-4). Additionally, sensitivity tests were conducted on the number of grids, the standard deviation of synthetic diagnostic data, and noise levels, laying a solid foundation for the application of the model to experimental data. This work not only achieves accurate 2D inversion using the integrated Bayesian model but also provides a robust framework for decoupling pressure information from equilibrium reconstruction, thus making it possible to optimize equilibrium reconstruction using inversion results.

[LG-60] WENDy for Nonlinear-in-Parameter ODEs

链接: https://arxiv.org/abs/2502.08881
作者: Nic Rummel,Daniel A. Messenger,Stephen Becker,Vanja Dukic,David M. Bortz
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Weak-form Estimation of Non-linear Dynamics (WENDy) algorithm is extended to accommodate systems of ordinary differential equations that are nonlinear-in-parameters (NiP). The extension rests on derived analytic expressions for a likelihood function, its gradient and its Hessian matrix. WENDy makes use of these to approximate a maximum likelihood estimator based on optimization routines suited for non-convex optimization problems. The resulting parameter estimation algorithm has better accuracy, a substantially larger domain of convergence, and is often orders of magnitude faster than the conventional output error least squares method (based on forward solvers). The this http URL algorithm is efficiently implemented in Julia. We demonstrate the algorithm’s ability to accommodate the weak form optimization for both additive normal and multiplicative log-normal noise, and present results on a suite of benchmark systems of ordinary differential equations. In order to demonstrate the practical benefits of our approach, we present extensive comparisons between our method and output error methods in terms of accuracy, precision, bias, and coverage. Subjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2502.08881 [cs.LG] (or arXiv:2502.08881v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.08881 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Robust Graph-Based Semi-Supervised Learning via p-Conductances

链接: https://arxiv.org/abs/2502.08873
作者: Sawyer Jack Robertson,Chester Holtz,Zhengchao Wan,Gal Mishne,Alexander Cloninger
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Optimization and Control (math.OC)
*备注: 29 pages, 7 figures

点击查看摘要

Abstract:We study the problem of semi-supervised learning on graphs in the regime where data labels are scarce or possibly corrupted. We propose an approach called p -conductance learning that generalizes the p -Laplace and Poisson learning methods by introducing an objective reminiscent of p -Laplacian regularization and an affine relaxation of the label constraints. This leads to a family of probability measure mincut programs that balance sparse edge removal with accurate distribution separation. Our theoretical analysis connects these programs to well-known variational and probabilistic problems on graphs (including randomized cuts, effective resistance, and Wasserstein distance) and provides motivation for robustness when labels are diffused via the heat kernel. Computationally, we develop a semismooth Newton-conjugate gradient algorithm and extend it to incorporate class-size estimates when converting the continuous solutions into label assignments. Empirical results on computer vision and citation datasets demonstrate that our approach achieves state-of-the-art accuracy in low label-rate, corrupted-label, and partial-label regimes.

[LG-62] When and why randomised exploration works (in linear bandits)

链接: https://arxiv.org/abs/2502.08870
作者: Marc Abeille,David Janz,Ciara Pike-Burke
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the d -dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an n -step regret bound of the order O(d\sqrtn \log(n)) . Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.

[LG-63] A Systematic Evaluation of Generative Models on Tabular Transportation Data

链接: https://arxiv.org/abs/2502.08856
作者: Chengen Wang,Alvaro Cardenas,Gurcan Comert,Murat Kantarcioglu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The sharing of large-scale transportation data is beneficial for transportation planning and policymaking. However, it also raises significant security and privacy concerns, as the data may include identifiable personal information, such as individuals’ home locations. To address these concerns, synthetic data generation based on real transportation data offers a promising solution that allows privacy protection while potentially preserving data utility. Although there are various synthetic data generation techniques, they are often not tailored to the unique characteristics of transportation data, such as the inherent structure of transportation networks formed by all trips in the datasets. In this paper, we use New York City taxi data as a case study to conduct a systematic evaluation of the performance of widely used tabular data generative models. In addition to traditional metrics such as distribution similarity, coverage, and privacy preservation, we propose a novel graph-based metric tailored specifically for transportation data. This metric evaluates the similarity between real and synthetic transportation networks, providing potentially deeper insights into their structural and functional alignment. We also introduced an improved privacy metric to address the limitations of the commonly-used one. Our experimental results reveal that existing tabular data generative models often fail to perform as consistently as claimed in the literature, particularly when applied to transportation data use cases. Furthermore, our novel graph metric reveals a significant gap between synthetic and real data. This work underscores the potential need to develop generative models specifically tailored to take advantage of the unique characteristics of emerging domains, such as transportation.

[LG-64] PLayer-FL: A Principled Approach to Personalized Layer-wise Cross-Silo Federated Learning

链接: https://arxiv.org/abs/2502.08829
作者: Ahmed Elhussein,Gamze Gürsoy
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Non-identically distributed data is a major challenge in Federated Learning (FL). Personalized FL tackles this by balancing local model adaptation with global model consistency. One variant, partial FL, leverages the observation that early layers learn more transferable features by federating only early layers. However, current partial FL approaches use predetermined, architecture-specific rules to select layers, limiting their applicability. We introduce Principled Layer-wise-FL (PLayer-FL), which uses a novel federation sensitivity metric to identify layers that benefit from federation. This metric, inspired by model pruning, quantifies each layer’s contribution to cross-client generalization after the first training epoch, identifying a transition point in the network where the benefits of federation diminish. We first demonstrate that our federation sensitivity metric shows strong correlation with established generalization measures across diverse architectures. Next, we show that PLayer-FL outperforms existing FL algorithms on a range of tasks, also achieving more uniform performance improvements across clients.

[LG-65] A First-order Generative Bilevel Optimization Framework for Diffusion Models

链接: https://arxiv.org/abs/2502.08808
作者: Quan Xiao,Hui Yuan,A F M Saif,Gaowen Liu,Ramana Kompella,Mengdi Wang,Tianyi Chen
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models, which iteratively denoise data samples to synthesize high-quality outputs, have achieved empirical success across domains. However, optimizing these models for downstream tasks often involves nested bilevel structures, such as tuning hyperparameters for fine-tuning tasks or noise schedules in training dynamics, where traditional bilevel methods fail due to the infinite-dimensional probability space and prohibitive sampling costs. We formalize this challenge as a generative bilevel optimization problem and address two key scenarios: (1) fine-tuning pre-trained models via an inference-only lower-level solver paired with a sample-efficient gradient estimator for the upper level, and (2) training diffusion models from scratch with noise schedule optimization by reparameterizing the lower-level problem and designing a computationally tractable gradient estimator. Our first-order bilevel framework overcomes the incompatibility of conventional bilevel methods with diffusion processes, offering theoretical grounding and computational practicality. Experiments demonstrate that our method outperforms existing fine-tuning and hyperparameter search baselines.

[LG-66] InTAR: Inter-Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs

链接: https://arxiv.org/abs/2502.08807
作者: Zifan He,Anderson Truong,Yingqi Cao,Jason Cong
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of deep neural networks (DNNs) has driven a boom in AI services, which results in an increased demand for computing power and memory. In modern DNNs, the data sizes produced and consumed are highly varied across operations (high data volume variation, HDV). Because existing design paradigms use fixed execution patterns that lead to either low computational efficiency due to pipeline stalls or frequent off-chip memory accesses to manage large intermediate data, HDV applications are challenging to accelerate on FPGAs. To address these challenges, we introduce the Inter-Task Auto-Reconfigurable Accelerator (InTAR), a novel accelerator design for HDV applications on FPGAs. InTAR combines the high computational efficiency of sequential execution with the reduced off-chip memory overhead of dataflow execution. It switches execution patterns automatically with a static schedule determined before circuit design based on resource constraints and model parameters. Unlike previous reconfigurable accelerators, InTAR encodes reconfiguration schedules during circuit design, allowing model-specific optimizations that allocate only the necessary logic and interconnects. Thus, InTAR achieves a high clock frequency with fewer resources and low reconfiguration time. Furthermore, InTAR supports high-level tools such as HLS for fast design generation. We implement a set of multi-task kernels in various HDV DNNs using InTAR. Compared with dataflow and sequential accelerators, InTAR exhibits 1.8\times and 7.1 \times speedups correspondingly. We also implement InTAR for GPT-2 medium as a more complex example, which achieves a speedup of \mathbf3.65 \sim 39.14\times and a \mathbf1.72 \sim 10.44\times boost in DSP efficiency compared to the corresponding SoTA accelerators on FPGAs.

[LG-67] Deep EEG Super-Resolution: Upsampling EEG Spatial Resolution with Generative Adversarial Networks

链接: https://arxiv.org/abs/2502.08803
作者: Isaac Corley,Yufei Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) activity contains a wealth of information about what is happening within the human brain. Recording more of this data has the potential to unlock endless future applications. However, the cost of EEG hardware is increasingly expensive based upon the number of EEG channels being recorded simultaneously. We combat this problem in this paper by proposing a novel deep EEG super-resolution (SR) approach based on Generative Adversarial Networks (GANs). This approach can produce high spatial resolution EEG data from low resolution samples, by generating channel-wise upsampled data to effectively interpolate numerous missing channels, thus reducing the need for expensive EEG equipment. We tested the performance using an EEG dataset from a mental imagery task. Our proposed GAN model provided 10^4 fold and 10^2 fold reduction in mean-squared error (MSE) and mean-absolute error (MAE), respectively, over the baseline bicubic interpolation method. We further validate our method by training a classifier on the original classification task, which displayed minimal loss in accuracy while using the super-resolved data. The proposed SR EEG by GAN is a promising approach to improve the spatial resolution of low density EEG headsets.

[LG-68] Low-Resolution Neural Networks

链接: https://arxiv.org/abs/2502.08795
作者: Eduardo Lobo Lustosa Cabral,Larissa Driemeier
类目: Machine Learning (cs.LG)
*备注: 22 pages, 13 figures

点击查看摘要

Abstract:The expanding scale of large neural network models introduces significant challenges, driving efforts to reduce memory usage and enhance computational efficiency. Such measures are crucial to ensure the practical implementation and effective application of these sophisticated models across a wide array of use cases. This study examines the impact of parameter bit precision on model performance compared to standard 32-bit models, with a focus on multiclass object classification in images. The models analyzed include those with fully connected layers, convolutional layers, and transformer blocks, with model weight resolution ranging from 1 bit to 4.08 bits. The findings indicate that models with lower parameter bit precision achieve results comparable to 32-bit models, showing promise for use in memory-constrained devices. While low-resolution models with a small number of parameters require more training epochs to achieve accuracy comparable to 32-bit models, those with a large number of parameters achieve similar performance within the same number of epochs. Additionally, data augmentation can destabilize training in low-resolution models, but including zero as a potential value in the weight parameters helps maintain stability and prevents performance degradation. Overall, 2.32-bit weights offer the optimal balance of memory reduction, performance, and efficiency. However, further research should explore other dataset types and more complex and larger models. These findings suggest a potential new era for optimized neural network models with reduced memory requirements and improved computational efficiency, though advancements in dedicated hardware are necessary to fully realize this potential.

[LG-69] Spectral Journey: How Transformers Predict the Shortest Path

链接: https://arxiv.org/abs/2502.08794
作者: Andrew Cohen,Andrey Gromov,Kaiyu Yang,Yuandong Tian
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Decoder-only transformers lead to a step-change in capability of large language models. However, opinions are mixed as to whether they are really planning or reasoning. A path to making progress in this direction is to study the model’s behavior in a setting with carefully controlled data. Then interpret the learned representations and reverse-engineer the computation performed internally. We study decoder-only transformer language models trained from scratch to predict shortest paths on simple, connected and undirected graphs. In this setting, the representations and the dynamics learned by the model are interpretable. We present three major results: (1) Two-layer decoder-only language models can learn to predict shortest paths on simple, connected graphs containing up to 10 nodes. (2) Models learn a graph embedding that is correlated with the spectral decomposition of the line graph. (3) Following the insights, we discover a novel approximate path-finding algorithm Spectral Line Navigator (SLN) that finds shortest path by greedily selecting nodes in the space of spectral embedding of the line graph.

[LG-70] Decision Tree Based Wrappers for Hearing Loss

链接: https://arxiv.org/abs/2502.08785
作者: Miguel Rabuge,Nuno Lourenço
类目: Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:Audiology entities are using Machine Learning (ML) models to guide their screening towards people at risk. Feature Engineering (FE) focuses on optimizing data for ML models, with evolutionary methods being effective in feature selection and construction tasks. This work aims to benchmark an evolutionary FE wrapper, using models based on decision trees as proxies. The FEDORA framework is applied to a Hearing Loss (HL) dataset, being able to reduce data dimensionality and statistically maintain baseline performance. Compared to traditional methods, FEDORA demonstrates superior performance, with a maximum balanced accuracy of 76.2%, using 57 features. The framework also generated an individual that achieved 72.8% balanced accuracy using a single feature.

[LG-71] Learning Discontinuous Galerkin Solutions to Elliptic Problems via Small Linear Convolutional Neural Networks

链接: https://arxiv.org/abs/2502.08783
作者: Adrian Celaya,Yimo Wang,David Fuentes,Beatrice Riviere
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In recent years, there has been an increasing interest in using deep learning and neural networks to tackle scientific problems, particularly in solving partial differential equations (PDEs). However, many neural network-based methods, such as physics-informed neural networks, depend on automatic differentiation and the sampling of collocation points, which can result in a lack of interpretability and lower accuracy compared to traditional numerical methods. To address this issue, we propose two approaches for learning discontinuous Galerkin solutions to PDEs using small linear convolutional neural networks. Our first approach is supervised and depends on labeled data, while our second approach is unsupervised and does not rely on any training data. In both cases, our methods use substantially fewer parameters than similar numerics-based neural networks while also demonstrating comparable accuracy to the true and DG solutions for elliptic problems.

[LG-72] Unlocking Mental Health: Exploring College Students Well-being through Smartphone Behaviors

链接: https://arxiv.org/abs/2502.08766
作者: Wei Xuan,Meghna Roy Chowdhury,Yi Ding,Yixue Zhao
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Published at International Conference on Mobile Software Engineering and Systems (MOBILESoft 2025)

点击查看摘要

Abstract:The global mental health crisis is a pressing concern, with college students particularly vulnerable to rising mental health disorders. The widespread use of smartphones among young adults, while offering numerous benefits, has also been linked to negative outcomes such as addiction and regret, significantly impacting well-being. Leveraging the longest longitudinal dataset collected over four college years through passive mobile sensing, this study is the first to examine the relationship between students’ smartphone unlocking behaviors and their mental health at scale in real-world settings. We provide the first evidence demonstrating the predictability of phone unlocking behaviors for mental health outcomes based on a large dataset, highlighting the potential of these novel features for future predictive models. Our findings reveal important variations in smartphone usage across genders and locations, offering a deeper understanding of the interplay between digital behaviors and mental health. We highlight future research directions aimed at mitigating adverse effects and promoting digital well-being in this population.

[LG-73] Demand Response Optimization MILP Framework for Microgrids with DERs

链接: https://arxiv.org/abs/2502.08764
作者: K. Victor Sam Moses Babu,Pratyush Chakraborty,Mayukha Pal
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of renewable energy sources in microgrids introduces significant operational challenges due to their intermittent nature and the mismatch between generation and demand patterns. Effective demand response (DR) strategies are crucial for maintaining system stability and economic efficiency, particularly in microgrids with high renewable penetration. This paper presents a comprehensive mixed-integer linear programming (MILP) framework for optimizing DR operations in a microgrid with solar generation and battery storage systems. The framework incorporates load classification, dynamic price thresholding, and multi-period coordination for optimal DR event scheduling. Analysis across seven distinct operational scenarios demonstrates consistent peak load reduction of 10% while achieving energy cost savings ranging from 13.1% to 38.0%. The highest performance was observed in scenarios with high solar generation, where the framework achieved 38.0% energy cost reduction through optimal coordination of renewable resources and DR actions. The results validate the framework’s effectiveness in managing diverse operational challenges while maintaining system stability and economic efficiency.

[LG-74] Recurrent Memory for Online Interdomain Gaussian Processes

链接: https://arxiv.org/abs/2502.08736
作者: Wenlong Chen,Naoki Kiyohara,Harrison Bo Hua Zhu,Yingzhen Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:We propose a novel online Gaussian process (GP) model that is capable of capturing long-term memory in sequential data in an online regression setting. Our model, Online HiPPO Sparse Variational Gaussian Process Regression (OHSGPR), leverages the HiPPO (High-order Polynomial Projection Operators) framework, which is popularized in the RNN domain due to its long-range memory modeling capabilities. We interpret the HiPPO time-varying orthogonal projections as inducing variables with time-dependent orthogonal polynomial basis functions, which allows the SGPR inducing points to memorize the process history. We show that the HiPPO framework fits naturally into the interdomain GP framework and demonstrate that the kernel matrices can also be updated online in a recurrence form based on the ODE evolution of HiPPO. We evaluate our method on time series regression tasks, showing that it outperforms the existing online GP method in terms of predictive performance and computational efficiency

[LG-75] New Bounds for Sparse Variational Gaussian Processes

链接: https://arxiv.org/abs/2502.08730
作者: Michalis K. Titsias
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 17 pages, 5 figures

点击查看摘要

Abstract:Sparse variational Gaussian processes (GPs) construct tractable posterior approximations to GP models. At the core of these methods is the assumption that the true posterior distribution over training function values \bf f and inducing variables \bf u is approximated by a variational distribution that incorporates the conditional GP prior p(\bf f | \bf u) in its factorization. While this assumption is considered as fundamental, we show that for model training we can relax it through the use of a more general variational distribution q(\bf f | \bf u) that depends on N extra parameters, where N is the number of training examples. In GP regression, we can analytically optimize the evidence lower bound over the extra parameters and express a tractable collapsed bound that is tighter than the previous bound. The new bound is also amenable to stochastic optimization and its implementation requires minor modifications to existing sparse GP code. Further, we also describe extensions to non-Gaussian likelihoods. On several datasets we demonstrate that our method can reduce bias when learning the hyperpaparameters and can lead to better predictive performance.

[LG-76] A Comparative Study of Machine Learning Algorithms for Stock Price Prediction Using Insider Trading Data

链接: https://arxiv.org/abs/2502.08728
作者: Amitabh Chakravorty,Nelly Elsayed
类目: Machine Learning (cs.LG)
*备注: 5 pages, accepted to publish

点击查看摘要

Abstract:The research paper empirically investigates several machine learning algorithms to forecast stock prices depending on insider trading information. Insider trading offers special insights into market sentiment, pointing to upcoming changes in stock prices. This study examines the effectiveness of algorithms like decision trees, random forests, support vector machines (SVM) with different kernels, and K-Means Clustering using a dataset of Tesla stock transactions. Examining past data from April 2020 to March 2023, this study focuses on how well these algorithms identify trends and forecast stock price fluctuations. The paper uses Recursive Feature Elimination (RFE) and feature importance analysis to optimize the feature set and, hence, increase prediction accuracy. While it requires substantially greater processing time than other models, SVM with the Radial Basis Function (RBF) kernel displays the best accuracy. This paper highlights the trade-offs between accuracy and efficiency in machine learning models and proposes the possibility of pooling multiple data sources to raise prediction performance. The results of this paper aim to help financial analysts and investors in choosing strong algorithms to optimize investment strategies.

[LG-77] Efficient Split Learning LSTM Models for FPGA-based Edge IoT Devices ICML

链接: https://arxiv.org/abs/2502.08692
作者: Romina Soledad Molina,Vukan Ninkovic,Dejan Vukobratovic,Maria Liz Crespo,Marco Zennaro
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted for publication at IEEE ICMLCN 2025

点击查看摘要

Abstract:Split Learning (SL) recently emerged as an efficient paradigm for distributed Machine Learning (ML) suitable for the Internet Of Things (IoT)-Cloud systems. However, deploying SL on resource-constrained edge IoT platforms poses a significant challenge in terms of balancing the model performance against the processing, memory, and energy resources. In this work, we present a practical study of deploying SL framework on a real-world Field-Programmable Gate Array (FPGA)-based edge IoT platform. We address the SL framework applied to a time-series processing model based on Recurrent Neural Networks (RNNs). Set in the context of river water quality monitoring and using real-world data, we train, optimize, and deploy a Long Short-Term Memory (LSTM) model on a given edge IoT FPGA platform in different SL configurations. Our results demonstrate the importance of aligning design choices with specific application requirements, whether it is maximizing speed, minimizing power, or optimizing for resource constraints.

[LG-78] A Deep Learning approach for parametrized and time dependent Partial Differential Equations using Dimensionality Reduction and Neural ODEs

链接: https://arxiv.org/abs/2502.08683
作者: Alessandro Longhi,Danny Lathouwers,Zoltán Perkó
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Partial Differential Equations (PDEs) are central to science and engineering. Since solving them is computationally expensive, a lot of effort has been put into approximating their solution operator via both traditional and recently increasingly Deep Learning (DL) techniques. A conclusive methodology capable of accounting both for (continuous) time and parameter dependency in such DL models however is still lacking. In this paper, we propose an autoregressive and data-driven method using the analogy with classical numerical solvers for time-dependent, parametric and (typically) nonlinear PDEs. We present how Dimensionality Reduction (DR) can be coupled with Neural Ordinary Differential Equations (NODEs) in order to learn the solution operator of arbitrary PDEs. The idea of our work is that it is possible to map the high-fidelity (i.e., high-dimensional) PDE solution space into a reduced (low-dimensional) space, which subsequently exhibits dynamics governed by a (latent) Ordinary Differential Equation (ODE). Solving this (easier) ODE in the reduced space allows avoiding solving the PDE in the high-dimensional solution space, thus decreasing the computational burden for repeated calculations for e.g., uncertainty quantification or design optimization purposes. The main outcome of this work is the importance of exploiting DR as opposed to the recent trend of building large and complex architectures: we show that by leveraging DR we can deliver not only more accurate predictions, but also a considerably lighter and faster DL model compared to existing methodologies.

[LG-79] Democratizing AI Governance: Balancing Expertise and Public Participation

链接: https://arxiv.org/abs/2502.08651
作者: Lucile Ter-Minassian
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development and deployment of artificial intelligence (AI) systems, with their profound societal impacts, raise critical challenges for governance. Historically, technological innovations have been governed by concentrated expertise with limited public input. However, AI’s pervasive influence across domains such as healthcare, employment, and justice necessitates inclusive governance approaches. This article explores the tension between expert-led oversight and democratic participation, analyzing models of participatory and deliberative democracy. Using case studies from France and Brazil, we highlight how inclusive frameworks can bridge the gap between technical complexity and public accountability. Recommendations are provided for integrating these approaches into a balanced governance model tailored to the European Union, emphasizing transparency, diversity, and adaptive regulation to ensure that AI governance reflects societal values while maintaining technical rigor. This analysis underscores the importance of hybrid frameworks that unite expertise and public voice in shaping the future of AI policy.

[LG-80] Communicating Likelihoods with Normalising Flows

链接: https://arxiv.org/abs/2502.09494
作者: Jack Y. Araz,Anja Beck,Méril Reboud,Michael Spannowsky,Danny van Dyk
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 4 pages + references, 1 figure

点击查看摘要

Abstract:We present a machine-learning-based workflow to model an unbinned likelihood from its samples. A key advancement over existing approaches is the validation of the learned likelihood using rigorous statistical tests of the joint distribution, such as the Kolmogorov-Smirnov test of the joint distribution. Our method enables the reliable communication of experimental and phenomenological likelihoods for subsequent analyses. We demonstrate its effectiveness through three case studies in high-energy physics. To support broader adoption, we provide an open-source reference implementation, nabu.

[LG-81] Assessing Generative AI value in a public sector context: evidence from a field experiment

链接: https://arxiv.org/abs/2502.09479
作者: Trevor Fitzpatrick,Seamus Kelly,Patrick Carey,David Walsh,Ruairi Nugent
类目: General Finance (q-fin.GN); Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:The emergence of Generative AI (Gen AI) has motivated an interest in understanding how it could be used to enhance productivity across various tasks. We add to research results for the performance impact of Gen AI on complex knowledge-based tasks in a public sector setting. In a pre-registered experiment, after establishing a baseline level of performance, we find mixed evidence for two types of composite tasks related to document understanding and data analysis. For the Documents task, the treatment group using Gen AI had a 17% improvement in answer quality scores (as judged by human evaluators) and a 34% improvement in task completion time compared to a control group. For the Data task, we find the Gen AI treatment group experienced a 12% reduction in quality scores and no significant difference in mean completion time compared to the control group. These results suggest that the benefits of Gen AI may be task and potentially respondent dependent. We also discuss field notes and lessons learned, as well as supplementary insights from a post-trial survey and feedback workshop with participants.

[LG-82] A Differentiable Rank-Based Objective For Better Feature Learning

链接: https://arxiv.org/abs/2502.09445
作者: Krunoslav Lehman Pavasovic,David Lopez-Paz,Giulio Biroli,Levent Sagun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we leverage existing statistical methods to better understand feature learning from data. We tackle this by modifying the model-free variable selection method, Feature Ordering by Conditional Independence (FOCI), which is introduced in \citeazadkia2021simple. While FOCI is based on a non-parametric coefficient of conditional dependence, we introduce its parametric, differentiable approximation. With this approximate coefficient of correlation, we present a new algorithm called difFOCI, which is applicable to a wider range of machine learning problems thanks to its differentiable nature and learnable parameters. We present difFOCI in three contexts: (1) as a variable selection method with baseline comparisons to FOCI, (2) as a trainable model parametrized with a neural network, and (3) as a generic, widely applicable neural network regularizer, one that improves feature learning with better management of spurious correlations. We evaluate difFOCI on increasingly complex problems ranging from basic variable selection in toy examples to saliency map comparisons in convolutional networks. We then show how difFOCI can be incorporated in the context of fairness to facilitate classifications without relying on sensitive data.

[LG-83] Non-asymptotic Analysis of Diffusion Annealed Langevin Monte Carlo for Generative Modelling

链接: https://arxiv.org/abs/2502.09306
作者: Paula Cordero-Encinar,O.Deniz Akyildiz,Andrew B. Duncan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:We investigate the theoretical properties of general diffusion (interpolation) paths and their Langevin Monte Carlo implementation, referred to as diffusion annealed Langevin Monte Carlo (DALMC), under weak conditions on the data distribution. Specifically, we analyse and provide non-asymptotic error bounds for the annealed Langevin dynamics where the path of distributions is defined as Gaussian convolutions of the data distribution as in diffusion models. We then extend our results to recently proposed heavy-tailed (Student’s t) diffusion paths, demonstrating their theoretical properties for heavy-tailed data distributions for the first time. Our analysis provides theoretical guarantees for a class of score-based generative models that interpolate between a simple distribution (Gaussian or Student’s t) and the data distribution in finite time. This approach offers a broader perspective compared to standard score-based diffusion approaches, which are typically based on a forward Ornstein-Uhlenbeck (OU) noising process.

[LG-84] Joint Attention Mechanism Learning to Facilitate Opto-physiological Monitoring during Physical Activity

链接: https://arxiv.org/abs/2502.09291
作者: Xiaoyu Zheng,Sijung Hu,Vincent Dwyer,Mahsa Derakhshani,Laura Barrett
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Opto-physiological monitoring is a non-contact technique for measuring cardiac signals, i.e., photoplethysmography (PPG). Quality PPG signals directly lead to reliable physiological readings. However, PPG signal acquisition procedures are often accompanied by spurious motion artefacts (MAs), especially during low-to-high-intensity physical activity. This study proposes a practical adversarial learning approach for opto-physiological monitoring by using a generative adversarial network with an attention mechanism (AM-GAN) to model motion noise and to allow MA removal. The AM-GAN learns an MA-resistant mapping from raw and noisy signals to clear PPG signals in an adversarial manner, guided by an attention mechanism to directly translate the motion reference of triaxial acceleration to the MAs appearing in the raw signal. The AM-GAN was experimented with three various protocols engaged with 39 subjects in various physical activities. The average absolute error for heart rate (HR) derived from the MA-free PPG signal via the AM-GAN, is 1.81 beats/min for the IEEE-SPC dataset and 3.86 beats/min for the PPGDalia dataset. The same procedure applied to an in-house LU dataset resulted in average absolute errors for HR and respiratory rate (RR) of less than 1.37 beats/min and 2.49 breaths/min, respectively. The study demonstrates the robustness and resilience of AM-GAN, particularly during low-to-high-intensity physical activities.

[LG-85] Dynamic Rolling Horizon Optimization for Network-Constrained V2X Value Stacking of Electric Vehicles Under Uncertainties

链接: https://arxiv.org/abs/2502.09290
作者: Canchen Jiang,Ariel Liebman,Bo Jie,Hao Wang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 21 pages, accepted by Renewable Energy

点击查看摘要

Abstract:Electric vehicle (EV) coordination can provide significant benefits through vehicle-to-everything (V2X) by interacting with the grid, buildings, and other EVs. This work aims to develop a V2X value-stacking framework, including vehicle-to-building (V2B), vehicle-to-grid (V2G), and energy trading, to maximize economic benefits for residential communities while maintaining distribution voltage. This work also seeks to quantify the impact of prediction errors related to building load, renewable energy, and EV arrivals. A dynamic rolling-horizon optimization (RHO) method is employed to leverage multiple revenue streams and maximize the potential of EV coordination. To address energy uncertainties, including hourly local building load, local photovoltaic (PV) generation, and EV arrivals, this work develops a Transformer-based forecasting model named Gated Recurrent Units-Encoder-Temporal Fusion Decoder (GRU-EN-TFD). The simulation results, using real data from Australia’s National Electricity Market, and the Independent System Operators in New England and New York in the US, reveal that V2X value stacking can significantly reduce energy costs. The proposed GRU-EN-TFD model outperforms the benchmark forecast model. Uncertainties in EV arrivals have a more substantial impact on value-stacking performance, highlighting the significance of its accurate forecast. This work provides new insights into the dynamic interactions among residential communities, unlocking the full potential of EV batteries.

[LG-86] Quantifying Cryptocurrency Unpredictability: A Comprehensive Study of Complexity and Forecasting

链接: https://arxiv.org/abs/2502.09079
作者: Francesco Puoti,Fabrizio Pittorino,Manuel Roveri
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: This is the author’s accepted manuscript, modified per ACM self-archiving policy. The definitive Version of Record is available at this https URL

点击查看摘要

Abstract:This paper offers a thorough examination of the univariate predictability in cryptocurrency time-series. By exploiting a combination of complexity measure and model predictions we explore the cryptocurrencies time-series forecasting task focusing on the exchange rate in USD of Litecoin, Binance Coin, Bitcoin, Ethereum, and XRP. On one hand, to assess the complexity and the randomness of these time-series, a comparative analysis has been performed using Brownian and colored noises as a benchmark. The results obtained from the Complexity-Entropy causality plane and power density spectrum analysis reveal that cryptocurrency time-series exhibit characteristics closely resembling those of Brownian noise when analyzed in a univariate context. On the other hand, the application of a wide range of statistical, machine and deep learning models for time-series forecasting demonstrates the low predictability of cryptocurrencies. Notably, our analysis reveals that simpler models such as Naive models consistently outperform the more complex machine and deep learning ones in terms of forecasting accuracy across different forecast horizons and time windows. The combined study of complexity and forecasting accuracies highlights the difficulty of predicting the cryptocurrency market. These findings provide valuable insights into the inherent characteristics of the cryptocurrency data and highlight the need to reassess the challenges associated with predicting cryptocurrency’s price movements.

[LG-87] Optimal Algorithms in Linear Regression under Covariate Shift: On the Importance of Precondition

链接: https://arxiv.org/abs/2502.09047
作者: Yuanshi Liu,Haihan Zhang,Qian Chen,Cong Fang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A common pursuit in modern statistical learning is to attain satisfactory generalization out of the source data distribution (OOD). In theory, the challenge remains unsolved even under the canonical setting of covariate shift for the linear model. This paper studies the foundational (high-dimensional) linear regression where the ground truth variables are confined to an ellipse-shape constraint and addresses two fundamental questions in this regime: (i) given the target covariate matrix, what is the min-max \emphoptimal algorithm under covariate shift? (ii) for what kinds of target classes, the commonly-used SGD-type algorithms achieve optimality? Our analysis starts with establishing a tight lower generalization bound via a Bayesian Cramer-Rao inequality. For (i), we prove that the optimal estimator can be simply a certain linear transformation of the best estimator for the source distribution. Given the source and target matrices, we show that the transformation can be efficiently computed via a convex program. The min-max optimal analysis for SGD leverages the idea that we recognize both the accumulated updates of the applied algorithms and the ideal transformation as preconditions on the learning variables. We provide sufficient conditions when SGD with its acceleration variants attain optimality.

[LG-88] Off-Policy Evaluation for Recommendations with Missing-Not-At-Random Rewards

链接: https://arxiv.org/abs/2502.08993
作者: Tatsuki Takahashi,Chihiro Maru,Hiroko Shoji
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 4pages

点击查看摘要

Abstract:Unbiased recommender learning (URL) and off-policy evaluation/learning (OPE/L) techniques are effective in addressing the data bias caused by display position and logging policies, thereby consistently improving the performance of recommendations. However, when both bias exits in the logged data, these estimators may suffer from significant bias. In this study, we first analyze the position bias of the OPE estimator when rewards are missing not at random. To mitigate both biases, we propose a novel estimator that leverages two probabilities of logging policies and reward observations as propensity scores. Our experiments demonstrate that the proposed estimator achieves superior performance compared to other estimators, even as the levels of bias in reward observations increases.

[LG-89] reatment response as a latent variable

链接: https://arxiv.org/abs/2502.08776
作者: Christopher Tosh,Boyuan Zhang,Wesley Tansey
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Scientists often need to analyze the samples in a study that responded to treatment in order to refine their hypotheses and find potential causal drivers of response. Natural variation in outcomes makes teasing apart responders from non-responders a statistical inference problem. To handle latent responses, we introduce the causal two-groups (C2G) model, a causal extension of the classical two-groups model. The C2G model posits that treated samples may or may not experience an effect, according to some prior probability. We propose two empirical Bayes procedures for the causal two-groups model, one under semi-parametric conditions and another under fully nonparametric conditions. The semi-parametric model assumes additive treatment effects and is identifiable from observed data. The nonparametric model is unidentifiable, but we show it can still be used to test for response in each treated sample. We show empirically and theoretically that both methods for selecting responders control the false discovery rate at the target level with near-optimal power. We also propose two novel estimands of interest and provide a strategy for deriving estimand intervals in the unidentifiable nonparametric model. On a cancer immunotherapy dataset, the nonparametric C2G model recovers clinically-validated predictive biomarkers of both positive and negative outcomes. Code is available at this https URL.

[LG-90] Compression of Site-Specific Deep Neural Networks for Massive MIMO Precoding ICML

链接: https://arxiv.org/abs/2502.08758
作者: Ghazal Kasalaee,Ali Hasanzadeh Karkan,Jean-François Frigon,François Leduc-Primeau
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This preprint comprises 6 pages and features 3 figures. It has been accepted to the IEEE International Conference on Machine Learning and Computer Networking (ICMLCN) 2025

点击查看摘要

Abstract:The deployment of deep learning (DL) models for precoding in massive multiple-input multiple-output (mMIMO) systems is often constrained by high computational demands and energy consumption. In this paper, we investigate the compute energy efficiency of mMIMO precoders using DL-based approaches, comparing them to conventional methods such as zero forcing and weighted minimum mean square error (WMMSE). Our energy consumption model accounts for both memory access and calculation energy within DL accelerators. We propose a framework that incorporates mixed-precision quantization-aware training and neural architecture search to reduce energy usage without compromising accuracy. Using a ray-tracing dataset covering various base station sites, we analyze how site-specific conditions affect the energy efficiency of compressed models. Our results show that deep neural network compression generates precoders with up to 35 times higher energy efficiency than WMMSE at equal performance, depending on the scenario and the desired rate. These results establish a foundation and a benchmark for the development of energy-efficient DL-based mMIMO precoders.

[LG-91] A Low-Complexity Plug-and-Play Deep Learning Model for Massive MIMO Precoding Across Sites ICML

链接: https://arxiv.org/abs/2502.08757
作者: Ali Hasanzadeh Karkan,Ahmed Ibrahim,Jean-François Frigon,François Leduc-Primeau
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This preprint comprises 6 pages and features 2 figures. It has been accepted to the IEEE International Conference on Machine Learning and Computer Networking (ICMLCN) 2025

点击查看摘要

Abstract:Massive multiple-input multiple-output (mMIMO) technology has transformed wireless communication by enhancing spectral efficiency and network capacity. This paper proposes a novel deep learning-based mMIMO precoder to tackle the complexity challenges of existing approaches, such as weighted minimum mean square error (WMMSE), while leveraging meta-learning domain generalization and a teacher-student architecture to improve generalization across diverse communication environments. When deployed to a previously unseen site, the proposed model achieves excellent sum-rate performance while maintaining low computational complexity by avoiding matrix inversions and by using a simpler neural network structure. The model is trained and tested on a custom ray-tracing dataset composed of several base station locations. The experimental results indicate that our method effectively balances computational efficiency with high sum-rate performance while showcasing strong generalization performance in unseen environments. Furthermore, with fine-tuning, the proposed model outperforms WMMSE across all tested sites and SNR conditions while reducing complexity by at least 73 \times .

[LG-92] A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection

链接: https://arxiv.org/abs/2502.08695
作者: Randolph W. Linderman(1),Yiran Chen(1),Scott W. Linderman(2) ((1) Electrical and Computer Engineering Department, Duke University, Durham, NC, USA, (2) Statistics Department and The Wu Tsai Neurosciences Institute, Stanford University, Stanford, CA, USA)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages, 5 figures, code is available at this https URL

点击查看摘要

Abstract:Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we show a formal relationship between Bayesian nonparametric models and the relative Mahalanobis distance score (RMDS), a commonly used method for OOD detection. Building on this connection, we propose Bayesian nonparametric mixture models with hierarchical priors that generalize the RMDS. We evaluate these models on the OpenOOD detection benchmark and show that Bayesian nonparametric methods can improve upon existing OOD methods, especially in regimes where training classes differ in their covariance structure and where there are relatively few data points per class.

信息检索

[IR-0] FARM: Frequency-Aware Model for Cross-Domain Live-Streaming Recommendation

链接: https://arxiv.org/abs/2502.09375
作者: Xiaodong Li,Ruochen Yang,Shuang Wen,Shen Wang,Yueyang Liu,Guoquan Wang,Weisong Hu,Qiang Luo,Jiawei Sheng,Tingwen Liu,Jiangxia Cao,Shuang Yang,Zhaojie Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Live-streaming services have attracted widespread popularity due to their real-time interactivity and entertainment value. Users can engage with live-streaming authors by participating in live chats, posting likes, or sending virtual gifts to convey their preferences and support. However, the live-streaming services faces serious data-sparsity problem, which can be attributed to the following two points: (1) User’s valuable behaviors are usually sparse, e.g., like, comment and gift, which are easily overlooked by the model, making it difficult to describe user’s personalized preference. (2) The main exposure content on our platform is short-video, which is 9 times higher than the exposed live-streaming, leading to the inability of live-streaming content to fully model user preference. To this end, we propose a Frequency-Aware Model for Cross-Domain Live-Streaming Recommendation, termed as FARM. Specifically, we first present the intra-domain frequency aware module to enable our model to perceive user’s sparse yet valuable behaviors, i.e., high-frequency information, supported by the Discrete Fourier Transform (DFT). To transfer user preference across the short-video and live-streaming domains, we propose a novel preference align before fuse strategy, which consists of two parts: the cross-domain preference align module to align user preference in both domains with contrastive learning, and the cross-domain preference fuse module to further fuse user preference in both domains using a serious of tailor-designed attention mechanisms. Extensive offline experiments and online A/B testing on Kuaishou live-streaming services demonstrate the effectiveness and superiority of FARM. Our FARM has been deployed in online live-streaming services and currently serves hundreds of millions of users on Kuaishou.

[IR-1] KET-RAG : A Cost-Efficient Multi-Granular Indexing Framework for Graph-RAG

链接: https://arxiv.org/abs/2502.09304
作者: Yiqian Huang,Shiqi Zhang,Xiaokui Xiao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Graph-RAG constructs a knowledge graph from text chunks to improve retrieval in Large Language Model (LLM)-based question answering. It is particularly useful in domains such as biomedicine, law, and political science, where retrieval often requires multi-hop reasoning over proprietary documents. Some existing Graph-RAG systems construct KNN graphs based on text chunk relevance, but this coarse-grained approach fails to capture entity relationships within texts, leading to sub-par retrieval and generation quality. To address this, recent solutions leverage LLMs to extract entities and relationships from text chunks, constructing triplet-based knowledge graphs. However, this approach incurs significant indexing costs, especially for large document collections. To ensure a good result accuracy while reducing the indexing cost, we propose KET-RAG, a multi-granular indexing framework. KET-RAG first identifies a small set of key text chunks and leverages an LLM to construct a knowledge graph skeleton. It then builds a text-keyword bipartite graph from all text chunks, serving as a lightweight alternative to a full knowledge graph. During retrieval, KET-RAG searches both structures: it follows the local search strategy of existing Graph-RAG systems on the skeleton while mimicking this search on the bipartite graph to improve retrieval quality. We evaluate eight solutions on two real-world datasets, demonstrating that KET-RAG outperforms all competitors in indexing cost, retrieval effectiveness, and generation quality. Notably, it achieves comparable or superior retrieval quality to Microsoft’s Graph-RAG while reducing indexing costs by over an order of magnitude. Additionally, it improves the generation quality by up to 32.4% while lowering indexing costs by around 20%. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2502.09304 [cs.IR] (or arXiv:2502.09304v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2502.09304 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] Use of Air Quality Sensor Network Data for Real-time Pollution-Aware POI Suggestion

链接: https://arxiv.org/abs/2502.09155
作者: Giuseppe Fasano,Yashar Deldjoo,Tommaso di Noia,Bianca Lau,Sina Adham-Khiabani,Eric Morris,Xia Liu,Ganga Chinna Rao Devarapu,Liam O’Faolain
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This demo paper presents AirSense-R, a privacy-preserving mobile application that provides real-time, pollution-aware recommendations for points of interest (POIs) in urban environments. By combining real-time air quality monitoring data with user preferences, the proposed system aims to help users make health-conscious decisions about the locations they visit. The application utilizes collaborative filtering for personalized suggestions, and federated learning for privacy protection, and integrates air pollutant readings from AirSENCE sensor networks in cities such as Bari, Italy, and Cork, Ireland. Additionally, the AirSENCE prediction engine can be employed to detect anomaly readings and interpolate for air quality readings in areas with sparse sensor coverage. This system offers a promising, health-oriented POI recommendation solution that adapts dynamically to current urban air quality conditions while safeguarding user privacy. The code of AirTOWN and a demonstration video is made available at the following repo: this https URL.

[IR-3] Semantic Ads Retrieval at Walmart eCommerce with Language Models Progressively Trained on Multiple Knowledge Domains

链接: https://arxiv.org/abs/2502.09089
作者: Zhaodong Wang,Weizhi Du,Md Omar Faruk Rokon,Pooshpendu Adhikary,Yanbing Xue,Jiaxuan Xu,Jianghong Zhou,Kuang-chih Lee,Musen Wen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sponsored search in e-commerce poses several unique and complex challenges. These challenges stem from factors such as the asymmetric language structure between search queries and product names, the inherent ambiguity in user search intent, and the vast volume of sparse and imbalanced search corpus data. The role of the retrieval component within a sponsored search system is pivotal, serving as the initial step that directly affects the subsequent ranking and bidding systems. In this paper, we present an end-to-end solution tailored to optimize the ads retrieval system on this http URL. Our approach is to pretrain the BERT-like classification model with product category information, enhancing the model’s understanding of Walmart product semantics. Second, we design a two-tower Siamese Network structure for embedding structures to augment training efficiency. Third, we introduce a Human-in-the-loop Progressive Fusion Training method to ensure robust model performance. Our results demonstrate the effectiveness of this pipeline. It enhances the search relevance metric by up to 16% compared to a baseline DSSM-based model. Moreover, our large-scale online A/B testing demonstrates that our approach surpasses the ad revenue of the existing production model.

[IR-4] Unleashing the Power of Large Language Model for Denoising Recommendation WWW2025

链接: https://arxiv.org/abs/2502.09058
作者: Shuyao Wang,Zhi Zheng,Yongduo Sui,Hui Xiong
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 5 figures, 4 tables. Accecpted by WWW 2025

点击查看摘要

Abstract:Recommender systems are crucial for personalizing user experiences but often depend on implicit feedback data, which can be noisy and misleading. Existing denoising studies involve incorporating auxiliary information or learning strategies from interaction data. However, they struggle with the inherent limitations of external knowledge and interaction data, as well as the non-universality of certain predefined assumptions, hindering accurate noise identification. Recently, large language models (LLMs) have gained attention for their extensive world knowledge and reasoning abilities, yet their potential in enhancing denoising in recommendations remains underexplored. In this paper, we introduce LLaRD, a framework leveraging LLMs to improve denoising in recommender systems, thereby boosting overall recommendation performance. Specifically, LLaRD generates denoising-related knowledge by first enriching semantic insights from observational data via LLMs and inferring user-item preference knowledge. It then employs a novel Chain-of-Thought (CoT) technique over user-item interaction graphs to reveal relation knowledge for denoising. Finally, it applies the Information Bottleneck (IB) principle to align LLM-generated denoising knowledge with recommendation targets, filtering out noise and irrelevant LLM knowledge. Empirical results demonstrate LLaRD’s effectiveness in enhancing denoising and recommendation accuracy.

[IR-5] A Contextual-Aware Position Encoding for Sequential Recommendation WWW’25

链接: https://arxiv.org/abs/2502.09027
作者: Jun Yuan,Guohao Cai,Zhenhua Dong
类目: Information Retrieval (cs.IR)
*备注: Accepted by WWW’25 Industry Track

点击查看摘要

Abstract:Sequential recommendation (SR), which encodes user activity to predict the next action, has emerged as a widely adopted strategy in developing commercial personalized recommendation systems. A critical component of modern SR models is the attention mechanism, which synthesizes users’ historical activities. This mechanism is typically order-invariant and generally relies on position encoding (PE). Conventional SR models simply assign a learnable vector to each position, resulting in only modest gains compared to traditional recommendation models. Moreover, limited research has been conducted on position encoding tailored for sequential recommendation, leaving a significant gap in addressing its unique requirements. To bridge this gap, we propose a novel Contextual-Aware Position Encoding method for sequential recommendation, abbreviated as CAPE. To the best of our knowledge, CAPE is the first PE method specifically designed for sequential recommendation. Comprehensive experiments conducted on benchmark SR datasets demonstrate that CAPE consistently enhances multiple mainstream backbone models and achieves state-of-the-art performance, across small and large scale model size. Furthermore, we deployed CAPE in an industrial setting on a real-world commercial platform, clearly showcasing the effectiveness of our approach. Our source code is available at this https URL.

[IR-6] Optimal Dataset Size for Recommender Systems: Evaluating Algorithms Performance via Downsampling

链接: https://arxiv.org/abs/2502.08845
作者: Ardalan Arabzadeh,Joeran Beel,Tobias Vente
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This thesis investigates dataset downsampling as a strategy to optimize energy efficiency in recommender systems while maintaining competitive performance. With increasing dataset sizes posing computational and environmental challenges, this study explores the trade-offs between energy efficiency and recommendation quality in Green Recommender Systems, which aim to reduce environmental impact. By applying two downsampling approaches to seven datasets, 12 algorithms, and two levels of core pruning, the research demonstrates significant reductions in runtime and carbon emissions. For example, a 30% downsampling portion can reduce runtime by 52% compared to the full dataset, leading to a carbon emission reduction of up to 51.02 KgCO2e during the training of a single algorithm on a single dataset. The analysis reveals that algorithm performance under different downsampling portions depends on factors like dataset characteristics, algorithm complexity, and the specific downsampling configuration (scenario dependent). Some algorithms, which showed lower nDCG@10 scores compared to higher-performing ones, exhibited lower sensitivity to the amount of training data, offering greater potential for efficiency in lower downsampling portions. On average, these algorithms retained 81% of full-size performance using only 50% of the training set. In certain downsampling configurations, where more users were progressively included while keeping the test set size fixed, they even showed higher nDCG@10 scores than when using the full dataset. These findings highlight the feasibility of balancing sustainability and effectiveness, providing insights for designing energy-efficient recommender systems and promoting sustainable AI practices.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-14

目录

概览 (2025-02-14)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载