Arxiv今日论文 | 2025-02-17

本篇博文主要内容为 2025-02-17 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决多模态大型语言模型（Multimodal Large Language Models, MLLMs）在与人类偏好对齐方面的不足。当前对齐研究主要集中在特定领域（如减少幻觉），而缺乏系统性探索模型与人类偏好对齐能否全面提升MLLM的能力。为了解决这一问题，论文引入了一个包含12万个人类标注的细粒度偏好比较对的数据集MM-RLHF。基于这个数据集，论文提出了关键的创新方法来改进奖励模型的质量和对齐算法的效率。其中，关键创新包括提出了一种基于批评的奖励模型（Critique-Based Reward Model），它在评分前对模型输出进行批评，提供增强的可解释性和更丰富的反馈；以及动态奖励缩放（Dynamic Reward Scaling）方法，该方法根据奖励信号调整每个样本的损失权重，从而优化高质量对比对的使用。这些方法显著提升了模型在多个维度和基准上的性能。

链接: https://arxiv.org/abs/2502.10391
作者: Yi-Fan Zhang,Tao Yu,Haochen Tian,Chaoyou Fu,Peiyan Li,Jianshu Zeng,Wulin Xie,Yang Shi,Huanyu Zhang,Junkang Wu,Xue Wang,Yibo Hu,Bin Wen,Fan Yang,Zhang Zhang,Tingting Gao,Di Zhang,Liang Wang,Rong Jin,Tieniu Tan
机构: CASIA(中科院自动化研究所), NJU(南京大学), HKUST(香港科技大学), NTU(南洋理工大学), UCAS(中国科学院大学), Squirrel AI Learning(松鼠AI学习), Meta AI(Meta AI)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing \mathbf120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across \mathbf10 distinct dimensions and \mathbf27 benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a \mathbf19.5 % increase in conversational abilities and a \mathbf60 % improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: this https URL. Comments: Project Page: this https URL Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.10391 [cs.CL] (or arXiv:2502.10391v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2502.10391 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-1] Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction

【速读】：该论文旨在解决长文档复杂任务中大型语言模型（LLMs）零样本性能不佳的问题。解决方案的关键在于通过不同方面导向提示生成的LLM摘要包含不同的信息信号，并提出有效整合这些不同摘要中的信号以进行监督训练的方法。这种方法旨在捕捉原始文档的不同重要方面，从而提高复杂任务如患者结局预测的性能。

链接: https://arxiv.org/abs/2502.10388
作者: WonJin Yoon,Boyu Ren,Spencer Thomas,Chanwhi Kim,Guergana Savova,Mei-Hua Hall,Timothy Miller
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent progress in large language models (LLMs) has enabled the automated processing of lengthy documents even without supervised training on a task-specific dataset. Yet, their zero-shot performance in complex tasks as opposed to straightforward information extraction tasks remains suboptimal. One feasible approach for tasks with lengthy, complex input is to first summarize the document and then apply supervised fine-tuning to the summary. However, the summarization process inevitably results in some loss of information. In this study we present a method for processing the summaries of long documents aimed to capture different important aspects of the original document. We hypothesize that LLM summaries generated with different aspect-oriented prompts contain different \textitinformation signals, and we propose methods to measure these differences. We introduce approaches to effectively integrate signals from these different summaries for supervised training of transformer models. We validate our hypotheses on a high-impact task – 30-day readmission prediction from a psychiatric discharge – using real-world data from four hospitals, and show that our proposed method increases the prediction performance for the complex task of predicting patient outcome.
zh

[NLP-2] Unknown Word Detection for English as a Second Language (ESL) Learners Using Gaze and Pre-trained Language Models

【速读】：该论文旨在解决英语作为第二语言（ESL）学习者在阅读过程中遇到的未知词汇阻碍理解的问题。解决方案的关键在于EyeLingo，这是一种基于Transformer的机器学习方法，能够实时准确地预测文本内容和眼动轨迹中的未知词汇概率。

链接: https://arxiv.org/abs/2502.10378
作者: Jiexin Ding,Bowen Zhao,Yuntao Wang,Xinyun Liu,Rui Hao,Ishan Chatterjee,Yuanchun Shi
机构: Department of Computer Science and Technology, Global Innovation Exchange (GIX) Institute, Tsinghua University (清华大学); Electrical & Computer Engineering, University of Washington (华盛顿大学), Seattle, WA, USA; Groundlight AI (Groundlight AI), Seattle, WA, USA; Department of Computer Science and Technology, Tsinghua University (清华大学), Beijing, China; Rice University (莱斯大学), Houston, TX, USA; School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院), Beijing, China; Paul G. Allen School of Computer Science and Engineering, University of Washington (华盛顿大学), Seattle, WA, USA; Qinghai University (青海大学), Xining, China
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:English as a Second Language (ESL) learners often encounter unknown words that hinder their text comprehension. Automatically detecting these words as users read can enable computing systems to provide just-in-time definitions, synonyms, or contextual explanations, thereby helping users learn vocabulary in a natural and seamless manner. This paper presents EyeLingo, a transformer-based machine learning method that predicts the probability of unknown words based on text content and eye gaze trajectory in real time with high accuracy. A 20-participant user study revealed that our method can achieve an accuracy of 97.6%, and an F1-score of 71.1%. We implemented a real-time reading assistance prototype to show the effectiveness of EyeLingo. The user study shows improvement in willingness to use and usefulness compared to baseline methods.
zh

[NLP-3] OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

【速读】：该论文旨在探究神经网络缩放规律在多语言语音任务中的表现，特别是通过引入OWLS（一个涵盖从0.25B到18B参数的多语言语音识别与翻译模型套件）来系统性地研究数据、模型及计算资源缩放如何影响性能。关键在于利用大规模公共语音数据集（最多达360K小时，覆盖150种语言），揭示了性能随规模增加而提升的现象，并特别指出这种缩放可以增强低资源语言/方言的表现，从而减轻偏见并提高语音技术的可访问性。此外，论文展示了如何使用OWLS发现大规模语音模型中的新兴能力，为未来的研究方向提供支持。

链接: https://arxiv.org/abs/2502.10373
作者: William Chen,Jinchuan Tian,Yifan Peng,Brian Yan,Chao-Han Huck Yang,Shinji Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 23 pages, 13 figures

点击查看摘要

Abstract:Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models. Model checkpoints will be released on this https URL for future studies.
zh

[NLP-4] Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

【速读】：该论文旨在解决多语言数据集筛选过程中因有限研究而存在的非英语语言差异问题。论文的关键解决方案是提出了一种基于模型的筛选框架，利用Transformer和FastText分类器来识别多样化且富含知识的样本，以提高多语言数据集的筛选效果和效率。该方法通过在FineWeb-2数据集上的广泛研究验证了其有效性，并展示了其在不同语言家族和资源可用性下的泛化能力。

链接: https://arxiv.org/abs/2502.10361
作者: Bettina Messmer,Vinko Sabolčec,Martin Jaggi
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.
zh

[NLP-5] Agent ic Verification for Ambiguous Query Disambiguation

【速读】：该论文旨在解决检索增强生成（Retrieval-Augmented Generation, RAG）中查询消歧的问题，特别是在企业环境中，由于大规模语言模型（LLMs）在静态数据上训练，难以处理领域特定的消歧挑战。论文的关键解决方案是通过整合检索器和生成器的反馈，将多样化与验证统一起来，从而减少对多次检索和推理步骤的依赖，降低级联错误的风险。这种方法通过其提出的Verified-Diversification with Consolidation (VERDICT)方法，提高了效率和鲁棒性，并在广泛采用的ASQA基准测试中展示了优于最强基线方法的性能，平均提升了23%的接地感知F1分数。

链接: https://arxiv.org/abs/2502.10352
作者: Youngwon Lee,Seung-won Hwang,Ruofan Wu,Feng Yan,Danmei Xu,Moutasem Akkad,Zhewei Yao,Yuxiong He
机构: Snowflake AI Research; Seoul National University (首尔国立大学); University of Houston (休斯顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we tackle the challenge of disambiguating queries in retrieval-augmented generation (RAG) to diverse yet answerable interpretations. State-of-the-arts follow a Diversify-then-Verify (DtV) pipeline, where diverse interpretations are generated by an LLM, later used as search queries to retrieve supporting passages. Such a process may introduce noise in either interpretations or retrieval, particularly in enterprise settings, where LLMs – trained on static data – may struggle with domain-specific disambiguations. Thus, a post-hoc verification phase is introduced to prune noises. Our distinction is to unify diversification with verification by incorporating feedback from retriever and generator early on. This joint approach improves both efficiency and robustness by reducing reliance on multiple retrieval and inference steps, which are susceptible to cascading errors. We validate the efficiency and effectiveness of our method, Verified-Diversification with Consolidation (VERDICT), on the widely adopted ASQA benchmark to achieve diverse yet verifiable interpretations. Empirical results show that VERDICT improves grounding-aware F1 score by an average of 23% over the strongest baseline across different backbone LLMs.
zh

[NLP-6] Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

【速读】：该论文旨在解决现代语言模型训练数据集庞大且无结构的问题，难以对其内容进行推理并开发系统化数据整理方法。关键解决方案是通过WebOrganizer框架，依据主题和格式对网页进行分类，从而自动注释预训练数据。这种方法使得研究不同领域数据混合如何改进下游任务成为可能，并展示了如何结合主题与格式的洞见进一步提升性能。此外，该方法还改善了基于质量选择数据的现有方法。总的来说，构建和混合领域为基于质量的数据整理方法提供了有价值的补充。

链接: https://arxiv.org/abs/2502.10341
作者: Alexander Wettig,Kyle Lo,Sewon Min,Hannaneh Hajishirzi,Danqi Chen,Luca Soldaini
机构: Allen Institute for Artificial Intelligence (艾伦人工智能研究所); Princeton Language and Intelligence, Princeton University (普林斯顿语言与智能研究所，普林斯顿大学); University of California, Berkeley (加利福尼亚大学伯克利分校); Paul G. Allen School of Computer Science & Engineering, University of Washington (保罗·G·艾伦计算机科学与工程学院，华盛顿大学)
类目: Computation and Language (cs.CL)
备注: Project page: this https URL

点击查看摘要

Abstract:Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
zh

[NLP-7] STAR: Spectral Truncation and Rescale for Model Merging NAACL2025

【速读】：该论文旨在解决模型合并过程中任务性能随合并模型数量增加而不可避免下降的问题。论文提出的关键解决方案是Spectral Truncation And Rescale (STAR)，通过在各自的谱空间中截断小分量来缓解“合并冲突”，随后采用自动参数重缩放方案以保持原始矩阵的核范数。STAR无需对原始训练数据进行额外推理，并且对超参数选择具有鲁棒性。

链接: https://arxiv.org/abs/2502.10339
作者: Yu-Ang Lee,Ching-Yun Ko,Tejaswini Pedapati,I-Hsin Chung,Mi-Yen Yeh,Pin-Yu Chen
机构: IBM Research; National Taiwan University; Academia Sinica
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to NAACL 2025

点击查看摘要

Abstract:Model merging is an efficient way of obtaining a multi-task model from several pretrained models without further fine-tuning, and it has gained attention in various domains, including natural language processing (NLP). Despite the efficiency, a key challenge in model merging is the seemingly inevitable decrease in task performance as the number of models increases. In this paper, we propose \mathbfS pectral \mathbfT runcation \mathbfA nd \mathbfR escale (STAR) that aims at mitigating ``merging conflicts’’ by truncating small components in the respective spectral spaces, which is followed by an automatic parameter rescaling scheme to retain the nuclear norm of the original matrix. STAR requires no additional inference on original training data and is robust to hyperparamater choice. We demonstrate the effectiveness of STAR through extensive model merging cases on diverse NLP tasks. Specifically, STAR works robustly across varying model sizes, and can outperform baselines by 4.2 % when merging 12 models on Flan-T5. Our code is publicly available at this https URL.
zh

[NLP-8] Evaluating the Meta- and Object-Level Reasoning of Large Language Models for Question Answering AAAI2025

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在需要复杂多步推理的问题回答（Question Answering, QA）任务中的挑战。论文的关键在于区分并评估LLMs在元级别推理（meta-level reasoning）和对象级别推理（object-level reasoning）上的能力。通过引入Franklin数据集以及使用其他三个数据集，论文评估了四种LLMs在这些推理任务中的表现。研究表明，LLMs在元级别推理方面表现出较高频率的成功，但在某些数据集中对象级别推理任务上存在困难；同时，尽管在Franklin数据集中LLMs在处理对象级别推理问题时遇到挑战，但它们在元级别推理任务中展现出较强的表现。

链接: https://arxiv.org/abs/2502.10338
作者: Nick Ferguson,Liane Guillou,Alan Bundy,Kwabena Nuamah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages. Accepted to the Workshop on Planning in the Era of LLMs (LM4Plan @ AAAI 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) excel in natural language tasks but still face challenges in Question Answering (QA) tasks requiring complex, multi-step reasoning. We outline the types of reasoning required in some of these tasks, and reframe them in terms of meta-level reasoning (akin to high-level strategic reasoning or planning) and object-level reasoning (embodied in lower-level tasks such as mathematical reasoning). Franklin, a novel dataset with requirements of meta- and object-level reasoning, is introduced and used along with three other datasets to evaluate four LLMs at question answering tasks requiring multiple steps of reasoning. Results from human annotation studies suggest LLMs demonstrate meta-level reasoning with high frequency, but struggle with object-level reasoning tasks in some of the datasets used. Additionally, evidence suggests that LLMs find the object-level reasoning required for the questions in the Franklin dataset challenging, yet they do exhibit strong performance with respect to the meta-level reasoning requirements.
zh

[NLP-9] DeltaProduct: Increasing the Expressivity of DeltaNet Through Products of Householders

【速读】：该论文旨在解决线性递归神经网络（Linear Recurrent Neural Networks, linear RNNs）在序列建模中的表达能力和效率之间的根本权衡问题。解决方案的关键在于引入DeltaProduct机制，通过每令牌进行多次（ $n_h$ ）步骤，形成对角加秩- $n_h$ 的状态转换矩阵，该矩阵由 $n_h$ 个广义Householder变换的乘积构成。这种方法提供了一个可调的机制来平衡表达能力和效率，并且实现了更稳定的递归过程。实验结果表明，DeltaProduct在状态跟踪和语言建模方面表现出色，并且在长度外推能力上显著优于DeltaNet。

链接: https://arxiv.org/abs/2502.10297
作者: Julien Siems,Timur Carstensen,Arber Zela,Frank Hutter,Massimiliano Pontil,Riccardo Grazzi
机构: Istituto Italiano di Tecnologia(意大利技术研究院); University of Freiburg(弗莱堡大学); ELLIS Institute Tübingen(埃尔利斯图宾根研究所); AI Centre, University College London(伦敦大学学院人工智能中心); Microsoft Research(微软研究)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference. However, existing architectures face a fundamental trade-off between expressivity and efficiency, dictated by the structure of their state-transition matrices. While diagonal matrices used in architectures like Mamba, GLA, or mLSTM yield fast runtime, they suffer from severely limited expressivity. To address this, recent architectures such as (Gated) DeltaNet and RWKVv7 adopted a diagonal plus rank-1 structure, allowing simultaneous token-channel mixing, which overcomes some expressivity limitations with only a slight decrease in training efficiency. Building on the interpretation of DeltaNet’s recurrence as performing one step of online gradient descent per token on an associative recall loss, we introduce DeltaProduct, which instead takes multiple ( n_h ) steps per token. This naturally leads to diagonal plus rank- n_h state-transition matrices, formed as products of n_h generalized Householder transformations, providing a tunable mechanism to balance expressivity and efficiency and a stable recurrence. Through extensive experiments, we demonstrate that DeltaProduct achieves superior state-tracking and language modeling capabilities while exhibiting significantly improved length extrapolation compared to DeltaNet. Additionally, we also strengthen the theoretical foundation of DeltaNet’s expressivity by proving that it can solve dihedral group word problems in just two layers.
zh

[NLP-10] Are Large Language Models the future crowd workers of Linguistics?

【速读】：该论文旨在探讨大型语言模型（Large Language Models, LLMs）是否能够克服传统实证语言学研究中使用人类参与者所带来的挑战，如任务完成过程中注意力控制不足、众包环境中的工作条件不稳定以及实验设计耗时等问题。关键在于利用LLMs在零样本提示（zero-shot prompting）下的表现，并探索额外的提示技术，如思维链（Chain-of-Thought, CoT）提示，以进一步提高其与人类表现的一致性。研究表明，LLMs具有较高的灵活性和有效性，并且在某些语言任务中表现出超越人类参与者的潜力。

链接: https://arxiv.org/abs/2502.10266
作者: Iris Ferrazzo
机构: Bonn Center for Digital Humanities (BCDH); Universität Bonn
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data elicitation from human participants is one of the core data collection strategies used in empirical linguistic research. The amount of participants in such studies may vary considerably, ranging from a handful to crowdsourcing dimensions. Even if they provide resourceful extensive data, both of these settings come alongside many disadvantages, such as low control of participants’ attention during task completion, precarious working conditions in crowdsourcing environments, and time-consuming experimental designs. For these reasons, this research aims to answer the question of whether Large Language Models (LLMs) may overcome those obstacles if included in empirical linguistic pipelines. Two reproduction case studies are conducted to gain clarity into this matter: Cruz (2023) and Lombard et al. (2021). The two forced elicitation tasks, originally designed for human participants, are reproduced in the proposed framework with the help of OpenAI’s GPT-4o-mini model. Its performance with our zero-shot prompting baseline shows the effectiveness and high versatility of LLMs, that tend to outperform human informants in linguistic tasks. The findings of the second replication further highlight the need to explore additional prompting techniques, such as Chain-of-Thought (CoT) prompting, which, in a second follow-up experiment, demonstrates higher alignment to human performance on both critical and filler items. Given the limited scale of this study, it is worthwhile to further explore the performance of LLMs in empirical Linguistics and in other future applications in the humanities.
zh

[NLP-11] Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers

【速读】：该论文旨在解决手动识别和分类学术文献中数据集提及的高成本及不可扩展性问题。解决方案的关键在于提出了一种基于大型语言模型（LLMs）、合成数据和两阶段微调过程的机器学习框架，通过零样本提取、LLM作为评估者和推理代理进行精炼，生成弱监督合成数据集，并采用预微调和后续微调的方法训练模型。在推理阶段，使用基于ModernBERT的分类器高效过滤数据集提及，从而降低计算开销并保持高召回率。

链接: https://arxiv.org/abs/2502.10263
作者: Aivin V. Solatorio,Rafael Macalaba,James Liounis
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Databases (cs.DB); Machine Learning (cs.LG)
备注: Project GitHub repository at this https URL

点击查看摘要

Abstract:Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 and GLiNER-large-v2.1 in dataset extraction accuracy. Our results highlight how LLM-generated synthetic data can effectively address training data scarcity, improving generalization in low-resource settings. This framework offers a pathway toward scalable monitoring of dataset usage, enhancing transparency, and supporting researchers, funders, and policymakers in identifying data gaps and strengthening data accessibility for informed decision-making.
zh

[NLP-12] VisCon-100K: Leverag ing Contextual Web Data for Fine-tuning Vision Language Models PAKDD2025

【速读】：该论文旨在解决视觉语言模型（Vision-Language Models, VLMs）在缺乏高质量视觉微调数据的情况下表现受限的问题。解决方案的关键在于引入了一个名为VisCon-100K的新数据集，该数据集源自交错的图像-文本网络文档。通过利用GPT-4V生成与图像上下文相关的标题，并使用OpenChat 3.5模型将这些标题转化为多样化的自由形式和多项选择问答对，从而构建了包含100K个图像对话样本的数据集。这种方法不仅增强了VLM在多个基准测试中的性能，还发现“泄漏模态混合”（即对话样本中的问题可以从图像及其上下文标题中解答）比非泄漏组合表现更优。

链接: https://arxiv.org/abs/2502.10250
作者: Gokul Karthik Kumar,Iheb Chaabane,Kebin Wu
机构: Technology Innovation Institute (TII)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at PAKDD 2025

点击查看摘要

Abstract:Vision-language models (VLMs) excel in various visual benchmarks but are often constrained by the lack of high-quality visual fine-tuning data. To address this challenge, we introduce VisCon-100K, a novel dataset derived from interleaved image-text web documents. Our approach transforms 45K web documents from the OBELICS dataset into 100K image conversation samples. We utilize GPT-4V to generate image-contextual captions and OpenChat 3.5 model to convert these captions into diverse free-form and multiple-choice question-answer pairs. Integrating this dataset for fine-tuning considerably enhances VLM performance across multiple benchmarks. Unlike methods that focus solely on fine-grained visual content, our approach leverages accompanying web context, yielding superior results. We also discover that a `leaky modality mix,’ where conversation samples contain questions answerable from both the image and its contextual caption, outperforms non-leaky combinations of captions and Q\A pairs. VisCon-100k dataset shows strong performance with two popular VLM approaches: text-only large language model (LLM) aligned with a vision encoder using image captions data (ShareGPT4V-7b) and multimodally pretrained LLM (IDEFICS2-8b) using interleaved image-text data. In addition to releasing the VisCon-100K dataset, we provide a contextual captioner trained on this dataset, facilitating scalable fine-tuning data generation for future research and open-source applications. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset.
zh

[NLP-13] Step-Video-T2V Technical Report: The Practice Challenges and Future of Video Foundation Model

【速读】：该论文旨在解决文本到视频（Text-to-Video, T2V）生成的质量与效率问题。关键解决方案包括：设计了一个具有深度压缩能力的变分自编码器（Variational Autoencoder, VAE），实现了16x16的空间和8x的时间压缩比；使用双语文本编码器处理中英文输入；采用带有三维全注意力机制的DiT模型，并通过Flow Matching进行去噪；引入基于视频的DPO方法以减少伪影并提升生成视频的视觉质量。这些技术共同提升了文本到视频生成的性能与多样性。

链接: https://arxiv.org/abs/2502.10248
作者: Guoqing Ma,Haoyang Huang,Kun Yan,Liangyu Chen,Nan Duan,Shengming Yin,Changyi Wan,Ranchen Ming,Xiaoniu Song,Xing Chen,Yu Zhou,Deshan Sun,Deyu Zhou,Jian Zhou,Kaijun Tan,Kang An,Mei Chen,Wei Ji,Qiling Wu,Wen Sun,Xin Han,Yanan Wei,Zheng Ge,Aojie Li,Bin Wang,Bizhu Huang,Bo Wang,Brian Li,Changxing Miao,Chen Xu,Chenfei Wu,Chenguang Yu,Dapeng Shi,Dingyuan Hu,Enle Liu,Gang Yu,Ge Yang,Guanzhe Huang,Gulin Yan,Haiyang Feng,Hao Nie,Haonan Jia,Hanpeng Hu,Hanqi Chen,Haolong Yan,Heng Wang,Hongcheng Guo,Huilin Xiong,Huixin Xiong,Jiahao Gong,Jianchang Wu,Jiaoren Wu,Jie Wu,Jie Yang,Jiashuai Liu,Jiashuo Li,Jingyang Zhang,Junjing Guo,Junzhe Lin,Kaixiang Li,Lei Liu,Lei Xia,Liang Zhao,Liguo Tan,Liwen Huang,Liying Shi,Ming Li,Mingliang Li,Muhua Cheng,Na Wang,Qiaohui Chen,Qinglin He,Qiuyan Liang,Quan Sun,Ran Sun,Rui Wang,Shaoliang Pang,Shiliang Yang,Sitong Liu,Siqi Liu,Shuli Gao,Tiancheng Cao,Tianyu Wang,Weipeng Ming,Wenqing He,Xu Zhao,Xuelin Zhang,Xianfang Zeng,Xiaojia Liu,Xuan Yang,Yaqi Dai,Yanbo Yu,Yang Li,Yineng Deng,Yingming Wang,Yilei Wang,Yuanwei Lu,Yu Chen,Yu Luo,Yuchu Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 35 pages, 14 figures

点击查看摘要

Abstract:We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V’s performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at this https URL. The online version can be accessed from this https URL as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
zh

[NLP-14] Can Post-Training Quantization Benefit from an Additional QLoRA Integration? NAACL2025

【速读】：该论文旨在解决大型语言模型（LLMs）在实际部署中面临的计算资源需求高和成本昂贵的问题。论文的关键解决方案在于将4位后训练量化（4-bit Post-training Quantization, PTQ）与QLoRA技术相结合，通过广泛的实验验证，这种集成方法不仅优于标准PTQ，甚至在某些情况下优于16位全参数微调，从而为在资源受限环境中部署强大的LLMs提供了有效且性能不妥协的方案。

链接: https://arxiv.org/abs/2502.10202
作者: Xiliang Zhu,Elena Khasanova,Cheng Chen
机构: Dialpad Inc. (直呼股份有限公司)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 Industry Track

点击查看摘要

Abstract:Large language models (LLMs) have transformed natural language processing but pose significant challenges for real-world deployment. These models necessitate considerable computing resources, which can be costly and frequently unavailable. Model compression techniques such as quantization are often leveraged to alleviate resource demand, but they may have a negative impact on the generation quality. In this study, we explore the integration of 4-bit Post-training Quantization (PTQ) with QLoRA to address these issues. We demonstrate through extensive experiments that this integration outperforms standard PTQ, and in some cases even 16-bit full-parameter fine-tuning on LLMs, validated across proprietary and public datasets with different quantization algorithms. The results demonstrate the efficacy of PTQ-QLoRA integration, offering a viable solution for deploying powerful LLMs in resource-constrained environments without compromising on performance.
zh

[NLP-15] Prediction hubs are context-informed frequent tokens in LLM s

【速读】：该论文旨在探讨自回归大语言模型（LLMs）在高维表示空间中是否受到hubness现象的影响。论文的关键在于理论分析表明LLMs在进行上下文向量与unembedding向量比较以确定后续预测概率时，并不受典型导致hubness现象的距离集中化现象的影响。然而，实验结果显示尽管不存在干扰性的hubness，仍然存在一定程度的hubness，这些hub是由频繁出现的标记在可能的下一个标记预测候选池中引起的。对于其他涉及LLMs表示的距离计算，论文没有相同的理论保证，并观察到干扰性hub的存在。总结而言，论文指出hubness在高维空间中普遍存在，但并非总是需要缓解的负面属性，同时展示了广泛使用的LLMs发展出了一种策略，即持续为频繁标记分配高概率。

链接: https://arxiv.org/abs/2502.10201
作者: Beatrix M. G. Nielsen,Iuri Macocco,Marco Baroni
机构: Technical University of Denmark (丹麦技术大学); Universitat Pompeu Fabra (庞培法布拉大学); Universitat Pompeu Fabra/ICREA (庞培法布拉大学/ICREA)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hubness, the tendency for few points to be among the nearest neighbours of a disproportionate number of other points, commonly arises when applying standard distance measures to high-dimensional data, often negatively impacting distance-based analysis. As autoregressive large language models (LLMs) operate on high-dimensional representations, we ask whether they are also affected by hubness. We first show, theoretically, that the only representation comparison operation performed by LLMs, namely that between context and unembedding vectors to determine continuation probabilities, is not characterized by the concentration of distances phenomenon that typically causes the appeareance of nuisance hubness. We then empirically show that this comparison still leads to a high degree of hubness, but the hubs in this case do not constitute a disturbance. They are rather the result of context-modulated frequent tokens often appearing in the pool of likely candidates for next token prediction. On the other hand, when other distance computations involving LLM representations are performed, we do not have the same theoretical guarantees, and, indeed, we see nuisance hubs appear. In summary, our work highlights, on the one hand, how hubness, while omnipresent in high-dimensional spaces, is not always a negative property that needs to be mitigated, and, on the other hand, it shows that various widely-used LLMs have developed a guessing strategy that consists in constantly assigning a high probability to frequent tokens.
zh

[NLP-16] Revisiting Generalization Power of a DNN in Terms of Symbolic Interactions

【速读】：该论文旨在分析深度神经网络（DNNs）的泛化能力，并从交互作用的角度探讨这一问题。不同于以往在高维特征空间中对DNN泛化能力的分析，本文提出深度神经网络的泛化能力可以归结为其交互作用的泛化能力。关键在于发现可泛化的交互作用遵循衰减型分布，而非可泛化的交互作用则呈现纺锤型分布。此外，论文理论能够有效分离DNN中的这两种交互作用类型，并通过实验验证了其理论的有效性。

链接: https://arxiv.org/abs/2502.10162
作者: Lei Cheng,Junpeng Zhang,Qihan Ren,Quanshi Zhang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2407.19198

点击查看摘要

Abstract:This paper aims to analyze the generalization power of deep neural networks (DNNs) from the perspective of interactions. Unlike previous analysis of a DNN’s generalization power in a highdimensional feature space, we find that the generalization power of a DNN can be explained as the generalization power of the interactions. We found that the generalizable interactions follow a decay-shaped distribution, while non-generalizable interactions follow a spindle-shaped distribution. Furthermore, our theory can effectively disentangle these two types of interactions from a DNN. We have verified that our theory can well match real interactions in a DNN in experiments.
zh

[NLP-17] Small Models Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages

【速读】：该论文旨在解决低资源语言（Low-resource languages, LRLs）在自然语言处理（NLP）中的数据匮乏问题。论文的关键解决方案在于采用参数高效的适配器（adapter-based）方法来调整小型多语言模型（small multilingual models, mLMs），如mBERT和XLM-R，使其适应低资源语言。通过系统性地评估三种架构：顺序瓶颈（Sequential Bottleneck）、可逆瓶颈（Invertible Bottleneck）和低秩适应（Low-Rank Adaptation），研究显示即使使用小规模的适配数据集（例如，最多1GB的自由文本或几MB的知识图谱数据），也能在内在任务（如掩蔽语言建模）和外在任务（如主题分类、情感分析和命名实体识别）中取得性能提升。关键发现是顺序瓶颈适配器在语言建模方面表现出色，而可逆瓶颈适配器由于更好的嵌入对齐和更大的参数量，在下游任务中稍占优势。这些方法不仅匹配甚至超过了全量微调的效果，同时使用的参数更少，证明了小型多语言模型相较于大规模单语言模型（如LLaMA-3、GPT-4和基于DeepSeek-R1蒸馏的模型）在处理低资源语言时更为有效。

链接: https://arxiv.org/abs/2502.10140
作者: Daniil Gurgurov,Ivan Vykopal,Josef van Genabith,Simon Ostermann
机构: University of Saarland; German Research Center for Artificial Intelligence (DFKI); Brno University of Technology; Kempelen Institute of Intelligent Technologies (KInIT)
类目: Computation and Language (cs.CL)
备注: Pre-print

点击查看摘要

Abstract:Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. While current state-of-the-art large language models (LLMs) still struggle with LRLs, smaller multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes. This study systematically investigates parameter-efficient adapter-based methods for adapting mLMs to LRLs, evaluating three architectures: Sequential Bottleneck, Invertible Bottleneck, and Low-Rank Adaptation. Using unstructured text from GlotCC and structured knowledge from ConceptNet, we show that small adaptation datasets (e.g., up to 1 GB of free-text or a few MB of knowledge graph data) yield gains in intrinsic (masked language modeling) and extrinsic tasks (topic classification, sentiment analysis, and named entity recognition). We find that Sequential Bottleneck adapters excel in language modeling, while Invertible Bottleneck adapters slightly outperform other methods on downstream tasks due to better embedding alignment and larger parameter counts. Adapter-based methods match or outperform full fine-tuning while using far fewer parameters, and smaller mLMs prove more effective for LRLs than massive LLMs like LLaMA-3, GPT-4, and DeepSeek-R1-based distilled models. While adaptation improves performance, pre-training data size remains the dominant factor, especially for languages with extensive pre-training coverage.
zh

[NLP-18] Hands-off Image Editing: Language-guided Editing without any Task-specific Labeling Masking or even Training COLING2025

【速读】：该论文旨在解决无监督条件下的图像编辑问题，即在没有任务特定标注、掩码或训练的情况下，根据指令对图像进行编辑。其关键解决方案在于提出了一种无需任何任务特定监督的新方法，从而展现出更高的改进潜力，并且评估结果显示该方法具有高度的有效性，达到了极具竞争力的性能水平。

链接: https://arxiv.org/abs/2502.10064
作者: Rodrigo Santos,António Branco,João Silva,João Rodrigues
机构: University of Lisbon(里斯本大学); NLX—Natural Language and Speech Group, Department of Informatics(自然语言与语音研究组，信息科学系); Faculdade de Ciências, Campo Grande, 1749-016 Lisboa, Portugal(葡萄牙里斯本1749-016校区科学学院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in COLING 2025

点击查看摘要

Abstract:Instruction-guided image editing consists in taking an image and an instruction and deliverring that image altered according to that instruction. State-of-the-art approaches to this task suffer from the typical scaling up and domain adaptation hindrances related to supervision as they eventually resort to some kind of task-specific labelling, masking or training. We propose a novel approach that does without any such task-specific supervision and offers thus a better potential for improvement. Its assessment demonstrates that it is highly effective, achieving very competitive performance.
zh

[NLP-19] Annotating Compositionality Scores for Irish Noun Compounds is Hard Work

【速读】：该论文旨在解决名词复合词在自然语言处理应用中的挑战，这些挑战源于其习语性和解释的多样性。关键在于分析爱尔兰语文本中由专家标注员识别出的名词复合词，重点关注组合性这一关键特征，并探讨领域特异性以及标注员的熟悉度和置信度。研究结果有助于深入理解这些结构在爱尔兰语中的表现，并提出区别于英语名词复合词的处理方法。

链接: https://arxiv.org/abs/2502.10061
作者: Abigail Walsh,Teresa Clifford,Emma Daly,Jane Dunne,Brian Davis,Gearóid Ó Cleircín
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 1 figure

点击查看摘要

Abstract:Noun compounds constitute a challenging construction for NLP applications, given their variability in idiomaticity and interpretation. In this paper, we present an analysis of compound nouns identified in Irish text of varied domains by expert annotators, focusing on compositionality as a key feature, but also domain specificity, as well as familiarity and confidence of the annotator giving the ratings. Our findings and the discussion that ensued contributes towards a greater understanding of how these constructions appear in Irish language, and how they might be treated separately from English noun compounds.
zh

[NLP-20] MTLM: an Innovative Language Model Training Paradigm for ASR

【速读】：该论文旨在解决传统单向语言模型（Unidirectional Language Models, ULMs）在自动语音识别（ASR）中无法充分利用左右上下文信息的问题。解决方案的关键在于提出了一种方法，使传统单向语言模型能够充分使用左侧和右侧的上下文信息，从而提升ASR转录结果的一致性和语义明确性。实验结果表明，所提出的模型在LibriSpeech数据集上的表现优于传统的单向语言模型，无论采用n-best重排序还是浅层融合作为解码算法。

链接: https://arxiv.org/abs/2502.10058
作者: Qingliang Meng,Pengju Ren,Tian Li,Changsong Dai
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Pre-training Transformer-based language models (LMs) on a large amount of text has proven crucial for improving automatic speech recognition (ASR) performance. Generally, traditional LMs are unidirectional and unable to access the context on the right. This paper proposes a method for training LMs that enable traditional unidirectional LMs to fully utilize left and right contexts. Compared with the unidirectional LMs, our LM facilitates ASR to transcribe hypotheses more consistently and in a more semantically unambiguous way, as it incorporates richer contextual representations. Finally, our experimental results on the LibriSpeech corpus demonstrate that our model outperforms traditional unidirectional LMs, whether n-best rescoring or shallow fusion is used as the decoding algorithm.
zh

[NLP-21] ORI: O Routing Intelligence

【速读】：该论文旨在解决单一大型语言模型（LLMs）在处理日益增长的任务范围时表现不足的问题，提出了一种单一模型方法不足以应对当前挑战。解决方案的关键在于ORI (O Routing Intelligence)，这是一种动态框架，通过智能路由查询至最合适的模型，从而不仅提升了任务特定的准确性，还保持了效率。该方法在多个基准测试中展示了持续的准确性提升，并控制了计算开销。

链接: https://arxiv.org/abs/2502.10051
作者: Ahmad Shadid,Rahul Kumar,Mohit Mayank
机构: O.SYSTEMS Foundation
类目: Computation and Language (cs.CL)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Single large language models (LLMs) often fall short when faced with the ever-growing range of tasks, making a single-model approach insufficient. We address this challenge by proposing ORI (O Routing Intelligence), a dynamic framework that leverages a set of LLMs. By intelligently routing incoming queries to the most suitable model, ORI not only improves task-specific accuracy, but also maintains efficiency. Comprehensive evaluations across diverse benchmarks demonstrate consistent accuracy gains while controlling computational overhead. By intelligently routing queries, ORI outperforms the strongest individual models by up to 2.7 points on MMLU and 1.8 points on MuSR, ties the top performance on ARC, and on BBH. These results underscore the benefits of a multi-model strategy and demonstrate how ORI’s adaptive architecture can more effectively handle diverse tasks, offering a scalable, high-performance solution for a system of multiple large language models.
zh

[NLP-22] Probabilistic Lexical Manifold Construction in Large Language Models via Hierarchical Vector Field Interpolation

【速读】：该论文旨在解决词嵌入表示中的不连续性问题，特别是在变压器模型（Transformer-based models）中常见的问题。论文的关键在于引入了一种分层矢量场插值（Hierarchical Vector Field Interpolation）的方法，构建了一个概率函数空间，确保词表示在拓扑一致性下平滑过渡，而非局限于离散的标记映射。这种方法通过最小化发散（divergence minimization）技术来保持概率一致性，并且在大规模实现中保持计算可行性。

链接: https://arxiv.org/abs/2502.10013
作者: Clive Pendleton,Ewan Harrington,Giles Fairbrother,Jasper Arkwright,Nigel Fenwick,Richard Katrix
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hierarchical vector field interpolation introduces a structured probabilistic framework for lexical representation, ensuring that word embeddings transition smoothly across a continuous manifold rather than being constrained to discrete token mappings. The proposed methodology constructs a probabilistic function space where word representations adhere to topological consistency, mitigating representational discontinuities commonly observed in transformer-based embeddings. Empirical evaluations reveal that probabilistic constraints enhance lexical coherence by refining contextual relationships, leading to improvements in semantic stability across multiple linguistic distributions. The application of divergence minimization techniques ensures that interpolated embeddings maintain probabilistic consistency while preserving computational feasibility for large-scale implementations. Experimental findings demonstrate that interpolated lexical manifolds improve representation density alignment, reducing anisotropic distortions in contextual embedding distributions. Comparative analyses with standard transformer-based models highlight that structured interpolation yields more stable representations, particularly in tasks requiring fine-grained semantic differentiation. The statistical evaluation of embedding divergence confirms that probabilistic lexical manifolds reduce representational inconsistencies while maintaining coherence across varying scales of contextual abstraction. An assessment of computational efficiency reveals that while interpolation introduces minor processing overhead, the structured representation learning approach remains scalable for practical deployment.
zh

[NLP-23] SciClaimHunt: A Large Dataset for Evidence-based Scientific Claim Verification

【速读】：该论文旨在解决科学声明验证过程中缺乏大规模数据集的问题，以供基准测试和训练有效模型。解决方案的关键在于引入两个大规模数据集SciClaimHunt和SciClaimHunt_Num，这些数据集源自科学论文，并提出了一些专门针对科学声明验证的基线模型来评估这些数据集的有效性。此外，论文还通过人类评估和错误分析进一步验证了所提出的基线模型的效果。

链接: https://arxiv.org/abs/2502.10003
作者: Sujit Kumar,Anshul Sharma,Siddharth Hemant Khincha,Gargi Shroff,Sanasam Ranbir Singh,Rahul Mishra
机构: Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati (印度理工学院古瓦哈提分校), Assam, India; International Institute of Information Technology Hyderabad (海得拉巴国际信息技术学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Verifying scientific claims presents a significantly greater challenge than verifying political or news-related claims. Unlike the relatively broad audience for political claims, the users of scientific claim verification systems can vary widely, ranging from researchers testing specific hypotheses to everyday users seeking information on a medication. Additionally, the evidence for scientific claims is often highly complex, involving technical terminology and intricate domain-specific concepts that require specialized models for accurate verification. Despite considerable interest from the research community, there is a noticeable lack of large-scale scientific claim verification datasets to benchmark and train effective models. To bridge this gap, we introduce two large-scale datasets, SciClaimHunt and SciClaimHunt_Num, derived from scientific research papers. We propose several baseline models tailored for scientific claim verification to assess the effectiveness of these datasets. Additionally, we evaluate models trained on SciClaimHunt and SciClaimHunt_Num against existing scientific claim verification datasets to gauge their quality and reliability. Furthermore, we conduct human evaluations of the claims in proposed datasets and perform error analysis to assess the effectiveness of the proposed baseline models. Our findings indicate that SciClaimHunt and SciClaimHunt_Num serve as highly reliable resources for training models in scientific claim verification.
zh

[NLP-24] EmbBERT-Q: Breaking Memory Barriers in Embedded NLP

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在计算和内存需求方面对技术受限的小型设备（如可穿戴设备和物联网单元）不适用的问题。解决方案的关键在于引入了一种新型的微型语言模型——EmbBERT-Q，它通过结合架构创新与硬件兼容的8位量化技术，在仅781 kB的总内存占用下实现了当前最先进（SotA）的自然语言处理任务精度，相较于其他SotA模型缩小了25倍的规模。

链接: https://arxiv.org/abs/2502.10001
作者: Riccardo Bravin,Massimo Pavan,Hazem Hesham Yousef Shalby,Fabrizio Pittorino,Manuel Roveri
机构: 未知
类目: Computation and Language (cs.CL); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 24 pages, 4 figures, 14 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing, setting new standards across a wide range of applications. However, their relevant memory and computational demands make them impractical for deployment on technologically-constrained tiny devices such as wearable devices and Internet-of-Things units. To address this limitation, we introduce EmbBERT-Q, a novel tiny language model specifically designed for tiny devices with stringent memory constraints. EmbBERT-Q achieves state-of-the-art (SotA) accuracy in Natural Language Processing tasks in this scenario, with a total memory footprint (weights and activations) of just 781 kB, representing a 25x reduction in size with respect to SotA models. By combining architectural innovations with hardware-compatible 8-bit quantization, EmbBERT-Q consistently outperforms several baseline models scaled down to a 2 MB memory budget (i.e., the maximum memory typically available in tiny devices), including heavily compressed versions of BERT and MAMBA. Extensive experimental evaluations on both a selected benchmark dataset, TinyNLP, specifically curated to evaluate Tiny Language Models in NLP tasks and real-world scenarios, and the GLUE benchmark, demonstrate EmbBERT-Q ability to deliver competitive accuracy with respect to existing approaches, achieving an unmatched balance between memory and performance. To ensure the complete and immediate reproducibility of all our results, we release all code, scripts, and model checkpoints at this https URL.
zh

[NLP-25] Large Language Diffusion Models

【速读】：该论文旨在挑战自动回归模型（Autoregressive Models, ARMs）作为大型语言模型（Large Language Models, LLMs）基石的传统观点。论文提出了一种名为LLaDA的扩散模型（diffusion model），通过前向数据掩码过程和逆向预测掩码标记的过程来建模分布，并采用标准Transformer架构进行参数化以预测掩码标记。关键在于通过优化似然性界限，提供了一种基于概率推理的生成方法。实验结果表明，LLaDA在多个基准测试中表现出强大的可扩展性，并且在上下文学习和指令跟随能力方面与先进的LLM如LLaMA3相当。此外，LLaDA解决了逆转诅咒问题，在逆转诗歌完成任务中超越了GPT-4o。这些发现确立了扩散模型作为ARM的有效替代方案，挑战了LLM的关键能力与ARM紧密相关的假设。

链接: https://arxiv.org/abs/2502.09992
作者: Shen Nie,Fengqi Zhu,Zebin You,Xiaolu Zhang,Jingyang Ou,Jun Hu,Jun Zhou,Yankai Lin,Ji-Rong Wen,Chongxuan Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs.
zh

[NLP-26] X-Boundary: Establishing Exact Safety Boundary to Shield LLM s from Multi-Turn Jailbreaks without Compromising Usability

【速读】：该论文旨在解决多轮越狱（Multi-turn Jailbreaks）防御中，现有方法虽能提升大型语言模型（LLMs）的安全性，但会损害其可用性的问题。关键在于提出X-边界（X-Boundary）方法，通过将有害特征表示推离安全边界，实现精确区分安全与有害特征，从而在不损害模型整体能力的前提下有效防御多轮越狱攻击，并降低过度拒绝（over-refusal）现象约20%，同时加速训练过程。

链接: https://arxiv.org/abs/2502.09990
作者: Xiaoya Lu,Dongrui Liu,Yi Yu,Luxin Xu,Jing Shao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: this https URL.
zh

[NLP-27] LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLM s - No Silver Bullet for LC or RAG Routing

【速读】：该论文旨在解决如何有效地将外部知识融入大型语言模型（Large Language Models, LLMs）的问题，并探讨在提升模型能力及满足实际需求方面，检索增强生成（Retrieval-Augmented Generation, RAG）方法与长上下文（long-context, LC）LLMs相比是否仍然必要。论文的关键在于提出了一种名为LaRA的新基准，用于严格比较RAG和LC LLMs，通过系统评估多种开源和专有模型，发现最优选择取决于模型参数大小、长文本处理能力、上下文长度、任务类型以及检索片段的特性等复杂因素。

链接: https://arxiv.org/abs/2502.09977
作者: Kuan Li,Liwen Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Shuai Wang,Minhao Cheng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages

点击查看摘要

Abstract:Effectively incorporating external knowledge into Large Language Models (LLMs) is crucial for enhancing their capabilities and addressing real-world needs. Retrieval-Augmented Generation (RAG) offers an effective method for achieving this by retrieving the most relevant fragments into LLMs. However, the advancements in context window size for LLMs offer an alternative approach, raising the question of whether RAG remains necessary for effectively handling external knowledge. Several existing studies provide inconclusive comparisons between RAG and long-context (LC) LLMs, largely due to limitations in the benchmark designs. In this paper, we present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2,326 test cases across four practical QA task categories and three types of naturally occurring long texts. Through systematic evaluation of seven open-source and four proprietary LLMs, we find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model’s parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks. Our findings provide actionable guidelines for practitioners to effectively leverage both RAG and LC approaches in developing and deploying LLM applications. Our code and dataset is provided at: \hrefthis https URL\textbfthis https URL.
zh

[NLP-28] Data Valuation using Neural Networks for Efficient Instruction Fine-Tuning

【速读】：该论文旨在解决现有影响函数计算方法在大规模模型和数据集上计算成本高且泛化能力有限的问题。关键解决方案在于引入小型神经网络（InfluenceNetwork），通过该网络实现高达99%的成本降低，同时仅使用全语言模型0.0027%大小的模型即可估计影响值。此方法被称为NN-CIFT（Neural Networks for effiCient Instruction Fine-Tuning），在保证性能的同时显著提升了计算效率。

链接: https://arxiv.org/abs/2502.09969
作者: Ishika Agarwal,Dilek Hakkani-Tur
机构: UIUC
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks – which we refer to as the InfluenceNetwork – to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: this https URL.
zh

[NLP-29] KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

【速读】：该论文旨在解决知识图谱（Knowledge Graph, KG）数据稀缺的问题。解决方案的关键在于提出了一种名为文本到知识图谱生成器（Text-to-KG Generator, KGGen）的方法，该方法利用语言模型从纯文本生成高质量的知识图谱。不同于其他知识图谱抽取工具，KGGen通过聚类相关实体来减少抽取过程中知识图谱的稀疏性。

链接: https://arxiv.org/abs/2502.09956
作者: Belinda Mo,Kyssen Yu,Joshua Kazdan,Proud Mpala,Lisa Yu,Chris Cundy,Charilaos Kanatsoulis,Sanmi Koyejo
机构: Stanford University (斯坦福大学); University of Toronto (多伦多大学); FAR AI (未知缩写)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent interest in building foundation models for KGs has highlighted a fundamental challenge: knowledge-graph data is relatively scarce. The best-known KGs are primarily human-labeled, created by pattern-matching, or extracted using early NLP techniques. While human-generated KGs are in short supply, automatically extracted KGs are of questionable quality. We present a solution to this data scarcity problem in the form of a text-to-KG generator (KGGen), a package that uses language models to create high-quality graphs from plaintext. Unlike other KG extractors, KGGen clusters related entities to reduce sparsity in extracted KGs. KGGen is available as a Python library (\textttpip install kg-gen), making it accessible to everyone. Along with KGGen, we release the first benchmark, Measure of of Information in Nodes and Edges (MINE), that tests an extractor’s ability to produce a useful KG from plain text. We benchmark our new tool against existing extractors and demonstrate far superior performance.
zh

[NLP-30] Self-Supervised Learning for Neural Topic Models with Variance-Invariance-Covariance Regularization

【速读】：本文旨在通过结合自监督学习方法与神经主题模型（Neural Topic Model, NTM），提升主题建模的效果。关键解决方案在于引入显式的正则化技术来增强潜在主题表示的质量，并采用对抗性数据增强方法替代启发式采样方法。此外，文中还提出基于对比学习的多种变体模型，这些模型利用正样本和负样本进行训练。实验结果表明，所提出的模型在三个数据集上均优于基线及现有最先进的模型。

链接: https://arxiv.org/abs/2502.09944
作者: Weiran Xu,Kengo Hirami,Koji Eguchi
机构: Graduate School of Advanced Science and Engineering, Hiroshima University (广岛大学先进科学与工程研究生院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint accepted in Springer Knowledge and Information Systems (KAIS), in press

点击查看摘要

Abstract:In our study, we propose a self-supervised neural topic model (NTM) that combines the power of NTMs and regularized self-supervised learning methods to improve performance. NTMs use neural networks to learn latent topics hidden behind the words in documents, enabling greater flexibility and the ability to estimate more coherent topics compared to traditional topic models. On the other hand, some self-supervised learning methods use a joint embedding architecture with two identical networks that produce similar representations for two augmented versions of the same input. Regularizations are applied to these representations to prevent collapse, which would otherwise result in the networks outputting constant or redundant representations for all inputs. Our model enhances topic quality by explicitly regularizing latent topic representations of anchor and positive samples. We also introduced an adversarial data augmentation method to replace the heuristic sampling method. We further developed several variation models including those on the basis of an NTM that incorporates contrastive learning with both positive and negative samples. Experimental results on three datasets showed that our models outperformed baselines and state-of-the-art models both quantitatively and qualitatively.
zh

[NLP-31] A Preliminary Exploration with GPT -4o Voice Mode

【速读】：该论文旨在评估多模态大型语言模型GPT-4o在音频处理和推理任务中的能力。关键在于通过一系列实验分析GPT-4o在音频理解、语音识别、音乐分析以及跨语言任务中的表现，并探讨其在面对幻觉生成方面的鲁棒性及安全机制对其任务执行的影响。

链接: https://arxiv.org/abs/2502.09940
作者: Yu-Xiang Lin,Chih-Kai Yang,Wei-Chih Chen,Chen-An Li,Chien-yu Huang,Xuanjun Chen,Hung-yi Lee
机构: National Taiwan University
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Work in progress

点击查看摘要

Abstract:With the rise of multimodal large language models, GPT-4o stands out as a pioneering model, driving us to evaluate its capabilities. This report assesses GPT-4o across various tasks to analyze its audio processing and reasoning abilities. We find that GPT-4o exhibits strong knowledge in audio, speech, and music understanding, performing well in tasks like intent classification, spoken command classification, semantic and grammatical reasoning., multilingual speech recognition, and singing analysis. It also shows greater robustness against hallucinations than other large audio-language models (LALMs). However, it struggles with tasks such as audio duration prediction and instrument classification. Additionally, GPT-4o’s safety mechanisms cause it to decline tasks like speaker identification, age classification, MOS prediction, and audio deepfake detection. Notably, the model exhibits a significantly different refusal rate when responding to speaker verification tasks on different datasets. This is likely due to variations in the accompanying instructions or the quality of the input audio, suggesting the sensitivity of its built-in safeguards. Finally, we acknowledge that model performance varies with evaluation protocols. This report only serves as a preliminary exploration of the current state of LALMs.
zh

[NLP-32] MIR-Bench: Benchmarking LLM s Long-Context Intelligence via Many-Shot In-Context Inductive Reasoning

【速读】：该论文旨在解决现有归纳推理（Inductive Reasoning, IR）评估基准的局限性，特别是它们侧重于少量示例（通常少于10个）的设置，并缺乏对从长上下文中整合多条信息的评估。同时，论文也指出当前许多-shot情境学习（many-shot In-Context Learning, ICL）评估主要集中在分类任务上，而忽视了复杂的信息整合需求。为了解决这些问题，论文提出MIR-Bench，这是一个新的许多-shot情境内归纳推理基准，它要求大型语言模型（Large Language Models, LLMs）通过输入输出示例来推导出底层函数的输出，涵盖了多样化数据格式的任务。关键在于MIR-Bench的设计，它不仅扩展了评估的样本数量，还引入了复杂的推理任务，从而更全面地评估LLMs的归纳推理能力。

链接: https://arxiv.org/abs/2502.09933
作者: Kai Yan,Zhan Ling,Kang Liu,Yifan Yang,Ting-Han Fan,Lingfeng Shen,Zhengyin Du,Jiecao Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 32 pages, 11 figures

点击查看摘要

Abstract:Inductive Reasoning (IR), the ability to summarize rules from examples and apply on new ones, has long been viewed as a primal ability for general intelligence and widely studied by cognitive science and AI researchers. Many benchmarks have been proposed to measure such ability for Large Language Models (LLMs); however, they focus on few-shot (usually 10) setting and lack evaluation for aggregating many pieces of information from long contexts. On the other hand, the ever-growing context length of LLMs have brought forth the novel paradigm of many-shot In-Context Learning (ICL), which addresses new tasks with hundreds to thousands of examples without expensive and inefficient fine-tuning. However, many-shot evaluations are mostly focused on classification (a very limited aspect of IR), and popular long-context LLM tasks such as Needle-In-A-Haystack (NIAH) seldom require complicated intelligence for integrating many pieces of information. To fix the issues from both worlds, we propose MIR-Bench, the first many-shot in-context inductive reasoning benchmark that asks LLM to induce output via input-output examples from underlying functions with diverse data format. Based on MIR-Bench, we study many novel problems for inductive reasoning and many-shot ICL, including robustness against erroneous shots and the effect of Chain-of-Thought (CoT), and acquired insightful findings.
zh

[NLP-33] A Taxonomy of Linguistic Expressions That Contribute To Anthropomorphism of Language Technologies

【速读】：该论文旨在解决语言技术（Language Technologies）中拟人化（Anthropomorphism）的影响及其适用情境的问题。论文的关键在于通过分析用户与语言技术的互动案例，提出一个文本表达的分类法（Taxonomy），以更精确有效地讨论和决策语言技术的拟人化现象。这一分类法有助于应对理解语言拟人化的挑战和张力，如如何界定所有语言本质上的人类特性，以及如何避免在刻画和改变机器人性化感知的过程中可能伴随的某些人类去人性化的问题。

链接: https://arxiv.org/abs/2502.09870
作者: Alicia DeVrio,Myra Cheng,Lisa Egede,Alexandra Olteanu,Su Lin Blodgett
机构: Human-Computer Interaction Institute, Carnegie Mellon University(卡内基梅隆大学人机交互学院); Stanford University(斯坦福大学); Microsoft Research(微软研究院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 1 figure, to appear at CHI 2025

点击查看摘要

Abstract:Recent attention to anthropomorphism – the attribution of human-like qualities to non-human objects or entities – of language technologies like LLMs has sparked renewed discussions about potential negative impacts of anthropomorphism. To productively discuss the impacts of this anthropomorphism and in what contexts it is appropriate, we need a shared vocabulary for the vast variety of ways that language can be anthropomorphic. In this work, we draw on existing literature and analyze empirical cases of user interactions with language technologies to develop a taxonomy of textual expressions that can contribute to anthropomorphism. We highlight challenges and tensions involved in understanding linguistic anthropomorphism, such as how all language is fundamentally human and how efforts to characterize and shift perceptions of humanness in machines can also dehumanize certain humans. We discuss ways that our taxonomy supports more precise and effective discussions of and decisions about anthropomorphism of language technologies.
zh

[NLP-34] Solvable Dynamics of Self-Supervised Word Embeddings and the Emergence of Analogical Reasoning

【速读】：该论文旨在探究可解析的二次词嵌入模型（Quadratic Word Embedding Models），作为语言模型表征学习（Representation Learning）的简化替代方案。论文的关键解决方案在于提供了这些模型在特定超参数选择下的训练动态（Training Dynamics）及最终词嵌入（Word Embeddings）的解析解，这些解仅依赖于语料库统计信息。研究揭示，这些模型逐步学习正交线性子空间（Orthogonal Linear Subspaces），每个子空间提升嵌入的有效秩（Effective Rank），直至达到模型容量饱和。这一过程使得模型能够逐渐获得完成类比（Analogies）的能力。

链接: https://arxiv.org/abs/2502.09863
作者: Dhruva Karkada,James B. Simon,Yasaman Bahri,Michael R. DeWeese
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 26 pages, 10 figures

点击查看摘要

Abstract:The remarkable success of large language models relies on their ability to implicitly learn structured latent representations from the pretraining corpus. As a simpler surrogate for representation learning in language modeling, we study a class of solvable contrastive self-supervised algorithms which we term quadratic word embedding models. These models resemble the word2vec algorithm and perform similarly on downstream tasks. Our main contributions are analytical solutions for both the training dynamics (under certain hyperparameter choices) and the final word embeddings, given in terms of only the corpus statistics. Our solutions reveal that these models learn orthogonal linear subspaces one at a time, each one incrementing the effective rank of the embeddings until model capacity is saturated. Training on WikiText, we find that the top subspaces represent interpretable concepts. Finally, we use our dynamical theory to predict how and when models acquire the ability to complete analogies.
zh

[NLP-35] Automated Hypothesis Validation with Agent ic Sequential Falsifications

【速读】：该论文旨在解决复杂且抽象假设难以直接验证的问题，尤其是在大规模语言模型（Large Language Models, LLMs）生成假设导致验证任务变得更为艰巨的情况下。论文的关键解决方案是提出Popper框架，该框架基于卡尔·波普尔的证伪原则，利用LLM代理设计并执行针对假设可测量推论的证伪实验。通过一种新颖的顺序测试框架，Popper确保严格的I型错误控制，并从多样化的观测中主动收集证据，从而实现稳健的错误控制、高功效及可扩展性。

链接: https://arxiv.org/abs/2502.09858
作者: Kexin Huang,Ying Jin,Ryan Li,Michael Y. Li,Emmanuel Candès,Jure Leskovec
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose Popper, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper’s principle of falsification, Popper validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate Popper on six domains including biology, economics, and sociology. Popper delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, Popper achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation.
zh

[NLP-36] Efficient Multitask Learning in Small Language Models Through Upside-Down Reinforcement Learning

【速读】：该论文旨在解决在多任务提示生成任务中，如何使用资源受限的模型实现与大型语言模型（Large Language Models, LLMs）相媲美的性能。论文的关键解决方案在于采用了一种新颖的结合了倒立式强化学习（upside-down reinforcement learning）和从强大的LLM Llama-3合成数据蒸馏（synthetic data distillation）的方法来训练一个小参数规模的语言模型（Small Language Model, SLM），即1亿参数的GPT-2模型。通过这种方法，所训练的SLM实现了与最先进的模型（包括Llama-3、Qwen2和Mistral）相近的相关性得分，尽管其规模小至这些模型的1/80，从而证明了SLMs在资源受限和实时应用中的高效多任务学习能力。

链接: https://arxiv.org/abs/2502.09854
作者: Yu-Chen Lin,Sanat Sharma,Hari Manikandan,Jayant Kumar,Tracy Holloway King,Jing Zheng
机构: Adobe; Carnegie Mellon University; Meta
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this work, we demonstrate that small language models (SLMs), specifically a 100M parameter GPT-2 model, can achieve competitive performance in multitask prompt generation tasks while requiring only a fraction of the computational resources needed by large language models (LLMs). Through a novel combination of upside-down reinforcement learning and synthetic data distillation from a powerful LLM, Llama-3, we train an SLM that achieves relevance scores within 5% of state-of-the-art models, including Llama-3, Qwen2, and Mistral, despite being up to 80 times smaller, making it highly suitable for resource-constrained and real-time applications. This study highlights the potential of SLMs as efficient multitask learners in multimodal settings, providing a promising alternative to LLMs for scalable, low-latency deployments.
zh

[NLP-37] Statistical Coherence Alignment for Large Language Model Representation Learning Through Tensor Field Convergence

【速读】：该论文旨在解决语言模型中表示学习的统计一致性问题，以提高生成文本的连贯性和上下文一致性。关键解决方案在于引入统计一致性对齐（Statistical Coherence Alignment），通过张量场收敛（tensor field convergence）来引导嵌入表示反映语言数据中的统计依赖关系。这种方法建立了一个数学框架来量化一致性对齐，并整合了一种损失函数以在训练迭代中优化表征一致性。

链接: https://arxiv.org/abs/2502.09815
作者: Jonathan Gale,Godfrey Aldington,Harriet Thistlewood,Thomas Tattershall,Basil Wentworth,Vincent Enoasmo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Representation learning plays a central role in structuring internal embeddings to capture the statistical properties of language, influencing the coherence and contextual consistency of generated text. Statistical Coherence Alignment is introduced as a method to enforce structured token representations through tensor field convergence, guiding embeddings to reflect statistical dependencies inherent in linguistic data. A mathematical framework is established to quantify coherence alignment, integrating a loss function that optimizes representational consistency across training iterations. Empirical evaluations demonstrate that applying coherence constraints improves perplexity, enhances classification accuracy, and refines rare word embeddings, contributing to a more stable representation space. Comparative analyses with baseline models reveal that the proposed method fosters a more interpretable internal structure, ensuring that embeddings retain contextual dependencies while mitigating representation collapse. The impact on coherence score distributions suggests that the alignment mechanism strengthens semantic integrity across diverse linguistic constructs, leading to a more balanced organization of learned embeddings. Computational assessments indicate that while the method introduces additional memory and training costs, the structured optimization process justifies the trade-offs in applications requiring heightened contextual fidelity. Experimental results validate the effectiveness of coherence alignment in optimizing token representations, providing insights into how statistical dependencies can be leveraged to improve language model training.
zh

[NLP-38] INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages

【速读】：该论文旨在解决在对话AI领域中，现有大规模基准数据集主要针对英语且缺乏对低资源语言的评估的问题。论文的关键解决方案是引入Injongo——一个多文化、开源的数据集，包含16种非洲语言，并由母语使用者在多个领域（如银行、旅行、家居和餐饮）生成语料。通过实验验证，该数据集能够更好地利用非洲文化相关的语料来改善从英语到非洲低资源语言的跨语言迁移效果。关键在于使用多语言Transformer模型微调和提示大型语言模型的方法，以提升低资源非洲语言在槽位填充和意图检测任务上的表现。

链接: https://arxiv.org/abs/2502.09814
作者: Hao Yu,Jesujoba O. Alabi,Andiswa Bukula,Jian Yun Zhuang,En-Shiun Annie Lee,Tadesse Kebede Guge,Israel Abebe Azime,Happy Buzaaba,Blessing Kudzaishe Sibanda,Godson K. Kalipe,Jonathan Mukiibi,Salomon Kabongo Kabenamualu,Mmasibidi Setaka,Lolwethu Ndolela,Nkiruka Odu,Rooweither Mabuya,Shamsuddeen Hassan Muhammad,Salomey Osei,Sokhar Samb,Juliet W. Murage,Dietrich Klakow,David Ifeoluwa Adelani
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these tasks often exclude evaluations of low-resource languages and rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce Injongo – a multicultural, open-source benchmark dataset for 16 African languages with utterances generated by native speakers across diverse domains, including banking, travel, home, and dining. Through extensive experiments, we benchmark the fine-tuning multilingual transformer models and the prompting large language models (LLMs), and show the advantage of leveraging African-cultural utterances over Western-centric utterances for improving cross-lingual transfer from the English language. Experimental results reveal that current LLMs struggle with the slot-filling task, with GPT-4o achieving an average performance of 26 F1-score. In contrast, intent detection performance is notably better, with an average accuracy of 70.6%, though it still falls behind the fine-tuning baselines. Compared to the English language, GPT-4o and fine-tuning baselines perform similarly on intent detection, achieving an accuracy of approximately 81%. Our findings suggest that the performance of LLMs is still behind for many low-resource African languages, and more work is needed to further improve their downstream performance.
zh

[NLP-39] Improving Acoustic Side-Channel Attacks on Keyboards Using Transformers and Large Language Models

【速读】：该论文旨在解决通过声学侧信道攻击（Acoustic Side-Channel Attacks, ASCAs）针对键盘输入的安全威胁。解决方案的关键在于利用深度学习技术，特别是视觉变换器（Vision Transformers, VTs）和大型语言模型（Large Language Models, LLMs），以提升此类攻击的有效性和实用性。研究引入了一种噪声抑制方法，并利用LLMs进行上下文理解，以检测和纠正噪声环境中的错误按键，从而增强ASCAs的性能。此外，轻量级语言模型通过低秩适应（Low-Rank Adaptation, LoRA）实现了与参数多出67倍的重型模型相当的性能，这显著提升了ASCAs在实际应用中的可行性。

链接: https://arxiv.org/abs/2502.09782
作者: Jin Hyun Park,Seyyed Ali Ayati,Yichen Cai
机构: Texas A&M University(德州农工大学); University of Toronto(多伦多大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:The increasing prevalence of microphones in everyday devices and the growing reliance on online services have amplified the risk of acoustic side-channel attacks (ASCAs) targeting keyboards. This study explores deep learning techniques, specifically vision transformers (VTs) and large language models (LLMs), to enhance the effectiveness and applicability of such attacks. We present substantial improvements over prior research, with the CoAtNet model achieving state-of-the-art performance. Our CoAtNet shows a 5.0% improvement for keystrokes recorded via smartphone (Phone) and 5.9% for those recorded via Zoom compared to previous benchmarks. We also evaluate transformer architectures and language models, with the best VT model matching CoAtNet’s performance. A key advancement is the introduction of a noise mitigation method for real-world scenarios. By using LLMs for contextual understanding, we detect and correct erroneous keystrokes in noisy environments, enhancing ASCA performance. Additionally, fine-tuned lightweight language models with Low-Rank Adaptation (LoRA) deliver comparable performance to heavyweight models with 67X more parameters. This integration of VTs and LLMs improves the practical applicability of ASCA mitigation, marking the first use of these technologies to address ASCAs and error correction in real-world scenarios.
zh

[NLP-40] Prompt and circumstance: A word-by-word LLM prompting approach to interlinear glossing for low-resource languages

【速读】：该论文旨在解决部分自动化生成词间释文（IGT）的问题，以辅助语言学文档的制作。论文的关键在于利用大型语言模型（LLMs）的能力来遵循自然语言指令，从而提高IGT生成过程的可访问性。通过检索型提示方法，论文展示了该系统在七种语言上的表现超越了基于BERT的基准，并在形态级别得分类别中优于SIGMORPHON 2023共享任务的每一项语言。此外，在对Tsez语的研究案例中，LLMs能够自动创建并遵循语言指令，减少在复杂语法特征上的错误。

链接: https://arxiv.org/abs/2502.09778
作者: Micha Elsner,David Liu
机构: The Ohio State University; Sylvania High School
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Partly automated creation of interlinear glossed text (IGT) has the potential to assist in linguistic documentation. We argue that LLMs can make this process more accessible to linguists because of their capacity to follow natural-language instructions. We investigate the effectiveness of a retrieval-based LLM prompting approach to glossing, applied to the seven languages from the SIGMORPHON 2023 shared task. Our system beats the BERT-based shared task baseline for every language in the morpheme-level score category, and we show that a simple 3-best oracle has higher word-level scores than the challenge winner (a tuned sequence model) in five languages. In a case study on Tsez, we ask the LLM to automatically create and follow linguistic instructions, reducing errors on a confusing grammatical feature. Our results thus demonstrate the potential contributions which LLMs can make in interactive systems for glossing, both in making suggestions to human annotators and following directions.
zh

[NLP-41] Non-Markovian Discrete Diffusion with Causal Language Models

【速读】：该论文旨在缩小离散扩散模型与因果语言模型在表达能力上的差距。关键解决方案在于引入CaDDi（Causal Discrete Diffusion）模型，这是一种因果离散扩散模型，它在非马尔可夫扩散框架内统一序列和时间建模，并通过整合时间轨迹，使生成过程更加表达力强且可控。此外，CaDDi能够将因果语言模型视为特例，从而实现预训练大规模语言模型（LLMs）在离散扩散中的无缝应用，而无需进行架构修改。

链接: https://arxiv.org/abs/2502.09767
作者: Yangtian Zhang,Sizhuang He,Daniel Levine,Lawrence Zhao,David Zhang,Syed A Rizvi,Emanuele Zappala,Rex Ying,David van Dijk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Discrete diffusion models have emerged as a flexible and controllable paradigm for structured sequence modeling, yet they still lag behind causal language models in expressiveness. To bridge the gap between two paradigms, we introduce CaDDi, a causal discrete diffusion model that unifies sequential and temporal modeling within a non-Markovian diffusion framework. Unlike conventional diffusion models that operate step by step with no access to prior states, CaDDi integrates the temporal trajectory, enabling more expressive and controllable generation. Our approach also treats causal language models as a special case, allowing seamless adoption of pretrained large language models (LLMs) for discrete diffusion without the need for architectural modifications. Empirically, we demonstrate that CaDDi outperforms state-of-the-art discrete diffusion models on both natural language and biological sequence tasks, narrowing the gap between diffusion-based methods and large-scale autoregressive transformers.
zh

[NLP-42] he Widespread Adoption of Large Language Model-Assisted Writing Across Society

【速读】：该论文旨在系统分析大型语言模型（Large Language Models, LLMs）在四个不同领域——消费者投诉、企业沟通、职位发布和国际组织新闻稿中的应用情况。论文的关键在于通过一个稳健的人口级统计框架，揭示从2022年1月至2024年9月期间LLM使用模式的变化趋势，特别是其在各领域中的渗透率和地理分布特征。研究表明，自2022年11月ChatGPT发布后，LLM的使用显著增加，并在2024年底趋于稳定，反映出LLM在实际应用中的广泛采纳和更高级模型的逐渐普及。

链接: https://arxiv.org/abs/2502.09747
作者: Weixin Liang,Yaohui Zhang,Mihai Codreanu,Jiayu Wang,Hancheng Cao,James Zou
机构: Stanford University (斯坦福大学); University of Washington (华盛顿大学); Emory University (埃默里大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The recent advances in large language models (LLMs) attracted significant public and policymaker interest in its adoption patterns. In this paper, we systematically analyze LLM-assisted writing across four domains-consumer complaints, corporate communications, job postings, and international organization press releases-from January 2022 to September 2024. Our dataset includes 687,241 consumer complaints, 537,413 corporate press releases, 304.3 million job postings, and 15,919 United Nations (UN) press releases. Using a robust population-level statistical framework, we find that LLM usage surged following the release of ChatGPT in November 2022. By late 2024, roughly 18% of financial consumer complaint text appears to be LLM-assisted, with adoption patterns spread broadly across regions and slightly higher in urban areas. For corporate press releases, up to 24% of the text is attributable to LLMs. In job postings, LLM-assisted writing accounts for just below 10% in small firms, and is even more common among younger firms. UN press releases also reflect this trend, with nearly 14% of content being generated or modified by LLMs. Although adoption climbed rapidly post-ChatGPT, growth appears to have stabilized by 2024, reflecting either saturation in LLM adoption or increasing subtlety of more advanced models. Our study shows the emergence of a new reality in which firms, consumers and even international organizations substantially rely on generative AI for communications.
zh

[NLP-43] Partial Colexifications Improve Concept Embeddings

【速读】：该论文旨在解决概念嵌入（Concept Embedding）在计算语言学任务中的不足，特别是处理跨语言数据或低资源语言中的稀疏数据。论文的关键解决方案在于利用部分共范畴化（partial colexifications）的方法来改进概念嵌入。通过这种方法，论文展示了如何从自动推断的部分共范畴化关系中学习到的概念嵌入，在词汇相似性评估、语义漂移实例记录以及词联想数据中均表现出更好的性能，从而证明了其有效性。

链接: https://arxiv.org/abs/2502.09743
作者: Arne Rubehn,Johann-Mattis List
机构: University of Passau (帕绍大学)
类目: Computation and Language (cs.CL)
备注: Submitted to the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria

点击查看摘要

Abstract:While the embedding of words has revolutionized the field of Natural Language Processing, the embedding of concepts has received much less attention so far. A dense and meaningful representation of concepts, however, could prove useful for several tasks in computational linguistics, especially those involving cross-linguistic data or sparse data from low resource languages. First methods that have been proposed so far embed concepts from automatically constructed colexification networks. While these approaches depart from automatically inferred polysemies, attested across a larger number of languages, they are restricted to the word level, ignoring lexical relations that would only hold for parts of the words in a given language. Building on recently introduced methods for the inference of partial colexifications, we show how they can be used to improve concept embeddings in meaningful ways. The learned embeddings are evaluated against lexical similarity ratings, recorded instances of semantic shift, and word association data. We show that in all evaluation tasks, the inclusion of partial colexifications lead to improved concept representations and better results. Our results further show that the learned embeddings are able to capture and represent different semantic relationships between concepts.
zh

[NLP-44] FoNE: Precise Single-Token Number Embeddings via Fourier Features

【速读】：该论文旨在解决大型语言模型（LLMs）在处理数值时因使用多个标记而导致的碎片化问题，这降低了训练和推理效率，并影响了模型在与数字相关任务中的性能。论文的关键解决方案是提出了一种名为Fourier Number Embedding (FoNE) 的新方法，该方法通过直接将数字映射到嵌入空间来编码数字的傅里叶特征，从而实现每个数字仅用两个嵌入维度表示，有效避免了碎片化问题，加速了训练和推理过程。FoNE不仅减少了计算开销，还在多种数值任务中实现了更高的准确性。

链接: https://arxiv.org/abs/2502.09741
作者: Tianyi Zhou,Deqing Fu,Mahdi Soltanolkotabi,Robin Jia,Vatsal Sharan
机构: University of Southern California(南加州大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model’s performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64 \times less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3 \times and 6 \times fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at this https URL.
zh

[NLP-45] Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在安全与伦理方面所面临的挑战，特别是通过设计巧妙的越狱攻击（jailbreak attacks）来规避现有的安全对齐技术。论文的关键解决方案是提出了一种名为QueryAttack的新框架，该框架能够系统性地检验安全对齐机制的一般化能力。通过将恶意查询转化为代码风格的结构化查询，QueryAttack能够绕过LLMs的安全对齐防护。研究结果表明，QueryAttack在多种不同开发者和能力的主流LLMs上均取得了高攻击成功率（ASRs）。此外，论文还评估了针对常见防御措施的有效性，并提出了一个专门的防御方法，能够将GPT-4-1106上的攻击成功率降低最多64%。

链接: https://arxiv.org/abs/2502.09723
作者: Qingsong Zou,Jingyu Xiao,Qing Li,Zhi Yan,Yuhang Wang,Li Xu,Wenxuan Wang,Kuofeng Gao,Ruoyu Li,Yong Jiang
机构: Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院); The Chinese University of Hong Kong(香港中文大学); Pengcheng Laboratory(鹏城实验室); Jilin University(吉林大学); Southwest University(西南大学); University of Electronic Science and Technology of China(电子科技大学); Shenzhen University(深圳大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to systematically examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into code-style structured query to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, ant the results show that QueryAttack achieves high attack success rates (ASRs) across LLMs with different developers and capabilities. We also evaluate QueryAttack’s performance against common defenses, confirming that it is difficult to mitigate with general defensive techniques. To defend against QueryAttack, we tailor a defense method which can reduce ASR by up to 64% on GPT-4-1106. The code of QueryAttack can be found on this https URL.
zh

[NLP-46] Evaluating GPT s Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data ML4H ALT

【速读】：该论文旨在解决通过电子健康记录（EHRs）自动识别认知障碍阶段的问题。解决方案的关键在于使用零样本学习的GPT-4o模型来自动评估临床痴呆评定量表（CDR），从而确定认知障碍的程度。研究结果显示，GPT-4o在不同任务中的加权Kappa评分分别为0.83和0.91，特别是在高置信度案例中达到了0.96的评分，证明了其作为可扩展的图表审查工具的潜力，可用于创建研究数据集和辅助临床诊断。

链接: https://arxiv.org/abs/2502.09715
作者: Yu Leng,Yingnan He,Colin Magdamo,Ana-Maria Vranceanu,Christine S. Ritchie,Shibani S. Mukerji,Lidia M. V. R. Moura,John R. Dickson,Deborah Blacker,Sudeshna Das
机构: mgh.harvard.edu(麻省总医院与哈佛医学院); mgb.org(未知)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 7 pages

点击查看摘要

Abstract:Identifying cognitive impairment within electronic health records (EHRs) is crucial not only for timely diagnoses but also for facilitating research. Information about cognitive impairment often exists within unstructured clinician notes in EHRs, but manual chart reviews are both time-consuming and error-prone. To address this issue, our study evaluates an automated approach using zero-shot GPT-4o to determine stage of cognitive impairment in two different tasks. First, we evaluated the ability of GPT-4o to determine the global Clinical Dementia Rating (CDR) on specialist notes from 769 patients who visited the memory clinic at Massachusetts General Hospital (MGH), and achieved a weighted kappa score of 0.83. Second, we assessed GPT-4o’s ability to differentiate between normal cognition, mild cognitive impairment (MCI), and dementia on all notes in a 3-year window from 860 Medicare patients. GPT-4o attained a weighted kappa score of 0.91 in comparison to specialist chart reviews and 0.96 on cases that the clinical adjudicators rated with high confidence. Our findings demonstrate GPT-4o’s potential as a scalable chart review tool for creating research datasets and assisting diagnosis in clinical settings in the future.
zh

[NLP-47] rust at Your Own Peril: A Mixed Methods Exploration of the Ability of Large Language Models to Generate Expert-Like Systems Engineering Artifacts and a Characterization of Failure Modes

链接: https://arxiv.org/abs/2502.09690
作者: Taylan G. Topcu,Mohammed Husain,Max Ofsa,Paul Wach
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 41 pages, 10 figures

点击查看摘要

[NLP-48] Large Language Models and Provenance Metadata for Determining the Relevance of Images and Videos in News Stories

【速读】：该论文旨在解决多模态虚假信息传播的问题，尤其是那些结合文本与上下文脱节或完全虚构的图片和视频以支持特定叙事的误导性活动。论文的关键解决方案在于构建一个基于大型语言模型的系统，该系统能够分析文章文本以及所包含图像和视频的来源元数据，从而判断其相关性。

链接: https://arxiv.org/abs/2502.09689
作者: Tomas Peterka,Matyas Bohacek
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The most effective misinformation campaigns are multimodal, often combining text with images and videos taken out of context – or fabricating them entirely – to support a given narrative. Contemporary methods for detecting misinformation, whether in deepfakes or text articles, often miss the interplay between multiple modalities. Built around a large language model, the system proposed in this paper addresses these challenges. It analyzes both the article’s text and the provenance metadata of included images and videos to determine whether they are relevant. We open-source the system prototype and interactive web interface.
zh

[NLP-49] Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models

链接: https://arxiv.org/abs/2502.09687
作者: Wiktoria Mieleszczenko-Kowszewicz,Beata Bajcar,Jolanta Babiak,Berenika Dyczek,Jakub Świstak,Przemysław Biecek
机构: Warsaw University of Technology; Wrocław University of Science and Technology; University of Warsaw; Lincoln University College, Petaling Jaya, Malaysian
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

[NLP-50] Multi-level Conflict-Aware Network for Multi-modal Sentiment Analysis

链接: https://arxiv.org/abs/2502.09675
作者: Yubo Gao,Haotian Wu,Lei Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 1 figure

点击查看摘要

[NLP-51] he Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

【速读】：该论文旨在探究大型语言模型（LLMs）的安全对齐行为，如拒绝有害查询，如何在激活空间中通过多维方向共同控制，而不仅仅是单一方向。关键解决方案在于发现并研究这些多维方向及其相互作用，包括主导方向与次要方向之间的关系，并揭示去除某些触发词可以缓解这些方向的影响，从而绕过已学习的安全能力。这一多维度视角提供了对安全对齐脆弱性的新见解。

链接: https://arxiv.org/abs/2502.09674
作者: Wenbo Pan,Zhichao Liu,Qiguang Chen,Xiangyang Zhou,Haining Yu,Xiaohua Jia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code and artifacts: this https URL

点击查看摘要

Abstract:Large Language Models’ safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model’s refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model’s refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at this https URL.
zh

[NLP-52] Are Smarter LLM s Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning

链接: https://arxiv.org/abs/2502.09673
作者: Ang Li,Yichuan Mo,Mingjie Li,Yifei Wang,Yisen Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-53] he Science of Evaluating Foundation Models

【速读】：该论文旨在解决大型基础模型评估过程中面临的挑战，这些挑战源于模型的规模、能力及其在多样化应用场景中的部署。论文的关键解决方案在于提出一个结构化的评估框架（structured framework），以适应特定使用场景的上下文需求，并提供可操作的工具和框架（如检查清单和模板），确保评估过程的全面性、可重复性和实用性。此外，论文还针对大型语言模型（LLM）评估领域的最新进展进行了一次有针对性的综述，特别强调了实际应用中的考量。

链接: https://arxiv.org/abs/2502.09670
作者: Jiayi Yuan,Jiamu Zhang,Andrew Wen,Xia Hu
机构: Rice University(莱斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergent phenomena of large foundation models have revolutionized natural language processing. However, evaluating these models presents significant challenges due to their size, capabilities, and deployment across diverse applications. Existing literature often focuses on individual aspects, such as benchmark performance or specific tasks, but fails to provide a cohesive process that integrates the nuances of diverse use cases with broader ethical and operational considerations. This work focuses on three key aspects: (1) Formalizing the Evaluation Process by providing a structured framework tailored to specific use-case contexts, (2) Offering Actionable Tools and Frameworks such as checklists and templates to ensure thorough, reproducible, and practical evaluations, and (3) Surveying Recent Work with a targeted review of advancements in LLM evaluation, emphasizing real-world applications.
zh

[NLP-54] k-LLM means: Summaries as Centroids for Interpretable and Scalable LLM -Based Text Clustering

【速读】：该论文旨在解决传统k-means聚类算法在处理文本数据时，由于依赖纯数值的文档嵌入导致丢失上下文和语义细节的问题。关键解决方案在于引入了k-LLMmeans算法，它利用大型语言模型（LLMs）生成文本摘要作为聚类中心，从而保留了k-means算法的优点同时增强了可解释性：聚类中心由LLM生成的摘要表示，并通过其嵌入指导聚类分配。此外，论文还提出了一个小批量版本，以实现流式文本数据的高效在线聚类，并提供实时的聚类中心演变解释。

链接: https://arxiv.org/abs/2502.09667
作者: Jairo Diaz-Rodriguez
机构: York University (约克大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids, thereby capturing contextual and semantic nuances often lost when relying on purely numerical means of document embeddings. This modification preserves the properties of k-means while offering greater interpretability: the cluster centroid is represented by an LLM-generated summary, whose embedding guides cluster assignments. We also propose a mini-batch variant, enabling efficient online clustering for streaming text data and providing real-time interpretability of evolving cluster centroids. Through extensive simulations, we show that our methods outperform vanilla k-means on multiple metrics while incurring only modest LLM usage that does not scale with dataset size. Finally, We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams. As part of our evaluation, we compile a new dataset from StackExchange, offering a benchmark for text-stream clustering.
zh

[NLP-55] Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models

链接: https://arxiv.org/abs/2502.09659
作者: Hasin Rehana,Jie Zheng,Leo Yeh,Benu Bansal,Nur Bengisu Çam,Christianah Jemiyo,Brett McGregor,Arzucan Özgür,Yongqun He,Junguk Hur
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 10 pages, 6 figures, 4 tables

点击查看摘要

[NLP-56] Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality

链接: https://arxiv.org/abs/2502.09658
作者: Xin Kang,Veronika Shteingardt,Yuhan Wang,Dov Dori
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures,

点击查看摘要

[NLP-57] AI-VERDE: A Gateway for Egalitarian Access to Large Language Model-Based Resources For Educational Institutions NAACL

链接: https://arxiv.org/abs/2502.09651
作者: Paul Mithun,Enrique Noriega-Atala,Nirav Merchant,Edwin Skidmore
机构: University of Arizona (亚利桑那大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 7 Pages, includes appendix. Submitted to NAACL System demonstrations track 2025

点击查看摘要

[NLP-58] Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

链接: https://arxiv.org/abs/2502.09650
作者: Chengqian Gao,Haonan Li,Liu Liu,Zeke Xie,Peilin Zhao,Zhiqiang Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-59] UKTA: Unified Korean Text Analyzer

【速读】：该论文旨在解决韩语自动写作评估工具在多视角分析、错误传播和评估可解释性方面的挑战。论文的关键解决方案是引入UKTA（统一韩文文本分析器），一个全面的韩语文本分析和写作评估系统。UKTA通过提供精确的低级词素分析、用于中级可解释性的关键词汇特征以及透明的基于量规的写作评分，增强了评估的准确性和评分的一致性。

链接: https://arxiv.org/abs/2502.09648
作者: Seokho Ahn,Junhyung Park,Ganghee Go,Chulhui Kim,Jiho Jung,Myung Sun Shin,Do-Guk Kim,Young-Duk Seo
机构: Inha University(仁荷大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by SAC 2025

点击查看摘要

Abstract:Evaluating writing quality is complex and time-consuming often delaying feedback to learners. While automated writing evaluation tools are effective for English, Korean automated writing evaluation tools face challenges due to their inability to address multi-view analysis, error propagation, and evaluation explainability. To overcome these challenges, we introduce UKTA (Unified Korean Text Analyzer), a comprehensive Korea text analysis and writing evaluation system. UKTA provides accurate low-level morpheme analysis, key lexical features for mid-level explainability, and transparent high-level rubric-based writing scores. Our approach enhances accuracy and quadratic weighted kappa over existing baseline, positioning UKTA as a leading multi-perspective tool for Korean text analysis and writing evaluation.
zh

[NLP-60] Unveiling Simplicities of Attention: Adaptive Long-Context Head Identification

【速读】：该论文旨在解决在自然语言处理任务中处理长上下文信息的难题，特别是在理解注意力机制在长序列中的作用方面存在的不足。论文的关键在于提出了一种方法，通过仅使用局部键（local keys）来预测哪些注意力头（attention heads）对于长上下文信息处理至关重要，从而识别出需要长上下文信息以准确预测下一个词令牌的注意力头。这一方法的核心思想是利用二阶矩近似（second moment approximations）来估算长上下文得分（long-context scores），从而揭示了长序列中注意力机制的简单属性，并为提高效率打开了潜在的大门。

链接: https://arxiv.org/abs/2502.09647
作者: Konstantin Donhauser,Charles Arnal,Mohammad Pezeshki,Vivien Cabannes,David Lopez-Paz,Kartik Ahuja
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The ability to process long contexts is crucial for many natural language processing tasks, yet it remains a significant challenge. While substantial progress has been made in enhancing the efficiency of attention mechanisms, there is still a gap in understanding how attention heads function in long-context settings. In this paper, we observe that while certain heads consistently attend to local information only, others swing between attending to local and long-context information depending on the query. This raises the question: can we identify which heads require long-context information to predict the next token accurately? We demonstrate that it’s possible to predict which heads are crucial for long-context processing using only local keys. The core idea here is to exploit a simple model for the long-context scores via second moment approximations. These findings unveil simple properties of attention in the context of long sequences, and open the door to potentially significant gains in efficiency.
zh

[NLP-61] Language Shift or Maintenance? An Intergenerational Study of the Tibetan Community in Saudi Arabia

链接: https://arxiv.org/abs/2502.09646
作者: Sumaiyah Turkistani Mohammad Almoaily
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

[NLP-62] From No to Know: Taxonomy Challenges and Opportunities for Negation Understanding in Multimodal Foundation Models

链接: https://arxiv.org/abs/2502.09645
作者: Mayank Vatsa,Aparna Bharati,Surbhi Mittal,Richa Singh
机构: IITJ(印度技术学院贾苏尔分校); Lehigh University (莱斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-63] From Argumentation to Deliberation: Perspectivized Stance Vectors for Fine-grained (Dis)agreement Analysis NAACL

【速读】：该论文旨在解决在辩论过程中如何识别并分析不同参与者（arguers或stakeholders）的观点，以促进冲突的解决。论文的关键解决方案在于开发了一个名为Perspectivized Stance Vectors的框架，通过这一框架能够精细化分析每个参与者在特定议题上的视角化立场。这种方法不仅能够识别出对立观点，还能揭示由态度、价值观或需求带来的共同视角，从而衡量基于视角调和后的（不）一致程度，进而识别出可以采取的实际行动点来推动冲突的解决。

链接: https://arxiv.org/abs/2502.09644
作者: Moritz Plenz,Philipp Heinisch,Janosch Gehring,Philipp Cimiano,Anette Frank
机构: Heidelberg University (海德堡大学); Bielefeld University (比勒菲尔德大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at NAACL Findings 2025

点击查看摘要

Abstract:Debating over conflicting issues is a necessary first step towards resolving conflicts. However, intrinsic perspectives of an arguer are difficult to overcome by persuasive argumentation skills. Proceeding from a debate to a deliberative process, where we can identify actionable options for resolving a conflict requires a deeper analysis of arguments and the perspectives they are grounded in - as it is only from there that one can derive mutually agreeable resolution steps. In this work we develop a framework for a deliberative analysis of arguments in a computational argumentation setup. We conduct a fine-grained analysis of perspectivized stances expressed in the arguments of different arguers or stakeholders on a given issue, aiming not only to identify their opposing views, but also shared perspectives arising from their attitudes, values or needs. We formalize this analysis in Perspectivized Stance Vectors that characterize the individual perspectivized stances of all arguers on a given issue. We construct these vectors by determining issue- and argument-specific concepts, and predict an arguer’s stance relative to each of them. The vectors allow us to measure a modulated (dis)agreement between arguers, structured by perspectives, which allows us to identify actionable points for conflict resolution, as a first step towards deliberation.
zh

[NLP-64] Krutrim LLM : Multilingual Foundational Model for over a Billion People

链接: https://arxiv.org/abs/2502.09642
作者: Aditya Kallappa,Palash Kamble,Abhinav Ravi,Akshat Patidar,Vinayak Dhruv,Deepak Kumar,Raghav Awasthi,Arveti Manjunath,Shubham Agarwal,Kumar Ashish,Gautam Bhargava,Chandra Khatri
机构: Krutrim AI Team (Krutrim AI 团队), Bangalore (班加罗尔), India (印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-65] Online Social Support Detection in Spanish Social Media Texts

链接: https://arxiv.org/abs/2502.09640
作者: Moein Shahiki Tash,Luis Ramos,Zahra Ahani,Raul Monroy,Olga kolesnikova,Hiram Calvo,Grigori Sidorov
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-66] Jailbreaking to Jailbreak

【速读】：该论文旨在解决大型语言模型（Large Language Models, LLMs）在拒绝训练后仍易受到自动化及人为设计的越狱攻击的问题。解决方案的关键在于提出一种新颖的红队策略，即利用一个被人为越狱的拒绝训练LLM作为红队执行者（LLM-as-red-teamer），使其能够自我越狱或越狱其他LLM，形成所谓的J_2攻击者。这些J_2攻击者通过系统性的红队策略评估目标模型，并通过上下文学习从前一次的失败中改进其表现。实验结果显示，Sonnet 3.5和Gemini 1.5在Harmbench上分别达到了93.0%和91.0%的越狱成功率，显著优于其他LLMs。

链接: https://arxiv.org/abs/2502.09638
作者: Jeremy Kritz,Vaughn Robinson,Robert Vacareanu,Bijan Varjavand,Michael Choi,Bobby Gogov,Scale Red Team,Summer Yue,Willow E. Primack,Zifan Wang
机构: Scale AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Refusal training on Large Language Models (LLMs) prevents harmful outputs, yet this defense remains vulnerable to both automated and human-crafted jailbreaks. We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to the jailbroken LLMs as J_2 attackers, which can systematically evaluate target models using various red teaming strategies and improve its performance via in-context learning from the previous failures. Our experiments demonstrate that Sonnet 3.5 and Gemini 1.5 pro outperform other LLMs as J_2 , achieving 93.0% and 91.0% attack success rates (ASRs) respectively against GPT-4o (and similar results across other capable LLMs) on Harmbench. Our work not only introduces a scalable approach to strategic red teaming, drawing inspiration from human red teamers, but also highlights jailbreaking-to-jailbreak as an overlooked failure mode of the safeguard. Specifically, an LLM can bypass its own safeguards by employing a jailbroken version of itself that is willing to assist in further jailbreaking. To prevent any direct misuse with J_2 , while advancing research in AI safety, we publicly share our methodology while keeping specific prompting details private.
zh

[NLP-67] Meta-Cultural Competence: Climbing the Right Hill of Cultural Awareness

链接: https://arxiv.org/abs/2502.09637
作者: Sougata Saha,Saurabh Kumar Pandey,Monojit Choudhury
机构: Mohamed bin Zayed University of Artificial Intelligence
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-68] Reading between the Lines: Can LLM s Identify Cross-Cultural Communication Gaps?

【速读】：该论文旨在探究全球范围内文化差异对书评可理解性的影响，特别是那些包含特定文化元素的书评。研究通过分析Goodreads上的57篇书评发现，83%的书评至少包含一个难以被其他文化背景读者理解的文化特定元素。论文的关键在于评估GPT-4在识别这些文化特定元素方面的有效性，尽管结果参差不齐，表明仍有显著改进空间。

链接: https://arxiv.org/abs/2502.09636
作者: Sougata Saha,Saurabh Kumar Pandey,Harshit Gupta,Monojit Choudhury
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); IIIT Hyderabad (海得拉巴国际信息技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In a rapidly globalizing and digital world, content such as book and product reviews created by people from diverse cultures are read and consumed by others from different corners of the world. In this paper, we investigate the extent and patterns of gaps in understandability of book reviews due to the presence of culturally-specific items and elements that might be alien to users from another culture. Our user-study on 57 book reviews from Goodreads reveal that 83% of the reviews had at least one culture-specific difficult-to-understand element. We also evaluate the efficacy of GPT-4o in identifying such items, given the cultural background of the reader; the results are mixed, implying a significant scope for improvement. Our datasets are available here: this https URL
zh

[NLP-69] CORRECT: Context- and Reference-Augmented Reasoning and Prompting for Fact-Checking NAACL-25

链接: https://arxiv.org/abs/2502.09635
作者: Delvin Ce Zhang,Dongwon Lee
机构: The Pennsylvania State University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to NAACL-25

点击查看摘要

计算机视觉

[CV-0] xt-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

【速读】：该论文旨在解决3D视觉定位任务中实时推理的需求。传统方法由于采用两阶段或基于点的方法难以满足这一需求。论文的关键解决方案在于提出了一种高效的多层次卷积架构，并结合文本引导剪枝（Text-Guided Pruning, TGP）和基于补全的添加（Completion-Based Addition, CBA），通过渐进区域剪枝和目标补全来高效地融合3D场景表示与文本特征。这种方法不仅实现了顶级的推理速度，还达到了最先进的精度。

链接: https://arxiv.org/abs/2502.10392
作者: Wenxuan Guo,Xiuwei Xu,Ziwei Wang,Jianjiang Feng,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with +1.13 lead of Acc@0.5 on ScanRefer, and +2.6 and +3.2 leads on NR3D and SR3D respectively. The code is available at \hrefthis https URLthis https URL.
zh

[CV-1] Region-Adaptive Sampling for Diffusion Transformers

【速读】：该论文旨在解决扩散模型（Diffusion Models, DMs）在实时性能上的限制问题。扩散模型虽在生成任务中表现出色，但其需要多步顺序前向传播的特点显著限制了其实时应用能力。论文的关键解决方案是引入了一种名为RAS的新采样策略。RAS通过动态分配不同采样比例到图像的不同区域，利用扩散Transformer（Diffusion Transformers, DiTs）处理可变数量标记的灵活性，仅更新当前关注的区域，而其他区域则使用从上一步缓存的噪声进行更新。这种方法充分利用了模型在连续步骤中的关注区域的时序一致性，从而在保持生成质量的同时大幅提高了采样效率。

链接: https://arxiv.org/abs/2502.10389
作者: Ziming Liu,Yifan Yang,Chengruidong Zhang,Yiqi Zhang,Lili Qiu,Yang You,Yuqing Yang
机构: National University of Singapore; Microsoft Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model’s focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.
zh

[CV-2] Simplifying DINO via Coding Rate Regularization

【速读】：该论文旨在解决现有自监督学习模型（如DINO和DINOv2）在大规模无标签图像数据预训练中因复杂且不稳定的训练流程导致的性能瓶颈问题。论文的关键解决方案在于通过引入显式的编码率项到损失函数中以避免表征崩溃，从而简化了原有的复杂设计，并提出了SimDINO和SimDINOv2两个更为稳健的新模型，这些简化后的模型不仅对不同网络架构和超参数的选择具有更高的鲁棒性，而且在下游任务上的表现更优，实现了帕累托改进。

链接: https://arxiv.org/abs/2502.10385
作者: Ziyang Wu,Jingyuan Zhang,Druv Pai,XuDong Wang,Chandan Singh,Jianwei Yang,Jianfeng Gao,Yi Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:DINO and DINOv2 are two model families being widely used to learn representations from unlabeled imagery data at large scales. Their learned representations often enable state-of-the-art performance for downstream tasks, such as image classification and segmentation. However, they employ many empirically motivated design choices and their training pipelines are highly complex and unstable – many hyperparameters need to be carefully tuned to ensure that the representations do not collapse – which poses considerable difficulty to improving them or adapting them to new domains. In this work, we posit that we can remove most such-motivated idiosyncrasies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 which we call SimDINO and SimDINOv2, respectively. Remarkably, these simplified models are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higher-quality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 models. This work highlights the potential of using simplifying design principles to improve the empirical practice of deep learning.
zh

[CV-3] ReStyle3D: Scene-Level Appearance Transfer with Semantic Correspondences

【速读】：该论文旨在解决从单一风格图像到多视角真实场景的外观转换问题。关键在于结合显式的语义对应关系与多视角一致性，通过无需训练的语义注意力机制在扩散模型中将风格应用于单个视图，并利用学习得到的 warp-and-refine 网络通过单目深度和逐像素对应关系将风格化效果扩展至其他视图。这种方法确保了每个对象都能获得语义匹配的纹理，从而实现结构保真度、感知风格相似性和多视角一致性的提升。

链接: https://arxiv.org/abs/2502.10377
作者: Liyuan Zhu,Shengqu Cai,Shengyu Huang,Gordon Wetzstein,Naji Khosravan,Iro Armeni
机构: Stanford University (斯坦福大学); ETH Zurich (瑞士苏黎世联邦理工学院); Zillow Group (Zillow集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce ReStyle3D, a novel framework for scene-level appearance transfer from a single style image to a real-world scene represented by multiple views. The method combines explicit semantic correspondences with multi-view consistency to achieve precise and coherent stylization. Unlike conventional stylization methods that apply a reference style globally, ReStyle3D uses open-vocabulary segmentation to establish dense, instance-level correspondences between the style and real-world images. This ensures that each object is stylized with semantically matched textures. It first transfers the style to a single view using a training-free semantic-attention mechanism in a diffusion model. It then lifts the stylization to additional views via a learned warp-and-refine network guided by monocular depth and pixel-wise correspondences. Experiments show that ReStyle3D consistently outperforms prior methods in structure preservation, perceptual style similarity, and multi-view coherence. User studies further validate its ability to produce photo-realistic, semantically faithful results. Our code, pretrained models, and dataset will be publicly released, to support new applications in interior design, virtual staging, and 3D-consistent stylization.
zh

[CV-4] Ocular Disease Classification Using CNN with Deep Convolutional Generative Adversarial Network

【速读】：该论文旨在解决在训练卷积神经网络(CNN)进行眼底疾病图像分类时，因数据集不足导致的过拟合及泛化能力差的问题。论文的关键解决方案是利用生成对抗网络(GAN)生成合成数据集，以增强现有数据量并提高模型的泛化能力，最终通过包含实际病变的眼部图像验证模型，实现了近视、青光眼和白内障分类的分别78.6%、88.6%和84.6%的准确率，总体分类准确率为84.6%。

链接: https://arxiv.org/abs/2502.10334
作者: Arun Kunwar,Dibakar Raj Pant,Jukka Heikkonen,Rajeev Kanth
机构: Institute of Engineering, Nepal; University of Turku, Turku, Finland; Savonia University of Applied Sciences, Kuopio, Finland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Convolutional Neural Network (CNN) has shown impressive performance in image classification because of its strong learning capabilities. However, it demands a substantial and balanced dataset for effective training. Otherwise, networks frequently exhibit over fitting and struggle to generalize to new examples. Publicly available dataset of fundus images of ocular disease is insufficient to train any classification model to achieve satisfactory accuracy. So, we propose Generative Adversarial Network(GAN) based data generation technique to synthesize dataset for training CNN based classification model and later use original disease containing ocular images to test the model. During testing the model classification accuracy with the original ocular image, the model achieves an accuracy rate of 78.6% for myopia, 88.6% for glaucoma, and 84.6% for cataract, with an overall classification accuracy of 84.6%.
zh

[CV-5] Object Detection and Tracking

【速读】：该论文旨在解决高效且精确的目标检测问题，特别是在实现高精度的同时保证实时性能。论文的关键解决方案在于完全利用深度学习技术来解决端到端的目标检测问题，而非依赖其他计算机视觉算法。网络通过使用一个每年用于物品检测挑战的最具挑战性的公开数据集进行训练。这一方法使得需要目标检测的应用能够受益于系统快速且精准的检测能力。

链接: https://arxiv.org/abs/2502.10310
作者: Md Pranto,Omar Faruk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Efficient and accurate object detection is an important topic in the development of computer vision systems. With the advent of deep learning techniques, the accuracy of object detection has increased significantly. The project aims to integrate a modern technique for object detection with the aim of achieving high accuracy with real-time performance. The reliance on other computer vision algorithms in many object identification systems, which results in poor and ineffective performance, is a significant obstacle. In this research, we solve the end-to-end object detection problem entirely using deep learning techniques. The network is trained using the most difficult publicly available dataset, which is used for an annual item detection challenge. Applications that need object detection can benefit the system’s quick and precise finding.
zh

[CV-6] SPIRIT: Short-term Prediction of solar IRradIance for zero-shot Transfer learning using Foundation Models

【速读】：该论文旨在解决在缺乏多年历史辐照数据的新建光伏电站中进行精准太阳能辐照预测的问题。关键解决方案在于提出了一种名为SPIRIT的新方法，该方法利用基础模型进行零样本迁移学习（zero-shot transfer learning），从而实现在没有历史数据的情况下也能有效预测新地点的太阳能辐照情况，且性能比现有最先进模型高出约70%。此外，通过微调（fine-tuning）进一步提高预测精度，特别是在有更多特定位置数据可用时。这些改进得到了统计显著性的支持，验证了该方法的有效性。

链接: https://arxiv.org/abs/2502.10307
作者: Aditya Mishra,Ravindra T,Srinivasan Iyengar,Shivkumar Kalyanaraman,Ponnurangam Kumaraguru
机构: International Institute of Information Technology, Hyderabad(海得拉巴国际信息技术学院); Microsoft Corporation(微软公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional solar forecasting models are based on several years of site-specific historical irradiance data, often spanning five or more years, which are unavailable for newer photovoltaic farms. As renewable energy is highly intermittent, building accurate solar irradiance forecasting systems is essential for efficient grid management and enabling the ongoing proliferation of solar energy, which is crucial to achieve the United Nations’ net zero goals. In this work, we propose SPIRIT, a novel approach leveraging foundation models for solar irradiance forecasting, making it applicable to newer solar installations. Our approach outperforms state-of-the-art models in zero-shot transfer learning by about 70%, enabling effective performance at new locations without relying on any historical data. Further improvements in performance are achieved through fine-tuning, as more location-specific data becomes available. These findings are supported by statistical significance, further validating our approach. SPIRIT represents a pivotal step towards rapid, scalable, and adaptable solar forecasting solutions, advancing the integration of renewable energy into global power systems.
zh

[CV-7] QMaxViT-Unet: A Query-Based MaxViT-Unet with Edge Enhancement for Scribble-Supervised Segmentation of Medical Images

【速读】：该论文旨在解决医疗图像分割中需要大规模精确标注数据集的问题。为应对这一挑战，论文提出了一种基于弱监督学习的新型框架QMaxViT-Unet+。其关键是使用Multi-Axis Vision Transformer (MaxViT) 块替代U-Net架构中的编码器和解码器部分，以增强模型学习局部和全局特征的能力，并通过查询式Transformer解码器和边缘增强模块来优化特征提取和边界信息补偿。

链接: https://arxiv.org/abs/2502.10294
作者: Thien B. Nguyen-Tat,Hoang-An Vo,Phuoc-Sang Dang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The deployment of advanced deep learning models for medical image segmentation is often constrained by the requirement for extensively annotated datasets. Weakly-supervised learning, which allows less precise labels, has become a promising solution to this challenge. Building on this approach, we propose QMaxViT-Unet+, a novel framework for scribble-supervised medical image segmentation. This framework is built on the U-Net architecture, with the encoder and decoder replaced by Multi-Axis Vision Transformer (MaxViT) blocks. These blocks enhance the model’s ability to learn local and global features efficiently. Additionally, our approach integrates a query-based Transformer decoder to refine features and an edge enhancement module to compensate for the limited boundary information in the scribble label. We evaluate the proposed QMaxViT-Unet+ on four public datasets focused on cardiac structures, colorectal polyps, and breast cancer: ACDC, MS-CMRSeg, SUN-SEG, and BUSI. Evaluation metrics include the Dice similarity coefficient (DSC) and the 95th percentile of Hausdorff distance (HD95). Experimental results show that QMaxViT-Unet+ achieves 89.1% DSC and 1.316mm HD95 on ACDC, 88.4% DSC and 2.226mm HD95 on MS-CMRSeg, 71.4% DSC and 4.996mm HD95 on SUN-SEG, and 69.4% DSC and 50.122mm HD95 on BUSI. These results demonstrate that our method outperforms existing approaches in terms of accuracy, robustness, and efficiency while remaining competitive with fully-supervised learning approaches. This makes it ideal for medical image analysis, where high-quality annotations are often scarce and require significant effort and expense. The code is available at: this https URL
zh

[CV-8] Artificial Intelligence to Assess Dental Findings from Panoramic Radiographs – A Multinational Study

【速读】：该论文旨在解决牙科全景放射摄影（Dental Panoramic Radiographs, DPRs）在临床实践中由于重叠结构和时间限制导致的解读挑战。论文的关键解决方案在于开发并评估了一个结合目标检测与语义分割技术的人工智能系统，用于每颗牙齿的病变识别，并通过跨国数据集验证其性能，将其与人类阅片者的诊断结果进行对比。

链接: https://arxiv.org/abs/2502.10277
作者: Yin-Chih Chelsea Wang,Tsao-Lun Chen,Shankeeth Vinayahalingam,Tai-Hsien Wu,Chu Wei Chang,Hsuan Hao Chang,Hung-Jen Wei,Mu-Hsiung Chen,Ching-Chang Ko,David Anssari Moin,Bram van Ginneken,Tong Xi,Hsiao-Cheng Tsai,Min-Huey Chen,Tzu-Ming Harry Hsu,Hye Chou
机构: National Taiwan University; National Taiwan University of Science and Technology; Radboud University Medical Center; The Ohio State University; Promaton; Radboud University Medical Center; International Academia of Biomedical Innovation Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dental panoramic radiographs (DPRs) are widely used in clinical practice for comprehensive oral assessment but present challenges due to overlapping structures and time constraints in interpretation. This study aimed to establish a solid baseline for the AI-automated assessment of findings in DPRs by developing, evaluating an AI system, and comparing its performance with that of human readers across multinational data sets. We analyzed 6,669 DPRs from three data sets (the Netherlands, Brazil, and Taiwan), focusing on 8 types of dental findings. The AI system combined object detection and semantic segmentation techniques for per-tooth finding identification. Performance metrics included sensitivity, specificity, and area under the receiver operating characteristic curve (AUC-ROC). AI generalizability was tested across data sets, and performance was compared with human dental practitioners. The AI system demonstrated comparable or superior performance to human readers, particularly +67.9% (95% CI: 54.0%-81.9%; p .001) sensitivity for identifying periapical radiolucencies and +4.7% (95% CI: 1.4%-8.0%; p = .008) sensitivity for identifying missing teeth. The AI achieved a macro-averaged AUC-ROC of 96.2% (95% CI: 94.6%-97.8%) across 8 findings. AI agreements with the reference were comparable to inter-human agreements in 7 of 8 findings except for caries (p = .024). The AI system demonstrated robust generalization across diverse imaging and demographic settings and processed images 79 times faster (95% CI: 75-82) than human readers. The AI system effectively assessed findings in DPRs, achieving performance on par with or better than human experts while significantly reducing interpretation time. These results highlight the potential for integrating AI into clinical workflows to improve diagnostic efficiency and accuracy, and patient management. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.10277 [cs.CV] (or arXiv:2502.10277v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.10277 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Tzu-Ming Harry Hsu [view email] [v1] Fri, 14 Feb 2025 16:34:21 UTC (2,185 KB) Full-text links: Access Paper: View a PDF of the paper titled Artificial Intelligence to Assess Dental Findings from Panoramic Radiographs – A Multinational Study, by Yin-Chih Chelsea Wang and 15 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-02 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-9] Probing Perceptual Constancy in Large Vision Language Models

【速读】：该论文旨在探究视觉-语言模型（Vision-Language Models, VLMs）是否具备感知恒常性（Perceptual Constancy），即在不同感官输入条件下（如距离、角度或光照变化）稳定识别物体的能力。研究通过253个实验评估了33个VLMs在颜色恒常性、尺寸恒常性和形状恒常性三个领域的表现。关键解决方案在于设计涵盖单图像和视频适应的经典认知任务以及野外条件下的新任务，以此来全面评估模型在不同条件下的物体属性识别能力。研究表明，VLMs在这三个领域的表现存在显著差异，尤其是在形状恒常性方面与颜色及尺寸恒常性的表现明显不同。

链接: https://arxiv.org/abs/2502.10273
作者: Haoran Sun,Suyang Yu,Yijiang Li,Qingying Gao,Haiyun Lyu,Hokin Deng,Dezhi Luo
机构: Johns Hopkins University; University of California, San Diego; University of North Carolina at Chapel Hill; Carnegie Mellon University; University of Michigan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for recognizing visual information in a dynamic world, making it essential for Vision-Language Models (VLMs). However, whether VLMs are currently and theoretically capable of mastering this ability remains underexplored. In this study, we evaluated 33 VLMs using 253 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions, to evaluate the models’ recognition of object properties under varying conditions. We found significant variability in VLM performance, with models performance in shape constancy clearly dissociated from that of color and size constancy.
zh

[CV-10] MITO: Enabling Non-Line-of-Sight Perception using Millimeter-waves through Real-World Datasets and Simulation Tools

【速读】：该论文旨在解决毫米波（millimeter-wave, mmWave）成像数据稀缺及由此带来的非视距感知算法与模型开发困难的问题。关键在于引入了一个包含多光谱毫米波图像的真实世界数据集MITO以及开源的仿真工具。通过使用UR5机械臂配备两个不同频率的毫米波雷达和RGB-D相机，采集了超过580幅来自76种以上物体的三维毫米波图像，并提供了真实世界中的视距和非视距下的毫米波图像、RGB-D图像及其地面真值分割掩膜。此外，开发的开源仿真工具能够生成任意三维三角网格物体的合成毫米波图像，且与真实毫米波图像相比，其F-Score达到94%。这些贡献显著推动了计算机视觉领域中非视距感知任务的发展。

链接: https://arxiv.org/abs/2502.10259
作者: Laura Dodds,Tara Boroushaki,Fadel Adib
机构: Massachusetts Institute of Technology (麻省理工学院); Cartesian Systems (笛卡尔系统)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present MITO, the first dataset of multi-spectral millimeter-wave (mmWave) images of everyday objects. Unlike visible light, mmWave signals can image through everyday occlusions (e.g., cardboard boxes, fabric, plastic). However, due to the dearth of publicly-available mmWave images and the interdisciplinary challenges in collecting and processing mmWave signals, it remains difficult today for computer vision researchers to develop mmWave-based non-line-of-sight perception algorithms and models. To overcome these challenges, we introduce a real-world dataset and open-source simulation tool for mmWave imaging. The dataset is acquired using a UR5 robotic arm with two mmWave radars operating at different frequencies and an RGB-D camera. Through a signal processing pipeline, we capture and create over 580 real-world 3D mmWave images from over 76 different objects in the YCB dataset, a standard dataset for robotics manipulation. We provide real-world mmWave images in line-of-sight and non-line-of-sight, as well as RGB-D images and ground truth segmentation masks. We also develop an open-source simulation tool that can be used to generate synthetic mmWave images for any 3D triangle mesh, which achieves a median F-Score of 94% when compared to real-world mmWave images. We show the usefulness of this dataset and simulation tool in multiple CV tasks in non-line-of-sight. First, we perform object segmentation for mmWave images using the segment anything model (SAM), and achieve a median precision and recall of 92.6% and 64%. Second, we train a classifier that can recognize objects in non-line-of-sight. It is trained on synthetic images and can classify real-world images with 85% accuracy. We believe MITO will be a valuable resource for computer vision researchers in developing non-line-of-sight perception, similar to how early camera-based datasets shaped the field. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2502.10259 [cs.CV] (or arXiv:2502.10259v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2502.10259 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Laura Dodds [view email] [v1] Fri, 14 Feb 2025 16:12:14 UTC (7,180 KB)
zh

[CV-11] PromptArtisan: Multi-instruction Image Editing in Single Pass with Complete Attention Control ICASSP2025

【速读】：该论文旨在解决多指令图像编辑中的复杂性和低效性问题。解决方案的关键在于PromptArtisan方法，它结合了一个预训练的InstructPix2Pix模型与创新的完整注意力控制机制（Complete Attention Control Mechanism, CACM），从而实现单次处理即可精确遵循多个编辑指令，并支持复杂的遮罩操作。这种方法不仅提高了处理效率，还增强了编辑过程的精细控制能力。

链接: https://arxiv.org/abs/2502.10258
作者: Kunal Swami,Raghu Chittersu,Pranav Adlinge,Rajeev Irny,Shashavali Doodekula,Alok Shukla
机构: Samsung Research India Bangalore
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted in ICASSP 2025

点击查看摘要

Abstract:We present PromptArtisan, a groundbreaking approach to multi-instruction image editing that achieves remarkable results in a single pass, eliminating the need for time-consuming iterative refinement. Our method empowers users to provide multiple editing instructions, each associated with a specific mask within the image. This flexibility allows for complex edits involving mask intersections or overlaps, enabling the realization of intricate and nuanced image transformations. PromptArtisan leverages a pre-trained InstructPix2Pix model in conjunction with a novel Complete Attention Control Mechanism (CACM). This mechanism ensures precise adherence to user instructions, granting fine-grained control over the editing process. Furthermore, our approach is zero-shot, requiring no additional training, and boasts improved processing complexity compared to traditional iterative methods. By seamlessly integrating multi-instruction capabilities, single-pass efficiency, and complete attention control, PromptArtisan unlocks new possibilities for creative and efficient image editing workflows, catering to both novice and expert users alike.
zh

[CV-12] Mapping bathymetry of inland water bodies on the North Slope of Alaska with Landsat using Random Forest

【速读】：该论文旨在解决阿拉斯加北坡小型水体深度信息稀缺的问题，这些水体对当地人口和野生动物提供关键的生态系统服务。由于收集此类信息存在挑战，导致详细深度数据难以获得。为了解决这一问题，研究的关键在于利用随机森林回归模型（Random Forest Regressor），从多光谱Landsat数据预测水体深度。由于现场实测数据稀缺且获取成本高，研究通过使用先前研究中的模型预测深度作为合成训练数据，从而构建了一个更加多样化的训练数据集。这种方法使得最终的随机森林模型比直接基于现场数据训练的模型更为稳健，并在验证中达到了0.76的总体 (r^2) 值。

链接: https://arxiv.org/abs/2502.10214
作者: Mark L. Carroll(1),Margaret R. Wooten(2 and 3),Claire E. Simpson(4),Caleb S. Spradlin(1 and 5),Melanie J. Frost(1 and 5),Mariana Blanco-Rojas(1),Zachary W. Williams(1 and 5),Jordan A. Caraballo-Vega(1),Christopher S. R. Neigh(2) ((1) NASA Data Science Group, Goddard Space Flight Center, 8800 Greenbelt Rd. mail code 606.3 Greenbelt, MD 20771, USA, (2) NASA Biospheric Sciences Laboratory, Goddard Space Flight Center, 8800 Greenbelt Rd. mail code 618 Greenbelt, MD 20771, USA, (3) Science Systems and Applications Incorporated, 10210 Greenbelt Rd Suite 600 Lanham, MD 20706, USA, (4) Department of Geography, University of Colorado Boulder, Boulder, Colorado, 80309, USA, (5) ASRC Federal Goddard Space Flight Center, 8800 Greenbelt Rd. mail code 606.3 Greenbelt, MD 20771, USA)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 24 Pages, 6 Figures, 1 Table. This article is a US Government work. Landsat data from the US Geological Survey Earth Explorer system: this https URL . Sonar training measurements: this https URL . Output maps from the Oak Ridge National Laboratory Distribute Active Archive Center (ORNL-DAAC): this https URL

点击查看摘要

Abstract:The North Slope of Alaska is dominated by small waterbodies that provide critical ecosystem services for local population and wildlife. Detailed information on the depth of the waterbodies is scarce due to the challenges with collecting such information. In this work we have trained a machine learning (Random Forest Regressor) model to predict depth from multispectral Landsat data in waterbodies across the North Slope of Alaska. The greatest challenge is the scarcity of in situ data, which is expensive and difficult to obtain, to train the model. We overcame this challenge by using modeled depth predictions from a prior study as synthetic training data to provide a more diverse training data pool for the Random Forest. The final Random Forest model was more robust than models trained directly on the in situ data and when applied to 208 Landsat 8 scenes from 2016 to 2018 yielded a map with an overall r^2 value of 0.76 on validation. The final map has been made available through the Oak Ridge National Laboratory Distribute Active Archive Center (ORNL-DAAC). This map represents a first of its kind regional assessment of waterbody depth with per pixel estimates of depth for the entire North Slope of Alaska.
zh

[CV-13] Exploring the Camera Bias of Person Re-identification ICLR2025

【速读】：该论文旨在解决行人重识别（ReID）模型中的相机偏见问题，特别是在未见过的数据域中及数据分布变化下的表现。研究的关键在于通过重新审视嵌入向量上的特征归一化方法来减少这种偏见。研究表明，这一简单方法不仅能够有效降低相机偏见，还可以应用于低级图像属性和身体角度等细节偏差因素，并且在多种模型和基准测试中验证了其通用性。此外，论文还探讨了无监督学习中ReID模型内在的相机偏见风险，并提出了一些简单的训练策略来减轻这种偏见。

链接: https://arxiv.org/abs/2502.10195
作者: Myungseo Song,Jin-Woo Park,Jong-Seok Lee
机构: mAy-I Inc.(迈伊公司); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICLR 2025 (Spotlight)

点击查看摘要

Abstract:We empirically investigate the camera bias of person re-identification (ReID) models. Previously, camera-aware methods have been proposed to address this issue, but they are largely confined to training domains of the models. We measure the camera bias of ReID models on unseen domains and reveal that camera bias becomes more pronounced under data distribution shifts. As a debiasing method for unseen domain data, we revisit feature normalization on embedding vectors. While the normalization has been used as a straightforward solution, its underlying causes and broader applicability remain unexplored. We analyze why this simple method is effective at reducing bias and show that it can be applied to detailed bias factors such as low-level image properties and body angle. Furthermore, we validate its generalizability across various models and benchmarks, highlighting its potential as a simple yet effective test-time postprocessing method for ReID. In addition, we explore the inherent risk of camera bias in unsupervised learning of ReID models. The unsupervised models remain highly biased towards camera labels even for seen domain data, indicating substantial room for improvement. Based on observations of the negative impact of camera-biased pseudo labels on training, we suggest simple training strategies to mitigate the bias. By applying these strategies to existing unsupervised learning algorithms, we show that significant performance improvements can be achieved with minor modifications.
zh

[CV-14] MonoForce: Learnable Image-conditioned Physics Engine

【速读】：本文旨在解决机器人在崎岖越野地形上的轨迹预测问题。关键在于提出了一种融合物理感知神经符号层的混合模型，该模型不仅能够从大规模数据中学习（end-to-end可微），还能通过黑盒组件预测机器人与地形的相互作用力，并利用可微分物理引擎计算接触点处的作用力以确定机器人的轨迹。这种架构通过引入显著的几何和物理先验知识，显著减小了仿真到实际应用的差距并降低了分布外敏感性。

链接: https://arxiv.org/abs/2502.10156
作者: Ruslan Agishev,Karel Zimmermann
机构: The VRAS group, Faculty of Electrical Engineering, Czech Technical University in Prague(捷克布拉格技术大学电气工程学院VRAS小组)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Robotics (T-RO), 2025. Code: this https URL

点击查看摘要

Abstract:We propose a novel model for the prediction of robot trajectories on rough offroad terrain from the onboard camera images. This model enforces the laws of classical mechanics through a physics-aware neural symbolic layer while preserving the ability to learn from large-scale data as it is end-to-end differentiable. The proposed hybrid model integrates a black-box component that predicts robot-terrain interaction forces with a neural-symbolic layer. This layer includes a differentiable physics engine that computes the robot’s trajectory by querying these forces at the points of contact with the terrain. As the proposed architecture comprises substantial geometrical and physics priors, the resulting model can also be seen as a learnable physics engine conditioned on real images that delivers 10^4 trajectories per second. We argue and empirically demonstrate that this architecture reduces the sim-to-real gap and mitigates out-of-distribution sensitivity. The differentiability, in conjunction with the rapid simulation speed, makes the model well-suited for various applications including model predictive control, trajectory shooting, supervised and reinforcement learning or SLAM. The codes and data are publicly available.
zh

[CV-15] Interpretable Concept-based Deep Learning Framework for Multimodal Human Behavior Modeling

【速读】：该论文旨在解决情感计算（Affective Computing, AC）领域中可解释性与性能之间的权衡问题，以及现有方法在提供有意义的、领域特定解释方面的不足。论文的关键解决方案是提出了一种名为注意力引导的概念模型（Attention-Guided Concept Model, AGCM），该模型通过识别导致预测的概念及其观察位置来提供可学习的概念性解释。AGCM能够通过多模态概念对齐和协同学习扩展到任何空间和时间信号，从而赋予利益相关者更深入的洞察力，以理解模型的决策过程。

链接: https://arxiv.org/abs/2502.10145
作者: Xinyu Li,Marwa Mahmoud
机构: School of Computing Science, University of Glasgow(格拉斯哥大学计算科学学院), United Kingdom
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In the contemporary era of intelligent connectivity, Affective Computing (AC), which enables systems to recognize, interpret, and respond to human behavior states, has become an integrated part of many AI systems. As one of the most critical components of responsible AI and trustworthiness in all human-centered systems, explainability has been a major concern in AC. Particularly, the recently released EU General Data Protection Regulation requires any high-risk AI systems to be sufficiently interpretable, including biometric-based systems and emotion recognition systems widely used in the affective computing field. Existing explainable methods often compromise between interpretability and performance. Most of them focus only on highlighting key network parameters without offering meaningful, domain-specific explanations to the stakeholders. Additionally, they also face challenges in effectively co-learning and explaining insights from multimodal data sources. To address these limitations, we propose a novel and generalizable framework, namely the Attention-Guided Concept Model (AGCM), which provides learnable conceptual explanations by identifying what concepts that lead to the predictions and where they are observed. AGCM is extendable to any spatial and temporal signals through multimodal concept alignment and co-learning, empowering stakeholders with deeper insights into the model’s decision-making process. We validate the efficiency of AGCM on well-established Facial Expression Recognition benchmark datasets while also demonstrating its generalizability on more complex real-world human behavior understanding applications.
zh

[CV-16] Leverag ing V2X for Collaborative HD Maps Construction Using Scene Graph Generation

【速读】：该论文旨在解决高精地图（HD Map）实时更新的问题。传统方法依赖于专用测绘车辆，这不仅成本高昂且无法及时捕捉基础设施的变化。论文提出的关键解决方案是HDMapLaneNet框架，它利用车到一切（V2X）通信和场景图生成技术，通过前端摄像头图像提取车道中心线，并将其表示为图形数据，通过V2X传输至云端进行全局聚合。此方法能够实现高精地图的实时协同构建。

链接: https://arxiv.org/abs/2502.10127
作者: Gamal Elghazaly,Raphael Frank
机构: Interdisciplinary Center for Security, Reliability and Trust (SnT), University of Luxembourg (卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-Definition (HD) maps play a crucial role in autonomous vehicle navigation, complementing onboard perception sensors for improved accuracy and safety. Traditional HD map generation relies on dedicated mapping vehicles, which are costly and fail to capture real-time infrastructure changes. This paper presents HDMapLaneNet, a novel framework leveraging V2X communication and Scene Graph Generation to collaboratively construct a localized geometric layer of HD maps. The approach extracts lane centerlines from front-facing camera images, represents them as graphs, and transmits the data for global aggregation to the cloud via V2X. Preliminary results on the nuScenes dataset demonstrate superior association prediction performance compared to a state-of-the-art method.
zh

[CV-17] Compress image to patches for Vision Transformer

【速读】：该论文旨在解决深度Vision Transformer (ViT)模型在处理高分辨率输入图像时计算成本急剧增加的问题。解决方案的关键在于引入了一个名为CI2P的模块，该模块使用CompressAI编码器压缩图像，并通过一系列卷积生成图像块序列，从而替代ViT模型中的Patch Embedding组件。这种方法不仅显著降低了ViT模型的计算成本，还通过引入归纳偏置特性提高了模型精度。以Animals-10数据集为例，CI2P-ViT模型在保持较高精度（92.37%）的同时，将计算操作量减少了63.35%，训练速度提升了2倍。

链接: https://arxiv.org/abs/2502.10120
作者: Xinfeng Zhao,Yaoru Sun
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages,5 figures

点击查看摘要

Abstract:The Vision Transformer (ViT) has made significant strides in the field of computer vision. However, as the depth of the model and the resolution of the input images increase, the computational cost associated with training and running ViT models has surged this http URL paper proposes a hybrid model based on CNN and Vision Transformer, named CI2P-ViT. The model incorporates a module called CI2P, which utilizes the CompressAI encoder to compress images and subsequently generates a sequence of patches through a series of convolutions. CI2P can replace the Patch Embedding component in the ViT model, enabling seamless integration into existing ViT this http URL to ViT-B/16, CI2P-ViT has the number of patches input to the self-attention layer reduced to a quarter of the this http URL design not only significantly reduces the computational cost of the ViT model but also effectively enhances the model’s accuracy by introducing the inductive bias properties of this http URL ViT model’s precision is markedly this http URL trained from the ground up on the Animals-10 dataset, CI2P-ViT achieved an accuracy rate of 92.37%, representing a 3.3% improvement over the ViT-B/16 baseline. Additionally, the model’s computational operations, measured in floating-point operations per second (FLOPs), were diminished by 63.35%, and it exhibited a 2-fold increase in training velocity on identical hardware configurations.
zh

[CV-18] Image Embedding Sampling Method for Diverse Captioning

【速读】：该论文旨在解决大型视觉语言模型（VLMs）在提高图像描述质量的同时导致计算复杂度增加的问题，这限制了它们在资源受限的应用中的适用性。同时，小型VLMs虽然计算效率高，但倾向于提供高层次场景描述而忽略图像细节。论文的关键解决方案在于引入了一种无需额外训练的框架，通过使用BLIP作为基础模型，并利用结构化分割技术产生层次表示来增强图像区域的关注度，从而在保持较小模型规模的前提下提升图像描述的多样性和信息量。这种方法使得小型VLMs在图像-文本对齐、语义完整性和多样性方面能够达到与大型模型相当的性能。

链接: https://arxiv.org/abs/2502.10118
作者: Sania Waheed,Na Min An
机构: University of Southhampton(南安普顿大学); KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions.
zh

[CV-19] DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery

【速读】：该论文旨在解决在科学工作流程中理解视觉数据预测背后机制的问题。随着观测数据量的增加，仅做出准确预测已不足以满足需求，还需要深入理解这些预测的基础机制。为实现这一目标，论文提出了一种自动获取可设计解释性模型的方法，即通过学习神经网络交错的程序。关键解决方案是提出了DiSciPLE（使用大型语言模型和进化论发现科学程序），这是一种利用大规模语言模型的常识和先验知识来创建解释视觉数据的Python程序的进化算法。此外，论文还引入了程序批评者和程序简化器两种改进措施，以进一步提高方法的有效性，从而合成更优的程序。

链接: https://arxiv.org/abs/2502.10060
作者: Utkarsh Mall,Cheng Perng Phoo,Mia Chiquier,Bharath Hariharan,Kavita Bala,Carl Vondrick
机构: Columbia University (哥伦比亚大学); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual data is used in numerous different scientific workflows ranging from remote sensing to ecology. As the amount of observation data increases, the challenge is not just to make accurate predictions but also to understand the underlying mechanisms for those predictions. Good interpretation is important in scientific workflows, as it allows for better decision-making by providing insights into the data. This paper introduces an automatic way of obtaining such interpretable-by-design models, by learning programs that interleave neural networks. We propose DiSciPLE (Discovering Scientific Programs using LLMs and Evolution) an evolutionary algorithm that leverages common sense and prior knowledge of large language models (LLMs) to create Python programs explaining visual data. Additionally, we propose two improvements: a program critic and a program simplifier to improve our method further to synthesize good programs. On three different real-world problems, DiSciPLE learns state-of-the-art programs on novel tasks with no prior literature. For example, we can learn programs with 35% lower error than the closest non-interpretable baseline for population density estimation.
zh

[CV-20] RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control

【速读】：该论文旨在解决相机轨迹引导的图像到视频生成在实际应用中的精度和控制性挑战，特别是当用户难以提供精确的相机参数时。论文的关键解决方案是引入RealCam-I2V框架，该框架通过集成单目度量深度估计，在预处理阶段实现三维场景重建。这使得相机参数可以由相对值转换为绝对值，从而确保不同真实世界图像之间的一致性和兼容性。此外，RealCam-I2V提供直观的用户界面，允许用户在三维场景中精确绘制相机轨迹，并采用场景约束噪声塑形技术以增强相机控制精度和场景一致性。

链接: https://arxiv.org/abs/2502.10059
作者: Teng Li,Guangcong Zheng,Rui Jiang,Shuigenzhan,Tao Wu,Yehao Lu,Yining Lin,Xi Li
机构: College of Computer Science & Technology, Zhejiang University (计算机科学与技术学院, 浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in camera-trajectory-guided image-to-video generation offer higher precision and better support for complex camera control compared to text-based approaches. However, they also introduce significant usability challenges, as users often struggle to provide precise camera parameters when working with arbitrary real-world images without knowledge of their depth nor scene scale. To address these real-world application issues, we propose RealCam-I2V, a novel diffusion-based video generation framework that integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step. During training, the reconstructed 3D scene enables scaling camera parameters from relative to absolute values, ensuring compatibility and scale consistency across diverse real-world images. In inference, RealCam-I2V offers an intuitive interface where users can precisely draw camera trajectories by dragging within the 3D scene. To further enhance precise camera control and scene consistency, we propose scene-constrained noise shaping, which shapes high-level noise and also allows the framework to maintain dynamic, coherent video generation in lower noise stages. RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images. We further enables applications like camera-controlled looping video generation and generative frame interpolation. We will release our absolute-scale annotation, codes, and all checkpoints. Please see dynamic results in this https URL.
zh

[CV-21] owards Polyp Counting In Full-Procedure Colonoscopy Videos

【速读】：该论文旨在解决自动结肠镜报告中的多发息肉识别、跟踪及重关联（ReID）问题，以实现精确的息肉计数，并自动化计算关键质量指标，如腺瘤检出率（ADR）和每例结肠镜检查息肉数（PPC）。由于息肉外观变化、频繁离开视野以及遮挡等因素，这一任务极具挑战性。论文的关键解决方案在于利用REAL-Colon数据集定义任务、数据分割和评估指标，重新实现基于SimCLR的方法来学习息肉轨迹片段的表征，并将其适应于息肉计数任务。此外，引入基于亲和传播（Affinity Propagation）的聚类方法，进一步改进基于这些学习表征的息肉重关联，最终提升息肉计数的准确性。

链接: https://arxiv.org/abs/2502.10054
作者: Luca Parolari,Andrea Cherubini,Lamberto Ballan,Carlo Biffi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ISBI 2025

点击查看摘要

Abstract:Automated colonoscopy reporting holds great potential for enhancing quality control and improving cost-effectiveness of colonoscopy procedures. A major challenge lies in the automated identification, tracking, and re-association (ReID) of polyps tracklets across full-procedure colonoscopy videos. This is essential for precise polyp counting and enables automated computation of key quality metrics, such as Adenoma Detection Rate (ADR) and Polyps Per Colonoscopy (PPC). However, polyp ReID is challenging due to variations in polyp appearance, frequent disappearance from the field of view, and occlusions. In this work, we leverage the REAL-Colon dataset, the first open-access dataset providing full-procedure videos, to define tasks, data splits and metrics for the problem of automatically count polyps in full-procedure videos, establishing an open-access framework. We re-implement previously proposed SimCLR-based methods for learning representations of polyp tracklets, both single-frame and multi-view, and adapt them to the polyp counting task. We then propose an Affinity Propagation-based clustering method to further improve ReID based on these learned representations, ultimately enhancing polyp counting. Our approach achieves state-of-the-art performance, with a polyp fragmentation rate of 6.30 and a false positive rate (FPR) below 5% on the REAL-Colon dataset. We release code at this https URL.
zh

[CV-22] ViRAC: A Vision-Reasoning Agent Head Movement Control Framework in Arbitrary Virtual Environments

【速读】：该论文旨在解决虚拟代理在交互环境中生成自然头部旋转的问题，这是实现逼真代理行为的关键挑战。现有方法主要依赖于数据驱动或显著性方法，这些方法在多样化场景下表现不佳，并且难以捕捉深层认知因素如风险评估、信息寻求及情境优先级。论文提出了一种名为\textbf{ViRAC}的框架，即基于视觉推理的代理头部运动控制（Vision-\textbf{Reasoning Agent Head Movement Control），通过利用大规模模型（包括视觉语言模型VLMs和大型语言模型LLMs）的常识知识和推理能力，无需显式建模每个认知机制，而是利用模型内部化的行为偏差和模式，从而模拟类人感知过程。这一方案的关键在于利用大规模模型的内在训练结果来实现更自然和情境感知的头部旋转。

链接: https://arxiv.org/abs/2502.10046
作者: Juyeong Hwang,Seong-Eun Hong,Hyeongyeop Kang
机构: Kyung Hee University (庆熙大学); Korea University (韩国大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creating lifelike virtual agents capable of interacting with their environments is a longstanding goal in computer graphics. This paper addresses the challenge of generating natural head rotations, a critical aspect of believable agent behavior for visual information gathering and dynamic responses to environmental cues. Although earlier methods have made significant strides, many rely on data-driven or saliency-based approaches, which often underperform in diverse settings and fail to capture deeper cognitive factors such as risk assessment, information seeking, and contextual prioritization. Consequently, generated behaviors can appear rigid or overlook critical scene elements, thereby diminishing the sense of realism. In this paper, we propose \textbfViRAC, a \textbfVision-\textbfReasoning \textbfAgent Head Movement \textbfControl framework, which exploits the common-sense knowledge and reasoning capabilities of large-scale models, including Vision-Language Models (VLMs) and Large-Language Models (LLMs). Rather than explicitly modeling every cognitive mechanism, ViRAC leverages the biases and patterns internalized by these models from extensive training, thus emulating human-like perceptual processes without hand-tuned heuristics. Experimental results in multiple scenarios reveal that ViRAC produces more natural and context-aware head rotations than recent state-of-the-art techniques. Quantitative evaluations show a closer alignment with real human head-movement data, while user studies confirm improved realism and cognitive plausibility.
zh

[CV-23] ManiTrend: Bridging Future Generation and Action Prediction with 3D Flow for Robotic Manipulation

【速读】：该论文致力于解决语言条件下的操作任务，即通过自然语言理解来实现高级抽象的任务目标。关键解决方案在于提出了一种名为ManiTrend的统一框架，该框架利用因果变换器（causal transformer）建模三维粒子的动力学、视觉观测与操作动作。其中，三维流（3D flow）预测特征作为未来图像生成和动作预测的附加条件，简化了像素级时空建模的复杂性，并提供了无缝的动作指导。此外，三维流还可以在大规模跨实体演示的预训练过程中替代缺失或异构的动作标签。

链接: https://arxiv.org/abs/2502.10028
作者: Yuxin He,Qiang Nie
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 15 pages, 9 figures

点击查看摘要

Abstract:Language-conditioned manipulation is a vital but challenging robotic task due to the high-level abstraction of language. To address this, researchers have sought improved goal representations derived from natural language. In this paper, we highlight 3D flow - representing the motion trend of 3D particles within a scene - as an effective bridge between language-based future image generation and fine-grained action prediction. To this end, we develop ManiTrend, a unified framework that models the dynamics of 3D particles, vision observations and manipulation actions with a causal transformer. Within this framework, features for 3D flow prediction serve as additional conditions for future image generation and action prediction, alleviating the complexity of pixel-wise spatiotemporal modeling and providing seamless action guidance. Furthermore, 3D flow can substitute missing or heterogeneous action labels during large-scale pretraining on cross-embodiment demonstrations. Experiments on two comprehensive benchmarks demonstrate that our method achieves state-of-the-art performance with high efficiency. Our code and model checkpoints will be available upon acceptance.
zh

[CV-24] Navigating Label Ambiguity for Facial Expression Recognition in the Wild AAAI2025

【速读】：该论文旨在解决面部表情识别（FER）中的标签模糊性和类别不平衡问题。解决方案的关键在于提出了一种名为导航标签模糊性（Navigating Label Ambiguity, NLA）的新框架。NLA 包含两个主要组件：噪声感知自适应加权（Noise-aware Adaptive Weighting, NAW）和一致性正则化。NAW 动态地提高模糊样本的重要性并降低噪声样本的重要性，从而减少模型对多数类别的偏差，而一致性正则化确保潜在分布的一致性。这使得模型能够在训练后期更专注于具有挑战性的模糊样本，这些样本主要属于少数类别。

链接: https://arxiv.org/abs/2502.09993
作者: JunGyu Lee,Yeji Choi,Haksub Kim,Ig-Jae Kim,Gi Pyo Nam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Facial expression recognition (FER) remains a challenging task due to label ambiguity caused by the subjective nature of facial expressions and noisy samples. Additionally, class imbalance, which is common in real-world datasets, further complicates FER. Although many studies have shown impressive improvements, they typically address only one of these issues, leading to suboptimal results. To tackle both challenges simultaneously, we propose a novel framework called Navigating Label Ambiguity (NLA), which is robust under real-world conditions. The motivation behind NLA is that dynamically estimating and emphasizing ambiguous samples at each iteration helps mitigate noise and class imbalance by reducing the model’s bias toward majority classes. To achieve this, NLA consists of two main components: Noise-aware Adaptive Weighting (NAW) and consistency regularization. Specifically, NAW adaptively assigns higher importance to ambiguous samples and lower importance to noisy ones, based on the correlation between the intermediate prediction scores for the ground truth and the nearest negative. Moreover, we incorporate a regularization term to ensure consistent latent distributions. Consequently, NLA enables the model to progressively focus on more challenging ambiguous samples, which primarily belong to the minority class, in the later stages of training. Extensive experiments demonstrate that NLA outperforms existing methods in both overall and mean accuracy, confirming its robustness against noise and class imbalance. To the best of our knowledge, this is the first framework to address both problems simultaneously.
zh

[CV-25] V2V-LLM : Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models

【速读】：该论文旨在解决当前自动驾驶车辆依赖单一传感器理解周围环境和规划未来轨迹时存在的可靠性问题，特别是在传感器故障或遮挡情况下。论文的关键解决方案在于引入大型语言模型（Large Language Models, LLMs）到协同自动驾驶系统中，并提出了一个名为Vehicle-to-Vehicle Question-Answering (V2V-QA)的数据集和基准。通过V2V-LLM方法，利用LLMs融合多辆连接的自动驾驶车辆（Connected Autonomous Vehicles, CAVs）的感知信息，并回答与驾驶相关的问题，如场景定位、显著物体识别和路径规划。实验结果表明，V2V-LLM在协同自动驾驶任务中表现出了作为统一模型架构的潜力，并优于使用其他融合方法的基线方法。这一工作也为提升未来自动驾驶系统的安全性开辟了新的研究方向。

链接: https://arxiv.org/abs/2502.09980
作者: Hsu-kuang Chiu,Ryo Hachiuma,Chien-Yi Wang,Stephen F. Smith,Yu-Chiang Frank Wang,Min-Hung Chen
机构: NVIDIA(英伟达); Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Current autonomous driving vehicles rely mainly on their individual sensors to understand surrounding scenes and plan for future trajectories, which can be unreliable when the sensors are malfunctioning or occluded. To address this problem, cooperative perception methods via vehicle-to-vehicle (V2V) communication have been proposed, but they have tended to focus on detection and tracking. How those approaches contribute to overall cooperative planning performance is still under-explored. Inspired by recent progress using Large Language Models (LLMs) to build autonomous driving systems, we propose a novel problem setting that integrates an LLM into cooperative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected autonomous vehicles (CAVs) and answer driving-related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V-LLM can be a promising unified model architecture for performing various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. Our project website: this https URL .
zh

[CV-26] Conditional Latent Coding with Learnable Synthesized Reference for Deep Image Compression

【速读】：本文旨在解决如何利用外部字典合成动态参考以实现输入图像在潜在域中的条件编码，并探讨如何端到端地学习条件潜在合成与编码模块。关键在于构建一个通用图像特征字典，并通过多阶段方法包括修改后的空间金字塔池化、降维及多尺度特征聚类。对于每个输入图像，通过从字典中选择和合成相关特征来学习合成条件潜在，这显著增强了模型捕捉和探索图像源相关性的能力。具体而言，采用基于相关性的特征匹配与对齐策略，包括条件潜在匹配（Conditional Latent Matching, CLM）模块和条件潜在合成（Conditional Latent Synthesis, CLS）模块。实验结果表明，所提出的条件潜在编码（Conditional Latent Coding, CLC）方法不仅提升了编码性能（最高提升1.2 dB），还保持了非常小的额外开销（约每像素0.5%比特）。

链接: https://arxiv.org/abs/2502.09971
作者: Siqi Wu,Yinda Chen,Dong Liu,Zhihai He
机构: Southern University of Science and Technology(南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we study how to synthesize a dynamic reference from an external dictionary to perform conditional coding of the input image in the latent domain and how to learn the conditional latent synthesis and coding modules in an end-to-end manner. Our approach begins by constructing a universal image feature dictionary using a multi-stage approach involving modified spatial pyramid pooling, dimension reduction, and multi-scale feature clustering. For each input image, we learn to synthesize a conditioning latent by selecting and synthesizing relevant features from the dictionary, which significantly enhances the model’s capability in capturing and exploring image source correlation. This conditional latent synthesis involves a correlation-based feature matching and alignment strategy, comprising a Conditional Latent Matching (CLM) module and a Conditional Latent Synthesis (CLS) module. The synthesized latent is then used to guide the encoding process, allowing for more efficient compression by exploiting the correlation between the input image and the reference dictionary. According to our theoretical analysis, the proposed conditional latent coding (CLC) method is robust to perturbations in the external dictionary samples and the selected conditioning latent, with an error bound that scales logarithmically with the dictionary size, ensuring stability even with large and diverse dictionaries. Experimental results on benchmark datasets show that our new method improves the coding performance by a large margin (up to 1.2 dB) with a very small overhead of approximately 0.5% bits per pixel. Our code is publicly available at this https URL.
zh

[CV-27] VicKAM: Visual Conceptual Knowledge Guided Action Map for Weakly Supervised Group Activity Recognition

【速读】：该论文旨在解决弱监督群体活动识别方法在利用目标检测器或注意力机制自动捕捉关键区域时忽略语义信息的问题，这可能对识别性能产生不利影响。论文的关键解决方案是提出了一种名为Visual Conceptual Knowledge Guided Action Map (VicKAM)的新框架。VicKAM通过从训练集中生成个体动作原型作为视觉概念知识，来连接动作语义和视觉表示，并在此指导下生成指示各动作发生概率的动作图。进一步结合与群体活动相关的统计信息，建立动作图与特定群体活动之间的联系，从而提升群体活动识别效果。

链接: https://arxiv.org/abs/2502.09967
作者: Zhuming Wang,Yihao Zheng,Jiarui Li,Yaofei Wu,Yan Huang,Zun Li,Lifang Wu,Liang Wang
机构: School of Information Science and Technology, Beijing University of Technology, Beijing, China(北京工业大学信息科学与技术学院); Institute of automation, Chinese Academy of Science, China(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing weakly supervised group activity recognition methods rely on object detectors or attention mechanisms to capture key areas automatically. However, they overlook the semantic information associated with captured areas, which may adversely affect the recognition performance. In this paper, we propose a novel framework named Visual Conceptual Knowledge Guided Action Map (VicKAM) which effectively captures the locations of individual actions and integrates them with action semantics for weakly supervised group activity this http URL generates individual action prototypes from training set as visual conceptual knowledge to bridge action semantics and visual representations. Guided by this knowledge, VicKAM produces action maps that indicate the likelihood of each action occurring at various locations, based on image correlation theorem. It further augments individual action maps using group activity related statistical information, representing individual action distribution under different group activities, to establish connections between action maps and specific group activities. The augmented action map is incorporated with action semantic representations for group activity this http URL experiments on two public benchmarks, the Volleyball and the NBA datasets, demonstrate the effectiveness of our proposed method, even in cases of limited training data. The code will be released later.
zh

[CV-28] Generating on Generated: An Approach Towards Self-Evolving Diffusion Models

【速读】：该论文旨在解决递归自改进（RSI）在文本到图像扩散模型中的应用过程中，由于合成数据导致的训练崩溃问题。论文指出，训练崩溃的主要原因是感知对齐不足和生成性幻觉的累积。为了解决这些问题，论文提出了三个关键策略：(1) 设计了一套提示构造和过滤管道以促进感知对齐数据的生成；(2) 提出了一种偏好采样方法，用于识别人类偏好的样本并过滤掉生成性幻觉；(3) 引入了一种基于分布的加权方案，以惩罚包含幻觉错误的选定样本。这些方法的有效性通过广泛的实验得到了验证。

链接: https://arxiv.org/abs/2502.09963
作者: Xulu Zhang,Xiaoyong Wei,Jinlin Wu,Jiaxin Wu,Zhaoxiang Zhang,Zhen Lei,Qing Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recursive Self-Improvement (RSI) enables intelligence systems to autonomously refine their capabilities. This paper explores the application of RSI in text-to-image diffusion models, addressing the challenge of training collapse caused by synthetic data. We identify two key factors contributing to this collapse: the lack of perceptual alignment and the accumulation of generative hallucinations. To mitigate these issues, we propose three strategies: (1) a prompt construction and filtering pipeline designed to facilitate the generation of perceptual aligned data, (2) a preference sampling method to identify human-preferred samples and filter out generative hallucinations, and (3) a distribution-based weighting scheme to penalize selected samples with hallucinatory errors. Our extensive experiments validate the effectiveness of these approaches.
zh

[CV-29] Using MRNet to Predict Lunar Rock Categories Detected by Change 5 Probe

【速读】：该论文旨在解决月球岩石类型识别精度不足的问题。解决方案的关键在于提出了一种新的网络架构MRNet（MoonRockNet），该架构采用VGG16进行特征提取，并在解码部分引入扩张卷积和U-Net结构，以充分利用月球图像中的全局信息，从而更准确地识别种类更为精细且分布较为稀疏的月岩类型。实验结果表明，MRNet在所建立的CE5ROCK数据集上的岩石类型识别精度更高，优于其他现有主流算法。

链接: https://arxiv.org/abs/2502.09952
作者: Jin Cui,Yifei Zou,Siyuan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published at the 8th International Conference on Advances in Machinery, Material Science and Engineering Application (MMSE 2022)

点击查看摘要

Abstract:China’s Chang’e 5 mission has been a remarkable success, with the chang’e 5 lander traveling on the Oceanus Procellarum to collect images of the lunar surface. Over the past half century, people have brought back some lunar rock samples, but its quantity does not meet the need for research. Under current circumstances, people still mainly rely on the analysis of rocks on the lunar surface through the detection of lunar rover. The Oceanus Procellarum, chosen by Chang’e 5 mission, contains various kind of rock species. Therefore, we first applied to the National Astronomical Observatories of the China under the Chinese Academy of Sciences for the Navigation and Terrain Camera (NaTeCam) of the lunar surface image, and established a lunar surface rock image data set CE5ROCK. The data set contains 100 images, which randomly divided into training, validation and test set. Experimental results show that the identification accuracy testing on convolutional neural network (CNN) models like AlexNet or MobileNet is about to 40.0%. In order to make full use of the global information in Moon images, this paper proposes the MRNet (MoonRockNet) network architecture. The encoding structure of the network uses VGG16 for feature extraction, and the decoding part adds dilated convolution and commonly used U-Net structure on the original VGG16 decoding structure, which is more conducive to identify more refined but more sparsely distributed types of lunar rocks. We have conducted extensive experiments on the established CE5ROCK data set, and the experimental results show that MRNet can achieve more accurate rock type identification, and outperform other existing mainstream algorithms in the identification performance.
zh

[CV-30] A Lightweight and Effective Image Tampering Localization Network with Vision Mamba

【速读】：该论文旨在解决盲图像篡改定位问题，现有方法主要依赖于卷积神经网络（CNNs）和Transformer。然而，CNNs受限于局部感受野，而Transformer虽然提供了全局上下文建模，但计算复杂度呈二次增长。为了解决这些问题，论文提出了一种基于视觉Mamba（MAmba）的轻量且有效的FORensic网络（ForMa）。关键在于ForMa通过线性复杂度实现了多尺度全局特征捕捉，从而实现高效的全局依赖建模，并采用无参数像素洗牌层进行上采样。此外，还提出了一种噪声辅助解码策略以整合来自篡改图像的互补篡改痕迹，提高解码器对伪造线索的敏感度。实验结果表明，ForMa在保持最低计算复杂度的同时，具备最先进的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2502.09941
作者: Kun Guo,Gang Cao,Zijie Lou,Xianglin Huang,Jiaoyun Liu
机构: School of Computer and Cyber Sciences, Communication University of China, Beijing 100024, China (北京传播大学计算机与网络安全学院，北京 100024，中国); State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China (媒体融合与传播国家重点实验室，北京传播大学，北京 100024，中国); School of Information Engineering, Changsha Medical University, Changsha 410219, China (长沙医学院信息工程学院，长沙 410219，中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Current image tampering localization methods primarily rely on Convolutional Neural Networks (CNNs) and Transformers. While CNNs suffer from limited local receptive fields, Transformers offer global context modeling at the expense of quadratic computational complexity. Recently, the state space model Mamba has emerged as a competitive alternative, enabling linear-complexity global dependency modeling. Inspired by it, we propose a lightweight and effective FORensic network based on vision MAmba (ForMa) for blind image tampering localization. Firstly, ForMa captures multi-scale global features that achieves efficient global dependency modeling through linear complexity. Then the pixel-wise localization map is generated by a lightweight decoder, which employs a parameter-free pixel shuffle layer for upsampling. Additionally, a noise-assisted decoding strategy is proposed to integrate complementary manipulation traces from tampered images, boosting decoder sensitivity to forgery cues. Experimental results on 10 standard datasets demonstrate that ForMa achieves state-of-the-art generalization ability and robustness, while maintaining the lowest computational complexity. Code is available at this https URL.
zh

[CV-31] mporal Scale and Shift Invariant Automatic Event Recognition using the Mellin Transform

【速读】：该论文旨在解决不同帧率视频中自动事件识别的准确性问题，并过滤掉数据库中的无关事件。关键在于结合空间-时间全息相关器与冷原子非均匀展宽阵列，实现三维时空关联，从而大幅提升识别精度。

链接: https://arxiv.org/abs/2502.09939
作者: Xi Shen,Julian Gamboa,Tabassom Hamidfar,Shamima A. Mitu,Selim M. Shahriar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Spatio-temporal holographic correlator combines the traditional 2D optical image correlation techniques with inhomogeneously broadened arrays of cold atoms to achieve 3D time-space correlation to realize automatic event recognition at an ultra-high speed. Here we propose a method to realize such event recognition for videos running at different speeds. With this method, we can highly improve recognition accuracy and filter almost all the unwanted events in the video database.
zh

[CV-32] Precise Parameter Localization for Textual Generation in Diffusion Models ICLR2025

【速读】：该论文旨在解决扩散模型在生成图像中的文本内容控制与优化问题。关键在于通过注意力激活切割（attention activation patching）发现仅少于1%的参数（位于注意力层中）影响文本内容的生成，并由此针对性地改进交叉注意力层和联合注意力层，以提高文本生成的效率和性能。这一局部化方法不仅增强了大型扩散模型的文本生成能力，同时保持了生成质量与多样性，并能够实现对生成文本的编辑及无成本地防止生成有害文本。

链接: https://arxiv.org/abs/2502.09935
作者: Łukasz Staniszewski,Bartosz Cywiński,Franziska Boenisch,Kamil Deja,Adam Dziedzic
机构: Warsaw University of Technology; CISPA Helmholtz Center for Information Security
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models’ parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models’ generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at this https URL.
zh

[CV-33] AffectSRNet : Facial Emotion-Aware Super-Resolution Network

【速读】：该论文旨在解决低分辨率条件下面部表情识别（FER）系统在准确识别表情方面所面临的挑战，特别是在监控和移动通信等应用场景中常见的低图像分辨率导致识别精度下降的问题。传统单图像人脸超分辨（FSR）技术往往无法保留表情的情感意图，从而引入失真，模糊原始情感内容。论文的关键解决方案在于提出了一种名为AffectSRNet的新框架，它能够在增强图像质量的同时保持面部表情的强度和保真度。AffectSRNet通过采用专为FER应用设计的表情保留损失函数，有效地弥合了图像分辨率与表情准确性之间的差距，并引入了一个新的评估指标来衡量超分辨图像中的情感保留程度，从而提供了对FER系统在低分辨率场景下性能更细致的评价。实验结果表明，AffectSRNet在视觉质量和情感保真度方面均优于现有的FSR方法，展示了其在实际FER应用中整合的潜力。

链接: https://arxiv.org/abs/2502.09932
作者: Syed Sameen Ahmad Rizvi,Soham Kumar,Aryan Seth,Pratik Narang
机构: Department of CSIS, Birla Institute of Technology & Science, Pilani, RJ, India (计算机与信息系统系,比拉斯信息技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression recognition (FER) systems in low-resolution settings face significant challenges in accurately identifying expressions due to the loss of fine-grained facial details. This limitation is especially problematic for applications like surveillance and mobile communications, where low image resolution is common and can compromise recognition accuracy. Traditional single-image face super-resolution (FSR) techniques, however, often fail to preserve the emotional intent of expressions, introducing distortions that obscure the original affective content. Given the inherently ill-posed nature of single-image super-resolution, a targeted approach is required to balance image quality enhancement with emotion retention. In this paper, we propose AffectSRNet, a novel emotion-aware super-resolution framework that reconstructs high-quality facial images from low-resolution inputs while maintaining the intensity and fidelity of facial expressions. Our method effectively bridges the gap between image resolution and expression accuracy by employing an expression-preserving loss function, specifically tailored for FER applications. Additionally, we introduce a new metric to assess emotion preservation in super-resolved images, providing a more nuanced evaluation of FER system performance in low-resolution scenarios. Experimental results on standard datasets, including CelebA, FFHQ, and Helen, demonstrate that AffectSRNet outperforms existing FSR approaches in both visual quality and emotion fidelity, highlighting its potential for integration into practical FER applications. This work not only improves image clarity but also ensures that emotion-driven applications retain their core functionality in suboptimal resolution environments, paving the way for broader adoption in FER systems.
zh

[CV-34] ransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation

【速读】：该论文旨在解决医学图像分割中的语义差距问题以及复杂解剖结构关系的理解，同时克服现有基于变换器的方法在捕捉详细局部特征时计算复杂度过高的局限。论文的关键解决方案是引入注意力跨尺度图神经网络（ACS-GNN），通过将跨尺度特征图转化为图结构，并利用节点注意力机制来捕捉复杂的解剖结构。此外，论文还提出了一种熵驱动特征选择（EFS）方法，与空间注意力相结合，以提高特征地图的质量。通过这种创新框架TransGUNet，论文实现了在多种模态数据上的优越分割性能和更高的效率。

链接: https://arxiv.org/abs/2502.09931
作者: Ju-Hyeon Nam,Nur Suriza Syazwany,Sang-Chul Lee
机构: Department of Electrical and Computer Engineering, Inha University (电气与计算机工程系, 仁荷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 24 pages, 12 figures

点击查看摘要

Abstract:Skip connection engineering is primarily employed to address the semantic gap between the encoder and decoder, while also integrating global dependencies to understand the relationships among complex anatomical structures in medical image segmentation. Although several models have proposed transformer-based approaches to incorporate global dependencies within skip connections, they often face limitations in capturing detailed local features with high computational complexity. In contrast, graph neural networks (GNNs) exploit graph structures to effectively capture local and global features. Leveraging these properties, we introduce an attentional cross-scale graph neural network (ACS-GNN), which enhances the skip connection framework by converting cross-scale feature maps into a graph structure and capturing complex anatomical structures through node attention. Additionally, we observed that deep learning models often produce uninformative feature maps, which degrades the quality of spatial attention maps. To address this problem, we integrated entropy-driven feature selection (EFS) with spatial attention, calculating an entropy score for each channel and filtering out high-entropy feature maps. Our innovative framework, TransGUNet, comprises ACS-GNN and EFS-based spatial attentio to effectively enhance domain generalizability across various modalities by leveraging GNNs alongside a reliable spatial attention map, ensuring more robust features within the skip connection. Through comprehensive experiments and analysis, TransGUNet achieved superior segmentation performance on six seen and eight unseen datasets, demonstrating significantly higher efficiency compared to previous methods.
zh

[CV-35] Deep Tree Tensor Networks for Image Recognition

【速读】：该论文旨在解决现有张量网络（Tensor Networks, TNs）在自然图像处理中的应用局限性，特别是典型模型如矩阵乘积态（Matrix Product States, MPS）难以有效捕捉高阶特征交互的问题。解决方案的关键在于引入了一种新颖的架构——深度树张量网络（Deep Tree Tensor Network, DTTN），通过多线性运算捕获 $2^L$ -阶乘交互，并且该架构本质上展开为具有参数共享特性的树状拓扑结构。此外，DTTN通过多个反对称相互作用模块（antisymmetric interacting modules, AIMs）堆叠设计，实现了高效的实现方式。

链接: https://arxiv.org/abs/2502.09928
作者: Chang Nie,Junfang Chen,Yajie Chen
机构: Nanjing University of Science and Technology (南京理工大学); East China Normal University (华东师范大学); Li Auto Inc. (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Originating in quantum physics, tensor networks (TNs) have been widely adopted as exponential machines and parameter decomposers for recognition tasks. Typical TN models, such as Matrix Product States (MPS), have not yet achieved successful application in natural image processing. When employed, they primarily serve to compress parameters within off-the-shelf networks, thus losing their distinctive capability to enhance exponential-order feature interactions. This paper introduces a novel architecture named \textit\textbfDeep \textbfTree \textbfTensor \textbfNetwork (DTTN), which captures 2^L -order multiplicative interactions across features through multilinear operations, while essentially unfolding into a \emphtree-like TN topology with the parameter-sharing property. DTTN is stacked with multiple antisymmetric interacting modules (AIMs), and this design facilitates efficient implementation. Moreover, we theoretically reveal the equivalency among quantum-inspired TN models and polynomial and multilinear networks under certain conditions, and we believe that DTTN can inspire more interpretable studies in this field. We evaluate the proposed model against a series of benchmarks and achieve excellent performance compared to its peers and cutting-edge architectures. Our code will soon be publicly available.
zh

[CV-36] Granite Vision: a lightweight open-source multimodal model for enterprise Intelligence

【速读】：该论文旨在开发一种轻量级的大规模语言模型——Granite Vision，以具备视觉理解能力，特别是在企业环境中处理视觉文档的理解任务。解决方案的关键在于设计了一个以解码器为主的架构，包含20亿参数，并且专门针对视觉模态进行了优化。此外，引入了一种在测试阶段使用的专用安全分类方法，通过稀疏注意力向量识别潜在有害输入，从而确保模型的安全性和可靠性。尽管模型架构轻量，Granite Vision在标准的视觉文档理解基准测试以及LiveXiv基准测试中均表现出色。

链接: https://arxiv.org/abs/2502.09927
作者: Granite Vision Team:Leonid Karlinsky,Assaf Arbelle,Abraham Daniels,Ahmed Nassar,Amit Alfassi,Bo Wu,Eli Schwartz,Dhiraj Joshi,Jovana Kondic,Nimrod Shabtay,Pengyuan Li,Roei Herzig,Shafiq Abedin,Shaked Perek,Sivan Harary,Udi Barzelay,Adi Raz Goldfarb,Aude Oliva,Ben Wieles,Bishwaranjan Bhattacharjee,Brandon Huang,Christoph Auer,Dan Gutfreund,David Beymer,David Wood,Hilde Kuehne,Jacob Hansen,Joseph Shtok,Ken Wong,Luis Angel Bathen,Mayank Mishra,Maksym Lysak,Michele Dolfi,Mikhail Yurochkin,Nikolaos Livathinos,Nimrod Harel,Ophir Azulai,Oshri Naparstek,Rafael Teixeira de Lima,Rameswar Panda,Sivan Doveh,Shubham Gupta,Subhro Das,Syed Zawad,Yusik Kim,Zexue He,Alexander Brooks,Gabe Goodhart,Anita Govindjee,Derek Leist,Ibrahim Ibrahim,Aya Soffer,David Cox,Kate Soule,Luis Lastras,Nirmit Desai,Shila Ofek-koifman,Sriram Raghavan,Tanveer Syeda-Mahmood,Peter Staar,Tal Drory,Rogerio Feris
机构: IBM Research (IBM研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See this https URL for model weights.
zh

[CV-37] askGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types

【速读】：该论文旨在解决多模态视觉语言模型在开放世界应用中的性能限制问题，主要由于任务特定数据不足导致的泛化能力差和输出偏见。解决方案的关键在于提出TaskGalaxy，一个包含19,227个层次化任务类型和413,648个样本的大规模多模态指令微调数据集。TaskGalaxy通过利用GPT-4o从少量手动定义的任务出发扩展任务多样性，并使用CLIP和GPT-4o筛选与开源图像最佳匹配的任务，同时生成相关的问题-答案对。此自动化过程不仅提升了任务多样性，还提高了数据质量，减少了人工干预的需求。

链接: https://arxiv.org/abs/2502.09925
作者: Jiankang Chen,Tianke Zhang,Changyi Liu,Haojie Ding,Yaya Shi,Feng Cheng,Huihui Xiao,Bin Wen,Fan Yang,Tingting Gao,Di Zhang
机构: Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at this https URL.
zh

[CV-38] Self-Consistent Model-based Adaptation for Visual Reinforcement Learning

【速读】：该论文旨在解决视觉强化学习代理在实际应用中因视觉干扰导致性能显著下降的问题。解决方案的关键在于提出了一种名为自洽模型适应（Self-Consistent Model-based Adaptation, SCMA）的方法，该方法通过使用去噪模型将杂乱的观测转换为清晰的观测，从而无需修改策略即可增强其鲁棒性。SCMA作为一种即插即用的增强手段，能够有效减轻各种策略面临的干扰。为了以无监督方式优化去噪模型，文中推导了一个无监督分布匹配目标函数，并对其最优性进行了理论分析。

链接: https://arxiv.org/abs/2502.09923
作者: Xinning Zhou,Chengyang Ying,Yao Feng,Hang Su,Jun Zhu
机构: Tsinghua University(清华大学); Department of Computer Science & Technology(计算机科学与技术系); Institute for AI(人工智能研究所); BNRist Center(国家专用集成电路设计工程技术研究中心); Tsinghua-Bosch Joint ML Center(清华大学-博世联合机器学习中心); THBI Lab(清华伯克利深圳学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual reinforcement learning agents typically face serious performance declines in real-world applications caused by visual distractions. Existing methods rely on fine-tuning the policy’s representations with hand-crafted augmentations. In this work, we propose Self-Consistent Model-based Adaptation (SCMA), a novel method that fosters robust adaptation without modifying the policy. By transferring cluttered observations to clean ones with a denoising model, SCMA can mitigate distractions for various policies as a plug-and-play enhancement. To optimize the denoising model in an unsupervised manner, we derive an unsupervised distribution matching objective with a theoretical analysis of its optimality. We further present a practical algorithm to optimize the objective by estimating the distribution of clean observations with a pre-trained world model. Extensive experiments on multiple visual generalization benchmarks and real robot data demonstrate that SCMA effectively boosts performance across various distractions and exhibits better sample efficiency.
zh

[CV-39] Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding

【速读】：该论文旨在解决视觉昆虫理解领域中现有对话模型缺乏昆虫知识的问题。解决方案的关键在于引入了一个新的大规模多模态昆虫数据集（Multimodal Insect Dataset with Visual Insect Instruction Data），以及提出了一种新型的多模态对话模型Insect-LLaVA（Insect Large Language and Vision Assistant）。此外，通过开发一种引入Patch-wise Relevant Attention机制的新微特征自监督学习方法，并结合描述一致性损失（Description Consistency loss）来改进昆虫图像的细微特征学习，从而增强了模型对昆虫特征的理解能力。

链接: https://arxiv.org/abs/2502.09906
作者: Thanh-Dat Truong,Hoang-Quan Nguyen,Xuan-Bac Nguyen,Ashley Dowling,Xin Li,Khoa Luu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal conversational generative AI has shown impressive capabilities in various vision and language understanding through learning massive text-image data. However, current conversational models still lack knowledge about visual insects since they are often trained on the general knowledge of vision-language data. Meanwhile, understanding insects is a fundamental problem in precision agriculture, helping to promote sustainable development in agriculture. Therefore, this paper proposes a novel multimodal conversational model, Insect-LLaVA, to promote visual understanding in insect-domain knowledge. In particular, we first introduce a new large-scale Multimodal Insect Dataset with Visual Insect Instruction Data that enables the capability of learning the multimodal foundation models. Our proposed dataset enables conversational models to comprehend the visual and semantic features of the insects. Second, we propose a new Insect-LLaVA model, a new general Large Language and Vision Assistant in Visual Insect Understanding. Then, to enhance the capability of learning insect features, we develop an Insect Foundation Model by introducing a new micro-feature self-supervised learning with a Patch-wise Relevant Attention mechanism to capture the subtle differences among insect images. We also present Description Consistency loss to improve micro-feature learning via text descriptions. The experimental results evaluated on our new Visual Insect Question Answering benchmarks illustrate the effective performance of our proposed approach in visual insect understanding and achieve State-of-the-Art performance on standard benchmarks of insect-related tasks.
zh

[CV-40] FrGNet: A fourier-guided weakly-supervised framework for nuclear instance segmentation

【速读】：该论文旨在解决弱监督条件下的核实例分割问题。解决方案的关键在于提出了一种傅里叶引导框架，其中包含一个傅里叶引导模块用于融合先验信息到模型训练过程中，以促进模型捕捉相关特征。此外，通过引入基于引导的实例级别对比模块，进一步提升了模型对核特征的表达能力。该方法在两个公开数据集上展示了其在全监督设计下超越当前SOTA方法的能力，并且在仅使用少量标记的情况下，依然保持接近全监督的性能。

链接: https://arxiv.org/abs/2502.09874
作者: Peng Ling
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nuclear instance segmentation has played a critical role in pathology image analysis. The main challenges arise from the difficulty in accurately segmenting instances and the high cost of precise mask-level annotations for fully-supervised this http URL this work, we propose a fourier guidance framework for solving the weakly-supervised nuclear instance segmentation problem. In this framework, we construct a fourier guidance module to fuse the priori information into the training process of the model, which facilitates the model to capture the relevant features of the this http URL, in order to further improve the model’s ability to represent the features of nuclear, we propose the guide-based instance level contrastive module. This module makes full use of the framework’s own properties and guide information to effectively enhance the representation features of nuclear. We show on two public datasets that our model can outperform current SOTA methods under fully-supervised design, and in weakly-supervised experiments, with only a small amount of labeling our model still maintains close to the performance under full this http URL addition, we also perform generalization experiments on a private dataset, and without any labeling, our model is able to segment nuclear images that have not been seen during training quite effectively. As open science, all codes and pre-trained models are available at this https URL.
zh

[CV-41] Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal

【速读】：该论文旨在解决图像修复任务中扩散模型多步去噪过程导致的计算开销大以及现有方法在严重JPEG压缩图像中难以有效去除JPEG伪影的问题。论文的关键解决方案是提出了一种名为CODiff的压缩感知一阶段扩散模型，其中包含一个压缩感知视觉嵌入器（Compression-Aware Visual Embedder, CaVE），用于提取和利用JPEG压缩先验信息以指导扩散模型。此外，CODiff采用了一种结合显式学习和隐式学习的双学习策略，从而实现对JPEG压缩更深入和全面的理解。

链接: https://arxiv.org/abs/2502.09873
作者: Jinpei Guo,Zheng Chen,Wenbo Li,Yong Guo,Yulun Zhang
机构: Carnegie Mellon University (卡内基梅隆大学); Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated remarkable success in image restoration tasks. However, their multi-step denoising process introduces significant computational overhead, limiting their practical deployment. Furthermore, existing methods struggle to effectively remove severe JPEG artifact, especially in highly compressed images. To address these challenges, we propose CODiff, a compression-aware one-step diffusion model for JPEG artifact removal. The core of CODiff is the compression-aware visual embedder (CaVE), which extracts and leverages JPEG compression priors to guide the diffusion model. We propose a dual learning strategy that combines explicit and implicit learning. Specifically, explicit learning enforces a quality prediction objective to differentiate low-quality images with different compression levels. Implicit learning employs a reconstruction objective that enhances the model’s generalization. This dual learning allows for a deeper and more comprehensive understanding of JPEG compression. Experimental results demonstrate that CODiff surpasses recent leading methods in both quantitative and visual quality metrics. The code and models will be released at this https URL.
zh

[CV-42] Learning to Calibrate for Reliable Visual Fire Detection

【速读】：该论文旨在解决深度学习模型在多类别火灾检测任务中的过自信问题，并提升其不确定性建模能力。论文的关键解决方案在于将预期校准误差（Expected Calibration Error, ECE）转化为可微分的ECE损失函数，并将其与交叉熵损失结合，以指导模型训练过程。此外，通过引入基于课程学习的方法动态调整ECE损失权重，实现分类精度与可靠决策之间的良好平衡。

链接: https://arxiv.org/abs/2502.09872
作者: Ziqi Zhang,Xiuzhuang Zhou,Xiangyang Gong
机构: School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications (北京邮电大学智能工程与自动化学院); Information and Digital Management Department, China Petrochemical Corporation (中国石油化工集团信息数字化管理部)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Fire is characterized by its sudden onset and destructive power, making early fire detection crucial for ensuring human safety and protecting property. With the advancement of deep learning, the application of computer vision in fire detection has significantly improved. However, deep learning models often exhibit a tendency toward overconfidence, and most existing works focus primarily on enhancing classification performance, with limited attention given to uncertainty modeling. To address this issue, we propose transforming the Expected Calibration Error (ECE), a metric for measuring uncertainty, into a differentiable ECE loss function. This loss is then combined with the cross-entropy loss to guide the training process of multi-class fire detection models. Additionally, to achieve a good balance between classification accuracy and reliable decision, we introduce a curriculum learning-based approach that dynamically adjusts the weight of the ECE loss during training. Extensive experiments are conducted on two widely used multi-class fire detection datasets, DFAN and EdgeFireSmoke, validating the effectiveness of our uncertainty modeling method.
zh

[CV-43] HealthGPT : A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

【速读】：该论文旨在解决医学领域中视觉理解和生成任务的统一处理问题。解决方案的关键在于提出了一种名为HealthGPT的医学大型视觉语言模型（Medical Large Vision-Language Model, Med-LVLM），它通过一种新颖的异构低秩适应技术（Heterogeneous Low-Rank Adaptation, H-LoRA）以及定制的分层视觉感知方法和三阶段学习策略，将医学视觉理解与生成能力整合在统一的自回归范式中。

链接: https://arxiv.org/abs/2502.09838
作者: Tianwei Lin,Wenqiao Zhang,Sijing Li,Yuqian Yuan,Binhe Yu,Haoyuan Li,Wanggui He,Hao Jiang,Mengze Li,Xiaohui Song,Siliang Tang,Jun Xiao,Hui Lin,Yueting Zhuang,Beng Chin Ooi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at this https URL.
zh

[CV-44] PUGS: Perceptual Uncertainty for Grasp Selection in Underwater Environments ICRA

【速读】：该论文旨在解决在感知信息不完美且不完整的复杂环境中，机器人如何做出可靠决策的问题。关键解决方案在于提出了一种通过占用不确定性估计来量化和表示三维重建中的感知不确定性，并将其融入水下环境自主操作中的抓取选择框架中。这种方法通过将多视角重建过程中的固有不确定性传播到抓取选择过程中，提高了抓取选择对部分和噪声测量的鲁棒性。

链接: https://arxiv.org/abs/2502.09824
作者: Onur Bagoren,Marc Micatka,Katherine A. Skinner,Aaron Marburg
机构: Department of Robotics, University of Michigan (美国密歇根大学机器人系); Applied Physics Laboratory, University of Washington (美国华盛顿大学应用物理实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures Accepted to International Conference on Robotics and Automation (ICRA) 2024

点击查看摘要

Abstract:When navigating and interacting in challenging environments where sensory information is imperfect and incomplete, robots must make decisions that account for these shortcomings. We propose a novel method for quantifying and representing such perceptual uncertainty in 3D reconstruction through occupancy uncertainty estimation. We develop a framework to incorporate it into grasp selection for autonomous manipulation in underwater environments. Instead of treating each measurement equally when deciding which location to grasp from, we present a framework that propagates uncertainty inherent in the multi-view reconstruction process into the grasp selection. We evaluate our method with both simulated and the real world data, showing that by accounting for uncertainty, the grasp selection becomes robust against partial and noisy measurements. Code will be made available at this https URL
zh

[CV-45] A Solver-Aided Hierarchical Language for LLM -Driven CAD Design

【速读】：该论文旨在解决大型语言模型（LLMs）在生成过程几何（Procedural Geometry）方面的挑战，特别是在计算机辅助设计（CAD）领域。论文的关键解决方案是引入了一种名为AIDL的求解器辅助、分层领域特定语言（DSL），通过将空间推理需求转移给几何约束求解器来实现。这使得LLMs能够更有效地进行CAD设计，并且在少量样本情况下，AIDL的表现优于具有训练数据的OpenSCAD，不仅生成的视觉结果更接近提示，而且创建的对象也更容易进行后处理和推理。

链接: https://arxiv.org/abs/2502.09819
作者: Benjamin T. Jones,Felix Hähnlein,Zihan Zhang,Maaz Ahmad,Vladimir Kim,Adriana Schulz
机构: MIT CSAIL (MIT 计算科学与人工智能实验室); University of Washington (华盛顿大学); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been enormously successful in solving a wide variety of structured and unstructured generative tasks, but they struggle to generate procedural geometry in Computer Aided Design (CAD). These difficulties arise from an inability to do spatial reasoning and the necessity to guide a model through complex, long range planning to generate complex geometry. We enable generative CAD Design with LLMs through the introduction of a solver-aided, hierarchical domain specific language (DSL) called AIDL, which offloads the spatial reasoning requirements to a geometric constraint solver. Additionally, we show that in the few-shot regime, AIDL outperforms even a language with in-training data (OpenSCAD), both in terms of generating visual results closer to the prompt and creating objects that are easier to post-process and reason about.
zh

[CV-46] On the robustness of multimodal language model towards distractions

【速读】：该论文旨在评估视觉-语言模型（Vision-Language Models, VLMs）在科学问题回答场景下对视觉和文本干扰的鲁棒性。研究通过构建一个新的基准测试集，在输入中引入视觉和文本干扰，以评估这些干扰对当前最先进的VLMs如GPT-4的影响。研究发现，尽管一些模型如InternVL2展现出较高的鲁棒性，大多数模型在面对干扰时仍表现出显著的推理能力下降，特别是对文本干扰更为敏感。为了缓解干扰带来的影响，论文探索了包括提示工程在内的多种策略，虽然这些策略在一定程度上提高了模型性能，但仍存在进一步改进的空间。关键在于通过系统性的基准测试和策略优化来提升VLMs在实际应用中的可靠性。

链接: https://arxiv.org/abs/2502.09818
作者: Ming Liu,Hao Chen,Jindong Wang,Wensheng Zhang
机构: Iowa State University (爱荷华州立大学); Carnegie Mellon University (卡内基梅隆大学); William & Mary University (威廉与玛丽学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although vision-language models (VLMs) have achieved significant success in various applications such as visual question answering, their resilience to prompt variations remains an under-explored area. Understanding how distractions affect VLMs is crucial for improving their real-world applicability, as inputs could have noisy and irrelevant information in many practical scenarios. This paper aims to assess the robustness of VLMs against both visual and textual distractions in the context of science question answering. Built on the ScienceQA dataset, we developed a new benchmark that introduces distractions in both the visual and textual contexts to evaluate the reasoning capacity of VLMs amid these distractions. Our findings reveal that most-of-the-art VLMs, including GPT-4, are vulnerable to various types of distractions, experiencing noticeable degradation in reasoning capabilities when confronted with distractions. Notably, models such as InternVL2 demonstrate a higher degree of robustness to these distractions. We also found that models exhibit greater sensitivity to textual distractions than visual ones. Additionally, we explored various mitigation strategies, such as prompt engineering, to counteract the impact of distractions. While these strategies improved solution accuracy, our analysis shows that there remain significant opportunities for improvement.
zh

[CV-47] Face Deepfakes - A Comprehensive Review

【速读】：该论文旨在通过深入分析当前最先进的面部深度伪造生成与检测方法，填补深度伪造技术系统性研究的空白。论文的关键在于提供一个全面的理论分析，并系统性地评估深度伪造对人脸识别方法的影响，同时探讨深度伪造技术的应用领域及其潜在的研究缺口，并提出未来研究方向。

链接: https://arxiv.org/abs/2502.09812
作者: Tharindu Fernando,Darshana Priyasad,Sridha Sridharan,Arun Ross,Clinton Fookes
机构: Queensland University of Technology (昆士兰科技大学), Australia; Michigan State University (密歇根州立大学), United States
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, remarkable advancements in deep- fake generation technology have led to unprecedented leaps in its realism and capabilities. Despite these advances, we observe a notable lack of structured and deep analysis deepfake technology. The principal aim of this survey is to contribute a thorough theoretical analysis of state-of-the-art face deepfake generation and detection methods. Furthermore, we provide a coherent and systematic evaluation of the implications of deepfakes on face biometric recognition approaches. In addition, we outline key applications of face deepfake technology, elucidating both positive and negative applications of the technology, provide a detailed discussion regarding the gaps in existing research, and propose key research directions for further investigation.
zh

[CV-48] Vision-based Geo-Localization of Future Mars Rotorcraft in Challenging Illumination Conditions

【速读】：该论文旨在解决火星旋翼飞行器在长时间飞行过程中因光照变化导致的传统基于地图定位(Map-based Localization, MbL)系统失效的问题。关键解决方案在于提出了一种名为Geo-LoFTR的几何辅助深度学习模型，该模型在光照显著变化条件下比先前模型更为稳健。通过使用定制的仿真框架，该系统利用真实的轨道地图生成大量现实的火星地形图像，从而实现更精确的定位，特别是在光照和尺度变化显著的情况下。

链接: https://arxiv.org/abs/2502.09795
作者: Dario Pisanti,Robert Hewitt,Roland Brockers,Georgios Georgakis
机构: Jet Propulsion Lab, California Institute of Technology (喷气推进实验室，加州理工学院); Torc Robotics (Torc 机器人)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Planetary exploration using aerial assets has the potential for unprecedented scientific discoveries on Mars. While NASA’s Mars helicopter Ingenuity proved flight in Martian atmosphere is possible, future Mars rotocrafts will require advanced navigation capabilities for long-range flights. One such critical capability is Map-based Localization (MbL) which registers an onboard image to a reference map during flight in order to mitigate cumulative drift from visual odometry. However, significant illumination differences between rotocraft observations and a reference map prove challenging for traditional MbL systems, restricting the operational window of the vehicle. In this work, we investigate a new MbL system and propose Geo-LoFTR, a geometry-aided deep learning model for image registration that is more robust under large illumination differences than prior models. The system is supported by a custom simulation framework that uses real orbital maps to produce large amounts of realistic images of the Martian terrain. Comprehensive evaluations show that our proposed system outperforms prior MbL efforts in terms of localization accuracy under significant lighting and scale variations. Furthermore, we demonstrate the validity of our approach across a simulated Martian day.
zh

[CV-49] Noise Controlled CT Super-Resolution with Conditional Diffusion Model

【速读】：该论文旨在解决提高CT图像空间分辨率的同时控制噪声放大的挑战。解决方案的关键在于引入了一种基于条件扩散模型的创新框架，通过在混合数据集（包括噪声匹配的模拟数据和来自真实数据的分割细节）上进行训练，验证了该框架在实际CT成像应用中的有效性。

链接: https://arxiv.org/abs/2502.09793
作者: Yuang Wang,Siyeop Yoon,Rui Hu,Baihui Yu,Duhgoon Lee,Rajiv Gupta,Li Zhang,Zhiqiang Chen,Dufan Wu
机构: Department of Radiology, Massachusetts General Hospital and Harvard Medical School (麻省总医院和哈佛医学院), Boston MA 02114, USA; Neurologica Corp. (神经内科公司), Danvers MA 01923, USA; Department of Engineering Physics, Tsinghua University (清华大学工程物理系), Beijing 100084, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 8th International Conference on Image Formation in X-Ray Computed Tomography, Bamberg, Germany, August 5 - 9, 2024

点击查看摘要

Abstract:Improving the spatial resolution of CT images is a meaningful yet challenging task, often accompanied by the issue of noise amplification. This article introduces an innovative framework for noise-controlled CT super-resolution utilizing the conditional diffusion model. The model is trained on hybrid datasets, combining noise-matched simulation data with segmented details from real data. Experimental results with real CT images validate the effectiveness of our proposed framework, showing its potential for practical applications in CT imaging.
zh

[CV-50] A CNN Approach to Automated Detection and Classification of Brain Tumors

【速读】：该论文旨在解决脑肿瘤诊断过程中由于MRI图像中的噪声和不完整性所引起的挑战。解决方案的关键在于采用了一种结合去噪技术和卷积神经网络（Convolutional Neural Networks, CNN）的综合方法。具体而言，通过使用各向异性扩散滤波器对MRI图像进行去噪处理，并利用包括ResNet152V2、VGG、ViT和EfficientNet在内的多种CNN模型进行图像分类。最终，EfficientNet模型达到了98%的分类准确率，证明了该方法的有效性。

链接: https://arxiv.org/abs/2502.09731
作者: Md. Zahid Hasan,Abdullah Tamim,D.M. Asadujjaman,Md. Mahfujur Rahman,Md. Abu Ahnaf Mollick,Nosin Anjum Dristi,Abdullah-Al-Noman
机构: Dept. of Computer Science & Engineering, University of Rajshahi, Bangladesh (计算机科学与工程系, 拉杰沙希大学, 孟加拉国); Dept. of Computer Science & Engineering, Khulna University of Engineering & Technology, Khulna, Bangladesh (计算机科学与工程系, 吉大港工程技术大学, 吉大港, 孟加拉国); Dept. of Computer Science & Engineering, Rajshahi University of Engineering & Technology, Rajshahi, Bangladesh (计算机科学与工程系, 拉杰沙希工程技术大学, 拉杰沙希, 孟加拉国); Dept. of Computer Science & Engineering, Varendra University, Rajshahi, Bangladesh (计算机科学与工程系, 瓦伦德拉大学, 拉杰沙希, 孟加拉国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Brain tumors require an assessment to ensure timely diagnosis and effective patient treatment. Morphological factors such as size, location, texture, and variable appearance com- plicate tumor inspection. Medical imaging presents challenges, including noise and incomplete images. This research article presents a methodology for processing Magnetic Resonance Imag- ing (MRI) data, encompassing techniques for image classification and denoising. The effective use of MRI images allows medical professionals to detect brain disorders, including tumors. This research aims to categorize healthy brain tissue and brain tumors by analyzing the provided MRI data. Unlike alternative methods like Computed Tomography (CT), MRI technology offers a more detailed representation of internal anatomical components, mak- ing it a suitable option for studying data related to brain tumors. The MRI picture is first subjected to a denoising technique utilizing an Anisotropic diffusion filter. The dataset utilized for the models creation is a publicly accessible and validated Brain Tumour Classification (MRI) database, comprising 3,264 brain MRI scans. SMOTE was employed for data augmentation and dataset balancing. Convolutional Neural Networks(CNN) such as ResNet152V2, VGG, ViT, and EfficientNet were employed for the classification procedure. EfficientNet attained an accuracy of 98%, the highest recorded.
zh

[CV-51] ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

链接: https://arxiv.org/abs/2502.09696
作者: Jonathan Roberts,Mohammad Reza Taesiri,Ansh Sharma,Akash Gupta,Samuel Roberts,Ioana Croitoru,Simion-Vlad Bogolin,Jialu Tang,Florian Langer,Vyas Raina,Vatsal Raina,Hanyi Xiong,Vishaal Udandarao,Jingyi Lu,Shiyang Chen,Sam Purkis,Tianshuo Yan,Wenye Lin,Gyungin Shin,Qiaochu Yang,Anh Totti Nguyen,Kai Han,Samuel Albanie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-52] owards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling

链接: https://arxiv.org/abs/2502.09688
作者: Benjamin D. Killeen,Bohua Wan,Aditya V. Kulkarni,Nathan Drenkow,Michael Oberst,Paul H. Yi,Mathias Unberath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages

点击查看摘要

[CV-53] Object-Centric Latent Action Learning

【速读】：该论文旨在解决在利用大量互联网视频数据进行具身人工智能（Embodied AI）研究时，由于缺乏动作注释以及存在与动作相关的干扰因素所导致的瓶颈问题。论文的关键解决方案在于提出了一种基于对象中心的潜在动作学习方法，通过自监督分解场景为对象表示，并使用代理动作标签对视频数据进行标注。这种方法有效地分离了因果主体-对象交互作用与无关背景噪声，减少了由干扰因素引起的潜在动作学习方法性能下降问题。

链接: https://arxiv.org/abs/2502.09680
作者: Albina Klepach,Alexander Nikulin,Ilya Zisman,Denis Tarasov,Alexander Derevyagin,Andrei Polubarov,Nikita Lyubaykin,Vladislav Kurenkov
机构: AIRI; MIPT(莫斯科物理技术学院); Skoltech(斯科尔科沃科技研究院); ETH Zürich(瑞士联邦理工学院); Innopolis University(因诺波利斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. In review

点击查看摘要

Abstract:Leveraging vast amounts of internet video data for Embodied AI is currently bottle-necked by the lack of action annotations and the presence of action-correlated distractors. We propose a novel object-centric latent action learning approach, based on VideoSaur and LAPO, that employs self-supervised decomposition of scenes into object representations and annotates video data with proxy-action labels. This method effectively disentangles causal agent-object interactions from irrelevant background noise and reduces the performance degradation of latent action learning approaches caused by distractors. Our preliminary experiments with the Distracting Control Suite show that latent action pretraining based on object decompositions improve the quality of inferred latent actions by x2.7 and efficiency of downstream fine-tuning with a small set of labeled actions, increasing return by x2.6 on average.
zh

[CV-54] IMM-MOT: A Novel 3D Multi-object Tracking Framework with Interacting Multiple Model Filter

链接: https://arxiv.org/abs/2502.09672
作者: Xiaohong Liu,Xulong Zhao,Gang Liu,Zili Wu,Tao Wang,Lei Meng,Yuhan Wang
机构: School of Computer Science and Technology, Xidian University (西安电子科技大学), Xi’an, 710126, Shaanxi, China (中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages,5 figures

点击查看摘要

[CV-55] Meta-INR: Efficient Encoding of Volumetric Data via Meta-Learning Implicit Neural Representation

链接: https://arxiv.org/abs/2502.09669
作者: Maizhe Yang,Kaiyuan Tang,Chaoli Wang
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Accepted by PVIS Short Paper Track

点击查看摘要

[CV-56] Revealing Subtle Phenotypes in Small Microscopy Datasets Using Latent Diffusion Models

链接: https://arxiv.org/abs/2502.09665
作者: Anis Bourou,Biel Castaño Segade,Thomas Boye,Valérie Mezger,Auguste Genovesio
机构: Ecole Normale Supérieure (高等师范学院); Université Paris Cité (巴黎城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-57] Image Super-Resolution with Guarantees via Conformal Generative Models

【速读】：该论文旨在解决生成式机器学习基础模型在图像超分辨率应用中不确定性量化不稳健且难以解释的问题。解决方案的关键在于提出了一种基于一致性预测技术的新方法，能够创建一个“置信掩膜”（confidence mask），以可靠且直观的方式传达生成图像的可信区域。该方法适用于任何黑盒生成模型，并且仅需易于获取的数据进行校准，通过局部图像相似性度量的选择实现高度定制化。

链接: https://arxiv.org/abs/2502.09664
作者: Eduardo Adame,Daniel Csillag,Guilherme Tegoni Goedert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:The increasing use of generative ML foundation models for image super-resolution calls for robust and interpretable uncertainty quantification methods. We address this need by presenting a novel approach based on conformal prediction techniques to create a “confidence mask” capable of reliably and intuitively communicating where the generated image can be trusted. Our method is adaptable to any black-box generative model, including those locked behind an opaque API, requires only easily attainable data for calibration, and is highly customizable via the choice of a local image similarity metric. We prove strong theoretical guarantees for our method that span fidelity error control (according to our local image similarity metric), reconstruction quality, and robustness in the face of data leakage. Finally, we empirically evaluate these results and establish our method’s solid performance.
zh

[CV-58] DiffEx: Explaining a Classifier with Diffusion Models to Identify Microscopic Cellular Variations

链接: https://arxiv.org/abs/2502.09663
作者: Anis Bourou,Saranga Kingkor Mahanta,Thomas Boyer,Valérie Mezger,Auguste Genovesio
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Cell Behavior (q-bio.CB)
备注:

点击查看摘要

[CV-59] owards Fine-grained Interactive Segmentation in Images and Videos

链接: https://arxiv.org/abs/2502.09660
作者: Yuan Yao,Qiushi Yang,Miaomiao Cui,Liefeng Bo
机构: Institute for Intelligent Computing, Alibaba Group (阿里巴巴集团智能计算研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-60] Integrating Spatiotemporal Vision Transformer into Digital Twins for High-Resolution Heat Stress Forecasting in Campus Environments

链接: https://arxiv.org/abs/2502.09657
作者: Wenjing Gong,Xinyue Ye,Keshu Wu,Suphanut Jamonnak,Wenyu Zhang,Yifan Yang,Xiao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-61] Bidirectional Diffusion Bridge Models

链接: https://arxiv.org/abs/2502.09655
作者: Duc Kieu,Kien Do,Toan Nguyen,Dang Nguyen,Thin Nguyen
机构: Applied Artificial Intelligence Institute (A2I2), Deakin University (迪肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Source code: this https URL

点击查看摘要

[CV-62] GraphCompNet: A Position-Aware Model for Predicting and Compensating Shape Deviations in 3D Printing

【速读】：该论文旨在解决增材制造（Additive Manufacturing, AM）中的形状偏差问题，特别是在几何精度和批量生产方面面临的挑战。传统方法如解析模型和计量学虽奠定了几何精度的基础，但在大规模生产中应用受限。尽管机器学习（Machine Learning, ML）的最新进展提高了补偿精度，但其在复杂几何体上的泛化能力和对位置相关变化的适应性仍存在问题。论文提出的关键解决方案是GraphCompNet框架，它结合了基于图的神经网络和生成对抗网络（GAN）启发的训练过程，通过利用点云数据和动态图卷积神经网络（Dynamic Graph Convolutional Neural Networks, DGCNNs），能够建模复杂的形状并整合位置相关的热力和机械因素。通过两阶段对抗训练过程，GraphCompNet迭代优化补偿设计，提供实时反馈，从而显著提升了整个打印空间内的补偿精度（提高35%到65%），并有效应对了位置相关的变化。这一方法推进了增材制造中的数字孪生技术发展，实现了可扩展的实时监控与补偿，填补了AM工艺控制中的关键空白。

链接: https://arxiv.org/abs/2502.09652
作者: Lei (Rachel)Chen,Juheon Lee,Juan Carlos Catana,Tsegai Yhdego,Nathan Moroney,Mohammad Amin Nabian,Hui Wang,Jun Zeng
机构: IEEE Publication Technology Group (IEEE出版技术集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 11 figures

点击查看摘要

Abstract:This paper introduces a data-driven algorithm for modeling and compensating shape deviations in additive manufacturing (AM), addressing challenges in geometric accuracy and batch production. While traditional methods, such as analytical models and metrology, laid the groundwork for geometric precision, they are often impractical for large-scale production. Recent advancements in machine learning (ML) have improved compensation precision, but issues remain in generalizing across complex geometries and adapting to position-dependent variations. We present a novel approach for powder bed fusion (PBF) processes, using GraphCompNet, which is a computational framework combining graph-based neural networks with a generative adversarial network (GAN)-inspired training process. By leveraging point cloud data and dynamic graph convolutional neural networks (DGCNNs), GraphCompNet models complex shapes and incorporates position-specific thermal and mechanical factors. A two-stage adversarial training procedure iteratively refines compensated designs via a compensator-predictor architecture, offering real-time feedback and optimization. Experimental validation across diverse shapes and positions shows the framework significantly improves compensation accuracy (35 to 65 percent) across the entire print space, adapting to position-dependent variations. This work advances the development of Digital Twin technology for AM, enabling scalable, real-time monitoring and compensation, and addressing critical gaps in AM process control. The proposed method supports high-precision, automated industrial-scale design and manufacturing systems.
zh

[CV-63] Imit Diff: Semantics Guided Diffusion Transformer with Dual Resolution Fusion for Imitation Learning

链接: https://arxiv.org/abs/2502.09649
作者: Yuhang Dong,Haizhou Ge,Yupei Zeng,Jiangning Zhang,Beiwen Tian,Guanzhong Tian,Hongrui Zhu,Yufei Jia,Ruixiang Wang,Ran Yi,Guyue Zhou,Longhua Ma
机构: Zhejiang University; Tsinghua University; Youtu Lab(Tecent); Harbin Institute of Technology(威海); Shanghai Jiao Tong University
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

[CV-64] Former: Integrating ConvNet and Transformer for Mobile Application ICLR2025

链接: https://arxiv.org/abs/2501.15369
作者: Chuanyang Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ICLR 2025. Code: this https URL

点击查看摘要

[CV-65] Dynamic-Computed Tomography Angiography for Cerebral Vessel Templates and Segmentation

【速读】：该论文旨在开发和评估两种用于直接在CTA图像上分割血管的技术：(1)创建和注册基于人群的血管图谱，以及(2)使用深度学习(DL)。关键解决方案在于利用4D-CT数据，通过非线性配准方法结合CT衰减阈值生成基于图谱的动脉和静脉分割，同时采用iCafe工具结合DL模型，以骨内CT图像作为输入进行训练，从而实现更精确的动脉（平均修正Dice系数amDC 0.856 vs. 0.324）和静脉（amDC 0.743 vs. 0.495）分割，显著优于基于图谱的方法。

链接: https://arxiv.org/abs/2502.09893
作者: Shrikanth Yadav,Jisoo Kim,Geoffrey Young,Lei Qin
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Background: Computed Tomography Angiography (CTA) is crucial for cerebrovascular disease diagnosis. Dynamic CTA is a type of imaging that captures temporal information about the We aim to develop and evaluate two segmentation techniques to segment vessels directly on CTA images: (1) creating and registering population-averaged vessel atlases and (2) using deep learning (DL). Methods: We retrieved 4D-CT of the head from our institutional research database, with bone and soft tissue subtracted from post-contrast images. An Advanced Normalization Tools pipeline was used to create angiographic atlases from 25 patients. Then, atlas-driven ROIs were identified by a CT attenuation threshold to generate segmentation of the arteries and veins using non-linear registration. To create DL vessel segmentations, arterial and venous structures were segmented using the MRA vessel segmentation tool, iCafe, in 29 patients. These were then used to train a DL model, with bone-in CT images as input. Multiple phase images in the 4D-CT were used to increase the training and validation dataset. Both segmentation approaches were evaluated on a test 4D-CT dataset of 11 patients which were also processed by iCafe and validated by a neuroradiologist. Specifically, branch-wise segmentation accuracy was quantified with 20 labels for arteries and one for veins. DL outperformed the atlas-based segmentation models for arteries (average modified dice coefficient (amDC) 0.856 vs. 0.324) and veins (amDC 0.743 vs. 0.495) overall. For ICAs, vertebral and basilar arteries, DL and atlas -based segmentation had an amDC of 0.913 and 0.402, respectively. The amDC for MCA-M1, PCA-P1, and ACA-A1 segments were 0.932 and 0.474, respectively. Conclusion: Angiographic CT templates are developed for the first time in literature. Using 4D-CTA enables the use of tools like iCafe, lessening the burden of manual annotation.
zh

[CV-66] owards Patient-Specific Surgical Planning for Bicuspid Aortic Valve Repair: Fully Automated Segmentation of the Aortic Valve in 4D CT

链接: https://arxiv.org/abs/2502.09805
作者: Zaiyang Guo,Ningjun J Dong,Harold Litt,Natalie Yushkevich,Melanie Freas,Jessica Nunez,Victor Ferrari,Jilei Hao,Shir Goldfinger,Matthew A. Jolley,Joseph Bavaria,Nimesh Desai,Alison M. Pouch
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-67] Acute Lymphoblastic Leukemia Diagnosis Employing YOLOv11 YOLOv8 ResNet50 and Inception-ResNet-v2 Deep Learning Models

链接: https://arxiv.org/abs/2502.09804
作者: Alaa Awad,Salah A. Aly
机构: CS & IT Department, E-Japan University of Science and Tech.(E-Japan 科学和技术大学); Faculty of Computing and Data Science, Badya University(巴迪亚大学); CS & Math Section, Faculty of Science, Fayoum University(法尤姆大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 28 figures, 5 tables

点击查看摘要

[CV-68] Atom identification in bilayer moire materials with Gomb-Net

链接: https://arxiv.org/abs/2502.09791
作者: Austin C. Houston,Sumner B. Harris,Hao Wang,Yu-Chuan Lin,David B. Geohegan,Kai Xiao,Gerd Duscher
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-69] Automated Muscle and Fat Segmentation in Computed Tomography for Comprehensive Body Composition Analysis

【速读】：该论文旨在解决体成分评估在临床应用中的需求，特别是在使用CT图像进行骨骼肌、皮下脂肪组织（SAT）和内脏脂肪组织（VAT）分割时缺乏广泛可用工具的问题。解决方案的关键在于提出了一种公开访问的端到端分割和特征计算模型，用于CT体成分分析。该模型能够实现胸、腹、盆腔区域的横断面CT图像中骨骼肌、SAT和VAT的分割，并提供包括肌肉密度、VAT/SAT比值、肌肉面积/体积以及骨骼肌指数（SMI）在内的多种体成分指标，支持二维和三维评估。该模型在内部和外部数据集上的表现均达到高Dice系数，且在骨骼肌和SAT分割上优于现有基准。

链接: https://arxiv.org/abs/2502.09779
作者: Yaqian Chen,Hanxue Gu,Yuwen Chen,Jicheng Yang,Haoyu Dong,Joseph Y. Cao,Adrian Camarena,Christopher Mantyh,Roy Colglazier,Maciej A. Mazurowski
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Body composition assessment using CT images can potentially be used for a number of clinical applications, including the prognostication of cardiovascular outcomes, evaluation of metabolic health, monitoring of disease progression, assessment of nutritional status, prediction of treatment response in oncology, and risk stratification for surgical and critical care outcomes. While multiple groups have developed in-house segmentation tools for this analysis, there are very limited publicly available tools that could be consistently used across different applications. To mitigate this gap, we present a publicly accessible, end-to-end segmentation and feature calculation model specifically for CT body composition analysis. Our model performs segmentation of skeletal muscle, subcutaneous adipose tissue (SAT), and visceral adipose tissue (VAT) across the chest, abdomen, and pelvis area in axial CT images. It also provides various body composition metrics, including muscle density, visceral-to-subcutaneous fat (VAT/SAT) ratio, muscle area/volume, and skeletal muscle index (SMI), supporting both 2D and 3D assessments. The model is shared for public use. To evaluate the model, the segmentation was applied to both internal and external datasets, with body composition metrics analyzed across different age, sex, and race groups. The model achieved high dice coefficients on both internal and external datasets, exceeding 89% for skeletal muscle, SAT, and VAT segmentation. The model outperforms the benchmark by 2.40% on skeletal muscle and 10.26% on SAT compared to the manual annotations given by the publicly available dataset. Body composition metrics show mean relative absolute errors (MRAEs) under 10% for all measures. Furthermore, the model provided muscular fat segmentation with a Dice coefficient of 56.27%, which can be utilized for additional analyses as needed.
zh

[CV-70] CellFlow: Simulating Cellular Morphology Changes via Flow Matching

链接: https://arxiv.org/abs/2502.09775
作者: Yuhui Zhang,Yuchang Su,Chenyu Wang,Tianhong Li,Zoe Wefers,Jeffrey Nirschl,James Burgess,Daisy Ding,Alejandro Lozano,Emma Lundberg,Serena Yeung-Levy
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Biomolecules (q-bio.BM); Cell Behavior (q-bio.CB)
备注:

点击查看摘要

[CV-71] Generalizable Cervical Cancer Screening via Large-scale Pretraining and Test-Time Adaptation

【速读】：该论文旨在解决现有基于AI辅助细胞学筛查系统在复杂临床场景中泛化能力不足的问题。解决方案的关键在于引入Smart-CCS范式，通过大规模自监督预训练和测试时适应（Test-time Adaptation）来创建具有强泛化能力的宫颈癌筛查系统。这种方法使得系统能够在不同临床情境下保持高性能，从而提高了实际应用中的准确性和可靠性。

链接: https://arxiv.org/abs/2502.09662
作者: Hao Jiang,Cheng Jin,Huangjing Lin,Yanning Zhou,Xi Wang,Jiabo Ma,Li Ding,Jun Hou,Runsheng Liu,Zhizhong Chai,Luyang Luo,Huijuan Shi,Yinling Qian,Qiong Wang,Changzhong Li,Anjia Han,Ronald Cheong Kin Chan,Hao Chen
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Cervical cancer is a leading malignancy in female reproductive system. While AI-assisted cytology offers a cost-effective and non-invasive screening solution, current systems struggle with generalizability in complex clinical scenarios. To address this issue, we introduced Smart-CCS, a generalizable Cervical Cancer Screening paradigm based on pretraining and adaptation to create robust and generalizable screening systems. To develop and validate Smart-CCS, we first curated a large-scale, multi-center dataset named CCS-127K, which comprises a total of 127,471 cervical cytology whole-slide images collected from 48 medical centers. By leveraging large-scale self-supervised pretraining, our CCS models are equipped with strong generalization capability, potentially generalizing across diverse scenarios. Then, we incorporated test-time adaptation to specifically optimize the trained CCS model for complex clinical settings, which adapts and refines predictions, improving real-world applicability. We conducted large-scale system evaluation among various cohorts. In retrospective cohorts, Smart-CCS achieved an overall area under the curve (AUC) value of 0.965 and sensitivity of 0.913 for cancer screening on 11 internal test datasets. In external testing, system performance maintained high at 0.950 AUC across 6 independent test datasets. In prospective cohorts, our Smart-CCS achieved AUCs of 0.947, 0.924, and 0.986 in three prospective centers, respectively. Moreover, the system demonstrated superior sensitivity in diagnosing cervical cancer, confirming the accuracy of our cancer screening results by using histology findings for validation. Interpretability analysis with cell and slide predictions further indicated that the system’s decision-making aligns with clinical practice. Smart-CCS represents a significant advancement in cancer screening across diverse clinical contexts.
zh

[CV-72] Multi-Omics Fusion with Soft Labeling for Enhanced Prediction of Distant Metastasis in Nasopharyngeal Carcinoma Patients after Radiotherapy

链接: https://arxiv.org/abs/2502.09656
作者: Jiabao Sheng,SaiKit Lam,Jiang Zhang,Yuanpeng Zhang,Jing Cai
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-73] Heterogeneous Mixture of Experts for Remote Sensing Image Super-Resolution

链接: https://arxiv.org/abs/2502.09654
作者: Bowen Chen,Keyan Chen,Mohan Yang,Zhengxia Zou,Zhenwei Shi
机构: Beihang University(北京航空航天大学); State Key Laboratory of Virtual Reality Technology and Systems(虚拟现实技术与系统国家重点实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-74] SASVi - Segment Any Surgical Video

链接: https://arxiv.org/abs/2502.09653
作者: Ssharvien Kumar Sivakumar,Yannik Frisch,Amin Ranem,Anirban Mukhopadhyay
机构: unknown
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

人工智能

[AI-0] Representation and Interpretation in Artificial and Natural Computing

链接: https://arxiv.org/abs/2502.10383
作者: Luis A. Pineda
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Artificial computing machinery transforms representations through an objective process, to be interpreted subjectively by humans, so the machine and the interpreter are different entities, but in the putative natural computing both processes are performed by the same agent. The method or process that transforms a representation is called here \emphthe mode of computing. The mode used by digital computers is the algorithmic one, but there are others, such as quantum computers and diverse forms of non-conventional computing, and there is an open-ended set of representational formats and modes that could be used in artificial and natural computing. A mode based on a notion of computing different from Turing’s may perform feats beyond what the Turing Machine does but the modes would not be of the same kind and could not be compared. For a mode of computing to be more powerful than the algorithmic one, it ought to compute functions lacking an effective algorithm, and Church Thesis would not hold. Here, a thought experiment including a computational demon using a hypothetical mode for such an effect is presented. If there is natural computing, there is a mode of natural computing whose properties may be causal to the phenomenological experience. Discovering it would come with solving the hard problem of consciousness; but if it turns out that such a mode does not exist, there is no such thing as natural computing, and the mind is not a computational process.

[AI-1] BeamDojo: Learning Agile Humanoid Locomotion on Sparse Footholds

链接: https://arxiv.org/abs/2502.10363
作者: Huayi Wang,Zirui Wang,Junli Ren,Qingwei Ben,Tao Huang,Weinan Zhang,Jiangmiao Pang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Project website: this https URL

点击查看摘要

Abstract:Traversing risky terrains with sparse footholds poses a significant challenge for humanoid robots, requiring precise foot placements and stable locomotion. Existing approaches designed for quadrupedal robots often fail to generalize to humanoid robots due to differences in foot geometry and unstable morphology, while learning-based approaches for humanoid locomotion still face great challenges on complex terrains due to sparse foothold reward signals and inefficient learning processes. To address these challenges, we introduce BeamDojo, a reinforcement learning (RL) framework designed for enabling agile humanoid locomotion on sparse footholds. BeamDojo begins by introducing a sampling-based foothold reward tailored for polygonal feet, along with a double critic to balancing the learning process between dense locomotion rewards and sparse foothold rewards. To encourage sufficient trail-and-error exploration, BeamDojo incorporates a two-stage RL approach: the first stage relaxes the terrain dynamics by training the humanoid on flat terrain while providing it with task terrain perceptive observations, and the second stage fine-tunes the policy on the actual task terrain. Moreover, we implement a onboard LiDAR-based elevation map to enable real-world deployment. Extensive simulation and real-world experiments demonstrate that BeamDojo achieves efficient learning in simulation and enables agile locomotion with precise foot placement on sparse footholds in the real world, maintaining a high success rate even under significant external disturbances.

[AI-2] Process Reward Models for LLM Agents : Practical Framework and Directions

链接: https://arxiv.org/abs/2502.10325
作者: Sanjiban Choudhury
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:We introduce Agent Process Reward Models (AgentPRM), a simple and scalable framework for training LLM agents to continually improve through interactions. AgentPRM follows a lightweight actor-critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. It requires minimal modifications to existing RLHF pipelines, making it easy to integrate at scale. Beyond AgentPRM, we propose InversePRM, which learns process rewards directly from demonstrations without explicit outcome supervision. We also explore key challenges and opportunities, including exploration, process reward shaping, and model-predictive reasoning. We evaluate on ALFWorld benchmark, show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT-4o baselines, and analyze test-time scaling, reward hacking, and more. Our code is available at: this https URL.

[AI-3] ExplainReduce: Summarising local explanations via proxies

链接: https://arxiv.org/abs/2502.10311
作者: Lauri Seppäläinen,Mudong Guo,Kai Puolamäki
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 22 pages with a 7 page appendix, 7 + 5 figures, 2 tables. The datasets and source code used in the paper are available at this https URL

点击查看摘要

Abstract:Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificial intelligence (XAI) aims to develop tools to examine the inner workings of these closed boxes. An often-used model-agnostic approach to XAI involves using simple models as local approximations to produce so-called local explanations; examples of this approach include LIME, SHAP, and SLISEMAP. This paper shows how a large set of local explanations can be reduced to a small “proxy set” of simple models, which can act as a generative global explanation. This reduction procedure, ExplainReduce, can be formulated as an optimisation problem and approximated efficiently using greedy heuristics.

[AI-4] LLM -Powered Preference Elicitation in Combinatorial Assignment

链接: https://arxiv.org/abs/2502.10308
作者: Ermis Soumalias,Yanchen Jiang,Kehang Zhu,Michael Curry,Sven Seuken,David C. Parkes
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the potential of large language models (LLMs) as proxies for humans to simplify preference elicitation (PE) in combinatorial assignment. While traditional PE methods rely on iterative queries to capture preferences, LLMs offer a one-shot alternative with reduced human effort. We propose a framework for LLM proxies that can work in tandem with SOTA ML-powered preference elicitation schemes. Our framework handles the novel challenges introduced by LLMs, such as response variability and increased computational costs. We experimentally evaluate the efficiency of LLM proxies against human queries in the well-studied course allocation domain, and we investigate the model capabilities required for success. We find that our approach improves allocative efficiency by up to 20%, and these results are robust across different LLMs and to differences in quality and accuracy of reporting.

[AI-5] Reinforcement Learning in Strategy-Based and Atari Games: A Review of Google DeepMinds Innovations

链接: https://arxiv.org/abs/2502.10303
作者: Abdelrhman Shaheen,Anas Badr,Ali Abohendy,Hatem Alsaadawy,Nadine Alsayad
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has been widely used in many applications, particularly in gaming, which serves as an excellent training ground for AI models. Google DeepMind has pioneered innovations in this field, employing reinforcement learning algorithms, including model-based, model-free, and deep Q-network approaches, to create advanced AI models such as AlphaGo, AlphaGo Zero, and MuZero. AlphaGo, the initial model, integrates supervised learning and reinforcement learning to master the game of Go, surpassing professional human players. AlphaGo Zero refines this approach by eliminating reliance on human gameplay data, instead utilizing self-play for enhanced learning efficiency. MuZero further extends these advancements by learning the underlying dynamics of game environments without explicit knowledge of the rules, achieving adaptability across various games, including complex Atari games. This paper reviews the significance of reinforcement learning applications in Atari and strategy-based games, analyzing these three models, their key innovations, training processes, challenges encountered, and improvements made. Additionally, we discuss advancements in the field of gaming, including MiniZero and multi-agent models, highlighting future directions and emerging AI models from Google DeepMind.

[AI-6] A Hybrid Cross-Stage Coordination Pre-ranking Model for Online Recommendation Systems WWW2025

链接: https://arxiv.org/abs/2502.10284
作者: Binglei Zhao,Houying Qi,Guang Xu,Mian Ma,Xiwei Zhao,Feng Mei,Sulong Xu,Jinghe Hu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Accepted by WWW 2025

点击查看摘要

Abstract:Large-scale recommendation systems often adopt cascading architecture consisting of retrieval, pre-ranking, ranking, and re-ranking stages. With strict latency requirements, pre-ranking utilizes lightweight models to perform a preliminary selection from massive retrieved candidates. However, recent works focus solely on improving consistency with ranking, relying exclusively on downstream stages. Since downstream input is derived from the pre-ranking output, they will exacerbate the sample selection bias (SSB) issue and Matthew effect, leading to sub-optimal results. To address the limitation, we propose a novel Hybrid Cross-Stage Coordination Pre-ranking model (HCCP) to integrate information from upstream (retrieval) and downstream (ranking, re-ranking) stages. Specifically, cross-stage coordination refers to the pre-ranking’s adaptability to the entire stream and the role of serving as a more effective bridge between upstream and downstream. HCCP consists of Hybrid Sample Construction and Hybrid Objective Optimization. Hybrid sample construction captures multi-level unexposed data from the entire stream and rearranges them to become the optimal guiding “ground truth” for pre-ranking learning. Hybrid objective optimization contains the joint optimization of consistency and long-tail precision through our proposed Margin InfoNCE loss. It is specifically designed to learn from such hybrid unexposed samples, improving the overall performance and mitigating the SSB issue. The appendix describes a proof of the efficacy of the proposed loss in selecting potential positives. Extensive offline and online experiments indicate that HCCP outperforms SOTA methods by improving cross-stage coordination. It contributes up to 14.9% UCVR and 1.3% UCTR in the JD E-commerce recommendation system. Concerning code privacy, we provide a pseudocode for reference.

[AI-7] Efficient Zero-Order Federated Finetuning of Language Models for Resource-Constrained Devices

链接: https://arxiv.org/abs/2502.10239
作者: Mohamed Aboelenien Ahmed,Kilian Pfeiffer,Ramin Khalili,Heba Khdr,Jörg Henkel
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated fine-tuning offers a promising approach for tuning Large Language Models (LLMs) on edge devices while preserving data privacy. However, fine-tuning these models on edge devices remains challenging due to high memory, communication, and computational demands. Zero-order optimization with task alignment provides a potential solution, enabling fine-tuning with inference-level memory requirements but requires a longer convergence time. In this paper, we propose Federated Split-Perturbation Zero-order Optimization (FedSPZO) that divides the network into two blocks, applying a different number of perturbations per block in a computationally effective way, achieving faster convergence. Our evaluation shows a 2.5 - 7\times reduction in computation overhead compared to zero-order state of the art techniques in federated learning.

[AI-8] Shaping Inductive Bias in Diffusion Models through Frequency-Based Noise Control

链接: https://arxiv.org/abs/2502.10236
作者: Thomas Jiralerspong,Berton Earnshaw,Jason Hartford,Yoshua Bengio,Luca Scimeca
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) are powerful generative models that have achieved unparalleled success in a number of generative tasks. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. For topologically structured data, we devise a frequency-based noising operator to purposefully manipulate, and set, these inductive biases. We first show that appropriate manipulations of the noising forward process can lead DPMs to focus on particular aspects of the distribution to learn. We show that different datasets necessitate different inductive biases, and that appropriate frequency-based noise control induces increased generative performance compared to standard diffusion. Finally, we demonstrate the possibility of ignoring information at particular frequencies while learning. We show this in an image corruption and recovery task, where we train a DPM to recover the original target distribution after severe noise corruption.

[AI-9] A Multiagent Path Search Algorithm for Large-Scale Coalition Structure Generation DATE AAAI AAAI2025

链接: https://arxiv.org/abs/2502.10226
作者: Redha Taguelmimt,Samir Aknine,Djamila Boukredera,Narayan Changder,Tuomas Sandholm
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注: Long and updated version to the published paper in the Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025)

点击查看摘要

Abstract:Coalition structure generation (CSG), i.e. the problem of optimally partitioning a set of agents into coalitions to maximize social welfare, is a fundamental computational problem in multiagent systems. This problem is important for many applications where small run times are necessary, including transportation and disaster response. In this paper, we develop SALDAE, a multiagent path finding algorithm for CSG that operates on a graph of coalition structures. Our algorithm utilizes a variety of heuristics and strategies to perform the search and guide it. It is an anytime algorithm that can handle large problems with hundreds and thousands of agents. We show empirically on nine standard value distributions, including disaster response and electric vehicle allocation benchmarks, that our algorithm enables a rapid finding of high-quality solutions and compares favorably with other state-of-the-art methods.

[AI-10] Forget the Data and Fine-Tuning! Just Fold the Network to Compress ICLR

链接: https://arxiv.org/abs/2502.10216
作者: Dong Wang,Haris Šikić,Lothar Thiele,Olga Saukh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by The Thirteenth International Conference on Learning Representations(ICLR), 2025

点击查看摘要

Abstract:We introduce model folding, a novel data-free model compression technique that merges structurally similar neurons across layers, significantly reducing the model size without the need for fine-tuning or access to training data. Unlike existing methods, model folding preserves data statistics during compression by leveraging k-means clustering, and using novel data-free techniques to prevent variance collapse or explosion. Our theoretical framework and experiments across standard benchmarks, including ResNet18 and LLaMA-7B, demonstrate that model folding achieves comparable performance to data-driven compression techniques and outperforms recently proposed data-free methods, especially at high sparsity levels. This approach is particularly effective for compressing large-scale models, making it suitable for deployment in resource-constrained environments.

[AI-11] Do Large Language Models Reason Causally Like Us? Even Better?

链接: https://arxiv.org/abs/2502.10215
作者: Hanna M. Dettki,Brenden M. Lake,Charley M. Wu,Bob Rehder
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal reasoning is a core component of intelligence. Large language models (LLMs) have shown impressive capabilities in generating human-like text, raising questions about whether their responses reflect true understanding or statistical patterns. We compared causal reasoning in humans and four LLMs using tasks based on collider graphs, rating the likelihood of a query variable occurring given evidence from other variables. We find that LLMs reason causally along a spectrum from human-like to normative inference, with alignment shifting based on model, context, and task. Overall, GPT-4o and Claude showed the most normative behavior, including “explaining away”, whereas Gemini-Pro and GPT-3.5 did not. Although all agents deviated from the expected independence of causes - Claude the least - they exhibited strong associative reasoning and predictive inference when assessing the likelihood of the effect given its causes. These findings underscore the need to assess AI biases as they increasingly assist human decision-making.

[AI-12] Dynamic Reinforcement Learning for Actors

链接: https://arxiv.org/abs/2502.10200
作者: Katsunari Shibata
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注: 31 pages, 20 figures

点击查看摘要

Abstract:Dynamic Reinforcement Learning (Dynamic RL), proposed in this paper, directly controls system dynamics, instead of the actor (action-generating neural network) outputs at each moment, bringing about a major qualitative shift in reinforcement learning (RL) from static to dynamic. The actor is initially designed to generate chaotic dynamics through the loop with its environment, enabling the agent to perform flexible and deterministic exploration. Dynamic RL controls global system dynamics using a local index called “sensitivity,” which indicates how much the input neighborhood contracts or expands into the corresponding output neighborhood through each neuron’s processing. While sensitivity adjustment learning (SAL) prevents excessive convergence of the dynamics, sensitivity-controlled reinforcement learning (SRL) adjusts them – to converge more to improve reproducibility around better state transitions with positive TD error and to diverge more to enhance exploration around worse transitions with negative TD error. Dynamic RL was applied only to the actor in an Actor-Critic RL architecture while applying it to the critic remains a challenge. It was tested on two dynamic tasks and functioned effectively without external exploration noise or backward computation through time. Moreover, it exhibited excellent adaptability to new environments, although some problems remain. Drawing parallels between ‘exploration’ and ‘thinking,’ the author hypothesizes that “exploration grows into thinking through learning” and believes this RL could be a key technique for the emergence of thinking, including inspiration that cannot be reconstructed from massive existing text data. Finally, despite being presumptuous, the author presents the argument that this research should not proceed due to its potentially fatal risks, aiming to encourage discussion.

[AI-13] MathConstruct: Challenging LLM Reasoning with Constructive Proofs

链接: https://arxiv.org/abs/2502.10197
作者: Mislav Balunović,Jasper Dekoninck,Nikola Jovanović,Ivo Petrov,Martin Vechev
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed ground-truth answers, and are often saturated due to problem simplicity or the viability of guessing or memorization. Crucially, they capture only a narrow subset of relevant math problems. To address this research gap, we introduce \mc, a new benchmark of 126 challenging problems sourced from various math competitions, which targets constructive proofs, a widely encountered problem type requiring the construction of mathematical objects with specific properties. These proofs are particularly suitable for LLM evaluation, as solution correctness can be easily verified. Our automated verifiers also enable MathConstruct to generate problem variations, used to evaluate robustness. State-of-the-art LLMs solve only 54% of MathConstruct problems, highlighting its complexity and importance for LLM evaluation.

[AI-14] Merging public elementary schools to reduce racial/ethnic segregation

链接: https://arxiv.org/abs/2502.10193
作者: Madison Landry,Nabeel Gillani
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Forthcoming in PNAS Nexus

点击查看摘要

Abstract:Diverse schools can help address implicit biases and increase empathy, mutual respect, and reflective thought by fostering connections between students from different racial/ethnic, socioeconomic, and other backgrounds. Unfortunately, demographic segregation remains rampant in US public schools, despite over 70 years since the passing of federal legislation formally outlawing segregation by race. However, changing how students are assigned to schools can help foster more integrated learning environments. In this paper, we explore “school mergers” as one such under-explored, yet promising, student assignment policy change. School mergers involve merging the school attendance boundaries, or catchment areas, of schools and subsequently changing the grades each school offers. We develop an algorithm to simulate elementary school mergers across 200 large school districts serving 4.5 million elementary school students and find that pairing or tripling schools in this way could reduce racial/ethnic segregation by a median relative 20% – and as much as nearly 60% in some districts – while increasing driving times to schools by an average of a few minutes each way. Districts with many interfaces between racially/ethnically-disparate neighborhoods tend to be prime candidates for mergers. We also compare the expected results of school mergers to other typical integration policies, like redistricting, and find that different policies may be more or less suitable in different places. Finally, we make our results available through a public dashboard for policymakers and community members to explore further (this https URL). Together, our study offers new findings and tools to support integration policy-making across US public school districts.

[AI-15] From Markov to Laplace: How Mamba In-Context Learns Markov Chains

链接: https://arxiv.org/abs/2502.10178
作者: Marco Bondaschi,Nived Rajaraman,Xiuying Wei,Kannan Ramchandran,Razvan Pascanu,Caglar Gulcehre,Michael Gastpar,Ashok Vardhan Makkuva
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering a surprising phenomenon: unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal, for all Markovian orders. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.

[AI-16] STMA: A Spatio-Temporal Memory Agent for Long-Horizon Embodied Task Planning

链接: https://arxiv.org/abs/2502.10177
作者: Mingcong Lei,Yiming Zhao,Ge Wang,Zhixin Mai,Shuguang Cui,Yatong Han,Jinke Ren
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A key objective of embodied intelligence is enabling agents to perform long-horizon tasks in dynamic environments while maintaining robust decision-making and adaptability. To achieve this goal, we propose the Spatio-Temporal Memory Agent (STMA), a novel framework designed to enhance task planning and execution by integrating spatio-temporal memory. STMA is built upon three critical components: (1) a spatio-temporal memory module that captures historical and environmental changes in real time, (2) a dynamic knowledge graph that facilitates adaptive spatial reasoning, and (3) a planner-critic mechanism that iteratively refines task strategies. We evaluate STMA in the TextWorld environment on 32 tasks, involving multi-step planning and exploration under varying levels of complexity. Experimental results demonstrate that STMA achieves a 31.25% improvement in success rate and a 24.7% increase in average score compared to the state-of-the-art model. The results highlight the effectiveness of spatio-temporal memory in advancing the memory capabilities of embodied agents.

[AI-17] chnical Risks of (Lethal) Autonomous Weapons Systems

链接: https://arxiv.org/abs/2502.10174
作者: Heramb Podar,Alycia Colijn
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The autonomy and adaptability of (Lethal) Autonomous Weapons Systems, (L)AWS in short, promise unprecedented operational capabilities, but they also introduce profound risks that challenge the principles of control, accountability, and stability in international security. This report outlines the key technological risks associated with (L)AWS deployment, emphasizing their unpredictability, lack of transparency, and operational unreliability, which can lead to severe unintended consequences. Key Takeaways: 1. Proposed advantages of (L)AWS can only be achieved through objectification and classification, but a range of systematic risks limit the reliability and predictability of classifying algorithms. 2. These systematic risks include the black-box nature of AI decision-making, susceptibility to reward hacking, goal misgeneralization and potential for emergent behaviors that escape human control. 3. (L)AWS could act in ways that are not just unexpected but also uncontrollable, undermining mission objectives and potentially escalating conflicts. 4. Even rigorously tested systems may behave unpredictably and harmfully in real-world conditions, jeopardizing both strategic stability and humanitarian principles. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2502.10174 [cs.CY] (or arXiv:2502.10174v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2502.10174 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Heramb Podar [view email] [v1] Fri, 14 Feb 2025 14:09:43 UTC (446 KB) Full-text links: Access Paper: View a PDF of the paper titled Technical Risks of (Lethal) Autonomous Weapons Systems, by Heramb Podar and 1 other authorsView PDFOther Formats view license Current browse context: cs.CY prev | next new | recent | 2025-02 Change to browse by: cs cs.AI cs.SY eess eess.SY References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-18] SessionRec: Next Session Prediction Paradigm For Generative Sequential Recommendation

链接: https://arxiv.org/abs/2502.10157
作者: Lei Huang,Hao Guo,Linzhi Peng,Long Zhang,Xiaoteng Wang,Daoyuan Wang,Shichao Wang,Jinpeng Wang,Lei Wang,Sheng Chen
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce SessionRec, a novel next-session prediction paradigm (NSPP) for generative sequential recommendation, addressing the fundamental misalignment between conventional next-item prediction paradigm (NIPP) and real-world recommendation scenarios. Unlike NIPP’s item-level autoregressive generation that contradicts actual session-based user interactions, our framework introduces a session-aware representation learning through hierarchical sequence aggregation (intra/inter-session), reducing attention computation complexity while enabling implicit modeling of massive negative interactions, and a session-based prediction objective that better captures users’ diverse interests through multi-item recommendation in next sessions. Moreover, we found that incorporating a rank loss for items within the session under the next session prediction paradigm can significantly improve the ranking effectiveness of generative sequence recommendation models. We also verified that SessionRec exhibits clear power-law scaling laws similar to those observed in LLMs. Extensive experiments conducted on public datasets and online A/B test in Meituan App demonstrate the effectiveness of SessionRec. The proposed paradigm establishes new foundations for developing industrial-scale generative recommendation systems through its model-agnostic architecture and computational efficiency.

[AI-19] Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries IJCAI

链接: https://arxiv.org/abs/2502.10154
作者: Serkan Sulun,Paula Viana,Matthew E. P. Davies
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
*备注: Submitted to International Joint Conference on Artificial Intelligence (IJCAI) 2025

点击查看摘要

Abstract:We introduce EMSYNC, a video-based symbolic music generation model that aligns music with a video’s emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate and align musical chords with scene cuts. Unlike existing models, our approach retains event-based encoding, ensuring fine-grained timing control and expressive musical nuances. We also propose a mapping scheme to bridge the video emotion classifier, which produces discrete emotion categories, with the emotion-conditioned MIDI generator, which operates on continuous-valued valence-arousal inputs. In subjective listening tests, EMSYNC outperforms state-of-the-art models across all subjective metrics, for music theory-aware participants as well as the general listeners.

[AI-20] Cooperative Multi-Agent Planning with Adaptive Skill Synthesis

链接: https://arxiv.org/abs/2502.10148
作者: Zhiyuan Li,Wenshuai Zhao,Joni Pajarinen
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Despite much progress in training distributed artificial intelligence (AI), building cooperative multi-agent systems with multi-agent reinforcement learning (MARL) faces challenges in sample efficiency, interpretability, and transferability. Unlike traditional learning-based methods that require extensive interaction with the environment, large language models (LLMs) demonstrate remarkable capabilities in zero-shot planning and complex reasoning. However, existing LLM-based approaches heavily rely on text-based observations and struggle with the non-Markovian nature of multi-agent interactions under partial observability. We present COMPASS, a novel multi-agent architecture that integrates vision-language models (VLMs) with a dynamic skill library and structured communication for decentralized closed-loop decision-making. The skill library, bootstrapped from demonstrations, evolves via planner-guided tasks to enable adaptive strategies. COMPASS propagates entity information through multi-hop communication under partial observability. Evaluations on the improved StarCraft Multi-Agent Challenge (SMACv2) demonstrate COMPASS achieves up to 30% higher win rates than state-of-the-art MARL algorithms in symmetric scenarios.

[AI-21] Learning Relational Tabular Data without Shared Features

链接: https://arxiv.org/abs/2502.10125
作者: Zhaomin Wu,Shida Wang,Ziyang Wang,Bingsheng He
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Learning relational tabular data has gained significant attention recently, but most studies focus on single tables, overlooking the potential of cross-table learning. Cross-table learning, especially in scenarios where tables lack shared features and pre-aligned data, offers vast opportunities but also introduces substantial challenges. The alignment space is immense, and determining accurate alignments between tables is highly complex. We propose Latent Entity Alignment Learning (Leal), a novel framework enabling effective cross-table training without requiring shared features or pre-aligned data. Leal operates on the principle that properly aligned data yield lower loss than misaligned data, a concept embodied in its soft alignment mechanism. This mechanism is coupled with a differentiable cluster sampler module, ensuring efficient scaling to large relational tables. Furthermore, we provide a theoretical proof of the cluster sampler’s approximation capacity. Extensive experiments on five real-world and five synthetic datasets show that Leal achieves up to a 26.8% improvement in predictive performance compared to state-of-the-art methods, demonstrating its effectiveness and scalability.

[AI-22] Causal Information Prioritization for Efficient Reinforcement Learning

链接: https://arxiv.org/abs/2502.10097
作者: Hongye Cao,Fan Feng,Tianpei Yang,Jing Huo,Yang Gao
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Current Reinforcement Learning (RL) methods often suffer from sample-inefficiency, resulting from blind exploration strategies that neglect causal relationships among states, actions, and rewards. Although recent causal approaches aim to address this problem, they lack grounded modeling of reward-guided causal understanding of states and actions for goal-orientation, thus impairing learning efficiency. To tackle this issue, we propose a novel method named Causal Information Prioritization (CIP) that improves sample efficiency by leveraging factored MDPs to infer causal relationships between different dimensions of states and actions with respect to rewards, enabling the prioritization of causal information. Specifically, CIP identifies and leverages causal relationships between states and rewards to execute counterfactual data augmentation to prioritize high-impact state features under the causal understanding of the environments. Moreover, CIP integrates a causality-aware empowerment learning objective, which significantly enhances the agent’s execution of reward-guided actions for more efficient exploration in complex environments. To fully assess the effectiveness of CIP, we conduct extensive experiments across 39 tasks in 5 diverse continuous control environments, encompassing both locomotion and manipulation skills learning with pixel-based and sparse reward settings. Experimental results demonstrate that CIP consistently outperforms existing RL methods across a wide range of scenarios.

[AI-23] A novel approach to data generation in generative model

链接: https://arxiv.org/abs/2502.10092
作者: JaeHong Kim(1),Jaewon Shim(2) ((1) Healthcare, Legal and Policy Center, Graduate school of Law, Korea University, Seoul 02841, Korea, Human-Inspired AI Research, Korea University, Seoul 02841, Korea , (2) Center for 0D Nanofluidics, Institute of Applied Physics, Department of Physics and Astronomy, Seoul National University, Seoul 08826, Korea)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 47 pages, 2 tables, 9 figures

点击查看摘要

Abstract:Variational Autoencoders (VAEs) and other generative models are widely employed in artificial intelligence to synthesize new data. However, current approaches rely on Euclidean geometric assumptions and statistical approximations that fail to capture the structured and emergent nature of data generation. This paper introduces the Convergent Fusion Paradigm (CFP) theory, a novel geometric framework that redefines data generation by integrating dimensional expansion accompanied by qualitative transformation. By modifying the latent space geometry to interact with emergent high-dimensional structures, CFP theory addresses key challenges such as identifiability issues and unintended artifacts like hallucinations in Large Language Models (LLMs). CFP theory is based on two key conceptual hypotheses that redefine how generative models structure relationships between data and algorithms. Through the lens of CFP theory, we critically examine existing metric-learning approaches. CFP theory advances this perspective by introducing time-reversed metric embeddings and structural convergence mechanisms, leading to a novel geometric approach that better accounts for data generation as a structured epistemic process. Beyond its computational implications, CFP theory provides philosophical insights into the ontological underpinnings of data generation. By offering a systematic framework for high-dimensional learning dynamics, CFP theory contributes to establishing a theoretical foundation for understanding the data-relationship structures in AI. Finally, future research in CFP theory will be led to its implications for fully realizing qualitative transformations, introducing the potential of Hilbert space in generative modeling.

[AI-24] Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

链接: https://arxiv.org/abs/2502.10090
作者: Chenrui Tie,Shengxiang Sun,Jinxuan Zhu,Yiwei Liu,Jingxiang Guo,Yue Hu,Haonan Chen,Junting Chen,Ruihai Wu,Lin Shao
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they cannot interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages a Vision-Language Model (VLM) to extract structured information from instructional images and then uses this information to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step. At the same time, a motion planning module generates actionable sequences for real-world robotic implementation. We demonstrate the effectiveness of Manual2Skill by successfully assembling several real-world IKEA furniture items. This application highlights its ability to manage long-horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals. This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.

[AI-25] A Hybrid Edge Classifier: Combining TinyML-Optimised CNN with RRAM-CMOS ACAM for Energy-Efficient Inference

链接: https://arxiv.org/abs/2502.10089
作者: Kieran Woodward,Eiman Kanjo,Georgios Papandroulidakis,Shady Agwa,Themis Prodromakis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:In recent years, the development of smart edge computing systems to process information locally is on the rise. Many near-sensor machine learning (ML) approaches have been implemented to introduce accurate and energy efficient template matching operations in resource-constrained edge sensing systems, such as wearables. To introduce novel solutions that can be viable for extreme edge cases, hybrid solutions combining conventional and emerging technologies have started to be proposed. Deep Neural Networks (DNN) optimised for edge application alongside new approaches of computing (both device and architecture -wise) could be a strong candidate in implementing edge ML solutions that aim at competitive accuracy classification while using a fraction of the power of conventional ML solutions. In this work, we are proposing a hybrid software-hardware edge classifier aimed at the extreme edge near-sensor systems. The classifier consists of two parts: (i) an optimised digital tinyML network, working as a front-end feature extractor, and (ii) a back-end RRAM-CMOS analogue content addressable memory (ACAM), working as a final stage template matching system. The combined hybrid system exhibits a competitive trade-off in accuracy versus energy metric with E_front-end = 96.23 nJ and E_back-end = 1.45 nJ for each classification operation compared with 78.06 \mu J for the original teacher model, representing a 792-fold reduction, making it a viable solution for extreme edge applications.

[AI-26] owards Empowerment Gain through Causal Structure Learning in Model-Based RL

链接: https://arxiv.org/abs/2502.10077
作者: Hongye Cao,Fan Feng,Meng Fang,Shaokang Dong,Tianpei Yang,Jing Huo,Yang Gao
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In Model-Based Reinforcement Learning (MBRL), incorporating causal structures into dynamics models provides agents with a structured understanding of the environments, enabling efficient decision. Empowerment as an intrinsic motivation enhances the ability of agents to actively control their environments by maximizing the mutual information between future states and actions. We posit that empowerment coupled with causal understanding can improve controllability, while enhanced empowerment gain can further facilitate causal reasoning in MBRL. To improve learning efficiency and controllability, we propose a novel framework, Empowerment through Causal Learning (ECL), where an agent with the awareness of causal dynamics models achieves empowerment-driven exploration and optimizes its causal structure for task learning. Specifically, ECL operates by first training a causal dynamics model of the environment based on collected data. We then maximize empowerment under the causal structure for exploration, simultaneously using data gathered through exploration to update causal dynamics model to be more controllable than dense dynamics model without causal structure. In downstream task learning, an intrinsic curiosity reward is included to balance the causality, mitigating overfitting. Importantly, ECL is method-agnostic and is capable of integrating various causal discovery methods. We evaluate ECL combined with 3 causal discovery methods across 6 environments including pixel-based tasks, demonstrating its superior performance compared to other causal MBRL methods, in terms of causal discovery, sample efficiency, and asymptotic performance.

[AI-27] Strassen Multisystolic Array Hardware Architectures

链接: https://arxiv.org/abs/2502.10063
作者: Trevor E. Pogue,Nicola Nicolici
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: Accepted for publication in IEEE Transactions on Very Large Scale Integration (VLSI) Systems; Associated source code available on GitHub at this https URL

点击查看摘要

Abstract:While Strassen’s matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm’s promised theoretical speedups. This leaves the question of if it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or if they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen’s algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of 1.14^r for r implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to 32x32 and 24x24 at 1-2 levels of Strassen recursion, respectively. We evaluate the proposed designs both in isolation and in an end-to-end machine learning accelerator compared to baseline designs and prior works, achieving state-of-the-art performance.

[AI-28] Adaptive Bi-Level Multi-Robot Task Allocation and Learning under Uncertainty with Temporal Logic Constraints AAMAS2025

链接: https://arxiv.org/abs/2502.10062
作者: Xiaoshan Lin,Roberto Tron
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
*备注: Accepted as a full paper at AAMAS 2025

点击查看摘要

Abstract:This work addresses the problem of multi-robot coordination under unknown robot transition models, ensuring that tasks specified by Time Window Temporal Logic are satisfied with user-defined probability thresholds. We present a bi-level framework that integrates (i) high-level task allocation, where tasks are assigned based on the robots’ estimated task completion probabilities and expected rewards, and (ii) low-level distributed policy learning and execution, where robots independently optimize auxiliary rewards while fulfilling their assigned tasks. To handle uncertainty in robot dynamics, our approach leverages real-time task execution data to iteratively refine expected task completion probabilities and rewards, enabling adaptive task allocation without explicit robot transition models. We theoretically validate the proposed algorithm, demonstrating that the task assignments meet the desired probability thresholds with high confidence. Finally, we demonstrate the effectiveness of our framework through comprehensive simulations.

[AI-29] A Survey on LLM -powered Agents for Recommender Systems

链接: https://arxiv.org/abs/2502.10050
作者: Qiyao Peng,Hongtao Liu,Hua Huang,Qing Yang,Minglai Shao
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recommender systems are essential components of many online platforms, yet traditional approaches still struggle with understanding complex user preferences and providing explainable recommendations. The emergence of Large Language Model (LLM)-powered agents offers a promising approach by enabling natural language interactions and interpretable reasoning, potentially transforming research in recommender systems. This survey provides a systematic review of the emerging applications of LLM-powered agents in recommender systems. We identify and analyze three key paradigms in current research: (1) Recommender-oriented approaches, which leverage intelligent agents to enhance the fundamental recommendation mechanisms; (2) Interaction-oriented approaches, which facilitate dynamic user engagement through natural dialogue and interpretable suggestions; and (3) Simulation-oriented approaches, which employ multi-agent frameworks to model complex user-item interactions and system dynamics. Beyond paradigm categorization, we analyze the architectural foundations of LLM-powered recommendation agents, examining their essential components: profile construction, memory management, strategic planning, and action execution. Our investigation extends to a comprehensive analysis of benchmark datasets and evaluation frameworks in this domain. This systematic examination not only illuminates the current state of LLM-powered agent recommender systems but also charts critical challenges and promising research directions in this transformative field.

[AI-30] Janus: Collaborative Vision Transformer Under Dynamic Network Environment

链接: https://arxiv.org/abs/2502.10047
作者: Linyi Jiang,Silvery D. Fu,Yifei Zhu,Bo Li
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: Accepted for publication in IEEE INFOCOM 2025

点击查看摘要

Abstract:Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Network architectures and achieved state-of-the-art results in various computer vision tasks. Since ViTs are computationally expensive, the models either have to be pruned to run on resource-limited edge devices only or have to be executed on remote cloud servers after receiving the raw data transmitted over fluctuating networks. The resulting degraded performance or high latency all hinder their widespread applications. In this paper, we present Janus, the first framework for low-latency cloud-device collaborative Vision Transformer inference over dynamic networks. Janus overcomes the intrinsic model limitations of ViTs and realizes collaboratively executing ViT models on both cloud and edge devices, achieving low latency, high accuracy, and low communication overhead. Specifically, Janus judiciously combines token pruning techniques with a carefully designed fine-to-coarse model splitting policy and non-static mixed pruning policy. It attains a balance between accuracy and latency by dynamically selecting the optimal pruning level and split point. Experimental results across various tasks demonstrate that Janus enhances throughput by up to 5.15 times and reduces latency violation ratios by up to 98.7% when compared with baseline approaches under various network environments.

[AI-31] Unsupervised Entity Alignment Based on Personalized Discriminative Rooted Tree

链接: https://arxiv.org/abs/2502.10044
作者: Yaming Yang,Zhe Wang,Ziyu Guan,Wei Zhao,Xinyan Huang,Xiaofei He
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Entity Alignment (EA) is to link potential equivalent entities across different knowledge graphs (KGs). Most existing EA methods are supervised as they require the supervision of seed alignments, i.e., manually specified aligned entity pairs. Very recently, several EA studies have made some attempts to get rid of seed alignments. Despite achieving preliminary progress, they still suffer two limitations: (1) The entity embeddings produced by their GNN-like encoders lack personalization since some of the aggregation subpaths are shared between different entities. (2) They cannot fully alleviate the distribution distortion issue between candidate KGs due to the absence of the supervised signal. In this work, we propose a novel unsupervised entity alignment approach called UNEA to address the above two issues. First, we parametrically sample a tree neighborhood rooted at each entity, and accordingly develop a tree attention aggregation mechanism to extract a personalized embedding for each entity. Second, we introduce an auxiliary task of maximizing the mutual information between the input and the output of the KG encoder, to regularize the model and prevent the distribution distortion. Extensive experiments show that our UNEA achieves a new state-of-the-art for the unsupervised EA task, and can even outperform many existing supervised EA baselines.

[AI-32] POI-Enhancer: An LLM -based Semantic Enhancement Framework for POI Representation Learning

链接: https://arxiv.org/abs/2502.10038
作者: Jiawei Cheng,Jingyuan Wang,Yichuan Zhang,Jiahao Ji,Yuanshao Zhu,Zhibo Zhang,Xiangyu Zhao
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:POI representation learning plays a crucial role in handling tasks related to user mobility data. Recent studies have shown that enriching POI representations with multimodal information can significantly enhance their task performance. Previously, the textual information incorporated into POI representations typically involved only POI categories or check-in content, leading to relatively weak textual features in existing methods. In contrast, large language models (LLMs) trained on extensive text data have been found to possess rich textual knowledge. However leveraging such knowledge to enhance POI representation learning presents two key challenges: first, how to extract POI-related knowledge from LLMs effectively, and second, how to integrate the extracted information to enhance POI representations. To address these challenges, we propose POI-Enhancer, a portable framework that leverages LLMs to improve POI representations produced by classic POI learning models. We first design three specialized prompts to extract semantic information from LLMs efficiently. Then, the Dual Feature Alignment module enhances the quality of the extracted information, while the Semantic Feature Fusion module preserves its integrity. The Cross Attention Fusion module then fully adaptively integrates such high-quality information into POI representations and Multi-View Contrastive Learning further injects human-understandable semantic information into these representations. Extensive experiments on three real-world datasets demonstrate the effectiveness of our framework, showing significant improvements across all baseline representations.

[AI-33] Dream to Drive: Model-Based Vehicle Control Using Analytic World Models

链接: https://arxiv.org/abs/2502.10012
作者: Asen Nachkov,Danda Pani Paudel,Jan-Nico Zaech,Davide Scaramuzza,Luc Van Gool
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Differentiable simulators have recently shown great promise for training autonomous vehicle controllers. Being able to backpropagate through them, they can be placed into an end-to-end training loop where their known dynamics turn into useful priors for the policy to learn, removing the typical black box assumption of the environment. So far, these systems have only been used to train policies. However, this is not the end of the story in terms of what they can offer. Here, for the first time, we use them to train world models. Specifically, we present three new task setups that allow us to learn next state predictors, optimal planners, and optimal inverse states. Unlike analytic policy gradients (APG), which requires the gradient of the next simulator state with respect to the current actions, our proposed setups rely on the gradient of the next state with respect to the current state. We call this approach Analytic World Models (AWMs) and showcase its applications, including how to use it for planning in the Waymax simulator. Apart from pushing the limits of what is possible with such simulators, we offer an improved training recipe that increases performance on the large-scale Waymo Open Motion dataset by up to 12% compared to baselines at essentially no additional cost.

[AI-34] Decision Information Meets Large Language Models : The Future of Explainable Operations Research

链接: https://arxiv.org/abs/2502.09994
作者: Yansen Zhang,Qingcan Kang,Wing Yin Yu,Hailei Gong,Xiaojin Fu,Xiongwei Han,Tao Zhong,Chen Ma
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Operations Research (OR) is vital for decision-making in many industries. While recent OR methods have seen significant improvements in automation and efficiency through integrating Large Language Models (LLMs), they still struggle to produce meaningful explanations. This lack of clarity raises concerns about transparency and trustworthiness in OR applications. To address these challenges, we propose a comprehensive framework, Explainable Operations Research (EOR), emphasizing actionable and understandable explanations accompanying optimization. The core of EOR is the concept of Decision Information, which emerges from what-if analysis and focuses on evaluating the impact of complex constraints (or parameters) changes on decision-making. Specifically, we utilize bipartite graphs to quantify the changes in the OR model and adopt LLMs to improve the explanation capabilities. Additionally, we introduce the first industrial benchmark to rigorously evaluate the effectiveness of explanations and analyses in OR, establishing a new standard for transparency and clarity in the field.

[AI-35] Has My System Prompt Been Used? Large Language Model Prompt Membership Inference

链接: https://arxiv.org/abs/2502.09974
作者: Roman Levin,Valeriia Cherepanova,Abhimanyu Hans,Avi Schwarzschild,Tom Goldstein
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Prompt engineering has emerged as a powerful technique for optimizing large language models (LLMs) for specific applications, enabling faster prototyping and improved performance, and giving rise to the interest of the community in protecting proprietary system prompts. In this work, we explore a novel perspective on prompt privacy through the lens of membership inference. We develop Prompt Detective, a statistical method to reliably determine whether a given system prompt was used by a third-party language model. Our approach relies on a statistical test comparing the distributions of two groups of model outputs corresponding to different system prompts. Through extensive experiments with a variety of language models, we demonstrate the effectiveness of Prompt Detective for prompt membership inference. Our work reveals that even minor changes in system prompts manifest in distinct response distributions, enabling us to verify prompt usage with statistical significance.

[AI-36] Diverse Inference and Verification for Advanced Reasoning

链接: https://arxiv.org/abs/2502.09955
作者: Iddo Drori,Gaston Longhitano,Mao Mao,Seunghwan Hyun,Yuke Zhang,Sungjun Park,Zachary Meeks,Xin-Yu Zhang,Ben Segev,Howard Yong,Nakul Verma,Avi Shporer,Alon Amit,Madeleine Udell
类目: Artificial Intelligence (cs.AI)
*备注: 165 pages. arXiv admin note: text overlap with arXiv:2001.04383 by other authors

点击查看摘要

Abstract:Reasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity’s Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.

[AI-37] Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model NEURIPS2024

链接: https://arxiv.org/abs/2502.09947
作者: Jin Cui,Alexander Capstick,Payam Barnaghi,Gregory Scott
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: NeurIPS 2024 workshop Time Series in the Age of Large Models. arXiv admin note: substantial text overlap with arXiv:2502.09173

点击查看摘要

Abstract:In the analysis of remote healthcare monitoring data, time series representation learning offers substantial value in uncovering deeper patterns of patient behavior, especially given the fine temporal granularity of the data. In this study, we focus on a dataset of home activity records from people living with Dementia. We propose a two-stage self-supervised learning approach. The first stage involves converting time-series activities into text strings, which are then encoded by a fine-tuned language model. In the second stage, these time-series vectors are bi-dimensionalized for applying PageRank method, to analyze latent state transitions to quantitatively assess participants behavioral patterns and identify activity biases. These insights, combined with diagnostic data, aim to support personalized care interventions.

[AI-38] AttenGluco: Multimodal Transformer-Based Blood Glucose Forecasting on AI-READI Dataset

链接: https://arxiv.org/abs/2502.09919
作者: Ebrahim Farahmand,Reza Rahimi Azghan,Nooshin Taheri Chatrudi,Eric Kim,Gautham Krishna Gudur,Edison Thomaz,Giulia Pedrielli,Pavan Turaga,Hassan Ghasemzadeh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Diabetes is a chronic metabolic disorder characterized by persistently high blood glucose levels (BGLs), leading to severe complications such as cardiovascular disease, neuropathy, and retinopathy. Predicting BGLs enables patients to maintain glucose levels within a safe range and allows caregivers to take proactive measures through lifestyle modifications. Continuous Glucose Monitoring (CGM) systems provide real-time tracking, offering a valuable tool for monitoring BGLs. However, accurately forecasting BGLs remains challenging due to fluctuations due to physical activity, diet, and other factors. Recent deep learning models show promise in improving BGL prediction. Nonetheless, forecasting BGLs accurately from multimodal, irregularly sampled data over long prediction horizons remains a challenging research problem. In this paper, we propose AttenGluco, a multimodal Transformer-based framework for long-term blood glucose prediction. AttenGluco employs cross-attention to effectively integrate CGM and activity data, addressing challenges in fusing data with different sampling rates. Moreover, it employs multi-scale attention to capture long-term dependencies in temporal data, enhancing forecasting accuracy. To evaluate the performance of AttenGluco, we conduct forecasting experiments on the recently released AIREADI dataset, analyzing its predictive accuracy across different subject cohorts including healthy individuals, people with prediabetes, and those with type 2 diabetes. Furthermore, we investigate its performance improvements and forgetting behavior as new cohorts are introduced. Our evaluations show that AttenGluco improves all error metrics, such as root mean square error (RMSE), mean absolute error (MAE), and correlation, compared to the multimodal LSTM model. AttenGluco outperforms this baseline model by about 10% and 15% in terms of RMSE and MAE, respectively.

[AI-39] AutoS2earch: Unlocking the Reasoning Potential of Large Models for Web-based Source Search

链接: https://arxiv.org/abs/2502.09913
作者: Zhengqiu Zhu,Yatai Ji,Jiaheng Huang,Yong Zhao,Sihang Qiu,Rusheng Ju
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Web-based management systems have been widely used in risk control and industrial safety. However, effectively integrating source search capabilities into these systems, to enable decision-makers to locate and address the hazard (e.g., gas leak detection) remains a challenge. While prior efforts have explored using web crowdsourcing and AI algorithms for source search decision support, these approaches suffer from overheads in recruiting human participants and slow response times in time-sensitive situations. To address this, we introduce AutoS ^2 earch, a novel framework leveraging large models for zero-shot source search in web applications. AutoS ^2 earch operates on a simplified visual environment projected through a web-based display, utilizing a chain-of-thought prompt designed to emulate human reasoning. The multi-modal large language model (MLLMs) dynamically converts visual observations into language descriptions, enabling the LLM to perform linguistic reasoning on four directional choices. Extensive experiments demonstrate that AutoS ^2 earch achieves performance nearly equivalent to human-AI collaborative source search while eliminating dependency on crowdsourced labor. Our work offers valuable insights in using web engineering to design such autonomous systems in other industrial applications.

[AI-40] he Ann Arbor Architecture for Agent -Oriented Programming

链接: https://arxiv.org/abs/2502.09903
作者: Wei Dong
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:In this paper, we reexamine prompt engineering for large language models through the lens of automata theory. We argue that language models function as automata and, like all automata, should be programmed in the languages they accept, a unified collection of all natural and formal languages. Therefore, traditional software engineering practices–conditioned on the clear separation of programming languages and natural languages–must be rethought. We introduce the Ann Arbor Architecture, a conceptual framework for agent-oriented programming of language models, as a higher-level abstraction over raw token generation, and provide a new perspective on in-context learning. Based on this framework, we present the design of our agent platform Postline, and report on our initial experiments in agent training.

[AI-41] Artificial Intelligence in Spectroscopy: Advancing Chemistry from Prediction to Generation and Beyond

链接: https://arxiv.org/abs/2502.09897
作者: Kehan Guo,Yili Shen,Gisela Abigail Gonzalez-Montiel,Yue Huang,Yujun Zhou,Mihir Surve,Zhichun Guo,Prayel Das,Nitesh V Chawla,Olaf Wiest,Xiangliang Zhang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advent of machine learning (ML) and artificial intelligence (AI) has catalyzed major transformations in chemistry, yet the application of these methods to spectroscopic and spectrometric data, referred to as Spectroscopy Machine Learning (SpectraML), remains relatively underexplored. Modern spectroscopic techniques (MS, NMR, IR, Raman, UV-Vis) generate an ever-growing volume of high-dimensional data, creating a pressing need for automated and intelligent analysis beyond traditional expert-based workflows. In this survey, we provide a unified review of SpectraML, systematically examining state-of-the-art approaches for both forward tasks (molecule-to-spectrum prediction) and inverse tasks (spectrum-to-molecule inference). We trace the historical evolution of ML in spectroscopy, from early pattern recognition to the latest foundation models capable of advanced reasoning, and offer a taxonomy of representative neural architectures, including graph-based and transformer-based methods. Addressing key challenges such as data quality, multimodal integration, and computational scalability, we highlight emerging directions such as synthetic data generation, large-scale pretraining, and few- or zero-shot learning. To foster reproducible research, we also release an open-source repository containing recent papers and their corresponding curated datasets (this https URL). Our survey serves as a roadmap for researchers, guiding progress at the intersection of spectroscopy and AI.

[AI-42] ArchRAG : Attributed Community-based Hierarchical Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2502.09891
作者: Shu Wang,Yixiang Fang,Yingli Zhou,Xilin Liu,Yuchi Ma
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has proven effective in integrating external knowledge into large language models (LLMs) for question-answer (QA) tasks. The state-of-the-art RAG approaches often use the graph data as the external data since they capture the rich semantic information and link relationships between entities. However, existing graph-based RAG approaches cannot accurately identify the relevant information from the graph and also consume large numbers of tokens in the online retrieval process. To address these issues, we introduce a novel graph-based RAG approach, called Attributed Community-based Hierarchical RAG (ArchRAG), by augmenting the question using attributed communities, and also introducing a novel LLM-based hierarchical clustering method. To retrieve the most relevant information from the graph for the question, we build a novel hierarchical index structure for the attributed communities and develop an effective online retrieval method. Experimental results demonstrate that ArchRAG outperforms existing methods in terms of both accuracy and token cost.

[AI-43] Evaluating and Improving Graph-based Explanation Methods for Multi-Agent Coordination

链接: https://arxiv.org/abs/2502.09889
作者: Siva Kailas,Shalin Jain,Harish Ravichandar
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 19 pages, 8 figures, 6 tables

点击查看摘要

Abstract:Graph Neural Networks (GNNs), developed by the graph learning community, have been adopted and shown to be highly effective in multi-robot and multi-agent learning. Inspired by this successful cross-pollination, we investigate and characterize the suitability of existing GNN explanation methods for explaining multi-agent coordination. We find that these methods have the potential to identify the most-influential communication channels that impact the team’s behavior. Informed by our initial analyses, we propose an attention entropy regularization term that renders GAT-based policies more amenable to existing graph-based explainers. Intuitively, minimizing attention entropy incentivizes agents to limit their attention to the most influential or impactful agents, thereby easing the challenge faced by the explainer. We theoretically ground this intuition by showing that minimizing attention entropy increases the disparity between the explainer-generated subgraph and its complement. Evaluations across three tasks and three team sizes i) provides insights into the effectiveness of existing explainers, and ii) demonstrates that our proposed regularization consistently improves explanation quality without sacrificing task performance.

[AI-44] Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos

链接: https://arxiv.org/abs/2502.09886
作者: Weirui Ye,Fangchen Liu,Zheng Ding,Yang Gao,Oleh Rybkin,Pieter Abbeel
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation offers a promising approach for cheaply scaling training data for generalist policies. To scalably generate data from diverse and realistic tasks, existing algorithms either rely on large language models (LLMs) that may hallucinate tasks not interesting for robotics; or digital twins, which require careful real-to-sim alignment and are hard to scale. To address these challenges, we introduce Video2Policy, a novel framework that leverages internet RGB videos to reconstruct tasks based on everyday human behavior. Our approach comprises two phases: (1) task generation in simulation from videos; and (2) reinforcement learning utilizing in-context LLM-generated reward functions iteratively. We demonstrate the efficacy of Video2Policy by reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, which depicts diverse and complex human behaviors on 9 different tasks. Our method can successfully train RL policies on such tasks, including complex and challenging tasks such as throwing. Finally, we show that the generated simulation data can be scaled up for training a general policy, and it can be transferred back to the real robot in a Real2Sim2Real way.

[AI-45] Comprehensive Review of Neural Differential Equations for Time Series Analysis

链接: https://arxiv.org/abs/2502.09885
作者: YongKyung Oh,Seungsu Kam,Jonghun Lee,Dong-Young Lim,Sungil Kim,Alex Bui
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series modeling and analysis has become critical in various domains. Conventional methods such as RNNs and Transformers, while effective for discrete-time and regularly sampled data, face significant challenges in capturing the continuous dynamics and irregular sampling patterns inherent in real-world scenarios. Neural Differential Equations (NDEs) represent a paradigm shift by combining the flexibility of neural networks with the mathematical rigor of differential equations. This paper presents a comprehensive review of NDE-based methods for time series analysis, including neural ordinary differential equations, neural controlled differential equations, and neural stochastic differential equations. We provide a detailed discussion of their mathematical formulations, numerical methods, and applications, highlighting their ability to model continuous-time dynamics. Furthermore, we address key challenges and future research directions. This survey serves as a foundation for researchers and practitioners seeking to leverage NDEs for advanced time series analysis.

[AI-46] Nonasymptotic CLT and Error Bounds for Two-Time-Scale Stochastic Approximation

链接: https://arxiv.org/abs/2502.09884
作者: Seo Taek Kong,Sihan Zeng,Thinh T. Doan,R. Srikant
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We consider linear two-time-scale stochastic approximation algorithms driven by martingale noise. Recent applications in machine learning motivate the need to understand finite-time error rates, but conventional stochastic approximation analysis focus on either asymptotic convergence in distribution or finite-time bounds that are far from optimal. Prior work on asymptotic central limit theorems (CLTs) suggest that two-time-scale algorithms may be able to achieve 1/\sqrtn error in expectation, with a constant given by the expected norm of the limiting Gaussian vector. However, the best known finite-time rates are much slower. We derive the first non-asymptotic central limit theorem with respect to the Wasserstein-1 distance for two-time-scale stochastic approximation with Polyak-Ruppert averaging. As a corollary, we show that expected error achieved by Polyak-Ruppert averaging decays at rate 1/\sqrtn , which significantly improves on the rates of convergence in prior works.

[AI-47] How Users Who are Blind or Low Vision Play Mobile Games: Perceptions Challenges and Strategies

链接: https://arxiv.org/abs/2502.09866
作者: Zihe Ran,Xiyu Li,Qing Xiao,Xianzhe Fan,Franklin Mingzhe Li,Yanyun Wang,Zhicong Lu
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 18 pages, 3 figures, Accepted by CHI '25

点击查看摘要

Abstract:As blind and low-vision (BLV) players engage more deeply with games, accessibility features have become essential. While some research has explored tools and strategies to enhance game accessibility, the specific experiences of these players with mobile games remain underexamined. This study addresses this gap by investigating how BLV users experience mobile games with varying accessibility levels. Through interviews with 32 experienced BLV mobile players, we explore their perceptions, challenges, and strategies for engaging with mobile games. Our findings reveal that BLV players turn to mobile games to alleviate boredom, achieve a sense of accomplishment, and build social connections, but face barriers depending on the game’s accessibility level. We also compare mobile games to other forms of gaming, highlighting the relative advantages of mobile games, such as the inherent accessibility of smartphones. This study contributes to understanding BLV mobile gaming experiences and provides insights for enhancing accessible mobile game design.

[AI-48] A Scoresheet for Explainable AI AAMAS2025

链接: https://arxiv.org/abs/2502.09861
作者: Michael Winikoff,John Thangarajah,Sebastian Rodriguez
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
*备注: To appear at AAMAS 2025 - arXiv version also includes appendices

点击查看摘要

Abstract:Explainability is important for the transparency of autonomous and intelligent systems and for helping to support the development of appropriate levels of trust. There has been considerable work on developing approaches for explaining systems and there are standards that specify requirements for transparency. However, there is a gap: the standards are too high-level and do not adequately specify requirements for explainability. This paper develops a scoresheet that can be used to specify explainability requirements or to assess the explainability aspects provided for particular applications. The scoresheet is developed by considering the requirements of a range of stakeholders and is applicable to Multiagent Systems as well as other AI technologies. We also provide guidance for how to use the scoresheet and illustrate its generality and usefulness by applying it to a range of applications.

[AI-49] MuDoC: An Interactive Multimodal Document-grounded Conversational AI System AAAI

链接: https://arxiv.org/abs/2502.09843
作者: Karan Taneja,Ashok K. Goel
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
*备注: 5 pages, 3 figures, AAAI-MAKE 2025

点击查看摘要

Abstract:Multimodal AI is an important step towards building effective tools to leverage multiple modalities in human-AI communication. Building a multimodal document-grounded AI system to interact with long documents remains a challenge. Our work aims to fill the research gap of directly leveraging grounded visuals from documents alongside textual content in documents for response generation. We present an interactive conversational AI agent ‘MuDoC’ based on GPT-4o to generate document-grounded responses with interleaved text and figures. MuDoC’s intelligent textbook interface promotes trustworthiness and enables verification of system responses by allowing instant navigation to source text and figures in the documents. We also discuss qualitative observations based on MuDoC responses highlighting its strengths and limitations.

[AI-50] Efficient Evaluation of Multi-Task Robot Policies With Active Experiment Selection

链接: https://arxiv.org/abs/2502.09829
作者: Abrar Anwar,Rohan Gupta,Zain Merchant,Sayan Ghosh,Willie Neiswanger,Jesse Thomason
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Evaluating learned robot control policies to determine their physical task-level capabilities costs experimenter time and effort. The growing number of policies and tasks exacerbates this issue. It is impractical to test every policy on every task multiple times; each trial requires a manual environment reset, and each task change involves re-arranging objects or even changing robots. Naively selecting a random subset of tasks and policies to evaluate is a high-cost solution with unreliable, incomplete results. In this work, we formulate robot evaluation as an active testing problem. We propose to model the distribution of robot performance across all tasks and policies as we sequentially execute experiments. Tasks often share similarities that can reveal potential relationships in policy behavior, and we show that natural language is a useful prior in modeling these relationships between tasks. We then leverage this formulation to reduce the experimenter effort by using a cost-aware expected information gain heuristic to efficiently select informative trials. Our framework accommodates both continuous and discrete performance outcomes. We conduct experiments on existing evaluation data from real robots and simulations. By prioritizing informative trials, our framework reduces the cost of calculating evaluation metrics for robot policies across many tasks.

[AI-51] Agent Guard: Repurposing Agent ic Orchestrator for Safety Evaluation of Tool Orchestration

链接: https://arxiv.org/abs/2502.09809
作者: Jizhou Chen,Samuel Lee Cong
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: Project report of AgentGuard in LLM Agent MOOC Hackathon hosted by UC Berkeley in 2024

点击查看摘要

Abstract:The integration of tool use into large language models (LLMs) enables agentic systems with real-world impact. In the meantime, unlike standalone LLMs, compromised agents can execute malicious workflows with more consequential impact, signified by their tool-use capability. We propose AgentGuard, a framework to autonomously discover and validate unsafe tool-use workflows, followed by generating safety constraints to confine the behaviors of agents, achieving the baseline of safety guarantee at deployment. AgentGuard leverages the LLM orchestrator’s innate capabilities - knowledge of tool functionalities, scalable and realistic workflow generation, and tool execution privileges - to act as its own safety evaluator. The framework operates through four phases: identifying unsafe workflows, validating them in real-world execution, generating safety constraints, and validating constraint efficacy. The output, an evaluation report with unsafe workflows, test cases, and validated constraints, enables multiple security applications. We empirically demonstrate AgentGuard’s feasibility with experiments. With this exploratory work, we hope to inspire the establishment of standardized testing and hardening procedures for LLM agents to enhance their trustworthiness in real-world applications.

[AI-52] Co-designing Large Language Model Tools for Project-Based Learning with K12 Educators

链接: https://arxiv.org/abs/2502.09799
作者: Prerna Ravi,John Masla,Gisella Kakoti,Grace Lin,Emma Anderson,Matt Taylor,Anastasia Ostrowski,Cynthia Breazeal,Eric Klopfer,Hal Abelson
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: 25 pages

点击查看摘要

Abstract:The emergence of generative AI, particularly large language models (LLMs), has opened the door for student-centered and active learning methods like project-based learning (PBL). However, PBL poses practical implementation challenges for educators around project design and management, assessment, and balancing student guidance with student autonomy. The following research documents a co-design process with interdisciplinary K-12 teachers to explore and address the current PBL challenges they face. Through teacher-driven interviews, collaborative workshops, and iterative design of wireframes, we gathered evidence for ways LLMs can support teachers in implementing high-quality PBL pedagogy by automating routine tasks and enhancing personalized learning. Teachers in the study advocated for supporting their professional growth and augmenting their current roles without replacing them. They also identified affordances and challenges around classroom integration, including resource requirements and constraints, ethical concerns, and potential immediate and long-term impacts. Drawing on these, we propose design guidelines for future deployment of LLM tools in PBL.

[AI-53] A Survey on LLM -based News Recommender Systems

链接: https://arxiv.org/abs/2502.09797
作者: Rongyao Wang,Veronica Liesaputra,Zhiyi Huang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:News recommender systems play a critical role in mitigating the information overload problem. In recent years, due to the successful applications of large language model technologies, researchers have utilized Discriminative Large Language Models (DLLMs) or Generative Large Language Models (GLLMs) to improve the performance of news recommender systems. Although several recent surveys review significant challenges for deep learning-based news recommender systems, such as fairness, privacy-preserving, and responsibility, there is a lack of a systematic survey on Large Language Model (LLM)-based news recommender systems. In order to review different core methodologies and explore potential issues systematically, we categorize DLLM-based and GLLM-based news recommender systems under the umbrella of LLM-based news recommender systems. In this survey, we first overview the development of deep learning-based news recommender systems. Then, we review LLM-based news recommender systems based on three aspects: news-oriented modeling, user-oriented modeling, and prediction-oriented modeling. Next, we examine the challenges from various perspectives, including datasets, benchmarking tools, and methodologies. Furthermore, we conduct extensive experiments to analyze how large language model technologies affect the performance of different news recommender systems. Finally, we comprehensively explore the future directions for LLM-based news recommendations in the era of LLMs.

[AI-54] ableTalk: Scaffolding Spreadsheet Development with a Language Agent

链接: https://arxiv.org/abs/2502.09787
作者: Jenny T. Liang,Aayush Kumar,Yasharth Bajpai,Sumit Gulwani,Vu Le,Chris Parnin,Arjun Radhakrishna,Ashish Tiwari,Emerson Murphy-Hill,Guastavo Soares
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Despite its ubiquity in the workforce, spreadsheet programming remains challenging as programmers need both spreadsheet-specific knowledge (e.g., APIs to write formulas) and problem-solving skills to create complex spreadsheets. Large language models (LLMs) can help automate aspects of this process, and recent advances in planning and reasoning have enabled language agents, which dynamically plan, use tools, and take iterative actions to complete complex tasks. These agents observe, plan, and act, making them well-suited to scaffold spreadsheet programming by following expert processes. We present TableTalk, a language agent that helps programmers build spreadsheets conversationally. Its design reifies three design principles – scaffolding, flexibility, and incrementality – which we derived from two studies of seven programmers and 62 Excel templates. TableTalk structures spreadsheet development by generating step-by-step plans and suggesting three next steps users can choose from. It also integrates tools that enable incremental spreadsheet construction. A user study with 20 programmers shows that TableTalk produces spreadsheets 2.3 times more likely to be preferred over a baseline agent, while reducing cognitive load and time spent reasoning about spreadsheet actions by 12.6%. TableTalk’s approach has implications for human-agent collaboration. This includes providing persistent direct manipulation interfaces for stopping or undoing agent actions, while ensuring that such interfaces for accepting actions can be deactivated. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2502.09787 [cs.SE] (or arXiv:2502.09787v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2502.09787 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-55] Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games

链接: https://arxiv.org/abs/2502.09780
作者: Tong Yang,Bo Dai,Lin Xiao,Yuejie Chi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empirical estimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players’ policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.

[AI-56] On the existence of EFX allocations in multigraphs

链接: https://arxiv.org/abs/2502.09777
作者: Alkmini Sgouritsa,Minas Marios Sotiriou
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study the problem of “fairly” dividing indivisible goods to several agents that have valuation set functions over the sets of goods. As fair we consider the allocations that are envy-free up to any good (EFX), i.e., no agent envies any proper subset of the goods given to any other agent. The existence or not of EFX allocations is a major open problem in Fair Division, and there are only positive results for special cases. [George Christodoulou, Amos Fiat, Elias Koutsoupias, Alkmini Sgouritsa 2023] introduced a restriction on the agents’ valuations according to a graph structure: the vertices correspond to agents and the edges to goods, and each vertex/agent has zero marginal value (or in other words, they are indifferent) for the edges/goods that are not adjacent to them. The existence of EFX allocations has been shown for simple graphs with general monotone valuations [George Christodoulou, Amos Fiat, Elias Koutsoupias, Alkmini Sgouritsa 2023], and for multigraphs for restricted additive valuations [Alireza Kaviani, Masoud Seddighin, Amir Mohammad Shahrezaei 2024]. In this work, we push the state-of-the-art further, and show that the EFX allocations always exists in multigraphs and general monotone valuations if any of the following three conditions hold: either (a) the multigraph is bipartite, or (b) each agent has at most \lceil \fracn4 \rceil -1 neighbors, where n is the total number of agents, or © the shortest cycle with non-parallel edges has length at least 6. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2502.09777 [cs.GT] (or arXiv:2502.09777v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2502.09777 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-57] Differential Adjusted Parity for Learning Fair Representations

链接: https://arxiv.org/abs/2502.09765
作者: Bucher Sahyouni,Matthew Vowels,Liqun Chen,Simon Hadfield
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The development of fair and unbiased machine learning models remains an ongoing objective for researchers in the field of artificial intelligence. We introduce the Differential Adjusted Parity (DAP) loss to produce unbiased informative representations. It utilises a differentiable variant of the adjusted parity metric to create a unified objective function. By combining downstream task classification accuracy and its inconsistency across sensitive feature domains, it provides a single tool to increase performance and mitigate bias. A key element in this approach is the use of soft balanced accuracies. In contrast to previous non-adversarial approaches, DAP does not suffer a degeneracy where the metric is satisfied by performing equally poorly across all sensitive domains. It outperforms several adversarial models on downstream task accuracy and fairness in our analysis. Specifically, it improves the demographic parity, equalized odds and sensitive feature accuracy by as much as 22.5%, 44.1% and 40.1%, respectively, when compared to the best performing adversarial approaches on these metrics. Overall, the DAP loss and its associated metric can play a significant role in creating more fair machine learning models.

[AI-58] Adaptive Teaming in Multi-Drone Pursuit: Simulation Training and Deployment

链接: https://arxiv.org/abs/2502.09762
作者: Yang Li,Junfan Chen,Feng Xue,Jiabin Qiu,Wenbin Li,Qingrui Zhang,Ying Wen,Wei Pan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 17 pages

点击查看摘要

Abstract:Adaptive teaming, the ability to collaborate with unseen teammates without prior coordination, remains an underexplored challenge in multi-robot collaboration. This paper focuses on adaptive teaming in multi-drone cooperative pursuit, a critical task with real-world applications such as border surveillance, search-and-rescue, and counter-terrorism. We first define and formalize the \textbfAdaptive Teaming in \textbfMulti-\textbfDrone \textbfPursuit (AT-MDP) problem and introduce AT-MDP framework, a comprehensive framework that integrates simulation, algorithm training and real-world deployment. AT-MDP framework provides a flexible experiment configurator and interface for simulation, a distributed training framework with an extensive algorithm zoo (including two newly proposed baseline methods) and an unseen drone zoo for evaluating adaptive teaming, as well as a real-world deployment system that utilizes edge computing and Crazyflie drones. To the best of our knowledge, AT-MDP framework is the first adaptive framework for continuous-action decision-making in complex real-world drone tasks, enabling multiple drones to coordinate effectively with unseen teammates. Extensive experiments in four multi-drone pursuit environments of increasing difficulty confirm the effectiveness of AT-MDP framework, while real-world deployments further validate its feasibility in physical systems. Videos and code are available at this https URL.

[AI-59] he AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention

链接: https://arxiv.org/abs/2502.09757
作者: Bereket A. Yilma,Chan Mi Kim,Geke Ludden,Thomas van Rompay,Luis A. Leiva
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Post-intensive care syndrome (PICS) is a multifaceted condition that arises from prolonged stays in an intensive care unit (ICU). While preventing PICS among ICU patients is becoming increasingly important, interventions remain limited. Building on evidence supporting the effectiveness of art exposure in addressing the psychological aspects of PICS, we propose a novel art therapy solution through a collaborative Human-AI approach that enhances personalized therapeutic interventions using state-of-the-art Visual Art Recommendation Systems. We developed two Human-in-the-Loop (HITL) personalization methods and assessed their impact through a large-scale user study (N=150). Our findings demonstrate that this Human-AI collaboration not only enhances the personalization and effectiveness of art therapy but also supports therapists by streamlining their workload. While our study centres on PICS intervention, the results suggest that human-AI collaborative Art therapy could potentially benefit other areas where emotional support is critical, such as cases of anxiety and depression.

[AI-60] Vote-Tree-Planner: Optimizing Execution Order in LLM -based Task Planning Pipeline via Voting

链接: https://arxiv.org/abs/2502.09749
作者: Chaoyuan Zhang,Zhaowei Li,Wentao Yuan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Accepted to RSS24-W: TaskSpec

点击查看摘要

Abstract:Integrating large language models (LLMs) into closed-loop robotic task planning has become increasingly popular within embodied artificial intelligence. Previous efforts mainly focused on leveraging the strong reasoning abilities of LLMs to enhance task planning performance while often overlooking task planning efficiency and executability due to repetitive queries to LLMs. This paper addresses the synergy between LLMs and task planning systems, aiming to minimize redundancy while enhancing planning effectiveness. Specifically, building upon Prog-Prompt and the high-level concept of Tree-Planner, we propose Vote-Tree-Planner. This sampling strategy utilizes votes to guide plan traversal during the decision-making process. Our approach is motivated by a straightforward observation: assigning weights to agents during decision-making enables the evaluation of critical paths before execution. With this simple vote-tree construction, our method further improves the success rate and reduces the number of queries to LLMs. The experimental results highlight that our Vote-Tree-Planner demonstrates greater stability and shows a higher average success rate and goal condition recall on the unseen dataset compared with previous baseline methods. These findings underscore the potential of the Vote-Tree-Planner to enhance planning accuracy, reliability, and efficiency in LLM-based planning systems.

[AI-61] NeuralCFD: Deep Learning on High-Fidelity Automotive Aerodynamics Simulations

链接: https://arxiv.org/abs/2502.09692
作者: Maurits Bleeker,Matthias Dorfer,Tobias Kronlachner,Reinhard Sonnleitner,Benedikt Alkin,Johannes Brandstetter
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Preprint

点击查看摘要

Abstract:Recent advancements in neural operator learning are paving the way for transformative innovations in fields such as automotive aerodynamics. However, key challenges must be overcome before neural network-based simulation surrogates can be implemented at an industry scale. First, surrogates must become scalable to large surface and volume meshes, especially when using raw geometry inputs only, i.e., without relying on the simulation mesh. Second, surrogates must be trainable with a limited number of high-fidelity numerical simulation samples while still reaching the required performance levels. To this end, we introduce Geometry-preserving Universal Physics Transformer (GP-UPT), which separates geometry encoding and physics predictions, ensuring flexibility with respect to geometry representations and surface sampling strategies. GP-UPT enables independent scaling of the respective parts of the model according to practical requirements, offering scalable solutions to open challenges. GP-UPT circumvents the creation of high-quality simulation meshes, enables accurate 3D velocity field predictions at 20 million mesh cells, and excels in transfer learning from low-fidelity to high-fidelity simulation datasets, requiring less than half of the high-fidelity data to match the performance of models trained from scratch.

[AI-62] Efficient and Trustworthy Block Propagation for Blockchain-enabled Mobile Embodied AI Networks: A Graph Resfusion Approach

链接: https://arxiv.org/abs/2502.09624
作者: Jiawen Kang,Jiana Liao,Runquan Gao,Jinbo Wen,Huawei Huang,Maomao Zhang,Changyan Yi,Tao Zhang,Dusit Niyato,Zibin Zheng
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: 15 pages, 11 figures

点击查看摘要

[AI-63] Machine Learning for Phase Estimation in Satellite-to-Earth Quantum Communication

链接: https://arxiv.org/abs/2502.09920
作者: Nathan K Long,Robert Malaney,Kenneth J Grant
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

机器学习

[LG-0] (How) Can Transformers Predict Pseudo-Random Numbers?

链接: https://arxiv.org/abs/2502.10390
作者: Tao Tao,Darshil Doshi,Dayal Singh Kalra,Tianyu He,Maissam Barkeshli
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 10+16 pages, 12+20 figures

点击查看摘要

Abstract:Transformers excel at discovering patterns in sequential data, yet their fundamental limitations and learning mechanisms remain crucial topics of investigation. In this paper, we study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs), defined by the recurrence relation x_t+1 = a x_t + c ;\mathrmmod; m . Our analysis reveals that with sufficient architectural capacity and training data variety, Transformers can perform in-context prediction of LCG sequences with unseen moduli ( m ) and parameters ( a,c ). Through analysis of embedding layers and attention patterns, we uncover how Transformers develop algorithmic structures to learn these sequences in two scenarios of increasing complexity. First, we analyze how Transformers learn LCG sequences with unseen ( a, c ) but fixed modulus, and we demonstrate successful learning up to m = 2^32 . Our analysis reveals that models learn to factorize the modulus and utilize digit-wise number representations to make sequential predictions. In the second, more challenging scenario of unseen moduli, we show that Transformers can generalize to unseen moduli up to m_\texttest = 2^16 . In this case, the model employs a two-step strategy: first estimating the unknown modulus from the context, then utilizing prime factorizations to generate predictions. For this task, we observe a sharp transition in the accuracy at a critical depth =3 . We also find that the number of in-context sequence elements needed to reach high accuracy scales sublinearly with the modulus.

[LG-1] Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data

链接: https://arxiv.org/abs/2502.10381
作者: Corinna Cortes,Anqi Mao,Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Class imbalance remains a major challenge in machine learning, especially in multi-class problems with long-tailed distributions. Existing methods, such as data resampling, cost-sensitive techniques, and logistic loss modifications, though popular and often effective, lack solid theoretical foundations. As an example, we demonstrate that cost-sensitive methods are not Bayes consistent. This paper introduces a novel theoretical framework for analyzing generalization in imbalanced classification. We propose a new class-imbalanced margin loss function for both binary and multi-class settings, prove its strong H -consistency, and derive corresponding learning guarantees based on empirical loss and a new notion of class-sensitive Rademacher complexity. Leveraging these theoretical results, we devise novel and general learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate confidence margins and are applicable to various hypothesis sets. While our focus is theoretical, we also present extensive empirical results demonstrating the effectiveness of our algorithms compared to existing baselines.

[LG-2] AffinityFlow: Guided Flows for Antibody Affinity Maturation

链接: https://arxiv.org/abs/2502.10365
作者: Can Chen,Karla-Luise Herpoldt,Chenchao Zhao,Zichen Wang,Marcus Collins,Shang Shang,Ron Benson
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures

点击查看摘要

Abstract:Antibodies are widely used as therapeutics, but their development requires costly affinity maturation, involving iterative mutations to enhance binding this http URL paper explores a sequence-only scenario for affinity maturation, using solely antibody and antigen sequences. Recently AlphaFlow wraps AlphaFold within flow matching to generate diverse protein structures, enabling a sequence-conditioned generative model of structure. Building on this, we propose an alternating optimization framework that (1) fixes the sequence to guide structure generation toward high binding affinity using a structure-based affinity predictor, then (2) applies inverse folding to create sequence mutations, refined by a sequence-based affinity predictor for post selection. To address this, we develop a co-teaching module that incorporates valuable information from noisy biophysical energies into predictor refinement. The sequence-based predictor selects consensus samples to teach the structure-based predictor, and vice versa. Our method, AffinityFlow, achieves state-of-the-art performance in affinity maturation experiments. We plan to open-source our code after acceptance.

[LG-3] Proper Learnability and the Role of Unlabeled Data ALT2025

链接: https://arxiv.org/abs/2502.10359
作者: Julian Asilis,Siddartha Devic,Shaddin Dughmi,Vatsal Sharan,Shang-Hua Teng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ALT 2025, 22 pages

点击查看摘要

Abstract:Proper learning refers to the setting in which learners must emit predictors in the underlying hypothesis class H , and often leads to learners with simple algorithmic forms (e.g. empirical risk minimization (ERM), structural risk minimization (SRM)). The limitation of proper learning, however, is that there exist problems which can only be learned improperly, e.g. in multiclass classification. Thus, we ask: Under what assumptions on the hypothesis class or the information provided to the learner is a problem properly learnable? We first demonstrate that when the unlabeled data distribution is given, there always exists an optimal proper learner governed by distributional regularization, a randomized generalization of regularization. We refer to this setting as the distribution-fixed PAC model, and continue to evaluate the learner on its worst-case performance over all distributions. Our result holds for all metric loss functions and any finite learning problem (with no dependence on its size). Further, we demonstrate that sample complexities in the distribution-fixed PAC model can shrink by only a logarithmic factor from the classic PAC model, strongly refuting the role of unlabeled data in PAC learning (from a worst-case perspective). We complement this with impossibility results which obstruct any characterization of proper learnability in the realizable PAC model. First, we observe that there are problems whose proper learnability is logically undecidable, i.e., independent of the ZFC axioms. We then show that proper learnability is not a monotone property of the underlying hypothesis class, and that it is not a local property (in a precise sense). Our impossibility results all hold even for the fundamental setting of multiclass classification, and go through a reduction of EMX learning (Ben-David et al., 2019) to proper classification which may be of independent interest. Comments: ALT 2025, 22 pages Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2502.10359 [cs.LG] (or arXiv:2502.10359v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.10359 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Dimension-free Score Matching and Time Bootstrapping for Diffusion Models

链接: https://arxiv.org/abs/2502.10354
作者: Syamantak Kumar,Dheeraj Nagaraj,Purnamrita Sarkar
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Diffusion models generate samples by estimating the score function of the target distribution at various noise levels. The model is trained using samples drawn from the target distribution, progressively adding noise. In this work, we establish the first (nearly) dimension-free sample complexity bounds for learning these score functions, achieving a double exponential improvement in dimension over prior results. A key aspect of our analysis is the use of a single function approximator to jointly estimate scores across noise levels, a critical feature of diffusion models in practice which enables generalization across timesteps. Our analysis introduces a novel martingale-based error decomposition and sharp variance bounds, enabling efficient learning from dependent data generated by Markov processes, which may be of independent interest. Building on these insights, we propose Bootstrapped Score Matching (BSM), a variance reduction technique that utilizes previously learned scores to improve accuracy at higher noise levels. These results provide crucial insights into the efficiency and effectiveness of diffusion models for generative modeling.

[LG-5] Assortment Optimization for Patient-Provider Matching

链接: https://arxiv.org/abs/2502.10353
作者: Naveen Raman,Holly Wiberg
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 36 pages, 11 Figures

点击查看摘要

Abstract:Rising provider turnover forces healthcare administrators to frequently rematch patients to available providers, which can be cumbersome and labor-intensive. To reduce the burden of rematching, we study algorithms for matching patients and providers through assortment optimization. We develop a patient-provider matching model in which we simultaneously offer each patient a menu of providers, and patients subsequently respond and select providers. By offering assortments upfront, administrators can balance logistical ease and patient autonomy. We study policies for assortment optimization and characterize their performance under different problem settings. We demonstrate that the selection of assortment policy is highly dependent on problem specifics and, in particular, on a patient’s willingness to match and the ratio between patients and providers. On real-world data, we show that our best policy can improve match quality by 13% over a greedy solution by tailoring assortment sizes based on patient characteristics. We conclude with recommendations for running a real-world patient-provider matching system inspired by our results.

[LG-6] InfoPos: A ML-Assisted Solution Design Support Framework for Industrial Cyber-Physical Systems

链接: https://arxiv.org/abs/2502.10331
作者: Uraz Odyurt,Richard Loendersloot,Tiedo Tinga
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The variety of building blocks and algorithms incorporated in data-centric and ML-assisted solutions is high, contributing to two challenges: selection of most effective set and order of building blocks, as well as achieving such a selection with minimum cost. Considering that ML-assisted solution design is influenced by the extent of available data, as well as available knowledge of the target system, it is advantageous to be able to select matching building blocks. We introduce the first iteration of our InfoPos framework, allowing the placement of use-cases considering the available positions (levels), i.e., from poor to rich, of knowledge and data dimensions. With that input, designers and developers can reveal the most effective corresponding choice(s), streamlining the solution design process. The results from our demonstrator, an anomaly identification use-case for industrial Cyber-Physical Systems, reflects achieved effects upon the use of different building blocks throughout knowledge and data positions. The achieved ML model performance is considered as the indicator. Our data processing code and the composed data sets are publicly available.

[LG-7] DiOpt: Self-supervised Diffusion for Constrained Optimization

链接: https://arxiv.org/abs/2502.10330
作者: Shutong Ding,Yimiao Zhou,Ke Hu,Xi Yao,Junchi Yan,Xiaoying Tang,Ye Shi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in diffusion models show promising potential for learning-based optimization by leveraging their multimodal sampling capability to escape local optima. However, existing diffusion-based optimization approaches, often reliant on supervised training, lacks a mechanism to ensure strict constraint satisfaction which is often required in real-world applications. One resulting observation is the distributional misalignment, i.e. the generated solution distribution often exhibits small overlap with the feasible domain. In this paper, we propose DiOpt, a novel diffusion paradigm that systematically learns near-optimal feasible solution distributions through iterative self-training. Our framework introduces several key innovations: a target distribution specifically designed to maximize overlap with the constrained solution manifold; a bootstrapped self-training mechanism that adaptively weights candidate solutions based on the severity of constraint violations and optimality gaps; and a dynamic memory buffer that accelerates convergence by retaining high-quality solutions over training iterations. To our knowledge, DiOpt represents the first successful integration of self-supervised diffusion with hard constraint satisfaction. Evaluations on diverse tasks, including power grid control, motion retargeting, wireless allocation demonstrate its superiority in terms of both optimality and constraint satisfaction.

[LG-8] Fenchel-Young Variational Learning

链接: https://arxiv.org/abs/2502.10295
作者: Sophia Sklaviadis,Sweta Agrawal,Antonio Farinhas,Andre Martins,Mario Figueiredo
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:From a variational perspective, many statistical learning criteria involve seeking a distribution that balances empirical risk and regularization. In this paper, we broaden this perspective by introducing a new general class of variational methods based on Fenchel-Young (FY) losses, treated as divergences that generalize (and encompass) the familiar Kullback-Leibler divergence at the core of classical variational learning. Our proposed formulation – FY variational learning – includes as key ingredients new notions of FY free energy, FY evidence, FY evidence lower bound, and FY posterior. We derive alternating minimization and gradient backpropagation algorithms to compute (or lower bound) the FY evidence, which enables learning a wider class of models than previous variational formulations. This leads to generalized FY variants of classical algorithms, such as an FY expectation-maximization (FYEM) algorithm, and latent-variable models, such as an FY variational autoencoder (FYVAE). Our new methods are shown to be empirically competitive, often outperforming their classical counterparts, and most importantly, to have qualitatively novel features. For example, FYEM has an adaptively sparse E-step, while the FYVAE can support models with sparse observations and sparse posteriors.

[LG-9] Small Loss Bounds for Online Learning Separated Function Classes: A Gaussian Process Perspective

链接: https://arxiv.org/abs/2502.10292
作者: Adam Block,Abhishek Shetty
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In order to develop practical and efficient algorithms while circumventing overly pessimistic computational lower bounds, recent work has been interested in developing oracle-efficient algorithms in a variety of learning settings. Two such settings of particular interest are online and differentially private learning. While seemingly different, these two fields are fundamentally connected by the requirement that successful algorithms in each case satisfy stability guarantees; in particular, recent work has demonstrated that algorithms for online learning whose performance adapts to beneficial problem instances, attaining the so-called small-loss bounds, require a form of stability similar to that of differential privacy. In this work, we identify the crucial role that separation plays in allowing oracle-efficient algorithms to achieve this strong stability. Our notion, which we term \rho -separation, generalizes and unifies several previous approaches to enforcing this strong stability, including the existence of small-separator sets and the recent notion of \gamma -approximability. We present an oracle-efficient algorithm that is capable of achieving small-loss bounds with improved rates in greater generality than previous work, as well as a variant for differentially private learning that attains optimal rates, again under our separation condition. In so doing, we prove a new stability result for minimizers of a Gaussian process that strengthens and generalizes previous work.

[LG-10] Adversarial Mixup Unlearning ICLR2025

链接: https://arxiv.org/abs/2502.10288
作者: Zhuoyi Peng,Yixuan Tang,Yi Yang
类目: Machine Learning (cs.LG)
*备注: ICLR 2025

点击查看摘要

Abstract:Machine unlearning is a critical area of research aimed at safeguarding data privacy by enabling the removal of sensitive information from machine learning models. One unique challenge in this field is catastrophic unlearning, where erasing specific data from a well-trained model unintentionally removes essential knowledge, causing the model to deviate significantly from a retrained one. To address this, we introduce a novel approach that regularizes the unlearning process by utilizing synthesized mixup samples, which simulate the data susceptible to catastrophic effects. At the core of our approach is a generator-unlearner framework, MixUnlearn, where a generator adversarially produces challenging mixup examples, and the unlearner effectively forgets target information based on these synthesized data. Specifically, we first introduce a novel contrastive objective to train the generator in an adversarial direction: generating examples that prompt the unlearner to reveal information that should be forgotten, while losing essential knowledge. Then the unlearner, guided by two other contrastive loss terms, processes the synthesized and real data jointly to ensure accurate unlearning without losing critical knowledge, overcoming catastrophic effects. Extensive evaluations across benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches, offering a robust solution to machine unlearning. This work not only deepens understanding of unlearning mechanisms but also lays the foundation for effective machine unlearning with mixup augmentation.

[LG-11] Probabilistic Super-Resolution for High-Fidelity Physical System Simulations with Uncertainty Quantification

链接: https://arxiv.org/abs/2502.10280
作者: Pengyu Zhang,Connor Duffin,Alex Glyn-Davies,Arnaud Vadeboncoeur,Mark Girolami
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Super-resolution (SR) is a promising tool for generating high-fidelity simulations of physical systems from low-resolution data, enabling fast and accurate predictions in engineering applications. However, existing deep-learning based SR methods, require large labeled datasets and lack reliable uncertainty quantification (UQ), limiting their applicability in real-world scenarios. To overcome these challenges, we propose a probabilistic SR framework that leverages the Statistical Finite Element Method and energy-based generative modeling. Our method enables efficient high-resolution predictions with inherent UQ, while eliminating the need for extensive labeled datasets. The method is validated on a 2D Poisson example and compared with bicubic interpolation upscaling. Results demonstrate a computational speed-up over high-resolution numerical solvers while providing reliable uncertainty estimates.

[LG-12] Learning to Solve the Min-Max Mixed-Shelves Picker-Routing Problem via Hierarchical and Parallel Decoding

链接: https://arxiv.org/abs/2502.10233
作者: Laurin Luttmann,Lin Xie
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Mixed-Shelves Picker Routing Problem (MSPRP) is a fundamental challenge in warehouse logistics, where pickers must navigate a mixed-shelves environment to retrieve SKUs efficiently. Traditional heuristics and optimization-based approaches struggle with scalability, while recent machine learning methods often rely on sequential decision-making, leading to high solution latency and suboptimal agent coordination. In this work, we propose a novel hierarchical and parallel decoding approach for solving the min-max variant of the MSPRP via multi-agent reinforcement learning. While our approach generates a joint distribution over agent actions, allowing for fast decoding and effective picker coordination, our method introduces a sequential action selection to avoid conflicts in the multi-dimensional action space. Experiments show state-of-the-art performance in both solution quality and inference speed, particularly for large-scale and out-of-distribution instances. Our code is publicly available at this http URL.

[LG-13] ProReco: A Process Discovery Recommender System

链接: https://arxiv.org/abs/2502.10230
作者: Tsung-Hao Huang,Tarek Junied,Marco Pegoraro,Wil M. P. van der Aalst
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 8 pages, 5 figures, 9 references

点击查看摘要

Abstract:Process discovery aims to automatically derive process models from historical execution data (event logs). While various process discovery algorithms have been proposed in the last 25 years, there is no consensus on a dominating discovery algorithm. Selecting the most suitable discovery algorithm remains a challenge due to competing quality measures and diverse user requirements. Manually selecting the most suitable process discovery algorithm from a range of options for a given event log is a time-consuming and error-prone task. This paper introduces ProReco, a Process discovery Recommender system designed to recommend the most appropriate algorithm based on user preferences and event log characteristics. ProReco incorporates state-of-the-art discovery algorithms, extends the feature pools from previous work, and utilizes eXplainable AI (XAI) techniques to provide explanations for its recommendations.

[LG-14] Comparison of Deep Recurrent Neural Networks and Bayesian Neural Networks for Detecting Electric Motor Damage Through Sound Signal Analysis

链接: https://arxiv.org/abs/2502.10224
作者: Waldemar Bauer,Jerzy Baranowski
类目: Machine Learning (cs.LG)
*备注: Draft articles. arXiv admin note: substantial text overlap with arXiv:2409.08309

点击查看摘要

Abstract:Fault detection in electric motors is a critical challenge in various industries, where failures can result in significant operational disruptions. This study investigates the use of Recurrent Neural Networks (RNNs) and Bayesian Neural Networks (BNNs) for diagnosing motor damage using acoustic signal analysis. A novel approach is proposed, leveraging frequency domain representation of sound signals for enhanced diagnostic accuracy. The architectures of both RNNs and BNNs are designed and evaluated on real-world acoustic data collected from household appliances using smartphones. Experimental results demonstrate that BNNs provide superior fault detection performance, particularly for imbalanced datasets, offering more robust and interpretable predictions compared to traditional methods. The findings suggest that BNNs, with their ability to incorporate uncertainty, are well-suited for industrial diagnostic applications. Further analysis and benchmarks are suggested to explore resource efficiency and classification capabilities of these architectures.

[LG-15] Control-flow anomaly detection by process mining-based feature extraction and dimensionality reduction

链接: https://arxiv.org/abs/2502.10211
作者: Francesco Vitale,Marco Pegoraro,Wil M. P. van der Aalst,Nicola Mazzocca
类目: Machine Learning (cs.LG)
*备注: 16 pages, 9 figures, 7 tables, 56 references

点击查看摘要

Abstract:The business processes of organizations may deviate from normal control flow due to disruptive anomalies, including unknown, skipped, and wrongly-ordered activities. To identify these control-flow anomalies, process mining can check control-flow correctness against a reference process model through conformance checking, an explainable set of algorithms that allows linking any deviations with model elements. However, the effectiveness of conformance checking-based techniques is negatively affected by noisy event data and low-quality process models. To address these shortcomings and support the development of competitive and explainable conformance checking-based techniques for control-flow anomaly detection, we propose a novel process mining-based feature extraction approach with alignment-based conformance checking. This variant aligns the deviating control flow with a reference process model; the resulting alignment can be inspected to extract additional statistics such as the number of times a given activity caused mismatches. We integrate this approach into a flexible and explainable framework for developing techniques for control-flow anomaly detection. The framework combines process mining-based feature extraction and dimensionality reduction to handle high-dimensional feature sets, achieve detection effectiveness, and support explainability. The results show that the framework techniques implementing our approach outperform the baseline conformance checking-based techniques while maintaining the explainable nature of conformance checking. We also provide an explanation of why existing conformance checking-based techniques may be ineffective.

[LG-16] SGS-GNN: A Supervised Graph Sparsification method for Graph Neural Networks

链接: https://arxiv.org/abs/2502.10208
作者: Siddhartha Shankar Das,Naheed Anjum Arafat,Muftiqur Rahman,S M Ferdous,Alex Pothen,Mahantesh M Halappanavar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose SGS-GNN, a novel supervised graph sparsifier that learns the sampling probability distribution of edges and samples sparse subgraphs of a user-specified size to reduce the computational costs required by GNNs for inference tasks on large graphs. SGS-GNN employs regularizers in the loss function to enhance homophily in sparse subgraphs, boosting the accuracy of GNNs on heterophilic graphs, where a significant number of the neighbors of a node have dissimilar labels. SGS-GNN also supports conditional updates of the probability distribution learning module based on a prior, which helps narrow the search space for sparse graphs. SGS-GNN requires fewer epochs to obtain high accuracies since it learns the search space of subgraphs more effectively than methods using fixed distributions such as random sampling. Extensive experiments using 33 homophilic and heterophilic graphs demonstrate the following: (i) with only 20% of edges retained in the sparse subgraphs, SGS-GNN improves the F1-scores by a geometric mean of 4% relative to the original graph; on heterophilic graphs, the prediction accuracy is better up to 30%. (ii) SGS-GNN outperforms state-of-the-art methods with improvement in F1-scores of 4-7% in geometric mean with similar sparsities in the sampled subgraphs, and (iii) compared to sparsifiers that employ fixed distributions, SGS-GNN requires about half the number of epochs to converge.

[LG-17] Looking around you: external information enhances representations for event sequences

链接: https://arxiv.org/abs/2502.10205
作者: Maria Kovaleva,Petr Sokerin,Sofia Krehova,Alexey Zaytsev
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation learning produces models in different domains, such as store purchases, client transactions, and general people’s behaviour. However, such models for sequential data usually process a single sequence, ignoring context from other relevant ones, even in domains with rapidly changing external environments like finance or misguiding the prediction for a user with no recent events. We are the first to propose a method that aggregates information from multiple user representations augmenting a specific user one for a scenario of multiple co-occurring event sequences. Our study considers diverse aggregation approaches, ranging from simple pooling techniques to trainable attention-based approaches, especially Kernel attention aggregation, that can highlight more complex information flow from other users. The proposed method operates atop an existing encoder and supports its efficient fine-tuning. Across considered datasets of financial transactions and downstream tasks, Kernel attention improves ROC AUC scores, both with and without fine-tuning, while mean pooling yields a smaller but still significant gain. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2502.10205 [cs.LG] (or arXiv:2502.10205v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.10205 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] AI-in-the-Loop Sensing and Communication Joint Design for Edge Intelligence

链接: https://arxiv.org/abs/2502.10203
作者: Zhijie Cai,Xiaowen Cao,Xu Chen,Yuanhao Cui,Guangxu Zhu,Kaibin Huang,Shuguang Cui
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Recent breakthroughs in artificial intelligence (AI), wireless communications, and sensing technologies have accelerated the evolution of edge intelligence. However, conventional systems still grapple with issues such as low communication efficiency, redundant data acquisition, and poor model generalization. To overcome these challenges, we propose an innovative framework that enhances edge intelligence through AI-in-the-loop joint sensing and communication (JSAC). This framework features an AI-driven closed-loop control architecture that jointly optimizes system resources, thereby delivering superior system-level performance. A key contribution of our work is establishing an explicit relationship between validation loss and the system’s tunable parameters. This insight enables dynamic reduction of the generalization error through AI-driven closed-loop control. Specifically, for sensing control, we introduce an adaptive data collection strategy based on gradient importance sampling, allowing edge devices to autonomously decide when to terminate data acquisition and how to allocate sample weights based on real-time model feedback. For communication control, drawing inspiration from stochastic gradient Langevin dynamics (SGLD), our joint optimization of transmission power and batch size converts channel and data noise into gradient perturbations that help mitigate overfitting. Experimental evaluations demonstrate that our framework reduces communication energy consumption by up to 77 percent and sensing costs measured by the number of collected samples by up to 52 percent while significantly improving model generalization – with up to 58 percent reductions of the final validation loss. It validates that the proposed scheme can harvest the mutual benefit of AI and JSAC systems by incorporating the model itself into the control loop of the system.

[LG-19] A Powerful Random Forest Featuring Linear Extensions (RaFFLE)

链接: https://arxiv.org/abs/2502.10185
作者: Jakob Raymaekers,Peter J. Rousseeuw,Thomas Servotte,Tim Verdonck,Ruicong Yao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Random forests are widely used in regression. However, the decision trees used as base learners are poor approximators of linear relationships. To address this limitation we propose RaFFLE (Random Forest Featuring Linear Extensions), a novel framework that integrates the recently developed PILOT trees (Piecewise Linear Organic Trees) as base learners within a random forest ensemble. PILOT trees combine the computational efficiency of traditional decision trees with the flexibility of linear model trees. To ensure sufficient diversity of the individual trees, we introduce an adjustable regularization parameter and use node-level feature sampling. These modifications improve the accuracy of the forest. We establish theoretical guarantees for the consistency of RaFFLE under weak conditions, and its faster convergence when the data are generated by a linear model. Empirical evaluations on 136 regression datasets demonstrate that RaFFLE outperforms the classical CART and random forest methods, the regularized linear methods Lasso and Ridge, and the state-of-the-art XGBoost algorithm, across both linear and nonlinear datasets. By balancing predictive accuracy and computational efficiency, RaFFLE proves to be a versatile tool for tackling a wide variety of regression problems.

[LG-20] Realistic Evaluation of Deep Partial-Label Learning Algorithms ICLR2025

链接: https://arxiv.org/abs/2502.10184
作者: Wei Wang,Dong-Dong Wu,Jindong Wang,Gang Niu,Min-Ling Zhang,Masashi Sugiyama
类目: Machine Learning (cs.LG)
*备注: ICLR 2025 Spotlight

点击查看摘要

Abstract:Partial-label learning (PLL) is a weakly supervised learning problem in which each example is associated with multiple candidate labels and only one is the true label. In recent years, many deep PLL algorithms have been developed to improve model performance. However, we find that some early developed algorithms are often underestimated and can outperform many later algorithms with complicated designs. In this paper, we delve into the empirical perspective of PLL and identify several critical but previously overlooked issues. First, model selection for PLL is non-trivial, but has never been systematically studied. Second, the experimental settings are highly inconsistent, making it difficult to evaluate the effectiveness of the algorithms. Third, there is a lack of real-world image datasets that can be compatible with modern network architectures. Based on these findings, we propose PLENCH, the first Partial-Label learning bENCHmark to systematically compare state-of-the-art deep PLL algorithms. We investigate the model selection problem for PLL for the first time, and propose novel model selection criteria with theoretical guarantees. We also create Partial-Label CIFAR-10 (PLCIFAR10), an image dataset of human-annotated partial labels collected from Amazon Mechanical Turk, to provide a testbed for evaluating the performance of PLL algorithms in more realistic scenarios. Researchers can quickly and conveniently perform a comprehensive and fair evaluation and verify the effectiveness of newly developed algorithms based on PLENCH. We hope that PLENCH will facilitate standardized, fair, and practical evaluation of PLL algorithms in the future.

[LG-21] Provably Efficient RL under Episode-Wise Safety in Linear CMDPs

链接: https://arxiv.org/abs/2502.10138
作者: Toshinori Kitamura,Arnob Ghosh,Tadashi Kozuno,Wataru Kumagai,Kazumi Kasaura,Kenta Hoshino,Yohei Hosoe,Yutaka Matsuo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves \widetilde\mathcalO(\sqrtK) regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.

[LG-22] Modern Hopfield Networks with Continuous-Time Memories

链接: https://arxiv.org/abs/2502.10122
作者: Saul Santos,António Farinhas,Daniel C. McNamee,André F.T. Martins
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent research has established a connection between modern Hopfield networks (HNs) and transformer attention heads, with guarantees of exponential storage capacity. However, these models still face challenges scaling storage efficiently. Inspired by psychological theories of continuous neural resource allocation in working memory, we propose an approach that compresses large discrete Hopfield memories into smaller, continuous-time memories. Leveraging continuous attention, our new energy function modifies the update rule of HNs, replacing the traditional softmax-based probability mass function with a probability density, over the continuous memory. This formulation aligns with modern perspectives on human executive function, offering a principled link between attractor dynamics in working memory and resource-efficient memory allocation. Our framework maintains competitive performance with HNs while leveraging a compressed memory, reducing computational costs across synthetic and video datasets.

[LG-23] SeWA: Selective Weight Averag e via Probabilistic Masking

链接: https://arxiv.org/abs/2502.10119
作者: Peng Wang,Shengchao Hu,Zerui Tao,Guoxia Wang,Dianhai Yu,Li Shen,Quan Zheng,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weight averaging has become a standard technique for enhancing model performance. However, methods such as Stochastic Weight Averaging (SWA) and Latest Weight Averaging (LAWA) often require manually designed procedures to sample from the training trajectory, and the results depend heavily on hyperparameter tuning. To minimize human effort, this paper proposes a simple yet efficient algorithm called Selective Weight Averaging (SeWA), which adaptively selects checkpoints during the final stages of training for averaging. Based on SeWA, we show that only a few points are needed to achieve better generalization and faster convergence. Theoretically, solving the discrete subset selection problem is inherently challenging. To address this, we transform it into a continuous probabilistic optimization framework and employ the Gumbel-Softmax estimator to learn the non-differentiable mask for each checkpoint. Further, we theoretically derive the SeWA’s stability-based generalization bounds, which are sharper than that of SGD under both convex and non-convex assumptions. Finally, solid extended experiments in various domains, including behavior cloning, image classification, and text classification, further validate the effectiveness of our approach.

[LG-24] Accelerometry-based Energy Expenditure Estimation During Activities of Daily Living: A Comparison Among Different Accelerometer Compositions

链接: https://arxiv.org/abs/2502.10112
作者: Shuhao Que,Remco Poelarends,Peter Veltink,Miriam Vollenbroek-Hutten,Ying Wang
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Physical activity energy expenditure (PAEE) can be measured from breath-by-breath respiratory data, which can serve as a reference. Alternatively, PAEE can be predicted from the body movements, which can be measured and estimated with accelerometers. The body center of mass (COM) acceleration reflects the movements of the whole body and thus serves as a good predictor for PAEE. However, the wrist has also become a popular location due to recent advancements in wrist-worn devices. Therefore, in this work, using the respiratory data measured by COSMED K5 as the reference, we evaluated and compared the performances of COM-based settings and wrist-based settings. The COM-based settings include two different accelerometer compositions, using only the pelvis accelerometer (pelvis-acc) and the pelvis accelerometer with two accelerometers from two thighs (3-acc). The wrist-based settings include using only the left wrist accelerometer (l-wrist-acc) and only the right wrist accelerometer (r-wrist-acc). We implemented two existing PAEE estimation methods on our collected dataset, where 9 participants performed activities of daily living while wearing 5 accelerometers (i.e., pelvis, two thighs, and two wrists). These two methods include a linear regression (LR) model and a CNN-LSTM model. Both models yielded the best results with the COM-based 3-acc setting (LR: R^2 = 0.41, CNN-LSTM: R^2 = 0.53). No significant difference was found between the 3-acc and pelvis-acc settings (p-value = 0.278). For both models, neither the l-wrist-acc nor the r-wrist-acc settings demonstrated predictive power on PAEE with R^2 values close to 0, significantly outperformed by the two COM-based settings (p-values 0.05). No significant difference was found between the two wrists (p-value = 0.329).

[LG-25] COMBINEX: A Unified Counterfactual Explainer for Graph Neural Networks via Node Feature and Structural Perturbations

链接: https://arxiv.org/abs/2502.10111
作者: Flavio Giorgi,Fabrizio Silvestri,Gabriele Tolomei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Counterfactual explanations have emerged as a powerful tool to unveil the opaque decision-making processes of graph neural networks (GNNs). However, existing techniques primarily focus on edge modifications, often overlooking the crucial role of node feature perturbations in shaping model predictions. To address this limitation, we propose COMBINEX, a novel GNN explainer that generates counterfactual explanations for both node and graph classification tasks. Unlike prior methods, which treat structural and feature-based changes independently, COMBINEX optimally balances modifications to edges and node features by jointly optimizing these perturbations. This unified approach ensures minimal yet effective changes required to flip a model’s prediction, resulting in realistic and interpretable counterfactuals. Additionally, COMBINEX seamlessly handles both continuous and discrete node features, enhancing its versatility across diverse datasets and GNN architectures. Extensive experiments on real-world datasets and various GNN architectures demonstrate the effectiveness and robustness of our approach over existing baselines.

[LG-26] NeuroXVocal: Detection and Explanation of Alzheimers Disease through Non-invasive Analysis of Picture-prompted Speech

链接: https://arxiv.org/abs/2502.10108
作者: Nikolaos Ntampakis,Konstantinos Diamantaras,Ioanna Chouvarda,Magda Tsolaki,Vasileios Argyriou,Panagiotis Sarigianndis
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The early diagnosis of Alzheimer’s Disease (AD) through non invasive methods remains a significant healthcare challenge. We present NeuroXVocal, a novel dual-component system that not only classifies but also explains potential AD cases through speech analysis. The classification component (Neuro) processes three distinct data streams: acoustic features capturing speech patterns and voice characteristics, textual features extracted from speech transcriptions, and precomputed embeddings representing linguistic patterns. These streams are fused through a custom transformer-based architecture that enables robust cross-modal interactions. The explainability component (XVocal) implements a Retrieval-Augmented Generation (RAG) approach, leveraging Large Language Models combined with a domain-specific knowledge base of AD research literature. This architecture enables XVocal to retrieve relevant clinical studies and research findings to generate evidence-based context-sensitive explanations of the acoustic and linguistic markers identified in patient speech. Using the IS2021 ADReSSo Challenge benchmark dataset, our system achieved state-of-the-art performance with 95.77% accuracy in AD classification, significantly outperforming previous approaches. The explainability component was qualitatively evaluated using a structured questionnaire completed by medical professionals, validating its clinical relevance. NeuroXVocal’s unique combination of high-accuracy classification and interpretable, literature-grounded explanations demonstrates its potential as a practical tool for supporting clinical AD diagnosis.

[LG-27] Data-Adaptive Low-Rank Sparse Subspace Clustering

链接: https://arxiv.org/abs/2502.10106
作者: Ivica Kopriva
类目: Machine Learning (cs.LG)
*备注: 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:Low-rank sparse subspace clustering (LRSSC) algorithms built on self-expressive model effectively capture both the global and local structure of the data. However, existing solutions, primarily based on proximal operators associated with Sp/Lp , p e 0, 1/2, 2/3, 1, norms are not data-adaptive. In this work, we propose an LRSSC algorithm incorporating a data-adaptive surrogate for the S0/L0 quasi-norm. We provide a numerical solution for the corresponding proximal operator in cases where an analytical expression is unavailable. The proposed LRSSC algorithm is formulated within the proximal mapping framework, and we present theoretical proof of its global convergence toward a stationary point. We evaluate the performance of the proposed method on three well known datasets, comparing it against LRSSC algorithms constrained by Sp/Lp, p e 0, 1/2, 2/3, 1, norms.

[LG-28] Representation Learning on Out of Distribution in Tabular Data

链接: https://arxiv.org/abs/2502.10095
作者: Achmad Ginanjar,Xue Li,Priyanka Singh,Wen Hua
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The open-world assumption in model development suggests that a model might lack sufficient information to adequately handle data that is entirely distinct or out of distribution (OOD). While deep learning methods have shown promising results in handling OOD data through generalization techniques, they often require specialized hardware that may not be accessible to all users. We present TCL, a lightweight yet effective solution that operates efficiently on standard CPU hardware. Our approach adapts contrastive learning principles specifically for tabular data structures, incorporating full matrix augmentation and simplified loss calculation. Through comprehensive experiments across 10 diverse datasets, we demonstrate that TCL outperforms existing models, including FT-Transformer and ResNet, particularly in classification tasks, while maintaining competitive performance in regression problems. TCL achieves these results with significantly reduced computational requirements, making it accessible to users with limited hardware capabilities. This study also provides practical guidance for detecting and evaluating OOD data through straightforward experiments and visualizations. Our findings show that TCL offers a promising balance between performance and efficiency in handling OOD prediction tasks, which is particularly beneficial for general machine learning practitioners working with computational constraints.

[LG-29] Classification of Temporal Graphs using Persistent Homology

链接: https://arxiv.org/abs/2502.10076
作者: Siddharth Pritam,Rohit Roy,Madhav Cherupilil Sajeev
类目: Machine Learning (cs.LG); Computational Geometry (cs.CG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:Temporal graphs effectively model dynamic systems by representing interactions as timestamped edges. However, analytical tools for temporal graphs are limited compared to static graphs. We propose a novel method for analyzing temporal graphs using Persistent Homology. Our approach leverages \delta -temporal motifs (recurrent subgraphs) to capture temporal dynamics %without aggregation . By evolving these motifs, we define the \textitaverage filtration and compute PH on the associated clique complex. This method captures both local and global temporal structures and is stable with respect to reference models. We demonstrate the applicability of our approach to the temporal graph classification task. Experiments verify the effectiveness of our approach, achieving over 92% accuracy, with some cases reaching 100%. Unlike existing methods that require node classes, our approach is node class free, offering flexibility for a wide range of temporal graph analysis. Subjects: Machine Learning (cs.LG); Computational Geometry (cs.CG); Algebraic Topology (math.AT) Cite as: arXiv:2502.10076 [cs.LG] (or arXiv:2502.10076v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2502.10076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] opological Neural Networks over the Air

链接: https://arxiv.org/abs/2502.10070
作者: Simone Fiorellino,Claudio Battiloro,Paolo Di Lorenzo
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Topological neural networks (TNNs) are information processing architectures that model representations from data lying over topological spaces (e.g., simplicial or cell complexes) and allow for decentralized implementation through localized communications over different neighborhoods. Existing TNN architectures have not yet been considered in realistic communication scenarios, where channel effects typically introduce disturbances such as fading and noise. This paper aims to propose a novel TNN design, operating on regular cell complexes, that performs over-the-air computation, incorporating the wireless communication model into its architecture. Specifically, during training and inference, the proposed method considers channel impairments such as fading and noise in the topological convolutional filtering operation, which takes place over different signal orders and neighborhoods. Numerical results illustrate the architecture’s robustness to channel impairments during testing and the superior performance with respect to existing architectures, which are either communication-agnostic or graph-based.

[LG-31] Heterogeneous Resource Allocation with Multi-task Learning for Wireless Networks

链接: https://arxiv.org/abs/2502.10027
作者: Nikos A. Mitsiou,Pavlos S. Bouzinis,Panagiotis G. Sarigiannidis,George K. Karagiannidis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The optimal solution to an optimization problem depends on the problem’s objective function, constraints, and size. While deep neural networks (DNNs) have proven effective in solving optimization problems, changes in the problem’s size, objectives, or constraints often require adjustments to the DNN architecture to maintain effectiveness, or even retraining a new DNN from scratch. Given the dynamic nature of wireless networks, which involve multiple and diverse objectives that can have conflicting requirements and constraints, we propose a multi-task learning (MTL) framework to enable a single DNN to jointly solve a range of diverse optimization problems. In this framework, optimization problems with varying dimensionality values, objectives, and constraints are treated as distinct tasks. To jointly address these tasks, we propose a conditional computation-based MTL approach with routing. The multi-task DNN consists of two components, the base DNN (bDNN), which is the single DNN used to extract the solutions for all considered optimization problems, and the routing DNN (rDNN), which manages which nodes and layers of the bDNN to be used during the forward propagation of each task. The output of the rDNN is a binary vector which is multiplied with all bDNN’s weights during the forward propagation, creating a unique computational path through the bDNN for each task. This setup allows the tasks to either share parameters or use independent ones, with the decision controlled by the rDNN. The proposed framework supports both supervised and unsupervised learning scenarios. Numerical results demonstrate the efficiency of the proposed MTL approach in solving diverse optimization problems. In contrast, benchmark DNNs lacking the rDNN mechanism were unable to achieve similar levels of performance, highlighting the effectiveness of the proposed architecture.

[LG-32] InterGridNet: An Electric Network Frequency Approach for Audio Source Location Classification Using Convolutional Neural Networks

链接: https://arxiv.org/abs/2502.10011
作者: Christos Korgialas,Ioannis Tsingalis,Georgios Tzolopoulos,Constantine Kotropoulos
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: The 10th International Conference on Advances in Signal, Image and Video Processing (SIGNAL 2025)

点击查看摘要

Abstract:A novel framework, called InterGridNet, is introduced, leveraging a shallow RawNet model for geolocation classification of Electric Network Frequency (ENF) signatures in the SP Cup 2016 dataset. During data preparation, recordings are sorted into audio and power groups based on inherent characteristics, further divided into 50 Hz and 60 Hz groups via spectrogram analysis. Residual blocks within the classification model extract frame-level embeddings, aiding decision-making through softmax activation. The topology and the hyperparameters of the shallow RawNet are optimized using a Neural Architecture Search. The overall accuracy of InterGridNet in the test recordings is 92%, indicating its effectiveness against the state-of-the-art methods tested in the SP Cup 2016. These findings underscore InterGridNet’s effectiveness in accurately classifying audio recordings from diverse power grids, advancing state-of-the-art geolocation estimation methods.

[LG-33] Exploring Neural Granger Causality with xLSTMs: Unveiling Temporal Dependencies in Complex Data

链接: https://arxiv.org/abs/2502.09981
作者: Harsh Poonia,Felix Divo,Kristian Kersting,Devendra Singh Dhami
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causality in time series can be difficult to determine, especially in the presence of non-linear dependencies. The concept of Granger causality helps analyze potential relationships between variables, thereby offering a method to determine whether one time series can predict-Granger cause-future values of another. Although successful, Granger causal methods still struggle with capturing long-range relations between variables. To this end, we leverage the recently successful Extended Long Short-Term Memory (xLSTM) architecture and propose Granger causal xLSTMs (GC-xLSTM). It first enforces sparsity between the time series components by using a novel dynamic lass penalty on the initial projection. Specifically, we adaptively improve the model and identify sparsity candidates. Our joint optimization procedure then ensures that the Granger causal relations are recovered in a robust fashion. Our experimental evaluations on three datasets demonstrate the overall efficacy of our proposed GC-xLSTM model.

[LG-34] On Space Folds of ReLU Neural Networks

链接: https://arxiv.org/abs/2502.09954
作者: Michal Lewandowski,Hamid Eghbalzadeh,Bernhard Heinzl,Raphael Pisoni,Bernhard A.Moser
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at Transactions on Machine Learning Research (TMLR), 2025

点击查看摘要

Abstract:Recent findings suggest that the consecutive layers of ReLU neural networks can be understood geometrically as space folding transformations of the input space, revealing patterns of self-similarity. In this paper, we present the first quantitative analysis of this space folding phenomenon in ReLU neural networks. Our approach focuses on examining how straight paths in the Euclidean input space are mapped to their counterparts in the Hamming activation space. In this process, the convexity of straight lines is generally lost, giving rise to non-convex folding behavior. To quantify this effect, we introduce a novel measure based on range metrics, similar to those used in the study of random walks, and provide the proof for the equivalence of convexity notions between the input and activation spaces. Furthermore, we provide empirical analysis on a geometrical analysis benchmark (CantorNet) as well as an image classification benchmark (MNIST). Our work advances the understanding of the activation space in ReLU neural networks by leveraging the phenomena of geometric folding, providing valuable insights on how these models process input information.

[LG-35] radeoffs in Processing Queries and Supporting Updates over an ML-Enhanced R-tree

链接: https://arxiv.org/abs/2502.09937
作者: Abdullah Al-Mamun,Ch. Md. Rakin Haider,Jianguo Wang,Walid G. Aref
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2207.00550

点击查看摘要

Abstract:Machine Learning (ML) techniques have been successfully applied to design various learned database index structures for both the one- and multi-dimensional spaces. Particularly, a class of traditional multi-dimensional indexes has been augmented with ML models to design ML-enhanced variants of their traditional counterparts. This paper focuses on the R-tree multi-dimensional index structure as it is widely used for indexing multi-dimensional data. The R-tree has been augmented with machine learning models to enhance the R-tree performance. The AI+R-tree is an ML-enhanced R-tree index structure that augments a traditional disk-based R-tree with an ML model to enhance the R-tree’s query processing performance, mainly, to avoid navigating the overlapping branches of the R-tree that do not yield query results, e.g., in the presence of high-overlap among the rectangles of the R-tree nodes. We investigate the empirical tradeoffs in processing dynamic query workloads and in supporting updates over the AI+R-tree. Particularly, we investigate the impact of the choice of ML models over the AI+R-tree query processing performance. Moreover, we present a case study of designing a custom loss function for a neural network model tailored to the query processing requirements of the AI+R-tree. Furthermore, we present the design tradeoffs for adopting various strategies for supporting dynamic inserts, updates, and deletes with the vision of realizing a mutable AI+R-tree. Experiments on real datasets demonstrate that the AI+R-tree can enhance the query processing performance of a traditional R-tree for high-overlap range queries by up to 5.4X while achieving up to 99% average query recall.

[LG-36] Fused Partial Gromov-Wasserstein for Structured Objects

链接: https://arxiv.org/abs/2502.09934
作者: Yikun Bai,Huy Tran,Hengrong Du,Xinran Liu,Soheil Kolouri
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2402.03664

点击查看摘要

Abstract:Structured data, such as graphs, are vital in machine learning due to their capacity to capture complex relationships and interactions. In recent years, the Fused Gromov-Wasserstein (FGW) distance has attracted growing interest because it enables the comparison of structured data by jointly accounting for feature similarity and geometric structure. However, as a variant of optimal transport (OT), classical FGW assumes an equal mass constraint on the compared data. In this work, we relax this mass constraint and propose the Fused Partial Gromov-Wasserstein (FPGW) framework, which extends FGW to accommodate unbalanced data. Theoretically, we establish the relationship between FPGW and FGW and prove the metric properties of FPGW. Numerically, we introduce Frank-Wolfe solvers for the proposed FPGW framework and provide a convergence analysis. Finally, we evaluate the FPGW distance through graph classification and clustering experiments, demonstrating its robust performance, especially when data is corrupted by outlier noise.

[LG-37] Robust Anomaly Detection via Tensor Chidori Pseudoskeleton Decomposition

链接: https://arxiv.org/abs/2502.09926
作者: Bowen Su
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection plays a critical role in modern data-driven applications, from identifying fraudulent transactions and safeguarding network infrastructure to monitoring sensor systems for irregular patterns. Traditional approaches, such as distance, density, or cluster-based methods, face significant challenges when applied to high dimensional tensor data, where complex interdependencies across dimensions amplify noise and computational complexity. To address these limitations, this paper leverages Tensor Chidori pseudoskeleton decomposition within a tensor-robust principal component analysis framework to extract low Tucker rank structure while isolating sparse anomalies, ensuring robustness to anomaly detection. We establish theoretical results regarding convergence, and estimation error, demonstrating the stability and accuracy of the proposed approach. Numerical experiments on real-world spatiotemporal data from New York City taxi trip records validate the superiority of the proposed method in detecting anomalous urban events compared to existing benchmark methods. The results underscore the potential of Tensor Chidori pseudoskeleton decomposition to enhance anomaly detection for large-scale, high-dimensional data.

[LG-38] hompson Sampling for Repeated Newsvendor

链接: https://arxiv.org/abs/2502.09900
作者: Weizhou Zhang,Chen Li,Hanzhang Qin,Yunbei Xu,Ruihao Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we investigate the performance of Thompson Sampling (TS) for online learning with censored feedback, focusing primarily on the classic repeated newsvendor model–a foundational framework in inventory management–and demonstrating how our techniques can be naturally extended to a broader class of problems. We model demand using a Weibull distribution and initialize TS with a Gamma prior to dynamically adjust order quantities. Our analysis establishes optimal (up to logarithmic factors) frequentist regret bounds for TS without imposing restrictive prior assumptions. More importantly, it yields novel and highly interpretable insights on how TS addresses the exploration-exploitation trade-off in the repeated newsvendor setting. Specifically, our results show that when past order quantities are sufficiently large to overcome censoring, TS accurately estimates the unknown demand parameters, leading to near-optimal ordering decisions. Conversely, when past orders are relatively small, TS automatically increases future order quantities to gather additional demand information. Extensive numerical simulations further demonstrate that TS outperforms more conservative and widely-used approaches such as online convex optimization, upper confidence bounds, and myopic Bayesian dynamic programming. This study also lays the foundation for exploring general online learning problems with censored feedback.

[LG-39] Optimal lower Lipschitz bounds for ReLU layers saturation and phase retrieval

链接: https://arxiv.org/abs/2502.09898
作者: Daniel Freeman,Daniel Haider
类目: Machine Learning (cs.LG); Functional Analysis (math.FA); Numerical Analysis (math.NA)
*备注: 22 pages

点击查看摘要

Abstract:The injectivity of ReLU layers in neural networks, the recovery of vectors from clipped or saturated measurements, and (real) phase retrieval in \mathbbR^n allow for a similar problem formulation and characterization using frame theory. In this paper, we revisit all three problems with a unified perspective and derive lower Lipschitz bounds for ReLU layers and clipping which are analogous to the previously known result for phase retrieval and are optimal up to a constant factor.

[LG-40] Symmetry-Preserving Diffusion Models via Target Symmetrization

链接: https://arxiv.org/abs/2502.09890
作者: Vinh Tong,Yun Ye,Trung-Dung Hoang,Anji Liu,Guy Van den Broeck,Mathias Niepert
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models are powerful tools for capturing complex distributions, but modeling data with inherent symmetries, such as molecular structures, remains challenging. Equivariant denoisers are commonly used to address this, but they introduce architectural complexity and optimization challenges, including noisy gradients and convergence issues. We propose a novel approach that enforces equivariance through a symmetrized loss function, which applies a time-dependent weighted averaging operation over group actions to the model’s prediction target. This ensures equivariance without explicit architectural constraints and reduces gradient variance, leading to more stable and efficient optimization. Our method uses Monte Carlo sampling to estimate the average, incurring minimal computational overhead. We provide theoretical guarantees of equivariance for the minimizer of our loss function and demonstrate its effectiveness on synthetic datasets and the molecular conformation generation task using the GEOM-QM9 dataset. Experiments show improved sample quality compared to existing methods, highlighting the potential of our approach to enhance the scalability and practicality of equivariant diffusion models in generative tasks.

[LG-41] Elastic Representation: Mitigating Spurious Correlations for Group Robustness AISTATS2025

链接: https://arxiv.org/abs/2502.09850
作者: Tao Wen,Zihan Wang,Quan Zhang,Qi Lei
类目: Machine Learning (cs.LG)
*备注: Accepted at AISTATS 2025

点击查看摘要

Abstract:Deep learning models can suffer from severe performance degradation when relying on spurious correlations between input features and labels, making the models perform well on training data but have poor prediction accuracy for minority groups. This problem arises especially when training data are limited or imbalanced. While most prior work focuses on learning invariant features (with consistent correlations to y), it overlooks the potential harm of spurious correlations between features. We hereby propose Elastic Representation (ElRep) to learn features by imposing Nuclear- and Frobenius-norm penalties on the representation from the last layer of a neural network. Similar to the elastic net, ElRep enjoys the benefits of learning important features without losing feature diversity. The proposed method is simple yet effective. It can be integrated into many deep learning approaches to mitigate spurious correlations and improve group robustness. Moreover, we theoretically show that ElRep has minimum negative impacts on in-distribution predictions. This is a remarkable advantage over approaches that prioritize minority groups at the cost of overall performance.

[LG-42] A Survey on Human-Centered Evaluation of Explainable AI Methods in Clinical Decision Support Systems

链接: https://arxiv.org/abs/2502.09849
作者: Alessandro Gambetti,Qiwei Han,Hong Shen,Claudia Soares
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 10 pages, 1 table

点击查看摘要

Abstract:Explainable AI (XAI) has become a crucial component of Clinical Decision Support Systems (CDSS) to enhance transparency, trust, and clinical adoption. However, while many XAI methods have been proposed, their effectiveness in real-world medical settings remains underexplored. This paper provides a survey of human-centered evaluations of Explainable AI methods in Clinical Decision Support Systems. By categorizing existing works based on XAI methodologies, evaluation frameworks, and clinical adoption challenges, we offer a structured understanding of the landscape. Our findings reveal key challenges in the integration of XAI into healthcare workflows and propose a structured framework to align the evaluation methods of XAI with the clinical needs of stakeholders.

[LG-43] Solving Empirical Bayes via Transformers

链接: https://arxiv.org/abs/2502.09844
作者: Anzo Teh,Mark Jabbour,Yury Polyanskiy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 27 pages, 14 figures, 11 tables

点击查看摘要

Abstract:This work applies modern AI tools (transformers) to solving one of the oldest statistical problems: Poisson means under empirical Bayes (Poisson-EB) setting. In Poisson-EB a high-dimensional mean vector \theta (with iid coordinates sampled from an unknown prior \pi ) is estimated on the basis of X=\mathrmPoisson(\theta) . A transformer model is pre-trained on a set of synthetically generated pairs (X,\theta) and learns to do in-context learning (ICL) by adapting to unknown \pi . Theoretically, we show that a sufficiently wide transformer can achieve vanishing regret with respect to an oracle estimator who knows \pi as dimension grows to infinity. Practically, we discover that already very small models (100k parameters) are able to outperform the best classical algorithm (non-parametric maximum likelihood, or NPMLE) both in runtime and validation loss, which we compute on out-of-distribution synthetic data as well as real-world datasets (NHL hockey, MLB baseball, BookCorpusOpen). Finally, by using linear probes, we confirm that the transformer’s EB estimator appears to internally work differently from either NPMLE or Robbins’ estimators.

[LG-44] Learning Fair Policies for Infectious Diseases Mitigation using Path Integral Control

链接: https://arxiv.org/abs/2502.09831
作者: Zhuangzhuang Jia,Hyuk Park,Gökçe Dayanıklı,Grani A. Hanasusanto
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Infectious diseases pose major public health challenges to society, highlighting the importance of designing effective policies to reduce economic loss and mortality. In this paper, we propose a framework for sequential decision-making under uncertainty to design fairness-aware disease mitigation policies that incorporate various measures of unfairness. Specifically, our approach learns equitable vaccination and lockdown strategies based on a stochastic multi-group SIR model. To address the challenges of solving the resulting sequential decision-making problem, we adopt the path integral control algorithm as an efficient solution scheme. Through a case study, we demonstrate that our approach effectively improves fairness compared to conventional methods and provides valuable insights for policymakers.

[LG-45] ATM-Net: Adaptive Termination and Multi-Precision Neural Networks for Energy-Harvested Edge Intelligence

链接: https://arxiv.org/abs/2502.09822
作者: Neeraj Solanki,Sepehr Tabrizchi,Samin Sohrabi,Jason Schmidt,Arman Roohi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:ATM-Net is a novel neural network architecture tailored for energy-harvested IoT devices, integrating adaptive termination points with multi-precision computing. It dynamically adjusts computational precision (32/8/4-bit) and network depth based on energy availability via early exit points. An energy-aware task scheduler optimizes the energy-accuracy trade-off. Experiments on CIFAR-10, PlantVillage, and TissueMNIST show ATM-Net achieves up to 96.93% accuracy while reducing power consumption by 87.5% with Q4 quantization compared to 32-bit operations. The power-delay product improves from 13.6J to 0.141J for DenseNet-121 and from 10.3J to 0.106J for ResNet-18, demonstrating its suitability for energy-harvesting systems.

[LG-46] Medical Applications of Graph Convolutional Networks Using Electronic Health Records: A Survey

链接: https://arxiv.org/abs/2502.09781
作者: Garrik Hoyt,Noyonica Chatterjee,Fortunato Battaglia,Paramita Basu
类目: Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) have emerged as a promising approach to machine learning on Electronic Health Records (EHRs). By constructing a graph representation of patient data and performing convolutions on neighborhoods of nodes, GCNs can capture complex relationships and extract meaningful insights to support medical decision making. This survey provides an overview of the current research in applying GCNs to EHR data. We identify the key medical domains and prediction tasks where these models are being utilized, common benchmark datasets, and architectural patterns to provide a comprehensive survey of this field. While this is a nascent area of research, GCNs demonstrate strong potential to leverage the complex information hidden in EHRs. Challenges and opportunities for future work are also discussed.

[LG-47] Enhancing Jailbreak Attacks via Compliance-Refusal-Based Initialization

链接: https://arxiv.org/abs/2502.09755
作者: Amit Levi,Rom Himelstein,Yaniv Nemcovsky,Avi Mendelson,Chaim Baskin
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Jailbreak attacks aim to exploit large language models (LLMs) and pose a significant threat to their proper conduct; they seek to bypass models’ safeguards and often provoke transgressive behaviors. However, existing automatic jailbreak attacks require extensive computational resources and are prone to converge on suboptimal solutions. In this work, we propose \textbfCompliance \textbfRefusal \textbfInitialization (CRI), a novel, attack-agnostic framework that efficiently initializes the optimization in the proximity of the compliance subspace of harmful prompts. By narrowing the initial gap to the adversarial objective, CRI substantially improves adversarial success rates (ASR) and drastically reduces computational overhead – often requiring just a single optimization step. We evaluate CRI on the widely-used AdvBench dataset over the standard jailbreak attacks of GCG and AutoDAN. Results show that CRI boosts ASR and decreases the median steps to success by up to \textbf(\times 60). The project page, along with the reference implementation, is publicly available at \textttthis https URL.

[LG-48] Fine-Tuning Foundation Models with Federated Learning for Privacy Preserving Medical Time Series Forecasting

链接: https://arxiv.org/abs/2502.09744
作者: Mahad Ali,Curtis Lisle,Patrick W. Moore,Tammer Barkouki,Brian J. Kirkwood,Laura J. Brattain
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: submitted to IEEE EMBC 2025; 7 pages, 4 figures

点击查看摘要

Abstract:Federated Learning (FL) provides a decentralized machine learning approach, where multiple devices or servers collaboratively train a model without sharing their raw data, thus enabling data privacy. This approach has gained significant interest in academia and industry due to its privacy-preserving properties, which are particularly valuable in the medical domain where data availability is often protected under strict regulations. A relatively unexplored area is the use of FL to fine-tune Foundation Models (FMs) for time series forecasting, potentially enhancing model efficacy by overcoming data limitation while maintaining privacy. In this paper, we fine-tuned time series FMs with Electrocardiogram (ECG) and Impedance Cardiography (ICG) data using different FL techniques. We then examined various scenarios and discussed the challenges FL faces under different data heterogeneity configurations. Our empirical results demonstrated that while FL can be effective for fine-tuning FMs on time series forecasting tasks, its benefits depend on the data distribution across clients. We highlighted the trade-offs in applying FL to FM fine-tuning.

[LG-49] Navigating the Social Welfare Frontier: Portfolios for Multi-objective Reinforcement Learning

链接: https://arxiv.org/abs/2502.09724
作者: Cheol Woo Kim,Jai Moondra,Shresth Verma,Madeleine Pollack,Lingkai Kong,Milind Tambe,Swati Gupta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many real-world applications of reinforcement learning (RL), deployed policies have varied impacts on different stakeholders, creating challenges in reaching consensus on how to effectively aggregate their preferences. Generalized p -means form a widely used class of social welfare functions for this purpose, with broad applications in fair resource allocation, AI alignment, and decision-making. This class includes well-known welfare functions such as Egalitarian, Nash, and Utilitarian welfare. However, selecting the appropriate social welfare function is challenging for decision-makers, as the structure and outcomes of optimal policies can be highly sensitive to the choice of p . To address this challenge, we study the concept of an \alpha -approximate portfolio in RL, a set of policies that are approximately optimal across the family of generalized p -means for all p \in [-\infty, 1] . We propose algorithms to compute such portfolios and provide theoretical guarantees on the trade-offs among approximation factor, portfolio size, and computational efficiency. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of our approach in summarizing the policy space induced by varying p values, empowering decision-makers to navigate this landscape more effectively.

[LG-50] NestQuant: Nested Lattice Quantization for Matrix Products and LLM s

链接: https://arxiv.org/abs/2502.09720
作者: Semyon Savkin,Eitan Porat,Or Ordentlich,Yury Polyanskiy
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent work have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Meta’s SpinQuant (perplexity 7.3). Comparisons on various LLM evaluation benchmarks also show a reduction in performance degradation induced by quantization.

[LG-51] Leverag ing Machine Learning and Deep Learning Techniques for Improved Pathological Staging of Prostate Cancer

链接: https://arxiv.org/abs/2502.09686
作者: Raziehsadat Ghalamkarian,Marziehsadat Ghalamkarian,MortezaAli Ahmadi,Sayed Mohammad Ahmadi,Abolfazl Diyanat
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-52] A Novel Hybrid Approach to Contraceptive Demand Forecasting: Integrating Point Predictions with Probabilistic Distributions

链接: https://arxiv.org/abs/2502.09685
作者: Harsha Chamara Hewage,Bahman Rostami-Tabar,Aris Syntetos,Federico Liberatore,Glenn Milano
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-53] Channel Dependence Limited Lookback Windows and the Simplicity of Datasets: How Biased is Time Series Forecasting?

链接: https://arxiv.org/abs/2502.09683
作者: Ibram Abdelmalak,Kiran Madhusudhanan,Jungmin Choi,Maximilian Stubbemann,Lars Schmidt-Thieme
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-series forecasting research has converged to a small set of datasets and a standardized collection of evaluation scenarios. Such a standardization is to a specific extent needed for comparable research. However, the underlying assumption is, that the considered setting is a representative for the problem as a whole. In this paper, we challenge this assumption and show that the current scenario gives a strongly biased perspective on the state of time-series forecasting research. To be more detailed, we show that the current evaluation scenario is heavily biased by the simplicity of the current datasets. We furthermore emphasize, that when the lookback-window is properly tuned, current models usually do not need any information flow across channels. However, when using more complex benchmark data, the situation changes: Here, modeling channel-interactions in a sophisticated manner indeed enhances performances. Furthermore, in this complex evaluation scenario, Crossformer, a method regularly neglected as an important baseline, is the SOTA method for time series forecasting. Based on this, we present the Fast Channel-dependent Transformer (FaCT), a simplified version of Crossformer which closes the runtime gap between Crossformer and TimeMixer, leading to an efficient model for complex forecasting datasets.

[LG-54] Learning Euler Factors of Elliptic Curves

链接: https://arxiv.org/abs/2502.10357
作者: Angelica Babei,François Charton,Edgar Costa,Xiaoyu Huang,Kyu-Hwan Lee,David Lowry-Duda,Ashvni Narayanan,Alexey Pozdnyakov
类目: Number Theory (math.NT); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

[LG-55] Studying number theory with deep learning: a case study with the Möbius and squarefree indicator functions

链接: https://arxiv.org/abs/2502.10335
作者: David Lowry-Duda
类目: Number Theory (math.NT); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Building on work of Charton, we train small transformer models to calculate the Möbius function \mu(n) and the squarefree indicator function \mu^2(n) . The models attain nontrivial predictive power. We then iteratively train additional models to understand how the model functions, ultimately finding a theoretical explanation.

[LG-56] Generalised Parallel Tempering: Flexible Replica Exchange via Flows and Diffusions

链接: https://arxiv.org/abs/2502.10328
作者: Leo Zhang,Peter Potaptchik,Arnaud Doucet,Hai-Dang Dau,Saifuddin Syed
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-57] AdaPTS: Adapting Univariate Foundation Models to Probabilistic Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2502.10235
作者: Abdelhakim Benechehab,Vasilii Feofanov,Giuseppe Paolo,Albert Thomas,Maurizio Filippone,Balázs Kégl
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pre-trained foundation models (FMs) have shown exceptional performance in univariate time series forecasting tasks. However, several practical challenges persist, including managing intricate dependencies among features and quantifying uncertainty in predictions. This study aims to tackle these critical limitations by introducing adapters; feature-space transformations that facilitate the effective use of pre-trained univariate time series FMs for multivariate tasks. Adapters operate by projecting multivariate inputs into a suitable latent space and applying the FM independently to each dimension. Inspired by the literature on representation learning and partially stochastic Bayesian neural networks, we present a range of adapters and optimization/inference strategies. Experiments conducted on both synthetic and real-world datasets confirm the efficacy of adapters, demonstrating substantial enhancements in forecasting accuracy and uncertainty quantification compared to baseline methods. Our framework, AdaPTS, positions adapters as a modular, scalable, and effective solution for leveraging time series FMs in multivariate contexts, thereby promoting their wider adoption in real-world applications. We release the code at this https URL.

[LG-58] Agent ic End-to-End De Novo Protein Design for Tailored Dynamics Using a Language Diffusion Model

链接: https://arxiv.org/abs/2502.10173
作者: Bo Ni,Markus J. Buehler
类目: Biomolecules (q-bio.BM); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Proteins are dynamic molecular machines whose biological functions, spanning enzymatic catalysis, signal transduction, and structural adaptation, are intrinsically linked to their motions. Designing proteins with targeted dynamic properties, however, remains a challenge due to the complex, degenerate relationships between sequence, structure, and molecular motion. Here, we introduce VibeGen, a generative AI framework that enables end-to-end de novo protein design conditioned on normal mode vibrations. VibeGen employs an agentic dual-model architecture, comprising a protein designer that generates sequence candidates based on specified vibrational modes and a protein predictor that evaluates their dynamic accuracy. This approach synergizes diversity, accuracy, and novelty during the design process. Via full-atom molecular simulations as direct validation, we demonstrate that the designed proteins accurately reproduce the prescribed normal mode amplitudes across the backbone while adopting various stable, functionally relevant structures. Notably, generated sequences are de novo, exhibiting no significant similarity to natural proteins, thereby expanding the accessible protein space beyond evolutionary constraints. Our work integrates protein dynamics into generative protein design, and establishes a direct, bidirectional link between sequence and vibrational behavior, unlocking new pathways for engineering biomolecules with tailored dynamical and functional properties. This framework holds broad implications for the rational design of flexible enzymes, dynamic scaffolds, and biomaterials, paving the way toward dynamics-informed AI-driven protein engineering.

[LG-59] Enhancing anomaly detection with topology-aware autoencoders

链接: https://arxiv.org/abs/2502.10163
作者: Vishal S. Ngairangbam,Błażej Rozwoda,Kazuki Sakurai,Michael Spannowsky
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注: 12 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Anomaly detection in high-energy physics is essential for identifying new physics beyond the Standard Model. Autoencoders provide a signal-agnostic approach but are limited by the topology of their latent space. This work explores topology-aware autoencoders, embedding phase-space distributions onto compact manifolds that reflect energy-momentum conservation. We construct autoencoders with spherical ( S^n ), product ( S^2 \otimes S^2 ), and projective ( \mathbbRP^2 ) latent spaces and compare their anomaly detection performance against conventional Euclidean embeddings. Our results show that autoencoders with topological priors significantly improve anomaly separation by preserving the global structure of the data manifold and reducing spurious reconstruction errors. Applying our approach to simulated hadronic top-quark decays, we show that latent spaces with appropriate topological constraints enhance sensitivity and robustness in detecting anomalous events. This study establishes topology-aware autoencoders as a powerful tool for unsupervised searches for new physics in particle-collision data.

[LG-60] Combinatorial Reinforcement Learning with Preference Feedback

链接: https://arxiv.org/abs/2502.10158
作者: Joongkyu Lee,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

Abstract:In this paper, we consider combinatorial reinforcement learning with preference feedback, where a learning agent sequentially offers an action–an assortment of multiple items to–a user, whose preference feedback follows a multinomial logistic (MNL) model. This framework allows us to model real-world scenarios, particularly those involving long-term user engagement, such as in recommender systems and online advertising. However, this framework faces two main challenges: (1) the unknown value of each item, unlike traditional MNL bandits that only address single-step preference feedback, and (2) the difficulty of ensuring optimism while maintaining tractable assortment selection in the combinatorial action space with unknown values. In this paper, we assume a contextual MNL preference model, where the mean utilities are linear, and the value of each item is approximated by a general function. We propose an algorithm, MNL-VQL, that addresses these challenges, making it both computationally and statistically efficient. As a special case, for linear MDPs (with the MNL preference feedback), we establish the first regret lower bound in this framework and show that MNL-VQL achieves nearly minimax-optimal regret. To the best of our knowledge, this is the first work to provide statistical guarantees in combinatorial RL with preference feedback.

[LG-61] Improved Online Confidence Bounds for Multinomial Logistic Bandits

链接: https://arxiv.org/abs/2502.10020
作者: Joongkyu Lee,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint. Under review

点击查看摘要

[LG-62] Estimation of the Learning Coefficient Using Empirical Loss

链接: https://arxiv.org/abs/2502.09998
作者: Tatsuyoshi Takio,Joe Suzuki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures, 4 tables

点击查看摘要

[LG-63] On Volume Minimization in Conformal Regression

链接: https://arxiv.org/abs/2502.09985
作者: Batiste Le Bars(MAGNET),Pierre Humbert(LPSM (UMR_8001))
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-64] Universal Machine Learning Interatomic Potentials are Ready for Solid Ion Conductors

链接: https://arxiv.org/abs/2502.09970
作者: Hongwei Du,Jian Hui,Lanting Zhang,Hong Wang
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-65] Interpretable Early Warnings using Machine Learning in an Online Game-experiment

链接: https://arxiv.org/abs/2502.09880
作者: Guillaume Falmagne,Anna B. Stephenson,Simon A. Levin
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Adaptation and Self-Organizing Systems (nlin.AO); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-66] Gradient GA: Gradient Genetic Algorithm for Drug Molecular Design

链接: https://arxiv.org/abs/2502.09860
作者: Chris Zhuang,Debadyuti Mukherjee,Yingzhou Lu,Tianfan Fu,Ruqi Zhang
类目: Biomolecules (q-bio.BM); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-67] Algorithmic contiguity from low-degree conjecture and applications in correlated random graphs

链接: https://arxiv.org/abs/2502.09832
作者: Zhangsong Li
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注: 40 pages. arXiv admin note: text overlap with arXiv:2311.00289 by other authors

点击查看摘要

[LG-68] ΛCDM and early dark energy in latent space: a data-driven parametrization of the CMB temperature power spectrum

链接: https://arxiv.org/abs/2502.09810
作者: Davide Piras,Laura Herold,Luisa Lucie-Smith,Eiichiro Komatsu
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 17 pages, 12 figures, comments welcome

点击查看摘要

[LG-69] Reconstruction of frequency-localized functions from pointwise samples via least squares and deep learning

链接: https://arxiv.org/abs/2502.09794
作者: A. Martina Neuman,Andres Felipe Lerma Pineda,Jason J. Bramburger,Simone Brugiapaglia
类目: Classical Analysis and ODEs (math.CA); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-70] ExoMiner on TESS with Transfer Learning from Kepler: Transit Classification and Vetting Catalog for 2-min Data

链接: https://arxiv.org/abs/2502.09790
作者: Hamed Valizadegan,Miguel J. S. Martinho,Jon M. Jenkins,Joseph D. Twicken,Douglas A. Caldwell,Patrick Maynard,Hongbo Wei,William Zhong,Charles Yates,Sam Donald,Karen A. Collins,David Latham,Khalid Barkaoui,Perry Berlind,Michael L. Calkins,Kylee Carden,Nikita Chazov,Gilbert A. Esquerdo,Tristan Guillot,Vadim Krushinsky,Grzegorz Nowak,Benjamin V. Rackham,Amaury Triaud,Richard P. Schwarz,Denise Stephens,Chris Stockdale,Jiaqi Wang,Cristilyn N. Watkins,Francis P. Wilkin
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-71] Iterative quantum optimisation with a warm-started quantum state

链接: https://arxiv.org/abs/2502.09704
作者: Haomu Yuan,Songqinghao Yang,Crispin H. W. Barnes
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Optimization and Control (math.OC); Computational Physics (physics.comp-ph)
*备注: feedback welcome, 13 pages, 12 figures

点击查看摘要

[LG-72] Lifespan tree of brain anatomy: diagnostic values for motor and cognitive neurodegenerative diseases

链接: https://arxiv.org/abs/2502.09682
作者: Pierrick Coupé,Boris Mansencal,José V. Manjón,Patrice Péran,Wassilios G. Meissner,Thomas Tourdias,Vincent Planche
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-73] On the Bias Fairness and Bias Mitigation for a Wearable-based Freezing of Gait Detection in Parkinsons Disease

链接: https://arxiv.org/abs/2502.09626
作者: Timothy Odonga,Christine D. Esper,Stewart A. Factor,J. Lucas McKay,Hyeokhyen Kwon
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to IMWUT 2025

点击查看摘要

[LG-74] ransformer Based Time-Series Forecasting for Stock

链接: https://arxiv.org/abs/2502.09625
作者: Shuozhe Li,Zachery B Schulwol,Risto Miikkulainen
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:

点击查看摘要

信息检索

[IR-0] Semantica: Decentralized Search using a LLM -Guided Semantic Tree Overlay

链接: https://arxiv.org/abs/2502.10151
作者: Petru Neague,Quinten Stokkink,Naman Goel,Johan Pouwelse
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Centralized search engines are key for the Internet, but lead to undesirable concentration of power. Decentralized alternatives fail to offer equal document retrieval accuracy and speed. Nevertheless, Semantic Overlay Networks can come close to the performance of centralized solutions when the semantics of documents are properly captured. This work uses embeddings from Large Language Models to capture semantics and fulfill the promise of Semantic Overlay Networks. Our proposed algorithm, called Semantica, constructs a prefix tree (trie) utilizing document embeddings calculated by a language model. Users connect to each other based on the embeddings of their documents, ensuring that semantically similar users are directly linked. Thereby, this construction makes it more likely for user searches to be answered by the users that they are directly connected to, or by the users they are close to in the network connection graph. The implementation of our algorithm also accommodates the semantic diversity of individual users by spawning “clone” user identifiers in the tree. Our experiments use emulation with a real-world workload to show Semantica’s ability to identify and connect to similar users quickly. Semantica finds up to ten times more semantically similar users than current state-of-the-art approaches. At the same time, Semantica can retrieve more than two times the number of relevant documents given the same network load. We also make our code publicly available to facilitate further research in the area.

[IR-1] An Efficient Large Recommendation Model: Towards a Resource-Optimal Scaling Law

链接: https://arxiv.org/abs/2502.09888
作者: Songpei Xu,Shijia Wang,Da Guo,Xianwen Guo,Qiang Xiao,Fangjian Li,Chuanjiang Luo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The pursuit of scaling up recommendation models confronts intrinsic tensions between expanding model capacity and preserving computational tractability. While prior studies have explored scaling laws for recommendation systems, their resource-intensive paradigms – often requiring tens of thousands of A100 GPU hours – remain impractical for most industrial applications. This work addresses a critical gap: achieving sustainable model scaling under strict computational budgets. We propose Climber, a resource-efficient recommendation framework comprising two synergistic components: the ASTRO model architecture for algorithmic innovation and the TURBO acceleration framework for engineering optimization. ASTRO (Adaptive Scalable Transformer for RecOmmendation) adopts two core innovations: (1) multi-scale sequence partitioning that reduces attention complexity from O(n^2d) to O(n^2d/Nb) via hierarchical blocks, enabling more efficient scaling with sequence length; (2) dynamic temperature modulation that adaptively adjusts attention scores for multimodal distributions arising from inherent multi-scenario and multi-behavior interactions. Complemented by TURBO (Two-stage Unified Ranking with Batched Output), a co-designed acceleration framework integrating gradient-aware feature compression and memory-efficient Key-Value caching, Climber achieves 5.15x throughput gains without performance degradation. Comprehensive offline experiments on multiple datasets validate that Climber exhibits a more ideal scaling curve. To our knowledge, this is the first publicly documented framework where controlled model scaling drives continuous online metric growth (12.19% overall lift) without prohibitive resource costs. Climber has been successfully deployed on Netease Cloud Music, one of China’s largest music streaming platforms, serving tens of millions of users daily.

[IR-2] Data and Decision Traceability for the Welders Arc

链接: https://arxiv.org/abs/2502.09827
作者: Yasir Latif,Latha Pratti,Samya Bagchi
类目: Information Retrieval (cs.IR); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Space Protocol is applying the principles derived from MITRE and NIST’s Supply Chain Traceability: Manufacturing Meta-Framework (NIST IR 8536) to a complex multi party system to achieve introspection, auditing, and replay of data and decisions that ultimately lead to a end decision. The core goal of decision traceability is to ensure transparency, accountability, and integrity within the WA system. This is accomplished by providing a clear, auditable path from the system’s inputs all the way to the final decision. This traceability enables the system to track the various algorithms and data flows that have influenced a particular outcome.

[IR-3] Prioritized Ranking Experimental Design Using Recommender Systems in Two-Sided Platforms

链接: https://arxiv.org/abs/2502.09806
作者: Mahyar Habibi,Zahra Khanalizadeh,Negar Ziaeian
类目: Econometrics (econ.EM); Information Retrieval (cs.IR); Social and Information Networks (cs.SI); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Interdependencies between units in online two-sided marketplaces complicate estimating causal effects in experimental settings. We propose a novel experimental design to mitigate the interference bias in estimating the total average treatment effect (TATE) of item-side interventions in online two-sided marketplaces. Our Two-Sided Prioritized Ranking (TSPR) design uses the recommender system as an instrument for experimentation. TSPR strategically prioritizes items based on their treatment status in the listings displayed to users. We designed TSPR to provide users with a coherent platform experience by ensuring access to all items and a consistent realization of their treatment by all users. We evaluate our experimental design through simulations using a search impression dataset from an online travel agency. Our methodology closely estimates the true simulated TATE, while a baseline item-side estimator significantly overestimates TATE.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-02-17

目录

概览 (2025-02-17)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载