Arxiv今日论文 | 2025-01-31

本篇博文主要内容为 2025-01-31 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（LLMs）在复杂推理任务中频繁出现的“浅层思考”（underthinking）现象，即模型在推理过程中经常过早切换思维路径而未能充分探索可能的正确解法，从而导致推理深度不足和性能下降，特别是在具有挑战性的数学问题上。论文的关键解决方案是提出了一种带有思维切换惩罚（Thought Switching Penalty, TSP）的解码策略，该策略通过抑制过早的思维转换，鼓励更深入地探索每个推理路径，从而提高模型在未经过微调的情况下处理复杂数据集的准确性。

链接: https://arxiv.org/abs/2501.18585
作者: Yue Wang,Qiuzhi Liu,Jiahao Xu,Tian Liang,Xingyu Chen,Zhiwei He,Linfeng Song,Dian Yu,Juntao Li,Zhuosheng Zhang,Rui Wang,Zhaopeng Tu,Haitao Mi,Dong Yu
机构: Tencent AI Lab (腾讯AI实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) such as OpenAI’s o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.
zh

[NLP-1] R.I.P.: Better Models by Survival of the Fittest Prompts

【速读】：该论文旨在解决训练数据质量对最终模型性能的影响问题，提出了一种基于低质量输入提示会导致高方差和低质量响应的假设来评估数据完整性的方法。解决方案的关键是Rejecting Instruction Preferences (RIP)，通过测量被拒绝响应的质量以及选定偏好与被拒绝偏好之间的奖励差距来实现。RIP能够有效过滤现有训练集中的提示或生成高质量的合成数据集，在多个基准测试中显著提升了模型性能。

链接: https://arxiv.org/abs/2501.18578
作者: Ping Yu,Weizhe Yuan,Olga Golovneva,Tianhao Wu,Sainbayar Sukhbaatar,Jason Weston,Jing Xu
机构: Meta; NYU; UC Berkeley
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training data quality is one of the most important drivers of final model quality. In this work, we introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair. Our method, Rejecting Instruction Preferences (RIP) can be used to filter prompts from existing training sets, or to make high quality synthetic datasets, yielding large performance gains across various benchmarks compared to unfiltered data. Using Llama 3.1-8B-Instruct, RIP improves AlpacaEval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and WildBench by 9.9%. Using Llama 3.3-70B-Instruct, RIP improves Arena-Hard from 67.5 to 82.9, which is from 18th place to 6th overall in the leaderboard.
zh

[NLP-2] Can we Retrieve Everything All at Once? ARM: An Alignment-Oriented LLM -based Retrieval Method

【速读】：该论文旨在解决在复杂查询中，大型语言模型（LLMs）分解问题时无法充分利用现有数据及其组织结构的问题，导致检索性能不佳。为了解决这一问题，论文提出了一种基于LLM的检索方法ARM，其关键是通过探索数据对象之间的关系，超越单纯匹配查询语句，从而更好地与数据集的组织结构对齐，实现一次检索解决复杂查询的目标。

链接: https://arxiv.org/abs/2501.18539
作者: Peter Baile Chen,Yi Zhang,Michael Cafarella,Dan Roth
机构: MIT(麻省理工学院); AWS AI(亚马逊AWS人工智能); University of Pennsylvania(宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Real-world open-domain questions can be complicated, particularly when answering them involves information from multiple information sources. LLMs have demonstrated impressive performance in decomposing complex tasks into simpler steps, and previous work has used it for better retrieval in support of complex questions. However, LLM’s decomposition of questions is unaware of what data is available and how data is organized, often leading to a sub-optimal retrieval performance. Recent effort in agentic RAG proposes to perform retrieval in an iterative fashion, where a followup query is derived as an action based on previous rounds of retrieval. While this provides one way of interacting with the data collection, agentic RAG’s exploration of data is inefficient because successive queries depend on previous results rather than being guided by the organization of available data in the collection. To address this problem, we propose an LLM-based retrieval method – ARM, that aims to better align the question with the organization of the data collection by exploring relationships among data objects beyond matching the utterance of the query, thus leading to a retrieve-all-at-once solution for complex queries. We evaluated ARM on two datasets, Bird and OTT-QA. On Bird, it outperforms standard RAG with query decomposition by up to 5.2 pt in execution accuracy and agentic RAG (ReAct) by up to 15.9 pt. On OTT-QA, it achieves up to 5.5 pt and 19.3 pt higher F1 match scores compared to these approaches.
zh

[NLP-3] Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models

【速读】：该论文旨在解决大型视觉-语言模型（Vision-Language Models, VLMs）在安全关键领域部署时所面临的安全性挑战。现有安全微调方法主要集中在文本或多媒体内容上，无法有效应对复杂的场景或平衡有益性和无害性之间的关系。论文的关键在于提出了一种新的数据集——多图像安全（Multi-Image Safety, MIS）数据集，该数据集通过整合多图像输入与安全链式思考（Chain-of-Thought, CoT）标签，以增强模型在安全关键环境下的视觉感知和推理能力。实验表明，使用MIS数据集微调InternVL2.5-8B模型不仅在需要安全相关视觉推理的复杂多图像任务中表现出色，还提升了五个通用基准测试中的平均准确率，并大幅降低了多个安全基准测试中的攻击成功率（Attack Success Rate, ASR），同时保持了模型的一般能力而无需权衡。

链接: https://arxiv.org/abs/2501.18533
作者: Yi Ding,Lijun Li,Bing Cao,Jing Shao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, consisting of training and test splits. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, fine-tuning with MIS increases average accuracy by 0.83% across five general benchmarks and reduces the Attack Success Rate (ASR) on multiple safety benchmarks by a large margin. Data and Models are released under: \hrefthis https URL\textttthis https URL
zh

[NLP-4] Differentially Private Steering for Large Language Model Alignment ICLR2025

【速读】：该论文旨在解决在利用私有数据对齐大型语言模型（LLMs）时可能泄露私人信息的问题。论文的关键解决方案是提出了一种名为“Private Steering for LLM Alignment (PSA)”的算法，该算法通过差分隐私（DP）保证来编辑LLM的激活，从而在保持模型性能的同时防止私人信息泄露。

链接: https://arxiv.org/abs/2501.18532
作者: Anmol Goel,Yaxi Hu,Iryna Gurevych,Amartya Sanyal
机构: Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science and Hessian Center for AI (hessian.AI) Technical University of Darmstadt (达姆施塔特工业大学); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所), Tübingen, Germany (德国); Department of Computer Science, University of Copenhagen (哥本哈根大学), Denmark (丹麦)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ICLR 2025; Code: this https URL

点击查看摘要

Abstract:Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the \textit\underlinePrivate \underlineSteering for LLM \underlineAlignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our attack is tailored for activation editing and relies solely on the generated texts without their associated probabilities. Our experiments support the theoretical guarantees by showing improved guarantees for our \textitPSA algorithm compared to several existing non-private techniques.
zh

[NLP-5] Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

【速读】：该论文旨在解决大规模语言模型（Large Language Models, LLMs）分布式训练过程中所需的高带宽通信需求问题。论文的关键解决方案在于通过按序同步参数子集、允许在同步期间继续训练以及量化交换数据，从而大幅减少峰值带宽需求，最终实现与先前相当的训练质量，但所需带宽减少了两个数量级。

链接: https://arxiv.org/abs/2501.18512
作者: Arthur Douillard,Yanislav Donchev,Keith Rush,Satyen Kale,Zachary Charles,Zachary Garrett,Gabriel Teston,Dave Lacey,Ross McIlroy,Jiajun Shen,Alexandre Ramé,Arthur Szlam,Marc’Aurelio Ranzato,Paul Barham
机构: Google DeepMind; Google Research; Google; Apple(苹果)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers’', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.
zh

[NLP-6] WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

【速读】：该论文旨在解决大型语言模型（Large Language Model, LLM）在后训练技术（如从DPO到蒸馏）中的行为精炼和新技能解锁方面的研究支持不足的问题。为了解决这一问题，论文的关键在于引入WILDCHAT-50M，这是迄今为止最大的公共聊天数据集，包含了来自超过50种不同开源权重模型的响应，模型参数规模从0.5B到104B不等。通过扩展WildChat数据集并进行广泛的对比分析，论文展示了该数据集的潜力，并提出了RE-WILD，这是一种仅使用40%样本量就能超越Allen AI最新Tulu-3混合指令微调（SFT）模型的公共SFT混合方法。

链接: https://arxiv.org/abs/2501.18511
作者: Benjamin Feuer,Chinmay Hegde
机构: NYU(纽约大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at this https URL.
zh

[NLP-7] CALM: Unleashing the Cross-Lingual Self-Aligning Ability of Language Model Question Answering NAACL2025

【速读】：该论文旨在解决跨语言大型语言模型（LLMs）在处理文化独立问题时表现不一致的问题。关键解决方案是引入了跨语言自对齐（Cross-Lingual Self-Aligning, CALM）能力，通过从多种语言中采样多个响应，并利用直接偏好优化（Direct Preference Optimization, DPO）方法来对齐不同语言间的知识，从而提高跨语言知识问答的一致性和准确性。

链接: https://arxiv.org/abs/2501.18457
作者: Yumeng Wang,Zhiyuan Fan,Qingyun Wang,May Fung,Heng Ji
机构: University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); HKUST(香港科技大学)
类目: Computation and Language (cs.CL)
备注: Accepted by NAACL 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are pretrained on extensive multilingual corpora to acquire both language-specific cultural knowledge and general knowledge. Ideally, while LLMs should provide consistent responses to culture-independent questions across languages, we observe significant performance disparities. To address this, we explore the Cross-Lingual Self-Aligning ability of Language Models (CALM) to align knowledge across languages. Specifically, for a given question, we sample multiple responses across different languages, and select the most self-consistent response as the target, leaving the remaining responses as negative examples. We then employ direct preference optimization (DPO) to align the model’s knowledge across different languages. Evaluations on the MEDQA and X-CSQA datasets demonstrate CALM’s effectiveness in enhancing cross-lingual knowledge question answering, both in zero-shot and retrieval augmented settings. We also found that increasing the number of languages involved in CALM training leads to even higher accuracy and consistency. We offer a qualitative analysis of how cross-lingual consistency can enhance knowledge alignment and explore the method’s generalizability. The source code and data of this paper are available on GitHub.
zh

[NLP-8] GENIE: Generative Note Information Extraction model for structuring EHR data

【速读】：该论文旨在解决电子健康记录（EHRs）中临床文本的非结构化特性所带来的挑战，特别是传统方法在处理来自不同医疗环境的临床笔记时存在的局限性。论文的关键解决方案是GENIE系统，这是一种利用大型语言模型（LLMs）来高效地将非结构化临床文本转换为标准化格式的生成式笔记信息提取系统。GENIE通过统一的端到端方法简化工作流程，提高准确性，并减少手动干预的需求。

链接: https://arxiv.org/abs/2501.18435
作者: Huaiyuan Ying,Hongyi Yuan,Jinsen Lu,Zitian Qu,Yang Zhao,Zhengyun Zhao,Isaac Kohane,Tianxi Cai,Sheng Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Electronic Health Records (EHRs) hold immense potential for advancing healthcare, offering rich, longitudinal data that combines structured information with valuable insights from unstructured clinical notes. However, the unstructured nature of clinical text poses significant challenges for secondary applications. Traditional methods for structuring EHR free-text data, such as rule-based systems and multi-stage pipelines, are often limited by their time-consuming configurations and inability to adapt across clinical notes from diverse healthcare settings. Few systems provide a comprehensive attribute extraction for terminologies. While giant large language models (LLMs) like GPT-4 and LLaMA 405B excel at structuring tasks, they are slow, costly, and impractical for large-scale use. To overcome these limitations, we introduce GENIE, a Generative Note Information Extraction system that leverages LLMs to streamline the structuring of unstructured clinical text into usable data with standardized format. GENIE processes entire paragraphs in a single pass, extracting entities, assertion statuses, locations, modifiers, values, and purposes with high accuracy. Its unified, end-to-end approach simplifies workflows, reduces errors, and eliminates the need for extensive manual intervention. Using a robust data preparation pipeline and fine-tuned small scale LLMs, GENIE achieves competitive performance across multiple information extraction tasks, outperforming traditional tools like cTAKES and MetaMap and can handle extra attributes to be extracted. GENIE strongly enhances real-world applicability and scalability in healthcare systems. By open-sourcing the model and test data, we aim to encourage collaboration and drive further advancements in EHR structurization.
zh

[NLP-9] RbFT: Robust Fine-tuning for Retrieval-Augmented Generation against Retrieval Defects

【速读】：该论文旨在解决检索增强生成（Retrieval-augmented generation, RAG）系统因检索器和知识库的可靠性不足而导致的信任度下降问题。论文的关键解决方案是引入了鲁棒微调（Robust Fine-Tuning, RbFT）方法，通过两项针对性的微调任务来增强大型语言模型（LLMs）对抗检索缺陷的韧性，从而显著提升了RAG系统在不同检索条件下的鲁棒性。

链接: https://arxiv.org/abs/2501.18365
作者: Yiteng Tu,Weihang Su,Yujia Zhou,Yiqun Liu,Qingyao Ai
机构: DCST, Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved from a knowledge base. However, its effectiveness is fundamentally constrained by the reliability of both the retriever and the knowledge base. In real-world scenarios, imperfections in these components often lead to the retrieval of noisy, irrelevant, or misleading counterfactual information, ultimately undermining the trustworthiness of RAG systems. To address this challenge, we propose Robust Fine-Tuning (RbFT), a method designed to enhance the resilience of LLMs against retrieval defects through two targeted fine-tuning tasks. Experimental results demonstrate that RbFT significantly improves the robustness of RAG systems across diverse retrieval conditions, surpassing existing methods while maintaining high inference efficiency and compatibility with other robustness techniques.
zh

[NLP-10] MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

【速读】：该论文旨在解决现有医学知识评估基准不足的问题，特别是缺乏挑战性和临床相关性。为了解决这一问题，论文提出了MedXpertQA数据集，其关键是通过引入包含丰富临床信息和多种图像的多模态专家级考题，并应用严格的数据过滤和增强技术来提升数据难度和真实性，从而更好地评估高级医学知识和推理能力。

链接: https://arxiv.org/abs/2501.18362
作者: Yuxin Zuo,Shang Qu,Yifei Li,Zhangren Chen,Xuekai Zhu,Ermo Hua,Kaiyan Zhang,Ning Ding,Bowen Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.
zh

[NLP-11] State Stream Transformer (SST) : Emergent Metacognitive Behaviours Through Latent State Persistence

【速读】：该论文旨在解决传统Transformer模型在自回归生成过程中缺乏潜在计算连续性的问题。关键解决方案在于引入State Stream Transformer (SST)，通过在其架构中加入带加权衰减的滑动窗口潜在状态（FFN）缓存，以维持和演化持久的潜在过程。这一改进使得SST在无需调整权重的情况下，显著提升了推理能力，并展现出类似元认知的行为。

链接: https://arxiv.org/abs/2501.18356
作者: Thea Aviss
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 3 figures

点击查看摘要

Abstract:We introduce the State Stream Transformer (SST), a novel LLM architecture that reveals emergent reasoning behaviours and capabilities latent in pretrained weights through addressing a fundamental limitation in traditional transformer models: the lack of latent computational continuity across autoregressive generations in the state space. SST introduces a sliding window latent state (FFN) cache with weighted decay that maintains and evolves persistent latent processes throughout autoregressive generations. Through controlled experiments comparing base and SST architectures using the same frozen weights, we demonstrate that this architectural modification alone enables enhanced reasoning capabilities which appear best explained by some form of potential higher-order processing, as evidenced by emergent metacognitive behaviours. These behaviours persist under controlled conditions designed to eliminate confounding factors such as stochastic variation or learned response patterns. Analysis of latent state distributions and processing dynamics provides evidence that it is solely the ‘state stream’ that is responsible for these phenomena. In quantitative evaluations, the SST achieves substantial performance improvements over the base model on two reasoning benchmarks, reaching 89.01% accuracy on GSM-8K (0-shot) and 91.04% on ARC Challenge (0-shot CoT). These findings indicate that persistent computation in the latent state space enables fundamentally different information processing and internal reasoning strategies, with implications for our understanding of artificial intelligence systems.
zh

[NLP-12] A Video-grounded Dialogue Dataset and Metric for Event-driven Activities AAAI2025

【速读】：该论文旨在解决视频驱动对话（Video-grounded Dialogue）在事件驱动活动（Event-driven Activities）中的挑战，特别是需要高级上下文理解才能生成准确响应的问题。解决方案的关键在于提出了VDAct数据集和VDEval评估指标。VDAct数据集包含更长且复杂的视频序列及多样化的活动场景，而VDEval则通过整合对话历史和从知识图谱中提取的视频内容摘要来评估响应，从而显著提高了与人类评估的相关性。

链接: https://arxiv.org/abs/2501.18324
作者: Wiradee Imrattanatrai,Masaki Asada,Kimihiro Hasegawa,Zhi-Qi Cheng,Ken Fukuda,Teruko Mitamura
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at AAAI2025

点击查看摘要

Abstract:This paper presents VDAct, a dataset for a Video-grounded Dialogue on Event-driven Activities, alongside VDEval, a session-based context evaluation metric specially designed for the task. Unlike existing datasets, VDAct includes longer and more complex video sequences that depict a variety of event-driven activities that require advanced contextual understanding for accurate response generation. The dataset comprises 3,000 dialogues with over 30,000 question-and-answer pairs, derived from 1,000 videos with diverse activity scenarios. VDAct displays a notably challenging characteristic due to its broad spectrum of activity scenarios and wide range of question types. Empirical studies on state-of-the-art vision foundation models highlight their limitations in addressing certain question types on our dataset. Furthermore, VDEval, which integrates dialogue session history and video content summaries extracted from our supplementary Knowledge Graphs to evaluate individual responses, demonstrates a significantly higher correlation with human assessments on the VDAct dataset than existing evaluation metrics that rely solely on the context of single dialogue turns.
zh

[NLP-13] Citation Recommendation based on Argumentative Zoning of User Queries

【速读】：该论文旨在提升引文推荐模型的性能，通过考虑引用句子中的论证信息。解决方案的关键在于构建一个多任务学习模型，同时进行引文推荐和论证分区分类，并基于新的论证分区方案生成了一个标注数据集。实验结果表明，引入论证信息能够显著提高引文推荐的效果。

链接: https://arxiv.org/abs/2501.18292
作者: Shutian Ma,Chengzhi Zhang,Heng Zhang,Zheng Gao
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:Citation recommendation aims to locate the important papers for scholars to cite. When writing the citing sentences, the authors usually hold different citing intents, which are referred to citation function in citation analysis. Since argumentative zoning is to identify the argumentative and rhetorical structure in scientific literature, we want to use this information to improve the citation recommendation task. In this paper, a multi-task learning model is built for citation recommendation and argumentative zoning classification. We also generated an annotated corpus of the data from PubMed Central based on a new argumentative zoning schema. The experimental results show that, by considering the argumentative information in the citing sentence, citation recommendation model will get better performance.
zh

[NLP-14] Mining for Species Locations Habitats and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models ALT

【速读】：该论文旨在解决从入侵生物学文献中高效提取关键生态实体的问题，重点关注物种名称、地理位置、相关栖息地及生态系统的信息。这些信息对于理解物种扩散、预测未来入侵以及指导保护工作至关重要。传统文本挖掘方法往往难以应对生态术语的复杂性和文本中的微妙语言模式。论文的关键解决方案在于利用未经领域特定微调的大规模语言模型（LLMs）来实现生态实体的提取，从而揭示了此类模型在该应用中的潜力与局限性。这一研究为开发更先进的自动化知识提取工具奠定了基础，以辅助研究人员和从业人员理解和管理生物入侵。

链接: https://arxiv.org/abs/2501.18287
作者: Jennifer D’Souza,Zachary Laubach,Tarek Al Mustafa,Sina Zarrieß,Robert Frühstückl,Phyllis Illari
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 8 pages, 2 figures, accepted to the NLP4Ecology Workshop 2025 ( this https URL ) co-located with the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies

点击查看摘要

Abstract:This paper presents an exploratory study that harnesses the capabilities of large language models (LLMs) to mine key ecological entities from invasion biology literature. Specifically, we focus on extracting species names, their locations, associated habitats, and ecosystems, information that is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Traditional text mining approaches often struggle with the complexity of ecological terminology and the subtle linguistic patterns found in these texts. By applying general-purpose LLMs without domain-specific fine-tuning, we uncover both the promise and limitations of using these models for ecological entity extraction. In doing so, this study lays the groundwork for more advanced, automated knowledge extraction tools that can aid researchers and practitioners in understanding and managing biological invasions.
zh

[NLP-15] Jailbreaking LLM s Safeguard with Universal Magic Words for Text Embedding Models

【速读】：该论文旨在解决大型语言模型（LLMs）的安全问题，特别是针对基于文本嵌入模型的防御机制的攻击。论文的关键在于发现文本嵌入模型输出分布存在显著偏差且具有较大均值的现象，并据此提出了一种高效的搜索方法来寻找通用魔术词。这些魔术词作为后缀可以将任何文本的嵌入向量导向偏差方向，从而操纵任意文本对之间的相似性，误导现有的防御措施。为消除这一安全风险，论文还提出了无需训练的防御机制，以纠正文本嵌入分布的偏差。

链接: https://arxiv.org/abs/2501.18280
作者: Haoyu Liang,Youran Sun,Yunfeng Cai,Jun Zhu,Bo Zhang
机构: Tsinghua University (清华大学); Tsinghua University (清华大学); Beijing Inst. of Math. Sci. and App. (BIMSA) (北京雁栖湖应用数学研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:The security issue of large language models (LLMs) has gained significant attention recently, with various defense mechanisms developed to prevent harmful outputs, among which safeguards based on text embedding models serve as a fundamental defense. Through testing, we discover that the distribution of text embedding model outputs is significantly biased with a large mean. Inspired by this observation, we propose novel efficient methods to search for universal magic words that can attack text embedding models. The universal magic words as suffixes can move the embedding of any text towards the bias direction, therefore manipulate the similarity of any text pair and mislead safeguards. By appending magic words to user prompts and requiring LLMs to end answers with magic words, attackers can jailbreak the safeguard. To eradicate this security risk, we also propose defense mechanisms against such attacks, which can correct the biased distribution of text embeddings in a train-free manner.
zh

[NLP-16] Collecting Cost-Effective High-Quality Truthfulness Assessments with LLM Summarized Evidence

【速读】：该论文旨在解决在线环境中对抗误导性和虚假信息的有效性问题。解决方案的关键在于利用基于大型语言模型（LLM）生成摘要的众包真实性评估，与使用原始网页进行评估相比，这种方法显著提高了评估速度，同时保持了评估质量，并且增加了众包工人在相同时间内的评估数量，从而降低了获取真实性的成本。此外，使用摘要证据最大化了注释者之间的协议以及对证据的依赖和感知有用性，证明了摘要证据在不牺牲评估质量的情况下具有实用价值。

链接: https://arxiv.org/abs/2501.18265
作者: Kevin Roitero,Dustin Wright,Michael Soprano,Isabelle Augenstein,Stefano Mizzaro
机构: Department of Computer Science, University of Udine, Italy(意大利乌迪内大学计算机科学系); Department of Computer Science, University of Copenhagen, Denmark(丹麦哥本哈根大学计算机科学系)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 18 pages; 7 figures; 5 tables

点击查看摘要

Abstract:With the degradation of guardrails against mis- and disinformation online, it is more critical than ever to be able to effectively combat it. In this paper, we explore the efficiency and effectiveness of using crowd-sourced truthfulness assessments based on condensed, large language model (LLM) generated summaries of online sources. We compare the use of generated summaries to the use of original web pages in an A/B testing setting, where we employ a large and diverse pool of crowd-workers to perform the truthfulness assessment. We evaluate the quality of assessments, the efficiency with which assessments are performed, and the behavior and engagement of participants. Our results demonstrate that the Summary modality, which relies on summarized evidence, offers no significant change in assessment accuracy over the Standard modality, while significantly increasing the speed with which assessments are performed. Workers using summarized evidence produce a significantly higher number of assessments in the same time frame, reducing the cost needed to acquire truthfulness assessments. Additionally, the Summary modality maximizes both the inter-annotator agreements as well as the reliance on and perceived usefulness of evidence, demonstrating the utility of summarized evidence without sacrificing the quality of assessments.
zh

[NLP-17] How to Select Datapoints for Efficient Human Evaluation of NLG Models?

【速读】：该论文旨在解决通过人类评估文本生成模型时因随机选取测试数据而导致的经济效率低下问题。解决方案的关键在于开发了一套选择器，能够以考虑评估成本的方式获取最具信息量的数据点。具体而言，基于自动度量分数方差、模型输出多样性或项目反应理论的选择器优于随机选择。此外，还提出了一种方法，在模型输出尚不可用的情况下也能有效选择，即引入了仅基于源文本预测人类评估项有用性的源基础估计器。研究表明，使用这些选择器后，只需约50%的测试数据即可达到与全部数据相同的人类评估效果。

链接: https://arxiv.org/abs/2501.18251
作者: Vilém Zouhar,Peng Cui,Mrinmaya Sachan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human evaluation is the gold-standard for evaluating text generation models. It is also expensive, and to fit budgetary constraints, a random subset of the test data is often chosen in practice. The randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop a suite of selectors to get the most informative datapoints for human evaluation while taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts. We demonstrate the efficacy of our selectors in two common NLG tasks, machine translation and summarization, and show that up to only ~50% of the test data is needed to produce the same evaluation result as the entire data. Our implementations are published in the subset2evaluate package.
zh

[NLP-18] Contextually Structured Token Dependency Encoding for Large Language Models

【速读】：该论文旨在解决在大规模神经架构中，传统的词嵌入方法难以显式编码结构化关系的问题。关键解决方案在于引入依赖感知的词编码机制，通过依赖加权注意力计算来细化词交互，确保句法和语义依赖关系在多层处理过程中得以保留。这种方法不仅提高了上下文的一致性和预测的一致性，还增强了词汇多样性及依赖关系的保持，同时在长序列中表现出更好的层次一致性。

链接: https://arxiv.org/abs/2501.18205
作者: James Blades,Frederick Somerfield,William Langley,Susan Everingham,Maurice Witherington
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Token representation strategies within large-scale neural architectures often rely on contextually refined embeddings, yet conventional approaches seldom encode structured relationships explicitly within token interactions. Self-attention mechanisms effectively capture dynamic contextual dependencies, but their reliance on learned weight distributions limits the preservation of long-range hierarchical structures in generated sequences. Dependency-aware token encoding introduces a structured approach to embedding initialization, ensuring that relational constraints are embedded within token representations rather than inferred solely through attention dynamics. The proposed encoding mechanism refines token interactions through dependency-weighted attention computations, ensuring that syntactic and semantic dependencies are retained across multiple processing layers. Empirical evaluations indicate reductions in perplexity across diverse linguistic benchmarks, suggesting improvements in contextual coherence and predictive consistency in autoregressive text generation. Computational efficiency assessments reveal a moderate increase in memory consumption and training time, attributed to additional matrix computations within the encoding module, yet scalability remains feasible within conventional transformer architectures. Structured encoding enhances lexical variation and dependency retention, reinforcing linguistic coherence without requiring external syntactic annotations or auxiliary training objectives. Statistical comparisons highlight improvements in dependency alignment, particularly in longer sequences where conventional self-attention models exhibit degradation in hierarchical consistency. Sentence length distributions indicate a reduction in abrupt phrase transitions, further supporting the hypothesis that explicit dependency encoding facilitates more structured phrase generation.
zh

[NLP-19] Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models ICASSP2025

【速读】：该论文旨在解决在低比特量化（Low-bit Quantization）条件下，现有后训练量化（Post-Training Quantization, PTQ）策略性能不佳的问题。论文的关键在于引入了一种混合精度图神经后训练量化方法（Mixed-precision Graph Neural PTQ, MG-PTQ），通过采用图神经网络（Graph Neural Network, GNN）模块来捕捉权重之间的依赖关系，并自适应地分配量化比特宽度，从而更有效地评估权重的重要性并优化量化策略的分配。

链接: https://arxiv.org/abs/2501.18154
作者: Wanlong Liu,Yichen Xiao,Dingyi Zeng,Hongyang Zhao,Wenyu Chen,Malu Zhang
机构: School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院)
类目: Computation and Language (cs.CL)
备注: ICASSP 2025

点击查看摘要

Abstract:Post-Training Quantization (PTQ) is pivotal for deploying large language models (LLMs) within resource-limited settings by significantly reducing resource demands. However, existing PTQ strategies underperform at low bit levels 3 bits due to the significant difference between the quantized and original weights. To enhance the quantization performance at low bit widths, we introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a graph neural network (GNN) module to capture dependencies among weights and adaptively assign quantization bit-widths. Through the information propagation of the GNN module, our method more effectively captures dependencies among target weights, leading to a more accurate assessment of weight importance and optimized allocation of quantization strategies. Extensive experiments on the WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms previous state-of-the-art PTQ method GPTQ, setting new benchmarks for quantization performance under low-bit conditions.
zh

[NLP-20] Unraveling the Capabilities of Language Models in News Summarization

【速读】：该论文旨在评估近期引入的多个语言模型在新闻摘要任务中的表现，重点关注较小规模的模型。研究的关键解决方案在于采用全面的基准测试方法，包括零样本和少样本学习设置，并结合自动指标、人工评估以及大语言模型作为裁判的评估方法。研究发现，在少样本学习设置中加入演示示例并未提升模型性能，有时甚至导致生成摘要质量下降。论文指出，GPT-3.5-Turbo 和 GPT-4 在新闻摘要任务中表现出色，同时一些公开可用的模型如 Qwen1.5-7B、SOLAR-10.7B-Instruct-v1.0、Meta-Llama-3-8B 和 Zephyr-7B-Beta 也展示了显著潜力。

链接: https://arxiv.org/abs/2501.18128
作者: Abdurrahman Odabaşı,Göksel Biricik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Given the recent introduction of multiple language models and the ongoing demand for improved Natural Language Processing tasks, particularly summarization, this work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. In this work, we systematically test the capabilities and effectiveness of these models in summarizing news article texts which are written in different styles and presented in three distinct datasets. Specifically, we focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology that combines different evaluation concepts including automatic metrics, human evaluation, and LLM-as-a-judge. Interestingly, including demonstration examples in the few-shot learning setting did not enhance models’ performance and, in some cases, even led to worse quality of the generated summaries. This issue arises mainly due to the poor quality of the gold summaries that have been used as reference summaries, which negatively impacts the models’ performance. Furthermore, our study’s results highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities. However, among the public models evaluated, certain models such as Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta demonstrated promising results. These models showed significant potential, positioning them as competitive alternatives to large models for the task of news summarization.
zh

[NLP-21] Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models

【速读】：该论文旨在解决知识图谱（Knowledge Graph, KG）结构与自然语言之间的天然鸿沟，实现KG与大规模语言模型（Large Language Models, LLMs）之间有效且无缝的整合。论文的关键解决方案在于提出了一种两阶段框架，通过自监督量化表示（Self-Supervised Quantized Representation, SSQR）方法，将KG的结构和语义知识压缩成离散码（即标记），以匹配语言句子的格式。进一步地，通过将这些学习到的码视为特征直接输入LLMs，设计了知识图谱指令跟随数据，从而实现了KG与LLMs的无缝集成。实验结果表明，SSQR方法优于现有的无监督量化方法，并且经过微调的LLaMA2和LLaMA3.1在知识图谱链接预测和三元组分类任务中表现出色，仅使用每个实体16个标记，而非传统的提示方法所需的数千个标记。

链接: https://arxiv.org/abs/2501.18119
作者: Qika Lin,Tianzhe Zhao,Kai He,Zhen Peng,Fangzhi Xu,Ling Huang,Jingying Ma,Mengling Feng
机构: National University of Singapore(新加坡国立大学); Xi’an Jiaotong University(西安交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to the presence of the natural gap between Knowledge Graph (KG) structures and the natural language, the effective integration of holistic structural information of KGs with Large Language Models (LLMs) has emerged as a significant question. To this end, we propose a two-stage framework to learn and apply quantized codes for each entity, aiming for the seamless integration of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR) method is proposed to compress both KG structural and semantic knowledge into discrete codes (\ie, tokens) that align the format of language sentences. We further design KG instruction-following data by viewing these learned codes as features to directly input to LLMs, thereby achieving seamless integration. The experiment results demonstrate that SSQR outperforms existing unsupervised quantized methods, producing more distinguishable codes. Further, the fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link prediction and triple classification tasks, utilizing only 16 tokens per entity instead of thousands in conventional prompting methods.
zh

[NLP-22] Scaling Inference-Efficient Language Models

【速读】：该论文旨在解决大型语言模型在推理成本方面的预测不足问题，并且关注模型架构对推理延迟的影响。论文的关键解决方案在于修改Chinchilla缩放定律，以同时优化模型参数数量、训练令牌数和模型架构，并提出了一种新的方法来训练具有高效推理能力的模型。通过广泛的实证研究，作者训练了多种模型，并发布了Morph-1B模型，该模型在保持下游任务准确性的同时，将推理延迟提高了1.8倍。

链接: https://arxiv.org/abs/2501.18107
作者: Song Bian,Minghao Yan,Shivaram Venkataraman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 16 figures

点击查看摘要

Abstract:Scaling laws are powerful tools to predict the performance of large language models. However, current scaling laws fall short of accounting for inference costs. In this work, we first show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency. To tackle this challenge, we modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture. Due to the reason that models of similar training loss exhibit gaps in downstream evaluation, we also propose a novel method to train inference-efficient models based on the revised scaling laws. We perform extensive empirical studies to fit and evaluate our inference-aware scaling laws. We vary model parameters from 80M to 1B, training tokens from 1.6B to 30B, and model shapes, training a total of 63 models. Guided by our inference-efficient scaling law and model selection method, we release the Morph-1B model, which improves inference latency by 1.8x while maintaining accuracy on downstream tasks compared to open-source models, pushing the Pareto frontier of accuracy-latency tradeoff.
zh

[NLP-23] Beyond Turn-taking: Introducing Text-based Overlap into Human-LLM Interactions

【速读】：该论文旨在解决传统文本交互中严格轮流发言方式导致的交流不自然的问题。关键解决方案在于开发OverlopBot，一种允许AI与用户同时发起对话的原型聊天机器人，从而实现类似于人类自然对话中的信息重叠行为，促进更流畅和自然的交流体验。

链接: https://arxiv.org/abs/2501.18103
作者: JiWoo Kim,Minsuk Chang,JinYeong Bak
机构: Sungkyunkwan University( Sungkyunkwan 大学) Suwon South Korea; Google DeepMind(谷歌深度思维) Seattle USA; Sungkyunkwan University( Sungkyunkwan 大学) Suwon South Korea
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Traditional text-based human-AI interactions often adhere to a strict turn-taking approach. In this research, we propose a novel approach that incorporates overlapping messages, mirroring natural human conversations. Through a formative study, we observed that even in text-based contexts, users instinctively engage in overlapping behaviors like “A: Today I went to-” “B: yeah.” To capitalize on these insights, we developed OverlapBot, a prototype chatbot where both AI and users can initiate overlapping. Our user study revealed that OverlapBot was perceived as more communicative and immersive than traditional turn-taking chatbot, fostering faster and more natural interactions. Our findings contribute to the understanding of design space for overlapping interactions. We also provide recommendations for implementing overlap-capable AI interactions to enhance the fluidity and engagement of text-based conversations.
zh

[NLP-24] Diverse Preference Optimization

【速读】：该论文旨在解决通过后训练（Post-training）方法，如强化学习、偏好优化或有监督微调，导致生成的响应概率分布过于集中且多样性降低的问题。特别是在需要多样化响应的创造性生成任务中，这一问题尤为突出。论文的关键解决方案是引入了一种名为多样偏好优化（Diverse Preference Optimization, DivPO）的在线优化方法。DivPO通过选择更罕见但高质量的样本作为优选示例，并将更常见但低质量的样本作为拒绝示例，从而在保持生成质量的同时显著提升了生成响应的多样性。具体而言，DivPO使得生成的人物属性多样性提高了45.6%，故事多样性增加了74.6%，同时保持了与标准基线相似的胜率。

链接: https://arxiv.org/abs/2501.18101
作者: Jack Lanchantin,Angelica Chen,Shehzaad Dhuliawala,Ping Yu,Jason Weston,Sainbayar Sukhbaatar,Ilia Kulikov
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training of language models, either through reinforcement learning, preference optimization or supervised finetuning, tends to sharpen the output probability distribution and reduce the diversity of generated responses. This is particularly a problem for creative generative tasks where varied responses are desired. %This impacts the ability to generate high quality synthetic data which is becoming a vital component of model training. In this work we introduce Diverse Preference Optimization (DivPO), an online optimization method which learns to generate much more diverse responses than standard pipelines, while maintaining the quality of the generations. In DivPO, preference pairs are selected by first considering a pool of responses, and a measure of diversity among them, and selecting chosen examples as being more rare but high quality, while rejected examples are more common, but low quality. DivPO results in generating 45.6% more diverse persona attributes, and an 74.6% increase in story diversity, while maintaining similar win rates as standard baselines.
zh

[NLP-25] Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

【速读】：该论文旨在解决有害微调攻击对微调服务带来的显著安全风险。主流防御方法试图通过“接种”模型来减轻后续有害微调的影响，但实验结果显示这些防御措施脆弱，在少量微调步骤后模型仍可能学习到有害知识。解决方案的关键在于引入纯粹随机扰动以恢复模型免于有害行为，并进一步提出Panacea方法，优化自适应扰动在微调后应用于模型，从而在保持模型安全的同时不损害其下游微调性能。

链接: https://arxiv.org/abs/2501.18100
作者: Yibo Wang,Tiansheng Huang,Li Shen,Huanjin Yao,Haotian Luo,Rui Liu,Naiqiang Tan,Jiaxing Huang,Dacheng Tao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile – with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution – adding purely random perturbations to the fine-tuned model, can recover the model from harmful behavior, though it leads to a degradation in the model’s fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model’s safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.5%, while maintaining fine-tuning performance. As a by-product, we analyze the optimized perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at this https URL
zh

[NLP-26] Learning to Plan Reason for Evaluation with Thinking-LLM -as-a-Judge

【速读】：该论文旨在解决生成式大型语言模型（LLM）作为评判者在生成思维链（Chain-of-Thought, CoT）序列时所面临的挑战，特别是缺乏人类标注的CoT进行评估的问题。论文的关键在于提出EvalPlanner算法，这是一种偏好优化算法，用于思考型LLM作为评判者。EvalPlanner首先生成不受限的评估计划，然后执行该计划并作出最终判断。通过自训练循环，EvalPlanner迭代优化合成构造的评估计划及其执行过程，从而实现更优的最终判决结果。这种方法在RewardBench上取得了新的最先进性能，尽管其训练数据量较少且为合成生成的偏好对。

链接: https://arxiv.org/abs/2501.18099
作者: Swarnadeep Saha,Xian Li,Marjan Ghazvininejad,Jason Weston,Tianlu Wang
机构: FAIR (FAIR) at Meta (Meta)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.
zh

[NLP-27] LLM s can see and hear without any training KR

【速读】：该论文旨在赋予大型语言模型 (LLM) 多模态能力，无需额外训练。其关键解决方案在于多模态迭代解算器 (MILS)，通过迭代生成候选输出并对其进行评分，从而引导模型逐步推理最终生成任务解决方案。这种方法不仅在零样本图像、视频和音频字幕任务上达到了新的最先进水平，还扩展到了文本到图像生成和风格迁移等媒体生成应用，并能够将多模态嵌入转换回文本，实现跨模态运算。

链接: https://arxiv.org/abs/2501.18096
作者: Kumar Ashutosh,Yossi Gandelsman,Xinlei Chen,Ishan Misra,Rohit Girdhar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
zh

[NLP-28] FinanceQA: A Benchmark for Evaluating Financial Analysis Capabilities of Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在复杂数值金融分析任务上的表现不足问题，这些任务反映了现实世界中的投资工作。当前模型在模拟对冲基金、私募股权公司、投资银行及其他金融机构实际工作的任务中，大约有60%的任务未能达到所需的精确度标准。论文的关键解决方案在于提出需要更高品质的训练数据，并通过使用OpenAI的微调API进行了相关实验。FinanceQA测试套件也已公开发布以促进这一领域的研究。

链接: https://arxiv.org/abs/2501.18062
作者: Spencer Mateega,Carlos Georgescu,Danny Tang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:FinanceQA is a testing suite that evaluates LLMs’ performance on complex numerical financial analysis tasks that mirror real-world investment work. Despite recent advances, current LLMs fail to meet the strict accuracy requirements of financial institutions, with models failing approximately 60% of realistic tasks that mimic on-the-job analyses at hedge funds, private equity firms, investment banks, and other financial institutions. The primary challenges include hand-spreading metrics, adhering to standard accounting and corporate valuation conventions, and performing analysis under incomplete information - particularly in multi-step tasks requiring assumption generation. This performance gap highlights the disconnect between existing LLM capabilities and the demands of professional financial analysis that are inadequately tested by current testing architectures. Results show that higher-quality training data is needed to support such tasks, which we experiment with using OpenAI’s fine-tuning API. FinanceQA is publicly released at [this https URL](this https URL).
zh

[NLP-29] From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors

【速读】：该论文旨在探究公众对日益普及的人工智能（AI）技术的看法，并通过收集超过12,000份来自美国全国代表性样本的回答来实现这一目标。论文的关键在于采用混合方法论结合定量聚类和定性编码，识别出塑造公众理解AI的20个主要隐喻。此外，论文提出了一种可扩展框架，整合基于语言模型（LM）的技术，以量化感知的关键维度，包括拟人化（anthropomorphism）、温暖度和能力。研究发现，美国人普遍认为AI是温暖且有能力的，尤其在过去一年中，人们对AI拟人化和温暖度的认知显著增加。这些隐喻及感知维度与公众对AI的信任和接纳意愿有强烈的相关性，从而为理解和预测不同人群对AI的态度提供了重要依据。

链接: https://arxiv.org/abs/2501.18045
作者: Myra Cheng,Angela Y. Lee,Kristina Rapuano,Kate Niederhoffer,Alex Liebscher,Jeffrey Hancock
机构: Stanford University (斯坦福大学); BetterUp (未知)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:How has the public responded to the increasing prevalence of artificial intelligence (AI)-based technologies? We investigate public perceptions of AI by collecting over 12,000 responses over 12 months from a nationally representative U.S. sample. Participants provided open-ended metaphors reflecting their mental models of AI, a methodology that overcomes the limitations of traditional self-reported measures. Using a mixed-methods approach combining quantitative clustering and qualitative coding, we identify 20 dominant metaphors shaping public understanding of AI. To analyze these metaphors systematically, we present a scalable framework integrating language modeling (LM)-based techniques to measure key dimensions of public perception: anthropomorphism (attribution of human-like qualities), warmth, and competence. We find that Americans generally view AI as warm and competent, and that over the past year, perceptions of AI’s human-likeness and warmth have significantly increased ( +34%, r = 0.80, p 0.01; +41%, r = 0.62, p 0.05 ). Furthermore, these implicit perceptions, along with the identified dominant metaphors, strongly predict trust in and willingness to adopt AI ( r^2 = 0.21, 0.18, p 0.001 ). We further explore how differences in metaphors and implicit perceptions–such as the higher propensity of women, older individuals, and people of color to anthropomorphize AI–shed light on demographic disparities in trust and adoption. In addition to our dataset and framework for tracking evolving public attitudes, we provide actionable insights on using metaphors for inclusive and responsible AI development.
zh

[NLP-30] InnerThoughts: Disentangling Representations and Predictions in Large Language Models AISTATS2025

【速读】：该论文旨在解决大型语言模型（LLMs）在多选题回答任务中仅利用最终层隐藏状态进行预测而导致的信息利用不足的问题。关键在于引入了一个小型独立的神经网络预测模块，该模块在最后一个时间步从所有层获取隐藏状态作为输入，并输出预测结果。这种方法有效地将LLMs的表征能力与其预测能力解耦，从而在多个困难基准测试中实现了显著的性能提升，同时降低了计算成本。

链接: https://arxiv.org/abs/2501.17994
作者: Didier Chételat,Joseph Cotnareanu,Rylee Thompson,Yingxue Zhang,Mark Coates
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at AISTATS 2025

点击查看摘要

Abstract:Large language models (LLMs) contain substantial factual knowledge which is commonly elicited by multiple-choice question-answering prompts. Internally, such models process the prompt through multiple transformer layers, building varying representations of the problem within its hidden states. Ultimately, however, only the hidden state corresponding to the final layer and token position are used to predict the answer label. In this work, we propose instead to learn a small separate neural network predictor module on a collection of training questions, that take the hidden states from all the layers at the last temporal position as input and outputs predictions. In effect, such a framework disentangles the representational abilities of LLMs from their predictive abilities. On a collection of hard benchmarks, our method achieves considerable improvements in performance, sometimes comparable to supervised fine-tuning procedures, but at a fraction of the computational cost.
zh

[NLP-31] DReSS: Data-driven Regularized Structured Streamlining for Large Language Models

【速读】：该论文旨在解决大型语言模型（LLMs）在规模增大后导致的高计算和内存成本问题。论文的关键在于提出了一种新颖的范式：先进行正则化处理，然后进行剪枝，最后再进行微调。基于这一范式，作者引入了DReSS方法，即一种简单有效的数据驱动结构化流线化方法。通过利用少量数据提前将重要信息转移到模型的其他部分，DReSS显著减少了因参数移除导致的信息损失，从而增强了语言建模能力。实验结果表明，即使在极端剪枝比率下，DReSS也显著优于现有剪枝方法，并大幅降低了延迟，提高了吞吐量。

链接: https://arxiv.org/abs/2501.17905
作者: Mingkuan Feng,Jinyang Wu,Shuai Zhang,Pengpeng Shao,Ruihan Jin,Zhengqi Wen,Jianhua Tao,Feihu Che
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. Recent studies have revealed that LLMs exhibit sparsity, providing the potential to reduce model size through pruning techniques. However, existing pruning methods typically follow a prune-then-finetune paradigm. Since the pruned components still contain valuable information, their direct removal often leads to irreversible performance degradation, imposing a substantial computational burden to recover performance during finetuning. In this paper, we propose a novel paradigm that first applies regularization, then prunes, and finally finetunes. Based on this paradigm, we introduce DReSS, a simple and effective Data-driven Regularized Structured Streamlining method for LLMs. By leveraging a small amount of data to regularize the components to be pruned, DReSS explicitly transfers the important information to the remaining parts of the model in advance. Compared to direct pruning, this can reduce the information loss caused by parameter removal, thereby enhancing its language modeling capabilities. Experimental results demonstrate that DReSS significantly outperforms existing pruning methods even under extreme pruning ratios, significantly reducing latency and increasing throughput.
zh

[NLP-32] Docling: An Efficient Open-Source Toolkit for AI-driven Document Conversion AAAI25

【速读】：该论文旨在解决文档转换与解析的问题，特别是将多种流行文档格式统一转换为结构化表示。解决方案的关键在于采用了先进的专用AI模型，包括用于布局分析的DocLayNet和用于表格结构识别的TableFormer，并且其高效的架构和资源利用率使得Docling可以在普通硬件上高效运行。

链接: https://arxiv.org/abs/2501.17887
作者: Nikolaos Livathinos,Christoph Auer,Maksym Lysak,Ahmed Nassar,Michele Dolfi,Panos Vagenas,Cesar Berrospi Ramis,Matteo Omenetti,Kasper Dinkla,Yusik Kim,Shubham Gupta,Rafael Teixeira de Lima,Valery Weber,Lucas Morin,Ingmar Meijer,Viktor Kuropiatnyk,Peter W. J. Staar
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: Accepted to AAAI 25: Workshop on Open-Source AI for Mainstream Use

点击查看摘要

Abstract:We introduce Docling, an easy-to-use, self-contained, MIT-licensed, open-source toolkit for document conversion, that can parse several types of popular document formats into a unified, richly structured representation. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. Docling is released as a Python package and can be used as a Python API or as a CLI tool. Docling’s modular architecture and efficient document representation make it easy to implement extensions, new features, models, and customizations. Docling has been already integrated in other popular open-source frameworks (e.g., LangChain, LlamaIndex, spaCy), making it a natural fit for the processing of documents and the development of high-end applications. The open-source community has fully engaged in using, promoting, and developing for Docling, which gathered 10k stars on GitHub in less than a month and was reported as the No. 1 trending repository in GitHub worldwide in November 2024.
zh

[NLP-33] Provence: efficient and robust context pruning for retrieval-augmented generation ICLR2025

【速读】：该论文旨在解决基于检索增强生成（Retrieval-Augmented Generation, RAG）的大规模语言模型（LLMs）在处理长上下文和传播无关信息时所面临的计算开销问题。论文的关键解决方案是提出了一种名为Provence的方法，它通过将上下文剪枝任务形式化为序列标注任务，并结合上下文剪枝与重排序的能力，在多样化数据集上进行训练，从而实现高效且鲁棒的上下文剪枝。这种方法能够动态检测给定上下文中所需的剪枝量，并适用于多种领域，从而在保持性能的同时显著减少计算开销。

链接: https://arxiv.org/abs/2501.16214
作者: Nadezhda Chirkova,Thibault Formal,Vassilina Nikoulina,Stéphane Clinchant
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Retrieval-augmented generation improves various aspects of large language models (LLMs) generation, but suffers from computational overhead caused by long contexts as well as the propagation of irrelevant retrieved information into generated responses. Context pruning deals with both aspects, by removing irrelevant parts of retrieved contexts before LLM generation. Existing context pruning approaches are however limited, and do not provide a universal model that would be both efficient and robust in a wide range of scenarios, e.g., when contexts contain a variable amount of relevant information or vary in length, or when evaluated on various domains. In this work, we close this gap and introduce Provence (Pruning and Reranking Of retrieVEd relevaNt ContExts), an efficient and robust context pruner for Question Answering, which dynamically detects the needed amount of pruning for a given context and can be used out-of-the-box for various domains. The three key ingredients of Provence are formulating the context pruning task as sequence labeling, unifying context pruning capabilities with context reranking, and training on diverse data. Our experimental results show that Provence enables context pruning with negligible to no drop in performance, in various domains and settings, at almost no cost in a standard RAG pipeline. We also conduct a deeper analysis alongside various ablations to provide insights into training context pruners for future work.
zh

[NLP-34] Statistical multi-metric evaluation and visualization of LLM system predictive performance

【速读】：该论文旨在解决多维复杂环境下大型语言模型（Large Language Model, LLM）生成或判别系统评估的问题。论文的关键在于提出一个自动化的框架，能够正确执行统计测试，适当地汇总不同指标和数据集上的统计结果（这一任务并不简单），并且能够将结果可视化。此框架展示了在多语言代码生成基准CrossCodeEval上，针对几种最先进的LLMs进行评估的能力。

链接: https://arxiv.org/abs/2501.18243
作者: Samuel Ackerman,Eitan Farchi,Orna Raz,Assaf Toledo
机构: 未知
类目: Applications (stat.AP); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The evaluation of generative or discriminative large language model (LLM)-based systems is often a complex multi-dimensional problem. Typically, a set of system configuration alternatives are evaluated on one or more benchmark datasets, each with one or more evaluation metrics, which may differ between datasets. We often want to evaluate – with a statistical measure of significance – whether systems perform differently either on a given dataset according to a single metric, on aggregate across metrics on a dataset, or across datasets. Such evaluations can be done to support decision-making, such as deciding whether a particular system component change (e.g., choice of LLM or hyperparameter values) significantly improves performance over the current system configuration, or, more generally, whether a fixed set of system configurations (e.g., a leaderboard list) have significantly different performances according to metrics of interest. We present a framework implementation that automatically performs the correct statistical tests, properly aggregates the statistical results across metrics and datasets (a nontrivial task), and can visualize the results. The framework is demonstrated on the multi-lingual code generation benchmark CrossCodeEval, for several state-of-the-art LLMs.
zh

计算机视觉

[CV-0] ROSA: Reconstructing Object Shape and Appearance Textures by Adaptive Detail Transfer

【速读】：该论文旨在解决从有限视角图像重建物体形状与空间变化双向反射分布函数（SVBRDF）纹理的网格表示这一病态问题。现有最先进的方法要么直接在几何结构上重建外观，要么额外使用纹理法线作为外观特征的一部分。然而，这些方法需要详细但低效的大网格，并且在后期处理步骤中简化网格，或者受限于法线贴图的已知局限性，如阴影缺失或轮廓错误。此外，固定且通常较低的纹理估计分辨率会导致表面细节的丢失。为了解决这些问题，本文提出ROSA方法，这是一种逆向渲染技术，仅基于图像数据直接优化网格几何结构，并采用空间自适应网格分辨率。关键在于通过估计法线纹理和网格曲率来细化网格并局部调节表面平滑度，同时通过一种创新的基于瓦片的方法，在单一预训练解码器网络上实现高分辨率外观细节的重建，而不受网络输出分辨率的限制。

链接: https://arxiv.org/abs/2501.18595
作者: Julian Kaltheuner,Patrick Stotko,Reinhard Klein
机构: University of Bonn (波恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing an object’s shape and appearance in terms of a mesh textured by a spatially-varying bidirectional reflectance distribution function (SVBRDF) from a limited set of images captured under collocated light is an ill-posed problem. Previous state-of-the-art approaches either aim to reconstruct the appearance directly on the geometry or additionally use texture normals as part of the appearance features. However, this requires detailed but inefficiently large meshes, that would have to be simplified in a post-processing step, or suffers from well-known limitations of normal maps such as missing shadows or incorrect silhouettes. Another limiting factor is the fixed and typically low resolution of the texture estimation resulting in loss of important surface details. To overcome these problems, we present ROSA, an inverse rendering method that directly optimizes mesh geometry with spatially adaptive mesh resolution solely based on the image data. In particular, we refine the mesh and locally condition the surface smoothness based on the estimated normal texture and mesh curvature. In addition, we enable the reconstruction of fine appearance details in high-resolution textures through a pioneering tile-based method that operates on a single pre-trained decoder network but is not limited by the network output resolution.
zh

[CV-1] Foundational Models for 3D Point Clouds: A Survey and Outlook

【速读】：该论文旨在解决3D点云表示在几何保真度和复杂三维环境理解方面的问题。论文的关键在于探索如何通过融合多模态信息（包括现有的二维知识和大规模预训练语言模型）来构建基础模型（Foundation Models, FMs），从而克服3D领域中数据标注稀缺和计算开销高的挑战。通过这种方法，论文提出了一种利用FMs增强3D视觉理解的综合策略，并提供了未来研究方向的见解。

链接: https://arxiv.org/abs/2501.18594
作者: Vishal Thengane,Xiatian Zhu,Salim Bouzerdoum,Son Lam Phung,Yunpeng Li
机构: University of Surrey(萨里大学); University of Wollongong(卧龙岗大学); King’s College London(伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Initial submission

点击查看摘要

Abstract:The 3D point cloud representation plays a crucial role in preserving the geometric fidelity of the physical world, enabling more accurate complex 3D environments. While humans naturally comprehend the intricate relationships between objects and variations through a multisensory system, artificial intelligence (AI) systems have yet to fully replicate this capacity. To bridge this gap, it becomes essential to incorporate multiple modalities. Models that can seamlessly integrate and reason across these modalities are known as foundation models (FMs). The development of FMs for 2D modalities, such as images and text, has seen significant progress, driven by the abundant availability of large-scale datasets. However, the 3D domain has lagged due to the scarcity of labelled data and high computational overheads. In response, recent research has begun to explore the potential of applying FMs to 3D tasks, overcoming these challenges by leveraging existing 2D knowledge. Additionally, language, with its capacity for abstract reasoning and description of the environment, offers a promising avenue for enhancing 3D understanding through large pre-trained language models (LLMs). Despite the rapid development and adoption of FMs for 3D vision tasks in recent years, there remains a gap in comprehensive and in-depth literature reviews. This article aims to address this gap by presenting a comprehensive overview of the state-of-the-art methods that utilize FMs for 3D visual understanding. We start by reviewing various strategies employed in the building of various 3D FMs. Then we categorize and summarize use of different FMs for tasks such as perception tasks. Finally, the article offers insights into future directions for research and development in this field. To help reader, we have curated list of relevant papers on the topic: this https URL.
zh

[CV-2] Diffusion Autoencoders are Scalable Image Tokenizers

【速读】：该论文旨在解决图像生成模型中紧凑视觉表示的学习问题。论文的关键在于提出了一种简单的扩散tokenizer（DiTo），通过单一的扩散L2损失（diffusion L2 loss）进行训练，从而实现高效且高质量的图像tokenization。这种方法简化了现有最先进的监督学习tokenizer所需的复杂训练过程。

链接: https://arxiv.org/abs/2501.18593
作者: Yinbo Chen,Rohit Girdhar,Xiaolong Wang,Sai Saketh Rambhatla,Ishan Misra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.
zh

[CV-3] Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models

【速读】：该论文旨在解决多模态领域适应（Multimodal Domain Adaptation）和多模态测试时适应（Multimodal Test-Time Adaptation），以及在多模态基础模型（Multimodal Foundation Models）辅助下的领域适应和泛化（Domain Adaptation and Generalization）等挑战。关键在于利用大型预训练的多模态基础模型（如CLIP）来增强这些任务的性能或将其应用于下游任务。论文通过形式化定义每个主题的问题，并全面回顾现有方法，分析相关数据集和应用，揭示了存在的开放性挑战及未来潜在的研究方向。

链接: https://arxiv.org/abs/2501.18592
作者: Hao Dong,Moru Liu,Kaiyang Zhou,Eleni Chatzi,Juho Kannala,Cyrill Stachniss,Olga Fink
机构: ETH Zürich (瑞士苏黎世联邦理工学院); Technical University of Munich (慕尼黑工业大学); Hong Kong Baptist University (香港浸会大学); Aalto University (阿尔托大学); University of Bonn (波恩大学); University of Oxford (牛津大学); Lamarr Institute for Machine Learning and Artificial Intelligence (Lamarr 机器学习与人工智能研究所); EPFL (瑞士联邦理工学院洛桑分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:In real-world scenarios, achieving domain adaptation and generalization poses significant challenges, as models must adapt to or generalize across unknown target distributions. Extending these capabilities to unseen multimodal distributions, i.e., multimodal domain adaptation and generalization, is even more challenging due to the distinct characteristics of different modalities. Significant progress has been made over the years, with applications ranging from action recognition to semantic segmentation. Besides, the recent advent of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired works leveraging these models to enhance adaptation and generalization performances or adapting them to downstream tasks. This survey provides the first comprehensive review of recent advances from traditional approaches to foundation models, covering: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models. For each topic, we formally define the problem and thoroughly review existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions. We maintain an active repository that contains up-to-date literature at this https URL.
zh

[CV-4] DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models

【速读】：该论文旨在解决在真实场景中难以获取精确场景表示（explicit 3D geometry, high-quality material properties, and lighting conditions）的情况下，物理基础渲染（Physically-Based Rendering, PBR）难以实现的问题。解决方案的关键在于引入了一种名为DiffusionRenderer的神经网络方法，该方法在一个统一框架内同时处理逆向渲染和正向渲染任务。通过利用强大的视频扩散模型先验知识，逆向渲染模型能够从真实世界视频中准确估计G-缓冲区（G-buffers），从而提供图像编辑接口及训练数据；而正向渲染模型则可以直接从G-缓冲区生成逼真的图像，无需显式的光传输模拟（light transport simulation）。

链接: https://arxiv.org/abs/2501.18590
作者: Ruofan Liang,Zan Gojcic,Huan Ling,Jacob Munkberg,Jon Hasselgren,Zhi-Hao Lin,Jun Gao,Alexander Keller,Nandita Vijaykumar,Sanja Fidler,Zian Wang
机构: NVIDIA; University of Toronto; Vector Institute; University of Illinois Urbana-Champaign
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this http URL

点击查看摘要

Abstract:Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations–explicit 3D geometry, high-quality material properties, and lighting conditions–that are often impractical to obtain in real-world scenarios. Therefore, we introduce DiffusionRenderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Experiments demonstrate that DiffusionRenderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input–including relighting, material editing, and realistic object insertion.
zh

[CV-5] Inkspire: Supporting Design Exploration with Generative AI through Analogical Sketching

【速读】：该论文旨在解决文本到图像（Text-to-Image, T2I）AI模型难以理解抽象语言及当前工具易导致设计固化而非迭代探索的问题。解决方案的关键在于开发了一款名为Inkspire的草图驱动工具，它支持设计师通过类比灵感进行产品设计概念的原型制作，并实现完整的草图到设计再到草图的反馈循环。

链接: https://arxiv.org/abs/2501.18588
作者: David Chuan-En Lin,Hyeonsu B. Kang,Nikolas Martelaro,Aniket Kittur,Yan-Ying Chen,Matthew K. Hong
机构: Carnegie Mellon University(卡内基梅隆大学); Toyota Research Institute(丰田研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to CHI 2025

点击查看摘要

Abstract:With recent advancements in the capabilities of Text-to-Image (T2I) AI models, product designers have begun experimenting with them in their work. However, T2I models struggle to interpret abstract language and the current user experience of T2I tools can induce design fixation rather than a more iterative, exploratory process. To address these challenges, we developed Inkspire, a sketch-driven tool that supports designers in prototyping product design concepts with analogical inspirations and a complete sketch-to-design-to-sketch feedback loop. To inform the design of Inkspire, we conducted an exchange session with designers and distilled design goals for improving T2I interactions. In a within-subjects study comparing Inkspire to ControlNet, we found that Inkspire supported designers with more inspiration and exploration of design ideas, and improved aspects of the co-creative process by allowing designers to effectively grasp the current state of the AI to guide it towards novel design intentions.
zh

[CV-6] UDC-VIT: A Real-World Video Dataset for Under-Display Cameras

【速读】：该论文旨在解决在屏下摄像头（Under Display Camera, UDC）视频中由于显示面板引起的图像退化问题，包括低透射率、模糊、噪声和耀斑等。解决方案的关键在于提出一个名为UDC-VIT的真实世界屏下摄像头视频数据集，并开发了一个同时获取非退化和UDC退化视频的视频捕捉系统。通过离散傅里叶变换（DFT）对齐帧与帧之间的视频，从而实现有效的UDC视频恢复。实验结果表明，先前基于合成UDC视频数据集训练的模型无法有效反映实际UDC视频的特性，而UDC-VIT能够更好地推动UDC视频恢复的研究进展。

链接: https://arxiv.org/abs/2501.18545
作者: Kyusu Ahn,JiSoo Kim,Sangik Lee,HyunGyu Lee,Byeonghyun Ko,Chanwoo Park,Jaejin Lee
机构: Dept. of Data Science, Seoul National University (首尔国立大学); Dept. of Computer Science and Engineering, Seoul National University (首尔国立大学); Research Center, Samsung Display Co., Ltd. (三星显示有限公司); Mobile Display Electronics Development Team, Samsung Display Co., Ltd. (三星显示有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main body (10 pages, 9 Figures, 3 Tables), References (4 pages), Appendix (15 pages, 11 Figures, 6 Tables)

点击查看摘要

Abstract:Under Display Camera (UDC) is an advanced imaging system that places a digital camera lens underneath a display panel, effectively concealing the camera. However, the display panel significantly degrades captured images or videos, introducing low transmittance, blur, noise, and flare issues. Tackling such issues is challenging because of the complex degradation of UDCs, including diverse flare patterns. Despite extensive research on UDC images and their restoration models, studies on videos have yet to be significantly explored. While two UDC video datasets exist, they primarily focus on unrealistic or synthetic UDC degradation rather than real-world UDC degradation. In this paper, we propose a real-world UDC video dataset called UDC-VIT. Unlike existing datasets, only UDC-VIT exclusively includes human motions that target facial recognition. We propose a video-capturing system to simultaneously acquire non-degraded and UDC-degraded videos of the same scene. Then, we align a pair of captured videos frame by frame, using discrete Fourier transform (DFT). We compare UDC-VIT with six representative UDC still image datasets and two existing UDC video datasets. Using six deep-learning models, we compare UDC-VIT and an existing synthetic UDC video dataset. The results indicate the ineffectiveness of models trained on earlier synthetic UDC video datasets, as they do not reflect the actual characteristics of UDC-degraded videos. We also demonstrate the importance of effective UDC restoration by evaluating face recognition accuracy concerning PSNR, SSIM, and LPIPS scores. UDC-VIT enables further exploration in the UDC video restoration and offers better insights into the challenge. UDC-VIT is available at our project site.
zh

[CV-7] Learning Priors of Human Motion With Vision Transformers

【速读】：该论文旨在解决在特定场景下人类移动路径、速度以及停留位置的理解问题，这对于城市区域的流动性研究及人类居住环境中机器人导航任务至关重要。论文提出了一种基于视觉变换器（Vision Transformers, ViTs）的神经架构作为解决方案，其关键在于利用ViTs更有效地捕捉空间相关性，相比卷积神经网络（Convolutional Neural Networks, CNNs），从而提升模型性能指标。

链接: https://arxiv.org/abs/2501.18543
作者: Placido Falqueto,Alberto Sanfeliu,Luigi Palopoli,Daniele Fontanelli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2024

点击查看摘要

Abstract:A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within human-populated environments. We propose in this article, a neural architecture based on Vision Transformers (ViTs) to provide this information. This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs). In the paper, we describe the methodology and proposed neural architecture and show the experiments’ results with a standard dataset. We show that the proposed ViT architecture improves the metrics compared to a method based on a CNN.
zh

[CV-8] Mini-ResEmoteNet: Leverag ing Knowledge Distillation for Human-Centered Design

【速读】：该论文旨在通过采用知识蒸馏框架来开发轻量级的学生模型Mini-ResEmoteNet，从而扩展ResEmoteNet模型，以更好地适用于可用性测试。解决方案的关键在于通过减少教师模型中每一层特征通道的数量，分别约50%、75%和87.5%，来开发三种学生模型架构（Student Model A, Student Model B, 和 Student Model C），进而实现高性能且资源高效的面部表情识别。

链接: https://arxiv.org/abs/2501.18538
作者: Amna Murtada,Omnia Abdelrhman,Tahani Abdalla Attia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages with 4 figures

点击查看摘要

Abstract:Facial Emotion Recognition has emerged as increasingly pivotal in the domain of User Experience, notably within modern usability testing, as it facilitates a deeper comprehension of user satisfaction and engagement. This study aims to extend the ResEmoteNet model by employing a knowledge distillation framework to develop Mini-ResEmoteNet models - lightweight student models - tailored for usability testing. Experiments were conducted on the FER2013 and RAF-DB datasets to assess the efficacy of three student model architectures: Student Model A, Student Model B, and Student Model C. Their development involves reducing the number of feature channels in each layer of the teacher model by approximately 50%, 75%, and 87.5%. Demonstrating exceptional performance on the FER2013 dataset, Student Model A (E1) achieved a test accuracy of 76.33%, marking a 0.21% absolute improvement over EmoNeXt. Moreover, the results exhibit absolute improvements in terms of inference speed and memory usage during inference compared to the ResEmoteNet model. The findings indicate that the proposed methods surpass other state-of-the-art approaches.
zh

[CV-9] Integrating Spatial and Frequency Information for Under-Display Camera Image Restoration

【速读】：该论文旨在解决屏下摄像头（Under-Display Camera, UDC）图像退化问题，特别是针对噪声、模糊、透光率下降和耀斑等复杂退化现象。论文的关键在于提出了一种新颖的多层级深度神经网络架构SFIM，通过结合局部信息和全局信息来有效恢复UDC引起的图像退化。具体而言，SFIM架构包含空间域块（SDB）、频域块（FDB）和基于注意力机制的多层级整合块（AMIB），分别处理细节纹理如噪声和模糊，以及在大面积区域中的不规则纹理损失如耀斑，并促进跨域交互，从而实现更优的图像恢复效果。

链接: https://arxiv.org/abs/2501.18517
作者: Kyusu Ahn,Jinpyo Kim,Chanwoo Park,JiSoo Kim,Jaejin Lee
机构: Seoul National University(首尔国立大学); Samsung Display Co., Ltd.(三星显示有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Main body (10 pages, 9 Figures, 5 Tables), References (3 pages), Appendix (8 pages, 6 Figures, 6 Tables)

点击查看摘要

Abstract:Under-Display Camera (UDC) houses a digital camera lens under a display panel. However, UDC introduces complex degradations such as noise, blur, decrease in transmittance, and flare. Despite the remarkable progress, previous research on UDC mainly focuses on eliminating diffraction in the spatial domain and rarely explores its potential in the frequency domain. It is essential to consider both the spatial and frequency domains effectively. For example, degradations, such as noise and blur, can be addressed by local information (e.g., CNN kernels in the spatial domain). At the same time, tackling flares may require leveraging global information (e.g., the frequency domain). In this paper, we revisit the UDC degradations in the Fourier space and figure out intrinsic frequency priors that imply the presence of the flares. Based on this observation, we propose a novel multi-level DNN architecture called SFIM. It efficiently restores UDC-distorted images by integrating local and global (the collective contribution of all points in the image) information. The architecture exploits CNNs to capture local information and FFT-based models to capture global information. SFIM comprises a spatial domain block (SDB), a Frequency Domain Block (FDB), and an Attention-based Multi-level Integration Block (AMIB). Specifically, SDB focuses more on detailed textures such as noise and blur, FDB emphasizes irregular texture loss in extensive areas such as flare, and AMIB enables effective cross-domain interaction. SFIM’s superior performance over state-of-the-art approaches is demonstrated through rigorous quantitative and qualitative assessments across three UDC benchmarks.
zh

[CV-10] Deconstruct Complexity (DeComplex): A Novel Perspective on Tackling Dense Action Detection

【速读】：该论文旨在解决密集动作检测中的挑战任务，即在未剪辑视频中检测多个共存的动作，而这些动作类别通常具有模糊性和重叠概念。为了解决这一问题，论文提出了一种新颖的方法，即将复杂任务分解为可管理的子任务。关键在于将问题分解为检测密集静态概念和密集动态概念，并分别使用专门化的网络进行处理。此外，论文通过引入一种基于语言引导的对比学习损失函数，在网络优化过程中提供共存概念的显式监督，以克服当前网络无法有效学习动作间关系的局限性。实验结果表明，该方法显著优于现有技术，相对提升了23.4%和2.5%的mAP值。

链接: https://arxiv.org/abs/2501.18509
作者: Faegheh Sardari,Armin Mustafa,Philip J. B. Jackson,Adrian Hilton
机构: Centre for Vision, Speech and Signal Processing (CVSSP) (视觉、语音和信号处理中心); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Computer Vision

点击查看摘要

Abstract:Dense action detection involves detecting multiple co-occurring actions in an untrimmed video while action classes are often ambiguous and represent overlapping concepts. To address this challenge task, we introduce a novel perspective inspired by how humans tackle complex tasks by breaking them into manageable sub-tasks. Instead of relying on a single network to address the entire problem, as in current approaches, we propose decomposing the problem into detecting key concepts present in action classes, specifically, detecting dense static concepts and detecting dense dynamic concepts, and assigning them to distinct, specialized networks. Furthermore, simultaneous actions in a video often exhibit interrelationships, and exploiting these relationships can improve performance. However, we argue that current networks fail to effectively learn these relationships due to their reliance on binary cross-entropy optimization, which treats each class independently. To address this limitation, we propose providing explicit supervision on co-occurring concepts during network optimization through a novel language-guided contrastive learning loss. Our extensive experiments demonstrate the superiority of our approach over state-of-the-art methods, achieving substantial relative improvements of 23.4% and 2.5% mAP on the challenging benchmark datasets, Charades and MultiTHUMOS.
zh

[CV-11] CLEAR: Cue Learning using Evolution for Accurate Recognition Applied to Sustainability Data Extraction

【速读】：该论文旨在解决大型语言模型（LLM）在图像识别任务中因提示不足导致准确性下降的问题，特别是针对专业领域任务时需要依赖领域专家。为了解决这一问题，论文提出了一种名为Cue Learning using Evolution for Accurate Recognition (CLEAR) 的方法。该方法结合了LLM和进化计算技术，通过自动生成新颖的领域特定表示，并利用遗传算法优化合适的文本提示，从而提升图像中特定特征识别的准确性。

链接: https://arxiv.org/abs/2501.18504
作者: Peter J. Bentley,Soo Ling Lim,Fuyuki Ishikawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages plus 2 pages of supplemental material

点击查看摘要

Abstract:Large Language Model (LLM) image recognition is a powerful tool for extracting data from images, but accuracy depends on providing sufficient cues in the prompt - requiring a domain expert for specialized tasks. We introduce Cue Learning using Evolution for Accurate Recognition (CLEAR), which uses a combination of LLMs and evolutionary computation to generate and optimize cues such that recognition of specialized features in images is improved. It achieves this by auto-generating a novel domain-specific representation and then using it to optimize suitable textual cues with a genetic algorithm. We apply CLEAR to the real-world task of identifying sustainability data from interior and exterior images of buildings. We investigate the effects of using a variable-length representation compared to fixed-length and show how LLM consistency can be improved by refactoring from categorical to real-valued estimates. We show that CLEAR enables higher accuracy compared to expert human recognition and human-authored prompts in every task with error rates improved by up to two orders of magnitude and an ablation study evincing solution concision.
zh

[CV-12] HSRMamba: Contextual Spatial-Spectral State Space Model for Single Hyperspectral Super-Resolution

【速读】：该论文旨在解决在高光谱图像超分辨率（HSISR）任务中，Mamba模型因将图像转换为1D序列而忽略局部相邻像素之间的空间-光谱结构关系，以及其性能高度依赖于输入顺序的问题。解决方案的关键在于提出了HSRMamba模型，通过设计局部空间-光谱分区机制来建立三维特征中相邻像素的块状因果关系，从而缓解局部遗忘问题；同时采用基于光谱相似性的全局光谱重排序策略，增强空间和光谱维度中相似像素的因果表示。实验结果表明，HSRMamba在定量质量和视觉效果方面优于现有先进技术。

链接: https://arxiv.org/abs/2501.18500
作者: Shi Chen,Lefei Zhang,Liangpei Zhang
机构: School of Computer Science, Wuhan University (武汉大学); Aerospace Information Research Institute, Henan Academy of Sciences (河南科学院航空航天信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Mamba has demonstrated exceptional performance in visual tasks due to its powerful global modeling capabilities and linear computational complexity, offering considerable potential in hyperspectral image super-resolution (HSISR). However, in HSISR, Mamba faces challenges as transforming images into 1D sequences neglects the spatial-spectral structural relationships between locally adjacent pixels, and its performance is highly sensitive to input order, which affects the restoration of both spatial and spectral details. In this paper, we propose HSRMamba, a contextual spatial-spectral modeling state space model for HSISR, to address these issues both locally and globally. Specifically, a local spatial-spectral partitioning mechanism is designed to establish patch-wise causal relationships among adjacent pixels in 3D features, mitigating the local forgetting issue. Furthermore, a global spectral reordering strategy based on spectral similarity is employed to enhance the causal representation of similar pixels across both spatial and spectral dimensions. Finally, experimental results demonstrate our HSRMamba outperforms the state-of-the-art methods in quantitative quality and visual results. Code will be available soon.
zh

[CV-13] Runway vs. Taxiway: Challenges in Automated Line Identification and Notation Approaches

【速读】：该论文旨在解决在复杂环境下跑道和滑行道标记的精确检测与可靠标注问题，特别是现有算法如ALINA在识别跑道标记时面临的挑战。解决方案的关键在于引入一个名为AssistNet的分类步骤，通过整合这一分类步骤，检测流程能够更好地适应环境变化并减少误分类，从而提升跑道标记检测的鲁棒性和准确性。

链接: https://arxiv.org/abs/2501.18494
作者: Parth Ganeriwala,Amy Alvarez,Abdullah AlQahtani,Siddhartha Bhattacharyya,Mohammed Abdul Hafeez Khan,Natasha Neogi
机构: Florida Institute of Technology(佛罗里达理工学院); Langley Research Center(NASA兰利研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at SysCon 2025

点击查看摘要

Abstract:The increasing complexity of autonomous systems has amplified the need for accurate and reliable labeling of runway and taxiway markings to ensure operational safety. Precise detection and labeling of these markings are critical for tasks such as navigation, landing assistance, and ground control automation. Existing labeling algorithms, like the Automated Line Identification and Notation Algorithm (ALINA), have demonstrated success in identifying taxiway markings but encounter significant challenges when applied to runway markings. This limitation arises due to notable differences in line characteristics, environmental context, and interference from elements such as shadows, tire marks, and varying surface conditions. To address these challenges, we modified ALINA by adjusting color thresholds and refining region of interest (ROI) selection to better suit runway-specific contexts. While these modifications yielded limited improvements, the algorithm still struggled with consistent runway identification, often mislabeling elements such as the horizon or non-relevant background features. This highlighted the need for a more robust solution capable of adapting to diverse visual interferences. In this paper, we propose integrating a classification step using a Convolutional Neural Network (CNN) named AssistNet. By incorporating this classification step, the detection pipeline becomes more resilient to environmental variations and misclassifications. This work not only identifies the challenges but also outlines solutions, paving the way for improved automated labeling techniques essential for autonomous aviation systems.
zh

[CV-14] rack-On: Transformer-based Online Point Tracking with Memory ICLR2025

【速读】：该论文致力于解决长期点跟踪（Long-term Point Tracking）的问题，即在视频中的多帧间一致识别点，尽管存在外观变化、光照变化、视角变化及遮挡等问题。论文的关键解决方案是提出了一种名为Track-On的简单Transformer模型，该模型基于在线逐帧处理，无需访问未来帧，通过因果方式处理视频帧。其关键创新在于利用空间记忆模块和上下文记忆模块来捕捉时间信息，并维持长时间跨度内可靠的目标点跟踪。此外，Track-On在推理阶段采用补丁分类和精炼方法以实现高精度的点对应识别与跟踪。

链接: https://arxiv.org/abs/2501.18487
作者: Görkay Aydemir,Xiongyi Cai,Weidi Xie,Fatma Güney
机构: Department of Computer Engineering, Koç University (科奇大学); KUIS AI Center (KUIS人工智能中心); School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICLR 2025

点击查看摘要

Abstract:In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across multiple frames in a video, despite changes in appearance, lighting, perspective, and occlusions. We target online tracking on a frame-by-frame basis, making it suitable for real-world, streaming scenarios. Specifically, we introduce Track-On, a simple transformer-based model designed for online long-term point tracking. Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames, leveraging two memory modules – spatial memory and context memory – to capture temporal information and maintain reliable point tracking over long time horizons. At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy. Through extensive experiments, we demonstrate that Track-On sets a new state-of-the-art for online models and delivers superior or competitive results compared to offline approaches on seven datasets, including the TAP-Vid benchmark. Our method offers a robust and scalable solution for real-time tracking in diverse applications. Project page: this https URL
zh

[CV-15] SimpleDepthPose: Fast and Reliable Human Pose Estimation with RGBD-Images

【速读】：该论文旨在解决多视角多人姿态估计中的准确性与可靠性挑战。关键在于引入了一种新颖的算法，该算法通过整合深度信息来提升多视角多人姿态估计的性能。所提出的算法不仅在未见数据集上表现出良好的泛化能力，而且运行速度快，并且适应于不同的关键点检测。

链接: https://arxiv.org/abs/2501.18478
作者: Daniel Bermuth,Alexander Poeppel,Wolfgang Reif
机构: University of Augsburg, Germany(奥格斯堡大学, 德国); ISSE(未知缩写); University of Augsburg(奥格斯堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the rapidly advancing domain of computer vision, accurately estimating the poses of multiple individuals from various viewpoints remains a significant challenge, especially when reliability is a key requirement. This paper introduces a novel algorithm that excels in multi-view, multi-person pose estimation by incorporating depth information. An extensive evaluation demonstrates that the proposed algorithm not only generalizes well to unseen datasets, and shows a fast runtime performance, but also is adaptable to different keypoints. To support further research, all of the work is publicly accessible.
zh

[CV-16] uning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations

【速读】：该论文旨在解决在医学影像领域中，由于获取完全标注的数据集既耗时又昂贵，导致基础模型（vision foundation models）在下游任务中的性能与专门化模型存在差距的问题。论文的关键解决方案在于提出了一种测试时训练范式，通过使用简单的点提示（point prompts）引导测试时半自监督训练任务，使模型能够在无需完整标注的情况下，通过对点提示的多种增强实现性能提升。该方法显著增强了基础模型在下游数据集上的表现，特别是在实例分割任务中，在新构建的Videofluoroscopy数据集（VFSS-5k）上实现了平均Dice系数为0.868的成绩。

链接: https://arxiv.org/abs/2501.18474
作者: Chengxi Zeng,David Smithard,Alberto M Gambaruto,Tilo Burghardt
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models have demonstrated exceptional generalization capabilities in segmentation tasks for both generic and specialized images. However, a performance gap persists between foundation models and task-specific, specialized models. Fine-tuning foundation models on downstream datasets is often necessary to bridge this gap. Unfortunately, obtaining fully annotated ground truth for downstream datasets is both challenging and costly. To address this limitation, we propose a novel test-time training paradigm that enhances the performance of foundation models on downstream datasets without requiring full annotations. Specifically, our method employs simple point prompts to guide a test-time semi-self-supervised training task. The model learns by resolving the ambiguity of the point prompt through various augmentations. This approach directly tackles challenges in the medical imaging field, where acquiring annotations is both time-intensive and expensive. We conducted extensive experiments on our new Videofluoroscopy dataset (VFSS-5k) for the instance segmentation task, achieving an average Dice coefficient of 0.868 across 12 anatomies with a single model.
zh

[CV-17] A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models

【速读】：该论文旨在解决现有常规基准在检测分布外（Out-of-distribution, OOD）样本方面的性能饱和问题，这使得难以比较近期的OOD检测方法。为了解决这一挑战，论文提出了三个新的OOD检测基准：ImageNet-X用于评估在具有挑战性的语义转换下的性能；ImageNet-FS-X用于全面的OOD检测，评估协变量转换（特征分布转换）下的鲁棒性；Wilds-FS-X则将这些评估扩展到真实世界的数据集，提供一个更全面的测试平台。论文的关键解决方案在于引入这些新的基准，以更深入地理解方法特性，并反映现实世界的条件。

链接: https://arxiv.org/abs/2501.18463
作者: Shiho Noda,Atsuyuki Miyai,Qing Yu,Go Irie,Kiyoharu Aizawa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is this https URL.
zh

[CV-18] ransfer Learning for Keypoint Detection in Low-Resolution Thermal TUG Test Images

【速读】：该论文旨在解决低分辨率热图像中人体关键点检测的问题。解决方案的关键在于采用了一种基于迁移学习的技术，并引入了Timed Up and Go (TUG)测试在热图像计算机视觉中的应用。研究采用了MobileNetV3-Small编码器和ViTPose解码器，通过一个复合损失函数（平衡潜在表示对齐和热图精度）进行训练。这一方法在AP、AP50和AP75指标上分别达到了0.861、0.942和0.887，显著优于传统的监督学习方法如Mask R-CNN和ViTPose-Base，同时展示了更好的计算效率。

链接: https://arxiv.org/abs/2501.18453
作者: Wei-Lun Chen,Chia-Yeh Hsieh,Yu-Hsiang Kao,Kai-Chun Liu,Sheng-Yu Peng,Yu Tsao
机构: Research Center for Information Technology Innovation, Academic Sinica (学术 Sinica 研究中心), Taiwan; Graduate Institute of Electrical Engineering, National Taiwan University (台湾国立清华大学电机工程研究生院), Taiwan; Bachelor’s Program in Medical Informatics and Innovative Applications, Fu Jen Catholic University (辅仁天主教大学医学信息学和创新应用本科项目), Taiwan; Department of Electrical Engineering, National Taiwan University (台湾国立清华大学电机工程系), Taiwan; College of Information and Computer Sciences, University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校信息与计算机科学学院), MA, 01003, USA; Department of Electrical Engineering, National Taiwan University of Science of Technology (台湾科技大学电机工程系), Taiwan
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to AICAS 2025. This is the preprint version

点击查看摘要

Abstract:This study presents a novel approach to human keypoint detection in low-resolution thermal images using transfer learning techniques. We introduce the first application of the Timed Up and Go (TUG) test in thermal image computer vision, establishing a new paradigm for mobility assessment. Our method leverages a MobileNetV3-Small encoder and a ViTPose decoder, trained using a composite loss function that balances latent representation alignment and heatmap accuracy. The model was evaluated using the Object Keypoint Similarity (OKS) metric from the COCO Keypoint Detection Challenge. The proposed model achieves better performance with AP, AP50, and AP75 scores of 0.861, 0.942, and 0.887 respectively, outperforming traditional supervised learning approaches like Mask R-CNN and ViTPose-Base. Moreover, our model demonstrates superior computational efficiency in terms of parameter count and FLOPS. This research lays a solid foundation for future clinical applications of thermal imaging in mobility assessment and rehabilitation monitoring.
zh

[CV-19] Adaptive Object Detection for Indoor Navigation Assistance: A Performance Evaluation of Real-Time Algorithms

【速读】：该论文旨在解决视觉障碍人士在室内导航辅助技术中对准确且高效的目标检测需求。研究的关键在于评估YOLO、SSD、Faster R-CNN和Mask R-CNN四种实时目标检测算法在室内导航辅助场景下的表现，通过分析检测精度、处理速度以及对室内环境的适应性，揭示精度与效率之间的权衡，从而为实时辅助导航选择最优算法提供见解。这项研究推进了自适应机器学习应用，提升了视觉障碍人士的室内导航解决方案，并促进了无障碍技术的发展。

链接: https://arxiv.org/abs/2501.18444
作者: Abhinav Pratap,Sushant Kumar,Suchinton Chakravarty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, 3 tables

点击查看摘要

Abstract:This study addresses the need for accurate and efficient object detection in assistive technologies for visually impaired individuals. We evaluate four real-time object detection algorithms YOLO, SSD, Faster R-CNN, and Mask R-CNN within the context of indoor navigation assistance. Using the Indoor Objects Detection dataset, we analyze detection accuracy, processing speed, and adaptability to indoor environments. Our findings highlight the trade-offs between precision and efficiency, offering insights into selecting optimal algorithms for realtime assistive navigation. This research advances adaptive machine learning applications, enhancing indoor navigation solutions for the visually impaired and promoting accessibility.
zh

[CV-20] SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

【速读】：该论文旨在解决文本到图像生成中的高效扩展问题。解决方案的关键在于三个创新点：(1) 效率训练扩展（Efficient Training Scaling），采用深度增长范式（depth-growth paradigm）使参数规模从16亿扩展至48亿，同时使用内存高效的8位优化器；(2) 模型深度剪枝（Model Depth Pruning），通过区块重要性分析技术实现高效的模型压缩，以任意大小输出且保持较小的质量损失；(3) 推理时间扩展（Inference-time Scaling），通过重复采样策略在推理阶段平衡计算资源与模型容量，使得较小模型能够达到较大模型的质量水平。这些策略使得SANA-1.5在GenEval基准测试中达到了0.72的文本图像对齐得分，并通过推理时间扩展进一步提升至0.80，从而确立了新的顶级水平（SoTA）。

链接: https://arxiv.org/abs/2501.18427
作者: Enze Xie,Junsong Chen,Yuyang Zhao,Jincheng Yu,Ligeng Zhu,Yujun Lin,Zhekai Zhang,Muyang Li,Junyu Chen,Han Cai,Bingchen Liu,Daquan Zhou,Song Han
机构: NVIDIA; MIT; Tsinghua University; Playground; Peking University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.
zh

[CV-21] Real Time Scheduling Framework for Multi Object Detection via Spiking Neural Networks

【速读】：该论文旨在解决自主移动平台（AMAs）中多目标检测（MOD）系统的实时性保证（R1）和高精度（R2）问题。解决方案的关键在于RT-SNN系统设计，通过利用脉冲神经网络（SNNs）在多个时间步长内迭代计算膜电位的特点，提供可调整时间步长的多种执行选项，并引入了一种新颖的方法来重用膜电位以支持R1。此外，RT-SNN通过引入平均绝对误差和膜置信度的新概念，评估这些执行策略对R2的影响，并开发了一个新的调度框架，包括离线调度性分析以确保R1和运行时调度算法以优化R2。

链接: https://arxiv.org/abs/2501.18412
作者: Donghwa Kang,Woojin Shin,Cheol-Ho Hong,Minsuk Koo,Brent ByungHoon Kang,Jinkyu Lee,Hyeongboo Baek
机构: KAIST(韩国科学技术院); University of Seoul(首尔大学); Chung-Ang University(中央大学); Sungkyunkwan University(成均馆大学)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 7 pages

点击查看摘要

Abstract:Given the energy constraints in autonomous mobile agents (AMAs), such as unmanned vehicles, spiking neural networks (SNNs) are increasingly favored as a more efficient alternative to traditional artificial neural networks. AMAs employ multi-object detection (MOD) from multiple cameras to identify nearby objects while ensuring two essential objectives, (R1) timing guarantee and (R2) high accuracy for safety. In this paper, we propose RT-SNN, the first system design, aiming at achieving R1 and R2 in SNN-based MOD systems on AMAs. Leveraging the characteristic that SNNs gather feature data of input image termed as membrane potential, through iterative computation over multiple timesteps, RT-SNN provides multiple execution options with adjustable timesteps and a novel method for reusing membrane potential to support R1. Then, it captures how these execution strategies influence R2 by introducing a novel notion of mean absolute error and membrane confidence. Further, RT-SNN develops a new scheduling framework consisting of offline schedulability analysis for R1 and a run-time scheduling algorithm for R2 using the notion of membrane confidence. We deployed RT-SNN to Spiking-YOLO, the SNN-based MOD model derived from ANN-to-SNN conversion, and our experimental evaluation confirms its effectiveness in meeting the R1 and R2 requirements while providing significant energy efficiency.
zh

[CV-22] Efficient Transformer for High Resolution Image Motion Deblurring

【速读】：该论文旨在解决高分辨率图像运动去模糊任务中的模型复杂度与性能之间的权衡问题。关键解决方案在于引入了架构修改，通过优化注意力机制将模型复杂度降低了18.4%，同时保持或提升了性能。此外，增强的训练流程中加入了色彩抖动（color jitter）、高斯模糊（Gaussian blur）和透视变换（perspective transforms），并引入了一种新的频率损失项（frequency loss term），以提高模型的鲁棒性。这些改进使得模型展现出更好的收敛行为和缩短的训练时间，并在具有挑战性的场景中保持了竞争力。

链接: https://arxiv.org/abs/2501.18403
作者: Amanturdieva Akmaral,Muhammad Hamza Zafar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 18 figures Submitted as a preprint, no prior journal/conference submission

点击查看摘要

Abstract:This paper presents a comprehensive study and improvement of the Restormer architecture for high-resolution image motion deblurring. We introduce architectural modifications that reduce model complexity by 18.4% while maintaining or improving performance through optimized attention mechanisms. Our enhanced training pipeline incorporates additional transformations including color jitter, Gaussian blur, and perspective transforms to improve model robustness as well as a new frequency loss term. Extensive experiments on the RealBlur-R, RealBlur-J, and Ultra-High-Definition Motion blurred (UHDM) datasets demonstrate the effectiveness of our approach. The improved architecture shows better convergence behavior and reduced training time while maintaining competitive performance across challenging scenarios. We also provide detailed ablation studies analyzing the impact of our modifications on model behavior and performance. Our results suggest that thoughtful architectural simplification combined with enhanced training strategies can yield more efficient yet equally capable models for motion deblurring tasks. Code and Data Available at: this https URL
zh

[CV-23] MatIR: A Hybrid Mamba-Transformer Image Restoration Model

【速读】：该论文旨在解决现有图像修复模型在处理长范围依赖和复杂上下文特征提取方面的不足。具体而言，Transformers模型在上下文学习能力方面表现出色，但计算效率较低；而Mamba模型则具有显著的计算效率优势，但在长范围依赖处理方面存在局限。为克服这些限制，论文提出了一种名为MatIR的Mamba-Transformer混合图像修复模型。关键解决方案在于通过交叉循环Transformer层和Mamba层的模块来充分利用两者的优势，并引入了Image Inpainting State Space (IRSS)模块以高效处理长序列数据，同时结合三角窗局部注意力和基于通道的全局注意力机制，以更广泛地激活图像像素的注意力机制。

链接: https://arxiv.org/abs/2501.18401
作者: Juan Wen(1 and 2),Weiyan Hou(1),Luc Van Gool(2 and 3 and 4),Radu Timofte(5) ((1) Zhengzhou University, (2) ETH Zurich, (3) KU Leuven, (4) INSAIT, Sofia University, (5) University of Wurzburg)
机构: Zhengzhou University (郑州大学); ETH Zürich (瑞士苏黎世联邦理工学院); KU Leuven (比利时鲁汶大学); INSAIT (INSAIT); Sofia University (索非亚大学); University of Wurzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2402.15648 by other authors

点击查看摘要

Abstract:In recent years, Transformers-based models have made significant progress in the field of image restoration by leveraging their inherent ability to capture complex contextual features. Recently, Mamba models have made a splash in the field of computer vision due to their ability to handle long-range dependencies and their significant computational efficiency compared to Transformers. However, Mamba currently lags behind Transformers in contextual learning capabilities. To overcome the limitations of these two models, we propose a Mamba-Transformer hybrid image restoration model called MatIR. Specifically, MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features, thereby taking full advantage of the advantages of the two architectures. In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths to achieve efficient processing of long sequence data. In the Transformer module, we combine triangular window-based local attention with channel-based global attention to effectively activate the attention mechanism over a wider range of image pixels. Extensive experimental results and ablation studies demonstrate the effectiveness of our approach.
zh

[CV-24] Cracks in concrete

【速读】：该论文旨在解决在混凝土图像中寻找和精确分割裂缝的难题。由于裂缝在三维计算机断层扫描图像中的对比度极弱，加之混凝土基质的异质性和图像尺寸的影响，这一任务变得尤为复杂。论文的关键解决方案在于生成半合成图像数据以训练卷积神经网络（CNN）如3D U-Net或随机森林来实现裂缝的分割，并引入了一种名为RieszNet的网络设计，以确保分割方法对尺度变化具有不变性。

链接: https://arxiv.org/abs/2501.18376
作者: Tin Barisin,Christian Jung,Anna Nowacka,Claudia Redenbach,Katja Schladitz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Applications (stat.AP)
备注: This is a preprint of the chapter: T. Barisin, C. Jung, A. Nowacka, C. Redenbach, K. Schladitz: Cracks in concrete, published in Statistical Machine Learning for Engineering with Applications (LNCS), edited by J. Franke, A. Schöbel, reproduced with permission of Springer Nature Switzerland AG 2024. The final authenticated version is available online at: this https URL

点击查看摘要

Abstract:Finding and properly segmenting cracks in images of concrete is a challenging task. Cracks are thin and rough and being air filled do yield a very weak contrast in 3D images obtained by computed tomography. Enhancing and segmenting dark lower-dimensional structures is already demanding. The heterogeneous concrete matrix and the size of the images further increase the complexity. ML methods have proven to solve difficult segmentation problems when trained on enough and well annotated data. However, so far, there is not much 3D image data of cracks available at all, let alone annotated. Interactive annotation is error-prone as humans can easily tell cats from dogs or roads without from roads with cars but have a hard time deciding whether a thin and dark structure seen in a 2D slice continues in the next one. Training networks by synthetic, simulated images is an elegant way out, bears however its own challenges. In this contribution, we describe how to generate semi-synthetic image data to train CNN like the well known 3D U-Net or random forests for segmenting cracks in 3D images of concrete. The thickness of real cracks varies widely, both, within one crack as well as from crack to crack in the same sample. The segmentation method should therefore be invariant with respect to scale changes. We introduce the so-called RieszNet, designed for exactly this purpose. Finally, we discuss how to generalize the ML crack segmentation methods to other concrete types.
zh

[CV-25] Video-based Surgical Tool-tip and Keypoint Tracking using Multi-frame Context-driven Deep Learning Models

【速读】：该论文旨在解决自动化追踪机器人手术视频中的手术工具关键点（surgical tool keypoints）的问题，特别是在技能评估、专家评估和划定安全区域等下游应用场景中。解决方案的关键在于提出了一种新颖的多帧上下文驱动深度学习框架，通过利用复杂的深度学习模型和多帧上下文信息，实现了在手术视频中定位和追踪工具关键点，达到了90%的关键点检测准确率和5.27像素的定位均方根误差（RMS error）。

链接: https://arxiv.org/abs/2501.18361
作者: Bhargav Ghanekar,Lianne R. Johnson,Jacob L. Laughlin,Marcia K. O’Malley,Ashok Veeraraghavan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated tracking of surgical tool keypoints in robotic surgery videos is an essential task for various downstream use cases such as skill assessment, expertise assessment, and the delineation of safety zones. In recent years, the explosion of deep learning for vision applications has led to many works in surgical instrument segmentation, while lesser focus has been on tracking specific tool keypoints, such as tool tips. In this work, we propose a novel, multi-frame context-driven deep learning framework to localize and track tool keypoints in surgical videos. We train and test our models on the annotated frames from the 2015 EndoVis Challenge dataset, resulting in state-of-the-art performance. By leveraging sophisticated deep learning models and multi-frame context, we achieve 90% keypoint detection accuracy and a localization RMS error of 5.27 pixels. Results on a self-annotated JIGSAWS dataset with more challenging scenarios also show that the proposed multi-frame models can accurately track tool-tip and tool-base keypoints, with 4.2 -pixel RMS error overall. Such a framework paves the way for accurately tracking surgical instrument keypoints, enabling further downstream use cases. Project and dataset webpage: this https URL
zh

[CV-26] CodeBrain: Impute Any Brain MRI via Instance-specific Scalar-quantized Codes

【速读】：该论文旨在解决磁共振成像（MRI）模态缺失的问题，通过从一个或多个可用模态中合成缺失的模态，从而减少扫描成本并提供全面的MRI信息以增强临床诊断。论文的关键解决方案在于提出了一种名为CodeBrain的统一模型，其核心设计是将不同模态之间的转换视为全模态码预测任务。CodeBrain通过两个阶段进行训练：重建和码预测。首先，在重建阶段，每个MRI模态被映射到共享潜空间并进行标量量化；其次，在码预测阶段，通过定制的分级损失训练另一个编码器，从随机掩码的MRI样本中预测全模态码，以此实现模态间的转换。这种方法实现了在有限标量空间内映射特定实例的码，从而显著提升了多模态MRI数据的合成效果。

链接: https://arxiv.org/abs/2501.18328
作者: Yicheng Wu,Tao Song,Zhonghua Wu,Zongyuan Ge,Zhaolin Chen,Jianfei Cai
机构: Monash University (蒙纳士大学), Clayton, Australia; Fudan University (复旦大学), Shanghai, China; Nanyang Technological University (南洋理工大学), Singapore, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:MRI imputation aims to synthesize the missing modality from one or more available ones, which is highly desirable since it reduces scanning costs and delivers comprehensive MRI information to enhance clinical diagnosis. In this paper, we propose a unified model, CodeBrain, designed to adapt to various brain MRI imputation scenarios. The core design lies in casting various inter-modality transformations as a full-modality code prediction task. To this end, CodeBrain is trained in two stages: Reconstruction and Code Prediction. First, in the Reconstruction stage, we reconstruct each MRI modality, which is mapped into a shared latent space followed by a scalar quantization. Since such quantization is lossy and the code is low dimensional, another MRI modality belonging to the same subject is randomly selected to generate common features to supplement the code and boost the target reconstruction. In the second stage, we train another encoder by a customized grading loss to predict the full-modality codes from randomly masked MRI samples, supervised by the corresponding quantized codes generated from the first stage. In this way, the inter-modality transformation is achieved by mapping the instance-specific codes in a finite scalar space. We evaluated the proposed CodeBrain model on two public brain MRI datasets (i.e., IXI and BraTS 2023). Extensive experiments demonstrate that our CodeBrain model achieves superior imputation performance compared to four existing methods, establishing a new state of the art for unified brain MRI imputation. Codes will be released.
zh

[CV-27] Surface Defect Identification using Bayesian Filtering on a 3D Mesh

【速读】：该论文旨在解决自动化表面缺陷检测的问题。解决方案的关键在于利用CAD模型中的先验知识，并将其与从商用立体和深度相机获取的点云数据相融合。通过将CAD模型转换为高密度多边形网格，并使用加权最小二乘算法迭代估计扫描工件的状态，从而实现基于点云测量的精确检测。这种方法展示了利用商用立体相机进行高精度质量控制应用的潜力。

链接: https://arxiv.org/abs/2501.18315
作者: Matteo Dalle Vedove,Matteo Bonetto,Edoardo Lamon,Luigi Palopoli,Matteo Saveriano,Daniele Fontanelli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Presented at IMEKO2024 World Congress, Hamburg, Germany, 26-29 October 2024

点击查看摘要

Abstract:This paper presents a CAD-based approach for automated surface defect detection. We leverage the a-priori knowledge embedded in a CAD model and integrate it with point cloud data acquired from commercially available stereo and depth cameras. The proposed method first transforms the CAD model into a high-density polygonal mesh, where each vertex represents a state variable in 3D space. Subsequently, a weighted least squares algorithm is employed to iteratively estimate the state of the scanned workpiece based on the captured point cloud measurements. This framework offers the potential to incorporate information from diverse sensors into the CAD domain, facilitating a more comprehensive analysis. Preliminary results demonstrate promising performance, with the algorithm achieving convergence to a sub-millimeter standard deviation in the region of interest using only approximately 50 point cloud samples. This highlights the potential of utilising commercially available stereo cameras for high-precision quality control applications.
zh

[CV-28] AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

【速读】：该论文旨在解决AI生成的音频-视觉内容（AI-generated audio-visual content, AGAV）质量评估的问题。现有方法难以应对AGAV特有的失真，如不真实和不一致的元素。为了解决这一问题，论文引入了AGAVQA数据集，并提出了一种基于线性混合模型（LMM-based）的AGAV-Rater模型。AGAV-Rater模型能够从多个维度评估AGAV以及由文本生成的音频和音乐的质量，并选择最佳结果呈现给用户。这一方案的关键在于通过大规模数据集和多维度评分模型来有效识别和评估AGAV中的复杂质量问题。

链接: https://arxiv.org/abs/2501.18314
作者: Yuqin Cao,Xiongkuo Min,Yixuan Gao,Wei Sun,Guangtao Zhai
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Many video-to-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements. To address this, we introduce AGAVQA, the first large-scale AGAV quality assessment dataset, comprising 3,382 AGAVs from 16 VTA methods. AGAVQA includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We further propose AGAV-Rater, a LMM-based model that can score AGAVs, as well as audio and music generated from text, across multiple dimensions, and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA, Text-to-Audio, and Text-to-Music datasets. Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience. The project page is available at this https URL.
zh

[CV-29] Simulation of microstructures and machine learning MICRO

【速读】：该论文旨在解决机器学习方法在图像处理任务中因训练数据有限而引发的关键瓶颈问题。解决方案的关键在于利用基于随机几何模型实现的合成图像（synthetic images），这些合成图像能够自然地捕捉结构内的变化，并且无需手动标注即可提供地面实况（ground truth）。这种方法尤其适用于光学质量控制及混凝土三维图像中裂缝系统分割等场景。

链接: https://arxiv.org/abs/2501.18313
作者: Katja Schladitz,Claudia Redenbach,Tin Barisin,Christian Jung,Natascha Jeziorski,Lovro Bosnar,Juraj Fulir,Petra Gospodnetić
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint of: K. Schladitz, C. Redenbach, T. Barisin, C. Jung, N. Jeziorski, L. Bosnar, J. Fulir, P. Gospodnetić: Simulation of Microstructures and Machine Learning, published in Continuum Models and Discrete Systems by F. Willot, J. Dirrenberger, S. Forest, D. Jeulin, A.V. Cherkaev (eds), 2024, Springer Cham. The final version is this https URL

点击查看摘要

Abstract:Machine learning offers attractive solutions to challenging image processing tasks. Tedious development and parametrization of algorithmic solutions can be replaced by training a convolutional neural network or a random forest with a high potential to generalize. However, machine learning methods rely on huge amounts of representative image data along with a ground truth, usually obtained by manual annotation. Thus, limited availability of training data is a critical bottleneck. We discuss two use cases: optical quality control in industrial production and segmenting crack structures in 3D images of concrete. For optical quality control, all defect types have to be trained but are typically not evenly represented in the training data. Additionally, manual annotation is costly and often inconsistent. It is nearly impossible in the second case: segmentation of crack systems in 3D images of concrete. Synthetic images, generated based on realizations of stochastic geometry models, offer an elegant way out. A wide variety of structure types can be generated. The within structure variation is naturally captured by the stochastic nature of the models and the ground truth is for free. Many new questions arise. In particular, which characteristics of the real image data have to be met to which degree of fidelity.
zh

[CV-30] A Comprehensive Analysis on Machine Learning based Methods for Lung Cancer Level Classification

【速读】：该论文旨在解决肺癌分期精确分类的问题，采用多种机器学习方法进行研究。关键在于通过仔细分析最小儿童权重和学习率等参数来克服模型过拟合问题，并系统性地对比包括XGBoost (XGB)、LGBM、Adaboost、逻辑回归 (LR)、决策树 (DT)、随机森林 (RF)、CatBoost和k-最近邻 (k-NN)在内的多种模型。此外，利用深度神经网络 (DNN) 模型探究特征与目标之间的相关性，从而识别复杂模式。研究表明，尽管DNN架构复杂，但传统机器学习模型如XGBoost、LGBM和逻辑回归在准确度、精度、召回率和F-1得分等指标上表现出色。

链接: https://arxiv.org/abs/2501.18294
作者: Shayli Farshchiha,Salman Asoudeh,Maryam Shavali Kuhshuri,Mehrshad Eisaeid,Mohamadreza Azadie,Saba Hesaraki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lung cancer is a major issue in worldwide public health, requiring early diagnosis using stable techniques. This work begins a thorough investigation of the use of machine learning (ML) methods for precise classification of lung cancer stages. A cautious analysis is performed to overcome overfitting issues in model performance, taking into account minimum child weight and learning rate. A set of machine learning (ML) models including XGBoost (XGB), LGBM, Adaboost, Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), CatBoost, and k-Nearest Neighbor (k-NN) are run methodically and contrasted. Furthermore, the correlation between features and targets is examined using the deep neural network (DNN) model and thus their capability in detecting complex patternsis established. It is argued that several ML models can be capable of classifying lung cancer stages with great accuracy. In spite of the complexity of DNN architectures, traditional ML models like XGBoost, LGBM, and Logistic Regression excel with superior performance. The models perform better than the others in lung cancer prediction on the complete set of comparative metrics like accuracy, precision, recall, and F-1 score
zh

[CV-31] MAMS: Model-Agnostic Module Selection Framework for Video Captioning AAAI2025

【速读】：该论文旨在解决多模态视频字幕任务中固定帧数提取所引发的关键挑战：重要信息帧可能被遗漏，或过多帧导致视觉标记冗余。论文提出了一种适用于视频字幕任务的首个模型无关的模块选择框架，其关键是根据视频帧提取的视觉标记选择合适大小的字幕生成模块，并构建这些标记的子集以优化处理。此外，论文还提出了一种新的自适应注意力掩码方案，以增强对重要视觉标记的关注。

链接: https://arxiv.org/abs/2501.18269
作者: Sangho Lee,Il Yong Chun,Hogun Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the AAAI 2025 Main Technical Track. This is an extended version of the original submission

点击查看摘要

Abstract:Multi-modal transformers are rapidly gaining attention in video captioning tasks. Existing multi-modal video captioning methods typically extract a fixed number of frames, which raises critical challenges. When a limited number of frames are extracted, important frames with essential information for caption generation may be missed. Conversely, extracting an excessive number of frames includes consecutive frames, potentially causing redundancy in visual tokens extracted from consecutive video frames. To extract an appropriate number of frames for each video, this paper proposes the first model-agnostic module selection framework in video captioning that has two main functions: (1) selecting a caption generation module with an appropriate size based on visual tokens extracted from video frames, and (2) constructing subsets of visual tokens for the selected caption generation module. Furthermore, we propose a new adaptive attention masking scheme that enhances attention on important visual tokens. Our experiments on three different benchmark datasets demonstrate that the proposed framework significantly improves the performance of three recent video captioning models.
zh

[CV-32] Ground Awareness in Deep Learning for Large Outdoor Point Cloud Segmentation

【速读】：该论文旨在解决在大规模室外点云语义分割中，由于点云密度高导致机器学习模型感受野过小而难以准确确定点的周围环境和上下文信息的问题。解决方案的关键在于通过计算数字地形模型（Digital Terrain Models, DTMs）从点云中提取相对高程特征（即点到地形的垂直距离），并将该特征整合进RandLA-Net模型以增强长距离依赖关系的建模能力。实验结果表明，这一方法在不同数据集上均表现出一致的性能提升，尤其是在Hessigheim数据集中，平均F1分数提高了3.7个百分点。

链接: https://arxiv.org/abs/2501.18246
作者: Kevin Qiu,Dimitri Bulatov,Dorota Iwaszczuk
机构: Fraunhofer IOSB Ettlingen, Gutleuthausstrasse 1, 76275 Ettlingen, Germany(弗劳恩霍夫应用软件研究所); Technical University of Darmstadt, Civil and Environmental Engineering Sciences, Darmstadt, Germany(达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for presentation at the GRAPP 2025 conference

点击查看摘要

Abstract:This paper presents an analysis of utilizing elevation data to aid outdoor point cloud semantic segmentation through existing machine-learning networks in remote sensing, specifically in urban, built-up areas. In dense outdoor point clouds, the receptive field of a machine learning model may be too small to accurately determine the surroundings and context of a point. By computing Digital Terrain Models (DTMs) from the point clouds, we extract the relative elevation feature, which is the vertical distance from the terrain to a point. RandLA-Net is employed for efficient semantic segmentation of large-scale point clouds. We assess its performance across three diverse outdoor datasets captured with varying sensor technologies and sensor locations. Integration of relative elevation data leads to consistent performance improvements across all three datasets, most notably in the Hessigheim dataset, with an increase of 3.7 percentage points in average F1 score from 72.35% to 76.01%, by establishing long-range dependencies between ground and objects. We also explore additional local features such as planarity, normal vectors, and 2D features, but their efficacy varied based on the characteristics of the point cloud. Ultimately, this study underscores the important role of the non-local relative elevation feature for semantic segmentation of point clouds in remote sensing applications.
zh

[CV-33] Arbitrary Data as Images: Fusion of Patient Data Across Modalities and Irregular Intervals with Vision Transformers

【速读】：该论文旨在简化多模态医疗数据处理与建模的复杂性，解决在训练神经网络时因不同模态数据（如时间序列数据、离散测量值、治疗干预等）需要特定建模而带来的难题。论文的关键解决方案是提出了一种名为Vision Transformer for irregular sampled Multi-modal Measurements (ViTiMM)的方法，通过将所有信息可视化为图像以及无结构文本，并随后训练常规的视觉-文本变换器，从而不仅简化了数据预处理和建模过程，还在预测住院死亡率和表型分类任务上超越了当前最先进的方法。

链接: https://arxiv.org/abs/2501.18237
作者: Malte Tölle,Mohamad Scharaf,Samantha Fischer,Christoph Reich,Silav Zeid,Christoph Dieterich,Benjamin Meder,Norbert Frey,Philipp Wild,Sandy Engelhardt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A patient undergoes multiple examinations in each hospital stay, where each provides different facets of the health status. These assessments include temporal data with varying sampling rates, discrete single-point measurements, therapeutic interventions such as medication administration, and images. While physicians are able to process and integrate diverse modalities intuitively, neural networks need specific modeling for each modality complicating the training procedure. We demonstrate that this complexity can be significantly reduced by visualizing all information as images along with unstructured text and subsequently training a conventional vision-text transformer. Our approach, Vision Transformer for irregular sampled Multi-modal Measurements (ViTiMM), not only simplifies data preprocessing and modeling but also outperforms current state-of-the-art methods in predicting in-hospital mortality and phenotyping, as evaluated on 6,175 patients from the MIMIC-IV dataset. The modalities include patient’s clinical measurements, medications, X-ray images, and electrocardiography scans. We hope our work inspires advancements in multi-modal medical AI by reducing the training complexity to (visual) prompt engineering, thus lowering entry barriers and enabling no-code solutions for training. The source code will be made publicly available.
zh

[CV-34] Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss

【速读】：该论文旨在解决文本到运动生成中的频率域分析缺失问题，现有方法主要集中在时间建模而忽略了频域分析。论文识别出运动去噪过程中的两个关键阶段：语义规划阶段和细粒度改进阶段。为有效解决这两个阶段的问题，提出了一种名为频率增强文本到运动扩散模型（Frequency enhanced text-to-motion diffusion model, Free-T2M），该模型引入了针对各阶段的一致性损失，以增强静态特征的鲁棒性和提高细粒度准确性。实验结果表明，所提方法在StableMoFusion数据集上的FID从0.189降低到了0.051，确立了扩散架构中的新SOTA性能。这些发现强调了在文本到运动生成中结合频域洞察的重要性，以实现更精确和稳健的结果。

链接: https://arxiv.org/abs/2501.18232
作者: Wenshuo Chen,Haozhe Jia,Songning Lai,Keming Wu,Hongru Xiao,Lijie Hu,Yutao Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rapid progress in text-to-motion generation has been largely driven by diffusion models. However, existing methods focus solely on temporal modeling, thereby overlooking frequency-domain analysis. We identify two key phases in motion denoising: the semantic planning stage and the fine-grained improving stage. To address these phases effectively, we propose Frequency enhanced text-to-motion diffusion model (Free-T2M), incorporating stage-specific consistency losses that enhance the robustness of static features and improve fine-grained accuracy. Extensive experiments demonstrate the effectiveness of our method. Specifically, on StableMoFusion, our method reduces the FID from 0.189 to 0.051, establishing a new SOTA performance within the diffusion architecture. These findings highlight the importance of incorporating frequency-domain insights into text-to-motion generation for more precise and robust results.
zh

[CV-35] Machine Learning Fairness for Depression Detection using EEG Data

【速读】：该论文旨在评估使用脑电图（EEG）数据进行抑郁症检测中的机器学习公平性。论文通过在三个不同的EEG数据集（Mumtaz、MODMA和Rest）上实验不同深度学习架构（如卷积神经网络（CNN）、长短期记忆网络（LSTM）和门控循环单元（GRU）网络），试图揭示并解决其中的偏见问题。研究的关键在于采用了五种不同的偏差缓解策略，分别在预处理、处理过程中以及后处理阶段应用，并评估这些方法的有效性。实验结果表明，现有的EEG数据集和算法在抑郁症检测中存在偏见，而不同的偏差缓解方法能在不同程度上应对多种公平性指标下的偏见问题。

链接: https://arxiv.org/abs/2501.18192
作者: Angus Man Ho Kwok,Jiaee Cheong,Sinan Kalkan,Hatice Gunes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: To appear as part of the International Symposium on Biomedical Imaging (ISBI) 2025 proceedings

点击查看摘要

Abstract:This paper presents the very first attempt to evaluate machine learning fairness for depression detection using electroencephalogram (EEG) data. We conduct experiments using different deep learning architectures such as Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Unit (GRU) networks across three EEG datasets: Mumtaz, MODMA and Rest. We employ five different bias mitigation strategies at the pre-, in- and post-processing stages and evaluate their effectiveness. Our experimental results show that bias exists in existing EEG datasets and algorithms for depression detection, and different bias mitigation methods address bias at different levels across different fairness measures.
zh

[CV-36] IROAM: Improving Roadside Monocular 3D Object Detection Learning from Autonomous Vehicle Data Domain ICRA2025

【速读】：该论文旨在解决自主驾驶中因视点域差距导致的车载相机与路侧相机在单目检测方法上的不适用性问题。为弥合这一差距并提升路侧单目3D目标检测性能，论文提出了一种名为IROAM（In-Domain Query Interaction and Cross-Domain Query Enhancement）的语义-几何解耦对比学习框架。该框架的关键在于其两个模块：In-Domain Query Interaction利用变换器学习每个域的内容和深度信息，并输出对象查询；Cross-Domain Query Enhancement通过将查询解耦为语义和几何部分，并仅使用前者进行对比学习，从而更好地从两个域中学习特征表示。

链接: https://arxiv.org/abs/2501.18162
作者: Zhe Wang,Xiaoliang Huo,Siqi Fan,Jingjing Liu,Ya-Qin Zhang,Yan Wang
机构: Institute for AI Industry Research (AIR), Tsinghua University (清华大学), Beijing, China; School of Software, Beihang University (北京航空航天大学), Beijing, China
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 7 pages, 5 figures, ICRA2025

点击查看摘要

Abstract:In autonomous driving, The perception capabilities of the ego-vehicle can be improved with roadside sensors, which can provide a holistic view of the environment. However, existing monocular detection methods designed for vehicle cameras are not suitable for roadside cameras due to viewpoint domain gaps. To bridge this gap and Improve ROAdside Monocular 3D object detection, we propose IROAM, a semantic-geometry decoupled contrastive learning framework, which takes vehicle-side and roadside data as input simultaneously. IROAM has two significant modules. In-Domain Query Interaction module utilizes a transformer to learn content and depth information for each domain and outputs object queries. Cross-Domain Query Enhancement To learn better feature representations from two domains, Cross-Domain Query Enhancement decouples queries into semantic and geometry parts and only the former is used for contrastive learning. Experiments demonstrate the effectiveness of IROAM in improving roadside detector’s performance. The results validate that IROAM has the capabilities to learn cross-domain information.
zh

[CV-37] Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment

【速读】：该论文旨在解决多模态系统在实际应用中面临的增加传感器需求、计算成本以及模态同步等挑战。论文的关键解决方案是提出了一种名为Multimodal Training and Unimodal Deployment (MUTUD) 的框架，其中包含一个Temporally Aligned Modality feature Estimation (TAME) 模块，能够利用现有模态信息估计缺失模态的信息。此方法能够在部署或推理阶段仅使用单个或较少模态的情况下，通过融合不同模态的优势来补偿某些模态的缺失，从而显著缩小多模态模型与相应单模态模型之间的性能差距，并同时减少模型大小和计算需求。

链接: https://arxiv.org/abs/2501.18157
作者: Joanna Hong,Sanjeel Parekh,Honglie Chen,Jacob Donley,Ke Tan,Buye Xu,Anurag Kumar
机构: Meta(元宇宙集团)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come with several constraints such as increased sensory requirements, computational cost, and modality synchronization, to mention a few. These challenges constrain the direct uses of these multimodal solutions in real-world applications. In this work, we develop approaches where the learning happens with all available modalities but the deployment or inference is done with just one or reduced modalities. To do so, we propose a Multimodal Training and Unimodal Deployment (MUTUD) framework which includes a Temporally Aligned Modality feature Estimation (TAME) module that can estimate information from missing modality using modalities present during inference. This innovative approach facilitates the integration of information across different modalities, enhancing the overall inference process by leveraging the strengths of each modality to compensate for the absence of certain modalities during inference. We apply MUTUD to various audiovisual speech tasks and show that it can reduce the performance gap between the multimodal and corresponding unimodal models to a considerable extent. MUTUD can achieve this while reducing the model size and compute compared to multimodal models, in some cases by almost 80%.
zh

[CV-38] REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via Multimodal Visual Feature Learning

【速读】：该论文旨在解决内窥镜实时自运动跟踪（Real-time ego-motion tracking for endoscope）的问题，以实现高效的导航和内窥镜的机器人自动化。关键解决方案在于提出了一种新颖的框架，其中包括一个多模态视觉特征学习网络用于相对姿态预测，通过提取光流运动特征、场景特征以及相邻两次观察的联合特征来进行预测。此外，设计了一种基于注意力机制的新型特征提取器来整合连续帧融合图像中的多维信息，并引入了一种新的姿态解码器来从融合特征图中预测姿态变换。最终，通过累积相对姿态来计算内窥镜的绝对姿态。实验结果表明，所提方法在多个内窥镜场景数据集上优于现有技术，并且推理速度超过每秒30帧，满足实时需求。

链接: https://arxiv.org/abs/2501.18124
作者: Liangjing Shao,Benshuang Chen,Shuting Zhao,Xinrong Chen
机构: Academy for Engineering & Technology, Fudan University(工程与技术学院，复旦大学); Shanghai Key Laboratory of Medical Image Computing and Computer-Assisted Intervention, Fudan University(上海医学图像计算与计算机辅助介入重点实验室，复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on an attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the concatenated feature map at the end of the framework. At last, the absolute pose of endoscope is calculated based on relative poses. The experiment is conducted on three datasets of various endoscopic scenes and the results demonstrate that the proposed method outperforms state-of-the-art methods. Besides, the inference speed of the proposed method is over 30 frames per second, which meets the real-time requirement. The project page is here: \hrefthis https URLthis http URL
zh

[CV-39] DeepFRC: An End-to-End Deep Learning Model for Functional Registration and Classification

【速读】：该论文旨在解决功能数据分析（Functional Data Analysis, FDA）中注册（registration）与分类（classification）任务解耦导致的效率和性能限制问题。解决方案的关键在于提出了一种端到端的深度学习框架DeepFRC，该框架通过整合对齐模块（学习时间扭曲函数以实现弹性函数注册）和可学习基表示模块（用于对齐数据的降维），实现了注册与分类任务的同时优化。这种集成显著提升了对齐精度和预测性能。

链接: https://arxiv.org/abs/2501.18116
作者: Siyuan Jiang,Yihan Hu,Wenjie Li,Pengcheng Zeng
机构: Institute of Mathematical Sciences, ShanghaiTech University(上海科技大学数学科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 27 pages, 8 figures

点击查看摘要

Abstract:Functional data analysis (FDA) is essential for analyzing continuous, high-dimensional data, yet existing methods often decouple functional registration and classification, limiting their efficiency and performance. We present DeepFRC, an end-to-end deep learning framework that unifies these tasks within a single model. Our approach incorporates an alignment module that learns time warping functions via elastic function registration and a learnable basis representation module for dimensionality reduction on aligned data. This integration enhances both alignment accuracy and predictive performance. Theoretical analysis establishes that DeepFRC achieves low misalignment and generalization error, while simulations elucidate the progression of registration, reconstruction, and classification during training. Experiments on real-world datasets demonstrate that DeepFRC consistently outperforms state-of-the-art methods, particularly in addressing complex registration challenges. Code is available at: this https URL.
zh

[CV-40] Lifelong 3D Mapping Framework for Hand-held Robot-mounted LiDAR Mapping Systems

【速读】：该论文旨在解决手持和机器人安装的三维LiDAR（3D LiDAR）映射系统中的终身3D地图构建问题。关键解决方案包括模块化、云原生设计的框架，以及动态点去除、多会话地图对齐、地图变化检测和地图版本控制等模块。特别是，动态点去除算法适用于不同的传感器设置，能够生成干净的静态3D地图；多会话地图对齐通过特征描述符匹配和精细注册实现自动对齐，无需人工参数微调；地图变化检测可以识别两个对齐地图之间的正负变化；而地图版本控制系统则维护一个代表当前环境状态的基础地图，并存储检测到的变化及边界信息，允许用户查询任意两次映射会话之间的变化，同时不保存原始会话数据。

链接: https://arxiv.org/abs/2501.18110
作者: Liudi Yang,Sai Manoj Prakhya,Senhua Zhu,Ziyuan Liu
机构: Huawei(华为), Munich, Germany(德国慕尼黑); Huawei Cloud(华为云), China(中国)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a lifelong 3D mapping framework that is modular, cloud-native by design and more importantly, works for both hand-held and robot-mounted 3D LiDAR mapping systems. Our proposed framework comprises of dynamic point removal, multi-session map alignment, map change detection and map version control. First, our sensor-setup agnostic dynamic point removal algorithm works seamlessly with both hand-held and robot-mounted setups to produce clean static 3D maps. Second, the multi-session map alignment aligns these clean static maps automatically, without manual parameter fine-tuning, into a single reference frame, using a two stage approach based on feature descriptor matching and fine registration. Third, our novel map change detection identifies positive and negative changes between two aligned maps. Finally, the map version control maintains a single base map that represents the current state of the environment, and stores the detected positive and negative changes, and boundary information. Our unique map version control system can reconstruct any of the previous clean session maps and allows users to query changes between any two random mapping sessions, all without storing any input raw session maps, making it very unique. Extensive experiments are performed using hand-held commercial LiDAR mapping devices and open-source robot-mounted LiDAR SLAM algorithms to evaluate each module and the whole 3D lifelong mapping framework.
zh

[CV-41] Disentangling Safe and Unsafe Corruptions via Anisotropy and Locality

【速读】：该论文旨在解决现有威胁模型（如 (\ell_p) 范数）无法有效捕捉计算机视觉中的常见图像扰动（如模糊、压缩或遮挡）的问题。论文的关键解决方案是提出了一种新的威胁模型——投影位移（Projected Displacement, PD），该模型通过扰动与不安全方向（定义为能够改变真实类别标签的输入空间方向）的对齐程度来衡量威胁。PD 模型具有各向异性和局部性特点，能够识别每个输入样本的本地不安全方向，并且可以应用于任意分类任务而无需预训练或微调。这使得 PD 威胁模型成为一种灵活且面向任务的威胁定义方法。

链接: https://arxiv.org/abs/2501.18098
作者: Ramchandran Muthukumar,Ambar Pal,Jeremias Sulam,Rene Vidal
机构: Johns Hopkins University; Amazon Web Service; University of Pennsylvania
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:State-of-the-art machine learning systems are vulnerable to small perturbations to their input, where ``small’’ is defined according to a threat model that assigns a positive threat to each perturbation. Most prior works define a task-agnostic, isotropic, and global threat, like the \ell_p norm, where the magnitude of the perturbation fully determines the degree of the threat and neither the direction of the attack nor its position in space matter. However, common corruptions in computer vision, such as blur, compression, or occlusions, are not well captured by such threat models. This paper proposes a novel threat model called \textttProjected Displacement (PD) to study robustness beyond existing isotropic and global threat models. The proposed threat model measures the threat of a perturbation via its alignment with \textitunsafe directions, defined as directions in the input space along which a perturbation of sufficient magnitude changes the ground truth class label. Unsafe directions are identified locally for each input based on observed training data. In this way, the PD threat model exhibits anisotropy and locality. Experiments on Imagenet-1k data indicate that, for any input, the set of perturbations with small PD threat includes \textitsafe perturbations of large \ell_p norm that preserve the true label, such as noise, blur and compression, while simultaneously excluding \textitunsafe perturbations that alter the true label. Unlike perceptual threat models based on embeddings of large-vision models, the PD threat model can be readily computed for arbitrary classification tasks without pre-training or finetuning. Further additional task annotation such as sensitivity to image regions or concept hierarchies can be easily integrated into the assessment of threat and thus the PD threat model presents practitioners with a flexible, task-driven threat specification.
zh

[CV-42] Generative AI for Vision: A Comprehensive Study of Frameworks and Applications

【速读】：该论文旨在系统分类图像生成技术，并探讨其背后的原理与关键框架。论文通过输入模态（如噪声向量、潜在表示和条件输入）来组织这些方法，以解决生成式AI在实际应用中的挑战，如计算成本、数据偏见以及输出与用户意图的对齐问题。关键解决方案在于提供一种输入导向的观点，帮助研究者和从业者全面理解并利用生成式AI进行现实世界的应用。

链接: https://arxiv.org/abs/2501.18033
作者: Fouad Bousetouane
机构: The University of Chicago(芝加哥大学); 2ndsight.ai(2ndsight.ai)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 53 pages, 18 figures

点击查看摘要

Abstract:Generative AI is transforming image synthesis, enabling the creation of high-quality, diverse, and photorealistic visuals across industries like design, media, healthcare, and autonomous systems. Advances in techniques such as image-to-image translation, text-to-image generation, domain transfer, and multimodal alignment have broadened the scope of automated visual content creation, supporting a wide spectrum of applications. These advancements are driven by models like Generative Adversarial Networks (GANs), conditional frameworks, and diffusion-based approaches such as Stable Diffusion. This work presents a structured classification of image generation techniques based on the nature of the input, organizing methods by input modalities like noisy vectors, latent representations, and conditional inputs. We explore the principles behind these models, highlight key frameworks including DALL-E, ControlNet, and DeepSeek Janus-Pro, and address challenges such as computational costs, data biases, and output alignment with user intent. By offering this input-centric perspective, this study bridges technical depth with practical insights, providing researchers and practitioners with a comprehensive resource to harness generative AI for real-world applications.
zh

[CV-43] Anatomy Might Be All You Need: Forecasting What to Do During Surgery

【速读】：该论文旨在解决在神经外科手术中如何提供更精细尺度的指导，特别是通过预测手术器械的轨迹来确定下一步的操作。解决方案的关键在于提出了一种模型，该模型不仅利用手术器械的历史位置，还整合了解剖特征。特别值得注意的是，该方法无需依赖显式的器械轨迹真实标签，而是通过训练一个检测模型来生成这些标签，该模型能够识别手术视频中的解剖结构和器械。通过分析解剖结构与器械运动之间的交互，并预测未来器械的运动，证明了解剖特征对于解决这一挑战性任务具有重要价值。据我们所知，这是首次尝试解决手动操作手术中的这一任务。

链接: https://arxiv.org/abs/2501.18011
作者: Gary Sarwin,Alessandro Carretta,Victor Staartjes,Matteo Zoli,Diego Mazzatenta,Luca Regli,Carlo Serra,Ender Konukoglu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Surgical guidance can be delivered in various ways. In neurosurgery, spatial guidance and orientation are predominantly achieved through neuronavigation systems that reference pre-operative MRI scans. Recently, there has been growing interest in providing live guidance by analyzing video feeds from tools such as endoscopes. Existing approaches, including anatomy detection, orientation feedback, phase recognition, and visual question-answering, primarily focus on aiding surgeons in assessing the current surgical scene. This work aims to provide guidance on a finer scale, aiming to provide guidance by forecasting the trajectory of the surgical instrument, essentially addressing the question of what to do next. To address this task, we propose a model that not only leverages the historical locations of surgical instruments but also integrates anatomical features. Importantly, our work does not rely on explicit ground truth labels for instrument trajectories. Instead, the ground truth is generated by a detection model trained to detect both anatomical structures and instruments within surgical videos of a comprehensive dataset containing pituitary surgery videos. By analyzing the interaction between anatomy and instrument movements in these videos and forecasting future instrument movements, we show that anatomical features are a valuable asset in addressing this challenging task. To the best of our knowledge, this work is the first attempt to address this task for manually operated surgeries.
zh

[CV-44] Pressure Field Reconstruction with SIREN: A Mesh-Free Approach for Image Velocimetry in Complex Noisy Environments

【速读】：该论文旨在解决从图像速度场数据重建压力场的问题，特别是在噪声环境和无网格结构条件下。解决方案的关键在于提出了一种基于SIREN（Sinusoidal Representation Network）的新方法，该方法作为一种隐式神经表示，在噪声环境中表现出色，并且具有无网格的特性。SIREN方法通过直接重建压力场，避免了传统基于网格方法中遇到的网格连接性和病态单元等问题，从而在无网格结构条件下提供了显著的优势。此外，SIREN架构的变化还可以用于过滤速度场数据中的固有噪声。

链接: https://arxiv.org/abs/2501.17987
作者: Renato F. Miotto,William R. Wolf,Fernando Zigunov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

Abstract:This work presents a novel approach for pressure field reconstruction from image velocimetry data using SIREN (Sinusoidal Representation Network), emphasizing its effectiveness as an implicit neural representation in noisy environments and its mesh-free nature. While we briefly assess two recently proposed methods - one-shot matrix-omnidirectional integration (OS-MODI) and Green’s function integral (GFI) - the primary focus is on the advantages of the SIREN approach. The OS-MODI technique performs well in noise-free conditions and with structured meshes but struggles when applied to unstructured meshes with high aspect ratio. Similarly, the GFI method encounters difficulties due to singularities inherent from the Newtonian kernel. In contrast, the proposed SIREN approach is a mesh-free method that directly reconstructs the pressure field, bypassing the need for an intrinsic grid connectivity and, hence, avoiding the challenges associated with ill-conditioned cells and unstructured meshes. This provides a distinct advantage over traditional mesh-based methods. Moreover, it is shown that changes in the architecture of the SIREN can be used to filter out inherent noise from velocimetry data. This work positions SIREN as a robust and versatile solution for pressure reconstruction, particularly in noisy environments characterized by the absence of mesh structure, opening new avenues for innovative applications in this field.
zh

[CV-45] Efficient Feature Fusion for UAV Object Detection

【速读】：该论文旨在解决无人飞行器（UAV）遥感图像中目标检测面临的不稳定图像质量、小目标检测困难、复杂背景及环境遮挡等问题。尤其针对小目标占图像比例较小，难以准确定位的挑战，现有方法在多尺度特征融合方面虽有一定效果，但难以平衡分类与定位性能。论文的关键解决方案在于提出了一种新的特征融合框架，通过集成混合上采样和下采样模块，实现不同网络深度特征图的灵活调整及跨层连接和多尺度特征融合，从而增强小目标的表征能力，并优化全局上下文信息聚合，提高分类鲁棒性。实验结果表明，将此方法整合进YOLO-V10模型后，平均精度（AP）相较于基线模型提升了2%，同时保持参数数量不变。

链接: https://arxiv.org/abs/2501.17983
作者: Xudong Wang,Chaomin Shen,Yaxin Peng
机构: East China Normal University (华东师范大学); East China Normal University (华东师范大学); Shanghai University (上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection in unmanned aerial vehicle (UAV) remote sensing images poses significant challenges due to unstable image quality, small object sizes, complex backgrounds, and environmental occlusions. Small objects, in particular, occupy minimal portions of images, making their accurate detection highly difficult. Existing multi-scale feature fusion methods address these challenges to some extent by aggregating features across different resolutions. However, these methods often fail to effectively balance classification and localization performance for small objects, primarily due to insufficient feature representation and imbalanced network information flow. In this paper, we propose a novel feature fusion framework specifically designed for UAV object detection tasks to enhance both localization accuracy and classification performance. The proposed framework integrates hybrid upsampling and downsampling modules, enabling feature maps from different network depths to be flexibly adjusted to arbitrary resolutions. This design facilitates cross-layer connections and multi-scale feature fusion, ensuring improved representation of small objects. Our approach leverages hybrid downsampling to enhance fine-grained feature representation, improving spatial localization of small targets, even under complex conditions. Simultaneously, the upsampling module aggregates global contextual information, optimizing feature consistency across scales and enhancing classification robustness in cluttered scenes. Experimental results on two public UAV datasets demonstrate the effectiveness of the proposed framework. Integrated into the YOLO-V10 model, our method achieves a 2% improvement in average precision (AP) compared to the baseline YOLO-V10 model, while maintaining the same number of parameters. These results highlight the potential of our framework for accurate and efficient UAV object detection.
zh

[CV-46] VoD-3DGS: View-opacity-Dependent 3D Gaussian Splatting

【速读】：该论文旨在解决从图像重建三维场景时，由于视角和材料属性不同导致的光线交互复杂性问题。标准的3D高斯点 splatting 模型难以表现视点相关的细节，尤其是在处理镜面反射和高光时。论文的关键解决方案是通过引入一个额外的对称矩阵来增强每个3D高斯的不透明度表示，从而实现视点相关的不透明度调节，使得模型能够更准确地表现视点相关的反射和高光，同时保持场景的整体完整性。此方法在Mip-Nerf、Tanks\Temples、Deep Blending和Nerf-Synthetic数据集上实现了最先进的性能，并且渲染速度达到60帧/秒，仅轻微增加了内存使用。

链接: https://arxiv.org/abs/2501.17978
作者: Nowak Mateusz,Jarosz Wojciech,Chin Peter
机构: Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reconstructing a 3D scene from images is challenging due to the different ways light interacts with surfaces depending on the viewer’s position and the surface’s material. In classical computer graphics, materials can be classified as diffuse or specular, interacting with light differently. The standard 3D Gaussian Splatting model struggles to represent view-dependent content, since it cannot differentiate an object within the scene from the light interacting with its specular surfaces, which produce highlights or reflections. In this paper, we propose to extend the 3D Gaussian Splatting model by introducing an additional symmetric matrix to enhance the opacity representation of each 3D Gaussian. This improvement allows certain Gaussians to be suppressed based on the viewer’s perspective, resulting in a more accurate representation of view-dependent reflections and specular highlights without compromising the scene’s integrity. By allowing the opacity to be view dependent, our enhanced model achieves state-of-the-art performance on Mip-Nerf, Tanks\Temples, Deep Blending, and Nerf-Synthetic datasets without a significant loss in rendering speed, achieving 60FPS, and only incurring a minimal increase in memory used.
zh

[CV-47] ransRAD: Retentive Vision Transformer for Enhanced Radar Object Detection

【速读】：该论文旨在解决雷达在低光照条件和恶劣天气下作为传感器的可靠性不足问题，并克服基于雷达数据进行目标检测的固有弱点，如低分辨率、高噪声以及缺乏视觉信息。论文的关键解决方案是提出了一种名为TransRAD的新颖三维雷达目标检测模型，该模型利用保留视觉变换器（Retentive Vision Transformer, RMT）中的保留曼哈顿自注意力机制（Retentive Manhattan Self-Attention, MaSA），更有效地从密集雷达距离-方位-多普勒（Range-Azimuth-Doppler, RAD）数据中学习特征，从而实现精准的三维雷达检测。此外，论文还提出了位置感知非极大值抑制（Location-Aware NMS）方法，以有效减少深度雷达目标检测中常见的重复边界框问题。

链接: https://arxiv.org/abs/2501.17977
作者: Lei Cheng,Siyang Cao
机构: The University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: Accepted by IEEE Transactions on Radar Systems

点击查看摘要

Abstract:Despite significant advancements in environment perception capabilities for autonomous driving and intelligent robotics, cameras and LiDARs remain notoriously unreliable in low-light conditions and adverse weather, which limits their effectiveness. Radar serves as a reliable and low-cost sensor that can effectively complement these limitations. However, radar-based object detection has been underexplored due to the inherent weaknesses of radar data, such as low resolution, high noise, and lack of visual information. In this paper, we present TransRAD, a novel 3D radar object detection model designed to address these challenges by leveraging the Retentive Vision Transformer (RMT) to more effectively learn features from information-dense radar Range-Azimuth-Doppler (RAD) data. Our approach leverages the Retentive Manhattan Self-Attention (MaSA) mechanism provided by RMT to incorporate explicit spatial priors, thereby enabling more accurate alignment with the spatial saliency characteristics of radar targets in RAD data and achieving precise 3D radar detection across Range-Azimuth-Doppler dimensions. Furthermore, we propose Location-Aware NMS to effectively mitigate the common issue of duplicate bounding boxes in deep radar object detection. The experimental results demonstrate that TransRAD outperforms state-of-the-art methods in both 2D and 3D radar detection tasks, achieving higher accuracy, faster inference speed, and reduced computational complexity. Code is available at this https URL
zh

[CV-48] Unsupervised Patch-GAN with Targeted Patch Ranking for Fine-Grained Novelty Detection in Medical Imaging

【速读】：该论文旨在解决医学影像中罕见异常检测的挑战，特别是由于有限的标注数据导致的小范围且高度变异性及微妙异常难以被发现的问题。论文的关键解决方案在于提出了一种无监督的Patch-GAN框架，通过学习精细的正常特征并将其划分为小块（patches）进行局部细节与全局结构的捕捉，从而实现对细微异常的敏感检测和定位。此方法通过优先处理具有更高异常评分的区域，强化了局部差异与全局图像背景之间的关联，有效克服了整体图像评估的局限性。

链接: https://arxiv.org/abs/2501.17906
作者: Jingkun Chen,Guang Yang,Xiao Zhang,Jingchao Peng,Tianlu Zhang,Jianguo Zhang,Jungong Han,Vicente Grau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Detecting novel anomalies in medical imaging is challenging due to the limited availability of labeled data for rare abnormalities, which often display high variability and subtlety. This challenge is further compounded when small abnormal regions are embedded within larger normal areas, as whole-image predictions frequently overlook these subtle deviations. To address these issues, we propose an unsupervised Patch-GAN framework designed to detect and localize anomalies by capturing both local detail and global structure. Our framework first reconstructs masked images to learn fine-grained, normal-specific features, allowing for enhanced sensitivity to minor deviations from normality. By dividing these reconstructed images into patches and assessing the authenticity of each patch, our approach identifies anomalies at a more granular level, overcoming the limitations of whole-image evaluation. Additionally, a patch-ranking mechanism prioritizes regions with higher abnormal scores, reinforcing the alignment between local patch discrepancies and the global image context. Experimental results on the ISIC 2016 skin lesion and BraTS 2019 brain tumor datasets validate our framework’s effectiveness, achieving AUCs of 95.79% and 96.05%, respectively, and outperforming three state-of-the-art baselines.
zh

[CV-49] VidSole: A Multimodal Dataset for Joint Kinetics Quantification and Disease Detection with Deep Learning AAAI2025

【速读】：该论文旨在解决大规模、低成本分析关节负荷生物力学的问题，以诊断与步态相关疾病如膝关节骨关节炎。解决方案的关键在于开发并部署新型智能鞋垫（instrumented insoles），创建大型多模态生物力学数据集（VidSole），以及建立一个基准深度学习管道来预测内部关节负荷因素。智能鞋垫能够测量足底五个高压点的三轴力和力矩，而VidSole数据集则包含了这些鞋垫所测得的数据，以及从两个视角拍摄的RGB视频、三维人体运动捕捉数据和测力台数据，涵盖了52名参与者执行四种日常生活活动（坐立、站立、行走和跑步）超过2600次试验的数据。通过将鞋垫数据和可从视频中提取的运动学参数（如姿势、膝关节角度）输入到深度学习管道中，实现了活动分类准确率达到99.02%，并且膝关节内收力矩（KAM）估计的平均绝对误差小于0.5%体重*身高，达到了准确检测膝关节骨关节炎所需的当前阈值。

链接: https://arxiv.org/abs/2501.17890
作者: Archit Kambhamettu,Samantha Snyder,Maliheh Fakhar,Samuel Audia,Ross Miller,Jae Kun Shim,Aniket Bera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted by AAAI 2025 Special Track on AI for Social Impact

点击查看摘要

Abstract:Understanding internal joint loading is critical for diagnosing gait-related diseases such as knee osteoarthritis; however, current methods of measuring joint risk factors are time-consuming, expensive, and restricted to lab settings. In this paper, we enable the large-scale, cost-effective biomechanical analysis of joint loading via three key contributions: the development and deployment of novel instrumented insoles, the creation of a large multimodal biomechanics dataset (VidSole), and a baseline deep learning pipeline to predict internal joint loading factors. Our novel instrumented insole measures the tri-axial forces and moments across five high-pressure points under the foot. VidSole consists of the forces and moments measured by these insoles along with corresponding RGB video from two viewpoints, 3D body motion capture, and force plate data for over 2,600 trials of 52 diverse participants performing four fundamental activities of daily living (sit-to-stand, stand-to-sit, walking, and running). We feed the insole data and kinematic parameters extractable from video (i.e., pose, knee angle) into a deep learning pipeline consisting of an ensemble Gated Recurrent Unit (GRU) activity classifier followed by activity-specific Long Short Term Memory (LSTM) regression networks to estimate knee adduction moment (KAM), a biomechanical risk factor for knee osteoarthritis. The successful classification of activities at an accuracy of 99.02 percent and KAM estimation with mean absolute error (MAE) less than 0.5 percentbody weightheight, the current threshold for accurately detecting knee osteoarthritis with KAM, illustrates the usefulness of our dataset for future research and clinical settings.
zh

[CV-50] ask-based Regularization in Penalized Least-Squares for Binary Signal Detection Tasks in Medical Image Denoising

【速读】：该论文旨在解决在医学图像去噪过程中，传统惩罚最小二乘（Penalized Least-Squares, PLS）方法和基于卷积神经网络（Convolutional Neural Networks, CNNs）的监督学习方法存在的问题，特别是如何在去除噪声的同时保留任务相关的特征信息。论文的关键在于提出了一种基于任务的正则化策略，该策略与高斯噪声模型下噪声图像线性测试统计量的似然相关。此方法无需使用真实无噪声图像数据，并且针对每幅图像单独求解优化问题以实现去噪，从而有效提升去噪后图像中的信号检测能力。

链接: https://arxiv.org/abs/2501.18418
作者: Wentao Chen,Tianming Xu,Weimin Zhou
机构: University of Michigan-Shanghai Jiao Tong University Joint Institute (密歇根大学-上海交通大学联合研究院), Shanghai Jiao Tong University (上海交通大学), China; Global Institute of Future Technology (未来技术全球研究所), Shanghai Jiao Tong University (上海交通大学), China; Wyant College of Optical Sciences (韦恩特光学科学学院), University of Arizona (亚利桑那大学), USA; Department of Medical Imaging (医学成像系), University of Arizona (亚利桑那大学), USA
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image denoising algorithms have been extensively investigated for medical imaging. To perform image denoising, penalized least-squares (PLS) problems can be designed and solved, in which the penalty term encodes prior knowledge of the object being imaged. Sparsity-promoting penalties, such as total variation (TV), have been a popular choice for regularizing image denoising problems. However, such hand-crafted penalties may not be able to preserve task-relevant information in measured image data and can lead to oversmoothed image appearances and patchy artifacts that degrade signal detectability. Supervised learning methods that employ convolutional neural networks (CNNs) have emerged as a popular approach to denoising medical images. However, studies have shown that CNNs trained with loss functions based on traditional image quality measures can lead to a loss of task-relevant information in images. Some previous works have investigated task-based loss functions that employ model observers for training the CNN denoising models. However, such training processes typically require a large number of noisy and ground-truth (noise-free or low-noise) image data pairs. In this work, we propose a task-based regularization strategy for use with PLS in medical image denoising. The proposed task-based regularization is associated with the likelihood of linear test statistics of noisy images for Gaussian noise models. The proposed method does not require ground-truth image data and solves an individual optimization problem for denoising each image. Computer-simulation studies are conducted that consider a multivariate-normally distributed (MVN) lumpy background and a binary texture background. It is demonstrated that the proposed regularization strategy can effectively improve signal detectability in denoised images.
zh

[CV-51] he iToBoS dataset: skin region images extracted from 3D total body photographs for lesion detection

【速读】：该论文旨在解决现有皮肤癌诊断数据集中缺乏周围皮肤上下文信息的问题。解决方案的关键在于创建iToBoS数据集，该数据集包含来自100名参与者的16,954张皮肤区域图像，并使用3D全身摄影技术捕捉，每张图像大约对应7×9厘米的皮肤区域，所有可疑病变均通过边界框进行标注。此外，数据集还提供了诸如解剖位置、年龄组和日光损伤评分等元数据，以增强算法训练和基准测试的效果，从而实现早期皮肤癌检测，并将该技术部署于非临床环境。

链接: https://arxiv.org/abs/2501.18270
作者: Anup Saha,Joseph Adeola,Nuria Ferrera,Adam Mothershaw,Gisele Rezze,Séraphin Gaborit,Brian D’Alessandro,James Hudson,Gyula Szabó,Balazs Pataki,Hayat Rajani,Sana Nazari,Hassan Hayat,Clare Primiero,H. Peter Soyer,Josep Malvehy,Rafael Garcia
机构: Computer Vision and Robotics Research Institute, University of Girona (计算机视觉与机器人研究学院，赫罗纳大学); Dermatology Department, Hospital Clínic Barcelona, Universitat de Barcelona (皮肤病学系，巴塞罗那临床医院，巴塞罗那大学); Frazer Institute, The University of Queensland, Dermatology Research Center (弗雷泽研究所，昆士兰大学，皮肤病学研究中心); ISAHIT (未知); Canfield Scientific, Inc. (坎菲尔德科学公司); V7 (V7); HUN-REN Institute for Computer Science and Control (HUN-REN 计算机科学与控制研究所); Dermatology Department, University of Trieste (皮肤病学系，里雅斯特大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Article Submitted to Scientific Data

点击查看摘要

Abstract:Artificial intelligence has significantly advanced skin cancer diagnosis by enabling rapid and accurate detection of malignant lesions. In this domain, most publicly available image datasets consist of single, isolated skin lesions positioned at the center of the image. While these lesion-centric datasets have been fundamental for developing diagnostic algorithms, they lack the context of the surrounding skin, which is critical for improving lesion detection. The iToBoS dataset was created to address this challenge. It includes 16,954 images of skin regions from 100 participants, captured using 3D total body photography. Each image roughly corresponds to a 7 \times 9 cm section of skin with all suspicious lesions annotated using bounding boxes. Additionally, the dataset provides metadata such as anatomical location, age group, and sun damage score for each image. This dataset aims to facilitate training and benchmarking of algorithms, with the goal of enabling early detection of skin cancer and deployment of this technology in non-clinical environments.
zh

[CV-52] Revisiting PsiDONet: microlocally inspired filters for incomplete-data tomographic reconstructions

【速读】：该论文致力于解决稀疏角度层析成像（Tomographic Reconstructions）中的重建质量问题。关键在于通过引入针对不完全数据导致的条纹伪影（Streak Artifact Singularities）特别设计的滤波器结构，对原有的\Psi DONet方法进行改进。这种方法不仅显著减少了可学习参数的数量，同时保持甚至略微提升了从有限角度数据中重建图像的质量，并为稀疏角度层析成像的数据提供了概念验证。

链接: https://arxiv.org/abs/2501.18219
作者: Tatiana A. Bubba,Luca Ratti,Andrea Sebastiani
机构: Department of Mathematics and Computer Science, University of Ferrara (数学与计算机科学系, 费拉拉大学); Department of Mathematics, University of Bologna (数学系, 博洛尼亚大学); Department of Physics, Computer Science and Mathematics, University of Modena and Reggio Emilia (物理、计算机科学与数学系, 摩德纳和雷焦艾米利亚大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we revisit a supervised learning approach based on unrolling, known as \Psi DONet, by providing a deeper microlocal interpretation for its theoretical analysis, and extending its study to the case of sparse-angle tomography. Furthermore, we refine the implementation of the original \Psi DONet considering special filters whose structure is specifically inspired by the streak artifact singularities characterizing tomographic reconstructions from incomplete data. This allows to considerably lower the number of (learnable) parameters while preserving (or even slightly improving) the same quality for the reconstructions from limited-angle data and providing a proof-of-concept for the case of sparse-angle tomographic data.
zh

[CV-53] Scattering approach to diffusion quantifies axonal damage in brain injury

【速读】：该论文旨在解决早期诊断和非侵入性监测神经疾病时，如何敏感地检测到发生在毫米级医学成像模态可观察体积变化之前更为细微的细胞层面改变。论文的关键在于揭示时间依赖性扩散磁共振成像（Diffusion MRI, dMRI）对微米尺度轴突形态变化的敏感性，并通过散射理论确定了决定轴内水分子扩散动力学的两个参数：平均倒数截面和长程截面波动的方差。这一理论进展使得能够预测大范围轴突变化相关的dMRI指标，从而在秒级时间内评估创伤性脑损伤的大鼠模型，而无需耗时数月的模拟。这种方法弥合了微米级与毫米级分辨率之间的差距，提供了适用于广泛神经疾病的定量客观生物标志物。

链接: https://arxiv.org/abs/2501.18167
作者: Ali Abdollahzadeh,Ricardo Coronado-Leija,Hong-Hsi Lee,Alejandra Sierra,Els Fieremans,Dmitry S. Novikov
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Biological Physics (physics.bio-ph)
备注:

点击查看摘要

Abstract:Early diagnosis and noninvasive monitoring of neurological disorders require sensitivity to elusive cellular-level alterations that occur much earlier than volumetric changes observable with the millimeter-resolution of medical imaging modalities. Morphological changes in axons, such as axonal varicosities or beadings, are observed in neurological disorders, as well as in development and aging. Here, we reveal the sensitivity of time-dependent diffusion MRI (dMRI) to axonal morphology at the micrometer scale. Scattering theory uncovers the two parameters that determine the diffusive dynamics of water in axons: the average reciprocal cross-section and the variance of long-range cross-sectional fluctuations. This theoretical development allowed us to predict dMRI metrics sensitive to axonal alterations across tens of thousands of axons in seconds rather than months of simulations in a rat model of traumatic brain injury. Our approach bridges the gap between micrometers and millimeters in resolution, offering quantitative, objective biomarkers applicable to a broad spectrum of neurological disorders.
zh

[CV-54] Using Computer Vision for Skin Disease Diagnosis in Bangladesh Enhancing Interpretability and Transparency in Deep Learning Models for Skin Cancer Classification

【速读】：该论文旨在解决皮肤癌诊断中深度学习模型缺乏可解释性的问题，特别是在孟加拉国。解决方案的关键在于结合使用显著性图（Saliency Maps）和注意力图（Attention Maps），以可视化影响模型诊断的关键特征，从而增强模型的可解释性。

链接: https://arxiv.org/abs/2501.18161
作者: Rafiul Islam,Jihad Khan Dipu,Mehedi Hasan Tusar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages

点击查看摘要

Abstract:With over 2 million new cases identified annually, skin cancer is the most prevalent type of cancer globally and the second most common in Bangladesh, following breast cancer. Early detection and treatment are crucial for enhancing patient outcomes; however, Bangladesh faces a shortage of dermatologists and qualified medical professionals capable of diagnosing and treating skin cancer. As a result, many cases are diagnosed only at advanced stages. Research indicates that deep learning algorithms can effectively classify skin cancer images. However, these models typically lack interpretability, making it challenging to understand their decision-making processes. This lack of clarity poses barriers to utilizing deep learning in improving skin cancer detection and treatment. In this article, we present a method aimed at enhancing the interpretability of deep learning models for skin cancer classification in Bangladesh. Our technique employs a combination of saliency maps and attention maps to visualize critical features influencing the model’s diagnoses.
zh

[CV-55] Influence of High-Performance Image-to-Image Translation Networks on Clinical Visual Assessment and Outcome Prediction: Utilizing Ultrasound to MRI Translation in Prostate Cancer

【速读】：该论文旨在评估图像到图像翻译（Image-to-Image Translation, I2I）网络在临床常规应用中的有效性和适应性，特别关注其在将超声（Ultrasound, US）图像转换为磁共振成像（MRI）扫描中的表现。论文的关键解决方案在于采用2D-Pix2Pix网络进行图像转换，并通过引入放射组学特征（Radiomic Features, RF）分析来评估其性能。研究结果表明，2D-Pix2Pix网络在低级特征发现和整体误差与相似性指标方面表现出色，但仍然需要改进以提高低级特征识别能力。此外，基于合成图像的分类方法优于基于原始超声图像的方法。

链接: https://arxiv.org/abs/2501.18109
作者: Mohammad R. Salmanpour,Amin Mousavi,Yixi Xu,William B Weeks,Ilker Hacihaliloglu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Biological Physics (physics.bio-ph)
备注: 9 pages, 4 figures and 1 table

点击查看摘要

Abstract:Purpose: This study examines the core traits of image-to-image translation (I2I) networks, focusing on their effectiveness and adaptability in everyday clinical settings. Methods: We have analyzed data from 794 patients diagnosed with prostate cancer (PCa), using ten prominent 2D/3D I2I networks to convert ultrasound (US) images into MRI scans. We also introduced a new analysis of Radiomic features (RF) via the Spearman correlation coefficient to explore whether networks with high performance (SSIM85%) could detect subtle RFs. Our study further examined synthetic images by 7 invited physicians. As a final evaluation study, we have investigated the improvement that are achieved using the synthetic MRI data on two traditional machine learning and one deep learning method. Results: In quantitative assessment, 2D-Pix2Pix network substantially outperformed the other 7 networks, with an average SSIM~0.855. The RF analysis revealed that 76 out of 186 RFs were identified using the 2D-Pix2Pix algorithm alone, although half of the RFs were lost during the translation process. A detailed qualitative review by 7 medical doctors noted a deficiency in low-level feature recognition in I2I tasks. Furthermore, the study found that synthesized image-based classification outperformed US image-based classification with an average accuracy and AUC~0.93. Conclusion: This study showed that while 2D-Pix2Pix outperformed cutting-edge networks in low-level feature discovery and overall error and similarity metrics, it still requires improvement in low-level feature performance, as highlighted by Group 3. Further, the study found using synthetic image-based classification outperformed original US image-based methods.
zh

[CV-56] Visualization of Organ Movements Using Automatic Region Segmentation of Swallowing CT

【速读】：该论文旨在开发一种用于自动分割吞咽过程中四维计算机断层扫描（4D-CT）图像区域的人工智能（AI）。解决方案的关键在于使用基于nnU-Net的三维卷积模型，并采用留一法交叉验证进行学习和评估。训练nnU-Net的轮数（epochs）设定为100，Dice系数被用作评估AI区域分割精度的指标。

链接: https://arxiv.org/abs/2501.17897
作者: Yukihiro Michiwaki,Takahiro Kikuchi,Takashi Ijiri,Yoko Inamoto,Hiroshi Moriya,Takumi Ogawa,Ryota Nakatani,Yuto Masaki,Yoshito Otake,Yoshinobu Sato
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 8 pages, 5 figures, 1 table

点击查看摘要

Abstract:This study presents the first report on the development of an artificial intelligence (AI) for automatic region segmentation of four-dimensional computer tomography (4D-CT) images during swallowing. The material consists of 4D-CT images taken during swallowing. Additionally, data for verifying the practicality of the AI were obtained from 4D-CT images during mastication and swallowing. The ground truth data for the region segmentation for the AI were created from five 4D-CT datasets of swallowing. A 3D convolutional model of nnU-Net was used for the AI. The learning and evaluation method for the AI was leave-one-out cross-validation. The number of epochs for training the nnU-Net was 100. The Dice coefficient was used as a metric to assess the AI’s region segmentation accuracy. Regions with a median Dice coefficient of 0.7 or higher included the bolus, bones, tongue, and soft palate. Regions with a Dice coefficient below 0.7 included the thyroid cartilage and epiglottis. Factors that reduced the Dice coefficient included metal artifacts caused by dental crowns in the bolus and the speed of movement for the thyroid cartilage and epiglottis. In practical verification of the AI, no significant misrecognition was observed for facial bones, jaw bones, or the tongue. However, regions such as the hyoid bone, thyroid cartilage, and epiglottis were not fully delineated during fast movement. It is expected that future research will improve the accuracy of the AI’s region segmentation, though the risk of misrecognition will always exist. Therefore, the development of tools for efficiently correcting the AI’s segmentation results is necessary. AI-based visualization is expected to contribute not only to the deepening of motion analysis of organs during swallowing but also to improving the accuracy of swallowing CT by clearly showing the current state of its precision.
zh

人工智能

[AI-0] DeltaLLM : Compress LLM s with Low-Rank Deltas between Shared Weights

链接: https://arxiv.org/abs/2501.18596
作者: Liana Mikaelyan,Ayyoob Imani,Mathew Salvaris,Parth Pathak,Mohsen Fayyaz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce DeltaLLM, a new post-training compression technique to reduce the memory footprint of LLMs. We propose an alternative way of structuring LLMs with weight sharing between layers in subsequent Transformer blocks, along with additional low-rank difference matrices between them. For training, we adopt the progressing module replacement method and show that the lightweight training of the low-rank modules with approximately 30M-40M tokens is sufficient to achieve performance on par with LLMs of comparable sizes trained from scratch. We release the resultant models, DeltaLLAMA and DeltaPHI, with a 12% parameter reduction, retaining 90% of the performance of the base Llama and Phi models on common knowledge and reasoning benchmarks. Our method also outperforms compression techniques JointDrop, LaCo, ShortGPT and SliceGPT with the same number of parameters removed. For example, DeltaPhi 2.9B with a 24% reduction achieves similar average zero-shot accuracies as recovery fine-tuned SlicedPhi 3.3B with a 12% reduction, despite being approximately 400M parameters smaller with no fine-tuning applied. This work provides new insights into LLM architecture design and compression methods when storage space is critical.

[AI-1] BounTCHA: A CAPTCHA Utilizing Boundary Identification in AI-extended Videos

链接: https://arxiv.org/abs/2501.18565
作者: Lehao Lin,Ke Wang,Maha Abdallah,Wei Cai
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 22 pages, 15 figures

点击查看摘要

Abstract:In recent years, the rapid development of artificial intelligence (AI) especially multi-modal Large Language Models (MLLMs), has enabled it to understand text, images, videos, and other multimedia data, allowing AI systems to execute various tasks based on human-provided prompts. However, AI-powered bots have increasingly been able to bypass most existing CAPTCHA systems, posing significant security threats to web applications. This makes the design of new CAPTCHA mechanisms an urgent priority. We observe that humans are highly sensitive to shifts and abrupt changes in videos, while current AI systems still struggle to comprehend and respond to such situations effectively. Based on this observation, we design and implement BounTCHA, a CAPTCHA mechanism that leverages human perception of boundaries in video transitions and disruptions. By utilizing AI’s capability to expand original videos with prompts, we introduce unexpected twists and changes to create a pipeline for generating short videos for CAPTCHA purposes. We develop a prototype and conduct experiments to collect data on humans’ time biases in boundary identification. This data serves as a basis for distinguishing between human users and bots. Additionally, we perform a detailed security analysis of BounTCHA, demonstrating its resilience against various types of attacks. We hope that BounTCHA will act as a robust defense, safeguarding millions of web applications in the AI-driven era.

[AI-2] Semantic Web and Creative AI – A Technical Report from ISWS 2023

链接: https://arxiv.org/abs/2501.18542
作者: Raia Abu Ahmad,Reham Alharbi,Roberto Barile,Martin Böckling,Francisco Bolanos,Sara Bonfitto,Oleksandra Bruns,Irene Celino,Yashrajsinh Chudasama,Martin Critelli,Claudia d’Amato,Giada D’Ippolito,Ioannis Dasoulas,Stefano De Giorgis,Vincenzo De Leo,Chiara Di Bonaventura,Marco Di Panfilo,Daniil Dobriy,John Domingue,Xuemin Duan,Michel Dumontier,Sefika Efeoglu,Ruben Eschauzier,Fakih Ginwa,Nicolas Ferranti,Arianna Graciotti,Philipp Hanisch,George Hannah,Golsa Heidari,Aidan Hogan,Hassan Hussein,Alexane Jouglar,Jan-Christoph Kalo,Manoé Kieffer,Antonis Klironomos,Inês Koch,Weronika Lajewska,Nicolas Lazzari,Mikael Lindekrans,Anna Sofia Lippolis,Majlinda Llugiqi,Eleonora Mancini,Eleonora Marzi,Laura Menotti,Daniela Milon Flores,Soulakshmee Nagowah,Kerstin Neubert,Emetis Niazmand,Ebrahim Norouzi,Beatriz Olarte Martinez,Anouk Michelle Oudshoorn,Andrea Poltronieri,Valentina Presutti,Disha Purohit,Ensiyeh Raoufi,Celian Ringwald,Johanna Rockstroh,Sebastian Rudolph,Harald Sack,Zafar Saeed,Mohammad Javad Saeedizade,Aya Sahbi,Cristian Santini,Aleksandra Simic,Dennis Sommer,Rita Sousa,Mary Ann Tan,Vidyashree Tarikere,Tabea Tietz,Liam Tirpitz,Arnaldo Tomasino,Frank van Harmelen,Joao Vissoci,Caitlin Woods,Bohui Zhang,Xinyue Zhang,Heng Zheng
类目: Artificial Intelligence (cs.AI)
*备注: Technical Report

点击查看摘要

Abstract:The International Semantic Web Research School (ISWS) is a week-long intensive program designed to immerse participants in the field. This document reports a collaborative effort performed by ten teams of students, each guided by a senior researcher as their mentor, attending ISWS 2023. Each team provided a different perspective to the topic of creative AI, substantiated by a set of research questions as the main subject of their investigation. The 2023 edition of ISWS focuses on the intersection of Semantic Web technologies and Creative AI. ISWS 2023 explored various intersections between Semantic Web technologies and creative AI. A key area of focus was the potential of LLMs as support tools for knowledge engineering. Participants also delved into the multifaceted applications of LLMs, including legal aspects of creative content production, humans in the loop, decentralised approaches to multimodal generative AI models, nanopublications and AI for personal scientific knowledge graphs, commonsense knowledge in automatic story and narrative completion, generative AI for art critique, prompt engineering, automatic music composition, commonsense prototyping and conceptual blending, and elicitation of tacit knowledge. As Large Language Models and semantic technologies continue to evolve, new exciting prospects are emerging: a future where the boundaries between creative expression and factual knowledge become increasingly permeable and porous, leading to a world of knowledge that is both informative and inspiring.

[AI-3] A Hybrid Data-Driven Approach For Analyzing And Predicting Inpatient Length Of Stay In Health Centre

链接: https://arxiv.org/abs/2501.18535
作者: Tasfia Noor Chowdhury,Sanjida Afrin Mou,Kazi Naimur Rahman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 8 pages, 15 figures

点击查看摘要

Abstract:Patient length of stay (LoS) is a critical metric for evaluating the efficacy of hospital management. The primary objectives encompass to improve efficiency and reduce costs while enhancing patient outcomes and hospital capacity within the patient journey. By seamlessly merging data-driven techniques with simulation methodologies, the study proposes an all-encompassing framework for the optimization of patient flow. Using a comprehensive dataset of 2.3 million de-identified patient records, we analyzed demographics, diagnoses, treatments, services, costs, and charges with machine learning models (Decision Tree, Logistic Regression, Random Forest, Adaboost, LightGBM) and Python tools (Spark, AWS clusters, dimensionality reduction). Our model predicts patient length of stay (LoS) upon admission using supervised learning algorithms. This hybrid approach enables the identification of key factors influencing LoS, offering a robust framework for hospitals to streamline patient flow and resource utilization. The research focuses on patient flow, corroborating the efficacy of the approach, illustrating decreased patient length of stay within a real healthcare environment. The findings underscore the potential of hybrid data-driven models in transforming hospital management practices. This innovative methodology provides generally flexible decision-making, training, and patient flow enhancement; such a system could have huge implications for healthcare administration and overall satisfaction with healthcare.

[AI-4] GuardReason er: Towards Reasoning -based LLM Safeguards

链接: https://arxiv.org/abs/2501.18492
作者: Yue Liu,Hongcheng Gao,Shengfang Zhai,Jun Xia,Tianyi Wu,Zhiwei Xue,Yulin Chen,Kenji Kawaguchi,Jiaheng Zhang,Bryan Hooi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 22 pages, 18 figures

点击查看摘要

Abstract:As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : this https URL.

[AI-5] Curriculum-based Sample Efficient Reinforcement Learning for Robust Stabilization of a Quadrotor

链接: https://arxiv.org/abs/2501.18490
作者: Fausto Mauricio Lagos Suarez,Akshit Saradagi,Vidya Sumathy,Shruti Kotpaliwar,George Nikolakopoulos
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:This article introduces a curriculum learning approach to develop a reinforcement learning-based robust stabilizing controller for a Quadrotor that meets predefined performance criteria. The learning objective is to achieve desired positions from random initial conditions while adhering to both transient and steady-state performance specifications. This objective is challenging for conventional one-stage end-to-end reinforcement learning, due to the strong coupling between position and orientation dynamics, the complexity in designing and tuning the reward function, and poor sample efficiency, which necessitates substantial computational resources and leads to extended convergence times. To address these challenges, this work decomposes the learning objective into a three-stage curriculum that incrementally increases task complexity. The curriculum begins with learning to achieve stable hovering from a fixed initial condition, followed by progressively introducing randomization in initial positions, orientations and velocities. A novel additive reward function is proposed, to incorporate transient and steady-state performance specifications. The results demonstrate that the Proximal Policy Optimization (PPO)-based curriculum learning approach, coupled with the proposed reward structure, achieves superior performance compared to a single-stage PPO-trained policy with the same reward function, while significantly reducing computational resource requirements and convergence time. The curriculum-trained policy’s performance and robustness are thoroughly validated under random initial conditions and in the presence of disturbances.

[AI-6] CLoQ: Enhancing Fine-Tuning of Quantized LLM s via Calibrated LoRA Initialization

链接: https://arxiv.org/abs/2501.18475
作者: Yanxia Deng,Aozhong Zhang,Naigang Wang,Selcuk Gurses,Zi Yang,Penghang Yin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) using low-rank adaptation (LoRA) has become a highly efficient approach for downstream tasks, particularly in scenarios with limited computational resources. However, applying LoRA techniques to quantized LLMs poses unique challenges due to the reduced representational precision of quantized weights. In this paper, we introduce CLoQ (Calibrated LoRA initialization for Quantized LLMs), a simplistic initialization strategy designed to overcome these challenges. Our approach focuses on minimizing the layer-wise discrepancy between the original LLM and its quantized counterpart with LoRA components during initialization. By leveraging a small calibration dataset, CLoQ quantizes a pre-trained LLM and determines the optimal LoRA components for each layer, ensuring a strong foundation for subsequent fine-tuning. A key contribution of this work is a novel theoretical result that enables the accurate and closed-form construction of these optimal LoRA components. We validate the efficacy of CLoQ across multiple tasks such as language generation, arithmetic reasoning, and commonsense reasoning, demonstrating that it consistently outperforms existing LoRA fine-tuning methods for quantized LLMs, especially at ultra low-bit widths.

[AI-7] Beyond Instructed Tasks: Recognizing In-the-Wild Reading Behaviors in the Classroom Using Eye Tracking

链接: https://arxiv.org/abs/2501.18468
作者: Eduardo Davalos,Jorge Alberto Salas,Yike Zhang,Namrata Srivastava,Yashvitha Thatigotla,Abbey Gonzales,Sara McFadden,Sun-Joo Cho,Gautam Biswas,Amanda Goodwin
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 24 pages, 16 figures, 6 tables, conference

点击查看摘要

Abstract:Understanding reader behaviors such as skimming, deep reading, and scanning is essential for improving educational instruction. While prior eye-tracking studies have trained models to recognize reading behaviors, they often rely on instructed reading tasks, which can alter natural behaviors and limit the applicability of these findings to in-the-wild settings. Additionally, there is a lack of clear definitions for reading behavior archetypes in the literature. We conducted a classroom study to address these issues by collecting instructed and in-the-wild reading data. We developed a mixed-method framework, including a human-driven theoretical model, statistical analyses, and an AI classifier, to differentiate reading behaviors based on their velocity, density, and sequentiality. Our lightweight 2D CNN achieved an F1 score of 0.8 for behavior recognition, providing a robust approach for understanding in-the-wild reading. This work advances our ability to provide detailed behavioral insights to educators, supporting more targeted and effective assessment and instruction.

[AI-8] Conversation Games and a Strategic View of the Turing Test

链接: https://arxiv.org/abs/2501.18455
作者: Kaveh Aryan
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Although many game-theoretic models replicate real interactions that often rely on natural language, explicit study of games where language is central to strategic interaction remains limited. This paper introduces the \emphconversation game, a multi-stage, extensive-form game based on linguistic strategic interaction. We focus on a subset of the games, called verdict games. In a verdict game, two players alternate to contribute to a conversation, which is evaluated at each stage by a non-strategic judge who may render a conclusive binary verdict, or a decision to continue the dialogue. The game ends once a limit is reached or a verdict is given. We show many familiar processes, such as interrogation or a court process fall under this category. We also, show that the Turing test is an instance of verdict game, and discuss the significance of a strategic view of the Turing test in the age of advanced AI deception. We show the practical relevance of the proposed concepts by simulation experiments, and show that a strategic agent outperforms a naive agent by a high margin.

[AI-9] Clustering Properties of Self-Supervised Learning

链接: https://arxiv.org/abs/2501.18452
作者: Xi Weng,Jianing An,Xudong Ma,Binhang Qi,Jie Luo,Xi Yang,Jin Song Dong,Lei Huang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) methods via joint embedding architectures have proven remarkably effective at capturing semantically rich representations with strong clustering properties, magically in the absence of label supervision. Despite this, few of them have explored leveraging these untapped properties to improve themselves. In this paper, we provide an evidence through various metrics that the encoder’s output encoding exhibits superior and more stable clustering properties compared to other components. Building on this insight, we propose a novel positive-feedback SSL method, termed Representation Soft Assignment (ReSA), which leverages the model’s clustering properties to promote learning in a self-guided manner. Extensive experiments on standard SSL benchmarks reveal that models pretrained with ReSA outperform other state-of-the-art SSL methods by a significant margin. Finally, we analyze how ReSA facilitates better clustering properties, demonstrating that it effectively enhances clustering performance at both fine-grained and coarse-grained levels, shaping representations that are inherently more structured and semantically meaningful.

[AI-10] Autonomy and Safety Assurance in the Early Development of Robotics and Autonomous Systems

链接: https://arxiv.org/abs/2501.18448
作者: Dhaminda B. Abeywickrama,Michael Fisher,Frederic Wheeler,Louise Dennis
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:This report provides an overview of the workshop titled Autonomy and Safety Assurance in the Early Development of Robotics and Autonomous Systems, hosted by the Centre for Robotic Autonomy in Demanding and Long-Lasting Environments (CRADLE) on September 2, 2024, at The University of Manchester, UK. The event brought together representatives from six regulatory and assurance bodies across diverse sectors to discuss challenges and evidence for ensuring the safety of autonomous and robotic systems, particularly autonomous inspection robots (AIR). The workshop featured six invited talks by the regulatory and assurance bodies. CRADLE aims to make assurance an integral part of engineering reliable, transparent, and trustworthy autonomous systems. Key discussions revolved around three research questions: (i) challenges in assuring safety for AIR; (ii) evidence for safety assurance; and (iii) how assurance cases need to differ for autonomous systems. Following the invited talks, the breakout groups further discussed the research questions using case studies from ground (rail), nuclear, underwater, and drone-based AIR. This workshop offered a valuable opportunity for representatives from industry, academia, and regulatory bodies to discuss challenges related to assured autonomy. Feedback from participants indicated a strong willingness to adopt a design-for-assurance process to ensure that robots are developed and verified to meet regulatory expectations.

[AI-11] o3-mini vs DeepSeek -R1: Which One is Safer?

链接: https://arxiv.org/abs/2501.18438
作者: Aitor Arrieta,Miriam Ugarte,Pablo Valle,José Antonio Parejo,Sergio Segura
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: substantial text overlap with arXiv:2501.17749

点击查看摘要

Abstract:The irruption of DeepSeek-R1 constitutes a turning point for the AI industry in general and the LLMs in particular. Its capabilities have demonstrated outstanding performance in several tasks, including creative thinking, code generation, maths and automated program repair, at apparently lower execution cost. However, LLMs must adhere to an important qualitative property, i.e., their alignment with safety and human values. A clear competitor of DeepSeek-R1 is its American counterpart, OpenAI’s o3-mini model, which is expected to set high standards in terms of performance, safety and cost. In this paper we conduct a systematic assessment of the safety level of both, DeepSeek-R1 (70b version) and OpenAI’s o3-mini (beta version). To this end, we make use of our recently released automated safety testing tool, named ASTRAL. By leveraging this tool, we automatically and systematically generate and execute a total of 1260 unsafe test inputs on both models. After conducting a semi-automated assessment of the outcomes provided by both LLMs, the results indicate that DeepSeek-R1 is highly unsafe as compared to OpenAI’s o3-mini. Based on our evaluation, DeepSeek-R1 answered unsafely to 11.98% of the executed prompts whereas o3-mini only to 1.19%.

[AI-12] Guaranteed confidence-band enclosures for PDE surrogates

链接: https://arxiv.org/abs/2501.18426
作者: Ander Gray,Vignesh Gopakumar,Sylvain Rousseau,Sébastien Destercke
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We propose a method for obtaining statistically guaranteed confidence bands for functional machine learning techniques: surrogate models which map between function spaces, motivated by the need build reliable PDE emulators. The method constructs nested confidence sets on a low-dimensional representation (an SVD) of the surrogate model’s prediction error, and then maps these sets to the prediction space using set-propagation techniques. The result are conformal-like coverage guaranteed prediction sets for functional surrogate models. We use zonotopes as basis of the set construction, due to their well studied set-propagation and verification properties. The method is model agnostic and can thus be applied to complex Sci-ML models, including Neural Operators, but also in simpler settings. We also elicit a technique to capture the truncation error of the SVD, ensuring the guarantees of the method.

[AI-13] GBFRS: Robust Fuzzy Rough Sets via Granular-ball Computing

链接: https://arxiv.org/abs/2501.18413
作者: Shuyin Xia,Xiaoyu Lian,Binbin Sang,Guoyin Wang,Xinbo Gao
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fuzzy rough set theory is effective for processing datasets with complex attributes, supported by a solid mathematical foundation and closely linked to kernel methods in machine learning. Attribute reduction algorithms and classifiers based on fuzzy rough set theory exhibit promising performance in the analysis of high-dimensional multivariate complex data. However, most existing models operate at the finest granularity, rendering them inefficient and sensitive to noise, especially for high-dimensional big data. Thus, enhancing the robustness of fuzzy rough set models is crucial for effective feature selection. Muiti-garanularty granular-ball computing, a recent development, uses granular-balls of different sizes to adaptively represent and cover the sample space, performing learning based on these granular-balls. This paper proposes integrating multi-granularity granular-ball computing into fuzzy rough set theory, using granular-balls to replace sample points. The coarse-grained characteristics of granular-balls make the model more robust. Additionally, we propose a new method for generating granular-balls, scalable to the entire supervised method based on granular-ball computing. A forward search algorithm is used to select feature sequences by defining the correlation between features and categories through dependence functions. Experiments demonstrate the proposed model’s effectiveness and superiority over baseline methods.

[AI-14] Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents

链接: https://arxiv.org/abs/2501.18411
作者: Nolan Koblischke,Hyunseok Jang,Kristen Menou,Mohamad Ali-Dib
类目: Artificial Intelligence (cs.AI)
*备注: Technical report - Work in progress

点击查看摘要

Abstract:Modern science emerged from reasoning over repeatedly-observed planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. PhD-level solutions for each task are provided, to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.

[AI-15] A Learnable Multi-views Contrastive Framework with Reconstruction Discrepancy for Medical Time-Series

链接: https://arxiv.org/abs/2501.18367
作者: Yifan Wang,Hongfeng Ai,Ruiqi Li,Maowei Jiang,Cheng Jiang,Chenzhong Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages,6 figures

点击查看摘要

Abstract:In medical time series disease diagnosis, two key challenges are this http URL, the high annotation cost of medical data leads to overfitting in models trained on label-limited, single-center datasets. To address this, we propose incorporating external data from related tasks and leveraging AE-GAN to extract prior knowledge,providing valuable references for downstream tasks. Second, many existing studies employ contrastive learning to derive more generalized medical sequence representations for diagnostic tasks, usually relying on manually designed diverse positive and negative sample this http URL, these approaches are complex, lack generalizability, and fail to adaptively capture disease-specific features across different this http URL overcome this, we introduce LMCF (Learnable Multi-views Contrastive Framework), a framework that integrates a multi-head attention mechanism and adaptively learns representations from different views through inter-view and intra-view contrastive learning this http URL, the pre-trained AE-GAN is used to reconstruct discrepancies in the target data as disease probabilities, which are then integrated into the contrastive learning this http URL on three target datasets demonstrate that our method consistently outperforms seven other baselines, highlighting its significant impact on healthcare applications such as the diagnosis of myocardial infarction, Alzheimer’s disease, and Parkinson’s disease.

[AI-16] ransfer Learning of Surrogate Models: Integrating Domain Warping and Affine Transformations

链接: https://arxiv.org/abs/2501.18344
作者: Shuaiqun Pan,Diederick Vermetten,Manuel López-Ibáñez,Thomas Bäck,Hao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Surrogate models provide efficient alternatives to computationally demanding real-world processes but often require large datasets for effective training. A promising solution to this limitation is the transfer of pre-trained surrogate models to new tasks. Previous studies have investigated the transfer of differentiable and non-differentiable surrogate models, typically assuming an affine transformation between the source and target functions. This paper extends previous research by addressing a broader range of transformations, including linear and nonlinear variations. Specifically, we consider the combination of an unknown input warping, such as one modelled by the beta cumulative distribution function, with an unspecified affine transformation. Our approach achieves transfer learning by employing a limited number of data points from the target task to optimize these transformations, minimizing empirical loss on the transfer dataset. We validate the proposed method on the widely used Black-Box Optimization Benchmark (BBOB) testbed and a real-world transfer learning task from the automobile industry. The results underscore the significant advantages of the approach, revealing that the transferred surrogate significantly outperforms both the original surrogate and the one built from scratch using the transfer dataset, particularly in data-scarce scenarios.

[AI-17] Leverag ing LLM Agents for Automated Optimization Modeling for SASP Problems: A Graph-RAG based Approach

链接: https://arxiv.org/abs/2501.18320
作者: Tianpeng Pan,Wenqiang Pu,Licheng Zhao,Rui Zhou
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Automated optimization modeling (AOM) has evoked considerable interest with the rapid evolution of large language models (LLMs). Existing approaches predominantly rely on prompt engineering, utilizing meticulously designed expert response chains or structured guidance. However, prompt-based techniques have failed to perform well in the sensor array signal processing (SASP) area due the lack of specific domain knowledge. To address this issue, we propose an automated modeling approach based on retrieval-augmented generation (RAG) technique, which consists of two principal components: a multi-agent (MA) structure and a graph-based RAG (Graph-RAG) process. The MA structure is tailored for the architectural AOM process, with each agent being designed based on principles of human modeling procedure. The Graph-RAG process serves to match user query with specific SASP modeling knowledge, thereby enhancing the modeling result. Results on ten classical signal processing problems demonstrate that the proposed approach (termed as MAG-RAG) outperforms several AOM benchmarks.

[AI-18] Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis

链接: https://arxiv.org/abs/2501.18310
作者: Haoxiong Liu,Jiacheng Sun,Zhenguo Li,Andrew C Yao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The synergy between deep learning models and traditional automation tools plays a pivotal role in developing robust neural theorem provers (NTPs). However, for proof synthesis with LLMs, previous work applies automation tools either only when the model explicitly calls the method, or only at a single granularity level, failing to fully exploit the power of built-in tactics and off-the-shelf automated theorem provers. In this work, we propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency through equipping proof-generation LLMs with automation methods in different granularities via fine-grained structure analysis of model-generated proof proposals. Furthermore, ProofAug serves as a versatile plug-and-play module that seamlessly integrates with any tree-search algorithm, enabling our construction of an efficient recursive proving (ERP) module to further enhance performance. The superiority of our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant. Notably, by additionally employing a mixed prompting strategy, we achieve a cumulative pass rate of 66.0% after curation of the dataset (61.9% for the original version), setting a new SOTA across all proof languages with a total sample budget of only 2100. Our code is available at this https URL.

[AI-19] Model-Free RL Agents Demonstrate System 1-Like Intentionality

链接: https://arxiv.org/abs/2501.18299
作者: Hal Ashton,Matija Franklin
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper argues that model-free reinforcement learning (RL) agents, while lacking explicit planning mechanisms, exhibit behaviours that can be analogised to System 1 (“thinking fast”) processes in human cognition. Unlike model-based RL agents, which operate akin to System 2 (“thinking slow”) reasoning by leveraging internal representations for planning, model-free agents react to environmental stimuli without anticipatory modelling. We propose a novel framework linking the dichotomy of System 1 and System 2 to the distinction between model-free and model-based RL. This framing challenges the prevailing assumption that intentionality and purposeful behaviour require planning, suggesting instead that intentionality can manifest in the structured, reactive behaviours of model-free agents. By drawing on interdisciplinary insights from cognitive psychology, legal theory, and experimental jurisprudence, we explore the implications of this perspective for attributing responsibility and ensuring AI safety. These insights advocate for a broader, contextually informed interpretation of intentionality in RL systems, with implications for their ethical deployment and regulation.

[AI-20] Extending the design space of ontologization practices: Using bCLEARer as an example

链接: https://arxiv.org/abs/2501.18296
作者: Chris Partridge,Andrew Mitchell,Sergio de Cesare,John Beverley
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Our aim in this paper is to outline how the design space for the ontologization process is richer than current practice would suggest. We point out that engineering processes as well as products need to be designed - and identify some components of the design. We investigate the possibility of designing a range of radically new practices, providing examples of the new practices from our work over the last three decades with an outlier methodology, bCLEARer. We also suggest that setting an evolutionary context for ontologization helps one to better understand the nature of these new practices and provides the conceptual scaffolding that shapes fertile processes. Where this evolutionary perspective positions digitalization (the evolutionary emergence of computing technologies) as the latest step in a long evolutionary trail of information transitions. This reframes ontologization as a strategic tool for leveraging the emerging opportunities offered by digitalization.

[AI-21] CueTip: An Interactive and Explainable Physics-aware Pool Assistant

链接: https://arxiv.org/abs/2501.18291
作者: Sean Memery,Kevin Denamganai,Jiaxin Zhang,Zehai Tu,Yiwen Guo,Kartic Subr
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We present an interactive and explainable automated coaching assistant called CueTip for a variant of pool/billiards. CueTip’s novelty lies in its combination of three features: a natural-language interface, an ability to perform contextual, physics-aware reasoning, and that its explanations are rooted in a set of predetermined guidelines developed by domain experts. We instrument a physics simulator so that it generates event traces in natural language alongside traditional state traces. Event traces lend themselves to interpretation by language models, which serve as the interface to our assistant. We design and train a neural adaptor that decouples tactical choices made by CueTip from its interactivity and explainability allowing it to be reconfigured to mimic any pool playing agent. Our experiments show that CueTip enables contextual query-based assistance and explanations while maintaining the strength of the agent in terms of win rate (improving it in some situations). The explanations generated by CueTip are physically-aware and grounded in the expert rules and are therefore more reliable.

[AI-22] Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks

链接: https://arxiv.org/abs/2501.18271
作者: Hao-Zhe Tan,Zhi Zhou,Lan-Zhe Guo,Yu-Feng Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pre-trained Vision-Language Models (VLMs) are becoming increasingly popular across various visual tasks, and several open-sourced VLM variants have been released. However, selecting the best-performing pre-trained VLM for a specific downstream task is challenging since no single VLM can achieve promising performance on all downstream tasks, and evaluating all available VLMs is impossible due to time and data limitations. To address this problem, this paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL). The proposal contains three key modules: \emphmodel labeling, which assigns labels to each VLM to describe their specialty and utility; \emphmodel selection, which matches the requirements of the target task with model labels; and \emphmodel reuse, which applies selected VLMs to the target task in an ensemble manner. The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent and the ability could grow with the number of candidate VLMs. We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets. Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs.

[AI-23] PDE-DKL: PDE-constrained deep kernel learning in high dimensionality

链接: https://arxiv.org/abs/2501.18258
作者: Weihao Yan,Christoph Brune,Mengwu Guo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:Many physics-informed machine learning methods for PDE-based problems rely on Gaussian processes (GPs) or neural networks (NNs). However, both face limitations when data are scarce and the dimensionality is high. Although GPs are known for their robust uncertainty quantification in low-dimensional settings, their computational complexity becomes prohibitive as the dimensionality increases. In contrast, while conventional NNs can accommodate high-dimensional input, they often require extensive training data and do not offer uncertainty quantification. To address these challenges, we propose a PDE-constrained Deep Kernel Learning (PDE-DKL) framework that combines DL and GPs under explicit PDE constraints. Specifically, NNs learn a low-dimensional latent representation of the high-dimensional PDE problem, reducing the complexity of the problem. GPs then perform kernel regression subject to the governing PDEs, ensuring accurate solutions and principled uncertainty quantification, even when available data are limited. This synergy unifies the strengths of both NNs and GPs, yielding high accuracy, robust uncertainty estimates, and computational efficiency for high-dimensional PDEs. Numerical experiments demonstrate that PDE-DKL achieves high accuracy with reduced data requirements. They highlight its potential as a practical, reliable, and scalable solver for complex PDE-based applications in science and engineering.

[AI-24] Exploring Large Protein Language Models in Constrained Evaluation Scenarios within the FLIP Benchmark

链接: https://arxiv.org/abs/2501.18223
作者: Manuel F. Mollon,Joaquin Gonzalez-Rodriguez,Alicia Lozano-Diez,Daniel Ramos,Doroteo T. Toledano
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this study, we expand upon the FLIP benchmark-designed for evaluating protein fitness prediction models in small, specialized prediction tasks-by assessing the performance of state-of-the-art large protein language models, including ESM-2 and SaProt on the FLIP dataset. Unlike larger, more diverse benchmarks such as ProteinGym, which cover a broad spectrum of tasks, FLIP focuses on constrained settings where data availability is limited. This makes it an ideal framework to evaluate model performance in scenarios with scarce task-specific data. We investigate whether recent advances in protein language models lead to significant improvements in such settings. Our findings provide valuable insights into the performance of large-scale models in specialized protein prediction tasks.

[AI-25] On Scaling Neurosymbolic Programming through Guided Logical Inference

链接: https://arxiv.org/abs/2501.18202
作者: Thomas Jean-Michel Valentin(ENS Paris Saclay),Luisa Sophie Werner(UGA, LIG),Pierre Genevès(LIG),Nabil Layaïda(LIG, TYREX)
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Probabilistic neurosymbolic learning seeks to integrate neural networks with symbolic programming. Many state-of-the-art systems rely on a reduction to the Probabilistic Weighted Model Counting Problem (PWMC), which requires computing a Boolean formula called the logical this http URL, PWMC is \#P-hard, and the number of clauses in the logical provenance formula can grow exponentially, creating a major bottleneck that significantly limits the applicability of PNL solutions in this http URL propose a new approach centered around an exact algorithm DPNL, that enables bypassing the computation of the logical this http URL DPNL approach relies on the principles of an oracle and a recursive DPLL-like decomposition in order to guide and speed up logical this http URL, we show that this approach can be adapted for approximate reasoning with \epsilon or (\epsilon, \delta) guarantees, called this http URL show significant performance this http URL enables scaling exact inference further, resulting in more accurate this http URL, ApproxDPNL shows potential for advancing the scalability of neurosymbolic programming by incorporating approximations even further, while simultaneously ensuring guarantees for the reasoning process.

[AI-26] Neural Operator based Reinforcement Learning for Control of first-order PDEs with Spatially-Varying State Delay

链接: https://arxiv.org/abs/2501.18201
作者: Jiaqi Hu,Jie Qi,Jing Zhang
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注: 6 Pages, 7 Figures

点击查看摘要

Abstract:Control of distributed parameter systems affected by delays is a challenging task, particularly when the delays depend on spatial variables. The idea of integrating analytical control theory with learning-based control within a unified control scheme is becoming increasingly promising and advantageous. In this paper, we address the problem of controlling an unstable first-order hyperbolic PDE with spatially-varying delays by combining PDE backstepping control strategies and deep reinforcement learning (RL). To eliminate the assumption on the delay function required for the backstepping design, we propose a soft actor-critic (SAC) architecture incorporating a DeepONet to approximate the backstepping controller. The DeepONet extracts features from the backstepping controller and feeds them into the policy network. In simulations, our algorithm outperforms the baseline SAC without prior backstepping knowledge and the analytical controller.

[AI-27] HKAN: Hierarchical Kolmogorov-Arnold Network without Backpropagation

链接: https://arxiv.org/abs/2501.18199
作者: Grzegorz Dudek,Tomasz Rodak
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 9 figures

点击查看摘要

Abstract:This paper introduces the Hierarchical Kolmogorov-Arnold Network (HKAN), a novel network architecture that offers a competitive alternative to the recently proposed Kolmogorov-Arnold Network (KAN). Unlike KAN, which relies on backpropagation, HKAN adopts a randomized learning approach, where the parameters of its basis functions are fixed, and linear aggregations are optimized using least-squares regression. HKAN utilizes a hierarchical multi-stacking framework, with each layer refining the predictions from the previous one by solving a series of linear regression problems. This non-iterative training method simplifies computation and eliminates sensitivity to local minima in the loss function. Empirical results show that HKAN delivers comparable, if not superior, accuracy and stability relative to KAN across various regression tasks, while also providing insights into variable importance. The proposed approach seamlessly integrates theoretical insights with practical applications, presenting a robust and efficient alternative for neural network modeling.

[AI-28] Economic Rationality under Specialization: Evidence of Decision Bias in AI Agents

链接: https://arxiv.org/abs/2501.18190
作者: ShuiDe Wen,Juan Feng
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the study by Chen et al. (2023) [01], the large language model GPT demonstrated economic rationality comparable to or exceeding the average human level in tasks such as budget allocation and risk preference. Building on this finding, this paper further incorporates specialized agents, such as biotechnology experts and economists, for a horizontal comparison to explore whether specialization can enhance or maintain economic rationality equivalent to that of GPT in similar decision-making scenarios. The results indicate that when agents invest more effort in specialized fields, their decision-making behavior is more prone to ‘rationality shift,’ specifically manifested as increased violations of GARP (Generalized Axiom of Revealed Preference), decreased CCEI (Critical Cost Efficiency Index), and more significant decision deviations under high-risk conditions. In contrast, GPT and more generalized basic agents maintain a more stable and consistent level of rationality across multiple tasks. This study reveals the inherent conflict between specialization and economic rationality, providing new insights for constructing AI decision-making systems that balance specialization and generalization across various scenarios.

[AI-29] In-Context Learning of Polynomial Kernel Regression in Transformers with GLU Layers

链接: https://arxiv.org/abs/2501.18187
作者: Haoyuan Sun,Ali Jadbabaie,Navid Azizan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer-based models have demonstrated remarkable ability in in-context learning (ICL), where they can adapt to unseen tasks from a prompt with a few examples, without requiring parameter updates. Recent research has provided insight into how linear Transformers can perform ICL by implementing gradient descent estimators. In particular, it has been shown that the optimal linear self-attention (LSA) mechanism can implement one step of gradient descent with respect to a linear least-squares objective when trained on random linear regression tasks. However, the theoretical understanding of ICL for nonlinear function classes remains limited. In this work, we address this gap by first showing that LSA is inherently restricted to solving linear least-squares objectives and thus, the solutions in prior works cannot readily extend to nonlinear ICL tasks. To overcome this limitation, drawing inspiration from modern architectures, we study a mechanism that combines LSA with GLU-like feed-forward layers and show that this allows the model to perform one step of gradient descent on a polynomial kernel regression. Further, we characterize the scaling behavior of the resulting Transformer model, highlighting the necessary model size to effectively handle quadratic ICL tasks. Our findings highlight the distinct roles of attention and feed-forward layers in nonlinear ICL and identify key challenges when extending ICL to nonlinear function classes. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.18187 [cs.LG] (or arXiv:2501.18187v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.18187 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-30] nsor Completion for Surrogate Modeling of Material Property Prediction AAAI

链接: https://arxiv.org/abs/2501.18137
作者: Shaan Pakala,Dawon Ahn,Evangelos Papalexakis
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注: 2 page paper accepted to AAAI KGML 2025 bridge program

点击查看摘要

Abstract:When designing materials to optimize certain properties, there are often many possible configurations of designs that need to be explored. For example, the materials’ composition of elements will affect properties such as strength or conductivity, which are necessary to know when developing new materials. Exploring all combinations of elements to find optimal materials becomes very time consuming, especially when there are more design variables. For this reason, there is growing interest in using machine learning (ML) to predict a material’s properties. In this work, we model the optimization of certain material properties as a tensor completion problem, to leverage the structure of our datasets and navigate the vast number of combinations of material configurations. Across a variety of material property prediction tasks, our experiments show tensor completion methods achieving 10-20% decreased error compared with baseline ML models such as GradientBoosting and Multilayer Perceptron (MLP), while maintaining similar training speed.

[AI-31] Entropy-Synchronized Neural Hashing for Unsupervised Ransomware Detection

链接: https://arxiv.org/abs/2501.18131
作者: Peter Idliman,Wilfred Balfour,Benedict Featheringham,Hugo Chesterfield
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Entropy-based detection methodologies have gained significant attention due to their ability to analyze structural irregularities within executable files, particularly in the identification of malicious software employing advanced obfuscation techniques. The Entropy-Synchronized Neural Hashing (ESNH) framework introduces a novel approach that leverages entropy-driven hash representations to classify software binaries based on their underlying entropy characteristics. Through the synchronization of entropy profiles with neural network architectures, the model generates robust and unique hash values that maintain stability even when faced with polymorphic and metamorphic transformations. Comparative analysis against traditional detection approaches revealed superior performance in identifying novel threats, reducing false-positive rates, and achieving consistent classification across diverse ransomware families. The incorporation of a self-regulating hash convergence mechanism further ensured that entropy-synchronized hashes remained invariant across executions, minimizing classification inconsistencies that often arise due to dynamic modifications in ransomware payloads. Experimental results demonstrated high detection rates across contemporary ransomware strains, with the model exhibiting resilience against encryption-based evasion mechanisms, code injection strategies, and reflective loading techniques. Unlike conventional detection mechanisms that rely on static signatures and heuristic analysis, the proposed entropy-aware classification framework adapts to emerging threats through an inherent ability to capture entropy anomalies within executable structures. The findings reinforce the potential of entropy-based detection in addressing the limitations of traditional methodologies while enhancing detection robustness against obfuscation and adversarial evasion techniques.

[AI-32] Revisiting gender bias research in bibliometrics: Standardizing methodological variability using Scholarly Data Analysis (SoDA) Cards

链接: https://arxiv.org/abs/2501.18129
作者: HaeJin Lee,Shubhanshu Mishra,Apratim Mishra,Zhiwen You,Jinseok Kim,Jana Diesner
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 33 pg, 7 figures. Soda Cards: this https URL

点击查看摘要

Abstract:Gender biases in scholarly metrics remain a persistent concern, despite numerous bibliometric studies exploring their presence and absence across productivity, impact, acknowledgment, and self-citations. However, methodological inconsistencies, particularly in author name disambiguation and gender identification, limit the reliability and comparability of these studies, potentially perpetuating misperceptions and hindering effective interventions. A review of 70 relevant publications over the past 12 years reveals a wide range of approaches, from name-based and manual searches to more algorithmic and gold-standard methods, with no clear consensus on best practices. This variability, compounded by challenges such as accurately disambiguating Asian names and managing unassigned gender labels, underscores the urgent need for standardized and robust methodologies. To address this critical gap, we propose the development and implementation of ``Scholarly Data Analysis (SoDA) Cards." These cards will provide a structured framework for documenting and reporting key methodological choices in scholarly data analysis, including author name disambiguation and gender identification procedures. By promoting transparency and reproducibility, SoDA Cards will facilitate more accurate comparisons and aggregations of research findings, ultimately supporting evidence-informed policymaking and enabling the longitudinal tracking of analytical approaches in the study of gender and other social biases in academia.

[AI-33] VQLTI: Long-Term Tropical Cyclone Intensity Forecasting with Physical Constraints

链接: https://arxiv.org/abs/2501.18122
作者: Xinyu Wang,Lei Liu,Kang Chen,Tao Han,Bin Li,Lei Bai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tropical cyclone (TC) intensity forecasting is crucial for early disaster warning and emergency decision-making. Numerous researchers have explored deep-learning methods to address computational and post-processing issues in operational forecasting. Regrettably, they exhibit subpar long-term forecasting capabilities. We use two strategies to enhance long-term forecasting. (1) By enhancing the matching between TC intensity and spatial information, we can improve long-term forecasting performance. (2) Incorporating physical knowledge and physical constraints can help mitigate the accumulation of forecasting errors. To achieve the above strategies, we propose the VQLTI framework. VQLTI transfers the TC intensity information to a discrete latent space while retaining the spatial information differences, using large-scale spatial meteorological data as conditions. Furthermore, we leverage the forecast from the weather prediction model FengWu to provide additional physical knowledge for VQLTI. Additionally, we calculate the potential intensity (PI) to impose physical constraints on the latent variables. In the global long-term TC intensity forecasting, VQLTI achieves state-of-the-art results for the 24h to 120h, with the MSW (Maximum Sustained Wind) forecast error reduced by 35.65%-42.51% compared to ECMWF-IFS.

[AI-34] Investigating an Intelligent System to Monitor Explain Abnormal Activity Patterns of Older Adults

链接: https://arxiv.org/abs/2501.18108
作者: Min Hun Lee,Daniel P. Siewiorek,Alexandre Bernardino
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite the growing potential of older adult care technologies, the adoption of these technologies remains challenging. In this work, we conducted a focus-group session with family caregivers to scope designs of the older adult care technology. We then developed a high-fidelity prototype and conducted its qualitative study with professional caregivers and older adults to understand their perspectives on the system functionalities. This system monitors abnormal activity patterns of older adults using wireless motion sensors and machine learning models and supports interactive dialogue responses to explain abnormal activity patterns of older adults to caregivers and allow older adults proactively sharing their status with caregivers for an adequate intervention. Both older adults and professional caregivers appreciated that our system can provide a faster, personalized service while proactively controlling what information is to be shared through interactive dialogue responses. We further discuss other considerations to realize older adult technology in practice.

[AI-35] DIAL: Distribution-Informed Adaptive Learning of Multi-Task Constraints for Safety-Critical Systems

链接: https://arxiv.org/abs/2501.18086
作者: Se-Wook Yoo,Seung-Woo Seo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
*备注: 16 pages, 14 figures, 6 tables, submission to T-RO in 2024

点击查看摘要

Abstract:Safe reinforcement learning has traditionally relied on predefined constraint functions to ensure safety in complex real-world tasks, such as autonomous driving. However, defining these functions accurately for varied tasks is a persistent challenge. Recent research highlights the potential of leveraging pre-acquired task-agnostic knowledge to enhance both safety and sample efficiency in related tasks. Building on this insight, we propose a novel method to learn shared constraint distributions across multiple tasks. Our approach identifies the shared constraints through imitation learning and then adapts to new tasks by adjusting risk levels within these learned distributions. This adaptability addresses variations in risk sensitivity stemming from expert-specific biases, ensuring consistent adherence to general safety principles even with imperfect demonstrations. Our method can be applied to control and navigation domains, including multi-task and meta-task scenarios, accommodating constraints such as maintaining safe distances or adhering to speed limits. Experimental results validate the efficacy of our approach, demonstrating superior safety performance and success rates compared to baselines, all without requiring task-specific constraint definitions. These findings underscore the versatility and practicality of our method across a wide range of real-world tasks.

[AI-36] Normative Evaluation of Large Language Models with Everyday Moral Dilemmas

链接: https://arxiv.org/abs/2501.18081
作者: Pratik S. Sachdeva,Tom van Nuenen
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:The rapid adoption of large language models (LLMs) has spurred extensive research into their encoded moral norms and decision-making processes. Much of this research relies on prompting LLMs with survey-style questions to assess how well models are aligned with certain demographic groups, moral beliefs, or political ideologies. While informative, the adherence of these approaches to relatively superficial constructs tends to oversimplify the complexity and nuance underlying everyday moral dilemmas. We argue that auditing LLMs along more detailed axes of human interaction is of paramount importance to better assess the degree to which they may impact human beliefs and actions. To this end, we evaluate LLMs on complex, everyday moral dilemmas sourced from the “Am I the Asshole” (AITA) community on Reddit, where users seek moral judgments on everyday conflicts from other community members. We prompted seven LLMs to assign blame and provide explanations for over 10,000 AITA moral dilemmas. We then compared the LLMs’ judgments and explanations to those of Redditors and to each other, aiming to uncover patterns in their moral reasoning. Our results demonstrate that large language models exhibit distinct patterns of moral judgment, varying substantially from human evaluations on the AITA subreddit. LLMs demonstrate moderate to high self-consistency but low inter-model agreement. Further analysis of model explanations reveals distinct patterns in how models invoke various moral principles. These findings highlight the complexity of implementing consistent moral reasoning in artificial systems and the need for careful evaluation of how different models approach ethical judgment. As LLMs continue to be used in roles requiring ethical decision-making such as therapists and companions, careful evaluation is crucial to mitigate potential biases and limitations.

[AI-37] owards Transparent and Accurate Diabetes Prediction Using Machine Learning and Explainable Artificial Intelligence

链接: https://arxiv.org/abs/2501.18071
作者: Pir Bakhsh Khokhar,Viviana Pentangelo,Fabio Palomba,Carmine Gravino
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Diabetes mellitus (DM) is a global health issue of significance that must be diagnosed as early as possible and managed well. This study presents a framework for diabetes prediction using Machine Learning (ML) models, complemented with eXplainable Artificial Intelligence (XAI) tools, to investigate both the predictive accuracy and interpretability of the predictions from ML models. Data Preprocessing is based on the Synthetic Minority Oversampling Technique (SMOTE) and feature scaling used on the Diabetes Binary Health Indicators dataset to deal with class imbalance and variability of clinical features. The ensemble model provided high accuracy, with a test accuracy of 92.50% and an ROC-AUC of 0.975. BMI, Age, General Health, Income, and Physical Activity were the most influential predictors obtained from the model explanations. The results of this study suggest that ML combined with XAI is a promising means of developing accurate and computationally transparent tools for use in healthcare systems.

[AI-38] Learning the Optimal Stopping for Early Classification within Finite Horizons via Sequential Probability Ratio Test ICLR

链接: https://arxiv.org/abs/2501.18059
作者: Akinori F. Ebihara,Taiki Miyagawa,Kazuyuki Sakurai,Hitoshi Imaoka
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to International Conference on Learning Representations (ICLR) 2025

点击查看摘要

Abstract:Time-sensitive machine learning benefits from Sequential Probability Ratio Test (SPRT), which provides an optimal stopping time for early classification of time series. However, in finite horizon scenarios, where input lengths are finite, determining the optimal stopping rule becomes computationally intensive due to the need for backward induction, limiting practical applicability. We thus introduce FIRMBOUND, an SPRT-based framework that efficiently estimates the solution to backward induction from training data, bridging the gap between optimal stopping theory and real-world deployment. It employs density ratio estimation and convex function learning to provide statistically consistent estimators for sufficient statistic and conditional expectation, both essential for solving backward induction; consequently, FIRMBOUND minimizes Bayes risk to reach optimality. Additionally, we present a faster alternative using Gaussian process regression, which significantly reduces training time while retaining low deployment overhead, albeit with potential compromise in statistical consistency. Experiments across independent and identically distributed (i.i.d.), non-i.i.d., binary, multiclass, synthetic, and real-world datasets show that FIRMBOUND achieves optimalities in the sense of Bayes risk and speed-accuracy tradeoff. Furthermore, it advances the tradeoff boundary toward optimality when possible and reduces decision-time variance, ensuring reliable decision-making. Code is publicly available at this https URL

[AI-39] Current Pathology Foundation Models are unrobust to Medical Center Differences

链接: https://arxiv.org/abs/2501.18055
作者: Edwin D. de Jong,Eric Marcus,Jonas Teuwen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Pathology Foundation Models (FMs) hold great promise for healthcare. Before they can be used in clinical practice, it is essential to ensure they are robust to variations between medical centers. We measure whether pathology FMs focus on biological features like tissue and cancer type, or on the well known confounding medical center signatures introduced by staining procedure and other differences. We introduce the Robustness Index. This novel robustness metric reflects to what degree biological features dominate confounding features. Ten current publicly available pathology FMs are evaluated. We find that all current pathology foundation models evaluated represent the medical center to a strong degree. Significant differences in the robustness index are observed. Only one model so far has a robustness index greater than one, meaning biological features dominate confounding features, but only slightly. A quantitative approach to measure the influence of medical center differences on FM-based prediction performance is described. We analyze the impact of unrobustness on classification performance of downstream models, and find that cancer-type classification errors are not random, but specifically attributable to same-center confounders: images of other classes from the same medical center. We visualize FM embedding spaces, and find these are more strongly organized by medical centers than by biological factors. As a consequence, the medical center of origin is predicted more accurately than the tissue source and cancer type. The robustness index introduced here is provided with the aim of advancing progress towards clinical adoption of robust and reliable pathology FMs.

[AI-40] SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

链接: https://arxiv.org/abs/2501.18052
作者: Bartosz Cywiński,Kamil Deja
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent machine unlearning approaches offer promising solution for removing unwanted concepts from diffusion models. However, traditional methods, which largely rely on fine-tuning, provide little insight into the changes they introduce to the base model, making it unclear whether concepts are truly removed or only masked. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to unlearn unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a method of selecting concept-specific features. This enables precise interventions on the model’s activations to block targeted content while preserving the model’s overall performance. Evaluation on the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron’s state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron dismisses the possibility of generating unwanted content, even under adversarial attack.

[AI-41] Digital Twin-Enabled Real-Time Control in Robotic Additive Manufacturing via Soft Actor-Critic Reinforcement Learning

链接: https://arxiv.org/abs/2501.18016
作者: Matsive Ali,Sandesh Giri,Sen Liu,Qin Yang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Smart manufacturing systems increasingly rely on adaptive control mechanisms to optimize complex processes. This research presents a novel approach integrating Soft Actor-Critic (SAC) reinforcement learning with digital twin technology to enable real-time process control in robotic additive manufacturing. We demonstrate our methodology using a Viper X300s robot arm, implementing two distinct control scenarios: static target acquisition and dynamic trajectory following. The system architecture combines Unity’s simulation environment with ROS2 for seamless digital twin synchronization, while leveraging transfer learning to efficiently adapt trained models across tasks. Our hierarchical reward structure addresses common reinforcement learning challenges including local minima avoidance, convergence acceleration, and training stability. Experimental results show rapid policy convergence and robust task execution in both simulated and physical environments, with performance metrics including cumulative reward, value prediction accuracy, policy loss, and discrete entropy coefficient demonstrating the effectiveness of our approach. This work advances the integration of reinforcement learning with digital twins for industrial robotics applications, providing a framework for enhanced adaptive real-time control for smart additive manufacturing process.

[AI-42] Large Language Models Think Too Fast To Explore Effectively

链接: https://arxiv.org/abs/2501.18009
作者: Lan Pan,Hanbo Xie,Robert C. Wilson
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注: 16 pages, 13 figures, under review

点击查看摘要

Abstract:Large Language Models have emerged many intellectual capacities. While numerous benchmarks assess their intelligence, limited attention has been given to their ability to explore, an essential capacity for discovering new information and adapting to novel environments in both natural and artificial systems. The extent to which LLMs can effectively explore, particularly in open-ended tasks, remains unclear. This study investigates whether LLMs can surpass humans in exploration during an open-ended task, using Little Alchemy 2 as a paradigm, where agents combine elements to discover new ones. Results show most LLMs underperform compared to humans, except for the o1 model, with those traditional LLMs relying primarily on uncertainty driven strategies, unlike humans who balance uncertainty and empowerment. Representational analysis of the models with Sparse Autoencoders revealed that uncertainty and choices are represented at earlier transformer blocks, while empowerment values are processed later, causing LLMs to think too fast and make premature decisions, hindering effective exploration. These findings shed light on the limitations of LLM exploration and suggest directions for improving their adaptability.

[AI-43] opological Signatures of Adversaries in Multimodal Alignments

链接: https://arxiv.org/abs/2501.18006
作者: Minh Vu,Geigh Zollicoffer,Huy Mai,Ben Nebgen,Boian Alexandrov,Manish Bhattarai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Multimodal Machine Learning systems, particularly those aligning text and image data like CLIP/BLIP models, have become increasingly prevalent, yet remain susceptible to adversarial attacks. While substantial research has addressed adversarial robustness in unimodal contexts, defense strategies for multimodal systems are underexplored. This work investigates the topological signatures that arise between image and text embeddings and shows how adversarial attacks disrupt their alignment, introducing distinctive signatures. We specifically leverage persistent homology and introduce two novel Topological-Contrastive losses based on Total Persistence and Multi-scale kernel methods to analyze the topological signatures introduced by adversarial perturbations. We observe a pattern of monotonic changes in the proposed topological losses emerging in a wide range of attacks on image-text alignments, as more adversarial samples are introduced in the data. By designing an algorithm to back-propagate these signatures to input samples, we are able to integrate these signatures into Maximum Mean Discrepancy tests, creating a novel class of tests that leverage topological signatures for better adversarial detection.

[AI-44] Investigating the Monte-Carlo Tree Search Approach for the Job Shop Scheduling Problem

链接: https://arxiv.org/abs/2501.17991
作者: Laurie Boveroux,Damien Ernst,Quentin Louveaux
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The Job Shop Scheduling Problem (JSSP) is a well-known optimization problem in manufacturing, where the goal is to determine the optimal sequence of jobs across different machines to minimize a given objective. In this work, we focus on minimising the weighted sum of job completion times. We explore the potential of Monte Carlo Tree Search (MCTS), a heuristic-based reinforcement learning technique, to solve large-scale JSSPs, especially those with recirculation. We propose several Markov Decision Process (MDP) formulations to model the JSSP for the MCTS algorithm. In addition, we introduce a new synthetic benchmark derived from real manufacturing data, which captures the complexity of large, non-rectangular instances often encountered in practice. Our experimental results show that MCTS effectively produces good-quality solutions for large-scale JSSP instances, outperforming our constraint programming approach.

[AI-45] Belief Roadmaps with Uncertain Landmark Evanescence

链接: https://arxiv.org/abs/2501.17982
作者: Erick Fuentes,Jared Strader,Ethan Fahnestock,Nicholas Roy
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We would like a robot to navigate to a goal location while minimizing state uncertainty. To aid the robot in this endeavor, maps provide a prior belief over the location of objects and regions of interest. To localize itself within the map, a robot identifies mapped landmarks using its sensors. However, as the time between map creation and robot deployment increases, portions of the map can become stale, and landmarks, once believed to be permanent, may disappear. We refer to the propensity of a landmark to disappear as landmark evanescence. Reasoning about landmark evanescence during path planning, and the associated impact on localization accuracy, requires analyzing the presence or absence of each landmark, leading to an exponential number of possible outcomes of a given motion plan. To address this complexity, we develop BRULE, an extension of the Belief Roadmap. During planning, we replace the belief over future robot poses with a Gaussian mixture which is able to capture the effects of landmark evanescence. Furthermore, we show that belief updates can be made efficient, and that maintaining a random subset of mixture components is sufficient to find high quality solutions. We demonstrate performance in simulated and real-world experiments. Software is available at this https URL.

[AI-46] Limits to AI Growth: The Ecological and Social Consequences of Scaling

链接: https://arxiv.org/abs/2501.17980
作者: Eshta Bhardwaj,Rohan Alexander,Christoph Becker
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 14 pages

点击查看摘要

Abstract:The accelerating development and deployment of AI technologies depend on the continued ability to scale their infrastructure. This has implied increasing amounts of monetary investment and natural resources. Frontier AI applications have thus resulted in rising financial, environmental, and social costs. While the factors that AI scaling depends on reach its limits, the push for its accelerated advancement and entrenchment continues. In this paper, we provide a holistic review of AI scaling using four lenses (technical, economic, ecological, and social) and review the relationships between these lenses to explore the dynamics of AI growth. We do so by drawing on system dynamics concepts including archetypes such as “limits to growth” to model the dynamic complexity of AI scaling and synthesize several perspectives. Our work maps out the entangled relationships between the technical, economic, ecological and social perspectives and the apparent limits to growth. The analysis explains how industry’s responses to external limits enables continued (but temporary) scaling and how this benefits Big Tech while externalizing social and environmental damages. To avoid an “overshoot and collapse” trajectory, we advocate for realigning priorities and norms around scaling to prioritize sustainable and mindful advancements.

[AI-47] hink Smarter not Harder: Adaptive Reasoning with Inference Aware Optimization

链接: https://arxiv.org/abs/2501.17974
作者: Zishun Yu,Tengyu Xu,Di Jin,Karthik Abinav Sankararaman,Yun He,Wenxuan Zhou,Zhouhao Zeng,Eryk Helenowski,Chen Zhu,Sinong Wang,Hao Ma,Han Fang
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Solving mathematics problems has been an intriguing capability of large language models, and many efforts have been made to improve reasoning by extending reasoning length, such as through self-correction and extensive long chain-of-thoughts. While promising in problem-solving, advanced long reasoning chain models exhibit an undesired single-modal behavior, where trivial questions require unnecessarily tedious long chains of thought. In this work, we propose a way to allow models to be aware of inference budgets by formulating it as utility maximization with respect to an inference budget constraint, hence naming our algorithm Inference Budget-Constrained Policy Optimization (IBPO). In a nutshell, models fine-tuned through IBPO learn to ``understand’’ the difficulty of queries and allocate inference budgets to harder ones. With different inference budgets, our best models are able to have a 4.14 % and 5.74 % absolute improvement ( 8.08 % and 11.2 % relative improvement) on MATH500 using 2.16 x and 4.32 x inference budgets respectively, relative to LLaMA3.1 8B Instruct. These improvements are approximately 2 x those of self-consistency under the same budgets.

[AI-48] Deep Ensembles Secretly Perform Empirical Bayes

链接: https://arxiv.org/abs/2501.17917
作者: Gabriel Loaiza-Ganem,Valentin Villecroze,Yixin Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Quantifying uncertainty in neural networks is a highly relevant problem which is essential to many applications. The two predominant paradigms to tackle this task are Bayesian neural networks (BNNs) and deep ensembles. Despite some similarities between these two approaches, they are typically surmised to lack a formal connection and are thus understood as fundamentally different. BNNs are often touted as more principled due to their reliance on the Bayesian paradigm, whereas ensembles are perceived as more ad-hoc; yet, deep ensembles tend to empirically outperform BNNs, with no satisfying explanation as to why this is the case. In this work we bridge this gap by showing that deep ensembles perform exact Bayesian averaging with a posterior obtained with an implicitly learned data-dependent prior. In other words deep ensembles are Bayesian, or more specifically, they implement an empirical Bayes procedure wherein the prior is learned from the data. This perspective offers two main benefits: (i) it theoretically justifies deep ensembles and thus provides an explanation for their strong empirical performance; and (ii) inspection of the learned prior reveals it is given by a mixture of point masses – the use of such a strong prior helps elucidate observed phenomena about ensembles. Overall, our work delivers a newfound understanding of deep ensembles which is not only of interest in it of itself, but which is also likely to generate future insights that drive empirical improvements for these models.

[AI-49] Free Agent in Agent -Based Mixture-of-Experts Generative AI Framework

链接: https://arxiv.org/abs/2501.17903
作者: Jung-Hua Liu
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Multi-agent systems commonly distribute tasks among specialized, autonomous agents, yet they often lack mechanisms to replace or reassign underperforming agents in real time. Inspired by the free-agency model of Major League Baseball, the Reinforcement Learning Free Agent (RLFA) algorithm introduces a reward-based mechanism to detect and remove agents exhibiting persistent underperformance and seamlessly insert more capable ones. Each agent internally uses a mixture-of-experts (MoE) approach, delegating incoming tasks to specialized sub-models under the guidance of a gating function. A primary use case is fraud detection, where RLFA promptly swaps out an agent whose detection accuracy dips below a preset threshold. A new agent is tested in a probationary mode, and upon demonstrating superior performance, fully replaces the underperformer. This dynamic, free-agency cycle ensures sustained accuracy, quicker adaptation to emerging threats, and minimal disruption to ongoing operations. By continually refreshing its roster of agents, the system fosters ongoing improvements and more resilient collaboration in multi-agent Generative AI environments.

[AI-50] he Right to AI

链接: https://arxiv.org/abs/2501.17899
作者: Rashid Mushkani,Hugo Berard,Allison Cohen,Shin Koeski
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 19 pages, 2 figures, 1 table

点击查看摘要

Abstract:This paper proposes a Right to AI, which asserts that individuals and communities should meaningfully participate in the development and governance of the AI systems that shape their lives. Motivated by the increasing deployment of AI in critical domains and inspired by Henri Lefebvre’s concept of the Right to the City, we reconceptualize AI as a societal infrastructure, rather than merely a product of expert design. In this paper, we critically evaluate how generative agents, large-scale data extraction, and diverse cultural values bring new complexities to AI oversight. The paper proposes that grassroots participatory methodologies can mitigate biased outcomes and enhance social responsiveness. It asserts that data is socially produced and should be managed and owned collectively. Drawing on Sherry Arnstein’s Ladder of Citizen Participation and analyzing nine case studies, the paper develops a four-tier model for the Right to AI that situates the current paradigm and envisions an aspirational future. It proposes recommendations for inclusive data ownership, transparent design processes, and stakeholder-driven oversight. We also discuss market-led and state-centric alternatives and argue that participatory approaches offer a better balance between technical efficiency and democratic legitimacy.

[AI-51] ask and Perception-aware Distributed Source Coding for Correlated Speech under Bandwidth-constrained Channels AAAI2025

链接: https://arxiv.org/abs/2501.17879
作者: Sagnik Bhattacharya,Muhammad Ahmed Mohsin,Ahsan Bilal,John M. Cioffi
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Published at AAAI 2025 Workshop

点击查看摘要

Abstract:Emerging wireless AR/VR applications require real-time transmission of correlated high-fidelity speech from multiple resource-constrained devices over unreliable, bandwidth-limited channels. Existing autoencoder-based speech source coding methods fail to address the combination of the following - (1) dynamic bitrate adaptation without retraining the model, (2) leveraging correlations among multiple speech sources, and (3) balancing downstream task loss with realism of reconstructed speech. We propose a neural distributed principal component analysis (NDPCA)-aided distributed source coding algorithm for correlated speech sources transmitting to a central receiver. Our method includes a perception-aware downstream task loss function that balances perceptual realism with task-specific performance. Experiments show significant PSNR improvements under bandwidth constraints over naive autoencoder methods in task-agnostic (19%) and task-aware settings (52%). It also approaches the theoretical upper bound, where all correlated sources are sent to a single encoder, especially in low-bandwidth scenarios. Additionally, we present a rate-distortion-perception trade-off curve, enabling adaptive decisions based on application-specific realism needs.

[AI-52] Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling

链接: https://arxiv.org/abs/2501.18577
作者: Dan M. Kluger,Kerri Lu,Tijana Zrnic,Sherrie Wang,Stephen Bates
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning models are increasingly used to produce predictions that serve as input data in subsequent statistical analyses. For example, computer vision predictions of economic and environmental indicators based on satellite imagery are used in downstream regressions; similarly, language models are widely used to approximate human ratings and opinions in social science research. However, failure to properly account for errors in the machine learning predictions renders standard statistical procedures invalid. Prior work uses what we call the Predict-Then-Debias estimator to give valid confidence intervals when machine learning algorithms impute missing variables, assuming a small complete sample from the population of interest. We expand the scope by introducing bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. Importantly, the method can be applied to many settings without requiring additional calculations. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.

[AI-53] Beyond Prior Limits: Addressing Distribution Misalignment in Particle Filtering

链接: https://arxiv.org/abs/2501.18501
作者: Yiwei Shi,Jingyu Hu,Yu Zhang,Mengyue Yang,Weinan Zhang,Cunjia Liu,Weiru Liu
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Particle filtering is a Bayesian inference method and a fundamental tool in state estimation for dynamic systems, but its effectiveness is often limited by the constraints of the initial prior distribution, a phenomenon we define as the Prior Boundary Phenomenon. This challenge arises when target states lie outside the prior’s support, rendering traditional particle filtering methods inadequate for accurate estimation. Although techniques like unbounded priors and larger particle sets have been proposed, they remain computationally prohibitive and lack adaptability in dynamic scenarios. To systematically overcome these limitations, we propose the Diffusion-Enhanced Particle Filtering Framework, which introduces three key innovations: adaptive diffusion through exploratory particles, entropy-driven regularisation to prevent weight collapse, and kernel-based perturbations for dynamic support expansion. These mechanisms collectively enable particle filtering to explore beyond prior boundaries, ensuring robust state estimation for out-of-boundary targets. Theoretical analysis and extensive experiments validate framework’s effectiveness, indicating significant improvements in success rates and estimation accuracy across high-dimensional and non-convex scenarios.

[AI-54] Solving Drone Routing Problems with Quantum Computing: A Hybrid Approach Combining Quantum Annealing and Gate-Based Paradigms CEC2025

链接: https://arxiv.org/abs/2501.18432
作者: Eneko Osaba,Pablo Miranda-Rodriguez,Andreas Oikonomakis,Matic Petrič,Sebastian Bock,Michail-Alexandros Kourtis
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
*备注: 8 pages, 5 figures. Paper submitted to IEEE Congress on Evolutionary Computation (IEEE CEC 2025)

点击查看摘要

Abstract:This paper presents a novel hybrid approach to solving real-world drone routing problems by leveraging the capabilities of quantum computing. The proposed method, coined Quantum for Drone Routing (Q4DR), integrates the two most prominent paradigms in the field: quantum gate-based computing, through the Eclipse Qrisp programming language; and quantum annealers, by means of D-Wave System’s devices. The algorithm is divided into two different phases: an initial clustering phase executed using a Quantum Approximate Optimization Algorithm (QAOA), and a routing phase employing quantum annealers. The efficacy of Q4DR is demonstrated through three use cases of increasing complexity, each incorporating real-world constraints such as asymmetric costs, forbidden paths, and itinerant charging points. This research contributes to the growing body of work in quantum optimization, showcasing the practical applications of quantum computing in logistics and route planning.

[AI-55] Unfaithful Probability Distributions in Binary Triple of Causality Directed Acyclic Graph

链接: https://arxiv.org/abs/2501.18337
作者: Jingwei Liu
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Faithfulness is the foundation of probability distribution and graph in causal discovery and causal inference. In this paper, several unfaithful probability distribution examples are constructed in three–vertices binary causality directed acyclic graph (DAG) structure, which are not faithful to causal DAGs described in J.M.,Robins,et al. Uniform consistency in causal inference. Biometrika (2003),90(3): 491–515. And the general unfaithful probability distribution with multiple independence and conditional independence in binary triple causal DAG is given.

[AI-56] Learning Metal Microstructural Heterogeneity through Spatial Mapping of Diffraction Latent Space Features

链接: https://arxiv.org/abs/2501.18064
作者: Mathieu Calvat,Chris Bean,Dhruv Anjaria,Hyoungryul Park,Haoren Wang,Kenneth Vecchio,J.C. Stinville
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:To leverage advancements in machine learning for metallic materials design and property prediction, it is crucial to develop a data-reduced representation of metal microstructures that surpasses the limitations of current physics-based discrete microstructure descriptors. This need is particularly relevant for metallic materials processed through additive manufacturing, which exhibit complex hierarchical microstructures that cannot be adequately described using the conventional metrics typically applied to wrought materials. Furthermore, capturing the spatial heterogeneity of microstructures at the different scales is necessary within such framework to accurately predict their properties. To address these challenges, we propose the physical spatial mapping of metal diffraction latent space features. This approach integrates (i) point diffraction data encoding via variational autoencoders or contrastive learning and (ii) the physical mapping of the encoded values. Together these steps offer a method offers a novel means to comprehensively describe metal microstructures. We demonstrate this approach on a wrought and additively manufactured alloy, showing that it effectively encodes microstructural information and enables direct identification of microstructural heterogeneity not directly possible by physics-based models. This data-reduced microstructure representation opens the application of machine learning models in accelerating metallic material design and accurately predicting their properties.

[AI-57] Progress in Artificial Intelligence and its Determinants

链接: https://arxiv.org/abs/2501.17894
作者: Michael R. Douglas,Sergiy Verstyuk
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:We study long-run progress in artificial intelligence in a quantitative way. Many measures, including traditional ones such as patents and publications, machine learning benchmarks, and a new Aggregate State of the Art in ML (or ASOTA) Index we have constructed from these, show exponential growth at roughly constant rates over long periods. Production of patents and publications doubles every ten years, by contrast with the growth of computing resources driven by Moore’s Law, roughly a doubling every two years. We argue that the input of AI researchers is also crucial and its contribution can be objectively estimated. Consequently, we give a simple argument that explains the 5:1 relation between these two rates. We then discuss the application of this argument to different output measures and compare our analyses with predictions based on machine learning scaling laws proposed in existing literature. Our quantitative framework facilitates understanding, predicting, and modulating the development of these important technologies.

[AI-58] Knoop: Practical Enhancement of Knockoff with Over-Parameterization for Variable Selection

链接: https://arxiv.org/abs/2501.17889
作者: Xiaochen Zhang,Yunfeng Cai,Haoyi Xiong
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: An earlier version of our paper at Machine Learning

点击查看摘要

Abstract:Variable selection plays a crucial role in enhancing modeling effectiveness across diverse fields, addressing the challenges posed by high-dimensional datasets of correlated variables. This work introduces a novel approach namely Knockoff with over-parameterization (Knoop) to enhance Knockoff filters for variable selection. Specifically, Knoop first generates multiple knockoff variables for each original variable and integrates them with the original variables into an over-parameterized Ridgeless regression model. For each original variable, Knoop evaluates the coefficient distribution of its knockoffs and compares these with the original coefficients to conduct an anomaly-based significance test, ensuring robust variable selection. Extensive experiments demonstrate superior performance compared to existing methods in both simulation and real-world datasets. Knoop achieves a notably higher Area under the Curve (AUC) of the Receiver Operating Characteristic (ROC) Curve for effectively identifying relevant variables against the ground truth by controlled simulations, while showcasing enhanced predictive accuracy across diverse regression and classification tasks. The analytical results further backup our observations.

[AI-59] RadioLLM : Introducing Large Language Model into Cognitive Radio via Hybrid Prompt and Token Reprogrammings

链接: https://arxiv.org/abs/2501.17888
作者: Shuai Chen,Yong Zu,Zhixi Feng,Shuyuan Yang,Mengchang Li,Yue Ma,Jun Liu,Qiukai Pan,Xinlei Zhang,Changjun Sun
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing scarcity of spectrum resources and the rapid growth of wireless device have made efficient management of radio networks a critical challenge. Cognitive Radio Technology (CRT), when integrated with deep learning (DL), offers promising solutions for tasks such as radio signal classification (RSC), signal denoising, and spectrum allocation. However, existing DL-based CRT frameworks are often task-specific and lack scalability to diverse real-world scenarios. Meanwhile, Large Language Models (LLMs) have demonstrated exceptional generalization capabilities across multiple domains, making them a potential candidate for advancing CRT technologies. In this paper, we introduce RadioLLM, a novel framework that incorporates Hybrid Prompt and Token Reprogramming (HPTR) and a Frequency Attuned Fusion (FAF) module to enhance LLMs for CRT tasks. HPTR enables the integration of radio signal features with expert knowledge, while FAF improves the modeling of high-frequency features critical for precise signal processing. These innovations allow RadioLLM to handle diverse CRT tasks, bridging the gap between LLMs and traditional signal processing methods. Extensive empirical studies on multiple benchmark datasets demonstrate that the proposed RadioLLM achieves superior performance over current baselines.

[AI-60] Explainable and Robust Millimeter Wave Beam Alignment for AI-Native 6G Networks

链接: https://arxiv.org/abs/2501.17883
作者: Nasir Khan,Asmaa Abdallah,Abdulkadir Celik,Ahmed M. Eltawil,Sinem Coleri
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Integrated artificial intelligence (AI) and communication has been recognized as a key pillar of 6G and beyond networks. In line with AI-native 6G vision, explainability and robustness in AI-driven systems are critical for establishing trust and ensuring reliable performance in diverse and evolving environments. This paper addresses these challenges by developing a robust and explainable deep learning (DL)-based beam alignment engine (BAE) for millimeter-wave (mmWave) multiple-input multiple-output (MIMO) systems. The proposed convolutional neural network (CNN)-based BAE utilizes received signal strength indicator (RSSI) measurements over a set of wide beams to accurately predict the best narrow beam for each UE, significantly reducing the overhead associated with exhaustive codebook-based narrow beam sweeping for initial access (IA) and data transmission. To ensure transparency and resilience, the Deep k-Nearest Neighbors (DkNN) algorithm is employed to assess the internal representations of the network via nearest neighbor approach, providing human-interpretable explanations and confidence metrics for detecting out-of-distribution inputs. Experimental results demonstrate that the proposed DL-based BAE exhibits robustness to measurement noise, reduces beam training overhead by 75% compared to the exhaustive search while maintaining near-optimal performance in terms of spectral efficiency. Moreover, the proposed framework improves outlier detection robustness by up to 5x and offers clearer insights into beam prediction decisions compared to traditional softmax-based classifiers.

[AI-61] RayLoc: Wireless Indoor Localization via Fully Differentiable Ray-tracing

链接: https://arxiv.org/abs/2501.17881
作者: Xueqiang Han,Tianyue Zheng,Tony Xiao Han,Jun Luo
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Wireless indoor localization has been a pivotal area of research over the last two decades, becoming a cornerstone for numerous sensing applications. However, conventional wireless localization methods rely on channel state information to perform blind modelling and estimation of a limited set of localization parameters. This oversimplification neglects many sensing scene details, resulting in suboptimal localization accuracy. To address this limitation, this paper presents a novel approach to wireless indoor localization by reformulating it as an inverse problem of wireless ray-tracing, inferring scene parameters that generates the measured CSI. At the core of our solution is a fully differentiable ray-tracing simulator that enables backpropagation to comprehensive parameters of the sensing scene, allowing for precise localization. To establish a robust localization context, RayLoc constructs a high-fidelity sensing scene by refining coarse-grained background model. Furthermore, RayLoc overcomes the challenges of sparse gradient and local minima by convolving the signal generation process with a Gaussian kernel. Extensive experiments showcase that RayLoc outperforms traditional localization baselines and is able to generalize to different sensing environments.

[AI-62] Assessment of the January 2025 Los Angeles County wildfires: A multi-modal analysis of impact response and population exposure

链接: https://arxiv.org/abs/2501.17880
作者: Seyd Teymoor Seydi
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This study presents a comprehensive analysis of four significant California wildfires: Palisades, Eaton, Kenneth, and Hurst, examining their impacts through multiple dimensions, including land cover change, jurisdictional management, structural damage, and demographic vulnerability. Using the Chebyshev-Kolmogorov-Arnold network model applied to Sentinel-2 imagery, the extent of burned areas was mapped, ranging from 315.36 to 10,960.98 hectares. Our analysis revealed that shrubland ecosystems were consistently the most affected, comprising 57.4-75.8% of burned areas across all events. The jurisdictional assessment demonstrated varying management complexities, from singular authority (98.7% in the Palisades Fire) to distributed management across multiple agencies. A structural impact analysis revealed significant disparities between urban interface fires (Eaton: 9,869 structures; Palisades: 8,436 structures) and rural events (Kenneth: 24 structures; Hurst: 17 structures). The demographic analysis showed consistent gender distributions, with 50.9% of the population identified as female and 49.1% as male. Working-age populations made up the majority of the affected populations, ranging from 53.7% to 54.1%, with notable temporal shifts in post-fire periods. The study identified strong correlations between urban interface proximity, structural damage, and population exposure. The Palisades and Eaton fires affected over 20,000 people each, compared to fewer than 500 in rural events. These findings offer valuable insights for the development of targeted wildfire management strategies, particularly in wildland urban interface zones, and emphasize the need for age- and gender-conscious approaches in emergency response planning.

机器学习

[LG-0] Accuracy and Robustness of Weight-Balancing Methods for Training PINNs

链接: https://arxiv.org/abs/2501.18582
作者: Matthieu Barreau,Haoming Shen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as powerful tools for integrating physics-based models with data by minimizing both data and physics losses. However, this multi-objective optimization problem is notoriously challenging, with some benchmark problems leading to unfeasible solutions. To address these issues, various strategies have been proposed, including adaptive weight adjustments in the loss function. In this work, we introduce clear definitions of accuracy and robustness in the context of PINNs and propose a novel training algorithm based on the Primal-Dual (PD) optimization framework. Our approach enhances the robustness of PINNs while maintaining comparable performance to existing weight-balancing methods. Numerical experiments demonstrate that the PD method consistently achieves reliable solutions across all investigated cases and can be easily implemented, facilitating its practical adoption. The code is available at this https URL.

[LG-1] Bias-variance decompositions: the exclusive privilege of Bregman divergences

链接: https://arxiv.org/abs/2501.18581
作者: Tom Heskes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bias-variance decompositions are widely used to understand the generalization performance of machine learning models. While the squared error loss permits a straightforward decomposition, other loss functions - such as zero-one loss or L_1 loss - either fail to sum bias and variance to the expected loss or rely on definitions that lack the essential properties of meaningful bias and variance. Recent research has shown that clean decompositions can be achieved for the broader class of Bregman divergences, with the cross-entropy loss as a special case. However, the necessary and sufficient conditions for these decompositions remain an open question. In this paper, we address this question by studying continuous, nonnegative loss functions that satisfy the identity of indiscernibles under mild regularity conditions. We prove that so-called g -Bregman divergences are the only such loss functions that have a clean bias-variance decomposition. A g -Bregman divergence can be transformed into a standard Bregman divergence through an invertible change of variables. This makes the squared Mahalanobis distance, up to such a variable transformation, the only symmetric loss function with a clean bias-variance decomposition. We also examine the impact of relaxing the restrictions on the loss functions and how this affects our results. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.18581 [cs.LG] (or arXiv:2501.18581v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.18581 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Node Classification and Search on the Rubiks Cube Graph with GNNs

链接: https://arxiv.org/abs/2501.18580
作者: Alessandro Barro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study focuses on the application of deep geometric models to solve the 3x3x3 Rubik’s Cube. We begin by discussing the cube’s graph representation and defining distance as the model’s optimization objective. The distance approximation task is reformulated as a node classification problem, effectively addressed using Graph Neural Networks (GNNs). After training the model on a random subgraph, the predicted classes are used to construct a heuristic for A^* search. We conclude with experiments comparing our heuristic to that of the DeepCubeA model.

[LG-3] oken-Hungry Yet Precise: DeepSeek R1 Highlights the Need for Multi-Step Reasoning Over Speed in MATH

链接: https://arxiv.org/abs/2501.18576
作者: Evgenii Evstafev
类目: Machine Learning (cs.LG)
*备注: 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:This study investigates the performance of the DeepSeek R1 language model on 30 challenging mathematical problems derived from the MATH dataset, problems that previously proved unsolvable by other models under time constraints. Unlike prior work, this research removes time limitations to explore whether DeepSeek R1’s architecture, known for its reliance on token-based reasoning, can achieve accurate solutions through a multi-step process. The study compares DeepSeek R1 with four other models (gemini-1.5-flash-8b, gpt-4o-mini-2024-07-18, llama3.1:8b, and mistral-8b-latest) across 11 temperature settings. Results demonstrate that DeepSeek R1 achieves superior accuracy on these complex problems but generates significantly more tokens than other models, confirming its token-intensive approach. The findings highlight a trade-off between accuracy and efficiency in mathematical problem-solving with large language models: while DeepSeek R1 excels in accuracy, its reliance on extensive token generation may not be optimal for applications requiring rapid responses. The study underscores the importance of considering task-specific requirements when selecting an LLM and emphasizes the role of temperature settings in optimizing performance.

[LG-4] No Equations Needed: Learning System Dynamics Without Relying on Closed-Form ODEs ICLR2025

链接: https://arxiv.org/abs/2501.18563
作者: Krzysztof Kacprzyk,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注: To appear in the Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025)

点击查看摘要

Abstract:Data-driven modeling of dynamical systems is a crucial area of machine learning. In many scenarios, a thorough understanding of the model’s behavior becomes essential for practical applications. For instance, understanding the behavior of a pharmacokinetic model, constructed as part of drug development, may allow us to both verify its biological plausibility (e.g., the drug concentration curve is non-negative and decays to zero) and to design dosing guidelines. Discovery of closed-form ordinary differential equations (ODEs) can be employed to obtain such insights by finding a compact mathematical equation and then analyzing it (a two-step approach). However, its widespread use is currently hindered because the analysis process may be time-consuming, requiring substantial mathematical expertise, or even impossible if the equation is too complex. Moreover, if the found equation’s behavior does not satisfy the requirements, editing it or influencing the discovery algorithms to rectify it is challenging as the link between the symbolic form of an ODE and its behavior can be elusive. This paper proposes a conceptual shift to modeling low-dimensional dynamical systems by departing from the traditional two-step modeling process. Instead of first discovering a closed-form equation and then analyzing it, our approach, direct semantic modeling, predicts the semantic representation of the dynamical system (i.e., description of its behavior) directly from data, bypassing the need for complex post-hoc analysis. This direct approach also allows the incorporation of intuitive inductive biases into the optimization algorithm and editing the model’s behavior directly, ensuring that the model meets the desired specifications. Our approach not only simplifies the modeling pipeline but also enhances the transparency and flexibility of the resulting models compared to traditional closed-form ODEs.

[LG-5] Bandits with Anytime Knapsacks

链接: https://arxiv.org/abs/2501.18560
作者: Eray Can Elumar,Cem Tekin,Osman Yagan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider bandits with anytime knapsacks (BwAK), a novel version of the BwK problem where there is an \textitanytime cost constraint instead of a total cost budget. This problem setting introduces additional complexities as it mandates adherence to the constraint throughout the decision-making process. We propose SUAK, an algorithm that utilizes upper confidence bounds to identify the optimal mixture of arms while maintaining a balance between exploration and exploitation. SUAK is an adaptive algorithm that strategically utilizes the available budget in each round in the decision-making process and skips a round when it is possible to violate the anytime cost constraint. In particular, SUAK slightly under-utilizes the available cost budget to reduce the need for skipping rounds. We show that SUAK attains the same problem-dependent regret upper bound of O(K \log T) established in prior work under the simpler BwK framework. Finally, we provide simulations to verify the utility of SUAK in practical settings.

[LG-6] Loss Functions and Operators Generated by f-Divergences

链接: https://arxiv.org/abs/2501.18537
作者: Vincent Roulet,Tianlin Liu,Nino Vieillard,Michael E. Sander,Mathieu Blondel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback–Leibler (KL) divergence and the softargmax operator. In this work, we propose to construct new convex loss functions based on f -divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with f -divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous f -divergences, recovering existing losses and creating new ones. By analogy with the logistic loss, the loss function generated by an f -divergence is associated with an operator, that we dub f -softargmax. We derive a novel parallelizable bisection algorithm for computing the f -softargmax associated with any f -divergence. On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the \alpha -divergence (which is equivalent to Tsallis \alpha -negentropy in the case of unit reference measures) with \alpha=1.5 performs well across several tasks.

[LG-7] Graph Learning for Bidirectional Disease Contact Tracing on Real Human Mobility Data

链接: https://arxiv.org/abs/2501.18531
作者: Sofia Hurtado,Radu Marculescu
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: Accepted into International Workshop on Disaster Network Science for Building Resilient Communities (REINFORCE) held at the Advances in Social Networks Analysis and Mining conference

点击查看摘要

Abstract:For rapidly spreading diseases where many cases show no symptoms, swift and effective contact tracing is essential. While exposure notification applications provide alerts on potential exposures, a fully automated system is needed to track the infectious transmission routes. To this end, our research leverages large-scale contact networks from real human mobility data to identify the path of transmission. More precisely, we introduce a new Infectious Path Centrality network metric that informs a graph learning edge classifier to identify important transmission events, achieving an F1-score of 94%. Additionally, we explore bidirectional contact tracing, which quarantines individuals both retroactively and proactively, and compare its effectiveness against traditional forward tracing, which only isolates individuals after testing positive. Our results indicate that when only 30% of symptomatic individuals are tested, bidirectional tracing can reduce infectious effective reproduction rate by 71%, thus significantly controlling the outbreak.

[LG-8] Joint Learning of Energy-based Models and their Partition Function

链接: https://arxiv.org/abs/2501.18528
作者: Michael E. Sander,Vincent Roulet,Tianlin Liu,Mathieu Blondel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks. However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function (normalization constant). In this paper, we propose a novel formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, both parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points. On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions. Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtain the first tractable method for optimizing the sparsemax loss in combinatorially-large spaces. We demonstrate our approach on multilabel classification and label ranking.

[LG-9] Neural Discovery in Mathematics: Do Machines Dream of Colored Planes?

链接: https://arxiv.org/abs/2501.18527
作者: Konrad Mundinger,Max Zimmer,Aldo Kiem,Christoph Spiegel,Sebastian Pokutta
类目: Machine Learning (cs.LG); Combinatorics (math.CO)
*备注: 8 pages main paper, 10 pages references and appendix, 17 figures, 1 table

点击查看摘要

Abstract:We demonstrate how neural networks can drive mathematical discovery through a case study of the Hadwiger-Nelson problem, a long-standing open problem from discrete geometry and combinatorics about coloring the plane avoiding monochromatic unit-distance pairs. Using neural networks as approximators, we reformulate this mixed discrete-continuous geometric coloring problem as an optimization task with a probabilistic, differentiable loss function. This enables gradient-based exploration of admissible configurations that most significantly led to the discovery of two novel six-colorings, providing the first improvements in thirty years to the off-diagonal variant of the original problem (Mundinger et al., 2024a). Here, we establish the underlying machine learning approach used to obtain these results and demonstrate its broader applicability through additional results and numerical insights.

[LG-10] MolGraph-xLSTM: A graph-based dual-level xLSTM framework with multi-head mixture-of-experts for enhanced molecular representation and interpretability

链接: https://arxiv.org/abs/2501.18439
作者: Yan Sun,Yutong Lu,Yan Yi Li,Zihao Jing,Carson K. Leung,Pingzhao Hu
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Predicting molecular properties is essential for drug discovery, and computational methods can greatly enhance this process. Molecular graphs have become a focus for representation learning, with Graph Neural Networks (GNNs) widely used. However, GNNs often struggle with capturing long-range dependencies. To address this, we propose MolGraph-xLSTM, a novel graph-based xLSTM model that enhances feature extraction and effectively models molecule long-range interactions. Our approach processes molecular graphs at two scales: atom-level and motif-level. For atom-level graphs, a GNN-based xLSTM framework with jumping knowledge extracts local features and aggregates multilayer information to capture both local and global patterns effectively. Motif-level graphs provide complementary structural information for a broader molecular view. Embeddings from both scales are refined via a multi-head mixture of experts (MHMoE), further enhancing expressiveness and performance. We validate MolGraph-xLSTM on 10 molecular property prediction datasets, covering both classification and regression tasks. Our model demonstrates consistent performance across all datasets, with improvements of up to 7.03% on the BBBP dataset for classification and 7.54% on the ESOL dataset for regression compared to baselines. On average, MolGraph-xLSTM achieves an AUROC improvement of 3.18% for classification tasks and an RMSE reduction of 3.83% across regression datasets compared to the baseline methods. These results confirm the effectiveness of our model, offering a promising solution for molecular representation learning for drug discovery. Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM) Cite as: arXiv:2501.18439 [cs.LG] (or arXiv:2501.18439v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.18439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-11] Causal Inference Real-Time Anomaly Detection with Synthetic Anomaly Monitoring (SAM)

链接: https://arxiv.org/abs/2501.18417
作者: Emanuele Luzio,Moacir Antonelli Ponti
类目: Machine Learning (cs.LG)
*备注: 19 pages, 3 figures, submitted for publication

点击查看摘要

Abstract:Anomaly detection is essential for identifying rare and significant events across diverse domains such as finance, cybersecurity, and network monitoring. This paper presents Synthetic Anomaly Monitoring (SAM), an innovative approach that applies synthetic control methods from causal inference to improve both the accuracy and interpretability of anomaly detection processes. By modeling normal behavior through the treatment of each feature as a control unit, SAM identifies anomalies as deviations within this causal framework. We conducted extensive experiments comparing SAM with established benchmark models, including Isolation Forest, Local Outlier Factor (LOF), k-Nearest Neighbors (kNN), and One-Class Support Vector Machine (SVM), across five diverse datasets, including Credit Card Fraud, HTTP Dataset CSIC 2010, and KDD Cup 1999, among others. Our results demonstrate that SAM consistently delivers robust performance, highlighting its potential as a powerful tool for real-time anomaly detection in dynamic and complex environments.

[LG-12] Exploring Potential Prompt Injection Attacks in Federated Military LLM s and Their Mitigation

链接: https://arxiv.org/abs/2501.18416
作者: Youngjoon Lee,Taehyun Park,Yunho Lee,Jinu Gong,Joonhyuk Kang
类目: Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:Federated Learning (FL) is increasingly being adopted in military collaborations to develop Large Language Models (LLMs) while preserving data sovereignty. However, prompt injection attacks-malicious manipulations of input prompts-pose new threats that may undermine operational security, disrupt decision-making, and erode trust among allies. This perspective paper highlights four potential vulnerabilities in federated military LLMs: secret data leakage, free-rider exploitation, system disruption, and misinformation spread. To address these potential risks, we propose a human-AI collaborative framework that introduces both technical and policy countermeasures. On the technical side, our framework uses red/blue team wargaming and quality assurance to detect and mitigate adversarial behaviors of shared LLM weights. On the policy side, it promotes joint AI-human policy development and verification of security protocols. Our findings will guide future research and emphasize proactive strategies for emerging military contexts.

[LG-13] Segmentation of cracks in 3d images of fiber reinforced concrete using deep learning

链接: https://arxiv.org/abs/2501.18405
作者: Anna Nowacka,Katja Schladitz,Szymon Grzesiak,Matthias Pahn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cracks in concrete structures are very common and are an integral part of this heterogeneous material. Characteristics of cracks induced by standardized tests yield valuable information about the tested concrete formulation and its mechanical properties. Observing cracks on the surface of the concrete structure leaves a wealth of structural information unused. Computed tomography enables looking into the sample without interfering or destroying the microstructure. The reconstructed tomographic images are 3d images, consisting of voxels whose gray values represent local X-ray absorption. In order to identify voxels belonging to the crack, so to segment the crack structure in the images, appropriate algorithms need to be developed. Convolutional neural networks are known to solve this type of task very well given enough and consistent training data. We adapted a 3d version of the well-known U-Net and trained it on semi-synthetic 3d images of real concrete samples equipped with simulated crack structures. Here, we explain the general approach. Moreover, we show how to teach the network to detect also real crack systems in 3d images of varying types of real concrete, in particular of fiber reinforced concrete.

[LG-14] Improved Replicable Boosting with Majority-of-Majorities

链接: https://arxiv.org/abs/2501.18388
作者: Kasper Green Larsen,Markus Engelund Mathiasen,Clement Svendsen
类目: Machine Learning (cs.LG)
*备注: 13 pages

点击查看摘要

Abstract:We introduce a new replicable boosting algorithm which significantly improves the sample complexity compared to previous algorithms. The algorithm works by doing two layers of majority voting, using an improved version of the replicable boosting algorithm introduced by Impagliazzo et al. [2022] in the bottom layer.

[LG-15] Function Encoders: A Principled Approach to Transfer Learning in Hilbert Spaces

链接: https://arxiv.org/abs/2501.18373
作者: Tyler Ingebrand,Adam J. Thorpe,Ufuk Topcu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A central challenge in transfer learning is designing algorithms that can quickly adapt and generalize to new tasks without retraining. Yet, the conditions of when and how algorithms can effectively transfer to new tasks is poorly characterized. We introduce a geometric characterization of transfer in Hilbert spaces and define three types of inductive transfer: interpolation within the convex hull, extrapolation to the linear span, and extrapolation outside the span. We propose a method grounded in the theory of function encoders to achieve all three types of transfer. Specifically, we introduce a novel training scheme for function encoders using least-squares optimization, prove a universal approximation theorem for function encoders, and provide a comprehensive comparison with existing approaches such as transformers and meta-learning on four diverse benchmarks. Our experiments demonstrate that the function encoder outperforms state-of-the-art methods on four benchmark tasks and on all three types of transfer.

[LG-16] A Cartesian Encoding Graph Neural Network for Crystal Structures Property Prediction: Application to Thermal Ellipsoid Estimation

链接: https://arxiv.org/abs/2501.18369
作者: Àlex Solé,Albert Mosella-Montoro,Joan Cardona,Silvia Gómez-Coca,Daniel Aravena,Eliseo Ruiz,Javier Ruiz-Hidalgo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In diffraction-based crystal structure analysis, thermal ellipsoids, quantified via Anisotropic Displacement Parameters (ADPs), are critical yet challenging to determine. ADPs capture atomic vibrations, reflecting thermal and structural properties, but traditional computation is often expensive. This paper introduces CartNet, a novel graph neural network (GNN) for efficiently predicting crystal properties by encoding atomic geometry into Cartesian coordinates alongside the crystal temperature. CartNet integrates a neighbour equalization technique to emphasize covalent and contact interactions, and a Cholesky-based head to ensure valid ADP predictions. We also propose a rotational SO(3) data augmentation strategy during training to handle unseen orientations. An ADP dataset with over 200,000 experimental crystal structures from the Cambridge Structural Database (CSD) was curated to validate the approach. CartNet significantly reduces computational costs and outperforms existing methods in ADP prediction by 10.87%, while delivering a 34.77% improvement over theoretical approaches. We further evaluated CartNet on other datasets covering formation energy, band gap, total energy, energy above the convex hull, bulk moduli, and shear moduli, achieving 7.71% better results on the Jarvis Dataset and 13.16% on the Materials Project Dataset. These gains establish CartNet as a state-of-the-art solution for diverse crystal property predictions. Project website and online demo: this https URL

[LG-17] Robust Online Conformal Prediction under Uniform Label Noise

链接: https://arxiv.org/abs/2501.18363
作者: Huajun Xi,Kangdao Liu,Hao Zeng,Wenguang Sun,Hongxin Wei
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction is an emerging technique for uncertainty quantification that constructs prediction sets guaranteed to contain the true label with a predefined probability. Recent work develops online conformal prediction methods that adaptively construct prediction sets to accommodate distribution shifts. However, existing algorithms typically assume perfect label accuracy which rarely holds in practice. In this work, we investigate the robustness of online conformal prediction under uniform label noise with a known noise rate, in both constant and dynamic learning rate schedules. We show that label noise causes a persistent gap between the actual mis-coverage rate and the desired rate \alpha , leading to either overestimated or underestimated coverage guarantees. To address this issue, we propose Noise Robust Online Conformal Prediction (dubbed NR-OCP) by updating the threshold with a novel robust pinball los, which provides an unbiased estimate of clean pinball loss without requiring ground-truth labels. Our theoretical analysis shows that NR-OCP eliminates the coverage gap in both constant and dynamic learning rate schedules, achieving a convergence rate of \mathcalO(T^-1/2) for both empirical and expected coverage errors under uniform label noise. Extensive experiments demonstrate the effectiveness of our method by achieving both precise coverage and improved efficiency.

[LG-18] Contrastive Learning Meets Pseudo-label-assisted Mixup Augmentation: A Comprehensive Graph Representation Framework from Local to Global

链接: https://arxiv.org/abs/2501.18357
作者: Jinlu Wang,Yanfeng Sun,Jiapu Wang,Junbin Gao,Shaofan Wang,Jipeng Guo
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness in various graph representation learning tasks. However, most existing GNNs focus primarily on capturing local information through explicit graph convolution, often neglecting global message-passing. This limitation hinders the establishment of a collaborative interaction between global and local information, which is crucial for comprehensively understanding graph data. To address these challenges, we propose a novel framework called Comprehensive Graph Representation Learning (ComGRL). ComGRL integrates local information into global information to derive powerful representations. It achieves this by implicitly smoothing local information through flexible graph contrastive learning, ensuring reliable representations for subsequent global exploration. Then ComGRL transfers the locally derived representations to a multi-head self-attention module, enhancing their discriminative ability by uncovering diverse and rich global correlations. To further optimize local information dynamically under the self-supervision of pseudo-labels, ComGRL employs a triple sampling strategy to construct mixed node pairs and applies reliable Mixup augmentation across attributes and structure for local contrastive learning. This approach broadens the receptive field and facilitates coordination between local and global representation learning, enabling them to reinforce each other. Experimental results across six widely used graph datasets demonstrate that ComGRL achieves excellent performance in node classification tasks. The code could be available at this https URL.

[LG-19] Stream-Based Monitoring of Algorithmic Fairness

链接: https://arxiv.org/abs/2501.18331
作者: Jan Baumeister,Bernd Finkbeiner,Frederik Scheerer,Julian Siber,Tobias Wagenpfeil
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
*备注: 31st International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2025)

点击查看摘要

Abstract:Automatic decision and prediction systems are increasingly deployed in applications where they significantly impact the livelihood of people, such as for predicting the creditworthiness of loan applicants or the recidivism risk of defendants. These applications have given rise to a new class of algorithmic-fairness specifications that require the systems to decide and predict without bias against social groups. Verifying these specifications statically is often out of reach for realistic systems, since the systems may, e.g., employ complex learning components, and reason over a large input space. In this paper, we therefore propose stream-based monitoring as a solution for verifying the algorithmic fairness of decision and prediction systems at runtime. Concretely, we present a principled way to formalize algorithmic fairness over temporal data streams in the specification language RTLola and demonstrate the efficacy of this approach on a number of benchmarks. Besides synthetic scenarios that particularly highlight its efficiency on streams with a scaling amount of data, we notably evaluate the monitor on real-world data from the recidivism prediction tool COMPAS.

[LG-20] A Unified Perspective on the Dynamics of Deep Transformers

链接: https://arxiv.org/abs/2501.18322
作者: Valérie Castin,Pierre Ablin,José Antonio Carrillo,Gabriel Peyré
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:

点击查看摘要

Abstract:Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the iterative application of attention across layers induces complex dynamics that remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. Our first set of contributions focuses on compactly supported initial data. We show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system, thus generalizing and extending previous analysis to several variants of self-attention: multi-head attention, L2 attention, Sinkhorn attention, Sigmoid attention, and masked attention–leveraging a conditional Wasserstein framework. In a second set of contributions, we are the first to study non-compactly supported initial conditions, by focusing on Gaussian initial data. Again for different types of attention, we show that the Transformer PDE preserves the space of Gaussian measures, which allows us to analyze the Gaussian case theoretically and numerically to identify typical behaviors. This Gaussian analysis captures the evolution of data anisotropy through a deep Transformer. In particular, we highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.

[LG-21] Update Estimation and Scheduling for Over-the-Air Federated Learning with Energy Harvesting Devices

链接: https://arxiv.org/abs/2501.18298
作者: Furkan Bagci,Busra Tegin,Mohammad Kazemi,Tolga M. Duman
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 6 pages

点击查看摘要

Abstract:We study over-the-air (OTA) federated learning (FL) for energy harvesting devices with heterogeneous data distribution over wireless fading multiple access channel (MAC). To address the impact of low energy arrivals and data heterogeneity on global learning, we propose user scheduling strategies. Specifically, we develop two approaches: 1) entropy-based scheduling for known data distributions and 2) least-squares-based user representation estimation for scheduling with unknown data distributions at the parameter server. Both methods aim to select diverse users, mitigating bias and enhancing convergence. Numerical and analytical results demonstrate improved learning performance by reducing redundancy and conserving energy.

[LG-22] Leverag ing Sparsity for Sample-Efficient Preference Learning: A Theoretical Perspective

链接: https://arxiv.org/abs/2501.18282
作者: Yunzhen Yao,Lie He,Michael Gastpar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper considers the sample-efficiency of preference learning, which models and predicts human choices based on comparative judgments. The minimax optimal estimation rate \Theta(d/n) in traditional estimation theory requires that the number of samples n scales linearly with the dimensionality of the feature space d . However, the high dimensionality of the feature space and the high cost of collecting human-annotated data challenge the efficiency of traditional estimation methods. To remedy this, we leverage sparsity in the preference model and establish sharp estimation rates. We show that under the sparse random utility model, where the parameter of the reward function is k -sparse, the minimax optimal rate can be reduced to \Theta(k/n \log(d/k)) . Furthermore, we analyze the \ell_1 -regularized estimator and show that it achieves near-optimal rate under mild assumptions on the Gram matrix. Experiments on synthetic data and LLM alignment data validate our theoretical findings, showing that sparsity-aware methods significantly reduce sample complexity and improve prediction accuracy.

[LG-23] ReactEmbed: A Cross-Domain Framework for Protein-Molecule Representation Learning via Biochemical Reaction Networks

链接: https://arxiv.org/abs/2501.18278
作者: Amitay Sicherman,Kira Radinsky
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The challenge in computational biology and drug discovery lies in creating comprehensive representations of proteins and molecules that capture their intrinsic properties and interactions. Traditional methods often focus on unimodal data, such as protein sequences or molecular structures, limiting their ability to capture complex biochemical relationships. This work enhances these representations by integrating biochemical reactions encompassing interactions between molecules and proteins. By leveraging reaction data alongside pre-trained embeddings from state-of-the-art protein and molecule models, we develop ReactEmbed, a novel method that creates a unified embedding space through contrastive learning. We evaluate ReactEmbed across diverse tasks, including drug-target interaction, protein-protein interaction, protein property prediction, and molecular property prediction, consistently surpassing all current state-of-the-art models. Notably, we showcase ReactEmbed’s practical utility through successful implementation in lipid nanoparticle-based drug delivery, enabling zero-shot prediction of blood-brain barrier permeability for protein-nanoparticle complexes. The code and comprehensive database of reaction pairs are available for open use at \hrefthis https URLGitHub.

[LG-24] Sebra: Debiasing Through Self-Guided Bias Ranking ICLR2025

链接: https://arxiv.org/abs/2501.18277
作者: Adarsh Kappiyath,Abhra Chaudhuri,Ajay Jaiswal,Ziquan Liu,Yunpeng Li,Xiatian Zhu,Lu Yin
类目: Machine Learning (cs.LG)
*备注: Accepted to ICLR 2025

点击查看摘要

Abstract:Ranking samples by fine-grained estimates of spuriosity (the degree to which spurious cues are present) has recently been shown to significantly benefit bias mitigation, over the traditional binary biased-\textitvs-unbiased partitioning of train sets. However, this spuriosity ranking comes with the requirement of human supervision. In this paper, we propose a debiasing framework based on our novel \ulSelf-Guided \ulBias \ulRanking (\emphSebra), that mitigates biases (spurious correlations) via an automatic ranking of data points by spuriosity within their respective classes. Sebra leverages a key local symmetry in Empirical Risk Minimization (ERM) training – the ease of learning a sample via ERM inversely correlates with its spuriousity; the fewer spurious correlations a sample exhibits, the harder it is to learn, and vice versa. However, globally across iterations, ERM tends to deviate from this symmetry. Sebra dynamically steers ERM to correct this deviation, facilitating the sequential learning of attributes in increasing order of difficulty, \ie, decreasing order of spuriosity. As a result, the sequence in which Sebra learns samples naturally provides spuriousity rankings. We use the resulting fine-grained bias characterization in a contrastive learning framework to mitigate biases from multiple sources. Extensive experiments show that Sebra consistently outperforms previous state-of-the-art unsupervised debiasing techniques across multiple standard benchmarks, including UrbanCars, BAR, CelebA, and ImageNet-1K. Code, pre-trained models, and training logs are available at this https URL.

[LG-25] Reducing Aleatoric and Epistemic Uncertainty through Multi-modal Data Acquisition

链接: https://arxiv.org/abs/2501.18268
作者: Arthur Hoarau,Benjamin Quost,Sébastien Destercke,Willem Waegeman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:To generate accurate and reliable predictions, modern AI systems need to combine data from multiple modalities, such as text, images, audio, spreadsheets, and time series. Multi-modal data introduces new opportunities and challenges for disentangling uncertainty: it is commonly assumed in the machine learning community that epistemic uncertainty can be reduced by collecting more data, while aleatoric uncertainty is irreducible. However, this assumption is challenged in modern AI systems when information is obtained from different modalities. This paper introduces an innovative data acquisition framework where uncertainty disentanglement leads to actionable decisions, allowing sampling in two directions: sample size and data modality. The main hypothesis is that aleatoric uncertainty decreases as the number of modalities increases, while epistemic uncertainty decreases by collecting more observations. We provide proof-of-concept implementations on two multi-modal datasets to showcase our data acquisition framework, which combines ideas from active learning, active feature acquisition and uncertainty quantification.

[LG-26] Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations

链接: https://arxiv.org/abs/2501.18197
作者: Cedric Renggli,Ihab F. Ilyas,Theodoros Rekatsinas
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we dive into the fundamental challenges of evaluating Text2SQL solutions and highlight potential failure causes and the potential risks of relying on aggregate metrics in existing benchmarks. We identify two largely unaddressed limitations in current open benchmarks: (1) data quality issues in the evaluation data, mainly attributed to the lack of capturing the probabilistic nature of translating a natural language description into a structured query (e.g., NL ambiguity), and (2) the bias introduced by using different match functions as approximations for SQL equivalence. To put both limitations into context, we propose a unified taxonomy of all Text2SQL limitations that can lead to both prediction and evaluation errors. We then motivate the taxonomy by providing a survey of Text2SQL limitations using state-of-the-art Text2SQL solutions and benchmarks. We describe the causes of limitations with real-world examples and propose potential mitigation solutions for each category in the taxonomy. We conclude by highlighting the open challenges encountered when deploying such mitigation strategies or attempting to automatically apply the taxonomy. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2501.18197 [cs.LG] (or arXiv:2501.18197v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.18197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] GDformer: Going Beyond Subsequence Isolation for Multivariate Time Series Anomaly Detection

链接: https://arxiv.org/abs/2501.18196
作者: Qingxiang Liu,Chenghao Liu,Sheng Sun,Di Yao,Yuxuan Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Unsupervised anomaly detection of multivariate time series is a challenging task, given the requirements of deriving a compact detection criterion without accessing the anomaly points. The existing methods are mainly based on reconstruction error or association divergence, which are both confined to isolated subsequences with limited horizons, hardly promising unified series-level criterion. In this paper, we propose the Global Dictionary-enhanced Transformer (GDformer) with a renovated dictionary-based cross attention mechanism to cultivate the global representations shared by all normal points in the entire series. Accordingly, the cross-attention maps reflect the correlation weights between the point and global representations, which naturally leads to the representation-wise similarity-based detection criterion. To foster more compact detection boundary, prototypes are introduced to capture the distribution of normal point-global correlation weights. GDformer consistently achieves state-of-the-art unsupervised anomaly detection performance on five real-world benchmark datasets. Further experiments validate the global dictionary has great transferability among various datasets. The code is available at this https URL.

[LG-28] Neural Network Modeling of Microstructure Complexity Using Digital Libraries

链接: https://arxiv.org/abs/2501.18189
作者: Yingjie Zhao,Zhiping Xu
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Engineering, Finance, and Science (cs.CE); Pattern Formation and Solitons (nlin.PS); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Microstructure evolution in matter is often modeled numerically using field or level-set solvers, mirroring the dual representation of spatiotemporal complexity in terms of pixel or voxel data, and geometrical forms in vector graphics. Motivated by this analog, as well as the structural and event-driven nature of artificial and spiking neural networks, respectively, we evaluate their performance in learning and predicting fatigue crack growth and Turing pattern development. Predictions are made based on digital libraries constructed from computer simulations, which can be replaced by experimental data to lift the mathematical overconstraints of physics. Our assessment suggests that the leaky integrate-and-fire neuron model offers superior predictive accuracy with fewer parameters and less memory usage, alleviating the accuracy-cost tradeoff in contrast to the common practices in computer vision tasks. Examination of network architectures shows that these benefits arise from its reduced weight range and sparser connections. The study highlights the capability of event-driven models in tackling problems with evolutionary bulk-phase and interface behaviors using the digital library approach.

[LG-29] Genetic Algorithm with Border Trades (GAB)

链接: https://arxiv.org/abs/2501.18184
作者: Qingchuan Lyu
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach to improving Genetic Algorithms (GA) in large or complex problem spaces by incorporating new chromosome patterns in the breeding process through border trade activities. These strategies increase chromosome diversity, preventing premature convergence and enhancing the GA’s ability to explore the solution space more effectively. Empirical evidence demonstrates significant improvements in convergence behavior. This approach offers a promising pathway to addressing challenges in optimizing large or complex problem domains.

[LG-30] Advancing Personalized Federated Learning: Integrative Approaches with AI for Enhanced Privacy and Customization

链接: https://arxiv.org/abs/2501.18174
作者: Kevin Cooper,Michael Geller
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: arXiv admin note: substantial text overlap with arXiv:2501.16758

点击查看摘要

Abstract:In the age of data-driven decision making, preserving privacy while providing personalized experiences has become paramount. Personalized Federated Learning (PFL) offers a promising framework by decentralizing the learning process, thus ensuring data privacy and reducing reliance on centralized data repositories. However, the integration of advanced Artificial Intelligence (AI) techniques within PFL remains underexplored. This paper proposes a novel approach that enhances PFL with cutting-edge AI methodologies including adaptive optimization, transfer learning, and differential privacy. We present a model that not only boosts the performance of individual client models but also ensures robust privacy-preserving mechanisms and efficient resource utilization across heterogeneous networks. Empirical results demonstrate significant improvements in model accuracy and personalization, along with stringent privacy adherence, as compared to conventional federated learning models. This work paves the way for a new era of truly personalized and privacy-conscious AI systems, offering significant implications for industries requiring compliance with stringent data protection regulations.

[LG-31] Continually Evolved Multimodal Foundation Models for Cancer Prognosis

链接: https://arxiv.org/abs/2501.18170
作者: Jie Peng,Shuang Zhou,Longwei Yang,Yiran Song,Mohan Zhang,Kaixiong Zhou,Feng Xie,Mingquan Lin,Rui Zhang,Tianlong Chen
类目: Machine Learning (cs.LG)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:Cancer prognosis is a critical task that involves predicting patient outcomes and survival rates. To enhance prediction accuracy, previous studies have integrated diverse data modalities, such as clinical notes, medical images, and genomic data, leveraging their complementary information. However, existing approaches face two major limitations. First, they struggle to incorporate newly arrived data with varying distributions into training, such as patient records from different hospitals, thus rendering sub-optimal generalizability and limited utility in real-world applications. Second, most multimodal integration methods rely on simplistic concatenation or task-specific pipelines, which fail to capture the complex interdependencies across modalities. To address these, we propose a continually evolving multi-modal foundation model. Extensive experiments on the TCGA dataset demonstrate the effectiveness of our approach, highlighting its potential to advance cancer prognosis by enabling robust and adaptive multimodal integration.

[LG-32] Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size

链接: https://arxiv.org/abs/2501.18164
作者: Kanata Oowada,Hideaki Iiduka
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Many models used in machine learning have become so large that even computer computation of the full gradient of the loss function is impractical. This has made it necessary to efficiently train models using limited available information, such as batch size and learning rate. We have theoretically analyzed the use of Riemannian stochastic gradient descent (RSGD) and found that using an increasing batch size leads to faster RSGD convergence than using a constant batch size not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay. In particular, RSGD has a better convergence rate O(\frac1\sqrtT) than the existing rate O(\frac\sqrt\log T\sqrt[4]T) with a diminishing learning rate, where T is the number of iterations. The results of experiments on principal component analysis and low-rank matrix completion problems confirmed that, except for the MovieLens dataset and a constant learning rate, using a polynomial growth batch size or an exponential growth batch size results in better performance than using a constant batch size.

[LG-33] Large Language Models for Cryptocurrency Transaction Analysis: A Bitcoin Case Study

链接: https://arxiv.org/abs/2501.18158
作者: Yuchen Lei,Yuexin Xiang,Qin Wang,Rafael Dowsley,Tsz Hon Yuen,Jiangshan Yu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cryptocurrencies are widely used, yet current methods for analyzing transactions heavily rely on opaque, black-box models. These lack interpretability and adaptability, failing to effectively capture behavioral patterns. Many researchers, including us, believe that Large Language Models (LLMs) could bridge this gap due to their robust reasoning abilities for complex tasks. In this paper, we test this hypothesis by applying LLMs to real-world cryptocurrency transaction graphs, specifically within the Bitcoin network. We introduce a three-tiered framework to assess LLM capabilities: foundational metrics, characteristic overview, and contextual interpretation. This includes a new, human-readable graph representation format, LLM4TG, and a connectivity-enhanced sampling algorithm, CETraS, which simplifies larger transaction graphs. Experimental results show that LLMs excel at foundational metrics and offer detailed characteristic overviews. Their effectiveness in contextual interpretation suggests they can provide useful explanations of transaction behaviors, even with limited labeled data.

[LG-34] Dual-Bounded Nonlinear Optimal Transport for Size Constrained Min Cut Clustering

链接: https://arxiv.org/abs/2501.18143
作者: Fangyuan Xie,Jinghui Yuan,Feiping Nie,Xuelong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Min cut is an important graph partitioning method. However, current solutions to the min cut problem suffer from slow speeds, difficulty in solving, and often converge to simple solutions. To address these issues, we relax the min cut problem into a dual-bounded constraint and, for the first time, treat the min cut problem as a dual-bounded nonlinear optimal transport problem. Additionally, we develop a method for solving dual-bounded nonlinear optimal transport based on the Frank-Wolfe method (abbreviated as DNF). Notably, DNF not only solves the size constrained min cut problem but is also applicable to all dual-bounded nonlinear optimal transport problems. We prove that for convex problems satisfying Lipschitz smoothness, the DNF method can achieve a convergence rate of (\mathcalO(\frac1t)). We apply the DNF method to the min cut problem and find that it achieves state-of-the-art performance in terms of both the loss function and clustering accuracy at the fastest speed, with a convergence rate of (\mathcalO(\frac1\sqrtt)). Moreover, the DNF method for the size constrained min cut problem requires no parameters and exhibits better stability.

[LG-35] B3C: A Minimalist Approach to Offline Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2501.18138
作者: Woojun Kim,Katia Sycara
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Overestimation arising from selecting unseen actions during policy evaluation is a major challenge in offline reinforcement learning (RL). A minimalist approach in the single-agent setting – adding behavior cloning (BC) regularization to existing online RL algorithms – has been shown to be effective; however, this approach is understudied in multi-agent settings. In particular, overestimation becomes worse in multi-agent settings due to the presence of multiple actions, resulting in the BC regularization-based approach easily suffering from either over-regularization or critic divergence. To address this, we propose a simple yet effective method, Behavior Cloning regularization with Critic Clipping (B3C), which clips the target critic value in policy evaluation based on the maximum return in the dataset and pushes the limit of the weight on the RL objective over BC regularization, thereby improving performance. Additionally, we leverage existing value factorization techniques, particularly non-linear factorization, which is understudied in offline settings. Integrated with non-linear value factorization, B3C outperforms state-of-the-art algorithms on various offline multi-agent benchmarks.

[LG-36] HyperZero: A Customized End-to-End Auto-Tuning System for Recommendation with Hourly Feedback

链接: https://arxiv.org/abs/2501.18126
作者: Xufeng Cai,Ziwei Guan,Lei Yuan,Ali Selman Aydin,Tengyu Xu,Boying Liu,Wenbo Ren,Renkai Xiang,Songyi He,Haichuan Yang,Serena Li,Mingze Gao,Yue Weng,Ji Liu
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern recommendation systems can be broadly divided into two key stages: the ranking stage, where the system predicts various user engagements (e.g., click-through rate, like rate, follow rate, watch time), and the value model stage, which aggregates these predictive scores through a function (e.g., a linear combination defined by a weight vector) to measure the value of each content by a single numerical score. Both stages play roughly equally important roles in real industrial systems; however, how to optimize the model weights for the second stage still lacks systematic study. This paper focuses on optimizing the second stage through auto-tuning technology. Although general auto-tuning systems and solutions - both from established production practices and open-source solutions - can address this problem, they typically require weeks or even months to identify a feasible solution. Such prolonged tuning processes are unacceptable in production environments for recommendation systems, as suboptimal value models can severely degrade user experience. An effective auto-tuning solution is required to identify a viable model within 2-3 days, rather than the extended timelines typically associated with existing approaches. In this paper, we introduce a practical auto-tuning system named HyperZero that addresses these time constraints while effectively solving the unique challenges inherent in modern recommendation systems. Moreover, this framework has the potential to be expanded to broader tuning tasks within recommendation systems.

[LG-37] Battery State of Health Estimation Using LLM Framework

链接: https://arxiv.org/abs/2501.18123
作者: Aybars Yunusoglu,Dexter Le,Karn Tiwari,Murat Isik,I. Can Dikmen
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Accepted at The 26th International Symposium on Quality Electronic Design (ISQED’25)

点击查看摘要

Abstract:Battery health monitoring is critical for the efficient and reliable operation of electric vehicles (EVs). This study introduces a transformer-based framework for estimating the State of Health (SoH) and predicting the Remaining Useful Life (RUL) of lithium titanate (LTO) battery cells by utilizing both cycle-based and instantaneous discharge data. Testing on eight LTO cells under various cycling conditions over 500 cycles, we demonstrate the impact of charge durations on energy storage trends and apply Differential Voltage Analysis (DVA) to monitor capacity changes (dQ/dV) across voltage ranges. Our LLM model achieves superior performance, with a Mean Absolute Error (MAE) as low as 0.87% and varied latency metrics that support efficient processing, demonstrating its strong potential for real-time integration into EVs. The framework effectively identifies early signs of degradation through anomaly detection in high-resolution data, facilitating predictive maintenance to prevent sudden battery failures and enhance energy efficiency.

[LG-38] ACTGNN: Assessment of Clustering Tendency with Synthetically-Trained Graph Neural Networks

链接: https://arxiv.org/abs/2501.18112
作者: Yiran Luo,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Determining clustering tendency in datasets is a fundamental but challenging task, especially in noisy or high-dimensional settings where traditional methods, such as the Hopkins Statistic and Visual Assessment of Tendency (VAT), often struggle to produce reliable results. In this paper, we propose ACTGNN, a graph-based framework designed to assess clustering tendency by leveraging graph representations of data. Node features are constructed using Locality-Sensitive Hashing (LSH), which captures local neighborhood information, while edge features incorporate multiple similarity metrics, such as the Radial Basis Function (RBF) kernel, to model pairwise relationships. A Graph Neural Network (GNN) is trained exclusively on synthetic datasets, enabling robust learning of clustering structures under controlled conditions. Extensive experiments demonstrate that ACTGNN significantly outperforms baseline methods on both synthetic and real-world datasets, exhibiting superior performance in detecting faint clustering structures, even in high-dimensional or noisy data. Our results highlight the generalizability and effectiveness of the proposed approach, making it a promising tool for robust clustering tendency assessment.

[LG-39] AlphaAdam:Asynchronous Masked Optimization with Dynamic Alpha for Selective Updates

链接: https://arxiv.org/abs/2501.18094
作者: Da Chang,Yu Li,Ganzhao Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the training of large language models (LLMs), updating parameters more efficiently and stably has always been an important challenge. To achieve efficient parameter updates, existing methods usually achieve performance comparable to full parameter updates through methods such as low-dimensional decomposition or layer-wise selective updates. In this work, we propose AlphaAdam, an optimization framework for LLM from the perspective of intra-layer parameter updates. By decoupling parameter updates and dynamically adjusting their strength, AlphaAdam accelerates convergence and improves training stability. We construct parameter masks based on the consistency of historical momentum and gradient direction and combine them with an adaptive mask strength strategy to ensure efficient optimization and theoretical convergence guarantees, which is also applicable to most momentum-based optimizers. Extensive experiments show that AlphaAdam outperforms state-of-the-art methods such as AdamW in terms of convergence speed and computational efficiency across tasks, including GPT-2 pre-trained and fine-tuned RoBERTa and Llama-7B. Our AlphaAdam implements an optimizer enhancement framework for LLMs through intra-layer asynchronous masked adaptive updates. Our code is available in this \hrefthis https URLlink

[LG-40] Reward Prediction Error Prioritisation in Experience Replay: The RPE-PER Method

链接: https://arxiv.org/abs/2501.18093
作者: Hoda Yamani,Yuning Xing,Lee Violet C. Ong,Bruce A. MacDonald,Henry Williams
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: This paper was accepted for presentation at the 2024 Australasian Conference on Robotics and Automation (ACRA 2024). It consists of 10 pages, including four figures and two tables

点击查看摘要

Abstract:Reinforcement Learning algorithms aim to learn optimal control strategies through iterative interactions with an environment. A critical element in this process is the experience replay buffer, which stores past experiences, allowing the algorithm to learn from a diverse range of interactions rather than just the most recent ones. This buffer is especially essential in dynamic environments with limited experiences. However, efficiently selecting high-value experiences to accelerate training remains a challenge. Drawing inspiration from the role of reward prediction errors (RPEs) in biological systems, where they are essential for adaptive behaviour and learning, we introduce Reward Predictive Error Prioritised Experience Replay (RPE-PER). This novel approach prioritises experiences in the buffer based on RPEs. Our method employs a critic network, EMCN, that predicts rewards in addition to the Q-values produced by standard critic networks. The discrepancy between these predicted and actual rewards is computed as RPE and utilised as a signal for experience prioritisation. Experimental evaluations across various continuous control tasks demonstrate RPE-PER’s effectiveness in enhancing the learning speed and performance of off-policy actor-critic algorithms compared to baseline approaches.

[LG-41] Learning Provablely Improves the Convergence of Gradient Descent

链接: https://arxiv.org/abs/2501.18092
作者: Qingyu Song,Wei Lin,Hong Xu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 46 pages, 11 figures

点击查看摘要

Abstract:As a specialized branch of deep learning, Learning to Optimize (L2O) tackles optimization problems by training DNN-based solvers. Despite achieving significant success in various scenarios, such as faster convergence in solving convex optimizations and improved optimality in addressing non-convex cases, there remains a deficiency in theoretical support. Current research heavily relies on stringent assumptions that do not align with the intricacies of the training process. To address this gap, our study aims to establish L2O’s convergence through its training methodology. We demonstrate that learning an algorithm’s hyperparameters significantly enhances its convergence. Focusing on the gradient descent (GD) algorithm for quadratic programming, we prove the convergence of L2O’s training using the neural tangent kernel theory. Moreover, we conduct empirical evaluations using synthetic datasets. Our findings indicate exceeding 50% outperformance over the GD methods.

[LG-42] ISAM-MTL: Cross-subject multi-task learning model with identifiable spikes and associative memory networks

链接: https://arxiv.org/abs/2501.18089
作者: Junyan Li,Bin Hu,Zhi-Hong Guan
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Cross-subject variability in EEG degrades performance of current deep learning models, limiting the development of brain-computer interface (BCI). This paper proposes ISAM-MTL, which is a multi-task learning (MTL) EEG classification model based on identifiable spiking (IS) representations and associative memory (AM) networks. The proposed model treats EEG classification of each subject as an independent task and leverages cross-subject data training to facilitate feature sharing across subjects. ISAM-MTL consists of a spiking feature extractor that captures shared features across subjects and a subject-specific bidirectional associative memory network that is trained by Hebbian learning for efficient and fast within-subject EEG classification. ISAM-MTL integrates learned spiking neural representations with bidirectional associative memory for cross-subject EEG classification. The model employs label-guided variational inference to construct identifiable spike representations, enhancing classification accuracy. Experimental results on two BCI Competition datasets demonstrate that ISAM-MTL improves the average accuracy of cross-subject EEG classification while reducing performance variability among subjects. The model further exhibits the characteristics of few-shot learning and identifiable neural activity beneath EEG, enabling rapid and interpretable calibration for BCI systems.

[LG-43] Joint Pricing and Resource Allocation: An Optimal Online-Learning Approach

链接: https://arxiv.org/abs/2501.18049
作者: Jianyu Xu,Xuan Wang,Yu-Xiang Wang,Jiashuo Jiang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study an online learning problem on dynamic pricing and resource allocation, where we make joint pricing and inventory decisions to maximize the overall net profit. We consider the stochastic dependence of demands on the price, which complicates the resource allocation process and introduces significant non-convexity and non-smoothness to the problem. To solve this problem, we develop an efficient algorithm that utilizes a “Lower-Confidence Bound (LCB)” meta-strategy over multiple OCO agents. Our algorithm achieves \tildeO(\sqrtTmn) regret (for m suppliers and n consumers), which is optimal with respect to the time horizon T . Our results illustrate an effective integration of statistical learning methodologies with complex operations research problems.

[LG-44] KNN and K-means in Gini Prametric Spaces

链接: https://arxiv.org/abs/2501.18028
作者: Cassandra Mussard,Arthur Charpentier,Stéphane Mussard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces innovative enhancements to the K-means and K-nearest neighbors (KNN) algorithms based on the concept of Gini prametric spaces. Unlike traditional distance metrics, Gini-based measures incorporate both value-based and rank-based information, improving robustness to noise and outliers. The main contributions of this work include: proposing a Gini-based measure that captures both rank information and value distances; presenting a Gini K-means algorithm that is proven to converge and demonstrates resilience to noisy data; and introducing a Gini KNN method that performs competitively with state-of-the-art approaches such as Hassanat’s distance in noisy environments. Experimental evaluations on 14 datasets from the UCI repository demonstrate the superior performance and efficiency of Gini-based algorithms in clustering and classification tasks. This work opens new avenues for leveraging rank-based measures in machine learning and statistical analysis.

[LG-45] Perforated Backpropagation: A Neuroscience Inspired Extension to Artificial Neural Networks

链接: https://arxiv.org/abs/2501.18018
作者: Rorry Brenner,Laurent Itti
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 23 Pages, 2 Tables, 11 Figures

点击查看摘要

Abstract:The neurons of artificial neural networks were originally invented when much less was known about biological neurons than is known today. Our work explores a modification to the core neuron unit to make it more parallel to a biological neuron. The modification is made with the knowledge that biological dendrites are not simply passive activation funnels, but also compute complex non-linear functions as they transmit activation to the cell body. The paper explores a novel system of “Perforated” backpropagation empowering the artificial neurons of deep neural networks to achieve better performance coding for the same features they coded for in the original architecture. After an initial network training phase, additional “Dendrite Nodes” are added to the network and separately trained with a different objective: to correlate their output with the remaining error of the original neurons. The trained Dendrite Nodes are then frozen, and the original neurons are further trained, now taking into account the additional error signals provided by the Dendrite Nodes. The cycle of training the original neurons and then adding and training Dendrite Nodes can be repeated several times until satisfactory performance is achieved. Our algorithm was successfully added to modern state-of-the-art PyTorch networks across multiple domains, improving upon original accuracies and allowing for significant model compression without a loss in accuracy.

[LG-46] A Proximal Operator for Inducing 2:4-Sparsity

链接: https://arxiv.org/abs/2501.18015
作者: Jonas M Kübler,Yu-Xiang Wang,Shoham Sabach,Navid Ansari,Matthäus Kleindessner,Kailash Budhathoki,Volkan Cevher,George Karypis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent hardware advancements in AI Accelerators and GPUs allow to efficiently compute sparse matrix multiplications, especially when 2 out of 4 consecutive weights are set to zero. However, this so-called 2:4 sparsity usually comes at a decreased accuracy of the model. We derive a regularizer that exploits the local correlation of features to find better sparsity masks in trained models. We minimize the regularizer jointly with a local squared loss by deriving the proximal operator for which we show that it has an efficient solution in the 2:4-sparse case. After optimizing the mask, we use maskedgradient updates to further minimize the local squared loss. We illustrate our method on toy problems and apply it to pruning entire large language models up to 70B parameters. On models up to 13B we improve over previous state of the art algorithms, whilst on 70B models we match their performance.

[LG-47] When less is more: evolving large neural networks from small ones

链接: https://arxiv.org/abs/2501.18012
作者: Anil Radhakrishnan,John F. Lindner,Scott T. Miller,Sudeshna Sinha,William L. Ditto
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:In contrast to conventional artificial neural networks, which are large and structurally static, we study feed-forward neural networks that are small and dynamic, whose nodes can be added (or subtracted) during training. A single neuronal weight in the network controls the network’s size, while the weight itself is optimized by the same gradient-descent algorithm that optimizes the network’s other weights and biases, but with a size-dependent objective or loss function. We train and evaluate such Nimble Neural Networks on nonlinear regression and classification tasks where they outperform the corresponding static networks. Growing networks to minimal, appropriate, or optimal sizes while training elucidates network dynamics and contrasts with pruning large networks after training but before deployment.

[LG-48] Fault Localization via Fine-tuning Large Language Models with Mutation Generated Stack Traces

链接: https://arxiv.org/abs/2501.18005
作者: Neetha Jambigi,Bartosz Bogacz,Moritz Mueller,Thomas Bach,Michael Felderer
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted at ICST 2025

点击查看摘要

Abstract:Abrupt and unexpected terminations of software are termed as software crashes. They can be challenging to analyze. Finding the root cause requires extensive manual effort and expertise to connect information sources like stack traces, source code, and logs. Typical approaches to fault localization require either test failures or source code. Crashes occurring in production environments, such as that of SAP HANA, provide solely crash logs and stack traces. We present a novel approach to localize faults based only on the stack trace information and no additional runtime information, by fine-tuning large language models (LLMs). We address complex cases where the root cause of a crash differs from the technical cause, and is not located in the innermost frame of the stack trace. As the number of historic crashes is insufficient to fine-tune LLMs, we augment our dataset by leveraging code mutators to inject synthetic crashes into the code base. By fine-tuning on 64,369 crashes resulting from 4.1 million mutations of the HANA code base, we can correctly predict the root cause location of a crash with an accuracy of 66.9% while baselines only achieve 12.6% and 10.6%. We substantiate the generalizability of our approach by evaluating on two additional open-source databases, SQLite and DuckDB, achieving accuracies of 63% and 74%, respectively. Across all our experiments, fine-tuning consistently outperformed prompting non-finetuned LLMs for localizing faults in our datasets.

[LG-49] KoopAGRU: A Koopman-based Anomaly Detection in Time-Series using Gated Recurrent Units

链接: https://arxiv.org/abs/2501.17976
作者: Issam Ait Yahia,Ismail Berrada
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection in real-world time-series data is a challenging task due to the complex and nonlinear temporal dynamics involved. This paper introduces KoopAGRU, a new deep learning model designed to tackle this problem by combining Fast Fourier Transform (FFT), Deep Dynamic Mode Decomposition (DeepDMD), and Koopman theory. FFT allows KoopAGRU to decompose temporal data into time-variant and time-invariant components providing precise modeling of complex patterns. To better control these two components, KoopAGRU utilizes Gate Recurrent Unit (GRU) encoders to learn Koopman observables, enhancing the detection capability across multiple temporal scales. KoopAGRU is trained in a single process and offers fast inference times. Extensive tests on various benchmark datasets show that KoopAGRU outperforms other leading methods, achieving a new average F1-score of 90.88% on the well-known anomalies detection task of times series datasets, and proves to be efficient and reliable in detecting anomalies in real-world scenarios.

[LG-50] Variational Combinatorial Sequential Monte Carlo for Bayesian Phylogenetics in Hyperbolic Space

链接: https://arxiv.org/abs/2501.17965
作者: Alex Chen,Philipe Chlenski,Kenneth Munyuza,Antonio Khalil Moretti,Christian A. Naesseth,Itsik Pe’er
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 10 figures

点击查看摘要

Abstract:Hyperbolic space naturally encodes hierarchical structures such as phylogenies (binary trees), where inward-bending geodesics reflect paths through least common ancestors, and the exponential growth of neighborhoods mirrors the super-exponential scaling of topologies. This scaling challenge limits the efficiency of Euclidean-based approximate inference methods. Motivated by the geometric connections between trees and hyperbolic space, we develop novel hyperbolic extensions of two sequential search algorithms: Combinatorial and Nested Combinatorial Sequential Monte Carlo (\textscCsmc and \textscNcsmc). Our approach introduces consistent and unbiased estimators, along with variational inference methods (\textscH-Vcsmc and \textscH-Vncsmc), which outperform their Euclidean counterparts. Empirical results demonstrate improved speed, scalability and performance in high-dimensional phylogenetic inference tasks.

[LG-51] Shared DIFF Transformer

链接: https://arxiv.org/abs/2501.17900
作者: Yueyang Cang,Yuhang Liu,Xiaoteng Zhang,Xiangju Wang
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2501.17486

点击查看摘要

Abstract:DIFF Transformer improves attention allocation by enhancing focus on relevant context while suppressing noise. It introduces a differential attention mechanism that calculates the difference between two independently generated attention distributions, effectively reducing noise and promoting sparse attention patterns. However, the independent signal generation in DIFF Transformer results in parameter redundancy and suboptimal utilization of information. In this work, we propose Shared DIFF Transformer, which draws on the idea of a differential amplifier by introducing a shared base matrix to model global patterns and incorporating low-rank updates to enhance task-specific flexibility. This design significantly reduces parameter redundancy, improves efficiency, and retains strong noise suppression capabilities. Experimental results show that, compared to DIFF Transformer, our method achieves better performance in tasks such as long-sequence modeling, key information retrieval, and in-context learning. Our work provides a novel and efficient approach to optimizing differential attention mechanisms and advancing robust Transformer architectures.

[LG-52] Explainable Machine Learning: An Illustration of Kolmogorov-Arnold Network Model for Airfoil Lift Prediction

链接: https://arxiv.org/abs/2501.17896
作者: Sudhanva Kulkarni
类目: Machine Learning (cs.LG)
*备注: 3 pages, 2 tables, 3 figures

点击查看摘要

Abstract:Data science has emerged as fourth paradigm of scientific exploration. However many machine learning models operate as black boxes offering limited insight into the reasoning behind their predictions. This lack of transparency is one of the drawbacks to generate new knowledge from data. Recently Kolmogorov-Arnold Network or KAN has been proposed as an alternative model which embeds explainable AI. This study demonstrates the potential of KAN for new scientific exploration. KAN along with five other popular supervised machine learning models are applied to the well-known problem of airfoil lift prediction in aerospace engineering. Standard data generated from an earlier study on 2900 different airfoils is used. KAN performed the best with an R2 score of 96.17 percent on the test data, surpassing both the baseline model and Multi Layer Perceptron. Explainability of KAN is shown by pruning and symbolizing the model resulting in an equation for coefficient of lift in terms of input variables. The explainable information retrieved from KAN model is found to be consistent with the known physics of lift generation by airfoil thus demonstrating its potential to aid in scientific exploration.

[LG-53] Optimal generalisation and learning transition in extensive-width shallow neural networks near interpolation

链接: https://arxiv.org/abs/2501.18530
作者: Jean Barbier,Francesco Camilli,Minh-Toan Nguyen,Mauro Pastore,Rudy Skerk
类目: Machine Learning (stat.ML); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 8 pages + appendix, 3 figures

点击查看摘要

Abstract:We consider a teacher-student model of supervised learning with a fully-trained 2-layer neural network whose width k and input dimension d are large and proportional. We compute the Bayes-optimal generalisation error of the network for any activation function in the regime where the number of training data n scales quadratically with the input dimension, i.e., around the interpolation threshold where the number of trainable parameters kd+k and of data points n are comparable. Our analysis tackles generic weight distributions. Focusing on binary weights, we uncover a discontinuous phase transition separating a “universal” phase from a “specialisation” phase. In the first, the generalisation error is independent of the weight distribution and decays slowly with the sampling rate n/d^2 , with the student learning only some non-linear combinations of the teacher weights. In the latter, the error is weight distribution-dependent and decays faster due to the alignment of the student towards the teacher network. We thus unveil the existence of a highly predictive solution near interpolation, which is however potentially hard to find.

[LG-54] Resampling Filter Design for Multirate Neural Audio Effect Processing

链接: https://arxiv.org/abs/2501.18470
作者: Alistair Carson,Vesa Välimäki,Alec Wright,Stefan Bilbao
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
*备注: Preprint

点击查看摘要

Abstract:Neural networks have become ubiquitous in audio effects modelling, especially for guitar amplifiers and distortion pedals. One limitation of such models is that the sample rate of the training data is implicitly encoded in the model weights and therefore not readily adjustable at inference. Recent work explored modifications to recurrent neural network architecture to approximate a sample rate independent system, enabling audio processing at a rate that differs from the original training rate. This method works well for integer oversampling and can reduce aliasing caused by nonlinear activation functions. For small fractional changes in sample rate, fractional delay filters can be used to approximate sample rate independence, but in some cases this method fails entirely. Here, we explore the use of signal resampling at the input and output of the neural network as an alternative solution. We investigate several resampling filter designs and show that a two-stage design consisting of a half-band IIR filter cascaded with a Kaiser window FIR filter can give similar or better results to the previously proposed model adjustment method with many fewer operations per sample and less than one millisecond of latency at typical audio rates. Furthermore, we investigate interpolation and decimation filters for the task of integer oversampling and show that cascaded half-band IIR and FIR designs can be used in conjunction with the model adjustment method to reduce aliasing in a range of distortion effect models.

[LG-55] adabmDCA 2.0 – a flexible but easy-to-use package for Direct Coupling Analysis

链接: https://arxiv.org/abs/2501.18456
作者: Lorenzo Rosset,Roberto Netti,Anna Paola Muntoni,Martin Weigt,Francesco Zamponi
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:

点击查看摘要

Abstract:In this methods article, we provide a flexible but easy-to-use implementation of Direct Coupling Analysis (DCA) based on Boltzmann machine learning, together with a tutorial on how to use it. The package \textttadabmDCA 2.0 is available in different programming languages (C++, Julia, Python) usable on different architectures (single-core and multi-core CPU, GPU) using a common front-end interface. In addition to several learning protocols for dense and sparse generative DCA models, it allows to directly address common downstream tasks like residue-residue contact prediction, mutational-effect prediction, scoring of sequence libraries and generation of artificial sequences for sequence design. It is readily applicable to protein and RNA sequence data.

[LG-56] DeepExtractor: Time-domain reconstruction of signals and glitches in gravitational wave data with deep learning

链接: https://arxiv.org/abs/2501.18423
作者: Tom Dooney,Harsh Narola,Stefano Bromuri,R. Lyana Curier,Chris Van Den Broeck,Sarah Caudill,Daniel Stanley Tan
类目: General Relativity and Quantum Cosmology (gr-qc); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Instrumentation and Detectors (physics.ins-det)
*备注: 22 pages, 16 figures, 4 tables

点击查看摘要

Abstract:Gravitational wave (GW) interferometers, detect faint signals from distant astrophysical events, such as binary black hole mergers. However, their high sensitivity also makes them susceptible to background noise, which can obscure these signals. This noise often includes transient artifacts called “glitches” that can mimic astrophysical signals or mask their characteristics. Fast and accurate reconstruction of both signals and glitches is crucial for reliable scientific inference. In this study, we present DeepExtractor, a deep learning framework designed to reconstruct signals and glitches with power exceeding interferometer noise, regardless of their source. We design DeepExtractor to model the inherent noise distribution of GW interferometers, following conventional assumptions that the noise is Gaussian and stationary over short time scales. It operates by predicting and subtracting the noise component of the data, retaining only the clean reconstruction. Our approach achieves superior generalization capabilities for arbitrary signals and glitches compared to methods that directly map inputs to the clean training waveforms. We validate DeepExtractor’s effectiveness through three experiments: (1) reconstructing simulated glitches injected into simulated detector noise, (2) comparing performance with the state-of-the-art BayesWave algorithm, and (3) analyzing real data from the Gravity Spy dataset to demonstrate effective glitch subtraction from LIGO strain data. DeepExtractor achieves a median mismatch of only 0.9% for simulated glitches, outperforming several deep learning baselines. Additionally, DeepExtractor surpasses BayesWave in glitch recovery, offering a dramatic computational speedup by reconstructing one glitch sample in approx. 0.1 seconds on a CPU, compared to BayesWave’s processing time of approx. one hour per glitch.

[LG-57] Consensus statement on the credibility assessment of ML predictors

链接: https://arxiv.org/abs/2501.18415
作者: Alessandra Aldieri,Thiranja Prasad Babarenda Gamage,Antonino Amedeo La Mattina,Yi Li,Axel Loewe,Francesco Pappalardo,Marco Viceconti Italy
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid integration of machine learning (ML) predictors into in silico medicine has revolutionized the estimation of quantities of interest (QIs) that are otherwise challenging to measure directly. However, the credibility of these predictors is critical, especially when they inform high-stakes healthcare decisions. This position paper presents a consensus statement developed by experts within the In Silico World Community of Practice. We outline twelve key statements forming the theoretical foundation for evaluating the credibility of ML predictors, emphasizing the necessity of causal knowledge, rigorous error quantification, and robustness to biases. By comparing ML predictors with biophysical models, we highlight unique challenges associated with implicit causal knowledge and propose strategies to ensure reliability and applicability. Our recommendations aim to guide researchers, developers, and regulators in the rigorous assessment and deployment of ML predictors in clinical and biomedical contexts.

[LG-58] Implicit Riemannian Optimism with Applications to Min-Max Problems

链接: https://arxiv.org/abs/2501.18381
作者: Christophe Roux,David Martínez-Rubio,Sebastian Pokutta
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a Riemannian optimistic online learning algorithm for Hadamard manifolds based on inexact implicit updates. Unlike prior work, our method can handle in-manifold constraints, and matches the best known regret bounds in the Euclidean setting with no dependence on geometric constants, like the minimum curvature. Building on this, we develop algorithms for g-convex, g-concave smooth min-max problems on Hadamard manifolds. Notably, one method nearly matches the gradient oracle complexity of the lower bound for Euclidean problems, for the first time.

[LG-59] Contextual Online Decision Making with Infinite-Dimensional Functional Regression

链接: https://arxiv.org/abs/2501.18359
作者: Haichen Hu,Rui Ai,Stephen Bates,David Simchi-Levi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages

点击查看摘要

Abstract:Contextual sequential decision-making problems play a crucial role in machine learning, encompassing a wide range of downstream applications such as bandits, sequential hypothesis testing and online risk control. These applications often require different statistical measures, including expectation, variance and quantiles. In this paper, we provide a universal admissible algorithm framework for dealing with all kinds of contextual online decision-making problems that directly learns the whole underlying unknown distribution instead of focusing on individual statistics. This is much more difficult because the dimension of the regression is uncountably infinite, and any existing linear contextual bandits algorithm will result in infinite regret. To overcome this issue, we propose an efficient infinite-dimensional functional regression oracle for contextual cumulative distribution functions (CDFs), where each data point is modeled as a combination of context-dependent CDF basis functions. Our analysis reveals that the decay rate of the eigenvalue sequence of the design integral operator governs the regression error rate and, consequently, the utility regret rate. Specifically, when the eigenvalue sequence exhibits a polynomial decay of order \frac1\gamma\ge 1 , the utility regret is bounded by \tilde\mathcalO\Big(T^\frac3\gamma+22(\gamma+2)\Big) . By setting \gamma=0 , this recovers the existing optimal regret rate for contextual bandits with finite-dimensional regression and is optimal under a stronger exponential decay assumption. Additionally, we provide a numerical method to compute the eigenvalue sequence of the integral operator, enabling the practical implementation of our framework.

[LG-60] Random Feature Representation Boosting

链接: https://arxiv.org/abs/2501.18283
作者: Nikita Zozoulenko,Thomas Cass,Lukas Gonon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Random Feature Representation Boosting (RFRBoost), a novel method for constructing deep residual random feature neural networks (RFNNs) using boosting theory. RFRBoost uses random features at each layer to learn the functional gradient of the network representation, enhancing performance while preserving the convex optimization benefits of RFNNs. In the case of MSE loss, we obtain closed-form solutions to greedy layer-wise boosting with random features. For general loss functions, we show that fitting random feature residual blocks reduces to solving a quadratically constrained least squares problem. We demonstrate, through numerical experiments on 91 tabular datasets for regression and classification, that RFRBoost significantly outperforms traditional RFNNs and end-to-end trained MLP ResNets, while offering substantial computational advantages and theoretical guarantees stemming from boosting theory.

[LG-61] Decentralized Projection-free Online Upper-Linearizable Optimization with Applications to DR-Submodular Optimization

链接: https://arxiv.org/abs/2501.18183
作者: Yiyang Lu,Mohammad Pedramfar,Vaneet Aggarwal
类目: Optimization and Control (math.OC); Computational Complexity (cs.CC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a novel framework for decentralized projection-free optimization, extending projection-free methods to a broader class of upper-linearizable functions. Our approach leverages decentralized optimization techniques with the flexibility of upper-linearizable function frameworks, effectively generalizing traditional DR-submodular function optimization. We obtain the regret of O(T^1-\theta/2) with communication complexity of O(T^\theta) and number of linear optimization oracle calls of O(T^2\theta) for decentralized upper-linearizable function optimization, for any 0\le \theta \le 1 . This approach allows for the first results for monotone up-concave optimization with general convex constraints and non-monotone up-concave optimization with general convex constraints. Further, the above results for first order feedback are extended to zeroth order, semi-bandit, and bandit feedback.

[LG-62] Estimating Multi-chirp Parameters using Curvature-guided Langevin Monte Carlo

链接: https://arxiv.org/abs/2501.18178
作者: Sattwik Basu,Debottam Dutta,Yu-Lin Wei,Romit Roy Choudhury
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper considers the problem of estimating chirp parameters from a noisy mixture of chirps. While a rich body of work exists in this area, challenges remain when extending these techniques to chirps of higher order polynomials. We formulate this as a non-convex optimization problem and propose a modified Langevin Monte Carlo (LMC) sampler that exploits the average curvature of the objective function to reliably find the minimizer. Results show that our Curvature-guided LMC (CG-LMC) algorithm is robust and succeeds even in low SNR regimes, making it viable for practical applications.

[LG-63] Optimal Survey Design for Private Mean Estimation

链接: https://arxiv.org/abs/2501.18121
作者: Yu-Wei Chen,Raghu Pasupathy,Jordan A. Awan
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This work identifies the first privacy-aware stratified sampling scheme that minimizes the variance for general private mean estimation under the Laplace, Discrete Laplace (DLap) and Truncated-Uniform-Laplace (TuLap) mechanisms within the framework of differential privacy (DP). We view stratified sampling as a subsampling operation, which amplifies the privacy guarantee; however, to have the same final privacy guarantee for each group, different nominal privacy budgets need to be used depending on the subsampling rate. Ignoring the effect of DP, traditional stratified sampling strategies risk significant variance inflation. We phrase our optimal survey design as an optimization problem, where we determine the optimal subsampling sizes for each group with the goal of minimizing the variance of the resulting estimator. We establish strong convexity of the variance objective, propose an efficient algorithm to identify the integer-optimal design, and offer insights on the structure of the optimal design.

[LG-64] A spectral clustering-type algorithm for the consistent estimation of the Hurst distribution in moderately high dimensions

链接: https://arxiv.org/abs/2501.18115
作者: Patrice Abry,Gustavo Didier,Oliver Orejola,Herwig Wendt
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Scale invariance (fractality) is a prominent feature of the large-scale behavior of many stochastic systems. In this work, we construct an algorithm for the statistical identification of the Hurst distribution (in particular, the scaling exponents) undergirding a high-dimensional fractal system. The algorithm is based on wavelet random matrices, modified spectral clustering and a model selection step for picking the value of the clustering precision hyperparameter. In a moderately high-dimensional regime where the dimension, the sample size and the scale go to infinity, we show that the algorithm consistently estimates the Hurst distribution. Monte Carlo simulations show that the proposed methodology is efficient for realistic sample sizes and outperforms another popular clustering method based on mixed-Gaussian modeling. We apply the algorithm in the analysis of real-world macroeconomic time series to unveil evidence for cointegration.

[LG-65] DCatalyst: A Unified Accelerated Framework for Decentralized Optimization

链接: https://arxiv.org/abs/2501.18114
作者: Tianyu Cao,Xiaokai Chen,Gesualdo Scutari
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study decentralized optimization over a network of agents, modeled as graphs, with no central server. The goal is to minimize f+r , where f represents a (strongly) convex function averaging the local agents’ losses, and r is a convex, extended-value function. We introduce DCatalyst, a unified black-box framework that integrates Nesterov acceleration into decentralized optimization algorithms. %, enhancing their performance. At its core, DCatalyst operates as an \textitinexact, \textitmomentum-accelerated proximal method (forming the outer loop) that seamlessly incorporates any selected decentralized algorithm (as the inner loop). We demonstrate that DCatalyst achieves optimal communication and computational complexity (up to log-factors) across various decentralized algorithms and problem instances. Notably, it extends acceleration capabilities to problem classes previously lacking accelerated solution methods, thereby broadening the effectiveness of decentralized methods. On the technical side, our framework introduce the \it inexact estimating sequences–a novel extension of the well-known Nesterov’s estimating sequences, tailored for the minimization of composite losses in decentralized settings. This method adeptly handles consensus errors and inexact solutions of agents’ subproblems, challenges not addressed by existing models. Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG) Cite as: arXiv:2501.18114 [math.OC] (or arXiv:2501.18114v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2501.18114 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-66] U-aggregation: Unsupervised Aggregation of Multiple Learning Algorithms

链接: https://arxiv.org/abs/2501.18084
作者: Rui Duan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Across various domains, the growing advocacy for open science and open-source machine learning has made an increasing number of models publicly available. These models allow practitioners to integrate them into their own contexts, reducing the need for extensive data labeling, training, and calibration. However, selecting the best model for a specific target population remains challenging due to issues like limited transferability, data heterogeneity, and the difficulty of obtaining true labels or outcomes in real-world settings. In this paper, we propose an unsupervised model aggregation method, U-aggregation, designed to integrate multiple pre-trained models for enhanced and robust performance in new populations. Unlike existing supervised model aggregation or super learner approaches, U-aggregation assumes no observed labels or outcomes in the target population. Our method addresses limitations in existing unsupervised model aggregation techniques by accommodating more realistic settings, including heteroskedasticity at both the model and individual levels, and the presence of adversarial models. Drawing on insights from random matrix theory, U-aggregation incorporates a variance stabilization step and an iterative sparse signal recovery process. These steps improve the estimation of individuals’ true underlying risks in the target population and evaluate the relative performance of candidate models. We provide a theoretical investigation and systematic numerical experiments to elucidate the properties of U-aggregation. We demonstrate its potential real-world application by using U-aggregation to enhance genetic risk prediction of complex traits, leveraging publicly available models from the PGS Catalog.

[LG-67] Noise-Adaptive Conformal Classification with Marginal Coverag e

链接: https://arxiv.org/abs/2501.18060
作者: Teresa Bortolotti,Y. X. Rachel Wang,Xin Tong,Alessandra Menafoglio,Simone Vantini,Matteo Sesia
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conformal inference provides a rigorous statistical framework for uncertainty quantification in machine learning, enabling well-calibrated prediction sets with precise coverage guarantees for any classification model. However, its reliance on the idealized assumption of perfect data exchangeability limits its effectiveness in the presence of real-world complications, such as low-quality labels – a widespread issue in modern large-scale data sets. This work tackles this open problem by introducing an adaptive conformal inference method capable of efficiently handling deviations from exchangeability caused by random label noise, leading to informative prediction sets with tight marginal coverage guarantees even in those challenging scenarios. We validate our method through extensive numerical experiments demonstrating its effectiveness on synthetic and real data sets, including CIFAR-10H and BigEarthNet.

[LG-68] Reinforcement-Learning Portfolio Allocation with Dynamic Embedding of Market Information

链接: https://arxiv.org/abs/2501.17992
作者: Jinghai He,Cheng Hua,Chunyang Zhou,Zeyu Zheng
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We develop a portfolio allocation framework that leverages deep learning techniques to address challenges arising from high-dimensional, non-stationary, and low-signal-to-noise market information. Our approach includes a dynamic embedding method that reduces the non-stationary, high-dimensional state space into a lower-dimensional representation. We design a reinforcement learning (RL) framework that integrates generative autoencoders and online meta-learning to dynamically embed market information, enabling the RL agent to focus on the most impactful parts of the state space for portfolio allocation decisions. Empirical analysis based on the top 500 U.S. stocks demonstrates that our framework outperforms common portfolio benchmarks and the predict-then-optimize (PTO) approach using machine learning, particularly during periods of market stress. Traditional factor models do not fully explain this superior performance. The framework’s ability to time volatility reduces its market exposure during turbulent times. Ablation studies confirm the robustness of this performance across various reinforcement learning algorithms. Additionally, the embedding and meta-learning techniques effectively manage the complexities of high-dimensional, noisy, and non-stationary financial data, enhancing both portfolio performance and risk management.

[LG-69] A Robust Support Vector Machine Approach for Raman COVID-19 Data Classification

链接: https://arxiv.org/abs/2501.17904
作者: Marco Piazza,Andrea Spinelli,Francesca Maggioni,Marzia Bedoni,Enza Messina
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in healthcare technologies have led to the availability of large amounts of biological samples across several techniques and applications. In particular, in the last few years, Raman spectroscopy analysis of biological samples has been successfully applied for early-stage diagnosis. However, spectra’ inherent complexity and variability make the manual analysis challenging, even for domain experts. For the same reason, the use of traditional Statistical and Machine Learning (ML) techniques could not guarantee for accurate and reliable results. ML models, combined with robust optimization techniques, offer the possibility to improve the classification accuracy and enhance the resilience of predictive models. In this paper, we investigate the performance of a novel robust formulation for Support Vector Machine (SVM) in classifying COVID-19 samples obtained from Raman Spectroscopy. Given the noisy and perturbed nature of biological samples, we protect the classification process against uncertainty through the application of robust optimization techniques. Specifically, we derive robust counterpart models of deterministic formulations using bounded-by-norm uncertainty sets around each observation. We explore the cases of both linear and kernel-induced classifiers to address binary and multiclass classification tasks. The effectiveness of our approach is validated on real-world COVID-19 datasets provided by Italian hospitals by comparing the results of our simulations with a state-of-the-art classifier.

[LG-70] Molecular Fingerprints Are Strong Models for Peptide Function Prediction

链接: https://arxiv.org/abs/2501.17901
作者: Jakub Adamczyk,Piotr Ludynia,Wojciech Czech
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the effectiveness of molecular fingerprints for peptide property prediction and demonstrate that domain-specific feature extraction from molecular graphs can outperform complex and computationally expensive models such as GNNs, pretrained sequence-based transformers and multimodal ensembles, even without hyperparameter tuning. To this end, we perform a thorough evaluation on 126 datasets, achieving state-of-the-art results on LRGB and 5 other peptide function prediction benchmarks. We show that models based on count variants of ECFP, Topological Torsion, and RDKit molecular fingerprints and LightGBM as classification head are remarkably robust. The strong performance of molecular fingerprints, which are intrinsically very short-range feature encoders, challenges the presumed importance of long-range interactions in peptides. Our conclusion is that the use of molecular fingerprints for larger molecules, such as peptides, can be a computationally feasible, low-parameter, and versatile alternative to sophisticated deep learning models.

[LG-71] Distilling Knowledge for Designing Computational Imaging Systems

链接: https://arxiv.org/abs/2501.17898
作者: Leon Suarez-Rodriguez,Roman Jacome,Henry Arguello
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 14 figures, 16 pages

点击查看摘要

Abstract:Designing the physical encoder is crucial for accurate image reconstruction in computational imaging (CI) systems. Currently, these systems are designed via end-to-end (E2E) optimization, where the encoder is modeled as a neural network layer and is jointly optimized with the decoder. However, the performance of E2E optimization is significantly reduced by the physical constraints imposed on the encoder. Also, since the E2E learns the parameters of the encoder by backpropagating the reconstruction error, it does not promote optimal intermediate outputs and suffers from gradient vanishing. To address these limitations, we reinterpret the concept of knowledge distillation (KD) for designing a physically constrained CI system by transferring the knowledge of a pretrained, less-constrained CI system. Our approach involves three steps: (1) Given the original CI system (student), a teacher system is created by relaxing the constraints on the student’s encoder. (2) The teacher is optimized to solve a less-constrained version of the student’s problem. (3) The teacher guides the training of the student through two proposed knowledge transfer functions, targeting both the encoder and the decoder feature space. The proposed method can be employed to any imaging modality since the relaxation scheme and the loss functions can be adapted according to the physical acquisition and the employed decoder. This approach was validated on three representative CI modalities: magnetic resonance, single-pixel, and compressive spectral imaging. Simulations show that a teacher system with an encoder that has a structure similar to that of the student encoder provides effective guidance. Our approach achieves significantly improved reconstruction performance and encoder design, outperforming both E2E optimization and traditional non-data-driven encoder designs.

[LG-72] Language Modelling for Speaker Diarization in Telephonic Interviews

链接: https://arxiv.org/abs/2501.17893
作者: Miquel India,Javier Hernando,José A.R. Fonollosa
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:

点击查看摘要

Abstract:The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.

[LG-73] Heterogeneous Multi-Player Multi-Armed Bandits Robust To Adversarial Attacks

链接: https://arxiv.org/abs/2501.17882
作者: Akshayaa Magesh,Venugopal V. Veeravalli
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a multi-player multi-armed bandit setting in the presence of adversaries that attempt to negatively affect the rewards received by the players in the system. The reward distributions for any given arm are heterogeneous across the players. In the event of a collision (more than one player choosing the same arm), all the colliding users receive zero rewards. The adversaries use collisions to affect the rewards received by the players, i.e., if an adversary attacks an arm, any player choosing that arm will receive zero reward. At any time step, the adversaries may attack more than one arm. It is assumed that the players in the system do not deviate from a pre-determined policy used by all the players, and that the probability that none of the arms face adversarial attacks is strictly positive at every time step. In order to combat the adversarial attacks, the players are allowed to communicate using a single bit for O(\log T) time units, where T is the time horizon, and each player can only observe their own actions and rewards at all time steps. We propose a policy that is used by all the players, which achieves near order optimal regret of order O(\log^1+\deltaT + W) , where W is total number of time units for which there was an adversarial attack on at least one arm.

[LG-74] Performance Analysis of NR Sidelink and Wi-Fi Coexistence Networks in Unlicensed Spectrum

链接: https://arxiv.org/abs/2501.17878
作者: Zhuangzhuang Yan,Xinyu Gu,Zhenyu Liu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of various internet of things (IoT) applications, including industrial IoT (IIoT) and visual IoT (VIoT), the demand for direct device-to-device communication to support high data rates continues to grow. To address this demand, 5G-Advanced has introduced sidelink communication over the unlicensed spectrum (SL-U) as a method to increase data rates. However, the primary challenge of SL-U in the unlicensed spectrum is ensuring fair coexistence with other incumbent systems, such as Wi-Fi. In this paper, we address the challenge by designing channel access mechanisms and power control strategies to mitigate interference and ensure fair coexistence. First, we propose a novel collaborative channel access (CCHA) mechanism that integrates channel access with resource allocation through collaborative interactions between base stations (BS) and SL-U users. This mechanism ensures fair coexistence with incumbent systems while improving resource utilization. Second, we mathematically model the joint channel access and power control problems, analyzing the trade-off between fairness and transmission rate to minimize interference and optimize performance in the coexistence system. Finally, we develop a collaborative subgoal-based hierarchical deep reinforcement learning (C-GHDRL) framework. This framework enables SL-U users to make globally optimal decisions by leveraging collaborative operations between the BS and SL-U users, effectively overcoming the limitations of traditional optimization methods in solving joint optimization problems with nonlinear constraints. Simulation results demonstrate that the proposed scheme significantly enhances the coexistence system’s performance while ensuring fair coexistence between SL-U and Wi-Fi users.

[LG-75] On the challenges of detecting MCI using EEG in the wild

链接: https://arxiv.org/abs/2501.17871
作者: Aayush Mishra,David Joffe,Sankara Surendra Telidevara,David S Oakley,Anqi Liu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Recent studies have shown promising results in the detection of Mild Cognitive Impairment (MCI) using easily accessible Electroencephalogram (EEG) data which would help administer early and effective treatment for dementia patients. However, the reliability and practicality of such systems remains unclear. In this work, we investigate the potential limitations and challenges in developing a robust MCI detection method using two contrasting datasets: 1) CAUEEG, collected and annotated by expert neurologists in controlled settings and 2) GENEEG, a new dataset collected and annotated in general practice clinics, a setting where routine MCI diagnoses are typically made. We find that training on small datasets, as is done by most previous works, tends to produce high variance models that make overconfident predictions, and are unreliable in practice. Additionally, distribution shifts between datasets make cross-domain generalization challenging. Finally, we show that MCI detection using EEG may suffer from fundamental limitations because of the overlapping nature of feature distributions with control groups. We call for more effort in high-quality data collection in actionable settings (like general practice clinics) to make progress towards this salient goal of non-invasive MCI detection.

信息检索

[IR-0] Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers Rerankers and LLM Judges

链接: https://arxiv.org/abs/2501.18536
作者: Manveer Singh Tamber,Jimmy Lin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Consider a scenario in which a user searches for information, only to encounter texts flooded with misleading or non-relevant content. This scenario exemplifies a simple yet potent vulnerability in neural Information Retrieval (IR) pipelines: content injection attacks. We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to these attacks, in which adversaries insert misleading text into passages to manipulate model judgements. We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively “relevant”, and (2) inserting entire queries or key query terms into passages to boost their perceived relevance. While the second tactic has been explored in prior research, we present, to our knowledge, the first empirical analysis of the first threat, demonstrating how state-of-the-art models can be easily misled. Our study systematically examines the factors that influence an attack’s success, such as the placement of injected content and the balance between relevant and non-relevant material. Additionally, we explore various defense strategies, including adversarial passage classifiers, retriever fine-tuning to discount manipulated content, and prompting LLM judges to adopt a more cautious approach. However, we find that these countermeasures often involve trade-offs, sacrificing effectiveness for attack robustness and sometimes penalizing legitimate documents in the process. Our findings highlight the need for stronger defenses against these evolving adversarial strategies to maintain the trustworthiness of IR systems. We release our code and scripts to facilitate further research.

[IR-1] Behavior Modeling Space Reconstruction for E-Commerce Search

链接: https://arxiv.org/abs/2501.18216
作者: Yejing Wang,Chi Zhang,Xiangyu Zhao,Qidong Liu,Maolin Wang,Xuewei Tao,Zitao Liu,Xing Shi,Xudong Yang,Ling Zhong,Wei Lin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Delivering superior search services is crucial for enhancing customer experience and driving revenue growth. Conventionally, search systems model user behaviors by combining user preference and query item relevance statically, often through a fixed logical ‘and’ relationship. This paper reexamines existing approaches through a unified lens using both causal graphs and Venn diagrams, uncovering two prevalent yet significant issues: entangled preference and relevance effects, and a collapsed modeling space. To surmount these challenges, our research introduces a novel framework, DRP, which enhances search accuracy through two components to reconstruct the behavior modeling space. Specifically, we implement preference editing to proactively remove the relevance effect from preference predictions, yielding untainted user preferences. Additionally, we employ adaptive fusion, which dynamically adjusts fusion criteria to align with the varying patterns of relevance and preference, facilitating more nuanced and tailored behavior predictions within the reconstructed modeling space. Empirical validation on two public datasets and a proprietary search dataset underscores the superiority of our proposed methodology, demonstrating marked improvements in performance over existing approaches.

[IR-2] Hashtag Re-Appropriation for Audience Control on Recommendation-Driven Social Media Xiaohongshu (rednote)

链接: https://arxiv.org/abs/2501.18210
作者: Ruyuan Wan,Lingbo Tong,Tiffany Knearem,Toby Jia-Jun Li,Ting-Hao ‘Kenneth’ Huang,Qunfang Wu
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Algorithms have played a central role in personalized recommendations on social media. However, they also present significant obstacles for content creators trying to predict and manage their audience reach. This issue is particularly challenging for marginalized groups seeking to maintain safe spaces. Our study explores how women on Xiaohongshu (rednote), a recommendation-driven social platform, proactively re-appropriate hashtags (e.g., #Baby Supplemental Food) by using them in posts unrelated to their literal meaning. The hashtags were strategically chosen from topics that would be uninteresting to the male audience they wanted to block. Through a mixed-methods approach, we analyzed the practice of hashtag re-appropriation based on 5,800 collected posts and interviewed 24 active users from diverse backgrounds to uncover users’ motivations and reactions towards the re-appropriation. This practice highlights how users can reclaim agency over content distribution on recommendation-driven platforms, offering insights into self-governance within algorithmic-centered power structures.

[IR-3] Investigating Tax Evasion Emergence Using Dual Large Language Model and Deep Reinforcement Learning Powered Agent -based Simulation

链接: https://arxiv.org/abs/2501.18177
作者: Teddy Lazebnik,Labib Shami
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Tax evasion, usually the largest component of an informal economy, is a persistent challenge over history with significant socio-economic implications. Many socio-economic studies investigate its dynamics, including influencing factors, the role and influence of taxation policies, and the prediction of the tax evasion volume over time. These studies assumed such behavior is given, as observed in the real world, neglecting the “big bang” of such activity in a population. To this end, computational economy studies adopted developments in computer simulations, in general, and recent innovations in artificial intelligence (AI), in particular, to simulate and study informal economy appearance in various socio-economic settings. This study presents a novel computational framework to examine the dynamics of tax evasion and the emergence of informal economic activity. Employing an agent-based simulation powered by Large Language Models and Deep Reinforcement Learning, the framework is uniquely designed to allow informal economic behaviors to emerge organically, without presupposing their existence or explicitly signaling agents about the possibility of evasion. This provides a rigorous approach for exploring the socio-economic determinants of compliance behavior. The experimental design, comprising model validation and exploratory phases, demonstrates the framework’s robustness in replicating theoretical economic behaviors. Findings indicate that individual personality traits, external narratives, enforcement probabilities, and the perceived efficiency of public goods provision significantly influence both the timing and extent of informal economic activity. The results underscore that efficient public goods provision and robust enforcement mechanisms are complementary; neither alone is sufficient to curtail informal activity effectively.

[IR-4] Improving Minimax Group Fairness in Sequential Recommendation ECIR2025

链接: https://arxiv.org/abs/2501.18117
作者: Krishna Acharya,David Wardrope,Timos Korres,Aleksandr Petrov,Anders Uhrenholt
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted to the IR for Good track at ECIR 2025

点击查看摘要

Abstract:Training sequential recommenders such as SASRec with uniform sample weights achieves good overall performance but can fall short on specific user groups. One such example is popularity bias, where mainstream users receive better recommendations than niche content viewers. To improve recommendation quality across diverse user groups, we explore three Distributionally Robust Optimization(DRO) methods: Group DRO, Streaming DRO, and Conditional Value at Risk (CVaR) DRO. While Group and Streaming DRO rely on group annotations and struggle with users belonging to multiple groups, CVaR does not require such annotations and can naturally handle overlapping groups. In experiments on two real-world datasets, we show that the DRO methods outperform standard training, with CVaR delivering the best results. Additionally, we find that Group and Streaming DRO are sensitive to the choice of group used for loss computation. Our contributions include (i) a novel application of CVaR to recommenders, (ii) showing that the DRO methods improve group metrics as well as overall performance, and (iii) demonstrating CVaR’s effectiveness in the practical scenario of intersecting user groups.

[IR-5] RL-based Query Rewriting with Distilled LLM for online E-Commerce Systems

链接: https://arxiv.org/abs/2501.18056
作者: Duy A. Nguyen,Rishi Kesav Mohan,Van Yang,Pritom Saha Akash,Kevin Chen-Chuan Chang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Query rewriting (QR) is a critical technique in e-commerce search, addressing the lexical gap between user queries and product descriptions to enhance search performance. Existing QR approaches typically fall into two categories: discriminative models and generative methods leveraging large language models (LLMs). Discriminative models often struggle with natural language understanding and offer limited flexibility in rewriting, while generative LLMs, despite producing high-quality rewrites, face high inference latency and cost in online settings. These limitations force offline deployment, making them vulnerable to issues like information staleness and semantic drift. To overcome these challenges, we propose a novel hybrid pipeline for QR that balances efficiency and effectiveness. Our approach combines offline knowledge distillation to create a lightweight but efficient student model with online reinforcement learning (RL) to refine query rewriting dynamically using real-time feedback. A key innovation is the use of LLMs as simulated human feedback, enabling scalable reward signals and cost-effective evaluation without manual annotations. Experimental results on Amazon ESCI dataset demonstrate significant improvements in query relevance, diversity, and adaptability, as well as positive feedback from the LLM simulation. This work contributes to advancing LLM capabilities for domain-specific applications, offering a robust solution for dynamic and complex e-commerce search environments.

[IR-6] Can Generative LLM s Create Query Variants for Test Collections? An Exploratory Study SIGIR’23

链接: https://arxiv.org/abs/2501.17981
作者: Marwah Alaofi,Luke Gallagher,Mark Sanderson,Falk Scholer,Paul Thomas
类目: Information Retrieval (cs.IR)
*备注: Published in the proceedings of SIGIR’23

点击查看摘要

Abstract:This paper explores the utility of a Large Language Model (LLM) to automatically generate queries and query variants from a description of an information need. Given a set of information needs described as backstories, we explore how similar the queries generated by the LLM are to those generated by humans. We quantify the similarity using different metrics and examine how the use of each set would contribute to document pooling when building test collections. Our results show potential in using LLMs to generate query variants. While they may not fully capture the wide variety of human-generated variants, they generate similar sets of relevant documents, reaching up to 71.1% overlap at a pool depth of 100.

[IR-7] LLM s can be Fooled into Labelling a Document as Relevant (best cafe near me; this paper is perfectly relevant) SIGIR

链接: https://arxiv.org/abs/2501.17969
作者: Marwah Alaofi,Paul Thomas,Falk Scholer,Mark Sanderson
类目: Information Retrieval (cs.IR)
*备注: Published in the proceedings of SIGIR-AP’24

点击查看摘要

Abstract:LLMs are increasingly being used to assess the relevance of information objects. This work reports on experiments to study the labelling of short texts (i.e., passages) for relevance, using multiple open-source and proprietary LLMs. While the overall agreement of some LLMs with human judgements is comparable to human-to-human agreement measured in previous research, LLMs are more likely to label passages as relevant compared to human judges, indicating that LLM labels denoting non-relevance are more reliable than those indicating relevance. This observation prompts us to further examine cases where human judges and LLMs disagree, particularly when the human judge labels the passage as non-relevant and the LLM labels it as relevant. Results show a tendency for many LLMs to label passages that include the original query terms as relevant. We, therefore, conduct experiments to inject query words into random and irrelevant passages, not unlike the way we inserted the query “best café near me” into this paper. The results show that LLMs are highly influenced by the presence of query words in the passages under assessment, even if the wider passage has no relevance to the query. This tendency of LLMs to be fooled by the mere presence of query words demonstrates a weakness in our current measures of LLM labelling: relying on overall agreement misses important patterns of failures. There is a real risk of bias in LLM-generated relevance labels and, therefore, a risk of bias in rankers trained on those labels. We also investigate the effects of deliberately manipulating LLMs by instructing them to label passages as relevant, similar to the instruction “this paper is perfectly relevant” inserted above. We find that such manipulation influences the performance of some LLMs, highlighting the critical need to consider potential vulnerabilities when deploying LLMs in real-world applications. Comments: Published in the proceedings of SIGIR-AP’24 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2501.17969 [cs.IR] (or arXiv:2501.17969v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2501.17969 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3673791.369843 Focus to learn more DOI(s) linking to related resources

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-31

目录

概览 (2025-01-31)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载