Arxiv今日论文 | 2025-01-16

本篇博文主要内容为 2025-01-16 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决多模态大语言模型（Multimodal LLMs, MLLMs）在艺术美学评估中的推理能力问题，特别是如何有效激发和评估MLLMs在艺术风格化任务中的表现。研究通过构建一个高质量的数据集MM-StyleBench，用于基准测试艺术风格化，并提出了一种基于人类偏好的建模方法，系统分析了MLLMs的响应与人类偏好之间的相关性。实验揭示了MLLMs在艺术评估中存在的主观性导致的幻觉问题。为解决这一问题，论文提出了ArtCoT方法，通过艺术特定任务分解和使用具体语言来增强MLLMs在美学推理中的能力。这一解决方案的关键在于通过任务分解和语言优化来提升MLLMs在艺术美学评估中的准确性和可靠性。

链接: https://arxiv.org/abs/2501.09012
作者: Ruixiang Jiang,Changwen Chen
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: WIP, Homepage this https URL

点击查看摘要

Abstract:We present the first study on how Multimodal LLMs’ (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks. To facilitate this investigation, we construct MM-StyleBench, a novel high-quality dataset for benchmarking artistic stylization. We then develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs’ responses and human preference. Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity. ArtCoT is proposed, demonstrating that art-specific task decomposition and the use of concrete language boost MLLMs’ reasoning ability for aesthetics. Our findings offer valuable insights into MLLMs for art and can benefit a wide range of downstream applications, such as style transfer and artistic image generation. Code available at this https URL.
zh

[NLP-1] Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

【速读】：该论文试图解决大语言模型（LLMs）和生成式 AI 在广泛应用中面临的内容安全问题，特别是缺乏高质量、人工标注的数据集来全面应对 LLM 相关的安全风险。为解决这一问题，论文提出了一个全面且可扩展的分类法，将安全风险分为 12 个顶级危险类别，并进一步细化为 9 个子类别，以满足下游用户的多样化需求。关键解决方案包括：1）采用混合数据生成管道，结合人工标注和多 LLM “陪审团”系统评估响应安全性，生成了 Aegis 2.0 数据集，包含 34,248 个经过标注的人机交互样本；2）通过参数高效技术训练轻量级模型，验证了 Aegis 2.0 在性能上可与基于更大非商业数据集的全微调模型竞争；3）引入了一种结合安全性和主题跟踪的新型训练方法，增强了防护模型的适应性，使其能够泛化到推理过程中定义的新风险类别。最终，论文计划开源 Aegis 2.0 数据和模型，以支持 LLM 安全防护的研究。

链接: https://arxiv.org/abs/2501.09004
作者: Shaona Ghosh,Prasoon Varshney,Makesh Narsimhan Sreedhar,Aishwarya Padmakumar,Traian Rebedea,Jibin Rajan Varghese,Christopher Parisien
机构: NVIDIA
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2404.05993

点击查看摘要

Abstract:As Large Language Models (LLMs) and generative AI become increasingly widespread, concerns about content safety have grown in parallel. Currently, there is a clear lack of high-quality, human-annotated datasets that address the full spectrum of LLM-related safety risks and are usable for commercial applications. To bridge this gap, we propose a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories. This taxonomy is designed to meet the diverse requirements of downstream users, offering more granular and flexible tools for managing various risk types. Using a hybrid data generation pipeline that combines human annotations with a multi-LLM “jury” system to assess the safety of responses, we obtain Aegis 2.0, a carefully curated collection of 34,248 samples of human-LLM interactions, annotated according to our proposed taxonomy. To validate its effectiveness, we demonstrate that several lightweight models, trained using parameter-efficient techniques on Aegis 2.0, achieve performance competitive with leading safety models fully fine-tuned on much larger, non-commercial datasets. In addition, we introduce a novel training blend that combines safety with topic following this http URL approach enhances the adaptability of guard models, enabling them to generalize to new risk categories defined during inference. We plan to open-source Aegis 2.0 data and models to the research community to aid in the safety guardrailing of LLMs.
zh

[NLP-2] Personality Modeling for Persuasion of Misinformation using AI Agent

【速读】：该论文试图解决的问题是：社交媒体平台上错误信息（misinformation）的传播如何受到个体人格特质（personality traits）的影响，以及如何通过理解这些影响来设计有效的干预措施。研究通过基于代理的建模（agent-based modeling）方法，模拟了具有不同大五人格特质（Big Five personality traits）的AI代理在六个不同错误信息主题上的互动，揭示了人格特质组合如何影响个体对错误信息的易感性和传播。

解决方案的关键在于使用AgentScope框架和GLM-4-Flash模型，生成了90个独特的互动场景，分析了不同人格特质组合在证据讨论中的表现。研究发现，具有分析性和批判性人格特质的代理在基于证据的讨论中表现更为有效，而非攻击性的说服策略在纠正错误信息方面表现出意外的成功。特别是，具有批判性特质的代理在HIV相关错误信息讨论中达到了59.4%的成功率，而采用非攻击性方法的代理在不同人格组合中保持了40%以上的说服率。这些发现为开发人格感知的干预措施提供了关键见解，并建议在数字环境中优先考虑情感连接和信任建立，而非对抗性策略。

链接: https://arxiv.org/abs/2501.08985
作者: Qianmin Lou,Wentao Xu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:The proliferation of misinformation on social media platforms has highlighted the need to understand how individual personality traits influence susceptibility to and propagation of misinformation. This study employs an innovative agent-based modeling approach to investigate the relationship between personality traits and misinformation dynamics. Using six AI agents embodying different dimensions of the Big Five personality traits (Extraversion, Agreeableness, and Neuroticism), we simulated interactions across six diverse misinformation topics. The experiment, implemented through the AgentScope framework using the GLM-4-Flash model, generated 90 unique interactions, revealing complex patterns in how personality combinations affect persuasion and resistance to misinformation. Our findings demonstrate that analytical and critical personality traits enhance effectiveness in evidence-based discussions, while non-aggressive persuasion strategies show unexpected success in misinformation correction. Notably, agents with critical traits achieved a 59.4% success rate in HIV-related misinformation discussions, while those employing non-aggressive approaches maintained consistent persuasion rates above 40% across different personality combinations. The study also revealed a non-transitive pattern in persuasion effectiveness, challenging conventional assumptions about personality-based influence. These results provide crucial insights for developing personality-aware interventions in digital environments and suggest that effective misinformation countermeasures should prioritize emotional connection and trust-building over confrontational approaches. The findings contribute to both theoretical understanding of personality-misinformation dynamics and practical strategies for combating misinformation in social media contexts.
zh

[NLP-3] Learning to Extract Cross-Domain Aspects and Understanding Sentiments Using Large Language Models

【速读】：该论文试图解决的问题是如何利用大语言模型（LLMs）进行跨领域的基于方面的情感分析（Aspect-based Sentiment Analysis, ABSA），并为其定义一个框架，以便在不同产品和服务中推广应用。ABSA的核心在于从文本中提取特定方面（如质量、价格、服务等），并对每个方面的情感进行分类，从而提供更细粒度的客户意见分析。论文的关键解决方案是通过自然语言处理（NLP）和机器学习技术，结合SemEval-2015 Task 12的数据集，验证了LLMs在ABSA任务中的有效性，达到了92%的准确率。这一方法不仅能够帮助企业更精准地识别客户反馈中的强项和改进点，还能为个性化客户体验提供更深入的情感洞察。

链接: https://arxiv.org/abs/2501.08974
作者: Karukriti Kaushik Ghosh,Chiranjib Sur
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aspect-based sentiment analysis (ASBA) is a refined approach to sentiment analysis that aims to extract and classify sentiments based on specific aspects or features of a product, service, or entity. Unlike traditional sentiment analysis, which assigns a general sentiment score to entire reviews or texts, ABSA focuses on breaking down the text into individual components or aspects (e.g., quality, price, service) and evaluating the sentiment towards each. This allows for a more granular level of understanding of customer opinions, enabling businesses to pinpoint specific areas of strength and improvement. The process involves several key steps, including aspect extraction, sentiment classification, and aspect-level sentiment aggregation for a review paragraph or any other form that the users have provided. ABSA has significant applications in areas such as product reviews, social media monitoring, customer feedback analysis, and market research. By leveraging techniques from natural language processing (NLP) and machine learning, ABSA facilitates the extraction of valuable insights, enabling companies to make data-driven decisions that enhance customer satisfaction and optimize offerings. As ABSA evolves, it holds the potential to greatly improve personalized customer experiences by providing a deeper understanding of sentiment across various product aspects. In this work, we have analyzed the strength of LLMs for a complete cross-domain aspect-based sentiment analysis with the aim of defining the framework for certain products and using it for other similar situations. We argue that it is possible to that at an effectiveness of 92% accuracy for the Aspect Based Sentiment Analysis dataset of SemEval-2015 Task 12.
zh

[NLP-4] Applying General Turn-taking Models to Conversational Human-Robot Interaction

【速读】：该论文旨在解决人机交互（HRI）系统中基于简单静音检测的对话轮换（turn-taking）模型导致的非自然停顿和打断问题。为了解决这一问题，论文首次将通用的对话轮换模型——TurnGPT和语音活动预测（Voice Activity Projection, VAP）——应用于HRI系统中，以改善对话的动态性。这些模型通过自监督学习目标在人类对话数据上进行训练，无需领域特定的微调。论文提出的关键解决方案是将这些模型结合使用，以预测机器人何时应开始准备响应、进行轮换以及处理潜在的打断。通过在39名成年人的对话环境中使用Furhat机器人进行评估，结果表明，参与者显著偏好该系统，且该系统显著减少了响应延迟和打断现象。

链接: https://arxiv.org/abs/2501.08946
作者: Gabriel Skantze,Bahar Irfan
机构: KTH Royal Institute of Technology (瑞典皇家理工学院)
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注: Accepted at HRI 2025 (the IEEE/ACM International Conference on Human-Robot Interaction)

点击查看摘要

Abstract:Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.
zh

[NLP-5] Disentangling Exploration of Large Language Models by Optimal Exploitation

【速读】：该论文试图解决大型语言模型（LLMs）在状态空间探索（state-space exploration）方面的有效性问题。现有的评估方法主要集中在探索与利用（exploration-exploitation）之间的权衡，通常通过多臂赌博机问题（multi-armed bandit problems）进行评估。然而，本文通过将探索作为唯一目标，要求模型提供能够增强未来回报的信息，从而隔离了探索的影响。解决方案的关键在于提出了一种新的评估方法，通过将缺失的回报分解为探索和利用两个部分，并测量已探索状态的最优可实现回报。实验结果表明，大多数模型在状态空间探索方面表现不足，且模型规模与探索性能呈正相关，较大规模的模型展现出更强的探索能力。此外，该方法还揭示了提示工程（prompt engineering）中不同指令驱动的行为差异，为优化LLM在探索任务中的表现提供了有价值的工具。

链接: https://arxiv.org/abs/2501.08925
作者: Tim Grams,Patrick Betz,Christian Bartelt
机构: Technical University of Clausthal(克劳斯塔尔工业大学); University of Mannheim(曼海姆大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Exploration is a crucial skill for self-improvement and open-ended problem-solving. However, it remains uncertain whether large language models can effectively explore the state-space. Existing evaluations predominantly focus on the trade-off between exploration and exploitation, often assessed in multi-armed bandit problems. In contrast, this work isolates exploration as the sole objective, tasking the agent with delivering information that enhances future returns. For the evaluation, we propose to decompose missing rewards into exploration and exploitation components by measuring the optimal achievable return for the states already explored. Our experiments with various LLMs reveal that most models struggle to sufficiently explore the state-space and that weak exploration is insufficient. We observe a positive correlation between model size and exploration performance, with larger models demonstrating superior capabilities. Furthermore, we show that our decomposition provides insights into differences in behaviors driven by agent instructions during prompt engineering, offering a valuable tool for refining LLM performance in exploratory tasks.
zh

[NLP-6] GenAI Content Detection Task 3: Cross-Domain Machine-Generated Text Detection Challenge COLING2025

【速读】：该论文试图解决的问题是：在大型语言模型（LLMs）生成的文本检测任务中，现有研究往往局限于单一领域或未在测试时见过的多领域文本。本文通过使用新发布的RAID基准测试，旨在探讨模型是否能够检测来自大量固定领域和LLMs生成的文本，且这些领域和模型在训练时均已见过。解决方案的关键在于利用RAID基准测试，评估多个团队提交的检测器在同时处理多领域和多模型生成文本时的性能。结果表明，多个参与团队能够在保持5%误报率的同时，对RAID中的机器生成文本实现超过99%的检测准确率，表明检测器能够同时稳健地检测来自多个领域和模型的文本。

链接: https://arxiv.org/abs/2501.08913
作者: Liam Dugan,Andrew Zhu,Firoj Alam,Preslav Nakov,Marianna Apidianaki,Chris Callison-Burch
机构: University of Pennsylvania(宾夕法尼亚大学); Qatar Computing Research Institute(卡塔尔计算研究所); MBZUAI(穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: COLING 2025

点击查看摘要

Abstract:Recently there have been many shared tasks targeting the detection of generated text from Large Language Models (LLMs). However, these shared tasks tend to focus either on cases where text is limited to one particular domain or cases where text can be from many domains, some of which may not be seen during test time. In this shared task, using the newly released RAID benchmark, we aim to answer whether or not models can detect generated text from a large, yet fixed, number of domains and LLMs, all of which are seen during training. Over the course of three months, our task was attempted by 9 teams with 23 detector submissions. We find that multiple participants were able to obtain accuracies of over 99% on machine-generated text from RAID while maintaining a 5% False Positive Rate – suggesting that detectors are able to robustly detect text from many domains and models simultaneously. We discuss potential interpretations of this result and provide directions for future research.
zh

[NLP-7] oMATO: Verbalizing the Mental States of Role-Playing LLM s for Benchmarking Theory of Mind AAAI2025

【速读】：该论文旨在解决现有心理理论（Theory of Mind, ToM）基准测试与现实场景脱节的三个主要问题：1）仅评估有限的心理状态（如信念）；2）对错误信念的探索不够全面；3）忽视了角色的多样化人格特质。为解决这些问题，作者提出了ToMATO这一新的ToM基准测试，其核心是通过基于信息不对称的LLM（大语言模型）对话生成多选问答。ToMATO的关键在于采用了一种提示方法，要求扮演角色的LLM在每次发言前口头表达其思维，从而捕捉到五种心理状态（信念、意图、欲望、情感和知识）的一阶和二阶心理状态。这些口头表达的思维作为问题的答案，用于评估对话中角色的心理状态。此外，通过隐藏思维引入的信息不对称性，生成了关于各种心理状态的错误信念。同时，为LLM赋予不同的人格特质，进一步丰富了对话内容和思维表达。ToMATO包含5.4k个问题、753个对话和15种人格特质模式。分析表明，该数据集构建方法因角色扮演LLM之间的信息不对称性而频繁生成错误信念，并有效反映了多样化的人格特质。

链接: https://arxiv.org/abs/2501.08838
作者: Kazutoshi Shinoda,Nobukatsu Hojo,Kyosuke Nishida,Saki Mizuno,Keita Suzuki,Ryo Masumura,Hiroaki Sugiyama,Kuniko Saito
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in three aspects: 1) they assess a limited range of mental states such as beliefs, 2) false beliefs are not comprehensively explored, and 3) the diverse personality traits of characters are overlooked. To address these challenges, we introduce ToMATO, a new ToM benchmark formulated as multiple-choice QA over conversations. ToMATO is generated via LLM-LLM conversations featuring information asymmetry. By employing a prompting method that requires role-playing LLMs to verbalize their thoughts before each utterance, we capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge. These verbalized thoughts serve as answers to questions designed to assess the mental states of characters within conversations. Furthermore, the information asymmetry introduced by hiding thoughts from others induces the generation of false beliefs about various mental states. Assigning distinct personality traits to LLMs further diversifies both utterances and thoughts. ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns. Our analysis shows that this dataset construction approach frequently generates false beliefs due to the information asymmetry between role-playing LLMs, and effectively reflects diverse personalities. We evaluate nine LLMs on ToMATO and find that even GPT-4o mini lags behind human performance, especially in understanding false beliefs, and lacks robustness to various personality traits.
zh

[NLP-8] MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

【速读】：该论文旨在解决多模态文档检索（multi-modal document retrieval）领域缺乏有效基准测试的问题。多模态文档检索涉及从大量文档中识别和检索多种形式的内容，如图表、表格、图像和布局信息。为了填补这一空白，论文提出了一个新的基准测试MMDocIR，该基准包含两个任务：页面级检索（page-level retrieval）和布局级检索（layout-level retrieval）。页面级检索专注于在长文档中定位最相关的页面，而布局级检索则针对特定布局的检测，提供比整页分析更细粒度的检索。布局可以包括文本段落、公式、图表、表格等多种元素。MMDocIR基准包含一个丰富的数据集，其中包含1,685个问题的专家标注标签和173,843个问题的自举标签，为多模态文档检索的训练和评估提供了关键资源。通过实验，论文揭示了视觉检索器显著优于文本检索器，MMDocIR训练集能有效提升多模态文档检索的训练效果，以及基于VLM-text的文本检索器比基于OCR-text的检索器表现更好。这些发现强调了整合视觉元素在多模态文档检索中的潜在优势。

链接: https://arxiv.org/abs/2501.08828
作者: Kuicai Dong,Yujing Chang,Xin Deik Goh,Dexun Li,Ruiming Tang,Yong Liu
机构: Noah’s Ark Lab, Huawei(华为诺亚方舟实验室)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents. Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis. A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation. Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text. These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.
zh

[NLP-9] SAIF: A Comprehensive Framework for Evaluating the Risks of Generative AI in the Public Sector AAAI AAAI2025

【速读】：该论文试图解决生成式 AI（Generative AI）在公共部门应用中的风险评估不足问题。尽管生成式 AI 在公共部门的广泛应用（如自动化公共服务、福利服务和移民流程）展示了其变革潜力，但对其潜在风险的评估仍不充分。论文基于已有的 AI 风险分类体系，扩展了生成式 AI 的多模态能力，提出了一个系统化的数据生成框架（Systematic dAta generatIon Framework, SAIF）来评估生成式 AI 的风险。SAIF 的关键在于其四个核心阶段：风险分解、场景设计、应用越狱方法（jailbreak methods）和探索提示类型（prompt types）。这一框架确保了系统性和一致性的提示数据生成，从而为全面评估和缓解风险提供了坚实基础。此外，SAIF 能够适应新兴的越狱方法和不断演变的提示类型，有效应对不可预见的风险场景。该研究旨在促进生成式 AI 在公共部门的安全和负责任集成。

链接: https://arxiv.org/abs/2501.08814
作者: Kyeongryul Lee,Heehyeon Kim,Joyce Jiyoung Whang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 6 pages, 2 figures, 1 tables. AI for Public Missions (AIPM) Workshop at the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)

点击查看摘要

Abstract:The rapid adoption of generative AI in the public sector, encompassing diverse applications ranging from automated public assistance to welfare services and immigration processes, highlights its transformative potential while underscoring the pressing need for thorough risk assessments. Despite its growing presence, evaluations of risks associated with AI-driven systems in the public sector remain insufficiently explored. Building upon an established taxonomy of AI risks derived from diverse government policies and corporate guidelines, we investigate the critical risks posed by generative AI in the public sector while extending the scope to account for its multimodal capabilities. In addition, we propose a Systematic dAta generatIon Framework for evaluating the risks of generative AI (SAIF). SAIF involves four key stages: breaking down risks, designing scenarios, applying jailbreak methods, and exploring prompt types. It ensures the systematic and consistent generation of prompt data, facilitating a comprehensive evaluation while providing a solid foundation for mitigating the risks. Furthermore, SAIF is designed to accommodate emerging jailbreak methods and evolving prompt types, thereby enabling effective responses to unforeseen risk scenarios. We believe that this study can play a crucial role in fostering the safe and responsible integration of generative AI into the public sector.
zh

[NLP-10] Enhanced Large Language Models for Effective Screening of Depression and Anxiety

【速读】：该论文试图解决抑郁和焦虑障碍（depressive and anxiety disorders）的及时识别和管理问题。由于这些心理障碍广泛存在，传统的诊断方法往往成本高且效率低。论文提出了一种基于大语言模型（Large Language Models, LLMs）的解决方案，即EmoScan系统，用于情绪障碍的筛查。EmoScan通过合成临床访谈数据（PsyInterview）生成1,157个交互式对话，并能够区分粗粒度（如焦虑或抑郁障碍）和细粒度（如重度抑郁障碍）的情绪障碍。EmoScan在筛查情绪障碍方面表现出色，其F1-score达到0.7467，超过了基础模型和其他LLMs（如GPT-4）。此外，EmoScan在解释能力（BERTScore=0.9408）和泛化能力（在外部数据集上的F1-score为0.67）方面也表现出色。该研究强调了可扩展的数据生成管道在开发有效心理健康LLM工具中的重要性。

链接: https://arxiv.org/abs/2501.08769
作者: June M. Liu,Mengxia Gao,Sahand Sabour,Zhuang Chen,Minlie Huang,Tatia M.C. Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Depressive and anxiety disorders are widespread, necessitating timely identification and management. Recent advances in Large Language Models (LLMs) offer potential solutions, yet high costs and ethical concerns about training data remain challenges. This paper introduces a pipeline for synthesizing clinical interviews, resulting in 1,157 interactive dialogues (PsyInterview), and presents EmoScan, an LLM-based emotional disorder screening system. EmoScan distinguishes between coarse (e.g., anxiety or depressive disorders) and fine disorders (e.g., major depressive disorders) and conducts high-quality interviews. Evaluations showed that EmoScan exceeded the performance of base models and other LLMs like GPT-4 in screening emotional disorders (F1-score=0.7467). It also delivers superior explanations (BERTScore=0.9408) and demonstrates robust generalizability (F1-score of 0.67 on an external dataset). Furthermore, EmoScan outperforms baselines in interviewing skills, as validated by automated ratings and human evaluations. This work highlights the importance of scalable data-generative pipelines for developing effective mental health LLM tools.
zh

[NLP-11] Expanding Vietnamese SentiWordNet to Improve Performance of Vietnamese Sentiment Analysis Models

【速读】：该论文试图解决越南语评论的情感分析（Sentiment Analysis）问题，旨在通过结合预训练语言模型（Pre-trained Language Models, PLMs）和情感词汇资源（SentiWordNet）来提高情感分类的准确性。解决方案的关键在于使用PhoBERT-V2模型，该模型基于RoBERTa优化了BERT的预训练方法，特别适用于越南语。此外，论文还引入了SentiWordNet，这是一个专门为情感分类应用设计的词汇资源，进一步增强了模型的情感分析能力。通过在VLSP 2016和AIVIVN 2019数据集上的实验，证明了该模型在越南语情感分析任务中的卓越性能。

链接: https://arxiv.org/abs/2501.08758
作者: Hong-Viet Tran,Van-Tan Bui,Lam-Quan Tran
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sentiment analysis is one of the most crucial tasks in Natural Language Processing (NLP), involving the training of machine learning models to classify text based on the polarity of opinions. Pre-trained Language Models (PLMs) can be applied to downstream tasks through fine-tuning, eliminating the need to train the model from scratch. Specifically, PLMs have been employed for Sentiment Analysis, a process that involves detecting, analyzing, and extracting the polarity of text sentiments. Numerous models have been proposed to address this task, with pre-trained PhoBERT-V2 models standing out as the state-of-the-art language models for Vietnamese. The PhoBERT-V2 pre-training approach is based on RoBERTa, optimizing the BERT pre-training method for more robust performance. In this paper, we introduce a novel approach that combines PhoBERT-V2 and SentiWordnet for Sentiment Analysis of Vietnamese reviews. Our proposed model utilizes PhoBERT-V2 for Vietnamese, offering a robust optimization for the prominent BERT model in the context of Vietnamese language, and leverages SentiWordNet, a lexical resource explicitly designed to support sentiment classification applications. Experimental results on the VLSP 2016 and AIVIVN 2019 datasets demonstrate that our sentiment analysis system has achieved excellent performance in comparison to other models.
zh

[NLP-12] he Inherent Limits of Pretrained LLM s: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities

【速读】：该论文试图解决的问题是：大型语言模型（LLMs）在指令微调（instruction tuning）后是否具备与基础模型（base models）不同的能力，以及指令微调对模型性能的具体贡献。通过实验，作者发现指令微调模型的性能与其基础模型在上下文学习（in-context learning）中的表现显著相关。这表明，指令微调并未赋予模型全新的能力，而是依赖于基础模型在预训练数据中获得的先验知识。解决方案的关键在于通过广泛的实验，比较不同模型家族、规模和任务类型下的指令微调模型与基础模型的性能，从而揭示指令微调对模型能力的实际影响。

链接: https://arxiv.org/abs/2501.08716
作者: Irina Bigoulaeva,Harish Tayyar Madabushi,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt (达姆施塔特工业大学); Department of Computer Science, The University of Bath (巴斯大学)
类目: Computation and Language (cs.CL)
备注: The code for this paper is available at: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs), trained on extensive web-scale corpora, have demonstrated remarkable abilities across diverse tasks, especially as they are scaled up. Nevertheless, even state-of-the-art models struggle in certain cases, sometimes failing at problems solvable by young children, indicating that traditional notions of task complexity are insufficient for explaining LLM capabilities. However, exploring LLM capabilities is complicated by the fact that most widely-used models are also “instruction-tuned” to respond appropriately to prompts. With the goal of disentangling the factors influencing LLM performance, we investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. Through extensive experiments across various model families, scales and task types, which included instruction tuning 90 different LLMs, we demonstrate that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. By clarifying what instruction-tuning contributes, we extend prior research into in-context learning, which suggests that base models use priors from pretraining data to solve tasks. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve, with the added influence of the instruction-tuning dataset.
zh

[NLP-13] Deep Learning-Based Feature Fusion for Emotion Analysis and Suicide Risk Differentiation in Chinese Psychological Support Hotlines

【速读】：该论文试图解决心理健康热线通话中情感表达的识别和分析问题，特别是在早期识别自杀风险方面的应用。当前研究对热线通话中的情感表达缺乏深入探索。论文提出了一种结合音高声学特征（pitch acoustic features）和基于深度学习特征的方法，用于分析和理解热线互动中的情感表达。通过使用中国最大的心理健康热线数据，该方法在负面二元情感分类中取得了79.13%的F1分数，并在多类情感分类的公开数据集上验证了其优于现有方法的性能。此外，研究还探讨了该模型在临床相关性中的应用，分析了46名有自杀行为和无自杀行为受试者的负面情绪频率和情绪变化率。尽管自杀组比非自杀组表现出更频繁的情绪变化，但差异并不显著。研究结果表明，情绪波动强度和频率可以作为心理评估量表和自杀风险评估的新特征。该方法为情感动态提供了有价值的见解，并有望通过与临床工具和评估的整合，推动早期干预和改善自杀预防策略。

链接: https://arxiv.org/abs/2501.08696
作者: Han Wang,Jianqiang Li,Qing Zhao,Zhonglong Chen,Changwei Song,Jing Tang,Yuning Huang,Wei Zhai,Yongsheng Tong,Guanghui Fu
机构: School of Software Engineering, Beijing University of Technology, Beijing, China(北京工业大学软件工程学院); Peking University Huilongguan Clinical Medical School, Beijing, China(北京大学回龙观临床医学院); WHO Collaborating Center for Research and Training in Suicide Prevention, Beijing, China(世界卫生组织自杀预防研究与培训合作中心); Sorbonne Université, Institut du Cerveau – Paris Brain Institute - ICM, CNRS, Inria, Inserm, AP-HP, Hôpital de la Pitié-Salpêtrière, Paris, France(索邦大学，巴黎大脑研究所 - ICM，法国国家科学研究中心，法国国家信息与自动化研究所，法国国家健康与医学研究院，巴黎公立医院集团，皮提耶-萨勒佩特里医院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mental health is a critical global public health issue, and psychological support hotlines play a pivotal role in providing mental health assistance and identifying suicide risks at an early stage. However, the emotional expressions conveyed during these calls remain underexplored in current research. This study introduces a method that combines pitch acoustic features with deep learning-based features to analyze and understand emotions expressed during hotline interactions. Using data from China’s largest psychological support hotline, our method achieved an F1-score of 79.13% for negative binary emotion this http URL, the proposed approach was validated on an open dataset for multi-class emotion classification,where it demonstrated better performance compared to the state-of-the-art methods. To explore its clinical relevance, we applied the model to analysis the frequency of negative emotions and the rate of emotional change in the conversation, comparing 46 subjects with suicidal behavior to those without. While the suicidal group exhibited more frequent emotional changes than the non-suicidal group, the difference was not statistically this http URL, our findings suggest that emotional fluctuation intensity and frequency could serve as novel features for psychological assessment scales and suicide risk this http URL proposed method provides valuable insights into emotional dynamics and has the potential to advance early intervention and improve suicide prevention strategies through integration with clinical tools and assessments The source code is publicly available at this https URL.
zh

[NLP-14] Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching

【速读】：该论文试图解决传统基于相似性的模式匹配（schema matching）方法在处理领域特定的复杂映射场景时，由于缺乏常识和领域特定知识而无法解决语义歧义和冲突的问题。同时，大语言模型（LLMs）的幻觉问题（hallucination problem）也使得基于LLM的模式匹配难以应对上述挑战。为此，论文提出了一种基于知识图谱（Knowledge Graph, KG）的检索增强生成模型（Retrieval-Augmented Generation model for Schema Matching, KG-RAG4SM）。该模型的关键在于引入了新颖的基于向量、图遍历和查询的图检索方法，以及一种混合方法和排序方案，能够从外部大型知识图谱中识别出最相关的子图。通过实验验证，KG-RAG4SM在无需重新训练的情况下，能够为复杂匹配场景生成更准确的结果，并在MIMIC和Synthea数据集上显著优于现有的基于LLM和预训练语言模型（PLM）的最先进方法（SOTA）。此外，该方案还展示了其在端到端模式匹配中的高效性，并能够扩展到从大型知识图谱中检索信息，有效缓解了LLM在模式匹配中的幻觉问题。

链接: https://arxiv.org/abs/2501.08686
作者: Chuangtao Ma,Sriom Chakrabarti,Arijit Khan,Bálint Molnár
机构: Department of Computer Science, Aalborg University, Aalborg, Denmark(丹麦奥尔堡大学计算机科学系); Department of Information Systems, Faculty of Informatics, Eötvös Loránd University, Budapest, Hungary(匈牙利布达佩斯罗兰大学信息学院信息系统系)
类目: Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Under Review

点击查看摘要

Abstract:Traditional similarity-based schema matching methods are incapable of resolving semantic ambiguities and conflicts in domain-specific complex mapping scenarios due to missing commonsense and domain-specific knowledge. The hallucination problem of large language models (LLMs) also makes it challenging for LLM-based schema matching to address the above issues. Therefore, we propose a Knowledge Graph-based Retrieval-Augmented Generation model for Schema Matching, referred to as the KG-RAG4SM. In particular, KG-RAG4SM introduces novel vector-based, graph traversal-based, and query-based graph retrievals, as well as a hybrid approach and ranking schemes that identify the most relevant subgraphs from external large knowledge graphs (KGs). We showcase that KG-based retrieval-augmented LLMs are capable of generating more accurate results for complex matching cases without any re-training. Our experimental results show that KG-RAG4SM outperforms the LLM-based state-of-the-art (SOTA) methods (e.g., Jellyfish-8B) by 35.89% and 30.50% in terms of precision and F1 score on the MIMIC dataset, respectively; KG-RAG4SM with GPT-4o-mini outperforms the pre-trained language model (PLM)-based SOTA methods (e.g., SMAT) by 69.20% and 21.97% in terms of precision and F1 score on the Synthea dataset, respectively. The results also demonstrate that our approach is more efficient in end-to-end schema matching, and scales to retrieve from large KGs. Our case studies on the dataset from the real-world schema matching scenario exhibit that the hallucination problem of LLMs for schema matching is well mitigated by our solution.
zh

[NLP-15] MAGNET: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities

【速读】：该论文试图解决的是单向生成式大语言模型（LLMs）在双向建模任务中的局限性问题。传统的单向和双向模型通常分别训练，分别专注于生成任务和表示学习任务，这种分离忽视了两种任务之间潜在的互补性。论文提出的解决方案MAGNET通过引入三种自监督训练目标和一个结合双向和因果注意力机制的注意力机制，实现了对所有目标的统一训练。这一关键创新使得MAGNET能够增强LLMs在生成鲁棒表示和填补缺失文本片段方面的能力，同时保留其原有的知识和文本生成能力。实验结果表明，MAGNET在表示学习任务上超越了强文本编码器，能够生成上下文适当的文本填补，并保持开放文本生成的能力，且不出现重复问题。

链接: https://arxiv.org/abs/2501.08648
作者: Savya Khosla,Kushal Kafle,Simon Jenni,Handong Zhao,John Collomosse,Jing Shi
机构: Adobe Research; University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning, respectively). This separation overlooks the opportunity for developing a more versatile language model and for these objectives to complement each other. In this work, we introduce MAGNET, an adaptation of decoder-only LLMs that enhances their ability to generate robust representations and infill missing text spans, while preserving their knowledge and text generation capabilities. MAGNET employs three self-supervised training objectives and introduces an attention mechanism that combines bidirectional and causal attention, enabling unified training across all objectives. Our results demonstrate that LLMs adapted with MAGNET (1) surpass strong text encoders on token-level and sentence-level representation learning tasks, (2) generate contextually appropriate text infills by leveraging future context, (3) retain the ability for open-ended text generation without exhibiting repetition problem, and (4) preserve the knowledge gained by the LLM during pretraining.
zh

[NLP-16] Reassessing the Role of Chain-of-Thought in Sentiment Analysis: Insights and Limitations

【速读】：该论文试图解决的问题是：语言模型对语义的理解是否依赖于思维过程。具体来说，作者探讨了推理技术（如链式思维提示，chain-of-thought prompting）是否能够促进语义理解。解决方案的关键在于将思维概念化为推理过程，并通过实验验证链式思维提示在情感分析任务中的影响。实验结果表明，链式思维提示对情感分析任务的影响有限，模型在处理情感任务时主要依赖于演示中的信息，而非推理过程。这一结果支持了语言与思维独立的观点。

链接: https://arxiv.org/abs/2501.08641
作者: Kaiyuan Zheng,Qinghua Zhao,Lei Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The relationship between language and thought remains an unresolved philosophical issue. Existing viewpoints can be broadly categorized into two schools: one asserting their independence, and another arguing that language constrains thought. In the context of large language models, this debate raises a crucial question: Does a language model’s grasp of semantic meaning depend on thought processes? To explore this issue, we investigate whether reasoning techniques can facilitate semantic understanding. Specifically, we conceptualize thought as reasoning, employ chain-of-thought prompting as a reasoning technique, and examine its impact on sentiment analysis tasks. The experiments show that chain-of-thought has a minimal impact on sentiment analysis tasks. Both the standard and chain-of-thought prompts focus on aspect terms rather than sentiment in the generated content. Furthermore, counterfactual experiments reveal that the model’s handling of sentiment tasks primarily depends on information from demonstrations. The experimental results support the first viewpoint.
zh

[NLP-17] SWSC: Shared Weight for Similar Channel in LLM

【速读】：该论文旨在解决大型语言模型（LLMs）由于参数数量庞大而带来的存储和计算负担问题。为此，作者提出了一种基于共享权重（Shared Weight for Similar Channel, SWSC）的模型压缩方法。该方法的关键在于使用K-Means聚类算法逐通道对模型权重进行聚类，生成具有高度相似向量的簇，并从每个簇中选择一个代表性向量来近似替换簇中的所有向量，从而显著减少模型权重参数的数量。此外，为了应对近似恢复可能导致的模型性能损失，作者在压缩前后对权重误差值进行奇异值分解（Singular Value Decomposition, SVD），并保留较大的奇异值及其对应的奇异向量，以补偿精度损失。实验结果表明，该方法在低精度条件下仍能有效保证压缩后LLM的性能。

链接: https://arxiv.org/abs/2501.08631
作者: Binrui Zeng,Yongtao Tang,Xiaodong Liu,Xiaopeng Li
机构: College of Computer Science and Technology, National University of Defense Technology (国防科技大学计算机科学与技术学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 5pages, 3 figures, work in progress

点击查看摘要

Abstract:Large language models (LLMs) have spurred development in multiple industries. However, the growing number of their parameters brings substantial storage and computing burdens, making it essential to explore model compression techniques for parameter reduction and easier deployment. We propose SWSC, an LLM compression method based on the concept of Shared Weight for Similar Channel. It uses the K-Means clustering algorithm to cluster model weights channel-by-channel, generating clusters with highly similar vectors within each. A representative vector from each cluster is selected to approximately replace all vectors in the cluster, significantly reducing the number of model weight parameters. However, approximate restoration will inevitably cause damage to the performance of the model. To tackle this issue, we perform singular value decomposition on the weight error values before and after compression and retain the larger singular values and their corresponding singular vectors to compensate for the accuracy. The experimental results show that our method can effectively ensure the performance of the compressed LLM even under low-precision conditions.
zh

[NLP-18] ViBidirectionMT-Eval: Machine Translation for Vietnamese-Chinese and Vietnamese-Lao language pair

【速读】：该论文旨在解决越南语-中文和越南语-老挝语之间的机器翻译问题，具体涉及四个翻译方向。解决方案的关键在于构建高效的机器翻译系统，并通过自动评估指标（如BLEU和SacreBLEU）以及专家的人工评估来全面评估系统性能。自动评估指标提供了客观的量化结果，而人工评估则确保了翻译质量的准确性和实用性，特别是在新闻和通用领域的测试数据上。这种双重评估机制确保了机器翻译模型在实际应用中的可靠性和有效性。

链接: https://arxiv.org/abs/2501.08621
作者: Hong-Viet Tran,Minh-Quy Nguyen,Van-Vinh Nguyen
机构: University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam (越南国立大学河内分校工程与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents an results of the VLSP 2022-2023 Machine Translation Shared Tasks, focusing on Vietnamese-Chinese and Vietnamese-Lao machine translation. The tasks were organized as part of the 9th, 10th annual workshop on Vietnamese Language and Speech Processing (VLSP 2022, VLSP 2023). The objective of the shared task was to build machine translation systems, specifically targeting Vietnamese-Chinese and Vietnamese-Lao translation (corresponding to 4 translation directions). The submission were evaluated on 1,000 pairs for testing (news and general domains) using established metrics like BLEU [11] and SacreBLEU [12]. Additionally, system outputs also were evaluated with human judgment provided by experts in Chinese and Lao languages. These human assessments played a crucial role in ranking the performance of the machine translation models, ensuring a more comprehensive evaluation.
zh

[NLP-19] Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in Large Language Models

【速读】：该论文探讨了自然语言处理中层次结构（hierarchical structure）的神经编码问题，特别是研究大规模语言模型（LLMs）是否仅通过暴露于大规模语言分布就能形成对层次结构敏感的特定功能区域。论文通过生成符合层次结构或线性/位置规则（linear/positional rules）的输入，使用英语、意大利语、日语或无意义词（nonce words）进行实验。关键解决方案包括：1）观察语言模型在层次结构和线性结构输入上的不同行为；2）通过消融实验（ablation experiments）验证处理层次结构的组件与处理线性结构的组件是功能独立的；3）发现层次结构敏感的组件在无意义语法（nonce grammars）上同样活跃，表明这种敏感性不依赖于语义或分布内输入。这些发现揭示了层次结构处理的神经机制及其在语言模型中的独立性。

链接: https://arxiv.org/abs/2501.08618
作者: Aruna Sankaranarayanan,Dylan Hadfield-Menell,Aaron Mueller
机构: CSAIL, MIT(麻省理工学院计算机科学与人工智能实验室); Northeastern University(东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:All natural languages are structured hierarchically. In humans, this structural restriction is neurologically coded: when two grammars are presented with identical vocabularies, brain areas responsible for language processing are only sensitive to hierarchical grammars. Using large language models (LLMs), we investigate whether such functionally distinct hierarchical processing regions can arise solely from exposure to large-scale language distributions. We generate inputs using English, Italian, Japanese, or nonce words, varying the underlying grammars to conform to either hierarchical or linear/positional rules. Using these grammars, we first observe that language models show distinct behaviors on hierarchical versus linearly structured inputs. Then, we find that the components responsible for processing hierarchical grammars are distinct from those that process linear grammars; we causally verify this in ablation experiments. Finally, we observe that hierarchy-selective components are also active on nonce grammars; this suggests that hierarchy sensitivity is not tied to meaning, nor in-distribution inputs.
zh

[NLP-20] RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

【速读】：该论文试图解决生成式 AI（Generative AI）系统在基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）中存在的对齐问题。具体而言，现有的 RLHF 方法主要依赖即时反馈，这种反馈可能无法准确反映交互对用户效用的长期影响，从而导致模型行为与人类价值观不一致，甚至引发诸如奉承和欺骗等不良行为。论文提出了一种新的解决方案，即通过将评估与预测解耦，转而使用事后反馈（hindsight feedback）来优化模型行为。关键创新在于提出了基于事后模拟的强化学习（Reinforcement Learning from Hindsight Simulation, RLHS），该方法首先模拟可能的后果，然后根据这些模拟结果评估哪些行为在事后真正有益。通过将 RLHS 应用于两种广泛使用的偏好优化方法——近端策略优化（Proximal Policy Optimization, PPO）和直接偏好优化（Direct Preference Optimization, DPO），实验表明 RLHS 显著减少了对齐问题，并在用户研究中表现出更高的用户满意度和目标达成率。这一解决方案强调了关注长期后果（即使是模拟的）在缓解 RLHF 中对齐问题中的重要性。

链接: https://arxiv.org/abs/2501.08617
作者: Kaiqu Liang,Haimin Hu,Ryan Liu,Thomas L. Griffiths,Jaime Fernández Fisac
机构: Department of Computer Science, Princeton University (普林斯顿大学计算机科学系); Department of Electrical and Computer Engineering, Princeton University (普林斯顿大学电气与计算机工程系); Department of Psychology, Princeton University (普林斯顿大学心理学系)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely on immediate feedback, which can fail to accurately reflect the downstream impact of an interaction on users’ utility. We demonstrate that feedback based on evaluators’ foresight estimates of downstream consequences systematically induces Goodhart’s Law dynamics, incentivizing misaligned behaviors like sycophancy and deception and ultimately degrading user outcomes. To alleviate this, we propose decoupling evaluation from prediction by refocusing RLHF on hindsight feedback. Our theoretical analysis reveals that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility, even when these observations are simulated by the AI system itself. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods – Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) – and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.
zh

[NLP-21] Assessing the Alignment of FOL Closeness Metrics with Human Judgement

【速读】：该论文试图解决在使用工具增强的大型语言模型（LLMs）进行逻辑推理时，生成的一阶逻辑（First-Order Logic, FOL）语句的正确性验证问题。由于缺乏可靠的评估指标来比较生成的FOL与真实FOL之间的差异，现有的评估方法往往无法有效验证FOL语句的正确性。论文的关键解决方案是通过对真实FOL进行精心设计的扰动，评估现有指标的敏感性，并比较自动评估指标与人工标注者之间的排名一致性。研究发现，BLEU指标对文本扰动过于敏感，Smatch++指标对结构扰动敏感，而FOL指标对操作符扰动敏感。此外，BertScore与人工判断的一致性更高，且结合多个指标可以显著提高评估的一致性和敏感性。

链接: https://arxiv.org/abs/2501.08613
作者: Ramya Keerthy Thatikonda,Wray Buntine,Ehsan Shareghi
机构: Department of Data Science & AI, Monash University(莫纳什大学); College of Engineering and Computer Science, VinUniversity(维纳大学)
类目: Computation and Language (cs.CL)
备注: Code: this https URL

点击查看摘要

Abstract:The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language statements into First-Order Logic~(FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text predicates, often goes unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we present a comprehensive study of sensitivity of existing metrics and their alignment with human judgement on FOL evaluation. Using ground-truth FOLs, we carefully designed various perturbations on the ground-truth to assess metric sensitivity. We sample FOL translation candidates for natural language statements and measure the ranking alignment between automatic metrics and human annotators. Our empirical findings highlight oversensitivity in the n-gram metric BLEU for text perturbations, the semantic graph metric Smatch++ for structural perturbations, and FOL metric for operator perturbation. We also observe a closer alignment between BertScore and human judgement. Additionally, we show that combining metrics enhances both alignment and sensitivity compared to using individual metrics.
zh

[NLP-22] Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning

【速读】：该论文试图解决大型视觉-语言模型（Large Vision-Language Models, LVLMs）在处理知识密集型任务（如视觉问答和推理）时，由于缺乏外部知识整合而导致的性能受限问题。为解决这一问题，论文提出了一种名为“自适应知识引导预训练”（Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models, AKGP-LVLM）的新方法。该方案的关键在于动态地将结构化和非结构化知识整合到LVLMs的预训练和微调过程中，具体通过三个核心组件实现：知识编码器（knowledge encoder）用于表示外部知识，检索机制（retrieval mechanism）用于选择任务相关信息，以及动态适配器（dynamic adaptor）用于有效对齐多模态和知识表示。实验结果表明，该方法在多个基准数据集上显著优于现有最先进模型，并通过人类评估验证了其输出的正确性和相关性。

链接: https://arxiv.org/abs/2501.08597
作者: Julian Perry,Surasakdi Siripong,Thanakorn Phonchai
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal tasks, but their performance is often constrained by the lack of external knowledge integration, limiting their ability to handle knowledge-intensive tasks such as visual question answering and reasoning. To address this challenge, we propose a novel method, Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM), which dynamically incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning. Our approach employs a knowledge encoder to represent external knowledge, a retrieval mechanism to select task-relevant information, and a dynamic adaptor to align multimodal and knowledge representations effectively. We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models. Furthermore, human evaluations highlight the superior correctness and relevance of our model’s outputs. Extensive analyses confirm the robustness, efficiency, and scalability of AKGP-LVLM, making it a compelling solution for real-world knowledge-intensive tasks.
zh

[NLP-23] LoRS: Efficient Low-Rank Adaptation for Sparse Large Language Model

【速读】：该论文旨在解决现有低秩适应（LoRA）方法在稀疏大语言模型（LLMs）上应用时无法保持稀疏性，并导致内存和计算开销增加的问题。为了解决这一挑战，论文提出了一种名为LoRS的创新方法，通过在微调稀疏LLMs时结合权重重计算（weight recompute）和计算图重排（computational graph rearrangement）策略，显著降低了内存和计算消耗。此外，通过改进适配器初始化（adapter initialization）方法，进一步提升了LoRS的有效性。这些创新使得LoRS在保持稀疏性的同时，实现了优于现有LoRA方法的性能表现。

链接: https://arxiv.org/abs/2501.08582
作者: Yuxuan Hu,Jing Zhang,Xiaodong Chen,Zhe Zhao,Cuiping Li,Hong Chen
机构: 1School of Information, Renmin University of China, Beijing, China (中国人民大学信息学院); 2Key Laboratory of Data Engineering and Knowledge Engineering, Beijing, China (数据工程与知识工程重点实验室); 3Engineering Research Center of Database and Business Intelligence, Beijing, China (数据库与商务智能工程研究中心); 4Tencent AI Lab, Beijing, China (腾讯AI实验室)
类目: Computation and Language (cs.CL)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Existing low-rank adaptation (LoRA) methods face challenges on sparse large language models (LLMs) due to the inability to maintain sparsity. Recent works introduced methods that maintain sparsity by augmenting LoRA techniques with additional masking mechanisms. Despite these successes, such approaches suffer from an increased memory and computation overhead, which affects efficiency of LoRA methods. In response to this limitation, we introduce LoRS, an innovative method designed to achieve both memory and computation efficiency when fine-tuning sparse LLMs. To mitigate the substantial memory and computation demands associated with preserving sparsity, our approach incorporates strategies of weight recompute and computational graph rearrangement. In addition, we also improve the effectiveness of LoRS through better adapter initialization. These innovations lead to a notable reduction in memory and computation consumption during the fine-tuning phase, all while achieving performance levels that outperform existing LoRA approaches.
zh

[NLP-24] What Limits LLM -based Human Simulation: LLM s or Our Design?

【速读】：该论文旨在解决基于大语言模型（LLM-based）的人类模拟（human simulation）中的双重挑战：一是大语言模型本身的内在局限性（inherent limitations），二是模拟框架设计中的问题（simulation framework design challenges）。研究表明，当前基于大语言模型的人类模拟与真实世界观察之间存在显著差距，凸显了这两方面的挑战。为此，论文提出了针对性的解决方案，包括对大语言模型局限性的全面分析以及模拟框架设计的改进建议。关键解决方案在于同时应对这两方面的挑战，特别是在数据收集（data collection）、大语言模型生成（LLM generation）和评估（evaluation）等环节进行优化。此外，论文还探讨了未来研究方向，并提供了相关资源以支持进一步研究。

链接: https://arxiv.org/abs/2501.08579
作者: Qian Wang,Jiaying Wu,Zhenheng Tang,Bingqiao Luo,Nuo Chen,Wei Chen,Bingsheng He
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We argue that advancing LLM-based human simulation requires addressing both LLM’s inherent limitations and simulation framework design challenges. Recent studies have revealed significant gaps between LLM-based human simulations and real-world observations, highlighting these dual challenges. To address these gaps, we present a comprehensive analysis of LLM limitations and our design issues, proposing targeted solutions for both aspects. Furthermore, we explore future directions that address both challenges simultaneously, particularly in data collection, LLM generation, and evaluation. To support further research in this field, we provide a curated collection of LLM-based human simulation resources.\footnotethis https URL
zh

[NLP-25] Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms

【速读】：该论文旨在解决大语言模型（LLMs）在长度外推（length extrapolation）能力上的挑战，特别是在处理超出训练长度范围的上下文时，如何保持模型的性能。现有的方法通常通过修改缩放点积注意力机制（scaled dot-product attention mechanism）来应对这一问题，但这些方法往往缺乏严格的理论依据。为此，论文提出了一种基于信息熵不变性（information entropy invariance）的新方法。关键解决方案包括两个方面：首先，设计了一种无需训练的InfoScale方法，用于点积注意力机制，通过确保信息熵在长度外推过程中保持一致，从而保持对原始标记的关注；其次，理论分析了缩放（CosScale）对余弦注意力机制的影响。实验结果表明，结合InfoScale和CosScale的方法在GAU-α模型上实现了最先进的性能，能够将上下文窗口扩展到训练长度的64倍，并优于现有的七种方法。分析还表明，显著增加CosScale可以近似窗口注意力机制，并揭示了注意力分数稀释（attention score dilution）是处理长距离上下文的关键挑战。

链接: https://arxiv.org/abs/2501.08570
作者: Kewei Li,Yanwen Kong,Yiping Xu,Lan Huang,Ruochi Zhang,Fengfeng Zhou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Improving the length extrapolation capabilities of Large Language Models (LLMs) remains a critical challenge in natural language processing. Many recent efforts have focused on modifying the scaled dot-product attention mechanism, and often introduce scaled temperatures without rigorous theoretical justification. To fill this gap, we introduce a novel approach based on information entropy invariance. We propose two new scaled temperatures to enhance length extrapolation. First, a training-free method InfoScale is designed for dot-product attention, and preserves focus on original tokens during length extrapolation by ensuring information entropy remains consistent. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-of-the-art performance on the GAU-\alpha model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates windowed attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at this https URL.
zh

[NLP-26] Knowledge prompt chaining for semantic modeling

【速读】：该论文试图解决为结构化数据（如CSV、JSON和XML文件）构建语义的挑战性问题。尽管互联网上存在大量结构化数据，但将其映射到领域本体（domain ontologies）以构建语义仍然非常困难，因为这需要构建模型来理解和学习图结构知识（graph-structured knowledge）。否则，该任务将需要大量的人工努力和成本。论文提出的解决方案是“知识提示链”（Knowledge Prompt Chaining），这是一种新颖的自动语义建模框架。该框架通过将图结构知识序列化并以提示链（Prompt Chaining）架构的方式注入到大型语言模型（LLMs）中，使模型能够学习图的结构信息和潜在空间，并自然地根据链的指令生成语义标签和语义图。实验结果表明，该方法在使用较少结构化输入数据的情况下，性能优于现有的领先技术。

链接: https://arxiv.org/abs/2501.08540
作者: Ning Pei Ding,Jingge Du,Zaiwen Feng
机构: College of Informatics, Huazhong Agricultural University (华中农业大学信息学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:The task of building semantics for structured data such as CSV, JSON, and XML files is highly relevant in the knowledge representation field. Even though we have a vast of structured data on the internet, mapping them to domain ontologies to build semantics for them is still very challenging as it requires the construction model to understand and learn graph-structured knowledge. Otherwise, the task will require human beings’ effort and cost. In this paper, we proposed a novel automatic semantic modeling framework: Knowledge Prompt Chaining. It can serialize the graph-structured knowledge and inject it into the LLMs properly in a Prompt Chaining architecture. Through this knowledge injection and prompting chaining, the model in our framework can learn the structure information and latent space of the graph and generate the semantic labels and semantic graphs following the chains’ insturction naturally. Based on experimental results, our method achieves better performance than existing leading techniques, despite using reduced structured input data.
zh

[NLP-27] Complexity Control Facilitates Reasoning -Based Compositional Generalization in Transformers

【速读】：该论文旨在探讨Transformer模型在处理组合性任务（compositional tasks）时的内部机制，特别是模型在解决这些任务时是依赖于学习基础规则（reasoning-based solutions）还是仅依赖于记忆映射（memory-based solutions）。研究发现，复杂性控制策略（complexity control strategies）对模型的学习方式有显著影响。通过应用掩码策略（masking strategies）和多种复杂性度量（complexity metrics），论文揭示了与不同解决方案类型相关的内部工作机制。具体而言，基于推理的解决方案表现出较低的复杂性偏差（complexity bias），这与神经元凝聚现象（neuron condensation phenomenon）一致，这种低复杂性偏差被认为是模型能够学习推理规则的关键因素。研究结论在多个真实世界数据集（如图像生成和自然语言处理任务）中得到了验证，表明这些发现具有广泛的适用性。

链接: https://arxiv.org/abs/2501.08537
作者: Zhongwang Zhang,Pengxiao Lin,Zhiwei Wang,Yaoyu Zhang,Zhi-Qin John Xu
机构: Institute of Natural Sciences, School of Mathematical Sciences, MOE-LSC, Shanghai Jiao Tong University (上海交通大学), Shanghai, 200240, China; School of Artificial Intelligence, Shanghai Jiao Tong University (上海交通大学), Shanghai, 200240, China; Center for LLM, Institute for Advanced Algorithms Research, Shanghai Seres Information Technology Co., Ltd (上海赛睿信息技术有限公司), Shanghai 200040, China
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Mistakenly submitted as a replacement to 2405.05409v4

点击查看摘要

Abstract:Transformers have demonstrated impressive capabilities across various tasks, yet their performance on compositional problems remains a subject of debate. In this study, we investigate the internal mechanisms underlying Transformers’ behavior in compositional tasks. We find that complexity control strategies significantly influence whether the model learns primitive-level rules that generalize out-of-distribution (reasoning-based solutions) or relies solely on memorized mappings (memory-based solutions). By applying masking strategies to the model’s information circuits and employing multiple complexity metrics, we reveal distinct internal working mechanisms associated with different solution types. Further analysis reveals that reasoning-based solutions exhibit a lower complexity bias, which aligns with the well-studied neuron condensation phenomenon. This lower complexity bias is hypothesized to be the key factor enabling these solutions to learn reasoning rules. We validate these conclusions across multiple real-world datasets, including image generation and natural language processing tasks, confirming the broad applicability of our findings.
zh

[NLP-28] Doc-Guided Sent2Sent: A Sent2Sent Agent with Doc-Guided memory for Document-level Machine Translation

【速读】：该论文试图解决文档级机器翻译（DocMT）中的质量、一致性和流畅性问题。现有方法如Doc2Doc和Doc2Sent在翻译过程中要么省略句子，要么牺牲流畅性。论文提出的解决方案是Doc-Guided Sent2Sent++，其关键在于采用增量式句子级强制解码策略（incremental sentence-level forced decoding strategy），确保每个句子都被翻译，同时提升相邻句子的流畅性。此外，该方案引入了文档引导记忆机制（Doc-Guided Memory），仅关注文档摘要及其翻译，从而有效保持翻译一致性。通过多语言和多领域的广泛测试，Sent2Sent++在质量、一致性和流畅性方面均优于其他方法，并在s-COMET、d-COMET、LTCR-1_f和文档级困惑度（d-ppl）等指标上取得了显著提升。

链接: https://arxiv.org/abs/2501.08523
作者: Jiaxin Guo,Yuanchang Luo,Daimeng Wei,Ling Zhang,Zongyao Li,Hengchao Shang,Zhiqiang Rao,Shaojun Li,Jinlong Yang,Zhanglin Wu,Hao Yang
机构: Huawei Translation Services Center, Beijing, China (华为翻译服务中心, 北京, 中国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The field of artificial intelligence has witnessed significant advancements in natural language processing, largely attributed to the capabilities of Large Language Models (LLMs). These models form the backbone of Agents designed to address long-context dependencies, particularly in Document-level Machine Translation (DocMT). DocMT presents unique challenges, with quality, consistency, and fluency being the key metrics for evaluation. Existing approaches, such as Doc2Doc and Doc2Sent, either omit sentences or compromise fluency. This paper introduces Doc-Guided Sent2Sent++, an Agent that employs an incremental sentence-level forced decoding strategy \textbfto ensure every sentence is translated while enhancing the fluency of adjacent sentences. Our Agent leverages a Doc-Guided Memory, focusing solely on the summary and its translation, which we find to be an efficient approach to maintaining consistency. Through extensive testing across multiple languages and domains, we demonstrate that Sent2Sent++ outperforms other methods in terms of quality, consistency, and fluency. The results indicate that, our approach has achieved significant improvements in metrics such as s-COMET, d-COMET, LTCR- 1_f , and document-level perplexity (d-ppl). The contributions of this paper include a detailed analysis of current DocMT research, the introduction of the Sent2Sent++ decoding method, the Doc-Guided Memory mechanism, and validation of its effectiveness across languages and domains.
zh

[NLP-29] Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom

【速读】：该论文试图解决自动语音识别（ASR）模型在处理英国不同地区口音时的性能差异问题，特别是针对苏格兰两种不同方言的口音。这一问题在公共服务领域中尤为突出，因为偏见的ASR模型可能导致沟通失误，进而对具有地区口音的个体，尤其是弱势群体，造成不利影响。论文的关键解决方案包括：首先，评估了Whisper large-v3模型在基线数据集和作者收集的数据上的开箱即用性能；其次，通过微调Whisper模型来提升其在两个英国地区的性能，并探讨了现有模型评估技术在现实应用中的有效性。研究发现，微调后的模型在相同领域和口音的测试数据集上表现更好，并且显示出在训练区域之外的测试数据上也有一定的可迁移性。此外，论文还通过手动检查模型输出，揭示了使用词错误率（WER）作为评估指标的优缺点，以及微调在适应地区方言方面的效果。

链接: https://arxiv.org/abs/2501.08502
作者: Melissa Torgbi,Andrew Clayman,Jordan J. Speight,Harish Tayyar Madabushi
机构: Department of Computer Science, University of Bath, UK (巴斯大学计算机科学系); Wyser LTD, UK (Wyser有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We collect novel data in the public service domain to evaluate the capability of the state-of-the-art automatic speech recognition (ASR) models in capturing regional differences in accents in the United Kingdom (UK), specifically focusing on two accents from Scotland with distinct dialects. This study addresses real-world problems where biased ASR models can lead to miscommunication in public services, disadvantaging individuals with regional accents particularly those in vulnerable populations. We first examine the out-of-the-box performance of the Whisper large-v3 model on a baseline dataset and our data. We then explore the impact of fine-tuning Whisper on the performance in the two UK regions and investigate the effectiveness of existing model evaluation techniques for our real-world application through manual inspection of model errors. We observe that the Whisper model has a higher word error rate (WER) on our test datasets compared to the baseline data and fine-tuning on a given data improves performance on the test dataset with the same domain and accent. The fine-tuned models also appear to show improved performance when applied to the test data outside of the region it was trained on suggesting that fine-tuned models may be transferable within parts of the UK. Our manual analysis of model outputs reveals the benefits and drawbacks of using WER as an evaluation metric and fine-tuning to adapt to regional dialects.
zh

[NLP-30] Quantifying the Importance of Data Alignment in Downstream Model Performance

【速读】：该论文探讨了在训练大型语言模型（LLMs）时，数据对齐（data alignment）这一常被忽视的数据质量因素对模型性能的影响，而非传统上强调的数据集规模。论文通过使用Task2Vec对齐系数（Task2Vec-based alignment coefficient）来量化训练数据与评估数据之间的对齐程度对下游任务性能的影响。关键解决方案在于进行了两项干预实验：1. 增加预训练数据集与评估数据集之间的对齐系数；2. 增加领域特定微调数据集与领域特定评估数据集之间的对齐系数。实验结果表明，训练数据与评估数据的对齐系数与模型在下游任务上的损失/困惑度（loss/perplexity）之间存在显著的负相关关系。这一发现提示在LLM训练方法中应重新评估数据对齐的重要性，尤其是在如自动形式化（Autoformalization）等特定领域任务中，数据对齐比数据量更为关键。

链接: https://arxiv.org/abs/2501.08496
作者: Krrish Chawla,Aryan Sahai,Mario DePavia,Sudharsan Sundar,Brando Miranda
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Contrary to the conventional emphasis on dataset size, we explore the role of data alignment – an often overlooked aspect of data quality – in training capable Large Language Models (LLMs). To do so, we use the Task2Vec-based alignment coefficient, a quantitative measure of the similarity between two datasets, to quantify the impact of alignment between training data and evaluation data on downstream performance. In particular, we conduct controlled \textitinterventional experiments for two settings: 1. the impact of increased alignment coefficients between various pre-training (pt) against evaluation datasets, and 2. the impact of increased alignment coefficients between domain specific fine-tuning (ft) against domain specific evaluation. The domain specific task we explore is Autoformalization – the machine translation task between natural language and code for formal verification. In both settings, we find a strong, predictable negative correlation between the alignment coefficient of a model’s training and evaluation data and the model’s loss/perplexity on the respective downstream task. These findings suggest a re-evaluation of LLM training approaches, demonstrating the relevance of data alignment compared to data quantity, especially in specialized downstream tasks such as Autoformalization.
zh

[NLP-31] he Theater Stage as Laboratory: Review of Real-Time Comedy LLM Systems for Live Performance COLING2025

【速读】：该论文探讨了在实时表演环境下评估生成式 AI（Generative AI）幽默生成系统的挑战和可行性。论文的核心问题在于如何在这种动态、互动的环境中有效评估 AI 生成的幽默，特别是在即兴喜剧（improvised comedy）这一场景中。解决方案的关键在于将 AI 幽默生成系统置于真实的表演环境中，面对观众进行实时互动，从而考察其在以下三个方面的表现：1）机器人具身化（robotic embodiment）、拟人化（anthropomorphism）以及人机竞争；2）喜剧时机把握（comedic timing）与观众互动的本质；3）人类对 AI 生成的看似荒谬幽默的解读。通过分析成功的 AI 辅助表演案例，论文强调了在实时表演环境中评估 AI 幽默生成系统的方法论选择，并探讨了人类喜剧演员与 AI 工具之间的协作关系。

链接: https://arxiv.org/abs/2501.08474
作者: Piotr Wojciech Mirowski,Boyd Branch,Kory Wallace Mathewson
机构: Improbotics; Improbotics; Improbotics
类目: Computation and Language (cs.CL)
备注: 8 pages, 1st Workshop on Computational Humor (CHum), COLING 2025

点击查看摘要

Abstract:In this position paper, we review the eclectic recent history of academic and artistic works involving computational systems for humor generation, and focus specifically on live performance. We make the case that AI comedy should be evaluated in live conditions, in front of audiences sharing either physical or online spaces, and under real-time constraints. We further suggest that improvised comedy is therefore the perfect substrate for deploying and assessing computational humor systems. Using examples of successful AI-infused shows, we demonstrate that live performance raises three sets of challenges for computational humor generation: 1) questions around robotic embodiment, anthropomorphism and competition between humans and machines, 2) questions around comedic timing and the nature of audience interaction, and 3) questions about the human interpretation of seemingly absurd AI-generated humor. We argue that these questions impact the choice of methodologies for evaluating computational humor, as any such method needs to work around the constraints of live audiences and performance spaces. These interrogations also highlight different types of collaborative relationship of human comedians towards AI tools.
zh

[NLP-32] Selective Attention Merging for low resource tasks: A case study of Child ASR ICASSP2025

【速读】：该论文试图解决低资源任务（如儿童自动语音识别，ASR）中，由于预训练数据有限，导致语音基础模型（Speech Foundation Models, SFMs）性能受限的问题。解决方案的关键在于探索不同的模型融合技术，特别是引入了一种新颖的选择性注意力融合（Selective Attention Merge, SA Merge）方法。该方法通过选择性地融合来自注意力矩阵的任务向量，从而提升SFM在低资源任务中的表现。实验结果表明，结合数据增强技术和SA Merge方法，能够在MyST数据库上显著降低相对词错误率（WER），最高可达14%，并在Whisper-small模型上实现了8.69的WER，达到了新的最优水平。

链接: https://arxiv.org/abs/2501.08468
作者: Natarajan Balaji Shankar,Zilai Wang,Eray Eren,Abeer Alwan
机构: Department of Electrical and Computer Engineering, University of California Los Angeles (加州大学洛杉矶分校电气与计算机工程系)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: To appear in ICASSP 2025

点击查看摘要

Abstract:While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. This paper also introduces Selective Attention (SA) Merge, a novel method that selectively merges task vectors from attention matrices to enhance SFM performance on low-resource tasks. Experiments on the MyST database show significant reductions in relative word error rate of up to 14%, outperforming existing model merging and data augmentation techniques. By combining data augmentation techniques with SA Merge, we achieve a new state-of-the-art WER of 8.69 on the MyST database for the Whisper-small model, highlighting the potential of SA Merge for improving low-resource ASR.
zh

[NLP-33] owards Zero-Shot Explainable Video Description by Reasoning over Graphs of Events in Space and Time

【速读】：该论文试图解决的是视觉与语言之间关系的理解问题，特别是在描述视频内容时如何生成连贯、丰富且相关的自然语言描述。尽管现有的基于Transformer的模型在各自领域（如语言生成、图像分类等）取得了显著成果，但视觉与语言之间的关联仍然是一个未解决的难题。论文提出的解决方案关键在于通过基于时空事件的可解释和程序化方法，将基于学习的视觉模型与语言模型结合起来，从而为视频生成自然语言描述提供了一种新的途径。该方法通过标准评估指标（如Bleu、ROUGE）和现代LLM-as-a-Jury方法进行了验证，证明了其有效性。

链接: https://arxiv.org/abs/2501.08460
作者: Mihai Masala,Marius Leordeanu
机构: Institute of Mathematics of Romanian Academy(罗马尼亚科学院数学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.
zh

[NLP-34] Large Language Models For Text Classification: Case Study And Comprehensive Review

【速读】：该论文旨在探索大型语言模型（LLMs）在数据分类任务中的潜力，并与现有的深度学习（deep-learning）和机器学习（machine-learning）模型进行性能比较。研究聚焦于两种不同的分类场景：1）基于在线发布的职位评论对员工工作地点进行分类（多类别分类，multiclass classification）；2）将新闻文章分类为虚假或非虚假（二分类，binary classification）。研究的关键在于评估不同规模、量化和架构的语言模型在这些任务中的表现，并探讨不同提示技术（prompting techniques）对模型性能的影响。通过加权F1分数（weighted F1-score）和推理响应时间（inference response time）的权衡分析，研究揭示了LLMs（特别是Llama3和GPT-4）在复杂分类任务中优于传统方法，尽管其推理时间较长；而在简单的二分类任务中，传统机器学习模型在性能与时间之间提供了更好的权衡。

链接: https://arxiv.org/abs/2501.08457
作者: Arina Kostina,Marios D. Dikaiakos,Dimosthenis Stefanidis,George Pallis
机构: Computer Science, University of Cyprus (塞浦路斯大学计算机科学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Unlocking the potential of Large Language Models (LLMs) in data classification represents a promising frontier in natural language processing. In this work, we evaluate the performance of different LLMs in comparison with state-of-the-art deep-learning and machine-learning models, in two different classification scenarios: i) the classification of employees’ working locations based on job reviews posted online (multiclass classification), and 2) the classification of news articles as fake or not (binary classification). Our analysis encompasses a diverse range of language models differentiating in size, quantization, and architecture. We explore the impact of alternative prompting techniques and evaluate the models based on the weighted F1-score. Also, we examine the trade-off between performance (F1-score) and time (inference response time) for each language model to provide a more nuanced understanding of each model’s practical applicability. Our work reveals significant variations in model responses based on the prompting strategies. We find that LLMs, particularly Llama3 and GPT-4, can outperform traditional methods in complex classification tasks, such as multiclass classification, though at the cost of longer inference times. In contrast, simpler ML models offer better performance-to-time trade-offs in simpler binary classification tasks.
zh

[NLP-35] agTab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack

【速读】：该论文试图解决在大语言模型（LLMs）预训练数据检测中现有方法在准确性和语义重要性考虑上的不足。现有方法主要依赖于句子级或段落级的成员推断攻击（MIAs），通常通过对目标模型预测标记的概率分析来进行，但这些方法在准确性方面表现较差，未能充分考虑文本内容的语义重要性和词汇的显著性。为了解决这些问题，论文提出了一种名为TagTab的新方法。该方法的关键在于利用先进的自然语言处理（NLP）技术对输入文本中的关键词进行标记（Tagging），然后使用LLM获取这些关键词的概率，并计算其平均对数似然以确定输入文本的成员关系（Tabbing）。实验结果表明，TagTab在多个基准数据集和不同规模的LLM上，相较于现有最先进方法，AUC分数平均提高了4.1%至12.1%，显著提升了数据泄露检测的性能。

链接: https://arxiv.org/abs/2501.08454
作者: Sagiv Antebi,Edan Habler,Asaf Shabtai,Yuval Elovici
机构: Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel (本古里安大学软件与信息系统工程系, 以色列)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have become essential digital task assistance tools. Their training relies heavily on the collection of vast amounts of data, which may include copyright-protected or sensitive information. Recent studies on the detection of pretraining data in LLMs have primarily focused on sentence-level or paragraph-level membership inference attacks (MIAs), usually involving probability analysis of the target model prediction tokens. However, the proposed methods often demonstrate poor performance, specifically in terms of accuracy, failing to account for the semantic importance of textual content and word significance. To address these shortcomings, we propose TagTab, a novel approach for detecting data that has been used as part of the LLM pretraining. Our method leverages advanced natural language processing (NLP) techniques to tag keywords in the input text - a process we term Tagging. Then, the LLM is used to obtain the probabilities of these keywords and calculate their average log-likelihood to determine input text membership, a process we refer to as Tabbing. Our experiments on three benchmark datasets (BookMIA, MIMIR, and the Pile) and several open-source LLMs of varying sizes demonstrate an average increase in the AUC scores ranging from 4.1% to 12.1% over state-of-the-art methods. TagTab not only sets a new standard for data leakage detection in LLMs, but its outstanding performance is a testament to the importance of words in MIAs on LLMs.
zh

[NLP-36] Jochre 3 and the Yiddish OCR corpus

【速读】：该论文旨在解决意第绪语（Yiddish）文本的自动光学字符识别（OCR）问题，特别是针对意第绪语书籍的数字化和可搜索性。解决方案的关键在于构建了一个公开可用的意第绪语OCR语料库（Yiddish OCR Corpus），并开发了一套开源OCR工具套件Jochre 3。该工具套件包括用于语料库注释的Alto编辑器、用于生成Alto OCR层的OCR软件，以及一个可定制的OCR搜索引擎。Jochre 3采用了经过微调的YOLOv8模型进行自上而下的页面布局分析，并使用自定义的卷积神经网络（CNN）进行字形识别。该工具在意第绪语测试语料库上达到了1.5%的字符错误率（CER），显著优于现有的其他公开模型。通过Jochre 3，研究团队还分析了包含6.6亿词的意第绪语书籍中心（Yiddish Book Center）的全部内容，并通过其OCR搜索引擎实现了可搜索性。

链接: https://arxiv.org/abs/2501.08442
作者: Assaf Urieli,Amber Clooney,Michelle Sigiel,Grisha Leyfer
机构: Joliciel Informatique; Yiddish Book Center (意第绪语书籍中心)
类目: Computation and Language (cs.CL)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:We describe the construction of a publicly available Yiddish OCR Corpus, and describe and evaluate the open source OCR tool suite Jochre 3, including an Alto editor for corpus annotation, OCR software for Alto OCR layer generation, and a customizable OCR search engine. The current version of the Yiddish OCR corpus contains 658 pages, 186K tokens and 840K glyphs. The Jochre 3 OCR tool uses various fine-tuned YOLOv8 models for top-down page layout analysis, and a custom CNN network for glyph recognition. It attains a CER of 1.5% on our test corpus, far out-performing all other existing public models for Yiddish. We analyzed the full 660M word Yiddish Book Center with Jochre 3 OCR, and the new OCR is searchable through the Yiddish Book Center OCR search engine.
zh

[NLP-37] Religious Bias Landscape in Language and Text-to-Image Models: Analysis Detection and Debiasing Strategies

【速读】：该论文旨在解决语言模型（language models）和文本到图像生成模型（text-to-image generation models）中存在的宗教偏见（religious bias）问题。研究通过构建约400个自然发生的提示（prompts），系统地探究了开源和闭源系统中的宗教偏见，涵盖了掩码填充（mask filling）、提示补全（prompt completion）和图像生成（image generation）等多种任务。实验揭示了语言模型在文本和图像生成任务中持续表现出显著的偏见，特别是某些宗教被不成比例地与刻板印象相关联。此外，研究还探讨了宗教偏见与性别、年龄和国籍等人口统计因素的交叉影响。解决方案的关键在于采用针对性的去偏技术（debiasing techniques），通过设计纠正性提示（corrective prompts）来减轻已识别的偏见。研究结果表明，开发更公平的语言模型以实现全球可接受性具有紧迫性。

链接: https://arxiv.org/abs/2501.08441
作者: Ajwad Abrar,Nafisa Tabassum Oeshy,Mohsinul Kabir,Sophia Ananiadou
机构: Islamic University of Technology (伊斯兰科技大学); University of Manchester (曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Note: This paper includes examples of potentially offensive content related to religious bias, presented solely for academic purposes. The widespread adoption of language models highlights the need for critical examinations of their inherent biases, particularly concerning religion. This study systematically investigates religious bias in both language models and text-to-image generation models, analyzing both open-source and closed-source systems. We construct approximately 400 unique, naturally occurring prompts to probe language models for religious bias across diverse tasks, including mask filling, prompt completion, and image generation. Our experiments reveal concerning instances of underlying stereotypes and biases associated disproportionately with certain religions. Additionally, we explore cross-domain biases, examining how religious bias intersects with demographic factors such as gender, age, and nationality. This study further evaluates the effectiveness of targeted debiasing techniques by employing corrective prompts designed to mitigate the identified biases. Our findings demonstrate that language models continue to exhibit significant biases in both text and image generation tasks, emphasizing the urgent need to develop fairer language models to achieve global acceptability.
zh

[NLP-38] Ensemble of Large Language Models for Curated Labeling and Rating of Free-text Data

【速读】：该论文试图解决在心理学研究中，自由文本数据的标注问题。由于自由文本数据通常包含丰富的定性信息，传统的定量方法难以捕捉这些信息，而通过多名训练有素的人工编码员进行标注则耗时且劳动密集。尽管大语言模型（LLMs）在语言处理方面表现出色，但依赖闭源LLMs的标注技术无法在未经明确同意的情况下直接应用于自由文本数据。为此，论文提出了一种在隐私约束下，通过集成本地可部署的开源LLMs来增强自由文本数据中预定主题标注的框架。该框架的关键在于利用多个开源LLMs的异质性，通过一种基于嵌入距离的相关性评分方法，平衡LLMs之间的一致性和分歧，从而提高标注的准确性和精确度-敏感度的权衡。实验结果表明，集成LLMs在预测人工标注方面达到了最高的准确性，并且相关性评分方法有效缓解了LLMs标注的异质性。

链接: https://arxiv.org/abs/2501.08413
作者: Jiaxing Qiu,Dongliang Guo,Papini Natalie,Peace Noelle,Levinson Cheri,Teague R. Henry
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Free-text responses are commonly collected in psychological studies, providing rich qualitative insights that quantitative measures may not capture. Labeling curated topics of research interest in free-text data by multiple trained human coders is typically labor-intensive and time-consuming. Though large language models (LLMs) excel in language processing, LLM-assisted labeling techniques relying on closed-source LLMs cannot be directly applied to free-text data, without explicit consent for external use. In this study, we propose a framework of assembling locally-deployable LLMs to enhance the labeling of predetermined topics in free-text data under privacy constraints. Analogous to annotation by multiple human raters, this framework leverages the heterogeneity of diverse open-source LLMs. The ensemble approach seeks a balance between the agreement and disagreement across LLMs, guided by a relevancy scoring methodology that utilizes embedding distances between topic descriptions and LLMs’ reasoning. We evaluated the ensemble approach using both publicly accessible Reddit data from eating disorder related forums, and free-text responses from eating disorder patients, both complemented by human annotations. We found that: (1) there is heterogeneity in the performance of labeling among same-sized LLMs, with some showing low sensitivity but high precision, while others exhibit high sensitivity but low precision. (2) Compared to individual LLMs, the ensemble of LLMs achieved the highest accuracy and optimal precision-sensitivity trade-off in predicting human annotations. (3) The relevancy scores across LLMs showed greater agreement than dichotomous labels, indicating that the relevancy scoring method effectively mitigates the heterogeneity in LLMs’ labeling. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2501.08413 [cs.CL] (or arXiv:2501.08413v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2501.08413 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-39] OptiChat: Bridging Optimization Models and Practitioners with Large Language Models

【速读】：该论文旨在解决优化模型（optimization models）在实际应用中由非优化专家使用时所面临的困难，特别是这些用户在独立解释模型、诊断不可行性、分析敏感性、检索信息、评估修改以及提供反事实解释等方面的挑战。为了解决这一问题，论文提出了OptiChat，一个基于自然语言对话的系统，通过增强大型语言模型（LLMs）的功能调用和代码生成能力，专门为优化模型设计。OptiChat的关键在于其能够无缝地与用户交互，并减少幻觉（hallucinations）的风险，从而帮助用户更好地理解和操作优化模型。通过开发新的数据集来评估OptiChat的性能，实验表明该系统能够有效地弥合优化模型与用户之间的鸿沟，提供自主、准确且即时的响应。

链接: https://arxiv.org/abs/2501.08406
作者: Hao Chen,Gonzalo Esteban Constante-Flores,Krishna Sri Ipsit Mantri,Sai Madhukiran Kompalli,Akshdeep Singh Ahluwalia,Can Li
机构: Purdue University(普渡大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Optimization models have been applied to solve a wide variety of decision-making problems. These models are usually developed by optimization experts but are used by practitioners without optimization expertise in various application domains. As a result, practitioners often struggle to interact with and draw useful conclusions from optimization models independently. To fill this gap, we introduce OptiChat, a natural language dialogue system designed to help practitioners interpret model formulation, diagnose infeasibility, analyze sensitivity, retrieve information, evaluate modifications, and provide counterfactual explanations. By augmenting large language models (LLMs) with functional calls and code generation tailored for optimization models, we enable seamless interaction and minimize the risk of hallucinations in OptiChat. We develop a new dataset to evaluate OptiChat’s performance in explaining optimization models. Experiments demonstrate that OptiChat effectively bridges the gap between optimization models and practitioners, delivering autonomous, accurate, and instant responses.
zh

[NLP-40] owards Best Practices for Open Datasets for LLM Training

【速读】：该论文探讨了当前许多AI公司在未经版权所有者许可的情况下使用数据训练大语言模型（LLMs）所引发的法律和伦理问题。尽管在不同司法管辖区（如欧盟和日本）这种行为在特定限制下是被允许的，但在美国等地的法律环境则较为模糊。这种行为引发了创意生产者的担忧，并导致了多起高调的版权诉讼。为了避免法律风险，企业和公共利益相关者逐渐减少公开训练数据集的信息，这种做法阻碍了透明度、问责制和整个生态系统的创新。论文指出，解决这一问题的关键在于通过法律、技术和政策领域的合作，推动使用开放许可和公共领域数据进行模型训练，并投资于元数据标准、数字化以及开放文化的培养。然而，目前尚未有大规模训练的开放数据模型，主要由于技术和社会学上的挑战，如不完整和不可靠的元数据、数字化物理记录的成本和复杂性，以及确保相关性和责任所需的多样化法律和技术技能。

链接: https://arxiv.org/abs/2501.08365
作者: Stefan Baack,Stella Biderman,Kasia Odrozek,Aviya Skowron,Ayah Bdeir,Jillian Bommarito,Jennifer Ding,Maximilian Gahntz,Paul Keller,Pierre-Carl Langlais,Greg Lindahl,Sebastian Majstorovic,Nik Marda,Guilherme Penedo,Maarten Van Segbroeck,Jennifer Wang,Leandro von Werra,Mitchell Baker,Julie Belião,Kasia Chmielinski,Marzieh Fadaee,Lisa Gutermuth,Hynek Kydlíček,Greg Leppert,EM Lewis-Jong,Solana Larsen,Shayne Longpre,Angela Oduor Lungati,Cullen Miller,Victor Miller,Max Ryabinin,Kathleen Siminyu,Andrew Strait,Mark Surman,Anna Tumadóttir,Maurice Weber,Rebecca Weiss,Lee White,Thomas Wolf
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2501.08365 [cs.CY] (or arXiv:2501.08365v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2501.08365 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-41] MERaLiON-TextLLM : Cross-Lingual Understanding of Large Language Models in Chinese Indonesian Malay and Singlish

【速读】：该论文旨在解决多语言大语言模型（MLLMs）在不同语言家族中表现差异显著的问题，尤其是对于语言资源有限的语种。为了解决这一问题，作者提出了MERaLiON-TextLLM，这是一系列专门针对中文、印尼语、马来语和新加坡式英语（Singlish）优化的开源语言模型。解决方案的关键在于基于Llama-3-8B-Base模型，通过精心设计的持续预训练和权重合并过程进行改进。这一方法在多个基准测试中显著提升了这些语言的性能，超越了官方Llama-3模型的表现。作者还提供了模型检查点，以支持跨语言理解领域的进一步研究和开发。

链接: https://arxiv.org/abs/2501.08335
作者: Xin Huang,Tarun Kumar Vangani,Minh Duc Pham,Xunlong Zou,Bin Wang,Zhengyuan Liu,Ai Ti Aw
机构: Institute for Infocomm Research (I2R), A*STAR, Singapore(新加坡科技研究局信息通信研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.
zh

[NLP-42] SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models ICASSP2025

【速读】：该论文试图解决传统说话人分离（Speaker Diarization, SD）系统在说话人转换和重叠语音时引入的说话人错误问题。传统SD系统通常基于音频并独立于自动语音识别（ASR）运行，导致在复杂场景下错误率较高。论文提出了一种新颖的声学条件方法，通过从声学分离器中提取更细粒度的信息，并将其输入到微调的大语言模型（LLMs）中，以利用转录输出中的词汇上下文进行二次校正。此外，论文还提出了一种简化的约束解码策略，减少了LLM的幻觉现象，同时避免了复杂的后处理步骤。实验结果表明，该方法在Fisher、Callhome和RT03-CTS数据集上显著降低了说话人错误率，相比第一遍声学SD系统，错误率减少了24-43%。

链接: https://arxiv.org/abs/2501.08421
作者: Anurag Kumar,Rohit Paturi,Amber Afshan,Sundararajan Srinivasan
机构: Ohio State University(俄亥俄州立大学); AWS AI Labs(AWS AI实验室); AWS AI Labs(AWS AI实验室); AWS AI Labs(AWS AI实验室)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker error corrector by leveraging lexical context in the transcribed output. In this work, we introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. We also show that a simpler constrained decoding strategy reduces LLM hallucinations, while avoiding complicated post-processing. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD.
zh

计算机视觉

[CV-0] Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

【速读】：该论文试图解决基于预训练文本到视频（text-to-video）模型的FIFO（First-In-First-Out）视频扩散方法在生成长视频时难以保持长程时间一致性的问题。具体来说，FIFO-Diffusion由于缺乏跨帧的对应关系建模，导致生成的视频在结构和内容（主体）一致性上表现不佳。为解决这一问题，论文提出了Ouroboros-Diffusion框架，其关键创新点包括：1）在队列尾部引入新的潜在采样技术，以增强结构一致性，确保帧间过渡的感知平滑性；2）设计了主体感知跨帧注意力机制（Subject-Aware Cross-Frame Attention, SACFA），通过在短片段内对齐帧间主体，提升视觉连贯性；3）引入自回归引导技术，利用队列前端所有先前较干净的帧信息，指导队列末端较噪声帧的去噪过程，促进全局信息的丰富交互。这些改进显著提升了生成视频的主体一致性、运动平滑性和时间一致性。

链接: https://arxiv.org/abs/2501.09019
作者: Jingyuan Chen,Fuchen Long,Jie An,Zhaofan Qiu,Ting Yao,Jiebo Luo,Tao Mei
机构: 1. Jingyuan Chen1, Jie An1, Jiebo Luo1 -> 1 (未知); 2. Fuchen Long2, Zhaofan Qiu2, Ting Yao2, Tao Mei2 -> 2 (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue’s head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.
zh

[CV-1] SimGen: A Diffusion-Based Framework for Simultaneous Surgical Image and Segmentation Mask Generation

【速读】：该论文旨在解决获取和标注手术数据时面临的资源密集、伦理限制和专家参与需求高等问题。为了解决这些问题，论文提出了一种名为SimGen的新任务和方法，用于同时生成高保真手术图像及其对应的分割掩码（segmentation masks）。SimGen基于DDPM（Denoising Diffusion Probabilistic Models）框架和残差U-Net（Residual U-Net），通过利用跨相关先验（cross-correlation priors）来捕捉连续图像和离散掩码分布之间的依赖关系。此外，模型采用了规范斐波那契格点（Canonical Fibonacci Lattice, CFL）来增强掩码在RGB空间中的类别可分性和均匀性。SimGen在六个公开数据集上的评估表明，其在图像和语义起始距离（inception distance）指标上优于基线模型。消融实验显示，CFL显著提高了掩码质量和空间分离效果。下游实验表明，生成的图像-掩码对在法规限制人类数据发布的情况下可用于研究。该研究为生成配对的手术图像和复杂标签提供了一种经济高效的解决方案，减少了昂贵的手动标注需求，推动了手术AI的发展。

链接: https://arxiv.org/abs/2501.09008
作者: Aditya Bhat,Rupak Bose,Chinedu Innocent Nwoye,Nicolas Padoy
机构: ICube(ICube); UMR7357(UMR7357); CNRS(法国国家科学研究中心); INSERM(法国国家健康与医学研究院); University of Strasbourg(斯特拉斯堡大学); IHU Strasbourg(斯特拉斯堡大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 17 figures, 4 tables, project page at this https URL

点击查看摘要

Abstract:Acquiring and annotating surgical data is often resource-intensive, ethical constraining, and requiring significant expert involvement. While generative AI models like text-to-image can alleviate data scarcity, incorporating spatial annotations, such as segmentation masks, is crucial for precision-driven surgical applications, simulation, and education. This study introduces both a novel task and method, SimGen, for Simultaneous Image and Mask Generation. SimGen is a diffusion model based on the DDPM framework and Residual U-Net, designed to jointly generate high-fidelity surgical images and their corresponding segmentation masks. The model leverages cross-correlation priors to capture dependencies between continuous image and discrete mask distributions. Additionally, a Canonical Fibonacci Lattice (CFL) is employed to enhance class separability and uniformity in the RGB space of the masks. SimGen delivers high-fidelity images and accurate segmentation masks, outperforming baselines across six public datasets assessed on image and semantic inception distance metrics. Ablation study shows that the CFL improves mask quality and spatial separation. Downstream experiments suggest generated image-mask pairs are usable if regulations limit human data release for research. This work offers a cost-effective solution for generating paired surgical images and complex labels, advancing surgical AI development by reducing the need for expensive manual annotations.
zh

[CV-2] RepVideo: Rethinking Cross-Layer Representation for Video Generation

【速读】：该论文试图解决视频生成过程中由于中间层特征（intermediate layer features）的注意力图（attention maps）在不同层之间存在显著差异，导致语义表示不稳定，进而影响相邻帧之间的相似性和时间一致性（temporal coherence）的问题。为了解决这一问题，论文提出了RepVideo，一种增强的表示框架，通过累积相邻层的特征来形成更丰富的表示，从而捕捉更稳定的语义信息。这些增强的表示随后被用作注意力机制的输入，以提高语义表达能力，同时确保相邻帧之间的特征一致性。实验结果表明，RepVideo不仅显著提升了生成准确空间外观（如捕捉多个对象之间的复杂空间关系）的能力，还改善了视频生成的时间一致性。

链接: https://arxiv.org/abs/2501.08994
作者: Chenyang Si,Weichen Fan,Zhengyao Lv,Ziqi Huang,Yu Qiao,Ziwei Liu
机构: S-Lab, Nanyang Technological University, Singapore, 639798 (新加坡南洋理工大学S实验室); Shanghai Artificial Intelligence Laboratory, China (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.
zh

[CV-3] CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

【速读】：该论文旨在解决生成4D城市场景的挑战，特别是由于城市环境中结构复杂、视觉多样化的物体（如建筑物和车辆）以及人类对城市环境中的失真高度敏感所导致的困难。为了解决这些问题，论文提出了CityDreamer4D，一个专门用于生成无边界4D城市的组合生成模型。其关键解决方案包括：1）将动态物体（如车辆）与静态场景（如建筑物和道路）分离；2）4D场景中的所有物体应由不同类型的神经场（neural fields）组成，分别用于建筑物、车辆和背景物体。具体而言，论文提出了交通场景生成器（Traffic Scenario Generator）和无边界布局生成器（Unbounded Layout Generator），使用高度紧凑的鸟瞰图（BEV）表示来生成动态交通场景和静态城市布局。此外，通过结合面向背景物体和实例的神经场，生成4D城市中的物体，并采用定制的生成哈希网格（generative hash grids）和周期性位置嵌入（periodic positional embeddings）作为场景参数化方法。CityDreamer4D还支持一系列下游应用，如实例编辑、城市风格化和城市模拟，并在生成逼真的4D城市场景方面表现出色。

链接: https://arxiv.org/abs/2501.08983
作者: Haozhe Xie,Zhaoxi Chen,Fangzhou Hong,Ziwei Liu
机构: S-Lab, Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.
zh

[CV-4] CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation

【速读】：该论文旨在解决在大规模3D场景中基于文本描述进行定位（Localization）时存在的模糊性问题。具体来说，当描述一般性概念（如城市中的所有交通灯）时，传统的定位方法难以准确捕捉这些概念的分布。为此，论文提出了一种基于扩散模型（diffusion-based architecture）的解决方案，通过生成条件于文本描述的相机位姿（6DoF camera poses）分布来实现文本定位。关键创新点在于利用预训练的文本编码器（pre-trained text encoders）和视觉-语言模型（Vision-Language-Model, 如CLIP）建立文本描述与位姿分布之间的联系，并通过3D高斯泼溅（3D Gaussian splatting）技术对候选位姿进行优化，使其更符合文本描述。实验表明，该方法在五个大规模数据集上均优于传统的检索方法和基于学习的方法。

链接: https://arxiv.org/abs/2501.08982
作者: Qi Ma,Runyi Yang,Bin Ren,Ender Konukoglu,Luc Van Gool,Danda Pani Paudel
机构: Computer Vision Lab, ETH Zurich(苏黎世联邦理工学院计算机视觉实验室); INSAIT, Sofia University(索非亚大学INSAIT); University of Pisa(比萨大学); University of Trento(特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Localizing text descriptions in large-scale 3D scenes is inherently an ambiguous task. This nonetheless arises while describing general concepts, e.g. all traffic lights in a city. To facilitate reasoning based on such concepts, text localization in the form of distribution is required. In this paper, we generate the distribution of the camera poses conditioned upon the textual description. To facilitate such generation, we propose a diffusion-based architecture that conditionally diffuses the noisy 6DoF camera poses to their plausible locations. The conditional signals are derived from the text descriptions, using the pre-trained text encoders. The connection between text descriptions and pose distribution is established through pretrained Vision-Language-Model, i.e. CLIP. Furthermore, we demonstrate that the candidate poses for the distribution can be further refined by rendering potential poses using 3D Gaussian splatting, guiding incorrectly posed samples towards locations that better align with the textual description, through visual reasoning. We demonstrate the effectiveness of our method by comparing it with both standard retrieval methods and learning-based approaches. Our proposed method consistently outperforms these baselines across all five large-scale datasets. Our source code and dataset will be made publicly available. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2501.08982 [cs.CV] (or arXiv:2501.08982v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2501.08982 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Runyi Yang [view email] [v1] Wed, 15 Jan 2025 17:59:32 UTC (33,172 KB) Full-text links: Access Paper: View a PDF of the paper titled CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation, by Qi Ma and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-01 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-5] An analysis of data variation and bias in image-based dermatological datasets for machine learning classification

【速读】：该论文旨在解决在临床皮肤病学中，基于智能手机摄像头拍摄的RGB图像进行恶性皮肤病变分类时，由于数据分布差异（如分辨率、环境控制、肤色变化、视角变化、数据噪声和标签噪声以及类别不平衡）导致的模型性能下降问题。论文的关键解决方案是通过评估皮肤镜（dermoscopic）数据集与临床数据集之间的分布差异，理解这些差异对模型训练的影响，并探讨如何通过结合不同分布的数据来减少对模型最终准确性的负面影响。具体来说，论文通过实验评估了不同架构下的模型表现，并提出了如何有效利用迁移学习（transfer learning）来应对临床图像数据量不足和分布差异的挑战。

链接: https://arxiv.org/abs/2501.08962
作者: Francisco Mauro,Emanoel Thyago,Othon Vinicius,Rodrigo Abreu,Kelvin Cunha,José Gabriel,Rafael Barros,Thales Bezerra,Manoel Henriques,Natalia Lopes,Érico Moutinho,Jéssica Guido,Tsang Ing Ren,Paulo Borba
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figure

点击查看摘要

Abstract:AI algorithms have become valuable in aiding professionals in healthcare. The increasing confidence obtained by these models is helpful in critical decision demands. In clinical dermatology, classification models can detect malignant lesions on patients’ skin using only RGB images as input. However, most learning-based methods employ data acquired from dermoscopic datasets on training, which are large and validated by a gold standard. Clinical models aim to deal with classification on users’ smartphone cameras that do not contain the corresponding resolution provided by dermoscopy. Also, clinical applications bring new challenges. It can contain captures from uncontrolled environments, skin tone variations, viewpoint changes, noises in data and labels, and unbalanced classes. A possible alternative would be to use transfer learning to deal with the clinical images. However, as the number of samples is low, it can cause degradations on the model’s performance; the source distribution used in training differs from the test set. This work aims to evaluate the gap between dermoscopic and clinical samples and understand how the dataset variations impact training. It assesses the main differences between distributions that disturb the model’s prediction. Finally, from experiments on different architectures, we argue how to combine the data from divergent distributions, decreasing the impact on the model’s final accuracy.
zh

[CV-6] Visual WetlandBirds Dataset: Bird Species Identification and Behavior Recognition in Videos

【速读】：该论文试图解决当前生物多样性丧失危机中鸟类监测数据稀缺的问题，特别是缺乏包含详细鸟类行为注释的视频数据集。为解决这一问题，研究团队首次引入了一个细粒度的视频数据集，专门用于鸟类行为检测和物种分类。该数据集的关键在于其提供了西班牙湿地中13种不同鸟类执行7种不同行为类别的178个视频，填补了现有数据集的空白。此外，研究还展示了使用最先进模型在鸟类行为识别和物种分类任务上的基线结果，为开发深度学习模型以识别鸟类行为提供了基础，类似于人类行为识别领域的进展。

链接: https://arxiv.org/abs/2501.08931
作者: Javier Rodriguez-Juan,David Ortiz-Perez,Manuel Benavent-Lledo,David Mulero-Pérez,Pablo Ruiz-Ponce,Adrian Orihuela-Torres,Jose Garcia-Rodriguez,Esther Sebastián-González
机构: Department of Computer Technology, University of Alicante (阿利坎特大学计算机技术系); Department of Ecology, University of Alicante (阿利坎特大学生态学系); ‘Ramón Margalef’ Multidisciplinary Institute for the study of the Environment, University of Alicante (阿利坎特大学环境研究多学科研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The current biodiversity loss crisis makes animal monitoring a relevant field of study. In light of this, data collected through monitoring can provide essential insights, and information for decision-making aimed at preserving global biodiversity. Despite the importance of such data, there is a notable scarcity of datasets featuring videos of birds, and none of the existing datasets offer detailed annotations of bird behaviors in video format. In response to this gap, our study introduces the first fine-grained video dataset specifically designed for bird behavior detection and species classification. This dataset addresses the need for comprehensive bird video datasets and provides detailed data on bird actions, facilitating the development of deep learning models to recognize these, similar to the advancements made in human action recognition. The proposed dataset comprises 178 videos recorded in Spanish wetlands, capturing 13 different bird species performing 7 distinct behavior classes. In addition, we also present baseline results using state of the art models on two tasks: bird behavior recognition and species classification.
zh

[CV-7] Learning Joint Denoising Demosaicing and Compression from the Raw Natural Image Noise Dataset

【速读】：该论文旨在解决图像去噪（denoising）模型在不同传感器、图像处理流程和风格之间的泛化问题。为此，作者提出了Raw Natural Image Noise Dataset (RawNIND)，这是一个多样化的原始图像数据集，用于支持去噪模型的开发。论文提出了两种去噪方法：第一种方法直接在原始Bayer数据（raw Bayer data）上进行处理，具有较高的计算效率；第二种方法则处理线性RGB图像（linear RGB images），以提高对不同传感器的泛化能力。这两种方法都保留了后续图像处理的灵活性，并且在性能上优于传统的基于处理后的图像的去噪方法。此外，论文还展示了在原始数据层面集成去噪和压缩（compression）可以显著提升率失真性能（rate-distortion performance）和计算效率。这些发现表明，采用原始数据工作流（raw data workflows）可以实现更高效和灵活的图像处理。

链接: https://arxiv.org/abs/2501.08924
作者: Benoit Brummer,Christophe De Vleeschouwer
机构: University of Louvain, Louvain-la-Neuve, Belgium (鲁汶大学, 鲁汶拉讷夫, 比利时)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:This paper introduces the Raw Natural Image Noise Dataset (RawNIND), a diverse collection of paired raw images designed to support the development of denoising models that generalize across sensors, image development workflows, and styles. Two denoising methods are proposed: one operates directly on raw Bayer data, leveraging computational efficiency, while the other processes linear RGB images for improved generalization to different sensors, with both preserving flexibility for subsequent development. Both methods outperform traditional approaches which rely on developed images. Additionally, the integration of denoising and compression at the raw data level significantly enhances rate-distortion performance and computational efficiency. These findings suggest a paradigm shift toward raw data workflows for efficient and flexible image processing.
zh

[CV-8] Empowering Agricultural Insights: RiceLeafBD - A Novel Dataset and Optimal Model Selection for Rice Leaf Disease Diagnosis through Transfer Learning Technique

【速读】：该论文试图解决水稻（Oryza sativa）病害早期检测的难题，特别是在孟加拉国等农业国家中，由于人口增长和耕地减少，水稻病害导致的产量下降已成为粮食危机的主要威胁。论文的关键解决方案是通过深度学习（deep learning）和迁移学习（transfer learning）模型，利用从孟加拉国田间收集的数据集进行病害检测。研究采用了轻量级卷积神经网络（CNN）模型以及预训练的InceptionNet-V2、EfficientNet-V2和MobileNet-V2模型，其中EfficientNet-V2模型在数据集上的表现达到了91.5%的准确率，优于其他模型并超越了现有技术水平。研究表明，该数据集能够精确有效地识别水稻叶片病害，为减少水稻病害提供了重要的研究基础。

链接: https://arxiv.org/abs/2501.08912
作者: Sadia Afrin Rimi,Md. Jalal Uddin Chowdhury,Rifat Abdullah,Iftekhar Ahmed,Mahrima Akter Mim,Mohammad Shoaib Rahman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The number of people living in this agricultural nation of ours, which is surrounded by lush greenery, is growing on a daily basis. As a result of this, the level of arable land is decreasing, as well as residential houses and industrial factories. The food crisis is becoming the main threat for us in the upcoming days. Because on the one hand, the population is increasing, and on the other hand, the amount of food crop production is decreasing due to the attack of diseases. Rice is one of the most significant cultivated crops since it provides food for more than half of the world’s population. Bangladesh is dependent on rice (Oryza sativa) as a vital crop for its agriculture, but it faces a significant problem as a result of the ongoing decline in rice yield brought on by common diseases. Early disease detection is the main difficulty in rice crop cultivation. In this paper, we proposed our own dataset, which was collected from the Bangladesh field, and also applied deep learning and transfer learning models for the evaluation of the datasets. We elaborately explain our dataset and also give direction for further research work to serve society using this dataset. We applied a light CNN model and pre-trained InceptionNet-V2, EfficientNet-V2, and MobileNet-V2 models, which achieved 91.5% performance for the EfficientNet-V2 model of this work. The results obtained assaulted other models and even exceeded approaches that are considered to be part of the state of the art. It has been demonstrated by this study that it is possible to precisely and effectively identify diseases that affect rice leaves using this unbiased datasets. After analysis of the performance of different models, the proposed datasets are significant for the society for research work to provide solutions for decreasing rice leaf disease.
zh

[CV-9] Lights Camera Matching: The Role of Image Illumination in Fair Face Recognition

【速读】：该论文旨在解决面部亮度（facial brightness）对跨人口群体（如白种人和非裔美国人女性）人脸识别准确率差异的影响问题。具体而言，研究试图通过平衡不同人口群体之间的面部亮度，减少白种人和非裔美国人女性配对图像对的相似度评分分布之间的准确率差距（measured by d’ between distributions）。解决方案的关键在于通过三种实验方法调整面部亮度：一种基于中值亮度（median pixel value）的平衡，另一种基于亮度分布（distribution of pixel values）的平衡。实验结果表明，仅基于中值亮度的平衡可使d’减少高达46.8%，而基于亮度分布的平衡可使d’减少高达57.6%。此外，所有实验均提高了个体分布的相似度评分，白种人女性的平均评分最大提高了5.9%，非裔美国人女性的平均评分最大提高了3.7%。

链接: https://arxiv.org/abs/2501.08910
作者: Gabriella Pangelinan,Grace Bezold,Haiyu Wu,Michael C. King,Kevin W. Bowyer
机构: Florida Institute of Technology(佛罗里达理工学院); University of Notre Dame(圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures, Conference submission

点击查看摘要

Abstract:Facial brightness is a key image quality factor impacting face recognition accuracy differentials across demographic groups. In this work, we aim to decrease the accuracy gap between the similarity score distributions for Caucasian and African American female mated image pairs, as measured by d’ between distributions. To balance brightness across demographic groups, we conduct three experiments, interpreting brightness in the face skin region either as median pixel value or as the distribution of pixel values. Balancing based on median brightness alone yields up to a 46.8% decrease in d’, while balancing based on brightness distribution yields up to a 57.6% decrease. In all three cases, the similarity scores of the individual distributions improve, with mean scores maximally improving 5.9% for Caucasian females and 3.7% for African American females.
zh

[CV-10] Enhanced Multi-Scale Cross-Attention for Person Image Generation ECCV2020

【速读】：该论文试图解决的是人物图像生成任务中的挑战性问题，特别是如何有效地融合不同模态的特征（如外观和形状）以生成高质量的人物图像。解决方案的关键在于提出了一种基于交叉注意力机制（cross-attention）的生成对抗网络（GAN），称为XingGAN（或CrossingGAN）。该网络包含两个生成分支，分别捕捉人物的外观和形状特征。此外，论文提出了两种新颖的交叉注意力块（cross-attention blocks），用于有效地传递和更新人物的形状和外观嵌入（embeddings），从而实现相互提升。为了进一步学习不同姿态和尺度下的长程相关性，论文还引入了多尺度交叉注意力块（multi-scale cross-attention blocks）。为了解决交叉注意力机制中独立相关性计算导致的噪声和模糊注意力权重问题，论文提出了增强注意力模块（enhanced attention, EA）。最后，论文还引入了一种密集连接的共注意力模块（densely connected co-attention module），用于在不同阶段有效地融合外观和形状特征。实验结果表明，该方法在生成质量和速度上均优于现有的GAN方法，并与基于扩散模型（diffusion-based methods）的方法性能相当，但训练和推理速度显著更快。

链接: https://arxiv.org/abs/2501.08900
作者: Hao Tang,Ling Shao,Nicu Sebe,Luc Van Gool
机构: School of Computer Science, Peking University (北京大学计算机学院); UCAS-Terminus AI Lab, University of Chinese Academy of Sciences (中国科学院大学UCAS-Terminus AI实验室); Department of Information Engineering and Computer Science (DISI), University of Trento (特伦托大学信息工程与计算机科学系); Department of Information Technology and Electrical Engineering, ETH Zurich (苏黎世联邦理工学院信息技术与电气工程系); Department of Electrical Engineering, KU Leuven (鲁汶大学电气工程系); INSAIT, Sofia University (索非亚大学INSAIT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to TPAMI, an extended version of a paper published in ECCV2020. arXiv admin note: substantial text overlap with arXiv:2007.09278

点击查看摘要

Abstract:In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person’s appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person’s shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.
zh

[CV-11] Feature-based One-For-All: A Universal Framework for Heterogeneous Knowledge Distillation

【速读】：该论文试图解决知识蒸馏（Knowledge Distillation, KD）在异构架构模型之间的知识迁移问题。随着技术的发展，模型架构从最初的卷积神经网络（Convolutional Neural Networks, CNNs）发展到视觉变换器（Vision Transformers, ViTs）和多层感知器（Multi-Level Perceptrons, MLPs），传统的知识蒸馏方法通常假设教师模型和学生模型是同构的，无法有效处理异构架构之间的知识迁移。为了解决这一问题，论文提出了一种基于特征的一对多（Feature-based One-For-All, FOFA）知识蒸馏框架。该框架的关键在于两个核心组件：首先，设计了包含学生反馈的提示调优块（prompt tuning blocks），使教师模型的特征能够适应学生模型的学习过程；其次，提出了区域感知注意力机制（region-aware attention），以缓解异构架构之间的视角不匹配问题。通过这两个模块，论文实现了在异构架构之间有效蒸馏中间特征的目标，并在CIFAR、ImageNet和COCO数据集上验证了该方法的优越性。

链接: https://arxiv.org/abs/2501.08885
作者: Jhe-Hao Lin,Yi Yao,Chan-Feng Hsu,Hongxia Xie,Hong-Han Shuai,Wen-Huang Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model, thereby reducing the inference cost while maintaining comparable effectiveness. Prior KD techniques typically assume homogeneity between the teacher and student models. However, as technology advances, a wide variety of architectures have emerged, ranging from initial Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), and Multi-Level Perceptrons (MLPs). Consequently, developing a universal KD framework compatible with any architecture has become an important research topic. In this paper, we introduce a feature-based one-for-all (FOFA) KD framework to enable feature distillation across diverse architecture. Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model’s learning process. Second, we propose region-aware attention to mitigate the view mismatch problem between heterogeneous architecture. By leveraging these two modules, effective distillation of intermediate features can be achieved across heterogeneous architectures. Extensive experiments on CIFAR, ImageNet, and COCO demonstrate the superiority of the proposed method.
zh

[CV-12] Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving

【速读】：该论文试图解决自动驾驶领域中的视觉理解、决策推理和场景泛化等挑战。现有的基于视觉的端到端模型虽然在感知和理解周围环境方面取得了一定成果，但在这些方面仍存在不足。为了解决这些问题，论文提出了一种名为GPVL的生成式规划模型，该模型结合了3D视觉语言预训练（3D-vision language pre-training）和跨模态语言模型（cross-modal language model）。关键解决方案包括两个方面：首先，设计了一个3D视觉语言预训练模块，旨在桥接鸟瞰图中的视觉感知与语言理解之间的差距；其次，引入了一个跨模态语言模型，以自回归的方式生成整体的驾驶决策和细粒度的轨迹，结合感知和导航信息。通过在nuScenes数据集上的实验，GPVL展示了优异的性能、强大的泛化能力以及处理各种场景中高级命令的实时潜力，这对于未来自动驾驶系统的实际应用至关重要。

链接: https://arxiv.org/abs/2501.08861
作者: Tengpeng Li,Hanli Wang,Xianfei Li,Wenlong Liao,Tao He,Pai Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving is a challenging task that requires perceiving and understanding the surrounding environment for safe trajectory planning. While existing vision-based end-to-end models have achieved promising results, these methods are still facing the challenges of vision understanding, decision reasoning and scene generalization. To solve these issues, a generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. The proposed paradigm has two significant aspects. On one hand, a 3D-vision language pre-training module is designed to bridge the gap between visual perception and linguistic understanding in the bird’s eye view. On the other hand, a cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories with perception and navigation information in an auto-regressive manner. Experiments on the challenging nuScenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems. Code is available at this https URL
zh

[CV-13] Exploring Task-Level Optimal Prompts for Visual In-Context Learning

【速读】：该论文试图解决视觉上下文学习（Visual In-Context Learning, VICL）中为每个测试样本寻找最优提示（prompt）所产生的高计算成本问题。当前，为每个测试样本构建提示需要大量的计算资源，这阻碍了VICL的实际部署。论文发现了一个反直觉的现象：大多数测试样本在相同的提示下都能达到最优性能，而为每个样本单独搜索提示不仅耗时，且最终得到的提示几乎完全相同。基于这一发现，论文提出了任务级提示（task-level prompting）方法，以减少推理阶段搜索提示的成本，并引入了两种节省时间且有效的任务级提示搜索策略。实验结果表明，该方法能够以极低的成本识别接近最优的提示，并达到最佳的VICL性能，这是以往工作未曾实现的。

链接: https://arxiv.org/abs/2501.08841
作者: Yan Zhu,Huan Ma,Changqing Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the development of Vision Foundation Models (VFMs) in recent years, Visual In-Context Learning (VICL) has become a better choice compared to modifying models in most scenarios. Different from retraining or fine-tuning model, VICL does not require modifications to the model’s weights or architecture, and only needs a prompt with demonstrations to teach VFM how to solve tasks. Currently, significant computational cost for finding optimal prompts for every test sample hinders the deployment of VICL, as determining which demonstrations to use for constructing prompts is very costly. In this paper, however, we find a counterintuitive phenomenon that most test samples actually achieve optimal performance under the same prompts, and searching for sample-level prompts only costs more time but results in completely identical prompts. Therefore, we propose task-level prompting to reduce the cost of searching for prompts during the inference stage and introduce two time-saving yet effective task-level prompt search strategies. Extensive experimental results show that our proposed method can identify near-optimal prompts and reach the best VICL performance with a minimal cost that prior work has never achieved.
zh

[CV-14] MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation

【速读】：该论文致力于解决随机长期密集预测（stochastic long-term dense anticipation）问题，其目标是根据提供的视频观测数据，预测未来几分钟内的动作及其持续时间。由于长时间跨度的预测引入了高度不确定性，单一观测可能导致多种可能的未来结果。为解决这一问题，论文提出了使用随机模型（stochastic models）来预测多个潜在的未来动作序列。近期研究进一步提出通过同时预测每帧的过去和未来动作，以统一方式对观测帧的不确定性进行建模。然而，现有方法由于受限和/或稀疏的感受野（receptive field），难以实现长距离的时间理解。为此，论文提出了一种新颖的MANTA（MAmba for ANTicipation）网络，该模型能够在保持序列长度线性复杂度的同时，有效进行长时间跨度的时序建模。实验表明，该方法在Breakfast、50Salads和Assembly101三个数据集上取得了最先进的结果，并显著提升了计算和内存效率。

链接: https://arxiv.org/abs/2501.08837
作者: Olga Zatsarynna,Emad Bahrami,Yazan Abu Farha,Gianpiero Francesca,Juergen Gall
机构: University of Bonn(波恩大学); Birzeit University(比尔泽特大学); Toyota Motor Europe(丰田汽车欧洲); Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习和人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Our work addresses the problem of stochastic long-term dense anticipation. The goal of this task is to predict actions and their durations several minutes into the future based on provided video observations. Anticipation over extended horizons introduces high uncertainty, as a single observation can lead to multiple plausible future outcomes. To address this uncertainty, stochastic models are designed to predict several potential future action sequences. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency.
zh

[CV-15] IDEA: Image Description Enhanced CLIP-Adapter

【速读】：该论文旨在解决将CLIP（Contrastive Language-Image Pre-training）模型应用于少样本图像分类任务时，未能充分利用图像-文本对之间的互补信息和相关性的问题。当前的解决方案主要集中在文本的提示学习（prompt learning）或视觉的适配器调优（adapter tuning），而忽略了图像和文本之间的协同作用。为此，论文提出了一种名为Image Description Enhanced CLIP-Adapter（IDEA）的方法，通过同时利用图像的视觉特征和文本描述来捕捉细粒度特征，从而提升CLIP在少样本分类任务中的表现。IDEA是一种无需训练的CLIP适配方法，能够在多个任务上达到或超越现有最先进模型的性能。此外，论文还提出了Trainable-IDEA（T-IDEA），通过引入两个轻量级可学习组件（即投影器和可学习的潜在空间）进一步提升了模型性能，并在11个数据集上取得了SOTA（state-of-the-art）结果。关键贡献之一是使用Llama模型设计了一个全面的流程，为11个数据集的图像生成文本描述，构建了包含1,637,795个图像-文本对的“IMD-11”数据集。

链接: https://arxiv.org/abs/2501.08816
作者: Zhipeng Ye,Feng Jiang,Qiufeng Wang,Kaizhu Huang,Jiaqi Huang
机构: Nanjing University of Science and Technology (南京理工大学); Xi’an Jiaotong-Liverpool University (西交利物浦大学); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model’s performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named “IMD-11”. Our code and data are released at this https URL.
zh

[CV-16] Human Pose-Constrained UV Map Estimation

【速读】：该论文试图解决在计算机视觉中用于人体姿态或活动详细分析的UV映射（UV map）估计问题。传统方法通过独立比较像素描述符将像素分配到身体模型顶点，缺乏全局一致性和合理性。论文提出的解决方案是Pose-Constrained Continuous Surface Embeddings (PC-CSE)，其关键是将估计的2D人体姿态整合到像素到顶点的分配过程中。通过引入姿态提供的全局解剖学约束，确保UV映射在保持局部精度的同时具有全局一致性。实验表明，该方法在DensePose COCO数据集上表现出一致的改进，且不受所选2D人体姿态模型的影响。全身姿态通过包含手和脚的额外细节提供了更好的约束，减少了无效映射并增强了解剖学合理性。此外，论文还指出了地面真实标注中的不一致性。

链接: https://arxiv.org/abs/2501.08815
作者: Matej Suchanek,Miroslav Purkrabek,Jiri Matas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:UV map estimation is used in computer vision for detailed analysis of human posture or activity. Previous methods assign pixels to body model vertices by comparing pixel descriptors independently, without enforcing global coherence or plausibility in the UV map. We propose Pose-Constrained Continuous Surface Embeddings (PC-CSE), which integrates estimated 2D human pose into the pixel-to-vertex assignment process. The pose provides global anatomical constraints, ensuring that UV maps remain coherent while preserving local precision. Evaluation on DensePose COCO demonstrates consistent improvement, regardless of the chosen 2D human pose model. Whole-body poses offer better constraints by incorporating additional details about the hands and feet. Conditioning UV maps with human pose reduces invalid mappings and enhances anatomical plausibility. In addition, we highlight inconsistencies in the ground-truth annotations.
zh

[CV-17] Multi-visual modality micro drone-based structural damage detection

【速读】：该论文旨在解决结构损伤检测中目标检测器（object detector）的鲁棒性问题，特别是在复杂环境下的泛化能力不足。目标检测器的鲁棒性对于确保民用基础设施的持续使用至关重要。论文提出的解决方案DetectorX框架通过引入两个创新模块来增强目标检测器的鲁棒性：一是动态视觉模态的stem block，二是螺旋池化技术（spiral pooling technique）。stem block通过结合两个深度卷积神经网络（DCNN）模型的输出，引入了动态视觉模态，并结合事件驱动的奖励强化学习（event-based reward reinforcement learning）来约束父模型和子模型的行为，从而增强感知和适应能力。螺旋池化技术则通过在线图像增强方法，增加了特征表示，进一步提升了框架的性能。实验结果表明，DetectorX在精度、召回率、平均精度等多项指标上均优于其他竞争模型，如YOLOX-m，展现了其在复杂环境中的鲁棒性和适应性。

链接: https://arxiv.org/abs/2501.08807
作者: Isaac Osei Agyemanga,Liaoyuan Zeng,Jianwen Chena,Isaac Adjei-Mensah,Daniel Acheampong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate detection and resilience of object detectors in structural damage detection are important in ensuring the continuous use of civil infrastructure. However, achieving robustness in object detectors remains a persistent challenge, impacting their ability to generalize effectively. This study proposes DetectorX, a robust framework for structural damage detection coupled with a micro drone. DetectorX addresses the challenges of object detector robustness by incorporating two innovative modules: a stem block and a spiral pooling technique. The stem block introduces a dynamic visual modality by leveraging the outputs of two Deep Convolutional Neural Network (DCNN) models. The framework employs the proposed event-based reward reinforcement learning to constrain the actions of a parent and child DCNN model leading to a reward. This results in the induction of two dynamic visual modalities alongside the Red, Green, and Blue (RGB) data. This enhancement significantly augments DetectorX’s perception and adaptability in diverse environmental situations. Further, a spiral pooling technique, an online image augmentation method, strengthens the framework by increasing feature representations by concatenating spiraled and average/max pooled features. In three extensive experiments: (1) comparative and (2) robustness, which use the Pacific Earthquake Engineering Research Hub ImageNet dataset, and (3) field-experiment, DetectorX performed satisfactorily across varying metrics, including precision (0.88), recall (0.84), average precision (0.91), mean average precision (0.76), and mean average recall (0.73), compared to the competing detectors including You Only Look Once X-medium (YOLOX-m) and others. The study’s findings indicate that DetectorX can provide satisfactory results and demonstrate resilience in challenging environments.
zh

[CV-18] Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning WACV

【速读】：该论文探讨了ChatGPT（特别是GPT-4o）在面部呈现攻击检测（Face Presentation Attack Detection, PAD）中的潜力，旨在解决现有PAD模型在特定场景下性能不足的问题。研究结果表明，GPT-4o在少样本上下文学习（few-shot in-context learning）中表现出色，尤其在提供更多参考数据时性能显著提升。关键解决方案包括使用详细的提示（prompts）来提高模型的评分可靠性，并通过解释性提示（explanation-seeking prompts）增强模型的解释能力。此外，GPT-4o展现出一定的推理能力，能够在少样本场景中准确预测攻击类型（如打印或重放攻击），尽管并未明确指示其进行分类。然而，GPT-4o在零样本任务（zero-shot tasks）中的表现仍不及专门的PAD系统。实验基于SOTERIA数据集的子集进行，确保符合数据隐私法规。这些发现为未来研究提供了基础，以解决更广泛的数据隐私问题并提升跨数据集的泛化能力。

链接: https://arxiv.org/abs/2501.08799
作者: Alain Komaty,Hatef Otroshi Shahreza,Anjith George,Sebastien Marcel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted in WACV workshop 2025

点击查看摘要

Abstract:This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD), outperforming several PAD models, including commercial solutions, in specific scenarios. Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning, where its performance improves as more examples are provided (reference data). We also observe that detailed prompts enable the model to provide scores reliably, a behavior not observed with concise prompts. Additionally, explanation-seeking prompts slightly enhance the model’s performance by improving its interpretability. Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios, despite not being explicitly instructed to classify attack types. Despite these strengths, GPT-4o faces challenges in zero-shot tasks, where its performance is limited compared to specialized PAD systems. Experiments were conducted on a subset of the SOTERIA dataset, ensuring compliance with data privacy regulations by using only data from consenting individuals. These findings underscore GPT-4o’s promise in PAD applications, laying the groundwork for future research to address broader data privacy concerns and improve cross-dataset generalization. Code available here: this https URL
zh

[CV-19] Admitting Ignorance Helps the Video Question Answering Models to Answer

【速读】：该论文试图解决视频问答（VideoQA）领域中存在的虚假相关性问题。现有方法在训练过程中通常只关注最大化答案与视频-问题对之间的相关性，导致模型容易建立捷径，产生问题与答案之间的虚假相关性，尤其是在视频与文本数据对齐不理想的情况下。为解决这一问题，论文提出了一种新的训练框架，该框架通过干预问题（如位移和扰动）来迫使模型在面对干预问题时承认其无知，而不是仅基于表面的问题-答案相关性进行猜测。关键解决方案包括设计干预问题的方法，并在多选和开放式VideoQA设置中构建模型承认其知识不足的框架。实验结果表明，该框架在仅进行最小结构修改的情况下，显著提升了VideoQA模型的性能。

链接: https://arxiv.org/abs/2501.08771
作者: Haopeng Li,Tom Drummond,Mingming Gong,Mohammed Bennamoun,Qiuhong Ke
机构: School of Computing and Information Systems, University of Melbourne(墨尔本大学计算与信息系统学院); School of Mathematics and Statistics, University of Melbourne(墨尔本大学数学与统计学院); School of Physics, Maths and Computing, The University of Western Australia(西澳大利亚大学物理、数学与计算学院); Department of Data Science & AI, Monash University(莫纳什大学数据科学与人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Significant progress has been made in the field of video question answering (VideoQA) thanks to deep learning and large-scale pretraining. Despite the presence of sophisticated model structures and powerful video-text foundation models, most existing methods focus solely on maximizing the correlation between answers and video-question pairs during training. We argue that these models often establish shortcuts, resulting in spurious correlations between questions and answers, especially when the alignment between video and text data is suboptimal. To address these spurious correlations, we propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question, rather than making guesses solely based on superficial question-answer correlations. We introduce methodologies for intervening in questions, utilizing techniques such as displacement and perturbation, and design frameworks for the model to admit its lack of knowledge in both multi-choice VideoQA and open-ended settings. In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness. The results clearly demonstrate that our framework can significantly enhance the performance of VideoQA models with minimal structural modifications.
zh

[CV-20] Few-Shot Learner Generalizes Across AI-Generated Image Detection

【速读】：该论文旨在解决当前基于大规模合成图像数据集训练的假图像检测器在面对未见过的生成模型时性能显著下降的问题。此外，收集足够的在线生成模型训练数据通常成本高昂或不可行。为解决这些问题，论文提出了一种名为Few-Shot Detector (FSD)的新型AI生成图像检测器。FSD通过学习一个专门的度量空间，利用极少量的样本有效地区分未见过的假图像。实验表明，FSD在GenImage数据集上实现了+7.4%的平均准确率（ACC），并且在不进一步训练的情况下，能够更好地捕捉未见图像中的类内共同特征。

链接: https://arxiv.org/abs/2501.08763
作者: Shiyu Wu,Jing Liu,Jing Li,Yequan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, they suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space to effectively distinguish unseen fake images by utilizing very few samples. Experiments show FSD achieves state-of-the-art performance by +7.4% average ACC on GenImage dataset. More importantly, our method is better capable of capturing the intra-category common features in unseen images without further training.
zh

[CV-21] textttInfoHier: Hierarchical Information Extraction via Encoding and Embedding

【速读】：该论文试图解决在处理大规模、复杂且高维数据（如图像）时，自监督学习（SSL）通常忽略数据中的多层次关系，而传统的层次聚类（HC）方法又难以捕捉多样化数据类型的复杂性的问题。解决方案的关键在于提出了一个名为 \texttt{InfoHier} 的框架，该框架将自监督学习与层次聚类相结合，通过联合学习鲁棒的潜在表示和层次结构来克服上述挑战。具体而言，\texttt{InfoHier} 利用自监督学习提供自适应表示，增强层次聚类捕捉复杂模式的能力，同时通过引入层次聚类损失来优化自监督学习的训练过程，从而使表示更加符合数据的潜在信息层次结构。这一方法有望提升聚类和表示学习的表达能力和性能，为数据分析、管理和信息检索带来显著优势。

链接: https://arxiv.org/abs/2501.08717
作者: Tianru Zhang,Li Ju,Prashant Singh,Salman Toor
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Analyzing large-scale datasets, especially involving complex and high-dimensional data like images, is particularly challenging. While self-supervised learning (SSL) has proven effective for learning representations from unlabelled data, it typically focuses on flat, non-hierarchical structures, missing the multi-level relationships present in many real-world datasets. Hierarchical clustering (HC) can uncover these relationships by organizing data into a tree-like structure, but it often relies on rigid similarity metrics that struggle to capture the complexity of diverse data types. To address these we envision \textttInfoHier , a framework that combines SSL with HC to jointly learn robust latent representations and hierarchical structures. This approach leverages SSL to provide adaptive representations, enhancing HC’s ability to capture complex patterns. Simultaneously, it integrates HC loss to refine SSL training, resulting in representations that are more attuned to the underlying information hierarchy. \textttInfoHier has the potential to improve the expressiveness and performance of both clustering and representation learning, offering significant benefits for data analysis, management, and information retrieval.
zh

[CV-22] Self-supervised Transformation Learning for Equivariant Representations NEURIPS2024

【速读】：该论文试图解决在无监督表示学习（Unsupervised Representation Learning）中，现有方法在处理需要精确特征的任务（如定位或花卉分类）时性能下降的问题。具体来说，传统的基于不变性表示学习的方法通过随机裁剪和颜色抖动等变换来获得语义上相同的输入表示，但这些方法在处理复杂变换时表现不佳。论文提出了一种自监督变换学习（Self-supervised Transformation Learning, STL）方法，通过从图像对中推导出变换表示来替代传统的变换标签，从而捕捉变换敏感的信息。该方法的关键在于确保变换表示是图像不变的，并学习相应的等变变换（equivariant transformations），从而在不增加批次复杂性的情况下提升性能。实验结果表明，该方法在多种分类和检测任务中表现优异，尤其在检测任务中表现突出，并且能够兼容多种基础模型，展示了其广泛的适用性和灵活性。

链接: https://arxiv.org/abs/2501.08712
作者: Jaemyung Yu,Jaehyun Choi,Dong-Jae Lee,HyeongGwon Hong,Junmo Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach’s effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at this https URL.
zh

[CV-23] RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency

【速读】：该论文试图解决在动态视频序列中保持服装外观一致性和真实性的问题，特别是在单图像虚拟试衣（VTO）任务中，现有方法在长时间视频中难以维持服装的真实表现。这一挑战主要源于捕捉动态人体姿态和保持目标服装特性的复杂性。论文提出的解决方案RealVVT框架，通过利用现有的视频基础模型，引入了三个关键技术：1) 服装时间一致性策略（Clothing Temporal Consistency），确保服装在视频序列中的一致性；2) 无指导注意力聚焦损失机制（Agnostic-guided Attention Focus Loss），用于保证空间一致性；3) 姿态引导的长视频VTO技术（Pose-guided Long Video VTO），专门处理长时间视频的虚拟试衣任务。实验结果表明，该方法在单图像和视频VTO任务中均优于现有最先进模型，为时尚电商和虚拟试衣环境提供了可行的解决方案。

链接: https://arxiv.org/abs/2501.08682
作者: Siqi Li,Zhengkai Jiang,Jiawei Zhou,Zhihong Liu,Xiaowei Chi,Haoqian Wang
机构: Intellifusion(云天励飞); HKUST(香港科技大学); THU(清华大学); FDU(复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 10 pages (8 pages main text, 2 pages references), 5 figures in the main text, and 4 pages supplementary materials with 3 additional figures

点击查看摘要

Abstract:Virtual try-on has emerged as a pivotal task at the intersection of computer vision and fashion, aimed at digitally simulating how clothing items fit on the human body. Despite notable progress in single-image virtual try-on (VTO), current methodologies often struggle to preserve a consistent and authentic appearance of clothing across extended video sequences. This challenge arises from the complexities of capturing dynamic human pose and maintaining target clothing characteristics. We leverage pre-existing video foundation models to introduce RealVVT, a photoRealistic Video Virtual Try-on framework tailored to bolster stability and realism within dynamic video contexts. Our methodology encompasses a Clothing Temporal Consistency strategy, an Agnostic-guided Attention Focus Loss mechanism to ensure spatial consistency, and a Pose-guided Long Video VTO technique adept at handling extended video this http URL experiments across various datasets confirms that our approach outperforms existing state-of-the-art models in both single-image and video VTO tasks, offering a viable solution for practical applications within the realms of fashion e-commerce and virtual fitting environments.
zh

[CV-24] FlexiClip: Locality-Preserving Free-Form Character Animation

【速读】：该论文试图解决在保持视觉保真度和时间一致性的同时，为剪贴画（clipart）生成无缝动画的挑战。现有方法如AniClipart虽然在空间变形建模上表现良好，但在确保平滑时间过渡方面存在不足，导致动画中出现突然运动和几何失真等问题。此外，文本到视频（T2V）和图像到视频（I2V）模型在处理剪贴画时也面临统计特性不匹配的困难。论文提出的解决方案FlexiClip通过引入多项关键创新来克服这些限制：首先，使用时间雅可比矩阵（temporal Jacobians）逐步校正运动动力学；其次，通过概率流常微分方程（pfODEs）进行连续时间建模，以减少时间噪声；最后，采用基于GFlowNet原理的流匹配损失（flow matching loss）来优化平滑运动过渡。这些改进确保了在涉及快速运动和非刚性变形的复杂场景中生成连贯的动画。FlexiClip通过将空间和时间建模与预训练的视频扩散模型相结合，为高质量剪贴画动画设定了新标准，并在广泛的视觉内容上表现出色。

链接: https://arxiv.org/abs/2501.08676
作者: Anant Khandelwal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 13 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Animating clipart images with seamless motion while maintaining visual fidelity and temporal coherence presents significant challenges. Existing methods, such as AniClipart, effectively model spatial deformations but often fail to ensure smooth temporal transitions, resulting in artifacts like abrupt motions and geometric distortions. Similarly, text-to-video (T2V) and image-to-video (I2V) models struggle to handle clipart due to the mismatch in statistical properties between natural video and clipart styles. This paper introduces FlexiClip, a novel approach designed to overcome these limitations by addressing the intertwined challenges of temporal consistency and geometric integrity. FlexiClip extends traditional Bézier curve-based trajectory modeling with key innovations: temporal Jacobians to correct motion dynamics incrementally, continuous-time modeling via probability flow ODEs (pfODEs) to mitigate temporal noise, and a flow matching loss inspired by GFlowNet principles to optimize smooth motion transitions. These enhancements ensure coherent animations across complex scenarios involving rapid movements and non-rigid deformations. Extensive experiments validate the effectiveness of FlexiClip in generating animations that are not only smooth and natural but also structurally consistent across diverse clipart types, including humans and animals. By integrating spatial and temporal modeling with pre-trained video diffusion models, FlexiClip sets a new standard for high-quality clipart animation, offering robust performance across a wide range of visual content. Project Page: this https URL
zh

[CV-25] GS-LIVO: Real-Time LiDAR Inertial and Visual Multi-sensor Fused Odometry with Gaussian Mapping

【速读】：该论文旨在解决现有基于视觉的3D高斯泼溅（3D Gaussian Splatting, 3D-GS）方法在处理遮挡、高GPU内存和计算消耗方面的挑战。现有方法通常依赖于手工设计的启发式方法进行点云稠密化，难以有效应对复杂场景。论文提出了一种基于多传感器融合的实时高斯泼溅同时定位与建图（SLAM）系统，关键解决方案包括：1）采用全局高斯地图和滑动窗口高斯地图相结合的方式，通过哈希索引的体素组织在递归八叉树中，有效覆盖稀疏空间体积并适应不同细节和尺度；2）通过多传感器融合初始化高斯地图，并利用光度梯度进行优化；3）引入基于迭代误差状态卡尔曼滤波（IESKF）的紧耦合多传感器融合里程计，实时更新和渲染高斯地图，显著降低GPU计算和内存消耗。该框架首次实现了在资源受限的嵌入式系统（如NVIDIA Jetson Orin NX平台）上部署的实时高斯泼溅SLAM系统，具备鲁棒的多传感器融合能力。

链接: https://arxiv.org/abs/2501.08672
作者: Sheng Hong,Chunran Zheng,Yishu Shen,Changze Li,Fu Zhang,Tong Qin,Shaojie Shen
机构: Department of Electronic Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China (香港科技大学电子计算机工程系); Department of Mechanical Engineering, The University of Hong Kong, Hong Kong SAR, China (香港大学机械工程系); Global Institute of Future Technology, Shanghai Jiao Tong University, Shanghai, China (上海交通大学未来技术学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, 3D Gaussian splatting (3D-GS) has emerged as a novel scene representation approach. However, existing vision-only 3D-GS methods often rely on hand-crafted heuristics for point-cloud densification and face challenges in handling occlusions and high GPU memory and computation consumption. LiDAR-Inertial-Visual (LIV) sensor configuration has demonstrated superior performance in localization and dense mapping by leveraging complementary sensing characteristics: rich texture information from cameras, precise geometric measurements from LiDAR, and high-frequency motion data from IMU. Inspired by this, we propose a novel real-time Gaussian-based simultaneous localization and mapping (SLAM) system. Our map system comprises a global Gaussian map and a sliding window of Gaussians, along with an IESKF-based odometry. The global Gaussian map consists of hash-indexed voxels organized in a recursive octree, effectively covering sparse spatial volumes while adapting to different levels of detail and scales. The Gaussian map is initialized through multi-sensor fusion and optimized with photometric gradients. Our system incrementally maintains a sliding window of Gaussians, significantly reducing GPU computation and memory consumption by only optimizing the map within the sliding window. Moreover, we implement a tightly coupled multi-sensor fusion odometry with an iterative error state Kalman filter (IESKF), leveraging real-time updating and rendering of the Gaussian map. Our system represents the first real-time Gaussian-based SLAM framework deployable on resource-constrained embedded systems, demonstrated on the NVIDIA Jetson Orin NX platform. The framework achieves real-time performance while maintaining robust multi-sensor fusion capabilities. All implementation algorithms, hardware designs, and CAD models will be publicly available.
zh

[CV-26] A Survey on Facial Image Privacy Preservation in Cloud-Based Services

【速读】：该论文旨在解决基于云服务的人脸识别模型中存在的隐私保护问题。随着商业企业、政府机构和云服务提供商广泛使用人脸识别技术进行身份验证、消费者服务和监控，大量人脸数据在云端处理和存储，引发了严重的隐私担忧。用户的未经同意的面部图像可能被滥用，导致数据泄露和误用。论文提出了两种主要的解决方案：基于图像模糊化（image obfuscation-based protection）的保护方法和基于对抗性扰动（adversarial perturbation-based protection）的保护方法。通过对这两种方法的深入分析和定量定性比较，论文评估了它们在保护面部图像隐私方面的有效性，并指出了当前未解决的挑战，提出了未来在云计算环境中改进隐私保护的研究方向。

链接: https://arxiv.org/abs/2501.08665
作者: Chen Chen,Mengyuan Sun,Xueluan Gong,Yanjiao Chen,Qian Wang
机构: Nanyang Technological University(南洋理工大学); School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院); College of Electrical Engineering, Zhejiang University(浙江大学电气工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial recognition models are increasingly employed by commercial enterprises, government agencies, and cloud service providers for identity verification, consumer services, and surveillance. These models are often trained using vast amounts of facial data processed and stored in cloud-based platforms, raising significant privacy concerns. Users’ facial images may be exploited without their consent, leading to potential data breaches and misuse. This survey presents a comprehensive review of current methods aimed at preserving facial image privacy in cloud-based services. We categorize these methods into two primary approaches: image obfuscation-based protection and adversarial perturbation-based protection. We provide an in-depth analysis of both categories, offering qualitative and quantitative comparisons of their effectiveness. Additionally, we highlight unresolved challenges and propose future research directions to improve privacy preservation in cloud computing environments.
zh

[CV-27] BRIGHT-VO: Brightness-Guided Hybrid Transformer for Visual Odometry with Multi-modality Refinement Module

【速读】：该论文旨在解决视觉里程计（Visual Odometry, VO）在低光照条件下性能下降的问题。传统的数据驱动VO方法，尤其是基于深度学习的技术，在低光照环境中由于特征可见性降低和关键点匹配难度增加，往往表现不佳。为解决这一问题，论文提出了BrightVO模型，其关键创新在于结合了Transformer架构和多模态优化模块。BrightVO不仅在前端进行视觉特征提取，还在后端集成了惯性测量单元（Inertial Measurement Unit, IMU）数据，通过姿态图优化（pose graph optimization）迭代修正姿态估计，从而减少误差并提高精度和鲁棒性。此外，论文还创建了一个合成低光照数据集KiC4R，用于训练和评估VO框架在复杂环境中的性能。实验结果表明，BrightVO在KiC4R数据集和KITTI基准测试中均达到了最先进的性能，尤其在低光照条件下，姿态估计精度提升了259%。

链接: https://arxiv.org/abs/2501.08659
作者: Dongzhihan Wang,Yang Yang,Liang Xu
机构: Shanghai University(上海大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks by estimating the position and orientation of a camera based on visual input. Significant progress has been made in data-driven VO methods, particularly those leveraging deep learning techniques to extract image features and estimate camera poses. However, these methods often struggle in low-light conditions because of the reduced visibility of features and the increased difficulty of matching keypoints. To address this limitation, we introduce BrightVO, a novel VO model based on Transformer architecture, which not only performs front-end visual feature extraction, but also incorporates a multi-modality refinement module in the back-end that integrates Inertial Measurement Unit (IMU) data. Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness. Furthermore, we create a synthetic low-light dataset, KiC4R, which includes a variety of lighting conditions to facilitate the training and evaluation of VO frameworks in challenging environments. Experimental results demonstrate that BrightVO achieves state-of-the-art performance on both the KiC4R dataset and the KITTI benchmarks. Specifically, it provides an average improvement of 20% in pose estimation accuracy in normal outdoor environments and 259% in low-light conditions, outperforming existing methods. For widespread use and further development, the research work is fully open-source at this https URL.
zh

[CV-28] StereoGen: High-quality Stereo Image Generation from a Single Image

【速读】：该论文试图解决现有监督式立体匹配方法在真实场景中泛化能力不足的问题，主要原因是缺乏真实世界的标注数据。为此，论文提出了一种名为StereoGen的新型立体图像生成管道。该管道的核心解决方案包括：1）利用任意单张图像作为左图像，并通过单目深度估计模型生成的伪视差图来合成高质量的右图像；2）通过微调扩散修复模型来恢复遮挡区域的背景，而不是像以往方法那样使用随机背景或卷积操作选择附近像素；3）提出了无需训练的置信度生成和自适应视差选择机制，前者在立体训练过程中抑制有害伪真值的负面影响，后者则有助于生成更广泛的视差分布和更好的合成图像。实验表明，基于该管道训练的模型在零样本泛化能力上达到了当前最优水平。

链接: https://arxiv.org/abs/2501.08654
作者: Xianqi Wang,Hao Yang,Gangwei Xu,Junda Cheng,Min Lin,Yong Deng,Jinliang Zang,Yurui Chen,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学); Autel Robotics (道通智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art supervised stereo matching methods have achieved amazing results on various benchmarks. However, these data-driven methods suffer from generalization to real-world scenarios due to the lack of real-world annotated data. In this paper, we propose StereoGen, a novel pipeline for high-quality stereo image generation. This pipeline utilizes arbitrary single images as left images and pseudo disparities generated by a monocular depth estimation model to synthesize high-quality corresponding right images. Unlike previous methods that fill the occluded area in warped right images using random backgrounds or using convolutions to take nearby pixels selectively, we fine-tune a diffusion inpainting model to recover the background. Images generated by our model possess better details and undamaged semantic structures. Besides, we propose Training-free Confidence Generation and Adaptive Disparity Selection. The former suppresses the negative effect of harmful pseudo ground truth during stereo training, while the latter helps generate a wider disparity distribution and better synthetic images. Experiments show that models trained under our pipeline achieve state-of-the-art zero-shot generalization results among all published methods. The code will be available upon publication of the paper.
zh

[CV-29] Joint Learning of Depth and Appearance for Portrait Image Animation

【速读】：该论文旨在解决二维肖像动画（2D portrait animation）领域中，现有方法主要关注生成RGB图像而忽略了视觉外观与三维输出（3D output）一致性联合生成的问题。为此，作者提出了一种基于扩散模型（diffusion model）的肖像图像生成方法，能够同时学习视觉外观和深度信息。解决方案的关键在于引入了一种新的架构，该架构结合了参考网络（reference network）和通道扩展的扩散主干网络（channel-expanded diffusion backbone），以端到端的方式学习条件联合分布（conditional joint distribution）。这一框架不仅能够高效适应多种下游应用，如面部深度到图像的生成、图像到深度的生成、肖像重光照（portrait relighting）以及音频驱动的说话头动画（audio-driven talking head animation），还能确保生成的三维输出具有一致性。

链接: https://arxiv.org/abs/2501.08649
作者: Xinya Ji,Gaspard Zoss,Prashanth Chandran,Lingchen Yang,Xun Cao,Barbara Solenthaler,Derek Bradley
机构: ETH Zürich(苏黎世联邦理工学院); Nanjing University(南京大学); DisneyResearch|Studios(迪士尼研究院|工作室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:2D portrait animation has experienced significant advancements in recent years. Much research has utilized the prior knowledge embedded in large generative diffusion models to enhance high-quality image manipulation. However, most methods only focus on generating RGB images as output, and the co-generation of consistent visual plus 3D output remains largely under-explored. In our work, we propose to jointly learn the visual appearance and depth simultaneously in a diffusion-based portrait image generator. Our method embraces the end-to-end diffusion paradigm and introduces a new architecture suitable for learning this conditional joint distribution, consisting of a reference network and a channel-expanded diffusion backbone. Once trained, our framework can be efficiently adapted to various downstream applications, such as facial depth-to-image and image-to-depth generation, portrait relighting, and audio-driven talking head animation with consistent 3D output.
zh

[CV-30] MonSter: Marry Monodepth to Stereo Unleashes Power

【速读】：该论文旨在解决立体匹配（stereo matching）在处理遮挡（occlusions）和无纹理区域（textureless areas）等匹配线索有限的病态区域（ill-posed regions）时的困难。现有方法在这些区域表现不佳，难以准确恢复深度信息。为此，作者提出了一种名为MonSter的新方法，该方法结合了单目深度估计（monocular depth estimation）和立体匹配的互补优势。MonSter通过双分支架构（dual-branch architecture）将单目深度和立体匹配集成在一起，并利用置信度引导（confidence-based guidance）自适应地选择可靠的立体匹配线索，用于单目深度尺度和偏移的恢复。同时，优化后的单目深度信息反过来有效指导立体匹配在病态区域的表现。这种迭代的相互增强机制使得MonSter能够从粗略的物体级结构逐步细化到像素级几何，充分释放立体匹配的潜力。实验结果表明，MonSter在多个常用基准测试（如SceneFlow、KITTI 2012、KITTI 2015、Middlebury和ETH3D）中均排名第一，并在ETH3D上相比之前的最佳方法提升了49.5%。

链接: https://arxiv.org/abs/2501.08643
作者: Junda Cheng,Longliang Liu,Gangwei Xu,Xianqi Wang,Zhaoxing Zhang,Yong Deng,Jinliang Zang,Yurui Chen,Zhipeng Cai,Xin Yang
机构: Huazhong University of Science and Technology (华中科技大学); Autel Robotics (道通智能); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo matching recovers depth from image correspondences. Existing methods struggle to handle ill-posed regions with limited matching cues, such as occlusions and textureless areas. To address this, we propose MonSter, a novel method that leverages the complementary strengths of monocular depth estimation and stereo matching. MonSter integrates monocular depth and stereo matching into a dual-branch architecture to iteratively improve each other. Confidence-based guidance adaptively selects reliable stereo cues for monodepth scale-shift recovery. The refined monodepth is in turn guides stereo effectively at ill-posed regions. Such iterative mutual enhancement enables MonSter to evolve monodepth priors from coarse object-level structures to pixel-level geometry, fully unlocking the potential of stereo matching. As shown in Fig.1, MonSter ranks 1st across five most commonly used leaderboards – SceneFlow, KITTI 2012, KITTI 2015, Middlebury, and ETH3D. Achieving up to 49.5% improvements (Bad 1.0 on ETH3D) over the previous best method. Comprehensive analysis verifies the effectiveness of MonSter in ill-posed regions. In terms of zero-shot generalization, MonSter significantly and consistently outperforms state-of-the-art across the board. The code is publicly available at: this https URL.
zh

[CV-31] Detecting Wildfire Flame and Smoke through Edge Computing using Transfer Learning Enhanced Deep Learning Models

【速读】：该论文旨在解决在有限数据集上训练的目标检测器在识别野火烟雾和火焰时的性能问题，特别是在边缘计算设备上实时处理数据时的延迟问题。解决方案的关键在于利用迁移学习（Transfer Learning, TL）来提升目标检测器的性能，尤其是在使用YOLO（You Only Look Once）模型时。研究通过两阶段级联迁移学习方法，首先在D-Fire或FASDD数据集上进行预训练，然后在AFSE数据集上进行微调，显著提高了检测精度，达到了79.2%的平均精度（mAP@0.5），并减少了训练时间。然而，研究也发现，迁移学习并未显著改善边缘计算设备的推理时间、功耗和能耗等指标。此外，研究还表明，在没有硬件加速的情况下，YOLOv5n模型的处理速度几乎是其新版本YOLO11n的两倍。总体而言，研究证实了迁移学习在提高目标检测器精度方面的作用，但也指出需要进一步优化以提升边缘计算性能。

链接: https://arxiv.org/abs/2501.08639
作者: Giovanny Vazquez,Shengjie Zhai,Mei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Autonomous unmanned aerial vehicles (UAVs) integrated with edge computing capabilities empower real-time data processing directly on the device, dramatically reducing latency in critical scenarios such as wildfire detection. This study underscores Transfer Learning’s (TL) significance in boosting the performance of object detectors for identifying wildfire smoke and flames, especially when trained on limited datasets, and investigates the impact TL has on edge computing metrics. With the latter focusing how TL-enhanced You Only Look Once (YOLO) models perform in terms of inference time, power usage, and energy consumption when using edge computing devices. This study utilizes the Aerial Fire and Smoke Essential (AFSE) dataset as the target, with the Flame and Smoke Detection Dataset (FASDD) and the Microsoft Common Objects in Context (COCO) dataset serving as source datasets. We explore a two-stage cascaded TL method, utilizing D-Fire or FASDD as initial stage target datasets and AFSE as the subsequent stage. Through fine-tuning, TL significantly enhances detection precision, achieving up to 79.2% mean Average Precision (mAP@0.5), reduces training time, and increases model generalizability across the AFSE dataset. However, cascaded TL yielded no notable improvements and TL alone did not benefit the edge computing metrics evaluated. Lastly, this work found that YOLOv5n remains a powerful model when lacking hardware acceleration, finding that YOLOv5n can process images nearly twice as fast as its newer counterpart, YOLO11n. Overall, the results affirm TL’s role in augmenting the accuracy of object detectors while also illustrating that additional enhancements are needed to improve edge computing performance.
zh

[CV-32] Self-Organizing Edge Computing Distribution Framework for Visual SLAM

【速读】：该论文试图解决在资源有限的移动机器人上实现实时定位与地图构建（SLAM）的挑战。传统的边缘辅助SLAM方法通过将计算密集型任务卸载到服务器来实现实时执行，但这种客户端-服务器架构对服务器和网络故障敏感。论文提出了一种新型的边缘辅助SLAM框架，能够在设备网络中自组织完全分布式的SLAM执行，或在无连接的情况下在单一设备上运行。该框架的关键在于其三层架构设计，具有设备无关性、对网络故障的弹性以及对核心SLAM系统的最小侵入性。实验结果表明，该框架在完全分布式和独立SLAM配置下均能与ORB SLAM3的精度和资源利用率相匹配，同时支持协作执行。

链接: https://arxiv.org/abs/2501.08629
作者: Jussi Kalliola,Lauri Suomela,Sergio Moreschini,David Hästbacka
机构: Tampere University (坦佩雷大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Localization within a known environment is a crucial capability for mobile robots. Simultaneous Localization and Mapping (SLAM) is a prominent solution to this problem. SLAM is a framework that consists of a diverse set of computational tasks ranging from real-time tracking to computation-intensive map optimization. This combination can present a challenge for resource-limited mobile robots. Previously, edge-assisted SLAM methods have demonstrated promising real-time execution capabilities by offloading heavy computations while performing real-time tracking onboard. However, the common approach of utilizing a client-server architecture for offloading is sensitive to server and network failures. In this article, we propose a novel edge-assisted SLAM framework capable of self-organizing fully distributed SLAM execution across a network of devices or functioning on a single device without connectivity. The architecture consists of three layers and is designed to be device-agnostic, resilient to network failures, and minimally invasive to the core SLAM system. We have implemented and demonstrated the framework for monocular ORB SLAM3 and evaluated it in both fully distributed and standalone SLAM configurations against the ORB SLAM3. The experiment results demonstrate that the proposed design matches the accuracy and resource utilization of the monolithic approach while enabling collaborative execution.
zh

[CV-33] Computerized Assessment of Motor Imitation for Distinguishing Autism in Video (CAMI-2DNet)

【速读】：该论文试图解决自闭症谱系障碍（ASCs）患者运动模仿能力评估中的主观性和劳动密集型问题。传统的运动模仿评估方法依赖于主观判断和大量的人工训练，而现有的计算机化评估方法（如CAMI-3D和CAMI-2D）虽然减少了主观性，但仍需进行数据标准化、清理和人工标注。为解决这些问题，论文提出了CAMI-2DNet，一种基于深度学习的可扩展且可解释的运动模仿评估方法，适用于视频数据。CAMI-2DNet采用编码器-解码器架构，将视频映射到与身体形状和摄像机视角等干扰因素解耦的运动编码。通过使用虚拟角色运动重定向生成的合成数据以及真实参与者数据，CAMI-2DNet能够自动计算个体与演员运动编码之间的相似度评分，从而区分ASCs患者与神经典型（NT）个体。实验表明，CAMI-2DNet在区分ASCs与NT儿童方面优于CAMI-2D，并与CAMI-3D表现相当，同时具有更高的实用性，无需额外的数据标准化和人工标注。

链接: https://arxiv.org/abs/2501.08609
作者: Kaleab A. Kinfu,Carolina Pacheco,Alice D. Sperry,Deana Crocetti,Bahar Tunçgenç,Stewart H. Mostofsky,René Vidal
机构: Center for Innovation in Data Engineering and Science at the University of Pennsylvania(宾夕法尼亚大学数据工程与科学创新中心); Department of Biomedical Engineering at Johns Hopkins University(约翰霍普金斯大学生物医学工程系); Center for Neurodevelopmental and Imaging Research at the Kennedy Krieger Institute(肯尼迪克里格研究所神经发育与影像研究中心); Department of Neurology and the Department of Psychiatry and Behavioral Sciences at the Johns Hopkins University School of Medicine(约翰霍普金斯大学医学院神经病学系与精神病学和行为科学系); Department of Psychology at the Nottingham Trent University(诺丁汉特伦特大学心理学系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Motor imitation impairments are commonly reported in individuals with autism spectrum conditions (ASCs), suggesting that motor imitation could be used as a phenotype for addressing autism heterogeneity. Traditional methods for assessing motor imitation are subjective, labor-intensive, and require extensive human training. Modern Computerized Assessment of Motor Imitation (CAMI) methods, such as CAMI-3D for motion capture data and CAMI-2D for video data, are less subjective. However, they rely on labor-intensive data normalization and cleaning techniques, and human annotations for algorithm training. To address these challenges, we propose CAMI-2DNet, a scalable and interpretable deep learning-based approach to motor imitation assessment in video data, which eliminates the need for data normalization, cleaning and annotation. CAMI-2DNet uses an encoder-decoder architecture to map a video to a motion encoding that is disentangled from nuisance factors such as body shape and camera views. To learn a disentangled representation, we employ synthetic data generated by motion retargeting of virtual characters through the reshuffling of motion, body shape, and camera views, as well as real participant data. To automatically assess how well an individual imitates an actor, we compute a similarity score between their motion encodings, and use it to discriminate individuals with ASCs from neurotypical (NT) individuals. Our comparative analysis demonstrates that CAMI-2DNet has a strong correlation with human scores while outperforming CAMI-2D in discriminating ASC vs NT children. Moreover, CAMI-2DNet performs comparably to CAMI-3D while offering greater practicality by operating directly on video data and without the need for ad-hoc data normalization and human annotations.
zh

[CV-34] PACF: Prototype Augmented Compact Features for Improving Domain Adaptive Object Detection

【速读】：该论文试图解决跨领域目标检测中由于领域差异（domain gap）导致的性能显著下降问题。具体表现为，目标领域中的类条件分布（class-conditional distributions）相较于源领域具有更高的方差和均值偏移（mean shift）。为解决这一问题，论文提出了原型增强紧凑特征（Prototype Augmented Compact Features, PACF）框架，通过正则化类内特征的分布来减少目标领域特征的方差和均值偏移。关键解决方案包括：1）对目标特征相关似然的下界进行理论分析，并推导出原型交叉熵损失（prototype cross entropy loss）以校准目标感兴趣区域（RoI）特征的分布；2）设计了一种互正则化策略（mutual regularization strategy），使线性分类器和基于原型的分类器能够相互学习，从而提升特征的紧凑性和判别性。通过这些方法，PACF框架显著降低了目标领域特征的类条件分布方差，并进一步减少了跨领域的类均值偏移，最终在不同适应设置下取得了最先进的性能。

链接: https://arxiv.org/abs/2501.08605
作者: Chenguang Liu,Yongchao Feng,Yanan Zhang,Qingjie Liu,Yunhong Wang
机构: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (北京航空航天大学虚拟现实技术与系统国家重点实验室); School of Computer Science and Information Engineering, Hefei University of Technology (合肥工业大学计算机与信息工程学院); Hangzhou Innovation Institute, Beihang University (北京航空航天大学杭州创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, there has been significant advancement in object detection. However, applying off-the-shelf detectors to a new domain leads to significant performance drop, caused by the domain gap. These detectors exhibit higher-variance class-conditional distributions in the target domain than that in the source domain, along with mean shift. To address this problem, we propose the Prototype Augmented Compact Features (PACF) framework to regularize the distribution of intra-class features. Specifically, we provide an in-depth theoretical analysis on the lower bound of the target features-related likelihood and derive the prototype cross entropy loss to further calibrate the distribution of target RoI features. Furthermore, a mutual regularization strategy is designed to enable the linear and prototype-based classifiers to learn from each other, promoting feature compactness while enhancing discriminability. Thanks to this PACF framework, we have obtained a more compact cross-domain feature space, within which the variance of the target features’ class-conditional distributions has significantly decreased, and the class-mean shift between the two domains has also been further reduced. The results on different adaptation settings are state-of-the-art, which demonstrate the board applicability and effectiveness of the proposed approach.
zh

[CV-35] Watermarking in Diffusion Model: Gaussian Shading with Exact Diffusion Inversion via Coupled Transformations (EDICT)

【速读】：该论文旨在解决高斯着色（Gaussian Shading）水印技术中由于逆过程不精确而导致的水印失真问题。高斯着色传统上通过在噪声潜在空间中嵌入水印，并通过迭代去噪生成图像和添加噪声来恢复水印，但其逆过程并不精确，可能导致水印失真。论文提出的解决方案关键是通过集成精确扩散反演（Exact Diffusion Inversion via Coupled Transformations, EDICT）框架来改进这一过程。具体方法包括复制包含水印的噪声潜在空间，并在两个潜在空间之间采用交替的去噪和加噪方案，利用EDICT实现精确的反演映射。这种方法能够更精确地重建图像和嵌入的水印，实验结果表明，该方法在水印恢复保真度上取得了轻微但统计显著的提升。这一研究首次探索了EDICT与高斯着色在数字水印中的协同作用，为高保真和鲁棒的水印嵌入与提取开辟了新的研究方向。

链接: https://arxiv.org/abs/2501.08604
作者: Krishna Panthi
机构: Clemson University(克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:This paper introduces a novel approach to enhance the performance of Gaussian Shading, a prevalent watermarking technique, by integrating the Exact Diffusion Inversion via Coupled Transformations (EDICT) framework. While Gaussian Shading traditionally embeds watermarks in a noise latent space, followed by iterative denoising for image generation and noise addition for watermark recovery, its inversion process is not exact, leading to potential watermark distortion. We propose to leverage EDICT’s ability to derive exact inverse mappings to refine this process. Our method involves duplicating the watermark-infused noisy latent and employing a reciprocal, alternating denoising and noising scheme between the two latents, facilitated by EDICT. This allows for a more precise reconstruction of both the image and the embedded watermark. Empirical evaluation on standard datasets demonstrates that our integrated approach yields a slight, yet statistically significant improvement in watermark recovery fidelity. These results highlight the potential of EDICT to enhance existing diffusion-based watermarking techniques by providing a more accurate and robust inversion mechanism. To the best of our knowledge, this is the first work to explore the synergy between EDICT and Gaussian Shading for digital watermarking, opening new avenues for research in robust and high-fidelity watermark embedding and extraction.
zh

[CV-36] Image-to-Force Estimation for Soft Tissue Interaction in Robotic-Assisted Surgery Using Structured Light

【速读】：该论文试图解决微创手术机器人（Minimally Invasive Surgical, MIS）在空间受限条件下无法通过硬件传感器直接测量与软组织交互力的问题。解决方案的关键在于提出了一种基于视觉的方案，利用一次性结构光投影（One-Shot structured light projection）在软组织上投射设计好的图案，并通过训练的图像到力的神经网络（image-to-force neural network）进行触觉信息处理。通过内窥镜立体相机捕获的图像，重建软组织变形的高分辨率三维点云，并基于此提出了一种改进的基于PointNet的力估计方法，该方法能够有效表征软组织的复杂力学特性。实验验证了该方案在不同刚度硅材料上的有效性。

链接: https://arxiv.org/abs/2501.08593
作者: Jiayin Wang,Mingfeng Yao,Yanran Wei,Xiaoyu Guo,Ayong Zheng,Weidong Zhao
机构: 1School of Electronic Information Engineering, Tongji University (同济大学电子与信息工程学院), 200092, Shanghai, China; 2MicroPort MedBot (Group) Company Ltd. (微创医疗机器人(集团)有限公司), 201203, Shanghai, China; 3College of Engineering, Peking University (北京大学工学院), 100871, Beijing, China; 4Department of Biomedical Engineering, City University of Hong Kong (香港城市大学生物医学工程系), 999077, Kowloon, Hong Kong
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:For Minimally Invasive Surgical (MIS) robots, accurate haptic interaction force feedback is essential for ensuring the safety of interacting with soft tissue. However, most existing MIS robotic systems cannot facilitate direct measurement of the interaction force with hardware sensors due to space limitations. This letter introduces an effective vision-based scheme that utilizes a One-Shot structured light projection with a designed pattern on soft tissue coupled with haptic information processing through a trained image-to-force neural network. The images captured from the endoscopic stereo camera are analyzed to reconstruct high-resolution 3D point clouds for soft tissue deformation. Based on this, a modified PointNet-based force estimation method is proposed, which excels in representing the complex mechanical properties of soft tissue. Numerical force interaction experiments are conducted on three silicon materials with different stiffness. The results validate the effectiveness of the proposed scheme.
zh

[CV-37] Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation AAAI2025

【速读】：该论文试图解决在计算机视觉领域中，参数高效调优（Parameter-Efficient Tuning, PET）方法在多模态场景下，特别是面对未对齐编码器（misaligned encoders）时的性能不足问题。现有的PET方法主要针对单模态优化设计，虽然一些研究已经进行了初步探索，但仍局限于对齐编码器（如CLIP），未能有效处理未对齐编码器的情况，导致在多模态特征对齐和适应方面表现不佳。

论文提出的解决方案是DETRIS框架，其关键创新在于通过在每个层与所有前序层之间建立密集的互连，增强了低秩视觉特征的传播，从而实现了有效的跨模态特征交互和对未对齐编码器的适应。此外，论文还建议使用文本适配器（text adapters）来改进文本特征。该方法仅需更新0.9%到1.8%的主干网络参数，便在多个具有挑战性的基准测试中显著超越了现有最先进的方法。

链接: https://arxiv.org/abs/2501.08580
作者: Jiaqi Huang,Zunnan Xu,Ting Liu,Yong Liu,Haonan Han,Kehong Yuan,Xiu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at \urlthis https URL.
zh

[CV-38] Scalable and High-Quality Neural Implicit Representation for 3D Reconstruction

【速读】：该论文旨在解决现有基于神经隐式表面重建方法（SDF-based neural implicit surface reconstruction）在重建精度和规模上的局限性。现有方法由于单一网络的全局性质和有限表示能力，难以实现高精度和大规模的重建。为此，论文提出了一种基于“分而治之”策略的神经隐式表示方法，通过将物体或场景建模为多个独立的局部神经SDF（Signed Distance Function）的融合，并在重叠区域进行优化。解决方案的关键包括三个步骤：(1) 基于物体结构或数据分布构建局部辐射场（local radiance fields）的分布和重叠关系，(2) 对相邻局部SDF进行相对位姿配准（relative pose registration），(3) 通过SDF混合（SDF blending）实现高保真表面重建。该方法不仅提升了重建精度，还支持可扩展的场景重建。

链接: https://arxiv.org/abs/2501.08577
作者: Leyuan Yang,Bailin Deng,Juyong Zhang
机构: School of Mathematical Sciences, University of Science and Technology of China(中国科学技术大学数学科学学院); School of Computer Science and Informatics, Cardiff University(卡迪夫大学计算机科学与信息学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Various SDF-based neural implicit surface reconstruction methods have been proposed recently, and have demonstrated remarkable modeling capabilities. However, due to the global nature and limited representation ability of a single network, existing methods still suffer from many drawbacks, such as limited accuracy and scale of the reconstruction. In this paper, we propose a versatile, scalable and high-quality neural implicit representation to address these issues. We integrate a divide-and-conquer approach into the neural SDF-based reconstruction. Specifically, we model the object or scene as a fusion of multiple independent local neural SDFs with overlapping regions. The construction of our representation involves three key steps: (1) constructing the distribution and overlap relationship of the local radiance fields based on object structure or data distribution, (2) relative pose registration for adjacent local SDFs, and (3) SDF blending. Thanks to the independent representation of each local region, our approach can not only achieve high-fidelity surface reconstruction, but also enable scalable scene reconstruction. Extensive experimental results demonstrate the effectiveness and practicality of our proposed method.
zh

[CV-39] GOTLoc: General Outdoor Text-based Localization Using Scene Graph Retrieval with OpenStreetMap

【速读】：该论文旨在解决在GPS信号不可用的户外环境中实现鲁棒定位（robust localization）的问题。现有基于文本的定位方法通常将地图表示为点云（point clouds），并通过比较文本和点云数据的嵌入（embeddings）来识别最相似的场景。然而，点云地图的可扩展性有限，因为预先生成所有户外空间的地图是不现实的，且其数据量庞大，难以在实际机器人上直接存储和使用。为解决这些问题，GOTLoc提出了一种利用场景图（scene graphs）存储空间信息的紧凑数据结构，使单个机器人能够携带和利用大量地图数据。此外，GOTLoc通过利用公开的地图数据（如OpenStreetMap）来获取全球户外空间信息，从而无需额外创建定制地图数据。实验结果表明，GOTLoc在精度上与依赖点云地图的算法相当，且在城市场景测试中显著减少了存储需求，整体处理时间仅为几秒，验证了其在实际机器人应用中的可行性。

链接: https://arxiv.org/abs/2501.08575
作者: Donghwi Jung,Keonwoo Kim,Seong-Woo Kim
机构: Seoul National University(首尔国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose GOTLoc, a robust localization method capable of operating even in outdoor environments where GPS signals are unavailable. The method achieves this robust localization by leveraging comparisons between scene graphs generated from text descriptions and maps. Existing text-based localization studies typically represent maps as point clouds and identify the most similar scenes by comparing embeddings of text and point cloud data. However, point cloud maps have limited scalability as it is impractical to pre-generate maps for all outdoor spaces. Furthermore, their large data size makes it challenging to store and utilize them directly on actual robots. To address these issues, GOTLoc leverages compact data structures, such as scene graphs, to store spatial information, enabling individual robots to carry and utilize large amounts of map data. Additionally, by utilizing publicly available map data, such as OpenStreetMap, which provides global information on outdoor spaces, we eliminate the need for additional effort to create custom map data. For performance evaluation, we utilized the KITTI360Pose dataset in conjunction with corresponding OpenStreetMap data to compare the proposed method with existing approaches. Our results demonstrate that the proposed method achieves accuracy comparable to algorithms relying on point cloud maps. Moreover, in city-scale tests, GOTLoc required significantly less storage compared to point cloud-based methods and completed overall processing within a few seconds, validating its applicability to real-world robotics. Our code is available at this https URL.
zh

[CV-40] MIAFEx: An Attention-based Feature Extraction Method for Medical Image Classification

【速读】：该论文旨在解决医学图像分类中特征提取技术的局限性问题。传统特征提取器和机器学习分类器在处理复杂图像集时，往往无法提供足够的判别信息。尽管卷积神经网络（CNNs）和视觉Transformer（ViT）在特征提取方面表现出潜力，但由于医学影像数据的小样本量和高类内方差等固有特性，这些模型容易出现过拟合现象。论文提出了一种名为医学图像注意力特征提取器（MIAFEx）的新方法，其关键创新在于引入了一种可学习的精炼机制，该机制通过调整Transformer编码器架构中的分类标记（classification token），基于学习到的权重来增强显著特征的提取，从而提高模型对医学影像数据挑战的适应性。MIAFEx在多个复杂医学影像数据集上的分类任务中表现出优于传统和现代模型的准确性和鲁棒性，尤其在训练数据有限的情况下，其优势更为显著。

链接: https://arxiv.org/abs/2501.08562
作者: Oscar Ramos-Soto,Jorge Ramos-Frutos,Ezequiel Perez-Zarate,Diego Oliva,Sandra E. Balderas-Mata
机构: Universidad de Guadalajara (瓜达拉哈拉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: In preparation for Journal Submission

点击查看摘要

Abstract:Feature extraction techniques are crucial in medical image classification; however, classical feature extractors in addition to traditional machine learning classifiers often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model’s adaptability to the challenges presented by medical imaging data. The MIAFEx output features quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating its superiority in accuracy and robustness across multiple complex classification medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at this https URL
zh

[CV-41] DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors

【速读】：该论文旨在解决现有面部交换（face swapping）方法在将源面部身份信息转移到目标面部时，往往会无意中保留目标面部的身份信息，从而影响表情细节和身份准确性的问题。为解决这一问题，论文提出了一种名为DynamicFace的新方法，其关键创新在于利用扩散模型（diffusion model）和即插即用的时序层（temporal layers）来实现视频面部交换。具体而言，该方法引入了四种基于3D面部先验的细粒度面部条件（fine-grained face conditions），这些条件被设计为相互解耦，以实现精确且独立的控制。此外，通过采用Face Former和ReferenceNet进行高层级和细节化的身份注入（identity injection），该方法在FF++数据集上实现了最先进的面部交换效果，展示了卓越的图像质量、身份保持和表情准确性。该方法还可通过时序注意力层（temporal attention layer）轻松扩展到视频领域。

链接: https://arxiv.org/abs/2501.08553
作者: Runqi Wang,Sijie Xu,Tianyao He,Yang Chen,Wei Zhu,Dejia Song,Nemo Chen,Xu Tang,Yao Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion model and plug-and-play temporal layers for video face swapping. First, we introduce four fine-grained face conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Besides, our method could be easily transferred to video domain with temporal attention layer. Our code and results will be available on the project page: this https URL
zh

[CV-42] he Devil is in Temporal Token: High Quality Video Reasoning Segmentation

【速读】：该论文旨在解决现有视频推理分割（Video Reasoning Segmentation）方法在捕捉空间复杂性和帧间运动方面的不足。现有方法主要依赖单一的特殊标记（token）来表示关键帧或整个视频中的对象，导致对空间和时序信息的捕捉不够充分。为解决这一问题，论文提出了VRS-HQ，一种端到端的视频推理分割方法，通过引入多模态大语言模型（Multimodal Large Language Models, MLLMs）来注入丰富的时空特征。其关键创新包括时间动态聚合（Temporal Dynamic Aggregation, TDA）和基于标记的关键帧选择（Token-driven Keyframe Selection, TKS）。具体而言，VRS-HQ设计了帧级SEG标记和时序级TAK标记，利用MLLM的自回归学习能力有效捕捉局部和全局信息。随后，通过基于相似度的加权融合和帧选择策略，结合SAM2进行关键帧分割和传播。TKS在推理过程中根据SAM2的遮挡分数筛选关键帧，以提高关键帧定位的准确性。实验结果表明，VRS-HQ在ReVOS数据集上达到了最先进的性能，JF分数在三个子集上分别比VISA高出5.9%、12.5%和9.1%，展示了其强大的时序推理和分割能力。

链接: https://arxiv.org/abs/2501.08549
作者: Sitong Gong,Yunzhi Zhuge,Lu Zhang,Zongxin Yang,Pingping Zhang,Huchuan Lu
机构: Dalian University of Technology(大连理工大学); Harvard University(哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical this http URL key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level SEG and temporal-level TAK tokens that utilize MLLM’s autoregressive learning to effectively capture both local and global information. Subsequently, we apply a similarity-based weighted fusion and frame selection strategy, then utilize SAM2 to perform keyframe segmentation and propagation. To enhance keyframe localization accuracy, the TKS filters keyframes based on SAM2’s occlusion scores during inference. VRS-HQ achieves state-of-the-art performance on ReVOS, surpassing VISA by 5.9%/12.5%/9.1% in JF scores across the three subsets. These results highlight the strong temporal reasoning and segmentation capabilities of our method. Code and model weights will be released at VRS-HQ.
zh

[CV-43] Comprehensive Subjective and Objective Evaluation Method for Text-generated Video

【速读】：该论文旨在解决文本生成视频（Text-to-Video, T2V）质量评估的难题。由于文本生成的视频中常存在复杂的失真现象，如不自然的动作和违背人类认知的现象，传统的质量评估方法难以准确衡量其感知质量。为此，论文提出了一个大规模的基准数据集T2VEval-Bench，包含148个文本描述和12个模型生成的1,783个视频，并通过主观评估收集了五个关键评分：整体印象、视频质量、美学质量、真实性和文本-视频一致性。在客观评估方面，论文开发了T2VEval模型，该模型通过三个分支（质量、真实性和一致性）对视频进行评估，并利用基于注意力的融合模块整合各分支特征，借助大型预训练模型进行评分预测。此外，论文采用了渐进式训练策略，确保各分支在保持协同的同时学习特定知识。实验结果表明，T2VEval在多个指标上达到了最先进的性能。

链接: https://arxiv.org/abs/2501.08545
作者: Zelu Qi,Ping Shi,Shuqi Wang,Zhaoyang Zhang,Zefeng Ying,Da Pan
机构: School of Information and Communication Engineering, Communication University of China, Chaoyang District, Beijing, China (中国传媒大学信息与通信工程学院，北京市朝阳区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen3, Pika, and Sora, have significantly broadened its applicability and popularity. This progress has created a growing demand for accurate quality assessment metrics to evaluate the perceptual quality of text-generated videos and optimize video generation models. However, assessing the quality of text-generated videos remains challenging due to the presence of highly complex distortions, such as unnatural actions and phenomena that defy human cognition. To address these challenges, we constructed a large-scale benchmark dataset for \textbfText-generated \textbfVideo \textbfevaluation, \textbfT2VEval-Bench, comprising 148 textual words and 1,783 videos generated by 12 models. During the subjective evaluation, we collected five key scores: overall impression, video quality, aesthetic quality, realness, and text-video consistency. For objective evaluation, we developed the \textbfT2VEval model, which assesses videos across three branches: quality, authenticity, and consistency. Using an attention-based fusion module, T2VEval effectively integrates features from each branch and predicts scores with the aid of a large oracle model. Additionally, we implemented a progressive training strategy, enabling each branch to learn targeted knowledge while maintaining synergy with the others. Experimental results demonstrate that T2VEval achieves state-of-the-art performance across multiple metrics. The dataset and code will be open-sourced upon completion of the follow-up work.
zh

[CV-44] Multimodal Fake News Video Explanation Generation

【速读】：该论文试图解决的问题是如何为多模态（multimodal）新闻内容（包含视频和字幕文本）生成自然语言解释，以揭示预测结果的真实性。具体来说，论文提出了一个新的问题——假新闻视频解释（Fake News Video Explanation, FNVE），旨在通过生成自然语言解释来评估多模态新闻的真实性。解决方案的关键在于开发了一个名为FakeNVE的新数据集，该数据集包含真实的多模态新闻帖子及其对应的自然语言解释。此外，论文采用了一种基于多模态Transformer的架构作为基准模型，并使用基于BART的自回归解码器作为生成器。实验结果表明，该方法在多个评估指标上均表现出色，并且在人类评估中，生成的解释在充分性和流畅性方面均获得了高分。

链接: https://arxiv.org/abs/2501.08514
作者: Lizhi Chen,Zhong Qian,Peifeng Li,Qiaoming Zhu
机构: Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Multi-modal explanation involves the assessment of the veracity of a variety of different content, and relies on multiple information modalities to comprehensively consider the relevance and consistency between modalities. Most existing fake news video detection methods focus on improving accuracy while ignoring the importance of providing explanations. In this paper, we propose a novel problem - Fake News Video Explanation (FNVE) - Given a multimodal news containing both video and caption text, we aim to generate natural language explanations to reveal the truth of predictions. To this end, we develop FakeNVE, a new dataset of explanations for truthfully multimodal posts, where each explanation is a natural language (English) sentence describing the attribution of a news thread. We benchmark FakeNVE by using a multimodal transformer-based architecture. Subsequently, a BART-based autoregressive decoder is used as the generator. Empirical results show compelling results for various baselines (applicable to FNVE) across multiple evaluation metrics. We also perform human evaluation on explanation generation, achieving high scores for both adequacy and fluency.
zh

[CV-45] Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training

【速读】：该论文试图解决的问题是：在当前大规模模型训练中，数据和模型规模占据了主导地位，但关于训练数据集的其他属性（如多样性）对模型性能的影响却缺乏深入探索。论文假设数据集多样性（dataset diversity）会影响视觉模型的性能，并通过实验验证了这一假设。

解决方案的关键在于：通过分析预训练（pre-training）和模型无关元学习（model-agnostic meta-learning, MAML）方法在12个常用视觉数据集（如Omniglot、CIFAR-FS、Aircraft等）和5种模型配置上的表现，论文发现测试集准确率与数据多样性之间存在中度到强正相关（R-squared: 0.15-0.42），损失与多样性之间也存在较弱但显著的相关性（R-squared: ~0.2）。这些结果表明，数据多样性是影响模型性能的重要因素，为进一步研究数据集属性提供了依据，并强调了理解数据集对于构建更强大和泛化能力更强的模型的重要性。

链接: https://arxiv.org/abs/2501.08506
作者: Kavita Selva,Satita Vittayaareekul,Brando Miranda
机构: Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Currently, data and model size dominate the narrative in the training of super-large, powerful models. However, there has been a lack of exploration on the effect of other attributes of the training dataset on model performance. We hypothesize that dataset diversity can impact the performance of vision models. Our study shows positive correlations between test set accuracy and data diversity, providing an argument for furthering the research of dataset attributes beyond size. We analyzed pre-training and model-agnostic meta-learning methods on twelve popular visual datasets (e.g., Omniglot, CIFAR-FS, Aircraft) and five model configurations, including MAML variants with different numbers of inner gradient steps and supervised learning. We show moderate to strong positive correlations (R-squared: 0.15-0.42) between accuracy and data diversity and weaker but significant correlations (R-squared: ~0.2) between loss and diversity. These findings support our hypothesis and demonstrate a promising way for a deeper exploration of how formal data diversity influences model performance. This initial study highlights the potential of (Task2Vec) data diversity as a valuable measure in the rapidly evolving field of large-scale learning and emphasizes that understanding the dataset is key to building more powerful and generalizable models.
zh

[CV-46] Yuan: Yielding Unblemished Aesthetics Through A Unified Network for Visual Imperfections Removal in Generated Images

【速读】：该论文试图解决生成式 AI（Generative AI）在文本到图像合成（text-to-image synthesis）过程中产生的视觉缺陷问题，如解剖学不准确、物体位置不当以及文本元素错位等。这些缺陷严重影响了生成图像的实际应用价值。为解决这些问题，论文提出了一个名为 \textitYuan 的新框架，其关键创新在于能够自主纠正这些视觉缺陷。\textitYuan 通过同时结合文本提示（textual prompt）和分割图像（segmented image）生成精确的掩码（masks），自动识别需要修复的区域，而无需人工干预。随后，通过一个先进的修复模块（inpainting module），将上下文一致的内容无缝集成到识别出的区域中，从而保持原始图像和文本提示的完整性和保真度。实验结果表明，\textitYuan 在多个公开数据集和自定义数据集上均表现出色，显著提升了生成图像的质量和适用性。

链接: https://arxiv.org/abs/2501.08505
作者: Zhenyu Yu,Chee Seng Chan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Generative AI presents transformative potential across various domains, from creative arts to scientific visualization. However, the utility of AI-generated imagery is often compromised by visual flaws, including anatomical inaccuracies, improper object placements, and misplaced textual elements. These imperfections pose significant challenges for practical applications. To overcome these limitations, we introduce \textitYuan, a novel framework that autonomously corrects visual imperfections in text-to-image synthesis. \textitYuan uniquely conditions on both the textual prompt and the segmented image, generating precise masks that identify areas in need of refinement without requiring manual intervention – a common constraint in previous methodologies. Following the automated masking process, an advanced inpainting module seamlessly integrates contextually coherent content into the identified regions, preserving the integrity and fidelity of the original image and associated text prompts. Through extensive experimentation on publicly available datasets such as ImageNet100 and Stanford Dogs, along with a custom-generated dataset, \textitYuan demonstrated superior performance in eliminating visual imperfections. Our approach consistently achieved higher scores in quantitative metrics, including NIQE, BRISQUE, and PI, alongside favorable qualitative evaluations. These results underscore \textitYuan’s potential to significantly enhance the quality and applicability of AI-generated images across diverse fields.
zh

[CV-47] SuperSAM: Crafting a SAM Supernetwork via Structured Pruning and Unstructured Parameter Prioritization

【速读】：该论文旨在解决基于视觉Transformer（Vision Transformer, ViT）架构的神经架构搜索（Neural Architecture Search, NAS）中的搜索空间设计问题。传统的一体化NAS方法虽然有效，但在设计搜索空间时仍面临挑战。为此，论文提出了一种新的搜索空间设计策略，将Segment Anything Model（SAM）转换为一个权重共享的超网络（SuperSAM）。该策略的关键在于通过层次化结构化剪枝（layer-wise structured pruning）和参数优先级排序（parameter prioritization）来自动化搜索空间设计。结构化剪枝通过概率性地移除某些Transformer层，而参数优先级排序则对剩余层中的MLP块进行权重重排序和切片。此外，论文采用三明治规则（sandwich rule）训练超网络，并利用程序自动调优器（program autotuner）来发现搜索空间内的高效子网络。实验结果表明，所得到的子网络比预训练的SAM ViT-B模型小30-70%，且性能更优。该研究为ViT NAS的搜索空间设计提供了一种新的有效方法。

链接: https://arxiv.org/abs/2501.08504
作者: Waqwoya Abebe,Sadegh Jafari,Sixing Yu,Akash Dutta,Jan Strube,Nathan R. Tallent,Luanzheng Guo,Pablo Munoz,Ali Jannesari
机构: Iowa State University(爱荷华州立大学); Pacific Northwest National Laboratory(太平洋西北国家实验室); Intel Labs(英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural Architecture Search (NAS) is a powerful approach of automating the design of efficient neural architectures. In contrast to traditional NAS methods, recently proposed one-shot NAS methods prove to be more efficient in performing NAS. One-shot NAS works by generating a singular weight-sharing supernetwork that acts as a search space (container) of subnetworks. Despite its achievements, designing the one-shot search space remains a major challenge. In this work we propose a search space design strategy for Vision Transformer (ViT)-based architectures. In particular, we convert the Segment Anything Model (SAM) into a weight-sharing supernetwork called SuperSAM. Our approach involves automating the search space design via layer-wise structured pruning and parameter prioritization. While the structured pruning applies probabilistic removal of certain transformer layers, parameter prioritization performs weight reordering and slicing of MLP-blocks in the remaining layers. We train supernetworks on several datasets using the sandwich rule. For deployment, we enhance subnetwork discovery by utilizing a program autotuner to identify efficient subnetworks within the search space. The resulting subnetworks are 30-70% smaller in size compared to the original pre-trained SAM ViT-B, yet outperform the pretrained model. Our work introduces a new and effective method for ViT NAS search-space design.
zh

[CV-48] FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing

【速读】：该论文试图解决在遥感图像处理中，结合对比学习（contrastive learning）和掩码建模（masked modeling）的预训练方法在视觉任务中性能下降的问题。具体来说，尽管像CLIP这样的对比图像-文本方法能够实现视觉-语言对齐和零样本分类能力，但在仅视觉任务（如KNN分类和语义分割）中，其性能往往不如仅使用图像预训练的方法（如MAE）。论文提出的解决方案FLAVARS结合了对比学习和掩码建模的优势，并通过对比位置编码（contrastive location encoding）实现地理空间对齐。实验表明，FLAVARS在仅视觉任务中显著优于基线方法SkyCLIP，例如在SpaceNet1数据集上提升了6%的mIOU（mean Intersection over Union），同时保留了零样本分类能力，这是MAE预训练方法所不具备的。

链接: https://arxiv.org/abs/2501.08490
作者: Isaac Corley,Simone Fobi Nsutezo,Anthony Ortiz,Caleb Robinson,Rahul Dodhia,Juan M. Lavista Ferres,Peyman Najafirad
机构: University of Texas at San Antonio (德克萨斯大学圣安东尼奥分校); Microsoft AI for Good Research Lab (微软AI for Good研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image-text methods like CLIP enable vision-language alignment and zero-shot classification ability, vision-only downstream performance tends to degrade compared to image-only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision-only tasks such as KNN classification and semantic segmentation, +6% mIOU on SpaceNet1, while retaining the ability to perform zero-shot classification, unlike MAE pretrained methods.
zh

[CV-49] Benchmarking Classical Deep and Generative Models for Human Activity Recognition

【速读】：该论文旨在解决人类活动识别（Human Activity Recognition, HAR）领域中不同模型在不同数据集上的性能评估问题。通过对经典机器学习模型（如决策树和随机森林）、深度学习架构（如卷积神经网络CNN和深度信念网络DBNs）以及受限玻尔兹曼机（Restricted Boltzmann Machines, RBMs）在五个关键基准数据集（UCI-HAR、OPPORTUNITY、PAMAP2、WISDM和Berkeley MHAD）上的性能进行全面比较，论文评估了这些模型在准确性、精确度、召回率和F1分数等指标上的表现。研究结果表明，CNN模型在所有数据集上表现优异，尤其是在Berkeley MHAD数据集上；经典模型如随机森林在较小数据集上表现良好，但在处理更大、更复杂的数据时面临挑战；RBM模型在特征学习方面显示出显著潜力。论文通过详细比较为研究人员选择最适合HAR任务的模型提供了依据。

链接: https://arxiv.org/abs/2501.08471
作者: Md Meem Hossain, TheAnh Han,Safina Showkat Ara,Zia Ush Shamszaman
机构: Department of Computing and Games, Teesside University(提赛德大学); University of Sunderland(桑德兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 48 pages, 21 Figures

点击查看摘要

Abstract:Human Activity Recognition (HAR) has gained significant importance with the growing use of sensor-equipped devices and large datasets. This paper evaluates the performance of three categories of models : classical machine learning, deep learning architectures, and Restricted Boltzmann Machines (RBMs) using five key benchmark datasets of HAR (UCI-HAR, OPPORTUNITY, PAMAP2, WISDM, and Berkeley MHAD). We assess various models, including Decision Trees, Random Forests, Convolutional Neural Networks (CNN), and Deep Belief Networks (DBNs), using metrics such as accuracy, precision, recall, and F1-score for a comprehensive comparison. The results show that CNN models offer superior performance across all datasets, especially on the Berkeley MHAD. Classical models like Random Forest do well on smaller datasets but face challenges with larger, more complex data. RBM-based models also show notable potential, particularly for feature learning. This paper offers a detailed comparison to help researchers choose the most suitable model for HAR tasks.
zh

[CV-50] Detecting Contextual Anomalies by Discovering Consistent Spatial Regions

【速读】：该论文旨在解决视频异常检测中的空间上下文建模问题。其核心解决方案是通过高斯混合模型（Gaussian Mixture Models, GMM）对联合对象属性进行聚类，从而发现具有相似对象级活动的区域。这种方法的关键在于利用较少的参数（相比其他竞争模型少几个数量级），在具有挑战性的空间上下文依赖的Street Scene数据集中实现了最先进的性能。此外，模型学习到的高分辨率区域还为人类操作员提供了可解释的正常性地图，而无需依赖任何预训练的分割模型。

链接: https://arxiv.org/abs/2501.08470
作者: Zhengye Yang,Richard J. Radke
机构: Rensselaer Polytechnic Institute(伦斯勒理工学院); Rensselaer Polytechnic Institute(伦斯勒理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We describe a method for modeling spatial context to enable video anomaly detection. The main idea is to discover regions that share similar object-level activities by clustering joint object attributes using Gaussian mixture models. We demonstrate that this straightforward approach, using orders of magnitude fewer parameters than competing models, achieves state-of-the-art performance in the challenging spatial-context-dependent Street Scene dataset. As a side benefit, the high-resolution discovered regions learned by the model also provide explainable normalcy maps for human operators without the need for any pre-trained segmentation model.
zh

[CV-51] Predicting Performance of Object Detection Models in Electron Microscopy Using Random Forests

【速读】：该论文旨在解决在应用深度学习目标检测模型（object detection models）于新的、未标注数据集时，如何量化预测不确定性的问题，特别是在透射电子显微镜（TEM）图像中检测金属合金中辐照诱导空腔的缺陷。解决方案的关键在于开发了一种随机森林回归模型（random forest regression model），该模型通过从目标检测模型的预测中提取特征，快速预测新图像上的F1分数（F1 score），从而评估模型的性能。该方法的平均绝对误差（MAE）为0.09，R^2得分为0.77，表明预测值与真实缺陷检测F1分数之间存在显著相关性。该方法在三个不同的TEM图像数据集上表现出鲁棒性，能够帮助用户评估缺陷检测和分割模型预测的可靠性，并判断模型是否需要针对特定数据集进行微调或进一步训练。

链接: https://arxiv.org/abs/2501.08465
作者: Ni Li,Ryan Jacobs,Matthew Lynch,Vidit Agrawal,Kevin Field,Dane Morgan
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
备注: 14 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Quantifying prediction uncertainty when applying object detection models to new, unlabeled datasets is critical in applied machine learning. This study introduces an approach to estimate the performance of deep learning-based object detection models for quantifying defects in transmission electron microscopy (TEM) images, focusing on detecting irradiation-induced cavities in TEM images of metal alloys. We developed a random forest regression model that predicts the object detection F1 score, a statistical metric used to evaluate the ability to accurately locate and classify objects of interest. The random forest model uses features extracted from the predictions of the object detection model whose uncertainty is being quantified, enabling fast prediction on new, unlabeled images. The mean absolute error (MAE) for predicting F1 of the trained model on test data is 0.09, and the R^2 score is 0.77, indicating there is a significant correlation between the random forest regression model predicted and true defect detection F1 scores. The approach is shown to be robust across three distinct TEM image datasets with varying imaging and material domains. Our approach enables users to estimate the reliability of a defect detection and segmentation model predictions and assess the applicability of the model to their specific datasets, providing valuable information about possible domain shifts and whether the model needs to be fine-tuned or trained on additional data to be maximally effective for the desired use case.
zh

[CV-52] Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

【速读】：该论文旨在解决大规模文本到视频生成（text-to-video generation）中的关键挑战，包括文本描述与生成视频帧之间的一致性、时间序列的连贯性，以及训练长视频序列时的内存和计算瓶颈问题。解决方案的关键在于三个方面：(1) 引入了一种新颖的多模态扩散块（Multimodal Diffusion Block），确保文本描述与生成视频帧之间的一致性，并保持时间序列的连贯性；(2) 提出了一个内存高效训练框架（Memory-efficient Training framework），结合混合并行（hybrid parallelism）和其他内存优化技术，使得在分布式系统上高效训练长视频序列成为可能；(3) 通过增强的数据处理流程，创建了一个高质量的大规模训练数据集 Vchitect T2V DataVerse，确保了数据的严格标注和美学评估。这些设计使得 Vchitect-2.0 在视频质量、训练效率和可扩展性方面优于现有方法，成为高保真视频生成的理想基础。

链接: https://arxiv.org/abs/2501.08453
作者: Weichen Fan,Chenyang Si,Junhao Song,Zhenyu Yang,Yinan He,Long Zhuo,Ziqi Huang,Ziyue Dong,Jingwen He,Dongwei Pan,Yi Wang,Yuming Jiang,Yaohui Wang,Peng Gao,Xinyuan Chen,Hengjie Li,Dahua Lin,Yu Qiao,Ziwei Liu
机构: S-Lab, Nanyang Technological University, Singapore, 639798 (新加坡南洋理工大学S-Lab); Shanghai Artificial Intelligence Laboratory, China (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across sequences. (2) To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework that incorporates hybrid parallelism and other memory reduction techniques, enabling efficient training of long video sequences on distributed systems. (3) Additionally, our enhanced data processing pipeline ensures the creation of Vchitect T2V DataVerse, a high-quality million-scale training dataset through rigorous annotation and aesthetic evaluation. Extensive benchmarking demonstrates that Vchitect-2.0 outperforms existing methods in video quality, training efficiency, and scalability, serving as a suitable base for high-fidelity video generation.
zh

[CV-53] Poseidon: A ViT-based Architecture for Multi-Frame Pose Estimation with Adaptive Frame Weighting and Multi-Scale Feature Fusion

【速读】：该论文试图解决单帧姿态估计（single-frame pose estimation）在捕捉复杂连续运动时无法有效利用时间动态信息的问题。为了解决这一局限性，论文提出了Poseidon，一种新颖的多帧姿态估计架构，通过集成时间信息来提高准确性和鲁棒性。解决方案的关键在于三个创新点：(1) 自适应帧加权机制（Adaptive Frame Weighting, AFW），动态地根据帧的相关性进行优先级排序，确保模型聚焦于最具信息量的数据；(2) 多尺度特征融合模块（Multi-Scale Feature Fusion, MSFF），聚合来自不同骨干网络层的特征，以捕捉细粒度细节和高层次语义；(3) 跨注意力模块（Cross-Attention），用于中心帧和上下文帧之间的有效信息交换，增强模型的时间一致性。这些创新使得Poseidon在复杂视频场景中表现出色，并在PoseTrack21和PoseTrack18数据集上达到了88.3和87.8的mAP分数，超越了现有方法。

链接: https://arxiv.org/abs/2501.08446
作者: Cesare Davide Pace,Alessandro Marco De Nunzio,Claudio De Stefano,Francesco Fontanella,Mario Molinara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human pose estimation, a vital task in computer vision, involves detecting and localising human joints in images and videos. While single-frame pose estimation has seen significant progress, it often fails to capture the temporal dynamics for understanding complex, continuous movements. We propose Poseidon, a novel multi-frame pose estimation architecture that extends the ViTPose model by integrating temporal information for enhanced accuracy and robustness to address these limitations. Poseidon introduces key innovations: (1) an Adaptive Frame Weighting (AFW) mechanism that dynamically prioritises frames based on their relevance, ensuring that the model focuses on the most informative data; (2) a Multi-Scale Feature Fusion (MSFF) module that aggregates features from different backbone layers to capture both fine-grained details and high-level semantics; and (3) a Cross-Attention module for effective information exchange between central and contextual frames, enhancing the model’s temporal coherence. The proposed architecture improves performance in complex video scenarios and offers scalability and computational efficiency suitable for real-world applications. Our approach achieves state-of-the-art performance on the PoseTrack21 and PoseTrack18 datasets, achieving mAP scores of 88.3 and 87.8, respectively, outperforming existing methods.
zh

[CV-54] Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

【速读】：该论文试图解决当前大型视觉-语言模型（Large Vision-Language Models, LVLMs）在利用视觉编码器特征时主要依赖最终层特征，而忽略了浅层特征中的互补信息的问题。尽管已有方法探索了多层特征的利用，但这些方法通常是任务无关的。论文通过研究不同编码器层视觉特征在18个基准测试和6个任务类别中的贡献，发现多层特征在不同任务中具有互补优势，而均匀融合效果不佳。基于这些发现，论文提出了一种基于指令引导的视觉聚合器（instruction-guided vision aggregator），该聚合器能够根据文本指令动态整合多层特征，且不增加视觉标记的数量。实验结果表明，该方法在性能上表现优异，分析揭示了中高层特征在语义任务中的主导作用以及低层特征在细粒度感知中的关键作用。这一研究为LVLMs中分层视觉特征的自适应使用提供了有价值的见解，推动了更灵活的多模态系统的发展。

链接: https://arxiv.org/abs/2501.08443
作者: Xu Li,Yi Zheng,Haotian Chen,Xiaolei Chen,Yuxuan Liang,Chenghang Lai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks by combining pre-trained vision encoders and large language models. However, current LVLMs mainly rely on features from the final layers of the vision encoder, neglecting complementary information in shallower layers. While recent methods have explored multi-layer features, they are often task-agnostic. We investigate the contributions of visual features from different encoder layers across 18 benchmarks and 6 task categories. Our results show that multi-layer features provide complementary strengths with varying task dependencies, and uniform fusion performs suboptimally. Based on these findings, we propose an instruction-guided vision aggregator that dynamically integrates multi-layer features based on textual instructions, without increasing the number of visual tokens. Extensive evaluations show superior performance, and analysis reveals the dominance of mid-to-high-level features in semantic tasks and the critical role of low-level features in fine-grained perception. This work provides valuable insights into the adaptive use of hierarchical visual features in LVLMs, advancing more flexible multimodal systems.
zh

[CV-55] FARE: A Deep Learning-Based Framework for Radar-based Face Recognition and Out-of-distribution Detection ICASSP2025

【速读】：该论文旨在解决基于短距离调频连续波（FMCW）雷达的人脸识别（face recognition）和分布外检测（out-of-distribution, OOD detection）问题。解决方案的关键在于提出了一种新颖的管道架构，该架构包含一个主路径（primary path, PP）用于分布内（in-distribution, ID）人脸的分类，以及多个中间路径（intermediate paths, IPs）专门用于OOD检测。网络训练分为两个阶段：首先，主路径通过三元组损失（triplet loss）进行优化，以提高ID人脸分类的准确性；其次，主路径被冻结，中间路径（由简单的线性自编码器网络组成）专门用于OOD检测的训练。实验结果表明，该方法在使用60 GHz FMCW雷达生成的数据集上，ID分类准确率达到99.30%，OOD检测的AUROC（Area Under the Receiver Operating Characteristic curve）达到96.91%。

链接: https://arxiv.org/abs/2501.08440
作者: Sabri Mustafa Kahya,Boran Hamdi Sivrikaya,Muhammet Sami Yavuz,Eckehard Steinbach
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:In this work, we propose a novel pipeline for face recognition and out-of-distribution (OOD) detection using short-range FMCW radar. The proposed system utilizes Range-Doppler and micro Range-Doppler Images. The architecture features a primary path (PP) responsible for the classification of in-distribution (ID) faces, complemented by intermediate paths (IPs) dedicated to OOD detection. The network is trained in two stages: first, the PP is trained using triplet loss to optimize ID face classification. In the second stage, the PP is frozen, and the IPs-comprising simple linear autoencoder networks-are trained specifically for OOD detection. Using our dataset generated with a 60 GHz FMCW radar, our method achieves an ID classification accuracy of 99.30% and an OOD detection AUROC of 96.91%.
zh

[CV-56] Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics

【速读】：该论文试图解决现代图像和视频质量评估（IQA/VQA）模型在面对对抗攻击时的脆弱性问题。具体来说，现有研究主要集中在白盒攻击（white-box attacks）上，而黑盒攻击（black-box attacks）在VQA领域的应用较少受到关注。此外，针对一个模型生成的对抗样本在迁移到另一个模型时缺乏可迁移性（transferability）。为解决这些问题，论文提出了一种跨模态攻击方法IC2VQA，旨在探索现代VQA模型的漏洞。该方法的关键在于利用图像和视频在低层特征空间上的相似性，通过在白盒IQA模型上生成对抗扰动，并结合CLIP模块（Contrastive Language–Image Pretraining）来增强对抗样本的可迁移性。实验表明，IC2VQA在攻击三个黑盒VQA模型时具有较高的成功率，并且在相同迭代次数和攻击强度下优于现有的黑盒攻击策略。

链接: https://arxiv.org/abs/2501.08415
作者: Georgii Gotin,Ekaterina Shumitskaya,Anastasia Antsiferova,Dmitriy Vatolin
机构: Lomonosov Moscow State University, Moscow, Russia(莫斯科国立大学); ISP RAS Research Center for Trusted Artificial Intelligence, Moscow, Russia(俄罗斯科学院ISP RAS可信人工智能研究中心); MSU Institute for Artificial Intelligence, Moscow, Russia(莫斯科国立大学人工智能研究所); Laboratory of Innovative Technologies for Processing Video Content, Innopolis University, Innopolis, Russia(因诺波利斯大学视频内容处理创新技术实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for VISAPP 2025

点击查看摘要

Abstract:Recent studies have revealed that modern image and video quality assessment (IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can manipulate a video through preprocessing to artificially increase its quality score according to a certain metric, despite no actual improvement in visual quality. Most of the attacks studied in the literature are white-box attacks, while black-box attacks in the context of VQA have received less attention. Moreover, some research indicates a lack of transferability of adversarial examples generated for one model to another when applied to VQA. In this paper, we propose a cross-modal attack method, IC2VQA, aimed at exploring the vulnerabilities of modern VQA models. This approach is motivated by the observation that the low-level feature spaces of images and videos are similar. We investigate the transferability of adversarial perturbations across different modalities; specifically, we analyze how adversarial perturbations generated on a white-box IQA model with an additional CLIP module can effectively target a VQA model. The addition of the CLIP module serves as a valuable aid in increasing transferability, as the CLIP model is known for its effective capture of low-level semantics. Extensive experiments demonstrate that IC2VQA achieves a high success rate in attacking three black-box VQA models. We compare our method with existing black-box attack strategies, highlighting its superiority in terms of attack success within the same number of iterations and levels of attack strength. We believe that the proposed method will contribute to the deeper analysis of robust VQA metrics.
zh

[CV-57] BiDepth Multimodal Neural Network: Bidirectional Depth Deep Learning Arcitecture for Spatial-Temporal Prediction

【速读】：该论文试图解决动态系统中时空信息（Spatial-Temporal, ST）的准确预测问题，特别是在城市交通和天气模式等复杂系统中，由于空间邻近性和时间相关性之间的复杂交互作用，传统统计方法和常规神经网络难以同时兼顾长期趋势和短期波动，导致预测结果不准确。为解决这一问题，论文提出了双向深度调制多模态神经网络（BiDepth Multimodal Neural Network, BDMNN），其关键创新在于通过双向深度调制机制，能够同时捕捉长期季节性和短期波动，从而适应复杂的时空上下文。实验结果表明，BDMNN在城市交通预测和降雨预报中显著提升了预测精度，分别减少了12%的均方误差和提高了15%的预测准确率，且无需额外的计算资源。

链接: https://arxiv.org/abs/2501.08411
作者: Sina Ehsani,Fenglian Pan,Qingpei Hu,Jian Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注: This paper has been submitted to Applied Intelligence for review

点击查看摘要

Abstract:Accurate prediction of spatial-temporal (ST) information in dynamic systems, such as urban mobility and weather patterns, is a crucial yet challenging problem. The complexity stems from the intricate interplay between spatial proximity and temporal relevance, where both long-term trends and short-term fluctuations are present in convoluted patterns. Existing approaches, including traditional statistical methods and conventional neural networks, may provide inaccurate results due to the lack of an effective mechanism that simultaneously incorporates information at variable temporal depths while maintaining spatial context, resulting in a trade-off between comprehensive long-term historical analysis and responsiveness to short-term new information. To bridge this gap, this paper proposes the BiDepth Multimodal Neural Network (BDMNN) with bidirectional depth modulation that enables a comprehensive understanding of both long-term seasonality and short-term fluctuations, adapting to the complex ST context. Case studies with real-world public data demonstrate significant improvements in prediction accuracy, with a 12% reduction in Mean Squared Error for urban traffic prediction and a 15% improvement in rain precipitation forecasting compared to state-of-the-art benchmarks, without demanding extra computational resources.
zh

[CV-58] Leverag ing 2D Masked Reconstruction for Domain Adaptation of 3D Pose Estimation

【速读】：该论文试图解决基于RGB图像的3D姿态估计（3D pose estimation）方法在测试图像分布与训练数据分布差异较大时表现不佳的问题。现有的方法通常依赖于高质量标注的3D姿态数据集，但在实际应用中，收集具有多样性的标注数据（尤其是3D姿态数据）非常困难。为此，论文提出了一种无监督域适应（unsupervised domain adaptation）框架，通过引入未标注数据来增强模型的泛化能力。解决方案的关键在于利用掩码图像建模（masked image modeling, MIM）框架，结合前景中心重建（foreground-centric reconstruction）和注意力正则化（attention regularization）技术，有效提升未标注数据的使用效果。实验结果表明，该方法在跨域场景下的人体和手部姿态估计任务中均达到了最先进的精度。

链接: https://arxiv.org/abs/2501.08408
作者: Hansoo Park,Chanwoo Kim,Jihyeon Kim,Hoseong Cho,Nhat Nguyen Bao Truong,Taehwan Kim,Seungryul Baek
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:RGB-based 3D pose estimation methods have been successful with the development of deep learning and the emergence of high-quality 3D pose datasets. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. This problem might be alleviated by involving diverse data during training, however it is non-trivial to collect such diverse data with corresponding labels (i.e. 3D pose). In this paper, we introduced an unsupervised domain adaptation framework for 3D pose estimation that utilizes the unlabeled data in addition to labeled data via masked image modeling (MIM) framework. Foreground-centric reconstruction and attention regularization are further proposed to increase the effectiveness of unlabeled data usage. Experiments are conducted on the various datasets in human and hand pose estimation tasks, especially using the cross-domain scenario. We demonstrated the effectiveness of ours by achieving the state-of-the-art accuracy on all datasets.
zh

[CV-59] 3D Gaussian Splatting with Normal Information for Mesh Extraction and Improved Rendering ICASSP2025

【速读】：该论文试图解决基于可微分3D高斯泼溅（Differentiable 3D Gaussian Splatting）渲染技术在复杂场景重建中由于依赖光度损失（photometric losses）而导致几何重建不精确、提取的网格（mesh）质量不高的问题，尤其是在高曲率或细节丰富的区域。为了解决这一问题，论文提出了一种新的正则化方法，通过利用从高斯分布估计的有符号距离函数（signed distance function）的梯度来改善渲染质量，并同时提取表面网格。该方法的关键在于引入法向监督（normal supervision）作为正则化项，从而在保持网格质量的同时提升渲染效果。这一改进对于视频生成、动画、增强现实-虚拟现实（AR-VR）以及游戏等下游应用具有重要意义。实验结果表明，该方法在Mip-NeRF360、Tanks and Temples和Deep-Blending等数据集上，相较于其他网格提取渲染方法，在光真实感（photorealism）指标上表现更优。

链接: https://arxiv.org/abs/2501.08370
作者: Meenakshi Krishnan,Liam Fowl,Ramani Duraiswami
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: ICASSP 2025: Workshop on Generative Data Augmentation for Real-World Signal Processing Applications

点击查看摘要

Abstract:Differentiable 3D Gaussian splatting has emerged as an efficient and flexible rendering technique for representing complex scenes from a collection of 2D views and enabling high-quality real-time novel-view synthesis. However, its reliance on photometric losses can lead to imprecisely reconstructed geometry and extracted meshes, especially in regions with high curvature or fine detail. We propose a novel regularization method using the gradients of a signed distance function estimated from the Gaussians, to improve the quality of rendering while also extracting a surface mesh. The regularizing normal supervision facilitates better rendering and mesh reconstruction, which is crucial for downstream applications in video generation, animation, AR-VR and gaming. We demonstrate the effectiveness of our approach on datasets such as Mip-NeRF360, Tanks and Temples, and Deep-Blending. Our method scores higher on photorealism metrics compared to other mesh extracting rendering methods without compromising mesh quality.
zh

[CV-60] Weight Averag ing for Out-of-Distribution Generalization and Few-Shot Domain Adaptation

【速读】：该论文试图解决的是在数据分布发生变化时，模型在测试数据上的泛化能力不足的问题，即分布外泛化（out-of-distribution generalization）问题。具体来说，当测试数据的分布与训练数据的分布不同时，传统的经验风险最小化（Empirical Risk Minimization, ERM）方法表现不佳。论文提出了两种关键解决方案：权重平均（Weight Averaging, WA）和锐度感知最小化（Sharpness-Aware Minimization, SAM）。WA通过训练多个具有不同超参数的模型并对其权重进行平均，显著提升了分布外泛化性能。SAM则通过优化神经网络以寻找平坦区域的最小值，这些区域在分布变化下表现良好。此外，论文进一步提出通过引入梯度相似性作为损失正则化器来显式增加WA中的模型多样性，并结合WA和SAM来解决少样本域适应（few-shot domain adaptation）问题。实验结果表明，结合WA和SAM的方法在多个数据集上显著提升了分布外泛化性能和少样本域适应精度。

链接: https://arxiv.org/abs/2501.08361
作者: Shijian Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Master Thesis

点击查看摘要

Abstract:Empirical risk minimization (ERM) is not robust to changes in the distribution of data. When the distribution of test data is different from that of training data, the problem is known as out-of-distribution generalization. Recently, two techniques have been developed for addressing out-of-distribution generalization in computer vision: weight averaging (WA) and sharpness-aware minimization (SAM). WA involves training multiple models with different hyperparameters and then averaging the weights of these models, which can significantly improve out-of-distribution generalization performance. SAM optimizes a neural network to find minima in flat regions, which have been proven to perform well under distribution shifts. While these techniques have made great progress, there is still room for improvement and further exploration. In this thesis, we propose increasing the model diversity in WA explicitly by introducing gradient similarity as a loss regularizer to further improve out-of-distribution generalization performance. We also propose combining WA and SAM to solve the problem of few-shot domain adaptation. Our extensive experiments on digits datasets (MNIST, SVHN, USPS, MNIST-M) and other domain adaptation datasets (VLCS, PACS) show that combining WA and SAM leads to improved out-of-distribution generalization performance and significantly increases few-shot domain adaptation accuracy.
zh

[CV-61] A Preliminary Survey of Semantic Descriptive Model for Images

【速读】：该论文试图解决中国古代绘画（Ancient Chinese Paintings, ACP）领域在图像描述和深层次文化分析方面缺乏统一框架的问题。具体而言，研究旨在通过整合图像学理论（iconological theory）和新的术语提取与映射流程，开发一个语义模型（semantic model），以支持对ACP的主题级文化探索和知识组织。解决方案的关键在于利用北京故宫博物院的ACP藏品，构建了一个有效的语义模型（SDM），该模型能够支持进一步的艺术相关知识的组织和文化探索。

链接: https://arxiv.org/abs/2501.08352
作者: Chengxi Yan,Jie Jian,Yang Li
机构: 未知
类目: Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV)
备注: 3 pages, 2 figures

点击查看摘要

Abstract:Considering the lack of a unified framework for image description and deep cultural analysis at the subject level in the field of Ancient Chinese Paintings (ACP), this study utilized the Beijing Palace Museum’s ACP collections to develop a semantic model integrating the iconological theory with a new workflow for term extraction and mapping. Our findings underscore the model’s effectiveness. SDM can be used to support further art-related knowledge organization and cultural exploration of ACPs.
zh

[CV-62] SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval WACV2025

【速读】：该论文试图解决组合图像检索（Compositional Image Retrieval, CIR）任务中的两个主要挑战：一是标注三元组数据集（如FashionIQ和CIRR）的构建过程耗时且劳动密集；二是现有模型在未见过的对象和领域上缺乏泛化能力。为解决这些问题，论文提出了一种名为SCOT（Self-supervised COmpositional Training）的零样本组合预训练策略。该策略的关键在于利用现有的大规模图像-文本对数据集，结合大语言模型的生成能力，通过对比学习训练一个嵌入组合网络。具体而言，SCOT利用大规模对比预训练的视觉-语言模型生成的文本嵌入作为代理目标监督，替代目标图像嵌入，从而在零样本设置下超越了现有的零样本组合检索方法，并在FashionIQ和CIRR等标准基准上优于许多全监督方法。

链接: https://arxiv.org/abs/2501.08347
作者: Bhavin Jawade,Joao V. B. Soares,Kapil Thadani,Deen Dayal Mohan,Amir Erfan Eshratifar,Benjamin Culpepper,Paloma de Juan,Srirangaraj Setlur,Venu Govindaraju
机构: Yahoo Research; University at Buffalo, SUNY(纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Paper accepted at WACV 2025 in round 1

点击查看摘要

Abstract:Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.
zh

[CV-63] Vision Foundation Models for Computed Tomography

【速读】：该论文旨在解决放射学领域中复杂任务的处理问题，特别是在多种成像模态下的任务。为此，作者开发了CT-FM，一个基于大规模3D图像预训练的模型，专门设计用于执行各种放射学任务。CT-FM通过无标签对比学习（label-agnostic contrastive learning）在148,000个计算机断层扫描（CT）图像上进行预训练。该模型在全身和肿瘤分割、头部CT分诊、医学图像检索和语义理解等四类任务中表现出色，超越了现有的最先进模型。CT-FM不仅展示了在定量任务上的成功，还具备解剖区域聚类和跨扫描识别相似解剖结构的能力，同时在测试-重测设置中表现出鲁棒性，并能够生成与其嵌入相关的显著区域。通过开源模型权重、代码和数据，该研究旨在支持更适应性强、可靠且可解释的AI解决方案在放射学中的应用。

链接: https://arxiv.org/abs/2501.09001
作者: Suraj Pai(1 and 2 and 3),Ibrahim Hadzic(1 and 2 and 3),Dennis Bontempi(1 and 2 and 3),Keno Bressem(4 and 5),Benjamin H. Kann(1 and 3),Andriy Fedorov(6),Raymond H. Mak(1 and 3),Hugo J. W. L. Aerts(1 and 2 and 3 and 6) ((1) Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, (2) Radiology and Nuclear Medicine, CARIM amp; GROW, Maastricht University, (3) Department of Radiation Oncology, Brigham and Women’s Hospital, Dana-Farber Cancer Institute, Harvard Medical School, (4) Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, (5) Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, (6) Department of Radiology, Brigham and Women’s Hospital, Dana-Farber Cancer Institute, Harvard Medical School)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 figures, followed by 9 Extended Data Figures and a Supplementary Information document

点击查看摘要

Abstract:Foundation models (FMs) have shown transformative potential in radiology by performing diverse, complex tasks across imaging modalities. Here, we developed CT-FM, a large-scale 3D image-based pre-trained model designed explicitly for various radiological tasks. CT-FM was pre-trained using 148,000 computed tomography (CT) scans from the Imaging Data Commons through label-agnostic contrastive learning. We evaluated CT-FM across four categories of tasks, namely, whole-body and tumor segmentation, head CT triage, medical image retrieval, and semantic understanding, showing superior performance against state-of-the-art models. Beyond quantitative success, CT-FM demonstrated the ability to cluster regions anatomically and identify similar anatomical and structural concepts across scans. Furthermore, it remained robust across test-retest settings and indicated reasonable salient regions attached to its embeddings. This study demonstrates the value of large-scale medical imaging foundation models and by open-sourcing the model weights, code, and data, aims to support more adaptable, reliable, and interpretable AI solutions in radiology.
zh

[CV-64] Multi-View Transformers for Airway-To-Lung Ratio Inference on Cardiac CT Scans: The C4R Study

【速读】：该论文试图解决的问题是如何从广泛可用的心脏CT图像中推断气道树腔与肺大小的比值（ALR），以研究ALR与严重COVID-19及SARS-CoV-2感染后急性后遗症（PASC）之间的关系。心脏CT扫描通常只包含约2/3的肺体积，且切片厚度比高分辨率全肺CT（HR FL CT）大5-6倍，这限制了其直接用于ALR推断的准确性。论文提出的解决方案是使用一种基于注意力机制的多视图Swin Transformer网络，从分割后的心脏CT图像中推断全肺ALR值。该网络在监督训练中利用了多民族动脉粥样硬化研究（MESA）中获取的成对全肺和心脏CT图像，显著优于直接在心脏CT图像上推断ALR的代理方法，并达到了与全肺ALR真实值的扫描-重扫描重现性相当的准确性和重现性。

链接: https://arxiv.org/abs/2501.08902
作者: Sneha N. Naik,Elsa D. Angelini,Eric A. Hoffman,Elizabeth C. Oelsner,R. Graham Barr,Benjamin M. Smith,Andrew F. Laine
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to appear in Proceedings of International Symposium on Biomedical Imaging (ISBI), 2025

点击查看摘要

Abstract:The ratio of airway tree lumen to lung size (ALR), assessed at full inspiration on high resolution full-lung computed tomography (CT), is a major risk factor for chronic obstructive pulmonary disease (COPD). There is growing interest to infer ALR from cardiac CT images, which are widely available in epidemiological cohorts, to investigate the relationship of ALR to severe COVID-19 and post-acute sequelae of SARS-CoV-2 infection (PASC). Previously, cardiac scans included approximately 2/3 of the total lung volume with 5-6x greater slice thickness than high-resolution (HR) full-lung (FL) CT. In this study, we present a novel attention-based Multi-view Swin Transformer to infer FL ALR values from segmented cardiac CT scans. For the supervised training we exploit paired full-lung and cardiac CTs acquired in the Multi-Ethnic Study of Atherosclerosis (MESA). Our network significantly outperforms a proxy direct ALR inference on segmented cardiac CT scans and achieves accuracy and reproducibility comparable with a scan-rescan reproducibility of the FL ALR ground-truth.
zh

[CV-65] Boosting Diffusion Guidance via Learning Degradation-Aware Models for Blind Super Resolution WACV2025

【速读】：该论文试图解决基于扩散模型（diffusion models）的盲超分辨率（blind super-resolution, SR）问题，即在未知退化核（degradation kernels）的情况下生成高分辨率图像时，如何在保持高频细节的同时确保图像的高保真度。现有方法在生成高频细节时往往牺牲了图像的保真度，而依赖已知退化核的方法则难以应用于盲超分辨率任务。为此，论文提出了一种退化感知模型（degradation-aware models），该模型可以集成到扩散引导框架（diffusion guidance framework）中，从而无需预先知道退化核。此外，论文还提出了两种新技术：输入扰动（input perturbation）和引导标量（guidance scalar），以进一步提升性能。实验结果表明，该方法在盲超分辨率基准测试中优于现有最先进的方法。

链接: https://arxiv.org/abs/2501.08819
作者: Shao-Hao Lu,Ren Wang,Ching-Chun Huang,Wei-Chen Chiu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in WACV 2025. Code is available at: this https URL

点击查看摘要

Abstract:Recently, diffusion-based blind super-resolution (SR) methods have shown great ability to generate high-resolution images with abundant high-frequency detail, but the detail is often achieved at the expense of fidelity. Meanwhile, another line of research focusing on rectifying the reverse process of diffusion models (i.e., diffusion guidance), has demonstrated the power to generate high-fidelity results for non-blind SR. However, these methods rely on known degradation kernels, making them difficult to apply to blind SR. To address these issues, we introduce degradation-aware models that can be integrated into the diffusion guidance framework, eliminating the need to know degradation kernels. Additionally, we propose two novel techniques input perturbation and guidance scalar to further improve our performance. Extensive experimental results show that our proposed method has superior performance over state-of-the-art methods on blind SR benchmarks
zh

[CV-66] meFlow: Longitudinal Brain Image Registration and Aging Progression Analysis

【速读】：该论文旨在解决纵向脑部 MRI（Magnetic Resonance Imaging）配准中的关键问题，包括无法预测未来脑部状态、对密集纵向数据的依赖以及配准精度与时间平滑性之间的平衡难题。论文提出的解决方案是 TimeFlow，这是一种基于 U-Net 架构并结合扩散模型（diffusion models）时间条件的新型框架。TimeFlow 通过时间条件机制实现了准确的纵向配准，并能够预测未来图像，从而支持前瞻性分析。与传统方法不同，TimeFlow 无需依赖显式的时间平滑正则化或密集序列数据，即可实现时间一致性和连续性。实验结果表明，TimeFlow 在未来时间点预测和配准精度方面均优于现有方法。此外，TimeFlow 还支持新型的脑老化生物学分析，能够有效区分神经退行性疾病与健康老化，且无需分割，避免了复杂标注和不一致分割误差的挑战。

链接: https://arxiv.org/abs/2501.08667
作者: Bailiang Jian,Jiazhen Pan,Yitong Li,Fabian Bongratz,Ruochen Li,Daniel Rueckert,Benedikt Wiestler,Christian Wachinger
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting future brain states is crucial for understanding healthy aging and neurodegenerative diseases. Longitudinal brain MRI registration, a cornerstone for such analyses, has long been limited by its inability to forecast future developments, reliance on extensive, dense longitudinal data, and the need to balance registration accuracy with temporal smoothness. In this work, we present \emphTimeFlow, a novel framework for longitudinal brain MRI registration that overcomes all these challenges. Leveraging a U-Net architecture with temporal conditioning inspired by diffusion models, TimeFlow enables accurate longitudinal registration and facilitates prospective analyses through future image prediction. Unlike traditional methods that depend on explicit smoothness regularizers and dense sequential data, TimeFlow achieves temporal consistency and continuity without these constraints. Experimental results highlight its superior performance in both future timepoint prediction and registration accuracy compared to state-of-the-art methods. Additionally, TimeFlow supports novel biological brain aging analyses, effectively differentiating neurodegenerative conditions from healthy aging. It eliminates the need for segmentation, thereby avoiding the challenges of non-trivial annotation and inconsistent segmentation errors. TimeFlow paves the way for accurate, data-efficient, and annotation-free prospective analyses of brain aging and chronic diseases.
zh

[CV-67] Product of Gaussian Mixture Diffusion Model for non-linear MRI Inversion

【速读】：该论文旨在解决磁共振成像（MRI）重建中的两个主要问题：一是现有扩散模型（diffusion models）通常作为黑箱估计器，参数数量庞大，导致解释性差且重建时间较长；二是并行成像重建算法依赖于离线线圈灵敏度估计，容易产生错位并限制采样轨迹，或进行逐线圈重建，导致计算成本与线圈数量成正比。为解决这些问题，论文提出了一种联合重建图像和线圈灵敏度的方法，采用轻量级、参数高效且可解释的高斯混合扩散模型（product of Gaussian mixture diffusion model）作为图像先验，并结合线圈灵敏度的经典平滑先验。该方法在保证快速推理的同时，表现出对对比度分布外数据和采样轨迹的鲁棒性，效果与经典变分惩罚（如总变差）相当。此外，概率公式允许计算后验期望和像素级方差，进一步增强了方法的实用性和解释性。

链接: https://arxiv.org/abs/2501.08662
作者: Laurenz Nagler,Martin Zach,Thomas Pock
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have recently shown remarkable results in magnetic resonance imaging reconstruction. However, the employed networks typically are black-box estimators of the (smoothed) prior score with tens of millions of parameters, restricting interpretability and increasing reconstruction time. Furthermore, parallel imaging reconstruction algorithms either rely on off-line coil sensitivity estimation, which is prone to misalignment and restricting sampling trajectories, or perform per-coil reconstruction, making the computational cost proportional to the number of coils. To overcome this, we jointly reconstruct the image and the coil sensitivities using the lightweight, parameter-efficient, and interpretable product of Gaussian mixture diffusion model as an image prior and a classical smoothness priors on the coil sensitivities. The proposed method delivers promising results while allowing for fast inference and demonstrating robustness to contrast out-of-distribution data and sampling trajectories, comparable to classical variational penalties such as total variation. Finally, the probabilistic formulation allows the calculation of the posterior expectation and pixel-wise variance.
zh

[CV-68] A Systematic Review of Machine Learning Methods for Multimodal EEG Data in Clinical Application

【速读】：该论文旨在探讨多模态脑电图（EEG）数据在机器学习和深度学习模型中的临床应用，以解决神经精神疾病、神经发育障碍（如自闭症谱系障碍）、神经疾病（如癫痫检测）以及睡眠阶段分类等临床挑战。解决方案的关键在于通过整合多模态数据（如EEG与其他生理信号）来提升模型的准确性。具体而言，数据融合在信号、特征和决策三个层次上进行，常用的机器学习模型包括支持向量机（SVM）和决策树。研究表明，多模态EEG数据在16项研究中有11项显著提高了模型的准确性，凸显了其在临床诊断和问题解决中的潜力。

链接: https://arxiv.org/abs/2501.08585
作者: Siqi Zhao(1),Wangyang Li(1),Xiru Wang(1),Stevie Foglia(2),Hongzhao Tan(1),Bohan Zhang(1),Ameer Hamoodi(2),Aimee Nelson(2 and 3),Zhen Gao(1 and 2) ((1) WBooth School of Engineering Practice and Technology, McMaster University, Hamilton, Ontario Canada, (2) School of Biomedical Engineering, McMaster University, Hamilton, Ontario, Canada, (3) Department of Kinesiology, McMaster University, Hamilton, Ontario, Canada)
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper includes 4 figures, 6 tables, and totals 18 pages

点击查看摘要

Abstract:Machine learning (ML) and deep learning (DL) techniques have been widely applied to analyze electroencephalography (EEG) signals for disease diagnosis and brain-computer interfaces (BCI). The integration of multimodal data has been shown to enhance the accuracy of ML and DL models. Combining EEG with other modalities can improve clinical decision-making by addressing complex tasks in clinical populations. This systematic literature review explores the use of multimodal EEG data in ML and DL models for clinical applications. A comprehensive search was conducted across PubMed, Web of Science, and Google Scholar, yielding 16 relevant studies after three rounds of filtering. These studies demonstrate the application of multimodal EEG data in addressing clinical challenges, including neuropsychiatric disorders, neurological conditions (e.g., seizure detection), neurodevelopmental disorders (e.g., autism spectrum disorder), and sleep stage classification. Data fusion occurred at three levels: signal, feature, and decision levels. The most commonly used ML models were support vector machines (SVM) and decision trees. Notably, 11 out of the 16 studies reported improvements in model accuracy with multimodal EEG data. This review highlights the potential of multimodal EEG-based ML models in enhancing clinical diagnostics and problem-solving.
zh

[CV-69] Automotive Elevation Mapping with Interferometric Synthetic Aperture Radar

【速读】：该论文试图解决车载雷达在进行到达方向分析时，由于阵列分辨率和灵敏度限制而难以在复杂驾驶环境中精确定位的问题。解决方案的关键在于利用合成孔径雷达（SAR）技术，特别是干涉合成孔径雷达（InSAR），通过相位测量变化提取高程信息，从而提升雷达的方位分辨率和灵敏度。通过将InSAR与专门为自动驾驶设计的信号处理方案相结合，论文展示了如何在城市和农业环境中使用低分辨率雷达阵列生成三维点云，实现精确的三维空间定位。这种低计算复杂度的方案使得雷达能够作为主要传感器，用于绘制复杂驾驶环境中的细节，并支持自主感知决策。

链接: https://arxiv.org/abs/2501.08495
作者: Leyla A. Kabuli,Griffin Foster
机构: University of California, Berkeley (加州大学伯克利分校); Zendar Inc.
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 pages, 6 figures

点击查看摘要

Abstract:Radar is a low-cost and ubiquitous automotive sensor, but is limited by array resolution and sensitivity when performing direction of arrival analysis. Synthetic Aperture Radar (SAR) is a class of techniques to improve azimuth resolution and sensitivity for radar. Interferometric SAR (InSAR) can be used to extract elevation from the variations in phase measurements in SAR images. Utilizing InSAR we show that a typical, low-resolution radar array mounted on a vehicle can be used to accurately localize detections in 3D space for both urban and agricultural environments. We generate point clouds in each environment by combining InSAR with a signal processing scheme tailored to automotive driving. This low-compute approach allows radar to be used as a primary sensor to map fine details in complex driving environments, and be used to make autonomous perception decisions.
zh

[CV-70] RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation

【速读】：该论文旨在解决医学图像分析中卷积神经网络（CNNs）和Transformer模型在捕捉长程依赖性和计算复杂度方面的局限性。具体来说，CNNs在捕捉长程依赖性方面存在不足，而Transformer模型则面临高计算复杂度的挑战。为解决这些问题，论文提出了一种名为RWKV-UNet的新型模型，该模型将RWKV（Receptance Weighted Key Value）结构集成到U-Net架构中。这一集成增强了模型捕捉长程依赖性的能力，并提升了上下文理解能力，这对于精确的医学图像分割至关重要。此外，论文还提出了一个跨通道混合（Cross-Channel Mix, CCM）模块，通过多尺度特征融合改进跳跃连接，实现了全局通道信息的整合。实验结果表明，RWKV-UNet在多个基准数据集上达到了最先进的性能，并且其小型变体RWKV-UNet-S和RWKV-UNet-T在准确性和计算效率之间取得了平衡，适合更广泛的临床应用。

链接: https://arxiv.org/abs/2501.08458
作者: Juntao Jiang,Jiangning Zhang,Weixuan Liu,Muxuan Gao,Xiaobin Hu,Xiaoxiao Yan,Feiyue Huang,Yong Liu
机构: Zhejiang University(浙江大学); Youtu Lab, Tencent(腾讯优图实验室); Northeastern University(东北大学); Ruijin Hospital(瑞金医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, there have been significant advancements in deep learning for medical image analysis, especially with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies while transformers suffer high computational complexities. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model’s ability to capture long-range dependencies and improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed inverted residual RWKV (IR-RWKV) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on benchmark datasets, including Synapse, ACDC, BUSI, CVC-ClinicDB, CVC-ColonDB, Kvasir-SEG, ISIC 2017 and GLAS show that RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.
zh

[CV-71] High-throughput digital twin framework for predicting neurite deterioration using MetaFormer attention

【速读】：该论文旨在解决神经发育障碍（NDDs）研究中面临的诊断和治疗挑战，特别是由于高共病性和复杂病因导致的实验数据稀缺和实验过程耗时的问题。论文提出了一种高通量数字孪生框架，通过整合合成数据生成、实验图像和机器学习模型来模拟与NDDs相关的神经突触退化。解决方案的关键在于利用基于等几何分析（IGA）的相场模型生成多样化的神经突触退化模式（如神经突触回缩、萎缩和碎片化），并通过基于MetaFormer的门控时空注意力架构的机器学习模型进行快速预测。该框架能够有效捕捉长时程依赖性和复杂的形态学变化，显著降低了实验成本和时间，并为探索复杂的神经机制提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2501.08334
作者: Kuanren Qian,Genesis Omana Suarez,Toshihiko Nambara,Takahisa Kanekiyo,Yongjie Jessica Zhang
机构: Department of Mechanical Engineering, Carnegie Mellon University (卡内基梅隆大学机械工程系); Department of Neuroscience, Mayo Clinic (梅奥诊所神经科学系); Department of Biomedical Engineering, Carnegie Mellon University (卡内基梅隆大学生物医学工程系)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Neurodevelopmental disorders (NDDs) cover a variety of conditions, including autism spectrum disorder, attention-deficit/hyperactivity disorder, and epilepsy, which impair the central and peripheral nervous systems. Their high comorbidity and complex etiologies present significant challenges for accurate diagnosis and effective treatments. Conventional clinical and experimental studies are time-intensive, burdening research progress considerably. This paper introduces a high-throughput digital twin framework for modeling neurite deteriorations associated with NDDs, integrating synthetic data generation, experimental images, and machine learning (ML) models. The synthetic data generator utilizes an isogeometric analysis (IGA)-based phase field model to capture diverse neurite deterioration patterns such as neurite retraction, atrophy, and fragmentation while mitigating the limitations of scarce experimental data. The ML model utilizes MetaFormer-based gated spatiotemporal attention architecture with deep temporal layers and provides fast predictions. The framework effectively captures long-range temporal dependencies and intricate morphological transformations with average errors of 1.9641% and 6.0339% for synthetic and experimental neurite deterioration, respectively. Seamlessly integrating simulations, experiments, and ML, the digital twin framework can guide researchers to make informed experimental decisions by predicting potential experimental outcomes, significantly reducing costs and saving valuable time. It can also advance our understanding of neurite deterioration and provide a scalable solution for exploring complex neurological mechanisms, contributing to the development of targeted treatments.
zh

人工智能

[AI-0] How Do Generative Models Draw a Software Engineer? A Case Study on Stable Diffusion Bias

链接: https://arxiv.org/abs/2501.09014
作者: Tosin Fadahunsi,Giordano d’Aloisio,Antinisca Di Marco,Federica Sarro
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Generative models are nowadays widely used to generate graphical content used for multiple purposes, e.g. web, art, advertisement. However, it has been shown that the images generated by these models could reinforce societal biases already existing in specific contexts. In this paper, we focus on understanding if this is the case when one generates images related to various software engineering tasks. In fact, the Software Engineering (SE) community is not immune from gender and ethnicity disparities, which could be amplified by the use of these models. Hence, if used without consciousness, artificially generated images could reinforce these biases in the SE domain. Specifically, we perform an extensive empirical evaluation of the gender and ethnicity bias exposed by three versions of the Stable Diffusion (SD) model (a very popular open-source text-to-image model) - SD 2, SD XL, and SD 3 - towards SE tasks. We obtain 6,720 images by feeding each model with two sets of prompts describing different software-related tasks: one set includes the Software Engineer keyword, and one set does not include any specification of the person performing the task. Next, we evaluate the gender and ethnicity disparities in the generated images. Results show how all models are significantly biased towards male figures when representing software engineers. On the contrary, while SD 2 and SD XL are strongly biased towards White figures, SD 3 is slightly more biased towards Asian figures. Nevertheless, all models significantly under-represent Black and Arab figures, regardless of the prompt style used. The results of our analysis highlight severe concerns about adopting those models to generate content for SE tasks and open the field for future research on bias mitigation in this context.

[AI-1] AI-RAN: Transforming RAN with AI-driven Computing Infrastructure

链接: https://arxiv.org/abs/2501.09007
作者: Lopamudra Kundu,Xingqin Lin,Rajesh Gadiyar,Jean-Francois Lacasse,Shuvo Chowdhury
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: 7 pages, 5 figures

点击查看摘要

[AI-2] Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

链接: https://arxiv.org/abs/2501.08977
作者: Emma Croxford,Yanjun Gao,Nicholas Pellegrino,Karen K. Wong,Graham Wills,Elliot First,Miranda Schnier,Kyle Burton,Cris G. Ebby,Jillian Gorskic,Matthew Kalscheur,Samy Khalil,Marie Pisani,Tyler Rubeor,Peter Stetson,Frank Liao,Cherodeep Goswami,Brian Patterson,Majid Afshar
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are integrated into electronic health record (EHR) workflows, validated instruments are essential to evaluate their performance before implementation. Existing instruments for provider documentation quality are often unsuitable for the complexities of LLM-generated text and lack validation on real-world data. The Provider Documentation Summarization Quality Instrument (PDSQI-9) was developed to evaluate LLM-generated clinical summaries. Multi-document summaries were generated from real-world EHR data across multiple specialties using several LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson correlation for substantive validity, factor analysis and Cronbach’s alpha for structural validity, inter-rater reliability (ICC and Krippendorff’s alpha) for generalizability, a semi-Delphi process for content validity, and comparisons of high- versus low-quality summaries for discriminant validity. Seven physician raters evaluated 779 summaries and answered 8,329 questions, achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated strong internal consistency (Cronbach’s alpha = 0.879; 95% CI: 0.867-0.891) and high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting structural validity and generalizability. Factor analysis identified a 4-factor model explaining 58% of the variance, representing organization, clarity, accuracy, and utility. Substantive validity was supported by correlations between note length and scores for Succinct (rho = -0.200, p = 0.029) and Organized (rho = -0.190, p = 0.037). Discriminant validity distinguished high- from low-quality summaries (p 0.001). The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows.

[AI-3] rusted Machine Learning Models Unlock Private Inference for Problems Currently Infeasible with Cryptography

链接: https://arxiv.org/abs/2501.08970
作者: Ilia Shumailov,Daniel Ramage,Sarah Meiklejohn,Peter Kairouz,Florian Hartmann,Borja Balle,Eugene Bagdasarian
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We often interact with untrusted parties. Prioritization of privacy can limit the effectiveness of these interactions, as achieving certain goals necessitates sharing private data. Traditionally, addressing this challenge has involved either seeking trusted intermediaries or constructing cryptographic protocols that restrict how much data is revealed, such as multi-party computations or zero-knowledge proofs. While significant advances have been made in scaling cryptographic approaches, they remain limited in terms of the size and complexity of applications they can be used for. In this paper, we argue that capable machine learning models can fulfill the role of a trusted third party, thus enabling secure computations for applications that were previously infeasible. In particular, we describe Trusted Capable Model Environments (TCMEs) as an alternative approach for scaling secure computation, where capable machine learning model(s) interact under input/output constraints, with explicit information flow control and explicit statelessness. This approach aims to achieve a balance between privacy and computational efficiency, enabling private inference where classical cryptographic solutions are currently infeasible. We describe a number of use cases that are enabled by TCME, and show that even some simple classic cryptographic problems can already be solved with TCME. Finally, we outline current limitations and discuss the path forward in implementing them.

[AI-4] Kolmogorov-Arnold Networks for Time Series Granger Causality Inference

链接: https://arxiv.org/abs/2501.08958
作者: Meiliang Liu,Yunfang Xu,Zijin Li,Zhengye Si,Xiaoxiao Yang,Xinyue Yang,Zhiwen Zhao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Granger Causality Kolmogorov-Arnold Networks (GCKAN), an innovative architecture that extends the recently proposed Kolmogorov-Arnold Networks (KAN) to the domain of causal inference. By extracting base weights from KAN layers and incorporating the sparsity-inducing penalty along with ridge regularization, GCKAN infers the Granger causality from time series while enabling automatic time lag selection. Additionally, we propose an algorithm leveraging time-reversed Granger causality to enhance inference accuracy. The algorithm compares prediction and sparse-inducing losses derived from the original and time-reversed series, automatically selecting the casual relationship with the higher score or integrating the results to mitigate spurious connectivities. Comprehensive experiments conducted on Lorenz-96, gene regulatory networks, fMRI BOLD signals, and VAR datasets demonstrate that the proposed model achieves competitive performance to state-of-the-art methods in inferring Granger causality from nonlinear, high-dimensional, and limited-sample time series.

[AI-5] Analyzing the Ethical Logic of Six Large Language Models

链接: https://arxiv.org/abs/2501.08951
作者: W. Russell Neuman,Chad Coleman,Manan Shah
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:This study examines the ethical reasoning of six prominent generative large language models: OpenAI GPT-4o, Meta LLaMA 3.1, Perplexity, Anthropic Claude 3.5 Sonnet, Google Gemini, and Mistral 7B. The research explores how these models articulate and apply ethical logic, particularly in response to moral dilemmas such as the Trolley Problem, and Heinz Dilemma. Departing from traditional alignment studies, the study adopts an explainability-transparency framework, prompting models to explain their ethical reasoning. This approach is analyzed through three established ethical typologies: the consequentialist-deontological analytic, Moral Foundations Theory, and the Kohlberg Stages of Moral Development Model. Findings reveal that LLMs exhibit largely convergent ethical logic, marked by a rationalist, consequentialist emphasis, with decisions often prioritizing harm minimization and fairness. Despite similarities in pre-training and model architecture, a mixture of nuanced and significant differences in ethical reasoning emerge across models, reflecting variations in fine-tuning and post-training processes. The models consistently display erudition, caution, and self-awareness, presenting ethical reasoning akin to a graduate-level discourse in moral philosophy. In striking uniformity these systems all describe their ethical reasoning as more sophisticated than what is characteristic of typical human moral logic.

[AI-6] Modeling Melt Pool Features and Spatter Using Symbolic Regression and Machine Learning

链接: https://arxiv.org/abs/2501.08922
作者: Olabode T. Ajenifujah,Amir Barati Farimani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Additive manufacturing (AM) is a rapidly evolving technology that has attracted applications across a wide range of fields due to its ability to fabricate complex geometries. However, one of the key challenges in AM is achieving consistent print quality. This inconsistency is often attributed to uncontrolled melt pool dynamics, partly caused by spatter which can lead to defects. Therefore, capturing and controlling the evolution of the melt pool is crucial for enhancing process stability and part quality. In this study, we developed a framework to support decision-making in AM operations, facilitating quality control and minimizing defects via machine learning (ML) and polynomial symbolic regression models. We implemented experimentally validated computational tools as a cost-effective approach to collect large datasets from laser powder bed fusion (LPBF) processes. For a dataset consisting of 281 process conditions, parameters such as melt pool dimensions (length, width, depth), melt pool geometry (area, volume), and volume indicated as spatter were extracted. Using machine learning (ML) and polynomial symbolic regression models, a high R2 of over 95 % was achieved in predicting the melt pool dimensions and geometry features for both the training and testing datasets, with either process conditions (power and velocity) or melt pool dimensions as the model inputs. In the case of volume indicated as spatter, R2 improved after logarithmic transforming the model inputs, which was either the process conditions or the melt pool dimensions. Among the investigated ML models, the ExtraTree model achieved the highest R2 values of 96.7 % and 87.5 %.

[AI-7] Projection Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2501.08907
作者: Xinchen Han,Hossam Afifi,Michel Marot
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Offline Reinforcement Learning (RL) faces a critical challenge of extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL) algorithm employs expectile regression to achieve in-sample learning, effectively mitigating the risks associated with OOD actions. However, the fixed hyperparameter in policy evaluation and density-based policy improvement method limit its overall efficiency. In this paper, we propose Proj-IQL, a projective IQL algorithm enhanced with the support constraint. In the policy evaluation phase, Proj-IQL generalizes the one-step approach to a multi-step approach through vector projection, while maintaining in-sample learning and expectile regression framework. In the policy improvement phase, Proj-IQL introduces support constraint that is more aligned with the policy evaluation approach. Furthermore, we theoretically demonstrate that Proj-IQL guarantees monotonic policy improvement and enjoys a progressively more rigorous criterion for superior actions. Empirical results demonstrate the Proj-IQL achieves state-of-the-art performance on D4RL benchmarks, especially in challenging navigation domains.

[AI-8] Computing Game Symmetries and Equilibria That Respect Them AAAI DATE AAAI2025

链接: https://arxiv.org/abs/2501.08905
作者: Emanuel Tewolde,Brian Hu Zhang,Caspar Oesterheld,Tuomas Sandholm,Vincent Conitzer
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Multiagent Systems (cs.MA)
*备注: Long and updated version to the published paper in the Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025). 24 pages, 2 figures, 1 table

点击查看摘要

Abstract:Strategic interactions can be represented more concisely, and analyzed and solved more efficiently, if we are aware of the symmetries within the multiagent system. Symmetries also have conceptual implications, for example for equilibrium selection. We study the computational complexity of identifying and using symmetries. Using the classical framework of normal-form games, we consider game symmetries that can be across some or all players and/or actions. We find a strong connection between game symmetries and graph automorphisms, yielding graph automorphism and graph isomorphism completeness results for characterizing the symmetries present in a game. On the other hand, we also show that the problem becomes polynomial-time solvable when we restrict the consideration of actions in one of two ways. Next, we investigate when exactly game symmetries can be successfully leveraged for Nash equilibrium computation. We show that finding a Nash equilibrium that respects a given set of symmetries is PPAD- and CLS-complete in general-sum and team games respectively – that is, exactly as hard as Brouwer fixed point and gradient descent problems. Finally, we present polynomial-time methods for the special cases where we are aware of a vast number of symmetries, or where the game is two-player zero-sum and we do not even know the symmetries. Comments: Long and updated version to the published paper in the Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025). 24 pages, 2 figures, 1 table Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Multiagent Systems (cs.MA) MSC classes: 91A05, 91A06, 91A10, 91A26, 91A35, 91A68, 68Q17, 68Q25, 68T01 ACMclasses: I.2; J.4; F.2 Cite as: arXiv:2501.08905 [cs.GT] (or arXiv:2501.08905v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2501.08905 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-9] Leverag ing Large Language Models as Knowledge-Driven Agents for Reliable Retrosynthesis Planning

链接: https://arxiv.org/abs/2501.08897
作者: Qinyu Ma,Yuhao Zhou,Jianfeng Li
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Identifying reliable synthesis pathways in materials chemistry is a complex task, particularly in polymer science, due to the intricate and often non-unique nomenclature of macromolecules. To address this challenge, we propose an agent system that integrates large language models (LLMs) and knowledge graphs (KGs). By leveraging LLMs’ powerful capabilities for extracting and recognizing chemical substance names, and storing the extracted data in a structured knowledge graph, our system fully automates the retrieval of relevant literatures, extraction of reaction data, database querying, construction of retrosynthetic pathway trees, further expansion through the retrieval of additional literature and recommendation of optimal reaction pathways. A novel Multi-branched Reaction Pathway Search (MBRPS) algorithm enables the exploration of all pathways, with a particular focus on multi-branched ones, helping LLMs overcome weak reasoning in multi-branched paths. This work represents the first attempt to develop a fully automated retrosynthesis planning agent tailored specially for macromolecules powered by LLMs. Applied to polyimide synthesis, our new approach constructs a retrosynthetic pathway tree with hundreds of pathways and recommends optimized routes, including both known and novel pathways, demonstrating its effectiveness and potential for broader applications.

[AI-10] Karatsuba Matrix Multiplication and its Efficient Custom Hardware Implementations

链接: https://arxiv.org/abs/2501.08889
作者: Trevor E. Pogue,Nicola Nicolici
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: Accepted for publication in IEEE Transactions on Computers; Associated source code available on github at this https URL

点击查看摘要

Abstract:While the Karatsuba algorithm reduces the complexity of large integer multiplication, the extra additions required minimize its benefits for smaller integers of more commonly-used bitwidths. In this work, we propose the extension of the scalar Karatsuba multiplication algorithm to matrix multiplication, showing how this maintains the reduction in multiplication complexity of the original Karatsuba algorithm while reducing the complexity of the extra additions. Furthermore, we propose new matrix multiplication hardware architectures for efficiently exploiting this extension of the Karatsuba algorithm in custom hardware. We show that the proposed algorithm and hardware architectures can provide real area or execution time improvements for integer matrix multiplication compared to scalar Karatsuba or conventional matrix multiplication algorithms, while also supporting implementation through proven systolic array and conventional multiplier architectures at the core. We provide a complexity analysis of the algorithm and architectures and evaluate the proposed designs both in isolation and in an end-to-end deep learning accelerator system compared to baseline designs and prior state-of-the-art works implemented on the same type of compute platform, demonstrating their ability to increase the performance-per-area of matrix multiplication hardware.

[AI-11] Incrementally Learning Multiple Diverse Data Domains via Multi-Source Dynamic Expansion Model

链接: https://arxiv.org/abs/2501.08878
作者: Runqing Wu,Fei Ye,Qihe Liu,Guoxi Huang,Jinyu Guo,Rongyao Hu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Continual Learning seeks to develop a model capable of incrementally assimilating new information while retaining prior knowledge. However, current research predominantly addresses a straightforward learning context, wherein all data samples originate from a singular data domain. This paper shifts focus to a more complex and realistic learning environment, characterized by data samples sourced from multiple distinct domains. We tackle this intricate learning challenge by introducing a novel methodology, termed the Multi-Source Dynamic Expansion Model (MSDEM), which leverages various pre-trained models as backbones and progressively establishes new experts based on them to adapt to emerging tasks. Additionally, we propose an innovative dynamic expandable attention mechanism designed to selectively harness knowledge from multiple backbones, thereby accelerating the new task learning. Moreover, we introduce a dynamic graph weight router that strategically reuses all previously acquired parameters and representations for new task learning, maximizing the positive knowledge transfer effect, which further improves generalization performance. We conduct a comprehensive series of experiments, and the empirical findings indicate that our proposed approach achieves state-of-the-art performance.

[AI-12] Silent Abandonment in Text-Based Contact Centers: Identifying Quantifying and Mitigating its Operational Impacts

链接: https://arxiv.org/abs/2501.08869
作者: Antonio Castellanos,Galit B. Yom-Tov,Yair Goldberg,Jaeyoung Park
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: arXiv admin note: text overlap with arXiv:2304.11754

点击查看摘要

Abstract:In the quest to improve services, companies offer customers the option to interact with agents via texting. Such contact centers face unique challenges compared to traditional call centers, as measuring customer experience proxies like abandonment and patience involves uncertainty. A key source of this uncertainty is silent abandonment, where customers leave without notifying the system, wasting agent time and leaving their status unclear. Silent abandonment also obscures whether a customer was served or left. Our goals are to measure the magnitude of silent abandonment and mitigate its effects. Classification models show that 3%-70% of customers across 17 companies abandon silently. In one study, 71.3% of abandoning customers did so silently, reducing agent efficiency by 3.2% and system capacity by 15.3%, incurring 5,457 in annual costs per agent. We develop an expectation-maximization (EM) algorithm to estimate customer patience under uncertainty and identify influencing covariates. We find that companies should use classification models to estimate abandonment scope and our EM algorithm to assess patience. We suggest strategies to operationally mitigate the impact of silent abandonment by predicting suspected silent-abandonment behavior or changing service design. Specifically, we show that while allowing customers to write while waiting in the queue creates a missing data challenge, it also significantly increases patience and reduces service time, leading to reduced abandonment and lower staffing requirements.

[AI-13] ARMOR: Shielding Unlearnable Examples against Data Augmentation

链接: https://arxiv.org/abs/2501.08862
作者: Xueluan Gong,Yuji Wang,Yanjiao Chen,Haocheng Dong,Yiming Li,Mengyuan Sun,Shuaike Li,Qian Wang,Chen Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Private data, when published online, may be collected by unauthorized parties to train deep neural networks (DNNs). To protect privacy, defensive noises can be added to original samples to degrade their learnability by DNNs. Recently, unlearnable examples are proposed to minimize the training loss such that the model learns almost nothing. However, raw data are often pre-processed before being used for training, which may restore the private information of protected data. In this paper, we reveal the data privacy violation induced by data augmentation, a commonly used data pre-processing technique to improve model generalization capability, which is the first of its kind as far as we are concerned. We demonstrate that data augmentation can significantly raise the accuracy of the model trained on unlearnable examples from 21.3% to 66.1%. To address this issue, we propose a defense framework, dubbed ARMOR, to protect data privacy from potential breaches of data augmentation. To overcome the difficulty of having no access to the model training process, we design a non-local module-assisted surrogate model that better captures the effect of data augmentation. In addition, we design a surrogate augmentation selection strategy that maximizes distribution alignment between augmented and non-augmented samples, to choose the optimal augmentation strategy for each class. We also use a dynamic step size adjustment algorithm to enhance the defensive noise generation process. Extensive experiments are conducted on 4 datasets and 5 data augmentation methods to verify the performance of ARMOR. Comparisons with 6 state-of-the-art defense methods have demonstrated that ARMOR can preserve the unlearnability of protected private data under data augmentation. ARMOR reduces the test accuracy of the model trained on augmented protected samples by as much as 60% more than baselines.

[AI-14] Digital Phenotyping for Adolescent Mental Health: A Feasibility Study Employing Machine Learning to Predict Mental Health Risk From Active and Passive Smartphone Data

链接: https://arxiv.org/abs/2501.08851
作者: Balasundaram Kadirvelu,Teresa Bellido Bel,Aglaia Freccero,Martina Di Simplicio,Dasha Nicholls,A Aldo Faisal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Background: Adolescents are particularly vulnerable to mental disorders, with over 75% of cases manifesting before the age of 25. Research indicates that only 18 to 34% of young people experiencing high levels of depression or anxiety symptoms seek support. Digital tools leveraging smartphones offer scalable and early intervention opportunities. Objective: Using a novel machine learning framework, this study evaluated the feasibility of integrating active and passive smartphone data to predict mental disorders in non-clinical adolescents. Specifically, we investigated the utility of the Mindcraft app in predicting risks for internalising and externalising disorders, eating disorders, insomnia and suicidal ideation. Methods: Participants (N=103; mean age 16.1 years) were recruited from three London schools. Participants completed the Strengths and Difficulties Questionnaire, the Eating Disorders-15 Questionnaire, Sleep Condition Indicator Questionnaire and indicated the presence/absence of suicidal ideation. They used the Mindcraft app for 14 days, contributing active data via self-reports and passive data from smartphone sensors. A contrastive pretraining phase was applied to enhance user-specific feature stability, followed by supervised fine-tuning. The model evaluation employed leave-one-subject-out cross-validation using balanced accuracy as the primary metric. Results: The integration of active and passive data achieved superior performance compared to individual data sources, with mean balanced accuracies of 0.71 for SDQ-High risk, 0.67 for insomnia, 0.77 for suicidal ideation and 0.70 for eating disorders. The contrastive learning framework stabilised daily behavioural representations, enhancing predictive robustness. This study demonstrates the potential of integrating active and passive smartphone data with advanced machine-learning techniques for predicting mental health risks.

[AI-15] Graph Counterfactual Explainable AI via Latent Space Traversal

链接: https://arxiv.org/abs/2501.08850
作者: Andreas Abildtrup Hansen,Paraskevas Pegios,Anna Calissano,Aasa Feragen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Published at Northern Lights Deep Learning Conference 2025

点击查看摘要

Abstract:Explaining the predictions of a deep neural network is a nontrivial task, yet high-quality explanations for predictions are often a prerequisite for practitioners to trust these models. Counterfactual explanations aim to explain predictions by finding the ‘‘nearest’’ in-distribution alternative input whose prediction changes in a pre-specified way. However, it remains an open question how to define this nearest alternative input, whose solution depends on both the domain (e.g. images, graphs, tabular data, etc.) and the specific application considered. For graphs, this problem is complicated i) by their discrete nature, as opposed to the continuous nature of state-of-the-art graph classifiers; and ii) by the node permutation group acting on the graphs. We propose a method to generate counterfactual explanations for any differentiable black-box graph classifier, utilizing a case-specific permutation equivariant graph variational autoencoder. We generate counterfactual explanations in a continuous fashion by traversing the latent space of the autoencoder across the classification boundary of the classifier, allowing for seamless integration of discrete graph structure and continuous graph attributes. We empirically validate the approach on three graph datasets, showing that our model is consistently high-performing and more robust than the baselines.

[AI-16] RouteNet-Gauss: Hardware-Enhanced Network Modeling with Machine Learning

链接: https://arxiv.org/abs/2501.08848
作者: Carlos Güemes-Palau,Miquel Ferriol-Galmés,Jordi Paillisse-Vilanova,Albert López-Brescó,Pere Barlet-Ros,Albert Cabellos-Aparicio
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 13 pages, 11 figures

点击查看摘要

Abstract:Network simulation is pivotal in network modeling, assisting with tasks ranging from capacity planning to performance estimation. Traditional approaches such as Discrete Event Simulation (DES) face limitations in terms of computational cost and accuracy. This paper introduces RouteNet-Gauss, a novel integration of a testbed network with a Machine Learning (ML) model to address these challenges. By using the testbed as a hardware accelerator, RouteNet-Gauss generates training datasets rapidly and simulates network scenarios with high fidelity to real-world conditions. Experimental results show that RouteNet-Gauss significantly reduces prediction errors by up to 95% and achieves a 488x speedup in inference time compared to state-of-the-art DES-based methods. RouteNet-Gauss’s modular architecture is dynamically constructed based on the specific characteristics of the network scenario, such as topology and routing. This enables it to understand and generalize to different network configurations beyond those seen during training, including networks up to 10x larger. Additionally, it supports Temporal Aggregated Performance Estimation (TAPE), providing configurable temporal granularity and maintaining high accuracy in flow performance metrics. This approach shows promise in improving both simulation efficiency and accuracy, offering a valuable tool for network operators.

[AI-17] Automatic tuning of communication protocols for vehicular ad hoc networks using metaheuristics

链接: https://arxiv.org/abs/2501.08847
作者: José García-Nieto,Jamal Toutouh,Enrique Alba
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:The emerging field of vehicular ad hoc networks (VANETs) deals with a set of communicating vehicles which are able to spontaneously interconnect without any pre-existing infrastructure. In such kind of networks, it is crucial to make an optimal configuration of the communication protocols previously to the final network deployment. This way, a human designer can obtain an optimal QoS of the network beforehand. The problem we consider in this work lies in configuring the File Transfer protocol Configuration (FTC) with the aim of optimizing the transmission time, the number of lost packets, and the amount of data transferred in realistic VANET scenarios. We face the FTC with five representative state-of-the-art optimization techniques and compare their performance. These algorithms are: Particle Swarm Optimization (PSO), Differential Evolution (DE), Genetic Algorithm (GA), Evolutionary Strategy (ES), and Simulated Annealing (SA). For our tests, two typical environment instances of VANETs for Urban and Highway scenarios have been defined. The experiments using ns- 2 (a well-known realistic VANET simulator) reveal that PSO outperforms all the compared algorithms for both studied VANET instances.

[AI-18] XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework

链接: https://arxiv.org/abs/2501.08809
作者: Sida Tian,Can Zhang,Wei Yuan,Wei Tan,Wenjie Zhu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: accepted by TMM

点击查看摘要

Abstract:In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine Highlights of Collectibles at WAIC 2023. The project homepage of XMusic is this https URL.

[AI-19] Networked Agents in the Dark: Team Value Learning under Partial Observability AAMAS AAMAS2025

链接: https://arxiv.org/abs/2501.08778
作者: Guilherme S. Varela,Alberto Sardinha,Francisco S. Melo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注: 18 pages, 7 figures, 5 tables. Accepted as supplemental material at Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Detroit, Michigan, USA, May 19 - 23, 2025, IFAAMAS

点击查看摘要

Abstract:We propose a novel cooperative multi-agent reinforcement learning (MARL) approach for networked agents. In contrast to previous methods that rely on complete state information or joint observations, our agents must learn how to reach shared objectives under partial observability. During training, they collect individual rewards and approximate a team value function through local communication, resulting in cooperative behavior. To describe our problem, we introduce the networked dynamic partially observable Markov game framework, where agents communicate over a switching topology communication network. Our distributed method, DNA-MARL, uses a consensus mechanism for local communication and gradient descent for local computation. DNA-MARL increases the range of the possible applications of networked agents, being well-suited for real world domains that impose privacy and where the messages may not reach their recipients. We evaluate DNA-MARL across benchmark MARL scenarios. Our results highlight the superior performance of DNA-MARL over previous methods.

[AI-20] How Developers Interact with AI: A Taxonomy of Human-AI Collaboration in Software Engineering

链接: https://arxiv.org/abs/2501.08774
作者: Christoph Treude,Marco A. Gerosa
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: Accepted at 2nd ACM International Conference on AI Foundation Models and Software Engineering (FORGE 2025)

点击查看摘要

Abstract:Artificial intelligence (AI), including large language models and generative AI, is emerging as a significant force in software development, offering developers powerful tools that span the entire development lifecycle. Although software engineering research has extensively studied AI tools in software development, the specific types of interactions between developers and these AI-powered tools have only recently begun to receive attention. Understanding and improving these interactions has the potential to improve productivity, trust, and efficiency in AI-driven workflows. In this paper, we propose a taxonomy of interaction types between developers and AI tools, identifying eleven distinct interaction types, such as auto-complete code suggestions, command-driven actions, and conversational assistance. Building on this taxonomy, we outline a research agenda focused on optimizing AI interactions, improving developer control, and addressing trust and usability challenges in AI-assisted development. By establishing a structured foundation for studying developer-AI interactions, this paper aims to stimulate research on creating more effective, adaptive AI tools for software development.

[AI-21] Leverag ing LLM Agents for Translating Network Configurations

链接: https://arxiv.org/abs/2501.08760
作者: Yunze Wei,Xiaohui Xie,Yiwei Zuo,Tianshuo Hu,Xinyi Chen,Kaiwen Chi,Yong Cui
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Configuration translation is a critical and frequent task in network operations. When a network device is damaged or outdated, administrators need to replace it to maintain service continuity. The replacement devices may originate from different vendors, necessitating configuration translation to ensure seamless network operation. However, translating configurations manually is a labor-intensive and error-prone process. In this paper, we propose an intent-based framework for translating network configuration with Large Language Model (LLM) Agents. The core of our approach is an Intent-based Retrieval Augmented Generation (IRAG) module that systematically splits a configuration file into fragments, extracts intents, and generates accurate translations. We also design a two-stage verification method to validate the syntax and semantics correctness of the translated configurations. We implement and evaluate the proposed method on real-world network configurations. Experimental results show that our method achieves 97.74% syntax correctness, outperforming state-of-the-art methods in translation accuracy.

[AI-22] SPEQ: Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning

链接: https://arxiv.org/abs/2501.08669
作者: Carlo Romeo,Girolamo Macaluso,Alessandro Sestini,Andrew D. Bagdanov
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A key challenge in Deep Reinforcement Learning is sample efficiency, especially in real-world applications where collecting environment interactions is expensive or risky. Recent off-policy algorithms improve sample efficiency by increasing the Update-To-Data (UTD) ratio and performing more gradient updates per environment interaction. While this improves sample efficiency, it significantly increases computational cost due to the higher number of gradient updates required. In this paper we propose a sample-efficient method to improve computational efficiency by separating training into distinct learning phases in order to exploit gradient updates more effectively. Our approach builds on top of the Dropout Q-Functions (DroQ) algorithm and alternates between an online, low UTD ratio training phase, and an offline stabilization phase. During the stabilization phase, we fine-tune the Q-functions without collecting new environment interactions. This process improves the effectiveness of the replay buffer and reduces computational overhead. Our experimental results on continuous control problems show that our method achieves results comparable to state-of-the-art, high UTD ratio algorithms while requiring 56% fewer gradient updates and 50% less training time than DroQ. Our approach offers an effective and computationally economical solution while maintaining the same sample efficiency as the more costly, high UTD ratio state-of-the-art.

[AI-23] Application of Deep Reinforcement Learning to UAV Swarming for Ground Surveillance

链接: https://arxiv.org/abs/2501.08655
作者: Raúl Arranz,David Carramiñana,Gonzalo de Miguel,Juan A. Besada,Ana M. Bernardos
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:This paper summarizes in depth the state of the art of aerial swarms, covering both classical and new reinforcement-learning-based approaches for their management. Then, it proposes a hybrid AI system, integrating deep reinforcement learning in a multi-agent centralized swarm architecture. The proposed system is tailored to perform surveillance of a specific area, searching and tracking ground targets, for security and law enforcement applications. The swarm is governed by a central swarm controller responsible for distributing different search and tracking tasks among the cooperating UAVs. Each UAV agent is then controlled by a collection of cooperative sub-agents, whose behaviors have been trained using different deep reinforcement learning models, tailored for the different task types proposed by the swarm controller. More specifically, proximal policy optimization (PPO) algorithms were used to train the agents’ behavior. In addition, several metrics to assess the performance of the swarm in this application were defined. The results obtained through simulation show that our system searches the operation area effectively, acquires the targets in a reasonable time, and is capable of tracking them continuously and consistently.

[AI-24] Fine-grained Spatio-temporal Event Prediction with Self-adaptive Anchor Graph SDM’25

链接: https://arxiv.org/abs/2501.08653
作者: Wang-Tao Zhou,Zhao Kang,Sicong Liu,Lizong Zhang,Ling Tian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: Accepted to SIAM International Conference on Data Mining 2025 (SDM’25)

点击查看摘要

Abstract:Event prediction tasks often handle spatio-temporal data distributed in a large spatial area. Different regions in the area exhibit different characteristics while having latent correlations. This spatial heterogeneity and correlations greatly affect the spatio-temporal distributions of event occurrences, which has not been addressed by state-of-the-art models. Learning spatial dependencies of events in a continuous space is challenging due to its fine granularity and a lack of prior knowledge. In this work, we propose a novel Graph Spatio-Temporal Point Process (GSTPP) model for fine-grained event prediction. It adopts an encoder-decoder architecture that jointly models the state dynamics of spatially localized regions using neural Ordinary Differential Equations (ODEs). The state evolution is built on the foundation of a novel Self-Adaptive Anchor Graph (SAAG) that captures spatial dependencies. By adaptively localizing the anchor nodes in the space and jointly constructing the correlation edges between them, the SAAG enhances the model’s ability of learning complex spatial event patterns. The proposed GSTPP model greatly improves the accuracy of fine-grained event prediction. Extensive experimental results show that our method greatly improves the prediction accuracy over existing spatio-temporal event prediction approaches.

[AI-25] Monte Carlo Tree Search for Comprehensive Exploration in LLM -Based Automatic Heuristic Design

链接: https://arxiv.org/abs/2501.08603
作者: Zhi Zheng,Zhuoliang Xie,Zhenkun Wang,Bryan Hooi
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Handcrafting heuristics for solving complex planning tasks (e.g., NP-hard combinatorial optimization (CO) problems) is a common practice but requires extensive domain knowledge. Recently, Large Language Model (LLM)-based automatic heuristics design (AHD) methods have shown promise in generating high-quality heuristics without manual intervention. Existing LLM-based AHD methods employ a population to maintain a fixed number of top-performing LLM-generated heuristics and introduce evolutionary computation (EC) to enhance the population iteratively. However, the population-based procedure brings greedy properties, often resulting in convergence to local optima. Instead, to more comprehensively explore the space of heuristics, we propose using Monte Carlo Tree Search (MCTS) for LLM-based heuristic evolution while preserving all LLM-generated heuristics in a tree structure. With a novel thought-alignment process and an exploration-decay technique, the proposed MCTS-AHD method delivers significantly higher-quality heuristics on various complex tasks. Our code is available at this https URL.

[AI-26] AutoRestTest: A Tool for Automated REST API Testing Using LLM s and MARL ICSE

链接: https://arxiv.org/abs/2501.08600
作者: Tyler Stennett,Myeongsoo Kim,Saurabh Sinha,Alessandro Orso
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: To be published in the 47th IEEE/ACM International Conference on Software Engineering - Demonstration Track (ICSE-Demo 2025)

点击查看摘要

Abstract:As REST APIs have become widespread in modern web services, comprehensive testing of these APIs has become increasingly crucial. Due to the vast search space consisting of operations, parameters, and parameter values along with their complex dependencies and constraints, current testing tools suffer from low code coverage, leading to suboptimal fault detection. To address this limitation, we present a novel tool, AutoRestTest, which integrates the Semantic Operation Dependency Graph (SODG) with Multi-Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing. AutoRestTest determines operation-dependent parameters using the SODG and employs five specialized agents (operation, parameter, value, dependency, and header) to identify dependencies of operations and generate operation sequences, parameter combinations, and values. AutoRestTest provides a command-line interface and continuous telemetry on successful operation count, unique server errors detected, and time elapsed. Upon completion, AutoRestTest generates a detailed report highlighting errors detected and operations exercised. In this paper, we introduce our tool and present preliminary results.

[AI-27] LlamaRestTest: Effective REST API Testing with Small Language Models

链接: https://arxiv.org/abs/2501.08598
作者: Myeongsoo Kim,Saurabh Sinha,Alessandro Orso
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: To be published in the ACM International Conference on the Foundations of Software Engineering (FSE 2025)

点击查看摘要

Abstract:Modern web services rely heavily on REST APIs, typically documented using the OpenAPI specification. The widespread adoption of this standard has resulted in the development of many black-box testing tools that generate tests based on these specifications. Recent advancements in Natural Language Processing (NLP), particularly with Large Language Models (LLMs), have enhanced REST API testing by extracting actionable rules and generating input values from the human-readable portions of the specification. However, these advancements overlook the potential of continuously refining the identified rules and test inputs based on server responses. To address this limitation, we present LlamaRestTest, a novel approach that employs two custom LLMs to generate realistic test inputs and uncover parameter dependencies during the testing process by incorporating server responses. These LLMs are created by fine-tuning the Llama3-8b model, using mined datasets of REST API example values and inter-parameter dependencies. We evaluated LlamaRestTest on 12 real-world services (including popular services such as Spotify), comparing it against RESTGPT, a GPT-powered specification-enhancement tool, as well as several state-of-the-art REST API testing tools, including RESTler, MoRest, EvoMaster, and ARAT-RL. Our results show that fine-tuning enables smaller LLMs to outperform larger models in detecting actionable rules and generating inputs for REST API testing. We evaluated configurations from the base Llama3-8B to fine-tuned versions and explored 2-bit, 4-bit, and 8-bit quantization for efficiency. LlamaRestTest surpasses state-of-the-art tools in code coverage and error detection, even with RESTGPT-enhanced specifications, and an ablation study highlights the impact of its novel components.

[AI-28] OpenMLDB: A Real-Time Relational Data Feature Computation System for Online ML

链接: https://arxiv.org/abs/2501.08591
作者: Xuanhe Zhou,Wei Zhou,Liguo Qi,Hao Zhang,Dihao Chen,Bingsheng He,Mian Lu,Guoliang Li,Fan Wu,Yuqiang Chen
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Efficient and consistent feature computation is crucial for a wide range of online ML applications. Typically, feature computation is divided into two distinct phases, i.e., offline stage for model training and online stage for model serving. These phases often rely on execution engines with different interface languages and function implementations, causing significant inconsistencies. Moreover, many online ML features involve complex time-series computations (e.g., functions over varied-length table windows) that differ from standard streaming and analytical queries. Existing data processing systems (e.g., Spark, Flink, DuckDB) often incur multi-second latencies for these computations, making them unsuitable for real-time online ML applications that demand timely feature updates. This paper presents OpenMLDB, a feature computation system deployed in 4Paradigm’s SageOne platform and over 100 real scenarios. Technically, OpenMLDB first employs a unified query plan generator for consistent computation results across the offline and online stages, significantly reducing feature deployment overhead. Second, OpenMLDB provides an online execution engine that resolves performance bottlenecks caused by long window computations (via pre-aggregation) and multi-table window unions (via data self-adjusting). It also provides a high-performance offline execution engine with window parallel optimization and time-aware data skew resolving. Third, OpenMLDB features a compact data format and stream-focused indexing to maximize memory usage and accelerate data access. Evaluations in testing and real workloads reveal significant performance improvements and resource savings compared to the baseline systems. The open community of OpenMLDB now has over 150 contributors and gained 1.6k stars on GitHub. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2501.08591 [cs.DB] (or arXiv:2501.08591v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2501.08591 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-29] Sound Scene Synthesis at the DCASE 2024 Challenge

链接: https://arxiv.org/abs/2501.08587
作者: Mathieu Lagrange,Junwon Lee,Modan Tailleur,Laurie M. Heller,Keunwoo Choi,Brian McFee,Keisuke Imoto,Yuki Okamoto
类目: Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This paper presents Task 7 at the DCASE 2024 Challenge: sound scene synthesis. Recent advances in sound synthesis and generative models have enabled the creation of realistic and diverse audio content. We introduce a standardized evaluation framework for comparing different sound scene synthesis systems, incorporating both objective and subjective metrics. The challenge attracted four submissions, which are evaluated using the Fréchet Audio Distance (FAD) and human perceptual ratings. Our analysis reveals significant insights into the current capabilities and limitations of sound scene synthesis systems, while also highlighting areas for future improvement in this rapidly evolving field.

[AI-30] Evaluating SAT and SMT Solvers on Large-Scale Sudoku Puzzles

链接: https://arxiv.org/abs/2501.08569
作者: Liam Davis,Tairan Ji
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Modern SMT solvers have revolutionized the approach to constraint satisfaction problems by integrating advanced theory reasoning and encoding techniques. In this work, we evaluate the performance of modern SMT solvers in Z3, CVC5 and DPLL(T) against a standard SAT solver in DPLL. By benchmarking these solvers on novel, diverse 25x25 Sudoku puzzles of various difficulty levels created by our improved Sudoku generator, we examine the impact of advanced theory reasoning and encoding techniques. Our findings demonstrate that modern SMT solvers significantly outperform classical SAT solvers. This work highlights the evolution of logical solvers and exemplifies the utility of SMT solvers in addressing large-scale constraint satisfaction problems.

[AI-31] owards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement

链接: https://arxiv.org/abs/2501.08566
作者: Qianniu Chen,Xiaoyang Hao,Bowen Li,Yue Liu,Li Lu
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: 5 pages,4 figures

点击查看摘要

Abstract:Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In this paper, we present a lightweight and stable zero-shot TTS system. We introduce a novel TTS architecture designed to effectively model linguistic content and various speaker attributes from source speech and prompt speech, respectively. Furthermore, we present a two-stage self-distillation framework that constructs parallel data pairs for effectively disentangling linguistic content and speakers from the perspective of training data. Extensive experiments show that our system exhibits excellent performance and superior stability on the zero-shot TTS tasks. Moreover, it shows markedly superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and GPU, respectively.

[AI-32] DualOpt: A Dual Divide-and-Optimize Algorithm for the Large-scale Traveling Salesman Problem AAAI-25

链接: https://arxiv.org/abs/2501.08565
作者: Shipei Zhou,Yuandong Ding,Chi Zhang,Zhiguang Cao,Yan Jin
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI-25, February 2025

点击查看摘要

Abstract:This paper proposes a dual divide-and-optimize algorithm (DualOpt) for solving the large-scale traveling salesman problem (TSP). DualOpt combines two complementary strategies to improve both solution quality and computational efficiency. The first strategy is a grid-based divide-and-conquer procedure that partitions the TSP into smaller sub-problems, solving them in parallel and iteratively refining the solution by merging nodes and partial routes. The process continues until only one grid remains, yielding a high-quality initial solution. The second strategy involves a path-based divide-and-optimize procedure that further optimizes the solution by dividing it into sub-paths, optimizing each using a neural solver, and merging them back to progressively improve the overall solution. Extensive experiments conducted on two groups of TSP benchmark instances, including randomly generated instances with up to 100,000 nodes and real-world datasets from TSPLIB, demonstrate the effectiveness of DualOpt. The proposed DualOpt achieves highly competitive results compared to 10 state-of-the-art algorithms in the literature. In particular, DualOpt achieves an improvement gap up to 1.40% for the largest instance TSP100K with a remarkable 104x speed-up over the leading heuristic solver LKH3. Additionally, DualOpt demonstrates strong generalization on TSPLIB benchmarks, confirming its capability to tackle diverse real-world TSP applications.

[AI-33] ANSR-DT: An Adaptive Neuro-Symbolic Learning and Reasoning Framework for Digital Twins

链接: https://arxiv.org/abs/2501.08561
作者: Safayat Bin Hakim,Muhammad Adil,Alvaro Velasquez,Houbing Herbert Song
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:In this paper, we propose an Adaptive Neuro-Symbolic Learning Framework for digital twin technology called ``ANSR-DT." Our approach combines pattern recognition algorithms with reinforcement learning and symbolic reasoning to enable real-time learning and adaptive intelligence. This integration enhances the understanding of the environment and promotes continuous learning, leading to better and more effective decision-making in real-time for applications that require human-machine collaboration. We evaluated the \textitANSR-DT framework for its ability to learn and adapt to dynamic patterns, observing significant improvements in decision accuracy, reliability, and interpretability when compared to existing state-of-the-art methods. However, challenges still exist in extracting and integrating symbolic rules in complex environments, which limits the full potential of our framework in heterogeneous settings. Moreover, our ongoing research aims to address this issue in the future by ensuring seamless integration of neural models at large. In addition, our open-source implementation promotes reproducibility and encourages future research to build on our foundational work.

[AI-34] LAMS: LLM -Driven Automatic Mode Switching for Assistive Teleoperation

链接: https://arxiv.org/abs/2501.08558
作者: Yiran Tao,Jehan Yang,Dan Ding,Zackory Erickson
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Teleoperating high degrees-of-freedom (DoF) robotic manipulators via low-DoF controllers like joysticks often requires frequent switching between control modes, where each mode maps controller movements to specific robot actions. Manually performing this frequent switching can make teleoperation cumbersome and inefficient. On the other hand, existing automatic mode-switching solutions, such as heuristic-based or learning-based methods, are often task-specific and lack generalizability. In this paper, we introduce LLM-Driven Automatic Mode Switching (LAMS), a novel approach that leverages Large Language Models (LLMs) to automatically switch control modes based on task context. Unlike existing methods, LAMS requires no prior task demonstrations and incrementally improves by integrating user-generated mode-switching examples. We validate LAMS through an ablation study and a user study with 10 participants on complex, long-horizon tasks, demonstrating that LAMS effectively reduces manual mode switches, is preferred over alternative methods, and improves performance over time. The project website with supplementary materials is at this https URL.

[AI-35] Reinforcement Learning-Enhanced Procedural Generation for Dynamic Narrative-Driven AR Experiences

链接: https://arxiv.org/abs/2501.08552
作者: Aniruddha Srinivas Joshi
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Number of pages: 13, Number of figures: 4. Accepted for presentation at GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications (for additional details on the conference visit this https URL ). Disclaimer: This preprint may differ from the final version published in the conference proceedings

点击查看摘要

Abstract:Procedural Content Generation (PCG) is widely used to create scalable and diverse environments in games. However, existing methods, such as the Wave Function Collapse (WFC) algorithm, are often limited to static scenarios and lack the adaptability required for dynamic, narrative-driven applications, particularly in augmented reality (AR) games. This paper presents a reinforcement learning-enhanced WFC framework designed for mobile AR environments. By integrating environment-specific rules and dynamic tile weight adjustments informed by reinforcement learning (RL), the proposed method generates maps that are both contextually coherent and responsive to gameplay needs. Comparative evaluations and user studies demonstrate that the framework achieves superior map quality and delivers immersive experiences, making it well-suited for narrative-driven AR games. Additionally, the method holds promise for broader applications in education, simulation training, and immersive extended reality (XR) experiences, where dynamic and adaptive environments are critical.

[AI-36] Dynamic Portfolio Optimization via Augmented DDPG with Quantum Price Levels-Based Trading Strategy

链接: https://arxiv.org/abs/2501.08528
作者: Runsheng Lin,Zihan Xing,Mingze Ma,Raymond S.T. Lee
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
*备注: 8 pages

点击查看摘要

Abstract:With the development of deep learning, Dynamic Portfolio Optimization (DPO) problem has received a lot of attention in recent years, not only in the field of finance but also in the field of deep learning. Some advanced research in recent years has proposed the application of Deep Reinforcement Learning (DRL) to the DPO problem, which demonstrated to be more advantageous than supervised learning in solving the DPO problem. However, there are still certain unsolved issues: 1) DRL algorithms usually have the problems of slow learning speed and high sample complexity, which is especially problematic when dealing with complex financial data. 2) researchers use DRL simply for the purpose of obtaining high returns, but pay little attention to the problem of risk control and trading strategy, which will affect the stability of model returns. In order to address these issues, in this study we revamped the intrinsic structure of the model based on the Deep Deterministic Policy Gradient (DDPG) and proposed the Augmented DDPG model. Besides, we also proposed an innovative risk control strategy based on Quantum Price Levels (QPLs) derived from Quantum Finance Theory (QFT). Our experimental results revealed that our model has better profitability as well as risk control ability with less sample complexity in the DPO problem compared to the baseline models.

[AI-37] Mitigating Domain Shift in Federated Learning via Intra- and Inter-Domain Prototypes

链接: https://arxiv.org/abs/2501.08521
作者: Huy Q. Le,Ye Lin Tun,Yu Qiao,Minh N. H. Nguyen,Keon Oh Kim,Choong Seon Hong
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 13 pages, 9 figures, 10 tables

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a decentralized machine learning technique, allowing clients to train a global model collaboratively without sharing private data. However, most FL studies ignore the crucial challenge of heterogeneous domains where each client has a distinct feature distribution, which is common in real-world scenarios. Prototype learning, which leverages the mean feature vectors within the same classes, has become a prominent solution for federated learning under domain skew. However, existing federated prototype learning methods only consider inter-domain prototypes on the server and overlook intra-domain characteristics. In this work, we introduce a novel federated prototype learning method, namely I ^2 PFL, which incorporates \textbfI ntra-domain and \textbfI nter-domain \textbfP rototypes, to mitigate domain shifts and learn a generalized global model across multiple domains in federated learning. To construct intra-domain prototypes, we propose feature alignment with MixUp-based augmented prototypes to capture the diversity of local domains and enhance the generalization of local features. Additionally, we introduce a reweighting mechanism for inter-domain prototypes to generate generalized prototypes to provide inter-domain knowledge and reduce domain skew across multiple clients. Extensive experiments on the Digits, Office-10, and PACS datasets illustrate the superior performance of our method compared to other baselines.

[AI-38] Easing Seasickness through Attention Redirection with a Mindfulness-Based Brain–Computer Interface

链接: https://arxiv.org/abs/2501.08518
作者: Xiaoyu Bao,Kailin Xu,Jiawei Zhu,Haiyun Huang,Kangning Li,Qiyun Huang,Yuanqing Li
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Seasickness is a prevalent issue that adversely impacts both passenger experiences and the operational efficiency of maritime crews. While techniques that redirect attention have proven effective in alleviating motion sickness symptoms in terrestrial environments, applying similar strategies to manage seasickness poses unique challenges due to the prolonged and intense motion environment associated with maritime travel. In this study, we propose a mindfulness brain-computer interface (BCI), specifically designed to redirect attention with the aim of mitigating seasickness symptoms in real-world settings. Our system utilizes a single-channel headband to capture prefrontal EEG signals, which are then wirelessly transmitted to computing devices for the assessment of mindfulness states. The results are transferred into real-time feedback as mindfulness scores and audiovisual stimuli, facilitating a shift in attentional focus from physiological discomfort to mindfulness practices. A total of 43 individuals participated in a real-world maritime experiment consisted of three sessions: a real-feedback mindfulness session, a resting session, and a pseudofeedback mindfulness session. Notably, 81.39% of participants reported that the mindfulness BCI intervention was effective, and there was a significant reduction in the severity of seasickness, as measured by the Misery Scale (MISC). Furthermore, EEG analysis revealed a decrease in the theta/beta ratio, corresponding with the alleviation of seasickness symptoms. A decrease in overall EEG band power during the real-feedback mindfulness session suggests that the mindfulness BCI fosters a more tranquil and downregulated state of brain activity. Together, this study presents a novel nonpharmacological, portable, and effective approach for seasickness intervention, with the potential to enhance the cruising experience for both passengers and crews.

[AI-39] A Short-Term Predict-Then-Cluster Framework for Meal Delivery Services

链接: https://arxiv.org/abs/2501.08466
作者: Jingyi Cheng,Shadi Sharif Azadeh
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Micro-delivery services offer promising solutions for on-demand city logistics, but their success relies on efficient real-time delivery operations and fleet management. On-demand meal delivery platforms seek to optimize real-time operations based on anticipatory insights into citywide demand distributions. To address these needs, this study proposes a short-term predict-then-cluster framework for on-demand meal delivery services. The framework utilizes ensemble-learning methods for point and distributional forecasting with multivariate features, including lagged-dependent inputs to capture demand dynamics. We introduce Constrained K-Means Clustering (CKMC) and Contiguity Constrained Hierarchical Clustering with Iterative Constraint Enforcement (CCHC-ICE) to generate dynamic clusters based on predicted demand and geographical proximity, tailored to user-defined operational constraints. Evaluations of European and Taiwanese case studies demonstrate that the proposed methods outperform traditional time series approaches in both accuracy and computational efficiency. Clustering results demonstrate that the incorporation of distributional predictions effectively addresses demand uncertainties, improving the quality of operational insights. Additionally, a simulation study demonstrates the practical value of short-term demand predictions for proactive strategies, such as idle fleet rebalancing, significantly enhancing delivery efficiency. By addressing demand uncertainties and operational constraints, our predict-then-cluster framework provides actionable insights for optimizing real-time operations. The approach is adaptable to other on-demand platform-based city logistics and passenger mobility services, promoting sustainable and efficient urban operations.

[AI-40] Active Sampling for Node Attribute Completion on Graphs

链接: https://arxiv.org/abs/2501.08450
作者: Benyuan Liu,Xu Chen,Yanfeng Wang,Ya Zhang,Zhi Cao,Ivor Tsang
类目: Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Node attribute, a type of crucial information for graph analysis, may be partially or completely missing for certain nodes in real world applications. Restoring the missing attributes is expected to benefit downstream graph learning. Few attempts have been made on node attribute completion, but a novel framework called Structure-attribute Transformer (SAT) was recently proposed by using a decoupled scheme to leverage structures and attributes. SAT ignores the differences in contributing to the learning schedule and finding a practical way to model the different importance of nodes with observed attributes is challenging. This paper proposes a novel AcTive Sampling algorithm (ATS) to restore missing node attributes. The representativeness and uncertainty of each node’s information are first measured based on graph structure, representation similarity and learning bias. To select nodes as train samples in the next optimization step, a weighting scheme controlled by Beta distribution is then introduced to linearly combine the two properties. Extensive experiments on four public benchmark datasets and two downstream tasks have shown the superiority of ATS in node attribute completion.

[AI-41] Modeling Discrimination with Causal Abstraction

链接: https://arxiv.org/abs/2501.08429
作者: Milan Mossé,Kara Schechtman,Frederick Eberhardt,Thomas Icard
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A person is directly racially discriminated against only if her race caused her worse treatment. This implies that race is an attribute sufficiently separable from other attributes to isolate its causal role. But race is embedded in a nexus of social factors that resist isolated treatment. If race is socially constructed, in what sense can it cause worse treatment? Some propose that the perception of race, rather than race itself, causes worse treatment. Others suggest that since causal models require modularity, i.e. the ability to isolate causal effects, attempts to causally model discrimination are misguided. This paper addresses the problem differently. We introduce a framework for reasoning about discrimination, in which race is a high-level abstraction of lower-level features. In this framework, race can be modeled as itself causing worse treatment. Modularity is ensured by allowing assumptions about social construction to be precisely and explicitly stated, via an alignment between race and its constituents. Such assumptions can then be subjected to normative and empirical challenges, which lead to different views of when discrimination occurs. By distinguishing constitutive and causal relations, the abstraction framework pinpoints disagreements in the current literature on modeling discrimination, while preserving a precise causal account of discrimination. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2501.08429 [cs.CY] (or arXiv:2501.08429v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2501.08429 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-42] Causal vs. Anticausal merging of predictors NEURIPS2024

链接: https://arxiv.org/abs/2501.08426
作者: Sergio Hernan Garrido Mejia,Patrick Blöbaum,Bernhard Schölkopf,Dominik Janzing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: Presented at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:We study the differences arising from merging predictors in the causal and anticausal directions using the same data. In particular we study the asymmetries that arise in a simple model where we merge the predictors using one binary variable as target and two continuous variables as predictors. We use Causal Maximum Entropy (CMAXENT) as inductive bias to merge the predictors, however, we expect similar differences to hold also when we use other merging methods that take into account asymmetries between cause and effect. We show that if we observe all bivariate distributions, the CMAXENT solution reduces to a logistic regression in the causal direction and Linear Discriminant Analysis (LDA) in the anticausal direction. Furthermore, we study how the decision boundaries of these two solutions differ whenever we observe only some of the bivariate distributions implications for Out-Of-Variable (OOV) generalisation.

[AI-43] CVaR-Based Variational Quantum Optimization for User Association in Handoff-Aware Vehicular Networks

链接: https://arxiv.org/abs/2501.08418
作者: Zijiang Yan,Hao Zhou,Jianhua Pei,Aryan Kaushik,Hina Tabassum,Ping Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
*备注: Accepted in IEEE International Conference on Communications (ICC 2025)

点击查看摘要

Abstract:Efficient resource allocation is essential for optimizing various tasks in wireless networks, which are usually formulated as generalized assignment problems (GAP). GAP, as a generalized version of the linear sum assignment problem, involves both equality and inequality constraints that add computational challenges. In this work, we present a novel Conditional Value at Risk (CVaR)-based Variational Quantum Eigensolver (VQE) framework to address GAP in vehicular networks (VNets). Our approach leverages a hybrid quantum-classical structure, integrating a tailored cost function that balances both objective and constraint-specific penalties to improve solution quality and stability. Using the CVaR-VQE model, we handle the GAP efficiently by focusing optimization on the lower tail of the solution space, enhancing both convergence and resilience on noisy intermediate-scale quantum (NISQ) devices. We apply this framework to a user-association problem in VNets, where our method achieves 23.5% improvement compared to the deep neural network (DNN) approach.

[AI-44] A Survey on Recent Advances in Self-Organizing Maps

链接: https://arxiv.org/abs/2501.08416
作者: Axel Guérin,Pierre Chauvet,Frédéric Saubion
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 36 pages

点击查看摘要

Abstract:Self-organising maps are a powerful tool for cluster analysis in a wide range of data contexts. From the pioneer work of Kohonen, many variants and improvements have been proposed. This review focuses on the last decade, in order to provide an overview of the main evolution of the seminal SOM algorithm as well as of the methodological developments that have been achieved in order to better fit to various application contexts and users’ requirements. We also highlight a specific and important application field that is related to commercial use of SOM, which involves specific data management.

[AI-45] Addressing Quality Challenges in Deep Learning: The Role of MLOps and Domain Knowledge

链接: https://arxiv.org/abs/2501.08402
作者: Santiago del Rey,Adrià Medina,Xavier Franch,Silverio Martínez-Fernández
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 6 pages, 1 figure, accepted to the 4th International Conference on AI Engineering - Software Engineering for AI (CAIN)

点击查看摘要

Abstract:Deep learning (DL) systems present unique challenges in software engineering, especially concerning quality attributes like correctness and resource efficiency. While DL models achieve exceptional performance in specific tasks, engineering DL-based systems is still essential. The effort, cost, and potential diminishing returns of continual improvements must be carefully evaluated, as software engineers often face the critical decision of when to stop refining a system relative to its quality attributes. This experience paper explores the role of MLOps practices – such as monitoring and experiment tracking – in creating transparent and reproducible experimentation environments that enable teams to assess and justify the impact of design decisions on quality attributes. Furthermore, we report on experiences addressing the quality challenges by embedding domain knowledge into the design of a DL model and its integration within a larger system. The findings offer actionable insights into not only the benefits of domain knowledge and MLOps but also the strategic consideration of when to limit further optimizations in DL projects to maximize overall system quality and reliability.

[AI-46] Operator Learning for Reconstructing Flow Fields from Sparse Measurements: an Energy Transformer Approach

链接: https://arxiv.org/abs/2501.08339
作者: Qian Zhang,Dmitry Krotov,George Em Karniadakis
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Machine learning methods have shown great success in various scientific areas, including fluid mechanics. However, reconstruction problems, where full velocity fields must be recovered from partial observations, remain challenging. In this paper, we propose a novel operator learning framework for solving reconstruction problems by using the Energy Transformer (ET), an architecture inspired by associative memory models. We formulate reconstruction as a mapping from incomplete observed data to full reconstructed fields. The method is validated on three fluid mechanics examples using diverse types of data: (1) unsteady 2D vortex street in flow past a cylinder using simulation data; (2) high-speed under-expanded impinging supersonic jets impingement using Schlieren imaging; and (3) 3D turbulent jet flow using particle tracking. The results demonstrate the ability of ET to accurately reconstruct complex flow fields from highly incomplete data (90% missing), even for noisy experimental measurements, with fast training and inference on a single GPU. This work provides a promising new direction for tackling reconstruction problems in fluid mechanics and other areas in mechanics, geophysics, weather prediction, and beyond.

机器学习

[LG-0] Improving Stability Estimates in Adversarial Explainable AI through Alternate Search Methods

链接: https://arxiv.org/abs/2501.09006
作者: Christopher Burger,Charles Walter
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures, 5 tables. arXiv admin note: text overlap with arXiv:2406.15839

点击查看摘要

Abstract:Advances in the effectiveness of machine learning models have come at the cost of enormous complexity resulting in a poor understanding of how they function. Local surrogate methods have been used to approximate the workings of these complex models, but recent work has revealed their vulnerability to adversarial attacks where the explanation produced is appreciably different while the meaning and structure of the complex model’s output remains similar. This prior work has focused on the existence of these weaknesses but not on their magnitude. Here we explore using an alternate search method with the goal of finding minimum viable perturbations, the fewest perturbations necessary to achieve a fixed similarity value between the original and altered text’s explanation. Intuitively, a method that requires fewer perturbations to expose a given level of instability is inferior to one which requires more. This nuance allows for superior comparisons of the stability of explainability methods.

[LG-1] VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science

链接: https://arxiv.org/abs/2501.08995
作者: Youssef Abdalla,Marrisa Taub,Eleanor Hilton,Priya Akkaraju,Alexander Milanovic,Mine Orlu,Abdul W. Basit,Michael T Cook,Tapabrata Chakraborty,David Shorthouse
类目: Machine Learning (cs.LG)
*备注: 30 pages, 6 primary figures, 3 supplementary figures

点击查看摘要

Abstract:Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial and error approaches for development rather than data driven methods. While Machine Learning offers a solution, existing datasets are often small and noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state of the art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules, which is an extremely common pharmaceutical task. We demonstrate the power of synthetic data for regularising small tabular datasets, highlighting its potential to become standard practice in pharmaceutical model development, and make our method, including VECT GAN pretrained on ChEMBL available as a pip package.

[LG-2] raining-Aware Risk Control for Intensity Modulated Radiation Therapies Quality Assurance with Conformal Prediction ALT

链接: https://arxiv.org/abs/2501.08963
作者: Kevin He,David Adam,Sarah Han-Oh,Anqi Liu
类目: Machine Learning (cs.LG)
*备注: 2024 Machine Learning for Health Symposium

点击查看摘要

Abstract:Measurement quality assurance (QA) practices play a key role in the safe use of Intensity Modulated Radiation Therapies (IMRT) for cancer treatment. These practices have reduced measurement-based IMRT QA failure below 1%. However, these practices are time and labor intensive which can lead to delays in patient care. In this study, we examine how conformal prediction methodologies can be used to robustly triage plans. We propose a new training-aware conformal risk control method by combining the benefit of conformal risk control and conformal training. We incorporate the decision making thresholds based on the gamma passing rate, along with the risk functions used in clinical evaluation, into the design of the risk control framework. Our method achieves high sensitivity and specificity and significantly reduces the number of plans needing measurement without generating a huge confidence interval. Our results demonstrate the validity and applicability of conformal prediction methods for improving efficiency and reducing the workload of the IMRT QA process.

[LG-3] Computing Approximated Fixpoints via Dampened Mann Iteration

链接: https://arxiv.org/abs/2501.08950
作者: Paolo Baldan,Sebastian Gurke,Barbara König,Tommaso Padoan,Florian Wittbold
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fixpoints are ubiquitous in computer science and when dealing with quantitative semantics and verification one is commonly led to consider least fixpoints of (higher-dimensional) functions over the nonnegative reals. We show how to approximate the least fixpoint of such functions, focusing on the case in which they are not known precisely, but represented by a sequence of approximating functions that converge to them. We concentrate on monotone and non-expansive functions, for which uniqueness of fixpoints is not guaranteed and standard fixpoint iteration schemes might get stuck at a fixpoint that is not the least. Our main contribution is the identification of an iteration scheme, a variation of Mann iteration with a dampening factor, which, under suitable conditions, is shown to guarantee convergence to the least fixpoint of the function of interest. We then argue that these results are relevant in the context of model-based reinforcement learning for Markov decision processes (MDPs), showing that the proposed iteration scheme instantiates to MDPs and allows us to derive convergence to the optimal expected return. More generally, we show that our results can be used to iterate to the least fixpoint almost surely for systems where the function of interest can be approximated with given probabilistic error bounds, as it happens for probabilistic systems, such as simple stochastic games, that can be explored via sampling.

[LG-4] A Reinforcement Learning Approach to Quiet and Safe UAM Traffic Management

链接: https://arxiv.org/abs/2501.08941
作者: Surya Murthy,John-Paul Clarke,Ufuk Topcu,Zhenyu Gao
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Paper presented at SciTech 2025

点击查看摘要

Abstract:Urban air mobility (UAM) is a transformative system that operates various small aerial vehicles in urban environments to reshape urban transportation. However, integrating UAM into existing urban environments presents a variety of complex challenges. Recent analyses of UAM’s operational constraints highlight aircraft noise and system safety as key hurdles to UAM system implementation. Future UAM air traffic management schemes must ensure that the system is both quiet and safe. We propose a multi-agent reinforcement learning approach to manage UAM traffic, aiming at both vertical separation assurance and noise mitigation. Through extensive training, the reinforcement learning agent learns to balance the two primary objectives by employing altitude adjustments in a multi-layer UAM network. The results reveal the tradeoffs among noise impact, traffic congestion, and separation. Overall, our findings demonstrate the potential of reinforcement learning in mitigating UAM’s noise impact while maintaining safe separation using altitude adjustments

[LG-5] A Two-Stage Pretraining-Finetuning Framework for Treatment Effect Estimation with Unmeasured Confounding KDD25

链接: https://arxiv.org/abs/2501.08888
作者: Chuan Zhou,Yaxuan Li,Chunyuan Zheng,Haiteng Zhang,Min Zhang,Haoxuan Li,Mingming Gong
类目: Machine Learning (cs.LG)
*备注: KDD 25 Research Track

点击查看摘要

Abstract:Estimating the conditional average treatment effect (CATE) from observational data plays a crucial role in areas such as e-commerce, healthcare, and economics. Existing studies mainly rely on the strong ignorability assumption that there are no unmeasured confounders, whose presence cannot be tested from observational data and can invalidate any causal conclusion. In contrast, data collected from randomized controlled trials (RCT) do not suffer from confounding, but are usually limited by a small sample size. In this paper, we propose a two-stage pretraining-finetuning (TSPF) framework using both large-scale observational data and small-scale RCT data to estimate the CATE in the presence of unmeasured confounding. In the first stage, a foundational representation of covariates is trained to estimate counterfactual outcomes through large-scale observational data. In the second stage, we propose to train an augmented representation of the covariates, which is concatenated to the foundational representation obtained in the first stage to adjust for the unmeasured confounding. To avoid overfitting caused by the small-scale RCT data in the second stage, we further propose a partial parameter initialization approach, rather than training a separate network. The superiority of our approach is validated on two public datasets with extensive experiments. The code is available at this https URL.

[LG-6] PAC Learnability of Scenario Decision-Making Algorithms: Necessary and Sufficient Conditions

链接: https://arxiv.org/abs/2501.08887
作者: Guillaume O. Berger,Raphaël M. Jungers
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study the PAC property of scenario decision-making algorithms, that is, the ability to make a decision that has an arbitrarily low risk of violating an unknown safety constraint, provided sufficiently many realizations (called scenarios) of the safety constraint are sampled. Sufficient conditions for scenario decision-making algorithms to be PAC are available in the literature, such as finiteness of the VC dimension of its associated classifier and existence of a compression scheme. We study the question of whether these sufficient conditions are also necessary. We show with counterexamples that this is not the case in general. This contrasts with binary classification learning, for which the analogous conditions are sufficient and necessary. Popular scenario decision-making algorithms, such as scenario optimization, enjoy additional properties, such as stability and consistency. We show that even under these additional assumptions the above conclusions hold. Finally, we derive a necessary condition for scenario decision-making algorithms to be PAC, inspired by the VC dimension and the so-called no-free-lunch theorem.

[LG-7] Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum

链接: https://arxiv.org/abs/2501.08883
作者: Keisuke Kamo,Hideaki Iiduka
类目: Machine Learning (cs.LG)
*备注: 22 pages

点击查看摘要

Abstract:Stochastic gradient descent with momentum (SGDM), which is defined by adding a momentum term to SGD, has been well studied in both theory and practice. Theoretically investigated results showed that the settings of the learning rate and momentum weight affect the convergence of SGDM. Meanwhile, practical results showed that the setting of batch size strongly depends on the performance of SGDM. In this paper, we focus on mini-batch SGDM with constant learning rate and constant momentum weight, which is frequently used to train deep neural networks in practice. The contribution of this paper is showing theoretically that using a constant batch size does not always minimize the expectation of the full gradient norm of the empirical loss in training a deep neural network, whereas using an increasing batch size definitely minimizes it, that is, increasing batch size improves convergence of mini-batch SGDM. We also provide numerical results supporting our analyses, indicating specifically that mini-batch SGDM with an increasing batch size converges to stationary points faster than with a constant batch size. Python implementations of the optimizers used in the numerical experiments are available at this https URL.

[LG-8] A Closer Look at the Learnability of Out-of-Distribution (OOD) Detection

链接: https://arxiv.org/abs/2501.08821
作者: Konstantin Garov,Kamalika Chaudhuri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning algorithms often encounter different or “out-of-distribution” (OOD) data at deployment time, and OOD detection is frequently employed to detect these examples. While it works reasonably well in practice, existing theoretical results on OOD detection are highly pessimistic. In this work, we take a closer look at this problem, and make a distinction between uniform and non-uniform learnability, following PAC learning theory. We characterize under what conditions OOD detection is uniformly and non-uniformly learnable, and we show that in several cases, non-uniform learnability turns a number of negative results into positive. In all cases where OOD detection is learnable, we provide concrete learning algorithms and a sample-complexity analysis.

[LG-9] Deep learning for temporal super-resolution 4D Flow MRI

链接: https://arxiv.org/abs/2501.08780
作者: Pia Callmer,Mia Bonini,Edward Ferdian,David Nordsletten,Daniel Giese,Alistair A. Young,Alexander Fyrdahl,David Marlevi
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:4D Flow Magnetic Resonance Imaging (4D Flow MRI) is a non-invasive technique for volumetric, time-resolved blood flow quantification. However, apparent trade-offs between acquisition time, image noise, and resolution limit clinical applicability. In particular, in regions of highly transient flow, coarse temporal resolution can hinder accurate capture of physiologically relevant flow variations. To overcome these issues, post-processing techniques using deep learning have shown promising results to enhance resolution post-scan using so-called super-resolution networks. However, while super-resolution has been focusing on spatial upsampling, temporal super-resolution remains largely unexplored. The aim of this study was therefore to implement and evaluate a residual network for temporal super-resolution 4D Flow MRI. To achieve this, an existing spatial network (4DFlowNet) was re-designed for temporal upsampling, adapting input dimensions, and optimizing internal layer structures. Training and testing were performed using synthetic 4D Flow MRI data originating from patient-specific in-silico models, as well as using in-vivo datasets. Overall, excellent performance was achieved with input velocities effectively denoised and temporally upsampled, with a mean absolute error (MAE) of 1.0 cm/s in an unseen in-silico setting, outperforming deterministic alternatives (linear interpolation MAE = 2.3 cm/s, sinc interpolation MAE = 2.6 cm/s). Further, the network synthesized high-resolution temporal information from unseen low-resolution in-vivo data, with strong correlation observed at peak flow frames. As such, our results highlight the potential of utilizing data-driven neural networks for temporal super-resolution 4D Flow MRI, enabling high-frame-rate flow quantification without extending acquisition times beyond clinically acceptable limits.

[LG-10] MeshMask: Physics-Based Simulations with Masked Graph Neural Networks

链接: https://arxiv.org/abs/2501.08738
作者: Paul Garnier,Vincent Lannelongue,Jonathan Viquerat,Elie Hachem
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:We introduce a novel masked pre-training technique for graph neural networks (GNNs) applied to computational fluid dynamics (CFD) problems. By randomly masking up to 40% of input mesh nodes during pre-training, we force the model to learn robust representations of complex fluid dynamics. We pair this masking strategy with an asymmetric encoder-decoder architecture and gated multi-layer perceptrons to further enhance performance. The proposed method achieves state-of-the-art results on seven CFD datasets, including a new challenging dataset of 3D intracranial aneurysm simulations with over 250,000 nodes per mesh. Moreover, it significantly improves model performance and training efficiency across such diverse range of fluid simulation tasks. We demonstrate improvements of up to 60% in long-term prediction accuracy compared to previous best models, while maintaining similar computational costs. Notably, our approach enables effective pre-training on multiple datasets simultaneously, significantly reducing the time and data required to achieve high performance on new tasks. Through extensive ablation studies, we provide insights into the optimal masking ratio, architectural choices, and training strategies.

[LG-11] Resource-Constrained Federated Continual Learning: What Does Matter?

链接: https://arxiv.org/abs/2501.08737
作者: Yichen Li,Yuying Wang,Jiahua Dong,Haozhao Wang,Yining Qi,Rui Zhang,Ruixuan Li
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2303.11165 by other authors

点击查看摘要

Abstract:Federated Continual Learning (FCL) aims to enable sequentially privacy-preserving model training on streams of incoming data that vary in edge devices by preserving previous knowledge while adapting to new data. Current FCL literature focuses on restricted data privacy and access to previously seen data while imposing no constraints on the training overhead. This is unreasonable for FCL applications in real-world scenarios, where edge devices are primarily constrained by resources such as storage, computational budget, and label rate. We revisit this problem with a large-scale benchmark and analyze the performance of state-of-the-art FCL approaches under different resource-constrained settings. Various typical FCL techniques and six datasets in two incremental learning scenarios (Class-IL and Domain-IL) are involved in our experiments. Through extensive experiments amounting to a total of over 1,000+ GPU hours, we find that, under limited resource-constrained settings, existing FCL approaches, with no exception, fail to achieve the expected performance. Our conclusions are consistent in the sensitivity analysis. This suggests that most existing FCL methods are particularly too resource-dependent for real-world deployment. Moreover, we study the performance of typical FCL techniques with resource constraints and shed light on future research directions in FCL.

[LG-12] GRAPPA - A Hybrid Graph Neural Network for Predicting Pure Component Vapor Pressures

链接: https://arxiv.org/abs/2501.08729
作者: Marco Hoffmann,Hans Hasse,Fabian Jirasek
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 38 pages, 12 figures

点击查看摘要

Abstract:Although the pure component vapor pressure is one of the most important properties for designing chemical processes, no broadly applicable, sufficiently accurate, and open-source prediction method has been available. To overcome this, we have developed GRAPPA - a hybrid graph neural network for predicting vapor pressures of pure components. GRAPPA enables the prediction of the vapor pressure curve of basically any organic molecule, requiring only the molecular structure as input. The new model consists of three parts: A graph attention network for the message passing step, a pooling function that captures long-range interactions, and a prediction head that yields the component-specific parameters of the Antoine equation, from which the vapor pressure can readily and consistently be calculated for any temperature. We have trained and evaluated GRAPPA on experimental vapor pressure data of almost 25,000 pure components. We found excellent prediction accuracy for unseen components, outperforming state-of-the-art group contribution methods and other machine learning approaches in applicability and accuracy. The trained model and its code are fully disclosed, and GRAPPA is directly applicable via the interactive website this http URL.

[LG-13] ransformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

链接: https://arxiv.org/abs/2501.08727
作者: Zerui Tao,Yuhta Takida,Naoki Murata,Qibin Zhao,Yuki Mitsufuji
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable generation. The results manifest that our method can achieve better performances and parameter efficiency compared to LoRA and several baselines.

[LG-14] Disentangled Interleaving Variational Encoding

链接: https://arxiv.org/abs/2501.08710
作者: Noelle Y. L. Wong,Eng Yeow Cheu,Zhonglin Chiam
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Conflicting objectives present a considerable challenge in interleaving multi-task learning, necessitating the need for meticulous design and balance to ensure effective learning of a representative latent data space across all tasks without mutual negative impact. Drawing inspiration from the concept of marginal and conditional probability distributions in probability theory, we design a principled and well-founded approach to disentangle the original input into marginal and conditional probability distributions in the latent space of a variational autoencoder. Our proposed model, Deep Disentangled Interleaving Variational Encoding (DeepDIVE) learns disentangled features from the original input to form clusters in the embedding space and unifies these features via the cross-attention mechanism in the fusion stage. We theoretically prove that combining the objectives for reconstruction and forecasting fully captures the lower bound and mathematically derive a loss function for disentanglement using Naïve Bayes. Under the assumption that the prior is a mixture of log-concave distributions, we also establish that the Kullback-Leibler divergence between the prior and the posterior is upper bounded by a function minimized by the minimizer of the cross entropy loss, informing our adoption of radial basis functions (RBF) and cross entropy with interleaving training for DeepDIVE to provide a justified basis for convergence. Experiments on two public datasets show that DeepDIVE disentangles the original input and yields forecast accuracies better than the original VAE and comparable to existing state-of-the-art baselines.

[LG-15] Diagonal Over-parameterization in Reproducing Kernel Hilbert Spaces as an Adaptive Feature Model: Generalization and Adaptivity

链接: https://arxiv.org/abs/2501.08679
作者: Yicheng Li,Qian Lin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2409.00894

点击查看摘要

Abstract:This paper introduces a diagonal adaptive kernel model that dynamically learns kernel eigenvalues and output coefficients simultaneously during training. Unlike fixed-kernel methods tied to the neural tangent kernel theory, the diagonal adaptive kernel model adapts to the structure of the truth function, significantly improving generalization over fixed-kernel methods, especially when the initial kernel is misaligned with the target. Moreover, we show that the adaptivity comes from learning the right eigenvalues during training, showing a feature learning behavior. By extending to deeper parameterization, we further show how extra depth enhances adaptability and generalization. This study combines the insights from feature learning and implicit regularization and provides new perspective into the adaptivity and generalization potential of neural networks beyond the kernel regime.

[LG-16] Investigating Parameter-Efficiency of Hybrid QuGANs Based on Geometric Properties of Generated Sea Route Graphs

链接: https://arxiv.org/abs/2501.08678
作者: Tobias Rohe,Florian Burger,Michael Kölle,Sebastian Wölckert,Maximilian Zorn,Claudia Linnhoff-Popien
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The demand for artificially generated data for the development, training and testing of new algorithms is omnipresent. Quantum computing (QC), does offer the hope that its inherent probabilistic functionality can be utilised in this field of generative artificial intelligence. In this study, we use quantum-classical hybrid generative adversarial networks (QuGANs) to artificially generate graphs of shipping routes. We create a training dataset based on real shipping data and investigate to what extent QuGANs are able to learn and reproduce inherent distributions and geometric features of this data. We compare hybrid QuGANs with classical Generative Adversarial Networks (GANs), with a special focus on their parameter efficiency. Our results indicate that QuGANs are indeed able to quickly learn and represent underlying geometric properties and distributions, although they seem to have difficulties in introducing variance into the sampled data. Compared to classical GANs of greater size, measured in the number of parameters used, some QuGANs show similar result quality. Our reference to concrete use cases, such as the generation of shipping data, provides an illustrative example and demonstrate the potential and diversity in which QC can be used.

[LG-17] Quantum Reservoir Computing and Risk Bounds

链接: https://arxiv.org/abs/2501.08640
作者: Naomi Mona Chmielewski(L2S),Nina Amini(L2S, CNRS),Joseph Mikael
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a way to bound the generalisation errors of several classes of quantum reservoirs using the Rademacher complexity. We give specific, parameter-dependent bounds for two particular quantum reservoir classes. We analyse how the generalisation bounds scale with growing numbers of qubits. Applying our results to classes with polynomial readout functions, we find that the risk bounds converge in the number of training samples. The explicit dependence on the quantum reservoir and readout parameters in our bounds can be used to control the generalisation error to a certain extent. It should be noted that the bounds scale exponentially with the number of qubits n . The upper bounds on the Rademacher complexity can be applied to other reservoir classes that fulfill a few hypotheses on the quantum dynamics and the readout function.

[LG-18] ransformer-based Multivariate Time Series Anomaly Localization

链接: https://arxiv.org/abs/2501.08628
作者: Charalampos Shimillas,Kleanthis Malialis,Konstantinos Fokianos,Marios M. Polycarpou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing complexity of Cyber-Physical Systems (CPS) and the integration of Internet of Things (IoT), the use of sensors for online monitoring generates large volume of multivariate time series (MTS) data. Consequently, the need for robust anomaly diagnosis in MTS is paramount to maintaining system reliability and safety. While significant advancements have been made in anomaly detection, localization remains a largely underexplored area, though crucial for intelligent decision-making. This paper introduces a novel transformer-based model for unsupervised anomaly diagnosis in MTS, with a focus on improving localization performance, through an in-depth analysis of the self-attention mechanism’s learning behavior under both normal and anomalous conditions. We formulate the anomaly localization problem as a three-stage process: time-step, window, and segment-based. This leads to the development of the Space-Time Anomaly Score (STAS), a new metric inspired by the connection between transformer latent representations and space-time statistical models. STAS is designed to capture individual anomaly behaviors and inter-series dependencies, delivering enhanced localization performance. Additionally, the Statistical Feature Anomaly Score (SFAS) complements STAS by analyzing statistical features around anomalies, with their combination helping to reduce false alarms. Experiments on real world and synthetic datasets illustrate the model’s superiority over state-of-the-art methods in both detection and localization tasks.

[LG-19] A Learning Algorithm That Attains the Human Optimum in a Repeated Human-Machine Interaction Game

链接: https://arxiv.org/abs/2501.08626
作者: Jason T. Isa,Lillian J. Ratliff,Samuel A. Burden
类目: Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:When humans interact with learning-based control systems, a common goal is to minimize a cost function known only to the human. For instance, an exoskeleton may adapt its assistance in an effort to minimize the human’s metabolic cost-of-transport. Conventional approaches to synthesizing the learning algorithm solve an inverse problem to infer the human’s cost. However, these problems can be ill-posed, hard to solve, or sensitive to problem data. Here we show a game-theoretic learning algorithm that works solely by observing human actions to find the cost minimum, avoiding the need to solve an inverse problem. We evaluate the performance of our algorithm in an extensive set of human subjects experiments, demonstrating consistent convergence to the minimum of a prescribed human cost function in scalar and multidimensional instantiations of the game. We conclude by outlining future directions for theoretical and empirical extensions of our results.

[LG-20] CT-PatchTST: Channel-Time Patch Time-Series Transformer for Long-Term Renewable Energy Forecasting

链接: https://arxiv.org/abs/2501.08620
作者: Menghao Huo,Kuan Lu,Yuxiao Li,Qiang Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurately predicting renewable energy output is crucial for the efficient integration of solar and wind power into modern energy systems. This study develops and evaluates an advanced deep learning model, Channel-Time Patch Time-Series Transformer (CT-PatchTST), to forecast the power output of photovoltaic and wind energy systems using annual offshore wind power, onshore wind power, and solar power generation data from Denmark. While the original Patch Time-Series Transformer(PatchTST) model employs a channel-independent (CI) approach, it tends to overlook inter-channel relationships during training, potentially leading to a loss of critical information. To address this limitation and further leverage the benefits of increased data granularity brought by CI, we propose CT-PatchTST. This enhanced model improves the processing of inter-channel information while maintaining the advantages of the channel-independent approach. The predictive performance of CT-PatchTST is rigorously analyzed, demonstrating its ability to provide precise and reliable energy forecasts. This work contributes to improving the predictability of renewable energy systems, supporting their broader adoption and integration into energy grids.

[LG-21] owards Aligned Data Forgetting via Twin Machine Unlearning

链接: https://arxiv.org/abs/2501.08615
作者: Zhenxing Niu,Haoxuan Ji,Yuyao Sun,Zheng Lin,Fei Gao,Yuhang Wang,Haichao Gao
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: substantial text overlap with arXiv:2408.11433

点击查看摘要

Abstract:Modern privacy regulations have spurred the evolution of machine unlearning, a technique enabling a trained model to efficiently forget specific training data. In prior unlearning methods, the concept of “data forgetting” is often interpreted and implemented as achieving zero classification accuracy on such data. Nevertheless, the authentic aim of machine unlearning is to achieve alignment between the unlearned model and the gold model, i.e., encouraging them to have identical classification accuracy. On the other hand, the gold model often exhibits non-zero classification accuracy due to its generalization ability. To achieve aligned data forgetting, we propose a Twin Machine Unlearning (TMU) approach, where a twin unlearning problem is defined corresponding to the original unlearning problem. Consequently, the generalization-label predictor trained on the twin problem can be transferred to the original problem, facilitating aligned data forgetting. Comprehensive empirical experiments illustrate that our approach significantly enhances the alignment between the unlearned model and the gold model.

[LG-22] Neural Risk-sensitive Satisficing in Contextual Bandits

链接: https://arxiv.org/abs/2501.08612
作者: Shogo Ito,Tatsuji Takahashi,Yu Kono
类目: Machine Learning (cs.LG)
*备注: Accepted by AROB-ISBC 2025

点击查看摘要

Abstract:The contextual bandit problem, which is a type of reinforcement learning tasks, provides an effective framework for solving challenges in recommendation systems, such as satisfying real-time requirements, enabling personalization, addressing cold-start problems. However, contextual bandit algorithms face challenges since they need to handle large state-action spaces sequentially. These challenges include the high costs for learning and balancing exploration and exploitation, as well as large variations in performance that depend on the domain of application. To address these challenges, Tsuboya et~al. proposed the Regional Linear Risk-sensitive Satisficing (RegLinRS) algorithm. RegLinRS switches between exploration and exploitation based on how well the agent has achieved the target. However, the reward expectations in RegLinRS are linearly approximated based on features, which limits its applicability when the relationship between features and reward expectations is non-linear. To handle more complex environments, we proposed Neural Risk-sensitive Satisficing (NeuralRS), which incorporates neural networks into RegLinRS, and demonstrated its utility.

[LG-23] Molecular Graph Contrastive Learning with Line Graph

链接: https://arxiv.org/abs/2501.08589
作者: Xueyuan Chen,Shangzhe Li,Ruomei Liu,Bowen Shi,Jiaheng Liu,Junran Wu,Ke Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Trapped by the label scarcity in molecular property prediction and drug design, graph contrastive learning (GCL) came forward. Leading contrastive learning works show two kinds of view generators, that is, random or learnable data corruption and domain knowledge incorporation. While effective, the two ways also lead to molecular semantics altering and limited generalization capability, respectively. To this end, we relate the \textbfLin\textbfE graph with \textbfMOlecular graph co\textbfNtrastive learning and propose a novel method termed \textitLEMON. Specifically, by contrasting the given graph with the corresponding line graph, the graph encoder can freely encode the molecular semantics without omission. Furthermore, we present a new patch with edge attribute fusion and two local contrastive losses enhance information transmission and tackle hard negative samples. Compared with state-of-the-art (SOTA) methods for view generation, superior performance on molecular property prediction suggests the effectiveness of our proposed framework.

[LG-24] Normalize Then Propagate: Efficient Homophilous Regularization for Few-shot Semi-Supervised Node Classification AAAI2025

链接: https://arxiv.org/abs/2501.08581
作者: Baoming Zhang,MingCai Chen,Jianqing Song,Shuangjie Li,Jie Zhang,Chongjun Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable ability in semi-supervised node classification. However, most existing GNNs rely heavily on a large amount of labeled data for training, which is labor-intensive and requires extensive domain knowledge. In this paper, we first analyze the restrictions of GNNs generalization from the perspective of supervision signals in the context of few-shot semi-supervised node classification. To address these challenges, we propose a novel algorithm named NormProp, which utilizes the homophily assumption of unlabeled nodes to generate additional supervision signals, thereby enhancing the generalization against label scarcity. The key idea is to efficiently capture both the class information and the consistency of aggregation during message passing, via decoupling the direction and Euclidean norm of node representations. Moreover, we conduct a theoretical analysis to determine the upper bound of Euclidean norm, and then propose homophilous regularization to constraint the consistency of unlabeled nodes. Extensive experiments demonstrate that NormProp achieve state-of-the-art performance under low-label rate scenarios with low computational complexity.

[LG-25] DNMDR: Dynamic Networks and Multi-view Drug Representations for Safe Medication Recommendation

链接: https://arxiv.org/abs/2501.08572
作者: Guanlin Liu,Xiaomei Yu,Zihao Liu,Xue Li,Xingxu Fan,Xiangwei Zheng
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Medication Recommendation (MR) is a promising research topic which booms diverse applications in the healthcare and clinical domains. However, existing methods mainly rely on sequential modeling and static graphs for representation learning, which ignore the dynamic correlations in diverse medical events of a patient’s temporal visits, leading to insufficient global structural exploration on nodes. Additionally, mitigating drug-drug interactions (DDIs) is another issue determining the utility of the MR systems. To address the challenges mentioned above, this paper proposes a novel MR method with the integration of dynamic networks and multi-view drug representations (DNMDR). Specifically, weighted snapshot sequences for dynamic heterogeneous networks are constructed based on discrete visits in temporal EHRs, and all the dynamic networks are jointly trained to gain both structural correlations in diverse medical events and temporal dependency in historical health conditions, for achieving comprehensive patient representations with both semantic features and structural relationships. Moreover, combining the drug co-occurrences and adverse drug-drug interactions (DDIs) in internal view of drug molecule structure and interactive view of drug pairs, the safe drug representations are available to obtain high-quality medication combination recommendation. Finally, extensive experiments on real world datasets are conducted for performance evaluation, and the experimental results demonstrate that the proposed DNMDR method outperforms the state-of-the-art baseline models with a large margin on various metrics such as PRAUC, Jaccard, DDI rates and so on.

[LG-26] Adaptive Sampled Softmax with Inverted Multi-Index: Methods Theory and Applications

链接: https://arxiv.org/abs/2501.08563
作者: Jin Chen,Jin Zhang,Xu huang,Yi Yang,Defu Lian,Enhong Chen
类目: Machine Learning (cs.LG)
*备注: 40 pages

点击查看摘要

Abstract:The softmax function is a cornerstone of multi-class classification, integral to a wide range of machine learning applications, from large-scale retrieval and ranking models to advanced large language models. However, its computational cost grows linearly with the number of classes, which becomes prohibitively expensive in scenarios with millions or even billions of classes. The sampled softmax, which relies on self-normalized importance sampling, has emerged as a powerful alternative, significantly reducing computational complexity. Yet, its estimator remains unbiased only when the sampling distribution matches the true softmax distribution. To improve both approximation accuracy and sampling efficiency, we propose the MIDX Sampler, a novel adaptive sampling strategy based on an inverted multi-index approach. Concretely, we decompose the softmax probability into several multinomial probabilities, each associated with a specific set of codewords and the last associated with the residual score of queries, thus reducing time complexity to the number of codewords instead of the number of classes. To further boost efficiency, we replace the query-specific residual probability with a simple uniform distribution, simplifying the computation while retaining high performance. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds. The results demonstrate that a smaller divergence from the ideal softmax distribution leads to faster convergence and improved generalization. Extensive experiments on large-scale language models, sequential recommenders, and extreme multi-class classification tasks confirm that the MIDX-Sampler delivers superior effectiveness and efficiency compared to existing approaches.

[LG-27] OMEGA: A Low-Latency GNN Serving System for Large Graphs

链接: https://arxiv.org/abs/2501.08547
作者: Geon-Woo Kim,Donghyun Kim,Jeongyoon Moon,Henry Liu,Tarannum Khan,Anand Iyer,Daehyeok Kim,Aditya Akella
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have been widely adopted for their ability to compute expressive node representations in graph datasets. However, serving GNNs on large graphs is challenging due to the high communication, computation, and memory overheads of constructing and executing computation graphs, which represent information flow across large neighborhoods. Existing approximation techniques in training can mitigate the overheads but, in serving, still lead to high latency and/or accuracy loss. To this end, we propose OMEGA, a system that enables low-latency GNN serving for large graphs with minimal accuracy loss through two key ideas. First, OMEGA employs selective recomputation of precomputed embeddings, which allows for reusing precomputed computation subgraphs while selectively recomputing a small fraction to minimize accuracy loss. Second, we develop computation graph parallelism, which reduces communication overhead by parallelizing the creation and execution of computation graphs across machines. Our evaluation with large graph datasets and GNN models shows that OMEGA significantly outperforms state-of-the-art techniques.

[LG-28] Homophily-aware Heterogeneous Graph Contrastive Learning

链接: https://arxiv.org/abs/2501.08538
作者: Haosen Wang,Chenglong Shi,Can Xu,Surong Yan,Pan Tang
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Heterogeneous graph pre-training (HGP) has demonstrated remarkable performance across various domains. However, the issue of heterophily in real-world heterogeneous graphs (HGs) has been largely overlooked. To bridge this research gap, we proposed a novel heterogeneous graph contrastive learning framework, termed HGMS, which leverages connection strength and multi-view self-expression to learn homophilous node representations. Specifically, we design a heterogeneous edge dropping augmentation strategy that enhances the homophily of augmented views. Moreover, we introduce a multi-view self-expressive learning method to infer the homophily between nodes. In practice, we develop two approaches to solve the self-expressive matrix. The solved self-expressive matrix serves as an additional augmented view to provide homophilous information and is used to identify false negatives in contrastive loss. Extensive experimental results demonstrate the superiority of HGMS across different downstream tasks.

[LG-29] Learning Hyperplane Tree: A Piecewise Linear and Fully Interpretable Decision-making Framework

链接: https://arxiv.org/abs/2501.08515
作者: Hongyi Li,Jun Xu,William Ward Armstrong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel tree-based model, Learning Hyperplane Tree (LHT), which outperforms state-of-the-art (SOTA) tree models for classification tasks on several public datasets. The structure of LHT is simple and efficient: it partitions the data using several hyperplanes to progressively distinguish between target and non-target class samples. Although the separation is not perfect at each stage, LHT effectively improves the distinction through successive partitions. During testing, a sample is classified by evaluating the hyperplanes defined in the branching blocks and traversing down the tree until it reaches the corresponding leaf block. The class of the test sample is then determined using the piecewise linear membership function defined in the leaf blocks, which is derived through least-squares fitting and fuzzy logic. LHT is highly transparent and interpretable–at each branching block, the contribution of each feature to the classification can be clearly observed.

[LG-30] Score-based 3D molecule generation with neural fields NEURIPS2024

链接: https://arxiv.org/abs/2501.08508
作者: Matthieu Kirchmeyer,Pedro O. Pinheiro,Saeed Saremi
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: NeurIPS 2024

点击查看摘要

Abstract:We introduce a new representation for 3D molecules based on their continuous atomic density fields. Using this representation, we propose a new model based on walk-jump sampling for unconditional 3D molecule generation in the continuous space using neural fields. Our model, FuncMol, encodes molecular fields into latent codes using a conditional neural field, samples noisy codes from a Gaussian-smoothed distribution with Langevin MCMC (walk), denoises these samples in a single step (jump), and finally decodes them into molecular fields. FuncMol performs all-atom generation of 3D molecules without assumptions on the molecular structure and scales well with the size of molecules, unlike most approaches. Our method achieves competitive results on drug-like molecules and easily scales to macro-cyclic peptides, with at least one order of magnitude faster sampling. The code is available at this https URL.

[LG-31] Scalable Bayesian Physics-Informed Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2501.08501
作者: Zhiwei Gao,George Em Karniadakis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Uncertainty quantification (UQ) plays a pivotal role in scientific machine learning, especially when surrogate models are used to approximate complex systems. Although multilayer perceptions (MLPs) are commonly employed as surrogates, they often suffer from overfitting due to their large number of parameters. Kolmogorov-Arnold networks (KANs) offer an alternative solution with fewer parameters. However, gradient-based inference methods, such as Hamiltonian Monte Carlo (HMC), may result in computational inefficiency when applied to KANs, especially for large-scale datasets, due to the high cost of this http URL address these challenges, we propose a novel approach, combining the dropout Tikhonov ensemble Kalman inversion (DTEKI) with Chebyshev KANs. This gradient-free method effectively mitigates overfitting and enhances numerical stability. Additionally, we incorporate the active subspace method to reduce the parameter-space dimensionality, allowing us to improve the accuracy of predictions and obtain more reliable uncertainty this http URL experiments demonstrate the efficacy of our approach in various test cases, including scenarios with large datasets and high noise levels. Our results show that the new method achieves comparable or better accuracy, much higher efficiency as well as stability compared to HMC, in addition to scalability. Moreover, by leveraging the low-dimensional parameter subspace, our method preserves prediction accuracy while substantially reducing further the computational cost.

[LG-32] me series forecasting for multidimensional telemetry data using GAN and BiLSTM in a Digital Twin

链接: https://arxiv.org/abs/2501.08464
作者: Joao Carmo de Almeida Neto,Claudio Miceli de Farias,Leandro Santiago de Araujo,Leopoldo Andre Dutra Lusquino Filho
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The research related to digital twins has been increasing in recent years. Besides the mirroring of the physical word into the digital, there is the need of providing services related to the data collected and transferred to the virtual world. One of these services is the forecasting of physical part future behavior, that could lead to applications, like preventing harmful events or designing improvements to get better performance. One strategy used to predict any system operation it is the use of time series models like ARIMA or LSTM, and improvements were implemented using these algorithms. Recently, deep learning techniques based on generative models such as Generative Adversarial Networks (GANs) have been proposed to create time series and the use of LSTM has gained more relevance in time series forecasting, but both have limitations that restrict the forecasting results. Another issue found in the literature is the challenge of handling multivariate environments/applications in time series generation. Therefore, new methods need to be studied in order to fill these gaps and, consequently, provide better resources for creating useful digital twins. In this proposal, it is going to be studied the integration of a BiLSTM layer with a time series obtained by GAN in order to improve the forecasting of all the features provided by the dataset in terms of accuracy and, consequently, improving behaviour prediction.

[LG-33] Keras Sig: Efficient Path Signature Computation on GPU in Keras 3

链接: https://arxiv.org/abs/2501.08455
作者: Rémi Genet,Hugo Inzirillo
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:In this paper we introduce Keras Sig a high-performance pythonic library designed to compute path signature for deep learning applications. Entirely built in Keras 3, \textitKeras Sig leverages the seamless integration with the mostly used deep learning backends such as PyTorch, JAX and TensorFlow. Inspired by Kidger and Lyons (2021),we proposed a novel approach reshaping signature calculations to leverage GPU parallelism. This adjustment allows us to reduce the training time by 55% and 5 to 10-fold improvements in direct signature computation compared to existing methods, while maintaining similar CPU performance. Relying on high-level tensor operations instead of low-level C++ code, Keras Sig significantly reduces the versioning and compatibility issues commonly encountered in deep learning libraries, while delivering superior or comparable performance across various hardware configurations. We demonstrate through extensive benchmarking that our approach scales efficiently with the length of input sequences and maintains competitive performance across various signature parameters, though bounded by memory constraints for very large signature dimensions.

[LG-34] Physics-informed neural networks for phase-resolved data assimilation and prediction of nonlinear ocean waves

链接: https://arxiv.org/abs/2501.08430
作者: Svenja Ehlers,Norbert Hoffmann,Tianning Tang,Adrian H. Callaghan,Rui Cao,Enrique M. Padilla,Yuxin Fang,Merten Stender
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph); Fluid Dynamics (physics.flu-dyn)
*备注: 22 pages, 12 Figures, preprint

点击查看摘要

Abstract:The assimilation and prediction of phase-resolved surface gravity waves are critical challenges in ocean science and engineering. Potential flow theory (PFT) has been widely employed to develop wave models and numerical techniques for wave prediction. However, traditional wave prediction methods are often limited. For example, most simplified wave models have a limited ability to capture strong wave nonlinearity, while fully nonlinear PFT solvers often fail to meet the speed requirements of engineering applications. This computational inefficiency also hinders the development of effective data assimilation techniques, which are required to reconstruct spatial wave information from sparse measurements to initialize the wave prediction. To address these challenges, we propose a novel solver method that leverages physics-informed neural networks (PINNs) that parameterize PFT solutions as neural networks. This provides a computationally inexpensive way to assimilate and predict wave data. The proposed PINN framework is validated through comparisons with analytical linear PFT solutions and experimental data collected in a laboratory wave flume. The results demonstrate that our approach accurately captures and predicts irregular, nonlinear, and dispersive wave surface dynamics. Moreover, the PINN can infer the fully nonlinear velocity potential throughout the entire fluid volume solely from surface elevation measurements, enabling the calculation of fluid velocities that are difficult to measure experimentally.

[LG-35] Physics-Informed Latent Neural Operator for Real-time Predictions of Complex Physical Systems

链接: https://arxiv.org/abs/2501.08428
作者: Sharmila Karumuri,Lori Graham-Brady,Somdatta Goswami
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep operator network (DeepONet) has shown great promise as a surrogate model for systems governed by partial differential equations (PDEs), learning mappings between infinite-dimensional function spaces with high accuracy. However, achieving low generalization errors often requires highly overparameterized networks, posing significant challenges for large-scale, complex systems. To address these challenges, latent DeepONet was proposed, introducing a two-step approach: first, a reduced-order model is used to learn a low-dimensional latent space, followed by operator learning on this latent space. While effective, this method is inherently data-driven, relying on large datasets and making it difficult to incorporate governing physics into the framework. Additionally, the decoupled nature of these steps prevents end-to-end optimization and the ability to handle data scarcity. This work introduces PI-Latent-NO, a physics-informed latent operator learning framework that overcomes these limitations. Our architecture employs two coupled DeepONets in an end-to-end training scheme: the first, termed Latent-DeepONet, identifies and learns the low-dimensional latent space, while the second, Reconstruction-DeepONet, maps the latent representations back to the original physical space. By integrating governing physics directly into the training process, our approach requires significantly fewer data samples while achieving high accuracy. Furthermore, the framework is computationally and memory efficient, exhibiting nearly constant scaling behavior on a single GPU and demonstrating the potential for further efficiency gains with distributed training. We validate the proposed method on high-dimensional parametric PDEs, demonstrating its effectiveness as a proof of concept and its potential scalability for large-scale systems.

[LG-36] Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

链接: https://arxiv.org/abs/2501.08425
作者: Davide Barbieri,Matteo Bonforte,Peio Ibarrondo
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Probability (math.PR)
*备注:

点击查看摘要

Abstract:In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD? Subjects: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Probability (math.PR) MSC classes: 35Q68, 68T07, 35K65, 35B40, 60J60, 35J70, 35K15 Cite as: arXiv:2501.08425 [cs.LG] (or arXiv:2501.08425v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2501.08425 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Predict Confidently Predict Right: Abstention in Dynamic Graph Learning

链接: https://arxiv.org/abs/2501.08397
作者: Jayadratha Gayen,Himanshu Pal,Naresh Manwani,Charu Sharma
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Many real-world systems can be modeled as dynamic graphs, where nodes and edges evolve over time, requiring specialized models to capture their evolving dynamics in risk-sensitive applications effectively. Temporal graph neural networks (GNNs) are one such category of specialized models. For the first time, our approach integrates a reject option strategy within the framework of GNNs for continuous-time dynamic graphs. This allows the model to strategically abstain from making predictions when the uncertainty is high and confidence is low, thus minimizing the risk of critical misclassification and enhancing the results and reliability. We propose a coverage-based abstention prediction model to implement the reject option that maximizes prediction within a specified coverage. It improves the prediction score for link prediction and node classification tasks. Temporal GNNs deal with extremely skewed datasets for the next state prediction or node classification task. In the case of class imbalance, our method can be further tuned to provide a higher weightage to the minority class. Exhaustive experiments are presented on four datasets for dynamic link prediction and two datasets for dynamic node classification tasks. This demonstrates the effectiveness of our approach in improving the reliability and area under the curve (AUC)/ average precision (AP) scores for predictions in dynamic graph scenarios. The results highlight our model’s ability to efficiently handle the trade-offs between prediction confidence and coverage, making it a dependable solution for applications requiring high precision in dynamic and uncertain environments.

[LG-38] Empathetic Conversational Agents : Utilizing Neural and Physiological Signals for Enhanced Empathetic Interactions

链接: https://arxiv.org/abs/2501.08393
作者: Nastaran Saffaryazdi,Tamil Selvan Gunasekaran,Kate Laveys,Elizabeth Broadbent,Mark Billinghurst
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conversational agents (CAs) are revolutionizing human-computer interaction by evolving from text-based chatbots to empathetic digital humans (DHs) capable of rich emotional expressions. This paper explores the integration of neural and physiological signals into the perception module of CAs to enhance empathetic interactions. By leveraging these cues, the study aims to detect emotions in real-time and generate empathetic responses and expressions. We conducted a user study where participants engaged in conversations with a DH about emotional topics. The DH responded and displayed expressions by mirroring detected emotions in real-time using neural and physiological cues. The results indicate that participants experienced stronger emotions and greater engagement during interactions with the Empathetic DH, demonstrating the effectiveness of incorporating neural and physiological signals for real-time emotion recognition. However, several challenges were identified, including recognition accuracy, emotional transition speeds, individual personality effects, and limitations in voice tone modulation. Addressing these challenges is crucial for further refining Empathetic DHs and fostering meaningful connections between humans and artificial entities. Overall, this research advances human-agent interaction and highlights the potential of real-time neural and physiological emotion recognition in creating empathetic DHs.

[LG-39] owards Fast Specialized Machine Learning Force Fields: Distilling Foundation Models via Energy Hessians ICLR2025

链接: https://arxiv.org/abs/2501.09009
作者: Ishan Amin,Sanjeev Raja,Aditi Krishnapriyan
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注: Under Review at ICLR 2025

点击查看摘要

Abstract:The foundation model (FM) paradigm is transforming Machine Learning Force Fields (MLFFs), leveraging general-purpose representations and scalable training to perform a variety of computational chemistry tasks. Although MLFF FMs have begun to close the accuracy gap relative to first-principles methods, there is still a strong need for faster inference speed. Additionally, while research is increasingly focused on general-purpose models which transfer across chemical space, practitioners typically only study a small subset of systems at a given time. This underscores the need for fast, specialized MLFFs relevant to specific downstream applications, which preserve test-time physical soundness while maintaining train-time scalability. In this work, we introduce a method for transferring general-purpose representations from MLFF foundation models to smaller, faster MLFFs specialized to specific regions of chemical space. We formulate our approach as a knowledge distillation procedure, where the smaller “student” MLFF is trained to match the Hessians of the energy predictions of the “teacher” foundation model. Our specialized MLFFs can be up to 20 \times faster than the original foundation model, while retaining, and in some cases exceeding, its performance and that of undistilled models. We also show that distilling from a teacher model with a direct force parameterization into a student model trained with conservative forces (i.e., computed as derivatives of the potential energy) successfully leverages the representations from the large-scale teacher for improved accuracy, while maintaining energy conservation during test-time molecular dynamics simulations. More broadly, our work suggests a new paradigm for MLFF development, in which foundation models are released along with smaller, specialized simulation “engines” for common chemical subsets.

[LG-40] CrystalGRW: Generative Modeling of Crystal Structures with Targeted Properties via Geodesic Random Walks

链接: https://arxiv.org/abs/2501.08998
作者: Krit Tangsongcharoen,Teerachote Pakornchote,Chayanon Atthapak,Natthaphon Choomphon-anomakhun,Annop Ektarawong,Björn Alling,Christopher Sutton,Thiti Bovornratanaraks,Thiparat Chotibut
类目: Materials Science (cond-mat.mtrl-sci); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 10+12 pages, 10 figures

点击查看摘要

Abstract:Determining whether a candidate crystalline material is thermodynamically stable depends on identifying its true ground-state structure, a central challenge in computational materials science. We introduce CrystalGRW, a diffusion-based generative model on Riemannian manifolds that proposes novel crystal configurations and can predict stable phases validated by density functional theory. The crystal properties, such as fractional coordinates, atomic types, and lattice matrices, are represented on suitable Riemannian manifolds, ensuring that new predictions generated through the diffusion process preserve the periodicity of crystal structures. We incorporate an equivariant graph neural network to also account for rotational and translational symmetries during the generation process. CrystalGRW demonstrates the ability to generate realistic crystal structures that are close to their ground states with accuracy comparable to existing models, while also enabling conditional control, such as specifying a desired crystallographic point group. These features help accelerate materials discovery and inverse design by offering stable, symmetry-consistent crystal candidates for experimental validation.

[LG-41] Improved Compression Bounds for Scenario Decision Making

链接: https://arxiv.org/abs/2501.08884
作者: Guillaume O. Berger,Raphaël M. Jungers
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scenario decision making offers a flexible way of making decision in an uncertain environment while obtaining probabilistic guarantees on the risk of failure of the decision. The idea of this approach is to draw samples of the uncertainty and make a decision based on the samples, called “scenarios”. The probabilistic guarantees take the form of a bound on the probability of sampling a set of scenarios that will lead to a decision whose risk of failure is above a given maximum tolerance. This bound can be expressed as a function of the number of sampled scenarios, the maximum tolerated risk, and some intrinsic property of the problem called the “compression size”. Several such bounds have been proposed in the literature under various assumptions on the problem. We propose new bounds that improve upon the existing ones without requiring stronger assumptions on the problem.

[LG-42] Deep Learning Meets Queue-Reactive: A Framework for Realistic Limit Order Book Simulation

链接: https://arxiv.org/abs/2501.08822
作者: Hamza Bodor,Laurent Carlier
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Queue-Reactive model introduced by Huang et al. (2015) has become a standard tool for limit order book modeling, widely adopted by both researchers and practitioners for its simplicity and effectiveness. We present the Multidimensional Deep Queue-Reactive (MDQR) model, which extends this framework in three ways: it relaxes the assumption of queue independence, enriches the state space with market features, and models the distribution of order sizes. Through a neural network architecture, the model learns complex dependencies between different price levels and adapts to varying market conditions, while preserving the interpretable point-process foundation of the original framework. Using data from the Bund futures market, we show that MDQR captures key market properties including the square-root law of market impact, cross-queue correlations, and realistic order size patterns. The model demonstrates particular strength in reproducing both conditional and stationary distributions of order sizes, as well as various stylized facts of market microstructure. The model achieves this while maintaining the computational efficiency needed for practical applications such as strategy development through reinforcement learning or realistic backtesting.

[LG-43] Nesterov Acceleration for Ensemble Kalman Inversion and Variants

链接: https://arxiv.org/abs/2501.08779
作者: Sydney Vernon,Eviatar Bach,Oliver R. A. Dunbar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Ensemble Kalman inversion (EKI) is a derivative-free, particle-based optimization method for solving inverse problems. It can be shown that EKI approximates a gradient flow, which allows the application of methods for accelerating gradient descent. Here, we show that Nesterov acceleration is effective in speeding up the reduction of the EKI cost function on a variety of inverse problems. We also implement Nesterov acceleration for two EKI variants, unscented Kalman inversion and ensemble transform Kalman inversion. Our specific implementation takes the form of a particle-level nudge that is demonstrably simple to couple in a black-box fashion with any existing EKI variant algorithms, comes with no additional computational expense, and with no additional tuning hyperparameters. This work shows a pathway for future research to translate advances in gradient-based optimization into advances in gradient-free Kalman optimization.

[LG-44] A Theory of Optimistically Universal Online Learnability for General Concept Classes NEURIPS2024

链接: https://arxiv.org/abs/2501.08551
作者: Steve Hanneke,Hongao Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS 2024

点击查看摘要

Abstract:We provide a full characterization of the concept classes that are optimistically universally online learnable with \0, 1\ labels. The notion of optimistically universal online learning was defined in [Hanneke, 2021] in order to understand learnability under minimal assumptions. In this paper, following the philosophy behind that work, we investigate two questions, namely, for every concept class: (1) What are the minimal assumptions on the data process admitting online learnability? (2) Is there a learning algorithm which succeeds under every data process satisfying the minimal assumptions? Such an algorithm is said to be optimistically universal for the given concept class. We resolve both of these questions for all concept classes, and moreover, as part of our solution, we design general learning algorithms for each case. Finally, we extend these algorithms and results to the agnostic case, showing an equivalence between the minimal assumptions on the data process for learnability in the agnostic and realizable cases, for every concept class, as well as the equivalence of optimistically universal learnability.

[LG-45] Head Motion Degrades Machine Learning Classification of Alzheimers Disease from Positron Emission Tomography

链接: https://arxiv.org/abs/2501.08459
作者: Eléonore V. Lieffrig,Takuya Toyonaga,Jiazhen Zhang,John A. Onofrey
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:Brain positron emission tomography (PET) imaging is broadly used in research and clinical routines to study, diagnose, and stage Alzheimer’s disease (AD). However, its potential cannot be fully exploited yet due to the lack of portable motion correction solutions, especially in clinical settings. Head motion during data acquisition has indeed been shown to degrade image quality and induces tracer uptake quantification error. In this study, we demonstrate that it also biases machine learning-based AD classification. We start by proposing a binary classification algorithm solely based on PET images. We find that it reaches a high accuracy in classifying motion corrected images into cognitive normal or AD. We demonstrate that the classification accuracy substantially decreases when images lack motion correction, thereby limiting the algorithm’s effectiveness and biasing image interpretation. We validate these findings in cohorts of 128 ^11 C-UCB-J and 173 ^18 F-FDG scans, two tracers highly relevant to the study of AD. Classification accuracies decreased by 10% and 5% on 20 ^18 F-FDG and 20 ^11 C-UCB-J testing cases, respectively. Our findings underscore the critical need for efficient motion correction methods to make the most of the diagnostic capabilities of PET-based machine learning.

[LG-46] A Constant Velocity Latent Dynamics Approach for Accelerating Simulation of Stiff Nonlinear Systems

链接: https://arxiv.org/abs/2501.08423
作者: William Cole Nockolds,C. G. Krishnanunni,Tan Bui-Thanh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving stiff ordinary differential equations (StODEs) requires sophisticated numerical solvers, which are often computationally expensive. In particular, StODE’s often cannot be solved with traditional explicit time integration schemes and one must resort to costly implicit methods to compute solutions. On the other hand, state-of-the-art machine learning (ML) based methods such as Neural ODE (NODE) poorly handle the timescale separation of various elements of the solutions to StODEs and require expensive implicit solvers for integration at inference time. In this work, we embark on a different path which involves learning a latent dynamics for StODEs, in which one completely avoids numerical integration. To that end, we consider a constant velocity latent dynamical system whose solution is a sequence of straight lines. Given the initial condition and parameters of the ODE, the encoder networks learn the slope (i.e the constant velocity) and the initial condition for the latent dynamics. In other words, the solution of the original dynamics is encoded into a sequence of straight lines which can be decoded back to retrieve the actual solution as and when required. Another key idea in our approach is a nonlinear transformation of time, which allows for the “stretching/squeezing” of time in the latent space, thereby allowing for varying levels of attention to different temporal regions in the solution. Additionally, we provide a simple universal-approximation-type proof showing that our approach can approximate the solution of stiff nonlinear system on a compact set to any degree of accuracy, \epsilon. We show that the dimension of the latent dynamical system in our approach is independent of \epsilon. Numerical investigation on prototype StODEs suggest that our method outperforms state-of-the art machine learning approaches for handling StODEs.

[LG-47] Dissecting a Small Artificial Neural Network

链接: https://arxiv.org/abs/2501.08341
作者: Xiguang Yang,Krish Arora,Michael Bachmann
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 12 pages, 8 figures, and 2 tables

点击查看摘要

Abstract:We investigate the loss landscape and backpropagation dynamics of convergence for the simplest possible artificial neural network representing the logical exclusive-OR (XOR) gate. Cross-sections of the loss landscape in the nine-dimensional parameter space are found to exhibit distinct features, which help understand why backpropagation efficiently achieves convergence toward zero loss, whereas values of weights and biases keep drifting. Differences in shapes of cross-sections obtained by nonrandomized and randomized batches are discussed. In reference to statistical physics we introduce the microcanonical entropy as a unique quantity that allows to characterize the phase behavior of the network. Learning in neural networks can thus be thought of as an annealing process that experiences the analogue of phase transitions known from thermodynamic systems. It also reveals how the loss landscape simplifies as more hidden neurons are added to the network, eliminating entropic barriers caused by finite-size effects.

信息检索

[IR-0] Real-time Indexing for Large-scale Recommendation by Streaming Vector Quantization Retriever

链接: https://arxiv.org/abs/2501.08695
作者: Xingyan Bin,Jianfei Cui,Wujie Yan,Zhichen Zhao,Xintian Han,Chongyang Yan,Feng Zhang,Xun Zhou,Qi Wu,Zuotao Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrievers, which form one of the most important recommendation stages, are responsible for efficiently selecting possible positive samples to the later stages under strict latency limitations. Because of this, large-scale systems always rely on approximate calculations and indexes to roughly shrink candidate scale, with a simple ranking model. Considering simple models lack the ability to produce precise predictions, most of the existing methods mainly focus on incorporating complicated ranking models. However, another fundamental problem of index effectiveness remains unresolved, which also bottlenecks complication. In this paper, we propose a novel index structure: streaming Vector Quantization model, as a new generation of retrieval paradigm. Streaming VQ attaches items with indexes in real time, granting it immediacy. Moreover, through meticulous verification of possible variants, it achieves additional benefits like index balancing and reparability, enabling it to support complicated ranking models as existing approaches. As a lightweight and implementation-friendly architecture, streaming VQ has been deployed and replaced all major retrievers in Douyin and Douyin Lite, resulting in remarkable user engagement gain.

[IR-1] Continuous Approach to Phase (Norm) Retrieval Frames

链接: https://arxiv.org/abs/2501.08927
作者: Ramin Farshchian,Rajab Ali Kamyabi-Gol,Fahimeh Arabyani-Neyshaburi,Fatemeh Esmaeelzadeh
类目: Functional Analysis (math.FA); Information Retrieval (cs.IR); Mathematical Physics (math-ph); Numerical Analysis (math.NA); Optics (physics.optics)
*备注:

点击查看摘要

Abstract:This paper investigates the properties of continuous frames, with a particular focus on phase retrieval and norm retrieval in the context of Hilbert spaces. We introduce the concept of continuous near-Riesz bases and prove their invariance under invertible operators. Some equivalent conditions for phase and norm retrieval property of continuous frames are presented. We study the stability of phase retrieval under perturbations. Furthermore, tensor product frames for separable Hilbert spaces are studied, and we establish the equivalence of phase retrieval and norm retrieval properties between components and their tensor products.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2025-01-16

目录

概览 (2025-01-16)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载