Arxiv今日论文 | 2024-12-17

本篇博文主要展示 2024-12-17 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文旨在解决大型语言模型（LLMs）在计算需求和推理速度方面的挑战，特别是由于其二次复杂度带来的问题。论文的关键发现是某些看似无意义的特殊标记（separators）在注意力分数中占据了不成比例的贡献，远超语义上有意义的标记。基于这一观察，论文提出了SepLLM框架，通过压缩这些特殊标记之间的信息段并消除冗余标记来加速推理。该框架是即插即用的，并实现了高效的训练加速内核。实验结果表明，SepLLM在不同训练设置下均有效，特别是在Llama-3-8B模型上，在GSM8K-CoT基准测试中实现了超过50%的KV缓存减少，同时保持了相当的性能。此外，在流式处理设置中，SepLLM能够有效处理长达400万甚至更多标记的序列，同时保持一致的语言建模能力。

链接: https://arxiv.org/abs/2412.12094
作者: Guoxuan Chen,Han Shi,Jiawei Li,Yihang Gao,Xiaozhe Ren,Yimeng Chen,Xin Jiang,Zhenguo Li,Weiyang Liu,Chao Huang
机构: 未知
关键词: Large Language Models, language processing tasks, natural language processing, exhibited exceptional performance, Large Language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless special tokens (i.e., separators) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM’s effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.
zh

[NLP-1] Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats

【速读】：该论文试图解决在社交媒体语料库中识别新型“狗哨”（dog whistles）的问题。狗哨是指具有双重含义的编码表达，通常用于在保持合理否认性的同时传达争议性政治观点。现有的识别方法依赖于手动编制的词汇表，难以跟上新表达的更新速度。论文提出的解决方案是 EarShot，它结合了向量数据库和大型语言模型（LLMs）的优势，能够高效且有效地识别新的狗哨表达。

链接: https://arxiv.org/abs/2412.12072
作者: Kuleen Sasse,Carlos Aguirre,Isabel Cachola,Sharon Levy,Mark Dredze
机构: Johns Hopkins University(约翰斯·霍普金斯大学); Rutgers University(罗格斯大学)
关键词: upsetting or offensive, WARNING, Dog whistles, Large Language Models, Dog
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:WARNING: This paper contains content that maybe upsetting or offensive to some readers. Dog whistles are coded expressions with dual meanings: one intended for the general public (outgroup) and another that conveys a specific message to an intended audience (ingroup). Often, these expressions are used to convey controversial political opinions while maintaining plausible deniability and slip by content moderation filters. Identification of dog whistles relies on curated lexicons, which have trouble keeping up to date. We introduce \textbfFETCH!, a task for finding novel dog whistles in massive social media corpora. We find that state-of-the-art systems fail to achieve meaningful results across three distinct social media case studies. We present \textbfEarShot, a novel system that combines the strengths of vector databases and Large Language Models (LLMs) to efficiently and effectively identify new dog whistles.
zh

[NLP-2] Semi-automated analysis of audio-recorded lessons: The case of teachers engaging messages

【速读】：该论文试图解决教师在课堂中传递的吸引学生注意力的信息（engaging messages）的观察和分析问题。解决方案的关键在于提出了一种高效的方法，通过自动转录（automatic transcription）和基于关键词的过滤分析（keyword-based filtering analysis），从大量音频记录的课程中提取并分类这些信息。这种方法将需要分析的信息量减少了90%，显著优化了时间和资源的利用，相较于传统的手动编码方法更为高效。

链接: https://arxiv.org/abs/2412.12062
作者: Samuel Falcon,Carmen Alvarez-Alvarez,Jaime Leon
机构: 未知
关键词: influences student outcomes, student outcomes, Engaging messages delivered, Engaging messages, key aspect
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Engaging messages delivered by teachers are a key aspect of the classroom discourse that influences student outcomes. However, improving this communication is challenging due to difficulties in obtaining observations. This study presents a methodology for efficiently extracting actual observations of engaging messages from audio-recorded lessons. We collected 2,477 audio-recorded lessons from 75 teachers over two academic years. Using automatic transcription and keyword-based filtering analysis, we identified and classified engaging messages. This method reduced the information to be analysed by 90%, optimising the time and resources required compared to traditional manual coding. Subsequent descriptive analysis revealed that the most used messages emphasised the future benefits of participating in school activities. In addition, the use of engaging messages decreased as the academic year progressed. This study offers insights for researchers seeking to extract information from teachers’ discourse in naturalistic settings and provides useful information for designing interventions to improve teachers’ communication strategies.
zh

[NLP-3] Virtual Agent -Based Communication Skills Training to Facilitate Health Persuasion Among Peers

【速读】：该论文试图解决普通民众在帮助家人或朋友改善健康行为时，尤其是面对可能带有污名化或争议性的健康行为时，缺乏指导和实践机会的问题。解决方案的关键在于利用虚拟代理（virtual agents）来培训社区志愿者，教授他们健康咨询技巧，如动机性访谈（motivational interviewing），并通过角色扮演场景让他们实践这些技能。研究通过一个虚拟代理系统来增强用户对社交网络中COVID-19疫苗接种的影响力，并在对照实验中测试了系统互动性和角色扮演功能对咨询效果的影响。结果表明，所有版本的系统都能有效培养出符合标准化咨询能力评估的同伴咨询师，且参与者对互动式虚拟代理的满意度显著高于被动观看培训材料。

链接: https://arxiv.org/abs/2412.12061
作者: Farnaz Nouraei,Keith Rebello,Mina Fallah,Prasanth Murali,Haley Matuszak,Valerie Jap,Andrea Parker,Michael Paasche-Orlow,Timothy Bickmore
机构: Northeastern University(东北大学); University of Waterloo(滑铁卢大学); Georgia Institute of Technology(佐治亚理工学院); Tufts Medical Center(塔夫茨医学中心)
关键词: health behavior, stigmatizing or controversial, laypeople are motivated, motivated to improve, family or friends
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted at CSCW '24

点击查看摘要

Abstract:Many laypeople are motivated to improve the health behavior of their family or friends but do not know where to start, especially if the health behavior is potentially stigmatizing or controversial. We present an approach that uses virtual agents to coach community-based volunteers in health counseling techniques, such as motivational interviewing, and allows them to practice these skills in role-playing scenarios. We use this approach in a virtual agent-based system to increase COVID-19 vaccination by empowering users to influence their social network. In a between-subjects comparative design study, we test the effects of agent system interactivity and role-playing functionality on counseling outcomes, with participants evaluated by standardized patients and objective judges. We find that all versions are effective at producing peer counselors who score adequately on a standardized measure of counseling competence, and that participants were significantly more satisfied with interactive virtual agents compared to passive viewing of the training material. We discuss design implications for interpersonal skills training systems based on our findings.
zh

[NLP-4] How Private are Language Models in Abstractive Summarization?

【速读】：该论文试图解决语言模型（Language Models, LMs）在文本摘要任务中可能泄露源文档中的个人身份信息（Personally Identifying Information, PII）的问题。解决方案的关键在于系统性地研究不同大小和类型的闭源和开源LMs在隐私保护方面的表现，并通过提示（prompting）和微调（fine-tuning）策略在多个领域的摘要数据集上进行实验。研究结果表明，现有的LMs在生成摘要时仍难以完全避免PII泄露，且广泛使用的评估指标无法有效捕捉上下文相关的隐私风险。

链接: https://arxiv.org/abs/2412.12040
作者: Anthony Hughes,Nikolaos Aletras,Ning Ma
机构: University of Sheffield (谢菲尔德大学)
关键词: shown outstanding performance, Language models, medicine and law, shown outstanding, outstanding performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) have shown outstanding performance in text summarization including sensitive domains such as medicine and law. In these settings, it is important that personally identifying information (PII) included in the source document should not leak in the summary. Prior efforts have mostly focused on studying how LMs may inadvertently elicit PII from training data. However, to what extent LMs can provide privacy-preserving summaries given a non-private source document remains under-explored. In this paper, we perform a comprehensive study across two closed- and three open-weight LMs of different sizes and families. We experiment with prompting and fine-tuning strategies for privacy-preservation across a range of summarization datasets across three domains. Our extensive quantitative and qualitative analysis including human evaluation shows that LMs often cannot prevent PII leakage on their summaries and that current widely-used metrics cannot capture context dependent privacy risks.
zh

[NLP-5] Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection

【速读】：该论文试图解决大型语言模型（LLMs）在漏洞检测（vulnerability detection）应用任务中的能力有限问题。解决方案的关键在于提出一种综合的提示策略（prompting strategy），该策略结合了漏洞的自然语言描述、对比链式推理（contrastive chain-of-thought reasoning）以及使用合成数据集（synthetic dataset）中的对比样本。通过将这些元素整合到一个全面的提示框架中，该方法显著提升了LLMs对漏洞的理解能力，并在高质量的漏洞检测数据集（如SVEN）上实现了准确率、F1分数和成对准确率的显著提升，分别提高了23%、11%和14%。

链接: https://arxiv.org/abs/2412.12039
作者: Ira Ceka,Feitong Qiao,Anik Dey,Aastha Valechia,Gail Kaiser,Baishakhi Ray
机构: Columbia University, New York, NY, USA(哥伦比亚大学，纽约，纽约州，美国); Amherst College, Amherst, MA, USA(阿默斯特学院，阿默斯特，马萨诸塞州，美国)
关键词: large language models, shown limited ability, remarkable success, vulnerability detection, shown limited
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Despite their remarkable success, large language models (LLMs) have shown limited ability on applied tasks such as vulnerability detection. We investigate various prompting strategies for vulnerability detection and, as part of this exploration, propose a prompting strategy that integrates natural language descriptions of vulnerabilities with a contrastive chain-of-thought reasoning approach, augmented using contrastive samples from a synthetic dataset. Our study highlights the potential of LLMs to detect vulnerabilities by integrating natural language descriptions, contrastive reasoning, and synthetic examples into a comprehensive prompting framework. Our results show that this approach can enhance LLM understanding of vulnerabilities. On a high-quality vulnerability detection dataset such as SVEN, our prompting strategies can improve accuracies, F1-scores, and pairwise accuracies by 23%, 11%, and 14%, respectively.
zh

[NLP-6] he Open Source Advantage in Large Language Models (LLM s)

【速读】：该论文试图解决大语言模型（LLMs）在自然语言处理（NLP）领域中，闭源模型与开源模型之间的透明性、可访问性和伦理问题。闭源模型如GPT-4虽然性能卓越，但其“黑箱”性质和限制性访问阻碍了可重复性和公平的AI发展。论文提出的关键解决方案是支持开源模型如LLaMA和BLOOM，这些模型通过社区驱动开发和计算效率优化，显著缩小了性能差距，特别是在语言多样性和特定领域应用中。此外，论文强调了混合方法的重要性，即结合闭源模型的扩展能力和开源模型的适应性，以实现技术性能、可访问性和伦理部署的平衡。

链接: https://arxiv.org/abs/2412.12004
作者: Jiya Manchanda,Laura Boettcher,Matheus Westphalen,Jasser Jasser
机构: Rollins College (罗林斯学院)
关键词: advanced text generation, Large language models, natural language processing, Large language, mark a key
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 7 pages, 0 figures

点击查看摘要

Abstract:Large language models (LLMs) mark a key shift in natural language processing (NLP), having advanced text generation, translation, and domain-specific reasoning. Closed-source models like GPT-4, powered by proprietary datasets and extensive computational resources, lead with state-of-the-art performance today. However, they face criticism for their “black box” nature and for limiting accessibility in a manner that hinders reproducibility and equitable AI development. By contrast, open-source initiatives like LLaMA and BLOOM prioritize democratization through community-driven development and computational efficiency. These models have significantly reduced performance gaps, particularly in linguistic diversity and domain-specific applications, while providing accessible tools for global researchers and developers. Notably, both paradigms rely on foundational architectural innovations, such as the Transformer framework by Vaswani et al. (2017). Closed-source models excel by scaling effectively, while open-source models adapt to real-world applications in underrepresented languages and domains. Techniques like Low-Rank Adaptation (LoRA) and instruction-tuning datasets enable open-source models to achieve competitive results despite limited resources. To be sure, the tension between closed-source and open-source approaches underscores a broader debate on transparency versus proprietary control in AI. Ethical considerations further highlight this divide. Closed-source systems restrict external scrutiny, while open-source models promote reproducibility and collaboration but lack standardized auditing documentation frameworks to mitigate biases. Hybrid approaches that leverage the strengths of both paradigms are likely to shape the future of LLM innovation, ensuring accessibility, competitive technical performance, and ethical deployment.
zh

[NLP-7] LLM -RG4: Flexible and Factual Radiology Report Generation across Diverse Input Contexts

【速读】：该论文试图解决当前放射学报告生成 (RRG) 模型在处理复杂任务时的灵活性不足问题，这些模型通常受限于固定的任务范式，导致输入与输出之间的不匹配，并可能产生与输入无关的幻觉。解决方案的关键在于：首先，开发了一个数据生成管道，创建了新的MIMIC-RG4数据集，该数据集考虑了四种常见的放射学报告编写场景，并确保输入与输出完全对应。其次，提出了基于大型语言模型 (LLM) 的RRG框架，即LLM-RG4，利用LLM的灵活指令跟随能力和广泛的一般知识。此外，设计了一个自适应标记融合模块，以处理不同输入组合的多样性，同时最小化计算负担。最后，提出了标记级损失加权策略，以引导模型关注正面和不确定的描述。实验结果表明，LLM-RG4在临床效率和自然语言生成方面均达到了最先进的性能，并显著减少了输入无关的幻觉。

链接: https://arxiv.org/abs/2412.12001
作者: Zhuhao Wang,Yihua Sun,Zihan Li,Xuan Yang,Fang Chen,Hongen Liao
机构: 1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (深圳先进技术研究院，中国科学院);
2. School of Biomedical Engineering, Southern University of Science and Technology (南方科技大学生物医学工程学院);
3. Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging (广东省生物医学测量与超声成像重点实验室)
关键词: radiologists tail content, complex task requiring, radiology report drafting, Drafting radiology reports, task requiring flexibility
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Drafting radiology reports is a complex task requiring flexibility, where radiologists tail content to available information and particular clinical demands. However, most current radiology report generation (RRG) models are constrained to a fixed task paradigm, such as predicting the full ``finding’’ section from a single image, inherently involving a mismatch between inputs and outputs. The trained models lack the flexibility for diverse inputs and could generate harmful, input-agnostic hallucinations. To bridge the gap between current RRG models and the clinical demands in practice, we first develop a data generation pipeline to create a new MIMIC-RG4 dataset, which considers four common radiology report drafting scenarios and has perfectly corresponded input and output. Secondly, we propose a novel large language model (LLM) based RRG framework, namely LLM-RG4, which utilizes LLM’s flexible instruction-following capabilities and extensive general knowledge. We further develop an adaptive token fusion module that offers flexibility to handle diverse scenarios with different input combinations, while minimizing the additional computational burden associated with increased input volumes. Besides, we propose a token-level loss weighting strategy to direct the model’s attention towards positive and uncertain descriptions. Experimental results demonstrate that LLM-RG4 achieves state-of-the-art performance in both clinical efficiency and natural language generation on the MIMIC-RG4 and MIMIC-CXR datasets. We quantitatively demonstrate that our model has minimal input-agnostic hallucinations, whereas current open-source models commonly suffer from this problem.
zh

[NLP-8] ExecRepoBench: Multi-level Executable Code Completion Evaluation

【速读】：该论文试图解决现有代码补全评估基准在捕捉真实编程环境中动态特性的不足，主要问题包括上下文长度有限、依赖表面化评估指标以及可能对训练数据过拟合。解决方案的关键在于引入一个新颖的框架，通过创建仓库级基准 ExecRepoBench 和指令语料库 Repo-Instruct，旨在提升开源大语言模型 (LLMs) 在涉及多文件复杂依赖的真实编程场景中的功能。具体方法包括：1) 使用来自活跃 Python 仓库的 1.2K 样本构建 ExecRepoBench；2) 提出基于抽象语法树的多层次语法补全方法，以不同逻辑单元（如语句、表达式、函数）为条件进行代码片段掩码；3) 在 Repo-Instruct 上微调具有 7B 参数的开源 LLM，生成强基线模型 Qwen2.5-Coder-Instruct-C，并在多个现有基准（如 MultiPL-E 和 ExecRepoBench）上进行严格评估，结果显示其在所有编程语言中均优于先前的基线模型。

链接: https://arxiv.org/abs/2412.11990
作者: Jian Yang,Jiajun Zhang,Jiaxi Yang,Ke Jin,Lei Zhang,Qiyao Peng,Ken Deng,Yibo Miao,Tianyu Liu,Zeyu Cui,Binyuan Hui,Junyang Lin
机构: Alibaba Group(阿里巴巴集团); University of Science and Technology of China(中国科学技术大学); University of Chinese Academy of Sciences(中国科学院大学); Shanghai Jiao Tong University(上海交通大学)
关键词: daily software development, essential tool, tool for daily, Code completion, daily software
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant challenges, including limited context length, reliance on superficial evaluation metrics, and potential overfitting to training datasets. In this work, we introduce a novel framework for enhancing code completion in software development through the creation of a repository-level benchmark ExecRepoBench and the instruction corpora Repo-Instruct, aim at improving the functionality of open-source large language models (LLMs) in real-world coding scenarios that involve complex interdependencies across multiple files. ExecRepoBench includes 1.2K samples from active Python repositories. Plus, we present a multi-level grammar-based completion methodology conditioned on the abstract syntax tree to mask code fragments at various logical units (e.g. statements, expressions, and functions). Then, we fine-tune the open-source LLM with 7B parameters on Repo-Instruct to produce a strong code completion baseline model Qwen2.5-Coder-Instruct-C based on the open-source model. Qwen2.5-Coder-Instruct-C is rigorously evaluated against existing benchmarks, including MultiPL-E and ExecRepoBench, which consistently outperforms prior baselines across all programming languages. The deployment of \ourmethod can be used as a high-performance, local service for programming development\footnote\urlthis https URL.
zh

[NLP-9] SciFaultyQA: Benchmarking LLM s on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation

【速读】：该论文试图解决当前大型语言模型（LLMs）在面对故意设计的有缺陷科学问题时，无法识别问题本身错误并给出合理回答的问题。解决方案的关键在于开发了一个名为SciFaultyQA的数据集，该数据集包含故意设计的有缺陷科学问题，用于评估和基准测试LLMs在识别这些缺陷问题上的表现。通过分析LLMs在这些数据集上的表现，研究者提出了一种生成合成数据集的新方法，并开发了减少错误的新策略，以提高LLMs在识别和处理有缺陷问题时的准确性和一致性。

链接: https://arxiv.org/abs/2412.11988
作者: Debarshi Kundu
机构: 未知
关键词: Gemini Flash frequently, Gemini Flash, woman can produce, Flash frequently answer, woman
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Consider the problem: ``If one man and one woman can produce one child in one year, how many children will be produced by one woman and three men in 0.5 years?" Current large language models (LLMs) such as GPT-4o, GPT-o1-preview, and Gemini Flash frequently answer “0.5,” which does not make sense. While these models sometimes acknowledge the unrealistic nature of the question, in many cases (8 out of 10 trials), they provide the nonsensical answer of “0.5 child.” Additionally, temporal variation has been observed: if an LLM answers correctly once (by recognizing the faulty nature of the question), subsequent responses are more likely to also reflect this understanding. However, this is inconsistent. These types of questions have motivated us to develop a dataset of science questions, SciFaultyQA, where the questions themselves are intentionally faulty. We observed that LLMs often proceed to answer these flawed questions without recognizing their inherent issues, producing results that are logically or scientifically invalid. By analyzing such patterns, we developed a novel method for generating synthetic datasets to evaluate and benchmark the performance of various LLMs in identifying these flawed questions. We have also developed novel approaches to reduce the errors. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2412.11988 [cs.CL] (or arXiv:2412.11988v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.11988 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-10] Speak Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback

【速读】：该论文试图解决开发第二语言（L2）口语处理系统面临的主要挑战，即缺乏带有高质量标注的公开可用数据。解决方案的关键在于发布了Speak \ Improve Corpus 2025，这是一个包含L2学习者英语数据的语料库，具有整体评分和语言错误标注，涵盖了广泛的说话者属性（如母语和口语能力），并提供了手动标注。该语料库支持多种语言学习任务的研究，如口语能力评估和口语语法错误纠正（GEC），并为低资源L2学习者英语的自动语音识别（ASR）和言语不流畅检测等技术研究提供了数据支持。

链接: https://arxiv.org/abs/2412.11986
作者: Kate Knill,Diane Nicholls,Mark J.F. Gales,Mengjie Qian,Pawel Stroinski
机构: 未知
关键词: Improve learning platform, language processing systems, introduce the Speak, Improve learning, https URL
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce the Speak \ Improve Corpus 2025, a dataset of L2 learner English data with holistic scores and language error annotation, collected from open (spontaneous) speaking tests on the Speak \ Improve learning platform this https URL . The aim of the corpus release is to address a major challenge to developing L2 spoken language processing systems, the lack of publicly available data with high-quality annotations. It is being made available for non-commercial use on the ELiT website. In designing this corpus we have sought to make it cover a wide-range of speaker attributes, from their L1 to their speaking ability, as well as providing manual annotations. This enables a range of language-learning tasks to be examined, such as assessing speaking proficiency or providing feedback on grammatical errors in a learner’s speech. Additionally, the data supports research into the underlying technology required for these tasks including automatic speech recognition (ASR) of low resource L2 learner English, disfluency detection or spoken grammatical error correction (GEC). The corpus consists of around 340 hours of L2 English learners audio with holistic scores, and a subset of audio annotated with transcriptions and error labels.
zh

[NLP-11] Speak Improve Challenge 2025: Tasks and Baseline Systems

【速读】：该论文试图解决的是推动口语语言评估与反馈技术的发展问题，解决方案的关键在于通过“Speak Improve Challenge 2025”挑战赛及其配套的Speak Improve (SI) Corpus 2025数据集，促进自动语音识别(ASR)、口语语言评估(SLA)、口语语法错误纠正(SGEC)和口语语法错误纠正反馈(SGECF)等任务的研究。该挑战赛设置了封闭和开放两个赛道，允许参与者使用预定的模型和数据源或任何公开资源，从而推动技术的创新与应用。

链接: https://arxiv.org/abs/2412.11985
作者: Mengjie Qian,Kate Knill,Stefano Banno,Siyuan Tang,Penny Karanasou,Mark J.F. Gales,Diane Nicholls
机构: 未知
关键词: Spoken Language Assessment, Speak Improve Challenge, Speak Improve, Spoken Grammatical Error, Speak Improve learning
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the “Speak Improve Challenge 2025: Spoken Language Assessment and Feedback” – a challenge associated with the ISCA SLaTE 2025 Workshop. The goal of the challenge is to advance research on spoken language assessment and feedback, with tasks associated with both the underlying technology and language learning feedback. Linked with the challenge, the Speak Improve (SI) Corpus 2025 is being pre-released, a dataset of L2 learner English data with holistic scores and language error annotation, collected from open (spontaneous) speaking tests on the Speak Improve learning platform. The corpus consists of 340 hours of audio data from second language English learners with holistic scores, and a 60-hour subset with manual transcriptions and error labels. The Challenge has four shared tasks: Automatic Speech Recognition (ASR), Spoken Language Assessment (SLA), Spoken Grammatical Error Correction (SGEC), and Spoken Grammatical Error Correction Feedback (SGECF). Each of these tasks has a closed track where a predetermined set of models and data sources are allowed to be used, and an open track where any public resource may be used. Challenge participants may do one or more of the tasks. This paper describes the challenge, the SI Corpus 2025, and the baseline systems released for the Challenge.
zh

[NLP-12] Speech Foundation Models and Crowdsourcing for Efficient High-Quality Data Collection COLING2025

【速读】：该论文试图解决在众包语音数据收集过程中，由于非专业人士参与而导致的质量控制成本高昂的问题。解决方案的关键在于利用语音基础模型 (Speech Foundation Models, SFMs) 自动化验证过程，首次探讨了数据采集中的成本与质量之间的权衡。实验结果表明，基于 SFM 的验证方法能够在不降低最终数据质量的前提下，实现超过 40.0% 的成本节约，从而为更高效、更具成本效益和可扩展性的语音数据采集提供了新的可能性。

链接: https://arxiv.org/abs/2412.11978
作者: Beomseok Lee,Marco Gaido,Ioan Calapodescu,Laurent Besacier,Matteo Negri
机构: University of Trento, Italy(意大利特伦托大学); Fondazione Bruno Kessler, Italy(意大利布鲁诺凯斯勒基金会); NAVER LABS Europe, France(法国NAVER实验室)
关键词: non-experts necessitates protocols, Speech Foundation Models, final data quality, ensure final data, speech data acquisition
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at COLING 2025 main conference

点击查看摘要

Abstract:While crowdsourcing is an established solution for facilitating and scaling the collection of speech data, the involvement of non-experts necessitates protocols to ensure final data quality. To reduce the costs of these essential controls, this paper investigates the use of Speech Foundation Models (SFMs) to automate the validation process, examining for the first time the cost/quality trade-off in data acquisition. Experiments conducted on French, German, and Korean data demonstrate that SFM-based validation has the potential to reduce reliance on human validation, resulting in an estimated cost saving of over 40.0% without degrading final data quality. These findings open new opportunities for more efficient, cost-effective, and scalable speech data acquisition.
zh

[NLP-13] Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning

【速读】：该论文试图解决传统基于强化学习（Reinforcement Learning）的机器人控制方法在多样化环境和未见对象及指令下泛化能力不足的问题。解决方案的关键在于提出了一种名为Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning (Emma-X) 的新模型。Emma-X通过利用基于BridgeV2构建的分层体现数据集，包含60,000条自动标注的机器人操作轨迹，结合基于夹爪状态和运动轨迹的轨迹分割策略，有效提升了长时空间推理和接地任务规划的能力，从而在需要空间推理的实际机器人任务中表现出优于现有基线的性能。

链接: https://arxiv.org/abs/2412.11974
作者: Qi Sun,Pengfei Hong,Tej Deep Pala,Vernon Toh,U-Xuan Tan,Deepanway Ghosal,Soujanya Poria
机构: 未知
关键词: Traditional reinforcement learning-based, Traditional reinforcement, reinforcement learning-based robotic, learning-based robotic control, robotic control methods
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL , this https URL

点击查看摘要

Abstract:Traditional reinforcement learning-based robotic control methods are often task-specific and fail to generalize across diverse environments or unseen objects and instructions. Visual Language Models (VLMs) demonstrate strong scene understanding and planning capabilities but lack the ability to generate actionable policies tailored to specific robotic embodiments. To address this, Visual-Language-Action (VLA) models have emerged, yet they face challenges in long-horizon spatial reasoning and grounded task planning. In this work, we propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, Emma-X. Emma-X leverages our constructed hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot manipulation trajectories auto-annotated with grounded task reasoning and spatial guidance. Additionally, we introduce a trajectory segmentation strategy based on gripper states and motion trajectories, which can help mitigate hallucination in grounding subtask reasoning generation. Experimental results demonstrate that Emma-X achieves superior performance over competitive baselines, particularly in real-world robotic tasks requiring spatial reasoning.
zh

[NLP-14] DARWIN 1.5: Large Language Models as Materials Science Adapted Learners

【速读】：该论文试图解决材料发现与设计中传统方法依赖复杂描述符（descriptors）导致的泛化性和可迁移性差的问题，以及这些描述符在实际应用中因实验数据偏差而降低有效性的挑战。解决方案的关键在于提出了Darwin 1.5，一个专门为材料科学定制的开源大型语言模型（LLM）。通过利用自然语言作为输入，Darwin 1.5消除了对任务特定描述符的依赖，采用问答（QA）微调与多任务学习（MTL）相结合的两阶段训练策略，实现了跨任务的知识迁移，显著提升了模型在材料属性预测和发现中的准确性，最大提升了60%，并展示了LLM在材料科学中的多功能性和可扩展性潜力。

链接: https://arxiv.org/abs/2412.11970
作者: Tong Xie,Yuwei Wan,Yixuan Liu,Yuchen Zeng,Wenjie Zhang,Chunyu Kit,Dongzhan Zhou,Bram Hoex
机构: Greendynamics(Greendynamics); UNSW(新南威尔士大学)
关键词: diverse search spaces, search spaces, aim to find, find components, components and structures
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Materials discovery and design aim to find components and structures with desirable properties over highly complex and diverse search spaces. Traditional solutions, such as high-throughput simulations and machine learning (ML), often rely on complex descriptors, which hinder generalizability and transferability across tasks. Moreover, these descriptors may deviate from experimental data due to inevitable defects and purity issues in the real world, which may reduce their effectiveness in practical applications. To address these challenges, we propose Darwin 1.5, an open-source large language model (LLM) tailored for materials science. By leveraging natural language as input, Darwin eliminates the need for task-specific descriptors and enables a flexible, unified approach to material property prediction and discovery. We employ a two-stage training strategy combining question-answering (QA) fine-tuning with multi-task learning (MTL) to inject domain-specific knowledge in various modalities and facilitate cross-task knowledge transfer. Through our strategic approach, we achieved a significant enhancement in the prediction accuracy of LLMs, with a maximum improvement of 60% compared to LLaMA-7B base models. It further outperforms traditional machine learning models on various tasks in material science, showcasing the potential of LLMs to provide a more versatile and scalable foundation model for materials discovery and design.
zh

[NLP-15] Inferring Functionality of Attention Heads from their Parameters

【速读】：该论文试图解决大型语言模型 (LLMs) 中注意力头 (Attention Heads) 的功能映射问题，即在不进行模型训练或推理的情况下，全面理解这些注意力头在模型中实现的操作。解决方案的关键是提出了一个名为 MAPS (Mapping Attention head ParameterS) 的高效框架，该框架通过分析注意力头的参数来推断其功能。MAPS 不仅能够评估模型中各个注意力头对预定义操作的实现强度，还能推断单个注意力头的主要功能。实验表明，MAPS 的估计与模型在推理过程中的输出高度相关，并且与其预测结果存在因果关系，同时揭示了先前研究中被忽视的某些操作的注意力头，为理解 LLMs 的功能普遍性和架构偏差提供了新的见解。

链接: https://arxiv.org/abs/2412.11965
作者: Amit Elhelo,Mor Geva
机构: Blavatnik School of Computer Science, Tel Aviv University (布兰克塔尼克计算机科学学院，特拉维夫大学)
关键词: large language models, building blocks, blocks of large, large language, Attention
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attention heads are one of the building blocks of large language models (LLMs). Prior work on investigating their operation mostly focused on analyzing their behavior during inference for specific circuits or tasks. In this work, we seek a comprehensive mapping of the operations they implement in a model. We propose MAPS (Mapping Attention head ParameterS), an efficient framework that infers the functionality of attention heads from their parameters, without any model training or inference. We showcase the utility of MAPS for answering two types of questions: (a) given a predefined operation, mapping how strongly heads across the model implement it, and (b) given an attention head, inferring its salient functionality. Evaluating MAPS on 20 operations across 6 popular LLMs shows its estimations correlate with the head’s outputs during inference and are causally linked to the model’s predictions. Moreover, its mappings reveal attention heads of certain operations that were overlooked in previous studies, and valuable insights on function universality and architecture biases in LLMs. Next, we present an automatic pipeline and analysis that leverage MAPS to characterize the salient operations of a given head. Our pipeline produces plausible operation descriptions for most heads, as assessed by human judgment, while revealing diverse operations.
zh

[NLP-16] he Impact of Token Granularity on the Predictive Power of Language Model Surprisal

【速读】：该论文试图解决语言模型中子词（subword）粒度对人类阅读处理难度预测能力的影响问题。解决方案的关键在于通过实验调整子词的粒度，并评估其对自然文本和花园路径（garden-path）结构处理难度的影响。研究发现，词汇量为8,000的子词粒度在自然阅读时间中对 surprisal 的预测能力最强，而在花园路径结构中，较粗粒度的子词模型对语法敏感性更高，导致关键区域 surprisal 值更高。这些结果表明，子词粒度在语言模型 surprisal 的质量中起着重要作用，尤其是在认知建模中。

链接: https://arxiv.org/abs/2412.11940
作者: Byung-Doh Oh,William Schuler
机构: New York University (纽约大学); The Ohio State University (俄亥俄州立大学)
关键词: human readers, raises questions, token granularity, surprisal, language modeling influence
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Word-by-word language model surprisal is often used to model the incremental processing of human readers, which raises questions about how various choices in language modeling influence its predictive power. One factor that has been overlooked in cognitive modeling is the granularity of subword tokens, which explicitly encodes information about word length and frequency, and ultimately influences the quality of vector representations that are learned. This paper presents experiments that manipulate the token granularity and evaluate its impact on the ability of surprisal to account for processing difficulty of naturalistic text and garden-path constructions. Experiments with naturalistic reading times reveal a substantial influence of token granularity on surprisal, with tokens defined by a vocabulary size of 8,000 resulting in surprisal that is most predictive. In contrast, on garden-path constructions, language models trained on coarser-grained tokens generally assigned higher surprisal to critical regions, suggesting their increased sensitivity to syntax. Taken together, these results suggest a large role of token granularity on the quality of language model surprisal for cognitive modeling.
zh

[NLP-17] SEAGraph: Unveiling the Whole Story of Paper Review Comments

【速读】：该论文试图解决传统同行评审过程中作者收到的反馈模糊或不够详细的问题，导致改进效率低下和评审周期延长。解决方案的关键在于提出了SEAGraph框架，通过构建语义思维图（semantic mind graph）和层次背景图（hierarchical background graph）来揭示评审意见背后的意图，并设计了一种检索方法从这两类图中提取相关内容，从而为评审意见提供连贯的解释。实验结果表明，SEAGraph在理解评审意见方面表现优异，显著提升了作者对反馈的理解和改进效率。

链接: https://arxiv.org/abs/2412.11939
作者: Jianxiang Yu,Jiaqi Tan,Zichen Ding,Jiapeng Zhu,Jiahao Li,Yao Cheng,Qier Cui,Yunshi Lan,Xiang Li
机构: East China Normal University(华东师范大学)
关键词: Peer review, peer review process, ensures the integrity, traditional peer review, cornerstone of scientific
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Peer review, as a cornerstone of scientific research, ensures the integrity and quality of scholarly work by providing authors with objective feedback for refinement. However, in the traditional peer review process, authors often receive vague or insufficiently detailed feedback, which provides limited assistance and leads to a more time-consuming review cycle. If authors can identify some specific weaknesses in their paper, they can not only address the reviewer’s concerns but also improve their work. This raises the critical question of how to enhance authors’ comprehension of review comments. In this paper, we present SEAGraph, a novel framework developed to clarify review comments by uncovering the underlying intentions behind them. We construct two types of graphs for each paper: the semantic mind graph, which captures the author’s thought process, and the hierarchical background graph, which delineates the research domains related to the paper. A retrieval method is then designed to extract relevant content from both graphs, facilitating coherent explanations for the review comments. Extensive experiments show that SEAGraph excels in review comment understanding tasks, offering significant benefits to authors.
zh

[NLP-18] Precise Length Control in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在生成响应时难以精确控制响应长度的问题，特别是在需要结构化输出或特定详细程度的任务中。解决方案的关键在于引入了一种长度差异位置编码（LDPE），将其嵌入到输入嵌入中，用于倒计时到用户设定的响应终止长度。通过微调带有LDPE的模型，使其能够在期望的长度处连贯地终止响应，实现了平均误差小于3个token的精确长度控制。此外，论文还提出了Max New Tokens++扩展，以实现灵活的上限长度控制，而非精确目标长度。实验结果表明，该方法在保持响应质量的同时，实现了精确的长度控制。

链接: https://arxiv.org/abs/2412.11937
作者: Bradley Butcher,Michael O’Keefe,James Titchener
机构: Raytheon UK
关键词: Large Language Models, Large Language, Language Models, production systems, powering applications
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used in production systems, powering applications such as chatbots, summarization, and question answering. Despite their success, controlling the length of their response remains a significant challenge, particularly for tasks requiring structured outputs or specific levels of detail. In this work, we propose a method to adapt pre-trained decoder-only LLMs for precise control of response length. Our approach incorporates a secondary length-difference positional encoding (LDPE) into the input embeddings, which counts down to a user-set response termination length. Fine-tuning with LDPE allows the model to learn to terminate responses coherently at the desired length, achieving mean token errors of less than 3 tokens. We also introduce Max New Tokens++, an extension that enables flexible upper-bound length control, rather than an exact target. Experimental results on tasks such as question answering and document summarization demonstrate that our method enables precise length control without compromising response quality.
zh

[NLP-19] A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark Method Challenges

【速读】：该论文旨在解决在多模态大语言模型（MLLMs）时代下，如何将数学推理（Mathematical Reasoning）与大语言模型（LLMs）有效结合的问题。其关键在于全面分析和总结当前数学推理领域的最新进展，特别是多模态设置下的数学推理流程、LLMs的作用及其相关方法论。论文通过回顾200多篇自2021年以来的研究，将该领域划分为基准测试、方法论和挑战三个维度，并指出了实现人工通用智能（AGI）在该领域面临的五大主要挑战，为未来提升多模态推理能力提供了研究方向。

链接: https://arxiv.org/abs/2412.11936
作者: Yibo Yan,Jiamin Su,Jianxiang He,Fangteng Fu,Xu Zheng,Yuanhuiyi Lyu,Kun Wang,Shen Wang,Qingsong Wen,Xuming Hu
机构: 未知
关键词: Mathematical reasoning, large language models, human cognition, scientific advancements, core aspect
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mathematical reasoning, a core aspect of human cognition, is vital across many domains, from educational problem-solving to scientific advancements. As artificial general intelligence (AGI) progresses, integrating large language models (LLMs) with mathematical reasoning tasks is becoming increasingly significant. This survey provides the first comprehensive analysis of mathematical reasoning in the era of multimodal large language models (MLLMs). We review over 200 studies published since 2021, and examine the state-of-the-art developments in Math-LLMs, with a focus on multimodal settings. We categorize the field into three dimensions: benchmarks, methodologies, and challenges. In particular, we explore multimodal mathematical reasoning pipeline, as well as the role of (M)LLMs and the associated methodologies. Finally, we identify five major challenges hindering the realization of AGI in this domain, offering insights into the future direction for enhancing multimodal reasoning capabilities. This survey serves as a critical resource for the research community in advancing the capabilities of LLMs to tackle complex multimodal reasoning tasks.
zh

[NLP-20] Explainable Procedural Mistake Detection

【速读】：该论文试图解决自动化任务指导中的程序性错误检测 (Procedural Mistake Detection, PMD) 问题，即通过第一视角视频判断用户是否正确执行了由程序文本指定的任务。解决方案的关键在于将PMD问题重新表述为一种解释性的自我对话，通过问答形式提供决策依据，从而提高推理过程的透明度。论文利用经过微调的自然语言推理 (Natural Language Inference, NLI) 模型，提出了两种自动化的连贯性度量，用于评估生成的解释。通过将这些连贯性度量整合到常见的推理和微调方法中，显著提升了开源视觉语言模型 (VLMs) 的准确性、连贯性和对话效率，并能直观地展示常见结果，便于识别改进点。

链接: https://arxiv.org/abs/2412.11927
作者: Shane Storks,Itamar Bar-Yossef,Yayuan Li,Zheyuan Zhang,Jason J. Corso,Joyce Chai
机构: University of Michigan, Ann Arbor, Michigan, USA (密歇根大学安娜堡分校，密歇根州，美国)
关键词: recently attracted attention, research community, Automated task guidance, guidance has recently, recently attracted
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated task guidance has recently attracted attention from the AI research community. Procedural mistake detection (PMD) is a challenging sub-problem of classifying whether a human user (observed through egocentric video) has successfully executed the task at hand (specified by a procedural text). Despite significant efforts in building resources and models for PMD, machine performance remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we recast PMD to an explanatory self-dialog of questions and answers, which serve as evidence for a decision. As this reformulation enables an unprecedented transparency, we leverage a fine-tuned natural language inference (NLI) model to formulate two automated coherence metrics for generated explanations. Our results show that while open-source VLMs struggle with this task off-the-shelf, their accuracy, coherence, and dialog efficiency can be vastly improved by incorporating these coherence metrics into common inference and fine-tuning methods. Furthermore, our multi-faceted metrics can visualize common outcomes at a glance, highlighting areas for improvement.
zh

[NLP-21] PICLe: Pseudo-Annotations for In-Context Learning in Low-Resource Named Entity Detection

【速读】：该论文试图解决在低资源环境下，如何利用少量标注样本进行命名实体检测 (Named Entity Detection, NED) 的问题。解决方案的关键在于提出了一种新的框架，称为伪标注上下文学习 (Pseudo-annotated In-Context Learning, PICLe)。该框架利用大型语言模型 (Large Language Models, LLMs) 在零样本条件下生成伪标注的演示样本，并通过聚类和独立预测的方式，结合自验证机制选择最终的实体提及。实验结果表明，在缺乏高质量标注样本的情况下，PICLe 在低资源环境中显著优于传统的上下文学习 (In-Context Learning, ICL) 方法。

链接: https://arxiv.org/abs/2412.11923
作者: Sepideh Mamooler,Syrielle Montariol,Alexander Mathis,Antoine Bosselut
机构: 未知
关键词: Large Language Models, enables Large Language, Language Models, Large Language, enables Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:In-context learning (ICL) enables Large Language Models (LLMs) to perform tasks using few demonstrations, facilitating task adaptation when labeled examples are hard to obtain. However, ICL is sensitive to the choice of demonstrations, and it remains unclear which demonstration attributes enable in-context generalization. In this work, we conduct a perturbation study of in-context demonstrations for low-resource Named Entity Detection (NED). Our surprising finding is that in-context demonstrations with partially correct annotated entity mentions can be as effective for task transfer as fully correct demonstrations. Based off our findings, we propose Pseudo-annotated In-Context Learning (PICLe), a framework for in-context learning with noisy, pseudo-annotated demonstrations. PICLe leverages LLMs to annotate many demonstrations in a zero-shot first pass. We then cluster these synthetic demonstrations, sample specific sets of in-context demonstrations from each cluster, and predict entity mentions using each set independently. Finally, we use self-verification to select the final set of entity mentions. We evaluate PICLe on five biomedical NED datasets and show that, with zero human annotation, PICLe outperforms ICL in low-resource settings where limited gold examples can be used as in-context demonstrations.
zh

[NLP-22] RetroLLM : Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

【速读】：该论文试图解决大语言模型（LLMs）在生成过程中常出现的幻觉问题，并针对现有检索增强生成（Retrieval-augmented generation, RAG）方法的局限性，提出了一个统一的框架RetroLLM。解决方案的关键在于将检索和生成整合为一个连贯的过程，使LLMs能够直接从语料库中生成细粒度的证据，并通过约束解码来实现。具体创新包括：(1) 分层FM-Index约束，用于在生成证据前识别相关文档子集，减少无关解码空间；(2) 前瞻性约束解码策略，考虑未来序列的相关性以提高证据的准确性。这些方法共同提升了模型在开放域问答任务中的表现。

链接: https://arxiv.org/abs/2412.11919
作者: Xiaoxi Li,Jiajie Jin,Yujia Zhou,Yongkang Wu,Zhonghua Li,Qi Ye,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China(高瓴人工智能学院，中国人民大学); Tsinghua University(清华大学); Huawei Poisson Lab(华为泊松实验室)
关键词: Large language models, exhibit remarkable generative, remarkable generative capabilities, Large language, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimization of retrieval and generation. To address these issues, we propose \textbfRetroLLM, a unified framework that integrates retrieval and generation into a single, cohesive process, enabling LLMs to directly generate fine-grained evidence from the corpus with constrained decoding. Moreover, to mitigate false pruning in the process of constrained evidence generation, we introduce (1) hierarchical FM-Index constraints, which generate corpus-constrained clues to identify a subset of relevant documents before evidence generation, reducing irrelevant decoding space; and (2) a forward-looking constrained decoding strategy, which considers the relevance of future sequences to improve evidence accuracy. Extensive experiments on five open-domain QA datasets demonstrate RetroLLM’s superior performance across both in-domain and out-of-domain tasks. The code is available at \urlthis https URL.
zh

[NLP-23] CharacterBench: Benchmarking Character Customization of Large Language Models AAAI2025

【速读】：该论文试图解决现有基准测试在评估大型语言模型（LLMs）角色定制能力时存在的不足，主要问题包括单一角色类别、有限评估维度以及角色特征在响应中的稀疏性导致的评估无效和低效。解决方案的关键在于提出了CharacterBench，这是一个包含22,859个人工标注样本的双语生成式基准测试，涵盖3,956个角色和25个详细角色类别。通过定义11个维度（分为稀疏和密集维度），并针对每个维度设计定制查询以诱导相关响应，从而实现有效和高效的评估。此外，论文还开发了CharacterJudge模型，用于成本效益高且稳定的自动评估，实验表明其在性能上优于现有的自动评估模型（如GPT-4），并展示了该基准测试在优化LLMs角色定制方面的潜力。

链接: https://arxiv.org/abs/2412.11912
作者: Jinfeng Zhou,Yongkang Huang,Bosi Wen,Guanqun Bi,Yuxuan Chen,Pei Ke,Zhuang Chen,Xiyao Xiao,Libiao Peng,Kuntian Tang,Rongsheng Zhang,Le Zhang,Tangjie Lv,Zhipeng Hu,Hongning Wang,Minlie Huang
机构: 1. Tsinghua University (清华大学); 2. Beijing University of Posts and Telecommunications (北京邮电大学); 3. Zhejiang University (浙江大学); 4. Alibaba Group (阿里巴巴集团); 5. Baidu Inc. (百度公司)
关键词: Character-based dialogue, freely customize characters, aka role-playing, relies on LLMs, users to freely
类目: Computation and Language (cs.CL)
备注: AAAI 2025

点击查看摘要

Abstract:Character-based dialogue (aka role-playing) enables users to freely customize characters for interaction, which often relies on LLMs, raising the need to evaluate LLMs’ character customization capability. However, existing benchmarks fail to ensure a robust evaluation as they often only involve a single character category or evaluate limited dimensions. Moreover, the sparsity of character features in responses makes feature-focused generative evaluation both ineffective and inefficient. To address these issues, we propose CharacterBench, the largest bilingual generative benchmark, with 22,859 human-annotated samples covering 3,956 characters from 25 detailed character categories. We define 11 dimensions of 6 aspects, classified as sparse and dense dimensions based on whether character features evaluated by specific dimensions manifest in each response. We enable effective and efficient evaluation by crafting tailored queries for each dimension to induce characters’ responses related to specific dimensions. Further, we develop CharacterJudge model for cost-effective and stable evaluations. Experiments show its superiority over SOTA automatic judges (e.g., GPT-4) and our benchmark’s potential to optimize LLMs’ character customization. Our repository is at this https URL.
zh

[NLP-24] Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments

【速读】：该论文旨在评估近期大型语言模型（LLMs）在组合数学问题上的解决能力，并通过引入Combi-Puzzles数据集进行模型间及模型与人类（包括有数学奥林匹克经验的学生和本科生）的比较。解决方案的关键在于通过系统性地操纵问题陈述（如添加对抗性内容、改变数值参数和语言混淆）生成五种不同形式的问题变体，以测试LLMs的泛化能力，并确保这些问题在训练实例中未曾出现过。研究发现，基于GPT-4的模型在正确响应率和数学问题变体上的表现显著优于其他模型和人类，同时问题陈述的修改对LLM性能有显著影响，而人类表现则不受影响。

链接: https://arxiv.org/abs/2412.11908
作者: Andrii Nikolaiev,Yiannos Stathopoulos,Simone Teufel
机构: Taras Shevchenko National University of Kyiv; University of Cambridge
关键词: recent large language, large language models, ability of recent, recent large, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper we look at the ability of recent large language models (LLMs) at solving mathematical problems in combinatorics. We compare models LLaMA-2, LLaMA-3.1, GPT-4, and Mixtral against each other and against human pupils and undergraduates with prior experience in mathematical olympiads. To facilitate these comparisons we introduce the Combi-Puzzles dataset, which contains 125 problem variants based on 25 combinatorial reasoning problems. Each problem is presented in one of five distinct forms, created by systematically manipulating the problem statements through adversarial additions, numeric parameter changes, and linguistic obfuscation. Our variations preserve the mathematical core and are designed to measure the generalisability of LLM problem-solving abilities, while also increasing confidence that problems are submitted to LLMs in forms that have not been seen as training instances. We found that a model based on GPT-4 outperformed all other models in producing correct responses, and performed significantly better in the mathematical variation of the problems than humans. We also found that modifications to problem statements significantly impact the LLM’s performance, while human performance remains unaffected.
zh

[NLP-25] Classification of Spontaneous and Scripted Speech for Multilingual Audio

【速读】：该论文试图解决如何构建一个能够在不同格式和语言之间泛化良好的分类器，以区分脚本化（scripted）和即兴（spontaneous）语音的问题。解决方案的关键在于系统性地评估从传统手工设计的声学和韵律特征到先进的音频变压器（audio transformers）等多种模型，并利用大规模多语言的播客数据集进行训练和验证。实验结果表明，基于变压器的模型在跨语言环境下持续优于传统特征提取技术，实现了在多种语言中区分脚本化和即兴语音的最新性能。

链接: https://arxiv.org/abs/2412.11896
作者: Shahar Elisha,Andrew McDowell,Mariano Beguerisse-Díaz,Emmanouil Benetos
机构: 未知
关键词: speech processing research, speech styles influence, styles influence speech, influence speech processing, processing research
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to IEEE Spoken Language Technology Workshop 2024

点击查看摘要

Abstract:Distinguishing scripted from spontaneous speech is an essential tool for better understanding how speech styles influence speech processing research. It can also improve recommendation systems and discovery experiences for media users through better segmentation of large recorded speech catalogues. This paper addresses the challenge of building a classifier that generalises well across different formats and languages. We systematically evaluate models ranging from traditional, handcrafted acoustic and prosodic features to advanced audio transformers, utilising a large, multilingual proprietary podcast dataset for training and validation. We break down the performance of each model across 11 language groups to evaluate cross-lingual biases. Our experimental analysis extends to publicly available datasets to assess the models’ generalisability to non-podcast domains. Our results indicate that transformer-based models consistently outperform traditional feature-based techniques, achieving state-of-the-art performance in distinguishing between scripted and spontaneous speech across various languages.
zh

[NLP-26] Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives

【速读】：该论文旨在解决如何有效利用指令调优的大型语言模型 (Instruction-Tuned Large Language Models, IT-LLMs) 对警察与公众互动的非结构化文本进行定性编码，并与人类编码员的结果进行比较，以评估模型在识别心理健康、物质滥用、酒精依赖和无家可归等脆弱性方面的表现及其潜在偏差。解决方案的关键在于通过多种提示策略和模型规模，比较IT-LLMs与人类编码员生成的标签，并利用反事实方法评估种族和性别等受保护特征对模型分类的影响。结果表明，IT-LLMs在筛查无脆弱性的叙述方面非常有效，且对性别和种族的操纵对模型分类的影响有限，表明模型在资源需求较低的情况下能够有效增强人类定性编码，促进编码的透明性和标准化。

链接: https://arxiv.org/abs/2412.11878
作者: Sam Relins,Daniel Birks,Charlie Lloyd
机构: 未知
关键词: routinely collected unstructured, describes police-public interactions, instruction tuned large, tuned large language, Boston Police Department
类目: Computation and Language (cs.CL)
备注: 33 pages, 6 figures Submitted to Journal of Quantitative Criminology

点击查看摘要

Abstract:Objectives: Compare qualitative coding of instruction tuned large language models (IT-LLMs) against human coders in classifying the presence or absence of vulnerability in routinely collected unstructured text that describes police-public interactions. Evaluate potential bias in IT-LLM codings. Methods: Analyzing publicly available text narratives of police-public interactions recorded by Boston Police Department, we provide humans and IT-LLMs with qualitative labelling codebooks and compare labels generated by both, seeking to identify situations associated with (i) mental ill health; (ii) substance misuse; (iii) alcohol dependence; and (iv) homelessness. We explore multiple prompting strategies and model sizes, and the variability of labels generated by repeated prompts. Additionally, to explore model bias, we utilize counterfactual methods to assess the impact of two protected characteristics - race and gender - on IT-LLM classification. Results: Results demonstrate that IT-LLMs can effectively support human qualitative coding of police incident narratives. While there is some disagreement between LLM and human generated labels, IT-LLMs are highly effective at screening narratives where no vulnerabilities are present, potentially vastly reducing the requirement for human coding. Counterfactual analyses demonstrate that manipulations to both gender and race of individuals described in narratives have very limited effects on IT-LLM classifications beyond those expected by chance. Conclusions: IT-LLMs offer effective means to augment human qualitative coding in a way that requires much lower levels of resource to analyze large unstructured datasets. Moreover, they encourage specificity in qualitative coding, promote transparency, and provide the opportunity for more standardized, replicable approaches to analyzing large free-text police data sources.
zh

[NLP-27] GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

【速读】：该论文试图解决多模态大语言模型（MLLMs）在自动几何问题求解（GPS）中的局限性，特别是由于预训练数据集中缺乏几何图示和符号的理解能力，以及问题求解过程中缺乏自动化验证的问题。解决方案的关键在于提出了GeoX模型，通过引入单模态预训练来开发图示编码器和符号解码器，增强对几何图像和语料的理解。此外，论文提出了几何-语言对齐的预训练范式，以弥合单模态几何专家之间的模态差距，并采用生成器-采样器Transformer（GS-Former）来生成判别性查询，消除不均匀分布的几何信号中的无信息表示。最后，通过视觉指令调优，GeoX能够处理几何图像和问题，并生成可验证的解决方案。

链接: https://arxiv.org/abs/2412.11863
作者: Renqiu Xia,Mingsheng Li,Hancheng Ye,Wenjie Wu,Hongbin Zhou,Jiakang Yuan,Tianshuo Peng,Xinyu Cai,Xiangchao Yan,Bin Wang,Conghui He,Botian Shi,Tao Chen,Junchi Yan,Bo Zhang
机构: Shanghai Jiao Tong University(上海交通大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Fudan University(复旦大学); MMLab, The Chinese University of Hong Kong(香港中文大学MMLab)
关键词: Geometry Problem Solving, automatic Geometry Problem, Multi-modal Large Language, Large Language Models, automatic Geometry
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Despite their proficiency in general tasks, Multi-modal Large Language Models (MLLMs) struggle with automatic Geometry Problem Solving (GPS), which demands understanding diagrams, interpreting symbols, and performing complex reasoning. This limitation arises from their pre-training on natural images and texts, along with the lack of automated verification in the problem-solving process. Besides, current geometric specialists are limited by their task-specific designs, making them less effective for broader geometric problems. To this end, we present GeoX, a multi-modal large model focusing on geometric understanding and reasoning tasks. Given the significant differences between geometric diagram-symbol and natural image-text, we introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora. Furthermore, we introduce geometry-language alignment, an effective pre-training paradigm that bridges the modality gap between unimodal geometric experts. We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals. Finally, GeoX benefits from visual instruction tuning, empowering it to take geometric images and questions as input and generate verifiable solutions. Experiments show that GeoX outperforms both generalists and geometric specialists on publicly recognized benchmarks, such as GeoQA, UniGeo, Geometry3K, and PGPS9k.
zh

[NLP-28] A Benchmark and Robustness Study of In-Context-Learning with Large Language Models in Music Entity Detection

【速读】：该论文试图解决音乐实体检测（如歌曲标题或艺术家名称）在文本中的识别问题，特别是在处理音乐搜索查询或分析网络音乐消费时。解决方案的关键在于利用大型语言模型（LLMs）在上下文学习（ICL）设置下的表现优于小型语言模型（SLMs），如BERT。论文通过引入一个新颖的用户生成元数据数据集，并进行基准测试和鲁棒性研究，发现LLMs在ICL设置下显著提升了性能。此外，研究还揭示了预训练阶段实体暴露对LLMs性能的显著影响。

链接: https://arxiv.org/abs/2412.11851
作者: Simon Hachmeier,Robert Jäschke
机构: Berlin School of Library and Information Science(柏林图书馆与信息科学学院); Humboldt-Universität zu Berlin(柏林洪堡大学)
关键词: Detecting music entities, processing music search, music search queries, analyzing music consumption, Detecting music
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Detecting music entities such as song titles or artist names is a useful application to help use cases like processing music search queries or analyzing music consumption on the web. Recent approaches incorporate smaller language models (SLMs) like BERT and achieve high results. However, further research indicates a high influence of entity exposure during pre-training on the performance of the models. With the advent of large language models (LLMs), these outperform SLMs in a variety of downstream tasks. However, researchers are still divided if this is applicable to tasks like entity detection in texts due to issues like hallucination. In this paper, we provide a novel dataset of user-generated metadata and conduct a benchmark and a robustness study using recent LLMs with in-context-learning (ICL). Our results indicate that LLMs in the ICL setting yield higher performance than SLMs. We further uncover the large impact of entity exposure on the best performing LLM in our study.
zh

[NLP-29] Improved Models for Media Bias Detection and Subcategorization

【速读】：该论文旨在改进对英语新闻文章中新闻媒体偏见（news media bias）的粒度检测和子分类。解决方案的关键在于比较零样本学习（zero-shot）与微调大型预训练神经变换器语言模型（fine-tuned large pre-trained neural transformer language models）的性能，并探讨类别细节层次对27种新闻偏见类型分类性能的影响。此外，论文还展示了如何通过使用合成生成的示例数据来提高分类质量。

链接: https://arxiv.org/abs/2412.11835
作者: Tim Menzner,Jochen L. Leidner
机构: 未知
关键词: present improved models, English news articles, bias in English, present improved, granular detection
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present improved models for the granular detection and sub-classification news media bias in English news articles. We compare the performance of zero-shot versus fine-tuned large pre-trained neural transformer language models, explore how the level of detail of the classes affects performance on a novel taxonomy of 27 news bias-types, and demonstrate how using synthetically generated example data can be used to improve quality
zh

[NLP-30] Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture

【速读】：该论文旨在提升基础模型的效率和效果，关键解决方案在于结合序列变换和状态变换。首先，通过证明旋转位置嵌入（rotary position embedding）在状态空间对偶算法中的可用性，降低了混合二次因果自注意力与状态空间对偶的困惑度（perplexity）超过4%，确保了序列变换统一位置编码。其次，提出动态掩码注意力（dynamic mask attention），在更具挑战性的多查询关联回忆任务中保持100%的准确率，相比二次因果自注意力和状态空间对偶提升了150%以上，确保序列变换能够选择性过滤相关信息。第三，设计跨领域专家混合（cross domain mixture of experts），使包含超过1024个专家的专家检索计算速度比传统专家混合快8到10倍，确保状态变换快速检索混合。最终，这些矩阵算法形成了基础模型：Wonderful Matrices，具备与流行模型架构竞争的潜力。

链接: https://arxiv.org/abs/2412.11834
作者: Jingze Shi,Bingheng Wu
机构: 未知
关键词: combining sequence transformation, state space duality, combining sequence, combining state transformation, sequence transformation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The code is open-sourced at this https URL

点击查看摘要

Abstract:In order to make the foundation model more efficient and effective, our idea is combining sequence transformation and state transformation. First, we prove the availability of rotary position embedding in the state space duality algorithm, which reduces the perplexity of the hybrid quadratic causal self-attention and state space duality by more than 4%, to ensure that the combining sequence transformation unifies position encoding. Second, we propose dynamic mask attention, which maintains 100% accuracy in the more challenging multi-query associative recall task, improving by more than 150% compared to quadratic causal self-attention and state space duality, to ensure that the combining sequence transformation selectively filters relevant information. Third, we design cross domain mixture of experts, which makes the computational speed of expert retrieval with more than 1024 experts 8 to 10 times faster than the mixture of experts, to ensure that the combining state transformation quickly retrieval mixture. Finally, we summarize these matrix algorithms that can form the foundation model: Wonderful Matrices, which can be a competitor to popular model architectures.
zh

[NLP-31] Are You Doubtful? Oh It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation

【速读】：该论文试图解决多选题（MCQs）难度自动评估的问题，旨在为教师和学生提供有用的难度估计信息。解决方案的关键在于利用大型语言模型（Large Language Models）对题目进行处理，并通过模型的不确定性（model uncertainty）特征来估计题目难度。研究通过结合模型不确定性特征和文本特征，使用随机森林回归器（Random Forest regressor）进行难度预测，发现不确定性特征对难度预测有显著贡献。此外，该方法在公开的BEA数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.11831
作者: Leonidas Zotos,Hedderik van Rijn,Malvina Nissim
机构: University of Groningen (格罗宁根大学); University of Groningen (格罗宁根大学); University of Groningen (格罗宁根大学)
关键词: assess learning progress, educational setting, learning progress, commonly used strategy, strategy to assess
类目: Computation and Language (cs.CL)
备注: 14 pages,7 figures

点击查看摘要

Abstract:In an educational setting, an estimate of the difficulty of multiple-choice questions (MCQs), a commonly used strategy to assess learning progress, constitutes very useful information for both teachers and students. Since human assessment is costly from multiple points of view, automatic approaches to MCQ item difficulty estimation are investigated, yielding however mixed success until now. Our approach to this problem takes a different angle from previous work: asking various Large Language Models to tackle the questions included in two different MCQ datasets, we leverage model uncertainty to estimate item difficulty. By using both model uncertainty features as well as textual features in a Random Forest regressor, we show that uncertainty features contribute substantially to difficulty prediction, where difficulty is inversely proportional to the number of students who can correctly answer a question. In addition to showing the value of our approach, we also observe that our model achieves state-of-the-art results on the BEA publicly available dataset.
zh

[NLP-32] Advancements and Challenges in Bangla Question Answering Models: A Comprehensive Review

【速读】：该论文旨在解决孟加拉语（Bangla）问答系统（QA）领域中的关键问题，特别是缺乏高质量标注数据、阅读理解数据集不足以及上下文中词语意义理解困难等问题。解决方案的关键在于通过综述七篇相关研究文章，探讨了数据收集、预处理、模型设计、实验实施及结果解释等多个方面，并引入了创新方法如基于LSTM和注意力机制的模型、上下文感知问答系统以及基于先验知识的深度学习技术。这些方法旨在克服现有挑战，提升孟加拉语问答系统的精确性和实用性。

链接: https://arxiv.org/abs/2412.11823
作者: Md Iftekhar Islam Tashik,Abdullah Khondoker,Enam Ahmed Taufik,Antara Firoz Parsa,S M Ishtiak Mahmud
机构: Brac University(布拉克大学)
关键词: Bangla Question Answering, Natural Language Processing, Question Answering, experienced notable progress, Language Processing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The domain of Natural Language Processing (NLP) has experienced notable progress in the evolution of Bangla Question Answering (QA) systems. This paper presents a comprehensive review of seven research articles that contribute to the progress in this domain. These research studies explore different aspects of creating question-answering systems for the Bangla language. They cover areas like collecting data, preparing it for analysis, designing models, conducting experiments, and interpreting results. The papers introduce innovative methods like using LSTM-based models with attention mechanisms, context-based QA systems, and deep learning techniques based on prior knowledge. However, despite the progress made, several challenges remain, including the lack of well-annotated data, the absence of high-quality reading comprehension datasets, and difficulties in understanding the meaning of words in context. Bangla QA models’ precision and applicability are constrained by these challenges. This review emphasizes the significance of these research contributions by highlighting the developments achieved in creating Bangla QA systems as well as the ongoing effort required to get past roadblocks and improve the performance of these systems for actual language comprehension tasks.
zh

[NLP-33] EventSum: A Large-Scale Event-Centric Summarization Dataset for Chinese Multi-News Documents AAAI2025

【速读】：该论文试图解决动态事件（如重大灾难和大规模体育赛事）的多文档摘要生成问题，这些事件的关键信息通常分散在多个文档中，涉及复杂的事件知识理解和推理，这在以往的研究中较少涉及。解决方案的关键在于提出了事件中心的多文档摘要生成任务（Event-Centric Multi-Document Summarization, ECS），并构建了EventSum数据集，该数据集包含5,100个事件和57,984篇新闻文档，通过多阶段人工标注确保数据质量。此外，论文设计了专门的事件召回率（Event Recall）、论点召回率（Argument Recall）、因果召回率（Causal Recall）和时间召回率（Temporal Recall）等评估指标，以全面评估生成摘要的质量。实验结果表明，现有长上下文大语言模型（LLMs）在该任务上仍面临挑战，而设计的召回率指标对评估摘要信息的全面性至关重要。

链接: https://arxiv.org/abs/2412.11814
作者: Mengna Zhu,Kaisheng Zeng,Mao Wang,Kaiming Xiao,Lei Hou,Hongbin Huang,Juanzi Li
机构: 1. Tsinghua University (清华大学); 2. Beijing Institute of Technology (北京理工大学); 3. Beijing University of Posts and Telecommunications (北京邮电大学)
关键词: large-scale sports events, real life, evolve continuously, continuously over time, major disasters
类目: Computation and Language (cs.CL)
备注: Extended version for paper accepted to AAAI 2025

点击查看摘要

Abstract:In real life, many dynamic events, such as major disasters and large-scale sports events, evolve continuously over time. Obtaining an overview of these events can help people quickly understand the situation and respond more effectively. This is challenging because the key information of the event is often scattered across multiple documents, involving complex event knowledge understanding and reasoning, which is under-explored in previous work. Therefore, we proposed the Event-Centric Multi-Document Summarization (ECS) task, which aims to generate concise and comprehensive summaries of a given event based on multiple related news documents. Based on this, we constructed the EventSum dataset, which was constructed using Baidu Baike entries and underwent extensive human annotation, to facilitate relevant research. It is the first large scale Chinese multi-document summarization dataset, containing 5,100 events and a total of 57,984 news documents, with an average of 11.4 input news documents and 13,471 characters per event. To ensure data quality and mitigate potential data leakage, we adopted a multi-stage annotation approach for manually labeling the test set. Given the complexity of event-related information, existing metrics struggle to comprehensively assess the quality of generated summaries. We designed specific metrics including Event Recall, Argument Recall, Causal Recall, and Temporal Recall along with corresponding calculation methods for evaluation. We conducted comprehensive experiments on EventSum to evaluate the performance of advanced long-context Large Language Models (LLMs) on this task. Our experimental results indicate that: 1) The event-centric multi-document summarization task remains challenging for existing long-context LLMs; 2) The recall metrics we designed are crucial for evaluating the comprehensiveness of the summary information.
zh

[NLP-34] UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models

【速读】：该论文试图解决大语言模型 (LLMs) 在表达其所掌握的事实知识时，尤其是在知识边界模糊的情况下，常常难以准确表达的问题。解决方案的关键在于提出了 UAlign 框架，通过利用不确定性估计 (Uncertainty estimations) 来表示知识边界，并将这些表示作为输入特征显式地融入到 LLMs 的提示 (prompts) 中，以实现与事实知识的对齐。具体来说，UAlign 首先通过计算置信度分数 (confidence score) 和语义熵 (semantic entropy) 来准备知识问答 (QA) 样本的数据集，以表示 LLMs 的知识边界。然后，利用该数据集训练一个结合不确定性估计的奖励模型，并采用近端策略优化 (PPO) 算法对 LLMs 进行事实对齐。实验结果表明，UAlign 通过整合不确定性表示，显著提升了 LLMs 在已知问题上的自信回答能力以及在未知问题上的拒绝能力，同时在不同任务上表现出更高的可靠性和泛化能力。

链接: https://arxiv.org/abs/2412.11803
作者: Boyang Xue,Fei Mi,Qi Zhu,Hongru Wang,Rui Wang,Sheng Wang,Erxin Yu,Xuming Hu,Kam-Fai Wong
机构: The Chinese University of Hong Kong; Huawei Noah’s Ark Lab; The University of Hong Kong; The Hong Kong Polytechnic University; Hong Kong University of Science and Technology, Guangzhou
关键词: Large Language Models, Large Language, demonstrating impressive capabilities, Language Models, knowledge boundaries
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite demonstrating impressive capabilities, Large Language Models (LLMs) still often struggle to accurately express the factual knowledge they possess, especially in cases where the LLMs’ knowledge boundaries are ambiguous. To improve LLMs’ factual expressions, we propose the UAlign framework, which leverages Uncertainty estimations to represent knowledge boundaries, and then explicitly incorporates these representations as input features into prompts for LLMs to Align with factual knowledge. First, we prepare the dataset on knowledge question-answering (QA) samples by calculating two uncertainty estimations, including confidence score and semantic entropy, to represent the knowledge boundaries for LLMs. Subsequently, using the prepared dataset, we train a reward model that incorporates uncertainty estimations and then employ the Proximal Policy Optimization (PPO) algorithm for factuality alignment on LLMs. Experimental results indicate that, by integrating uncertainty representations in LLM alignment, the proposed UAlign can significantly enhance the LLMs’ capacities to confidently answer known questions and refuse unknown questions on both in-domain and out-of-domain tasks, showing reliability improvements and good generalizability over various prompt- and training-based baselines.
zh

[NLP-35] ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis AAAI2025

【速读】：该论文试图解决当前文本到语音合成 (TTS) 模型在处理复杂句子结构时，无法准确生成自然语调和断句的问题。解决方案的关键在于提出了一个名为 ProsodyFM 的韵律感知 TTS 模型，该模型基于 flow-matching (FM) 架构，并引入了两个核心组件：Phrase Break Encoder 用于捕捉初始断句位置，并通过 Duration Predictor 灵活调整断句时长；Terminal Intonation Encoder 结合 intonation shape tokens 和 Pitch Processor，以更精确地建模人类感知的语调变化。ProsodyFM 无需显式的韵律标签即可学习广泛的断句时长和语调模式，实验结果表明其在提升韵律的断句和语调方面优于现有的四种最先进 (SOTA) 模型，并展示了在未见过的复杂句子和不同说话人上的优越泛化能力。

链接: https://arxiv.org/abs/2412.11795
作者: Xiangheng He,Junjie Chen,Zixing Zhang,Björn W. Schuller
机构: University of Augsburg(奥格斯堡大学); Hunan University(湖南大学)
关键词: meaning of words, intonation, rich information, literal meaning, Terminal Intonation Encoder
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which integrates a set of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.
zh

[NLP-36] A Method for Detecting Legal Article Competition for Korean Criminal Law Using a Case-augmented Mention Graph

【速读】：该论文试图解决法律条文之间潜在竞争关系的识别问题，尤其是在制定新法律或适用现有法律时。解决方案的关键在于提出了一个新的法律AI任务，称为法律条文竞争检测 (Legal Article Competition Detection, LACD)，并通过创新的检索方法CAM-Re2显著提升了检测性能。CAM-Re2方法在减少误报和漏报方面表现出色，分别降低了20.8%的误报率和8.3%的漏报率，同时在LACD任务中实现了98.2%的精度@5提升。

链接: https://arxiv.org/abs/2412.11787
作者: Seonho An,Young Yik Rhim,Min-Soo Kim
机构: KAIST, Republic of Korea(韩国科学技术院); Infolab(信息实验室); Intellicon(智能控制)
关键词: increasingly complex, growing more intricate, making it progressively, social systems, systems become increasingly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: under review

点击查看摘要

Abstract:As social systems become increasingly complex, legal articles are also growing more intricate, making it progressively harder for humans to identify any potential competitions among them, particularly when drafting new laws or applying existing laws. Despite this challenge, no method for detecting such competitions has been proposed so far. In this paper, we propose a new legal AI task called Legal Article Competition Detection (LACD), which aims to identify competing articles within a given law. Our novel retrieval method, CAM-Re2, outperforms existing relevant methods, reducing false positives by 20.8% and false negatives by 8.3%, while achieving a 98.2% improvement in precision@5, for the LACD task. We release our codes at this https URL.
zh

[NLP-37] QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLM s COLING2025

【速读】：该论文试图解决大型语言模型 (LLMs) 在传统基准测试系统之外需要更高级的评估工具的问题。解决方案的关键是引入 QUENCH，这是一个新颖的基于文本的英语问答基准，通过手动筛选和转录自 YouTube 问答视频，包含被遮蔽的实体和推理过程，供 LLMs 通过生成方式预测。QUENCH 特别关注地理背景和常识推理的交叉领域，通过零样本、开放域的问答设置，评估 LLMs 的世界知识和推理能力。研究还通过广泛的模型评估（涉及7个LLMs和4个指标），探讨了模型规模、提示风格、地理背景和金标推理生成的影响，并进行了错误分析以揭示 LLMs 的常见错误。

链接: https://arxiv.org/abs/2412.11763
作者: Mohammad Aflah Khan,Neemesh Yadav,Sarah Masud,Md. Shad Akhtar
机构: IIIT Delhi, India(印度国际信息技术学院德里分校)
关键词: English Quizzing Benchmark, large language models, advanced benchmarking systems, rise of large, large language
类目: Computation and Language (cs.CL)
备注: 17 Pages, 6 Figures, 8 Tables, COLING 2025

点击查看摘要

Abstract:The rise of large language models (LLMs) has created a need for advanced benchmarking systems beyond traditional setups. To this end, we introduce QUENCH, a novel text-based English Quizzing Benchmark manually curated and transcribed from YouTube quiz videos. QUENCH possesses masked entities and rationales for the LLMs to predict via generation. At the intersection of geographical context and common sense reasoning, QUENCH helps assess world knowledge and deduction capabilities of LLMs via a zero-shot, open-domain quizzing setup. We perform an extensive evaluation on 7 LLMs and 4 metrics, investigating the influence of model size, prompting style, geographical context, and gold-labeled rationale generation. The benchmarking concludes with an error analysis to which the LLMs are prone.
zh

[NLP-38] SCITAT: A Question Answering Benchmark for Scientific Tables and Text Covering Diverse Reasoning Types

【速读】：该论文旨在解决当前科学问答（SQA）数据集在推理类型上的局限性以及忽视表格与文本之间关联性的问题。为此，论文提出了一个名为SciTaT的问答基准，该基准涵盖了多样化的推理类型，并要求问题尽可能结合表格和文本。解决方案的关键在于提出了一个强基线模型CaR，该模型通过结合多种推理方法来处理不同的推理类型，并同时处理表格和文本。实验结果显示，CaR在SciTaT上的平均性能比其他基线模型提升了12.9%，验证了其有效性。此外，错误分析揭示了SciTaT面临的挑战，如复杂的数值计算和领域知识需求。

链接: https://arxiv.org/abs/2412.11757
作者: Xuanliang Zhang,Dingzirui Wang,Baoxin Wang,Longxu Dou,Xinyuan Lu,Keyan Xu,Dayong Wu,Qingfu Zhu,Wanxiang Che
机构: Harbin Institute of Technology(哈尔滨工业大学); iFLYTEK Research(科大讯飞研究院); National University of Singapore(新加坡国立大学)
关键词: important task aimed, reasoning types, Scientific question answering, tables and text, current SQA datasets
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific question answering (SQA) is an important task aimed at answering questions based on papers. However, current SQA datasets have limited reasoning types and neglect the relevance between tables and text, creating a significant gap with real scenarios. To address these challenges, we propose a QA benchmark for scientific tables and text with diverse reasoning types (SciTaT). To cover more reasoning types, we summarize various reasoning types from real-world questions. To involve both tables and text, we require the questions to incorporate tables and text as much as possible. Based on SciTaT, we propose a strong baseline (CaR), which combines various reasoning methods to address different reasoning types and process tables and text at the same time. CaR brings average improvements of 12.9% over other baselines on SciTaT, validating its effectiveness. Error analysis reveals the challenges of SciTaT, such as complex numerical calculations and domain knowledge.
zh

[NLP-39] Common Ground Diverse Roots: The Difficulty of Classifying Common Examples in Spanish Varieties

【速读】：该论文试图解决在处理西班牙语变体时，由于忽视跨地区或文化间的共同语言表达（common examples）而导致的偏差问题，尤其是在设计用于文化敏感任务（如仇恨言论检测或对话代理）的自然语言处理（NLP）系统中。解决方案的关键在于利用训练动态（training dynamics）自动检测现有西班牙语数据集中的共同表达或错误，并通过预测标签置信度来识别难以分类的样本，尤其是这些共同表达，从而提升模型在变体识别任务中的性能。此外，论文还引入了首个专注于识别古巴西班牙语变体的数据集，该数据集包含了共同表达的标注，以促进更准确的变体检测。

链接: https://arxiv.org/abs/2412.11750
作者: Javier A. Lopetegui,Arij Riabi,Djamé Seddah
机构: INRIA Paris, France(法国国家信息与自动化研究所巴黎分部)
关键词: Variations in languages, NLP systems designed, culturally sensitive tasks, hate speech detection, conversational agents
类目: Computation and Language (cs.CL)
备注: Accepted to VARDIAL 2025

点击查看摘要

Abstract:Variations in languages across geographic regions or cultures are crucial to address to avoid biases in NLP systems designed for culturally sensitive tasks, such as hate speech detection or dialog with conversational agents. In languages such as Spanish, where varieties can significantly overlap, many examples can be valid across them, which we refer to as common examples. Ignoring these examples may cause misclassifications, reducing model accuracy and fairness. Therefore, accounting for these common examples is essential to improve the robustness and representativeness of NLP systems trained on such data. In this work, we address this problem in the context of Spanish varieties. We use training dynamics to automatically detect common examples or errors in existing Spanish datasets. We demonstrate the efficacy of using predicted label confidence for our Datamaps \citeswayamdipta-etal-2020-dataset implementation for the identification of hard-to-classify examples, especially common examples, enhancing model performance in variety identification tasks. Additionally, we introduce a Cuban Spanish Variety Identification dataset with common examples annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. To our knowledge, this is the first dataset focused on identifying the Cuban, or any other Caribbean, Spanish variety.
zh

[NLP-40] Beyond Dataset Creation: Critical View of Annotation Variation and Bias Probing of a Dataset for Online Radical Content Detection COLING2025

【速读】：该论文试图解决在线平台上激进内容泛滥的问题，特别是现有数据集和模型在处理多语言和多样化数据时的不足。解决方案的关键在于引入了一个公开可用的多语言数据集，该数据集标注了激进化程度、行动呼吁和命名实体，涵盖英语、法语和阿拉伯语，并通过伪名化保护个人隐私。此外，论文分析了标注过程中的偏见和分歧，探讨了社会人口特征对标注模式和模型预测的影响，强调了在激进内容检测中构建稳健数据集的重要性，以及公平性和透明性在模型开发中的关键作用。

链接: https://arxiv.org/abs/2412.11745
作者: Arij Riabi,Virginie Mouilleron,Menel Mahamdi,Wissam Antoun,Djamé Seddah
机构: Inria(法国国家信息与自动化研究所)
关键词: poses significant risks, including inciting violence, spreading extremist ideologies, online platforms poses, platforms poses significant
类目: Computation and Language (cs.CL)
备注: Accepted to COLING 2025

点击查看摘要

Abstract:The proliferation of radical content on online platforms poses significant risks, including inciting violence and spreading extremist ideologies. Despite ongoing research, existing datasets and models often fail to address the complexities of multilingual and diverse data. To bridge this gap, we introduce a publicly available multilingual dataset annotated with radicalization levels, calls for action, and named entities in English, French, and Arabic. This dataset is pseudonymized to protect individual privacy while preserving contextual information. Beyond presenting our \hrefthis https URLfreely available dataset, we analyze the annotation process, highlighting biases and disagreements among annotators and their implications for model performance. Additionally, we use synthetic data to investigate the influence of socio-demographic traits on annotation patterns and model predictions. Our work offers a comprehensive examination of the challenges and opportunities in building robust datasets for radical content detection, emphasizing the importance of fairness and transparency in model development.
zh

[NLP-41] CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation

【速读】：该论文试图解决大型语言模型（LLMs）在长上下文文本应用中面临的内存占用问题，特别是由于Key-Value (KV)缓存的线性增长导致的内存消耗过大，可能使模型在内存资源有限的情况下无法正常运行。解决方案的关键在于提出了一种名为Cache Sparse Representation (CSR)的新方法，通过将密集的KV缓存张量转换为稀疏索引和权重，从而实现更高效的内存表示。此外，论文还引入了NeuralDict，一种基于神经网络的自动生成稀疏表示所需字典的方法。实验结果表明，CSR在内存受限的环境中表现出色，且性能与现有的KV缓存量化算法相当。

链接: https://arxiv.org/abs/2412.11741
作者: Hongxuan Zhang,Yao Zhao,Jiaqi Zheng,Chenyi Zhuang,Jinjie Gu,Guihai Chen
机构: Ant Group(蚂蚁集团); 未知
关键词: significant scalability challenges, long-context text applications, text applications utilizing, applications utilizing large, utilizing large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The emergence of long-context text applications utilizing large language models (LLMs) has presented significant scalability challenges, particularly in memory footprint. The linear growth of the Key-Value (KV) cache responsible for storing attention keys and values to minimize redundant computations can lead to substantial increases in memory consumption, potentially causing models to fail to serve with limited memory resources. To address this issue, we propose a novel approach called Cache Sparse Representation (CSR), which converts the KV cache by transforming the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Furthermore, we introduce NeuralDict, a novel neural network-based method for automatically generating the dictionary used in our sparse representation. Our extensive experiments demonstrate that CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms while maintaining robust functionality in memory-constrained environments.
zh

[NLP-42] Personalized LLM for Generating Customized Responses to the Same Query from Different Users

【速读】：该论文试图解决现有大型语言模型（LLM）个性化工作中忽视提问者多样性的问题，提出了一种新的提问者感知（questioner-aware）LLM个性化方法，即针对同一查询生成不同提问者的个性化响应。解决方案的关键在于设计了一种双塔模型架构，包括一个跨提问者的通用编码器和一个提问者特定的编码器，并通过对比学习（contrastive learning）结合多视图增强（multi-view augmentation）来拉近同一提问者的对话表示，同时拉开不同提问者的对话表示。此外，为减少问题多样性对提问者对比学习的影响，论文采用了基于问题相似度的对话聚类方法，并在每个聚类内限制对比学习的范围。该研究还构建了一个包含173个提问者和12个响应者的多提问者数据集（MQDialog），并通过广泛的评估验证了个性化响应生成质量的显著提升。

链接: https://arxiv.org/abs/2412.11736
作者: Hang Zeng,Chaoyue Niu,Fan Wu,Chengfei Lv,Guihai Chen
机构: Shanghai Jiao Tong University(上海交通大学); Alibaba Group(阿里巴巴集团)
关键词: large language model, Existing work, questioner-aware LLM personalization, large language, assigned different responding
类目: Computation and Language (cs.CL)
备注: 9 pages

点击查看摘要

Abstract:Existing work on large language model (LLM) personalization assigned different responding roles to LLM, but overlooked the diversity of questioners. In this work, we propose a new form of questioner-aware LLM personalization, generating different responses even for the same query from different questioners. We design a dual-tower model architecture with a cross-questioner general encoder and a questioner-specific encoder. We further apply contrastive learning with multi-view augmentation, pulling close the dialogue representations of the same questioner, while pulling apart those of different questioners. To mitigate the impact of question diversity on questioner-contrastive learning, we cluster the dialogues based on question similarity and restrict the scope of contrastive learning within each cluster. We also build a multi-questioner dataset from English and Chinese scripts and WeChat records, called MQDialog, containing 173 questioners and 12 responders. Extensive evaluation with different metrics shows a significant improvement in the quality of personalized response generation.
zh

[NLP-43] Findings of the WMT 2024 Shared Task on Discourse-Level Literary Translation

【速读】：该论文旨在解决文学翻译中的篇章级翻译问题，特别是针对中文到英语、中文到德语和中文到俄语的翻译任务。解决方案的关键在于通过自动评估和人工评估相结合的方式，对提交的翻译系统进行性能测量，并基于整体人工判断对系统进行官方排名。此外，论文还发布了相关数据、系统输出和排行榜，以促进该领域的进一步研究和发展。

链接: https://arxiv.org/abs/2412.11732
作者: Longyue Wang,Siyou Liu,Chenyang Lyu,Wenxiang Jiao,Xing Wang,Jiahao Xu,Zhaopeng Tu,Yan Gu,Weiyu Chen,Minghao Wu,Liting Zhou,Philipp Koehn,Andy Way,Yulin Yuan
机构: 未知
关键词: WMT translation shared, Discourse-Level Literary Translation, translation shared task, WMT translation, Literary Translation
类目: Computation and Language (cs.CL)
备注: WMT2024

点击查看摘要

Abstract:Following last year, we have continued to host the WMT translation shared task this year, the second edition of the Discourse-Level Literary Translation. We focus on three language directions: Chinese-English, Chinese-German, and Chinese-Russian, with the latter two ones newly added. This year, we totally received 10 submissions from 5 academia and industry teams. We employ both automatic and human evaluations to measure the performance of the submitted systems. The official ranking of the systems is based on the overall human judgments. We release data, system outputs, and leaderboard at this https URL.
zh

[NLP-44] LLM s Can Simulate Standardized Patients via Agent Coevolution

【速读】：该论文试图解决医疗人员培训中标准化患者（Standardized Patients, SPs）模拟的复杂性问题，特别是现有基于大语言模型（Large Language Model, LLM）的模拟患者研究主要集中在数据检索精度和通过人类反馈调整提示，而忽略了患者代理需要学习标准化的表现模式以通过无监督模拟生成类人患者响应的关键需求。解决方案的关键是提出了EvoPatient框架，通过患者代理和医生代理之间的多轮对话模拟诊断过程，同时积累经验以提升问答质量，最终实现对人类医生的培训。该框架在仅提供总体SP需求的情况下，相比现有推理方法在需求对齐和人类偏好方面提升了超过10%，并在资源消耗上实现了最佳平衡，具有良好的泛化能力。

链接: https://arxiv.org/abs/2412.11716
作者: Zhuoyun Du,Lujie Zheng,Renjun Hu,Yuyang Xu,Xiawei Li,Ying Sun,Wei Chen,Jian Wu,Haolei Cai,Haohao Ying
机构: State Key Lab of CAD & CG, Zhejiang University; Zhejiang Polytechnic Institute, Polytechnic Institute, Zhejiang University; College of Computer Science & Technology, Zhejiang University; Alibaba Group; Sun Yat-sen University Cancer Center; School of Public Health, Zhejiang University; Second Affiliated Hospital, Zhejiang University School of Medicine; Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence
关键词: Large Language Model, Training medical personnel, requiring extensive domain, extensive domain expertise, remains a complex
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: Work in Progress

点击查看摘要

Abstract:Training medical personnel using standardized patients (SPs) remains a complex challenge, requiring extensive domain expertise and role-specific practice. Most research on Large Language Model (LLM)-based simulated patients focuses on improving data retrieval accuracy or adjusting prompts through human feedback. However, this focus has overlooked the critical need for patient agents to learn a standardized presentation pattern that transforms data into human-like patient responses through unsupervised simulations. To address this gap, we propose EvoPatient, a novel simulated patient framework in which a patient agent and doctor agents simulate the diagnostic process through multi-turn dialogues, simultaneously gathering experience to improve the quality of both questions and answers, ultimately enabling human doctor training. Extensive experiments on various cases demonstrate that, by providing only overall SP requirements, our framework improves over existing reasoning methods by more than 10% in requirement alignment and better human preference, while achieving an optimal balance of resource consumption after evolving over 200 cases for 10 hours, with excellent generalizability. The code will be available at this https URL.
zh

[NLP-45] Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework

【速读】：该论文试图解决在实际软件开发中，由于不当或缺失的异常处理（exception handling）导致的代码脆弱性和可靠性问题。解决方案的关键在于提出了一种名为Seeker的多代理框架，该框架受专家开发者异常处理策略的启发，通过Scanner、Detector、Predator、Ranker和Handler五个代理协助大型语言模型（LLMs）更有效地检测、捕获和解决异常。这一框架旨在系统性地提升异常处理实践，从而增强代码的鲁棒性。

链接: https://arxiv.org/abs/2412.11713
作者: Xuanming Zhang,Yuxuan Chen,Yiming Zheng,Zhexin Zhang,Yuan Yuan,Minlie Huang
机构: 未知
关键词: exception handling, improper or missing, missing exception handling, handling, Distorted Handling Solution
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 30 pages, 9 figures, submitted to ARR Dec

点击查看摘要

Abstract:In real world software development, improper or missing exception handling can severely impact the robustness and reliability of code. Exception handling mechanisms require developers to detect, capture, and manage exceptions according to high standards, but many developers struggle with these tasks, leading to fragile code. This problem is particularly evident in open-source projects and impacts the overall quality of the software ecosystem. To address this challenge, we explore the use of large language models (LLMs) to improve exception handling in code. Through extensive analysis, we identify three key issues: Insensitive Detection of Fragile Code, Inaccurate Capture of Exception Block, and Distorted Handling Solution. These problems are widespread across real world repositories, suggesting that robust exception handling practices are often overlooked or mishandled. In response, we propose Seeker, a multi-agent framework inspired by expert developer strategies for exception handling. Seeker uses agents: Scanner, Detector, Predator, Ranker, and Handler to assist LLMs in detecting, capturing, and resolving exceptions more effectively. Our work is the first systematic study on leveraging LLMs to enhance exception handling practices in real development scenarios, providing valuable insights for future improvements in code reliability.
zh

[NLP-46] MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning COLING2025

【速读】：该论文试图解决现有大型语言模型（LLMs）在表格推理任务中与实际应用场景之间的差距问题。解决方案的关键在于提出了一个名为MiMoTable的多尺度电子表格基准，该基准具有两个核心特征：一是使用真实世界场景中的电子表格，涵盖七个领域并包含不同类型；二是定义了六类元操作（meta operations）作为新的难度衡量标准，用于评估每个问题的难度，并为现有基准提供新的难度视角。实验结果表明，尽管Claude-3.5-Sonnet在MiMoTable上取得了77.4%的准确率，但LLMs在该基准上的表现仍有显著提升空间，同时证明了新难度标准的有效性。

链接: https://arxiv.org/abs/2412.11711
作者: Zheng Li,Yang Du,Mao Zheng,Mingyang Song
机构: Tencent Hunyuan(腾讯混元)
关键词: Large Language Models, Language Models, Large Language, Extensive research, capability of Large
类目: Computation and Language (cs.CL)
备注: Accepted by COLING 2025

点击查看摘要

Abstract:Extensive research has been conducted to explore the capability of Large Language Models (LLMs) for table reasoning and has significantly improved the performance on existing benchmarks. However, tables and user questions in real-world applications are more complex and diverse, presenting an unignorable gap compared to the existing benchmarks. To fill the gap, we propose a \textbfMult\textbfi-scale spreadsheet benchmark with \textbfMeta \textbfoperations for \textbfTable reasoning, named as MiMoTable. Specifically, MiMoTable incorporates two key features. First, the tables in MiMoTable are all spreadsheets used in real-world scenarios, which cover seven domains and contain different types. Second, we define a new criterion with six categories of meta operations for measuring the difficulty of each question in MiMoTable, simultaneously as a new perspective for measuring the difficulty of the existing benchmarks. Experimental results show that Claude-3.5-Sonnet achieves the best performance with 77.4% accuracy, indicating that there is still significant room to improve for LLMs on MiMoTable. Furthermore, we grade the difficulty of existing benchmarks according to our new criteria. Experiments have shown that the performance of LLMs decreases as the difficulty of benchmarks increases, thereby proving the effectiveness of our proposed new criterion.
zh

[NLP-47] Context Filtering with Reward Modeling in Question Answering COLING2025

【速读】：该论文试图解决自然语言处理中的问答系统（Question Answering, QA）在处理检索到的上下文时，由于包含大量相关与不相关信息的混合而导致的性能提升受限问题。解决方案的关键在于引入一种上下文过滤方法，通过奖励建模（Reward Modeling）来去除非必要的细节，并在摘要模型训练过程中保留关键内容。该方法通过识别数据集对中的有用信息，避免了昂贵的人工评估，从而构建高效的问答模型。实验结果表明，该方法显著提升了模型的性能，特别是在低资源环境下，提出的EM Per Token (EPT) 指标显示了6.8倍的效率提升。

链接: https://arxiv.org/abs/2412.11707
作者: Sangryul Kim,James Thorne
机构: KAIST AI(KAIST AI); KAIST AI(KAIST AI)
关键词: Question Answering, relevant context retrieved, retrieval system, finding answers, context retrieved
类目: Computation and Language (cs.CL)
备注: Accepted Main Conference at COLING 2025

点击查看摘要

Abstract:Question Answering (QA) in NLP is the task of finding answers to a query within a relevant context retrieved by a retrieval system. Yet, the mix of relevant and irrelevant information in these contexts can hinder performance enhancements in QA tasks. To address this, we introduce a context filtering approach that removes non-essential details, summarizing crucial content through Reward Modeling. This method emphasizes keeping vital data while omitting the extraneous during summarization model training. We offer a framework for developing efficient QA models by discerning useful information from dataset pairs, bypassing the need for costly human evaluation. Furthermore, we show that our approach can significantly outperform the baseline, as evidenced by a 6.8-fold increase in the EM Per Token (EPT) metric, which we propose as a measure of token efficiency, indicating a notable token-efficiency boost for low-resource settings.
zh

[NLP-48] Vocabulary Expansion of Chat Models with Unlabeled Target Language Data

【速读】：该论文试图解决在多语言环境下，如何有效适应聊天模型（Chat models）以支持训练数据中未充分代表或缺失的语言的问题。解决方案的关键在于使用无标签的目标语言数据进行词汇扩展（Vocabulary Expansion, VE），并通过后处理技术从源模型中注入信息，而无需进一步训练。研究表明，这种无监督的适应方法在71%的情况下表现良好，并通过后处理技术进一步提升了87%的适应模型性能。

链接: https://arxiv.org/abs/2412.11704
作者: Atsuki Yamaguchi,Terufumi Morishita,Aline Villavicencio,Nikolaos Aletras
机构: University of Sheffield, United Kingdom; Hitachi, Ltd., Japan; University of Exeter, United Kingdom; The Alan Turing Institute, United Kingdom
关键词: general task-solving abilities, language models trained, models, trained solely, outperform base models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chat models (i.e. language models trained to follow instructions through conversation with humans) outperform base models (i.e. trained solely on unlabeled data) in both conversation and general task-solving abilities. These models are generally English-centric and require further adaptation for languages that are underrepresented in or absent from their training data. A common technique for adapting base models is to extend the model’s vocabulary with target language tokens, i.e. vocabulary expansion (VE), and then continually pre-train it on language-specific data. Using chat data is ideal for chat model adaptation, but often, either this does not exist or is costly to construct. Alternatively, adapting chat models with unlabeled data is a possible solution, but it could result in catastrophic forgetting. In this paper, we investigate the impact of using unlabeled target language data for VE on chat models for the first time. We first show that off-the-shelf VE generally performs well across target language tasks and models in 71% of cases, though it underperforms in scenarios where source chat models are already strong. To further improve adapted models, we propose post-hoc techniques that inject information from the source model without requiring any further training. Experiments reveal the effectiveness of our methods, helping the adapted models to achieve performance improvements in 87% of cases.
zh

[NLP-49] CoinMath: Harnessing the Power of Coding Instruction for Math LLM s

【速读】：该论文试图解决如何利用编码指令数据提升大型语言模型（LLMs）在数学推理中的表现，特别是探讨不同编码风格对模型学习效果的影响。解决方案的关键在于提出了一种名为CoinMath的学习策略，该策略通过多样化编码风格的代码推理（包括简洁注释、描述性命名和硬编码解决方案）来增强数学推理能力。实验结果表明，CoinMath显著优于当前最先进的数学LLM模型MAmmoTH。

链接: https://arxiv.org/abs/2412.11699
作者: Chengwei Wei,Bin Wang,Jung-jae Kim,Guimei Liu,Nancy F. Chen
机构: Institute for Infocomm Research (I2R), ASTAR, Singapore; Centre for Frontier AI Research (CFAR), ASTAR, Singapore
关键词: Large Language Models, Large Language, solving mathematical problems, Language Models, shown strong performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong performance in solving mathematical problems, with code-based solutions proving particularly effective. However, the best practice to leverage coding instruction data to enhance mathematical reasoning remains underexplored. This study investigates three key questions: (1) How do different coding styles of mathematical code-based rationales impact LLMs’ learning performance? (2) Can general-domain coding instructions improve performance? (3) How does integrating textual rationales with code-based ones during training enhance mathematical reasoning abilities? Our findings reveal that code-based rationales with concise comments, descriptive naming, and hardcoded solutions are beneficial, while improvements from general-domain coding instructions and textual rationales are relatively minor. Based on these insights, we propose CoinMath, a learning strategy designed to enhance mathematical reasoning by diversifying the coding styles of code-based rationales. CoinMath generates a variety of code-based rationales incorporating concise comments, descriptive naming conventions, and hardcoded solutions. Experimental results demonstrate that CoinMath significantly outperforms its baseline model, MAmmoTH, one of the SOTA math LLMs.
zh

[NLP-50] From Specific-MLLM to Omni-MLLM LLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality

【速读】：该论文试图解决多模态信息理解和生成的问题，解决方案的关键在于通过Omni-MLLM（全模态多模态语言模型）实现多模态的统一建模和交互。Omni-MLLM将不同模态的特征视为不同的“外语”，在统一的空间内实现跨模态的交互和理解。核心技术包括四个组成部分：统一建模、模态对齐预训练（alignment pretraining）、指令微调（instruction fine-tuning）以及开源数据集和交互能力测试。这些技术共同推动了多模态信息处理的研究进展。

链接: https://arxiv.org/abs/2412.11694
作者: Shixin Jiang,Jiafeng Liang,Ming Liu,Bing Qin
机构: Harbin Institute of Technology(哈尔滨工业大学); Peng Cheng Laboratory(鹏城实验室)
关键词: single-modal tasks, multimodal information, excels in single-modal, extends the range, range of general
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages

点击查看摘要

Abstract:From the Specific-MLLM, which excels in single-modal tasks, to the Omni-MLLM, which extends the range of general modalities, this evolution aims to achieve understanding and generation of multimodal information. Omni-MLLM treats the features of different modalities as different “foreign languages,” enabling cross-modal interaction and understanding within a unified space. To promote the advancement of related research, we have compiled 47 relevant papers to provide the community with a comprehensive introduction to Omni-MLLM. We first explain the four core components of Omni-MLLM for unified modeling and interaction of multiple modalities. Next, we introduce the effective integration achieved through “alignment pretraining” and “instruction fine-tuning,” and discuss open-source datasets and testing of interaction capabilities. Finally, we summarize the main challenges facing current Omni-MLLM and outline future directions.
zh

[NLP-51] Multilingual and Explainable Text Detoxification with Parallel Corpora COLING2025

【速读】：该论文试图解决数字虐待言论（digital abusive speech）在多语言环境下的自动文本解毒（automatic text detoxification）问题。解决方案的关键在于扩展并测试多语言平行语料库（parallel corpora），涵盖德语、中文、阿拉伯语、印地语和阿姆哈拉语，并在此基础上进行自动化的可解释性分析，深入探讨毒性文本与非毒性文本的描述特征差异。此外，论文提出了一种基于链式思维推理（Chain-of-Thoughts reasoning）的新型文本解毒方法，通过聚类相关描述属性来增强提示过程，从而提升解毒效果。

链接: https://arxiv.org/abs/2412.11691
作者: Daryna Dementieva,Nikolay Babakov,Amit Ronen,Abinew Ali Ayele,Naquee Rizwan,Florian Schneider,Xintong Wang,Seid Muhie Yimam,Daniil Moskovskiy,Elisei Stakovskii,Eran Kaufman,Ashraf Elnagar,Animesh Mukherjee,Alexander Panchenko
机构: Technical University of Munich(慕尼黑工业大学); Universidade de Santiago de Compostela(圣地亚哥-德孔波斯特拉大学); Shenkar College(申卡尔学院); University of Hamburg(汉堡大学); Bahir Dar University(巴赫尔大学); IIT Kharagpur(印度理工学院卡哈拉格普尔分校); University of Sharjah(沙迦大学); Skoltech(斯科尔科沃科技学院); AIRI(人工智能研究所); University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)
关键词: Government of India, European Union, European Parliament, social media platforms, digital abusive speech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: COLING 2025, main conference, long

点击查看摘要

Abstract:Even with various regulations in place across countries and social media platforms (Government of India, 2021; European Parliament and Council of the European Union, 2022, digital abusive speech remains a significant issue. One potential approach to address this challenge is automatic text detoxification, a text style transfer (TST) approach that transforms toxic language into a more neutral or non-toxic form. To date, the availability of parallel corpora for the text detoxification task (Logachevavet al., 2022; Atwell et al., 2022; Dementievavet al., 2024a) has proven to be crucial for state-of-the-art approaches. With this work, we extend parallel text detoxification corpus to new languages – German, Chinese, Arabic, Hindi, and Amharic – testing in the extensive multilingual setup TST baselines. Next, we conduct the first of its kind an automated, explainable analysis of the descriptive features of both toxic and non-toxic sentences, diving deeply into the nuances, similarities, and differences of toxicity and detoxification across 9 languages. Finally, based on the obtained insights, we experiment with a novel text detoxification method inspired by the Chain-of-Thoughts reasoning approach, enhancing the prompting process through clustering on relevant descriptive attributes.
zh

[NLP-52] Bias Vector: Mitigating Biases in Language Models with Task Arithmetic Approach COLING2025

【速读】：该论文旨在解决语言模型 (Language Models, LMs) 在输出中反映的训练数据中的偏见和刻板印象所引发的社会问题。其关键解决方案是提出了一种名为“偏见向量 (Bias Vector)”的方法，该方法无需手动创建去偏数据。具体步骤包括：(1) 在有偏数据上持续训练预训练的 LMs，使用掩码语言建模 (Masked Language Modeling)；(2) 构建偏见向量，作为有偏 LMs 与预训练 LMs 权重之间的差异；(3) 通过从预训练 LMs 的权重中减去偏见向量来实现去偏。实验结果表明，该方法在 SEAT 基准上平均提升了 0.177 分，且在 GLUE 基准的下游任务中未降低模型性能。

链接: https://arxiv.org/abs/2412.11679
作者: Daiki Shirafuji,Makoto Takenaka,Shinya Taguchi
机构: Mitsubishi Electric Corporation (三菱电机公司)
关键词: Bias Vector method, Bias Vector, causing social problems, Vector method, Bias
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to COLING2025

点击查看摘要

Abstract:The use of language models (LMs) has increased considerably in recent years, and the biases and stereotypes in training data that are reflected in the LM outputs are causing social problems. In this paper, inspired by the task arithmetic, we propose the ``Bias Vector’’ method for the mitigation of these LM biases. The Bias Vector method does not require manually created debiasing data. The three main steps of our approach involve: (1) continual training the pre-trained LMs on biased data using masked language modeling; (2) constructing the Bias Vector as the difference between the weights of the biased LMs and those of pre-trained LMs; and (3) subtracting the Bias Vector from the weights of the pre-trained LMs for debiasing. We evaluated the Bias Vector method on the SEAT across three LMs and confirmed an average improvement of 0.177 points. We demonstrated that the Bias Vector method does not degrade the LM performance on downstream tasks in the GLUE benchmark. In addition, we examined the impact of scaling factors, which control the magnitudes of Bias Vectors, with effect sizes on the SEAT and conducted a comprehensive evaluation of our debiased LMs across both the SEAT and GLUE benchmarks.
zh

[NLP-53] BioBridge: Unified Bio-Embedding with Bridging Modality in Code-Switched EMR

【速读】：该论文试图解决儿科急诊部门（PED）的拥挤问题，并提出了一种名为BioBridge的新框架，通过应用自然语言处理（NLP）技术处理以自由文本形式记录的电子病历（EMRs）来提升决策效率。解决方案的关键在于两个核心模块：“上下文中的桥接模式”（bridging modality in context）和“统一生物嵌入”（unified bio-embedding）。前者提升了对双语和代码转换（Code-Switching, CS）EMRs的上下文理解，后者通过将医学领域模型训练的知识注入基于编码器的模型，弥合了医学领域与通用领域之间的差距。实验结果表明，BioBridge在多个评估指标上显著优于传统的机器学习模型和预训练的编码器模型。

链接: https://arxiv.org/abs/2412.11671
作者: Jangyeong Jeon,Sangyeon Cho,Dongjoon Lee,Changhee Lee,Junyeong Kim
机构: Chung-Ang University(中央大学); Korea University(高丽大学)
关键词: Pediatric Emergency Department, significant global challenge, Pediatric Emergency, Emergency Department, Natural Language Processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE Access 2024

点击查看摘要

Abstract:Pediatric Emergency Department (PED) overcrowding presents a significant global challenge, prompting the need for efficient solutions. This paper introduces the BioBridge framework, a novel approach that applies Natural Language Processing (NLP) to Electronic Medical Records (EMRs) in written free-text form to enhance decision-making in PED. In non-English speaking countries, such as South Korea, EMR data is often written in a Code-Switching (CS) format that mixes the native language with English, with most code-switched English words having clinical significance. The BioBridge framework consists of two core modules: “bridging modality in context” and “unified bio-embedding.” The “bridging modality in context” module improves the contextual understanding of bilingual and code-switched EMRs. In the “unified bio-embedding” module, the knowledge of the model trained in the medical domain is injected into the encoder-based model to bridge the gap between the medical and general domains. Experimental results demonstrate that the proposed BioBridge significantly performance traditional machine learning and pre-trained encoder-based models on several metrics, including F1 score, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and Brier score. Specifically, BioBridge-XLM achieved enhancements of 0.85% in F1 score, 0.75% in AUROC, and 0.76% in AUPRC, along with a notable 3.04% decrease in the Brier score, demonstrating marked improvements in accuracy, reliability, and prediction calibration over the baseline XLM model. The source code will be made publicly available.
zh

[NLP-54] C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness AAAI2025

【速读】：该论文试图解决生成式链式思维（Chain-of-Thought, CoT）在大语言模型（LLMs）中应用时，由于生成的CoT长度远超最终答案，导致解码成本增加和推理能力下降的问题。解决方案的关键在于提出了一种条件压缩链式思维（Conditioned Compressed Chain-of-Thought, C3oT）框架，该框架通过压缩器将原始较长的CoT压缩为较短的CoT，同时保留关键信息和可解释性；采用条件训练方法，使LLMs能够同时学习长CoT和短CoT之间的关系；并通过条件推理方法，利用从长CoT中学习的推理能力生成短CoT。实验结果表明，该方法能够在不降低CoT有效性的前提下，将生成的CoT长度压缩至多50%以上。

链接: https://arxiv.org/abs/2412.11664
作者: Yu Kang,Xianghui Sun,Liangyu Chen,Wei Zou
机构: 未知
关键词: large language models, CoT, effectively improve, significantly improve, improve the accuracy
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Generating Chain-of-Thought (CoT) before deriving the answer can effectively improve the reasoning capabilities of large language models (LLMs) and significantly improve the accuracy of the generated answer. However, in most cases, the length of the generated CoT is much longer than the desired final answer, which results in additional decoding costs. Furthermore, existing research has discovered that shortening the reasoning steps in CoT, even while preserving the key information, diminishes LLMs’ abilities. These phenomena make it difficult to use LLMs and CoT in many real-world applications that only require the final answer and are sensitive to latency, such as search and recommendation. To reduce the costs of model decoding and shorten the length of the generated CoT, this paper presents \textbfC onditioned \textbfC ompressed \textbfC hain-of- \textbfT hought (C3oT), a CoT compression framework that involves a compressor to compress an original longer CoT into a shorter CoT while maintaining key information and interpretability, a conditioned training method to train LLMs with both longer CoT and shorter CoT simultaneously to learn the corresponding relationships between them, and a conditioned inference method to gain the reasoning ability learned from longer CoT by generating shorter CoT. We conduct experiments over four datasets from arithmetic and commonsense scenarios, showing that the proposed method is capable of compressing the length of generated CoT by up to more than 50% without compromising its effectiveness.
zh

[NLP-55] Self-Adaptive Paraphrasing and Preference Learning for Improved Claim Verifiability ACL

【速读】：该论文试图解决在事实核查中，社交媒体内容作为输入时存在的噪声问题，以及如何在不依赖标注数据的情况下提取适合核查的声明。解决方案的关键在于提出了一种自适应方法，该方法仅依赖于一个黑箱事实核查模型和一个生成式语言模型 (Generative Language Model, LM)。通过迭代优化生成式语言模型，使其生成的声明改写能够提升事实核查模型的性能，并通过直接偏好优化 (Direct Preference Optimization) 对齐生成式语言模型与事实核查模型。该方法不仅提高了声明的可核查性，还在被否定的声明上表现优于所有基线方法。

链接: https://arxiv.org/abs/2412.11653
作者: Amelie Wührl,Roman Klinger
机构: University of Stuttgart, Germany; University of Bamberg, Germany
关键词: predict verdicts accurately, claims critically influence, structure and phrasing, verdicts accurately, critically influence
类目: Computation and Language (cs.CL)
备注: Under review at ACL ARR

点击查看摘要

Abstract:In fact-checking, structure and phrasing of claims critically influence a model’s ability to predict verdicts accurately. Social media content in particular rarely serves as optimal input for verification systems, which necessitates pre-processing to extract the claim from noisy context before fact checking. Prior work suggests extracting a claim representation that humans find to be checkworthy and verifiable. This has two limitations: (1) the format may not be optimal for a fact-checking model, and (2), it requires annotated data to learn the extraction task from. We address both issues and propose a method to extract claims that is not reliant on labeled training data. Instead, our self-adaptive approach only requires a black-box fact checking model and a generative language model (LM). Given a tweet, we iteratively optimize the LM to generate a claim paraphrase that increases the performance of a fact checking model. By learning from preference pairs, we align the LM to the fact checker using direct preference optimization. We show that this novel setup extracts a claim paraphrase that is more verifiable than their original social media formulations, and is on par with competitive baselines. For refuted claims, our method consistently outperforms all baselines.
zh

[NLP-56] SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation

【速读】：该论文试图解决当前图对比学习 (Graph Contrastive Learning, GCL) 在文本表示学习中的两个主要问题：一是现有方法通常需要结合领域知识或复杂的计算来指导数据增强过程，导致应用效率和范围受限；二是许多方法仅通过构建词-文档关系来学习文本表示，忽略了文本中丰富的上下文语义信息。为解决这些问题，论文提出了一种基于事件的简单且有效的图对比学习方法 (SE-GCL)。其关键在于从文本中提取事件块并构建内部关系图，以表示语义间的相互联系，确保保留最关键的语义信息。此外，论文设计了一个简化的无监督图对比学习框架，引入事件骨架 (event skeleton) 概念来简化复杂的数据增强技术，并通过多种损失函数促进向量空间中嵌入的多样性和平衡性，从而提升算法效率和文本表示的有效性。

链接: https://arxiv.org/abs/2412.11652
作者: Tao Meng,Wei Ai,Jianbin Li,Ze Wang,Yuntao Shou,Keqin Li
机构: HNU(湖南大学); CSUFT(中南林业科技大学); XJTU(西安交通大学); New Paltz(新帕尔兹)
关键词: natural language processing, graph contrastive learning, Text representation learning, contrastive learning, graph contrastive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 tables

点击查看摘要

Abstract:Text representation learning is significant as the cornerstone of natural language processing. In recent years, graph contrastive learning (GCL) has been widely used in text representation learning due to its ability to represent and capture complex text information in a self-supervised setting. However, current mainstream graph contrastive learning methods often require the incorporation of domain knowledge or cumbersome computations to guide the data augmentation process, which significantly limits the application efficiency and scope of GCL. Additionally, many methods learn text representations only by constructing word-document relationships, which overlooks the rich contextual semantic information in the text. To address these issues and exploit representative textual semantics, we present an event-based, simple, and effective graph contrastive learning (SE-GCL) for text representation. Precisely, we extract event blocks from text and construct internal relation graphs to represent inter-semantic interconnections, which can ensure that the most critical semantic information is preserved. Then, we devise a streamlined, unsupervised graph contrastive learning framework to leverage the complementary nature of the event semantic and structural information for intricate feature data capture. In particular, we introduce the concept of an event skeleton for core representation semantics and simplify the typically complex data augmentation techniques found in existing graph contrastive learning to boost algorithmic efficiency. We employ multiple loss functions to prompt diverse embeddings to converge or diverge within a confined distance in the vector space, ultimately achieving a harmonious equilibrium. We conducted experiments on the proposed SE-GCL on four standard data sets (AG News, 20NG, SougouNews, and THUCNews) to verify its effectiveness in text representation learning.
zh

[NLP-57] On Crowdsourcing Task Design for Discourse Relation Annotation

【速读】：该论文试图解决隐含话语关系（implicit discourse relations）的解释问题，特别是通过连接词插入的方式进行标注时，自由选择方法（free-choice approach）和强制选择方法（forced-choice approach）之间的差异。解决方案的关键在于重新标注DiscoGeM 1.0语料库，比较这两种方法在标注多样性和一致性上的表现。研究发现，尽管自由选择方法提供了更大的灵活性和直觉性，但其标注结果的多样性较低，且常集中于常见标签。这一结果揭示了任务设计与标注者能力之间的相互作用。

链接: https://arxiv.org/abs/2412.11637
作者: Frances Yung,Vera Demberg
机构: Saarland University(萨尔兰大学); Saarbrücken, Germany(德国萨尔布吕肯)
关键词: involves complex reasoning, Interpreting implicit discourse, relations involves complex, Interpreting implicit, discourse relations involves
类目: Computation and Language (cs.CL)
备注: To appear in the workshop of Context and Meaning - Navigating Disagreements in NLP Annotations

点击查看摘要

Abstract:Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like because or then are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource English implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus – initially annotated with the free-choice method – using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators’ abilities to interpret and produce discourse relations.
zh

[NLP-58] Fool Me Fool Me: User Attitudes Toward LLM Falsehoods

【速读】：该论文试图解决用户对大型语言模型（Large Language Models, LLMs）生成虚假信息的偏好问题，特别是用户对未标记虚假信息与标记虚假信息的偏好，以及对自信的虚假信息与模型承认缺乏知识的声明之间的偏好。研究的关键发现是，尽管虚假信息存在，但大多数用户（61%）更倾向于接受未标记的虚假信息，而不是标记的虚假信息，并且（69%）更倾向于接受自信的虚假信息，而不是模型承认缺乏知识的声明。此外，当用户被要求评估陈述的真实性时，对未标记和虚假信息的偏好略有下降但仍保持较高水平。这些发现表明，用户的偏好可能通过反馈机制无意中鼓励了LLMs生成虚假信息，这为未来的研究提出了伦理和实践上的挑战，特别是如何调整LLMs的行为以符合这些偏好。

链接: https://arxiv.org/abs/2412.11625
作者: Diana Bar-Or Nirman,Ariel Weizman,Amos Azaria
机构: Ariel University(阿里尔大学); Ariel University(阿里尔大学); Ariel University(阿里尔大学)
关键词: Large Language Models, Language Models, Large Language, central tools, provide inaccurate
类目: Computation and Language (cs.CL)
备注: 11 pages, 5 figures, 5 tables

点击查看摘要

Abstract:While Large Language Models (LLMs) have become central tools in various fields, they often provide inaccurate or false information. This study examines user preferences regarding falsehood responses from LLMs. Specifically, we evaluate preferences for LLM responses where false statements are explicitly marked versus unmarked responses and preferences for confident falsehoods compared to LLM disclaimers acknowledging a lack of knowledge. Additionally, we investigate how requiring users to assess the truthfulness of statements influences these preferences. Surprisingly, 61% of users prefer unmarked falsehood responses over marked ones, and 69% prefer confident falsehoods over LLMs admitting lack of knowledge. In all our experiments, a total of 300 users participated, contributing valuable data to our analysis and conclusions. When users are required to evaluate the truthfulness of statements, preferences for unmarked and falsehood responses decrease slightly but remain high. These findings suggest that user preferences, which influence LLM training via feedback mechanisms, may inadvertently encourage the generation of falsehoods. Future research should address the ethical and practical implications of aligning LLM behavior with such preferences. Comments: 11 pages, 5 figures, 5 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2412.11625 [cs.CL] (or arXiv:2412.11625v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.11625 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-59] MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation

【速读】：该论文试图解决现有机器翻译（Machine Translation, MT）评估工具在全面评估MT系统性能方面的不足，特别是缺乏对翻译质量、性别偏见检测、添加毒性和拼写错误鲁棒性等多方面能力的综合评估。解决方案的关键在于引入MT-LENS框架，该框架扩展了LM-eval-harness的功能，支持最新的数据集和广泛的评估指标，并提供用户友好的平台以进行系统比较和翻译分析。MT-LENS旨在超越传统的翻译质量评估，帮助研究人员和工程师更全面地理解神经机器翻译（NMT）模型的性能，并轻松测量系统的偏见。

链接: https://arxiv.org/abs/2412.11615
作者: Javier García Gilabert,Carlos Escolano,Audrey Mash,Xixian Liao,Maite Melero
机构: Barcelona Super Computing Center (BSC)
关键词: gender bias detection, evaluate Machine Translation, evaluate Machine, Large Language Models, added toxicity
类目: Computation and Language (cs.CL)
备注: 6 pages, 2 figures

点击查看摘要

Abstract:We introduce MT-LENS, a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks, including translation quality, gender bias detection, added toxicity, and robustness to misspellings. While several toolkits have become very popular for benchmarking the capabilities of Large Language Models (LLMs), existing evaluation tools often lack the ability to thoroughly assess the diverse aspects of MT performance. MT-LENS addresses these limitations by extending the capabilities of LM-eval-harness for MT, supporting state-of-the-art datasets and a wide range of evaluation metrics. It also offers a user-friendly platform to compare systems and analyze translations with interactive visualizations. MT-LENS aims to broaden access to evaluation strategies that go beyond traditional translation quality evaluation, enabling researchers and engineers to better understand the performance of a NMT model and also easily measure system’s biases.
zh

[NLP-60] SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

【速读】：该论文试图解决现有偏好学习方法在生成偏好对时引入与指令遵循无关的内容变异（如不同表达方式），从而干扰模型识别关键差异以提升指令遵循能力的问题。解决方案的关键是引入SPaR（Self-Play with Tree-Search Refinement）框架，通过自博弈和树搜索策略，使语言模型（LLM）在遵循指令的前提下，逐步优化其响应，减少不必要的内容变异，生成有效且可比较的偏好对。实验结果表明，基于SPaR训练的LLaMA3-8B模型在IFEval基准测试中超越了GPT-4-Turbo，同时保持了通用能力，并展示了良好的扩展性和迁移性。

链接: https://arxiv.org/abs/2412.11605
作者: Jiale Cheng,Xiao Liu,Cunxiang Wang,Xiaotao Gu,Yida Lu,Dan Zhang,Yuxiao Dong,Jie Tang,Hongning Wang,Minlie Huang
机构: The Conversational Artificial Intelligence (CoAI) Group, Tsinghua University; Zhipu AI; The Knowledge Engineering Group (KEG), Tsinghua University
关键词: fundamental capability, capability of language, subtle requirements, accurately reflect, Instruction-following
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at this https URL.
zh

[NLP-61] AUEB-Archimedes at RIRAG-2025: Is obligation concatenation really all you need? COLING2025

【速读】：该论文旨在解决RIRAG-2025共享任务中的监管问题回答任务，通过检索相关段落生成答案。解决方案的关键在于结合三种检索模型和一个重排序器，利用RePASs中的神经组件从检索段落中提取重要句子（“义务”），并通过迭代优化生成多个候选答案，选择最佳答案并进一步减少矛盾、覆盖更多义务，从而生成可读性强、连贯性好的答案。尽管直接提取的答案在RePASs评分中表现异常高（0.947），但通过生成和优化过程，最终生成的答案获得了更合理且较高的评分（0.639）。

链接: https://arxiv.org/abs/2412.11567
作者: Ioannis Chasandras,Odysseas S. Chlapanis,Ion Androutsopoulos
机构: Athens University of Economics and Business(雅典经济与商业大学); Archimedes/Athena RC(阿基米德/雅典研究与技术中心)
关键词: requires answering regulatory, answering regulatory questions, retrieving relevant passages, paper presents, shared task
类目: Computation and Language (cs.CL)
备注: RIRAG 2025 Shared-Task at RegNLP workshop collocated with COLING 2025

点击查看摘要

Abstract:This paper presents the systems we developed for RIRAG-2025, a shared task that requires answering regulatory questions by retrieving relevant passages. The generated answers are evaluated using RePASs, a reference-free and model-based metric. Our systems use a combination of three retrieval models and a reranker. We show that by exploiting a neural component of RePASs that extracts important sentences (‘obligations’) from the retrieved passages, we achieve a dubiously high score (0.947), even though the answers are directly extracted from the retrieved passages and are not actually generated answers. We then show that by selecting the answer with the best RePASs among a few generated alternatives and then iteratively refining this answer by reducing contradictions and covering more obligations, we can generate readable, coherent answers that achieve a more plausible and relatively high score (0.639).
zh

[NLP-62] he Role of Natural Language Processing Tasks in Automatic Literary Character Network Construction

【速读】：该论文试图解决在从文学文本中自动提取角色网络时，低层次自然语言处理（NLP）任务（如命名实体识别 (NER) 和共指消解）对性能的影响问题。解决方案的关键在于通过实验验证NER和共指消解在角色共现网络提取中的作用，并揭示这些任务的性能如何影响角色网络的质量。研究通过逐步引入均匀分布的错误来模拟NER和共指消解的性能下降，发现NER的准确性对角色检测有显著影响，而仅依赖NER检测的提及会遗漏大量角色共现，因此需要共指消解来弥补这一不足。此外，研究还对比了基于大型语言模型（LLMs）的方法，发现传统NLP流水线在召回率方面优于这些模型。

链接: https://arxiv.org/abs/2412.11560
作者: Arthur Amalvy(LIA),Vincent Labatut(LIA),Richard Dufour(LS2N - équipe TALN)
机构: 未知
关键词: natural language processing, automatic extraction, texts is generally, generally carried, low-level NLP tasks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The automatic extraction of character networks from literary texts is generally carried out using natural language processing (NLP) cascading pipelines. While this approach is widespread, no study exists on the impact of low-level NLP tasks on their performance. In this article, we conduct such a study on a literary dataset, focusing on the role of named entity recognition (NER) and coreference resolution when extracting co-occurrence networks. To highlight the impact of these tasks’ performance, we start with gold-standard annotations, progressively add uniformly distributed errors, and observe their impact in terms of character network quality. We demonstrate that NER performance depends on the tested novel and strongly affects character detection. We also show that NER-detected mentions alone miss a lot of character co-occurrences, and that coreference resolution is needed to prevent this. Finally, we present comparison points with 2 methods based on large language models (LLMs), including a fully end-to-end one, and show that these models are outperformed by traditional NLP pipelines in terms of recall.
zh

[NLP-63] oken Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLM s

【速读】：该论文试图解决大型语言模型（LLMs）在生成句子嵌入时由于因果注意力机制导致的句子信息编码偏差问题。解决方案的关键在于提出了一种名为Token Prepending（TP）的新技术，通过在每一层的输入中将解码的句子嵌入前置到句子开头，使得早期token能够访问完整的句子信息，从而在因果注意力机制下实现更准确的句子信息编码。该技术无需训练且可即插即用，能够显著提升现有基于提示的句子嵌入方法在各种语义文本相似性（STS）任务和下游分类任务中的性能，同时几乎不增加额外的推理成本。

链接: https://arxiv.org/abs/2412.11556
作者: Yuchen Fu,Zifeng Cheng,Zhiwei Jiang,Zhonghui Wang,Yafeng Yin,Zhengliang Li,Qing Gu
机构: State Key Laboratory for Novel Software Technology, Nanjing University, China (软件新技术国家重点实验室，南京大学，中国)
关键词: Extracting sentence embeddings, semantic understanding capabilities, large language models, demonstrated stronger semantic, stronger semantic understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Extracting sentence embeddings from large language models (LLMs) is a promising direction, as LLMs have demonstrated stronger semantic understanding capabilities. Previous studies typically focus on prompt engineering to elicit sentence embeddings from LLMs by prompting the model to encode sentence information into the embedding of the last token. However, LLMs are mostly decoder-only models with causal attention and the earlier tokens in the sentence cannot attend to the latter tokens, resulting in biased encoding of sentence information and cascading effects on the final decoded token. To this end, we propose a novel Token Prepending (TP) technique that prepends each layer’s decoded sentence embedding to the beginning of the sentence in the next layer’s input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism. The proposed TP technique is a plug-and-play and training-free technique, which means it can be seamlessly integrated with various prompt-based sentence embedding methods and autoregressive LLMs. Extensive experiments on various Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our proposed TP technique can significantly improve the performance of existing prompt-based sentence embedding methods across different LLMs, while incurring negligible additional inference cost.
zh

[NLP-64] Error Diversity Matters: An Error-Resistant Ensemble Method for Unsupervised Dependency Parsing AAAI

【速读】：该论文试图解决无监督依存句法分析（unsupervised dependency parsing）中，由于错误累积导致的集成模型（ensemble）鲁棒性不足的问题。解决方案的关键在于提出了一种高效的集成选择方法（ensemble-selection approach），通过避免错误累积来提升集成模型的性能和鲁棒性。实验结果表明，该方法不仅优于单个模型，还显著超越了以往未考虑错误多样性的集成技术。

链接: https://arxiv.org/abs/2412.11543
作者: Behzad Shayegh,Hobie H.-B. Lee,Xiaodan Zhu,Jackie Chi Kit Cheung,Lili Mou
机构: 1. University of Alberta (阿尔伯塔大学); 2. McMaster University (麦克马斯特大学); 3. McGill University (麦吉尔大学); 4. Mila (Mila)
关键词: dependency parse structures, address unsupervised dependency, unsupervised dependency parsing, output dependency parse, post hoc aggregation
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by the AAAI Conference on Artificial Intelligence (AAAI) 2025

点击查看摘要

Abstract:We address unsupervised dependency parsing by building an ensemble of diverse existing models through post hoc aggregation of their output dependency parse structures. We observe that these ensembles often suffer from low robustness against weak ensemble components due to error accumulation. To tackle this problem, we propose an efficient ensemble-selection approach that avoids error accumulation. Results demonstrate that our approach outperforms each individual model as well as previous ensemble techniques. Additionally, our experiments show that the proposed ensemble-selection method significantly enhances the performance and robustness of our ensemble, surpassing previously proposed strategies, which have not accounted for error diversity.
zh

[NLP-65] owards a Speech Foundation Model for Singapore and Beyond

【速读】：该论文旨在解决新加坡及东南亚地区多样化的语音处理需求，特别是针对英语（包括新加坡英语）的语音识别任务。解决方案的关键在于开发了MERaLiON Speech Encoder，这是一个基于自监督学习（self-supervised learning）的预训练模型，通过掩码语言建模（masked language modelling）从20万小时的未标注语音数据中从头训练。该模型不仅在自发语音和新加坡英语基准测试中表现出改进，还在其他十个语音任务中与最先进的语音编码器保持竞争力。其核心创新在于针对特定区域语言需求的定制化训练和扩展能力，未来计划逐步支持更多语言。

链接: https://arxiv.org/abs/2412.11538
作者: Muhammad Huzaifah,Tianchi Liu,Hardik B. Sailor,Kye Min Tan,Tarun K. Vangani,Qiongqiong Wang,Jeremy H. M. Wong,Nancy F. Chen,Ai Ti Aw
机构: Institute for Infocomm Research (I2R), A*STAR, Singapore
关键词: MERaLiON Speech Encoder, National Multimodal Large, Singapore National Multimodal, downstream speech applications, foundation model designed
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This technical report describes the MERaLiON Speech Encoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore’s National Multimodal Large Language Model Programme, the MERaLiON Speech Encoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON Speech Encoder was pre-trained from scratch on 200K hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.
zh

[NLP-66] Let your LLM generate a few tokens and you will reduce the need for retrieval

【速读】：该论文试图解决如何高效训练大型语言模型 (LLM) 以判断其参数记忆中是否已存储某个答案的问题。解决方案的关键在于引入了一个名为“IK (I Know) score”的评分机制，通过蒸馏一个作为“法官”的 LLM 来计算该评分。该方法在检索增强生成 (RAG) 的背景下尤为有效，能够显著减少搜索和重排序步骤（超过 50%），并且仅需约 20,000 个训练样本即可实现良好性能。核心创新在于使用教师模型（LLM 作为法官）生成训练数据，并通过不同类型的教师模型（包括基于字符串的方法和 LLM）评估 IK 分类器的鲁棒性，其中 LLM 表现更优。

链接: https://arxiv.org/abs/2412.11536
作者: Hervé Déjean
机构: 未知
关键词: efficiently large language, large language models, parametric memory, investigate how efficiently, efficiently large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we investigate how efficiently large language models (LLM) can be trained to check whether an answer is already stored in their parametric memory. We distill an LLM-as-a-judge to compute the IK (I Know) score. We found that this method is particularly beneficial in the context of retrieval-assisted augmented generation (RAG), with a respectable accuracy of 80%. It enables a significant reduction (more than 50%) in the number of search and reranking steps required for certain data sets. We have also introduced the IK score, which serves as a useful tool for characterising datasets by facilitating the classification task. Interestingly, through the inclusion of response tokens as input, our results suggest that only about 20,000 training samples are required to achieve good performance. The central element of this work is the use of a teacher model - the LLM as a judge - to generate training data. We also assess the robustness of the IK classifier by evaluating it with various types of teachers, including both string-based methods and LLMs, with the latter providing better results.
zh

[NLP-67] DART: An AIGT Detector using AMR of Rephrased Text

【速读】：该论文试图解决现有AI生成文本检测方法的两个主要挑战：一是对黑箱大型语言模型（LLMs）的检测性能较低，因为现有模型主要依赖于句法特征；二是大多数检测器仅在单一候选设置下进行测试，假设已知生成文本的来源，这与现实场景不符。解决方案的关键是提出了DART方法，该方法通过四个步骤（重述、语义解析、评分和多类分类），能够在不依赖句法特征和已知文本来源的情况下，有效区分多个黑箱LLMs生成的文本。

链接: https://arxiv.org/abs/2412.11517
作者: Hyeonchu Park,Byungjun Kim,Bugeun Kim
机构: Chung-Ang University (中央大学); Department of Artificial Intelligence (人工智能系)
关键词: large language models, human-like texts, AI-generated texts, generate more human-like, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:As large language models (LLMs) generate more human-like texts, concerns about the side effects of AI-generated texts (AIGT) have grown. So, researchers have developed methods for detecting AIGT. However, two challenges remain. First, the performance on detecting black-box LLMs is low, because existing models have focused on syntactic features. Second, most AIGT detectors have been tested on a single-candidate setting, which assumes that we know the origin of an AIGT and may deviate from the real-world scenario. To resolve these challenges, we propose DART, which consists of four steps: rephrasing, semantic parsing, scoring, and multiclass classification. We conducted several experiments to test the performance of DART by following previous work. The experimental result shows that DART can discriminate multiple black-box LLMs without using syntactic features and knowing the origin of AIGT.
zh

[NLP-68] Glimpse: Enabling White-Box Methods to Use Proprietary Models for Zero-Shot LLM -Generated Text Detection

【速读】：该论文试图解决当前零样本检测技术在检测大型语言模型（LLM）生成文本时面临的挑战，特别是白盒方法受限于较弱的开放源代码LLM，而黑盒方法则因无法完全观测到更强的专有LLM而受限。解决方案的关键是提出了一种名为Glimpse的概率分布估计方法，通过从部分观测中预测完整的分布，从而使白盒方法能够扩展到最新的专有模型。实验结果表明，Glimpse结合Fast-DetectGPT和GPT-3.5在五个最新源模型中实现了约0.95的平均AUROC，相较于开放源代码基线提升了51%，证明了该方法的有效性。

链接: https://arxiv.org/abs/2412.11506
作者: Guangsheng Bao,Yanbin Zhao,Juncai He,Yue Zhang
机构: Zhejiang University; School of Engineering; Westlake University; School of Mathematics; Physics and Statistics; Shanghai Polytechnic University; Computer; Electrical and Mathematical Science and Engineering Division; King Abdullah University of Science and Technology
关键词: LLM-generated text detection, Advanced large language, large language models, text detection, generate text
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, 10 tables

点击查看摘要

Abstract:Advanced large language models (LLMs) can generate text almost indistinguishable from human-written text, highlighting the importance of LLM-generated text detection. However, current zero-shot techniques face challenges as white-box methods are restricted to use weaker open-source LLMs, and black-box methods are limited by partial observation from stronger proprietary LLMs. It seems impossible to enable white-box methods to use proprietary models because API-level access to the models neither provides full predictive distributions nor inner embeddings. To traverse the divide, we propose Glimpse, a probability distribution estimation approach, predicting the full distributions from partial observations. Despite the simplicity of Glimpse, we successfully extend white-box methods like Entropy, Rank, Log-Rank, and Fast-DetectGPT to latest proprietary models. Experiments show that Glimpse with Fast-DetectGPT and GPT-3.5 achieves an average AUROC of about 0.95 in five latest source models, improving the score by 51% relative to the remaining space of the open source baseline (Table 1). It demonstrates that the latest LLMs can effectively detect their own outputs, suggesting that advanced LLMs may be the best shield against themselves.
zh

[NLP-69] Intention Knowledge Graph Construction for User Intention Relation Modeling

【速读】：该论文试图解决在线平台中理解用户意图的挑战，特别是缺乏对意图之间连接的关注，这对于用户行为建模和未来行为预测至关重要。解决方案的关键在于引入了一个自动生成意图知识图谱（intention knowledge graph）的框架，该框架能够捕捉用户意图之间的连接。通过使用Amazon m2数据集构建了一个包含351百万条边的意图图谱，该模型不仅展示了高度的合理性和接受度，还能有效预测新的会话意图并提升产品推荐效果，显著超越了现有的最先进方法，证明了该方法的实际应用价值。

链接: https://arxiv.org/abs/2412.11500
作者: Jiaxin Bai,Zhaobo Wang,Junfei Cheng,Dan Yu,Zerui Huang,Weiqi Wang,Xin Liu,Chen Luo,Qi He,Yanming Zhu,Bo Li,Yangqiu Song
机构: CSE, Hong Kong University of Science and Technology; CSE, Shanghai Jiaotong University; Amazon.com
关键词: Understanding user intentions, Understanding user, online platforms, challenging for online, Understanding
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding user intentions is challenging for online platforms. Recent work on intention knowledge graphs addresses this but often lacks focus on connecting intentions, which is crucial for modeling user behavior and predicting future actions. This paper introduces a framework to automatically generate an intention knowledge graph, capturing connections between user intentions. Using the Amazon m2 dataset, we construct an intention graph with 351 million edges, demonstrating high plausibility and acceptance. Our model effectively predicts new session intentions and enhances product recommendations, outperforming previous state-of-the-art methods and showcasing the approach’s practical utility.
zh

[NLP-70] FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing

【速读】：该论文试图解决大规模语言模型（LLMs）在推理过程中计算开销巨大的问题，特别是在工业应用中的部署障碍。解决方案的关键在于提出了一种细粒度的token级剪枝方法，通过引入一个可学习的路由器（learnable router）来自适应地识别并跳过模型块中不重要的token，从而减少推理时的计算成本。为了高效构建该路由器，论文还提出了一种基于搜索的稀疏调度器（search-based sparsity scheduler），并结合四个低维因子作为输入和三个提出的损失函数进行训练。实验结果表明，该方法在多个基准测试中实现了最先进的剪枝效果，显著优于现有的剪枝方法。

链接: https://arxiv.org/abs/2412.11494
作者: Zekai Li,Jintu Zheng,Ji Liu,Han Liu,Haowei Zhu,Zeping Li,Fuwei Yang,Haiduo Huang,Jinzhang Peng,Dong Li,Lu Tian,Emad Barsoum
机构: Advanced Micro Devices, Inc.(超威半导体公司)
关键词: demonstrated superior performance, large language models, increase model size, significantly increase model, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, but these always introduce additional training costs to restore the performance and the pruning results typically show noticeable performance drops compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a search-based sparsity scheduler for pruning sparsity allocation, a trainable router combined with our proposed four low-dimensional factors as input and three proposed losses. We conduct extensive experiments across different benchmarks on different LLMs to demonstrate the superiority of our method. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods. For instance, our method outperforms BlockPruner and ShortGPT by approximately 10 points on both LLaMA2-7B and Qwen1.5-7B in accuracy retention at comparable token sparsity levels.
zh

[NLP-71] NoteContrast: Contrastive Language-Diagnostic Pretraining for Medical Text

【速读】：该论文试图解决医疗记录中诊断编码的自动化问题，旨在提高诊断编码的准确性以支持患者护理、医学研究和无误计费。解决方案的关键在于结合ICD-10诊断编码序列模型、大型语言模型（large language models）以及对比预训练（contrastive pre-training），构建一个集成模型，能够同时处理ICD-10诊断编码和对应的医疗文本。通过对比预训练方法，该模型在MIMIC-III数据集的多个诊断编码任务中显著提升了性能，超越了之前的最先进模型。

链接: https://arxiv.org/abs/2412.11477
作者: Prajwal Kailas,Max Homilius,Rahul C. Deo,Calum A. MacRae
机构: Brigham and Women’s Hospital; Harvard Medical School
关键词: enhancing patient care, medical notes, Accurate diagnostic coding, medical, diagnostic coding
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Accurate diagnostic coding of medical notes is crucial for enhancing patient care, medical research, and error-free billing in healthcare organizations. Manual coding is a time-consuming task for providers, and diagnostic codes often exhibit low sensitivity and specificity, whereas the free text in medical notes can be a more precise description of a patients status. Thus, accurate automated diagnostic coding of medical notes has become critical for a learning healthcare system. Recent developments in long-document transformer architectures have enabled attention-based deep-learning models to adjudicate medical notes. In addition, contrastive loss functions have been used to jointly pre-train large language and image models with noisy labels. To further improve the automated adjudication of medical notes, we developed an approach based on i) models for ICD-10 diagnostic code sequences using a large real-world data set, ii) large language models for medical notes, and iii) contrastive pre-training to build an integrated model of both ICD-10 diagnostic codes and corresponding medical text. We demonstrate that a contrastive approach for pre-training improves performance over prior state-of-the-art models for the MIMIC-III-50, MIMIC-III-rare50, and MIMIC-III-full diagnostic coding tasks.
zh

[NLP-72] Understanding Knowledge Hijack Mechanism in In-context Learning through Associative Memory

【速读】：该论文试图解决的问题是理解在上下文学习 (In-context Learning, ICL) 过程中，大型语言模型 (LLMs) 如何平衡利用上下文信息和预训练阶段获得的全局知识进行下一个词预测。解决方案的关键在于研究诱导头机制 (induction head mechanism)，这是ICL中的一个关键组件。通过理论分析两层transformer在接收由双词模型生成的提示时产生的logits，并设计特定实验提示来验证两层transformer的输出是否与理论结果一致，论文揭示了ICL中上下文信息与预训练知识之间的平衡机制。

链接: https://arxiv.org/abs/2412.11459
作者: Shuo Wang,Issei Sato
机构: The University of Tokyo(东京大学)
关键词: enables large language, large language models, contextual information provided, enables large, large language
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks without fine-tuning by leveraging contextual information provided within a prompt. However, ICL relies not only on contextual clues but also on the global knowledge acquired during pretraining for the next token prediction. Analyzing this process has been challenging due to the complex computational circuitry of LLMs. This paper investigates the balance between in-context information and pretrained bigram knowledge in token prediction, focusing on the induction head mechanism, a key component in ICL. Leveraging the fact that a two-layer transformer can implement the induction head mechanism with associative memories, we theoretically analyze the logits when a two-layer transformer is given prompts generated by a bigram model. In the experiments, we design specific prompts to evaluate whether the outputs of a two-layer transformer align with the theoretical results.
zh

[NLP-73] owards Better Multi-task Learning: A Framework for Optimizing Dataset Combinations in Large Language Models

【速读】：该论文试图解决在大语言模型中多任务学习（MTL）性能提升时，如何高效选择最佳数据集组合的问题。解决方案的关键在于提出了一种新颖的框架，该框架利用神经网络来预测最佳数据集组合，并通过迭代优化选择过程，显著提高了效率。该框架具有模型、数据集和领域无关的特性，实验结果表明其在多个生物医学数据集上的四个任务（命名实体识别、关系抽取、事件抽取和文本分类）中有效识别出更优的组合，验证了其最大化MTL潜力的潜力。

链接: https://arxiv.org/abs/2412.11455
作者: Zaifu Zhan,Rui Zhang
机构: University of Minnesota, Minneapolis, MN, USA; University of Minnesota, Minneapolis, MN, USA
关键词: enhancing multi-task learning, efficiently select optimal, large language models, select optimal dataset, multi-task learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, 4 tables

点击查看摘要

Abstract:To efficiently select optimal dataset combinations for enhancing multi-task learning (MTL) performance in large language models, we proposed a novel framework that leverages a neural network to predict the best dataset combinations. The framework iteratively refines the selection, greatly improving efficiency, while being model-, dataset-, and domain-independent. Through experiments on 12 biomedical datasets across four tasks - named entity recognition, relation extraction, event extraction, and text classification-we demonstrate that our approach effectively identifies better combinations, even for tasks that may seem unpromising from a human perspective. This verifies that our framework provides a promising solution for maximizing MTL potential.
zh

[NLP-74] ACE-M3: Automatic Capability Evaluator for Multimodal Medical Models

【速读】：该论文试图解决医学领域中多模态大语言模型（MLLMs）评估方法的不足问题，特别是传统评估指标（如ROUGE和BLEU）在医学场景下与人类判断不一致的局限性。解决方案的关键在于提出了ACE-M³，这是一个开源的自动能力评估工具，专门用于评估医学MLLMs的问答能力。其核心创新包括采用分支-合并架构进行详细分析和基于标准医学评估标准的简洁评分，以及引入基于奖励令牌的直接偏好优化（RTDPO）策略，以在不影响性能的情况下减少训练时间。

链接: https://arxiv.org/abs/2412.11453
作者: Xiechi Zhang,Shunfan Zheng,Linlin Wang,Gerard de Melo,Zhu Cao,Xiaoling Wang,Liang He
机构: 未知
关键词: multimodal large language, large language models, textbf, gain prominence, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As multimodal large language models (MLLMs) gain prominence in the medical field, the need for precise evaluation methods to assess their effectiveness has become critical. While benchmarks provide a reliable means to evaluate the capabilities of MLLMs, traditional metrics like ROUGE and BLEU employed for open domain evaluation only focus on token overlap and may not align with human judgment. Although human evaluation is more reliable, it is labor-intensive, costly, and not scalable. LLM-based evaluation methods have proven promising, but to date, there is still an urgent need for open-source multimodal LLM-based evaluators in the medical field. To address this issue, we introduce ACE- M^3 , an open-sourced \textbfAutomatic \textbfCapability \textbfEvaluator for \textbfMultimodal \textbfMedical \textbfModels specifically designed to assess the question answering abilities of medical MLLMs. It first utilizes a branch-merge architecture to provide both detailed analysis and a concise final score based on standard medical evaluation criteria. Subsequently, a reward token-based direct preference optimization (RTDPO) strategy is incorporated to save training time without compromising performance of our model. Extensive experiments have demonstrated the effectiveness of our ACE- M^3 model\footnote\urlthis https URL in evaluating the capabilities of medical MLLMs.
zh

[NLP-75] Optimized Quran Passage Retrieval Using an Expanded QA Dataset and Fine-Tuned Language Models

【速读】：该论文旨在解决《古兰经》问答系统中现代标准阿拉伯语与古典阿拉伯语之间的语言障碍问题，并提升问答系统的准确性。解决方案的关键在于更新和扩展原始数据集，从251个问题扩展到629个问题，并通过多样化与重构生成1895个问题，分为单答案、多答案和无答案类型。此外，通过微调多种变压器模型（如AraBERT、RoBERTa、CAMeLBERT、AraELECTRA和BERT），特别是AraBERT-base模型，显著提升了模型性能，MAP@10和MRR分别提高了63%和59%。同时，数据集扩展还改善了对“无答案”情况的处理，成功率从25%提升至75%。这些改进展示了数据集优化和模型架构调整对提升《古兰经》问答系统性能的重要性。

链接: https://arxiv.org/abs/2412.11431
作者: Mohamed Basem,Islam Oshallah,Baraa Hikal,Ali Hamdi,Ammar Mohamed
机构: 未知
关键词: modern standard Arabic, standard Arabic, classical Arabic, Understanding the deep, Holy Qur’an
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Understanding the deep meanings of the Qur’an and bridging the language gap between modern standard Arabic and classical Arabic is essential to improve the question-and-answer system for the Holy Qur’an. The Qur’an QA 2023 shared task dataset had a limited number of questions with weak model retrieval. To address this challenge, this work updated the original dataset and improved the model accuracy. The original dataset, which contains 251 questions, was reviewed and expanded to 629 questions with question diversification and reformulation, leading to a comprehensive set of 1895 categorized into single-answer, multi-answer, and zero-answer types. Extensive experiments fine-tuned transformer models, including AraBERT, RoBERTa, CAMeLBERT, AraELECTRA, and BERT. The best model, AraBERT-base, achieved a MAP@10 of 0.36 and MRR of 0.59, representing improvements of 63% and 59%, respectively, compared to the baseline scores (MAP@10: 0.22, MRR: 0.37). Additionally, the dataset expansion led to improvements in handling “no answer” cases, with the proposed approach achieving a 75% success rate for such instances, compared to the baseline’s 25%. These results demonstrate the effect of dataset improvement and model architecture optimization in increasing the performance of QA systems for the Holy Qur’an, with higher accuracy, recall, and precision.
zh

[NLP-76] ConceptEdit: Conceptualization-Augmented Knowledge Editing in Large Language Models for Commonsense Reasoning

【速读】：该论文试图解决常识知识编辑（Knowledge Editing, KE）在大型语言模型（LLM）中的挑战，包括现有资源的知识覆盖有限、大量常识知识难以标注标签以及现有编辑方法对知识格式的严格要求。解决方案的关键是提出了ConceptEdit框架，该框架将概念化（conceptualization）和实例化（instantiation）整合到LLM的KE流程中，通过动态诊断不合理的常识知识并使用验证器LLM进行增强，从而提升LLM的常识推理能力。实验结果表明，采用ConceptEdit增强的LLM在生成合理常识知识和多任务问答基准测试中表现优于其他基线方法。

链接: https://arxiv.org/abs/2412.11418
作者: Liyu Zhang,Weiqi Wang,Tianqing Fang,Yangqiu Song
机构: Department of Computer Science and Engineering, HKUST, Hong Kong SAR, China
关键词: Large Language Model, Large Language, Language Model, improve output consistency, adjust a Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Knowledge Editing (KE) aims to adjust a Large Language Model’s (LLM) internal representations and parameters to correct inaccuracies and improve output consistency without incurring the computational expense of re-training the entire model. However, editing commonsense knowledge still faces difficulties, including limited knowledge coverage in existing resources, the infeasibility of annotating labels for an overabundance of commonsense knowledge, and the strict knowledge formats of current editing methods. In this paper, we address these challenges by presenting ConceptEdit, a framework that integrates conceptualization and instantiation into the KE pipeline for LLMs to enhance their commonsense reasoning capabilities. ConceptEdit dynamically diagnoses implausible commonsense knowledge within an LLM using another verifier LLM and augments the source knowledge to be edited with conceptualization for stronger generalizability. Experimental results demonstrate that LLMs enhanced with ConceptEdit successfully generate commonsense knowledge with improved plausibility compared to other baselines and achieve stronger performance across multiple question answering benchmarks.
zh

[NLP-77] Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws

【速读】：该论文试图解决生成式语言模型在输出中反映和放大社会偏见的问题，特别是难以区分这些偏见是源于模型固有的偏见还是阅读理解任务中的错误。解决方案的关键在于提出了一种针对刻板印象的缓解框架，通过在通用数据集上的指令微调（instruction-tuning）隐式地减少生成模型中的刻板印象输出。该方法在不依赖显式去偏技术的情况下，显著降低了多个维度（如国籍、年龄、性别、残疾和外貌）的刻板印象输出，证明了其有效性并保持了模型的整体实用性。

链接: https://arxiv.org/abs/2412.11414
作者: Akshita Jha,Sanchit Kabra,Chandan K. Reddy
机构: Virginia Tech(弗吉尼亚理工大学)
关键词: amplify societal biases, Recent studies, reflect and amplify, amplify societal, societal biases
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent studies have shown that generative language models often reflect and amplify societal biases in their outputs. However, these studies frequently conflate observed biases with other task-specific shortcomings, such as comprehension failure. For example, when a model misinterprets a text and produces a response that reinforces a stereotype, it becomes difficult to determine whether the issue arises from inherent bias or from a misunderstanding of the given content. In this paper, we conduct a multi-faceted evaluation that distinctly disentangles bias from flaws within the reading comprehension task. We propose a targeted stereotype mitigation framework that implicitly mitigates observed stereotypes in generative models through instruction-tuning on general-purpose datasets. We reduce stereotypical outputs by over 60% across multiple dimensions – including nationality, age, gender, disability, and physical appearance – by addressing comprehension-based failures, and without relying on explicit debiasing techniques. We evaluate several state-of-the-art generative models to demonstrate the effectiveness of our approach while maintaining the overall utility. Our findings highlight the need to critically disentangle the concept of `bias’ from other types of errors to build more targeted and effective mitigation strategies. CONTENT WARNING: Some examples contain offensive stereotypes.
zh

[NLP-78] Attention with Dependency Parsing Augmentation for Fine-Grained Attribution ACL

【速读】：该论文旨在解决现有细粒度归因方法在验证基于检索增强生成 (RAG) 内容时的高计算复杂度和粗粒度表示问题。现有方法依赖于模型内部的相似度度量（如显著性分数和隐藏状态相似度），但这些方法要么计算复杂度高，要么无法在目标跨度后整合上下文信息。论文提出的解决方案包括两个关键技术：首先，通过集合联合操作聚合逐标记的证据，保持表示的粒度；其次，通过集成依存句法分析 (dependency parsing) 来增强目标跨度的语义完整性。实验结果表明，该方法在性能上优于所有先前的工作。

链接: https://arxiv.org/abs/2412.11404
作者: Qiang Ding,Lvzhou Luo,Yixuan Cao,Ping Luo
机构: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)(中国科学院智能信息处理重点实验室); Institute of Computing Technology, CAS (中国科学院计算技术研究所); University of Chinese Academy of Sciences, CAS (中国科学院大学); Peng Cheng Laboratory (鹏城实验室)
关键词: validating RAG-generated content, efficiently validating RAG-generated, fine-grained attribution mechanism, Existing fine-grained attribution, RAG-generated content
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, submitted to ACL ARR 2024 October

点击查看摘要

Abstract:To assist humans in efficiently validating RAG-generated content, developing a fine-grained attribution mechanism that provides supporting evidence from retrieved documents for every answer span is essential. Existing fine-grained attribution methods rely on model-internal similarity metrics between responses and documents, such as saliency scores and hidden state similarity. However, these approaches suffer from either high computational complexity or coarse-grained representations. Additionally, a common problem shared by the previous works is their reliance on decoder-only Transformers, limiting their ability to incorporate contextual information after the target span. To address the above problems, we propose two techniques applicable to all model-internals-based methods. First, we aggregate token-wise evidence through set union operations, preserving the granularity of representations. Second, we enhance the attributor by integrating dependency parsing to enrich the semantic completeness of target spans. For practical implementation, our approach employs attention weights as the similarity metric. Experimental results demonstrate that the proposed method consistently outperforms all prior works.
zh

[NLP-79] INTERACT: Enabling Interactive Question-Driven Learning in Large Language Models

【速读】：该论文试图解决大型语言模型（LLMs）在知识获取过程中缺乏互动性和主动性的问题。解决方案的关键在于引入交互式学习框架INTERACT，通过“学生”LLM与“教师”LLM之间的迭代对话，使学生模型能够在多种上下文（如歌词、新闻、电影情节、学术论文和图像）中主动提问并精炼知识。实验表明，这种交互式学习在不同场景和模型架构下均能显著提升性能，最高可达25%的改进，并且能在仅五次对话后使“冷启动”学生模型达到静态学习基准的水平。此外，交互式学习还能缓解较弱教师模型的不足，展示了问题驱动学习的鲁棒性。

链接: https://arxiv.org/abs/2412.11388
作者: Aum Kendapadi,Kerem Zaman,Rakesh R. Menon,Shashank Srivastava
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
关键词: Large language models, remain passive learners, Large language, absorbing static data, Adaptive Concept Transfer
类目: Computation and Language (cs.CL)
备注: 30 pages, 8 figures, 14 tables

点击查看摘要

Abstract:Large language models (LLMs) excel at answering questions but remain passive learners–absorbing static data without the ability to question and refine knowledge. This paper explores how LLMs can transition to interactive, question-driven learning through student-teacher dialogues. We introduce INTERACT (INTEReractive Learning for Adaptive Concept Transfer), a framework in which a “student” LLM engages a “teacher” LLM through iterative inquiries to acquire knowledge across 1,347 contexts, including song lyrics, news articles, movie plots, academic papers, and images. Our experiments show that across a wide range of scenarios and LLM architectures, interactive learning consistently enhances performance, achieving up to a 25% improvement, with ‘cold-start’ student models matching static learning baselines in as few as five dialogue turns. Interactive setups can also mitigate the disadvantages of weaker teachers, showcasing the robustness of question-driven learning.
zh

[NLP-80] Why Does ChatGPT “Delve” So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models COLING2025

【速读】：该论文试图解决科学英语中词汇使用频率变化的问题，特别是由于大型语言模型（LLMs）的使用导致的词汇过度使用现象。解决方案的关键在于开发了一种形式化的、可转移的方法来识别这些词汇变化，并提出了“词汇过度表示之谜”。通过应用该方法，研究识别出21个在科学摘要中出现频率增加的关键词，并探讨了这些词汇过度使用的原因。研究排除了模型架构、算法选择和训练数据的影响，转而评估了人类反馈强化学习（RLHF）是否对此有贡献。实验结果表明，RLHF可能在一定程度上影响了词汇的过度使用，但参与者对某些关键词（如“delve”）的反应与其他关键词不同。论文强调，尽管对LLMs工作机制的洞察已触手可及，但模型开发过程中的透明度不足仍是研究的主要障碍。

链接: https://arxiv.org/abs/2412.11385
作者: Tom S. Juzek,Zina B. Ward
机构: Florida State University (佛罗里达州立大学)
关键词: undergoing rapid change, Scientific English, years ago, undergoing rapid, rapid change
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 8 figures, The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Scientific English is currently undergoing rapid change, with words like “delve,” “intricate,” and “underscore” appearing far more frequently than just a few years ago. It is widely assumed that scientists’ use of large language models (LLMs) is responsible for such trends. We develop a formal, transferable method to characterize these linguistic changes. Application of our method yields 21 focal words whose increased occurrence in scientific abstracts is likely the result of LLM usage. We then pose “the puzzle of lexical overrepresentation”: WHY are such words overused by LLMs? We fail to find evidence that lexical overrepresentation is caused by model architecture, algorithm choices, or training data. To assess whether reinforcement learning from human feedback (RLHF) contributes to the overuse of focal words, we undertake comparative model testing and conduct an exploratory online study. While the model testing is consistent with RLHF playing a role, our experimental results suggest that participants may be reacting differently to “delve” than to other focal words. With LLMs quickly becoming a driver of global language change, investigating these potential sources of lexical overrepresentation is important. We note that while insights into the workings of LLMs are within reach, a lack of transparency surrounding model development remains an obstacle to such research.
zh

[NLP-81] ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data AAAI2025

【速读】：该论文试图解决传统深度学习预测器在处理时间序列数据时仅依赖单一模态（数值数据）、使用固定长度窗口进行训练和预测，且无法适应不同场景的问题。解决方案的关键在于创新性地将时间序列建模为一种“外语”，并构建了ChatTime框架，这是一个统一的时间序列和文本处理模型。ChatTime作为即插即用的多模态时间序列基础模型，具备零样本预测能力，并支持时间序列和文本的双模态输入/输出。通过一系列实验验证了ChatTime在多任务和多场景中的优越性能，并创建了四个多模态数据集以填补数据空白。

链接: https://arxiv.org/abs/2412.11376
作者: Chengsen Wang,Qi Qi,Jingyu Wang,Haifeng Sun,Zirui Zhuang,Jinming Wu,Lei Zhang,Jianxin Liao
机构: 未知
关键词: Human experts typically, experts typically integrate, Human experts, typically integrate numerical, time series
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Human experts typically integrate numerical and textual multimodal information to analyze time series. However, most traditional deep learning predictors rely solely on unimodal numerical data, using a fixed-length window for training and prediction on a single dataset, and cannot adapt to different scenarios. The powered pre-trained large language model has introduced new opportunities for time series analysis. Yet, existing methods are either inefficient in training, incapable of handling textual information, or lack zero-shot forecasting capability. In this paper, we innovatively model time series as a foreign language and construct ChatTime, a unified framework for time series and text processing. As an out-of-the-box multimodal time series foundation model, ChatTime provides zero-shot forecasting capability and supports bimodal input/output for both time series and text. We design a series of experiments to verify the superior performance of ChatTime across multiple tasks and scenarios, and create four multimodal datasets to address data gaps. The experimental results demonstrate the potential and utility of ChatTime.
zh

[NLP-82] Codenames as a Benchmark for Large Language Models

【速读】：该论文试图解决如何有效评估大型语言模型（LLMs）在复杂语言推理任务中的表现问题，特别是通过使用流行的基于词汇的桌游《Codenames》作为基准测试。解决方案的关键在于利用LLMs在语言理解、心智理论和认知推理方面的增强能力，克服传统基于词嵌入（word embedding）技术的局限性，并评估不同LLMs在游戏中的表现及其在合作模式下的通用性。

链接: https://arxiv.org/abs/2412.11373
作者: Matthew Stephenson,Matthew Sidji,Benoît Ronval
机构: Flinders University(弗林德斯大学); University of Melbourne(墨尔本大学); UCLouvain(法语天主教鲁汶大学)
关键词: Large Language Models, Large Language, word-based board game, popular word-based board, board game Codenames
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose the use of the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language-based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles. We also evaluate the performance of different combinations of LLMs when playing cooperatively together, demonstrating that LLM agents are more generalisable to a wider range of teammates than prior techniques.
zh

[NLP-83] Can AI Extract Antecedent Factors of Human Trust in AI? An Application of Information Extraction for Scientific Literature in Behavioural and Computer Sciences

【速读】：该论文试图解决从科学文献中提取信息以将隐藏在文本中的非结构化知识转化为可用于下游任务决策的结构化数据的问题，特别是在人工智能信任（Trust in AI）领域，研究影响人类对人工智能应用信任的因素及其复杂关系。解决方案的关键在于通过领域专家的输入，精心设计标注指南，创建该领域首个英文标注数据集，并探索大语言模型（LLM）引导的标注方法。论文还通过使用最先进的命名实体和关系抽取方法对大语言模型进行基准测试，结果表明该问题需要监督学习，而当前基于提示的大语言模型可能无法实现这一目标。

链接: https://arxiv.org/abs/2412.11344
作者: Melanie McGrath,Harrison Bailey,Necva Bölücü,Xiang Dai,Sarvnaz Karimi,Cecile Paris
机构: CSIRO Data61(CSIRO数据61)
关键词: transform unstructured knowledge, unstructured knowledge hidden, down-stream tasks, scientific literature, main techniques
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Information extraction from the scientific literature is one of the main techniques to transform unstructured knowledge hidden in the text into structured data which can then be used for decision-making in down-stream tasks. One such area is Trust in AI, where factors contributing to human trust in artificial intelligence applications are studied. The relationships of these factors with human trust in such applications are complex. We hence explore this space from the lens of information extraction where, with the input of domain experts, we carefully design annotation guidelines, create the first annotated English dataset in this domain, investigate an LLM-guided annotation, and benchmark it with state-of-the-art methods using large language models in named entity and relation extraction. Our results indicate that this problem requires supervised learning which may not be currently feasible with prompt-based LLMs.
zh

[NLP-84] Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models

【速读】：该论文试图解决扩散模型在生成长篇、连贯且上下文准确的文本时面临的挑战。现有方法如词级别扩散忽略了词序依赖性，且强制短输出窗口，而段落级别扩散则在学习长文本的鲁棒表示方面存在困难。解决方案的关键在于提出段级别扩散 (Segment-Level Diffusion, SLD) 框架，通过文本分段、结合对抗学习和对比学习的鲁棒表示训练，以及改进的潜在空间引导来增强扩散模型的文本生成能力。具体来说，SLD将长文本输出分割为独立的潜在表示，并使用自回归解码器进行解码，从而简化了扩散预测并提高了可扩展性。实验结果表明，SLD在流畅性、连贯性和上下文兼容性方面相较于其他扩散和自回归基线模型表现更优。

链接: https://arxiv.org/abs/2412.11333
作者: Xiaochen Zhu,Georgi Karadzhov,Chenxi Whitehouse,Andreas Vlachos
机构: University of Cambridge (剑桥大学)
关键词: contextually accurate text, generating long, models have shown, shown promise, contextually accurate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have shown promise in text generation but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion overlooks word-order dependencies and enforces short output windows, while passage-level diffusion struggles with learning robust representation for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into separate latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on XSum, ROCStories, DialogSum, and DeliData demonstrate that SLD achieves competitive or superior performance in fluency, coherence, and contextual compatibility across automatic and human evaluation metrics comparing with other diffusion and autoregressive baselines. Ablation studies further validate the effectiveness of our segmentation and representation learning strategies.
zh

[NLP-85] Generics are puzzling. Can language models find the missing piece? COLING2025

【速读】：该论文试图解决生成式语句（generics）在语义框架中的精确建模问题，特别是生成式语句中隐含的量化和上下文敏感性。解决方案的关键在于利用语言模型（language models）来研究生成式语句的隐含量化和上下文敏感性，并通过创建一个名为ConGen的数据集（包含2873个自然出现的生成式和量化句子）以及定义基于意外度（surprisal）的p-可接受性（p-acceptability）度量来实现。实验结果表明，生成式语句比确定性量化词更具上下文敏感性，并且约20%的生成式语句表达了弱泛化。此外，研究还探讨了语言模型中人类刻板印象的体现。

链接: https://arxiv.org/abs/2412.11318
作者: Gustavo Cilleruelo Calderón,Emily Allaway,Barry Haddow,Alexandra Birch
机构: School of Informatics, University of Edinburgh(爱丁堡大学信息学院)
关键词: world without explicit, generics, explicit quantification, Abstract, naturally occurring
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at CoLing 2025

点击查看摘要

Abstract:Generic sentences express generalisations about the world without explicit quantification. Although generics are central to everyday communication, building a precise semantic framework has proven difficult, in part because speakers use generics to generalise properties with widely different statistical prevalence. In this work, we study the implicit quantification and context-sensitivity of generics by leveraging language models as models of language. We create ConGen, a dataset of 2873 naturally occurring generic and quantified sentences in context, and define p-acceptability, a metric based on surprisal that is sensitive to quantification. Our experiments show generics are more context-sensitive than determiner quantifiers and about 20% of naturally occurring generics we analyze express weak generalisations. We also explore how human biases in stereotypes can be observed in language models.
zh

[NLP-86] RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary Headline and Keyword Generation COLING2024

【速读】：该论文试图解决自动摘要任务中罗马尼亚语数据集匮乏的问题，关键解决方案是引入了RoLargeSum，一个从罗马尼亚和摩尔多瓦的公开新闻网站爬取并经过严格清洗的大规模摘要数据集。该数据集包含超过61.5万篇新闻文章及其摘要、标题、关键词、方言和其他元数据，为罗马尼亚语的自动摘要模型开发提供了高质量的资源。此外，论文还通过在RoLargeSum上评估多种BART变体和开源大型语言模型的性能，进行了基准测试，并手动评估了最佳系统的结果，以揭示该数据集的潜在问题和未来发展的方向。

链接: https://arxiv.org/abs/2412.11317
作者: Andrei-Marius Avram,Mircea Timpuriu,Andreea Iuga,Vlad-Cristian Matei,Iulian-Marius Tăiatu,Tudor Găină,Dumitru-Clementin Cercel,Florin Pop,Mihaela-Claudia Cercel
机构: National University of Science and Technology POLITEHNICA Bucharest, Romania; Paris 1 Panthéon-Sorbonne University, Paris, France; University of Bucharest, Bucharest, Romania; National Institute for Research and Development in Informatics - ICI Bucharest, Romania
关键词: supervised automatic summarisation, automatic summarisation methods, summarisation methods requires, methods requires sufficient, requires sufficient corpora
类目: Computation and Language (cs.CL)
备注: Accepted at COLING 2024 (long papers)

点击查看摘要

Abstract:Using supervised automatic summarisation methods requires sufficient corpora that include pairs of documents and their summaries. Similarly to many tasks in natural language processing, most of the datasets available for summarization are in English, posing challenges for developing summarization models in other languages. Thus, in this work, we introduce RoLargeSum, a novel large-scale summarization dataset for the Romanian language crawled from various publicly available news websites from Romania and the Republic of Moldova that were thoroughly cleaned to ensure a high-quality standard. RoLargeSum contains more than 615K news articles, together with their summaries, as well as their headlines, keywords, dialect, and other metadata that we found on the targeted websites. We further evaluated the performance of several BART variants and open-source large language models on RoLargeSum for benchmarking purposes. We manually evaluated the results of the best-performing system to gain insight into the potential pitfalls of this data set and future development.
zh

[NLP-87] Reliable Reproducible and Really Fast Leaderboards with Evalica COLING2025

【速读】：该论文试图解决自然语言处理(NLP)技术快速发展背景下，如何建立可靠且可重复的模型评估协议的问题。解决方案的关键在于引入Evalica，一个开源工具包，它通过提供易于使用的Web界面、命令行接口和Python API，支持创建和维护可信的模型排行榜，从而促进模型评估的标准化和透明化。

链接: https://arxiv.org/abs/2412.11314
作者: Dmitry Ustalov
机构: JetBrains
关键词: natural language processing, modern evaluation protocols, instruction-tuned large language, large language models, urges the development
类目: Computation and Language (cs.CL)
备注: accepted at COLING 2025 system demonstration track

点击查看摘要

Abstract:The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.
zh

[NLP-88] Sequence-Level Analysis of Leakage Risk of Training Data in Large Language Models

【速读】：该论文试图解决量化从大型语言模型（LLMs）中提取训练数据风险的问题，并提出使用序列级概率作为更细粒度的度量方法。解决方案的关键在于重新分析解码方案、模型大小、前缀长度、部分序列泄露和标记位置等因素，以揭示先前工作中由于度量选择而未能发现的新见解。研究发现，传统的提取率度量低估了随机化LLMs中训练数据泄露的威胁，且不同序列在不同模型大小和前缀长度下的提取难度存在显著差异。此外，部分泄露并不比逐字提取训练数据更容易，且序列中后期标记的提取难度显著低于早期标记。这些发现强调了在评估训练数据泄露风险时，应基于每个序列进行分析的重要性。

链接: https://arxiv.org/abs/2412.11302
作者: Trishita Tiwari,G. Edward Suh
机构: Cornell University(康奈尔大学); NVIDIA(英伟达)
关键词: Large Language Models, Large Language, sequence level probabilities, Language Models, previously obtained
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work advocates for the use of sequence level probabilities for quantifying the risk of extraction training data from Large Language Models (LLMs) as they provide much finer-grained information than has been previously obtained. We re-analyze the effects of decoding schemes, model-size, prefix length, partial sequence leakages, and token positions to uncover new insights that have were not possible in prior work due to their choice of metrics. We perform this study on two pre-trained models, LLaMa and OPT, trained on the Common Crawl and Pile respectively. We discover that 1) Extraction rate, the predominant metric used in prior quantification work, underestimates the threat of leakage of training data in randomized LLMs by as much as 2.14x. 2) Though, on average, larger models and longer prefixes can extract more data, this is not true with a substantial portion of individual sequences. 30.4-41.5% of our sequences are easier to extract with either shorter prefixes or smaller models. 3) Contrary to prior belief, partial leakage in the commonly used decoding schemes like top-k and top-p are not easier than leaking verbatim training data. 4) Extracting later tokens in a sequence is as much as 912% easier than extracting earlier tokens. The insights gained from our analysis show that it is important to look at leakage of training data on a per-sequence basis.
zh

[NLP-89] CATER: Leveraging LLM to Pioneer a Multidimensional Reference-Independent Paradigm in Translation Quality Evaluation

【速读】：该论文试图解决传统机器翻译（MT）评估方法依赖于参考译文、难以全面评估翻译质量的问题。解决方案的关键在于提出了一个全新的、基于提示的评估框架——综合AI辅助翻译编辑比率（CATER）。CATER利用大语言模型（LLMs）通过精心设计的提示协议，实现了多维度的、无需参考的评估，涵盖语言准确性、语义保真度、上下文连贯性、风格适宜性和信息完整性。其核心优势在于通过提供源文本和目标文本以及标准化提示，LLM能够快速识别错误、量化编辑工作量并生成类别和整体评分，无需预计算的参考或特定领域的资源，从而实现对多种语言、体裁和用户优先级的即时适应。

链接: https://arxiv.org/abs/2412.11261
作者: Kurando IIDA,Kenjiro MIMURA
机构: 未知
关键词: Comprehensive AI-assisted Translation, Translation Edit Ratio, evaluating machine translation, AI-assisted Translation Edit, introduces the Comprehensive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17pages,1sample prompt

点击查看摘要

Abstract:This paper introduces the Comprehensive AI-assisted Translation Edit Ratio (CATER), a novel and fully prompt-driven framework for evaluating machine translation (MT) quality. Leveraging large language models (LLMs) via a carefully designed prompt-based protocol, CATER expands beyond traditional reference-bound metrics, offering a multidimensional, reference-independent evaluation that addresses linguistic accuracy, semantic fidelity, contextual coherence, stylistic appropriateness, and information completeness. CATER’s unique advantage lies in its immediate implementability: by providing the source and target texts along with a standardized prompt, an LLM can rapidly identify errors, quantify edit effort, and produce category-level and overall scores. This approach eliminates the need for pre-computed references or domain-specific resources, enabling instant adaptation to diverse languages, genres, and user priorities through adjustable weights and prompt modifications. CATER’s LLM-enabled strategy supports more nuanced assessments, capturing phenomena such as subtle omissions, hallucinations, and discourse-level shifts that increasingly challenge contemporary MT systems. By uniting the conceptual rigor of frameworks like MQM and DQF with the scalability and flexibility of LLM-based evaluation, CATER emerges as a valuable tool for researchers, developers, and professional translators worldwide. The framework and example prompts are openly available, encouraging community-driven refinement and further empirical validation.
zh

[NLP-90] Beyond Discrete Personas: Personality Modeling Through Journal Intensive Conversations COLING2025

【速读】：该论文试图解决现有大型语言模型（LLMs）在生成个性化对话时，依赖静态、预定义人格（personas）导致对话无法真实反映人类性格动态变化的问题。解决方案的关键在于引入一个包含约40万对话的新数据集，并基于Reddit上的长篇日记条目构建生成个性化对话的框架。具体方法包括对每个作者的日记条目进行聚类，筛选出最能代表其性格的条目，并通过捕捉大五人格特质（Big Five personality traits）来进一步优化数据，确保对话能够真实反映个体的性格特征。通过在Llama 3 70B模型上进行微调，该方法在捕捉人格特质方面平均提升了11%，生成了更加连贯且富有个性化的对话。

链接: https://arxiv.org/abs/2412.11250
作者: Sayantan Pal,Souvik Das,Rohini K. Srihari
机构: State University of New York at Buffalo (纽约州立大学布法罗分校); Department of Computer Science and Engineering (计算机科学与工程系)
关键词: Large Language Models, Large Language, Synthetic Persona Chat, Blended Skill Talk, Persona Chat
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in COLING 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly improved personalized conversational capabilities. However, existing datasets like Persona Chat, Synthetic Persona Chat, and Blended Skill Talk rely on static, predefined personas. This approach often results in dialogues that fail to capture human personalities’ fluid and evolving nature. To overcome these limitations, we introduce a novel dataset with around 400,000 dialogues and a framework for generating personalized conversations using long-form journal entries from Reddit. Our approach clusters journal entries for each author and filters them by selecting the most representative cluster, ensuring that the retained entries best reflect the author’s personality. We further refine the data by capturing the Big Five personality traits --openness, conscientiousness, extraversion, agreeableness, and neuroticism --ensuring that dialogues authentically reflect an individual’s personality. Using Llama 3 70B, we generate high-quality, personality-rich dialogues grounded in these journal entries. Fine-tuning models on this dataset leads to an 11% improvement in capturing personality traits on average, outperforming existing approaches in generating more coherent and personality-driven dialogues.
zh

[NLP-91] rimLLM : Progressive Layer Dropping for Domain-Specific LLM s

【速读】：该论文试图解决在特定领域本地部署大型语言模型（LLMs）时，如何在满足延迟和隐私约束的同时，实现高效的推理加速和内存节省问题。解决方案的关键在于开发了TrimLLM，该方法基于作者在当代LLMs中观察到的逐层专业化现象，通过渐进式层丢弃来减少模型深度，从而在特定领域保持模型能力的同时，实现推理加速，且不依赖于专用硬件或深度学习框架。实验结果表明，TrimLLM在医疗、法律和金融数据集上的模型适应中，相比最先进的模型压缩算法，在消费级GPU上实现了2.1-5.7倍的推理加速，在A100上实现了3.1倍的加速，且在50～60%的模型压缩比下未损失精度。

链接: https://arxiv.org/abs/2412.11242
作者: Lanxiang Hu,Tajana Rosing,Hao Zhang
机构: University of California, San Diego (加州大学圣地亚哥分校)
关键词: Specializing large language, Specializing large, large language models, privacy constraints, large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Specializing large language models (LLMs) for local deployment in domain-specific use cases is necessary for strong performance while meeting latency and privacy constraints. However, conventional task-specific adaptation approaches do not show simultaneous memory saving and inference speedup at deployment time. Practical compression techniques like quantization and pruning require dedicated hardware or kernel support to achieve measured inference speedup. We develop TrimLLM based on the layer-wise specialization phenomenon we empirically observed and verified on contemporary LLMs. TrimLLM reduces the depth of LLMs via progressive layer dropping. We show it retains LLMs’ capacity in specific domains and achieves inference speedup irrespective of hardware and deep learning frameworks. We evaluated TrimLLM on LLMs of various sizes for inference; models adapted on medical, legal, and financial datasets all demonstrate 2.1-5.7\times inference speedup on consumer GPUs and up to 3.1\times speedup on A100 when compared to state-of-the-art model compression algorithms, with no loss in accuracy at 50 \sim 60% model compression ratio.
zh

[NLP-92] Smaller Language Models Are Better Instruction Evolvers

【速读】：该论文试图解决的问题是当前在构建大规模指令时，普遍认为只有大型语言模型（LLMs）才能生成有效指令的假设。研究通过实验证明，小型语言模型（SLMs）在指令生成方面同样具有潜力，甚至能够合成更有效的指令。解决方案的关键在于揭示了SLMs在指令生成过程中具有更广泛的输出空间，从而产生更复杂和多样的指令变体。此外，论文提出了一个新的评估指标——指令复杂性感知的IFD（IC-IFD），通过引入指令复杂性来更准确地评估指令数据的有效性。

链接: https://arxiv.org/abs/2412.11231
作者: Tingfeng Hui,Lulu Zhao,Guanting Dong,Yaqi Zhang,Hua Zhou,Sen Su
机构: Beijing University of Posts and Telecommunications, Beijing, China; Beijing Academy of Artificial Intelligence, BAAI, Beijing, China; Renmin University of China, Beijing, China
关键词: large language models, language models, unleash the complete, smaller language models, Instruction
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing large-scale instructions predominantly favour powerful models such as GPT-4 or those with over 70 billion parameters, under the empirical presumption that such larger language models (LLMs) inherently possess enhanced capabilities. In this study, we question this prevalent assumption and conduct an in-depth exploration into the potential of smaller language models (SLMs) in the context of instruction evolution. Extensive experiments across three scenarios of instruction evolution reveal that smaller language models (SLMs) can synthesize more effective instructions than LLMs. Further analysis demonstrates that SLMs possess a broader output space during instruction evolution, resulting in more complex and diverse variants. We also observe that the existing metrics fail to focus on the impact of the instructions. Thus, we propose Instruction Complex-Aware IFD (IC-IFD), which introduces instruction complexity in the original IFD score to evaluate the effectiveness of instruction data more accurately. Our source code is available at: \hrefthis https URLthis https URL
zh

[NLP-93] ask-Oriented Dialog Systems for the Senegalese Wolof Language COLING2025

【速读】：该论文试图解决大语言模型（LLMs）在实际应用中的局限性问题，特别是幻觉（hallucination）现象和低资源语言（如非洲语言）在系统中的表现不足。解决方案的关键在于采用基于模块化架构的任务导向对话系统（Task-oriented Dialog Systems, ToDS），并通过Rasa框架实现聊天机器人生成引擎，结合自有的机器翻译系统将标注映射到Wolof语言。该方法不仅提升了输出控制能力，还通过语言无关的意图分类器管道，简化了低资源语言聊天机器人的设计，使其在低资源语言上的表现接近资源丰富语言（如法语）。

链接: https://arxiv.org/abs/2412.11203
作者: Derguene Mbaye,Moussa Diallo
机构: Baamtu Technologies; Université Cheikh Anta Diop; Ecole Supérieure Polytechnique (ESP)
关键词: large language models, recent years, interest in conversational, conversational agents, rise of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注: 10 pages, 3 tables, 6 figures, The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:In recent years, we are seeing considerable interest in conversational agents with the rise of large language models (LLMs). Although they offer considerable advantages, LLMs also present significant risks, such as hallucination, which hinder their widespread deployment in industry. Moreover, low-resource languages such as African ones are still underrepresented in these systems limiting their performance in these languages. In this paper, we illustrate a more classical approach based on modular architectures of Task-oriented Dialog Systems (ToDS) offering better control over outputs. We propose a chatbot generation engine based on the Rasa framework and a robust methodology for projecting annotations onto the Wolof language using an in-house machine translation system. After evaluating a generated chatbot trained on the Amazon Massive dataset, our Wolof Intent Classifier performs similarly to the one obtained for French, which is a resource-rich language. We also show that this approach is extensible to other low-resource languages, thanks to the intent classifier’s language-agnostic pipeline, simplifying the design of chatbots in these languages.
zh

[NLP-94] Drawing the Line: Enhancing Trustworthiness of MLLM s Through the Power of Refusal

【速读】：该论文试图解决多模态大语言模型（MLLMs）在生成幻觉或不准确响应时信任度降低的问题。解决方案的关键在于提出了信息边界感知学习框架（Information Boundary-aware Learning Framework, InBoL），该框架通过系统地定义信息边界，使模型能够在遇到信息不足时拒绝回答用户查询。InBoL引入了全面的数据生成流程和定制的训练策略，以提升模型拒绝回答的准确性，同时不显著影响其有用性。这是首个系统性地定义拒绝条件并提升模型信任度的框架。

链接: https://arxiv.org/abs/2412.11196
作者: Yuhao Wang,Zhiyuan Zhu,Heyang Liu,Yusheng Liao,Hongcheng Liu,Yanfeng Wang,Yu Wang
机构: Cooperative Medianet Innovation Center, Shanghai Jiao Tong University (合作媒体创新中心，上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
关键词: Multimodal large language, inaccurate responses undermines, Multimodal large, multimodal perception, large language models
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) excel at multimodal perception and understanding, yet their tendency to generate hallucinated or inaccurate responses undermines their trustworthiness. Existing methods have largely overlooked the importance of refusal responses as a means of enhancing MLLMs reliability. To bridge this gap, we present the Information Boundary-aware Learning Framework (InBoL), a novel approach that empowers MLLMs to refuse to answer user queries when encountering insufficient information. To the best of our knowledge, InBoL is the first framework that systematically defines the conditions under which refusal is appropriate for MLLMs using the concept of information boundaries proposed in our paper. This framework introduces a comprehensive data generation pipeline and tailored training strategies to improve the model’s ability to deliver appropriate refusal responses. To evaluate the trustworthiness of MLLMs, we further propose a user-centric alignment goal along with corresponding metrics. Experimental results demonstrate a significant improvement in refusal accuracy without noticeably compromising the model’s helpfulness, establishing InBoL as a pivotal advancement in building more trustworthy MLLMs.
zh

[NLP-95] Leveraging Large Language Models for Active Merchant Non-player Characters

【速读】：该论文旨在解决当前游戏中商人非玩家角色（NPC）的被动性问题，主要体现在定价和沟通两个方面。论文提出了一个基于大语言模型（LLM）的商人框架，称为MART，包含评估模块和谈判模块，以实现主动的商人NPC。解决方案的关键在于通过监督微调（SFT）和知识蒸馏（KD）等微调方法，利用较小的LLM实现有效的主动交互，并通过实验验证了这些方法的可行性。此外，论文还识别了LLM响应中出现的三种异常情况，为开发者提供了实际指导。

链接: https://arxiv.org/abs/2412.11189
作者: Byungjun Kim,Minju Kim,Dayeon Seo,Bugeun Kim
机构: Chung-Ang University(中央大学)
关键词: merchant non-player characters, current merchant non-player, significant issues leading, non-player characters, highlight two significant
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:We highlight two significant issues leading to the passivity of current merchant non-player characters (NPCs): pricing and communication. While immersive interactions have been a focus, negotiations between merchant NPCs and players on item prices have not received sufficient attention. First, we define passive pricing as the limited ability of merchants to modify predefined item prices. Second, passive communication means that merchants can only interact with players in a scripted manner. To tackle these issues and create an active merchant NPC, we propose a merchant framework based on large language models (LLMs), called MART, which consists of an appraiser module and a negotiator module. We conducted two experiments to guide game developers in selecting appropriate implementations by comparing different training methods and LLM sizes. Our findings indicate that finetuning methods, such as supervised finetuning (SFT) and knowledge distillation (KD), are effective in using smaller LLMs to implement active merchant NPCs. Additionally, we found three irregular cases arising from the responses of LLMs. We expect our findings to guide developers in using LLMs for developing active merchant NPCs.
zh

[NLP-96] Analyzing the Attention Heads for Pronoun Disambiguation in Context-aware Machine Translation Models COLING2025

【速读】：该论文旨在研究上下文感知机器翻译模型中注意力头在代词消歧中的作用，特别是在英语到德语和英语到法语的翻译方向上。解决方案的关键在于分析和调整与代词预测相关的潜在关系的注意力分数，发现某些注意力头并未充分利用，导致模型在代词消歧上的表现不佳。通过微调最有潜力的注意力头，论文展示了代词消歧准确率可提升多达5个百分点，表明通过优化注意力头的使用可以显著提升模型性能。

链接: https://arxiv.org/abs/2412.11187
作者: Paweł Mąka,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis
机构: Maastricht University (马斯特里赫特大学)
关键词: Context-aware Machine Translation, Machine Translation models, Context-aware Machine, Machine Translation, language directions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: COLING 2025

点击查看摘要

Abstract:In this paper, we investigate the role of attention heads in Context-aware Machine Translation models for pronoun disambiguation in the English-to-German and English-to-French language directions. We analyze their influence by both observing and modifying the attention scores corresponding to the plausible relations that could impact a pronoun prediction. Our findings reveal that while some heads do attend the relations of interest, not all of them influence the models’ ability to disambiguate pronouns. We show that certain heads are underutilized by the models, suggesting that model performance could be improved if only the heads would attend one of the relations more strongly. Furthermore, we fine-tune the most promising heads and observe the increase in pronoun disambiguation accuracy of up to 5 percentage points which demonstrates that the improvements in performance can be solidified into the models’ parameters.
zh

[NLP-97] Unpacking the Resilience of SNLI Contradiction Examples to Attacks

【速读】：该论文试图解决预训练模型在自然语言推理（NLI）基准测试中表现优异，但其真正的语言理解能力仍不确定的问题。研究揭示了这些模型在面对对抗性攻击时的脆弱性，表明其高准确率可能依赖于数据集的偏差和虚假相关性。解决方案的关键在于通过应用通用对抗攻击（Universal Adversarial Attack）来识别模型的脆弱点，并通过对包含对抗样本的增强数据集进行微调，恢复模型在标准和挑战集上的性能。这一方法不仅揭示了模型对虚假相关性的依赖，还提升了模型的鲁棒性，同时提供了关于矛盾类别对对抗攻击更具抵抗力的见解。

链接: https://arxiv.org/abs/2412.11172
作者: Chetan Verma,Archit Agarwal
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
关键词: Pre-trained models excel, understanding remains uncertain, true language understanding, language understanding remains, SNLI and MultiNLI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained models excel on NLI benchmarks like SNLI and MultiNLI, but their true language understanding remains uncertain. Models trained only on hypotheses and labels achieve high accuracy, indicating reliance on dataset biases and spurious correlations. To explore this issue, we applied the Universal Adversarial Attack to examine the model’s vulnerabilities. Our analysis revealed substantial drops in accuracy for the entailment and neutral classes, whereas the contradiction class exhibited a smaller decline. Fine-tuning the model on an augmented dataset with adversarial examples restored its performance to near-baseline levels for both the standard and challenge sets. Our findings highlight the value of adversarial triggers in identifying spurious correlations and improving robustness while providing insights into the resilience of the contradiction class to adversarial attacks.
zh

[NLP-98] Cultural Palette: Pluralising Culture Alignment via Multi-agent Palette

【速读】：该论文试图解决大型语言模型 (LLMs) 在生成过程中难以与多样文化价值观对齐的问题，这一问题源于模型固有的单一文化偏见以及难以捕捉复杂的文化语义。论文提出的解决方案是 Cultural Palette，一个基于多代理框架的文化对齐方法。其关键在于：首先，通过合成 Pentachromatic Cultural Palette Dataset 来捕捉五大洲的多样化文化价值观；其次，利用 Cultural MoErges 对齐技术，结合五个洲级对齐代理和一个元代理，动态激活相关文化专业知识以适应新文化，从而在文化价值对齐方面优于其他联合和对齐策略。每个洲代理生成初步的文化草案，再由元代理进行精炼和自我调节，最终生成与文化对齐的响应。实验结果表明，Cultural Palette 在文化对齐方面显著优于现有基线方法。

链接: https://arxiv.org/abs/2412.11167
作者: Jiahao Yuan,Zixiang Di,Shangzixin Zhao,Usman Naseem
机构: East China Normal University(华东师范大学); University of Shanghai for Science and Technology(上海理工大学); Macquarie University(麦考瑞大学)
关键词: Large language models, inherent monocultural biases, nuanced cultural semantics, Large language, capturing nuanced cultural
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) face challenges in aligning with diverse cultural values despite their remarkable performance in generation, which stems from inherent monocultural biases and difficulties in capturing nuanced cultural semantics. Existing methods lack adaptability to unkown culture after finetuning. Inspired by cultural geography across five continents, we propose Cultural Palette, a multi-agent framework for cultural alignment. We first introduce the Pentachromatic Cultural Palette Dataset synthesized using LLMs to capture diverse cultural values from social dialogues across five continents. Building on this, Cultural Palette integrates five continent-level alignment agents with a meta-agent using our superior Cultural MoErges alignment technique by dynamically activating relevant cultural expertise based on user prompts to adapting new culture, which outperforms other joint and merging alignment strategies in overall cultural value alignment. Each continent agent generates a cultural draft, which is then refined and self-regulated by the meta-agent to produce the final culturally aligned response. Experiments across various countries demonstrate that Cultural Palette surpasses existing baselines in cultural alignment.
zh

[NLP-99] he Superalignment of Superhuman Intelligence with Large Language Models

【速读】：该论文试图解决的问题是如何确保超人类智能模型（superhuman models）在复杂任务中的安全性、可靠性和与人类价值观的一致性。解决方案的关键在于提出了一种名为“超对齐”（superalignment）的学习范式，旨在通过设计高效且有效的对齐算法，从噪声标注数据（noisy-labeled data）中进行可扩展的学习，尤其是在任务复杂到人类专家难以标注且模型能力超越人类专家的情况下。论文提出的超对齐框架包括三个核心模块：攻击者（attacker）生成对抗性查询以暴露学习模型的弱点，学习者（learner）通过可扩展的反馈和少量人类专家的指导进行自我改进，以及批评者（critic）生成批评或解释以帮助学习者提升。该框架强调了弱到强泛化（weak-to-strong generalization）、可扩展监督（scalable oversight）和评估等关键研究问题，并提出了自对齐（self-alignment）、自博弈（self-play）和自改进（self-refinement）等相关的研究方向。

链接: https://arxiv.org/abs/2412.11145
作者: Minlie Huang,Yingkang Wang,Shiyao Cui,Pei Ke,Jie Tang
机构: The CoAI group, DCST, Tsinghua University, Beijing, 100084, China; University of Electronic Science and Technology of China, 611731, China; The Knowledge Engineering Group (KEG), Tsinghua University, Beijing, 100084, China
关键词: witnessed superhuman intelligence, large language models, multimodal language models, large language, multimodal language
类目: Computation and Language (cs.CL)
备注: Under review of Science China

点击查看摘要

Abstract:We have witnessed superhuman intelligence thanks to the fast development of large language models and multimodal language models. As the application of such superhuman models becomes more and more common, a critical question rises here: how can we ensure superhuman models are still safe, reliable and aligned well to human values? In this position paper, we discuss the concept of superalignment from the learning perspective to answer this question by outlining the learning paradigm shift from large-scale pretraining, supervised fine-tuning, to alignment training. We define superalignment as designing effective and efficient alignment algorithms to learn from noisy-labeled data (point-wise samples or pair-wise preference data) in a scalable way when the task becomes very complex for human experts to annotate and the model is stronger than human experts. We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation. We then present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing. We discuss some important research problems in each component of this framework and highlight some interesting research ideas that are closely related to our proposed framework, for instance, self-alignment, self-play, self-refinement, and more. Last, we highlight some future research directions for superalignment, including identification of new emergent risks and multi-dimensional alignment.
zh

[NLP-100] AD-LLM : Benchmarking Large Language Models for Anomaly Detection

【速读】：该论文试图解决大语言模型（LLMs）在自然语言处理（NLP）中的异常检测（Anomaly Detection, AD）应用问题，特别是其在零样本检测、数据增强和模型选择方面的潜力。解决方案的关键在于引入AD-LLM基准，通过实验验证LLMs在零样本AD中的有效性，展示精心设计的数据增强方法的实用性，并指出在特定数据集上解释模型选择的挑战性。

链接: https://arxiv.org/abs/2412.11142
作者: Tiankai Yang,Yi Nian,Shawn Li,Ruiyao Xu,Yuangang Li,Jiaqi Li,Zhuo Xiao,Xiyang Hu,Ryan Rossi,Kaize Ding,Xia Hu,Yue Zhao
机构: University of Southern California; Northwestern University; Arizona State University; Adobe Research; Rice University
关键词: important machine learning, including fraud detection, machine learning task, medical diagnosis, NLP anomaly detection
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs’ pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.
zh

[NLP-101] Feature engineering vs. deep learning for paper section identification: Toward applications in Chinese medical literature

【速读】：该论文试图解决中文医学文献中的章节识别问题，特别是从医生视角识别文献中的主题、方法和结果部分，以帮助在实体和关系抽取中过滤噪声。解决方案的关键在于采用条件随机场 (Conditional Random Fields) 结合词袋模型 (bag-of-words)、词性标注 (part-of-speech) 和标题等特征，有效处理句子间的依赖关系。此外，论文设计了一种新的深度学习模型——结构化双向长短期记忆网络 (Structural Bidirectional Long Short-Term Memory, SLSTM)，该模型结合了词和句子的依赖关系以及上下文信息，实验结果表明其在精度 (precision) 和召回率 (recall) 上接近90%，优于传统的机器学习方法和其他深度学习模型。

链接: https://arxiv.org/abs/2412.11125
作者: Sijia Zhou,Xin Li
机构: 未知
关键词: Section identification, literature section identification, library science, knowledge management, Chinese literature section
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Section identification is an important task for library science, especially knowledge management. Identifying the sections of a paper would help filter noise in entity and relation extraction. In this research, we studied the paper section identification problem in the context of Chinese medical literature analysis, where the subjects, methods, and results are more valuable from a physician’s perspective. Based on previous studies on English literature section identification, we experiment with the effective features to use with classic machine learning algorithms to tackle the problem. It is found that Conditional Random Fields, which consider sentence interdependency, is more effective in combining different feature sets, such as bag-of-words, part-of-speech, and headings, for Chinese literature section identification. Moreover, we find that classic machine learning algorithms are more effective than generic deep learning models for this problem. Based on these observations, we design a novel deep learning model, the Structural Bidirectional Long Short-Term Memory (SLSTM) model, which models word and sentence interdependency together with the contextual information. Experiments on a human-curated asthma literature dataset show that our approach outperforms the traditional machine learning methods and other deep learning methods and achieves close to 90% precision and recall in the task. The model shows good potential for use in other text mining tasks. The research has significant methodological and practical implications.
zh

[NLP-102] Hanprome: Modified Hangeul for Expression of foreign language pronunciation

【速读】：该论文试图解决的问题是如何在不改变韩文字母（Hangeul）基本形式的前提下，通过修改笔画的形状来表达不同于原字母的发音。解决方案的关键在于保留字母的基本结构，仅通过改变笔画的形状来实现这一目标。据作者所知，这是首次尝试通过改变字母笔画的形状来表达不同发音的方法，此前在任何语言中均未有过类似尝试。

链接: https://arxiv.org/abs/2412.11090
作者: Wonchan Kim,Michelle Meehyun Kim
机构: 未知
关键词: basic form, Hangeul was created, existing alphabets, Hangeul, phonetic alphabet
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 21 pages

点击查看摘要

Abstract:Hangeul was created as a phonetic alphabet and is known to have the best 1:1 correspondence between letters and pronunciation among existing alphabets. In this paper, we examine the possibility of modifying the basic form of Hangeul and using it as a kind of phonetic symbol. The core concept of this approach is to preserve the basic form of the alphabet, modifying only the shape of a stroke rather than the letter itself. To the best of our knowledge, no previous attempts in any language have been made to express pronunciations of an alphabet different from the original simply by changing the shape of the alphabet strokes, and this paper is probably the first attempt in this direction.
zh

[NLP-103] LAW: Legal Agent ic Workflows for Custody and Fund Services Contracts COLING2025

【速读】：该论文试图解决托管和基金服务领域中法律合同处理的问题，特别是由于合同文本长、结构化程度低、法律术语复杂以及大型语言模型（LLM）的上下文窗口限制，导致现成的LLM难以有效处理这些合同。解决方案的关键在于引入了一个名为LAW（Legal Agentic Workflows for Custody and Fund Services Contracts）的系统，该系统采用模块化设计，通过协调一系列领域特定的工具和文本代理来响应用户查询。LAW通过集成多个专业代理和工具，显著优于基线模型，尤其在计算合同终止日期等复杂任务上表现突出，且通过利用可重用的领域特定工具，提供了比传统微调法律LLM更具成本效益的替代方案。

链接: https://arxiv.org/abs/2412.11063
作者: William Watson,Nicole Cho,Nishan Srishankar,Zhen Zeng,Lucas Cecchi,Daniel Scott,Suchetha Siddagangappa,Rachneet Kaur,Tucker Balch,Manuela Veloso
机构: J.P. Morgan AI Research (J.P. 摩根人工智能研究); New York, New York, USA (纽约, 纽约, 美国)
关键词: key provider responsibilities, Large Language Model, domain govern critical, govern critical aspects, services domain govern
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted at The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Legal contracts in the custody and fund services domain govern critical aspects such as key provider responsibilities, fee schedules, and indemnification rights. However, it is challenging for an off-the-shelf Large Language Model (LLM) to ingest these contracts due to the lengthy unstructured streams of text, limited LLM context windows, and complex legal jargon. To address these challenges, we introduce LAW (Legal Agentic Workflows for Custody and Fund Services Contracts). LAW features a modular design that responds to user queries by orchestrating a suite of domain-specific tools and text agents. Our experiments demonstrate that LAW, by integrating multiple specialized agents and tools, significantly outperforms the baseline. LAW excels particularly in complex tasks such as calculating a contract’s termination date, surpassing the baseline by 92.9% points. Furthermore, LAW offers a cost-effective alternative to traditional fine-tuned legal LLMs by leveraging reusable, domain-specific tools.
zh

[NLP-104] NITRO: LLM Inference on Intel Laptop NPUs

【速读】：该论文试图解决大语言模型 (LLMs) 在神经处理单元 (NPU) 上的动态自回归 token 生成推理问题。解决方案的关键在于开发了 NITRO (NPU Inference for Transformers Optimization) 框架，该框架基于 Intel 的 OpenVINO 构建，通过修改 transformer 架构以支持在 NPU 上的文本和聊天生成任务。这一解决方案的核心在于对 transformer 架构的优化，使其能够在 NPU 上高效执行动态推理任务。

链接: https://arxiv.org/abs/2412.11053
作者: Anthony Fei,Mohamed S. Abdelfattah
机构: Cornell University (康奈尔大学); Intel Labs (英特尔实验室)
关键词: Large Language Models, finding large usage, natural language processing, Large Language, ChatGPT and Gemini
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have become essential tools in natural language processing, finding large usage in chatbots such as ChatGPT and Gemini, and are a central area of research. A particular area of interest includes designing hardware specialized for these AI applications, with one such example being the neural processing unit (NPU). In 2023, Intel released the Intel Core Ultra processor with codename Meteor Lake, featuring a CPU, GPU, and NPU system-on-chip. However, official software support for the NPU through Intel’s OpenVINO framework is limited to static model inference. The dynamic nature of autoregressive token generation in LLMs is therefore not supported out of the box. To address this shortcoming, we present NITRO (NPU Inference for Transformers Optimization), a Python-based framework built on top of OpenVINO to support text and chat generation on NPUs. In this paper, we discuss in detail the key modifications made to the transformer architecture to enable inference, some performance benchmarks, and future steps towards improving the package. The code repository for NITRO can be found here: this https URL.
zh

[NLP-105] Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models

【速读】：该论文试图解决大语言模型 (LLMs) 在微调过程中安全对齐 (safety alignment) 被削弱的问题。解决方案的关键是提出了一种名为 IRR (Identify, Remove, and Recalibrate for Safety Realignment) 的方法，通过识别、移除微调模型中的不安全增量参数 (unsafe delta parameters)，并对保留的参数进行重新校准，从而实现安全对齐的重新调整。实验结果表明，IRR 在提升微调模型在安全基准测试（如有害查询和越狱攻击）中的表现的同时，保持了其在下游任务中的性能。

链接: https://arxiv.org/abs/2412.11041
作者: Di Wu,Xin Lu,Yanyan Zhao,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Research Center for Social Computing and Information Retrieval (社会计算与信息检索研究中心)
关键词: achieve effective safety, effective safety alignment, large language models, achieve effective, time of release
类目: Computation and Language (cs.CL)
备注: 14 pages, 12 figures,

点击查看摘要

Abstract:Although large language models (LLMs) achieve effective safety alignment at the time of release, they still face various safety challenges. A key issue is that fine-tuning often compromises the safety alignment of LLMs. To address this issue, we propose a method named \textbfIRR (\textbfIdentify, \textbfRemove, and \textbfRecalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained ones. We evaluate the effectiveness of IRR across various datasets, including both full fine-tuning and LoRA methods. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks, while maintaining their performance on downstream tasks. The source code is available at: \urlthis https URL.
zh

[NLP-106] A Contextualized BERT model for Knowledge Graph Completion NEURIPS2024

【速读】：该论文试图解决知识图谱补全 (Knowledge Graph Completion, KGC) 中存在的实体和关系缺失问题，特别是传统方法如 TransE 和 ComplEx 在处理未见实体时的局限性，以及基于文本的模型在计算成本、语义一致性和数据不平衡方面的挑战。解决方案的关键在于引入了一个基于上下文感知的 BERT 模型，该模型通过利用邻近实体和关系的上下文信息来预测缺失的尾实体 (tail entities)，从而无需依赖实体描述和负样本采样，显著降低了计算需求并提升了性能。实验结果表明，该模型在 FB15k-237 和 WN18RR 数据集上分别将 Hit@1 提升了 5.3% 和 4.88%，达到了新的技术水平。

链接: https://arxiv.org/abs/2412.11016
作者: Haji Gul,Abdul Ghani Naim,Ajaz A. Bhat
机构: School of Digital Science, Universiti Brunei Darussalam (数字科学学院，文莱达鲁萨兰大学)
关键词: Knowledge Graph Completion, representing structured, enabling tasks, recommendation systems, systems and inference
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: MuslML Workshop, 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Knowledge graphs (KGs) are valuable for representing structured, interconnected information across domains, enabling tasks like semantic search, recommendation systems and inference. A pertinent challenge with KGs, however, is that many entities (i.e., heads, tails) or relationships are unknown. Knowledge Graph Completion (KGC) addresses this by predicting these missing nodes or links, enhancing the graph’s informational depth and utility. Traditional methods like TransE and ComplEx predict tail entities but struggle with unseen entities. Textual-based models leverage additional semantics but come with high computational costs, semantic inconsistencies, and data imbalance issues. Recent LLM-based models show improvement but overlook contextual information and rely heavily on entity descriptions. In this study, we introduce a contextualized BERT model for KGC that overcomes these limitations by utilizing the contextual information from neighbouring entities and relationships to predict tail entities. Our model eliminates the need for entity descriptions and negative triplet sampling, reducing computational demands while improving performance. Our model outperforms state-of-the-art methods on standard datasets, improving Hit@1 by 5.3% and 4.88% on FB15k-237 and WN18RR respectively, setting a new benchmark in KGC.
zh

[NLP-107] Dual Traits in Probabilistic Reasoning of Large Language Models

【速读】：该论文试图解决大语言模型 (LLMs) 在评估后验概率时表现出的认知偏差问题。研究发现，LLMs 在判断后验概率时存在两种模式：一种是遵循贝叶斯规则的规范模式 (normative mode)，另一种是基于相似性的代表性模式 (representative-based mode)，类似于人类的系统1和系统2思维。关键解决方案在于识别并减轻代表性模式带来的偏差，这可能需要通过改进提示工程策略 (prompt engineering strategies) 来实现。此外，论文还推测这种双重判断模式可能是由于强化学习中使用的对比损失函数 (contrastive loss function) 导致的。研究结果强调了减少LLMs认知偏差的重要性，并提醒在关键领域中谨慎部署LLMs的必要性。

链接: https://arxiv.org/abs/2412.11009
作者: Shenxiong Li,Huaxia Rui
机构: 未知
关键词: evaluate posterior probabilities, large language models, conducted three experiments, experiments to investigate, investigate how large
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We conducted three experiments to investigate how large language models (LLMs) evaluate posterior probabilities. Our results reveal the coexistence of two modes in posterior judgment among state-of-the-art models: a normative mode, which adheres to Bayes’ rule, and a representative-based mode, which relies on similarity – paralleling human System 1 and System 2 thinking. Additionally, we observed that LLMs struggle to recall base rate information from their memory, and developing prompt engineering strategies to mitigate representative-based judgment may be challenging. We further conjecture that the dual modes of judgment may be a result of the contrastive loss function employed in reinforcement learning from human feedback. Our findings underscore the potential direction for reducing cognitive biases in LLMs and the necessity for cautious deployment of LLMs in critical areas.
zh

[NLP-108] Entropy-Regularized Process Reward Model

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 在数学推理任务中表现不佳的问题，尤其是系统性错误。解决方案的关键在于引入基于过程奖励 (process rewards) 的强化学习 (Reinforcement Learning, RL)，并通过熵正则化的过程奖励模型 (Entropy-Regularized Process Reward Model, ER-PRM) 来优化策略模型。ER-PRM 结合了 KL 正则化的马尔可夫决策过程 (Markov Decision Processes, MDP)，以平衡策略优化与防止策略偏离初始分布的需求。通过理论分析，论文提出了一种基于初始策略采样的最优奖励构造方法，并在 MATH 和 GSM8K 基准测试中验证了其有效性，显著提升了模型在数学推理任务中的表现。

链接: https://arxiv.org/abs/2412.11006
作者: Hanning Zhang,Pengcheng Wang,Shizhe Diao,Yong Lin,Rui Pan,Hanze Dong,Dylan Zhang,Pavlo Molchanov,Tong Zhang
机构: University of Illinois Urbana-Champaign; University of Toronto; NVIDIA; Princeton University; Salesforce Research
关键词: Large language models, making systematic errors, performing complex multi-step, Large language, complex multi-step reasoning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning, often making systematic errors. A promising solution is reinforcement learning (RL) guided by reward models, particularly those focusing on process rewards, which score each intermediate step rather than solely evaluating the final outcome. This approach is more effective at guiding policy models towards correct reasoning trajectories. In this work, we propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP) to balance policy optimization with the need to prevent the policy from shifting too far from its initial distribution. We derive a novel reward construction method based on the theoretical results. Our theoretical analysis shows that we could derive the optimal reward model from the initial policy sampling. Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation, and more than 1% improvement under RLHF. These results highlight the efficacy of entropy-regularization in enhancing LLMs’ reasoning capabilities.
zh

[NLP-109] Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection

【速读】：该论文试图解决在Levantine阿拉伯语等非主流方言中检测仇恨言论的独特挑战，这些问题涉及文化、伦理和语言层面的复杂性。解决方案的关键在于强调需要开发文化背景和语境敏感的自然语言处理（NLP）工具，以克服现有数据集中存在的方言偏见和数据稀缺问题。论文倡导采用更加细致和包容的方法，以确保在阿拉伯世界中的仇恨言论检测能够更加准确和公正。

链接: https://arxiv.org/abs/2412.10991
作者: Ahmed Haj Ahmed,Rui-Jie Yew,Xerxes Minocher,Suresh Venkatasubramanian
机构: Haverford College; Brown University
关键词: Social media platforms, Social media, hate speech, hate speech detection, Levantine Arabic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Social media platforms have become central to global communication, yet they also facilitate the spread of hate speech. For underrepresented dialects like Levantine Arabic, detecting hate speech presents unique cultural, ethical, and linguistic challenges. This paper explores the complex sociopolitical and linguistic landscape of Levantine Arabic and critically examines the limitations of current datasets used in hate speech detection. We highlight the scarcity of publicly available, diverse datasets and analyze the consequences of dialectal bias within existing resources. By emphasizing the need for culturally and contextually informed natural language processing (NLP) tools, we advocate for a more nuanced and inclusive approach to hate speech detection in the Arab world.
zh

[NLP-110] Can LLM s Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning COLING2025

【速读】：该论文试图解决濒危语言的文档化和保存问题，特别是如何利用大型语言模型 (Large Language Models, LLMs) 在数据有限的情况下生成低资源语言的语法信息。解决方案的关键在于通过上下文学习 (in-context learning) 和现有双语词典及平行句子的数据组织，直接利用 LLMs 生成形式化的 XLE 语法，而无需从头构建模型。这种方法不仅展示了 LLMs 在捕捉关键语法结构和词汇信息方面的有效性，还为濒危语言的保存提供了经济高效的解决方案。

链接: https://arxiv.org/abs/2412.10960
作者: Piyapath T Spencer,Nanthipat Kongborrirak
机构: Language and Information Technology Programme, Faculty of Arts, CU, Thailand; Center for Information and Language Processing (CIS), LMU Munich, Germany
关键词: Large Language Models, LLMs, Large Language, application of Large, endangered languages
类目: Computation and Language (cs.CL)
备注: Preprint manuscript. Under revision. Accepted to COLING 2025

点击查看摘要

Abstract:Yes! In the present-day documenting and preserving endangered languages, the application of Large Language Models (LLMs) presents a promising approach. This paper explores how LLMs, particularly through in-context learning, can assist in generating grammatical information for low-resource languages with limited amount of data. We takes Moklen as a case study to evaluate the efficacy of LLMs in producing coherent grammatical rules and lexical entries using only bilingual dictionaries and parallel sentences of the unknown language without building the model from scratch. Our methodology involves organising the existing linguistic data and prompting to efficiently enable to generate formal XLE grammar. Our results demonstrate that LLMs can successfully capture key grammatical structures and lexical information, although challenges such as the potential for English grammatical biases remain. This study highlights the potential of LLMs to enhance language documentation efforts, providing a cost-effective solution for generating linguistic data and contributing to the preservation of endangered languages.
zh

[NLP-111] Human-Centric NLP or AI-Centric Illusion?: A Critical Investigation ACL

【速读】：该论文试图解决的问题是当前自然语言处理（NLP）领域中“以人为本”（Human-Centric NLP）理念与实际实施之间的显著差距。论文通过分析语言模型、行为测试和多模态对齐的案例研究，指出当前的NLP实践往往偏离了真正的人本设计原则，将人类因素简化为基准测试，且对现实世界的影响考虑不足。解决方案的关键在于重新定义以人为本的NLP，强调跨学科合作和伦理考量，并呼吁更广泛地关注实际应用和社会影响，以确保语言技术真正服务于用户并赋予其权力。

链接: https://arxiv.org/abs/2412.10939
作者: Piyapath T Spencer
机构: Language and Information Technology Programme, Faculty of Arts, CU, Thailand (语言与信息技术项目，艺术学院，泰国); Center for Information and Language Processing (CIS), LMU Munich, Germany (语言处理中心，慕尼黑大学，德国)
关键词: underlying AI-centric focus, claims to prioritise, implementations reveal, reveal an underlying, underlying AI-centric
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Preprint to be published in Proceedings of PACLIC38

点击查看摘要

Abstract:Human-Centric NLP often claims to prioritise human needs and values, yet many implementations reveal an underlying AI-centric focus. Through an analysis of case studies in language modelling, behavioural testing, and multi-modal alignment, this study identifies a significant gap between the ideas of human-centricity and actual practices. Key issues include misalignment with human-centred design principles, the reduction of human factors to mere benchmarks, and insufficient consideration of real-world impacts. The discussion explores whether Human-Centric NLP embodies true human-centred design, emphasising the need for interdisciplinary collaboration and ethical considerations. The paper advocates for a redefinition of Human-Centric NLP, urging a broader focus on real-world utility and societal implications to ensure that language technologies genuinely serve and empower users.
zh

[NLP-112] Enhancing Discoverability in Enterprise Conversational Systems with Proactive Question Suggestions

【速读】：该论文试图解决企业对话式AI系统中新用户难以提出有效问题的问题，尤其是在功能不熟悉或不断演变的系统中。解决方案的关键在于提出一个框架，通过生成主动的、上下文感知的提问建议，来满足用户的即时需求并提升系统功能的可发现性。该框架结合了在群体层面定期进行用户意图分析与基于聊天会话的提问生成，从而提高了AI助手的实用性和系统功能的可发现性。

链接: https://arxiv.org/abs/2412.10933
作者: Xiaobin Shen,Daniel Lee,Sumit Ranjan,Sai Sree Harsha,Pawan Sevak,Yunyao Li
机构: Carnegie Mellon University; Adobe
关键词: completing daily tasks, customer management, increasingly popular, popular to assist, completing daily
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enterprise conversational AI systems are becoming increasingly popular to assist users in completing daily tasks such as those in marketing and customer management. However, new users often struggle to ask effective questions, especially in emerging systems with unfamiliar or evolving capabilities. This paper proposes a framework to enhance question suggestions in conversational enterprise AI systems by generating proactive, context-aware questions that try to address immediate user needs while improving feature discoverability. Our approach combines periodic user intent analysis at the population level with chat session-based question generation. We evaluate the framework using real-world data from the AI Assistant for Adobe Experience Platform (AEP), demonstrating the improved usefulness and system discoverability of the AI Assistant.
zh

[NLP-113] okens the oft-overlooked appetizer: Large language models the distributional hypothesis and meaning

【速读】：该论文试图解决的问题是当前语言模型中分词（tokenization）对模型认知的影响被忽视的现象。论文指出，尽管分词是许多语言模型（包括基于Transformer的大型语言模型，LLMs）架构中的必要组成部分，但其对模型认知的影响却常常被忽略。解决方案的关键在于认识到分词不仅是语义基元（semantic primitives），还是传递人类语言中显著分布模式（salient distributional patterns）的载体。论文建议通过语言学指导的干预来改进现有的、语言学无关的分词技术，以优化语义构建块并确保模型能够访问必要的分布模式。此外，论文还强调了分词预训练可能成为偏见和其他不良内容的“后门”，并指出分词算法的优化目标函数对LLM的认知有重要影响，尽管它与主要系统智能相对独立。

链接: https://arxiv.org/abs/2412.10924
作者: Julia Witte Zimmerman,Denis Hudon,Kathryn Cramer,Alejandro J. Ruiz,Calla Beauregard,Ashley Fehr,Mikaela Irene Fudolig,Bradford Demarest,Yoshi Meke Bird,Milo Z. Trujillo,Christopher M. Danforth,Peter Sheridan Dodds
机构: 未知
关键词: transformer-based large language, large language models, including the transformer-based, language models, human-like language performance
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tokenization is a necessary component within the current architecture of many language models, including the transformer-based large language models (LLMs) of Generative AI, yet its impact on the model’s cognition is often overlooked. We argue that LLMs demonstrate that the Distributional Hypothesis (DM) is sufficient for reasonably human-like language performance, and that the emergence of human-meaningful linguistic units among tokens motivates linguistically-informed interventions in existing, linguistically-agnostic tokenization techniques, particularly with respect to their roles as (1) semantic primitives and as (2) vehicles for conveying salient distributional patterns from human language to the model. We explore tokenizations from a BPE tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken; and the information in exemplar token vectors as they move through the layers of a RoBERTa (large) model. Besides creating sub-optimal semantic building blocks and obscuring the model’s access to the necessary distributional patterns, we describe how tokenization pretraining can be a backdoor for bias and other unwanted content, which current alignment practices may not remediate. Additionally, we relay evidence that the tokenization algorithm’s objective function impacts the LLM’s cognition, despite being meaningfully insulated from the main system intelligence.
zh

[NLP-114] LLM s-in-the-Loop Part 2: Expert Small AI Models for Anonymization and De-identification of PHI Across Multiple Languages

【速读】：该论文旨在解决慢性疾病和疫情（如 COVID-19）背景下，如何在保护患者隐私的前提下高效处理患者数据的问题。解决方案的关键在于采用 LLM-in-the-loop 方法，开发领域特定的去标识化命名实体识别（NER）模型。这些模型通过避免传输或存储敏感数据，克服了使用大型语言模型（LLMs）通过 API 带来的隐私风险，并且在去标识化任务中表现优于 LLMs，提供了更高的性能和可靠性。论文展示了这些模型在八种语言中的优异表现，确立了它们作为最精确的医疗数据匿名化解决方案的地位，并为未来的医疗 AI 创新奠定了基础。

链接: https://arxiv.org/abs/2412.10918
作者: Murat Gunay,Bunyamin Keles,Raife Hizlan
机构: 未知
关键词: protected health information, effective patient data, patient data processing, health information, rise of chronic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 tables

点击查看摘要

Abstract:The rise of chronic diseases and pandemics like COVID-19 has emphasized the need for effective patient data processing while ensuring privacy through anonymization and de-identification of protected health information (PHI). Anonymized data facilitates research without compromising patient confidentiality. This paper introduces expert small AI models developed using the LLM-in-the-loop methodology to meet the demand for domain-specific de-identification NER models. These models overcome the privacy risks associated with large language models (LLMs) used via APIs by eliminating the need to transmit or store sensitive data. More importantly, they consistently outperform LLMs in de-identification tasks, offering superior performance and reliability. Our de-identification NER models, developed in eight languages (English, German, Italian, French, Romanian, Turkish, Spanish, and Arabic) achieved f1-micro score averages of 0.966, 0.975, 0.976, 0.970, 0.964, 0.974, 0.978, and 0.953 respectively. These results establish them as the most accurate healthcare anonymization solutions, surpassing existing small models and even general-purpose LLMs such as GPT-4o. While Part-1 of this series introduced the LLM-in-the-loop methodology for bio-medical document translation, this second paper showcases its success in developing cost-effective expert small NER models in de-identification tasks. Our findings lay the groundwork for future healthcare AI innovations, including biomedical entity and relation extraction, demonstrating the value of specialized models for domain-specific challenges.
zh

[NLP-115] Quantifying Extreme Opinions on Reddit Amidst the 2023 Israeli-Palestinian Conflict

【速读】：该论文试图解决在2023年以色列-巴勒斯坦冲突期间，社交媒体上极端意见的动态变化问题。解决方案的关键在于开发了一种基于词典的无监督方法，通过考虑愤怒、极性和主观性等因素来量化“极端意见”。研究通过分析来自四个Reddit子论坛（r/Palestine, r/Judaism, r/IsraelPalestine, 和 r/worldnews）的超过45万条帖子，识别出与关键现实事件（如IDF轰炸Al Quds医院和Jabalia难民营，以及恐怖袭击后停火的结束）相对应的极端主义得分峰值。此外，研究还探讨了这些得分在不同子论坛间的分布和相关性，并通过Jaccard指数分析词云相似性，提供了推动极端在线意见的因素的细致理解。该方法强调了社交媒体分析在捕捉现实事件与在线讨论之间复杂互动方面的潜力，同时也揭示了在社交媒体环境中测量极端主义的局限性和挑战。

链接: https://arxiv.org/abs/2412.10913
作者: Alessio Guerra,Marcello Lepre,Oktay Karakus
机构: Cardiff University (卡迪夫大学)
关键词: Jabalia Refugee Camp, utilising a comprehensive, Reddit subreddits, investigates the dynamics, comprehensive dataset
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注: 31 pages, 8 figures and 6 tables

点击查看摘要

Abstract:This study investigates the dynamics of extreme opinions on social media during the 2023 Israeli-Palestinian conflict, utilising a comprehensive dataset of over 450,000 posts from four Reddit subreddits (r/Palestine, r/Judaism, r/IsraelPalestine, and r/worldnews). A lexicon-based, unsupervised methodology was developed to measure “extreme opinions” by considering factors such as anger, polarity, and subjectivity. The analysis identifies significant peaks in extremism scores that correspond to pivotal real-life events, such as the IDF’s bombings of Al Quds Hospital and the Jabalia Refugee Camp, and the end of a ceasefire following a terrorist attack. Additionally, this study explores the distribution and correlation of these scores across different subreddits and over time, providing insights into the propagation of polarised sentiments in response to conflict events. By examining the quantitative effects of each score on extremism and analysing word cloud similarities through Jaccard indices, the research offers a nuanced understanding of the factors driving extreme online opinions. This approach underscores the potential of social media analytics in capturing the complex interplay between real-world events and online discourse, while also highlighting the limitations and challenges of measuring extremism in social media contexts.
zh

[NLP-116] SusGen-GPT: A Data-Centric LLM for Financial NLP and Sustainability Report Generation

【速读】：该论文旨在解决金融领域和ESG（Environmental, Social, and Governance）领域中缺乏开源大型语言模型（LLMs）的问题。解决方案的关键在于引入了一个名为SusGen-30K的类别平衡数据集，该数据集涵盖了七个金融NLP任务和ESG报告生成任务，并提出了TCFD-Bench基准用于评估可持续发展报告生成。基于此数据集，研究团队开发了SusGen-GPT模型套件，在六个适配任务和两个现成任务上实现了最先进的性能，尽管其参数规模仅为7-8B，与GPT-4的1,700B相比显著较小，但性能仅落后2%。此外，论文还提出了SusGen系统，结合检索增强生成（Retrieval-Augmented Generation, RAG）技术，以辅助可持续发展报告的生成，展示了该方法的高效性，推动了金融和ESG领域的研究进展。

链接: https://arxiv.org/abs/2412.10906
作者: Qilong Wu,Xiaoneng Xiang,Hejia Huang,Xuan Wang,Yeo Wei Jie,Ranjan Satapathy,Ricardo Shirota Filho,Bharadwaj Veeravalli
机构: National University of Singapore, Singapore; Nanyang Technological University, Singapore; Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A* STAR), Singapore
关键词: advanced NLP tools, focus on Environmental, ESG report generation, considerations highlight, sustainability report generation
类目: Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:The rapid growth of the financial sector and the rising focus on Environmental, Social, and Governance (ESG) considerations highlight the need for advanced NLP tools. However, open-source LLMs proficient in both finance and ESG domains remain scarce. To address this gap, we introduce SusGen-30K, a category-balanced dataset comprising seven financial NLP tasks and ESG report generation, and propose TCFD-Bench, a benchmark for evaluating sustainability report generation. Leveraging this dataset, we developed SusGen-GPT, a suite of models achieving state-of-the-art performance across six adapted and two off-the-shelf tasks, trailing GPT-4 by only 2% despite using 7-8B parameters compared to GPT-4’s 1,700B. Based on this, we propose the SusGen system, integrated with Retrieval-Augmented Generation (RAG), to assist in sustainability report generation. This work demonstrates the efficiency of our approach, advancing research in finance and ESG.
zh

[NLP-117] BgGPT 1.0: Extending English-centric LLM s to other languages

【速读】：该论文旨在解决保加利亚语理解和生成任务中的性能问题，通过持续预训练和微调Google的Gemma-2模型，创建了专门针对保加利亚语优化的BgGPT-Gemma-2-27B-Instruct和BgGPT-Gemma-2-9B-Instruct模型。解决方案的关键在于利用Gemma-2的多语言能力，结合超过1000亿词元的保加利亚语和英语数据，采用持续学习策略（基于Branch-and-Merge技术）和精心筛选的训练数据，确保模型在保加利亚语任务中表现卓越的同时，保留了原始Gemma-2模型在英语语言任务中的强大能力。此外，论文还通过发布商业友好的模型权重和建立全面的基准测试，推动了该模型在研究、企业和爱好者中的广泛应用。

链接: https://arxiv.org/abs/2412.10893
作者: Anton Alexandrov,Veselin Raychev,Dimitar I. Dimitrov,Ce Zhang,Martin Vechev,Kristina Toutanova
机构: INSAIT; Sofia University “St. Kliment Ohridski”; LogicStar.ai; ETH Zurich; University of Chicago; Together AI
关键词: Bulgarian language understanding, versions of Google, Bulgarian language tasks, Bulgarian language, continually pretrained
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present BgGPT-Gemma-2-27B-Instruct and BgGPT-Gemma-2-9B-Instruct: continually pretrained and fine-tuned versions of Google’s Gemma-2 models, specifically optimized for Bulgarian language understanding and generation. Leveraging Gemma-2’s multilingual capabilities and over 100 billion tokens of Bulgarian and English text data, our models demonstrate strong performance in Bulgarian language tasks, setting a new standard for language-specific AI models. Our approach maintains the robust capabilities of the original Gemma-2 models, ensuring that the English language performance remains intact. To preserve the base model capabilities, we incorporate continual learning strategies based on recent Branch-and-Merge techniques as well as thorough curation and selection of training data. We provide detailed insights into our methodology, including the release of model weights with a commercial-friendly license, enabling broader adoption by researchers, companies, and hobbyists. Further, we establish a comprehensive set of benchmarks based on non-public educational data sources to evaluate models on Bulgarian language tasks as well as safety and chat capabilities. Our findings demonstrate the effectiveness of fine-tuning state-of-the-art models like Gemma 2 to enhance language-specific AI applications while maintaining cross-lingual capabilities.
zh

[NLP-118] A Novel End-To-End Event Geolocation Method Leveraging Hyperbolic Space and Toponym Hierarchies

【速读】：该论文试图解决基于社交数据的事件检测和地理位置定位问题，特别是在危机响应和资源分配等应用中，现有方法因事件检测错误导致的定位精度不足的问题。解决方案的关键在于提出了一种端到端的基于双曲空间（Hyperbolic space）和地名层次结构（toponym hierarchies）的事件地理位置定位方法（GTOP）。该方法包含两个主要模块：事件检测模块和地理位置定位模块。事件检测模块通过构建异构信息网络和同构消息图，结合文本和时间特征，在双曲空间中更新节点特征并进行事件分类。为减少地理位置误差，论文提出了基于地名层次结构的噪声地名过滤算法（HIST），通过分析事件集群中的地名层次结构，过滤噪声地名并确定粗粒度的事件位置。为进一步提高定位精度，还提出了基于HIST输出的细粒度伪地名生成算法（FIT），结合生成的伪地名和过滤后的地名，基于地理中心点进行事件定位。实验结果表明，该方法在精度和性能上优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.10870
作者: Yaqiong Qiao,Guojun Huang
机构: 未知
关键词: Timely detection, event detection, provide critical information, event detection module, event geolocation method
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Timely detection and geolocation of events based on social data can provide critical information for applications such as crisis response and resource allocation. However, most existing methods are greatly affected by event detection errors, leading to insufficient geolocation accuracy. To this end, this paper proposes a novel end-to-end event geolocation method (GTOP) leveraging Hyperbolic space and toponym hierarchies. Specifically, the proposed method contains one event detection module and one geolocation module. The event detection module constructs a heterogeneous information networks based on social data, and then constructs a homogeneous message graph and combines it with the text and time feature of the message to learning initial features of nodes. Node features are updated in Hyperbolic space and then fed into a classifier for event detection. To reduce the geolocation error, this paper proposes a noise toponym filtering algorithm (HIST) based on the hierarchical structure of toponyms. HIST analyzes the hierarchical structure of toponyms mentioned in the event cluster, taking the highly frequent city-level locations as the coarse-grained locations for events. By comparing the hierarchical structure of the toponyms within the cluster against those of the coarse-grained locations of events, HIST filters out noisy toponyms. To further improve the geolocation accuracy, we propose a fine-grained pseudo toponyms generation algorithm (FIT) based on the output of HIST, and combine generated pseudo toponyms with filtered toponyms to locate events based on the geographic center points of the combined toponyms. Extensive experiments are conducted on the Chinese dataset constructed in this paper and another public English dataset. The experimental results show that the proposed method is superior to the state-of-the-art baselines.
zh

[NLP-119] CRENER: A Character Relation Enhanced Chinese NER Model

【速读】：该论文试图解决中文命名实体识别 (Chinese Named Entity Recognition, NER) 中由于缺乏自然分隔符而导致的实体边界识别不准确的问题。解决方案的关键在于提出了一个字符关系增强的中文NER模型 (CRENER)，通过定义四种反映字符间关系的标签，并基于三种类型的关系（字符间的邻接关系、字符与标签的关系、标签间的关系）进行细粒度建模，从而更准确地识别实体边界。具体而言，该模型将中文NER任务转化为字符-字符关系分类任务，并通过联合建模关系标签来确保实体边界识别的准确性。此外，模型还构建了一个结合未缩放的方向感知和距离感知掩码自注意力机制的适配Transformer编码器，以及一个关系表示增强模块，以有效挖掘字符与标签之间的关系表示，从而提升模型的上下文理解能力。

链接: https://arxiv.org/abs/2412.10858
作者: Yaqiong Qiao,Shixuan Peng
机构: 未知
关键词: Chinese NER, Named Entity Recognition, Chinese Named Entity, Chinese NER accuracy, Chinese NER task
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Chinese Named Entity Recognition (NER) is an important task in information extraction, which has a significant impact on downstream applications. Due to the lack of natural separators in Chinese, previous NER methods mostly relied on external dictionaries to enrich the semantic and boundary information of Chinese words. However, such methods may introduce noise that affects the accuracy of named entity recognition. To this end, we propose a character relation enhanced Chinese NER model (CRENER). This model defines four types of tags that reflect the relationships between characters, and proposes a fine-grained modeling of the relationships between characters based on three types of relationships: adjacency relations between characters, relations between characters and tags, and relations between tags, to more accurately identify entity boundaries and improve Chinese NER accuracy. Specifically, we transform the Chinese NER task into a character-character relationship classification task, ensuring the accuracy of entity boundary recognition through joint modeling of relation tags. To enhance the model’s ability to understand contextual information, WRENER further constructed an adapted transformer encoder that combines unscaled direction-aware and distance-aware masked self-attention mechanisms. Moreover, a relationship representation enhancement module was constructed to model predefined relationship tags, effectively mining the relationship representations between characters and tags. Experiments conducted on four well-known Chinese NER benchmark datasets have shown that the proposed model outperforms state-of-the-art baselines. The ablation experiment also demonstrated the effectiveness of the proposed model.
zh

[NLP-120] Superhuman performance of a large language model on the reasoning tasks of a physician

【速读】：该论文试图解决的问题是如何更有效地评估大型语言模型（LLMs）在医疗任务中的表现，特别是临床推理能力，而不仅仅是依赖传统的多选题基准测试。解决方案的关键在于采用更贴近实际临床场景的评估方法，包括差异诊断生成、诊断推理展示、分诊差异诊断、概率推理和管理推理等五个实验，并通过经过验证的心理测量学的医生专家进行评判。研究结果表明，o1-preview模型在需要复杂批判性思维的诊断和管理任务上表现显著提升，但在概率推理和分诊差异诊断任务上未见改进。这一研究强调了开发新的、更稳健的基准测试和可扩展的评估方法的必要性，以便更好地比较LLMs与人类医生的能力，并推动AI在实际临床环境中的应用。

链接: https://arxiv.org/abs/2412.10849
作者: Peter G. Brodeur,Thomas A. Buckley,Zahir Kanjee,Ethan Goh,Evelyn Bin Ling,Priyank Jain,Stephanie Cabral,Raja-Elie Abdulnour,Adrian Haimovich,Jason A. Freed,Andrew Olson,Daniel J. Morgan,Jason Hom,Robert Gallo,Eric Horvitz,Jonathan Chen,Arjun K. Manrai,Adam Rodman
机构: 未知
关键词: multiple choice question, choice question benchmarks, large language models, large language, traditionally been evaluated
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Performance of large language models (LLMs) on medical tasks has traditionally been evaluated using multiple choice question benchmarks. However, such benchmarks are highly constrained, saturated with repeated impressive performance by LLMs, and have an unclear relationship to performance in real clinical scenarios. Clinical reasoning, the process by which physicians employ critical thinking to gather and synthesize clinical data to diagnose and manage medical problems, remains an attractive benchmark for model performance. Prior LLMs have shown promise in outperforming clinicians in routine and complex diagnostic scenarios. We sought to evaluate OpenAI’s o1-preview model, a model developed to increase run-time via chain of thought processes prior to generating a response. We characterize the performance of o1-preview with five experiments including differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, adjudicated by physician experts with validated psychometrics. Our primary outcome was comparison of the o1-preview output to identical prior experiments that have historical human controls and benchmarks of previous LLMs. Significant improvements were observed with differential diagnosis generation and quality of diagnostic and management reasoning. No improvements were observed with probabilistic reasoning or triage differential diagnosis. This study highlights o1-preview’s ability to perform strongly on tasks that require complex critical thinking such as diagnosis and management while its performance on probabilistic reasoning tasks was similar to past models. New robust benchmarks and scalable evaluation of LLM capabilities compared to human physicians are needed along with trials evaluating AI in real clinical settings.
zh

[NLP-121] Large Language Models for Medical Forecasting – Foresight 2

【速读】：该论文试图解决在医疗领域中通过大型语言模型（LLMs）进行患者时间线建模和预测的问题。解决方案的关键在于使用医院数据对FS2模型进行微调，使其能够理解和预测患者的临床笔记中的SNOMED代码，从而应用于诊断建议、风险预测以及治疗和药物推荐等多种生物医学用例。通过在MIMIC-III数据集的自由文本部分提取生物医学概念并构建上下文化的患者时间线，FS2在预测新的生物医学概念和疾病方面显著优于之前的先进模型，并在风险预测任务中表现优于GPT-4-turbo等大型模型，表明高质量、专业化的医院数据对模型性能的提升至关重要。

链接: https://arxiv.org/abs/2412.10848
作者: Zeljko Kraljevic,Joshua Au Yeung,Daniel Bean,James Teo,Richard J. Dobson
机构: King’s College London(伦敦国王学院); King’s College Hospital(国王学院医院)
关键词: modelling patient timelines, large language model, removed for anon, large language, language model fine-tuned
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foresight 2 (FS2) is a large language model fine-tuned on hospital data for modelling patient timelines (GitHub ‘removed for anon’). It can understand patients’ clinical notes and predict SNOMED codes for a wide range of biomedical use cases, including diagnosis suggestions, risk forecasting, and procedure and medication recommendations. FS2 is trained on the free text portion of the MIMIC-III dataset, firstly through extracting biomedical concepts and then creating contextualised patient timelines, upon which the model is then fine-tuned. The results show significant improvement over the previous state-of-the-art for the next new biomedical concept prediction (P/R - 0.73/0.66 vs 0.52/0.32) and a similar improvement specifically for the next new disorder prediction (P/R - 0.69/0.62 vs 0.46/0.25). Finally, on the task of risk forecast, we compare our model to GPT-4-turbo (and a range of open-source biomedical LLMs) and show that FS2 performs significantly better on such tasks (P@5 - 0.90 vs 0.65). This highlights the need to incorporate hospital data into LLMs and shows that small models outperform much larger ones when fine-tuned on high-quality, specialised data.
zh

[NLP-122] Rethinking Chain-of-Thought from the Perspective of Self-Training

【速读】：该论文试图解决如何通过链式思维推理（Chain-of-thought, CoT）激活大型语言模型（LLMs）的潜在能力，并探讨CoT与自训练（self-training）之间的内在关系。解决方案的关键在于揭示CoT与自训练在语义熵最小化原则上的相似性，并提出一个包含两个核心组件的新CoT框架：(i) 任务特定的提示模块（task-specific prompt module），用于引导LLMs生成高质量的初始推理过程；(ii) 自适应推理迭代模块（adaptive reasoning iteration module），用于逐步优化推理过程。

链接: https://arxiv.org/abs/2412.10827
作者: Zongqian Wu,Baoduo Xu,Ruochen Cui,Mengmeng Zhan,Xiaofeng Zhu,Lei Feng
机构: 未知
关键词: large language models, activating latent capabilities, language models, effective approach, approach for activating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 12 figures

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has emerged as an effective approach for activating latent capabilities in large language models (LLMs). We observe that CoT shares significant similarities with self-training in terms of their learning processes. Motivated by these parallels, this paper explores the underlying relationship between CoT and self-training, demonstrating how insights from self-training can enhance CoT performance. Specifically, our study first reveals that CoT, like self-training, follows the principle of semantic entropy minimization. Leveraging this insight, we propose a novel CoT framework that incorporates two key components: (i) a task-specific prompt module designed to guide LLMs in generating high-quality initial reasoning processes, and (ii) an adaptive reasoning iteration module for progressively refining the reasoning process.
zh

[NLP-123] FinGPT: Enhancing Sentiment-Based Stock Movement Prediction with Dissemination-Aware and Context-Enriched LLM s AAAI2025

【速读】：该论文试图解决现有金融情感分析方法在预测短期股价波动时，仅依赖新闻内容本身而忽视新闻传播广度的问题。解决方案的关键在于通过数据驱动的方法，结合新闻传播广度、上下文数据和明确的指令来增强大型语言模型（LLM）的情感分析能力。具体来说，论文通过聚类公司相关新闻来评估其传播范围和影响力，并在提示中加入更具体的数据和精确的指令，构建指令微调数据集以微调LLM，从而提高短期股价预测的准确性。

链接: https://arxiv.org/abs/2412.10823
作者: Yixuan Liang,Yuncong Liu,Boyu Zhang,Christina Dan Wang,Hongyang Yang
机构: Yixuan Liang1; Yuncong Liu1; Boyu Zhang1; Christina Dan Wang1,2; Hongyang Yang1
关键词: Financial sentiment analysis, Financial sentiment, crucial for understanding, Financial, sentiment analysis
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Computational Finance (q-fin.CP); Trading and Market Microstructure (q-fin.TR)
备注: 1st Workshop on Preparing Good Data for Generative AI: Challenges and Approaches@ AAAI 2025

点击查看摘要

Abstract:Financial sentiment analysis is crucial for understanding the influence of news on stock prices. Recently, large language models (LLMs) have been widely adopted for this purpose due to their advanced text analysis capabilities. However, these models often only consider the news content itself, ignoring its dissemination, which hampers accurate prediction of short-term stock movements. Additionally, current methods often lack sufficient contextual data and explicit instructions in their prompts, limiting LLMs’ ability to interpret news. In this paper, we propose a data-driven approach that enhances LLM-powered sentiment-based stock movement predictions by incorporating news dissemination breadth, contextual data, and explicit instructions. We cluster recent company-related news to assess its reach and influence, enriching prompts with more specific data and precise instructions. This data is used to construct an instruction tuning dataset to fine-tune an LLM for predicting short-term stock price movements. Our experimental results show that our approach improves prediction accuracy by 8% compared to existing methods.
zh

[NLP-124] Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages

【速读】：该论文试图解决预训练语言模型 (Pre-trained Language Models, PLMs) 对语言学基础攻击的敏感性问题，特别是那些细微且自然存在的攻击。解决方案的关键在于首次系统性地研究了不同印度语言和各种下游任务中，PLMs 对语言学基础攻击的反应。研究发现，尽管 PLMs 对语言学扰动敏感，但相较于非语言学攻击，其对语言学攻击的敏感性略低，这表明即使受限的攻击也是有效的。此外，研究还探讨了这些结果在不同语言家族和不同书写系统中的普遍性。

链接: https://arxiv.org/abs/2412.10805
作者: Poulami Ghosh,Raj Dabre,Pushpak Bhattacharyya
机构: IIT Bombay(印度理工学院孟买分校); NICT(日本国立信息学研究所)
关键词: Pre-trained language models, linguistically grounded attacks, Pre-trained language, linguistically grounded, input text
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Pre-trained language models (PLMs) are known to be susceptible to perturbations to the input text, but existing works do not explicitly focus on linguistically grounded attacks, which are subtle and more prevalent in nature. In this paper, we study whether PLMs are agnostic to linguistically grounded attacks or not. To this end, we offer the first study addressing this, investigating different Indic languages and various downstream tasks. Our findings reveal that although PLMs are susceptible to linguistic perturbations, when compared to non-linguistic attacks, PLMs exhibit a slightly lower susceptibility to linguistic attacks. This highlights that even constrained attacks are effective. Moreover, we investigate the implications of these outcomes across a range of languages, encompassing diverse language families and different scripts.
zh

[NLP-125] WEPO: Web Element Preference Optimization for LLM -based Web Navigation AAAI2025

【速读】：该论文试图解决在自主网页导航任务中，如何更有效地利用HTML元素冗余进行对比训练的问题。解决方案的关键在于提出了Web Element Preference Optimization (WEPO)方法，通过无监督的偏好学习，采样基于距离的非显著网页元素作为负样本，并在Direct Preference Optimization (DPO)框架下优化最大似然目标。这一方法在Mind2Web基准测试中显著提升了模型对用户高层意图与输出动作的匹配效果，达到了当前最先进的性能。

链接: https://arxiv.org/abs/2412.10742
作者: Jiarun Liu,Jia Hao,Chunhong Zhang,Zheng Hu
机构: State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室); Beijing University of Posts and Telecommunications(北京邮电大学)
关键词: grounding pretrained Large, pretrained Large Language, Large Language Models, pretrained Large, Large Language
类目: Computation and Language (cs.CL)
备注: Published at AAAI 2025

点击查看摘要

Abstract:The rapid advancement of autonomous web navigation has significantly benefited from grounding pretrained Large Language Models (LLMs) as agents. However, current research has yet to fully leverage the redundancy of HTML elements for contrastive training. This paper introduces a novel approach to LLM-based web navigation tasks, called Web Element Preference Optimization (WEPO). WEPO utilizes unsupervised preference learning by sampling distance-based non-salient web elements as negative samples, optimizing maximum likelihood objective within Direct Preference Optimization (DPO). We evaluate WEPO on the Mind2Web benchmark and empirically demonstrate that WEPO aligns user high-level intent with output actions more effectively. The results show that our method achieved the state-of-the-art, with an improvement of 13.8% over WebAgent and 5.3% over the visual language model CogAgent baseline. Our findings underscore the potential of preference optimization to enhance web navigation and other web page based tasks, suggesting a promising direction for future research.
zh

[NLP-126] HITgram: A Platform for Experimenting with n-gram Language Models

【速读】：该论文试图解决大规模语言模型（LLMs）资源消耗大、难以在资源受限环境中使用的问题。解决方案的关键在于HITgram平台，它通过提供轻量级的n-gram模型实验环境来实现这一目标。HITgram支持从unigrams到4-grams的模型，并集成了上下文敏感加权、拉普拉斯平滑（Laplace smoothing）和动态语料库管理等功能，以提高预测准确性，特别是在处理未见过的词序列时。实验结果表明，HITgram在效率上表现出色，能够在资源有限的环境中高效运行，并计划通过多语言支持、高级平滑、并行处理和模型保存等增强功能进一步扩展其应用范围。

链接: https://arxiv.org/abs/2412.10717
作者: Shibaranjani Dasgupta,Chandan Maity,Somdip Mukherjee,Rohan Singh,Diptendu Dutta,Debasish Jana
机构: Indian Institute of Technology Kharagpur (印度理工学院卡拉格普尔分校); Indian Institute of Engineering Science and Technology Shibpur (印度工程科学与技术学院希布尔分校)
关键词: Large language models, Large language, limiting accessibility, resource intensive, powerful but resource
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are powerful but resource intensive, limiting accessibility. HITgram addresses this gap by offering a lightweight platform for n-gram model experimentation, ideal for resource-constrained environments. It supports unigrams to 4-grams and incorporates features like context sensitive weighting, Laplace smoothing, and dynamic corpus management to e-hance prediction accuracy, even for unseen word sequences. Experiments demonstrate HITgram’s efficiency, achieving 50,000 tokens/second and generating 2-grams from a 320MB corpus in 62 seconds. HITgram scales efficiently, constructing 4-grams from a 1GB file in under 298 seconds on an 8 GB RAM system. Planned enhancements include multilingual support, advanced smoothing, parallel processing, and model saving, further broadening its utility.
zh

[NLP-127] owards Effective Efficient and Unsupervised Social Event Detection in the Hyperbolic Space AAAI2025

【速读】：该论文试图解决社交消息数据在社交事件检测（Social Event Detection, SED）中的复杂性和动态性带来的挑战，特别是消息表示不充分（ineffective）和学习时间过长（inefficient）的问题。解决方案的关键在于提出了一个无监督框架HyperSED（Hyperbolic SED），通过将社交消息建模为基于语义的消息锚点，并利用锚点图的结构和双曲空间的表现力来获取结构和几何感知的锚点表示。最终，HyperSED通过结合可微分的结构信息构建锚点消息图的分区树，以反映检测到的事件。实验结果表明，HyperSED在性能上具有竞争力，并且在效率上显著优于当前最先进的无监督方法。

链接: https://arxiv.org/abs/2412.10712
作者: Xiaoyan Yu,Yifan Wei,Shuaishuai Zhou,Zhiwei Yang,Li Sun,Hao Peng,Liehuang Zhu,Philip S. Yu
机构: 1. Beijing University of Posts and Telecommunications (北京邮电大学); 2. Beijing Institute of Technology (北京理工大学); 3. Tsinghua University (清华大学); 4. Peking University (北京大学); 5. Shanghai Jiao Tong University (上海交通大学); 6. University of Illinois at Chicago (伊利诺伊大学芝加哥分校)
关键词: social message data, social event detection, dynamic nature, data has posed, posed challenges
类目: Computation and Language (cs.CL)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:The vast, complex, and dynamic nature of social message data has posed challenges to social event detection (SED). Despite considerable effort, these challenges persist, often resulting in inadequately expressive message representations (ineffective) and prolonged learning durations (inefficient). In response to the challenges, this work introduces an unsupervised framework, HyperSED (Hyperbolic SED). Specifically, the proposed framework first models social messages into semantic-based message anchors, and then leverages the structure of the anchor graph and the expressiveness of the hyperbolic space to acquire structure- and geometry-aware anchor representations. Finally, HyperSED builds the partitioning tree of the anchor message graph by incorporating differentiable structural information as the reflection of the detected events. Extensive experiments on public datasets demonstrate HyperSED’s competitive performance, along with a substantial improvement in efficiency compared to the current state-of-the-art unsupervised paradigm. Statistically, HyperSED boosts incremental SED by an average of 2%, 2%, and 25% in NMI, AMI, and ARI, respectively; enhancing efficiency by up to 37.41 times and at least 12.10 times, illustrating the advancement of the proposed framework. Our code is publicly available at this https URL.
zh

[NLP-128] Efficient Adaptation of Multilingual Models for Japanese ASR

【速读】：该论文试图解决多语言自动语音识别（ASR）模型在特定语言（如日语）中的精确度不足的问题。解决方案的关键在于使用日语特定数据集，结合低秩适应（Low-Rank Adaptation, LoRA）和端到端（end-to-end, E2E）训练方法对OpenAI的Whisper-Tiny模型进行微调。通过这种方法，Whisper-Tiny的字符错误率（CER）显著降低，从32.7降至20.8（使用LoRA）和14.7（使用E2E微调），超越了Whisper-Base的CER（20.2）。尽管如此，领域特定术语的识别仍存在挑战，表明需要专门的语料库。这一研究展示了微调多语言模型在保持灵活性的同时，能够实现强大的语言特定性能，为资源受限环境中的ASR改进提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2412.10705
作者: Mark Bajo,Haruka Fukukawa,Ryuji Morita,Yuma Ogasawara
机构: Georgia Institute of Technology (乔治亚理工学院)
关键词: Automatic Speech Recognition, Automatic Speech, Speech Recognition, specifically OpenAI Whisper-Tiny, study explores fine-tuning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This study explores fine-tuning multilingual ASR (Automatic Speech Recognition) models, specifically OpenAI’s Whisper-Tiny, to improve performance in Japanese. While multilingual models like Whisper offer versatility, they often lack precision in specific languages. Conversely, monolingual models like ReazonSpeech excel in language-specific tasks but are less adaptable. Using Japanese-specific datasets and Low-Rank Adaptation (LoRA) along with end-to-end (E2E) training, we fine-tuned Whisper-Tiny to bridge this gap. Our results show that fine-tuning reduced Whisper-Tiny’s Character Error Rate (CER) from 32.7 to 20.8 with LoRA and to 14.7 with end-to-end fine-tuning, surpassing Whisper-Base’s CER of 20.2. However, challenges with domain-specific terms remain, highlighting the need for specialized datasets. These findings demonstrate that fine-tuning multilingual models can achieve strong language-specific performance while retaining their flexibility. This approach provides a scalable solution for improving ASR in resource-constrained environments and languages with complex writing systems like Japanese.
zh

[NLP-129] VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

【速读】：该论文旨在解决多文档环境中，尤其是包含丰富视觉元素（如表格、图表、演示文稿）的文档集合中的问答问题。其关键解决方案是提出了VisDoMRAG，一种新颖的多模态检索增强生成（Retrieval Augmented Generation, RAG）方法。VisDoMRAG通过结合视觉和文本RAG，采用多步推理过程，包括证据整理和链式推理，以同时处理文本和视觉信息。其核心创新在于一致性约束的模态融合机制，该机制在推理时对齐不同模态的推理过程，从而生成连贯的最终答案，提升了在跨模态信息分布场景中的准确性和答案的可验证性。

链接: https://arxiv.org/abs/2412.10704
作者: Manan Suri,Puneet Mathur,Franck Dernoncourt,Kanika Goswami,Ryan A. Rossi,Dinesh Manocha
机构: University of Maryland, College Park(马里兰大学学院公园分校); Adobe Research(Adobe研究); IGDTUW(IGDTUW)
关键词: document-grounded question answering, visually rich elements, question answering, Understanding information, Retrieval Augmented Generation
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding information from a collection of multiple documents, particularly those with visually rich elements, is important for document-grounded question answering. This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings with rich multimodal content, including tables, charts, and presentation slides. We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG, combining robust visual retrieval capabilities with sophisticated linguistic reasoning. VisDoMRAG employs a multi-step reasoning process encompassing evidence curation and chain-of-thought reasoning for concurrent textual and visual RAG pipelines. A key novelty of VisDoMRAG is its consistency-constrained modality fusion mechanism, which aligns the reasoning processes across modalities at inference time to produce a coherent final answer. This leads to enhanced accuracy in scenarios where critical information is distributed across modalities and improved answer verifiability through implicit context attribution. Through extensive experiments involving open-source and proprietary large language models, we benchmark state-of-the-art document QA methods on VisDoMBench. Extensive results show that VisDoMRAG outperforms unimodal and long-context LLM baselines for end-to-end multimodal document QA by 12-20%.
zh

[NLP-130] Learning to Verify Summary Facts with Fine-Grained LLM Feedback COLING2025

【速读】：该论文试图解决自动摘要事实验证模型训练中缺乏人工标注数据的问题。解决方案的关键在于利用大型语言模型（LLM）生成的反馈来替代传统的人工标注数据。具体来说，论文提出了FineSumFact数据集，该数据集包含对摘要的细粒度事实反馈，通过多样化的LLM生成摘要和Llama-3-70B-Instruct模型提供反馈。利用这一数据集对轻量级开源模型Llama-3-8B-Instruct进行微调，既优化了资源效率，又保持了高性能。实验结果表明，基于LLM生成数据集训练的模型在人类生成的测试集上表现优于基于较小规模人工标注数据集训练的模型，从而证明了使用LLM反馈进行事实验证模型微调的有效性和成本效益。

链接: https://arxiv.org/abs/2412.10689
作者: Jihwan Oh,Jeonghwan Choi,Nicole Hee-Yeon Kim,Taewon Yun,Hwanjun Song
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
关键词: Training automatic summary, Training automatic, human-labeled data, leveraging Large Language, Large Language Model
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at COLING 2025

点击查看摘要

Abstract:Training automatic summary fact verifiers often faces the challenge of a lack of human-labeled data. In this paper, we explore alternative way of leveraging Large Language Model (LLM) generated feedback to address the inherent limitation of using human-labeled data. We introduce FineSumFact, a large-scale dataset containing fine-grained factual feedback on summaries. We employ 10 distinct LLMs for diverse summary generation and Llama-3-70B-Instruct for feedback. We utilize this dataset to fine-tune the lightweight open-source model Llama-3-8B-Instruct, optimizing resource efficiency while maintaining high performance. Our experimental results reveal that the model trained on extensive LLM-generated datasets surpasses that trained on smaller human-annotated datasets when evaluated using human-generated test sets. Fine-tuning fact verification models with LLM feedback can be more effective and cost-efficient than using human feedback. The dataset is available at this https URL.
zh

[NLP-131] Inference Scaling for Bridging Retrieval and Augmented Generation

【速读】：该论文试图解决检索增强生成 (Retrieval-augmented Generation, RAG) 中存在的生成器偏差问题，即改进检索结果可能会对生成结果产生负面影响。解决方案的关键在于提出了一种名为“干预混合 (Mixture-of-Intervention, MOI)”的方法，通过聚合从不同检索上下文顺序中得到的推理调用来减轻偏差。MOI 通过多次前向传递显式建模每个段落的去偏置效用，并构建新的排名。此外，MOI 利用检索器的先验知识来减少计算成本，通过最小化考虑的排列数量和降低每次语言模型调用的成本。实验结果表明，MOI 在 MS MARCO 和 HotpotQA 基准测试中分别将 ROUGE-L 和 EM 提高了约 7 个百分点。

链接: https://arxiv.org/abs/2412.10684
作者: Youngwon Lee,Seung-won Hwang,Daniel Campos,Filip Graliński,Zhewei Yao,Yuxiong He
机构: Snowflake AI Research; Seoul National University
关键词: Retrieval-augmented generation, large language model, incorporating retrieved contexts, popular approach, approach to steering
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference calls from the permuted order of retrieved contexts. The proposed Mixture-of-Intervention (MOI) explicitly models the debiased utility of each passage with multiple forward passes to construct a new ranking. We also show that MOI can leverage the retriever’s prior knowledge to reduce the computational cost by minimizing the number of permutations considered and lowering the cost per LLM call. We showcase the effectiveness of MOI on diverse RAG tasks, improving ROUGE-L on MS MARCO and EM on HotpotQA benchmarks by ~7 points.
zh

[NLP-132] Chasing Progress Not Perfection: Revisiting Strategies for End-to-End LLM Plan Generation AAAI2025

【速读】：该论文试图解决关于大型语言模型（LLMs）在规划任务中的能力争议，特别是关于通过训练数据提升规划能力的有效性问题。研究的关键在于通过开发一个端到端的LLM规划器，并采用多样化的评估指标，重新审视现有的提升策略。研究发现，仅通过在规划语料库上微调LLMs并不能显著提升其在分布外测试集上的规划能力，但诸如思维链（Chain-of-Thought）等策略确实能提高规划的可执行性。其中，基于“最长连续公共子序列”奖励的强化学习策略被证明最为有效，能够同时提升规划的有效性和可执行性。尽管如此，规划的有效性仍是一个挑战，未来的研究应同时关注这两个方面。

链接: https://arxiv.org/abs/2412.10675
作者: Sukai Huang,Trevor Cohn,Nir Lipovetzky
机构: Google DeepMind
关键词: Large Language Models, Large Language, Language Models, capability of Large, topic of debate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages main body, 10 pages appendix, accepted by Workshop on Planning in the Era of LLMs (LM4Plan @ AAAI 2025)

点击查看摘要

Abstract:The capability of Large Language Models (LLMs) to plan remains a topic of debate. Some critics argue that strategies to boost LLMs’ reasoning skills are ineffective in planning tasks, while others report strong outcomes merely from training models on a planning corpus. This study reassesses recent strategies by developing an end-to-end LLM planner and employing diverse metrics for a thorough evaluation. We find that merely fine-tuning LLMs on a corpus of planning instances does not lead to robust planning skills, as indicated by poor performance on out-of-distribution test sets. At the same time, we find that various strategies, including Chain-of-Thought, do enhance the probability of a plan being executable. This indicates progress towards better plan quality, despite not directly enhancing the final validity rate. Among the strategies we evaluated, reinforcement learning with our novel `Longest Contiguous Common Subsequence’ reward emerged as the most effective, contributing to both plan validity and executability. Overall, our research addresses key misconceptions in the LLM-planning literature; we validate incremental progress in plan executability, although plan validity remains a challenge. Hence, future strategies should focus on both these aspects, drawing insights from our findings.
zh

[NLP-133] hinking with Knowledge Graphs: Enhancing LLM Reasoning Through Structured Data

【速读】：该论文试图解决大型语言模型 (LLMs) 在复杂推理任务中表现不佳且容易产生幻觉的问题。解决方案的关键在于将知识图谱 (KGs) 的结构和语义紧密集成到 LLM 的表示中。通过利用 KGs 提供的实体及其关系的结构化表示，论文开发了不同的技术来增强 LLM 的推理能力，并首次使用编程语言表示 KGs 并微调预训练的 LLMs。这种集成不仅显著提升了 LLM 在复杂推理场景中的性能，还使推理过程更加准确和可解释。

链接: https://arxiv.org/abs/2412.10654
作者: Xue Wu,Kostas Tsioutsiouliklis
机构: Yahoo Research(雅虎研究院); Facts.ai
关键词: Large Language Models, natural language understanding, demonstrated remarkable capabilities, Large Language, Language Models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, they often struggle with complex reasoning tasks and are prone to hallucination. Recent research has shown promising results in leveraging knowledge graphs (KGs) to enhance LLM performance. KGs provide a structured representation of entities and their relationships, offering a rich source of information that can enhance the reasoning capabilities of LLMs. For this work, we have developed different techniques that tightly integrate KG structures and semantics into LLM representations. Our results show that we are able to significantly improve the performance of LLMs in complex reasoning scenarios, and ground the reasoning process with KGs. We are the first to represent KGs with programming language and fine-tune pretrained LLMs with KGs. This integration facilitates more accurate and interpretable reasoning processes, paving the way for more advanced reasoning capabilities of LLMs.
zh

[NLP-134] BinarySelect to Improve Accessibility of Black-Box Attack Research COLING2025

【速读】：该论文试图解决对抗性文本攻击研究中由于Transformer模型计算需求高而导致的效率问题，特别是在资源有限的情况下（如缺乏GPU）。解决方案的关键是提出了一种名为BinarySelect的高效选择方法，该方法结合了二分查找 (binary search) 和攻击选择策略，显著减少了查找单个token所需的查询次数。具体而言，BinarySelect仅需 ( \log_2(n) * 2 ) 次查询即可找到第一个token，相较于传统方法的 ( n ) 次查询大幅提升效率。通过在多个数据集和分类器上的实验，BinarySelect在减少查询次数的同时，仅轻微降低了攻击效果，为资源有限的研究者提供了更高效的对抗性攻击研究工具。

链接: https://arxiv.org/abs/2412.10617
作者: Shatarupa Ghosh,Jonathan Rusert
机构: Purdue University, Fort Wayne (普渡大学韦恩堡分校)
关键词: robustness of NLP, NLP models, queries, testing the robustness, rise of transformers
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted to COLING 2025, 17 pages, 5 figures, 11 tables

点击查看摘要

Abstract:Adversarial text attack research is useful for testing the robustness of NLP models, however, the rise of transformers has greatly increased the time required to test attacks. Especially when researchers do not have access to adequate resources (e.g. GPUs). This can hinder attack research, as modifying one example for an attack can require hundreds of queries to a model, especially for black-box attacks. Often these attacks remove one token at a time to find the ideal one to change, requiring n queries (the length of the text) right away. We propose a more efficient selection method called BinarySelect which combines binary search and attack selection methods to greatly reduce the number of queries needed to find a token. We find that BinarySelect only needs \textlog_2(n) * 2 queries to find the first token compared to n queries. We also test BinarySelect in an attack setting against 5 classifiers across 3 datasets and find a viable tradeoff between number of queries saved and attack effectiveness. For example, on the Yelp dataset, the number of queries is reduced by 32% (72 less) with a drop in attack effectiveness of only 5 points. We believe that BinarySelect can help future researchers study adversarial attacks and black-box problems more efficiently and opens the door for researchers with access to less resources.
zh

[NLP-135] Evaluation of GPT-4o GPT-4o-minis Vision Capabilities for Salt Evaporite Identification

【速读】：该论文试图解决从盐渍图像中识别盐类的问题，并探讨了使用OpenAI的先进视觉模型（如GPT-4o和GPT-4o-mini）作为即时解决方案的可行性。解决方案的关键在于利用GPT-4o模型在识别盐类方面的显著性能提升，其准确率达到57%，F1得分为0.52，远超随机猜测（8%）和GPT-4o-mini（11%），表明当前的视觉模型可以作为盐渍图像识别的临时解决方案。

链接: https://arxiv.org/abs/2412.10587
作者: Deven B. Dangi,Beni B. Dangi,Oliver Steinbock
机构: 未知
关键词: diverse practical applications, stains’ has diverse, practical applications, Identifying salts, diverse practical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Identifying salts from images of their ‘stains’ has diverse practical applications. While specialized AI models are being developed, this paper explores the potential of OpenAI’s state-of-the-art vision models (GPT-4o and GPT-4o-mini) as an immediate solution. Testing with 12 different types of salts, the GPT-4o model achieved 57% accuracy and a 0.52 F1 score, significantly outperforming both random chance (8%) and GPT-4o mini (11% accuracy). Results suggest that current vision models could serve as an interim solution for salt identification from stain images.
zh

[NLP-136] WHAT-IF: Exploring Branching Narratives by Meta-Prompting Large Language Models

【速读】：该论文试图解决如何从预先编写的线性故事中生成分支叙事的问题。解决方案的关键在于使用零样本元提示（zero-shot meta-prompting）技术，通过交互式小说（Interactive Fiction, IF）游戏的形式，让玩家在GPT-4生成的主要角色决策分支中进行选择。系统通过元提示引导GPT-4考虑故事中的关键情节点，从而生成连贯且结构良好的替代故事线。此外，系统将分支情节树存储在图中，这不仅有助于跟踪故事以进行提示，还能保持最终IF系统的结构完整性。

链接: https://arxiv.org/abs/2412.10582
作者: Runsheng “Anson” Huang,Lara J. Martin,Chris Callison-Burch
机构: University of Pennsylvania(宾夕法尼亚大学); University of Maryland, Baltimore County(马里兰大学巴尔的摩分校)
关键词: Hero Alternate Timeline, Writing a Hero, Interactive Fiction, create branching narratives, Hero Alternate
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:WHAT-IF – Writing a Hero’s Alternate Timeline through Interactive Fiction – is a system that uses zero-shot meta-prompting to create branching narratives from a prewritten story. Played as an interactive fiction (IF) game, WHAT-IF lets the player choose between decisions that the large language model (LLM) GPT-4 generates as possible branches in the story. Starting with an existing linear plot as input, a branch is created at each key decision taken by the main character. By meta-prompting the LLM to consider the major plot points from the story, the system produces coherent and well-structured alternate storylines. WHAT-IF stores the branching plot tree in a graph which helps it to both keep track of the story for prompting and maintain the structure for the final IF system. A video demo of our system can be found here: this https URL.
zh

[NLP-137] Evidence Contextualization and Counterfactual Attribution for Conversational QA over Heterogeneous Data with RAG Systems WSDM2025

【速读】：该论文试图解决当前检索增强生成 (Retrieval Augmented Generation, RAG) 系统中的两个主要问题：(i) 检索到的段落通常包含原始文本，缺乏适当的文档上下文，影响检索和回答质量；(ii) 归因策略仅依赖于答案与检索段落之间的相似性，导致生成的解释仅是合理的而非因果的。解决方案的关键在于提出RAGONITE系统，通过以下方式解决这些问题：(i) 使用源元数据和周围文本来上下文化证据；(ii) 采用反事实归因 (counterfactual attribution)，通过比较原始响应与移除某证据后得到的答案来确定该证据对答案的贡献，从而提供因果解释。实验结果表明，上下文化证据提升了RAG性能，而反事实归因有效解释了RAG答案。

链接: https://arxiv.org/abs/2412.10571
作者: Rishiraj Saha Roy,Joel Schlotthauer,Chris Hinze,Andreas Foltyn,Luzian Hahn,Fabian Kuech
机构: Fraunhofer IIS Audio and Media Technologies (弗劳恩霍夫IIS音频和媒体技术)
关键词: Retrieval Augmented Generation, Retrieval Augmented, Augmented Generation, Conversational Question Answering, RAG
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Extended version of demo paper accepted at WSDM 2025

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) works as a backbone for interacting with an enterprise’s own data via Conversational Question Answering (ConvQA). In a RAG system, a retriever fetches passages from a collection in response to a question, which are then included in the prompt of a large language model (LLM) for generating a natural language (NL) answer. However, several RAG systems today suffer from two shortcomings: (i) retrieved passages usually contain their raw text and lack appropriate document context, negatively impacting both retrieval and answering quality; and (ii) attribution strategies that explain answer generation usually rely only on similarity between the answer and the retrieved passages, thereby only generating plausible but not causal explanations. In this work, we demonstrate RAGONITE, a RAG system that remedies the above concerns by: (i) contextualizing evidence with source metadata and surrounding text; and (ii) computing counterfactual attribution, a causal explanation approach where the contribution of an evidence to an answer is determined by the similarity of the original response to the answer obtained by removing that evidence. To evaluate our proposals, we release a new benchmark ConfQuestions, with 300 hand-created conversational questions, each in English and German, coupled with ground truth URLs, completed questions, and answers from 215 public Confluence pages, that are typical of enterprise wiki spaces with heterogeneous elements. Experiments with RAGONITE on ConfQuestions show the viability of our ideas: contextualization improves RAG performance, and counterfactual attribution is effective at explaining RAG answers.
zh

[NLP-138] oo Big to Fool: Resisting Deception in Language Models

【速读】：该论文试图解决大语言模型在处理包含误导性信息的提示时如何平衡其内部知识与上下文信息的问题。解决方案的关键在于，通过实验发现，更大容量的模型表现出更高的抗误导能力，能够更好地解释和整合提示信息与内部知识，而非简单地忽略上下文信息。这种能力源于模型对提示中隐含任务相关信息的更好利用，而非单纯的记忆效应。

链接: https://arxiv.org/abs/2412.10558
作者: Mohammad Reza Samsami,Mats Leon Richter,Juan Rodriguez,Megh Thakkar,Sarath Chandar,Maxime Gasse
机构: 未知
关键词: Large language models, generate accurate responses, Large language, accurate responses, balance their weight-encoded
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. This paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models’ ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
zh

[NLP-139] RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation

【速读】：该论文试图解决RAG系统中生成质量与响应延迟之间的权衡问题。解决方案的关键在于提出了RAGServe，这是首个能够联合调度查询并动态调整RAG配置（如检索文本块数量和合成方法）的系统，以在优化生成质量和减少响应延迟之间实现平衡。通过在四个流行的RAG-QA数据集上的实验，RAGServe在不牺牲生成质量的前提下，将生成延迟降低了1.64到2.54倍。

链接: https://arxiv.org/abs/2412.10543
作者: Siddhant Ray,Rui Pan,Zhuohan Gu,Kuntai Du,Ganesh Ananthanarayanan,Ravi Netravali,Junchen Jiang
机构: University of Chicago; Microsoft Research; Princeton University
关键词: Retrieval Augmented Generation, Retrieval Augmented, large language models, external knowledge, Augmented Generation
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 17 pages, 18 figures

点击查看摘要

Abstract:RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents RAGServe, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, RAGServe reduces the generation latency by 1.64-2.54\times without sacrificing generation quality.
zh

[NLP-140] On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models

【速读】：该论文试图解决大语言模型（LLMs）在对抗性扰动和分布外（OOD）输入情况下的鲁棒性问题，特别是探讨对抗性鲁棒性和OOD鲁棒性之间的关联性。解决方案的关键在于通过应用原本用于提升一种鲁棒性的方法，分析其在对抗性和OOD基准数据集上的表现，并评估不同模型大小和架构下这两种鲁棒性之间的相互作用。研究发现，这两种鲁棒性之间的关联性因模型大小和架构而异，例如LLaMA2-7b表现出中性关联，LLaMA2-13b表现出负相关，而Mixtral则表现出正相关。这些结果强调了针对特定模型和领域定制的混合鲁棒性框架的重要性，以实现更可靠和泛化能力更强的大语言模型。

链接: https://arxiv.org/abs/2412.10535
作者: April Yang,Jordan Tab,Parth Shah,Paul Kotchavong
机构: 未知
关键词: diverse applications necessitates, OOD robustness, robustness, increasing reliance, reliance on large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing reliance on large language models (LLMs) for diverse applications necessitates a thorough understanding of their robustness to adversarial perturbations and out-of-distribution (OOD) inputs. In this study, we investigate the correlation between adversarial robustness and OOD robustness in LLMs, addressing a critical gap in robustness evaluation. By applying methods originally designed to improve one robustness type across both contexts, we analyze their performance on adversarial and out-of-distribution benchmark datasets. The input of the model consists of text samples, with the output prediction evaluated in terms of accuracy, precision, recall, and F1 scores in various natural language inference tasks. Our findings highlight nuanced interactions between adversarial robustness and OOD robustness, with results indicating limited transferability between the two robustness types. Through targeted ablations, we evaluate how these correlations evolve with different model sizes and architectures, uncovering model-specific trends: smaller models like LLaMA2-7b exhibit neutral correlations, larger models like LLaMA2-13b show negative correlations, and Mixtral demonstrates positive correlations, potentially due to domain-specific alignment. These results underscore the importance of hybrid robustness frameworks that integrate adversarial and OOD strategies tailored to specific models and domains. Further research is needed to evaluate these interactions across larger models and varied architectures, offering a pathway to more reliable and generalizable LLMs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.10535 [cs.CL] (or arXiv:2412.10535v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.10535 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-141] Solving the Inverse Alignment Problem for Efficient RLHF

【速读】：该论文试图解决在强化学习从人类反馈 (RLHF) 中，由于高质量偏好数据集的收集成本高且困难，导致奖励模型在训练时使用大量离线数据集进行聚合，从而产生平均效应，影响奖励模型的信号质量和对齐过程的问题。解决方案的关键在于提出“逆向对齐问题”，即在固定的策略和离线偏好数据集下，优化评论者的奖励。通过在RLHF过程中，定期冻结策略并对奖励模型进行子集微调，以提高奖励模型的质量和对齐效果。实验结果表明，这种方法相较于传统的RLHF，能够实现更好的对齐效果和更快的收敛速度。

链接: https://arxiv.org/abs/2412.10529
作者: Shambhavi Krishna,Aishwarya Sahoo
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)
关键词: Collecting high-quality preference, Collecting high-quality, resource-intensive and challenging, high-quality preference datasets, reinforcement learning
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Collecting high-quality preference datasets for reinforcement learning from human feedback (RLHF) is resource-intensive and challenging. As a result, researchers often train reward models on extensive offline datasets which aggregate diverse generation sources and scoring/alignment policies. We hypothesize that this aggregation has an averaging effect on reward model scores, which limits signal and impairs the alignment process. Inspired by the field of inverse RL, we define the ‘inverse alignment problem’ in language model training, where our objective is to optimize the critic’s reward for a fixed actor and a fixed offline preference dataset. We hypothesize that solving the inverse alignment problem will improve reward model quality by providing clearer feedback on the policy’s current behavior. To that end, we investigate whether repeatedly fine-tuning a reward model on subsets of the offline preference dataset aligned with a periodically frozen policy during RLHF improves upon vanilla RLHF. Our empirical results demonstrate that this approach facilitates superior alignment and faster convergence compared to using an unaligned or out-of-distribution reward model relative to the LLM policy.
zh

[NLP-142] DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

【速读】：该论文旨在解决开放域中涉及文本和图像的声明验证问题，特别是面对日益增长的虚假信息对社会信任和民主的威胁。解决方案的关键在于提出了一个名为 DEFAME 的模块化、零样本多模态语言模型（MLLM）管道，该系统将事实核查问题框架化为一个六阶段的流程，动态决定使用外部工具检索文本和视觉证据，并生成包含验证结果和多模态证据的综合报告。与现有方法相比，DEFAME 不仅解决了端到端的事实核查问题，还支持包含图像的声明或需要视觉证据的声明，并在多个基准测试中超越了以往的方法，成为新的最先进的事实核查系统。

链接: https://arxiv.org/abs/2412.10510
作者: Tobias Braun,Mark Rothermel,Marcus Rohrbach,Anna Rohrbach
机构: Technical University of Darmstadt (达姆施塔特工业大学); hessian.AI (hessian.AI)
关键词: present Dynamic Evidence-based, Dynamic Evidence-based FAct-checking, trust and democracy, necessitating robust, scalable Fact-Checking systems
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The proliferation of disinformation presents a growing threat to societal trust and democracy, necessitating robust and scalable Fact-Checking systems. In this work, we present Dynamic Evidence-based FAct-checking with Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME frames the problem of fact-checking as a six-stage process, dynamically deciding about the usage of external tools for the retrieval of textual and visual evidence. In addition to the claim’s veracity, DEFAME returns a justification accompanied by a comprehensive, multimodal fact-checking report. While most alternatives either focus on sub-tasks of fact-checking, lack explainability or are limited to text-only inputs, DEFAME solves the problem of fact-checking end-to-end, including claims with images or those that require visual evidence. Evaluation on the popular benchmarks VERITE, AVeriTeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing it as the new state-of-the-art fact-checking system.
zh

[NLP-143] Do Large Language Models Show Biases in Causal Learning?

【速读】：该论文试图解决的问题是探究大型语言模型（LLMs）是否会在因果学习与推理过程中表现出因果错觉（causal illusion）的偏差。解决方案的关键在于构建了一个包含2000多个样本的数据集，涵盖了纯相关性案例、零关联性场景以及通过时间信息排除因果关系可能性的案例。通过让模型在这些结构化环境中进行因果陈述或回答因果问题，研究评估了模型在不同情境下错误推断因果关系的倾向。研究发现，LLMs在涉及虚假相关性的开放生成任务中表现出与人类相似甚至更低的偏差，但在面对零关联性或时间线索否定因果关系的情境中，模型表现出显著更高的偏差。这表明模型并未一致、可靠地内化因果学习所需的规范性原则。

链接: https://arxiv.org/abs/2412.10509
作者: Maria Victoria Carro,Francisca Gauna Selasco,Denise Alejandra Mester,Margarita Gonzales,Mario A. Leiva,Maria Vanina Martinez,Gerardo I. Simari
机构: 未知
关键词: developing the capability, capability of making, making causal inferences, causal inferences based, Causal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Causal learning is the cognitive process of developing the capability of making causal inferences based on available information, often guided by normative principles. This process is prone to errors and biases, such as the illusion of causality, in which people perceive a causal relationship between two variables despite lacking supporting evidence. This cognitive bias has been proposed to underlie many societal problems, including social prejudice, stereotype formation, misinformation, and superstitious thinking. In this research, we investigate whether large language models (LLMs) develop causal illusions, both in real-world and controlled laboratory contexts of causal learning and inference. To this end, we built a dataset of over 2K samples including purely correlational cases, situations with null contingency, and cases where temporal information excludes the possibility of causality by placing the potential effect before the cause. We then prompted the models to make statements or answer causal questions to evaluate their tendencies to infer causation erroneously in these structured settings. Our findings show a strong presence of causal illusion bias in LLMs. Specifically, in open-ended generation tasks involving spurious correlations, the models displayed bias at levels comparable to, or even lower than, those observed in similar studies on human subjects. However, when faced with null-contingency scenarios or temporal cues that negate causal relationships, where it was required to respond on a 0-100 scale, the models exhibited significantly higher bias. These findings suggest that the models have not uniformly, consistently, or reliably internalized the normative principles essential for accurate causal learning.
zh

[NLP-144] MGM: Global Understanding of Audience Overlap Graphs for Predicting the Factuality and the Bias of News Media

【速读】：该论文旨在解决新闻媒体在政治偏见和事实性方面的分类问题，特别是在处理传统方法如预训练语言模型 (Pre-trained Language Models, PLMs) 和图神经网络 (Graph Neural Networks, GNNs) 的局限性时。传统方法的局限在于PLMs仅依赖文本特征而忽略了实体间的复杂关系，而GNNs则在处理包含不连通组件和标签不足的媒体图时表现不佳。论文提出的解决方案是MediaGraphMind (MGM)，它在一个变分期望最大化 (Expectation-Maximization, EM) 框架内，通过利用全局相似节点的特征、结构模式和标签信息，解决了上述问题。MGM不仅增强了GNNs捕捉长程依赖关系的能力，还通过整合结构信息提升了PLMs的性能，从而实现了两者的协同优化，并在实验中取得了新的最先进结果。

链接: https://arxiv.org/abs/2412.10467
作者: Muhammad Arslan Manzoor,Ruihong Zeng,Dilshod Azizov,Preslav Nakov,Shangsong Liang
机构: Mohamed bin Zayed University of Artificial Intelligence, UAE (穆罕默德·本·扎耶德人工智能大学，阿联酋); Sun Yat-sen University, China (中山大学，中国)
关键词: growing digital data, rapidly growing digital, reliable information online, seeking reliable information, political bias
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:In the current era of rapidly growing digital data, evaluating the political bias and factuality of news outlets has become more important for seeking reliable information online. In this work, we study the classification problem of profiling news media from the lens of political bias and factuality. Traditional profiling methods, such as Pre-trained Language Models (PLMs) and Graph Neural Networks (GNNs) have shown promising results, but they face notable challenges. PLMs focus solely on textual features, causing them to overlook the complex relationships between entities, while GNNs often struggle with media graphs containing disconnected components and insufficient labels. To address these limitations, we propose MediaGraphMind (MGM), an effective solution within a variational Expectation-Maximization (EM) framework. Instead of relying on limited neighboring nodes, MGM leverages features, structural patterns, and label information from globally similar nodes. Such a framework not only enables GNNs to capture long-range dependencies for learning expressive node representations but also enhances PLMs by integrating structural information and therefore improving the performance of both models. The extensive experiments demonstrate the effectiveness of the proposed framework and achieve new state-of-the-art results. Further, we share our repository1 which contains the dataset, code, and documentation
zh

[NLP-145] NAT-NL2GQL: A Novel Multi-Agent Framework for Translating Natural Language to Graph Query Language

【速读】：该论文试图解决将自然语言转换为图查询语言（NL2GQL）的复杂问题，特别是现有方法在处理复杂查询时依赖简化流程、缺乏自主规划和协作能力的局限性。解决方案的关键在于提出了一个新颖的多代理框架NAT-NL2GQL，该框架通过三个协同工作的代理（Preprocessor、Generator和Refiner）来实现自然语言到图查询语言的转换。Preprocessor负责数据处理和上下文管理，Generator是经过微调的大型语言模型（LLM），负责生成相应的图查询语言（GQL）语句，而Refiner则利用查询执行结果中的错误信息来优化GQL或上下文。此外，论文还开发了基于金融市场的图数据库数据集StockGQL，以弥补高质量开源NL2GQL数据集的不足。

链接: https://arxiv.org/abs/2412.10434
作者: Yuanyuan Liang,Tingyu Xie,Gan Peng,Zihao Huang,Yunshi Lan,Weining Qian
机构: School of Data Science and Engineering, East China Normal University; School of Computer Science and Technology, Zhejiang University
关键词: Large Language Models, emergence of Large, Language Models, Large Language, traditional natural language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 12 pages,6 figures

点击查看摘要

Abstract:The emergence of Large Language Models (LLMs) has revolutionized many fields, not only traditional natural language processing (NLP) tasks. Recently, research on applying LLMs to the database field has been booming, and as a typical non-relational database, the use of LLMs in graph database research has naturally gained significant attention. Recent efforts have increasingly focused on leveraging LLMs to translate natural language into graph query language (NL2GQL). Although some progress has been made, these methods have clear limitations, such as their reliance on streamlined processes that often overlook the potential of LLMs to autonomously plan and collaborate with other LLMs in tackling complex NL2GQL challenges. To address this gap, we propose NAT-NL2GQL, a novel multi-agent framework for translating natural language to graph query language. Specifically, our framework consists of three synergistic agents: the Preprocessor agent, the Generator agent, and the Refiner agent. The Preprocessor agent manages data processing as context, including tasks such as name entity recognition, query rewriting, path linking, and the extraction of query-related schemas. The Generator agent is a fine-tuned LLM trained on NL-GQL data, responsible for generating corresponding GQL statements based on queries and their related schemas. The Refiner agent is tasked with refining the GQL or context using error information obtained from the GQL execution results. Given the scarcity of high-quality open-source NL2GQL datasets based on nGQL syntax, we developed StockGQL, a dataset constructed from a financial market graph database. It is available at: this https URL. Experimental results on the StockGQL and SpCQL datasets reveal that our method significantly outperforms baseline approaches, highlighting its potential for advancing NL2GQL research.
zh

[NLP-146] Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection AAAI2025

【速读】：该论文试图解决机器修订文本（如重写、扩展和润色）的检测问题，这类文本仅在细微处与原始人类提示有所不同，现有的检测方法难以识别隐藏在人类贡献内容中的机器风格表达。解决方案的关键在于提出的“先模仿后检测”（Imitate Before Detect, ImBD）方法，该方法首先模仿机器风格的标记分布，然后通过比较待测文本的分布与机器风格分布来判断是否经过机器修订。具体实现中，引入了风格偏好优化（Style Preference Optimization, SPO），通过调整评分LLM模型以适应机器生成的文本风格，并计算风格条件概率曲率（Style-Conditional Probability Curvature, Style-CPC），量化原始文本与条件采样文本之间的对数概率差异，从而实现有效检测。

链接: https://arxiv.org/abs/2412.10432
作者: Jiaqi Chen,Xiaoye Zhu,Tianyang Liu,Ying Chen,Xinhui Chen,Yiwen Yuan,Chak Tou Leong,Zuchao Li,Tang Long,Lei Zhang,Chenyu Yan,Guanghao Mei,Jie Zhang,Lefei Zhang
机构: Fudan University(复旦大学); South China University of Technology(华南理工大学); Wuhan University(武汉大学); Fenz AI; UC San Diego(加州大学圣地亚哥分校); UIUC(伊利诺伊大学厄巴纳-香槟分校); CMU(卡内基梅隆大学); PolyU(香港理工大学); Stanford University(斯坦福大学); NUS (Chongqing) Research Institute(新加坡国立大学重庆研究院); Georgia Tech(乔治亚理工学院)
关键词: Large Language Models, Large Language, making detecting machine-generated, revolutionized text generation, text increasingly challenging
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized text generation, making detecting machine-generated text increasingly challenging. Although past methods have achieved good performance on detecting pure machine-generated text, those detectors have poor performance on distinguishing machine-revised text (rewriting, expansion, and polishing), which can have only minor changes from its original human prompt. As the content of text may originate from human prompts, detecting machine-revised text often involves identifying distinctive machine styles, e.g., worded favored by LLMs. However, existing methods struggle to detect machine-style phrasing hidden within the content contributed by humans. We propose the “Imitate Before Detect” (ImBD) approach, which first imitates the machine-style token distribution, and then compares the distribution of the text to be tested with the machine-style distribution to determine whether the text has been machine-revised. To this end, we introduce style preference optimization (SPO), which aligns a scoring LLM model to the preference of text styles generated by machines. The aligned scoring model is then used to calculate the style-conditional probability curvature (Style-CPC), quantifying the log probability difference between the original and conditionally sampled texts for effective detection. We conduct extensive comparisons across various scenarios, encompassing text revisions by six LLMs, four distinct text domains, and three machine revision types. Compared to existing state-of-the-art methods, our method yields a 13% increase in AUC for detecting text revised by open-source LLMs, and improves performance by 5% and 19% for detecting GPT-3.5 and GPT-4o revised text, respectively. Notably, our method surpasses the commercially trained GPT-Zero with just 1,000 samples and five minutes of SPO, demonstrating its efficiency and effectiveness.
zh

[NLP-147] Identifying and Manipulating Personality Traits in LLM s Through Activation Engineering

【速读】：该论文试图解决大型语言模型（LLMs）中动态调整个性特征的问题，关键解决方案在于利用“激活工程”（activation engineering）技术，通过识别和调整与个性特质相关的激活方向，实现对LLMs个性的动态微调。这一方法借鉴了Refusal in LLMs Is Mediated by a Single Direction和Steering Llama 2 via Contrastive Activation Addition等研究，旨在提升LLMs的可解释性，并探讨其潜在的伦理影响。

链接: https://arxiv.org/abs/2412.10427
作者: Rumi A. Allbert,James K. Wiles
机构: Wolfram Institute(沃尔夫勒姆研究所)
关键词: large language models, Contrastive Activation Addition, language models, recent years, field of large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The field of large language models (LLMs) has grown rapidly in recent years, driven by the desire for better efficiency, interpretability, and safe use. Building on the novel approach of “activation engineering,” this study explores personality modification in LLMs, drawing inspiration from research like Refusal in LLMs Is Mediated by a Single Direction (arXiv:2406.11717) and Steering Llama 2 via Contrastive Activation Addition (arXiv:2312.06681). We leverage activation engineering to develop a method for identifying and adjusting activation directions related to personality traits, which may allow for dynamic LLM personality fine-tuning. This work aims to further our understanding of LLM interpretability while examining the ethical implications of such developments.
zh

[NLP-148] CAP: Evaluation of Persuasive and Creative Image Generation

【速读】：该论文试图解决广告图像生成中的评估难题，特别是针对生成图像的创造性、提示对齐和说服力（Creativity, prompt Alignment, and Persuasiveness, CAP）的评估。现有方法主要关注与明确描述的对齐，但缺乏对视觉隐含提示的对齐、创造性和说服力的评估。论文提出了三个新的评估指标来衡量这些方面，并指出当前的文本到图像（Text-to-Image, T2I）生成模型在处理隐含提示时存在创造性、说服力和对齐方面的不足。解决方案的关键在于引入一种简单而有效的方法，以增强T2I模型在这些方面的能力，从而生成更符合要求、更具创造性和说服力的广告图像。

链接: https://arxiv.org/abs/2412.10426
作者: Aysan Aghazadeh,Adriana Kovashka
机构: University of Pittsburgh (匹兹堡大学)
关键词: CAP, advertisement image generation, Alignment, Persuasiveness, images
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We address the task of advertisement image generation and introduce three evaluation metrics to assess Creativity, prompt Alignment, and Persuasiveness (CAP) in generated advertisement images. Despite recent advancements in Text-to-Image (T2I) generation and their performance in generating high-quality images for explicit descriptions, evaluating these models remains challenging. Existing evaluation methods focus largely on assessing alignment with explicit, detailed descriptions, but evaluating alignment with visually implicit prompts remains an open problem. Additionally, creativity and persuasiveness are essential qualities that enhance the effectiveness of advertisement images, yet are seldom measured. To address this, we propose three novel metrics for evaluating the creativity, alignment, and persuasiveness of generated images. Our findings reveal that current T2I models struggle with creativity, persuasiveness, and alignment when the input text is implicit messages. We further introduce a simple yet effective approach to enhance T2I models’ capabilities in producing images that are better aligned, more creative, and more persuasive.
zh

[NLP-149] Active Inference for Self-Organizing Multi-LLM Systems: A Bayesian Thermodynamic Approach to Adaptation

【速读】：该论文试图解决大语言模型 (LLMs) 在面对新信息和变化环境时适应性不足的问题，其关键解决方案在于将主动推理 (active inference) 框架与 LLMs 集成，形成一个动态调整提示 (prompt) 和搜索策略的认知层。通过引入自由能原理 (free energy principle)，该框架能够系统性地探索不同的提示组合和搜索策略，从而实现对环境动态的准确建模和复杂的信息寻求行为。实验结果表明，这种集成方法有效提升了语言代理的适应性和鲁棒性，使其能够在高维度的语言驱动环境中表现出精细的探索-利用行为。

链接: https://arxiv.org/abs/2412.10425
作者: Rithvik Prakki
机构: University of North Carolina (北卡罗来纳大学)
关键词: paper introduces, integrating active inference, active inference, integrating active, large language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a novel approach to creating adaptive language agents by integrating active inference with large language models (LLMs). While LLMs demonstrate remarkable capabilities, their reliance on static prompts limits adaptation to new information and changing environments. We address this by implementing an active inference framework that acts as a cognitive layer above an LLM-based agent, dynamically adjusting prompts and search strategies through principled information-seeking behavior. Our framework models the environment using three state factors (prompt, search, and information states) with seven observation modalities capturing quality metrics. By framing the agent’s learning through the free energy principle, we enable systematic exploration of prompt combinations and search strategies. Experimental results demonstrate the effectiveness of this approach, with the agent developing accurate models of environment dynamics evidenced by emergent structure in observation matrices. Action selection patterns reveal sophisticated exploration-exploitation behavior, transitioning from initial information-gathering to targeted prompt testing. The integration of thermodynamic principles with language model capabilities provides a principled framework for creating robust, adaptable agents, extending active inference beyond traditional low-dimensional control problems to high-dimensional, language-driven environments.
zh

[NLP-150] LLM -AS-AN-INTERVIEWER: Beyond Static Testing Through Dynamic LLM Evaluation

【速读】：该论文试图解决现有大型语言模型（LLMs）评估方法中的局限性，包括数据污染（data contamination）、冗长偏差（verbosity bias）和自我增强偏差（self enhancement bias）。解决方案的关键在于提出了一种新的评估范式，称为“LLM-as-an-Interviewer”，该范式通过两阶段过程来评估LLMs的真实能力：首先修改基准数据集以生成初始查询，然后通过反馈和后续问题与LLM进行交互。这种多轮评估过程不仅提供了对LLM在实际场景中表现的深入洞察，还特别关注了其对反馈的适应性和处理后续问题的能力。最终，论文提出的“Interview Report”为LLM的能力提供了全面的评估，展示了其在实际应用中的强项和弱项。

链接: https://arxiv.org/abs/2412.10424
作者: Eunsu Kim,Juyoung Suk,Seungone Kim,Niklas Muennighoff,Dongkwan Kim,Alice Oh
机构: KAIST; Carnegie Mellon University; Stanford University; Contextual AI
关键词: large language models, language models, paradigm for large, large language, LLM
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a novel evaluation paradigm for large language models (LLMs), LLM-as-an-Interviewer. This approach consists of a two stage process designed to assess the true capabilities of LLMs: first, modifying benchmark datasets to generate initial queries, and second, interacting with the LLM through feedback and follow up questions. Compared to existing evaluation methods such as LLM as a Judge, our framework addresses several limitations, including data contamination, verbosity bias, and self enhancement bias. Additionally, we show that our multi turn evaluation process provides valuable insights into the LLM’s performance in real-world scenarios, including its adaptability to feedback and its ability to handle follow up questions, including clarification or requests for additional knowledge. Finally, we propose the Interview Report, which offers a comprehensive reflection of an LLM’s strengths and weaknesses, illustrated with specific examples from the interview process. This report delivers a snapshot of the LLM’s capabilities, providing a detailed picture of its practical performance.
zh

[NLP-151] Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM

【速读】：该论文试图解决大语言模型 (LLMs) 在面对新兴的越狱攻击 (jailbreak attacks) 时，其对齐机制 (alignment mechanisms) 容易受到威胁的问题。解决方案的关键在于提出了一种名为 GuidelineLLM 的新型防御范式，该范式通过在 LLMs 响应查询之前，识别查询中潜在的有害内容，并将这些风险总结为指导建议，进而将这些建议反馈给 LLMs。这一方法无需对 LLMs 本身进行额外的安全微调 (fine-tuning)，仅需对 GuidelineLLM 进行微调，从而增强了其在不同 LLMs 中的通用性。实验结果表明，GuidelineLLM 能够显著降低攻击成功率 (ASR)，平均减少 34.17%，同时保持 LLMs 在处理良性查询时的有效性。

链接: https://arxiv.org/abs/2412.10423
作者: Shaoqing Zhang,Zhuosheng Zhang,Kehai Chen,Rongxiang Weng,Muyun Yang,Tiejun Zhao,Min Zhang
机构: 1. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所);
2. School of Computer Science and Technology, Harbin Institute of Technology (哈尔滨工业大学计算机科学与技术学院);
3. School of Computer Science and Technology, Soochow University (苏州大学计算机科学与技术学院)
关键词: large language models, alignment mechanisms, large language, language models, increasingly vulnerable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite being empowered with alignment mechanisms, large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks that can compromise their alignment mechanisms. This vulnerability poses significant risks to the real-world applications. Existing work faces challenges in both training efficiency and generalization capabilities (i.e., Reinforcement Learning from Human Feedback and Red-Teaming). Developing effective strategies to enable LLMs to resist continuously evolving jailbreak attempts represents a significant challenge. To address this challenge, we propose a novel defensive paradigm called GuidelineLLM, which assists LLMs in recognizing queries that may have harmful content. Before LLMs respond to a query, GuidelineLLM first identifies potential risks associated with the query, summarizes these risks into guideline suggestions, and then feeds these guidelines to the responding LLMs. Importantly, our approach eliminates the necessity for additional safety fine-tuning of the LLMs themselves; only the GuidelineLLM requires fine-tuning. This characteristic enhances the general applicability of GuidelineLLM across various LLMs. Experimental results demonstrate that GuidelineLLM can significantly reduce the attack success rate (ASR) against the LLMs (an average reduction of 34.17% ASR) while maintaining the helpfulness of the LLMs in handling benign queries. Code is available at this https URL.
zh

[NLP-152] AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework

【速读】：该论文试图解决表格问答（Tabular Question Answering, TQA）中的问题，特别是如何在自然语言（NL）问题下进行高效且准确的数据准备（data prep）。传统数据准备方法无法满足问答场景中的特定需求，如列增强（column augmentation）、过滤（filtering）和值归一化（value normalization）等。论文提出的解决方案是AUTOPREP，一个基于大型语言模型（LLM）的多智能体框架。其关键在于通过多个专门化的智能体分别处理不同的数据准备任务，确保每个任务都能得到最优处理。AUTOPREP框架包括三个核心组件：规划器（Planner）负责制定逻辑计划，程序员（Programmer）将逻辑计划转化为低级代码，执行器（Executor）则迭代执行和调试代码以确保正确结果。此外，论文还设计了链式思维推理机制（chain-of-thought reasoning mechanism）用于高层次操作建议，以及工具增强方法（tool-augmented method）用于低层次代码生成，从而显著提升了现有TQA解决方案的性能。

链接: https://arxiv.org/abs/2412.10422
作者: Meihao Fan,Ju Fan,Nan Tang,Lei Cao,Xiaoyong Du
机构: Renmin University of China(中国人民大学); HKUST (GZ)(香港科技大学广州校区); University of Arizona/MIT(亚利桑那大学/麻省理工学院)
关键词: Tabular Question Answering, Answering natural language, extract meaningful insights, meaningful insights quickly, Answering natural
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Answering natural language (NL) questions about tables, which is referred to as Tabular Question Answering (TQA), is important because it enables users to extract meaningful insights quickly and efficiently from structured data, bridging the gap between human language and machine-readable formats. Many of these tables originate from web sources or real-world scenarios, necessitating careful data preparation (or data prep for short) to ensure accurate answers. However, unlike traditional data prep, question-aware data prep introduces new requirements, which include tasks such as column augmentation and filtering for given questions, and question-aware value normalization or conversion. Because each of the above tasks is unique, a single model (or agent) may not perform effectively across all scenarios. In this paper, we propose AUTOPREP, a large language model (LLM)-based multi-agent framework that leverages the strengths of multiple agents, each specialized in a certain type of data prep, ensuring more accurate and contextually relevant responses. Given an NL question over a table, AUTOPREP performs data prep through three key components. Planner: Determines a logical plan, outlining a sequence of high-level operations. Programmer: Translates this logical plan into a physical plan by generating the corresponding low-level code. Executor: Iteratively executes and debugs the generated code to ensure correct outcomes. To support this multi-agent framework, we design a novel chain-of-thought reasoning mechanism for high-level operation suggestion, and a tool-augmented method for low-level code generation. Extensive experiments on real-world TQA datasets demonstrate that AUTOPREP can significantly improve the SOTA TQA solutions through question-aware data prep.
zh

[NLP-153] Personalized and Sequential Text-to-Image Generation WWW

【速读】：该论文试图解决个性化、交互式文本到图像生成 (Text-to-Image, T2I) 的问题，特别是通过一系列提示扩展来迭代改进生成图像的用户体验。解决方案的关键在于设计了一个强化学习 (Reinforcement Learning, RL) 代理，称为个性化与序列化文本到图像代理 (Personalized And Sequential Text-to-image Agent, PASTA)，它利用人类评价者创建的序列偏好数据集，结合大规模开源非序列数据集，通过期望最大化 (Expectation-Maximization, EM) 策略构建用户偏好和选择模型，并识别不同类型的用户偏好。PASTA 利用多模态语言模型 (Large Multimodal Language Model, LMM) 和基于价值的 RL 方法，为用户提供个性化且多样化的提示扩展，从而增强 T2I 模型的多轮交互能力，促进协作式共创，并解决用户意图中的不确定性和未明确指定的问题。

链接: https://arxiv.org/abs/2412.10419
作者: Ofir Nabati,Guy Tennenholtz,ChihWei Hsu,Moonkyung Ryu,Deepak Ramachandran,Yinlam Chow,Xiang Li,Craig Boutilier
机构: Google Research(谷歌研究); Google DeepMind(谷歌深度思维)
关键词: designing a reinforcement, reinforcement learning, address the problem, iteratively improves, improves a set
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Link to PASTA dataset: this https URL

点击查看摘要

Abstract:We address the problem of personalized, interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest a personalized and diverse slate of prompt expansions to the user. Our Personalized And Sequential Text-to-image Agent (PASTA) extends T2I models with personalized multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user’s intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also release our sequential rater dataset and simulated user-rater interactions to support future research in personalized, multi-turn T2I generation.
zh

[NLP-154] Constrained Decoding with Speculative Lookaheads

【速读】：该论文试图解决约束解码方法中存在的效率与性能之间的权衡问题。具体来说，现有的约束解码方法（如CDLH）虽然能够有效对齐生成式语言模型（LLM）的输出与人类偏好，但其基于前瞻性启发式的解码过程计算成本过高，导致实际应用中难以推广。论文提出的解决方案是约束解码与推测性前瞻（CDSL），其关键在于利用一个较小的草稿模型生成候选序列，并通过较大的目标模型和任务特定的奖励函数进行验证，从而在保持高性能的同时显著提升解码效率。

链接: https://arxiv.org/abs/2412.10418
作者: Nishanth Nakshatri,Shamik Roy,Rajarshi Das,Suthee Chaidaroon,Leonid Boytsov,Rashmi Gangadharaiah
机构: Purdue University(普渡大学); AWS AI Labs(AWS AI实验室)
关键词: highly effective method, human preferences, aligning LLM generations, highly effective, effective method
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under submission

点击查看摘要

Abstract:Constrained decoding with lookahead heuristics (CDLH) is a highly effective method for aligning LLM generations to human preferences. However, the extensive lookahead roll-out operations for each generated token makes CDLH prohibitively expensive, resulting in low adoption in practice. In contrast, common decoding strategies such as greedy decoding are extremely efficient, but achieve very low constraint satisfaction. We propose constrained decoding with speculative lookaheads (CDSL), a technique that significantly improves upon the inference efficiency of CDLH without experiencing the drastic performance reduction seen with greedy decoding. CDSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification. In CDSL, the draft model is used to generate lookaheads which is verified by a combination of target LLM and task-specific reward functions. This process accelerates decoding by reducing the computational burden while maintaining strong performance. We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.
zh

[NLP-155] Leveraging Audio and Text Modalities in Mental Health: A Study of LLM s Performance

【速读】：该论文试图解决心理健康障碍的早期诊断和干预问题，特别是通过文本和音频模态检测抑郁症和创伤后应激障碍（Post Traumatic Stress Disorder, PTSD）。解决方案的关键在于利用大语言模型（Large Language Models, LLMs）进行多模态心理健康诊断，并通过结合文本和音频模态来提升诊断准确性。研究通过对比单一模态（文本和音频）与多模态的性能，发现多模态集成显著提高了诊断效果，尤其是在使用Gemini 1.5 Pro模型时，其二分类抑郁症检测的F1分数达到0.67，平衡准确率（Balanced Accuracy, BA）为77.4%，相较于单一模态有显著提升。此外，所有结果均在零样本推理（zero-shot inferring）条件下获得，表明模型无需特定任务微调即可表现出色。

链接: https://arxiv.org/abs/2412.10417
作者: Abdelrahman A. Ali,Aya E. Fouda,Radwa J. Hanafy,Mohammed E. Fouda
机构: Compumacy for Artificial Intelligence solutions, Cairo, Egypt; Department of Behavioural Health- Saint Elizabeths Hospital, Washington DC, 20032
关键词: Mental health disorders, Traumatic Stress Disorder, increasingly prevalent worldwide, support early diagnosis, Post Traumatic Stress
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Mental health disorders are increasingly prevalent worldwide, creating an urgent need for innovative tools to support early diagnosis and intervention. This study explores the potential of Large Language Models (LLMs) in multimodal mental health diagnostics, specifically for detecting depression and Post Traumatic Stress Disorder through text and audio modalities. Using the E-DAIC dataset, we compare text and audio modalities to investigate whether LLMs can perform equally well or better with audio inputs. We further examine the integration of both modalities to determine if this can enhance diagnostic accuracy, which generally results in improved performance metrics. Our analysis specifically utilizes custom-formulated metrics; Modal Superiority Score and Disagreement Resolvement Score to evaluate how combined modalities influence model performance. The Gemini 1.5 Pro model achieves the highest scores in binary depression classification when using the combined modality, with an F1 score of 0.67 and a Balanced Accuracy (BA) of 77.4%, assessed across the full dataset. These results represent an increase of 3.1% over its performance with the text modality and 2.7% over the audio modality, highlighting the effectiveness of integrating modalities to enhance diagnostic accuracy. Notably, all results are obtained in zero-shot inferring, highlighting the robustness of the models without requiring task-specific fine-tuning. To explore the impact of different configurations on model performance, we conduct binary, severity, and multiclass tasks using both zero-shot and few-shot prompts, examining the effects of prompt variations on performance. The results reveal that models such as Gemini 1.5 Pro in text and audio modalities, and GPT-4o mini in the text modality, often surpass other models in balanced accuracy and F1 scores across multiple tasks.
zh

[NLP-156] SUPERMERGE: An Approach For Gradient-Based Model Merging

【速读】：该论文试图解决在部署任务特定模型时，面对新任务需求时需要重新微调模型的高计算成本和时间消耗问题。解决方案的关键是提出了一种基于梯度的模型合并方法，称为SUPERMERGE。该方法能够系统地将多个针对不同任务微调的模型合并，生成一个轻量且快速的合并模型，其性能与完全微调的模型相当。此外，论文还提出了分层模型合并策略，以减少峰值空间需求，同时不牺牲合并模型的性能。实验结果表明，SUPERMERGE在常见的自然语言处理和计算机视觉任务中优于现有的模型合并方法。

链接: https://arxiv.org/abs/2412.10416
作者: Haoyu Yang,Zheng Zhang,Saket Sathe
机构: University of Minnesota(明尼苏达大学); Amazon Web Services(亚马逊网络服务)
关键词: simultaneously support thousands, Large language models, Large language, possess the superpower, superpower to simultaneously
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models, such as ChatGPT, Claude, or LLaMA, are gigantic, monolithic, and possess the superpower to simultaneously support thousands of tasks. However, high-throughput applications often prefer smaller task-specific models because of their lower latency and cost. One challenge of using task-specific models is the incremental need for solving newer tasks after the model is already deployed for existing tasks. A straightforward solution requires fine-tuning the model again for both existing and new tasks, which is computationally expensive and time-consuming. To address this issue, we propose a model merging based approach called SUPERMERGE. SUPERMERGE is a gradient-based method to systematically merge several fine-tuned models trained on existing and new tasks. SUPERMERGE is designed to be lightweight and fast, and the merged model achieves similar performance to fully fine-tuned models on all tasks. Furthermore, we proposed a hierarchical model merging strategy to reduce the peak space requirement without sacrificing the performance of the merged model. We experimentally demonstrate that SUPERMERGE outperforms existing model merging methods on common natural language processing and computer vision tasks.
zh

[NLP-157] Generative Adversarial Reviews: When LLM s Become the Critic

【速读】：该论文试图解决科学同行评审过程中由于学术产出快速增长和知识领域专业化加剧而导致的传统反馈机制压力问题。解决方案的关键在于引入生成式代理评审员 (Generative Agent Reviewers, GAR)，利用大型语言模型 (LLM) 赋能的代理模拟忠实的同行评审员。GAR 的核心架构包括扩展具有记忆能力的大型语言模型，并从历史数据中提取评审员角色，同时采用基于图的论文表示方法，将内容逻辑化组织，连接观点与证据及技术细节。GAR 的评审过程结合外部知识评估论文新颖性，并通过图表示和多轮评估进行详细审查，最终由元评审员汇总个体评审结果以预测接受决策。实验表明，GAR 在提供详细反馈和预测论文结果方面与人类评审员表现相当，并能通过提供早期专家级反馈实现评审过程的民主化和透明化。

链接: https://arxiv.org/abs/2412.10415
作者: Nicolas Bougie,Narimasa Watanabe
机构: Woven by Toyota(Woven by 丰田); Woven by Toyota(Woven by 丰田)
关键词: standards for publication, meet the quality, quality standards, scientific progress, strain traditional scientific
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The peer review process is fundamental to scientific progress, determining which papers meet the quality standards for publication. Yet, the rapid growth of scholarly production and increasing specialization in knowledge areas strain traditional scientific feedback mechanisms. In light of this, we introduce Generative Agent Reviewers (GAR), leveraging LLM-empowered agents to simulate faithful peer reviewers. To enable generative reviewers, we design an architecture that extends a large language model with memory capabilities and equips agents with reviewer personas derived from historical data. Central to this approach is a graph-based representation of manuscripts, condensing content and logically organizing information - linking ideas with evidence and technical details. GAR’s review process leverages external knowledge to evaluate paper novelty, followed by detailed assessment using the graph representation and multi-round assessment. Finally, a meta-reviewer aggregates individual reviews to predict the acceptance decision. Our experiments demonstrate that GAR performs comparably to human reviewers in providing detailed feedback and predicting paper outcomes. Beyond mere performance comparison, we conduct insightful experiments, such as evaluating the impact of reviewer expertise and examining fairness in reviews. By offering early expert-level feedback, typically restricted to a limited group of researchers, GAR democratizes access to transparent and in-depth evaluation.
zh

[NLP-158] Exploring Complex Mental Health Symptoms via Classifying Social Media Data with Explainable LLM s ML4H ALT NEURIPS2024

【速读】：该论文试图通过训练大型语言模型（LLMs）在社交媒体文本数据分类任务上，解决复杂疾病（如莱姆病和焦虑症）的预测和解释问题。解决方案的关键在于构建一个流程，包括：1) 训练LLMs进行分类任务；2) 获取分类输出的解释；3) 对解释进行定性和定量分析。通过这一流程，研究初步展示了在预测和解释精神健康问题（如未来可能出现的ADHD症状）方面的成果，并展示了可视化解释结果的初步结果。

链接: https://arxiv.org/abs/2412.10414
作者: Kexin Chen,Noelle Lim,Claire Lee,Michael Guerzhoy
机构: University of Toronto(多伦多大学); LinkedIn Corporation(领英公司); Princeton University(普林斯顿大学); University of Toronto(多伦多大学)
关键词: data classification tasks, challenging social media, social media text, media text data, text data classification
类目: Computation and Language (cs.CL)
备注: Accepted to Machine Learning for Health (ML4H) Findings 2024 (co-located with NeurIPS 2024)

点击查看摘要

Abstract:We propose a pipeline for gaining insights into complex diseases by training LLMs on challenging social media text data classification tasks, obtaining explanations for the classification outputs, and performing qualitative and quantitative analysis on the explanations. We report initial results on predicting, explaining, and systematizing the explanations of predicted reports on mental health concerns in people reporting Lyme disease concerns. We report initial results on predicting future ADHD concerns for people reporting anxiety disorder concerns, and demonstrate preliminary results on visualizing the explanations for predicting that a person with anxiety concerns will in the future have ADHD concerns.
zh

[NLP-159] Evaluating Robustness of LLM s on Crisis-Related Microblogs across Events Information Types and Linguistic Features

【速读】：该论文试图解决在灾难期间从微博平台（如X，前Twitter）获取的实时信息中筛选出相关信息的问题。传统的有监督机器学习模型在处理这些噪声数据时缺乏泛化能力，而论文提出的解决方案关键在于利用大型语言模型（LLMs），特别是GPT-4和GPT-4o，这些模型在理解和处理自然语言方面表现出更好的泛化能力。然而，尽管LLMs在处理不同灾难和信息类型时表现较好，但在处理洪水相关数据和识别紧急请求等关键信息类别时仍面临挑战。论文还探讨了语言特征对模型性能的影响，并指出LLMs在处理某些特征（如拼写错误）时的脆弱性。最后，通过在零样本和少样本设置下的基准测试，论文发现专有模型在所有任务中均优于开源模型。

链接: https://arxiv.org/abs/2412.10413
作者: Muhammad Imran,Abdul Wahab Ziaullah,Kai Chen,Ferda Ofli
机构: Qatar Computing Research Institute(卡塔尔计算研究研究所); Hamad Bin Khalifa University(哈马德·本·哈利法大学); OpenAI
关键词: response authorities, governments and response, microblogging platforms, real-time information, Twitter
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 12 pages, 10 figs, 5 tables

点击查看摘要

Abstract:The widespread use of microblogging platforms like X (formerly Twitter) during disasters provides real-time information to governments and response authorities. However, the data from these platforms is often noisy, requiring automated methods to filter relevant information. Traditionally, supervised machine learning models have been used, but they lack generalizability. In contrast, Large Language Models (LLMs) show better capabilities in understanding and processing natural language out of the box. This paper provides a detailed analysis of the performance of six well-known LLMs in processing disaster-related social media data from a large-set of real-world events. Our findings indicate that while LLMs, particularly GPT-4o and GPT-4, offer better generalizability across different disasters and information types, most LLMs face challenges in processing flood-related data, show minimal improvement despite the provision of examples (i.e., shots), and struggle to identify critical information categories like urgent requests and needs. Additionally, we examine how various linguistic features affect model performance and highlight LLMs’ vulnerabilities against certain features like typos. Lastly, we provide benchmarking results for all events across both zero- and few-shot settings and observe that proprietary models outperform open-source ones in all tasks.
zh

[NLP-160] Reinforcement Learning Enhanced LLM s: A Survey

【速读】：该论文旨在系统性地回顾和分析通过强化学习（Reinforcement Learning, RL）增强大型语言模型（LLMs）的研究现状，帮助研究人员理解当前的挑战和进展。解决方案的关键在于利用基于奖励反馈的RL技术，使LLMs能够根据输出质量的反馈进行自我改进，从而生成更准确、连贯且符合上下文的响应。论文详细介绍了RL的基本原理、流行的RL增强LLMs、基于奖励模型的两种广泛使用的RL技术（RLHF和RLAIF），以及直接偏好优化（Direct Preference Optimization, DPO）方法，后者通过直接使用人类偏好数据来对齐LLM输出与人类期望。此外，论文还指出了现有方法的挑战和不足，并提出了进一步改进的方向。

链接: https://arxiv.org/abs/2412.10400
作者: Shuhe Wang,Shengyu Zhang,Jie Zhang,Runyi Hu,Xiaoya Li,Tianwei Zhang,Jiwei Li,Fei Wu,Guoyin Wang,Eduard Hovy
机构: 未知
关键词: enhancing large language, paper surveys research, reinforcement learning, large language models, rapidly growing research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper surveys research in the rapidly growing field of enhancing large language models (LLMs) with reinforcement learning (RL), a technique that enables LLMs to improve their performance by receiving feedback in the form of rewards based on the quality of their outputs, allowing them to generate more accurate, coherent, and contextually appropriate responses. In this work, we make a systematic review of the most up-to-date state of knowledge on RL-enhanced LLMs, attempting to consolidate and analyze the rapidly growing research in this field, helping researchers understand the current challenges and advancements. Specifically, we (1) detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward model-based RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF); and (4) explore Direct Preference Optimization (DPO), a set of methods that bypass the reward model to directly use human preference data for aligning LLM outputs with human expectations. We will also point out current challenges and deficiencies of existing methods and suggest some avenues for further improvements.
zh

[NLP-161] AI-assisted summary of suicide risk Formulation

【速读】：该论文试图解决在电子健康记录（EHR）中手动审计临床文档以评估自杀风险时面临的挑战，特别是由于临床语言的多样性和非结构化数据的复杂性导致的资源密集型问题。解决方案的关键在于开发先进的自然语言处理（NLP）算法，通过光学字符识别（OCR）技术处理非结构化数据，并利用语义匹配技术对自由文本进行统一和分析，从而自动提取与自杀风险评估相关的信息，并通过加权评分获得置信度水平。

链接: https://arxiv.org/abs/2412.10388
作者: Rajib Rana,Niall Higgins,Kazi N. Haque,John Reilly,Kylie Burke,Kathryn Turner,Anthony R. Pisani,Terry Stedman
机构: 1. University of Southern Queensland (南昆士兰大学); 2. University of Queensland (昆士兰大学); 3. Queensland University of Technology (昆士兰科技大学); 4. Mater Research Institute-University of Queensland (玛特研究所-昆士兰大学); 5. Mater Hospital (玛特医院); 6. University of Queensland Diamantina Institute (昆士兰大学戴安娜研究所); 7. University of Hawaii Cancer Center (夏威夷大学癌症中心)
关键词: suicide risk assessment, risk assessment, individual problems, seeks to understand, understand the idiosyncratic
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Background: Formulation, associated with suicide risk assessment, is an individualised process that seeks to understand the idiosyncratic nature and development of an individual’s problems. Auditing clinical documentation on an electronic health record (EHR) is challenging as it requires resource-intensive manual efforts to identify keywords in relevant sections of specific forms. Furthermore, clinicians and healthcare professionals often do not use keywords; their clinical language can vary greatly and may contain various jargon and acronyms. Also, the relevant information may be recorded elsewhere. This study describes how we developed advanced Natural Language Processing (NLP) algorithms, a branch of Artificial Intelligence (AI), to analyse EHR data automatically. Method: Advanced Optical Character Recognition techniques were used to process unstructured data sets, such as portable document format (pdf) files. Free text data was cleaned and pre-processed using Normalisation of Free Text techniques. We developed algorithms and tools to unify the free text. Finally, the formulation was checked for the presence of each concept based on similarity using NLP-powered semantic matching techniques. Results: We extracted information indicative of formulation and assessed it to cover the relevant concepts. This was achieved using a Weighted Score to obtain a Confidence Level. Conclusion: The rigour to which formulation is completed is crucial to effectively using EHRs, ensuring correct and timely identification, engagement and interventions that may potentially avoid many suicide attempts and suicides.
zh

[NLP-162] Fully Open Source Moxin-7B Technical Report

【速读】：该论文试图解决开源大型语言模型（LLMs）在商业化过程中面临的透明度、可重复性和安全性问题。解决方案的关键在于引入Moxin 7B，这是一个完全开源的LLM，遵循模型开放性框架（Model Openness Framework, MOF），通过全面公开预训练代码、配置、训练和微调数据集以及中间和最终检查点，达到了MOF分类系统中最高的“开放科学”级别。这种方法确保了模型的完整性和开放性，促进了创新和研究的进一步发展。

链接: https://arxiv.org/abs/2412.06845
作者: Pu Zhao,Xuan Shen,Zhenglun Kong,Yixin Shen,Sung-En Chang,Timothy Rupprecht,Lei Lu,Enfu Nan,Changdi Yang,Yumei He,Xingchen Xu,Yu Huang,Wei Wang,Yue Chen,Yong He,Yanzhi Wang
机构: Northeastern University(东北大学); Harvard University(哈佛大学); Cornell University(康奈尔大学); Tulane University(杜兰大学); University of Washington(华盛顿大学); Roboraction.ai; Futurewei Technologies; AIBAO LLC
关键词: Large Language Models, Large Language, Language Models, significant transformation, open-source LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, and some use restrictive licenses whilst claiming to be “open-source,” which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed in accordance with the Model Openness Framework (MOF), a ranked classification system that evaluates AI models based on model completeness and openness, adhering to principles of open science, open source, open data, and open access. Our model achieves the highest MOF classification level of “open science” through the comprehensive release of pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints. Experiments show that our model achieves superior performance in zero-shot evaluation compared with popular 7B models and performs competitively in few-shot evaluation.
zh

[NLP-163] SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval

【速读】：该论文试图解决语音大语言模型 (Speech LLMs) 在处理长音频序列时面临的计算和表征需求问题。解决方案的关键是提出了一种无需训练的标记修剪策略，称为SpeechPrune。该策略通过利用语音与文本的相似性以及近似的注意力分数，有效地丢弃无关的标记，从而在保持模型性能的同时显著减少计算负担。在SPIRAL基准测试中，SpeechPrune在20%的修剪率下分别比原始模型和随机修剪模型提高了29%和47%的准确率，甚至在80%的修剪率下仍能维持网络性能，展示了标记级修剪在高效和可扩展的长音频理解中的潜力。

链接: https://arxiv.org/abs/2412.12009
作者: Yueqian Lin,Yuzhe Fu,Jingyang Zhang,Yudong Liu,Jianyi Zhang,Jingwei Sun,Hai “Helen” Li,Yiran Chen
机构: Duke University (杜克大学)
关键词: Speech Information Retrieval, Speech Large Language, Large Language Models, introduce Speech Information, Information Retrieval
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Project page and dataset is available at this https URL

点击查看摘要

Abstract:We introduce Speech Information Retrieval (SIR), a new long-context task for Speech Large Language Models (Speech LLMs), and present SPIRAL, a 1,012-sample benchmark testing models’ ability to extract critical details from approximately 90-second spoken inputs. While current Speech LLMs excel at short-form tasks, they struggle with the computational and representational demands of longer audio sequences. To address this limitation, we propose SpeechPrune, a training-free token pruning strategy that uses speech-text similarity and approximated attention scores to efficiently discard irrelevant tokens. In SPIRAL, SpeechPrune achieves accuracy improvements of 29% and up to 47% over the original model and the random pruning model at a pruning rate of 20%, respectively. SpeechPrune can maintain network performance even at a pruning level of 80%. This approach highlights the potential of token-level pruning for efficient and scalable long-form speech understanding.
zh

[NLP-164] ransliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition

【速读】：该论文试图解决自动语音识别模型在未见过的领域数据上性能下降的问题，特别是在目标语言中缺乏目标领域数据的情况下。解决方案的关键在于提出了一种零样本领域适应 (Zero-shot Domain Adaptation, ZSDA) 方法，通过从源语言中获取目标领域知识，并利用跨语言预训练 (Cross-lingual Pre-training, XLPT) 和目标语言微调来构建最终模型。为了避免预训练知识在微调过程中被遗忘，论文提出了音译的 ZSDA，通过保持预训练和微调标签的一致性，最大限度地保留预训练知识。实验结果表明，这种方法相比wav2vec 2.0基线降低了9.2%的词错误率，并且在自监督和有监督的ZSDA方法中表现优异。

链接: https://arxiv.org/abs/2412.11185
作者: Han Zhu,Gaofeng Cheng,Qingwei Zhao,Pengyuan Zhang
机构: Institute of Acoustics, Chinese Academy of Sciences (中国科学院声学研究所)
关键词: automatic speech recognition, target domain data, target domain, speech recognition models, Domain
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The performance of automatic speech recognition models often degenerates on domains not covered by the training data. Domain adaptation can address this issue, assuming the availability of the target domain data in the target language. However, such assumption does not stand in many real-world applications. To make domain adaptation more applicable, we address the problem of zero-shot domain adaptation (ZSDA), where target domain data is unavailable in the target language. Instead, we transfer the target domain knowledge from another source language where the target domain data is more accessible. To do that, we first perform cross-lingual pre-training (XLPT) to share domain knowledge across languages, then use target language fine-tuning to build the final model. One challenge in this practice is that the pre-trained knowledge can be forgotten during fine-tuning, resulting in sub-optimal adaptation performance. To address this issue, we propose transliterated ZSDA to achieve consistent pre-training and fine-tuning labels, leading to maximum preservation of the pre-trained knowledge. Experimental results show that transliterated ZSDA relatively decreases the word error rate by 9.2% compared with a wav2vec 2.0 baseline. Moreover, transliterated ZSDA consistently outperforms self-supervised ZSDA and performs on par with supervised ZSDA, proving the superiority of transliteration-based pre-training labels.
zh

[NLP-165] Observing Micromotives and Macrobehavior of Large Language Models

【速读】：该论文试图解决的问题是：尽管研究者们致力于消除大型语言模型（LLMs）中的偏好或偏见（micromotives），但这些模型对社会宏观行为（macrobehavior）的影响尚未得到系统性研究。论文的关键解决方案是通过Schelling的隔离模型设计，观察LLMs的微观动机与社会宏观行为之间的关系。研究结果表明，无论LLMs的偏见程度如何，随着更多人遵循LLMs的建议，社会将出现高度隔离现象。这一发现促使研究者重新评估消除LLMs偏见的假设，并鼓励进一步探讨LLMs对用户和社会的潜在影响。

链接: https://arxiv.org/abs/2412.10428
作者: Yuyang Cheng,Xingwei Qu,Tomas Goldsack,Chenghua Lin,Chung-Chi Chen
机构: University of Manchester(曼彻斯特大学); University of Sheffield(谢菲尔德大学); Artificial Intelligence Research Center, AIST(人工智能研究中心, 日本科学技术振兴机构)
关键词: Nobel Memorial Prize, Nobel Memorial, Economic Sciences, Memorial Prize, Prize in Economic
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Thomas C. Schelling, awarded the 2005 Nobel Memorial Prize in Economic Sciences, pointed out that ``individuals decisions (micromotives), while often personal and localized, can lead to societal outcomes (macrobehavior) that are far more complex and different from what the individuals intended.‘’ The current research related to large language models’ (LLMs’) micromotives, such as preferences or biases, assumes that users will make more appropriate decisions once LLMs are devoid of preferences or biases. Consequently, a series of studies has focused on removing bias from LLMs. In the NLP community, while there are many discussions on LLMs’ micromotives, previous studies have seldom conducted a systematic examination of how LLMs may influence society’s macrobehavior. In this paper, we follow the design of Schelling’s model of segregation to observe the relationship between the micromotives and macrobehavior of LLMs. Our results indicate that, regardless of the level of bias in LLMs, a highly segregated society will emerge as more people follow LLMs’ suggestions. We hope our discussion will spark further consideration of the fundamental assumption regarding the mitigation of LLMs’ micromotives and encourage a reevaluation of how LLMs may influence users and society.
zh

计算机视觉

[CV-0] PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting

【速读】：该论文试图解决在虚拟现实（VR）、虚拟导览、机器人和自动驾驶等应用中，全景视图合成面临的高分辨率、快速推理和内存效率问题。现有方法通常受限于较低的分辨率（512 × 1024），主要由于内存和计算需求的限制。论文提出的解决方案是PanSplat，一种可泛化的前馈方法，能够支持高达4K（2048 × 4096）的分辨率。其关键在于采用定制的球面3D高斯金字塔（spherical 3D Gaussian pyramid）与Fibonacci晶格排列，以提高图像质量并减少信息冗余。此外，论文提出了一种集成层次化球面代价体积（hierarchical spherical cost volume）和高斯头部的管道，结合局部操作，实现了两步延迟反向传播，从而在单个A100 GPU上实现内存高效的训练。实验结果表明，PanSplat在合成和真实世界数据集上均达到了最先进的效率和图像质量。

链接: https://arxiv.org/abs/2412.12096
作者: Cheng Zhang,Haofei Xu,Qianyi Wu,Camilo Cruz Gambardella,Dinh Phung,Jianfei Cai
机构: 未知
关键词: gained significant attention, virtual reality, virtual tours, advent of portable, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:With the advent of portable 360° cameras, panorama has gained significant attention in applications like virtual reality (VR), virtual tours, robotics, and autonomous driving. As a result, wide-baseline panorama view synthesis has emerged as a vital task, where high resolution, fast inference, and memory efficiency are essential. Nevertheless, existing methods are typically constrained to lower resolutions (512 \times 1024) due to demanding memory and computational requirements. In this paper, we present PanSplat, a generalizable, feed-forward approach that efficiently supports resolution up to 4K (2048 \times 4096). Our approach features a tailored spherical 3D Gaussian pyramid with a Fibonacci lattice arrangement, enhancing image quality while reducing information redundancy. To accommodate the demands of high resolution, we propose a pipeline that integrates a hierarchical spherical cost volume and Gaussian heads with local operations, enabling two-step deferred backpropagation for memory-efficient training on a single A100 GPU. Experiments demonstrate that PanSplat achieves state-of-the-art results with superior efficiency and image quality across both synthetic and real-world datasets. Code will be available at \urlthis https URL.
zh

[CV-1] Causal Diffusion Transformers for Generative Modeling

【速读】：该论文试图解决将扩散模型（Diffusion models）与自回归模型（autoregressive models）结合的问题，特别是如何在这两种生成模式之间实现平滑过渡，并提升模型在多模态数据上的表现。解决方案的关键在于提出了Causal Diffusion框架，并通过CausalFusion模型实现了这一目标。CausalFusion是一个仅包含解码器的Transformer模型，它通过双因子分解（dual-factorization）同时处理序列化令牌（sequential tokens）和扩散噪声水平（diffusion noise levels），从而在ImageNet生成基准上取得了最先进的结果，并保留了自回归模型生成任意数量令牌进行上下文推理的优势。此外，CausalFusion还展示了其在多模态任务中的能力，如联合图像生成与描述模型，以及零样本上下文图像操作。

链接: https://arxiv.org/abs/2412.12095
作者: Chaorui Deng,Deyao Zh,Kunchang Li,Shi Guan,Haoqi Fan
机构: ByteDance(字节跳动)
关键词: introduce Causal Diffusion, introduce Causal, Causal Diffusion, Diffusion, Causal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 22 figures

点击查看摘要

Abstract:We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion’s multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion’s ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.
zh

[CV-2] CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

【速读】：该论文试图解决从任意数量的参考图像（1到100张）中重建高保真、动态的4D（动态3D）肖像化身的问题，以满足广告、视觉特效和虚拟现实等应用的需求。解决方案的关键在于提出了一种名为CAP4D的方法，该方法采用可变形的多视角扩散模型（morphable multi-view diffusion model），能够在不同数量的参考图像下实现高质量的化身重建，并支持实时动画和渲染。CAP4D在单图像、少量图像和多图像的4D肖像化身重建中展示了最先进的性能，并致力于缩小单图像和多视角重建技术在视觉保真度上的差距。

链接: https://arxiv.org/abs/2412.12093
作者: Felix Taubner,Ruihang Zhang,Mathieu Tuli,David B. Lindell
机构: University of Toronto(多伦多大学); Vector Institute(向量研究所); LG Electronics(LG电子)
关键词: applications including advertising, Reconstructing photorealistic, including advertising, virtual reality, avatar reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 15 figures

点击查看摘要

Abstract:Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints - for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D (dynamic 3D) portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.
zh

[CV-3] Wonderland: Navigating 3D Scenes from a Single Image

【速读】：该论文试图解决从单张任意图像高效生成高质量、广范围的3D场景的问题。解决方案的关键在于提出了一种新颖的管道，利用视频扩散模型（video diffusion model）的潜在空间（latent space）来预测3D场景的3D高斯喷射（3D Gaussian Splattings）。具体来说，视频扩散模型能够精确地生成遵循指定相机轨迹的视频，并压缩包含多视角信息的视频潜在表示，同时保持3D一致性。通过在视频潜在空间上进行渐进式训练策略，该方法能够高效生成高质量、广范围且通用的3D场景。实验结果表明，该模型在单视图3D场景生成任务中显著优于现有方法，尤其是在处理域外图像时。

链接: https://arxiv.org/abs/2412.12091
作者: Hanwen Liang,Junli Cao,Vidit Goel,Guocheng Qian,Sergei Korolev,Demetri Terzopoulos,Konstantinos N. Plataniotis,Sergey Tulyakov,Jian Ren
机构: University of Toronto(多伦多大学); Snap Inc.(Snap公司); University of California, Los Angeles(加州大学洛杉矶分校)
关键词: single arbitrary image, challenging question, paper addresses, addresses a challenging, single arbitrary
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.
zh

[CV-4] Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation

【速读】：该论文试图解决在涉及软体（deformables）的机器人任务中，由于软体仿真速度较慢导致强化学习（Reinforcement Learning, RL）样本复杂度要求过高的问题。解决方案的关键在于提出了两种创新技术：一是Soft Analytic Policy Optimization (SAPO)算法，这是一种基于最大熵的一阶模型化actor-critic RL算法，利用可微分仿真的解析梯度来训练随机策略以最大化期望回报和熵；二是Rewarped仿真平台，这是一个支持多种材料（包括非刚体）的并行可微分多物理仿真平台。通过这两者的结合，论文展示了SAPO在涉及刚体、关节和软体相互作用的复杂任务中优于基线方法。

链接: https://arxiv.org/abs/2412.12089
作者: Eliot Xing,Vernon Luk,Jean Oh
机构: Carnegie Mellon University (卡内基梅隆大学)
关键词: deep reinforcement learning, collect large amounts, complex control policies, Recent advances, train complex control
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by fast rigid-body dynamics. Simulation techniques for soft bodies are comparatively several orders of magnitude slower, thereby limiting the use of RL due to sample complexity requirements. To address this challenge, this paper presents both a novel RL algorithm and a simulation platform to enable scaling RL on tasks involving rigid bodies and deformables. We introduce Soft Analytic Policy Optimization (SAPO), a maximum entropy first-order model-based actor-critic RL algorithm, which uses first-order analytic gradients from differentiable simulation to train a stochastic actor to maximize expected return and entropy. Alongside our approach, we develop Rewarped, a parallel differentiable multiphysics simulation platform that supports simulating various materials beyond rigid bodies. We re-implement challenging manipulation and locomotion tasks in Rewarped, and show that SAPO outperforms baselines over a range of tasks that involve interaction between rigid bodies, articulations, and deformables.
zh

[CV-5] Instruction-based Image Manipulation by Watching How Things Move

【速读】：该论文试图解决生成式指令驱动图像编辑模型训练数据集的问题，关键在于提出了一种新颖的数据集构建流程。该流程通过从视频中采样帧对，并利用多模态大语言模型 (MLLMs) 生成编辑指令，从而训练基于指令的图像操作模型。视频帧天然保留了主体和场景的身份信息，确保编辑过程中内容的连贯性。此外，视频数据捕捉了多样且自然的动态变化（如非刚性主体运动和复杂摄像机运动），这些难以通过其他方式建模的动态特性使其成为构建可扩展数据集的理想来源。通过这一方法，论文创建了一个新数据集，用于训练名为InstructMove的模型，该模型能够执行复杂的指令驱动编辑任务，如调整主体姿态、重新排列元素和改变摄像机视角，并在这些任务中展示了最先进的性能。

链接: https://arxiv.org/abs/2412.12087
作者: Mingdeng Cao,Xuaner Zhang,Yinqiang Zheng,Zhihao Xia
机构: The University of Tokyo(东京大学); Adobe
关键词: multimodal large language, generate editing instructions, training instruction-based image, large language models, paper introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.
zh

[CV-6] IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

【速读】：该论文试图解决从图像中捕捉几何和材质信息这一计算机视觉和图形学中的基本挑战。传统基于优化的方法在处理多视角输入时，计算时间长且难以区分光照和材质的模糊性；而基于学习的模型虽然利用了丰富的3D物体数据集，但难以保持多视角一致性。论文提出的IDArb模型通过扩散机制实现了对任意数量图像在不同光照条件下的内蕴分解，关键在于引入了一种跨视角、跨域的注意力模块（cross-view, cross-domain attention module）和一种光照增强、视角自适应的训练策略（illumination-augmented, view-adaptive training strategy），从而实现了表面法线和材质属性的准确且多视角一致的估计。此外，论文还引入了ARB-Objaverse数据集，提供了大规模多视角内蕴数据和多样光照条件下的渲染，支持鲁棒训练。

链接: https://arxiv.org/abs/2412.12083
作者: Zhibing Li,Tong Wu,Jing Tan,Mengchen Zhang,Jiaqi Wang,Dahua Lin
机构: The Chinese University of Hong Kong; Zhejiang University; Shanghai AI Laboratory
关键词: Capturing geometric, vision and graphics, remains a fundamental, computer vision, material information
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Capturing geometric and material information from images remains a fundamental challenge in computer vision and graphics. Traditional optimization-based methods often require hours of computational time to reconstruct geometry, material properties, and environmental lighting from dense multi-view inputs, while still struggling with inherent ambiguities between lighting and material. On the other hand, learning-based approaches leverage rich material priors from existing 3D object datasets but face challenges with maintaining multi-view consistency. In this paper, we introduce IDArb, a diffusion-based model designed to perform intrinsic decomposition on an arbitrary number of images under varying illuminations. Our method achieves accurate and multi-view consistent estimation on surface normals and material properties. This is made possible through a novel cross-view, cross-domain attention module and an illumination-augmented, view-adaptive training strategy. Additionally, we introduce ARB-Objaverse, a new dataset that provides large-scale multi-view intrinsic data and renderings under diverse lighting conditions, supporting robust training. Extensive experiments demonstrate that IDArb outperforms state-of-the-art methods both qualitatively and quantitatively. Moreover, our approach facilitates a range of downstream tasks, including single-image relighting, photometric stereo, and 3D reconstruction, highlighting its broad applications in realistic 3D content creation.
zh

[CV-7] UniLoc: Towards Universal Place Recognition Using Any Single Modality

【速读】：该论文试图解决跨模态场景识别问题，旨在通过单一模型实现对自然语言、图像和点云等多种查询模态的统一处理。解决方案的关键在于提出了一种名为UniLoc的通用解决方案，该方案利用大规模对比学习技术，在实例级和场景级两个层次上进行匹配学习。具体而言，论文引入了一种基于自注意力机制的池化模块（Self-Attention based Pooling, SAP），用于评估实例描述符在聚合为场景级描述符时的重要性，从而提升跨模态场景识别的性能。实验结果表明，UniLoc在跨模态设置中表现优异，同时在单模态场景中也具有竞争力。

链接: https://arxiv.org/abs/2412.12079
作者: Yan Xia,Zhendong Li,Yun-Jin Li,Letian Shi,Hu Cao,João F. Henriques,Daniel Cremers
机构: Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); Visual Geometry Group, University of Oxford (牛津大学视觉几何组)
关键词: single-modality retrieval, recognition methods focus, focus on single-modality, methods focus, place recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:To date, most place recognition methods focus on single-modality retrieval. While they perform well in specific environments, cross-modal methods offer greater flexibility by allowing seamless switching between map and query sources. It also promises to reduce computation requirements by having a unified model, and achieving greater sample efficiency by sharing parameters. In this work, we develop a universal solution to place recognition, UniLoc, that works with any single query modality (natural language, image, or point cloud). UniLoc leverages recent advances in large-scale contrastive learning, and learns by matching hierarchically at two levels: instance-level matching and scene-level matching. Specifically, we propose a novel Self-Attention based Pooling (SAP) module to evaluate the importance of instance descriptors when aggregated into a place-level descriptor. Experiments on the KITTI-360 dataset demonstrate the benefits of cross-modality for place recognition, achieving superior performance in cross-modal settings and competitive results also for uni-modal scenarios. Our project page is publicly available at this https URL.
zh

[CV-8] CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology

【速读】：该论文试图解决病理学中多模态模型（LMMs）在处理patch级和全片图像（WSI）级任务时的知识整合问题，以及由此导致的模型冗余问题。解决方案的关键在于提出了CPath-Omni，这是一个拥有150亿参数的大型多模态模型，首次实现了patch级和WSI级图像分析的统一，整合了包括分类、视觉问答、图像描述和视觉参考提示在内的多种任务。此外，论文还开发了基于CLIP的病理学专用视觉处理器CPath-CLIP，通过集成不同视觉模型并结合大型语言模型作为文本编码器，构建了一个更强大的CLIP模型，显著提升了零样本和少样本数据集上的性能。这些创新使得CPath-Omni在多个任务和数据集上达到了最先进的性能，展示了其在病理学基础模型领域的潜力。

链接: https://arxiv.org/abs/2412.12077
作者: Yuxuan Sun,Yixuan Si,Chenglu Zhu,Xuan Gong,Kai Zhang,Pingyi Chen,Ye Zhang,Zhongyi Shui,Tao Lin,Lin Yang
机构: Westlake University(西湖大学); Zhejiang University(浙江大学); Harvard University(哈佛大学); The Ohio State University(俄亥俄州立大学); University of the Chinese Academy of Sciences(中国科学院大学)
关键词: brought significant advancements, brought significant, significant advancements, WSI level image, large multimodal models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 13 figures

点击查看摘要

Abstract:The emergence of large multimodal models (LMMs) has brought significant advancements to pathology. Previous research has primarily focused on separately training patch-level and whole-slide image (WSI)-level models, limiting the integration of learned knowledge across patches and WSIs, and resulting in redundant models. In this work, we introduce CPath-Omni, the first 15-billion-parameter LMM designed to unify both patch and WSI level image analysis, consolidating a variety of tasks at both levels, including classification, visual question answering, captioning, and visual referring prompting. Extensive experiments demonstrate that CPath-Omni achieves state-of-the-art (SOTA) performance across seven diverse tasks on 39 out of 42 datasets, outperforming or matching task-specific models trained for individual tasks. Additionally, we develop a specialized pathology CLIP-based visual processor for CPath-Omni, CPath-CLIP, which, for the first time, integrates different vision models and incorporates a large language model as a text encoder to build a more powerful CLIP model, which achieves SOTA performance on nine zero-shot and four few-shot datasets. Our findings highlight CPath-Omni’s ability to unify diverse pathology tasks, demonstrating its potential to streamline and advance the field of foundation model in pathology.
zh

[CV-9] CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

【速读】：该论文试图解决现有多模态大语言模型 (MLLMs) 在长视频理解方面的评估不足问题，特别是基于多项选择题 (MCQs) 的评估方法无法真正反映模型对视频内容的理解能力。解决方案的关键在于引入CG-Bench，这是一个针对长视频线索导向问答的新型基准，强调模型通过检索相关线索来回答问题的能力，从而增强评估的可信度。CG-Bench包含1,219个手工筛选的长视频，分为14个一级类别、171个二级类别和638个三级类别，并提供12,129个问答对，涵盖感知、推理和幻觉三种主要问题类型。论文还设计了两种新的线索导向评估方法：线索导向的白盒和黑盒评估，以验证模型是否基于对视频的正确理解生成答案。

链接: https://arxiv.org/abs/2412.12075
作者: Guo Chen,Yicheng Liu,Yifei Huang,Yuping He,Baoqi Pei,Jilan Xu,Yali Wang,Tong Lu,Limin Wang
机构: Nanjing University(南京大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室); Fudan University(复旦大学); Zhejiang University(浙江大学)
关键词: multimodal large language, large language models, video, long video understanding, video understanding
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 9 figures

点击查看摘要

Abstract:Most existing video understanding benchmarks for multimodal large language models (MLLMs) focus only on short videos. The limited number of benchmarks for long video understanding often rely solely on multiple-choice questions (MCQs). However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content. To address this gap, we introduce CG-Bench, a novel benchmark designed for clue-grounded question answering in long videos. CG-Bench emphasizes the model’s ability to retrieve relevant clues for questions, enhancing evaluation credibility. It features 1,219 manually curated videos categorized by a granular system with 14 primary categories, 171 secondary categories, and 638 tertiary categories, making it the largest benchmark for long video analysis. The benchmark includes 12,129 QA pairs in three major question types: perception, reasoning, and hallucination. Compensating the drawbacks of pure MCQ-based evaluation, we design two novel clue-based evaluation methods: clue-grounded white box and black box evaluations, to assess whether the model generates answers based on the correct understanding of the video. We evaluate multiple closed-source and open-source MLLMs on CG-Bench. Results indicate that current models significantly underperform in understanding long videos compared to short ones, and a significant gap exists between open-source and commercial models. We hope CG-Bench can advance the development of more trustworthy and capable MLLMs for long video understanding. All annotations and video data are released at this https URL.
zh

[CV-10] SPADE: Spectroscopic Photoacoustic Denoising using an Analytical and Data-free Enhancement Framework

【速读】：该论文试图解决光声成像（photoacoustic imaging）中由于噪声导致的信噪比（SNR）低和图像质量下降的问题。解决方案的关键在于提出了一种无需调参且无需大量训练数据的增强框架SPADE（SPA denoising using a tuning-free analytical and data-free enhancement）。该框架结合了无数据学习方法和基于BM3D的高效分析方法，能够在保留光谱线性的同时实现噪声抑制，确保功能信息的完整性。通过模拟、仿体、离体和在体实验验证，SPADE在提高SNR和保持光谱信息方面表现优异，尤其在复杂成像条件下优于传统方法。

链接: https://arxiv.org/abs/2412.12068
作者: Fangzhou Lin,Shang Gao,Yichuan Tang,Xihan Ma,Ryo Murakami,Ziming Zhang,John D. Obayemic,Winston W. Soboyejo,Haichong K. Zhang
机构: 未知
关键词: optical absorption spectra, differentiate chromophores based, unique optical absorption, Spectroscopic photoacoustic, absorption spectra
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Spectroscopic photoacoustic (sPA) imaging uses multiple wavelengths to differentiate chromophores based on their unique optical absorption spectra. This technique has been widely applied in areas such as vascular mapping, tumor detection, and therapeutic monitoring. However, sPA imaging is highly susceptible to noise, leading to poor signal-to-noise ratio (SNR) and compromised image quality. Traditional denoising techniques like frame averaging, though effective in improving SNR, can be impractical for dynamic imaging scenarios due to reduced frame rates. Advanced methods, including learning-based approaches and analytical algorithms, have demonstrated promise but often require extensive training data and parameter tuning, limiting their adaptability for real-time clinical use. In this work, we propose a sPA denoising using a tuning-free analytical and data-free enhancement (SPADE) framework for denoising sPA images. This framework integrates a data-free learning-based method with an efficient BM3D-based analytical approach while preserves spectral linearity, providing noise reduction and ensuring that functional information is maintained. The SPADE framework was validated through simulation, phantom, ex vivo, and in vivo experiments. Results demonstrated that SPADE improved SNR and preserved spectral information, outperforming conventional methods, especially in challenging imaging conditions. SPADE presents a promising solution for enhancing sPA imaging quality in clinical applications where noise reduction and spectral preservation are critical.
zh

[CV-11] Exploring Semantic Consistency and Style Diversity for Domain Generalized Semantic Segmentation AAAI2025

【速读】：该论文试图解决域泛化语义分割 (Domain Generalized Semantic Segmentation, DGSS) 中存在的两个主要问题：特征归一化方法导致的语义特征混淆和分类误判，以及域随机化方法引入的域无关噪声导致的分割模糊。解决方案的关键在于提出了一种名为SCSD的新框架，包含三个核心组件：1) 语义查询增强器 (Semantic Query Booster)，提升掩码解码器中对象查询的语义感知和区分能力，实现跨域语义一致性预测；2) 文本驱动的风格变换模块 (Text-Driven Style Transform)，利用域差异文本嵌入控制图像特征的风格变换，增加域间风格多样性；3) 风格协同优化机制 (Style Synergy Optimization)，通过协同加权风格对比损失和风格聚合损失，增强域间特征分离和域内特征聚合。实验结果表明，SCSD在未见域数据集上的表现显著优于现有最先进方法。

链接: https://arxiv.org/abs/2412.12050
作者: Hongwei Niu,Linhuang Xie,Jianghang Lin,Shengchuan Zhang
机构: 未知
关键词: Generalized Semantic Segmentation, Domain Generalized Semantic, Generalized Semantic, unknown target domains, source domain data
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Domain Generalized Semantic Segmentation (DGSS) seeks to utilize source domain data exclusively to enhance the generalization of semantic segmentation across unknown target domains. Prevailing studies predominantly concentrate on feature normalization and domain randomization, these approaches exhibit significant limitations. Feature normalization-based methods tend to confuse semantic features in the process of constraining the feature space distribution, resulting in classification misjudgment. Domain randomization-based methods frequently incorporate domain-irrelevant noise due to the uncontrollability of style transformations, resulting in segmentation ambiguity. To address these challenges, we introduce a novel framework, named SCSD for Semantic Consistency prediction and Style Diversity generalization. It comprises three pivotal components: Firstly, a Semantic Query Booster is designed to enhance the semantic awareness and discrimination capabilities of object queries in the mask decoder, enabling cross-domain semantic consistency prediction. Secondly, we develop a Text-Driven Style Transform module that utilizes domain difference text embeddings to controllably guide the style transformation of image features, thereby increasing inter-domain style diversity. Lastly, to prevent the collapse of similar domain feature spaces, we introduce a Style Synergy Optimization mechanism that fortifies the separation of inter-domain features and the aggregation of intra-domain features by synergistically weighting style contrastive loss and style aggregation loss. Extensive experiments demonstrate that the proposed SCSD significantly outperforms existing state-of-theart methods. Notably, SCSD trained on GTAV achieved an average of 49.11 mIoU on the four unseen domain datasets, surpassing the previous state-of-the-art method by +4.08 mIoU. Code is available at this https URL.
zh

[CV-12] A LoRA is Worth a Thousand Pictures

【速读】：该论文试图解决如何在不依赖原始训练集图像或额外图像生成的情况下，有效描述和区分艺术风格的问题。解决方案的关键在于利用低秩适应 (Low Rank Adaptation, LoRA) 权重作为艺术风格的有效描述符，通过实验证明 LoRA 权重在艺术风格聚类中的表现优于传统的预训练特征（如 CLIP 和 DINO），并且在缺乏训练图像知识的情况下，能够实现更准确的模型检索。这一方法为未来的零样本 LoRA 微调 (zero-shot LoRA fine-tuning) 和模型归属 (model attribution) 等应用提供了可能性。

链接: https://arxiv.org/abs/2412.12048
作者: Chenxi Liu,Towaki Takikawa,Alec Jacobson
机构: University of Toronto; Outerport; Adobe Research
关键词: Low Rank Adaptation, customization widely accessible, Rank Adaptation, Low Rank, Recent advances
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models and parameter-efficient fine-tuning (PEFT) have made text-to-image generation and customization widely accessible, with Low Rank Adaptation (LoRA) able to replicate an artist’s style or subject using minimal data and computation. In this paper, we examine the relationship between LoRA weights and artistic styles, demonstrating that LoRA weights alone can serve as an effective descriptor of style, without the need for additional image generation or knowledge of the original training set. Our findings show that LoRA weights yield better performance in clustering of artistic styles compared to traditional pre-trained features, such as CLIP and DINO, with strong structural similarities between LoRA-based and conventional image-based embeddings observed both qualitatively and quantitatively. We identify various retrieval scenarios for the growing collection of customized models and show that our approach enables more accurate retrieval in real-world settings where knowledge of the training images is unavailable and additional generation is required. We conclude with a discussion on potential future applications, such as zero-shot LoRA fine-tuning and model attribution.
zh

[CV-13] FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation Learning

【速读】：该论文试图解决在拥有大量未标记真实人脸数据的情况下，如何学习一种鲁棒且可迁移的面部表示，以提升各种面部安全任务的泛化性能。解决方案的关键在于提出了一个自监督预训练框架FSFM，该框架结合了掩码图像建模 (Masked Image Modeling, MIM) 和实例判别 (Instance Discrimination, ID) 的协同作用。具体来说，论文探索了多种面部掩码策略，并提出了一种简单而强大的CRFR-P掩码策略，该策略强制模型捕捉有意义的区域内部一致性和挑战性的区域间连贯性。此外，论文设计了与MIM自然耦合的ID网络，通过定制的自蒸馏建立从局部到全局的对应关系。这些学习目标（即3C）共同促使模型编码真实人脸的局部特征和全局语义。预训练后，一个标准的ViT模型作为通用视觉基础模型，用于下游的面部安全任务，如跨数据集的深度伪造检测、跨域的面部反欺骗和未见过的扩散面部伪造检测。实验结果表明，该模型在10个公开数据集上的迁移性能优于监督预训练和其他自监督学习方法，甚至超越了任务专用的最先进方法。

链接: https://arxiv.org/abs/2412.12032
作者: Gaojian Wang,Feng Lin,Tong Wu,Zhenguang Liu,Zhongjie Ba,Kui Ren
机构: State Key Laboratory of Blockchain and Data Security, Zhejiang University; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
关键词: unlabeled real faces, generalization performance, transferable facial representation, robust and transferable, respect to generalization
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 11 figures, project page: this https URL

点击查看摘要

Abstract:This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region consistency and challenging inter-region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.
zh

[CV-14] RepFace: Refining Closed-Set Noise with Progressive Label Correction for Face Recognition AAAI2025

【速读】：该论文试图解决人脸识别中由于标签噪声（label noise），尤其是闭集噪声（closed-set noise）对模型训练造成的影响。解决方案的关键在于提出一种新的框架，通过在训练早期阶段引入辅助闭集噪声样本，使模型能够识别噪声数据，并将样本分为干净（clean）、模糊（ambiguous）和噪声（noisy）三组，分别采用不同的训练策略。具体步骤包括：1）使用生成的辅助闭集噪声样本帮助模型在早期识别噪声；2）根据样本与正类和最近负类中心的相似度进行分组；3）对模糊样本进行标签融合，结合累积的模型预测结果；4）在闭集内应用标签平滑（label smoothing），调整标签至最近负类与初始标签之间的中间值。该方法通过稳定早期训练并优化噪声样本处理，显著提升了主流人脸数据集上的性能。

链接: https://arxiv.org/abs/2412.12031
作者: Jie Zhang,Xun Gong,Zhonglin Sun
机构: 1. Key Laboratory of Urban Environment and Health, Institute of Urban Environment, Chinese Academy of Sciences, Xiamen, Fujian, China; 2. University of the Chinese Academy of Sciences, Beijing, China; 3. CAS Center for Excellence in Urbanization and Global Environmental Change, Beijing, China; 4. State Key Laboratory of Urban and Regional Ecology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing, China
关键词: made remarkable strides, Face recognition, face recognition performance, remarkable strides, discriminative losses
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, AAAI2025

点击查看摘要

Abstract:Face recognition has made remarkable strides, driven by the expanding scale of datasets, advancements in various backbone and discriminative losses. However, face recognition performance is heavily affected by the label noise, especially closed-set noise. While numerous studies have focused on handling label noise, addressing closed-set noise still poses challenges. This paper identifies this challenge as training isn’t robust to noise at the early-stage training, and necessitating an appropriate learning strategy for samples with low confidence, which are often misclassified as closed-set noise in later training phases. To address these issues, we propose a new framework to stabilize the training at early stages and split the samples into clean, ambiguous and noisy groups which are devised with separate training strategies. Initially, we employ generated auxiliary closed-set noisy samples to enable the model to identify noisy data at the early stages of training. Subsequently, we introduce how samples are split into clean, ambiguous and noisy groups by their similarity to the positive and nearest negative centers. Then we perform label fusion for ambiguous samples by incorporating accumulated model predictions. Finally, we apply label smoothing within the closed set, adjusting the label to a point between the nearest negative class and the initially assigned label. Extensive experiments validate the effectiveness of our method on mainstream face datasets, achieving state-of-the-art results. The code will be released upon acceptance.
zh

[CV-15] SAMIC: Segment Anything with In-Context Spatial Prompt Engineering

【速读】：该论文试图解决少样本分割（few-shot segmentation）问题，即如何利用少量标注的参考图像在特定领域中识别特定类型的对象（如飞机）。当前最先进的方法依赖于为每个新领域特定应用构建资源密集型的模型，这些模型需要在大规模标注数据集上进行训练，以实现知识迁移。论文的关键解决方案是引入SAMIC，一个轻量级网络，通过学习如何提示视觉基础模型（Vision Foundation Models, VFMs）来实现新类型对象的分割。SAMIC将任何任务视为少样本学习问题，其参数数量仅为260万，比主流模型（如ResNet 101）小94%，并且在使用少量训练数据的情况下，在多个少样本和语义分割数据集上达到了与现有技术相当或更优的性能。

链接: https://arxiv.org/abs/2412.11998
作者: Savinay Nagendra,Kashif Rashid,Chaopeng Shen,Daniel Kifer
机构: The Pennsylvania State University, University Park, Pennsylvania(宾夕法尼亚州立大学，大学公园，宾夕法尼亚州); Schlumberger-Doll Research, Cambridge, Massachusetts(斯伦贝谢-多尔研究中心，马萨诸塞州剑桥)
关键词: labeled reference images, identify specific types, reference images, identify specific, types of objects
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot segmentation is the problem of learning to identify specific types of objects (e.g., airplanes) in images from a small set of labeled reference images. The current state of the art is driven by resource-intensive construction of models for every new domain-specific application. Such models must be trained on enormous labeled datasets of unrelated objects (e.g., cars, trains, animals) so that their ``knowledge’’ can be transferred to new types of objects. In this paper, we show how to leverage existing vision foundation models (VFMs) to reduce the incremental cost of creating few-shot segmentation models for new domains. Specifically, we introduce SAMIC, a small network that learns how to prompt VFMs in order to segment new types of objects in domain-specific applications. SAMIC enables any task to be approached as a few-shot learning problem. At 2.6 million parameters, it is 94% smaller than the leading models (e.g., having ResNet 101 backbone with 45+ million parameters). Even using 1/5th of the training data provided by one-shot benchmarks, SAMIC is competitive with, or sets the state of the art, on a variety of few-shot and semantic segmentation datasets including COCO- 20^i , Pascal- 5^i , PerSeg, FSS-1000, and NWPU VHR-10.
zh

[CV-16] Controllable Shadow Generation with Single-Step Diffusion Models from Synthetic Data

【速读】：该论文试图解决现有方法在高质量图像合成和视觉效果中生成真实阴影的局限性问题。现有方法中，基于物理的方法需要3D场景几何信息，而学习型方法则难以控制并容易产生视觉伪影。论文提出了一种新的快速、可控且无需背景的2D物体图像阴影生成方法。其关键在于使用3D渲染引擎创建大规模合成数据集，并训练扩散模型以实现可控的阴影生成，生成适用于多种光源参数的阴影图。通过修正流目标函数，该方法在单次采样步骤中即可实现高质量结果，支持实时应用，并展示了良好的真实世界图像泛化能力。

链接: https://arxiv.org/abs/2412.11972
作者: Onur Tasar,Clément Chadebec,Benjamin Aubin
机构: Jasper Research(Jasper研究); Jasper Research(Jasper研究); Jasper Research(Jasper研究)
关键词: Physics-based approaches require, Realistic shadow generation, learning-based techniques struggle, existing methods suffer, Physics-based approaches
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Realistic shadow generation is a critical component for high-quality image compositing and visual effects, yet existing methods suffer from certain limitations: Physics-based approaches require a 3D scene geometry, which is often unavailable, while learning-based techniques struggle with control and visual artifacts. We introduce a novel method for fast, controllable, and background-free shadow generation for 2D object images. We create a large synthetic dataset using a 3D rendering engine to train a diffusion model for controllable shadow generation, generating shadow maps for diverse light source parameters. Through extensive ablation studies, we find that rectified flow objective achieves high-quality results with just a single sampling step enabling real-time applications. Furthermore, our experiments demonstrate that the model generalizes well to real-world images. To facilitate further research in evaluating quality and controllability in shadow generation, we release a new public benchmark containing a diverse set of object images and shadow maps in various settings. The project page is available at this https URL
zh

[CV-17] Gramian Multimodal Representation Learning and Alignment

【速读】：该论文试图解决多模态学习中现有模型在处理多个模态时表现不佳的问题，特别是在需要联合理解多个模态的任务中。现有模型通常通过对比学习对每对模态进行对齐，但未能确保所有模态之间的全局对齐，导致性能受限。论文提出的解决方案是引入一种新的几何对齐方法，即Gramian Representation Alignment Measure (GRAM)，通过最小化模态向量在高维空间中张成的平行多面体的Gramian体积，直接在更高维的嵌入空间中对齐多个模态，从而确保所有模态的几何对齐。GRAM可以替代传统的余弦相似度，适用于从2到n个模态的对齐，并在下游任务中显著提升了多模态模型的性能，如视频-音频-文本检索和音频-视频分类。

链接: https://arxiv.org/abs/2412.11959
作者: Giordano Cicchetti,Eleonora Grassucci,Luigi Sigillo,Danilo Comminiello
机构: Sapienza University of Rome(罗马大学)
关键词: Human perception integrates, perception integrates multiple, Human perception, integrates multiple modalities, multiple modalities
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human perception integrates multiple modalities, such as vision, hearing, and language, into a unified understanding of the surrounding reality. While recent multimodal models have achieved significant progress by aligning pairs of modalities via contrastive learning, their solutions are unsuitable when scaling to multiple modalities. These models typically align each modality to a designated anchor without ensuring the alignment of all modalities with each other, leading to suboptimal performance in tasks requiring a joint understanding of multiple modalities. In this paper, we structurally rethink the pairwise conventional approach to multimodal learning and we present the novel Gramian Representation Alignment Measure (GRAM), which overcomes the above-mentioned limitations. GRAM learns and then aligns n modalities directly in the higher-dimensional space in which modality embeddings lie by minimizing the Gramian volume of the k -dimensional parallelotope spanned by the modality vectors, ensuring the geometric alignment of all modalities simultaneously. GRAM can replace cosine similarity in any downstream method, holding for 2 to n modality and providing more meaningful alignment with respect to previous similarity measures. The novel GRAM-based contrastive loss function enhances the alignment of multimodal models in the higher-dimensional embedding space, leading to new state-of-the-art performance in downstream tasks such as video-audio-text retrieval and audio-video classification. The project page, the code, and the pretrained models are available at this https URL.
zh

[CV-18] Reliable Breast Cancer Molecular Subtype Prediction based on uncertainty-aware Bayesian Deep Learning by Mammography

【速读】：该论文试图解决乳腺癌分子亚型分类中的预测不确定性问题，并提出了一种基于全乳房X光图像的贝叶斯深度学习模型。解决方案的关键在于引入不确定性量化方法，通过贝叶斯深度学习模型来评估预测的不确定性，从而提高诊断的可靠性。此外，论文还提出了一种新颖的分层分类策略（two-stage classification strategy），以提升多类别分子亚型分类任务的性能。该模型在HER2-enriched、luminal和triple-negative三种亚型的分类任务中分别达到了0.71、0.75和0.86的AUC值，不仅在性能上与其他研究相当，还通过量化预测不确定性增强了模型的可靠性。

链接: https://arxiv.org/abs/2412.11953
作者: Mohaddeseh Chegini,Ali Mahloojifar
机构: 未知
关键词: breast cancer molecular, Breast cancer, cancer molecular subtypes, breast cancer classification, molecular subtypes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Breast cancer is a heterogeneous disease with different molecular subtypes, clinical behavior, treatment responses as well as survival outcomes. The development of a reliable, accurate, available and inexpensive method to predict the molecular subtypes using medical images plays an important role in the diagnosis and prognosis of breast cancer. Recently, deep learning methods have shown good performance in the breast cancer classification tasks using various medical images. Despite all that success, classical deep learning cannot deliver the predictive uncertainty. The uncertainty represents the validity of the this http URL, the high predicted uncertainty might cause a negative effect in the accurate diagnosis of breast cancer molecular subtypes. To overcome this, uncertainty quantification methods are used to determine the predictive uncertainty. Accordingly, in this study, we proposed an uncertainty-aware Bayesian deep learning model using the full mammogram images. In addition, to increase the performance of the multi-class molecular subtype classification task, we proposed a novel hierarchical classification strategy, named the two-stage classification strategy. The separate AUC of the proposed model for each subtype was 0.71, 0.75 and 0.86 for HER2-enriched, luminal and triple-negative classes, respectively. The proposed model not only has a comparable performance to other studies in the field of breast cancer molecular subtypes prediction, even using full mammography images, but it is also more reliable, due to quantify the predictive uncertainty.
zh

[CV-19] Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning AAAI2025

【速读】：该论文试图解决图像美学评估 (Image Aesthetic Assessment, IAA) 中传统方法依赖单一任务和不足的标注数据集，导致美学理解深度不足的问题。解决方案的关键在于提出了一种综合性的多模态大语言模型 (Multi-modal Large Language Models, MLLMs)，并采用了创新的多尺度文本引导自监督学习技术。该技术包括多尺度特征对齐模块，利用大量未标注数据进行自监督学习，从而在结构和功能上增强美学评估能力。通过广泛的指令微调，该模型在美学评分、美学评论和个性化图像美学评估等多个任务上达到了新的最先进水平，并展示了在新兴的美学建议任务中的零样本学习能力。此外，该模型还利用上下文学习展示了个性化图像美学评估的内在优势。

链接: https://arxiv.org/abs/2412.11952
作者: Yuti Liu,Shice Liu,Junyuan Gao,Pengtao Jiang,Hao Zhang,Jinwei Chen,Bo Li
机构: 未知
关键词: Image Aesthetic Assessment, Multi-modal Large Language, Aesthetic, areas for improvement, Large Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image’s aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing in-depth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages.
zh

[CV-20] Coconut Palm Tree Counting on Drone Images with Deep Object Detection and Synthetic Training Data

【速读】：该论文试图解决加纳农场中椰子树数量难以准确统计的问题，尤其是在不同种植阶段导致的手动计数繁琐且易出错的挑战。解决方案的关键在于利用YOLO（You Only Look Once）实时目标检测算法，结合半自动化框架进行椰子树的识别与计数。通过创建合成图像来优化在数据稀缺情况下的模型训练和验证，并调整超参数以提高YOLO的平均精度（mAP）。最终，通过实验验证了合成图像在农业场景中的价值，将mAP@.5从0.65提升至0.88，显著提高了检测精度。

链接: https://arxiv.org/abs/2412.11949
作者: Tobias Rohe,Barbara Böhm,Michael Kölle,Jonas Stein,Robert Müller,Claudia Linnhoff-Popien
机构: Mobile and Distributed Systems Group, LMU Munich, Germany; Aqarios GmbH, Germany
关键词: including agriculture, revolutionized various domains, YOLO, Abstract, trees
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Drones have revolutionized various domains, including agriculture. Recent advances in deep learning have propelled among other things object detection in computer vision. This study utilized YOLO, a real-time object detector, to identify and count coconut palm trees in Ghanaian farm drone footage. The farm presented has lost track of its trees due to different planting phases. While manual counting would be very tedious and error-prone, accurately determining the number of trees is crucial for efficient planning and management of agricultural processes, especially for optimizing yields and predicting production. We assessed YOLO for palm detection within a semi-automated framework, evaluated accuracy augmentations, and pondered its potential for farmers. Data was captured in September 2022 via drones. To optimize YOLO with scarce data, synthetic images were created for model training and validation. The YOLOv7 model, pretrained on the COCO dataset (excluding coconut palms), was adapted using tailored data. Trees from footage were repositioned on synthetic images, with testing on distinct authentic images. In our experiments, we adjusted hyperparameters, improving YOLO’s mean average precision (mAP). We also tested various altitudes to determine the best drone height. From an initial mAP@.5 of 0.65 , we achieved 0.88, highlighting the value of synthetic images in agricultural scenarios.
zh

[CV-21] Does VLM Classification Benefit from LLM Description Semantics?

【速读】：该论文试图解决的问题是如何区分描述性文本在图像分类中的实际区分能力与可能依赖于语义无关的集成效应带来的性能提升。解决方案的关键在于提出了一种无需训练的方法，通过在局部标签邻域中提取具有区分能力的描述，从而独立于类别名称的集成效应进行描述选择。具体来说，该方法利用测试图像的局部CLIP标签邻域，并针对一个小型选择集提取能够有效区分每个类别的描述，从而在七个数据集上展示了分类准确率的提升，并深入分析了基于描述的图像分类的可解释性。

链接: https://arxiv.org/abs/2412.11917
作者: Pingchuan Ma,Lennart Rietdorf,Dmytro Kotovenko,Vincent Tao Hu,Björn Ommer
机构: University of Heidelberg(海德堡大学); University of Tübingen(蒂宾根大学)
关键词: Accurately describing images, Accurately describing, Large Language Models, foundation of explainable, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurately describing images via text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect. Considering this, we ask how to distinguish the actual discriminative power of descriptions from performance boosts that potentially rely on an ensembling effect. To study this, we propose an alternative evaluation scenario that shows a characteristic behavior if the used descriptions have discriminative power. Furthermore, we propose a training-free method to select discriminative descriptions that work independently of classname ensembling effects. The training-free method works in the following way: A test image has a local CLIP label neighborhood, i.e., its top- k label predictions. Then, w.r.t. to a small selection set, we extract descriptions that distinguish each class well in the local neighborhood. Using the selected descriptions, we demonstrate improved classification accuracy across seven datasets and provide in-depth analysis and insights into the explainability of description-based image classification by VLMs.
zh

[CV-22] PunchBench: Benchmarking MLLM s in Multimodal Punchline Comprehension

【速读】：该论文试图解决现有多模态笑话（multimodal punchlines）理解基准的三个主要局限性：1) 语言捷径导致模型仅依赖文本；2) 问题多样性不足；3) 领域范围狭窄。解决方案的关键在于引入了一个名为PunchBench的多模态笑话理解基准，通过生成同义和反义的标题来减少文本捷径的影响，并采用多样化的问答格式和跨领域的图像-标题对，以实现更全面和准确的评估。此外，论文提出了Simple-to-Complex Chain-of-Question (SC-CoQ)策略，通过逐步解决复杂问题来提升模型的表现，显著提高了多模态大语言模型在PunchBench上的性能。

链接: https://arxiv.org/abs/2412.11906
作者: Kun Ouyang,Yuanxin Liu,Shicheng Li,Yi Liu,Hao Zhou,Fandong Meng,Jie Zhou,Xu Sun
机构: State Key Laboratory of Multimedia Information Processing(多媒体信息处理国家重点实验室); School of Computer Science(计算机科学学院); Peking University(北京大学); WeChat AI(微信人工智能); Tencent Inc.(腾讯公司)
关键词: online multimedia platforms, multimedia platforms, involve humor, humor or sarcasm, sarcasm conveyed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal punchlines, which involve humor or sarcasm conveyed in image-caption pairs, are a popular way of communication on online multimedia platforms. With the rapid development of multimodal large language models (MLLMs), it is essential to assess their ability to effectively comprehend these punchlines. However, existing benchmarks on punchline comprehension suffer from three major limitations: 1) language shortcuts that allow models to solely rely on text, 2) lack of question diversity, and 3) narrow focus on a specific domain of multimodal content (e.g., cartoon). To address these limitations, we introduce a multimodal \textbfPunchline comprehension \textbfBenchmark, named \textbfPunchBench, which is tailored for accurate and comprehensive evaluation of punchline comprehension. To enhance the evaluation accuracy, we generate synonymous and antonymous captions by modifying original captions, which mitigates the impact of shortcuts in the captions. To provide a comprehensive evaluation, PunchBench incorporates diverse question formats and image-captions from various domains. On this basis, we conduct extensive evaluations and reveal a significant gap between state-of-the-art MLLMs and humans in punchline comprehension. To improve punchline comprehension, we propose Simple-to-Complex Chain-of-Question (SC-CoQ) strategy, enabling the models to incrementally address complicated questions by first mastering simple ones. SC-CoQ effectively enhances the performance of various MLLMs on PunchBench, surpassing in-context learning and chain-of-thought.
zh

[CV-23] From 2D CAD Drawings to 3D Parametric Models: A Vision-Language Approach AAAI2025

【速读】：该论文试图解决从2D CAD图纸重建3D参数化模型的问题。解决方案的关键在于采用了一种基于视觉-语言模型 (Vision-Language Models, VLMs) 的新方法，称为CAD2Program。与传统依赖特定数据表示和算法的重建方法不同，该方法将2D CAD图纸视为栅格图像，并使用标准的视觉变换器 (Vision Transformer, ViT) 模型进行编码，从而避免了传统方法对输入格式的严格限制。在输出端，该方法通过自回归预测生成描述3D参数化模型的通用语言文本，相比其他使用固定大小槽位的特定领域序列表示方法，这种基于文本的表示更加灵活，能够轻松扩展到任意几何实体和语义或功能属性。实验结果表明，该方法在大规模橱柜模型数据集上具有显著的有效性。

链接: https://arxiv.org/abs/2412.11892
作者: Xilin Wang,Jia Zheng,Yuanchao Hu,Hao Zhu,Qian Yu,Zihan Zhou
机构: 1. Tsinghua University (清华大学); 2. ByteDance AI Lab (字节跳动人工智能实验室)
关键词: CAD, parametric models, models, CAD drawings, method
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To Appear in AAAI 2025. The project page is at this https URL

点击查看摘要

Abstract:In this paper, we present CAD2Program, a new method for reconstructing 3D parametric models from 2D CAD drawings. Our proposed method is inspired by recent successes in vision-language models (VLMs), and departs from traditional methods which rely on task-specific data representations and/or algorithms. Specifically, on the input side, we simply treat the 2D CAD drawing as a raster image, regardless of its original format, and encode the image with a standard ViT model. We show that such an encoding scheme achieves competitive performance against existing methods that operate on vector-graphics inputs, while imposing substantially fewer restrictions on the 2D drawings. On the output side, our method auto-regressively predicts a general-purpose language describing 3D parametric models in text form. Compared to other sequence modeling methods for CAD which use domain-specific sequence representations with fixed-size slots, our text-based representation is more flexible, and can be easily extended to arbitrary geometric entities and semantic or functional properties. Experimental results on a large-scale dataset of cabinet models demonstrate the effectiveness of our method.
zh

[CV-24] SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation

【速读】：该论文试图解决高质量语义分割中同时实现全局上下文建模、局部细节编码和多尺度特征提取的挑战。解决方案的关键在于引入了一种名为SegMAN的新型线性时间模型，该模型包含一个混合特征编码器（SegMAN Encoder）和一个基于状态空间模型的解码器。SegMAN Encoder通过协同集成滑动局部注意力与动态状态空间模型，实现了高效的全局上下文建模并保留了细粒度的局部细节。同时，解码器中的MMSCopE模块增强了多尺度上下文特征提取，并能自适应地随输入分辨率调整。通过在ADE20K、Cityscapes和COCO-Stuff等数据集上的全面评估，SegMAN在保持较低计算复杂度的同时，显著提升了分割性能。

链接: https://arxiv.org/abs/2412.11890
作者: Yunxiang Fu,Meng Lou,Yizhou Yu
机构: School of Computing and Data Science, The University of Hong Kong
关键词: High-quality semantic segmentation, semantic segmentation relies, global context modeling, local detail encoding, High-quality semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi-scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high-quality local detail encoding, and rich multi-scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine-grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi-scale context feature extraction and adaptively scales with the input resolution. We comprehensively evaluate SegMAN on three challenging datasets: ADE20K, Cityscapes, and COCO-Stuff. For instance, SegMAN-B achieves 52.6% mIoU on ADE20K, outperforming SegNeXt-L by 1.6% mIoU while reducing computational complexity by over 15% GFLOPs. On Cityscapes, SegMAN-B attains 83.8% mIoU, surpassing SegFormer-B3 by 2.1% mIoU with approximately half the GFLOPs. Similarly, SegMAN-B improves upon VWFormer-B3 by 1.6% mIoU with lower GFLOPs on the COCO-Stuff dataset. Our code is available at this https URL.
zh

[CV-25] owards Physically-Based Sky-Modeling

【速读】：该论文试图解决现有天空模型在生成环境映射（environment maps）时无法忠实再现物理捕获的高动态范围图像（HDRI）中的关键特征（如色调、阴影和光照一致性）的问题。解决方案的关键在于提出了一种全天气条件下的天空模型（AllSky），该模型直接从物理捕获的HDR图像中学习，并能够根据用户控制的太阳位置和云层分布，生成具有扩展动态范围（EDR）的环境映射，从而更好地模拟物理捕获的环境映射。

链接: https://arxiv.org/abs/2412.11883
作者: Ian J. Maquignaz
机构: Université Laval(拉瓦尔大学)
关键词: Accurate environment maps, Extended Dynamic Range, rendering photorealistic outdoor, captured HDR imagery, Accurate environment
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Accurate environment maps are a key component in rendering photorealistic outdoor scenes with coherent illumination. They enable captivating visual arts, immersive virtual reality and a wide range of engineering and scientific applications. Recent works have extended sky-models to be more comprehensive and inclusive of cloud formations but existing approaches fall short in faithfully recreating key-characteristics in physically captured HDRI. As we demonstrate, environment maps produced by sky-models do not relight scenes with the same tones, shadows, and illumination coherence as physically captured HDR imagery. Though the visual quality of DNN-generated LDR and HDR imagery has greatly progressed in recent years, we demonstrate this progress to be tangential to sky-modelling. Due to the Extended Dynamic Range (EDR) of 14EV required for outdoor environment maps inclusive of the sun, sky-modelling extends beyond the conventional paradigm of High Dynamic Range Imagery (HDRI). In this work, we propose an all-weather sky-model, learning weathered-skies directly from physically captured HDR imagery. Per user-controlled positioning of the sun and cloud formations, our model (AllSky) allows for emulation of physically captured environment maps with improved retention of the Extended Dynamic Range (EDR) of the sky.
zh

[CV-26] Event-based Motion Deblurring via Multi-Temporal Granularity Fusion

【速读】：该论文试图解决传统基于帧的相机在曝光时间内因运动产生的模糊问题，并提出了一种基于事件相机的高效去模糊方法。解决方案的关键在于引入基于点云的事件表示（point cloud-based event representation），并提出多时间粒度网络（Multi-Temporal Granularity Network, MTGNet）。该网络结合了空间密集但时间粗粒度的体素表示（voxel-based event representation）和时间细粒度但空间稀疏的点云表示，通过细粒度点分支（Fine-grained Point Branch）实现两种表示的互补融合。此外，论文设计了聚合与映射模块（Aggregation and Mapping Module, AMM）来对齐点云特征与帧特征，并通过自适应特征扩散模块（Adaptive Feature Diffusion Module, AFDM）处理事件数据与图像数据之间的分辨率差异，从而提升去模糊性能。

链接: https://arxiv.org/abs/2412.11866
作者: Xiaopeng Lin,Hongwei Ren,Yulong Huang,Zunchang Liu,Yue Zhou,Haotian Fu,Biao Pan,Bojun Cheng
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Beihang University (北京航空航天大学)
关键词: inevitably produce blurry, produce blurry effects, blurry effects due, Conventional frame-based cameras, cameras inevitably produce
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Conventional frame-based cameras inevitably produce blurry effects due to motion occurring during the exposure time. Event camera, a bio-inspired sensor offering continuous visual information could enhance the deblurring performance. Effectively utilizing the high-temporal-resolution event data is crucial for extracting precise motion information and enhancing deblurring performance. However, existing event-based image deblurring methods usually utilize voxel-based event representations, losing the fine-grained temporal details that are mathematically essential for fast motion deblurring. In this paper, we first introduce point cloud-based event representation into the image deblurring task and propose a Multi-Temporal Granularity Network (MTGNet). It combines the spatially dense but temporally coarse-grained voxel-based event representation and the temporally fine-grained but spatially sparse point cloud-based event. To seamlessly integrate such complementary representations, we design a Fine-grained Point Branch. An Aggregation and Mapping Module (AMM) is proposed to align the low-level point-based features with frame-based features and an Adaptive Feature Diffusion Module (AFDM) is designed to manage the resolution discrepancies between event data and image data by enriching the sparse point feature. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art approaches on both synthetic and real-world datasets.
zh

[CV-27] Sonar-based Deep Learning in Underwater Robotics: Overview Robustness and Challenges

【速读】：该论文旨在解决水下自主航行器（AUVs）中基于声呐（sonar）的深度学习（DL）模型在实际应用中的鲁棒性问题。由于水下环境主要依赖声呐技术，且存在训练数据有限和固有噪声等挑战，现有的视觉DL模型在实际部署中可能引发安全隐患。论文的关键解决方案包括：系统性地研究声呐感知任务模型（如分类、目标检测、分割和SLAM），梳理当前最先进的声呐数据集、仿真器及鲁棒性方法（如神经网络验证、分布外数据处理和对抗攻击），并提出未来研究方向，特别是建立基准声呐数据集和弥合仿真与现实之间的差距。

链接: https://arxiv.org/abs/2412.11840
作者: Martin Aubard,Ana Madureira,Luís Teixeira,José Pinto
机构: OceanScan Marine Systems & Technology, 4450-718 Matosinhos, Portugal; INESC INOV-Lab and ISRC (ISEP/P.PORTO), 4249-015 Porto, Portugal; Faculty of Engineering, University of Porto (FEUP), 4200-465 Porto, Portugal
关键词: Autonomous Underwater Vehicles, onboard Deep Learning, Autonomous Underwater, Underwater Vehicles, exploration and monitoring
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:With the growing interest in underwater exploration and monitoring, Autonomous Underwater Vehicles (AUVs) have become essential. The recent interest in onboard Deep Learning (DL) has advanced real-time environmental interaction capabilities relying on efficient and accurate vision-based DL models. However, the predominant use of sonar in underwater environments, characterized by limited training data and inherent noise, poses challenges to model robustness. This autonomy improvement raises safety concerns for deploying such models during underwater operations, potentially leading to hazardous situations. This paper aims to provide the first comprehensive overview of sonar-based DL under the scope of robustness. It studies sonar-based DL perception task models, such as classification, object detection, segmentation, and SLAM. Furthermore, the paper systematizes sonar-based state-of-the-art datasets, simulators, and robustness methods such as neural network verification, out-of-distribution, and adversarial attacks. This paper highlights the lack of robustness in sonar-based DL research and suggests future research pathways, notably establishing a baseline sonar-based dataset and bridging the simulation-to-reality gap.
zh

[CV-28] UnMA-CapSumT: Unified and Multi-Head Attention-driven Caption Summarization Transformer

【速读】：该论文试图解决图像描述生成中的词汇外问题（out-of-vocabulary）和重复问题，并提出一种能够同时生成事实性和风格化描述的统一框架。解决方案的关键在于提出了一个名为Unified Attention and Multi-Head Attention-driven Caption Summarization Transformer (UnMA-CapSumT)的框架，该框架结合了基于Modified Adaptive Attention的图像描述模型（MAA-FIC）生成的事实性描述和基于Style Factored Bi-LSTM with attention（SF-Bi-ALSTM）的风格化描述模型生成的浪漫和幽默风格描述。通过引入fastText与Attention Word Embedding（fTA-WE）以及指针生成网络（pointer-generator network）与覆盖机制（coverage mechanism），UnMA-CapSumT能够有效解决词汇外问题和重复问题，并生成风格丰富且连贯的总结性描述。

链接: https://arxiv.org/abs/2412.11836
作者: Dhruv Sharma,Chhavi Dhiman,Dinesh Kumar
机构: 未知
关键词: increased immense popularity, stylized image captioning, natural language descriptions, Image captioning, image captioning model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image captioning is the generation of natural language descriptions of images which have increased immense popularity in the recent past. With this different deep-learning techniques are devised for the development of factual and stylized image captioning models. Previous models focused more on the generation of factual and stylized captions separately providing more than one caption for a single image. The descriptions generated from these suffer from out-of-vocabulary and repetition issues. To the best of our knowledge, no such work exists that provided a description that integrates different captioning methods to describe the contents of an image with factual and stylized (romantic and humorous) elements. To overcome these limitations, this paper presents a novel Unified Attention and Multi-Head Attention-driven Caption Summarization Transformer (UnMA-CapSumT) based Captioning Framework. It utilizes both factual captions and stylized captions generated by the Modified Adaptive Attention-based factual image captioning model (MAA-FIC) and Style Factored Bi-LSTM with attention (SF-Bi-ALSTM) driven stylized image captioning model respectively. SF-Bi-ALSTM-based stylized IC model generates two prominent styles of expression- romance, and humor. The proposed summarizer UnMHA-ST combines both factual and stylized descriptions of an input image to generate styled rich coherent summarized captions. The proposed UnMHA-ST transformer learns and summarizes different linguistic styles efficiently by incorporating proposed word embedding fastText with Attention Word Embedding (fTA-WE) and pointer-generator network with coverage mechanism concept to solve the out-of-vocabulary issues and repetition problem. Extensive experiments are conducted on Flickr8K and a subset of FlickrStyle10K with supporting ablation studies to prove the efficiency and efficacy of the proposed framework.
zh

[CV-29] Spatiotemporal Blind-Spot Network with Calibrated Flow Alignment for Self-Supervised Video Denoising

【速读】：该论文试图解决自监督视频去噪中存在的两个主要问题：一是现有方法对帧间和帧内信息的利用不足，二是光学流对齐在自监督条件下的潜力未被充分挖掘。解决方案的关键在于引入时空盲点网络 (SpatioTemporal Blind-spot Network, STBN)，通过双向盲点特征传播和盲点对齐块实现精确的时序对齐，并利用空间感受野扩展模块增强全局感知能力。此外，论文提出了一种无监督的光流蒸馏机制，以减少噪声对光流估计的敏感性，从而在光流对齐过程中优化细粒度的帧间交互。这些创新使得该方法在合成和真实世界的视频去噪数据集上表现出优越的性能。

链接: https://arxiv.org/abs/2412.11820
作者: Zikang Chen,Tao Jiang,Xiaowan Hu,Wang Zhang,Huaqiu Li,Haoqian Wang
机构: 未知
关键词: ground truth data, optical flow, recover clean frames, optical flow alignment, truth data
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised video denoising aims to remove noise from videos without relying on ground truth data, leveraging the video itself to recover clean frames. Existing methods often rely on simplistic feature stacking or apply optical flow without thorough analysis. This results in suboptimal utilization of both inter-frame and intra-frame information, and it also neglects the potential of optical flow alignment under self-supervised conditions, leading to biased and insufficient denoising outcomes. To this end, we first explore the practicality of optical flow in the self-supervised setting and introduce a SpatioTemporal Blind-spot Network (STBN) for global frame feature utilization. In the temporal domain, we utilize bidirectional blind-spot feature propagation through the proposed blind-spot alignment block to ensure accurate temporal alignment and effectively capture long-range dependencies. In the spatial domain, we introduce the spatial receptive field expansion module, which enhances the receptive field and improves global perception capabilities. Additionally, to reduce the sensitivity of optical flow estimation to noise, we propose an unsupervised optical flow distillation mechanism that refines fine-grained inter-frame interactions during optical flow alignment. Our method demonstrates superior performance across both synthetic and real-world video denoising datasets. The source code is publicly available at this https URL.
zh

[CV-30] HiGDA: Hierarchical Graph of Nodes to Learn Local-to-Global Topology for Semi-Supervised Domain Adaptation AAAI2025

【速读】：该论文试图解决深度学习模型在域迁移（domain shift）条件下表现不佳的问题，特别是在源域和目标域数据分布不同的情况下。解决方案的关键在于引入了一种分层节点图（Hierarchical Graph of Nodes, HiGDA），该方法在特征层和类别层同时进行表示。在特征层，通过局部图识别图像中最相关的区域，增强对主要对象的表示；在类别层，通过全局图聚合同一类别样本的特征，丰富整体表示。这种双层表示方法显著提升了半监督域适应（semi-supervised domain adaptation, SSDA）任务中的分类性能，并在多个基准数据集上验证了其有效性，成为新的最先进方法。

链接: https://arxiv.org/abs/2412.11819
作者: Ba Hung Ngo,Doanh C. Bui,Nhat-Tuong Do-Tran,Tae Jong Choi
机构: KAIST(韩国科学技术院); University of Science and Technology of Hanoi(河内科学与技术大学)
关键词: enhanced representational power, attracted significant interest, deep learning models, recent years, enhanced representational
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at AAAI2025

点击查看摘要

Abstract:The enhanced representational power and broad applicability of deep learning models have attracted significant interest from the research community in recent years. However, these models often struggle to perform effectively under domain shift conditions, where the training data (the source domain) is related to but exhibits different distributions from the testing data (the target domain). To address this challenge, previous studies have attempted to reduce the domain gap between source and target data by incorporating a few labeled target samples during training - a technique known as semi-supervised domain adaptation (SSDA). While this strategy has demonstrated notable improvements in classification performance, the network architectures used in these approaches primarily focus on exploiting the features of individual images, leaving room for improvement in capturing rich representations. In this study, we introduce a Hierarchical Graph of Nodes designed to simultaneously present representations at both feature and category levels. At the feature level, we introduce a local graph to identify the most relevant patches within an image, facilitating adaptability to defined main object representations. At the category level, we employ a global graph to aggregate the features from samples within the same category, thereby enriching overall representations. Extensive experiments on widely used SSDA benchmark datasets, including Office-Home, DomainNet, and VisDA2017, demonstrate that both quantitative and qualitative results substantiate the effectiveness of HiGDA, establishing it as a new state-of-the-art method.
zh

[CV-31] ColorFlow: Retrieval-Augmented Image Sequence Colorization

【速读】：该论文试图解决自动黑白图像序列着色过程中保持角色和物体身份一致性（identity consistency）的复杂问题，特别是在卡通或漫画系列着色等工业应用中。解决方案的关键在于提出了一种名为ColorFlow的三阶段扩散模型框架，该框架通过一种新颖的检索增强着色（Retrieval Augmented Colorization）流程，避免了现有方法中对每个身份进行微调或显式身份嵌入提取的需求。ColorFlow采用双分支设计，一个分支用于颜色身份提取，另一个用于着色，充分利用了扩散模型的自注意力机制进行强大的上下文学习和颜色身份匹配。此外，论文还引入了ColorFlow-Bench基准测试，以全面评估模型在基于参考的着色任务中的性能。

链接: https://arxiv.org/abs/2412.11815
作者: Junhao Zhuang,Xuan Ju,Zhaoyang Zhang,Yong Liu,Shiyi Zhang,Chun Yuan,Ying Shan
机构: Tsinghua University(清华大学); ARC Lab, Tencent PCG(ARC实验室, 腾讯PCG)
关键词: significant market demand, comic series colorization, image sequence colorization, market demand, preserving character
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Automatic black-and-white image sequence colorization while preserving character and object identity (ID) is a complex task with significant market demand, such as in cartoon or comic series colorization. Despite advancements in visual colorization using large-scale generative models like diffusion models, challenges with controllability and identity consistency persist, making current solutions unsuitable for industrial this http URL address this, we propose ColorFlow, a three-stage diffusion-based framework tailored for image sequence colorization in industrial applications. Unlike existing methods that require per-ID finetuning or explicit ID embedding extraction, we propose a novel robust and generalizable Retrieval Augmented Colorization pipeline for colorizing images with relevant color references. Our pipeline also features a dual-branch design: one branch for color identity extraction and the other for colorization, leveraging the strengths of diffusion models. We utilize the self-attention mechanism in diffusion models for strong in-context learning and color identity matching. To evaluate our model, we introduce ColorFlow-Bench, a comprehensive benchmark for reference-based colorization. Results show that ColorFlow outperforms existing models across multiple metrics, setting a new standard in sequential image colorization and potentially benefiting the art industry. We release our codes and models on our project page: this https URL.
zh

[CV-32] Designing Semi-Structured Pruning of Graph Convolutional Networks for Skeleton-based Recognition

【速读】：该论文试图解决在边缘设备上部署深度神经网络（DNNs）时面临的资源限制问题，特别是如何在有限的时间和内存资源下设计轻量级且高效的网络结构。解决方案的关键在于提出了一种新颖的半结构化剪枝方法（semi-structured pruning），该方法结合了结构化剪枝和非结构化剪枝的优点，同时避免了各自的缺点。具体来说，该方法通过可微分的级联参数化实现，包括：(i) 基于权重幅度的带阻机制（band-stop mechanism）进行剪枝，(ii) 权重共享参数化（weight-sharing parametrization）实现个体或分组剪枝，以及 (iii) 用于仲裁分组和个体剪枝的门控机制（gating mechanism）。这些级联参数化基于一个共同的潜在张量，并通过端到端的训练过程，结合分类损失和代理张量秩正则化进行优化。实验结果表明，该半结构化剪枝方法在动作和手势识别等挑战性任务中，相较于单独使用结构化或非结构化剪枝方法，具有显著优势。

链接: https://arxiv.org/abs/2412.11813
作者: Hichem Sahbi
机构: Sorbonne University(索邦大学); CNRS(法国国家科学研究中心); LIP6(巴黎第六大学)
关键词: Deep neural networks, Deep neural, nowadays witnessing, witnessing a major, major success
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) are nowadays witnessing a major success in solving many pattern recognition tasks including skeleton-based classification. The deployment of DNNs on edge-devices, endowed with limited time and memory resources, requires designing lightweight and efficient variants of these networks. Pruning is one of the lightweight network design techniques that operate by removing unnecessary network parts, in a structured or an unstructured manner, including individual weights, neurons or even entire channels. Nonetheless, structured and unstructured pruning methods, when applied separately, may either be inefficient or ineffective. In this paper, we devise a novel semi-structured method that discards the downsides of structured and unstructured pruning while gathering their upsides to some extent. The proposed solution is based on a differentiable cascaded parametrization which combines (i) a band-stop mechanism that prunes weights depending on their magnitudes, (ii) a weight-sharing parametrization that prunes connections either individually or group-wise, and (iii) a gating mechanism which arbitrates between different group-wise and entry-wise pruning. All these cascaded parametrizations are built upon a common latent tensor which is trained end-to-end by minimizing a classification loss and a surrogate tensor rank regularizer. Extensive experiments, conducted on the challenging tasks of action and hand-gesture recognition, show the clear advantage of our proposed semi-structured pruning approach against both structured and unstructured pruning, when taken separately, as well as the related work.
zh

[CV-33] CLDA-YOLO: Visual Contrastive Learning Based Domain Adaptive YOLO Detector

【速读】：该论文试图解决在域迁移（domain shift）条件下，单阶段目标检测器（如 YOLO）性能提升有限的问题。解决方案的关键在于构建了一个基于教师-学生协同系统的无监督域自适应（Unsupervised Domain Adaptive, UDA）架构，并提出了不确定性学习（uncertainty learning）来处理教师模型生成的伪标签中的极端不确定性，同时利用动态数据增强（dynamic data augmentation）逐步适应环境。此外，论文采用统一的视觉对比学习（visual contrastive learning）范式，分别在骨干网络和检测头进行实例对齐，从而提高跨域任务中的检测器鲁棒性。最终，提出的CLDA-YOLO在多个域自适应数据集上实现了具有竞争力的结果，且未降低推理速度。

链接: https://arxiv.org/abs/2412.11812
作者: Tianheng Qiu,Ka Lung Law,Guanghua Pan,Jufei Wang,Xin Gao,Xuan Huang,Hu Wei
机构: University of Science and Technology of China; Hefei Institutes of Physical Science, Chinese Academy of Sciences; SenseTime Research; Tsinghua University
关键词: Unsupervised domain adaptive, domain adaptive YOLO, domain adaptive, adaptive YOLO detector, YOLO detector
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptive (UDA) algorithms can markedly enhance the performance of object detectors under conditions of domain shifts, thereby reducing the necessity for extensive labeling and retraining. Current domain adaptive object detection algorithms primarily cater to two-stage detectors, which tend to offer minimal improvements when directly applied to single-stage detectors such as YOLO. Intending to benefit the YOLO detector from UDA, we build a comprehensive domain adaptive architecture using a teacher-student cooperative system for the YOLO detector. In this process, we propose uncertainty learning to cope with pseudo-labeling generated by the teacher model with extreme uncertainty and leverage dynamic data augmentation to asymptotically adapt the teacher-student system to the environment. To address the inability of single-stage object detectors to align at multiple stages, we utilize a unified visual contrastive learning paradigm that aligns instance at backbone and head respectively, which steadily improves the robustness of the detectors in cross-domain tasks. In summary, we present an unsupervised domain adaptive YOLO detector based on visual contrastive learning (CLDA-YOLO), which achieves highly competitive results across multiple domain adaptive datasets without any reduction in inference speed.
zh

[CV-34] PhysAug: A Physical-guided and Frequency-based Data Augmentation for Single-Domain Generalized Object Detection

【速读】：该论文试图解决单域泛化目标检测 (Single-Domain Generalized Object Detection, S-DGOD) 中，由于缺乏真实世界先验知识导致的数据增强策略无法有效提升训练数据分布多样性的问题。解决方案的关键在于提出了基于物理模型的非理想成像条件数据增强方法 PhysAug。该方法通过大气光学原理构建了一个通用的扰动模型，利用图像频谱模拟真实世界的视觉扰动，从而在训练过程中增强检测器的泛化能力，使其能够学习到域不变的特征表示。该方法在不改变网络架构或损失函数的情况下，显著提升了在多个 S-DGOD 数据集上的性能，特别是在 DWD 和 Cityscape-C 数据集上分别提升了 7.3% 和 7.2%。

链接: https://arxiv.org/abs/2412.11807
作者: Xiaoran Xu,Jiangang Yang,Wenhui Shi,Siyuan Ding,Luqing Luo,Jian Liu
机构: 1. School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院);
2. Collaborative Innovation Center of Novel Software Technology and Industrialization(新型软件技术与产业化协同创新中心)
关键词: Generalized Object Detection, Single-Domain Generalized Object, single source domain, unseen target domains, Object Detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Single-Domain Generalized Object Detection~(S-DGOD) aims to train on a single source domain for robust performance across a variety of unseen target domains by taking advantage of an object detector. Existing S-DGOD approaches often rely on data augmentation strategies, including a composition of visual transformations, to enhance the detector’s generalization ability. However, the absence of real-world prior knowledge hinders data augmentation from contributing to the diversity of training data distributions. To address this issue, we propose PhysAug, a novel physical model-based non-ideal imaging condition data augmentation method, to enhance the adaptability of the S-DGOD tasks. Drawing upon the principles of atmospheric optics, we develop a universal perturbation model that serves as the foundation for our proposed PhysAug. Given that visual perturbations typically arise from the interaction of light with atmospheric particles, the image frequency spectrum is harnessed to simulate real-world variations during training. This approach fosters the detector to learn domain-invariant representations, thereby enhancing its ability to generalize across various settings. Without altering the network architecture or loss function, our approach significantly outperforms the state-of-the-art across various S-DGOD datasets. In particular, it achieves a substantial improvement of 7.3% and 7.2% over the baseline on DWD and Cityscape-C, highlighting its enhanced generalizability in real-world settings.
zh

[CV-35] AMI-Net: Adaptive Mask Inpainting Network for Industrial Anomaly Detection and Localization

【速读】：该论文试图解决无监督视觉异常检测中，现有重构方法在恢复异常区域时效果不佳的问题。解决方案的关键在于提出了一种新的自适应掩码修复网络 (Adaptive Mask Inpainting Network, AMI-Net)，通过预训练网络提取多尺度语义特征作为重构目标，并采用随机位置和数量掩码的训练策略。此外，论文还引入了一个创新的适应性掩码生成器，能够生成适应性掩码，有效遮蔽异常区域同时保留正常区域，从而利用可见的正常全局上下文信息来恢复被遮蔽的异常区域，显著抑制缺陷的重构。

链接: https://arxiv.org/abs/2412.11802
作者: Wei Luo,Haiming Yao,Wenyong Yu,Zhengyong Li
机构: State Key Laboratory of Precision Measurement Technology and Instruments, Department of Precision Instrument, Tsinghua University, Beijing 100084, China; State Key Laboratory of Digital Manufacturing Equipment and Technology, School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
关键词: Unsupervised visual anomaly, enhancing industrial production, industrial production quality, visual anomaly detection, quality and efficiency
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE Transactions on Automation Science and this http URL is available at: this https URL

点击查看摘要

Abstract:Unsupervised visual anomaly detection is crucial for enhancing industrial production quality and efficiency. Among unsupervised methods, reconstruction approaches are popular due to their simplicity and effectiveness. The key aspect of reconstruction methods lies in the restoration of anomalous regions, which current methods have not satisfactorily achieved. To tackle this issue, we introduce a novel \ulineAdaptive \ulineMask \ulineInpainting \ulineNetwork (AMI-Net) from the perspective of adaptive mask-inpainting. In contrast to traditional reconstruction methods that treat non-semantic image pixels as targets, our method uses a pre-trained network to extract multi-scale semantic features as reconstruction targets. Given the multiscale nature of industrial defects, we incorporate a training strategy involving random positional and quantitative masking. Moreover, we propose an innovative adaptive mask generator capable of generating adaptive masks that effectively mask anomalous regions while preserving normal regions. In this manner, the model can leverage the visible normal global contextual information to restore the masked anomalous regions, thereby effectively suppressing the reconstruction of defects. Extensive experimental results on the MVTec AD and BTAD industrial datasets validate the effectiveness of the proposed method. Additionally, AMI-Net exhibits exceptional real-time performance, striking a favorable balance between detection accuracy and speed, rendering it highly suitable for industrial applications. Code is available at: this https URL
zh

[CV-36] Neural Collapse Inspired Knowledge Distillation AAAI2025

【速读】：该论文试图解决现有知识蒸馏（Knowledge Distillation, KD）方法中教师网络与学生网络之间的知识差距问题，这一差距可能影响蒸馏过程的效果。解决方案的关键在于将神经崩溃（Neural Collapse, NC）结构引入KD框架，通过将教师的NC结构传递给学生，促使学生学习教师的几何结构，从而缓解知识差距并提升学生网络的泛化能力。具体而言，论文提出了一种新的蒸馏范式，称为神经崩溃启发式知识蒸馏（Neural Collapse-inspired Knowledge Distillation, NCKD），通过实验验证了其简单有效性，并实现了最先进的准确性表现。

链接: https://arxiv.org/abs/2412.11788
作者: Shuoxi Zhang,Zijian Song,Kun He
机构: 未知
关键词: demonstrated their ability, knowledge distillation, distillation, Neural Collapse, Existing knowledge distillation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures. Accepted by AAAI 2025

点击查看摘要

Abstract:Existing knowledge distillation (KD) methods have demonstrated their ability in achieving student network performance on par with their teachers. However, the knowledge gap between the teacher and student remains significant and may hinder the effectiveness of the distillation process. In this work, we introduce the structure of Neural Collapse (NC) into the KD framework. NC typically occurs in the final phase of training, resulting in a graceful geometric structure where the last-layer features form a simplex equiangular tight frame. Such phenomenon has improved the generalization of deep network training. We hypothesize that NC can also alleviate the knowledge gap in distillation, thereby enhancing student performance. This paper begins with an empirical analysis to bridge the connection between knowledge distillation and neural collapse. Through this analysis, we establish that transferring the teacher’s NC structure to the student benefits the distillation process. Therefore, instead of merely transferring instance-level logits or features, as done by existing distillation methods, we encourage students to learn the teacher’s NC structure. Thereby, we propose a new distillation paradigm termed Neural Collapse-inspired Knowledge Distillation (NCKD). Comprehensive experiments demonstrate that NCKD is simple yet effective, improving the generalization of all distilled student models and achieving state-of-the-art accuracy performance.
zh

[CV-37] InterDyn: Controllable Interactive Dynamics with Video Diffusion Models

【速读】：该论文试图解决现有方法在复杂、真实世界环境中预测交互物体动态的局限性问题。解决方案的关键在于提出了一个名为InterDyn的新框架，该框架利用大规模视频数据训练的生成模型，通过学习交互动态，生成基于初始帧和控制信号的视频序列。其核心洞察在于，大型视频基础模型可以同时作为神经渲染器和隐式物理模拟器，通过引入交互控制机制，条件化视频生成过程以适应驱动实体的运动。这使得InterDyn能够生成合理、时间一致的复杂物体交互视频，并在未见过的物体上表现出良好的泛化能力。

链接: https://arxiv.org/abs/2412.11785
作者: Rick Akkerman,Haiwen Feng,Michael J. Black,Dimitrios Tzionas,Victoria Fernández Abrevaya
机构: Max Planck Institute for Intelligent Systems, Tübingen, Germany; University of Amsterdam, the Netherlands
关键词: intelligent systems, humans and intelligent, Predicting, Predicting the dynamics, interacting objects
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous motion and subsequent dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video foundation models can act as both neural renderers and implicit physics simulators by learning interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines.
zh

[CV-38] Impact of Face Alignment on Face Image Quality

【速读】：该论文试图解决的问题是评估面部对齐（face alignment）对面部图像质量（face image quality）的影响。解决方案的关键在于通过实验验证不同面部对齐方法（如MTCNN和RetinaFace）对面部图像质量评估方法（如SER-FIQ、FaceQAN、DifFIQA和SDD-FIQA）的影响。研究发现，面部图像质量评估方法对对齐敏感，尤其是在现实生活条件下的挑战性场景中，这强调了在质量评估中考虑对齐的重要性。

链接: https://arxiv.org/abs/2412.11779
作者: Eren Onaran,Erdi Sarıtaş,Hazım Kemal Ekenel
机构: 未知
关键词: face image quality, Face, face image, alignment, facial analysis tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at EAI ROSENET 2024 - 8th EAI International Conference on Robotic Sensor Networks

点击查看摘要

Abstract:Face alignment is a crucial step in preparing face images for feature extraction in facial analysis tasks. For applications such as face recognition, facial expression recognition, and facial attribute classification, alignment is widely utilized during both training and inference to standardize the positions of key landmarks in the face. It is well known that the application and method of face alignment significantly affect the performance of facial analysis models. However, the impact of alignment on face image quality has not been thoroughly investigated. Current FIQA studies often assume alignment as a prerequisite but do not explicitly evaluate how alignment affects quality metrics, especially with the advent of modern deep learning-based detectors that integrate detection and landmark localization. To address this need, our study examines the impact of face alignment on face image quality scores. We conducted experiments on the LFW, IJB-B, and SCFace datasets, employing MTCNN and RetinaFace models for face detection and alignment. To evaluate face image quality, we utilized several assessment methods, including SER-FIQ, FaceQAN, DifFIQA, and SDD-FIQA. Our analysis included examining quality score distributions for the LFW and IJB-B datasets and analyzing average quality scores at varying distances in the SCFace dataset. Our findings reveal that face image quality assessment methods are sensitive to alignment. Moreover, this sensitivity increases under challenging real-life conditions, highlighting the importance of evaluating alignment’s role in quality assessment.
zh

[CV-39] IDEA-Bench: How Far are Generative Models from Professional Designing?

【速读】：该论文试图解决现有视觉生成模型在处理复杂、多样化的真实世界设计任务（如绘本创作、电影故事板开发、照片修图、视觉特效和字体转换）时面临的挑战。这些任务需要对指令、描述和参考图像进行深度解读和元素提取，而现有模型在这些专业设计场景中表现不佳，尤其是在涉及多种输入和输出的情况下，即使使用ControlNets和LoRAs等适配器进行增强。解决方案的关键在于引入IDEA-Bench，这是一个包含100个真实世界设计任务的综合基准，涵盖渲染、视觉特效、故事板、绘本、字体、风格化生成和身份保持生成等，共有275个测试案例，用于全面评估模型的通用生成能力。论文通过详细分析现有模型的表现（最高得分仅为22.48，通用模型仅为6.81），指出了改进方向，并提供了基于多模态大语言模型（MLLM）的自动评估技术，以加速模型开发和比较。

链接: https://arxiv.org/abs/2412.11767
作者: Chen Liang,Lianghua Huang,Jingwu Fang,Huanzhang Dou,Wei Wang,Zhi-Fan Wu,Yupeng Shi,Junge Zhang,Xin Zhao,Yu Liu
机构: Tongyi Lab; CASIA(中国科学院自动化研究所); USTB(北京科技大学); ZJU(浙江大学); Alibaba-inc(阿里巴巴); Taobao(淘宝)
关键词: requiring deep interpretation, Real-world design tasks, film storyboard development, picture book creation, photo retouching
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-world design tasks - such as picture book creation, film storyboard development using character sets, photo retouching, visual effects, and font transfer - are highly diverse and complex, requiring deep interpretation and extraction of various elements from instructions, descriptions, and reference images. The resulting images often implicitly capture key features from references or user inputs, making it challenging to develop models that can effectively address such varied tasks. While existing visual generative models can produce high-quality images based on prompts, they face significant limitations in professional design scenarios that involve varied forms and multiple inputs and outputs, even when enhanced with adapters like ControlNets and LoRAs. To address this, we introduce IDEA-Bench, a comprehensive benchmark encompassing 100 real-world design tasks, including rendering, visual effects, storyboarding, picture books, fonts, style-based, and identity-preserving generation, with 275 test cases to thoroughly evaluate a model’s general-purpose generation capabilities. Notably, even the best-performing model only achieves 22.48 on IDEA-Bench, while the best general-purpose model only achieves 6.81. We provide a detailed analysis of these results, highlighting the inherent challenges and providing actionable directions for improvement. Additionally, we provide a subset of 18 representative tasks equipped with multimodal large language model (MLLM)-based auto-evaluation techniques to facilitate rapid model development and comparison. We releases the benchmark data, evaluation toolkits, and an online leaderboard at this https URL, aiming to drive the advancement of generative models toward more versatile and applicable intelligent design systems.
zh

[CV-40] GS-ProCams: Gaussian Splatting-based Projector-Camera Systems

【速读】：该论文试图解决现有投影-摄像系统（ProCams）在视角变化时的局限性问题，特别是基于卷积神经网络（CNN）的方法受限于特定视角，而基于神经辐射场（NeRF）的方法则需要额外的光源和大量的计算资源。解决方案的关键在于提出了GS-ProCams框架，该框架利用二维高斯（2D Gaussian）进行场景表示，通过显式建模投影仪响应、目标表面的几何和材料属性以及全局光照组件，实现了高效的视角无关投影映射。此外，GS-ProCams采用可微分的物理渲染方法，从多视角投影中联合估计这些参数，从而在不增加额外设备的情况下，显著提升了投影映射的效率和质量，同时大幅减少了计算时间和GPU内存需求。

链接: https://arxiv.org/abs/2412.11762
作者: Qingyue Deng,Jijiang Li,Haibin Ling,Bingyao Huang
机构: Southwest University(西南大学); Stony Brook University(石溪大学)
关键词: Gaussian Splatting-based framework, Splatting-based framework, Gaussian Splatting-based, projector-camera systems, framework for projector-camera
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We present GS-ProCams, the first Gaussian Splatting-based framework for projector-camera systems (ProCams). GS-ProCams significantly enhances the efficiency of projection mapping (PM) that requires establishing geometric and radiometric mappings between the projector and the camera. Previous CNN-based ProCams are constrained to a specific viewpoint, limiting their applicability to novel perspectives. In contrast, NeRF-based ProCams support view-agnostic projection mapping, however, they require an additional colocated light source and demand significant computational and memory resources. To address this issue, we propose GS-ProCams that employs 2D Gaussian for scene representations, and enables efficient view-agnostic ProCams applications. In particular, we explicitly model the complex geometric and photometric mappings of ProCams using projector responses, the target surface’s geometry and materials represented by Gaussians, and global illumination component. Then, we employ differentiable physically-based rendering to jointly estimate them from captured multi-view projections. Compared to state-of-the-art NeRF-based methods, our GS-ProCams eliminates the need for additional devices, achieving superior ProCams simulation quality. It is also 600 times faster and uses only 1/10 of the GPU memory.
zh

[CV-41] Generative Inbetweening through Frame-wise Conditions-Driven Video Generation

【速读】：该论文试图解决生成式中间帧（Generative Inbetweening）中由于输入关键帧之间运动差距大而导致的时间稳定性问题。解决方案的关键在于提出了帧条件驱动的视频生成方法（Frame-wise Conditions-driven Video Generation, FCVG），通过为每个帧提供显式条件来明确插值路径，从而确保生成视频帧的时间稳定性。具体实现上，该方法从输入帧中提取匹配的线条，并将其作为帧条件逐帧插值，无缝集成到现有的视频生成模型中，显著提升了视频生成的时间一致性。

链接: https://arxiv.org/abs/2412.11755
作者: Tianyi Zhu,Dongwei Ren,Qilong Wang,Xiaohe Wu,Wangmeng Zuo
机构: Harbin Institute of Technology(哈尔滨工业大学); Tianjin University(天津大学)
关键词: Generative inbetweening aims, Generative inbetweening, intermediate frame sequences, video generation models, video generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input. Although remarkable progress has been made in video generation models, generative inbetweening still faces challenges in maintaining temporal stability due to the ambiguous interpolation path between two key frames. This issue becomes particularly severe when there is a large motion gap between input frames. In this paper, we propose a straightforward yet highly effective Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames. Specifically, our FCVG provides an explicit condition for each frame, making it much easier to identify the interpolation path between two input frames and thus ensuring temporally stable production of visually plausible video frames. To achieve this, we suggest extracting matched lines from two input frames that can then be easily interpolated frame by frame, serving as frame-wise conditions seamlessly integrated into existing video generation models. In extensive evaluations covering diverse scenarios such as natural landscapes, complex human poses, camera movements and animations, existing methods often exhibit incoherent transitions across frames. In contrast, our FCVG demonstrates the capability to generate temporally stable videos using both linear and non-linear interpolation curves. Our project page and code are available at \urlthis https URL.
zh

[CV-42] DriveGazen: Event-Based Driving Status Recognition using Conventional Camera AAAI AAAI25

【速读】：该论文试图解决从驾驶员眼部观察中识别驾驶状态的问题，特别是在光照条件变化下的鲁棒性识别。解决方案的关键在于两个核心技术：一是通过传统强度帧生成事件帧（event frames），二是设计了一种新型的注意力驾驶状态网络（Attention Driving State Network, ADSN）。具体来说，论文首先利用视频帧生成合成动态视觉传感器（DVS）事件，然后采用脉冲神经网络（spiking neural network）解码相关的时间信息。ADSN则从对应的强度帧中提取关键的空间线索，并通过一种新颖的引导注意力模块（guide attention module）在训练和推理过程中将空间注意力传递给卷积脉冲层，从而引导事件帧的特征学习和增强。这种方法首次结合了引导注意力脉冲神经网络和基于传统摄像头的眼部事件帧，用于驾驶状态识别。

链接: https://arxiv.org/abs/2412.11753
作者: Xiaoyin Yang
机构: Dalian University of Technology (大连理工大学)
关键词: real-time method robust, driving status, wearable driving status, identifying driving status, Driving State Network
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures, (AAAI25)The 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:We introduce a wearable driving status recognition device and our open-source dataset, along with a new real-time method robust to changes in lighting conditions for identifying driving status from eye observations of drivers. The core of our method is generating event frames from conventional intensity frames, and the other is a newly designed Attention Driving State Network (ADSN). Compared to event cameras, conventional cameras offer complete information and lower hardware costs, enabling captured frames to encode rich spatial information. However, these textures lack temporal information, posing challenges in effectively identifying driving status. DriveGazen addresses this issue from three perspectives. First, we utilize video frames to generate realistic synthetic dynamic vision sensor (DVS) events. Second, we adopt a spiking neural network to decode pertinent temporal information. Lastly, ADSN extracts crucial spatial cues from corresponding intensity frames and conveys spatial attention to convolutional spiking layers during both training and inference through a novel guide attention module to guide the feature learning and feature enhancement of the event frame. We specifically collected the Driving Status (DriveGaze) dataset to demonstrate the effectiveness of our approach. Additionally, we validate the superiority of the DriveGazen on the Single-eye Event-based Emotion (SEE) dataset. To the best of our knowledge, our method is the first to utilize guide attention spiking neural networks and eye-based event frames generated from conventional cameras for driving status recognition. Please refer to our project page for more details: this https URL.
zh

[CV-43] Deformable Radial Kernel Splatting

【速读】：该论文试图解决传统高斯光栅化技术在表示复杂3D形状时存在的局限性问题，特别是在需要大量基元来近似复杂几何形状时，高斯函数的径向对称性和平滑性约束导致效率和精度不足。解决方案的关键在于引入可变形径向核 (Deformable Radial Kernel, DRK)，通过可学习的径向基函数，DRK能够灵活调整角度和尺度，从而高效地建模多样化的形状基元，并精确控制边缘锐度和边界曲率。此外，论文还提出了精确的射线-基元交点计算方法和高效的核剔除策略，以提升光栅化效率。实验结果表明，DRK在表示效率和渲染质量上均优于现有方法，显著减少了基元数量并实现了最先进的性能。

链接: https://arxiv.org/abs/2412.11752
作者: Yi-Hua Huang,Ming-Xian Lin,Yang-Tian Sun,Ziyi Yang,Xiaoyang Lyu,Yan-Pei Cao,Xiaojuan Qi
机构: The University of Hong Kong; VAST
关键词: technique for representing, Gaussian splatting, robust technique, Gaussians’ inherent radial, extends Gaussian splatting
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Recently, Gaussian splatting has emerged as a robust technique for representing 3D scenes, enabling real-time rasterization and high-fidelity rendering. However, Gaussians’ inherent radial symmetry and smoothness constraints limit their ability to represent complex shapes, often requiring thousands of primitives to approximate detailed geometry. We introduce Deformable Radial Kernel (DRK), which extends Gaussian splatting into a more general and flexible framework. Through learnable radial bases with adjustable angles and scales, DRK efficiently models diverse shape primitives while enabling precise control over edge sharpness and boundary curvature. iven DRK’s planar nature, we further develop accurate ray-primitive intersection computation for depth sorting and introduce efficient kernel culling strategies for improved rasterization efficiency. Extensive experiments demonstrate that DRK outperforms existing methods in both representation efficiency and rendering quality, achieving state-of-the-art performance while dramatically reducing primitive count.
zh

[CV-44] ransferable Adversarial Face Attack with Text Controlled Attribute

【速读】：该论文试图解决传统对抗攻击在生成对抗样本时对属性控制有限和低迁移性的问题。解决方案的关键在于提出了一种新的文本控制属性攻击方法（Text Controlled Attribute Attack, TCA²），通过自然语言引导生成逼真的对抗冒充人脸。具体来说，该方法利用类别级别的个人softmax向量精确引导冒充攻击，并采用数据和模型增强策略来提高攻击的迁移性。此外，使用Style-GAN生成具有所需属性的冒充人脸，实验验证了该方法在生成自然且高迁移性的文本引导对抗冒充人脸方面的有效性，并在实际人脸识别系统中展示了其潜在应用。

链接: https://arxiv.org/abs/2412.11735
作者: Wenyun Li,Zheng Zhang,Xiangyuan Lan,Dongmei Jiang
机构: 1. School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院);
2. State Key Laboratory of Software Engineering, Wuhan University (武汉大学软件工程国家重点实验室);
3. School of Computer Science, Wuhan University (武汉大学计算机科学学院)
关键词: semantically meaningful perturbations, Traditional adversarial attacks, attacks typically produce, typically produce adversarial, Traditional adversarial
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional adversarial attacks typically produce adversarial examples under norm-constrained conditions, whereas unrestricted adversarial examples are free-form with semantically meaningful perturbations. Current unrestricted adversarial impersonation attacks exhibit limited control over adversarial face attributes and often suffer from low transferability. In this paper, we propose a novel Text Controlled Attribute Attack (TCA ^2 ) to generate photorealistic adversarial impersonation faces guided by natural language. Specifically, the category-level personal softmax vector is employed to precisely guide the impersonation attacks. Additionally, we propose both data and model augmentation strategies to achieve transferable attacks on unknown target models. Finally, a generative model, \textiti.e, Style-GAN, is utilized to synthesize impersonated faces with desired attributes. Extensive experiments on two high-resolution face recognition datasets validate that our TCA ^2 method can generate natural text-guided adversarial impersonation faces with high transferability. We also evaluate our method on real-world face recognition systems, \textiti.e, Face++ and Aliyun, further demonstrating the practical potential of our approach.
zh

[CV-45] Discrepancy-Aware Attention Network for Enhanced Audio-Visual Zero-Shot Learning

【速读】：该论文试图解决音频-视觉零样本学习（Audio-visual Zero-Shot Learning, ZSL）中的模态不平衡问题，特别是由于模态间质量和内容差异导致的识别能力下降。解决方案的关键在于提出了一个差异感知注意力网络（Discrepancy-Aware Attention Network, DAAN），其中包括两个核心模块：质量差异缓解注意力（Quality-Discrepancy Mitigation Attention, QDMA）单元和对比样本级梯度调制（Contrastive Sample-level Gradient Modulation, CSGM）块。QDMA通过减少高质量模态中的冗余信息来缓解模态间的质量差异，而CSGM则通过调整梯度幅度来平衡模态内的内容差异，从而提升对未见类别的判别能力。

链接: https://arxiv.org/abs/2412.11715
作者: RunLin Yu,Yipu Gong,Wenrui Li,Aiwen Sun,Mengren Zheng
机构: Central South University(中南大学); Harbin Institute of Technology(哈尔滨工业大学); Chong Qing University(重庆大学)
关键词: Audio-visual Zero-Shot Learning, video classification tasks, Zero-Shot Learning, identify unseen classes, attracted significant attention
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Audio-visual Zero-Shot Learning (ZSL) has attracted significant attention for its ability to identify unseen classes and perform well in video classification tasks. However, modal imbalance in (G)ZSL leads to over-reliance on the optimal modality, reducing discriminative capabilities for unseen classes. Some studies have attempted to address this issue by modifying parameter gradients, but two challenges still remain: (a) Quality discrepancies, where modalities offer differing quantities and qualities of information for the same concept. (b) Content discrepancies, where sample contributions within a modality vary significantly. To address these challenges, we propose a Discrepancy-Aware Attention Network (DAAN) for Enhanced Audio-Visual ZSL. Our approach introduces a Quality-Discrepancy Mitigation Attention (QDMA) unit to minimize redundant information in the high-quality modality and a Contrastive Sample-level Gradient Modulation (CSGM) block to adjust gradient magnitudes and balance content discrepancies. We quantify modality contributions by integrating optimization and convergence rate for more precise gradient modulation in CSGM. Experiments demonstrates DAAN achieves state-of-the-art performance on benchmark datasets, with ablation studies validating the effectiveness of individual modules.
zh

[CV-46] Re-Attentional Controllable Video Diffusion Editing AAAI2025 ATC

【速读】：该论文试图解决文本引导视频编辑中的可控性问题，特别是由于现有方法在处理对象位置错误和对象数量不正确等局限性。解决方案的关键在于提出了Re-Attentional Controllable Video Diffusion Editing (ReAtCo)方法，其中包括两个核心技术：Re-Attentional Diffusion (RAD)和Invariant Region-guided Joint Sampling (IRJS)。RAD通过在去噪阶段重新聚焦编辑文本提示与目标视频之间的交叉注意力激活响应，实现了目标对象空间位置对齐和语义高保真度的视频编辑。IRJS则通过减少不变区域在每个去噪时间步的固有采样误差，确保生成内容与不变区域内容和谐一致，从而减少边界伪影。实验结果表明，ReAtCo显著提升了视频扩散编辑的可控性，并实现了更优的视频编辑性能。

链接: https://arxiv.org/abs/2412.11710
作者: Yuanzhi Wang,Yong Li,Mengyi Liu,Xiaoya Zhang,Xin Liu,Zhen Cui,Antoni B. Chan
机构: Tsinghua University(清华大学); Beijing University of Posts and Telecommunications(北京邮电大学); City University of Hong Kong(香港城市大学); Nanyang Technological University(南洋理工大学)
关键词: garnered popularity due, Video Diffusion Editing, video editing, video, Editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2025. Codes are released at: this https URL

点击查看摘要

Abstract:Editing videos with textual guidance has garnered popularity due to its streamlined process which mandates users to solely edit the text prompt corresponding to the source video. Recent studies have explored and exploited large-scale text-to-image diffusion models for text-guided video editing, resulting in remarkable video editing capabilities. However, they may still suffer from some limitations such as mislocated objects, incorrect number of objects. Therefore, the controllability of video editing remains a formidable challenge. In this paper, we aim to challenge the above limitations by proposing a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method. Specially, to align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD) to refocus the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video. In particular, to faithfully preserve the invariant region content with less border artifacts, we propose an Invariant Region-guided Joint Sampling (IRJS) strategy to mitigate the intrinsic sampling errors w.r.t the invariant regions at each denoising timestep and constrain the generated content to be harmonized with the invariant region content. Experimental results verify that ReAtCo consistently improves the controllability of video diffusion editing and achieves superior video editing performance.
zh

[CV-47] AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

【速读】：该论文试图解决视频扩散变压器 (Video Diffusion Transformers, DiTs) 在生成高质量视频时计算量大的问题。解决方案的关键是提出了一种名为非对称减少与恢复 (Asymmetric Reduction and Restoration, AsymRnR) 的训练免费方法，该方法通过灵活且自适应地减少基于冗余度的标记数量，来提升加速效果和生成质量。此外，论文还提出了匹配缓存 (matching cache) 以进一步加速处理。通过集成到现有的先进视频 DiTs 中，AsymRnR 在不降低生成质量的前提下实现了显著的加速。

链接: https://arxiv.org/abs/2412.11706
作者: Wenhao Sun,Rong-Cheng Tu,Jingyi Liao,Zhao Jin,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学)
关键词: Video Diffusion Transformers, Diffusion Transformers, demonstrated significant potential, generating high-fidelity videos, computationally intensive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Video Diffusion Transformers (DiTs) have demonstrated significant potential for generating high-fidelity videos but are computationally intensive. Existing acceleration methods include distillation, which requires costly retraining, and feature caching, which is highly sensitive to network architecture. Recent token reduction methods are training-free and architecture-agnostic, offering greater flexibility and wider applicability. However, they enforce the same sequence length across different components, constraining their acceleration potential. We observe that intra-sequence redundancy in video DiTs varies across features, blocks, and denoising timesteps. Building on this observation, we propose Asymmetric Reduction and Restoration (AsymRnR), a training-free approach to accelerate video DiTs. It offers a flexible and adaptive strategy that reduces the number of tokens based on their redundancy to enhance both acceleration and generation quality. We further propose matching cache to facilitate faster processing. Integrated into state-of-the-art video DiTs, AsymRnR achieves a superior speedup without compromising the quality.
zh

[CV-48] Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads

【速读】：该论文试图解决现有解决方案在支持多样化精度（precision）和运行时可重配置的非线性激活函数（non-linear activation functions, AF）方面存在的不足。解决方案的关键在于提出了一种灵活的单指令多数据（SIMD）多精度处理单元（FlexPE），该单元支持多种运行时可配置的激活函数（如sigmoid、tanh、ReLU和softmax）以及乘积累加（MAC）操作。FlexPE在流水线模式下实现了显著的吞吐量提升，并在边缘AI应用中通过SIMD脉动阵列实现了面积高效的多精度迭代模式，显著减少了数据传输需求并提高了能效。此外，该设计支持新兴的4位计算，适用于深度学习推理（DL inference），并在FxP8/16模式下提升了Transformer等高性能计算应用的吞吐量。

链接: https://arxiv.org/abs/2412.11702
作者: Mukul Lokhande,Gopal Raut,Santosh Kumar Vishvakarma
机构: NSDCS Research Group, Department of Electrical Engineering, IIT Indore, Simrol-453552, India(NSDCS研究组，电气工程系，印度理工学院印多尔分校，印度西姆罗尔-453552)
关键词: linear activation functions, deep learning inference, Vision Transformers, driven AI models, drives a strong
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV)
备注: 10 pages, 5 figures, Preprint, Submitted to TVLSI Regular papers

点击查看摘要

Abstract:The rapid adaptation of data driven AI models, such as deep learning inference, training, Vision Transformers (ViTs), and other HPC applications, drives a strong need for runtime precision configurable different non linear activation functions (AF) hardware support. Existing solutions support diverse precision or runtime AF reconfigurability but fail to address both simultaneously. This work proposes a flexible and SIMD multiprecision processing element (FlexPE), which supports diverse runtime configurable AFs, including sigmoid, tanh, ReLU and softmax, and MAC operation. The proposed design achieves an improved throughput of up to 16X FxP4, 8X FxP8, 4X FxP16 and 1X FxP32 in pipeline mode with 100% time multiplexed hardware. This work proposes an area efficient multiprecision iterative mode in the SIMD systolic arrays for edge AI use cases. The design delivers superior performance with up to 62X and 371X reductions in DMA reads for input feature maps and weight filters in VGG16, with an energy efficiency of 8.42 GOPS / W within the accuracy loss of 2%. The proposed architecture supports emerging 4-bit computations for DL inference while enhancing throughput in FxP8/16 modes for transformers and other HPC applications. The proposed approach enables future energy-efficient AI accelerators in edge and cloud environments.
zh

[CV-49] Ultra-High-Definition Dynamic Multi-Exposure Image Fusion via Infinite Pixel Learning

【速读】：该论文试图解决在资源受限的设备上高效生成高质量超高清（Ultra-High-Definition, UHD）多曝光动态场景图像融合的问题。解决方案的关键在于提出了一种名为无限像素学习（Infinite Pixel Learning, IPL）的新型学习范式，该范式借鉴了大型语言模型（Large Language Model, LLM）处理无限长文本的思想。具体来说，解决方案包括三个关键组件：首先，通过将输入序列切片来减轻模型处理数据流的压力；其次，开发了一种类似于KV缓存的注意力缓存技术，用于处理无限数据流；最后，设计了一种注意力缓存压缩方法，以缓解缓存对设备存储的负担。此外，论文还提供了一个新的UHD基准来评估该方法的有效性，实验结果表明该方法能够在单个消费级GPU上实时（40fps）融合UHD多曝光动态图像，同时保持高质量的视觉效果。

链接: https://arxiv.org/abs/2412.11685
作者: Xingchi Chen,Zhuoran Zheng,Xuerui Li,Yuying Chen,Shu Wang,Wenqi Ren
机构: 1. School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院);
2. Institute of Artificial Intelligence, Soochow University(苏州大学人工智能研究所);
3. School of Computer Science and Technology, Nanjing University of Science and Technology(南京理工大学计算机科学与技术学院);
4. School of Computer Science, Fudan University(复旦大学计算机科学学院)
关键词: device imaging resolution, UHD multi-exposure dynamic, UHD dynamic multi-exposure, single consumer-grade GPU, Large Language Model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the continuous improvement of device imaging resolution, the popularity of Ultra-High-Definition (UHD) images is increasing. Unfortunately, existing methods for fusing multi-exposure images in dynamic scenes are designed for low-resolution images, which makes them inefficient for generating high-quality UHD images on a resource-constrained device. To alleviate the limitations of extremely long-sequence inputs, inspired by the Large Language Model (LLM) for processing infinitely long texts, we propose a novel learning paradigm to achieve UHD multi-exposure dynamic scene image fusion on a single consumer-grade GPU, named Infinite Pixel Learning (IPL). The design of our approach comes from three key components: The first step is to slice the input sequences to relieve the pressure generated by the model processing the data stream; Second, we develop an attention cache technique, which is similar to KV cache for infinite data stream processing; Finally, we design a method for attention cache compression to alleviate the storage burden of the cache on the device. In addition, we provide a new UHD benchmark to evaluate the effectiveness of our method. Extensive experimental results show that our method maintains high-quality visual performance while fusing UHD dynamic multi-exposure images in real-time (40fps) on a single consumer-grade GPU.
zh

[CV-50] EGP3D: Edge-guided Geometric Preserving 3D Point Cloud Super-resolution for RGB-D camera

【速读】：该论文试图解决现有点云超分辨率（PCSR）方法在处理由RGB-D相机捕获的低分辨率点云时，存在的几何伪影和边缘细节缺失问题。解决方案的关键在于提出了边缘引导的几何保持3D点云超分辨率（EGP3D）方法，该方法通过在投影的2D空间中引入边缘约束来优化点云，从而在3D PCSR任务中确保高质量的边缘保持。此外，论文还引入了一个多方面的损失函数，同时优化Chamfer距离、Hausdorff距离和梯度平滑度，以应对超分辨率点云中的几何优化挑战。为了弥补现有数据集在真实场景中的不足，论文还构建了一个包含真实世界噪声和杂散光效应的数据集，以更准确地模拟实际环境。

链接: https://arxiv.org/abs/2412.11680
作者: Zheng Fang,Ke Ye,Yaofang Liu,Gongzhe Li,Xianhong Zhao,Jialong Li,Ruxin Wang,Yuchen Zhang,Xiangyang Ji,Qilin Sun
机构: School of Data Science, The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)数据科学学院); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); City University of Hong Kong(香港城市大学); King Abdullah University of Science and Technology(阿卜杜拉国王科技大学); Tsinghua University(清华大学)
关键词: depth images captured, point cloud, point cloud super-resolution, reconstruction and robots, low resolution
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point clouds or depth images captured by current RGB-D cameras often suffer from low resolution, rendering them insufficient for applications such as 3D reconstruction and robots. Existing point cloud super-resolution (PCSR) methods are either constrained by geometric artifacts or lack attention to edge details. To address these issues, we propose an edge-guided geometric-preserving 3D point cloud super-resolution (EGP3D) method tailored for RGB-D cameras. Our approach innovatively optimizes the point cloud with an edge constraint on a projected 2D space, thereby ensuring high-quality edge preservation in the 3D PCSR task. To tackle geometric optimization challenges in super-resolution point clouds, particularly preserving edge shapes and smoothness, we introduce a multi-faceted loss function that simultaneously optimizes the Chamfer distance, Hausdorff distance, and gradient smoothness. Existing datasets used for point cloud upsampling are predominantly synthetic and inadequately represent real-world scenarios, neglecting noise and stray light effects. To address the scarcity of realistic RGB-D data for PCSR tasks, we built a dataset that captures real-world noise and stray-light effects, offering a more accurate representation of authentic environments. Validated through simulations and real-world experiments, the proposed method exhibited superior performance in preserving edge clarity and geometric details.
zh

[CV-51] textttDINO-Foresight: Looking into the Future with DINO

【速读】：该论文试图解决现有像素级预测方法在计算成本高且常关注无关细节的问题，特别是在自动驾驶和机器人等需要理解环境的应用中。解决方案的关键在于引入 DINO-Foresight 框架，该框架在预训练的视觉基础模型 (Vision Foundation Models, VFMs) 的语义特征空间中操作，通过自监督的方式训练一个掩码特征变换器来预测 VFM 特征随时间的演化。这种方法通过预测这些特征，能够使用现成的、任务特定的头部模块进行多种场景理解任务。关键创新在于将 VFM 特征视为潜在空间，并附加不同的头部模块以执行特定任务的未来帧分析，从而提高了预测的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2412.11673
作者: Efstathios Karypidis,Ioannis Kakogeorgiou,Spyros Gidaris,Nikos Komodakis
机构: Archimedes/Athena RC; valeo.ai; National Technical University of Athens; University of Crete; IACM-Forth
关键词: Predicting future dynamics, Predicting future, Vision Foundation Models, driving and robotics, environment is key
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce \textttDINO-Foresight , a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show that our framework outperforms existing methods, demonstrating its robustness and scalability. Additionally, we highlight how intermediate transformer representations in \textttDINO-Foresight improve downstream task performance, offering a promising path for the self-supervised enhancement of VFM features. We provide the implementation code at this https URL .
zh

[CV-52] Online Writer Retrieval with Chinese Handwritten Phrases: A Synergistic Temporal-Frequency Representation Learning Approach

【速读】：该论文试图解决在线书写检索（online writer retrieval）领域中方法和数据集匮乏的问题。解决方案的关键在于提出了DOLPHIN模型，该模型通过协同的时间-频率分析（temporal-frequency analysis）来增强书写特征的表示。具体来说，模型引入了HFGA块（HFGA block）用于频率特征学习，通过门控交叉注意力（gated cross-attention）机制在原始时间序列和其高频子带之间进行交互，以放大显著的书写细节；同时，提出了CAIR块（CAIR block）用于时间特征学习，促进通道交互并减少通道冗余。此外，论文还引入了大规模数据集OLIWER，包含超过67万条中文手写短语，以解决数据不足的问题。通过这些创新，DOLPHIN模型在性能上显著优于现有方法，并揭示了特征对齐在跨域书写检索中的重要性。

链接: https://arxiv.org/abs/2412.11668
作者: Peirong Zhang,Lianwen Jin
机构: South China University of Technology (华南理工大学)
关键词: accurately search relevant, search relevant handwriting, relevant handwriting instances, effective retrieval systems, spurred a critical
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Currently, the prevalence of online handwriting has spurred a critical need for effective retrieval systems to accurately search relevant handwriting instances from specific writers, known as online writer retrieval. Despite the growing demand, this field suffers from a scarcity of well-established methodologies and public large-scale datasets. This paper tackles these challenges with a focus on Chinese handwritten phrases. First, we propose DOLPHIN, a novel retrieval model designed to enhance handwriting representations through synergistic temporal-frequency analysis. For frequency feature learning, we propose the HFGA block, which performs gated cross-attention between the vanilla temporal handwriting sequence and its high-frequency sub-bands to amplify salient writing details. For temporal feature learning, we propose the CAIR block, tailored to promote channel interaction and reduce channel redundancy. Second, to address data deficit, we introduce OLIWER, a large-scale online writer retrieval dataset encompassing over 670,000 Chinese handwritten phrases from 1,731 individuals. Through extensive evaluations, we demonstrate the superior performance of DOLPHIN over existing methods. In addition, we explore cross-domain writer retrieval and reveal the pivotal role of increasing feature alignment in bridging the distributional gap between different handwriting data. Our findings emphasize the significance of point sampling frequency and pressure features in improving handwriting representation quality and retrieval performance. Code and dataset are available at this https URL.
zh

[CV-53] LMM-Regularized CLIP Embeddings for Image Classification

【速读】：该论文旨在通过使用CLIP视觉-语言模型的图像编码器来提升图像分类任务的性能。其关键解决方案是提出了一种基于大型多模态模型 (Large Multimodal Model, LMM) 的正则化方法。该方法首先利用LMM提取数据集中图像的语义描述，然后使用冻结的CLIP文本编码器获取相应的文本嵌入，并计算平均语义类别描述。接着，通过添加分类头来调整CLIP的图像编码器，并在训练过程中引入一个辅助目标，使图像编码器输出的嵌入与LMM生成的平均语义类别描述相似。这种正则化方法增强了嵌入的区分能力，从而提高了分类性能。

链接: https://arxiv.org/abs/2412.11663
作者: Maria Tzelepi,Vasileios Mezaris
机构: Information Technologies Institute (ITI); Centre of Research and Technology Hellas (CERTH)
关键词: CLIP vision-language model, Large Multimodal Model, powerful CLIP vision-language, CLIP image encoder, CLIP image
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted for publication, 26th Int. Symp. on Multimedia (IEEE ISM 2024), Tokyo, Japan, Dec. 2024. This is the authors’ “accepted version”

点击查看摘要

Abstract:In this paper we deal with image classification tasks using the powerful CLIP vision-language model. Our goal is to advance the classification performance using the CLIP’s image encoder, by proposing a novel Large Multimodal Model (LMM) based regularization method. The proposed method uses an LMM to extract semantic descriptions for the images of the dataset. Then, it uses the CLIP’s text encoder, frozen, in order to obtain the corresponding text embeddings and compute the mean semantic class descriptions. Subsequently, we adapt the CLIP’s image encoder by adding a classification head, and we train it along with the image encoder output, apart from the main classification objective, with an additional auxiliary objective. The additional objective forces the embeddings at the image encoder’s output to become similar to their corresponding LMM-generated mean semantic class descriptions. In this way, it produces embeddings with enhanced discrimination ability, leading to improved classification performance. The effectiveness of the proposed regularization method is validated through extensive experiments on three image classification datasets.
zh

[CV-54] CNNtention: Can CNNs do better with Attention?

【速读】：该论文旨在解决传统卷积神经网络 (CNN) 与基于注意力机制增强的 CNN 在图像分类任务中的性能比较问题。解决方案的关键在于评估和比较这两种架构在准确性、计算效率等方面的表现，揭示传统 CNN 的局部特征提取优势与注意力增强 CNN 的全局上下文捕捉能力之间的权衡，从而为特定应用需求选择合适的模型提供指导，并深化对深度学习社区中这些架构的理解。

链接: https://arxiv.org/abs/2412.11657
作者: Julian Glattki,Nikhil Kapila,Tejas Rathi
机构: Georgia Institute of Technology (佐治亚理工学院)
关键词: Convolutional Neural Networks, Convolutional Neural, Neural Networks, recently attention-based mechanisms, image classification tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 11 figures

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have been the standard for image classification tasks for a long time, but more recently attention-based mechanisms have gained traction. This project aims to compare traditional CNNs with attention-augmented CNNs across an image classification task. By evaluating and comparing their performance, accuracy and computational efficiency, the project will highlight benefits and trade-off of the localized feature extraction of traditional CNNs and the global context capture in attention-augmented CNNs. By doing this, we can reveal further insights into their respective strengths and weaknesses, guide the selection of models based on specific application needs and ultimately, enhance understanding of these architectures in the deep learning community. This was our final project for CS7643 Deep Learning course at Georgia Tech. Comments: 10 pages, 11 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) MSC classes: 68T45 (Primary), 68T07 (Secondary) Cite as: arXiv:2412.11657 [cs.CV] (or arXiv:2412.11657v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.11657 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-55] Image Gradient-Aided Photometric Stereo Network

【速读】：该论文试图解决传统基于深度学习的光度立体法（Photometric Stereo, PS）在处理复杂表面（如皱纹和边缘）时出现的模糊问题。解决方案的关键在于提出了图像梯度辅助的光度立体网络（Image Gradient-Aided Photometric Stereo Network, IGA-PSN），通过双分支框架同时提取光度图像及其梯度的特征，并结合沙漏回归网络（hourglass regression network）和监督机制来规范法向量回归，从而在复杂区域中保持纹理和几何形状，显著提升了表面法向量的估计精度。

链接: https://arxiv.org/abs/2412.11650
作者: Kaixuan Wang,Lin Qi,Shiyu Qin,Kai Luo,Yakun Ju,Xia Li,Junyu Dong
机构: 未知
关键词: endeavors to ascertain, shading clues, photometric images, Photometric stereo, ascertain surface normals
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, published to Springer

点击查看摘要

Abstract:Photometric stereo (PS) endeavors to ascertain surface normals using shading clues from photometric images under various illuminations. Recent deep learning-based PS methods often overlook the complexity of object surfaces. These neural network models, which exclusively rely on photometric images for training, often produce blurred results in high-frequency regions characterized by local discontinuities, such as wrinkles and edges with significant gradient changes. To address this, we propose the Image Gradient-Aided Photometric Stereo Network (IGA-PSN), a dual-branch framework extracting features from both photometric images and their gradients. Furthermore, we incorporate an hourglass regression network along with supervision to regularize normal regression. Experiments on DiLiGenT benchmarks show that IGA-PSN outperforms previous methods in surface normal estimation, achieving a mean angular error of 6.46 while preserving textures and geometric shapes in complex regions.
zh

[CV-56] High-speed and High-quality Vision Reconstruction of Spike Camera with Spike Stability Theorem

【速读】：该论文试图解决从脉冲相机（spike camera）的高频脉冲流中实现高速高质量视觉重建的问题。解决方案的关键在于提出了一种新的脉冲稳定性定理（spike stability theorem），该定理揭示了脉冲流特性与稳定光强之间的关系，并基于此定理设计了两种无参数的实时视觉重建算法。这些算法在重建质量和速度之间取得了最佳平衡，并通过FPGA实现方法达到了20,000 FPS的实时视觉重建性能。

链接: https://arxiv.org/abs/2412.11639
作者: Wei Zhang,Weiquan Yan,Yun Zhao,Wenxiang Cheng,Gang Chen,Huihui Zhou,Yonghong Tian
机构: Department of Networked Intelligence, Pengcheng Laboratory, Shenzhen 518000, China; School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China; School of Computer Science, Peking University, Beijing 100871, China
关键词: Neuromorphic vision sensors, spike camera, dynamic vision sensor, gained increasing attention, spike
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Neuromorphic vision sensors, such as the dynamic vision sensor (DVS) and spike camera, have gained increasing attention in recent years. The spike camera can detect fine textures by mimicking the fovea in the human visual system, and output a high-frequency spike stream. Real-time high-quality vision reconstruction from the spike stream can build a bridge to high-level vision task applications of the spike camera. To realize high-speed and high-quality vision reconstruction of the spike camera, we propose a new spike stability theorem that reveals the relationship between spike stream characteristics and stable light intensity. Based on the spike stability theorem, two parameter-free algorithms are designed for the real-time vision reconstruction of the spike camera. To demonstrate the performances of our algorithms, two datasets (a public dataset PKU-Spike-High-Speed and a newly constructed dataset SpikeCityPCL) are used to compare the reconstruction quality and speed of various reconstruction methods. Experimental results show that, compared with the current state-of-the-art (SOTA) reconstruction methods, our reconstruction methods obtain the best tradeoff between the reconstruction quality and speed. Additionally, we design the FPGA implementation method of our algorithms to realize the real-time (running at 20,000 FPS) visual reconstruction. Our work provides new theorem and algorithm foundations for the real-time edge-end vision processing of the spike camera.
zh

[CV-57] IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

【速读】：该论文试图解决基于编码器的身份保护方法对肖像照片进行未经授权定制的问题。解决方案的关键是引入IDProtector，一种对抗性噪声编码器，通过在单次前向传递中向肖像照片添加不可察觉的对抗性噪声，从而提供对多种最先进的基于编码器方法（如InstantID、IP-Adapter和PhotoMaker）的通用保护，同时确保对常见图像变换（如JPEG压缩、调整大小和仿射变换）的鲁棒性。

链接: https://arxiv.org/abs/2412.11638
作者: Yiren Song,Pei Yang,Hai Ci,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学Show实验室)
关键词: revolutionized identity-preserving generation, identity-preserving generation, Recently, efficient identity-preserving generation, zero-shot methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, zero-shot methods like InstantID have revolutionized identity-preserving generation. Unlike multi-image finetuning approaches such as DreamBooth, these zero-shot methods leverage powerful facial encoders to extract identity information from a single portrait photo, enabling efficient identity-preserving generation through a single inference pass. However, this convenience introduces new threats to the facial identity protection. This paper aims to safeguard portrait photos from unauthorized encoder-based customization. We introduce IDProtector, an adversarial noise encoder that applies imperceptible adversarial noise to portrait photos in a single forward pass. Our approach offers universal protection for portraits against multiple state-of-the-art encoder-based methods, including InstantID, IP-Adapter, and PhotoMaker, while ensuring robustness to common image transformations such as JPEG compression, resizing, and affine transformations. Experiments across diverse portrait datasets and generative models reveal that IDProtector generalizes effectively to unseen data and even closed-source proprietary models.
zh

[CV-58] Predicting the Original Appearance of Damaged Historical Documents AAAI2025

【速读】：该论文试图解决历史文档修复（Historical Document Repair, HDR）问题，即预测受损历史文档的原始外观。解决方案的关键在于提出了一个大规模数据集HDR28K和一个基于扩散网络的模型DiffHDR。HDR28K包含28,552对受损-修复图像，具有字符级标注和多风格退化，而DiffHDR通过结合语义和空间信息以及精心设计的字符感知损失（character perceptual loss），增强了传统扩散框架的上下文和视觉一致性。实验结果表明，DiffHDR在处理真实受损文档时显著优于现有方法，并展示了其在文档编辑和文本块生成方面的灵活性和泛化能力。

链接: https://arxiv.org/abs/2412.11634
作者: Zhenhua Yang,Dezhi Peng,Yongxin Shi,Yuyi Zhang,Chongyu Liu,Lianwen Jin
机构: 未知
关键词: severe damages including, Historical Document Repair, Historical documents encompass, including character missing, damages including character
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025; Github Page: this https URL

点击查看摘要

Abstract:Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code is available at this https URL.
zh

[CV-59] VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting AAAI

【速读】：该论文试图解决基于大型语言模型 (LLM) 的代理在处理程序性任务时，如何通过多模态指令（包括文本和视频）来增强用户辅助的潜力问题。解决方案的关键在于提出了视觉基础的文本-视频提示方法 (VG-TVP)，这是一种新颖的基于 LLM 的多模态程序规划 (MPP) 框架。VG-TVP 通过利用 LLM 的零样本推理能力、视频字幕模型的视频到文本生成能力以及扩散模型的文本到视频生成能力，生成连贯的文本和视频程序计划。为增强模态间的交互，VG-TVP 提出了融合字幕 (FoC) 方法，并使用文本到视频桥 (T2V-B) 和视频到文本桥 (V2T-B)，使 LLM 能够引导生成视觉基础的文本计划和文本基础的视频计划。此外，为解决适用于 MPP 的数据集稀缺问题，论文还创建了新的 Daily-Life Task Procedural Plans (Daily-PP) 数据集，并通过实验验证了 VG-TVP 在 Daily-PP 数据集上优于单模态基线。

链接: https://arxiv.org/abs/2412.11621
作者: Muhammet Furkan Ilaslan,Ali Koksal,Kevin Qinhong Lin,Burak Satar,Mike Zheng Shou,Qianli Xu
机构: 1. Shanghai Jiao Tong University (上海交通大学); 2. Koç University (科克大学)
关键词: Large Language Model, Large Language, users remains under-explored, multimodal instructions augmented, assist users remains
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted for The 39th Annual AAAI Conference on Artificial Intelligence 2025 in Main Track, 19 pages, 24 figures

点击查看摘要

Abstract:Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually Grounded Text-Video Prompting (VG-TVP) method which is a novel LLM-empowered Multimodal Procedural Planning (MPP) framework. It generates cohesive text and video procedural plans given a specified high-level objective. The main challenges are achieving textual and visual informativeness, temporal coherence, and accuracy in procedural plans. VG-TVP leverages the zero-shot reasoning capability of LLMs, the video-to-text generation ability of the video captioning models, and the text-to-video generation ability of diffusion models. VG-TVP improves the interaction between modalities by proposing a novel Fusion of Captioning (FoC) method and using Text-to-Video Bridge (T2V-B) and Video-to-Text Bridge (V2T-B). They allow LLMs to guide the generation of visually-grounded text plans and textual-grounded video plans. To address the scarcity of datasets suitable for MPP, we have curated a new dataset called Daily-Life Task Procedural Plans (Daily-PP). We conduct comprehensive experiments and benchmarks to evaluate human preferences (regarding textual and visual informativeness, temporal coherence, and plan accuracy). Our VG-TVP method outperforms unimodal baselines on the Daily-PP dataset.
zh

[CV-60] Combating Semantic Contamination in Learning with Label Noise AAAI2025

【速读】：该论文试图解决深度神经网络在处理带有噪声标签的数据时性能下降的问题，特别是由于标签重构（label refurbishment）方法引入的语义污染（Semantic Contamination）现象。解决方案的关键在于提出了一种名为协同交叉学习（Collaborative Cross Learning）的新方法，该方法通过在重构标签上应用半监督学习，从不同视角和模型的嵌入中提取适当的语义关联，从而有效平衡单个类别的语义信息并保持跨模型的语义一致性，最终在合成和真实世界的噪声数据集上均表现出优于现有方法的性能。

链接: https://arxiv.org/abs/2412.11620
作者: Wenxiao Fan,Kan Li
机构: 未知
关键词: deep neural networks, Semantic Contamination, neural networks, performance of deep, deep neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI2025

点击查看摘要

Abstract:Noisy labels can negatively impact the performance of deep neural networks. One common solution is label refurbishment, which involves reconstructing noisy labels through predictions and distributions. However, these methods may introduce problematic semantic associations, a phenomenon that we identify as Semantic Contamination. Through an analysis of Robust LR, a representative label refurbishment method, we found that utilizing the logits of views for refurbishment does not adequately balance the semantic information of individual classes. Conversely, using the logits of models fails to maintain consistent semantic relationships across models, which explains why label refurbishment methods frequently encounter issues related to Semantic Contamination. To address this issue, we propose a novel method called Collaborative Cross Learning, which utilizes semi-supervised learning on refurbished labels to extract appropriate semantic associations from embeddings across views and models. Experimental results show that our method outperforms existing approaches on both synthetic and real-world noisy datasets, effectively mitigating the impact of label noise and Semantic Contamination.
zh

[CV-61] CLIP-SR: Collaborative Linguistic and Image Processing for Super-Resolution

【速读】：该论文试图解决基于卷积神经网络 (CNN) 的超分辨率 (SR) 方法在严重下采样（如8x或16x）时产生的伪影和模糊问题，以及现有文本引导SR方法在语义对齐上的不一致性。解决方案的关键在于引入多模态语义增强方法，通过结合文本语义与视觉特征，有效解决语义不匹配和细节丢失问题。具体实现上，论文提出了一个多模态协同框架，该框架整合了文本和图像输入，利用提示预测器、文本-图像融合块 (TIFBlock)、迭代优化模块以及CLIP特征，实现细粒度的逐步增强过程。这种方法能够在大幅缩放因子下生成具有清晰细节和语义一致性的高质量超分辨率图像。

链接: https://arxiv.org/abs/2412.11609
作者: Bingwen Hu,Heng Liu,Zhedong Zheng,Ping Liu
机构: 未知
关键词: Convolutional Neural Networks, Convolutional Neural, Neural Networks, CNN-based methods rely, methods rely solely
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have advanced Image Super-Resolution (SR), but most CNN-based methods rely solely on pixel-based transformations, often leading to artifacts and blurring, particularly with severe downsampling (e.g., 8x or 16x). Recent text-guided SR methods attempt to leverage textual information for enhanced detail, but they frequently struggle with effective alignment, resulting in inconsistent semantic coherence. To address these limitations, we introduce a multi-modal semantic enhancement approach that combines textual semantics with visual features, effectively tackling semantic mismatches and detail loss in highly degraded LR images. Our proposed multi-modal collaborative framework enables the production of realistic and high-quality SR images at significant up-scaling factors. The framework integrates text and image inputs, employing a prompt predictor, Text-Image Fusion Block (TIFBlock), and Iterative Refinement Module alongside CLIP (Contrastive Language-Image Pretraining) features to guide a progressive enhancement process with fine-grained alignment. This alignment produces high-resolution outputs with crisp details and semantic coherence, even at large scaling factors. Through extensive comparative experiments and ablation studies, we validate the effectiveness of our approach. Additionally, by incorporating textual semantic guidance, our technique enables a degree of super-resolution editability while maintaining semantic coherence.
zh

[CV-62] owards Adversarial Robustness of Model-Level Mixture-of-Experts Architectures for Semantic Segmentation ICML

【速读】：该论文试图解决深度神经网络在对抗攻击下的脆弱性问题，特别是针对城市和高速公路交通场景的语义分割任务。解决方案的关键在于引入混合专家模型 (Mixture of Experts, MoE)，其通过可学习的门控组件动态预测专家模型的输出权重，从而在组合模型输出时更具灵活性和鲁棒性。与传统的集成方法相比，MoE在大多数情况下表现出更强的对抗攻击抵御能力，尤其是在面对实例级、通用白盒攻击以及迁移攻击时。

链接: https://arxiv.org/abs/2412.11608
作者: Svetlana Pavlitska,Enrico Eisen,J. Marius Zöllner
机构: FZI Research Center for Information Technology; Karlsruhe Institute of Technology (KIT)
关键词: deep neural networks, well-known deficiency, deficiency of deep, deep neural, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at ICMLA 2024

点击查看摘要

Abstract:Vulnerability to adversarial attacks is a well-known deficiency of deep neural networks. Larger networks are generally more robust, and ensembling is one method to increase adversarial robustness: each model’s weaknesses are compensated by the strengths of others. While an ensemble uses a deterministic rule to combine model outputs, a mixture of experts (MoE) includes an additional learnable gating component that predicts weights for the outputs of the expert models, thus determining their contributions to the final prediction. MoEs have been shown to outperform ensembles on specific tasks, yet their susceptibility to adversarial attacks has not been studied yet. In this work, we evaluate the adversarial vulnerability of MoEs for semantic segmentation of urban and highway traffic scenes. We show that MoEs are, in most cases, more robust to per-instance and universal white-box adversarial attacks and can better withstand transfer attacks. Our code is available at \urlthis https URL.
zh

[CV-63] 3D2-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling AAAI2025

【速读】：该论文试图解决从稀疏多视角RGB视频中学习可动画的3D虚拟形象时，传统方法在捕捉姿态依赖细节和泛化到新姿态时面临的挑战。解决方案的关键在于引入了一种新颖的姿态条件3D感知人体建模流程，即3D²-Actor。该方法通过迭代2D去噪和3D校正步骤，结合姿态线索生成详细的多视角图像，为高保真3D重建和姿态渲染提供丰富的特征集。具体来说，2D去噪器生成多视角图像，而基于高斯的3D校正器通过两阶段投影策略和局部坐标表示增强3D一致性。此外，创新的采样策略确保了视频合成中的平滑时间连续性。这种方法有效解决了传统数值方法在处理不适定映射时的局限性，生成了逼真且可动画的3D人体虚拟形象。

链接: https://arxiv.org/abs/2412.11599
作者: Zichen Tang,Hongyu Yang,Hanchen Zhang,Jiaxin Chen,Di Huang
机构: 1. School of Computer Science and Technology, Tianjin University, Tianjin, China (计算机科学与技术学院，天津大学，天津，中国);
2. Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University, Tianjin, China (天津认知计算与应用重点实验室，天津大学，天津，中国);
3. School of Computer Science and Technology, Shandong University, Jinan, China (计算机科学与技术学院，山东大学，济南，中国)
关键词: sparse multi-view RGB, Advancements in neural, neural implicit representations, multi-view RGB videos, multi-view RGB
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Advancements in neural implicit representations and differentiable rendering have markedly improved the ability to learn animatable 3D avatars from sparse multi-view RGB videos. However, current methods that map observation space to canonical space often face challenges in capturing pose-dependent details and generalizing to novel poses. While diffusion models have demonstrated remarkable zero-shot capabilities in 2D image generation, their potential for creating animatable 3D avatars from 2D inputs remains underexplored. In this work, we introduce 3D ^2 -Actor, a novel approach featuring a pose-conditioned 3D-aware human modeling pipeline that integrates iterative 2D denoising and 3D rectifying steps. The 2D denoiser, guided by pose cues, generates detailed multi-view images that provide the rich feature set necessary for high-fidelity 3D reconstruction and pose rendering. Complementing this, our Gaussian-based 3D rectifier renders images with enhanced 3D consistency through a two-stage projection strategy and a novel local coordinate representation. Additionally, we propose an innovative sampling strategy to ensure smooth temporal continuity across frames in video synthesis. Our method effectively addresses the limitations of traditional numerical solutions in handling ill-posed mappings, producing realistic and animatable 3D human avatars. Experimental results demonstrate that 3D ^2 -Actor excels in high-fidelity avatar modeling and robustly generalizes to novel poses. Code is available at: this https URL.
zh

[CV-64] MeshArt: Generating Articulated Meshes with Structure-guided Transformers

【速读】：该论文试图解决生成具有关节结构的3D物体（articulated 3D object）的问题，旨在创建逼真、功能性强且可交互的虚拟资产，而非简单的静态模型。解决方案的关键在于提出了一种基于分层Transformer的方法，称为MeshArt。该方法通过两阶段生成过程实现：首先生成高层次的关节感知对象结构，然后基于该结构合成每个部分的网格面。核心创新在于将关节结构和部件网格建模为量化三角形嵌入序列，形成统一的层次化框架，并利用Transformer进行自回归生成。具体来说，首先生成对象部件的边界原语和关节模式，随后在关节结构的指导下，利用第二个Transformer生成每个部件的网格三角形。为确保生成部件之间的连贯性，引入了结构引导的条件生成，同时考虑局部部件网格的连通性。

链接: https://arxiv.org/abs/2412.11596
作者: Daoyi Gao,Yawar Siddiqui,Lei Li,Angela Dai
机构: Technical University of Munich; Meta
关键词: interactable virtual assets, creating realistic, simply static, fundamental for creating, interactable virtual
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:Articulated 3D object generation is fundamental for creating realistic, functional, and interactable virtual assets which are not simply static. We introduce MeshArt, a hierarchical transformer-based approach to generate articulated 3D meshes with clean, compact geometry, reminiscent of human-crafted 3D models. We approach articulated mesh generation in a part-by-part fashion across two stages. First, we generate a high-level articulation-aware object structure; then, based on this structural information, we synthesize each part’s mesh faces. Key to our approach is modeling both articulation structures and part meshes as sequences of quantized triangle embeddings, leading to a unified hierarchical framework with transformers for autoregressive generation. Object part structures are first generated as their bounding primitives and articulation modes; a second transformer, guided by these articulation structures, then generates each part’s mesh triangles. To ensure coherency among generated parts, we introduce structure-guided conditioning that also incorporates local part mesh connectivity. MeshArt shows significant improvements over state of the art, with 57.1% improvement in structure coverage and a 209-point improvement in mesh generation FID.
zh

[CV-65] VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis AAAI2025

【速读】：该论文试图解决文本到图像 (T2I) 合成中精确视觉控制的问题。现有方法虽然尝试通过多方面控制（如文本和草图）来增强生成图像的创意控制，但研究表明人类的表达能力远超当前方法的局限性。论文提出的解决方案是 VersaGen，一个生成式 AI 代理，能够实现多功能的视觉控制，包括单个视觉对象、多个视觉对象、场景背景以及它们的任意组合或无控制。关键在于通过在冻结的 T2I 模型上训练一个适配器，将视觉信息融入以文本为主的扩散过程中，并在推理阶段引入三种优化策略以提升生成效果和用户体验。实验结果表明 VersaGen 在 COCO 和 Sketchy 数据集上的有效性和灵活性。

链接: https://arxiv.org/abs/2412.11594
作者: Zhipeng Chen,Lan Yang,Yonggang Qi,Honggang Zhang,Kaiyue Pang,Ke Li,Yi-Zhe Song
机构: 1. Beijing Institute of Technology (北京理工大学); 2. University of Surrey (萨里大学)
关键词: enabling precise visual, enabling precise, significant challenge, rapid advancements, remains a significant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper has been accepted by AAAI 2025. Paper code: this https URL

点击查看摘要

Abstract:Despite the rapid advancements in text-to-image (T2I) synthesis, enabling precise visual control remains a significant challenge. Existing works attempted to incorporate multi-facet controls (text and sketch), aiming to enhance the creative control over generated images. However, our pilot study reveals that the expressive power of humans far surpasses the capabilities of current methods. Users desire a more versatile approach that can accommodate their diverse creative intents, ranging from controlling individual subjects to manipulating the entire scene composition. We present VersaGen, a generative AI agent that enables versatile visual control in T2I synthesis. VersaGen admits four types of visual controls: i) single visual subject; ii) multiple visual subjects; iii) scene background; iv) any combination of the three above or merely no control at all. We train an adaptor upon a frozen T2I model to accommodate the visual information into the text-dominated diffusion process. We introduce three optimization strategies during the inference phase of VersaGen to improve generation results and enhance user experience. Comprehensive experiments on COCO and Sketchy validate the effectiveness and flexibility of VersaGen, as evidenced by both qualitative and quantitative results.
zh

[CV-66] StrandHead: Text to Strand-Disentangled 3D Head Avatars Using Hair Geometric Priors

【速读】：该论文试图解决现有虚拟头像生成方法在处理实际发型时存在的通用或纠缠表示问题，导致无法准确建模复杂发型。解决方案的关键在于提出了StrandHead，一种新颖的从文本生成3D头像的方法，能够生成具有发丝表示的解耦3D头发。通过蒸馏2D生成扩散模型，StrandHead无需3D数据监督即可生成逼真的发丝，并提出了一系列可靠的先验（如形状初始化、几何基元和统计发型特征），确保了稳定的优化过程和与文本对齐的性能。实验表明，StrandHead在生成的3D头像和头发的真实性与多样性方面达到了最先进水平，并可轻松集成到Unreal Engine中进行物理模拟和其他应用。

链接: https://arxiv.org/abs/2412.11586
作者: Xiaokun Sun,Zeyu Cai,Zhenyu Zhang,Ying Tai,Jian Yang
机构: Nanjing University; The Hong Kong University of Science and Technology (Guangzhou)
关键词: existing avatar generation, generation methods fail, avatar generation methods, practical hair due, distinct personality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:While haircut indicates distinct personality, existing avatar generation methods fail to model practical hair due to the general or entangled representation. We propose StrandHead, a novel text to 3D head avatar generation method capable of generating disentangled 3D hair with strand representation. Without using 3D data for supervision, we demonstrate that realistic hair strands can be generated from prompts by distilling 2D generative diffusion models. To this end, we propose a series of reliable priors on shape initialization, geometric primitives, and statistical haircut features, leading to a stable optimization and text-aligned performance. Extensive experiments show that StrandHead achieves the state-of-the-art reality and diversity of generated 3D head and hair. The generated 3D hair can also be easily implemented in the Unreal Engine for physical simulation and other applications. The code will be available at this https URL.
zh

[CV-67] Oriented Tiny Object Detection: A Dataset Benchmark and Dynamic Unbiased Learning

【速读】：该论文试图解决面向微小物体检测的问题，这类物体在外观信息有限的情况下广泛存在于现实应用中，但目前仍是一个复杂且未被充分探索的领域。解决方案的关键在于提出了一个动态粗到细学习方案（Dynamic Coarse-to-Fine Learning, DCFL），该方案通过动态更新先验位置以更好地对齐微小物体的有限区域，并在样本分配中平衡数量和质量，从而减轻了学习偏差，提升了检测性能。实验结果表明，DCFL在多个具有挑战性的数据集上实现了最先进的准确性、高效性和广泛的适用性。

链接: https://arxiv.org/abs/2412.11582
作者: Chang Xu,Ruixiang Zhang,Wen Yang,Haoran Zhu,Fang Xu,Jian Ding,Gui-Song Xia
机构: School of Electronic Information, Wuhan University, Wuhan, 430072 China; King Abdullah University of Science and Technology; School of Computer Science and the State Key Lab. LIESMARS, Wuhan University, Wuhan, 430072, China
关键词: Detecting oriented tiny, Detecting oriented, real-world applications, remains an intricate, under-explored problem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting oriented tiny objects, which are limited in appearance information yet prevalent in real-world applications, remains an intricate and under-explored problem. To address this, we systemically introduce a new dataset, benchmark, and a dynamic coarse-to-fine learning scheme in this study. Our proposed dataset, AI-TOD-R, features the smallest object sizes among all oriented object detection datasets. Based on AI-TOD-R, we present a benchmark spanning a broad range of detection paradigms, including both fully-supervised and label-efficient approaches. Through investigation, we identify a learning bias presents across various learning pipelines: confident objects become increasingly confident, while vulnerable oriented tiny objects are further marginalized, hindering their detection performance. To mitigate this issue, we propose a Dynamic Coarse-to-Fine Learning (DCFL) scheme to achieve unbiased learning. DCFL dynamically updates prior positions to better align with the limited areas of oriented tiny objects, and it assigns samples in a way that balances both quantity and quality across different object shapes, thus mitigating biases in prior settings and sample selection. Extensive experiments across eight challenging object detection datasets demonstrate that DCFL achieves state-of-the-art accuracy, high efficiency, and remarkable versatility. The dataset, benchmark, and code are available at this https URL.
zh

[CV-68] SweepEvGS: Event-Based 3D Gaussian Splatting for Macro and Micro Radiance Field Rendering from a Single Sweep

【速读】：该论文试图解决传统3D高斯拼接（3D Gaussian Splatting, 3D-GS）方法在动态环境中需要高帧率、高质量图像进行场景重建时，捕捉过程耗时且效率低下的问题。解决方案的关键在于提出了SweepEvGS，这是一种结合事件相机（event cameras）的新型硬件集成方法，利用事件相机的高时间分辨率和异步亮度变化捕捉能力，通过单次扫描即可实现鲁棒且精确的新视角合成。SweepEvGS通过结合初始静态帧和单次扫描期间捕捉的密集事件流，有效重建详细的场景视图，并在合成物体、宏观和微观真实场景中验证了其优越的视觉渲染质量、速度和计算效率。

链接: https://arxiv.org/abs/2412.11579
作者: Jingqian Wu,Shuo Zhu,Chutian Wang,Boxin Shi,Edmund Y. Lam
机构: The University of Hong Kong, Pokfulam, Hong Kong SAR, China; National Key Laboratory for Multimedia Information Processing and National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, Beijing 100871, China
关键词: Gaussian Splatting, continuously calibrated input, Gaussian primitives, Recent advancements, calibrated input views
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3D-GS) have demonstrated the potential of using 3D Gaussian primitives for high-speed, high-fidelity, and cost-efficient novel view synthesis from continuously calibrated input views. However, conventional methods require high-frame-rate dense and high-quality sharp images, which are time-consuming and inefficient to capture, especially in dynamic environments. Event cameras, with their high temporal resolution and ability to capture asynchronous brightness changes, offer a promising alternative for more reliable scene reconstruction without motion blur. In this paper, we propose SweepEvGS, a novel hardware-integrated method that leverages event cameras for robust and accurate novel view synthesis across various imaging settings from a single sweep. SweepEvGS utilizes the initial static frame with dense event streams captured during a single camera sweep to effectively reconstruct detailed scene views. We also introduce different real-world hardware imaging systems for real-world data collection and evaluation for future research. We validate the robustness and efficiency of SweepEvGS through experiments in three different imaging settings: synthetic objects, real-world macro-level, and real-world micro-level view synthesis. Our results demonstrate that SweepEvGS surpasses existing methods in visual rendering quality, rendering speed, and computational efficiency, highlighting its potential for dynamic practical applications.
zh

[CV-69] DVP-MVS: Synergize Depth-Edge and Visibility Prior for Multi-View Stereo

【速读】：该论文试图解决基于补丁变形的多视图立体（multi-view stereo, MVS）方法中存在的变形不稳定性和可见性遮挡问题，这些问题导致边缘跳跃和估计偏差。解决方案的关键在于提出了一种新的深度边缘对齐和跨视图先验（depth-edge aligned and cross-view prior）方法，称为DVP-MVS。具体来说，通过使用Depth Anything V2和Roberts算子初始化粗略的深度和边缘图，并通过侵蚀-膨胀策略对齐这些图以生成细粒度的均匀边界，从而避免边缘跳跃。此外，通过将视图选择权重重构为可见性图，并利用跨视图深度重投影恢复可见区域，作为跨视图先验来促进可见性感知的补丁变形。最后，通过引入基于视图选择的多视图几何一致性和基于极线的局部投影深度差异，改进了传播和细化过程。实验结果表明，该方法在ETH3D和Tanks & Temples基准测试中表现出卓越的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2412.11578
作者: Zhenlong Yuan,Jinguo Luo,Fei Shen,Zhaoxin Li,Cong Liu,Tianlu Mao,Zhaoqi Wang
机构: 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
2. School of Software, Dalian University of Technology, Dalian, China(大连理工大学软件学院，大连，中国);
3. School of Information Science and Engineering, Northeastern University, Shenyang, China(东北大学信息科学与工程学院，沈阳，中国);
4. School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)计算机科学与技术学院，深圳，中国);
5. Shenzhen Research Institute of Big Data, Shenzhen, China(深圳大数据研究院，深圳，中国)
关键词: recently exhibited substantial, exhibited substantial effectiveness, reconstruct textureless areas, patch deformation, visibility-aware patch deformation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Patch deformation-based methods have recently exhibited substantial effectiveness in multi-view stereo, due to the incorporation of deformable and expandable perception to reconstruct textureless areas. However, such approaches typically focus on exploring correlative reliable pixels to alleviate match ambiguity during patch deformation, but ignore the deformation instability caused by mistaken edge-skipping and visibility occlusion, leading to potential estimation deviation. To remedy the above issues, we propose DVP-MVS, which innovatively synergizes depth-edge aligned and cross-view prior for robust and visibility-aware patch deformation. Specifically, to avoid unexpected edge-skipping, we first utilize Depth Anything V2 followed by the Roberts operator to initialize coarse depth and edge maps respectively, both of which are further aligned through an erosion-dilation strategy to generate fine-grained homogeneous boundaries for guiding patch deformation. In addition, we reform view selection weights as visibility maps and restore visible areas by cross-view depth reprojection, then regard them as cross-view prior to facilitate visibility-aware patch deformation. Finally, we improve propagation and refinement with multi-view geometry consistency by introducing aggregated visible hemispherical normals based on view selection and local projection depth differences based on epipolar lines, respectively. Extensive evaluations on ETH3D and Tanks Temples benchmarks demonstrate that our method can achieve state-of-the-art performance with excellent robustness and generalization.
zh

[CV-70] Aligning Visual and Semantic Interpretability through Visually Grounded Concept Bottleneck Models

【速读】：该论文试图解决神经网络决策过程的透明性和可解释性问题。解决方案的关键在于引入视觉基础概念瓶颈模型 (Visually Grounded Concept Bottleneck Models, GCBM)，通过使用分割和检测基础模型在图像层面上推导出可解释的概念，这些概念可以直接与输入图像关联，从而增强模型的可解释性。GCBM 允许用户控制概念的粒度、数量和命名，提供了灵活性，并且可以轻松适应新数据集，无需预训练或额外数据。此外，GCBM 在细粒度分类的可解释性方面表现尤为出色，尤其是在 CUB 数据集上。

链接: https://arxiv.org/abs/2412.11576
作者: Patrick Knab,Katharina Prasse,Sascha Marton,Christian Bartelt,Margret Keuper
机构: Institute for Enterprise Systems, University of Mannheim, Mannheim, Germany; Data and Web Science Group, University of Mannheim, Mannheim, Germany; Max-Planck-Institute for Informatics, Saarland Informatics Campus
关键词: networks increases steadily, neural networks increases, Concept Bottleneck Models, increases steadily, Bottleneck Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: *Equal contribution

点击查看摘要

Abstract:The performance of neural networks increases steadily, but our understanding of their decision-making lags behind. Concept Bottleneck Models (CBMs) address this issue by incorporating human-understandable concepts into the prediction process, thereby enhancing transparency and interpretability. Since existing approaches often rely on large language models (LLMs) to infer concepts, their results may contain inaccurate or incomplete mappings, especially in complex visual domains. We introduce visually Grounded Concept Bottleneck Models (GCBM), which derive concepts on the image level using segmentation and detection foundation models. Our method generates inherently interpretable concepts, which can be grounded in the input image using attribution methods, allowing interpretations to be traced back to the image plane. We show that GCBM concepts are meaningful interpretability vehicles, which aid our understanding of model embedding spaces. GCBMs allow users to control the granularity, number, and naming of concepts, providing flexibility and are easily adaptable to new datasets without pre-training or additional data needed. Prediction accuracy is within 0.3-6% of the linear probe and GCBMs perform especially well for fine-grained classification interpretability on CUB, due to their dataset specificity. Our code is available on this https URL.
zh

[CV-71] PyPotteryLens: An Open-Source Deep Learning Framework for Automated Digitisation of Archaeological Pottery Documentation

【速读】：该论文试图解决考古学中陶器记录和研究的耗时问题，尤其是大量传统出版物中的历史数据难以数字化的问题。解决方案的关键在于引入了一个名为PyPotteryLens的开源框架，该框架利用深度学习技术自动从出版物中数字化和处理考古陶器绘图。其核心技术包括使用YOLO进行实例分割和EfficientNetV2进行分类，结合直观的用户界面，使得非技术背景的考古学家也能轻松使用先进的数字方法。该框架在陶器检测和分类任务中实现了超过97%的精确度和召回率，并将处理时间缩短至手动方法的1/5到1/20，同时具备跨不同考古背景的稳健泛化能力。其模块化架构和标准化输出格式还支持扩展到其他考古材料，并为长期数据保存和机器学习算法的训练提供了坚实基础。

链接: https://arxiv.org/abs/2412.11574
作者: Lorenzo Cardarelli
机构: Sapienza University of Rome (罗马大学)
关键词: aspect of archaeology, study represents, represents a crucial, crucial but time-consuming, time-consuming aspect
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Archaeological pottery documentation and study represents a crucial but time-consuming aspect of archaeology. While recent years have seen advances in digital documentation methods, vast amounts of legacy data remain locked in traditional publications. This paper introduces PyPotteryLens, an open-source framework that leverages deep learning to automate the digitisation and processing of archaeological pottery drawings from published sources. The system combines state-of-the-art computer vision models (YOLO for instance segmentation and EfficientNetV2 for classification) with an intuitive user interface, making advanced digital methods accessible to archaeologists regardless of technical expertise. The framework achieves over 97% precision and recall in pottery detection and classification tasks, while reducing processing time by up to 5x to 20x compared to manual methods. Testing across diverse archaeological contexts demonstrates robust generalisation capabilities. Also, the system’s modular architecture facilitates extension to other archaeological materials, while its standardised output format ensures long-term preservation and reusability of digitised data as well as solid basis for training machine learning algorithms. The software, documentation, and examples are available on GitHub (this https URL).
zh

[CV-72] RADARSAT Constellation Mission Compact Polarisation SAR Data for Burned Area Mapping with Deep Learning

【速读】：该论文旨在解决光学卫星在监测野火时因云层和烟雾遮挡导致烧毁区域检测效果不佳的问题。解决方案的关键在于利用配备合成孔径雷达（SAR）的卫星数据，特别是紧凑型极化（compact-pol）C波段RADARSAT Constellation Mission (RCM) SAR数据，通过深度学习方法进行烧毁区域制图。研究中采用了紧凑型极化m-chi分解和紧凑型极化雷达植被指数（CpRVI），并结合基于卷积神经网络（ConvNet）和Transformer的深度学习模型，构建了三种不同的输入设置进行烧毁区域制图。结果表明，紧凑型极化m-chi分解和CpRVI图像显著补充了双极化强度图像的不足，其中使用log-ratio、m-chi分解和CpRVI数据的Transformer模型UNETR表现最佳，F1分数为0.718，IoU分数为0.565，相较于仅使用log-ratio图像的模型有显著提升。

链接: https://arxiv.org/abs/2412.11561
作者: Yu Zhao,Yifang Ban
机构: 未知
关键词: Monitoring wildfires, increasingly critical due, burned area mapping, mapping burned areas, burned areas
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monitoring wildfires has become increasingly critical due to the sharp rise in wildfire incidents in recent years. Optical satellites like Sentinel-2 and Landsat are extensively utilized for mapping burned areas. However, the effectiveness of optical sensors is compromised by clouds and smoke, which obstruct the detection of burned areas. Thus, satellites equipped with Synthetic Aperture Radar (SAR), such as dual-polarization Sentinel-1 and quad-polarization RADARSAT-1/-2 C-band SAR, which can penetrate clouds and smoke, are investigated for mapping burned areas. However, there is limited research on using compact polarisation (compact-pol) C-band RADARSAT Constellation Mission (RCM) SAR data for this purpose. This study aims to investigate the capacity of compact polarisation RCM data for burned area mapping through deep learning. Compact-pol m-chi decomposition and Compact-pol Radar Vegetation Index (CpRVI) are derived from the RCM Multi-look Complex product. A deep-learning-based processing pipeline incorporating ConvNet-based and Transformer-based models is applied for burned area mapping, with three different input settings: using only log-ratio dual-polarization intensity images images, using only compact-pol decomposition plus CpRVI, and using all three data sources. The results demonstrate that compact-pol m-chi decomposition and CpRVI images significantly complement log-ratio images for burned area mapping. The best-performing Transformer-based model, UNETR, trained with log-ratio, m-chi decomposition, and CpRVI data, achieved an F1 Score of 0.718 and an IoU Score of 0.565, showing a notable improvement compared to the same model trained using only log-ratio images.
zh

[CV-73] S-SatFire: A Multi-Task Satellite Image Time-Series Dataset for Wildfire Detection and Prediction

【速读】：该论文试图解决野火监测和预测问题，关键在于通过多任务深度学习模型整合和增强地球观测数据。论文提出了一种综合的多时相遥感数据集，涵盖2017年1月至2021年10月美国本土的野火事件，包含3552幅地表反射率图像及天气、地形、土地覆盖和燃料等辅助数据，总计71 GB。该数据集支持三种任务：a) 活跃火点检测 (Active Fire Detection)，b) 每日烧毁区域制图 (Daily Burned Area Mapping)，c) 野火进展预测 (Wildfire Progression Prediction)。检测任务采用多光谱、多时相图像的像素级分类，而预测任务则结合卫星数据和辅助数据来建模火势动态。这一数据集及其基准为利用深度学习推进野火研究提供了基础。

链接: https://arxiv.org/abs/2412.11555
作者: Yu Zhao,Sebastian Gerard,Yifang Ban
机构: KTH Royal Institute of Technology, Stockholm, 11428, Sweden(瑞典皇家理工学院，斯德哥尔摩，11428，瑞典)
关键词: understanding wildfire behaviour, essential for understanding, Wildfire, Earth observation data, active fire detection
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wildfire monitoring and prediction are essential for understanding wildfire behaviour. With extensive Earth observation data, these tasks can be integrated and enhanced through multi-task deep learning models. We present a comprehensive multi-temporal remote sensing dataset for active fire detection, daily wildfire monitoring, and next-day wildfire prediction. Covering wildfire events in the contiguous U.S. from January 2017 to October 2021, the dataset includes 3552 surface reflectance images and auxiliary data such as weather, topography, land cover, and fuel information, totalling 71 GB. The lifecycle of each wildfire is documented, with labels for active fires (AF) and burned areas (BA), supported by manual quality assurance of AF and BA test labels. The dataset supports three tasks: a) active fire detection, b) daily burned area mapping, and c) wildfire progression prediction. Detection tasks use pixel-wise classification of multi-spectral, multi-temporal images, while prediction tasks integrate satellite and auxiliary data to model fire dynamics. This dataset and its benchmarks provide a foundation for advancing wildfire research using deep learning.
zh

[CV-74] raining Strategies for Isolated Sign Language Recognition

【速读】：该论文旨在解决孤立手语识别 (Isolated Sign Language Recognition, ISLR) 中的数据质量和手语速度变化问题，提出了一种综合的模型训练流程。解决方案的关键在于：1) 引入精心选择的图像和视频增强技术，以应对低数据质量和手语速度变化；2) 结合回归头和IoU平衡分类损失，增强模型对手势的感知能力，并简化时间信息的捕捉；3) 通过广泛的实验验证，该训练流程能够轻松适应不同的数据集和架构，并在多个ISLR基准测试中显著提升了识别性能，特别是在WLASL和Slovo数据集上分别取得了1.63%和14.12%的性能提升，达到了当前最先进的水平。

链接: https://arxiv.org/abs/2412.11553
作者: Karina Kvanchiani,Roman Kraynov,Elizaveta Petrova,Petr Surovcev,Aleksandr Nagaev,Alexander Kapitanov
机构: SberDevices, Russia(SberDevices, 俄罗斯)
关键词: Isolated Sign Language, Sign Language Recognition, Sign Language, Isolated Sign, comprehensive model training
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: sign language recognition, training strategies, computer vision

点击查看摘要

Abstract:This paper introduces a comprehensive model training pipeline for Isolated Sign Language Recognition (ISLR) designed to accommodate the distinctive characteristics and constraints of the Sign Language (SL) domain. The constructed pipeline incorporates carefully selected image and video augmentations to tackle the challenges of low data quality and varying sign speeds. Including an additional regression head combined with IoU-balanced classification loss enhances the model’s awareness of the gesture and simplifies capturing temporal information. Extensive experiments demonstrate that the developed training pipeline easily adapts to different datasets and architectures. Additionally, the ablation study shows that each proposed component expands the potential to consider ISLR task specifics. The presented strategies improve recognition performance on a broad set of ISLR benchmarks. Moreover, we achieved a state-of-the-art result on the WLASL and Slovo benchmarks with 1.63% and 14.12% improvements compared to the previous best solution, respectively.
zh

[CV-75] MPQ-DM: Mixed Precision Quantization for Extremely Low Bit Diffusion Models AAAI2025

【速读】：该论文试图解决扩散模型在资源受限场景下的高计算成本问题，特别是低比特量化（2-4 bit）导致的性能严重下降问题。解决方案的关键在于提出了混合精度量化方法MPQ-DM，主要依赖于两个技术：(1) 通过峰度（Kurtosis）量化异常显著的权重通道，并采用优化的层内混合精度比特分配策略（Outlier-Driven Mixed Quantization, OMQ）来减少量化误差，恢复目标效率下的精度性能；(2) 构建时间平滑关系蒸馏（Time-Smoothed Relation Distillation, TRD）方案，将量化模型与全精度模型之间的离散和连续潜在表示映射到统一的关系空间，减少跨时间步的表示不一致性。实验表明，MPQ-DM在极低比特宽度下显著提升了精度，相较于现有最先进方法，在W2A4设置下FID降低了58%。

链接: https://arxiv.org/abs/2412.11549
作者: Weilun Feng,Haotong Qin,Chuanguang Yang,Zhulin An,Libo Huang,Boyu Diao,Fei Wang,Renshuai Tao,Yongjun Xu,Michele Magno
机构: 1. School of Computer Science and Technology, Soochow University (苏州大学计算机科学与技术学院);
2. Institute of Functional Nano & Soft Materials (FUNSOM), Soochow University (苏州大学功能纳米与软物质研究院);
3. Department of Computer Science, ETH Zurich (苏黎世联邦理工学院计算机科学系);
4. School of Information Science and Engineering, Southeast University (东南大学信息科学与工程学院)
关键词: received wide attention, Diffusion models, generation tasks, Diffusion, received wide
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Diffusion models have received wide attention in generation tasks. However, the expensive computation cost prevents the application of diffusion models in resource-constrained scenarios. Quantization emerges as a practical solution that significantly saves storage and computation by reducing the bit-width of parameters. However, the existing quantization methods for diffusion models still cause severe degradation in performance, especially under extremely low bit-widths (2-4 bit). The primary decrease in performance comes from the significant discretization of activation values at low bit quantization. Too few activation candidates are unfriendly for outlier significant weight channel quantization, and the discretized features prevent stable learning over different time steps of the diffusion model. This paper presents MPQ-DM, a Mixed-Precision Quantization method for Diffusion Models. The proposed MPQ-DM mainly relies on two techniques:(1) To mitigate the quantization error caused by outlier severe weight channels, we propose an Outlier-Driven Mixed Quantization (OMQ) technique that uses Kurtosis to quantify outlier salient channels and apply optimized intra-layer mixed-precision bit-width allocation to recover accuracy performance within target efficiency.(2) To robustly learn representations crossing time steps, we construct a Time-Smoothed Relation Distillation (TRD) scheme between the quantized diffusion model and its full-precision counterpart, transferring discrete and continuous latent to a unified relation space to reduce the representation inconsistency. Comprehensive experiments demonstrate that MPQ-DM achieves significant accuracy gains under extremely low bit-widths compared with SOTA quantization methods. MPQ-DM achieves a 58% FID decrease under W2A4 setting compared with baseline, while all other methods even collapse.
zh

[CV-76] Meta Curvature-Aware Minimization for Domain Generalization

【速读】：该论文试图解决领域泛化 (Domain Generalization, DG) 中模型在源域训练后难以有效泛化到未见域的问题。解决方案的关键在于提出了一种改进的模型训练过程，旨在促使模型收敛到更平坦的局部最小值。为此，论文设计了一种曲率度量 (curvature metric)，该度量在模型远离收敛时影响最小，而在模型接近局部最小值时逐渐增强，以指示最小值的曲率。基于此度量，论文推导出一种新的算法，称为元曲率感知最小化 (Meta Curvature-Aware Minimization, MeCAM)，该算法同时最小化常规训练损失、SAM的代理间隙以及元学习的代理间隙。通过理论分析和实验验证，MeCAM在多个基准数据集上展示了优于现有DG方法的性能。

链接: https://arxiv.org/abs/2412.11542
作者: Ziyang Chen,Yiwen Ye,Feilong Tang,Yongsheng Pan,Yong Xia
机构: School of Computer Science and Engineering, Northwestern Polytechnical University, China; Faculty of Engineering, Monash University, Australia; Research & Development Institute of Northwestern Polytechnical University in Shenzhen, China; Ningbo Institute of Northwestern Polytechnical University, China
关键词: source domains, unseen domains, Domain generalization, aims to enhance, enhance the ability
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages, 5 figures, 17 tables

点击查看摘要

Abstract:Domain generalization (DG) aims to enhance the ability of models trained on source domains to generalize effectively to unseen domains. Recently, Sharpness-Aware Minimization (SAM) has shown promise in this area by reducing the sharpness of the loss landscape to obtain more generalized models. However, SAM and its variants sometimes fail to guide the model toward a flat minimum, and their training processes exhibit limitations, hindering further improvements in model generalization. In this paper, we first propose an improved model training process aimed at encouraging the model to converge to a flat minima. To achieve this, we design a curvature metric that has a minimal effect when the model is far from convergence but becomes increasingly influential in indicating the curvature of the minima as the model approaches a local minimum. Then we derive a novel algorithm from this metric, called Meta Curvature-Aware Minimization (MeCAM), to minimize the curvature around the local minima. Specifically, the optimization objective of MeCAM simultaneously minimizes the regular training loss, the surrogate gap of SAM, and the surrogate gap of meta-learning. We provide theoretical analysis on MeCAM’s generalization error and convergence rate, and demonstrate its superiority over existing DG methods through extensive experiments on five benchmark DG datasets, including PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet. Code will be available on GitHub.
zh

[CV-77] SP2T: Sparse Proxy Attention for Dual-stream Point Transformer

【速读】：该论文试图解决在三维理解任务中，点变换器（point transformers）在扩大感受野时面临的组注意力（grouping attention）限制问题。解决方案的关键在于提出了一个基于局部代理（local proxy-based）的双流点变换器模型SP²T，通过引入空间维度的代理采样（spatial-wise proxy sampling）和基于顶点的点代理关联（vertex-based point proxy associations），确保在多尺度点云中进行稳健的采样。此外，论文还提出了稀疏代理注意力（sparse proxy attention）结合表格式相对偏置（table-based relative bias），以实现低成本且精确的代理与点特征交互，从而在保持局部与全局信息平衡的同时，有效扩展模型的感受野。

链接: https://arxiv.org/abs/2412.11540
作者: Jiaxu Wan,Hong Zhang,Ziqi He,Qishu Wang,Ding Yuan,Yifan Yang
机构: School of Aerospace, BUAA(航空航天学院，北航); School of Artificial Intelligence, BUAA(人工智能学院，北航)
关键词: yielded significant advances, receptive field, yielded significant, significant advances, advances in broadening
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 14 figures, 14 tables

点击查看摘要

Abstract:In 3D understanding, point transformers have yielded significant advances in broadening the receptive field. However, further enhancement of the receptive field is hindered by the constraints of grouping attention. The proxy-based model, as a hot topic in image and language feature extraction, uses global or local proxies to expand the model’s receptive field. But global proxy-based methods fail to precisely determine proxy positions and are not suited for tasks like segmentation and detection in the point cloud, and exist local proxy-based methods for image face difficulties in global-local balance, proxy sampling in various point clouds, and parallel cross-attention computation for sparse association. In this paper, we present SP ^2 T, a local proxy-based dual stream point transformer, which promotes global receptive field while maintaining a balance between local and global information. To tackle robust 3D proxy sampling, we propose a spatial-wise proxy sampling with vertex-based point proxy associations, ensuring robust point-cloud sampling in many scales of point cloud. To resolve economical association computation, we introduce sparse proxy attention combined with table-based relative bias, which enables low-cost and precise interactions between proxy and point features. Comprehensive experiments across multiple datasets reveal that our model achieves SOTA performance in downstream tasks. The code has been released in this https URL .
zh

[CV-78] Near Large Far Small: Relative Distance Based Partition Learning for UAV-view Geo-Localization

【速读】：该论文试图解决无人机视角地理定位 (UAV-view Geo-Localization, UVGL) 中由于无人机飞行状态变化导致的跨视角尺度不一致问题，这使得基于分区的学习方法性能严重下降。解决方案的关键在于提出了一种基于相对距离的分区学习框架，即距离引导的动态分区学习策略 (Distance Guided Dynamic Partition Learning, DGDPL)。该策略通过平方分区策略提取细粒度特征和全局特征，并利用动态引导调整策略根据无人机和卫星视角之间的相对距离比率调整分区大小，从而对齐分区对的语义信息。此外，论文还提出了显著性引导的细化策略，进一步优化部分级特征，提升检索精度。实验结果表明，该方法在各种尺度不一致的场景中表现出优越的地理定位精度和对尺度变化的鲁棒性。

链接: https://arxiv.org/abs/2412.11535
作者: Quan Chen,Tingyu Wang,Rongfeng Lu,Bolun Zheng,Zhedong Zheng,Chenggang Yan
机构: School of Automation, Hangzhou Dianzi University, Hangzhou 310018, China; School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310018, China; Lishui Institute of Hangzhou Dianzi University, Lishui 323000, China; Faculty of Science and Technology, and Institute of Collaborative Innovation, University of Macau, Macau, China
关键词: presents substantial challenges, presents substantial, substantial challenges, primarily due, due to appearance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: In Peer Review

点击查看摘要

Abstract:UAV-view Geo-Localization (UVGL) presents substantial challenges, primarily due to appearance differences between drone-view and satellite-view. Existing methods develop partition learning strategies aimed at mining more comprehensive information by constructing diverse part-level feature representations, which rely on consistent cross-view scales. However, variations of UAV flight state leads to the scale mismatch of cross-views, resulting in serious performance degradation of partition-based methods. To overcome this issue, we propose a partition learning framework based on relative distance, which alleviates the dependence on scale consistency while mining fine-grained features. Specifically, we propose a distance guided dynamic partition learning strategy (DGDPL), consisting of a square partition strategy and a dynamic-guided adjustment strategy. The former is utilized to extract fine-grained features and global features in a simple manner. The latter calculates the relative distance ratio between drone- and satellite-view to adjust the partition size, thereby aligning the semantic information between partition pairs. Furthermore, we propose a saliency-guided refinement strategy to refine part-level features, so as to further improve the retrieval accuracy. Extensive experiments show that our approach achieves superior geo-localization accuracy across various scale-inconsistent scenarios, and exhibits remarkable robustness against scale variations. The code will be released.
zh

[CV-79] RoMeO: Robust Metric Visual Odometry

【速读】：该论文试图解决单目RGB视觉里程计（Visual Odometry, VO）在无IMU或3D传感器情况下的鲁棒性和泛化能力问题，特别是在户外场景中难以恢复度量尺度姿态的挑战。解决方案的关键在于提出了鲁棒度量视觉里程计（Robust Metric Visual Odometry, RoMeO），通过利用预训练的深度模型先验，结合单目度量深度和多视图立体（Multi-View Stereo, MVS）模型，实现了度量尺度的姿态恢复、简化了对应搜索、提供了更好的初始化并正则化了优化过程。此外，RoMeO通过在训练过程中注入噪声和自适应过滤噪声深度先验，显著提升了在野外数据上的鲁棒性，并在多个室内外数据集上大幅超越了当前的最先进（SOTA）方法。

链接: https://arxiv.org/abs/2412.11530
作者: Junda Cheng,Zhipeng Cai,Zhaoxing Zhang,Wei Yin,Matthias Muller,Michael Paulitsch,Xin Yang
机构: Huazhong University of Science and Technology; Intel Labs; Horizon Robotics
关键词: fundamental building block, estimate camera poses, Metric Visual Odometry, Visual odometry, aims to estimate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual odometry (VO) aims to estimate camera poses from visual inputs – a fundamental building block for many applications such as VR/AR and robotics. This work focuses on monocular RGB VO where the input is a monocular RGB video without IMU or 3D sensors. Existing approaches lack robustness under this challenging scenario and fail to generalize to unseen data (especially outdoors); they also cannot recover metric-scale poses. We propose Robust Metric Visual Odometry (RoMeO), a novel method that resolves these issues leveraging priors from pre-trained depth models. RoMeO incorporates both monocular metric depth and multi-view stereo (MVS) models to recover metric-scale, simplify correspondence search, provide better initialization and regularize optimization. Effective strategies are proposed to inject noise during training and adaptively filter noisy depth priors, which ensure the robustness of RoMeO on in-the-wild data. As shown in Fig.1, RoMeO advances the state-of-the-art (SOTA) by a large margin across 6 diverse datasets covering both indoor and outdoor scenes. Compared to the current SOTA DPVO, RoMeO reduces the relative (align the trajectory scale with GT) and absolute trajectory errors both by 50%. The performance gain also transfers to the full SLAM pipeline (with global BA loop closure). Code will be released upon acceptance.
zh

[CV-80] Cross-View Geo-Localization with Street-View and VHR Satellite Imagery in Decentrality Settings

【速读】：该论文试图解决在无全球导航卫星系统（GNSS）环境下，通过匹配街景查询图像与带有地理标签的航拍参考图像进行地理定位的问题，特别是当查询图像与参考图像中心存在较大偏移（decentrality）时，定位精度的显著下降问题。解决方案的关键在于提出了CVSat数据集，该数据集强调了地理范围广泛和景观多样性，并着重评估了偏移问题。同时，论文提出了AuxGeo方法，通过多指标优化策略，结合鸟瞰图中介模块（Bird’s-eye view Intermediary Module, BIM）和位置约束模块（Position Constraint Module, PCM），有效简化了跨视图和偏移问题的复杂性，并通过实验证明其在CVSat及其他公开数据集上的优越性能。

链接: https://arxiv.org/abs/2412.11529
作者: Panwang Xia,Lei Yu,Yi Wan,Qiong Wu,Peiqi Chen,Liheng Zhong,Yongxiang Yao,Dong Wei,Xinyi Liu,Lixiang Ru,Yingying Zhang,Jiangwei Lao,Jingdong Chen,Ming Yang,Yongjun Zhang
机构: Wuhan University (武汉大学)
关键词: geo-tagged aerial-view reference, aerial-view reference images, matching street-view query, GNSS-denied environments, environments by matching
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-View Geo-Localization tackles the problem of image geo-localization in GNSS-denied environments by matching street-view query images with geo-tagged aerial-view reference images. However, existing datasets and methods often assume center-aligned settings or only consider limited decentrality (i.e., the offset of the query image from the reference image center). This assumption overlooks the challenges present in real-world applications, where large decentrality can significantly enhance localization efficiency but simultaneously lead to a substantial degradation in localization accuracy. To address this limitation, we introduce CVSat, a novel dataset designed to evaluate cross-view geo-localization with a large geographic scope and diverse landscapes, emphasizing the decentrality issue. Meanwhile, we propose AuxGeo (Auxiliary Enhanced Geo-Localization), which leverages a multi-metric optimization strategy with two novel modules: the Bird’s-eye view Intermediary Module (BIM) and the Position Constraint Module (PCM). BIM uses bird’s-eye view images derived from street-view panoramas as an intermediary, simplifying the cross-view challenge with decentrality to a cross-view problem and a decentrality problem. PCM leverages position priors between cross-view images to establish multi-grained alignment constraints. These modules improve the performance of cross-view geo-localization with the decentrality problem. Extensive experiments demonstrate that AuxGeo outperforms previous methods on our proposed CVSat dataset, mitigating the issue of large decentrality, and also achieves state-of-the-art performance on existing public datasets such as CVUSA, CVACT, and VIGOR.
zh

[CV-81] Sequence Matters: Harnessing Video Models in Super-Resolution

【速读】：该论文试图解决从低分辨率（LR）多视角图像中重建高保真3D模型的3D超分辨率问题。早期研究主要依赖于单图像超分辨率（SISR）模型，但这些方法在处理每张图像时缺乏视角一致性。尽管后续处理技术被广泛探索以缓解这些不一致性，但问题仍未完全解决。论文的关键解决方案是利用视频超分辨率（VSR）模型，通过确保更高的空间一致性并参考周围的空间信息，从而实现更精确和详细的3D重建。研究表明，VSR模型在缺乏精确空间对齐的序列上也能表现出色，因此提出了一种简单实用的方法来对齐LR图像，无需微调或生成平滑轨迹。实验结果表明，这种简单算法在标准基准数据集（如NeRF-synthetic和MipNeRF-360）上达到了3D超分辨率任务的最新技术水平。

链接: https://arxiv.org/abs/2412.11525
作者: Hyun-kyu Ko,Dongheok Park,Youngin Park,Byeonghyeon Lee,Juhee Han,Eunbyung Park
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院); Korea Institute of Science and Technology (KIST)(韩国科学技术院)
关键词: reconstruct high-fidelity, aims to reconstruct, models, images, multi-view images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:3D super-resolution aims to reconstruct high-fidelity 3D models from low-resolution (LR) multi-view images. Early studies primarily focused on single-image super-resolution (SISR) models to upsample LR images into high-resolution images. However, these methods often lack view consistency because they operate independently on each image. Although various post-processing techniques have been extensively explored to mitigate these inconsistencies, they have yet to fully resolve the issues. In this paper, we perform a comprehensive study of 3D super-resolution by leveraging video super-resolution (VSR) models. By utilizing VSR models, we ensure a higher degree of spatial consistency and can reference surrounding spatial information, leading to more accurate and detailed reconstructions. Our findings reveal that VSR models can perform remarkably well even on sequences that lack precise spatial alignment. Given this observation, we propose a simple yet practical approach to align LR images without involving fine-tuning or generating ‘smooth’ trajectory from the trained 3D models over LR images. The experimental results show that the surprisingly simple algorithms can achieve the state-of-the-art results of 3D super-resolution tasks on standard benchmark datasets, such as the NeRF-synthetic and MipNeRF-360 datasets. Project page: this https URL
zh

[CV-82] EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting

【速读】：该论文试图解决当前3D编辑方法中存在的多视角不一致问题，以及3D高斯拼接（3D Gaussian Splatting, 3DGS）在编辑过程中优化效率低下的问题。解决方案的关键在于提出了EditSplat框架，该框架通过结合多视角融合引导（Multi-view Fusion Guidance, MFG）和注意力引导修剪（Attention-Guided Trimming, AGT）来实现。MFG通过将多视角信息整合到扩散过程中，确保了多视角一致性，而AGT则利用3DGS的显式表示，选择性地修剪和优化3D高斯，从而提高了优化效率并实现了精确且语义丰富的局部编辑。

链接: https://arxiv.org/abs/2412.11520
作者: Dong In Lee,Hyeongcheol Park,Jiyoung Seo,Eunbyung Park,Hyunje Park,Ha Dam Baek,Shin Sangheon,Sangmin kim,Sangpil Kim
机构: Korea University(高丽大学); Sungkyunkwan University(成均馆大学); Hanhwa Systems(韩华系统)
关键词: Recent advancements, highlighted the potential, potential of text-driven, Recent, multi-view
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in 3D editing have highlighted the potential of text-driven methods in real-time, user-friendly AR/VR applications. However, current methods rely on 2D diffusion models without adequately considering multi-view information, resulting in multi-view inconsistency. While 3D Gaussian Splatting (3DGS) significantly improves rendering quality and speed, its 3D editing process encounters difficulties with inefficient optimization, as pre-trained Gaussians retain excessive source information, hindering optimization. To address these limitations, we propose \textbfEditSplat, a novel 3D editing framework that integrates Multi-view Fusion Guidance (MFG) and Attention-Guided Trimming (AGT). Our MFG ensures multi-view consistency by incorporating essential multi-view information into the diffusion process, leveraging classifier-free guidance from the text-to-image diffusion model and the geometric properties of 3DGS. Additionally, our AGT leverages the explicit representation of 3DGS to selectively prune and optimize 3D Gaussians, enhancing optimization efficiency and enabling precise, semantically rich local edits. Through extensive qualitative and quantitative evaluations, EditSplat achieves superior multi-view consistency and editing quality over existing methods, significantly enhancing overall efficiency.
zh

[CV-83] LineArt: A Knowledge-guided Training-free High-quality Appearance Transfer for Design Drawing with Diffusion Model

【速读】：该论文试图解决从线条图生成高质量图像时面临的准确性、一致性和细粒度控制问题。解决方案的关键在于提出了LineArt框架，通过模拟层次化的视觉认知和整合人类艺术经验来引导扩散过程，从而在保留复杂细节的同时生成高保真度的外观。LineArt框架包括两个主要阶段：多频线条融合模块用于补充输入设计图的结构信息，以及分为基础层塑造和表面层着色的两部分绘画过程。该方法无需精确的3D建模、物理属性规格或网络训练，显著提升了设计任务的便利性。

链接: https://arxiv.org/abs/2412.11519
作者: Xi Wang,Hongzhen Li,Heng Fang,Yichen Peng,Haoran Xie,Xi Yang,Chuntao Li
机构: Jilin University(吉林大学); KTH Royal Institute of Technology(瑞典皇家理工学院); Tokyo Institute of Technology(东京工业大学); Japan Advanced Institute of Science and Technology (JAIST)(日本先进科学技术研究所)
关键词: technologies reduce costs, image generation technologies, generation technologies reduce, reduce costs, Image rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Image rendering from line drawings is vital in design and image generation technologies reduce costs, yet professional line drawings demand preserving complex details. Text prompts struggle with accuracy, and image translation struggles with consistency and fine-grained control. We present LineArt, a framework that transfers complex appearance onto detailed design drawings, facilitating design and artistic creation. It generates high-fidelity appearance while preserving structural accuracy by simulating hierarchical visual cognition and integrating human artistic experience to guide the diffusion process. LineArt overcomes the limitations of current methods in terms of difficulty in fine-grained control and style degradation in design drawings. It requires no precise 3D modeling, physical property specs, or network training, making it more convenient for design tasks. LineArt consists of two stages: a multi-frequency lines fusion module to supplement the input design drawing with detailed structural information and a two-part painting process for Base Layer Shaping and Surface Layer Coloring. We also present a new design drawing dataset ProLines for evaluation. The experiments show that LineArt performs better in accuracy, realism, and material precision compared to SOTAs.
zh

[CV-84] IGR: Improving Diffusion Model for Garment Restoration from Person Image

【速读】：该论文试图解决服装还原（garment restoration）任务中的问题，即从人物图像中恢复标准服装时，现有方法难以保留服装的身份特征或依赖复杂流程。解决方案的关键在于提出了一种改进的扩散模型（diffusion model），通过两个服装提取器分别捕捉低级特征和高级语义，并利用预训练的潜在扩散模型将这些特征整合到去噪过程中。通过服装融合块（garment fusion blocks）结合自注意力和交叉注意力层来对齐恢复的服装与人物图像，同时采用由粗到细的训练策略（coarse-to-fine training strategy）以提高生成服装的保真度和真实性。

链接: https://arxiv.org/abs/2412.11513
作者: Le Shen,Rong Huang,Zhijie Wang
机构: Donghua University(东华大学)
关键词: virtual try-on task, requiring accurate capture, try-on task, requiring accurate, inverse of virtual
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Garment restoration, the inverse of virtual try-on task, focuses on restoring standard garment from a person image, requiring accurate capture of garment details. However, existing methods often fail to preserve the identity of the garment or rely on complex processes. To address these limitations, we propose an improved diffusion model for restoring authentic garments. Our approach employs two garment extractors to independently capture low-level features and high-level semantics from the person image. Leveraging a pretrained latent diffusion model, these features are integrated into the denoising process through garment fusion blocks, which combine self-attention and cross-attention layers to align the restored garment with the person image. Furthermore, a coarse-to-fine training strategy is introduced to enhance the fidelity and authenticity of the generated garments. Experimental results demonstrate that our model effectively preserves garment identity and generates high-quality restorations, even in challenging scenarios such as complex garments or those with occlusions.
zh

[CV-85] SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting

【速读】：该论文试图解决单目视频转换为沉浸式立体视频的两个主要挑战：高保真度和稳定性结果的实现困难，以及高质量立体视频数据的不足。解决方案的关键在于提出了基于深度扭曲和融合修复的立体视频转换框架SpatialMe，其中包括一个基于掩码的分层特征更新（MHFU）精炼器，用于整合和优化多分支修复模块的输出，以及一种视差扩展策略来解决前景渗漏问题。此外，论文还引入了高质量的真实世界立体视频数据集StereoV1K，以缓解数据不足的问题。

链接: https://arxiv.org/abs/2412.11512
作者: Jiale Zhang,Qianxi Jia,Yang Liu,Wei Zhang,Wei Wei,Xin Tian
机构: JD.com(京东); Beijing, China(北京，中国)
关键词: immersive stereo format, transform monocular videos, video conversion aims, aims to transform, transform monocular
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Stereo video conversion aims to transform monocular videos into immersive stereo format. Despite the advancements in novel view synthesis, it still remains two major challenges: i) difficulty of achieving high-fidelity and stable results, and ii) insufficiency of high-quality stereo video data. In this paper, we introduce SpatialMe, a novel stereo video conversion framework based on depth-warping and blend-inpainting. Specifically, we propose a mask-based hierarchy feature update (MHFU) refiner, which integrate and refine the outputs from designed multi-branch inpainting module, using feature update unit (FUU) and mask mechanism. We also propose a disparity expansion strategy to address the problem of foreground bleeding. Furthermore, we conduct a high-quality real-world stereo video dataset – StereoV1K, to alleviate the data shortage. It contains 1000 stereo videos captured in real-world at a resolution of 1180 x 1180, covering various indoor and outdoor scenes. Extensive experiments demonstrate the superiority of our approach in generating stereo videos over state-of-the-art methods.
zh

[CV-86] Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves

【速读】：该论文试图解决在将预训练的视觉-语言模型 (Vision-Language Models, VLMs) 迁移到下游任务时，传统的提示调优 (Prompt Tuning, PT) 方法在冻结模型参数的情况下，既未能显著提升预训练知识的迁移能力，也未显著改善内存和时间效率的问题。解决方案的关键在于减少全量微调 (Full Fine-Tuning, FT) 基线中的特征-梯度传播流的长度和宽度，从而实现更有效和高效的迁移。基于此，论文提出了跳跃调优 (Skip Tuning) 这一新范式，通过在 FT 基线上应用逐层跳跃 (Layer-wise Skipping, LSkip) 和逐类跳跃 (Class-wise Skipping, CSkip)，避免了引入额外的上下文向量或适配器模块，从而在广泛的基准测试中展示了其优于传统 PT 和基于适配器方法的有效性和效率。

链接: https://arxiv.org/abs/2412.11509
作者: Shihan Wu,Ji Zhang,Pengpeng Zeng,Lianli Gao,Jingkuan Song,Heng Tao Shen
机构: University of Electronic Science and Technology of China(电子科技大学); Southwest Jiaotong University(西南交通大学); Tongji University(同济大学)
关键词: pre-trained vision-language models, large pre-trained vision-language, transferring large pre-trained, Prompt tuning, context vectors
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prompt tuning (PT) has long been recognized as an effective and efficient paradigm for transferring large pre-trained vision-language models (VLMs) to downstream tasks by learning a tiny set of context vectors. Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. Upon further investigation, we find that reducing both the length and width of the feature-gradient propagation flows of the full fine-tuning (FT) baseline is key to achieving effective and efficient knowledge transfer. Motivated by this, we propose Skip Tuning, a novel paradigm for adapting VLMs to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules. Extensive experiments across a wide spectrum of benchmarks demonstrate the superior effectiveness and efficiency of our Skip Tuning over both PT and adapter-based methods. Code: this https URL.
zh

[CV-87] Exploring More from Multiple Gait Modalities for Human Identification

【速读】：该论文试图解决不同步态表示（如轮廓、人体解析和光流）之间的比较研究不足的问题，特别是这些表示在模型设计和实验设置上的不一致性。解决方案的关键在于提出了一种新的融合策略，即C²融合（C² Fusion），该策略在保留共性的同时突出差异，以丰富步态特征的学习。基于这一策略，论文构建了新的框架MultiGait++，并通过在Gait3D、GREW、CCPG和SUSTech1K等多个数据集上的广泛实验验证了其有效性。

链接: https://arxiv.org/abs/2412.11495
作者: Dongyang Jin,Chao Fan,Weihua Chen,Shiqi Yu
机构: 1. Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (深圳先进技术研究院，中国科学院);
2. School of Computer Science and Engineering, Nanyang Technological University (计算机科学与工程学院，南洋理工大学)
关键词: distinct walking patterns, soft biometric characteristic, unrestrained human identification, exhibiting a promising, kind of soft
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The gait, as a kind of soft biometric characteristic, can reflect the distinct walking patterns of individuals at a distance, exhibiting a promising technique for unrestrained human identification. With largely excluding gait-unrelated cues hidden in RGB videos, the silhouette and skeleton, though visually compact, have acted as two of the most prevailing gait modalities for a long time. Recently, several attempts have been made to introduce more informative data forms like human parsing and optical flow images to capture gait characteristics, along with multi-branch architectures. However, due to the inconsistency within model designs and experiment settings, we argue that a comprehensive and fair comparative study among these popular gait modalities, involving the representational capacity and fusion strategy exploration, is still lacking. From the perspectives of fine vs. coarse-grained shape and whole vs. pixel-wise motion modeling, this work presents an in-depth investigation of three popular gait representations, i.e., silhouette, human parsing, and optical flow, with various fusion evaluations, and experimentally exposes their similarities and differences. Based on the obtained insights, we further develop a C ^2 Fusion strategy, consequently building our new framework MultiGait++. C ^2 Fusion preserves commonalities while highlighting differences to enrich the learning of gait features. To verify our findings and conclusions, extensive experiments on Gait3D, GREW, CCPG, and SUSTech1K are conducted. The code is available at this https URL.
zh

[CV-88] HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection AAAI2025 AAAI

【速读】：该论文试图解决毫米波雷达在自动驾驶3D目标检测中由于点云稀疏性和角度估计误差导致的感知局限性问题。解决方案的关键在于提出了一种名为HGSFusion的雷达-相机融合网络，通过引入Radar Hybrid Generation Module (RHGM)和Dual Sync Module (DSM)来增强雷达和图像特征的融合。RHGM通过利用语义信息和不同的概率密度函数生成更密集的雷达点，以缓解角度估计误差的影响；DSM则通过空间同步和模态同步机制，结合雷达位置信息增强图像特征，从而实现更有效的跨模态融合。实验结果表明，该方法在VoD和TJ4DRadSet数据集上显著优于现有最先进方法。

链接: https://arxiv.org/abs/2412.11489
作者: Zijian Gu,Jianwei Ma,Yan Huang,Honghao Wei,Zhanye Chen,Hui Zhang,Wei Hong
机构: 未知
关键词: Millimeter-wave radar plays, autonomous driving due, Millimeter-wave radar, capabilities for perception, plays a vital
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 12 pages, 8 figures, 7 tables. Accepted by AAAI 2025 , the 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Millimeter-wave radar plays a vital role in 3D object detection for autonomous driving due to its all-weather and all-lighting-condition capabilities for perception. However, radar point clouds suffer from pronounced sparsity and unavoidable angle estimation errors. To address these limitations, incorporating a camera may partially help mitigate the shortcomings. Nevertheless, the direct fusion of radar and camera data can lead to negative or even opposite effects due to the lack of depth information in images and low-quality image features under adverse lighting conditions. Hence, in this paper, we present the radar-camera fusion network with Hybrid Generation and Synchronization (HGSFusion), designed to better fuse radar potentials and image features for 3D object detection. Specifically, we propose the Radar Hybrid Generation Module (RHGM), which fully considers the Direction-Of-Arrival (DOA) estimation errors in radar signal processing. This module generates denser radar points through different Probability Density Functions (PDFs) with the assistance of semantic information. Meanwhile, we introduce the Dual Sync Module (DSM), comprising spatial sync and modality sync, to enhance image features with radar positional information and facilitate the fusion of distinct characteristics in different modalities. Extensive experiments demonstrate the effectiveness of our approach, outperforming the state-of-the-art methods in the VoD and TJ4DRadSet datasets by 6.53% and 2.03% in RoI AP and BEV AP, respectively. The code is available at this https URL.
zh

[CV-89] Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents NEURIPS2023

【速读】：该论文试图解决强化学习（RL）中实体代理在面对未见过的视觉观测时快速适应策略的挑战，特别是实现零样本适应能力的问题。解决方案的关键在于提出了一种新颖的对比提示集成（Contrastive Prompt Ensemble, ConPE）框架，该框架利用预训练的视觉-语言模型和一组视觉提示，通过引导注意力机制的集成方法，构建鲁棒的状态表示。具体而言，每个提示在影响代理自我中心感知和观测的个体领域因素上进行对比学习，使得状态表示不仅在多个领域中具有泛化能力，还能针对特定任务进行优化。实验结果表明，ConPE在多个实体代理任务中（如AI2THOR中的导航、egocentric-Metaworld中的操作和CARLA中的自动驾驶）表现优于现有算法，并提高了策略学习和适应的样本效率。

链接: https://arxiv.org/abs/2412.11484
作者: Wonje Choi,Woo Kyung Kim,SeungHyun Kim,Honguk Woo
机构: Sungkyunkwan University (成均馆大学)
关键词: achieving zero-shot adaptation, zero-shot adaptation capability, unseen visual observations, embodied reinforcement learning, rapid policy adaptation
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at NeurIPS 2023

点击查看摘要

Abstract:For embodied reinforcement learning (RL) agents interacting with the environment, it is desirable to have rapid policy adaptation to unseen visual observations, but achieving zero-shot adaptation capability is considered as a challenging problem in the RL context. To address the problem, we present a novel contrastive prompt ensemble (ConPE) framework which utilizes a pretrained vision-language model and a set of visual prompts, thus enabling efficient policy learning and adaptation upon a wide range of environmental and physical changes encountered by embodied agents. Specifically, we devise a guided-attention-based ensemble approach with multiple visual prompts on the vision-language model to construct robust state representations. Each prompt is contrastively learned in terms of an individual domain factor that significantly affects the agent’s egocentric perception and observation. For a given task, the attention-based ensemble and policy are jointly learned so that the resulting state representations not only generalize to various domains but are also optimized for learning the task. Through experiments, we show that ConPE outperforms other state-of-the-art algorithms for several embodied agent tasks including navigation in AI2THOR, manipulation in egocentric-Metaworld, and autonomous driving in CARLA, while also improving the sample efficiency of policy learning and adaptation.
zh

[CV-90] Data-driven Precipitation Nowcasting Using Satellite Imagery AAAI2025

【速读】：该论文试图解决传统地面雷达系统在空间限制和高维护成本下的降水预报问题，特别是在发展中国家依赖低分辨率全球数值模型的背景下。解决方案的关键是提出了神经降水模型 (Neural Precipitation Model, NPM)，该模型利用全球静止卫星图像进行降水预测，能够提供长达六小时的预报，每小时更新一次。NPM通过引入红外辐射 (10.5 μm)、上层水汽 (6.3 μm) 和下层水汽 (7.3 μm) 三个关键通道来区分雨云，并使用位置编码器捕捉季节性和时间性模式，以适应降水变化。实验结果表明，NPM能够以2公里的分辨率实时预测降雨。

链接: https://arxiv.org/abs/2412.11480
作者: Young-Jae Park,Doyi Kim,Minseok Seo,Hae-Gon Jeon,Yeji Choi
机构: 1. Yonsei University(延世大学); 2. Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院); 3. LG AI Research(LG人工智能研究院)
关键词: Accurate precipitation forecasting, warnings of disasters, floods and landslides, Accurate precipitation, forecasting is crucial
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Accurate precipitation forecasting is crucial for early warnings of disasters, such as floods and landslides. Traditional forecasts rely on ground-based radar systems, which are space-constrained and have high maintenance costs. Consequently, most developing countries depend on a global numerical model with low resolution, instead of operating their own radar systems. To mitigate this gap, we propose the Neural Precipitation Model (NPM), which uses global-scale geostationary satellite imagery. NPM predicts precipitation for up to six hours, with an update every hour. We take three key channels to discriminate rain clouds as input: infrared radiation (at a wavelength of 10.5 \mu m ), upper- (6.3 \mu m ), and lower- (7.3 \mu m ) level water vapor channels. Additionally, NPM introduces positional encoders to capture seasonal and temporal patterns, accounting for variations in precipitation. Our experimental results demonstrate that NPM can predict rainfall in real-time with a resolution of 2 km. The code and dataset are available at this https URL.
zh

[CV-91] OmniVLM: A Token-Compressed Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference

【速读】：该论文试图解决在设备上进行高效视觉-语言模型推理的问题，关键解决方案是引入了一种名为“token compression mechanism”的机制，该机制将视觉token序列长度从729压缩到81，显著减少了计算开销，同时保持了视觉-语义的保真度。通过多阶段的训练流程，包括预训练、监督微调以及最小编辑直接偏好优化（Direct Preference Optimization, DPO），OmniVLM在多个基准测试中超越了现有的基线模型，如nanoLLAVA，并且在相同的硬件条件下实现了更快的推理速度和更高的解码效率，从而实现了在边缘设备上的高效部署。

链接: https://arxiv.org/abs/2412.11475
作者: Wei Chen,Zhiyuan Li,Shuo Xin
机构: Nexa AI
关键词: Direct Preference Optimization, minimal-edit Direct Preference, Preference Optimization, vision-language model, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present OmniVLM, a sub-billion-parameter vision-language model for efficient on-device inference. OmniVLM introduces a token compression mechanism that reduces visual token sequence length from 729 to 81 tokens, significantly reducing computational overhead while preserving visual-semantic fidelity. Through a multi-stage training pipeline of pretraining, supervised fine-tuning, and minimal-edit Direct Preference Optimization (DPO), OmniVLM matches the performance of larger models. On multiple benchmarks including ScienceQA, POPE, and MMMU, OmniVLM outperforms existing baselines like nanoLLAVA within a 968M-parameter footprint. Empirical results on the same laptop demonstrate 9.1x faster time-to-first-token (0.75s vs 6.82s) and 1.5x higher decoding speed (29.41 vs 19.20 tokens/s) compared to nanoLLAVA, enabling efficient deployment on edge devices. The model weights can be accessed on huggingface: \urlthis https URL, and the inference examples can be find in Appendix B.
zh

[CV-92] Exploring Temporal Event Cues for Dense Video Captioning in Cyclic Co-learning AAAI2025

【速读】：该论文试图解决密集视频字幕生成问题，即在未剪辑的视频中检测并描述所有事件。解决方案的关键在于提出了多概念循环学习网络 (Multi-Concept Cyclic Learning, MCCL)，其核心创新包括：(1) 在帧级别检测多个概念，并将这些概念嵌入整合到视频特征中，以提供时间事件线索；(2) 在字幕生成网络中设计生成器与定位器之间的循环协同学习策略，通过语义匹配指导事件定位，同时通过位置匹配增强事件语义感知，从而实现语义感知与事件定位的相互促进。该方法在ActivityNet Captions和YouCook2数据集上取得了最先进的性能，并通过实验验证了其有效性和可解释性。

链接: https://arxiv.org/abs/2412.11467
作者: Zhuyang Xie,Yan Yang,Yankai Yu,Jie Wang,Yongquan Jiang,Xiao Wu
机构: 未知
关键词: Dense video captioning, video captioning aims, Dense video, video captioning, video captioning network
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Dense video captioning aims to detect and describe all events in untrimmed videos. This paper presents a dense video captioning network called Multi-Concept Cyclic Learning (MCCL), which aims to: (1) detect multiple concepts at the frame level, using these concepts to enhance video features and provide temporal event cues; and (2) design cyclic co-learning between the generator and the localizer within the captioning network to promote semantic perception and event localization. Specifically, we perform weakly supervised concept detection for each frame, and the detected concept embeddings are integrated into the video features to provide event cues. Additionally, video-level concept contrastive learning is introduced to obtain more discriminative concept embeddings. In the captioning network, we establish a cyclic co-learning strategy where the generator guides the localizer for event localization through semantic matching, while the localizer enhances the generator’s event semantic perception through location matching, making semantic perception and event localization mutually beneficial. MCCL achieves state-of-the-art performance on the ActivityNet Captions and YouCook2 datasets. Extensive experiments demonstrate its effectiveness and interpretability.
zh

[CV-93] MaskCLIP: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation

【速读】：该论文旨在解决开放词汇图像分割中由于低质量生成掩码（generated masks）导致的视觉与语言对齐问题。其关键解决方案是提出了一种名为MaskCLIP++的微调框架，通过使用真实掩码（ground-truth masks）而非生成掩码来增强CLIP的掩码分类能力。此外，为了缓解由于图像分割数据集标注多样性不足导致的类别偏差问题，论文还引入了在微调过程中的一致性对齐约束（consistency alignment constraint）。通过这种低成本的微调方法，结合现有的基于掩码的开放词汇分割方法，论文在多个数据集上实现了显著的性能提升。

链接: https://arxiv.org/abs/2412.11464
作者: Quan-Sheng Zeng,Yunheng Li,Daquan Zhou,Guanbin Li,Qibin Hou,Ming-Ming Cheng
机构: Nankai University(南开大学); ByteDance Inc.(字节跳动公司); Sun Yat-sen University(中山大学); NKIARI, Futian, Shenzhen(深圳福田国家信息安全与可靠信息技术研究院)
关键词: Contrastive Language-Image Pre-training, Language-Image Pre-training, models like Contrastive, Contrastive Language-Image, Open-vocabulary image segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures

点击查看摘要

Abstract:Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment constraint during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively.
zh

[CV-94] FedCAR: Cross-client Adaptive Re-weighting for Generative Models in Federated Learning

【速读】：该论文试图解决在医疗图像生成领域中，由于隐私问题导致的多机构数据共享困难，从而影响生成模型性能的问题。解决方案的关键在于提出了一种新的联邦学习 (Federated Learning, FL) 算法，该算法通过自适应地重新加权每个客户端的贡献，优化了生成模型在联邦学习环境下的训练效果。具体来说，服务器在每一轮训练中通过测量客户端生成假图像的分布距离，而不是直接比较每个客户端的Fréchet Inception Distance，从而提高了学习的效率。实验结果表明，该方法在三个公开的胸部X光数据集上表现优异，超越了集中式学习和传统联邦学习算法。

链接: https://arxiv.org/abs/2412.11463
作者: Minjun Kim,Minjee Kim,Jinhoon Jeong
机构: 未知
关键词: Generative models trained, Generative models, trained on multi-institutional, provide an enriched, enriched understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models trained on multi-institutional datasets can provide an enriched understanding through diverse data distributions. However, training the models on medical images is often challenging due to hospitals’ reluctance to share data for privacy reasons. Federated learning(FL) has emerged as a privacy-preserving solution for training distributed datasets across data centers by aggregating model weights from multiple clients instead of sharing raw data. Previous research has explored the adaptation of FL to generative models, yet effective aggregation algorithms specifically tailored for generative models remain unexplored. We hereby propose a novel algorithm aimed at improving the performance of generative models within FL. Our approach adaptively re-weights the contribution of each client, resulting in well-trained shared parameters. In each round, the server side measures the distribution distance between fake images generated by clients instead of directly comparing the Fréchet Inception Distance per client, thereby enhancing efficiency of the learning. Experimental results on three public chest X-ray datasets show superior performance in medical image generation, outperforming both centralized learning and conventional FL algorithms. Our code is available at this https URL.
zh

[CV-95] HResFormer: Hybrid Residual Transformer for Volumetric Medical Image Segmentation

【速读】：该论文试图解决3D医学图像分割中2D方法忽略切片内信息和3D方法计算成本高、内存消耗大的问题。解决方案的关键在于设计了一种混合模型HResFormer，该模型结合了2D和3D Transformer的优点，通过两个创新设计实现：(1) 混合局部-全局融合模块 (Hybrid Local-Global Fusion Module, HLGM)，用于有效融合2D Transformer的切片内信息和3D Transformer的切片间信息，提供局部细粒度和全局长距离表示；(2) 混合模型的残差学习，能够更好地利用切片内和切片间信息，增强对解剖结构的3D理解。实验结果表明，HResFormer在广泛使用的医学图像分割基准上优于现有方法。

链接: https://arxiv.org/abs/2412.11458
作者: Sucheng Ren,Xiaomeng Li
机构: The Hong Kong University of Science and Technology, Hong Kong SAR, China; HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, Futian, Shenzhen, China
关键词: textbf, Vision Transformer shows, medical image segmentation, Vision Transformer, medical image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TNNLS

点击查看摘要

Abstract:Vision Transformer shows great superiority in medical image segmentation due to the ability in learning long-range dependency. For medical image segmentation from 3D data, such as computed tomography (CT), existing methods can be broadly classified into 2D-based and 3D-based methods. One key limitation in 2D-based methods is that the intra-slice information is ignored, while the limitation in 3D-based methods is the high computation cost and memory consumption, resulting in a limited feature representation for inner-slice information. During the clinical examination, radiologists primarily use the axial plane and then routinely review both axial and coronal planes to form a 3D understanding of anatomy. Motivated by this fact, our key insight is to design a hybrid model which can first learn fine-grained inner-slice information and then generate a 3D understanding of anatomy by incorporating 3D information. We present a novel \textbfHybrid \textbfResidual trans\textbfFormer \textbf(HResFormer) for 3D medical image segmentation. Building upon standard 2D and 3D Transformer backbones, HResFormer involves two novel key designs: \textbf(1) a \textbfHybrid \textbfLocal-\textbfGlobal fusion \textbfModule \textbf(HLGM) to effectively and adaptively fuse inner-slice information from 2D Transformer and intra-slice information from 3D volumes for 3D Transformer with local fine-grained and global long-range representation. \textbf(2) a residual learning of the hybrid model, which can effectively leverage the inner-slice and intra-slice information for better 3D understanding of anatomy. Experiments show that our HResFormer outperforms prior art on widely-used medical image segmentation benchmarks. This paper sheds light on an important but neglected way to design Transformers for 3D medical image segmentation.
zh

[CV-96] MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

【速读】：该论文试图解决在多物体场景下，预训练扩散模型在新视角合成 (Novel View Synthesis, NVS) 中存在的跨视角一致性差、物体放置错误以及形状和外观不一致的问题。解决方案的关键在于通过增强模型的结构感知能力来提升其在多物体场景中的表现。具体措施包括：1) 向去噪 U-Net 注入结构感知特征（如深度和物体掩码），以增强模型对物体实例及其空间关系的理解；2) 引入辅助任务，要求模型同时预测新视角下的物体掩码，从而提升物体区分和放置能力；3) 设计结构引导的时间步采样调度器，平衡全局物体放置与细节恢复的学习。此外，论文还提出了系统评估合成图像合理性的方法，通过评估跨视角一致性和物体放置的合理性，结合现有的图像级 NVS 指标进行综合评价。

链接: https://arxiv.org/abs/2412.11457
作者: Ruijie Lu,Yixin Chen,Junfeng Ni,Baoxiong Jia,Yu Liu,Diwen Wan,Gang Zeng,Siyuan Huang
机构: State Key Laboratory of General Artificial Intelligence, Peking University; State Key Laboratory of General Artificial Intelligence, BIGAI; Tsinghua University
关键词: Repurposing pre-trained diffusion, Repurposing pre-trained, pre-trained diffusion models, object, Repurposing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Repurposing pre-trained diffusion models has been proven to be effective for NVS. However, these methods are mostly limited to a single object; directly applying such methods to compositional multi-object scenarios yields inferior results, especially incorrect object placement and inconsistent shape and appearance under novel views. How to enhance and systematically evaluate the cross-view consistency of such models remains under-explored. To address this issue, we propose MOVIS to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS in terms of model inputs, auxiliary tasks, and training strategy. First, we inject structure-aware features, including depth and object mask, into the denoising U-Net to enhance the model’s comprehension of object instances and their spatial relationships. Second, we introduce an auxiliary task requiring the model to simultaneously predict novel view object masks, further improving the model’s capability in differentiating and placing objects. Finally, we conduct an in-depth analysis of the diffusion sampling process and carefully devise a structure-guided timestep sampling scheduler during training, which balances the learning of global object placement and fine-grained detail recovery. To systematically evaluate the plausibility of synthesized images, we propose to assess cross-view consistency and novel view object placement alongside existing image-level NVS metrics. Extensive experiments on challenging synthetic and realistic datasets demonstrate that our method exhibits strong generalization capabilities and produces consistent novel view synthesis, highlighting its potential to guide future 3D-aware multi-object NVS tasks.
zh

[CV-97] Multilabel Classification for Lung Disease Detection: Integrating Deep Learning and Natural Language Processing

【速读】：该论文试图解决胸部X光片分类的耗时和挑战性问题，特别是对胸腔积液、气胸和肺炎等疾病的精确区分。解决方案的关键在于提出了一种基于迁移学习的多标签肺病分类模型，利用CheXpert数据集进行训练，并通过RadGraph解析技术提取高效的标注信息，从而提升模型对复杂医学图像的多疾病分类能力。此外，结合自然语言处理（NLP）解析报告元数据，进一步解决了分类中的不确定性问题，最终实现了F1分数为0.69和AUROC为0.86的性能，展示了其在临床应用中的潜力。

链接: https://arxiv.org/abs/2412.11452
作者: Maria Efimovich,Jayden Lim,Vedant Mehta,Ethan Poon
机构: Staten Island Tech High School (斯塔滕岛科技高中); Monta Vista High School (蒙塔维斯塔高中); Lambert High School (兰伯特高中); Edison Academy Magnet School (爱迪生学院磁石学校)
关键词: Classifying chest radiographs, challenging task, experienced radiologists, Classifying chest, time-consuming and challenging
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: All authors contributed equally

点击查看摘要

Abstract:Classifying chest radiographs is a time-consuming and challenging task, even for experienced radiologists. This provides an area for improvement due to the difficulty in precisely distinguishing between conditions such as pleural effusion, pneumothorax, and pneumonia. We propose a novel transfer learning model for multi-label lung disease classification, utilizing the CheXpert dataset with over 12,617 images of frontal radiographs being analyzed. By integrating RadGraph parsing for efficient annotation extraction, we enhance the model’s ability to accurately classify multiple lung diseases from complex medical images. The proposed model achieved an F1 score of 0.69 and an AUROC of 0.86, demonstrating its potential for clinical applications. Also explored was the use of Natural Language Processing (NLP) to parse report metadata and address uncertainties in disease classification. By comparing uncertain reports with more certain cases, the NLP-enhanced model improves its ability to conclusively classify conditions. This research highlights the connection between deep learning and NLP, underscoring their potential to enhance radiological diagnostics and aid in the efficient analysis of chest radiographs.
zh

[CV-98] GroupFace: Imbalanced Age Estimation Based on Multi-hop Attention Graph Convolutional Network and Group-aware Margin Optimization

【速读】：该论文试图解决年龄估计中的类别不平衡问题，特别是在长尾分布的年龄组中识别的偏差。解决方案的关键在于提出了一种创新的协同学习框架（GroupFace），该框架结合了多跳注意力图卷积网络和基于强化学习的动态组感知边际策略。具体来说，多跳注意力图卷积网络用于提取不同组的判别特征，通过捕捉不同距离的邻近节点的交互，融合局部和全局信息，以建模面部深度老化并探索不同组的多样表示。此外，基于强化学习的动态组感知边际策略通过将样本分为四个年龄组，并利用马尔可夫决策过程来确定各年龄组的最佳边际，从而减少不同组之间的特征表示偏差和分类边际偏差，平衡类间可分性和类内接近性。通过联合优化，该架构在多个年龄估计基准数据集上取得了优异的性能。

链接: https://arxiv.org/abs/2412.11450
作者: Yiping Zhang,Yuntao Shou,Wei Ai,Tao Meng,Keqin Li
机构: 未知
关键词: groups, age estimation, computer vision, recent advances, advances in computer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:With the recent advances in computer vision, age estimation has significantly improved in overall accuracy. However, owing to the most common methods do not take into account the class imbalance problem in age estimation datasets, they suffer from a large bias in recognizing long-tailed groups. To achieve high-quality imbalanced learning in long-tailed groups, the dominant solution lies in that the feature extractor learns the discriminative features of different groups and the classifier is able to provide appropriate and unbiased margins for different groups by the discriminative features. Therefore, in this novel, we propose an innovative collaborative learning framework (GroupFace) that integrates a multi-hop attention graph convolutional network and a dynamic group-aware margin strategy based on reinforcement learning. Specifically, to extract the discriminative features of different groups, we design an enhanced multi-hop attention graph convolutional network. This network is capable of capturing the interactions of neighboring nodes at different distances, fusing local and global information to model facial deep aging, and exploring diverse representations of different groups. In addition, to further address the class imbalance problem, we design a dynamic group-aware margin strategy based on reinforcement learning to provide appropriate and unbiased margins for different groups. The strategy divides the sample into four age groups and considers identifying the optimum margins for various age groups by employing a Markov decision process. Under the guidance of the agent, the feature representation bias and the classification margin deviation between different groups can be reduced simultaneously, balancing inter-class separability and intra-class proximity. After joint optimization, our architecture achieves excellent performance on several age estimation benchmark datasets.
zh

[CV-99] Universal Domain Adaptive Object Detection via Dual Probabilistic Alignment AAAI2025

【速读】：该论文试图解决领域自适应目标检测 (Domain Adaptive Object Detection, DAOD) 中的两个关键问题：1) 领域私有类别对齐对于全局特征的重要性；2) 不同层次特征的领域概率异质性。为解决这些问题，论文提出了一个新颖的双概率对齐 (Dual Probabilistic Alignment, DPA) 框架，通过将领域概率建模为高斯分布，实现异质领域分布的采样和测量。DPA 包含三个定制模块：全局级别领域私有对齐 (Global-level Domain Private Alignment, GDPA)、实例级别领域共享对齐 (Instance-level Domain Shared Alignment, IDSA) 和私有类别约束 (Private Class Constraint, PCC)。GDPA 通过全局级别采样挖掘领域私有类别样本并计算对齐权重，IDSA 通过实例级别采样挖掘领域共享类别样本并利用高斯分布计算对齐权重，PCC 则聚合特征和概率空间中的领域私有类别质心以缓解负迁移。实验结果表明，DPA 在多种数据集和场景下均优于现有的 UniDAOD 和 DAOD 方法。

链接: https://arxiv.org/abs/2412.11443
作者: Yuanfan Zheng,Jinlin Wu,Wuyang Li,Zhen Chen
机构: 1. School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院);
2. Institute of Functional Nano & Soft Materials (FUNSOM), Soochow University(苏州大学功能纳米与软物质研究院);
3. School of Computer Science and Technology, Nanjing University(南京大学计算机科学与技术学院)
关键词: Adaptive Object Detection, Domain Adaptive Object, Object Detection, Adaptive Object, labeled source domain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work is accepted by AAAI 2025

点击查看摘要

Abstract:Domain Adaptive Object Detection (DAOD) transfers knowledge from a labeled source domain to an unannotated target domain under closed-set assumption. Universal DAOD (UniDAOD) extends DAOD to handle open-set, partial-set, and closed-set domain adaptation. In this paper, we first unveil two issues: domain-private category alignment is crucial for global-level features, and the domain probability heterogeneity of features across different levels. To address these issues, we propose a novel Dual Probabilistic Alignment (DPA) framework to model domain probability as Gaussian distribution, enabling the heterogeneity domain distribution sampling and measurement. The DPA consists of three tailored modules: the Global-level Domain Private Alignment (GDPA), the Instance-level Domain Shared Alignment (IDSA), and the Private Class Constraint (PCC). GDPA utilizes the global-level sampling to mine domain-private category samples and calculate alignment weight through a cumulative distribution function to address the global-level private category alignment. IDSA utilizes instance-level sampling to mine domain-shared category samples and calculates alignment weight through Gaussian distribution to conduct the domain-shared category domain alignment to address the feature heterogeneity. The PCC aggregates domain-private category centroids between feature and probability spaces to mitigate negative transfer. Extensive experiments demonstrate that our DPA outperforms state-of-the-art UniDAOD and DAOD methods across various datasets and scenarios, including open, partial, and closed sets. Codes are available at \urlthis https URL.
zh

[CV-100] Learning Implicit Features with Flow Infused Attention for Realistic Virtual Try-On

【速读】：该论文试图解决基于图像的虚拟试衣问题，特别是如何在不同姿态下将服装图像适配到模特图像上，同时保持服装的特征和细节。解决方案的关键在于提出了FIA-VTON方法，通过引入Flow Infused Attention模块，利用隐式的密集warp流图作为间接指导注意力，增强生成过程中的特征图变形，从而减少对显式服装图像变形的依赖。此外，结合高层空间注意力进一步增强隐式变形指导，使得该方法在VTON-HD和DressCode数据集上的实验结果显著优于现有最先进的方法，展示了其在虚拟试衣中的有效性和鲁棒性。

链接: https://arxiv.org/abs/2412.11435
作者: Delong Zhang,Qiwei Huang,Yuanliu Liu,Yang Sun,Wei-Shi Zheng,Pengfei Xiong,Wei Zhang
机构: Sun Yat-sen University(中山大学); Shopee
关键词: Image-based virtual try-on, Image-based virtual, garment image, garment image firstly, fit the garment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-based virtual try-on is challenging since the generated image should fit the garment to model images in various poses and keep the characteristics and details of the garment simultaneously. A popular research stream warps the garment image firstly to reduce the burden of the generation stage, which relies highly on the performance of the warping module. Other methods without explicit warping often lack sufficient guidance to fit the garment to the model images. In this paper, we propose FIA-VTON, which leverages the implicit warp feature by adopting a Flow Infused Attention module on virtual try-on. The dense warp flow map is projected as indirect guidance attention to enhance the feature map warping in the generation process implicitly, which is less sensitive to the warping estimation accuracy than an explicit warp of the garment image. To further enhance implicit warp guidance, we incorporate high-level spatial attention to complement the dense warp. Experimental results on the VTON-HD and DressCode dataset significantly outperform state-of-the-art methods, demonstrating that FIA-VTON is effective and robust for virtual try-on.
zh

[CV-101] View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection AAAI25

【速读】：该论文试图解决多视角三维物体重建模型在视角变换下的鲁棒性问题（View Transformation Robustness, VTR）。解决方案的关键在于利用Stable Diffusion模型生成新的视角图像，并通过重建误差引导的视角选择方法，选择能够覆盖重建误差空间分布的视角，从而提高模型在不同视角变换下的稳定性。该方法避免了在推理阶段增加计算负担，同时通过在大视角变换数据集上的训练和测试，验证了其在提升三维重建模型鲁棒性方面的有效性。

链接: https://arxiv.org/abs/2412.11428
作者: Qi Zhang,Zhouhang Luo,Tao Yu,Hui Huang
机构: Shenzhen University(深圳大学); Guangdong Laboratory of Artificial Intelligence and Digital Economy (Shenzhen)(广东省人工智能与数字经济实验室（深圳）)
关键词: Stable Diffusion models, View transformation robustness, Stable Diffusion, view transformations, View
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 25

点击查看摘要

Abstract:View transformation robustness (VTR) is critical for deep-learning-based multi-view 3D object reconstruction models, which indicates the methods’ stability under inputs with various view transformations. However, existing research seldom focused on view transformation robustness in multi-view 3D object reconstruction. One direct way to improve the models’ VTR is to produce data with more view transformations and add them to model training. Recent progress on large vision models, particularly Stable Diffusion models, has provided great potential for generating 3D models or synthesizing novel view images with only a single image input. Directly deploying these models at inference consumes heavy computation resources and their robustness to view transformations is not guaranteed either. To fully utilize the power of Stable Diffusion models without extra inference computation burdens, we propose to generate novel views with Stable Diffusion models for better view transformation robustness. Instead of synthesizing random views, we propose a reconstruction error-guided view selection method, which considers the reconstruction errors’ spatial distribution of the 3D predictions and chooses the views that could cover the reconstruction errors as much as possible. The methods are trained and tested on sets with large view transformations to validate the 3D reconstruction models’ robustness to view transformations. Extensive experiments demonstrate that the proposed method can outperform state-of-the-art 3D reconstruction methods and other view transformation robustness comparison methods.
zh

[CV-102] Nearly Zero-Cost Protection Against Mimicry by Personalized Diffusion Models

【速读】：该论文旨在解决扩散模型在图像生成中的滥用问题，特别是复制艺术品和生成深度伪造图像的风险。现有图像保护方法在保护效果、不可见性和延迟之间难以平衡，限制了其实际应用。论文提出了一种新的解决方案，关键在于扰动预训练（perturbation pre-training）和混合扰动方法（mixture-of-perturbations approach），通过动态适应输入图像来最小化性能下降。此外，论文采用了一种新颖的训练策略，在多个变分自编码器（VAE）特征空间中计算保护损失，并在推理时进行自适应目标保护，从而增强了鲁棒性和不可见性。实验结果表明，该方法在保持保护性能的同时，显著提高了不可见性和大幅减少了推理时间。

链接: https://arxiv.org/abs/2412.11423
作者: Namhyuk Ahn,KiYoon Yoo,Wonhyuk Ahn,Daesik Kim,Seung-Hun Nam
机构: Inha University(仁荷大学); NAVER WEBTOON AI(NAVER WEBTOON AI); KRAFTON(KRAFTON)
关键词: diffusion models revolutionize, Recent advancements, models revolutionize image, revolutionize image generation, risks of misuse
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in diffusion models revolutionize image generation but pose risks of misuse, such as replicating artworks or generating deepfakes. Existing image protection methods, though effective, struggle to balance protection efficacy, invisibility, and latency, thus limiting practical use. We introduce perturbation pre-training to reduce latency and propose a mixture-of-perturbations approach that dynamically adapts to input images to minimize performance degradation. Our novel training strategy computes protection loss across multiple VAE feature spaces, while adaptive targeted protection at inference enhances robustness and invisibility. Experiments show comparable protection performance with improved invisibility and drastically reduced inference time. The code and demo are available at \urlthis https URL
zh

[CV-103] Category Level 6D Object Pose Estimation from a Single RGB Image using Diffusion

【速读】：该论文试图解决从单张RGB图像中对类别级物体进行6D位姿和3D尺寸估计的问题，尤其在没有特定物体模型或深度信息的情况下。解决方案的关键在于利用基于分数的扩散模型（score-based diffusion models）生成物体位姿假设，以建模可能位姿的分布。与以往依赖昂贵的训练似然估计器来去除异常值并使用均值池化进行位姿聚合的方法不同，该论文提出了一种更简单的方法，即使用Mean Shift算法来估计分布的模式作为最终的位姿估计。这种方法在REAL275数据集上显著超越了当前的最先进方法。

链接: https://arxiv.org/abs/2412.11420
作者: Adam Bethell,Ravi Garg,Ian Reid
机构: University of Adelaide (阿德莱德大学)
关键词: computer vision, fundamental task, task in computer, Estimating, single RGB image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating the 6D pose and 3D size of an object from an image is a fundamental task in computer vision. Most current approaches are restricted to specific instances with known models or require ground truth depth information or point cloud captures from LIDAR. We tackle the harder problem of pose estimation for category-level objects from a single RGB image. We propose a novel solution that eliminates the need for specific object models or depth information. Our method utilises score-based diffusion models to generate object pose hypotheses to model the distribution of possible poses for the object. Unlike previous methods that rely on costly trained likelihood estimators to remove outliers before pose aggregation using mean pooling, we introduce a simpler approach using Mean Shift to estimate the mode of the distribution as the final pose estimate. Our approach outperforms the current state-of-the-art on the REAL275 dataset by a significant margin.
zh

[CV-104] V-MIND: Building Versatile Monocular Indoor 3D Detector with Diverse 2D Annotations WACV2025

【速读】：该论文试图解决室内单目3D物体检测领域中由于3D训练数据稀缺和多样性不足而导致的性能瓶颈问题。解决方案的关键在于提出了V-MIND（多功能单目室内检测器），通过利用公开的大规模2D数据集，结合单目深度估计技术和相机内参预测器，将2D图像转换为3D点云并生成伪3D边界框，从而增强室内3D检测器的性能。为解决转换过程中点云的距离误差问题，论文引入了3D自校准损失进行伪3D边界框的精炼，并提出了模糊性损失以应对从2D数据集引入新类别时的模糊性问题。最终，通过联合训练现有3D数据集和从2D数据集生成的伪3D边界框，V-MIND在Omni3D室内数据集上实现了跨类别的最先进物体检测性能。

链接: https://arxiv.org/abs/2412.11412
作者: Jin-Cheng Jhang,Tao Tu,Fu-En Wang,Ke Zhang,Min Sun,Cheng-Hao Kuo
机构: National Tsing Hua University(国立清华大学); Cornell University(康奈尔大学); Amazon
关键词: gaining significant attention, significant attention, robotic applications, gaining significant, increasing demand
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2025

点击查看摘要

Abstract:The field of indoor monocular 3D object detection is gaining significant attention, fueled by the increasing demand in VR/AR and robotic applications. However, its advancement is impeded by the limited availability and diversity of 3D training data, owing to the labor-intensive nature of 3D data collection and annotation processes. In this paper, we present V-MIND (Versatile Monocular INdoor Detector), which enhances the performance of indoor 3D detectors across a diverse set of object classes by harnessing publicly available large-scale 2D datasets. By leveraging well-established monocular depth estimation techniques and camera intrinsic predictors, we can generate 3D training data by converting large-scale 2D images into 3D point clouds and subsequently deriving pseudo 3D bounding boxes. To mitigate distance errors inherent in the converted point clouds, we introduce a novel 3D self-calibration loss for refining the pseudo 3D bounding boxes during training. Additionally, we propose a novel ambiguity loss to address the ambiguity that arises when introducing new classes from 2D datasets. Finally, through joint training with existing 3D datasets and pseudo 3D bounding boxes derived from 2D datasets, V-MIND achieves state-of-the-art object detection performance across a wide range of classes on the Omni3D indoor dataset.
zh

[CV-105] Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech AAAI’2025

【速读】：该论文试图解决视觉文本到语音（Visual Text-to-Speech, VTTS）任务中对空间环境的理解问题，特别是如何从图像中提取全面的局部和全局空间信息。解决方案的关键在于提出了一种多模态和多尺度空间环境理解方案（M2SE-VTTS），通过同时利用RGB和深度（Depth）空间信息，并采用多尺度建模方法，实现了对局部和全局空间知识的综合理解。具体来说，该方案通过将RGB和深度图像分割成小块，并使用Gemini生成的环境描述来引导局部空间理解，随后通过局部感知的全球空间理解来整合多模态和多尺度的特征，从而有效建模了多模态空间环境中局部和全局空间上下文的交互。

链接: https://arxiv.org/abs/2412.11409
作者: Rui Liu,Shuwei He,Yifan Hu,Haizhou Li
机构: 未知
关键词: spatial, spatial environment, global spatial, global spatial visual, spatial visual information
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: 9 pages,2 figures, Accepted by AAAI’2025

点击查看摘要

Abstract:Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth images into patches and adopt the Gemini-generated environment captions to guide the local spatial understanding. After that, the multi-modal and multi-scale features are integrated by the local-aware global spatial understanding. In this way, M2SE-VTTS effectively models the interactions between local and global spatial contexts in the multi-modal spatial environment. Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation. The code and audio samples are available at: this https URL.
zh

[CV-106] An Enhanced Classification Method Based on Adaptive Multi-Scale Fusion for Long-tailed Multispectral Point Clouds

【速读】：该论文试图解决多光谱点云（Multispectral Point Cloud, MPC）分类中面临的稀疏标注目标、地物尺度差异和长尾分布等问题。解决方案的关键在于提出了一种基于自适应多尺度融合的增强分类方法。具体来说，该方法在训练集生成阶段采用网格平衡采样策略，从稀疏标注数据集中可靠地生成训练样本；在特征学习阶段，通过多尺度特征融合模块融合不同尺度的地物浅层特征，以解决因地物尺度变化导致的细粒度特征丢失问题；在分类阶段，设计了自适应混合损失模块，利用具有自适应权重的多分类头来平衡不同类别学习能力，从而提升小类别在多尺度和长尾分布下的分类性能。

链接: https://arxiv.org/abs/2412.11407
作者: TianZhu Liu,BangYan Hu,YanFeng Gu,Xian Li,Aleksandra Pižurica
机构: Harbin Institute of Technology (哈尔滨工业大学); Ghent University (根特大学)
关键词: Multispectral point cloud, Multispectral point, observed scene, scene understanding, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 16 pages, 9 figures, 5 tables

点击查看摘要

Abstract:Multispectral point cloud (MPC) captures 3D spatial-spectral information from the observed scene, which can be used for scene understanding and has a wide range of applications. However, most of the existing classification methods were extensively tested on indoor datasets, and when applied to outdoor datasets they still face problems including sparse labeled targets, differences in land-covers scales, and long-tailed distributions. To address the above issues, an enhanced classification method based on adaptive multi-scale fusion for MPCs with long-tailed distributions is proposed. In the training set generation stage, a grid-balanced sampling strategy is designed to reliably generate training samples from sparse labeled datasets. In the feature learning stage, a multi-scale feature fusion module is proposed to fuse shallow features of land-covers at different scales, addressing the issue of losing fine features due to scale variations in land-covers. In the classification stage, an adaptive hybrid loss module is devised to utilize multi-classification heads with adaptive weights to balance the learning ability of different classes, improving the classification performance of small classes due to various-scales and long-tailed distributions in land-covers. Experimental results on three MPC datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art methods.
zh

[CV-107] Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes

【速读】：该论文试图解决视觉-语言任务中的对象感知推理问题，特别是在处理未见对象、减少幻觉以及捕捉复杂视觉场景中的细粒度关系方面。解决方案的关键在于提出了视觉感知检索增强提示框架 (Vision-Aware Retrieval-Augmented Prompting, VRAP)，通过将检索增强的对象标签集成到大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 的提示中，增强模型的推理能力。VRAP 引入了一种新颖的流程，利用预训练的视觉编码器和场景图解析器提取结构化标签（包括对象、属性和关系），并结合外部知识丰富这些标签，最终将其嵌入到 LLM 的输入中，从而实现更详细和准确的推理。

链接: https://arxiv.org/abs/2412.11396
作者: Antonio Carlos Rivera,Anthony Moore,Steven Robinson
机构: 未知
关键词: tasks poses significant, poses significant challenges, Large Vision-Language Models, vision-language tasks poses, handling unseen objects
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object-aware reasoning in vision-language tasks poses significant challenges for current models, particularly in handling unseen objects, reducing hallucinations, and capturing fine-grained relationships in complex visual scenes. To address these limitations, we propose the Vision-Aware Retrieval-Augmented Prompting (VRAP) framework, a generative approach that enhances Large Vision-Language Models (LVLMs) by integrating retrieval-augmented object tags into their prompts. VRAP introduces a novel pipeline where structured tags, including objects, attributes, and relationships, are extracted using pretrained visual encoders and scene graph parsers. These tags are enriched with external knowledge and incorporated into the LLM’s input, enabling detailed and accurate reasoning. We evaluate VRAP across multiple vision-language benchmarks, including VQAv2, GQA, VizWiz, and COCO, achieving state-of-the-art performance in fine-grained reasoning and multimodal understanding. Additionally, our ablation studies highlight the importance of retrieval-augmented tags and contrastive learning, while human evaluations confirm VRAP’s ability to generate accurate, detailed, and contextually relevant responses. Notably, VRAP achieves a 40% reduction in inference latency by eliminating runtime retrieval. These results demonstrate that VRAP is a robust and efficient framework for advancing object-aware multimodal reasoning.
zh

[CV-108] Depth-Centric Dehazing and Depth-Estimation from Real-World Hazy Driving Video AAAI20205

【速读】：该论文试图解决同时从单目雾霾视频中去除雾霾和估计深度的挑战性问题。解决方案的关键在于提出了一种以深度为中心的学习框架，该框架将大气散射模型 (ASM) 与亮度一致性约束 (BCC) 相结合，并依赖于一个共享的深度估计网络。该网络通过利用相邻的去雾帧来增强深度估计，同时利用改进的深度线索更有效地去除雾霾。此外，论文还设计了两个判别网络（D_MFIR 和 D_MDR），分别用于增强去雾视频的高频细节和减少低纹理区域中的黑洞现象。通过这些创新，该方法在真实雾霾场景中的视频去雾和深度估计任务上显著优于当前最先进的技术。

链接: https://arxiv.org/abs/2412.11395
作者: Junkai Fan,Kun Wang,Zhiqiang Yan,Xiang Chen,Shangbing Gao,Jun Li,Jian Yang
机构: 未知
关键词: depth estimation, real monocular hazy, depth, study the challenging, challenging problem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 20205, Project page: this https URL

点击查看摘要

Abstract:In this paper, we study the challenging problem of simultaneously removing haze and estimating depth from real monocular hazy videos. These tasks are inherently complementary: enhanced depth estimation improves dehazing via the atmospheric scattering model (ASM), while superior dehazing contributes to more accurate depth estimation through the brightness consistency constraint (BCC). To tackle these intertwined tasks, we propose a novel depth-centric learning framework that integrates the ASM model with the BCC constraint. Our key idea is that both ASM and BCC rely on a shared depth estimation network. This network simultaneously exploits adjacent dehazed frames to enhance depth estimation via BCC and uses the refined depth cues to more effectively remove haze through ASM. Additionally, we leverage a non-aligned clear video and its estimated depth to independently regularize the dehazing and depth estimation networks. This is achieved by designing two discriminator networks: D_MFIR enhances high-frequency details in dehazed videos, and D_MDR reduces the occurrence of black holes in low-texture regions. Extensive experiments demonstrate that the proposed method outperforms current state-of-the-art techniques in both video dehazing and depth estimation tasks, especially in real-world hazy scenes. Project page: this https URL.
zh

[CV-109] mporal Contrastive Learning for Video Temporal Reasoning in Large Vision-Language Models

【速读】：该论文试图解决视频-语言理解中的时间推理问题，即模型在处理视频序列时难以捕捉动态交互和时间依赖性的挑战。解决方案的关键在于提出了时间语义对齐通过动态提示 (Temporal Semantic Alignment via Dynamic Prompting, TSADP) 框架，通过动态任务特定提示和时间对比学习来增强时间推理能力。具体来说，TSADP 利用动态提示生成器 (Dynamic Prompt Generator, DPG) 编码细粒度的时间关系，并通过时间对比损失 (Temporal Contrastive Loss, TCL) 对齐视觉和文本嵌入。该方法在 VidSitu 数据集上进行了评估，显著提升了现有最先进模型在视频内实体关联、时间关系理解和时间顺序预测等任务中的表现。

链接: https://arxiv.org/abs/2412.11391
作者: Rafael Souza,Jia-Hao Lim,Alexander Davis
机构: University of Brasilia (巴西利亚大学)
关键词: semantic concepts consistently, align semantic concepts, Temporal Semantic Alignment, Temporal, Temporal Contrastive Loss
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal reasoning is a critical challenge in video-language understanding, as it requires models to align semantic concepts consistently across time. While existing large vision-language models (LVLMs) and large language models (LLMs) excel at static tasks, they struggle to capture dynamic interactions and temporal dependencies in video sequences. In this work, we propose Temporal Semantic Alignment via Dynamic Prompting (TSADP), a novel framework that enhances temporal reasoning capabilities through dynamic task-specific prompts and temporal contrastive learning. TSADP leverages a Dynamic Prompt Generator (DPG) to encode fine-grained temporal relationships and a Temporal Contrastive Loss (TCL) to align visual and textual embeddings across time. We evaluate our method on the VidSitu dataset, augmented with enriched temporal annotations, and demonstrate significant improvements over state-of-the-art models in tasks such as Intra-Video Entity Association, Temporal Relationship Understanding, and Chronology Prediction. Human evaluations further confirm TSADP’s ability to generate coherent and semantically accurate descriptions. Our analysis highlights the robustness, efficiency, and practical utility of TSADP, making it a step forward in the field of video-language understanding.
zh

[CV-110] Adapting Segment Anything Model (SAM) to Experimental Datasets via Fine-Tuning on GAN-based Simulation: A Case Study in Additive Manufacturing

【速读】：该论文试图解决在工业X射线计算机断层扫描（XCT）中，传统计算机视觉模型在处理噪声、分辨率变化和复杂内部结构时表现不佳的问题。解决方案的关键在于利用参数高效的微调策略，特别是Conv-LoRa技术，对Segment Anything Model (SAM)进行适应性调整，以应对材料科学领域的特定数据集。此外，通过生成对抗网络（GAN）生成数据来增强训练过程，从而提高模型在复杂XCT数据上的分割性能。实验结果表明，针对特定领域的科学成像数据进行微调显著提升了模型的性能，但其在跨数据集上的泛化能力仍有限，这突显了进一步研究领域特定分割任务的稳健和可扩展解决方案的必要性。

链接: https://arxiv.org/abs/2412.11381
作者: Anika Tabassum,Amirkoushyar Ziabari
机构: Oak Ridge National Laboratory
关键词: X-ray computed tomography, Industrial X-ray computed, computed tomography, XCT commonly accompanied, powerful tool
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Industrial X-ray computed tomography (XCT) is a powerful tool for non-destructive characterization of materials and manufactured components. XCT commonly accompanied by advanced image analysis and computer vision algorithms to extract relevant information from the images. Traditional computer vision models often struggle due to noise, resolution variability, and complex internal structures, particularly in scientific imaging applications. State-of-the-art foundational models, like the Segment Anything Model (SAM)-designed for general-purpose image segmentation-have revolutionized image segmentation across various domains, yet their application in specialized fields like materials science remains under-explored. In this work, we explore the application and limitations of SAM for industrial X-ray CT inspection of additive manufacturing components. We demonstrate that while SAM shows promise, it struggles with out-of-distribution data, multiclass segmentation, and computational efficiency during fine-tuning. To address these issues, we propose a fine-tuning strategy utilizing parameter-efficient techniques, specifically Conv-LoRa, to adapt SAM for material-specific datasets. Additionally, we leverage generative adversarial network (GAN)-generated data to enhance the training process and improve the model’s segmentation performance on complex X-ray CT data. Our experimental results highlight the importance of tailored segmentation models for accurate inspection, showing that fine-tuning SAM on domain-specific scientific imaging data significantly improves performance. However, despite improvements, the model’s ability to generalize across diverse datasets remains limited, highlighting the need for further research into robust, scalable solutions for domain-specific segmentation tasks.
zh

[CV-111] Relation-Guided Adversarial Learning for Data-free Knowledge Transfer

【速读】：该论文试图解决数据无知识蒸馏（data-free knowledge distillation）中由于忽视类内多样性（intra-class diversity）和类间相似性（inter-class confusion）导致的样本同质化问题，从而限制了模型性能。解决方案的关键在于提出了一种新的关系引导对抗学习方法（Relation-Guided Adversarial Learning, RGAL），通过引入三元组损失（triplet losses）来分别增强类内多样性和类间混淆。具体来说，该方法设计了两个阶段：图像合成阶段和学生训练阶段。在图像合成阶段，通过优化过程推动相同标签样本的分离和不同标签样本的靠近，从而实现类内多样性和类间混淆；在学生训练阶段，则进行相反的优化，以增强类内样本的聚合和类间样本的分离。此外，为了缓解全局多样性与类间混淆之间的冲突，提出了焦点加权采样策略（focal weighted sampling strategy），通过在有限距离范围内不均匀地选择三元组中的负样本，进一步提升了方法的有效性和数据效率。

链接: https://arxiv.org/abs/2412.11380
作者: Yingping Liang,Ying Fu
机构: 未知
关键词: recovering training data, pre-trained model, data, diversity, samples
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data-free knowledge distillation transfers knowledge by recovering training data from a pre-trained model. Despite the recent success of seeking global data diversity, the diversity within each class and the similarity among different classes are largely overlooked, resulting in data homogeneity and limited performance. In this paper, we introduce a novel Relation-Guided Adversarial Learning method with triplet losses, which solves the homogeneity problem from two aspects. To be specific, our method aims to promote both intra-class diversity and inter-class confusion of the generated samples. To this end, we design two phases, an image synthesis phase and a student training phase. In the image synthesis phase, we construct an optimization process to push away samples with the same labels and pull close samples with different labels, leading to intra-class diversity and inter-class confusion, respectively. Then, in the student training phase, we perform an opposite optimization, which adversarially attempts to reduce the distance of samples of the same classes and enlarge the distance of samples of different classes. To mitigate the conflict of seeking high global diversity and keeping inter-class confusing, we propose a focal weighted sampling strategy by selecting the negative in the triplets unevenly within a finite range of distance. RGAL shows significant improvement over previous state-of-the-art methods in accuracy and data efficiency. Besides, RGAL can be inserted into state-of-the-art methods on various data-free knowledge transfer applications. Experiments on various benchmarks demonstrate the effectiveness and generalizability of our proposed method on various tasks, specially data-free knowledge distillation, data-free quantization, and non-exemplar incremental learning. Our code is available at this https URL.
zh

[CV-112] xt and Image Are Mutually Beneficial: Enhancing Training-Free Few-Shot Classification with CLIP AAAI2025

【速读】：该论文试图解决在无训练的小样本学习（Few-Shot Learning, FSL）中，基于对比语言-图像预训练（Contrastive Language-Image Pretraining, CLIP）的方法存在的两个主要问题：1）图像模态中的异常匹配；2）生成的文本提示质量不一致。解决方案的关键在于构建了一个文本-图像互指导优化（Text-Image Mutual guidance Optimization, TIMO）机制，其中包括图像引导文本（Image-Guided-Text, IGT）组件和文本引导图像（Text-Guided-Image, TGI）组件。IGT通过图像表示来校正文本提示的质量，而TGI通过文本表示来缓解图像模态的异常匹配。通过这种互指导机制，TIMO显著超越了现有的无训练SOTA方法，并且其增强版本TIMO-S在时间成本大幅降低的情况下，甚至超过了需要训练的最佳方法。

链接: https://arxiv.org/abs/2412.11375
作者: Yayuan Li,Jintao Guo,Lei Qi,Wenbin Li,Yinghuan Shi
机构: Tongji University (同济大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
关键词: Contrastive Language-Image Pretraining, Contrastive Language-Image, Language-Image Pretraining, vision tasks, CLIP
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) has been widely used in vision tasks. Notably, CLIP has demonstrated promising performance in few-shot learning (FSL). However, existing CLIP-based methods in training-free FSL (i.e., without the requirement of additional training) mainly learn different modalities independently, leading to two essential issues: 1) severe anomalous match in image modality; 2) varying quality of generated text prompts. To address these issues, we build a mutual guidance mechanism, that introduces an Image-Guided-Text (IGT) component to rectify varying quality of text prompts through image representations, and a Text-Guided-Image (TGI) component to mitigate the anomalous match of image modality through text representations. By integrating IGT and TGI, we adopt a perspective of Text-Image Mutual guidance Optimization, proposing TIMO. Extensive experiments show that TIMO significantly outperforms the state-of-the-art (SOTA) training-free method. Additionally, by exploring the extent of mutual guidance, we propose an enhanced variant, TIMO-S, which even surpasses the best training-required methods by 0.33% with approximately 100 times less time cost. Our code is available at this https URL.
zh

[CV-113] BiM-VFI: directional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions

【速读】：该论文试图解决视频帧插值 (Video Frame Interpolation, VFI) 模型在处理非均匀运动（如加速、减速和改变方向）时出现的时序定位模糊问题，导致插值帧模糊的现象。解决方案的关键在于提出了一种双向运动场 (Bidirectional Motion Field, BiM) 来有效描述非均匀运动，并结合 BiM 引导的光流网络 (BiM-guided Flow Net, BiMFN) 和内容感知上采样网络 (Content-Aware Upsampling Network, CAUN) 进行精确的光流估计。此外，通过基于 VFI 的流监督知识蒸馏 (Knowledge Distillation for VFI-centric Flow supervision, KDVCF) 来监督 VFI 模型的运动估计，从而显著减少插值帧的模糊。

链接: https://arxiv.org/abs/2412.11365
作者: Wonyong Seo,Jihyong Oh,Munchurl Kim
机构: Korea Advanced Institute of Science and Technology(韩国高等科学技术研究院); Chung-Ang University(中央大学)
关键词: Existing Video Frame, Video Frame interpolation, Existing Video, yield blurred interpolated, Content-Aware Upsampling Network
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Video Frame interpolation (VFI) models tend to suffer from time-to-location ambiguity when trained with video of non-uniform motions, such as accelerating, decelerating, and changing directions, which often yield blurred interpolated frames. In this paper, we propose (i) a novel motion description map, Bidirectional Motion field (BiM), to effectively describe non-uniform motions; (ii) a BiM-guided Flow Net (BiMFN) with Content-Aware Upsampling Network (CAUN) for precise optical flow estimation; and (iii) Knowledge Distillation for VFI-centric Flow supervision (KDVCF) to supervise the motion estimation of VFI model with VFI-centric teacher flows. The proposed VFI is called a Bidirectional Motion field-guided VFI (BiM-VFI) model. Extensive experiments show that our BiM-VFI model significantly surpasses the recent state-of-the-art VFI methods by 26% and 45% improvements in LPIPS and STLPIPS respectively, yielding interpolated frames with much fewer blurs at arbitrary time instances.
zh

[CV-114] Visual IRL for Human-Like Robotic Manipulation

【速读】：该论文试图解决协作机器人（cobots）在执行操作任务时如何以类人方式进行学习的问题。解决方案的关键在于采用了一种基于观察学习（learn-from-observation, LfO）的方法，通过视觉逆向强化学习（Visual IRL）直接利用观察到的人类任务执行中的RGB-D关键点作为状态特征，输入到逆向强化学习（IRL）中，从而学习到将关键点映射到奖励值的奖励函数。随后，通过一种新颖的神经符号动力学模型，将人类的运动学特征映射到协作机器人手臂上，实现类似末端执行器的定位，同时最小化关节调整，以保持人类运动的自然动态。与仅关注末端执行器位置的先前技术不同，该方法将人类手臂的多个关节角度映射到协作机器人的相应关节，并使用逆运动学模型进行最小调整，以实现精确的末端执行器定位。

链接: https://arxiv.org/abs/2412.11360
作者: Ehsan Asali,Prashant Doshi
机构: University of Georgia (佐治亚大学)
关键词: introduce Visual IRL, learn manipulation tasks, collaborative robots, human, Visual IRL
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel method for collaborative robots (cobots) to learn manipulation tasks and perform them in a human-like manner. Our method falls under the learn-from-observation (LfO) paradigm, where robots learn to perform tasks by observing human actions, which facilitates quicker integration into industrial settings compared to programming from scratch. We introduce Visual IRL that uses the RGB-D keypoints in each frame of the observed human task performance directly as state features, which are input to inverse reinforcement learning (IRL). The inversely learned reward function, which maps keypoints to reward values, is transferred from the human to the cobot using a novel neuro-symbolic dynamics model, which maps human kinematics to the cobot arm. This model allows similar end-effector positioning while minimizing joint adjustments, aiming to preserve the natural dynamics of human motion in robotic manipulation. In contrast with previous techniques that focus on end-effector placement only, our method maps multiple joint angles of the human arm to the corresponding cobot joints. Moreover, it uses an inverse kinematics model to then minimally adjust the joint angles, for accurate end-effector positioning. We evaluate the performance of this approach on two different realistic manipulation tasks. The first task is produce processing, which involves picking, inspecting, and placing onions based on whether they are blemished. The second task is liquid pouring, where the robot picks up bottles, pours the contents into designated containers, and disposes of the empty bottles. Our results demonstrate advances in human-like robotic manipulation, leading to more human-robot compatibility in manufacturing applications.
zh

[CV-115] One-Shot Multilingual Font Generation Via ViT

【速读】：该论文试图解决多语言字体生成中的独特挑战，特别是针对汉字、日文和韩文（CJK）等表意文字系统中数千个独特字符的单独设计问题。解决方案的关键在于引入了一种基于视觉Transformer（Vision Transformer, ViT）的模型，并通过预训练的视觉前置任务（Masked Autoencoding, MAE）来增强模型的泛化能力，从而无需依赖先前框架中复杂的组件设计。此外，论文还集成了检索增强引导（Retrieval-Augmented Guidance, RAG）模块，以动态检索和适应样式参考，进一步提升了模型的可扩展性和实际应用性。

链接: https://arxiv.org/abs/2412.11342
作者: Zhiheng Wang,Jiarui Liu
机构: University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校); University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校)
关键词: poses unique challenges, design poses unique, Font design poses, individually crafted, poses unique
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Font design poses unique challenges for logographic languages like Chinese, Japanese, and Korean (CJK), where thousands of unique characters must be individually crafted. This paper introduces a novel Vision Transformer (ViT)-based model for multi-language font generation, effectively addressing the complexities of both logographic and alphabetic scripts. By leveraging ViT and pretraining with a strong visual pretext task (Masked Autoencoding, MAE), our model eliminates the need for complex design components in prior frameworks while achieving comprehensive results with enhanced generalizability. Remarkably, it can generate high-quality fonts across multiple languages for unseen, unknown, and even user-crafted characters. Additionally, we integrate a Retrieval-Augmented Guidance (RAG) module to dynamically retrieve and adapt style references, improving scalability and real-world applicability. We evaluated our approach in various font generation tasks, demonstrating its effectiveness, adaptability, and scalability.
zh

[CV-116] Modality-Driven Design for Multi-Step Dexterous Manipulation: Insights from Neuroscience DATE

【速读】：该论文试图解决多步灵巧操作（multi-step dexterous manipulation）在家庭场景中的应用问题，这是一个在机器人领域尚未充分探索的领域。解决方案的关键在于提出了一种模块化方法，将操作过程分解为三个子技能：1) 到达（reaching）、2) 抓取和提升（grasping and lifting）、3) 手内旋转（in-hand rotation），每个子技能基于人类大脑中使用的主导感官模态（dominant sensory modalities），并采用不同的方法来实现：经典控制器、视觉-语言-动作模型（Vision-Language-Action model）和带有力反馈的强化学习策略（reinforcement learning policy with force feedback）。这种方法的核心贡献在于提供了一种受神经科学启发、基于模态驱动（modality-driven）的多步灵巧操作方法。

链接: https://arxiv.org/abs/2412.11337
作者: Naoki Wake,Atsushi Kanehira,Daichi Saito,Jun Takamatsu,Kazuhiro Sasabuchi,Hideki Koike,Katsushi Ikeuchi
机构: Microsoft(微软); Institute of Science Tokyo(东京科学研究所)
关键词: Multi-step dexterous manipulation, household scenarios, fundamental skill, skill in household, remains an underexplored
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 2 tables. Last updated on December 14th, 2024

点击查看摘要

Abstract:Multi-step dexterous manipulation is a fundamental skill in household scenarios, yet remains an underexplored area in robotics. This paper proposes a modular approach, where each step of the manipulation process is addressed with dedicated policies based on effective modality input, rather than relying on a single end-to-end model. To demonstrate this, a dexterous robotic hand performs a manipulation task involving picking up and rotating a box. Guided by insights from neuroscience, the task is decomposed into three sub-skills, 1)reaching, 2)grasping and lifting, and 3)in-hand rotation, based on the dominant sensory modalities employed in the human brain. Each sub-skill is addressed using distinct methods from a practical perspective: a classical controller, a Vision-Language-Action model, and a reinforcement learning policy with force feedback, respectively. We tested the pipeline on a real robot to demonstrate the feasibility of our approach. The key contribution of this study lies in presenting a neuroscience-inspired, modality-driven methodology for multi-step dexterous manipulation.
zh

[CV-117] Sonicmesh: Enhancing 3D Human Mesh Reconstruction in Vision-Impaired Environments With Acoustic Signals

【速读】：该论文试图解决在光照不良、隐私问题或遮挡等挑战性环境下，从2D RGB图像进行3D人体网格重建 (3D Human Mesh Reconstruction, HMR) 的难题。解决方案的关键在于引入SonicMesh，这是一种结合声学信号与RGB图像的新方法。为应对声学信号生成图像的低分辨率问题以及缺乏专用处理骨干网络的挑战，研究者对现有的HRNet进行了修改以实现有效的特征提取，并集成了通用特征嵌入技术以增强跨维度特征对齐的精度，从而在复杂环境中实现高精度的3D人体网格重建。

链接: https://arxiv.org/abs/2412.11325
作者: Xiaoxuan Liang,Wuyang Zhang,Hong Zhou,Zhaolong Wei,Sicheng Zhu,Yansong Li,Rui Yin,Jiantao Yuan,Jeremy Gummeson
机构: 未知
关键词: Human Mesh Reconstruction, Mesh Reconstruction, acoustic signals, RGB images faces, privacy concerns
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:3D Human Mesh Reconstruction (HMR) from 2D RGB images faces challenges in environments with poor lighting, privacy concerns, or occlusions. These weaknesses of RGB imaging can be complemented by acoustic signals, which are widely available, easy to deploy, and capable of penetrating obstacles. However, no existing methods effectively combine acoustic signals with RGB data for robust 3D HMR. The primary challenges include the low-resolution images generated by acoustic signals and the lack of dedicated processing backbones. We introduce SonicMesh, a novel approach combining acoustic signals with RGB images to reconstruct 3D human mesh. To address the challenges of low resolution and the absence of dedicated processing backbones in images generated by acoustic signals, we modify an existing method, HRNet, for effective feature extraction. We also integrate a universal feature embedding technique to enhance the precision of cross-dimensional feature alignment, enabling SonicMesh to achieve high accuracy. Experimental results demonstrate that SonicMesh accurately reconstructs 3D human mesh in challenging environments such as occlusions, non-line-of-sight scenarios, and poor lighting.
zh

[CV-118] Unimodal and Multimodal Static Facial Expression Recognition for Virtual Reality Users with EmoHeVRDB

【速读】：该论文试图解决在虚拟现实（VR）环境中由于头戴式显示器（HMD）遮挡导致面部表情识别（Facial Expression Recognition, FER）精度受限的问题。解决方案的关键在于利用Meta Quest Pro VR头盔捕捉的面部表情激活数据（Facial Expression Activations, FEAs），并通过多模态融合技术提升FER的准确性。研究通过对比单模态方法和多模态方法，发现将FEA数据与图像数据结合的中间融合方法在七种情感类别的静态FER任务中达到了80.42%的最高准确率，显著超越了仅使用图像数据的基线结果69.84%。这一发现为VR环境中的FER任务设定了新的基准，并强调了多模态融合在提升FER精度方面的潜力。

链接: https://arxiv.org/abs/2412.11306
作者: Thorben Ortmann,Qi Wang,Larissa Putzar
机构: University of the West of Scotland; Hamburg University of Applied Sciences
关键词: Facial Expression Activations, Pro Virtual Reality, Meta Quest Pro, Quest Pro Virtual, utilizing Facial Expression
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this study, we explored the potential of utilizing Facial Expression Activations (FEAs) captured via the Meta Quest Pro Virtual Reality (VR) headset for Facial Expression Recognition (FER) in VR settings. Leveraging the EmojiHeroVR Database (EmoHeVRDB), we compared several unimodal approaches and achieved up to 73.02% accuracy for the static FER task with seven emotion categories. Furthermore, we integrated FEA and image data in multimodal approaches, observing significant improvements in recognition accuracy. An intermediate fusion approach achieved the highest accuracy of 80.42%, significantly surpassing the baseline evaluation result of 69.84% reported for EmoHeVRDB’s image data. Our study is the first to utilize EmoHeVRDB’s unique FEA data for unimodal and multimodal static FER, establishing new benchmarks for FER in VR settings. Our findings highlight the potential of fusing complementary modalities to enhance FER accuracy in VR settings, where conventional image-based methods are severely limited by the occlusion caused by Head-Mounted Displays (HMDs).
zh

[CV-119] Detecting Daily Living Gait Amid Huntingtons Disease Chorea using a Foundation Deep Learning Model

【速读】：该论文试图解决神经退行性疾病（Neurodegenerative Diseases, NDDs）患者在日常活动中步态检测的难题，尤其是针对伴有不自主运动的患者。解决方案的关键在于开发了一种名为J-Net的深度学习模型，该模型基于U-Net架构，通过预训练的自监督基础模型进行微调，并结合分割头（segmentation head）来实现步态检测。J-Net利用腕戴式加速度计数据，在实验室和日常环境中对亨廷顿病（Huntington’s Disease, HD）、帕金森病（Parkinson’s Disease, PD）和对照组的数据进行了评估，显著提高了步态检测的准确性，并在临床相关性上得到了验证。

链接: https://arxiv.org/abs/2412.11286
作者: Dafna Schwartz,Lori Quinn,Nora E. Fritz,Lisa M. Muratori,Jeffery M. Hausdorff,Ran Gilad Bachrach
机构: 未知
关键词: Wearable sensors offer, collect physical activity, Wearable sensors, physical activity, key component
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Wearable sensors offer a non-invasive way to collect physical activity (PA) data, with walking as a key component. Existing models often struggle to detect gait bouts in individuals with neurodegenerative diseases (NDDs) involving involuntary movements. We developed J-Net, a deep learning model inspired by U-Net, which uses a pre-trained self-supervised foundation model fine-tuned with Huntingtons disease (HD) in-lab data and paired with a segmentation head for gait detection. J-Net processes wrist-worn accelerometer data to detect gait during daily living. We evaluated J-Net on in-lab and daily-living data from HD, Parkinsons disease (PD), and controls. J-Net achieved a 10-percentage point improvement in ROC-AUC for HD over existing methods, reaching 0.97 for in-lab data. In daily-living environments, J-Net estimates showed no significant differences in median daily walking time between HD and controls (p = 0.23), in contrast to other models, which indicated counterintuitive results (p 0.005). Walking time measured by J-Net correlated with the UHDRS-TMS clinical severity score (r=-0.52; p=0.02), confirming its clinical relevance. Fine-tuning J-Net on PD data also improved gait detection over current methods. J-Net`s architecture effectively addresses the challenges of gait detection in severe chorea and offers robust performance in daily living. The dataset and J-Net model are publicly available, providing a resource for further research into NDD-related gait impairments.
zh

[CV-120] Learning Normal Flow Directly From Event Neighborhoods

【速读】：该论文试图解决基于事件的动场估计（event-based motion field estimation）中的问题，特别是当前光流（optical flow）方法在跨域迁移性和准确性方面的不足。解决方案的关键在于提出了一种新颖的监督点云编码器方法，用于法向流（normal flow）估计。该方法通过直接从原始事件数据中估计每个事件的法向流，具有以下关键优势：1) 提供时空上更锐利的预测；2) 支持多样化的数据增强，如随机旋转，以提高跨域鲁棒性；3) 通过集成推理自然支持不确定性量化；4) 在归一化相机坐标系中进行训练和推理，增强跨相机迁移性。实验表明，该方法在跨数据集迁移时表现优于现有最先进方法，并结合法向流和IMU数据提出了一个基于最大间隔问题的自运动求解器，以应对复杂场景。

链接: https://arxiv.org/abs/2412.11284
作者: Dehao Yuan,Levi Burner,Jiayi Wu,Minghui Liu,Jingxi Chen,Yiannis Aloimonos,Cornelia Fermüller
机构: University of Maryland, College Park, USA(马里兰大学学院公园分校)
关键词: Event-based motion field, Event-based motion, motion field estimation, normal flow, motion field
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based motion field estimation is an important task. However, current optical flow methods face challenges: learning-based approaches, often frame-based and relying on CNNs, lack cross-domain transferability, while model-based methods, though more robust, are less accurate. To address the limitations of optical flow estimation, recent works have focused on normal flow, which can be more reliably measured in regions with limited texture or strong edges. However, existing normal flow estimators are predominantly model-based and suffer from high errors. In this paper, we propose a novel supervised point-based method for normal flow estimation that overcomes the limitations of existing event learning-based approaches. Using a local point cloud encoder, our method directly estimates per-event normal flow from raw events, offering multiple unique advantages: 1) It produces temporally and spatially sharp predictions. 2) It supports more diverse data augmentation, such as random rotation, to improve robustness across various domains. 3) It naturally supports uncertainty quantification via ensemble inference, which benefits downstream tasks. 4) It enables training and inference on undistorted data in normalized camera coordinates, improving transferability across cameras. Extensive experiments demonstrate our method achieves better and more consistent performance than state-of-the-art methods when transferred across different datasets. Leveraging this transferability, we train our model on the union of datasets and release it for public use. Finally, we introduce an egomotion solver based on a maximum-margin problem that uses normal flow and IMU to achieve strong performance in challenging scenarios. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.11284 [cs.CV] (or arXiv:2412.11284v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.11284 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-121] VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

【速读】：该论文试图解决视频人脸交换中的关键问题，包括时间一致性、复杂场景下的表现、身份保持以及对遮挡和姿态变化的鲁棒性。解决方案的关键在于提出了一种基于扩散模型的图像-视频混合训练框架，该框架结合了丰富的静态图像数据和时间序列视频数据，通过VidFaceVAE有效地处理这两类数据，从而更好地维持生成视频的时间一致性。此外，论文构建了Attribute-Identity Disentanglement Triplet (AIDT) 数据集，通过增强的遮挡增强和3D重建技术作为网络输入条件，进一步解耦身份和姿态特征，提升了对遮挡和大姿态变化的鲁棒性。实验结果表明，该框架在身份保持、时间一致性和视觉质量方面均优于现有方法，同时减少了推理步骤。

链接: https://arxiv.org/abs/2412.11279
作者: Hao Shao,Shulun Wang,Yang Zhou,Guanglu Song,Dailan He,Shuo Qin,Zhuofan Zong,Bingqi Ma,Yu Liu,Hongsheng Li
机构: CUHK MMLab(CUHK MMLab); SenseTime Research(商汤科技研究); CPII under InnoHK(InnoHK下的CPII)
关键词: Video face swapping, face swapping, Video face, methods primarily focus, complex scenarios
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: project page: this https URL

点击查看摘要

Abstract:Video face swapping is becoming increasingly popular across various applications, yet existing methods primarily focus on static images and struggle with video face swapping because of temporal consistency and complex scenarios. In this paper, we present the first diffusion-based framework specifically designed for video face swapping. Our approach introduces a novel image-video hybrid training framework that leverages both abundant static image data and temporal video sequences, addressing the inherent limitations of video-only training. The framework incorporates a specially designed diffusion model coupled with a VidFaceVAE that effectively processes both types of data to better maintain temporal coherence of the generated videos. To further disentangle identity and pose features, we construct the Attribute-Identity Disentanglement Triplet (AIDT) Dataset, where each triplet has three face images, with two images sharing the same pose and two sharing the same identity. Enhanced with a comprehensive occlusion augmentation, this dataset also improves robustness against occlusions. Additionally, we integrate 3D reconstruction techniques as input conditioning to our network for handling large pose variations. Extensive experiments demonstrate that our framework achieves superior performance in identity preservation, temporal consistency, and visual quality compared to existing methods, while requiring fewer inference steps. Our approach effectively mitigates key challenges in video face swapping, including temporal flickering, identity preservation, and robustness to occlusions and pose variations.
zh

[CV-122] GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

【速读】：该论文试图解决从视觉数据中估计材料物理属性的问题，这一任务在计算机视觉、图形学和机器人学中具有重要应用，如增强现实、物理仿真和机器人抓取。论文提出的解决方案关键在于引入了一个无需训练的框架GaussianProperty，通过将材料的物理属性分配给3D高斯分布来实现。具体来说，该框架结合了SAM的分割能力和GPT-4V(ision)的识别能力，构建了一个全局-局部物理属性推理模块，用于处理2D图像。随后，通过投票策略将多视角2D图像中的物理属性投影到3D高斯分布上。实验结果表明，带有物理属性标注的3D高斯分布在基于物理的动态仿真和机器人抓取应用中表现出色，特别是在动态仿真中利用了材料点法（MPM），在机器人抓取中开发了抓取力预测策略。

链接: https://arxiv.org/abs/2412.11258
作者: Xinli Xu,Wenhang Ge,Dicong Qiu,ZhiFei Chen,Dongyu Yan,Zhuoyun Liu,Haoyu Zhao,Hanfeng Zhao,Shunsi Zhang,Junwei Liang,Ying-Cong Chen
机构: HKUST(GZ); HKUST; Quwan
关键词: Estimating physical properties, Estimating physical, physical properties, computer vision, augmented reality
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 17 figures

点击查看摘要

Abstract:Estimating physical properties for visual data is a crucial task in computer vision, graphics, and robotics, underpinning applications such as augmented reality, physical simulation, and robotic grasping. However, this area remains under-explored due to the inherent ambiguities in physical property estimation. To address these challenges, we introduce GaussianProperty, a training-free framework that assigns physical properties of materials to 3D Gaussians. Specifically, we integrate the segmentation capability of SAM with the recognition capability of GPT-4V(ision) to formulate a global-local physical property reasoning module for 2D images. Then we project the physical properties from multi-view 2D images to 3D Gaussians using a voting strategy. We demonstrate that 3D Gaussians with physical property annotations enable applications in physics-based dynamic simulation and robotic grasping. For physics-based dynamic simulation, we leverage the Material Point Method (MPM) for realistic dynamic simulation. For robot grasping, we develop a grasping force prediction strategy that estimates a safe force range required for object grasping based on the estimated physical properties. Extensive experiments on material segmentation, physics-based dynamic simulation, and robotic grasping validate the effectiveness of our proposed method, highlighting its crucial role in understanding physical properties from visual data. Online demo, code, more cases and annotated datasets are available on \hrefthis https URLthis https URL.
zh

[CV-123] Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing AAAI-2025

【速读】：该论文试图解决音频-视觉视频解析任务中，由于每个片段可能包含多个事件，导致整体特征语义混杂，从而在模态内和跨模态交互中引发语义干扰的问题。解决方案的关键在于引入类感知特征解耦模块 (Class-Aware Feature Decoupling, CAFD)，将语义混杂的特征明确解耦为不同的类别特征，包括多个事件特定特征和一个背景特征，从而避免无关类别的语义干扰。此外，论文还设计了细粒度语义增强模块，包括片段级事件共现建模 (Segment-wise Event Co-occurrence Modeling, SECM) 和局部-全局语义融合 (Local-Global Semantic Fusion, LGSF)，分别用于捕捉同一时间戳内事件的类间依赖关系和增强每个片段的事件语义，进一步提升解析性能。

链接: https://arxiv.org/abs/2412.11248
作者: Pengcheng Zhao,Jinxing Zhou,Dan Guo,Yang Zhao,Yanxiang Chen
机构: 未知
关键词: Parsing task aims, task aims, aims to recognize, recognize and temporally, temporally localize
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by AAAI-2025

点击查看摘要

Abstract:The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background feature. The decoupled class-wise features enable our model to selectively aggregate useful semantics for each segment from clearly matched classes contained in other segments, preventing semantic interference from irrelevant classes. Specifically, we further design a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations. It comprises a Segment-wise Event Co-occurrence Modeling (SECM) block and a Local-Global Semantic Fusion (LGSF) block. The SECM exploits inter-class dependencies of concurrent events within the same timestamp with the aid of a new event co-occurrence loss. The LGSF further enhances the event semantics of each segment by incorporating relevant semantics from more informative global video features. Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance.
zh

[CV-124] Volumetric Mapping with Panoptic Refinement via Kernel Density Estimation for Mobile Robots

【速读】：该论文试图解决移动机器人在三维场景重建中遇到的语义分割质量问题，特别是在分布外场景下，分割掩码（masks）过度覆盖对象的情况。解决方案的关键在于利用非参数统计方法对分割错误进行精细化处理，通过将预测的掩码映射到深度帧中，利用核密度估计（kernel densities）来估计掩码的分布，并自适应地剔除深度感知中的异常值，从而提高掩码的精度。随后，使用投影符号距离函数（projective signed distance functions, SDFs）进行三维重建。该方法在合成数据集和实际机器人系统中均验证了其有效性。

链接: https://arxiv.org/abs/2412.11241
作者: Khang Nguyen,Tuan Dang,Manfred Huber
机构: University of Texas at Arlington, Department of Computer Science and Engineering, Learning and Adaptive Robotics Laboratory (德克萨斯大学阿灵顿分校，计算机科学与工程系，学习与自适应机器人实验室)
关键词: Reconstructing three-dimensional, robotic applications, semantic understanding, understanding is vital, Reconstructing
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing three-dimensional (3D) scenes with semantic understanding is vital in many robotic applications. Robots need to identify which objects, along with their positions and shapes, to manipulate them precisely with given tasks. Mobile robots, especially, usually use lightweight networks to segment objects on RGB images and then localize them via depth maps; however, they often encounter out-of-distribution scenarios where masks over-cover the objects. In this paper, we address the problem of panoptic segmentation quality in 3D scene reconstruction by refining segmentation errors using non-parametric statistical methods. To enhance mask precision, we map the predicted masks into a depth frame to estimate their distribution via kernel densities. The outliers in depth perception are then rejected without the need for additional parameters in an adaptive manner to out-of-distribution scenarios, followed by 3D reconstruction using projective signed distance functions (SDFs). We validate our method on a synthetic dataset, which shows improvements in both quantitative and qualitative results for panoptic mapping. Through real-world testing, the results furthermore show our method’s capability to be deployed on a real-robot system. Our source code is available at: this https URL panoptic mapping.
zh

[CV-125] On the Generalizability of Iterative Patch Selection for Memory-Efficient High-Resolution Image Classification

【速读】：该论文试图解决在大图像中识别小或微小感兴趣区域（ROI, Regions of Interest）时面临的计算和内存限制问题。解决方案的关键在于引入了一种基于迭代补丁选择（IPS, Iterative Patch Selection）的内存高效交叉注意力Transformer，并通过实验验证了其在低数据设置下对补丁尺寸的调整如何影响泛化性能。具体来说，IPS在低数据情况下通过缩小补丁尺寸相对于ROI的比例，显著提升了泛化能力，在Megapixel MNIST和Swedish traffic signs数据集上分别提高了15%和5%的性能。此外，论文还探讨了噪声成分与目标对象的相似性对IPS泛化能力的影响，进一步支持了先前的假设。

链接: https://arxiv.org/abs/2412.11237
作者: Max Riffi-Aslett,Christina Fell
机构: 未知
关键词: Classifying large images, Classifying large, megapixel MNIST, regions of interest, memory constraints
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, submitted to Springer Nature, International Journal of Computer Vision

点击查看摘要

Abstract:Classifying large images with small or tiny regions of interest (ROI) is challenging due to computational and memory constraints. Weakly supervised memory-efficient patch selectors have achieved results comparable with strongly supervised methods. However, low signal-to-noise ratios and low entropy attention still cause overfitting. We explore these issues using a novel testbed on a memory-efficient cross-attention transformer with Iterative Patch Selection (IPS) as the patch selection module. Our testbed extends the megapixel MNIST benchmark to four smaller O2I (object-to-image) ratios ranging from 0.01% to 0.14% while keeping the canvas size fixed and introducing a noise generation component based on Bézier curves. Experimental results generalize the observations made on CNNs to IPS whereby the O2I threshold below which the classifier fails to generalize is affected by the training dataset size. We further observe that the magnitude of this interaction differs for each task of the Megapixel MNIST. For tasks “Maj” and “Top”, the rate is at its highest, followed by tasks “Max” and “Multi” where in the latter, this rate is almost at 0. Moreover, results show that in a low data setting, tuning the patch size to be smaller relative to the ROI improves generalization, resulting in an improvement of + 15% for the megapixel MNIST and + 5% for the Swedish traffic signs dataset compared to the original object-to-patch ratios in IPS. Further outcomes indicate that the similarity between the thickness of the noise component and the digits in the megapixel MNIST gradually causes IPS to fail to generalize, contributing to previous suspicions.
zh

[CV-126] Uni-AdaFocus: Spatial-temporal Dynamic Computation for Video Recognition CVPR2022 ECCV2022 ICCV2021

【速读】：该论文旨在解决视频理解中的数据冗余问题，特别是空间冗余，以提高计算效率。解决方案的关键在于提出了一种名为AdaFocus的空间自适应视频识别方法，通过轻量级编码器快速处理全视频序列，并利用策略网络识别任务相关区域，随后由高容量深度网络对选定区域进行推理以完成最终预测。该方法不仅支持端到端训练，还通过扩展考虑时间冗余和样本冗余，形成了更全面的Uni-AdaFocus框架，显著提升了计算效率，并兼容现有的高效骨干网络（如TSM和X3D），从而在多个基准数据集和应用场景中表现出优越的性能。

链接: https://arxiv.org/abs/2412.11228
作者: Yulin Wang,Haoji Zhang,Yang Yue,Shiji Song,Chao Deng,Junlan Feng,Gao Huang
机构: Department of Automation, BNRist, Tsinghua University(自动化系，BNRist，清华大学); Beijing Academy of Artificial Intelligence(北京人工智能研究院); Shenzhen International Graduate School, Tsinghua University(深圳国际研究生院，清华大学); China Mobile Research Institute(中国移动研究院)
关键词: paper presents, aim to improve, improve computational efficiency, data redundancy, video understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by IEEE TPAMI. Journal version of arXiv:2105.03245 (AdaFocusV1, ICCV 2021 Oral), arXiv:2112.14238 (AdaFocusV2, CVPR 2022), and arXiv:2209.13465 (AdaFocusV3, ECCV 2022). Code and pre-trained models: this https URL

点击查看摘要

Abstract:This paper presents a comprehensive exploration of the phenomenon of data redundancy in video understanding, with the aim to improve computational efficiency. Our investigation commences with an examination of spatial redundancy, which refers to the observation that the most informative region in each video frame usually corresponds to a small image patch, whose shape, size and location shift smoothly across frames. Motivated by this phenomenon, we formulate the patch localization problem as a dynamic decision task, and introduce a spatially adaptive video recognition approach, termed AdaFocus. In specific, a lightweight encoder is first employed to quickly process the full video sequence, whose features are then utilized by a policy network to identify the most task-relevant regions. Subsequently, the selected patches are inferred by a high-capacity deep network for the final prediction. The full model can be trained in end-to-end conveniently. Furthermore, AdaFocus can be extended by further considering temporal and sample-wise redundancies, i.e., allocating the majority of computation to the most task-relevant frames, and minimizing the computation spent on relatively “easier” videos. Our resulting approach, Uni-AdaFocus, establishes a comprehensive framework that seamlessly integrates spatial, temporal, and sample-wise dynamic computation, while it preserves the merits of AdaFocus in terms of efficient end-to-end training and hardware friendliness. In addition, Uni-AdaFocus is general and flexible as it is compatible with off-the-shelf efficient backbones (e.g., TSM and X3D), which can be readily deployed as our feature extractor, yielding a significantly improved computational efficiency. Empirically, extensive experiments based on seven benchmark datasets and three application scenarios substantiate that Uni-AdaFocus is considerably more efficient than the competitive baselines.
zh

[CV-127] GenLit: Reformulating Single-Image Relighting as Video Generation

【速读】：该论文试图解决单张图像中的光照操控问题，传统方法依赖于逆向渲染技术，需要明确的3D资产重建和昂贵的光线追踪模拟。论文提出了一种新的解决方案，关键在于利用视频扩散模型（如Stable Video Diffusion, SVD）来理解物理世界并执行重新光照任务。具体来说，论文引入了GenLit框架，通过将图形引擎的光照操控能力提炼到视频生成模型中，使用户能够直接在给定图像的3D世界中插入和操控点光源，并生成相应的视频序列。该方法在仅使用少量合成数据集（270个对象）进行微调的情况下，能够推广到真实图像，实现具有真实光线追踪效果和投影阴影的单图像重新光照。这一解决方案展示了视频基础模型在捕捉光照、材质和形状信息方面的潜力，表明这些模型在无需明确的物理资产重建和复杂光线追踪的情况下，能够实现基于物理的渲染和可控的图像合成。

链接: https://arxiv.org/abs/2412.11224
作者: Shrisha Bharadwaj,Haiwen Feng,Victoria Abrevaya,Michael J. Black
机构: Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所，蒂宾根，德国)
关键词: Manipulating the illumination, single image represents, represents a fundamental, fundamental challenge, challenge in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Manipulating the illumination within a single image represents a fundamental challenge in computer vision and graphics. This problem has been traditionally addressed using inverse rendering techniques, which require explicit 3D asset reconstruction and costly ray tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be practical and possible – one that replaces explicit physical models with networks that are trained on massive amounts of image and video data. In this paper, we explore the potential of exploiting video diffusion models, and in particular Stable Video Diffusion (SVD), in understanding the physical world to perform relighting tasks given a single image. Specifically, we introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate the results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset (270 objects) is able to generalize to real images, enabling single-image relighting with realistic ray tracing effects and cast shadows. These results reveal the ability of video foundation models to capture rich information about lighting, material, and shape. Our findings suggest that such models, with minimal training, can be used for physically-based rendering without explicit physically asset reconstruction and complex ray tracing. This further suggests the potential of such models for controllable and physically accurate image synthesis tasks.
zh

[CV-128] Distribution-Consistency-Guided Multi-modal Hashing

【速读】：该论文试图解决多模态哈希方法在实际应用中由于标签噪声（noisy labels）导致的检索性能下降问题。解决方案的关键在于提出了一种基于分布一致性（distribution consistency）的新型多模态哈希方法，称为分布一致性引导的多模态哈希（Distribution-Consistency-Guided Multi-modal Hashing, DCGMH）。该方法通过实验发现标签中类别存在与否的1-0分布与哈希码相对于类别中心相似度的高低分布具有一致性，并利用这一模式来过滤和重构噪声标签。具体步骤包括随机初始化类别中心以计算相似度分布，通过分布一致性模式分离噪声和干净标签，并对过滤后的噪声标签进行校正，高置信度的标签进行修正，低置信度的标签视为未标记进行无监督学习，从而提升模型的检索性能。

链接: https://arxiv.org/abs/2412.11216
作者: Jin-Yu Liu,Xian-Ling Mao,Tian-Yi Che,Rong-Cheng Tu
机构: 未知
关键词: low storage requirements, Multi-modal hashing methods, gained popularity due, Multi-modal hashing, noisy labels
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Multi-modal hashing methods have gained popularity due to their fast speed and low storage requirements. Among them, the supervised methods demonstrate better performance by utilizing labels as supervisory signals compared with unsupervised methods. Currently, for almost all supervised multi-modal hashing methods, there is a hidden assumption that training sets have no noisy labels. However, labels are often annotated incorrectly due to manual labeling in real-world scenarios, which will greatly harm the retrieval performance. To address this issue, we first discover a significant distribution consistency pattern through experiments, i.e., the 1-0 distribution of the presence or absence of each category in the label is consistent with the high-low distribution of similarity scores of the hash codes relative to category centers. Then, inspired by this pattern, we propose a novel Distribution-Consistency-Guided Multi-modal Hashing (DCGMH), which aims to filter and reconstruct noisy labels to enhance retrieval performance. Specifically, the proposed method first randomly initializes several category centers, which are used to compute the high-low distribution of similarity scores; Noisy and clean labels are then separately filtered out via the discovered distribution consistency pattern to mitigate the impact of noisy labels; Subsequently, a correction strategy, which is indirectly designed via the distribution consistency pattern, is applied to the filtered noisy labels, correcting high-confidence ones while treating low-confidence ones as unlabeled for unsupervised learning, thereby further enhancing the model’s performance. Extensive experiments on three widely used datasets demonstrate the superiority of the proposed method compared to state-of-the-art baselines in multi-modal retrieval tasks. The code is available at this https URL.
zh

[CV-129] Image Forgery Localization with State Space Models

【速读】：该论文试图解决图像篡改定位中像素依赖建模的问题，现有方法主要依赖卷积神经网络 (CNN) 或基于 Transformer 的模型，这些方法要么缺乏足够的感受野，要么计算开销过大。论文提出的解决方案 LoMa 的关键在于结合了全局像素依赖建模的 Selective State Space (S6) 模型和局部像素依赖建模的倒置残差 CNN，并通过引入 Mixed-SSM Block 实现高效的全局依赖提取。该模块首先使用空洞选择性扫描遍历空间域，将篡改图像转换为有序的图像块序列，然后应用多方向的 S6 建模，同时通过辅助卷积分支增强局部特征提取。最终，通过简单的 MLP 解码器获得像素级的篡改定位结果。

链接: https://arxiv.org/abs/2412.11214
作者: Zijie Lou,Gang Cao
机构: School of Computer and Cyber Sciences, Communication University of China(中国传媒大学计算机与网络安全学院); State Key Laboratory of Media Convergence and Communication, Communication University of China(中国传媒大学媒体融合与传播国家重点实验室)
关键词: Pixel dependency modeling, Pixel dependency, image forgery localization, Selective State Space, dependency modeling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pixel dependency modeling from tampered images is pivotal for image forgery localization. Current approaches predominantly rely on convolutional neural network (CNN) or Transformer-based models, which often either lack sufficient receptive fields or entail significant computational overheads. In this paper, we propose LoMa, a novel image forgery localization method that leverages the Selective State Space (S6) model for global pixel dependency modeling and inverted residual CNN for local pixel dependency modeling. Our method introduces the Mixed-SSM Block, which initially employs atrous selective scan to traverse the spatial domain and convert the tampered image into order patch sequences, and subsequently applies multidirectional S6 modeling. In addition, an auxiliary convolutional branch is introduced to enhance local feature extraction. This design facilitates the efficient extraction of global dependencies while upholding linear complexity. Upon modeling the pixel dependency with the SSM and CNN blocks, the pixel-wise forgery localization results are obtained by a simple MLP decoder. Extensive experimental results validate the superiority of LoMa over CNN-based and Transformer-based state-of-the-arts.
zh

[CV-130] ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction AAAI25

【速读】：该论文试图解决从单张图像推断场景的三维结构这一病态且具有挑战性的问题，特别是在以视觉为中心的自动驾驶领域。解决方案的关键在于提出了ViPOcc方法，该方法利用视觉基础模型（Vision Foundation Models, VFMs）的视觉先验进行细粒度的三维占据预测。与以往仅使用体渲染（volume rendering）进行RGB和深度图像重建的方法不同，ViPOcc引入了一个度量深度估计分支，并通过逆深度对齐模块（inverse depth alignment module）来弥合VFMs预测与真实深度分布之间的领域差距。此外，ViPOcc还提出了一个语义引导的非重叠高斯混合采样器（semantic-guided non-overlapping Gaussian mixture sampler），用于高效的实例感知光线采样，解决了现有最先进方法中存在的冗余和不平衡采样问题。这些创新使得ViPOcc在KITTI-360和KITTI Raw数据集上的三维占据预测和深度估计任务中表现出色。

链接: https://arxiv.org/abs/2412.11210
作者: Yi Feng,Yu Han,Xijing Zhang,Tanghui Li,Yanting Zhang,Rui Fan
机构: Tongji University(同济大学); National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室); Donghua University(东华大学)
关键词: vision-centric autonomous driving, occupancy prediction, autonomous driving, ill-posed and challenging, challenging problem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to AAAI25

点击查看摘要

Abstract:Inferring the 3D structure of a scene from a single image is an ill-posed and challenging problem in the field of vision-centric autonomous driving. Existing methods usually employ neural radiance fields to produce voxelized 3D occupancy, lacking instance-level semantic reasoning and temporal photometric consistency. In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models (VFMs) for fine-grained 3D occupancy prediction. Unlike previous works that solely employ volume rendering for RGB and depth image reconstruction, we introduce a metric depth estimation branch, in which an inverse depth alignment module is proposed to bridge the domain gap in depth distribution between VFM predictions and the ground truth. The recovered metric depth is then utilized in temporal photometric alignment and spatial geometric alignment to ensure accurate and consistent 3D occupancy prediction. Additionally, we also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling, which addresses the redundant and imbalanced sampling issue that still exists in previous state-of-the-art methods. Extensive experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks on the KITTI-360 and KITTI Raw datasets. Our code is available at: \urlthis https URL.
zh

[CV-131] GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion Object Dynamics and Scene Composition Control

【速读】：该论文试图解决在多模态数据（如RGB图像、深度图、人体姿态和自我轨迹）中预测未来帧的问题，关键解决方案是提出了一个可泛化的自我视觉多模态世界模型GEM。GEM通过参考帧、稀疏特征、人体姿态和自我轨迹来精确控制物体动态、自我运动和人体姿态，并生成配对的RGB和深度输出以增强空间理解。论文引入了自回归噪声调度（autoregressive noise schedules）以实现稳定的长时间生成，并通过伪标签技术获取深度图、自我轨迹和人体姿态。实验表明，GEM在生成多样性、可控性和长时间生成的时间一致性方面表现优异。

链接: https://arxiv.org/abs/2412.11198
作者: Mariam Hassan,Sebastian Stapf,Ahmad Rahimi,Pedro M B Rezende,Yasaman Haghighi,David Brüggemann,Isinsu Katircioglu,Lin Zhang,Xiaoran Chen,Suman Saha,Marco Cannici,Elie Aljalbout,Botao Ye,Xi Wang,Aram Davtyan,Mathieu Salzmann,Davide Scaramuzza,Marc Pollefeys,Paolo Favaro,Alexandre Alahi
机构: École Polytechnique Fédérale de Lausanne (EPFL); University of Bern; Swiss Data Science Center; University of Zurich; ETH Zurich
关键词: Generalizable Ego-vision Multimodal, Ego-vision Multimodal world, Generalizable Ego-vision, predicts future frames, human poses
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, ego-trajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.
zh

[CV-132] Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation AAAI2025

【速读】：该论文旨在解决文本到动作生成 (Text-to-Motion, T2M) 任务中模型参数过多和推理速度慢导致的高使用成本问题。解决方案的关键在于设计一个轻量级模型，具体包括以下几个创新点：首先，通过重新考虑人体运动的内在属性，强调局部信息建模的重要性，提出了轻量级的局部信息建模模块 (Local Information Modeling Module)；其次，引入Mamba算法以减少参数数量和GPU内存需求，并通过设计伪双向扫描 (Pseudo-bidirectional Scan) 来模拟双向扫描效果而不增加参数数量；最后，提出自适应文本信息注入器 (Adaptive Textual Information Injector)，更有效地将文本信息整合到生成的动作中。这些设计共同构成了轻量且快速的Light-T2M模型，显著减少了参数数量和推理时间，同时在性能上超越了现有的最先进方法MoMask。

链接: https://arxiv.org/abs/2412.11193
作者: Ling-An Zeng,Guohong Huang,Gaojie Wu,Wei-Shi Zheng
机构: 未知
关键词: high usage costs, local information modeling, current methods involve, slow inference speeds, reduce usage costs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Despite the significant role text-to-motion (T2M) generation plays across various applications, current methods involve a large number of parameters and suffer from slow inference speeds, leading to high usage costs. To address this, we aim to design a lightweight model to reduce usage costs. First, unlike existing works that focus solely on global information modeling, we recognize the importance of local information modeling in the T2M task by reconsidering the intrinsic properties of human motion, leading us to propose a lightweight Local Information Modeling Module. Second, we introduce Mamba to the T2M task, reducing the number of parameters and GPU memory demands, and we have designed a novel Pseudo-bidirectional Scan to replicate the effects of a bidirectional scan without increasing parameter count. Moreover, we propose a novel Adaptive Textual Information Injector that more effectively integrates textual information into the motion during generation. By integrating the aforementioned designs, we propose a lightweight and fast model named Light-T2M. Compared to the state-of-the-art method, MoMask, our Light-T2M model features just 10% of the parameters (4.48M vs 44.85M) and achieves a 16% faster inference time (0.152s vs 0.180s), while surpassing MoMask with an FID of \textbf0.040 (vs. 0.045) on HumanML3D dataset and 0.161 (vs. 0.228) on KIT-ML dataset. The code is available at this https URL.
zh

[CV-133] Efficient Quantization-Aware Training on Segment Anything Model in Medical Images and Its Deployment

【速读】：该论文试图解决MedSAM模型在推理过程中对计算资源需求过高的问题。解决方案的关键在于引入一种量化感知训练（quantization-aware training）流程，通过该流程对Segment Anything Model进行高效量化，并利用OpenVINO推理引擎进行部署。这一方法不仅优化了训练时间和磁盘存储，还在保持可接受准确度的同时显著提升了处理速度。

链接: https://arxiv.org/abs/2412.11186
作者: Haisheng Lu,Yujie Fu,Fan Zhang,Le Zhang
机构: 未知
关键词: Medical image segmentation, clinical practice, advanced this field, critical component, component of clinical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, to be published in LNCS

点击查看摘要

Abstract:Medical image segmentation is a critical component of clinical practice, and the state-of-the-art MedSAM model has significantly advanced this field. Nevertheless, critiques highlight that MedSAM demands substantial computational resources during inference. To address this issue, the CVPR 2024 MedSAM on Laptop Challenge was established to find an optimal balance between accuracy and processing speed. In this paper, we introduce a quantization-aware training pipeline designed to efficiently quantize the Segment Anything Model for medical images and deploy it using the OpenVINO inference engine. This pipeline optimizes both training time and disk storage. Our experimental results confirm that this approach considerably enhances processing speed over the baseline, while still achieving an acceptable accuracy level. The training script, inference script, and quantized model are publicly accessible at this https URL.
zh

[CV-134] OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation

【速读】：该论文试图解决现有方法在3D场景生成和感知任务中通常分离这两个过程的问题，提出了一种新的互学习范式OccScene。其关键在于通过一个统一的框架，将细粒度的3D感知和高品质的生成任务结合起来，实现跨任务的双赢效果。具体来说，OccScene利用联合训练的扩散框架，通过文本提示生成新的、一致的3D真实场景，并引入基于Mamba的双重对齐模块（Mamba-based Dual Alignment module），将细粒度的语义和几何信息作为感知先验，以对齐占用信息和扩散潜在空间。这种设计使得感知模块能够通过定制化和多样化的生成场景得到有效提升，同时感知先验反过来增强生成性能，从而实现互惠互利。

链接: https://arxiv.org/abs/2412.11183
作者: Bohan Li,Xin Jin,Jianan Wang,Yukai Shi,Yasheng Sun,Xiaofeng Wang,Zhuang Ma,Baao Xie,Chao Ma,Xiaokang Yang,Wenjun Zeng
机构: School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China; Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China; Astribot, Shenzhen, China; PhiGent Robotics, Beijing, China
关键词: Recent diffusion models, demonstrated remarkable performance, Recent diffusion, demonstrated remarkable, perception
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent diffusion models have demonstrated remarkable performance in both 3D scene generation and perception tasks. Nevertheless, existing methods typically separate these two processes, acting as a data augmenter to generate synthetic data for downstream perception tasks. In this work, we propose OccScene, a novel mutual learning paradigm that integrates fine-grained 3D perception and high-quality generation in a unified framework, achieving a cross-task win-win effect. OccScene generates new and consistent 3D realistic scenes only depending on text prompts, guided with semantic occupancy in a joint-training diffusion framework. To align the occupancy with the diffusion latent, a Mamba-based Dual Alignment module is introduced to incorporate fine-grained semantics and geometry as perception priors. Within OccScene, the perception module can be effectively improved with customized and diverse generated scenes, while the perception priors in return enhance the generation performance for mutual benefits. Extensive experiments show that OccScene achieves realistic 3D scene generation in broad indoor and outdoor scenarios, while concurrently boosting the perception models to achieve substantial performance improvements in the 3D perception task of semantic occupancy prediction.
zh

[CV-135] Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation

【速读】：该论文试图解决文本到3D生成（Text-to-3D generation）评估中的两个主要问题：一是现有基准测试缺乏对不同提示类别和评估维度的细粒度评估；二是之前的评估指标仅关注单一维度（如文本-3D对齐），无法进行多维度的质量评估。为解决这些问题，论文提出了一个名为MATE-3D的综合基准测试，包含八个设计良好的提示类别，涵盖单对象和多对象生成，并生成了1,280个带纹理的网格。通过大规模主观实验，从四个评估维度收集了107,520条注释，并进行了详细分析。基于MATE-3D，论文进一步提出了一个名为HyperScore的新型质量评估器，利用超网络（hypernetwork）为每个评估维度生成特定的映射函数，从而实现多维度的质量评估。HyperScore在MATE-3D上表现出优于现有指标的性能，成为评估和改进文本到3D生成任务的有力工具。

链接: https://arxiv.org/abs/2412.11170
作者: Yujie Zhang,Bingyang Cui,Qi Yang,Zhu Li,Yiling Xu
机构: Shanghai Jiao Tong University(上海交通大学); University of Missouri-Kansas City(密苏里大学堪萨斯城分校)
关键词: achieved remarkable progress, methods remains challenging, lack fine-grained evaluation, benchmarks lack fine-grained, recent years
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-3D generation has achieved remarkable progress in recent years, yet evaluating these methods remains challenging for two reasons: i) Existing benchmarks lack fine-grained evaluation on different prompt categories and evaluation dimensions. ii) Previous evaluation metrics only focus on a single aspect (e.g., text-3D alignment) and fail to perform multi-dimensional quality assessment. To address these problems, we first propose a comprehensive benchmark named MATE-3D. The benchmark contains eight well-designed prompt categories that cover single and multiple object generation, resulting in 1,280 generated textured meshes. We have conducted a large-scale subjective experiment from four different evaluation dimensions and collected 107,520 annotations, followed by detailed analyses of the results. Based on MATE-3D, we propose a novel quality evaluator named HyperScore. Utilizing hypernetwork to generate specified mapping functions for each evaluation dimension, our metric can effectively perform multi-dimensional quality assessment. HyperScore presents superior performance over existing metrics on MATE-3D, making it a promising metric for assessing and improving text-to-3D generation. The project is available at this https URL.
zh

[CV-136] OTLRM: Orthogonal Learning-based Low-Rank Metric for Multi-Dimensional Inverse Problems AAAI2025

【速读】：该论文试图解决现有张量奇异值分解 (t-SVD) 方法在定义张量核范数 (TNN) 时依赖手工设计或预设变换，导致缺乏灵活性的问题。解决方案的关键在于引入一种基于可学习正交变换的数据驱动生成式低秩 t-SVD 模型。该模型通过构建适应神经网络的内生正交矩阵，利用Householder变换的线性代数定理实现正交变换，并提出一种低秩求解器作为SVT的泛化，利用生成网络的高效表示来获取低秩结构。这一方法不仅解决了传统SVT在深度神经网络中因数值不稳定性导致的求导问题，还显著提升了恢复效果。

链接: https://arxiv.org/abs/2412.11165
作者: Xiangming Wang,Haijin Zeng,Jiaoyang Chen,Sheng Liu,Yongyong Chen,Guoqing Chao
机构: 未知
关键词: multi-frame videos inherently, videos inherently exhibit, inherently exhibit robust, exhibit robust low-rank, multispectral image denoising
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2025

点击查看摘要

Abstract:In real-world scenarios, complex data such as multispectral images and multi-frame videos inherently exhibit robust low-rank property. This property is vital for multi-dimensional inverse problems, such as tensor completion, spectral imaging reconstruction, and multispectral image denoising. Existing tensor singular value decomposition (t-SVD) definitions rely on hand-designed or pre-given transforms, which lack flexibility for defining tensor nuclear norm (TNN). The TNN-regularized optimization problem is solved by the singular value thresholding (SVT) operator, which leverages the t-SVD framework to obtain the low-rank tensor. However, it is quite complicated to introduce SVT into deep neural networks due to the numerical instability problem in solving the derivatives of the eigenvectors. In this paper, we introduce a novel data-driven generative low-rank t-SVD model based on the learnable orthogonal transform, which can be naturally solved under its representation. Prompted by the linear algebra theorem of the Householder transformation, our learnable orthogonal transform is achieved by constructing an endogenously orthogonal matrix adaptable to neural networks, optimizing it as arbitrary orthogonal matrices. Additionally, we propose a low-rank solver as a generalization of SVT, which utilizes an efficient representation of generative networks to obtain low-rank structures. Extensive experiments highlight its significant restoration enhancements.
zh

[CV-137] Why and How: Knowledge-Guided Learning for Cross-Spectral Image Patch Matching

【速读】：该论文试图解决现有跨光谱图像块匹配方法中存在的性能瓶颈问题。解决方案的关键在于构建一个稳定的桥接机制，将描述子学习（descriptor learning）与基于特征差异学习的度量学习（metric learning based on feature difference learning）相结合。具体来说，论文发现这两种学习方式在特征提取上具有一致性，从而为桥接提供了理论基础。为了确保桥接的稳定性和效率，论文深入探索了20种组合网络架构，并构建了特征引导损失（feature-guided loss）以实现特征的相互引导。此外，论文还提出了度量学习硬负样本挖掘策略（HNSM-M），以增强度量分支的特征映射能力，这是首次在度量网络中实现硬负样本挖掘，并带来了显著的性能提升。最终，论文提出的知识引导学习网络（KGL-Net）在三种不同的跨光谱图像块匹配场景中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.11161
作者: Chuang Yu,Yunpeng Liu,Jinmiao Zhao,Xiangyu Yue
机构: 未知
关键词: learning, Recently, feature relation learning, learning based, metric learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, cross-spectral image patch matching based on feature relation learning has attracted extensive attention. However, performance bottleneck problems have gradually emerged in existing methods. To address this challenge, we make the first attempt to explore a stable and efficient bridge between descriptor learning and metric learning, and construct a knowledge-guided learning network (KGL-Net), which achieves amazing performance improvements while abandoning complex network structures. Specifically, we find that there is feature extraction consistency between metric learning based on feature difference learning and descriptor learning based on Euclidean distance. This provides the foundation for bridge building. To ensure the stability and efficiency of the constructed bridge, on the one hand, we conduct an in-depth exploration of 20 combined network architectures. On the other hand, a feature-guided loss is constructed to achieve mutual guidance of features. In addition, unlike existing methods, we consider that the feature mapping ability of the metric branch should receive more attention. Therefore, a hard negative sample mining for metric learning (HNSM-M) strategy is constructed. To the best of our knowledge, this is the first time that hard negative sample mining for metric networks has been implemented and brings significant performance gains. Extensive experimental results show that our KGL-Net achieves SOTA performance in three different cross-spectral image patch matching scenarios. Our code are available at this https URL.
zh

[CV-138] From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision

【速读】：该论文试图解决单帧红外小目标检测（single-frame infrared small target detection）中使用单点监督（single point supervision）的标签演化框架（LESPS）存在的不稳定性、过度标签演化和难以发挥嵌入网络性能的问题。解决方案的关键在于构建了一个渐进主动学习（Progressive Active Learning, PAL）框架，该框架通过逐步主动识别和学习更多难样本（hard samples）来实现持续的性能提升。具体措施包括引入模型预启动概念（model pre-start concept），用于选择部分简单样本以帮助模型具备基本的任务特定学习能力；以及提出精细的双重更新策略（refined dual-update strategy），促进对更难样本的合理学习和伪标签的持续优化。此外，通过合理引入衰减因子（decay factor），缓解了过度标签演化的风险，实现了目标标注扩展与收缩的动态平衡。实验结果表明，配备PAL框架的卷积神经网络（CNNs）在多个公开数据集上达到了最先进（SOTA）水平，并有效连接了全监督和点监督任务。

链接: https://arxiv.org/abs/2412.11154
作者: Chuang Yu,Jinmiao Zhao,Yunpeng Liu,Sicheng Zhao,Xiangyu Yue
机构: 未知
关键词: single-frame infrared small, drawn wide-spread attention, single point supervision, infrared small target, Progressive Active Learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, excessive label evolution, and difficulty in exerting embedded network performance. Therefore, we construct a Progressive Active Learning (PAL) framework. Specifically, inspired by organisms gradually adapting to their environment and continuously accumulating knowledge, we propose an innovative progressive active learning idea, which emphasizes that the network progressively and actively recognizes and learns more hard samples to achieve continuous performance enhancement. Based on this, on the one hand, we propose a model pre-start concept, which focuses on selecting a portion of easy samples and can help models have basic task-specific learning capabilities. On the other hand, we propose a refined dual-update strategy, which can promote reasonable learning of harder samples and continuous refinement of pseudo-labels. In addition, to alleviate the risk of excessive label evolution, a decay factor is reasonably introduced, which helps to achieve a dynamic balance between the expansion and contraction of target annotations. Extensive experiments show that convolutional neural networks (CNNs) equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets. Furthermore, our PAL framework can build a efficient and stable bridge between full supervision and point supervision tasks. Our code are available at this https URL.
zh

[CV-139] Dual-Schedule Inversion: Training- and Tuning-Free Inversion for Real Image Editing

【速读】：该论文试图解决基于扩散模型（diffusion model）的文本条件图像编辑中，DDIM反演（DDIM Inversion）导致的重建失败问题，这影响了后续编辑的效果。解决方案的关键在于提出了一种新的反演和采样方法，称为双调度反演（Dual-Schedule Inversion），并通过设计一个分类器来自适应地结合该方法与不同的编辑方法，以实现用户友好的图像编辑。该方法的优势在于能够完美重建真实图像且无需微调，数学上保证了可逆性，编辑后的对象或场景符合文本提示的语义，同时未编辑的部分保持原始身份。

链接: https://arxiv.org/abs/2412.11152
作者: Jiancheng Huang,Yi Huang,Jianzhuang Liu,Donghao Zhou,Yifan Liu,Shifeng Chen
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (深圳先进技术研究院，中国科学院)
关键词: practical AIGC task, Text-conditional image editing, practical AIGC, AIGC task, DDIM Inversion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-conditional image editing is a practical AIGC task that has recently emerged with great commercial and academic value. For real image editing, most diffusion model-based methods use DDIM Inversion as the first stage before editing. However, DDIM Inversion often results in reconstruction failure, leading to unsatisfactory performance for downstream editing. To address this problem, we first analyze why the reconstruction via DDIM Inversion fails. We then propose a new inversion and sampling method named Dual-Schedule Inversion. We also design a classifier to adaptively combine Dual-Schedule Inversion with different editing methods for user-friendly image editing. Our work can achieve superior reconstruction and editing performance with the following advantages: 1) It can reconstruct real images perfectly without fine-tuning, and its reversibility is guaranteed mathematically. 2) The edited object/scene conforms to the semantics of the text prompt. 3) The unedited parts of the object/scene retain the original identity.
zh

[CV-140] A Comprehensive Survey of Action Quality Assessment: Method and Benchmark

【速读】：该论文试图解决动作质量评估 (Action Quality Assessment, AQA) 领域中缺乏统一基准和系统性分析的问题。解决方案的关键在于通过系统分析超过150篇相关文献，构建了一个分层分类法 (hierarchical taxonomy)，该分类法根据输入模态（视频、骨骼、多模态）及其特定特征对AQA方法进行分类，揭示了不同方法之间的演进和相互关系。此外，论文提出了一个统一的基准，整合了多样化的数据集，以评估评估精度和计算效率，从而促进标准化。最后，论文还探讨了新兴的任务特定应用，并指出了AQA领域中未被充分探索的挑战，为未来的研究方向提供了可操作的见解。

链接: https://arxiv.org/abs/2412.11149
作者: Kanglei Zhou,Ruizhi Cai,Liyuan Wang,Hubert P. H. Shum,Xiaohui Liang
机构: State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(虚拟现实技术与系统国家重点实验室，北京航空航天大学); Department of Computer Science and Technology, Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint Center for ML, Tsinghua University(计算机科学与技术系，人工智能研究所，BNRist中心，THBI实验室，清华-博世联合机器学习中心，清华大学); Department of Computer Science, Durham University(计算机科学系，杜伦大学); Zhongguancun Laboratory(中关村实验室)
关键词: Action Quality Assessment, Action Quality, human actions, human judgment, Quality Assessment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Action Quality Assessment (AQA) quantitatively evaluates the quality of human actions, providing automated assessments that reduce biases in human judgment. Its applications span domains such as sports analysis, skill assessment, and medical care. Recent advances in AQA have introduced innovative methodologies, but similar methods often intertwine across different domains, highlighting the fragmented nature that hinders systematic reviews. In addition, the lack of a unified benchmark and limited computational comparisons hinder consistent evaluation and fair assessment of AQA approaches. In this work, we address these gaps by systematically analyzing over 150 AQA-related papers to develop a hierarchical taxonomy, construct a unified benchmark, and provide an in-depth analysis of current trends, challenges, and future directions. Our hierarchical taxonomy categorizes AQA methods based on input modalities (video, skeleton, multi-modal) and their specific characteristics, highlighting the evolution and interrelations across various approaches. To promote standardization, we present a unified benchmark, integrating diverse datasets to evaluate the assessment precision and computational efficiency. Finally, we review emerging task-specific applications and identify under-explored challenges in AQA, providing actionable insights into future research directions. This survey aims to deepen understanding of AQA progress, facilitate method comparison, and guide future innovations. The project web page can be found at this https URL.
zh

[CV-141] Redefining Normal: A Novel Object-Level Approach for Multi-Object Novelty Detection ACCV24

【速读】：该论文试图解决在多对象场景中新颖检测（novelty detection）中准确识别异常数据的问题，特别是在缺乏特定类别信息的情况下。解决方案的关键在于重新定义训练数据集中的“正常”（normal）概念，从传统的图像级别转向对象级别，即将数据集中最主要的对象视为正常。为此，论文提出了两种创新方法：一是密集特征微调（Dense Feature Fine-tuning on Normal Data, DeFeND），通过自监督损失使教师网络专注于对象级别的特征；二是掩码知识蒸馏（masked knowledge distillation），使学生网络在部分输入被隐藏的情况下学习，从而提高其从不完全数据中推断和泛化的能力。这些方法不仅在单对象新颖检测中表现良好，而且在多对象场景中显著超越现有方法。

链接: https://arxiv.org/abs/2412.11148
作者: Mohammadreza Salehi,Nikolaos Apostolikas,Efstratios Gavves,Cees G. M. Snoek,Yuki M. Asano
机构: 未知
关键词: accurately identifying outliers, specific class information, class information poses, accurately identifying, significant challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACCV24(Oral)

点击查看摘要

Abstract:In the realm of novelty detection, accurately identifying outliers in data without specific class information poses a significant challenge. While current methods excel in single-object scenarios, they struggle with multi-object situations due to their focus on individual objects. Our paper suggests a novel approach: redefining normal' at the object level in training datasets. Rather than the usual image-level view, we consider the most dominant object in a dataset as the norm, offering a perspective that is more effective for real-world scenarios. Adapting to our object-level definition of normal’, we modify knowledge distillation frameworks, where a student network learns from a pre-trained teacher network. Our first contribution, DeFeND(Dense Feature Fine-tuning on Normal Data), integrates dense feature fine-tuning into the distillation process, allowing the teacher network to focus on object-level features with a self-supervised loss. The second is masked knowledge distillation, where the student network works with partially hidden inputs, honing its ability to deduce and generalize from incomplete data. This approach not only fares well in single-object novelty detection but also considerably surpasses existing methods in multi-object contexts. The implementation is available at: this https URL
zh

[CV-142] Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning AAAI25

【速读】：该论文试图解决多模态大语言模型 (MLLMs) 在处理视觉-语言任务时面临的幻觉问题，特别是感知层面和认知层面的幻觉。解决方案的关键在于引入了一种自底向上的推理框架，该框架通过验证和整合感知层面的信息与认知层面的常识知识，确保输出更加可靠。具体来说，该框架不仅关注感知层面的错误，还强调了认知层面的事实性常识，并提出了一种更有效的视觉输入表示方法，以减少视觉幻觉。此外，该框架还解决了文本输入错误导致的幻觉问题，这些问题在现有研究中长期被忽视。实验结果表明，该方法在多个幻觉基准测试中显著提升了模型的性能。

链接: https://arxiv.org/abs/2412.11124
作者: Shengqiong Wu,Hao Fei,Liangming Pan,William Yang Wang,Shuicheng Yan,Tat-Seng Chua
机构: 1. National University of Singapore (新加坡国立大学);
2. University of Edinburgh (爱丁堡大学);
3. University of California, Santa Barbara (加州大学圣巴巴拉分校);
4. Nanyang Technological University (南洋理工大学);
5. Sea AI Lab (Sea AI实验室)
关键词: large language models, multimodal large language, shown unprecedented capabilities, Recent advancements, language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 10 figures, accepted by AAAI 25

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have shown unprecedented capabilities in advancing various vision-language tasks. However, MLLMs face significant challenges with hallucinations, and misleading outputs that do not align with the input data. While existing efforts are paid to combat MLLM hallucinations, several pivotal challenges are still unsolved. First, while current approaches aggressively focus on addressing errors at the perception level, another important type at the cognition level requiring factual commonsense can be overlooked. In addition, existing methods might fall short in finding a more effective way to represent visual input, which is yet a key bottleneck that triggers visual hallucinations. Moreover, MLLMs can frequently be misled by faulty textual inputs and cause hallucinations, while unfortunately, this type of issue has long been overlooked by existing studies. Inspired by human intuition in handling hallucinations, this paper introduces a novel bottom-up reasoning framework. Our framework systematically addresses potential issues in both visual and textual inputs by verifying and integrating perception-level information with cognition-level commonsense knowledge, ensuring more reliable outputs. Extensive experiments demonstrate significant improvements in multiple hallucination benchmarks after integrating MLLMs with the proposed framework. In-depth analyses reveal the great potential of our methods in addressing perception- and cognition-level hallucinations.
zh

[CV-143] Impact of Adversarial Attacks on Deep Learning Model Explainability

【速读】：该论文试图解决深度学习模型在对抗攻击下的可解释性问题，特别是当模型遭受对抗攻击时，现有的解释性技术（如GradCAM、SmoothGrad和LIME）是否能够保持其解释的鲁棒性。解决方案的关键在于通过使用对抗攻击方法（如FGSM和BIM）对模型进行攻击，并评估这些攻击对模型准确性和解释性指标（如IoU和RMSE）的影响。研究结果表明，尽管模型准确性显著下降，但解释性指标的变化几乎可以忽略不计，这表明现有的解释性指标可能不足以检测对抗扰动的影响。

链接: https://arxiv.org/abs/2412.11119
作者: Gazi Nazia Nur,Mohammad Ahnaf Sadat
机构: 未知
关键词: autonomous feature extraction, deep learning models, feature extraction, black-box nature, investigate the impact
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages with reference included, submitted to a journal

点击查看摘要

Abstract:In this paper, we investigate the impact of adversarial attacks on the explainability of deep learning models, which are commonly criticized for their black-box nature despite their capacity for autonomous feature extraction. This black-box nature can affect the perceived trustworthiness of these models. To address this, explainability techniques such as GradCAM, SmoothGrad, and LIME have been developed to clarify model decision-making processes. Our research focuses on the robustness of these explanations when models are subjected to adversarial attacks, specifically those involving subtle image perturbations that are imperceptible to humans but can significantly mislead models. For this, we utilize attack methods like the Fast Gradient Sign Method (FGSM) and the Basic Iterative Method (BIM) and observe their effects on model accuracy and explanations. The results reveal a substantial decline in model accuracy, with accuracies dropping from 89.94% to 58.73% and 45.50% under FGSM and BIM attacks, respectively. Despite these declines in accuracy, the explanation of the models measured by metrics such as Intersection over Union (IoU) and Root Mean Square Error (RMSE) shows negligible changes, suggesting that these metrics may not be sensitive enough to detect the presence of adversarial perturbations.
zh

[CV-144] Empowering LLM s to Understand and Generate Complex Vector Graphics

【速读】：该论文试图解决大型语言模型（LLMs）在生成可扩展矢量图形（SVG）时面临的两个主要问题：语义模糊和矢量路径渲染顺序的理解不足，这可能导致矢量图元预测中的幻觉和输出图元之间的遮挡。解决方案的关键在于提出了LLM4SVG，通过引入可学习的语义标记（learnable semantic tokens）来精确编码SVG组件及其属性，从而生成语义对齐的SVG输出。此外，论文还开发了一个自动化的数据生成管道，收集了大量SVG数据和SVG-文本指令，采用两阶段训练策略，结合模块化架构，将几何、外观和语言信息紧密结合，显著提升了SVG生成的质量和效果。

链接: https://arxiv.org/abs/2412.11102
作者: Ximing Xing,Juncheng Hu,Guotao Liang,Jing Zhang,Dong Xu,Qian Yu
机构: Beihang University (北京航空航天大学); The University of Hong Kong (香港大学)
关键词: profoundly impacted natural, impacted natural language, natural language processing, Large Language Models, scalable vector graphics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:The unprecedented advancements in Large Language Models (LLMs) have profoundly impacted natural language processing but have yet to fully embrace the realm of scalable vector graphics (SVG) generation. While LLMs encode partial knowledge of SVG data from web pages during training, recent findings suggest that semantically ambiguous and tokenized representations within LLMs may result in hallucinations in vector primitive predictions. Additionally, LLM training typically lacks modeling and understanding of the rendering sequence of vector paths, which can lead to occlusion between output vector primitives. In this paper, we present LLM4SVG, an initial yet substantial step toward bridging this gap by enabling LLMs to better understand and generate vector graphics. LLM4SVG facilitates a deeper understanding of SVG components through learnable semantic tokens, which precisely encode these tokens and their corresponding properties to generate semantically aligned SVG outputs. Using a series of learnable semantic tokens, a structured dataset for instruction following is developed to support comprehension and generation across two primary tasks. Our method introduces a modular architecture to existing large language models, integrating semantic tags, vector instruction encoders, fine-tuned commands, and powerful LLMs to tightly combine geometric, appearance, and language information. To overcome the scarcity of SVG-text instruction data, we developed an automated data generation pipeline that collected a massive dataset of more than 250k SVG data and 580k SVG-text instructions, which facilitated the adoption of the two-stage training strategy popular in LLM development. By exploring various training strategies, we developed LLM4SVG, which significantly moves beyond optimized rendering-based approaches and language-model-based baselines to achieve remarkable results in human evaluation tasks.
zh

[CV-145] DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

【速读】：该论文试图解决现有视频扩散模型在生成高质量场景级和360°全景视频时面临的分辨率和宽高比限制问题。解决方案的关键在于提出了DynamicScaler，通过引入Offset Shifting Denoiser和Global Motion Guidance机制，实现了空间可扩展和全景动态场景合成的无缝衔接。Offset Shifting Denoiser通过固定分辨率的扩散模型和无缝旋转窗口，确保了全景场景中任意大小的边界过渡和一致性；而Global Motion Guidance则保证了局部细节的保真度和全局运动的连续性。这一方法在实验中展示了优越的内容和运动质量，提供了一种无需训练、高效且可扩展的沉浸式动态场景生成解决方案，且在输出视频分辨率变化时保持恒定的VRAM消耗。

链接: https://arxiv.org/abs/2412.11100
作者: Jinxiu Liu,Shaoheng Lin,Yinxiao Li,Ming-Hsuan Yang
机构: SCUT; Google DeepMind; UC Merced
关键词: generate high-quality scene-level, increasing demand, applications and spatial, spatial intelligence, intelligence has heightened
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The increasing demand for immersive AR/VR applications and spatial intelligence has heightened the need to generate high-quality scene-level and 360° panoramic video. However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose the DynamicScaler, addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. Specifically, we introduce a Offset Shifting Denoiser, facilitating efficient, synchronous, and coherent denoising panoramic dynamic scenes via a diffusion model with fixed resolution through a seamless rotating Window, which ensures seamless boundary transitions and consistency across the entire panoramic space, accommodating varying resolutions and aspect ratios. Additionally, we employ a Global Motion Guidance mechanism to ensure both local detail fidelity and global motion continuity. Extensive experiments demonstrate our method achieves superior content and motion quality in panoramic scene-level video generation, offering a training-free, efficient, and scalable solution for immersive dynamic scene creation with constant VRAM consumption regardless of the output video resolution. Our project page is available at \urlthis https URL.
zh

[CV-146] Seeing the Forest and the Trees: Solving Visual Graph and Tree Based Data Structure Problems using Large Multimodal Models

【速读】：该论文试图解决生成式 AI (Generative AI) 系统在解决基于图像的图和树数据结构问题方面的能力问题。解决方案的关键在于构建并评估一个包含 9,072 个样本的新型基准数据集，用于测试 GPT-4o、GPT-4v、Gemini 1.5 Pro、Gemini 1.5 Flash、Gemini 1.0 Pro Vision 和 Claude 3 模型家族在处理这些复杂计算问题上的表现。研究结果表明，GPT-4o 在树结构问题上表现最佳，准确率达到 87.6%，而 Gemini 1.5 Flash 在图结构问题上表现最佳，准确率为 56.2%。该研究不仅引入了多模态模型 (LMM) 的基准测试，还强调了结构和视觉变化对模型性能的影响，为教育评估和教学实践提供了重要启示。

链接: https://arxiv.org/abs/2412.11088
作者: Sebastian Gutierrez,Irene Hou,Jihye Lee,Kenneth Angelikas,Owen Man,Sophia Mettille,James Prather,Paul Denny,Stephen MacNeil
机构: Temple University(天普大学); Abilene Christian University(阿比林基督教大学); The University of Auckland(奥克兰大学)
关键词: Recent advancements, integrity among educators, advancements in generative, generative AI systems, systems have raised
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 14 pages, 4 figures, to be published in ACE 2025

点击查看摘要

Abstract:Recent advancements in generative AI systems have raised concerns about academic integrity among educators. Beyond excelling at solving programming problems and text-based multiple-choice questions, recent research has also found that large multimodal models (LMMs) can solve Parsons problems based only on an image. However, such problems are still inherently text-based and rely on the capabilities of the models to convert the images of code blocks to their corresponding text. In this paper, we further investigate the capabilities of LMMs to solve graph and tree data structure problems based only on images. To achieve this, we computationally construct and evaluate a novel benchmark dataset comprising 9,072 samples of diverse graph and tree data structure tasks to assess the performance of the GPT-4o, GPT-4v, Gemini 1.5 Pro, Gemini 1.5 Flash, Gemini 1.0 Pro Vision, and Claude 3 model families. GPT-4o and Gemini 1.5 Flash performed best on trees and graphs respectively. GPT-4o achieved 87.6% accuracy on tree samples, while Gemini 1.5 Flash, achieved 56.2% accuracy on graph samples. Our findings highlight the influence of structural and visual variations on model performance. This research not only introduces an LMM benchmark to facilitate replication and further exploration but also underscores the potential of LMMs in solving complex computing problems, with important implications for pedagogy and assessment practices.
zh

[CV-147] Deep Spectral Clustering via Joint Spectral Embedding and Kmeans

【速读】：该论文试图解决传统谱聚类（Spectral Clustering）方法中的两个主要问题：一是谱聚类将数据映射到谱嵌入空间和使用Kmeans聚类这两个步骤是分离的，无法进行联合优化；二是当数据为高维时，构建样本的相似性图会受到维度灾难（curse of dimensionality）的影响。为解决这些问题，论文提出了深度谱聚类（Deep Spectral Clustering, DSC），其关键在于通过深度神经网络和幂迭代（power iteration）学习将原始样本高效嵌入到谱嵌入空间，并通过贪婪优化策略改进Kmeans在学到的谱嵌入上的聚类结构。DSC通过端到端的方式无缝集成这两个模块，实现了谱嵌入和聚类的联合优化，从而在多个真实世界数据集上实现了最先进的聚类性能。

链接: https://arxiv.org/abs/2412.11080
作者: Wengang Guo,Wei Ye
机构: Tongji University(同济大学)
关键词: popular clustering method, spectral embedding space, spectral embedding, Spectral, spectral embedding module
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spectral clustering is a popular clustering method. It first maps data into the spectral embedding space and then uses Kmeans to find clusters. However, the two decoupled steps prohibit joint optimization for the optimal solution. In addition, it needs to construct the similarity graph for samples, which suffers from the curse of dimensionality when the data are high-dimensional. To address these two challenges, we introduce \textbfDeep \textbfSpectral \textbfClustering (\textbfDSC), which consists of two main modules: the spectral embedding module and the greedy Kmeans module. The former module learns to efficiently embed raw samples into the spectral embedding space using deep neural networks and power iteration. The latter module improves the cluster structures of Kmeans on the learned spectral embeddings by a greedy optimization strategy, which iteratively reveals the direction of the worst cluster structures and optimizes embeddings in this direction. To jointly optimize spectral embeddings and clustering, we seamlessly integrate the two modules and optimize them in an end-to-end manner. Experimental results on seven real-world datasets demonstrate that DSC achieves state-of-the-art clustering performance.
zh

[CV-148] Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

【速读】：该论文试图解决现有零样本组合图像检索 (Zero-Shot Composed Image Retrieval, ZS-CIR) 方法在处理用户指定文本修改时，由于两阶段流程导致的视觉细节丢失和推理能力有限的问题。解决方案的关键在于提出了一种无需训练的一阶段方法，即一阶段反射链式推理 (One-Stage Reflective Chain-of-Thought Reasoning, OSrCIR)，通过多模态大语言模型 (Multimodal Large Language Models) 在单阶段推理过程中保留关键视觉信息，避免了传统两阶段方法中的信息丢失。此外，反射链式推理框架通过将操作意图与参考图像的上下文线索对齐，进一步提高了解释的准确性，从而在多个任务中实现了1.80%到6.44%的性能提升，刷新了ZS-CIR的最新技术水平。

链接: https://arxiv.org/abs/2412.11077
作者: Yuanmin Tang,Xiaoting Qin,Jue Zhang,Jing Yu,Gaopeng Gou,Gang Xiong,Qingwei Ling,Saravan Rajmohan,Dongmei Zhang,Qi Wu
机构: Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; Microsoft; Minzu University of China; University of Adelaide
关键词: Composed Image Retrieval, user-specified textual modifications, Large Language Models, integrating user-specified textual, Composed Image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image while integrating user-specified textual modifications, thereby capturing user intent more precisely. Existing training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process: they first generate a caption for the reference image and then use Large Language Models for reasoning to obtain a target description. However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss seen in two-stage methods. Our Reflective Chain-of-Thought framework further improves interpretative accuracy by aligning manipulation intent with contextual cues from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks, setting new state-of-the-art results in ZS-CIR and enhancing its utility in vision-language applications. Our code will be available at this https URL.
zh

[CV-149] MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation AAAI2025

【速读】：该论文试图解决弱监督语义分割 (Weakly Supervised Semantic Segmentation, WSSS) 中使用图像级标签时，基于Vision Transformer (ViT) 生成的定位注意力图 (Localization Attention Maps, LAM) 存在的伪激活问题，即与类别语义无关的区域被错误激活。解决方案的关键在于提出了一种名为MoRe的方法，通过两种模块来增强类-块注意力 (class-patch attention) 的正则化：首先，将注意力视为一种有向图，并引入图类别表示模块 (Graph Category Representation module) 来隐式地正则化类-块实体间的交互，确保类标记动态地聚合相关块信息并抑制无关伪影；其次，设计了定位信息正则化模块 (Localization-informed Regularization module)，通过从分类激活图 (CAM) 中挖掘标记关系，显式地正则化类-块注意力，以学习方式监督类标记与块标记之间的一致性。实验结果表明，MoRe有效解决了伪激活问题，并在PASCAL VOC和MS COCO数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.11076
作者: Zhiwei Yang,Yucong Meng,Kexue Fu,Shuo Wang,Zhijian Song
机构: 未知
关键词: Weakly Supervised Semantic, Supervised Semantic Segmentation, Weakly Supervised, Class Activation Maps, image-level labels typically
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically uses Class Activation Maps (CAM) to achieve dense predictions. Recently, Vision Transformer (ViT) has provided an alternative to generate localization maps from class-patch attention. However, due to insufficient constraints on modeling such attention, we observe that the Localization Attention Maps (LAM) often struggle with the artifact issue, i.e., patch regions with minimal semantic relevance are falsely activated by class tokens. In this work, we propose MoRe to address this issue and further explore the potential of LAM. Our findings suggest that imposing additional regularization on class-patch attention is necessary. To this end, we first view the attention as a novel directed graph and propose the Graph Category Representation module to implicitly regularize the interaction among class-patch entities. It ensures that class tokens dynamically condense the related patch information and suppress unrelated artifacts at a graph level. Second, motivated by the observation that CAM from classification weights maintains smooth localization of objects, we devise the Localization-informed Regularization module to explicitly regularize the class-patch attention. It directly mines the token relations from CAM and further supervises the consistency between class and patch tokens in a learnable manner. Extensive experiments are conducted on PASCAL VOC and MS COCO, validating that MoRe effectively addresses the artifact issue and achieves state-of-the-art performance, surpassing recent single-stage and even multi-stage methods. Code is available at this https URL.
zh

[CV-150] Adapter-Enhanced Semantic Prompting for Continual Learning

【速读】：该论文试图解决持续学习 (Continual Learning, CL) 中的灾难性遗忘问题 (catastrophic forgetting)，即模型在适应新数据时会覆盖之前学到的知识。传统的解决方案通常需要保留过去的训练数据或增加额外的模型分支，导致内存需求较高。论文提出了一种轻量级的持续学习框架，称为适配器增强的语义提示 (Adapter-Enhanced Semantic Prompting, AESP)，其关键在于结合了提示调优 (prompt tuning) 和适配器 (adapter) 技术。具体来说，通过设计语义引导的提示来增强视觉特征的泛化能力，并利用适配器高效地融合语义信息，从而学习更具适应性的特征。此外，论文还开发了一种新的提示选择匹配机制，以确保在特征适应过程中选择合适的任务提示。实验结果表明，该方法在多个持续学习数据集上表现优异，展示了其在推进持续学习领域的潜力。

链接: https://arxiv.org/abs/2412.11074
作者: Baocai Yin,Ji Zhao,Huajie Jiang,Ningning Hou,Yongli Hu,Amin Beheshti,Ming-Hsuan Yang,Yuankai Qi
机构: 未知
关键词: evolving data streams, adapt to evolving, enables models, data streams, models to adapt
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continual learning (CL) enables models to adapt to evolving data streams. A major challenge of CL is catastrophic forgetting, where new knowledge will overwrite previously acquired knowledge. Traditional methods usually retain the past data for replay or add additional branches in the model to learn new knowledge, which has high memory requirements. In this paper, we propose a novel lightweight CL framework, Adapter-Enhanced Semantic Prompting (AESP), which integrates prompt tuning and adapter techniques. Specifically, we design semantic-guided prompts to enhance the generalization ability of visual features and utilize adapters to efficiently fuse the semantic information, aiming to learn more adaptive features for the continual learning task. Furthermore, to choose the right task prompt for feature adaptation, we have developed a novel matching mechanism for prompt selection. Extensive experiments on three CL datasets demonstrate that our approach achieves favorable performance across multiple metrics, showing its potential for advancing CL.
zh

[CV-151] HC-LLM : Historical-Constrained Large Language Models for Radiology Report Generation AAAI2025

【速读】：该论文试图解决放射学报告生成 (Radiology Report Generation, RRG) 模型在处理患者随访时，未能有效整合历史影像或文本数据的问题。传统方法在处理长序列依赖时表现不佳，而大语言模型 (Large Language Models, LLMs) 擅长上下文学习，适合分析纵向医疗数据。论文提出的解决方案是历史约束大语言模型 (Historical-Constrained Large Language Models, HC-LLM) 框架，其关键在于通过约束纵向影像与其对应报告之间的一致性和差异性，提取时间共享和时间特定特征，捕捉疾病进展。通过应用模态内相似性约束和多模态对比与结构约束，确保特征在不同模态间的一致性，从而指导 LLMs 生成准确反映疾病进展的诊断报告，在 Longitudinal-MIMIC 数据集上达到最先进的结果。此外，该方法在测试时即使没有历史数据也能表现良好，并可轻松适应其他多模态大模型，增强了其通用性。

链接: https://arxiv.org/abs/2412.11070
作者: Tengfei Liu,Jiapu Wang,Yongli Hu,Mingjie Li,Junfei Yi,Xiaojun Chang,Junbin Gao,Baocai Yin
机构: 未知
关键词: Radiology report generation, models typically focus, large language models, Radiology report, individual exams
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Radiology report generation (RRG) models typically focus on individual exams, often overlooking the integration of historical visual or textual data, which is crucial for patient follow-ups. Traditional methods usually struggle with long sequence dependencies when incorporating historical information, but large language models (LLMs) excel at in-context learning, making them well-suited for analyzing longitudinal medical data. In light of this, we propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for RRG, empowering LLMs with longitudinal report generation capabilities by constraining the consistency and differences between longitudinal images and their corresponding reports. Specifically, our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression. Then, we ensure consistent representation by applying intra-modality similarity constraints and aligning various features across modalities with multimodal contrastive and structural constraints. These combined constraints effectively guide the LLMs in generating diagnostic reports that accurately reflect the progression of the disease, achieving state-of-the-art results on the Longitudinal-MIMIC dataset. Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models, enhancing its versatility.
zh

[CV-152] CFSynthesis: Controllable and Free-view 3D Human Video Synthesis

【速读】：该论文试图解决现有2D扩散方法在处理复杂3D姿态和多变场景背景时的局限性问题。解决方案的关键在于引入CFSynthesis框架，该框架通过以下两个核心技术实现高质量的人类视频生成：1) 采用基于纹理的SMPL表示（texture-SMPL-based representation），确保角色在自由视角下的外观一致性和稳定性；2) 引入一种新的前景-背景分离策略（foreground-background separation strategy），有效分解场景为前景和背景，从而实现用户定义背景的无缝集成。这些技术使得CFSynthesis在复杂人类动画和3D自由视角运动场景中表现出卓越的性能。

链接: https://arxiv.org/abs/2412.11067
作者: Cui Liyuan,Xu Xiaogang,Dong Wenqi,Yang Zesong,Bao Hujun,Cui Zhaopeng
机构: State Key lab of CAD&CG, College of Computer Science, Zhejiang University(浙江大学计算机科学与技术学院CAD&CG国家重点实验室); department of computer science and engineering, the Chinese University of Hong Kong(香港中文大学计算机科学与工程系)
关键词: video synthesis aims, create lifelike characters, Human video synthesis, content creation, synthesis aims
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human video synthesis aims to create lifelike characters in various environments, with wide applications in VR, storytelling, and content creation. While 2D diffusion-based methods have made significant progress, they struggle to generalize to complex 3D poses and varying scene backgrounds. To address these limitations, we introduce CFSynthesis, a novel framework for generating high-quality human videos with customizable attributes, including identity, motion, and scene configurations. Our method leverages a texture-SMPL-based representation to ensure consistent and stable character appearances across free viewpoints. Additionally, we introduce a novel foreground-background separation strategy that effectively decomposes the scene as foreground and background, enabling seamless integration of user-defined backgrounds. Experimental results on multiple datasets show that CFSynthesis not only achieves state-of-the-art performance in complex human animations but also adapts effectively to 3D motions in free-view and user-specified scenarios.
zh

[CV-153] Classification Drives Geographic Bias in Street Scene Segmentation

【速读】：该论文旨在解决地理多样性不足的图像数据集对实例分割模型性能的影响问题。研究的关键在于揭示了在欧洲驾驶场景数据集上训练的模型（Eurocentric models）存在地理偏见，且这种偏见主要来源于分类错误而非定位错误。具体来说，分类错误在分割任务中贡献了10-90%的地理偏见，在检测任务中贡献了19-88%的地理偏见。解决方案的关键是通过使用更粗略的类别（如将汽车、公交车和卡车归类为四轮车）来显著减少分类错误带来的地理偏见。

链接: https://arxiv.org/abs/2412.11061
作者: Rahul Nair,Gabriel Tseng,Esther Rolf,Bhanu Tokas,Hannah Kerner
机构: Arizona State University(亚利桑那州立大学); Mila Quebec AI Institute(Mila魁北克人工智能研究所); University of Colorado Boulder(科罗拉多大学博尔德分校)
关键词: lacking geographic diversity, datasets lacking geographic, image datasets lacking, Eurocentric models, lacking geographic
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Previous studies showed that image datasets lacking geographic diversity can lead to biased performance in models trained on them. While earlier work studied general-purpose image datasets (e.g., ImageNet) and simple tasks like image recognition, we investigated geo-biases in real-world driving datasets on a more complex task: instance segmentation. We examined if instance segmentation models trained on European driving scenes (Eurocentric models) are geo-biased. Consistent with previous work, we found that Eurocentric models were geo-biased. Interestingly, we found that geo-biases came from classification errors rather than localization errors, with classification errors alone contributing 10-90% of the geo-biases in segmentation and 19-88% of the geo-biases in detection. This showed that while classification is geo-biased, localization (including detection and segmentation) is geographically robust. Our findings show that in region-specific models (e.g., Eurocentric models), geo-biases from classification errors can be significantly mitigated by using coarser classes (e.g., grouping car, bus, and truck as 4-wheeler).
zh

[CV-154] Making Bias Amplification in Balanced Datasets Directional and Interpretable

【速读】：该论文试图解决在平衡数据集中测量偏差放大（bias amplification）方向性的问题。现有的基于共现的度量方法在处理平衡数据集时失效，而现有的可预测性度量方法（如leakage amplification）虽然可以测量偏差放大，但无法确定偏差放大的方向。论文提出的解决方案是引入一种新的可预测性度量方法，称为方向性可预测性放大（directional predictability amplification, DPA），该方法能够有效测量平衡数据集中的方向性偏差放大，并且相较于leakage amplification，DPA更易于解释且对攻击模型（attacker models）的敏感性较低。

链接: https://arxiv.org/abs/2412.11060
作者: Bhanu Tokas,Rahul Nair,Hannah Kerner
机构: Arizona State University (亚利桑那州立大学)
关键词: amplification, bias amplification, bias, DPA, measure bias amplification
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Most of the ML datasets we use today are biased. When we train models on these biased datasets, they often not only learn dataset biases but can also amplify them – a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification between a protected attribute A (e.g., gender) and a task T (e.g., cooking). However, these metrics fail to measure biases when A is balanced with T. To measure bias amplification in balanced datasets, recent work proposed a predictability-based metric called leakage amplification. However, leakage amplification cannot identify the direction in which biases are amplified. In this work, we propose a new predictability-based metric called directional predictability amplification (DPA). DPA measures directional bias amplification, even for balanced datasets. Unlike leakage amplification, DPA is easier to interpret and less sensitive to attacker models (a hyperparameter in predictability-based metrics). Our experiments on tabular and image datasets show that DPA is an effective metric for measuring directional bias amplification. The code will be available soon.
zh

[CV-155] SHMT: Self-supervised Hierarchical Makeup Transfer via Latent Diffusion Models NEURIPS2024

【速读】：该论文试图解决化妆风格迁移任务中的两个主要问题：一是由于缺乏配对数据导致的低质量伪真实数据对模型训练的误导，二是现有方法难以处理不同化妆风格对人脸的多样性影响。解决方案的关键在于提出了一种自监督的分层化妆迁移方法（Self-supervised Hierarchical Makeup Transfer, SHMT），通过潜在扩散模型实现。SHMT采用“解耦-重建”范式，在自监督模式下工作，避免了伪配对数据的误导。此外，通过拉普拉斯金字塔分解分层纹理细节，并选择性地引入内容表示，以适应多种化妆风格。最后，设计了迭代双对齐模块（Iterative Dual Alignment, IDA），动态调整扩散模型的注入条件，纠正内容与化妆表示之间的领域差异导致的对齐误差。

链接: https://arxiv.org/abs/2412.11058
作者: Zhaoyang Sun,Shengwu Xiong,Yaxiong Chen,Fei Du,Weihua Chen,Fan Wang,Yi Rong
机构: School of Computer Science and Artificial Intelligence, Wuhan University of Technology; Sanya Science and Education Innovation Park, Wuhan University of Technology; DAMO Academy, Alibaba Group; Hupan Laboratory; Shanghai AI Laboratory
关键词: apply diverse makeup, diverse makeup styles, makeup styles precisely, facial image, Hierarchical Makeup Transfer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:This paper studies the challenging task of makeup transfer, which aims to apply diverse makeup styles precisely and naturally to a given facial image. Due to the absence of paired data, current methods typically synthesize sub-optimal pseudo ground truths to guide the model training, resulting in low makeup fidelity. Additionally, different makeup styles generally have varying effects on the person face, but existing methods struggle to deal with this diversity. To address these issues, we propose a novel Self-supervised Hierarchical Makeup Transfer (SHMT) method via latent diffusion models. Following a “decoupling-and-reconstruction” paradigm, SHMT works in a self-supervised manner, freeing itself from the misguidance of imprecise pseudo-paired data. Furthermore, to accommodate a variety of makeup styles, hierarchical texture details are decomposed via a Laplacian pyramid and selectively introduced to the content representation. Finally, we design a novel Iterative Dual Alignment (IDA) module that dynamically adjusts the injection condition of the diffusion model, allowing the alignment errors caused by the domain gap between content and makeup representations to be corrected. Extensive quantitative and qualitative analyses demonstrate the effectiveness of our method. Our code is available at \urlthis https URL.
zh

[CV-156] Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track

【速读】：该论文试图解决在医疗领域中，如何通过自然语言查询与视频内容进行有效交互的问题。解决方案的关键在于开发能够理解医疗视频并生成视觉答案的多模态系统，利用大规模语言-视觉数据集和高效的深度神经网络技术，以支持医疗急救、教育和临床决策等应用。通过引入新的任务，论文旨在推动研究，设计出能够从医疗视频中生成指导步骤的系统，从而为公众和医疗专业人员提供更先进的应用支持。

链接: https://arxiv.org/abs/2412.11056
作者: Deepak Gupta,Dina Demner-Fushman
机构: National Library of Medicine, NIH(国立卫生研究院); National Library of Medicine, NIH(国立卫生研究院)
关键词: artificial intelligence, natural language query, key goals, goals of artificial, facilitates communication
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One of the key goals of artificial intelligence (AI) is the development of a multimodal system that facilitates communication with the visual world (image and video) using a natural language query. Earlier works on medical question answering primarily focused on textual and visual (image) modalities, which may be inefficient in answering questions requiring demonstration. In recent years, significant progress has been achieved due to the introduction of large-scale language-vision datasets and the development of efficient deep neural techniques that bridge the gap between language and visual understanding. Improvements have been made in numerous vision-and-language tasks, such as visual captioning visual question answering, and natural language video localization. Most of the existing work on language vision focused on creating datasets and developing solutions for open-domain applications. We believe medical videos may provide the best possible answers to many first aid, medical emergency, and medical education questions. With increasing interest in AI to support clinical decision-making and improve patient engagement, there is a need to explore such challenges and develop efficient algorithms for medical language-video understanding and generation. Toward this, we introduced new tasks to foster research toward designing systems that can understand medical videos to provide visual answers to natural language questions, and are equipped with multimodal capability to generate instruction steps from the medical video. These tasks have the potential to support the development of sophisticated downstream applications that can benefit the public and medical professionals.
zh

[CV-157] RAC3: Retrieval-Augmented Corner Case Comprehension for Autonomous Driving with Vision-Language Models

【速读】：该论文旨在解决自动驾驶系统中视觉-语言模型 (Vision-Language Models, VLMs) 在处理极端情况（corner cases）时面临的挑战，特别是幻觉（hallucination）和缺乏现实世界基础的问题。解决方案的关键在于提出了一个名为 RAC3 的新框架，该框架通过集成检索增强生成 (Retrieval-Augmented Generation, RAG) 来动态引入上下文特定的外部知识，从而缓解幻觉问题。RAC3 的核心创新在于跨模态对齐微调，利用对比学习将图像-文本对嵌入到统一的语义空间中，增强了相似场景的检索能力。实验结果表明，RAC3 在语义对齐、幻觉缓解和性能指标（如余弦相似度和 ROUGE-L 分数）上均有显著提升，从而增强了自动驾驶系统在复杂环境中的鲁棒性和安全性。

链接: https://arxiv.org/abs/2412.11050
作者: Yujin Wang,Quanfeng Liu,Jiaqi Fan,Jinlong Hong,Hongqing Chu,Mengjian Tian,Bingzhao Gao,Hong Chen
机构: Tongji University (同济大学); Shanghai Research Institute for Intelligent Autonomous Systems (上海智能自主系统研究院); Shenzhen Technology University (深圳技术大学); College of Electronic and Information Engineering (电子与信息工程学院)
关键词: Understanding and addressing, autonomous driving systems, addressing corner cases, essential for ensuring, driving systems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Understanding and addressing corner cases is essential for ensuring the safety and reliability of autonomous driving systems. Vision-Language Models (VLMs) play a crucial role in enhancing scenario comprehension, yet they face significant challenges, such as hallucination and insufficient real-world grounding, which compromise their performance in critical driving scenarios. In this work, we propose RAC3, a novel framework designed to improve VLMs’ ability to handle corner cases effectively. The framework integrates Retrieval-Augmented Generation (RAG) to mitigate hallucination by dynamically incorporating context-specific external knowledge. A cornerstone of RAC3 is its cross-modal alignment fine-tuning, which utilizes contrastive learning to embed image-text pairs into a unified semantic space, enabling robust retrieval of similar scenarios. We evaluate RAC3 through extensive experiments using a curated dataset of corner case scenarios, demonstrating its ability to enhance semantic alignment, improve hallucination mitigation, and achieve superior performance metrics, such as Cosine Similarity and ROUGE-L scores. For example, for the LLaVA-v1.6-34B VLM, the cosine similarity between the generated text and the reference text has increased by 5.22%. The F1-score in ROUGE-L has increased by 39.91%, the Precision has increased by 55.80%, and the Recall has increased by 13.74%. This work underscores the potential of retrieval-augmented VLMs to advance the robustness and safety of autonomous driving in complex environments.
zh

[CV-158] Facial Surgery Preview Based on the Orthognathic Treatment Prediction

【速读】：该论文旨在解决正颌手术咨询中患者对术后面部外观变化理解不足的问题，特别是现有可视化方法效率低且不准确的问题。解决方案的关键在于开发一个全自动的流程，通过引入新颖的美学损失函数（如嘴部凸度和不对称损失）和专门的参数化模型（如FLAME模型），生成高精度的术后3D面部预览，而无需额外的医学图像。此外，研究还提出了医学相关的损失函数和数据增强方案，以优化潜在代码预测网络并解决数据不足的问题。通过定量和定性评估，该算法展示了其有效性，用户研究表明医生和公众难以区分预测结果与实际术后效果。

链接: https://arxiv.org/abs/2412.11045
作者: Huijun Han,Congyi Zhang,Lifeng Zhu,Pradeep Singh,Richard Tai Chiu Hsung,Yiu Yan Leung,Taku Komura,Wenping Wang,Min Gu
机构: 未知
关键词: facial, Orthognathic surgery, facial appearance, Orthognathic surgery consultation, study
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Orthognathic surgery consultation is essential to help patients understand the changes to their facial appearance after surgery. However, current visualization methods are often inefficient and inaccurate due to limited pre- and post-treatment data and the complexity of the treatment. To overcome these challenges, this study aims to develop a fully automated pipeline that generates accurate and efficient 3D previews of postsurgical facial appearances for patients with orthognathic treatment without requiring additional medical images. The study introduces novel aesthetic losses, such as mouth-convexity and asymmetry losses, to improve the accuracy of facial surgery prediction. Additionally, it proposes a specialized parametric model for 3D reconstruction of the patient, medical-related losses to guide latent code prediction network optimization, and a data augmentation scheme to address insufficient data. The study additionally employs FLAME, a parametric model, to enhance the quality of facial appearance previews by extracting facial latent codes and establishing dense correspondences between pre- and post-surgery geometries. Quantitative comparisons showed the algorithm’s effectiveness, and qualitative results highlighted accurate facial contour and detail predictions. A user study confirmed that doctors and the public could not distinguish between machine learning predictions and actual postoperative results. This study aims to offer a practical, effective solution for orthognathic surgery consultations, benefiting doctors and patients.
zh

[CV-159] SAM-IF: Leveraging SAM for Incremental Few-Shot Instance Segmentation

【速读】：该论文试图解决在少量标注数据下进行增量式少样本实例分割的问题。解决方案的关键在于提出了SAM-IF方法，该方法通过引入多类别分类器并微调Segment Anything Model (SAM)，使其专注于特定目标对象，从而实现类无关的实例分割。SAM-IF采用基于余弦相似度的分类器，能够在极少数据下高效适应新类别，并通过更新分类器权重而非重新训练解码器来支持增量学习。这种方法在需要特定对象分割且标注数据有限的情况下，相较于现有方法取得了更具竞争力的合理结果。

链接: https://arxiv.org/abs/2412.11034
作者: Xudong Zhou,Wenhao He
机构: Shanghai Jiao Tong University(上海交通大学); Noematrix Co., Ltd.(诺矩阵有限公司)
关键词: Segment Anything Model, leveraging the Segment, instance segmentation leveraging, Model, instance segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose SAM-IF, a novel method for incremental few-shot instance segmentation leveraging the Segment Anything Model (SAM). SAM-IF addresses the challenges of class-agnostic instance segmentation by introducing a multi-class classifier and fine-tuning SAM to focus on specific target objects. To enhance few-shot learning capabilities, SAM-IF employs a cosine-similarity-based classifier, enabling efficient adaptation to novel classes with minimal data. Additionally, SAM-IF supports incremental learning by updating classifier weights without retraining the decoder. Our method achieves competitive but more reasonable results compared to existing approaches, particularly in scenarios requiring specific object segmentation with limited labeled data.
zh

[CV-160] AURORA: Automated Unleash of 3D Room Outlines for VR Applications

【速读】：该论文试图解决虚拟现实（VR）体验中真实世界细节复制的劳动密集型问题，提出了一种自动化方法以保持空间准确性和设计灵活性。解决方案的关键是AURORA方法，它利用RGB-D图像自动生成纯虚拟场景和结合现实元素的VR场景。AURORA通过集成图像处理、分割和3D重建等高级技术，从现实环境中高效创建逼真的室内设计，确保了最佳性能和精度，解决了自动化室内设计生成中的关键挑战。

链接: https://arxiv.org/abs/2412.11033
作者: Huijun Han,Yongqing Liang,Yuanlong Zhou,Wenping Wang,Edgar J. Rojas-Munoz,Xin Li
机构: Texas A&M University College StationTexasUSA(德克萨斯A&M大学学院站德克萨斯美国); Texas A&M UniversityCollege StationTexasUSA(德克萨斯A&M大学学院站德克萨斯美国)
关键词: maintain spatial accuracy, provide design flexibility, accurately replicating real-world, Creating realistic, replicating real-world details
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Creating realistic VR experiences is challenging due to the labor-intensive process of accurately replicating real-world details into virtual scenes, highlighting the need for automated methods that maintain spatial accuracy and provide design flexibility. In this paper, we propose AURORA, a novel method that leverages RGB-D images to automatically generate both purely virtual reality (VR) scenes and VR scenes combined with real-world elements. This approach can benefit designers by streamlining the process of converting real-world details into virtual scenes. AURORA integrates advanced techniques in image processing, segmentation, and 3D reconstruction to efficiently create realistic and detailed interior designs from real-world environments. The design of this integration ensures optimal performance and precision, addressing key challenges in automated indoor design generation by uniquely combining and leveraging the strengths of foundation models. We demonstrate the effectiveness of our approach through experiments, both on self-captured data and public datasets, showcasing its potential to enhance virtual reality (VR) applications by providing interior designs that conform to real-world positioning.
zh

[CV-161] SceneLLM : Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation

【速读】：该论文试图解决动态场景中复杂时空信息的解析问题，特别是为移动机器人、无人机和自动驾驶系统生成准确的场景图 (Scene Graph Generation, SGG)。解决方案的关键在于提出了一种名为 SceneLLM 的新框架，该框架利用大型语言模型 (Large Language Models, LLMs) 作为强大的场景分析工具。具体来说，SceneLLM 通过引入视频到语言 (Video-to-Language, V2L) 映射模块将视频帧转换为语言信号（场景标记），并设计了空间信息聚合 (Spatial Information Aggregation, SIA) 方案来编码空间数据。此外，利用最优传输 (Optimal Transport, OT) 生成隐式语言信号，并通过低秩适应 (Low-Rank Adaptation, LoRA) 对模型进行微调，以增强 LLM 处理隐式输入的能力。最终，使用基于 transformer 的 SGG 预测器来解码 LLM 的推理结果并预测语义三元组。

链接: https://arxiv.org/abs/2412.11026
作者: Hang Zhang,Zhuoling Li,Jun Liu
机构: 未知
关键词: make informed decisions, autonomous driving systems, intricate spatio-temporal information, Scene Graph Generation, crucial for mobile
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 4 figures

点击查看摘要

Abstract:Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets Subject-Predicate-Object for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose SceneLLM, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal Transport (OT), we generate an implicit language signal from the frame-level token sequence that captures the video’s spatio-temporal information. To further improve the LLM’s ability to process this implicit linguistic input, we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Finally, we use a transformer-based SGG predictor to decode the LLM’s reasoning and predict semantic triplets. Our method achieves state-of-the-art results on the Action Genome (AG) benchmark, and extensive experiments show the effectiveness of SceneLLM in understanding and generating accurate dynamic scene graphs.
zh

[CV-162] From Simple to Professional: A Combinatorial Controllable Image Captioning Agent

【速读】：该论文试图解决图像描述任务中用户简单指令与专业级输出之间的差距问题。解决方案的关键在于Controllable Image Captioning Agent (CapAgent)，它通过自动将用户提供的简单指令转化为详细的专业指令，利用多模态大语言模型 (MLLMs) 和外部工具（如目标检测工具和搜索引擎），确保生成的描述符合指定的情感、关键词、焦点和格式要求。CapAgent在每个步骤中透明地控制描述生成过程，并展示其推理和工具使用情况，从而增强用户信任和参与度。

链接: https://arxiv.org/abs/2412.11025
作者: Xinran Wang,Muxi Diao,Baoteng Li,Haiwen Zhang,Kongming Liang,Zhanyu Ma
机构: PRIS Lab, BUPT (PRIS实验室，北京邮电大学); BUPT (北京邮电大学)
关键词: Image Captioning Agent, Controllable Image Captioning, image captioning tasks, Controllable Image, innovative system designed
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: A technical report. Project: this https URL

点击查看摘要

Abstract:The Controllable Image Captioning Agent (CapAgent) is an innovative system designed to bridge the gap between user simplicity and professional-level outputs in image captioning tasks. CapAgent automatically transforms user-provided simple instructions into detailed, professional instructions, enabling precise and context-aware caption generation. By leveraging multimodal large language models (MLLMs) and external tools such as object detection tool and search engines, the system ensures that captions adhere to specified guidelines, including sentiment, keywords, focus, and formatting. CapAgent transparently controls each step of the captioning process, and showcases its reasoning and tool usage at every step, fostering user trust and engagement. The project code is available at this https URL.
zh

[CV-163] Exploring Diffusion and Flow Matching Under Generator Matching

【速读】：该论文试图解决生成式建模中扩散模型（diffusion）和流匹配模型（flow matching）之间的理论比较问题。解决方案的关键在于将这两种模型统一在生成器匹配框架（Generator Matching framework）下，通过在相同的生成马尔可夫框架（generative Markov framework）中重新表述这两种模型，揭示了流匹配模型在实践中更为稳健的原因，并展示了如何通过混合确定性和随机性组件来构建新型模型类别。这一分析为当前最先进的生成建模范式之间的关系提供了新的视角。

链接: https://arxiv.org/abs/2412.11024
作者: Zeeshan Patel,James DeLoye,Lance Mathias
机构: 未知
关键词: flow matching, Generator Matching framework, Generator Matching, comprehensive theoretical comparison, diffusion and flow
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present a comprehensive theoretical comparison of diffusion and flow matching under the Generator Matching framework. Despite their apparent differences, both diffusion and flow matching can be viewed under the unified framework of Generator Matching. By recasting both diffusion and flow matching under the same generative Markov framework, we provide theoretical insights into why flow matching models can be more robust empirically and how novel model classes can be constructed by mixing deterministic and stochastic components. Our analysis offers a fresh perspective on the relationships between state-of-the-art generative modeling paradigms.
zh

[CV-164] Exploring Enhanced Contextual Information for Video-Level Object Tracking AAAI2025

【速读】：该论文试图解决现有视觉目标跟踪方法在视频级上下文信息利用上的不足，特别是由于仅使用少量标记导致的信息丢失问题。解决方案的关键在于提出了一种新的视频级视觉目标跟踪框架MCITrack，其核心是通过Mamba的隐藏状态持续记录和传输广泛的上下文信息。MCITrack的核心组件是上下文信息融合模块，包括mamba层和交叉注意力层。mamba层用于存储历史上下文信息，而交叉注意力层则将这些信息整合到当前视觉特征中，从而通过与主干的深度集成，增强模型在多层次上捕捉和利用上下文信息的能力。实验结果表明，MCITrack在多个基准测试中取得了竞争性甚至最先进的性能。

链接: https://arxiv.org/abs/2412.11023
作者: Ben Kang,Xin Chen,Simiao Lai,Yang Liu,Yi Liu,Dong Wang
机构: 未知
关键词: Contextual information, visual object tracking, information, Contextual Information Fusion, increasingly crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was accepted by AAAI2025

点击查看摘要

Abstract:Contextual information at the video level has become increasingly crucial for visual object tracking. However, existing methods typically use only a few tokens to convey this information, which can lead to information loss and limit their ability to fully capture the context. To address this issue, we propose a new video-level visual object tracking framework called MCITrack. It leverages Mamba’s hidden states to continuously record and transmit extensive contextual information throughout the video stream, resulting in more robust object tracking. The core component of MCITrack is the Contextual Information Fusion module, which consists of the mamba layer and the cross-attention layer. The mamba layer stores historical contextual information, while the cross-attention layer integrates this information into the current visual features of each backbone block. This module enhances the model’s ability to capture and utilize contextual information at multiple levels through deep integration with the backbone. Experiments demonstrate that MCITrack achieves competitive performance across numerous benchmarks. For instance, it gets 76.6% AUC on LaSOT and 80.0% AO on GOT-10k, establishing a new state-of-the-art performance. Code and models are available at this https URL.
zh

[CV-165] On Distilling the Displacement Knowledge for Few-Shot Class-Incremental Learning

【速读】：该论文试图解决少样本类增量学习 (Few-shot Class-Incremental Learning, FSCIL) 中的灾难性遗忘问题，特别是在数据分布不断变化和数据获取困难的情况下。解决方案的关键在于引入位移知识蒸馏 (Displacement Knowledge Distillation, DKD) 方法，该方法通过利用样本间的位移（而非相似性）来增强知识蒸馏过程中保留的信息密度，从而提升特征表示的质量。此外，论文提出了双蒸馏网络 (Dual Distillation Network, DDNet)，该网络分别对基础类和新类采用传统的知识蒸馏和 DKD 方法，并通过实例感知的样本选择器在推理过程中动态调整双分支权重，以充分利用两种方法的互补优势。实验结果表明，DDNet 在多个基准测试中达到了最先进的性能，并验证了 DKD 方法的鲁棒性和广泛适用性。

链接: https://arxiv.org/abs/2412.11017
作者: Pengfei Fang,Yongchun Qin,Hui Xue
机构: 未知
关键词: Few-shot Class-Incremental Learning, Class-Incremental Learning, evolving data distributions, knowledge distillation, addresses the challenges
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot Class-Incremental Learning (FSCIL) addresses the challenges of evolving data distributions and the difficulty of data acquisition in real-world scenarios. To counteract the catastrophic forgetting typically encountered in FSCIL, knowledge distillation is employed as a way to maintain the knowledge from learned data distribution. Recognizing the limitations of generating discriminative feature representations in a few-shot context, our approach incorporates structural information between samples into knowledge distillation. This structural information serves as a remedy for the low quality of features. Diverging from traditional structured distillation methods that compute sample similarity, we introduce the Displacement Knowledge Distillation (DKD) method. DKD utilizes displacement rather than similarity between samples, incorporating both distance and angular information to significantly enhance the information density retained through knowledge distillation. Observing performance disparities in feature distribution between base and novel classes, we propose the Dual Distillation Network (DDNet). This network applies traditional knowledge distillation to base classes and DKD to novel classes, challenging the conventional integration of novel classes with base classes. Additionally, we implement an instance-aware sample selector during inference to dynamically adjust dual branch weights, thereby leveraging the complementary strengths of each approach. Extensive testing on three benchmarks demonstrates that DDNet achieves state-of-the-art results. Moreover, through rigorous experimentation and comparison, we establish the robustness and general applicability of our proposed DKD method.
zh

[CV-166] owards Context-aware Convolutional Network for Image Restoration

【速读】：该论文试图解决图像恢复 (Image Restoration, IR) 任务中现有卷积神经网络 (CNN) 在特征空间映射和长程上下文信息捕捉方面的局限性。解决方案的关键在于提出了两个核心模块：一是高效的残差星模块 (Efficient Residual Star Module, ERSM)，通过上下文感知的“星操作”将特征映射到高维非线性空间，增强表示学习能力；二是大动态集成模块 (Large Dynamic Integration Module, LDIM)，具有极大的感受野，能够动态高效地整合更多上下文信息，从而显著提升重建性能。通过将这两个模块集成到U形网络结构中，构建了上下文感知的卷积网络 (CCNet)，在多个图像恢复任务中实现了优于现有最先进方法的性能。

链接: https://arxiv.org/abs/2412.11008
作者: Fangwei Hao,Ji Du,Weiyun Liang,Jing Xu,Xiaoxuan Xu
机构: 未知
关键词: corrupted observation, recover a high-quality, non-linear feature spaces, long-standing task, Image restoration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration (IR) is a long-standing task to recover a high-quality image from its corrupted observation. Recently, transformer-based algorithms and some attention-based convolutional neural networks (CNNs) have presented promising results on several IR tasks. However, existing convolutional residual building modules for IR encounter limited ability to map inputs into high-dimensional and non-linear feature spaces, and their local receptive fields have difficulty in capturing long-range context information like Transformer. Besides, CNN-based attention modules for IR either face static abundant parameters or have limited receptive fields. To address the first issue, we propose an efficient residual star module (ERSM) that includes context-aware “star operation” (element-wise multiplication) to contextually map features into exceedingly high-dimensional and non-linear feature spaces, which greatly enhances representation learning. To further boost the extraction of contextual information, as for the second issue, we propose a large dynamic integration module (LDIM) which possesses an extremely large receptive field. Thus, LDIM can dynamically and efficiently integrate more contextual information that helps to further significantly improve the reconstruction performance. Integrating ERSM and LDIM into an U-shaped backbone, we propose a context-aware convolutional network (CCNet) with powerful learning ability for contextual high-dimensional mapping and abundant contextual information. Extensive experiments show that our CCNet with low model complexity achieves superior performance compared to other state-of-the-art IR methods on several IR tasks, including image dehazing, image motion deblurring, and image desnowing.
zh

[CV-167] RapidNet: Multi-Level Dilated Convolution Based Mobile Backbone WACV2025

【速读】：该论文试图解决在移动设备上实现高效计算机视觉任务的问题，特别是针对Vision Transformer (ViT)和混合模型在计算成本和速度上的不足。解决方案的关键在于提出了多级空洞卷积 (Multi-Level Dilated Convolutions)，通过这种结构设计，不仅扩大了理论感受野，还实现了短程和长程特征的有效交互，从而在保持高精度的同时显著提升了推理速度。实验结果表明，所提出的纯卷积神经网络 (CNN) 架构在图像分类、目标检测、实例分割和语义分割任务中，均优于现有的最先进 (SOTA) 移动CNN、ViT、ViG及混合模型。

链接: https://arxiv.org/abs/2412.10995
作者: Mustafa Munir,Md Mostafijur Rahman,Radu Marculescu
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
关键词: dominated computer vision, recent years, Multi-Level Dilated Convolutions, Vision transformers, Vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)

点击查看摘要

Abstract:Vision transformers (ViTs) have dominated computer vision in recent years. However, ViTs are computationally expensive and not well suited for mobile devices; this led to the prevalence of convolutional neural network (CNN) and ViT-based hybrid models for mobile vision applications. Recently, Vision GNN (ViG) and CNN hybrid models have also been proposed for mobile vision tasks. However, all of these methods remain slower compared to pure CNN-based models. In this work, we propose Multi-Level Dilated Convolutions to devise a purely CNN-based mobile backbone. Using Multi-Level Dilated Convolutions allows for a larger theoretical receptive field than standard convolutions. Different levels of dilation also allow for interactions between the short-range and long-range features in an image. Experiments show that our proposed model outperforms state-of-the-art (SOTA) mobile CNN, ViT, ViG, and hybrid architectures in terms of accuracy and/or speed on image classification, object detection, instance segmentation, and semantic segmentation. Our fastest model, RapidNet-Ti, achieves 76.3% top-1 accuracy on ImageNet-1K with 0.9 ms inference latency on an iPhone 13 mini NPU, which is faster and more accurate than MobileNetV2x1.4 (74.7% top-1 with 1.0 ms latency). Our work shows that pure CNN architectures can beat SOTA hybrid and ViT models in terms of accuracy and speed when designed properly.
zh

[CV-168] Point Cloud to Mesh Reconstruction: A Focus on Key Learning-Based Paradigms

【速读】：该论文旨在解决从点云数据中重建网格（mesh reconstruction）的问题，这一任务在机器人、自主系统和医学成像等领域具有重要意义。论文的关键在于对当前最先进的基于学习的网格重建方法进行了系统分类和深入探讨，将其分为五大范式：PointNet系列、自编码器架构、基于形变的方法、点移动技术和基于原语的方法。通过对这些方法的详细分析和比较，论文为研究人员和从业者提供了全面的指南，帮助他们理解和选择适合的网格重建技术。研究表明，这些基于学习的方法在细节和效率上往往优于传统技术，展现了其变革性的潜力。

链接: https://arxiv.org/abs/2412.10977
作者: Fatima Zahra Iguenfer,Achraf Hsain,Hiba Amissa,Yousra Chtouki
机构: 未知
关键词: Reconstructing meshes, autonomous systems, medical imaging, meshes from point, point clouds
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Reconstructing meshes from point clouds is an important task in fields such as robotics, autonomous systems, and medical imaging. This survey examines state-of-the-art learning-based approaches to mesh reconstruction, categorizing them into five paradigms: PointNet family, autoencoder architectures, deformation-based methods, point-move techniques, and primitive-based approaches. Each paradigm is explored in depth, detailing the primary approaches and their underlying methodologies. By comparing these techniques, our study serves as a comprehensive guide, and equips researchers and practitioners with the knowledge to navigate the landscape of learning-based mesh reconstruction techniques. The findings underscore the transformative potential of these methods, which often surpass traditional techniques in allowing detailed and efficient reconstructions.
zh

[CV-169] DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting

【速读】：该论文试图解决开放集3D分割问题，特别是在机器人和增强/虚拟现实应用中的重要性。解决方案的关键在于提出了一种解耦的3D分割流程，确保了模块化和适应性，能够适应新的3D表示和语义分割基础模型。具体来说，该流程首先通过3D场景重建生成类别无关的掩码，然后利用类别感知的2D基础模型为这些3D掩码添加类别标注。通过结合3D高斯Splatting和不同的2D分割模型，该方法在性能上优于专门设计的方案，同时显著提高了模块化程度。

链接: https://arxiv.org/abs/2412.10972
作者: Luis Wiedmann,Luca Wiehe,David Rozenberszki
机构: Technical University of Munich (慕尼黑工业大学)
关键词: virtual reality applications, multiple downstream robotics, robotics and augmented, virtual reality, reality applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-set 3D segmentation represents a major point of interest for multiple downstream robotics and augmented/virtual reality applications. Recent advances introduce 3D Gaussian Splatting as a computationally efficient representation of the underlying scene. They enable the rendering of novel views while achieving real-time display rates and matching the quality of computationally far more expensive methods. We present a decoupled 3D segmentation pipeline to ensure modularity and adaptability to novel 3D representations and semantic segmentation foundation models. The pipeline proposes class-agnostic masks based on a 3D reconstruction of the scene. Given the resulting class-agnostic masks, we use a class-aware 2D foundation model to add class annotations to the 3D masks. We test this pipeline with 3D Gaussian Splatting and different 2D segmentation models and achieve better performance than more tailored approaches while also significantly increasing the modularity.
zh

[CV-170] SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

【速读】：该论文试图解决生成式模型训练中图像标记化（tokenization）与高压缩比之间的效率问题。解决方案的关键在于提出了SoftVQ-VAE，这是一种连续图像标记器，通过利用软分类后验（soft categorical posteriors）将多个码字（codewords）聚合到每个潜在标记（latent token）中，显著提升了潜在空间的表示能力。SoftVQ-VAE在Transformer架构中实现了对256x256和512x512图像的高效压缩，仅需32或64个一维标记。该方法不仅实现了高质量的重建，还显著提升了基于去噪生成模型的图像生成速度，最高提升了18倍（256x256图像）和55倍（512x512图像），同时保持了竞争力的FID分数（1.78和2.21）。此外，SoftVQ-VAE通过减少训练迭代次数（2.3倍）提高了训练效率，且其全可微分设计和语义丰富的潜在空间确保了生成质量不受影响。

链接: https://arxiv.org/abs/2412.10958
作者: Hao Chen,Ze Wang,Xiang Li,Ximeng Sun,Fangyi Chen,Jiang Liu,Jindong Wang,Bhiksha Raj,Zicheng Liu,Emad Barsoum
机构: Carnegie Mellon University; AMD; William & Mary; MBZUAI
关键词: high compression ratios, compression ratios remains, high compression, compression ratios, ratios remains
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code and model: this https URL

点击查看摘要

Abstract:Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VQE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.
zh

[CV-171] Deep Learning-Based Noninvasive Screening of Type 2 Diabetes with Chest X-ray Images and Electronic Health Records

【速读】：该论文试图解决的是2型糖尿病（T2DM）早期检测的挑战，主要由于其无症状起病和依赖于不理想的临床诊断测试，导致其在全球范围内的广泛流行。解决方案的关键在于利用深度学习模型整合多模态数据，包括胸部X光片（CXR）、电子健康记录（EHRs）和心电图信号，以实现对患者健康状况的更全面理解。研究采用了两种深度融合范式：基于早期融合的多模态Transformer和模块化联合融合的ResNet-LSTM架构。最终，ResNet-LSTM模型在仅使用9863个训练样本的情况下，实现了0.86的AUROC，比仅使用CXR的基线模型提高了2.3%，证明了CXR在多模态框架中用于早期识别高风险个体的诊断价值。

链接: https://arxiv.org/abs/2412.10955
作者: Sanjana Gundapaneni,Zhuo Zhi,Miguel Rodrigues
机构: University College London (伦敦大学学院)
关键词: widespread global prevalence, clinical diagnostic tests, suboptimal clinical diagnostic, diabetes mellitus, global prevalence
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The imperative for early detection of type 2 diabetes mellitus (T2DM) is challenged by its asymptomatic onset and dependence on suboptimal clinical diagnostic tests, contributing to its widespread global prevalence. While research into noninvasive T2DM screening tools has advanced, conventional machine learning approaches remain limited to unimodal inputs due to extensive feature engineering requirements. In contrast, deep learning models can leverage multimodal data for a more holistic understanding of patients’ health conditions. However, the potential of chest X-ray (CXR) imaging, one of the most commonly performed medical procedures, remains underexplored. This study evaluates the integration of CXR images with other noninvasive data sources, including electronic health records (EHRs) and electrocardiography signals, for T2DM detection. Utilising datasets meticulously compiled from the MIMIC-IV databases, we investigated two deep fusion paradigms: an early fusion-based multimodal transformer and a modular joint fusion ResNet-LSTM architecture. The end-to-end trained ResNet-LSTM model achieved an AUROC of 0.86, surpassing the CXR-only baseline by 2.3% with just 9863 training samples. These findings demonstrate the diagnostic value of CXRs within multimodal frameworks for identifying at-risk individuals early. Additionally, the dataset preprocessing pipeline has also been released to support further research in this domain.
zh

[CV-172] SegHeD: Segmentation of Heterogeneous Data for Multiple Sclerosis Lesions with Anatomical Constraints and Lesion-aware Augmentation

【速读】：该论文试图解决多发性硬化症（MS）病变在脑部磁共振（MR）图像中的分割问题，特别是在面对多源、异构数据集时，如何构建一个统一的分割模型。解决方案的关键在于提出了SegHeD+模型，该模型能够处理多种数据集和任务，适应异构输入数据，并实现对所有病变、新病变和消失病变的分割。通过整合纵向、解剖和体积约束等领域的知识，并采用病变级别的数据增强技术，SegHeD+在五个MS数据集上展示了优越的分割性能，超越了现有的先进方法。

链接: https://arxiv.org/abs/2412.10946
作者: Berke Doga Basaran,Paul M. Matthews,Wenjia Bai
机构: 未知
关键词: brain magnetic resonance, monitoring multiple sclerosis, Assessing lesions, magnetic resonance, images is essential
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Assessing lesions and tracking their progression over time in brain magnetic resonance (MR) images is essential for diagnosing and monitoring multiple sclerosis (MS). Machine learning models have shown promise in automating the segmentation of MS lesions. However, training these models typically requires large, well-annotated datasets. Unfortunately, MS imaging datasets are often limited in size, spread across multiple hospital sites, and exhibit different formats (such as cross-sectional or longitudinal) and annotation styles. This data diversity presents a significant obstacle to developing a unified model for MS lesion segmentation. To address this issue, we introduce SegHeD+, a novel segmentation model that can handle multiple datasets and tasks, accommodating heterogeneous input data and performing segmentation for all lesions, new lesions, and vanishing lesions. We integrate domain knowledge about MS lesions by incorporating longitudinal, anatomical, and volumetric constraints into the segmentation model. Additionally, we perform lesion-level data augmentation to enlarge the training set and further improve segmentation performance. SegHeD+ is evaluated on five MS datasets and demonstrates superior performance in segmenting all, new, and vanishing lesions, surpassing several state-of-the-art methods in the field.
zh

[CV-173] Unconstrained Salient and Camouflaged Object Detection

【速读】：该论文试图解决在不受约束的场景中同时检测显著物体（Salient Object Detection, SOD）和伪装物体（Camouflaged Object Detection, COD）的问题。现有研究通常分别处理仅包含显著物体或仅包含伪装物体的场景，而忽略了两者共存或均不存在的情况。为此，论文提出了一个名为Unconstrained Salient and Camouflaged Object Detection (USCOD)的基准，并构建了一个大规模数据集CS12K，涵盖了四种不同类型的场景。解决方案的关键在于提出了一种名为USCNet的基线模型，该模型通过引入属性区分学习模块（APG module），将属性区分与掩码重建解耦，从而有效区分同一场景中的显著物体和伪装物体。此外，论文还设计了Camouflage-Saliency Confusion Score (CSCS)指标，用于评估模型在区分这两种物体时的性能。

链接: https://arxiv.org/abs/2412.10943
作者: Zhangjun Zhou,Yiping Li,Chunlin Zhong,Jianuo Huang,Jialun Pei,He Tang
机构: Huazhong University of Science and Technology(华中科技大学); The Chinese University of Hong Kong(香港中文大学)
关键词: camouflaged objects, Camouflaged Object Detection, Visual Salient Object, Camouflaged, salient and camouflaged
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 12 figures

点击查看摘要

Abstract:Visual Salient Object Detection (SOD) and Camouflaged Object Detection (COD) are two interrelated yet distinct tasks. Both tasks model the human visual system’s ability to perceive the presence of objects. The traditional SOD datasets and methods are designed for scenes where only salient objects are present, similarly, COD datasets and methods are designed for scenes where only camouflaged objects are present. However, scenes where both salient and camouflaged objects coexist, or where neither is present, are not considered. This simplifies the existing research on SOD and COD. In this paper, to explore a more generalized approach to SOD and COD, we introduce a benchmark called Unconstrained Salient and Camouflaged Object Detection (USCOD), which supports the simultaneous detection of salient and camouflaged objects in unconstrained scenes, regardless of their presence. Towards this, we construct a large-scale dataset, CS12K, that encompasses a variety of scenes, including four distinct types: only salient objects, only camouflaged objects, both, and neither. In our benchmark experiments, we identify a major challenge in USCOD: distinguishing between salient and camouflaged objects within the same scene. To address this challenge, we propose USCNet, a baseline model for USCOD that decouples the learning of attribute distinction from mask reconstruction. The model incorporates an APG module, which learns both sample-generic and sample-specific features to enhance the attribute differentiation between salient and camouflaged objects. Furthermore, to evaluate models’ ability to distinguish between salient and camouflaged objects, we design a metric called Camouflage-Saliency Confusion Score (CSCS). The proposed method achieves state-of-the-art performance on the newly introduced USCOD task. The code and dataset will be publicly available.
zh

[CV-174] Meta-evaluating stability measures: MAX-Senstivity AVG-Sensitivity

【速读】：该论文试图解决可解释人工智能（eXplainable Artificial Intelligence, XAI）系统中的稳定性评估问题，特别是如何客观评估XAI的鲁棒性（robustness）。解决方案的关键在于提出了一种新的元评估方法，即分析评估指标的正确性。具体而言，论文提出了两个新的测试（AVG-Sensitivity和MAX-Sensitivity），用于评估不同稳定性指标的可靠性。通过测试这些指标在完美解释、随机解释和预测情况下的表现，研究发现这些指标无法有效识别随机解释的错误，从而揭示了其整体不可靠性。

链接: https://arxiv.org/abs/2412.10942
作者: Miquel Miró-Nicolau,Antoni Jaume-i-Capó,Gabriel Moyà-Alcover
机构: UGiVIA Research Group; Laboratory for Artificial Intelligence Applications (LAIA@UIB); University of the Balearic Islands
关键词: eXplainable Artificial Intelligence, Artificial Intelligence, eXplainable Artificial, systems has introduced, introduced a set
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The use of eXplainable Artificial Intelligence (XAI) systems has introduced a set of challenges that need resolution. The XAI robustness, or stability, has been one of the goals of the community from its beginning. Multiple authors have proposed evaluating this feature using objective evaluation measures. Nonetheless, many questions remain. With this work, we propose a novel approach to meta-evaluate these metrics, i.e. analyze the correctness of the evaluators. We propose two new tests that allowed us to evaluate two different stability measures: AVG-Sensitiviy and MAX-Senstivity. We tested their reliability in the presence of perfect and robust explanations, generated with a Decision Tree; as well as completely random explanations and prediction. The metrics results showed their incapacity of identify as erroneous the random explanations, highlighting their overall unreliability.
zh

[CV-175] Progressive Compression with Universally Quantized Diffusion Models ICLR2025

【速读】：该论文试图解决扩散概率模型在生成建模任务中的应用问题，特别是如何将这些模型用于渐进式编码（progressive coding），以实现增量传输和解码，从而逐步提高重建质量。解决方案的关键在于提出了一种新的扩散模型，其前向过程采用均匀噪声（uniform noise），使得负变分证据下界（negative ELBO）对应于使用通用量化（universal quantization）的端到端压缩成本。这一方法在图像压缩任务中取得了有竞争力的率失真（rate-distortion）和率真实性（rate-realism）结果，展示了神经编解码器（neural codecs）在实际部署中的潜力。

链接: https://arxiv.org/abs/2412.10935
作者: Yibo Yang,Justus C. Will,Stephan Mandt
机构: University of California, Irvine (加州大学欧文分校)
关键词: inverse problem solving, achieved mainstream success, generative modeling tasks, problem solving, Diffusion probabilistic models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 10 figures, submitted to ICLR 2025

点击查看摘要

Abstract:Diffusion probabilistic models have achieved mainstream success in many generative modeling tasks, from image generation to inverse problem solving. A distinct feature of these models is that they correspond to deep hierarchical latent variable models optimizing a variational evidence lower bound (ELBO) on the data likelihood. Drawing on a basic connection between likelihood modeling and compression, we explore the potential of diffusion models for progressive coding, resulting in a sequence of bits that can be incrementally transmitted and decoded with progressively improving reconstruction quality. Unlike prior work based on Gaussian diffusion or conditional diffusion models, we propose a new form of diffusion model with uniform noise in the forward process, whose negative ELBO corresponds to the end-to-end compression cost using universal quantization. We obtain promising first results on image compression, achieving competitive rate-distortion and rate-realism results on a wide range of bit-rates with a single model, bringing neural codecs a step closer to practical deployment.
zh

[CV-176] Video Representation Learning with Joint-Embedding Predictive Architectures

【速读】：该论文试图解决视频表示学习中的表示崩溃问题，解决方案的关键在于提出了Video JEPA with Variance-Covariance Regularization (VJ-VCR)，通过引入方差和协方差正则化来避免表示崩溃。该方法不仅在自监督视频表示学习中表现出色，还能捕捉输入数据的高层次抽象信息，并在需要理解视频中物体运动动态的下游任务中优于生成式基线方法。此外，论文还探讨了在非确定性环境下，如何通过引入潜在变量来捕捉未来不确定性信息的不同方法。

链接: https://arxiv.org/abs/2412.10925
作者: Katrina Drozdov,Ravid Shwartz-Ziv,Yann LeCun
机构: Center for Data Science(数据科学中心); New York University(纽约大学); Meta FAIR(Meta FAIR)
关键词: increasingly important topic, machine learning research, Video representation learning, present Video JEPA, increasingly important
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video representation learning is an increasingly important topic in machine learning research. We present Video JEPA with Variance-Covariance Regularization (VJ-VCR): a joint-embedding predictive architecture for self-supervised video representation learning that employs variance and covariance regularization to avoid representation collapse. We show that hidden representations from our VJ-VCR contain abstract, high-level information about the input data. Specifically, they outperform representations obtained from a generative baseline on downstream tasks that require understanding of the underlying dynamics of moving objects in the videos. Additionally, we explore different ways to incorporate latent variables into the VJ-VCR framework that capture information about uncertainty in the future in non-deterministic settings.
zh

[CV-177] Do large language vision models understand 3D shapes?

【速读】：该论文试图解决的问题是测试大型视觉语言模型（LVLM）是否真正理解三维形状（3D shapes）。解决方案的关键在于通过测试模型在不同方向和材质/纹理条件下识别和匹配相同三维形状的能力。研究使用计算机生成图像（CGI）创建了大量多样化的物体、材质和场景，结果表明模型在匹配三维形状方面虽显著优于随机猜测，但仍远不及人类。这表明模型已获得一定的三维形状抽象理解能力，但在同时改变物体材质和方向的情况下，模型表现显著低于人类。

链接: https://arxiv.org/abs/2412.10908
作者: Sagi Eppel
机构: 未知
关键词: Large vision language, Large vision, vision language models, general visual understanding, vision language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision language models (LVLM) are the leading A.I approach for achieving a general visual understanding of the world. Models such as GPT, Claude, Gemini, and LLama can use images to understand and analyze complex visual scenes. 3D objects and shapes are the basic building blocks of the world, recognizing them is a fundamental part of human perception. The goal of this work is to test whether LVLMs truly understand 3D shapes by testing the models ability to identify and match objects of the exact same 3D shapes but with different orientations and materials/textures. Test images were created using CGI with a huge number of highly diverse objects, materials, and scenes. The results of this test show that the ability of such models to match 3D shapes is significantly below humans but much higher than random guesses. Suggesting that the models have gained some abstract understanding of 3D shapes but still trail far beyond humans in this task. Mainly it seems that the models can easily identify the same object with a different orientation as well as matching identical 3D shapes of the same orientation but with different material textures. However, when both the object material and orientation are changed, all models perform poorly relative to humans.
zh

[CV-178] Enhancing Road Crack Detection Accuracy with BsS-YOLO: Optimizing Feature Fusion and Attention Mechanisms

【速读】：该论文试图解决现有道路裂缝检测方法在面对多尺度目标、复杂背景和环境适应性差等问题时的不足。解决方案的关键在于提出了BsS-YOLO模型，通过优化多尺度特征融合机制，包括增强的Path Aggregation Network (PAN)和Bidirectional Feature Pyramid Network (BiFPN)，并引入加权特征融合以提升特征表示能力。此外，模型在骨干网络中嵌入了Simple and Effective Attention Mechanism (SimAM)，通过空间和通道注意力机制提高检测精度。检测层则集成了Shuffle Attention机制，通过跨通道重排和混合特征来进一步优化关键特征表示，从而显著提升了检测的准确性和鲁棒性。

链接: https://arxiv.org/abs/2412.10902
作者: Jiaze Tang,Angzehua Feng,Vladimir Korkhov,Yuxi Pu
机构: 未知
关键词: offering significant economic, significant economic benefits, extending road lifespan, Path Aggregation Network, infrastructure preservation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective road crack detection is crucial for road safety, infrastructure preservation, and extending road lifespan, offering significant economic benefits. However, existing methods struggle with varied target scales, complex backgrounds, and low adaptability to different environments. This paper presents the BsS-YOLO model, which optimizes multi-scale feature fusion through an enhanced Path Aggregation Network (PAN) and Bidirectional Feature Pyramid Network (BiFPN). The incorporation of weighted feature fusion improves feature representation, boosting detection accuracy and robustness. Furthermore, a Simple and Effective Attention Mechanism (SimAM) within the backbone enhances precision via spatial and channel-wise attention. The detection layer integrates a Shuffle Attention mechanism, which rearranges and mixes features across channels, refining key representations and further improving accuracy. Experimental results show that BsS-YOLO achieves a 2.8% increase in mean average precision (mAP) for road crack detection, supporting its applicability in diverse scenarios, including urban road maintenance and highway inspections.
zh

[CV-179] PEARL: Input-Agnostic Prompt Enhancement with Negative Feedback Regulation for Class-Incremental Learning AAAI-25

【速读】：该论文试图解决类增量学习 (Class-incremental learning, CIL) 中由于过度依赖输入信息而导致的灾难性遗忘问题。解决方案的关键在于提出了一种基于预训练模型 (PTMs) 的新方法，称为“输入无关提示增强与负反馈调节 (Input-Agnostic Prompt Enhancement with Negative Feedback Regulation, PEARL)”。PEARL 通过引入一个输入无关的全局提示 (input-agnostic global prompt) 和自适应动量更新策略 (adaptive momentum update strategy)，减少模型对数据分布的依赖，从而有效缓解灾难性遗忘。此外，负反馈调节机制 (negative feedback regulation) 解决了固定权重动量更新中的参数敏感性问题，并通过利用 CIL 中不同任务之间的关联性，持续增强提示以适应新任务。实验结果表明，该方法在六个基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.10900
作者: Yongchun Qin,Pengfei Fang,Hui Xue
机构: 未知
关键词: Class-incremental learning, forgetting previously learned, aims to continuously, Negative Feedback Regulation, CIL
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI-25

点击查看摘要

Abstract:Class-incremental learning (CIL) aims to continuously introduce novel categories into a classification system without forgetting previously learned ones, thus adapting to evolving data distributions. Researchers are currently focusing on leveraging the rich semantic information of pre-trained models (PTMs) in CIL tasks. Prompt learning has been adopted in CIL for its ability to adjust data distribution to better align with pre-trained knowledge. This paper critically examines the limitations of existing methods from the perspective of prompt learning, which heavily rely on input information. To address this issue, we propose a novel PTM-based CIL method called Input-Agnostic Prompt Enhancement with Negative Feedback Regulation (PEARL). In PEARL, we implement an input-agnostic global prompt coupled with an adaptive momentum update strategy to reduce the model’s dependency on data distribution, thereby effectively mitigating catastrophic forgetting. Guided by negative feedback regulation, this adaptive momentum update addresses the parameter sensitivity inherent in fixed-weight momentum updates. Furthermore, it fosters the continuous enhancement of the prompt for new tasks by harnessing correlations between different tasks in CIL. Experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance. The code is available at: this https URL.
zh

[CV-180] Zigzag Diffusion Sampling: The Path to Success Is Zigzag

【速读】：该论文试图解决现有文本到图像扩散模型在处理复杂提示时难以同时保持高图像质量和与提示高度一致性的问题。解决方案的关键在于提出了一种名为Zigzag Diffusion Sampling (Z-Sampling)的新型采样方法，该方法通过逐步积累去噪和反演过程中的条件引导差距所捕获的语义信息，从而在整个生成过程中逐步引导潜在变量向期望方向发展，显著提升了生成质量和提示与图像的匹配度。Z-Sampling作为一种即插即用的方法，能够广泛应用于各种扩散模型，且具有较低的编码和计算成本。

链接: https://arxiv.org/abs/2412.10891
作者: Lichen Bai,Shitong Shao,Zikai Zhou,Zipeng Qi,Zhiqiang Xu,Haoyi Xiong,Zeke Xie
机构: xLeaF Lab, The Hong Kong University of Science and Technology (Guangzhou); Mohamed bin Zayed University of Artificial Intelligence; Baidu Inc
关键词: popular generative paradigm, Diffusion models, desired directions, popular generative, generative paradigm
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we theoretically and empirically demonstrate that the conditional guidance gap between the denoising and inversion processes captures prompt-related semantic information. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel sampling method that leverages the guidance gap to accumulate semantic information step-by-step throughout the entire generation process, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding and computational costs. Third, our extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. For example, Z-Sampling can even make DreamShaper achieve the HPSv2 winning rate higher than 94% over the original results. Moreover, Z-Sampling can further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO.
zh

[CV-181] Heterogeneous Graph Transformer for Multiple Tiny Object Tracking in RGB-T Videos

【速读】：该论文试图解决多目标跟踪中微小物体由于外观弱和特征有限而导致的跟踪难题。解决方案的关键在于提出了一个名为HGT-Track（基于异构图Transformer的多微小物体跟踪）的新框架，通过整合来自多个远程传感器的互补信息来提升跟踪性能。具体来说，该框架首先使用基于Transformer的编码器对不同模态的图像进行嵌入，然后利用异构图Transformer聚合多模态的空间和时间信息，生成检测和跟踪特征。此外，引入了一个目标重检测模块（ReDet）以确保跨模态的轨迹连续性。论文还首次提出了用于RGB-T融合多微小物体跟踪的基准数据集VT-Tiny-MOT，并通过实验验证了该方法在MOTA和ID-F1分数上的优越性能。

链接: https://arxiv.org/abs/2412.10861
作者: Qingyu Xu,Longguang Wang,Weidong Sheng,Yingqian Wang,Chao Xiao,Chao Ma,Wei An
机构: College of Electronic Science and Technology, National University of Defense Technology (NUDT)(电子科学与技术学院，国防科技大学); Aviation University of Air Force (空军航空大学)
关键词: highly challenging due, Heterogeneous Graph Transformer, tiny object tracking, tiny objects, Heterogeneous Graph
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: N/A

点击查看摘要

Abstract:Tracking multiple tiny objects is highly challenging due to their weak appearance and limited features. Existing multi-object tracking algorithms generally focus on single-modality scenes, and overlook the complementary characteristics of tiny objects captured by multiple remote sensors. To enhance tracking performance by integrating complementary information from multiple sources, we propose a novel framework called HGT-Track (Heterogeneous Graph Transformer based Multi-Tiny-Object Tracking). Specifically, we first employ a Transformer-based encoder to embed images from different modalities. Subsequently, we utilize Heterogeneous Graph Transformer to aggregate spatial and temporal information from multiple modalities to generate detection and tracking features. Additionally, we introduce a target re-detection module (ReDet) to ensure tracklet continuity by maintaining consistency across different modalities. Furthermore, this paper introduces the first benchmark VT-Tiny-MOT (Visible-Thermal Tiny Multi-Object Tracking) for RGB-T fused multiple tiny object tracking. Extensive experiments are conducted on VT-Tiny-MOT, and the results have demonstrated the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of MOTA (Multiple-Object Tracking Accuracy) and ID-F1 score. The code and dataset will be made available at this https URL.
zh

[CV-182] Robust Recognition of Persian Isolated Digits in Speech using Deep Neural Network

【速读】：该论文旨在解决在噪声环境下识别孤立的波斯语数字（零到九）的问题，特别是区分语音相似的数字。解决方案的关键在于提出了一种混合结构的神经网络模型，结合了残差卷积神经网络 (Residual Convolutional Neural Network) 和双向门控循环单元 (Bidirectional Gated Recurrent Unit)，并以词单元作为输入而非音素单元。该方法通过使用FARSDIGIT1数据库中的音频数据，经过噪声增强和梅尔频率倒谱系数 (MFCC) 特征提取技术，显著提高了在噪声环境下的识别准确率，实验结果显示在训练、验证和测试集上的准确率分别为98.53%、96.10%和95.9%，相比基于音素单元的LSTM方法，平均性能提升了26.88%。

链接: https://arxiv.org/abs/2412.10857
作者: Ali Nasr-Esfahani,Mehdi Bekrani,Roozbeh Rajabi
机构: Queensland University of Technology (QUT); University of Geneva (UNIGE)
关键词: speech recognition applications, artificial intelligence, recent years, advanced significantly, significantly in speech
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 15 pages, submitted to journal

点击查看摘要

Abstract:In recent years, artificial intelligence (AI) has advanced significantly in speech recognition applications. Speech-based interaction with digital systems, particularly AI-driven digit recognition, has emerged as a prominent application. However, existing neural network-based methods often neglect the impact of noise, leading to reduced accuracy in noisy environments. This study tackles the challenge of recognizing the isolated spoken Persian numbers (zero to nine), particularly distinguishing phonetically similar numbers, in noisy environments. The proposed method, which is designed for speaker-independent recognition, combines residual convolutional neural network and bidirectional gated recurrent unit in a hybrid structure for Persian number recognition. This method employs word units as input instead of phoneme units. Audio data from 51 speakers of FARSDIGIT1 database are utilized after augmentation using various noises, and the Mel-Frequency Cepstral Coefficients (MFCC) technique is employed for feature extraction. The experimental results show the proposed method efficacy with 98.53%, 96.10%, and 95.9% recognition accuracy for training, validation, and test, respectively. In the noisy environment, the proposed method exhibits an average performance improvement of 26.88% over phoneme unit-based LSTM method for Persian numbers. In addition, the accuracy of the proposed method is 7.61% better than that of the Mel-scale Two Dimension Root Cepstrum Coefficients (MTDRCC) feature extraction technique along with MLP model in the test data for the same dataset.
zh

[CV-183] SEW: Self-calibration Enhanced Whole Slide Pathology Image Analysis

【速读】：该论文试图解决病理图像分析中同时提取全局结构信息和局部细节信息的难题。现有方法无法同时兼顾这两方面，而论文的关键解决方案在于提出一种能够有效整合全局和局部特征的算法或模型，以提升癌症诊断和治疗的准确性。

链接: https://arxiv.org/abs/2412.10853
作者: Haoming Luo,Xiaotian Yu,Shengxuming Zhang,Jiabin Xia,Yang Jian,Yuning Sun,Liang Xue,Mingli Song,Jing Zhang,Xiuming Zhang,Zunlei Feng
机构: Zhejiang University (浙江大学); The First Affiliated Hospital of Zhejiang University (浙江大学第一附属医院)
关键词: providing extensive tissue, gigapixel images providing, images providing extensive, Pathology images, gold standard
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pathology images are considered the “gold standard” for cancer diagnosis and treatment, with gigapixel images providing extensive tissue and cellular information. Existing methods fail to simultaneously extract global structural and local detail f
zh

[CV-184] Detecting Activities of Daily Living in Egocentric Video to Contextualize Hand Use at Home in Outpatient Neurorehabilitation Settings

【速读】：该论文试图解决在卒中或脊髓损伤（SCI）后，如何通过可穿戴式以自我为中心的摄像机和机器学习技术，更细致地理解患者在家中的手部使用情况，并据此有效指导康复治疗计划的问题。解决方案的关键在于采用以物体为中心的方法，即关注患者与哪些物体互动，而非他们如何移动。通过利用预训练的物体检测和手-物体交互模型，该系统在复杂的环境中实现了对日常生活活动（ADL）的有效识别，并在不同程度的患者和环境中表现出稳健的性能，生成了临床上可解释的功能性物体使用信息，同时对患者特定的运动变化具有鲁棒性。

链接: https://arxiv.org/abs/2412.10846
作者: Adesh Kadambi,José Zariffa
机构: 未知
关键词: spinal cord injury, Wearable egocentric cameras, cord injury, Wearable egocentric, cameras and machine
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: To be submitted to IEEE Transactions on Neural Systems and Rehabilitation Engineering. 11 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Wearable egocentric cameras and machine learning have the potential to provide clinicians with a more nuanced understanding of patient hand use at home after stroke and spinal cord injury (SCI). However, they require detailed contextual information (i.e., activities and object interactions) to effectively interpret metrics and meaningfully guide therapy planning. We demonstrate that an object-centric approach, focusing on what objects patients interact with rather than how they move, can effectively recognize Activities of Daily Living (ADL) in real-world rehabilitation settings. We evaluated our models on a complex dataset collected in the wild comprising 2261 minutes of egocentric video from 16 participants with impaired hand function. By leveraging pre-trained object detection and hand-object interaction models, our system achieves robust performance across different impairment levels and environments, with our best model achieving a mean weighted F1-score of 0.78 +/- 0.12 and maintaining an F1-score 0.5 for all participants using leave-one-subject-out cross validation. Through qualitative analysis, we observe that this approach generates clinically interpretable information about functional object use while being robust to patient-specific movement variations, making it particularly suitable for rehabilitation contexts with prevalent upper limb impairment.
zh

[CV-185] Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

【速读】：该论文试图解决多标签识别任务中部分标签（MLR-PL）的语义混淆问题，特别是在使用视觉语言模型（如CLIP）时，由于单一全局视觉和文本表示缺乏细粒度信息，导致不同类别之间的语义混淆。解决方案的关键在于引入了一个语义解耦模块（semantic decoupling module）和类别特定的提示优化方法（category-specific prompt optimization method）。语义解耦模块通过视觉编码器后的语义引导空间注意力机制学习类别特定的特征图，而类别特定的提示优化方法则用于学习与类别语义对齐的文本表示。这些方法使得每个类别的预测独立，从而缓解了语义混淆问题，并在多个数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2412.10843
作者: Haoxian Ruan,Zhihua Xu,Zhijing Yang,Yongyi Lu,Jinghui Qin,Tianshui Chen
机构: Guangdong University of Technology(广东工业大学); Guangdong University of Technology(广东工业大学); Guangdong University of Technology(广东工业大学); Guangdong University of Technology(广东工业大学); Guangdong University of Technology(广东工业大学); Guangdong University of Technology(广东工业大学)
关键词: real application scenarios, complete multi-label datasets, Multi-label recognition, complete multi-label, application scenarios
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM Transactions on Multimedia Computing Communications and Applications

点击查看摘要

Abstract:Multi-label recognition with partial labels (MLR-PL), in which only some labels are known while others are unknown for each image, is a practical task in computer vision, since collecting large-scale and complete multi-label datasets is difficult in real application scenarios. Recently, vision language models (e.g. CLIP) have demonstrated impressive transferability to downstream tasks in data limited or label limited settings. However, current CLIP-based methods suffer from semantic confusion in MLR task due to the lack of fine-grained information in the single global visual and textual representation for all categories. In this work, we address this problem by introducing a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework. Specifically, the semantic decoupling module following the visual encoder learns category-specific feature maps by utilizing the semantic-guided spatial attention mechanism. Moreover, the category-specific prompt optimization method is introduced to learn text representations aligned with category semantics. Therefore, the prediction of each category is independent, which alleviate the semantic confusion problem. Extensive experiments on Microsoft COCO 2014 and Pascal VOC 2007 datasets demonstrate that the proposed framework significantly outperforms current state-of-art methods with a simpler model structure. Additionally, visual analysis shows that our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
zh

[CV-186] Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning AAAI2025

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 在自主交互和解释图形用户界面 (Graphical User Interfaces, GUIs) 时面临的“grounding”问题，即如何准确识别 GUI 图像中的关键组件（如文本或图标）。传统方法依赖于使用专门的训练数据对 MLLMs 进行微调 (fine-tuning) 以直接预测组件位置，而本文提出了一种无需微调的注意力驱动 grounding 方法 (Tuning-free Attention-driven Grounding, TAG)。该方法的关键在于利用预训练 MLLMs 的固有注意力模式，通过识别和聚合特定查询提示中的注意力图 (attention maps) 来实现组件定位，无需额外训练。实验表明，该方法在 MiniCPM-Llama3-V 2.5 上取得了与微调方法相当的性能，并在文本定位方面表现尤为突出。

链接: https://arxiv.org/abs/2412.10840
作者: Hai-Ming Xu,Qi Chen,Lei Wang,Lingqiao Liu
机构: 1. University of New South Wales (新南威尔士大学); 2. University of Adelaide (阿德莱德大学)
关键词: Large Language Models, Graphical User Interfaces, Multimodal Large Language, interpret Graphical User, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.
zh

[CV-187] SegACIL: Solving the Stability-Plasticity Dilemma in Class-Incremental Semantic Segmentation

【速读】：该论文试图解决持续学习（continual learning）中面临的灾难性遗忘问题，特别是在语义分割任务中，如何在保留先前知识的同时适应新数据（即稳定性-可塑性困境）。解决方案的关键是提出了SegACIL，一种基于线性闭式解的持续学习方法，该方法仅需单次训练迭代即可完成，显著降低了计算成本。SegACIL通过理论分析证明了其在性能上与联合学习相当，能够有效保留先前数据的知识，从而在保持稳定性和可塑性之间取得平衡。实验结果表明，SegACIL在Pascal VOC2012数据集上的顺序、不连续和重叠设置中均表现出优越的性能，为类别增量语义分割提供了强有力的解决方案。

链接: https://arxiv.org/abs/2412.10834
作者: Jiaxu Li,Songning Lai,Rui Li,Di Fang,Kejia Fan,Jianheng Tang,Yuhan Zhao,Rongchang Zhao,Dongzhan Zhou,Yutao Yue,Huiping Zhuang
机构: Central South University(中南大学); The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学（广州）); South China University of Technology(华南理工大学); Peking University(北京大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
关键词: made remarkable progress, processing continuously incoming, continuously incoming data, recent years, models continue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While deep learning has made remarkable progress in recent years, models continue to struggle with catastrophic forgetting when processing continuously incoming data. This issue is particularly critical in continual learning, where the balance between retaining prior knowledge and adapting to new information-known as the stability-plasticity dilemma-remains a significant challenge. In this paper, we propose SegACIL, a novel continual learning method for semantic segmentation based on a linear closed-form solution. Unlike traditional methods that require multiple epochs for training, SegACIL only requires a single epoch, significantly reducing computational costs. Furthermore, we provide a theoretical analysis demonstrating that SegACIL achieves performance on par with joint learning, effectively retaining knowledge from previous data which makes it to keep both stability and plasticity at the same time. Extensive experiments on the Pascal VOC2012 dataset show that SegACIL achieves superior performance in the sequential, disjoint, and overlap settings, offering a robust solution to the challenges of class-incremental semantic segmentation. Code is available at this https URL.
zh

[CV-188] Unbiased General Annotated Dataset Generation

【速读】：该论文试图解决手动收集的标注数据集（如ImageNet）中存在的偏差问题，这种偏差导致模型在跨类别或跨领域任务中的泛化能力下降。解决方案的关键在于提出了一个无偏标注数据集生成框架（ubGen），通过利用多模态基础模型（如CLIP）的优势，在由语言定义的无偏语义空间中对齐图像。具体方法包括设计双层语义对齐损失，通过对抗学习确保生成的图像符合目标数据集的语义分布，并匹配其类别名称的语义描述；同时引入图像质量评分模型作为质量保证损失，以维持生成图像的质量。最终，通过微调预训练的扩散模型，仅使用目标数据集的类别名称作为输入，即可生成无偏图像，从而提升不同任务中骨干网络的泛化能力，尤其是在手动标注样本稀缺的情况下。

链接: https://arxiv.org/abs/2412.10831
作者: Dengyang Jiang,Haoyu Wang,Lei Zhang,Wei Wei,Guang Dai,Mengmeng Wang,Jingdong Wang,Yanning Zhang
机构: Northwestern Polytechnical University(西北工业大学); SGIT AI Lab, State Grid Corporation of China(国家电网公司SGIT AI实验室); Zhejiang University of Technology(浙江工业大学); Baidu Inc(百度公司)
关键词: comprises numerous manually, numerous manually collected, manually collected images, Pre-training backbone networks, generalization capacity
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Pre-training backbone networks on a general annotated dataset (e.g., ImageNet) that comprises numerous manually collected images with category annotations has proven to be indispensable for enhancing the generalization capacity of downstream visual tasks. However, those manually collected images often exhibit bias, which is non-transferable across either categories or domains, thus causing the model’s generalization capacity degeneration. To mitigate this problem, we present an unbiased general annotated dataset generation framework (ubGen). Instead of expensive manual collection, we aim at directly generating unbiased images with category annotations. To achieve this goal, we propose to leverage the advantage of a multimodal foundation model (e.g., CLIP), in terms of aligning images in an unbiased semantic space defined by language. Specifically, we develop a bi-level semantic alignment loss, which not only forces all generated images to be consistent with the semantic distribution of all categories belonging to the target dataset in an adversarial learning manner, but also requires each generated image to match the semantic description of its category name. In addition, we further cast an existing image quality scoring model into a quality assurance loss to preserve the quality of the generated image. By leveraging these two loss functions, we can obtain an unbiased image generation model by simply fine-tuning a pre-trained diffusion model using only all category names in the target dataset as input. Experimental results confirm that, compared with the manually labeled dataset or other synthetic datasets, the utilization of our generated unbiased datasets leads to stable generalization capacity enhancement of different backbone networks across various tasks, especially in tasks where the manually labeled samples are scarce.
zh

[CV-189] Diffusion Model from Scratch

【速读】：该论文试图解决生成式模型（Generative Models）中扩散模型（Diffusion Models）的复杂性问题，特别是从经典论文《Denoising Diffusion Probability Model (DDPM)》入手时可能遇到的困难。解决方案的关键在于通过详细的数学推导和问题导向的分析方法，追溯从变分自编码器（VAEs）到DDPM的演化过程，帮助读者建立基础理解。此外，论文还探讨了当前主流方法的核心思想和改进策略，为对扩散模型感兴趣的本科生和研究生提供了学习指导。

链接: https://arxiv.org/abs/2412.10824
作者: Wang Zhen,Dong Yunyun
机构: YunNan University(云南大学); YunNan University(云南大学); YunNan University(云南大学); YnuNan University(云南大学)
关键词: Denoising Diffusion Probability, Diffusion Probability Model, popular generative models, paper Denoising Diffusion, Diffusion generative models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion generative models are currently the most popular generative models. However, their underlying modeling process is quite complex, and starting directly with the seminal paper Denoising Diffusion Probability Model (DDPM) can be challenging. This paper aims to assist readers in building a foundational understanding of generative models by tracing the evolution from VAEs to DDPM through detailed mathematical derivations and a problem-oriented analytical approach. It also explores the core ideas and improvement strategies of current mainstream methodologies, providing guidance for undergraduate and graduate students interested in learning about diffusion models.
zh

[CV-190] Enhance Vision-Language Alignment with Noise

【速读】：该论文试图解决预训练视觉-语言模型（VL models）在下游任务中视觉与语言模态对齐的问题。与现有的通过添加额外模块进行微调的方法不同，论文提出了一种基于定制噪声的微调方法，利用正激励噪声（Positive-incentive Noise, Pi-noise）来增强模型的对齐能力。关键在于通过变分推断重新构建CLIP模型的推理过程，并设计了正激励噪声注入器（Positive-incentive Noise Injector, PiNI），将噪声注入视觉和文本编码器中，从而在有限的计算资源下生成更多样化的视觉和语言嵌入，以更好地对齐这两种模态。实验结果表明，该方法在11个数据集上的评估中展现了其有效性。

链接: https://arxiv.org/abs/2412.10817
作者: Sida Huang,Hongyuan Zhang,Xuelong Li
机构: 未知
关键词: noise, pre-trained vision-language, enhancing the alignment, critical challenge, advancement of pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the advancement of pre-trained vision-language (VL) models, enhancing the alignment between visual and linguistic modalities in downstream tasks has emerged as a critical challenge. Different from existing fine-tuning methods that add extra modules to these two modalities, we investigate whether the frozen model can be fine-tuned by customized noise. Our approach is motivated by the scientific study of beneficial noise, namely Positive-incentive Noise (Pi-noise or \pi -noise) , which quantitatively analyzes the impact of noise. It therefore implies a new scheme to learn beneficial noise distribution that can be employed to fine-tune VL models. Focusing on few-shot classification tasks based on CLIP, we reformulate the inference process of CLIP and apply variational inference, demonstrating how to generate \pi -noise towards visual and linguistic modalities. Then, we propose Positive-incentive Noise Injector (PiNI), which can fine-tune CLIP via injecting noise into both visual and text encoders. Since the proposed method can learn the distribution of beneficial noise, we can obtain more diverse embeddings of vision and language to better align these two modalities for specific downstream tasks within limited computational resources. We evaluate different noise incorporation approaches and network architectures of PiNI. The evaluation across 11 datasets demonstrates its effectiveness.
zh

[CV-191] Hyper-Fusion Network for Semi-Automatic Segmentation of Skin Lesions

【速读】：该论文试图解决在训练数据不足的情况下，基于全卷积网络 (FCN) 的自动皮肤病变分割方法在处理具有不常见图像特征的病变时表现不佳的问题。解决方案的关键在于引入了一种超融合网络 (Hyper-Fusion Network, HFN)，通过在多个阶段融合用户输入和图像特征，从而在每个融合阶段迭代使用用户输入来细化分割结果。与传统的早期融合方法不同，HFN能够更有效地保留和利用用户输入信息，特别是在处理具有不均匀纹理和模糊边界的复杂皮肤病变时，显著提高了分割的准确性和泛化能力。

链接: https://arxiv.org/abs/2412.10816
作者: Lei Bi,Michael Fulham,Jinman Kim
机构: 未知
关键词: Automatic skin lesion, segmentation methods, semi-automatic segmentation methods, skin lesions, automatic segmentation methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the journal of medical image analysis

点击查看摘要

Abstract:Automatic skin lesion segmentation methods based on fully convolutional networks (FCNs) are regarded as the state-of-the-art for accuracy. When there are, however, insufficient training data to cover all the variations in skin lesions, where lesions from different patients may have major differences in size/shape/texture, these methods failed to segment the lesions that have image characteristics, which are less common in the training datasets. FCN-based semi-automatic segmentation methods, which fuse user-inputs with high-level semantic image features derived from FCNs offer an ideal complement to overcome limitations of automatic segmentation methods. These semi-automatic methods rely on the automated state-of-the-art FCNs coupled with user-inputs for refinements, and therefore being able to tackle challenging skin lesions. However, there are a limited number of FCN-based semi-automatic segmentation methods and all these methods focused on early-fusion, where the first few convolutional layers are used to fuse image features and user-inputs and then derive fused image features for segmentation. For early-fusion based methods, because the user-input information can be lost after the first few convolutional layers, consequently, the user-input information will have limited guidance and constraint in segmenting the challenging skin lesions with inhomogeneous textures and fuzzy boundaries. Hence, in this work, we introduce a hyper-fusion network (HFN) to fuse the extracted user-inputs and image features over multiple stages. We separately extract complementary features which then allows for an iterative use of user-inputs along all the fusion stages to refine the segmentation. We evaluated our HFN on ISIC 2017, ISIC 2016 and PH2 datasets, and our results show that the HFN is more accurate and generalizable than the state-of-the-art methods.
zh

[CV-192] Medical Manifestation-Aware De-Identification AAAI2025

【速读】：该论文试图解决医疗场景中面部去识别化（Face de-identification, DeID）研究不足的问题，主要原因是缺乏大规模的患者面部数据集。解决方案的关键在于发布了名为MeMa的数据集，包含超过40,000张逼真的患者面部图像，这些图像是通过重新生成大量真实患者照片而来。通过精心调节生成和数据过滤过程，MeMa在确保丰富的医学表现的同时，避免了侵犯真实患者的隐私。此外，论文还通过招募临床专家对MeMa进行粗粒度和细粒度的标注，建立了首个医疗场景DeID基准。最后，提出了一种将数据驱动的医学语义先验集成到DeID过程中的基线方法，尽管该方法简洁且简单，但在性能上显著优于以往的方法。

链接: https://arxiv.org/abs/2412.10804
作者: Yuan Tian,Shuo Wang,Guangtao Zhai
机构: 1. Shanghai Jiao Tong University (上海交通大学); 2. Shanghai University (上海大学)
关键词: common scenes, large-scale patient face, Face de-identification, widely studied, studied for common
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Face de-identification (DeID) has been widely studied for common scenes, but remains under-researched for medical scenes, mostly due to the lack of large-scale patient face datasets. In this paper, we release MeMa, consisting of over 40,000 photo-realistic patient faces. MeMa is re-generated from massive real patient photos. By carefully modulating the generation and data-filtering procedures, MeMa avoids breaching real patient privacy, while ensuring rich and plausible medical manifestations. We recruit expert clinicians to annotate MeMa with both coarse- and fine-grained labels, building the first medical-scene DeID benchmark. Additionally, we propose a baseline approach for this new medical-aware DeID task, by integrating data-driven medical semantic priors into the DeID procedure. Despite its conciseness and simplicity, our approach substantially outperforms previous ones. Dataset is available at this https URL.
zh

[CV-193] Reliable and superior elliptic Fourier descriptor normalization and its application software ElliShape with efficient image processing

【速读】：该论文试图解决椭圆傅里叶分析 (Elliptic Fourier Analysis, EFA) 在形状分析中面临的两个主要问题：一是椭圆傅里叶描述子 (Elliptic Fourier Descriptors, EFDs) 的归一化问题，导致在基本轮廓变换下难以获得唯一结果；二是现有轮廓/轮廓提取方法在处理复杂数字图像时的不足。解决方案的关键在于重新设计EFDs计算过程以提高计算效率，并引入一种新的归一化方法，称为“真实EFD归一化”，该方法在所有基本轮廓变换下保持不变。此外，论文开发了ElliShape软件，通过结合自动轮廓生成和手动修正的交互式方法，显著提升了轮廓提取的效率和准确性。这些改进使得ElliShape在处理来自不同平台的大量轮廓曲线时表现出色，并为基于深度学习的数据训练提供了高质量的标注图像和EFDs，从而推动了植物学和生物多样性保护等领域的人工智能应用。

链接: https://arxiv.org/abs/2412.10795
作者: Hui Wu(1,2,3,4),Jia-Jie Yang(1,3,4),Chao-Qun Li(5),Jin-Hua Ran(2,4,6),Ren-Hua Peng(6,7),Xiao-Quan Wang(1,2,3,4,6) ((1) Big Data and AI Biodiversity Conservation Research Center, Institute of Botany, Chinese Academy of Sciences, Beijing, China (2) State Key Laboratory of Plant Diversity and Specialty Crops and Key Laboratory of Systematic and Evolutionary Botany, Institute of Botany, Chinese Academy of Sciences, Beijing, China (3) Plant Science Data Center, Chinese Academy of Sciences, Beijing, China (4) China National Botanical Garden, Beijing, China (5) School of Life Sciences, Qilu Normal University, Jinan, China (6) University of Chinese Academy of Sciences, Beijing, China (7) Key Laboratory of Noise and Vibration Control, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China)
机构: 未知
关键词: Elliptic Fourier analysis, Elliptic Fourier, elliptic Fourier descriptors, Fourier analysis, geometric morphometrics
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Elliptic Fourier analysis (EFA) is a powerful tool for shape analysis, which is often employed in geometric morphometrics. However, the normalization of elliptic Fourier descriptors has persistently posed challenges in obtaining unique results in basic contour transformations, requiring extensive manual alignment. Additionally, contemporary contour/outline extraction methods often struggle to handle complex digital images. Here, we reformulated the procedure of EFDs calculation to improve computational efficiency and introduced a novel approach for EFD normalization, termed true EFD normalization, which remains invariant under all basic contour transformations. These improvements are crucial for processing large sets of contour curves collected from different platforms with varying transformations. Based on these improvements, we developed ElliShape, a user-friendly software. Particularly, the improved contour/outline extraction employs an interactive approach that combines automatic contour generation for efficiency with manual correction for essential modifications and refinements. We evaluated ElliShape’s stability, robustness, and ease of use by comparing it with existing software using standard datasets. ElliShape consistently produced reliable reconstructed shapes and normalized EFD values across different contours and transformations, and it demonstrated superior performance in visualization and efficient processing of various digital images for contour this http URL output annotated images and EFDs could be utilized in deep learning-based data training, thereby advancing artificial intelligence in botany and offering innovative solutions for critical challenges in biodiversity conservation, species classification, ecosystem function assessment, and related critical issues.
zh

[CV-194] Optimizing Few-Step Sampler for Diffusion Probabilistic Model

【速读】：该论文试图解决扩散概率模型 (Diffusion Probabilistic Models, DPMs) 在生成高质量图像时面临的计算成本高的问题。解决方案的关键在于优化采样时间表 (sampling schedule)，即在求解概率流常微分方程 (Probability-Flow Ordinary Differential Equation, PF-ODE) 时对积分域进行离散化的过程。论文提出了一种两阶段交替优化算法：第一阶段优化预训练DPM的采样时间表，第二阶段在选定的时间步上进一步微调DPM。通过理论推导和蒙特卡洛估计，论文给出了采样时间表离散化误差的上界，并验证了该方法在ImageNet64数据集上的有效性，能够一致性地提升基线模型的性能。

链接: https://arxiv.org/abs/2412.10786
作者: Jen-Yuan Huang
机构: Peking University (北京大学)
关键词: Diffusion Probabilistic Models, demonstrated exceptional capability, intensive computational cost, Ordinary Differential Equation, Diffusion Probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Probabilistic Models (DPMs) have demonstrated exceptional capability of generating high-quality and diverse images, but their practical application is hindered by the intensive computational cost during inference. The DPM generation process requires solving a Probability-Flow Ordinary Differential Equation (PF-ODE), which involves discretizing the integration domain into intervals for numerical approximation. This corresponds to the sampling schedule of a diffusion ODE solver, and we notice the solution from a first-order solver can be expressed as a convex combination of model outputs at all scheduled time-steps. We derive an upper bound for the discretization error of the sampling schedule, which can be efficiently optimized with Monte-Carlo estimation. Building on these theoretical results, we purpose a two-phase alternating optimization algorithm. In Phase-1, the sampling schedule is optimized for the pre-trained DPM; in Phase-2, the DPM further tuned on the selected time-steps. Experiments on a pre-trained DPM for ImageNet64 dataset demonstrate the purposed method consistently improves the baseline across various number of sampling steps.
zh

[CV-195] StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer

【速读】：该论文试图解决亲属面部合成中的挑战，主要问题在于现有方法难以在生成多样性和保真度之间取得平衡，同时精确控制面部属性（如年龄和性别）。解决方案的关键是提出了Style Latent Diffusion Transformer (StyleDiT)框架，该框架结合了StyleGAN的丰富面部先验和扩散模型的优势，通过条件扩散模型对齐StyleGAN潜在空间中的亲属关系分布，从而实现高质量和多样化的亲属面部生成。此外，引入的Relational Trait Guidance (RTG)机制允许独立控制影响条件（如父母的面部图像），并在生成面部时实现多样性和保真度之间的精细调整。该框架还扩展到预测伴侣面部图像的应用领域，展示了其在多样性和保真度之间取得平衡的优越性能。

链接: https://arxiv.org/abs/2412.10785
作者: Pin-Yen Chiu,Dai-Jie Wu,Po-Hsun Chu,Chia-Hsuan Hsu,Hsiang-Chen Chiu,Chih-Yu Wang,Jun-Cheng Chen
机构: Research Center for Information Technology Innovation, Academia Sinica(信息科技创新研究中心，中央研究院)
关键词: challenging problem due, Kinship face synthesis, challenging problem, problem due, scarcity and low
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Kinship face synthesis is a challenging problem due to the scarcity and low quality of the available kinship data. Existing methods often struggle to generate descendants with both high diversity and fidelity while precisely controlling facial attributes such as age and gender. To address these issues, we propose the Style Latent Diffusion Transformer (StyleDiT), a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate high-quality and diverse kinship faces. In this framework, the rich facial priors of StyleGAN enable fine-grained attribute control, while our conditional diffusion model is used to sample a StyleGAN latent aligned with the kinship relationship of conditioning images by utilizing the advantage of modeling complex kinship relationship distribution. StyleGAN then handles latent decoding for final face generation. Additionally, we introduce the Relational Trait Guidance (RTG) mechanism, enabling independent control of influencing conditions, such as each parent’s facial image. RTG also enables a fine-grained adjustment between the diversity and fidelity in synthesized faces. Furthermore, we extend the application to an unexplored domain: predicting a partner’s facial images using a child’s image and one parent’s image within the same framework. Extensive experiments demonstrate that our StyleDiT outperforms existing methods by striking an excellent balance between generating diverse and high-fidelity kinship faces.
zh

[CV-196] Video Diffusion Transformers are In-Context Learners

【速读】：该论文旨在解决视频扩散变换器（video diffusion transformers）在无需大量微调的情况下实现上下文生成能力的问题。解决方案的关键在于提出了一种简单的流程：首先将视频沿空间或时间维度拼接，然后联合标注来自同一源的多场景视频片段，最后使用精心挑选的小数据集进行任务特定的微调。该方法无需修改原始模型，能够在不增加计算开销的情况下生成高质量、符合提示要求且角色一致的多场景视频，时长超过30秒。

链接: https://arxiv.org/abs/2412.10783
作者: Zhengcong Fei,Di Qiu,Changqian Yu,Debang Li,Mingyuan Fan,Xiang Wen
机构: 未知
关键词: minimal tuning required, enabling in-context capabilities, video diffusion transformers, diffusion transformers, required for activation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates a solution for enabling in-context capabilities of video diffusion transformers, with minimal tuning required for activation. Specifically, we propose a simple pipeline to leverage in-context generation: ( \textbfi ) concatenate videos along spacial or time dimension, ( \textbfii ) jointly caption multi-scene video clips from one source, and ( \textbfiii ) apply task-specific fine-tuning using carefully curated small datasets. Through a series of diverse controllable tasks, we demonstrate qualitatively that existing advanced text-to-video models can effectively perform in-context generation. Notably, it allows for the creation of consistent multi-scene videos exceeding 30 seconds in duration, without additional computational overhead. Importantly, this method requires no modifications to the original models, results in high-fidelity video outputs that better align with prompt specifications and maintain role consistency. Our framework presents a valuable tool for the research community and offers critical insights for advancing product-level controllable video generation systems. The data, code, and model weights are publicly available at: \urlthis https URL.
zh

[CV-197] Sample-efficient Unsupervised Policy Cloning from Ensemble Self-supervised Labeled Videos

【速读】：该论文试图解决现有高级策略学习方法在缺乏任务特定奖励、专家标注轨迹和大量环境交互的情况下难以高效学习的问题。解决方案的关键在于提出了一个名为“无监督策略从集成自监督标注视频 (UPESV)”的新框架，通过自监督任务从专家视频中推断出专家动作，并利用这些标注视频和无奖励的环境交互来训练策略。UPESV通过多个有机结合的自监督任务，充分利用专家视频和无奖励交互，实现对环境动态的深入理解和鲁棒预测。最终，该框架在无需任何额外监督的情况下，通过高效的样本训练过程，从视频中模仿学习到高级策略，并在多个挑战性环境中展示了其优越的少样本学习能力。

链接: https://arxiv.org/abs/2412.10778
作者: Xin Liu,Yaran Chen
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China.
关键词: develop expert-level strategies, methodologies have demonstrated, demonstrated the ability, ability to develop, develop expert-level
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current advanced policy learning methodologies have demonstrated the ability to develop expert-level strategies when provided enough information. However, their requirements, including task-specific rewards, expert-labeled trajectories, and huge environmental interactions, can be expensive or even unavailable in many scenarios. In contrast, humans can efficiently acquire skills within a few trials and errors by imitating easily accessible internet video, in the absence of any other supervision. In this paper, we try to let machines replicate this efficient watching-and-learning process through Unsupervised Policy from Ensemble Self-supervised labeled Videos (UPESV), a novel framework to efficiently learn policies from videos without any other expert supervision. UPESV trains a video labeling model to infer the expert actions in expert videos, through several organically combined self-supervised tasks. Each task performs its own duties, and they together enable the model to make full use of both expert videos and reward-free interactions for advanced dynamics understanding and robust prediction. Simultaneously, UPESV clones a policy from the labeled expert videos, in turn collecting environmental interactions for self-supervised tasks. After a sample-efficient and unsupervised (i.e., reward-free) training process, an advanced video-imitated policy is obtained. Extensive experiments in sixteen challenging procedurally-generated environments demonstrate that the proposed UPESV achieves state-of-the-art few-shot policy learning (outperforming five current advanced baselines on 12/16 tasks) without exposure to any other supervision except videos. Detailed analysis is also provided, verifying the necessity of each self-supervised task employed in UPESV.
zh

[CV-198] VinTAGe: Joint Video and Text Conditioning for Holistic Audio Generation

【速读】：该论文试图解决文本到音频 (T2A) 和视频到音频 (V2A) 方法无法生成包含屏幕内和屏幕外声音的完整音频的问题。解决方案的关键在于提出了 VinTAGe，一个基于流的 Transformer 模型，该模型通过联合考虑文本和视频来引导音频生成。VinTAGe 框架包括两个核心组件：视觉-文本编码器 (Visual-Text Encoder) 和联合 VT-SiT 模型 (Joint VT-SiT model)。为了减少模态偏差并提高生成质量，研究中还采用了预训练的单模态 T2A 和 V2A 生成模型进行额外指导。此外，论文引入了 VinTAGe-Bench 数据集，用于评估模型在生成完整音频方面的性能，并展示了其在 VGGSound 基准测试中的最先进结果。

链接: https://arxiv.org/abs/2412.10768
作者: Saksham Singh Kushwaha,Yapeng Tian
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
关键词: Recent advances, audio generation, holistic audio generation, offscreen sounds, generation
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Recent advances in audio generation have focused on text-to-audio (T2A) and video-to-audio (V2A) tasks. However, T2A or V2A methods cannot generate holistic sounds (onscreen and off-screen). This is because T2A cannot generate sounds aligning with onscreen objects, while V2A cannot generate semantically complete (offscreen sounds missing). In this work, we address the task of holistic audio generation: given a video and a text prompt, we aim to generate both onscreen and offscreen sounds that are temporally synchronized with the video and semantically aligned with text and video. Previous approaches for joint text and video-to-audio generation often suffer from modality bias, favoring one modality over the other. To overcome this limitation, we introduce VinTAGe, a flow-based transformer model that jointly considers text and video to guide audio generation. Our framework comprises two key components: a Visual-Text Encoder and a Joint VT-SiT model. To reduce modality bias and improve generation quality, we employ pretrained uni-modal text-to-audio and video-to-audio generation models for additional guidance. Due to the lack of appropriate benchmarks, we also introduce VinTAGe-Bench, a dataset of 636 video-text-audio pairs containing both onscreen and offscreen sounds. Our comprehensive experiments on VinTAGe-Bench demonstrate that joint text and visual interaction is necessary for holistic audio generation. Furthermore, VinTAGe achieves state-of-the-art results on the VGGSound benchmark. Our source code and pre-trained models will be released. Demo is available at: this https URL.
zh

[CV-199] Neural Network Meta Classifier: Improving the Reliability of Anomaly Segmentation

【速读】：该论文试图解决在开放环境（open-set environments）中，深度神经网络（DNNs）在语义分割任务中遇到未知物体或异常情况时的检测问题。解决方案的关键在于将传统的基于逻辑回归的元分类器（meta classifier）替换为更具有表达能力的轻量级全连接神经网络，并通过熵最大化（entropy maximization）进行异常分割。此外，论文还引入了信息性分布外样本（informative out-of-distribution examples）的概念，以提升训练效果。尽管神经网络元分类器在性能上优于逻辑回归，但论文也讨论了其解释性（interpretability）的损失，并指出两者行为具有强相关性。

链接: https://arxiv.org/abs/2412.10765
作者: Jurica Runtas,Tomislav Petkovic
机构: University of Zagreb Faculty of Electrical Engineering and Computing(萨格勒布大学电气工程与计算学院)
关键词: predefined closed set, Deep neural networks, Deep neural, set of classes, contemporary solution
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to VISAPP 2025

点击查看摘要

Abstract:Deep neural networks (DNNs) are a contemporary solution for semantic segmentation and are usually trained to operate on a predefined closed set of classes. In open-set environments, it is possible to encounter semantically unknown objects or anomalies. Road driving is an example of such an environment in which, from a safety standpoint, it is important to ensure that a DNN indicates it is operating outside of its learned semantic domain. One possible approach to anomaly segmentation is entropy maximization, which is paired with a logistic regression based post-processing step called meta classification, which is in turn used to improve the reliability of detection of anomalous pixels. We propose to substitute the logistic regression meta classifier with a more expressive lightweight fully connected neural network. We analyze advantages and drawbacks of the proposed neural network meta classifier and demonstrate its better performance over logistic regression. We also introduce the concept of informative out-of-distribution examples which we show to improve training results when using entropy maximization in practice. Finally, we discuss the loss of interpretability and show that the behavior of logistic regression and neural network is strongly correlated.
zh

[CV-200] Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation

【速读】：该论文试图解决跨模态检索中由于模态不平衡（modal imbalance）导致的检索性能下降问题。模态不平衡通常由噪声干扰和模态信息不足引起，影响了在潜在共同空间中的跨模态匹配表示。论文指出，传统的跨模态匹配方法在模态不平衡的情况下通常表现不佳，因为模态不平衡会影响共同空间中实例的结构，从而挑战跨模态相似性度量。解决方案的关键在于提出了一种结构保留匹配（structure-preserved matching）的方法，通过学习结构保留的匹配表示来重新平衡跨模态匹配。具体而言，论文设计了一种多粒度跨模态匹配（multi-granularity cross-modal matching），结合了结构感知蒸馏（structure-aware distillation）和跨模态匹配损失（cross-modal matching loss）。跨模态匹配损失约束实例级别的匹配，而结构感知蒸馏通过关系匹配（relational matching）进一步规范了学习到的匹配表示与模态内表示之间的几何一致性。

链接: https://arxiv.org/abs/2412.10761
作者: Yang Yang,Wenjuan Xi,Luping Zhou,Jinhui Tang
机构: School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; School of Electrical and Information Engineering, The University of Sydney, Sydney, NSW 2006, Australia
关键词: Vision-language retrieval aims, cross-modal matching, matching, Vision-language retrieval, cross-modal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.
zh

[CV-201] Optimizing Vision-Language Interactions Through Decoder-Only Models

【速读】：该论文试图解决视觉-语言模型 (Vision-Language Models, VLMs) 在多模态任务中由于依赖单独的视觉编码器而导致的效率、可扩展性和模态对齐问题。解决方案的关键在于提出了一种名为 MUDAIF (Multimodal Unified Decoder with Adaptive Input Fusion) 的解码器架构，该架构通过创新的视觉标记适配器 (Vision-Token Adapter, VTA) 和自适应协同注意力机制，实现了视觉和文本输入的无缝集成，从而消除了对视觉编码器的依赖。这种设计不仅提高了模型的效率和灵活性，还增强了跨模态理解能力，使其在多个基准测试中超越了现有的最先进方法。

链接: https://arxiv.org/abs/2412.10758
作者: Kaito Tanaka,Benjamin Tan,Brian Wong
机构: SANNO University
关键词: encoders introduces challenges, Multimodal Unified Decoder, Adaptive Input Fusion, separate visual encoders, visual encoders introduces
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have emerged as key enablers for multimodal tasks, but their reliance on separate visual encoders introduces challenges in efficiency, scalability, and modality alignment. To address these limitations, we propose MUDAIF (Multimodal Unified Decoder with Adaptive Input Fusion), a decoder-only vision-language model that seamlessly integrates visual and textual inputs through a novel Vision-Token Adapter (VTA) and adaptive co-attention mechanism. By eliminating the need for a visual encoder, MUDAIF achieves enhanced efficiency, flexibility, and cross-modal understanding. Trained on a large-scale dataset of 45M image-text pairs, MUDAIF consistently outperforms state-of-the-art methods across multiple benchmarks, including VQA, image captioning, and multimodal reasoning tasks. Extensive analyses and human evaluations demonstrate MUDAIF’s robustness, generalization capabilities, and practical usability, establishing it as a new standard in encoder-free vision-language models.
zh

[CV-202] Damage Assessment after Natural Disasters with UAVs: Semantic Feature Extraction using Deep Learning

【速读】：该论文试图解决无人机在资源受限环境下进行灾害恢复任务时，由于带宽有限和连接不稳定导致的数据传输问题。解决方案的关键在于提出了一种新颖的语义提取器 (semantic extractor)，该提取器可以集成到任何机器学习下游任务中，用于识别决策所需的关键数据。通过在无人机上执行该提取器，能够显著减少需要传输到地面站的数据量，同时保持高准确性，从而有效捕捉任务特定的显著信息。

链接: https://arxiv.org/abs/2412.10756
作者: Nethmi S. Hewawiththi,M. Mahesha Viduranga,Vanodhya G. Warnasooriya,Tharindu Fernando,Himal A. Suraweera,Sridha Sridharan,Clinton Fookes
机构: University of Peradeniya(佩拉德尼亚大学); Vega Innovations(Vega创新); Queensland University of Technology (QUT)(昆士兰科技大学)
关键词: Unmanned aerial vehicle-assisted, promoted recently due, Unmanned aerial, aerial vehicle-assisted disaster, vehicle-assisted disaster recovery
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 8 figures

点击查看摘要

Abstract:Unmanned aerial vehicle-assisted disaster recovery missions have been promoted recently due to their reliability and flexibility. Machine learning algorithms running onboard significantly enhance the utility of UAVs by enabling real-time data processing and efficient decision-making, despite being in a resource-constrained environment. However, the limited bandwidth and intermittent connectivity make transmitting the outputs to ground stations challenging. This paper proposes a novel semantic extractor that can be adopted into any machine learning downstream task for identifying the critical data required for decision-making. The semantic extractor can be executed onboard which results in a reduction of data that needs to be transmitted to ground stations. We test the proposed architecture together with the semantic extractor on two publicly available datasets, FloodNet and RescueNet, for two downstream tasks: visual question answering and disaster damage level classification. Our experimental results demonstrate the proposed method maintains high accuracy across different downstream tasks while significantly reducing the volume of transmitted data, highlighting the effectiveness of our semantic extractor in capturing task-specific salient information.
zh

[CV-203] Patch-level Sounding Object Tracking for Audio-Visual Question Answering AAAI2025

【速读】：该论文试图解决音频-视觉场景问答任务 (AVQA) 中的关键挑战，即准确识别和跟踪与问题相关的声源对象。解决方案的关键在于提出了一种新的补丁级声源对象跟踪 (Patch-level Sounding Object Tracking, PSOT) 方法，该方法通过并行运行的运动驱动关键补丁跟踪 (Motion-driven Key Patch Tracking, M-KPT) 和声音驱动关键补丁跟踪 (Sound-driven KPT, S-KPT) 模块，分别利用视觉运动信息和音频-视觉对应关系来识别和跟踪显著的视觉补丁和声源补丁。此外，还引入了问题驱动关键补丁跟踪 (Question-driven KPT, Q-KPT) 模块，以确保模型专注于与问题最相关的信息。通过这些模块的协同作用，最终实现了对音频-视觉-问题特征的更新和聚合，从而提高了问答任务的准确性。

链接: https://arxiv.org/abs/2412.10749
作者: Zhangbin Li,Jinxing Zhou,Jing Zhang,Shengeng Tang,Kun Li,Dan Guo
机构: 未知
关键词: Answering questions related, AVQA task, sounding objects related, Answering questions, Patch-level Sounding Object
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this paper, we present a new Patch-level Sounding Object Tracking (PSOT) method. It begins with a Motion-driven Key Patch Tracking (M-KPT) module, which relies on visual motion information to identify salient visual patches with significant movements that are more likely to relate to sounding objects and questions. We measure the patch-wise motion intensity map between neighboring video frames and utilize it to construct and guide a motion-driven graph network. Meanwhile, we design a Sound-driven KPT (S-KPT) module to explicitly track sounding patches. This module also involves a graph network, with the adjacency matrix regularized by the audio-visual correspondence map. The M-KPT and S-KPT modules are performed in parallel for each temporal segment, allowing balanced tracking of salient and sounding objects. Based on the tracked patches, we further propose a Question-driven KPT (Q-KPT) module to retain patches highly relevant to the question, ensuring the model focuses on the most informative clues. The audio-visual-question features are updated during the processing of these modules, which are then aggregated for final answer prediction. Extensive experiments on standard datasets demonstrate the effectiveness of our method, achieving competitive performance even compared to recent large-scale pretraining-based approaches.
zh

[CV-204] A Pioneering Neural Network Method for Efficient and Robust Fuel Sloshing Simulation in Aircraft AAAI-25 AAAI

【速读】：该论文试图解决飞机油箱内燃油晃动模拟的高计算成本问题。解决方案的关键在于将流体运动视为点云变换，并提出了首个专门用于模拟飞机燃油晃动的神经网络模型。该模型通过三角特征融合设计，在流体动力学建模、动量守恒约束和全局稳定性控制之间实现了最佳平衡。此外，论文构建了首个飞机燃油表面晃动数据集Fueltank，并通过实验验证了该模型在提高精度的同时，显著提升了计算速度，相比传统SPH方法速度提高了约10倍，相比传统流体模拟软件Flow3D速度提高了超过300倍。

链接: https://arxiv.org/abs/2412.10748
作者: Yu Chen,Shuai Zheng,Nianyi Wang,Menglong Jin,Yan Chang
机构: 未知
关键词: aircraft safety research, Simulating fuel sloshing, safety research, Simulating fuel, aircraft
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
备注: This paper has been accepted by AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:Simulating fuel sloshing within aircraft tanks during flight is crucial for aircraft safety research. Traditional methods based on Navier-Stokes equations are computationally expensive. In this paper, we treat fluid motion as point cloud transformation and propose the first neural network method specifically designed for simulating fuel sloshing in aircraft. This model is also the deep learning model that is the first to be capable of stably modeling fluid particle dynamics in such complex scenarios. Our triangle feature fusion design achieves an optimal balance among fluid dynamics modeling, momentum conservation constraints, and global stability control. Additionally, we constructed the Fueltank dataset, the first dataset for aircraft fuel surface sloshing. It comprises 320,000 frames across four typical tank types and covers a wide range of flight maneuvers, including multi-directional rotations. We conducted comprehensive experiments on both our dataset and the take-off scenario of the aircraft. Compared to existing neural network-based fluid simulation algorithms, we significantly enhanced accuracy while maintaining high computational speed. Compared to traditional SPH methods, our speed improved approximately 10 times. Furthermore, compared to traditional fluid simulation software such as Flow3D, our computation speed increased by more than 300 times.
zh

[CV-205] RegMixMatch: Optimizing Mixup Utilization in Semi-Supervised Learning

【速读】：该论文试图解决半监督学习 (SSL) 中使用 Mixup 进行一致性正则化可能导致人工标签纯度下降的问题，以及伪标签方法中通过阈值策略排除低置信度数据限制了未标记样本利用的问题。解决方案的关键在于提出了一种名为 RegMixMatch 的新框架，该框架通过引入半监督 RegMixup 和类别感知的 Mixup 技术来优化 Mixup 的使用。具体来说，半监督 RegMixup 通过同时使用混合样本和干净样本进行训练，有效解决了人工标签纯度下降的问题；而类别感知的 Mixup 技术则将前两个预测类别的信息整合到低置信度样本及其人工标签中，减少了确认偏差，并提高了这些样本的有效利用率。实验结果表明，RegMixMatch 在多个 SSL 基准测试中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.10741
作者: Haorong Han,Jidong Yuan,Chixuan Wei,Zhongyang Yu
机构: 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China (计算机科学与技术学院，哈尔滨工业大学，哈尔滨，中国);
2. National Engineering Research Center for E-Learning, Central China Normal University, Wuhan, China (国家教育工程研究中心，华中师范大学，武汉，中国);
3. School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China (计算机科学与技术学院，武汉理工大学，武汉，中国)
关键词: Consistency regularization, advanced semi-supervised learning, significantly advanced semi-supervised, SSL, significantly advanced
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Consistency regularization and pseudo-labeling have significantly advanced semi-supervised learning (SSL). Prior works have effectively employed Mixup for consistency regularization in SSL. However, our findings indicate that applying Mixup for consistency regularization may degrade SSL performance by compromising the purity of artificial labels. Moreover, most pseudo-labeling based methods utilize thresholding strategy to exclude low-confidence data, aiming to mitigate confirmation bias; however, this approach limits the utility of unlabeled samples. To address these challenges, we propose RegMixMatch, a novel framework that optimizes the use of Mixup with both high- and low-confidence samples in SSL. First, we introduce semi-supervised RegMixup, which effectively addresses reduced artificial labels purity by using both mixed samples and clean samples for training. Second, we develop a class-aware Mixup technique that integrates information from the top-2 predicted classes into low-confidence samples and their artificial labels, reducing the confirmation bias associated with these samples and enhancing their effective utilization. Experimental results demonstrate that RegMixMatch achieves state-of-the-art performance across various SSL benchmarks.
zh

[CV-206] DSRC: Learning Density-insensitive and Semantic-aware Collaborative Representation against Corruptions AAAI2025

【速读】：该论文试图解决多智能体协同感知在复杂现实环境中面对自然损坏时的鲁棒性问题。解决方案的关键在于提出了一种名为DSRC的鲁棒性增强的协同感知方法，该方法通过两个核心设计来实现：一是语义引导的稀疏到密集蒸馏框架，用于构建由真实边界框绘制的多视图密集对象，从而有效学习对密度不敏感且语义感知的协同表示；二是特征到点云的重建方法，用于更好地融合跨智能体的关键协同表示。实验结果表明，DSRC在干净和损坏条件下均优于现有的最先进协同感知方法。

链接: https://arxiv.org/abs/2412.10739
作者: Jingyu Zhang,Yilei Wang,Lang Qian,Peng Sun,Zengwen Li,Sudong Jiang,Maolin Liu,Liang Song
机构: 1. School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院);
2. Institute of Functional Nano & Soft Materials (FUNSOM), Soochow University(苏州大学功能纳米与软物质研究院);
3. School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院);
4. School of Information Science and Engineering, Southeast University(东南大学信息科学与工程学院)
关键词: achieved significant success, multi-agent collaborative perception, collaborative perception, Semantic-aware collaborative Representation, collaborative perception methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:As a potential application of Vehicle-to-Everything (V2X) communication, multi-agent collaborative perception has achieved significant success in 3D object detection. While these methods have demonstrated impressive results on standard benchmarks, the robustness of such approaches in the face of complex real-world environments requires additional verification. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate the robustness of collaborative perception methods in the presence of natural corruptions typical of real-world environments. Furthermore, we propose DSRC, a robustness-enhanced collaborative perception method aiming to learn Density-insensitive and Semantic-aware collaborative Representation against Corruptions. DSRC consists of two key designs: i) a semantic-guided sparse-to-dense distillation framework, which constructs multi-view dense objects painted by ground truth bounding boxes to effectively learn density-insensitive and semantic-aware collaborative representation; ii) a feature-to-point cloud reconstruction approach to better fuse critical collaborative representation across agents. To thoroughly evaluate DSRC, we conduct extensive experiments on real-world and simulated datasets. The results demonstrate that our method outperforms SOTA collaborative perception methods in both clean and corrupted conditions. Code is available at this https URL.
zh

[CV-207] OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving

【速读】：该论文试图解决自动驾驶算法对高质量、多模态数据集的迫切需求问题。解决方案的关键在于提出了OmniHD-Scenes，这是一个大规模的多模态数据集，整合了128线激光雷达、六台摄像头和六套4D成像雷达的数据，以实现全环境感知。该数据集包含1501个片段，总计超过45万同步帧和585万同步传感器数据点，并采用了创新的4D标注流程，已对200个片段进行了超过51.4万个精确的3D边界框标注，并包含静态场景元素的语义分割标注。此外，论文还引入了自动化的密集占用地面真实值生成管道，利用非关键帧信息提升标注效率。通过建立全面的评估指标、基线模型和基准测试，论文验证了低成本传感器配置在3D检测和语义占用预测任务中的有效性和鲁棒性。

链接: https://arxiv.org/abs/2412.10734
作者: Lianqing Zheng,Long Yang,Qunshu Lin,Wenjin Ai,Minghao Liu,Shouyi Lu,Jianan Liu,Hongze Ren,Jingyue Mo,Xiaokai Bai,Jie Bai,Zhixiong Ma,Xichan Zhu
机构: School of Automotive Studies, Tongji University, Shanghai 201804, China; College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China; Abaka AI, Singapore; 2077AI Foundation, Singapore; Momoni AI, Gothenburg, Sweden; College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China; School of Information and Electrical Engineering, Hangzhou City University, Hangzhou 310015, China
关键词: autonomous driving algorithms, autonomous driving, rapid advancement, advancement of deep, deep learning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of deep learning has intensified the need for comprehensive data for use by autonomous driving algorithms. High-quality datasets are crucial for the development of effective data-driven autonomous driving solutions. Next-generation autonomous driving datasets must be multimodal, incorporating data from advanced sensors that feature extensive data coverage, detailed annotations, and diverse scene representation. To address this need, we present OmniHD-Scenes, a large-scale multimodal dataset that provides comprehensive omnidirectional high-definition data. The OmniHD-Scenes dataset combines data from 128-beam LiDAR, six cameras, and six 4D imaging radar systems to achieve full environmental perception. The dataset comprises 1501 clips, each approximately 30-s long, totaling more than 450K synchronized frames and more than 5.85 million synchronized sensor data points. We also propose a novel 4D annotation pipeline. To date, we have annotated 200 clips with more than 514K precise 3D bounding boxes. These clips also include semantic segmentation annotations for static scene elements. Additionally, we introduce a novel automated pipeline for generation of the dense occupancy ground truth, which effectively leverages information from non-key frames. Alongside the proposed dataset, we establish comprehensive evaluation metrics, baseline models, and benchmarks for 3D detection and semantic occupancy prediction. These benchmarks utilize surround-view cameras and 4D imaging radar to explore cost-effective sensor solutions for autonomous driving applications. Extensive experiments demonstrate the effectiveness of our low-cost sensor configuration and its robustness under adverse conditions. Data will be released at this https URL.
zh

[CV-208] MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance

【速读】：该论文试图解决传统长短期记忆网络 (LSTM) 在视觉任务中面临的扩展性和复杂依赖捕捉能力不足的问题。解决方案的关键在于引入了一种名为 MAL 的新框架，通过创新的预训练策略增强 xLSTM 的性能。具体来说，MAL 采用了一种聚类掩码 (cluster-masked) 的掩码方法，显著提升了局部特征的捕捉能力并优化了图像扫描效率。此外，MAL 还采用了通用编码器-解码器预训练方法，集成了图像自回归、深度估计和图像分割等多任务学习，从而增强了模型在多样视觉任务中的适应性和鲁棒性。

链接: https://arxiv.org/abs/2412.10730
作者: Wenjun Huang,Jianguo Hu
机构: Sun Yat-sen University (中山大学)
关键词: Long Short-Term Memory, traditionally faced challenges, effectively capturing complex, capturing complex dependencies, Long Short-Term
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Long Short-Term Memory (LSTM) networks have traditionally faced challenges in scaling and effectively capturing complex dependencies in visual tasks. The xLSTM architecture has emerged to address these limitations, incorporating exponential gating and a parallel matrix memory structure to enhance performance and scalability. Despite these advancements, the potential of xLSTM in visual computing has not been fully realized, particularly in leveraging autoregressive techniques for improved feature extraction. In this paper, we introduce MAL (Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance), a novel framework that enhances xLSTM’s capabilities through innovative pretraining strategies. We propose a cluster-masked masking method that significantly improves local feature capture and optimizes image scanning efficiency. Additionally, our universal encoder-decoder pretraining approach integrates multiple tasks, including image autoregression, depth estimation, and image segmentation, thereby enhancing the model’s adaptability and robustness across diverse visual tasks. Our experimental results demonstrate that MAL surpasses traditional supervised models and fully leverages the scaling potential of xLSTM, setting a new benchmark in visual task performance.
zh

[CV-209] NoisyEQA: Benchmarking Embodied Question Answering Against Noisy Queries

【速读】：该论文试图解决在真实场景中，由于人类提出的问题中常含有噪声（noise），导致具身问答（Embodied Question Answering, EQA）系统在理解和推理过程中出现困难的问题。解决方案的关键在于引入了一个名为NoisyEQA的基准测试，该基准测试通过自动化数据集创建框架生成四种常见类型的噪声：潜在幻觉噪声（Latent Hallucination Noise）、记忆噪声（Memory Noise）、感知噪声（Perception Noise）和语义噪声（Semantic Noise），以评估代理识别和纠正噪声问题的能力。此外，论文还提出了一种“自我纠正”提示机制（Self-Correction Prompting）和新的评估指标，以增强噪声检测能力和提高答案质量。通过这些方法，论文有效提升了EQA系统在面对噪声问题时的准确性。

链接: https://arxiv.org/abs/2412.10726
作者: Tao Wu,Chuhao Zhou,Yen Heng Wong,Lin Gu,Jianfei Yang
机构: MARS Lab, Nanyang Technological University(南洋理工大学); RIKEN AIP(理化学研究所)
关键词: Embodied Question Answering, enhancing agents’ abilities, Vision-Language Models, development of Embodied, Question Answering
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The rapid advancement of Vision-Language Models (VLMs) has significantly advanced the development of Embodied Question Answering (EQA), enhancing agents’ abilities in language understanding and reasoning within complex and realistic scenarios. However, EQA in real-world scenarios remains challenging, as human-posed questions often contain noise that can interfere with an agent’s exploration and response, bringing challenges especially for language beginners and non-expert users. To address this, we introduce a NoisyEQA benchmark designed to evaluate an agent’s ability to recognize and correct noisy questions. This benchmark introduces four common types of noise found in real-world applications: Latent Hallucination Noise, Memory Noise, Perception Noise, and Semantic Noise generated through an automated dataset creation framework. Additionally, we also propose a ‘Self-Correction’ prompting mechanism and a new evaluation metric to enhance and measure both noise detection capability and answer quality. Our comprehensive evaluation reveals that current EQA agents often struggle to detect noise in questions, leading to responses that frequently contain erroneous information. Through our Self-Correct Prompting mechanism, we can effectively improve the accuracy of agent answers.
zh

[CV-210] HEP-NAS: Towards Efficient Few-shot Neural Architecture Search via Hierarchical Edge Partitioning

【速读】：该论文试图解决神经架构搜索 (Neural Architecture Search, NAS) 中由于权重共享策略导致的性能估计不准确问题，特别是在大规模搜索空间中，传统的Few-shot方法忽略了边之间的关系，导致性能下降。解决方案的关键是提出了HEP-NAS，一种层次化分区算法。HEP-NAS通过将共享相同端节点的边视为一个层次，并在同一层次内进行边的排列和分割，直接搜索每个中间节点的最优操作组合，从而更接近NAS的最终目标。此外，HEP-NAS在每次分割后选择最有前景的子超网，逐步缩小搜索空间，并通过搜索空间互蒸馏 (search space mutual distillation) 来稳定训练过程，加速子超网的收敛，最终在给定预算内实现所有边的分割，逐步搜索出更高精度的架构。

链接: https://arxiv.org/abs/2412.10723
作者: Jianfeng Li,Jiawen Zhang,Feng Wang,Lianbo Ma
机构: 未知
关键词: adopting weight-sharing strategy, reduce search costs, One-shot methods, significantly advanced, advanced the field
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One-shot methods have significantly advanced the field of neural architecture search (NAS) by adopting weight-sharing strategy to reduce search costs. However, the accuracy of performance estimation can be compromised by co-adaptation. Few-shot methods divide the entire supernet into individual sub-supernets by splitting edge by edge to alleviate this issue, yet neglect relationships among edges and result in performance degradation on huge search space. In this paper, we introduce HEP-NAS, a hierarchy-wise partition algorithm designed to further enhance accuracy. To begin with, HEP-NAS treats edges sharing the same end node as a hierarchy, permuting and splitting edges within the same hierarchy to directly search for the optimal operation combination for each intermediate node. This approach aligns more closely with the ultimate goal of NAS. Furthermore, HEP-NAS selects the most promising sub-supernet after each segmentation, progressively narrowing the search space in which the optimal architecture may exist. To improve performance evaluation of sub-supernets, HEP-NAS employs search space mutual distillation, stabilizing the training process and accelerating the convergence of each individual sub-supernet. Within a given budget, HEP-NAS enables the splitting of all edges and gradually searches for architectures with higher accuracy. Experimental results across various datasets and search spaces demonstrate the superiority of HEP-NAS compared to state-of-the-art methods.
zh

[CV-211] Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives

【速读】：该论文试图解决现有大规模视觉-语言模型 (LVLMs) 在生成视频描述时难以捕捉复杂视频序列中的因果关系和时间动态的问题。解决方案的关键在于引入了一个因果-时间推理模块 (Causal-Temporal Reasoning Module, CTRM)，该模块包含因果动态编码器 (Causal Dynamics Encoder, CDE) 和时间关系学习器 (Temporal Relational Learner, TRL)，用于共同编码视频帧中的因果依赖性和时间一致性。此外，论文还设计了一种多阶段学习策略，结合大规模视频-文本数据集的预训练、因果注释数据的微调以及对比对齐，以优化模型并提升嵌入的连贯性。实验结果表明，该方法在标准基准测试中显著优于现有方法，生成的描述更加流畅、连贯且相关。

链接: https://arxiv.org/abs/2412.10720
作者: Ji-jun Park,Soo-joon Choi
机构: Dongguk University (东国大学)
关键词: multimodal machine learning, aiming to generate, critical task, field of multimodal, multimodal machine
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video captioning is a critical task in the field of multimodal machine learning, aiming to generate descriptive and coherent textual narratives for video content. While large vision-language models (LVLMs) have shown significant progress, they often struggle to capture the causal and temporal dynamics inherent in complex video sequences. To address this limitation, we propose an enhanced framework that integrates a Causal-Temporal Reasoning Module (CTRM) into state-of-the-art LVLMs. CTRM comprises two key components: the Causal Dynamics Encoder (CDE) and the Temporal Relational Learner (TRL), which collectively encode causal dependencies and temporal consistency from video frames. We further design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets, fine-tuning on causally annotated data, and contrastive alignment for better embedding coherence. Experimental results on standard benchmarks such as MSVD and MSR-VTT demonstrate that our method outperforms existing approaches in both automatic metrics (CIDEr, BLEU-4, ROUGE-L) and human evaluations, achieving more fluent, coherent, and relevant captions. These results validate the effectiveness of our approach in generating captions with enriched causal-temporal narratives.
zh

[CV-212] Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm AAAI2025

【速读】：该论文试图解决开放集目标检测 (Open-Set Object Detection, OSOD) 和开放集分割 (Open-Set Segmentation, OSS) 中现有文本提示和视觉提示范式的局限性问题。现有方法在描述特定类别特征时存在困难，且视觉提示方法依赖多轮人工交互，限制了其在全自动化流程中的应用。论文提出了一种新的图像提示范式 (Image Prompt Paradigm)，通过使用少量图像实例作为提示，避免了多轮人工干预，实现了单阶段非交互式推理。解决方案的关键在于提出的MI Grounding框架，该框架能够自动编码、选择和融合高质量的图像提示，从而在OSOD和OSS任务中取得了与现有文本和视觉提示方法相媲美的性能，并在特定数据集ADR50K上显著优于现有方法。

链接: https://arxiv.org/abs/2412.10719
作者: Jinrong Zhang,Penghui Wang,Chunxiao Liu,Wei Liu,Dian Jin,Qiong Zhang,Erli Meng,Zhengnan Hu
机构: Xiaomi AI Lab(小米人工智能实验室)
关键词: Open-Set Object Detection, Open-Set Object, Object Detection, prompt paradigm, prompt
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:To break through the limitations of pre-training models on fixed categories, Open-Set Object Detection (OSOD) and Open-Set Segmentation (OSS) have attracted a surge of interest from researchers. Inspired by large language models, mainstream OSOD and OSS methods generally utilize text as a prompt, achieving remarkable performance. Following SAM paradigm, some researchers use visual prompts, such as points, boxes, and masks that cover detection or segmentation targets. Despite these two prompt paradigms exhibit excellent performance, they also reveal inherent limitations. On the one hand, it is difficult to accurately describe characteristics of specialized category using textual description. On the other hand, existing visual prompt paradigms heavily rely on multi-round human interaction, which hinders them being applied to fully automated pipeline. To address the above issues, we propose a novel prompt paradigm in OSOD and OSS, that is, \textbfImage Prompt Paradigm. This brand new prompt paradigm enables to detect or segment specialized categories without multi-round human intervention. To achieve this goal, the proposed image prompt paradigm uses just a few image instances as prompts, and we propose a novel framework named \textbfMI Grounding for this new paradigm. In this framework, high-quality image prompts are automatically encoded, selected and fused, achieving the single-stage and non-interactive inference. We conduct extensive experiments on public datasets, showing that MI Grounding achieves competitive performance on OSOD and OSS benchmarks compared to text prompt paradigm methods and visual prompt paradigm methods. Moreover, MI Grounding can greatly outperform existing method on our constructed specialized ADR50K dataset.
zh

[CV-213] GRID: Visual Layout Generation

【速读】：该论文试图解决将广泛的视觉生成任务重新定义为一个网格排列问题，类似于电影胶片。解决方案的关键在于GRID范式，它通过将时间序列转换为网格布局，使图像生成模型能够整体处理视觉序列。为实现布局一致性和运动连贯性，论文提出了一种并行流匹配训练策略，结合布局匹配和时间损失，并采用从粗到细的训练计划，逐步从基本布局过渡到精确的运动控制。这种方法显著提高了效率，推理速度提升至35倍，同时计算资源消耗仅为专用模型的千分之一。

链接: https://arxiv.org/abs/2412.10718
作者: Cong Wan,Xiangyang Luo,Zijian Cai,Yiren Song,Yunlong Zhao,Yifan Bai,Yuhang He,Yihong Gong
机构: 未知
关键词: akin to film, film strips, paradigm that reframes, reframes a broad, broad range
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint, codes: this https URL

点击查看摘要

Abstract:In this paper, we introduce GRID, a novel paradigm that reframes a broad range of visual generation tasks as the problem of arranging grids, akin to film strips. At its core, GRID transforms temporal sequences into grid layouts, enabling image generation models to process visual sequences holistically. To achieve both layout consistency and motion coherence, we develop a parallel flow-matching training strategy that combines layout matching and temporal losses, guided by a coarse-to-fine schedule that evolves from basic layouts to precise motion control. Our approach demonstrates remarkable efficiency, achieving up to 35 faster inference speeds while using 1/1000 of the computational resources compared to specialized models. Extensive experiments show that GRID exhibits exceptional versatility across diverse visual generation tasks, from Text-to-Video to 3D Editing, while maintaining its foundational image generation capabilities. This dual strength in both expanded applications and preserved core competencies establishes GRID as an efficient and versatile omni-solution for visual generation.
zh

[CV-214] Virtual Trial Room with Computer Vision and Machine Learning

【速读】：该论文旨在解决在线购物中顾客因不确定穿戴产品的合适性和尺寸而导致的犹豫购买和高退货率问题。解决方案的关键在于设计了一个基于计算机视觉和机器学习的虚拟试衣间平台，通过使用DECA模型从单张2D图像生成AI驱动的3D人体头部模型，并将根据真实测量数据定制的3D眼镜模型叠加在头部模型上，以模拟真实穿戴效果。此外，通过前端和后端技术的整合，开发了一个全栈应用，使用户能够在网站上查看3D生成结果，提供沉浸式和互动体验。

链接: https://arxiv.org/abs/2412.10710
作者: Tulashi Prasasd Joshi,Amrendra Kumar Yadav,Arjun Chhetri,Suraj Agrahari,Umesh Kanta Ghimire
机构: Tribhuwan University, Institute of Engineering, Thapathali Campus, Kathmandu (Nepal)
关键词: Online shopping, retail industry, convenience and accessibility, shopping has revolutionized, revolutionized the retail
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online shopping has revolutionized the retail industry, providing customers with convenience and accessibility. However, customers often hesitate to purchase wearable products such as watches, jewelry, glasses, shoes, and clothes due to the lack of certainty regarding fit and suitability. This leads to significant return rates, causing problems for both customers and vendors. To address this issue, a platform called the Virtual Trial Room with Computer Vision and Machine Learning is designed which enables customers to easily check whether a product will fit and suit them or not. To achieve this, an AI-generated 3D model of the human head was created from a single 2D image using the DECA model. This 3D model was then superimposed with a custom-made 3D model of glass which is based on real-world measurements and fitted over the human head. To replicate the real-world look and feel, the model was retouched with textures, lightness, and smoothness. Furthermore, a full-stack application was developed utilizing various fornt-end and back-end technologies. This application enables users to view 3D-generated results on the website, providing an immersive and interactive experience.
zh

[CV-215] MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt AAAI2025

【速读】：该论文试图解决多模态目标重识别（Multi-modal Object Re-IDentification, ReID）中，现有方法在处理不同模态的长序列时存在的局限性问题。解决方案的关键在于提出了一种名为MambaPro的新框架，该框架通过三个核心组件来实现：首先，使用并行前馈适配器（Parallel Feed-Forward Adapter, PFA）将CLIP模型适配到多模态目标重识别任务；其次，提出协同残差提示（Synergistic Residual Prompt, SRP）来引导多模态特征的联合学习；最后，利用Mamba的优越长序列处理能力，引入Mamba聚合（Mamba Aggregation, MA）来高效建模不同模态间的交互。这些创新使得MambaPro能够在降低复杂度的同时提取更鲁棒的特征，并通过在多个多模态目标重识别基准上的实验验证了其有效性。

链接: https://arxiv.org/abs/2412.10707
作者: Yuhao Wang,Xuehu Liu,Tianyu Yan,Yang Liu,Aihua Zheng,Pingping Zhang,Huchuan Lu
机构: Dalian University of Technology (大连理工大学); Dalian University of Technology (大连理工大学); Dalian University of Technology (大连理工大学); Dalian University of Technology (大连理工大学); Dalian University of Technology (大连理工大学); Dalian University of Technology (大连理工大学); Dalian University of Technology (大连理工大学)
关键词: utilizing complementary image, complementary image information, multi-modal object ReID, Multi-modal object, object ReID
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: This work is accepted by AAAI2025. More modifications may be performed

点击查看摘要

Abstract:Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities. Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks. However, they remain unexplored for multi-modal object ReID. Furthermore, current multi-modal aggregation methods have obvious limitations in dealing with long sequences from different modalities. To address above issues, we introduce a novel framework called MambaPro for multi-modal object ReID. To be specific, we first employ a Parallel Feed-Forward Adapter (PFA) for adapting CLIP to multi-modal object ReID. Then, we propose the Synergistic Residual Prompt (SRP) to guide the joint learning of multi-modal features. Finally, leveraging Mamba’s superior scalability for long sequences, we introduce Mamba Aggregation (MA) to efficiently model interactions between different modalities. As a result, MambaPro could extract more robust features with lower complexity. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) validate the effectiveness of our proposed methods. The source code is available at this https URL.
zh

[CV-216] Memory Efficient Matting with Adaptive Token Routing

【速读】：该论文试图解决Transformer模型在处理高分辨率图像时的内存效率问题，特别是由于全局自注意力机制（global self-attention）的二次复杂度导致的计算和内存瓶颈。解决方案的关键在于提出了一个名为MEMatte的内存高效抠图框架，该框架通过在每个全局注意力块前引入一个路由器（router），将信息量较大的token引导至全局注意力处理，而将其他token路由至轻量级的Token细化模块（Lightweight Token Refinement Module, LTRM）。此外，论文还引入了批约束自适应Token路由机制（Batch-constrained Adaptive Token Routing, BATR），使得路由器能够根据图像内容和网络中注意力块的阶段动态调整token的路由策略。通过这些创新，MEMatte在保持高精度的同时，显著降低了内存使用和计算延迟。

链接: https://arxiv.org/abs/2412.10702
作者: Yiheng Lin,Yihan Hu,Chenyi Zhang,Ting Liu,Xiaochao Qu,Luoqi Liu,Yao Zhao,Yunchao Wei
机构: 1. Tsinghua University (清华大学); 2. Beijing National Research Center for Information Science and Technology (北京国家信息科学与技术研究中心); 3. ByteDance AI Lab (字节跳动人工智能实验室); 4. Peking University (北京大学)
关键词: recently achieved outstanding, achieved outstanding performance, Transformer-based models, models have recently, recently achieved
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a memory-efficient matting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of 4872\times6017 . This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark.
zh

[CV-217] Linked Adapters: Linking Past and Future to Present for Effective Continual Learning

【速读】：该论文试图解决深度学习模型在持续学习（continual learning）中面临的灾难性遗忘（catastrophic forgetting）问题，特别是在使用大规模预训练模型（如 transformers）时，为每个新任务从头开始重新训练的高成本问题。解决方案的关键在于提出了一种名为“Linked Adapters”的新方法，通过任务特定适配器（task-specific adapters）之间的加权注意力机制（weighted attention mechanism）实现跨任务的知识传递（knowledge transfer）。Linked Adapters 使用多层感知机（MLP）来建模注意力权重，不仅解决了持续学习中的反向知识传递问题，还增强了正向知识传递的能力。在推理阶段，该方法通过 MLP 基于的注意力权重有效利用了所有横向任务适配器之间的知识传递，从而提升了持续学习任务的性能。

链接: https://arxiv.org/abs/2412.10687
作者: Dupati Srikar Chandra,P. K. Srijith,Dana Rezazadegan,Chris McCarthy
机构: Indian Institute of Technology Hyderabad, India; Swinburne University of Technology, Australia
关键词: adapters, Continual learning, Linked Adapters, system to learn, acquired from previous
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 Pages, 5 Figures

点击查看摘要

Abstract:Continual learning allows the system to learn and adapt to new tasks while retaining the knowledge acquired from previous tasks. However, deep learning models suffer from catastrophic forgetting of knowledge learned from earlier tasks while learning a new task. Moreover, retraining large models like transformers from scratch for every new task is costly. An effective approach to address continual learning is to use a large pre-trained model with task-specific adapters to adapt to the new tasks. Though this approach can mitigate catastrophic forgetting, they fail to transfer knowledge across tasks as each task is learning adapters separately. To address this, we propose a novel approach Linked Adapters that allows knowledge transfer through a weighted attention mechanism to other task-specific adapters. Linked adapters use a multi-layer perceptron (MLP) to model the attention weights, which overcomes the challenge of backward knowledge transfer in continual learning in addition to modeling the forward knowledge transfer. During inference, our proposed approach effectively leverages knowledge transfer through MLP-based attention weights across all the lateral task adapters. Through numerous experiments conducted on diverse image classification datasets, we effectively demonstrated the improvement in performance on the continual learning tasks using Linked Adapters.
zh

[CV-218] One Pixel is All I Need

【速读】：该论文试图解决Vision Transformers (ViTs)在面对后门攻击时的脆弱性问题。解决方案的关键在于利用Perturbation Sensitivity Distribution Map (PSDM)工具，通过计算和汇总多个输入的梯度来揭示模型对输入微小变化的敏感性。研究发现，ViTs在中央像素区域表现出更高的敏感性，基于此，论文设计了"WorstVIT"攻击，通过极低的毒化率、单次训练和单像素修改，成功实现了对ViT模型的有效后门攻击。

链接: https://arxiv.org/abs/2412.10681
作者: Deng Siqin,Zhou Xiaoyi
机构: Hainan University(海南大学)
关键词: Vision Transformers, achieved record-breaking performance, http URL, visual tasks, achieved record-breaking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved record-breaking performance in various visual tasks. However, concerns about their robustness against backdoor attacks have grown. Backdoor attacks involve associating a specific trigger with a target label, causing the model to predict the attacker-specified label when the trigger is present, while correctly identifying clean this http URL found that ViTs exhibit higher attack success rates for quasi-triggers(patterns different from but similar to the original training triggers)compared to CNNs. Moreover, some backdoor features in clean samples can suppress the original trigger, making quasi-triggers more this http URL better understand and exploit these vulnerabilities, we developed a tool called the Perturbation Sensitivity Distribution Map (PSDM). PSDM computes and sums gradients over many inputs to show how sensitive the model is to small changes in the input. In ViTs, PSDM reveals a patch-like pattern where central pixels are more sensitive than edges. We use PSDM to guide the creation of this http URL on these findings, we designed “WorstVIT,” a simple yet effective data poisoning backdoor for ViT models. This attack requires an extremely low poisoning rate, trains for just one epoch, and modifies a single pixel to successfully attack all validation images.
zh

[CV-219] UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval WACV2025

【速读】：该论文试图解决通用跨域检索 (Universal Cross-Domain Retrieval, UCDR) 中在未见过的域和类别上进行图像检索的问题，特别是在没有语义标签的情况下实现鲁棒的泛化。解决方案的关键在于提出了UCDR-Adapter，通过两阶段训练策略增强预训练的视觉-语言模型：首先，源适配器学习 (Source Adapter Learning) 结合类语义和域特定的视觉知识，使用可学习的文本语义模板 (Learnable Textual Semantic Template) 并通过动量更新和双重损失函数优化类和域提示 (Class and Domain Prompts)，以实现鲁棒的对齐；其次，目标提示生成 (Target Prompt Generation) 通过关注掩码源提示动态生成提示，从而实现对未见域和类别的无缝适应。与现有方法相比，UCDR-Adapter通过动态适应不断变化的数据分布，提升了灵活性和泛化能力，并在推理阶段仅依赖图像分支和生成的提示，无需文本输入，从而实现高效的检索。

链接: https://arxiv.org/abs/2412.10680
作者: Haoyu Jiang,Zhi-Qi Cheng,Gabriel Moreira,Jiawen Zhu,Jingdong Sun,Bukun Ren,Jun-Yan He,Qi Dai,Xian-Sheng Hua
机构: Zhejiang University; Carnegie Mellon University; Dalian University of Technology; DAMO Academy, Alibaba Group; Microsoft Research; University of Washington
关键词: Universal Cross-Domain Retrieval, Universal Cross-Domain, retrieves relevant images, retrieves relevant, ensuring robust generalization
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted to WACV 2025. Project link: this https URL

点击查看摘要

Abstract:Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen domains and classes without semantic labels, ensuring robust generalization. Existing methods commonly employ prompt tuning with pre-trained vision-language models but are inherently limited by static prompts, reducing adaptability. We propose UCDR-Adapter, which enhances pre-trained models with adapters and dynamic prompt generation through a two-phase training strategy. First, Source Adapter Learning integrates class semantics with domain-specific visual knowledge using a Learnable Textual Semantic Template and optimizes Class and Domain Prompts via momentum updates and dual loss functions for robust alignment. Second, Target Prompt Generation creates dynamic prompts by attending to masked source prompts, enabling seamless adaptation to unseen domains and classes. Unlike prior approaches, UCDR-Adapter dynamically adapts to evolving data distributions, enhancing both flexibility and generalization. During inference, only the image branch and generated prompts are used, eliminating reliance on textual inputs for highly efficient retrieval. Extensive benchmark experiments show that UCDR-Adapter consistently outperforms ProS in most cases and other state-of-the-art methods on UCDR, U©CDR, and U(d)CDR settings.
zh

[CV-220] U-FaceBP: Uncertainty-aware Bayesian Ensemble Deep Learning for Face Video-based Blood Pressure Measurement

【速读】：该论文试图解决基于远程光电容积描记法 (remote photoplethysmography, rPPG) 进行血压 (Blood Pressure, BP) 估计时存在的不确定性和性能限制问题。解决方案的关键在于提出了U-FaceBP，一种基于贝叶斯神经网络 (Bayesian Neural Network, BNN) 的不确定性感知集成深度学习方法。U-FaceBP通过建模数据、模型和集成三种不确定性，并利用多个BNN从面部视频中提取的rPPG信号、估计的PPG信号以及面部图像进行BP估计，从而提高了BP估计的准确性和预测的置信度。

链接: https://arxiv.org/abs/2412.10679
作者: Yusuke Akamatsu,Terumi Umematsu,Hitoshi Imaoka
机构: NEC Corporation(日本电气公司)
关键词: Blood pressure, plays an essential, essential role, role in assessing, daily basis
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Blood pressure (BP) measurement plays an essential role in assessing health on a daily basis. Remote photoplethysmography (rPPG), which extracts pulse waves from camera-captured face videos, has the potential to easily measure BP for daily health monitoring. However, there are many uncertainties in BP estimation using rPPG, resulting in limited estimation performance. In this paper, we propose U-FaceBP, an uncertainty-aware Bayesian ensemble deep learning method for face video-based BP measurement. U-FaceBP models three types of uncertainty, i.e., data, model, and ensemble uncertainties, in face video-based BP estimation with a Bayesian neural network (BNN). We also design U-FaceBP as an ensemble method, with which BP is estimated from rPPG signals, PPG signals estimated from face videos, and face images using multiple BNNs. A large-scale experiment with 786 subjects demonstrates that U-FaceBP outperforms state-of-the-art BP estimation methods. We also show that the uncertainties estimated from U-FaceBP are reasonable and useful for prediction confidence.
zh

[CV-221] Memory-Efficient 4-bit Preconditioned Stochastic Optimization

【速读】：该论文试图解决预条件随机优化算法（如 Shampoo）在处理大规模神经网络训练时由于非对角预条件矩阵存储需求导致的显著内存开销问题。解决方案的关键在于引入 4-bit 量化技术对 Shampoo 的预条件矩阵进行优化。具体方法包括：首先，通过 Cholesky 分解后对 Cholesky 因子进行量化，利用其下三角结构减少内存使用，同时保持对称性和正定性以最小化信息损失；其次，在量化过程中引入误差反馈，将 Cholesky 因子和误差状态高效存储在同一矩阵的下三角和上三角部分。这些方法在实验中显著提升了内存效率和算法性能，并在理论上证明了量化 Shampoo 在平滑和非平滑随机优化设置下的收敛性。

链接: https://arxiv.org/abs/2412.10663
作者: Jingyang Li,Kuangyu Ding,Kim-Chuan Toh,Pan Zhou
机构: National University of Singapore(新加坡国立大学); Singapore Management University(新加坡管理大学)
关键词: neural network training, Preconditioned stochastic optimization, large-scale neural network, demonstrated superior performance, Preconditioned stochastic
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Preconditioned stochastic optimization algorithms, exemplified by Shampoo, have demonstrated superior performance over first-order optimizers, providing both theoretical advantages in convergence rates and practical improvements in large-scale neural network training. However, they incur substantial memory overhead due to the storage demands of non-diagonal preconditioning matrices. To address this, we introduce 4-bit quantization for Shampoo’s preconditioners. We introduced two key methods: First, we apply Cholesky decomposition followed by quantization of the Cholesky factors, reducing memory usage by leveraging their lower triangular structure while preserving symmetry and positive definiteness to minimize information loss. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners. Second, we incorporate error feedback in the quantization process, efficiently storing Cholesky factors and error states in the lower and upper triangular parts of the same matrix. Through extensive experiments, we demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance in large-scale deep-learning tasks. Theoretically, we also provide convergence proofs for quantized Shampoo under both smooth and non-smooth stochastic optimization settings.
zh

[CV-222] MEATRD: Multimodal Anomalous Tissue Region Detection Enhanced with Spatial Transcriptomics AAAI2025

【速读】：该论文试图解决在病理学和临床诊断中，传统基于组织学图像的异常组织区域（ATRs）检测方法在ATRs与正常组织视觉差异微小时表现不佳的问题。解决方案的关键在于提出了一种名为MEATRD的新型ATRs检测方法，该方法通过整合组织学图像和空间转录组学（ST）数据，利用重建和单类分类的结合策略。MEATRD的核心是一个创新的掩码图双重注意力变换器（MGDAT）网络，它不仅促进了跨模态和跨节点的信息共享，还解决了基于重建的异常检测方法中常见的过泛化问题。此外，MEATRD通过多模态瓶颈编码有效地整合了模态特异性和任务相关的信息，并首次从理论上分析了多模态瓶颈编码的信息特性。实验结果表明，MEATRD在多个真实ST数据集上的ATRs检测性能优于现有的先进方法，特别是在识别与正常组织视觉差异微小的ATRs方面表现出色。

链接: https://arxiv.org/abs/2412.10659
作者: Kaichen Xu,Qilong Wu,Yan Lu,Yinan Zheng,Wenlin Li,Xingjie Tang,Jun Wang,Xiaobo Sun
机构: 未知
关键词: ATR detection, ATR detection methods, anomalous tissue regions, pathological studies, crucial in clinical
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: AAAI 2025. Code: this https URL

点击查看摘要

Abstract:The detection of anomalous tissue regions (ATRs) within affected tissues is crucial in clinical diagnosis and pathological studies. Conventional automated ATR detection methods, primarily based on histology images alone, falter in cases where ATRs and normal tissues have subtle visual differences. The recent spatial transcriptomics (ST) technology profiles gene expressions across tissue regions, offering a molecular perspective for detecting ATRs. However, there is a dearth of ATR detection methods that effectively harness complementary information from both histology images and ST. To address this gap, we propose MEATRD, a novel ATR detection method that integrates histology image and ST data. MEATRD is trained to reconstruct image patches and gene expression profiles of normal tissue spots (inliers) from their multimodal embeddings, followed by learning a one-class classification AD model based on latent multimodal reconstruction errors. This strategy harmonizes the strengths of reconstruction-based and one-class classification approaches. At the heart of MEATRD is an innovative masked graph dual-attention transformer (MGDAT) network, which not only facilitates cross-modality and cross-node information sharing but also addresses the model over-generalization issue commonly seen in reconstruction-based AD methods. Additionally, we demonstrate that modality-specific, task-relevant information is collated and condensed in multimodal bottleneck encoding generated in MGDAT, marking the first theoretical analysis of the informational properties of multimodal bottleneck encoding. Extensive evaluations across eight real ST datasets reveal MEATRD’s superior performance in ATR detection, surpassing various state-of-the-art AD methods. Remarkably, MEATRD also proves adept at discerning ATRs that only show slight visual deviations from normal tissues.
zh

[CV-223] LAN: Learning to Adapt Noise for Image Denoising CVPR2024

【速读】：该论文试图解决图像去噪（image denoising）中，现有深度学习模型在处理训练时未见过的噪声分布时性能下降的问题。解决方案的关键在于提出了一种新的去噪算法，称为“学习适应噪声”（Learning-to-Adapt-Noise, LAN），其核心思想是通过在输入的噪声图像上直接添加可学习的噪声偏移量（learnable noise offset），使输入噪声更接近于预训练去噪网络所熟悉的噪声分布，从而在不调整网络权重的情况下提升模型对未见过噪声的处理能力。

链接: https://arxiv.org/abs/2412.10651
作者: Changjin Kim,Tae Hyun Kim,Sungyong Baik
机构: Hanyang University (汉阳大学)
关键词: Removing noise, capturing environments, noise, challenging task, type and amount
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR2024

点击查看摘要

Abstract:Removing noise from images, a.k.a image denoising, can be a very challenging task since the type and amount of noise can greatly vary for each image due to many factors including a camera model and capturing environments. While there have been striking improvements in image denoising with the emergence of advanced deep learning architectures and real-world datasets, recent denoising networks struggle to maintain performance on images with noise that has not been seen during training. One typical approach to address the challenge would be to adapt a denoising network to new noise distribution. Instead, in this work, we shift our focus to adapting the input noise itself, rather than adapting a network. Thus, we keep a pretrained network frozen, and adapt an input noise to capture the fine-grained deviations. As such, we propose a new denoising algorithm, dubbed Learning-to-Adapt-Noise (LAN), where a learnable noise offset is directly added to a given noisy image to bring a given input noise closer towards the noise distribution a denoising network is trained to handle. Consequently, the proposed framework exhibits performance improvement on images with unseen noise, displaying the potential of the proposed research direction. The code is available at this https URL
zh

[CV-224] DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification AAAI2025

【速读】：该论文试图解决多模态目标重识别（Multi-modal Object Re-IDentification, ReID）中存在的两个关键问题：一是多模态成像中的动态质量变化被忽视，二是不同模态间的共享信息可能削弱模态特异性信息。为解决这些问题，论文提出了一种名为DeMo的新型特征学习框架，其关键在于通过混合专家（Mixture of Experts, MoE）自适应地平衡解耦特征。具体来说，DeMo框架首先使用Patch-Integrated Feature Extractor (PIFE)提取多粒度和多模态特征，然后通过Hierarchical Decoupling Module (HDM)将多模态特征解耦为非重叠形式，以保留模态独特性并增加特征多样性。最后，通过Attention-Triggered Mixture of Experts (ATMoE)模块，利用解耦特征动态生成注意力权重，替代传统的门控机制，从而生成更鲁棒的多模态特征。

链接: https://arxiv.org/abs/2412.10650
作者: Yuhao Wang,Yang Liu,Aihua Zheng,Pingping Zhang
机构: Dalian University of Technology (大连理工大学); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
关键词: combining complementary information, multi-modal object ReID, Multi-modal object Re-IDentification, Multi-modal object, aims to retrieve
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted by AAAI2025. More motifications may be performed

点击查看摘要

Abstract:Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by combining complementary information from multiple modalities. Existing multi-modal object ReID methods primarily focus on the fusion of heterogeneous features. However, they often overlook the dynamic quality changes in multi-modal imaging. In addition, the shared information between different modalities can weaken modality-specific information. To address these issues, we propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts. To be specific, we first deploy a Patch-Integrated Feature Extractor (PIFE) to extract multi-granularity and multi-modal features. Then, we introduce a Hierarchical Decoupling Module (HDM) to decouple multi-modal features into non-overlapping forms, preserving the modality uniqueness and increasing the feature diversity. Finally, we propose an Attention-Triggered Mixture of Experts (ATMoE), which replaces traditional gating with dynamic attention weights derived from decoupled features. With these modules, our DeMo can generate more robust multi-modal features. Extensive experiments on three multi-modal object ReID benchmarks fully verify the effectiveness of our methods. The source code is available at this https URL.
zh

[CV-225] Enhancement of text recognition for hanja handwritten documents of Ancient Korea

【速读】：该论文试图解决古典手写文档的光学字符识别（Optical Character Recognition, OCR）问题，特别是针对韩文汉字（hanja）手写文档的识别挑战。解决方案的关键在于使用数据增强技术，通过在文档区域内进行高度变化的裁剪（highly variable cropping）来增强训练数据，并采用两阶段目标检测模型——高分辨率神经网络（High resolution neural network, HRNet）进行训练。这种方法显著提高了对草书文档的识别准确率，达到了90%的推理识别率。此外，研究还揭示了简体字、变体字、常见字及难以识别的字符对OCR性能的影响，并提出该方法不仅适用于古典文档，还可应用于现代多语言文档及其他字体类型的识别。

链接: https://arxiv.org/abs/2412.10647
作者: Joonmo Ahna,Taehong Jang,Quan Fengnyu,Hyungil Lee,Jaehyuk Lee,Sojung Lucia Kim
机构: 未知
关键词: optical character recognition, high-performance optical character, highly variable cropping, optical character, classical Chinese characters
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We implemented a high-performance optical character recognition model for classical handwritten documents using data augmentation with highly variable cropping within the document region. Optical character recognition in handwritten documents, especially classical documents, has been a challenging topic in many countries and research organizations due to its difficulty. Although many researchers have conducted research on this topic, the quality of classical texts over time and the unique stylistic characteristics of various authors have made it difficult, and it is clear that the recognition of hanja handwritten documents is a meaningful and special challenge, especially since hanja, which has been developed by reflecting the vocabulary, semantic, and syntactic features of the Joseon Dynasty, is different from classical Chinese characters. To study this challenge, we used 1100 cursive documents, which are small in size, and augmented 100 documents per document by cropping a randomly sized region within each document for training, and trained them using a two-stage object detection model, High resolution neural network (HRNet), and applied the resulting model to achieve a high inference recognition rate of 90% for cursive documents. Through this study, we also confirmed that the performance of OCR is affected by the simplified characters, variants, variant characters, common characters, and alternators of Chinese characters that are difficult to see in other studies, and we propose that the results of this study can be applied to optical character recognition of modern documents in multiple languages as well as other typefaces in classical documents.
zh

[CV-226] CATALOG: A Camera Trap Language-guided Contrastive Learning Model

【速读】：该论文试图解决在相机陷阱图像识别中由于数据分布差异（domain shift）导致的动物物种识别难题，特别是在光照、伪装和遮挡等因素变化较大的情况下。解决方案的关键在于提出了Camera Trap Language-guided Contrastive Learning (CATALOG)模型，该模型通过结合多个基础模型（Foundation Models）来提取相机陷阱数据的视觉和文本特征，并利用对比损失函数（contrastive loss function）进行训练。CATALOG在两个基准数据集上的评估结果表明，其在处理训练和测试数据中动物物种或地理区域不一致的情况下，显著优于现有的最先进方法，展示了多模态融合和对比学习在解决相机陷阱图像识别中领域偏移问题的潜力。

链接: https://arxiv.org/abs/2412.10624
作者: Julian D. Santamaria,Claudia Isaza,Jhony H. Giraldo
机构: SISTEMIC, Faculty of Engineering, Universidad de Antioquia-UdeA(安蒂奥基亚大学工程学院); LTCI, Télécom Paris, Institut Polytechnique de Paris(巴黎综合理工学院电信学院)
关键词: computer vision tasks, Foundation Models, object detection, computer vision, Camera Trap Language-guided
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Foundation Models (FMs) have been successful in various computer vision tasks like image classification, object detection and image segmentation. However, these tasks remain challenging when these models are tested on datasets with different distributions from the training dataset, a problem known as domain shift. This is especially problematic for recognizing animal species in camera-trap images where we have variability in factors like lighting, camouflage and occlusions. In this paper, we propose the Camera Trap Language-guided Contrastive Learning (CATALOG) model to address these issues. Our approach combines multiple FMs to extract visual and textual features from camera-trap data and uses a contrastive loss function to train the model. We evaluate CATALOG on two benchmark datasets and show that it outperforms previous state-of-the-art methods in camera-trap image recognition, especially when the training and testing data have different animal species or come from different geographical areas. Our approach demonstrates the potential of using FMs in combination with multi-modal fusion and contrastive learning for addressing domain shifts in camera-trap image recognition. The code of CATALOG is publicly available at this https URL.
zh

[CV-227] EvalGIM: A Library for Evaluating Generative Image Models KR

【速读】：该论文试图解决生成式图像模型（text-to-image generative models）评估中缺乏统一、灵活且可操作的基准库的问题。解决方案的关键在于引入了EvalGIM（EvalGym），这是一个支持广泛数据集和评估指标的库，旨在测量生成模型的质量、多样性和一致性。EvalGIM的设计强调用户定制的灵活性，允许轻松添加新的数据集和指标，并通过“评估练习”（Evaluation Exercises）提供具体的评估见解，包括当前最先进的评估方法（如一致性-多样性-真实性帕累托前沿和性能差异的分组测量）以及新的分析方法（如模型排名的鲁棒性分析和不同提示风格下的平衡评估）。

链接: https://arxiv.org/abs/2412.10604
作者: Melissa Hall,Oscar Mañas,Reyhane Askari,Mark Ibrahim,Candace Ross,Pietro Astolfi,Tariq Berrada Ifriqi,Marton Havasi,Yohann Benchetrit,Karen Ullrich,Carolina Braga,Abhishek Charnalia,Maeve Ryan,Mike Rabbat,Michal Drozdzal,Jakob Verbeek,Adriana Romero Soriano
机构: Meta
关键词: Evaluation Exercises, evaluation, adoption of automatic, automatic benchmarking methods, generative models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: For code, see this https URL

点击查看摘要

Abstract:As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. However, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ‘‘EvalGym’’), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ‘‘Evaluation Exercises’’ that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at this https URL.
zh

[CV-228] Err on the Side of Texture: Texture Bias on Real Data

【速读】：该论文试图解决机器学习模型中存在的纹理偏差（texture bias）问题，即模型过度依赖纹理信息而非形状信息，从而影响模型的准确性和鲁棒性。解决方案的关键在于引入了一种新的度量指标——纹理关联值（Texture Association Value, TAV），该指标量化了模型在分类物体时对特定纹理的依赖程度。通过TAV，研究揭示了纹理偏差对模型准确性和鲁棒性的显著影响，并解释了自然对抗样本的存在，其中超过90%的样本由于纹理与真实标签的纹理不一致而导致模型产生自信的错误预测。

链接: https://arxiv.org/abs/2412.10597
作者: Blaine Hoak,Ryan Sheatsley,Patrick McDaniel
机构: 未知
关键词: Bias significantly undermines, machine learning models, significantly undermines, trustworthiness of machine, machine learning
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accepted to IEEE Secure and Trustworthy Machine Learning (SaTML)

点击查看摘要

Abstract:Bias significantly undermines both the accuracy and trustworthiness of machine learning models. To date, one of the strongest biases observed in image classification models is texture bias-where models overly rely on texture information rather than shape information. Yet, existing approaches for measuring and mitigating texture bias have not been able to capture how textures impact model robustness in real-world settings. In this work, we introduce the Texture Association Value (TAV), a novel metric that quantifies how strongly models rely on the presence of specific textures when classifying objects. Leveraging TAV, we demonstrate that model accuracy and robustness are heavily influenced by texture. Our results show that texture bias explains the existence of natural adversarial examples, where over 90% of these samples contain textures that are misaligned with the learned texture of their true label, resulting in confident mispredictions.
zh

[CV-229] owards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

【速读】：该论文试图解决多模态输入下人类感知相似性的复杂性问题，并开发能够准确模拟人类感知相似性的自动化度量方法。解决方案的关键在于引入UniSim-Bench基准，涵盖7个多模态感知相似性任务和25个数据集，通过评估通用模型和专门模型的表现，发现通用模型在平均表现上尚可，但在具体任务上往往不如专门模型，而专门模型在未见任务上的泛化能力较差。为此，论文提出对基于编码器和生成式视觉-语言模型进行多任务微调，以期在某些任务上超越专门模型，但仍面临泛化到未见任务的挑战，表明构建一个能够捕捉人类相似性概念的统一感知相似性度量仍是一个持续的挑战。

链接: https://arxiv.org/abs/2412.10594
作者: Sara Ghazanfari,Siddharth Garg,Nicolas Flammarion,Prashanth Krishnamurthy,Farshad Khorrami,Francesco Croce
机构: New York University, US(纽约大学，美国); EPFL, Switzerland(瑞士联邦理工学院，瑞士)
关键词: develop automated metrics, highly complex, making it challenging, multimodal inputs, inputs is highly
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human perception of similarity across uni- and multimodal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. General purpose vision-language models, such as CLIP and large multi-modal models (LMMs), can be applied as zero-shot perceptual metrics, and several recent works have developed models specialized in narrow perceptual tasks. However, the extent to which existing perceptual metrics align with human perception remains unclear. To investigate this question, we introduce UniSim-Bench, a benchmark encompassing 7 multi-modal perceptual similarity tasks, with a total of 25 datasets. Our evaluation reveals that while general-purpose models perform reasonably well on average, they often lag behind specialized models on individual tasks. Conversely, metrics fine-tuned for specific tasks fail to generalize well to unseen, though related, tasks. As a first step towards a unified multi-task perceptual similarity metric, we fine-tune both encoder-based and generative vision-language models on a subset of the UniSim-Bench tasks. This approach yields the highest average performance, and in some cases, even surpasses taskspecific models. Nevertheless, these models still struggle with generalization to unseen tasks, highlighting the ongoing challenge of learning a robust, unified perceptual similarity metric capable of capturing the human notion of similarity. The code and models are available at this https URL.
zh

[CV-230] PanSR: An Object-Centric Mask Transformer for Panoptic Segmentation

【速读】：该论文试图解决当前基于掩码变换器（mask-transformer）的视觉全景分割（panoptic segmentation）方法在处理小物体、拥挤场景和多尺度物体时面临的挑战。具体问题包括：查询提议生成过程偏向大物体，导致小物体漏检；初始定位良好的查询可能漂移到其他物体，导致检测遗漏；空间上分离的实例可能被合并为一个掩码，导致场景解释不一致和错误。解决方案的关键在于重新设计网络的各个组件及其监督机制，提出了一种名为PanSR的新方法。PanSR通过有效缓解实例合并、增强小物体检测以及提升拥挤场景中的表现，显著提高了在具有挑战性的LaRS基准上的性能，相较于现有技术实现了+3.4 PQ的提升，并在Cityscapes数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.10589
作者: Lojze Žust,Matej Kristan
机构: University of Ljubljana (卢布尔雅那大学)
关键词: autonomous vehicles, task in computer, computer vision, perception in autonomous, Panoptic segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Panoptic segmentation is a fundamental task in computer vision and a crucial component for perception in autonomous vehicles. Recent mask-transformer-based methods achieve impressive performance on standard benchmarks but face significant challenges with small objects, crowded scenes and scenes exhibiting a wide range of object scales. We identify several fundamental shortcomings of the current approaches: (i) the query proposal generation process is biased towards larger objects, resulting in missed smaller objects, (ii) initially well-localized queries may drift to other objects, resulting in missed detections, (iii) spatially well-separated instances may be merged into a single mask causing inconsistent and false scene interpretations. To address these issues, we rethink the individual components of the network and its supervision, and propose a novel method for panoptic segmentation PanSR. PanSR effectively mitigates instance merging, enhances small-object detection and increases performance in crowded scenes, delivering a notable +3.4 PQ improvement over state-of-the-art on the challenging LaRS benchmark, while reaching state-of-the-art performance on Cityscapes. The code and models will be publicly available at this https URL.
zh

[CV-231] ExeChecker: Where Did I Go Wrong?

【速读】：该论文试图解决康复训练中对用户执行动作的准确性进行解释和反馈的问题。解决方案的关键在于提出了一种基于对比学习 (contrastive learning) 的框架ExeChecker，结合人体姿态估计 (human pose estimation)、图注意力神经网络 (graph-attention neural networks) 和Transformer可解释性，通过对比正确与错误执行的动作，识别并突出显示需要用户注意的关节部位。该方法在自建的ExeCheck数据集和UI-PRMD数据集上进行了验证，结果表明其优于基于成对序列对齐的基线方法，能够更有效地识别与康复训练相关的关节。

链接: https://arxiv.org/abs/2412.10573
作者: Yiwen Gu,Mahir Patel,Margrit Betke
机构: 未知
关键词: learning based framework, based framework, contrastive learning based, exercises, present a contrastive
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we present a contrastive learning based framework, ExeChecker, for the interpretation of rehabilitation exercises. Our work builds upon state-of-the-art advances in the area of human pose estimation, graph-attention neural networks, and transformer interpretablity. The downstream task is to assist rehabilitation by providing informative feedback to users while they are performing prescribed exercises. We utilize a contrastive learning strategy during training. Given a tuple of correctly and incorrectly executed exercises, our model is able to identify and highlight those joints that are involved in an incorrect movement and thus require the user’s attention. We collected an in-house dataset, ExeCheck, with paired recordings of both correct and incorrect execution of exercises. In our experiments, we tested our method on this dataset as well as the UI-PRMD dataset and found ExeCheck outperformed the baseline method using pairwise sequence alignment in identifying joints of physical relevance in rehabilitation exercises.
zh

[CV-232] Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers NEURIPS2024

【速读】：该论文试图解决现有视觉Transformer (Vision Transformers, ViTs)中token合并方法依赖于中间特征的问题，这些方法无法利用专门为合并设计的特征，并且需要端到端的训练来改进token合并。解决方案的关键是提出了解耦的token嵌入合并方法 (Decoupled Token Embedding for Merging, DTEM)，通过引入一个轻量级的嵌入模块，该模块与ViT的前向传播解耦，专门提取用于token合并的特征。这种方法通过在训练过程中应用连续松弛的token合并，实现了可微分的解耦嵌入学习，从而解决了依赖中间特征的限制。DTEM可以无缝集成到现有的ViT骨干网络中，并且可以模块化训练或端到端微调，在多个任务（如分类、描述生成和分割）中均表现出一致的token合并性能提升。

链接: https://arxiv.org/abs/2412.10569
作者: Dong Hoon Lee,Seunghoon Hong
机构: KAIST(韩国科学技术院); KAIST(韩国科学技术院)
关键词: Vision Transformers, Recent token reduction, token merging, Recent token, token
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2024

点击查看摘要

Abstract:Recent token reduction methods for Vision Transformers (ViTs) incorporate token merging, which measures the similarities between token embeddings and combines the most similar pairs. However, their merging policies are directly dependent on intermediate features in ViTs, which prevents exploiting features tailored for merging and requires end-to-end training to improve token merging. In this paper, we propose Decoupled Token Embedding for Merging (DTEM) that enhances token merging through a decoupled embedding learned via a continuously relaxed token merging process. Our method introduces a lightweight embedding module decoupled from the ViT forward pass to extract dedicated features for token merging, thereby addressing the restriction from using intermediate features. The continuously relaxed token merging, applied during training, enables us to learn the decoupled embeddings in a differentiable manner. Thanks to the decoupled structure, our method can be seamlessly integrated into existing ViT backbones and trained either modularly by learning only the decoupled embeddings or end-to-end by fine-tuning. We demonstrate the applicability of DTEM on various tasks, including classification, captioning, and segmentation, with consistent improvement in token merging. Especially in the ImageNet-1k classification, DTEM achieves a 37.2% reduction in FLOPs while maintaining a top-1 accuracy of 79.85% with DeiT-small. Code is available at \hrefthis https URLlink.
zh

[CV-233] EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

【速读】：该论文试图解决视觉-语言建模中基于模糊指令编辑复杂视觉内容的挑战。现有模型虽然能够上下文化内容，但往往难以理解参考图像或场景中的潜在意图，导致编辑结果与预期不符。解决方案的关键在于引入了编辑视觉-语言模型 (Editing Vision-Language Model, EVLM)，该系统通过结合参考视觉内容来解释指令，生成精确且上下文感知的编辑提示。EVLM 利用链式思维 (Chain-of-Thought, CoT) 推理和 KL 散度目标优化 (KL-Divergence Target Optimization, KTO) 对齐技术，捕捉主观编辑偏好，无需二元标签。通过在包含 30,000 个 CoT 示例的数据集上进行微调，并由人类评估者对推理路径进行评分，EVLM 在与人意图对齐方面表现出显著改进，并在图像、视频、3D 和 4D 编辑任务中生成连贯、高质量的指令，支持复杂视觉-语言应用的可扩展框架。

链接: https://arxiv.org/abs/2412.10566
作者: Umar Khalid,Hasan Iqbal,Azib Farooq,Nazanin Rahnavard,Jing Hua,Chen Chen Umar Khalid,Hasan Iqbal,Azib Farooq,Nazanin Rahnavard,Jing Hua,Chen Chen Umar Khalid,Hasan Iqbal,Azib Farooq,Nazanin Rahnavard,Jing Hua,Chen Chen Umar Khalid,Hasan Iqbal,Azib Farooq,Nazanin Rahnavard,Jing Hua,Chen Chen
机构: Center for Research in Computer Vision, University of Central Florida, USA; Department of Computer Science, Wayne State University, USA; Miami University, USA
关键词: ambiguous instructions remains, visual content based, based on ambiguous, remains a challenging, challenging problem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Editing complex visual content based on ambiguous instructions remains a challenging problem in vision-language modeling. While existing models can contextualize content, they often struggle to grasp the underlying intent within a reference image or scene, leading to misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system designed to interpret such instructions in conjunction with reference visuals, producing precise and context-aware editing prompts. Leveraging Chain-of-Thought (CoT) reasoning and KL-Divergence Target Optimization (KTO) alignment technique, EVLM captures subjective editing preferences without requiring binary labels. Fine-tuned on a dataset of 30,000 CoT examples, with rationale paths rated by human evaluators, EVLM demonstrates substantial improvements in alignment with human intentions. Experiments across image, video, 3D, and 4D editing tasks show that EVLM generates coherent, high-quality instructions, supporting a scalable framework for complex vision-language applications.
zh

[CV-234] SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

【速读】：该论文试图解决在无需测试时微调的情况下，实现基于主题驱动的视频定制问题。解决方案的关键在于提出了SUGAR方法，通过构建一个可扩展的合成数据集（包含250万个图像-视频-文本三元组），并结合特殊注意力设计、改进的训练策略和优化的采样算法，实现了零样本（zero-shot）能力。SUGAR不仅能够在输入图像的基础上生成与用户指定视觉属性（如风格和动作）对齐的视频，还避免了传统方法中测试时额外成本的需求，从而在身份保持、视频动态和视频-文本对齐方面达到了最先进的性能。

链接: https://arxiv.org/abs/2412.10533
作者: Yufan Zhou,Ruiyi Zhang,Jiuxiang Gu,Nanxuan Zhao,Jing Shi,Tong Sun
机构: Adobe Research(Adobe研究)
关键词: subject-driven video customization, present SUGAR, SUGAR, SUGAR achieves, video customization
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: webpage this https URL

点击查看摘要

Abstract:We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.
zh

[CV-235] RowDetr: End-to-End Row Detection Using Polynomials

【速读】：该论文试图解决在GPS信号受限的农业环境中（如作物冠层下）进行作物行检测的问题，以实现导航功能。解决方案的关键是提出了RowDetr，一种端到端神经网络，利用平滑多项式函数在图像空间中描绘作物边界。此外，论文引入了一种新颖的基于能量的损失函数（PolyOptLoss），以增强模型在噪声标签情况下的学习鲁棒性。该模型在关键性能指标上比Agronav提升了3%，并且速度快了六倍，非常适合实时应用。

链接: https://arxiv.org/abs/2412.10525
作者: Rahul Harsha Cheppally,Ajay Sharda
机构: 未知
关键词: under-canopy agricultural settings, garnered significant interest, significant interest due, Crop row detection, GPS-denied environments
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Code will be open sourced upon publication

点击查看摘要

Abstract:Crop row detection has garnered significant interest due to its critical role in enabling navigation in GPS-denied environments, such as under-canopy agricultural settings. To address this challenge, we propose RowDetr, an end-to-end neural network that utilizes smooth polynomial functions to delineate crop boundaries in image space. A novel energy-based loss function, PolyOptLoss, is introduced to enhance learning robustness, even with noisy labels. The proposed model demonstrates a 3% improvement over Agronav in key performance metrics while being six times faster, making it well-suited for real-time applications. Additionally, metrics from lane detection studies were adapted to comprehensively evaluate the system, showcasing its accuracy and adaptability in various scenarios.
zh

[CV-236] he Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

【速读】：该论文试图解决现有运动生成模型在处理多模态输入（如语音、文本和运动数据）时的局限性问题。解决方案的关键在于提出了一种新颖的框架，通过多模态语言模型（multimodal language models）统一了口头语言和非口头语言（verbal and non-verbal language），从而能够灵活地接受文本、语音、运动数据或其任意组合作为输入。该框架结合了创新的预训练策略，不仅在协同语音手势生成（co-speech gesture generation）方面达到了最先进的性能，还显著减少了训练所需的数据量。此外，该模型还支持可编辑手势生成和从运动中预测情绪等新任务。

链接: https://arxiv.org/abs/2412.10523
作者: Changan Chen,Juze Zhang,Shrinidhi K. Lakshmikanth,Yusu Fang,Ruizhi Shao,Gordon Wetzstein,Li Fei-Fei,Ehsan Adeli
机构: Stanford University (斯坦福大学)
关键词: facial expressions, communication is inherently, verbal and non-verbal, Human communication, motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this http URL

点击查看摘要

Abstract:Human communication is inherently multimodal, involving a combination of verbal and non-verbal cues such as speech, facial expressions, and body gestures. Modeling these behaviors is essential for understanding human interaction and for creating virtual characters that can communicate naturally in applications like games, films, and virtual reality. However, existing motion generation models are typically limited to specific input modalities – either speech, text, or motion data – and cannot fully leverage the diversity of available data. In this paper, we propose a novel framework that unifies verbal and non-verbal language using multimodal language models for human motion understanding and generation. This model is flexible in taking text, speech, and motion or any combination of them as input. Coupled with our novel pre-training strategy, our model not only achieves state-of-the-art performance on co-speech gesture generation but also requires much less data for training. Our model also unlocks an array of novel tasks such as editable gesture generation and emotion prediction from motion. We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications, and language models offer a powerful approach to achieving this goal. Project page: this http URL.
zh

[CV-237] Automated Image Captioning with CNNs and Transformers

【速读】：该论文旨在解决图像自动描述生成的问题，通过结合计算机视觉和自然语言处理技术，构建一个能够为输入图像生成自然语言描述的自动化系统。解决方案的关键在于采用多种技术，包括从传统的卷积神经网络-循环神经网络 (CNN-RNN) 到更先进的基于Transformer的技术，并通过在带有描述性字幕的图像数据集上进行训练来实现。此外，论文还涉及对注意力机制的实验、不同架构选择的比较以及超参数优化，以提高描述的准确性和系统的整体效果。评估模型性能时使用了BLEU、METEOR和CIDEr等标准指标。

链接: https://arxiv.org/abs/2412.10511
作者: Joshua Adrian Cahyono,Jeremy Nathan Jusuf
机构: Nanyang Technological University(南洋理工大学); College of Computing and Data Science(计算与数据科学学院)
关键词: natural language processing, generates natural language, natural language descriptions, natural language, language processing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This project aims to create an automated image captioning system that generates natural language descriptions for input images by integrating techniques from computer vision and natural language processing. We employ various different techniques, ranging from CNN-RNN to the more advanced transformer-based techniques. Training is carried out on image datasets paired with descriptive captions, and model performance will be evaluated using established metrics such as BLEU, METEOR, and CIDEr. The project will also involve experimentation with advanced attention mechanisms, comparisons of different architectural choices, and hyperparameter optimization to refine captioning accuracy and overall system effectiveness.
zh

[CV-238] SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

【速读】：该论文试图解决视频生成模型因计算需求高而主要依赖云服务器，限制了内容创作者广泛采用的问题。解决方案的关键在于提出一个综合的加速框架，通过从紧凑的图像骨干网络出发，优化时间层的设计和排列以最大化硬件效率，并采用专门的对抗性微调算法，将去噪步骤减少到4步。最终，该模型以仅0.6B参数的规模，在iPhone 16 PM上实现5秒内生成5秒视频，显著加速了生成过程并保持了与服务器端模型相当的生成质量。

链接: https://arxiv.org/abs/2412.10494
作者: Yushu Wu,Zhixing Zhang,Yanyu Li,Yanwu Xu,Anil Kag,Yang Sui,Huseyin Coskun,Ke Ma,Aleksei Lebedev,Ju Hu,Dimitris Metaxas,Yanzhi Wang,Sergey Tulyakov,Jian Ren
机构: Snap Inc.; Northeastern University; Rutgers University
关键词: past year, diffusion-based video generation, witnessed the unprecedented, unprecedented success, success of diffusion-based
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注: this https URL

点击查看摘要

Abstract:We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.
zh

[CV-239] SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

【速读】：该论文试图解决文本生成图像 (Text-to-Image, T2I) 模型在安全性方面的局限性问题，特别是现有安全措施（如基于文本的过滤或概念移除策略）只能移除少量有害概念的不足。解决方案的关键在于引入了一种名为 SafetyDPO 的方法，通过直接偏好优化 (Direct Preference Optimization, DPO) 实现 T2I 模型的安全对齐。具体来说，研究者通过合成生成有害和安全的图像-文本对数据集 CoProV2，并采用自定义的 DPO 策略训练安全专家（以低秩适应矩阵 LoRA 形式），以引导生成过程远离特定的安全相关概念。随后，通过一种新颖的合并策略将这些专家合并为一个 LoRA，从而实现最佳的扩展性能。这种方法显著提高了有害概念的移除能力，比基线方法多移除 7 倍的有害概念，并在多个基准测试中超越了现有技术水平。

链接: https://arxiv.org/abs/2412.10493
作者: Runtao Liu,Chen I Chieh,Jindong Gu,Jipeng Zhang,Renjie Pi,Qifeng Chen,Philip Torr,Ashkan Khakzar,Fabio Pizzati
机构: 未知
关键词: guardrails expose end, expose end users, Direct Preference Optimization, safety guardrails expose, guardrails expose
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) models have become widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model’s generative capabilities. In this work, we introduce SafetyDPO, a method for safety alignment of T2I models through Direct Preference Optimization (DPO). We enable the application of DPO for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7 times more harmful concepts from T2I models compared to baselines. SafetyDPO consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. Code and data will be shared at this https URL.
zh

[CV-240] QSM-RimDS: A highly sensitive paramagnetic rim lesion detection and segmentation tool for multiple sclerosis lesions

【速读】：该论文旨在解决多发性硬化症（MS）中磁敏感加权成像（QSM）上的顺磁性边缘病变（PRLs）检测问题，特别是现有工具QSM-RimNet需要精确的QSM病变掩膜且无法提供边缘分割的局限性。解决方案的关键是开发了一种基于深度学习的工具QSM-RimDS，该工具能够利用易于获取的FLAIR病变掩膜进行PRLs检测，并提供边缘分割以用于小胶质细胞的定量分析。QSM-RimDS在PRLs检测方面达到了最先进的性能，有望在临床实践中辅助人工读者进行耗时的PRLs检测和分割任务。

链接: https://arxiv.org/abs/2412.10492
作者: Ha Luu,Mert Sisman,Ilhami Kovanlikaya,Tam Vu,Pascal Spincemaille,Yi Wang,Francesca Bagnato,Susan Gauthier,Thanh Nguyen
机构: 未知
关键词: Paramagnetic rim lesions, innate immune response, Paramagnetic rim, QSM lesion mask, PRL detection
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Paramagnetic rim lesions (PRLs) are imaging biomarker of the innate immune response in MS lesions. QSM-RimNet, a state-of-the-art tool for PRLs detection on QSM, can identify PRLs but requires precise QSM lesion mask and does not provide rim segmentation. Therefore, the aims of this study are to develop QSM-RimDS algorithm to detect PRLs using the readily available FLAIR lesion mask and to provide rim segmentation for microglial quantification. QSM-RimDS, a deep-learning based tool for joint PRL rim segmentation and PRL detection has been developed. QSM-RimDS has obtained state-of-the art performance in PRL detection and therefore has the potential to be used in clinical practice as a tool to assist human readers for the time-consuming PRL detection and segmentation task. QSM-RimDS is made publicly available [this https URL]
zh

[CV-241] CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information

【速读】：该论文试图解决现有研究中仅关注脑电图（EEG）信号与图像数据对之间的关系，而忽视了EEG信号中嵌入的“超越图像模态”信息的问题。解决方案的关键在于提出了CognitionCapturer框架，该框架通过训练模态专家编码器（Modality Expert Encoders）从EEG模态中提取跨模态信息，并引入扩散先验（diffusion prior）将EEG嵌入空间映射到CLIP嵌入空间，从而利用预训练的生成模型重建视觉刺激，且无需对生成模型进行微调。该框架能够有效利用多模态数据，提升EEG信号的表示能力，并在实验中表现出优于现有最先进方法的性能。

链接: https://arxiv.org/abs/2412.10489
作者: Kaifan Zhang,Lihuo He,Xin Jiang,Wen Lu,Di Wang,Xinbo Gao
机构: 1. School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院);
2. School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院)
关键词: attracted significant attention, high temporal sensitivity, EEG, EEG signals, attracted significant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their non-invasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable ``beyond-image-modality" information embedded in EEG signals. This results in the loss of critical multimodal information in EEG. To address this limitation, we propose CognitionCapturer, a unified framework that fully leverages multimodal data to represent EEG signals. Specifically, CognitionCapturer trains Modality Expert Encoders for each modality to extract cross-modal information from the EEG modality. Then, it introduces a diffusion prior to map the EEG embedding space to the CLIP embedding space, followed by using a pretrained generative model, the proposed framework can reconstruct visual stimuli with high semantic and structural fidelity. Notably, the framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities. Through extensive experiments, we demonstrate that CognitionCapturer outperforms state-of-the-art methods both qualitatively and quantitatively. Code: this https URL.
zh

[CV-242] SVGBuilder: Component-Based Colored SVG Generation with Text-Guided Autoregressive Transformers

【速读】：该论文试图解决现有SVG生成方法中高计算成本和复杂性的问题。解决方案的关键在于引入了一种基于组件的自动回归模型SVGBuilder，该模型通过模仿人类设计师使用组件工具的方式，显著降低了计算开销并提高了生成效率。与传统的优化方法相比，SVGBuilder能够将SVG生成速度提升至604倍。此外，为了支持该模型的研究，论文还引入了首个大规模彩色SVG数据集ColorSVG-100K，包含100,000个图形，填补了现有SVG数据集中颜色信息的空白，增强了模型训练的多样性。

链接: https://arxiv.org/abs/2412.10488
作者: Zehao Chen,Rong Pan
机构: 未知
关键词: Scalable Vector Graphics, Scalable Vector, offering resolution independence, essential XML-based formats, Vector Graphics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project: this https URL

点击查看摘要

Abstract:Scalable Vector Graphics (SVG) are essential XML-based formats for versatile graphics, offering resolution independence and scalability. Unlike raster images, SVGs use geometric shapes and support interactivity, animation, and manipulation via CSS and JavaScript. Current SVG generation methods face challenges related to high computational costs and complexity. In contrast, human designers use component-based tools for efficient SVG creation. Inspired by this, SVGBuilder introduces a component-based, autoregressive model for generating high-quality colored SVGs from textual input. It significantly reduces computational overhead and improves efficiency compared to traditional methods. Our model generates SVGs up to 604 times faster than optimization-based approaches. To address the limitations of existing SVG datasets and support our research, we introduce ColorSVG-100K, the first large-scale dataset of colored SVGs, comprising 100,000 graphics. This dataset fills the gap in color information for SVG generation models and enhances diversity in model training. Evaluation against state-of-the-art models demonstrates SVGBuilder’s superior performance in practical applications, highlighting its efficiency and quality in generating complex SVG graphics.
zh

[CV-243] Dynamic Entity-Masked Graph Diffusion Model for histopathological image Representation Learning

【速读】：该论文试图解决自然图像与病理图像特征差异大、病理图像标注稀缺的问题，并提出了一种新的自监督学习方法 H-MGDM (Histopathology image representation learning method through the Dynamic Entity-Masked Graph Diffusion Model)。解决方案的关键在于通过动态实体掩码图扩散模型，利用互补子图作为潜在扩散条件和自监督目标，增强病理实体的空间交互和拓扑关系的嵌入，从而提升病理图像的细粒度重建和表示能力。该方法在多个大型病理数据集上进行了预训练，并在分类和生存分析等下游任务中展示了优越的预测性能和可解释性。

链接: https://arxiv.org/abs/2412.10482
作者: Zhenfeng Zhuang,Min Cen,Yanfeng Li,Fangyu Zhou,Lequan Yu,Baptiste Magnier,Liansheng Wang
机构: 1. Shanghai Jiao Tong University (上海交通大学); 2. Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); 3. University of Hong Kong (香港大学); 4. Meta; 5. Université Paris-Saclay (巴黎萨克雷大学)
关键词: Significant disparities, transfer pre-trained models, natural images, make it challenging, challenging to directly
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Significant disparities between the features of natural images and those inherent to histopathological images make it challenging to directly apply and transfer pre-trained models from natural images to histopathology tasks. Moreover, the frequent lack of annotations in histopathology patch images has driven researchers to explore self-supervised learning methods like mask reconstruction for learning representations from large amounts of unlabeled data. Crucially, previous mask-based efforts in self-supervised learning have often overlooked the spatial interactions among entities, which are essential for constructing accurate representations of pathological entities. To address these challenges, constructing graphs of entities is a promising approach. In addition, the diffusion reconstruction strategy has recently shown superior performance through its random intensity noise addition technique to enhance the robust learned representation. Therefore, we introduce H-MGDM, a novel self-supervised Histopathology image representation learning method through the Dynamic Entity-Masked Graph Diffusion Model. Specifically, we propose to use complementary subgraphs as latent diffusion conditions and self-supervised targets respectively during pre-training. We note that the graph can embed entities’ topological relationships and enhance representation. Dynamic conditions and targets can improve pathological fine reconstruction. Our model has conducted pretraining experiments on three large histopathological datasets. The advanced predictive performance and interpretability of H-MGDM are clearly evaluated on comprehensive downstream tasks such as classification and survival analysis on six datasets. Our code will be publicly available at this https URL.
zh

[CV-244] CrossVIT-augmented Geospatial-Intelligence Visualization System for Tracking Economic Development Dynamics

【速读】：该论文试图解决经济数据在时效性和空间分辨率方面的挑战，关键解决方案是引入Senseconomic系统，通过多模态感知（multimodal sensing）和分布式计算（distributed computing）来实时追踪经济动态。该系统基于Transformer框架，利用交叉注意力机制（cross-attention）整合遥感图像和街景图像，并以夜间灯光数据作为弱监督（weak supervision）。其核心创新在于高效的数据处理和预测能力，通过分布式计算将处理时间缩短至23分钟，并在县域经济预测中实现了0.8363的R平方值。此外，系统还具备用户友好的设计，前端采用Vue3和百度地图进行可视化，后端使用Python自动化图像下载和预处理任务。

链接: https://arxiv.org/abs/2412.10474
作者: Yanbing Bai,Jinhua Su,Bin Qiao,Xiaoran Ma
机构: Center for Applied Statistics and School of Statistics Renmin University of China (应用统计中心和统计学院中国人民大学); School of Information Renmin University of China (信息学院中国人民大学); Department of Statistics and Applied Probability University of California Santa Barbara (统计与应用概率系加州大学圣塔芭芭拉分校)
关键词: Timely and accurate, accurate economic data, effective policymaking, crucial for effective, Timely
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Timely and accurate economic data is crucial for effective policymaking. Current challenges in data timeliness and spatial resolution can be addressed with advancements in multimodal sensing and distributed computing. We introduce Senseconomic, a scalable system for tracking economic dynamics via multimodal imagery and deep learning. Built on the Transformer framework, it integrates remote sensing and street view images using cross-attention, with nighttime light data as weak supervision. The system achieved an R-squared value of 0.8363 in county-level economic predictions and halved processing time to 23 minutes using distributed computing. Its user-friendly design includes a Vue3-based front end with Baidu maps for visualization and a Python-based back end automating tasks like image downloads and preprocessing. Senseconomic empowers policymakers and researchers with efficient tools for resource allocation and economic planning.
zh

[CV-245] VCA: Video Curious Agent for Long Video Understanding

【速读】：该论文试图解决长视频理解中的时间复杂性和信息密度低的问题。解决方案的关键在于引入了一个具有自探索能力的好奇心驱动视频代理 (VCA)，该代理基于视觉语言模型 (VLM)，能够自主导航视频片段并高效构建对复杂视频序列的全面理解。VCA采用树搜索结构探索视频片段并收集帧，而不是直接采样帧，同时利用VLM生成的内在奖励来指导其探索，从而捕捉推理中最关键的信息。

链接: https://arxiv.org/abs/2412.10471
作者: Zeyuan Yang,Delin Chen,Xueyang Yu,Maohao Shen,Chuang Gan
机构: University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校); Massachusetts Institute of Technology(麻省理工学院)
关键词: poses unique challenges, unique challenges due, low information density, understanding poses unique, poses unique
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long video understanding poses unique challenges due to their temporal complexity and low information density. Recent works address this task by sampling numerous frames or incorporating auxiliary tools using LLMs, both of which result in high computational costs. In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as VCA. Built upon VLMs, VCA autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences. Instead of directly sampling frames, VCA employs a tree-search structure to explore video segments and collect frames. Rather than relying on external feedback or reward, VCA leverages VLM’s self-generated intrinsic reward to guide its exploration, enabling it to capture the most crucial information for reasoning. Experimental results on multiple long video benchmarks demonstrate our approach’s superior effectiveness and efficiency.
zh

[CV-246] Automatic Detection Positioning and Counting of Grape Bunches Using Robots

【速读】：该论文旨在推动农业自动化采摘和产量估算技术的发展，主要解决葡萄串的自动检测、定位和计数问题。解决方案的关键在于设计了一套结合Yolov3检测网络和局部跟踪算法的葡萄串检测系统。Yolov3用于实现葡萄串的精确检测，而局部跟踪算法则用于消除重新定位的误差。通过深度距离和空间限制方法，进一步获取葡萄串中心点的精确三维空间位置，最终完成葡萄串的计数。该方案在模拟的葡萄园环境中通过农业机器人进行了验证，并公开了项目代码。

链接: https://arxiv.org/abs/2412.10464
作者: Xumin Gao
机构: 未知
关键词: yield estimation technology, promote agricultural automatic, agricultural automatic picking, grape bunches, automatic picking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In order to promote agricultural automatic picking and yield estimation technology, this project designs a set of automatic detection, positioning and counting algorithms for grape bunches, and applies it to agricultural robots. The Yolov3 detection network is used to realize the accurate detection of grape bunches, and the local tracking algorithm is added to eliminate relocation. Then it obtains the accurate 3D spatial position of the central points of grape bunches using the depth distance and the spatial restriction method. Finally, the counting of grape bunches is completed. It is verified using the agricultural robot in the simulated vineyard environment. The project code is released at: this https URL.
zh

[CV-247] Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content

【速读】：该论文旨在解决多模态情感分析 (Multimodal Sentiment Analysis, MSA) 中，音频和视频表达中微妙情感变化的识别难题。关键解决方案是提出了一个名为 DEVA 的渐进融合框架，该框架通过情感描述生成器 (Emotional Description Generator, EDG) 将原始音频和视频数据转换为文本化的情感描述，从而增强其情感特征。随后，这些描述与源数据结合，生成更丰富的特征。DEVA 还引入了文本引导的渐进融合模块 (Text-guided Progressive Fusion Module, TPF)，利用不同层次的文本作为核心模态引导，逐步融合视觉和音频的次要模态，以减少文本与视觉-音频模态之间的差异。实验结果表明，DEVA 在多个基准数据集上显著优于现有最先进模型，并展现出对细微情感变化的强大敏感性。

链接: https://arxiv.org/abs/2412.10460
作者: Sheng Wu,Xiaobao Wang,Longbiao Wang,Dongxiao He,Jianwu Dang
机构: 1. Tsinghua University (清华大学); 2. Beijing Key Laboratory of Mobile Computing and Pervasive Device (北京市移动计算与泛在设备重点实验室); 3. Institute of Acoustics, Chinese Academy of Sciences (中国科学院声学研究所); 4. Beijing University of Posts and Telecommunications (北京邮电大学)
关键词: critical research frontier, comprehensively unravel human, unravel human emotions, research frontier, seeking to comprehensively
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge, particularly when emotional polarities across various segments appear similar. In this paper, our objective is to spotlight emotion-relevant attributes of audio and visual modalities to facilitate multimodal fusion in the context of nuanced emotional shifts in visual-audio scenarios. To this end, we introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions aimed at accentuating emotional features of visual-audio content. DEVA employs an Emotional Description Generator (EDG) to transmute raw audio and visual data into textualized sentiment descriptions, thereby amplifying their emotional characteristics. These descriptions are then integrated with the source data to yield richer, enhanced features. Furthermore, DEVA incorporates the Text-guided Progressive Fusion Module (TPF), leveraging varying levels of text as a core modality guide. This module progressively fuses visual-audio minor modalities to alleviate disparities between text and visual-audio modalities. Experimental results on widely used sentiment analysis benchmark datasets, including MOSI, MOSEI, and CH-SIMS, underscore significant enhancements compared to state-of-the-art models. Moreover, fine-grained emotion experiments corroborate the robust sensitivity of DEVA to subtle emotional variations.
zh

[CV-248] Motion Generation Review: Exploring Deep Learning for Lifelike Animation with Manifold

【速读】：该论文试图解决人类运动生成中的自然序列生成问题，特别是在游戏、虚拟现实和人机交互等领域的应用。其关键解决方案在于利用流形学习（manifold learning）来降低数据维度并捕捉有效的运动子空间。通过从非结构化数据中提取流形，论文探讨了其在运动生成中的应用，并讨论了其优势和未来发展方向。这种方法旨在提高虚拟角色的运动真实性，并解决传统方法在复杂人类运动与信号关系处理中的不足。

链接: https://arxiv.org/abs/2412.10458
作者: Jiayi Zhao,Dongdong Weng,Qiuxin Du,Zeyu Tian
机构: 未知
关键词: involves creating natural, creating natural sequences, human body poses, generation involves creating, Human motion generation
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Human motion generation involves creating natural sequences of human body poses, widely used in gaming, virtual reality, and human-computer interaction. It aims to produce lifelike virtual characters with realistic movements, enhancing virtual agents and immersive experiences. While previous work has focused on motion generation based on signals like movement, music, text, or scene background, the complexity of human motion and its relationships with these signals often results in unsatisfactory outputs. Manifold learning offers a solution by reducing data dimensionality and capturing subspaces of effective motion. In this review, we present a comprehensive overview of manifold applications in human motion generation, one of the first in this domain. We explore methods for extracting manifolds from unstructured data, their application in motion generation, and discuss their advantages and future directions. This survey aims to provide a broad perspective on the field and stimulate new approaches to ongoing challenges.
zh

[CV-249] Explaining Model Overfitting in CNNs via GMM Clustering

【速读】：该论文试图解决卷积神经网络 (CNN) 在计算机视觉领域中决策过程不透明的问题，特别是如何量化评估 CNN 滤波器的性能。解决方案的关键在于通过高斯混合模型 (GMM) 对模型中各个滤波器对应的特征图进行聚类，从而筛选出与异常样本相关的异常滤波器。通过分析这些异常滤波器与模型过拟合之间的关系，提出了三种假设，并通过实验验证了这些假设。该方法无需对不同 CNN 架构进行修改，具有普适性，已在 AlexNet 和 LeNet-5 等模型上成功应用。

链接: https://arxiv.org/abs/2412.10457
作者: Hui Dou,Xinyu Mu,Mengjun Yi,Feng Han,Jian Zhao,Furao Shen
机构: 未知
关键词: Convolutional Neural Networks, Convolutional Neural, Neural Networks, demonstrated remarkable prowess, computer vision
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have demonstrated remarkable prowess in the field of computer vision. However, their opaque decision-making processes pose significant challenges for practical applications. In this study, we provide quantitative metrics for assessing CNN filters by clustering the feature maps corresponding to individual filters in the model via Gaussian Mixture Model (GMM). By analyzing the clustering results, we screen out some anomaly filters associated with outlier samples. We further analyze the relationship between the anomaly filters and model overfitting, proposing three hypotheses. This method is universally applicable across diverse CNN architectures without modifications, as evidenced by its successful application to models like AlexNet and LeNet-5. We present three meticulously designed experiments demonstrating our hypotheses from the perspectives of model behavior, dataset characteristics, and filter impacts. Through this work, we offer a novel perspective for evaluating the CNN performance and gain new insights into the operational behavior of model overfitting.
zh

[CV-250] FovealNet: Advancing AI-Driven Gaze Tracking Solutions for Optimized Foveated Rendering System Performance in Virtual Reality

【速读】：该论文试图解决基于深度学习的注视追踪（gaze tracking）在虚拟现实（VR）中应用时，由于追踪误差的长尾分布导致的视觉质量下降和系统性能降低的问题。解决方案的关键在于提出了FovealNet，这是一个先进的AI驱动的注视追踪框架，通过以下几个关键技术优化系统性能：1) 采用事件驱动的裁剪方法，减少64.8%的无关像素输入，降低计算成本；2) 引入简单的动态token修剪策略，实时移除不必要的信息，保持追踪精度；3) 提出系统性能感知的多分辨率训练策略，使注视追踪深度神经网络（DNN）能够根据不同的运行时渲染配置自适应优化整体系统性能。实验结果表明，FovealNet相比之前的方法实现了至少1.42倍的加速，并提升了13%的感知质量。

链接: https://arxiv.org/abs/2412.10456
作者: Wenxuan Liu,Monde Duinkharjav,Qi Sun,Sai Qian Zhang
机构: Tandon School of Engineering, New York University (坦登工程学院，纽约大学)
关键词: Leveraging real-time eye-tracking, quality virtual reality, Leveraging real-time, optimizes hardware efficiency, enhances visual quality
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Leveraging real-time eye-tracking, foveated rendering optimizes hardware efficiency and enhances visual quality virtual reality (VR). This approach leverages eye-tracking techniques to determine where the user is looking, allowing the system to render high-resolution graphics only in the foveal region-the small area of the retina where visual acuity is highest, while the peripheral view is rendered at lower resolution. However, modern deep learning-based gaze-tracking solutions often exhibit a long-tail distribution of tracking errors, which can degrade user experience and reduce the benefits of foveated rendering by causing misalignment and decreased visual quality. This paper introduces \textitFovealNet, an advanced AI-driven gaze tracking framework designed to optimize system performance by strategically enhancing gaze tracking accuracy. To further reduce the implementation cost of the gaze tracking algorithm, FovealNet employs an event-based cropping method that eliminates over 64.8% of irrelevant pixels from the input image. Additionally, it incorporates a simple yet effective token-pruning strategy that dynamically removes tokens on the fly without compromising tracking accuracy. Finally, to support different runtime rendering configurations, we propose a system performance-aware multi-resolution training strategy, allowing the gaze tracking DNN to adapt and optimize overall system performance more effectively. Evaluation results demonstrate that FovealNet achieves at least 1.42\times speed up compared to previous methods and 13% increase in perceptual quality for foveated output. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.10456 [cs.CV] (or arXiv:2412.10456v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.10456 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-251] Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning

【速读】：该论文试图解决几何数学问题，特别是涉及视觉元素和空间推理的立体几何问题，这些问题对大型语言模型（LLMs）构成了显著挑战。解决方案的关键在于提出了一个名为Geo-LLaVA的大型多模态模型（LMM）框架，该框架结合了检索增强与监督微调（SFT）的元训练方法，并在推理阶段采用上下文学习（ICL）来提升性能。通过这种综合方法，模型在GeoQA和GeoMath数据集上分别达到了65.25%和42.36%的最新性能水平，并首次赋予了模型解决立体几何问题的能力，同时支持生成合理的立体几何图片描述和问题解决步骤。

链接: https://arxiv.org/abs/2412.10455
作者: Shihao Xu,Yiyang Luo,Wei Shi
机构: Huawei Singapore Research Center(华为新加坡研究中心); Huawei Singapore Research Center(华为新加坡研究中心); Huawei Singapore Research Center(华为新加坡研究中心)
关键词: pose significant challenges, involve visual elements, mathematics problems pose, problems pose significant, Geometry mathematics problems
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG)
备注:

点击查看摘要

Abstract:Geometry mathematics problems pose significant challenges for large language models (LLMs) because they involve visual elements and spatial reasoning. Current methods primarily rely on symbolic character awareness to address these problems. Considering geometry problem solving is a relatively nascent field with limited suitable datasets and currently almost no work on solid geometry problem solving, we collect a geometry question-answer dataset by sourcing geometric data from Chinese high school education websites, referred to as GeoMath. It contains solid geometry questions and answers with accurate reasoning steps as compensation for existing plane geometry datasets. Additionally, we propose a Large Multi-modal Model (LMM) framework named Geo-LLaVA, which incorporates retrieval augmentation with supervised fine-tuning (SFT) in the training stage, called meta-training, and employs in-context learning (ICL) during inference to improve performance. Our fine-tuned model with ICL attains the state-of-the-art performance of 65.25% and 42.36% on selected questions of the GeoQA dataset and GeoMath dataset respectively with proper inference steps. Notably, our model initially endows the ability to solve solid geometry problems and supports the generation of reasonable solid geometry picture descriptions and problem-solving steps. Our research sets the stage for further exploration of LLMs in multi-modal math problem-solving, particularly in geometry math problems.
zh

[CV-252] Analysis of Object Detection Models for Tiny Object in Satellite Imagery: A Dataset-Centric Approach

【速读】：该论文试图解决卫星图像中小目标检测 (Small-Object-Detection, SOD) 的挑战，特别是在卫星图像中由于广域成像范围、目标分布不均以及目标外观多样性带来的困难。解决方案的关键在于构建了一个包含3000张卫星图像的精细数据集，涵盖汽车、船只和飞机等目标，并通过实证评估现有最先进的模型来提供对小目标检测的深入理解。此外，论文还探讨了基于卫星视频的目标跟踪问题，采用了Byte Track算法在SAT-MTB数据集上进行实验，以全面评估这些模型在卫星应用中的有效性。

链接: https://arxiv.org/abs/2412.10453
作者: Kailas PS,Selvakumaran R,Palani Murugan,Ramesh Kumar V,Malaya Kumar Biswal M
机构: 未知
关键词: revolutionizing basic computer, computer vision tasks, basic computer vision, deep learning-based object, learning-based object detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Conference Proceesings of AIAA SciTech Forum 2025 and Exposition

点击查看摘要

Abstract:In recent years, significant advancements have been made in deep learning-based object detection algorithms, revolutionizing basic computer vision tasks, notably in object detection, tracking, and segmentation. This paper delves into the intricate domain of Small-Object-Detection (SOD) within satellite imagery, highlighting the unique challenges stemming from wide imaging ranges, object distribution, and their varying appearances in bird’s-eye-view satellite images. Traditional object detection models face difficulties in detecting small objects due to limited contextual information and class imbalances. To address this, our research presents a meticulously curated dataset comprising 3000 images showcasing cars, ships, and airplanes in satellite imagery. Our study aims to provide valuable insights into small object detection in satellite imagery by empirically evaluating state-of-the-art models. Furthermore, we tackle the challenges of satellite video-based object tracking, employing the Byte Track algorithm on the SAT-MTB dataset. Through rigorous experimentation, we aim to offer a comprehensive understanding of the efficacy of state-of-the-art models in Small-Object-Detection for satellite applications. Our findings shed light on the effectiveness of these models and pave the way for future advancements in satellite imagery analysis.
zh

[CV-253] Unlocking Visual Secrets: Inverting Features with Diffusion Priors for Image Reconstruction

【速读】：该论文试图解决深度神经网络（DNN）中视觉表示的特征反演问题，旨在从预训练DNN生成的未识别目标图像特征中重建原始图像。解决方案的关键在于利用扩散模型（diffusion models）这一图像合成技术来提升特征反演的质量，并通过结合文本提示（textual prompts）和跨帧时间相关性（cross-frame temporal correlations）等替代形式的先验知识，进一步提高反演特征的质量。研究表明，扩散模型能够有效利用DNN特征中的隐藏信息，从而在重建性能上优于以往方法，为基于DNN特征的应用提供了增强隐私和安全性的新视角。

链接: https://arxiv.org/abs/2412.10448
作者: Sai Qian Zhang,Ziyun Li,Chuan Guo,Saeed Mahloujifar,Deeksha Dangwal,Edward Suh,Barbara De Salvo,Chiao Liu
机构: New York University(纽约大学); Meta Research(Meta研究); NVidia Research(英伟达研究)
关键词: Inverting visual representations, deep neural networks, Inverting visual, deep learning, DNN features
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inverting visual representations within deep neural networks (DNNs) presents a challenging and important problem in the field of security and privacy for deep learning. The main goal is to invert the features of an unidentified target image generated by a pre-trained DNN, aiming to reconstruct the original image. Feature inversion holds particular significance in understanding the privacy leakage inherent in contemporary split DNN execution techniques, as well as in various applications based on the extracted DNN features. In this paper, we explore the use of diffusion models, a promising technique for image synthesis, to enhance feature inversion quality. We also investigate the potential of incorporating alternative forms of prior knowledge, such as textual prompts and cross-frame temporal correlations, to further improve the quality of inverted features. Our findings reveal that diffusion models can effectively leverage hidden information from the DNN features, resulting in superior reconstruction performance compared to previous methods. This research offers valuable insights into how diffusion models can enhance privacy and security within applications that are reliant on DNN features. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.10448 [cs.CV] (or arXiv:2412.10448v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.10448 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-254] dyBot: An Open-Source Holonomic Mobile Manipulator for Robot Learning

【速读】：该论文试图解决移动操作任务中数据收集的问题，特别是通过模仿学习（imitation learning）实现家庭环境中复杂任务的自动化。解决方案的关键在于设计了一种低成本、坚固且灵活的移动操作机器人，该机器人采用全向动力脚轮（powered casters），使其移动基座具备全向运动能力（holonomic），能够独立且同时控制所有平面自由度。这种设计简化了移动操作任务中的运动学约束，提高了机器人的机动性，并通过直观的手机远程操作界面，便于数据采集以支持模仿学习。实验结果表明，通过该系统收集的数据训练出的策略能够成功执行多种常见的家庭移动操作任务。

链接: https://arxiv.org/abs/2412.10447
作者: Jimmy Wu,William Chong,Robert Holmberg,Aaditya Prasad,Yihuai Gao,Oussama Khatib,Shuran Song,Szymon Rusinkiewicz,Jeannette Bohg
机构: 未知
关键词: Exploiting the promise, mobile manipulation tasks, manipulation tasks, human-guided demonstrations, mobile manipulation
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Conference on Robot Learning (CoRL), 2024. Project page: this https URL

点击查看摘要

Abstract:Exploiting the promise of recent advances in imitation learning for mobile manipulation will require the collection of large numbers of human-guided demonstrations. This paper proposes an open-source design for an inexpensive, robust, and flexible mobile manipulator that can support arbitrary arms, enabling a wide range of real-world household mobile manipulation tasks. Crucially, our design uses powered casters to enable the mobile base to be fully holonomic, able to control all planar degrees of freedom independently and simultaneously. This feature makes the base more maneuverable and simplifies many mobile manipulation tasks, eliminating the kinematic constraints that create complex and time-consuming motions in nonholonomic bases. We equip our robot with an intuitive mobile phone teleoperation interface to enable easy data acquisition for imitation learning. In our experiments, we use this interface to collect data and show that the resulting learned policies can successfully perform a variety of common household mobile manipulation tasks.
zh

[CV-255] Disentanglement and Compositionality of Letter Identity and Letter Position in Variational Auto-Encoder Vision Models

【速读】：该论文试图解决的问题是现代深度神经网络模型是否具备与人类相似的组合能力，即能否在处理书面文字时解耦字母位置和字母身份信息。解决方案的关键在于设计并使用一个新的基准测试——CompOrth，来评估模型在解耦字母位置和字母身份方面的表现。通过训练beta变分自编码器（beta-VAE）并利用CompOrth进行一系列实验，研究发现尽管模型能够有效解耦表面特征（如单词在图像中的水平和垂直位置），但在解耦字母位置和字母身份以及理解单词长度方面表现不佳。这表明当前最先进的beta-VAE模型在组合学习能力上存在显著不足，论文因此提出了一个新的挑战和相应的基准测试，以推动神经模型在这方面的改进。

链接: https://arxiv.org/abs/2412.10446
作者: Bruno Bianchi,Aakash Agrawal,Stanislas Dehaene,Emmanuel Chemla,Yair Lakretz
机构: CONICET-UBA, FCEN, ICC-Departamento de Computación. Buenos Aires, Argentina; Cognitive Neuroimaging Unit, CEA, INSERM U 992, Université Paris-Saclay, NeuroSpin center, Gif/Yvette, France; Collège de France, Paris, France; PSL University, Paris, France; LSCP, Ecole Normale Supérieure, CNRS, Paris, France
关键词: letter, disentangle letter position, bufflo or add, accurately count, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human readers can accurately count how many letters are in a word (e.g., 7 in buffalo''), remove a letter from a given position (e.g., bufflo’‘) or add a new one. The human brain of readers must have therefore learned to disentangle information related to the position of a letter and its identity. Such disentanglement is necessary for the compositional, unbounded, ability of humans to create and parse new strings, with any combination of letters appearing in any positions. Do modern deep neural models also possess this crucial compositional ability? Here, we tested whether neural models that achieve state-of-the-art on disentanglement of features in visual input can also disentangle letter position and letter identity when trained on images of written words. Specifically, we trained beta variational autoencoder ( \beta -VAE) to reconstruct images of letter strings and evaluated their disentanglement performance using CompOrth - a new benchmark that we created for studying compositional learning and zero-shot generalization in visual models for orthography. The benchmark suggests a set of tests, of increasing complexity, to evaluate the degree of disentanglement between orthographic features of written words in deep neural models. Using CompOrth, we conducted a set of experiments to analyze the generalization ability of these models, in particular, to unseen word length and to unseen combinations of letter identities and letter positions. We found that while models effectively disentangle surface features, such as horizontal and vertical `retinal’ locations of words within an image, they dramatically fail to disentangle letter position and letter identity and lack any notion of word length. Together, this study demonstrates the shortcomings of state-of-the-art \beta -VAE models compared to humans and proposes a new challenge and a corresponding benchmark to evaluate neural models.
zh

[CV-256] Boundary Exploration of Next Best View Policy in 3D Robotic Scanning IROS2025

【速读】：该论文试图解决3D机器人扫描中的Next Best View (NBV)问题，旨在提高物体捕捉和重建的效率。当前方法存在忽略视图重叠、假设虚拟焦点以及依赖体素表示等问题。论文提出的关键解决方案包括：1) 一种基于边界探索的NBV策略，考虑视图重叠并允许灵活调整扫描距离；2) 一种基于模型的迭代搜索方法，通过计算新扫描数据与现有数据的重叠及最终收敛性来确定下一个传感器位置；3) 一种深度学习网络Boundary Exploration NBV network (BENBV-Net)，直接从扫描数据中预测NBV，无需参考模型，显著提高了NBV生成的速度并保持了模型方法的性能。实验结果表明，该方法在扫描效率和重叠方面优于现有方法，适用于实际的3D扫描应用。

链接: https://arxiv.org/abs/2412.10444
作者: Leihui Li,Xuping Zhang
机构: Aarhus University(奥胡斯大学)
关键词: Boundary Exploration NBV, NBV, capture and reconstruction, pivotal challenge, potential to greatly
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Will be submitted to IROS 2025

点击查看摘要

Abstract:The Next Best View (NBV) problem is a pivotal challenge in 3D robotic scanning, with the potential to greatly improve the efficiency of object capture and reconstruction. Current methods for determining the NBV often overlook view overlaps, assume a virtual origin point for the camera’s focus, and rely on voxel representations of 3D data. To address these issues and improve the practicality of scanning unknown objects, we propose an NBV policy in which the next view explores the boundary of the scanned point cloud, and the overlap is intrinsically considered. The scanning distance or camera working distance is adjustable and flexible. To this end, a model-based approach is proposed where the next sensor positions are searched iteratively based on a reference model. A score is calculated by considering the overlaps between newly scanned and existing data, as well as the final convergence. Additionally, following the boundary exploration idea, a deep learning network, Boundary Exploration NBV network (BENBV-Net), is designed and proposed, which can be used to predict the NBV directly from the scanned data without requiring the reference model. It predicts the scores for given boundaries, and the boundary with the highest score is selected as the target point of the next best view. BENBV-Net improves the speed of NBV generation while maintaining the performance of the model-based approach. Our proposed methods are evaluated and compared with existing approaches on the ShapeNet, ModelNet, and 3D Repository datasets. Experimental results demonstrate that our approach outperforms others in terms of scanning efficiency and overlap, both of which are crucial for practical 3D scanning applications. The related code is released at \urlthis http URL.
zh

[CV-257] SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization

【速读】：该论文试图解决在视觉数据压缩过程中提高压缩比率并保持重建质量的问题。解决方案的关键在于提出了一种名为 Semantic-aware Warped spatial-temporal Tokenizer (SweetTokenizer) 的方法，通过将图像或视频解耦为空间和时间维度，并利用 Cross-attention Query AutoEncoder (CQAE) 将视觉信息转换为可学习的查询空间和时间令牌。此外，通过从现成的语言模型嵌入中派生的专用码本对这些令牌进行量化，以补充视觉信息并引入语义信息。为了增强训练稳定性和收敛性，还采用了课程学习策略。最终，SweetTokenizer 在减少令牌数量的同时，显著提升了视频和图像的重建质量，并在下游应用中实现了基于语言模型的少样本识别能力。

链接: https://arxiv.org/abs/2412.10443
作者: Zhentao Tan,Ben Xue,Jian Jia,Junhao Wang,Wencai Ye,Shaoyun Shi,Mingjie Sun,Wenjin Wu,Quan Chen,Peng Jiang
机构: Kuaishou Technology(快手科技)
关键词: textbf, effective discretization approach, vision data, paper presents, discretization approach
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents the \textbfSemantic-a\textbfWar\textbfE spatial-t\textbfEmporal \textbfTokenizer (SweetTokenizer), a compact yet effective discretization approach for vision data. Our goal is to boost tokenizers’ compression ratio while maintaining reconstruction fidelity in the VQ-VAE paradigm. Firstly, to obtain compact latent representations, we decouple images or videos into spatial-temporal dimensions, translating visual information into learnable querying spatial and temporal tokens through a \textbfCross-attention \textbfQuery \textbfAuto\textbfEncoder (CQAE). Secondly, to complement visual information during compression, we quantize these tokens via a specialized codebook derived from off-the-shelf LLM embeddings to leverage the rich semantics from language modality. Finally, to enhance training stability and convergence, we also introduce a curriculum learning strategy, which proves critical for effective discrete visual representation learning. SweetTokenizer achieves comparable video reconstruction fidelity with only \textbf25% of the tokens used in previous state-of-the-art video tokenizers, and boost video generation results by \textbf32.9% w.r.t gFVD. When using the same token number, we significantly improves video and image reconstruction results by \textbf57.1% w.r.t rFVD on UCF-101 and \textbf37.2% w.r.t rFID on ImageNet-1K. Additionally, the compressed tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.
zh

[CV-258] Novel 3D Binary Indexed Tree for Volume Computation of 3D Reconstructed Models from Volumetric Data

【速读】：该论文旨在解决医学影像领域中精确计算三维体积的问题，这对于后续的三维重建对象定性分析至关重要。解决方案的关键在于结合多元微积分、移动立方体算法 (marching cube algorithm) 和二叉索引树数据结构 (binary indexed tree data structure)，开发了一种高效的算法来计算从计算机断层扫描 (CT) 或磁共振 (MR) 恢复的体积数据的固有体积。该算法通过基于多边形网格生成方法的30种体积值配置，实现了与重建算法同步的扫描线顺序处理，并利用Fenwick树确保了更快的查询时间，同时支持用户对切片或模型变换的编辑。

链接: https://arxiv.org/abs/2412.10441
作者: Quoc-Bao Nguyen-Le,Tuan-Hy Le,Anh-Triet Do
机构: Le Hong Phong High School for the Gifted, Ho Chi Minh City, Vietnam; Faculty of Information Technology, University of Science, VNU-HCM; Research laboratory of Center of Talent in AI
关键词: subsequent qualitative analysis, medical imaging, burgeoning field, field of medical, holds a significant
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:In the burgeoning field of medical imaging, precise computation of 3D volume holds a significant importance for subsequent qualitative analysis of 3D reconstructed objects. Combining multivariate calculus, marching cube algorithm, and binary indexed tree data structure, we developed an algorithm for efficient computation of intrinsic volume of any volumetric data recovered from computed tomography (CT) or magnetic resonance (MR). We proposed the 30 configurations of volume values based on the polygonal mesh generation method. Our algorithm processes the data in scan-line order simultaneously with reconstruction algorithm to create a Fenwick tree, ensuring query time much faster and assisting users’ edition of slicing or transforming model. We tested the algorithm’s accuracy on simple 3D objects (e.g., sphere, cylinder) to complicated structures (e.g., lungs, cardiac chambers). The result deviated within \pm 0.004 \textcm^3 and there is still room for further improvement.
zh

[CV-259] Multi-level Matching Network for Multimodal Entity Linking KDD’25

【速读】：该论文试图解决多模态实体链接 (Multimodal Entity Linking, MEL) 中的两个主要问题：一是现有方法忽略了同模态中的负样本，二是缺乏捕捉双向跨模态交互的机制。为解决这些问题，论文提出了多层次匹配网络 (Multi-level Matching network for Multimodal Entity Linking, M3EL)。其关键在于三个模块的设计：(i) 多模态特征提取模块，通过多模态编码器提取特定模态的表示，并引入同模态对比学习子模块以获得更具区分性的嵌入；(ii) 同模态匹配网络模块，包含粗粒度和细粒度的全局到全局及全局到局部匹配，实现局部和全局级别的同模态交互；(iii) 跨模态匹配网络模块，采用双向策略（文本到视觉和视觉到文本）实现双向跨模态交互。实验结果表明，M3EL在多个数据集上优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.10440
作者: Zhiwei Hu,Víctor Gutiérrez-Basulto,Ru Li,Jeff Z. Pan
机构: Shanxi University(山西大学); Cardiff University(卡迪夫大学); University of Edinburgh(爱丁堡大学)
关键词: link ambiguous mentions, multimodal knowledge base, Multimodal entity linking, aims to link, knowledge base
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at KDD’25

点击查看摘要

Abstract:Multimodal entity linking (MEL) aims to link ambiguous mentions within multimodal contexts to corresponding entities in a multimodal knowledge base. Most existing approaches to MEL are based on representation learning or vision-and-language pre-training mechanisms for exploring the complementary effect among multiple modalities. However, these methods suffer from two limitations. On the one hand, they overlook the possibility of considering negative samples from the same modality. On the other hand, they lack mechanisms to capture bidirectional cross-modal interaction. To address these issues, we propose a Multi-level Matching network for Multimodal Entity Linking (M3EL). Specifically, M3EL is composed of three different modules: (i) a Multimodal Feature Extraction module, which extracts modality-specific representations with a multimodal encoder and introduces an intra-modal contrastive learning sub-module to obtain better discriminative embeddings based on uni-modal differences; (ii) an Intra-modal Matching Network module, which contains two levels of matching granularity: Coarse-grained Global-to-Global and Fine-grained Global-to-Local, to achieve local and global level intra-modal interaction; (iii) a Cross-modal Matching Network module, which applies bidirectional strategies, Textual-to-Visual and Visual-to-Textual matching, to implement bidirectional cross-modal interaction. Extensive experiments conducted on WikiMEL, RichpediaMEL, and WikiDiverse datasets demonstrate the outstanding performance of M3EL when compared to the state-of-the-art baselines.
zh

[CV-260] CogNav: Cognitive Process Modeling for Object Goal Navigation with LLM s

【速读】：该论文试图解决的是在未知环境中进行目标物体导航（ObjectNav）的任务，这一任务要求代理在未见过的环境中找到特定目标物体，涉及感知和认知过程的结合。论文提出的解决方案关键在于引入认知导航（CogNav），通过大型语言模型（Large Language Models）模拟人类的认知过程。具体来说，论文采用了一个由探索到识别的有限状态机（finite state machine）来建模认知过程，并通过在线构建的异构认知地图（heterogeneous cognitive map），包含场景的空间和语义信息，来决定状态间的转换。这种方法显著提升了导航效率，并在开放词汇和零样本设置下，将HM3D基准的SOTA从69.3%提升至87.2%。

链接: https://arxiv.org/abs/2412.10439
作者: Yihan Cao,Jiazhao Zhang,Zhinan Yu,Shuzhen Liu,Zheng Qin,Qin Zou,Bo Du,Kai Xu
机构: 未知
关键词: Object goal navigation, requires the agent, agent to find, find a target, cognitive
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Object goal navigation (ObjectNav) is a fundamental task of embodied AI that requires the agent to find a target object in unseen environments. This task is particularly challenging as it demands both perceptual and cognitive processes for effective perception and decision-making. While perception has gained significant progress powered by the rapidly developed visual foundation models, the progress on the cognitive side remains limited to either implicitly learning from massive navigation demonstrations or explicitly leveraging pre-defined heuristic rules. Inspired by neuroscientific evidence that humans consistently update their cognitive states while searching for objects in unseen environments, we present CogNav, which attempts to model this cognitive process with the help of large language models. Specifically, we model the cognitive process with a finite state machine composed of cognitive states ranging from exploration to identification. The transitions between the states are determined by a large language model based on an online built heterogeneous cognitive map containing spatial and semantic information of the scene being explored. Extensive experiments on both synthetic and real-world environments demonstrate that our cognitive modeling significantly improves ObjectNav efficiency, with human-like navigation behaviors. In an open-vocabulary and zero-shot setting, our method advances the SOTA of the HM3D benchmark from 69.3% to 87.2%. The code and data will be released.
zh

[CV-261] Automatic Image Annotation for Mapped Features Detection

【速读】：该论文试图解决自动驾驶和定位中的道路特征检测问题，特别是广泛存在于道路环境中的杆状物（poles）的可靠检测。解决方案的关键在于通过多模态自动标注（multi-modal automatic annotation）来提高标注质量和感知模型的学习效果。具体方法包括融合三种自动标注方法：高精度矢量地图与激光雷达（lidar）的特征投影、图像分割和激光雷达分割。实验结果表明，这种多模态标注方法在杆状物检测上显著优于单一方法，并且通过使用未标注数据微调目标检测模型，进一步提升了网络的专精化能力。

链接: https://arxiv.org/abs/2412.10438
作者: Maxime Noizet(UTC, Heudiasyc),Philippe Xu(ENSTA Paris),Philippe Bonnifait(UTC, Heudiasyc)
机构: 未知
关键词: Detecting road features, Detecting road, key enabler, enabler for autonomous, autonomous driving
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting road features is a key enabler for autonomous driving and localization. For instance, a reliable detection of poles which are widespread in road environments can improve localization. Modern deep learning-based perception systems need a significant amount of annotated data. Automatic annotation avoids time-consuming and costly manual annotation. Because automatic methods are prone to errors, managing annotation uncertainty is crucial to ensure a proper learning process. Fusing multiple annotation sources on the same dataset can be an efficient way to reduce the errors. This not only improves the quality of annotations, but also improves the learning of perception models. In this paper, we consider the fusion of three automatic annotation methods in images: feature projection from a high accuracy vector map combined with a lidar, image segmentation and lidar segmentation. Our experimental results demonstrate the significant benefits of multi-modal automatic annotation for pole detection through a comparative evaluation on manually annotated images. Finally, the resulting multi-modal fusion is used to fine-tune an object detection model for pole base detection using unlabeled data, showing overall improvements achieved by enhancing network specialization. The dataset is publicly available.
zh

[CV-262] SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion

【速读】：该论文试图解决从文本数据生成可缩放矢量图形 (Scalable Vector Graphics, SVG) 资产的挑战，主要难点在于高质量矢量数据集的稀缺以及现有矢量表示方法在复杂图形分布建模中的局限性。解决方案的关键在于提出了SVGFusion模型，该模型通过结合矢量-像素融合变分自编码器 (Vector-Pixel Fusion Variational Autoencoder, VP-VAE) 和矢量空间扩散变换器 (Vector Space Diffusion Transformer, VS-DiT)，学习一个连续的矢量图形潜在空间。VP-VAE同时处理SVG和对应的栅格化图像，学习连续潜在空间，而VS-DiT则根据文本提示生成该空间中的潜在代码。此外，论文提出了一种新的渲染序列建模策略，使潜在空间能够嵌入SVG的构造逻辑知识，从而实现类似人类的设计能力并避免复杂图形组合中的遮挡问题。通过增加VS-DiT模块，模型的能力可以持续提升。实验结果表明，SVGFusion在生成质量和泛化能力上优于现有方法，为SVG内容创作提供了一个新的框架。

链接: https://arxiv.org/abs/2412.10437
作者: Ximing Xing,Juncheng Hu,Jing Zhang,Dong Xu,Qian Yu
机构: Beihang University(北京航空航天大学); The University of Hong Kong(香港大学)
关键词: scalable vector representations, Scalable Vector Graphics, Scalable Vector, intricate graphic distributions, vector representations required
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: project page: \href{ this https URL }{ this https URL }

点击查看摘要

Abstract:The generation of Scalable Vector Graphics (SVG) assets from textual data remains a significant challenge, largely due to the scarcity of high-quality vector datasets and the limitations in scalable vector representations required for modeling intricate graphic distributions. This work introduces SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data without reliance on a text-based discrete language model or prolonged SDS optimization. The essence of SVGFusion is to learn a continuous latent space for vector graphics with a popular Text-to-Image framework. Specifically, SVGFusion consists of two modules: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) and a Vector Space Diffusion Transformer (VS-DiT). VP-VAE takes both the SVGs and corresponding rasterizations as inputs and learns a continuous latent space, whereas VS-DiT learns to generate a latent code within this space based on the text prompt. Based on VP-VAE, a novel rendering sequence modeling strategy is proposed to enable the latent space to embed the knowledge of construction logics in SVGs. This empowers the model to achieve human-like design capabilities in vector graphics, while systematically preventing occlusion in complex graphic compositions. Moreover, our SVGFusion’s ability can be continuously improved by leveraging the scalability of the VS-DiT by adding more VS-DiT blocks. A large-scale SVG dataset is collected to evaluate the effectiveness of our proposed method. Extensive experimentation has confirmed the superiority of our SVGFusion over existing SVG generation methods, achieving enhanced quality and generalizability, thereby establishing a novel framework for SVG content creation. Code, model, and data will be released at: \hrefthis https URLthis https URL
zh

[CV-263] Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph Generation

【速读】：该论文试图解决联邦学习（Federated Learning, FL）在处理复杂语义任务时的挑战，特别是当样本包含多标签语义信息（如Panoptic Scene Graph Generation, PSG）时，如何在客户端之间控制语义异质性。解决方案的关键在于提出一个建立FL基准的过程，包含两个主要步骤：一是基于语义进行数据聚类（data clustering with semantics），二是通过可控的语义异质性进行数据分发（data distributing via controllable semantic heterogeneity across clients）。通过构建一个联邦PSG基准，论文展示了现有PSG方法在FL设置下的有效性，并验证了该基准在处理数据异质性时能够提升学习算法的性能。

链接: https://arxiv.org/abs/2412.10436
作者: SeungBum Ha,Taehwan Lee,Jiyoun Lim,Sung Whan Yoon
机构: Graduate School of Artificial Intelligence, Ulsan National Institute of Science and Technology, Ulsan, Korea; Department of Electrical Engineering, Ulsan National Institute of Science and Technology, Ulsan, Korea; Electronics and Telecommunications Research Institute (ETRI), Daejeon, Korea
关键词: data-decentralized training framework, locally distributed samples, recently garnered attention, keeping data privacy, semantic heterogeneity
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Federated learning (FL) has recently garnered attention as a data-decentralized training framework that enables the learning of deep models from locally distributed samples while keeping data privacy. Built upon the framework, immense efforts have been made to establish FL benchmarks, which provide rigorous evaluation settings that control data heterogeneity across clients. Prior efforts have mainly focused on handling relatively simple classification tasks, where each sample is annotated with a one-hot label, such as MNIST, CIFAR, LEAF benchmark, etc. However, little attention has been paid to demonstrating an FL benchmark that handles complicated semantics, where each sample encompasses diverse semantic information from multiple labels, such as Panoptic Scene Graph Generation (PSG) with objects, subjects, and relations between them. Because the existing benchmark is designed to distribute data in a narrow view of a single semantic, e.g., a one-hot label, managing the complicated semantic heterogeneity across clients when formalizing FL benchmarks is non-trivial. In this paper, we propose a benchmark process to establish an FL benchmark with controllable semantic heterogeneity across clients: two key steps are i) data clustering with semantics and ii) data distributing via controllable semantic heterogeneity across clients. As a proof of concept, we first construct a federated PSG benchmark, demonstrating the efficacy of the existing PSG methods in an FL setting with controllable semantic heterogeneity of scene graphs. We also present the effectiveness of our benchmark by applying robust federated learning algorithms to data heterogeneity to show increased performance. Our code is available at this https URL.
zh

[CV-264] COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework

【速读】：该论文试图解决多模态大语言模型（MLLM）在视频质量理解任务中部署时对GPU资源需求过高的问题。解决方案的关键在于提出了COEF-VQ，一种新颖的级联MLLM框架。该框架通过融合视觉、文本和音频信号的MLLM，并结合一个轻量级模型作为预过滤阶段，以及MLLM作为精细考虑阶段，显著减少了GPU资源的需求，同时保持了仅使用MLLM时的性能。实验结果表明，COEF-VQ在TikTok视频管理平台上的两个视频质量理解任务中，实现了显著的性能提升，同时限制了资源消耗。

链接: https://arxiv.org/abs/2412.10435
作者: Xin Dong,Sen Jia,Hongyu Xiong
机构: TikTok, Inc.(TikTok公司); TikTok, Inc.(TikTok公司); TikTok, Inc.(TikTok公司)
关键词: Multimodal Large Language, recent Multimodal Large, Large Language Model, Multimodal Large, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, with the emergence of recent Multimodal Large Language Model (MLLM) technology, it has become possible to exploit its video understanding capability on different classification tasks. In practice, we face the difficulty of huge requirements for GPU resource if we need to deploy MLLMs online. In this paper, we propose COEF-VQ, a novel cascaded MLLM framework for better video quality understanding on TikTok. To this end, we first propose a MLLM fusing all visual, textual and audio signals, and then develop a cascade framework with a lightweight model as pre-filtering stage and MLLM as fine-consideration stage, significantly reducing the need for GPU resource, while retaining the performance demonstrated solely by MLLM. To demonstrate the effectiveness of COEF-VQ, we deployed this new framework onto the video management platform (VMP) at TikTok, and performed a series of detailed experiments on two in-house tasks related to video quality understanding. We show that COEF-VQ leads to substantial performance gains with limit resource consumption in these two tasks.
zh

[CV-265] Implicit Neural Compression of Point Clouds

【速读】：该论文试图解决高精度、非结构化点云数据的压缩问题。解决方案的关键在于提出了一种名为 NeRC³ 的新型点云压缩框架，该框架利用隐式神经表示 (Implicit Neural Representations) 来处理几何和属性信息。具体来说，NeRC³ 使用两个基于坐标的神经网络：第一个网络确定体素的占用状态，第二个网络预测已占用体素的属性。通过将体素坐标输入这些网络，接收端能够高效地重建原始点云的几何和属性。此外，神经网络参数和重建所需的辅助信息被量化和压缩。论文还扩展了该方法以支持动态点云压缩，通过引入 4D 时空表示 (4D-NeRC³) 来减少时间冗余。实验结果表明，NeRC³ 在静态点云压缩方面优于基于八叉树的方法，而 4D-NeRC³ 在动态点云的几何压缩方面表现优于最新的 G-PCC 和 V-PCC 标准，并在联合几何和属性压缩方面取得了有竞争力的结果。

链接: https://arxiv.org/abs/2412.10433
作者: Hongning Ruan,Yulin Shao,Qianqian Yang,Liang Zhao,Zhaoyang Zhang,Dusit Niyato
机构: Zhejiang University; University of Macau; Nanyang Technological University
关键词: numerous applications due, point cloud, objects and scenes, Point, accurately depict
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Point clouds have gained prominence in numerous applications due to their ability to accurately depict 3D objects and scenes. However, compressing unstructured, high-precision point cloud data effectively remains a significant challenge. In this paper, we propose NeRC ^\textbf3 , a novel point cloud compression framework leveraging implicit neural representations to handle both geometry and attributes. Our approach employs two coordinate-based neural networks to implicitly represent a voxelized point cloud: the first determines the occupancy status of a voxel, while the second predicts the attributes of occupied voxels. By feeding voxel coordinates into these networks, the receiver can efficiently reconstructs the original point cloud’s geometry and attributes. The neural network parameters are quantized and compressed alongside auxiliary information required for reconstruction. Additionally, we extend our method to dynamic point cloud compression with techniques to reduce temporal redundancy, including a 4D spatial-temporal representation termed 4D-NeRC ^\textbf3 . Experimental results validate the effectiveness of our approach: for static point clouds, NeRC ^\textbf3 outperforms octree-based methods in the latest G-PCC standard. For dynamic point clouds, 4D-NeRC ^\textbf3 demonstrates superior geometry compression compared to state-of-the-art G-PCC and V-PCC standards and achieves competitive results for joint geometry and attribute compression.
zh

[CV-266] CUPS: Improving Human Pose-Shape Estimators with Conformalized Deep Uncertainty

【速读】：该论文试图解决从RGB视频中学习序列到序列的3D人体形状和姿态，并进行不确定性量化的问题。解决方案的关键在于提出了一种名为CUPS的新方法，通过在训练过程中生成和评分多个假设，将不确定性量化有效地整合到学习过程中。这种方法训练了一个端到端的深度不确定性函数，与3D姿态估计器共同优化。训练后，该深度不确定性模型作为一致性评分，用于校准conformal predictor（保形预测器），从而评估输出预测的质量。论文还提出了两个实用的覆盖率差距界限，为模型的不确定性界限提供了理论支持，最终在多个数据集和指标上实现了最先进的性能，并继承了conformal prediction的概率保证。

链接: https://arxiv.org/abs/2412.10431
作者: Harry Zhang,Luca Carlone
机构: 未知
关键词: introduce CUPS, RGB videos, uncertainty quantification, conformal prediction, integrating uncertainty quantification
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We introduce CUPS, a novel method for learning sequence-to-sequence 3D human shapes and poses from RGB videos with uncertainty quantification. To improve on top of prior work, we develop a method to generate and score multiple hypotheses during training, effectively integrating uncertainty quantification into the learning process. This process results in a deep uncertainty function that is trained end-to-end with the 3D pose estimator. Post-training, the learned deep uncertainty model is used as the conformity score, which can be used to calibrate a conformal predictor in order to assess the quality of the output prediction. Since the data in human pose-shape learning is not fully exchangeable, we also present two practical bounds for the coverage gap in conformal prediction, developing theoretical backing for the uncertainty bound of our model. Our results indicate that by taking advantage of deep uncertainty with conformal prediction, our method achieves state-of-the-art performance across various metrics and datasets while inheriting the probabilistic guarantees of conformal prediction.
zh

[CV-267] Unsupervised Cross-Domain Regression for Fine-grained 3D Game Character Reconstruction

【速读】：该论文旨在解决从单视图图像中重建细粒度3D游戏角色的问题，特别是在虚拟世界（metaverse）中，确保角色重建的忠实性和沉浸感。解决方案的关键在于提出了一种跨域框架，通过有效的回归器（regressor）显著减少现实世界与游戏领域之间的差异，并采用无监督学习实现目标领域的知识迁移。此外，论文还引入了对比损失（contrastive loss）来解决实例间的差异，保持重建角色的个性化细节，并通过辅助的3D身份感知提取器（3D identity-aware extractor）进一步提升模型结果的精确性。最终，该方法能够生成大量物理意义明确的面部参数，实验表明其在3D游戏角色重建方面达到了最先进的性能。

链接: https://arxiv.org/abs/2412.10430
作者: Qi Wen,Xiang Wen,Hao Jiang,Siqi Yang,Bingfeng Han,Tianlei Hu,Gang Chen,Shuang Li
机构: ByteDanceHangzhouChina; Zhejiang UniversityHangzhouChina; Beijing Institute of TechnologyBeijingChina
关键词: virtual world faithfully, world faithfully, rapid development, virtual world, game character
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:With the rise of the metaverse'' and the rapid development of games, it has become more and more critical to reconstruct characters in the virtual world faithfully. The immersive experience is one of the most central themes of the metaverse’', while the reducibility of the avatar is the crucial point. Meanwhile, the game is the carrier of the metaverse, in which players can freely edit the facial appearance of the game character. In this paper, we propose a simple but powerful cross-domain framework that can reconstruct fine-grained 3D game characters from single-view images in an end-to-end manner. Different from the previous methods, which do not resolve the cross-domain gap, we propose an effective regressor that can greatly reduce the discrepancy between the real-world domain and the game domain. To figure out the drawbacks of no ground truth, our unsupervised framework has accomplished the knowledge transfer of the target domain. Additionally, an innovative contrastive loss is proposed to solve the instance-wise disparity, which keeps the person-specific details of the reconstructed character. In contrast, an auxiliary 3D identity-aware extractor is activated to make the results of our model more impeccable. Then a large set of physically meaningful facial parameters is generated robustly and exquisitely. Experiments demonstrate that our method yields state-of-the-art performance in 3D game character reconstruction.
zh

[CV-268] GPTDrawer: Enhancing Visual Synthesis through ChatGPT

【速读】：该论文试图解决AI驱动图像生成领域中，如何提高图像生成对文本提示的精确性和相关性的问题。解决方案的关键在于引入GPTDrawer这一创新流程，通过结合GPT模型（GPT-based models）的生成能力与Stable Diffusion的图像生成技术，采用迭代优化的算法。该算法利用关键词提取、语义分析和图文一致性评估，通过ChatGPT进行自然语言处理，逐步调整输入提示，并使用余弦相似度（cosine similarity）作为指导，直到达到语义对齐的阈值。这种方法显著提升了根据用户定义提示生成的图像的保真度，展示了系统对复杂语义结构的理解和可视化能力。

链接: https://arxiv.org/abs/2412.10429
作者: Kun Li,Xinwei Chen,Tianyou Song,Hansong Zhang,Wenzhe Zhang,Qing Shan
机构: University of Illinois Urbana-Champaign; University of Illinois at Urbana Champaign; Columbia University; University of California San Diego; Northeastern University
关键词: prompts remains paramount, textual prompts remains, AI-driven image generation, remains paramount, burgeoning field
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the burgeoning field of AI-driven image generation, the quest for precision and relevance in response to textual prompts remains paramount. This paper introduces GPTDrawer, an innovative pipeline that leverages the generative prowess of GPT-based models to enhance the visual synthesis process. Our methodology employs a novel algorithm that iteratively refines input prompts using keyword extraction, semantic analysis, and image-text congruence evaluation. By integrating ChatGPT for natural language processing and Stable Diffusion for image generation, GPTDrawer produces a batch of images that undergo successive refinement cycles, guided by cosine similarity metrics until a threshold of semantic alignment is attained. The results demonstrate a marked improvement in the fidelity of images generated in accordance with user-defined prompts, showcasing the system’s ability to interpret and visualize complex semantic constructs. The implications of this work extend to various applications, from creative arts to design automation, setting a new benchmark for AI-assisted creative processes.
zh

[CV-269] Physics Meets Pixels: PDE Models in Image Processing

【速读】：该论文试图解决图像处理中的各种复杂问题，如去噪、去模糊、锐化、修复和特征提取等，通过引入基于物理的偏微分方程 (Partial Differential Equations, PDEs) 模型来提供新的解决方案。解决方案的关键在于开发和应用创新的PDE模型，这些模型结合了数学原理和方法，据我们所知，这些方法在图像处理领域尚未被应用。通过这些新模型，论文展示了它们在解决传统PDE方法无法应对的挑战方面的潜力，并通过理论分析和大量数值实验验证了其有效性。

链接: https://arxiv.org/abs/2412.11946
作者: Alejandro Garnung Menéndez
机构: University of Oviedo(奥维耶多大学); Gijón Polytechnic School of Engineering(希洪理工学院工程系)
关键词: Partial Differential Equations, Partial Differential, Differential Equations, geometric properties inherent, image processing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 19 pages, 15 figures, 4 tables

点击查看摘要

Abstract:Partial Differential Equations (PDEs) have long been recognized as powerful tools for image processing and analysis, providing a framework to model and exploit structural and geometric properties inherent in visual data. Over the years, numerous PDE-based models have been developed and refined, inspired by natural analogies between physical phenomena and image spaces. These methods have proven highly effective in a wide range of applications, including denoising, deblurring, sharpening, inpainting, feature extraction, and others. This work provides a theoretical and computational exploration of both fundamental and innovative PDE models applied to image processing, accompanied by extensive numerical experimentation and objective and subjective analysis. Building upon well-established techniques, we introduce novel physical-based PDE models specifically designed for various image processing tasks. These models incorporate mathematical principles and approaches that, to the best of our knowledge, have not been previously applied in this domain, showcasing their potential to address challenges beyond the capabilities of traditional and existing PDE methods. By formulating and solving these mathematical models, we demonstrate their effectiveness in advancing image processing tasks while retaining a rigorous connection to their theoretical underpinnings. This work seeks to bridge foundational concepts and cutting-edge innovations, contributing to the evolution of PDE methodologies in digital image processing and related interdisciplinary fields.
zh

[CV-270] Are the Latent Representations of Foundation Models for Pathology Invariant to Rotation?

【速读】：该论文试图解决数字病理学中自监督基础模型在处理H&E全切片图像时，其潜在表示对图像块旋转的不变性问题。研究的关键在于通过量化非旋转和旋转图像块之间的对齐程度，使用互k近邻和余弦距离来评估十二种基础模型的旋转不变性。研究发现，在自监督训练过程中引入旋转增强的模型表现出显著更高的旋转不变性，这表明在transformer架构中缺乏旋转归纳偏置的情况下，训练时的旋转增强是实现学习到的旋转不变性的关键。

链接: https://arxiv.org/abs/2412.11938
作者: Matouš Elphick,Samra Turajlic,Guang Yang
机构: 未知
关键词: digital pathology encode, pathology encode small, encode small patches, downstream tasks, digital pathology
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Samra Turajlic and Guang Yang are joint last authors

点击查看摘要

Abstract:Self-supervised foundation models for digital pathology encode small patches from H\E whole slide images into latent representations used for downstream tasks. However, the invariance of these representations to patch rotation remains unexplored. This study investigates the rotational invariance of latent representations across twelve foundation models by quantifying the alignment between non-rotated and rotated patches using mutual k -nearest neighbours and cosine distance. Models that incorporated rotation augmentation during self-supervised training exhibited significantly greater invariance to rotations. We hypothesise that the absence of rotational inductive bias in the transformer architecture necessitates rotation augmentation during training to achieve learned invariance. Code: this https URL.
zh

[CV-271] Ensemble Learning and 3D Pix2Pix for Comprehensive Brain Tumor Analysis in Multimodal MRI MICCAI

【速读】：该论文旨在解决多模态磁共振成像（MRI）中胶质瘤受影响脑区域的分割与修复问题。解决方案的关键在于集成多种先进技术，包括结合混合Transformer模型和卷积神经网络（CNN）的集成学习方法，以及创新的3D Pix2Pix生成对抗网络（GAN）应用。具体来说，该方法通过轴向注意力机制和Transformer编码器增强空间关系建模，实现精确的肿瘤分割，同时利用3D Pix2Pix GAN生成生物学上合理的脑组织，从而实现逼真的修复。这种方法不仅在分割任务中表现出优异的性能，如Dice相似系数（DSC）和Hausdorff距离（HD95），还在修复任务中通过结构相似性指数（SSIM）、峰值信噪比（PSNR）和均方误差（MSE）等指标展示了高质量的输出，为临床决策和患者护理提供了重要支持。

链接: https://arxiv.org/abs/2412.11849
作者: Ramy A. Zeineldin,Franziska Mathis-Ullrich
机构: 未知
关键词: Generative Adversarial Network, convolutional neural networks, multi-modal magnetic resonance, Generative Adversarial, Adversarial Network
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the MICCAI BraTS Challenge 2023

点击查看摘要

Abstract:Motivated by the need for advanced solutions in the segmentation and inpainting of glioma-affected brain regions in multi-modal magnetic resonance imaging (MRI), this study presents an integrated approach leveraging the strengths of ensemble learning with hybrid transformer models and convolutional neural networks (CNNs), alongside the innovative application of 3D Pix2Pix Generative Adversarial Network (GAN). Our methodology combines robust tumor segmentation capabilities, utilizing axial attention and transformer encoders for enhanced spatial relationship modeling, with the ability to synthesize biologically plausible brain tissue through 3D Pix2Pix GAN. This integrated approach addresses the BraTS 2023 cluster challenges by offering precise segmentation and realistic inpainting, tailored for diverse tumor types and sub-regions. The results demonstrate outstanding performance, evidenced by quantitative evaluations such as the Dice Similarity Coefficient (DSC), Hausdorff Distance (HD95) for segmentation, and Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean-Square Error (MSE) for inpainting. Qualitative assessments further validate the high-quality, clinically relevant outputs. In conclusion, this study underscores the potential of combining advanced machine learning techniques for comprehensive brain tumor analysis, promising significant advancements in clinical decision-making and patient care within the realm of medical imaging.
zh

[CV-272] Point Cloud-Assisted Neural Image Compression

【速读】：该论文试图解决现有图像编解码器（codecs）在处理多模态数据时未能充分利用其他模态辅助信息，导致压缩效率低下的问题。解决方案的关键在于提出了点云辅助的神经图像编解码器（PCA-NIC），通过统一图像和点云的数据表示，利用点云的高维信息增强图像纹理和结构的保留。此外，引入多模态特征融合变换模块（MMFFT）来捕捉更具代表性的图像特征，并去除与图像内容无关的通道和模态间的冗余信息。该方法首次通过点云提升图像压缩性能，并达到了最先进的性能水平。

链接: https://arxiv.org/abs/2412.11771
作者: Ziqun Li,Qi Zhang,Xiaofeng Huang,Zhao Wang,Siwei Ma,Wei Yan
机构: Peking University(北京大学); Advanced Institute of Information Technology, Peking University(北京大学信息科学技术学院)
关键词: High-efficient image compression, High-efficient image, critical requirement, image compression, image compression performance
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-efficient image compression is a critical requirement. In several scenarios where multiple modalities of data are captured by different sensors, the auxiliary information from other modalities are not fully leveraged by existing image-only codecs, leading to suboptimal compression efficiency. In this paper, we increase image compression performance with the assistance of point cloud, which is widely adopted in the area of autonomous driving. We first unify the data representation for both modalities to facilitate data processing. Then, we propose the point cloud-assisted neural image codec (PCA-NIC) to enhance the preservation of image texture and structure by utilizing the high-dimensional point cloud information. We further introduce a multi-modal feature fusion transform module (MMFFT) to capture more representative image features, remove redundant information between channels and modalities that are not relevant to the image content. Our work is the first to improve image compression performance using point cloud and achieves state-of-the-art performance.
zh

[CV-273] Fast-staged CNN Model for Accurate pulmonary diseases and Lung cancer detection

【速读】：该论文旨在解决肺部病理诊断中由于专业放射科医生资源有限而导致的诊断延迟问题。解决方案的关键在于利用深度学习技术，特别是卷积神经网络 (Convolutional Neural Network, CNN)，通过大规模多样化的胸部X光片数据集进行训练，以实现对肺部病变的自动检测。研究采用两阶段分类系统，首先将图像分为正常或异常，然后进一步识别包括肺结节在内的九种特定肺部病理。该模型在外部验证中表现出较高的准确性（77%）、敏感性（0.713）、特异性（0.776）和AUC值（0.888），展示了其在不同患者群体中的泛化潜力。未来改进方向包括引入ETL数据分布策略和扩展数据集以提高诊断精度。

链接: https://arxiv.org/abs/2412.11681
作者: Abdelbaki Souid,Mohamed Hamroun,Soufiene Ben Othman,Hedi Sakli,Naceur Abdelkarim
机构: 未知
关键词: global health concern, significant global health, health concern, treated promptly, significant global
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE International Workshop on Mechatronic Systems Supervision 2023

点击查看摘要

Abstract:Pulmonary pathologies are a significant global health concern, often leading to fatal outcomes if not diagnosed and treated promptly. Chest radiography serves as a primary diagnostic tool, but the availability of experienced radiologists remains limited. Advances in Artificial Intelligence (AI) and machine learning, particularly in computer vision, offer promising solutions to address this challenge. This research evaluates a deep learning model designed to detect lung cancer, specifically pulmonary nodules, along with eight other lung pathologies, using chest radiographs. The study leverages diverse datasets comprising over 135,120 frontal chest radiographs to train a Convolutional Neural Network (CNN). A two-stage classification system, utilizing ensemble methods and transfer learning, is employed to first triage images into Normal or Abnormal categories and then identify specific pathologies, including lung nodules. The deep learning model achieves notable results in nodule classification, with a top-performing accuracy of 77%, a sensitivity of 0.713, a specificity of 0.776 during external validation, and an AUC score of 0.888. Despite these successes, some misclassifications were observed, primarily false negatives. In conclusion, the model demonstrates robust potential for generalization across diverse patient populations, attributed to the geographic diversity of the training dataset. Future work could focus on integrating ETL data distribution strategies and expanding the dataset with additional nodule-type samples to further enhance diagnostic accuracy. Comments: IEEE International Workshop on Mechatronic Systems Supervision 2023 Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.11681 [eess.IV] (or arXiv:2412.11681v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2412.11681 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohamed Hamroun [view email] [v1] Mon, 16 Dec 2024 11:47:07 UTC (1,133 KB)
zh

[CV-274] Block-Based Multi-Scale Image Rescaling AAAI2025

【速读】：该论文试图解决高分辨率图像（HR）在缩放过程中信息分布不均的问题，特别是在2K及以上分辨率的图像中。传统图像缩放方法仅关注整体缩放比例，忽略了图像不同区域信息量的差异，导致缩放后的图像质量下降。解决方案的关键在于提出的基于块的多尺度图像缩放框架（Block-Based Multi-Scale Image Rescaling Framework, BBMR）。BBMR通过将HR图像分割为等大小的子块，并在下采样模块中为每个子块动态分配不同的缩放比例，同时保持整体缩放比例不变，从而有效处理信息分布不均的问题。在上采样模块中，采用联合超分辨率方法（Joint Super-Resolution, JointSR）对具有不同缩放比例的子块进行处理，消除块状伪影，显著提升超分辨率（SR）图像的质量。

链接: https://arxiv.org/abs/2412.11468
作者: Jian Li,Siwang Zhou
机构: 未知
关键词: Image rescaling, image rescaling methods, Image, Image Rescaling Framework, seeks to determine
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by AAAI2025

点击查看摘要

Abstract:Image rescaling (IR) seeks to determine the optimal low-resolution (LR) representation of a high-resolution (HR) image to reconstruct a high-quality super-resolution (SR) image. Typically, HR images with resolutions exceeding 2K possess rich information that is unevenly distributed across the image. Traditional image rescaling methods often fall short because they focus solely on the overall scaling rate, ignoring the varying amounts of information in different parts of the image. To address this limitation, we propose a Block-Based Multi-Scale Image Rescaling Framework (BBMR), tailored for IR tasks involving HR images of 2K resolution and higher. BBMR consists of two main components: the Downscaling Module and the Upscaling Module. In the Downscaling Module, the HR image is segmented into sub-blocks of equal size, with each sub-block receiving a dynamically allocated scaling rate while maintaining a constant overall scaling rate. For the Upscaling Module, we introduce the Joint Super-Resolution method (JointSR), which performs SR on these sub-blocks with varying scaling rates and effectively eliminates blocking artifacts. Experimental results demonstrate that BBMR significantly enhances the SR image quality on the of 2K and 4K test dataset compared to initial network image rescaling methods.
zh

[CV-275] Controllable Distortion-Perception Tradeoff Through Latent Diffusion for Neural Image Compression AAAI2025

【速读】：该论文试图解决神经图像压缩中常见的速率、失真和感知之间的权衡问题。现有方法通常侧重于高像素级保真度或优化感知指标，而本文提出了一种新颖的方法，能够在不改变原始神经图像编解码器的情况下，同时兼顾失真和感知质量。解决方案的关键在于在解码器端引入一个即插即用模块，利用潜在扩散过程（latent diffusion process）对解码特征进行转换，从而在不增加额外训练的情况下，灵活调整失真与感知之间的平衡。实验结果表明，该方法显著提升了预训练编解码器的性能，能够在广泛的失真-感知范围内进行调整，同时保持原有的压缩能力。

链接: https://arxiv.org/abs/2412.11379
作者: Chuqin Zhou,Guo Lu,Jiangchuan Li,Xiangyu Chen,Zhengxue Cheng,Li Song,Wenjun Zhang
机构: 1. Simon Fraser University (西蒙弗雷泽大学); 2. Huawei Technologies (华为技术)
关键词: Neural image, Neural image compression, trade-off among rate, fixed neural image, neural image codec
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Neural image compression often faces a challenging trade-off among rate, distortion and perception. While most existing methods typically focus on either achieving high pixel-level fidelity or optimizing for perceptual metrics, we propose a novel approach that simultaneously addresses both aspects for a fixed neural image codec. Specifically, we introduce a plug-and-play module at the decoder side that leverages a latent diffusion process to transform the decoded features, enhancing either low distortion or high perceptual quality without altering the original image compression codec. Our approach facilitates fusion of original and transformed features without additional training, enabling users to flexibly adjust the balance between distortion and perception during inference. Extensive experimental results demonstrate that our method significantly enhances the pretrained codecs with a wide, adjustable distortion-perception range while maintaining their original compression capabilities. For instance, we can achieve more than 150% improvement in LPIPS-BDRate without sacrificing more than 1 dB in PSNR.
zh

[CV-276] Improving Automatic Fetal Biometry Measurement with Swoosh Activation Function

【速读】：该论文试图解决胎儿丘脑直径 (FTD) 和胎儿头围 (FHC) 测量中的问题，特别是由于2D超声图像的高噪声比和模糊边缘导致的测量不准确性。解决方案的关键在于提出了一种新的Swoosh激活函数 (SAF)，该函数通过增强热图的正则化来减少预测热图中热点的不均匀性，从而提高FTD和FHC的测量精度。SAF作为正则化项，能够在预测热图之间强制执行最佳的均方误差 (MSE) 水平，显著提高了测量性能，并在实验中表现出比当前最先进的BiometryNet算法更高的类内相关系数和更低的平均差异。此外，SAF具有高度通用性和架构无关性，其系数可根据不同任务进行配置，显示出极大的灵活性和应用潜力。

链接: https://arxiv.org/abs/2412.11377
作者: Shijia Zhou,Euijoon Ahn,Hao Wang,Ann Quinton,Narelle Kennedy,Pradeeba Sridar,Ralph Nanan,Jinman Kim
机构: 未知
关键词: fetal thalamus diameter, abnormal fetal thalamus, fetal thalamus development, fetal head circumference, identifying abnormal fetal
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The measurement of fetal thalamus diameter (FTD) and fetal head circumference (FHC) are crucial in identifying abnormal fetal thalamus development as it may lead to certain neuropsychiatric disorders in later life. However, manual measurements from 2D-US images are laborious, prone to high inter-observer variability, and complicated by the high signal-to-noise ratio nature of the images. Deep learning-based landmark detection approaches have shown promise in measuring biometrics from US images, but the current state-of-the-art (SOTA) algorithm, BiometryNet, is inadequate for FTD and FHC measurement due to its inability to account for the fuzzy edges of these structures and the complex shape of the FTD structure. To address these inadequacies, we propose a novel Swoosh Activation Function (SAF) designed to enhance the regularization of heatmaps produced by landmark detection algorithms. Our SAF serves as a regularization term to enforce an optimum mean squared error (MSE) level between predicted heatmaps, reducing the dispersiveness of hotspots in predicted heatmaps. Our experimental results demonstrate that SAF significantly improves the measurement performances of FTD and FHC with higher intraclass correlation coefficient scores in FTD and lower mean difference scores in FHC measurement than those of the current SOTA algorithm BiometryNet. Moreover, our proposed SAF is highly generalizable and architecture-agnostic. The SAF’s coefficients can be configured for different tasks, making it highly customizable. Our study demonstrates that the SAF activation function is a novel method that can improve measurement accuracy in fetal biometry landmark detection. This improvement has the potential to contribute to better fetal monitoring and improved neonatal outcomes.
zh

[CV-277] VRVVC: Variable-Rate NeRF-Based Volumetric Video Compression

【速读】：该论文试图解决基于神经辐射场（NeRF）的体积视频在存储和传输中面临的大数据量问题，并提出了一种端到端的联合优化可变比特率框架VRVVC。解决方案的关键在于：1）引入紧凑的三平面隐式残差表示（compact tri-plane implicit residual representation），用于长时动态场景的帧间建模，有效减少时间冗余；2）提出可变比特率的残差表示压缩方案，利用可学习的量化和基于小型MLP的熵模型，通过预定义的拉格朗日乘数管理所有潜在表示的量化误差；3）采用端到端的渐进训练策略和多比特率失真损失函数，优化整个框架。这些创新使得VRVVC在单个模型中实现广泛的可变比特率，并在各种数据集上超越现有方法的率失真性能。

链接: https://arxiv.org/abs/2412.11362
作者: Qiang Hu,Houqiang Zhong,Zihan Zheng,Xiaoyun Zhang,Zhengxue Cheng,Li Song,Guangtao Zhai,Yanfeng Wang
机构: 1. Shanghai Jiao Tong University (上海交通大学); 2. East China Normal University (华东师范大学); 3. Huawei Technologies (华为技术)
关键词: Neural Radiance Field, Neural Radiance, Radiance Field, revolutionized visual media, delivering photorealistic Free-Viewpoint
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Radiance Field (NeRF)-based volumetric video has revolutionized visual media by delivering photorealistic Free-Viewpoint Video (FVV) experiences that provide audiences with unprecedented immersion and interactivity. However, the substantial data volumes pose significant challenges for storage and transmission. Existing solutions typically optimize NeRF representation and compression independently or focus on a single fixed rate-distortion (RD) tradeoff. In this paper, we propose VRVVC, a novel end-to-end joint optimization variable-rate framework for volumetric video compression that achieves variable bitrates using a single model while maintaining superior RD performance. Specifically, VRVVC introduces a compact tri-plane implicit residual representation for inter-frame modeling of long-duration dynamic scenes, effectively reducing temporal redundancy. We further propose a variable-rate residual representation compression scheme that leverages a learnable quantization and a tiny MLP-based entropy model. This approach enables variable bitrates through the utilization of predefined Lagrange multipliers to manage the quantization error of all latent representations. Finally, we present an end-to-end progressive training strategy combined with a multi-rate-distortion loss function to optimize the entire framework. Extensive experiments demonstrate that VRVVC achieves a wide range of variable bitrates within a single model and surpasses the RD performance of existing methods across various datasets.
zh

[CV-278] Macro2Micro: Cross-modal Magnetic Resonance Imaging Synthesis Leveraging Multi-scale Brain Structures

【速读】：该论文试图解决从宏观解剖结构到微观组织架构的非线性关系映射问题，这一挑战主要源于多模态磁共振成像（MRI）获取的技术限制和高成本。解决方案的关键在于引入了一个名为Macro2Micro的深度学习框架，该框架利用生成对抗网络（GAN）从宏观结构预测脑微观结构。Macro2Micro的核心创新在于其基于脑组织无尺度、自相似的特性，将多尺度脑表示显式编码到不同的处理分支中，并通过引入辅助判别器和学习目标来增强图像保真度和抑制伪影。实验结果表明，该方法在将T1加权MRI转换为分数各向异性（FA）图像时，相较于先前方法在结构相似性指数（SSIM）上提升了6.8%，同时保留了个体神经生物学特征。

链接: https://arxiv.org/abs/2412.11277
作者: Sooyoung Kim,Joonwoo Kwon,Junbeom Kwon,Sangyoon Bae,Yuewei Lin,Shinjae Yoo,Jiook Cha
机构: Seoul National University (首尔国立大学); Brookhaven National Lab (布鲁克海文国家实验室)
关键词: Spanning multiple scales-from, multiple scales-from macroscopic, scales-from macroscopic anatomy, intricate microscopic architecture-the, microscopic architecture-the human
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The code will be made available upon acceptance

点击查看摘要

Abstract:Spanning multiple scales-from macroscopic anatomy down to intricate microscopic architecture-the human brain exemplifies a complex system that demands integrated approaches to fully understand its complexity. Yet, mapping nonlinear relationships between these scales remains challenging due to technical limitations and the high cost of multimodal Magnetic Resonance Imaging (MRI) acquisition. Here, we introduce Macro2Micro, a deep learning framework that predicts brain microstructure from macrostructure using a Generative Adversarial Network (GAN). Grounded in the scale-free, self-similar nature of brain organization-where microscale information can be inferred from macroscale patterns-Macro2Micro explicitly encodes multiscale brain representations into distinct processing branches. To further enhance image fidelity and suppress artifacts, we propose a simple yet effective auxiliary discriminator and learning objective. Our results show that Macro2Micro faithfully translates T1-weighted MRIs into corresponding Fractional Anisotropy (FA) images, achieving a 6.8% improvement in the Structural Similarity Index Measure (SSIM) compared to previous methods, while preserving the individual neurobiological characteristics.
zh

[CV-279] Plug-and-Play Priors as a Score-Based Method

【速读】：该论文试图解决将基于分数的扩散模型（Score-based Diffusion Models, SBMs）与传统的插拔式（Plug-and-Play, PnP）方法结合的问题。解决方案的关键在于提出了一种新的视角，将PnP方法视为一种基于分数的方法，从而可以直接利用预训练的SBMs作为先验，而无需重新训练。通过建立数学关系，论文展示了如何将流行的SBMs适配为PnP中的先验，并实现了PnP与基于SBM的重建方法之间的直接比较，使用相同的神经网络作为先验。

链接: https://arxiv.org/abs/2412.11108
作者: Chicago Y. Park,Yuyang Hu,Michael T. McCann,Cristina Garcia-Cardona,Brendt Wohlberg,Ulugbek S. Kamilov
机构: Washington University in St. Louis(圣路易斯华盛顿大学); Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)
关键词: solving imaging inverse, imaging inverse problems, integrating physical measurement, physical measurement models, pre-trained deep denoisers
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Plug-and-play (PnP) methods are extensively used for solving imaging inverse problems by integrating physical measurement models with pre-trained deep denoisers as priors. Score-based diffusion models (SBMs) have recently emerged as a powerful framework for image generation by training deep denoisers to represent the score of the image prior. While both PnP and SBMs use deep denoisers, the score-based nature of PnP is unexplored in the literature due to its distinct origins rooted in proximal optimization. This letter introduces a novel view of PnP as a score-based method, a perspective that enables the re-use of powerful SBMs within classical PnP algorithms without retraining. We present a set of mathematical relationships for adapting popular SBMs as priors within PnP. We show that this approach enables a direct comparison between PnP and SBM-based reconstruction methods using the same neural network as the prior. Code is available at this https URL.
zh

[CV-280] Unpaired Multi-Domain Histopathology Virtual Staining using Dual Path Prompted Inversion

【速读】：该论文试图解决虚拟染色（virtual staining）中结构一致性保持和病理诊断内容保留的问题。解决方案的关键在于提出了一种双路径反演虚拟染色方法，利用提示学习（prompt learning）优化视觉提示（visual prompts）来控制内容和风格，同时确保病理诊断内容的完整性。具体来说，该方法包括两个核心组件：（1）双路径提示策略（Dual Path Prompted Strategy），通过特征适配器生成参考图像进行反演，分别构建风格目标路径和结构目标路径，确保结构一致性和风格信息的保留；（2）染色提示优化（StainPrompt Optimization），仅优化空视觉提示作为“操作符”，在每个时间步围绕关键噪声进行结构和风格轨迹的优化，实现精确的双路径反演重建。该方法在多领域未配对染色数据集上的广泛评估表明，其能够实现高结构一致性和准确的样式转换。

链接: https://arxiv.org/abs/2412.11106
作者: Bing Xiong,Yue Peng,RanRan Zhang,Fuqiang Chen,JiaYe He,Wenjian Qin
机构: 1. School of Computer Science and Engineering, University of Electronic Science and Technology of China (电子科技大学计算机科学与工程学院); 2. Artificial Intelligence Key Laboratory of Sichuan Province (四川省人工智能重点实验室); 3. School of Information and Software Engineering, University of Electronic Science and Technology of China (电子科技大学信息与软件工程学院)
关键词: histochemically stained tissue, stained tissue samples, staining leverages computer-aided, Virtual staining, leverages computer-aided techniques
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Virtual staining leverages computer-aided techniques to transfer the style of histochemically stained tissue samples to other staining types. In virtual staining of pathological images, maintaining strict structural consistency is crucial, as these images emphasize structural integrity more than natural images. Even slight structural alterations can lead to deviations in diagnostic semantic information. Furthermore, the unpaired characteristic of virtual staining data may compromise the preservation of pathological diagnostic content. To address these challenges, we propose a dual-path inversion virtual staining method using prompt learning, which optimizes visual prompts to control content and style, while preserving complete pathological diagnostic content. Our proposed inversion technique comprises two key components: (1) Dual Path Prompted Strategy, we utilize a feature adapter function to generate reference images for inversion, providing style templates for input image inversion, called Style Target Path. We utilize the inversion of the input image as the Structural Target path, employing visual prompt images to maintain structural consistency in this path while preserving style information from the style Target path. During the deterministic sampling process, we achieve complete content-style disentanglement through a plug-and-play embedding visual prompt approach. (2) StainPrompt Optimization, where we only optimize the null visual prompt as ``operator’’ for dual path inversion, rather than fine-tune pre-trained model. We optimize null visual prompt for structual and style trajectory around pivotal noise on each timestep, ensuring accurate dual-path inversion reconstruction. Extensive evaluations on publicly available multi-domain unpaired staining datasets demonstrate high structural consistency and accurate style transfer results.
zh

[CV-281] A Digitalized Atlas for Pulmonary Airway

【速读】：该论文试图解决气道解剖结构的自动提取与分级标注问题，特别是针对肺叶、肺段和亚段级别的精确标注。解决方案的关键在于提出了一个端到端的流程——AirwayAtlas，并通过生成基于气道分支多样特征的紧凑表示——AirwaySign，来实现高效的气道解剖结构提取与标注。实验结果表明，AirwayAtlas在多中心数据集上验证了其有效性，并且AirwaySign在肺部疾病的关联分析中表现出强大的工具性。

链接: https://arxiv.org/abs/2412.11039
作者: Minghui Zhang,Chenyu Li,Hanxiao Zhang,Yaoyu Liu,Yun Gu
机构: Institute of Medical Robotics(医疗机器人研究所); Shanghai Jiao Tong University(上海交通大学)
关键词: pipeline for automatic, anatomies with lobar, segmental and subsegmental, subsegmental labeling, automatic extraction
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:In this work, we proposed AirwayAtlas, which is an end-to-end pipeline for automatic extraction of airway anatomies with lobar, segmental and subsegmental labeling. A compact representation, AirwaySign, is generated based on diverse features of airway branches. Experiments on multi-center datasets validated the effectiveness of AirwayAtlas. We also demonstrated that AirwaySign is a powerful tool for correlation analysis on pulmonary diseases.
zh

[CV-282] Mask Enhanced Deeply Supervised Prostate Cancer Detection on B-mode Micro-Ultrasound

【速读】：该论文试图解决前列腺癌在微超声图像中的自动检测和分割问题，特别是针对临床上重要的癌症与正常组织的区分。解决方案的关键在于提出了一个名为MedMusNet的新型网络模型，该模型通过利用预测的前列腺癌掩码（masks）来逐层强化学习特征，从而减少噪声影响并提高图像帧间的一致性。MedMusNet在检测临床显著性癌症方面表现优异，Dice相似系数达到0.365，显著优于基线模型Swin-M2F，并且在特异性和准确性上均有提升。尽管在病灶级和患者级分析中未达到统计显著性，但初步结果表明该模型有潜力辅助泌尿科医生在前列腺癌的诊断和治疗决策中。

链接: https://arxiv.org/abs/2412.10997
作者: Lichun Zhang,Steve Ran Zhou,Moon Hyung Choi,Jeong Hoon Lee,Shengtian Sang,Adam Kinnaird,Wayne G. Brisbane,Giovanni Lughezzani,Davide Maffei,Vittorio Fasulo,Patrick Albers,Sulaiman Vesal,Wei Shao,Ahmed N. El Kaffas,Richard E. Fan,Geoffrey A. Sonn,Mirabela Rusu
机构: Stanford University(斯坦福大学); The Catholic University of Korea(韩国天主教大学); University of Alberta(阿尔伯塔大学); University of California Los Angeles(加州大学洛杉矶分校); Humanitas University(Humanitas大学); IRCCS Humanitas Research Hospital(IRCCS Humanitas研究医院); University of Florida(佛罗里达大学); University of California San Diego(加州大学圣地亚哥分校)
关键词: Prostate cancer, deaths among men, cancer-related deaths, cancer, clinically significant cancer
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Prostate cancer is a leading cause of cancer-related deaths among men. The recent development of high frequency, micro-ultrasound imaging offers improved resolution compared to conventional ultrasound and potentially a better ability to differentiate clinically significant cancer from normal tissue. However, the features of prostate cancer remain subtle, with ambiguous borders with normal tissue and large variations in appearance, making it challenging for both machine learning and humans to localize it on micro-ultrasound images. We propose a novel Mask Enhanced Deeply-supervised Micro-US network, termed MedMusNet, to automatically and more accurately segment prostate cancer to be used as potential targets for biopsy procedures. MedMusNet leverages predicted masks of prostate cancer to enforce the learned features layer-wisely within the network, reducing the influence of noise and improving overall consistency across frames. MedMusNet successfully detected 76% of clinically significant cancer with a Dice Similarity Coefficient of 0.365, significantly outperforming the baseline Swin-M2F in specificity and accuracy (Wilcoxon test, Bonferroni correction, p-value0.05). While the lesion-level and patient-level analyses showed improved performance compared to human experts and different baseline, the improvements did not reach statistical significance, likely on account of the small cohort. We have presented a novel approach to automatically detect and segment clinically significant prostate cancer on B-mode micro-ultrasound images. Our MedMusNet model outperformed other models, surpassing even human experts. These preliminary results suggest the potential for aiding urologists in prostate cancer diagnosis via biopsy and treatment decision-making. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2412.10997 [eess.IV] (or arXiv:2412.10997v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2412.10997 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mirabela Rusu [view email] [v1] Sat, 14 Dec 2024 23:40:53 UTC (4,740 KB)
zh

[CV-283] MorphiNet: A Graph Subdivision Network for Adaptive Bi-ventricle Surface Reconstruction

【速读】：该论文试图解决心脏磁共振成像 (Cardiac Magnetic Resonance, CMR) 图像在心脏模型重建中的挑战，特别是由于切片间距大和心脏运动导致的图像各向异性问题，这些问题导致数据丢失和测量不准确，进而影响详细解剖结构的捕捉。解决方案的关键在于引入了一种名为 MorphiNet 的新型网络，该网络通过利用与 CMR 图像不配对的高分辨率计算机断层扫描 (Computer Tomography, CT) 图像来学习心脏解剖结构。MorphiNet 将解剖结构编码为梯度场，将模板网格转换为患者特异性几何形状，并通过多层图细分网络进行细化，同时保持密集点对应。这种方法显著提高了解剖结构的保真度，并在 Dice 分数、Hausdorff 距离和表面误差等指标上优于现有最先进的方法，同时具有更高的推理效率。

链接: https://arxiv.org/abs/2412.10985
作者: Yu Deng,Yiyang Xu,Linglong Qian,Charlene Mauger,Anastasia Nasopoulou,Steven Williams,Michelle Williams,Steven Niederer,David Newby,Andrew McCulloch,Jeff Omens,Kuberan Pushprajah,Alistair Young
机构: School of Biomedical Engineering and Imaging Sciences, King’s College London, UK; Department of Biostatistics and Health Informatics, King’s College London; Centre for Cardiovascular Science, University of Edinburgh, UK; Department of Bioengineering, University of California, San Diego; National Heart and Lung Institute (NHLI), Imperial College London, UK
关键词: Cardiac Magnetic Resonance, Magnetic Resonance, visualize soft tissues, Cardiac Magnetic, CMR images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiac Magnetic Resonance (CMR) imaging is widely used for heart modelling and digital twin computational analysis due to its ability to visualize soft tissues and capture dynamic functions. However, the anisotropic nature of CMR images, characterized by large inter-slice distances and misalignments from cardiac motion, poses significant challenges to accurate model reconstruction. These limitations result in data loss and measurement inaccuracies, hindering the capture of detailed anatomical structures. This study introduces MorphiNet, a novel network that enhances heart model reconstruction by leveraging high-resolution Computer Tomography (CT) images, unpaired with CMR images, to learn heart anatomy. MorphiNet encodes anatomical structures as gradient fields, transforming template meshes into patient-specific geometries. A multi-layer graph subdivision network refines these geometries while maintaining dense point correspondence. The proposed method achieves high anatomy fidelity, demonstrating approximately 40% higher Dice scores, half the Hausdorff distance, and around 3 mm average surface error compared to state-of-the-art methods. MorphiNet delivers superior results with greater inference efficiency. This approach represents a significant advancement in addressing the challenges of CMR-based heart model reconstruction, potentially improving digital twin computational analyses of cardiac structure and functions.
zh

[CV-284] Biological and Radiological Dictionary of Radiomics Features: Addressing Understandable AI Issues in Personalized Prostate Cancer; Dictionary version PM1.0

【速读】：该论文旨在解决前列腺影像报告和数据系统（PI-RADS）中视觉语义特征与相关风险因素之间的关联问题，并通过创建一个标准化的生物学/放射学放射组学特征（RFs）字典，超越了异常影像发现的局限性。解决方案的关键在于结合多参数前列腺MRI序列（T2加权成像[T2WI]、扩散加权成像[DWI]和表观扩散系数[ADC]）与多种特征选择算法（FSAs），如ANOVA F-test、相关系数和Fisher Score，并利用逻辑回归识别关键特征，如T2WI的第90百分位（与癌症风险相关的低强度）、T2WI的方差（病灶异质性）、ADC的形状指标（如最小轴长和表面积与体积比，反映病灶紧凑性）以及ADC的运行熵（纹理一致性）。这种方法的平均准确率达到0.78，优于单一序列方法，并提供了一个通用语言，促进临床专业人员与AI开发者之间的合作，实现可信、可解释的AI，以支持可靠的临床决策。

链接: https://arxiv.org/abs/2412.10967
作者: Mohammad R. Salmanpour,Sajad Amiri,Sara Gharibi,Ahmad Shariftabrizi,Yixi Xu,William B Weeks,Arman Rahmim,Ilker Hacihaliloglu
机构: 未知
关键词: abnormal imaging findings, radiological radiomics features, visual semantic features, predict UCLA scores, moving beyond abnormal
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 3 Figures, 2 Tables

点击查看摘要

Abstract:This study investigates the connection between visual semantic features in PI-RADS and associated risk factors, moving beyond abnormal imaging findings by creating a standardized dictionary of biological/radiological radiomics features (RFs). Using multiparametric prostate MRI sequences (T2-weighted imaging [T2WI], diffusion-weighted imaging [DWI], and apparent diffusion coefficient [ADC]), six interpretable and seven complex classifiers, combined with nine feature selection algorithms (FSAs), were applied to segmented lesions to predict UCLA scores. Combining T2WI, DWI, and ADC with FSAs such as ANOVA F-test, Correlation Coefficient, and Fisher Score, and utilizing logistic regression, identified key features: the 90th percentile from T2WI (hypo-intensity linked to cancer risk), variance from T2WI (lesion heterogeneity), shape metrics like Least Axis Length and Surface Area to Volume ratio from ADC (lesion compactness), and Run Entropy from ADC (texture consistency). This approach achieved an average accuracy of 0.78, outperforming single-sequence methods (p 0.05). The developed dictionary provides a common language, fostering collaboration between clinical professionals and AI developers to enable trustworthy, interpretable AI for reliable clinical decisions.
zh

[CV-285] Integrating Generative and Physics-Based Models for Ptychographic Imaging with Uncertainty Quantification NEURIPS2024

【速读】：该论文试图解决传统叠层成像（ptychography）技术中，由于迭代图像重建方法需要大量相邻扫描位置的重叠，导致数据量大和采集时间长的问题。解决方案的关键在于提出了一种基于贝叶斯反演的方法，该方法即使在相邻扫描位置重叠较少的情况下也能有效工作，并且能够量化叠层成像对象的固有不确定性。具体而言，该方法首先利用深度生成模型学习对象的先验分布，然后通过马尔可夫链蒙特卡罗（Markov Chain Monte Carlo）算法从对象的后验分布中生成样本。实验结果表明，该框架在减少重叠的情况下仍能优于广泛使用的迭代重建算法，并能提供与真实误差密切相关的不确定性估计。

链接: https://arxiv.org/abs/2412.10882
作者: Canberk Ekmekci,Tekin Bicer,Zichao Wendy Di,Junjing Deng,Mujdat Cetin
机构: University of Rochester(罗切斯特大学); Argonne National Laboratory(阿贡国家实验室)
关键词: diffractive imaging technique, enables imaging nanometer-scale, imaging nanometer-scale features, coherent diffractive imaging, scanning coherent diffractive
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Machine Learning and the Physical Sciences Workshop at NeurIPS 2024, 7 pages, 4 figures

点击查看摘要

Abstract:Ptychography is a scanning coherent diffractive imaging technique that enables imaging nanometer-scale features in extended samples. One main challenge is that widely used iterative image reconstruction methods often require significant amount of overlap between adjacent scan locations, leading to large data volumes and prolonged acquisition times. To address this key limitation, this paper proposes a Bayesian inversion method for ptychography that performs effectively even with less overlap between neighboring scan locations. Furthermore, the proposed method can quantify the inherent uncertainty on the ptychographic object, which is created by the ill-posed nature of the ptychographic inverse problem. At a high level, the proposed method first utilizes a deep generative model to learn the prior distribution of the object and then generates samples from the posterior distribution of the object by using a Markov Chain Monte Carlo algorithm. Our results from simulated ptychography experiments show that the proposed framework can consistently outperform a widely used iterative reconstruction algorithm in cases of reduced overlap. Moreover, the proposed framework can provide uncertainty estimates that closely correlate with the true error, which is not available in practice. The project website is available here.
zh

[CV-286] Generative AI: A Pix2pix-GAN-Based Machine Learning Approach for Robust and Efficient Lung Segmentation

【速读】：该论文旨在解决胸部X光片（CXR）中肺部异常的自动、准确和高效分割问题，以减少放射科医生的工作负担并提高早期疾病检测的效率。解决方案的关键在于采用基于Pix2pix生成对抗网络（GAN）的深度学习框架，结合U-Net架构的生成器和判别器，通过对抗损失和L1距离优化模型。该框架通过预处理和数据增强技术，利用Montgomery和Shenzhen数据集进行训练和测试，验证了其在肺部异常分割中的有效性，为未来的临床应用研究奠定了基础。

链接: https://arxiv.org/abs/2412.10826
作者: Sharmin Akter
机构: 未知
关键词: lead to misdiagnoses, Chest radiography, radiography is climacteric, climacteric in identifying, radiologist workload
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 12 figures, 2 tables

点击查看摘要

Abstract:Chest radiography is climacteric in identifying different pulmonary diseases, yet radiologist workload and inefficiency can lead to misdiagnoses. Automatic, accurate, and efficient segmentation of lung from X-ray images of chest is paramount for early disease detection. This study develops a deep learning framework using a Pix2pix Generative Adversarial Network (GAN) to segment pulmonary abnormalities from CXR images. This framework’s image preprocessing and augmentation techniques were properly incorporated with a U-Net-inspired generator-discriminator architecture. Initially, it loaded the CXR images and manual masks from the Montgomery and Shenzhen datasets, after which preprocessing and resizing were performed. A U-Net generator is applied to the processed CXR images that yield segmented masks; then, a Discriminator Network differentiates between the generated and real masks. Montgomery dataset served as the model’s training set in the study, and the Shenzhen dataset was used to test its robustness, which was used here for the first time. An adversarial loss and an L1 distance were used to optimize the model in training. All metrics, which assess precision, recall, F1 score, and Dice coefficient, prove the effectiveness of this framework in pulmonary abnormality segmentation. It, therefore, sets the basis for future studies to be performed shortly using diverse datasets that could further confirm its clinical applicability in medical imaging.
zh

[CV-287] Boosting ViT-based MRI Reconstruction from the Perspectives of Frequency Modulation Spatial Purification and Scale Diversification

【速读】：该论文试图解决加速 MRI 重建过程中由于 k-空间欠采样导致的病态逆问题，特别是 Vision Transformers (ViTs) 在处理高频图像成分、计算多尺度信息以及减少计算负担方面的不足。解决方案的关键在于提出 FPS-Former 框架，通过频率调制注意力模块（frequency modulation attention module）增强自注意力图的高频信息捕捉能力，空间净化注意力模块（spatial purification attention module）减少无关特征的干扰，以及基于混合尺度融合策略的高效前馈网络（feed-forward network）来处理多尺度信息。这些创新显著提升了 MRI 重建的性能，同时降低了计算成本。

链接: https://arxiv.org/abs/2412.10776
作者: Yucong Meng,Zhiwei Yang,Yonghong Shi,Zhijian Song
机构: 未知
关键词: reconstruction process presents, challenging ill-posed inverse, ill-posed inverse problem, inverse problem due, accelerated MRI reconstruction
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The accelerated MRI reconstruction process presents a challenging ill-posed inverse problem due to the extensive under-sampling in k-space. Recently, Vision Transformers (ViTs) have become the mainstream for this task, demonstrating substantial performance improvements. However, there are still three significant issues remain unaddressed: (1) ViTs struggle to capture high-frequency components of images, limiting their ability to detect local textures and edge information, thereby impeding MRI restoration; (2) Previous methods calculate multi-head self-attention (MSA) among both related and unrelated tokens in content, introducing noise and significantly increasing computational burden; (3) The naive feed-forward network in ViTs cannot model the multi-scale information that is important for image restoration. In this paper, we propose FPS-Former, a powerful ViT-based framework, to address these issues from the perspectives of frequency modulation, spatial purification, and scale diversification. Specifically, for issue (1), we introduce a frequency modulation attention module to enhance the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid. For issue (2), we customize a spatial purification attention module to capture interactions among closely related tokens, thereby reducing redundant or irrelevant feature representations. For issue (3), we propose an efficient feed-forward network based on a hybrid-scale fusion strategy. Comprehensive experiments conducted on three public datasets show that our FPS-Former outperforms state-of-the-art methods while requiring lower computational costs.
zh

[CV-288] Rapid Reconstruction of Extremely Accelerated Liver 4D MRI via Chained Iterative Refinement

【速读】：该论文试图解决高质量4D MRI在密集k空间信号采集中需要过长扫描时间的问题，提出了一种高效的稀疏采样重建方法，同时保持临床可部署的图像质量。解决方案的关键是提出了链式迭代重建网络（CIRNet），该网络采用去噪扩散概率框架，通过随机迭代去噪过程来条件化图像重建。在训练阶段，设计了一个前向马尔可夫扩散过程逐步向密集采样的真实值（GT）添加高斯噪声，而CIRNet则被优化以迭代地逆转该过程。在推理阶段，CIRNet仅执行反向过程以从噪声中恢复信号，并根据欠采样输入进行条件化。该方法在4D数据（3D+t）上处理为时间切片（2D+t），并在48名患者的自由呼吸肝脏4D MRI数据集上进行了评估，结果表明CIRNet在加速比高达30倍的情况下仍能保持可用的图像质量，显著减少了4D MRI的负担。

链接: https://arxiv.org/abs/2412.10629
作者: Di Xu,Xin Miao,Hengjie Liu,Jessica E. Scholey,Wensha Yang,Mary Feng,Michael Ohliger,Hui Lin,Yi Lao,Yang Yang,Ke Sheng
机构: 未知
关键词: Abstract Purpose, impractically long scanning, dense k-space signal, k-space signal acquisition, signal acquisition covering
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Abstract Purpose: High-quality 4D MRI requires an impractically long scanning time for dense k-space signal acquisition covering all respiratory phases. Accelerated sparse sampling followed by reconstruction enhancement is desired but often results in degraded image quality and long reconstruction time. We hereby propose the chained iterative reconstruction network (CIRNet) for efficient sparse-sampling reconstruction while maintaining clinically deployable quality. Methods: CIRNet adopts the denoising diffusion probabilistic framework to condition the image reconstruction through a stochastic iterative denoising process. During training, a forward Markovian diffusion process is designed to gradually add Gaussian noise to the densely sampled ground truth (GT), while CIRNet is optimized to iteratively reverse the Markovian process from the forward outputs. At the inference stage, CIRNet performs the reverse process solely to recover signals from noise, conditioned upon the undersampled input. CIRNet processed the 4D data (3D+t) as temporal slices (2D+t). The proposed framework is evaluated on a data cohort consisting of 48 patients (12332 temporal slices) who underwent free-breathing liver 4D MRI. 3-, 6-, 10-, 20- and 30-times acceleration were examined with a retrospective random undersampling scheme. Compressed sensing (CS) reconstruction with a spatiotemporal constraint and a recently proposed deep network, Re-Con-GAN, are selected as baselines. Results: CIRNet consistently achieved superior performance compared to CS and Re-Con-GAN. The inference time of CIRNet, CS, and Re-Con-GAN are 11s, 120s, and 0.15s. Conclusion: A novel framework, CIRNet, is presented. CIRNet maintains useable image quality for acceleration up to 30 times, significantly reducing the burden of 4DMRI.
zh

[CV-289] Predictive Pattern Recognition Techniques Towards Spatiotemporal Representation of Plant Growth in Simulated and Controlled Environments: A Comprehensive Review

【速读】：该论文试图解决植物表型学研究中对植物生长模式在模拟和受控环境下的精确预测和表示问题。解决方案的关键在于采用先进的预测模式识别技术，特别是时空建模（spatiotemporal modeling）和动态环境交互的整合。论文综述了确定性、概率性和生成式建模方法，强调其在高通量表型分析和基于模拟的植物生长预测中的应用。关键点包括回归和基于神经网络的表示模型、现有实验性确定性方法的局限性，以及需要动态框架来整合不确定性和环境反馈的演变。此外，论文还探讨了通过功能-结构植物模型和条件生成模型对2D和3D结构数据的表示，并提出了未来工作的机会，如将领域知识与数据驱动方法结合、改进可用数据集以及将这些技术应用于实际场景。

链接: https://arxiv.org/abs/2412.10538
作者: Mohamed Debbagh,Shangpeng Sun,Mark Lefsrud
机构: McGill University(麦吉尔大学)
关键词: plant phenomics research, Accurate predictions, plant growth patterns, phenomics research, simulated and controlled
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate predictions and representations of plant growth patterns in simulated and controlled environments are important for addressing various challenges in plant phenomics research. This review explores various works on state-of-the-art predictive pattern recognition techniques, focusing on the spatiotemporal modeling of plant traits and the integration of dynamic environmental interactions. We provide a comprehensive examination of deterministic, probabilistic, and generative modeling approaches, emphasizing their applications in high-throughput phenotyping and simulation-based plant growth forecasting. Key topics include regressions and neural network-based representation models for the task of forecasting, limitations of existing experiment-based deterministic approaches, and the need for dynamic frameworks that incorporate uncertainty and evolving environmental feedback. This review surveys advances in 2D and 3D structured data representations through functional-structural plant models and conditional generative models. We offer a perspective on opportunities for future works, emphasizing the integration of domain-specific knowledge to data-driven methods, improvements to available datasets, and the implementation of these techniques toward real-world applications.
zh

[CV-290] Structurally Consistent MRI Colorization using Cross-modal Fusion Learning

【速读】：该论文试图解决医学图像着色问题，旨在将冷冻切片数据（Cryosection data）中的颜色信息转移到源MRI数据上，同时保持MRI的结构完整性。解决方案的关键在于提出了一种新颖的架构，通过融合冷冻切片图像的分割语义（segmentation semantics）来实现MRI图像中不同器官的稳定上下文着色。该架构不需要MRI与冷冻切片图像之间的精确配准（registration），也不需要对MRI图像进行分割。此外，通过引入特征压缩与激活机制（feature compression-and-activation mechanism），该架构能够捕捉器官级别的全局信息并抑制噪声，从而实现更准确和真实的器官特定着色。实验结果表明，该方法在定量和定性上都优于现有方法。

链接: https://arxiv.org/abs/2412.10452
作者: Mayuri Mathur,Anav Chaudhary,Saurabh Kumar Gupta,Ojaswa Sharma
机构: 未知
关键词: underlying imaging modality, Medical image colorization, MRI, source MRI data, Medical image
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, 2 Tables

点击查看摘要

Abstract:Medical image colorization can greatly enhance the interpretability of the underlying imaging modality and provide insights into human anatomy. The objective of medical image colorization is to transfer a diverse spectrum of colors distributed across human anatomy from Cryosection data to source MRI data while retaining the structures of the MRI. To achieve this, we propose a novel architecture for structurally consistent color transfer to the source MRI data. Our architecture fuses segmentation semantics of Cryosection images for stable contextual colorization of various organs in MRI images. For colorization, we neither require precise registration between MRI and Cryosection images, nor segmentation of MRI images. Additionally, our architecture incorporates a feature compression-and-activation mechanism to capture organ-level global information and suppress noise, enabling the distinction of organ-specific data in MRI scans for more accurate and realistic organ-specific colorization. Our experiments demonstrate that our architecture surpasses the existing methods and yields better quantitative and qualitative results.
zh

[CV-291] Computational Methods for Breast Cancer Molecular Profiling through Routine Histopathology: A Review

【速读】：该论文试图解决的问题是如何通过人工智能（AI）技术从常规的苏木精-伊红（HE）染色病理图像中提取多组学（omic）生物标志物，以支持乳腺癌的精准医疗。解决方案的关键在于利用AI驱动的数字病理学技术，分析病理图像中的分子和多组学生物标志物，从而实现无需昂贵分子检测的个性化治疗决策。这一方法不仅能够发现新的生物标志物，还为临床应用提供了更高效、经济的诊断和治疗方案。

链接: https://arxiv.org/abs/2412.10392
作者: Suchithra Kunhoth,Somaya Al- Maadeed,Younes Akbari,Rafif Al Saady
机构: Qatar University (卡塔尔大学)
关键词: Precision medicine, breast cancer management, advancing beyond conventional, individualized therapies, conventional methods
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Precision medicine has become a central focus in breast cancer management, advancing beyond conventional methods to deliver more precise and individualized therapies. Traditionally, histopathology images have been used primarily for diagnostic purposes; however, they are now recognized for their potential in molecular profiling, which provides deeper insights into cancer prognosis and treatment response. Recent advancements in artificial intelligence (AI) have enabled digital pathology to analyze histopathologic images for both targeted molecular and broader omic biomarkers, marking a pivotal step in personalized cancer care. These technologies offer the capability to extract various biomarkers such as genomic, transcriptomic, proteomic, and metabolomic markers directly from the routine hematoxylin and eosin (HE) stained images, which can support treatment decisions without the need for costly molecular assays. In this work, we provide a comprehensive review of AI-driven techniques for biomarker detection, with a focus on diverse omic biomarkers that allow novel biomarker discovery. Additionally, we analyze the major challenges faced in this field for robust algorithm development. These challenges highlight areas where further research is essential to bridge the gap between AI research and clinical application.
zh

人工智能

[AI-0] MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

链接: https://arxiv.org/abs/2412.12098
作者: Bhavya Sukhija,Stelian Coros,Andreas Krause,Pieter Abbeel,Carmelo Sferrazza
关键词: Reinforcement learning, aim to balance, balance exploiting, exploiting the current, current best strategy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.

[AI-1] Revelations: A Decidable Class of POMDPs with Omega-Regular Objectives AAAI2025

链接: https://arxiv.org/abs/2412.12063
作者: Marius Belly,Nathanaël Fijalkow,Hugo Gimbert,Florian Horn,Guillermo A. Pérez,Pierre Vandenhove
关键词: Partially observable Markov, Partially observable, sequential decision making, Markov decision processes, observable Markov decision
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Systems and Control (eess.SY)
*备注: Extended version of paper accepted to AAAI 2025. 26 pages, 10 figures

点击查看摘要

Abstract:Partially observable Markov decision processes (POMDPs) form a prominent model for uncertainty in sequential decision making. We are interested in constructing algorithms with theoretical guarantees to determine whether the agent has a strategy ensuring a given specification with probability 1. This well-studied problem is known to be undecidable already for very simple omega-regular objectives, because of the difficulty of reasoning on uncertain events. We introduce a revelation mechanism which restricts information loss by requiring that almost surely the agent has eventually full information of the current state. Our main technical results are to construct exact algorithms for two classes of POMDPs called weakly and strongly revealing. Importantly, the decidable cases reduce to the analysis of a finite belief-support Markov decision process. This yields a conceptually simple and exact algorithm for a large class of POMDPs.

[AI-2] Artificial Intelligence in Traffic Systems

链接: https://arxiv.org/abs/2412.12046
作者: Ritwik Raj Saxena
关键词: deep neural networks, traffic management systems, Existing research, traffic management, AI-based traffic management
类目: Artificial Intelligence (cs.AI)
*备注: 35 pages, 17343 words, 6 figures

点击查看摘要

Abstract:Existing research on AI-based traffic management systems, utilizing techniques such as fuzzy logic, reinforcement learning, deep neural networks, and evolutionary algorithms, demonstrates the potential of AI to transform the traffic landscape. This article endeavors to review the topics where AI and traffic management intersect. It comprises areas like AI-powered traffic signal control systems, automatic distance and velocity recognition (for instance, in autonomous vehicles, hereafter AVs), smart parking systems, and Intelligent Traffic Management Systems (ITMS), which use data captured in real-time to keep track of traffic conditions, and traffic-related law enforcement and surveillance using AI. AI applications in traffic management cover a wide range of spheres. The spheres comprise, inter alia, streamlining traffic signal timings, predicting traffic bottlenecks in specific areas, detecting potential accidents and road hazards, managing incidents accurately, advancing public transportation systems, development of innovative driver assistance systems, and minimizing environmental impact through simplified routes and reduced emissions. The benefits of AI in traffic management are also diverse. They comprise improved management of traffic data, sounder route decision automation, easier and speedier identification and resolution of vehicular issues through monitoring the condition of individual vehicles, decreased traffic snarls and mishaps, superior resource utilization, alleviated stress of traffic management manpower, greater on-road safety, and better emergency response time.

[AI-3] he Impact of AI Assistance on Radiology Reporting: A Pilot Study Using Simulated AI Draft Reports

链接: https://arxiv.org/abs/2412.12042
作者: Julián N. Acosta,Siddhant Dogra,Subathra Adithan,Kay Wu,Michael Moritz,Stephen Kwak,Pranav Rajpurkar
关键词: growing imaging volumes, pressures amid growing, amid growing imaging, Radiologists face increasing, face increasing workload
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Radiologists face increasing workload pressures amid growing imaging volumes, creating risks of burnout and delayed reporting times. While artificial intelligence (AI) based automated radiology report generation shows promise for reporting workflow optimization, evidence of its real-world impact on clinical accuracy and efficiency remains limited. This study evaluated the effect of draft reports on radiology reporting workflows by conducting a three reader multi-case study comparing standard versus AI-assisted reporting workflows. In both workflows, radiologists reviewed the cases and modified either a standard template (standard workflow) or an AI-generated draft report (AI-assisted workflow) to create the final report. For controlled evaluation, we used GPT-4 to generate simulated AI drafts and deliberately introduced 1-3 errors in half the cases to mimic real AI system performance. The AI-assisted workflow significantly reduced average reporting time from 573 to 435 seconds (p=0.003), without a statistically significant difference in clinically significant errors between workflows. These findings suggest that AI-generated drafts can meaningfully accelerate radiology reporting while maintaining diagnostic accuracy, offering a practical solution to address mounting workload challenges in clinical practice.

[AI-4] Learning to Navigate in Mazes with Novel Layouts using Abstract Top-down Maps

链接: https://arxiv.org/abs/2412.12024
作者: Linfeng Zhao,Lawson L.S. Wong
关键词: challenges in decision-making, major challenges, Learning navigation capabilities, navigation capabilities, Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Published at Reinforcement Learning Conference (RLC) 2024. Website: this http URL

点击查看摘要

Abstract:Learning navigation capabilities in different environments has long been one of the major challenges in decision-making. In this work, we focus on zero-shot navigation ability using given abstract 2 -D top-down maps. Like human navigation by reading a paper map, the agent reads the map as an image when navigating in a novel layout, after learning to navigate on a set of training maps. We propose a model-based reinforcement learning approach for this multi-task learning problem, where it jointly learns a hypermodel that takes top-down maps as input and predicts the weights of the transition network. We use the DeepMind Lab environment and customize layouts using generated maps. Our method can adapt better to novel environments in zero-shot and is more robust to noise.

[AI-5] Agent ic AI-Driven Technical Troubleshooting for Enterprise Systems: A Novel Weighted Retrieval-Augmented Generation Paradigm

链接: https://arxiv.org/abs/2412.12006
作者: Rajat Khanda
关键词: involves navigating diverse, complex issues effectively, resolve complex issues, heterogeneous data sources, navigating diverse
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Technical troubleshooting in enterprise environments often involves navigating diverse, heterogeneous data sources to resolve complex issues effectively. This paper presents a novel agentic AI solution built on a Weighted Retrieval-Augmented Generation (RAG) Framework tailored for enterprise technical troubleshooting. By dynamically weighting retrieval sources such as product manuals, internal knowledge bases, FAQs, and troubleshooting guides based on query context, the framework prioritizes the most relevant data. For instance, it gives precedence to product manuals for SKU-specific queries while incorporating general FAQs for broader issues. The system employs FAISS for efficient dense vector search, coupled with a dynamic aggregation mechanism to seamlessly integrate results from multiple sources. A Llama-based self-evaluator ensures the contextual accuracy and confidence of the generated responses before delivering them. This iterative cycle of retrieval and validation enhances precision, diversity, and reliability in response generation. Preliminary evaluations on large enterprise datasets demonstrate the framework’s efficacy in improving troubleshooting accuracy, reducing resolution times, and adapting to varied technical challenges. Future research aims to enhance the framework by integrating advanced conversational AI capabilities, enabling more interactive and intuitive troubleshooting experiences. Efforts will also focus on refining the dynamic weighting mechanism through reinforcement learning to further optimize the relevance and precision of retrieved information. By incorporating these advancements, the proposed framework is poised to evolve into a comprehensive, autonomous AI solution, redefining technical service workflows across enterprise settings.

[AI-6] CP-Guard: Malicious Agent Detection and Defense in Collaborative Birds Eye View Perception AAAI’25

链接: https://arxiv.org/abs/2412.12000
作者: Senkang Hu,Yihang Tao,Guowen Xu,Yiqin Deng,Xianhao Chen,Yuguang Fang,Sam Kwong
关键词: ego CAV, ego CAV perception, autonomous driving, autonomous vehicles, shown a promising
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI’25

点击查看摘要

Abstract:Collaborative Perception (CP) has shown a promising technique for autonomous driving, where multiple connected and autonomous vehicles (CAVs) share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, ego CAV needs to receive messages from its collaborators, which makes it easy to be attacked by malicious agents. For example, a malicious agent can send harmful information to the ego CAV to mislead it. To address this critical issue, we propose a novel method, \textbfCP-Guard, a tailored defense mechanism for CP that can be deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against the ego CAV’s perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define a collaborative consistency loss (CCLoss) to capture the discrepancy between the ego CAV and its collaborators, which is used as a verification criterion for consensus. Finally, we conduct extensive experiments in collaborative bird’s eye view (BEV) tasks and our results demonstrate the effectiveness of our CP-Guard.

[AI-7] Combining Large Language Models with Tutoring System Intelligence: A Case Study in Caregiver Homework Support

链接: https://arxiv.org/abs/2412.11995
作者: Devika Venugopalan,Ziwen Yan,Conrad Borchers,Jionghao Lin,Vincent Aleven
关键词: child caring community, parents and members, caring community, underappreciated stakeholders, learning analytics
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Full research paper accepted to Learning Analytics and Knowledge (LAK 2025)

点击查看摘要

Abstract:Caregivers (i.e., parents and members of a child’s caring community) are underappreciated stakeholders in learning analytics. Although caregiver involvement can enhance student academic outcomes, many obstacles hinder involvement, most notably knowledge gaps with respect to modern school curricula. An emerging topic of interest in learning analytics is hybrid tutoring, which includes instructional and motivational support. Caregivers assert similar roles in homework, yet it is unknown how learning analytics can support them. Our past work with caregivers suggested that conversational support is a promising method of providing caregivers with the guidance needed to effectively support student learning. We developed a system that provides instructional support to caregivers through conversational recommendations generated by a Large Language Model (LLM). Addressing known instructional limitations of LLMs, we use instructional intelligence from tutoring systems while conducting prompt engineering experiments with the open-source Llama 3 LLM. This LLM generated message recommendations for caregivers supporting their child’s math practice via chat. Few-shot prompting and combining real-time problem-solving context from tutoring systems with examples of tutoring practices yielded desirable message recommendations. These recommendations were evaluated with ten middle school caregivers, who valued recommendations facilitating content-level support and student metacognition through self-explanation. We contribute insights into how tutoring systems can best be merged with LLMs to support hybrid tutoring settings through conversational assistance, facilitating effective caregiver involvement in tutoring systems.

[AI-8] Fairness Shields: Safeguarding against Biased Decision Makers AAAI2025

链接: https://arxiv.org/abs/2412.11994
作者: Filip Cano,Thomas A. Henzinger,Bettina Könighofer,Konstantin Kueffner,Kaushik Mallik
关键词: influence human lives, people sensitive attributes, increasingly influence human, AI-based decision-makers increasingly, decision-makers increasingly influence
类目: Artificial Intelligence (cs.AI)
*备注: To appear in AAAI 2025

点击查看摘要

Abstract:As AI-based decision-makers increasingly influence human lives, it is a growing concern that their decisions are often unfair or biased with respect to people’s sensitive attributes, such as gender and race. Most existing bias prevention measures provide probabilistic fairness guarantees in the long run, and it is possible that the decisions are biased on specific instances of short decision sequences. We introduce fairness shielding, where a symbolic decision-maker – the fairness shield – continuously monitors the sequence of decisions of another deployed black-box decision-maker, and makes interventions so that a given fairness criterion is met while the total intervention costs are minimized. We present four different algorithms for computing fairness shields, among which one guarantees fairness over fixed horizons, and three guarantee fairness periodically after fixed intervals. Given a distribution over future decisions and their intervention costs, our algorithms solve different instances of bounded-horizon optimal control problems with different levels of computational costs and optimality guarantees. Our empirical evaluation demonstrates the effectiveness of these shields in ensuring fairness while maintaining cost efficiency across various scenarios.

[AI-9] Cost-Effective Label-free Node Classification with LLM s

链接: https://arxiv.org/abs/2412.11983
作者: Taiyan Zhang,Renchi Yang,Mingyu Yan,Xiaochun Ye,Dongrui Fan,Yurui Lai
关键词: Graph neural networks, fusing graph structures, graph data due, neural networks, structures and attributes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 15 pages, 5 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) have emerged as go-to models for node classification in graph data due to their powerful abilities in fusing graph structures and attributes. However, such models strongly rely on adequate high-quality labeled data for training, which are expensive to acquire in practice. With the advent of large language models (LLMs), a promising way is to leverage their superb zero-shot capabilities and massive knowledge for node labeling. Despite promising results reported, this methodology either demands considerable queries to LLMs, or suffers from compromised performance caused by noisy labels produced by LLMs. To remedy these issues, this work presents Cella, an active self-training framework that integrates LLMs into GNNs in a cost-effective manner. The design recipe of Cella is to iteratively identify small sets of “critical” samples using GNNs and extract informative pseudo-labels for them with both LLMs and GNNs as additional supervision signals to enhance model training. Particularly, Cella includes three major components: (i) an effective active node selection strategy for initial annotations; (ii) a judicious sample selection scheme to sift out the “critical” nodes based on label disharmonicity and entropy; and (iii) a label refinement module combining LLMs and GNNs with rewired topology. Our extensive experiments over five benchmark text-attributed graph datasets demonstrate that Cella significantly outperforms the state of the arts under the same query budget to LLMs in terms of label-free node classification. In particular, on the DBLP dataset with 14.3k nodes, Cella is able to achieve an 8.08% conspicuous improvement in accuracy over the state-of-the-art at a cost of less than one cent. Comments: 15 pages, 5 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.11983 [cs.LG] (or arXiv:2412.11983v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.11983 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] he Impact of Generalization Techniques on the Interplay Among Privacy Utility and Fairness in Image Classification

链接: https://arxiv.org/abs/2412.11951
作者: Ahmad Hassanpour,Amir Zarei,Khawla Mallat,Anderson Santana de Oliveira,Bian Yang
关键词: study investigates, investigates the trade-offs, image classification, generalization techniques, privacy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at the 25th Privacy Enhancing Technologies Symposium (PETS 2025)

点击查看摘要

Abstract:This study investigates the trade-offs between fairness, privacy, and utility in image classification using machine learning (ML). Recent research suggests that generalization techniques can improve the balance between privacy and utility. One focus of this work is sharpness-aware training (SAT) and its integration with differential privacy (DP-SAT) to further improve this balance. Additionally, we examine fairness in both private and non-private learning models trained on datasets with synthetic and real-world biases. We also measure the privacy risks involved in these scenarios by performing membership inference attacks (MIAs) and explore the consequences of eliminating high-privacy risk samples, termed outliers. Moreover, we introduce a new metric, named \emphharmonic score, which combines accuracy, privacy, and fairness into a single measure. Through empirical analysis using generalization techniques, we achieve an accuracy of 81.11% under (8, 10^-5) -DP on CIFAR-10, surpassing the 79.5% reported by De et al. (2022). Moreover, our experiments show that memorization of training samples can begin before the overfitting point, and generalization techniques do not guarantee the prevention of this memorization. Our analysis of synthetic biases shows that generalization techniques can amplify model bias in both private and non-private models. Additionally, our results indicate that increased bias in training data leads to reduced accuracy, greater vulnerability to privacy attacks, and higher model bias. We validate these findings with the CelebA dataset, demonstrating that similar trends persist with real-world attribute imbalances. Finally, our experiments show that removing outlier data decreases accuracy and further amplifies model bias. Comments: Published as a conference paper at the 25th Privacy Enhancing Technologies Symposium (PETS 2025) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.11951 [cs.LG] (or arXiv:2412.11951v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.11951 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-11] OpenReviewer: A Specialized Large Language Model for Generating Critical Scientific Paper Reviews

链接: https://arxiv.org/abs/2412.11948
作者: Maximilian Idahl,Zahra Ahmadi
关键词: generating high-quality peer, generating high-quality, machine learning, high-quality peer reviews, high-quality peer
类目: Artificial Intelligence (cs.AI)
*备注: Demo: this https URL Model: this https URL

点击查看摘要

Abstract:We present OpenReviewer, an open-source system for generating high-quality peer reviews of machine learning and AI conference papers. At its core is Llama-OpenReviewer-8B, an 8B parameter language model specifically fine-tuned on 79,000 expert reviews from top ML conferences. Given a PDF paper submission and review template as input, OpenReviewer extracts the full text, including technical content like equations and tables, and generates a structured review following conference-specific guidelines. Our evaluation on 400 test papers shows that OpenReviewer produces significantly more critical and realistic reviews compared to general-purpose LLMs like GPT-4 and Claude-3.5. While other LLMs tend toward overly positive assessments, OpenReviewer’s recommendations closely match the distribution of human reviewer ratings. The system provides authors with rapid, constructive feedback to improve their manuscripts before submission, though it is not intended to replace human peer review. OpenReviewer is available as an online demo and open-source tool.

[AI-12] autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks

链接: https://arxiv.org/abs/2412.11943
作者: Simon Rampp,Andreas Triantafyllopoulos,Manuel Milling,Björn W. Schuller
关键词: computer audition tasks, audition tasks, deep learning training, learning training framework, key operating principles
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This work introduces the key operating principles for autrainer, our new deep learning training framework for computer audition tasks. autrainer is a PyTorch-based toolkit that allows for rapid, reproducible, and easily extensible training on a variety of different computer audition tasks. Concretely, autrainer offers low-code training and supports a wide range of neural networks as well as preprocessing routines. In this work, we present an overview of its inner workings and key capabilities.

[AI-13] Stepwise Reasoning Error Disruption Attack of LLM s

链接: https://arxiv.org/abs/2412.11934
作者: Jingyu Peng,Maolin Wang,Xiangyu Zhao,Kai Zhang,Wanyu Wang,Pengyue Jia,Qidong Liu,Ruocheng Guo,Qi Liu
关键词: Large language models, made remarkable strides, Large language, processes remain underexplored, complex reasoning tasks
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have made remarkable strides in complex reasoning tasks, but their safety and robustness in reasoning processes remain underexplored. Existing attacks on LLM reasoning are constrained by specific settings or lack of imperceptibility, limiting their feasibility and generalizability. To address these challenges, we propose the Stepwise rEasoning Error Disruption (SEED) attack, which subtly injects errors into prior reasoning steps to mislead the model into producing incorrect subsequent reasoning and final answers. Unlike previous methods, SEED is compatible with zero-shot and few-shot settings, maintains the natural reasoning flow, and ensures covert execution without modifying the instruction. Extensive experiments on four datasets across four different models demonstrate SEED’s effectiveness, revealing the vulnerabilities of LLMs to disruptions in reasoning processes. These findings underscore the need for greater attention to the robustness of LLM reasoning to ensure safety in practical applications.

[AI-14] Hierarchical Meta-Reinforcement Learning via Automated Macro-Action Discovery

链接: https://arxiv.org/abs/2412.11930
作者: Minjae Cho,Chuangchuang Sun
关键词: enables fast adaptation, Meta-Reinforcement Learning, Learning, testing tasks, enables fast
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Meta-Reinforcement Learning (Meta-RL) enables fast adaptation to new testing tasks. Despite recent advancements, it is still challenging to learn performant policies across multiple complex and high-dimensional tasks. To address this, we propose a novel architecture with three hierarchical levels for 1) learning task representations, 2) discovering task-agnostic macro-actions in an automated manner, and 3) learning primitive actions. The macro-action can guide the low-level primitive policy learning to more efficiently transition to goal states. This can address the issue that the policy may forget previously learned behavior while learning new, conflicting tasks. Moreover, the task-agnostic nature of the macro-actions is enabled by removing task-specific components from the state space. Hence, this makes them amenable to re-composition across different tasks and leads to promising fast adaptation to new tasks. Also, the prospective instability from the tri-level hierarchies is effectively mitigated by our innovative, independently tailored training schemes. Experiments in the MetaWorld framework demonstrate the improved sample efficiency and success rate of our approach compared to previous state-of-the-art methods.

[AI-15] GNN Applied to Ego-nets for Friend Suggestions

链接: https://arxiv.org/abs/2412.11888
作者: Evgeny Zamyatin
关键词: making friend suggestions, billions of connections, making friend, friend suggestions, large size
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A major problem of making friend suggestions in social networks is the large size of social graphs, which can have hundreds of millions of people and tens of billions of connections. Classic methods based on heuristics or factorizations are often used to address the difficulties of scaling more complex models. However, the unsupervised nature of these methods can lead to suboptimal results. In this work, we introduce the Generalized Ego-network Friendship Score framework, which makes it possible to use complex supervised models without sacrificing scalability. The main principle of the framework is to reduce the problem of link prediction on a full graph to a series of low-scale tasks on ego-nets with subsequent aggregation of their results. Here, the underlying model takes an ego-net as input and produces a pairwise relevance matrix for its nodes. In addition, we develop the WalkGNN model which is capable of working effectively in the social network domain, where these graph-level link prediction tasks are heterogeneous, dynamic and featureless. To measure the accuracy of this model, we introduce the Ego-VK dataset that serves as an exact representation of the real-world problem that we are addressing. Offline experiments on the dataset show that our model outperforms all baseline methods, and a live A/B test demonstrates the growth of business metrics as a result of utilizing our approach.

[AI-16] A Variable Occurrence-Centric Framework for Inconsistency Handling (Extended Version)

链接: https://arxiv.org/abs/2412.11868
作者: Yakoub Salhi
关键词: introduce a syntactic, syntactic framework, framework for analyzing, analyzing and handling, handling inconsistencies
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:In this paper, we introduce a syntactic framework for analyzing and handling inconsistencies in propositional bases. Our approach focuses on examining the relationships between variable occurrences within conflicts. We propose two dual concepts: Minimal Inconsistency Relation (MIR) and Maximal Consistency Relation (MCR). Each MIR is a minimal equivalence relation on variable occurrences that results in inconsistency, while each MCR is a maximal equivalence relation designed to prevent inconsistency. Notably, MIRs capture conflicts overlooked by minimal inconsistent subsets. Using MCRs, we develop a series of non-explosive inference relations. The main strategy involves restoring consistency by modifying the propositional base according to each MCR, followed by employing the classical inference relation to derive conclusions. Additionally, we propose an unusual semantics that assigns truth values to variable occurrences instead of the variables themselves. The associated inference relations are established through Boolean interpretations compatible with the occurrence-based models.

[AI-17] ransformers Use Causal World Models in Maze-Solving Tasks

链接: https://arxiv.org/abs/2412.11867
作者: Alex F. Spies,William Edwards,Michael I. Ivanitskiy,Adrians Skapars,Tilman Räuker,Katsumi Inoue,Alessandra Russo,Murray Shanahan
关键词: networks naturally develop, Recent studies, naturally develop surprisingly, develop surprisingly structured, surprisingly structured representations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Main paper: 9 pages, 9 figures. Supplementary material: 10 pages, 17 additional figures. Code and data will be available upon publication. Corresponding author: A. F. Spies (afspies@imperial. this http URL )

点击查看摘要

Abstract:Recent studies in interpretability have explored the inner workings of transformer models trained on tasks across various domains, often discovering that these networks naturally develop surprisingly structured representations. When such representations comprehensively reflect the task domain’s structure, they are commonly referred to as ``World Models’’ (WMs). In this work, we discover such WMs in transformers trained on maze tasks. In particular, by employing Sparse Autoencoders (SAEs) and analysing attention patterns, we examine the construction of WMs and demonstrate consistency between the circuit analysis and the SAE feature-based analysis. We intervene upon the isolated features to confirm their causal role and, in doing so, find asymmetries between certain types of interventions. Surprisingly, we find that models are able to reason with respect to a greater number of active features than they see during training, even if attempting to specify these in the input token sequence would lead the model to fail. Futhermore, we observe that varying positional encodings can alter how WMs are encoded in a model’s residual stream. By analyzing the causal role of these WMs in a toy domain we hope to make progress toward an understanding of emergent structure in the representations acquired by Transformers, leading to the development of more interpretable and controllable AI systems.

[AI-18] Investigating Mixture of Experts in Dense Retrieval

链接: https://arxiv.org/abs/2412.11864
作者: Effrosyni Sokli,Pranav Kasela,Georgios Peikos,Gabriella Pasi
关键词: advanced Information Retrieval, Dense Retrieval Models, advanced Information, Dense Retrieval, Information Retrieval
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Dense Retrieval Models (DRMs) have advanced Information Retrieval (IR), one limitation of these neural models is their narrow generalizability and robustness. To cope with this issue, one can leverage the Mixture-of-Experts (MoE) architecture. While previous IR studies have incorporated MoE architectures within the Transformer layers of DRMs, our work investigates an architecture that integrates a single MoE block (SB-MoE) after the output of the final Transformer layer. Our empirical evaluation investigates how SB-MoE compares, in terms of retrieval effectiveness, to standard fine-tuning. In detail, we fine-tune three DRMs (TinyBERT, BERT, and Contriever) across four benchmark collections with and without adding the MoE block. Moreover, since MoE showcases performance variations with respect to its parameters (i.e., the number of experts), we conduct additional experiments to investigate this aspect further. The findings show the effectiveness of SB-MoE especially for DRMs with a low number of parameters (i.e., TinyBERT), as it consistently outperforms the fine-tuned underlying model on all four benchmarks. For DRMs with a higher number of parameters (i.e., BERT and Contriever), SB-MoE requires larger numbers of training samples to yield better retrieval performance.

[AI-19] A Theory of Formalisms for Representing Knowledge AAAI-25

链接: https://arxiv.org/abs/2412.11855
作者: Heng Zhang,Donghui Quan
关键词: knowledge representation, knowledge representation formalisms, representation formalisms, knowledge, representation
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Logic in Computer Science (cs.LO)
*备注: Extended version of a paper to appear in AAAI-25

点击查看摘要

Abstract:There has been a longstanding dispute over which formalism is the best for representing knowledge in AI. The well-known “declarative vs. procedural controversy” is concerned with the choice of utilizing declarations or procedures as the primary mode of knowledge representation. The ongoing debate between symbolic AI and connectionist AI also revolves around the question of whether knowledge should be represented implicitly (e.g., as parametric knowledge in deep learning and large language models) or explicitly (e.g., as logical theories in traditional knowledge representation and reasoning). To address these issues, we propose a general framework to capture various knowledge representation formalisms in which we are interested. Within the framework, we find a family of universal knowledge representation formalisms, and prove that all universal formalisms are recursively isomorphic. Moreover, we show that all pairwise intertranslatable formalisms that admit the padding property are also recursively isomorphic. These imply that, up to an offline compilation, all universal (or natural and equally expressive) representation formalisms are in fact the same, which thus provides a partial answer to the aforementioned dispute.

[AI-20] Does it Chug? Towards a Data-Driven Understanding of Guitar Tone Description

链接: https://arxiv.org/abs/2412.11769
作者: Pratik Sutar,Jason Naradowsky,Yusuke Miyao
关键词: Natural language, language is commonly, describe instrument timbre, Natural, describe instrument
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted for publication at the 3rd Workshop on NLP for Music and Audio (NLP4MusA 2024)

点击查看摘要

Abstract:Natural language is commonly used to describe instrument timbre, such as a “warm” or “heavy” sound. As these descriptors are based on human perception, there can be disagreement over which acoustic features correspond to a given adjective. In this work, we pursue a data-driven approach to further our understanding of such adjectives in the context of guitar tone. Our main contribution is a dataset of timbre adjectives, constructed by processing single clips of instrument audio to produce varied timbres through adjustments in EQ and effects such as distortion. Adjective annotations are obtained for each clip by crowdsourcing experts to complete a pairwise comparison and a labeling task. We examine the dataset and reveal correlations between adjective ratings and highlight instances where the data contradicts prevailing theories on spectral features and timbral adjectives, suggesting a need for a more nuanced, data-driven understanding of timbre.

[AI-21] No More Adam: Learning Rate Scaling at Initialization is All You Need

链接: https://arxiv.org/abs/2412.11768
作者: Minghao Xu,Lichuan Xiang,Xu Cai,Hongkai Wen
关键词: deep neural networks, training deep neural, neural networks, question the necessity, deep neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 20 pages, 10 figures

点击查看摘要

Abstract:In this work, we question the necessity of adaptive gradient methods for training deep neural networks. SGD-SaI is a simple yet effective enhancement to stochastic gradient descent with momentum (SGDM). SGD-SaI performs learning rate Scaling at Initialization (SaI) to distinct parameter groups, guided by their respective gradient signal-to-noise ratios (g-SNR). By adjusting learning rates without relying on adaptive second-order momentum, SGD-SaI helps prevent training imbalances from the very first iteration and cuts the optimizer’s memory usage by half compared to AdamW. Despite its simplicity and efficiency, SGD-SaI consistently matches or outperforms AdamW in training a variety of Transformer-based tasks, effectively overcoming a long-standing challenge of using SGD for training Transformers. SGD-SaI excels in ImageNet-1K classification with Vision Transformers(ViT) and GPT-2 pretraining for large language models (LLMs, transformer decoder-only), demonstrating robustness to hyperparameter variations and practicality for diverse applications. We further tested its robustness on tasks like LoRA fine-tuning for LLMs and diffusion models, where it consistently outperforms state-of-the-art optimizers. From a memory efficiency perspective, SGD-SaI achieves substantial memory savings for optimizer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B compared to AdamW in full-precision training settings.

[AI-22] Harnessing Language for Coordination: A Framework and Benchmark for LLM -Driven Multi-Agent Control

链接: https://arxiv.org/abs/2412.11761
作者: Timothée Anne,Noah Syrkis,Meriem Elhosni,Florian Turati,Franck Legendre,Alain Jaquier,Sebastian Risi
关键词: demonstrated remarkable performance, Large Language Models, Large Language, demonstrated remarkable, remarkable performance
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. A promising but largely under-explored area is their potential to facilitate human coordination with many agents. Such capabilities would be useful in domains including disaster response, urban planning, and real-time strategy scenarios. In this work, we introduce (1) a real-time strategy game benchmark designed to evaluate these abilities and (2) a novel framework we term HIVE. HIVE empowers a single human to coordinate swarms of up to 2,000 agents using natural language dialog with an LLM. We present promising results on this multi-agent benchmark, with our hybrid approach solving tasks such as coordinating agent movements, exploiting unit weaknesses, leveraging human annotations, and understanding terrain and strategic points. However, our findings also highlight critical limitations of current models, including difficulties in processing spatial visual information and challenges in formulating long-term strategic plans. This work sheds light on the potential and limitations of LLMs in human-swarm coordination, paving the way for future research in this area. The HIVE project page, which includes videos of the system in action, can be found here: this http URL.

[AI-23] On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet?

链接: https://arxiv.org/abs/2412.11698
作者: Matteo Esposito,Francesco Palagiano,Valentina Lenarduzzi,Davide Taibi
关键词: Large Language Models, MCS governance, security, cyber warfare landscape, governance
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Context. The security of critical infrastructure has been a fundamental concern since the advent of computers, and this concern has only intensified in today’s cyber warfare landscape. Protecting mission-critical systems (MCSs), including essential assets like healthcare, telecommunications, and military coordination, is vital for national security. These systems require prompt and comprehensive governance to ensure their resilience, yet recent events have shown that meeting these demands is increasingly challenging. Aim. Building on prior research that demonstrated the potential of GAI, particularly Large Language Models (LLMs), in improving risk analysis tasks, we aim to explore practitioners’ perspectives, specifically developers and security personnel, on using generative AI (GAI) in the governance of IT MCSs seeking to provide insights and recommendations for various stakeholders, including researchers, practitioners, and policymakers. Method. We designed a survey to collect practical experiences, concerns, and expectations of practitioners who develop and implement security solutions in the context of MCSs. Analyzing this data will help identify key trends, challenges, and opportunities for introducing GAIs in this niche domain. Conclusions and Future Works. Our findings highlight that the safe use of LLMs in MCS governance requires interdisciplinary collaboration. Researchers should focus on designing regulation-oriented models and focus on accountability; practitioners emphasize data protection and transparency, while policymakers must establish a unified AI framework with global benchmarks to ensure ethical and secure LLMs-based MCS governance.

[AI-24] NEST: A Neuromodulated Small-world Hypergraph Trajectory Prediction Model for Autonomous Driving AAAI-25

链接: https://arxiv.org/abs/2412.11682
作者: Chengyue Wang,Haicheng Liao,Bonan Wang,Yanchen Guan,Bin Rao,Ziyuan Pu,Zhiyong Cui,Chengzhong Xu,Zhenning Li
关键词: Accurate trajectory prediction, Accurate trajectory, Neuromodulated Small-world Hypergraph, trajectory prediction, Small-world Hypergraph Trajectory
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by AAAI-25

点击查看摘要

Abstract:Accurate trajectory prediction is essential for the safety and efficiency of autonomous driving. Traditional models often struggle with real-time processing, capturing non-linearity and uncertainty in traffic environments, efficiency in dense traffic, and modeling temporal dynamics of interactions. We introduce NEST (Neuromodulated Small-world Hypergraph Trajectory Prediction), a novel framework that integrates Small-world Networks and hypergraphs for superior interaction modeling and prediction accuracy. This integration enables the capture of both local and extended vehicle interactions, while the Neuromodulator component adapts dynamically to changing traffic conditions. We validate the NEST model on several real-world datasets, including nuScenes, MoCAD, and HighD. The results consistently demonstrate that NEST outperforms existing methods in various traffic scenarios, showcasing its exceptional generalization capability, efficiency, and temporal foresight. Our comprehensive evaluation illustrates that NEST significantly improves the reliability and operational efficiency of autonomous driving systems, making it a robust solution for trajectory prediction in complex traffic environments.

[AI-25] Loosely Synchronized Rule-Based Planning for Multi-Agent Path Finding with Asynchronous Actions AAAI2025

链接: https://arxiv.org/abs/2412.11678
作者: Shuai Zhou,Shizhe Zhao,Zhongqiang Ren
关键词: minimizing path costs, seeks collision-free paths, respective starting locations, respective goal locations, Multi-Agent Path Finding
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: AAAI2025

点击查看摘要

Abstract:Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective starting locations to their respective goal locations while minimizing path costs. Although many MAPF algorithms were developed and can handle up to thousands of agents, they usually rely on the assumption that each action of the agent takes a time unit, and the actions of all agents are synchronized in a sense that the actions of agents start at the same discrete time step, which may limit their use in practice. Only a few algorithms were developed to address asynchronous actions, and they all lie on one end of the spectrum, focusing on finding optimal solutions with limited scalability. This paper develops new planners that lie on the other end of the spectrum, trading off solution quality for scalability, by finding an unbounded sub-optimal solution for many agents. Our method leverages both search methods (LSS) in handling asynchronous actions and rule-based planning methods (PIBT) for MAPF. We analyze the properties of our method and test it against several baselines with up to 1000 agents in various maps. Given a runtime limit, our method can handle an order of magnitude more agents than the baselines with about 25% longer makespan.

[AI-26] UA-PDFL: A Personalized Approach for Decentralized Federated Learning

链接: https://arxiv.org/abs/2412.11674
作者: Hangyu Zhu,Yuxiang Fan,Zhenping Xie
关键词: privacy preserving machine, machine learning paradigm, learning paradigm designed, preserving machine learning, decentralized federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a privacy preserving machine learning paradigm designed to collaboratively learn a global model without data leakage. Specifically, in a typical FL system, the central server solely functions as an coordinator to iteratively aggregate the collected local models trained by each client, potentially introducing single-point transmission bottleneck and security threats. To mitigate this issue, decentralized federated learning (DFL) has been proposed, where all participating clients engage in peer-to-peer communication without a central server. Nonetheless, DFL still suffers from training degradation as FL does due to the non-independent and identically distributed (non-IID) nature of client data. And incorporating personalization layers into DFL may be the most effective solutions to alleviate the side effects caused by non-IID data. Therefore, in this paper, we propose a novel unit representation aided personalized decentralized federated learning framework, named UA-PDFL, to deal with the non-IID challenge in DFL. By adaptively adjusting the level of personalization layers through the guidance of the unit representation, UA-PDFL is able to address the varying degrees of data skew. Based on this scheme, client-wise dropout and layer-wise personalization are proposed to further enhance the learning performance of DFL. Extensive experiments empirically prove the effectiveness of our proposed method.

[AI-27] LLM -DaaS: LLM -driven Drone-as-a-Service Operations from Text User Requests

链接: https://arxiv.org/abs/2412.11672
作者: Lillian Wassim,Kamal Mohamed,Ali Hamdi
关键词: leverages Large Language, Large Language Models, Large Language, leverages Large, framework that leverages
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:We propose LLM-DaaS, a novel Drone-as-a-Service (DaaS) framework that leverages Large Language Models (LLMs) to transform free-text user requests into structured, actionable DaaS operation tasks. Our approach addresses the key challenge of interpreting and structuring natural language input to automate drone service operations under uncertain conditions. The system is composed of three main components: free-text request processing, structured request generation, and dynamic DaaS selection and composition. First, we fine-tune different LLM models such as Phi-3.5, LLaMA-3.2 7b and Gemma 2b on a dataset of text user requests mapped to structured DaaS requests. Users interact with our model in a free conversational style, discussing package delivery requests, while the fine-tuned LLM extracts DaaS metadata such as delivery time, source and destination locations, and package weight. The DaaS service selection model is designed to select the best available drone capable of delivering the requested package from the delivery point to the nearest optimal destination. Additionally, the DaaS composition model composes a service from a set of the best available drones to deliver the package from the source to the final destination. Second, the system integrates real-time weather data to optimize drone route planning and scheduling, ensuring safe and efficient operations. Simulations demonstrate the system’s ability to significantly improve task accuracy, operational efficiency, and establish LLM-DaaS as a robust solution for DaaS operations in uncertain environments.

[AI-28] Smoothness Really Matters: A Simple yet Effective Approach for Unsupervised Graph Domain Adaptation AAAI2025

链接: https://arxiv.org/abs/2412.11654
作者: Wei Chen,Guo Ye,Yakun Wang,Zhao Zhang,Libang Zhang,Daxin Wang,Zhiqiang Zhang,Fuzhen Zhuang
关键词: Unsupervised Graph Domain, Graph Domain Adaptation, Domain Adaptation, Unsupervised Graph, seeks to bridge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 11 pages, Accpected by AAAI2025

点击查看摘要

Abstract:Unsupervised Graph Domain Adaptation (UGDA) seeks to bridge distribution shifts between domains by transferring knowledge from labeled source graphs to given unlabeled target graphs. Existing UGDA methods primarily focus on aligning features in the latent space learned by graph neural networks (GNNs) across domains, often overlooking structural shifts, resulting in limited effectiveness when addressing structurally complex transfer scenarios. Given the sensitivity of GNNs to local structural features, even slight discrepancies between source and target graphs could lead to significant shifts in node embeddings, thereby reducing the effectiveness of knowledge transfer. To address this issue, we introduce a novel approach for UGDA called Target-Domain Structural Smoothing (TDSS). TDSS is a simple and effective method designed to perform structural smoothing directly on the target graph, thereby mitigating structural distribution shifts and ensuring the consistency of node representations. Specifically, by integrating smoothing techniques with neighborhood sampling, TDSS maintains the structural coherence of the target graph while mitigating the risk of over-smoothing. Our theoretical analysis shows that TDSS effectively reduces target risk by improving model smoothness. Empirical results on three real-world datasets demonstrate that TDSS outperforms recent state-of-the-art baselines, achieving significant improvements across six transfer scenarios. The code is available in this https URL.

[AI-29] A comprehensive GeoAI review: Progress Challenges and Outlooks

链接: https://arxiv.org/abs/2412.11643
作者: Anasse Boutayeb,Iyad Lahsen-cherif,Ahmed El Khadimi
关键词: Geospatial Artificial Intelligence, applying Artificial Intelligence, Artificial Intelligence, relevant research works, Geospatial Artificial
类目: Artificial Intelligence (cs.AI); Geophysics (physics.geo-ph)
*备注: A comprehensive GeoAI review with 50 pages, 52 figures and 13 tables. This paper explores the synergy between the most advanced artificial intelligence techniques and geospatial data, while highlighting the close relationship between this concept and the notions of GIS and big geodata

点击查看摘要

Abstract:In recent years, Geospatial Artificial Intelligence (GeoAI) has gained traction in the most relevant research works and industrial applications, while also becoming involved in various fields of use. This paper offers a comprehensive review of GeoAI as a synergistic concept applying Artificial Intelligence (AI) methods and models to geospatial data. A preliminary study is carried out, identifying the methodology of the work, the research motivations, the issues and the directions to be tracked, followed by exploring how GeoAI can be used in various interesting fields of application, such as precision agriculture, environmental monitoring, disaster management and urban planning. Next, a statistical and semantic analysis is carried out, followed by a clear and precise presentation of the challenges facing GeoAI. Then, a concrete exploration of the future prospects is provided, based on several informations gathered during the census. To sum up, this paper provides a complete overview of the correlation between AI and the geospatial domain, while mentioning the researches conducted in this context, and emphasizing the close relationship linking GeoAI with other advanced concepts such as geographic information systems (GIS) and large-scale geospatial data, known as big geodata. This will enable researchers and scientific community to assess the state of progress in this promising field, and will help other interested parties to gain a better understanding of the issues involved.

[AI-30] Introduction to AI Planning

链接: https://arxiv.org/abs/2412.11642
作者: Marco Aiello,Ilche Georgievski
关键词: University of Stuttgart, Stuttgart that provide, Planning, Artificial Intelligence Planning, provide an introduction
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:These are notes for lectures presented at the University of Stuttgart that provide an introduction to key concepts and techniques in AI Planning. Artificial Intelligence Planning, also known as Automated Planning, emerged somewhere in 1966 from the need to give autonomy to a wheeled robot. Since then, it has evolved into a flourishing research and development discipline, often associated with scheduling. Over the decades, various approaches to planning have been developed with characteristics that make them appropriate for specific tasks and applications. Most approaches represent the world as a state within a state transition system; then the planning problem becomes that of searching a path in the state space from the current state to one which satisfies the goals of the user. The notes begin by introducing the state model and move on to exploring classical planning, the foundational form of planning, and present fundamental algorithms for solving such problems. Subsequently, we examine planning as a constraint satisfaction problem, outlining the mapping process and describing an approach to solve such problems. The most extensive section is dedicated to Hierarchical Task Network (HTN) planning, one of the most widely used and powerful planning techniques in the field. The lecture notes end with a bonus chapter on the Planning Domain Definition (PDDL) Language, the de facto standard syntax for representing non-hierarchical planning problems.

[AI-31] Multi-Scale Incremental Modeling for Enhanced Human Motion Prediction in Human-Robot Collaboration

链接: https://arxiv.org/abs/2412.11632
作者: Juncheng Zou
关键词: remains challenging due, Accurate human motion, variable human movements, safe human-robot collaboration, Accurate human
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurate human motion prediction is crucial for safe human-robot collaboration but remains challenging due to the complexity of modeling intricate and variable human movements. This paper presents Parallel Multi-scale Incremental Prediction (PMS), a novel framework that explicitly models incremental motion across multiple spatio-temporal scales to capture subtle joint evolutions and global trajectory shifts. PMS encodes these multi-scale increments using parallel sequence branches, enabling iterative refinement of predictions. A multi-stage training procedure with a full-timeline loss integrates temporal context. Extensive experiments on four datasets demonstrate substantial improvements in continuity, biomechanical consistency, and long-term forecast stability by modeling inter-frame increments. PMS achieves state-of-the-art performance, increasing prediction accuracy by 16.3%-64.2% over previous methods. The proposed multi-scale incremental approach provides a powerful technique for advancing human motion prediction capabilities critical for seamless human-robot interaction.

[AI-32] EvoLlama: Enhancing LLM s Understanding of Proteins via Multimodal Structure and Sequence Representations

链接: https://arxiv.org/abs/2412.11618
作者: Nuowei Liu,Changzhi Sun,Tao Ji,Junfeng Tian,Jianxin Tang,Yuanbin Wu,Man Lan
关键词: Current Large Language, Large Language Models, Current Large, Protein Language Models, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Current Large Language Models (LLMs) for understanding proteins primarily treats amino acid sequences as a text modality. Meanwhile, Protein Language Models (PLMs), such as ESM-2, have learned massive sequential evolutionary knowledge from the universe of natural protein sequences. Furthermore, structure-based encoders like ProteinMPNN learn the structural information of proteins through Graph Neural Networks. However, whether the incorporation of protein encoders can enhance the protein understanding of LLMs has not been explored. To bridge this gap, we propose EvoLlama, a multimodal framework that connects a structure-based encoder, a sequence-based protein encoder and an LLM for protein understanding. EvoLlama consists of a ProteinMPNN structure encoder, an ESM-2 protein sequence encoder, a multimodal projector to align protein and text representations and a Llama-3 text decoder. To train EvoLlama, we fine-tune it on protein-oriented instructions and protein property prediction datasets verbalized via natural language instruction templates. Our experiments show that EvoLlama’s protein understanding capabilities have been significantly enhanced, outperforming other fine-tuned protein-oriented LLMs in zero-shot settings by an average of 1%-8% and surpassing the state-of-the-art baseline with supervised fine-tuning by an average of 6%. On protein property prediction datasets, our approach achieves promising results that are competitive with state-of-the-art task-specific baselines. We will release our code in a future version.

[AI-33] Region-Based Optimization in Continual Learning for Audio Deepfake Detection AAAI2025

链接: https://arxiv.org/abs/2412.11551
作者: Yujie Chen,Jiangyan Yi,Cunhang Fan,Jianhua Tao,Yong Ren,Siding Zeng,Chu Yuan Zhang,Xinrui Yan,Hao Gu,Jun Xue,Chenglong Wang,Zhao Lv,Xiaohui Zhang
关键词: voice conversion bring, conversion bring convenience, audio deepfake detection, Rapid advancements, effective audio deepfake
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Rapid advancements in speech synthesis and voice conversion bring convenience but also new security risks, creating an urgent need for effective audio deepfake detection. Although current models perform well, their effectiveness diminishes when confronted with the diverse and evolving nature of real-world deepfakes. To address this issue, we propose a continual learning method named Region-Based Optimization (RegO) for audio deepfake detection. Specifically, we use the Fisher information matrix to measure important neuron regions for real and fake audio detection, dividing them into four regions. First, we directly fine-tune the less important regions to quickly adapt to new tasks. Next, we apply gradient optimization in parallel for regions important only to real audio detection, and in orthogonal directions for regions important only to fake audio detection. For regions that are important to both, we use sample proportion-based adaptive gradient optimization. This region-adaptive optimization ensures an appropriate trade-off between memory stability and learning plasticity. Additionally, to address the increase of redundant neurons from old tasks, we further introduce the Ebbinghaus forgetting mechanism to release them, thereby promoting the capability of the model to learn more generalized discriminative features. Experimental results show our method achieves a 21.3% improvement in EER over the state-of-the-art continual learning approach RWM for audio deepfake detection. Moreover, the effectiveness of RegO extends beyond the audio deepfake detection domain, showing potential significance in other tasks, such as image recognition. The code is available at this https URL

[AI-34] Embodied CoT Distillation From LLM To Off-the-shelf Agents ICML2024

链接: https://arxiv.org/abs/2412.11499
作者: Wonje Choi,Woo Kyung Kim,Minjong Yoo,Honguk Woo
关键词: utilizing large language, large language models, decision-making systems operate, systems operate timely, small language model
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: Accepted at ICML 2024

点击查看摘要

Abstract:We address the challenge of utilizing large language models (LLMs) for complex embodied tasks, in the environment where decision-making systems operate timely on capacity-limited, off-the-shelf devices. We present DeDer, a framework for decomposing and distilling the embodied reasoning capabilities from LLMs to efficient, small language model (sLM)-based policies. In DeDer, the decision-making process of LLM-based strategies is restructured into a hierarchy with a reasoning-policy and planning-policy. The reasoning-policy is distilled from the data that is generated through the embodied in-context learning and self-verification of an LLM, so it can produce effective rationales. The planning-policy, guided by the rationales, can render optimized plans efficiently. In turn, DeDer allows for adopting sLMs for both policies, deployed on off-the-shelf devices. Furthermore, to enhance the quality of intermediate rationales, specific to embodied tasks, we devise the embodied knowledge graph, and to generate multiple rationales timely through a single inference, we also use the contrastively prompted attention model. Our experiments with the ALFRED benchmark demonstrate that DeDer surpasses leading language planning and distillation approaches, indicating the applicability and efficiency of sLM-based embodied policies derived through DeDer.

[AI-35] Leveraging Foundation Language Models (FLMs) for Automated Cohort Extraction from Large EHR Databases

链接: https://arxiv.org/abs/2412.11472
作者: Purity Mugambi,Alexandra Meliou,Madalina Fiterau
关键词: cohort, crucial step, cohort extraction, cohort studies, required cohort
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A crucial step in cohort studies is to extract the required cohort from one or more study datasets. This step is time-consuming, especially when a researcher is presented with a dataset that they have not previously worked with. When the cohort has to be extracted from multiple datasets, cohort extraction can be extremely laborious. In this study, we present an approach for partially automating cohort extraction from multiple electronic health record (EHR) databases. We formulate the guided multi-dataset cohort extraction problem in which selection criteria are first converted into queries, translating them from natural language text to language that maps to database entities. Then, using FLMs, columns of interest identified from the queries are automatically matched between the study databases. Finally, the generated queries are run across all databases to extract the study cohort. We propose and evaluate an algorithm for automating column matching on two large, popular and publicly-accessible EHR databases – MIMIC-III and eICU. Our approach achieves a high top-three accuracy of 92% , correctly matching 12 out of the 13 columns of interest, when using a small, pre-trained general purpose language model. Furthermore, this accuracy is maintained even as the search space (i.e., size of the database) increases.

[AI-36] Red Pill and Blue Pill: Controllable Website Fingerprinting Defense via Dynamic Backdoor Learning

链接: https://arxiv.org/abs/2412.11471
作者: Siyuan Liang,Jiajun Gong,Tianmeng Fang,Aishan Liu,Tao Wang,Xianglong Liu,Xiaochun Cao,Dacheng Tao,Chang Ee-Chien
关键词: covertly monitor user, monitor user communications, Controllable Website Fingerprint, Website Fingerprint Defense, Website fingerprint
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Website fingerprint (WF) attacks, which covertly monitor user communications to identify the web pages they visit, pose a serious threat to user privacy. Existing WF defenses attempt to reduce the attacker’s accuracy by disrupting unique traffic patterns; however, they often suffer from the trade-off between overhead and effectiveness, resulting in less usefulness in practice. To overcome this limitation, we introduce Controllable Website Fingerprint Defense (CWFD), a novel defense perspective based on backdoor learning. CWFD exploits backdoor vulnerabilities in neural networks to directly control the attacker’s model by designing trigger patterns based on network traffic. Specifically, CWFD injects only incoming packets on the server side into the target web page’s traffic, keeping overhead low while effectively poisoning the attacker’s model during training. During inference, the defender can influence the attacker’s model through a ‘red pill, blue pill’ choice: traces with the trigger (red pill) lead to misclassification as the target web page, while normal traces (blue pill) are classified correctly, achieving directed control over the defense outcome. We use the Fast Levenshtein-like distance as the optimization objective to compute trigger patterns that can be effectively associated with our target page. Experiments show that CWFD significantly reduces RF’s accuracy from 99% to 6% with 74% data overhead. In comparison, FRONT reduces accuracy to only 97% at similar overhead, while Palette achieves 32% accuracy with 48% more overhead. We further validate the practicality of our method in a real Tor network environment.

[AI-37] Unsupervised Anomaly Detection for Tabular Data Using Noise Evaluation AAAI2025

链接: https://arxiv.org/abs/2412.11461
作者: Wei Dai,Kai Hwang,Jicong Fan
关键词: Unsupervised anomaly detection, guaranteed UAD algorithms, Unsupervised anomaly, modern data analytics, anomaly detection
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: The paper was accepted by AAAI 2025

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) plays an important role in modern data analytics and it is crucial to provide simple yet effective and guaranteed UAD algorithms for real applications. In this paper, we present a novel UAD method for tabular data by evaluating how much noise is in the data. Specifically, we propose to learn a deep neural network from the clean (normal) training dataset and a noisy dataset, where the latter is generated by adding highly diverse noises to the clean data. The neural network can learn a reliable decision boundary between normal data and anomalous data when the diversity of the generated noisy data is sufficiently high so that the hard abnormal samples lie in the noisy region. Importantly, we provide theoretical guarantees, proving that the proposed method can detect anomalous data successfully, although the method does not utilize any real anomalous data in the training stage. Extensive experiments through more than 60 benchmark datasets demonstrate the effectiveness of the proposed method in comparison to 12 baselines of UAD. Our method obtains a 92.27% AUC score and a 1.68 ranking score on average. Moreover, compared to the state-of-the-art UAD methods, our method is easier to implement.

[AI-38] Whisper-GPT: A Hybrid Representation Audio Large Language Model

链接: https://arxiv.org/abs/2412.11449
作者: Prateek Verma
关键词: large language model, generative large language, discrete tokens simultaneously, propose WHISPER-GPT, large language
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 6 pages, 3 figures. 50th International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India

点击查看摘要

Abstract:We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.

[AI-39] RAIL: Trust-Aware Client Scheduling for Semi-Decentralized Federated Learning

链接: https://arxiv.org/abs/2412.11448
作者: Gangqiang Hu,Jianfeng Lu,Jianmin Han,Shuqin Cao,Jing Liu,Hao Fu
关键词: enable distributed machine, distributed machine learning, safeguarding data privacy, federated learning, semi-decentralized federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Due to the sensitivity of data, federated learning (FL) is employed to enable distributed machine learning while safeguarding data privacy and accommodating the requirements of various devices. However, in the context of semi-decentralized federated learning (SD-FL), clients’ communication and training states are dynamic. This variability arises from local training fluctuations, heterogeneous data distributions, and intermittent client participation. Most existing studies primarily focus on stable client states, neglecting the dynamic challenges present in real-world scenarios. To tackle this issue, we propose a trust-aware client scheduling mechanism (TRAIL) that assesses client states and contributions, enhancing model training efficiency through selective client participation. Our focus is on a semi-decentralized federated learning framework where edge servers and clients train a shared global model using unreliable intra-cluster model aggregation and inter-cluster model consensus. First, we develop an adaptive hidden semi-Markov model (AHSMM) to estimate clients’ communication states and contributions. Next, we address a client-server association optimization problem to minimize global training loss. Using convergence analysis, we propose a greedy client scheduling algorithm. Finally, our experiments conducted on real-world datasets demonstrate that TRAIL outperforms state-of-the-art baselines, achieving an improvement of 8.7% in test accuracy and a reduction of 15.3% in training loss.

[AI-40] heoretical Analysis of Quality Diversity Algorithms for a Classical Path Planning Problem

链接: https://arxiv.org/abs/2412.11446
作者: Duc-Cuong Dang,Aneta Neumann,Frank Neumann,Andre Opris,Dirk Sudholt
关键词: high quality solutions, Quality diversity, high quality, combinatorial optimisation, shown to provide
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Quality diversity (QD) algorithms have shown to provide sets of high quality solutions for challenging problems in robotics, games, and combinatorial optimisation. So far, theoretical foundational explaining their good behaviour in practice lack far behind their practical success. We contribute to the theoretical understanding of these algorithms and study the behaviour of QD algorithms for a classical planning problem seeking several solutions. We study the all-pairs-shortest-paths (APSP) problem which gives a natural formulation of the behavioural space based on all pairs of nodes of the given input graph that can be used by Map-Elites QD algorithms. Our results show that Map-Elites QD algorithms are able to compute a shortest path for each pair of nodes efficiently in parallel. Furthermore, we examine parent selection techniques for crossover that exhibit significant speed ups compared to the standard QD approach.

[AI-41] Bayesian Flow Is All You Need to Sample Out-of-Distribution Chemical Spaces

链接: https://arxiv.org/abs/2412.11439
作者: Nianze Tao
关键词: drug design, molecules with higher, higher properties, training space, Abstract
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for de~novo drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network is capable of effortlessly generating high quality out-of-distribution samples that meet several scenarios. We introduce a semi-autoregressive training/sampling method that helps to enhance the model performance and surpass the state-of-the-art models.

[AI-42] Auto-bidding in real-time auctions via Oracle Imitation Learning

链接: https://arxiv.org/abs/2412.11434
作者: Alberto Chiappa,Briti Gangopadhyay,Zhao Wang,Shingo Takamatsu
关键词: successful business models, internet era, successful business, business models, Online advertising
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Online advertising has become one of the most successful business models of the internet era. Impression opportunities are typically allocated through real-time auctions, where advertisers bid to secure advertisement slots. Deciding the best bid for an impression opportunity is challenging, due to the stochastic nature of user behavior and the variability of advertisement traffic over time. In this work, we propose a framework for training auto-bidding agents in multi-slot second-price auctions to maximize acquisitions (e.g., clicks, conversions) while adhering to budget and cost-per-acquisition (CPA) constraints. We exploit the insight that, after an advertisement campaign concludes, determining the optimal bids for each impression opportunity can be framed as a multiple-choice knapsack problem (MCKP) with a nonlinear objective. We propose an “oracle” algorithm that identifies a near-optimal combination of impression opportunities and advertisement slots, considering both past and future advertisement traffic data. This oracle solution serves as a training target for a student network which bids having access only to real-time information, a method we term Oracle Imitation Learning (OIL). Through numerical experiments, we demonstrate that OIL achieves superior performance compared to both online and offline reinforcement learning algorithms, offering improved sample efficiency. Notably, OIL shifts the complexity of training auto-bidding agents from crafting sophisticated learning algorithms to solving a nonlinear constrained optimization problem efficiently.

[AI-43] owards Scientific Discovery with Generative AI: Progress Opportunities and Challenges AAAI2025

链接: https://arxiv.org/abs/2412.11427
作者: Chandan K Reddy,Parshin Shojaee
关键词: complex cognitive process, driven human knowledge, Scientific discovery, complex cognitive, cognitive process
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Scientific discovery is a complex cognitive process that has driven human knowledge and technological progress for centuries. While artificial intelligence (AI) has made significant advances in automating aspects of scientific reasoning, simulation, and experimentation, we still lack integrated AI systems capable of performing autonomous long-term scientific research and discovery. This paper examines the current state of AI for scientific discovery, highlighting recent progress in large language models and other AI techniques applied to scientific tasks. We then outline key challenges and promising research directions toward developing more comprehensive AI systems for scientific discovery, including the need for science-focused AI agents, improved benchmarks and evaluation metrics, multimodal scientific representations, and unified frameworks combining reasoning, theorem proving, and data-driven modeling. Addressing these challenges could lead to transformative AI tools to accelerate progress across disciplines towards scientific discovery.

[AI-44] RL-LLM -DT: An Automatic Decision Tree Generation Method Based on RL Evaluation and LLM Enhancement

链接: https://arxiv.org/abs/2412.11417
作者: Junjie Lin,Jian Zhao,Yue Deng,Youpeng Zhao,Wengang Zhou,Houqiang Li
关键词: decision tree, decision, tree, two-player zero-sum games, Large Language Models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Length:10 pages. Figures:10 figures. Additional Notes:In this paper, we have introduced a novel hybrid approach which leverages the strengths of both RL and LLMs to itera- tively refine decision tree tactics, enhancing their performance and adaptability

点击查看摘要

Abstract:Traditionally, AI development for two-player zero-sum games has relied on two primary techniques: decision trees and reinforcement learning (RL). A common approach involves using a fixed decision tree as one player’s strategy while training an RL agent as the opponent to identify vulnerabilities in the decision tree, thereby improving its strategic strength iteratively. However, this process often requires significant human intervention to refine the decision tree after identifying its weaknesses, resulting in inefficiencies and hindering full automation of the strategy enhancement process. Fortunately, the advent of Large Language Models (LLMs) offers a transformative opportunity to automate the process. We propose RL-LLM-DT, an automatic decision tree generation method based on RL Evaluation and LLM Enhancement. Given an initial decision tree, the method involves two important iterative steps. Response Policy Search: RL is used to discover counter-strategies targeting the decision tree. Policy Improvement: LLMs analyze failure scenarios and generate improved decision tree code. In our method, RL focuses on finding the decision tree’s flaws while LLM is prompted to generate an improved version of the decision tree. The iterative refinement process terminates when RL can’t find any flaw of the tree or LLM fails to improve the tree. To evaluate the effectiveness of this integrated approach, we conducted experiments in a curling game. After iterative refinements, our curling AI based on the decision tree ranks first on the Jidi platform among 34 curling AIs in total, which demonstrates that LLMs can significantly enhance the robustness and adaptability of decision trees, representing a substantial advancement in the field of Game AI. Our code is available at this https URL.

[AI-45] Federated Domain Generalization with Label Smoothing and Balanced Decentralized Training

链接: https://arxiv.org/abs/2412.11408
作者: Milad Soltany,Farhad Pourpanah,Mahdiyar Molahasani,Michael Greenspan,Ali Etemad
关键词: federated learning framework, Federated Domain Generalization, Balanced Decentralized Training, Label Smoothing, utilizes label smoothing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel approach, Federated Domain Generalization with Label Smoothing and Balanced Decentralized Training (FedSB), to address the challenges of data heterogeneity within a federated learning framework. FedSB utilizes label smoothing at the client level to prevent overfitting to domain-specific features, thereby enhancing generalization capabilities across diverse domains when aggregating local models into a global model. Additionally, FedSB incorporates a decentralized budgeting mechanism which balances training among clients, which is shown to improve the performance of the aggregated global model. Extensive experiments on four commonly used multi-domain datasets, PACS, VLCS, OfficeHome, and TerraInc, demonstrate that FedSB outperforms competing methods, achieving state-of-the-art results on three out of four datasets, indicating the effectiveness of FedSB in addressing data heterogeneity.

[AI-46] How Can LLM s and Knowledge Graphs Contribute to Robot Safety? A Few-Shot Learning Approach

链接: https://arxiv.org/abs/2412.11387
作者: Abdulrahman Althobaiti,Angel Ayala,JingYing Gao,Ali Almutairi,Mohammad Deghat,Imran Razzak,Francisco Cruz
关键词: Large Language Models, execute natural language, natural language instructions, Large Language, natural language
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming the robotics domain by enabling robots to comprehend and execute natural language instructions. The cornerstone benefits of LLM include processing textual data from technical manuals, instructions, academic papers, and user queries based on the knowledge provided. However, deploying LLM-generated code in robotic systems without safety verification poses significant risks. This paper outlines a safety layer that verifies the code generated by ChatGPT before executing it to control a drone in a simulated environment. The safety layer consists of a fine-tuned GPT-4o model using Few-Shot learning, supported by knowledge graph prompting (KGP). Our approach improves the safety and compliance of robotic actions, ensuring that they adhere to the regulations of drone operations.

[AI-47] Individual Bus Trip Chain Prediction and Pattern Identification Considering Similarities

链接: https://arxiv.org/abs/2412.11364
作者: Xiannan Huang,Yixin Chen,Quan Yuan,Chao Yang
关键词: public transit systems, Predicting future bus, Predicting future, bus trip chains, future bus trip
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Predicting future bus trip chains for an existing user is of great significance for operators of public transit systems. Existing methods always treat this task as a time-series prediction problem, but the 1-dimensional time series structure cannot express the complex relationship between trips. To better capture the inherent patterns in bus travel behavior, this paper proposes a novel approach that synthesizes future bus trip chains based on those from similar days. Key similarity patterns are defined and tested using real-world data, and a similarity function is then developed to capture these patterns. Afterwards, a graph is constructed where each day is represented as a node and edge weight reflects the similarity between days. Besides, the trips on a given day can be regarded as labels for each node, transferring the bus trip chain prediction problem to a semi-supervised classification problem on a graph. To address this, we propose several methods and validate them on a real-world dataset of 10000 bus users, achieving state-of-the-art prediction results. Analyzing the parameters of similarity function reveals some interesting bus usage patterns, allowing us can to cluster bus users into three types: repeat-dominated, evolve-dominate and repeat-evolve balanced. In summary, our work demonstrates the effectiveness of similarity-based prediction for bus trip chains and provides a new perspective for analyzing individual bus travel patterns. The code for our prediction model is publicly available.

[AI-48] An Empirical Study of Fault Localisation Techniques for Deep Learning

链接: https://arxiv.org/abs/2412.11304
作者: Nargiz Humbatova,Jinhan Kim,Gunel Jahangirova,Shin Yoo,Paolo Tonella
关键词: Deep Neural Networks, Neural Networks, Deep Neural, popularity of Deep, testing and debugging
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the increased popularity of Deep Neural Networks (DNNs), increases also the need for tools to assist developers in the DNN implementation, testing and debugging process. Several approaches have been proposed that automatically analyse and localise potential faults in DNNs under test. In this work, we evaluate and compare existing state-of-the-art fault localisation techniques, which operate based on both dynamic and static analysis of the DNN. The evaluation is performed on a benchmark consisting of both real faults obtained from bug reporting platforms and faulty models produced by a mutation tool. Our findings indicate that the usage of a single, specific ground truth (e.g., the human defined one) for the evaluation of DNN fault localisation tools results in pretty low performance (maximum average recall of 0.31 and precision of 0.23). However, such figures increase when considering alternative, equivalent patches that exist for a given faulty DNN. Results indicate that \dfd is the most effective tool, achieving an average recall of 0.61 and precision of 0.41 on our benchmark.

[AI-49] Semi-Implicit Neural Ordinary Differential Equations

链接: https://arxiv.org/abs/2412.11301
作者: Hong Zhang,Ying Liu,Romit Maulik
关键词: Classical neural ODEs, Classical neural, neural ODEs trained, stiff learning problems, scientific machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Classical neural ODEs trained with explicit methods are intrinsically limited by stability, crippling their efficiency and robustness for stiff learning problems that are common in graph learning and scientific machine learning. We present a semi-implicit neural ODE approach that exploits the partitionable structure of the underlying dynamics. Our technique leads to an implicit neural network with significant computational advantages over existing approaches because of enhanced stability and efficient linear solves during time integration. We show that our approach outperforms existing approaches on a variety of applications including graph classification and learning complex dynamical systems. We also demonstrate that our approach can train challenging neural ODEs where both explicit methods and fully implicit methods are intractable.

[AI-50] A Comparative Study on Dynamic Graph Embedding based on Mamba and Transformers

链接: https://arxiv.org/abs/2412.11293
作者: Ashish Parmanand Pandey,Alan John Varghese,Sarang Patil,Mengjia Xu
关键词: Dynamic graph embedding, diverse domains, Dynamic graph, important technique, graph
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Dynamic graph embedding has emerged as an important technique for modeling complex time-evolving networks across diverse domains. While transformer-based models have shown promise in capturing long-range dependencies in temporal graph data, they face scalability challenges due to quadratic computational complexity. This study presents a comparative analysis of dynamic graph embedding approaches using transformers and the recently proposed Mamba architecture, a state-space model with linear complexity. We introduce three novel models: TransformerG2G augment with graph convolutional networks, DG-Mamba, and GDG-Mamba with graph isomorphism network edge convolutions. Our experiments on multiple benchmark datasets demonstrate that Mamba-based models achieve comparable or superior performance to transformer-based approaches in link prediction tasks while offering significant computational efficiency gains on longer sequences. Notably, DG-Mamba variants consistently outperform transformer-based models on datasets with high temporal variability, such as UCI, Bitcoin, and Reality Mining, while maintaining competitive performance on more stable graphs like SBM. We provide insights into the learned temporal dependencies through analysis of attention weights and state matrices, revealing the models’ ability to capture complex temporal patterns. By effectively combining state-space models with graph neural networks, our work addresses key limitations of previous approaches and contributes to the growing body of research on efficient temporal graph representation learning. These findings offer promising directions for scaling dynamic graph embedding to larger, more complex real-world networks, potentially enabling new applications in areas such as social network analysis, financial modeling, and biological system dynamics.

[AI-51] Wearable Accelerometer Foundation Models for Health via Knowledge Distillation

链接: https://arxiv.org/abs/2412.11276
作者: Salar Abbaspourazad,Anshuman Mishra,Joseph Futoma,Andrew C. Miller,Ian Shapiro
关键词: Modern wearable devices, Modern wearable, daily living, ultimately enabling, continuously record
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Modern wearable devices can conveniently and continuously record various biosignals in the many different environments of daily living, ultimately enabling a rich view of individual health. However, not all biosignals are the same: high-fidelity measurements, such as photoplethysmography (PPG), contain more physiological information, but require optical sensors with a high power footprint. In a resource-constrained setting, such biosignals may be unavailable. Alternatively, a lower-fidelity biosignal, such as accelerometry that captures minute cardiovascular information during low-motion periods, has a significantly smaller power footprint and is available in almost any wearable device. Here, we demonstrate that we can distill representational knowledge across biosignals, i.e., from PPG to accelerometry, using 20 million minutes of unlabeled data, collected from ~172K participants in the Apple Heart and Movement Study under informed consent. We first pre-train PPG encoders via self-supervised learning, and then distill their representational knowledge to accelerometry encoders. We demonstrate strong cross-modal alignment on unseen data, e.g., 99.2% top-1 accuracy for retrieving PPG embeddings from accelerometry embeddings. We show that distilled accelerometry encoders have significantly more informative representations compared to self-supervised or supervised encoders trained directly on accelerometry data, observed by at least 23%-49% improved performance for predicting heart rate and heart rate variability. We also show that distilled accelerometry encoders are readily predictive of a wide array of downstream health targets, i.e., they are generalist foundation models. We believe accelerometry foundation models for health may unlock new opportunities for developing digital biomarkers from any wearable device, and help individuals track their health more frequently and conveniently.

[AI-52] Do Tutors Learn from Equity Training and Can Generative AI Assess It?

链接: https://arxiv.org/abs/2412.11255
作者: Danielle R. Thomas,Conrad Borchers,Sanjit Kakarla,Jionghao Lin,Shambhavi Bhushan,Boyuan Guo,Erin Gatz,Kenneth R. Koedinger
关键词: core concern, assess equity skills, Equity, learning analytics, assess equity
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Full research paper accepted to Learning Analytics and Knowledge (LAK 2025)

点击查看摘要

Abstract:Equity is a core concern of learning analytics. However, applications that teach and assess equity skills, particularly at scale are lacking, often due to barriers in evaluating language. Advances in generative AI via large language models (LLMs) are being used in a wide range of applications, with this present work assessing its use in the equity domain. We evaluate tutor performance within an online lesson on enhancing tutors’ skills when responding to students in potentially inequitable situations. We apply a mixed-method approach to analyze the performance of 81 undergraduate remote tutors. We find marginally significant learning gains with increases in tutors’ self-reported confidence in their knowledge in responding to middle school students experiencing possible inequities from pretest to posttest. Both GPT-4o and GPT-4-turbo demonstrate proficiency in assessing tutors ability to predict and explain the best approach. Balancing performance, efficiency, and cost, we determine that few-shot learning using GPT-4o is the preferred model. This work makes available a dataset of lesson log data, tutor responses, rubrics for human annotation, and generative AI prompts. Future work involves leveling the difficulty among scenarios and enhancing LLM prompts for large-scale grading and assessment.

[AI-53] Are Expressive Models Truly Necessary for Offline RL?

链接: https://arxiv.org/abs/2412.11253
作者: Guan Wang,Haoyi Niu,Jianxiong Li,Li Jiang,Jianming Hu,Xianyuan Zhan
关键词: offline reinforcement learning, gained increasing popularity, notoriously difficult credit, difficult credit assignment, credit assignment challenge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Instead of relying on expressive models, shallow MLPs can also excel in long sequential decision-making tasks with Recursive Skip-Step Planning (RSP)

点击查看摘要

Abstract:Among various branches of offline reinforcement learning (RL) methods, goal-conditioned supervised learning (GCSL) has gained increasing popularity as it formulates the offline RL problem as a sequential modeling task, therefore bypassing the notoriously difficult credit assignment challenge of value learning in conventional RL paradigm. Sequential modeling, however, requires capturing accurate dynamics across long horizons in trajectory data to ensure reasonable policy performance. To meet this requirement, leveraging large, expressive models has become a popular choice in recent literature, which, however, comes at the cost of significantly increased computation and inference latency. Contradictory yet promising, we reveal that lightweight models as simple as shallow 2-layer MLPs, can also enjoy accurate dynamics consistency and significantly reduced sequential modeling errors against large expressive models by adopting a simple recursive planning scheme: recursively planning coarse-grained future sub-goals based on current and target information, and then executes the action with a goal-conditioned policy learned from data rela-beled with these sub-goal ground truths. We term our method Recursive Skip-Step Planning (RSP). Simple yet effective, RSP enjoys great efficiency improvements thanks to its lightweight structure, and substantially outperforms existing methods, reaching new SOTA performances on the D4RL benchmark, especially in multi-stage long-horizon tasks.

[AI-54] ransformer-Based Bearing Fault Detection using Temporal Decomposition Attention Mechanism

链接: https://arxiv.org/abs/2412.11245
作者: Marzieh Mirzaeibonehkhater,Mohammad Ali Labbaf-Khaniki,Mohammad Manthouri
关键词: prevent costly downtime, timely fault identification, Bearing fault detection, Traditional attention mechanisms, predictive maintenance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Bearing fault detection is a critical task in predictive maintenance, where accurate and timely fault identification can prevent costly downtime and equipment damage. Traditional attention mechanisms in Transformer neural networks often struggle to capture the complex temporal patterns in bearing vibration data, leading to suboptimal performance. To address this limitation, we propose a novel attention mechanism, Temporal Decomposition Attention (TDA), which combines temporal bias encoding with seasonal-trend decomposition to capture both long-term dependencies and periodic fluctuations in time series data. Additionally, we incorporate the Hull Exponential Moving Average (HEMA) for feature extraction, enabling the model to effectively capture meaningful characteristics from the data while reducing noise. Our approach integrates TDA into the Transformer architecture, allowing the model to focus separately on the trend and seasonal components of the data. Experimental results on the Case Western Reserve University (CWRU) bearing fault detection dataset demonstrate that our approach outperforms traditional attention mechanisms and achieves state-of-the-art performance in terms of accuracy and interpretability. The HEMA-Transformer-TDA model achieves an accuracy of 98.1%, with exceptional precision, recall, and F1-scores, demonstrating its effectiveness in bearing fault detection and its potential for application in other time series tasks with seasonal patterns or trends.

[AI-55] Learning Set Functions with Implicit Differentiation AAAI2025

链接: https://arxiv.org/abs/2412.11239
作者: Gözde Özcan,Chengzhi Shi,Stratis Ioannidis
关键词: optimal subset oracle, learning set functions, Abstract, so-called optimal subset, underlying utility function
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 19 pages, 1 figure, accepted to AAAI 2025

点击查看摘要

Abstract:Ou et al. (2022) introduce the problem of learning set functions from data generated by a so-called optimal subset oracle. Their approach approximates the underlying utility function with an energy-based model, whose parameters are estimated via mean-field variational inference. Ou et al. (2022) show this reduces to fixed point iterations; however, as the number of iterations increases, automatic differentiation quickly becomes computationally prohibitive due to the size of the Jacobians that are stacked during backpropagation. We address this challenge with implicit differentiation and examine the convergence conditions for the fixed-point iterations. We empirically demonstrate the efficiency of our method on synthetic and real-world subset selection applications including product recommendation, set anomaly detection and compound selection tasks.

[AI-56] Neural Port-Hamiltonian Differential Algebraic Equations for Compositional Learning of Electrical Networks

链接: https://arxiv.org/abs/2412.11215
作者: Cyrus Neary,Nathan Tsao,Ufuk Topcu
关键词: coupled dynamical systems, dynamical systems, develop compositional learning, coupled dynamical, deep learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We develop compositional learning algorithms for coupled dynamical systems. While deep learning has proven effective at modeling complex relationships from data, compositional couplings between system components typically introduce algebraic constraints on state variables, posing challenges to many existing data-driven approaches to modeling dynamical systems. Towards developing deep learning models for constrained dynamical systems, we introduce neural port-Hamiltonian differential algebraic equations (N-PHDAEs), which use neural networks to parametrize unknown terms in both the differential and algebraic components of a port-Hamiltonian DAE. To train these models, we propose an algorithm that uses automatic differentiation to perform index reduction, automatically transforming the neural DAE into an equivalent system of neural ordinary differential equations (N-ODEs), for which established model inference and backpropagation methods exist. The proposed compositional modeling framework and learning algorithms may be applied broadly to learn control-oriented models of dynamical systems in a variety of application areas, however, in this work, we focus on their application to the modeling of electrical networks. Experiments simulating the dynamics of nonlinear circuits exemplify the benefits of our approach: the proposed N-PHDAE model achieves an order of magnitude improvement in prediction accuracy and constraint satisfaction when compared to a baseline N-ODE over long prediction time horizons. We also validate the compositional capabilities of our approach through experiments on a simulated D.C. microgrid: we train individual N-PHDAE models for separate grid components, before coupling them to accurately predict the behavior of larger-scale networks.

[AI-57] ProFe: Communication-Efficient Decentralized Federated Learning via Distillation and Prototypes

链接: https://arxiv.org/abs/2412.11207
作者: Pedro Miguel Sánchez Sánchez,Enrique Tomás Martínez Beltrán,Miguel Fernández Llamas,Gérôme Bovet,Gregorio Martínez Pérez,Alberto Huertas Celdrán
关键词: Decentralized Federated Learning, Decentralized Federated, Federated Learning, removing model centralization, model centralization risks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Decentralized Federated Learning (DFL) trains models in a collaborative and privacy-preserving manner while removing model centralization risks and improving communication bottlenecks. However, DFL faces challenges in efficient communication management and model aggregation within decentralized environments, especially with heterogeneous data distributions. Thus, this paper introduces ProFe, a novel communication optimization algorithm for DFL that combines knowledge distillation, prototype learning, and quantization techniques. ProFe utilizes knowledge from large local models to train smaller ones for aggregation, incorporates prototypes to better learn unseen classes, and applies quantization to reduce data transmitted during communication rounds. The performance of ProFe has been validated and compared to the literature by using benchmark datasets like MNIST, CIFAR10, and CIFAR100. Results showed that the proposed algorithm reduces communication costs by up to ~40-50% while maintaining or improving model performance. In addition, it adds ~20% training time due to increased complexity, generating a trade-off.

[AI-58] SoK: On Closing the Applicability Gap in Automated Vulnerability Detection

链接: https://arxiv.org/abs/2412.11194
作者: Ezzeldin Shereen,Dan Ristea,Sanyam Vyas,Shae McFadden,Madeleine Dwyer,Chris Hicks,Vasilios Mavroudis
关键词: proprietary software underscores, DARPA Artificial Intelligence, Artificial Intelligence Cyber, Intelligence Cyber Challenge, Automated Vulnerability Detection
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The frequent discovery of security vulnerabilities in both open-source and proprietary software underscores the urgent need for earlier detection during the development lifecycle. Initiatives such as DARPA’s Artificial Intelligence Cyber Challenge (AIxCC) aim to accelerate Automated Vulnerability Detection (AVD), seeking to address this challenge by autonomously analyzing source code to identify vulnerabilities. This paper addresses two primary research questions: (RQ1) How is current AVD research distributed across its core components? (RQ2) What key areas should future research target to bridge the gap in the practical applicability of AVD throughout software development? To answer these questions, we conduct a systematization over 79 AVD articles and 17 empirical studies, analyzing them across five core components: task formulation and granularity, input programming languages and representations, detection approaches and key solutions, evaluation metrics and datasets, and reported performance. Our systematization reveals that the narrow focus of AVD research-mainly on specific tasks and programming languages-limits its practical impact and overlooks broader areas crucial for effective, real-world vulnerability detection. We identify significant challenges, including the need for diversified problem formulations, varied detection granularities, broader language support, better dataset quality, enhanced reproducibility, and increased practical impact. Based on these findings we identify research directions that will enhance the effectiveness and applicability of AVD solutions in software security. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.11194 [cs.SE] (or arXiv:2412.11194v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.11194 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-59] Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting

链接: https://arxiv.org/abs/2412.11155
作者: Joar Skalse,Alessandro Abate
关键词: inverse reinforcement learning, reinforcement learning, aim of inverse, inverse reinforcement, IRL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The aim of inverse reinforcement learning (IRL) is to infer an agent’s preferences from observing their behaviour. Usually, preferences are modelled as a reward function, R , and behaviour is modelled as a policy, \pi . One of the central difficulties in IRL is that multiple preferences may lead to the same observed behaviour. That is, R is typically underdetermined by \pi , which means that R is only partially identifiable. Recent work has characterised the extent of this partial identifiability for different types of agents, including optimal and Boltzmann-rational agents. However, work so far has only considered agents that discount future reward exponentially: this is a serious limitation, especially given that extensive work in the behavioural sciences suggests that humans are better modelled as discounting hyperbolically. In this work, we newly characterise partial identifiability in IRL for agents with non-exponential discounting: our results are in particular relevant for hyperbolical discounting, but they also more generally apply to agents that use other types of (non-exponential) discounting. We significantly show that generally IRL is unable to infer enough information about R to identify the correct optimal policy, which entails that IRL alone can be insufficient to adequately characterise the preferences of such agents.

[AI-60] ViSymRe: Vision-guided Multimodal Symbolic Regression

链接: https://arxiv.org/abs/2412.11139
作者: Da Li,Junping Yin,Jin Xu,Xinxin Li,Juan Zhang
关键词: offering enhanced interpretability, Symbolic regression, regression automatically searches, Symbolic regression automatically, reveal underlying mechanisms
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
*备注:

点击查看摘要

Abstract:Symbolic regression automatically searches for mathematical equations to reveal underlying mechanisms within datasets, offering enhanced interpretability compared to black box models. Traditionally, symbolic regression has been considered to be purely numeric-driven, with insufficient attention given to the potential contributions of visual information in augmenting this process. When dealing with high-dimensional and complex datasets, existing symbolic regression models are often inefficient and tend to generate overly complex equations, making subsequent mechanism analysis complicated. In this paper, we propose the vision-guided multimodal symbolic regression model, called ViSymRe, that systematically explores how visual information can improve various metrics of symbolic regression. Compared to traditional models, our proposed model has the following innovations: (1) It integrates three modalities: vision, symbol and numeric to enhance symbolic regression, enabling the model to benefit from the strengths of each modality; (2) It establishes a meta-learning framework that can learn from historical experiences to efficiently solve new symbolic regression problems; (3) It emphasizes the simplicity and structural rationality of the equations rather than merely numerical fitting. Extensive experiments show that our proposed model exhibits strong generalization capability and noise resistance. The equations it generates outperform state-of-the-art numeric-only baselines in terms of fitting effect, simplicity and structural accuracy, thus being able to facilitate accurate mechanism analysis and the development of theoretical models.

[AI-61] Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

链接: https://arxiv.org/abs/2412.11138
作者: Juntao Dai,Yaodong Yang,Qian Zheng,Gang Pan
关键词: Safe Reinforcement Learning, Reinforcement Learning, Safe Reinforcement, involves estimating, existing Advantage-based Estimation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:A key aspect of Safe Reinforcement Learning (Safe RL) involves estimating the constraint condition for the next policy, which is crucial for guiding the optimization of safe policy updates. However, the existing Advantage-based Estimation (ABE) method relies on the infinite-horizon discounted advantage function. This dependence leads to catastrophic errors in finite-horizon scenarios with non-discounted constraints, resulting in safety-violation updates. In response, we propose the first estimation method for finite-horizon non-discounted constraints in deep Safe RL, termed Gradient-based Estimation (GBE), which relies on the analytic gradient derived along trajectories. Our theoretical and empirical analyses demonstrate that GBE can effectively estimate constraint changes over a finite horizon. Constructing a surrogate optimization problem with GBE, we developed a novel Safe RL algorithm called Constrained Gradient-based Policy Optimization (CGPO). CGPO identifies feasible optimal policies by iteratively resolving sub-problems within trust regions. Our empirical results reveal that CGPO, unlike baseline algorithms, successfully estimates the constraint functions of subsequent policies, thereby ensuring the efficiency and feasibility of each update.

[AI-62] Paid with Models: Optimal Contract Design for Collaborative Machine Learning AAAI2025

链接: https://arxiv.org/abs/2412.11122
作者: Bingchen Wang,Zhaoxuan Wu,Fusheng Liu,Bryan Kian Hsiang Low
关键词: Collaborative machine learning, democratizing advanced technologies, Collaborative machine, machine learning, promising paradigm
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Theoretical Economics (econ.TH)
*备注: Accepted for publication in AAAI 2025

点击查看摘要

Abstract:Collaborative machine learning (CML) provides a promising paradigm for democratizing advanced technologies by enabling cost-sharing among participants. However, the potential for rent-seeking behaviors among parties can undermine such collaborations. Contract theory presents a viable solution by rewarding participants with models of varying accuracy based on their contributions. However, unlike monetary compensation, using models as rewards introduces unique challenges, particularly due to the stochastic nature of these rewards when contribution costs are privately held information. This paper formalizes the optimal contracting problem within CML and proposes a transformation that simplifies the non-convex optimization problem into one that can be solved through convex optimization algorithms. We conduct a detailed analysis of the properties that an optimal contract must satisfy when models serve as the rewards, and we explore the potential benefits and welfare implications of these contract-driven CML schemes through numerical experiments.

[AI-63] Latent Reward: LLM -Empowered Credit Assignment in Episodic Reinforcement Learning

链接: https://arxiv.org/abs/2412.11120
作者: Yun Qu,Yuhang Jiang,Boyuan Wang,Yixiu Mao,Cheems Wang,Chang Liu,Xiangyang Ji
关键词: Reinforcement learning, Large Language Model, real-world applications, encounters delayed, delayed and sparse
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement learning (RL) often encounters delayed and sparse feedback in real-world applications, even with only episodic rewards. Previous approaches have made some progress in reward redistribution for credit assignment but still face challenges, including training difficulties due to redundancy and ambiguous attributions stemming from overlooking the multifaceted nature of mission performance evaluation. Hopefully, Large Language Model (LLM) encompasses fruitful decision-making knowledge and provides a plausible tool for reward redistribution. Even so, deploying LLM in this case is non-trivial due to the misalignment between linguistic knowledge and the symbolic form requirement, together with inherent randomness and hallucinations in inference. To tackle these issues, we introduce LaRe, a novel LLM-empowered symbolic-based decision-making framework, to improve credit assignment. Key to LaRe is the concept of the Latent Reward, which works as a multi-dimensional performance evaluation, enabling more interpretable goal attainment from various perspectives and facilitating more effective reward redistribution. We examine that semantically generated code from LLM can bridge linguistic knowledge and symbolic latent rewards, as it is executable for symbolic objects. Meanwhile, we design latent reward self-verification to increase the stability and reliability of LLM inference. Theoretically, reward-irrelevant redundancy elimination in the latent reward benefits RL performance from more accurate reward estimation. Extensive experimental results witness that LaRe (i) achieves superior temporal credit assignment to SOTA methods, (ii) excels in allocating contributions among multiple agents, and (iii) outperforms policies trained with ground truth rewards for certain tasks.

[AI-64] ABC3: Active Bayesian Causal Inference with Cohn Criteria in Randomized Experiments AAAI2025

链接: https://arxiv.org/abs/2412.11104
作者: Taehun Cha,Donghun Lee
关键词: observational study, facto method, method to overcome, issues in observational, causal inference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
*备注: AAAI 2025

点击查看摘要

Abstract:In causal inference, randomized experiment is a de facto method to overcome various theoretical issues in observational study. However, the experimental design requires expensive costs, so an efficient experimental design is necessary. We propose ABC3, a Bayesian active learning policy for causal inference. We show a policy minimizing an estimation error on conditional average treatment effect is equivalent to minimizing an integrated posterior variance, similar to Cohn criteria \citepcohn1994active. We theoretically prove ABC3 also minimizes an imbalance between the treatment and control groups and the type 1 error probability. Imbalance-minimizing characteristic is especially notable as several works have emphasized the importance of achieving balance. Through extensive experiments on real-world data sets, ABC3 achieves the highest efficiency, while empirically showing the theoretical results hold.

[AI-65] GraphMoRE: Mitigating Topological Heterogeneity via Mixture of Riemannian Experts AAAI AAAI-2025

链接: https://arxiv.org/abs/2412.11085
作者: Zihao Guo,Qingyun Sun,Haonan Yuan,Xingcheng Fu,Min Zhou,Yisen Gao,Jianxin Li
关键词: topological heterogeneity, graph, Riemannian Experts, topological, inherently complex
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by the Main Technical Track of the 39th Annual AAAI Conference on Artificial Intelligence (AAAI-2025)

点击查看摘要

Abstract:Real-world graphs have inherently complex and diverse topological patterns, known as topological heterogeneity. Most existing works learn graph representation in a single constant curvature space that is insufficient to match the complex geometric shapes, resulting in low-quality embeddings with high distortion. This also constitutes a critical challenge for graph foundation models, which are expected to uniformly handle a wide variety of diverse graph data. Recent studies have indicated that product manifold gains the possibility to address topological heterogeneity. However, the product manifold is still homogeneous, which is inadequate and inflexible for representing the mixed heterogeneous topology. In this paper, we propose a novel Graph Mixture of Riemannian Experts (GraphMoRE) framework to effectively tackle topological heterogeneity by personalized fine-grained topology geometry pattern preservation. Specifically, to minimize the embedding distortion, we propose a topology-aware gating mechanism to select the optimal embedding space for each node. By fusing the outputs of diverse Riemannian experts with learned gating weights, we construct personalized mixed curvature spaces for nodes, effectively embedding the graph into a heterogeneous manifold with varying curvatures at different points. Furthermore, to fairly measure pairwise distances between different embedding spaces, we present a concise and effective alignment strategy. Extensive experiments on real-world and synthetic datasets demonstrate that our method achieves superior performance with lower distortion, highlighting its potential for modeling complex graphs with topological heterogeneity, and providing a novel architectural perspective for graph foundation models.

[AI-66] RecSys Arena: Pair-wise Recommender System Evaluation with Large Language Models

链接: https://arxiv.org/abs/2412.11068
作者: Zhuo Wu,Qinglin Jia,Chuhan Wu,Zhaocheng Du,Shuai Wang,Zan Wang,Zhenhua Dong
关键词: Evaluating the quality, design and optimization, LLM Chatbot Arena, Evaluating, offline metrics
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Evaluating the quality of recommender systems is critical for algorithm design and optimization. Most evaluation methods are computed based on offline metrics for quick algorithm evolution, since online experiments are usually risky and time-consuming. However, offline evaluation usually cannot fully reflect users’ preference for the outcome of different recommendation algorithms, and the results may not be consistent with online A/B test. Moreover, many offline metrics such as AUC do not offer sufficient information for comparing the subtle differences between two competitive recommender systems in different aspects, which may lead to substantial performance differences in long-term online serving. Fortunately, due to the strong commonsense knowledge and role-play capability of large language models (LLMs), it is possible to obtain simulated user feedback on offline recommendation results. Motivated by the idea of LLM Chatbot Arena, in this paper we present the idea of RecSys Arena, where the recommendation results given by two different recommender systems in each session are evaluated by an LLM judger to obtain fine-grained evaluation feedback. More specifically, for each sample we use LLM to generate a user profile description based on user behavior history or off-the-shelf profile features, which is used to guide LLM to play the role of this user and evaluate the relative preference for two recommendation results generated by different models. Through extensive experiments on two recommendation datasets in different scenarios, we demonstrate that many different LLMs not only provide general evaluation results that are highly consistent with canonical offline metrics, but also provide rich insight in many subjective aspects. Moreover, it can better distinguish different algorithms with comparable performance in terms of AUC and nDCG.

[AI-67] Set-Valued Sensitivity Analysis of Deep Neural Networks

链接: https://arxiv.org/abs/2412.11057
作者: Xin Wang,Feiling wang,Xuegang Ban(Jeff)
关键词: solution set, DNN, DNN respond, set valued mapping, deep neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes a sensitivity analysis framework based on set valued mapping for deep neural networks (DNN) to understand and compute how the solutions (model weights) of DNN respond to perturbations in the training data. As a DNN may not exhibit a unique solution (minima) and the algorithm of solving a DNN may lead to different solutions with minor perturbations to input data, we focus on the sensitivity of the solution set of DNN, instead of studying a single solution. In particular, we are interested in the expansion and contraction of the set in response to data perturbations. If the change of solution set can be bounded by the extent of the data perturbation, the model is said to exhibit the Lipschitz like property. This “set-to-set” analysis approach provides a deeper understanding of the robustness and reliability of DNNs during training. Our framework incorporates both isolated and non-isolated minima, and critically, does not require the assumption that the Hessian of loss function is non-singular. By developing set-level metrics such as distance between sets, convergence of sets, derivatives of set-valued mapping, and stability across the solution set, we prove that the solution set of the Fully Connected Neural Network holds Lipschitz-like properties. For general neural networks (e.g., Resnet), we introduce a graphical-derivative-based method to estimate the new solution set following data perturbation without retraining.

[AI-68] Deployment Pipeline from Rockpool to Xylo for Edge Computing

链接: https://arxiv.org/abs/2412.11047
作者: Peng Zhou,Dylan R. Muir
关键词: Deploying Spiking Neural, Spiking Neural Networks, Rockpool framework represents, Neural Networks, high computational efficiency
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Deploying Spiking Neural Networks (SNNs) on the Xylo neuromorphic chip via the Rockpool framework represents a significant advancement in achieving ultra-low-power consumption and high computational efficiency for edge applications. This paper details a novel deployment pipeline, emphasizing the integration of Rockpool’s capabilities with Xylo’s architecture, and evaluates the system’s performance in terms of energy efficiency and accuracy. The unique advantages of the Xylo chip, including its digital spiking architecture and event-driven processing model, are highlighted to demonstrate its suitability for real-time, power-sensitive applications.

[AI-69] PromptV: Leveraging LLM -powered Multi-Agent Prompting for High-quality Verilog Generation

链接: https://arxiv.org/abs/2412.11014
作者: Zhendong Mi,Renming Zheng,Haowen Zhong,Yue Sun,Shaoyi Huang
关键词: demonstrated remarkable automated, remarkable automated Verilog, Recent advances, automated Verilog code, Verilog code generation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Recent advances in agentic LLMs have demonstrated remarkable automated Verilog code generation capabilities. However, existing approaches either demand substantial computational resources or rely on LLM-assisted single-agent prompt learning techniques, which we observe for the first time has a degeneration issue - characterized by deteriorating generative performance and diminished error detection and correction capabilities. This paper proposes a novel multi-agent prompt learning framework to address these limitations and enhance code generation quality. We show for the first time that multi-agent architectures can effectively mitigate the degeneration risk while improving code error correction capabilities, resulting in higher-quality Verilog code generation. Experimental results show that the proposed method could achieve 96.4% and 96.5% pass@10 scores on VerilogEval Machine and Human benchmarks, respectively while attaining 100% Syntax and 99.9% Functionality pass@5 metrics on the RTLLM benchmark.

[AI-70] Cocoa: Co-Planning and Co-Execution with AI Agents

链接: https://arxiv.org/abs/2412.10999
作者: K. J. Kevin Feng,Kevin Pu,Matt Latzke,Tal August,Pao Siangliulue,Jonathan Bragg,Daniel S. Weld,Amy X. Zhang,Joseph Chee Chang
关键词: interaction design pattern, multi-step tasks, document editor, present Cocoa, Cocoa
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present Cocoa, a system that implements a novel interaction design pattern – interactive plans – for users to collaborate with an AI agent on complex, multi-step tasks in a document editor. Cocoa harmonizes human and AI efforts and enables flexible delegation of agency through two actions: Co-planning (where users collaboratively compose a plan of action with the agent) and Co-execution (where users collaboratively execute plan steps with the agent). Using scientific research as a sample domain, we motivate the design of Cocoa through a formative study with 9 researchers while also drawing inspiration from the design of computational notebooks. We evaluate Cocoa through a user study with 16 researchers and find that when compared to a strong chat baseline, Cocoa improved agent steerability without sacrificing ease of use. A deeper investigation of the general utility of both systems uncovered insights into usage contexts where interactive plans may be more appropriate than chat, and vice versa. Our work surfaces numerous practical implications and paves new paths for interactive interfaces that foster more effective collaboration between humans and agentic AI systems.

[AI-71] MedG-KRP: Medical Graph Knowledge Representation Probing ML4H ALT

链接: https://arxiv.org/abs/2412.10982
作者: Gabriel R. Rosenbaum,Lavender Yao Jiang,Ivaxi Sheth,Jaden Stryker,Anton Alyakin,Daniel Alexander Alber,Nicolas K. Goff,Young Joon(Fred)Kwon,John Markert,Mustafa Nasir-Moin,Jan Moritz Niehues,Karl L. Sangwon,Eunice Yang,Eric Karl Oermann
关键词: powerful tools, recently emerged, emerged as powerful, Large language models, medical
类目: Artificial Intelligence (cs.AI)
*备注: Findings paper presented at Machine Learning for Health (ML4H) symposium 2024, December 15-16, 2024, Vancouver, Canada, 19 pages

点击查看摘要

Abstract:Large language models (LLMs) have recently emerged as powerful tools, finding many medical applications. LLMs’ ability to coalesce vast amounts of information from many sources to generate a response-a process similar to that of a human expert-has led many to see potential in deploying LLMs for clinical use. However, medicine is a setting where accurate reasoning is paramount. Many researchers are questioning the effectiveness of multiple choice question answering (MCQA) benchmarks, frequently used to test LLMs. Researchers and clinicians alike must have complete confidence in LLMs’ abilities for them to be deployed in a medical setting. To address this need for understanding, we introduce a knowledge graph (KG)-based method to evaluate the biomedical reasoning abilities of LLMs. Essentially, we map how LLMs link medical concepts in order to better understand how they reason. We test GPT-4, Llama3-70b, and PalmyraMed-70b, a specialized medical model. We enlist a panel of medical students to review a total of 60 LLM-generated graphs and compare these graphs to BIOS, a large biomedical KG. We observe GPT-4 to perform best in our human review but worst in our ground truth comparison; vice-versa with PalmyraMed, the medical model. Our work provides a means of visualizing the medical reasoning pathways of LLMs so they can be implemented in clinical settings safely and effectively.

[AI-72] Hybrid Forecasting of Geopolitical Events

链接: https://arxiv.org/abs/2412.10981
作者: Daniel M. Benjamin,Fred Morstatter,Ali E. Abbas,Andres Abeliuk,Pavel Atanasov,Stephen Bennett,Andreas Beger,Saurabh Birari,David V. Budescu,Michele Catasta,Emilio Ferrara,Lucas Haravitch,Mark Himmelstein,KSM Tozammel Hossain,Yuzhong Huang,Woojeong Jin,Regina Joseph,Jure Leskovec,Akira Matsui,Mehrnoosh Mirtaheri,Xiang Ren,Gleb Satyukov,Rajiv Sethi,Amandeep Singh,Rok Sosic,Mark Steyvers,Pedro A Szekely,Michael D. Ward,Aram Galstyan
关键词: Sound decision-making relies, tangible outcomes ranging, Sound decision-making, disease outbreaks, decision-making relies
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: 20 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Sound decision-making relies on accurate prediction for tangible outcomes ranging from military conflict to disease outbreaks. To improve crowdsourced forecasting accuracy, we developed SAGE, a hybrid forecasting system that combines human and machine generated forecasts. The system provides a platform where users can interact with machine models and thus anchor their judgments on an objective benchmark. The system also aggregates human and machine forecasts weighting both for propinquity and based on assessed skill while adjusting for overconfidence. We present results from the Hybrid Forecasting Competition (HFC) - larger than comparable forecasting tournaments - including 1085 users forecasting 398 real-world forecasting problems over eight months. Our main result is that the hybrid system generated more accurate forecasts compared to a human-only baseline which had no machine generated predictions. We found that skilled forecasters who had access to machine-generated forecasts outperformed those who only viewed historical data. We also demonstrated the inclusion of machine-generated forecasts in our aggregation algorithms improved performance, both in terms of accuracy and scalability. This suggests that hybrid forecasting systems, which potentially require fewer human resources, can be a viable approach for maintaining a competitive level of accuracy over a larger number of forecasting questions.

[AI-73] Recursive Aggregates as Intensional Functions in Answer Set Programming: Semantics and Strong Equivalence AAAI

链接: https://arxiv.org/abs/2412.10975
作者: Jorge Fandinno,Zachary Hansen
关键词: extended First-Order formulas, programs with aggregates, paper shows, solvers clingo, clingo and dlv
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: Accepted for publication in the Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:This paper shows that the semantics of programs with aggregates implemented by the solvers clingo and dlv can be characterized as extended First-Order formulas with intensional functions in the logic of Here-and-There. Furthermore, this characterization can be used to study the strong equivalence of programs with aggregates under either semantics. We also present a transformation that reduces the task of checking strong equivalence to reasoning in classical First-Order logic, which serves as a foundation for automating this procedure.

[AI-74] Composers Evaluations of an AI Music Tool: Insights for Human-Centred Design NEURIPS2024

链接: https://arxiv.org/abs/2412.10968
作者: Eleanor Row,György Fazekas
关键词: tools for music, music composition, present a study, study that explores, explores the role
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted to NeurIPS 2024 Workshop on Generative AI and Creativity: A dialogue between machine learning researchers and creative professionals in Vancouver, Canada

点击查看摘要

Abstract:We present a study that explores the role of user-centred design in developing Generative AI (GenAI) tools for music composition. Through semi-structured interviews with professional composers, we gathered insights on a novel generative model for creating variations, highlighting concerns around trust, transparency, and ethical design. The findings helped form a feedback loop, guiding improvements to the model that emphasised traceability, transparency and explainability. They also revealed new areas for innovation, including novel features for controllability and research questions on the ethical and practical implementation of GenAI models.

[AI-75] FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction

链接: https://arxiv.org/abs/2412.10966
作者: Alex Morehead,Jianlin Cheng
关键词: Powerful generative models, Powerful generative, recently been proposed, support both flexible, flexible protein-ligand docking
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注: 10 pages, 2 tables, 2 algorithms, 7 figures. Code, data, pre-trained models, and baseline method predictions are available at this https URL

点击查看摘要

Abstract:Powerful generative models of protein-ligand structure have recently been proposed, but few of these methods support both flexible protein-ligand docking and affinity estimation. Of those that do, none can directly model multiple binding ligands concurrently or have been rigorously benchmarked on pharmacologically relevant drug targets, hindering their widespread adoption in drug discovery efforts. In this work, we propose FlowDock, a deep geometric generative model based on conditional flow matching that learns to directly map unbound (apo) structures to their bound (holo) counterparts for an arbitrary number of binding ligands. Furthermore, FlowDock provides predicted structural confidence scores and binding affinity values with each of its generated protein-ligand complex structures, enabling fast virtual screening of new (multi-ligand) drug targets. For the commonly-used PoseBusters Benchmark dataset, FlowDock achieves a 51% blind docking success rate using unbound (apo) protein input structures and without any information derived from multiple sequence alignments, and for the challenging new DockGen-E dataset, FlowDock matches the performance of single-sequence Chai-1 for binding pocket generalization. Additionally, in the ligand category of the 16th community-wide Critical Assessment of Techniques for Structure Prediction (CASP16), FlowDock ranked among the top-5 methods for pharmacological binding affinity estimation across 140 protein-ligand complexes, demonstrating the efficacy of its learned representations in virtual screening. Source code, data, and pre-trained models are available at this https URL.

[AI-76] PSMGD: Periodic Stochastic Multi-Gradient Descent for Fast Multi-Objective Optimization AAAI2025

链接: https://arxiv.org/abs/2412.10961
作者: Mingjing Xu,Peizhong Ju,Jia Liu,Haibo Yang
关键词: multi-objective reinforcement learning, multi-objective reinforcement, multi-task learning, potentially conflicting objectives, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Multi-objective optimization (MOO) lies at the core of many machine learning (ML) applications that involve multiple, potentially conflicting objectives (e.g., multi-task learning, multi-objective reinforcement learning, among many others). Despite the long history of MOO, recent years have witnessed a surge in interest within the ML community in the development of gradient manipulation algorithms for MOO, thanks to the availability of gradient information in many ML problems. However, existing gradient manipulation methods for MOO often suffer from long training times, primarily due to the need for computing dynamic weights by solving an additional optimization problem to determine a common descent direction that can decrease all objectives simultaneously. To address this challenge, we propose a new and efficient algorithm called Periodic Stochastic Multi-Gradient Descent (PSMGD) to accelerate MOO. PSMGD is motivated by the key observation that dynamic weights across objectives exhibit small changes under minor updates over short intervals during the optimization process. Consequently, our PSMGD algorithm is designed to periodically compute these dynamic weights and utilizes them repeatedly, thereby effectively reducing the computational overload. Theoretically, we prove that PSMGD can achieve state-of-the-art convergence rates for strongly-convex, general convex, and non-convex functions. Additionally, we introduce a new computational complexity measure, termed backpropagation complexity, and demonstrate that PSMGD could achieve an objective-independent backpropagation complexity. Through extensive experiments, we verify that PSMGD can provide comparable or superior performance to state-of-the-art MOO algorithms while significantly reducing training time.

[AI-77] Optimizing AI-Assisted Code Generation

链接: https://arxiv.org/abs/2412.10953
作者: Simon Torka,Sahin Albayrak
关键词: AI-assisted code-generation tools, significantly transformed software, transformed software development, recent years, rise of AI-assisted
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, the rise of AI-assisted code-generation tools has significantly transformed software development. While code generators have mainly been used to support conventional software development, their use will be extended to powerful and secure AI systems. Systems capable of generating code, such as ChatGPT, OpenAI Codex, GitHub Copilot, and AlphaCode, take advantage of advances in machine learning (ML) and natural language processing (NLP) enabled by large language models (LLMs). However, it must be borne in mind that these models work probabilistically, which means that although they can generate complex code from natural language input, there is no guarantee for the functionality and security of the generated code. However, to fully exploit the considerable potential of this technology, the security, reliability, functionality, and quality of the generated code must be guaranteed. This paper examines the implementation of these goals to date and explores strategies to optimize them. In addition, we explore how these systems can be optimized to create safe, high-performance, and executable artificial intelligence (AI) models, and consider how to improve their accessibility to make AI development more inclusive and equitable. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.10953 [cs.SE] (or arXiv:2412.10953v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.10953 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-78] ALPACA – Adaptive Learning Pipeline for Comprehensive AI

链接: https://arxiv.org/abs/2412.10950
作者: Simon Torka,Sahin Albayrak
关键词: evaluation and visualisation, technologies has greatly, greatly increased, increased the complexity, include many stages
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The advancement of AI technologies has greatly increased the complexity of AI pipelines as they include many stages such as data collection, pre-processing, training, evaluation and visualisation. To provide effective and accessible AI solutions, it is important to design pipelines for different user groups such as experts, professionals from different fields and laypeople. Ease of use and trust play a central role in the acceptance of AI systems. The presented system, ALPACA (Adaptive Learning Pipeline for Advanced Comprehensive AI Analysis), offers a comprehensive AI pipeline that addresses the needs of diverse user groups. ALPACA integrates visual and code-based development and facilitates all key phases of the AI pipeline. Its architecture is based on Celery (with Redis backend) for efficient task management, MongoDB for seamless data storage and Kubernetes for cloud-based scalability and resource utilisation. Future versions of ALPACA will support modern techniques such as federated and continuous learning as well as explainable AI methods to further improve security, usability and trustworthiness. The application is demonstrated by an Android app for similarity recognition, which emphasises ALPACA’s potential for use in everyday life. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2412.10950 [cs.DC] (or arXiv:2412.10950v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2412.10950 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-79] APAR: Modeling Irregular Target Functions in Tabular Regression via Arithmetic-Aware Pre-Training and Adaptive-Regularized Fine-Tuning AAAI2025

链接: https://arxiv.org/abs/2412.10941
作者: Hong-Wei Wu,Wei-Yao Wang,Kuang-Da Wang,Wen-Chih Peng
关键词: machine learning applications, common machine learning, Tabular data, ranging from finance, genomics and healthcare
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: AAAI 2025 Main Track

点击查看摘要

Abstract:Tabular data are fundamental in common machine learning applications, ranging from finance to genomics and healthcare. This paper focuses on tabular regression tasks, a field where deep learning (DL) methods are not consistently superior to machine learning (ML) models due to the challenges posed by irregular target functions inherent in tabular data, causing sensitive label changes with minor variations from features. To address these issues, we propose a novel Arithmetic-Aware Pre-training and Adaptive-Regularized Fine-tuning framework (APAR), which enables the model to fit irregular target function in tabular data while reducing the negative impact of overfitting. In the pre-training phase, APAR introduces an arithmetic-aware pretext objective to capture intricate sample-wise relationships from the perspective of continuous labels. In the fine-tuning phase, a consistency-based adaptive regularization technique is proposed to self-learn appropriate data augmentation. Extensive experiments across 10 datasets demonstrated that APAR outperforms existing GBDT-, supervised NN-, and pretrain-finetune NN-based methods in RMSE (+9.43% \sim 20.37%), and empirically validated the effects of pre-training tasks, including the study of arithmetic operations. Our code and data are publicly available at this https URL.

[AI-80] Predicting Survival of Hemodialysis Patients using Federated Learning

链接: https://arxiv.org/abs/2412.10919
作者: Abhiram Raju,Praneeth Vepakomma
关键词: delaying their wait, kidney transplant, donor lists, Federated Learning, wait time
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 6 pages, 2 figures, 4 tables, Presented at MIT Undergraduate Research Technology Conference and to be published as conference proceeding at IEEE Xplore

点击查看摘要

Abstract:Hemodialysis patients who are on donor lists for kidney transplant may get misidentified, delaying their wait time. Thus, predicting their survival time is crucial for optimizing waiting lists and personalizing treatment plans. Predicting survival times for patients often requires large quantities of high quality but sensitive data. This data is siloed and since individual datasets are smaller and less diverse, locally trained survival models do not perform as well as centralized ones. Hence, we propose the use of Federated Learning in the context of predicting survival for hemodialysis patients. Federated Learning or FL can have comparatively better performances than local models while not sharing data between centers. However, despite the increased use of such technologies, the application of FL in survival and even more, dialysis patients remains sparse. This paper studies the performance of FL for data of hemodialysis patients from NephroPlus, the largest private network of dialysis centers in India.

[AI-81] Adaptive Reward Design for Reinforcement Learning in Complex Robotic Tasks

链接: https://arxiv.org/abs/2412.10917
作者: Minjae Kwon,Ingy ElSayed-Aly,Lu Feng
关键词: Linear Temporal Logic, Temporal Logic, Linear Temporal, derive reward functions, surge of interest
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures. Under review at RA-L

点击查看摘要

Abstract:There is a surge of interest in using formal languages such as Linear Temporal Logic (LTL) and finite automata to precisely and succinctly specify complex tasks and derive reward functions for reinforcement learning (RL) in robotic applications. However, existing methods often assign sparse rewards (e.g., giving a reward of 1 only if a task is completed and 0 otherwise), necessitating extensive exploration to converge to a high-quality policy. To address this limitation, we propose a suite of reward functions that incentivize an RL agent to make measurable progress on tasks specified by LTL formulas and develop an adaptive reward shaping approach that dynamically updates these reward functions during the learning process. Experimental results on a range of RL-based robotic tasks demonstrate that the proposed approach is compatible with various RL algorithms and consistently outperforms baselines, achieving earlier convergence to better policies with higher task success rates and returns.

[AI-82] ST-FiT: Inductive Spatial-Temporal Forecasting with Limited Training Data

链接: https://arxiv.org/abs/2412.10912
作者: Zhenyu Lei,Yushun Dong,Jundong Li,Chen Chen
关键词: Graph Neural Networks, Spatial-Temporal Graph Neural, Neural Networks, real-world applications, Spatial-temporal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Spatial-temporal graphs are widely used in a variety of real-world applications. Spatial-Temporal Graph Neural Networks (STGNNs) have emerged as a powerful tool to extract meaningful insights from this data. However, in real-world applications, most nodes may not possess any available temporal data during training. For example, the pandemic dynamics of most cities on a geographical graph may not be available due to the asynchronous nature of outbreaks. Such a phenomenon disagrees with the training requirements of most existing spatial-temporal forecasting methods, which jeopardizes their effectiveness and thus blocks broader deployment. In this paper, we propose to formulate a novel problem of inductive forecasting with limited training data. In particular, given a spatial-temporal graph, we aim to learn a spatial-temporal forecasting model that can be easily generalized onto those nodes without any available temporal training data. To handle this problem, we propose a principled framework named ST-FiT. ST-FiT consists of two key learning components: temporal data augmentation and spatial graph topology learning. With such a design, ST-FiT can be used on top of any existing STGNNs to achieve superior performance on the nodes without training data. Extensive experiments verify the effectiveness of ST-FiT in multiple key perspectives.

[AI-83] CEKER: A Generalizable LLM Framework for Literature Analysis with a Case Study in Unikernel Security

链接: https://arxiv.org/abs/2412.10904
作者: Alex Wollman,John Hastings
关键词: Large Language Models, critical component, component of formulating, formulating and justifying, Literature reviews
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 7 pages, 2 figures

点击查看摘要

Abstract:Literature reviews are a critical component of formulating and justifying new research, but are a manual and often time-consuming process. This research introduces a novel, generalizable approach to literature analysis called CEKER which uses a three-step process to streamline the collection of literature, the extraction of key insights, and the summarized analysis of key trends and gaps. Leveraging Large Language Models (LLMs), this methodology represents a significant shift from traditional manual literature reviews, offering a scalable, flexible, and repeatable approach that can be applied across diverse research domains. A case study on unikernel security illustrates CEKER’s ability to generate novel insights validated against previous manual methods. CEKER’s analysis highlighted reduced attack surface as the most prominent theme. Key security gaps included the absence of Address Space Layout Randomization, missing debugging tools, and limited entropy generation, all of which represent important challenges to unikernel security. The study also revealed a reliance on hypervisors as a potential attack vector and emphasized the need for dynamic security adjustments to address real-time threats. Comments: 7 pages, 2 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) ACMclasses: K.6.5; I.2.7; H.3.1; A.1; D.4.1 Cite as: arXiv:2412.10904 [cs.CR] (or arXiv:2412.10904v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.10904 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-84] Know Unreported Roadway Incidents in Real-time: A Deep Learning Framework for Early Traffic Anomaly Detection

链接: https://arxiv.org/abs/2412.10892
作者: Haocheng Duan,Hao Wu,Sean Qian
关键词: AID, conventional AID, relied heavily, AID models, Conventional automatic incident
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Conventional automatic incident detection (AID) has relied heavily on all incident reports exclusively for training and evaluation. However, these reports suffer from a number of issues, such as delayed reports, inaccurate descriptions, false alarms, missing reports, and incidents that do not necessarily influence traffic. Relying on these reports to train or calibrate AID models hinders their ability to detect traffic anomalies effectively and timely, even leading to convergence issues in the model training process. Moreover, conventional AID models are not inherently designed to capture the early indicators of any generic incidents. It remains unclear how far ahead an AID model can report incidents. The AID applications in the literature are also spatially limited because the data used by most models is often limited to specific test road segments. To solve these problems, we propose a deep learning framework utilizing prior domain knowledge and model-designing strategies. This allows the model to detect a broader range of anomalies, not only incidents that significantly influence traffic flow but also early characteristics of incidents along with historically unreported anomalies. We specially design the model to target the early-stage detection/prediction of an incident. Additionally, unlike most conventional AID studies, we use widely available data, enhancing our method’s scalability. The experimental results across numerous road segments on different maps demonstrate that our model leads to more effective and early anomaly detection. Our framework does not focus on stacking or tweaking various deep learning models; instead, it focuses on model design and training strategies to improve early detection performance.

[AI-85] Fully Test-time Adaptation for Tabular Data AAAI2025

链接: https://arxiv.org/abs/2412.10871
作者: Zhi Zhou,Kun-Yang Yu,Lan-Zhe Guo,Yu-Feng Li
关键词: finds extensive applications, Tabular data plays, Tabular data, extensive applications, Tabular
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted by AAAI 2025. Code is available at: this https URL

点击查看摘要

Abstract:Tabular data plays a vital role in various real-world scenarios and finds extensive applications. Although recent deep tabular models have shown remarkable success, they still struggle to handle data distribution shifts, leading to performance degradation when testing distributions change. To remedy this, a robust tabular model must adapt to generalize to unknown distributions during testing. In this paper, we investigate the problem of fully test-time adaptation (FTTA) for tabular data, where the model is adapted using only the testing data. We identify three key challenges: the existence of label and covariate distribution shifts, the lack of effective data augmentation, and the sensitivity of adaptation, which render existing FTTA methods ineffective for tabular data. To this end, we propose the Fully Test-time Adaptation for Tabular data, namely FTAT, which enables FTTA methods to robustly optimize the label distribution of predictions, adapt to shifted covariate distributions, and suit a variety of tasks and models effectively. We conduct comprehensive experiments on six benchmark datasets, which are evaluated using three metrics. The experimental results demonstrate that FTAT outperforms state-of-the-art methods by a margin.

[AI-86] nySubNets: An efficient and low capacity continual learning strategy

链接: https://arxiv.org/abs/2412.10869
作者: Marcin Pietroń,Kamil Faber,Dominik Żurek,Roberto Corizzo
关键词: machine learning research, recent machine learning, Continual Learning, setting gaining traction, learning research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Continual Learning (CL) is a highly relevant setting gaining traction in recent machine learning research. Among CL works, architectural and hybrid strategies are particularly effective due to their potential to adapt the model architecture as new tasks are presented. However, many existing solutions do not efficiently exploit model sparsity, and are prone to capacity saturation due to their inefficient use of available weights, which limits the number of learnable tasks. In this paper, we propose TinySubNets (TSN), a novel architectural CL strategy that addresses the issues through the unique combination of pruning with different sparsity levels, adaptive quantization, and weight sharing. Pruning identifies a subset of weights that preserve model performance, making less relevant weights available for future tasks. Adaptive quantization allows a single weight to be separated into multiple parts which can be assigned to different tasks. Weight sharing between tasks boosts the exploitation of capacity and task similarity, allowing for the identification of a better trade-off between model accuracy and capacity. These features allow TSN to efficiently leverage the available capacity, enhance knowledge transfer, and reduce computational resource consumption. Experimental results involving common benchmark CL datasets and scenarios show that our proposed strategy achieves better results in terms of accuracy than existing state-of-the-art CL strategies. Moreover, our strategy is shown to provide a significantly improved model capacity exploitation. Code released at: this https URL.

[AI-87] AuctionNet: A Novel Benchmark for Decision-Making in Large-Scale Games

链接: https://arxiv.org/abs/2412.10798
作者: Kefan Su,Yusen Huo,Zhilin Zhang,Shuai Dou,Chuan Yu,Jian Xu,Zongqing Lu,Bo Zheng
关键词: significant real-world impact, bid decision-making, bid decision-making algorithms, essential research area, Decision-making
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decision-making in large-scale games is an essential research area in artificial intelligence (AI) with significant real-world impact. However, the limited access to realistic large-scale game environments has hindered research progress in this area. In this paper, we present \textbfAuctionNet, a benchmark for bid decision-making in large-scale ad auctions derived from a real-world online advertising platform. AuctionNet is composed of three parts: an ad auction environment, a pre-generated dataset based on the environment, and performance evaluations of several baseline bid decision-making algorithms. More specifically, the environment effectively replicates the integrity and complexity of real-world ad auctions through the interaction of several modules: the ad opportunity generation module employs deep generative models to bridge the gap between simulated and real-world data while mitigating the risk of sensitive data exposure; the bidding module implements diverse auto-bidding agents trained with different decision-making algorithms; and the auction module is anchored in the classic Generalized Second Price (GSP) auction but also allows for customization of auction mechanisms as needed. To facilitate research and provide insights into the game environment, we have also pre-generated a substantial dataset based on the environment. The dataset contains trajectories involving 48 diverse agents competing with each other, totaling over 500 million records and occupying 80GB of storage. Performance evaluations of baseline algorithms such as linear programming, reinforcement learning, and generative models for bid decision-making are also presented as part of AuctionNet. We note that AuctionNet is applicable not only to research on bid decision-making algorithms in ad auctions but also to the general area of decision-making in large-scale games.

[AI-88] ANaGRAM: A Natural Gradient Relative to Adapted Model for efficient PINNs learning ICLR2025

链接: https://arxiv.org/abs/2412.10782
作者: Nilo Schwencke,Cyril Furtlehner
关键词: Physics Informed Neural, Informed Neural Networks, PDE driven systems, solve PDE driven, data assimilation purpose
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: submitted to ICLR 2025

点击查看摘要

Abstract:In the recent years, Physics Informed Neural Networks (PINNs) have received strong interest as a method to solve PDE driven systems, in particular for data assimilation purpose. This method is still in its infancy, with many shortcomings and failures that remain not properly understood. In this paper we propose a natural gradient approach to PINNs which contributes to speed-up and improve the accuracy of the training. Based on an in depth analysis of the differential geometric structures of the problem, we come up with two distinct contributions: (i) a new natural gradient algorithm that scales as \min(P^2S, S^2P) , where P is the number of parameters, and S the batch size; (ii) a mathematically principled reformulation of the PINNs problem that allows the extension of natural gradient to it, with proved connections to Green’s function theory.

[AI-89] Control of Overfitting with Physics

链接: https://arxiv.org/abs/2412.10716
作者: Sergei V. Kozyrev,Ilya A Lopatin,Alexander N Pechen
关键词: machine learning, understand the theoretical, theoretical justifications, justifications to explain, gradient Langevin dynamics
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 20 pages, 6 figures

点击查看摘要

Abstract:While there are many works on the applications of machine learning, not so many of them are trying to understand the theoretical justifications to explain their efficiency. In this work, overfitting control (or generalization property) in machine learning is explained using analogies from physics and biology. For stochastic gradient Langevin dynamics, we show that the Eyring formula of kinetic theory allows to control overfitting in the algorithmic stability approach - when wide minima of the risk function with low free energy correspond to low overfitting. For the generative adversarial network (GAN) model, we establish an analogy between GAN and the predator-prey model in biology. An application of this analogy allows us to explain the selection of wide likelihood maxima and overfitting reduction for GANs.

[AI-90] RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors AAAI2025

链接: https://arxiv.org/abs/2412.10713
作者: Fengshuo Bai,Runze Liu,Yali Du,Ying Wen,Yaodong Yang
关键词: Evaluating deep reinforcement, deep reinforcement learning, Evaluating deep, reinforcement learning, deep reinforcement
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Robotics (cs.RO)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Evaluating deep reinforcement learning (DRL) agents against targeted behavior attacks is critical for assessing their robustness. These attacks aim to manipulate the victim into specific behaviors that align with the attacker’s objectives, often bypassing traditional reward-based defenses. Prior methods have primarily focused on reducing cumulative rewards; however, rewards are typically too generic to capture complex safety requirements effectively. As a result, focusing solely on reward reduction can lead to suboptimal attack strategies, particularly in safety-critical scenarios where more precise behavior manipulation is needed. To address these challenges, we propose RAT, a method designed for universal, targeted behavior attacks. RAT trains an intention policy that is explicitly aligned with human preferences, serving as a precise behavioral target for the adversary. Concurrently, an adversary manipulates the victim’s policy to follow this target behavior. To enhance the effectiveness of these attacks, RAT dynamically adjusts the state occupancy measure within the replay buffer, allowing for more controlled and effective behavior manipulation. Our empirical results on robotic simulation tasks demonstrate that RAT outperforms existing adversarial attack algorithms in inducing specific behaviors. Additionally, RAT shows promise in improving agent robustness, leading to more resilient policies. We further validate RAT by guiding Decision Transformer agents to adopt behaviors aligned with human preferences in various MuJoCo tasks, demonstrating its effectiveness across diverse tasks.

[AI-91] USM: Unbiased Survey Modeling for Limiting Negative User Experiences in Recommendation Systems

链接: https://arxiv.org/abs/2412.10674
作者: Chenghui Yu,Peiyi Li,Haoze Wu,Bingfeng Deng,Hongyu Xiong
关键词: improve user experience, crucial to guardrail, signals, Negative feedback signals, guardrail content recommendations
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Negative feedback signals are crucial to guardrail content recommendations and improve user experience. When these signals are effectively integrated into recommendation systems, they play a vital role in preventing the promotion of harmful or undesirable content, thereby contributing to a healthier online environment. However, the challenges associated with negative signals are noteworthy. Due to the limited visibility of options for users to express negative feedback, these signals are often sparse compared to positive signals. This imbalance can lead to a skewed understanding of user preferences, resulting in recommendations that prioritize short-term engagement over long-term satisfaction. Moreover, an over-reliance on positive signals can create a filter bubble, where users are continuously exposed to content that aligns with their immediate preferences but may not be beneficial in the long run. This scenario can ultimately lead to user attrition as audiences become disillusioned with the quality of the content provided. Additionally, existing user signals frequently fail to meet specific customized requirements, such as understanding the underlying reasons for a user’s likes or dislikes regarding a video. This lack of granularity hinders our ability to tailor content recommendations effectively, as we cannot identify the particular attributes of content that resonate with individual users.

[AI-92] Proposing and solving olympiad geometry with guided tree search

链接: https://arxiv.org/abs/2412.10673
作者: Chi Zhang,Jiajun Song,Siyu Li,Yitao Liang,Yuxi Ma,Wei Wang,Yixin Zhu,Song-Chun Zhu
关键词: solving highly honored, Mathematics olympiads, highly honored, proposing and solving, geometry
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mathematics olympiads are prestigious competitions, with problem proposing and solving highly honored. Building artificial intelligence that proposes and solves olympiads presents an unresolved challenge in automated theorem discovery and proving, especially in geometry for its combination of numerical and spatial elements. We introduce TongGeometry, a Euclidean geometry system supporting tree-search-based guided problem proposing and solving. The efficient geometry system establishes the most extensive repository of geometry theorems to date: within the same computational budget as the existing state-of-the-art, TongGeometry discovers 6.7 billion geometry theorems requiring auxiliary constructions, including 4.1 billion exhibiting geometric symmetry. Among them, 10 theorems were proposed to regional mathematical olympiads with 3 of TongGeometry’s proposals selected in real competitions, earning spots in a national team qualifying exam or a top civil olympiad in China and the US. Guided by fine-tuned large language models, TongGeometry solved all International Mathematical Olympiad geometry in IMO-AG-30, outperforming gold medalists for the first time. It also surpasses the existing state-of-the-art across a broader spectrum of olympiad-level problems. The full capabilities of the system can be utilized on a consumer-grade machine, making the model more accessible and fostering widespread democratization of its use. By analogy, unlike existing systems that merely solve problems like students, TongGeometry acts like a geometry coach, discovering, presenting, and proving theorems.

[AI-93] Hidden Echoes Survive Training in Audio To Audio Generative Instrument Models AAAI

链接: https://arxiv.org/abs/2412.10649
作者: Christopher J. Tralie,Matt Amery,Benjamin Douglas,Ian Utz
关键词: black box behavior, properly licensed data, Realtime Audio Variational, generative techniques pervade, box behavior
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: 8 pages, 11 Figures, Proceedings of 2025 AAAI Workshop on AI for Music

点击查看摘要

Abstract:As generative techniques pervade the audio domain, there has been increasing interest in tracing back through these complicated models to understand how they draw on their training data to synthesize new examples, both to ensure that they use properly licensed data and also to elucidate their black box behavior. In this paper, we show that if imperceptible echoes are hidden in the training data, a wide variety of audio to audio architectures (differentiable digital signal processing (DDSP), Realtime Audio Variational autoEncoder (RAVE), and ``Dance Diffusion’') will reproduce these echoes in their outputs. Hiding a single echo is particularly robust across all architectures, but we also show promising results hiding longer time spread echo patterns for an increased information capacity. We conclude by showing that echoes make their way into fine tuned models, that they survive mixing/demixing, and that they survive pitch shift augmentation during training. Hence, this simple, classical idea in watermarking shows significant promise for tagging generative audio models.

[AI-94] WaveGNN: Modeling Irregular Multivariate Time Series for Accurate Predictions

链接: https://arxiv.org/abs/2412.10621
作者: Arash Hajisafi,Maria Despoina Siampou,Bita Azarijoo,Cyrus Shahabi
关键词: analyzing time series, time series, Accurately modeling, crucial for downstream, downstream applications
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Accurately modeling and analyzing time series data is crucial for downstream applications across various fields, including healthcare, finance, astronomy, and epidemiology. However, real-world time series often exhibit irregularities such as misaligned timestamps, missing entries, and variable sampling rates, complicating their analysis. Existing approaches often rely on imputation, which can introduce biases. A few approaches that directly model irregularity tend to focus exclusively on either capturing intra-series patterns or inter-series relationships, missing the benefits of integrating both. To this end, we present WaveGNN, a novel framework designed to directly (i.e., no imputation) embed irregularly sampled multivariate time series data for accurate predictions. WaveGNN utilizes a Transformer-based encoder to capture intra-series patterns by directly encoding the temporal dynamics of each time series. To capture inter-series relationships, WaveGNN uses a dynamic graph neural network model, where each node represents a sensor, and the edges capture the long- and short-term relationships between them. Our experimental results on real-world healthcare datasets demonstrate that WaveGNN consistently outperforms existing state-of-the-art methods, with an average relative improvement of 14.7% in F1-score when compared to the second-best baseline in cases with extreme sparsity. Our ablation studies reveal that both intra-series and inter-series modeling significantly contribute to this notable improvement.

[AI-95] Client-Side Patching against Backdoor Attacks in Federated Learning

链接: https://arxiv.org/abs/2412.10605
作者: Borja Molina Coronado
关键词: decentralized environments, versatile framework, framework for training, Federated learning, Federated
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning is a versatile framework for training models in decentralized environments. However, the trust placed in clients makes federated learning vulnerable to backdoor attacks launched by malicious participants. While many defenses have been proposed, they often fail short when facing heterogeneous data distributions among participating clients. In this paper, we propose a novel defense mechanism for federated learning systems designed to mitigate backdoor attacks on the clients-side. Our approach leverages adversarial learning techniques and model patching to neutralize the impact of backdoor attacks. Through extensive experiments on the MNIST and Fashion-MNIST datasets, we demonstrate that our defense effectively reduces backdoor accuracy, outperforming existing state-of-the-art defenses, such as LFighter, FLAME, and RoseAgg, in i.i.d. and non-i.i.d. scenarios, while maintaining competitive or superior accuracy on clean data.

[AI-96] Advances in Transformers for Robotic Applications: A Review DATE

链接: https://arxiv.org/abs/2412.10599
作者: Nikunj Sanghai,Nik Bear Brown
关键词: Natural Language Processing, Language Processing, Natural Language, Deep Reinforcement Learning, brought about significant
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Early preprint, focusing primarily on general purpose robots, more updates to come

点击查看摘要

Abstract:The introduction of Transformers architecture has brought about significant breakthroughs in Deep Learning (DL), particularly within Natural Language Processing (NLP). Since their inception, Transformers have outperformed many traditional neural network architectures due to their “self-attention” mechanism and their scalability across various applications. In this paper, we cover the use of Transformers in Robotics. We go through recent advances and trends in Transformer architectures and examine their integration into robotic perception, planning, and control for autonomous systems. Furthermore, we review past work and recent research on use of Transformers in Robotics as pre-trained foundation models and integration of Transformers with Deep Reinforcement Learning (DRL) for autonomous systems. We discuss how different Transformer variants are being adapted in robotics for reliable planning and perception, increasing human-robot interaction, long-horizon decision-making, and generalization. Finally, we address limitations and challenges, offering insight and suggestions for future research directions.

[AI-97] Whos the (Multi-)Fairest of Them textscAll: Rethinking Interpolation-Based Data Augmentation Through the Lens of Multicalibration AAAI2025

链接: https://arxiv.org/abs/2412.10575
作者: Karina Halevy,Karly Hou,Charumathi Badrinath
关键词: Fair Mixup, SoTA interpolation-based methods, Mixup, SoTA interpolation-based, widely shown
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Expanded version of AAAI 2025 main track paper. 8 pages, 2 figures

点击查看摘要

Abstract:Data augmentation methods, especially SoTA interpolation-based methods such as Fair Mixup, have been widely shown to increase model fairness. However, this fairness is evaluated on metrics that do not capture model uncertainty and on datasets with only one, relatively large, minority group. As a remedy, multicalibration has been introduced to measure fairness while accommodating uncertainty and accounting for multiple minority groups. However, existing methods of improving multicalibration involve reducing initial training data to create a holdout set for post-processing, which is not ideal when minority training data is already sparse. This paper uses multicalibration to more rigorously examine data augmentation for classification fairness. We stress-test four versions of Fair Mixup on two structured data classification problems with up to 81 marginalized groups, evaluating multicalibration violations and balanced accuracy. We find that on nearly every experiment, Fair Mixup \textitworsens baseline performance and fairness, but the simple vanilla Mixup \textitoutperforms both Fair Mixup and the baseline, especially when calibrating on small groups. \textitCombining vanilla Mixup with multicalibration post-processing, which enforces multicalibration through post-processing on a holdout set, further increases fairness.

[AI-98] Edge AI-based Radio Frequency Fingerprinting for IoT Networks

链接: https://arxiv.org/abs/2412.10553
作者: Ahmed Mohamed Hussain,Nada Abughanam,Panos Papadimitratos
关键词: Internet of Things, introduced significant security, significant security challenges, real-time data exchange, smart cities
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: 11 pages, and 8 figures

点击查看摘要

Abstract:The deployment of the Internet of Things (IoT) in smart cities and critical infrastructure has enhanced connectivity and real-time data exchange but introduced significant security challenges. While effective, cryptography can often be resource-intensive for small-footprint resource-constrained (i.e., IoT) devices. Radio Frequency Fingerprinting (RFF) offers a promising authentication alternative by using unique RF signal characteristics for device identification at the Physical (PHY)-layer, without resorting to cryptographic solutions. The challenge is two-fold: how to deploy such RFF in a large scale and for resource-constrained environments. Edge computing, processing data closer to its source, i.e., the wireless device, enables faster decision-making, reducing reliance on centralized cloud servers. Considering a modest edge device, we introduce two truly lightweight Edge AI-based RFF schemes tailored for resource-constrained devices. We implement two Deep Learning models, namely a Convolution Neural Network and a Transformer-Encoder, to extract complex features from the IQ samples, forming device-specific RF fingerprints. We convert the models to TensorFlow Lite and evaluate them on a Raspberry Pi, demonstrating the practicality of Edge deployment. Evaluations demonstrate the Transformer-Encoder outperforms the CNN in identifying unique transmitter features, achieving high accuracy ( 0.95) and ROC-AUC scores ( 0.90) while maintaining a compact model size of 73KB, appropriate for resource-constrained devices.

[AI-99] Extracting PAC Decision Trees from Black Box Binary Classifiers: The Gender Bias Study Case on BERT-based Language Models AAAI2025

链接: https://arxiv.org/abs/2412.10513
作者: Ana Ozaki,Roberto Confalonieri,Ricardo Guimarães,Anders Imenes
关键词: machine learning method, popular machine learning, Decision trees, learning method, inherent explainability
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Decision trees are a popular machine learning method, known for their inherent explainability. In Explainable AI, decision trees can be used as surrogate models for complex black box AI models or as approximations of parts of such models. A key challenge of this approach is determining how accurately the extracted decision tree represents the original model and to what extent it can be trusted as an approximation of their behavior. In this work, we investigate the use of the Probably Approximately Correct (PAC) framework to provide a theoretical guarantee of fidelity for decision trees extracted from AI models. Based on theoretical results from the PAC framework, we adapt a decision tree algorithm to ensure a PAC guarantee under certain conditions. We focus on binary classification and conduct experiments where we extract decision trees from BERT-based language models with PAC guarantees. Our results indicate occupational gender bias in these models.

[AI-100] Enhancing Automated Loop Invariant Generation for Complex Programs with Large Language Models

链接: https://arxiv.org/abs/2412.10483
作者: Ruibang Liu,Guoqiang Li,Minyu Chen,Ling-I Wu,Jingyu Ke
关键词: building trustworthy software, Automated program verification, loop invariant generation, trustworthy software, Large Language Models
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注: 11 pages, 7 figures

点击查看摘要

Abstract:Automated program verification has always been an important component of building trustworthy software. While the analysis of real-world programs remains a theoretical challenge, the automation of loop invariant analysis has effectively resolved the problem. However, real-world programs that often mix complex data structures and control flows pose challenges to traditional loop invariant generation tools. To enhance the applicability of invariant generation techniques, we proposed ACInv, an Automated Complex program loop Invariant generation tool, which combines static analysis with Large Language Models (LLMs) to generate the proper loop invariants. We utilize static analysis to extract the necessary information for each loop and embed it into prompts for the LLM to generate invariants for each loop. Subsequently, we employ an LLM-based evaluator to assess the generated invariants, refining them by either strengthening, weakening, or rejecting them based on their correctness, ultimately obtaining enhanced invariants. We conducted experiments on ACInv, which showed that ACInv outperformed previous tools on data sets with data structures, and maintained similar performance to the state-of-the-art tool AutoSpec on numerical programs without data structures. For the total data set, ACInv can solve 21% more examples than AutoSpec and can generate reference data structure templates.

[AI-101] pping Points Pulse Elasticity and Tonal Tension: An Empirical Study on What Generates Tipping Points

链接: https://arxiv.org/abs/2412.10481
作者: Canishk Naik(CAM, LSE),Elaine Chew(Repmus, CNRS, STMS)
关键词: characterise crucial turning, crucial turning points, Tipping points, piece of music, moments of change
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Audio and Speech Processing (eess.AS)
*备注: International Society for Music Information Retrieval Conference, Oct 2017, Suzhou, China, China

点击查看摘要

Abstract:Tipping points are moments of change that characterise crucial turning points in a piece of music. This study presents a first step towards quantitatively and systematically describing the musical properties of tipping points. Timing information and computationally-derived tonal tension values which correspond to dissonance, distance from key, and harmonic motion are compared to tipping points in Ashkenazy’s recordings of six Chopin Mazurkas, as identified by 35 listeners. The analysis shows that all popular tipping points but one could be explained by statistically significant timing deviations or changepoints in at least one of the three tension parameters.

[AI-102] Benchmarking large language models for materials synthesis: the case of atomic layer deposition

链接: https://arxiv.org/abs/2412.10477
作者: Angel Yanguas-Gil,Matthew T. Dearing,Jeffrey W. Elam,Jessica C. Jones,Sungjoon Kim,Adnan Mohammad,Chi Thang Nguyen,Bratin Sengupta
关键词: atomic layer deposition, thin film growth, film growth technique, open-ended question benchmark, large language models
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work we introduce an open-ended question benchmark, ALDbench, to evaluate the performance of large language models (LLMs) in materials synthesis, and in particular in the field of atomic layer deposition, a thin film growth technique used in energy applications and microelectronics. Our benchmark comprises questions with a level of difficulty ranging from graduate level to domain expert current with the state of the art in the field. Human experts reviewed the questions along the criteria of difficulty and specificity, and the model responses along four different criteria: overall quality, specificity, relevance, and accuracy. We ran this benchmark on an instance of OpenAI’s GPT-4o. The responses from the model received a composite quality score of 3.7 on a 1 to 5 scale, consistent with a passing grade. However, 36% of the questions received at least one below average score. An in-depth analysis of the responses identified at least five instances of suspected hallucination. Finally, we observed statistically significant correlations between the difficulty of the question and the quality of the response, the difficulty of the question and the relevance of the response, and the specificity of the question and the accuracy of the response as graded by the human experts. This emphasizes the need to evaluate LLMs across multiple criteria beyond difficulty or accuracy.

[AI-103] CONCLAD: COntinuous Novel CLAss Detector

链接: https://arxiv.org/abs/2412.10473
作者: Amanda Rios,Ibrahima Ndiour,Parual Datta,Omesh Tickoo,Nilesh Ahuja
关键词: commonplace albeit unrealistic, relying on so-called, albeit unrealistic, so-called oracles, oracles for novelty
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the field of continual learning, relying on so-called oracles for novelty detection is commonplace albeit unrealistic. This paper introduces CONCLAD (“COntinuous Novel CLAss Detector”), a comprehensive solution to the under-explored problem of continual novel class detection in post-deployment data. At each new task, our approach employs an iterative uncertainty estimation algorithm to differentiate between known and novel class(es) samples, and to further discriminate between the different novel classes themselves. Samples predicted to be from a novel class with high-confidence are automatically pseudo-labeled and used to update our model. Simultaneously, a tiny supervision budget is used to iteratively query ambiguous novel class predictions, which are also used during update. Evaluation across multiple datasets, ablations and experimental settings demonstrate our method’s effectiveness at separating novel and old class samples continuously. We will release our code upon acceptance.

[AI-104] EvoSampling: A Granular Ball-based Evolutionary Hybrid Sampling with Knowledge Transfer for Imbalanced Learning

链接: https://arxiv.org/abs/2412.10461
作者: Wenbin Pei,Ruohao Dai,Bing Xue,Mengjie Zhang,Qiang Zhang,Yiu-Ming Cheung,Shuyin Xia
关键词: minority class, imbalance would lead, lead to biased, biased classifiers, classifiers that favor
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Class imbalance would lead to biased classifiers that favor the majority class and disadvantage the minority class. Unfortunately, from a practical perspective, the minority class is of importance in many real-life applications. Hybrid sampling methods address this by oversampling the minority class to increase the number of its instances, followed by undersampling to remove low-quality instances. However, most existing sampling methods face difficulties in generating diverse high-quality instances and often fail to remove noise or low-quality instances on a larger scale effectively. This paper therefore proposes an evolutionary multi-granularity hybrid sampling method, called EvoSampling. During the oversampling process, genetic programming (GP) is used with multi-task learning to effectively and efficiently generate diverse high-quality instances. During the undersampling process, we develop a granular ball-based undersampling method that removes noise in a multi-granular fashion, thereby enhancing data quality. Experiments on 20 imbalanced datasets demonstrate that EvoSampling effectively enhances the performance of various classification algorithms by providing better datasets than existing sampling methods. Besides, ablation studies further indicate that allowing knowledge transfer accelerates the GP’s evolutionary learning process.

[AI-105] Conformal Prediction on Quantifying Uncertainty of Dynamic Systems

链接: https://arxiv.org/abs/2412.10459
作者: Aoming Liang,Qi Liu,Lei Xu,Fahad Sohrab,Weicheng Cui,Changhui Song,Moncef Gaubbouj
关键词: Numerous studies, studies have focused, understanding the dynamics, spatial intelligence, video data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Numerous studies have focused on learning and understanding the dynamics of physical systems from video data, such as spatial intelligence. Artificial intelligence requires quantitative assessments of the uncertainty of the model to ensure reliability. However, there is still a relative lack of systematic assessment of the uncertainties, particularly the uncertainties of the physical data. Our motivation is to introduce conformal prediction into the uncertainty assessment of dynamical systems, providing a method supported by theoretical guarantees. This paper uses the conformal prediction method to assess uncertainties with benchmark operator learning methods. We have also compared the Monte Carlo Dropout and Ensemble methods in the partial differential equations dataset, effectively evaluating uncertainty through straight roll-outs, making it ideal for time-series tasks.

[AI-106] An Interoperable Machine Learning Pipeline for Pediatric Obesity Risk Estimation

链接: https://arxiv.org/abs/2412.10454
作者: Hamed Fayyaz,Mehak Gupta,Alejandra Perez Ramirez,Claudine Jurkovitz,H. Timothy Bunnell,Thao-Ly T. Phan,Rahmatollah Beheshti
关键词: timely preventive interventions, Reliable prediction, helping them engage, disease is established, offer a valuable
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reliable prediction of pediatric obesity can offer a valuable resource to providers, helping them engage in timely preventive interventions before the disease is established. Many efforts have been made to develop ML-based predictive models of obesity, and some studies have reported high predictive performances. However, no commonly used clinical decision support tool based on existing ML models currently exists. This study presents a novel end-to-end pipeline specifically designed for pediatric obesity prediction, which supports the entire process of data extraction, inference, and communication via an API or a user interface. While focusing only on routinely recorded data in pediatric electronic health records (EHRs), our pipeline uses a diverse expert-curated list of medical concepts to predict the 1-3 years risk of developing obesity. Furthermore, by using the Fast Healthcare Interoperability Resources (FHIR) standard in our design procedure, we specifically target facilitating low-effort integration of our pipeline with different EHR systems. In our experiments, we report the effectiveness of the predictive model as well as its alignment with the feedback from various stakeholders, including ML scientists, providers, health IT personnel, health administration representatives, and patient group representatives.

[AI-107] Steganography in Game Actions

链接: https://arxiv.org/abs/2412.10442
作者: Ching-Chun Chang,Isao Echizen
关键词: primarily relying, relying on visual, auditory and linguistic, linguistic media, ongoing evolutionary interplay
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:The problem of subliminal communication has been addressed in various forms of steganography, primarily relying on visual, auditory and linguistic media. However, the field faces a fundamental paradox: as the art of concealment advances, so too does the science of revelation, leading to an ongoing evolutionary interplay. This study seeks to extend the boundaries of what is considered a viable steganographic medium. We explore a steganographic paradigm, where hidden information is communicated through the episodes of multiple agents interacting with an environment. Each agent, acting as an encoder, learns a policy to disguise the very existence of hidden messages within actions seemingly directed toward innocent objectives. Meanwhile, an observer, serving as a decoder, learns to associate behavioural patterns with their respective agents despite their dynamic nature, thereby unveiling the hidden messages. The interactions of agents are governed by the framework of multi-agent reinforcement learning and shaped by feedback from the observer. This framework encapsulates a game-theoretic dilemma, wherein agents face decisions between cooperating to create distinguishable behavioural patterns or defecting to pursue individually optimal yet potentially overlapping episodic actions. As a proof of concept, we exemplify action steganography through the game of labyrinth, a navigation task where subliminal communication is concealed within the act of steering toward a destination. The stego-system has been systematically validated through experimental evaluations, assessing its distortion and capacity alongside its secrecy and robustness when subjected to simulated passive and active adversaries.

[AI-108] GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents

链接: https://arxiv.org/abs/2412.10410
作者: Shaofei Cai,Bowei Zhang,Zihao Wang,Haowei Lin,Xiaojian Ma,Anji Liu,Yitao Liang
关键词: Developing agents, remains a fundamental, fundamental challenge, learn diverse behaviors, follow multimodal instructions
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets (no language instruction) has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset with instruction labels can mitigate this issue, acquiring such high-quality annotations at scale is impractical. To address this issue, we frame the problem as a semi-supervised learning task and introduce GROOT-2, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of labeled demonstrations to ensure the latent space reflects human intentions. GROOT-2’s effectiveness is validated across four diverse environments, ranging from video games to robotic manipulation, demonstrating its robust multimodal instruction-following capabilities.

[AI-109] ANGO: Training-free Embodied AI Agents for Open-world Tasks

链接: https://arxiv.org/abs/2412.10402
作者: Filippo Ziliotto,Tommaso Campari,Luciano Serafini,Lamberto Ballan
关键词: Large Language Models, Large Language, perform complex reasoning, demonstrated excellent capabilities, Language Models
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated excellent capabilities in composing various modules together to create programs that can perform complex reasoning tasks on images. In this paper, we propose TANGO, an approach that extends the program composition via LLMs already observed for images, aiming to integrate those capabilities into embodied agents capable of observing and acting in the world. Specifically, by employing a simple PointGoal Navigation model combined with a memory-based exploration policy as a foundational primitive for guiding an agent through the world, we show how a single model can address diverse tasks without additional training. We task an LLM with composing the provided primitives to solve a specific task, using only a few in-context examples in the prompt. We evaluate our approach on three key Embodied AI tasks: Open-Set ObjectGoal Navigation, Multi-Modal Lifelong Navigation, and Open Embodied Question Answering, achieving state-of-the-art results without any specific fine-tuning in challenging zero-shot scenarios.

[AI-110] Neural-Symbolic Reasoning over Knowledge Graphs: A Survey from a Query Perspective

链接: https://arxiv.org/abs/2412.10390
作者: Lihui Liu,Zihao Wang,Hanghang Tong
关键词: Knowledge graph reasoning, Knowledge graph, graph reasoning, artificial intelligence, Knowledge
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Knowledge graph reasoning is pivotal in various domains such as data mining, artificial intelligence, the Web, and social sciences. These knowledge graphs function as comprehensive repositories of human knowledge, facilitating the inference of new information. Traditional symbolic reasoning, despite its strengths, struggles with the challenges posed by incomplete and noisy data within these graphs. In contrast, the rise of Neural Symbolic AI marks a significant advancement, merging the robustness of deep learning with the precision of symbolic reasoning. This integration aims to develop AI systems that are not only highly interpretable and explainable but also versatile, effectively bridging the gap between symbolic and neural methodologies. Additionally, the advent of large language models (LLMs) has opened new frontiers in knowledge graph reasoning, enabling the extraction and synthesis of knowledge in unprecedented ways. This survey offers a thorough review of knowledge graph reasoning, focusing on various query types and the classification of neural symbolic reasoning. Furthermore, it explores the innovative integration of knowledge graph reasoning with large language models, highlighting the potential for groundbreaking advancements. This comprehensive overview is designed to support researchers and practitioners across multiple fields, including data mining, AI, the Web, and social sciences, by providing a detailed understanding of the current landscape and future directions in knowledge graph reasoning.

[AI-111] Adult learners recall and recognition performance and affective feedback when learning from an AI-generated synthetic video

链接: https://arxiv.org/abs/2412.10384
作者: Zoe Ruo-Yu Li,Caswell Barry,Mutlu Cukurova
关键词: enhance learning outcomes, potentially enhance learning, human instructor-generated text, human instructor-generated, led to multiple
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 13 pages, 9 figures

点击查看摘要

Abstract:The widespread use of generative AI has led to multiple applications of AI-generated text and media to potentially enhance learning outcomes. However, there are a limited number of well-designed experimental studies investigating the impact of learning gains and affective feedback from AI-generated media compared to traditional media (e.g., text from documents and human recordings of video). The current study recruited 500 participants to investigate adult learners recall and recognition performances as well as their affective feedback on the AI-generated synthetic video, using a mixed-methods approach with a pre-and post-test design. Specifically, four learning conditions, AI-generated framing of human instructor-generated text, AI-generated synthetic videos with human instructor-generated text, human instructor-generated videos, and human instructor-generated text frame (baseline), were considered. The results indicated no statistically significant difference amongst conditions on recall and recognition performance. In addition, the participants affective feedback was not statistically significantly different between the two video conditions. However, adult learners preferred to learn from the video formats rather than text materials.

[AI-112] Supervised Learning-enhanced Multi-Group Actor Critic for Live-stream Recommendation

链接: https://arxiv.org/abs/2412.10381
作者: Jingxin Liu,Xiang Gao,Yisha Li,Xin Li,Haiyang Lu,Ben Wang
关键词: improving dwelling time, enhancing user retention, capture users’ long-term, Reinforcement Learning, users’ long-term engagement
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has been widely applied in recommendation systems to capture users’ long-term engagement, thereby improving dwelling time and enhancing user retention. In the context of a short video live-stream mixed recommendation scenario, the live-stream recommendation system (RS) decides whether to inject at most one live-stream into the video feed for each user request. To maximize long-term user engagement, it is crucial to determine an optimal live-stream injection policy for accurate live-stream allocation. However, traditional RL algorithms often face divergence and instability problems, and these issues are even more pronounced in our scenario. To address these challenges, we propose a novel Supervised Learning-enhanced Multi-Group Actor Critic algorithm (SL-MGAC). Specifically, we introduce a supervised learning-enhanced actor-critic framework that incorporates variance reduction techniques, where multi-task reward learning helps restrict bootstrapping error accumulation during critic learning. Additionally, we design a multi-group state decomposition module for both actor and critic networks to reduce prediction variance and improve model stability. Empirically, we evaluate the SL-MGAC algorithm using offline policy evaluation (OPE) and online A/B testing. Experimental results demonstrate that the proposed method not only outperforms baseline methods but also exhibits enhanced stability in online recommendation scenarios.

[AI-113] Challenges in Human-Agent Communication

链接: https://arxiv.org/abs/2412.10380
作者: Gagan Bansal,Jennifer Wortman Vaughan,Saleema Amershi,Eric Horvitz,Adam Fourney,Hussein Mozannar,Victor Dibia,Daniel S. Weld
关键词: modern generative foundation, generative foundation models, highly capable autonomous, capable autonomous agents, Remarkable advancements
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Remarkable advancements in modern generative foundation models have enabled the development of sophisticated and highly capable autonomous agents that can observe their environment, invoke tools, and communicate with other agents to solve problems. Although such agents can communicate with users through natural language, their complexity and wide-ranging failure modes present novel challenges for human-AI interaction. Building on prior research and informed by a communication grounding perspective, we contribute to the study of \emphhuman-agent communication by identifying and analyzing twelve key communication challenges that these systems pose. These include challenges in conveying information from the agent to the user, challenges in enabling the user to convey information to the agent, and overarching challenges that need to be considered across all human-agent communication. We illustrate each challenge through concrete examples and identify open directions of research. Our findings provide insights into critical gaps in human-agent communication research and serve as an urgent call for new design patterns, principles, and guidelines to support transparency and control in these systems.

[AI-114] he Stabilizer Bootstrap of Quantum Machine Learning with up to 10000 qubits

链接: https://arxiv.org/abs/2412.11356
作者: Yuqing Li,Jinglei Cheng,Xulong Tang,Youtao Zhang,Frederic T. Chong,Junyu Liu
关键词: near-term quantum devices, quantum computers, fault-tolerant quantum computers, Quantum, early fault-tolerant quantum
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 15 pages, 14 figures

点击查看摘要

Abstract:Quantum machine learning is considered one of the flagship applications of quantum computers, where variational quantum circuits could be the leading paradigm both in the near-term quantum devices and the early fault-tolerant quantum computers. However, it is not clear how to identify the regime of quantum advantages from these circuits, and there is no explicit theory to guide the practical design of variational ansatze to achieve better performance. We address these challenges with the stabilizer bootstrap, a method that uses stabilizer-based techniques to optimize quantum neural networks before their quantum execution, together with theoretical proofs and high-performance computing with 10000 qubits or random datasets up to 1000 data. We find that, in a general setup of variational ansatze, the possibility of improvements from the stabilizer bootstrap depends on the structure of the observables and the size of the datasets. The results reveal that configurations exhibit two distinct behaviors: some maintain a constant probability of circuit improvement, while others show an exponential decay in improvement probability as qubit numbers increase. These patterns are termed strong stabilizer enhancement and weak stabilizer enhancement, respectively, with most situations falling in between. Our work seamlessly bridges techniques from fault-tolerant quantum computing with applications of variational quantum algorithms. Not only does it offer practical insights for designing variational circuits tailored to large-scale machine learning challenges, but it also maps out a clear trajectory for defining the boundaries of feasible and practical quantum advantages.

[AI-115] From Votes to Volatility Predicting the Stock Market on Election Day

链接: https://arxiv.org/abs/2412.11192
作者: Igor L.R. Azevedo,Toyotaro Suzumura
关键词: optimal stock recommendations, Stock market forecasting, extensive research, aiming to provide, higher returns
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Stock market forecasting has been a topic of extensive research, aiming to provide investors with optimal stock recommendations for higher returns. In recent years, this field has gained even more attention due to the widespread adoption of deep learning models. While these models have achieved impressive accuracy in predicting stock behavior, tailoring them to specific scenarios has become increasingly important. Election Day represents one such critical scenario, characterized by intensified market volatility, as the winning candidate’s policies significantly impact various economic sectors and companies. To address this challenge, we propose the Election Day Stock Market Forecasting (EDSMF) Model. Our approach leverages the contextual capabilities of large language models alongside specialized agents designed to analyze the political and economic consequences of elections. By building on a state-of-the-art architecture, we demonstrate that EDSMF improves the predictive performance of the SP 500 during this uniquely volatile day.

[AI-116] Decoding Drug Discovery: Exploring A-to-Z In silico Methods for Beginners

链接: https://arxiv.org/abs/2412.11137
作者: Hezha O. Rasul,Dlzar D. Ghafour,Bakhtyar K. Aziz,Bryar A. Hassan,Tarik A. Rashid,Arif Kivrak
关键词: pharmaceutical industry due, drug development, address various ailments, drug, drug development process
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: this https URL

点击查看摘要

Abstract:The drug development process is a critical challenge in the pharmaceutical industry due to its time-consuming nature and the need to discover new drug potentials to address various ailments. The initial step in drug development, drug target identification, often consumes considerable time. While valid, traditional methods such as in vivo and in vitro approaches are limited in their ability to analyze vast amounts of data efficiently, leading to wasteful outcomes. To expedite and streamline drug development, an increasing reliance on computer-aided drug design (CADD) approaches has merged. These sophisticated in silico methods offer a promising avenue for efficiently identifying viable drug candidates, thus providing pharmaceutical firms with significant opportunities to uncover new prospective drug targets. The main goal of this work is to review in silico methods used in the drug development process with a focus on identifying therapeutic targets linked to specific diseases at the genetic or protein level. This article thoroughly discusses A-to-Z in silico techniques, which are essential for identifying the targets of bioactive compounds and their potential therapeutic effects. This review intends to improve drug discovery processes by illuminating the state of these cutting-edge approaches, thereby maximizing the effectiveness and duration of clinical trials for novel drug target investigation.

[AI-117] Deep Learning Models for Colloidal Nanocrystal Synthesis

链接: https://arxiv.org/abs/2412.10838
作者: Kai Gu,Yingping Liang,Jiaming Su,Peihan Sun,Jia Peng,Naihua Miao,Zhimei Sun,Ying Fu,Haizheng Zhong,Jun Zhang
关键词: multi-step crystallization processes, includes complex chemical, complex chemical reactions, Colloidal synthesis, nanocrystal synthesis model
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:Colloidal synthesis of nanocrystals usually includes complex chemical reactions and multi-step crystallization processes. Despite the great success in the past 30 years, it remains challenging to clarify the correlations between synthetic parameters of chemical reaction and physical properties of nanocrystals. Here, we developed a deep learning-based nanocrystal synthesis model that correlates synthetic parameters with the final size and shape of target nanocrystals, using a dataset of 3500 recipes covering 348 distinct nanocrystal compositions. The size and shape labels were obtained from transmission electron microscope images using a segmentation model trained with a semi-supervised algorithm on a dataset comprising 1.2 million nanocrystals. By applying the reaction intermediate-based data augmentation method and elaborated descriptors, the synthesis model was able to predict nanocrystal’s size with a mean absolute error of 1.39 nm, while reaching an 89% average accuracy for shape classification. The synthesis model shows knowledge transfer capabilities across different nanocrystals with inputs of new recipes. With that, the influence of chemicals on the final size of nanocrystals was further evaluated, revealing the importance order of nanocrystal composition, precursor or ligand, and solvent. Overall, the deep learning-based nanocrystal synthesis model offers a powerful tool to expedite the development of high-quality nanocrystals.

[AI-118] Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling AAAI-25

链接: https://arxiv.org/abs/2412.10658
作者: Jinzong Dong,Zhaohui Jiang,Dong Pan,Haoyang Yu
关键词: ensuring reliable decision-making, calibration, calibration curve, true posterior probability, Confidence calibration
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted by AAAI-25

点击查看摘要

Abstract:Confidence calibration of classification models is a technique to estimate the true posterior probability of the predicted class, which is critical for ensuring reliable decision-making in practical applications. Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data or fit a user-defined calibration function, but often overlook fully mining and utilizing the prior distribution behind the calibration curve. However, a well-informed prior distribution can provide valuable insights beyond the empirical data under the limited data or low-density regions of confidence scores. To fill this gap, this paper proposes a new method that integrates the prior distribution behind the calibration curve with empirical data to estimate a continuous calibration curve, which is realized by modeling the sampling process of calibration data as a binomial process and maximizing the likelihood function of the binomial process. We prove that the calibration curve estimating method is Lipschitz continuous with respect to data distribution and requires a sample size of 3/B of that required for histogram binning, where B represents the number of bins. Also, a new calibration metric ( TCE_bpm ), which leverages the estimated calibration curve to estimate the true calibration error (TCE), is designed. TCE_bpm is proven to be a consistent calibration measure. Furthermore, realistic calibration datasets can be generated by the binomial process modeling from a preset true calibration curve and confidence score distribution, which can serve as a benchmark to measure and compare the discrepancy between existing calibration metrics and the true calibration error. The effectiveness of our calibration method and metric are verified in real-world and simulated data.

[AI-119] Model-driven deep neural network for enhanced direction finding with commodity 5G gNodeB AAAI’2024

链接: https://arxiv.org/abs/2412.10644
作者: Shengheng Liu,Zihuan Mao,Xingkang Li,Mengguan Pan,Peng Liu,Yongming Huang,Xiaohu You
关键词: intelligent connected devices, Pervasive and high-accuracy, increasingly important, fundamental enabler, enabler for intelligent
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: To appear in ACM TOSN. A preliminary version of this article was presented at the AAAI’2024 Main Technical Track

点击查看摘要

Abstract:Pervasive and high-accuracy positioning has become increasingly important as a fundamental enabler for intelligent connected devices in mobile networks. Nevertheless, current wireless networks heavily rely on pure model-driven techniques to achieve positioning functionality, often succumbing to performance deterioration due to hardware impairments in practical scenarios. Here we reformulate the direction finding or angle-of-arrival (AoA) estimation problem as an image recovery task of the spatial spectrum and propose a new model-driven deep neural network (MoD-DNN) framework. The proposed MoD-DNN scheme comprises three modules: a multi-task autoencoder-based beamformer, a coarray spectrum generation module, and a model-driven deep learning-based spatial spectrum reconstruction module. Our technique enables automatic calibration of angular-dependent phase error thereby enhancing the resilience of direction-finding precision against realistic system non-idealities. We validate the proposed scheme both using numerical simulations and field tests. The results show that the proposed MoD-DNN framework enables effective spectrum calibration and accurate AoA estimation. To the best of our knowledge, this study marks the first successful demonstration of hybrid data-and-model-driven direction finding utilizing readily available commodity 5G gNodeB.

[AI-120] A recent evaluation on the performance of LLM s on radiation oncology physics using questions of randomly shuffled options

链接: https://arxiv.org/abs/2412.10622
作者: Peilong Wang,Jason Holmes,Zhengliang Liu,Dequan Chen,Tianming Liu,Jiajian Shen,Wei Liu
关键词: radiation oncology physics, oncology physics questions, radiation oncology, updated study evaluating, answering radiation oncology
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Purpose: We present an updated study evaluating the performance of large language models (LLMs) in answering radiation oncology physics questions, focusing on the latest released models. Methods: A set of 100 multiple-choice radiation oncology physics questions, previously created by us, was used for this study. The answer options of the questions were randomly shuffled to create “new” exam sets. Five LLMs – OpenAI o1-preview, GPT-4o, LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet – with the versions released before September 30, 2024, were queried using these new exams. To evaluate their deductive reasoning abilities, the correct answer options in the questions were replaced with “None of the above.” Then, the explain-first and step-by-step instruction prompt was used to test if it improved their reasoning abilities. The performance of the LLMs was compared to medical physicists in majority-vote scenarios. Results: All models demonstrated expert-level performance on these questions, with o1-preview even surpassing medical physicists in majority-vote scenarios. When substituting the correct answer options with “None of the above,” all models exhibited a considerable decline in performance, suggesting room for improvement. The explain-first and step-by-step instruction prompt helped enhance the reasoning abilities of LLaMA 3.1 (405B), Gemini 1.5 Pro, and Claude 3.5 Sonnet models. Conclusion: These latest LLMs demonstrated expert-level performance in answering radiation oncology physics questions, exhibiting great potential for assisting in radiation oncology physics education. Subjects: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.10622 [physics.med-ph] (or arXiv:2412.10622v1 [physics.med-ph] for this version) https://doi.org/10.48550/arXiv.2412.10622 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Peilong Wang [view email] [v1] Sat, 14 Dec 2024 00:05:42 UTC (701 KB)

[AI-121] Regional Weather Variable Predictions by Machine Learning with Near-Surface Observational and Atmospheric Numerical Data

链接: https://arxiv.org/abs/2412.10450
作者: Yihe Zhang,Bryce Turney,Purushottam Sigdel,Xu Yuan,Eric Rappin,Adrian Lago,Sytske Kimball,Li Chen,Paul Darby,Lu Peng,Sercan Aygun,Yazhou Tu,M. Hassan Najafi,Nian-Feng Tzeng
关键词: timely regional weather, weather-related decisions, timely regional, vital for sectors, sectors dependent
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and timely regional weather prediction is vital for sectors dependent on weather-related decisions. Traditional prediction methods, based on atmospheric equations, often struggle with coarse temporal resolutions and inaccuracies. This paper presents a novel machine learning (ML) model, called MiMa (short for Micro-Macro), that integrates both near-surface observational data from Kentucky Mesonet stations (collected every five minutes, known as Micro data) and hourly atmospheric numerical outputs (termed as Macro data) for fine-resolution weather forecasting. The MiMa model employs an encoder-decoder transformer structure, with two encoders for processing multivariate data from both datasets and a decoder for forecasting weather variables over short time horizons. Each instance of the MiMa model, called a modelet, predicts the values of a specific weather parameter at an individual Mesonet station. The approach is extended with Re-MiMa modelets, which are designed to predict weather variables at ungauged locations by training on multivariate data from a few representative stations in a region, tagged with their elevations. Re-MiMa (short for Regional-MiMa) can provide highly accurate predictions across an entire region, even in areas without observational stations. Experimental results show that MiMa significantly outperforms current models, with Re-MiMa offering precise short-term forecasts for ungauged locations, marking a significant advancement in weather forecasting accuracy and applicability.

[AI-122] Pre-trained protein language model for codon optimization

链接: https://arxiv.org/abs/2412.10411
作者: Shashank Pathak,Guohui Lin
关键词: Open Reading Frame, Reading Frame, Open Reading, impacts immune strength, directly impacts immune
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Motivation: Codon optimization of Open Reading Frame (ORF) sequences is essential for enhancing mRNA stability and expression in applications like mRNA vaccines, where codon choice can significantly impact protein yield which directly impacts immune strength. In this work, we investigate the use of a pre-trained protein language model (PPLM) for getting a rich representation of amino acids which could be utilized for codon optimization. This leaves us with a simpler fine-tuning task over PPLM in optimizing ORF sequences. Results: The ORFs generated by our proposed models outperformed their natural counterparts encoding the same proteins on computational metrics for stability and expression. They also demonstrated enhanced performance against the benchmark ORFs used in mRNA vaccines for the SARS-CoV-2 viral spike protein and the varicella-zoster virus (VZV). These results highlight the potential of adapting PPLM for designing ORFs tailored to encode target antigens in mRNA vaccines. Subjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.10411 [q-bio.QM] (or arXiv:2412.10411v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2412.10411 Focus to learn more arXiv-issued DOI via DataCite

机器学习

[LG-0] No More Tuning: Prioritized Multi-Task Learning with Lagrangian Differential Multiplier Methods AAAI2025

链接: https://arxiv.org/abs/2412.12092
作者: Zhengxing Cheng,Yuheng Huang,Zhixuan Zhang,Dan Ou,Qingwen Liu
关键词: found widespread application, diverse domains, found widespread, MTL, tasks
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Given the ubiquity of multi-task in practical systems, Multi-Task Learning (MTL) has found widespread application across diverse domains. In real-world scenarios, these tasks often have different priorities. For instance, In web search, relevance is often prioritized over other metrics, such as click-through rates or user engagement. Existing frameworks pay insufficient attention to the prioritization among different tasks, which typically adjust task-specific loss function weights to differentiate task priorities. However, this approach encounters challenges as the number of tasks grows, leading to exponential increases in hyper-parameter tuning complexity. Furthermore, the simultaneous optimization of multiple objectives can negatively impact the performance of high-priority tasks due to interference from lower-priority tasks. In this paper, we introduce a novel multi-task learning framework employing Lagrangian Differential Multiplier Methods for step-wise multi-task optimization. It is designed to boost the performance of high-priority tasks without interference from other tasks. Its primary advantage lies in its ability to automatically optimize multiple objectives without requiring balancing hyper-parameters for different tasks, thereby eliminating the need for manual tuning. Additionally, we provide theoretical analysis demonstrating that our method ensures optimization guarantees, enhancing the reliability of the process. We demonstrate its effectiveness through experiments on multiple public datasets and its application in Taobao search, a large-scale industrial search ranking system, resulting in significant improvements across various business metrics. Comments: Accepted by AAAI 2025 Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) ACMclasses: I.2.6; H.3.3 Cite as: arXiv:2412.12092 [cs.LG] (or arXiv:2412.12092v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.12092 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] LLM s for Cold-Start Cutting Plane Separator Configuration

链接: https://arxiv.org/abs/2412.12038
作者: Connor Lawless,Yingxi Li,Anders Wikum,Madeleine Udell,Ellen Vitercik
关键词: Mixed integer linear, integer linear programming, expert optimization users, Mixed integer, linear programming
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixed integer linear programming (MILP) solvers ship with a staggering number of parameters that are challenging to select a priori for all but expert optimization users, but can have an outsized impact on the performance of the MILP solver. Existing machine learning (ML) approaches to configure solvers require training ML models by solving thousands of related MILP instances, generalize poorly to new problem sizes, and often require implementing complex ML pipelines and custom solver interfaces that can be difficult to integrate into existing optimization workflows. In this paper, we introduce a new LLM-based framework to configure which cutting plane separators to use for a given MILP problem with little to no training data based on characteristics of the instance, such as a natural language description of the problem and the associated LaTeX formulation. We augment these LLMs with descriptions of cutting plane separators available in a given solver, grounded by summarizing the existing research literature on separators. While individual solver configurations have a large variance in performance, we present a novel ensembling strategy that clusters and aggregates configurations to create a small portfolio of high-performing configurations. Our LLM-based methodology requires no custom solver interface, can find a high-performing configuration by solving only a small number of MILPs, and can generate the configuration with simple API calls that run in under a second. Numerical results show our approach is competitive with existing configuration approaches on a suite of classic combinatorial optimization problems and real-world datasets with only a fraction of the training data and computation time.

[LG-2] LeARN: Learnable and Adaptive Representations for Nonlinear Dynamics in System Identification

链接: https://arxiv.org/abs/2412.12036
作者: Arunabh Singh,Joyjit Mukherjee
关键词: deriving mathematical models, observed input-output data, basis functions, process of deriving, deriving mathematical
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: This work has been submitted to the 7th Annual Learning for Dynamics Control Conference for review

点击查看摘要

Abstract:System identification, the process of deriving mathematical models of dynamical systems from observed input-output data, has undergone a paradigm shift with the advent of learning-based methods. Addressing the intricate challenges of data-driven discovery in nonlinear dynamical systems, these methods have garnered significant attention. Among them, Sparse Identification of Nonlinear Dynamics (SINDy) has emerged as a transformative approach, distilling complex dynamical behaviors into interpretable linear combinations of basis functions. However, SINDy relies on domain-specific expertise to construct its foundational “library” of basis functions, which limits its adaptability and universality. In this work, we introduce a nonlinear system identification framework called LeARN that transcends the need for prior domain knowledge by learning the library of basis functions directly from data. To enhance adaptability to evolving system dynamics under varying noise conditions, we employ a novel meta-learning-based system identification approach that uses a lightweight deep neural network (DNN) to dynamically refine these basis functions. This not only captures intricate system behaviors but also adapts seamlessly to new dynamical regimes. We validate our framework on the Neural Fly dataset, showcasing its robust adaptation and generalization capabilities. Despite its simplicity, our LeARN achieves competitive dynamical error performance compared to SINDy. This work presents a step toward the autonomous discovery of dynamical systems, paving the way for a future where machine learning uncovers the governing principles of complex systems without requiring extensive domain-specific interventions.

[LG-3] hermodynamics-informed graph neural networks for real-time simulation of digital human twins

链接: https://arxiv.org/abs/2412.12034
作者: Lucas Tesán,David González,Pedro Martins,Elías Cueto
关键词: complex biological systems, biological systems, growing importance, field has exposed, exposed the limitations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing importance of real-time simulation in the medical field has exposed the limitations and bottlenecks inherent in the digital representation of complex biological systems. This paper presents a novel methodology aimed at advancing current lines of research in soft tissue simulation. The proposed approach introduces a hybrid model that integrates the geometric bias of graph neural networks with the physical bias derived from the imposition of a metriplectic structure as soft and hard constrains in the architecture, being able to simulate hepatic tissue with dissipative properties. This approach provides an efficient solution capable of generating predictions at high feedback rate while maintaining a remarkable generalization ability for previously unseen anatomies. This makes these features particularly relevant in the context of precision medicine and haptic rendering. Based on the adopted methodologies, we propose a model that predicts human liver responses to traction and compression loads in as little as 7.3 milliseconds for optimized configurations and as fast as 1.65 milliseconds in the most efficient cases, all in the forward pass. The model achieves relative position errors below 0.15%, with stress tensor and velocity estimations maintaining relative errors under 7%. This demonstrates the robustness of the approach developed, which is capable of handling diverse load states and anatomies effectively. This work highlights the feasibility of integrating real-time simulation with patient-specific geometries through deep learning, paving the way for more robust digital human twins in medical applications. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.12034 [cs.LG] (or arXiv:2412.12034v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.12034 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] Memory-Reduced Meta-Learning with Guaranteed Convergence AAAI

链接: https://arxiv.org/abs/2412.12030
作者: Honglin Yang,Ji Ma,Xiao Yu
关键词: gaining increased traction, optimization-based meta-learning approaches, existing optimization-based meta-learning, amounts of data, optimization-based meta-learning
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 18 pages, 2 figures; Accepted by the 39th Annual AAAI Conference on Artificial Intelligence (AAAI)

点击查看摘要

Abstract:The optimization-based meta-learning approach is gaining increased traction because of its unique ability to quickly adapt to a new task using only small amounts of data. However, existing optimization-based meta-learning approaches, such as MAML, ANIL and their variants, generally employ backpropagation for upper-level gradient estimation, which requires using historical lower-level parameters/gradients and thus increases computational and memory overhead in each iteration. In this paper, we propose a meta-learning algorithm that can avoid using historical parameters/gradients and significantly reduce memory costs in each iteration compared to existing optimization-based meta-learning approaches. In addition to memory reduction, we prove that our proposed algorithm converges sublinearly with the iteration number of upper-level optimization, and the convergence error decays sublinearly with the batch size of sampled tasks. In the specific case in terms of deterministic meta-learning, we also prove that our proposed algorithm converges to an exact solution. Moreover, we quantify that the computational complexity of the algorithm is on the order of \mathcalO(\epsilon^-1) , which matches existing convergence results on meta-learning even without using any historical parameters/gradients. Experimental results on meta-learning benchmarks confirm the efficacy of our proposed algorithm.

[LG-5] Deep-learning-based identification of individual motion characteristics from upper-limb trajectories towards disorder stage evaluation

链接: https://arxiv.org/abs/2412.12016
作者: Tim Sziburis,Susanne Blex,Tobias Glasmachers,Ioannis Iossifidis
关键词: personal rehabilitation progress, individual movement characteristics, movement characteristics sets, provide diagnostic information, movement disorders
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The identification of individual movement characteristics sets the foundation for the assessment of personal rehabilitation progress and can provide diagnostic information on levels and stages of movement disorders. This work presents a preliminary study for differentiating individual motion patterns using a dataset of 3D upper-limb transport trajectories measured in task-space. Identifying individuals by deep time series learning can be a key step to abstracting individual motion properties. In this study, a classification accuracy of about 95% is reached for a subset of nine, and about 78% for the full set of 31 individuals. This provides insights into the separability of patient attributes by exerting a simple standardized task to be transferred to portable systems.

[LG-6] Industrial-scale Prediction of Cement Clinker Phases using Machine Learning

链接: https://arxiv.org/abs/2412.11981
作者: Sheikh Junaid Fayaz,Nestor Montiel-Bohorquez,Shashank Bishnoi,Matteo Romano,Manuele Gatti,N. M. Anoop Krishnan
关键词: faces critical challenges, billion tonnes, faces critical, critical challenges, Cement
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Cement production, exceeding 4.1 billion tonnes and contributing 2.4 tonnes of CO2 annually, faces critical challenges in quality control and process optimization. While traditional process models for cement manufacturing are confined to steady-state conditions with limited predictive capability for mineralogical phases, modern plants operate under dynamic conditions that demand real-time quality assessment. Here, exploiting a comprehensive two-year operational dataset from an industrial cement plant, we present a machine learning framework that accurately predicts clinker mineralogy from process data. Our model achieves unprecedented prediction accuracy for major clinker phases while requiring minimal input parameters, demonstrating robust performance under varying operating conditions. Through post-hoc explainable algorithms, we interpret the hierarchical relationships between clinker oxides and phase formation, providing insights into the functioning of an otherwise black-box model. This digital twin framework can potentially enable real-time optimization of cement production, thereby providing a route toward reducing material waste and ensuring quality while reducing the associated emissions under real plant conditions. Our approach represents a significant advancement in industrial process control, offering a scalable solution for sustainable cement manufacturing.

[LG-7] AlphaZero Neural Scaling and Zipfs Law: a Tale of Board Games and Power Laws

链接: https://arxiv.org/abs/2412.11979
作者: Oren Neumann,Claudius Gros
关键词: Neural scaling laws, Neural scaling, power law observed, Neural, Zipf law
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural scaling laws are observed in a range of domains, to date with no clear understanding of why they occur. Recent theories suggest that loss power laws arise from Zipf’s law, a power law observed in domains like natural language. One theory suggests that language scaling laws emerge when Zipf-distributed task quanta are learned in descending order of frequency. In this paper we examine power-law scaling in AlphaZero, a reinforcement learning algorithm, using a theory of language-model scaling. We find that game states in training and inference data scale with Zipf’s law, which is known to arise from the tree structure of the environment, and examine the correlation between scaling-law and Zipf’s-law exponents. In agreement with quanta scaling theory, we find that agents optimize state loss in descending order of frequency, even though this order scales inversely with modelling complexity. We also find that inverse scaling, the failure of models to improve with size, is correlated with unusual Zipf curves where end-game states are among the most frequent states. We show evidence that larger models shift their focus to these less-important states, sacrificing their understanding of important early-game states.

[LG-8] A Digital twin for Diesel Engines: Operator-infused PINNs with Transfer Learning for Engine Health Monitoring

链接: https://arxiv.org/abs/2412.11967
作者: Kamaljyoti Nath,Varun Kumar,Daniel J. Smith,George Em Karniadakis
关键词: Improving diesel engine, critical research topics, Improving diesel, efficiency and emission, emission reduction
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Improving diesel engine efficiency and emission reduction have been critical research topics. Recent government regulations have shifted this focus to another important area related to engine health and performance monitoring. Although the advancements in the use of deep learning methods for system monitoring have shown promising results in this direction, designing efficient methods suitable for field systems remains an open research challenge. The objective of this study is to develop a computationally efficient neural network-based approach for identifying unknown parameters of a mean value diesel engine model to facilitate physics-based health monitoring and maintenance forecasting. We propose a hybrid method combining physics informed neural networks, PINNs, and a deep neural operator, DeepONet to predict unknown parameters and gas flow dynamics in a diesel engine. The operator network predicts independent actuator dynamics learnt through offline training, thereby reducing the PINNs online computational cost. To address PINNs need for retraining with changing input scenarios, we propose two transfer learning (TL) strategies. The first strategy involves multi-stage transfer learning for parameter identification. While this method is computationally efficient as compared to online PINN training, improvements are required to meet field requirements. The second TL strategy focuses solely on training the output weights and biases of a subset of multi-head networks pretrained on a larger dataset, substantially reducing computation time during online prediction. We also evaluate our model for epistemic and aleatoric uncertainty by incorporating dropout in pretrained networks and Gaussian noise in the training dataset. This strategy offers a tailored, computationally inexpensive, and physics-based approach for parameter identification in diesel engine sub systems.

[LG-9] Asynchronous Distributed Gaussian Process Regression for Online Learning and Dynamical Systems: Complementary Document

链接: https://arxiv.org/abs/2412.11950
作者: Zewen Yang,Xiaobing Dai,Sandra Hirche
关键词: Asynchronous Distributed Gaussian, Distributed Gaussian Process, Gaussian Process Regression, Asynchronous Distributed, Dynamical Systems
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:This is a complementary document for the paper titled “Asynchronous Distributed Gaussian Process Regression for Online Learning and Dynamical Systems”.

[LG-10] SPGL: Enhancing Session-based Recommendation with Single Positive Graph Learning ICONIP2024

链接: https://arxiv.org/abs/2412.11846
作者: Tiantian Liang,Zhe Yang
关键词: Session-based recommendation seeks, Session-based recommendation, seeks to forecast, Single Positive optimization, session-based recommendation model
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: ICONIP 2024

点击查看摘要

Abstract:Session-based recommendation seeks to forecast the next item a user will be interested in, based on their interaction sequences. Due to limited interaction data, session-based recommendation faces the challenge of limited data availability. Traditional methods enhance feature learning by constructing complex models to generate positive and negative samples. This paper proposes a session-based recommendation model using Single Positive optimization loss and Graph Learning (SPGL) to deal with the problem of data sparsity, high model complexity and weak transferability. SPGL utilizes graph convolutional networks to generate global item representations and batch session representations, effectively capturing intrinsic relationships between items. The use of single positive optimization loss improves uniformity of item representations, thereby enhancing recommendation accuracy. In the intent extractor, SPGL considers the hop count of the adjacency matrix when constructing the directed global graph to fully integrate spatial information. It also takes into account the reverse positional information of items when constructing session representations to incorporate temporal information. Comparative experiments across three benchmark datasets, Tmall, RetailRocket and Diginetica, demonstrate the model’s effectiveness. The source code can be accessed on this https URL .

[LG-11] Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory

链接: https://arxiv.org/abs/2412.11810
作者: Wadjih Bencheikh,Jan Finkbeiner,Emre Neftci
关键词: require high memory-processor, high memory-processor bandwidth, tasks involving long, reduced memory requirements, bandwidth to train
类目: Neural and Evolutionary Computing (cs.NE); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) are valued for their computational efficiency and reduced memory requirements on tasks involving long sequence lengths but require high memory-processor bandwidth to train. Checkpointing techniques can reduce the memory requirements by only storing a subset of intermediate states, the checkpoints, but are still rarely used due to the computational overhead of the additional recomputation phase. This work addresses these challenges by introducing memory-efficient gradient checkpointing strategies tailored for the general class of sparse RNNs and Spiking Neural Networks (SNNs). SNNs are energy efficient alternatives to RNNs thanks to their local, event-driven operation and potential neuromorphic implementation. We use the Intelligence Processing Unit (IPU) as an exemplary platform for architectures with distributed local memory. We exploit its suitability for sparse and irregular workloads to scale SNN training on long sequence lengths. We find that Double Checkpointing emerges as the most effective method, optimizing the use of local memory resources while minimizing recomputation overhead. This approach reduces dependency on slower large-scale memory access, enabling training on sequences over 10 times longer or 4 times larger networks than previously feasible, with only marginal time overhead. The presented techniques demonstrate significant potential to enhance scalability and efficiency in training sparse and recurrent networks across diverse hardware platforms, and highlights the benefits of sparse activations for scalable recurrent neural network training.

[LG-12] Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data

链接: https://arxiv.org/abs/2412.11800
作者: Mulugeta Weldezgina Asres,Christian Walter Omlin, TheCMS-HCAL Collaboration
关键词: Extracting anomaly causality, detect system faults, causality facilitates diagnostics, Extracting anomaly, systems detect system
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages, 17 figures, 9 tables

点击查看摘要

Abstract:Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a more extensive set of monitoring variables across multiple subsystems. However, learning causal graphs comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data – the meaning of state transition and data sparsity – challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (AnomalyCD), addressing the accuracy and computational challenges of generating causal graphs from binary flag data sets. The AnomalyCD framework presents several strategies, such as anomaly flag characteristics incorporating causality testing, sparse data and link compression, and edge pruning adjustment approaches. We validate the performance of this framework on two datasets: monitoring sensor data of the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public data set for information technology monitoring. The results demonstrate the considerable reduction of the computation overhead and moderate enhancement of the accuracy of temporal causal discovery on binary anomaly data sets.

[LG-13] Fast and Slow Gradient Approximation for Binary Neural Network Optimization AAAI2025

链接: https://arxiv.org/abs/2412.11777
作者: Xinquan Chen,Junqi Gao,Biqing Qi,Dong Li,Yiang Luo,Fangyuan Li,Pengfei Li
关键词: Binary Neural Networks, garnered significant attention, Binary Neural, significant attention due, Neural Networks
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Binary Neural Networks (BNNs) have garnered significant attention due to their immense potential for deployment on edge devices. However, the non-differentiability of the quantization function poses a challenge for the optimization of BNNs, as its derivative cannot be backpropagated. To address this issue, hypernetwork based methods, which utilize neural networks to learn the gradients of non-differentiable quantization functions, have emerged as a promising approach due to their adaptive learning capabilities to reduce estimation errors. However, existing hypernetwork based methods typically rely solely on current gradient information, neglecting the influence of historical gradients. This oversight can lead to accumulated gradient errors when calculating gradient momentum during optimization. To incorporate historical gradient information, we design a Historical Gradient Storage (HGS) module, which models the historical gradient sequence to generate the first-order momentum required for optimization. To further enhance gradient generation in hypernetworks, we propose a Fast and Slow Gradient Generation (FSG) method. Additionally, to produce more precise gradients, we introduce Layer Recognition Embeddings (LRE) into the hypernetwork, facilitating the generation of layer-specific fine gradients. Extensive comparative experiments on the CIFAR-10 and CIFAR-100 datasets demonstrate that our method achieves faster convergence and lower loss values, outperforming existing this http URL is available at this http URL .

[LG-14] What Matters in Learning A Zero-Shot Sim-to-Real RL Policy for Quadrotor Control? A Comprehensive Study

链接: https://arxiv.org/abs/2412.11764
作者: Jiayu Chen,Chao Yu,Yuqing Xie,Feng Gao,Yinuo Chen,Shu’ang Yu,Wenhao Tang,Shilong Ji,Mo Mu,Yi Wu,Huazhong Yang,Yu Wang
关键词: agile flight maneuvers, Executing precise, precise and agile, agile flight, flight maneuvers
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: The first two authors contribute equally

点击查看摘要

Abstract:Executing precise and agile flight maneuvers is critical for quadrotors in various applications. Traditional quadrotor control approaches are limited by their reliance on flat trajectories or time-consuming optimization, which restricts their flexibility. Recently, RL-based policy has emerged as a promising alternative due to its ability to directly map observations to actions, reducing the need for detailed system knowledge and actuation constraints. However, a significant challenge remains in bridging the sim-to-real gap, where RL-based policies often experience instability when deployed in real world. In this paper, we investigate key factors for learning robust RL-based control policies that are capable of zero-shot deployment in real-world quadrotors. We identify five critical factors and we develop a PPO-based training framework named SimpleFlight, which integrates these five techniques. We validate the efficacy of SimpleFlight on Crazyflie quadrotor, demonstrating that it achieves more than a 50% reduction in trajectory tracking error compared to state-of-the-art RL baselines, and achieves 70% improvement over the traditional MPC. The policy derived by SimpleFlight consistently excels across both smooth polynominal trajectories and challenging infeasible zigzag trajectories on small thrust-to-weight quadrotors. In contrast, baseline methods struggle with high-speed or infeasible trajectories. To support further research and reproducibility, we integrate SimpleFlight into a GPU-based simulator Omnidrones and provide open-source access to the code and model checkpoints. We hope SimpleFlight will offer valuable insights for advancing RL-based quadrotor control. For more details, visit our project website at this https URL.

[LG-15] Asymmetric Learning for Spectral Graph Neural Networks

链接: https://arxiv.org/abs/2412.11739
作者: Fangbing Liu,Qing Wang
关键词: Optimizing spectral graph, graph neural networks, Optimizing spectral, neural networks, remains a critical
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimizing spectral graph neural networks (GNNs) remains a critical challenge in the field, yet the underlying processes are not well understood. In this paper, we investigate the inherent differences between graph convolution parameters and feature transformation parameters in spectral GNNs and their impact on the optimization landscape. Our analysis reveals that these differences contribute to a poorly conditioned problem, resulting in suboptimal performance. To address this issue, we introduce the concept of the block condition number of the Hessian matrix, which characterizes the difficulty of poorly conditioned problems in spectral GNN optimization. We then propose an asymmetric learning approach, dynamically preconditioning gradients during training to alleviate poorly conditioned problems. Theoretically, we demonstrate that asymmetric learning can reduce block condition numbers, facilitating easier optimization. Extensive experiments on eighteen benchmark datasets show that asymmetric learning consistently improves the performance of spectral GNNs for both heterophilic and homophilic graphs. This improvement is especially notable for heterophilic graphs, where the optimization process is generally more complex than for homophilic graphs. Code is available at this https URL.

[LG-16] Efficiently Achieving Secure Model Training and Secure Aggregation to Ensure Bidirectional Privacy-Preservation in Federated Learning

链接: https://arxiv.org/abs/2412.11737
作者: Xue Yang,Depan Peng,Yan Feng,Xiaohu Tang,Weijun Fang,Jun Shao
关键词: model, accuracy, privacy, model accuracy, local gradients
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Bidirectional privacy-preservation federated learning is crucial as both local gradients and the global model may leak privacy. However, only a few works attempt to achieve it, and they often face challenges such as excessive communication and computational overheads, or significant degradation of model accuracy, which hinders their practical applications. In this paper, we design an efficient and high-accuracy bidirectional privacy-preserving scheme for federated learning to complete secure model training and secure aggregation. To efficiently achieve bidirectional privacy, we design an efficient and accuracy-lossless model perturbation method on the server side (called \mathbfMP_Server ) that can be combined with local differential privacy (LDP) to prevent clients from accessing the model, while ensuring that the local gradients obtained on the server side satisfy LDP. Furthermore, to ensure model accuracy, we customize a distributed differential privacy mechanism on the client side (called \mathbfDDP_Client ). When combined with \mathbfMP_Server , it ensures LDP of the local gradients, while ensuring that the aggregated result matches the accuracy of central differential privacy (CDP). Extensive experiments demonstrate that our scheme significantly outperforms state-of-the-art bidirectional privacy-preservation baselines (SOTAs) in terms of computational cost, model accuracy, and defense ability against privacy attacks. Particularly, given target accuracy, the training time of SOTAs is approximately 200 times, or even over 1000 times, longer than that of our scheme. When the privacy budget is set relatively small, our scheme incurs less than 6% accuracy loss compared to the privacy-ignoring method, while SOTAs suffer up to 20% accuracy loss. Experimental results also show that the defense capability of our scheme outperforms than SOTAs.

[LG-17] CiTrus: Squeezing Extra Performance out of Low-data Bio-signal Transfer Learning

链接: https://arxiv.org/abs/2412.11695
作者: Eloy Geenjaar,Lie Lu
关键词: bio-signal transfer learning, EMG or ECG, improve prediction performance, small bio-signal datasets, Transfer learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning for bio-signals has recently become an important technique to improve prediction performance on downstream tasks with small bio-signal datasets. Recent works have shown that pre-training a neural network model on a large dataset (e.g. EEG) with a self-supervised task, replacing the self-supervised head with a linear classification head, and fine-tuning the model on different downstream bio-signal datasets (e.g., EMG or ECG) can dramatically improve the performance on those datasets. In this paper, we propose a new convolution-transformer hybrid model architecture with masked auto-encoding for low-data bio-signal transfer learning, introduce a frequency-based masked auto-encoding task, employ a more comprehensive evaluation framework, and evaluate how much and when (multimodal) pre-training improves fine-tuning performance. We also introduce a dramatically more performant method of aligning a downstream dataset with a different temporal length and sampling rate to the original pre-training dataset. Our findings indicate that the convolution-only part of our hybrid model can achieve state-of-the-art performance on some low-data downstream tasks. The performance is often improved even further with our full model. In the case of transformer-based models we find that pre-training especially improves performance on downstream datasets, multimodal pre-training often increases those gains further, and our frequency-based pre-training performs the best on average for the lowest and highest data regimes.

[LG-18] Just a Simple Transformation is Enough for Data Protection in Vertical Federated Learning

链接: https://arxiv.org/abs/2412.11689
作者: Andrei Semenov,Philip Zmushko,Alexander Pichugin,Aleksandr Beznosikov
关键词: Vertical Federated Learning, Vertical Federated, Federated Learning, enable collaborative training, deep learning models
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 29 pages, 12 figures, 3 tables

点击查看摘要

Abstract:Vertical Federated Learning (VFL) aims to enable collaborative training of deep learning models while maintaining privacy protection. However, the VFL procedure still has components that are vulnerable to attacks by malicious parties. In our work, we consider feature reconstruction attacks, a common risk targeting input data compromise. We theoretically claim that feature reconstruction attacks cannot succeed without knowledge of the prior distribution on data. Consequently, we demonstrate that even simple model architecture transformations can significantly impact the protection of input data during VFL. Confirming these findings with experimental results, we show that MLP-based models are resistant to state-of-the-art feature reconstruction attacks.

[LG-19] Dual Unscented Kalman Filter Architecture for Sensor Fusion in Water Networks Leak Localization

链接: https://arxiv.org/abs/2412.11687
作者: Luis Romero-Ben,Paul Irofti,Florin Stoican,Vicenç Puig
关键词: degrading service quality, daily water losses, aggravating environmental problems, significant daily water, water systems results
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Leakage in water systems results in significant daily water losses, degrading service quality, increasing costs, and aggravating environmental problems. Most leak localization methods rely solely on pressure data, missing valuable information from other sensor types. This article proposes a hydraulic state estimation methodology based on a dual Unscented Kalman Filter (UKF) approach, which enhances the estimation of both nodal hydraulic heads, critical in localization tasks, and pipe flows, useful for operational purposes. The approach enables the fusion of different sensor types, such as pressure, flow and demand meters. The strategy is evaluated in well-known open source case studies, namely Modena and L-TOWN, showing improvements over other state-of-the-art estimation approaches in terms of interpolation accuracy, as well as more precise leak localization performance in L-TOWN.

[LG-20] Multimodal LLM for Intelligent Transportation Systems

链接: https://arxiv.org/abs/2412.11683
作者: Dexter Le,Aybars Yunusoglu,Karn Tiwari,Murat Isik,I. Can Dikmen
关键词: integrating Large Language, Large Language Models, Large Language, integrating Large, Language Models
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE Symposium Series on Computational Intelligence (SSCI) 2025

点击查看摘要

Abstract:In the evolving landscape of transportation systems, integrating Large Language Models (LLMs) offers a promising frontier for advancing intelligent decision-making across various applications. This paper introduces a novel 3-dimensional framework that encapsulates the intersection of applications, machine learning methodologies, and hardware devices, particularly emphasizing the role of LLMs. Instead of using multiple machine learning algorithms, our framework uses a single, data-centric LLM architecture that can analyze time series, images, and videos. We explore how LLMs can enhance data interpretation and decision-making in transportation. We apply this LLM framework to different sensor datasets, including time-series data and visual data from sources like Oxford Radar RobotCar, D-Behavior (D-Set), nuScenes by Motional, and Comma2k19. The goal is to streamline data processing workflows, reduce the complexity of deploying multiple models, and make intelligent transportation systems more efficient and accurate. The study was conducted using state-of-the-art hardware, leveraging the computational power of AMD RTX 3060 GPUs and Intel i9-12900 processors. The experimental results demonstrate that our framework achieves an average accuracy of 91.33% across these datasets, with the highest accuracy observed in time-series data (92.7%), showcasing the model’s proficiency in handling sequential information essential for tasks such as motion planning and predictive maintenance. Through our exploration, we demonstrate the versatility and efficacy of LLMs in handling multimodal data within the transportation sector, ultimately providing insights into their application in real-world scenarios. Our findings align with the broader conference themes, highlighting the transformative potential of LLMs in advancing transportation technologies.

[LG-21] Non-Convex Optimization in Federated Learning via Variance Reduction and Adaptive Learning AAAI2025

链接: https://arxiv.org/abs/2412.11660
作者: Dipanwita Thakur,Antonella Guzzo,Giancarlo Fortino,Sajal K. Das
关键词: leverages momentum-based variance, momentum-based variance reduction, heterogeneous data, address non-convex settings, paper proposes
类目: Machine Learning (cs.LG)
*备注: FLUID Workshop@AAAI 2025

点击查看摘要

Abstract:This paper proposes a novel federated algorithm that leverages momentum-based variance reduction with adaptive learning to address non-convex settings across heterogeneous data. We intend to minimize communication and computation overhead, thereby fostering a sustainable federated learning system. We aim to overcome challenges related to gradient variance, which hinders the model’s efficiency, and the slow convergence resulting from learning rate adjustments with heterogeneous data. The experimental results on the image classification tasks with heterogeneous data reveal the effectiveness of our suggested algorithms in non-convex settings with an improved communication complexity of \mathcalO(\epsilon^-1) to converge to an \epsilon -stationary point - compared to the existing communication complexity \mathcalO(\epsilon^-2) of most prior works. The proposed federated version maintains the trade-off between the convergence rate, number of communication rounds, and test accuracy while mitigating the client drift in heterogeneous settings. The experimental results demonstrate the efficiency of our algorithms in image classification tasks (MNIST, CIFAR-10) with heterogeneous data.

[LG-22] Private Yet Social: How LLM Chatbots Support and Challenge Eating Disorder Recovery

链接: https://arxiv.org/abs/2412.11656
作者: Ryuhaerang Choi,Taehan Kim,Subin Park,Jennifer G Kim,Sung-Ju Lee
关键词: complex mental health, mental health conditions, Eating disorders, require long-term management, complex mental
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Eating disorders (ED) are complex mental health conditions that require long-term management and support. Recent advancements in large language model (LLM)-based chatbots offer the potential to assist individuals in receiving immediate support. Yet, concerns remain about their reliability and safety in sensitive contexts such as ED. We explore the opportunities and potential harms of using LLM-based chatbots for ED recovery. We observe the interactions between 26 participants with ED and an LLM-based chatbot, WellnessBot, designed to support ED recovery, over 10 days. We discovered that our participants have felt empowered in recovery by discussing ED-related stories with the chatbot, which served as a personal yet social avenue. However, we also identified harmful chatbot responses, especially concerning individuals with ED, that went unnoticed partly due to participants’ unquestioning trust in the chatbot’s reliability. Based on these findings, we provide design implications for safe and effective LLM-based interventions in ED management.

[LG-23] BA-BFL: Barycentric Aggregation for Bayesian Federated Learning

链接: https://arxiv.org/abs/2412.11646
作者: Nour Jamoussi,Giuseppe Serra,Photios A. Stavrou,Marios Kountouris
关键词: Bayesian Federated Learning, Bayesian Federated, Federated Learning, BFL aggregation step, Bayesian Deep Learning
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:In this work, we study the problem of aggregation in the context of Bayesian Federated Learning (BFL). Using an information geometric perspective, we interpret the BFL aggregation step as finding the barycenter of the trained posteriors for a pre-specified divergence metric. We study the barycenter problem for the parametric family of \alpha -divergences and, focusing on the standard case of independent and Gaussian distributed parameters, we recover the closed-form solution of the reverse Kullback-Leibler barycenter and develop the analytical form of the squared Wasserstein-2 barycenter. Considering a non-IID setup, where clients possess heterogeneous data, we analyze the performance of the developed algorithms against state-of-the-art (SOTA) Bayesian aggregation methods in terms of accuracy, uncertainty quantification (UQ), model calibration (MC), and fairness. Finally, we extend our analysis to the framework of Hybrid Bayesian Deep Learning (HBDL), where we study how the number of Bayesian layers in the architecture impacts the considered performance metrics. Our experimental results show that the proposed methodology presents comparable performance with the SOTA while offering a geometric interpretation of the aggregation phase.

[LG-24] A Mapper Algorithm with implicit intervals and its optimization

链接: https://arxiv.org/abs/2412.11631
作者: Yuyang Tao,Shufei Ge
关键词: high dimensional data, topology data analysis, Mapper algorithm, standard Mapper algorithm, Mapper
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The Mapper algorithm is an essential tool for visualizing complex, high dimensional data in topology data analysis (TDA) and has been widely used in biomedical research. It outputs a combinatorial graph whose structure implies the shape of the data. However,the need for manual parameter tuning and fixed intervals, along with fixed overlapping ratios may impede the performance of the standard Mapper algorithm. Variants of the standard Mapper algorithms have been developed to address these limitations, yet most of them still require manual tuning of parameters. Additionally, many of these variants, including the standard version found in the literature, were built within a deterministic framework and overlooked the uncertainty inherent in the data. To relax these limitations, in this work, we introduce a novel framework that implicitly represents intervals through a hidden assignment matrix, enabling automatic parameter optimization via stochastic gradient descent. In this work, we develop a soft Mapper framework based on a Gaussian mixture model(GMM) for flexible and implicit interval construction. We further illustrate the robustness of the soft Mapper algorithm by introducing the Mapper graph mode as a point estimation for the output graph. Moreover, a stochastic gradient descent algorithm with a specific topological loss function is proposed for optimizing parameters in the model. Both simulation and application studies demonstrate its effectiveness in capturing the underlying topological structures. In addition, the application to an RNA expression dataset obtained from the Mount Sinai/JJ Peters VA Medical Center Brain Bank (MSBB) successfully identifies a distinct subgroup of Alzheimer’s Disease.

[LG-25] QPruner: Probabilistic Decision Quantization for Structured Pruning in Large Language Models

链接: https://arxiv.org/abs/2412.11629
作者: Changhai Zhou,Yuhua Zhou,Shijie Han,Qian Qiao,Hongguang Li
关键词: natural language processing, large language models, language processing, large language, natural language
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) has significantly advanced various natural language processing (NLP) tasks. However, the resource demands of these models pose substantial challenges. Structured pruning is an effective approach to reducing model size, but it often results in significant accuracy degradation, necessitating parameter updates to adapt. Unfortunately, such fine-tuning requires substantial memory, which limits its applicability. To address these challenges, we introduce quantization into the structured pruning framework to reduce memory consumption during both fine-tuning and inference. However, the combined errors from pruning and quantization increase the difficulty of fine-tuning, requiring a more refined quantization scheme. To this end, we propose QPruner, a novel framework that employs structured pruning to reduce model size, followed by a layer-wise mixed-precision quantization scheme. Quantization precisions are assigned to each layer based on their importance to the target task, and Bayesian optimization is employed to refine precision allocation strategies, ensuring a balance between model accuracy and memory efficiency. Extensive experiments on benchmark datasets demonstrate that QPruner significantly outperforms existing methods in memory savings while maintaining or improving model performance.

[LG-26] HESAURUS: Contrastive Graph Clustering by Swapping Fused Gromov-Wasserstein Couplings AAAI2025

链接: https://arxiv.org/abs/2412.11550
作者: Bowen Deng,Tong Wang,Lele Fu,Sheng Huang,Chuan Chen,Tao Zhang
关键词: fundamental unsupervised task, cluster, fundamental unsupervised, cluster separability, Uniform Effect
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Graph node clustering is a fundamental unsupervised task. Existing methods typically train an encoder through selfsupervised learning and then apply K-means to the encoder output. Some methods use this clustering result directly as the final assignment, while others initialize centroids based on this initial clustering and then finetune both the encoder and these learnable centroids. However, due to their reliance on K-means, these methods inherit its drawbacks when the cluster separability of encoder output is low, facing challenges from the Uniform Effect and Cluster Assimilation. We summarize three reasons for the low cluster separability in existing methods: (1) lack of contextual information prevents discrimination between similar nodes from different clusters; (2) training tasks are not sufficiently aligned with the downstream clustering task; (3) the cluster information in the graph structure is not appropriately exploited. To address these issues, we propose conTrastive grapH clustEring by SwApping fUsed gRomov-wasserstein coUplingS (THESAURUS). Our method introduces semantic prototypes to provide contextual information, and employs a cross-view assignment prediction pretext task that aligns well with the downstream clustering task. Additionally, it utilizes Gromov-Wasserstein Optimal Transport (GW-OT) along with the proposed prototype graph to thoroughly exploit cluster information in the graph structure. To adapt to diverse real-world data, THESAURUS updates the prototype graph and the prototype marginal distribution in OT by using momentum. Extensive experiments demonstrate that THESAURUS achieves higher cluster separability than the prior art, effectively mitigating the Uniform Effect and Cluster Assimilation issues

[LG-27] Probability-Informed Machine Learning

链接: https://arxiv.org/abs/2412.11526
作者: Mohsen Rashki
关键词: tackling complex regression, Machine learning, powerful tool, tool for tackling, tackling complex
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Machine learning (ML) has emerged as a powerful tool for tackling complex regression and classification tasks, yet its success often hinges on the quality of training data. This study introduces a novel ML paradigm inspired by domain knowledge of the structure of output function, akin to physics-informed ML, but rooted in probabilistic principles rather than physical laws. The proposed approach integrates the probabilistic structure of the target variable (such as its cumulative distribution function) into the training process. This probabilistic information is obtained from historical data or estimated using structural reliability methods during experimental design. By embedding domain-specific probabilistic insights into the learning process, the method enhances model accuracy and mitigates risks of overfitting and underfitting. Applications in regression, image denoising, and classification demonstrate the effectiveness of the approach in addressing real-world problems.

[LG-28] On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory

链接: https://arxiv.org/abs/2412.11521
作者: Andrea Perin,Stephane Deny
关键词: holds significant promise, leveraging them holds, holds significant, significant promise, promise for improving
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Symmetries (transformations by group actions) are present in many datasets, and leveraging them holds significant promise for improving predictions in machine learning. In this work, we aim to understand when and how deep networks can learn symmetries from data. We focus on a supervised classification paradigm where data symmetries are only partially observed during training: some classes include all transformations of a cyclic group, while others include only a subset. We ask: can deep networks generalize symmetry invariance to the partially sampled classes? In the infinite-width limit, where kernel analogies apply, we derive a neural kernel theory of symmetry learning to address this question. The group-cyclic nature of the dataset allows us to analyze the spectrum of neural kernels in the Fourier domain; here we find a simple characterization of the generalization error as a function of the interaction between class separation (signal) and class-orbit density (noise). We observe that generalization can only be successful when the local structure of the data prevails over its non-local, symmetric, structure, in the kernel space defined by the architecture. This occurs when (1) classes are sufficiently distinct and (2) class orbits are sufficiently dense. Our framework also applies to equivariant architectures (e.g., CNNs), and recovers their success in the special case where the architecture matches the inherent symmetry of the data. Empirically, our theory reproduces the generalization failure of finite-width networks (MLP, CNN, ViT) trained on partially observed versions of rotated-MNIST. We conclude that conventional networks trained with supervision lack a mechanism to learn symmetries that have not been explicitly embedded in their architecture a priori. Our framework could be extended to guide the design of architectures and training procedures able to learn symmetries from data.

[LG-29] Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets

链接: https://arxiv.org/abs/2412.11511
作者: Yuxin Wang,Maresa Schröder,Dennis Frauen,Jonas Schweisthal,Konstantin Hess,Stefan Feuerriegel
关键词: average treatment effect, Constructing confidence intervals, multiple observational datasets, observational datasets, confidence intervals
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Constructing confidence intervals (CIs) for the average treatment effect (ATE) from patient records is crucial to assess the effectiveness and safety of drugs. However, patient records typically come from different hospitals, thus raising the question of how multiple observational datasets can be effectively combined for this purpose. In our paper, we propose a new method that estimates the ATE from multiple observational datasets and provides valid CIs. Our method makes little assumptions about the observational datasets and is thus widely applicable in medical practice. The key idea of our method is that we leverage prediction-powered inferences and thereby essentially `shrink’ the CIs so that we offer more precise uncertainty quantification as compared to naïve approaches. We further prove the unbiasedness of our method and the validity of our CIs. We confirm our theoretical results through various numerical experiments. Finally, we provide an extension of our method for constructing CIs from combinations of experimental and observational datasets.

[LG-30] Explicit and Implicit Graduated Optimization in Deep Neural Networks AAAI-25

链接: https://arxiv.org/abs/2412.11501
作者: Naoki Sato,Hideaki Iiduka
关键词: global optimization technique, graduated optimization algorithm, multimodal nonconvex function, Graduated optimization, refining the solution
类目: Machine Learning (cs.LG)
*备注: Accepted at AAAI-25

点击查看摘要

Abstract:Graduated optimization is a global optimization technique that is used to minimize a multimodal nonconvex function by smoothing the objective function with noise and gradually refining the solution. This paper experimentally evaluates the performance of the explicit graduated optimization algorithm with an optimal noise scheduling derived from a previous study and discusses its limitations. It uses traditional benchmark functions and empirical loss functions for modern neural network architectures for evaluating. In addition, this paper extends the implicit graduated optimization algorithm, which is based on the fact that stochastic noise in the optimization process of SGD implicitly smooths the objective function, to SGD with momentum, analyzes its convergence, and demonstrates its effectiveness through experiments on image classification tasks with ResNet architectures.

[LG-31] “Theyve Stolen My GPL-Licensed Model!”: Toward Standardized and Transparent Model Licensing

链接: https://arxiv.org/abs/2412.11483
作者: Moming Duan,Rui Zhao,Linshan Jiang,Nigel Shadbolt,Bingsheng He
关键词: Machine Learning, parameter sizes reach, training consumes zettaFLOPs, model parameter sizes, model publishing
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures. Under review

点击查看摘要

Abstract:As model parameter sizes reach the billion-level range and their training consumes zettaFLOPs of computation, components reuse and collaborative development are become increasingly prevalent in the Machine Learning (ML) community. These components, including models, software, and datasets, may originate from various sources and be published under different licenses, which govern the use and distribution of licensed works and their derivatives. However, commonly chosen licenses, such as GPL and Apache, are software-specific and are not clearly defined or bounded in the context of model publishing. Meanwhile, the reused components may also have free-content licenses and model licenses, which pose a potential risk of license noncompliance and rights infringement within the model production workflow. In this paper, we propose addressing the above challenges along two lines: 1) For license analysis, we have developed a new vocabulary for ML workflow management and encoded license rules to enable ontological reasoning for analyzing rights granting and compliance issues. 2) For standardized model publishing, we have drafted a set of model licenses that provide flexible options to meet the diverse needs of model publishing. Our analysis tool is built on Turtle language and Notation3 reasoning engine, envisioned as a first step toward Linked Open Model Production Data. We have also encoded our proposed model licenses into rules and demonstrated the effects of GPL and other commonly used licenses in model publishing, along with the flexibility advantages of our licenses, through comparisons and experiments.

[LG-32] Vertical Federated Unlearning via Backdoor Certification

链接: https://arxiv.org/abs/2412.11476
作者: Mengde Han,Tianqing Zhu,Lefeng Zhang,Huan Huo,Wanlei Zhou
关键词: Vertical Federated Learning, Vertical Federated, enabling distinct entities, train models cooperatively, Federated Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vertical Federated Learning (VFL) offers a novel paradigm in machine learning, enabling distinct entities to train models cooperatively while maintaining data privacy. This method is particularly pertinent when entities possess datasets with identical sample identifiers but diverse attributes. Recent privacy regulations emphasize an individual’s \emphright to be forgotten, which necessitates the ability for models to unlearn specific training data. The primary challenge is to develop a mechanism to eliminate the influence of a specific client from a model without erasing all relevant data from other clients. Our research investigates the removal of a single client’s contribution within the VFL framework. We introduce an innovative modification to traditional VFL by employing a mechanism that inverts the typical learning trajectory with the objective of extracting specific data contributions. This approach seeks to optimize model performance using gradient ascent, guided by a pre-defined constrained model. We also introduce a backdoor mechanism to verify the effectiveness of the unlearning procedure. Our method avoids fully accessing the initial training data and avoids storing parameter updates. Empirical evidence shows that the results align closely with those achieved by retraining from scratch. Utilizing gradient ascent, our unlearning approach addresses key challenges in VFL, laying the groundwork for future advancements in this domain. All the code and implementations related to this paper are publicly available at this https URL.

[LG-33] Mining In-distribution Attributes in Outliers for Out-of-distribution Detection AAAI2025

链接: https://arxiv.org/abs/2412.11466
作者: Yutian Lei,Luping Ji,Pei Liu
关键词: deploying reliable machine, real-world scenarios, reliable machine learning, machine learning systems, indispensable for deploying
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by AAAI2025

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is indispensable for deploying reliable machine learning systems in real-world scenarios. Recent works, using auxiliary outliers in training, have shown good potential. However, they seldom concern the intrinsic correlations between in-distribution (ID) and OOD data. In this work, we discover an obvious correlation that OOD data usually possesses significant ID attributes. These attributes should be factored into the training process, rather than blindly suppressed as in previous approaches. Based on this insight, we propose a structured multi-view-based out-of-distribution detection learning (MVOL) framework, which facilitates rational handling of the intrinsic in-distribution attributes in outliers. We provide theoretical insights on the effectiveness of MVOL for OOD detection. Extensive experiments demonstrate the superiority of our framework to others. MVOL effectively utilizes both auxiliary OOD datasets and even wild datasets with noisy in-distribution data. Code is available at this https URL.

[LG-34] Regional Expected Improvement for Efficient Trust Region Selection in High-Dimensional Bayesian Optimization AAAI2025

链接: https://arxiv.org/abs/2412.11456
作者: Nobuo Namura,Sho Takemori
关键词: involve complex objective, complex objective functions, involve complex, complex objective, costly evaluations
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Real-world optimization problems often involve complex objective functions with costly evaluations. While Bayesian optimization (BO) with Gaussian processes is effective for these challenges, it suffers in high-dimensional spaces due to performance degradation from limited function evaluations. To overcome this, simplification techniques like dimensionality reduction have been employed, yet they often rely on assumptions about the problem characteristics, potentially underperforming when these assumptions do not hold. Trust-region-based methods, which avoid such assumptions, focus on local search but risk stagnation in local optima. In this study, we propose a novel acquisition function, regional expected improvement (REI), designed to enhance trust-region-based BO in medium to high-dimensional settings. REI identifies regions likely to contain the global optimum, improving performance without relying on specific problem characteristics. We provide a theoretical proof that REI effectively identifies optimal trust regions and empirically demonstrate that incorporating REI into trust-region-based BO outperforms conventional BO and other high-dimensional BO methods in medium to high-dimensional real-world problems.

[LG-35] Data-Dependent Generalization Bounds for Parameterized Quantum Models Under Noise

链接: https://arxiv.org/abs/2412.11451
作者: Bikram Khanal,Pablo Rivas
关键词: solving complex problems, near-term quantum devices, Quantum machine learning, inherent noise hinders, machine learning offers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantum machine learning offers a transformative approach to solving complex problems, but the inherent noise hinders its practical implementation in near-term quantum devices. This obstacle makes it challenging to understand the generalization capabilities of quantum circuit models. Designing robust quantum machine learning models under noise requires a principled understanding of complexity and generalization, extending beyond classical capacity measures. This study investigates the generalization properties of parameterized quantum machine learning models under the influence of noise. We present a data-dependent generalization bound grounded in the quantum Fisher information matrix. We leverage statistical learning theory to relate the parameter space volumes and training sizes to estimate the generalization capability of the trained model. By integrating local parameter neighborhoods and effective dimensions defined through quantum Fisher information matrix eigenvalues, we provide a structured characterization of complexity in quantum models. We analyze the tightness of the bound and discuss the trade-off between model expressiveness and generalization performance.

[LG-36] UIBDiffusion: Universal Imperceptible Backdoor Attack for Diffusion Models

链接: https://arxiv.org/abs/2412.11441
作者: Yuning Han,Bingyin Zhao,Rui Chu,Feng Luo,Biplab Sikdar,Yingjie Lao
关键词: Recent studies show, Recent studies, studies show, backdoor, diffusion models
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies show that diffusion models (DMs) are vulnerable to backdoor attacks. Existing backdoor attacks impose unconcealed triggers (e.g., a gray box and eyeglasses) that contain evident patterns, rendering remarkable attack effects yet easy detection upon human inspection and defensive algorithms. While it is possible to improve stealthiness by reducing the strength of the backdoor, doing so can significantly compromise its generality and effectiveness. In this paper, we propose UIBDiffusion, the universal imperceptible backdoor attack for diffusion models, which allows us to achieve superior attack and generation performance while evading state-of-the-art defenses. We propose a novel trigger generation approach based on universal adversarial perturbations (UAPs) and reveal that such perturbations, which are initially devised for fooling pre-trained discriminative models, can be adapted as potent imperceptible backdoor triggers for DMs. We evaluate UIBDiffusion on multiple types of DMs with different kinds of samplers across various datasets and targets. Experimental results demonstrate that UIBDiffusion brings three advantages: 1) Universality, the imperceptible trigger is universal (i.e., image and model agnostic) where a single trigger is effective to any images and all diffusion models with different samplers; 2) Utility, it achieves comparable generation quality (e.g., FID) and even better attack success rate (i.e., ASR) at low poison rates compared to the prior works; and 3) Undetectability, UIBDiffusion is plausible to human perception and can bypass Elijah and TERD, the SOTA defenses against backdoors for DMs. We will release our backdoor triggers and code.

[LG-37] MGDA: Model-based Goal Data Augmentation for Offline Goal-conditioned Weighted Supervised Learning

链接: https://arxiv.org/abs/2412.11410
作者: Xing Lei,Xuetao Zhang,Donglin Wang
关键词: Weighted Supervised Learning, Goal-Conditioned Weighted Supervised, Weighted Supervised, goal-conditioned reinforcement learning, Supervised Learning
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, a state-of-the-art family of algorithms, known as Goal-Conditioned Weighted Supervised Learning (GCWSL) methods, has been introduced to tackle challenges in offline goal-conditioned reinforcement learning (RL). GCWSL optimizes a lower bound of the goal-conditioned RL objective and has demonstrated outstanding performance across diverse goal-reaching tasks, providing a simple, effective, and stable solution. However, prior research has identified a critical limitation of GCWSL: the lack of trajectory stitching capabilities. To address this, goal data augmentation strategies have been proposed to enhance these methods. Nevertheless, existing techniques often struggle to sample suitable augmented goals for GCWSL effectively. In this paper, we establish unified principles for goal data augmentation, focusing on goal diversity, action optimality, and goal reachability. Based on these principles, we propose a Model-based Goal Data Augmentation (MGDA) approach, which leverages a learned dynamics model to sample more suitable augmented goals. MGDA uniquely incorporates the local Lipschitz continuity assumption within the learned model to mitigate the impact of compounding errors. Empirical results show that MGDA significantly enhances the performance of GCWSL methods on both state-based and vision-based maze datasets, surpassing previous goal data augmentation techniques in improving stitching capabilities.

[LG-38] Formulations and scalability of neural network surrogates in nonlinear optimization problems

链接: https://arxiv.org/abs/2412.11403
作者: Robert B. Parker,Oscar Dowson,Nicole LoGiudice,Manuel Garcia,Russell Bent
关键词: trained neural networks, representing trained neural, neural networks, nonlinear constrained optimization, neural network surrogate
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We compare full-space, reduced-space, and gray-box formulations for representing trained neural networks in nonlinear constrained optimization problems. We test these formulations on a transient stability-constrained, security-constrained alternating current optimal power flow (SCOPF) problem where the transient stability criteria are represented by a trained neural network surrogate. Optimization problems are implemented in JuMP and trained neural networks are embedded using a new Julia package: this http URL. To study the bottlenecks of the three formulations, we use neural networks with up to 590 million trained parameters. The full-space formulation is bottlenecked by the linear solver used by the optimization algorithm, while the reduced-space formulation is bottlenecked by the algebraic modeling environment and derivative computations. The gray-box formulation is the most scalable and is capable of solving with the largest neural networks tested. It is bottlenecked by evaluation of the neural network’s outputs and their derivatives, which may be accelerated with a graphics processing unit (GPU). Leveraging the gray-box formulation and GPU acceleration, we solve our test problem with our largest neural network surrogate in 2.5 \times the time required for a simpler SCOPF problem without the stability constraint.

[LG-39] Modeling Inter-Intra Heterogeneity for Graph Federated Learning AAAI2025

链接: https://arxiv.org/abs/2412.11402
作者: Wentao Yu,Shuo Chen,Yongxin Tong,Tianlong Gu,Chen Gong
关键词: graph data due, fundamental and challenging, Federated learning method, subgraph data, federated learning
类目: Machine Learning (cs.LG)
*备注: accepted by AAAI 2025

点击查看摘要

Abstract:Heterogeneity is a fundamental and challenging issue in federated learning, especially for the graph data due to the complex relationships among the graph nodes. To deal with the heterogeneity, lots of existing methods perform the weighted federation based on their calculated similarities between pairwise clients (i.e., subgraphs). However, their inter-subgraph similarities estimated with the outputs of local models are less reliable, because the final outputs of local models may not comprehensively represent the real distribution of subgraph data. In addition, they ignore the critical intra-heterogeneity which usually exists within each subgraph itself. To address these issues, we propose a novel Federated learning method by integrally modeling the Inter-Intra Heterogeneity (FedIIH). For the inter-subgraph relationship, we propose a novel hierarchical variational model to infer the whole distribution of subgraph data in a multi-level form, so that we can accurately characterize the inter-subgraph similarities with the global perspective. For the intra-heterogeneity, we disentangle the subgraph into multiple latent factors and partition the model parameters into multiple parts, where each part corresponds to a single latent factor. Our FedIIH not only properly computes the distribution similarities between subgraphs, but also learns disentangled representations that are robust to irrelevant factors within subgraphs, so that it successfully considers the inter- and intra- heterogeneity simultaneously. Extensive experiments on six homophilic and five heterophilic graph datasets in both non-overlapping and overlapping settings demonstrate the effectiveness of our method when compared with nine state-of-the-art methods. Specifically, FedIIH averagely outperforms the second-best method by a large margin of 5.79% on all heterophilic datasets.

[LG-40] Scaled Conjugate Gradient Method for Nonconvex Optimization in Deep Neural Networks

链接: https://arxiv.org/abs/2412.11400
作者: Naoki Sato,Koshiro Izumi,Hideaki Iiduka
关键词: solving nonconvex optimization, nonconvex optimization problems, utilizing stochastic gradients, methods utilizing stochastic, deep neural networks
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at JMLR (Dec. 2024)

点击查看摘要

Abstract:A scaled conjugate gradient method that accelerates existing adaptive methods utilizing stochastic gradients is proposed for solving nonconvex optimization problems with deep neural networks. It is shown theoretically that, whether with constant or diminishing learning rates, the proposed method can obtain a stationary point of the problem. Additionally, its rate of convergence with diminishing learning rates is verified to be superior to that of the conjugate gradient method. The proposed method is shown to minimize training loss functions faster than the existing adaptive methods in practical applications of image and text classification. Furthermore, in the training of generative adversarial networks, one version of the proposed method achieved the lowest Frechet inception distance score among those of the adaptive methods.

[LG-41] Quantization of Climate Change Impacts on Renewable Energy Generation Capacity: A Super-Resolution Recurrent Diffusion Model

链接: https://arxiv.org/abs/2412.11399
作者: Xiaochong Dong,Jun Dan,Yingyun Sun,Yang Liu,Xuemin Zhang,Shengwei Mei
关键词: ongoing energy transition, Driven by global, climate data, global climate change, power supply capabilities
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Driven by global climate change and the ongoing energy transition, the coupling between power supply capabilities and meteorological factors has become increasingly significant. Over the long term, accurately quantifying the power generation capacity of renewable energy under the influence of climate change is essential for the development of sustainable power systems. However, due to interdisciplinary differences in data requirements, climate data often lacks the necessary hourly resolution to capture the short-term variability and uncertainties of renewable energy resources. To address this limitation, a super-resolution recurrent diffusion model (SRDM) has been developed to enhance the temporal resolution of climate data and model the short-term uncertainty. The SRDM incorporates a pre-trained decoder and a denoising network, that generates long-term, high-resolution climate data through a recurrent coupling mechanism. The high-resolution climate data is then converted into power value using the mechanism model, enabling the simulation of wind and photovoltaic (PV) power generation capacity on future long-term scales. Case studies were conducted in the Ejina region of Inner Mongolia, China, using fifth-generation reanalysis (ERA5) and coupled model intercomparison project (CMIP6) data under two climate pathways: SSP126 and SSP585. The results demonstrate that the SRDM outperforms existing generative models in generating super-resolution climate data. For the Ejina region, under a high-emission pathway, the annual utilization hours of wind power are projected to decrease by 2.82 hours/year, while those for PV power are projected to decrease by 0.26 hours/year. Furthermore, the research highlights the estimation biases introduced when low-resolution climate data is used for power conversion.

[LG-42] STDHL: Spatio-Temporal Dynamic Hypergraph Learning for Wind Power Forecasting

链接: https://arxiv.org/abs/2412.11393
作者: Xiaochong Dong,Xuemin Zhang,Ming Yang,Shengwei Mei
关键词: Leveraging spatio-temporal correlations, Leveraging spatio-temporal, significantly enhance, Leveraging, wind power forecasting
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Leveraging spatio-temporal correlations among wind farms can significantly enhance the accuracy of ultra-short-term wind power forecasting. However, the complex and dynamic nature of these correlations presents significant modeling challenges. To address this, we propose a spatio-temporal dynamic hypergraph learning (STDHL) model. This model uses a hypergraph structure to represent spatial features among wind farms. Unlike traditional graph structures, which only capture pair-wise node features, hypergraphs create hyperedges connecting multiple nodes, enabling the representation and transmission of higher-order spatial features. The STDHL model incorporates a novel dynamic hypergraph convolutional layer to model dynamic spatial correlations and a grouped temporal convolutional layer for channel-independent temporal modeling. The model uses spatio-temporal encoders to extract features from multi-source covariates, which are mapped to quantile results through a forecast decoder. Experimental results using the GEFCom dataset show that the STDHL model outperforms existing state-of-the-art methods. Furthermore, an in-depth analysis highlights the critical role of spatio-temporal covariates in improving ultra-short-term forecasting accuracy.

[LG-43] Accurate Robust and Privacy-Preserving Brain-Computer Interface Decoding

链接: https://arxiv.org/abs/2412.11390
作者: Xiaoqing Chen,Tianwang Jia,Dongrui Wu
关键词: based brain-computer interface, enables direct communication, based brain-computer, brain-computer interface, enables direct
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:An electroencephalogram (EEG) based brain-computer interface (BCI) enables direct communication between the brain and external devices. However, EEG-based BCIs face at least three major challenges in real-world applications: data scarcity and individual differences, adversarial vulnerability, and data privacy. While previous studies have addressed one or two of these issues, simultaneous accommodation of all three challenges remains challenging and unexplored. This paper fills this gap, by proposing an Augmented Robustness Ensemble (ARE) algorithm and integrating it into three privacy protection scenarios (centralized source-free transfer, federated source-free transfer, and source data perturbation), achieving simultaneously accurate decoding, adversarial robustness, and privacy protection of EEG-based BCIs. Experiments on three public EEG datasets demonstrated that our proposed approach outperformed over 10 classic and state-of-the-art approaches in both accuracy and robustness in all three privacy-preserving scenarios, even outperforming state-of-the-art transfer learning approaches that do not consider privacy protection at all. This is the first time that three major challenges in EEG-based BCIs can be addressed simultaneously, significantly improving the practicalness of EEG decoding in real-world BCIs.

[LG-44] FinLoRA: Finetuning Quantized Financial Large Language Models Using Low-Rank Adaptation

链接: https://arxiv.org/abs/2412.11378
作者: Dannong Wang,Daniel Kim,Bo Jin,Xingjian Zhao,Tianfan Fu,Steve Yang,Xiao-Yang Liu
关键词: Finetuned large language, Finetuned large, large language models, shown remarkable performance, information retrieval
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Finetuned large language models (LLMs) have shown remarkable performance in financial tasks, such as sentiment analysis and information retrieval. Due to privacy concerns, finetuning and deploying Financial LLMs (FinLLMs) locally are crucial for institutions. However, finetuning FinLLMs poses challenges including GPU memory constraints and long input sequences. In this paper, we employ quantized low-rank adaptation (QLoRA) to finetune FinLLMs, which leverage low-rank matrix decomposition and quantization techniques to significantly reduce computational requirements while maintaining high model performance. We also employ data and pipeline parallelism to enable local finetuning using cost-effective, widely accessible GPUs. Experiments on financial datasets demonstrate that our method achieves substantial improvements in accuracy, GPU memory usage, and time efficiency, underscoring the potential of lowrank methods for scalable and resource-efficient LLM finetuning.

[LG-45] Deep Random Features for Scalable Interpolation of Spatiotemporal Data

链接: https://arxiv.org/abs/2412.11350
作者: Weibin Chen,Azhir Mahmood,Michel Tsamados,So Takao
关键词: interpolate remote-sensing observations, earth observation systems, observation systems calls, earth observation, observation systems
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The rapid growth of earth observation systems calls for a scalable approach to interpolate remote-sensing observations. These methods in principle, should acquire more information about the observed field as data grows. Gaussian processes (GPs) are candidate model choices for interpolation. However, due to their poor scalability, they usually rely on inducing points for inference, which restricts their expressivity. Moreover, commonly imposed assumptions such as stationarity prevents them from capturing complex patterns in the data. While deep GPs can overcome this issue, training and making inference with them are difficult, again requiring crude approximations via inducing points. In this work, we instead approach the problem through Bayesian deep learning, where spatiotemporal fields are represented by deep neural networks, whose layers share the inductive bias of stationary GPs on the plane/sphere via random feature expansions. This allows one to (1) capture high frequency patterns in the data, and (2) use mini-batched gradient descent for large scale training. We experiment on various remote sensing data at local/global scales, showing that our approach produce competitive or superior results to existing methods, with well-calibrated uncertainties.

[LG-46] Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent AAAI2025

链接: https://arxiv.org/abs/2412.11341
作者: Xiang Li,Qiaomin Xie
关键词: Stochastic Gradient Descent, Gradient Descent, Stochastic Gradient, behavior of Stochastic, crucially depends
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 13 pages, 30 figures, to be published in AAAI 2025

点击查看摘要

Abstract:The convergence behavior of Stochastic Gradient Descent (SGD) crucially depends on the stepsize configuration. When using a constant stepsize, the SGD iterates form a Markov chain, enjoying fast convergence during the initial transient phase. However, when reaching stationarity, the iterates oscillate around the optimum without making further progress. In this paper, we study the convergence diagnostics for SGD with constant stepsize, aiming to develop an effective dynamic stepsize scheme. We propose a novel coupling-based convergence diagnostic procedure, which monitors the distance of two coupled SGD iterates for stationarity detection. Our diagnostic statistic is simple and is shown to track the transition from transience stationarity theoretically. We conduct extensive numerical experiments and compare our method against various existing approaches. Our proposed coupling-based stepsize scheme is observed to achieve superior performance across a diverse set of convex and non-convex problems. Moreover, our results demonstrate the robustness of our approach to a wide range of hyperparameters.

[LG-47] Regularized Dikin Walks for Sampling Truncated Logconcave Measures Mixed Isoperimetry and Beyond Worst-Case Analysis

链接: https://arxiv.org/abs/2412.11303
作者: Minhui Jiang,Yuansi Chen
关键词: Bayesian statistical models, challenges in Bayesian, Bayesian statistical, Dikin walk, regularized Dikin walk
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 63 pages, 2 figures

点击查看摘要

Abstract:We study the problem of drawing samples from a logconcave distribution truncated on a polytope, motivated by computational challenges in Bayesian statistical models with indicator variables, such as probit regression. Building on interior point methods and the Dikin walk for sampling from uniform distributions, we analyze the mixing time of regularized Dikin walks. Our contributions are threefold. First, for a logconcave and log-smooth distribution with condition number \kappa , truncated on a polytope in \mathbbR^n defined with m linear constraints, we prove that the soft-threshold Dikin walk mixes in \widetildeO((m+\kappa)n) iterations from a warm initialization. It improves upon prior work which required the polytope to be bounded and involved a bound dependent on the radius of the bounded region. Moreover, we introduce the regularized Dikin walk using Lewis weights for approximating the John ellipsoid. We show that it mixes in \widetildeO((n^2.5+\kappa n) . Second, we extend the mixing time guarantees mentioned above to weakly log-concave distributions truncated on polytopes, provided that they have a finite covariance matrix. Third, going beyond worst-case mixing time analysis, we demonstrate that soft-threshold Dikin walk can mix significantly faster when only a limited number of constraints intersect the high-probability mass of the distribution, improving the \widetildeO((m+\kappa)n) upper bound to \widetildeO(m + \kappa n) . Additionally, per-iteration complexity of regularized Dikin walk and ways to generate a warm initialization are discussed to facilitate practical implementation.

[LG-48] How not to Stitch Representations to Measure Similarity: Task Loss Matching versus Direct Matching AAAI2025

链接: https://arxiv.org/abs/2412.11299
作者: András Balogh,Márk Jelasity
关键词: task loss matching, deep neural networks, task loss, loss matching, matching
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at AAAI 2025. For the implementation, see this https URL

点击查看摘要

Abstract:Measuring the similarity of the internal representations of deep neural networks is an important and challenging problem. Model stitching has been proposed as a possible approach, where two half-networks are connected by mapping the output of the first half-network to the input of the second one. The representations are considered functionally similar if the resulting stitched network achieves good task-specific performance. The mapping is normally created by training an affine stitching layer on the task at hand while freezing the two half-networks, a method called task loss matching. Here, we argue that task loss matching may be very misleading as a similarity index. For example, it can indicate very high similarity between very distant layers, whose representations are known to have different functional properties. Moreover, it can indicate very distant layers to be more similar than architecturally corresponding layers. Even more surprisingly, when comparing layers within the same network, task loss matching often indicates that some layers are more similar to a layer than itself. We argue that the main reason behind these problems is that task loss matching tends to create out-of-distribution representations to improve task-specific performance. We demonstrate that direct matching (when the mapping minimizes the distance between the stitched representations) does not suffer from these problems. We compare task loss matching, direct matching, and well-known similarity indices such as CCA and CKA. We conclude that direct matching strikes a good balance between the structural and functional requirements for a good similarity index.

[LG-49] Grassmannian Geometry Meets Dynamic Mode Decomposition in DMD-GEN: A New Metric for Mode Collapse in Time Series Generative Models

链接: https://arxiv.org/abs/2412.11292
作者: Amime Mohamed Aboussalah,Yassine Abbahaddou
关键词: Generative Adversarial Networks, Adversarial Networks, Variational Autoencoders, Generative Adversarial, time series
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative models like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) often fail to capture the full diversity of their training data, leading to mode collapse. While this issue is well-explored in image generation, it remains underinvestigated for time series data. We introduce a new definition of mode collapse specific to time series and propose a novel metric, DMD-GEN, to quantify its severity. Our metric utilizes Dynamic Mode Decomposition (DMD), a data-driven technique for identifying coherent spatiotemporal patterns, and employs Optimal Transport between DMD eigenvectors to assess discrepancies between the underlying dynamics of the original and generated data. This approach not only quantifies the preservation of essential dynamic characteristics but also provides interpretability by pinpointing which modes have collapsed. We validate DMD-GEN on both synthetic and real-world datasets using various generative models, including TimeGAN, TimeVAE, and DiffusionTS. The results demonstrate that DMD-GEN correlates well with traditional evaluation metrics for static data while offering the advantage of applicability to dynamic data. This work offers for the first time a definition of mode collapse for time series, improving understanding, and forming the basis of our tool for assessing and improving generative models in the time series domain.

[LG-50] Wasserstein Bounds for generative diffusion models with Gaussian tail targets

链接: https://arxiv.org/abs/2412.11251
作者: Xixian Wang,Zhongjian Wang
关键词: Gaussian-type tail behavior, score-based generative models, Wasserstein distance, data distribution, Gaussian-type tail
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We present an estimate of the Wasserstein distance between the data distribution and the generation of score-based generative models, assuming an \epsilon -accurate approximation of the score and a Gaussian-type tail behavior of the data distribution. The complexity bound in dimension is O(\sqrtd) , with a logarithmic constant. Such Gaussian tail assumption applies to the distribution of a compact support target with early stopping technique and the Bayesian posterior with a bounded observation operator. Corresponding convergence and complexity bounds are derived. The crux of the analysis lies in the Lipchitz bound of the score, which is related to the Hessian estimate of a viscous Hamilton-Jacobi equation (vHJ). This latter is demonstrated by employing a dimension independent kernel estimate. Consequently, our complexity bound scales linearly (up to a logarithmic constant) with the square root of the trace of the covariance operator, which relates to the invariant distribution of forward process. Our analysis also extends to the probabilistic flow ODE, as the sampling process. Subjects: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA) Cite as: arXiv:2412.11251 [cs.LG] (or arXiv:2412.11251v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.11251 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-51] Concept Learning in the Wild: Towards Algorithmic Understanding of Neural Networks

链接: https://arxiv.org/abs/2412.11205
作者: Elad Shohama,Hadar Cohena,Khalil Wattada,Havana Rikab,Dan Vilenchik
关键词: methods typically focus, identifying essential input, essential input features, methods typically, text classification
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explainable AI (XAI) methods typically focus on identifying essential input features or more abstract concepts for tasks like image or text classification. However, for algorithmic tasks like combinatorial optimization, these concepts may depend not only on the input but also on the current state of the network, like in the graph neural networks (GNN) case. This work studies concept learning for an existing GNN model trained to solve Boolean satisfiability (SAT). \textcolorblackOur analysis reveals that the model learns key concepts matching those guiding human-designed SAT heuristics, particularly the notion of ‘support.’ We demonstrate that these concepts are encoded in the top principal components (PCs) of the embedding’s covariance matrix, allowing for unsupervised discovery. Using sparse PCA, we establish the minimality of these concepts and show their teachability through a simplified GNN. Two direct applications of our framework are (a) We improve the convergence time of the classical WalkSAT algorithm and (b) We use the discovered concepts to “reverse-engineer” the black-box GNN and rewrite it as a white-box textbook algorithm. Our results highlight the potential of concept learning in understanding and enhancing algorithmic neural networks for combinatorial optimization tasks.

[LG-52] GNNs-to-MLPs by Teacher Injection and Dirichlet Energy Distillation

链接: https://arxiv.org/abs/2412.11180
作者: Ziang Zhou,Zhihao Ding,Jieming Shi,Qing Li,Shiqi Shen
关键词: Graph Neural Networks, Neural Networks, node classification tasks, Graph Neural, Dirichlet Energy Distillation
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 14 pages

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are fundamental to graph-based learning and excel in node classification tasks. However, GNNs suffer from scalability issues due to the need for multi-hop data during inference, limiting their use in latency-sensitive applications. Recent studies attempt to distill GNNs into multi-layer perceptrons (MLPs) for faster inference. They typically treat GNN and MLP models as single units for distillation, insufficiently utilizing the fine-grained knowledge within GNN layers. In this paper, we propose TINED, a novel method that distills GNNs to MLPs layer-wise through Teacher Injection with fine-tuning and Dirichlet Energy Distillation techniques. We analyze key operations in GNN layers, feature transformation (FT) and graph propagation (GP), and identify that an FT performs the same computation as a fully-connected (FC) layer in MLPs. Thus, we propose directly injecting valuable teacher parameters of an FT in a GNN into an FC layer of the student MLP, assisted by fine-tuning. In TINED, FC layers in an MLP mirror the order of the corresponding FTs and GPs in GNN. We provide a theoretical bound on the approximation of GPs. Moreover, we observe that in a GNN layer, FT and GP operations often have opposing smoothing effects: GP is aggressive, while FT is conservative, in smoothing. Using Dirichlet energy, we design a DE ratio to quantify these smoothing effects and propose Dirichlet Energy Distillation to distill these characteristics from GNN layers to MLP layers. Extensive experiments demonstrate that TINED achieves superior performance over GNNs and state-of-the-art distillation methods under various settings across seven datasets. The code is in supplementary material.

[LG-53] A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer

链接: https://arxiv.org/abs/2412.11177
作者: Hanxiao Lu,Hongyu Cai,Yiming Liang,Antonio Bianchi,Z. Berkay Celik
关键词: Masked Language Modeling, function signature recovery, function similarity detection, Language Modeling, Masked Language
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language model approaches have recently been integrated into binary analysis tasks, such as function similarity detection and function signature recovery. These models typically employ a two-stage training process: pre-training via Masked Language Modeling (MLM) on machine code and fine-tuning for specific tasks. While MLM helps to understand binary code structures, it ignores essential code characteristics, including control and data flow, which negatively affect model generalization. Recent work leverages domain-specific features (e.g., control flow graphs and dynamic execution traces) in transformer-based approaches to improve binary code semantic understanding. However, this approach involves complex feature engineering, a cumbersome and time-consuming process that can introduce predictive uncertainty when dealing with stripped or obfuscated code, leading to a performance drop. In this paper, we introduce ProTST, a novel transformer-based methodology for binary code embedding. ProTST employs a hierarchical training process based on a unique tree-like structure, where knowledge progressively flows from fundamental tasks at the root to more specialized tasks at the leaves. This progressive teacher-student paradigm allows the model to build upon previously learned knowledge, resulting in high-quality embeddings that can be effectively leveraged for diverse downstream binary analysis tasks. The effectiveness of ProTST is evaluated in seven binary analysis tasks, and the results show that ProTST yields an average validation score (F1, MRR, and Recall@1) improvement of 14.8% compared to traditional two-stage training and an average validation score of 10.7% compared to multimodal two-stage frameworks.

[LG-54] Knowledge Migration Framework for Smart Contract Vulnerability Detection

链接: https://arxiv.org/abs/2412.11175
作者: Luqi Wang,Wenbao Jiang
关键词: smart contract vulnerability, contract vulnerability detection, smart contract, smart contracts play, AF-STip smart contract
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As a cornerstone of blockchain technology in the 3.0 era, smart contracts play a pivotal role in the evolution of blockchain systems. In order to address the limitations of existing smart contract vulnerability detection models with regard to their generalisation capability, an AF-STip smart contract vulnerability detection framework incorporating efficient knowledge migration is proposed. AF-STip employs the teacher network as the main model and migrates the knowledge processed by the smart contract to the student model using a data-free knowledge distillation method. The student model utilises this knowledge to enhance its vulnerability detection capabilities. The approach markedly enhances the model’s capacity for feature extraction and cross-class adaptation, while concurrently reducing computational this http URL order to further enhance the extraction of vulnerability features, an adaptive fusion module is proposed in this paper, which aims to strengthen the interaction and fusion of feature this http URL experimental results demonstrate that the STip model attains an average F1 value detection score of 91.16% for the four vulnerabilities without disclosing the original smart contract data. To validate the viability of the proposed lightweight migration approach, the student model is deployed in a migration learning task targeting a novel vulnerability type, resulting in an accuracy of 91.02% and an F1 score of 90.46%. To the best of our knowledge, AF-STip is the inaugural model to apply data-free knowledge migration to smart contract vulnerability detection. While markedly reducing the computational overhead, the method still demonstrates exceptional performance in detecting novel vulnerabilities.

[LG-55] Semi-Supervised Risk Control via Prediction-Powered Inference

链接: https://arxiv.org/abs/2412.11174
作者: Bat-Sheva Einbinder,Liran Ringel,Yaniv Romano
关键词: machine learning model, error rate control, rigorous error rate, risk-controlling prediction sets, error rate
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The risk-controlling prediction sets (RCPS) framework is a general tool for transforming the output of any machine learning model to design a predictive rule with rigorous error rate control. The key idea behind this framework is to use labeled hold-out calibration data to tune a hyper-parameter that affects the error rate of the resulting prediction rule. However, the limitation of such a calibration scheme is that with limited hold-out data, the tuned hyper-parameter becomes noisy and leads to a prediction rule with an error rate that is often unnecessarily conservative. To overcome this sample-size barrier, we introduce a semi-supervised calibration procedure that leverages unlabeled data to rigorously tune the hyper-parameter without compromising statistical validity. Our procedure builds upon the prediction-powered inference framework, carefully tailoring it to risk-controlling tasks. We demonstrate the benefits and validity of our proposal through two real-data experiments: few-shot image classification and early time series classification.

[LG-56] Learning Latent Spaces for Domain Generalization in Time Series Forecasting

链接: https://arxiv.org/abs/2412.11171
作者: Songgaojun Deng,Maarten de Rijke
关键词: Time series forecasting, Time series, time series data, unseen relevant domains, remains underexplored
类目: Machine Learning (cs.LG)
*备注: 18 pages with 8 figures

点击查看摘要

Abstract:Time series forecasting is vital in many real-world applications, yet developing models that generalize well on unseen relevant domains – such as forecasting web traffic data on new platforms/websites or estimating e-commerce demand in new regions – remains underexplored. Existing forecasting models often struggle with domain shifts in time series data, as the temporal patterns involve complex components like trends, seasonality, etc. While some prior work addresses this by matching feature distributions across domains or disentangling domain-shared features using label information, they fail to reveal insights into the latent temporal dependencies, which are critical for identifying common patterns across domains and achieving generalization. We propose a framework for domain generalization in time series forecasting by mining the latent factors that govern temporal dependencies across domains. Our approach uses a decomposition-based architecture with a new Conditional \beta -Variational Autoencoder (VAE), wherein time series data is first decomposed into trend-cyclical and seasonal components, each modeled independently through separate \beta -VAE modules. The \beta -VAE aims to capture disentangled latent factors that control temporal dependencies across domains. We enhance the learning of domain-specific information with a decoder-conditional design and introduce domain regularization to improve the separation of domain-shared and domain-specific latent factors. Our proposed method is flexible and can be applied to various time series forecasting models, enabling effective domain generalization with simplicity and efficiency. We validate its effectiveness on five real-world time series datasets, covering web traffic, e-commerce, finance and power consumption, demonstrating improved generalization performance over state-of-the-art methods. Comments: 18 pages with 8 figures Subjects: Machine Learning (cs.LG) MSC classes: 68T07 (Primary) Cite as: arXiv:2412.11171 [cs.LG] (or arXiv:2412.11171v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.11171 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-57] PGD-Imp: Rethinking and Unleashing Potential of Classic PGD with Dual Strategies for Imperceptible Adversarial Attacks

链接: https://arxiv.org/abs/2412.11168
作者: Jin Li,Zitong Yu,Ziqiang He,Z. Jane Wang,Xiangui Kang
关键词: increasing research interests, recently attracted increasing, attracted increasing research, Imperceptible adversarial attacks, research interests
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Imperceptible adversarial attacks have recently attracted increasing research interests. Existing methods typically incorporate external modules or loss terms other than a simple l_p -norm into the attack process to achieve imperceptibility, while we argue that such additional designs may not be necessary. In this paper, we rethink the essence of imperceptible attacks and propose two simple yet effective strategies to unleash the potential of PGD, the common and classical attack, for imperceptibility from an optimization perspective. Specifically, the Dynamic Step Size is introduced to find the optimal solution with minimal attack cost towards the decision boundary of the attacked model, and the Adaptive Early Stop strategy is adopted to reduce the redundant strength of adversarial perturbations to the minimum level. The proposed PGD-Imperceptible (PGD-Imp) attack achieves state-of-the-art results in imperceptible adversarial attacks for both untargeted and targeted scenarios. When performing untargeted attacks against ResNet-50, PGD-Imp attains 100 % (+0.3 % ) ASR, 0.89 (-1.76) l_2 distance, and 52.93 (+9.2) PSNR with 57s (-371s) running time, significantly outperforming existing methods.

[LG-58] Missing data imputation for noisy time-series data and applications in healthcare

链接: https://arxiv.org/abs/2412.11164
作者: Lien P. Le,Xuan-Hien Nguyen Thi,Thu Nguyen,Michael A. Riegler,Pål Halvorsen,Binh T. Nguyen
关键词: monitoring patient activity, Healthcare time series, Healthcare time, vital for monitoring, monitoring patient
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Healthcare time series data is vital for monitoring patient activity but often contains noise and missing values due to various reasons such as sensor errors or data interruptions. Imputation, i.e., filling in the missing values, is a common way to deal with this issue. In this study, we compare imputation methods, including Multiple Imputation with Random Forest (MICE-RF) and advanced deep learning approaches (SAITS, BRITS, Transformer) for noisy, missing time series data in terms of MAE, F1-score, AUC, and MCC, across missing data rates (10 % - 80 %). Our results show that MICE-RF can effectively impute missing data compared to deep learning methods and the improvement in classification of data imputed indicates that imputation can have denoising effects. Therefore, using an imputation algorithm on time series with missing data can, at the same time, offer denoising effects.

[LG-59] Early Concept Drift Detection via Prediction Uncertainty AAAI-2025

链接: https://arxiv.org/abs/2412.11158
作者: Pengqian Lu,Jie Lu,Anjin Liu,Guangquan Zhang
关键词: poses significant challenges, machine learning models, streaming data scenarios, concept drift detectors, Concept drift
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI-2025

点击查看摘要

Abstract:Concept drift, characterized by unpredictable changes in data distribution over time, poses significant challenges to machine learning models in streaming data scenarios. Although error rate-based concept drift detectors are widely used, they often fail to identify drift in the early stages when the data distribution changes but error rates remain constant. This paper introduces the Prediction Uncertainty Index (PU-index), derived from the prediction uncertainty of the classifier, as a superior alternative to the error rate for drift detection. Our theoretical analysis demonstrates that: (1) The PU-index can detect drift even when error rates remain stable. (2) Any change in the error rate will lead to a corresponding change in the PU-index. These properties make the PU-index a more sensitive and robust indicator for drift detection compared to existing methods. We also propose a PU-index-based Drift Detector (PUDD) that employs a novel Adaptive PU-index Bucketing algorithm for detecting drift. Empirical evaluations on both synthetic and real-world datasets demonstrate PUDD’s efficacy in detecting drift in structured and image data.

[LG-60] Modeling the Heterogeneous Duration of User Interest in Time-Dependent Recommendation: A Hidden Semi-Markov Approach

链接: https://arxiv.org/abs/2412.11127
作者: Haidong Zhang,Wancheng Ni,Xin Li,Yiping Yang
关键词: Recommender systems, time-dependent recommender systems, education materials, suggesting books, exploring their behaviors
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems are widely used for suggesting books, education materials, and products to users by exploring their behaviors. In reality, users’ preferences often change over time, leading to studies on time-dependent recommender systems. However, most existing approaches that deal with time information remain primitive. In this paper, we extend existing methods and propose a hidden semi-Markov model to track the change of users’ interests. Particularly, this model allows for capturing the different durations of user stays in a (latent) interest state, which can better model the heterogeneity of user interests and focuses. We derive an expectation maximization algorithm to estimate the parameters of the framework and predict users’ actions. Experiments on three real-world datasets show that our model significantly outperforms the state-of-the-art time-dependent and static benchmark methods. Further analyses of the experiment results indicate that the performance improvement is related to the heterogeneity of state durations and the drift of user interests in the dataset.

[LG-61] Multi-Graph Co-Training for Capturing User Intent in Session-based Recommendation COLING2025

链接: https://arxiv.org/abs/2412.11105
作者: Zhe Yang,Tiantian Liang
关键词: focuses on predicting, interact with based, based on sequences, sequences of anonymous, Session-based recommendation focuses
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: COLING 2025 Main Conference

点击查看摘要

Abstract:Session-based recommendation focuses on predicting the next item a user will interact with based on sequences of anonymous user sessions. A significant challenge in this field is data sparsity due to the typically short-term interactions. Most existing methods rely heavily on users’ current interactions, overlooking the wealth of auxiliary information available. To address this, we propose a novel model, the Multi-Graph Co-Training model (MGCOT), which leverages not only the current session graph but also similar session graphs and a global item relation graph. This approach allows for a more comprehensive exploration of intrinsic relationships and better captures user intent from multiple views, enabling session representations to complement each other. Additionally, MGCOT employs multi-head attention mechanisms to effectively capture relevant session intent and uses contrastive learning to form accurate and robust session representations. Extensive experiments on three datasets demonstrate that MGCOT significantly enhances the performance of session-based recommendations, particularly on the Diginetica dataset, achieving improvements up to 2.00% in P@20 and 10.70% in MRR@20. Resources have been made publicly available in our GitHub repository this https URL.

[LG-62] Dynamic Graph Attention Networks for Travel Time Distribution Prediction in Urban Arterial Roads

链接: https://arxiv.org/abs/2412.11095
作者: Nooshin Yousefzadeh,Rahul Sengupta,Sanjay Ranka
关键词: key performance metric, Traffic Control Systems, Adaptive Traffic Control, reducing costs, performance metric
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Effective congestion management along signalized corridors is essential for improving productivity and reducing costs, with arterial travel time serving as a key performance metric. Traditional approaches, such as Coordinated Signal Timing and Adaptive Traffic Control Systems, often lack scalability and generalizability across diverse urban layouts. We propose Fusion-based Dynamic Graph Neural Networks (FDGNN), a structured framework for simultaneous modeling of travel time distributions in both directions along arterial corridors. FDGNN utilizes attentional graph convolution on dynamic, bidirectional graphs and integrates fusion techniques to capture evolving spatiotemporal traffic dynamics. The framework is trained on extensive hours of simulation data and utilizes GPU computation to ensure scalability. The results demonstrate that our framework can efficiently and accurately model travel time as a normal distribution on arterial roads leveraging a unique dynamic graph representation of corridor traffic states. This representation integrates sequential traffic signal timing plans, local driving behaviors, temporal turning movement counts, and ingress traffic volumes, even when aggregated over intervals as short as a single cycle length. The results demonstrate resilience to effective traffic variations, including cycle lengths, green time percentages, traffic density, and counterfactual routes. Results further confirm its stability under varying conditions at different intersections. This framework supports dynamic signal timing, enhances congestion management, and improves travel time reliability in real-world applications.

[LG-63] BarcodeMamba: State Space Models for Biodiversity Analysis NEURIPS2024

链接: https://arxiv.org/abs/2412.11084
作者: Tiancheng Gao,Graham W. Taylor
关键词: automatic identification systems, building automatic identification, building automatic, systems that recognize, DNA barcodes
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM)
*备注: 9 pages, 2 figures, accepted at Foundation Models for Science: Progress, Opportunities, and Challenges Workshop (NeurIPS 2024)

点击查看摘要

Abstract:DNA barcodes are crucial in biodiversity analysis for building automatic identification systems that recognize known species and discover unseen species. Unlike human genome modeling, barcode-based invertebrate identification poses challenges in the vast diversity of species and taxonomic complexity. Among Transformer-based foundation models, BarcodeBERT excelled in species-level identification of invertebrates, highlighting the effectiveness of self-supervised pretraining on barcode-specific datasets. Recently, structured state space models (SSMs) have emerged, with a time complexity that scales sub-quadratically with the context length. SSMs provide an efficient parameterization of sequence modeling relative to attention-based architectures. Given the success of Mamba and Mamba-2 in natural language, we designed BarcodeMamba, a performant and efficient foundation model for DNA barcodes in biodiversity analysis. We conducted a comprehensive ablation study on the impacts of self-supervised training and tokenization methods, and compared both versions of Mamba layers in terms of expressiveness and their capacity to identify “unseen” species held back from training. Our study shows that BarcodeMamba has better performance than BarcodeBERT even when using only 8.3% as many parameters, and improves accuracy to 99.2% on species-level accuracy in linear probing without fine-tuning for “seen” species. In our scaling study, BarcodeMamba with 63.6% of BarcodeBERT’s parameters achieved 70.2% genus-level accuracy in 1-nearest neighbor (1-NN) probing for unseen species. The code repository to reproduce our experiments is available at this https URL.

[LG-64] EquiFlow: Equivariant Conditional Flow Matching with Optimal Transport for 3D Molecular Conformation Prediction

链接: https://arxiv.org/abs/2412.11082
作者: Qingwen Tian,Yuxin Xu,Yixuan Yang,Zhen Wang,Ziqi Liu,Pengju Yan,Xiaolin Li
关键词: protein surfaces, play a key, key role, role in determining, conditional flow matching
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注: 11 pages,5 figures

点击查看摘要

Abstract:Molecular 3D conformations play a key role in determining how molecules interact with other molecules or protein surfaces. Recent deep learning advancements have improved conformation prediction, but slow training speeds and difficulties in utilizing high-degree features limit performance. We propose EquiFlow, an equivariant conditional flow matching model with optimal transport. EquiFlow uniquely applies conditional flow matching in molecular 3D conformation prediction, leveraging simulation-free training to address slow training speeds. It uses a modified Equiformer model to encode Cartesian molecular conformations along with their atomic and bond properties into higher-degree embeddings. Additionally, EquiFlow employs an ODE solver, providing faster inference speeds compared to diffusion models with SDEs. Experiments on the QM9 dataset show that EquiFlow predicts small molecule conformations more accurately than current state-of-the-art models.

[LG-65] Edge Contrastive Learning: An Augmentation-Free Graph Contrastive Learning Model

链接: https://arxiv.org/abs/2412.11075
作者: Yujun Li,Hongyuan Zhang,Yuan Yuan
关键词: unlabeled graph data, Graph contrastive learning, unlabeled graph, graph data, Graph contrastive
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph contrastive learning (GCL) aims to learn representations from unlabeled graph data in a self-supervised manner and has developed rapidly in recent years. However, edgelevel contrasts are not well explored by most existing GCL methods. Most studies in GCL only regard edges as auxiliary information while updating node features. One of the primary obstacles of edge-based GCL is the heavy computation burden. To tackle this issue, we propose a model that can efficiently learn edge features for GCL, namely AugmentationFree Edge Contrastive Learning (AFECL) to achieve edgeedge contrast. AFECL depends on no augmentation consisting of two parts. Firstly, we design a novel edge feature generation method, where edge features are computed by embedding concatenation of their connected nodes. Secondly, an edge contrastive learning scheme is developed, where edges connecting the same nodes are defined as positive pairs, and other edges are defined as negative pairs. Experimental results show that compared with recent state-of-the-art GCL methods or even some supervised GNNs, AFECL achieves SOTA performance on link prediction and semi-supervised node classification of extremely scarce labels. The source code is available at this https URL.

[LG-66] Navigating Towards Fairness with Data Selection

链接: https://arxiv.org/abs/2412.11072
作者: Yixuan Zhang,Zhidong Li,Yang Wang,Fang Chen,Xuhui Fan,Feng Zhou
关键词: Machine learning algorithms, Machine learning, inherent data biases, learning algorithms, algorithms often struggle
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Machine learning algorithms often struggle to eliminate inherent data biases, particularly those arising from unreliable labels, which poses a significant challenge in ensuring fairness. Existing fairness techniques that address label bias typically involve modifying models and intervening in the training process, but these lack flexibility for large-scale datasets. To address this limitation, we introduce a data selection method designed to efficiently and flexibly mitigate label bias, tailored to more practical needs. Our approach utilizes a zero-shot predictor as a proxy model that simulates training on a clean holdout set. This strategy, supported by peer predictions, ensures the fairness of the proxy model and eliminates the need for an additional holdout set, which is a common requirement in previous methods. Without altering the classifier’s architecture, our modality-agnostic method effectively selects appropriate training data and has proven efficient and effective in handling label bias and improving fairness across diverse datasets in experimental evaluations.

[LG-67] Learning Robust and Privacy-Preserving Representations via Information Theory

链接: https://arxiv.org/abs/2412.11066
作者: Binghui Zhang,Sayedeh Leila Noorbakhsh,Yun Dong,Yuan Hong,Binghui Wang
关键词: Machine learning models, private attribute inference, Machine learning, privacy attacks, attribute inference adversaries
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Machine learning models are vulnerable to both security attacks (e.g., adversarial examples) and privacy attacks (e.g., private attribute inference). We take the first step to mitigate both the security and privacy attacks, and maintain task utility as well. Particularly, we propose an information-theoretic framework to achieve the goals through the lens of representation learning, i.e., learning representations that are robust to both adversarial examples and attribute inference adversaries. We also derive novel theoretical results under our framework, e.g., the inherent trade-off between adversarial robustness/utility and attribute privacy, and guaranteed attribute privacy leakage against attribute inference adversaries.

[LG-68] DisCo-DSO: Coupling Discrete and Continuous Optimization for Efficient Generative Design in Hybrid Spaces AAAI-25

链接: https://arxiv.org/abs/2412.11051
作者: Jacob F. Pettit,Chak Shing Lee,Jiachen Yang,Alex Ho,Daniel Faissol,Brenden Petersen,Mikel Landajuela
关键词: Discrete-Continuous Deep Symbolic, Deep Symbolic Optimization, variable-length spaces, symbolic regression, challenge of black-box
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted at AAAI-25

点击查看摘要

Abstract:We consider the challenge of black-box optimization within hybrid discrete-continuous and variable-length spaces, a problem that arises in various applications, such as decision tree learning and symbolic regression. We propose DisCo-DSO (Discrete-Continuous Deep Symbolic Optimization), a novel approach that uses a generative model to learn a joint distribution over discrete and continuous design variables to sample new hybrid designs. In contrast to standard decoupled approaches, in which the discrete and continuous variables are optimized separately, our joint optimization approach uses fewer objective function evaluations, is robust against non-differentiable objectives, and learns from prior samples to guide the search, leading to significant improvement in performance and sample efficiency. Our experiments on a diverse set of optimization tasks demonstrate that the advantages of DisCo-DSO become increasingly evident as the complexity of the problem increases. In particular, we illustrate DisCo-DSO’s superiority over the state-of-the-art methods for interpretable reinforcement learning with decision trees.

[LG-69] Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

链接: https://arxiv.org/abs/2412.11044
作者: Zhengyu Fang,Zhimeng Jiang,Huiyuan Chen,Xiao Li,Jing Li
关键词: attracted significant research, significant research interest, tabular diffusion models, diffusion models, models greatly improving
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization, where models inadvertently replicate exact or near-identical training data, has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random same-class training sample pairs. Building upon this, we introduce TabCutMixPlus, an enhanced method that clusters features based on feature correlations and ensures that features within the same cluster are exchanged together during augmentation. This clustering mechanism mitigates out-of-distribution (OOD) generation issues by maintaining feature coherence. Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation.

[LG-70] FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores PPOPP’25

链接: https://arxiv.org/abs/2412.11007
作者: Jinliang Shi,Shigang Li,Youxuan Xu,Rongtian Fu,Xueying Wang,Tong Wu
关键词: Sampled Dense-dense Matrix, Sampled Dense-dense, Dense-dense Matrix Multiplication, Sparse Matrix-matrix Multiplication, Tensor Core Units
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted by 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP’25)

点击查看摘要

Abstract:Sparse Matrix-matrix Multiplication (SpMM) and Sampled Dense-dense Matrix Multiplication (SDDMM) are important sparse operators in scientific computing and deep learning. Tensor Core Units (TCUs) enhance modern accelerators with superior computing power, which is promising to boost the performance of matrix operators to a higher level. However, due to the irregularity of unstructured sparse data, it is difficult to deliver practical speedups on TCUs. To this end, we propose FlashSparse, a novel approach to bridge the gap between sparse workloads and the TCU architecture. Specifically, FlashSparse minimizes the sparse granularity for SpMM and SDDMM on TCUs through a novel swap-and-transpose matrix multiplication strategy. Benefiting from the minimum sparse granularity, the computation redundancy is remarkably reduced while the computing power of TCUs is fully utilized. Besides, FlashSparse is equipped with a memory-efficient thread mapping strategy for coalesced data access and a sparse matrix storage format to save memory footprint. Extensive experimental results on H100 and RTX 4090 GPUs show that FlashSparse sets a new state-of-the-art for sparse matrix multiplications (geometric mean 5.5x speedup over DTC-SpMM and 3.22x speedup over RoDe).

[LG-71] Optimal Rates for Robust Stochastic Convex Optimization

链接: https://arxiv.org/abs/2412.11003
作者: Changyu Gao,Andrew Lowy,Xingyu Zhou,Stephen J. Wright
关键词: machine learning, high-dimensional spaces, necessitates the development, machine learning algorithms, contamination model
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The sensitivity of machine learning algorithms to outliers, particularly in high-dimensional spaces, necessitates the development of robust methods. Within the framework of \epsilon -contamination model, where the adversary can inspect and replace up to \epsilon fraction of the samples, a fundamental open question is determining the optimal rates for robust stochastic convex optimization (robust SCO), provided the samples under \epsilon -contamination. We develop novel algorithms that achieve minimax-optimal excess risk (up to logarithmic factors) under the \epsilon -contamination model. Our approach advances beyonds existing algorithms, which are not only suboptimal but also constrained by stringent requirements, including Lipschitzness and smoothness conditions on sample this http URL algorithms achieve optimal rates while removing these restrictive assumptions, and notably, remain effective for nonsmooth but Lipschitz population risks. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2412.11003 [cs.LG] (or arXiv:2412.11003v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.11003 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-72] A Staged Deep Learning Approach to Spatial Refinement in 3D Temporal Atmospheric Transport

链接: https://arxiv.org/abs/2412.10945
作者: M. Giselle Fernández-Godino,Wai Tong Chung,Akshay A. Gowardhan,Matthias Ihme,Qingkai Kong,Donald D. Lucas,Stephen C. Myers
关键词: spatiotemporal simulations effectively, simulations effectively capture, High-resolution spatiotemporal simulations, effectively capture, capture the complexities
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:High-resolution spatiotemporal simulations effectively capture the complexities of atmospheric plume dispersion in complex terrain. However, their high computational cost makes them impractical for applications requiring rapid responses or iterative processes, such as optimization, uncertainty quantification, or inverse modeling. To address this challenge, this work introduces the Dual-Stage Temporal Three-dimensional UNet Super-resolution (DST3D-UNet-SR) model, a highly efficient deep learning model for plume dispersion prediction. DST3D-UNet-SR is composed of two sequential modules: the temporal module ™, which predicts the transient evolution of a plume in complex terrain from low-resolution temporal data, and the spatial refinement module (SRM), which subsequently enhances the spatial resolution of the TM predictions. We train DST3DUNet- SR using a comprehensive dataset derived from high-resolution large eddy simulations (LES) of plume transport. We propose the DST3D-UNet-SR model to significantly accelerate LES simulations of three-dimensional plume dispersion by three orders of magnitude. Additionally, the model demonstrates the ability to dynamically adapt to evolving conditions through the incorporation of new observational data, substantially improving prediction accuracy in high-concentration regions near the source. Keywords: Atmospheric sciences, Geosciences, Plume transport,3D temporal sequences, Artificial intelligence, CNN, LSTM, Autoencoder, Autoregressive model, U-Net, Super-resolution, Spatial Refinement. Comments: 12 pages, 10 figures Subjects: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph) MSC classes: 68T07, 86A10, 93A30, 62M20 ACMclasses: I.2.6; I.6.5; J.2 Reportnumber: LLNL-JRNL-2001564 Cite as: arXiv:2412.10945 [cs.LG] (or arXiv:2412.10945v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.10945 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-73] Linear Programming based Approximation to Individually Fair k-Clustering with Outliers

链接: https://arxiv.org/abs/2412.10923
作者: Binita Maity,Shrutimoy Das,Anirban Dasgupta
关键词: Individual fairness guarantees, Individual fairness, desirable properties, hard to formalize, Machine Learning
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 12 pages

点击查看摘要

Abstract:Individual fairness guarantees are often desirable properties to have, but they become hard to formalize when the dataset contains outliers. Here, we investigate the problem of developing an individually fair k -means clustering algorithm for datasets that contain outliers. That is, given n points and k centers, we want that for each point which is not an outlier, there must be a center within the \fracnk nearest neighbours of the given point. While a few of the recent works have looked into individually fair clustering, this is the first work that explores this problem in the presence of outliers for k -means clustering. For this purpose, we define and solve a linear program (LP) that helps us identify the outliers. We exclude these outliers from the dataset and apply a rounding algorithm that computes the k centers, such that the fairness constraint of the remaining points is satisfied. We also provide theoretical guarantees that our method leads to a guaranteed approximation of the fair radius as well as the clustering cost. We also demonstrate our techniques empirically on real-world datasets. Comments: 12 pages Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2412.10923 [cs.LG] (or arXiv:2412.10923v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.10923 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] C3: Learning Congestion Controllers with Formal Certificates

链接: https://arxiv.org/abs/2412.10915
作者: Chenxi Yang,Divyanshu Saxena,Rohit Dwivedula,Kshiteej Mahajan,Swarat Chaudhuri,Aditya Akella
关键词: traditional heuristic algorithms, heuristic algorithms, compared to traditional, traditional heuristic, Learning-based congestion controllers
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Learning-based congestion controllers offer better adaptability compared to traditional heuristic algorithms. However, the inherent unreliability of learning techniques can cause learning-based controllers to behave poorly, creating a need for formal guarantees. While methods for formally verifying learned congestion controllers exist, these methods offer binary feedback that cannot optimize the controller toward better behavior. We improve this state-of-the-art via C3, a new learning framework for congestion control that integrates the concept of formal certification in the learning loop. C3 uses an abstract interpreter that can produce robustness and performance certificates to guide the training process, rewarding models that are robust and performant even on worst-case inputs. Our evaluation demonstrates that unlike state-of-the-art learned controllers, C3-trained controllers provide both adaptability and worst-case reliability across a range of network conditions.

[LG-75] Exploring Grokking: Experimental and Mechanistic Investigations

链接: https://arxiv.org/abs/2412.10898
作者: Hu Qiye,Zhou Hao,Yu RuoXi
关键词: garnered significant interest, over-parameterized neural networks, significant interest, neural network initially, garnered significant
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The phenomenon of grokking in over-parameterized neural networks has garnered significant interest. It involves the neural network initially memorizing the training set with zero training error and near-random test error. Subsequent prolonged training leads to a sharp transition from no generalization to perfect generalization. Our study comprises extensive experiments and an exploration of the research behind the mechanism of grokking. Through experiments, we gained insights into its behavior concerning the training data fraction, the model, and the optimization. The mechanism of grokking has been a subject of various viewpoints proposed by researchers, and we introduce some of these perspectives.

[LG-76] ask Diversity in Bayesian Federated Learning: Simultaneous Processing of Classification and Regression

链接: https://arxiv.org/abs/2412.10897
作者: Junliang Lyu,Yixuan Zhang,Xiaoling Lu,Feng Zhou
关键词: federated learning approaches, current federated learning, federated learning, work addresses, addresses a key
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This work addresses a key limitation in current federated learning approaches, which predominantly focus on homogeneous tasks, neglecting the task diversity on local devices. We propose a principled integration of multi-task learning using multi-output Gaussian processes (MOGP) at the local level and federated learning at the global level. MOGP handles correlated classification and regression tasks, offering a Bayesian non-parametric approach that naturally quantifies uncertainty. The central server aggregates the posteriors from local devices, updating a global MOGP prior redistributed for training local models until convergence. Challenges in performing posterior inference on local devices are addressed through the Pólya-Gamma augmentation technique and mean-field variational inference, enhancing computational efficiency and convergence rate. Experimental results on both synthetic and real data demonstrate superior predictive performance, OOD detection, uncertainty calibration and convergence rate, highlighting the method’s potential in diverse applications. Our code is publicly available at this https URL.

[LG-77] Multi-Class and Multi-Task Strategies for Neural Directed Link Prediction

链接: https://arxiv.org/abs/2412.10895
作者: Claudio Moroni,Claudio Borile,Carolina Mattsson,Michele Starnini,André Panisson
关键词: Graph Representation Learning, Directed Link Prediction, Link Prediction, knowledge graph completion, Representation Learning
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 15 pages, 2 figures

点击查看摘要

Abstract:Link Prediction is a foundational task in Graph Representation Learning, supporting applications like link recommendation, knowledge graph completion and graph generation. Graph Neural Networks have shown the most promising results in this domain and are currently the de facto standard approach to learning from graph data. However, a key distinction exists between Undirected and Directed Link Prediction: the former just predicts the existence of an edge, while the latter must also account for edge directionality and bidirectionality. This translates to Directed Link Prediction (DLP) having three sub-tasks, each defined by how training, validation and test sets are structured. Most research on DLP overlooks this trichotomy, focusing solely on the “existence” sub-task, where training and test sets are random, uncorrelated samples of positive and negative directed edges. Even in the works that recognize the aforementioned trichotomy, models fail to perform well across all three sub-tasks. In this study, we experimentally demonstrate that training Neural DLP (NDLP) models only on the existence sub-task, using methods adapted from Neural Undirected Link Prediction, results in parameter configurations that fail to capture directionality and bidirectionality, even after rebalancing edge classes. To address this, we propose three strategies that handle the three tasks simultaneously. Our first strategy, the Multi-Class Framework for Neural Directed Link Prediction (MC-NDLP) maps NDLP to a Multi-Class training objective. The second and third approaches adopt a Multi-Task perspective, either with a Multi-Objective (MO-DLP) or a Scalarized (S-DLP) strategy. Our results show that these methods outperform traditional approaches across multiple datasets and models, achieving equivalent or superior performance in addressing the three DLP sub-tasks.

[LG-78] Adaptive Quantization Resolution and Power Control for Federated Learning over Cell-free Networks

链接: https://arxiv.org/abs/2412.10878
作者: Afsaneh Mahmoudi,Emil Björnson
关键词: distributed learning framework, preserving data privacy, Federated learning, exchanging local model, distributed learning
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a distributed learning framework where users train a global model by exchanging local model updates with a server instead of raw datasets, preserving data privacy and reducing communication overhead. However, the latency grows with the number of users and the model size, impeding the successful FL over traditional wireless networks with orthogonal access. Cell-free massive multiple-input multipleoutput (CFmMIMO) is a promising solution to serve numerous users on the same time/frequency resource with similar rates. This architecture greatly reduces uplink latency through spatial multiplexing but does not take application characteristics into account. In this paper, we co-optimize the physical layer with the FL application to mitigate the straggler effect. We introduce a novel adaptive mixed-resolution quantization scheme of the local gradient vector updates, where only the most essential entries are given high resolution. Thereafter, we propose a dynamic uplink power control scheme to manage the varying user rates and mitigate the straggler effect. The numerical results demonstrate that the proposed method achieves test accuracy comparable to classic FL while reducing communication overhead by at least 93% on the CIFAR-10, CIFAR-100, and Fashion-MNIST datasets. We compare our methods against AQUILA, Top-q, and LAQ, using the max-sum rate and Dinkelbach power control schemes. Our approach reduces the communication overhead by 75% and achieves 10% higher test accuracy than these benchmarks within a constrained total latency budget.

[LG-79] DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting KDD2025

链接: https://arxiv.org/abs/2412.10859
作者: Xiangfei Qiu,Xingjian Wu,Yan Lin,Chenjuan Guo,Jilin Hu,Bin Yang
关键词: energy management, financial investment, traffic optimization, weather forecasting, Clustering Module
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by KDD 2025

点击查看摘要

Abstract:Multivariate time series forecasting is crucial for various applications, such as financial investment, energy management, weather forecasting, and traffic optimization. However, accurate forecasting is challenging due to two main factors. First, real-world time series often show heterogeneous temporal patterns caused by distribution shifts over time. Second, correlations among channels are complex and intertwined, making it hard to model the interactions among channels precisely and flexibly. In this study, we address these challenges by proposing a general framework called \textbfDUET, which introduces \underlineDUal clustering on the temporal and channel dimensions to \underlineEnhance multivariate \underlineTime series forecasting. First, we design a Temporal Clustering Module (TCM) that clusters time series into fine-grained distributions to handle heterogeneous temporal patterns. For different distribution clusters, we design various pattern extractors to capture their intrinsic temporal patterns, thus modeling the heterogeneity. Second, we introduce a novel Channel-Soft-Clustering strategy and design a Channel Clustering Module (CCM), which captures the relationships among channels in the frequency domain through metric learning and applies sparsification to mitigate the adverse effects of noisy channels. Finally, DUET combines TCM and CCM to incorporate both the temporal and channel dimensions. Extensive experiments on 25 real-world datasets from 10 application domains, demonstrate the state-of-the-art performance of DUET. Comments: Accepted by KDD 2025 Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2412.10859 [cs.LG] (or arXiv:2412.10859v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.10859 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-80] RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices

链接: https://arxiv.org/abs/2412.10856
作者: Wonkyo Choe,Yangfeng Ji,Felix Lin
关键词: achieved major breakthroughs, Repentance Weighted Key, robotics and wearables, major breakthroughs, resource-contained platforms
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:

点击查看摘要

Abstract:To deploy LLMs on resource-contained platforms such as mobile robotics and wearables, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) models have shown promising results in text generation on resource-constrained devices thanks to their computational efficiency. However, these models remain too large to be deployed on embedded devices due to their high parameter count. In this paper, we propose an efficient suite of compression techniques, tailored to the RWKV architecture. These techniques include low-rank approximation, sparsity predictors, and clustering head, designed to align with the model size. Our methods compress the RWKV models by 4.95–3.8x with only 2.95pp loss in accuracy.

[LG-81] Fast and Robust Visuomotor Riemannian Flow Matching Policy

链接: https://arxiv.org/abs/2412.10855
作者: Haoran Ding,Noémie Jaquier,Jan Peters,Leonel Rozo
关键词: Diffusion-based visuomotor policies, effectively combining visual, combining visual data, Diffusion-based visuomotor, multi-modal action distributions
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 14 pages, 10 figures, 9 tables, project website: this https URL

点击查看摘要

Abstract:Diffusion-based visuomotor policies excel at learning complex robotic tasks by effectively combining visual data with high-dimensional, multi-modal action distributions. However, diffusion models often suffer from slow inference due to costly denoising processes or require complex sequential training arising from recent distilling approaches. This paper introduces Riemannian Flow Matching Policy (RFMP), a model that inherits the easy training and fast inference capabilities of flow matching (FM). Moreover, RFMP inherently incorporates geometric constraints commonly found in realistic robotic applications, as the robot state resides on a Riemannian manifold. To enhance the robustness of RFMP, we propose Stable RFMP (SRFMP), which leverages LaSalle’s invariance principle to equip the dynamics of FM with stability to the support of a target Riemannian distribution. Rigorous evaluation on eight simulated and real-world tasks show that RFMP successfully learns and synthesizes complex sensorimotor policies on Euclidean and Riemannian spaces with efficient training and inference phases, outperforming Diffusion Policies while remaining competitive with Consistency Policies.

[LG-82] Improving Graph Neural Networks via Adversarial Robustness Evaluation

链接: https://arxiv.org/abs/2412.10850
作者: Yongyu Wang
关键词: neural network architectures, Graph Neural Networks, Neural Networks, network architectures, Graph Neural
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are currently one of the most powerful types of neural network architectures. Their advantage lies in the ability to leverage both the graph topology, which represents the relationships between samples, and the features of the samples themselves. However, the given graph topology often contains noisy edges, and GNNs are vulnerable to noise in the graph structure. This issue remains unresolved. In this paper, we propose using adversarial robustness evaluation to select a small subset of robust nodes that are less affected by noise. We then only feed the features of these robust nodes, along with the KNN graph constructed from these nodes, into the GNN for classification. Additionally, we compute the centroids for each class. For the remaining non-robust nodes, we assign them to the class whose centroid is closest to them. Experimental results show that this method significantly improves the accuracy of GNNs.

[LG-83] A Diagrammatic Approach to Improve Computational Efficiency in Group Equivariant Neural Networks

链接: https://arxiv.org/abs/2412.10837
作者: Edward Pearce-Crump,William J. Knottenbelt
关键词: Group equivariant neural, equivariant neural networks, underlying symmetries, growing in importance, ability to generalise
类目: Machine Learning (cs.LG); Combinatorics (math.CO); Representation Theory (math.RT); Machine Learning (stat.ML)
*备注: 51 pages

点击查看摘要

Abstract:Group equivariant neural networks are growing in importance owing to their ability to generalise well in applications where the data has known underlying symmetries. Recent characterisations of a class of these networks that use high-order tensor power spaces as their layers suggest that they have significant potential; however, their implementation remains challenging owing to the prohibitively expensive nature of the computations that are involved. In this work, we present a fast matrix multiplication algorithm for any equivariant weight matrix that maps between tensor power layer spaces in these networks for four groups: the symmetric, orthogonal, special orthogonal, and symplectic groups. We obtain this algorithm by developing a diagrammatic framework based on category theory that enables us to not only express each weight matrix as a linear combination of diagrams but also makes it possible for us to use these diagrams to factor the original computation into a series of steps that are optimal. We show that this algorithm improves the Big- O time complexity exponentially in comparison to a naïve matrix multiplication.

[LG-84] Diffusion-based Method for Satellite Pattern-of-Life Identification

链接: https://arxiv.org/abs/2412.10814
作者: Yongchao Ye,Xinting Zhu,Xuejin Shen,Xiaoyu Chen,Lishuai Li,S. Joe Qin
关键词: typical satellite behaviors, involving the analysis, Satellite, crucial for space, space safety
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Satellite pattern-of-life (PoL) identification is crucial for space safety and satellite monitoring, involving the analysis of typical satellite behaviors such as station-keeping, drift, etc. However, existing PoL identification methods remain underdeveloped due to the complexity of aerospace systems, variability in satellite behaviors, and fluctuating observation sampling rates. In a first attempt, we developed a domain expertise-informed machine learning method (Expert-ML) to combine satellite orbital movement knowledge and machine learning models. The Expert-ML method achieved high accuracy results in simulation data and real-world data with normal sampling rate. However, this approach lacks of generality as it requires domain expertise and its performance degraded significantly when data sampling rate varied. To achieve generality, we propose a novel diffusion-based PoL identification method. Distinct from prior approaches, the proposed method leverages a diffusion model to achieve end-to-end identification without manual refinement or domain-specific knowledge. Specifically, we employ a multivariate time-series encoder to capture hidden representations of satellite positional data. The encoded features are subsequently incorporated as conditional information in the denoising process to generate PoL labels. Through experimentation across real-world satellite settings, our proposed diffusion-based method demonstrates its high identification quality and provides a robust solution even with reduced data sampling rates, indicating its great potential in practical satellite behavior pattern identification, tracking and related mission deployment.

[LG-85] Audio-based Anomaly Detection in Industrial Machines Using Deep One-Class Support Vector Data Description

链接: https://arxiv.org/abs/2412.10792
作者: Sertac Kilickaya,Mete Ahishali,Cansu Celebioglu,Fahad Sohrab,Levent Eren,Turker Ince,Murat Askar,Moncef Gabbouj
关键词: driven increasing interest, effective condition monitoring, condition monitoring sensors, frequent breakdowns, breakdowns and malfunctions
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: To be published in 2025 IEEE Symposium Series on Computational Intelligence

点击查看摘要

Abstract:The frequent breakdowns and malfunctions of industrial equipment have driven increasing interest in utilizing cost-effective and easy-to-deploy sensors, such as microphones, for effective condition monitoring of machinery. Microphones offer a low-cost alternative to widely used condition monitoring sensors with their high bandwidth and capability to detect subtle anomalies that other sensors might have less sensitivity. In this study, we investigate malfunctioning industrial machines to evaluate and compare anomaly detection performance across different machine types and fault conditions. Log-Mel spectrograms of machinery sound are used as input, and the performance is evaluated using the area under the curve (AUC) score for two different methods: baseline dense autoencoder (AE) and one-class deep Support Vector Data Description (deep SVDD) with different subspace dimensions. Our results over the MIMII sound dataset demonstrate that the deep SVDD method with a subspace dimension of 2 provides superior anomaly detection performance, achieving average AUC scores of 0.84, 0.80, and 0.69 for 6 dB, 0 dB, and -6 dB signal-to-noise ratios (SNRs), respectively, compared to 0.82, 0.72, and 0.64 for the baseline model. Moreover, deep SVDD requires 7.4 times fewer trainable parameters than the baseline dense AE, emphasizing its advantage in both effectiveness and computational efficiency.

[LG-86] Scaling Up Graph Propagation Computation on Large Graphs: A Local Chebyshev Approximation Approach

链接: https://arxiv.org/abs/2412.10789
作者: Yichun Yang,Rong-Hua Li,Meihao Liao,Longlong Lin,Guoren Wang
关键词: node similarity queries, graph node ranking, graph node similarity, graph data analysis, graph neural networks
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 15 pages

点击查看摘要

Abstract:Graph propagation (GP) computation plays a crucial role in graph data analysis, supporting various applications such as graph node similarity queries, graph node ranking, graph clustering, and graph neural networks. Existing methods, mainly relying on power iteration or push computation frameworks, often face challenges with slow convergence rates when applied to large-scale graphs. To address this issue, we propose a novel and powerful approach that accelerates power iteration and push methods using Chebyshev polynomials. Specifically, we first present a novel Chebyshev expansion formula for general GP functions, offering a new perspective on GP computation and achieving accelerated convergence. Building on these theoretical insights, we develop a novel Chebyshev power iteration method (\ltwocheb) and a novel Chebyshev push method (\chebpush). Our \ltwocheb method demonstrates an approximate acceleration of O(\sqrtN) compared to existing power iteration techniques for both personalized PageRank and heat kernel PageRank computations, which are well-studied GP problems. For \chebpush, we propose an innovative subset Chebyshev recurrence technique, enabling the design of a push-style local algorithm with provable error guarantee and reduced time complexity compared to existing push methods. We conduct extensive experiments using 5 large real-world datasets to evaluate our proposed algorithms, demonstrating their superior efficiency compared to state-of-the-art approaches.

[LG-87] Continual Learning for Behavior-based Driver Identification

链接: https://arxiv.org/abs/2412.10780
作者: Mattia Fanan,Davide Dalle Pezze,Emad Efatinasab,Ruggero Carli,Mirco Rampazzo,Gian Antonio Susto
关键词: Behavior-based Driver Identification, personalized driving experiences, vehicle theft prevention, offering important applications, Behavior-based Driver
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Behavior-based Driver Identification is an emerging technology that recognizes drivers based on their unique driving behaviors, offering important applications such as vehicle theft prevention and personalized driving experiences. However, most studies fail to account for the real-world challenges of deploying Deep Learning models within vehicles. These challenges include operating under limited computational resources, adapting to new drivers, and changes in driving behavior over time. The objective of this study is to evaluate if Continual Learning (CL) is well-suited to address these challenges, as it enables models to retain previously learned knowledge while continually adapting with minimal computational overhead and resource requirements. We tested several CL techniques across three scenarios of increasing complexity based on the well-known OCSLab dataset. This work provides an important step forward in scalable driver identification solutions, demonstrating that CL approaches, such as DER, can obtain strong performance, with only an 11% reduction in accuracy compared to the static scenario. Furthermore, to enhance the performance, we propose two new methods, SmooER and SmooDER, that leverage the temporal continuity of driver identity over time to enhance classification accuracy. Our novel method, SmooDER, achieves optimal results with only a 2% reduction compared to the 11% of the DER approach. In conclusion, this study proves the feasibility of CL approaches to address the challenges of Driver Identification in dynamic environments, making them suitable for deployment on cloud infrastructure or directly within vehicles.

[LG-88] Explainable Fuzzy Neural Network with Multi-Fidelity Reinforcement Learning for Micro-Architecture Design Space Exploration

链接: https://arxiv.org/abs/2412.10754
作者: Hanwei Fan,Ya Wang,Sicheng Li,Tingyuan Liang,Wei Zhang
关键词: modern micro-architecture designs, advancement of processors, modern micro-architecture, increasingly complex, continuous advancement
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: preprint version, published on DAC24

点击查看摘要

Abstract:With the continuous advancement of processors, modern micro-architecture designs have become increasingly complex. The vast design space presents significant challenges for human designers, making design space exploration (DSE) algorithms a significant tool for \mu -arch design. In recent years, efforts have been made in the development of DSE algorithms, and promising results have been achieved. However, the existing DSE algorithms, e.g., Bayesian Optimization and ensemble learning, suffer from poor interpretability, hindering designers’ understanding of the decision-making process. To address this limitation, we propose utilizing Fuzzy Neural Networks to induce and summarize knowledge and insights from the DSE process, enhancing interpretability and controllability. Furthermore, to improve efficiency, we introduce a multi-fidelity reinforcement learning approach, which primarily conducts exploration using cheap but less precise data, thereby substantially diminishing the reliance on costly data. Experimental results show that our method achieves excellent results with a very limited sample budget and successfully surpasses the current state-of-the-art. Our DSE framework is open-sourced and available at this https URL_MFRL_ArchDSE/\ .

[LG-89] p-Mean Regret for Stochastic Bandits

链接: https://arxiv.org/abs/2412.10751
作者: Anand Krishna,Philips George John,Adarsh Barik,Vincent Y. F. Tan
关键词: multi-armed bandit problems, stochastic multi-armed bandit, social choice theory, regret, regret bound
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:In this work, we extend the concept of the p -mean welfare objective from social choice theory (Moulin 2004) to study p -mean regret in stochastic multi-armed bandit problems. The p -mean regret, defined as the difference between the optimal mean among the arms and the p -mean of the expected rewards, offers a flexible framework for evaluating bandit algorithms, enabling algorithm designers to balance fairness and efficiency by adjusting the parameter p . Our framework encompasses both average cumulative regret and Nash regret as special cases. We introduce a simple, unified UCB-based algorithm (Explore-Then-UCB) that achieves novel p -mean regret bounds. Our algorithm consists of two phases: a carefully calibrated uniform exploration phase to initialize sample means, followed by the UCB1 algorithm of Auer, Cesa-Bianchi, and Fischer (2002). Under mild assumptions, we prove that our algorithm achieves a p -mean regret bound of \tildeO\left(\sqrt\frackT^\frac12|p|\right) for all p \leq -1 , where k represents the number of arms and T the time horizon. When -1p0 , we achieve a regret bound of \tildeO\left(\sqrt\frack^1.5T^\frac12\right) . For the range 0 p \leq 1 , we achieve a p -mean regret scaling as \tildeO\left(\sqrt\frackT\right) , which matches the previously established lower bound up to logarithmic factors (Auer et al. 1995). This result stems from the fact that the p -mean regret of any algorithm is at least its average cumulative regret for p \leq 1 . In the case of Nash regret (the limit as p approaches zero), our unified approach differs from prior work (Barman et al. 2023), which requires a new Nash Confidence Bound algorithm. Notably, we achieve the same regret bound up to constant factors using our more general method. Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2412.10751 [cs.LG] (or arXiv:2412.10751v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.10751 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-90] NeuralPLexer3: Physio-Realistic Biomolecular Complex Structure Prediction with Flow Models

链接: https://arxiv.org/abs/2412.10743
作者: Zhuoran Qiao,Feizhi Ding,Thomas Dresselhaus,Mia A. Rosenfeld,Xiaotian Han,Owen Howell,Aniketh Iyengar,Stephen Opalenski,Anders S. Christensen,Sai Krishna Sirumalla,Frederick R. Manby,Thomas F. Miller III,Matthew Welborn
关键词: determination is essential, mechanistic understanding, understanding of diseases, Structure determination, structure prediction methods
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Structure determination is essential to a mechanistic understanding of diseases and the development of novel therapeutics. Machine-learning-based structure prediction methods have made significant advancements by computationally predicting protein and bioassembly structures from sequences and molecular topology alone. Despite substantial progress in the field, challenges remain to deliver structure prediction models to real-world drug discovery. Here, we present NeuralPLexer3 – a physics-inspired flow-based generative model that achieves state-of-the-art prediction accuracy on key biomolecular interaction types and improves training and sampling efficiency compared to its predecessors and alternative methodologies. Examined through newly developed benchmarking strategies, NeuralPLexer3 excels in vital areas that are crucial to structure-based drug design, such as physical validity and ligand-induced conformational changes.

[LG-91] Doubly-Bounded Queue for Constrained Online Learning: Keeping Pace with Dynamics of Both Loss and Constraint AAAI2025

链接: https://arxiv.org/abs/2412.10703
作者: Juncheng Wang,Bingjie Yan,Yituo Liu
关键词: online solution benchmark, Constrained Online Learning, hard constraint violation, online convex optimization, constraint violation
类目: Machine Learning (cs.LG)
*备注: To appear in AAAI 2025

点击查看摘要

Abstract:We consider online convex optimization with time-varying constraints and conduct performance analysis using two stringent metrics: dynamic regret with respect to the online solution benchmark, and hard constraint violation that does not allow any compensated violation over time. We propose an efficient algorithm called Constrained Online Learning with Doubly-bounded Queue (COLDQ), which introduces a novel virtual queue that is both lower and upper bounded, allowing tight control of the constraint violation without the need for the Slater condition. We prove via a new Lyapunov drift analysis that COLDQ achieves O(T^\frac1+V_x2) dynamic regret and O(T^V_g) hard constraint violation, where V_x and V_g capture the dynamics of the loss and constraint functions. For the first time, the two bounds smoothly approach to the best-known O(T^\frac12) regret and O(1) violation, as the dynamics of the losses and constraints diminish. For strongly convex loss functions, COLDQ matches the best-known O(\logT) static regret while maintaining the O(T^V_g) hard constraint violation. We further introduce an expert-tracking variation of COLDQ, which achieves the same performance bounds without any prior knowledge of the system dynamics. Simulation results demonstrate that COLDQ outperforms the state-of-the-art approaches.

[LG-92] Cluster-Based Multi-Agent Task Scheduling for Space-Air-Ground Integrated Networks

链接: https://arxiv.org/abs/2412.10700
作者: Zhiying Wang,Gang Sun,Yuhui Wang,Hongfang Yu,Dusit Niyato
关键词: Integrated Network, aerial nodes assist, Unmanned Aerial Vehicles, future networks, aerial nodes
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The Space-Air-Ground Integrated Network (SAGIN) framework is a crucial foundation for future networks, where satellites and aerial nodes assist in computational task offloading. The low-altitude economy, leveraging the flexibility and multifunctionality of Unmanned Aerial Vehicles (UAVs) in SAGIN, holds significant potential for development in areas such as communication and sensing. However, effective coordination is needed to streamline information exchange and enable efficient system resource allocation. In this paper, we propose a Clustering-based Multi-agent Deep Deterministic Policy Gradient (CMADDPG) algorithm to address the multi-UAV cooperative task scheduling challenges in SAGIN. The CMADDPG algorithm leverages dynamic UAV clustering to partition UAVs into clusters, each managed by a Cluster Head (CH) UAV, facilitating a distributed-centralized control approach. Within each cluster, UAVs delegate offloading decisions to the CH UAV, reducing intra-cluster communication costs and decision conflicts, thereby enhancing task scheduling efficiency. Additionally, by employing a multi-agent reinforcement learning framework, the algorithm leverages the extensive coverage of satellites to achieve centralized training and distributed execution of multi-agent tasks, while maximizing overall system profit through optimized task offloading decision-making. Simulation results reveal that the CMADDPG algorithm effectively optimizes resource allocation, minimizes queue delays, maintains balanced load distribution, and surpasses existing methods by achieving at least a 25% improvement in system profit, showcasing its robustness and adaptability across diverse scenarios.

[LG-93] Stochastic k-Submodular Bandits with Full Bandit Feedback

链接: https://arxiv.org/abs/2412.10682
作者: Guanyu Nie,Vaneet Aggarwal,Christopher John Quinn
关键词: monotone functions, alpha, regret bounds, full-bandit feedback, submodular optimization problems
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 26 pages, 1 figure

点击查看摘要

Abstract:In this paper, we present the first sublinear \alpha -regret bounds for online k -submodular optimization problems with full-bandit feedback, where \alpha is a corresponding offline approximation ratio. Specifically, we propose online algorithms for multiple k -submodular stochastic combinatorial multi-armed bandit problems, including (i) monotone functions and individual size constraints, (ii) monotone functions with matroid constraints, (iii) non-monotone functions with matroid constraints, (iv) non-monotone functions without constraints, and (v) monotone functions without constraints. We transform approximation algorithms for offline k -submodular maximization problems into online algorithms through the offline-to-online framework proposed by Nie et al. (2023a). A key contribution of our work is analyzing the robustness of the offline algorithms.

[LG-94] FairGP: A Scalable and Fair Graph Transformer Using Graph Partitioning AAAI2025

链接: https://arxiv.org/abs/2412.10669
作者: Renqiang Luo,Huafei Huang,Ivan Lee,Chengpei Xu,Jianzhong Qi,Feng Xia
关键词: Recent studies, Graph Transformer, highlighted significant fairness, studies have highlighted, highlighted significant
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 2 figures, Accepted at AAAI 2025

点击查看摘要

Abstract:Recent studies have highlighted significant fairness issues in Graph Transformer (GT) models, particularly against subgroups defined by sensitive features. Additionally, GTs are computationally intensive and memory-demanding, limiting their application to large-scale graphs. Our experiments demonstrate that graph partitioning can enhance the fairness of GT models while reducing computational complexity. To understand this improvement, we conducted a theoretical investigation into the root causes of fairness issues in GT models. We found that the sensitive features of higher-order nodes disproportionately influence lower-order nodes, resulting in sensitive feature bias. We propose Fairness-aware scalable GT based on Graph Partitioning (FairGP), which partitions the graph to minimize the negative impact of higher-order nodes. By optimizing attention mechanisms, FairGP mitigates the bias introduced by global attention, thereby enhancing fairness. Extensive empirical evaluations on six real-world datasets validate the superior performance of FairGP in achieving fairness compared to state-of-the-art methods. The codes are available at this https URL.

[LG-95] Structured Sampling for Robust Euclidean Distance Geometry

链接: https://arxiv.org/abs/2412.10664
作者: Chandra Kundu,Abiy Tasissa,HanQin Cai
关键词: distance measurements, distance measurements corrupted, corrupted distance measurements, paper addresses, estimating the positions
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper addresses the problem of estimating the positions of points from distance measurements corrupted by sparse outliers. Specifically, we consider a setting with two types of nodes: anchor nodes, for which exact distances to each other are known, and target nodes, for which complete but corrupted distance measurements to the anchors are available. To tackle this problem, we propose a novel algorithm powered by Nyström method and robust principal component analysis. Our method is computationally efficient as it processes only a localized subset of the distance matrix and does not require distance measurements between target nodes. Empirical evaluations on synthetic datasets, designed to mimic sensor localization, and on molecular experiments, demonstrate that our algorithm achieves accurate recovery with a modest number of anchors, even in the presence of high levels of sparse outliers.

[LG-96] Centaur: Bridging the Impossible Trinity of Privacy Efficiency and Performance in Privacy-Preserving Transformer Inference

链接: https://arxiv.org/abs/2412.10652
作者: Jinglong Luo,Guanzhong Chen,Yehong Zhang,Shiyu Liu,Hui Wang,Yue Yu,Xun Zhou,Yuan Qi,Zenglin Xu
关键词: privacy concerns surrounding, concerns surrounding model, inference, Centaur, increasingly deployed
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:As pre-trained models, like Transformers, are increasingly deployed on cloud platforms for inference services, the privacy concerns surrounding model parameters and inference data are becoming more acute. Current Privacy-Preserving Transformer Inference (PPTI) frameworks struggle with the “impossible trinity” of privacy, efficiency, and performance. For instance, Secure Multi-Party Computation (SMPC)-based solutions offer strong privacy guarantees but come with significant inference overhead and performance trade-offs. On the other hand, PPTI frameworks that use random permutations achieve inference efficiency close to that of plaintext and maintain accurate results but require exposing some model parameters and intermediate results, thereby risking substantial privacy breaches. Addressing this “impossible trinity” with a single technique proves challenging. To overcome this challenge, we propose Centaur, a novel hybrid PPTI framework. Unlike existing methods, Centaur protects model parameters with random permutations and inference data with SMPC, leveraging the structure of Transformer models. By designing a series of efficient privacy-preserving algorithms, Centaur leverages the strengths of both techniques to achieve a better balance between privacy, efficiency, and performance in PPTI. We comprehensively evaluate the effectiveness of Centaur on various types of Transformer models and datasets. Experimental results demonstrate that the privacy protection capabilities offered by Centaur can withstand various existing model inversion attack methods. In terms of performance and efficiency, Centaur not only maintains the same performance as plaintext inference but also improves inference speed by 5.0-30.4 times.

[LG-97] Ares: Approximate Representations via Efficient Sparsification – A Stateless Approach through Polynomial Homomorphism

链接: https://arxiv.org/abs/2412.10623
作者: Dongfang Zhao
关键词: support modern applications, high-dimensional data demands, data demands efficient, modern applications, increasing prevalence
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing prevalence of high-dimensional data demands efficient and scalable compression methods to support modern applications. However, existing techniques like PCA and Autoencoders often rely on auxiliary metadata or intricate architectures, limiting their practicality for streaming or infinite datasets. In this paper, we introduce a stateless compression framework that leverages polynomial representations to achieve compact, interpretable, and scalable data reduction. By eliminating the need for auxiliary data, our method supports direct algebraic operations in the compressed domain while minimizing error growth during computations. Through extensive experiments on synthetic and real-world datasets, we show that our approach achieves high compression ratios without compromising reconstruction accuracy, all while maintaining simplicity and scalability.

[LG-98] Hybrid Preference Optimization for Alignment: Provably Faster Convergence Rates by Combining Offline Preferences with Online Exploration

链接: https://arxiv.org/abs/2412.10616
作者: Avinandan Bose,Zhihan Xiong,Aadirupa Saha,Simon Shaolei Du,Maryam Fazel
关键词: Reinforcement Learning, aligning large language, large language models, Learning from Human, Human Feedback
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is currently the leading approach for aligning large language models with human preferences. Typically, these models rely on extensive offline preference datasets for training. However, offline algorithms impose strict concentrability requirements, which are often difficult to satisfy. On the other hand, while online algorithms can avoid the concentrability issue, pure online exploration could be expensive due to the active preference query cost and real-time implementation overhead. In this paper, we propose a novel approach: Hybrid Preference Optimization (HPO) which combines online exploration with existing offline preferences by relaxing the stringent concentrability conditions for offline exploration, as well as significantly improving the sample efficiency for its online counterpart. We give the first provably optimal theoretical bound for Hybrid RLHF with preference feedback, providing sample complexity bounds for policy optimization with matching lower bounds. Our results yield improved sample efficiency of hybrid RLHF over pure offline and online exploration.

[LG-99] Adaptive Sampling to Reduce Epistemic Uncertainty Using Prediction Interval-Generation Neural Networks AAAI2025

链接: https://arxiv.org/abs/2412.10570
作者: Giorgio Morales,John Sheppard
关键词: Obtaining high certainty, Obtaining high, high certainty, crucial for making, making informed
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to appear in AAAI 2025

点击查看摘要

Abstract:Obtaining high certainty in predictive models is crucial for making informed and trustworthy decisions in many scientific and engineering domains. However, extensive experimentation required for model accuracy can be both costly and time-consuming. This paper presents an adaptive sampling approach designed to reduce epistemic uncertainty in predictive models. Our primary contribution is the development of a metric that estimates potential epistemic uncertainty leveraging prediction interval-generation neural networks. This estimation relies on the distance between the predicted upper and lower bounds and the observed data at the tested positions and their neighboring points. Our second contribution is the proposal of a batch sampling strategy based on Gaussian processes (GPs). A GP is used as a surrogate model of the networks trained at each iteration of the adaptive sampling process. Using this GP, we design an acquisition function that selects a combination of sampling locations to maximize the reduction of epistemic uncertainty across the domain. We test our approach on three unidimensional synthetic problems and a multi-dimensional dataset based on an agricultural field for selecting experimental fertilizer rates. The results demonstrate that our method consistently converges faster to minimum epistemic uncertainty levels compared to Normalizing Flows Ensembles, MC-Dropout, and simple GPs.

[LG-100] Identifying Predictions That Influence the Future: Detecting Performative Concept Drift in Data Streams AAAI2025

链接: https://arxiv.org/abs/2412.10545
作者: Brandon Gower-Winter,Georg Krempl,Sergey Dragomiretskiy,Tineke Jelsma,Arno Siebes
关键词: Drift, Concept Drift, performative drift, drift detection, performative drift detection
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: 20 pages, 17 figures. Extended version of paper with the same name accepted to AAAI2025

点击查看摘要

Abstract:Concept Drift has been extensively studied within the context of Stream Learning. However, it is often assumed that the deployed model’s predictions play no role in the concept drift the system experiences. Closer inspection reveals that this is not always the case. Automated trading might be prone to self-fulfilling feedback loops. Likewise, malicious entities might adapt to evade detectors in the adversarial setting resulting in a self-negating feedback loop that requires the deployed models to constantly retrain. Such settings where a model may induce concept drift are called performative. In this work, we investigate this phenomenon. Our contributions are as follows: First, we define performative drift within a stream learning setting and distinguish it from other causes of drift. We introduce a novel type of drift detection task, aimed at identifying potential performative concept drift in data streams. We propose a first such performative drift detection approach, called CheckerBoard Performative Drift Detection (CB-PDD). We apply CB-PDD to both synthetic and semi-synthetic datasets that exhibit varying degrees of self-fulfilling feedback loops. Results are positive with CB-PDD showing high efficacy, low false detection rates, resilience to intrinsic drift, comparability to other drift detection techniques, and an ability to effectively detect performative drift in semi-synthetic datasets. Secondly, we highlight the role intrinsic (traditional) drift plays in obfuscating performative drift and discuss the implications of these findings as well as the limitations of CB-PDD.

[LG-101] Higher Order Transformers: Enhancing Stock Movement Prediction On Multimodal Time-Series Data KDD2024

链接: https://arxiv.org/abs/2412.10540
作者: Soroush Omranpour,Guillaume Rabusseau,Reihaneh Rabbany
关键词: introducing Higher Order, processing multivariate time-series, Higher Order Transformers, multivariate time-series data, Higher Order
类目: Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: KDD 2024 Workshop on Machine Learning in Finance

点击查看摘要

Abstract:In this paper, we tackle the challenge of predicting stock movements in financial markets by introducing Higher Order Transformers, a novel architecture designed for processing multivariate time-series data. We extend the self-attention mechanism and the transformer architecture to a higher order, effectively capturing complex market dynamics across time and variables. To manage computational complexity, we propose a low-rank approximation of the potentially large attention tensor using tensor decomposition and employ kernel attention, reducing complexity to linear with respect to the data size. Additionally, we present an encoder-decoder model that integrates technical and fundamental analysis, utilizing multimodal signals from historical prices and related tweets. Our experiments on the Stocknet dataset demonstrate the effectiveness of our method, highlighting its potential for enhancing stock movement prediction in financial markets.

[LG-102] ExclaveFL: Providing Transparency to Federated Learning using Exclaves

链接: https://arxiv.org/abs/2412.10537
作者: Jinnan Guo,Kapil Vaswani,Andrew Paverd,Peter Pietzuch
关键词: providers jointly train, data providers jointly, malicious data provider, jointly train, providers jointly
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In federated learning (FL), data providers jointly train a model without disclosing their training data. Despite its privacy benefits, a malicious data provider can simply deviate from the correct training protocol without being detected, thus attacking the trained model. While current solutions have explored the use of trusted execution environment (TEEs) to combat such attacks, there is a mismatch with the security needs of FL: TEEs offer confidentiality guarantees, which are unnecessary for FL and make them vulnerable to side-channel attacks, and focus on coarse-grained attestation, which does not capture the execution of FL training. We describe ExclaveFL, an FL platform that achieves end-to-end transparency and integrity for detecting attacks. ExclaveFL achieves this by employing a new hardware security abstraction, exclaves, which focus on integrity-only guarantees. ExclaveFL uses exclaves to protect the execution of FL tasks, while generating signed statements containing fine-grained, hardware-based attestation reports of task execution at runtime. ExclaveFL then enables auditing using these statements to construct an attested dataflow graph and then check that the FL training jobs satisfies claims, such as the absence of attacks. Our experiments show that ExclaveFL introduces a less than 9% overhead while detecting a wide-range of attacks. Subjects: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2412.10537 [cs.CR] (or arXiv:2412.10537v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.10537 Focus to learn more arXiv-issued DOI via DataCite

[LG-103] owards Using Machine Learning to Generatively Simulate EV Charging in Urban Areas NEURIPS2024

链接: https://arxiv.org/abs/2412.10531
作者: Marek Miltner,Jakub Zíka,Daniel Vašata,Artem Bryksa,Magda Friedjungová,Ondřej Štogl,Ram Rajagopal,Oldřich Starý
关键词: predicting electric vehicle, electric vehicle, limited data, study addresses, addresses the challenge
类目: Machine Learning (cs.LG)
*备注: Accepted to Tackling Climate Change with Machine Learning: workshop at NeurIPS 2024

点击查看摘要

Abstract:This study addresses the challenge of predicting electric vehicle (EV) charging profiles in urban locations with limited data. Utilizing a neural network architecture, we aim to uncover latent charging profiles influenced by spatio-temporal factors. Our model focuses on peak power demand and daily load shapes, providing insights into charging behavior. Our results indicate significant impacts from the type of Basic Administrative Units on predicted load curves, which contributes to the understanding and optimization of EV charging infrastructure in urban settings and allows Distribution System Operators (DSO) to more efficiently plan EV charging infrastructure expansion.

[LG-104] Differentially Private Multi-Sampling from Distributions

链接: https://arxiv.org/abs/2412.10512
作者: Albert Cheu,Debanuj Nayak
关键词: estimate probability distributions, sample complexity, estimate probability, sample, emph
类目: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 22 pages

点击查看摘要

Abstract:Many algorithms have been developed to estimate probability distributions subject to differential privacy (DP): such an algorithm takes as input independent samples from a distribution and estimates the density function in a way that is insensitive to any one sample. A recent line of work, initiated by Raskhodnikova et al. (Neurips '21), explores a weaker objective: a differentially private algorithm that approximates a single sample from the distribution. Raskhodnikova et al. studied the sample complexity of DP \emphsingle-sampling i.e., the minimum number of samples needed to perform this task. They showed that the sample complexity of DP single-sampling is less than the sample complexity of DP learning for certain distribution classes. We define two variants of \emphmulti-sampling, where the goal is to privately approximate m1 samples. This better models the realistic scenario where synthetic data is needed for exploratory data analysis. A baseline solution to \emphmulti-sampling is to invoke a single-sampling algorithm m times on independently drawn datasets of samples. When the data comes from a finite domain, we improve over the baseline by a factor of m in the sample complexity. When the data comes from a Gaussian, Ghazi et al. (Neurips '23) show that \emphsingle-sampling can be performed under approximate differential privacy; we show it is possible to \emphsingle- and multi-sample Gaussians with known covariance subject to pure DP. Our solution uses a variant of the Laplace mechanism that is of independent interest. We also give sample complexity lower bounds, one for strong multi-sampling of finite distributions and another for weak multi-sampling of bounded-covariance Gaussians. Comments: 22 pages Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2412.10512 [cs.CR] (or arXiv:2412.10512v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.10512 Focus to learn more arXiv-issued DOI via DataCite

[LG-105] A Hybrid Real-Time Framework for Efficient Fussell-Vesely Importance Evaluation Using Virtual Fault Trees and Graph Neural Networks

链接: https://arxiv.org/abs/2412.10484
作者: Xingyu Xiao,Peng Chen
关键词: ensuring system reliability, reflects the potential, virtual fault tree, potential impact, crucial for ensuring
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:The Fussell-Vesely Importance (FV) reflects the potential impact of a basic event on system failure, and is crucial for ensuring system reliability. However, traditional methods for calculating FV importance are complex and time-consuming, requiring the construction of fault trees and the calculation of minimal cut set. To address these limitations, this study proposes a hybrid real-time framework to evaluate the FV importance of basic events. Our framework combines expert knowledge with a data-driven model. First, we use Interpretive Structural Modeling (ISM) to build a virtual fault tree that captures the relationships between basic events. Unlike traditional fault trees, which include intermediate events, our virtual fault tree consists solely of basic events, reducing its complexity and space requirements. Additionally, our virtual fault tree considers the dependencies between basic events rather than assuming their independence, as is typically done in traditional fault trees. We then feed both the event relationships and relevant data into a graph neural network (GNN). This approach enables a rapid, data-driven calculation of FV importance, significantly reducing processing time and quickly identifying critical events, thus providing robust decision support for risk control. Results demonstrate that our model performs well in terms of MSE, RMSE, MAE, and R2, reducing computational energy consumption and offering real-time, risk-informed decision support for complex systems.

[LG-106] Comparative Analysis of Mel-Frequency Cepstral Coefficients and Wavelet Based Audio Signal Processing for Emotion Detection and Mental Health Assessment in Spoken Speech

链接: https://arxiv.org/abs/2412.10469
作者: Idoko Agbo,Dr Hoda El-Sayed,M.D Kamruzzan Sarker
关键词: spurred innovative approaches, Convolutional Neural Network, Mel-frequency Cepstral Coefficients, assessing emotional well-being, computational techniques applied
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:The intersection of technology and mental health has spurred innovative approaches to assessing emotional well-being, particularly through computational techniques applied to audio data analysis. This study explores the application of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) models on wavelet extracted features and Mel-frequency Cepstral Coefficients (MFCCs) for emotion detection from spoken speech. Data augmentation techniques, feature extraction, normalization, and model training were conducted to evaluate the models’ performance in classifying emotional states. Results indicate that the CNN model achieved a higher accuracy of 61% compared to the LSTM model’s accuracy of 56%. Both models demonstrated better performance in predicting specific emotions such as surprise and anger, leveraging distinct audio features like pitch and speed variations. Recommendations include further exploration of advanced data augmentation techniques, combined feature extraction methods, and the integration of linguistic analysis with speech characteristics for improved accuracy in mental health diagnostics. Collaboration for standardized dataset collection and sharing is recommended to foster advancements in affective computing and mental health care interventions.

[LG-107] Elucidating microstructural influences on fatigue behavior for additively manufactured Hastelloy X using Bayesian-calibrated crystal plasticity model

链接: https://arxiv.org/abs/2412.10405
作者: Ajay Kushwaha,Eralp Demir,Amrita Basak
关键词: Crystal plasticity, calibration involves numerous, involves numerous, requiring time-consuming, vital tool
类目: Computational Engineering, Finance, and Science (cs.CE); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Crystal plasticity (CP) modeling is a vital tool for predicting the mechanical behavior of materials, but its calibration involves numerous (8) constitutive parameters, often requiring time-consuming trial-and-error methods. This paper proposes a robust calibration approach using Bayesian optimization (BO) to identify optimal CP model parameters under fatigue loading conditions. Utilizing cyclic data from additively manufactured Hastelloy X specimens at 500 degree-F, the BO framework, integrated with a Gaussian process surrogate model, significantly reduces the number of required simulations. A novel objective function is developed to match experimental stress-strain data across different strain amplitudes. Results demonstrate that effective CP model calibration is achieved within 75 iterations, with as few as 50 initial simulations. Sensitivity analysis reveals the influence of CP parameters at various loading points on the stress-strain curve. The results show that the stress-strain response is predominantly controlled by parameters related to yield, with increased influence from backstress parameters during compressive loading. In addition, the effect of introducing twins into the synthetic microstructure on fatigue behavior is studied, and a relationship between microstructural features and the fatigue indicator parameter is established. Results show that larger diameter grains, which exhibit a higher Schmid factor and an average misorientation of approximately 42 degrees +/- 1.67 degree, are identified as probable sites for failure. The proposed optimization framework can be applied to any material system or CP model, streamlining the calibration process and improving the predictive accuracy of such models.

[LG-108] Scalable Early Childhood Reading Performance Prediction

链接: https://arxiv.org/abs/2412.10401
作者: Zhongkai Shangguan,Zanming Huang,Eshed Ohn-Bar,Ola Ozernov-Palchik,Derek Kosty,Michael Stoolmiller,Hank Fien
关键词: proactively identify at-risk, identify at-risk students, tailored instructional interventions, Core Reading Instruction, Reading Instruction ECRI
类目: Machine Learning (cs.LG)
*备注: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

点击查看摘要

Abstract:Models for student reading performance can empower educators and institutions to proactively identify at-risk students, thereby enabling early and tailored instructional interventions. However, there are no suitable publicly available educational datasets for modeling and predicting future reading performance. In this work, we introduce the Enhanced Core Reading Instruction ECRI dataset, a novel large-scale longitudinal tabular dataset collected across 44 schools with 6,916 students and 172 teachers. We leverage the dataset to empirically evaluate the ability of state-of-the-art machine learning models to recognize early childhood educational patterns in multivariate and partial measurements. Specifically, we demonstrate a simple self-supervised strategy in which a Multi-Layer Perception (MLP) network is pre-trained over masked inputs to outperform several strong baselines while generalizing over diverse educational settings. To facilitate future developments in precise modeling and responsible use of models for individualized and early intervention strategies, our data and code are available at this https URL.

[LG-109] Extrapolating Jet Radiation with Autoregressive Transformers

链接: https://arxiv.org/abs/2412.12074
作者: Anja Butter,François Charton,Javier Mariño Villadamigo,Ayodele Ore,Tilman Plehn,Jonas Spinner
关键词: LHC event generation, fast LHC event, Generative networks, fast LHC, LHC event
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generative networks are an exciting tool for fast LHC event generation. Usually, they are used to generate configurations with a fixed number of particles. Autoregressive transformers allow us to generate events with variable numbers of particles, very much in line with the physics of QCD jet radiation. We show how they can learn a factorized likelihood for jet radiation and extrapolate in terms of the number of generated jets. For this extrapolation, bootstrapping training data and training with modifications of the likelihood loss can be used.

[LG-110] Bilevel Learning with Inexact Stochastic Gradients

链接: https://arxiv.org/abs/2412.12049
作者: Mohammad Sadegh Salehi,Subhadip Mukherjee,Lindon Roberts,Matthias J. Ehrhardt
关键词: optimizing forward operators, learning data-adaptive regularizers, including hyperparameter optimization, including hyperparameter, data-adaptive regularizers
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Bilevel learning has gained prominence in machine learning, inverse problems, and imaging applications, including hyperparameter optimization, learning data-adaptive regularizers, and optimizing forward operators. The large-scale nature of these problems has led to the development of inexact and computationally efficient methods. Existing adaptive methods predominantly rely on deterministic formulations, while stochastic approaches often adopt a doubly-stochastic framework with impractical variance assumptions, enforces a fixed number of lower-level iterations, and requires extensive tuning. In this work, we focus on bilevel learning with strongly convex lower-level problems and a nonconvex sum-of-functions in the upper-level. Stochasticity arises from data sampling in the upper-level which leads to inexact stochastic hypergradients. We establish their connection to state-of-the-art stochastic optimization theory for nonconvex objectives. Furthermore, we prove the convergence of inexact stochastic bilevel optimization under mild assumptions. Our empirical results highlight significant speed-ups and improved generalization in imaging tasks such as image denoising and deblurring in comparison with adaptive deterministic bilevel methods.

[LG-111] Generalization Analysis for Deep Contrastive Representation Learning AAAI2025

链接: https://arxiv.org/abs/2412.12014
作者: Nong Minh Hieu,Antoine Ledent,Yunwen Lei,Cheng Yeaw Ku
关键词: Deep Contrastive Representation, Representation Learning framework, employs deep neural, Contrastive Representation Learning, deep neural networks
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:In this paper, we present generalization bounds for the unsupervised risk in the Deep Contrastive Representation Learning framework, which employs deep neural networks as representation functions. We approach this problem from two angles. On the one hand, we derive a parameter-counting bound that scales with the overall size of the neural networks. On the other hand, we provide a norm-based bound that scales with the norms of neural networks’ weight matrices. Ignoring logarithmic factors, the bounds are independent of k , the size of the tuples provided for contrastive learning. To the best of our knowledge, this property is only shared by one other work, which employed a different proof strategy and suffers from very strong exponential dependence on the depth of the network which is due to a use of the peeling technique. Our results circumvent this by leveraging powerful results on covering numbers with respect to uniform norms over samples. In addition, we utilize loss augmentation techniques to further reduce the dependency on matrix norms and the implicit dependence on network depth. In fact, our techniques allow us to produce many bounds for the contrastive learning setting with similar architectural dependencies as in the study of the sample complexity of ordinary loss functions, thereby bridging the gap between the learning theories of contrastive learning and DNNs.

[LG-112] Echo State network for coarsening dynamics of charge density waves

链接: https://arxiv.org/abs/2412.11982
作者: Clement Dinh,Yunhao Fan,Gia-Wei Chern
关键词: echo state network, recurrent neural network, sparsely connected hidden, recurrent neural, ESN
类目: atistical Mechanics (cond-mat.stat-mech); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures

点击查看摘要

Abstract:An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. Compared with other recurrent neural networks, one great advantage of ESN is the simplicity of its training process. Yet, despite the seemingly restricted learnable parameters, ESN has been shown to successfully capture the spatial-temporal dynamics of complex patterns. Here we build an ESN to model the coarsening dynamics of charge-density waves (CDW) in a semi-classical Holstein model, which exhibits a checkerboard electron density modulation at half-filling stabilized by a commensurate lattice distortion. The inputs to the ESN are local CDW order-parameters in a finite neighborhood centered around a given site, while the output is the predicted CDW order of the center site at the next time step. Special care is taken in the design of couplings between hidden layer and input nodes to ensure lattice symmetries are properly incorporated into the ESN model. Since the model predictions depend only on CDW configurations of a finite domain, the ESN is scalable and transferrable in the sense that a model trained on dataset from a small system can be directly applied to dynamical simulations on larger lattices. Our work opens a new avenue for efficient dynamical modeling of pattern formations in functional electron materials.

[LG-113] Neural general circulation models optimized to predict satellite-based precipitation observations

链接: https://arxiv.org/abs/2412.11973
作者: Janni Yuval,Ian Langmore,Dmitrii Kochkov,Stephan Hoyer
关键词: accurately simulate precipitation, struggle to accurately, accurately simulate, Climate models struggle, diurnal cycle
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 20 pages, 6 figures in Main. 29 pages, 30 figures in SI

点击查看摘要

Abstract:Climate models struggle to accurately simulate precipitation, particularly extremes and the diurnal cycle. Here, we present a hybrid model that is trained directly on satellite-based precipitation observations. Our model runs at 2.8 ^\circ resolution and is built on the differentiable NeuralGCM framework. The model demonstrates significant improvements over existing general circulation models, the ERA5 reanalysis, and a global cloud-resolving model in simulating precipitation. Our approach yields reduced biases, a more realistic precipitation distribution, improved representation of extremes, and a more accurate diurnal cycle. Furthermore, it outperforms the mid-range precipitation forecast of the ECMWF ensemble. This advance paves the way for more reliable simulations of current climate and demonstrates how training on observations can be used to directly improve GCMs.

[LG-114] BetaExplainer: A Probabilistic Method to Explain Graph Neural Networks

链接: https://arxiv.org/abs/2412.11964
作者: Whitney Sloneker,Shalin Patel,Michael Wang,Lorin Crawford,Ritambhara Singh
关键词: extracting meaningful subnetworks, meaningful subnetworks driving, Graph neural networks, driving predictive performance, subnetworks driving predictive
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) are powerful tools for conducting inference on graph data but are often seen as “black boxes” due to difficulty in extracting meaningful subnetworks driving predictive performance. Many interpretable GNN methods exist, but they cannot quantify uncertainty in edge weights and suffer in predictive accuracy when applied to challenging graph structures. In this work, we proposed BetaExplainer which addresses these issues by using a sparsity-inducing prior to mask unimportant edges during model training. To evaluate our approach, we examine various simulated data sets with diverse real-world characteristics. Not only does this implementation provide a notion of edge importance uncertainty, it also improves upon evaluation metrics for challenging datasets compared to state-of-the art explainer methods.

[LG-115] Bayesian Surrogate Training on Multiple Data Sources: A Hybrid Modeling Strategy

链接: https://arxiv.org/abs/2412.11875
作者: Philipp Reiser,Paul-Christian Bürkner,Anneli Guthke
关键词: computationally efficient approximations, probabilistic forward predictions, solving inverse problems, computationally infeasible, sensitivity analysis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Surrogate models are often used as computationally efficient approximations to complex simulation models, enabling tasks such as solving inverse problems, sensitivity analysis, and probabilistic forward predictions, which would otherwise be computationally infeasible. During training, surrogate parameters are fitted such that the surrogate reproduces the simulation model’s outputs as closely as possible. However, the simulation model itself is merely a simplification of the real-world system, often missing relevant processes or suffering from misspecifications e.g., in inputs or boundary conditions. Hints about these might be captured in real-world measurement data, and yet, we typically ignore those hints during surrogate building. In this paper, we propose two novel probabilistic approaches to integrate simulation data and real-world measurement data during surrogate training. The first method trains separate surrogate models for each data source and combines their predictive distributions, while the second incorporates both data sources by training a single surrogate. We show the conceptual differences and benefits of the two approaches through both synthetic and real-world case studies. The results demonstrate the potential of these methods to improve predictive accuracy, predictive coverage, and to diagnose problems in the underlying simulation model. These insights can improve system understanding and future model development.

[LG-116] Causal Invariance Learning via Efficient Optimization of a Nonconvex Objective

链接: https://arxiv.org/abs/2412.11850
作者: Zhenyu Wang,Yifan Hu,Peter Bühlmann,Zijian Guo
关键词: causal outcome model, causal outcome, offer valuable opportunities, uncover causal relationships, outcome model
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Data from multiple environments offer valuable opportunities to uncover causal relationships among variables. Leveraging the assumption that the causal outcome model remains invariant across heterogeneous environments, state-of-the-art methods attempt to identify causal outcome models by learning invariant prediction models and rely on exhaustive searches over all (exponentially many) covariate subsets. These approaches present two major challenges: 1) determining the conditions under which the invariant prediction model aligns with the causal outcome model, and 2) devising computationally efficient causal discovery algorithms that scale polynomially, instead of exponentially, with the number of covariates. To address both challenges, we focus on the additive intervention regime and propose nearly necessary and sufficient conditions for ensuring that the invariant prediction model matches the causal outcome model. Exploiting the essentially necessary identifiability conditions, we introduce Negative Weight Distributionally Robust Optimization NegDRO a nonconvex continuous minimax optimization whose global optimizer recovers the causal outcome model. Unlike standard group DRO problems that maximize over the simplex, NegDRO allows negative weights on environment losses, which break the convexity. Despite its nonconvexity, we demonstrate that a standard gradient method converges to the causal outcome model, and we establish the convergence rate with respect to the sample size and the number of iterations. Our algorithm avoids exhaustive search, making it scalable especially when the number of covariates is large. The numerical results further validate the efficiency of the proposed method.

[LG-117] Evaluating the Efficacy of Vectocardiographic and ECG Parameters for Efficient Tertiary Cardiology Care Allocation Using Decision Tree Analysis

链接: https://arxiv.org/abs/2412.11839
作者: Lucas José da Costa,Vinicius Ruiz Uemoto,Mariana F. N. de Marchi,Renato de Aguiar Hortegal,Renata Valeri de Freitas
关键词: Standard ECG features, real word data, Standard ECG, risk factor anamnesis, Risk Factors
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Use real word data to evaluate the performance of the electrocardiographic markers of GEH as features in a machine learning model with Standard ECG features and Risk Factors in Predicting Outcome of patients in a population referred to a tertiary cardiology hospital. Patients forwarded to specific evaluation in a cardiology specialized hospital performed an ECG and a risk factor anamnesis. A series of follow up attendances occurred in periods of 6 months, 12 months and 15 months to check for cardiovascular related events (mortality or new nonfatal cardiovascular events (Stroke, MI, PCI, CS), as identified during 1-year phone follow-ups. The first attendance ECG was measured by a specialist and processed in order to obtain the global electric heterogeneity (GEH) using the Kors Matriz. The ECG measurements, GEH parameters and risk factors were combined for training multiple instances of XGBoost decision trees models. Each instance were optmized for the AUCPR and the instance with higher AUC is chosen as representative to the model. The importance of each parameter for the winner tree model was compared to better understand the improvement from using GEH parameters. The GEH parameters turned out to have statistical significance for this population specially the QRST angle and the SVG. The combined model with the tree parameters class had the best performance. The findings suggest that using VCG features can facilitate more accurate identification of patients who require tertiary care, thereby optimizing resource allocation and improving patient outcomes. Moreover, the decision tree model’s transparency and ability to pinpoint critical features make it a valuable tool for clinical decision-making and align well with existing clinical practices. Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2412.11839 [eess.SP] (or arXiv:2412.11839v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2412.11839 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Renata Valeri De Freitas [view email] [v1] Mon, 16 Dec 2024 15:01:53 UTC (2,524 KB)

[LG-118] he Eclipsing Binaries via Artificial Intelligence. II. Need for Speed in PHOEBE Forward Models

链接: https://arxiv.org/abs/2412.11837
作者: Marcin Wrona,Andrej Prša
关键词: advanced artificial intelligence, modern astronomy, techniques to assist, labor-intensive tasks, quantity of data
类目: olar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: Submitted to AAS Journals. 26 pages, 21 figures, 3 tables

点击查看摘要

Abstract:In modern astronomy, the quantity of data collected has vastly exceeded the capacity for manual analysis, necessitating the use of advanced artificial intelligence (AI) techniques to assist scientists with the most labor-intensive tasks. AI can optimize simulation codes where computational bottlenecks arise from the time required to generate forward models. One such example is PHOEBE, a modeling code for eclipsing binaries (EBs), where simulating individual systems is feasible, but analyzing observables for extensive parameter combinations is highly time-consuming. To address this, we present a fully connected feedforward artificial neural network (ANN) trained on a dataset of over one million synthetic light curves generated with PHOEBE. Optimization of the ANN architecture yielded a model with six hidden layers, each containing 512 nodes, provides an optimized balance between accuracy and computational complexity. Extensive testing enabled us to establish ANN’s applicability limits and to quantify the systematic and statistical errors associated with using such networks for EB analysis. Our findings demonstrate the critical role of dilution effects in parameter estimation for EBs, and we outline methods to incorporate these effects in AI-based models. This proposed ANN framework enables a speedup of over four orders of magnitude compared to traditional methods, with systematic errors not exceeding 1%, and often as low as 0.01%, across the entire parameter space. Comments: Submitted to AAS Journals. 26 pages, 21 figures, 3 tables Subjects: Solar and Stellar Astrophysics (astro-ph.SR); Earth and Planetary Astrophysics (astro-ph.EP); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG) Cite as: arXiv:2412.11837 [astro-ph.SR] (or arXiv:2412.11837v1 [astro-ph.SR] for this version) https://doi.org/10.48550/arXiv.2412.11837 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-119] Conditional Diffusion Models Based Conditional Independence Testing AAAI2025

链接: https://arxiv.org/abs/2412.11744
作者: Yanfeng Yang,Shuai Li,Yingjie Zhang,Zhuoran Sun,Hai Shu,Ziqi Chen,Renmin Zhang
关键词: machine learning, fundamental task, task in modern, Conditional, Conditional independence
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages, 7 figures, aaai 2025

点击查看摘要

Abstract:Conditional independence (CI) testing is a fundamental task in modern statistics and machine learning. The conditional randomization test (CRT) was recently introduced to test whether two random variables, X and Y , are conditionally independent given a potentially high-dimensional set of random variables, Z . The CRT operates exceptionally well under the assumption that the conditional distribution X|Z is known. However, since this distribution is typically unknown in practice, accurately approximating it becomes crucial. In this paper, we propose using conditional diffusion models (CDMs) to learn the distribution of X|Z . Theoretically and empirically, it is shown that CDMs closely approximate the true conditional distribution. Furthermore, CDMs offer a more accurate approximation of X|Z compared to GANs, potentially leading to a CRT that performs better than those based on GANs. To accommodate complex dependency structures, we utilize a computationally efficient classifier-based conditional mutual information (CMI) estimator as our test statistic. The proposed testing procedure performs effectively without requiring assumptions about specific distribution forms or feature dependencies, and is capable of handling mixed-type conditioning sets that include both continuous and discrete variables. Theoretical analysis shows that our proposed test achieves a valid control of the type I error. A series of experiments on synthetic data demonstrates that our new test effectively controls both type-I and type-II errors, even in high dimensional scenarios.

[LG-120] Generalized Bayesian deep reinforcement learning

链接: https://arxiv.org/abs/2412.11743
作者: Shreya Sinha Roy,Richard G. Everitt,Christian P. Robert,Ritabrata Dutta
关键词: Bayesian reinforcement learning, reinforcement learning, make optimal decisions, Bayesian reinforcement, Bayesian posterior distribution
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Bayesian reinforcement learning (BRL) is a method that merges principles from Bayesian statistics and reinforcement learning to make optimal decisions in uncertain environments. Similar to other model-based RL approaches, it involves two key components: (1) Inferring the posterior distribution of the data generating process (DGP) modeling the true environment and (2) policy learning using the learned posterior. We propose to model the dynamics of the unknown environment through deep generative models assuming Markov dependence. In absence of likelihood functions for these models we train them by learning a generalized predictive-sequential (or prequential) scoring rule (SR) posterior. We use sequential Monte Carlo (SMC) samplers to draw samples from this generalized Bayesian posterior distribution. In conjunction, to achieve scalability in the high dimensional parameter space of the neural networks, we use the gradient based Markov chain Monte Carlo (MCMC) kernels within SMC. To justify the use of the prequential scoring rule posterior we prove a Bernstein-von Misses type theorem. For policy learning, we propose expected Thompson sampling (ETS) to learn the optimal policy by maximizing the expected value function with respect to the posterior distribution. This improves upon traditional Thompson sampling (TS) and its extensions which utilize only one sample drawn from the posterior distribution. This improvement is studied both theoretically and using simulation studies assuming discrete action and state-space. Finally we successfully extend our setup for a challenging problem with continuous action space without theoretical guarantees.

[LG-121] he dark side of the forces: assessing non-conservative force models for atomistic machine learning

链接: https://arxiv.org/abs/2412.11569
作者: Filippo Bigi,Marcel Langer,Michele Ceriotti
关键词: group of atoms, stable configurations, materials discovery, revolutionized the fields, chemistry and materials
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of machine learning to estimate the energy of a group of atoms, and the forces that drive them to more stable configurations, have revolutionized the fields of computational chemistry and materials discovery. In this domain, rigorous enforcement of symmetry and conservation laws has traditionally been considered essential. For this reason, interatomic forces are usually computed as the derivatives of the potential energy, ensuring energy conservation. Several recent works have questioned this physically-constrained approach, suggesting that using the forces as explicit learning targets yields a better trade-off between accuracy and computational efficiency - and that energy conservation can be learned during training. The present work investigates the applicability of such non-conservative models in microscopic simulations. We identify and demonstrate several fundamental issues, from ill-defined convergence of geometry optimization to instability in various types of molecular dynamics. Contrary to the case of rotational symmetry, lack of energy conservation is hard to learn, control, and correct. The best approach to exploit the acceleration afforded by direct force evaluation might be to use it in tandem with a conservative model, reducing - rather than eliminating - the additional cost of backpropagation, but avoiding most of the pathological behavior associated with non-conservative forces.

[LG-122] Learning Massive-scale Partial Correlation Networks in Clinical Multi-omics Studies with HP-ACCORD

链接: https://arxiv.org/abs/2412.11554
作者: Sungdong Lee,Joshua Bang,Youngrae Kim,Hyungwon Choi,Sang-Yun Oh,Joong-Ho Won
关键词: statistical estimation performance, Graphical model estimation, Graphical model, pseudolikelihood-based graphical model, graphical model framework
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Graphical model estimation from modern multi-omics data requires a balance between statistical estimation performance and computational scalability. We introduce a novel pseudolikelihood-based graphical model framework that reparameterizes the target precision matrix while preserving sparsity pattern and estimates it by minimizing an \ell_1 -penalized empirical risk based on a new loss function. The proposed estimator maintains estimation and selection consistency in various metrics under high-dimensional assumptions. The associated optimization problem allows for a provably fast computation algorithm using a novel operator-splitting approach and communication-avoiding distributed matrix multiplication. A high-performance computing implementation of our framework was tested in simulated data with up to one million variables demonstrating complex dependency structures akin to biological networks. Leveraging this scalability, we estimated partial correlation network from a dual-omic liver cancer data set. The co-expression network estimated from the ultrahigh-dimensional data showed superior specificity in prioritizing key transcription factors and co-activators by excluding the impact of epigenomic regulation, demonstrating the value of computational scalability in multi-omic data analysis. %derived from the gene expression data.

[LG-123] datadriftR: An R Package for Concept Drift Detection in Predictive Models

链接: https://arxiv.org/abs/2412.11308
作者: Ugur Dar,Mustafa Cavus
关键词: face performance degradation, performance degradation due, Profile Drift Detection, drift detection, evolving data distributions
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 37 pages, 6 figures

点击查看摘要

Abstract:Predictive models often face performance degradation due to evolving data distributions, a phenomenon known as data drift. Among its forms, concept drift, where the relationship between explanatory variables and the response variable changes, is particularly challenging to detect and adapt to. Traditional drift detection methods often rely on metrics such as accuracy or variable distributions, which may fail to capture subtle but significant conceptual changes. This paper introduces drifter, an R package designed to detect concept drift, and proposes a novel method called Profile Drift Detection (PDD) that enables both drift detection and an enhanced understanding of the cause behind the drift by leveraging an explainable AI tool - Partial Dependence Profiles (PDPs). The PDD method, central to the package, quantifies changes in PDPs through novel metrics, ensuring sensitivity to shifts in the data stream without excessive computational costs. This approach aligns with MLOps practices, emphasizing model monitoring and adaptive retraining in dynamic environments. The experiments across synthetic and real-world datasets demonstrate that PDD outperforms existing methods by maintaining high accuracy while effectively balancing sensitivity and stability. The results highlight its capability to adaptively retrain models in dynamic environments, making it a robust tool for real-time applications. The paper concludes by discussing the advantages, limitations, and future extensions of the package for broader use cases.

[LG-124] Bayesian inference of mean velocity fields and turbulence models from flow MRI

链接: https://arxiv.org/abs/2412.11266
作者: A. Kontogiannis,P. Nair,M. Loecher,D. B. Ennis,A. Marsden,M. P. Juniper
关键词: Bayesian inverse Reynolds-averaged, inverse Reynolds-averaged Navier-Stokes, unknown RANS parameters, Reynolds-averaged Navier-Stokes, unknown RANS
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We solve a Bayesian inverse Reynolds-averaged Navier-Stokes (RANS) problem that assimilates mean flow data by jointly reconstructing the mean flow field and learning its unknown RANS parameters. We devise an algorithm that learns the most likely parameters of an algebraic effective viscosity model, and estimates their uncertainties, from mean flow data of a turbulent flow. We conduct a flow MRI experiment to obtain mean flow data of a confined turbulent jet in an idealized medical device known as the FDA (Food and Drug Administration) nozzle. The algorithm successfully reconstructs the mean flow field and learns the most likely turbulence model parameters without overfitting. The methodology accepts any turbulence model, be it algebraic (explicit) or multi-equation (implicit), as long as the model is differentiable, and naturally extends to unsteady turbulent flows.

[LG-125] Prediction-Enhanced Monte Carlo: A Machine Learning View on Control Variate

链接: https://arxiv.org/abs/2412.11257
作者: Fengpei Li,Haoxian Chen,Jiahe Lin,Arkin Gupta,Xiaowei Tan,Gang Xu,Yuriy Nevmyvaka,Agostino Capponi,Henry Lam
关键词: Monte Carlo simulation, hinder straightforward parallelization, Monte Carlo, Prediction-Enhanced Monte Carlo, engineering and finance
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Pricing of Securities (q-fin.PR)
*备注:

点击查看摘要

Abstract:Despite being an essential tool across engineering and finance, Monte Carlo simulation can be computationally intensive, especially in large-scale, path-dependent problems that hinder straightforward parallelization. A natural alternative is to replace simulation with machine learning or surrogate prediction, though this introduces challenges in understanding the resulting this http URL introduce a Prediction-Enhanced Monte Carlo (PEMC) framework where we leverage machine learning prediction as control variates, thus maintaining unbiased evaluations instead of the direct use of ML predictors. Traditional control variate methods require knowledge of means and focus on per-sample variance reduction. In contrast, PEMC aims at overall cost-aware variance reduction, eliminating the need for mean knowledge. PEMC leverages pre-trained neural architectures to construct effective control variates and replaces computationally expensive sample-path generation with efficient neural network evaluations. This allows PEMC to address scenarios where no good control variates are known. We showcase the efficacy of PEMC through two production-grade exotic option-pricing problems: swaption pricing in HJM model and the variance swap pricing in a stochastic local volatility model.

[LG-126] Deep Learning-based Approaches for State Space Models: A Selective Review

链接: https://arxiv.org/abs/2412.11211
作者: Jiahe Lin,George Michailidis
关键词: dynamical system analysis, Stochastic Differential Equations, neural Ordinary Differential, State-space models, system analysis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Other Statistics (stat.OT)
*备注:

点击查看摘要

Abstract:State-space models (SSMs) offer a powerful framework for dynamical system analysis, wherein the temporal dynamics of the system are assumed to be captured through the evolution of the latent states, which govern the values of the observations. This paper provides a selective review of recent advancements in deep neural network-based approaches for SSMs, and presents a unified perspective for discrete time deep state space models and continuous time ones such as latent neural Ordinary Differential and Stochastic Differential Equations. It starts with an overview of the classical maximum likelihood based approach for learning SSMs, reviews variational autoencoder as a general learning pipeline for neural network-based approaches in the presence of latent variables, and discusses in detail representative deep learning models that fall under the SSM framework. Very recent developments, where SSMs are used as standalone architectural modules for improving efficiency in sequence modeling, are also examined. Finally, examples involving mixed frequency and irregularly-spaced time series data are presented to demonstrate the advantage of SSMs in these settings.

[LG-127] Hierarchical Bidirectional Transition Dispersion Entropy-based Lempel-Ziv Complexity and Its Application in Fault-Bearing Diagnosis

链接: https://arxiv.org/abs/2412.11123
作者: Runze Jiang,Pengjian Shang
关键词: Permutation Lempel-Ziv complexity, Lempel-Ziv complexity, Dispersion Entropy-based Lempel-Ziv, Entropy-based Lempel-Ziv complexity, time series
类目: Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (cs.LG); Signal Processing (eess.SP); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:Lempel-Ziv complexity (LZC) is a key measure for detecting the irregularity and complexity of nonlinear time series and has seen various improvements in recent decades. However, existing LZC-based metrics, such as Permutation Lempel-Ziv complexity (PLZC) and Dispersion-Entropy based Lempel-Ziv complexity (DELZC), focus mainly on patterns of independent embedding vectors, often overlooking the transition patterns within the time series. To address this gap, this paper introduces a novel LZC-based method called Bidirectional Transition Dispersion Entropy-based Lempel-Ziv complexity (BT-DELZC). Leveraging Markov chain theory, this method integrates a bidirectional transition network framework with DELZC to better capture dynamic signal information. Additionally, an improved hierarchical decomposition algorithm is used to extract features from various frequency components of the time series. The proposed BT-DELZC method is first evaluated through four simulated experiments, demonstrating its robustness and effectiveness in characterizing nonlinear time series. Additionally, two fault-bearing diagnosis experiments are conducted by combining the hierarchical BT-DELZC method with various classifiers from the machine learning domain. The results indicate that BT-DELZC achieves the highest accuracy across both datasets, significantly outperforming existing methods such as LZC, PLZC, and DELZC in extracting features related to fault bearings.

[LG-128] Representation learning of dynamic networks

链接: https://arxiv.org/abs/2412.11065
作者: Haixu Wang,Jiguo Cao,Jian Pei
关键词: continuously evolving relationships, learning space, describes the continuously, continuously evolving, evolving relationships
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a novel representation learning model tailored for dynamic networks, which describes the continuously evolving relationships among individuals within a population. The problem is encapsulated in the dimension reduction topic of functional data analysis. With dynamic networks represented as matrix-valued functions, our objective is to map this functional data into a set of vector-valued functions in a lower-dimensional learning space. This space, defined as a metric functional space, allows for the calculation of norms and inner products. By constructing this learning space, we address (i) attribute learning, (ii) community detection, and (iii) link prediction and recovery of individual nodes in the dynamic network. Our model also accommodates asymmetric low-dimensional representations, enabling the separate study of nodes’ regulatory and receiving roles. Crucially, the learning method accounts for the time-dependency of networks, ensuring that representations are continuous over time. The functional learning space we define naturally spans the time frame of the dynamic networks, facilitating both the inference of network links at specific time points and the reconstruction of the entire network structure without direct observation. We validated our approach through simulation studies and real-world applications. In simulations, we compared our methods link prediction performance to existing approaches under various data corruption scenarios. For real-world applications, we examined a dynamic social network replicated across six ant populations, demonstrating that our low-dimensional learning space effectively captures interactions, roles of individual ants, and the social evolution of the network. Our findings align with existing knowledge of ant colony behavior.

[LG-129] Generative Modeling with Diffusion

链接: https://arxiv.org/abs/2412.10948
作者: Justin Le
关键词: diffusion, Stable Diffusion, diffusion model, Abstract, models
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 16 pages with 5 figures. This work was submitted to SIAM Undergraduate Research Online for consideration in their journal

点击查看摘要

Abstract:We introduce the diffusion model as a method to generate new samples. Generative models have been recently adopted for tasks such as art generation (Stable Diffusion, Dall-E) and text generation (ChatGPT). Diffusion models in particular apply noise to sample data and then “reverse” this noising process to generate new samples. We will formally define the noising and denoising processes, then introduce algorithms to train and generate with a diffusion model. Finally, we will explore a potential application of diffusion models in improving classifier performance on imbalanced data.

[LG-130] Classification of Financial Data Using Quantum Support Vector Machine

链接: https://arxiv.org/abs/2412.10860
作者: Seemanta Bhattacharjee,MD. Muhtasim Fuad,A.K.M. Fakhrul Hossain
关键词: Support Vector Machine, Quantum Support Vector, Support Vector, Vector Machine, Dhaka Stock Exchange
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
*备注: 5 pages, 6 figures

点击查看摘要

Abstract:Quantum Support Vector Machine is a kernel-based approach to classification problems. We study the applicability of quantum kernels to financial data, specifically our self-curated Dhaka Stock Exchange (DSEx) Broad Index dataset. To the best of our knowledge, this is the very first systematic research work on this dataset on the application of quantum kernel. We report empirical quantum advantage in our work, using several quantum kernels and proposing the best one for this dataset while verifying the Phase Space Terrain Ruggedness Index metric. We estimate the resources needed to carry out these investigations on a larger scale for future practitioners.

[LG-131] Graph Attention Hamiltonian Neural Networks: A Lattice System Analysis Model Based on Structural Learning

链接: https://arxiv.org/abs/2412.10821
作者: Ru Geng,Yixian Gao,Jian Zu,Hong-Kun Zhang
关键词: Graph Attention Hamiltonian, Attention Hamiltonian Neural, Hamiltonian Neural Network, specific performance requirements, propose Graph Attention
类目: High Energy Physics - Lattice (hep-lat); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Dynamical Systems (math.DS); Chemical Physics (physics.chem-ph)
*备注: 17 pages, 7 figures

点击查看摘要

Abstract:A deep understanding of the intricate interactions between particles within a system is a key approach to revealing the essential characteristics of the system, whether it is an in-depth analysis of molecular properties in the field of chemistry or the design of new materials for specific performance requirements in materials science. To this end, we propose Graph Attention Hamiltonian Neural Network (GAHN), a neural network method that can understand the underlying structure of lattice Hamiltonian systems solely through the dynamic trajectories of particles. We can determine which particles in the system interact with each other, the proportion of interactions between different particles, and whether the potential energy of interactions between particles exhibits even symmetry or not. The obtained structure helps the neural network model to continue predicting the trajectory of the system and further understand the dynamic properties of the system. In addition to understanding the underlying structure of the system, it can be used for detecting lattice structural abnormalities, such as link defects, abnormal interactions, etc. These insights benefit system optimization, design, and detection of aging or damage. Moreover, this approach can integrate other components to deduce the link structure needed for specific parts, showcasing its scalability and potential. We tested it on a challenging molecular dynamics dataset, and the results proved its ability to accurately infer molecular bond connectivity, highlighting its scientific research potential.

[LG-132] Pretrained Event Classification Model for High Energy Physics Analysis

链接: https://arxiv.org/abs/2412.10665
作者: Joshua Ho,Benjamin Ryan Roberts,Shuo Han,Haichen Wang
关键词: Graph Neural Network, Neural Network architecture, Graph Neural, Neural Network, million simulated proton-proton
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG)
*备注: 9 pages, 1 figure

点击查看摘要

Abstract:We introduce a foundation model for event classification in high-energy physics, built on a Graph Neural Network architecture and trained on 120 million simulated proton-proton collision events spanning 12 distinct physics processes. The model is pretrained to learn a general and robust representation of collision data using challenging multiclass and multilabel classification tasks. Its performance is evaluated across five event classification tasks, which include both physics processes used during pretraining and new processes not encountered during pretraining. Fine-tuning the pretrained model significantly improves classification performance, particularly in scenarios with limited training data, demonstrating gains in both accuracy and computational efficiency. To investigate the underlying mechanisms behind these performance improvements, we employ a representational similarity evaluation framework based on Centered Kernel Alignment. This analysis reveals notable differences in the learned representations of fine-tuned pretrained models compared to baseline models trained from scratch.

[LG-133] Global Estimation of Subsurface Eddy Kinetic Energy of Mesoscale Eddies Using a Multiple-input Residual Neural Network

链接: https://arxiv.org/abs/2412.10656
作者: Chenyue Xie,An-Kang Gao,Xiyun Lu
关键词: eddy kinetic energy, parameterizing eddy effects, subsurface EKE, Oceanic eddy kinetic, neural network model
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Oceanic eddy kinetic energy (EKE) is a key quantity for measuring the intensity of mesoscale eddies and for parameterizing eddy effects in ocean climate models. Three decades of satellite altimetry observations allow a global assessment of sea surface information. However, the subsurface EKE with spatial filter has not been systematically studied due to the sparseness of subsurface observational data. The subsurface EKE can be inferred both theoretically and numerically from sea surface observations but is limited by the issue of decreasing correlation with sea surface variables as depth increases. In this work, inspired by the Taylor-series expansion of subsurface EKE, a multiple-input neural network approach is proposed to reconstruct the subsurface monthly mean EKE from sea surface variables and subsurface climatological variables (e.g., horizontal filtered velocity gradients). Four neural networks are trained on a high-resolution global ocean reanalysis dataset, namely, surface-input fully connected neural network model (FCNN), surface-input Residual neural network model (ResNet), multiple-input fully connected neural network model (MI-FCNN), and multiple-input residual neural network model (MI-ResNet). The proposed MI-FCNN and MI-ResNet models integrate the surface input variables and the vertical profiles of subsurface variables. The MI-ResNet model outperforms the FCNN, ResNet, and MI-FCNN models, and traditional physics-based models in both regional and global reconstruction of subsurface EKE in the upper 2000 m. In addition, the MI-ResNet model performs well for both regional and global observational data based on transfer learning. These findings reveal the potential of the MI-ResNet model for efficient and accurate reconstruction of subsurface oceanic variables.

[LG-134] Scientific Realism vs. Anti-Realism: Toward a Common Ground

链接: https://arxiv.org/abs/2412.10643
作者: Hanti Lin
关键词: reconciliation seeming hopeless, anti-realism remains, important work remains, version of Ockham, Ockham razor
类目: Other Statistics (stat.OT); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:The debate between scientific realism and anti-realism remains at a stalemate, with reconciliation seeming hopeless. Yet, important work remains: to seek a common ground, even if only to uncover deeper points of disagreement. I develop the idea that everyone values some truths, and use it to benefit both sides of the debate. More specifically, many anti-realists, such as instrumentalists, have yet to seriously engage with Sober’s call to justify their preferred version of Ockham’s razor through a positive epistemology. Meanwhile, realists face a similar challenge: providing a non-circular explanation of how their version of Ockham’s razor connects to truth. Drawing insights from fields that study scientific inference – statistics and machine learning – I propose a common ground that addresses these challenges for both sides. This common ground also isolates a distinctively epistemic root of the irreconcilability in the realism debate.

[LG-135] Upstream flow geometries can be uniquely learnt from single-point turbulence signatures

链接: https://arxiv.org/abs/2412.10630
作者: Mukesh Karunanethy,Raghunathan Rengaswamy,Mahesh V Panchagnula
关键词: geometry-identifiable information pertaining, near-field turbulence downstream, upstream obstruction, sudden contraction, contraction contains geometry-identifiable
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: Manuscript: 10 pages, 4 figures; SI Appendix: 24 pages, 3 figures; Submitted to PNAS

点击查看摘要

Abstract:We test the hypothesis that the microscopic temporal structure of near-field turbulence downstream of a sudden contraction contains geometry-identifiable information pertaining to the shape of the upstream obstruction. We measure a set of spatially sparse velocity time-series data downstream of differently-shaped orifices. We then train random forest multiclass classifier models on a vector of invariants derived from this time-series. We test the above hypothesis with 25 somewhat similar orifice shapes to push the model to its extreme limits. Remarkably, the algorithm was able to identify the orifice shape with 100% accuracy and 100% precision. This outcome is enabled by the uniqueness in the downstream temporal evolution of turbulence structures in the flow past orifices, combined with the random forests’ ability to learn subtle yet discerning features in the turbulence microstructure. We are also able to explain the underlying flow physics that enables such classification by listing the invariant measures in the order of increasing information entropy. We show that the temporal autocorrelation coefficients of the time-series are most sensitive to orifice shape and are therefore informative. The ability to identify changes in system geometry without the need for physical disassembly offers tremendous potential for flow control and system identification. Furthermore, the proposed approach could potentially have significant applications in other unrelated fields as well, by deploying the core methodology of training random forest classifiers on vectors of invariant measures obtained from time-series data.

[LG-136] Cardiovascular Disease Detection By Leveraging Semi-Supervised Learning

链接: https://arxiv.org/abs/2412.10567
作者: Shaohan Chen,Zheyan Liu,Huili Zheng,Qimin Zhang,Yiru Gong
关键词: timely detection methods, global scale, CVD detection, requires more effective, effective and timely
类目: Quantitative Methods (q-bio.QM); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 4 pages, 3 figures, 1 table. This paper has been accepted for publication in the IEEE ITCA 2024 conference

点击查看摘要

Abstract:Cardiovascular disease (CVD) persists as a primary cause of death on a global scale, which requires more effective and timely detection methods. Traditional supervised learning approaches for CVD detection rely heavily on large-labeled datasets, which are often difficult to obtain. This paper employs semi-supervised learning models to boost efficiency and accuracy of CVD detection when there are few labeled samples. By leveraging both labeled and vast amounts of unlabeled data, our approach demonstrates improvements in prediction performance, while reducing the dependency on labeled data. Experimental results in a publicly available dataset show that semi-supervised models outperform traditional supervised learning techniques, providing an intriguing approach for the initial identification of cardiovascular disease within clinical environments.

[LG-137] Aspen Open Jets: Unlocking LHC Data for Foundation Models in Particle Physics

链接: https://arxiv.org/abs/2412.10504
作者: Oz Amram,Luca Anzalone,Joschka Birk,Darius A. Faroughy,Anna Hallin,Gregor Kasieczka,Michael Krämer,Ian Pang,Humberto Reyes-Gonzalez,David Shih
关键词: Large Hadron Collider, deep learning models, learning models pre-trained, deep learning, capable of generalizing
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Machine Learning (stat.ML)
*备注: 11 pages, 4 figures, the AspenOpenJets dataset can be found at this http URL

点击查看摘要

Abstract:Foundation models are deep learning models pre-trained on large amounts of data which are capable of generalizing to multiple datasets and/or downstream tasks. This work demonstrates how data collected by the CMS experiment at the Large Hadron Collider can be useful in pre-training foundation models for HEP. Specifically, we introduce the AspenOpenJets dataset, consisting of approximately 180M high p_T jets derived from CMS 2016 Open Data. We show how pre-training the OmniJet- \alpha foundation model on AspenOpenJets improves performance on generative tasks with significant domain shift: generating boosted top and QCD jets from the simulated JetClass dataset. In addition to demonstrating the power of pre-training of a jet-based foundation model on actual proton-proton collision data, we provide the ML-ready derived AspenOpenJets dataset for further public use.

信息检索

[IR-0] One for Dozens: Adaptive REcommendation for All Domains with Counterfactual Augmentation AAAI2025

链接: https://arxiv.org/abs/2412.11905
作者: Huishi Luo,Yiwen Chen,Yiqing Wu,Fuzhen Zhuang,Deqing Wang
关键词: enhance recommendation performance, Multi-domain recommendation, traditional MDR algorithms, aims to enhance, domains
类目: Information Retrieval (cs.IR)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Multi-domain recommendation (MDR) aims to enhance recommendation performance across various domains. However, real-world recommender systems in online platforms often need to handle dozens or even hundreds of domains, far exceeding the capabilities of traditional MDR algorithms, which typically focus on fewer than five domains. Key challenges include a substantial increase in parameter count, high maintenance costs, and intricate knowledge transfer patterns across domains. Furthermore, minor domains often suffer from data sparsity, leading to inadequate training in classical methods. To address these issues, we propose Adaptive REcommendation for All Domains with counterfactual augmentation (AREAD). AREAD employs a hierarchical structure with a limited number of expert networks at several layers, to effectively capture domain knowledge at different granularities. To adaptively capture the knowledge transfer pattern across domains, we generate and iteratively prune a hierarchical expert network selection mask for each domain during training. Additionally, counterfactual assumptions are used to augment data in minor domains, supporting their iterative mask pruning. Our experiments on two public datasets, each encompassing over twenty domains, demonstrate AREAD’s effectiveness, especially in data-sparse domains. Source code is available at this https URL.

[IR-1] A Distributed Collaborative Retrieval Framework Excelling in All Queries and Corpora based on Zero-shot Rank-Oriented Automatic Evaluation

链接: https://arxiv.org/abs/2412.11832
作者: Tian-Yi Che,Xian-Ling Mao,Chun Xu,Cheng-Xin Xin,Heng-Da Xu,Jin-Yu Liu,Heyan Huang
关键词: Numerous retrieval models, demonstrated remarkable performance, queries and corpora, Numerous retrieval, including sparse
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Numerous retrieval models, including sparse, dense and llm-based methods, have demonstrated remarkable performance in predicting the relevance between queries and corpora. However, the preliminary effectiveness analysis experiments indicate that these models fail to achieve satisfactory performance on the majority of queries and corpora, revealing their effectiveness restricted to specific scenarios. Thus, to tackle this problem, we propose a novel Distributed Collaborative Retrieval Framework (DCRF), outperforming each single model across all queries and corpora. Specifically, the framework integrates various retrieval models into a unified system and dynamically selects the optimal results for each user’s query. It can easily aggregate any retrieval model and expand to any application scenarios, illustrating its flexibility and this http URL, to reduce maintenance and training costs, we design four effective prompting strategies with large language models (LLMs) to evaluate the quality of ranks without reliance of labeled data. Extensive experiments demonstrate that proposed framework, combined with 8 efficient retrieval models, can achieve performance comparable to effective listwise methods like RankGPT and ListT5, while offering superior efficiency. Besides, DCRF surpasses all selected retrieval models on the most datasets, indicating the effectiveness of our prompting strategies on rank-oriented automatic evaluation.

[IR-2] Leveraging User-Generated Metadata of Online Videos for Cover Song Identification

链接: https://arxiv.org/abs/2412.11818
作者: Simon Hachmeier,Robert Jäschke
关键词: cover song identification, cover song, rich source, song identification, cover
类目: Multimedia (cs.MM); Information Retrieval (cs.IR)
*备注: accepted for presentation at NLP for Music and Audio (NLP4MusA) 2024

点击查看摘要

Abstract:YouTube is a rich source of cover songs. Since the platform itself is organized in terms of videos rather than songs, the retrieval of covers is not trivial. The field of cover song identification addresses this problem and provides approaches that usually rely on audio content. However, including the user-generated video metadata available on YouTube promises improved identification results. In this paper, we propose a multi-modal approach for cover song identification on online video platforms. We combine the entity resolution models with audio-based approaches using a ranking model. Our findings implicate that leveraging user-generated metadata can stabilize cover song identification performance on YouTube.

[IR-3] Establishing a Foundation for Tetun Text Ad-Hoc Retrieval: Indexing Stemming Retrieval and Ranking

链接: https://arxiv.org/abs/2412.11758
作者: Gabriel de Jesus,Sérgio Nunes
关键词: Searching for information, requires effective retrieval, Tetun, Tetun text retrieval, internet and digital
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Searching for information on the internet and digital platforms to satisfy an information need requires effective retrieval solutions. However, such solutions are not yet available for Tetun, making it challenging to find relevant documents for text-based search queries in this language. To address these challenges, this study investigates Tetun text retrieval with a focus on the ad-hoc retrieval task. It begins by developing essential language resources – including a list of stopwords, a stemmer, and a test collection – which serve as foundational components for solutions tailored to Tetun text retrieval. Various strategies are then explored using both document titles and content to evaluate retrieval effectiveness. The results show that retrieving document titles, after removing hyphens and apostrophes without applying stemming, significantly improves retrieval performance compared to the baseline. Efficiency increases by 31.37%, while effectiveness achieves an average gain of 9.40% in MAP@10 and 30.35% in nDCG@10 with DFR BM25. Beyond the top-10 cutoff point, Hiemstra LM demonstrates strong performance across various retrieval strategies and evaluation metrics. Contributions of this work include the development of Labadain-Stopwords (a list of 160 Tetun stopwords), Labadain-Stemmer (a Tetun stemmer with three variants), and Labadain-Avaliadór (a Tetun test collection containing 59 topics, 33,550 documents, and 5,900 qrels).

[IR-4] Beyond Graph Convolution: Multimodal Recommendation with Topology-aware MLPs AAAI2025

链接: https://arxiv.org/abs/2412.11747
作者: Junjie Huang,Jiarui Qin,Yong Yu,Weinan Zhang
关键词: richer semantic information, exploit richer semantic, Graph Convolutional Networks, multimodal recommender systems, leveraging Graph Convolutional
类目: Information Retrieval (cs.IR)
*备注: AAAI 2025. 11 pages, 9 figures

点击查看摘要

Abstract:Given the large volume of side information from different modalities, multimodal recommender systems have become increasingly vital, as they exploit richer semantic information beyond user-item interactions. Recent works highlight that leveraging Graph Convolutional Networks (GCNs) to explicitly model multimodal item-item relations can significantly enhance recommendation performance. However, due to the inherent over-smoothing issue of GCNs, existing models benefit only from shallow GCNs with limited representation power. This drawback is especially pronounced when facing complex and high-dimensional patterns such as multimodal data, as it requires large-capacity models to accommodate complicated correlations. To this end, in this paper, we investigate bypassing GCNs when modeling multimodal item-item relationship. More specifically, we propose a Topology-aware Multi-Layer Perceptron (TMLP), which uses MLPs instead of GCNs to model the relationships between items. TMLP enhances MLPs with topological pruning to denoise item-item relations and intra (inter)-modality learning to integrate higher-order modality correlations. Extensive experiments on three real-world datasets verify TMLP’s superiority over nine baselines. We also find that by discarding the internal message passing in GCNs, which is sensitive to node connections, TMLP achieves significant improvements in both training efficiency and robustness against existing models.

[IR-5] STAIR: Manipulating Collaborative and Multimodal Information for E-Commerce Recommendation AAAI2025

链接: https://arxiv.org/abs/2412.11729
作者: Cong Xu,Yunhang He,Jun Wang,Wei Zhang
关键词: multimodal recommendation methods, multimodal information, Vanilla graph convolution, mining of modalities, fully utilize
类目: Information Retrieval (cs.IR)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:While the mining of modalities is the focus of most multimodal recommendation methods, we believe that how to fully utilize both collaborative and multimodal information is pivotal in e-commerce scenarios where, as clarified in this work, the user behaviors are rarely determined entirely by multimodal features. In order to combine the two distinct types of information, some additional challenges are encountered: 1) Modality erasure: Vanilla graph convolution, which proves rather useful in collaborative filtering, however erases multimodal information; 2) Modality forgetting: Multimodal information tends to be gradually forgotten as the recommendation loss essentially facilitates the learning of collaborative information. To this end, we propose a novel approach named STAIR, which employs a novel STepwise grAph convolution to enable a co-existence of collaborative and multimodal Information in e-commerce Recommendation. Besides, it starts with the raw multimodal features as an initialization, and the forgetting problem can be significantly alleviated through constrained embedding updates. As a result, STAIR achieves state-of-the-art recommendation performance on three public e-commerce datasets with minimal computational and memory costs. Our code is available at this https URL.

[IR-6] Future Sight and Tough Fights: Revolutionizing Sequential Recommendation with FENRec AAAI2025

链接: https://arxiv.org/abs/2412.11589
作者: Yu-Hsuan Huang,Ling Lo,Hongxia Xie,Hong-Han Shuai,Wen-Huang Cheng
关键词: time-ordered interaction sequences, analyzing time-ordered interaction, systems predict user, predict user preferences, systems predict
类目: Information Retrieval (cs.IR)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Sequential recommendation (SR) systems predict user preferences by analyzing time-ordered interaction sequences. A common challenge for SR is data sparsity, as users typically interact with only a limited number of items. While contrastive learning has been employed in previous approaches to address the challenges, these methods often adopt binary labels, missing finer patterns and overlooking detailed information in subsequent behaviors of users. Additionally, they rely on random sampling to select negatives in contrastive learning, which may not yield sufficiently hard negatives during later training stages. In this paper, we propose Future data utilization with Enduring Negatives for contrastive learning in sequential Recommendation (FENRec). Our approach aims to leverage future data with time-dependent soft labels and generate enduring hard negatives from existing data, thereby enhancing the effectiveness in tackling data sparsity. Experiment results demonstrate our state-of-the-art performance across four benchmark datasets, with an average improvement of 6.16% across all metrics.

[IR-7] Enhancing Healthcare Recommendation Systems with a Multimodal LLM s-based MOE Architecture

链接: https://arxiv.org/abs/2412.11557
作者: Jingyu Xu,Yang Wang
关键词: fields urgently require, urgently require advanced, require advanced architectures, advanced architectures capable, address specific problems
类目: Information Retrieval (cs.IR); Databases (cs.DB)
*备注: 10 page, accpted by Conf-SMPL conference

点击查看摘要

Abstract:With the increasing availability of multimodal data, many fields urgently require advanced architectures capable of effectively integrating these diverse data sources to address specific problems. This study proposes a hybrid recommendation model that combines the Mixture of Experts (MOE) framework with large language models to enhance the performance of recommendation systems in the healthcare domain. We built a small dataset for recommending healthy food based on patient descriptions and evaluated the model’s performance on several key metrics, including Precision, Recall, NDCG, and MAP@5. The experimental results show that the hybrid model outperforms the baseline models, which use MOE or large language models individually, in terms of both accuracy and personalized recommendation effectiveness. The paper finds image data provided relatively limited improvement in the performance of the personalized recommendation system, particularly in addressing the cold start problem. Then, the issue of reclassification of images also affected the recommendation results, especially when dealing with low-quality images or changes in the appearance of items, leading to suboptimal performance. The findings provide valuable insights into the development of powerful, scalable, and high-performance recommendation systems, advancing the application of personalized recommendation technologies in real-world domains such as healthcare.

[IR-8] Leveraging Large Vision-Language Model as User Intent-aware Encoder for Composed Image Retrieval AAAI2025

链接: https://arxiv.org/abs/2412.11087
作者: Zelong Sun,Dong Jing,Guoxing Yang,Nanyi Fei,Zhiwu Lu
关键词: Composed Image Retrieval, retrieve target images, Composed Image, Image Retrieval, hybrid-modality query consisting
类目: Information Retrieval (cs.IR)
*备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybrid-modality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the this http URL, these methods typically fail to simultaneously meet two key requirements of CIR: comprehensively extracting visual information and faithfully following the user intent. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Furthermore, we design a novel hybrid intent instruction module to provide explicit intent guidance at two levels: (1) The task prompt clarifies the task requirement and assists the model in discerning user intent at the task level. (2) The instance-specific soft prompt, which is adaptively selected from the learnable prompt pool, enables the model to better comprehend the user intent at the instance level compared to a universal prompt for all instances. CIR-LVLM achieves state-of-the-art performance across three prominent benchmarks with acceptable inference efficiency. We believe this study provides fundamental insights into CIR-related fields.

[IR-9] Why Not Together? A Multiple-Round Recommender System for Queries and Items KDD2025

链接: https://arxiv.org/abs/2412.10787
作者: Jiarui Jin,Xianyu Chen,Weinan Zhang,Yong Yu,Jun Wang
关键词: recommender systems involves, systems involves modeling, modeling user preferences, involves modeling user, fundamental technique
类目: Information Retrieval (cs.IR)
*备注: KDD 2025

点击查看摘要

Abstract:A fundamental technique of recommender systems involves modeling user preferences, where queries and items are widely used as symbolic representations of user interests. Queries delineate user needs at an abstract level, providing a high-level description, whereas items operate on a more specific and concrete level, representing the granular facets of user preference. While practical, both query and item recommendations encounter the challenge of sparse user feedback. To this end, we propose a novel approach named Multiple-round Auto Guess-and-Update System (MAGUS) that capitalizes on the synergies between both types, allowing us to leverage both query and item information to form user interests. This integrated system introduces a recursive framework that could be applied to any recommendation method to exploit queries and items in historical interactions and to provide recommendations for both queries and items in each interaction round. Empirical results from testing 12 different recommendation methods demonstrate that integrating queries into item recommendations via MAGUS significantly enhances the efficiency, with which users can identify their preferred items during multiple-round interactions.

[IR-10] Learned Data Compression: Challenges and Opportunities for the Future

链接: https://arxiv.org/abs/2412.10770
作者: Qiyu Liu,Siyuan Han,Jianwei Liao,Jin Li,Jingshu Peng,Jun Du,Lei Chen
关键词: Compressing integer keys, Compressing integer, emph, multiple communities, high-performance computing
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Compressing integer keys is a fundamental operation among multiple communities, such as database management (DB), information retrieval (IR), and high-performance computing (HPC). Recent advances in \emphlearned indexes have inspired the development of \emphlearned compressors, which leverage simple yet compact machine learning (ML) models to compress large-scale sorted keys. The core idea behind learned compressors is to \emphlosslessly encode sorted keys by approximating them with \empherror-bounded ML models (e.g., piecewise linear functions) and using a \emphresidual array to guarantee accurate key reconstruction. While the concept of learned compressors remains in its early stages of exploration, our benchmark results demonstrate that an SIMD-optimized learned compressor can significantly outperform state-of-the-art CPU-based compressors. Drawing on our preliminary experiments, this vision paper explores the potential of learned data compression to enhance critical areas in DBMS and related domains. Furthermore, we outline the key technical challenges that existing systems must address when integrating this emerging methodology. Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2412.10770 [cs.DB] (or arXiv:2412.10770v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2412.10770 Focus to learn more arXiv-issued DOI via DataCite

[IR-11] Enhancing Event Extraction from Short Stories through Contextualized Prompts

链接: https://arxiv.org/abs/2412.10745
作者: Chaitanya Kirti(1),Ayon Chattopadhyay(1),Ashish Anand(1),Prithwijit Guha(1) ((1) Indian Institute of Technology Guwahati)
关键词: natural language processing, important natural language, Event extraction, language processing, natural language
类目: Information Retrieval (cs.IR)
*备注: 47 pages, 8 figures, Planning to submit in Elsevier (Computer Speech and Language Journal)

点击查看摘要

Abstract:Event extraction is an important natural language processing (NLP) task of identifying events in an unstructured text. Although a plethora of works deal with event extraction from new articles, clinical text etc., only a few works focus on event extraction from literary content. Detecting events in short stories presents several challenges to current systems, encompassing a different distribution of events as compared to other domains and the portrayal of diverse emotional conditions. This paper presents \textttVrittanta-EN, a collection of 1000 English short stories annotated for real events. Exploring this field could result in the creation of techniques and resources that support literary scholars in improving their effectiveness. This could simultaneously influence the field of Natural Language Processing. Our objective is to clarify the intricate idea of events in the context of short stories. Towards the objective, we collected 1,000 short stories written mostly for children in the Indian context. Further, we present fresh guidelines for annotating event mentions and their categories, organized into \textitseven distinct classes. The classes are \ttCOGNITIVE-MENTAL-STATE(CMS), COMMUNICATION(COM), CONFLICT(CON), GENERAL-ACTIVITY(GA), LIFE-EVENT(LE), MOVEMENT(MOV), and OTHERS(OTH). Subsequently, we apply these guidelines to annotate the short story dataset. Later, we apply the baseline methods for automatically detecting and categorizing events. We also propose a prompt-based method for event detection and classification. The proposed method outperforms the baselines, while having significant improvement of more than 4% for the class \textttCONFLICT in event classification task.

[IR-12] Sentiment and Hashtag-aware Attentive Deep Neural Network for Multimodal Post Popularity Prediction

链接: https://arxiv.org/abs/2412.10737
作者: Shubhi Bansal,Mohit Kumar,Chandravardhan Singh Raghaw,Nagendra Kumar
关键词: comprising multiple modes, posts comprising multiple, media users articulate, social media platforms, Social media users
类目: Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Social media users articulate their opinions on a broad spectrum of subjects and share their experiences through posts comprising multiple modes of expression, leading to a notable surge in such multimodal content on social media platforms. Nonetheless, accurately forecasting the popularity of these posts presents a considerable challenge. Prevailing methodologies primarily center on the content itself, thereby overlooking the wealth of information encapsulated within alternative modalities such as visual demographics, sentiments conveyed through hashtags and adequately modeling the intricate relationships among hashtags, texts, and accompanying images. This oversight limits the ability to capture emotional connection and audience relevance, significantly influencing post popularity. To address these limitations, we propose a seNtiment and hAshtag-aware attentive deep neuRal netwoRk for multimodAl posT pOpularity pRediction, herein referred to as NARRATOR that extracts visual demographics from faces appearing in images and discerns sentiment from hashtag usage, providing a more comprehensive understanding of the factors influencing post popularity Moreover, we introduce a hashtag-guided attention mechanism that leverages hashtags as navigational cues, guiding the models focus toward the most pertinent features of textual and visual modalities, thus aligning with target audience interests and broader social media context. Experimental results demonstrate that NARRATOR outperforms existing methods by a significant margin on two real-world datasets. Furthermore, ablation studies underscore the efficacy of integrating visual demographics, sentiment analysis of hashtags, and hashtag-guided attention mechanisms in enhancing the performance of post popularity prediction, thereby facilitating increased audience relevance, emotional engagement, and aesthetic appeal.

[IR-13] Movie Recommendation using Web Crawling

链接: https://arxiv.org/abs/2412.10714
作者: Pronit Raj,Chandrashekhar Kumar,Harshit Shekhar,Amit Kumar,Kritibas Paul,Debasish Jana
关键词: today digital world, streaming platforms offer, find content matching, digital world, streaming platforms
类目: Information Retrieval (cs.IR)
*备注: 12 pages, 3 figures, Accepted and to be published in Proceedings of 2025 International Conference on Applied Algorithms (ICAA), Kolkata, India, Dec 8-10, 2025

点击查看摘要

Abstract:In today’s digital world, streaming platforms offer a vast array of movies, making it hard for users to find content matching their preferences. This paper explores integrating real time data from popular movie websites using advanced HTML scraping techniques and APIs. It also incorporates a recommendation system trained on a static Kaggle dataset, enhancing the relevance and freshness of suggestions. By combining content based filtering, collaborative filtering, and a hybrid model, we create a system that utilizes both historical and real time data for more personalized suggestions. Our methodology shows that incorporating dynamic data not only boosts user satisfaction but also aligns recommendations with current viewing trends.

[IR-14] Beyond Quantile Methods: Improved Top-K Threshold Estimation for Traditional and Learned Sparse Indexes

链接: https://arxiv.org/abs/2412.10701
作者: Jinrui Gou,Yifan Liu,Minghao Shao,Torsten Suel
关键词: k-th highest ranking, highest ranking result, common top-k query, top-k query processing, estimating the score
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Top-k threshold estimation is the problem of estimating the score of the k-th highest ranking result of a search query. A good estimate can be used to speed up many common top-k query processing algorithms, and thus a number of researchers have recently studied the problem. Among the various approaches that have been proposed, quantile methods appear to give the best estimates overall at modest computational costs, followed by sampling-based methods in certain cases. In this paper, we make two main contributions. First, we study how to get even better estimates than the state of the art. Starting from quantile-based methods, we propose a series of enhancements that give improved estimates in terms of the commonly used mean under-prediction fraction (MUF). Second, we study the threshold estimation problem on recently proposed learned sparse index structures, showing that our methods also work well for these cases. Our best methods substantially narrow the gap between the state of the art and the ideal MUF of 1.0, at some additional cost in time and space.

[IR-15] Recommendation and Temptation

链接: https://arxiv.org/abs/2412.10595
作者: Md Sanzeed Anwar,Paramveer S. Dhillon,Grant Schoenebeck
关键词: users’ dual-self nature, capture users’ dual-self, Traditional recommender systems, long-term benefits, instant gratification
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
*备注:

点击查看摘要

Abstract:Traditional recommender systems based on utility maximization and revealed preferences often fail to capture users’ dual-self nature, where consumption choices are driven by both long-term benefits (enrichment) and desire for instant gratification (temptation). Consequently, these systems may generate recommendations that fail to provide long-lasting satisfaction to users. To address this issue, we propose a novel user model that accounts for this dual-self behavior and develop an optimal recommendation strategy to maximize enrichment from consumption. We highlight the limitations of historical consumption data in implementing this strategy and present an estimation framework that makes minimal assumptions and leverages explicit user feedback and implicit choice data to overcome these constraints. We evaluate our approach through both synthetic simulations and simulations based on real-world data from the MovieLens dataset. Results demonstrate that our proposed recommender can deliver superior enrichment compared to several competitive baseline algorithms that assume a single utility type and rely solely on revealed preferences. Our work emphasizes the critical importance of optimizing for enrichment in recommender systems, particularly in temptation-laden consumption contexts. Our findings have significant implications for content platforms, user experience design, and the development of responsible AI systems, paving the way for more nuanced and user-centric recommendation approaches.

[IR-16] Agro-STAY : Collecte de donnees et analyse des informations en agriculture alternative issues de YouTube

链接: https://arxiv.org/abs/2412.10576
作者: Laura Maxim,Julien Rabatel,Jean-Marc Douguet,Natalia Grabar,Roberto Interdonato,Sébastien Loustau,Mathieu Roche,Maguelonne Teisseire
关键词: combine energy sobriety, energy sobriety, self-production of food, arouses an increasing, increasing interest
类目: Information Retrieval (cs.IR)
*备注: 8 pages, in Frnch language, 3 figures

点击查看摘要

Abstract:To address the current crises (climatic, social, economic), the self-sufficiency – a set of practices that combine energy sobriety, self-production of food and energy, and self-construction - arouses an increasing interest. The CNRS STAY project (Savoirs Techniques pour l’Auto-suffisance, sur YouTube) explores this topic by analyzing techniques shared on YouTube. We present Agro-STAY, a platform designed for the collection, processing, and visualization of data from YouTube videos and their comments. We use Natural Language Processing (NLP) techniques and language models, which enable a fine-grained analysis of alternative agricultural practice described online. – Face aux crises actuelles (climatiques, sociales, économiques), l’auto-suffisance – ensemble de pratiques combinant sobriété énergétique, autoproduction alimentaire et énergétique et autoconstruction - suscite un intérêt croissant. Le projet CNRS STAY (Savoirs Techniques pour l’Auto-suffisance, sur YouTube) s’inscrit dans ce domaine en analysant les savoirs techniques diffusés sur YouTube. Nous présentons Agro-STAY, une plateforme dédiée à la collecte, au traitement et à la visualisation de données issues de vidéos YouTube et de leurs commentaires. En mobilisant des techniques de traitement automatique des langues (TAL) et des modèles de langues, ce travail permet une analyse fine des pratiques agricoles alternatives décrites en ligne. Comments: 8 pages, in Frnch language, 3 figures Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2412.10576 [cs.IR] (or arXiv:2412.10576v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2412.10576 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-17] CRS Arena: Crowdsourced Benchmarking of Conversational Recommender Systems WSDM’25

链接: https://arxiv.org/abs/2412.10514
作者: Nolwenn Bernard,Hideaki Joko,Faegheh Hasibi,Krisztian Balog
关键词: Conversational Recommender Systems, Conversational Recommender, introduce CRS Arena, anonymous conversational recommender, CRS Arena
类目: Information Retrieval (cs.IR)
*备注: Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining (WSDM '25), March 10–14, 2025, Hannover, Germany

点击查看摘要

Abstract:We introduce CRS Arena, a research platform for scalable benchmarking of Conversational Recommender Systems (CRS) based on human feedback. The platform displays pairwise battles between anonymous conversational recommender systems, where users interact with the systems one after the other before declaring either a winner or a draw. CRS Arena collects conversations and user feedback, providing a foundation for reliable evaluation and ranking of CRSs. We conduct experiments with CRS Arena on both open and closed crowdsourcing platforms, confirming that both setups produce highly correlated rankings of CRSs and conversations with similar characteristics. We release CRSArena-Dial, a dataset of 474 conversations and their corresponding user feedback, along with a preliminary ranking of the systems based on the Elo rating system. The platform is accessible at this https URL.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-17

目录

概览 (2024-12-17)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载