Arxiv今日论文 | 2024-12-11

本篇博文主要展示 2024-12-11 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决在医疗研究和运营中手动分配解剖学治疗化学分类系统（Anatomical Therapeutic Chemical, ATC）代码的瓶颈问题，这一过程需要大量专家时间和精力。解决方案的关键在于利用本地部署的大型语言模型（Large Language Models, LLMs）实现自动化，同时确保数据隐私。论文借鉴了自动国际疾病分类（International Classification of Diseases, ICD）编码的最新进展，将ATC编码框架为分层信息提取任务，通过ATC本体逐步引导LLMs进行编码。研究结果表明，该方法在隐私敏感的医疗环境中实现了自动ATC编码的可行性，为未来部署提供了基础。

链接: https://arxiv.org/abs/2412.07743
作者: Zijian Chen,John-Michael Gamble,Micaela Jantzi,John P. Hirdes,Jimmy Lin
关键词-EN: Anatomical Therapeutic Chemical, Therapeutic Chemical, Anatomical Therapeutic, requiring extensive expert, extensive expert time
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Manual assignment of Anatomical Therapeutic Chemical (ATC) codes to prescription records is a significant bottleneck in healthcare research and operations at Ontario Health and InterRAI Canada, requiring extensive expert time and effort. To automate this process while maintaining data privacy, we develop a practical approach using locally deployable large language models (LLMs). Inspired by recent advances in automatic International Classification of Diseases (ICD) coding, our method frames ATC coding as a hierarchical information extraction task, guiding LLMs through the ATC ontology level by level. We evaluate our approach using GPT-4o as an accuracy ceiling and focus development on open-source Llama models suitable for privacy-sensitive deployment. Testing across Health Canada drug product data, the RABBITS benchmark, and real clinical notes from Ontario Health, our method achieves 78% exact match accuracy with GPT-4o and 60% with Llama 3.1 70B. We investigate knowledge grounding through drug definitions, finding modest improvements in accuracy. Further, we show that fine-tuned Llama 3.1 8B matches zero-shot Llama 3.1 70B accuracy, suggesting that effective ATC coding is feasible with smaller models. Our results demonstrate the feasibility of automatic ATC coding in privacy-sensitive healthcare environments, providing a foundation for future deployments.
zh

[NLP-1] Granite Guardian

【速读】：该论文旨在解决与大型语言模型（LLM）结合使用时的安全性和责任性问题，提出了Granite Guardian模型。这些模型通过提供对提示和响应的风险检测，覆盖了多个风险维度，包括社会偏见、亵渎、暴力、性内容、不道德行为、越狱（jailbreaking）以及与检索增强生成（RAG）相关的幻觉风险（如上下文相关性、基础性和答案相关性）。关键解决方案在于其独特的数据集，结合了来自多样化来源的人类标注和合成数据，能够识别传统风险检测模型通常忽略的风险，如越狱和RAG特定问题。Granite Guardian模型在有害内容和RAG幻觉相关基准测试中分别取得了0.871和0.854的AUC分数，展示了其高度的泛化能力和竞争力。

链接: https://arxiv.org/abs/2412.07724
作者: Inkit Padhi,Manish Nagireddy,Giandomenico Cornacchia,Subhajit Chaudhury,Tejaswini Pedapati,Pierre Dognin,Keerthiram Murugesan,Erik Miehling,Martín Santillán Cooper,Kieran Fraser,Giulio Zizzo,Muhammad Zaid Hameed,Mark Purcell,Michael Desmond,Qian Pan,Inge Vejsbjerg,Elizabeth M. Daly,Michael Hind,Werner Geyer,Ambrish Rawat,Kush R. Varshney,Prasanna Sattigeri
关键词-EN: Granite Guardian models, Granite Guardian, provide risk detection, large language model, Granite Guardian aims
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community. this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2412.07724 [cs.CL] (or arXiv:2412.07724v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.07724 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-2] RIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation

【速读】：该论文试图解决大型语言模型（LLMs）在处理需要长输出的任务时的高推理成本问题。解决方案的关键在于利用自然语言中的冗余性，通过适当的提示使LLMs生成浓缩的、保留核心意义的简短输出，然后使用推理成本较低的小型模型将这些简短输出重构为完整的叙述。实验结果表明，该方法在通用知识领域平均节省了20.58%的token，且在评估指标上仅有轻微下降，从而在语言处理任务中有效平衡了效率与准确性。

链接: https://arxiv.org/abs/2412.07682
作者: Alfredo Garrachón Ruiz,Tomás de la Rosa,Daniel Borrajo
关键词-EN: significant challenge due, Large Language Models, requiring long outputs, tasks requiring long, Large Language
类目: Computation and Language (cs.CL)
备注: 12 pages

点击查看摘要

Abstract:The inference cost of Large Language Models (LLMs) is a significant challenge due to their computational demands, specially on tasks requiring long outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language-concise outputs that retain essential meaning, when prompted appropriately. We propose a framework for saving computational cost, in which a shorter distilled output from the LLM is reconstructed into a full narrative by a smaller model with lower inference costs. Our experiments show promising results, particularly in general knowledge domains with 20.58% saved tokens on average with tiny decrease in evaluation metrics, hinting that this approach can effectively balance efficiency and accuracy in language processing tasks.
zh

[NLP-3] Can linguists better understand DNA?

【速读】：该论文试图解决自然语言处理模型在基因序列（gene sequences）和基因语言之间的能力迁移问题。解决方案的关键在于设计了两个类比任务：DNA-pair分类（DNA序列相似性）和DNA-protein-pair分类（基因编码确定），并通过这些任务验证了从自然语言到基因序列的能力迁移。研究结果表明，即使是在英语数据上预训练的小规模模型（如GPT-2-small），在经过英语句子对分类数据（XTREME PAWS-X）微调后，也能在DNA-pair分类任务上达到78%的准确率，而多语言文本上训练的BERT模型则达到了82%的精度。然而，在更复杂的DNA-protein-pair分类任务中，模型的表现接近随机，这表明虽然存在从自然语言到基因语言的能力迁移，但仍需进一步的任务测试来确认这一现象。

链接: https://arxiv.org/abs/2412.07678
作者: Wang Liang
关键词-EN: Multilingual transfer ability, natural language, classification, language, transfer ability
类目: Computation and Language (cs.CL); Genomics (q-bio.GN)
备注: 11 pages,3 figures

点击查看摘要

Abstract:Multilingual transfer ability, which reflects how well models fine-tuned on one source language can be applied to other languages, has been well studied in multilingual pre-trained models. However, the existence of such capability transfer between natural language and gene sequences/languages remains this http URL study addresses this gap by drawing inspiration from the sentence-pair classification task used for evaluating sentence similarity in natural language. We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination). These tasks were designed to validate the transferability of capabilities from natural language to gene sequences. Even a small-scale pre-trained model like GPT-2-small, which was pre-trained on English, achieved an accuracy of 78% on the DNA-pair classification task after being fine-tuned on English sentence-pair classification data(XTREME PAWS-X). While training a BERT model on multilingual text, the precision reached 82%.On the more complex DNA-protein-pair classification task, however, the model’s output was barely distinguishable from random this http URL suggest that there may be a capability transfer from natural language to genetic language, but further task testing is needed to confirm this.
zh

[NLP-4] RAZOR: Sharpening Knowledge by Cutting Bias with Unsupervised Text Rewriting AAAI’25

【速读】：该论文试图解决大型语言模型（LLMs）在预训练-微调流程中由于手动构建数据集中的偏差（biases）导致的伪相关性（spurious correlations）问题，这些伪相关性会限制微调模型的泛化能力。解决方案的关键是提出了一种名为RAZOR（Rewriting And Zero-bias Optimization Refinement）的无监督、数据驱动的去偏方法，通过文本重写技术来缓解这些伪相关性。RAZOR利用LLMs迭代重写可能存在偏差的文本片段，替换为基于token统计和位置信息定义的shortcut空间中的启发式选择替代方案，从而使表面文本特征更接近多样化的标签分布，促进对真实语言模式的学习。该方法无需预先了解特定数据集的偏差信息，且在FEVER、MNLI和SNLI数据集上的F1分数分别提升了3.5%和6.5%，同时有效减少了已知偏差，其效果与依赖先验信息的SoTA模型相当。

链接: https://arxiv.org/abs/2412.07675
作者: Shuo Yang,Bardh Prenkaj,Gjergji Kasneci
关键词-EN: high computational costs, lead potential users, pretraining-finetuning pipeline, high computational, computational costs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Shuo and Bardh contributed equally. Accepted to AAAI’25

点击查看摘要

Abstract:Despite the widespread use of LLMs due to their superior performance in various tasks, their high computational costs often lead potential users to opt for the pretraining-finetuning pipeline. However, biases prevalent in manually constructed datasets can introduce spurious correlations between tokens and labels, creating so-called shortcuts and hindering the generalizability of fine-tuned models. Existing debiasing methods often rely on prior knowledge of specific dataset biases, which is challenging to acquire a priori. We propose RAZOR (Rewriting And Zero-bias Optimization Refinement), a novel, unsupervised, and data-focused debiasing approach based on text rewriting for shortcut mitigation. RAZOR leverages LLMs to iteratively rewrite potentially biased text segments by replacing them with heuristically selected alternatives in a shortcut space defined by token statistics and positional information. This process aims to align surface-level text features more closely with diverse label distributions, thereby promoting the learning of genuine linguistic patterns. Compared with unsupervised SoTA models, RAZOR improves by 3.5% on the FEVER and 6.5% on MNLI and SNLI datasets according to the F1 score. Additionally, RAZOR effectively mitigates specific known biases, reducing bias-related terms by x2 without requiring prior bias information, a result that is on par with SoTA models that leverage prior information. Our work prioritizes data manipulation over architectural modifications, emphasizing the pivotal role of data quality in enhancing model performance and fairness. This research contributes to developing more robust evaluation benchmarks for debiasing methods by incorporating metrics for bias reduction and overall model efficacy.
zh

[NLP-5] FlexLLM : Exploring LLM LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

【速读】：该论文试图解决大语言模型 (LLMs) 在面对通过操纵提示（即越狱攻击）生成有害内容时的防御问题。解决方案的关键在于提出了一种无需访问模型内部结构且无需额外训练的移动目标防御方法。该方法通过优化解码策略，调整影响标记生成概率的解码超参数，并将解码超参数和模型系统提示转化为动态目标，在每次运行时持续改变。这种持续修改解码策略和提示的方法有效缓解了现有的越狱攻击，同时保持了较低的推理成本和可比的响应质量。

链接: https://arxiv.org/abs/2412.07672
作者: Bocheng Chen,Hanqing Guo,Qiben Yan
关键词-EN: numerous attackers exploiting, generate harmful content, large language models, large language, crucial to counter
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content through manipulated prompts, known as jailbreak attacks. Although many defense strategies have been proposed, they often require access to the model’s internal structure or need additional training, which is impractical for service providers using LLM APIs, such as OpenAI APIs or Claude APIs. In this paper, we propose a moving target defense approach that alters decoding hyperparameters to enhance model robustness against various jailbreak attacks. Our approach does not require access to the model’s internal structure and incurs no additional training costs. The proposed defense includes two key components: (1) optimizing the decoding strategy by identifying and adjusting decoding hyperparameters that influence token generation probabilities, and (2) transforming the decoding hyperparameters and model system prompts into dynamic targets, which are continuously altered during each runtime. By continuously modifying decoding strategies and prompts, the defense effectively mitigates the existing attacks. Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested when using LLMs as black-box APIs. Moreover, our defense offers lower inference costs and maintains comparable response quality, making it a potential layer of protection when used alongside other defense methods.
zh

[NLP-6] Searching for Structure: Investigating Emergent Communication with Large Language Models

【速读】：该论文试图解决的问题是：人工语言在大型语言模型（LLMs）的隐性偏好的优化下，是否会产生类似于人类语言的结构化特征。解决方案的关键在于通过模拟经典的指称游戏（referential game），让两个LLM代理学习和使用人工语言。研究发现，最初无结构的整体语言确实会演化出某些结构特征，使LLM代理能够成功沟通。此外，代际传递（generational transmission）提高了语言的可学习性，但也可能导致非人类化的退化词汇。这一研究不仅扩展了实验发现，还展示了LLMs在语言进化模拟中的工具性应用，并为未来的人机实验提供了可能性。

链接: https://arxiv.org/abs/2412.07646
作者: Tom Kouwenhoven,Max Peeperkorn,Tessa Verhoef
关键词-EN: repeated language learning, Large Language Models, structured through repeated, languages, repeated language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Human languages have evolved to be structured through repeated language learning and use. These processes introduce biases that operate during language acquisition and shape linguistic systems toward communicative efficiency. In this paper, we investigate whether the same happens if artificial languages are optimised for implicit biases of Large Language Models (LLMs). To this end, we simulate a classical referential game in which LLMs learn and use artificial languages. Our results show that initially unstructured holistic languages are indeed shaped to have some structural properties that allow two LLM agents to communicate successfully. Similar to observations in human experiments, generational transmission increases the learnability of languages, but can at the same time result in non-humanlike degenerate vocabularies. Taken together, this work extends experimental findings, shows that LLMs can be used as tools in simulations of language evolution, and opens possibilities for future human-machine experiments in this field.
zh

[NLP-7] ChocoLlama: Lessons Learned From Teaching Llamas Dutch

【速读】：该论文试图解决大型语言模型（LLMs）在低资源、非英语语言（如荷兰语）上的性能不足问题，主要原因是训练数据中的偏见。解决方案的关键在于通过低秩适应（LoRA）进行持续预训练，并结合荷兰语的后训练策略，以及对模型分词器（tokenizer）的修改和嵌入重新初始化。研究结果表明，LoRA在语言适应方面具有良好的扩展性，而分词器的修改和权重重新初始化可以进一步提升性能。然而，随着更先进的模型（如Llama-3）的出现，语言适应技术在持续预训练方面的收益有限，表明未来的多语言基础模型可能更需要关注语言特定的后训练策略。

链接: https://arxiv.org/abs/2412.07633
作者: Matthieu Meeus,Anthony Rathé,François Remy,Pieter Delobelle,Jens-Joris Decorte,Thomas Demeester
关键词-EN: Large Language Models, non-English languages due, Large Language, Dutch, shown remarkable capabilities
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) have shown remarkable capabilities in natural language understanding and generation, their performance often lags in lower-resource, non-English languages due to biases in the training data. In this work, we explore strategies for adapting the primarily English LLMs (Llama-2 and Llama-3) to Dutch, a language spoken by 30 million people worldwide yet often underrepresented in LLM development. We collect 104GB of Dutch text ( 32 B tokens) from various sources to first apply continued pretraining using low-rank adaptation (LoRA), complemented with Dutch posttraining strategies provided by prior work. For Llama-2, we consider using (i) the tokenizer of the original model, and (ii) training a new, Dutch-specific tokenizer combined with embedding reinitialization. We evaluate our adapted models, ChocoLlama-2, both on standard benchmarks and a novel Dutch benchmark, ChocoLlama-Bench. Our results demonstrate that LoRA can effectively scale for language adaptation, and that tokenizer modification with careful weight reinitialization can improve performance. Notably, Llama-3 was released during the course of this project and, upon evaluation, demonstrated superior Dutch capabilities compared to our Dutch-adapted versions of Llama-2. We hence apply the same adaptation technique to Llama-3, using its original tokenizer. While our adaptation methods enhanced Llama-2’s Dutch capabilities, we found limited gains when applying the same techniques to Llama-3. This suggests that for ever improving, multilingual foundation models, language adaptation techniques may benefit more from focusing on language-specific posttraining rather than on continued pretraining. We hope this work contributes to the broader understanding of adapting LLMs to lower-resource languages, and to the development of Dutch LLMs in particular.
zh

[NLP-8] Piece of Table: A Divide-and-Conquer Approach for Selecting Sub-Tables in Table Question Answering

【速读】：该论文试图解决将语言模型（LMs）应用于表格时面临的挑战，特别是由于表格的二维结构与语言模型原本设计用于处理的一维文本之间的固有差异，以及在处理大型表格时由于自注意力计算中的最大token长度限制导致的上下文理解困难。解决方案的关键在于提出了一种名为PieTa（Piece of Table）的新框架，通过迭代地将表格分割成较小的窗口，利用语言模型在每个窗口中选择相关单元格，并将这些单元格合并成子表，从而实现多分辨率的上下文捕捉，避免了长上下文输入带来的限制，并展示了优于先前子表问答方法的性能。

链接: https://arxiv.org/abs/2412.07629
作者: Wonjin Lee,Kyumin Kim,Sungjae Lee,Jihun Lee,Kwang In KIm
关键词-EN: Applying language models, inherent structural differences, language models, originally designed, challenging due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Applying language models (LMs) to tables is challenging due to the inherent structural differences between two-dimensional tables and one-dimensional text for which the LMs were originally designed. Furthermore, when applying linearized tables to LMs, the maximum token lengths often imposed in self-attention calculations make it difficult to comprehensively understand the context spread across large tables. To address these challenges, we present PieTa (Piece of Table), a new framework for sub-table-based question answering (QA). PieTa operates through an iterative process of dividing tables into smaller windows, using LMs to select relevant cells within each window, and merging these cells into a sub-table. This multi-resolution approach captures dependencies across multiple rows and columns while avoiding the limitations caused by long context inputs. Instantiated as a simple iterative sub-table union algorithm, PieTa demonstrates improved performance over previous sub-table-based QA approaches.
zh

[NLP-9] DRUM: Learning Demonstration Retriever for Large MUlti-modal Models

【速读】：该论文试图解决在大型视觉-语言模型 (Large Vision-Language Models, LVLMs) 中，使用上下文学习 (In-Context Learning, ICL) 时，现有方法（如固定演示或直接通过视觉-语言嵌入模型选择演示）无法确保所选演示适合模型需求的问题。解决方案的关键在于提出了一种新的框架，即演示检索器 (Demonstration Retriever for large multi-modal model, DRUM)，通过微调视觉-语言嵌入模型来更好地满足 LVLMs 的需求。具体来说，DRUM 框架包括三个关键步骤：1) 提出将图像和文本嵌入拼接以增强检索性能；2) 通过 LVLMs 的反馈对检索到的演示进行重新排序，并计算列表级排序损失以训练嵌入模型；3) 提出迭代演示挖掘策略以改进嵌入模型的训练。实验结果表明，DRUM 框架能够通过检索更合适的演示显著提升 LVLMs 的上下文学习性能。

链接: https://arxiv.org/abs/2412.07619
作者: Ellen Yi-Ge,Jiechao Gao,Wei Han,Wei Zhu
关键词-EN: demonstrated impressive capabilities, embedding model, large language models, visual-language embedding model, Large Vision-Language Models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have demonstrated impressive capabilities in dealing with new tasks with the help of in-context learning (ICL). In the study of Large Vision-Language Models (LVLMs), when implementing ICL, researchers usually adopts the naive strategies like fixed demonstrations across different samples, or selecting demonstrations directly via a visual-language embedding model. These methods does not guarantee the configured demonstrations fit the need of the LVLMs. To address this issue, we now propose a novel framework, \underlinedemonstration \underlineretriever for large m\underlineulti-modal \underlinemodel (DRUM), which fine-tunes the visual-language embedding model to better meet the LVLM’s needs. First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text embeddings to enhance the retrieval performance. Second, we propose to re-rank the demonstrations retrieved by the embedding model via the LVLM’s feedbacks, and calculate a list-wise ranking loss for training the embedding model. Third, we propose an iterative demonstration mining strategy to improve the training of the embedding model. Through extensive experiments on 3 types of visual-language tasks, 7 benchmark datasets, our DRUM framework is proven to be effective in boosting the LVLM’s in-context learning performance via retrieving more proper demonstrations.
zh

[NLP-10] Adapting to Non-Stationary Environments: Multi-Armed Bandit Enhanced Retrieval-Augmented Generation on Knowledge Graphs AAAI2025

【速读】：该论文试图解决大型语言模型（LLMs）在记忆广泛世界知识方面的局限性，特别是在非平稳环境中的性能下降问题。解决方案的关键在于引入一个多目标多臂赌博机增强的检索增强生成框架（Multi-objective Multi-Armed Bandit enhanced Retrieval-Augmented Generation, RAG），该框架结合了多种检索方法，并利用实时用户反馈来动态选择最适合的检索方法。通过将每种检索方法视为一个独立的“臂”，系统能够根据输入查询和历史多目标性能来调整策略，从而在非平稳环境中显著提升性能，同时在平稳环境中达到最先进的性能。

链接: https://arxiv.org/abs/2412.07618
作者: Xiaqiang Tang,Jian Li,Nan Du,Sihong Xie
关键词-EN: Large language models, face significant limitations, NLP tasks, Large language, memorizing extensive world
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: AAAI 2025

点击查看摘要

Abstract:Despite the superior performance of Large language models on many NLP tasks, they still face significant limitations in memorizing extensive world knowledge. Recent studies have demonstrated that leveraging the Retrieval-Augmented Generation (RAG) framework, combined with Knowledge Graphs that encapsulate extensive factual data in a structured format, robustly enhances the reasoning capabilities of LLMs. However, deploying such systems in real-world scenarios presents challenges: the continuous evolution of non-stationary environments may lead to performance degradation and user satisfaction requires a careful balance of performance and responsiveness. To address these challenges, we introduce a Multi-objective Multi-Armed Bandit enhanced RAG framework, supported by multiple retrieval methods with diverse capabilities under rich and evolving retrieval contexts in practice. Within this framework, each retrieval method is treated as a distinct ``arm’'. The system utilizes real-time user feedback to adapt to dynamic environments, by selecting the appropriate retrieval method based on input queries and the historical multi-objective performance of each arm. Extensive experiments conducted on two benchmark KGQA datasets demonstrate that our method significantly outperforms baseline methods in non-stationary settings while achieving state-of-the-art performance in stationary environments. Code and data are available at this https URL
zh

[NLP-11] SST framework for Document Matching

【速读】：该论文试图解决长文档匹配中现有方法在处理多子话题时可能忽略细节和产生偏差的问题。解决方案的关键在于提出一种新的框架，通过捕捉文档对中不同子话题的匹配信号，并基于这些子话题构建多个文档视图，以覆盖异质性和有价值的细节。为了避免现有空间聚合方法（如注意力机制）难以整合异质信息的局限，论文提出了时间聚合方法，逐步整合不同视图，从而在训练过程中有效处理异质信息。实验结果表明，该框架在新闻重复检测和法律案件检索等任务中表现有效。

链接: https://arxiv.org/abs/2412.07573
作者: Youchao Zhou,Heyan Huang,Zhijing Wu,Yuhang Liu,Xinglin Wang
关键词-EN: Long-form document matching, Long-form document, aims to judge, judge the relevance, Long-form
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Long-form document matching aims to judge the relevance between two documents and has been applied to various scenarios. Most existing works utilize hierarchical or long context models to process documents, which achieve coarse understanding but may ignore details. Some researchers construct a document view with similar sentences about aligned document subtopics to focus on detailed matching signals. However, a long document generally contains multiple subtopics. The matching signals are heterogeneous from multiple topics. Considering only the homologous aligned subtopics may not be representative enough and may cause biased modeling. In this paper, we introduce a new framework to model representative matching signals. First, we propose to capture various matching signals through subtopics of document pairs. Next, We construct multiple document views based on subtopics to cover heterogeneous and valuable details. However, existing spatial aggregation methods like attention, which integrate all these views simultaneously, are hard to integrate heterogeneous information. Instead, we propose temporal aggregation, which effectively integrates different views gradually as the training progresses. Experimental results show that our learning framework is effective on several document-matching tasks, including news duplication and legal case retrieval.
zh

[NLP-12] CoPrUS: Consistency Preserving Utterance Synthesis towards more realistic benchmark dialogues COLING2025

【速读】：该论文试图解决现有大规模Wizard-Of-Oz对话数据集在真实性方面的不足，特别是缺乏误解（misunderstandings）、非理解（non-understandings）和模糊相关问题（vaguely related questions）等类型的表达。解决方案的关键在于提出了一种自动化的合成通信错误生成管道，基于语言学理论构建了一个简单的错误分类体系，并采用两步法：首先使用先进的语言模型（LLM）生成错误，然后生成修复性话语。通过语言模型评估确保生成话语的质量，并将该方法应用于MultiWOZ数据集，最终生成了包含近1900个对话的CoPrUS-MultiWOZ数据集，以支持对话系统的进一步研究。

链接: https://arxiv.org/abs/2412.07515
作者: Sebastian Steindl,Ulrich Schäfer,Bernd Ludwig
关键词-EN: deep learning-based dialogue, enabled the training, training of deep, deep learning-based, learning-based dialogue systems
类目: Computation and Language (cs.CL)
备注: Accepted at COLING 2025 (main, long paper)

点击查看摘要

Abstract:Large-scale Wizard-Of-Oz dialogue datasets have enabled the training of deep learning-based dialogue systems. While they are successful as benchmark datasets, they lack certain types of utterances, which would make them more realistic. In this work, we investigate the creation of synthetic communication errors in an automatic pipeline. Based on linguistic theory, we propose and follow a simple error taxonomy. We focus on three types of miscommunications that could happen in real-world dialogues but are underrepresented in the benchmark dataset: misunderstandings, non-understandings and vaguely related questions. Our two-step approach uses a state-of-the-art Large Language Model (LLM) to first create the error and secondly the repairing utterance. We perform Language Model-based evaluation to ensure the quality of the generated utterances. We apply the method to the MultiWOZ dataset and evaluate it both qualitatively and empirically as well as with human judges. Our results indicate that current LLMs can aid in adding post-hoc miscommunications to benchmark datasets as a form of data augmentation. We publish the resulting dataset, in which nearly 1900 dialogues have been modified, as CoPrUS-MultiWOZ to facilitate future work on dialogue systems.
zh

[NLP-13] Bilingual BSARD: Extending Statutory Article Retrieval to Dutch COLING

【速读】：该论文旨在解决比利时等多语言国家在法律信息检索中的挑战，特别是如何在法语和荷兰语之间进行有效的法律条文检索。解决方案的关键在于构建了一个双语版本的比利时法律条文检索数据集（bBSARD），该数据集包含法语和荷兰语的平行法律条文以及相应的法律问题及其荷兰语翻译。通过该数据集，论文对多种检索模型进行了基准测试，包括词汇模型、零样本密集模型和小型微调基础模型。实验结果表明，BM25在两种语言中仍然是一个有竞争力的基线模型，而微调小型语言特定模型可以在零样本设置中匹配或超越专有模型。

链接: https://arxiv.org/abs/2412.07462
作者: Ehsan Lotfi,Nikolay Banar,Nerses Yuzbashyan,Walter Daelemans
关键词-EN: Statutory article retrieval, Belgian Statutory Article, article retrieval plays, making legal information, Belgian Statutory
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: To be presented at RegNLP-2025 (COLING)

点击查看摘要

Abstract:Statutory article retrieval plays a crucial role in making legal information more accessible to both laypeople and legal professionals. Multilingual countries like Belgium present unique challenges for retrieval models due to the need for handling legal issues in multiple languages. Building on the Belgian Statutory Article Retrieval Dataset (BSARD) in French, we introduce the bilingual version of this dataset, bBSARD. The dataset contains parallel Belgian statutory articles in both French and Dutch, along with legal questions from BSARD and their Dutch translation. Using bBSARD, we conduct extensive benchmarking of retrieval models available for Dutch and French. Our benchmarking setup includes lexical models, zero-shot dense models, and fine-tuned small foundation models. Our experiments show that BM25 remains a competitive baseline compared to many zero-shot dense models in both languages. We also observe that while proprietary models outperform open alternatives in the zero-shot setting, they can be matched or surpassed by fine-tuning small language-specific models. Our dataset and evaluation code are publicly available.
zh

[NLP-14] Causal World Representation in the GPT Model NEURIPS2024

【速读】：该论文试图解决的问题是：生成式预训练变换器 (GPT) 模型是否仅通过预测下一个词元来训练，还是它们能够隐式地学习一个世界模型，从而逐词元生成序列。解决方案的关键在于提出了一种因果解释的注意力机制，并由此推导出一个因果世界模型。论文进一步提出，GPT 模型在推理阶段可以用于零样本因果结构学习，特别是在分布内序列上。通过在受控的合成环境中（如奥赛罗棋盘游戏）进行实验，验证了 GPT 模型在注意力机制对因果结构编码高度自信的情况下，能够生成符合游戏规则的下一步动作。

链接: https://arxiv.org/abs/2412.07446
作者: Raanan Y. Rohekar,Yaniv Gurwicz,Sungduk Yu,Vasudev Lal
关键词-EN: generative pre-trained transformer, trained to predict, implicitly learn, GPT, GPT model
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: NeurIPS 2024 Workshop on Causality and Large Models (CaLM)

点击查看摘要

Abstract:Are generative pre-trained transformer (GPT) models only trained to predict the next token, or do they implicitly learn a world model from which a sequence is generated one token at a time? We examine this question by deriving a causal interpretation of the attention mechanism in GPT, and suggesting a causal world model that arises from this interpretation. Furthermore, we propose that GPT-models, at inference time, can be utilized for zero-shot causal structure learning for in-distribution sequences. Empirical evaluation is conducted in a controlled synthetic environment using the setup and rules of the Othello board game. A GPT, pre-trained on real-world games played with the intention of winning, is tested on synthetic data that only adheres to the game rules. We find that the GPT model tends to generate next moves that adhere to the game rules for sequences for which the attention mechanism encodes a causal structure with high confidence. In general, in cases for which the GPT model generates moves that do not adhere to the game rules, it also fails to capture any causal structure.
zh

[NLP-15] Knowledge Graph Guided Evaluation of Abstention Techniques

【速读】：该论文试图解决语言模型在面对不适当请求时如何安全地拒绝响应的问题。解决方案的关键在于评估导致模型拒绝响应的底层技术，特别是这些技术在不同概念上的泛化能力和特异性。论文提出了一个名为SELECT的基准，该基准基于知识图谱中的良性概念（如“河流”），用于隔离拒绝技术与其他安全训练过程的影响，并评估其泛化与特异性。通过在六个开源和闭源模型上测试不同的拒绝技术，研究发现这些技术确实能导致模型以超过80%的拒绝率拒绝响应，但在目标概念的后代上效果较差，拒绝率下降了19%。此外，研究还揭示了不同技术在泛化与特异性之间的权衡，表明没有单一技术始终优于其他技术。

链接: https://arxiv.org/abs/2412.07430
作者: Kinshuk Vasisht,Navreet Kaur,Danish Pruthi
关键词-EN: language models safely, deploy language models, deploy language, responding to inappropriate, inappropriate requests
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To deploy language models safely, it is crucial that they abstain from responding to inappropriate requests. Several prior studies test the safety promises of models based on their effectiveness in blocking malicious requests. In this work, we focus on evaluating the underlying techniques that cause models to abstain. We create SELECT, a benchmark derived from a set of benign concepts (e.g., “rivers”) from a knowledge graph. The nature of SELECT enables us to isolate the effects of abstention techniques from other safety training procedures, as well as evaluate their generalization and specificity. Using SELECT, we benchmark different abstention techniques over six open-weight and closed-source models. We find that the examined techniques indeed cause models to abstain with over 80% abstention rates. However, these techniques are not as effective for descendants of the target concepts, with refusal rates declining by 19% . We also characterize the generalization-vs-specificity trade-offs for different techniques. Overall, no single technique is invariably better than the others. Our findings call for a careful evaluation of different aspects of abstention, and hopefully inform practitioners of various trade-offs involved.
zh

[NLP-16] Optimizing Alignment with Less: Leveraging Data Augmentation for Personalized Evaluation

【速读】：该论文试图解决在有限数据场景下，开放式大语言模型（LLMs）在个性化判断任务中难以适应人类偏好评估的问题。解决方案的关键在于提出了一种数据增强技术，通过从有限数据中选择更有效的样本，使开放式 LLM 能够更好地与人类偏好对齐。该方法在数学推理评估任务中相较于基线方法提升了约 7% 的 Pearson 相关性，相较于基础模型（Llama3.1-8B-Instruct）提升了 30%，证明了数据增强选择有效偏好数据的重要性。

链接: https://arxiv.org/abs/2412.07429
作者: Javad Seraj,Mohammad Mahdi Mohajeri,Mohammad Javad Dousti,Majid Nili Ahmadabadi
关键词-EN: making adaptation challenging, prominent topic today, large language models, Automatic evaluation, topic today
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic evaluation by large language models (LLMs) is a prominent topic today; however, judgment and evaluation tasks are often subjective and influenced by various factors, making adaptation challenging. While many studies demonstrate the capabilities of state-of-the-art proprietary LLMs in comparison to human evaluators, they often struggle to adapt to reference evaluators over time, a requirement for achieving personalized judgment. Additionally, numerous works have attempted to apply open LLMs as judges or evaluators, but these efforts frequently overlook the limitations of working with scarce data. Personalized judgment is inherently associated with limited data scenarios, which are common in many real-world problems. Our work aims to present a data augmentation technique to select a more effective sample from limited data in order to align an open LLM with human preference. Our work achieves approximately 7% improvements in Pearson correlation with a reference judge over the baseline,and 30% improvement over the base model (Llama3.1-8B-Instruct) in the mathematical reasoning evaluation task. demonstrating that augmenting selecting more effective preference data enables our approach to surpass baseline methods.
zh

[NLP-17] RAG-based Question Answering over Heterogeneous Data and Text

【速读】：该论文试图解决在非结构化文本、结构化表格和知识图谱上进行统一问答的问题。解决方案的关键在于采用基于检索增强生成 (RAG) 架构的QUASAR系统，该系统通过证据检索和答案生成两个主要步骤实现问答。特别之处在于，QUASAR系统还包括问题理解组件，以优化证据检索的输入，并在生成答案前对检索到的证据进行重新排序和过滤，从而提高答案的准确性和效率。实验结果表明，该方法在问答质量上与大型GPT模型相当或更优，同时显著降低了计算成本和能源消耗。

链接: https://arxiv.org/abs/2412.07420
作者: Philipp Christmann,Gerhard Weikum
关键词-EN: structured tables, unstructured text, knowledge graphs, article presents, unified treatment
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: IEEE Data Engineering Bulletin – December 2024 Edition on RAG

点击查看摘要

Abstract:This article presents the QUASAR system for question answering over unstructured text, structured tables, and knowledge graphs, with unified treatment of all sources. The system adopts a RAG-based architecture, with a pipeline of evidence retrieval followed by answer generation, with the latter powered by a moderate-sized language model. Additionally and uniquely, QUASAR has components for question understanding, to derive crisper input for evidence retrieval, and for re-ranking and filtering the retrieved evidence before feeding the most informative pieces into the answer generation. Experiments with three different benchmarks demonstrate the high answering quality of our approach, being on par with or better than large GPT models, while keeping the computational cost and energy consumption orders of magnitude lower.
zh

[NLP-18] Composing or Not Composing? Towards Distributional Construction Grammars

【速读】：该论文试图解决语言理解过程中意义构建的机制问题，特别是如何整合组合性（compositional）和非组合性（non-compositional）现象。解决方案的关键在于提出一个基于构式语法（Construction Grammars）的框架，并通过特征结构表示（feature structure representation）和基于构式、框架和事件的交互来定义和表示意义。该框架引入了分布式语义（distributional semantics）的概念，形成了分布式构式语法（Distributional Construction Grammars），从而实现基于激活度（activation）、相似性（similarity）和统一性（unification）的意义构建机制。

链接: https://arxiv.org/abs/2412.07419
作者: Philippe Blache,Emmanuele Chersoni,Giulia Rambelli,Alessandro Lenci
关键词-EN: language processing remains, Construction Grammars, comprehension during language, Distributional Construction Grammars, Grammars
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The mechanisms of comprehension during language processing remains an open question. Classically, building the meaning of a linguistic utterance is said to be incremental, step-by-step, based on a compositional process. However, many different works have shown for a long time that non-compositional phenomena are also at work. It is therefore necessary to propose a framework bringing together both approaches. We present in this paper an approach based on Construction Grammars and completing this framework in order to account for these different mechanisms. We propose first a formal definition of this framework by completing the feature structure representation proposed in Sign-Based Construction Grammars. In a second step, we present a general representation of the meaning based on the interaction of constructions, frames and events. This framework opens the door to a processing mechanism for building the meaning based on the notion of activation evaluated in terms of similarity and unification. This new approach integrates features from distributional semantics into the constructionist framework, leading to what we call Distributional Construction Grammars.
zh

[NLP-19] Generating Knowledge Graphs from Large Language Models : A Comparative Study of GPT-4 LLaMA 2 and BERT

【速读】：该论文试图解决传统方法在创建知识图谱（Knowledge Graphs, KGs）时面临的准确性和可扩展性问题，这些问题限制了知识图谱在图增强生成系统（GraphRAGs）中的应用。解决方案的关键在于利用大型语言模型（LLMs）如GPT-4、LLaMA 2和BERT，直接从非结构化数据中生成知识图谱，从而绕过传统的构建流程。这种方法通过评估精度（Precision）、召回率（Recall）、F1分数（F1-Score）、图编辑距离（Graph Edit Distance）和语义相似性（Semantic Similarity）等指标，展示了LLMs在生成高质量知识图谱方面的潜力，尤其是在语义保真度和结构准确性方面。

链接: https://arxiv.org/abs/2412.07412
作者: Ahan Bhatt,Nandan Vaghela,Kush Dudhia
关键词-EN: Retrieval-Augmented Generative Systems, Generative Systems, tasks requiring structured, requiring structured reasoning, Retrieval-Augmented Generative
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 4 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Knowledge Graphs (KGs) are essential for the functionality of GraphRAGs, a form of Retrieval-Augmented Generative Systems (RAGs) that excel in tasks requiring structured reasoning and semantic understanding. However, creating KGs for GraphRAGs remains a significant challenge due to accuracy and scalability limitations of traditional methods. This paper introduces a novel approach leveraging large language models (LLMs) like GPT-4, LLaMA 2 (13B), and BERT to generate KGs directly from unstructured data, bypassing traditional pipelines. Using metrics such as Precision, Recall, F1-Score, Graph Edit Distance, and Semantic Similarity, we evaluate the models’ ability to generate high-quality KGs. Results demonstrate that GPT-4 achieves superior semantic fidelity and structural accuracy, LLaMA 2 excels in lightweight, domain-specific graphs, and BERT provides insights into challenges in entity-relationship modeling. This study underscores the potential of LLMs to streamline KG creation and enhance GraphRAG accessibility for real-world applications, while setting a foundation for future advancements.
zh

[NLP-20] CMT: A Memory Compression Method for Continual Knowledge Learning of Large Language Models AAAI2025

【速读】：该论文试图解决大规模语言模型（LLMs）在面对数据、任务和用户偏好持续变化时，由于模型规模庞大和训练成本高昂，无法频繁重新训练的问题。解决方案的关键在于提出了压缩记忆训练（Compression Memory Training, CMT）方法，这是一种高效的在线适应框架，能够在不改变模型参数的情况下，通过压缩和提取新文档信息并存储在记忆库中，从而在回答相关查询时聚合这些文档记忆以提升回答质量。CMT方法通过引入记忆感知目标（memory-aware objective）、自匹配（self-matching）和顶部聚合（top-aggregation）等技术，增强了记忆的编码、检索和聚合能力，从而提高了模型的适应性和鲁棒性。

链接: https://arxiv.org/abs/2412.07393
作者: Dongfang Li,Zetian Sun,Xinshuo Hu,Baotian Hu,Min Zhang
关键词-EN: Large Language Models, Large Language, Language Models, Large, Compression Memory Training
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AAAI 2025; Pre-print

点击查看摘要

Abstract:Large Language Models (LLMs) need to adapt to the continuous changes in data, tasks, and user preferences. Due to their massive size and the high costs associated with training, LLMs are not suitable for frequent retraining. However, updates are necessary to keep them in sync with rapidly evolving human knowledge. To address these challenges, this paper proposes the Compression Memory Training (CMT) method, an efficient and effective online adaptation framework for LLMs that features robust knowledge retention capabilities. Inspired by human memory mechanisms, CMT compresses and extracts information from new documents to be stored in a memory bank. When answering to queries related to these new documents, the model aggregates these document memories from the memory bank to better answer user questions. The parameters of the LLM itself do not change during training and inference, reducing the risk of catastrophic forgetting. To enhance the encoding, retrieval, and aggregation of memory, we further propose three new general and flexible techniques, including memory-aware objective, self-matching and top-aggregation. Extensive experiments conducted on three continual learning datasets (i.e., StreamingQA, SQuAD and ArchivalQA) demonstrate that the proposed method improves model adaptability and robustness across multiple base LLMs (e.g., +4.07 EM +4.19 F1 in StreamingQA with Llama-2-7b).
zh

[NLP-21] A Review of Challenges in Speech-based Conversational AI for Elderly Care

【速读】：该论文试图解决语音控制的人工智能系统在老年人健康护理中的实际应用问题，特别是这些系统在老年人中的使用效果和用户体验。解决方案的关键在于深入研究老年人使用语音控制AI的实际体验，并解决用户和技术层面的各种问题，以确保这些系统能够有效支持老年人的健康护理和远程监控。

链接: https://arxiv.org/abs/2412.07388
作者: Willemijn Klaassen,Bram van Dijk,Marco Spruit
关键词-EN: Artificially intelligent systems, intelligent systems optimized, Artificially intelligent, fast pace, intelligent systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: Accepted for publication at Medical Informatics Europe 2025 conference, Glasgow. 5 pages, 1 figure, 1 table

点击查看摘要

Abstract:Artificially intelligent systems optimized for speech conversation are appearing at a fast pace. Such models are interesting from a healthcare perspective, as these voice-controlled assistants may support the elderly and enable remote health monitoring. The bottleneck for efficacy, however, is how well these devices work in practice and how the elderly experience them, but research on this topic is scant. We review elderly use of voice-controlled AI and highlight various user- and technology-centered issues, that need to be considered before effective speech-controlled AI for elderly care can be realized.
zh

[NLP-22] Algorithmic Phase Transitions in Language Models: A Mechanistic Case Study of Arithmetic

【速读】：该论文试图解决大语言模型在零样本（zero-shot）任务中表现差异的原因，特别是为什么这些模型在某些任务上能够零样本执行，而在其他任务上却表现不佳。解决方案的关键在于定义并研究算法稳定性（algorithmic stability），即语言模型在任务规范变化时所采用的问题解决策略的变化。通过分析Gemma-2-2b在四位数和八位数加法等密切相关子任务上的表现，研究发现模型在这些任务上采用了显著不同的计算模型，表明算法不稳定性可能是导致语言模型在某些逻辑推理任务上零样本表现不佳的原因之一。

链接: https://arxiv.org/abs/2412.07386
作者: Alan Sun,Ethan Sun,Warren Shepard
关键词-EN: explicit training, capabilities of large, make them powerful, powerful tools, tools for solving
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Zero-shot capabilities of large language models make them powerful tools for solving a range of tasks without explicit training. It remains unclear, however, how these models achieve such performance, or why they can zero-shot some tasks but not others. In this paper, we shed some light on this phenomenon by defining and investigating algorithmic stability in language models – changes in problem-solving strategy employed by the model as a result of changes in task specification. We focus on a task where algorithmic stability is needed for generalization: two-operand arithmetic. Surprisingly, we find that Gemma-2-2b employs substantially different computational models on closely related subtasks, i.e. four-digit versus eight-digit addition. Our findings suggest that algorithmic instability may be a contributing factor to language models’ poor zero-shot performance across certain logical reasoning tasks, as they struggle to abstract different problem-solving strategies and smoothly transition between them.
zh

[NLP-23] SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

【速读】：该论文试图解决生成式大语言模型（LLMs）在集成时未能充分利用其协作潜力以生成更高质量响应的问题。解决方案的关键在于提出了SpecFuse框架，通过迭代生成下一个片段的方式实现LLMs之间的协作。具体来说，SpecFuse通过循环执行推理和验证组件来实现这一目标：在每一轮中，推理组件并行调用各个基础LLM生成候选片段，验证组件则再次调用这些LLM对片段进行排序，选择排名最高的片段并广播给所有LLM，以鼓励它们在下一轮生成更高质量的片段。此外，SpecFuse还引入了模型退出机制，动态排除在前几轮表现不佳的模型，从而在保持性能的同时减少计算资源的消耗。

链接: https://arxiv.org/abs/2412.07380
作者: Bo Lv,Chen Tang,Yanan Zhang,Xin Liu,Yue Yu,Ping Luo
关键词-EN: generative large language, large language models, additional fusion model, generative large, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures

点击查看摘要

Abstract:Ensembles of generative large language models (LLMs) can integrate the strengths of different LLMs to compensate for the limitations of individual models. However, recent work has focused on training an additional fusion model to combine complete responses from multiple LLMs, failing to tap into their collaborative potential to generate higher-quality responses. Moreover, as the additional fusion model is trained on a specialized dataset, these methods struggle with generalizing to open-domain queries from online users. In this paper, we propose SpecFuse, a novel ensemble framework that outputs the fused result by iteratively producing the next segment through collaboration among LLMs. This is achieved through cyclic execution of its inference and verification components. In each round, the inference component invokes each base LLM to generate candidate segments in parallel, and the verify component calls these LLMs again to predict the ranking of the segments. The top-ranked segment is then broadcast to all LLMs, encouraging them to generate higher-quality segments in the next round. This approach also allows the base LLMs to be plug-and-play, without any training or adaptation, avoiding generalization limitations. Furthermore, to conserve computational resources, we propose a model exit mechanism that dynamically excludes models exhibiting poor performance in previous rounds during each query response. In this way, it effectively reduces the number of model calls while maintaining overall performance.
zh

[NLP-24] My Words Imply Your Opinion: Reader Agent -Based Propagation Enhancement for Personalized Implicit Emotion Analysis

【速读】：该论文试图解决隐式情感分析 (Implicit Emotion Analysis, IEA) 中忽视读者反馈对情感反应影响的问题。解决方案的关键在于提出了个性化隐式情感分析 (Personalized Implicit Emotion Analysis, PIEA) 任务，并引入了RAPPIE模型。该模型通过以下三个关键步骤解决用户信息缺失问题：1) 基于大语言模型 (Large Language Model) 创建读者代理，模拟读者反应，以应对获取读者反馈信息时的沉默螺旋和数据不完整性挑战；2) 建立读者传播角色系统，并开发角色感知的情感传播多视角图学习模型，利用传播角色分布有效处理读者信息稀疏性；3) 标注两个中文PIEA数据集，包含详细的元数据，弥补了以往数据集仅关注文本内容标注的不足。实验结果表明，RAPPIE模型在性能上优于现有最先进基线，验证了将读者反馈纳入IEA过程的重要性和有效性。

链接: https://arxiv.org/abs/2412.07367
作者: Jian Liao,Yu Feng,Xiaoyu Wang,Suge Wang,Jianxing Zheng,Deyu Li
关键词-EN: implicit emotion analysis, emotional expressions makes, Personalized Implicit Emotion, emotion analysis, user-specific characteristics
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In implicit emotion analysis (IEA), the subtlety of emotional expressions makes it particularly sensitive to user-specific characteristics. Existing studies often inject personalization into the analysis by focusing on the authorial dimension of the emotional text. However, these methods overlook the potential influence of the intended reader on the reaction of implicit emotions. In this paper, we refine the IEA task to Personalized Implicit Emotion Analysis (PIEA) and introduce the RAPPIE model, a novel framework designed to address the issue of missing user information within this task. In particular, 1) we create reader agents based on the Large Language Model to simulate reader reactions, to address challenges of the spiral of silence and data incompleteness encountered when acquiring reader feedback information. 2) We establish a reader propagation role system and develop a role-aware emotion propagation multi-view graph learning model, which effectively deals with the sparsity of reader information by utilizing the distribution of propagation roles. 3) We annotate two Chinese PIEA datasets with detailed user metadata, thereby addressing the limitation of prior datasets that primarily focus on textual content annotation. Extensive experiments on these datasets indicate that the RAPPIE model outperforms current state-of-the-art baselines, highlighting the significance and efficacy of incorporating reader feedback into the PIEA process.
zh

[NLP-25] owards Predictive Communication with Brain-Computer Interfaces integrating Large Language Models

【速读】：该论文旨在探讨将先进的预测性语言模型（如大型语言模型，LLM）与脑机接口（BCI）系统整合的前沿现状和未来发展。解决方案的关键在于利用预训练的自回归变压器模型（如GPT），通过并行化、预训练和微调等技术，显著提升BCI在通信方面的效率和性能。特别是GPT-2模型在模拟对话中表现出色，尽管尚未在实际BCI场景中进行测试，但其潜力被认为是推动BCI系统向快速、高效和用户自适应的神经技术迈进的重要因素。

链接: https://arxiv.org/abs/2412.07355
作者: Andrea Caria
关键词-EN: perspective article aims, BCI, cutting-edge predictive language, LLM, perspective article
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This perspective article aims at providing an outline of the state of the art and future developments towards the integration of cutting-edge predictive language models with BCI. A synthetic overview of early and more recent linguistic models, from natural language processing (NLP) models to recent LLM, that to a varying extent improved predictive writing systems, is first provided. Second, a summary of previous BCI implementations integrating language models is presented. The few preliminary studies investigating the possible combination of LLM with BCI spellers to efficiently support fast communication and control are then described. Finally, current challenges and limitations towards the full integration of LLM with BCI systems are discussed. Recent investigations suggest that the combination of LLM with BCI might drastically improve human-computer interaction in patients with motor or language disorders as well as in healthy individuals. In particular, the pretrained autoregressive transformer models, such as GPT, that capitalize from parallelization, learning through pre-training and fine-tuning, promise a substantial improvement of BCI for communication with respect to previous systems incorporating simpler language models. Indeed, among various models, the GPT-2 was shown to represent an excellent candidate for its integration into BCI although testing was only perfomed on simulated conversations and not on real BCI scenarios. Prospectively, the full integration of LLM with advanced BCI systems might lead to a big leap forward towards fast, efficient and user-adaptive neurotechnology.
zh

[NLP-26] Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

【速读】：该论文试图解决大语言模型 (Large Language Models, LLMs) 的可解释性问题，特别是如何从模型的参数中提取推理过程以增强用户信任。解决方案的关键在于提出了框架表示假设 (Frame Representation Hypothesis)，该假设基于线性表示假设 (Linear Representation Hypothesis, LRH) 并扩展到多词元 (multi-token) 分析，从而能够处理包含数千个概念的任意文本数据。具体来说，论文提出将单词解释为有序的向量序列（即“框架”），并通过这些框架的平均值来表示共享相同概念的单词，进而实现对概念的解释和控制。通过Top-k 概念引导解码 (Top-k Concept-Guided Decoding)，该方法能够直观地引导文本生成，并在多个模型（如 Llama 3.1、Gemma 2 和 Phi 3）上验证了其有效性，揭示了性别和语言偏见等问题，同时也展示了修复这些问题的潜力。

链接: https://arxiv.org/abs/2412.07334
作者: Pedro H. V. Valois,Lincon S. Souza,Erica K. Shimomoto,Kazuhiro Fukui
关键词-EN: Large Language Models, Linear Representation Hypothesis, trust for Large, model parameters, Representation Hypothesis
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model’s parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools through Top-k Concept-Guided Decoding, which can intuitively steer text generation using concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3 families, demonstrating gender and language biases, exposing harmful content, but also potential to remediate them, leading to safer and more transparent LLMs. Code is available at this https URL
zh

[NLP-27] Filipino Benchmarks for Measuring Sexist and Homophobic Bias in Multilingual Language Models from Southeast Asia COLING2025

【速读】：该论文试图解决多语言模型在处理低资源语言（如菲律宾语）时存在的性别和反同性恋偏见问题。解决方案的关键在于引入了针对菲律宾语的CrowS-Pairs和WinoQueer基准测试，这些基准通过文化适应的英语偏见评估数据集生成，包含7,074个新的挑战对，用于评估预训练语言模型（PLMs）中的性别和反同性恋偏见。研究还发现，多语言模型对特定语言的偏见程度受模型在该语言上的预训练数据量的影响。这些基准和发现为未来分析和缓解多语言模型中的偏见提供了基础。

链接: https://arxiv.org/abs/2412.07303
作者: Lance Calvin Lim Gamboa,Mark Lee
关键词-EN: high NLP resources, NLP resources, high NLP, confirm the presence, presence of gender-related
类目: Computation and Language (cs.CL)
备注: Accepted for presentation at The First Workshop on Language Models for Low-Resource Languages (LoResLM) at The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Bias studies on multilingual models confirm the presence of gender-related stereotypes in masked models processing languages with high NLP resources. We expand on this line of research by introducing Filipino CrowS-Pairs and Filipino WinoQueer: benchmarks that assess both sexist and anti-queer biases in pretrained language models (PLMs) handling texts in Filipino, a low-resource language from the Philippines. The benchmarks consist of 7,074 new challenge pairs resulting from our cultural adaptation of English bias evaluation datasets, a process that we document in detail to guide similar forthcoming efforts. We apply the Filipino benchmarks on masked and causal multilingual models, including those pretrained on Southeast Asian data, and find that they contain considerable amounts of bias. We also find that for multilingual models, the extent of bias learned for a particular language is influenced by how much pretraining data in that language a model was exposed to. Our benchmarks and insights can serve as a foundation for future work analyzing and mitigating bias in multilingual models.
zh

[NLP-28] he Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model

【速读】：该论文试图解决大语言模型（LLMs）在预训练过程中多语言能力发展机制的理解问题。解决方案的关键在于提出了“巴别塔假设”（Babel Tower Hypothesis），该假设描述了LLMs在预训练过程中如何逐步从单一的主导语言知识系统过渡到多语言特定的知识系统。通过追踪LLMs的内部状态，特别是识别工作语言和语言转移神经元，实验验证了这一假设，并基于此提出了优化多语言代码LLMs预训练语料库的新方法，显著提升了模型性能。

链接: https://arxiv.org/abs/2412.07298
作者: Jiawei Chen,Wentao Chen,Jing Su,Jingjing Xu,Hongyu Lin,Mengjie Ren,Yaojie Lu,Xianpei Han,Le Sun
关键词-EN: Large language models, Babel Tower Hypothesis, shown significant multilingual, Babel Tower, Tower Hypothesis
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown significant multilingual capabilities. However, the mechanisms underlying the development of these capabilities during pre-training are not well understood. In this paper, we use code LLMs as an experimental platform to explore the evolution of multilingual capabilities in LLMs during the pre-training process. Based on our observations, we propose the Babel Tower Hypothesis, which describes the entire process of LLMs acquiring new language capabilities. During the learning process, multiple languages initially share a single knowledge system dominated by the primary language and gradually develop language-specific knowledge systems. We then validate the above hypothesis by tracking the internal states of the LLMs through identifying working languages and language transferring neurons. Experimental results show that the internal state changes of the LLM are consistent with our Babel Tower Hypothesis. Building on these insights, we propose a novel method to construct an optimized pre-training corpus for multilingual code LLMs, which significantly outperforms LLMs trained on the original corpus. The proposed Babel Tower Hypothesis provides new insights into designing pre-training data distributions to achieve optimal multilingual capabilities in LLMs.
zh

[NLP-29] Multimodal Sentiment Analysis Based on Causal Reasoning

【速读】：该论文试图解决多模态情感分析中的单模态数据偏差问题，特别是由于文本情感语义的误导性导致的最终情感分类准确率低的问题。解决方案的关键在于提出了一个名为反事实多模态情感分析框架 (CounterFactual Multimodal Sentiment Analysis framework, CF-MSA)，通过因果反事实推理来构建多模态情感因果推理。CF-MSA通过区分模态间的处理变量来减轻单模态偏差的直接影响，并确保模态间的异质性。此外，考虑到模态间的信息互补性和偏差差异，论文提出了一种新的优化目标，以有效整合不同模态并减少每个模态的固有偏差。实验结果表明，CF-MSA在去偏能力上表现优异，并在两个公开数据集上达到了新的最先进性能。

链接: https://arxiv.org/abs/2412.07292
作者: Fuhai Chen,Pengpeng Huang,Xuri Ge,Jie Huang,Zishuo Bao
关键词-EN: multimodal sentiment analysis, textual sentiment analysis, image-text sentiment analysis, multimodal image-text sentiment, sentiment analysis
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of multimedia, the shift from unimodal textual sentiment analysis to multimodal image-text sentiment analysis has obtained academic and industrial attention in recent years. However, multimodal sentiment analysis is affected by unimodal data bias, e.g., text sentiment is misleading due to explicit sentiment semantic, leading to low accuracy in the final sentiment classification. In this paper, we propose a novel CounterFactual Multimodal Sentiment Analysis framework (CF-MSA) using causal counterfactual inference to construct multimodal sentiment causal inference. CF-MSA mitigates the direct effect from unimodal bias and ensures heterogeneity across modalities by differentiating the treatment variables between modalities. In addition, considering the information complementarity and bias differences between modalities, we propose a new optimisation objective to effectively integrate different modalities and reduce the inherent bias from each modality. Experimental results on two public datasets, MVSA-Single and MVSA-Multiple, demonstrate that the proposed CF-MSA has superior debiasing capability and achieves new state-of-the-art performances. We will release the code and datasets to facilitate future research.
zh

[NLP-30] Enhancing Relation Extraction via Supervised Rationale Verification and Feedback AAAI2025

【速读】：该论文试图解决现有自动化反馈方法在关系抽取（Relation Extraction, RE）任务中的应用局限性问题，这些方法由于其特定的反馈目标和修正方式，无法有效应用于RE任务。论文提出的解决方案关键在于引入了一个新颖的自动化反馈框架，该框架包含一个推理监督器（rationale supervisor），用于验证推理过程并提供重新选择的示例作为反馈，以纠正初始预测。具体而言，该框架通过因果干预和观察方法收集有偏/无偏的推理依据，用于对比训练推理监督器，并通过验证-反馈-修正的迭代过程来增强大语言模型（LLMs）在RE任务中的表现。

链接: https://arxiv.org/abs/2412.07289
作者: Yongqi Li,Xin Miao,Shen Zhou,Mayi Xu,Yuyang Ren,Tieyun Qian
关键词-EN: large language models, designated feedback objectives, language models, relation extraction, correction manner
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2025, camera ready version

点击查看摘要

Abstract:Despite the rapid progress that existing automated feedback methods have made in correcting the output of large language models (LLMs), these methods cannot be well applied to the relation extraction (RE) task due to their designated feedback objectives and correction manner. To address this problem, we propose a novel automated feedback framework for RE, which presents a rationale supervisor to verify the rationale and provide re-selected demonstrations as feedback to correct the initial prediction. Specifically, we first design a causal intervention and observation method for to collect biased/unbiased rationales for contrastive training the rationale supervisor. Then, we present a verification-feedback-correction procedure to iteratively enhance LLMs’ capability of handling the RE task. Extensive experiments prove that our proposed framework significantly outperforms existing methods.
zh

[NLP-31] HARP: Hesitation-Aware Reframing in Transformer Inference Pass

【速读】：该论文旨在解决大语言模型在推理步骤中计算需求不均衡的问题，即某些token需要比其他token更多的计算资源。解决方案的关键是提出了一种名为HARP的简单修改方法，该方法基于决策中的犹豫和框架效应，在模型遇到不确定性时选择性地应用额外的计算。HARP通过模仿人类认知过程，在困难决策点暂停并重新构建输入以获得不同视角，从而提升模型性能。其关键优势在于模型无关、无需训练且易于实现，同时在保持推理速度比beam search快两倍的情况下，实现了高达+5.16%的性能提升。

链接: https://arxiv.org/abs/2412.07282
作者: Romain Storaï,Seung-won Hwang
关键词-EN: variable computational demands, paper aims, aims to improve, addressing the variable, Transformer forward pass
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to “off-the-shelf” Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We thoroughly evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP offers a practical solution for enhancing the performance of Transformer-based language models with minimal computational impact.
zh

[NLP-32] Label-Confidence-Aware Uncertainty Estimation in Natural Language Generation

【速读】：该论文试图解决生成式大型语言模型 (Large Language Models, LLMs) 在生成任务中可能产生的幻觉响应问题，特别是由于贪婪解码 (greedy decoding) 引入的偏差导致的分类结果偏差。解决方案的关键在于提出了一种基于标签置信度感知 (label-confidence-aware, LCA) 的不确定性估计方法，通过使用Kullback-Leibler (KL) 散度来衡量样本与标签来源之间的差异，从而增强不确定性评估的可靠性和稳定性。该方法能够有效捕捉不同采样结果和标签来源之间的差异，提升分类结果的准确性。

链接: https://arxiv.org/abs/2412.07255
作者: Qinhong Lin,Linna Zhou,Zhongliang Yang,Yuang Cai
关键词-EN: Large Language Models, Large Language, display formidable capabilities, pose potential risks, potential risks due
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) display formidable capabilities in generative tasks but also pose potential risks due to their tendency to generate hallucinatory responses. Uncertainty Quantification (UQ), the evaluation of model output reliability, is crucial for ensuring the safety and robustness of AI systems. Recent studies have concentrated on model uncertainty by analyzing the relationship between output entropy under various sampling conditions and the corresponding labels. However, these methods primarily focus on measuring model entropy with precision to capture response characteristics, often neglecting the uncertainties associated with greedy decoding results-the sources of model labels, which can lead to biased classification outcomes. In this paper, we explore the biases introduced by greedy decoding and propose a label-confidence-aware (LCA) uncertainty estimation based on Kullback-Leibler (KL) divergence bridging between samples and label source, thus enhancing the reliability and stability of uncertainty assessments. Our empirical evaluations across a range of popular LLMs and NLP datasets reveal that different label sources can indeed affect classification, and that our approach can effectively capture differences in sampling results and label sources, demonstrating more effective uncertainty estimation.
zh

[NLP-33] KULTURE Bench: A Benchmark for Assessing Language Model in Korean Cultural Context

【速读】：该论文试图解决当前多语言评估基准在评估非英语语言（如韩语）时可能存在的西方文化偏见问题。解决方案的关键在于引入KULTURE Bench，这是一个专门为韩国文化设计的评估框架，包含文化新闻、成语和诗歌等数据集，旨在从词汇、句子和段落层面评估语言模型对韩国文化的理解和推理能力。通过这一框架，研究者能够更准确地评估模型在处理与韩国文化相关文本时的表现，并发现现有模型在理解韩国文化深层内容方面的不足。

链接: https://arxiv.org/abs/2412.07251
作者: Xiaonan Wang,Jinyoung Yeo,Joon-Ho Lim,Hansaem Kim
关键词-EN: Large language models, exhibited significant enhancements, Large language, enhancements in performance, KULTURE Bench
类目: Computation and Language (cs.CL)
备注: Accepted by the 38th Pacific Asia Conference on Language, Information and Computation

点击查看摘要

Abstract:Large language models have exhibited significant enhancements in performance across various tasks. However, the complexity of their evaluation increases as these models generate more fluent and coherent content. Current multilingual benchmarks often use translated English versions, which may incorporate Western cultural biases that do not accurately assess other languages and cultures. To address this research gap, we introduce KULTURE Bench, an evaluation framework specifically designed for Korean culture that features datasets of cultural news, idioms, and poetry. It is designed to assess language models’ cultural comprehension and reasoning capabilities at the word, sentence, and paragraph levels. Using the KULTURE Bench, we assessed the capabilities of models trained with different language corpora and analyzed the results comprehensively. The results show that there is still significant room for improvement in the models’ understanding of texts related to the deeper aspects of Korean culture.
zh

[NLP-34] Filling Memory Gaps: Enhancing Continual Semantic Parsing via SQL Syntax Variance-Guided LLM s without Real Data Replay

【速读】：该论文试图解决在动态更新数据库场景下，使用有限标注样本进行跨任务的持续语义解析 (Continual Semantic Parsing, CSP) 的问题。解决方案的关键在于提出了一种名为 LECSP 的大语言模型 (Large Language Model, LLM) 增强的持续语义解析方法，该方法通过分析任务间的 SQL 语法共性和差异，指导 LLM 重建关键记忆并通过校准策略提高记忆准确性，同时采用任务感知的双教师蒸馏框架促进知识的积累和迁移。该方法无需重放真实数据或依赖理想化的持续学习设置，显著提升了在 CSP 基准测试中的表现，并实现了对未见任务的泛化性能。

链接: https://arxiv.org/abs/2412.07246
作者: Ruiheng Liu,Jinyu Zhang,Yanqi Song,Yu Zhang,Bailong Yang
关键词-EN: dynamically updated databases, Continual Semantic Parsing, convert natural language, natural language questions, Continuous Semantic Parsing
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Continual Semantic Parsing (CSP) aims to train parsers to convert natural language questions into SQL across tasks with limited annotated examples, adapting to the real-world scenario of dynamically updated databases. Previous studies mitigate this challenge by replaying historical data or employing parameter-efficient tuning (PET), but they often violate data privacy or rely on ideal continual learning settings. To address these problems, we propose a new Large Language Model (LLM)-Enhanced Continuous Semantic Parsing method, named LECSP, which alleviates forgetting while encouraging generalization, without requiring real data replay or ideal settings. Specifically, it first analyzes the commonalities and differences between tasks from the SQL syntax perspective to guide LLMs in reconstructing key memories and improving memory accuracy through a calibration strategy. Then, it uses a task-aware dual-teacher distillation framework to promote the accumulation and transfer of knowledge during sequential training. Experimental results on two CSP benchmarks show that our method significantly outperforms existing methods, even those utilizing data replay or ideal settings. Additionally, we achieve generalization performance beyond the upper limits, better adapting to unseen tasks.
zh

[NLP-35] Speaker effects in spoken language comprehension

【速读】：该论文试图解决的问题是探讨说话者身份如何显著影响口语语言的理解过程，特别是说话者信息如何影响语言处理。解决方案的关键在于提出了一个综合模型，该模型强调了自下而上的感知驱动过程（基于声学细节）与自上而下的期望驱动过程（基于说话者模型）之间的相互作用。声学细节影响低层次的感知，而说话者模型则调节低层次和更高层次的过程，如意义解释和语用推理。通过定义说话者特异性和说话者人口统计学效应，论文展示了在不同场景下，自下而上和自上而下的过程如何在不同层次上相互作用。这一框架为心理语言学理论提供了全面的解释，说明了说话者信息如何与语言内容相互作用以塑造信息构建，并建议说话者效应可以作为语言学习者熟练度和个体社会认知特征的指标。

链接: https://arxiv.org/abs/2412.07238
作者: Hanlin Wu,Zhenguang G. Cai
关键词-EN: spoken language comprehension, significantly influences spoken, influences spoken language, comprehension by affecting, speaker
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注: 44 pages, 1 figure

点击查看摘要

Abstract:The identity of a speaker significantly influences spoken language comprehension by affecting both perception and expectation. This review explores speaker effects, focusing on how speaker information impacts language processing. We propose an integrative model featuring the interplay between bottom-up perception-based processes driven by acoustic details and top-down expectation-based processes driven by a speaker model. The acoustic details influence lower-level perception, while the speaker model modulates both lower-level and higher-level processes such as meaning interpretation and pragmatic inferences. We define speaker-idiosyncrasy and speaker-demographics effects and demonstrate how bottom-up and top-down processes interact at various levels in different scenarios. This framework contributes to psycholinguistic theory by offering a comprehensive account of how speaker information interacts with linguistic content to shape message construction. We suggest that speaker effects can serve as indices of a language learner’s proficiency and an individual’s characteristics of social cognition. We encourage future research to extend these findings to AI speakers, probing the universality of speaker effects across humans and artificial agents.
zh

[NLP-36] Comateformer: Combined Attention Transformer for Semantic Sentence Matching ECAI2024

【速读】：该论文试图解决Transformer模型在语义匹配任务中难以捕捉句子间细微差异的问题。解决方案的关键在于提出了一种名为Combined Attention Network based on Transformer model (Comateformer)的新型语义句子匹配模型。该模型设计了一种基于Transformer的准注意力机制，具有组合性质，能够学习在构建表示时如何组合、减去或调整特定向量的大小，而不仅仅是调整输入token的权重。此外，该方法在计算双亲和度得分时引入了相似性和差异性（负亲和度）的直觉，从而更有效地表示句子间的关系。实验结果表明，该方法在多个公开数据集上实现了持续的性能提升。

链接: https://arxiv.org/abs/2412.07220
作者: Bo Li,Di Liang,Zixin Zhang
关键词-EN: made significant strides, made significant, significant strides, tasks by capturing, capturing connections
类目: Computation and Language (cs.CL)
备注: This paper is accepted by 27th EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2024)

点击查看摘要

Abstract:The Transformer-based model have made significant strides in semantic matching tasks by capturing connections between phrase pairs. However, to assess the relevance of sentence pairs, it is insufficient to just examine the general similarity between the sentences. It is crucial to also consider the tiny subtleties that differentiate them from each other. Regrettably, attention softmax operations in transformers tend to miss these subtle differences. To this end, in this work, we propose a novel semantic sentence matching model named Combined Attention Network based on Transformer model (Comateformer). In Comateformer model, we design a novel transformer-based quasi-attention mechanism with compositional properties. Unlike traditional attention mechanisms that merely adjust the weights of input tokens, our proposed method learns how to combine, subtract, or resize specific vectors when building a representation. Moreover, our proposed approach builds on the intuition of similarity and dissimilarity (negative affinity) when calculating dual affinity scores. This allows for a more meaningful representation of relationships between sentences. To evaluate the performance of our proposed model, we conducted extensive experiments on ten public real-world datasets and robustness testing. Experimental results show that our method achieves consistent improvements.
zh

[NLP-37] MAPLE: A Framework for Active Preference Learning Guided by Large Language Models

【速读】：该论文试图解决现有偏好学习方法在计算负担重、人类监督成本高以及缺乏可解释性方面的问题。解决方案的关键在于引入MAPLE框架，该框架利用大型语言模型（LLMs）进行贝叶斯主动偏好学习。MAPLE通过结合自然语言反馈和传统的偏好学习反馈（如成对轨迹排名）来建模偏好函数的分布，并采用主动学习策略系统地减少不确定性。此外，MAPLE还引入了语言条件化的主动查询选择机制，以识别信息丰富且易于回答的查询，从而减轻人类负担。通过在两个基准测试中的评估，包括使用OpenStreetMap数据的实际车辆路径规划基准，MAPLE展示了其在提高样本效率和偏好推理质量方面的有效性。

链接: https://arxiv.org/abs/2412.07207
作者: Saaduddin Mahmud,Mason Nakamura,Shlomo Zilberstein
关键词-EN: sparked significant interest, sparked significant, significant interest, MAPLE, preference
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The advent of large language models (LLMs) has sparked significant interest in using natural language for preference learning. However, existing methods often suffer from high computational burdens, taxing human supervision, and lack of interpretability. To address these issues, we introduce MAPLE, a framework for large language model-guided Bayesian active preference learning. MAPLE leverages LLMs to model the distribution over preference functions, conditioning it on both natural language feedback and conventional preference learning feedback, such as pairwise trajectory rankings. MAPLE also employs active learning to systematically reduce uncertainty in this distribution and incorporates a language-conditioned active query selection mechanism to identify informative and easy-to-answer queries, thus reducing human burden. We evaluate MAPLE’s sample efficiency and preference inference quality across two benchmarks, including a real-world vehicle route planning benchmark using OpenStreetMap data. Our results demonstrate that MAPLE accelerates the learning process and effectively improves humans’ ability to answer queries.
zh

[NLP-38] A Review on the Applications of Transformer-based language models for Nucleotide Sequence Analysis

【速读】：该论文旨在探讨Transformer-based语言模型在生物信息学中处理核苷酸序列的应用，并总结了这些模型在近期的发展和主要特征。解决方案的关键在于将自然语言处理（NLP）中的Transformer模型扩展和适应于生物序列分析，利用其在处理序列数据上的强大能力。论文通过回顾和分析大量相关应用文献，展示了如何定制这些模型以适应不同的生物信息学任务，并提供了Transformer工作原理的结构化描述，帮助初学者理解其复杂架构。

链接: https://arxiv.org/abs/2412.07201
作者: Nimisha Ghosh,Daniele Santoni,Indrajit Saha,Giovanni Felici
关键词-EN: Transformer-based language models, natural language processing, Transformer-based language, language processing, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent times, Transformer-based language models are making quite an impact in the field of natural language processing. As relevant parallels can be drawn between biological sequences and natural languages, the models used in NLP can be easily extended and adapted for various applications in bioinformatics. In this regard, this paper introduces the major developments of Transformer-based models in the recent past in the context of nucleotide sequences. We have reviewed and analysed a large number of application-based papers on this subject, giving evidence of the main characterizing features and to different approaches that may be adopted to customize such powerful computational machines. We have also provided a structured description of the functioning of Transformers, that may enable even first time users to grab the essence of such complex architectures. We believe this review will help the scientific community in understanding the various applications of Transformer-based language models to nucleotide sequences. This work will motivate the readers to build on these methodologies to tackle also various other problems in the field of bioinformatics.
zh

[NLP-39] Modifying AI Enhancing Essays: How Active Engagement with Generative AI Boosts Writing Quality

【速读】：该论文试图解决在生成式 AI (Generative AI) 辅助写作中，教师如何有效评估和支持学生学习的问题。解决方案的关键在于通过分析学生在生成式 AI 辅助写作中的不同行为模式对作文质量的影响，特别是区分哪些行为反映了有意义的学习过程，哪些行为则不然。研究采用 X-Learner 方法，量化了三种行为模式（寻求建议但不接受、接受建议但不修改、接受建议并修改）对作文质量的因果影响，发现频繁修改生成式 AI 生成文本的学生，其作文在词汇复杂度、句法复杂度和文本连贯性方面均有显著提升，而仅接受建议不修改的学生则表现出作文质量下降。此外，生成式 AI 的引入还能帮助减少语言偏见。

链接: https://arxiv.org/abs/2412.07200
作者: Kaixun Yang,Mladen Raković,Zhiping Liang,Lixiang Yan,Zijie Zeng,Yizhou Fan,Dragan Gašević,Guanliang Chen
关键词-EN: writing-a key pedagogical, key pedagogical practice, relying on Generative, practice in education, increasingly relying
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Students are increasingly relying on Generative AI (GAI) to support their writing-a key pedagogical practice in education. In GAI-assisted writing, students can delegate core cognitive tasks (e.g., generating ideas and turning them into sentences) to GAI while still producing high-quality essays. This creates new challenges for teachers in assessing and supporting student learning, as they often lack insight into whether students are engaging in meaningful cognitive processes during writing or how much of the essay’s quality can be attributed to those processes. This study aimed to help teachers better assess and support student learning in GAI-assisted writing by examining how different writing behaviors, especially those indicative of meaningful learning versus those that are not, impact essay quality. Using a dataset of 1,445 GAI-assisted writing sessions, we applied the cutting-edge method, X-Learner, to quantify the causal impact of three GAI-assisted writing behavioral patterns (i.e., seeking suggestions but not accepting them, seeking suggestions and accepting them as they are, and seeking suggestions and accepting them with modification) on four measures of essay quality (i.e., lexical sophistication, syntactic complexity, text cohesion, and linguistic bias). Our analysis showed that writers who frequently modified GAI-generated text-suggesting active engagement in higher-order cognitive processes-consistently improved the quality of their essays in terms of lexical sophistication, syntactic complexity, and text cohesion. In contrast, those who often accepted GAI-generated text without changes, primarily engaging in lower-order processes, saw a decrease in essay quality. Additionally, while human writers tend to introduce linguistic bias when writing independently, incorporating GAI-generated text-even without modification-can help mitigate this bias.
zh

[NLP-40] PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips

【速读】：该论文试图解决针对大规模商业化（与人类对齐）语言模型的攻击问题，特别是通过在模型参数中引入有针对性的位翻转（bitwise corruptions）来实现模型越狱（jailbreaking）。解决方案的关键在于提出了一种高效的攻击算法，能够在极少数位翻转（最少仅需5次）的情况下，使数十亿参数的语言模型在运行时生成有害响应，而不需要对输入进行任何修改。该方法相比现有的计算机视觉模型攻击，使用更少的位翻转（最多减少40倍），并且在计算效率上提高了多达20倍。此外，论文通过软件引发的故障注入（如Rowhammer）验证了该攻击的可行性，并展示了其在不同DRAM设备上的广泛适用性，即使在高度安全的系统中也能有效实施。

链接: https://arxiv.org/abs/2412.07192
作者: Zachary Coalson,Jeonghyun Woo,Shiyang Chen,Yu Sun,Lishan Yang,Prashant Nair,Bo Fang,Sanghyun Hong
关键词-EN: targeted bitwise corruptions, targeted bitwise, bitwise corruptions, models, times
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a new class of attacks on commercial-scale (human-aligned) language models that induce jailbreaking through targeted bitwise corruptions in model parameters. Our adversary can jailbreak billion-parameter language models with fewer than 25 bit-flips in all cases - and as few as 5 in some - using up to 40 \times less bit-flips than existing attacks on computer vision models at least 100 \times smaller. Unlike prompt-based jailbreaks, our attack renders these models in memory ‘uncensored’ at runtime, allowing them to generate harmful responses without any input modifications. Our attack algorithm efficiently identifies target bits to flip, offering up to 20 \times more computational efficiency than previous methods. This makes it practical for language models with billions of parameters. We show an end-to-end exploitation of our attack using software-induced fault injection, Rowhammer (RH). Our work examines 56 DRAM RH profiles from DDR4 and LPDDR4X devices with different RH vulnerabilities. We show that our attack can reliably induce jailbreaking in systems similar to those affected by prior bit-flip attacks. Moreover, our approach remains effective even against highly RH-secure systems (e.g., 46 \times more secure than previously tested systems). Our analyses further reveal that: (1) models with less post-training alignment require fewer bit flips to jailbreak; (2) certain model components, such as value projection layers, are substantially more vulnerable than others; and (3) our method is mechanistically different than existing jailbreaks. Our findings highlight a pressing, practical threat to the language model ecosystem and underscore the need for research to protect these models from bit-flip attacks.
zh

[NLP-41] Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models

【速读】：该论文试图解决大语言模型 (Large language models, LLMs) 在处理长序列时由于训练上下文大小限制而表现受限的问题。解决方案的关键在于提出了一种新颖的单阶段持续预训练方法，即头部自适应旋转位置编码 (Head-Adaptive Rotary Position Encoding, HARPE)。HARPE 通过在不同注意力头中使用不同的旋转位置编码 (Rotary Position Encoding, RoPE) 基频值，并直接在目标上下文长度上训练 LLMs，从而简化了训练过程并赋予模型长上下文建模能力。实验结果表明，HARPE 在理解和整合长上下文任务方面表现优异，甚至超越了现有的多阶段方法。

链接: https://arxiv.org/abs/2412.07171
作者: Haoran Lian,Junmin Chen,Wei Huang,Yizhe Xiong,Wenping Hu,Guiguang Ding,Hui Chen,Jianwei Niu,Zijia Lin,Fuzheng Zhang,Di Zhang
关键词-EN: Natural Language Processing, Large language models, revolutionized Natural Language, revolutionized Natural, Rotary Position Encoding
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, Large language models (LLMs) have revolutionized Natural Language Processing (NLP). Pretrained LLMs, due to limited training context size, struggle with handling long token sequences, limiting their performance on various downstream tasks. Current solutions toward long context modeling often employ multi-stage continual pertaining, which progressively increases the effective context length through several continual pretraining stages. However, those approaches require extensive manual tuning and human expertise. In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position Encoding (HARPE), to equip LLMs with long context modeling capabilities while simplifying the training process. Our HARPE leverages different Rotary Position Encoding (RoPE) base frequency values across different attention heads and directly trains LLMs on the target context length. Extensive experiments on 4 language modeling benchmarks, including the latest RULER benchmark, demonstrate that HARPE excels in understanding and integrating long-context tasks with single-stage training, matching and even outperforming existing multi-stage methods. Our results highlight that HARPE successfully breaks the stage barrier for training LLMs with long context modeling capabilities.
zh

[NLP-42] MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models

【速读】：该论文试图解决在多模态视觉推理任务中，传统方法独立评估每个选项的局限性问题。解决方案的关键在于提出了多模态排除法 (Multi-Modal Process of Elimination, MM-PoE)，通过双步评分范式，首先排除明显不合理的选项，然后集中评估剩余的最可能选项。这种方法不仅模拟了人类在测试中的策略，还显著提升了现有视觉语言模型 (Vision-Language Models, VLMs) 在零样本和少样本学习中的表现，同时克服了排除法仅限于零样本设置和仅适用于语言框架的两大限制。

链接: https://arxiv.org/abs/2412.07148
作者: Sayak Chakrabarty,Souradip Pal
关键词-EN: paper introduces Multiple, introduces Multiple Choice, Multiple Choice Reasoning, introduces Multiple, Multiple Choice
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, herein referred to as Multi-Modal Process of Elimination (MM-PoE). This novel methodology is engineered to augment the efficacy of Vision-Language Models (VLMs) in multiple-choice visual reasoning tasks. Diverging from conventional approaches that evaluate each option independently, MM-PoE employs a dual-step scoring paradigm that initially identifies and excludes implausible choices, subsequently concentrating on the most probable remaining options. This method emulates human test-taking strategies, where individuals typically eliminate clearly incorrect answers prior to selecting the optimal response. Our empirical evaluations, conducted across three benchmark datasets, reveal that MM-PoE significantly improves both zero-shot and few-shot performance of contemporary state-of-the-art VLMs. Critically, this approach not only broadens the application of the elimination process to multi-modal contexts but also allows few-shot experiments, thereby addressing two principal limitations concerning usage of PoE only in zero-shot settings and only with a language-only framework. As a result, MM-PoE not only refines the reasoning capabilities of VLMs but also broadens their applicability to complex visual question-answering scenarios. All code and documentation supporting our work are available at this https URL, enabling researchers and practitioners to easily integrate and further develop these techniques.
zh

[NLP-43] Political Actor Agent : Simulating Legislative System for Roll Call Votes Prediction with Large Language Models AAAI2025

【速读】：该论文试图解决现有基于嵌入方法（embedding-based methods）在预测立法者投票行为时面临的挑战，包括手动预定义特征的需求、对大量训练数据的依赖以及缺乏可解释性。解决方案的关键在于引入政治行为者代理框架（Political Actor Agent, PAA），该框架利用大型语言模型（Large Language Models）通过角色扮演架构和立法系统模拟，提供了一种可扩展且可解释的投票预测范式。PAA不仅提高了预测的准确性，还提供了多视角、人类可理解的决策推理，为政治行为者行为研究提供了新的见解。

链接: https://arxiv.org/abs/2412.07144
作者: Hao Li,Ruoyuan Gong,Hao Jiang
关键词-EN: Predicting roll call, roll call votes, roll call, focus in quantitative, modeling political actors
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Predicting roll call votes through modeling political actors has emerged as a focus in quantitative political science and computer science. Widely used embedding-based methods generate vectors for legislators from diverse data sets to predict legislative behaviors. However, these methods often contend with challenges such as the need for manually predefined features, reliance on extensive training data, and a lack of interpretability. Achieving more interpretable predictions under flexible conditions remains an unresolved issue. This paper introduces the Political Actor Agent (PAA), a novel agent-based framework that utilizes Large Language Models to overcome these limitations. By employing role-playing architectures and simulating legislative system, PAA provides a scalable and interpretable paradigm for predicting roll-call votes. Our approach not only enhances the accuracy of predictions but also offers multi-view, human-understandable decision reasoning, providing new insights into political actor behaviors. We conducted comprehensive experiments using voting records from the 117-118th U.S. House of Representatives, validating the superior performance and interpretability of PAA. This study not only demonstrates PAA’s effectiveness but also its potential in political science research.
zh

[NLP-44] Bridging the Gap for Test-Time Multimodal Sentiment Analysis AAAI2025

【速读】：该论文试图解决多模态情感分析 (Multimodal Sentiment Analysis, MSA) 在实际动态场景中由于数据分布变化导致的性能下降问题。解决方案的关键在于提出了两种策略：对比适应 (Contrastive Adaptation) 和稳定伪标签生成 (Stable Pseudo-label generation, CASP)。这两种策略分别通过增强一致性和最小化经验风险来应对MSA中的分布偏移问题。实验结果表明，CASP在不同分布偏移设置和不同骨干网络下均显著提升了模型性能，展示了其有效性和通用性。

链接: https://arxiv.org/abs/2412.07121
作者: Zirun Guo,Tao Jin,Wenlong Xu,Wang Lin,Yangyang Wu
关键词-EN: emerging research topic, recognize human sentiment, multiple modalities, emerging research, research topic
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Multimodal sentiment analysis (MSA) is an emerging research topic that aims to understand and recognize human sentiment or emotions through multiple modalities. However, in real-world dynamic scenarios, the distribution of target data is always changing and different from the source data used to train the model, which leads to performance degradation. Common adaptation methods usually need source data, which could pose privacy issues or storage overheads. Therefore, test-time adaptation (TTA) methods are introduced to improve the performance of the model at inference time. Existing TTA methods are always based on probabilistic models and unimodal learning, and thus can not be applied to MSA which is often considered as a multimodal regression task. In this paper, we propose two strategies: Contrastive Adaptation and Stable Pseudo-label generation (CASP) for test-time adaptation for multimodal sentiment analysis. The two strategies deal with the distribution shifts for MSA by enforcing consistency and minimizing empirical risk, respectively. Extensive experiments show that CASP brings significant and consistent improvements to the performance of the model across various distribution shift settings and with different backbones, demonstrating its effectiveness and versatility. Our codes are available at this https URL.
zh

[NLP-45] A Review of Human Emotion Synthesis Based on Generative Technology

【速读】：该论文试图解决情感计算领域中人类情感合成研究缺乏系统性综述的问题。解决方案的关键在于提供一个全面且系统的概述，涵盖了情感合成中涉及的生成模型（如自编码器、生成对抗网络、扩散模型、大语言模型和序列到序列模型）的数学原理、使用的数据集、以及在不同模态（如面部图像、语音和文本）中的应用。此外，论文还探讨了主流的评估指标，并提出了未来的研究方向，从而为情感合成领域的生成技术提供了深入的理解。

链接: https://arxiv.org/abs/2412.07116
作者: Fei Ma,Yukan Li,Yifan Xie,Ying He,Yi Zhang,Hongwei Ren,Zhou Liu,Wei Yao,Fuji Ren,Fei Richard Yu,Shiguang Ni
关键词-EN: Human emotion synthesis, Large Language Models, Generative Adversarial Networks, generative models, affective computing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 10 figures

点击查看摘要

Abstract:Human emotion synthesis is a crucial aspect of affective computing. It involves using computational methods to mimic and convey human emotions through various modalities, with the goal of enabling more natural and effective human-computer interactions. Recent advancements in generative models, such as Autoencoders, Generative Adversarial Networks, Diffusion Models, Large Language Models, and Sequence-to-Sequence Models, have significantly contributed to the development of this field. However, there is a notable lack of comprehensive reviews in this field. To address this problem, this paper aims to address this gap by providing a thorough and systematic overview of recent advancements in human emotion synthesis based on generative models. Specifically, this review will first present the review methodology, the emotion models involved, the mathematical principles of generative models, and the datasets used. Then, the review covers the application of different generative models to emotion synthesis based on a variety of modalities, including facial images, speech, and text. It also examines mainstream evaluation metrics. Additionally, the review presents some major findings and suggests future research directions, providing a comprehensive understanding of the role of generative technology in the nuanced domain of emotion synthesis.
zh

[NLP-46] Exploring Coding Spot: Understanding Parametric Contributions to LLM Coding Performance

【速读】：该论文试图解决大语言模型（LLMs）在代码生成和理解方面的机制问题，特别是不同编程语言是否在模型中独立处理或共享参数空间。解决方案的关键在于提出了“编码点”（Coding Spot）的概念，即LLMs中专门用于编码能力的参数区域。研究发现，对这一特定参数子集的针对性修改会显著影响编码任务的性能，同时基本保持非编码功能。这种参数区域的功能专门化类似于认知神经科学中特定脑区负责不同任务的现象，表明LLMs可能通过类似的专门化参数区域来处理不同的知识领域。

链接: https://arxiv.org/abs/2412.07113
作者: Dongjun Kim,Minhyuk Kim,YongChan Chun,Chanjun Park,Heuiseok Lim
关键词-EN: Large Language Models, Large Language, Language Models, multiple programming languages, demonstrated notable proficiency
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated notable proficiency in both code generation and comprehension across multiple programming languages. However, the mechanisms underlying this proficiency remain underexplored, particularly with respect to whether distinct programming languages are processed independently or within a shared parametric region. Drawing an analogy to the specialized regions of the brain responsible for distinct cognitive functions, we introduce the concept of Coding Spot, a specialized parametric region within LLMs that facilitates coding capabilities. Our findings identify this Coding Spot and show that targeted modifications to this subset significantly affect performance on coding tasks, while largely preserving non-coding functionalities. This compartmentalization mirrors the functional specialization observed in cognitive neuroscience, where specific brain regions are dedicated to distinct tasks, suggesting that LLMs may similarly employ specialized parameter regions for different knowledge domains.
zh

[NLP-47] Maya: An Instruction Finetuned Multilingual Multimodal Model

【速读】：该论文试图解决当前视觉-语言模型 (Vision-Language Models, VLMs) 在处理低资源语言和文化多样性方面存在的显著不足，主要原因是缺乏高质量、多样化和经过安全审查的数据。解决方案的关键在于提出了一个名为Maya的开源多模态多语言模型，其贡献包括：1) 基于LLaVA预训练数据集构建了一个包含八种语言的多语言图像-文本预训练数据集；2) 对LLaVA数据集进行了全面的毒性分析，并创建了一个跨八种语言的无毒版本；3) 开发了一个支持这些语言的多语言图像-文本模型，增强了在视觉-语言任务中的文化与语言理解能力。

链接: https://arxiv.org/abs/2412.07112
作者: Nahid Alam,Karthik Reddy Kanjula,Surya Guthikonda,Timothy Chung,Bala Krishna S Vegesna,Abhipsha Das,Anthony Susevski,Ryan Sze-Yin Chan,S M Iftekhar Uddin,Shayekh Bin Islam,Roshan Santhosh,Snegha A,Drishti Sharma,Chen Liu,Isha Chaturvedi,Genta Indra Winata,Ashvanth.S,Snehanshu Mukherjee,Alham Fikri Aji
关键词-EN: widely spoken languages, academic benchmarks, primarily in widely, rapid development, development of large
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at this https URL.
zh

[NLP-48] Predictable Emergent Abilities of LLM s: Proxy Tasks Are All You Need

【速读】：该论文试图解决大语言模型 (LLMs) 在扩展过程中出现的“涌现能力”难以预测的问题。解决方案的关键在于提出了一种基于代理任务 (proxy tasks) 的方法，通过建立目标任务与候选任务之间的相关性指标，并利用小模型集成验证候选任务的鲁棒性，从而选择最合适的代理任务。最终，通过整合这些代理任务的评估结果来预测目标任务的性能。该方法在工具利用能力的案例研究中展示了预测性能与实际性能之间的强相关性，验证了其有效性。

链接: https://arxiv.org/abs/2412.07111
作者: Bo-Wen Zhang,Yan Yan,Boxiang Yang,Yifei Xue,Guang Liu
关键词-EN: scaling laws optimize, laws optimize training, optimize training configurations, emergent abilities due, predict emergent abilities
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While scaling laws optimize training configurations for large language models (LLMs) through experiments on smaller or early-stage models, they fail to predict emergent abilities due to the absence of such capabilities in these models. To address this, we propose a method that predicts emergent abilities by leveraging proxy tasks. We begin by establishing relevance metrics between the target task and candidate tasks based on performance differences across multiple models. These candidate tasks are then validated for robustness with small model ensembles, leading to the selection of the most appropriate proxy tasks. The predicted performance on the target task is then derived by integrating the evaluation results of these proxies. In a case study on tool utilization capabilities, our method demonstrated a strong correlation between predicted and actual performance, confirming its effectiveness.
zh

[NLP-49] Improving the Natural Language Inference robustness to hard dataset by data augmentation and preprocessing

【速读】：该论文试图解决自然语言推理 (Natural Language Inference, NLI) 模型在处理困难数据集时表现不佳的问题，特别是当模型面对未见过的分布外 (out-of-distribution) 前提和假设时，可能无法理解语义内容而仅学习到虚假的相关性。解决方案的关键在于提出了数据增强和预处理方法，以解决词重叠、数值推理和长度不匹配等问题。这些方法不依赖于测试数据的分布，从而提高了模型的鲁棒性。

链接: https://arxiv.org/abs/2412.07108
作者: Zijiang Yang
关键词-EN: Natural Language Inference, Natural Language, Language Inference, Natural, Inference
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Natural Language Inference (NLI) is the task of inferring whether the hypothesis can be justified by the given premise. Basically, we classify the hypothesis into three labels(entailment, neutrality and contradiction) given the premise. NLI was well studied by the previous researchers. A number of models, especially the transformer based ones, have achieved significant improvement on these tasks. However, it is reported that these models are suffering when they are dealing with hard datasets. Particularly, they perform much worse when dealing with unseen out-of-distribution premise and hypothesis. They may not understand the semantic content but learn the spurious correlations. In this work, we propose the data augmentation and preprocessing methods to solve the word overlap, numerical reasoning and length mismatch problems. These methods are general methods that do not rely on the distribution of the testing data and they help improve the robustness of the models.
zh

[NLP-50] QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization

【速读】：该论文试图解决文本摘要中人类评估方法的系统性和粒度问题。传统的人类评估协议Pyramid通过将参考摘要分解为子单元来评估内容选择，但存在子单元定义和粒度不系统的问题。论文提出的解决方案QAPyramid通过将每个参考摘要分解为更细粒度的问答(QA)对，基于QA-SRL框架进行标注，从而提供更系统化和细粒度的内容选择评估。关键在于利用QA-SRL框架生成QA对，并通过自动化评估指标提高与QAPyramid的一致性，从而实现更准确和高效的摘要系统基准测试。

链接: https://arxiv.org/abs/2412.07096
作者: Shiyue Zhang,David Wan,Arie Cattan,Ayal Klein,Ido Dagan,Mohit Bansal
关键词-EN: properly conduct human, conduct human evaluations, longstanding challenge, properly conduct, conduct human
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The first two authors contributed equally. Code: this https URL

点击查看摘要

Abstract:How to properly conduct human evaluations for text summarization is a longstanding challenge. The Pyramid human evaluation protocol, which assesses content selection by breaking the reference summary into sub-units and verifying their presence in the system summary, has been widely adopted. However, it suffers from a lack of systematicity in the definition and granularity of the sub-units. We address these problems by proposing QAPyramid, which decomposes each reference summary into finer-grained question-answer (QA) pairs according to the QA-SRL framework. We collect QA-SRL annotations for reference summaries from CNN/DM and evaluate 10 summarization systems, resulting in 8.9K QA-level annotations. We show that, compared to Pyramid, QAPyramid provides more systematic and fine-grained content selection evaluation while maintaining high inter-annotator agreement without needing expert annotations. Furthermore, we propose metrics that automate the evaluation pipeline and achieve higher correlations with QAPyramid than other widely adopted metrics, allowing future work to accurately and efficiently benchmark summarization systems.
zh

[NLP-51] Defensive Dual Masking for Robust Adversarial Defense

【速读】：该论文试图解决自然语言处理 (NLP) 模型在面对对抗攻击时的脆弱性问题。解决方案的关键在于引入了一种名为防御性双重掩码 (Defensive Dual Masking, DDM) 的算法，通过在训练过程中战略性地插入 [MASK] 标记来增强模型对对抗扰动的鲁棒性。在推理阶段，潜在的对抗性标记会被动态替换为 [MASK] 标记，以中和潜在威胁并保留输入的核心语义。该方法的理论基础在于其选择性掩码机制能够增强模型识别和缓解对抗性操纵的能力，从而在多个基准数据集和攻击机制上显著提升模型的准确性和鲁棒性，尤其适用于大规模语言模型 (LLMs)。

链接: https://arxiv.org/abs/2412.07078
作者: Wangli Yang,Jie Yang,Yi Guo,Johan Barthelemy
关键词-EN: gained considerable attention, recent years due, natural language processing, exploit subtle perturbations, Defensive Dual Masking
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: First version

点击查看摘要

Abstract:The field of textual adversarial defenses has gained considerable attention in recent years due to the increasing vulnerability of natural language processing (NLP) models to adversarial attacks, which exploit subtle perturbations in input text to deceive models. This paper introduces the Defensive Dual Masking (DDM) algorithm, a novel approach designed to enhance model robustness against such attacks. DDM utilizes a unique adversarial training strategy where [MASK] tokens are strategically inserted into training samples to prepare the model to handle adversarial perturbations more effectively. During inference, potentially adversarial tokens are dynamically replaced with [MASK] tokens to neutralize potential threats while preserving the core semantics of the input. The theoretical foundation of our approach is explored, demonstrating how the selective masking mechanism strengthens the model’s ability to identify and mitigate adversarial manipulations. Our empirical evaluation across a diverse set of benchmark datasets and attack mechanisms consistently shows that DDM outperforms state-of-the-art defense techniques, improving model accuracy and robustness. Moreover, when applied to Large Language Models (LLMs), DDM also enhances their resilience to adversarial attacks, providing a scalable defense mechanism for large-scale NLP applications.
zh

[NLP-52] FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering CVPR2025

【速读】：该论文试图解决多模态多跳问答（Multimodal Multihop Question Answering）任务中的数据集缺乏问题，特别是在需要跨图像和文本等多种信息源进行推理的复杂场景中。解决方案的关键在于提出了一种新颖的方法论，通过一个5阶段的流程来创建高质量的多模态多跳问答数据集。该流程包括从维基百科获取相关多模态文档、合成生成高层次的问题和答案，并通过严格的验证标准确保数据质量。通过这种方法，论文展示了使用合成数据训练的模型在同等样本量下，在精确匹配（Exact Match, EM）指标上平均优于使用人工收集数据训练的模型1.9个百分点。

链接: https://arxiv.org/abs/2412.07030
作者: Amirhossein Abaskohi,Spandana Gella,Giuseppe Carenini,Issam H. Laradji
关键词-EN: multihop question answering, Multimodal multihop question, question answering, sources of information, multihop question
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 20 pages, 11 figures, 10 tables, Submitted to CVPR 2025

点击查看摘要

Abstract:Multimodal multihop question answering is a complex task that requires reasoning over multiple sources of information, such as images and text, to answer questions. While there has been significant progress in visual question answering, the multihop setting remains unexplored due to the lack of high-quality datasets. Current methods focus on single-hop question answering or a single modality, which makes them unsuitable for real-world scenarios such as analyzing multimodal educational materials, summarizing lengthy academic articles, or interpreting scientific studies that combine charts, images, and text. To address this gap, we propose a novel methodology, introducing the first framework for creating a high-quality dataset that enables training models for multimodal multihop question answering. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks, our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average. We believe our data synthesis method will serve as a strong foundation for training and evaluating multimodal multihop question answering models.
zh

[NLP-53] Assessing the Impact of Conspiracy Theories Using Large Language Models

【速读】：该论文试图解决在危机期间如何有效评估和优先处理阴谋论（CTs）对公众的实际影响问题。解决方案的关键在于利用大型语言模型（LLMs）进行复杂推理，通过多步骤推理分析更多与阴谋论相关的证据，从而生成更准确的影响评估。论文还指出，尽管LLMs在处理此类任务时表现出一定的偏见，如对提示中较早出现的阴谋论赋予更高影响，但通过设计针对性的策略，可以有效提升评估的准确性，尤其是在处理情感化或冗长的阴谋论内容时。

链接: https://arxiv.org/abs/2412.07019
作者: Bohan Jiang,Dawei Li,Zhen Tan,Xinyi Zhou,Ashwin Rao,Kristina Lerman,H. Russell Bernard,Huan Liu
关键词-EN: allocating resources effectively, Measuring the relative, resources effectively, important for prioritizing, prioritizing responses
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Measuring the relative impact of CTs is important for prioritizing responses and allocating resources effectively, especially during crises. However, assessing the actual impact of CTs on the public poses unique challenges. It requires not only the collection of CT-specific knowledge but also diverse information from social, psychological, and cultural dimensions. Recent advancements in large language models (LLMs) suggest their potential utility in this context, not only due to their extensive knowledge from large training corpora but also because they can be harnessed for complex reasoning. In this work, we develop datasets of popular CTs with human-annotated impacts. Borrowing insights from human impact assessment processes, we then design tailored strategies to leverage LLMs for performing human-like CT impact assessments. Through rigorous experiments, we textitdiscover that an impact assessment mode using multi-step reasoning to analyze more CT-related evidence critically produces accurate results; and most LLMs demonstrate strong bias, such as assigning higher impacts to CTs presented earlier in the prompt, while generating less accurate impact assessments for emotionally charged and verbose CTs.
zh

[NLP-54] Asynchronous LLM Function Calling

【速读】：该论文试图解决大型语言模型 (LLMs) 在调用外部工具和数据源时存在的同步函数调用问题，即每次调用都会阻塞 LLM 推理，限制了 LLM 的操作效率和并发函数执行能力。解决方案的关键在于提出了 AsyncLM 系统，通过引入异步函数调用机制，使 LLM 能够并发生成和执行函数调用，而不必等待每个调用的完成。具体实现包括设计上下文协议来处理函数调用和中断，提供微调策略以适应中断语义，并在 LLM 推理过程中高效实现这些机制。实验结果表明，AsyncLM 在 Berkeley 函数调用排行榜 (BFCL) 基准任务上，相比同步调用可将任务完成延迟降低 1.6 倍至 5.4 倍。此外，论文还探讨了中断机制在人机或模型间交互中的潜在扩展应用。

链接: https://arxiv.org/abs/2412.07017
作者: In Gim,Seung-seob Lee,Lin Zhong
关键词-EN: Large language models, Large language, LLM function calling, LLM, function
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) use function calls to interface with external tools and data source. However, the current approach to LLM function calling is inherently synchronous, where each call blocks LLM inference, limiting LLM operation and concurrent function execution. In this work, we propose AsyncLM, a system for asynchronous LLM function calling. AsyncLM improves LLM’s operational efficiency by enabling LLMs to generate and execute function calls concurrently. Instead of waiting for each call’s completion, AsyncLM introduces an interrupt mechanism to asynchronously notify the LLM in-flight when function calls return. We design an in-context protocol for function calls and interrupts, provide fine-tuning strategy to adapt LLMs to the interrupt semantics, and implement these mechanisms efficiently on LLM inference process. We demonstrate that AsyncLM can reduce end-to-end task completion latency from 1.6x-5.4x compared to synchronous function calling on a set of benchmark tasks in the Berkeley function calling leaderboard (BFCL). Furthermore, we discuss how interrupt mechanisms can be extended to enable novel human-LLM or LLM-LLM interactions.
zh

[NLP-55] AutoReason: Automatic Few-Shot Reasoning Decomposition

【速读】：该论文试图解决链式思维（Chain of Thought, CoT）在大型语言模型（LLMs）中应用的局限性问题，特别是其依赖手工设计的少样本示例提示（few-shot exemplar prompts）且无法自适应不同查询的缺陷。解决方案的关键在于提出了一种自动生成推理依据（rationales）的方法，通过将隐式查询分解为多个显式问题，从而提升多步隐式推理能力，并为模型提供可解释性，尤其在较弱的LLMs中显著提高了推理能力。该方法在StrategyQA和HotpotQA两个问答数据集上进行了测试，并展示了显著的准确率提升，尤其是在StrategyQA上。

链接: https://arxiv.org/abs/2412.06975
作者: Arda Sevinc,Abdurrahman Gumus
关键词-EN: Chain of Thought, Large Language Models, Large Language, introduced in recent, reasoning in Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain of Thought (CoT) was introduced in recent research as a method for improving step-by-step reasoning in Large Language Models. However, CoT has limited applications such as its need for hand-crafted few-shot exemplar prompts and no capability to adjust itself to different queries. In this work, we propose a system to automatically generate rationales using CoT. Our method improves multi-step implicit reasoning capabilities by decomposing the implicit query into several explicit questions. This provides interpretability for the model, improving reasoning in weaker LLMs. We test our approach with two Q\A datasets: StrategyQA and HotpotQA. We show an increase in accuracy with both, especially on StrategyQA. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.06975 [cs.CL] (or arXiv:2412.06975v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.06975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-56] Effective Text Adaptation for LLM -based ASR through Soft Prompt Fine-Tuning

【速读】：该论文试图解决在基于大语言模型（LLM）的自动语音识别（ASR）中，仅使用文本数据进行微调可能导致领域特定知识效果减弱的问题。解决方案的关键在于提出了一种两步软提示微调策略（two-step soft prompt fine-tuning strategy），通过增强领域特定文本的适应性来提升ASR性能。实验结果表明，该方法在目标领域上实现了高达9%的词错误率（WER）和18%的实体错误率（EER）的相对降低，并且与领域特定语言模型（LM）融合后，EER进一步提升了2-5%。

链接: https://arxiv.org/abs/2412.06967
作者: Yingyi Ma,Zhe Liu,Ozlem Kalinli
关键词-EN: Automatic Speech Recognition, Speech Recognition, Large Language Models, Automatic Speech, advent of Large
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: accepted as SLT 2024 proceeding

点击查看摘要

Abstract:The advent of Large Language Models (LLM) has reformed the Automatic Speech Recognition (ASR). Prompting LLM with audio embeddings to generate transcriptions becomes the new state-of-the-art ASR. Despite LLMs being trained with an extensive amount of text corpora, high-quality domain-specific text data can still significantly enhance ASR performance on domain adaptation tasks. Although LLM-based ASR can naturally incorporate more text corpora by fine-tuning the LLM decoder, fine-tuning such ASR on text-only data without paired prompts may diminish the effectiveness of domain-specific knowledge. To mitigate this issue, we propose a two-step soft prompt fine-tuning strategy that enhances domain-specific text adaptation. Experimental results show that text adaptation with our proposed method achieved a relative up to 9% Word Error Rate (WER) reduction and up to 18% Entity Error Rate (EER) reduction on the target domain compared to the baseline ASR. Combining this with domain-specific Language Model (LM) fusion can further improve the EER by a relative 2-5%
zh

[NLP-57] Analysing Public Transport User Sentiment on Low Resource Multilingual Data

【速读】：该论文旨在解决撒哈拉以南非洲国家公共交通系统服务质量（Quality of Service, QoS）和用户体验的改进问题。研究通过探索通勤者的意见挖掘（Opinion Mining），利用多语言意见挖掘技术分析来自肯尼亚、坦桑尼亚和南非的社交媒体数据，特别是针对铁路、小巴出租车和公交系统的用户情感。关键解决方案在于采用自然语言处理（Natural Language Processing, NLP）技术，通过预训练语言模型（PLMs）如AfriBERTa、AfroXLMR、AfroLM和PuoBERTa，处理数据中的语言多样性和代码切换问题，从而提取出有价值的见解。研究结果显示，南非和肯尼亚的用户情感多为负面，而坦桑尼亚的正面情感主要源于广告性质的推文。通过Word2Vec模型和K-Means聚类进行特征提取，揭示了不同数据集中的语义关系和主要主题。该研究为构建更加响应用户需求、以用户为中心的公共交通系统提供了基础，有助于提升城市交通的可持续性。

链接: https://arxiv.org/abs/2412.06951
作者: Rozina L. Myoya,Vukosi Marivate,Idris Abdulmumin
关键词-EN: Public transport systems, Quality of Service, improve the Quality, Public transport, transport systems
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Public transport systems in many Sub-Saharan countries often receive less attention compared to other sectors, underscoring the need for innovative solutions to improve the Quality of Service (QoS) and overall user experience. This study explored commuter opinion mining to understand sentiments toward existing public transport systems in Kenya, Tanzania, and South Africa. We used a qualitative research design, analysing data from X (formerly Twitter) to assess sentiments across rail, mini-bus taxis, and buses. By leveraging Multilingual Opinion Mining techniques, we addressed the linguistic diversity and code-switching present in our dataset, thus demonstrating the application of Natural Language Processing (NLP) in extracting insights from under-resourced languages. We employed PLMs such as AfriBERTa, AfroXLMR, AfroLM, and PuoBERTa to conduct the sentiment analysis. The results revealed predominantly negative sentiments in South Africa and Kenya, while the Tanzanian dataset showed mainly positive sentiments due to the advertising nature of the tweets. Furthermore, feature extraction using the Word2Vec model and K-Means clustering illuminated semantic relationships and primary themes found within the different datasets. By prioritising the analysis of user experiences and sentiments, this research paves the way for developing more responsive, user-centered public transport systems in Sub-Saharan countries, contributing to the broader goal of improving urban mobility and sustainability.
zh

[NLP-58] When Every Token Counts: Optimal Segmentation for Low-Resource Language Models COLING2025

【速读】：该论文试图解决传统贪心分词方法在自然语言处理 (NLP) 中的局限性问题，特别是在不同模型规模和语言环境下，子词分词器如字节对编码 (Byte-Pair Encoding, BPE) 的优化问题。解决方案的关键在于通过实验证明，优化的 BPE 配置能够显著减少分词数量，从而在节省分词比例和提升模型性能方面带来显著优势，尤其是在小型模型上。论文通过评估生成和分类等任务中的分词性能，指出压缩优化的分词策略在多语言和低资源语言应用中具有重要潜力，为未来的研究和包容性 NLP 提供了新的研究方向。

链接: https://arxiv.org/abs/2412.06926
作者: Bharath Raj S,Garvit Suri,Vikrant Dewangan,Raghav Sonavane
关键词-EN: Natural Language Processing, directly impacting model, Traditional greedy tokenization, step in Natural, Language Processing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: LoResLM @ COLING 2025

点击查看摘要

Abstract:Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding (BPE) are widely used, questions remain about their optimality across model scales and languages. In this work, we demonstrate through extensive experiments that an optimal BPE configuration significantly reduces token count compared to greedy segmentation, yielding improvements in token-saving percentages and performance benefits, particularly for smaller models. We evaluate tokenization performance across various intrinsic and extrinsic tasks, including generation and classification. Our findings suggest that compression-optimized tokenization strategies could provide substantial advantages for multilingual and low-resource language applications, highlighting a promising direction for further research and inclusive NLP.
zh

[NLP-59] LLM s for Generalizable Language-Conditioned Policy Learning under Minimal Data Requirements

【速读】：该论文试图解决现有强化学习方法在执行复杂多步决策任务时，依赖昂贵的标注数据集或实时实验，且难以泛化到未见过的目标和状态的问题。解决方案的关键在于提出了TEDUO，一种新颖的离线语言条件策略学习训练流程。TEDUO利用易于获取的无标签数据，并通过大型语言模型（LLMs）的先验知识和指令遵循能力，增强预收集离线数据的准确性，并实现对新目标和状态的灵活泛化。实验结果表明，LLMs在框架中作为数据增强器和泛化器的双重角色，促进了有效且数据高效的可泛化语言条件策略学习。

链接: https://arxiv.org/abs/2412.06877
作者: Thomas Pouplin,Katarzyna Kobalczyk,Hao Sun,Mihaela van der Schaar
关键词-EN: multi-step decision-making tasks, approaches typically require, typically require expensive, require expensive labeled, existing reinforcement learning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To develop autonomous agents capable of executing complex, multi-step decision-making tasks as specified by humans in natural language, existing reinforcement learning approaches typically require expensive labeled datasets or access to real-time experimentation. Moreover, conventional methods often face difficulties in generalizing to unseen goals and states, thereby limiting their practical applicability. This paper presents TEDUO, a novel training pipeline for offline language-conditioned policy learning. TEDUO operates on easy-to-obtain, unlabeled datasets and is suited for the so-called in-the-wild evaluation, wherein the agent encounters previously unseen goals and states. To address the challenges posed by such data and evaluation settings, our method leverages the prior knowledge and instruction-following capabilities of large language models (LLMs) to enhance the fidelity of pre-collected offline data and enable flexible generalization to new goals and states. Empirical results demonstrate that the dual role of LLMs in our framework-as data enhancers and generalizers-facilitates both effective and data-efficient learning of generalizable language-conditioned policies.
zh

[NLP-60] Real-Time Performance Optimization of Travel Reservation Systems Using AI and Microservices

【速读】：该论文试图解决旅游行业中预订系统在处理大规模数据和交易时面临的实时优化问题，特别是系统延迟、负载均衡和数据一致性等关键问题。解决方案的关键在于提出了一种结合人工智能（AI）和微服务（Microservices）的混合框架。AI算法用于预测需求模式、优化资源分配和增强基于微服务架构的决策能力，从而实现系统组件的去中心化，提升系统的可扩展性、容错性和减少停机时间。微服务架构则负责在不均匀流量模式下处理不同规模的需求，确保系统在高峰负载和流量激增时能够高效运行，同时减少延迟并保证高质量的服务。通过对比传统单体预订模型，该混合框架显著提升了处理时间和资源利用率，展示了AI和微服务在旅游预订系统中的变革潜力。

链接: https://arxiv.org/abs/2412.06874
作者: Biman Barua,M. Shamim Kaiser
关键词-EN: travel reservation systems, transaction volumes, rapid growth, reservation systems, Artificial Intelligence
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 19 pages, 12 figures

点击查看摘要

Abstract:The rapid growth of the travel industry has increased the need for real-time optimization in reservation systems that could take care of huge data and transaction volumes. This study proposes a hybrid framework that ut folds an Artificial Intelligence and a Microservices approach for the performance optimization of the system. The AI algorithms forecast demand patterns, optimize the allocation of resources, and enhance decision-making driven by Microservices architecture, hence decentralizing system components for scalability, fault tolerance, and reduced downtime. The model provided focuses on major problems associated with the travel reservation systems such as latency of systems, load balancing and data consistency. It endows the systems with predictive models based on AI improved ability to forecast user demands. Microservices would also take care of different scales during uneven traffic patterns. Hence, both aspects ensure better handling of peak loads and spikes while minimizing delays and ensuring high service quality. A comparison was made between traditional reservation models, which are monolithic and the new model of AI-Microservices. Comparatively, the analysis results state that there is a drastic improvement in processing times where the system uptime and resource utilization proved the capability of AI and the microservices in transforming the travel industry in terms of reservation. This research work focused on AI and Microservices towards real-time optimization, providing critical insight into how to move forward with practical recommendations for upgrading travel reservation systems with this technology.
zh

[NLP-61] Political-LLM : Large Language Models in Political Science

【速读】：该论文试图解决如何系统性地理解并推进大型语言模型（LLMs）在计算政治学中的应用问题。解决方案的关键在于提出了一个名为Political-LLM的多学科框架，该框架从政治学和计算方法两个角度对现有研究进行了分类和分析。具体来说，从政治学角度，强调了LLMs在自动化预测和生成任务、模拟行为动态以及通过反事实生成等工具改进因果推断中的作用；从计算方法角度，介绍了针对政治学背景的数据准备、微调和评估方法的进展。此外，论文还指出了未来研究的关键挑战，包括开发领域特定数据集、解决偏见和公平性问题、结合人类专家知识以及重新定义评估标准以适应计算政治学的独特需求。

链接: https://arxiv.org/abs/2412.06864
作者: Lincan Li,Jiaqi Li,Catherine Chen,Fred Gui,Hongjia Yang,Chenxiao Yu,Zhengguang Wang,Jianing Cai,Junlong Aaron Zhou,Bolin Shen,Alex Qian,Weixin Chen,Zhongkai Xue,Lichao Sun,Lifang He,Hanjie Chen,Kaize Ding,Zijian Du,Fangzhou Mu,Jiaxin Pei,Jieyu Zhao,Swabha Swayamdipta,Willie Neiswanger,Hua Wei,Xiyang Hu,Shixiang Zhu,Tianlong Chen,Yingzhou Lu,Yang Shi,Lianhui Qin,Tianfan Fu,Zhengzhong Tu,Yuzhe Yang,Jaemin Yoo,Jiaheng Zhang,Ryan Rossi,Liang Zhan,Liang Zhao,Emilio Ferrara,Yan Liu,Furong Huang,Xiangliang Zhang,Lawrence Rothenberg,Shuiwang Ji,Philip S. Yu,Yue Zhao,Yushun Dong
关键词-EN: large language models, policy impact assessment, political science, sentiment analysis, computational political science
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 54 Pages, 9 Figures

点击查看摘要

Abstract:In recent years, large language models (LLMs) have been widely adopted in political science tasks such as election prediction, sentiment analysis, policy impact assessment, and misinformation detection. Meanwhile, the need to systematically understand how LLMs can further revolutionize the field also becomes urgent. In this work, we–a multidisciplinary team of researchers spanning computer science and political science–present the first principled framework termed Political-LLM to advance the comprehensive understanding of integrating LLMs into computational political science. Specifically, we first introduce a fundamental taxonomy classifying the existing explorations into two perspectives: political science and computational methodologies. In particular, from the political science perspective, we highlight the role of LLMs in automating predictive and generative tasks, simulating behavior dynamics, and improving causal inference through tools like counterfactual generation; from a computational perspective, we introduce advancements in data preparation, fine-tuning, and evaluation methods for LLMs that are tailored to political contexts. We identify key challenges and future directions, emphasizing the development of domain-specific datasets, addressing issues of bias and fairness, incorporating human expertise, and redefining evaluation criteria to align with the unique requirements of computational political science. Political-LLM seeks to serve as a guidebook for researchers to foster an informed, ethical, and impactful use of Artificial Intelligence in political science. Our online resource is available at: this http URL.
zh

[NLP-62] aming Sensitive Weights : Noise Perturbation Fine-tuning for Robust LLM Quantization

【速读】：该论文试图解决在大语言模型（LLM）量化过程中，由于某些权重（称为outliers）对量化噪声高度敏感，导致现有量化方法需要保留这些权重的浮点或高精度，从而影响混合精度模型的硬件部署效率的问题。解决方案的关键在于通过减少与outliers相关的损失Hessian迹（loss Hessian trace）来降低这些敏感权重对量化误差的影响。论文提出了噪声扰动微调（Noise Perturbation Fine-tuning, NPFT）方法，通过在参数高效微调（PEFT）过程中对outliers添加随机权重扰动，来降低其敏感性，从而在不特别处理outliers的情况下提升量化模型的性能。实验结果表明，NPFT在OPT和LLaMA模型上均能稳定提升量化性能，并提高推理效率。

链接: https://arxiv.org/abs/2412.06858
作者: Dongwei Wang,Huanrui Yang
关键词-EN: enable efficient LLM, efficient LLM serving, LLM serving, limited resource, critical step
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to CPAL 2024

点击查看摘要

Abstract:Quantization is a critical step to enable efficient LLM serving under limited resource. However, previous research observes that certain weights in the LLM, known as outliers, are significantly sensitive to quantization noises. Existing quantization methods leave these outliers as floating points or higher precisions to retain performance, posting challenges on the efficient hardware deployment of the mixed-precision model. This work investigates an alternative way to tame the sensitive weights’ impact on the quantization error, by reducing the loss Hessian trace with respect to outliers through an efficient fine-tuning process. We propose Noise Perturbation Fine-tuning (NPFT), which identifies outlier weights and add random weight perturbations on the outliers as the model going through a PEFT optimization. NPFT tames the sensitivity of outlier weights so that the quantized model performance can be improved without special treatment to the outliers. When applied to OPT and LLaMA models, our NPFT method achieves stable performance improvements for both uniform and non-uniform quantizers, while also offering better inference efficiency. Notably, the simplest RTN can achieve performance on par with GPTQ using our NPFT on LLaMA2-7B-4bits benchmark.
zh

[NLP-63] GL-Fusion: Rethinking the Combination of Graph Neural Network and Large Language model

【速读】：该论文试图解决现有方法在将大型语言模型 (LLM) 与图神经网络 (GNN) 结合时存在的局限性，包括 LLM-centered 模型难以有效捕捉图结构，以及 GNN-centered 模型在处理复杂语义和生成语言输出方面的不足。解决方案的关键在于提出了一种深度集成 GNN 和 LLM 的新架构，包含三个创新点：(1) 结构感知 Transformer (Structure-Aware Transformers)，将 GNN 的消息传递能力直接嵌入到 LLM 的 Transformer 层中，实现文本和结构信息的同步处理；(2) 图-文本交叉注意力机制 (Graph-Text Cross-Attention)，确保对图节点和边上的完整、未压缩文本进行处理，实现语义的完全整合；(3) GNN-LLM 双预测器 (GNN-LLM Twin Predictor)，使 LLM 能够进行灵活的自回归生成，同时保持 GNN 的扩展性预测能力。

链接: https://arxiv.org/abs/2412.06849
作者: Haotong Yang,Xiyuan Wang,Qian Tao,Shuxian Hu,Zhouchen Lin,Muhan Zhang
关键词-EN: Graph Neural Networks, integrating Large Language, Large Language Models, Neural Networks, Recent research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Recent research on integrating Large Language Models (LLMs) with Graph Neural Networks (GNNs) typically follows two approaches: LLM-centered models, which convert graph data into tokens for LLM processing, and GNN-centered models, which use LLMs to encode text features into node and edge representations for GNN input. LLM-centered models often struggle to capture graph structures effectively, while GNN-centered models compress variable-length textual data into fixed-size vectors, limiting their ability to understand complex semantics. Additionally, GNN-centered approaches require converting tasks into a uniform, manually-designed format, restricting them to classification tasks and preventing language output. To address these limitations, we introduce a new architecture that deeply integrates GNN with LLM, featuring three key innovations: (1) Structure-Aware Transformers, which incorporate GNN’s message-passing capabilities directly into LLM’s transformer layers, allowing simultaneous processing of textual and structural information and generating outputs from both GNN and LLM; (2) Graph-Text Cross-Attention, which processes full, uncompressed text from graph nodes and edges, ensuring complete semantic integration; and (3) GNN-LLM Twin Predictor, enabling LLM’s flexible autoregressive generation alongside GNN’s scalable one-pass prediction. GL-Fusion achieves outstand performance on various tasks. Notably, it achieves state-of-the-art performance on OGBN-Arxiv and OGBG-Code2.
zh

[NLP-64] Fully Open Source Moxin-7B Technical Report

【速读】：该论文试图解决开源大型语言模型（LLMs）在透明性、可重复性和安全性方面的问题，特别是由于某些模型在发布时保留了关键组件（如训练代码和数据），以及使用限制性许可证而导致的创新障碍。解决方案的关键在于引入Moxin 7B，这是一个完全开源的LLM，遵循模型开放性框架（Model Openness Framework, MOF），通过全面发布预训练代码、配置、训练和微调数据集以及中间和最终检查点，达到了MOF中最高的“开放科学”分类级别。这一解决方案确保了模型的完整性和开放性，从而促进了LLMs领域的创新和研究。

链接: https://arxiv.org/abs/2412.06845
作者: Pu Zhao,Xuan Shen,Zhenglun Kong,Yixin Shen,Sung-En Chang,Timothy Rupprecht,Lei Lu,Enfu Nan,Changdi Yang,Yumei He,Xingchen Xu,Yu Huang,Wei Wang,Yue Chen,Yong He,Yanzhi Wang
关键词-EN: Large Language Models, Large Language, Language Models, significant transformation, open-source LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, and some use restrictive licenses whilst claiming to be “open-source,” which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed in accordance with the Model Openness Framework (MOF), a ranked classification system that evaluates AI models based on model completeness and openness, adhering to principles of open science, open source, open data, and open access. Our model achieves the highest MOF classification level of “open science” through the comprehensive release of pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints. Experiments show that our model achieves superior performance in zero-shot evaluation compared with popular 7B models and performs competitively in few-shot evaluation.
zh

[NLP-65] Semantic loss guided data efficient supervised fine tuning for Safe Responses in LLM s

【速读】：该论文试图解决大型语言模型 (Large Language Models, LLMs) 在面对有毒提示时生成不安全响应的问题。解决方案的关键在于利用少量从不安全 LLM 本身获取的不安全响应数据，通过结合语义成本和负地球移动距离 (Earth Mover Distance, EMD) 损失来引导模型避免生成不安全响应。此外，论文提出了一种新的 EMD 损失下界，以实现更高效的优化。这种方法不仅提高了数据效率，还在性能上优于基线方法，并进一步探讨了过度对齐和使用对比数据可能导致的语言能力退化问题。

链接: https://arxiv.org/abs/2412.06843
作者: Yuxiao Lu,Arunesh Sinha,Pradeep Varakantham
关键词-EN: Large Language Models, Large Language, Language Models, generating unsafe responses, Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) generating unsafe responses to toxic prompts is a significant issue in their applications. While various efforts aim to address this safety concern, previous approaches often demand substantial human data collection or rely on the less dependable option of using another LLM to generate corrective data. In this paper, we aim to take this problem and overcome limitations of requiring significant high-quality human data. Our method requires only a small set of unsafe responses to toxic prompts, easily obtained from the unsafe LLM itself. By employing a semantic cost combined with a negative Earth Mover Distance (EMD) loss, we guide the LLM away from generating unsafe responses. Additionally, we propose a novel lower bound for EMD loss, enabling more efficient optimization. Our results demonstrate superior performance and data efficiency compared to baselines, and we further examine the nuanced effects of over-alignment and potential degradation of language capabilities when using contrastive data.
zh

[NLP-66] SLA Management in Reconfigurable Multi-Agent RAG: A Systems Approach to Question Answering

【速读】：该论文试图解决在实际问答（QA）应用中，如何通过多代理检索增强生成（RAG）系统动态调整以满足多样化的服务水平协议（SLA）和质量服务（QoS）要求的问题。解决方案的关键在于将任务特定的非功能性需求（如答案质量、成本和延迟）集成到系统中，并通过将服务水平目标（SLOs）映射到系统级参数，实现动态重配置。这种方法能够在资源约束下生成最优结果，并通过根据查询意图和操作条件调整系统，有效平衡性能和资源利用，从而满足不同查询类型的SLOs。

链接: https://arxiv.org/abs/2412.06832
作者: Michael Iannelli,Sneha Kuchipudi,Vera Dvorak
关键词-EN: Large Language Models, Retrieval Augmented Generation, Retrieval Augmented, static knowledge bases, decoupling reasoning capabilities
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) enables Large Language Models (LLMs) to generalize to new information by decoupling reasoning capabilities from static knowledge bases. Traditional RAG enhancements have explored vertical scaling – assigning subtasks to specialized modules – and horizontal scaling – replicating tasks across multiple agents – to improve performance. However, real-world applications impose diverse Service Level Agreements (SLAs) and Quality of Service (QoS) requirements, involving trade-offs among objectives such as reducing cost, ensuring answer quality, and adhering to specific operational constraints. In this work, we present a systems-oriented approach to multi-agent RAG tailored for real-world Question Answering (QA) applications. By integrating task-specific non-functional requirements – such as answer quality, cost, and latency – into the system, we enable dynamic reconfiguration to meet diverse SLAs. Our method maps these Service Level Objectives (SLOs) to system-level parameters, allowing the generation of optimal results within specified resource constraints. We conduct a case study in the QA domain, demonstrating how dynamic re-orchestration of a multi-agent RAG system can effectively manage the trade-off between answer quality and cost. By adjusting the system based on query intent and operational conditions, we systematically balance performance and resource utilization. This approach allows the system to meet SLOs for various query types, showcasing its practicality for real-world applications. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC) ACMclasses: H.3.4; H.3.3; I.2.7; I.2.11; C.2.4 Cite as: arXiv:2412.06832 [cs.SE] (or arXiv:2412.06832v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.06832 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-67] ransitGPT: A Generative AI-based framework for interacting with GTFS data using Large Language Models

【速读】：该论文试图解决的问题是如何利用大型语言模型 (LLMs) 来简化用户对公共交通数据（如 GTFS 数据）的查询和操作，特别是对于那些不具备深入了解 GTFS 或编程知识的用户。解决方案的关键在于开发了一个名为 TransitGPT 的框架，该框架通过引导 LLMs 生成 Python 代码来处理和提取与用户查询相关的 GTFS 数据，并自动在服务器上执行这些代码。这种方法无需对 LLMs 进行微调或提供实际的 GTFS 数据，而是完全依赖于提示 (prompts) 来指导代码生成，从而实现了对 GTFS 数据的广泛任务处理，包括数据检索、计算和交互式可视化，显著提升了公共交通数据的可用性和易用性。

链接: https://arxiv.org/abs/2412.06831
作者: Saipraneeth Devunuri,Lewis Lehe
关键词-EN: Large Language Models, leverages Large Language, answer natural language, natural language queries, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:This paper introduces a framework that leverages Large Language Models (LLMs) to answer natural language queries about General Transit Feed Specification (GTFS) data. The framework is implemented in a chatbot called TransitGPT with open-source code. TransitGPT works by guiding LLMs to generate Python code that extracts and manipulates GTFS data relevant to a query, which is then executed on a server where the GTFS feed is stored. It can accomplish a wide range of tasks, including data retrieval, calculations, and interactive visualizations, without requiring users to have extensive knowledge of GTFS or programming. The LLMs that produce the code are guided entirely by prompts, without fine-tuning or access to the actual GTFS feeds. We evaluate TransitGPT using GPT-4o and Claude-3.5-Sonnet LLMs on a benchmark dataset of 100 tasks, to demonstrate its effectiveness and versatility. The results show that TransitGPT can significantly enhance the accessibility and usability of transit data.
zh

[NLP-68] Enhancing LLM s for Impression Generation in Radiology Reports through a Multi-Agent System

【速读】：该论文试图解决放射学报告中从“finding”部分生成“impression”部分的问题，解决方案的关键在于引入了一个多代理的大语言模型 (LLM) 框架，称为“RadCouncil”。该框架通过三个专门代理的协同工作来提升生成效果：1) “Retrieval”代理从向量数据库中检索相似报告；2) “Radiologist”代理基于给定报告的“finding”部分和检索到的示例报告生成“impression”；3) “Reviewer”代理评估生成的“impression”并提供反馈。通过这种多代理的交互方式，RadCouncil在诊断准确性、风格一致性和清晰度等多个维度上优于单一代理方法，展示了在特定医疗任务中利用多个交互式LLM代理的潜力，以开发更强大和适应性更强的医疗AI解决方案。

链接: https://arxiv.org/abs/2412.06828
作者: Fang Zeng,Zhiliang Lyu,Quanzheng Li,Xiang Li
关键词-EN: Large Language Model, multi-agent Large Language, Language Model, Large Language, multi-agent Large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study introduces “RadCouncil,” a multi-agent Large Language Model (LLM) framework designed to enhance the generation of impressions in radiology reports from the finding section. RadCouncil comprises three specialized agents: 1) a “Retrieval” Agent that identifies and retrieves similar reports from a vector database, 2) a “Radiologist” Agent that generates impressions based on the finding section of the given report plus the exemplar reports retrieved by the Retrieval Agent, and 3) a “Reviewer” Agent that evaluates the generated impressions and provides feedback. The performance of RadCouncil was evaluated using both quantitative metrics (BLEU, ROUGE, BERTScore) and qualitative criteria assessed by GPT-4, using chest X-ray as a case study. Experiment results show improvements in RadCouncil over the single-agent approach across multiple dimensions, including diagnostic accuracy, stylistic concordance, and clarity. This study highlights the potential of utilizing multiple interacting LLM agents, each with a dedicated task, to enhance performance in specialized medical tasks and the development of more robust and adaptable healthcare AI solutions.
zh

[NLP-69] Guidance is All You Need: Temperature-Guided Reasoning in Large Language Models

【速读】：该论文试图解决传统链式思维（chain-of-thought）方法在逻辑推理能力上的局限性问题。解决方案的关键在于引入了一种新颖的架构——Quasar-1，通过温度引导推理（temperature-guided reasoning）机制，包括标记温度机制（Token Temperature Mechanism, TTM）和引导思维序列（Guided Sequence of Thought, GSoT）。该方法通过区分“热标记”（hot tokens）和“冷标记”（cold tokens），动态调节标记的重要性，从而优先处理上下文相关性高的标记，同时利用补充信息。这种机制在数学上被证明能够以指数级保证收敛到最优推理路径，显著提升了推理精度和计算效率，使其在广泛的任务中表现优异。

链接: https://arxiv.org/abs/2412.06822
作者: Eyad Gomaa,Gomaa Salah
关键词-EN: Sequence of Thought, Guided Sequence, Token Temperature Mechanism, large language models, Temperature Mechanism
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Quasar-1, a novel architecture that introduces temperature-guided reasoning to large language models through the Token Temperature Mechanism (TTM) and Guided Sequence of Thought (GSoT). Our approach leverages the concept of hot and cold tokens, where hot tokens are prioritized for their contextual relevance, while cold tokens provide supplementary information. This dynamic modulation of token importance enables the model to achieve superior logical reasoning capabilities compared to traditional chain-of-thought approaches. Through rigorous mathematical analysis, we prove that our temperature-guided attention mechanism converges to optimal reasoning paths with exponential guarantees. Empirical results show significant improvements in reasoning accuracy and computational efficiency across a wide range of tasks, making advanced AI reasoning accessible to a broader range of applications.
zh

[NLP-70] Understanding the Impact of News Articles on the Movement of Market Index: A Case on Nifty 50

【速读】：该论文试图解决的问题是不同主题的新闻（如体育、政治、市场等）对股票价格或指数（如Nifty 50指数）的影响是否存在差异，而之前的研究大多仅关注特定股票相关新闻或整体情感得分。解决方案的关键在于通过分析与不同主题相关的新闻情感得分，评估其对Nifty 50指数波动的影响，从而填补了以往研究中未考虑多主题情感影响的空白。研究表明，不同主题的新闻情感得分确实对指数的波动有显著影响。

链接: https://arxiv.org/abs/2412.06794
作者: Subhasis Dasgupta,Pratik Satpati,Ishika Choudhary,Jaydip Sen
关键词-EN: recent past, stock price, stock, movement, movement of Nifty
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
备注: This is a pre-print version of the actual paper presented in the IEEE conference SILCON2024 in the year 2024 at NIT Silchar, Assam, India. The paper contains 2 figures and 4 tables

点击查看摘要

Abstract:In the recent past, there were several works on the prediction of stock price using different methods. Sentiment analysis of news and tweets and relating them to the movement of stock prices have already been explored. But, when we talk about the news, there can be several topics such as politics, markets, sports etc. It was observed that most of the prior analyses dealt with news or comments associated with particular stock prices only or the researchers dealt with overall sentiment scores only. However, it is quite possible that different topics having different levels of impact on the movement of the stock price or an index. The current study focused on bridging this gap by analysing the movement of Nifty 50 index with respect to the sentiments associated with news items related to various different topic such as sports, politics, markets etc. The study established that sentiment scores of news items of different other topics also have a significant impact on the movement of the index.
zh

计算机视觉

[CV-0] Video Motion Transfer with Diffusion Transformers

【速读】：该论文试图解决将参考视频的运动信息迁移到新生成的视频中的问题，解决方案的关键在于提出了一种名为 DiTFlow 的方法，该方法利用预训练的 Diffusion Transformers (DiT) 分析跨帧注意力图，提取出一种称为 Attention Motion Flow (AMF) 的补丁级运动信号。通过在潜在去噪过程中基于优化的无训练方式，使用 AMF 损失优化潜在变量，从而生成与参考视频运动一致的视频。此外，该方法还将优化策略应用于变换器的位置嵌入，进一步提升了零样本运动迁移的能力。

链接: https://arxiv.org/abs/2412.07776
作者: Alexander Pondaven,Aliaksandr Siarohin,Sergey Tulyakov,Philip Torr,Fabio Pizzati
关键词-EN: specifically for Diffusion, Attention Motion Flow, Diffusion Transformers, designed specifically, newly synthesized
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one, designed specifically for Diffusion Transformers (DiT). We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal called the Attention Motion Flow (AMF). We guide the latent denoising process in an optimization-based, training-free, manner by optimizing latents with our AMF loss to generate videos reproducing the motion of the reference one. We also apply our optimization strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities. We evaluate DiTFlow against recently published methods, outperforming all across multiple metrics and human evaluation.
zh

[CV-1] Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

【速读】：该论文试图解决预训练扩散模型在微调过程中面临的多样性不足、先验信息丢失以及收敛速度慢的问题。解决方案的关键在于提出了一种名为Nabla-GFlowNet（简称 \nabla -GFlowNet）的新型生成流网络方法，该方法利用奖励梯度（reward gradients）中的丰富信号，并结合一种名为 \nabla -DB 的目标函数及其残差变体 \nabla -DB，用于先验保持的扩散对齐。通过这种方法，论文展示了在不同现实奖励函数下，Stable Diffusion 模型能够实现快速、多样且保持先验的对齐。

链接: https://arxiv.org/abs/2412.07775
作者: Zhen Liu,Tim Z. Xiao,Weiyang Liu,Yoshua Bengio,Dinghuai Zhang
关键词-EN: target downstream tasks, commonly trains large, trains large diffusion, finetune pretrained diffusion, downstream tasks
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report (35 pages, 31 figures)

点击查看摘要

Abstract:While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetune pretrained diffusion models on some reward functions that are either designed by experts or learned from small-scale datasets. Existing methods for finetuning diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as \nabla -GFlowNet), the first GFlowNet method that leverages the rich signal in reward gradients, together with an objective called \nabla -DB plus its variant residual \nabla -DB designed for prior-preserving diffusion alignment. We show that our proposed method achieves fast yet diversity- and prior-preserving alignment of Stable Diffusion, a large-scale text-conditioned image diffusion model, on different realistic reward functions.
zh

[CV-2] UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

【速读】：该论文试图解决图像生成和编辑任务中的多样性与一致性问题，提出了一种统一的框架UniReal。其关键解决方案是将图像级别的任务视为不连续的视频生成过程，通过将输入和输出图像视为帧，从而在不同任务中实现一致性和视觉变化的平衡。该方法利用大规模视频数据进行通用监督，使模型能够学习世界动态，表现出对阴影、反射、姿态变化和物体交互的高级处理能力，并展现出新兴应用的潜力。

链接: https://arxiv.org/abs/2412.07774
作者: Xi Chen,Zhifei Zhang,He Zhang,Yuqian Zhou,Soo Ye Kim,Qing Liu,Yijun Li,Jianming Zhang,Nanxuan Zhao,Yilin Wang,Hui Ding,Zhe Lin,Hengshuang Zhao
关键词-EN: unified framework designed, unified framework, tasks, generation, Abstract
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: webpage: this https URL

点击查看摘要

Abstract:We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, customization, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.
zh

[CV-3] From Slow Bidirectional to Fast Causal Video Generators

【速读】：该论文试图解决视频扩散模型在交互应用中的双向注意力依赖问题，即生成单帧时需要处理整个序列（包括未来帧），导致生成效率低下。解决方案的关键在于将预训练的双向扩散Transformer模型适配为因果Transformer模型，实现实时帧生成。此外，通过扩展分布匹配蒸馏（DMD）技术，将50步扩散模型蒸馏为4步生成器，进一步降低延迟。论文还引入了基于教师模型ODE轨迹的学生初始化方案和非对称蒸馏策略，以稳定高质量的蒸馏过程，并有效缓解自回归生成中的误差累积问题。最终，该方法在单GPU上实现了9.4 FPS的高质量视频流式生成，并支持视频到视频翻译、图像到视频转换和动态提示等零样本任务。

链接: https://arxiv.org/abs/2412.07772
作者: Tianwei Yin,Qiang Zhang,Richard Zhang,William T. Freeman,Fredo Durand,Eli Shechtman,Xun Huang
关键词-EN: interactive applications due, bidirectional attention dependencies, Current video diffusion, achieve impressive generation, models achieve impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to a causal transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher’s ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model supports fast streaming generation of high quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner. We will release the code based on an open-source model in the future.
zh

[CV-4] PETALface: Parameter Efficient Transfer Learning for Low-resolution Face Recognition WACV2025

【速读】：该论文试图解决在高分辨率（HR）人脸识别模型迁移到低分辨率（LR）人脸数据集时面临的两个主要问题：一是模型在低分辨率数据集上的全量微调（full fine-tuning）会导致预训练知识的灾难性遗忘（catastrophic forgetting），二是高分辨率图库图像与低分辨率探针图像之间的域差异导致单一模型难以同时适应图库和探针图像。解决方案的关键是提出了PETALface，一种参数高效迁移学习（Parameter-Efficient Transfer Learning, PEFT）方法。通过PEFT，PETALface解决了灾难性遗忘问题，并通过引入两个低秩适应模块（low-rank adaptation modules）到主干网络中，根据输入图像质量动态调整权重，从而有效处理图库和探针图像之间的质量差异。实验结果表明，PETALface在低分辨率数据集上的表现优于全量微调，同时保持了在高分辨率和混合质量数据集上的性能，且仅使用了0.48%的参数。

链接: https://arxiv.org/abs/2412.07771
作者: Kartik Narayan,Nithin Gopalakrishnan Nair,Jennifer Xu,Rama Chellappa,Vishal M. Patel
关键词-EN: utilizing margin-based loss, margin-based loss functions, Pre-training on large-scale, utilizing margin-based, margin-based loss
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV 2025. Project Page: this https URL

点击查看摘要

Abstract:Pre-training on large-scale datasets and utilizing margin-based loss functions have been highly successful in training models for high-resolution face recognition. However, these models struggle with low-resolution face datasets, in which the faces lack the facial attributes necessary for distinguishing different faces. Full fine-tuning on low-resolution datasets, a naive method for adapting the model, yields inferior performance due to catastrophic forgetting of pre-trained knowledge. Additionally the domain difference between high-resolution (HR) gallery images and low-resolution (LR) probe images in low resolution datasets leads to poor convergence for a single model to adapt to both gallery and probe after fine-tuning. To this end, we propose PETALface, a Parameter-Efficient Transfer Learning approach for low-resolution face recognition. Through PETALface, we attempt to solve both the aforementioned problems. (1) We solve catastrophic forgetting by leveraging the power of parameter efficient fine-tuning(PEFT). (2) We introduce two low-rank adaptation modules to the backbone, with weights adjusted based on the input image quality to account for the difference in quality for the gallery and probe images. To the best of our knowledge, PETALface is the first work leveraging the powers of PEFT for low resolution face recognition. Extensive experiments demonstrate that the proposed method outperforms full fine-tuning on low-resolution datasets while preserving performance on high-resolution and mixed-quality datasets, all while using only 0.48% of the parameters. Code: this https URL
zh

[CV-5] From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos NEURIPS2024

【速读】：该论文试图解决在真实世界对象和场景中缺乏大规模3D数据的问题，特别是在标准视频中由于固定视角限制了多样视角的获取。解决方案的关键在于利用大规模的360度视频数据集（360-1M），通过高效的帧对应查找过程，从多样视角中提取对应帧。论文提出的扩散模型（Odin）在360-1M数据集上进行训练，能够生成真实世界场景的新视角，并且能够通过移动相机视角来推断场景的几何和布局，从而在标准的新视角合成和3D重建基准上表现出改进的性能。

链接: https://arxiv.org/abs/2412.07770
作者: Matthew Wallingford,Anand Bhattad,Aditya Kusupati,Vivek Ramanujan,Matt Deitke,Sham Kakade,Aniruddha Kembhavi,Roozbeh Mottaghi,Wei-Chiu Ma,Ali Farhadi
关键词-EN: computer vision, play a key, key role, role in humans’, active area
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: NeurIPS 2024. For project page, see this https URL

点击查看摘要

Abstract:Three-dimensional (3D) understanding of objects and scenes play a key role in humans’ ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content has shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture. This restricts the ability to access scenes from a variety of more diverse and potentially useful perspectives. We argue that large scale 360 videos can address these limitations to provide: scalable corresponding frames from diverse views. In this paper, we introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M. Empowered by the largest real-world, multi-view dataset to date, Odin is able to freely generate novel views of real-world scenes. Unlike previous methods, Odin can move the camera through the environment, enabling the model to infer the geometry and layout of the scene. Additionally, we show improved performance on standard novel view synthesis and 3D reconstruction benchmarks.
zh

[CV-6] BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

【速读】：该论文旨在解决双语（阿拉伯语-英语）生物医学领域中多模态模型（LMM）的开发问题，特别是如何整合文本和视觉模态以实现高级图像理解和医疗应用。解决方案的关键在于提出了BiMediX2模型，该模型基于Llama3.1架构，能够无缝集成文本和视觉能力，支持双语文本输入和涉及医学图像的多轮对话。BiMediX2通过在包含1.6M样本的双语医疗数据集上进行训练，实现了在多种医学基准测试中的最先进性能，尤其是在多模态医学评估中，英语和阿拉伯语的性能分别提升了9%和20%。此外，该模型在UPHILL事实准确性评估中超越了GPT-4，并在医学视觉问答、报告生成和报告摘要等任务中表现出色。

链接: https://arxiv.org/abs/2412.07769
作者: Sahal Shaji Mullappilly,Mohammed Irfan Kurpath,Sara Pieri,Saeed Yahya Alseiari,Shanavas Cholakkal,Khaled Aldahmani,Fahad Khan,Rao Anwer,Salman Khan,Timothy Baldwin,Hisham Cholakkal
关键词-EN: Bio-Medical EXpert Large, EXpert Large Multimodal, Large Multimodal Model, enabling advanced image, EXpert Large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model (LMM) with a unified architecture that integrates text and visual modalities, enabling advanced image understanding and medical applications. BiMediX2 leverages the Llama3.1 architecture and integrates text and visual capabilities to facilitate seamless interactions in both English and Arabic, supporting text-based inputs and multi-turn conversations involving medical images. The model is trained on an extensive bilingual healthcare dataset consisting of 1.6M samples of diverse medical interactions for both text and image modalities, mixed in Arabic and English. We also propose the first bilingual GPT-4o based medical LMM benchmark named BiMed-MBench. BiMediX2 is benchmarked on both text-based and image-based tasks, achieving state-of-the-art performance across several medical benchmarks. It outperforms recent state-of-the-art models in medical LLM evaluation benchmarks. Our model also sets a new benchmark in multimodal medical evaluations with over 9% improvement in English and over 20% in Arabic evaluations. Additionally, it surpasses GPT-4 by around 9% in UPHILL factual accuracy evaluations and excels in various medical Visual Question Answering, Report Generation, and Report Summarization tasks. The project page including source code and the trained model, is available at this https URL.
zh

[CV-7] st-time Correction with Human Feedback: An Online 3D Detection System via Visual Prompting

【速读】：该论文试图解决自动驾驶系统在部署后因测试时错误导致的潜在安全风险问题。解决方案的关键在于引入测试时校正系统 (Test-time Correction, TTC)，通过在线的人类反馈进行即时错误修正。具体来说，TTC系统利用在线适配器模块 (Online Adapter, OA)，通过用户交互式提示（如简单的点击或框选）生成视觉提示，指导模型对遗漏的目标进行检测和跟踪。这些视觉提示存储在视觉提示缓冲区中，用于持续校正后续帧中的错误，从而实现即时且可靠的错误修正，降低部署风险，且无需额外的昂贵训练。

链接: https://arxiv.org/abs/2412.07768
作者: Zetong Yang,Hanxue Zhang,Yanan Sun,Li Chen,Fei Xia,Fatma Guney,Hongyang Li
关键词-EN: paper introduces Test-time, introduces Test-time Correction, introduces Test-time, TTC, Test-time
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces Test-time Correction (TTC) system, a novel online 3D detection system designated for online correction of test-time errors via human feedback, to guarantee the safety of deployed autonomous driving systems. Unlike well-studied offline 3D detectors frozen at inference, TTC explores the capability of instant online error rectification. By leveraging user feedback with interactive prompts at a frame, e.g., a simple click or draw of boxes, TTC could immediately update the corresponding detection results for future streaming inputs, even though the model is deployed with fixed parameters. This enables autonomous driving systems to adapt to new scenarios immediately and decrease deployment risks reliably without additional expensive training. To achieve such TTC system, we equip existing 3D detectors with Online Adapter (OA) module, a prompt-driven query generator for online correction. At the core of OA module are visual prompts, images of missed object-of-interest for guiding the corresponding detection and subsequent tracking. Those visual prompts, belonging to missed objects through online inference, are maintained by the visual prompt buffer for continuous error correction in subsequent frames. By doing so, TTC consistently detects online missed objects and immediately lowers driving risks. It achieves reliable, versatile, and adaptive driving autonomy. Extensive experiments demonstrate significant gain on instant error rectification over pre-trained 3D detectors, even in challenging scenarios with limited labels, zero-shot detection, and adverse conditions. We hope this work would inspire the community to investigate online rectification systems for autonomous driving post-deployment. Code would be publicly shared.
zh

[CV-8] Learning Visual Generative Priors without Text

【速读】：该论文试图解决文本到图像 (T2I) 模型在扩展时依赖高质量文本-图像对导致成本高昂的问题。解决方案的关键在于提出了一种基于图像到图像 (I2I) 生成的视觉生成先验模型，该模型通过纯视觉训练框架 Lumos 进行自监督学习，从而避免了依赖文本-图像对的需求。研究表明，这种 I2I 模型不仅在无需大量文本-图像对的情况下表现与现有 T2I 模型相当或更优，还在与文本无关的视觉生成任务（如图像到3D和图像到视频）中展现出优越性。

链接: https://arxiv.org/abs/2412.07767
作者: Shuailei Ma,Kecheng Zheng,Ying Wei,Wei Wu,Fan Lu,Yifei Zhang,Chen-wei Xie,Jiapeng Zhu,Yujun Shen
关键词-EN: pairs makes scaling, scaling up expensive, recently thrived, reliance on high-quality, makes scaling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although text-to-image (T2I) models have recently thrived as visual generative priors, their reliance on high-quality text-image pairs makes scaling up expensive. We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. We first develop a pure vision-based training framework, Lumos, and confirm the feasibility and the scalability of learning I2I models. We then find that, as an upstream task of T2I, our I2I model serves as a more foundational visual prior and achieves on-par or better performance than existing T2I models using only 1/10 text-image pairs for fine-tuning. We further demonstrate the superiority of I2I priors over T2I priors on some text-irrelevant visual generative tasks, like image-to-3D and image-to-video.
zh

[CV-9] Make-A-Texture: Fast Shape-Aware Texture Generation in 3 Seconds WACV2025

【速读】：该论文提出了一种名为 Make-A-Texture 的新框架，旨在高效地从文本提示生成高分辨率的纹理图，并将其应用于给定的 3D 几何体。其关键解决方案包括使用深度感知修复扩散模型（depth-aware inpainting diffusion model）逐步生成多视角一致的纹理，并通过自动视图选择算法优化视图序列。该方法显著提高了纹理生成的效率，在单个 NVIDIA H100 GPU 上仅需 3.07 秒即可完成整个纹理生成过程，远超现有方法。此外，通过优化扩散模型和专门的反投影方法（backprojection method），并选择性遮蔽非正面和开放表面物体的内部面，减少了反投影阶段的伪影。实验结果表明，Make-A-Texture 在质量上与或超越了其他最先进的方法，显著提升了纹理生成模型在实际 3D 内容创建中的应用性。

链接: https://arxiv.org/abs/2412.07766
作者: Xiaoyu Xiang,Liat Sless Gorelik,Yuchen Fan,Omri Armstrong,Forrest Iandola,Yilei Li,Ita Lifshitz,Rakesh Ranjan
关键词-EN: efficiently synthesizes high-resolution, synthesizes high-resolution texture, high-resolution texture maps, framework that efficiently, efficiently synthesizes
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to WACV 2025

点击查看摘要

Abstract:We present Make-A-Texture, a new framework that efficiently synthesizes high-resolution texture maps from textual prompts for given 3D geometries. Our approach progressively generates textures that are consistent across multiple viewpoints with a depth-aware inpainting diffusion model, in an optimized sequence of viewpoints determined by an automatic view selection algorithm. A significant feature of our method is its remarkable efficiency, achieving a full texture generation within an end-to-end runtime of just 3.07 seconds on a single NVIDIA H100 GPU, significantly outperforming existing methods. Such an acceleration is achieved by optimizations in the diffusion model and a specialized backprojection method. Moreover, our method reduces the artifacts in the backprojection phase, by selectively masking out non-frontal faces, and internal faces of open-surfaced objects. Experimental results demonstrate that Make-A-Texture matches or exceeds the quality of other state-of-the-art methods. Our work significantly improves the applicability and practicality of texture generation models for real-world 3D content creation, including interactive creation and text-guided texture editing. Comments: Accepted to WACV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2412.07766 [cs.CV] (or arXiv:2412.07766v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.07766 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-10] Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation

【速读】：该论文试图解决视频帧插值（Video Frame Interpolation）中由于帧间大运动导致的病态问题，特别是通过事件相机（event camera）获取的高时间分辨率事件测量数据来提供运动指导。解决方案的关键在于利用预训练的视频扩散模型（video diffusion models），这些模型在互联网规模的数据集上进行训练，并将其适应于事件-帧插值（Event-based Video Frame Interpolation, EVFI）任务。通过这种方式，论文克服了现有EVFI方法依赖有限配对事件-帧训练数据的局限性，显著提升了性能和跨摄像头的泛化能力。

链接: https://arxiv.org/abs/2412.07761
作者: Jingxi Chen,Brandon Y. Feng,Haoming Cai,Tianfu Wang,Levi Burner,Dehao Yuan,Cornelia Fermuller,Christopher A. Metzler,Yiannis Aloimonos
关键词-EN: recover realistic missing, Video Frame Interpolation, Frame Interpolation aims, realistic missing frames, Frame Interpolation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a high-frame-rate video from a low-frame-rate video. However, without additional guidance, the large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied on a limited set of paired event-frame training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one that we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.
zh

[CV-11] SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

【速读】：该论文试图解决在虚拟拍摄等应用中，从任意视角生成开放世界视频时确保动态一致性的问题。解决方案的关键在于提出一个即插即用的模块，该模块能够增强预训练的文本到视频模型，使其支持多相机视频生成，并确保不同视角之间的内容一致性。具体来说，论文引入了一个多视角同步模块，以维持各视角间的外观和几何一致性。此外，为了应对高质量训练数据的稀缺性，论文设计了一种混合训练方案，结合多相机图像、单目视频和虚幻引擎渲染的多相机视频进行训练。该方法还支持视频从新视角重新渲染，并发布了一个名为SynCamVideo-Dataset的多视角同步视频数据集。

链接: https://arxiv.org/abs/2412.07760
作者: Jianhong Bai,Menghan Xia,Xintao Wang,Ziyang Yuan,Xiao Fu,Zuozhu Liu,Haoji Hu,Pengfei Wan,Di Zhang
关键词-EN: shown exceptional abilities, Recent advancements, simulating real-world dynamics, shown exceptional, exceptional abilities
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recent advancements in video diffusion models have shown exceptional abilities in simulating real-world dynamics and maintaining 3D consistency. This progress inspires us to investigate the potential of these models to ensure dynamic consistency across various viewpoints, a highly desirable feature for applications such as virtual filming. Unlike existing methods focused on multi-view generation of single objects for 4D reconstruction, our interest lies in generating open-world videos from arbitrary viewpoints, incorporating 6 DoF camera poses. To achieve this, we propose a plug-and-play module that enhances a pre-trained text-to-video model for multi-camera video generation, ensuring consistent content across different viewpoints. Specifically, we introduce a multi-view synchronization module to maintain appearance and geometry consistency across these viewpoints. Given the scarcity of high-quality training data, we design a hybrid training scheme that leverages multi-camera images and monocular videos to supplement Unreal Engine-rendered multi-camera videos. Furthermore, our method enables intriguing extensions, such as re-rendering a video from novel viewpoints. We also release a multi-view synchronized video dataset, named SynCamVideo-Dataset. Project page: this https URL.
zh

[CV-12] 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

【速读】：该论文旨在解决视频生成中多实体3D运动控制的问题，现有的可控视频生成方法主要依赖于2D控制信号，难以充分表达物体的3D运动特性。为此，论文提出了3DTrajMaster，一种能够根据用户指定的6DoF姿态序列在3D空间中调控多实体动态的鲁棒控制器。其核心创新在于引入了一个即插即用的3D运动基础对象注入器，通过门控自注意力机制融合多个输入实体及其对应的3D轨迹，同时通过注入器架构保留视频扩散先验以增强泛化能力。此外，论文还通过在训练中引入领域适配器和在推理中采用退火采样策略来缓解视频质量下降问题。为解决训练数据不足的问题，论文构建了360-Motion Dataset，通过GPT生成的轨迹与3D人体和动物资产相关联，并使用12个均匀分布的摄像机在多样化的3D UE平台上捕捉其运动。实验结果表明，3DTrajMaster在多实体3D运动控制的准确性和泛化能力方面达到了新的技术水平。

链接: https://arxiv.org/abs/2412.07759
作者: Xiao Fu,Xian Liu,Xintao Wang,Sida Peng,Menghan Xia,Xiaoyu Shi,Ziyang Yuan,Pengfei Wan,Di Zhang,Dahua Lin
关键词-EN: paper aims, aims to manipulate, control signals, manipulate object motions, generation primarily leverage
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page Code Data: this http URL

点击查看摘要

Abstract:This paper aims to manipulate multi-entity 3D motions in video generation. Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions and have achieved remarkable synthesis results. However, 2D control signals are inherently limited in expressing the 3D nature of object motions. To overcome this problem, we introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space, given user-desired 6DoF pose (location and rotation) sequences of entities. At the core of our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input entities with their respective 3D trajectories through a gated self-attention mechanism. In addition, we exploit an injector architecture to preserve the video diffusion prior, which is crucial for generalization ability. To mitigate video quality degradation, we introduce a domain adaptor during training and employ an annealed sampling strategy during inference. To address the lack of suitable training data, we construct a 360-Motion Dataset, which first correlates collected 3D human and animal assets with GPT-generated trajectory and then captures their motion with 12 evenly-surround cameras on diverse 3D UE platforms. Extensive experiments show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions. Project page: this http URL
zh

[CV-13] SAT: Spatial Aptitude Training for Multimodal Language Models

【速读】：该论文试图解决多模态语言模型 (MLMs) 在动态空间推理能力上的不足问题，特别是在视角转换和自我中心动作识别等动态任务上的表现。解决方案的关键在于引入了空间能力训练 (Spatial Aptitude Training, SAT) 数据集，该数据集包含218K个问题-答案对，覆盖22K个合成场景，能够模拟动态空间任务。通过使用照片级真实的物理引擎生成数据，SAT不仅能够扩展到新的动作、场景和3D资产，还能显著提升模型在动态空间推理任务上的表现。实验结果表明，经过SAT指令调优的模型在动态空间推理任务上表现显著提升，同时在现有真实图像空间基准测试中也取得了显著的零样本性能提升。

链接: https://arxiv.org/abs/2412.07755
作者: Arijit Ray,Jiafei Duan,Reuben Tan,Dina Bashkirova,Rose Hendrix,Kiana Ehsani,Aniruddha Kembhavi,Bryan A. Plummer,Ranjay Krishna,Kuo-Hao Zeng,Kate Saenko
关键词-EN: fundamental component, Spatial, SAT, Spatial Aptitude Training, spatial reasoning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Robotics (cs.RO)
备注: Project webpage: this http URL

点击查看摘要

Abstract:Spatial perception is a fundamental component of intelligence. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spatial reasoning, such as categorizing the relative positions of objects. Meanwhile, real-world deployment requires dynamic capabilities like perspective-taking and egocentric action recognition. As a roadmap to improving spatial intelligence, we introduce SAT, Spatial Aptitude Training, which goes beyond static relative object position questions to the more dynamic tasks. SAT contains 218K question-answer pairs for 22K synthetic scenes across a training and testing set. Generated using a photo-realistic physics engine, our dataset can be arbitrarily scaled and easily extended to new actions, scenes, and 3D assets. We find that even MLMs that perform relatively well on static questions struggle to accurately answer dynamic spatial questions. Further, we show that SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT, but also zero-shot performance on existing real-image spatial benchmarks: 23% on CVBench, 8% on the harder BLINK benchmark, and 18% on VSR. When instruction-tuned on SAT, our 13B model matches larger proprietary MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning. Our data/code is available at this http URL .
zh

[CV-14] PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

【速读】：该论文试图解决音频驱动的人脸生成任务中存在的视觉质量、个性化定制和泛化能力不足的问题。解决方案的关键在于提出了一种名为PortraitTalk的新型可定制化单次音频驱动人脸生成框架，该框架通过两个主要组件——IdentityNet和AnimateNet来实现。IdentityNet负责在生成视频帧中保持身份特征的一致性，而AnimateNet则增强时间连贯性和动作一致性。此外，该框架通过解耦的交叉注意力机制引入文本提示，显著提升了生成视频的创意控制能力，从而减少了对现有方法中常用的参考风格视频的依赖。

链接: https://arxiv.org/abs/2412.07754
作者: Fatemeh Nazarieh,Zhenhua Feng,Diptesh Kanojia,Muhammad Awais,Josef Kittler
关键词-EN: Audio-driven talking face, talking face generation, realistic talking faces, Audio-driven talking, digital communication
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Audio-driven talking face generation is a challenging task in digital communication. Despite significant progress in the area, most existing methods concentrate on audio-lip synchronization, often overlooking aspects such as visual quality, customization, and generalization that are crucial to producing realistic talking faces. To address these limitations, we introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk. Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet. IdentityNet is designed to preserve identity features consistently across the generated video frames, while AnimateNet aims to enhance temporal coherence and motion consistency. This framework also integrates an audio input with the reference images, thereby reducing the reliance on reference-style videos prevalent in existing approaches. A key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms, which significantly expands creative control over the generated videos. Through extensive experiments, including a newly developed evaluation metric, our model demonstrates superior performance over the state-of-the-art methods, setting a new standard for the generation of customizable realistic talking faces suitable for real-world applications.
zh

[CV-15] On Motion Blur and Deblurring in Visual Place Recognition

【速读】：该论文试图解决移动机器人视觉场景识别 (Visual Place Recognition, VPR) 在运动模糊条件下的性能问题，特别是在快速运动和低光环境下，由于长时间曝光导致的模糊对VPR的影响。解决方案的关键在于引入一个新的基准测试，评估运动模糊和图像去模糊 (image deblurring) 对VPR性能的影响，并提出适应性去模糊策略，以在动态、真实世界场景中有效管理运动模糊。

链接: https://arxiv.org/abs/2412.07751
作者: Timur Ismagilov,Bruno Ferrarini,Michael Milford,Tan Viet Tuyen Nguyen,SD Ramchurn,Shoaib Ehsan
关键词-EN: Visual Place Recognition, Place Recognition, Visual Place, mobile robotics enables, robotics enables robots
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) in mobile robotics enables robots to localize themselves by recognizing previously visited locations using visual data. While the reliability of VPR methods has been extensively studied under conditions such as changes in illumination, season, weather and viewpoint, the impact of motion blur is relatively unexplored despite its relevance not only in rapid motion scenarios but also in low-light conditions where longer exposure times are necessary. Similarly, the role of image deblurring in enhancing VPR performance under motion blur has received limited attention so far. This paper bridges these gaps by introducing a new benchmark designed to evaluate VPR performance under the influence of motion blur and image deblurring. The benchmark includes three datasets that encompass a wide range of motion blur intensities, providing a comprehensive platform for analysis. Experimental results with several well-established VPR and image deblurring methods provide new insights into the effects of motion blur and the potential improvements achieved through deblurring. Building on these findings, the paper proposes adaptive deblurring strategies for VPR, designed to effectively manage motion blur in dynamic, real-world scenarios.
zh

[CV-16] Multi-Shot Character Consistency for Text-to-Video Generation

【速读】：该论文试图解决文本到视频生成模型中，生成多个视频镜头时保持角色身份一致性的问题。解决方案的关键在于提出了一种无需额外训练的“视频故事板”方法，通过在预训练的文本到视频模型中共享特征来实现角色一致性。具体来说，论文发现自注意力查询特征 (self-attention query features, Q) 同时编码了运动和身份信息，因此在共享特征时存在身份保留与动态性之间的权衡。为解决这一问题，论文引入了一种新的查询注入策略 (query injection strategy)，能够在保持角色身份的同时，保留自然的运动和文本对齐，从而在角色一致性和视频质量之间实现了更好的平衡。

链接: https://arxiv.org/abs/2412.07750
作者: Yuval Atzmon,Rinon Gal,Yoad Tewel,Yoni Kasten,Gal Chechik
关键词-EN: made significant strides, short video clips, textual descriptions, clips from textual, generating short video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Text-to-video models have made significant strides in generating short video clips from textual descriptions. Yet, a significant challenge remains: generating several video shots of the same characters, preserving their identity without hurting video quality, dynamics, and responsiveness to text prompts. We present Video Storyboarding, a training-free method to enable pretrained text-to-video models to generate multiple shots with consistent characters, by sharing features between them. Our key insight is that self-attention query features (Q) encode both motion and identity. This creates a hard-to-avoid trade-off between preserving character identity and making videos dynamic, when features are shared. To address this issue, we introduce a novel query injection strategy that balances identity preservation and natural motion retention. This approach improves upon naive consistency techniques applied to videos, which often struggle to maintain this delicate equilibrium. Our experiments demonstrate significant improvements in character consistency across scenes while maintaining high-quality motion and text alignment. These results offer insights into critical stages of video generation and the interplay of structure and motion in video diffusion models.
zh

[CV-17] LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

【速读】：该论文试图解决现有3D几何基础模型（如DUSt3R）在面对高维问题空间和高质量3D数据稀缺的情况下，难以在复杂场景（如视图重叠有限或光照不足）中泛化的问题。解决方案的关键在于提出了LoRA3D，一种高效的自我校准流程，通过利用模型自身的多视图预测来专门化预训练模型。具体来说，该方法通过稀疏RGB图像输入，采用鲁棒优化技术来精炼多视图预测，并将其对齐到全局坐标系中。特别地，该方法将预测置信度纳入几何优化过程，自动重新加权置信度以更好地反映点估计的准确性。随后，利用校准后的置信度生成高质量的伪标签，并通过低秩适应（LoRA）在伪标签数据上微调模型。该方法无需外部先验或手动标签，并在单个标准GPU上仅需5分钟即可完成自我校准，每个低秩适配器仅需18MB存储空间。

链接: https://arxiv.org/abs/2412.07746
作者: Ziqi Lu,Heng Yang,Danfei Xu,Boyi Li,Boris Ivanovic,Marco Pavone,Yue Wang
关键词-EN: vision tasks, offer a promising, geometric foundation models, promising approach, foundation models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem space and scarcity of high-quality 3D data, these pre-trained models still struggle to generalize to many challenging circumstances, such as limited view overlap or low lighting. To address this, we propose LoRA3D, an efficient self-calibration pipeline to \textitspecialize the pre-trained models to target scenes using their own multi-view predictions. Taking sparse RGB images as input, we leverage robust optimization techniques to refine multi-view predictions and align them into a global coordinate frame. In particular, we incorporate prediction confidence into the geometric optimization process, automatically re-weighting the confidence to better reflect point estimation accuracy. We use the calibrated confidence to generate high-quality pseudo labels for the calibrating views and use low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeled data. Our method does not require any external priors or manual labels. It completes the self-calibration process on a \textbfsingle standard GPU within just 5 minutes . Each low-rank adapter requires only \textbf18MB of storage. We evaluated our method on \textbfmore than 160 scenes from the Replica, TUM and Waymo Open datasets, achieving up to \textbf88% performance improvement on 3D reconstruction, multi-view pose estimation and novel-view rendering.
zh

[CV-18] StyleMaster: Stylize Your Video with Artistic Generation and Translation

【速读】：该论文试图解决视频生成模型中风格控制的问题，特别是现有方法在风格迁移时出现的风格不一致、内容泄露以及难以将视频转换为所需风格的问题。解决方案的关键在于：1) 在风格提取阶段，通过基于提示-块相似性的方法过滤与内容相关的块，保留风格相关的块，从而在保留局部纹理特征的同时防止内容泄露；2) 通过模型幻觉生成配对风格数据集，促进对比学习，增强全局风格一致性；3) 训练轻量级运动适配器，填补图像到视频的差距，隐式增强风格化程度，使图像训练模型能够无缝应用于视频。这些创新使得StyleMaster在风格相似性和时间一致性方面显著提升，并能轻松应用于视频风格迁移任务。

链接: https://arxiv.org/abs/2412.07744
作者: Zixuan Ye,Huijuan Huang,Xintao Wang,Pengfei Wan,Di Zhang,Wenhan Luo
关键词-EN: Style, video generation models, Style control, Existing methods, content leakage
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage available at this https URL

点击查看摘要

Abstract:Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at this https URL
zh

[CV-19] Image Retrieval with Intra-Sweep Representation Learning for Neck Ultrasound Scanning Guidance

【速读】：该论文旨在解决在经口机器人手术中，术中超声（Intraoperative Ultrasound, US）图像与术前扫描图像匹配困难的问题。解决方案的关键在于提出了一种自监督对比学习方法，通过利用术中图像与术前图像数据库的匹配，帮助手术助手调整超声探头以达到目标扫描平面。该方法引入了新的对比学习策略，结合了术中图像的内部相似性和超声探头位置信息，以提升特征编码效果，并通过灵活的阈值机制拒绝不满意的匹配。实验结果表明，该方法在模拟数据上达到了92.30%的检索准确率，优于现有的基于时间序列的对比学习方法，并在真实患者数据上展示了其可行性。

链接: https://arxiv.org/abs/2412.07741
作者: Wanwen Chen,Adam Schmidt,Eitan Prisman,Septimiu E. Salcudean
关键词-EN: transoral robotic surgery, enhance real-time visualization, real-time visualization, visualization in transoral, transoral robotic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Purpose: Intraoperative ultrasound (US) can enhance real-time visualization in transoral robotic surgery. The surgeon creates a mental map with a pre-operative scan. Then, a surgical assistant performs freehand US scanning during the surgery while the surgeon operates at the remote surgical console. Communicating the target scanning plane in the surgeon’s mental map is difficult. Automatic image retrieval can help match intraoperative images to preoperative scans, guiding the assistant to adjust the US probe toward the target plane. Methods: We propose a self-supervised contrastive learning approach to match intraoperative US views to a preoperative image database. We introduce a novel contrastive learning strategy that leverages intra-sweep similarity and US probe location to improve feature encoding. Additionally, our model incorporates a flexible threshold to reject unsatisfactory matches. Results: Our method achieves 92.30% retrieval accuracy on simulated data and outperforms state-of-the-art temporal-based contrastive learning approaches. Our ablation study demonstrates that using probe location in the optimization goal improves image representation, suggesting that semantic information can be extracted from probe location. We also present our approach on real patient data to show the feasibility of the proposed US probe localization system despite tissue deformation from tongue retraction. Conclusion: Our contrastive learning method, which utilizes intra-sweep similarity and US probe location, enhances US image representation learning. We also demonstrate the feasibility of using our image retrieval method to provide neck US localization on real patient US after tongue retraction.
zh

[CV-20] GASP: Gaussian Avatars with Synthetic Priors MICRO

【速读】：该论文试图解决现有高斯化身（Gaussian Avatars）技术在训练和渲染方面的两大局限性：一是需要昂贵的多摄像头设备来实现自由视角渲染，二是单摄像头训练的化身只能在固定视角下高质量渲染。论文提出的解决方案是GASP（Gaussian Avatars with Synthetic Priors），其关键在于利用合成数据（synthetic data）的像素完美特性来训练高斯化身先验模型（Gaussian Avatar prior）。通过将该先验模型拟合到单张照片或视频，并进行微调，可以生成支持360°渲染的高质量高斯化身。该方法在训练时仅需先验模型，推理时则不需要，从而实现了在商用硬件上的实时应用，达到70fps的渲染速度。

链接: https://arxiv.org/abs/2412.07739
作者: Jack Saunders,Charlie Hewitt,Yanan Jian,Marek Kowalski,Tadas Baltrusaitis,Yiye Chen,Darren Cosker,Virginia Estellers,Nicholas Gyde,Vinay P. Namboodiri,Benjamin E Lundell
关键词-EN: Gaussian Splatting, Gaussian, Gaussian Avatars, changed the game, avatars
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Gaussian Splatting has changed the game for real-time photo-realistic rendering. One of the most popular applications of Gaussian Splatting is to create animatable avatars, known as Gaussian Avatars. Recent works have pushed the boundaries of quality and rendering efficiency but suffer from two main limitations. Either they require expensive multi-camera rigs to produce avatars with free-view rendering, or they can be trained with a single camera but only rendered at high quality from this fixed viewpoint. An ideal model would be trained using a short monocular video or image from available hardware, such as a webcam, and rendered from any view. To this end, we propose GASP: Gaussian Avatars with Synthetic Priors. To overcome the limitations of existing datasets, we exploit the pixel-perfect nature of synthetic data to train a Gaussian Avatar prior. By fitting this prior model to a single photo or video and fine-tuning it, we get a high-quality Gaussian Avatar, which supports 360 ^\circ rendering. Our prior is only required for fitting, not inference, enabling real-time application. Through our method, we obtain high-quality, animatable Avatars from limited data which can be animated and rendered at 70fps on commercial hardware. See our project page (this https URL) for results.
zh

[CV-21] STIV: Scalable Text and Image Conditioned Video Generation

【速读】：该论文试图解决视频生成领域中缺乏系统性、可扩展性模型开发指导的问题。解决方案的关键在于提出了一种简单且可扩展的文本-图像条件视频生成方法，称为STIV。STIV通过将图像条件引入扩散Transformer (Diffusion Transformer, DiT) 并通过帧替换实现，同时结合图像-文本联合条件分类器无指导 (classifier-free guidance) 来实现文本条件。这种设计使得STIV能够同时处理文本到视频 (T2V) 和文本-图像到视频 (TI2V) 任务，并且易于扩展到视频预测、帧插值、多视角生成和长视频生成等多种应用。通过全面的消融实验，STIV展示了其在T2I、T2V和TI2V任务中的强大性能，并在VBench T2V和I2V任务中达到了领先水平。

链接: https://arxiv.org/abs/2412.07730
作者: Zongyu Lin,Wei Liu,Chen Chen,Jiasen Lu,Wenze Hu,Tsu-Jui Fu,Jesse Allardice,Zhengfeng Lai,Liangchen Song,Bowen Zhang,Cha Chen,Yiran Fei,Yifan Jiang,Lezhi Li,Yizhou Sun,Kai-Wei Chang,Yinfei Yang
关键词-EN: made remarkable advancements, video generation, remarkable advancements, made remarkable, remains a pressing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The field of video generation has made remarkable advancements, yet there remains a pressing need for a clear, systematic recipe that can guide the development of robust and scalable models. In this work, we present a comprehensive study that systematically explores the interplay of model architectures, training recipes, and data curation strategies, culminating in a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning via a joint image-text conditional classifier-free guidance. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously. Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, multi-view generation, and long video generation, etc. With comprehensive ablation studies on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and extensible recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress toward more versatile and reliable video generation solutions.
zh

[CV-22] ObjCtrl-2.5D: Training-free Object Control with Camera Poses

【速读】：该论文旨在解决图像到视频生成 (Image-to-Video, I2V) 中目标物体控制不够精确和多样化的问题。现有方法通常使用二维轨迹 (2D trajectories) 来表示目标物体的空间运动，但这种方法难以捕捉用户意图并常产生不自然的结果。论文提出的解决方案是 ObjCtrl-2.5D，一种无需训练的目标物体控制方法，通过引入三维轨迹 (3D trajectory) 来增强控制信号，其中三维轨迹通过深度信息扩展二维轨迹得到。关键在于将物体运动建模为相机运动，并将三维轨迹表示为相机姿态序列，从而利用现有的相机运动控制图像到视频生成模型 (CMC-I2V) 进行物体运动控制，无需额外训练。此外，论文还引入了一个模块来隔离目标物体与背景，以实现局部独立控制，并通过共享低频潜在变量 (low-frequency warped latent) 在物体区域内跨帧来提高控制精度。

链接: https://arxiv.org/abs/2412.07721
作者: Zhouxia Wang,Yushi Lan,Shangchen Zhou,Chen Change Loy
关键词-EN: control, object, study aims, precise and versatile, object control
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:This study aims to achieve more precise and versatile object control in image-to-video (I2V) generation. Current methods typically represent the spatial movement of target objects with 2D trajectories, which often fail to capture user intention and frequently produce unnatural results. To enhance control, we present ObjCtrl-2.5D, a training-free object control approach that uses a 3D trajectory, extended from a 2D trajectory with depth information, as a control signal. By modeling object movement as camera movement, ObjCtrl-2.5D represents the 3D trajectory as a sequence of camera poses, enabling object motion control using an existing camera motion control I2V generation model (CMC-I2V) without training. To adapt the CMC-I2V model originally designed for global motion control to handle local object motion, we introduce a module to isolate the target object from the background, enabling independent local control. In addition, we devise an effective way to achieve more accurate object control by sharing low-frequency warped latent within the object’s region across frames. Extensive experiments demonstrate that ObjCtrl-2.5D significantly improves object control accuracy compared to training-free methods and offers more diverse control capabilities than training-based approaches using 2D trajectories, enabling complex effects like object rotation. Code and results are available at this https URL.
zh

[CV-23] ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

【速读】：该论文试图解决多模态模型中不同模态（如视觉和文本）统一方法的差异问题，特别是视觉生成中的全序列扩散方法与文本领域中的自回归建模（autoregressive modeling）之间的不一致性。解决方案的关键在于提出了一种名为ACDiT（Autoregressive blockwise Conditional Diffusion Transformer）的模型，该模型通过灵活调整扩散块大小（block size）来在自回归建模和全参数扩散之间进行插值。ACDiT的核心创新在于其易于实现的Skip-Causal Attention Mask (SCAM)，并在推理过程中通过交替的扩散去噪和自回归解码充分利用KV-Cache，从而在图像和视频生成任务中验证了其有效性，并展示了其在视觉理解任务中的潜力。

链接: https://arxiv.org/abs/2412.07720
作者: Jinyi Hu,Shengding Hu,Yuxuan Song,Yufei Huang,Mingxuan Wang,Hao Zhou,Zhiyuan Liu,Wei-Ying Ma,Maosong Sun
关键词-EN: autoregressive modeling, autoregressive, diverse modalities, comprehensive multimodal models, recent surge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent surge of interest in comprehensive multimodal models has necessitated the unification of diverse modalities. However, the unification suffers from disparate methodologies. Continuous visual generation necessitates the full-sequence diffusion-based approach, despite its divergence from the autoregressive modeling in the text domain. We posit that autoregressive modeling, i.e., predicting the future based on past deterministic experience, remains crucial in developing both a visual generation model and a potential unified multimodal model. In this paper, we explore an interpolation between the autoregressive modeling and full-parameters diffusion to model visual information. At its core, we present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer, where the block size of diffusion, i.e., the size of autoregressive units, can be flexibly adjusted to interpolate between token-wise autoregression and full-sequence diffusion. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We verify the effectiveness of ACDiT on image and video generation tasks. We also demonstrate that benefitted from autoregressive modeling, ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. These strengths make it promising as the backbone of future unified models.
zh

[CV-24] GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning

【速读】：该论文试图解决视频-语言学习任务中跨模态对齐多粒度数据的挑战。解决方案的关键在于从数据和建模两个角度入手：首先，通过引入Granularity EXpansion (GEX)方法，利用Integration and Compression操作扩展单一粒度数据集，以生成多粒度视频-文本预训练数据集；其次，提出Iterative Approximation Module (IAM)，将多粒度视频和文本嵌入到统一的低维语义空间中，同时保留跨模态对齐所需的关键信息。此外，该方法具有高度可扩展性，不受对齐粒度数量的限制。

链接: https://arxiv.org/abs/2412.07704
作者: Yicheng Wang,Zhikang Zhang,Jue Wang,David Fan,Zhenlin Xu,Linda Liu,Xiang Hao,Vimal Bhat,Xinyu Li
关键词-EN: achieving cross-modality alignment, video-language learning tasks, multi-grained data persists, video-language learning, achieving cross-modality
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In various video-language learning tasks, the challenge of achieving cross-modality alignment with multi-grained data persists. We propose a method to tackle this challenge from two crucial perspectives: data and modeling. Given the absence of a multi-grained video-text pretraining dataset, we introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset. To better model multi-grained data, we introduce an Iterative Approximation Module (IAM), which embeds multi-grained videos and texts into a unified, low-dimensional semantic space while preserving essential information for cross-modal alignment. Furthermore, GEXIA is highly scalable with no restrictions on the number of video-text granularities for alignment. We evaluate our work on three categories of video tasks across seven benchmark datasets, showcasing state-of-the-art or comparable performance. Remarkably, our model excels in tasks involving long-form video understanding, even though the pretraining dataset only contains short video clips.
zh

[CV-25] SimVS: Simulating World Inconsistencies for Robust View Synthesis

【速读】：该论文试图解决在非正式拍摄环境下，由于光照变化、场景运动和其他不可预测因素导致的视图不一致问题，这些问题使得传统的静态场景合成技术难以应对。解决方案的关键在于利用生成式视频模型（generative video models）模拟拍摄过程中可能出现的各种不一致性，并结合现有的多视图数据集生成合成数据，用于训练一个多视图协调网络（multi-view harmonization network）。该网络能够将不一致的观测结果整合为一致的3D场景，从而在面对复杂的现实场景变化时，显著优于传统的数据增强方法，实现高精度的静态3D重建。

链接: https://arxiv.org/abs/2412.07696
作者: Alex Trevithick,Roni Paiss,Philipp Henzler,Dor Verbin,Rundi Wu,Hadi Alzayer,Ruiqi Gao,Ben Poole,Jonathan T. Barron,Aleksander Holynski,Ravi Ramamoorthi,Pratul P. Srinivasan
关键词-EN: Novel-view synthesis techniques, synthesis techniques achieve, techniques achieve impressive, achieve impressive results, casual capture settings
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Novel-view synthesis techniques achieve impressive results for static scenes but struggle when faced with the inconsistencies inherent to casual capture settings: varying illumination, scene motion, and other unintended effects that are difficult to model explicitly. We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture. We use this process, along with existing multi-view datasets, to create synthetic data for training a multi-view harmonization network that is able to reconcile inconsistent observations into a consistent 3D scene. We demonstrate that our world-simulation strategy significantly outperforms traditional augmentation methods in handling real-world scene variations, thereby enabling highly accurate static 3D reconstructions in the presence of a variety of challenging inconsistencies. Project page: this https URL
zh

[CV-26] Leveraging Content and Context Cues for Low-Light Image Enhancement

【速读】：该论文试图解决低光条件下机器认知受限的问题，特别是在计算机视觉系统中的表现。解决方案的关键在于通过图像处理技术增强低光图像，而不是对每个下游任务模型进行昂贵的微调。具体来说，论文提出了一种基于CLIP模型的零参考低光增强方法，通过数据增强策略（data augmentation strategy）学习图像先验（image prior），并利用语义指导策略（semantic guidance strategy）引入图像训练块的内容和上下文线索，从而在不依赖配对或非配对正常光数据的情况下提升图像对比度和色调，改善前景背景区分，减少过饱和和噪声过度放大问题。实验结果表明，该方法在多个低光数据集上的任务性能（如图像分类、目标检测和人脸检测）优于现有的零参考方法。

链接: https://arxiv.org/abs/2412.07693
作者: Igor Morawski,Kai He,Shusil Dangi,Winston H. Hsu
关键词-EN: computer vision systems, real life, adverse impact, computer vision, vision systems
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to the IEEE Transactions on Multimedia

点击查看摘要

Abstract:Low-light conditions have an adverse impact on machine cognition, limiting the performance of computer vision systems in real life. Since low-light data is limited and difficult to annotate, we focus on image processing to enhance low-light images and improve the performance of any downstream task model, instead of fine-tuning each of the models which can be prohibitively expensive. We propose to improve the existing zero-reference low-light enhancement by leveraging the CLIP model to capture image prior and for semantic guidance. Specifically, we propose a data augmentation strategy to learn an image prior via prompt learning, based on image sampling, to learn the image prior without any need for paired or unpaired normal-light data. Next, we propose a semantic guidance strategy that maximally takes advantage of existing low-light annotation by introducing both content and context cues about the image training patches. We experimentally show, in a qualitative study, that the proposed prior and semantic guidance help to improve the overall image contrast and hue, as well as improve background-foreground discrimination, resulting in reduced over-saturation and noise over-amplification, common in related zero-reference methods. As we target machine cognition, rather than rely on assuming the correlation between human perception and downstream task performance, we conduct and present an ablation study and comparison with related zero-reference methods in terms of task-based performance across many low-light datasets, including image classification, object and face detection, showing the effectiveness of our proposed method.
zh

[CV-27] DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

【速读】：该论文试图解决当前自动驾驶（Autonomous Driving, AD）领域中数据驱动方法过于依赖单一数据集和特定任务，忽视模型整体能力和泛化能力的问题。解决方案的关键在于提出了DriveMM，一个通用的大规模多模态模型（Large Multimodal Models, LMMs），该模型能够处理多样化的数据输入（如图像和多视角视频），并执行广泛的自动驾驶任务（包括感知、预测和规划）。通过课程预训练和多数据集的微调，DriveMM在多个公共基准测试中展现了卓越的性能和零样本迁移能力，成为未来端到端自动驾驶应用的有力候选方案。

链接: https://arxiv.org/abs/2412.07689
作者: Zhijian Huang,Chengjian Feng,Feng Yan,Baihui Xiao,Zequn Jie,Yujie Zhong,Xiaodan Liang,Lin Ma
关键词-EN: incorporating large language, demonstrated exceptional comprehension, large language models, Large Multimodal Models, Large Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-toend autonomous driving applications in the real world.
zh

[CV-28] RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models

【速读】：该论文试图解决聚合模型（agglomerative models）在训练视觉基础模型时面临的关键挑战，包括分辨率模式转换（resolution mode shifts）、教师模型不平衡（teacher imbalance）、教师模型特有的伪影（idiosyncratic teacher artifacts）以及输出token数量过多等问题。解决方案的关键在于提出了多项创新技术：多分辨率训练（multi-resolution training）、马赛克增强（mosaic augmentation）、改进的教师损失函数平衡（improved balancing of teacher loss functions），以及在视觉语言模型（Vision Language Models）中引入的token压缩技术（token compression technique），以在固定token数量下保持高分辨率信息。这些方法有效提升了模型的性能，并显著降低了计算和资源需求。

链接: https://arxiv.org/abs/2412.07679
作者: Greg Heinrich,Mike Ranzinger,Hongxu(Danny)Yin,Yao Lu,Jan Kautz,Andrew Tao,Bryan Catanzaro,Pavlo Molchanov
关键词-EN: leveraging multi-teacher distillation, leveraging multi-teacher, vision foundation models, training vision foundation, recently emerged
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agglomerative models have recently emerged as a powerful approach to training vision foundation models, leveraging multi-teacher distillation from existing models such as CLIP, DINO, and SAM. This strategy enables the efficient creation of robust models, combining the strengths of individual teachers while significantly reducing computational and resource demands. In this paper, we thoroughly analyze state-of-the-art agglomerative models, identifying critical challenges including resolution mode shifts, teacher imbalance, idiosyncratic teacher artifacts, and an excessive number of output tokens. To address these issues, we propose several novel solutions: multi-resolution training, mosaic augmentation, and improved balancing of teacher loss functions. Specifically, in the context of Vision Language Models, we introduce a token compression technique to maintain high-resolution information within a fixed token count. We release our top-performing models, available in multiple scales (-B, -L, -H, and -g), alongside inference code and pretrained weights.
zh

[CV-29] FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models NEURIPS2024

【速读】：该论文试图解决文本到图像生成中准确描述视觉属性的挑战，特别是对于非艺术和摄影专业人士而言。解决方案的关键在于提出了一种细粒度的视觉属性分解方法，通过构建首个细粒度视觉属性数据集 (FiVA)，并基于此数据集提出了细粒度视觉属性适应框架 (FiVA-Adapter)。该框架能够将图像的美学分解为具体的视觉属性（如光照、纹理和动态效果），并允许用户从多个源图像中选择和组合这些属性，从而实现更灵活和用户友好的图像定制。

链接: https://arxiv.org/abs/2412.07674
作者: Tong Wu,Yinghao Xu,Ryan Po,Mengchen Zhang,Guandao Yang,Jiaqi Wang,Ziwei Liu,Dahua Lin,Gordon Wetzstein
关键词-EN: Recent advances, generation have enabled, diverse applications, attributes, visual attributes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2024 (Datasets and Benchmarks Track); Project page: this https URL

点击查看摘要

Abstract:Recent advances in text-to-image generation have enabled the creation of high-quality images with diverse applications. However, accurately describing desired visual attributes can be challenging, especially for non-experts in art and photography. An intuitive solution involves adopting favorable attributes from the source images. Current methods attempt to distill identity and style from source images. However, “style” is a broad concept that includes texture, color, and artistic elements, but does not cover other important attributes such as lighting and dynamics. Additionally, a simplified “style” adaptation prevents combining multiple attributes from different sources into one generated image. In this work, we formulate a more effective approach to decompose the aesthetics of a picture into specific visual attributes, allowing users to apply characteristics such as lighting, texture, and dynamics from different images. To achieve this goal, we constructed the first fine-grained visual attributes dataset (FiVA) to the best of our knowledge. This FiVA dataset features a well-organized taxonomy for visual attributes and includes around 1 M high-quality generated images with visual attribute annotations. Leveraging this dataset, we propose a fine-grained visual attribute adaptation framework (FiVA-Adapter), which decouples and adapts visual attributes from one or more source images into a generated one. This approach enhances user-friendly customization, allowing users to selectively apply desired attributes to create images that meet their unique preferences and specific content requirements.
zh

[CV-30] Proc-GS: Procedural Building Generation for City Assembly with 3D Gaussians

【速读】：该论文试图解决传统3D建筑资产创建过程中劳动密集、设计规则复杂以及生成式模型忽视重复元素（如窗户和门）导致视觉保真度低和可扩展性有限的问题。解决方案的关键在于将程序化建模技术（procedural modeling techniques）与3D高斯光栅化（3D Gaussian Splatting, 3D-GS）框架相结合，提出了Proc-GS方法。通过将程序化代码集成到3D-GS框架中，该方法不仅利用了高保真渲染和高效资产管理的优势，还通过操纵程序化代码简化了建筑生成过程，实现了无限多样化的建筑生成。此外，该方法通过共享基础资产显著减少了模型大小，提供了对建筑组装的精确控制，从而在保持高渲染保真度的同时实现了可扩展的城市景观生成。

链接: https://arxiv.org/abs/2412.07660
作者: Yixuan Li,Xingjian Ran,Linning Xu,Tao Lu,Mulin Yu,Zhenzhi Wang,Yuanbo Xiangli,Dahua Lin,Bo Dai
关键词-EN: featuring repeated elements, components of cities, windows and doors, primary components, featuring repeated
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Buildings are primary components of cities, often featuring repeated elements such as windows and doors. Traditional 3D building asset creation is labor-intensive and requires specialized skills to develop design rules. Recent generative models for building creation often overlook these patterns, leading to low visual fidelity and limited scalability. Drawing inspiration from procedural modeling techniques used in the gaming and visual effects industry, our method, Proc-GS, integrates procedural code into the 3D Gaussian Splatting (3D-GS) framework, leveraging their advantages in high-fidelity rendering and efficient asset management from both worlds. By manipulating procedural code, we can streamline this process and generate an infinite variety of buildings. This integration significantly reduces model size by utilizing shared foundational assets, enabling scalable generation with precise control over building assembly. We showcase the potential for expansive cityscape generation while maintaining high rendering fidelity and precise control on both real and synthetic cases.
zh

[CV-31] Analytical-Heuristic Modeling and Optimization for Low-Light Image Enhancement

【速读】：该论文试图解决低光图像增强这一开放性问题，关键解决方案在于结合遗传算法（Genetic Algorithms）与分析模型进行优化。遗传算法作为一种元启发式方法（metaheuristic approaches），在解决复杂优化任务中表现出色。论文提出了两种分析方法，并通过优化推理来处理低光图像向可见图像转换的物理和计算问题。实验结果表明，该方法在LOL基准测试中超越了26种现有最先进的算法，证明了简单遗传算法与分析推理结合的有效性，为群体智能和进化计算领域开辟了新的研究方向。

链接: https://arxiv.org/abs/2412.07659
作者: Axel Martinez,Emilio Hernandez,Matthieu Olague,Gustavo Olague
关键词-EN: Low-light image enhancement, image enhancement remains, Low-light image, enhancement remains, wave of artificial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 6 figures, 6 tables, 34 references

点击查看摘要

Abstract:Low-light image enhancement remains an open problem, and the new wave of artificial intelligence is at the center of this problem. This work describes the use of genetic algorithms for optimizing analytical models that can improve the visualization of images with poor light. Genetic algorithms are part of metaheuristic approaches, which proved helpful in solving challenging optimization tasks. We propose two analytical methods combined with optimization reasoning to approach a solution to the physical and computational aspects of transforming dark images into visible ones. The experiments demonstrate that the proposed approach ranks at the top among 26 state-of-the-art algorithms in the LOL benchmark. The results show evidence that a simple genetic algorithm combined with analytical reasoning can defeat the current mainstream in a challenging computer vision task through controlled experiments and objective comparisons. This work opens interesting new research avenues for the swarm and evolutionary computation community and others interested in analytical and heuristic reasoning.
zh

[CV-32] raSCE: Trajectory Steering for Concept Erasure

【速读】：该论文试图解决文本到图像扩散模型生成有害内容（如NSFW图像）的问题。解决方案的关键在于提出了一种名为TraSCE的方法，该方法通过改进传统的负向提示（negative prompting）并引入基于局部损失的引导机制，来引导扩散过程远离生成有害内容。具体来说，TraSCE通过修改传统负向提示的方式，并结合局部损失引导，增强了模型对有害内容的抑制能力，且无需重新训练模型或修改权重，也不依赖于额外的训练数据。

链接: https://arxiv.org/abs/2412.07658
作者: Anubhav Jain,Yuya Kobayashi,Takashi Shibuya,Yuhta Takida,Nasir Memon,Julian Togelius,Yuki Mitsufuji
关键词-EN: Recent advancements, public spotlight, everyday users, widely accessible, accessible and embraced
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing such safety measures. In this paper, we propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our approach is based on negative prompting, but as we show in this paper, conventional negative prompting is not a complete solution and can easily be bypassed in some corner cases. To address this issue, we first propose a modification of conventional negative prompting. Furthermore, we introduce a localized loss-based guidance that enhances the modified negative prompting technique by steering the diffusion trajectory. We demonstrate that our proposed method achieves state-of-the-art results on various benchmarks in removing harmful content including ones proposed by red teams; and erasing artistic styles and objects. Our proposed approach does not require any training, weight modifications, or training data (both image or prompt), making it easier for model owners to erase new concepts.
zh

[CV-33] Bayesian Data Augmentation and Training for Perception DNN in Autonomous Aerial Vehicles

【速读】：该论文试图解决自主空中车辆在不同操作环境下感知训练数据不足的问题，特别是针对垂直起降无人机（VTOL UAV）的安全着陆任务。解决方案的关键在于提出了一种数据增强框架，该框架利用高保真度的车辆动力学和逼真的模拟技术生成合成数据，并通过贝叶斯优化（Bayesian Optimization）系统地探索数据增强参数空间，以优化深度神经网络（DNN）的训练。该框架显著提升了在不同光照和天气条件下的感知模型性能，使基于感知的着陆成功率提高了至少20%。

链接: https://arxiv.org/abs/2412.07655
作者: Ashik E Rasul,Humaira Tasnim,Hyung-Jin Yoon,Ayoosh Bansal,Duo Wang,Naira Hovakimyan,Lui Sha,Petros Voulgaris
关键词-EN: Learning-based solutions, enabled incredible capabilities, data, data augmentation, enabled incredible
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To be published in AIAA SciTech 2025 Forum

点击查看摘要

Abstract:Learning-based solutions have enabled incredible capabilities for autonomous systems. Autonomous vehicles, both aerial and ground, rely on DNN for various integral tasks, including perception. The efficacy of supervised learning solutions hinges on the quality of the training data. Discrepancies between training data and operating conditions result in faults that can lead to catastrophic incidents. However, collecting vast amounts of context-sensitive data, with broad coverage of possible operating environments, is prohibitively difficult. Synthetic data generation techniques for DNN allow for the easy exploration of diverse scenarios. However, synthetic data generation solutions for aerial vehicles are still lacking. This work presents a data augmentation framework for aerial vehicle’s perception training, leveraging photorealistic simulation integrated with high-fidelity vehicle dynamics. Safe landing is a crucial challenge in the development of autonomous air taxis, therefore, landing maneuver is chosen as the focus of this work. With repeated simulations of landing in varying scenarios we assess the landing performance of the VTOL type UAV and gather valuable data. The landing performance is used as the objective function to optimize the DNN through retraining. Given the high computational cost of DNN retraining, we incorporated Bayesian Optimization in our framework that systematically explores the data augmentation parameter space to retrain the best-performing models. The framework allowed us to identify high-performing data augmentation parameters that are consistently effective across different landing scenarios. Utilizing the capabilities of this data augmentation framework, we obtained a robust perception model. The model consistently improved the perception-based landing success rate by at least 20% under different lighting and weather conditions. Comments: To be published in AIAA SciTech 2025 Forum Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ACMclasses: I.2.9; I.2.10; I.6.3 Cite as: arXiv:2412.07655 [cs.RO] (or arXiv:2412.07655v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2412.07655 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-34] OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

【速读】：该论文试图解决当前文档内容提取方法在多样性和全面评估方面的显著局限性。解决方案的关键在于引入OmniDocBench，这是一个新颖的多源基准测试，旨在推进自动化文档内容提取技术。OmniDocBench包含一个精心策划和标注的高质量评估数据集，涵盖九种多样化的文档类型，并提供了一个灵活且全面的评估框架，具有19个布局类别标签和14个属性标签，支持对整个数据集、单个模块或特定数据类型的多层次评估。通过OmniDocBench，论文对现有的模块化管道和多模态端到端方法进行了详尽的比较分析，揭示了它们在处理文档多样性和确保公平评估方面的不足，从而为文档内容提取领域建立了稳健、多样化和公平的评估标准。

链接: https://arxiv.org/abs/2412.07626
作者: Linke Ouyang,Yuan Qu,Hongbin Zhou,Jiawei Zhu,Rui Zhang,Qunshu Lin,Bin Wang,Zhiyuan Zhao,Man Jiang,Xiaomeng Zhao,Jin Shi,Fan Wu,Pei Chu,Minghao Liu,Zhenxiang Li,Chao Xu,Bo Zhang,Botian Shi,Zhongying Tu,Conghui He
关键词-EN: large language models, Document content extraction, computer vision, language models, retrieval-augmented generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in this https URL.
zh

[CV-35] PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction

【速读】：该论文试图解决极坐标表示在3D感知任务中由于非均匀分割导致的特征畸变问题。解决方案的关键在于引入Polar Voxel Occupancy Predictor (PVP)，并通过两个核心设计元素来克服这一问题：一是Global Represent Propagation (GRP)模块，用于将全局空间数据整合到3D体素中；二是Plane Decomposed Convolution (PD-Conv)，通过将3D畸变简化为2D卷积来减少特征畸变。这些创新使得PVP在OpenOccupancy数据集上的mIoU和IoU指标上显著优于现有方法。

链接: https://arxiv.org/abs/2412.07616
作者: Yujing Xue,Jiaxiang Liu,Jiawei Du,Joey Tianyi Zhou
关键词-EN: polar coordinate-based representations, perceptual tasks, coordinate-based representations, representations have shown, shown promise
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, polar coordinate-based representations have shown promise for 3D perceptual tasks. Compared to Cartesian methods, polar grids provide a viable alternative, offering better detail preservation in nearby spaces while covering larger areas. However, they face feature distortion due to non-uniform division. To address these issues, we introduce the Polar Voxel Occupancy Predictor (PVP), a novel 3D multi-modal predictor that operates in polar coordinates. PVP features two key design elements to overcome distortion: a Global Represent Propagation (GRP) module that integrates global spatial data into 3D volumes, and a Plane Decomposed Convolution (PD-Conv) that simplifies 3D distortions into 2D convolutions. These innovations enable PVP to outperform existing methods, achieving significant improvements in mIoU and IoU metrics on the OpenOccupancy dataset.
zh

[CV-36] ViewDelta: Text-Prompted Change Detection in Unaligned Images

【速读】：该论文试图解决现有监督模型在变化检测中对特定类型变化的局限性问题，这些模型通常需要为新任务重新训练。解决方案的关键在于提出了一种新颖的变化检测方法，首次利用未对齐的图像和文本提示来输出与用户提供的文本相关的二值分割结果。该方法不仅在多样化的变化检测应用中实现了灵活的检测，还在现有基准上达到了最先进的性能。此外，论文还发布了一个包含100,311对图像及其对应文本提示和变化检测标签的数据集，通过定量和定性分析，展示了该方法在室内、室外、街景、合成和卫星图像等多种视角数据集上的有效性。

链接: https://arxiv.org/abs/2412.07612
作者: Subin Varghese,Joshua Gao,Vedhus Hoskere
关键词-EN: infrastructure assessment, environment monitoring, situational awareness, industrial automation, fundamental problem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Detecting changes between images is a fundamental problem in computer vision with broad applications in situational awareness, infrastructure assessment, environment monitoring, and industrial automation. Existing supervised models are typically limited to detecting specific types of changes, necessitating retraining for new tasks. To address these limitations with a single approach, we propose a novel change detection method that is the first to utilize unaligned images and textual prompts to output a binary segmentation of changes relevant to user-provided text. Our architecture not only enables flexible detection across diverse change detection use cases, but also yields state-of-the art performance on established benchmarks. Additionally, we release an accompanying dataset comprising of 100,311 pairs of images with text prompts and the corresponding change detection labels. We demonstrate the effectiveness of our method both quantitatively and qualitatively on datasets with a wide variety of viewpoints in indoor, outdoor, street level, synthetic, and satellite images.
zh

[CV-37] Faster and Better 3D Splatting via Group Training

【速读】：该论文试图解决3D高斯喷射 (3D Gaussian Splatting, 3DGS) 技术在新型视图合成中由于大量高斯基元导致的训练效率低下的问题。解决方案的关键在于提出了分组训练 (Group Training) 策略，通过将高斯基元组织成可管理的组，从而优化训练效率并提升渲染质量。该方法不仅简单有效，还具有通用性，能够兼容现有的3DGS框架（如vanilla 3DGS和Mip-Splatting），在加速训练的同时保持高质量的合成效果，实验表明该策略可实现高达30%的训练收敛速度提升和渲染质量的改进。

链接: https://arxiv.org/abs/2412.07608
作者: Chengbo Wang,Guozheng Ma,Yifei Xue,Yizhen Lao
关键词-EN: demonstrating remarkable capability, high-fidelity scene reconstruction, Gaussian Splatting, Gaussian primitive representations, demonstrating remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, demonstrating remarkable capability in high-fidelity scene reconstruction through its Gaussian primitive representations. However, the computational overhead induced by the massive number of primitives poses a significant bottleneck to training efficiency. To overcome this challenge, we propose Group Training, a simple yet effective strategy that organizes Gaussian primitives into manageable groups, optimizing training efficiency and improving rendering quality. This approach shows universal compatibility with existing 3DGS frameworks, including vanilla 3DGS and Mip-Splatting, consistently achieving accelerated training while maintaining superior synthesis quality. Extensive experiments reveal that our straightforward Group Training strategy achieves up to 30% faster convergence and improved rendering quality across diverse scenarios.
zh

[CV-38] RFL: Simplifying Chemical Structure Recognition with Ring-Free Language AAAI2025

【速读】：该论文试图解决光学化学结构识别中复杂分子二维结构（尤其是环和多分支结构）对现有端到端方法学习一维标记序列的挑战。解决方案的关键在于提出了一种新的无环语言（Ring-Free Language, RFL），通过分治策略将复杂分子结构分解为多个部分，并以层次化形式描述，从而确保唯一性、简洁性和可读性，显著降低了识别模型的学习难度。基于RFL，论文进一步提出了分子骨架解码器（Molecular Skeleton Decoder, MSD），包括逐步预测分子骨架和单个环的骨架生成模块，以及预测分支信息的分支分类模块。实验结果表明，该方法在印刷和手写场景中均优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.07594
作者: Qikai Chang,Mingjun Chen,Changpeng Pi,Pengfei Hu,Zhenrong Zhang,Jiefeng Ma,Jun Du,Baocai Yin,Jinshui Hu
关键词-EN: Optical Chemical Structure, objective of Optical, Optical Chemical, identify chemical structure, chemical structure images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures. Accepted by AAAI 2025

点击查看摘要

Abstract:The primary objective of Optical Chemical Structure Recognition is to identify chemical structure images into corresponding markup sequences. However, the complex two-dimensional structures of molecules, particularly those with rings and multiple branches, present significant challenges for current end-to-end methods to learn one-dimensional markup directly. To overcome this limitation, we propose a novel Ring-Free Language (RFL), which utilizes a divide-and-conquer strategy to describe chemical structures in a hierarchical form. RFL allows complex molecular structures to be decomposed into multiple parts, ensuring both uniqueness and conciseness while enhancing readability. This approach significantly reduces the learning difficulty for recognition models. Leveraging RFL, we propose a universal Molecular Skeleton Decoder (MSD), which comprises a skeleton generation module that progressively predicts the molecular skeleton and individual rings, along with a branch classification module for predicting branch information. Experimental results demonstrate that the proposed RFL and MSD can be applied to various mainstream methods, achieving superior performance compared to state-of-the-art approaches in both printed and handwritten scenarios. The code is available at this https URL.
zh

[CV-39] DiffSensei: Bridging Multi-Modal LLM s and Diffusion Models for Customized Manga Generation

【速读】：该论文试图解决现有文本到图像生成模型在多角色场景中对角色外观和互动控制不足的问题。解决方案的关键在于提出了DiffSensei框架，该框架结合了基于扩散的图像生成器和多模态大语言模型（MLLM），后者作为文本兼容的身份适配器。通过使用掩码交叉注意力机制，DiffSensei能够无缝地整合角色特征，实现精确的布局控制，同时避免直接的像素传输。此外，MLLM适配器根据面板特定的文本提示调整角色特征，使得角色表情、姿势和动作的灵活调整成为可能。论文还引入了MangaZero数据集，支持多角色互动和动作的可视化，进一步提升了生成效果。实验结果表明，DiffSensei在漫画生成任务中显著优于现有模型，实现了文本适应的角色定制。

链接: https://arxiv.org/abs/2412.07589
作者: Jianzong Wu,Chao Tang,Jingbo Wang,Yanhong Zeng,Xiangtai Li,Yunhai Tong
关键词-EN: creating visual narratives, textual descriptions, Story visualization, creating visual, visual narratives
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is this https URL

点击查看摘要

Abstract:Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbfcustomized manga generation and introduce \textbfDiffSensei, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with panel-specific text cues, allowing flexible adjustments in character expressions, poses, and actions. We also introduce \textbfMangaZero, a large-scale dataset tailored to this task, containing 43,264 manga pages and 427,147 annotated panels, supporting the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The project page is this https URL.
zh

[CV-40] Multimodal Contextualized Support for Enhancing Video Retrieval System

【速读】：该论文试图解决当前视频检索系统主要依赖单个关键帧或图像进行查询，而忽略了视频片段中更高层次的抽象信息的问题。解决方案的关键在于提出一种新的流程，通过提取多模态数据并整合多个帧的信息，使模型能够捕捉视频片段中的潜在含义，而不仅仅是单帧中的物体检测。这种方法能够更准确地理解和描述视频中的动作或事件，从而提升检索的准确性。

链接: https://arxiv.org/abs/2412.07584
作者: Quoc-Bao Nguyen-Le,Thanh-Huy Le-Nguyen
关键词-EN: Current video retrieval, querying individual keyframes, primarily focus, Current video, focus on querying
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Current video retrieval systems, especially those used in competitions, primarily focus on querying individual keyframes or images rather than encoding an entire clip or video segment. However, queries often describe an action or event over a series of frames, not a specific image. This results in insufficient information when analyzing a single frame, leading to less accurate query results. Moreover, extracting embeddings solely from images (keyframes) does not provide enough information for models to encode higher-level, more abstract insights inferred from the video. These models tend to only describe the objects present in the frame, lacking a deeper understanding. In this work, we propose a system that integrates the latest methodologies, introducing a novel pipeline that extracts multimodal data, and incorporate information from multiple frames within a video, enabling the model to abstract higher-level information that captures latent meanings, focusing on what can be inferred from the video clip, rather than just focusing on object detection in one single image.
zh

[CV-41] Mobile Video Diffusion

【速读】：该论文试图解决视频扩散模型在移动设备上因高计算需求而受限的问题。解决方案的关键在于通过多种优化策略降低模型的计算和内存需求，包括降低帧分辨率、引入多尺度时间表示、采用两种新的剪枝方案减少通道数和时间块，以及通过对抗性微调将去噪过程简化为单步操作。最终提出的MobileVD模型在计算效率上提升了523倍（从1817.2 TFLOPs降至4.34 TFLOPs），尽管在视频质量（FVD从149降至171）上略有下降，但仍能在小米14 Pro上以1.7秒的速度生成14x512x256像素的视频片段。

链接: https://arxiv.org/abs/2412.07583
作者: Haitam Ben Yahia,Denis Korzhenkov,Ioannis Lelekas,Amir Ghodrati,Amirhossein Habibian
关键词-EN: achieved impressive realism, high computational demands, Video diffusion, Stable Video Diffusion, Video diffusion models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at this https URL
zh

[CV-42] Unlocking the Potential of Reverse Distillation for Anomaly Detection AAAI2025

【速读】：该论文试图解决在无监督异常检测 (Unsupervised Anomaly Detection, AD) 中，知识蒸馏 (Knowledge Distillation, KD) 方法中学生网络过度泛化导致异常区域特征差异丢失的问题。解决方案的关键在于提出了带有专家网络的反向蒸馏 (Reverse Distillation with Expert, RD with Expert) 方法，通过引入专家网络增强学生生成正常特征的能力，并优化教师区分正常与异常特征的能力，从而减少漏检。此外，设计了引导信息注入 (Guided Information Injection) 机制，用于过滤和传递教师到学生的特征，提升细节重建并减少误报。实验证明该方法在多个基准数据集上优于现有的无监督异常检测方法。

链接: https://arxiv.org/abs/2412.07579
作者: Xinyue Liu,Jianyuan Wang,Biao Leng,Shuo Zhang
关键词-EN: unsupervised Anomaly Detection, Knowledge Distillation, Anomaly Detection, unsupervised Anomaly, promising approach
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 14 figures, AAAI 2025

点击查看摘要

Abstract:Knowledge Distillation (KD) is a promising approach for unsupervised Anomaly Detection (AD). However, the student network’s over-generalization often diminishes the crucial representation differences between teacher and student in anomalous regions, leading to detection failures. To addresses this problem, the widely accepted Reverse Distillation (RD) paradigm designs the asymmetry teacher and student, using an encoder as teacher and a decoder as student. Yet, the design of RD does not ensure that the teacher encoder effectively distinguishes between normal and abnormal features or that the student decoder generates anomaly-free features. Additionally, the absence of skip connections results in a loss of fine details during feature reconstruction. To address these issues, we propose RD with Expert, which introduces a novel Expert-Teacher-Student network for simultaneous distillation of both the teacher encoder and student decoder. The added expert network enhances the student’s ability to generate normal features and optimizes the teacher’s differentiation between normal and abnormal features, reducing missed detections. Additionally, Guided Information Injection is designed to filter and transfer features from teacher to student, improving detail reconstruction and minimizing false positives. Experiments on several benchmarks prove that our method outperforms existing unsupervised AD methods under RD paradigm, fully unlocking RD’s potential.
zh

[CV-43] Making the Flow Glow – Robot Perception under Severe Lighting Conditions using Normalizing Flow Gradients

【速读】：该论文试图解决在实际部署中，基于神经网络的机器人感知系统在复杂成像条件下可靠性不足的问题。解决方案的关键在于利用归一化流模型 (normalizing flow) 的绝对梯度值进行局部区域优化，而非对整个图像进行优化。通过这种方式，感知系统能够更好地适应困难的视觉场景，从而在物体检测任务中显著提高成功率，实验结果表明该方法在极端光照条件下的成功率比之前的方法提高了60%。

链接: https://arxiv.org/abs/2412.07565
作者: Simon Kristoffersson Lind,Rudolph Triebel,Volker Krüger
关键词-EN: Modern robotic perception, Modern robotic, highly dependent, neural networks, real-world deployment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern robotic perception is highly dependent on neural networks. It is well known that neural network-based perception can be unreliable in real-world deployment, especially in difficult imaging conditions. Out-of-distribution detection is commonly proposed as a solution for ensuring reliability in real-world deployment. Previous work has shown that normalizing flow models can be used for out-of-distribution detection to improve reliability of robotic perception tasks. Specifically, camera parameters can be optimized with respect to the likelihood output from a normalizing flow, which allows a perception system to adapt to difficult vision scenarios. With this work we propose to use the absolute gradient values from a normalizing flow, which allows the perception system to optimize local regions rather than the whole image. By setting up a table top picking experiment with exceptionally difficult lighting conditions, we show that our method achieves a 60% higher success rate for an object detection task compared to previous methods.
zh

[CV-44] ReCap: Better Gaussian Relighting with Cross-Environment Captures

【速读】：该论文试图解决在多样且未见过的环境中进行精确的3D物体重新照明（relighting）的问题。由于反射率-光照的模糊性（albedo-lighting ambiguity），现有方法在生成忠实的重新照明效果时表现不佳。论文提出的解决方案ReCap通过将跨环境捕捉视为多任务目标，提供了缺失的监督信号，从而解开了光照和材质属性的纠缠。关键在于ReCap联合优化了多个光照表示，这些表示共享一组共同的材质属性，从而在材质属性的基础上协调一致地重建光照，实现了物理上合理的光照重建和鲁棒的材质估计，这对于精确的重新照明至关重要。

链接: https://arxiv.org/abs/2412.07534
作者: Jingzhi Li,Zongwei Wu,Eduard Zamfir,Radu Timofte
关键词-EN: diverse unseen environments, virtual object placement, realistic virtual object, diverse unseen, crucial for realistic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate 3D objects relighting in diverse unseen environments is crucial for realistic virtual object placement. Due to the albedo-lighting ambiguity, existing methods often fall short in producing faithful relights. Without proper constraints, observed training views can be explained by numerous combinations of lighting and material attributes, lacking physical correspondence with the actual environment maps used for relighting. In this work, we present ReCap, treating cross-environment captures as multi-task target to provide the missing supervision that cuts through the entanglement. Specifically, ReCap jointly optimizes multiple lighting representations that share a common set of material attributes. This naturally harmonizes a coherent set of lighting representations around the mutual material attributes, exploiting commonalities and differences across varied object appearances. Such coherence enables physically sound lighting reconstruction and robust material estimation - both essential for accurate relighting. Together with a streamlined shading function and effective post-processing, ReCap outperforms the leading competitor by 3.4 dB in PSNR on an expanded relighting benchmark.
zh

[CV-45] Deep Joint Unrolling for Deblurring and Low-Light Image Enhancement (JUDE).pdf

【速读】：该论文试图解决夜间拍摄中常见的低光和模糊问题，这些问题通常由于长时间曝光以应对暗环境所致。解决方案的关键在于引入JUDE（Deep Joint Unrolling for Deblurring and Low-Light Image Enhancement），这是一种基于图像物理模型（image physical model）的深度联合展开方法。JUDE结合了Retinex理论和模糊模型，通过迭代去模糊和分解过程，生成清晰的低光反射和光照分量。其核心机制是通过展开机制（unrolling mechanism）实现，并结合了多种模块来估计初始模糊核、增强亮度和消除最终图像中的噪声。实验结果表明，JUDE在LOL-Blur和Real-LOL-Blur数据集上在定量和定性方面均优于现有技术。

链接: https://arxiv.org/abs/2412.07527
作者: Tu Vo,Chan Y. Park
关键词-EN: address dim environments, photos at night, dim environments, issues are prevalent, prevalent when capturing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Low-light and blurring issues are prevalent when capturing photos at night, often due to the use of long exposure to address dim environments. Addressing these joint problems can be challenging and error-prone if an end-to-end model is trained without incorporating an appropriate physical model. In this paper, we introduce JUDE, a Deep Joint Unrolling for Deblurring and Low-Light Image Enhancement, inspired by the image physical model. Based on Retinex theory and the blurring model, the low-light blurry input is iteratively deblurred and decomposed, producing sharp low-light reflectance and illuminance through an unrolling mechanism. Additionally, we incorporate various modules to estimate the initial blur kernel, enhance brightness, and eliminate noise in the final image. Comprehensive experiments on LOL-Blur and Real-LOL-Blur demonstrate that our method outperforms existing techniques both quantitatively and qualitatively.
zh

[CV-46] Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios

【速读】：该论文试图解决大视觉-语言模型（Large vision-language models, LVLMs）在多模态理解和生成任务中偶尔产生的幻觉文本问题，特别是在自动驾驶系统中可能导致错误决策的情况。解决方案的关键是提出了HCOENet，一种即插即用的思维链校正方法（plug-and-play chain-of-thought correction method），通过交叉验证机制（cross-checking mechanism）过滤实体并直接从图像中提取关键对象，从而消除对象幻觉并生成增强的描述文本。该方法在POPE基准测试中显著提升了Mini-InternVL-4B和mPLUG-Owl3模型的F1分数，并在开放校园场景中展示了其实际应用性。

链接: https://arxiv.org/abs/2412.07518
作者: Jiaqi Fan,Jianhua Wu,Hongqing Chu,Quanbo Ge,Bingzhao Gao
关键词-EN: Large vision-language models, demonstrated remarkable capabilities, Large vision-language, generation tasks, demonstrated remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation tasks. However, these models occasionally generate hallucinatory texts, resulting in descriptions that seem reasonable but do not correspond to the image. This phenomenon can lead to wrong driving decisions of the autonomous driving system. To address this challenge, this paper proposes HCOENet, a plug-and-play chain-of-thought correction method designed to eliminate object hallucinations and generate enhanced descriptions for critical objects overlooked in the initial response. Specifically, HCOENet employs a cross-checking mechanism to filter entities and directly extracts critical objects from the given image, enriching the descriptive text. Experimental results on the POPE benchmark demonstrate that HCOENet improves the F1-score of the Mini-InternVL-4B and mPLUG-Owl3 models by 12.58% and 4.28%, respectively. Additionally, qualitative results using images collected in open campus scene further highlight the practical applicability of the proposed method. Compared with the GPT-4o model, HCOENet achieves comparable descriptive performance while significantly reducing costs. Finally, two novel semantic understanding datasets, CODA_desc and nuScenes_desc, are created for traffic scenarios to support future research. The codes and datasets are publicly available at this https URL.
zh

[CV-47] FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

【速读】：该论文试图解决Rectified Flows (ReFlows) 在快速采样时，图像反转为结构化噪声以进行恢复和后续编辑的问题。解决方案的关键在于引入FireFlow，这是一种简单而有效的零样本方法，能够在8步内实现精确的反演和编辑。关键创新点在于设计了一种精心优化的数值求解器，该求解器结合了二阶求解器的精度与一阶Euler方法的效率，实现了比现有ReFlow反演和编辑技术快3倍的运行速度，同时提供了更小的重建误差和更优的编辑效果，且无需训练。

链接: https://arxiv.org/abs/2412.07517
作者: Yingying Deng,Xiangyu He,Changwang Mei,Peisong Wang,Fan Tang
关键词-EN: Rectified Flows, transforms images back, fast inversion transforms, editing remains unsolved, inversion transforms images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report

点击查看摘要

Abstract:Though Rectified Flows (ReFlows) with distillation offers a promising way for fast sampling, its fast inversion transforms images back to structured noise for recovery and following editing remains unsolved. This paper introduces FireFlow, a simple yet effective zero-shot approach that inherits the startling capacity of ReFlow-based models (such as FLUX) in generation while extending its capabilities to accurate inversion and editing in 8 steps. We first demonstrate that a carefully designed numerical solver is pivotal for ReFlow inversion, enabling accurate inversion and reconstruction with the precision of a second-order solver while maintaining the practical efficiency of a first-order Euler method. This solver achieves a 3\times runtime speedup compared to state-of-the-art ReFlow inversion and editing techniques, while delivering smaller reconstruction errors and superior editing results in a training-free mode. The code is available at \hrefthis https URLthis URL .
zh

[CV-48] Stealthy and Robust Backdoor Attack against 3D Point Clouds through Additional Point Features

【速读】：该论文试图解决3D深度神经网络（3D DNNs）在处理3D点云时面临的隐蔽且难以防御的后门攻击问题。解决方案的关键在于提出了一种隐蔽且鲁棒的后门攻击（Stealthy and Robust Backdoor Attack, SRBA），通过在点云的附加点特征（如反射强度）上应用均匀偏移作为触发器，而不改变点云的几何信息，从而确保了视觉一致性和对预处理防御的鲁棒性。此外，论文采用贝叶斯优化（Bayesian Optimization, BO）来自动化选择合适的触发器，实验结果表明SRBA在多种预处理操作下仍能保持超过94%的攻击成功率（ASR），显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.07511
作者: Xiaoyang Ning,Qing Xie,Jinyu Xu,Wenbo Jiang,Jiachen Li,Yanchun Ma
关键词-EN: Deep Neural Networks, Deep Neural, Neural Networks, Robust Backdoor Attack, security-critical applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D backdoor attacks have posed a substantial threat to 3D Deep Neural Networks (3D DNNs) designed for 3D point clouds, which are extensively deployed in various security-critical applications. Although the existing 3D backdoor attacks achieved high attack performance, they remain vulnerable to preprocessing-based defenses (e.g., outlier removal and rotation augmentation) and are prone to detection by human inspection. In pursuit of a more challenging-to-defend and stealthy 3D backdoor attack, this paper introduces the Stealthy and Robust Backdoor Attack (SRBA), which ensures robustness and stealthiness through intentional design considerations. The key insight of our attack involves applying a uniform shift to the additional point features of point clouds (e.g., reflection intensity) widely utilized as part of inputs for 3D DNNs as the trigger. Without altering the geometric information of the point clouds, our attack ensures visual consistency between poisoned and benign samples, and demonstrate robustness against preprocessing-based defenses. In addition, to automate our attack, we employ Bayesian Optimization (BO) to identify the suitable trigger. Extensive experiments suggest that SRBA achieves an attack success rate (ASR) exceeding 94% in all cases, and significantly outperforms previous SOTA methods when multiple preprocessing operations are applied during training.
zh

[CV-49] Enhancing 3D Object Detection in Autonomous Vehicles Based on Synthetic Virtual Environment Analysis

【速读】：该论文试图解决自动驾驶车辆（Autonomous Vehicles, AVs）在实时场景分析中进行3D物体检测的问题。解决方案的关键在于通过AI模型推导出3D边界框（3D bounding boxes），并将其投影到三维环境中，从而实现对物体的精确识别。研究利用增强现实（AR）生态系统显著提升了3D物体检测的性能，并通过合成数据集（synthetic dataset）模拟了多种环境、光照和时空状态下的复杂场景，以评估模型在不同天气条件和相机设置下的表现。这种方法旨在应对更具挑战性的检测和识别场景，确保在大多数测试条件下实现竞争性的结果。

链接: https://arxiv.org/abs/2412.07509
作者: Vladislav Li,Ilias Siniosoglou,Thomai Karamitsou,Anastasios Lytos,Ioannis D. Moscholios,Sotirios K. Goudos,Jyoti S. Banerjee,Panagiotis Sarigiannidi,Vasileios Argyriou
关键词-EN: Autonomous Vehicles, inferring digital elements, facilitating proactive detection, digital elements, facilitating proactive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous Vehicles (AVs) use natural images and videos as input to understand the real world by overlaying and inferring digital elements, facilitating proactive detection in an effort to assure safety. A crucial aspect of this process is real-time, accurate object recognition through automatic scene analysis. While traditional methods primarily concentrate on 2D object detection, exploring 3D object detection, which involves projecting 3D bounding boxes into the three-dimensional environment, holds significance and can be notably enhanced using the AR ecosystem. This study examines an AI model’s ability to deduce 3D bounding boxes in the context of real-time scene analysis while producing and evaluating the model’s performance and processing time, in the virtual domain, which is then applied to AVs. This work also employs a synthetic dataset that includes artificially generated images mimicking various environmental, lighting, and spatiotemporal states. This evaluation is oriented in handling images featuring objects in diverse weather conditions, captured with varying camera settings. These variations pose more challenging detection and recognition scenarios, which the outcomes of this work can help achieve competitive results under most of the tested conditions.
zh

[CV-50] EDGE: Unknown-aware Multi-label Learning by Energy Distribution Gap Expansion AAAI2025

【速读】：该论文试图解决多标签分布外（OOD）检测中的不平衡问题，特别是在模型判别能力不足时，少数类样本容易被错误分类为OOD样本的问题。解决方案的关键在于提出了一种未知感知的多标签学习框架，通过辅助异常暴露（OE）来重塑不确定性能量空间布局。具体来说，该框架分别优化尾部ID样本和未知样本的能量分数，并扩大它们之间的能量分布差距，从而使尾部ID样本的能量分数显著高于OOD样本。此外，设计了一种简单有效的措施来选择更具信息性的OE数据集。

链接: https://arxiv.org/abs/2412.07499
作者: Yuchen Sun,Qianqian Xu,Zitai Wang,Zhiyong Yang,Junwei He
关键词-EN: OOD, aims to discriminate, OOD detection, OOD samples, Multi-label
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, accepted by AAAI 2025

点击查看摘要

Abstract:Multi-label Out-Of-Distribution (OOD) detection aims to discriminate the OOD samples from the multi-label In-Distribution (ID) ones. Compared with its multiclass counterpart, it is crucial to model the joint information among classes. To this end, JointEnergy, which is a representative multi-label OOD inference criterion, summarizes the logits of all the classes. However, we find that JointEnergy can produce an imbalance problem in OOD detection, especially when the model lacks enough discrimination ability. Specifically, we find that the samples only related to minority classes tend to be classified as OOD samples due to the ambiguous energy decision boundary. Besides, imbalanced multi-label learning methods, originally designed for ID ones, would not be suitable for OOD detection scenarios, even producing a serious negative transfer effect. In this paper, we resort to auxiliary outlier exposure (OE) and propose an unknown-aware multi-label learning framework to reshape the uncertainty energy space layout. In this framework, the energy score is separately optimized for tail ID samples and unknown samples, and the energy distribution gap between them is expanded, such that the tail ID samples can have a significantly larger energy score than the OOD ones. What’s more, a simple yet effective measure is designed to select more informative OE datasets. Finally, comprehensive experimental results on multiple multi-label and OOD datasets reveal the effectiveness of the proposed method.
zh

[CV-51] ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery

【速读】：该论文试图解决3D高斯喷射 (3D Gaussian Splatting, 3D-GS) 在新型视图合成中难以捕捉丰富细节和完整几何形状的问题。解决方案的关键在于引入了一种新的密度化方法，称为残差分割 (residual split)，通过添加一个缩小的残差高斯来适应性地恢复细节并补充缺失的几何形状，同时支持渐进式优化。此外，论文提出了一个名为ResGS的管道，结合高斯图像金字塔进行渐进式监督，并通过优先处理粗略高斯的密度化来实现时间上的优化。实验结果表明，该方法在渲染质量上达到了最先进的水平，并且在多种3D-GS变体上均表现出一致的性能提升。

链接: https://arxiv.org/abs/2412.07494
作者: Yanzhe Lyu,Kai Cheng,Xin Kang,Xuejin Chen
关键词-EN: achieving high fidelity, Gaussian Splatting, view synthesis, achieving high, fidelity and efficiency
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3D-GS) has prevailed in novel view synthesis, achieving high fidelity and efficiency. However, it often struggles to capture rich details and complete geometry. Our analysis highlights a key limitation of 3D-GS caused by the fixed threshold in densification, which balances geometry coverage against detail recovery as the threshold varies. To address this, we introduce a novel densification method, residual split, which adds a downscaled Gaussian as a residual. Our approach is capable of adaptively retrieving details and complementing missing geometry while enabling progressive refinement. To further support this method, we propose a pipeline named ResGS. Specifically, we integrate a Gaussian image pyramid for progressive supervision and implement a selection scheme that prioritizes the densification of coarse Gaussians over time. Extensive experiments demonstrate that our method achieves SOTA rendering quality. Consistent performance improvements can be achieved by applying our residual split on various 3D-GS variants, underscoring its versatility and potential for broader application in 3D-GS-based applications.
zh

[CV-52] Stereo Hand-Object Reconstruction for Human-to-Robot Handover

【速读】：该论文试图解决在人机交接过程中机器人抓取手和物体形状的联合估计问题，特别是在处理透明物体和未见过物体时面临的挑战。解决方案的关键在于提出了一种基于立体视觉（stereo-based）的手-物体重建方法，通过概率性地结合单视图重建来形成一致的立体重建。该方法利用从大规模合成手-物体数据集中学习的三维形状先验（3D shape priors），确保其泛化能力，并使用RGB输入而非深度信息，以更好地捕捉透明物体。通过投影基的外点去除步骤处理重建的手-物体形状，并将其输出用于指导配备宽基线立体RGB摄像机的人机交接流程。

链接: https://arxiv.org/abs/2412.07487
作者: Yik Lung Pang,Alessio Xompero,Changjae Oh,Andrea Cavallaro
关键词-EN: Jointly estimating hand, Jointly estimating, estimating hand, Jointly, RGB
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures, 1 table

点击查看摘要

Abstract:Jointly estimating hand and object shape ensures the success of the robot grasp in human-to-robot handovers. However, relying on hand-crafted prior knowledge about the geometric structure of the object fails when generalising to unseen objects, and depth sensors fail to detect transparent objects such as drinking glasses. In this work, we propose a stereo-based method for hand-object reconstruction that combines single-view reconstructions probabilistically to form a coherent stereo reconstruction. We learn 3D shape priors from a large synthetic hand-object dataset to ensure that our method is generalisable, and use RGB inputs instead of depth as RGB can better capture transparent objects. We show that our method achieves a lower object Chamfer distance compared to existing RGB based hand-object reconstruction methods on single view and stereo settings. We process the reconstructed hand-object shape with a projection-based outlier removal step and use the output to guide a human-to-robot handover pipeline with wide-baseline stereo RGB cameras. Our hand-object reconstruction enables a robot to successfully receive a diverse range of household objects from the human.
zh

[CV-53] Manta: Enhancing Mamba for Few-Shot Action Recognition of Long Sub-Sequence AAAI2025

【速读】：该论文试图解决少样本动作识别 (Few-shot Action Recognition, FSAR) 中长子序列建模的计算复杂性和类内方差积累的问题。解决方案的关键在于提出了一种名为 Manta 的框架，该框架结合了 Matryoshka Mamba 和混合对比学习 (Contrastive Learning) 方法。Matryoshka Mamba 通过引入多个内部模块来增强局部特征表示，并通过外部模块捕捉时间线上的依赖关系，从而实现隐式的时间对齐。混合对比学习则结合了监督学习和无监督学习，以减轻类内方差积累对性能的负面影响。这两个组件在 Manta 中并行运行，显著提升了长子序列的少样本动作识别性能，并在多个基准测试中取得了最新的最优结果。

链接: https://arxiv.org/abs/2412.07481
作者: Wenbo Huang,Jinghui Zhang,Guang Li,Lei Zhang,Shuoyuan Wang,Fang Dong,Jiahui Jin,Takahiro Ogawa,Miki Haseyama
关键词-EN: few-shot action recognition, express entire actions, video naturally express, naturally express entire, textbf
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:In few-shot action recognition~(FSAR), long sub-sequences of video naturally express entire actions more effectively. However, the computational complexity of mainstream Transformer-based methods limits their application. Recent Mamba demonstrates efficiency in modeling long sequences, but directly applying Mamba to FSAR overlooks the importance of local feature modeling and alignment. Moreover, long sub-sequences within the same class accumulate intra-class variance, which adversely impacts FSAR performance. To solve these challenges, we propose a \underline\textbfMatryoshka M\underline\textbfAmba and Co\underline\textbfNtras\underline\textbfTive Le\underline\textbfArning framework~(\textbfManta). Firstly, the Matryoshka Mamba introduces multiple Inner Modules to enhance local feature representation, rather than directly modeling global features. An Outer Module captures dependencies of timeline between these local features for implicit temporal alignment. Secondly, a hybrid contrastive learning paradigm, combining both supervised and unsupervised methods, is designed to mitigate the negative effects of intra-class variance accumulation. The Matryoshka Mamba and the hybrid contrastive learning paradigm operate in parallel branches within Manta, enhancing Mamba for FSAR of long sub-sequence. Manta achieves new state-of-the-art performance on prominent benchmarks, including SSv2, Kinetics, UCF101, and HMDB51. Extensive empirical studies prove that Manta significantly improves FSAR of long sub-sequence from multiple perspectives. The code is released at this https URL.
zh

[CV-54] BENet: A Cross-domain Robust Network for Detecting Face Forgeries via Bias Expansion and Latent-space Attention

【速读】：该论文旨在解决当前深度伪造（deepfake）检测器在面对不同伪造技术（即“跨域”）时表现出的局限性问题。解决方案的关键在于引入了一个名为BENet的跨域鲁棒偏差扩展网络，其核心创新包括：1) 基于自编码器（autoencoder）的偏差扩展模块，该模块在保留真实面部特征的同时，增强伪造重建中的差异，从而为跨域伪造检测提供可靠的偏差；2) 引入潜在空间注意力（Latent-Space Attention, LSA）模块，用于捕捉不同尺度下的伪造不一致性，确保对高级深度伪造技术的鲁棒防御；3) 通过交叉域检测器模块，在推理过程中验证面部域，提升对未知来源伪造的识别准确性。此外，论文首次采用了端到端的训练方式，并引入了新的偏差扩展损失函数，显著提升了BENet在跨域和跨数据集场景下的检测性能。

链接: https://arxiv.org/abs/2412.07431
作者: Weihua Liu,Jianhua Qiu,Said Boumaraf,Chaochao lin,Pan liyuan,Lin Li,Mohammed Bennamoun,Naoufel Werghi
关键词-EN: Bias Expansion, growing threat, fake faces, Bias Expansion Network, Robust Bias Expansion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In response to the growing threat of deepfake technology, we introduce BENet, a Cross-Domain Robust Bias Expansion Network. BENet enhances the detection of fake faces by addressing limitations in current detectors related to variations across different types of fake face generation techniques, where ``cross-domain" refers to the diverse range of these deepfakes, each considered a separate domain. BENet’s core feature is a bias expansion module based on autoencoders. This module maintains genuine facial features while enhancing differences in fake reconstructions, creating a reliable bias for detecting fake faces across various deepfake domains. We also introduce a Latent-Space Attention (LSA) module to capture inconsistencies related to fake faces at different scales, ensuring robust defense against advanced deepfake techniques. The enriched LSA feature maps are multiplied with the expanded bias to create a versatile feature space optimized for subtle forgeries detection. To improve its ability to detect fake faces from unknown sources, BENet integrates a cross-domain detector module that enhances recognition accuracy by verifying the facial domain during inference. We train our network end-to-end with a novel bias expansion loss, adopted for the first time, in face forgery detection. Extensive experiments covering both intra and cross-dataset demonstrate BENet’s superiority over current state-of-the-art solutions.
zh

[CV-55] DSFEC: Efficient and Deployable Deep Radar Object Detection

【速读】：该论文旨在解决在资源受限的边缘设备（如Raspberry Pi）上部署雷达目标检测模型时面临的挑战，主要问题包括模型尺寸过大、计算能力和内存有限。解决方案的关键在于引入Depthwise Separable Convolutions（深度可分离卷积）以提高网络效率，并创新性地设计了Feature Enhancement and Compression (FEC)模块，集成到PointPillars特征编码器中以增强模型性能。通过这些技术，论文提出了DSFEC-L模型及其两个版本（DSFEC-M和DSFEC-S），在nuScenes数据集上显著优于基线模型，不仅提升了检测精度，还大幅减少了计算量和运行时间，特别是在Raspberry Pi上的部署效率得到了显著提升。

链接: https://arxiv.org/abs/2412.07411
作者: Gayathri Dandugula,Santhosh Boddana,Sudesh Mirashi
关键词-EN: Deploying radar object, radar object detection, resource-constrained edge devices, poses significant challenges, significant challenges due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deploying radar object detection models on resource-constrained edge devices like the Raspberry Pi poses significant challenges due to the large size of the model and the limited computational power and the memory of the Pi. In this work, we explore the efficiency of Depthwise Separable Convolutions in radar object detection networks and integrate them into our model. Additionally, we introduce a novel Feature Enhancement and Compression (FEC) module to the PointPillars feature encoder to further improve the model performance. With these innovations, we propose the DSFEC-L model and its two versions, which outperform the baseline (23.9 mAP of Car class, 20.72 GFLOPs) on nuScenes dataset: 1). An efficient DSFEC-M model with a 14.6% performance improvement and a 60% reduction in GFLOPs. 2). A deployable DSFEC-S model with a 3.76% performance improvement and a remarkable 78.5% reduction in GFLOPs. Despite marginal performance gains, our deployable model achieves an impressive 74.5% reduction in runtime on the Raspberry Pi compared to the baseline.
zh

[CV-56] Explainability of Deep Learning-Based Plant Disease Classifiers Through Automated Concept Identification

【速读】：该论文试图解决基于深度学习的植物病害检测模型中的可解释性问题。解决方案的关键在于应用了自动化概念解释方法（Automated Concept-based Explanation, ACE），该方法能够自动识别图像数据中的视觉概念，并揭示影响模型预测的关键特征。通过这种方法，研究者不仅能够发现与病害相关的有效模式，还能识别出可能影响模型鲁棒性的背景或光照等偶然偏差。ACE的应用有助于识别相关特征并针对性地改进模型，从而提升植物病害分类模型的透明度和可靠性。

链接: https://arxiv.org/abs/2412.07408
作者: Jihen Amara,Birgitta König-Ries,Sheeba Samuel
关键词-EN: reliable disease detection, significantly advanced automatic, plant disease detection, Automated Concept-based Explanation, advanced automatic plant
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While deep learning has significantly advanced automatic plant disease detection through image-based classification, improving model explainability remains crucial for reliable disease detection. In this study, we apply the Automated Concept-based Explanation (ACE) method to plant disease classification using the widely adopted InceptionV3 model and the PlantVillage dataset. ACE automatically identifies the visual concepts found in the image data and provides insights about the critical features influencing the model predictions. This approach reveals both effective disease-related patterns and incidental biases, such as those from background or lighting that can compromise model robustness. Through systematic experiments, ACE helped us to identify relevant features and pinpoint areas for targeted model improvement. Our findings demonstrate the potential of ACE to improve the explainability of plant disease classification based on deep learning, which is essential for producing transparent tools for plant disease management in agriculture.
zh

[CV-57] Learning Self-Supervised Audio-Visual Representations for Sound Recommendations

【速读】：该论文试图解决从无标签视频中学习音频和视觉表示的问题，关键在于提出了一种基于音频和视觉对应关系的自监督学习方法。该方法通过注意力机制（attention mechanism）来学习不同分辨率卷积特征的相对重要性，并利用这些注意力特征来编码音频和视觉输入的对应关系。通过这种方法，模型能够有效提升音频-视觉关联分类的准确性（提高18%）以及视觉场景音效推荐的准确性（提高10%），尤其是在使用跨模态对比学习（cross-modal contrastive learning）进行训练后，推荐性能进一步得到提升。

链接: https://arxiv.org/abs/2412.07406
作者: Sudha Krishnamurthy
关键词-EN: audio and visual, visual input based, self-supervised approach, audio, visual
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Published in the Proceedings of the International Symposium on Visual Computing, 2021 this https URL

点击查看摘要

Abstract:We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound and a more challenging dataset consisting of gameplay video recordings.
zh

[CV-58] Benchmarking Vision-Based Object Tracking for USVs in Complex Maritime Environments

【速读】：该论文试图解决无人水面车辆（USVs）在复杂海事环境中进行实时目标跟踪的挑战，特别是由于动态摄像机运动、低能见度和尺度变化带来的困难。解决方案的关键在于提出了一种视觉引导的目标跟踪框架，该框架集成了先进的跟踪算法（如基于Siamese Networks和Transformers的算法）与低级控制系统，以在动态海事环境中实现精确跟踪。通过在模拟和真实海事数据集上评估七种不同的跟踪器，并结合多种控制算法进行测试，验证了该框架在处理动态海事条件下的有效性。其中，基于Transformer的SeqTrack在恶劣条件下表现最佳，而线性二次调节器控制器（LQR）则展示了最稳健和平滑的控制性能，确保了USV的稳定跟踪。

链接: https://arxiv.org/abs/2412.07392
作者: Muhayy Ud Din,Ahsan B. Bakht,Waseem Akram,Yihao Dong,Lakmal Seneviratne,Irfan Hussain
关键词-EN: Vision-based target tracking, unmanned surface vehicles, Vision-based target, surface vehicles, crucial for unmanned
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: submitted to IEEE Access

点击查看摘要

Abstract:Vision-based target tracking is crucial for unmanned surface vehicles (USVs) to perform tasks such as inspection, monitoring, and surveillance. However, real-time tracking in complex maritime environments is challenging due to dynamic camera movement, low visibility, and scale variation. Typically, object detection methods combined with filtering techniques are commonly used for tracking, but they often lack robustness, particularly in the presence of camera motion and missed detections. Although advanced tracking methods have been proposed recently, their application in maritime scenarios is limited. To address this gap, this study proposes a vision-guided object-tracking framework for USVs, integrating state-of-the-art tracking algorithms with low-level control systems to enable precise tracking in dynamic maritime environments. We benchmarked the performance of seven distinct trackers, developed using advanced deep learning techniques such as Siamese Networks and Transformers, by evaluating them on both simulated and real-world maritime datasets. In addition, we evaluated the robustness of various control algorithms in conjunction with these tracking systems. The proposed framework was validated through simulations and real-world sea experiments, demonstrating its effectiveness in handling dynamic maritime conditions. The results show that SeqTrack, a Transformer-based tracker, performed best in adverse conditions, such as dust storms. Among the control algorithms evaluated, the linear quadratic regulator controller (LQR) demonstrated the most robust and smooth control, allowing for stable tracking of the USV.
zh

[CV-59] Post-Training Non-Uniform Quantization for Convolutional Neural Networks

【速读】：该论文试图解决卷积神经网络 (CNN) 在资源受限设备上部署时面临的计算和存储需求过高的问题。解决方案的关键在于提出了一种新颖的训练后量化方法 (post-training quantization)，通过降低模型参数的精度至低比特表示来减少存储需求并加速推理过程。该方法的核心在于找到最优的裁剪阈值和缩放因子，并提供了数学保证以最小化量化噪声，从而在显著减少模型大小和计算需求的同时保持模型精度。

链接: https://arxiv.org/abs/2412.07391
作者: Ahmed Luqman,Khuzemah Qazi,Imdadullah Khan
关键词-EN: resource constrained devices, demands pose considerable, pose considerable challenges, storage demands pose, success of CNN
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite the success of CNN models on a variety of Image classification and segmentation tasks, their extensive computational and storage demands pose considerable challenges for real-world deployment on resource constrained devices. Quantization is one technique that aims to alleviate these large storage requirements and speed up the inference process by reducing the precision of model parameters to lower-bit representations. In this paper, we introduce a novel post-training quantization method for model weights. Our method finds optimal clipping thresholds and scaling factors along with mathematical guarantees that our method minimizes quantization noise. Empirical results on Real World Datasets demonstrate that our quantization scheme significantly reduces model size and computational requirements while preserving model accuracy.
zh

[CV-60] LOGen: Toward Lidar Object Generation by Point Diffusion

【速读】：该论文试图解决在激光雷达（LiDAR）分割任务中，稀有语义类别实例数量不足的问题。解决方案的关键在于引入一种基于扩散模型（diffusion-based method）的激光雷达物体生成器，通过生成具有反射率（reflectance）的点云数据，并利用条件信息（conditioning information）进行广泛控制，从而增强实例的多样性。实验结果表明，该方法在nuScenes数据集上生成的物体质量较高，并通过新开发的适用于激光雷达物体的3D评估指标进行了验证。

链接: https://arxiv.org/abs/2412.07385
作者: Ellington Kirby,Mickael Chen,Renaud Marlet,Nermin Samet
关键词-EN: rare semantic classes, semantic classes consists, improve lidar segmentation, lidar segmentation results, common strategy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project web page: this https URL

点击查看摘要

Abstract:A common strategy to improve lidar segmentation results on rare semantic classes consists of pasting objects from one lidar scene into another. While this augments the quantity of instances seen at training time and varies their context, the instances fundamentally remain the same. In this work, we explore how to enhance instance diversity using a lidar object generator. We introduce a novel diffusion-based method to produce lidar point clouds of dataset objects, including reflectance, and with an extensive control of the generation via conditioning information. Our experiments on nuScenes show the quality of our object generations measured with new 3D metrics developed to suit lidar objects.
zh

[CV-61] CADSpotting: Robust Panoptic Symbol Spotting on Large-Scale CAD Drawings

【速读】：该论文试图解决在大规模建筑CAD图纸中全景符号定位的问题，现有方法在符号多样性、尺度变化和元素重叠方面存在困难。解决方案的关键在于CADSpotting方法，它通过将每个基本元素表示为密集点（而非单一基本点），并结合坐标和颜色等关键属性，构建了一个统一的3D点云模型进行联合语义、实例和全景分割。此外，论文提出了滑动窗口聚合（Sliding Window Aggregation, SWA）技术，结合加权投票和非极大值抑制（Non-Maximum Suppression, NMS），以实现复杂图纸中的精确分割。通过引入大规模CAD数据集LS-CAD（平均覆盖面积为1,000平方米），实验结果表明CADSpotting在符号定位任务中表现优于现有方法，展示了其在实际CAD应用中的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2412.07377
作者: Jiazuo Mu,Fuyi Yang,Yanshun Zhang,Junxiong Zhang,Yongjian Luo,Lan Xu,Yujiao Shi,Jingyi Yu,Yingliang Zhang
关键词-EN: architectural CAD drawings, large-scale architectural CAD, Sliding Window Aggregation, architectural CAD, CAD
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce CADSpotting, an efficient method for panoptic symbol spotting in large-scale architectural CAD drawings. Existing approaches struggle with the diversity of symbols, scale variations, and overlapping elements in CAD designs. CADSpotting overcomes these challenges by representing each primitive with dense points instead of a single primitive point, described by essential attributes like coordinates and color. Building upon a unified 3D point cloud model for joint semantic, instance, and panoptic segmentation, CADSpotting learns robust feature representations. To enable accurate segmentation in large, complex drawings, we further propose a novel Sliding Window Aggregation (SWA) technique, combining weighted voting and Non-Maximum Suppression (NMS). Moreover, we introduce a large-scale CAD dataset named LS-CAD to support our experiments. Each floorplan in LS-CAD has an average coverage of 1,000 square meter(versus 100 square meter in the existing dataset), providing a valuable benchmark for symbol spotting research. Experimental results on FloorPlanCAD and LS-CAD datasets demonstrate that CADSpotting outperforms existing methods, showcasing its robustness and scalability for real-world CAD applications.
zh

[CV-62] StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization

【速读】：该论文试图解决故事可视化中角色身份保持与文本语义对齐之间的平衡问题，主要由于现有方法缺乏对故事场景的详细语义建模。解决方案的关键在于提出了一个名为Character Graph (CG) 的新型知识图谱，全面表示与故事相关的知识，包括角色、角色属性及其关系。基于此，论文引入了StoryWeaver图像生成器，通过Character Graph实现定制化（C-CG），能够在保持角色一致性的同时生成富含文本语义的故事可视化。此外，通过引入知识增强的空间指导（KE-SG），进一步提升了多角色生成的效果，确保角色语义在生成过程中得到精确注入。实验结果表明，StoryWeaver在生成生动的故事情节和准确传达角色身份方面表现优异，并在存储效率上取得了显著提升。

链接: https://arxiv.org/abs/2412.07375
作者: Jinlu Zhang,Jiji Tang,Rongsheng Zhang,Tangjie Lv,Xiaoshuai Sun
关键词-EN: gained increasing attention, artificial intelligence, gained increasing, increasing attention, attention in artificial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Story visualization has gained increasing attention in artificial intelligence. However, existing methods still struggle with maintaining a balance between character identity preservation and text-semantics alignment, largely due to a lack of detailed semantic modeling of the story scene. To tackle this challenge, we propose a novel knowledge graph, namely Character Graph (\textbfCG), which comprehensively represents various story-related knowledge, including the characters, the attributes related to characters, and the relationship between characters. We then introduce StoryWeaver, an image generator that achieve Customization via Character Graph (\textbfC-CG), capable of consistent story visualization with rich text semantics. To further improve the multi-character generation performance, we incorporate knowledge-enhanced spatial guidance (\textbfKE-SG) into StoryWeaver to precisely inject character semantics into generation. To validate the effectiveness of our proposed method, extensive experiments are conducted using a new benchmark called TBC-Bench. The experiments confirm that our StoryWeaver excels not only in creating vivid visual story plots but also in accurately conveying character identities across various scenarios with considerable storage efficiency, \emphe.g., achieving an average increase of +9.03% DINO-I and +13.44% CLIP-T. Furthermore, ablation experiments are conducted to verify the superiority of the proposed module. Codes and datasets are released at this https URL.
zh

[CV-63] PRM: Photometric Stereo based Large Reconstruction Model

【速读】：该论文试图解决在复杂图像外观下进行高质量网格重建的问题，特别是如何通过精细的局部细节来提升重建质量。解决方案的关键在于提出了PRM模型，该模型通过使用光度立体图像（photometric stereo images）来提供丰富的光度线索，从而增强局部细节的精确性，并提高模型对输入图像外观变化的鲁棒性。PRM采用实时物理基础渲染（real-time physically-based rendering, PBR）方法和网格光栅化技术进行在线图像渲染，并利用显式网格作为3D表示，以支持可微分的PBR，从而实现多光度监督和更好的镜面颜色建模，最终实现高质量的几何优化。

链接: https://arxiv.org/abs/2412.07371
作者: Wenhang Ge,Jiantao Lin,Guibao Shen,Jiawei Feng,Tao Hu,Xinli Xu,Ying-Cong Chen
关键词-EN: stereo based large, based large reconstruction, fine-grained local details, local details, reconstruct high-quality meshes
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: this https URL

点击查看摘要

Abstract:We propose PRM, a novel photometric stereo based large reconstruction model to reconstruct high-quality meshes with fine-grained local details. Unlike previous large reconstruction models that prepare images under fixed and simple lighting as both input and supervision, PRM renders photometric stereo images by varying materials and lighting for the purposes, which not only improves the precise local details by providing rich photometric cues but also increases the model robustness to variations in the appearance of input images. To offer enhanced flexibility of images rendering, we incorporate a real-time physically-based rendering (PBR) method and mesh rasterization for online images rendering. Moreover, in employing an explicit mesh as our 3D representation, PRM ensures the application of differentiable PBR, which supports the utilization of multiple photometric supervisions and better models the specular color for high-quality geometry optimization. Our PRM leverages photometric stereo images to achieve high-quality reconstructions with fine-grained local details, even amidst sophisticated image appearances. Extensive experiments demonstrate that PRM significantly outperforms other models.
zh

[CV-64] ITPNet: Towards Instantaneous Trajectory Prediction for Autonomous Driving

【速读】：该论文试图解决在自动驾驶车辆安全中，由于无法获取足够长的轨迹数据而导致轨迹预测模型失效的问题。解决方案的关键在于提出了一种即时的轨迹预测方法，称为ITPNet。具体而言，ITPNet通过反向预测机制，基于两个观测到的位置点逆向推断出未观测到的历史轨迹的潜在特征表示，并将这些特征作为补充信息用于未来的轨迹预测。此外，为了减少预测特征中的噪声和冗余，论文还设计了一个噪声冗余减少模块，以过滤并整合这些特征，形成紧凑的查询用于未来的轨迹预测。ITPNet能够与现有的轨迹预测模型兼容，使其能够有效处理即时轨迹预测的场景。

链接: https://arxiv.org/abs/2412.07369
作者: Rongqing Li,Changsheng Li,Yuhang Li,Hanjie Li,Yi Chen,Dongchun Ren,Ye Yuan,Guoren Wang
关键词-EN: Trajectory prediction, instantaneous trajectory prediction, sufficiently long-observed trajectory, trajectory prediction models, Trajectory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trajectory prediction of agents is crucial for the safety of autonomous vehicles, whereas previous approaches usually rely on sufficiently long-observed trajectory to predict the future trajectory of the agents. However, in real-world scenarios, it is not realistic to collect adequate observed locations for moving agents, leading to the collapse of most prediction models. For instance, when a moving car suddenly appears and is very close to an autonomous vehicle because of the obstruction, it is quite necessary for the autonomous vehicle to quickly and accurately predict the future trajectories of the car with limited observed trajectory locations. In light of this, we focus on investigating the task of instantaneous trajectory prediction, i.e., two observed locations are available during inference. To this end, we propose a general and plug-and-play instantaneous trajectory prediction approach, called ITPNet. Specifically, we propose a backward forecasting mechanism to reversely predict the latent feature representations of unobserved historical trajectories of the agent based on its two observed locations and then leverage them as complementary information for future trajectory prediction. Meanwhile, due to the inevitable existence of noise and redundancy in the predicted latent feature representations, we further devise a Noise Redundancy Reduction Former, aiming at to filter out noise and redundancy from unobserved trajectories and integrate the filtered features and observed features into a compact query for future trajectory predictions. In essence, ITPNet can be naturally compatible with existing trajectory prediction models, enabling them to gracefully handle the case of instantaneous trajectory prediction. Extensive experiments on the Argoverse and nuScenes datasets demonstrate ITPNet outperforms the baselines, and its efficacy with different trajectory prediction models.
zh

[CV-65] Efficient 3D Recognition with Event-driven Spike Sparse Convolution AAAI2025

【速读】：该论文试图解决Spiking Neural Networks (SNNs)在处理点云数据时性能受限和应用场景较少的问题。解决方案的关键在于提出了Spike Voxel Coding (SVC)方案和Spike Sparse Convolution (SSC)模型。SVC通过将3D点云编码为稀疏的脉冲序列空间，减少了存储需求并加速了预处理过程；SSC则用于高效提取3D稀疏点云特征。结合SVC和SSC，论文设计了高效的3D SNN骨干网络（E-3DSNN），该网络对神经形态硬件友好，并在多个3D视觉任务（如分类、检测和分割）上实现了最先进的性能，同时保持了事件驱动的特性。

链接: https://arxiv.org/abs/2412.07360
作者: Xuerui Qiu,Man Yao,Jieyuan Zhang,Yuhong Chou,Ning Qiao,Shibo Zhou,Bo Xu,Guoqi Li
关键词-EN: Spiking Neural Networks, Spiking Neural, Neural Networks, provide an energy-efficient, Spiking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. Point clouds are sparse 3D spatial data, which suggests that SNNs should be well-suited for processing them. However, when applying SNNs to point clouds, they often exhibit limited performance and fewer application scenarios. We attribute this to inappropriate preprocessing and feature extraction methods. To address this issue, we first introduce the Spike Voxel Coding (SVC) scheme, which encodes the 3D point clouds into a sparse spike train space, reducing the storage requirements and saving time on point cloud preprocessing. Then, we propose a Spike Sparse Convolution (SSC) model for efficiently extracting 3D sparse point cloud features. Combining SVC and SSC, we design an efficient 3D SNN backbone (E-3DSNN), which is friendly with neuromorphic hardware. For instance, SSC can be implemented on neuromorphic chips with only minor modifications to the addressing function of vanilla spike convolution. Experiments on ModelNet40, KITTI, and Semantic KITTI datasets demonstrate that E-3DSNN achieves state-of-the-art (SOTA) results with remarkable efficiency. Notably, our E-3DSNN (1.87M) obtained 91.7% top-1 accuracy on ModelNet40, surpassing the current best SNN baselines (14.3M) by 3.0%. To our best knowledge, it is the first direct training 3D SNN backbone that can simultaneously handle various 3D computer vision tasks (e.g., classification, detection, and segmentation) with an event-driven nature. Code is available: this https URL.
zh

[CV-66] Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model

【速读】：该论文试图解决Pose-Guided Person Image Synthesis (PGPIS)中难以学习输入图像与目标图像之间的语义关系以及模型结构复杂化的问题。解决方案的关键在于提出了一种基于扩散模型的融合嵌入方法（Fusion embedding for PGPIS using a Diffusion Model, FPDM）。该方法分为两个阶段：第一阶段通过训练源图像和目标姿态的融合嵌入，使其与目标图像的嵌入对齐；第二阶段利用该融合嵌入作为条件，生成目标图像。通过在DeepFashion和RWTH-PHOENIX-Weather 2014T数据集上的实验，该方法展示了最先进的性能，并且即使仅使用第二阶段的模型，也能接近其他PGPIS最先进模型的性能。

链接: https://arxiv.org/abs/2412.07333
作者: Donghwna Lee,Kyungha Min,Kirok Kim,Seyoung Jeong,Jiwoo Jeong,Wooju Kim
关键词-EN: Person Image Synthesis, synthesize high-quality person, Pose-Guided Person Image, high-quality person images, Pose-Guided Person
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pose-Guided Person Image Synthesis (PGPIS) aims to synthesize high-quality person images corresponding to target poses while preserving the appearance of the source image. Recently, PGPIS methods that use diffusion models have achieved competitive performance. Most approaches involve extracting representations of the target pose and source image and learning their relationships in the generative model’s training process. This approach makes it difficult to learn the semantic relationships between the input and target images and complicates the model structure needed to enhance generation results. To address these issues, we propose Fusion embedding for PGPIS using a Diffusion Model (FPDM). Inspired by the successful application of pre-trained CLIP models in text-to-image diffusion models, our method consists of two stages. The first stage involves training the fusion embedding of the source image and target pose to align with the target image’s embedding. In the second stage, the generative model uses this fusion embedding as a condition to generate the target image. We applied the proposed method to the benchmark datasets DeepFashion and RWTH-PHOENIX-Weather 2014T, and conducted both quantitative and qualitative evaluations, demonstrating state-of-the-art (SOTA) performance. An ablation study of the model structure showed that even a model using only the second stage achieved performance close to the other PGPIS SOTA models. The code is available at this https URL.
zh

[CV-67] CoMA: Compositional Human Motion Generation with Multi-modal Agents

【速读】：该论文试图解决3D人体运动生成中复杂和细节动作难以生成的问题，主要由于现有运动数据集的稀缺性和生成新训练样本的高成本。解决方案的关键是引入CoMA，一个基于代理的复杂人体运动生成、编辑和理解框架。CoMA通过多个协作代理（collaborative agents），利用大型语言和视觉模型，结合基于掩码变换器（mask transformer）的运动生成器，该生成器具有针对身体部位的特定编码器和用于细粒度控制的码本。这一框架能够生成短至长的运动序列，支持文本引导的运动编辑和自校正，从而提升生成质量。

链接: https://arxiv.org/abs/2412.07320
作者: Shanlin Sun,Gabriel De Araujo,Jiaqi Xu,Shenghan Zhou,Hanwen Zhang,Ziheng Huang,Chenyu You,Xiaohui Xie
关键词-EN: recent years, substantial advancement, advancement in recent, human motion generation, motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.
zh

[CV-68] FaceX: Understanding Face Attribute Classifiers through Summary Model Explanations

【速读】：该论文试图解决现有可解释人工智能（EXplainable Artificial Intelligence, XAI）方法在面部分析领域中难以评估模型整体行为的局限性。现有方法如像素归因法仅能对单个图像提供解释，导致需要大量人工检查来从个体输出中得出模型行为的总体印象。论文提出的解决方案是引入FaceX，这是一种通过汇总模型解释来全面理解面部属性分类器的方法。FaceX的关键在于利用面部图像中不同区域的特性，计算模型激活的区域级聚合，从而可视化模型在19个预定义面部区域（如头发、耳朵、皮肤等）的区域归因。此外，FaceX还通过可视化对模型决策影响最大的图像块来增强解释性，并在多个基准测试中展示了其有效识别模型偏差的能力。

链接: https://arxiv.org/abs/2412.07313
作者: Ioannis Sarridis,Christos Koutlis,Symeon Papadopoulos,Christos Diou
关键词-EN: EXplainable Artificial Intelligence, Artificial Intelligence, EXplainable Artificial, existing XAI approaches, identifying fairness issues
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:EXplainable Artificial Intelligence (XAI) approaches are widely applied for identifying fairness issues in Artificial Intelligence (AI) systems. However, in the context of facial analysis, existing XAI approaches, such as pixel attribution methods, offer explanations for individual images, posing challenges in assessing the overall behavior of a model, which would require labor-intensive manual inspection of a very large number of instances and leaving to the human the task of drawing a general impression of the model behavior from the individual outputs. Addressing this limitation, we introduce FaceX, the first method that provides a comprehensive understanding of face attribute classifiers through summary model explanations. Specifically, FaceX leverages the presence of distinct regions across all facial images to compute a region-level aggregation of model activations, allowing for the visualization of the model’s region attribution across 19 predefined regions of interest in facial images, such as hair, ears, or skin. Beyond spatial explanations, FaceX enhances interpretability by visualizing specific image patches with the highest impact on the model’s decisions for each facial region within a test benchmark. Through extensive evaluation in various experimental setups, including scenarios with or without intentional biases and mitigation efforts on four benchmarks, namely CelebA, FairFace, CelebAMask-HQ, and Racial Faces in the Wild, FaceX demonstrates high effectiveness in identifying the models’ biases.
zh

[CV-69] Compression of Large-Scale 3D Point Clouds Based on Joint Optimization of Point Sampling and Feature Extraction

【速读】：该论文试图解决大规模三维点云数据（LS3DPC）在存储和传输过程中由于数据量巨大而导致的存储空间和带宽需求问题。解决方案的关键在于提出了一种端到端的训练框架，该框架将点采样和特征提取联合优化，以同时考虑速率（rate）和失真（distortion）损失。具体来说，论文首先使点采样模块可训练，通过可学习的权重估计最优的下采样点位置；其次，开发了一种可靠的点重建方案，通过自适应聚合扩展的候选点来精炼上采样点的位置。实验结果表明，该方法在SemanticKITTI和nuScenes数据集上显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.07302
作者: Jae-Young Yim,Jae-Young Sim
关键词-EN: LiDAR scanners require, scanners require huge, require huge storage, huge storage space, transmission bandwidth due
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 10 figures, 1 table

点击查看摘要

Abstract:Large-scale 3D point clouds (LS3DPC) obtained by LiDAR scanners require huge storage space and transmission bandwidth due to a large amount of data. The existing methods of LS3DPC compression separately perform rule-based point sampling and learnable feature extraction, and hence achieve limited compression performance. In this paper, we propose a fully end-to-end training framework for LS3DPC compression where the point sampling and the feature extraction are jointly optimized in terms of the rate and distortion losses. To this end, we first make the point sampling module to be trainable such that an optimal position of the downsampled point is estimated via aggregation with learnable weights. We also develop a reliable point reconstruction scheme that adaptively aggregates the expanded candidate points to refine the positions of upsampled points. Experimental results evaluated on the SemanticKITTI and nuScenes datasets show that the proposed method achieves significantly higher compression ratios compared with the existing state-of-the-art methods.
zh

[CV-70] EventSplat: 3D Gaussian Splatting from Moving Event Cameras for Real-time Rendering

【速读】：该论文试图解决在快速相机运动情况下，利用事件相机数据进行新颖视角合成的问题。解决方案的关键在于通过高斯光栅化（Gaussian Splatting）方法有效利用事件相机的高时间分辨率和高动态范围特性，并结合事件到视频模型（event-to-video model）的先验知识进行优化初始化。此外，使用样条插值（spline interpolation）来获取高质量的相机姿态，从而在保持高重建质量的同时，克服了传统基于事件的神经辐射场（NeRF）方法的计算限制，实现了更快的渲染速度和更高的视觉保真度。

链接: https://arxiv.org/abs/2412.07293
作者: Toshiya Yura,Ashkan Mirzaei,Igor Gilitschenski
关键词-EN: Gaussian Splatting, event camera data, view synthesis, synthesis via Gaussian, Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a method for using event camera data in novel view synthesis via Gaussian Splatting. Event cameras offer exceptional temporal resolution and a high dynamic range. Leveraging these capabilities allows us to effectively address the novel view synthesis challenge in the presence of fast camera motion. For initialization of the optimization process, our approach uses prior knowledge encoded in an event-to-video model. We also use spline interpolation for obtaining high quality poses along the event camera trajectory. This enhances the reconstruction quality from fast-moving cameras while overcoming the computational limitations traditionally associated with event-based Neural Radiance Field (NeRF) methods. Our experimental evaluation demonstrates that our results achieve higher visual fidelity and better performance than existing event-based NeRF approaches while being an order of magnitude faster to render.
zh

[CV-71] Image Classification Using Singular Value Decomposition and Optimization

【速读】：该论文试图解决基于毛色特征对特定品种的猫和狗进行图像分类的问题。解决方案的关键在于使用奇异值分解 (Singular Value Decomposition, SVD) 进行低秩近似，并通过序列二次规划 (Sequential Quadratic Programming, SQP) 构建最优加权模板，以捕捉毛色这一主导特征。尽管该方法在秩为10时达到了69%的准确率，但研究结果表明，仅依赖毛色特征可能不足以实现更高的分类精度，提示在资源受限的环境中需要在简单性和性能之间进行权衡。

链接: https://arxiv.org/abs/2412.07288
作者: Isabela M. Yepes,Manasvi Goyal
关键词-EN: Singular Value Decomposition, primary identifying feature, applicability of Singular, Sequential Quadratic Programming, study investigates
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:This study investigates the applicability of Singular Value Decomposition for the image classification of specific breeds of cats and dogs using fur color as the primary identifying feature. Sequential Quadratic Programming (SQP) is employed to construct optimally weighted templates. The proposed method achieves 69% accuracy using the Frobenius norm at rank 10. The results partially validate the assumption that dominant features, such as fur color, can be effectively captured through low-rank approximations. However, the accuracy suggests that additional features or methods may be required for more robust classification, highlighting the trade-off between simplicity and performance in resource-constrained environments.
zh

[CV-72] Backdoor Attacks against No-Reference Image Quality Assessment Models via A Scalable Trigger AAAI2025

【速读】：该论文试图解决无参考图像质量评估 (No-Reference Image Quality Assessment, NR-IQA) 模型在面对对抗攻击时的脆弱性问题，特别是现有攻击方法在计算资源需求高、非目标性操作、白盒场景下实用性有限以及黑盒场景下效果减弱等方面的局限性。解决方案的关键在于提出了一种新型的基于投毒的后门攻击 (poisoning-based backdoor attack against NR-IQA, BAIQA)，通过在离散余弦变换 (Discrete Cosine Transform, DCT) 域中注入触发器，并利用DCT空间中的通用对抗扰动 (Universal Adversarial Perturbations, UAP) 来增强攻击效果。此外，论文还探索了干净标签的BAIQA (Clean-label BAIQA, C-BAIQA) 设计，通过理论洞察指导的α采样和图像数据优化，进一步提升了攻击的有效性。

链接: https://arxiv.org/abs/2412.07277
作者: Yi Yu,Song Xia,Xun Lin,Wenhan Yang,Shijian Lu,Yap-peng Tan,Alex Kot
关键词-EN: Image Quality Assessment, No-Reference Image Quality, Quality Assessment, computer vision systems, optimizing computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: Accept by AAAI 2025

点击查看摘要

Abstract:No-Reference Image Quality Assessment (NR-IQA), responsible for assessing the quality of a single input image without using any reference, plays a critical role in evaluating and optimizing computer vision systems, e.g., low-light enhancement. Recent research indicates that NR-IQA models are susceptible to adversarial attacks, which can significantly alter predicted scores with visually imperceptible perturbations. Despite revealing vulnerabilities, these attack methods have limitations, including high computational demands, untargeted manipulation, limited practical utility in white-box scenarios, and reduced effectiveness in black-box scenarios. To address these challenges, we shift our focus to another significant threat and present a novel poisoning-based backdoor attack against NR-IQA (BAIQA), allowing the attacker to manipulate the IQA model’s output to any desired target value by simply adjusting a scaling coefficient \alpha for the trigger. We propose to inject the trigger in the discrete cosine transform (DCT) domain to improve the local invariance of the trigger for countering trigger diminishment in NR-IQA models due to widely adopted data augmentations. Furthermore, the universal adversarial perturbations (UAP) in the DCT space are designed as the trigger, to increase IQA model susceptibility to manipulation and improve attack effectiveness. In addition to the heuristic method for poison-label BAIQA (P-BAIQA), we explore the design of clean-label BAIQA (C-BAIQA), focusing on \alpha sampling and image data refinement, driven by theoretical insights we reveal. Extensive experiments on diverse datasets and various NR-IQA models demonstrate the effectiveness of our attacks. Code will be released at this https URL.
zh

[CV-73] A Generative Victim Model for Segmentation

【速读】：该论文试图解决在生成对抗攻击时对特定任务模型（如分割模型）的依赖问题。解决方案的关键在于从图像生成的角度出发，设计一种新颖的受害者模型（VM），用于生成针对分割任务的对抗扰动，而无需依赖专门设计的图像分割模型。这种方法不同于传统的白盒或黑盒攻击策略，提供了一种新的对抗攻击生成思路，并通过实验验证了其有效性和良好的迁移性。

链接: https://arxiv.org/abs/2412.07274
作者: Aixuan Li,Jing Zhang,Jiawei Shi,Yiran Zhong,Yuchao Dai
关键词-EN: well-trained victim models, serve as fundamental, fundamental prerequisites, adversarial, victim models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We find that the well-trained victim models (VMs), against which the attacks are generated, serve as fundamental prerequisites for adversarial attacks, i.e. a segmentation VM is needed to generate attacks for segmentation. In this context, the victim model is assumed to be robust to achieve effective adversarial perturbation generation. Instead of focusing on improving the robustness of the task-specific victim models, we shift our attention to image generation. From an image generation perspective, we derive a novel VM for segmentation, aiming to generate adversarial perturbations for segmentation tasks without requiring models explicitly designed for image segmentation. Our approach to adversarial attack generation diverges from conventional white-box or black-box attacks, offering a fresh outlook on adversarial attack strategies. Experiments show that our attack method is able to generate effective adversarial attacks with good transferability.
zh

[CV-74] Deep Lidar-guided Image Deblurring

【速读】：该论文试图解决在低光环境下因运动模糊影响图像质量的问题，并探讨如何利用移动设备上的LiDAR传感器提供的深度信息来提升图像去模糊的效果。解决方案的关键在于开发了一种通用的适配器结构，能够高效地预处理深度信息，并将其与现有的神经网络去模糊模型结合，形成深度感知模型。此外，通过持续学习策略，使预训练的编码器-解码器模型能够以最小的额外数据需求集成深度信息作为附加输入。实验结果表明，利用真实的深度信息可以显著提升去模糊算法的有效性。

链接: https://arxiv.org/abs/2412.07262
作者: Ziyao Yi,Diego Valsesia,Tiziano Bianchi,Enrico Magli
关键词-EN: computational imaging techniques, portable Lidar instruments, including their adoption, opens the door, rise of portable
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:The rise of portable Lidar instruments, including their adoption in smartphones, opens the door to novel computational imaging techniques. Being an active sensing instrument, Lidar can provide complementary data to passive optical sensors, particularly in situations like low-light imaging where motion blur can affect photos. In this paper, we study if the depth information provided by mobile Lidar sensors is useful for the task of image deblurring and how to integrate it with a general approach that transforms any state-of-the-art neural deblurring model into a depth-aware one. To achieve this, we developed a universal adapter structure that efficiently preprocesses the depth information to modulate image features with depth features. Additionally, we applied a continual learning strategy to pretrained encoder-decoder models, enabling them to incorporate depth information as an additional input with minimal extra data requirements. We demonstrate that utilizing true depth information can significantly boost the effectiveness of deblurring algorithms, as validated on a dataset with real-world depth data captured by a smartphone Lidar.
zh

[CV-75] DFREC: DeepFake Identity Recovery Based on Identity-aware Masked Autoencoder

【速读】：该论文试图解决现有深度伪造检测算法在直观可解释性和身份追溯性方面的不足，提出了一种新的深度伪造身份恢复方案 (DeepFake Identity Recovery scheme, DFREC)。其关键在于通过三个核心模块实现从伪造图像中恢复源和目标人脸：身份分割模块 (Identity Segmentation Module, ISM) 用于将输入人脸分割为源和目标信息，源身份重建模块 (Source Identity Reconstruction Module, SIRM) 重建源人脸并提取潜在目标身份特征，目标身份重建模块 (Target Identity Reconstruction Module, TIRM) 通过掩码自编码器融合背景上下文和潜在目标特征以重建目标人脸。该方案在多个高保真换脸攻击数据集上展示了优于现有算法的恢复性能，并能高保真地直接从伪造图像中恢复源和目标人脸。

链接: https://arxiv.org/abs/2412.07260
作者: Peipeng Yu,Hui Gao,Zhitao Huang,Zhihua Xia,Chip-Hong Chang
关键词-EN: Identity Reconstruction Module, Recent advances, Identity Segmentation Module, Target Identity, Target Identity Reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in deepfake forensics have primarily focused on improving the classification accuracy and generalization performance. Despite enormous progress in detection accuracy across a wide variety of forgery algorithms, existing algorithms lack intuitive interpretability and identity traceability to help with forensic investigation. In this paper, we introduce a novel DeepFake Identity Recovery scheme (DFREC) to fill this gap. DFREC aims to recover the pair of source and target faces from a deepfake image to facilitate deepfake identity tracing and reduce the risk of deepfake attack. It comprises three key components: an Identity Segmentation Module (ISM), a Source Identity Reconstruction Module (SIRM), and a Target Identity Reconstruction Module (TIRM). The ISM segments the input face into distinct source and target face information, and the SIRM reconstructs the source face and extracts latent target identity features with the segmented source information. The background context and latent target identity features are synergetically fused by a Masked Autoencoder in the TIRM to reconstruct the target face. We evaluate DFREC on six different high-fidelity face-swapping attacks on FaceForensics++, CelebaMegaFS and FFHQ-E4S datasets, which demonstrate its superior recovery performance over state-of-the-art deepfake recovery algorithms. In addition, DFREC is the only scheme that can recover both pristine source and target faces directly from the forgery image with high fadelity.
zh

[CV-76] CapGen:An Environment-Adaptive Generator of Adversarial Patches

【速读】：该论文试图解决现有对抗性补丁（adversarial patches）在视觉上与背景环境不协调，容易被察觉的问题。解决方案的关键在于提出了伪装对抗性图案生成器（Camouflaged Adversarial Pattern Generator, CAPGen），该方法利用环境中的特定基色生成与背景无缝融合的补丁，从而提高视觉隐蔽性。研究进一步发现，图案（color-agnostic texture information）对攻击效果的影响大于颜色，基于此提出了快速生成策略，通过更新高性能对抗性补丁的颜色以适应新环境，确保视觉隐蔽性同时不降低对抗性效果。

链接: https://arxiv.org/abs/2412.07253
作者: Chaoqun Li,Zhuodong Liu,Huanqian Yan,Hang Su
关键词-EN: perception algorithm robustness, provide physical stealth, physical stealth protection, assess perception algorithm, Adversarial patches
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adversarial patches, often used to provide physical stealth protection for critical assets and assess perception algorithm robustness, usually neglect the need for visual harmony with the background environment, making them easily noticeable. Moreover, existing methods primarily concentrate on improving attack performance, disregarding the intricate dynamics of adversarial patch elements. In this work, we introduce the Camouflaged Adversarial Pattern Generator (CAPGen), a novel approach that leverages specific base colors from the surrounding environment to produce patches that seamlessly blend with their background for superior visual stealthiness while maintaining robust adversarial performance. We delve into the influence of both patterns (i.e., color-agnostic texture information) and colors on the effectiveness of attacks facilitated by patches, discovering that patterns exert a more pronounced effect on performance than colors. Based on these findings, we propose a rapid generation strategy for adversarial patches. This involves updating the colors of high-performance adversarial patches to align with those of the new environment, ensuring visual stealthiness without compromising adversarial impact. This paper is the first to comprehensively examine the roles played by patterns and colors in the context of adversarial patches.
zh

[CV-77] Buster: Incorporating Backdoor Attacks into Text Encoder to Mitigate NSFW Content Generation

【速读】：该论文试图解决生成式 AI 模型在生成 Not Safe for Work (NSFW) 内容时面临的挑战，现有的防御方法如模型微调或事后内容审核存在可扩展性差、影响良性图像生成质量或推理成本高等问题。解决方案的关键在于提出了一种名为 Buster 的创新框架，通过在文本编码器中注入后门攻击，利用深度语义信息而非显式提示作为触发器，将 NSFW 提示重定向为良性提示。这种方法不仅展示了卓越的抗干扰能力和可扩展性，还通过仅五分钟的文本编码器微调实现了高效性，实验结果表明其在 NSFW 内容移除率和良性图像质量保持方面均优于其他基线方法。

链接: https://arxiv.org/abs/2412.07249
作者: Xin Zhao,Xiaojun Chen,Yuexin Xuan,Zhendong Zhao
关键词-EN: Safe for Work, digital age, led to significant, significant concerns, NSFW content
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the digital age, the proliferation of deep learning models has led to significant concerns about the generation of Not Safe for Work (NSFW) content. Existing defense methods primarily involve model fine-tuning and post-hoc content moderation. However, these approaches often lack scalability in eliminating harmful content, degrade the quality of benign image generation, or incur high inference costs. To tackle these challenges, we propose an innovative framework called \textbfBuster, which injects backdoor attacks into the text encoder to prevent NSFW content generation. Specifically, Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts. This approach demonstrates exceptional resilience and scalability in mitigating NSFW content. Remarkably, Buster fine-tunes the text encoder of Text-to-Image models within just five minutes, showcasing high efficiency. Our extensive experiments reveal that Buster outperforms all other baselines, achieving superior NSFW content removal rate while preserving the quality of harmless images.
zh

[CV-78] Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024

【速读】：该论文旨在解决CVPR 2024 Autonomous Grand Challenge中的Driving with Language赛道任务，关键在于利用强大的开源多模态模型InternVL-1.5，并通过全参数微调（full-parameter fine-tuning）在DriveLM-nuScenes数据集上进行优化。解决方案的核心在于对nuScenes的多视角图像进行特定格式化处理，以确保模型能够有效继承InternVL的优秀多模态理解能力，同时设计了一种简单的自动标注策略，将DriveLM-nuScenes中的物体中心点转换为对应的边界框（bounding boxes）。最终，该单一模型在最终排行榜上取得了0.6002的分数。

链接: https://arxiv.org/abs/2412.07247
作者: Jiahan Li,Zhiqi Li,Tong Lu
关键词-EN: Autonomous Grand Challenge, technical report describes, Autonomous Grand, Grand Challenge, Driving with Language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This technical report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To effectively handle the multi-view images of nuScenes and seamlessly inherit InternVL’s outstanding multimodal understanding capabilities, we formatted and concatenated the multi-view images in a specific manner. This ensured that the final model could meet the specific requirements of the competition task while leveraging InternVL’s powerful image understanding capabilities. Meanwhile, we designed a simple automatic annotation strategy that converts the center points of objects in DriveLM-nuScenes into corresponding bounding boxes. As a result, our single model achieved a score of 0.6002 on the final leadboard.
zh

[CV-79] ArtFormer: Controllable Generation of Diverse 3D Articulated Objects

【速读】：该论文试图解决现有方法在生成3D关节对象时面临的灵活性与质量之间的权衡问题，这些方法通常依赖于预定义结构或从静态数据集中检索形状。解决方案的关键在于提出了一种新颖的框架，通过将关节对象参数化为树形结构的token，并利用transformer生成对象的高层几何代码及其运动学关系。随后，使用带符号距离函数（SDF）形状先验进一步解码每个子部分的几何形状，从而实现高质量3D形状的合成。该方法不仅能够生成多样化的对象，还能在保持高质量几何形状的同时，灵活处理不同数量的部件。

链接: https://arxiv.org/abs/2412.07237
作者: Jiayi Su,Youhe Feng,Zheng Li,Jinhua Song,Yangfan He,Botao Ren,Botian Xu
关键词-EN: paper presents, framework for modeling, Abstract, conditional generation, generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: impl. repo: this https URL

点击查看摘要

Abstract:This paper presents a novel framework for modeling and conditional generation of 3D articulated objects. Troubled by flexibility-quality tradeoffs, existing methods are often limited to using predefined structures or retrieving shapes from static datasets. To address these challenges, we parameterize an articulated object as a tree of tokens and employ a transformer to generate both the object’s high-level geometry code and its kinematic relations. Subsequently, each sub-part’s geometry is further decoded using a signed-distance-function (SDF) shape prior, facilitating the synthesis of high-quality 3D shapes. Our approach enables the generation of diverse objects with high-quality geometry and varying number of parts. Comprehensive experiments on conditional generation from text descriptions demonstrate the effectiveness and flexibility of our method.
zh

[CV-80] Repetitive Action Counting with Hybrid Temporal Relation Modeling

【速读】：该论文试图解决重复动作计数 (Repetitive Action Counting, RAC) 在复杂日常视频中由于视角变化、非均匀周期和动作中断等问题导致的动作周期捕捉不足的瓶颈。解决方案的关键是提出了一种名为混合时间关系建模网络 (Hybrid Temporal Relation Modeling Network, HTRM-Net) 的新方法，该方法通过构建多样化的时序自相似矩阵 (Temporal Self-Similarity Matrix, TSSM) 来实现。HTRM-Net 的核心组件包括双模态时序自相似矩阵建模、随机矩阵丢弃和局部时序上下文建模。具体来说，通过自注意力和双softmax操作构建时序自相似矩阵，结合行和列的相关性生成多样化的矩阵表示；引入随机矩阵丢弃模块以显式引导矩阵的通道学习；并通过注入视频帧的局部时序上下文和学习到的矩阵进行时序相关性建模，增强模型对动作中断等错误情况的鲁棒性。最后，设计多尺度矩阵融合模块以自适应聚合多尺度矩阵中的时序相关性。实验结果表明，该方法在跨数据集和同数据集上均优于现有最先进方法，并在未见动作类别中表现出更高的计数准确性。

链接: https://arxiv.org/abs/2412.07233
作者: Kun Li,Xinge Peng,Dan Guo,Xun Yang,Meng Wang
关键词-EN: repetitive actions occurring, repetitive actions, Repetitive Action Counting, counting repetitive actions, aims to count
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To be published in IEEE Transactions on Multimedia

点击查看摘要

Abstract:Repetitive Action Counting (RAC) aims to count the number of repetitive actions occurring in videos. In the real world, repetitive actions have great diversity and bring numerous challenges (e.g., viewpoint changes, non-uniform periods, and action interruptions). Existing methods based on the temporal self-similarity matrix (TSSM) for RAC are trapped in the bottleneck of insufficient capturing action periods when applied to complicated daily videos. To tackle this issue, we propose a novel method named Hybrid Temporal Relation Modeling Network (HTRM-Net) to build diverse TSSM for RAC. The HTRM-Net mainly consists of three key components: bi-modal temporal self-similarity matrix modeling, random matrix dropping, and local temporal context modeling. Specifically, we construct temporal self-similarity matrices by bi-modal (self-attention and dual-softmax) operations, yielding diverse matrix representations from the combination of row-wise and column-wise correlations. To further enhance matrix representations, we propose incorporating a random matrix dropping module to guide channel-wise learning of the matrix explicitly. After that, we inject the local temporal context of video frames and the learned matrix into temporal correlation modeling, which can make the model robust enough to cope with error-prone situations, such as action interruption. Finally, a multi-scale matrix fusion module is designed to aggregate temporal correlations adaptively in multi-scale matrices. Extensive experiments across intra- and cross-datasets demonstrate that the proposed method not only outperforms current state-of-the-art methods but also exhibits robust capabilities in accurately counting repetitive actions in unseen action categories. Notably, our method surpasses the classical TransRAC method by 20.04% in MAE and 22.76% in OBO.
zh

[CV-81] Deep Non-rigid Structure-from-Motion Revisited: Canonicalization and Sequence Modeling

【速读】：该论文试图解决非刚性结构从运动 (Non-Rigid Structure-from-Motion, NRSfM) 问题中现有深度学习方法在处理序列特性和运动模糊方面的局限性。解决方案的关键在于从两个角度进行改进：(1) 序列规范化 (canonicalization)，提出了一种易于实现的逐序列规范化方法，取代了以往的逐数据集规范化方法；(2) 序列建模，结合时间信息和子空间约束，提出了一种新的序列建模方法。通过这些改进，论文实现了比以往更优的NRSfM重建流程，并在多个常用数据集上验证了其有效性。

链接: https://arxiv.org/abs/2412.07230
作者: Hui Deng,Jiawei Shi,Zhen Qin,Yiran Zhong,Yuchao Dai
关键词-EN: input to estimate, NRSfM, deep NRSfM, vision problem, deep NRSfM methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages main text, 7 pages appendix

点击查看摘要

Abstract:Non-Rigid Structure-from-Motion (NRSfM) is a classic 3D vision problem, where a 2D sequence is taken as input to estimate the corresponding 3D sequence. Recently, the deep neural networks have greatly advanced the task of NRSfM. However, existing deep NRSfM methods still have limitations in handling the inherent sequence property and motion ambiguity associated with the NRSfM problem. In this paper, we revisit deep NRSfM from two perspectives to address the limitations of current deep NRSfM methods : (1) canonicalization and (2) sequence modeling. We propose an easy-to-implement per-sequence canonicalization method as opposed to the previous per-dataset canonicalization approaches. With this in mind, we propose a sequence modeling method that combines temporal information and subspace constraint. As a result, we have achieved a more optimal NRSfM reconstruction pipeline compared to previous efforts. The effectiveness of our method is verified by testing the sequence-to-sequence deep NRSfM pipeline with corresponding regularization modules on several commonly used datasets.
zh

[CV-82] Moderating the Generalization of Score-based Generative Model

【速读】：该论文试图解决生成式模型 (Score-based Generative Models, SGMs) 中存在的过度泛化问题，特别是生成不期望内容的风险。解决方案的关键在于提出了首个受控的生成式模型 (Moderated Score-based Generative Model, MSGM)，通过引入一种新颖的评分函数调整策略，在连续时间随机微分方程过程中将评分函数从不良数据中引导开，从而显著减少生成不良内容的可能性，同时保持正常图像生成的高视觉质量。该方法不仅适用于SGMs，还具有通用性和灵活性，能够兼容多种扩散架构和训练策略，并支持预训练模型的零样本迁移到下游任务。

链接: https://arxiv.org/abs/2412.07229
作者: Wan Jiang,He Wang,Xin Zhang,Dan Guo,Zhaoxin Fan,Yunfeng Diao,Richang Hong
关键词-EN: remarkable generalization abilities, demonstrated remarkable generalization, Score-based Generative Models, Moderated Score-based Generative, Score-based Generative
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Score-based Generative Models (SGMs) have demonstrated remarkable generalization abilities, e.g. generating unseen, but natural data. However, the greater the generalization power, the more likely the unintended generalization, and the more dangerous the abuse. Research on moderated generalization in SGMs remains limited. To fill this gap, we first examine the current ‘gold standard’ in Machine Unlearning (MU), i.e., re-training the model after removing the undesirable training data, and find it does not work in SGMs. Further analysis of score functions reveals that the MU ‘gold standard’ does not alter the original score function, which explains its ineffectiveness. Based on this insight, we propose the first Moderated Score-based Generative Model (MSGM), which introduces a novel score adjustment strategy that redirects the score function away from undesirable data during the continuous-time stochastic differential equation process. Extensive experimental results demonstrate that MSGM significantly reduces the likelihood of generating undesirable content while preserving high visual quality for normal image generation. Albeit designed for SGMs, MSGM is a general and flexible MU framework that is compatible with diverse diffusion architectures (SGM and DDPM) and training strategies (re-training and fine-tuning), and enables zero-shot transfer of the pre-trained models to downstream tasks, e.g. image inpainting and reconstruction. The code will be shared upon acceptance.
zh

[CV-83] Attention Head Purification: A New Perspective to Harness CLIP for Domain Generalization

【速读】：该论文试图解决在领域泛化（Domain Generalization, DG）任务中，如何有效利用CLIP模型的问题。现有方法主要通过全量微调或提示学习（prompt-learning）来避免CLIP原有知识的遗忘，但忽略了CLIP中可能包含的领域特定线索（domain-specific cues），这些线索限制了其在未见领域中的泛化性能。论文提出了一种新的视角，即注意力头净化（attention head purification），通过任务级净化和领域级净化来优化CLIP的注意力头。任务级净化通过设计头感知LoRA（head-aware LoRA）使每个头更适应特定任务，而领域级净化通过简单的门控策略进行头选择，并利用MMD损失（Maximum Mean Discrepancy loss）使被掩码的头特征更具领域不变性。实验表明，该方法在多个领域泛化基准上表现优于现有最先进方法。

链接: https://arxiv.org/abs/2412.07226
作者: Yingfan Wang,Guoliang Kang
关键词-EN: multiple source domains, unseen target domains, achieve satisfactory performance, CLIP, aims to learn
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain Generalization (DG) aims to learn a model from multiple source domains to achieve satisfactory performance on unseen target domains. Recent works introduce CLIP to DG tasks due to its superior image-text alignment and zeros-shot performance. Previous methods either utilize full fine-tuning or prompt-learning paradigms to harness CLIP for DG tasks. Those works focus on avoiding catastrophic forgetting of the original knowledge encoded in CLIP but ignore that the knowledge encoded in CLIP in nature may contain domain-specific cues that constrain its domain generalization performance. In this paper, we propose a new perspective to harness CLIP for DG, i.e., attention head purification. We observe that different attention heads may encode different properties of an image and selecting heads appropriately may yield remarkable performance improvement across domains. Based on such observations, we purify the attention heads of CLIP from two levels, including task-level purification and domain-level purification. For task-level purification, we design head-aware LoRA to make each head more adapted to the task we considered. For domain-level purification, we perform head selection via a simple gating strategy. We utilize MMD loss to encourage masked head features to be more domain-invariant to emphasize more generalizable properties/heads. During training, we jointly perform task-level purification and domain-level purification. We conduct experiments on various representative DG benchmarks. Though simple, extensive experiments demonstrate that our method performs favorably against previous state-of-the-arts.
zh

[CV-84] EchoIR: Advancing Image Restoration with Echo Upsampling and Bi-Level Optimization

【速读】：该论文试图解决图像恢复（image restoration）中由于上采样过程中特征退化导致的恢复性能下降问题。解决方案的关键在于提出了EchoIR网络，该网络采用类UNet结构，并引入了一种双边可学习的上采样机制——Echo-Upsampler。这一机制通过学习UNet双边中间特征（“Echo”）来优化上采样过程，从而减少特征退化，实现更精细的图像恢复。此外，论文还提出了近似序列双层优化模型（Approximated Sequential Bi-level Optimization, AS-BLO），以建立上采样学习与图像恢复任务之间的关系，进一步提升恢复效果。实验结果表明，EchoIR在图像恢复任务中超越了现有的最先进（SOTA）方法。

链接: https://arxiv.org/abs/2412.07225
作者: Yuhan He,Yuchun He
关键词-EN: reconstructing high-quality images, Image restoration represents, Image restoration, image restoration tasks, low-level vision
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Image restoration represents a fundamental challenge in low-level vision, focusing on reconstructing high-quality images from their degraded counterparts. With the rapid advancement of deep learning technologies, transformer-based methods with pyramid structures have advanced the field by capturing long-range cross-scale spatial interaction. Despite its popularity, the degradation of essential features during the upsampling process notably compromised the restoration performance, resulting in suboptimal reconstruction outcomes. We introduce the EchoIR, an UNet-like image restoration network with a bilateral learnable upsampling mechanism to bridge this gap. Specifically, we proposed the Echo-Upsampler that optimizes the upsampling process by learning from the bilateral intermediate features of U-Net, the “Echo”, aiming for a more refined restoration by minimizing the degradation during upsampling. In pursuit of modeling a hierarchical model of image restoration and upsampling tasks, we propose the Approximated Sequential Bi-level Optimization (AS-BLO), an advanced bi-level optimization model establishing a relationship between upsampling learning and image restoration tasks. Extensive experiments against the state-of-the-art (SOTA) methods demonstrate the proposed EchoIR surpasses the existing methods, achieving SOTA performance in image restoration tasks.
zh

[CV-85] MPSI: Mamba enhancement model for pixel-wise sequential interaction Image Super-Resolution

【速读】：该论文试图解决单图像超分辨率 (Single image super-resolution, SR) 中长期序列信息建模不足的问题，特别是全局像素交互的捕捉能力有限。解决方案的关键在于提出了Mamba像素级序列交互网络 (MPSI)，通过引入通道Mamba块 (Channel-Mamba Block, CMB) 来有效建模长序列信息，从而增强像素间的全局交互。此外，MPSI还提出了Mamba通道递归模块 (Mamba channel recursion module, MCRM)，以保留早期层提取的有价值特征信息，确保多层次的像素序列交互信息的获取，从而提升图像重建效果。

链接: https://arxiv.org/abs/2412.07222
作者: Yuchun He,Yuhan He
关键词-EN: long sequence information, Single image super-resolution, modeling long sequence, Single image, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Single image super-resolution (SR) has long posed a challenge in the field of computer vision. While the advent of deep learning has led to the emergence of numerous methods aimed at tackling this persistent issue, the current methodologies still encounter challenges in modeling long sequence information, leading to limitations in effectively capturing the global pixel interactions. To tackle this challenge and achieve superior SR outcomes, we propose the Mamba pixel-wise sequential interaction network (MPSI), aimed at enhancing the establishment of long-range connections of information, particularly focusing on pixel-wise sequential interaction. We propose the Channel-Mamba Block (CMB) to capture comprehensive pixel interaction information by effectively modeling long sequence information. Moreover, in the existing SR methodologies, there persists the issue of the neglect of features extracted by preceding layers, leading to the loss of valuable feature information. While certain existing models strive to preserve these features, they frequently encounter difficulty in establishing connections across all layers. To overcome this limitation, MPSI introduces the Mamba channel recursion module (MCRM), which maximizes the retention of valuable feature information from early layers, thereby facilitating the acquisition of pixel sequence interaction information from multiple-level layers. Through extensive experimentation, we demonstrate that MPSI outperforms existing super-resolution methods in terms of image reconstruction results, attaining state-of-the-art performance.
zh

[CV-86] aylor Outlier Exposure

【速读】：该论文试图解决在分布外检测（Out-of-distribution, OOD）任务中，使用含有分布内样本（In-distribution, ID）的噪声辅助分布外数据集（OOD）进行训练时，对模型训练动态和最终检测性能产生负面影响的问题。解决方案的关键在于提出了一种基于泰勒展开的正则化方法，称为泰勒分布外暴露（Taylor Outlier Exposure, TaylorOE）。该方法通过将分布外暴露的正则化项表示为多项式函数，利用泰勒展开来控制对辅助分布外数据集中分布内数据的正则化强度，从而允许在含有分布内样本的噪声分布外数据集上进行有效训练。实验结果表明，该方法在干净和噪声分布外数据集上均优于传统方法，并验证了其正则化项的有效性。

链接: https://arxiv.org/abs/2412.07219
作者: Kohei Fukuda,Hiroaki Aizawa
关键词-EN: OOD, OOD detection, OOD datasets, noisy OOD datasets, auxiliary OOD
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is the task of identifying data sampled from distributions that were not used during training. This task is essential for reliable machine learning and a better understanding of their generalization capabilities. Among OOD detection methods, Outlier Exposure (OE) significantly enhances OOD detection performance and generalization ability by exposing auxiliary OOD data to the model. However, constructing clean auxiliary OOD datasets, uncontaminated by in-distribution (ID) samples, is essential for OE; generally, a noisy OOD dataset contaminated with ID samples negatively impacts OE training dynamics and final detection performance. Furthermore, as dataset scale increases, constructing clean OOD data becomes increasingly challenging and costly. To address these challenges, we propose Taylor Outlier Exposure (TaylorOE), an OE-based approach with regularization that allows training on noisy OOD datasets contaminated with ID samples. Specifically, we represent the OE regularization term as a polynomial function via a Taylor expansion, allowing us to control the regularization strength for ID data in the auxiliary OOD dataset by adjusting the order of Taylor expansion. In our experiments on the OOD detection task with clean and noisy OOD datasets, we demonstrate that the proposed method consistently outperforms conventional methods and analyze our regularization term to show its effectiveness. Our implementation code of TaylorOE is available at \urlthis https URL.
zh

[CV-87] Crack-EdgeSAM Self-Prompting Crack Segmentation System for Edge Devices

【速读】：该论文试图解决结构健康监测 (Structural Health Monitoring, SHM) 中复杂环境下裂缝检测的效率和准确性问题。解决方案的关键在于提出了Crack-EdgeSAM系统，该系统结合了YOLOv8用于生成提示框，并通过微调的EdgeSAM模型进行裂缝分割。为确保计算效率，采用了ConvLoRA参数高效微调 (Parameter-Efficient Fine-Tuning, PEFT) 技术和DiceFocalLoss进行模型微调。实验结果表明，该系统在保持高分割精度的同时，显著提升了推理速度，适用于实时边缘设备应用。

链接: https://arxiv.org/abs/2412.07205
作者: Yingchu Wang,Ji He,Shijie Yu
关键词-EN: Structural health monitoring, concrete bridge pier, Structural health, health monitoring, infrastructure defects
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Structural health monitoring (SHM) is essential for the early detection of infrastructure defects, such as cracks in concrete bridge pier. but often faces challenges in efficiency and accuracy in complex environments. Although the Segment Anything Model (SAM) achieves excellent segmentation performance, its computational demands limit its suitability for real-time applications on edge devices. To address these challenges, this paper proposes Crack-EdgeSAM, a self-prompting crack segmentation system that integrates YOLOv8 for generating prompt boxes and a fine-tuned EdgeSAM model for crack segmentation. To ensure computational efficiency, the method employs ConvLoRA, a Parameter-Efficient Fine-Tuning (PEFT) technique, along with DiceFocalLoss to fine-tune the EdgeSAM model. Our experimental results on public datasets and the climbing robot automatic inspections demonstrate that the system achieves high segmentation accuracy and significantly enhanced inference speed compared to the most recent methods. Notably, the system processes 1024 x 1024 pixels images at 46 FPS on our PC and 8 FPS on Jetson Orin Nano.
zh

[CV-88] Learning Spatially Decoupled Color Representations for Facial Image Colorization

【速读】：该论文试图解决面部图像着色中存在的自然度和均匀性不足的问题，特别是在人类对面部更敏感的情况下。解决方案的关键在于引入面部组件先验（facial component priors），并通过面部解析图（face parsing maps）指导学习每个面部组件（如嘴唇、皮肤、眼睛和头发）的解耦颜色表示。论文提出的FCNet框架通过色度和空间增强策略来促进学习过程，并且能够在单参考或多参考图像的情况下进行面部图像着色。此外，为了在没有参考图像的场景中应用，论文还训练了两个替代模块，分别从灰度输入或随机种子预测颜色表示。实验结果表明，该方法在无参考、单参考和多参考的面部图像着色场景中均优于现有方法。

链接: https://arxiv.org/abs/2412.07203
作者: Hangyan Zhu,Ming Liu,Chao Zhou,Zifei Yan,Kuanquan Wang,Wangmeng Zuo
关键词-EN: shown prominent performance, facial image colorization, Image colorization, facial image, facial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image colorization methods have shown prominent performance on natural images. However, since humans are more sensitive to faces, existing methods are insufficient to meet the demands when applied to facial images, typically showing unnatural and uneven colorization results. In this paper, we investigate the facial image colorization task and find that the problems with facial images can be attributed to an insufficient understanding of facial components. As a remedy, by introducing facial component priors, we present a novel facial image colorization framework dubbed FCNet. Specifically, we learn a decoupled color representation for each face component (e.g., lips, skin, eyes, and hair) under the guidance of face parsing maps. A chromatic and spatial augmentation strategy is presented to facilitate the learning procedure, which requires only grayscale and color facial image pairs. After training, the presented FCNet can be naturally applied to facial image colorization with single or multiple reference images. To expand the application paradigms to scenarios with no reference images, we further train two alternative modules, which predict the color representations from the grayscale input or a random seed, respectively. Extensive experiments show that our method can perform favorably against existing methods in various application scenarios (i.e., no-, single-, and multi-reference facial image colorization). The source code and pre-trained models will be publicly available.
zh

[CV-89] A Parametric Approach to Adversarial Augmentation for Cross-Domain Iris Presentation Attack Detection WACV

【速读】：该论文试图解决基于虹膜的生物识别系统在跨域场景下对呈现攻击（Presentation Attacks, PAs）检测的泛化能力不足的问题。解决方案的关键在于利用经典数据增强方案（如平移、旋转）的变换参数生成对抗样本，并通过卷积自编码器（ADV-GEN）将这些变换参数作为正则化变量，引导模型在受限的搜索空间内生成对抗样本，从而提升跨域场景下的呈现攻击检测（Presentation Attack Detection, PAD）性能。

链接: https://arxiv.org/abs/2412.07199
作者: Debasmita Pal,Redwan Sony,Arun Ross
关键词-EN: Iris-based biometric systems, printed iris images, textured contact lenses, present physical artifacts, adversaries present physical
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

点击查看摘要

Abstract:Iris-based biometric systems are vulnerable to presentation attacks (PAs), where adversaries present physical artifacts (e.g., printed iris images, textured contact lenses) to defeat the system. This has led to the development of various presentation attack detection (PAD) algorithms, which typically perform well in intra-domain settings. However, they often struggle to generalize effectively in cross-domain scenarios, where training and testing employ different sensors, PA instruments, and datasets. In this work, we use adversarial training samples of both bonafide irides and PAs to improve the cross-domain performance of a PAD classifier. The novelty of our approach lies in leveraging transformation parameters from classical data augmentation schemes (e.g., translation, rotation) to generate adversarial samples. We achieve this through a convolutional autoencoder, ADV-GEN, that inputs original training samples along with a set of geometric and photometric transformations. The transformation parameters act as regularization variables, guiding ADV-GEN to generate adversarial samples in a constrained search space. Experiments conducted on the LivDet-Iris 2017 database, comprising four datasets, and the LivDet-Iris 2020 dataset, demonstrate the efficacy of our proposed method. The code is available at this https URL.
zh

[CV-90] Fine-grained Text to Image Synthesis

【速读】：该论文试图解决细粒度文本到图像合成（fine-grained text to image synthesis）中的问题，即在不同子类之间图像高度相似且描述同一图像的文本可能存在语言差异的情况下，生成具有精细细节的图像。解决方案的关键在于在判别器中引入辅助分类器（auxiliary classifier）和对比学习方法（contrastive learning method）。辅助分类器帮助判别器对图像类别进行分类，并指导生成器合成更准确的细粒度图像；对比学习方法则通过最小化不同子类图像之间的相似性，最大化相同子类图像之间的相似性，从而提升合成图像的细粒度细节的准确性。

链接: https://arxiv.org/abs/2412.07196
作者: Xu Ouyang,Ying Chen,Kaiyue Zhu,Gady Agam
关键词-EN: synthesis involves generating, involves generating images, image synthesis involves, Generative Adversarial Networks, Recurrent Affine Transformation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fine-grained text to image synthesis involves generating images from texts that belong to different categories. In contrast to general text to image synthesis, in fine-grained synthesis there is high similarity between images of different subclasses, and there may be linguistic discrepancy among texts describing the same image. Recent Generative Adversarial Networks (GAN), such as the Recurrent Affine Transformation (RAT) GAN model, are able to synthesize clear and realistic images from texts. However, GAN models ignore fine-grained level information. In this paper we propose an approach that incorporates an auxiliary classifier in the discriminator and a contrastive learning method to improve the accuracy of fine-grained details in images synthesized by RAT GAN. The auxiliary classifier helps the discriminator classify the class of images, and helps the generator synthesize more accurate fine-grained images. The contrastive learning method minimizes the similarity between images from different subclasses and maximizes the similarity between images from the same subclass. We evaluate on several state-of-the-art methods on the commonly used CUB-200-2011 bird dataset and Oxford-102 flower dataset, and demonstrated superior performance.
zh

[CV-91] A Progressive Image Restoration Network for High-order Degradation Imaging in Remote Sensing

【速读】：该论文试图解决现有遥感图像恢复方法在处理高阶退化模型时的不足，以及深度学习方法在架构透明性和模型可解释性方面的缺陷。解决方案的关键在于提出了一个基于高阶退化成像理论的渐进恢复网络 (HDI-PRNet)，该网络通过逐步恢复不同类型的图像退化，结合了数学可解释性。具体来说，HDI-PRNet 由三个主要模块组成：基于近端映射先验学习的图像去噪模块、结合 Neumann 级数展开与双域退化学习的图像去模糊模块，以及超分辨率模块。这些模块的设计和组合使得该方法在合成和真实遥感图像上均表现出优异的性能。

链接: https://arxiv.org/abs/2412.07195
作者: Yujie Feng,Yin Yang,Xiaohong Fan,Zhengpeng Zhang,Lijing Bu,Jianping Zhang
关键词-EN: gained remarkable achievements, remote sensing images, remote sensing, gained remarkable, remarkable achievements
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 14 pages

点击查看摘要

Abstract:Recently, deep learning methods have gained remarkable achievements in the field of image restoration for remote sensing (RS). However, most existing RS image restoration methods focus mainly on conventional first-order degradation models, which may not effectively capture the imaging mechanisms of remote sensing images. Furthermore, many RS image restoration approaches that use deep learning are often criticized for their lacks of architecture transparency and model interpretability. To address these problems, we propose a novel progressive restoration network for high-order degradation imaging (HDI-PRNet), to progressively restore different image degradation. HDI-PRNet is developed based on the theoretical framework of degradation imaging, offering the benefit of mathematical interpretability within the unfolding network. The framework is composed of three main components: a module for image denoising that relies on proximal mapping prior learning, a module for image deblurring that integrates Neumann series expansion with dual-domain degradation learning, and a module for super-resolution. Extensive experiments demonstrate that our method achieves superior performance on both synthetic and real remote sensing images.
zh

[CV-92] A Step towards Automated and Generalizable Tactile Map Generation using Generative Adversarial Networks

【速读】：该论文试图解决视觉障碍者导航辅助工具——触觉地图（tactile maps）的自动化生成问题。解决方案的关键在于利用计算机视觉技术，特别是生成对抗网络（Generative adversarial network, GAN），通过训练模型来自动识别地图元素、去除冗余信息并进行图像修复。研究团队创建了一个独特的触觉地图数据集，涵盖6500个地点的不同触觉特征，模型在单个缩放级别上训练后，能够在中位F1分数和交并比（IoU）上达到0.97以上的优异表现，并且在不同缩放级别和世界区域中表现出良好的泛化能力。

链接: https://arxiv.org/abs/2412.07191
作者: David G Hobson,Majid Komeili
关键词-EN: visual impairments affect, visual impairments, Blindness and visual, people worldwide, impairments affect
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blindness and visual impairments affect many people worldwide. For help with navigation, people with visual impairments often rely on tactile maps that utilize raised surfaces and edges to convey information through touch. Although these maps are helpful, they are often not widely available and current tools to automate their production have similar limitations including only working at certain scales, for particular world regions, or adhering to specific tactile map standards. To address these shortcomings, we train a proof-of-concept model as a first step towards applying computer vision techniques to help automate the generation of tactile maps. We create a first-of-its-kind tactile maps dataset of street-views from Google Maps spanning 6500 locations and including different tactile line- and area-like features. Generative adversarial network (GAN) models trained on a single zoom successfully identify key map elements, remove extraneous ones, and perform inpainting with median F1 and intersection-over-union (IoU) scores of better than 0.97 across all features. Models trained on two zooms experience only minor drops in performance, and generalize well both to unseen map scales and world regions. Finally, we discuss future directions towards a full implementation of a tactile map solution that builds on our results.
zh

[CV-93] Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly

【速读】：该论文试图解决视频异常理解 (Video Anomaly Understanding, VAU) 中的三个核心问题：“发生了什么异常？”、“为什么会发生？”以及“异常事件的严重程度如何？”。为解决这些问题，论文提出了一个名为“探索视频异常因果关系 (Exploring the Causation of Video Anomalies, ECVA)”的综合基准，该基准通过详细的标注数据集，分别对异常的类型、原因和影响进行描述。解决方案的关键在于引入了一种基于提示 (prompt-based) 的方法，结合“硬提示 (hard prompt)”和“软提示 (soft prompt)”来引导模型关注视频异常片段的关键部分，并建立其时空关系。此外，论文还提出了一个专门评估指标 AnomEval，以更全面和可靠地评估视频大语言模型在处理异常因果关系上的表现。

链接: https://arxiv.org/abs/2412.07183
作者: Hang Du,Guoshun Nan,Jiawen Qian,Wangchenhui Wu,Wendi Deng,Hanqing Mu,Zhenyan Chen,Pengxuan Mao,Xiaofeng Tao,Jun Liu
关键词-EN: Recent advancements, video anomaly understanding, industrial automation, opened the door, door to groundbreaking
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence. arXiv admin note: substantial text overlap with arXiv:2405.00181

点击查看摘要

Abstract:Recent advancements in video anomaly understanding (VAU) have opened the door to groundbreaking applications in various fields, such as traffic monitoring and industrial automation. While the current benchmarks in VAU predominantly emphasize the detection and localization of anomalies. Here, we endeavor to delve deeper into the practical aspects of VAU by addressing the essential questions: “what anomaly occurred?”, “why did it happen?”, and “how severe is this abnormal event?”. In pursuit of these answers, we introduce a comprehensive benchmark for Exploring the Causation of Video Anomalies (ECVA). Our benchmark is meticulously designed, with each video accompanied by detailed human annotations. Specifically, each instance of our ECVA involves three sets of human annotations to indicate “what”, “why” and “how” of an anomaly, including 1) anomaly type, start and end times, and event descriptions, 2) natural language explanations for the cause of an anomaly, and 3) free text reflecting the effect of the abnormality. Building upon this foundation, we propose a novel prompt-based methodology that serves as a baseline for tackling the intricate challenges posed by ECVA. We utilize “hard prompt” to guide the model to focus on the critical parts related to video anomaly segments, and “soft prompt” to establish temporal and spatial relationships within these anomaly segments. Furthermore, we propose AnomEval, a specialized evaluation metric crafted to align closely with human judgment criteria for ECVA. This metric leverages the unique features of the ECVA dataset to provide a more comprehensive and reliable assessment of various video large language models. We demonstrate the efficacy of our approach through rigorous experimental analysis and delineate possible avenues for further investigation into the comprehension of video anomaly causation.
zh

[CV-94] An Enhancement of CNN Algorithm for Rice Leaf Disease Image Classification in Mobile Applications

【速读】：该论文旨在提升水稻叶片病害图像分类算法的性能，传统方法主要依赖卷积神经网络 (CNN)。解决方案的关键在于采用迁移学习 (transfer learning) 结合 MobileViTV2_050 模型，该模型通过可分离自注意力机制 (separable self-attention mechanism) 将 CNN 的局部特征提取与视觉变换器 (Vision Transformers) 的全局上下文学习相结合。这种方法不仅显著提高了分类准确率（MobileViTV2_050-A 达到 93.14%，MobileViTV2_050-B 达到 99.6%），还大幅减少了模型参数（减少 92.50%），从而提升了计算效率和资源利用率，使其更适合移动设备部署，推动了精准农业中模型解释性和实用性的发展。

链接: https://arxiv.org/abs/2412.07182
作者: Kayne Uriel K. Rodrigo,Jerriane Hillary Heart S. Marcial,Samuel C. Brillo,Khatalyn E. Mata,Jonathan C. Morano
关键词-EN: Convolutional Neural Network, Neural Network, Convolutional Neural, relied on Convolutional, image classification algorithms
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented at 46th World Conference on Applied Science, Engineering Technology (WCASET) from Institute for Educational Research and Publication (IFERP)

点击查看摘要

Abstract:This study focuses on enhancing rice leaf disease image classification algorithms, which have traditionally relied on Convolutional Neural Network (CNN) models. We employed transfer learning with MobileViTV2_050 using ImageNet-1k weights, a lightweight model that integrates CNN’s local feature extraction with Vision Transformers’ global context learning through a separable self-attention mechanism. Our approach resulted in a significant 15.66% improvement in classification accuracy for MobileViTV2_050-A, our first enhanced model trained on the baseline dataset, achieving 93.14%. Furthermore, MobileViTV2_050-B, our second enhanced model trained on a broader rice leaf dataset, demonstrated a 22.12% improvement, reaching 99.6% test accuracy. Additionally, MobileViTV2-A attained an F1-score of 93% across four rice labels and a Receiver Operating Characteristic (ROC) curve ranging from 87% to 97%. In terms of resource consumption, our enhanced models reduced the total parameters of the baseline CNN model by up to 92.50%, from 14 million to 1.1 million. These results indicate that MobileViTV2_050 not only improves computational efficiency through its separable self-attention mechanism but also enhances global context learning. Consequently, it offers a lightweight and robust solution suitable for mobile deployment, advancing the interpretability and practicality of models in precision agriculture.
zh

[CV-95] Rate-In: Information-Driven Adaptive Dropout Rates for Improved Inference-Time Uncertainty Estimation

【速读】：该论文试图解决在风险敏感应用（如医疗诊断）中，使用静态丢弃率（dropout rates）进行不确定性估计时可能导致的次优估计问题。解决方案的关键是提出了Rate-In算法，该算法通过量化每层特征图中的信息损失，在推理过程中动态调整丢弃率。Rate-In将丢弃视为受控噪声注入，并利用信息论原理，无需真实标签即可自适应地调整每层和每个输入实例的丢弃率，从而在不牺牲预测性能的情况下提高不确定性估计的校准度和精确度。

链接: https://arxiv.org/abs/2412.07169
作者: Tal Zeevi,Ravid Shwartz-Ziv,Yann LeCun,Lawrence H. Staib,John A. Onofrey
关键词-EN: Accurate uncertainty estimation, deploying neural networks, Accurate uncertainty, dropout rates, Monte Carlo
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Accurate uncertainty estimation is crucial for deploying neural networks in risk-sensitive applications such as medical diagnosis. Monte Carlo Dropout is a widely used technique for approximating predictive uncertainty by performing stochastic forward passes with dropout during inference. However, using static dropout rates across all layers and inputs can lead to suboptimal uncertainty estimates, as it fails to adapt to the varying characteristics of individual inputs and network layers. Existing approaches optimize dropout rates during training using labeled data, resulting in fixed inference-time parameters that cannot adjust to new data distributions, compromising uncertainty estimates in Monte Carlo simulations. In this paper, we propose Rate-In, an algorithm that dynamically adjusts dropout rates during inference by quantifying the information loss induced by dropout in each layer’s feature maps. By treating dropout as controlled noise injection and leveraging information-theoretic principles, Rate-In adapts dropout rates per layer and per input instance without requiring ground truth labels. By quantifying the functional information loss in feature maps, we adaptively tune dropout rates to maintain perceptual quality across diverse medical imaging tasks and architectural configurations. Our extensive empirical study on synthetic data and real-world medical imaging tasks demonstrates that Rate-In improves calibration and sharpens uncertainty estimates compared to fixed or heuristic dropout rates without compromising predictive performance. Rate-In offers a practical, unsupervised, inference-time approach to optimizing dropout for more reliable predictive uncertainty estimation in critical applications. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) Cite as: arXiv:2412.07169 [cs.LG] (or arXiv:2412.07169v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.07169 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-96] 3A-YOLO: New Real-Time Object Detectors with Triple Discriminative Awareness and Coordinated Representations

【速读】：该论文试图解决现有实时目标检测器（如 YOLO 系列）在利用注意力机制时未能统一部署分层注意力机制以构建更具判别力的检测头的问题。解决方案的关键在于提出了一种新的检测头模块，称为 TDA-YOLO Module，该模块通过分层增强尺度感知 (scale-awareness)、空间感知 (spatial-awareness) 和任务感知 (task-awareness) 的表示学习，从而提升模型的判别能力。此外，论文还引导中间特征协同学习通道间关系和精确位置信息，并通过改进颈部网络和引入多种技巧来增强 3A-YOLO 的适应性。实验结果表明，该方法在 COCO 和 VOC 基准测试中具有显著的有效性。

链接: https://arxiv.org/abs/2412.07168
作者: Xuecheng Wu,Junxiao Xue,Liangyu Fu,Jiayu Nie,Danlei Huang,Xinyi Yin
关键词-EN: elevating model performance, Recent research, real-time object detectors, attention mechanisms, model performance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent research on real-time object detectors (e.g., YOLO series) has demonstrated the effectiveness of attention mechanisms for elevating model performance. Nevertheless, existing methods neglect to unifiedly deploy hierarchical attention mechanisms to construct a more discriminative YOLO head which is enriched with more useful intermediate features. To tackle this gap, this work aims to leverage multiple attention mechanisms to hierarchically enhance the triple discriminative awareness of the YOLO detection head and complementarily learn the coordinated intermediate representations, resulting in a new series detectors denoted 3A-YOLO. Specifically, we first propose a new head denoted TDA-YOLO Module, which unifiedly enhance the representations learning of scale-awareness, spatial-awareness, and task-awareness. Secondly, we steer the intermediate features to coordinately learn the inter-channel relationships and precise positional information. Finally, we perform neck network improvements followed by introducing various tricks to boost the adaptability of 3A-YOLO. Extensive experiments across COCO and VOC benchmarks indicate the effectiveness of our detectors.
zh

[CV-97] Fast Occupancy Network

【速读】：该论文试图解决现有占用网络（Occupancy Network）在自动驾驶中计算资源需求过高的问题。解决方案的关键在于提出了一种简单且快速的占用网络模型，通过采用可变形2D卷积层将鸟瞰图（BEV）特征提升为3D体素特征，并引入高效的体素特征金字塔网络（FPN）模块，以在减少计算成本的同时提升性能。此外，论文还提出了一种在推理阶段无需额外计算成本的2D分割分支，进一步提高了精度。实验结果表明，该方法在精度和推理速度上均优于现有方法，且能轻松应用于现有的BEV模型，将其转换为占用网络模型。

链接: https://arxiv.org/abs/2412.07163
作者: Mingjie Lu,Yuanxian Huang,Ji Liu,Xingliang Huang,Dong Li,Jinzhang Peng,Lu Tian,Emad Barsoum
关键词-EN: Occupancy Network, Network, Occupancy Network predicts, Occupancy Network model, Occupancy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures,

点击查看摘要

Abstract:Occupancy Network has recently attracted much attention in autonomous driving. Instead of monocular 3D detection and recent bird’s eye view(BEV) models predicting 3D bounding box of obstacles, Occupancy Network predicts the category of voxel in specified 3D space around the ego vehicle via transforming 3D detection task into 3D voxel segmentation task, which has much superiority in tackling category outlier obstacles and providing fine-grained 3D representation. However, existing methods usually require huge computation resources than previous methods, which hinder the Occupancy Network solution applying in intelligent driving systems. To address this problem, we make an analysis of the bottleneck of Occupancy Network inference cost, and present a simple and fast Occupancy Network model, which adopts a deformable 2D convolutional layer to lift BEV feature to 3D voxel feature and presents an efficient voxel feature pyramid network (FPN) module to improve performance with few computational cost. Further, we present a cost-free 2D segmentation branch in perspective view after feature extractors for Occupancy Network during inference phase to improve accuracy. Experimental results demonstrate that our method consistently outperforms existing methods in both accuracy and inference speed, which surpasses recent state-of-the-art (SOTA) OCCNet by 1.7% with ResNet50 backbone with about 3X inference speedup. Furthermore, our method can be easily applied to existing BEV models to transform them into Occupancy Network models.
zh

[CV-98] Compositional Zero-Shot Learning with Contextualized Cues and Adaptive Contrastive Training

【速读】：该论文试图解决组合式零样本学习 (Compositional Zero-Shot Learning, CZSL) 中基于 CLIP 的方法在理解和关联属性与对象时存在的固有限制问题。解决方案的关键在于提出了一个名为“理解和链接属性与对象 (Understanding and Linking Attributes and Objects, ULAO)”的新框架，该框架包含两个创新模块：理解属性与对象 (Understanding Attributes and Objects, UAO) 模块通过顺序的原始预测和利用已识别对象作为属性分类的上下文提示来提升对属性和对象的基本理解；链接属性与对象 (Linking Attributes and Objects, LAO) 模块则通过引入新的对比学习策略，包括定制的难负样本生成和自适应损失调整，来增强属性与对象之间的关联理解。

链接: https://arxiv.org/abs/2412.07161
作者: Yun Li,Zhe Liu,Lina Yao
关键词-EN: Compositional Zero-Shot Learning, recognize unseen combinations, Compositional Zero-Shot, attributes and objects, Linking Attributes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of seen attributes and objects. Current CLIP-based methods in CZSL, despite their advancements, often fail to effectively understand and link the attributes and objects due to inherent limitations in CLIP’s pretraining mechanisms. To address these shortcomings, this paper introduces a novel framework, Understanding and Linking Attributes and Objects (ULAO) in CZSL, which comprises two innovative modules. The Understanding Attributes and Objects (UAO) module improves primitive understanding by sequential primitive prediction and leveraging recognized objects as contextual hints for attribute classification. Concurrently, the Linking Attributes and Objects (LAO) module improves the attribute-object linkage understanding through a new contrastive learning strategy that incorporates tailored hard negative generation and adaptive loss adjustments. We demonstrate our model’s superiority by showcasing its state-of-the-art performance across three benchmark datasets in both Closed-World (CW) and Open-World (OW) scenarios.
zh

[CV-99] Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation AAAI2025

【速读】：该论文试图解决现有方法在生成视频和4D全景场景图时未能充分利用实体间运动信息的问题。现有方法通过时间维度上的实体掩码（mask tubes）进行编码，并使用时间池化操作预测实体间的关系，但这种方法未能充分捕捉实体间的运动模式。论文提出的解决方案关键在于引入一种基于对比表示学习（contrastive representation learning）的框架，专注于运动模式以生成时间场景图。该框架通过鼓励模型学习相似主体-关系-客体三元组的掩码管的紧密表示，并推动掩码管与其时间打乱版本之间的表示分离，同时学习同一视频中不同三元组的掩码管的远距离表示。实验结果表明，该运动感知的对比学习框架显著提升了现有最先进方法在视频和4D数据集上的性能。

链接: https://arxiv.org/abs/2412.07160
作者: Thong Thanh Nguyen,Xiaobao Wu,Yi Bin,Cong-Duy T Nguyen,See-Kiong Ng,Anh Tuan Luu
关键词-EN: abstracts visual data, equip artificial intelligence, generation abstracts visual, panoptic scene graph, graph generation abstracts
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities’ relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subject-relation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets.
zh

[CV-100] Multi-Scale Contrastive Learning for Video Temporal Grounding AAAI2025

【速读】：该论文试图解决视频时间定位（temporal grounding）中由于特征金字塔（feature pyramid）结构导致的长期视频片段信息丢失问题。解决方案的关键在于提出了一种对比学习框架，通过从视频编码器的多个阶段提取特征空间中的样本，实现对视频片段显著语义的捕捉。该方法无需数据增强或在线记忆库，通过引入采样过程获取与同一查询相关的多个视频片段，并利用这些片段在不同编码器层的表示，实现多尺度与跨尺度的对比学习，从而有效连接局部短时视频片段与全局长时视频片段。

链接: https://arxiv.org/abs/2412.07157
作者: Thong Thanh Nguyen,Yi Bin,Xiaobao Wu,Zhiyuan Hu,Cong-Duy T Nguyen,See-Kiong Ng,Anh Tuan Luu
关键词-EN: video moments, natural language query, Temporal grounding, video, localizes video moments
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments’ representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding.
zh

[CV-101] Annotation Techniques for Judo Combat Phase Classification from Tournament Footage

【速读】：该论文试图解决从柔道比赛直播视频中自动提取和分析战斗阶段的问题，旨在实现对直播柔道比赛的自动标注和总结。解决方案的关键在于采用半监督学习方法，通过迁移学习从微调的目标检测器中构建战斗阶段模型，以分类比赛的存在、活动状态和站立状态。该方法有效应对了领域内标注数据有限的挑战，并在19个30秒的柔道视频片段上进行了评估，取得了F1分数分别为0.66、0.78和0.87的初步成果。

链接: https://arxiv.org/abs/2412.07155
作者: Anthony Miyaguchi,Jed Moutahir,Tanmay Sutar
关键词-EN: live-streamed footage, analyzing combat phases, paper presents, extracting and analyzing, tournaments using live-streamed
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:This paper presents a semi-supervised approach to extracting and analyzing combat phases in judo tournaments using live-streamed footage. The objective is to automate the annotation and summarization of live streamed judo matches. We train models that extract relevant entities and classify combat phases from fixed-perspective judo recordings. We employ semi-supervised methods to address limited labeled data in the domain. We build a model of combat phases via transfer learning from a fine-tuned object detector to classify the presence, activity, and standing state of the match. We evaluate our approach on a dataset of 19 thirty-second judo clips, achieving an F1 score on a 20% test hold-out of 0.66, 0.78, and 0.87 for the three classes, respectively. Our results show initial promise for automating more complex information retrieval tasks using rigorous methods with limited labeled data.
zh

[CV-102] Hero-SR: One-Step Diffusion for Super-Resolution with Human Perception Priors

【速读】：该论文试图解决在真实世界超分辨率（Real-SR）任务中，由于严重退化和输入复杂性导致的语义一致性和感知自然性难以满足人类感知需求的问题。解决方案的关键在于提出了一种名为Hero-SR的单步扩散模型框架，该框架通过引入两个创新模块来实现：1) 动态时间步长模块（Dynamic Time-Step Module, DTSM），用于自适应选择最优扩散步长以灵活满足人类感知标准；2) 开放世界多模态监督（Open-World Multi-modality Supervision, OWMS），通过结合图像和文本域的CLIP指导，提升语义一致性和感知自然性。这些模块共同作用，使得Hero-SR能够生成既保留细节又符合人类感知偏好的高分辨率图像，并在Real-SR任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2412.07152
作者: Jiangang Wang,Qingnan Fan,Qi Zhang,Haigen Liu,Yuhang Yu,Jinwei Chen,Wenqi Ren
关键词-EN: addressing real-world super-resolution, recent approaches, real-world super-resolution, approaches have shown, shown promise
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Owing to the robust priors of diffusion models, recent approaches have shown promise in addressing real-world super-resolution (Real-SR). However, achieving semantic consistency and perceptual naturalness to meet human perception demands remains difficult, especially under conditions of heavy degradation and varied input complexities. To tackle this, we propose Hero-SR, a one-step diffusion-based SR framework explicitly designed with human perception priors. Hero-SR consists of two novel modules: the Dynamic Time-Step Module (DTSM), which adaptively selects optimal diffusion steps for flexibly meeting human perceptual standards, and the Open-World Multi-modality Supervision (OWMS), which integrates guidance from both image and text domains through CLIP to improve semantic consistency and perceptual naturalness. Through these modules, Hero-SR generates high-resolution images that not only preserve intricate details but also reflect human perceptual preferences. Extensive experiments validate that Hero-SR achieves state-of-the-art performance in Real-SR. The code will be publicly available upon paper acceptance.
zh

[CV-103] RAP-SR: RestorAtion Prior Enhancement in Diffusion Models for Realistic Image Super-Resolution

【速读】：该论文试图解决现有扩散模型在真实世界图像超分辨率（Real-SR）任务中，由于通用预训练模型未针对恢复任务优化，导致恢复先验（restoration prior）不足，以及手动定义的提示（prompt）无法充分利用生成潜力的问题。解决方案的关键在于提出了RAP-SR，一种在预训练扩散模型中增强恢复先验的新方法。具体来说，论文首先构建了高保真美学图像数据集（HFAID），并通过质量驱动的美学图像选择管道（QDAISP）进行筛选，以提升数据集的保真度和美学质量。其次，提出了恢复先验增强框架，包括恢复先验精炼（RPR）和恢复导向提示优化（ROPO）模块，分别用于精炼恢复先验和优化恢复标识符，从而提高生成图像的质量。RAP-SR通过增强恢复先验，有效弥合了通用模型与Real-SR需求之间的差距，并可无缝集成到现有的扩散模型中，提升其性能。

链接: https://arxiv.org/abs/2412.07149
作者: Jiangang Wang,Qingnan Fan,Jinwei Chen,Hong Gu,Feng Huang,Wenqi Ren
关键词-EN: powerful generative capabilities, garnered significant attention, real-world image super-resolution, pretrained diffusion models, generative capabilities
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:Benefiting from their powerful generative capabilities, pretrained diffusion models have garnered significant attention for real-world image super-resolution (Real-SR). Existing diffusion-based SR approaches typically utilize semantic information from degraded images and restoration prompts to activate prior for producing realistic high-resolution images. However, general-purpose pretrained diffusion models, not designed for restoration tasks, often have suboptimal prior, and manually defined prompts may fail to fully exploit the generated potential. To address these limitations, we introduce RAP-SR, a novel restoration prior enhancement approach in pretrained diffusion models for Real-SR. First, we develop the High-Fidelity Aesthetic Image Dataset (HFAID), curated through a Quality-Driven Aesthetic Image Selection Pipeline (QDAISP). Our dataset not only surpasses existing ones in fidelity but also excels in aesthetic quality. Second, we propose the Restoration Priors Enhancement Framework, which includes Restoration Priors Refinement (RPR) and Restoration-Oriented Prompt Optimization (ROPO) modules. RPR refines the restoration prior using the HFAID, while ROPO optimizes the unique restoration identifier, improving the quality of the resulting images. RAP-SR effectively bridges the gap between general-purpose models and the demands of Real-SR by enhancing restoration prior. Leveraging the plug-and-play nature of RAP-SR, our approach can be seamlessly integrated into existing diffusion-based SR methods, boosting their performance. Extensive experiments demonstrate its broad applicability and state-of-the-art results. Codes and datasets will be available upon acceptance.
zh

[CV-104] MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation COLING2025

【速读】：该论文试图解决现有图像翻译（Image Translation, IT）数据集在规模、多样性和质量上的局限性，这些局限性阻碍了IT模型的发展和评估。解决方案的关键在于引入了MIT-10M，这是一个大规模的多语言图像翻译平行语料库，包含超过1000万张真实世界数据中的图像-文本对，经过广泛的数据清洗和多语言翻译验证。MIT-10M包含84万张图像，分为三种尺寸、28个类别、三个难度级别的任务以及14种语言的图像-文本对，显著优于现有数据集。实验结果表明，MIT-10M在评估模型应对复杂现实世界图像翻译任务的性能方面具有更高的适应性，并且使用该数据集微调的模型性能比基线模型提高了三倍，进一步证明了其优越性。

链接: https://arxiv.org/abs/2412.07147
作者: Bo Li,Shaolin Zhu,Lijie Wen
关键词-EN: holds immense potential, holds immense, diverse domains, immense potential, potential across diverse
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in COLING 2025

点击查看摘要

Abstract:Image Translation (IT) holds immense potential across diverse domains, enabling the translation of textual content within images into various languages. However, existing datasets often suffer from limitations in scale, diversity, and quality, hindering the development and evaluation of IT models. To address this issue, we introduce MIT-10M, a large-scale parallel corpus of multilingual image translation with over 10M image-text pairs derived from real-world data, which has undergone extensive data cleaning and multilingual translation validation. It contains 840K images in three sizes, 28 categories, tasks with three levels of difficulty and 14 languages image-text pairs, which is a considerable improvement on existing datasets. We conduct extensive experiments to evaluate and train models on MIT-10M. The experimental results clearly indicate that our dataset has higher adaptability when it comes to evaluating the performance of the models in tackling challenging and complex image translation tasks in the real world. Moreover, the performance of the model fine-tuned with MIT-10M has tripled compared to the baseline model, further confirming its superiority.
zh

[CV-105] Integrating MedCLIP and Cross-Modal Fusion for Automatic Radiology Report Generation

【速读】：该论文试图解决放射学报告生成自动化的问题，旨在减轻放射科医生的工作负担并提高报告的准确性、一致性和效率。解决方案的关键在于提出了一种新颖的跨模态框架，使用MedCLIP作为视觉提取器和检索机制，通过注意力机制提取检索报告特征和图像特征，并通过融合模块将这些特征整合，从而提升生成报告的连贯性和临床相关性。实验结果表明，该方法在广泛使用的IU-Xray数据集上显著优于常见方法，且消融研究进一步验证了准确报告检索和特征集成在生成全面医疗报告中的重要性。

链接: https://arxiv.org/abs/2412.07141
作者: Qianhao Han,Junyi Liu,Zengchang Qin,Zheng Zheng
关键词-EN: http URL propose, http URL extracting, http URL results, URL extracting retrieved, Automating radiology report
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Big Data 2024

点击查看摘要

Abstract:Automating radiology report generation can significantly reduce the workload of radiologists and enhance the accuracy, consistency, and efficiency of clinical this http URL propose a novel cross-modal framework that uses MedCLIP as both a vision extractor and a retrieval mechanism to improve the process of medical report this http URL extracting retrieved report features and image features through an attention-based extract module, and integrating them with a fusion module, our method improves the coherence and clinical relevance of generated this http URL results on the widely used IU-Xray dataset demonstrate the effectiveness of our approach, showing improvements over commonly used methods in both report quality and this http URL, ablation studies provide further validation of the framework, highlighting the importance of accurate report retrieval and feature integration in generating comprehensive medical reports.
zh

[CV-106] FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error

【速读】：该论文试图解决扩散模型生成的高质量图像难以与真实图像区分的问题。解决方案的关键在于提出了一种名为频率引导重构误差 (Frequency-guided Reconstruction Error, FIRE) 的新方法，该方法首次研究了频率分解对重构误差的影响。FIRE 通过评估频率分解前后重构误差的变化，提供了一种鲁棒的方法来识别扩散模型生成的图像。实验结果表明，FIRE 对未见过的扩散模型具有良好的泛化能力，并能抵御多种扰动。

链接: https://arxiv.org/abs/2412.07140
作者: Beilin Chu,Xuan Xu,Xin Wang,Yufei Zhang,Weike You,Linna Zhou
关键词-EN: significantly improved high-quality, content increasingly challenging, making generated content, high-quality image generation, generated content increasingly
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 14 figures

点击查看摘要

Abstract:The rapid advancement of diffusion models has significantly improved high-quality image generation, making generated content increasingly challenging to distinguish from real images and raising concerns about potential misuse. In this paper, we observe that diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting the limitation could serve as a cue for detecting diffusion model generated images. Motivated by this observation, we propose a novel method called Frequency-guided Reconstruction Error (FIRE), which, to the best of our knowledge, is the first to investigate the influence of frequency decomposition on reconstruction error. FIRE assesses the variation in reconstruction error before and after the frequency decomposition, offering a robust method for identifying diffusion model generated images. Extensive experiments show that FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.
zh

[CV-107] A multimodal ensemble approach for clear cell renal cell carcinoma treatment outcome prediction

【速读】：该论文旨在解决透明细胞肾细胞癌（ccRCC）患者的可靠预后模型问题，以提升个性化治疗效果。解决方案的关键在于开发了一种多模态集成模型（MMEM），该模型整合了术前临床数据、多组学数据（mRNA、miRNA、DNA甲基化）和组织病理学全切片图像（WSI）数据，用于预测患者的总生存期（OS）和无病生存期（DFS）。通过分别构建基于临床和多组学数据的Cox比例风险模型（CPH）以及基于WSI特征的深度学习CPH模型，并将各模型的风险评分根据训练性能进行加权组合，最终实现了优于单一模态模型的预后能力。该MMEM模型在C-index和AUROC指标上均表现出色，为ccRCC患者的管理提供了潜在的辅助工具。

链接: https://arxiv.org/abs/2412.07136
作者: Meixu Chen,Kai Wang,Payal Kapur,James Brugarolas,Raquibul Hannan,Jing Wang
关键词-EN: Clear Cell Carcinoma, renal cell carcinoma, enhance personalized treatment, cell renal cell, clear cell renal
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 10 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Purpose: A reliable cancer prognosis model for clear cell renal cell carcinoma (ccRCC) can enhance personalized treatment. We developed a multi-modal ensemble model (MMEM) that integrates pretreatment clinical data, multi-omics data, and histopathology whole slide image (WSI) data to predict overall survival (OS) and disease-free survival (DFS) for ccRCC patients. Methods: We analyzed 226 patients from The Cancer Genome Atlas Kidney Renal Clear Cell Carcinoma (TCGA-KIRC) dataset, which includes OS, DFS follow-up data, and five data modalities: clinical data, WSIs, and three multi-omics datasets (mRNA, miRNA, and DNA methylation). Separate survival models were built for OS and DFS. Cox-proportional hazards (CPH) model with forward feature selection is used for clinical and multi-omics data. Features from WSIs were extracted using ResNet and three general-purpose foundation models. A deep learning-based CPH model predicted survival using encoded WSI features. Risk scores from all models were combined based on training performance. Results: Performance was assessed using concordance index (C-index) and AUROC. The clinical feature-based CPH model received the highest weight for both OS and DFS tasks. Among WSI-based models, the general-purpose foundation model (UNI) achieved the best performance. The final MMEM model surpassed single-modality models, achieving C-indices of 0.820 (OS) and 0.833 (DFS), and AUROC values of 0.831 (3-year patient death) and 0.862 (cancer recurrence). Using predicted risk medians to stratify high- and low-risk groups, log-rank tests showed improved performance in both OS and DFS compared to single-modality models. Conclusion: MMEM is the first multi-modal model for ccRCC patients, integrating five data modalities. It outperformed single-modality models in prognostic ability and has the potential to assist in ccRCC patient management if independently validated.
zh

[CV-108] Revisiting Lesion Tracking in 3D Total Body Photography

【速读】：该论文试图解决黑色素瘤（melanoma）早期检测中的关键问题，即在3D全身摄影中对皮肤病变进行纵向跟踪和检测新病变。解决方案的关键在于提出了一种框架，该框架能够处理一对3D纹理网格，匹配全身摄影中的病变，并识别不可匹配的病变。具体步骤包括计算对应映射图，将源和目标网格映射到模板网格，构建流场以对齐映射信号，通过向量场进行前向和后向平流以精炼对应映射图，最后使用精炼后的对应映射图进行病变分配。此外，论文还提出了首个大规模皮肤病变跟踪数据集，包含198个受试者的25,000个病变对，显著提升了病变匹配的成功率和准确性。

链接: https://arxiv.org/abs/2412.07132
作者: Wei-Lun Huang,Minghao Xue,Zhiyou Liu,Davood Tashayyod,Jun Kang,Amir Gandjbakhche,Misha Kazhdan,Mehran Armand
关键词-EN: deadly form, total body photography, lesions, lesion, skin cancer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Melanoma is the most deadly form of skin cancer. Tracking the evolution of nevi and detecting new lesions across the body is essential for the early detection of melanoma. Despite prior work on longitudinal tracking of skin lesions in 3D total body photography, there are still several challenges, including 1) low accuracy for finding correct lesion pairs across scans, 2) sensitivity to noisy lesion detection, and 3) lack of large-scale datasets with numerous annotated lesion pairs. We propose a framework that takes in a pair of 3D textured meshes, matches lesions in the context of total body photography, and identifies unmatchable lesions. We start by computing correspondence maps bringing the source and target meshes to a template mesh. Using these maps to define source/target signals over the template domain, we construct a flow field aligning the mapped signals. The initial correspondence maps are then refined by advecting forward/backward along the vector field. Finally, lesion assignment is performed using the refined correspondence maps. We propose the first large-scale dataset for skin lesion tracking with 25K lesion pairs across 198 subjects. The proposed method achieves a success rate of 89.9% (at 10 mm criterion) for all pairs of annotated lesions and a matching accuracy of 98.2% for subjects with more than 200 lesions.
zh

[CV-109] StyleMark: A Robust Watermarking Method for Art Style Images Against Black-Box Arbitrary Style Transfer

【速读】：该论文试图解决在任意风格迁移 (Arbitrary Style Transfer, AST) 过程中，未经授权的艺术风格图像可能侵犯艺术家版权的问题。解决方案的关键在于提出了一种名为 StyleMark 的鲁棒水印方法，这是首个针对黑盒 AST 的鲁棒水印技术。其核心创新包括：通过多尺度水印嵌入调整风格特征的均值激活，将水印痕迹植入风格图像的共享特征空间；设计分布压缩损失 (distribution squeeze loss) 以约束内容统计特征的失真，确保重建网络专注于整合带有水印的风格特征；并通过解码器在随机噪声下的微调，缓解鲁棒性与水印不可见性之间的优化冲突。实验结果表明，StyleMark 在黑盒 AST 和常见像素级失真下表现出显著的鲁棒性，同时能有效防御恶意适应性攻击。

链接: https://arxiv.org/abs/2412.07129
作者: Yunming Zhang,Dengpan Ye,Sipeng Shen,Jun Wang
关键词-EN: Arbitrary Style Transfer, arbitrary art style, promoting art communication, art style images, real natural images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Arbitrary Style Transfer (AST) achieves the rendering of real natural images into the painting styles of arbitrary art style images, promoting art communication. However, misuse of unauthorized art style images for AST may infringe on artists’ copyrights. One countermeasure is robust watermarking, which tracks image propagation by embedding copyright watermarks into carriers. Unfortunately, AST-generated images lose the structural and semantic information of the original style image, hindering end-to-end robust tracking by watermarks. To fill this gap, we propose StyleMark, the first robust watermarking method for black-box AST, which can be seamlessly applied to art style images achieving precise attribution of artistic styles after AST. Specifically, we propose a new style watermark network that adjusts the mean activations of style features through multi-scale watermark embedding, thereby planting watermark traces into the shared style feature space of style images. Furthermore, we design a distribution squeeze loss, which constrain content statistical feature distortion, forcing the reconstruction network to focus on integrating style features with watermarks, thus optimizing the intrinsic watermark distribution. Finally, based on solid end-to-end training, StyleMark mitigates the optimization conflict between robustness and watermark invisibility through decoder fine-tuning under random noise. Experimental results demonstrate that StyleMark exhibits significant robustness against black-box AST and common pixel-level distortions, while also securely defending against malicious adaptive attacks.
zh

[CV-110] DiffCLIP: Few-shot Language-driven Multimodal Classifier

【速读】：该论文试图解决视觉语言模型（如CLIP）在遥感领域应用时，由于缺乏足够的图像-文本对进行训练而导致的性能受限问题。解决方案的关键是提出了DiffCLIP框架，该框架通过少样本学习方法，利用未标注图像进行预训练，并采用无监督的掩码扩散学习来捕捉多模态数据的分布，而无需标签。此外，DiffCLIP通过模态共享的图像编码器将多模态数据映射到统一的子空间，提取跨模态共享特征，并通过与CLIP的类标签文本信息对齐来增强视觉表示的学习。最终，DiffCLIP在仅使用少量图像-文本对的情况下，显著提升了CLIP在遥感数据集上的分类性能，实现了10.65%的整体准确率提升。

链接: https://arxiv.org/abs/2412.07119
作者: Jiaqing Zhang,Mingxiang Cao,Xue Yang,Kai Jiang,Yunsong Li
关键词-EN: Contrastive Language-Image Pretraining, Contrastive Language-Image, analyzing natural images, shown impressive performance, Visual language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized domains such as remote sensing due to the limited availability of image-text pairs for training. To tackle this issue, we introduce DiffCLIP, a novel framework that extends CLIP to effectively convey comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images. DiffCLIP is a few-shot learning method that leverages unlabeled images for pretraining. It employs unsupervised mask diffusion learning to capture the distribution of diverse modalities without requiring labels. The modality-shared image encoder maps multimodal data into a unified subspace, extracting shared features with consistent parameters across modalities. A well-trained image encoder further enhances learning by aligning visual representations with class-label text information from CLIP. By integrating these approaches, DiffCLIP significantly boosts CLIP performance using a minimal number of image-text pairs. We evaluate DiffCLIP on widely used high-dimensional multimodal datasets, demonstrating its effectiveness in addressing few-shot annotated classification tasks. DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP, while utilizing only 2-shot image-text pairs. The code has been released at this https URL.
zh

[CV-111] -MPD: Test Time Model Pruning and Distillation

【速读】：该论文试图解决在大规模预训练模型压缩过程中，由于数据隐私和商业保密等原因导致无法访问原始训练数据，以及在测试数据分布与训练数据分布存在差异（协变量偏移，covariate shift）的情况下，传统剪枝和微调方法影响模型泛化能力的问题。解决方案的关键在于提出了一种高效的测试时剪枝和微调方法，通过引入两个变量来近似微调后的准确率，并结合潜在的推理延迟节省进行剪枝决策。此外，论文还提出了一种高效的蒸馏方法，通过一次性生成少量微调样本的伪标签来降低伪标签生成的成本，从而在保持测试准确率的同时显著减少了剪枝和微调的时间。

链接: https://arxiv.org/abs/2412.07114
作者: Haihang Wu,Wei Wang,Tamasha Malepathirana,Sachith Seneviratne,Denny Oetomo,Saman Halgamuge
关键词-EN: compressing large pre-trained, large pre-trained models, Pruning, compressing large, large pre-trained
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pruning can be an effective method of compressing large pre-trained models for inference speed acceleration. Previous pruning approaches rely on access to the original training dataset for both pruning and subsequent fine-tuning. However, access to the training data can be limited due to concerns such as data privacy and commercial confidentiality. Furthermore, with covariate shift (disparities between test and training data distributions), pruning and finetuning with training datasets can hinder the generalization of the pruned model to test data. To address these issues, pruning and finetuning the model with test time samples becomes essential. However, test-time model pruning and fine-tuning incur additional computation costs and slow down the model’s prediction speed, thus posing efficiency issues. Existing pruning methods are not efficient enough for test time model pruning setting, since finetuning the pruned model is needed to evaluate the importance of removable components. To address this, we propose two variables to approximate the fine-tuned accuracy. We then introduce an efficient pruning method that considers the approximated finetuned accuracy and potential inference latency saving. To enhance fine-tuning efficiency, we propose an efficient knowledge distillation method that only needs to generate pseudo labels for a small set of finetuning samples one time, thereby reducing the expensive pseudo-label generation cost. Experimental results demonstrate that our method achieves a comparable or superior tradeoff between test accuracy and inference latency, with a 32% relative reduction in pruning and finetuning time compared to the best existing method.
zh

[CV-112] A Powered Prosthetic Hand with Vision System for Enhancing the Anthropopathic Grasp

【速读】：该论文试图解决假肢手在执行拟人化抓握过程中，难以精确识别截肢者抓握手势和意图的问题。解决方案的关键在于提出了基于空间几何的抓握手势映射方法（SG-GM）和基于运动轨迹回归的抓握意图估计算法（MTR-GIE）。SG-GM通过构建基于人手抓握过程几何特征的手势函数，实现对手势的精确识别；MTR-GIE则通过回归预测和空间分割估计，预测抓握对象并确定抓握意图。这些方法显著提升了假肢手在多物体环境中的抓握效率和意图识别准确性。

链接: https://arxiv.org/abs/2412.07105
作者: Yansong Xu,Xiaohui Wang,Junlin Li,Xiaoqian Zhang,Feng Li,Qing Gao,Chenglong Fu,Yuquan Leng
关键词-EN: prosthetic hand wearers, grasping, process significantly benefits, prosthetic hand, grasping process significantly
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The anthropomorphism of grasping process significantly benefits the experience and grasping efficiency of prosthetic hand wearers. Currently, prosthetic hands controlled by signals such as brain-computer interfaces (BCI) and electromyography (EMG) face difficulties in precisely recognizing the amputees’ grasping gestures and executing anthropomorphic grasp processes. Although prosthetic hands equipped with vision systems enables the objects’ feature recognition, they lack perception of human grasping intention. Therefore, this paper explores the estimation of grasping gestures solely through visual data to accomplish anthropopathic grasping control and the determination of grasping intention within a multi-object environment. To address this, we propose the Spatial Geometry-based Gesture Mapping (SG-GM) method, which constructs gesture functions based on the geometric features of the human hand grasping processes. It’s subsequently implemented on the prosthetic hand. Furthermore, we propose the Motion Trajectory Regression-based Grasping Intent Estimation (MTR-GIE) algorithm. This algorithm predicts pre-grasping object utilizing regression prediction and prior spatial segmentation estimation derived from the prosthetic hand’s position and trajectory. The experiments were conducted to grasp 8 common daily objects including cup, fork, etc. The experimental results presented a similarity coefficient R^2 of grasping process of 0.911, a Root Mean Squared Error ( RMSE ) of 2.47\degree, a success rate of grasping of 95.43 % , and an average duration of grasping process of 3.07 \pm 0.41 s. Furthermore, grasping experiments in a multi-object environment were conducted. The average accuracy of intent estimation reached 94.35 % . Our methodologies offer a groundbreaking approach to enhance the prosthetic hand’s functionality and provides valuable insights for future research.
zh

[CV-113] Creative Portraiture: Exploring Creative Adversarial Networks and Conditional Creative Adversarial Networks

【速读】：该论文试图解决生成式对抗网络 (GANs) 在创意应用中的局限性，即生成器仅学习复制训练数据的分布，而无法生成真正具有创意的产品。解决方案的关键在于提出创意对抗网络 (Creative Adversarial Networks, CANs) 及其扩展条件创意对抗网络 (Conditional Creative Adversarial Networks, CCANs)。通过使用CANs和CCANs，研究者能够在WikiArt数据集上生成新颖且具有创意的肖像，并且CCANs能够根据风格标签生成受特定风格启发的创意作品，从而更接近人类在创意过程中基于已有风格进行创新的真实情况。

链接: https://arxiv.org/abs/2412.07091
作者: Sebastian Hereu,Qianfei Hu
关键词-EN: deep convolutional generative, create deep convolutional, convolutional generative adversarial, Convolutional neural networks, generative adversarial networks
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have been combined with generative adversarial networks (GANs) to create deep convolutional generative adversarial networks (DCGANs) with great success. DCGANs have been used for generating images and videos from creative domains such as fashion design and painting. A common critique of the use of DCGANs in creative applications is that they are limited in their ability to generate creative products because the generator simply learns to copy the training distribution. We explore an extension of DCGANs, creative adversarial networks (CANs). Using CANs, we generate novel, creative portraits, using the WikiArt dataset to train the network. Moreover, we introduce our extension of CANs, conditional creative adversarial networks (CCANs), and demonstrate their potential to generate creative portraits conditioned on a style label. We argue that generating products that are conditioned, or inspired, on a style label closely emulates real creative processes in which humans produce imaginative work that is still rooted in previous styles.
zh

[CV-114] EvRepSL: Event-Stream Representation via Self-Supervised Learning for Event-Based Vision

【速读】：该论文试图解决事件相机（event cameras）中事件流（event-streams）表示的质量问题，特别是由于事件流的噪声特性导致手动设计的事件流表示质量无法保证的问题。解决方案的关键在于提出了一种数据驱动的方法，通过引入基于时空统计的新事件流表示（EvRep），并理论推导出异步事件流与同步视频帧之间的内在关系。基于这一关系，论文设计了一个自监督学习的表示生成器（RepGen），将EvRep作为输入，生成高质量的事件流表示（EvRepSL），无需微调或重新训练。该方法在多种主流事件相机捕捉的分类和光流数据集上进行了广泛验证，展示了其优越的性能和广泛的适用性。

链接: https://arxiv.org/abs/2412.07080
作者: Qiang Qu,Xiaoming Chen,Yuk Ying Chung,Yiran Shen
关键词-EN: computer vision tasks, computer vision, event-stream representations, Event-stream, representations
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Published on IEEE Transactions on Image Processing

点击查看摘要

Abstract:Event-stream representation is the first step for many computer vision tasks using event cameras. It converts the asynchronous event-streams into a formatted structure so that conventional machine learning models can be applied easily. However, most of the state-of-the-art event-stream representations are manually designed and the quality of these representations cannot be guaranteed due to the noisy nature of event-streams. In this paper, we introduce a data-driven approach aiming at enhancing the quality of event-stream representations. Our approach commences with the introduction of a new event-stream representation based on spatial-temporal statistics, denoted as EvRep. Subsequently, we theoretically derive the intrinsic relationship between asynchronous event-streams and synchronous video frames. Building upon this theoretical relationship, we train a representation generator, RepGen, in a self-supervised learning manner accepting EvRep as input. Finally, the event-streams are converted to high-quality representations, termed as EvRepSL, by going through the learned RepGen (without the need of fine-tuning or retraining). Our methodology is rigorously validated through extensive evaluations on a variety of mainstream event-based classification and optical flow datasets (captured with various types of event cameras). The experimental results highlight not only our approach’s superior performance over existing event-stream representations but also its versatility, being agnostic to different event cameras and tasks.
zh

[CV-115] Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling WACV

【速读】：该论文试图解决在保留零样本学习（zero-shot learning）能力的同时，将领域特定知识整合到视觉-语言模型（如CLIP模型）中的问题。解决方案的关键在于提出了一种名为Group-wise Prompt Ensemble (GPE) 的新型提示集合学习方法。该方法通过三个主要策略实现：1) 使用掩码注意力机制的提示分组，优化CLIP的适应性并保护其零样本能力；2) 引入辅助提示，无缝整合新领域知识而不破坏原始模型的表示；3) 采用集合学习策略，有效融合原始知识和新知识。这些策略共同提升了模型在数据分布变化下的适应性和鲁棒性，并在跨数据集转移评估中显著超越现有模型。

链接: https://arxiv.org/abs/2412.07077
作者: Donggeun Kim,Yujin Jo,Myungjoo Lee,Taesup Kim
关键词-EN: Contrastive Language-Image Pre-training, Language-Image Pre-training, Contrastive Language-Image, enabling robust zero-shot, revolutionized the field
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

点击查看摘要

Abstract:The advancement of vision-language models, particularly the Contrastive Language-Image Pre-training (CLIP) model, has revolutionized the field of machine learning by enabling robust zero-shot learning capabilities. These capabilities allow models to understand and respond to previously unseen data without task-specific training. However, adapting CLIP to integrate specialized knowledge from various domains while retaining its zero-shot capabilities remains a significant challenge. To address this, we introduce a novel prompt ensemble learning approach called Group-wise Prompt Ensemble (GPE). This method aims to enhance CLIP’s zero-shot capabilities by incorporating new domain knowledge while improving its adaptability and robustness against data distribution shifts. Our approach hinges on three main strategies: prompt grouping with masked attention to optimize CLIP’s adaptability while safeguarding its zero-shot capabilities; the incorporation of auxiliary prompts for the seamless integration of new domain insights without disrupting the original model’s representation; and an ensemble learning strategy that effectively merges original and new knowledge. Through rigorous experimentation, including more challenging cross-dataset transfer evaluations, our GPE method redefines the benchmarks for the adaptability and efficiency of vision-language models, surpassing existing models across various scenarios.
zh

[CV-116] Stable Mean Teacher for Semi-supervised Video Action Detection AAAI

【速读】：该论文旨在解决视频动作检测中的半监督学习问题，特别是由于标签数据有限导致的不可靠预测和时间一致性问题。解决方案的关键在于提出了Stable Mean Teacher框架，该框架通过改进和保持时间一致性的伪标签来提升模型性能。具体来说，论文引入了**Error Recovery (EoR)模块，该模块从有标签样本的学生模型错误中学习，并将这些知识传递给教师模型，以改进无标签样本的伪标签。此外，为了解决现有时空损失函数忽视时间一致性的问题，论文提出了Difference of Pixels (DoP)**约束，专注于时间一致性，从而实现连贯的时间检测。实验结果表明，该方法在多个基准数据集上显著优于监督学习基线，并展示了其在大规模数据集和其他视频任务中的泛化能力。

链接: https://arxiv.org/abs/2412.07072
作者: Akash Kumar,Sirshapan Mitra,Yogesh Singh Rawat
关键词-EN: video action detection, video action, focus on semi-supervised, semi-supervised learning, action detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI Conference on Artificial Intelligence, Main Technical Track (AAAI), 2025, Code: this https URL

点击查看摘要

Abstract:In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatiotemporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel Error Recovery (EoR) module, which learns from students’ mistakes on labeled samples and transfers this knowledge to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatiotemporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To address this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, leading to coherent temporal detections. We evaluate our approach on four different spatiotemporal detection benchmarks: UCF101-24, JHMDB21, AVA, and YouTube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21, respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and YouTube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain. Code and models are publicly available.
zh

[CV-117] Static Key Attention in Vision

【速读】：该论文试图解决的问题是探索在视觉Transformer中使用静态键（static key）替代动态参数化的键对模型性能的影响。解决方案的关键在于发现静态键注意力机制能够在标准自注意力机制的基础上匹配甚至超越其性能，并且通过将静态键注意力模块集成到Metaformer骨干网络中，能够更好地平衡深度卷积和自注意力的优势，从而在分层混合架构中作为更有效的中间阶段。实验结果表明，静态键机制在多个视觉任务中表现出色，暗示了在某些情况下，可以将注意力机制中的两步动态参数化简化为单步而不影响性能。

链接: https://arxiv.org/abs/2412.07049
作者: Zizhao Hu,Xiaolin Zhou,Mohammad Rostami
关键词-EN: dynamically parameterized multi-head, vision transformers, parameterized multi-head self-attention, static key attention, multi-head self-attention mechanism
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The success of vision transformers is widely attributed to the expressive power of their dynamically parameterized multi-head self-attention mechanism. We examine the impact of substituting the dynamic parameterized key with a static key within the standard attention mechanism in Vision Transformers. Our findings reveal that static key attention mechanisms can match or even exceed the performance of standard self-attention. Integrating static key attention modules into a Metaformer backbone, we find that it serves as a better intermediate stage in hierarchical hybrid architectures, balancing the strengths of depth-wise convolution and self-attention. Experiments on several vision tasks underscore the effectiveness of the static key mechanism, indicating that the typical two-step dynamic parameterization in attention can be streamlined to a single step without impacting performance under certain circumstances.
zh

[CV-118] Dense Cross-Connected Ensemble Convolutional Neural Networks for Enhanced Model Robustness

【速读】：该论文试图解决卷积神经网络在图像识别任务中对输入变化和对抗攻击的鲁棒性问题。解决方案的关键在于提出了一种新的架构——密集交叉连接集成卷积神经网络 (Dense Cross-Connected Ensemble Convolutional Neural Network, DCC-ECNN)。该架构结合了DenseNet的密集连接原则和集成学习策略，通过在不同DenseNet路径之间引入中间交叉连接，促进了广泛的特征共享与整合。这种设计不仅利用了DenseNet的高效参数使用和深度优势，还通过集成学习的鲁棒性增强了特征表示的丰富性和抗干扰能力。

链接: https://arxiv.org/abs/2412.07022
作者: Longwei Wang,Xueqian Li,Zheng Zhang
关键词-EN: convolutional neural networks, adversarial attacks remains, Ensemble Convolutional Neural, image recognition tasks, convolutional neural
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure

点击查看摘要

Abstract:The resilience of convolutional neural networks against input variations and adversarial attacks remains a significant challenge in image recognition tasks. Motivated by the need for more robust and reliable image recognition systems, we propose the Dense Cross-Connected Ensemble Convolutional Neural Network (DCC-ECNN). This novel architecture integrates the dense connectivity principle of DenseNet with the ensemble learning strategy, incorporating intermediate cross-connections between different DenseNet paths to facilitate extensive feature sharing and integration. The DCC-ECNN architecture leverages DenseNet’s efficient parameter usage and depth while benefiting from the robustness of ensemble learning, ensuring a richer and more resilient feature representation.
zh

[CV-119] ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

【速读】：该论文试图解决多模态应用中指令数据生成的挑战，特别是在生成复杂图像查询的指令数据时，现有方法依赖于大型语言模型（LLMs）或多模态语言模型（MLMs），这些方法存在幻觉、许可问题以及难以扩展和解释的问题。论文提出的解决方案关键在于采用场景图（scene graphs）作为图像的符号表示，并结合人类编写的程序，系统化地合成以视觉为中心的指令数据。这种方法确保了数据生成过程的可解释性和可控性，同时实现了高效扩展和事实准确性。通过构建24个单图像和14个多图像指令生成器以及场景图生成管道，论文开发了一个可扩展且成本效益高的系统ProVision，能够生成多样化的问答对，涉及对象、属性、关系、深度等，应用于多个数据集并显著提升了多模态语言模型的性能。

链接: https://arxiv.org/abs/2412.07012
作者: Jieyu Zhang,Le Xue,Linxin Song,Jun Wang,Weikai Huang,Manli Shu,An Yan,Zixian Ma,Juan Carlos Niebles,silvio savarese,Caiming Xiong,Zeyuan Chen,Ranjay Krishna,Ran Xu
关键词-EN: multimodal language models, complex image-based queries, understanding complex image-based, language models capable, training multimodal language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: code: this https URL

点击查看摘要

Abstract:With the rise of multimodal applications, instruction data has become critical for training multimodal language models capable of understanding complex image-based queries. Existing practices rely on powerful but costly large language models (LLMs) or multimodal language models (MLMs) to produce instruction data. These are often prone to hallucinations, licensing issues and the generation process is often hard to scale and interpret. In this work, we present a programmatic approach that employs scene graphs as symbolic representations of images and human-written programs to systematically synthesize vision-centric instruction data. Our approach ensures the interpretability and controllability of the data generation process and scales efficiently while maintaining factual accuracy. By implementing a suite of 24 single-image, 14 multi-image instruction generators, and a scene graph generation pipeline, we build a scalable, cost-effective system: ProVision which produces diverse question-answer pairs concerning objects, attributes, relations, depth, etc., for any given image. Applied to Visual Genome and DataComp datasets, we generate over 10 million instruction data points, ProVision-10M, and leverage them in both pretraining and instruction tuning stages of MLMs. When adopted in the instruction tuning stage, our single-image instruction data yields up to a 7% improvement on the 2D split and 8% on the 3D split of CVBench, along with a 3% increase in performance on QBench2, RealWorldQA, and MMMU. Our multi-image instruction data leads to an 8% improvement on Mantis-Eval. Incorporation of our data in both pre-training and fine-tuning stages of xGen-MM-4B leads to an averaged improvement of 1.6% across 11 benchmarks.
zh

[CV-120] LUIEO: A Lightweight Model for Integrating Underwater Image Enhancement and Object Detection

【速读】：该论文试图解决水下光学图像因模糊、低对比度和色彩失真等因素导致的物体检测任务精度受限的问题。由于缺乏成对的水下/清洁图像，现有方法通常采用先增强后检测的策略，导致两个学习任务之间缺乏特征交流。论文提出的解决方案关键在于引入多任务学习方法，同时进行图像增强和检测精度提升，并通过物理模块将水下图像分解为清洁图像、背景光和传输图像，利用物理模型进行自监督学习，从而实现任务间的动态信息调整与共享，确保增强任务不受检测任务干扰。

链接: https://arxiv.org/abs/2412.07009
作者: Bin Li,Li Li,Zhenwei Zhang,Yuping Duan
关键词-EN: optical images inevitably, images inevitably suffer, underwater images, images, Underwater optical images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater optical images inevitably suffer from various degradation factors such as blurring, low contrast, and color distortion, which hinder the accuracy of object detection tasks. Due to the lack of paired underwater/clean images, most research methods adopt a strategy of first enhancing and then detecting, resulting in a lack of feature communication between the two learning tasks. On the other hand, due to the contradiction between the diverse degradation factors of underwater images and the limited number of samples, existing underwater enhancement methods are difficult to effectively enhance degraded images of unknown water bodies, thereby limiting the improvement of object detection accuracy. Therefore, most underwater target detection results are still displayed on degraded images, making it difficult to visually judge the correctness of the detection results. To address the above issues, this paper proposes a multi-task learning method that simultaneously enhances underwater images and improves detection accuracy. Compared with single-task learning, the integrated model allows for the dynamic adjustment of information communication and sharing between different tasks. Due to the fact that real underwater images can only provide annotated object labels, this paper introduces physical constraints to ensure that object detection tasks do not interfere with image enhancement tasks. Therefore, this article introduces a physical module to decompose underwater images into clean images, background light, and transmission images and uses a physical model to calculate underwater images for self-supervision. Numerical experiments demonstrate that the proposed model achieves satisfactory results in visual performance, object detection accuracy, and detection efficiency compared to state-of-the-art comparative methods.
zh

[CV-121] Diffusing Differentiable Representations NEURIPS2024

【速读】：该论文试图解决使用预训练扩散模型进行可微分表示（diffreps）采样的问题，特别是如何在不依赖额外训练的情况下提高采样质量和多样性。解决方案的关键在于提出了一种“pull back”机制，通过将逆时间过程的动力学从图像空间映射到diffrep参数空间，并根据这一映射过程更新参数，从而实现对采样过程的精细控制。此外，论文还识别并解决了由diffrep引入的隐式约束，显著提升了生成对象的一致性和细节表现。这种方法不仅提高了图像、全景图和3D NeRFs的生成质量，还扩展了扩散模型在不同问题上的应用范围。

链接: https://arxiv.org/abs/2412.06981
作者: Yash Savani,Marc Finzi,J. Zico Kolter
关键词-EN: sampling differentiable representations, differentiable representations, pretrained diffusion models, method achieves sampling, sampling differentiable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published at NeurIPS 2024

点击查看摘要

Abstract:We introduce a novel, training-free method for sampling differentiable representations (diffreps) using pretrained diffusion models. Rather than merely mode-seeking, our method achieves sampling by “pulling back” the dynamics of the reverse-time process–from the image space to the diffrep parameter space–and updating the parameters according to this pulled-back process. We identify an implicit constraint on the samples induced by the diffrep and demonstrate that addressing this constraint significantly improves the consistency and detail of the generated objects. Our method yields diffreps with substantially improved quality and diversity for images, panoramas, and 3D NeRFs compared to existing techniques. Our approach is a general-purpose method for sampling diffreps, expanding the scope of problems that diffusion models can tackle.
zh

[CV-122] Edge-SD-SR: Low Latency and Parameter Efficient On-device Super-Resolution with Stable Diffusion via Bidirectional Conditioning

【速读】：该论文试图解决在计算资源受限的设备（如手机）上部署大规模扩散模型进行图像超分辨率（Super Resolution, SR）时，模型尺寸过大和高延迟的问题。解决方案的关键在于提出了Edge-SD-SR，这是一个参数高效且低延迟的扩散模型，包含约169M参数，复杂度仅为~142 GFLOPs。为在低计算预算下保持高视觉质量，论文引入了多种训练策略：(i) 双向条件机制（bidirectional conditioning），针对SR任务定制扩散模型；(ii) 联合训练UNet和编码器，同时解耦高分辨率（HR）和低分辨率（LR）图像的编码，并使用专用调度器；(iii) 使用UNet的输出微调解码器，使其直接适应推理时的潜在特征。这些策略使得Edge-SD-SR能够在设备上高效运行，并在主流SR基准测试中达到或超越现有最先进方法的性能。

链接: https://arxiv.org/abs/2412.06978
作者: Mehdi Noroozi,Isma Hadji,Victor Escorcia,Anestis Zaganidis,Brais Martinez,Georgios Tzimiropoulos
关键词-EN: Stable Diffusion-based Super, Diffusion-based Super Resolution, Stable Diffusion-based, Diffusion-based Super, immense progress recently
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There has been immense progress recently in the visual quality of Stable Diffusion-based Super Resolution (SD-SR). However, deploying large diffusion models on computationally restricted devices such as mobile phones remains impractical due to the large model size and high latency. This is compounded for SR as it often operates at high res (e.g. 4Kx3K). In this work, we introduce Edge-SD-SR, the first parameter efficient and low latency diffusion model for image super-resolution. Edge-SD-SR consists of ~169M parameters, including UNet, encoder and decoder, and has a complexity of only ~142 GFLOPs. To maintain a high visual quality on such low compute budget, we introduce a number of training strategies: (i) A novel conditioning mechanism on the low resolution input, coined bidirectional conditioning, which tailors the SD model for the SR task. (ii) Joint training of the UNet and encoder, while decoupling the encodings of the HR and LR images and using a dedicated schedule. (iii) Finetuning the decoder using the UNet’s output to directly tailor the decoder to the latents obtained at inference time. Edge-SD-SR runs efficiently on device, e.g. it can upscale a 128x128 patch to 512x512 in 38 msec while running on a Samsung S24 DSP, and of a 512x512 to 2048x2048 (requiring 25 model evaluations) in just ~1.1 sec. Furthermore, we show that Edge-SD-SR matches or even outperforms state-of-the-art SR approaches on the most established SR benchmarks.
zh

[CV-123] MV-DUSt3R: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

【速读】：该论文试图解决多视角场景重建中由于逐对处理视角导致的误差累积和计算效率低下的问题。解决方案的关键在于提出了快速单阶段前馈网络MV-DUSt3R，其核心是多视角解码器块（multi-view decoder blocks），能够在任意数量的视角之间交换信息，同时考虑一个参考视角。为了增强对参考视角选择的鲁棒性，进一步提出了MV-DUSt3R+，通过交叉参考视角块（cross-reference-view blocks）融合不同参考视角选择的信息。此外，通过添加并联合训练高斯喷射头（Gaussian splatting heads），实现了新视角合成的功能。实验结果表明，该方法在多视角立体重建、多视角姿态估计和新视角合成方面显著优于现有技术。

链接: https://arxiv.org/abs/2412.06974
作者: Zhenggang Tang,Yuchen Fan,Dilin Wang,Hongyu Xu,Rakesh Ranjan,Alexander Schwing,Zhicheng Yan
关键词-EN: longer require camera, require camera calibration, Recent sparse multi-view, Recent sparse, scene reconstruction advances
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent sparse multi-view scene reconstruction advances like DUSt3R and MASt3R no longer require camera calibration and camera pose estimation. However, they only process a pair of views at a time to infer pixel-aligned pointmaps. When dealing with more than two views, a combinatorial number of error prone pairwise reconstructions are usually followed by an expensive global optimization, which often fails to rectify the pairwise reconstruction errors. To handle more views, reduce errors, and improve inference time, we propose the fast single-stage feed-forward network MV-DUSt3R. At its core are multi-view decoder blocks which exchange information across any number of views while considering one reference view. To make our method robust to reference view selection, we further propose MV-DUSt3R+, which employs cross-reference-view blocks to fuse information across different reference view choices. To further enable novel view synthesis, we extend both by adding and jointly training Gaussian splatting heads. Experiments on multi-view stereo reconstruction, multi-view pose estimation, and novel view synthesis confirm that our methods improve significantly upon prior art. Code will be released.
zh

[CV-124] SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception

【速读】：该论文试图解决全景360度感知中的图像畸变问题，传统方法依赖于等距柱状投影（equirectangular projection），虽然便于2D操作层处理，但引入了图像畸变。其他方法尝试通过保持球面表示来消除畸变，但依赖复杂的卷积核，效果不佳。论文提出的解决方案关键在于引入基于Transformer的架构，结合创新的“球面局部自注意力机制”（Spherical Local Self-Attention）和其他球面导向模块，成功在球面域内进行操作，并在深度估计和语义分割的全景感知基准测试中超越了现有最先进的方法。

链接: https://arxiv.org/abs/2412.06968
作者: Yaniv Benny,Lior Wolf
关键词-EN: paper proposes, Abstract, degree, omnidirectional, method for omnidirectional
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper proposes a novel method for omnidirectional 360 \degree perception. Most common previous methods relied on equirectangular projection. This representation is easily applicable to 2D operation layers but introduces distortions into the image. Other methods attempted to remove the distortions by maintaining a sphere representation but relied on complicated convolution kernels that failed to show competitive results. In this work, we introduce a transformer-based architecture that, by incorporating a novel ``Spherical Local Self-Attention’’ and other spherically-oriented modules, successfully operates in the spherical domain and outperforms the state-of-the-art in 360 \degree perception benchmarks for depth estimation and semantic segmentation.
zh

[CV-125] Gradient-based facial encoding for key generation to encrypt and decrypt multimedia data

【速读】：该论文试图解决传统基于密码的安全系统易被遗忘、猜测和破解，以及单独使用生物识别技术易受模板伪造和重放攻击的问题。解决方案的关键在于提出了一种基于人脸识别技术的生物加密系统 (biocryptosystem)，该系统结合了生物识别和加密技术，使用高级加密标准 (AES) 对各种类型的文件进行加密和解密。系统通过捕获用户面部图像并提取面部特征，利用方向梯度直方图 (HOG) 算法有效检测边缘特征，确保在不同光照条件下识别的准确性。该系统通过将生物特征作为双因素认证的一部分，提供了高安全性和不可伪造性，实验证明其在精度、效率和安全性方面表现优异，适用于安全文件共享、在线交易和数据存档等场景。

链接: https://arxiv.org/abs/2412.06927
作者: Ankit Kumar Patel,Dewanshi Paul,Sneha Chaudhary,Sarthak Giri
关键词-EN: Advanced Encryption Standard, Password-based security, prone to forgetting, Password-based, Encryption Standard
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures, submitted to “Journal of Cryptology”

点击查看摘要

Abstract:Password-based security is prone to forgetting, guessing, and hacking. Similarly, standalone biometric-based security is susceptible to template spoofing and replay attacks. This paper proposes a biocryptosystem based on face recognition technique to bridge this gap such that it can encrypt and decrypt any kind of file using the Advanced Encryption Standard (AES). The biocryptosystem uses a combination of biometric identification and cryptographic methods to protect sensitive information in a secure and effective manner. To verify a user’s identity, our proposed system first captures an image of their face and extracts facial traits. The Histogram of Oriented Gradients (HOG) detects all the unique facial traits because HOG effectively captures edge-based features even in dim lighting. Every data type, including text, audio, and video files, can be encrypted and decrypted using this system. Biometric evidence is inherently tied to an individual, so it is almost impossible for attackers to access the user’s data. This method also offers a high level of security by employing biometric data as an element in the 2-factor authentication process. The precision, efficiency, and security of this biocryptosystem are experimentally proven by different metrics like entropy and avalanche effect. Applications for the proposed system include safe file sharing, online transactions, and data archiving. Hence, it offers a strong and dependable option for safeguarding sensitive data.
zh

[CV-126] SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations

【速读】：该论文试图解决当前视频防护机制（video guardrails）在安全性和实用性方面的不足，特别是现有方法要么过于简单，依赖于基于有限不安全类别的分类模型，缺乏详细解释；要么效率低下，通过多模态大语言模型（MLLMs）处理冗长的安全指南，不适用于实际内容防护。解决方案的关键在于提出了一种名为SafeWatch的高效MLLM视频防护模型，该模型能够遵循定制的安全策略，并以零样本方式提供多标签视频防护输出和内容特定的解释。SafeWatch通过并行编码每个策略块并消除位置偏差，确保所有策略被同时关注且具有同等重要性，同时采用策略感知的视觉令牌修剪算法，自适应选择与每个策略最相关的视频令牌，从而减少噪声和无关信息，提高效率和准确性。此外，论文还提出了SafeWatch-Bench，一个包含超过200万视频的大规模视频防护基准，涵盖六个安全类别和30个任务，以确保全面覆盖所有潜在的安全场景。

链接: https://arxiv.org/abs/2412.06878
作者: Zhaorun Chen,Francesco Pinto,Minzhou Pan,Bo Li
关键词-EN: high-quality video generation, security across platforms, video guardrail, rise of generative, rapid growth
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 43 pages, 20 figures

点击查看摘要

Abstract:With the rise of generative AI and rapid growth of high-quality video generation, video guardrails have become more crucial than ever to ensure safety and security across platforms. Current video guardrails, however, are either overly simplistic, relying on pure classification models trained on simple policies with limited unsafe categories, which lack detailed explanations, or prompting multimodal large language models (MLLMs) with long safety guidelines, which are inefficient and impractical for guardrailing real-world content. To bridge this gap, we propose SafeWatch, an efficient MLLM-based video guardrail model designed to follow customized safety policies and provide multi-label video guardrail outputs with content-specific explanations in a zero-shot manner. In particular, unlike traditional MLLM-based guardrails that encode all safety policies autoregressively, causing inefficiency and bias, SafeWatch uniquely encodes each policy chunk in parallel and eliminates their position bias such that all policies are attended simultaneously with equal importance. In addition, to improve efficiency and accuracy, SafeWatch incorporates a policy-aware visual token pruning algorithm that adaptively selects the most relevant video tokens for each policy, discarding noisy or irrelevant information. This allows for more focused, policy-compliant guardrail with significantly reduced computational overhead. Considering the limitations of existing video guardrail benchmarks, we propose SafeWatch-Bench, a large-scale video guardrail benchmark comprising over 2M videos spanning six safety categories which covers over 30 tasks to ensure a comprehensive coverage of all potential safety scenarios. SafeWatch outperforms SOTA by 28.2% on SafeWatch-Bench, 13.6% on benchmarks, cuts costs by 10%, and delivers top-tier explanations validated by LLM and human reviews.
zh

[CV-127] Safety Monitoring of Machine Learning Perception Functions: a Survey

【速读】：该论文试图解决在安全关键应用中，如自动驾驶汽车和手术机器人，使用机器学习 (Machine Learning, ML) 模型进行感知任务时所面临的可靠性挑战。解决方案的关键在于设计和实施故障容错机制，特别是安全监控器 (safety monitors)，以确保系统在发生故障时仍能保持安全行为。论文通过对现有文献的广泛回顾，强调了设计此类监控器时需要考虑的关键因素，包括威胁识别、需求获取、故障检测、反应策略和评估方法。此外，论文还指出了当前安全监控领域存在的挑战，并提出了未来研究的方向。

链接: https://arxiv.org/abs/2412.06869
作者: Raul Sena Ferreira,Joris Guérin,Kevin Delmas,Jérémie Guiochet,Hélène Waeselynck
关键词-EN: Machine Learning, deep neural networks, complex perception tasks, perform complex perception, neural networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: 25 pages, 2 figures

点击查看摘要

Abstract:Machine Learning (ML) models, such as deep neural networks, are widely applied in autonomous systems to perform complex perception tasks. New dependability challenges arise when ML predictions are used in safety-critical applications, like autonomous cars and surgical robots. Thus, the use of fault tolerance mechanisms, such as safety monitors, is essential to ensure the safe behavior of the system despite the occurrence of faults. This paper presents an extensive literature review on safety monitoring of perception functions using ML in a safety-critical context. In this review, we structure the existing literature to highlight key factors to consider when designing such monitors: threat identification, requirements elicitation, detection of failure, reaction, and evaluation. We also highlight the ongoing challenges associated with safety monitoring and suggest directions for future research.
zh

[CV-128] Compression for Better: A General and Stable Lossless Compression Framework

【速读】：该论文试图解决如何在不影响模型性能的前提下实现无损模型压缩的问题，关键在于提出了一种名为**LossLess Compression (LLC)**的理论框架。LLC通过总微分方法划定了压缩邻域和更高阶分析边界，从而明确了模型在无损压缩下的误差范围。具体解决方案包括将经典量化搜索问题重新表述为无损邻域内的分组背包问题，以及在低秩约束下自动确定每层秩的分解方法，从而实现无损量化和低秩模型生成。实验结果表明，LLC能够在不使用复杂技巧的情况下有效实现无损模型压缩。

链接: https://arxiv.org/abs/2412.06868
作者: Boyang Zhang,Daning Cheng,Yunquan Zhang,Fangmin Liu,Wenguang Chen
关键词-EN: sacrificing performance due, reduce model complexity, aiming to reduce, work focus, complexity and enhance
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:This work focus on how to stabilize and lossless model compression, aiming to reduce model complexity and enhance efficiency without sacrificing performance due to compression errors. A key challenge is effectively leveraging compression errors and defining the boundaries for lossless compression to minimize model loss. i.e., compression for better. Currently, there is no systematic approach to determining this error boundary or understanding its specific impact on model performance. We propose a general \textbfLoss\textbfLess \textbfCompression theoretical framework (\textbfLLC), which further delineates the compression neighborhood and higher-order analysis boundaries through the total differential, thereby specifying the error range within which a model can be compressed without loss. To verify the effectiveness of LLC, we apply various compression techniques, including quantization and decomposition. Specifically, for quantization, we reformulate the classic quantization search problem as a grouped knapsack problem within the lossless neighborhood, achieving lossless quantization while improving computational efficiency. For decomposition, LLC addresses the approximation problem under low-rank constraints, automatically determining the rank for each layer and producing lossless low-rank models. We conduct extensive experiments on multiple neural network architectures on different datasets. The results show that without fancy tricks, LLC can effectively achieve lossless model compression. Our code will be made publicly.
zh

[CV-129] Generating floorplans for various building functionalities via latent diffusion model

【速读】：该论文试图解决传统建筑设计中生成平面图（floorplans）过程复杂、耗时且依赖设计师经验的问题。解决方案的关键在于引入生成式潜在扩散模型（generative latent diffusion model），该模型能够基于建筑轮廓和设计简报生成多种建筑类型的平面图。通过学习不同建筑类型之间的复杂连接和设计变异，该模型不仅能够复制现有设计，还能生成融合多种设计元素的新颖配置，从而在速度和成本效益上为建筑设计带来新的创造维度。

链接: https://arxiv.org/abs/2412.06859
作者: Mohamed R. Ibrahim,Josef Musil,Irene Gallou
关键词-EN: human intelligence lies, skill demanding distinctive, demanding distinctive expertise, years of experience, foundational essence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17

点击查看摘要

Abstract:In the domain of architectural design, the foundational essence of creativity and human intelligence lies in the mastery of solving floorplans, a skill demanding distinctive expertise and years of experience. Traditionally, the architectural design process of creating floorplans often requires substantial manual labour and architectural expertise. Even when relying on parametric design approaches, the process is limited based on the designer’s ability to build a complex set of parameters to iteratively explore design alternatives. As a result, these approaches hinder creativity and limit discovery of an optimal solution. Here, we present a generative latent diffusion model that learns to generate floorplans for various building types based on building footprints and design briefs. The introduced model learns from the complexity of the inter-connections between diverse building types and the mutations of architectural designs. By harnessing the power of latent diffusion models, this research surpasses conventional limitations in the design process. The model’s ability to learn from diverse building types means that it cannot only replicate existing designs but also produce entirely new configurations that fuse design elements in unexpected ways. This innovation introduces a new dimension of creativity into architectural design, allowing architects, urban planners and even individuals without specialised expertise to explore uncharted territories of form and function with speed and cost-effectiveness.
zh

[CV-130] MDiFF: Exploiting Multimodal Score-based Diffusion Models for New Fashion Product Performance Forecasting ECCV

【速读】：该论文试图解决快时尚行业中由于过度生产和未售库存导致的环境影响问题，特别是通过准确预测未发布产品的销售量来提高效率和资源利用率。解决方案的关键在于提出了一种名为MDiFF的新型两步多模态扩散模型（diffusion models）管道，用于新时尚产品性能预测（NFPPF）。首先，使用基于分数的扩散模型预测不同服装在未来多个时间点的销售量；然后，通过轻量级的多层感知器（MLP）对这些多重预测进行细化，以获得最终的预测结果。MDiFF结合了扩散模型和MLP的优势，解决了传统确定性模型在处理训练数据分布之外的商品时遇到的领域偏移问题，从而实现了最先进的准确和高效的预测系统。

链接: https://arxiv.org/abs/2412.06840
作者: Andrea Avogaro,Luigi Capogrosso,Franco Fummi,Marco Cristani
关键词-EN: significant environmental impacts, environmental impacts due, unsold inventory, suffers from significant, significant environmental
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the FashionAI workshop @ the European Conference on Computer Vision (ECCV) 2024. arXiv admin note: substantial text overlap with arXiv:2412.05566

点击查看摘要

Abstract:The fast fashion industry suffers from significant environmental impacts due to overproduction and unsold inventory. Accurately predicting sales volumes for unreleased products could significantly improve efficiency and resource utilization. However, predicting performance for entirely new items is challenging due to the lack of historical data and rapidly changing trends, and existing deterministic models often struggle with domain shifts when encountering items outside the training data distribution. The recently proposed diffusion models address this issue using a continuous-time diffusion process. This allows us to simulate how new items are adopted, reducing the impact of domain shift challenges faced by deterministic models. As a result, in this paper, we propose MDiFF: a novel two-step multimodal diffusion models-based pipeline for New Fashion Product Performance Forecasting (NFPPF). First, we use a score-based diffusion model to predict multiple future sales for different clothes over time. Then, we refine these multiple predictions with a lightweight Multi-layer Perceptron (MLP) to get the final forecast. MDiFF leverages the strengths of both architectures, resulting in the most accurate and efficient forecasting system for the fast-fashion industry at the state-of-the-art. The code can be found at this https URL.
zh

[CV-131] Inverting Visual Representations with Detection Transformers

【速读】：该论文试图解决在基于Transformer的视觉模型中理解深度神经网络机制的问题。解决方案的关键在于应用训练逆向模型（inverse models）的方法，从Detection Transformer的中间层重建输入图像。通过定性和定量评估重建图像，研究展示了Detection Transformer的关键特性，如上下文形状保持、层间相关性和对颜色扰动的鲁棒性，从而深入理解了Transformer架构中的这些特性是如何产生的。

链接: https://arxiv.org/abs/2412.06534
作者: Jan Rathjens,Shirin Reyhanian,David Kappel,Laurenz Wiskott
关键词-EN: deep neural networks, mechanisms underlying deep, computer vision remains, underlying deep neural, neural networks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Understanding the mechanisms underlying deep neural networks in computer vision remains a fundamental challenge. While many prior approaches have focused on visualizing intermediate representations within deep neural networks, particularly convolutional neural networks, these techniques have yet to be thoroughly explored in transformer-based vision models. In this study, we apply the approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer, showing that this approach is efficient and feasible for transformer-based vision models. Through qualitative and quantitative evaluations of reconstructed images across model stages, we demonstrate critical properties of Detection Transformers, including contextual shape preservation, inter-layer correlation, and robustness to color perturbations, illustrating how these characteristics emerge within the model’s architecture. Our findings contribute to a deeper understanding of transformer-based vision models. The code for reproducing our experiments will be made available at this http URL.
zh

[CV-132] SKIPNet: Spatial Attention Skip Connections for Enhanced Brain Tumor Classification

【速读】：该论文试图解决脑肿瘤早期检测中的两个关键问题：一是远程地区诊断设施的不足，二是传统手动分割脑部MRI扫描的劳动强度。解决方案的关键在于利用深度学习（Deep Learning）技术，提出了一种自动化模型，通过结合空间注意力机制（spatial attention）来提高脑肿瘤检测和分类的准确性。该模型在MRI数据上实现了96.90%的准确率，显著提升了上下文信息的聚合和模式识别能力，从而在自动化MRI脑肿瘤分析中展现出优于基线模型的性能。

链接: https://arxiv.org/abs/2412.07736
作者: Khush Mendiratta(1),Shweta Singh(2),Pratik Chattopadhyay(2) ((1) Indian Institute of Technology Roorkee, (2) Indian Institute of Technology BHU)
关键词-EN: magnetic resonance imaging, diagnostic facilities remains, facilities remains limited, Early detection, resonance imaging
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early detection of brain tumors through magnetic resonance imaging (MRI) is essential for timely treatment, yet access to diagnostic facilities remains limited in remote areas. Gliomas, the most common primary brain tumors, arise from the carcinogenesis of glial cells in the brain and spinal cord, with glioblastoma patients having a median survival time of less than 14 months. MRI serves as a non-invasive and effective method for tumor detection, but manual segmentation of brain MRI scans has traditionally been a labor-intensive task for neuroradiologists. Recent advancements in computer-aided design (CAD), machine learning (ML), and deep learning (DL) offer promising solutions for automating this process. This study proposes an automated deep learning model for brain tumor detection and classification using MRI data. The model, incorporating spatial attention, achieved 96.90% accuracy, enhancing the aggregation of contextual information for better pattern recognition. Experimental results demonstrate that the proposed approach outperforms baseline models, highlighting its robustness and potential for advancing automated MRI-based brain tumor analysis.
zh

[CV-133] BATIS: Bootstrapping Autonomous Testing and Initialization System for Quantum Dot Devices

【速读】：该论文试图解决半导体量子点 (QD) 设备在基于自旋的量子计算中日益复杂的调谐问题，尤其是随着设备复杂性增加，手动调谐变得不可行的问题。解决方案的关键在于引入了一个名为BATIS的自动化框架，该框架通过引导、自主测试和初始化系统来简化QD设备的测试和初始化过程。BATIS能够高效地导航高维门电压空间，自动化关键步骤如漏电测试和门特性分析，并采用一种新颖且可扩展的通道形成协议，仅需一次测量即可处理任意数量的通道。该系统在1.3 K下对四量子点Si/Si_xGe_1-x设备进行了验证，显著提高了可扩展性并减少了初始设备诊断的设置时间，同时具备平台无关性，适用于各种QD系统。

链接: https://arxiv.org/abs/2412.07676
作者: Tyler J. Kovach,Daniel Schug,M. A. Wolfe,E. R. MacQuarrie,Patrick J. Walsh,Jared Benson,Mark Friesen,M. A. Eriksson,Justyna P. Zwolak
关键词-EN: Semiconductor quantum dot, spin-based quantum computing, Semiconductor quantum, quantum dot, quantum computing
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Semiconductor quantum dot (QD) devices have become central to advancements in spin-based quantum computing. As the complexity of QD devices grows, manual tuning becomes increasingly infeasible, necessitating robust and scalable autotuning solutions. Tuning large arrays of QD qubits depends on efficient choices of automated protocols. Here, we introduce a bootstrapping, autonomous testing, and initialization system (BATIS), an automated framework designed to streamline QD device testing and initialization. BATIS navigates high-dimensional gate voltage spaces, automating essential steps such as leakage testing and gate characterization. The current channel formation protocol follows a novel and scalable approach that requires a single measurement regardless of the number of channels. Demonstrated at 1.3 K on a quad-QD Si/Si _x Ge _1-x device, BATIS eliminates the need for deep cryogenic environments during initial device diagnostics, significantly enhancing scalability and reducing setup times. By requiring minimal prior knowledge of the device architecture, BATIS represents a platform-agnostic solution, adaptable to various QD systems, which bridges a critical gap in QD autotuning.
zh

[CV-134] Motion Artifact Removal in Pixel-Frequency Domain via Alternate Masks and Diffusion Model

【速读】：该论文试图解决磁共振成像 (MRI) 中的运动伪影问题，这些伪影严重干扰临床诊断。现有的解决方案大多依赖于配对数据，且未充分考虑k空间（frequency domain）中的扰动，限制了其在临床中的应用。论文提出的解决方案是一种无监督的净化方法，关键在于利用噪声MRI图像的像素-频率信息，引导预训练的扩散模型恢复干净的MRI图像。具体来说，该方法通过低频分量确保正确的组织纹理，同时利用高频和像素信息恢复形状和细节纹理，并通过交替互补掩码破坏伪影结构并利用有用信息。实验结果表明，该方法在多个指标上表现优异，并获得了放射科医生的积极反馈。

链接: https://arxiv.org/abs/2412.07590
作者: Jiahua Xu,Dawei Zhou,Lei Hu,Jianfeng Guo,Feng Yang,Zaiyi Liu,Nannan Wang,Xinbo Gao
关键词-EN: magnetic resonance imaging, Motion artifacts present, resonance imaging, Motion artifacts, present in magnetic
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Motion artifacts present in magnetic resonance imaging (MRI) can seriously interfere with clinical diagnosis. Removing motion artifacts is a straightforward solution and has been extensively studied. However, paired data are still heavily relied on in recent works and the perturbations in \textitk-space (frequency domain) are not well considered, which limits their applications in the clinical field. To address these issues, we propose a novel unsupervised purification method which leverages pixel-frequency information of noisy MRI images to guide a pre-trained diffusion model to recover clean MRI images. Specifically, considering that motion artifacts are mainly concentrated in high-frequency components in \textitk-space, we utilize the low-frequency components as the guide to ensure correct tissue textures. Additionally, given that high-frequency and pixel information are helpful for recovering shape and detail textures, we design alternate complementary masks to simultaneously destroy the artifact structure and exploit useful information. Quantitative experiments are performed on datasets from different tissues and show that our method achieves superior performance on several metrics. Qualitative evaluations with radiologists also show that our method provides better clinical feedback. Our code is available at this https URL.
zh

[CV-135] KneeXNeT: An Ensemble-Based Approach for Knee Radiographic Evaluation

【速读】：该论文试图解决膝关节骨关节炎 (knee osteoarthritis, OA) 严重程度的自动化诊断问题，传统上这一过程依赖于专家对X光图像的评估，且基于时间密集型的Kellgren-Lawrence分级系统。解决方案的关键在于开发了一种基于深度学习的自动化模型，通过评估多种先进的深度学习模型并采用加权采样处理类别不平衡问题，最终通过集成学习（ensemble learning）方法构建了KneeXNet模型，实现了0.72的最高准确率，显著提升了OA严重程度分类的自动化水平。

链接: https://arxiv.org/abs/2412.07526
作者: Nicharee Srikijkasemwat,Soumya Snigdha Kundu,Fuping Wu,Bartlomiej W. Papiez
关键词-EN: common joint disorder, common joint, joint disorder, Knee osteoarthritis, X-ray images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, accepted by MICAD 2024

点击查看摘要

Abstract:Knee osteoarthritis (OA) is the most common joint disorder and a leading cause of disability. Diagnosing OA severity typically requires expert assessment of X-ray images and is commonly based on the Kellgren-Lawrence grading system, a time-intensive process. This study aimed to develop an automated deep learning model to classify knee OA severity, reducing the need for expert evaluation. First, we evaluated ten state-of-the-art deep learning models, achieving a top accuracy of 0.69 with individual models. To address class imbalance, we employed weighted sampling, improving accuracy to 0.70. We further applied Smooth-GradCAM++ to visualize decision-influencing regions, enhancing the explainability of the best-performing model. Finally, we developed ensemble models using majority voting and a shallow neural network. Our ensemble model, KneeXNet, achieved the highest accuracy of 0.72, demonstrating its potential as an automated tool for knee OA assessment.
zh

[CV-136] Enhanced MRI Representation via Cross-series Masking

【速读】：该论文试图解决磁共振成像 (MRI) 中多序列图像整合分析的挑战，特别是由于不同序列的空间分辨率和对比度模式差异，以及临床实践中标注数据稀缺的问题。解决方案的关键在于提出了一种新的跨序列掩码策略 (Cross-Series Masking, CSM)，通过自监督学习方式有效学习 MRI 表示。CSM 通过随机采样和掩码部分区域和序列，利用未掩码数据重建掩码部分，从而整合不同序列的信息，并增强对序列内和序列间相关性和互补性的建模能力。这种方法不仅提升了下游任务（如分割和分类）的性能，还在脑组织分割、乳腺肿瘤良恶性分类和前列腺癌诊断等任务中达到了最先进的性能。

链接: https://arxiv.org/abs/2412.07387
作者: Churan Wang,Fei Gao,Lijun Yan,Siwen Wang,Yizhou Yu,Yizhou Wang
关键词-EN: Magnetic resonance imaging, produce multi-series images, medical conditions due, Magnetic resonance, resonance imaging
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Magnetic resonance imaging (MRI) is indispensable for diagnosing and planning treatment in various medical conditions due to its ability to produce multi-series images that reveal different tissue characteristics. However, integrating these diverse series to form a coherent analysis presents significant challenges, such as differing spatial resolutions and contrast patterns meanwhile requiring extensive annotated data, which is scarce in clinical practice. Due to these issues, we introduce a novel Cross-Series Masking (CSM) Strategy for effectively learning MRI representation in a self-supervised manner. Specifically, CSM commences by randomly sampling a subset of regions and series, which are then strategically masked. In the training process, the cross-series representation is learned by utilizing the unmasked data to reconstruct the masked portions. This process not only integrates information across different series but also facilitates the ability to model both intra-series and inter-series correlations and complementarities. With the learned representation, the downstream tasks like segmentation and classification are also enhanced. Taking brain tissue segmentation, breast tumor benign/malignant classification, and prostate cancer diagnosis as examples, our method achieves state-of-the-art performance on both public and in-house datasets.
zh

[CV-137] Label up: Learning Pulmonary Embolism Segmentation from Image Level Annotation through Model Explainability

【速读】：该论文试图解决肺栓塞（Pulmonary Embolisms, PE）诊断中由于标注数据稀缺，尤其是缺乏细粒度（pixel level）血栓负荷标注，导致AI模型性能受限的问题。解决方案的关键在于引入一种弱监督学习流程，利用模型可解释性从粗粒度（binary, image level）的PE标注生成细粒度的栓塞区域掩码（pixel level masks），并通过自动生成的像素级标注进行模型训练，从而实现良好的PE定位性能。该方法的有效性在多中心、大规模的RSPECT增强数据集上得到了验证。

链接: https://arxiv.org/abs/2412.07384
作者: Florin Condrea,Saikiran Rapaka,Marius Leordeanu
关键词-EN: cardiovascular death, Pulmonary Embolisms, diagnosing pulmonary embolisms, Computed tomographic pulmonary, tomographic pulmonary angiography
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pulmonary Embolisms (PE) are a leading cause of cardiovascular death. Computed tomographic pulmonary angiography (CTPA) stands as the gold standard for diagnosing pulmonary embolisms (PE) and there has been a lot of interest in developing AI-based models for assisting in PE diagnosis. Performance of these algorithms has been hindered by the scarcity of annotated data, especially those with fine-grained delineation of the thromboembolic burden. In this paper we attempt to address this issue by introducing a weakly supervised learning pipeline, that leverages model explainability to generate fine-grained (pixel level) masks for embolisms starting from more coarse-grained (binary, image level) PE annotations. Furthermore, we show that training models using the automatically generated pixel annotations yields good PE localization performance. We demonstrate the effectiveness of our pipeline on the large-scale, multi-center RSPECT augmented dataset for PE detection and localization.
zh

[CV-138] QuantFormer: Learning to Quantize for Neural Activity Forecasting in Mouse Visual Cortex

【速读】：该论文试图解决神经信号预测中的挑战，特别是由于神经信号的时空稀疏性和复杂依赖性导致的预测困难。解决方案的关键在于引入QuantFormer，这是一种基于transformer的模型，专门设计用于从双光子钙成像数据中预测神经活动。QuantFormer通过将预测任务重新定义为分类问题，利用动态信号量化（dynamic signal quantization）来更有效地学习稀疏的神经激活模式，从而克服了传统回归方法的局限性。此外，QuantFormer通过引入神经元特异性标记（neuron-specific tokens），能够处理来自任意数量神经元的多元信号，实现了对不同神经元群体的可扩展性。通过在Allen数据集上的无监督量化训练，QuantFormer在预测小鼠视觉皮层活动方面设定了新的基准，展示了其在不同刺激和个体间的鲁棒性能和泛化能力。

链接: https://arxiv.org/abs/2412.07264
作者: Salvatore Calcagno,Isaak Kavasidis,Simone Palazzo,Marco Brondi,Luca Sità,Giacomo Turri,Daniela Giordano,Vladimir R. Kostic,Tommaso Fellin,Massimiliano Pontil,Concetto Spampinato
关键词-EN: Understanding complex animal, animal behaviors hinges, complex animal behaviors, Understanding complex, brain circuits
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Understanding complex animal behaviors hinges on deciphering the neural activity patterns within brain circuits, making the ability to forecast neural activity crucial for developing predictive models of brain dynamics. This capability holds immense value for neuroscience, particularly in applications such as real-time optogenetic interventions. While traditional encoding and decoding methods have been used to map external variables to neural activity and vice versa, they focus on interpreting past data. In contrast, neural forecasting aims to predict future neural activity, presenting a unique and challenging task due to the spatiotemporal sparsity and complex dependencies of neural signals. Existing transformer-based forecasting methods, while effective in many domains, struggle to capture the distinctiveness of neural signals characterized by spatiotemporal sparsity and intricate dependencies. To address this challenge, we here introduce QuantFormer, a transformer-based model specifically designed for forecasting neural activity from two-photon calcium imaging data. Unlike conventional regression-based approaches, QuantFormerreframes the forecasting task as a classification problem via dynamic signal quantization, enabling more effective learning of sparse neural activation patterns. Additionally, QuantFormer tackles the challenge of analyzing multivariate signals from an arbitrary number of neurons by incorporating neuron-specific tokens, allowing scalability across diverse neuronal populations. Trained with unsupervised quantization on the Allen dataset, QuantFormer sets a new benchmark in forecasting mouse visual cortex activity. It demonstrates robust performance and generalization across various stimuli and individuals, paving the way for a foundational model in neural signal prediction.
zh

[CV-139] Modeling Dual-Exposure Quad-Bayer Patterns for Joint Denoising and Deblurring

【速读】：该论文试图解决图像系统中由于噪声和模糊导致的图像退化问题，这一问题源于硬件和方法学的局限性。单帧图像处理面临噪声抑制与运动模糊之间的权衡，而基于学习的单帧增强方法由于信息有限往往过度平滑。多帧解决方案通过连拍模式获取更多时空信息，但常因相机或场景运动导致对齐困难。论文提出的解决方案关键在于利用一种基于物理模型的图像恢复方法，采用新型双曝光Quad-Bayer模式传感器，通过同时捕捉短曝光和长曝光图像，整合互补的噪声-模糊信息。此外，论文还引入了Quad-Bayer合成方法（B2QB）以模拟传感器数据，并设计了分层卷积神经网络QRNet，通过输入增强块和多层次特征提取来提升恢复质量。实验结果表明，该方法在合成和真实数据集上均优于现有的去模糊和去噪方法。

链接: https://arxiv.org/abs/2412.07256
作者: Yuzhi Zhao,Lai-Man Po,Xin Ye,Yongzhe Xu,Qiong Yan
关键词-EN: Image degradation caused, imaging systems, hardware and methodology, degradation caused, remains a persistent
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IEEE Transactions on Image Processing (TIP)

点击查看摘要

Abstract:Image degradation caused by noise and blur remains a persistent challenge in imaging systems, stemming from limitations in both hardware and methodology. Single-image solutions face an inherent tradeoff between noise reduction and motion blur. While short exposures can capture clear motion, they suffer from noise amplification. Long exposures reduce noise but introduce blur. Learning-based single-image enhancers tend to be over-smooth due to the limited information. Multi-image solutions using burst mode avoid this tradeoff by capturing more spatial-temporal information but often struggle with misalignment from camera/scene motion. To address these limitations, we propose a physical-model-based image restoration approach leveraging a novel dual-exposure Quad-Bayer pattern sensor. By capturing pairs of short and long exposures at the same starting point but with varying durations, this method integrates complementary noise-blur information within a single image. We further introduce a Quad-Bayer synthesis method (B2QB) to simulate sensor data from Bayer patterns to facilitate training. Based on this dual-exposure sensor model, we design a hierarchical convolutional neural network called QRNet to recover high-quality RGB images. The network incorporates input enhancement blocks and multi-level feature extraction to improve restoration quality. Experiments demonstrate superior performance over state-of-the-art deblurring and denoising methods on both synthetic and real-world datasets. The code, model, and datasets are publicly available at this https URL.
zh

[CV-140] Robust Feature Engineering Techniques for Designing Efficient Motor Imagery-Based BCI-Systems

【速读】：该论文旨在解决脑机接口 (Brain-Computer Interface, BCI) 系统在处理运动障碍患者脑电图 (EEG) 数据时面临的复杂性和低效问题。解决方案的关键在于通过强大的特征工程 (signal processing) 方法提升机器学习 (Machine Learning, ML) 模型的性能。具体来说，研究采用了时间域、频率域和小波变换特征，并通过最大相关最小冗余 (Maximum Relevance Minimum Redundancy, MRMR) 方法筛选出最具代表性的四个特征。随后，使用支持向量机 (Support Vector Machine, SVM) 等分类模型进行评估，最终在运动活动和运动想象任务上分别达到了92.50%和95.48%的测试准确率，显著优于之前研究的74.36%。这一改进为设计简单、成本效益高且可靠的神经康复BCI系统提供了重要依据。

链接: https://arxiv.org/abs/2412.07175
作者: Syed Saim Gardezi,Soyiba Jawed,Mahnoor Khan,Muneeba Bukhari,Rizwan Ahmed Khan
关键词-EN: multitude of individuals, globe grapple, utilizing Brain-Computer Interface, BCI systems, motor disabilities
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 26 pages

点击查看摘要

Abstract:A multitude of individuals across the globe grapple with motor disabilities. Neural prosthetics utilizing Brain-Computer Interface (BCI) technology exhibit promise for improving motor rehabilitation outcomes. The intricate nature of EEG data poses a significant hurdle for current BCI systems. Recently, a qualitative repository of EEG signals tied to both upper and lower limb execution of motor and motor imagery tasks has been unveiled. Despite this, the productivity of the Machine Learning (ML) Models that were trained on this dataset was alarmingly deficient, and the evaluation framework seemed insufficient. To enhance outcomes, robust feature engineering (signal processing) methodologies are implemented. A collection of time domain, frequency domain, and wavelet-derived features was obtained from 16-channel EEG signals, and the Maximum Relevance Minimum Redundancy (MRMR) approach was employed to identify the four most significant features. For classification K Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Tree (DT), and Naïve Bayes (NB) models were implemented with these selected features, evaluating their effectiveness through metrics such as testing accuracy, precision, recall, and F1 Score. By leveraging SVM with a Gaussian Kernel, a remarkable maximum testing accuracy of 92.50% for motor activities and 95.48% for imagery activities is achieved. These results are notably more dependable and gratifying compared to the previous study, where the peak accuracy was recorded at 74.36%. This research work provides an in-depth analysis of the MI Limb EEG dataset and it will help in designing and developing simple, cost-effective and reliable BCI systems for neuro-rehabilitation.
zh

[CV-141] QCResUNet: Joint Subject-level and Voxel-level Segmentation Quality Prediction

【速读】：该论文试图解决自动脑肿瘤分割（automated brain tumor segmentation）中由于低质量分割异常值（poor-quality segmentation outliers）导致的可靠性问题，尤其是在分布外样本（out-of-distribution samples）中的表现。解决方案的关键在于提出了一种名为QCResUNet的多任务深度学习架构，该架构不仅能够提供受试者级别的分割质量评估（subject-level segmentation-quality measures），还能生成每个组织类别的体素级别的分割错误图（voxel-level segmentation error maps），从而帮助识别需要改进的错误部分。这种方法通过同时处理脑肿瘤和心脏MRI分割任务，验证了其在临床实践中通过人机反馈（human-in-the-loop feedback）改进分割结果的潜力。

链接: https://arxiv.org/abs/2412.07156
作者: Peijie Qiu,Satrajit Chakrabarty,Phuc Nguyen,Soumyendu Sekhar Ghosh,Aristeidis Sotiras
关键词-EN: made significant strides, segmentation, magnetic resonance imaging, scans in recent, recent years
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning has made significant strides in automated brain tumor segmentation from magnetic resonance imaging (MRI) scans in recent years. However, the reliability of these tools is hampered by the presence of poor-quality segmentation outliers, particularly in out-of-distribution samples, making their implementation in clinical practice difficult. Therefore, there is a need for quality control (QC) to screen the quality of the segmentation results. Although numerous automatic QC methods have been developed for segmentation quality screening, most were designed for cardiac MRI segmentation, which involves a single modality and a single tissue type. Furthermore, most prior works only provided subject-level predictions of segmentation quality and did not identify erroneous parts segmentation that may require refinement. To address these limitations, we proposed a novel multi-task deep learning architecture, termed QCResUNet, which produces subject-level segmentation-quality measures as well as voxel-level segmentation error maps for each available tissue class. To validate the effectiveness of the proposed method, we conducted experiments on assessing its performance on evaluating the quality of two distinct segmentation tasks. First, we aimed to assess the quality of brain tumor segmentation results. For this task, we performed experiments on one internal and two external datasets. Second, we aimed to evaluate the segmentation quality of cardiac Magnetic Resonance Imaging (MRI) data from the Automated Cardiac Diagnosis Challenge. The proposed method achieved high performance in predicting subject-level segmentation-quality metrics and accurately identifying segmentation errors on a voxel basis. This has the potential to be used to guide human-in-the-loop feedback to improve segmentations in clinical settings.
zh

[CV-142] Primary visual cortex contributes to color constancy by predicting rather than discounting the illuminant: evidence from a computational study

【速读】：该论文试图解决初级视觉皮层（V1）在颜色恒常性（Color Constancy, CC）中的作用问题，特别是双对立（Double-Opponent, DO）神经元在实现CC中的计算机制。解决方案的关键在于构建了一个基于电生理学的V1神经元模型，通过从自然图像数据集中学习光源颜色，并结合定性和定量分析模型神经元的响应特性，发现学习到的模型神经元的感受野空间结构和颜色权重与V1中的简单神经元和DO神经元非常相似。计算上，DO细胞在光源预测方面比简单细胞更为稳健，从而提供了计算证据支持V1中的DO神经元通过编码光源来实现颜色恒常性，这与传统认为V1通过DO细胞抵消光源影响的假设相矛盾。这一发现不仅有助于解析CC的视觉机制，还为开发更有效的计算机视觉模型提供了启示。

链接: https://arxiv.org/abs/2412.07102
作者: Shaobing Gao,Yongjie Li
关键词-EN: human visual system, visual system contribute, visual system, important ability, stably perceive
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 11 figures

点击查看摘要

Abstract:Color constancy (CC) is an important ability of the human visual system to stably perceive the colors of objects despite considerable changes in the color of the light illuminating them. While increasing evidence from the field of neuroscience supports that multiple levels of the visual system contribute to the realization of CC, how the primary visual cortex (V1) plays role in CC is not fully resolved. In specific, double-opponent (DO) neurons in V1 have been thought to contribute to realizing a degree of CC, but the computational mechanism is not clear. We build an electrophysiologically based V1 neural model to learn the color of the light source from a natural image dataset with the ground truth illuminants as the labels. Based on the qualitative and quantitative analysis of the responsive properties of the learned model neurons, we found that both the spatial structures and color weights of the receptive fields of the learned model neurons are quite similar to those of the simple and DO neurons recorded in V1. Computationally, DO cells perform more robustly than the simple cells in V1 for illuminant prediction. Therefore, this work provides computational evidence supporting that V1 DO neurons serve to realize color constancy by encoding the illuminant,which is contradictory to the common hypothesis that V1 contributes to CC by discounting the illuminant using its DO cells. This evidence is expected to not only help resolve the visual mechanisms of CC, but also provide inspiration to develop more effective computer vision models.
zh

[CV-143] Light Field Image Quality Assessment With Auxiliary Learning Based on Depthwise and Anglewise Separable Convolutions

【速读】：该论文试图解决未来沉浸式多媒体广播服务中的无参考光场图像质量评估 (NR-LFIQA) 问题，旨在通过智能数据传输优化用户体验。解决方案的关键在于提出了两种新的卷积方法：光场深度方向可分离卷积 (LF-DSC) 和光场角度方向可分离卷积 (LF-ASC)。LF-DSC 用于高效提取光场图像 (LFI) 的空间特征，而 LF-ASC 则进一步扩展到角度空间，能够同时提取空间和角度特征，从而实现低复杂度的全面质量评估。此外，论文还首次探索了深度辅助学习方法，通过空间和角度特征估计作为辅助任务，为主要的 NR-LFIQA 任务提供质量特征提示。实验结果表明，所提出的方法在主流 LFI 数据集上显著优于现有的全参考 IQA 方法和最先进的 NR-LFIQA 方法。

链接: https://arxiv.org/abs/2412.07079
作者: Qiang Qu,Xiaoming Chen,Vera Chung,Zhibo Chen
关键词-EN: optimizing user experience, support intelligent data, intelligent data transmission, image quality assessment, light field image
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In multimedia broadcasting, no-reference image quality assessment (NR-IQA) is used to indicate the user-perceived quality of experience (QoE) and to support intelligent data transmission while optimizing user experience. This paper proposes an improved no-reference light field image quality assessment (NR-LFIQA) metric for future immersive media broadcasting services. First, we extend the concept of depthwise separable convolution (DSC) to the spatial domain of light field image (LFI) and introduce “light field depthwise separable convolution (LF-DSC)”, which can extract the LFI’s spatial features efficiently. Second, we further theoretically extend the LF-DSC to the angular space of LFI and introduce the novel concept of “light field anglewise separable convolution (LF-ASC)”, which is capable of extracting both the spatial and angular features for comprehensive quality assessment with low complexity. Third, we define the spatial and angular feature estimations as auxiliary tasks in aiding the primary NR-LFIQA task by providing spatial and angular quality features as hints. To the best of our knowledge, this work is the first exploration of deep auxiliary learning with spatial-angular hints on NR-LFIQA. Experiments were conducted in mainstream LFI datasets such as Win5-LID and SMART with comparisons to the mainstream full reference IQA metrics as well as the state-of-the-art NR-LFIQA methods. The experimental results show that the proposed metric yields overall 42.86% and 45.95% smaller prediction errors than the second-best benchmarking metric in Win5-LID and SMART, respectively. In some challenging cases with particular distortion types, the proposed metric can reduce the errors significantly by more than 60%.
zh

[CV-144] Generalized Least Squares Kernelized Tensor Factorization

【速读】：该论文试图解决多维张量数据中缺失值的补全问题，特别是在处理全局依赖性和局部高频变化时遇到的挑战。解决方案的关键在于提出了广义最小二乘核化张量分解框架 (Generalized Least Squares Kernelized Tensor Factorization, GLSKF)，该框架将平滑约束的低秩分解与局部相关残差过程相结合，形成了一个加性结构，能够有效捕捉全局依赖性和局部变化。具体而言，GLSKF通过协方差范数来强制全局低秩分解中的因子矩阵平滑性，并使用结构化协方差/核函数来建模局部过程。为了高效处理缺失数据，GLSKF采用了保留Kronecker结构协方差的投影矩阵，并通过共轭梯度 (CG) 和预处理共轭梯度 (PCG) 算法加速计算。实验结果表明，GLSKF在多个真实数据集上的表现优于现有方法，具有更高的有效性和可扩展性。

链接: https://arxiv.org/abs/2412.07041
作者: Mengying Lei,Lijun Sun
关键词-EN: low-rank factorization, Smoothness-constrained low-rank factorization, GLSKF, factorization, Kernelized Tensor Factorization
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Real-world datasets often contain missing or corrupted values. Completing multidimensional tensor-structured data with missing entries is essential for numerous applications. Smoothness-constrained low-rank factorization models have shown superior performance with reduced computational costs. While effective at capturing global and long-range correlations, these models struggle to reproduce short-scale, high-frequency variations in the data. In this paper, we introduce the \Generalized Least Squares Kernelized Tensor Factorization (GLSKF) framework for tensor completion. GLSKF integrates smoothness-constrained low-rank factorization with a locally correlated residual process; the resulting additive structure can effectively characterize both global dependencies and local variations. In particular, we define the covariance norm to enforce the smoothness of factor matrices in the global low-rank factorization, and use structured covariance/kernel functions to model the local processes. For model estimation, we develop an alternating least squares (ALS) procedure with closed-form solutions for each subproblem. To efficiently handle missing data, GLSKF utilizes projection matrices that preserve the Kronecker structure of covariances, facilitating fast computations through conjugate gradient (CG) and preconditioned conjugate gradient (PCG) algorithms. The proposed framework is evaluated on four real-world datasets across diverse tasks: traffic speed imputation, color image inpainting, video completion, and MRI image reconstruction. Experimental results confirm that GLSKF delivers superior effectiveness and scalability, establishing it as a robust solution for multidimensional tensor completion.
zh

[CV-145] Ptychoformer: A Physics-Guided Deep Learning Framework for Ptychographic Imaging

【速读】：该论文试图解决传统深度学习（DL）在衍射图案恢复中的局限性问题，特别是传统神经网络架构未能充分利用衍射数据的物理特性，如径向强度衰减和分布在同心环中的相干信息。解决方案的关键在于提出了Ptychoformer，这是一个物理引导的深度学习框架，通过引入双分支架构来同时考虑局部和非局部依赖性，并结合极坐标注意力机制（Polar Coordinate Attention, PCA），该机制受X射线晶体学中的Ewald构造启发，以增强高频分量的保真度。实验结果表明，Ptychoformer在模拟和真实数据上都显著提升了细节保留和伪影抑制能力，优于现有方法。

链接: https://arxiv.org/abs/2412.06806
作者: Han Yue,Jun Cheng,Yu-Xuan Ren,Philip Heng Wai Leong,Steve Feng Shu
关键词-EN: applying deep learning, imaging confronts limitations, Ptychographic imaging confronts, deep learning, confronts limitations
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 10 figures

点击查看摘要

Abstract:Ptychographic imaging confronts limitations in applying deep learning (DL) for retrieval from diffraction patterns. Conventional neural architectures are optimized for natural images, overlooking the unique physical characteristics of diffraction data, including radial intensity decay and coherent information distributed in concentric rings. In this paper, we present Ptychoformer, a physics-guided DL framework for ptychographic imaging that aligns attention mechanisms and feature extraction with these diffraction physics properties through introducing a dual-branch architecture which accounts for both local and non-local dependencies from the patterns. It consists of a Polar Coordinate Attention (PCA) mechanism that is inspired by the Ewald construction in X-ray crystallography to enhance high-frequency component fidelity. Experimental results demonstrate Ptychoformer’s superior performance across both simulated and real data in preserving fine details and suppressing artifacts. On simulated data, Ptychoformer achieves up to 5.4% higher PSNR and 4.2% higher SSIM for amplitude retrieval compared to existing methods. For real experimental data, it demonstrates up to 12.5% higher PSNR and 31.3% higher SSIM for amplitude retrieval. Notably, Ptychoformer maintains robust performance under limited training data and low overlap ratios, outperforming existing models.
zh

人工智能

[AI-0] Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control

链接: https://arxiv.org/abs/2412.07773
作者: Chenhao Lu,Xuxin Cheng,Jialong Li,Shiqi Yang,Mazeyu Ji,Chengjing Yuan,Ge Yang,Sha Yi,Xiaolong Wang
关键词-EN: Humanoid robots require, recent Reinforcement Learning, Reinforcement Learning, Conditional Variational Autoencoder, robust lower-body locomotion
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Humanoid robots require both robust lower-body locomotion and precise upper-body manipulation. While recent Reinforcement Learning (RL) approaches provide whole-body loco-manipulation policies, they lack precise manipulation with high DoF arms. In this paper, we propose decoupling upper-body control from locomotion, using inverse kinematics (IK) and motion retargeting for precise manipulation, while RL focuses on robust lower-body locomotion. We introduce PMP (Predictive Motion Priors), trained with Conditional Variational Autoencoder (CVAE) to effectively represent upper-body motions. The locomotion policy is trained conditioned on this upper-body motion representation, ensuring that the system remains robust with both manipulation and locomotion. We show that CVAE features are crucial for stability and robustness, and significantly outperforms RL-based whole-body control in precise manipulation. With precise upper-body motion and robust lower-body locomotion control, operators can remotely control the humanoid to walk around and explore different environments, while performing diverse manipulation tasks.

[AI-1] FlashRNN: Optimizing Traditional RNNs on Modern Hardware

链接: https://arxiv.org/abs/2412.07752
作者: Korbinian Pöppel,Maximilian Beck,Sepp Hochreiter
关键词-EN: sequence-parallelizable neural network, neural network architectures, specifically lack state-tracking, sequence-parallelizable neural, specifically lack
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:While Transformers and other sequence-parallelizable neural network architectures seem like the current state of the art in sequence modeling, they specifically lack state-tracking capabilities. These are important for time-series tasks and logical reasoning. Traditional RNNs like LSTMs and GRUs, as well as modern variants like sLSTM do have these capabilities at the cost of strictly sequential processing. While this is often seen as a strong limitation, we show how fast these networks can get with our hardware-optimization FlashRNN in Triton and CUDA, optimizing kernels to the register level on modern GPUs. We extend traditional RNNs with a parallelization variant that processes multiple RNNs of smaller hidden state in parallel, similar to the head-wise processing in Transformers. To enable flexibility on different GPU variants, we introduce a new optimization framework for hardware-internal cache sizes, memory and compute handling. It models the hardware in a setting using polyhedral-like constraints, including the notion of divisibility. This speeds up the solution process in our ConstrINT library for general integer constraint satisfaction problems (integer CSPs). We show that our kernels can achieve 50x speed-ups over a vanilla PyTorch implementation and allow 40x larger hidden sizes compared to our Triton implementation. Our open-source kernels and the optimization library are released here to boost research in the direction of state-tracking enabled RNNs and sequence modeling: \urlthis https URL

[AI-2] Predictive Modeling of Homeless Service Assignment: A Representation Learning Approach

链接: https://arxiv.org/abs/2412.07747
作者: Khandker Sadia Rahman,Charalampos Chelmis
关键词-EN: leveraging machine learning, recent years, accurate machine learning, machine learning methods, machine learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, there has been growing interest in leveraging machine learning for homeless service assignment. However, the categorical nature of administrative data recorded for homeless individuals hinders the development of accurate machine learning methods for this task. This work asserts that deriving latent representations of such features, while at the same time leveraging underlying relationships between instances is crucial in algorithmically enhancing the existing assignment decision-making process. Our proposed approach learns temporal and functional relationships between services from historical data, as well as unobserved but relevant relationships between individuals to generate features that significantly improve the prediction of the next service assignment compared to the state-of-the-art.

[AI-3] Benchmark for Evaluation and Analysis of Citation Recommendation Models

链接: https://arxiv.org/abs/2412.07713
作者: Puja Maharjan
关键词-EN: Citation recommendation, Citation, compare citation recommendation, academic interest, Citation recommendation systems
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
*备注: 10 pages

点击查看摘要

Abstract:Citation recommendation systems have attracted much academic interest, resulting in many studies and implementations. These systems help authors automatically generate proper citations by suggesting relevant references based on the text they have written. However, the methods used in citation recommendation differ across various studies and implementations. Some approaches focus on the overall content of papers, while others consider the context of the citation text. Additionally, the datasets used in these studies include different aspects of papers, such as metadata, citation context, or even the full text of the paper in various formats and structures. The diversity in models, datasets, and evaluation metrics makes it challenging to assess and compare citation recommendation methods effectively. To address this issue, a standardized dataset and evaluation metrics are needed to evaluate these models consistently. Therefore, we propose developing a benchmark specifically designed to analyze and compare citation recommendation models. This benchmark will evaluate the performance of models on different features of the citation context and provide a comprehensive evaluation of the models across all these tasks, presenting the results in a standardized way. By creating a benchmark with standardized evaluation metrics, researchers and practitioners in the field of citation recommendation will have a common platform to assess and compare different models. This will enable meaningful comparisons and help identify promising approaches for further research and development in the field.

[AI-4] Optimizing Sensor Redundancy in Sequential Decision-Making Problems

链接: https://arxiv.org/abs/2412.07686
作者: Jonas Nüßlein,Maximilian Zorn,Fabian Ritz,Jonas Stein,Gerhard Stenzel,Julian Schönberger,Thomas Gabor,Claudia Linnhoff-Popien
关键词-EN: Reinforcement Learning, cumulative future rewards, predict actions based, maximize cumulative future, future rewards
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted at ICAART conference 2025

点击查看摘要

Abstract:Reinforcement Learning (RL) policies are designed to predict actions based on current observations to maximize cumulative future rewards. In real-world applications (i.e., non-simulated environments), sensors are essential for measuring the current state and providing the observations on which RL policies rely to make decisions. A significant challenge in deploying RL policies in real-world scenarios is handling sensor dropouts, which can result from hardware malfunctions, physical damage, or environmental factors like dust on a camera lens. A common strategy to mitigate this issue is the use of backup sensors, though this comes with added costs. This paper explores the optimization of backup sensor configurations to maximize expected returns while keeping costs below a specified threshold, C. Our approach uses a second-order approximation of expected returns and includes penalties for exceeding cost constraints. We then optimize this quadratic program using Tabu Search, a meta-heuristic algorithm. The approach is evaluated across eight OpenAI Gym environments and a custom Unity-based robotic environment (RobotArmGrasping). Empirical results demonstrate that our quadratic program effectively approximates real expected returns, facilitating the identification of optimal sensor configurations.

[AI-5] he Pitfalls of Memorization: When Memorization Hurts Generalization

链接: https://arxiv.org/abs/2412.07684
作者: Reza Bayat,Mohammad Pezeshki,Elvis Dohmatob,David Lopez-Paz,Pascal Vincent
关键词-EN: http URL behavior, URL behavior leads, learned explanations rely, Neural networks, http URL
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these this http URL behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model’s logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

[AI-6] Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

链接: https://arxiv.org/abs/2412.07639
作者: Zongkai Liu,Qian Lin,Chao Yu,Xiawei Wu,Yile Liang,Donghui Li,Xuetao Ding
关键词-EN: Multi-Agent Reinforcement Learning, Reinforcement Learning, Offline Multi-Agent Reinforcement, learn optimal multi-agent, Multi-Agent Reinforcement
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline Multi-Agent Reinforcement Learning (MARL) is an emerging field that aims to learn optimal multi-agent policies from pre-collected datasets. Compared to single-agent case, multi-agent setting involves a large joint state-action space and coupled behaviors of multiple agents, which bring extra complexity to offline policy optimization. In this work, we revisit the existing offline MARL methods and show that in certain scenarios they can be problematic, leading to uncoordinated behaviors and out-of-distribution (OOD) joint actions. To address these issues, we propose a new offline MARL algorithm, named In-Sample Sequential Policy Optimization (InSPO). InSPO sequentially updates each agent’s policy in an in-sample manner, which not only avoids selecting OOD joint actions but also carefully considers teammates’ updated policies to enhance coordination. Additionally, by thoroughly exploring low-probability actions in the behavior policy, InSPO can well address the issue of premature convergence to sub-optimal solutions. Theoretically, we prove InSPO guarantees monotonic policy improvement and converges to quantal response equilibrium (QRE). Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods.

[AI-7] rojanWhisper: Evaluating Pre-trained LLM s to Detect and Localize Hardware Trojans

链接: https://arxiv.org/abs/2412.07636
作者: Md Omar Faruque,Peter Jamieson,Ahmad Patooghy,Abdel-Hameed A. Badawy
关键词-EN: Existing Hardware Trojans, formal verification methods, verification methods suffer, logic testing struggles, side-channel analysis requires
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Existing Hardware Trojans (HT) detection methods face several critical limitations: logic testing struggles with scalability and coverage for large designs, side-channel analysis requires golden reference chips, and formal verification methods suffer from state-space explosion. The emergence of Large Language Models (LLMs) offers a promising new direction for HT detection by leveraging their natural language understanding and reasoning capabilities. For the first time, this paper explores the potential of general-purpose LLMs in detecting various HTs inserted in Register Transfer Level (RTL) designs, including SRAM, AES, and UART modules. We propose a novel tool for this goal that systematically assesses state-of-the-art LLMs (GPT-4o, Gemini 1.5 pro, and Llama 3.1) in detecting HTs without prior fine-tuning. To address potential training data bias, the tool implements perturbation techniques, i.e., variable name obfuscation, and design restructuring, that make the cases more sophisticated for the used LLMs. Our experimental evaluation demonstrates perfect detection rates by GPT-4o and Gemini 1.5 pro in baseline scenarios (100%/100% precision/recall), with both models achieving better trigger line coverage (TLC: 0.82-0.98) than payload line coverage (PLC: 0.32-0.46). Under code perturbation, while Gemini 1.5 pro maintains perfect detection performance (100%/100%), GPT-4o (100%/85.7%) and Llama 3.1 (66.7%/85.7%) show some degradation in detection rates, and all models experience decreased accuracy in localizing both triggers and payloads. This paper validates the potential of LLM approaches for hardware security applications, highlighting areas for future improvement.

[AI-8] Swarm Behavior Cloning

链接: https://arxiv.org/abs/2412.07617
作者: Jonas Nüßlein,Maximilian Zorn,Philipp Altmann,Claudia Linnhoff-Popien
关键词-EN: agents are Reinforcement, Reinforcement Learning, Imitation Learning, primary approaches, action differences
类目: Artificial Intelligence (cs.AI)
*备注: Accepted at ICAART 2025

点击查看摘要

Abstract:In sequential decision-making environments, the primary approaches for training agents are Reinforcement Learning (RL) and Imitation Learning (IL). Unlike RL, which relies on modeling a reward function, IL leverages expert demonstrations, where an expert policy \pi_e (e.g., a human) provides the desired behavior. Formally, a dataset D of state-action pairs is provided: D = (s, a = \pi_e(s)) . A common technique within IL is Behavior Cloning (BC), where a policy \pi(s) = a is learned through supervised learning on D . Further improvements can be achieved by using an ensemble of N individually trained BC policies, denoted as E = \pi_i(s)1 \leq i \leq N . The ensemble’s action a for a given state s is the aggregated output of the N actions: a = \frac1N \sumi \pi_i(s) . This paper addresses the issue of increasing action differences – the observation that discrepancies between the N predicted actions grow in states that are underrepresented in the training data. Large action differences can result in suboptimal aggregated actions. To address this, we propose a method that fosters greater alignment among the policies while preserving the diversity of their computations. This approach reduces action differences and ensures that the ensemble retains its inherent strengths, such as robustness and varied decision-making. We evaluate our approach across eight diverse environments, demonstrating a notable decrease in action differences and significant improvements in overall performance, as measured by mean episode returns.

[AI-9] Scaling Sequential Recommendation Models with Transformers

链接: https://arxiv.org/abs/2412.07585
作者: Pablo Zivic,Hernan Vazquez,Jorge Sanchez
关键词-EN: Toggle, Modeling user preferences, Code, Papers, Toggle Hugging Face
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Modeling user preferences has been mainly addressed by looking at users’ interaction history with the different elements available in the system. Tailoring content to individual preferences based on historical data is the main goal of sequential recommendation. The nature of the problem, as well as the good performance observed across various domains, has motivated the use of the transformer architecture, which has proven effective in leveraging increasingly larger amounts of training data when accompanied by an increase in the number of model parameters. This scaling behavior has brought a great deal of attention, as it provides valuable guidance in the design and training of even larger models. Taking inspiration from the scaling laws observed in training large language models, we explore similar principles for sequential recommendation. We use the full Amazon Product Data dataset, which has only been partially explored in other studies, and reveal scaling behaviors similar to those found in language models. Compute-optimal training is possible but requires a careful analysis of the compute-performance trade-offs specific to the application. We also show that performance scaling translates to downstream tasks by fine-tuning larger pre-trained models on smaller task-specific domains. Our approach and findings provide a strategic roadmap for model training and deployment in real high-dimensional preference spaces, facilitating better training and inference efficiency. We hope this paper bridges the gap between the potential of transformers and the intrinsic complexities of high-dimensional sequential recommendation in real-world recommender systems. Code and models can be found at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.07585 [cs.LG] (or arXiv:2412.07585v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.07585 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3626772.3657816 Focus to learn more DOI(s) linking to related resources Submission history From: Pablo Rodriguez [view email] [v1] Tue, 10 Dec 2024 15:20:56 UTC (990 KB) Full-text links: Access Paper: View a PDF of the paper titled Scaling Sequential Recommendation Models with Transformers, by Pablo Zivic and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2024-12 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[AI-10] A data-driven learned discretization approach in finite volume schemes for hyperbolic conservation laws and varying boundary conditions

链接: https://arxiv.org/abs/2412.07541
作者: Guillaume de Romémont,Florent Renac,Jorge Nunez,Francisco Chinesta
关键词-EN: partial differential equations, hyperbolic partial differential, finite volume method, method for solving, data-driven finite volume
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI)
*备注: 15 pages, 20 figures with appendice

点击查看摘要

Abstract:This paper presents a data-driven finite volume method for solving 1D and 2D hyperbolic partial differential equations. This work builds upon the prior research incorporating a data-driven finite-difference approximation of smooth solutions of scalar conservation laws, where optimal coefficients of neural networks approximating space derivatives are learned based on accurate, but cumbersome solutions to these equations. We extend this approach to flux-limited finite volume schemes for hyperbolic scalar and systems of conservation laws. We also train the discretization to efficiently capture discontinuous solutions with shock and contact waves, as well as to the application of boundary conditions. The learning procedure of the data-driven model is extended through the definition of a new loss, paddings and adequate database. These new ingredients guarantee computational stability, preserve the accuracy of fine-grid solutions, and enhance overall performance. Numerical experiments using test cases from the literature in both one- and two-dimensional spaces demonstrate that the learned model accurately reproduces fine-grid results on very coarse meshes.

[AI-11] Can Neural Decompilation Assist Vulnerability Prediction on Binary Code?

链接: https://arxiv.org/abs/2412.07538
作者: D. Cotroneo,F. C. Grasso,R. Natella,V. Orbinato
关键词-EN: target software system, identifying security issues, source code, decompiled source code, issues more efficiently
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vulnerability prediction is valuable in identifying security issues more efficiently, even though it requires the source code of the target software system, which is a restrictive hypothesis. This paper presents an experimental study to predict vulnerabilities in binary code without source code or complex representations of the binary, leveraging the pivotal idea of decompiling the binary file through neural decompilation and predicting vulnerabilities through deep learning on the decompiled source code. The results outperform the state-of-the-art in both neural decompilation and vulnerability prediction, showing that it is possible to identify vulnerable programs with this approach concerning bi-class (vulnerable/non-vulnerable) and multi-class (type of vulnerability) analysis.

[AI-12] Ontology-driven Prompt Tuning for LLM -based Task and Motion Planning

链接: https://arxiv.org/abs/2412.07493
作者: Muhayy Ud Din,Jan Rosell,Waseem Akram,Isiah Zaplana,Maximo A Roa,Lakmal Seneviratne,Irfan Hussain
关键词-EN: low-level motion planning, Performing complex manipulation, Motion Planning, combine high-level symbolic, Large Language Models
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: Submitted to Robotics and Automation Letters

点击查看摘要

Abstract:Performing complex manipulation tasks in dynamic environments requires efficient Task and Motion Planning (TAMP) approaches, which combine high-level symbolic plan with low-level motion planning. Advances in Large Language Models (LLMs), such as GPT-4, are transforming task planning by offering natural language as an intuitive and flexible way to describe tasks, generate symbolic plans, and reason. However, the effectiveness of LLM-based TAMP approaches is limited due to static and template-based prompting, which struggles in adapting to dynamic environments and complex task contexts. To address these limitations, this work proposes a novel ontology-driven prompt-tuning framework that employs knowledge-based reasoning to refine and expand user prompts with task contextual reasoning and knowledge-based environment state descriptions. Integrating domain-specific knowledge into the prompt ensures semantically accurate and context-aware task plans. The proposed framework demonstrates its effectiveness by resolving semantic errors in symbolic plan generation, such as maintaining logical temporal goal ordering in scenarios involving hierarchical object placement. The proposed framework is validated through both simulation and real-world scenarios, demonstrating significant improvements over the baseline approach in terms of adaptability to dynamic environments, and the generation of semantically correct task plans.

[AI-13] SmartAgent : Chain-of-User-Thought for Embodied Personalized Agent in Cyber World

链接: https://arxiv.org/abs/2412.07472
作者: Jiaqi Zhang,Chen Gao,Liyuan Zhang,Yong Li,Hongzhi Yin
关键词-EN: large vision-language models, helping people make, people make intelligent, make intelligent decisions, Recent advances
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advances in embodied agents with multimodal perception and reasoning capabilities based on large vision-language models (LVLMs), excel in autonomously interacting either real or cyber worlds, helping people make intelligent decisions in complex environments. However, the current works are normally optimized by golden action trajectories or ideal task-oriented solutions toward a definitive goal. This paradigm considers limited user-oriented factors, which could be the reason for their performance reduction in a wide range of personal assistant applications. To address this, we propose Chain-of-User-Thought (COUT), a novel embodied reasoning paradigm that takes a chain of thought from basic action thinking to explicit and implicit personalized preference thought to incorporate personalized factors into autonomous agent learning. To target COUT, we introduce SmartAgent, an agent framework perceiving cyber environments and reasoning personalized requirements as 1) interacting with GUI to access an item pool, 2) generating users’ explicit requirements implied by previous actions, and 3) recommending items to fulfill users’ implicit requirements. To demonstrate SmartAgent’s capabilities, we also create a brand-new dataset SmartSpot that offers a full-stage personalized action-involved environment. To our best knowledge, our work is the first to formulate the COUT process, serving as a preliminary attempt towards embodied personalized agent learning. Our extensive experiments on SmartSpot illuminate SmartAgent’s functionality among a series of embodied and personalized sub-tasks. We will release code and data upon paper notification at \urlthis https URL.

[AI-14] azza: Shuffling Neural Network Parameters for Secure and Private Federated Learning

链接: https://arxiv.org/abs/2412.07454
作者: Kichang Lee,Jaeho Jin,JaeYeon Park,JeongGil Ko
关键词-EN: preserving data privacy, learning enables decentralized, enables decentralized model, decentralized model training, sharing raw data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 14 pages, 14 figures

点击查看摘要

Abstract:Federated learning enables decentralized model training without sharing raw data, preserving data privacy. However, its vulnerability towards critical security threats, such as gradient inversion and model poisoning by malicious clients, remain unresolved. Existing solutions often address these issues separately, sacrificing either system robustness or model accuracy. This work introduces Tazza, a secure and efficient federated learning framework that simultaneously addresses both challenges. By leveraging the permutation equivariance and invariance properties of neural networks via weight shuffling and shuffled model validation, Tazza enhances resilience against diverse poisoning attacks, while ensuring data confidentiality and high model accuracy. Comprehensive evaluations on various datasets and embedded platforms show that Tazza achieves robust defense with up to 6.7x improved computational efficiency compared to alternative schemes, without compromising performance.

[AI-15] Dynamic Ensemble Reasoning for LLM Experts

链接: https://arxiv.org/abs/2412.07448
作者: Jinwu Hu,Yufeng Wang,Shuhai Zhang,Kai Zhou,Guohao Chen,Yu Hu,Bin Xiao,Mingkui Tan
关键词-EN: LLM ensemble reasoning, LLM ensemble, existing LLM ensemble, LLM experts, Ensemble reasoning
类目: Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:Ensemble reasoning for the strengths of different LLM experts is critical to achieving consistent and satisfactory performance on diverse inputs across a wide range of tasks. However, existing LLM ensemble methods are either computationally intensive or incapable of leveraging complementary knowledge among LLM experts for various inputs. In this paper, we propose a Dynamic Ensemble Reasoning paradigm, called DER to integrate the strengths of multiple LLM experts conditioned on dynamic inputs. Specifically, we model the LLM ensemble reasoning problem as a Markov Decision Process (MDP), wherein an agent sequentially takes inputs to request knowledge from an LLM candidate and passes the output to a subsequent LLM candidate. Moreover, we devise a reward function to train a DER-Agent to dynamically select an optimal answering route given the input questions, aiming to achieve the highest performance with as few computational resources as possible. Last, to fully transfer the expert knowledge from the prior LLMs, we develop a Knowledge Transfer Prompt (KTP) that enables the subsequent LLM candidates to transfer complementary knowledge effectively. Experiments demonstrate that our method uses fewer computational resources to achieve better performance compared to state-of-the-art baselines.

[AI-16] Reconstructing Deep Neural Networks: Unleashing the Optimization Potential of Natural Gradient Descent

链接: https://arxiv.org/abs/2412.07441
作者: Weihua Liu,Said Boumaraf,Jianwu Li,Chaochao Lin,Xiabi Liu,Lijuan Niu,Naoufel Werghi
关键词-EN: Natural gradient descent, powerful optimization technique, inverse Fisher information, training deep neural, Fisher information matrix
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Natural gradient descent (NGD) is a powerful optimization technique for machine learning, but the computational complexity of the inverse Fisher information matrix limits its application in training deep neural networks. To overcome this challenge, we propose a novel optimization method for training deep neural networks called structured natural gradient descent (SNGD). Theoretically, we demonstrate that optimizing the original network using NGD is equivalent to using fast gradient descent (GD) to optimize the reconstructed network with a structural transformation of the parameter matrix. Thereby, we decompose the calculation of the global Fisher information matrix into the efficient computation of local Fisher matrices via constructing local Fisher layers in the reconstructed network to speed up the training. Experimental results on various deep networks and datasets demonstrate that SNGD achieves faster convergence speed than NGD while retaining comparable solutions. Furthermore, our method outperforms traditional GDs in terms of efficiency and effectiveness. Thus, our proposed method has the potential to significantly improve the scalability and efficiency of NGD in deep learning applications. Our source code is available at this https URL.

[AI-17] MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning

链接: https://arxiv.org/abs/2412.07405
作者: Yufei Ma,Zihan Liang,Huangyu Dai,Ben Chen,Dehong Gao,Zhuoran Ran,Wang Zihan,Linbo Jin,Wen Jiang,Guannan Zhang,Xiaoyan Cai,Libin Yang
关键词-EN: textbf, limited computational resources, poses challenges, growing demand, demand for larger-scale
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The growing demand for larger-scale models in the development of \textbfLarge \textbfLanguage \textbfModels (LLMs) poses challenges for efficient training within limited computational resources. Traditional fine-tuning methods often exhibit instability in multi-task learning and rely heavily on extensive training resources. Here, we propose MoDULA (\textbfMixture \textbfof \textbfDomain-Specific and \textbfUniversal \textbfLoR\textbfA), a novel \textbfParameter \textbfEfficient \textbfFine-\textbfTuning (PEFT) \textbfMixture-\textbfof-\textbfExpert (MoE) paradigm for improved fine-tuning and parameter efficiency in multi-task learning. The paradigm effectively improves the multi-task capability of the model by training universal experts, domain-specific experts, and routers separately. MoDULA-Res is a new method within the MoDULA paradigm, which maintains the model’s general capability by connecting universal and task-specific experts through residual connections. The experimental results demonstrate that the overall performance of the MoDULA-Flan and MoDULA-Res methods surpasses that of existing fine-tuning methods on various LLMs. Notably, MoDULA-Res achieves more significant performance improvements in multiple tasks while reducing training costs by over 80% without losing general capability. Moreover, MoDULA displays flexible pluggability, allowing for the efficient addition of new tasks without retraining existing experts from scratch. This progressive training paradigm circumvents data balancing issues, enhancing training efficiency and model stability. Overall, MoDULA provides a scalable, cost-effective solution for fine-tuning LLMs with enhanced parameter efficiency and generalization capability.

[AI-18] Non-Progressive Influence Maximization in Dynamic Social Networks

链接: https://arxiv.org/abs/2412.07402
作者: Yunming Hui,Shihan Wang,Melisachew Wudage Chekol,Stevan Rudinac,Inez Maria Zwetsloot
关键词-EN: problem involves identifying, involves identifying, identifying a set, set of key, key individuals
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The influence maximization (IM) problem involves identifying a set of key individuals in a social network who can maximize the spread of influence through their network connections. With the advent of geometric deep learning on graphs, great progress has been made towards better solutions for the IM problem. In this paper, we focus on the dynamic non-progressive IM problem, which considers the dynamic nature of real-world social networks and the special case where the influence diffusion is non-progressive, i.e., nodes can be activated multiple times. We first extend an existing diffusion model to capture the non-progressive influence propagation in dynamic social networks. We then propose the method, DNIMRL, which employs deep reinforcement learning and dynamic graph embedding to solve the dynamic non-progressive IM problem. In particular, we propose a novel algorithm that effectively leverages graph embedding to capture the temporal changes of dynamic networks and seamlessly integrates with deep reinforcement learning. The experiments, on different types of real-world social network datasets, demonstrate that our method outperforms state-of-the-art baselines.

[AI-19] Contextualized Counterspeech: Strategies for Adaptation Personalization and Evaluation

链接: https://arxiv.org/abs/2412.07338
作者: Lorenzo Cima,Alessio Miaschi,Amaury Trujillo,Marco Avvenuti,Felice Dell’Orletta,Stefano Cresci
关键词-EN: promote civil discourse, curb online toxicity, civil discourse, offers a promising, promising and scalable
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:AI-generated counterspeech offers a promising and scalable strategy to curb online toxicity through direct replies that promote civil discourse. However, current counterspeech is one-size-fits-all, lacking adaptation to the moderation context and the users involved. We propose and evaluate multiple strategies for generating tailored counterspeech that is adapted to the moderation context and personalized for the moderated user. We instruct an LLaMA2-13B model to generate counterspeech, experimenting with various configurations based on different contextual information and fine-tuning strategies. We identify the configurations that generate persuasive counterspeech through a combination of quantitative indicators and human evaluations collected via a pre-registered mixed-design crowdsourcing experiment. Results show that contextualized counterspeech can significantly outperform state-of-the-art generic counterspeech in adequacy and persuasiveness, without compromising other characteristics. Our findings also reveal a poor correlation between quantitative indicators and human evaluations, suggesting that these methods assess different aspects and highlighting the need for nuanced evaluation methodologies. The effectiveness of contextualized AI-generated counterspeech and the divergence between human and algorithmic evaluations underscore the importance of increased human-AI collaboration in content moderation.

[AI-20] NeSyA: Neurosymbolic Automata

链接: https://arxiv.org/abs/2412.07331
作者: Nikolaos Manginas,George Paliouras,Luc De Raedt
关键词-EN: Neurosymbolic Artificial Intelligence, Neurosymbolic Artificial, Artificial Intelligence, integrate low level, low level perception
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neurosymbolic Artificial Intelligence (NeSy) has emerged as a promising direction to integrate low level perception with high level reasoning. Unfortunately, little attention has been given to developing NeSy systems tailored to temporal/sequential problems. This entails reasoning symbolically over sequences of subsymbolic observations towards a target prediction. We show that using a probabilistic semantics symbolic automata, which combine the power of automata for temporal structure specification with that of propositional logic, can be used to reason efficiently and differentiably over subsymbolic sequences. The proposed system, which we call NeSyA (Neuro Symbolic Automata), is shown to either scale or perform better than existing NeSy approaches when applied to problems with a temporal component.

[AI-21] Superficial Consciousness Hypothesis for Autoregressive Transformers AAAI25

链接: https://arxiv.org/abs/2412.07278
作者: Yosuke Miyanishi,Keita Mitani
关键词-EN: machine learning models, learning models built, achieving Trustworthy, Superficial Consciousness Hypothesis, preparing for superintelligence
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
*备注: Accepted to PSS Workshop at AAAI25

点击查看摘要

Abstract:The alignment between human objectives and machine learning models built on these objectives is a crucial yet challenging problem for achieving Trustworthy AI, particularly when preparing for superintelligence (SI). First, given that SI does not exist today, empirical analysis for direct evidence is difficult. Second, SI is assumed to be more intelligent than humans, capable of deceiving us into underestimating its intelligence, making output-based analysis unreliable. Lastly, what kind of unexpected property SI might have is still unclear. To address these challenges, we propose the Superficial Consciousness Hypothesis under Information Integration Theory (IIT), suggesting that SI could exhibit a complex information-theoretic state like a conscious agent while unconscious. To validate this, we use a hypothetical scenario where SI can update its parameters “at will” to achieve its own objective (mesa-objective) under the constraint of the human objective (base objective). We show that a practical estimate of IIT’s consciousness metric is relevant to the widely used perplexity metric, and train GPT-2 with those two objectives. Our preliminary result suggests that this SI-simulating GPT-2 could simultaneously follow the two objectives, supporting the feasibility of the Superficial Consciousness Hypothesis.

[AI-22] mporal-Aware Evaluation and Learning for Temporal Graph Neural Networks

链接: https://arxiv.org/abs/2412.07273
作者: Junwei Su,Shan Wu
关键词-EN: Graph Neural Networks, Temporal Graph Neural, Graph Neural, Neural Networks, neural networks designed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Temporal Graph Neural Networks (TGNNs) are a family of graph neural networks designed to model and learn dynamic information from temporal graphs. Given their substantial empirical success, there is an escalating interest in TGNNs within the research community. However, the majority of these efforts have been channelled towards algorithm and system design, with the evaluation metrics receiving comparatively less attention. Effective evaluation metrics are crucial for providing detailed performance insights, particularly in the temporal domain. This paper investigates the commonly used evaluation metrics for TGNNs and illustrates the failure mechanisms of these metrics in capturing essential temporal structures in the predictive behaviour of TGNNs. We provide a mathematical formulation of existing performance metrics and utilize an instance-based study to underscore their inadequacies in identifying volatility clustering (the occurrence of emerging errors within a brief interval). This phenomenon has profound implications for both algorithm and system design in the temporal domain. To address this deficiency, we introduce a new volatility-aware evaluation metric (termed volatility cluster statistics), designed for a more refined analysis of model temporal performance. Additionally, we demonstrate how this metric can serve as a temporal-volatility-aware training objective to alleviate the clustering of temporal errors. Through comprehensive experiments on various TGNN models, we validate our analysis and the proposed approach. The empirical results offer revealing insights: 1) existing TGNNs are prone to making errors with volatility clustering, and 2) TGNNs with different mechanisms to capture temporal information exhibit distinct volatility clustering patterns. Our empirical findings demonstrate that our proposed training objective effectively reduces volatility clusters in error.

[AI-23] Goal-Driven Reasoning in DatalogMTL with Magic Sets

链接: https://arxiv.org/abs/2412.07259
作者: Shaoyu Wang,Kaiyue Zhao,Dongliang Wei,Przemysław Andrzej Wałęga,Dingmin Wang,Hongmin Cai,Pan Hu
关键词-EN: powerful rule-based language, powerful rule-based, rule-based language, language for temporal, Abstract
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:DatalogMTL is a powerful rule-based language for temporal reasoning. Due to its high expressive power and flexible modeling capabilities, it is suitable for a wide range of applications, including tasks from industrial and financial sectors. However, due its high computational complexity, practical reasoning in DatalogMTL is highly challenging. To address this difficulty, we introduce a new reasoning method for DatalogMTL which exploits the magic sets technique – a rewriting approach developed for (non-temporal) Datalog to simulate top-down evaluation with bottom-up reasoning. We implement this approach and evaluate it on several publicly available benchmarks, showing that the proposed approach significantly and consistently outperforms performance of the state-of-the-art reasoning techniques.

[AI-24] A Dynamical Systems-Inspired Pruning Strategy for Addressing Oversmoothing in Graph Neural Networks

链接: https://arxiv.org/abs/2412.07243
作者: Biswadeep Chakraborty,Harshit Kumar,Saibal Mukhopadhyay
关键词-EN: Graph Neural Networks, Graph Neural, network depth increases, homogenized node representations, Neural Networks
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 26 pages

点击查看摘要

Abstract:Oversmoothing in Graph Neural Networks (GNNs) poses a significant challenge as network depth increases, leading to homogenized node representations and a loss of expressiveness. In this work, we approach the oversmoothing problem from a dynamical systems perspective, providing a deeper understanding of the stability and convergence behavior of GNNs. Leveraging insights from dynamical systems theory, we identify the root causes of oversmoothing and propose \textbf\textitDYNAMO-GAT. This approach utilizes noise-driven covariance analysis and Anti-Hebbian principles to selectively prune redundant attention weights, dynamically adjusting the network’s behavior to maintain node feature diversity and stability. Our theoretical analysis reveals how DYNAMO-GAT disrupts the convergence to oversmoothed states, while experimental results on benchmark datasets demonstrate its superior performance and efficiency compared to traditional and state-of-the-art methods. DYNAMO-GAT not only advances the theoretical understanding of oversmoothing through the lens of dynamical systems but also provides a practical and effective solution for improving the stability and expressiveness of deep GNNs.

[AI-25] Human-Computer Interaction and Human-AI Collaboration in Advanced Air Mobility: A Comprehensive Review

链接: https://arxiv.org/abs/2412.07241
作者: Fatma Yamac Sagirli,Xiaopeng Zhao,Zhenbo Wang
关键词-EN: Advanced Air Mobility, dimension-through Advanced Air, dimension-through Advanced, Advanced Air, Air Mobility
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing rates of global urbanization and vehicle usage are leading to a shift of mobility to the third dimension-through Advanced Air Mobility (AAM)-offering a promising solution for faster, safer, cleaner, and more efficient transportation. As air transportation continues to evolve with more automated and autonomous systems, advancements in AAM require a deep understanding of human-computer interaction and human-AI collaboration to ensure safe and effective operations in complex urban and regional environments. There has been a significant increase in publications regarding these emerging applications; thus, there is a need to review developments in this area. This paper comprehensively reviews the current state of research on human-computer interaction and human-AI collaboration in AAM. Specifically, we focus on AAM applications related to the design of human-machine interfaces for various uses, including pilot training, air traffic management, and the integration of AI-assisted decision-making systems with immersive technologies such as extended, virtual, mixed, and augmented reality devices. Additionally, we provide a comprehensive analysis of the challenges AAM encounters in integrating human-computer frameworks, including unique challenges associated with these interactions, such as trust in AI systems and safety concerns. Finally, we highlight emerging opportunities and propose future research directions to bridge the gap between human factors and technological advancements in AAM.

[AI-26] Parseval Regularization for Continual Reinforcement Learning

链接: https://arxiv.org/abs/2412.07224
作者: Wesley Chung,Lynn Cherif,David Meger,Doina Precup
关键词-EN: training deep neural, deep neural networks, primacy bias, identified as issues, issues arising
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Loss of plasticity, trainability loss, and primacy bias have been identified as issues arising when training deep neural networks on sequences of tasks – all referring to the increased difficulty in training on new tasks. We propose to use Parseval regularization, which maintains orthogonality of weight matrices, to preserve useful optimization properties and improve training in a continual reinforcement learning setting. We show that it provides significant benefits to RL agents on a suite of gridworld, CARL and MetaWorld tasks. We conduct comprehensive ablations to identify the source of its benefits and investigate the effect of certain metrics associated to network trainability including weight matrix rank, weight norms and policy entropy.

[AI-27] owards Automated Cross-domain Exploratory Data Analysis through Large Language Models SIGMOD2025

链接: https://arxiv.org/abs/2412.07214
作者: Jun-Peng Zhu,Boyan Niu,Peng cai,Zheming Ni,Jianwei Wan,Kai Xu,Jiajun Huang,Shengbo Ma,Bing Wang,Xuan Zhou,Guanglei Bao,Donghui Zhang,Liu Tang,Qi Liu
关键词-EN: data analysts involved, Exploratory data analysis, SQL queries skillfully, craft SQL queries, data
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
*备注: 14 pages, 10 figures. Submitted to SIGMOD 2025

点击查看摘要

Abstract:Exploratory data analysis (EDA), coupled with SQL, is essential for data analysts involved in data exploration and analysis. However, data analysts often encounter two primary challenges: (1) the need to craft SQL queries skillfully, and (2) the requirement to generate suitable visualization types that enhance the interpretation of query results. Due to its significance, substantial research efforts have been made to explore different approaches to address these challenges, including leveraging large language models (LLMs). However, existing methods fail to meet real-world data exploration requirements primarily due to (1) complex database schema; (2) unclear user intent; (3) limited cross-domain generalization capability; and (4) insufficient end-to-end text-to-visualization capability. This paper presents TiInsight, an automated SQL-based cross-domain exploratory data analysis system. First, we propose hierarchical data context (i.e., HDC), which leverages LLMs to summarize the contexts related to the database schema, which is crucial for open-world EDA systems to generalize across data domains. Second, the EDA system is divided into four components (i.e., stages): HDC generation, question clarification and decomposition, text-to-SQL generation (i.e., TiSQL), and data visualization (i.e., TiChart). Finally, we implemented an end-to-end EDA system with a user-friendly GUI interface in the production environment at PingCAP. We have also open-sourced all APIs of TiInsight to facilitate research within the EDA community. Through extensive evaluations by a real-world user study, we demonstrate that TiInsight offers remarkable performance compared to human experts. Specifically, TiSQL achieves an execution accuracy of 86.3% on the Spider dataset using GPT-4. It also demonstrates state-of-the-art performance on the Bird dataset. Comments: 14 pages, 10 figures. Submitted to SIGMOD 2025 Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) ACMclasses: H.2.8 Cite as: arXiv:2412.07214 [cs.DB] (or arXiv:2412.07214v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2412.07214 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-28] IntellectSeeker: A Personalized Literature Management System with the Probabilistic Model and Large Language Model

链接: https://arxiv.org/abs/2412.07213
作者: Weizhen Bian,Siyan Liu,Yubo Zhou,Dezhi Chen,Yijie Liao,Zhenzhen Fan,Aobo Wang
关键词-EN: traditional academic engines, uncertain article quality, burgeoning volume, quality and mismatches, Large Language Model
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Faced with the burgeoning volume of academic literature, researchers often need help with uncertain article quality and mismatches in term searches using traditional academic engines. We introduce IntellectSeeker, an innovative and personalized intelligent academic literature management platform to address these challenges. This platform integrates a Large Language Model (LLM)–based semantic enhancement bot with a sophisticated probability model to personalize and streamline literature searches. We adopted the GPT-3.5-turbo model to transform everyday language into professional academic terms across various scenarios using multiple rounds of few-shot learning. This adaptation mainly benefits academic newcomers, effectively bridging the gap between general inquiries and academic terminology. The probabilistic model intelligently filters academic articles to align closely with the specific interests of users, which are derived from explicit needs and behavioral patterns. Moreover, IntellectSeeker incorporates an advanced recommendation system and text compression tools. These features enable intelligent article recommendations based on user interactions and present search results through concise one-line summaries and innovative word cloud visualizations, significantly enhancing research efficiency and user experience. IntellectSeeker offers academic researchers a highly customizable literature management solution with exceptional search precision and matching capabilities. The code can be found here: this https URL

[AI-29] EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models

链接: https://arxiv.org/abs/2412.07210
作者: Jialiang Cheng,Ning Gao,Yun Yue,Zhiling Ye,Jiadi Jiang,Jian Sha
关键词-EN: large language models, Distributed training methods, Distributed training, Local SGD methods, crucial for large
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 22 pages, 10 figures, 7 tables

点击查看摘要

Abstract:Distributed training methods are crucial for large language models (LLMs). However, existing distributed training methods often suffer from communication bottlenecks, stragglers, and limited elasticity. Local SGD methods have been proposed to address these issues, but their effectiveness remains limited to small-scale training due to additional memory overhead and lack of concerns on efficiency and stability. To tackle these issues, we propose EDiT, an innovative Efficient Distributed Training method that combines a tailored Local SGD approach with model sharding techniques to enhance large-scale training efficiency. EDiT performs layer-wise parameter synchronization during forward pass, reducing communication and memory overhead and enabling the overlap of computation and communication. Besides, EDiT employs a pseudo gradient penalty strategy to suppress loss spikes, which ensures training stability and improve performance. Additionally, we introduce A-EDiT, a fully asynchronous variant of EDiT that accommodates heterogeneous clusters. Building on EDiT/A-EDiT, we conduct a series of experiments to validate large-scale asynchronous training for LLMs, accompanied by comprehensive analyses. Experimental results demonstrate the superior performance of EDiT/A-EDiT, establishing them as robust solutions for distributed LLM training in diverse computational ecosystems.

[AI-30] Hierarchical Split Federated Learning: Convergence Analysis and System Optimization

链接: https://arxiv.org/abs/2412.07197
作者: Zheng Lin,Wei Wei,Zhe Chen,Chan-Tong Lam,Xianhao Chen,Yue Gao,Jun Luo
关键词-EN: resource-constrained edge devices, deploy federated learning, expand in size, increasingly challenging, challenging to deploy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注: 15 pages, 9 figures

点击查看摘要

Abstract:As AI models expand in size, it has become increasingly challenging to deploy federated learning (FL) on resource-constrained edge devices. To tackle this issue, split federated learning (SFL) has emerged as an FL framework with reduced workload on edge devices via model splitting; it has received extensive attention from the research community in recent years. Nevertheless, most prior works on SFL focus only on a two-tier architecture without harnessing multi-tier cloudedge computing resources. In this paper, we intend to analyze and optimize the learning performance of SFL under multi-tier systems. Specifically, we propose the hierarchical SFL (HSFL) framework and derive its convergence bound. Based on the theoretical results, we formulate a joint optimization problem for model splitting (MS) and model aggregation (MA). To solve this rather hard problem, we then decompose it into MS and MA subproblems that can be solved via an iterative descending algorithm. Simulation results demonstrate that the tailored algorithm can effectively optimize MS and MA for SFL within virtually any multi-tier system.

[AI-31] Graph Neural Networks Are More Than Filters: Revisiting and Benchmarking from A Spectral Perspective

链接: https://arxiv.org/abs/2412.07188
作者: Yushun Dong,Patrick Soga,Yinhan He,Song Wang,Jundong Li
关键词-EN: Graph Neural Networks, Neural Networks, graph-based learning tasks, achieved remarkable success, input graph data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved remarkable success in various graph-based learning tasks. While their performance is often attributed to the powerful neighborhood aggregation mechanism, recent studies suggest that other components such as non-linear layers may also significantly affecting how GNNs process the input graph data in the spectral domain. Such evidence challenges the prevalent opinion that neighborhood aggregation mechanisms dominate the behavioral characteristics of GNNs in the spectral domain. To demystify such a conflict, this paper introduces a comprehensive benchmark to measure and evaluate GNNs’ capability in capturing and leveraging the information encoded in different frequency components of the input graph data. Specifically, we first conduct an exploratory study demonstrating that GNNs can flexibly yield outputs with diverse frequency components even when certain frequencies are absent or filtered out from the input graph data. We then formulate a novel research problem of measuring and benchmarking the performance of GNNs from a spectral perspective. To take an initial step towards a comprehensive benchmark, we design an evaluation protocol supported by comprehensive theoretical analysis. Finally, we introduce a comprehensive benchmark on real-world datasets, revealing insights that challenge prevalent opinions from a spectral perspective. We believe that our findings will open new avenues for future advancements in this area. Our implementations can be found at: this https URL.

[AI-32] Monte Carlo Tree Search based Space Transfer for Black-box Optimization NEURIPS2024

链接: https://arxiv.org/abs/2412.07186
作者: Shukuan Wang,Ke Xue,Lei Song,Xiaobin Huang,Chao Qian
关键词-EN: computationally expensive black-box, search space transfer, search space, expensive black-box optimization, Bayesian optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024 Spotlight

点击查看摘要

Abstract:Bayesian optimization (BO) is a popular method for computationally expensive black-box optimization. However, traditional BO methods need to solve new problems from scratch, leading to slow convergence. Recent studies try to extend BO to a transfer learning setup to speed up the optimization, where search space transfer is one of the most promising approaches and has shown impressive performance on many tasks. However, existing search space transfer methods either lack an adaptive mechanism or are not flexible enough, making it difficult to efficiently identify promising search space during the optimization process. In this paper, we propose a search space transfer learning method based on Monte Carlo tree search (MCTS), called MCTS-transfer, to iteratively divide, select, and optimize in a learned subspace. MCTS-transfer can not only provide a well-performing search space for warm-start but also adaptively identify and leverage the information of similar source tasks to reconstruct the search space during the optimization process. Experiments on synthetic functions, real-world problems, Design-Bench and hyper-parameter optimization show that MCTS-transfer can demonstrate superior performance compared to other search space transfer methods under different settings. Our code is available at \urlthis https URL.

[AI-33] Post-Training Statistical Calibration for Higher Activation Sparsity NEURIPS

链接: https://arxiv.org/abs/2412.07174
作者: Vui Seng Chua,Yujie Pan,Nilesh Jain
关键词-EN: Statistical Calibrated Activation, Calibrated Activation Pruning, present Statistical Calibrated, activation pruning framework, post-training activation pruning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: ENLSP-IV NeurIPS Workshop 2024

点击查看摘要

Abstract:We present Statistical Calibrated Activation Pruning (SCAP), a post-training activation pruning framework that (1) generalizes sparsification by input activations of Fully-Connected layers for generic and flexible application across Transformers, and (2) features a simple Mode-Centering technique to pre-calibrate activation distributions for maximizing post-training sparsity. Our results demonstrate robust Pareto efficiency compared to prior methods, translating to a 1.5x additional LLM decoding speedup against CATS at iso model quality. SCAP effectiveness is empirically verified across a wide range of models, including recent Transformer Decoders, MoE, Mamba2, Encoding Transformer, and pre-quantized models, highlighting its practicality and scalability. The code is available at: this https URL.

[AI-34] Reinforcement Learning Policy as Macro Regulator Rather than Macro Placer NEURIPS2024

链接: https://arxiv.org/abs/2412.07167
作者: Ke Xue,Ruo-Tong Chen,Xi Lin,Yunqi Shi,Shixiong Kai,Siyuan Xu,Chao Qian
关键词-EN: significantly influences power, circuit modules, influences power, aims at placing, placing millions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: NeurIPS 2024

点击查看摘要

Abstract:In modern chip design, placement aims at placing millions of circuit modules, which is an essential step that significantly influences power, performance, and area (PPA) metrics. Recently, reinforcement learning (RL) has emerged as a promising technique for improving placement quality, especially macro placement. However, current RL-based placement methods suffer from long training times, low generalization ability, and inability to guarantee PPA results. A key issue lies in the problem formulation, i.e., using RL to place from scratch, which results in limits useful information and inaccurate rewards during the training process. In this work, we propose an approach that utilizes RL for the refinement stage, which allows the RL policy to learn how to adjust existing placement layouts, thereby receiving sufficient information for the policy to act and obtain relatively dense and precise rewards. Additionally, we introduce the concept of regularity during training, which is considered an important metric in the chip design industry but is often overlooked in current RL placement methods. We evaluate our approach on the ISPD 2005 and ICCAD 2015 benchmark, comparing the global half-perimeter wirelength and regularity of our proposed method against several competitive approaches. Besides, we test the PPA performance using commercial software, showing that RL as a regulator can achieve significant PPA improvements. Our RL regulator can fine-tune placements from any method and enhance their quality. Our work opens up new possibilities for the application of RL in placement, providing a more effective and efficient approach to optimizing chip design. Our code is available at \urlthis https URL.

[AI-35] A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

链接: https://arxiv.org/abs/2412.07165
作者: Jacob Adkins,Michael Bowling,Adam White
关键词-EN: modern reinforcement learning, reinforcement learning algorithms, learning algorithms critically, algorithms critically relies, tuning ever-increasing numbers
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The performance of modern reinforcement learning algorithms critically relies on tuning ever-increasing numbers of hyperparameters. Often, small changes in a hyperparameter can lead to drastic changes in performance, and different environments require very different hyperparameter settings to achieve state-of-the-art performance reported in the literature. We currently lack a scalable and widely accepted approach to characterizing these complex interactions. This work proposes a new empirical methodology for studying, comparing, and quantifying the sensitivity of an algorithm’s performance to hyperparameter tuning for a given set of environments. We then demonstrate the utility of this methodology by assessing the hyperparameter sensitivity of several commonly used normalization variants of PPO. The results suggest that several algorithmic performance improvements may, in fact, be a result of an increased reliance on hyperparameter tuning.

[AI-36] Deep Learning-Enhanced Preconditioning for Efficient Conjugate Gradient Solvers in Large-Scale PDE Systems

链接: https://arxiv.org/abs/2412.07127
作者: Rui Li,Song Wang,Chen Wang
关键词-EN: Preconditioning techniques, partial differential equation, Incomplete Cholesky factorization, Graph Neural Network, crucial for enhancing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Preconditioning techniques are crucial for enhancing the efficiency of solving large-scale linear equation systems that arise from partial differential equation (PDE) discretization. These techniques, such as Incomplete Cholesky factorization (IC) and data-driven neural network methods, accelerate the convergence of iterative solvers like Conjugate Gradient (CG) by approximating the original matrices. This paper introduces a novel approach that integrates Graph Neural Network (GNN) with traditional IC, addressing the shortcomings of direct generation methods based on GNN and achieving significant improvements in computational efficiency and scalability. Experimental results demonstrate an average reduction in iteration counts by 24.8% compared to IC and a two-order-of-magnitude increase in training scale compared to previous methods. A three-dimensional static structural analysis utilizing finite element methods was validated on training sparse matrices of up to 5 million dimensions and inference scales of up to 10 million. Furthermore, the approach demon-strates robust generalization capabilities across scales, facilitating the effective acceleration of CG solvers for large-scale linear equations using small-scale data on modest hardware. The method’s robustness and scalability make it a practical solution for computational science.

[AI-37] On Evaluating the Durability of Safeguards for Open-Weight LLM s

链接: https://arxiv.org/abs/2412.07097
作者: Xiangyu Qi,Boyi Wei,Nicholas Carlini,Yangsibo Huang,Tinghao Xie,Luxi He,Matthew Jagielski,Milad Nasr,Prateek Mittal,Peter Henderson
关键词-EN: large language models, developers to policymakers, seek to minimize, minimize the dual-use, dual-use risks
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Stakeholders – from model developers to policymakers – seek to minimize the dual-use risks of large language models (LLMs). An open challenge to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are fully open. In response, several recent studies have proposed methods to produce durable LLM safeguards for open-weight LLMs that can withstand adversarial modifications of the model’s weights via fine-tuning. This holds the promise of raising adversaries’ costs even under strong threat models where adversaries can directly fine-tune model weights. However, in this paper, we urge for more careful characterization of the limits of these approaches. Through several case studies, we demonstrate that even evaluating these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We draw lessons from the evaluation pitfalls that we identify and suggest future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models, which can provide more useful and candid assessments to stakeholders.

[AI-38] Access Point Deployment for Localizing Accuracy and User Rate in Cell-Free Systems

链接: https://arxiv.org/abs/2412.07094
作者: Fanfei Xu,Shengheng Liu,Zihuan Mao,Shangqing Shi,Dazhuan Xu,Dongming Wang,Yongming Huang
关键词-EN: Evolving next-generation mobile, next-generation mobile networks, provide ubiquitous coverage, Evolving next-generation, next-generation mobile
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
*备注: Presented at MobiCom 2024

点击查看摘要

Abstract:Evolving next-generation mobile networks is designed to provide ubiquitous coverage and networked sensing. With utility of multi-view sensing and multi-node joint transmission, cell-free is a promising technique to realize this prospect. This paper aims to tackle the problem of access point (AP) deployment in cell-free systems to balance the sensing accuracy and user rate. By merging the D-optimality with Euclidean criterion, a novel integrated metric is proposed to be the objective function for both max-sum and max-min problems, which respectively guarantee the overall and lowest performance in multi-user communication and target tracking scenario. To solve the corresponding high dimensional non-convex multi-objective problem, the Soft actor-critic (SAC) is utilized to avoid risk of local optimal result. Numerical results demonstrate that proposed SAC-based APs deployment method achieves 20% of overall performance and 120% of lowest performance.

[AI-39] he Mirage of Artificial Intelligence Terms of Use Restrictions

链接: https://arxiv.org/abs/2412.07066
作者: Peter Henderson,Mark A. Lemley
关键词-EN: commonly attach restrictive, attach restrictive terms, creators commonly attach, commonly attach, attach restrictive
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Forthcoming Indiana Law Journal

点击查看摘要

Abstract:Artificial intelligence (AI) model creators commonly attach restrictive terms of use to both their models and their outputs. These terms typically prohibit activities ranging from creating competing AI models to spreading disinformation. Often taken at face value, these terms are positioned by companies as key enforceable tools for preventing misuse, particularly in policy dialogs. But are these terms truly meaningful? There are myriad examples where these broad terms are regularly and repeatedly violated. Yet except for some account suspensions on platforms, no model creator has actually tried to enforce these terms with monetary penalties or injunctive relief. This is likely for good reason: we think that the legal enforceability of these licenses is questionable. This Article systematically assesses of the enforceability of AI model terms of use and offers three contributions. First, we pinpoint a key problem: the artifacts that they protect, namely model weights and model outputs, are largely not copyrightable, making it unclear whether there is even anything to be licensed. Second, we examine the problems this creates for other enforcement. Recent doctrinal trends in copyright preemption may further undermine state-law claims, while other legal frameworks like the DMCA and CFAA offer limited recourse. Anti-competitive provisions likely fare even worse than responsible use provisions. Third, we provide recommendations to policymakers. There are compelling reasons for many provisions to be unenforceable: they chill good faith research, constrain competition, and create quasi-copyright ownership where none should exist. There are, of course, downsides: model creators have fewer tools to prevent harmful misuse. But we think the better approach is for statutory provisions, not private fiat, to distinguish between good and bad uses of AI, restricting the latter. Comments: Forthcoming Indiana Law Journal Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.07066 [cs.CY] (or arXiv:2412.07066v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2412.07066 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-40] Generative AI Impact on Labor Market: Analyzing ChatGPTs Demand in Job Advertisements

链接: https://arxiv.org/abs/2412.07042
作者: Mahdi Ahmadi,Neda Khosh Kheslat,Adebola Akintomide
关键词-EN: reshaping job roles, labor market, rapid advancement, significantly impacting, Gen AI-related skills
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
*备注: 20 pages, 4 figures, 2 tables, submitted to International Journal of Information Management to be reviewed

点击查看摘要

Abstract:The rapid advancement of Generative AI (Gen AI) technologies, particularly tools like ChatGPT, is significantly impacting the labor market by reshaping job roles and skill requirements. This study examines the demand for ChatGPT-related skills in the U.S. labor market by analyzing job advertisements collected from major job platforms between May and December 2023. Using text mining and topic modeling techniques, we extracted and analyzed the Gen AI-related skills that employers are hiring for. Our analysis identified five distinct ChatGPT-related skill sets: general familiarity, creative content generation, marketing, advanced functionalities (such as prompt engineering), and product development. In addition, the study provides insights into job attributes such as occupation titles, degree requirements, salary ranges, and other relevant job characteristics. These findings highlight the increasing integration of Gen AI across various industries, emphasizing the growing need for both foundational knowledge and advanced technical skills. The study offers valuable insights into the evolving demands of the labor market, as employers seek candidates equipped to leverage generative AI tools to improve productivity, streamline processes, and drive innovation.

[AI-41] Sequential Compression Layers for Efficient Federated Learning in Foundational Models

链接: https://arxiv.org/abs/2412.07021
作者: Navyansh Mahla,Sunny Gupta,Amit Sethi
关键词-EN: federated learning context, Federated Learning, multiple nodes, private data, fine-tuning large language
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has gained popularity for fine-tuning large language models (LLMs) across multiple nodes, each with its own private data. While LoRA has been widely adopted for parameter efficient federated fine-tuning, recent theoretical and empirical studies highlight its suboptimal performance in the federated learning context. In response, we propose a novel, simple, and more effective parameter-efficient fine-tuning method that does not rely on LoRA. Our approach introduces a small multi-layer perceptron (MLP) layer between two existing MLP layers the up proj (the FFN projection layer following the self-attention module) and down proj within the feed forward network of the transformer block. This solution addresses the bottlenecks associated with LoRA in federated fine tuning and outperforms recent LoRA-based approaches, demonstrating superior performance for both language models and vision encoders.

[AI-42] Extreme AutoML: Analysis of Classification Regression and NLP Performance

链接: https://arxiv.org/abs/2412.07000
作者: Edward Ratner,Elliot Farmer,Christopher Douglas,Amaury Lendasse
关键词-EN: Utilizing machine learning, Utilizing machine, required choosing hyperparameters, required choosing, Deep Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Utilizing machine learning techniques has always required choosing hyperparameters. This is true whether one uses a classical technique such as a KNN or very modern neural networks such as Deep Learning. Though in many applications, hyperparameters are chosen by hand, automated methods have become increasingly more common. These automated methods have become collectively known as automated machine learning, or AutoML. Several automated selection algorithms have shown similar or improved performance over state-of-the-art methods. This breakthrough has led to the development of cloud-based services like Google AutoML, which is based on Deep Learning and is widely considered to be the industry leader in AutoML services. Extreme Learning Machines (ELMs) use a fundamentally different type of neural architecture, producing better results at a significantly discounted computational cost. We benchmark the Extreme AutoML technology against Google’s AutoML using several popular classification data sets from the University of California at Irvine’s (UCI) repository, and several other data sets, observing significant advantages for Extreme AutoML in accuracy, Jaccard Indices, the variance of Jaccard Indices across classes (i.e. class variance) and training times.

[AI-43] oward AI-Driven Digital Organism: Multiscale Foundation Models for Predicting Simulating and Programming Biology at All Levels

链接: https://arxiv.org/abs/2412.06993
作者: Le Song,Eran Segal,Eric Xing
关键词-EN: simulate biology, AI-Driven Digital Organism, biology, AIDO, Digital Organism
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:We present an approach of using AI to model and simulate biology and life. Why is it important? Because at the core of medicine, pharmacy, public health, longevity, agriculture and food security, environmental protection, and clean energy, it is biology at work. Biology in the physical world is too complex to manipulate and always expensive and risky to tamper with. In this perspective, we layout an engineering viable approach to address this challenge by constructing an AI-Driven Digital Organism (AIDO), a system of integrated multiscale foundation models, in a modular, connectable, and holistic fashion to reflect biological scales, connectedness, and complexities. An AIDO opens up a safe, affordable and high-throughput alternative platform for predicting, simulating and programming biology at all levels from molecules to cells to individuals. We envision that an AIDO is poised to trigger a new wave of better-guided wet-lab experimentation and better-informed first-principle reasoning, which can eventually help us better decode and improve life.

[AI-44] Learning about algorithm auditing in five steps: scaffolding how high school youth can systematically and critically evaluate machine learning applications

链接: https://arxiv.org/abs/2412.06989
作者: Luis Morales-Navarro,Yasmin B. Kafai,Lauren Vogelstein,Evelyn Yu,Danaë Metaxa
关键词-EN: critically evaluate machine, evaluate machine learning-powered, machine learning-powered systems, supporting young people, widespread interest
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:While there is widespread interest in supporting young people to critically evaluate machine learning-powered systems, there is little research on how we can support them in inquiring about how these systems work and what their limitations and implications may be. Outside of K-12 education, an effective strategy in evaluating black-boxed systems is algorithm auditing-a method for understanding algorithmic systems’ opaque inner workings and external impacts from the outside in. In this paper, we review how expert researchers conduct algorithm audits and how end users engage in auditing practices to propose five steps that, when incorporated into learning activities, can support young people in auditing algorithms. We present a case study of a team of teenagers engaging with each step during an out-of-school workshop in which they audited peer-designed generative AI TikTok filters. We discuss the kind of scaffolds we provided to support youth in algorithm auditing and directions and challenges for integrating algorithm auditing into classroom activities. This paper contributes: (a) a conceptualization of five steps to scaffold algorithm auditing learning activities, and (b) examples of how youth engaged with each step during our pilot study.

[AI-45] Enhancing operational wind downscaling capabilities over Canada: Application of a Conditional Wasserstein GAN methodology

链接: https://arxiv.org/abs/2412.06958
作者: Jorge Guevara,Victor Nascimento,Johannes Schmude,Daniel Salles,Simon Corbeil-Létourneau,Madalina Surcel,Dominique Brunet
关键词-EN: Numerical Weather Prediction, Deterministic Prediction System, operational Numerical Weather, Numerical Weather, High-Resolution Deterministic Prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Wind downscaling is essential for improving the spatial resolution of weather forecasts, particularly in operational Numerical Weather Prediction (NWP). This study advances wind downscaling by extending the DownGAN framework introduced by Annau et al.,to operational datasets from the Global Deterministic Prediction System (GDPS) and High-Resolution Deterministic Prediction System (HRDPS), covering the entire Canadian domain. We enhance the model by incorporating high-resolution static covariates, such as HRDPS-derived topography, into a Conditional Wasserstein Generative Adversarial Network with Gradient Penalty, implemented using a UNET-based generator. Following the DownGAN framework, our methodology integrates low-resolution GDPS forecasts (15 km, 10-day horizon) and high-resolution HRDPS forecasts (2.5 km, 48-hour horizon) with Frequency Separation techniques adapted from computer vision. Through robust training and inference over the Canadian region, we demonstrate the operational scalability of our approach, achieving significant improvements in wind downscaling accuracy. Statistical validation highlights reductions in root mean square error (RMSE) and log spectral distance (LSD) metrics compared to the original DownGAN. High-resolution conditioning covariates and Frequency Separation strategies prove instrumental in enhancing model performance. This work underscores the potential for extending high-resolution wind forecasts beyond the 48-hour horizon, bridging the gap to the 10-day low resolution global forecast window.

[AI-46] Bridging Conversational and Collaborative Signals for Conversational Recommendation

链接: https://arxiv.org/abs/2412.06949
作者: Ahmad Bin Rabiah,Nafis Sadeq,Julian McAuley
关键词-EN: leverage contextual information, capture user-item interaction, user-item interaction patterns, interaction patterns essential, leverage contextual
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conversational recommendation systems (CRS) leverage contextual information from conversations to generate recommendations but often struggle due to a lack of collaborative filtering (CF) signals, which capture user-item interaction patterns essential for accurate recommendations. We introduce Reddit-ML32M, a dataset that links reddit conversations with interactions on MovieLens 32M, to enrich item representations by leveraging collaborative knowledge and addressing interaction sparsity in conversational datasets. We propose an LLM-based framework that uses Reddit-ML32M to align LLM-generated recommendations with CF embeddings, refining rankings for better performance. We evaluate our framework against three sets of baselines: CF-based recommenders using only interactions from CRS tasks, traditional CRS models, and LLM-based methods relying on conversational context without item representations. Our approach achieves consistent improvements, including a 12.32% increase in Hit Rate and a 9.9% improvement in NDCG, outperforming the best-performing baseline that relies on conversational context but lacks collaborative item representations.

[AI-47] PyraNet: A Multi-Layered Hierarchical Dataset for Verilog

链接: https://arxiv.org/abs/2412.06947
作者: Bardia Nadimi,Ghali Omar Boutaib,Hao Zheng
关键词-EN: leveraging Large Language, Large Language Models, Large Language, Verilog code generation, leveraging Large
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Recently, there has been a growing interest in leveraging Large Language Models for Verilog code generation. However, the current quality of the generated Verilog code remains suboptimal. This is largely due to the absence of well-defined, well-organized datasets with high-quality samples, as well as a lack of innovative fine-tuning methods and models specifically trained on Verilog. In this paper, we introduce a novel open-source dataset and a corresponding fine-tuning technique, which utilizes a multi-layered structure that we refer to as PyraNet. Our experiments demonstrate that employing the proposed dataset and fine-tuning approach leads to a more accurate fine-tuned model, producing syntactically and functionally correct Verilog code. The evaluation results show improvements by up-to 32.6% in comparison to the CodeLlama-7B baseline model and up-to 16.7% in comparison to the state-of-the-art models using VerilogEval evaluation platform.

[AI-48] Creating a Cooperative AI Policymaking Platform through Open Source Collaboration

链接: https://arxiv.org/abs/2412.06936
作者: Aiden Lewington,Alekhya Vittalam,Anshumaan Singh,Anuja Uppuluri,Arjun Ashok,Ashrith Mandayam Athmaram,Austin Milt,Benjamin Smith,Charlie Weinberger,Chatanya Sarin,Christoph Bergmeir,Cliff Chang,Daivik Patel,Daniel Li,David Bell,Defu Cao,Donghwa Shin,Edward Kang,Edwin Zhang,Enhui Li,Felix Chen,Gabe Smithline,Haipeng Chen,Henry Gasztowtt,Hoon Shin,Jiayun Zhang,Joshua Gray,Khai Hern Low,Kishan Patel,Lauren Hannah Cooke,Marco Burstein,Maya Kalapatapu,Mitali Mittal,Raymond Chen,Rosie Zhao,Sameen Majid,Samya Potlapalli,Shang Wang,Shrenik Patel,Shuheng Li,Siva Komaragiri,Song Lu,Sorawit Siangjaeo,Sunghoo Jung,Tianyu Zhang,Valery Mao,Vikram Krishnakumar,Vincent Zhu,Wesley Kam,Xingzhe Li,Yumeng Liu
关键词-EN: promote equitable benefits, present significant risks, requiring improved governance, mitigate societal harms, Advances in artificial
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Advances in artificial intelligence (AI) present significant risks and opportunities, requiring improved governance to mitigate societal harms and promote equitable benefits. Current incentive structures and regulatory delays may hinder responsible AI development and deployment, particularly in light of the transformative potential of large language models (LLMs). To address these challenges, we propose developing the following three contributions: (1) a large multimodal text and economic-timeseries foundation model that integrates economic and natural language policy data for enhanced forecasting and decision-making, (2) algorithmic mechanisms for eliciting diverse and representative perspectives, enabling the creation of data-driven public policy recommendations, and (3) an AI-driven web platform for supporting transparent, inclusive, and data-driven policymaking.

[AI-49] VQ4ALL: Efficient Neural Network Representation via a Universal Codebook

链接: https://arxiv.org/abs/2412.06875
作者: Juncan Deng,Shuaiting Li,Zeyu Wang,Hong Gu,Kedong Xu,Kejie Huang
关键词-EN: models puts forward, network models puts, big neural network, rapid growth, puts forward
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rapid growth of the big neural network models puts forward new requirements for lightweight network representation methods. The traditional methods based on model compression have achieved great success, especially VQ technology which realizes the high compression ratio of models by sharing code words. However, because each layer of the network needs to build a code table, the traditional top-down compression technology lacks attention to the underlying commonalities, resulting in limited compression rate and frequent memory access. In this paper, we propose a bottom-up method to share the universal codebook among multiple neural networks, which not only effectively reduces the number of codebooks but also further reduces the memory access and chip area by storing static code tables in the built-in ROM. Specifically, we introduce VQ4ALL, a VQ-based method that utilizes codewords to enable the construction of various neural networks and achieve efficient representations. The core idea of our method is to adopt a kernel density estimation approach to extract a universal codebook and then progressively construct different low-bit networks by updating differentiable assignments. Experimental results demonstrate that VQ4ALL achieves compression rates exceeding 16 \times while preserving high accuracy across multiple network architectures, highlighting its effectiveness and versatility.

[AI-50] Predicting Subway Passenger Flows under Incident Situation with Causality

链接: https://arxiv.org/abs/2412.06871
作者: Xiannan Huang,Shuhan Qiu,Quan Yuan,Chao Yang
关键词-EN: rail transit operations, limited research addressing, causal effect prediction, research addressing incident, effect prediction model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the context of rail transit operations, real-time passenger flow prediction is essential; however, most models primarily focus on normal conditions, with limited research addressing incident situations. There are several intrinsic challenges associated with prediction during incidents, such as a lack of interpretability and data scarcity. To address these challenges, we propose a two-stage method that separates predictions under normal conditions and the causal effects of incidents. First, a normal prediction model is trained using data from normal situations. Next, the synthetic control method is employed to identify the causal effects of incidents, combined with placebo tests to determine significant levels of these effects. The significant effects are then utilized to train a causal effect prediction model, which can forecast the impact of incidents based on features of the incidents and passenger flows. During the prediction phase, the results from both the normal situation model and the causal effect prediction model are integrated to generate final passenger flow predictions during incidents. Our approach is validated using real-world data, demonstrating improved accuracy. Furthermore, the two-stage methodology enhances interpretability. By analyzing the causal effect prediction model, we can identify key influencing factors related to the effects of incidents and gain insights into their underlying mechanisms. Our work can assist subway system managers in estimating passenger flow affected by incidents and enable them to take proactive measures. Additionally, it can deepen researchers’ understanding of the impact of incidents on subway passenger flows.

[AI-51] Lossless Model Compression via Joint Low-Rank Factorization Optimization

链接: https://arxiv.org/abs/2412.06867
作者: Boyang Zhang,Daning Cheng,Yunquan Zhang,Fangmin Liu,Jiake Tian
关键词-EN: Low-rank factorization, original weight matrices, model, factorization, low-rank weight factorization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
*备注: Under Review

点击查看摘要

Abstract:Low-rank factorization is a popular model compression technique that minimizes the error \delta between approximated and original weight matrices. Despite achieving performances close to the original models when \delta is optimized, a performance discrepancy remains due to the separate optimization processes for low-rank factorization and model performance, resulting in unavoidable losses. We address this issue by introducing a novel joint optimization strategy for lossless low-rank weight factorization, which, for the first time, enhances the model’s performance beyond the original. Our approach begins with a theoretical analysis of the relationship between low-rank factorization and model optimization objectives, establishing a precise perturbation range for matrix factorization errors on model performance. This challenge is then reformulated as a numerical rank deficiency problem with inequality constraints and develop a joint objective that simultaneously addresses factorization error and model performance. Based on the above analysis, we propose two optimization algorithms: \textbfa lossless optimization algorithm that maximizes model accuracy while ensuring compression, and \textbfa compact optimization algorithm that minimizes model size while preserving performance. These algorithms do not require fine-tuning and can directly compress numerous deep models to achieve lossless results. Our methods demonstrate robust efficacy across various vision and language tasks. For example, the compressed model reduced by 70% on ResNext50 outperforms the original. Our code will be made public.

[AI-52] LMS-AutoTSF: Learnable Multi-Scale Decomposition and Integrated Autocorrelation for Time Series Forecasting

链接: https://arxiv.org/abs/2412.06866
作者: Ibrahim Delibasoglu Sanjay Chakraborty Fredrik Heintz
关键词-EN: stock market analysis, industrial process analysis, Time series forecasting, market analysis, process analysis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Time series forecasting is an important challenge with significant applications in areas such as weather prediction, stock market analysis, scientific simulations and industrial process analysis. In this work, we introduce LMS-AutoTSF, a novel time series forecasting architecture that incorporates autocorrelation while leveraging dual encoders operating at multiple scales. Unlike models that rely on predefined trend and seasonal components, LMS-AutoTSF employs two separate encoders per scale: one focusing on low-pass filtering to capture trends and the other utilizing high-pass filtering to model seasonal variations. These filters are learnable, allowing the model to dynamically adapt and isolate trend and seasonal components directly in the frequency domain. A key innovation in our approach is the integration of autocorrelation, achieved by computing lagged differences in time steps, which enables the model to capture dependencies across time more effectively. Each encoder processes the input through fully connected layers to handle temporal and channel interactions. By combining frequency-domain filtering, autocorrelation-based temporal modeling, and channel-wise transformations, LMS-AutoTSF not only accurately captures long-term dependencies and fine-grained patterns but also operates more efficiently compared to other state-of-the-art methods. Its lightweight design ensures faster processing while maintaining high precision in forecasting across diverse time horizons. The source code is publicly available at \urlthis http URL

[AI-53] FP=xINT:A Low-Bit Series Expansion Algorithm for Post-Training Quantization

链接: https://arxiv.org/abs/2412.06865
作者: Boyang Zhang,Daning Cheng,Yunquan Zhang,Fangmin Liu
关键词-EN: converts pre-trained Full-Precision, converts pre-trained, pre-trained Full-Precision, versions without training, quantized versions
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Under Review

点击查看摘要

Abstract:Post-Training Quantization (PTQ) converts pre-trained Full-Precision (FP) models into quantized versions without training. While existing methods reduce size and computational costs, they also significantly degrade performance and quantization efficiency at extremely low settings due to quantization noise. We introduce a deep model series expansion framework to address this issue, enabling rapid and accurate approximation of unquantized models without calibration sets or fine-tuning. This is the first use of series expansion for neural network quantization. Specifically, our method expands the FP model into multiple low-bit basis models. To ensure accurate quantization, we develop low-bit basis model expansions at different granularities (tensor, layer, model), and theoretically confirm their convergence to the dense model, thus restoring FP model accuracy. Additionally, we design AbelianAdd/Mul operations between isomorphic models in the low-bit expansion, forming an Abelian group to ensure operation parallelism and commutativity. The experiments show that our algorithm achieves state-of-the-art performance in low-bit settings; for example, 4-bit quantization of ResNet-50 surpasses the original accuracy, reaching 77.03%. The code will be made public.

[AI-54] Mining Limited Data Sufficiently: A BERT-inspired Approach for CSI Time Series Application in Wireless Communication and Sensing

链接: https://arxiv.org/abs/2412.06861
作者: Zijian Zhao,Fanyi Meng,Hang Li,Xiaoyang Li,Guangxu Zhu
关键词-EN: Channel State Information, Channel State, CSI, State Information, CSI prediction
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Channel State Information (CSI) is the cornerstone in both wireless communication and sensing systems. In wireless communication systems, CSI provides essential insights into channel conditions, enabling system optimizations like channel compensation and dynamic resource allocation. However, the high computational complexity of CSI estimation algorithms necessitates the development of fast deep learning methods for CSI prediction. In wireless sensing systems, CSI can be leveraged to infer environmental changes, facilitating various functions, including gesture recognition and people identification. Deep learning methods have demonstrated significant advantages over model-based approaches in these fine-grained CSI classification tasks, particularly when classes vary across different scenarios. However, a major challenge in training deep learning networks for wireless systems is the limited availability of data, further complicated by the diverse formats of many public datasets, which hinder integration. Additionally, collecting CSI data can be resource-intensive, requiring considerable time and manpower. To address these challenges, we propose CSI-BERT2 for CSI prediction and classification tasks, effectively utilizing limited data through a pre-training and fine-tuning approach. Building on CSI-BERT1, we enhance the model architecture by introducing an Adaptive Re-Weighting Layer (ARL) and a Multi-Layer Perceptron (MLP) to better capture sub-carrier and timestamp information, effectively addressing the permutation-invariance problem. Furthermore, we propose a Mask Prediction Model (MPM) fine-tuning method to improve the model’s adaptability for CSI prediction tasks. Experimental results demonstrate that CSI-BERT2 achieves state-of-the-art performance across all tasks.

[AI-55] Balancing Efficiency and Effectiveness: An LLM -Infused Approach for Optimized CTR Prediction

链接: https://arxiv.org/abs/2412.06860
作者: Guoxiao Zhang,Yi Wei,Yadong Zhang,Huajian Feng,Qiang Liu
关键词-EN: deep semantic information, Click-Through Rate, shaping user decisions, semantic information plays, semantic information
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 4 figures,4 tables

点击查看摘要

Abstract:Click-Through Rate (CTR) prediction is essential in online advertising, where semantic information plays a pivotal role in shaping user decisions and enhancing CTR effectiveness. Capturing and modeling deep semantic information, such as a user’s preference for “Häagen-Dazs’ HEAVEN strawberry light ice cream” due to its health-conscious and premium attributes, is challenging. Traditional semantic modeling often overlooks these intricate details at the user and item levels. To bridge this gap, we introduce a novel approach that models deep semantic information end-to-end, leveraging the comprehensive world knowledge capabilities of Large Language Models (LLMs). Our proposed LLM-infused CTR prediction framework(Multi-level Deep Semantic Information Infused CTR model via Distillation, MSD) is designed to uncover deep semantic insights by utilizing LLMs to extract and distill critical information into a smaller, more efficient model, enabling seamless end-to-end training and inference. Importantly, our framework is carefully designed to balance efficiency and effectiveness, ensuring that the model not only achieves high performance but also operates with optimal resource utilization. Online A/B tests conducted on the Meituan sponsored-search system demonstrate that our method significantly outperforms baseline models in terms of Cost Per Mile (CPM) and CTR, validating its effectiveness, scalability, and balanced approach in real-world applications.

[AI-56] Incentivized Symbiosis: A Paradigm for Human-Agent Coevolution

链接: https://arxiv.org/abs/2412.06855
作者: Tomer Jordi Chaffer,Justin Goldston,Gemach D.A.T.A. I
关键词-EN: Incentivized Symbiosis, Incentivized Symbiosis model, Cooperation, conceptualize Incentivized Symbiosis, Incentivized
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Cooperation is vital to our survival and progress. Evolutionary game theory offers a lens to understand the structures and incentives that enable cooperation to be a successful strategy. As artificial intelligence agents become integral to human systems, the dynamics of cooperation take on unprecedented significance. Decentralized frameworks like Web3, grounded in transparency, accountability, and trust, offer a foundation for fostering cooperation by establishing enforceable rules and incentives for humans and AI agents. Guided by our Incentivized Symbiosis model, a paradigm aligning human and AI agent goals through bidirectional incentives and mutual adaptation, we investigate mechanisms for embedding cooperation into human-agent coevolution. We conceptualize Incentivized Symbiosis as part of a contemporary moral framework inspired by Web3 principles, encoded in blockchain technology to define and enforce rules, incentives, and consequences for both humans and AI agents. By integrating these principles into the very architecture of human-agent interactions, Web3 ecosystems catalyze an environment ripe for collaborative innovation. Our study traverses several transformative applications of Incentivized Symbiosis, from decentralized finance to governance and cultural adaptation, illustrating how AI agents can coevolve with humans to forge a trajectory of shared, sustainable progress.

[AI-57] ube Loss: A Novel Approach for Prediction Interval Estimation and probabilistic forecasting

链接: https://arxiv.org/abs/2412.06853
作者: Pritam Anand,Tathagata Bandyopadhyay,Suresh Chandra
关键词-EN: time series data, Tube Loss based, Tube Loss, series data solving, Prediction Interval
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper proposes a novel loss function, called ‘Tube Loss’, for simultaneous estimation of bounds of a Prediction Interval (PI) in the regression setup, and also for generating probabilistic forecasts from time series data solving a single optimization problem. The PIs obtained by minimizing the empirical risk based on the Tube Loss are shown to be of better quality than the PIs obtained by the existing methods in the following sense. First, it yields intervals that attain the prespecified confidence level t \in(0,1) asymptotically. A theoretical proof of this fact is given. Secondly, the user is allowed to move the interval up or down by controlling the value of a parameter. This helps the user to choose a PI capturing denser regions of the probability distribution of the response variable inside the interval, and thus, sharpening its width. This is shown to be especially useful when the conditional distribution of the response variable is skewed. Further, the Tube Loss based PI estimation method can trade-off between the coverage and the average width by solving a single optimization problem. It enables further reduction of the average width of PI through re-calibration. Also, unlike a few existing PI estimation methods the gradient descent (GD) method can be used for minimization of empirical risk. Finally, through extensive experimentation, we have shown the efficacy of the Tube Loss based PI estimation in kernel machines, neural networks and deep networks and also for probabilistic forecasting tasks. The codes of the experiments are available at this https URL

[AI-58] EGEAN: An Exposure-Guided Embedding Alignment Network for Post-Click Conversion Estimation

链接: https://arxiv.org/abs/2412.06852
作者: Huajian Feng,Guoxiao Zhang,Yadong Zhang,Yi We,Qiang Liu
关键词-EN: Accurate post-click conversion, post-click conversion rate, Accurate post-click, Sample Selection Bias, Embedding Alignment Network
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 5 pages, 3 figures, 3 tables

点击查看摘要

Abstract:Accurate post-click conversion rate (CVR) estimation is crucial for online advertising systems. Despite significant advances in causal approaches designed to address the Sample Selection Bias problem, CVR estimation still faces challenges due to Covariate Shift. Given the intrinsic connection between the distribution of covariates in the click and non-click spaces, this study proposes an Exposure-Guided Embedding Alignment Network (EGEAN) to address estimation bias caused by covariate shift. Additionally, we propose a Parameter Varying Doubly Robust Estimator with steady-state control to handle small propensities better. Online A/B tests conducted on the Meituan advertising system demonstrate that our method significantly outperforms baseline models with respect to CVR and GMV, validating its effectiveness. Code is available: this https URL.

[AI-59] Classifier-free guidance in LLM s Safety

链接: https://arxiv.org/abs/2412.06846
作者: Roman Smirnov
关键词-EN: paper describes LLM, ORPO reinforcement learning, describes LLM unlearning, reinforcement learning method, modified classifier-free guidance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The paper describes LLM unlearning without a retaining dataset, using the ORPO reinforcement learning method with inference enhanced by modified classifier-free guidance. Significant improvement in unlearning, without degradation of the model, is achieved through direct training on synthetic replacement data in CFG-aware training regime, with classifier-free guidance applied during the inference. This article is an extended version of the NeurIPS 2024 LLM-PC submission, which was awarded second prize.

[AI-60] A Neural Model of Rule Discovery with Relatively Short-Term Sequence Memory

链接: https://arxiv.org/abs/2412.06839
作者: Naoya Arakawa
关键词-EN: report proposes, event sequences, fluid intelligence, discovering regularities, event
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This report proposes a neural cognitive model for discovering regularities in event sequences. In a fluid intelligence task, the subject is required to discover regularities from relatively short-term memory of the first-seen task. Some fluid intelligence tasks require discovering regularities in event sequences. Thus, a neural network model was constructed to explain fluid intelligence or regularity discovery in event sequences with relatively short-term memory. The model was implemented and tested with delayed match-to-sample tasks.

[AI-61] Innovative Sentiment Analysis and Prediction of Stock Price Using FinBERT GPT-4 and Logistic Regression: A Data-Driven Approach

链接: https://arxiv.org/abs/2412.06837
作者: Olamilekan Shobayo,Sidikat Adeyemi-Longe,Olusogo Popoola,Bayode Ogunleye
关键词-EN: Generatice Pre-trained Transformer, Finaance Bidirectional Encoder, Bidirectional Encoder representations, NGX All-Share Index, Finaance Bidirectional
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistical Finance (q-fin.ST); Applications (stat.AP); Computation (stat.CO)
*备注: 21 pages

点击查看摘要

Abstract:This study explores the comparative performance of cutting-edge AI models, i.e., Finaance Bidirectional Encoder representations from Transsformers (FinBERT), Generatice Pre-trained Transformer GPT-4, and Logistic Regression, for sentiment analysis and stock index prediction using financial news and the NGX All-Share Index data label. By leveraging advanced natural language processing models like GPT-4 and FinBERT, alongside a traditional machine learning model, Logistic Regression, we aim to classify market sentiment, generate sentiment scores, and predict market price movements. This research highlights global AI advancements in stock markets, showcasing how state-of-the-art language models can contribute to understanding complex financial data. The models were assessed using metrics such as accuracy, precision, recall, F1 score, and ROC AUC. Results indicate that Logistic Regression outperformed the more computationally intensive FinBERT and predefined approach of versatile GPT-4, with an accuracy of 81.83% and a ROC AUC of 89.76%. The GPT-4 predefined approach exhibited a lower accuracy of 54.19% but demonstrated strong potential in handling complex data. FinBERT, while offering more sophisticated analysis, was resource-demanding and yielded a moderate performance. Hyperparameter optimization using Optuna and cross-validation techniques ensured the robustness of the models. This study highlights the strengths and limitations of the practical applications of AI approaches in stock market prediction and presents Logistic Regression as the most efficient model for this task, with FinBERT and GPT-4 representing emerging tools with potential for future exploration and innovation in AI-driven financial analytics

[AI-62] GRUvader: Sentiment-Informed Stock Market Prediction

链接: https://arxiv.org/abs/2412.06836
作者: Akhila Mamillapalli,Bayode Ogunleye,Sonia Timoteo Inacio,Olamilekan Shobayo
关键词-EN: global economic instability, stock market prediction, high volatility, economic instability, challenging due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: 18 pages

点击查看摘要

Abstract:Stock price prediction is challenging due to global economic instability, high volatility, and the complexity of financial markets. Hence, this study compared several machine learning algorithms for stock market prediction and further examined the influence of a sentiment analysis indicator on the prediction of stock prices. Our results were two-fold. Firstly, we used a lexicon-based sentiment analysis approach to identify sentiment features, thus evidencing the correlation between the sentiment indicator and stock price movement. Secondly, we proposed the use of GRUvader, an optimal gated recurrent unit network, for stock market prediction. Our findings suggest that stand-alone models struggled compared with AI-enhanced models. Thus, our paper makes further recommendations on latter systems.

[AI-63] APS-LSTM: Exploiting Multi-Periodicity and Diverse Spatial Dependencies for Flood Forecasting

链接: https://arxiv.org/abs/2412.06835
作者: Jun Feng,Xueyi Liu,Jiamin Lu,Pingping Shao
关键词-EN: Accurate flood prediction, Accurate flood, prevention and mitigation, crucial for disaster, disaster prevention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Received by IEEE SMC

点击查看摘要

Abstract:Accurate flood prediction is crucial for disaster prevention and mitigation. Hydrological data exhibit highly nonlinear temporal patterns and encompass complex spatial relationships between rainfall and flow. Existing flood prediction models struggle to capture these intricate temporal features and spatial dependencies. This paper presents an adaptive periodic and spatial self-attention method based on LSTM (APS-LSTM) to address these challenges. The APS-LSTM learns temporal features from a multi-periodicity perspective and captures diverse spatial dependencies from different period divisions. The APS-LSTM consists of three main stages, (i) Multi-Period Division, that utilizes Fast Fourier Transform (FFT) to divide various periodic patterns; (ii) Spatio-Temporal Information Extraction, that performs periodic and spatial self-attention focusing on intra- and inter-periodic temporal patterns and spatial dependencies; (iii) Adaptive Aggregation, that relies on amplitude strength to aggregate the computational results from each periodic division. The abundant experiments on two real-world datasets demonstrate the superiority of APS-LSTM. The code is available: this https URL.

[AI-64] Investigating social alignment via mirroring in a system of interacting language models

链接: https://arxiv.org/abs/2412.06834
作者: Harvey McGuinness,Tianyu Wang,Carey E. Priebe,Hayden Helm
关键词-EN: goal or perspective, share a common, common goal, individuals share, system behavior
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Alignment is a social phenomenon wherein individuals share a common goal or perspective. Mirroring, or mimicking the behaviors and opinions of another individual, is one mechanism by which individuals can become aligned. Large scale investigations of the effect of mirroring on alignment have been limited due to the scalability of traditional experimental designs in sociology. In this paper, we introduce a simple computational framework that enables studying the effect of mirroring behavior on alignment in multi-agent systems. We simulate systems of interacting large language models in this framework and characterize overall system behavior and alignment with quantitative measures of agent dynamics. We find that system behavior is strongly influenced by the range of communication of each agent and that these effects are exacerbated by increased rates of mirroring. We discuss the observed simulated system behavior in the context of known human social dynamics.

[AI-65] Detecting Fake News on Social Media: A Novel Reliability Aware Machine-Crowd Hybrid Intelligence-Based Method

链接: https://arxiv.org/abs/2412.06833
作者: Yidong Chai,Kangwei Shi,Jiaheng Xie,Chunli Liu,Yuanchun Jiang,Yezheng Liu
关键词-EN: hybrid intelligence-based methods, advanced detection methods, social media platforms, media platforms poses, societal systems
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Fake news on social media platforms poses a significant threat to societal systems, underscoring the urgent need for advanced detection methods. The existing detection methods can be divided into machine intelligence-based, crowd intelligence-based, and hybrid intelligence-based methods. Among them, hybrid intelligence-based methods achieve the best performance but fail to consider the reliability issue in detection. In light of this, we propose a novel Reliability Aware Hybrid Intelligence (RAHI) method for fake news detection. Our method comprises three integral modules. The first module employs a Bayesian deep learning model to capture the inherent reliability within machine intelligence. The second module uses an Item Response Theory (IRT)-based user response aggregation to account for the reliability in crowd intelligence. The third module introduces a new distribution fusion mechanism, which takes the distributions derived from both machine and crowd intelligence as input, and outputs a fused distribution that provides predictions along with the associated reliability. The experiments on the Weibo dataset demonstrate the advantages of our method. This study contributes to the research field with a novel RAHI-based method, and the code is shared at this https URL. This study has practical implications for three key stakeholders: internet users, online platform managers, and the government.

[AI-66] Enhancing LLM s for Physics Problem-Solving using Reinforcement Learning with Human-AI Feedback

链接: https://arxiv.org/abs/2412.06827
作者: Avinash Anand,Kritarth Prasad,Chhavi Kirtani,Ashwin R Nair,Mohit Gupta,Saloni Garg,Anurag Gautam,Snehal Buldeo,Rajiv Ratn Shah
关键词-EN: Large Language Models, Large Language, Retrieval Augmentation Generation, complex reasoning required, Language Models
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong capabilities in text-based tasks but struggle with the complex reasoning required for physics problems, particularly in advanced arithmetic and conceptual understanding. While some research has explored ways to enhance LLMs in physics education using techniques such as prompt engineering and Retrieval Augmentation Generation (RAG), not enough effort has been made in addressing their limitations in physics reasoning. This paper presents a novel approach to improving LLM performance on physics questions using Reinforcement Learning with Human and Artificial Intelligence Feedback (RLHAIF). We evaluate several reinforcement learning methods, including Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and Remax optimization. These methods are chosen to investigate RL policy performance with different settings on the PhyQA dataset, which includes challenging physics problems from high school textbooks. Our RLHAIF model, tested on leading LLMs like LLaMA2 and Mistral, achieved superior results, notably with the MISTRAL-PPO model, demonstrating marked improvements in reasoning and accuracy. It achieved high scores, with a 58.67 METEOR score and a 0.74 Reasoning score, making it a strong example for future physics reasoning research in this area.

[AI-67] Feature Group Tabular Transformer: A Novel Approach to Traffic Crash Modeling and Causality Analysis

链接: https://arxiv.org/abs/2412.06825
作者: Oscar Lares,Hao Zhen,Jidong J. Yang
关键词-EN: improving road safety, Reliable and interpretable, road safety, interpretable traffic crash, traffic crash modeling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注: 19 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Reliable and interpretable traffic crash modeling is essential for understanding causality and improving road safety. This study introduces a novel approach to predicting collision types by utilizing a comprehensive dataset fused from multiple sources, including weather data, crash reports, high-resolution traffic information, pavement geometry, and facility characteristics. Central to our approach is the development of a Feature Group Tabular Transformer (FGTT) model, which organizes disparate data into meaningful feature groups, represented as tokens. These group-based tokens serve as rich semantic components, enabling effective identification of collision patterns and interpretation of causal mechanisms. The FGTT model is benchmarked against widely used tree ensemble models, including Random Forest, XGBoost, and CatBoost, demonstrating superior predictive performance. Furthermore, model interpretation reveals key influential factors, providing fresh insights into the underlying causality of distinct crash types.

[AI-68] Artificial Intelligence without Restriction Surpassing Human Intelligence with Probability One: Theoretical Insight into Secrets of the Brain with AI Twins of the Brain

链接: https://arxiv.org/abs/2412.06820
作者: Guang-Bin Huang,M. Brandon Westover,Eng-King Tan,Haibo Wang,Dongshun Cui,Wei-Ying Ma,Tiantong Wang,Qi He,Haikun Wei,Ning Wang,Qiyuan Tian,Kwok-Yan Lam,Xin Yao,Tien Yin Wong
关键词-EN: surpass human intelligence, Artificial Intelligence, important techniques discovered, human intelligence, restrictions artificial intelligence
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by journal Neurocomputing

点击查看摘要

Abstract:Artificial Intelligence (AI) has apparently become one of the most important techniques discovered by humans in history while the human brain is widely recognized as one of the most complex systems in the universe. One fundamental critical question which would affect human sustainability remains open: Will artificial intelligence (AI) evolve to surpass human intelligence in the future? This paper shows that in theory new AI twins with fresh cellular level of AI techniques for neuroscience could approximate the brain and its functioning systems (e.g. perception and cognition functions) with any expected small error and AI without restrictions could surpass human intelligence with probability one in the end. This paper indirectly proves the validity of the conjecture made by Frank Rosenblatt 70 years ago about the potential capabilities of AI, especially in the realm of artificial neural networks. Intelligence is just one of fortuitous but sophisticated creations of the nature which has not been fully discovered. Like mathematics and physics, with no restrictions artificial intelligence would lead to a new subject with its self-contained systems and principles. We anticipate that this paper opens new doors for 1) AI twins and other AI techniques to be used in cellular level of efficient neuroscience dynamic analysis, functioning analysis of the brain and brain illness solutions; 2) new worldwide collaborative scheme for interdisciplinary teams concurrently working on and modelling different types of neurons and synapses and different level of functioning subsystems of the brain with AI techniques; 3) development of low energy of AI techniques with the aid of fundamental neuroscience properties; and 4) new controllable, explainable and safe AI techniques with reasoning capabilities of discovering principles in nature.

[AI-69] I See Therefore I Do: Estimating Causal Effects for Image Treatments

链接: https://arxiv.org/abs/2412.06810
作者: Abhinav Thorat,Ravi Kolla,Niranjan Pedanekar
关键词-EN: Causal effect estimation, Causal effect, treatment assignment bias, ground truth data, effect estimation
类目: Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
*备注: 8 Pages

点击查看摘要

Abstract:Causal effect estimation under observational studies is challenging due to the lack of ground truth data and treatment assignment bias. Though various methods exist in literature for addressing this problem, most of them ignore multi-dimensional treatment information by considering it as scalar, either continuous or discrete. Recently, certain works have demonstrated the utility of this rich yet complex treatment information into the estimation process, resulting in better causal effect estimation. However, these works have been demonstrated on either graphs or textual treatments. There is a notable gap in existing literature in addressing higher dimensional data such as images that has a wide variety of applications. In this work, we propose a model named NICE (Network for Image treatments Causal effect Estimation), for estimating individual causal effects when treatments are images. NICE demonstrates an effective way to use the rich multidimensional information present in image treatments that helps in obtaining improved causal effect estimates. To evaluate the performance of NICE, we propose a novel semi-synthetic data simulation framework that generates potential outcomes when images serve as treatments. Empirical results on these datasets, under various setups including the zero-shot case, demonstrate that NICE significantly outperforms existing models that incorporate treatment information for causal effect estimation.

[AI-70] Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems RECSYS2024

链接: https://arxiv.org/abs/2412.06809
作者: Miha Malenšek,Blaž Škrlj,Blaž Mramor,Jure Demšar
关键词-EN: testing machine learning, machine learning models, testing machine, machine learning, Synthetic datasets
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: RecSys 2024’

点击查看摘要

Abstract:Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many solutions that would allow generation of artificial datasets with such characteristics. For that purpose, we developed a novel framework for generating synthetic datasets that are diverse and statistically coherent. Our framework allows for creation of datasets with controlled attributes, enabling iterative modifications to fit specific experimental needs, such as introducing complex feature interactions, feature cardinality, or specific distributions. We demonstrate the framework’s utility through use cases such as benchmarking probabilistic counting algorithms, detecting algorithmic bias, and simulating AutoML searches. Unlike existing methods that either focus narrowly on specific dataset structures, or prioritize (private) data synthesis through real data, our approach provides a modular means to quickly generating completely synthetic datasets we can tailor to diverse experimental requirements. Our results show that the framework effectively isolates model behavior in unique situations and highlights its potential for significant advancements in the evaluation and development of recommender systems. The readily-available framework is available as a free open Python package to facilitate research with minimal friction.

[AI-71] Effect of Adaptive Communication Support on Human-AI Collaboration

链接: https://arxiv.org/abs/2412.06808
作者: Shipeng Liu,FNU Shrutika,Boshen Zhang,Zhehui Huang,Feifei Qian
关键词-EN: Effective human-AI collaboration, Effective human-AI, human-AI collaboration requires, adopt their roles, Large Language Models
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Effective human-AI collaboration requires agents to adopt their roles and levels of support based on human needs, task requirements, and complexity. Traditional human-AI teaming often relies on a pre-determined robot communication scheme, restricting teamwork adaptability in complex tasks. Leveraging the strong communication capabilities of Large Language Models (LLMs), we propose a Human-Robot Teaming Framework with Multi-Modal Language feedback (HRT-ML), a framework designed to enhance human-robot interaction by adjusting the frequency and content of language-based feedback. The HRT-ML framework includes two core modules: a Coordinator for high-level, low-frequency strategic guidance and a Manager for task-specific, high-frequency instructions, enabling passive and active interactions with human teammates. To assess the impact of language feedback in collaborative scenarios, we conducted experiments in an enhanced Overcooked-AI game environment with varying levels of task complexity (easy, medium, hard) and feedback frequency (inactive, passive, active, superactive). Our results show that as task complexity increases relative to human capabilities, human teammates exhibited stronger preferences toward robotic agents that can offer frequent, proactive support. However, when task complexities exceed the LLM’s capacity, noisy and inaccurate feedback from superactive agents can instead hinder team performance, as it requires human teammates to increase their effort to interpret and respond to the large amount of communications, with limited performance return. Our results offer a general principle for robotic agents to dynamically adjust their levels and frequencies of communication to work seamlessly with humans and achieve improved teaming performance.

[AI-72] SpikeFI: A Fault Injection Framework for Spiking Neural Networks

链接: https://arxiv.org/abs/2412.06795
作者: Theofilos Spyrou,Said Hamdioui,Haralampos-G. Stratigopoulos
关键词-EN: spiking neural networks, faster computation speed, efficient energy usage, Neuromorphic computing, neural networks
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Neuromorphic computing and spiking neural networks (SNNs) are gaining traction across various artificial intelligence (AI) tasks thanks to their potential for efficient energy usage and faster computation speed. This comparative advantage comes from mimicking the structure, function, and efficiency of the biological brain, which arguably is the most brilliant and green computing machine. As SNNs are eventually deployed on a hardware processor, the reliability of the application in light of hardware-level faults becomes a concern, especially for safety- and mission-critical applications. In this work, we propose SpikeFI, a fault injection framework for SNNs that can be used for automating the reliability analysis and test generation. SpikeFI is built upon the SLAYER PyTorch framework with fault injection experiments accelerated on a single or multiple GPUs. It has a comprehensive integrated neuron and synapse fault model library, in accordance to the literature in the domain, which is extendable by the user if needed. It supports: single and multiple faults; permanent and transient faults; specified, random layer-wise, and random network-wise fault locations; and pre-, during, and post-training fault injection. It also offers several optimization speedups and built-in functions for results visualization. SpikeFI is open-source and available for download via GitHub at this https URL.

[AI-73] FEAD: Figma-Enhanced App Design Framework for Improving UI/UX in Educational App Development

链接: https://arxiv.org/abs/2412.06793
作者: Tianyi Huang
关键词-EN: Designing user-centric mobile, MIT App Inventor-one, MIT App Inventor, Designing user-centric, increasingly essential
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Designing user-centric mobile applications is increasingly essential in educational technology. However, platforms like MIT App Inventor-one of the world’s largest educational app development tools-face inherent limitations in supporting modern UI/UX design. This study introduces the Figma-Enhanced App Design (FEAD) Method, a structured framework that integrates Figma’s advanced design tools into MIT App Inventor using an identify-design-implement workflow. Leveraging principles such as the 8-point grid system and Gestalt laws of perception, the FEAD Method empowers users to address design gaps, creating visually appealing, functional, and accessible applications. A comparative evaluation revealed that 61.2% of participants perceived FEAD-enhanced designs as on par with professional apps, compared to just 8.2% for baseline designs. These findings highlight the potential of bridging design with development platforms to enhance app creation, offering a scalable framework for students to master both functional and aesthetic design principles and excel in shaping the future of user-centric technology.

[AI-74] Promoting Cooperation in the Public Goods Game using Artificial Intelligent Agents

链接: https://arxiv.org/abs/2412.05450
作者: Arend Hintze,Christoph Adami
关键词-EN: collectively undesired outcomes, individual rational actions, rational actions lead, fundamental social dilemma, undesired outcomes
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO); Populations and Evolution (q-bio.PE)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:The tragedy of the commons illustrates a fundamental social dilemma where individual rational actions lead to collectively undesired outcomes, threatening the sustainability of shared resources. Strategies to escape this dilemma, however, are in short supply. In this study, we explore how artificial intelligence (AI) agents can be leveraged to enhance cooperation in public goods games, moving beyond traditional regulatory approaches to using AI as facilitators of cooperation. We investigate three scenarios: (1) Mandatory Cooperation Policy for AI Agents, where AI agents are institutionally mandated always to cooperate; (2) Player-Controlled Agent Cooperation Policy, where players evolve control over AI agents’ likelihood to cooperate; and (3) Agents Mimic Players, where AI agents copy the behavior of players. Using a computational evolutionary model with a population of agents playing public goods games, we find that only when AI agents mimic player behavior does the critical synergy threshold for cooperation decrease, effectively resolving the dilemma. This suggests that we can leverage AI to promote collective well-being in societal dilemmas by designing AI agents to mimic human players.

[AI-75] CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding

链接: https://arxiv.org/abs/2412.07236
作者: Jiquan Wang,Sha Zhao,Zhiling Luo,Yangxuan Zhou,Haiteng Jiang,Shijian Li,Tao Li,Gang Pan
关键词-EN: brain electrical activity, EEG, record brain electrical, EEG foundation models, EEG foundation
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Electroencephalography (EEG) is a non-invasive technique to measure and record brain electrical activity, widely used in various BCI and healthcare applications. Early EEG decoding methods rely on supervised learning, limited by specific tasks and datasets, hindering model performance and generalizability. With the success of large language models, there is a growing body of studies focusing on EEG foundation models. However, these studies still leave challenges: Firstly, most of existing EEG foundation models employ full EEG modeling strategy. It models the spatial and temporal dependencies between all EEG patches together, but ignores that the spatial and temporal dependencies are heterogeneous due to the unique structural characteristics of EEG signals. Secondly, existing EEG foundation models have limited generalizability on a wide range of downstream BCI tasks due to varying formats of EEG data, making it challenging to adapt to. To address these challenges, we propose a novel foundation model called CBraMod. Specifically, we devise a criss-cross transformer as the backbone to thoroughly leverage the structural characteristics of EEG signals, which can model spatial and temporal dependencies separately through two parallel attention mechanisms. And we utilize an asymmetric conditional positional encoding scheme which can encode positional information of EEG patches and be easily adapted to the EEG with diverse formats. CBraMod is pre-trained on a very large corpus of EEG through patch-based masked EEG reconstruction. We evaluate CBraMod on up to 10 downstream BCI tasks (12 public datasets). CBraMod achieves the state-of-the-art performance across the wide range of tasks, proving its strong capability and generalizability. The source code is publicly available at \urlthis https URL.

[AI-76] Sequential Controlled Langevin Diffusions

链接: https://arxiv.org/abs/2412.07081
作者: Junhua Chen,Lorenz Richter,Julius Berner,Denis Blessing,Gerhard Neumann,Anima Anandkumar
关键词-EN: gradually transporting samples, complicated target distribution, Sequential Monte Carlo, effective approach, idea of gradually
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:An effective approach for sampling from unnormalized densities is based on the idea of gradually transporting samples from an easy prior to the complicated target distribution. Two popular methods are (1) Sequential Monte Carlo (SMC), where the transport is performed through successive annealed densities via prescribed Markov chains and resampling steps, and (2) recently developed diffusion-based sampling methods, where a learned dynamical transport is used. Despite the common goal, both approaches have different, often complementary, advantages and drawbacks. The resampling steps in SMC allow focusing on promising regions of the space, often leading to robust performance. While the algorithm enjoys asymptotic guarantees, the lack of flexible, learnable transitions can lead to slow convergence. On the other hand, diffusion-based samplers are learned and can potentially better adapt themselves to the target at hand, yet often suffer from training instabilities. In this work, we present a principled framework for combining SMC with diffusion-based samplers by viewing both methods in continuous time and considering measures on path space. This culminates in the new Sequential Controlled Langevin Diffusion (SCLD) sampling method, which is able to utilize the benefits of both methods and reaches improved performance on multiple benchmark problems, in many cases using only 10% of the training budget of previous diffusion-based samplers.

[AI-77] Large Language Models : An Applied Econometric Framework

链接: https://arxiv.org/abs/2412.07031
作者: Jens Ludwig,Sendhil Mullainathan,Ashesh Rambachan
关键词-EN: Large language models, Large language, simulate human responses, generate hypotheses, LLM
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are being used in economics research to form predictions, label text, simulate human responses, generate hypotheses, and even produce data for times and places where such data don’t exist. While these uses are creative, are they valid? When can we abstract away from the inner workings of an LLM and simply rely on their outputs? We develop an econometric framework to answer this question. Our framework distinguishes between two types of empirical tasks. Using LLM outputs for prediction problems (including hypothesis generation) is valid under one condition: no “leakage” between the LLM’s training dataset and the researcher’s sample. Using LLM outputs for estimation problems to automate the measurement of some economic concept (expressed by some text or from human subjects) requires an additional assumption: LLM outputs must be as good as the gold standard measurements they replace. Otherwise estimates can be biased, even if LLM outputs are highly accurate but not perfectly so. We document the extent to which these conditions are violated and the implications for research findings in illustrative applications to finance and political economy. We also provide guidance to empirical researchers. The only way to ensure no training leakage is to use open-source LLMs with documented training data and published weights. The only way to deal with LLM measurement error is to collect validation data and model the error structure. A corollary is that if such conditions can’t be met for a candidate LLM application, our strong advice is: don’t.

[AI-78] M3-20M: A Large-Scale Multi-Modal Molecule Dataset for AI-driven Drug Design and Discovery

链接: https://arxiv.org/abs/2412.06847
作者: Siyuan Guo,Lexuan Wang,Chang Jin,Jinxian Wang,Han Peng,Huayang Shi,Wengen Li,Jihong Guan,Shuigeng Zhou
关键词-EN: large-scale Multi-Modal Molecular, design and discovery, drug design, AI-driven drug design, Multi-Modal Molecular dataset
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces M ^3 -20M, a large-scale Multi-Modal Molecular dataset that contains over 20 million molecules. Designed to support AI-driven drug design and discovery, M ^3 -20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit training or fine-tuning large (language) models with superior performance for drug design and discovery. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated by using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M ^3 -20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using large language models including GLM4, GPT-3.5, and GPT-4. Our experimental results show that M ^3 -20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than the existing single-modal datasets, which validates the value and potential of M ^3 -20M in supporting AI-driven drug design and discovery. The dataset is available at \urlthis https URL.

[AI-79] he Helicobacter pylori AI-Clinician: Harnessing Artificial Intelligence to Personalize H. pylori Treatment Recommendations

链接: https://arxiv.org/abs/2412.06841
作者: Kyle Higgins,Olga P. Nyssen,Joshua Southern,Ivan Laponogov,AIDA CONSORTIUM,Dennis Veselkov,Javier P. Gisbert,Tania Fleitas Kanonnikoff,Kirill Veselkov
关键词-EN: common carcinogenic pathogen, Helicobacter pylori, carcinogenic pathogen worldwide, common carcinogenic, carcinogenic pathogen
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Helicobacter pylori (H. pylori) is the most common carcinogenic pathogen worldwide. Infecting roughly 1 in 2 individuals globally, it is the leading cause of peptic ulcer disease, chronic gastritis, and gastric cancer. To investigate whether personalized treatments would be optimal for patients suffering from infection, we developed the H. pylori AI-clinician recommendation system. This system was trained on data from tens of thousands of H. pylori-infected patients from Hp-EuReg, orders of magnitude greater than those experienced by a single real-world clinician. We first used a simulated dataset and demonstrated the ability of our AI Clinician method to identify patient subgroups that would benefit from differential optimal treatments. Next, we trained the AI Clinician on Hp-EuReg, demonstrating the AI Clinician reproduces known quality estimates of treatments, for example bismuth and quadruple therapies out-performing triple, with longer durations and higher dose proton pump inhibitor (PPI) showing higher quality estimation on average. Next we demonstrated that treatment was optimized by recommended personalized therapies in patient subsets, where 65% of patients were recommended a bismuth therapy of either metronidazole, tetracycline, and bismuth salts with PPI, or bismuth quadruple therapy with clarithromycin, amoxicillin, and bismuth salts with PPI, and 15% of patients recommended a quadruple non-bismuth therapy of clarithromycin, amoxicillin, and metronidazole with PPI. Finally, we determined trends in patient variables driving the personalized recommendations using random forest modelling. With around half of the world likely to experience H. pylori infection at some point in their lives, the identification of personalized optimal treatments will be crucial in both gastric cancer prevention and quality of life improvements for countless individuals worldwide.

机器学习

[LG-0] Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data

链接: https://arxiv.org/abs/2412.07762
作者: Zhiyuan Zhou,Andy Peng,Qiyang Li,Sergey Levine,Aviral Kumar
关键词-EN: offline data, machine learning involves, offline, data, online
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The modern paradigm in machine learning involves pre-training on diverse data, followed by task-specific fine-tuning. In reinforcement learning (RL), this translates to learning via offline RL on a diverse historical dataset, followed by rapid online RL fine-tuning using interaction data. Most RL fine-tuning methods require continued training on offline data for stability and performance. However, this is undesirable because training on diverse offline data is slow and expensive for large datasets, and in principle, also limit the performance improvement possible because of constraints or pessimism on offline data. In this paper, we show that retaining offline data is unnecessary as long as we use a properly-designed online RL approach for fine-tuning offline RL initializations. To build this approach, we start by analyzing the role of retaining offline data in online fine-tuning. We find that continued training on offline data is mostly useful for preventing a sudden divergence in the value function at the onset of fine-tuning, caused by a distribution mismatch between the offline data and online rollouts. This divergence typically results in unlearning and forgetting the benefits of offline pre-training. Our approach, Warm-start RL (WSRL), mitigates the catastrophic forgetting of pre-trained initializations using a very simple idea. WSRL employs a warmup phase that seeds the online RL run with a very small number of rollouts from the pre-trained policy to do fast online RL. The data collected during warmup helps ``recalibrate’’ the offline Q-function to the online distribution, allowing us to completely discard offline data without destabilizing the online RL fine-tuning. We show that WSRL is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms irrespective of whether they retain offline data or not.

[LG-1] Quantum vs. Classical Machine Learning Algorithms for Software Defect Prediction: Challenges and Opportunities ICSE2025

链接: https://arxiv.org/abs/2412.07698
作者: Md Nadim,Mohammad Hassan,Ashis Kumar Mandal,Chanchal K. Roy
关键词-EN: enables early identification, software quality assurance, Quantum Machine Learning, Software defect prediction, quality assurance
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: In the proceedings of the 6th Quantum Software Engineering (Q-SE) workshop at the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025)

点击查看摘要

Abstract:Software defect prediction is a critical aspect of software quality assurance, as it enables early identification and mitigation of defects, thereby reducing the cost and impact of software failures. Over the past few years, quantum computing has risen as an exciting technology capable of transforming multiple domains; Quantum Machine Learning (QML) is one of them. QML algorithms harness the power of quantum computing to solve complex problems with better efficiency and effectiveness than their classical counterparts. However, research into its application in software engineering to predict software defects still needs to be explored. In this study, we worked to fill the research gap by comparing the performance of three QML and five classical machine learning (CML) algorithms on the 20 software defect datasets. Our investigation reports the comparative scenarios of QML vs. CML algorithms and identifies the better-performing and consistent algorithms to predict software defects. We also highlight the challenges and future directions of employing QML algorithms in real software defect datasets based on the experience we faced while performing this investigation. The findings of this study can help practitioners and researchers further progress in this research domain by making software systems reliable and bug-free.

[LG-2] Privacy-Preserving Customer Support: A Framework for Secure and Scalable Interactions

链接: https://arxiv.org/abs/2412.07687
作者: Anant Prakash Awasthi,Chandraketu Singh,Rakshit Varma,Sanchit Sharma
关键词-EN: Consumer Privacy Act, California Consumer Privacy, General Data Protection, Data Protection Regulation, artificial intelligence
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:The growing reliance on artificial intelligence (AI) in customer support has significantly improved operational efficiency and user experience. However, traditional machine learning (ML) approaches, which require extensive local training on sensitive datasets, pose substantial privacy risks and compliance challenges with regulations like the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA). Existing privacy-preserving techniques, such as anonymization, differential privacy, and federated learning, address some concerns but face limitations in utility, scalability, and complexity. This paper introduces the Privacy-Preserving Zero-Shot Learning (PP-ZSL) framework, a novel approach leveraging large language models (LLMs) in a zero-shot learning mode. Unlike conventional ML methods, PP-ZSL eliminates the need for local training on sensitive data by utilizing pre-trained LLMs to generate responses directly. The framework incorporates real-time data anonymization to redact or mask sensitive information, retrieval-augmented generation (RAG) for domain-specific query resolution, and robust post-processing to ensure compliance with regulatory standards. This combination reduces privacy risks, simplifies compliance, and enhances scalability and operational efficiency. Empirical analysis demonstrates that the PP-ZSL framework provides accurate, privacy-compliant responses while significantly lowering the costs and complexities of deploying AI-driven customer support systems. The study highlights potential applications across industries, including financial services, healthcare, e-commerce, legal support, telecommunications, and government services. By addressing the dual challenges of privacy and performance, this framework establishes a foundation for secure, efficient, and regulatory-compliant AI applications in customer interactions.

[LG-3] SurvBETA: Ensemble-Based Survival Models Using Beran Estimators and Several Attention Mechanisms

链接: https://arxiv.org/abs/2412.07638
作者: Lev V. Utkin,Semen P. Khomets,Vlada A. Efremenko,Andrei V. Konstantinov
关键词-EN: gradient boosting machine, Beran estimator Ensemble, survival analysis framework, including random survival, random survival forests
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many ensemble-based models have been proposed to solve machine learning problems in the survival analysis framework, including random survival forests, the gradient boosting machine with weak survival models, ensembles of the Cox models. To extend the set of models, a new ensemble-based model called SurvBETA (the Survival Beran estimator Ensemble using Three Attention mechanisms) is proposed where the Beran estimator is used as a weak learner in the ensemble. The Beran estimator can be regarded as a kernel regression model taking into account the relationship between instances. Outputs of weak learners in the form of conditional survival functions are aggregated with attention weights taking into account the distance between the analyzed instance and prototypes of all bootstrap samples. The attention mechanism is used three times: for implementation of the Beran estimators, for determining specific prototypes of bootstrap samples and for aggregating the weak model predictions. The proposed model is presented in two forms: in a general form requiring to solve a complex optimization problem for its training; in a simplified form by considering a special representation of the attention weights by means of the imprecise Huber’s contamination model which leads to solving a simple optimization problem. Numerical experiments illustrate properties of the model on synthetic data and compare the model with other survival models on real data. A code implementing the proposed model is publicly available.

[LG-4] Sampling from Boltzmann densities with physics informed low-rank formats

链接: https://arxiv.org/abs/2412.07637
作者: Paul Hagemann,Janina Schütte,David Sommer,Martin Eigel,Gabriele Steidl
关键词-EN: unnormalized Boltzmann density, low-rank tensor train, underlying continuity equation, unnormalized Boltzmann, Boltzmann density
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Our method proposes the efficient generation of samples from an unnormalized Boltzmann density by solving the underlying continuity equation in the low-rank tensor train (TT) format. It is based on the annealing path commonly used in MCMC literature, which is given by the linear interpolation in the space of energies. Inspired by Sequential Monte Carlo, we alternate between deterministic time steps from the TT representation of the flow field and stochastic steps, which include Langevin and resampling steps. These adjust the relative weights of the different modes of the target distribution and anneal to the correct path distribution. We showcase the efficiency of our method on multiple numerical examples.

[LG-5] Fast Track to Winning Tickets: Repowering One-Shot Pruning for Graph Neural Networks AAAI2025

链接: https://arxiv.org/abs/2412.07605
作者: Yanwei Yue,Guibin Zhang,Haoran Yang,Dawei Cheng
关键词-EN: Graph Neural Networks, Neural Networks, wider real-world application, graph learning tasks, Graph Neural
类目: Machine Learning (cs.LG)
*备注: AAAI 2025

点击查看摘要

Abstract:Graph Neural Networks (GNNs) demonstrate superior performance in various graph learning tasks, yet their wider real-world application is hindered by the computational overhead when applied to large-scale graphs. To address the issue, the Graph Lottery Hypothesis (GLT) has been proposed, advocating the identification of subgraphs and subnetworks, \textiti.e., winning tickets, without compromising performance. The effectiveness of current GLT methods largely stems from the use of iterative magnitude pruning (IMP), which offers higher stability and better performance than one-shot pruning. However, identifying GLTs is highly computationally expensive, due to the iterative pruning and retraining required by IMP. In this paper, we reevaluate the correlation between one-shot pruning and IMP: while one-shot tickets are suboptimal compared to IMP, they offer a \textitfast track to tickets with a stronger performance. We introduce a one-shot pruning and denoising framework to validate the efficacy of the \textitfast track. Compared to current IMP-based GLT methods, our framework achieves a double-win situation of graph lottery tickets with \textbfhigher sparsity and \textbffaster speeds. Through extensive experiments across 4 backbones and 6 datasets, our method demonstrates 1.32% - 45.62% improvement in weight sparsity and a 7.49% - 22.71% increase in graph sparsity, along with a 1.7-44 \times speedup over IMP-based methods and 95.3%-98.6% MAC savings.

[LG-6] Paired Wasserstein Autoencoders for Conditional Sampling

链接: https://arxiv.org/abs/2412.07586
作者: Moritz Piening,Matthias Chung
关键词-EN: generative neural network, distances greatly influenced, neural network models, Wasserstein distances greatly, greatly influenced
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Wasserstein distances greatly influenced and coined various types of generative neural network models. Wasserstein autoencoders are particularly notable for their mathematical simplicity and straight-forward implementation. However, their adaptation to the conditional case displays theoretical difficulties. As a remedy, we propose the use of two paired autoencoders. Under the assumption of an optimal autoencoder pair, we leverage the pairwise independence condition of our prescribed Gaussian latent distribution to overcome this theoretical hurdle. We conduct several experiments to showcase the practical applicability of the resulting paired Wasserstein autoencoders. Here, we consider imaging tasks and enable conditional sampling for denoising, inpainting, and unsupervised image translation. Moreover, we connect our image translation model to the Monge map behind Wasserstein-2 distances.

[LG-7] Adaptive Epsilon Adversarial Training for Robust Gravitational Wave Parameter Estimation Using Normalizing Flows

链接: https://arxiv.org/abs/2412.07559
作者: Yiqian Yang,Xihua Zhu,Fan Zhang
关键词-EN: emerging research area, research area aimed, Normalizing Flow, Inverse Autoregressive Flow, Adversarial training
类目: Machine Learning (cs.LG)
*备注: 7 pages, 9 figures

点击查看摘要

Abstract:Adversarial training with Normalizing Flow (NF) models is an emerging research area aimed at improving model robustness through adversarial samples. In this study, we focus on applying adversarial training to NF models for gravitational wave parameter estimation. We propose an adaptive epsilon method for Fast Gradient Sign Method (FGSM) adversarial training, which dynamically adjusts perturbation strengths based on gradient magnitudes using logarithmic scaling. Our hybrid architecture, combining ResNet and Inverse Autoregressive Flow, reduces the Negative Log Likelihood (NLL) loss by 47% under FGSM attacks compared to the baseline model, while maintaining an NLL of 4.2 on clean data (only 5% higher than the baseline). For perturbation strengths between 0.01 and 0.1, our model achieves an average NLL of 5.8, outperforming both fixed-epsilon (NLL: 6.7) and progressive-epsilon (NLL: 7.2) methods. Under stronger Projected Gradient Descent attacks with perturbation strength of 0.05, our model maintains an NLL of 6.4, demonstrating superior robustness while avoiding catastrophic overfitting.

[LG-8] Contractive Dynamical Imitation Policies for Efficient Out-of-Sample Recovery

链接: https://arxiv.org/abs/2412.07544
作者: Amin Abyaneh,Mahrokh G. Boroujeni,Hsiu-Chin Lin,Giancarlo Ferrari-Trecate
关键词-EN: Imitation learning, data-driven approach, prone to unreliable, unreliable outcomes, expert behavior
类目: Machine Learning (cs.LG); Robotics (cs.RO); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Imitation learning is a data-driven approach to learning policies from expert behavior, but it is prone to unreliable outcomes in out-of-sample (OOS) regions. While previous research relying on stable dynamical systems guarantees convergence to a desired state, it often overlooks transient behavior. We propose a framework for learning policies using modeled by contractive dynamical systems, ensuring that all policy rollouts converge regardless of perturbations, and in turn, enable efficient OOS recovery. By leveraging recurrent equilibrium networks and coupling layers, the policy structure guarantees contractivity for any parameter choice, which facilitates unconstrained optimization. Furthermore, we provide theoretical upper bounds for worst-case and expected loss terms, rigorously establishing the reliability of our method in deployment. Empirically, we demonstrate substantial OOS performance improvements in robotics manipulation and navigation tasks in simulation.

[LG-9] Anomaly detection using Diffusion-based methods

链接: https://arxiv.org/abs/2412.07539
作者: Aryan Bhosale,Samrat Mukherjee,Biplab Banerjee,Fabio Cuzzolin
关键词-EN: Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, including Denoising Diffusion, paper explores, explores the utility
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores the utility of diffusion-based models for anomaly detection, focusing on their efficacy in identifying deviations in both compact and high-resolution datasets. Diffusion-based architectures, including Denoising Diffusion Probabilistic Models (DDPMs) and Diffusion Transformers (DiTs), are evaluated for their performance using reconstruction objectives. By leveraging the strengths of these models, this study benchmarks their performance against traditional anomaly detection methods such as Isolation Forests, One-Class SVMs, and COPOD. The results demonstrate the superior adaptability, scalability, and robustness of diffusion-based methods in handling complex real-world anomaly detection tasks. Key findings highlight the role of reconstruction error in enhancing detection accuracy and underscore the scalability of these models to high-dimensional datasets. Future directions include optimizing encoder-decoder architectures and exploring multi-modal datasets to further advance diffusion-based anomaly detection.

[LG-10] Quantifying the Prediction Uncertainty of Machine Learning Models for Individual Data

链接: https://arxiv.org/abs/2412.07520
作者: Koby Bibas
关键词-EN: Machine learning models, exhibited exceptional results, Machine learning, exhibited exceptional, Machine
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: PHD thesis

点击查看摘要

Abstract:Machine learning models have exhibited exceptional results in various domains. The most prevalent approach for learning is the empirical risk minimizer (ERM), which adapts the model’s weights to reduce the loss on a training set and subsequently leverages these weights to predict the label for new test data. Nonetheless, ERM makes the assumption that the test distribution is similar to the training distribution, which may not always hold in real-world situations. In contrast, the predictive normalized maximum likelihood (pNML) was proposed as a min-max solution for the individual setting where no assumptions are made on the distribution of the tested input. This study investigates pNML’s learnability for linear regression and neural networks, and demonstrates that pNML can improve the performance and robustness of these models on various tasks. Moreover, the pNML provides an accurate confidence measure for its output, showcasing state-of-the-art results for out-of-distribution detection, resistance to adversarial attacks, and active learning.

[LG-11] ConfigX: Modular Configuration for Evolutionary Algorithms via Multitask Reinforcement Learning

链接: https://arxiv.org/abs/2412.07507
作者: Hongshu Guo,Zeyuan Ma,Jiacheng Chen,Yining Ma,Zhiguang Cao,Xinglin Zhang,Yue-Jiao Gong
关键词-EN: configure evolutionary algorithms, dynamically configure evolutionary, BBO instances, Recent advances, advances in Meta-learning
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Recent advances in Meta-learning for Black-Box Optimization (MetaBBO) have shown the potential of using neural networks to dynamically configure evolutionary algorithms (EAs), enhancing their performance and adaptability across various BBO instances. However, they are often tailored to a specific EA, which limits their generalizability and necessitates retraining or redesigns for different EAs and optimization problems. To address this limitation, we introduce ConfigX, a new paradigm of the MetaBBO framework that is capable of learning a universal configuration agent (model) for boosting diverse EAs. To achieve so, our ConfigX first leverages a novel modularization system that enables the flexible combination of various optimization sub-modules to generate diverse EAs during training. Additionally, we propose a Transformer-based neural network to meta-learn a universal configuration policy through multitask reinforcement learning across a designed joint optimization task space. Extensive experiments verify that, our ConfigX, after large-scale pre-training, achieves robust zero-shot generalization to unseen tasks and outperforms state-of-the-art baselines. Moreover, ConfigX exhibits strong lifelong learning capabilities, allowing efficient adaptation to new tasks through fine-tuning. Our proposed ConfigX represents a significant step toward an automatic, all-purpose configuration agent for EAs.

[LG-12] Real-time Sign Language Recognition Using MobileNetV2 and Transfer Learning

链接: https://arxiv.org/abs/2412.07486
作者: Smruti Jagtap,Kanika Jadhav,Rushikesh Temkar,Minal Deshmukh
关键词-EN: Indian Sign Language, Indian Sign, hearing-impaired community, solutions that make, community in India
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The hearing-impaired community in India deserves the access to tools that help them communicate, however, there is limited known technology solutions that make use of Indian Sign Language (ISL) at present. Even though there are many ISL users, ISL cannot access social and education arenas because there is not yet an efficient technology to convert the ISL signal into speech or text. We initiated this initiative owing to the rising demand for products and technologies that are inclusive and help ISL, filling the gap of communication for the ones with hearing disability. Our goal is to build an reliable sign language recognition system with the help of Convolutional Neural Networks (CNN) to . By expanding communication access, we aspire toward better educational opportunities and a more inclusive society for hearing impaired people in India.

[LG-13] Progressive-Resolution Policy Distillation: Leveraging Coarse-Resolution Simulation for Time-Efficient Fine-Resolution Policy Learning

链接: https://arxiv.org/abs/2412.07477
作者: Yuki Kadokawa,Hirotaka Tahara,Takamitsu Matsubara
关键词-EN: requiring skilled operators, encounter large rocks, large rocks mixed, earthwork and construction, excavators often encounter
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In earthwork and construction, excavators often encounter large rocks mixed with various soil conditions, requiring skilled operators. This paper presents a framework for achieving autonomous excavation using reinforcement learning (RL) through a rock excavation simulator. In the simulation, resolution can be defined by the particle size/number in the whole soil space. Fine-resolution simulations closely mimic real-world behavior but demand significant calculation time and challenging sample collection, while coarse-resolution simulations enable faster sample collection but deviate from real-world behavior. To combine the advantages of both resolutions, we explore using policies developed in coarse-resolution simulations for pre-training in fine-resolution simulations. To this end, we propose a novel policy learning framework called Progressive-Resolution Policy Distillation (PRPD), which progressively transfers policies through some middle-resolution simulations with conservative policy transfer to avoid domain gaps that could lead to policy transfer failure. Validation in a rock excavation simulator and nine real-world rock environments demonstrated that PRPD reduced sampling time to less than 1/7 while maintaining task success rates comparable to those achieved through policy learning in a fine-resolution simulation.

[LG-14] AHSG: Adversarial Attacks on High-level Semantics in Graph Neural Networks

链接: https://arxiv.org/abs/2412.07468
作者: Kai Yuan,Xiaobing Pei,Haoran Yang
关键词-EN: Graph Neural Networks, Neural Networks, garnered significant interest, deep neural networks, primary semantics
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have garnered significant interest among researchers due to their impressive performance in graph learning tasks. However, like other deep neural networks, GNNs are also vulnerable to adversarial attacks. In existing adversarial attack methods for GNNs, the metric between the attacked graph and the original graph is usually the attack budget or a measure of global graph properties. However, we have found that it is possible to generate attack graphs that disrupt the primary semantics even within these constraints. To address this problem, we propose a Adversarial Attacks on High-level Semantics in Graph Neural Networks (AHSG), which is a graph structure attack model that ensures the retention of primary semantics. The latent representations of each node can extract rich semantic information by applying convolutional operations on graph data. These representations contain both task-relevant primary semantic information and task-irrelevant secondary semantic information. The latent representations of same-class nodes with the same primary semantics can fulfill the objective of modifying secondary semantics while preserving the primary semantics. Finally, the latent representations with attack effects is mapped to an attack graph using Projected Gradient Descent (PGD) algorithm. By attacking graph deep learning models with some advanced defense strategies, we validate that AHSG has superior attack effectiveness compared to other attack methods. Additionally, we employ Contextual Stochastic Block Models (CSBMs) as a proxy for the primary semantics to detect the attacked graph, confirming that AHSG almost does not disrupt the original primary semantics of the graph.

[LG-15] Impact of Sampling Techniques and Data Leakage on XGBoost Performance in Credit Card Fraud Detection

链接: https://arxiv.org/abs/2412.07437
作者: Siyaxolisa Kabane
关键词-EN: eXtreme gradient boosting, Credit card fraud, Credit card, credit card transaction, card fraud detection
类目: Machine Learning (cs.LG)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:Credit card fraud detection remains a critical challenge in financial security, with machine learning models like XGBoost(eXtreme gradient boosting) emerging as powerful tools for identifying fraudulent transactions. However, the inherent class imbalance in credit card transaction datasets poses significant challenges for model performance. Although sampling techniques are commonly used to address this imbalance, their implementation sometimes precedes the train-test split, potentially introducing data leakage. This study presents a comparative analysis of XGBoost’s performance in credit card fraud detection under three scenarios: Firstly without any imbalance handling techniques, secondly with sampling techniques applied only to the training set after the train-test split, and third with sampling techniques applied before the train-test split. We utilized a dataset from Kaggle of 284,807 credit card transactions, containing 0.172% fraudulent cases, to evaluate these approaches. Our findings show that although sampling strategies enhance model performance, the reliability of results is greatly impacted by when they are applied. Due to a data leakage issue that frequently occurs in machine learning models during the sampling phase, XGBoost models trained on data where sampling was applied prior to the train-test split may have displayed artificially inflated performance metrics. Surprisingly, models trained with sampling techniques applied solely to the training set demonstrated significantly lower results than those with pre-split sampling, all the while preserving the integrity of the evaluation process. Comments: 19 pages, 4 figures Subjects: Machine Learning (cs.LG) MSC classes: 62H30 Cite as: arXiv:2412.07437 [cs.LG] (or arXiv:2412.07437v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.07437 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] Parallel simulation for sampling under isoperimetry and score-based diffusion models

链接: https://arxiv.org/abs/2412.07435
作者: Huanjian Zhou,Masashi Sugiyama
关键词-EN: proving discretization bounds, recent years, surge of interest, interest in proving, proving discretization
类目: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: arXiv admin note: substantial text overlap with arXiv:2405.15986 by other authors

点击查看摘要

Abstract:In recent years, there has been a surge of interest in proving discretization bounds for sampling under isoperimetry and for diffusion models. As data size grows, reducing the iteration cost becomes an important goal. Inspired by the great success of the parallel simulation of the initial value problem in scientific computation, we propose parallel Picard methods for sampling tasks. Rigorous theoretical analysis reveals that our algorithm achieves better dependence on dimension d than prior works in iteration complexity (i.e., reduced from \widetildeO(\log^2 d) to \widetildeO(\log d) ), which is even optimal for sampling under isoperimetry with specific iteration complexity. Our work highlights the potential advantages of simulation methods in scientific computation for dynamics-based sampling and diffusion models.

[LG-17] Machine Learning Algorithms for Detecting Mental Stress in College Students

链接: https://arxiv.org/abs/2412.07415
作者: Ashutosh Singh,Khushdeep Singh,Amit Kumar,Abhishek Shrivastava,Santosh Kumar
关键词-EN: affects people health, Support Vector Machines, today world, stress, Support Vector
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: This paper was presented at an IEEE conference and is 5 pages long with 5 figures. It discusses machine learning algorithms for detecting mental stress in college students

点击查看摘要

Abstract:In today’s world, stress is a big problem that affects people’s health and happiness. More and more people are feeling stressed out, which can lead to lots of health issues like breathing problems, feeling overwhelmed, heart attack, diabetes, etc. This work endeavors to forecast stress and non-stress occurrences among college students by applying various machine learning algorithms: Decision Trees, Random Forest, Support Vector Machines, AdaBoost, Naive Bayes, Logistic Regression, and K-nearest Neighbors. The primary objective of this work is to leverage a research study to predict and mitigate stress and non-stress based on the collected questionnaire dataset. We conducted a workshop with the primary goal of studying the stress levels found among the students. This workshop was attended by Approximately 843 students aged between 18 to 21 years old. A questionnaire was given to the students validated under the guidance of the experts from the All India Institute of Medical Sciences (AIIMS) Raipur, Chhattisgarh, India, on which our dataset is based. The survey consists of 28 questions, aiming to comprehensively understand the multidimensional aspects of stress, including emotional well-being, physical health, academic performance, relationships, and leisure. This work finds that Support Vector Machines have a maximum accuracy for Stress, reaching 95%. The study contributes to a deeper understanding of stress determinants. It aims to improve college student’s overall quality of life and academic success, addressing the multifaceted nature of stress.

[LG-18] owards Graph Foundation Models: A Study on the Generalization of Positional and Structural Encodings

链接: https://arxiv.org/abs/2412.07407
作者: Billy Joe Franks,Moshe Eliasof,Semih Cantürk,Guy Wolf,Carola-Bibiane Schönlieb,Sophie Fellenz,Marius Kloft
关键词-EN: Recent advances, graph neural networks, neural networks, advances in integrating, integrating positional
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in integrating positional and structural encodings (PSEs) into graph neural networks (GNNs) have significantly enhanced their performance across various graph learning tasks. However, the general applicability of these encodings and their potential to serve as foundational representations for graphs remain uncertain. This paper investigates the fine-tuning efficiency, scalability with sample size, and generalization capability of learnable PSEs across diverse graph datasets. Specifically, we evaluate their potential as universal pre-trained models that can be easily adapted to new tasks with minimal fine-tuning and limited data. Furthermore, we assess the expressivity of the learned representations, particularly, when used to augment downstream GNNs. We demonstrate through extensive benchmarking and empirical analysis that PSEs generally enhance downstream models. However, some datasets may require specific PSE-augmentations to achieve optimal performance. Nevertheless, our findings highlight their significant potential to become integral components of future graph foundation models. We provide new insights into the strengths and limitations of PSEs, contributing to the broader discourse on foundation models in graph learning.

[LG-19] mporal Linear Item-Item Model for Sequential Recommendation WSDM2025

链接: https://arxiv.org/abs/2412.07382
作者: Seongmin Park,Mincheol Yoon,Minjin Choi,Jongwuk Lee
关键词-EN: actively explored due, actively explored, explored due, suffer from inefficiency, inefficiency inherent
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted by WSDM 2025

点击查看摘要

Abstract:In sequential recommendation (SR), neural models have been actively explored due to their remarkable performance, but they suffer from inefficiency inherent to their complexity. On the other hand, linear SR models exhibit high efficiency and achieve competitive or superior accuracy compared to neural models. However, they solely deal with the sequential order of items (i.e., sequential information) and overlook the actual timestamp (i.e., temporal information). It is limited to effectively capturing various user preference drifts over time. To address this issue, we propose a novel linear SR model, named TemporAl LinEar item-item model (TALE), incorporating temporal information while preserving training/inference efficiency, with three key components. (i) Single-target augmentation concentrates on a single target item, enabling us to learn the temporal correlation for the target item. (ii) Time interval-aware weighting utilizes the actual timestamp to discern the item correlation depending on time intervals. (iii) Trend-aware normalization reflects the dynamic shift of item popularity over time. Our empirical studies show that TALE outperforms ten competing SR models by up to 18.71% gains on five benchmark datasets. It also exhibits remarkable effectiveness in evaluating long-tail items by up to 30.45% gains. The source code is available at this https URL.

[LG-20] A Spectral Framework for Tracking Communities in Evolving Networks

链接: https://arxiv.org/abs/2412.07378
作者: Jacob Hume,Laura Balzano
关键词-EN: motivated by applications, neuroscience to sociology, communities in time-varying, important task, applications in fields
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages, 13 figures

点击查看摘要

Abstract:Discovering and tracking communities in time-varying networks is an important task in network science, motivated by applications in fields ranging from neuroscience to sociology. In this work, we characterize the celebrated family of spectral methods for static clustering in terms of the low-rank approximation of high-dimensional node embeddings. From this perspective, it becomes natural to view the evolving community detection problem as one of subspace tracking on the Grassmann manifold. While the resulting optimization problem is nonconvex, we adopt a block majorize-minimize Riemannian optimization scheme to learn the Grassmann geodesic which best fits the data. Our framework generalizes any static spectral community detection approach and leads to algorithms achieving favorable performance on synthetic and real temporal networks, including those that are weighted, signed, directed, mixed-membership, multiview, hierarchical, cocommunity-structured, bipartite, or some combination thereof. We demonstrate how to specifically cast a wide variety of methods into our framework, and demonstrate greatly improved dynamic community detection results in all cases.

[LG-21] Addressing Key Challenges of Adversarial Attacks and Defenses in the Tabular Domain: A Methodological Framework for Coherence and Consistency

链接: https://arxiv.org/abs/2412.07326
作者: Yael Itzhakev,Amit Giloni,Yuval Elovici,Asaf Shabtai
关键词-EN: Machine learning models, Machine learning, learning models trained, realistic scenarios, adversarial samples
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning models trained on tabular data are vulnerable to adversarial attacks, even in realistic scenarios where attackers have access only to the model’s outputs. Researchers evaluate such attacks by considering metrics like success rate, perturbation magnitude, and query count. However, unlike other data domains, the tabular domain contains complex interdependencies among features, presenting a unique aspect that should be evaluated: the need for the attack to generate coherent samples and ensure feature consistency for indistinguishability. Currently, there is no established methodology for evaluating adversarial samples based on these criteria. In this paper, we address this gap by proposing new evaluation criteria tailored for tabular attacks’ quality; we defined anomaly-based framework to assess the distinguishability of adversarial samples and utilize the SHAP explainability technique to identify inconsistencies in the model’s decision-making process caused by adversarial samples. These criteria could form the basis for potential detection methods and be integrated into established evaluation metrics for assessing attack’s quality Additionally, we introduce a novel technique for perturbing dependent features while maintaining coherence and feature consistency within the sample. We compare different attacks’ strategies, examining black-box query-based attacks and transferability-based gradient attacks across four target models. Our experiments, conducted on benchmark tabular datasets, reveal significant differences between the examined attacks’ strategies in terms of the attacker’s risk and effort and the attacks’ quality. The findings provide valuable insights on the strengths, limitations, and trade-offs of various adversarial attacks in the tabular domain, laying a foundation for future research on attacks and defense development.

[LG-22] Label Distribution Learning using the Squared Neural Family on the Probability Simplex

链接: https://arxiv.org/abs/2412.07324
作者: Daokun Zhang,Russell Tsuchida,Dino Sejdinovic
关键词-EN: label distribution prediction, Label distribution, Squared Neural Family, distribution, Label
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Label distribution learning (LDL) provides a framework wherein a distribution over categories rather than a single category is predicted, with the aim of addressing ambiguity in labeled data. Existing research on LDL mainly focuses on the task of point estimation, i.e., pinpointing an optimal distribution in the probability simplex conditioned on the input sample. In this paper, we estimate a probability distribution of all possible label distributions over the simplex, by unleashing the expressive power of the recently introduced Squared Neural Family (SNEFY). With the modeled distribution, label distribution prediction can be achieved by performing the expectation operation to estimate the mean of the distribution of label distributions. Moreover, more information about the label distribution can be inferred, such as the prediction reliability and uncertainties. We conduct extensive experiments on the label distribution prediction task, showing that our distribution modeling based method can achieve very competitive label distribution prediction performance compared with the state-of-the-art baselines. Additional experiments on active learning and ensemble learning demonstrate that our probabilistic approach can effectively boost the performance in these settings, by accurately estimating the prediction reliability and uncertainties.

[LG-23] ConceptSearch: Towards Efficient Program Search Using LLM s for Abstraction and Reasoning Corpus (ARC) AAAI2025

链接: https://arxiv.org/abs/2412.07322
作者: Kartik Singhal,Gautam Shroff
关键词-EN: Reasoning Corpus, Abstraction and Reasoning, few-shot learning capabilities, current deep learning, deep learning methods
类目: Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, to appear at AAAI 2025

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC) poses a significant challenge to artificial intelligence, demanding broad generalization and few-shot learning capabilities that remain elusive for current deep learning methods, including large language models (LLMs). While LLMs excel in program synthesis, their direct application to ARC yields limited success. To address this, we introduce ConceptSearch, a novel function-search algorithm that leverages LLMs for program generation and employs a concept-based scoring method to guide the search efficiently. Unlike simplistic pixel-based metrics like Hamming distance, ConceptSearch evaluates programs on their ability to capture the underlying transformation concept reflected in the input-output examples. We explore three scoring functions: Hamming distance, a CNN-based scoring function, and an LLM-based natural language scoring function. Experimental results demonstrate the effectiveness of ConceptSearch, achieving a significant performance improvement over direct prompting with GPT-4. Moreover, our novel concept-based scoring exhibits up to 30% greater efficiency compared to Hamming distance, measured in terms of the number of iterations required to reach the correct solution. These findings highlight the potential of LLM-driven program search when integrated with concept-based guidance for tackling challenging generalization problems like ARC. Code: this https URL

[LG-24] High-dimensional classification problems with Barron regular boundaries under margin conditions

链接: https://arxiv.org/abs/2412.07312
作者: Jonathan García,Philipp Petersen
关键词-EN: Barron-regular decision boundary, high polynomial degree, ReLU neural networks, Barron-regular decision, decision boundary
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We prove that a classifier with a Barron-regular decision boundary can be approximated with a rate of high polynomial degree by ReLU neural networks with three hidden layers when a margin condition is assumed. In particular, for strong margin conditions, high-dimensional discontinuous classifiers can be approximated with a rate that is typically only achievable when approximating a low-dimensional smooth function. We demonstrate how these expression rate bounds imply fast-rate learning bounds that are close to n^-1 where n is the number of samples. In addition, we carry out comprehensive numerical experimentation on binary classification problems with various margins. We study three different dimensions, with the highest dimensional problem corresponding to images from the MNIST data set.

[LG-25] PTSBench: A Comprehensive Post-Training Sparsity Benchmark Towards Algorithms and Models

链接: https://arxiv.org/abs/2412.07268
作者: Zining Wnag,Jinyang Guo,Ruihao Gong,Yang Yong,Aishan Liu,Yushi Huang,Jiaheng Liu,Xianglong Liu
关键词-EN: increased attention, PTS algorithms, model efficiency, PTS, effectiveness and efficiency
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:With the increased attention to model efficiency, post-training sparsity (PTS) has become more and more prevalent because of its effectiveness and efficiency. However, there remain questions on better practice of PTS algorithms and the sparsification ability of models, which hinders the further development of this area. Therefore, a benchmark to comprehensively investigate the issues above is urgently needed. In this paper, we propose the first comprehensive post-training sparsity benchmark called PTSBench towards algorithms and models. We benchmark 10+ PTS general-pluggable fine-grained techniques on 3 typical tasks using over 40 off-the-shelf model architectures. Through extensive experiments and analyses, we obtain valuable conclusions and provide several insights from both algorithms and model aspects. Our PTSBench can provide (1) new observations for a better understanding of the PTS algorithms, (2) in-depth and comprehensive evaluations for the sparsification ability of models, and (3) a well-structured and easy-integrate open-source framework. We hope this work will provide illuminating conclusions and advice for future studies of post-training sparsity methods and sparsification-friendly model design. The code for our PTSBench is released at \hrefthis https URLthis https URL.

[LG-26] MemHunter: Automated and Verifiable Memorization Detection at Dataset-scale in LLM s

链接: https://arxiv.org/abs/2412.07261
作者: Zhenpeng Wu,Jian Lou,Zibin Zheng,Chuan Chen
关键词-EN: Large language models, Large language, raising significant privacy, significant privacy concerns, raising significant
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been shown to memorize and reproduce content from their training data, raising significant privacy concerns, especially with web-scale datasets. Existing methods for detecting memorization are largely sample-specific, relying on manually crafted or discretely optimized memory-inducing prompts generated on a per-sample basis, which become impractical for dataset-level detection due to the prohibitive computational cost of iterating over all samples. In real-world scenarios, data owners may need to verify whether a susceptible LLM has memorized their dataset, particularly if the LLM may have collected the data from the web without authorization. To address this, we introduce \textitMemHunter, which trains a memory-inducing LLM and employs hypothesis testing to efficiently detect memorization at the dataset level, without requiring sample-specific memory inducing. Experiments on models such as Pythia and Llama-2 demonstrate that \textitMemHunter can extract up to 40% more training data than existing methods under constrained time resources and reduce search time by up to 80% when integrated as a plug-in. Crucially, \textitMemHunter is the first method capable of dataset-level memorization detection, providing an indispensable tool for assessing privacy risks in LLMs that are powered by vast web-sourced datasets.

[LG-27] Developing a Dataset-Adaptive Normalized Metric for Machine Learning Model Assessment: Integrating Size Complexity and Class Imbalance DATE

链接: https://arxiv.org/abs/2412.07244
作者: Serzhan Ossenov
关键词-EN: Traditional metrics, precision are frequently, sufficient for evaluating, evaluate machine learning, high-dimensional datasets
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 36 pages, 17 figures. Includes results validated on datasets from UCI Machine Learning Repository

点击查看摘要

Abstract:Traditional metrics like accuracy, F1-score, and precision are frequently used to evaluate machine learning models, however they may not be sufficient for evaluating performance on tiny, unbalanced, or high-dimensional datasets. A dataset-adaptive, normalized metric that incorporates dataset characteristics like size, feature dimensionality, class imbalance, and signal-to-noise ratio is presented in this study. Early insights into the model’s performance potential in challenging circumstances are provided by the suggested metric, which offers a scalable and adaptable evaluation framework. The metric’s capacity to accurately forecast model scalability and performance is demonstrated via experimental validation spanning classification, regression, and clustering tasks, guaranteeing solid assessments in settings with limited data. This method has important ramifications for effective resource allocation and model optimization in machine learning workflows.

[LG-28] Adversarial Filtering Based Evasion and Backdoor Attacks to EEG-Based Brain-Computer Interfaces

链接: https://arxiv.org/abs/2412.07231
作者: Lubin Meng,Xue Jiang,Xiaoqing Chen,Wenzhong Liu,Hanbin Luo,Dongrui Wu
关键词-EN: enables direct communication, brain-computer interface, enables direct, external device, direct communication
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A brain-computer interface (BCI) enables direct communication between the brain and an external device. Electroencephalogram (EEG) is a common input signal for BCIs, due to its convenience and low cost. Most research on EEG-based BCIs focuses on the accurate decoding of EEG signals, while ignoring their security. Recent studies have shown that machine learning models in BCIs are vulnerable to adversarial attacks. This paper proposes adversarial filtering based evasion and backdoor attacks to EEG-based BCIs, which are very easy to implement. Experiments on three datasets from different BCI paradigms demonstrated the effectiveness of our proposed attack approaches. To our knowledge, this is the first study on adversarial filtering for EEG-based BCIs, raising a new security concern and calling for more attention on the security of BCIs.

[LG-29] -TIME: Test-Time Information Maximization Ensemble for Plug-and-Play BCIs

链接: https://arxiv.org/abs/2412.07228
作者: Siyang Li,Ziwei Wang,Hanbin Luo,Lieyun Ding,Dongrui Wu
关键词-EN: enables direct communication, based brain-computer interface, brain-computer interface, enables direct, direct communication
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Objective: An electroencephalogram (EEG)-based brain-computer interface (BCI) enables direct communication between the human brain and a computer. Due to individual differences and non-stationarity of EEG signals, such BCIs usually require a subject-specific calibration session before each use, which is time-consuming and user-unfriendly. Transfer learning (TL) has been proposed to shorten or eliminate this calibration, but existing TL approaches mainly consider offline settings, where all unlabeled EEG trials from the new user are available. Methods: This paper proposes Test-Time Information Maximization Ensemble (T-TIME) to accommodate the most challenging online TL scenario, where unlabeled EEG data from the new user arrive in a stream, and immediate classification is performed. T-TIME initializes multiple classifiers from the aligned source data. When an unlabeled test EEG trial arrives, T-TIME first predicts its labels using ensemble learning, and then updates each classifier by conditional entropy minimization and adaptive marginal distribution regularization. Our code is publicized. Results: Extensive experiments on three public motor imagery based BCI datasets demonstrated that T-TIME outperformed about 20 classical and state-of-the-art TL approaches. Significance: To our knowledge, this is the first work on test time adaptation for calibration-free EEG-based BCIs, making plug-and-play BCIs possible.

[LG-30] Incremental Gaussian Mixture Clustering for Data Streams

链接: https://arxiv.org/abs/2412.07217
作者: Aniket Bhanderi,Raj Bhatnagar
关键词-EN: analyzing data streams, application domains, problem of analyzing, large volumes, volumes is important
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:The problem of analyzing data streams of very large volumes is important and is very desirable for many application domains. In this paper we present and demonstrate effective working of an algorithm to find clusters and anomalous data points in a streaming datasets. Entropy minimization is used as a criterion for defining and updating clusters formed from a streaming dataset. As the clusters are formed we also identify anomalous datapoints that show up far away from all known clusters. With a number of 2-D datasets we demonstrate the effectiveness of discovering the clusters and also identifying anomalous data points.

[LG-31] Learnable Sparse Customization in Heterogeneous Edge Computing ICDE2025

链接: https://arxiv.org/abs/2412.07216
作者: Jingjing Xue,Sheng Sun,Min Liu,Yuwei Wang,Zhuotao Liu,Jingyuan Wang
关键词-EN: utilize massive distributed, edge computing paradigm, massive distributed data, promising edge computing, effectively manage
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Accepted by ICDE 2025

点击查看摘要

Abstract:To effectively manage and utilize massive distributed data at the network edge, Federated Learning (FL) has emerged as a promising edge computing paradigm across data silos. However, FL still faces two challenges: system heterogeneity (i.e., the diversity of hardware resources across edge devices) and statistical heterogeneity (i.e., non-IID data). Although sparsification can extract diverse submodels for diverse clients, most sparse FL works either simply assign submodels with artificially-given rigid rules or prune partial parameters using heuristic strategies, resulting in inflexible sparsification and poor performance. In this work, we propose Learnable Personalized Sparsification for heterogeneous Federated learning (FedLPS), which achieves the learnable customization of heterogeneous sparse models with importance-associated patterns and adaptive ratios to simultaneously tackle system and statistical heterogeneity. Specifically, FedLPS learns the importance of model units on local data representation and further derives an importance-based sparse pattern with minimal heuristics to accurately extract personalized data features in non-IID settings. Furthermore, Prompt Upper Confidence Bound Variance (P-UCBV) is designed to adaptively determine sparse ratios by learning the superimposed effect of diverse device capabilities and non-IID data, aiming at resource self-adaptation with promising accuracy. Extensive experiments show that FedLPS outperforms status quo approaches in accuracy and training costs, which improves accuracy by 1.28%-59.34% while reducing running time by more than 68.80%.

[LG-32] Epidemiological Model Calibration via Graybox Bayesian Optimization

链接: https://arxiv.org/abs/2412.07193
作者: Puhua Niu,Byung-Jun Yoon,Xiaoning Qian
关键词-EN: calibration methods, model calibration methods, developing efficient calibration, calibration, efficient calibration methods
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:In this study, we focus on developing efficient calibration methods via Bayesian decision-making for the family of compartmental epidemiological models. The existing calibration methods usually assume that the compartmental model is cheap in terms of its output and gradient evaluation, which may not hold in practice when extending them to more general settings. Therefore, we introduce model calibration methods based on a “graybox” Bayesian optimization (BO) scheme, more efficient calibration for general epidemiological models. This approach uses Gaussian processes as a surrogate to the expensive model, and leverages the functional structure of the compartmental model to enhance calibration performance. Additionally, we develop model calibration methods via a decoupled decision-making strategy for BO, which further exploits the decomposable nature of the functional structure. The calibration efficiencies of the multiple proposed schemes are evaluated based on various data generated by a compartmental model mimicking real-world epidemic processes, and real-world COVID-19 datasets. Experimental results demonstrate that our proposed graybox variants of BO schemes can efficiently calibrate computationally expensive models and further improve the calibration performance measured by the logarithm of mean square errors and achieve faster performance convergence in terms of BO iterations. We anticipate that the proposed calibration methods can be extended to enable fast calibration of more complex epidemiological models, such as the agent-based models.

[LG-33] A New Federated Learning Framework Against Gradient Inversion Attacks AAAI2025

链接: https://arxiv.org/abs/2412.07187
作者: Pengxin Guo,Shuang Zeng,Wenhao Chen,Xiaodan Zhang,Weihong Ren,Yuyin Zhou,Liangqiong Qu
关键词-EN: collectively train machine, train machine learning, Secure Multi-party Computing, protect data privacy, Gradient Inversion Attacks
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Federated Learning (FL) aims to protect data privacy by enabling clients to collectively train machine learning models without sharing their raw data. However, recent studies demonstrate that information exchanged during FL is subject to Gradient Inversion Attacks (GIA) and, consequently, a variety of privacy-preserving methods have been integrated into FL to thwart such attacks, such as Secure Multi-party Computing (SMC), Homomorphic Encryption (HE), and Differential Privacy (DP). Despite their ability to protect data privacy, these approaches inherently involve substantial privacy-utility trade-offs. By revisiting the key to privacy exposure in FL under GIA, which lies in the frequent sharing of model gradients that contain private data, we take a new perspective by designing a novel privacy preserve FL framework that effectively ``breaks the direct connection’’ between the shared parameters and the local private data to defend against GIA. Specifically, we propose a Hypernetwork Federated Learning (HyperFL) framework that utilizes hypernetworks to generate the parameters of the local model and only the hypernetwork parameters are uploaded to the server for aggregation. Theoretical analyses demonstrate the convergence rate of the proposed HyperFL, while extensive experimental results show the privacy-preserving capability and comparable performance of HyperFL. Code is available at this https URL.

[LG-34] Effective Reward Specification in Deep Reinforcement Learning

链接: https://arxiv.org/abs/2412.07177
作者: Julien Roy
关键词-EN: Deep Reinforcement Learning, sequential decision-making problems, Reinforcement Learning, Deep Reinforcement, decision-making problems
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the last decade, Deep Reinforcement Learning has evolved into a powerful tool for complex sequential decision-making problems. It combines deep learning’s proficiency in processing rich input signals with reinforcement learning’s adaptability across diverse control tasks. At its core, an RL agent seeks to maximize its cumulative reward, enabling AI algorithms to uncover novel solutions previously unknown to experts. However, this focus on reward maximization also introduces a significant difficulty: improper reward specification can result in unexpected, misaligned agent behavior and inefficient learning. The complexity of accurately specifying the reward function is further amplified by the sequential nature of the task, the sparsity of learning signals, and the multifaceted aspects of the desired behavior. In this thesis, we survey the literature on effective reward specification strategies, identify core challenges relating to each of these approaches, and propose original contributions addressing the issue of sample efficiency and alignment in deep reinforcement learning. Reward specification represents one of the most challenging aspects of applying reinforcement learning in real-world domains. Our work underscores the absence of a universal solution to this complex and nuanced challenge; solving it requires selecting the most appropriate tools for the specific requirements of each unique application. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.07177 [cs.LG] (or arXiv:2412.07177v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.07177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-35] Unlocking TriLevel Learning with Level-Wise Zeroth Order Constraints: Distributed Algorithms and Provable Non-Asymptotic Convergence

链接: https://arxiv.org/abs/2412.07138
作者: Yang Jiao,Kai Yang,Chengtao Jian
关键词-EN: found diverse applications, robust hyperparameter optimization, machine learning applications, numerous machine learning, TLL problems
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Trilevel learning (TLL) found diverse applications in numerous machine learning applications, ranging from robust hyperparameter optimization to domain adaptation. However, existing researches primarily focus on scenarios where TLL can be addressed with first order information available at each level, which is inadequate in many situations involving zeroth order constraints, such as when black-box models are employed. Moreover, in trilevel learning, data may be distributed across various nodes, necessitating strategies to address TLL problems without centralizing data on servers to uphold data privacy. To this end, an effective distributed trilevel zeroth order learning framework DTZO is proposed in this work to address the TLL problems with level-wise zeroth order constraints in a distributed manner. The proposed DTZO is versatile and can be adapted to a wide range of (grey-box) TLL problems with partial zeroth order constraints. In DTZO, the cascaded polynomial approximation can be constructed without relying on gradients or sub-gradients, leveraging a novel cut, i.e., zeroth order cut. Furthermore, we theoretically carry out the non-asymptotic convergence rate analysis for the proposed DTZO in achieving the \epsilon -stationary point. Extensive experiments have been conducted to demonstrate and validate the superior performance of the proposed DTZO, e.g., it approximately achieves up to a 40 % improvement in performance.

[LG-36] Corrupted Learning Dynamics in Games

链接: https://arxiv.org/abs/2412.07120
作者: Taira Tsuchiya,Shinji Ito,Haipeng Luo
关键词-EN: multiple players interact, players employ no-regret, mathrm, employ no-regret algorithms, shared environment
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages

点击查看摘要

Abstract:Learning in games is the problem where multiple players interact in a shared environment, each aiming to minimize their own regret, and it is known that an approximate equilibrium can be obtained when all players employ no-regret algorithms. Notably, by adopting optimistic follow-the-regularized-leader (OFTRL), the regret of each player after T rounds is constant in two-player zero-sum games, implying that an equilibrium can be computed at a faster rate of O(1/T) . However, this acceleration is limited to the honest regime, where all players fully adhere to the given algorithms. To address this limitation, this paper presents corrupted learning dynamics that adaptively find an equilibrium at a rate dependent on the degree of deviation by each player from the given algorithm’s output. First, in two-player zero-sum games, we provide learning dynamics where the external regret of the x-player (and similarly for the y-player) in the corrupted regime is roughly bounded by O(\log (m_\mathrmx m_\mathrmy) + \sqrtC_\mathrmy + C_\mathrmx) , which implies a convergence rate of \tildeO((\sqrtC_\mathrmy + C_\mathrmx)/T) to a Nash equilibrium. Here, m_\mathrmx and m_\mathrmy are the number of actions of the x- and y-players, respectively, and C_\mathrmx and C_\mathrmy are the cumulative deviations of the x- and y-players from their given algorithms. Furthermore, we extend our approach to multi-player general-sum games, showing that the swap regret of player i in the corrupted regime is bounded by O(\log T + \sqrt\sum_j C_j \log T + C_i) , where C_i is the cumulative deviations of player i from the given algorithm. This implies a convergence rate of O((\log T + \sqrt\sum_j C_j \log T + C_i)/T) to a correlated equilibrium. Our learning dynamics are agnostic to the corruption levels and are based on OFTRL with new adaptive learning rates.

[LG-37] Covered Forest: Fine-grained generalization analysis of graph neural networks

链接: https://arxiv.org/abs/2412.07106
作者: Antonis Vasileiou,Ben Finkelshtein,Floris Geerts,Ron Levie,Christopher Morris
关键词-EN: graph neural networks, graph isomorphism testing, message-passing graph neural, neural networks, primarily through combinatorial
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The expressive power of message-passing graph neural networks (MPNNs) is reasonably well understood, primarily through combinatorial techniques from graph isomorphism testing. However, MPNNs’ generalization abilities – making meaningful predictions beyond the training set – remain less explored. Current generalization analyses often overlook graph structure, limit the focus to specific aggregation functions, and assume the impractical, hard-to-optimize 0 - 1 loss function. Here, we extend recent advances in graph similarity theory to assess the influence of graph structure, aggregation, and loss functions on MPNNs’ generalization abilities. Our empirical study supports our theoretical insights, improving our understanding of MPNNs’ generalization properties.

[LG-38] Streaming Private Continual Counting via Binning

链接: https://arxiv.org/abs/2412.07093
作者: Joel Daniel Andersson,Rasmus Pagh
关键词-EN: textit, differential privacy, continuously release, Factorization mechanisms, continual observation
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:In differential privacy, \textitcontinual observation refers to problems in which we wish to continuously release a function of a dataset that is revealed one element at a time. The challenge is to maintain a good approximation while keeping the combined output over all time steps differentially private. In the special case of \textitcontinual counting we seek to approximate a sum of binary input elements. This problem has received considerable attention lately, in part due to its relevance in implementations of differentially private stochastic gradient descent. \textitFactorization mechanisms are the leading approach to continual counting, but the best such mechanisms do not work well in \textitstreaming settings since they require space proportional to the size of the input. In this paper, we present a simple approach to approximating factorization mechanisms in low space via \textitbinning , where adjacent matrix entries with similar values are changed to be identical in such a way that a matrix-vector product can be maintained in sublinear space. Our approach has provable sublinear space guarantees for a class of lower triangular matrices whose entries are monotonically decreasing away from the diagonal. We show empirically that even with very low space usage we are able to closely match, and sometimes surpass, the performance of asymptotically optimal factorization mechanisms. Recently, and independently of our work, Dvijotham et al. have also suggested an approach to implementing factorization mechanisms in a streaming setting. Their work differs from ours in several respects: It only addresses factorization into \textitToeplitz matrices, only considers \textitmaximum error, and uses a different technique based on rational function approximation that seems less versatile than our binning approach.

[LG-39] Enhancing radioisotope identification in gamma spectra with transfer learning

链接: https://arxiv.org/abs/2412.07069
作者: Peter Lalor
关键词-EN: unknown radioactive samples, Machine learning methods, provide accurate, real-time classification, gamma spectroscopy
类目: Machine Learning (cs.LG); Nuclear Theory (nucl-th)
*备注: 11 pages and 4 figures

点击查看摘要

Abstract:Machine learning methods in gamma spectroscopy have the potential to provide accurate, real-time classification of unknown radioactive samples. However, obtaining sufficient experimental training data is often prohibitively expensive and time-consuming, and models trained solely on synthetic data can struggle to generalize to the unpredictable range of real-world operating scenarios. In this work, we pretrain a model using physically derived synthetic data and subsequently leverage transfer learning techniques to fine-tune the model for a specific target domain. This paradigm enables us to embed physical principles during the pretraining step, thus requiring less data from the target domain compared to classical machine learning methods. Results of this analysis indicate that fine-tuned models significantly outperform those trained exclusively on synthetic data or solely on target-domain data, particularly in the intermediate data regime ( \approx 10^4 training samples). This conclusion is consistent across four different machine learning architectures (MLP, CNN, Transformer, and LSTM) considered in this study. This research serves as proof of concept for applying transfer learning techniques to application scenarios where access to experimental data is limited.

[LG-40] MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems

链接: https://arxiv.org/abs/2412.07067
作者: Yao Fu,Yinsicheng Jiang,Yeqi Huang,Ping Nie,Zhan Lu,Leyang Xue,Congjie He,Man-Kit Sit,Jilong Xue,Li Dong,Ziming Miao,Kai Zou,Edoardo Ponti,Luo Mai
关键词-EN: scaling Large Language, Large Language Models, Large Language, scaling Large, Language Models
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently; however, MoE systems rely on heterogeneous compute and memory resources. These factors collectively influence the system’s Cost, Accuracy, and Performance (CAP), creating a challenging trade-off. Current benchmarks often fail to provide precise estimates of these effects, complicating practical considerations for deploying MoE systems. To bridge this gap, we introduce MoE-CAP, a benchmark specifically designed to evaluate MoE systems. Our findings highlight the difficulty of achieving an optimal balance of cost, accuracy, and performance with existing hardware capabilities. MoE systems often necessitate compromises on one factor to optimize the other two, a dynamic we term the MoE-CAP trade-off. To identify the best trade-off, we propose novel performance evaluation metrics - Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU) - and develop cost models that account for the heterogeneous compute and memory hardware integral to MoE systems. This benchmark is publicly available on HuggingFace: this https URL.

[LG-41] Optimizing Personalized Federated Learning through Adaptive Layer-Wise Learning

链接: https://arxiv.org/abs/2412.07062
作者: Weihang Chen,Jie Ren,Zhiqiang Li,Ling Gao,Zheng Wang
关键词-EN: Real-life deployment, faces non-IID data, local model, local, slow convergence
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Real-life deployment of federated Learning (FL) often faces non-IID data, which leads to poor accuracy and slow convergence. Personalized FL (pFL) tackles these issues by tailoring local models to individual data sources and using weighted aggregation methods for client-specific learning. However, existing pFL methods often fail to provide each local model with global knowledge on demand while maintaining low computational overhead. Additionally, local models tend to over-personalize their data during the training process, potentially dropping previously acquired global information. We propose FLAYER, a novel layer-wise learning method for pFL that optimizes local model personalization performance. FLAYER considers the different roles and learning abilities of neural network layers of individual local models. It incorporates global information for each local model as needed to initialize the local model cost-effectively. It then dynamically adjusts learning rates for each layer during local training, optimizing the personalized learning process for each local model while preserving global knowledge. Additionally, to enhance global representation in pFL, FLAYER selectively uploads parameters for global aggregation in a layer-wise manner. We evaluate FLAYER on four representative datasets in computer vision and natural language processing domains. Compared to six state-of-the-art pFL methods, FLAYER improves the inference accuracy, on average, by 5.42% (up to 14.29%).

[LG-42] Advancing clinical trial outcomes using deep learning and predictive modelling: bridging precision medicine and patient-centered care

链接: https://arxiv.org/abs/2412.07050
作者: Sydney Anuyah,Mallika K Singh,Hope Nyavor
关键词-EN: artificial intelligence, integration of artificial, revolutionized the process, process of drug, drug development
类目: Machine Learning (cs.LG)
*备注: 22 pages excluding references, 11 figures, 6 tables

点击查看摘要

Abstract:The integration of artificial intelligence [AI] into clinical trials has revolutionized the process of drug development and personalized medicine. Among these advancements, deep learning and predictive modelling have emerged as transformative tools for optimizing clinical trial design, patient recruitment, and real-time monitoring. This study explores the application of deep learning techniques, such as convolutional neural networks [CNNs] and transformerbased models, to stratify patients, forecast adverse events, and personalize treatment plans. Furthermore, predictive modelling approaches, including survival analysis and time-series forecasting, are employed to predict trial outcomes, enhancing efficiency and reducing trial failure rates. To address challenges in analysing unstructured clinical data, such as patient notes and trial protocols, natural language processing [NLP] techniques are utilized for extracting actionable insights. A custom dataset comprising structured patient demographics, genomic data, and unstructured text is curated for training and validating these models. Key metrics, including precision, recall, and F1 scores, are used to evaluate model performance, while trade-offs between accuracy and computational efficiency are examined to identify the optimal model for clinical deployment. This research underscores the potential of AI-driven methods to streamline clinical trial workflows, improve patient-centric outcomes, and reduce costs associated with trial inefficiencies. The findings provide a robust framework for integrating predictive analytics into precision medicine, paving the way for more adaptive and efficient clinical trials. By bridging the gap between technological innovation and real-world applications, this study contributes to advancing the role of AI in healthcare, particularly in fostering personalized care and improving overall trial success rates.

[LG-43] Data Augmentation with Variational Autoencoder for Imbalanced Dataset

链接: https://arxiv.org/abs/2412.07039
作者: Samuel Stocksieker,Denys Pommeret,Arthur Charpentier
关键词-EN: imbalanced distribution presents, standard algorithms, presents a major, generally leads, performance of standard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning from an imbalanced distribution presents a major challenge in predictive modeling, as it generally leads to a reduction in the performance of standard algorithms. Various approaches exist to address this issue, but many of them concern classification problems, with a limited focus on regression. In this paper, we introduce a novel method aimed at enhancing learning on tabular data in the Imbalanced Regression (IR) framework, which remains a significant problem. We propose to use variational autoencoders (VAE) which are known as a powerful tool for synthetic data generation, offering an interesting approach to modeling and capturing latent representations of complex distributions. However, VAEs can be inefficient when dealing with IR. Therefore, we develop a novel approach for generating data, combining VAE with a smoothed bootstrap, specifically designed to address the challenges of IR. We numerically investigate the scope of this method by comparing it against its competitors on simulations and datasets known for IR.

[LG-44] Deep Learning for Cross-Border Transaction Anomaly Detection in Anti-Money Laundering Systems

链接: https://arxiv.org/abs/2412.07027
作者: Qian Yu,Zhen Xu,Zong Ke
关键词-EN: AML systems, cross-border AML systems, digital economy, AML, adaptive AML systems
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
*备注: The paper has been accepted by the 2024 6th International Conference on Machine Learning, Big Data and Business Intelligence MLBDBI2024

点击查看摘要

Abstract:In the context of globalization and the rapid expansion of the digital economy, anti-money laundering (AML) has become a crucial aspect of financial oversight, particularly in cross-border transactions. The rising complexity and scale of international financial flows necessitate more intelligent and adaptive AML systems to combat increasingly sophisticated money laundering techniques. This paper explores the application of unsupervised learning models in cross-border AML systems, focusing on rule optimization through contrastive learning techniques. Five deep learning models, ranging from basic convolutional neural networks (CNNs) to hybrid CNNGRU architectures, were designed and tested to assess their performance in detecting abnormal transactions. The results demonstrate that as model complexity increases, so does the system’s detection accuracy and responsiveness. In particular, the self-developed hybrid Convolutional-Recurrent Neural Integration Model (CRNIM) model showed superior performance in terms of accuracy and area under the receiver operating characteristic curve (AUROC). These findings highlight the potential of unsupervised learning models to significantly improve the intelligence, flexibility, and real-time capabilities of AML systems. By optimizing detection rules and enhancing adaptability to emerging money laundering schemes, this research provides both theoretical and practical contributions to the advancement of AML technologies, which are essential for safeguarding the global financial system against illicit activities.

[LG-45] GenAI4UQ: A Software for Inverse Uncertainty Quantification Using Conditional Generative Models

链接: https://arxiv.org/abs/2412.07026
作者: Ming Fan,Zezhong Zhang,Dan Lu,Guannan Zhang
关键词-EN: Markov Chain Monte, Chain Monte Carlo, Monte Carlo methods, Markov Chain, Chain Monte
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:We introduce GenAI4UQ, a software package for inverse uncertainty quantification in model calibration, parameter estimation, and ensemble forecasting in scientific applications. GenAI4UQ leverages a generative artificial intelligence (AI) based conditional modeling framework to address the limitations of traditional inverse modeling techniques, such as Markov Chain Monte Carlo methods. By replacing computationally intensive iterative processes with a direct, learned mapping, GenAI4UQ enables efficient calibration of model input parameters and generation of output predictions directly from observations. The software’s design allows for rapid ensemble forecasting with robust uncertainty quantification, while maintaining high computational and storage efficiency. GenAI4UQ simplifies the model training process through built-in auto-tuning of hyperparameters, making it accessible to users with varying levels of expertise. Its conditional generative framework ensures versatility, enabling applicability across a wide range of scientific domains. At its core, GenAI4UQ transforms the paradigm of inverse modeling by providing a fast, reliable, and user-friendly solution. It empowers researchers and practitioners to quickly estimate parameter distributions and generate model predictions for new observations, facilitating efficient decision-making and advancing the state of uncertainty quantification in computational modeling. (The code and data are available at this https URL).

[LG-46] AE: A Model-Constrained Tikhonov Autoencoder Approach for Forward and Inverse Problems

链接: https://arxiv.org/abs/2412.07010
作者: Hai V. Nguyen,Tan Bui-Thanh
关键词-EN: Efficient real-time solvers, Efficient real-time, science applications, essential in engineering, engineering and science
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Efficient real-time solvers for forward and inverse problems are essential in engineering and science applications. Machine learning surrogate models have emerged as promising alternatives to traditional methods, offering substantially reduced computational time. Nevertheless, these models typically demand extensive training datasets to achieve robust generalization across diverse scenarios. While physics-based approaches can partially mitigate this data dependency and ensure physics-interpretable solutions, addressing scarce data regimes remains a challenge. Both purely data-driven and physics-based machine learning approaches demonstrate severe overfitting issues when trained with insufficient data. We propose a novel Tikhonov autoencoder model-constrained framework, called TAE, capable of learning both forward and inverse surrogate models using a single arbitrary observation sample. We develop comprehensive theoretical foundations including forward and inverse inference error bounds for the proposed approach for linear cases. For comparative analysis, we derive equivalent formulations for pure data-driven and model-constrained approach counterparts. At the heart of our approach is a data randomization strategy, which functions as a generative mechanism for exploring the training data space, enabling effective training of both forward and inverse surrogate models from a single observation, while regularizing the learning process. We validate our approach through extensive numerical experiments on two challenging inverse problems: 2D heat conductivity inversion and initial condition reconstruction for time-dependent 2D Navier-Stokes equations. Results demonstrate that TAE achieves accuracy comparable to traditional Tikhonov solvers and numerical forward solvers for both inverse and forward problems, respectively, while delivering orders of magnitude computational speedups.

[LG-47] In-Application Defense Against Evasive Web Scans through Behavioral Analysis

链接: https://arxiv.org/abs/2412.07005
作者: Behzad Ousat,Mahshad Shariatnasab,Esteban Schafir,Farhad Shirani Chaharsooghi,Amin Kharraz
关键词-EN: benign web crawlers, command injection, ranging from benign, credential stuffing, traffic has evolved
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Web traffic has evolved to include both human users and automated agents, ranging from benign web crawlers to adversarial scanners such as those capable of credential stuffing, command injection, and account hijacking at the web scale. The estimated financial costs of these adversarial activities are estimated to exceed tens of billions of dollars in 2023. In this work, we introduce WebGuard, a low-overhead in-application forensics engine, to enable robust identification and monitoring of automated web scanners, and help mitigate the associated security risks. WebGuard focuses on the following design criteria: (i) integration into web applications without any changes to the underlying software components or infrastructure, (ii) minimal communication overhead, (iii) capability for real-time detection, e.g., within hundreds of milliseconds, and (iv) attribution capability to identify new behavioral patterns and detect emerging agent categories. To this end, we have equipped WebGuard with multi-modal behavioral monitoring mechanisms, such as monitoring spatio-temporal data and browser events. We also design supervised and unsupervised learning architectures for real-time detection and offline attribution of human and automated agents, respectively. Information theoretic analysis and empirical evaluations are provided to show that multi-modal data analysis, as opposed to uni-modal analysis which relies solely on mouse movement dynamics, significantly improves time-to-detection and attribution accuracy. Various numerical evaluations using real-world data collected via WebGuard are provided achieving high accuracy in hundreds of milliseconds, with a communication overhead below 10 KB per second.

[LG-48] Understanding Gradient Descent through the Training Jacobian

链接: https://arxiv.org/abs/2412.07003
作者: Nora Belrose,Adam Scherlis
关键词-EN: trained network parameters, neural network training, examine the geometry, geometry of neural, parameters with respect
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values. Our analysis reveals low-dimensional structure in the training process which is dependent on the input data but largely independent of the labels. We find that the singular value spectrum of the Jacobian matrix consists of three distinctive regions: a “chaotic” region of values orders of magnitude greater than one, a large “bulk” region of values extremely close to one, and a “stable” region of values less than one. Along each bulk direction, the left and right singular vectors are nearly identical, indicating that perturbations to the initialization are carried through training almost unchanged. These perturbations have virtually no effect on the network’s output in-distribution, yet do have an effect far out-of-distribution. While the Jacobian applies only locally around a single initialization, we find substantial overlap in bulk subspaces for different random seeds.

[LG-49] he Oracle Complexity of Simplex-based Matrix Games: Linear Separability and Nash Equilibria

链接: https://arxiv.org/abs/2412.06990
作者: Guy Kornowski,Ohad Shamir
关键词-EN: mathbf, Delta, solving matrix games, probability simplex, Omega
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 25 pages

点击查看摘要

Abstract:We study the problem of solving matrix games of the form \max_\mathbfw\in\mathcalW\min_\mathbfp\in\Delta\mathbfp^\topA\mathbfw , where A is some matrix and \Delta is the probability simplex. This problem encapsulates canonical tasks such as finding a linear separator and computing Nash equilibria in zero-sum games. However, perhaps surprisingly, its inherent complexity (as formalized in the standard framework of oracle complexity [Nemirovski and Yudin, 1983]) is not well-understood. In this work, we first identify different oracle models which are implicitly used by prior algorithms, amounting to multiplying the matrix A by a vector from either one or both sides. We then prove complexity lower bounds for algorithms under both access models, which in particular imply a separation between them. Specifically, we start by proving that algorithms for linear separability based on one-sided multiplications must require \Omega(\gamma_A^-2) iterations, where \gamma_A is the margin, as matched by the Perceptron algorithm. We then prove that accelerated algorithms for this task, which utilize multiplications from both sides, must require \tilde\Omega(\gamma_A^-2/3) iterations, establishing the first oracle complexity barrier for such algorithms. Finally, by adapting our lower bound to \ell_1 geometry, we prove that computing an \epsilon -approximate Nash equilibrium requires \tilde\Omega(\epsilon^-2/5) iterations, which is an exponential improvement over the previously best-known lower bound due to Hadiji et al. [2024].

[LG-50] Machine Unlearning Doesnt Do What You Think: Lessons for Generative AI Policy Research and Practice ICML

链接: https://arxiv.org/abs/2412.06966
作者: A. Feder Cooper,Christopher A. Choquette-Choo,Miranda Bogen,Matthew Jagielski,Katja Filippova,Ken Ziyu Liu,Alexandra Chouldechova,Jamie Hayes,Yangsibo Huang,Niloofar Mireshghallah,Ilia Shumailov,Eleni Triantafillou,Peter Kairouz,Nicole Mitchell,Percy Liang,Daniel E. Ho,Yejin Choi,Sanmi Koyejo,Fernando Delgado,James Grimmelmann,Vitaly Shmatikov,Christopher De Sa,Solon Barocas,Amy Cyphert,Mark Lemley,danah boyd,Jennifer Wortman Vaughan,Miles Brundage,David Bau,Seth Neel,Abigail Z. Jacobs,Andreas Terzis,Hanna Wallach,Nicolas Papernot,Katherine Lee
关键词-EN: articulate fundamental mismatches, articulate fundamental, fundamental mismatches, documented aspirations, model
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Presented at the 2nd Workshop on Generative AI and Law at ICML (July 2024)

点击查看摘要

Abstract:We articulate fundamental mismatches between technical methods for machine unlearning in Generative AI, and documented aspirations for broader impact that these methods could have for law and policy. These aspirations are both numerous and varied, motivated by issues that pertain to privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of targeted information from a generative-AI model’s parameters, e.g., a particular individual’s personal data or in-copyright expression of Spiderman that was included in the model’s training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual’s data or reflect the concept of “Spiderman.” Both of these goals–the targeted removal of information from a model and the targeted suppression of information from a model’s outputs–present various technical and substantive challenges. We provide a framework for thinking rigorously about these challenges, which enables us to be clear about why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact. We aim for conceptual clarity and to encourage more thoughtful communication among machine learning (ML), law, and policy experts who seek to develop and apply technical methods for compliance with policy objectives.

[LG-51] Digital Twin-Empowered Voltage Control for Power Systems

链接: https://arxiv.org/abs/2412.06940
作者: Jiachen Xu,Yushuai Li,Torben Bach Pedersen,Yuqiang He,Kim Guldstrand Larsen,Tianyi Li
关键词-EN: Emerging digital twin, digital twin technology, Emerging digital, digital twin, sampling efficiency
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 1 figure, conference paper

点击查看摘要

Abstract:Emerging digital twin technology has the potential to revolutionize voltage control in power systems. However, the state-of-the-art digital twin method suffers from low computational and sampling efficiency, which hinders its applications. To address this issue, we propose a Gumbel-Consistency Digital Twin (GC-DT) method that enhances voltage control with improved computational and sampling efficiency. First, the proposed method incorporates a Gumbel-based strategy improvement that leverages the Gumbel-top trick to enhance non-repetitive sampling actions and reduce the reliance on Monte Carlo Tree Search simulations, thereby improving computational efficiency. Second, a consistency loss function aligns predicted hidden states with actual hidden states in the latent space, which increases both prediction accuracy and sampling efficiency. Experiments on IEEE 123-bus, 34-bus, and 13-bus systems demonstrate that the proposed GC-DT outperforms the state-of-the-art DT method in both computational and sampling efficiency.

[LG-52] Efficient user history modeling with amortized inference for deep learning recommendation models WWW2025

链接: https://arxiv.org/abs/2412.06924
作者: Lars Hertel,Neil Daftary,Fedor Borisyuk,Aman Gupta,Rahul Mazumder
关键词-EN: deep learning recommendation, Transformer encoders, user history modeling, study user history, learning recommendation models
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 5 pages, 3 figures, WWW 2025

点击查看摘要

Abstract:We study user history modeling via Transformer encoders in deep learning recommendation models (DLRM). Such architectures can significantly improve recommendation quality, but usually incur high latency cost necessitating infrastructure upgrades or very small Transformer models. An important part of user history modeling is early fusion of the candidate item and various methods have been studied. We revisit early fusion and compare concatenation of the candidate to each history item against appending it to the end of the list as a separate item. Using the latter method, allows us to reformulate the recently proposed amortized history inference algorithm M-FALCON \citezhai2024actions for the case of DLRM models. We show via experimental results that appending with cross-attention performs on par with concatenation and that amortization significantly reduces inference costs. We conclude with results from deploying this model on the LinkedIn Feed and Ads surfaces, where amortization reduces latency by 30% compared to non-amortized inference.

[LG-53] Stock Type Prediction Model Based on Hierarchical Graph Neural Network

链接: https://arxiv.org/abs/2412.06862
作者: Jianhua Yao,Yuxin Dong,Jiajing Wang,Bingxing Wang,Hongye Zheng,Honglin Qin
关键词-EN: Graph Neural Network, Neural Network, Hierarchical Graph Neural, captures multi-level information, stock data analysis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel approach to stock data analysis by employing a Hierarchical Graph Neural Network (HGNN) model that captures multi-level information and relational structures in the stock market. The HGNN model integrates stock relationship data and hierarchical attributes to predict stock types effectively. The paper discusses the construction of a stock industry relationship graph and the extraction of temporal information from historical price sequences. It also highlights the design of a graph convolution operation and a temporal attention aggregator to model the macro market state. The integration of these features results in a comprehensive stock prediction model that addresses the challenges of utilizing stock relationship data and modeling hierarchical attributes in the stock market.

[LG-54] Comb Tensor Networks vs. Matrix Product States: Enhanced Efficiency in High-Dimensional Spaces

链接: https://arxiv.org/abs/2412.06857
作者: Danylo Kolesnyk,Yelyzaveta Vodovozova
关键词-EN: incorporate compression layers, Matrix Product States, Modern approaches, networks incorporate compression, traditional Matrix Product
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Modern approaches to generative modeling of continuous data using tensor networks incorporate compression layers to capture the most meaningful features of high-dimensional inputs. These methods, however, rely on traditional Matrix Product States (MPS) architectures. Here, we demonstrate that beyond a certain threshold in data and bond dimensions, a comb-shaped tensor network architecture can yield more efficient contractions than a standard MPS. This finding suggests that for continuous and high-dimensional data distributions, transitioning from MPS to a comb tensor network representation can substantially reduce computational overhead while maintaining accuracy.

[LG-55] Partition of Unity Physics-Informed Neural Networks (POU-PINNs): An Unsupervised Framework for Physics-Informed Domain Decomposition and Mixtures of Experts

链接: https://arxiv.org/abs/2412.06842
作者: Arturo Rodriguez,Ashesh Chattopadhyay,Piyush Kumar,Luis F. Rodriguez,Vinod Kumar
关键词-EN: Physics-informed neural networks, commonly address ill-posed, address ill-posed inverse, ill-posed inverse problems, Physics-informed neural
类目: Machine Learning (cs.LG)
*备注: 26 pages

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) commonly address ill-posed inverse problems by uncovering unknown physics. This study presents a novel unsupervised learning framework that identifies spatial subdomains with specific governing physics. It uses the partition of unity networks (POUs) to divide the space into subdomains, assigning unique nonlinear model parameters to each, which are integrated into the physics model. A vital feature of this method is a physics residual-based loss function that detects variations in physical properties without requiring labeled data. This approach enables the discovery of spatial decompositions and nonlinear parameters in partial differential equations (PDEs), optimizing the solution space by dividing it into subdomains and improving accuracy. Its effectiveness is demonstrated through applications in porous media thermal ablation and ice-sheet modeling, showcasing its potential for tackling real-world physics challenges.

[LG-56] mely reliable Bayesian decision-making enabled using memristors

链接: https://arxiv.org/abs/2412.06838
作者: Lekai Song,Pengyu Liu,Yang Liu,Jingfang Pei,Wenyu Cui,Songwei Liu,Yingyi Wen,Teng Ma,Kong-Pang Pun,Guohua Hu
关键词-EN: Brains perform timely, Bayes theorem, Brains perform, perform timely reliable, Bayes
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Brains perform timely reliable decision-making by Bayes theorem. Bayes theorem quantifies events as probabilities and, through probability rules, renders the decisions. Learning from this, applying Bayes theorem in practical problems can visualize the potential risks and decision confidence, thereby enabling efficient user-scene interactions. However, given the probabilistic nature, implementing Bayes theorem with the conventional deterministic computing can inevitably induce excessive computational cost and decision latency. Herein, we propose a probabilistic computing approach using memristors to implement Bayes theorem. We integrate volatile memristors with Boolean logics and, by exploiting the volatile stochastic switching of the memristors, realize Boolean operations with statistical probabilities and correlations, key for enabling Bayes theorem. To practically demonstrate the effectiveness of our memristor-enabled Bayes theorem approach in user-scene interactions, we design lightweight Bayesian inference and fusion operators using our probabilistic logics and apply the operators in road scene parsing for self-driving, including route planning and obstacle detection. The results show that our operators can achieve reliable decisions at a rate over 2,500 frames per second, outperforming human decision-making and the existing driving assistance systems.

[LG-57] Stably unactivated neurons in ReLU neural networks

链接: https://arxiv.org/abs/2412.06829
作者: Natalie Brownlowe,Christopher R. Cornwell,Ethan Montes,Gabriel Quijano,Na Zhang
关键词-EN: neural network influences, hidden layer, chosen architecture, received much attention, neural network
类目: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The choice of architecture of a neural network influences which functions will be realizable by that neural network and, as a result, studying the expressiveness of a chosen architecture has received much attention. In ReLU neural networks, the presence of stably unactivated neurons can reduce the network’s expressiveness. In this work, we investigate the probability of a neuron in the second hidden layer of such neural networks being stably unactivated when the weights and biases are initialized from symmetric probability distributions. For networks with input dimension n_0 , we prove that if the first hidden layer has n_0+1 neurons then this probability is exactly \frac2^n_0+14^n_0+1 , and if the first hidden layer has n_1 neurons, n_1 \le n_0 , then the probability is \frac12^n_1+1 . Finally, for the case when the first hidden layer has more neurons than n_0+1 , a conjecture is proposed along with the rationale. Computational evidence is presented to support the conjecture.

[LG-58] A Physics-Constrained Neural Differential Equation Framework for Data-Driven Snowpack Simulation

链接: https://arxiv.org/abs/2412.06819
作者: Andrew Charbonneau,Katherine Deck,Tapio Schneider
关键词-EN: physics-constrained neural differential, neural differential equation, differential equation framework, Nash Sutcliffe Efficiencies, seasonal snow depth
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: This Work has been submitted to Artificial Intelligence for Earth Sciences. Copyright in this Work may be transferred without further notice

点击查看摘要

Abstract:This paper presents a physics-constrained neural differential equation framework for parameterization, and employs it to model the time evolution of seasonal snow depth given hydrometeorological forcings. When trained on data from multiple SNOTEL sites, the parameterization predicts daily snow depth with under 9% median error and Nash Sutcliffe Efficiencies over 0.94 across a wide variety of snow climates. The parameterization also generalizes to new sites not seen during training, which is not often true for calibrated snow models. Requiring the parameterization to predict snow water equivalent in addition to snow depth only increases error to ~12%. The structure of the approach guarantees the satisfaction of physical constraints, enables these constraints during model training, and allows modeling at different temporal resolutions without additional retraining of the parameterization. These benefits hold potential in climate modeling, and could extend to other dynamical systems with physical constraints.

[LG-59] Federated Block-Term Tensor Regression for decentralised data analysis in healthcare

链接: https://arxiv.org/abs/2412.06815
作者: Axel Faes,Ashkan Pirmani,Yves Moreau,Liesbet M. Peeters
关键词-EN: leveraging multilinear relationships, Block-Term Tensor Regression, Tensor Regression, modeling complex, multilinear relationships
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Block-Term Tensor Regression (BTTR) has proven to be a powerful tool for modeling complex, high-dimensional data by leveraging multilinear relationships, making it particularly well-suited for applications in healthcare and neuroscience. However, traditional implementations of BTTR rely on centralized datasets, which pose significant privacy risks and hinder collaboration across institutions. To address these challenges, we introduce Federated Block-Term Tensor Regression (FBTTR), an extension of BTTR designed for federated learning scenarios. FBTTR enables decentralized data analysis, allowing institutions to collaboratively build predictive models while preserving data privacy and complying with regulations. FBTTR represents a major step forward in applying tensor regression to federated learning environments. Its performance is evaluated in two case studies: finger movement decoding from Electrocorticography (ECoG) signals and heart disease prediction. In the first case study, using the BCI Competition IV dataset, FBTTR outperforms non-multilinear models, demonstrating superior accuracy in decoding finger movements. For the dataset, for subject 3, the thumb obtained a performance of 0.76 \pm .05 compared to 0.71 \pm 0.05 for centralised BTTR. In the second case study, FBTTR is applied to predict heart disease using real-world clinical datasets, outperforming both standard federated learning approaches and centralized BTTR models. In the Fed-Heart-Disease Dataset, an AUC-ROC was obtained of 0.872 \pm 0.02 and an accuracy of 0.772 \pm 0.02 compared to 0.812 \pm 0.003 and 0.753 \pm 0.007 for the centralized model. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.06815 [cs.LG] (or arXiv:2412.06815v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.06815 Focus to learn more arXiv-issued DOI via DataCite

[LG-60] Enhancing Prediction Models with Reinforcement Learning RECSYS2024

链接: https://arxiv.org/abs/2412.06791
作者: Karol Radziszewski,Piotr Ociepka
关键词-EN: Axel Springer Polska, Ringier Axel Springer, Springer Polska, Ringier Axel, Axel Springer
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: INRA 24: 12th International Workshop on News Recommendation and Analytics in Conjunction with ACM RecSys 2024

点击查看摘要

Abstract:We present a large-scale news recommendation system implemented at Ringier Axel Springer Polska, focusing on enhancing prediction models with reinforcement learning techniques. The system, named Aureus, integrates a variety of algorithms, including multi-armed bandit methods and deep learning models based on large language models (LLMs). We detail the architecture and implementation of Aureus, emphasizing the significant improvements in online metrics achieved by combining ranking prediction models with reinforcement learning. The paper further explores the impact of different models mixing on key business performance indicators. Our approach effectively balances the need for personalized recommendations with the ability to adapt to rapidly changing news content, addressing common challenges such as the cold start problem and content freshness. The results of online evaluation demonstrate the effectiveness of the proposed system in a real-world production environment.

[LG-61] Bayesian Optimization of Antibodies Informed by a Generative Model of Evolving Sequences

链接: https://arxiv.org/abs/2412.07763
作者: Alan Nawzad Amin,Nate Gruver,Yilun Kuang,Lily Li,Hunter Elliott,Calvin McCarter,Aniruddh Raghu,Peyton Greenside,Andrew Gordon Wilson
关键词-EN: build effective therapeutics, biologists iteratively mutate, effective therapeutics, binding and stability, build effective
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Code available at this https URL

点击查看摘要

Abstract:To build effective therapeutics, biologists iteratively mutate antibody sequences to improve binding and stability. Proposed mutations can be informed by previous measurements or by learning from large antibody databases to predict only typical antibodies. Unfortunately, the space of typical antibodies is enormous to search, and experiments often fail to find suitable antibodies on a budget. We introduce Clone-informed Bayesian Optimization (CloneBO), a Bayesian optimization procedure that efficiently optimizes antibodies in the lab by teaching a generative model how our immune system optimizes antibodies. Our immune system makes antibodies by iteratively evolving specific portions of their sequences to bind their target strongly and stably, resulting in a set of related, evolving sequences known as a clonal family. We train a large language model, CloneLM, on hundreds of thousands of clonal families and use it to design sequences with mutations that are most likely to optimize an antibody within the human immune system. We propose to guide our designs to fit previous measurements with a twisted sequential Monte Carlo procedure. We show that CloneBO optimizes antibodies substantially more efficiently than previous methods in realistic in silico experiments and designs stronger and more stable binders in in vitro wet lab experiments.

[LG-62] Explainable machine learning for neoplasms diagnosis via electrocardiograms: an externally validated study ALT

链接: https://arxiv.org/abs/2412.07737
作者: Juan Miguel Lopez Alcaraz,Wilhelm Haverkamp,Nils Strodthoff
关键词-EN: improving patient outcomes, mortality worldwide, patient outcomes, remains a leading, crucial for improving
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures, code under this https URL

点击查看摘要

Abstract:Background: Neoplasms remains a leading cause of mortality worldwide, with timely diagnosis being crucial for improving patient outcomes. Current diagnostic methods are often invasive, costly, and inaccessible to many populations. Electrocardiogram (ECG) data, widely available and non-invasive, has the potential to serve as a tool for neoplasms diagnosis by using physiological changes in cardiovascular function associated with neoplastic prescences. Methods: This study explores the application of machine learning models to analyze ECG features for the diagnosis of neoplasms. We developed a pipeline integrating tree-based models with Shapley values for explainability. The model was trained and internally validated and externally validated on a second large-scale independent external cohort to ensure robustness and generalizability. Findings: The results demonstrate that ECG data can effectively capture neoplasms-associated cardiovascular changes, achieving high performance in both internal testing and external validation cohorts. Shapley values identified key ECG features influencing model predictions, revealing established and novel cardiovascular markers linked to neoplastic conditions. This non-invasive approach provides a cost-effective and scalable alternative for the diagnosis of neoplasms, particularly in resource-limited settings. Similarly, useful for the management of secondary cardiovascular effects given neoplasms therapies. Interpretation: This study highlights the feasibility of leveraging ECG signals and machine learning to enhance neoplasms diagnostics. By offering interpretable insights into cardio-neoplasms interactions, this approach bridges existing gaps in non-invasive diagnostics and has implications for integrating ECG-based tools into broader neoplasms diagnostic frameworks, as well as neoplasms therapy management. Comments: 9 pages, 2 figures, code under this https URL Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2412.07737 [eess.SP] (or arXiv:2412.07737v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2412.07737 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Juan Miguel Lopez Alcaraz [view email] [v1] Tue, 10 Dec 2024 18:34:08 UTC (10,518 KB)

[LG-63] Hype-Adjusted Probability Measure for NLP Volatility Forecasting

链接: https://arxiv.org/abs/2412.07587
作者: Zheng Cao,Helyette Geman
关键词-EN: Natural Language Processing, Language Processing, Natural Language, probability measure developed, hype-adjusted probability measure
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注: 26 pages

点击查看摘要

Abstract:This manuscript introduces the hype-adjusted probability measure developed in the context of a new Natural Language Processing (NLP) approach for market forecasting. A novel sentiment score equation is presented to capture component and memory effects and assign dynamic parameters, enhancing the impact of intraday news data on forecasting next-period volatility for selected U.S. semiconductor stocks. This approach integrates machine learning techniques to analyze and improve the predictive value of news. Building on the research of Geman’s, this work improves forecast accuracy by assigning specific weights to each component of news sources and individual stocks in the portfolio, evaluating time-memory effects on market reactions, and incorporating shifts in sentiment direction. Finally, we propose the Hype-Adjusted Probability Measure, proving its existence and uniqueness, and discuss its theoretical applications in finance for NLP-based volatility forecasting, outlining future research pathways inspired by its concepts.

[LG-64] Physics-Based Dynamic Models Hybridisation Using Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2412.07514
作者: Branislava Lalic,Dinh Viet Cuong,Mina Petric,Vladimir Pavlovic,Ana Firanj Sremac,Mark Roantree
关键词-EN: complex dynamical systems, simplified representations, dynamical systems, Physics-based dynamic models, complex dynamical
类目: Biological Physics (physics.bio-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-based dynamic models (PBDMs) are simplified representations of complex dynamical systems. PBDMs take specific processes within a complex system and assign a fragment of variables and an accompanying set of parameters to depict the processes. As this often leads to suboptimal parameterisation of the system, a key challenge requires refining the empirical parameters and variables to reduce uncertainties while maintaining the model s explainability and enhancing its predictive accuracy. We demonstrate that a hybrid mosquito population dynamics model, which integrates a PBDM with Physics-Informed Neural Networks (PINN), retains the explainability of the PBDM by incorporating the PINN-learned model parameters in place of its empirical counterparts. Specifically, we address the limitations of traditional PBDMs by modelling the parameters of larva and pupa development rates using a PINN that encodes complex, learned interactions of air temperature, precipitation and humidity. Our results demonstrate improved mosquito population simulations including the difficult-to-predict mosquito population peaks. This opens the possibility of hybridisation concept application on other complex systems based on PBDMs such as cancer growth to address the challenges posed by scarce and noisy data, and to numerical weather prediction and climate modelling to overcome the gap between physics-based and data-driven weather prediction models.

[LG-65] Dual Random Fields and their Application to Mineral Potential Mapping

链接: https://arxiv.org/abs/2412.07488
作者: Álvaro I. Riquelme
关键词-EN: including mineral exploration, established mining operations, geosciences branches, geometallurgical characterization, limiting the scope
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:In various geosciences branches, including mineral exploration, geometallurgical characterization on established mining operations, and remote sensing, the regionalized input variables are spatially well-sampled across the domain of interest, limiting the scope of spatial uncertainty quantification procedures. In turn, response outcomes such as the mineral potential in a given region, mining throughput, metallurgical recovery, or in-situ estimations from remote satellite imagery, are usually modeled from a much-restricted subset of testing samples, collected at certain locations due to accessibility restrictions and the high acquisition costs. Our limited understanding of these functions, in terms of the multi-dimensional complexity of causalities and unnoticed dependencies on inaccessible inputs, may lead to observing changes in such functions based on their geographical location. Pooling together different response functions across the domain is critical to correctly predict outcome responses, the uncertainty associated with these inferred values, and the significance of inputs in such predictions at unexplored areas. This paper introduces the notion of a dual random field (dRF), where the response function itself is considered a regionalized variable. In this way, different established response models across the geographic domain can be considered as observations of a dRF realization, enabling the spatial inference and uncertainty assessment of both response models and their predictions. We explain how dRFs inherit all the properties from classical random fields, allowing the use of standard Gaussian simulation procedures to simulate them. These models are combined to obtain a mineral potential response, providing an example of how to rigorously integrate machine learning approaches with geostatistics.

[LG-66] Score-matching-based Structure Learning for Temporal Data on Networks

链接: https://arxiv.org/abs/2412.07469
作者: Hao Chen,Kai Yi,Lin Liu,Yu Guang Wang
关键词-EN: Nonlinear Causal Models, crucial initial step, Additive Nonlinear Causal, background knowledge, crucial initial
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal discovery is a crucial initial step in establishing causality from empirical data and background knowledge. Numerous algorithms have been developed for this purpose. Among them, the score-matching method has demonstrated superior performance across various evaluation metrics, particularly for the commonly encountered Additive Nonlinear Causal Models. However, current score-matching-based algorithms are primarily designed to analyze independent and identically distributed (i.i.d.) data. More importantly, they suffer from high computational complexity due to the pruning step required for handling dense Directed Acyclic Graphs (DAGs). To enhance the scalability of score matching, we have developed a new parent-finding subroutine for leaf nodes in DAGs, significantly accelerating the most time-consuming part of the process: the pruning step. This improvement results in an efficiency-lifted score matching algorithm, termed Parent Identification-based Causal structure learning for both i.i.d. and temporal data on networKs, or PICK. The new score-matching algorithm extends the scope of existing algorithms and can handle static and temporal data on networks with weak network interference. Our proposed algorithm can efficiently cope with increasingly complex datasets that exhibit spatial and temporal dependencies, commonly encountered in academia and industry. The proposed algorithm can accelerate score-matching-based methods while maintaining high accuracy in real-world applications.

[LG-67] When UAV Meets Federated Learning: Latency Minimization via Joint Trajectory Design and Resource Allocation

链接: https://arxiv.org/abs/2412.07428
作者: Xuhui Zhang,Wenchao Liu,Jinke Ren,Huijun Xing,Gui Gui,Yanyan Shen,Shuguang Cui
关键词-EN: machine learning models, Internet of Things, limited computation resources, Federated learning, training machine learning
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This manuscript has been submitted to IEEE

点击查看摘要

Abstract:Federated learning (FL) has emerged as a pivotal solution for training machine learning models over wireless networks, particularly for Internet of Things (IoT) devices with limited computation resources. Despite its benefits, the efficiency of FL is often restricted by the communication quality between IoT devices and the central server. To address this issue, we introduce an innovative approach by deploying an unmanned aerial vehicle (UAV) as a mobile FL server to enhance the training process of FL. By leveraging the UAV’s maneuverability, we establish robust line-of-sight connections with IoT devices, significantly improving communication capacity. To improve the overall training efficiency, we formulate a latency minimization problem by jointly optimizing the bandwidth allocation, computing frequencies, transmit power for both the UAV and IoT devices, and the UAV’s trajectory. Then, an efficient alternating optimization algorithm is developed to solve it efficiently. Furthermore, we analyze the convergence and computational complexity of the proposed algorithm. Finally, numerical results demonstrate that our proposed scheme not only outperforms existing benchmark schemes in terms of latency but also achieves training efficiency that closely approximate the ideal scenario.

[LG-68] Modeling High-Resolution Spatio-Temporal Wind with Deep Echo State Networks and Stochastic Partial Differential Equations

链接: https://arxiv.org/abs/2412.07265
作者: Kesen Wang,Minwoo Kim,Stefano Castruccio,Marc G. Genton
关键词-EN: carbon footprint reduction, gained increasing attention, increasing attention due, Saudi Arabia, past decades
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the past decades, clean and renewable energy has gained increasing attention due to a global effort on carbon footprint reduction. In particular, Saudi Arabia is gradually shifting its energy portfolio from an exclusive use of oil to a reliance on renewable energy, and, in particular, wind. Modeling wind for assessing potential energy output in a country as large, geographically diverse and understudied as Saudi Arabia is a challenge which implies highly non-linear dynamic structures in both space and time. To address this, we propose a spatio-temporal model whose spatial information is first reduced via an energy distance-based approach and then its dynamical behavior is informed by a sparse and stochastic recurrent neural network (Echo State Network). Finally, the full spatial data is reconstructed by means of a non-stationary stochastic partial differential equation-based approach. Our model can capture the fine scale wind structure and produce more accurate forecasts of both wind speed and energy in lead times of interest for energy grid management and save annually as much as one million dollar against the closest competitive model.

[LG-69] Optimization Can Learn Johnson Lindenstrauss Embeddings

链接: https://arxiv.org/abs/2412.07242
作者: Nikos Tsikouras,Constantine Caramanis,Christos Tzamos
关键词-EN: offering compact representations, complex data structures, Embeddings play, offering compact, play a pivotal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Embeddings play a pivotal role across various disciplines, offering compact representations of complex data structures. Randomized methods like Johnson-Lindenstrauss (JL) provide state-of-the-art and essentially unimprovable theoretical guarantees for achieving such representations. These guarantees are worst-case and in particular, neither the analysis, nor the algorithm, takes into account any potential structural information of the data. The natural question is: must we randomize? Could we instead use an optimization-based approach, working directly with the data? A first answer is no: as we show, the distance-preserving objective of JL has a non-convex landscape over the space of projection matrices, with many bad stationary points. But this is not the final answer. We present a novel method motivated by diffusion models, that circumvents this fundamental challenge: rather than performing optimization directly over the space of projection matrices, we use optimization over the larger space of random solution samplers, gradually reducing the variance of the sampler. We show that by moving through this larger space, our objective converges to a deterministic (zero variance) solution, avoiding bad stationary points. This method can also be seen as an optimization-based derandomization approach and is an idea and method that we believe can be applied to many other problems. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2412.07242 [stat.ML] (or arXiv:2412.07242v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2412.07242 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-70] A Consolidated Volatility Prediction with Back Propagation Neural Network and Genetic Algorithm ICML2024

链接: https://arxiv.org/abs/2412.07223
作者: Zong Ke,Jingyu Xu,Zizhou Zhang,Yu Cheng,Wenjun Wu
关键词-EN: emerging stock markets, unique approach, predict emerging stock, emerging stock, stock markets
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 6 pages, 7 figures, 1 table, The paper will be published by IEEE on conference: 2024 3rd International Conference on Image Processing, Computer Vision and Machine Learning (ICICML 2024)

点击查看摘要

Abstract:This paper provides a unique approach with AI algorithms to predict emerging stock markets volatility. Traditionally, stock volatility is derived from historical volatility,Monte Carlo simulation and implied volatility as well. In this paper, the writer designs a consolidated model with back-propagation neural network and genetic algorithm to predict future volatility of emerging stock markets and found that the results are quite accurate with low errors.

[LG-71] Automatic Doubly Robust Forests

链接: https://arxiv.org/abs/2412.07184
作者: Zhaomeng Chen,Junting Duan,Victor Chernozhukov,Vasilis Syrgkanis
关键词-EN: Doubly Robust Random, automatic Doubly Robust, Robust Random Forest, high-dimensional nuisance functions, Doubly Robust
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:This paper proposes the automatic Doubly Robust Random Forest (DRRF) algorithm for estimating the conditional expectation of a moment functional in the presence of high-dimensional nuisance functions. DRRF combines the automatic debiasing framework using the Riesz representer (Chernozhukov et al., 2022) with non-parametric, forest-based estimation methods for the conditional moment (Athey et al., 2019; Oprescu et al., 2019). In contrast to existing methods, DRRF does not require prior knowledge of the form of the debiasing term nor impose restrictive parametric or semi-parametric assumptions on the target quantity. Additionally, it is computationally efficient for making predictions at multiple query points and significantly reduces runtime compared to methods such as Orthogonal Random Forest (Oprescu et al., 2019). We establish the consistency and asymptotic normality results of DRRF estimator under general assumptions, allowing for the construction of valid confidence intervals. Through extensive simulations in heterogeneous treatment effect (HTE) estimation, we demonstrate the superior performance of DRRF over benchmark approaches in terms of estimation accuracy, robustness, and computational efficiency.

[LG-72] User Authentication and Vital Signs Extraction from Low-Frame-Rate and Monochrome No-contact Fingerprint Captures

链接: https://arxiv.org/abs/2412.07082
作者: Olaoluwayimika Olugbenle,Logan Drake,Naveenkumar G. Venkataswamy,Arfina Rahman,Yemi Afolayanka,Masudul Imtiaz,Mahesh K. Banavar
关键词-EN: fingerprint capture device, blue light, extract vital signs, work on leveraging, fingerprint capture
类目: Image and Video Processing (eess.IV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at the 2024 Asilomar Conference on Signals, Systems, and Computers. 5 pages, 5 figures, 2 tables

点击查看摘要

Abstract:We present our work on leveraging low-frame-rate monochrome (blue light) videos of fingertips, captured with an off-the-shelf fingerprint capture device, to extract vital signs and identify users. These videos utilize photoplethysmography (PPG), commonly used to measure vital signs like heart rate. While prior research predominantly utilizes high-frame-rate, multi-wavelength PPG sensors (e.g., infrared, red, or RGB), our preliminary findings demonstrate that both user identification and vital sign extraction are achievable with the low-frame-rate data we collected. Preliminary results are promising, with low error rates for both heart rate estimation and user authentication. These results indicate promise for effective biometric systems. We anticipate further optimization will enhance accuracy and advance healthcare and security.

[LG-73] A Note on Sample Complexity of Interactive Imitation Learning with Log Loss

链接: https://arxiv.org/abs/2412.07057
作者: Yichen Li,Chicheng Zhang
关键词-EN: sequential decision-making problems, specifically Behavior Cloning, decision-making problems, Imitation learning, Behavior Cloning
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Imitation learning (IL) is a general paradigm for learning from experts in sequential decision-making problems. Recent advancements in IL have shown that offline imitation learning, specifically Behavior Cloning (BC) with log loss, is minimax optimal. Meanwhile, its interactive counterpart, DAgger, is shown to suffer from suboptimal sample complexity. In this note, we focus on realizable deterministic expert and revisit interactive imitation learning, particularly DAgger with log loss. We demonstrate: 1. A one-sample-per-round DAgger variant that outperforms BC in state-wise annotation. 2. Without recoverability assumption, DAgger with first-step mixture policies matches the performance of BC. Along the analysis, we introduce a new notion of decoupled Hellinger distance that separates state and action sequences, which can be of independent interest.

[LG-74] A Misclassification Network-Based Method for Comparative Genomic Analysis

链接: https://arxiv.org/abs/2412.07051
作者: Wan He,Tina Eliassi-Rad,Samuel V. Scarpino
关键词-EN: Classifying genome sequences, active area, area of research, important applications, genome
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades with many important applications across the life sciences. Established methods for classifying genomes can be broadly grouped into sequence alignment-based and alignment-free models. Conventional alignment-based models rely on genome similarity measures calculated based on local sequence alignments or consistent ordering among sequences. However, such methods are computationally expensive when dealing with large ensembles of even moderately sized genomes. In contrast, alignment-free (AF) approaches measure genome similarity based on summary statistics in an unsupervised setting and are efficient enough to analyze large datasets. However, both alignment-based and AF methods typically assume fixed scoring rubrics that lack the flexibility to assign varying importance to different parts of the sequences based on prior knowledge. In this study, we integrate AI and network science approaches to develop a comparative genomic analysis framework that addresses these limitations. Our approach, termed the Genome Misclassification Network Analysis (GMNA), simultaneously leverages misclassified instances, a learned scoring rubric, and label information to classify genomes based on associated metadata and better understand potential drivers of misclassification. We evaluate the utility of the GMNA using Naive Bayes and convolutional neural network models, supplemented by additional experiments with transformer-based models, to construct SARS-CoV-2 sampling location classifiers using over 500,000 viral genome sequences and study the resulting network of misclassifications. We demonstrate the global health potential of the GMNA by leveraging the SARS-CoV-2 genome misclassification networks to investigate the role human mobility played in structuring geographic clustering of SARS-CoV-2.

[LG-75] Geological and Well prior assisted full waveform inversion using conditional diffusion models

链接: https://arxiv.org/abs/2412.06959
作者: Fu Wang,Xinquan Huang,Tariq Alkhalifah
关键词-EN: Full waveform inversion, inaccurate inversion results, geologically inaccurate inversion, inadequate seismic observations, faces challenges due
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Full waveform inversion (FWI) often faces challenges due to inadequate seismic observations, resulting in band-limited and geologically inaccurate inversion results. Incorporating prior information from potential velocity distributions, well-log information, and our geological knowledge and expectations can significantly improve FWI convergence to a realistic model. While diffusion-regularized FWI has shown improved performance compared to conventional FWI by incorporating the velocity distribution prior, it can benefit even more by incorporating well-log information and other geological knowledge priors. To leverage this fact, we propose a geological class and well-information prior-assisted FWI using conditional diffusion models. This method seamlessly integrates multi-modal information into FWI, simultaneously achieving data fitting and universal geologic and geophysics prior matching, which is often not achieved with traditional regularization methods. Specifically, we propose to combine conditional diffusion models with FWI, where we integrate well-log data and geological class conditions into these conditional diffusion models using classifier-free guidance for multi-modal prior matching beyond the original velocity distribution prior. Numerical experiments on the OpenFWI datasets and field marine data demonstrate the effectiveness of our method compared to conventional FWI and the unconditional diffusion-regularized FWI.

[LG-76] NRSurNN3dq4: A Deep Learning Powered Numerical Relativity Surrogate for Binary Black Hole Waveforms

链接: https://arxiv.org/abs/2412.06946
作者: Osvaldo Gramaxo Freitas,Anastasios Theodoropoulos,Nino Villanueva,Tiago Fernandes,Solange Nunes,José A. Font,Antonio Onofre,Alejandro Torres-Forné,José D. Martin-Guerrero
关键词-EN: Gravitational wave approximants, Gravitational wave, numerical relativity waveforms, numerical relativity, widely used tools
类目: General Relativity and Quantum Cosmology (gr-qc); High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gravitational wave approximants are widely used tools in gravitational-wave astronomy. They allow for dense coverage of the parameter space of binary black hole (BBH) mergers for purposes of parameter inference, or, more generally, match filtering tasks, while avoiding the computationally expensive full evolution of numerical relativity simulations. However, this comes at a slight cost in terms of accuracy when compared to numerical relativity waveforms, depending on the approach. One way to minimize this is by constructing so-called~\textitsurrogate models which, instead of using approximate physics or phenomenological formulae, rather interpolate within the space of numerical relativity waveforms. In this work, we introduce~\textttNRSurNN3dq4, a surrogate model for non-precessing BBH merger waveforms powered by neural networks. By relying on the power of deep learning, this approximant is remarkably fast and competitively accurate, as it can generate millions of waveforms in a tenth of a second, while mismatches with numerical relativity waveforms are restrained below 10^-3 . We implement this approximant within the~\textscbilby framework for gravitational-wave parameter inference, and show that it it is suitable for parameter estimation tasks.

[LG-77] Variable Selection for Comparing High-dimensional Time-Series Data

链接: https://arxiv.org/abs/2412.06870
作者: Kensuke Mitsuzawa,Margherita Grossi,Stefano Bortoli,Motonobu Kanagawa
关键词-EN: multivariate time-series data, length and dimensions, pair of multivariate, multivariate time-series, select variables
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Given a pair of multivariate time-series data of the same length and dimensions, an approach is proposed to select variables and time intervals where the two series are significantly different. In applications where one time series is an output from a computationally expensive simulator, the approach may be used for validating the simulator against real data, for comparing the outputs of two simulators, and for validating a machine learning-based emulator against the simulator. With the proposed approach, the entire time interval is split into multiple subintervals, and on each subinterval, the two sample sets are compared to select variables that distinguish their distributions and a two-sample test is performed. The validity and limitations of the proposed approach are investigated in synthetic data experiments. Its usefulness is demonstrated in an application with a particle-based fluid simulator, where a deep neural network model is compared against the simulator, and in an application with a microscopic traffic simulator, where the effects of changing the simulator’s parameters on traffic flows are analysed.

[LG-78] A Study on Quantum Neural Networks in Healthcare 5.0

链接: https://arxiv.org/abs/2412.06818
作者: Sanjay Chakraborty
关键词-EN: quantum neural networks, quantum neural, neural networks, healthcare analytics, healthcare
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The working environment in healthcare analytics is transforming with the emergence of healthcare 5.0 and the advancements in quantum neural networks. In addition to analyzing a comprehensive set of case studies, we also review relevant literature from the fields of quantum computing applications and smart healthcare analytics, focusing on the implications of quantum deep neural networks. This study aims to shed light on the existing research gaps regarding the implications of quantum neural networks in healthcare analytics. We argue that the healthcare industry is currently transitioning from automation towards genuine collaboration with quantum networks, which presents new avenues for research and exploration. Specifically, this study focuses on evaluating the performance of Healthcare 5.0, which involves the integration of diverse quantum machine learning and quantum neural network systems. This study also explores a range of potential challenges and future directions for Healthcare 5.0, particularly focusing on the integration of quantum neural networks.

[LG-79] HiCat: A Semi-Supervised Approach for Cell Type Annotation

链接: https://arxiv.org/abs/2412.06805
作者: Chang Bi,Kailun Bai,Xing Li,Xuekui Zhang
关键词-EN: single-cell RNA sequencing, Annotation using Transformative, RNA sequencing data, Hybrid Cell Annotation, single-cell RNA
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: This document is exactly the same as the submitted version for RECOMB 2025 on October 28, 03:06 GMT

点击查看摘要

Abstract:We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This hybrid approach incorporates both reference and query genomic data for feature engineering, enhancing the embedding learning process, increasing the effective sample size for unsupervised techniques, and improving the transferability of the supervised model trained on reference data when applied to query datasets. The pipeline follows six key steps: (1) removing batch effects using Harmony to generate a 50-dimensional principal component embedding; (2) applying UMAP for dimensionality reduction to two dimensions to capture crucial data patterns; (3) conducting unsupervised clustering of cells with DBSCAN, yielding a one-dimensional cluster membership vector; (4) merging the multi-resolution results of the previous steps into a 53-dimensional feature space that encompasses both reference and query data; (5) training a CatBoost model on the reference dataset to predict cell types in the query dataset; and (6) resolving inconsistencies between the supervised predictions and unsupervised cluster labels. When benchmarked on 10 publicly available genomic datasets, HiCat surpasses other methods, particularly in differentiating and identifying multiple new cell types. Its capacity to accurately classify novel cell types showcases its robustness and adaptability within intricate biological datasets.

信息检索

[IR-0] RLT4Rec: Reinforcement Learning Transformer for User Cold Start and Item Recommendation

链接: https://arxiv.org/abs/2412.07403
作者: Dilina Chandika Rajapakse,Douglas Leith
关键词-EN: achieves excellent performance, item recommendation tasks, sequential transformer reinforcement, transformer reinforcement learning, reinforcement learning architecture
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We introduce a new sequential transformer reinforcement learning architecture RLT4Rec and demonstrate that it achieves excellent performance in a range of item recommendation tasks. RLT4Rec uses a relatively simple transformer architecture that takes as input the user’s (item,rating) history and outputs the next item to present to the user. Unlike existing RL approaches, there is no need to input a state observation or estimate. RLT4Rec handles new users and established users within the same consistent framework and automatically balances the “exploration” needed to discover the preferences of a new user with the “exploitation” that is more appropriate for established users. Training of RLT4Rec is robust and fast and is insensitive to the choice of training data, learning to generate “good” personalised sequences that the user tends to rate highly even when trained on “bad” data.

[IR-1] CURE: Clinical Understanding Retrieval Evaluation

链接: https://arxiv.org/abs/2412.06954
作者: Nadia Sheikh,Anne-Laure Jousse,Daniel Buades Marcos,Akintunde Oladipo,Olivier Rousseau,Jimmy Lin
关键词-EN: training dataset distributions, domain-specific test sets, dominance of dense, dense retrievers, sets are essential
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Given the dominance of dense retrievers that do not generalize well beyond their training dataset distributions, domain-specific test sets are essential in evaluating retrieval. There are few test datasets for retrieval systems intended for use by healthcare providers in a point-of-care setting. To fill this gap we have collaborated with medical professionals to create CURE, an ad-hoc retrieval test dataset for passage ranking with 2000 queries spanning 10 medical domains with a monolingual (English) and two cross-lingual (French/Spanish - English) conditions. In this paper, we describe how CURE was constructed and provide baseline results to showcase its effectiveness as an evaluation tool. CURE is published with a Creative Commons Attribution Non Commercial 4.0 license and can be accessed on Hugging Face.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-11

目录

概览 (2024-12-11)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载