Arxiv今日论文 | 2024-12-20

本篇博文主要展示 2024-12-20 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决的是两种变体的分词问题 (tokenisation) 的NP完全性 (NP-completeness) 证明。具体来说，问题定义为通过压缩数据集至最多δ个符号，可以通过直接找到词汇表 (direct tokenisation) 或选择一系列合并操作 (bottom-up tokenisation) 来实现。解决方案的关键在于证明了这两种分词方法在计算复杂性上是NP完全的，这意味着它们在一般情况下是难以有效解决的。

链接: https://arxiv.org/abs/2412.15210
作者: Philip Whittington,Gregor Bachmann,Tiago Pimentel
机构: ETH Zürich (苏黎世联邦理工学院)
关键词: direct tokenisation, bottom-up tokenisation, vocabulary directly, merge operations, prove the NP-completeness
类目: Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most \delta symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).
zh

[NLP-1] LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

【速读】：该论文试图解决大语言模型（LLMs）在处理长上下文问题时面临的深度理解和推理能力不足的问题。解决方案的关键在于引入LongBench v2基准测试，该测试包含503个多选题，涵盖从8k到2M字的长上下文，涉及六大任务类别。通过收集来自不同专业背景的高学历人士的数据，并采用自动化和人工审查流程确保高质量和高难度，LongBench v2旨在评估模型在长上下文环境下的表现。研究结果表明，增强推理能力和扩展推理时计算资源是提升模型性能的关键，其中o1-preview模型通过更长的推理过程实现了57.7%的准确率，超过了人类专家的53.7%。

链接: https://arxiv.org/abs/2412.15204
作者: Yushi Bai,Shangqing Tu,Jiajie Zhang,Hao Peng,Xiaozhi Wang,Xin Lv,Shulin Cao,Jiazheng Xu,Lei Hou,Yuxiao Dong,Jie Tang,Juanzi Li
机构: 未知
关键词: problems requiring deep, paper introduces LongBench, requiring deep understanding, handle long-context problems, long-context problems requiring
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 25 pages, 13 figures

点击查看摘要

Abstract:This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2. The project is available at this https URL.
zh

[NLP-2] MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark

【速读】：该论文试图解决现有大规模语言模型（LLMs）在评估其常识、理解和问题解决能力时，由于基准测试数据集（如MMLU）的开放性和训练数据的广泛来源导致的基准污染问题。解决方案的关键在于提出了一个无污染且更具挑战性的多选题基准测试MMLU-CF。该基准通过扩展数据来源并设计三项去污染规则来避免无意数据泄露，同时通过将基准分为验证集和测试集来防止恶意数据泄露，其中测试集保持闭源以确保评估结果的可靠性，验证集公开以促进透明度和独立验证。

链接: https://arxiv.org/abs/2412.15194
作者: Qihao Zhao,Yangyu Huang,Tengchao Lv,Lei Cui,Qinzheng Sun,Shaoguang Mao,Xin Zhang,Ying Xin,Qiufeng Yin,Scarlett Li,Furu Wei
机构: Microsoft Research(微软研究院)
关键词: Massive Multitask Language, Multitask Language Understanding, large language models, Massive Multitask, Multitask Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs). However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation results. To alleviate this issue, we propose a contamination-free and more challenging MCQ benchmark called MMLU-CF. This benchmark reassesses LLMs’ understanding of world knowledge by averting both unintentional and malicious data leakage. To avoid unintentional data leakage, we source data from a broader domain and design three decontamination rules. To prevent malicious data leakage, we divide the benchmark into validation and test sets with similar difficulty and subject distributions. The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent verification. Our evaluation of mainstream LLMs reveals that the powerful GPT-4o achieves merely a 5-shot score of 73.4% and a 0-shot score of 71.9% on the test set, which indicates the effectiveness of our approach in creating a more rigorous and contamination-free evaluation standard. The GitHub repository is available at this https URL and the dataset refers to this https URL.
zh

[NLP-3] Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings

【速读】：该论文旨在解决自动化事实核查（Automated Fact-Checking）中现有基于检索增强生成（Retrieval-Augmented Generation, RAG）方法的局限性，特别是在处理风格复杂的主张和异构但可靠的知识库时。解决方案的关键在于放宽现有RAG管道的约束，并通过基准测试在更现实的情况下评估这些方法，特别是生成裁决（verdicts）的能力。研究发现，基于大语言模型（LLM）的检索器在检索性能上优于其他技术，但在处理异构知识库时仍存在挑战；较大的模型在裁决的忠实性上表现更好，而较小的模型在上下文一致性上更优。此外，人类评估显示，零样本和单样本方法在信息量上更受青睐，而微调模型在情感一致性上表现更佳。

链接: https://arxiv.org/abs/2412.15189
作者: Daniel Russo,Stefano Menini,Jacopo Staiano,Marco Guerini
机构: Fondazione Bruno Kessler, Italy; University of Trento, Italy
关键词: Natural Language Processing, Natural Language, Language Processing, professional fact-checkers, systems have recently
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Code and data at this https URL

点击查看摘要

Abstract:Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.
zh

[NLP-4] LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

【速读】：该论文试图解决如何在不从头预训练的情况下，赋予仅文本的预训练大型语言模型（LLMs）多模态生成能力的问题。解决方案的关键在于LlamaFusion框架，它通过利用现有的Llama-3模型的权重进行自回归文本处理，同时引入额外的并行扩散模块来处理图像，从而实现文本和图像的任意序列理解和生成。该框架通过冻结文本特定模块并仅训练图像特定模块，保留了文本LLMs的语言能力，同时增强了视觉理解和生成能力。与从头预训练多模态生成模型的方法相比，LlamaFusion在减少计算量的同时，显著提升了图像理解和生成的效果。

链接: https://arxiv.org/abs/2412.15188
作者: Weijia Shi,Xiaochuang Han,Chunting Zhou,Weixin Liang,Xi Victoria Lin,Luke Zettlemoyer,Lili Yu
机构: University of Washington(华盛顿大学); Meta(Meta); Stanford University(斯坦福大学)
关键词: empowering pretrained text-only, pretrained text-only large, arbitrary sequences, empowering pretrained, understand and generate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present LlamaFusion, a framework for empowering pretrained text-only large language models (LLMs) with multimodal generative capabilities, enabling them to understand and generate both text and images in arbitrary sequences. LlamaFusion leverages existing Llama-3’s weights for processing texts autoregressively while introducing additional and parallel transformer modules for processing images with diffusion. During training, the data from each modality is routed to its dedicated modules: modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while the shared self-attention layers allow interactions across text and image features. By freezing the text-specific modules and only training the image-specific modules, LlamaFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities. Compared to methods that pretrain multimodal generative models from scratch, our experiments demonstrate that, LlamaFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3’s language capabilities. We also demonstrate that this framework can adapt existing vision-language models with multimodal generation ability. Overall, this framework not only leverages existing computational investments in text-only LLMs but also enables the parallel development of language and vision capabilities, presenting a promising direction for efficient multimodal model development.
zh

[NLP-5] Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying

【速读】：该论文试图解决当前最先进的大型语言模型（LLMs）在逻辑和数学推理任务中表现不佳的问题。解决方案的关键在于引入论证理论中的关键问题（critical questions），特别是基于图尔敏的论证模型（Toulmin’s model of argumentation），通过这些关键问题来增强LLMs的推理能力。具体来说，通过深入探究模型推理过程中的逻辑依据，模型能够在提供最终回答之前识别并纠正潜在的逻辑错误，从而提升其在未见过的或训练数据中未包含的推理问题上的表现。这种方法的核心思想源于有效的论证过程标准：结论的有效性取决于其是否由已接受的假设所蕴含。论文通过在MT-Bench Reasoning和Math任务上对多种LLMs进行广泛评估，验证了该方法的有效性。

链接: https://arxiv.org/abs/2412.15177
作者: Federico Castagna,Isabel Sassoon,Simon Parsons
机构: University of Lincoln (林肯大学)
关键词: Large Language models, Large Language, Studies have underscored, Language models, continue to struggle
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Studies have underscored how, regardless of the recent breakthrough and swift advances in AI research, even state-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning. The results seem to suggest that LLMs still work as (highly advanced) data pattern identifiers, scoring poorly when attempting to generalise and solve reasoning problems the models have never previously seen or that are not close to samples presented in their training data. To address this compelling concern, this paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin’s model of argumentation. We show that employing these critical questions can improve the reasoning capabilities of LLMs. By probing the rationale behind the models’ reasoning process, the LLM can assess whether some logical mistake is occurring and correct it before providing the final reply to the user prompt. The underlying idea is drawn from the gold standard of any valid argumentative procedure: the conclusion is valid if it is entailed by accepted premises. Or, to paraphrase such Aristotelian principle in a real-world approximation, characterised by incomplete information and presumptive logic, the conclusion is valid if not proved otherwise. This approach successfully steers the models’ output through a reasoning pipeline, resulting in better performance against the baseline and its Chain-of-Thought (CoT) implementation. To this end, an extensive evaluation of the proposed approach on the MT-Bench Reasoning and Math tasks across a range of LLMs is provided.
zh

[NLP-6] Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

【速读】：该论文试图解决文本到视频生成模型中，用户提供的文本提示在生成高质量视频时需要多次迭代和修订的问题。当前自动提示优化方法在应用于文本到视频扩散模型时面临模态不一致（Modality-Inconsistency）、成本差异（Cost-Discrepancy）和模型不可知（Model-Unaware）等挑战。解决方案的关键是引入基于大语言模型（LLM）的提示适配框架，称为Prompt-A-Video，该框架能够生成以视频为中心、无需人工干预且与用户偏好对齐的提示。其核心在于精心设计的两阶段优化和对齐系统：首先通过奖励引导的提示进化管道自动创建最优提示池，并用于LLM的监督微调（SFT）；然后利用多维奖励生成成对数据，并通过直接偏好优化（DPO）算法进一步促进偏好对齐。

链接: https://arxiv.org/abs/2412.15156
作者: Yatai Ji,Jiacheng Zhang,Jie Wu,Shilong Zhang,Shoufa Chen,Chongjian GE,Peize Sun,Weifeng Chen,Wenqi Shao,Xuefeng Xiao,Weilin Huang,Ping Luo
机构: The University of Hong Kong; ByteDance; Shanghai AI Lab
关键词: high-quality text-video pairs, made remarkable advancements, textual prompts play, text-video pairs, made remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.
zh

[NLP-7] Language Models as Continuous Self-Evolving Data Engineers

【速读】：该论文试图解决大型语言模型（LLMs）在缺乏高质量训练数据和过度依赖专家标注数据的情况下，性能提升受限的问题。解决方案的关键在于提出了一种名为LANCE的新范式，使LLMs能够自主生成、清洗、审查和标注数据，并结合偏好信息进行自我训练。这种方法不仅显著减少了后训练数据构建的时间和成本，还通过迭代微调验证了其在多个任务上的有效性，确保数据生成的高质量和模型性能的持续提升。LANCE的引入降低了对外部专家或模型的依赖，同时确保数据符合人类价值观和偏好，为未来超智能系统的发展铺平了道路。

链接: https://arxiv.org/abs/2412.15151
作者: Peidong Wang,Ming Wang,Zhiming Ma,Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang
机构: Northeastern University(东北大学); China Mobile Internet Company Limited(中国移动互联有限公司)
关键词: Large Language Models, Large Language, demonstrated remarkable capabilities, Language Models, demonstrated remarkable
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on expert-labeled data, setting an upper limit on the performance of LLMs. To address this issue, we propose a novel paradigm that enables LLMs to train itself by autonomously generating, cleaning, reviewing, and annotating data with preference information, named LANCE. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of the post-training data construction process. Through iterative fine-tuning on different variants of the Qwen2, we validate the effectiveness of LANCE across various tasks, showing that it can continuously improve model performance and maintain high-quality data generation. Across eight benchmark dimensions, LANCE resulted in an average score enhancement of 3.36 for Qwen2-7B and 2.70 for Qwen2-7B-Instruct. This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human values and preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities.
zh

[NLP-8] Adaptive Pruning for Large Language Models with Structural Importance Awareness

【速读】：该论文试图解决大型语言模型（LLMs）在资源受限的边缘设备上部署困难的问题，主要原因是LLMs的高计算和存储资源需求。解决方案的关键在于提出了一种新型的模型剪枝方法，即结构感知自适应剪枝（structurally-aware adaptive pruning, SAAP）。该方法通过定义自适应重要性融合度量（adaptive importance fusion metric）来评估LLMs中所有耦合结构的重要性，并考虑其同方差不确定性，从而确定需要剪枝的具体层。此外，论文还提出了一种新的组微调策略（group fine-tuning strategy）以提高推理效率。实验结果表明，SAAP在多个LLMs上表现优异，不仅在零样本分类和文本生成任务中提高了准确性，还提升了令牌生成速度，展示了其在资源受限场景中的实际优势。

链接: https://arxiv.org/abs/2412.15127
作者: Haotian Zheng,Jinke Ren,Yushan Sun,Ruichen Zhang,Wenbo Zhang,Zhen Li,Dusit Niyato,Shuguang Cui,Yatong Han
机构: National Key Laboratory of Autonomous Marine Vehicle Technology, Harbin Engineering University, Harbin 150001, China; Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen), The Chinese University of Hong Kong, Shenzhen 518172, China; School of Science and Engineering (SSE); Guangdong Provincial Key Laboratory of Future Networks of Intelligence; College of Computing and Data Science, Nanyang Technological University, Singapore; Aerospace Science and Industry Shenzhen (Group) Co., Ltd, Shenzhen 518048, China; Infused Synapse AI, Shenzhen 518048, China
关键词: improved language understanding, large language models, significantly improved language, large language, improved language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 6 figures, 12 tables

点击查看摘要

Abstract:The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements. Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs. Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation. Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios.
zh

[NLP-9] Outcome-Refining Process Supervision for Code Generation

【速读】：该论文试图解决大型语言模型在处理需要深度算法推理的复杂编程任务时表现不佳的问题。其关键解决方案是提出了一种名为“结果精炼过程监督”（Outcome-Refining Process Supervision）的新范式，通过将结果精炼本身作为监督过程，利用具体的执行信号来指导推理步骤，同时采用树结构探索来同时维护多个解决方案轨迹。该方法无需训练奖励模型（PRMs），并通过实验证明，即使在较小的模型上也能显著提高在竞争性编程任务中的成功准确性和性能指标，平均正确率提升26.9%，效率提升42.2%。

链接: https://arxiv.org/abs/2412.15118
作者: Zhuohao Yu,Weizheng Gu,Yidong Wang,Zhengran Zeng,Jindong Wang,Wei Ye,Shikun Zhang
机构: Peking University(北京大学); Microsoft Research(微软研究院)
关键词: Large Language Models, Large Language, demonstrated remarkable capabilities, require deep algorithmic, deep algorithmic reasoning
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: 18 pages, 5 figures, Code: this https URL

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: this https URL
zh

[NLP-10] Qwen2.5 Technical Report

【速读】：该论文旨在解决构建一个能够满足多样化需求的大型语言模型（LLMs）的问题。解决方案的关键在于通过扩展高质量预训练数据集（从7万亿增加到18万亿tokens）和精细的监督微调（使用超过100万样本）以及多阶段强化学习，显著提升模型的常识、专家知识、推理能力以及长文本生成、结构化数据分析和指令跟随能力。此外，通过提供多种尺寸的模型（包括基础模型、指令微调模型和量化版本）以及专有的混合专家（MoE）变体（如Qwen2.5-Turbo和Qwen2.5-Plus），确保了模型在不同应用场景中的高效性和成本效益。这些改进使得Qwen2.5系列模型在多个基准测试中表现出色，甚至在某些方面超越了当前最先进的模型。

链接: https://arxiv.org/abs/2412.15115
作者: Qwen:An Yang,Baosong Yang,Beichen Zhang,Binyuan Hui,Bo Zheng,Bowen Yu,Chengyuan Li,Dayiheng Liu,Fei Huang,Haoran Wei,Huan Lin,Jian Yang,Jianhong Tu,Jianwei Zhang,Jianxin Yang,Jiaxi Yang,Jingren Zhou,Junyang Lin,Kai Dang,Keming Lu,Keqin Bao,Kexin Yang,Le Yu,Mei Li,Mingfeng Xue,Pei Zhang,Qin Zhu,Rui Men,Runji Lin,Tianhao Li,Tingyu Xia,Xingzhang Ren,Xuancheng Ren,Yang Fan,Yang Su,Yichang Zhang,Yu Wan,Yuqiong Liu,Zeyu Cui,Zhenru Zhang,Zihan Qiu(additional authors not shown)
机构: 未知
关键词: designed to meet, models, large language models, trillion tokens, comprehensive series
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.
zh

[NLP-11] Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture

【速读】：该论文试图解决的问题是如何在大语言模型 (LLMs) 中实现更高效的上下文学习 (in-context learning, ICL)。解决方案的关键在于引入了一种新的残差流架构 (residual stream architecture)，该架构允许信息在注意力头 (attention heads) 之间直接流动，从而加速 ICL 能力的显现。通过将注意力机制与现代联想记忆模型 (associative memory models) 相联系，作者设计了一种能够执行 ICL 的联想记忆模型，并将其应用于两层 Transformer 模型和具有 800 万参数的小型语言模型中，实验结果表明这种架构在提升 ICL 性能方面具有显著效果。

链接: https://arxiv.org/abs/2412.15113
作者: Thomas F Burns,Tomoki Fukai,Christopher J Earls
机构: SciAI Center, Cornell University, USA(SciAI中心，康奈尔大学，美国); Neural Coding and Brain Computing Unit, OIST, Japan(神经编码与脑计算单元，冲绳科学技术大学院大学，日本)
关键词: Large language models, Large language, input sequences, sequences to appropriately, appropriately respond
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. This ability is known as in-context learning (ICL). Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. Using this connection, we introduce an associative memory model capable of performing ICL. We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads. We test this architecture during training within a two-layer Transformer and show its ICL abilities manifest more quickly than without this modification. We then apply our architecture in small language models with 8 million parameters, focusing on attention head values, with results also indicating improved ICL performance at this larger and more naturalistic scale.
zh

[NLP-12] Review-Then-Refine: A Dynamic Framework for Multi-Hop Question Answering with Temporal Adaptability

【速读】：该论文试图解决多跳问答（multi-hop question answering, QA）任务中涉及时间信息（temporal information）的挑战，特别是现有检索增强生成（Retrieve-augmented Generation, RAG）框架在处理时间相关信息时存在的检索和信息合成困难。解决方案的关键在于提出了一种新的“回顾-精炼”（review-then-refine）框架，该框架通过动态重写分解后的子查询（sub-queries）并结合时间信息，实现自适应检索和推理过程。此外，该框架还引入了自适应检索机制，以减少不必要的检索，降低生成幻觉（hallucinations）的风险。在精炼阶段，大语言模型（LLMs）将检索到的信息与内部知识结合，生成连贯的答案。实验结果表明，该框架在多个数据集上显著提升了多跳问答的能力。

链接: https://arxiv.org/abs/2412.15101
作者: Xiangsen Chen,Xuming Hu,Nan Tang
机构: Hong Kong University of Science and Technology(Guangzhou)(香港科技大学(广州))
关键词: large language models, enables large language, Retrieve-augmented generation, multi-hop question answering, inherent knowledge deficiencies
类目: Computation and Language (cs.CL)
备注: 20 pages, 2 figures

点击查看摘要

Abstract:Retrieve-augmented generation (RAG) frameworks have emerged as a promising solution to multi-hop question answering(QA) tasks since it enables large language models (LLMs) to incorporate external knowledge and mitigate their inherent knowledge deficiencies. Despite this progress, existing RAG frameworks, which usually follows the retrieve-then-read paradigm, often struggle with multi-hop QA with temporal information since it has difficulty retrieving and synthesizing accurate time-related information. To address the challenge, this paper proposes a novel framework called review-then-refine, which aims to enhance LLM performance in multi-hop QA scenarios with temporal information. Our approach begins with a review phase, where decomposed sub-queries are dynamically rewritten with temporal information, allowing for subsequent adaptive retrieval and reasoning process. In addition, we implement adaptive retrieval mechanism to minimize unnecessary retrievals, thus reducing the potential for hallucinations. In the subsequent refine phase, the LLM synthesizes the retrieved information from each sub-query along with its internal knowledge to formulate a coherent answer. Extensive experimental results across multiple datasets demonstrate the effectiveness of our proposed framework, highlighting its potential to significantly improve multi-hop QA capabilities in LLMs.
zh

[NLP-13] A Cross-Domain Study of the Use of Persuasion Techniques in Online Disinformation

【速读】：该论文试图解决的问题是关于不同领域中虚假信息（disinformation）所使用的说服技巧（persuasion techniques）的全面理解不足。其解决方案的关键在于采用了一种先进的说服技巧分类器（persuasion technique classifier），对16种说服技巧在多个领域中的使用进行了大规模、跨领域的分析。通过这种方法，研究揭示了不同领域中说服技巧的不均衡使用，并结合气候变化虚假信息的详细案例研究，展示了语言、心理和文化因素如何影响说服策略在特定主题背景下的适应性。

链接: https://arxiv.org/abs/2412.15098
作者: João A. Leite,Olesya Razuvayevskaya,Carolina Scarton,Kalina Bontcheva
机构: University of Sheffield(谢菲尔德大学)
关键词: manipulate public opinion, employing advanced persuasion, advanced persuasion techniques, persuasion techniques, aims to deceive
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Disinformation, irrespective of domain or language, aims to deceive or manipulate public opinion, typically through employing advanced persuasion techniques. Qualitative and quantitative research on the weaponisation of persuasion techniques in disinformation has been mostly topic-specific (e.g., COVID-19) with limited cross-domain studies, resulting in a lack of comprehensive understanding of these strategies. This study employs a state-of-the-art persuasion technique classifier to conduct a large-scale, multi-domain analysis of the role of 16 persuasion techniques in disinformation narratives. It shows how different persuasion techniques are employed disproportionately in different disinformation domains. We also include a detailed case study on climate change disinformation, highlighting how linguistic, psychological, and cultural factors shape the adaptation of persuasion strategies to fit unique thematic contexts.
zh

[NLP-14] AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

【速读】：该论文旨在解决复杂数学问题的求解与评估问题，提出了AceMath模型套件，包括专门用于解决数学问题的指令调优模型（AceMath-72B-Instruct）和高效的奖励模型（AceMath-72B-RM）。解决方案的关键在于通过监督微调（SFT）过程，先在通用领域达到竞争性表现，再通过精心挑选的提示和合成生成的响应进行数学领域的针对性微调，从而显著提升模型在数学问题上的表现。此外，论文还构建了AceMath-RewardBench基准，用于系统性地开发和评估数学奖励模型，确保其在多样化和难度各异的数学问题上的可靠性。最终，结合指令调优模型和奖励模型，论文在数学推理基准上取得了最高的rm@8评分。

链接: https://arxiv.org/abs/2412.15084
作者: Zihan Liu,Yang Chen,Mohammad Shoeybi,Bryan Catanzaro,Wei Ping
机构: 未知
关键词: highly effective reward, solving complex math, evaluating generated solutions, introduce AceMath, effective reward models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: this https URL
zh

[NLP-15] the Layers Collapse: Compressing a Deep Neural Network through the Lenses of Batch Normalization Layers AAAI2025

【速读】：该论文试图解决深度神经网络（deep neural networks）由于过度参数化（overparameterization）而导致的计算资源消耗过高和延迟问题。解决方案的关键在于提出了一种名为“Till the Layers Collapse (TLC)”的方法，通过压缩批归一化层（batch normalization layers）来减少网络的深度，从而降低计算需求和整体延迟。该方法在图像分类和自然语言处理（NLP）任务中，对Swin-T、MobileNet-V2和RoBERTa等流行模型进行了验证。

链接: https://arxiv.org/abs/2412.15077
作者: Zhu Liao,Nour Hezbri,Victor Quétu,Van-Tam Nguyen,Enzo Tartaglione
机构: 1. Université Paris-Saclay(巴黎-萨克雷大学); 2. Université Paris Cité(巴黎西岱大学)
关键词: deep neural networks, handle a variety, variety of complex, deep neural, Today
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Today, deep neural networks are widely used since they can handle a variety of complex tasks. Their generality makes them very powerful tools in modern technology. However, deep neural networks are often overparameterized. The usage of these large models consumes a lot of computation resources. In this paper, we introduce a method called \textbfTill the \textbfLayers \textbfCollapse (TLC), which compresses deep neural networks through the lenses of batch normalization layers. By reducing the depth of these networks, our method decreases deep neural networks’ computational requirements and overall latency. We validate our method on popular models such as Swin-T, MobileNet-V2, and RoBERTa, across both image classification and natural language processing (NLP) tasks.
zh

[NLP-16] ConfliBERT: A Language Model for Political Conflict

【速读】：该论文试图解决从新闻报道和文本中提取与政治暴力相关的信息的问题，关键解决方案是采用基于规则的方法之外的最新自然语言处理技术。具体而言，论文介绍了ConfliBERT语言模型（Hu et al. 2022），该模型能够从涉及政治冲突的文本中提取参与者和行动的分类信息。通过微调，ConfliBERT在准确性、精确度和召回率方面优于其他大型语言模型（LLM），如Google的Gemma 2、Meta的Llama 3.1和Alibaba的Qwen 2.5，并且在相关领域内处理速度快数百倍。

链接: https://arxiv.org/abs/2412.15060
作者: Patrick T. Brandt,Sultan Alsarra,Vito J. D`Orazio,Dagmar Heintze,Latifur Khan,Shreyas Meher,Javier Osorio,Marcus Sianan
机构: UT Dallas(德克萨斯大学达拉斯分校); King Saud University(沙特国王大学); West Virginia University(西弗吉尼亚大学); University of Arizona(亚利桑那大学)
关键词: Natural Language Processing, rule-based approaches, rigid rule-based approaches, Recent Natural Language, Language Processing developments
类目: Computation and Language (cs.CL)
备注: 30 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Conflict scholars have used rule-based approaches to extract information about political violence from news reports and texts. Recent Natural Language Processing developments move beyond rigid rule-based approaches. We review our recent ConfliBERT language model (Hu et al. 2022) to process political and violence related texts. The model can be used to extract actor and action classifications from texts about political conflict. When fine-tuned, results show that ConfliBERT has superior performance in accuracy, precision and recall over other large language models (LLM) like Google’s Gemma 2 (9B), Meta’s Llama 3.1 (7B), and Alibaba’s Qwen 2.5 (14B) within its relevant domains. It is also hundreds of times faster than these more generalist LLMs. These results are illustrated using texts from the BBC, re3d, and the Global Terrorism Dataset (GTD).
zh

[NLP-17] LLM s Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps

【速读】：该论文试图解决多语言环境下大型语言模型（LLMs）的安全性问题，特别是确保在多种语言中模型的一致性和安全性。解决方案的关键在于引入了一个名为M-ALERT的多语言基准测试，该基准评估了LLMs在五种语言（英语、法语、德语、意大利语和西班牙语）中的安全性。M-ALERT包含75k高质量提示，遵循详细的ALERT分类法，并通过在10个最先进的LLMs上进行广泛实验，揭示了模型在不同语言和类别中安全性的显著不一致性。论文强调了语言特定安全分析的重要性，并指出某些类别（如substance_cannabis和crime_propaganda）在所有模型和语言中都容易引发不安全响应，从而突显了在LLMs中实施强大多语言安全实践的必要性。

链接: https://arxiv.org/abs/2412.15035
作者: Felix Friedrich,Simone Tedeschi,Patrick Schramowski,Manuel Brack,Roberto Navigli,Huu Nguyen,Bo Li,Kristian Kersting
机构: TU Darmstadt; Hessian.AI; Ontocord.AI; Sapienza University of Rome; DFKI; CERTAIN; University of Chicago; UIUC; Virtue.ai
关键词: Building safe Large, Large Language Models, safe Large Language, linguistic diversity, Large Language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Building safe Large Language Models (LLMs) across multiple languages is essential in ensuring both safe access and linguistic diversity. To this end, we introduce M-ALERT, a multilingual benchmark that evaluates the safety of LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT includes 15k high-quality prompts per language, totaling 75k, following the detailed ALERT taxonomy. Our extensive experiments on 10 state-of-the-art LLMs highlight the importance of language-specific safety analysis, revealing that models often exhibit significant inconsistencies in safety across languages and categories. For instance, Llama3.2 shows high unsafety in the category crime_tax for Italian but remains safe in other languages. Similar differences can be observed across all models. In contrast, certain categories, such as substance_cannabis and crime_propaganda, consistently trigger unsafe responses across models and languages. These findings underscore the need for robust multilingual safety practices in LLMs to ensure safe and responsible usage across diverse user communities.
zh

[NLP-18] Large Language Models and Code Security: A Systematic Literature Review

【速读】：该论文试图解决大型语言模型（LLMs）在编程任务中，特别是在安全相关任务中的应用所带来的安全风险和潜在问题。解决方案的关键在于系统性地评估LLMs在生成代码时可能引入的漏洞类型，分析其在检测和修复漏洞方面的能力，以及探讨提示策略（prompting strategy）对这些任务性能的影响。此外，论文还深入分析了数据中毒攻击（data poisoning attacks）对LLMs在这些任务中性能的影响。通过这些分析，论文旨在全面理解LLMs在代码相关任务中的安全优势与潜在缺陷。

链接: https://arxiv.org/abs/2412.15004
作者: Enna Basic,Alberto Giaretta
机构: Epiroc Rock Drills AB, 70244 Örebro; Department of Computer Science, Örebro University, 70281 Örebro, Sweden
关键词: Large Language Models, Large Language, Language Models, including security-related, emerged as powerful
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have emerged as powerful tools for automating various programming tasks, including security-related ones, such as detecting and fixing vulnerabilities. Despite their promising capabilities, when required to produce or modify pre-existing code, LLMs could introduce vulnerabilities unbeknown to the programmer. When analyzing code, they could miss clear vulnerabilities or signal nonexistent ones. In this Systematic Literature Review (SLR), we aim to investigate both the security benefits and potential drawbacks of using LLMs for a variety of code-related tasks. In particular, first we focus on the types of vulnerabilities that could be introduced by LLMs, when used for producing code. Second, we analyze the capabilities of LLMs to detect and fix vulnerabilities, in any given code, and how the prompting strategy of choice impacts their performance in these two tasks. Last, we provide an in-depth analysis on how data poisoning attacks on LLMs can impact performance in the aforementioned tasks.
zh

[NLP-19] Chain-of-MetaWriting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts COLING2025

【速读】：该论文试图解决的问题是：现有的大型语言模型 (Large Language Models, LLMs) 在生成文本时缺乏对写作过程的元表征能力，且不具备与年轻学生相似的沟通学习需求。解决方案的关键在于提出了Chain-of-MetaWriting方法，通过对小型语言模型 (Small Language Models, SLMs) 进行细粒度的语言和文本分析，使其能够模仿人类写作过程中的规划和评估步骤。该方法主要应用于法语的短篇故事和论文写作任务，针对中小学生和本科生。然而，实验结果表明，SLMs在处理敏感话题（如校园暴力）时存在困难，且生成的文本在时间连接词、主题推进和指代等方面与人类写作存在显著差异。

链接: https://arxiv.org/abs/2412.14986
作者: Ioana Buhnila,Georgeta Cislaru,Amalia Todirascu
机构: ATILF UMR 7118, CNRS - University of Lorraine; EA CLESTHIA, Sorbonne Nouvelle University & Institut Universitaire de France; LiLPa UR 1339, University of Strasbourg
关键词: Large Language Models, Large Language, Small Language Models’, Language Models, multilingual Small Language
类目: Computation and Language (cs.CL)
备注: Accepted at WRAICOGS 2025 (Writing Aids at the Crossroads of AI, Cognitive Science, and NLP) co-located with COLING 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have been used to generate texts in response to different writing tasks: reports, essays, story telling. However, language models do not have a meta-representation of the text writing process, nor inherent communication learning needs, comparable to those of young human students. This paper introduces a fine-grained linguistic and textual analysis of multilingual Small Language Models’ (SLMs) writing. With our method, Chain-of-MetaWriting, SLMs can imitate some steps of the human writing process, such as planning and evaluation. We mainly focused on short story and essay writing tasks in French for schoolchildren and undergraduate students respectively. Our results show that SLMs encounter difficulties in assisting young students on sensitive topics such as violence in the schoolyard, and they sometimes use words too complex for the target audience. In particular, the output is quite different from the human produced texts in term of text cohesion and coherence regarding temporal connectors, topic progression, reference.
zh

[NLP-20] Movie2Story: A framework for understanding videos and telling stories in the form of novel text

【速读】：该论文试图解决多模态视频转文本模型在生成丰富长篇文本描述方面的不足，特别是如何整合视频、音频和字符识别信息以生成高质量的长篇文本。解决方案的关键在于提出了一个名为M2S的框架，该框架通过结合视频长篇文本描述与理解、基于音频的情感分析、语速检测、字符对齐以及基于视觉的字符识别对齐，利用大型语言模型GPT4o整合多模态信息，从而在多模态文本生成领域中表现出色。

链接: https://arxiv.org/abs/2412.14965
作者: Kangning Li,Zheyang Jia,Anyu Ying
机构: University of California, Los Angeles (加州大学洛杉矶分校); The University of Hong Kong (香港大学)
关键词: made considerable progress, considerable progress, made considerable, primarily in generating, video content
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal video-to-text models have made considerable progress, primarily in generating brief descriptions of video content. However, there is still a deficiency in generating rich long-form text descriptions that integrate both video and audio. In this paper, we introduce a framework called M2S, designed to generate novel-length text by combining audio, video, and character recognition. M2S includes modules for video long-form text description and comprehension, audio-based analysis of emotion, speech rate, and character alignment, and visual-based character recognition alignment. By integrating multimodal information using the large language model GPT4o, M2S stands out in the field of multimodal text generation. We demonstrate the effectiveness and accuracy of M2S through comparative experiments and human evaluation. Additionally, the model framework has good scalability and significant potential for future research.
zh

[NLP-21] Knowledge Injection via Prompt Distillation

【速读】：该论文试图解决在大语言模型 (LLMs) 中如何有效注入新知识的问题，特别是如何在不需要大规模重新训练的情况下，通过微调 (fine-tuning) 达到与检索增强生成 (RAG) 相当的性能。解决方案的关键在于提出了一种新的微调技术，称为提示蒸馏 (prompt distillation)。该方法通过生成关于新知识的问题-答案对，并使用一个学生模型在这些问题-答案对上进行微调，以模仿教师模型的输出分布。教师模型在提示中接收新知识，而学生模型则通过LoRA适配器进行微调，从而将新知识从教师模型的提示中蒸馏到学生模型的权重中。

链接: https://arxiv.org/abs/2412.14964
作者: Kalle Kujanpää,Harri Valpola,Alexander Ilin
机构: Aalto University(阿尔托大学); System 2 AI
关键词: large language models, practical applications, large language, pre-training data, knowledge
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:In many practical applications, large language models (LLMs) need to incorporate new knowledge not present in their pre-training data. The primary methods for this are fine-tuning and retrieval-augmented generation (RAG). Although RAG has emerged as the industry standard for knowledge injection, fine-tuning has not yet achieved comparable success. In this paper, we propose a new fine-tuning technique for learning new knowledge and show that it can reach the performance of RAG. The proposed method is based on the self-distillation approach, which we call prompt distillation. First, we generate question-answer pairs about the new knowledge. Then, we fine-tune a student model on the question-answer pairs to imitate the output distributions of a teacher model, which additionally receives the new knowledge in its prompt. The student model is identical to the teacher, except it is equipped with a LoRA adapter. This training procedure facilitates distilling the new knowledge from the teacher’s prompt into the student’s weights.
zh

[NLP-22] Understanding the Dark Side of LLM s Intrinsic Self-Correction

【速读】：该论文旨在解释大语言模型（LLMs）在不同任务中的内在自我修正（intrinsic self-correction）机制，特别是针对那些在没有“oracle labels”作为反馈提示时失败的案例。研究通过设计三种解释方法，揭示了LLMs内在自我修正的负面影响，包括在简单事实问题上导致答案动摇和提示偏差，以及在复杂任务中引入类似人类的认知偏差。解决方案的关键在于提出了两种简单有效的缓解策略：问题重复（question repeating）和少量样本的监督微调（supervised fine-tuning）。

链接: https://arxiv.org/abs/2412.14959
作者: Qingjie Zhang,Han Qiu,Di Wang,Haoting Qian,Yiming Li,Tianwei Zhang,Minlie Huang
机构: Tsinghua University(清华大学); Nanyang Technological University(南洋理工大学)
关键词: LLMs’ intrinsic self-correction, Intrinsic self-correction, prompts solely based, improve LLMs’ responses, feedback prompts solely
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Intrinsic self-correction was proposed to improve LLMs’ responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs’ intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs’ intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs’ intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at this https URL.
zh

[NLP-23] RobustFT: Robust Supervised Fine-tuning for Large Language Models under Noisy Response

【速读】：该论文试图解决在实际应用中监督微调（Supervised Fine-Tuning, SFT）过程中数据噪声对下游任务模型性能的负面影响问题。解决方案的关键在于提出了一个鲁棒的微调框架（RobustFT），该框架通过多专家协作系统进行噪声检测和重标注，并采用上下文增强策略生成可靠的标注。此外，通过基于响应熵的有效数据选择机制，确保仅保留高质量样本用于微调，从而在噪声环境下显著提升模型性能。

链接: https://arxiv.org/abs/2412.14922
作者: Junyu Luo,Xiao Luo,Kaize Ding,Jingyang Yuan,Zhiping Xiao,Ming Zhang
机构: Peking University(北京大学); University of California, Los Angeles(加州大学洛杉矶分校); Northwestern University(西北大学); University of Washington(华盛顿大学)
关键词: adapting large language, large language models, Supervised fine-tuning, plays a crucial, crucial role
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) plays a crucial role in adapting large language models (LLMs) to specific domains or tasks. However, as demonstrated by empirical experiments, the collected data inevitably contains noise in practical applications, which poses significant challenges to model performance on downstream tasks. Therefore, there is an urgent need for a noise-robust SFT framework to enhance model capabilities in downstream tasks. To address this challenge, we introduce a robust SFT framework (RobustFT) that performs noise detection and relabeling on downstream task data. For noise identification, our approach employs a multi-expert collaborative system with inference-enhanced models to achieve superior noise detection. In the denoising phase, we utilize a context-enhanced strategy, which incorporates the most relevant and confident knowledge followed by careful assessment to generate reliable annotations. Additionally, we introduce an effective data selection mechanism based on response entropy, ensuring only high-quality samples are retained for fine-tuning. Extensive experiments conducted on multiple LLMs across five datasets demonstrate RobustFT’s exceptional performance in noisy scenarios.
zh

[NLP-24] Dehallucinating Parallel Context Extension for Retrieval-Augmented Generation

【速读】：该论文试图解决大型语言模型（LLMs）在检索增强生成（RAG）场景中生成幻觉信息的问题，特别是上下文中的事实虚构（fact fabrication）和事实遗漏（fact omission）。解决方案的关键在于提出了DePaC（Dehallucinating Parallel Context Extension），通过上下文感知的负训练（context-aware negative training）和信息校准的聚合（information-calibrated aggregation）来缓解这两类幻觉问题。具体来说，对于事实虚构，采用负训练来微调LLMs，使其在上下文与问题无关时拒绝回答；对于事实遗漏，通过信息校准的聚合优先考虑信息增量更高的上下文窗口。实验结果表明，DePaC在九个RAG任务中显著减少了幻觉现象，并提升了任务表现。

链接: https://arxiv.org/abs/2412.14905
作者: Zexiong Ma,Shengnan An,Zeqi Lin,Yanzhen Zou,Jian-Guang Lou,Bing Xie
机构: School of Computer Science, Peking University(计算机科学学院，北京大学); Microsoft(微软); Xi’an Jiaotong University(西安交通大学)
关键词: Large language models, Large language, Parallel context extension, generating hallucinated information, Dehallucinating Parallel Context
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are susceptible to generating hallucinated information, despite the integration of retrieval-augmented generation (RAG). Parallel context extension (PCE) is a line of research attempting to effectively integrating parallel (unordered) contexts, while it still suffers from hallucinations when adapted to RAG scenarios. In this paper, we propose DePaC (Dehallucinating Parallel Context Extension), which alleviates the hallucination problem with context-aware negative training and information-calibrated aggregation. DePaC is designed to alleviate two types of in-context hallucination: fact fabrication (i.e., LLMs present claims that are not supported by the contexts) and fact omission (i.e., LLMs fail to present claims that can be supported by the contexts). Specifically, (1) for fact fabrication, we apply the context-aware negative training that fine-tunes the LLMs with negative supervisions, thus explicitly guiding the LLMs to refuse to answer when contexts are not related to questions; (2) for fact omission, we propose the information-calibrated aggregation which prioritizes context windows with higher information increment from their contexts. The experimental results on nine RAG tasks demonstrate that DePaC significantly alleviates the two types of hallucination and consistently achieves better performances on these tasks.
zh

[NLP-25] Why language models collapse when trained on recursively generated text

【速读】：该论文试图解决语言模型（Language Models, LMs）在递归生成文本训练过程中出现的崩溃问题。解决方案的关键在于通过理论证明揭示了LM崩溃的根本原因，即所有自回归（auto-regressive）LMs在递归生成文本的训练过程中必然会崩溃。此外，论文还发现LMs在递归生成文本训练下的性能会逐渐下降，直至表现不如随机初始化的LM，并产生大量重复文本，导致在多种自然语言任务中表现不佳。这些发现为理解LM崩溃提供了深入的见解，并可能启发新的训练技术来缓解这一问题。

链接: https://arxiv.org/abs/2412.14872
作者: Lecheng Wang,Xianjie Shi,Ge Li,Jia Li,Yihong Dong,Xuanming Zhang,Wenpin Jiao,Hong Mei
机构: Peking University (北京大学)
关键词: generated text, recursively generated text, Internet, text, Language models
类目: Computation and Language (cs.CL)
备注: 28 pages, 9 figures

点击查看摘要

Abstract:Language models (LMs) have been widely used to generate text on the Internet. The generated text is often collected into the training corpus of the next generations of LMs. Previous work has experimentally found that LMs collapse when trained on recursively generated text. This paper contributes to existing knowledge from two aspects. We present a theoretical proof of LM collapse. Our proof reveals the cause of LM collapse and proves that all auto-regressive LMs will definitely collapse. We present a new finding: the performance of LMs gradually declines when trained on recursively generated text until they perform no better than a randomly initialized LM. The trained LMs produce large amounts of repetitive text and perform poorly across a wide range of natural language tasks. The above proof and new findings deepen our understanding of LM collapse and offer valuable insights that may inspire new training techniques to mitigate this threat.
zh

[NLP-26] Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering

【速读】：该论文试图解决现有文档聚类方法在处理富含命名实体（Named Entities, NEs）的文档时，未能充分利用命名实体之间深层关系的问题。解决方案的关键在于提出了一种结合命名实体识别（Named Entity Recognition, NER）和大型语言模型（Large Language Models, LLMs）嵌入的图结构框架。具体来说，该方法构建了一个以文档为节点、以命名实体相似性为加权边的图，并通过图卷积网络（Graph Convolutional Network, GCN）进行优化，从而实现对语义相关文档的更有效聚类。

链接: https://arxiv.org/abs/2412.14867
作者: Imed Keraghel,Mohamed Nadif
机构: Centre Borelli UMR9010; Université Paris Cité
关键词: Large Language Models, Language Models, BERT and GPT, Large Language, improve text representation
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Recent advances in machine learning, particularly Large Language Models (LLMs) such as BERT and GPT, provide rich contextual embeddings that improve text representation. However, current document clustering approaches often ignore the deeper relationships between named entities (NEs) and the potential of LLM embeddings. This paper proposes a novel approach that integrates Named Entity Recognition (NER) and LLM embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN). This ensures a more effective grouping of semantically related documents. Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.
zh

[NLP-27] hinkCite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling

【速读】：该论文试图解决大型语言模型 (LLMs) 在生成内容时容易产生幻觉和事实性错误的问题。解决方案的关键在于提出了一种名为 ThinkCite 的新框架，将属性文本生成问题形式化为一个多步骤推理问题，并结合搜索技术。具体而言，论文引入了自引导蒙特卡洛树搜索 (Self-Guided Monte Carlo Tree Search, SG-MCTS)，利用 LLMs 的自我反思能力来指导树扩展过程。此外，论文还提出了进度奖励模型 (Progress Reward Models)，用于从生成和属性进度两个方面评估树搜索的进展，从而提供可靠和全面的反馈。实验结果表明，该方法在多个数据集上显著优于基线方法。

链接: https://arxiv.org/abs/2412.14860
作者: Junyi Li,Hwee Tou Ng
机构: National University of Singapore (新加坡国立大学)
关键词: factually incorrect information, producing factually incorrect, large language models, outstanding capabilities, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite their outstanding capabilities, large language models (LLMs) are prone to hallucination and producing factually incorrect information. This challenge has spurred efforts in attributed text generation, which prompts LLMs to generate content with supporting evidence. In this paper, we propose a novel framework, called ThinkCite, and formulate attributed text generation as a multi-step reasoning problem integrated with search. Specifically, we propose Self-Guided Monte Carlo Tree Search (SG-MCTS), which capitalizes on the self-reflection capability of LLMs to reflect on the intermediate states of MCTS for guiding the tree expansion process. To provide reliable and comprehensive feedback, we introduce Progress Reward Models to measure the progress of tree search from the root to the current state from two aspects, i.e., generation and attribution progress. We conduct extensive experiments on three datasets and the results show that our approach significantly outperforms baseline approaches.
zh

[NLP-28] DS2-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis

【速读】：该论文试图解决低资源场景下少样本方面级情感分析 (ABSA) 中数据稀缺和多样性不足的问题。解决方案的关键在于提出了双流数据合成框架 (DS²-ABSA)，该框架通过生成式 AI (LLMs) 从关键点驱动 (key-point-driven) 和实例驱动 (instance-driven) 两个互补的角度生成多样且高质量的 ABSA 样本。此外，集成标签细化模块 (label refinement module) 以提升合成标签的准确性。实验结果表明，DS²-ABSA 在少样本 ABSA 任务中显著优于以往的方法。

链接: https://arxiv.org/abs/2412.14849
作者: Hongling Xu,Yice Zhang,Qianlong Wang,Ruifeng Xu
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳); Peng Cheng Laboratory, Shenzhen, China (鹏城实验室); Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies (广东省新型安全智能技术重点实验室)
关键词: Recently developed large, large language models, developed large language, Recently developed, address data scarcity
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently developed large language models (LLMs) have presented promising new avenues to address data scarcity in low-resource scenarios. In few-shot aspect-based sentiment analysis (ABSA), previous efforts have explored data augmentation techniques, which prompt LLMs to generate new samples by modifying existing ones. However, these methods fail to produce adequately diverse data, impairing their effectiveness. Besides, some studies apply in-context learning for ABSA by using specific instructions and a few selected examples as prompts. Though promising, LLMs often yield labels that deviate from task requirements. To overcome these limitations, we propose DS ^2 -ABSA, a dual-stream data synthesis framework targeted for few-shot ABSA. It leverages LLMs to synthesize data from two complementary perspectives: \textitkey-point-driven and \textitinstance-driven, which effectively generate diverse and high-quality ABSA samples in low-resource settings. Furthermore, a \textitlabel refinement module is integrated to improve the synthetic labels. Extensive experiments demonstrate that DS ^2 -ABSA significantly outperforms previous few-shot ABSA solutions and other LLM-oriented data generation methods.
zh

[NLP-29] A Survey of RWKV

【速读】：该论文试图解决RWKV模型缺乏系统性综述的问题，并提供对其架构、核心原理及应用的全面分析。解决方案的关键在于详细阐述RWKV如何通过结合循环（recurrent）和基于注意力（attention-based）的系统优势，有效捕捉长程依赖并降低计算需求，从而在处理长序列任务时表现出优于传统Transformer的效率和性能。

链接: https://arxiv.org/abs/2412.14847
作者: Zhiyuan Li,Tingyu Xia,Yi Chang,Yuan Wu
机构: Jilin University(吉林大学); School of Artificial Intelligence, Jilin University(吉林大学人工智能学院); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China(教育部知识驱动人机智能工程研究中心, 中国)
关键词: Receptance Weighted Key, Receptance Weighted, Weighted Key, merging the benefits, attention-based systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages

点击查看摘要

Abstract:The Receptance Weighted Key Value (RWKV) model offers a novel alternative to the Transformer architecture, merging the benefits of recurrent and attention-based systems. Unlike conventional Transformers, which depend heavily on self-attention, RWKV adeptly captures long-range dependencies with minimal computational demands. By utilizing a recurrent framework, RWKV addresses some computational inefficiencies found in Transformers, particularly in tasks with long sequences. RWKV has recently drawn considerable attention for its robust performance across multiple domains. Despite its growing popularity, no systematic review of the RWKV model exists. This paper seeks to fill this gap as the first comprehensive review of the RWKV architecture, its core principles, and its varied applications, such as natural language generation, natural language understanding, and computer vision. We assess how RWKV compares to traditional Transformer models, highlighting its capability to manage long sequences efficiently and lower computational costs. Furthermore, we explore the challenges RWKV encounters and propose potential directions for future research and advancement. We consistently maintain the related open-source materials at: this https URL.
zh

[NLP-30] Mapping and Influencing the Political Ideology of Large Language Models using Synthetic Personas

【速读】：该论文试图解决的问题是：基于人格化提示（persona-based prompting）对大型语言模型（LLMs）政治倾向的影响尚未被充分探索。解决方案的关键在于利用PersonaHub（合成人格描述的集合）和政治倾向测试（Political Compass Test, PCT）来映射人格化提示下LLMs的政治分布，并研究是否可以通过明确的意识形态提示将这些模型的政治倾向从左翼自由主义（left-libertarian）转向右翼威权主义（right-authoritarian）。实验结果表明，合成人格主要集中在左翼自由主义象限，且模型对右翼威权主义提示的响应更为显著，而对左翼自由主义提示的响应较为有限，这暗示了模型训练过程中可能存在的固有偏见。

链接: https://arxiv.org/abs/2412.14843
作者: Pietro Bernardelle,Leon Fröhling,Stefano Civelli,Riccardo Lunardi,Kevin Roiter,Gianluca Demartini
机构: The University of Queensland(昆士兰大学); GESIS(德国社会科学研究所); University of Udine(乌迪内大学)
关键词: fixed viewpoints, large language models, large language, primarily examined, examined these systems
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, 2 tables

点击查看摘要

Abstract:The analysis of political biases in large language models (LLMs) has primarily examined these systems as single entities with fixed viewpoints. While various methods exist for measuring such biases, the impact of persona-based prompting on LLMs’ political orientation remains unexplored. In this work we leverage PersonaHub, a collection of synthetic persona descriptions, to map the political distribution of persona-based prompted LLMs using the Political Compass Test (PCT). We then examine whether these initial compass distributions can be manipulated through explicit ideological prompting towards diametrically opposed political orientations: right-authoritarian and left-libertarian. Our experiments reveal that synthetic personas predominantly cluster in the left-libertarian quadrant, with models demonstrating varying degrees of responsiveness when prompted with explicit ideological descriptors. While all models demonstrate significant shifts towards right-authoritarian positions, they exhibit more limited shifts towards left-libertarian positions, suggesting an asymmetric response to ideological manipulation that may reflect inherent biases in model training.
zh

[NLP-31] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLM s

【速读】：该论文试图解决在大语言模型（LLMs）中高效管理键值缓存（KV cache）的问题，特别是在长上下文任务（如RAG和摘要生成）中。现有方法采用固定的压缩模式，忽视了任务特定的激活模式，导致重要信息的丢失。论文提出的解决方案是 DynamicKV，其关键在于动态优化每层保留的token数量，以适应特定任务的需求。DynamicKV通过设置全局和每层的最大KV缓存预算，在推理过程中临时保留当前层的最大预算，并定期更新所有前序层的KV缓存大小。这种方法在显著压缩KV缓存的同时，仍能保持接近全量KV缓存的性能，甚至在极端压缩条件下（0.9%），在Needle-in-a-Haystack测试中超越了当前最先进（SOTA）方法。

链接: https://arxiv.org/abs/2412.14838
作者: Xiabin Zhou,Wenbin Wang,Minyan Zeng,Jiaxian Guo,Xuebo Liu,Li Shen,Min Zhang,Liang Ding
机构: 未知
关键词: RAG and summarization, Efficient KV cache, management in LLMs, LLMs is crucial, crucial for long-context
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficient KV cache management in LLMs is crucial for long-context tasks like RAG and summarization. Existing KV cache compression methods enforce a fixed pattern, neglecting task-specific characteristics and reducing the retention of essential information. However, we observe distinct activation patterns across layers in various tasks, highlighting the need for adaptive strategies tailored to each task’s unique demands. Based on this insight, we propose DynamicKV, a method that dynamically optimizes token retention by adjusting the number of tokens retained at each layer to adapt to the specific task. DynamicKV establishes global and per-layer maximum KV cache budgets, temporarily retaining the maximum budget for the current layer, and periodically updating the KV cache sizes of all preceding layers during inference. Our method retains only 1.7% of the KV cache size while achieving ~85% of the Full KV cache performance on LongBench. Notably, even under extreme compression (0.9%), DynamicKV surpasses state-of-the-art (SOTA) methods by 11% in the Needle-in-a-Haystack test using Mistral-7B-Instruct-v0.2. The code will be released.
zh

[NLP-32] Progressive Multimodal Reasoning via Active Retrieval

【速读】：该论文试图解决多步骤多模态推理任务对多模态大语言模型 (MLLMs) 的挑战，特别是如何在这些复杂场景中提升模型的推理能力。解决方案的关键在于提出了一个名为 AR-MCTS 的通用框架，通过主动检索 (Active Retrieval) 和蒙特卡洛树搜索 (Monte Carlo Tree Search, MCTS) 相结合的方式，逐步增强模型的推理能力。具体来说，该框架首先开发了一个统一的检索模块，从混合模态检索语料库中检索出解决复杂推理问题的关键支持信息。随后，利用 MCTS 算法与主动检索机制相结合，自动生成逐步推理的注释，动态检索每一步推理所需的关键信息，从而超越传统的束搜索采样方法，提升推理空间的多样性和可靠性。此外，引入的过程奖励模型逐步对齐，支持多模态推理任务的自动验证。实验结果表明，AR-MCTS 框架在多个复杂多模态推理基准上显著提升了多种多模态模型的性能。

链接: https://arxiv.org/abs/2412.14835
作者: Guanting Dong,Chenghao Zhang,Mengjie Deng,Yutao Zhu,Zhicheng Dou,Ji-Rong Wen
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院，中国人民大学)
关键词: pose significant challenges, Monte Carlo Tree, Multi-step multimodal reasoning, Carlo Tree Search, multimodal large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Working in progress

点击查看摘要

Abstract:Multi-step multimodal reasoning tasks pose significant challenges for multimodal large language models (MLLMs), and finding effective ways to enhance their performance in such scenarios remains an unresolved issue. In this paper, we propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo Tree Search (MCTS). Our approach begins with the development of a unified retrieval module that retrieves key supporting insights for solving complex reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in automated multimodal reasoning verification, we employ the MCTS algorithm combined with an active retrieval mechanism, which enables the automatic generation of step-wise annotations. This strategy dynamically retrieves key insights for each reasoning step, moving beyond traditional beam search sampling to improve the diversity and reliability of the reasoning space. Additionally, we introduce a process reward model that aligns progressively to support the automatic verification of multimodal reasoning tasks. Experimental results across three complex multimodal reasoning benchmarks confirm the effectiveness of the AR-MCTS framework in enhancing the performance of various multimodal models. Further analysis demonstrates that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
zh

[NLP-33] Mention Attention for Pronoun Translation ACL

【速读】：该论文试图解决机器翻译中代词翻译的挑战，特别是由于不同语言间代词使用的差异性。解决方案的关键在于引入了一个额外的提及注意力模块（mention attention module），该模块在解码器中特别关注源语言中的提及（mentions）而非一般词汇。通过提取提及特征并结合目标端上下文，该模块有效提升了代词翻译的准确性。此外，论文还引入了两个提及分类器（mention classifiers）来训练模型识别提及，其输出指导提及注意力模块的工作。实验结果表明，该方法在WMT17英德翻译任务中，不仅提升了代词翻译的性能（APT指标），还保持了整体翻译质量（BLEU指标）。

链接: https://arxiv.org/abs/2412.14829
作者: Gongbo Tang,Christian Hardmeier
机构: Beijing Language and Culture University(北京语言文化大学); IT University of Copenhagen(哥本哈根信息技术大学)
关键词: usage across languages, translation, attention, referring expressions, Mentions
类目: Computation and Language (cs.CL)
备注: camera-ready version of the paper accepted by JCRAI-23 conference, in ACL format

点击查看摘要

Abstract:Most pronouns are referring expressions, computers need to resolve what do the pronouns refer to, and there are divergences on pronoun usage across languages. Thus, dealing with these divergences and translating pronouns is a challenge in machine translation. Mentions are referring candidates of pronouns and have closer relations with pronouns compared to general tokens. We assume that extracting additional mention features can help pronoun translation. Therefore, we introduce an additional mention attention module in the decoder to pay extra attention to source mentions but not non-mention tokens. Our mention attention module not only extracts features from source mentions, but also considers target-side context which benefits pronoun translation. In addition, we also introduce two mention classifiers to train models to recognize mentions, whose outputs guide the mention attention. We conduct experiments on the WMT17 English-German translation task, and evaluate our models on general translation and pronoun translation, using BLEU, APT, and contrastive evaluation metrics. Our proposed model outperforms the baseline Transformer model in terms of APT and BLEU scores, this confirms our hypothesis that we can improve pronoun translation by paying additional attention to source mentions, and shows that our introduced additional modules do not have negative effect on the general translation quality.
zh

[NLP-34] ResoFilter: Rine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis

【速读】：该论文试图解决生成式数据增强方法在大型语言模型（LLMs）中应用时，数据质量和实用性难以评估的问题。解决方案的关键在于提出了ResoFilter方法，该方法通过整合模型、数据和任务，利用微调过程中的Data-Parameter特征进行数据选择，从而提高数据特征的可解释性。ResoFilter通过模型权重来表示数据特征，实验表明其在数学任务中仅使用一半数据即可达到与全量微调相当的效果，并在不同模型和领域中展现出强大的泛化能力。这一方法为构建高质量合成数据集和评估数据质量提供了新的思路，有望提升数据增强技术和LLMs训练数据集的质量。

链接: https://arxiv.org/abs/2412.14809
作者: Zeao Tu,Xiangdi Meng,Yu He,Zihan Yao,Tianyu Qi,Jun Liu,Ming Li
机构: TAL Education Group, Beijing, China; Xi’an Jiaotong University, Xi’an, Shaanxi, China
关键词: methods utilizing GPT, Large language models, shown remarkable effectiveness, Large language, utilizing GPT
类目: Computation and Language (cs.CL)
备注: under review

点击查看摘要

Abstract:Large language models (LLMs) have shown remarkable effectiveness across various domains, with data augmentation methods utilizing GPT for synthetic data generation becoming prevalent. However, the quality and utility of augmented data remain questionable, and current methods lack clear metrics for evaluating data characteristics. To address these challenges, we propose ResoFilter, a novel method that integrates models, data, and tasks to refine datasets. ResoFilter leverages the fine-tuning process to obtain Data-Parameter features for data selection, offering improved interpretability by representing data characteristics through model weights. Our experiments demonstrate that ResoFilter achieves comparable results to full-scale fine-tuning using only half the data in mathematical tasks and exhibits strong generalization across different models and domains. This method provides valuable insights for constructing synthetic datasets and evaluating high-quality data, offering a promising solution for enhancing data augmentation techniques and improving training dataset quality for LLMs. For reproducibility, we will release our code and data upon acceptance.
zh

[NLP-35] Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

【速读】：该论文试图解决在使用代理任务数据集增强大型语言模型（LLMs）能力时，现有方法对样本中的所有标记（tokens）同等对待的问题。论文指出，不同角色的标记（如推理标记与模板标记）在重要性和学习复杂性上存在显著差异，因此需要分别处理。解决方案的关键是提出了一种新的Shuffle-Aware Discriminator (SHAD)，通过利用在打乱输入-输出组合后观察到的可预测性差异来区分标记：模板标记由于其重复性在样本间保持可预测性，而推理标记则不然。基于SHAD，论文进一步提出了**Reasoning-highlighted Fine-Tuning (RFT)**方法，在微调过程中自适应地强调推理标记，从而显著提升了模型性能，超越了传统的监督微调（SFT）方法。

链接: https://arxiv.org/abs/2412.14780
作者: Ziang Ye,Zhenru Zhang,Yang Zhang,Jianxin Ma,Junyang Lin,Fuli Feng
机构: University of Science and Technology of China; Alibaba Group; National University of Singapore
关键词: Large Language Models, Language Models, Large Language, enhance agent capabilities, capabilities for Large
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:When using agent-task datasets to enhance agent capabilities for Large Language Models (LLMs), current methodologies often treat all tokens within a sample equally. However, we argue that tokens serving different roles - specifically, reasoning tokens versus boilerplate tokens (e.g., those governing output format) - differ significantly in importance and learning complexity, necessitating their disentanglement and distinct treatment. To address this, we propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination. SHAD classifies tokens by exploiting predictability differences observed after shuffling input-output combinations across samples: boilerplate tokens, due to their repetitive nature among samples, maintain predictability, whereas reasoning tokens do not. Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning, yielding notable performance gains over common Supervised Fine-Tuning (SFT).
zh

[NLP-36] ALKAFI-LLAMA3: Fine-Tuning LLM s for Precise Legal Understanding in Palestine

【速读】：该论文试图解决在资源匮乏的环境中，特别是巴勒斯坦法律领域，如何有效应用大型语言模型（Large Language Models, LLMs）的问题。解决方案的关键在于通过量化版本的Llama-3.2-1B-Instruct模型进行微调，并使用从巴勒斯坦法律文本中生成的合成数据集进行训练。通过采用较小规模的模型和策略性生成的问题-答案对，实现了成本效益高且可持续的本地化解决方案，能够提供准确且与上下文相关的法律指导。该方法在处理多种查询类型（如是非问题、叙述性解释和复杂的法律区分）方面表现出良好的性能，同时指出了在处理基于计算的查询和结构化列表格式方面的改进空间。

链接: https://arxiv.org/abs/2412.14771
作者: Rabee Qasem,Mohannad Hendi,Banan Tantour
机构: Arab American University, Palestine (AAUP); Birzeit University, Palestine
关键词: Large Language Models, Large Language, demonstrated remarkable potential, Language Models, Palestinian legal domain
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable potential in diverse domains, yet their application in the legal sector, particularly in low-resource contexts, remains limited. This study addresses the challenges of adapting LLMs to the Palestinian legal domain, where political instability, fragmented legal frameworks, and limited AI resources hinder effective machine-learning applications. We present a fine-tuned model based on a quantized version of Llama-3.2-1B-Instruct, trained on a synthetic data set derived from Palestinian legal texts. Using smaller-scale models and strategically generated question-answer pairs, we achieve a cost-effective, locally sustainable solution that provides accurate and contextually relevant legal guidance. Our experiments demonstrate promising performance on various query types, ranging from yes/no questions and narrative explanations to complex legal differentiations, while highlighting areas for improvement, such as handling calculation-based inquiries and structured list formatting. This work provides a pathway for the deployment of AI-driven legal assistance tools tailored to the needs of resource-constrained environments.
zh

[NLP-37] PsyDraw: A Multi-Agent Multimodal System for Mental Health Screening in Left-Behind Children

【速读】：该论文旨在解决中国留守儿童（Left-behind children, LBCs）因父母外出务工而面临的心理健康问题，特别是在资源匮乏的农村地区，由于心理健康专业人员严重不足，难以进行早期筛查和识别。解决方案的关键是提出了PsyDraw，一个基于多模态大语言模型（Multimodal Large Language Models）的多代理系统，用于辅助心理健康专业人员分析HTP（House-Tree-Person）绘画测试。该系统通过专门设计的代理进行特征提取和心理学解释，分为全面特征分析和专业报告生成两个阶段，能够高效且准确地进行初步筛查，并在实际应用中显示出较高的专业一致性。

链接: https://arxiv.org/abs/2412.14769
作者: Yiqun Zhang,Xiaocui Yang,Xiaobai Li,Siyuan Yu,Yi Luan,Shi Feng,Daling Wang,Yifei Zhang
机构: Northeastern University, China; AI for Science, Shanghai AI Laboratory; Unitec Media Company Limited
关键词: Left-behind children, million in China, face severe mental, mental health professionals, mental health
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Left-behind children (LBCs), numbering over 66 million in China, face severe mental health challenges due to parental migration for work. Early screening and identification of at-risk LBCs is crucial, yet challenging due to the severe shortage of mental health professionals, especially in rural areas. While the House-Tree-Person (HTP) test shows higher child participation rates, its requirement for expert interpretation limits its application in resource-scarce regions. To address this challenge, we propose PsyDraw, a multi-agent system based on Multimodal Large Language Models that assists mental health professionals in analyzing HTP drawings. The system employs specialized agents for feature extraction and psychological interpretation, operating in two stages: comprehensive feature analysis and professional report generation. Evaluation of HTP drawings from 290 primary school students reveals that 71.03% of the analyzes achieved High Consistency with professional evaluations, 26.21% Moderate Consistency and only 2.41% Low Consistency. The system identified 31.03% of cases requiring professional attention, demonstrating its effectiveness as a preliminary screening tool. Currently deployed in pilot schools, \method shows promise in supporting mental health professionals, particularly in resource-limited areas, while maintaining high professional standards in psychological assessment.
zh

[NLP-38] Query pipeline optimization for cancer patient question answering systems

【速读】：该论文旨在解决癌症患者问答系统（CPQA）中检索增强生成（RAG）模型的幻觉问题，并通过优化查询管道提升答案准确性。解决方案的关键在于提出了一种三方面的优化方法：(1) 文档检索，采用混合语义实时文档检索（HSRDR）技术，结合NCBI资源进行比较分析；(2) 段落检索，通过优化密集检索器和重排序器的组合；(3) 语义表示，引入语义增强重叠分割（SEOS）以提升上下文理解。这些优化措施在癌症相关数据集上显著提升了Claude-3-haiku模型的答案准确性，相较于链式思维提示和基础RAG设置分别提高了5.24%和约3%。

链接: https://arxiv.org/abs/2412.14751
作者: Maolin He,Rena Gao,Mike Conway,Brian E. Chapman
机构: School of Computing and Information Systems, University of Melbourne, Parkville, VIC 3052 Australia(信息系统学院，墨尔本大学，帕克维尔，维多利亚州 3052 澳大利亚); School of Computing and Information Systems, University of Melbourne, Parkville, VIC 3052, Australia(信息系统学院，墨尔本大学，帕克维尔，维多利亚州 3052 澳大利亚); School of Computing and Information Systems, University of Melbourne, Parkville, VIC 3052 Australia(信息系统学院，墨尔本大学，帕克维尔，维多利亚州 3052 澳大利亚); School of Computing and Information Systems, University of Melbourne, Parville, VIC 3052, Australia(信息系统学院，墨尔本大学，帕克维尔，维多利亚州 3052 澳大利亚)
关键词: Large Language Models, Language Models, Large Language, retrieve relevant external, relevant external information
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) mitigates hallucination in Large Language Models (LLMs) by using query pipelines to retrieve relevant external information and grounding responses in retrieved knowledge. However, query pipeline optimization for cancer patient question-answering (CPQA) systems requires separately optimizing multiple components with domain-specific considerations. We propose a novel three-aspect optimization approach for the RAG query pipeline in CPQA systems, utilizing public biomedical databases like PubMed and PubMed Central. Our optimization includes: (1) document retrieval, utilizing a comparative analysis of NCBI resources and introducing Hybrid Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval, identifying optimal pairings of dense retrievers and rerankers; and (3) semantic representation, introducing Semantic Enhanced Overlap Segmentation (SEOS) for improved contextual understanding. On a custom-developed dataset tailored for cancer-related inquiries, our optimized RAG approach improved the answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and about 3% over a naive RAG setup. This study highlights the importance of domain-specific query optimization in realizing the full potential of RAG and provides a robust framework for building more accurate and reliable CPQA systems, advancing the development of RAG-based biomedical systems.
zh

[NLP-39] On Verbalized Confidence Scores for LLM s

【速读】：该论文试图解决大语言模型 (LLMs) 在日常生活中的信任问题，特别是如何量化这些模型的不确定性以增强其可信度。解决方案的关键在于让 LLM 在其输出中直接表达其不确定性，通过生成一个置信度分数 (confidence score) 作为输出的一部分。这种方法具有提示 (prompt) 和模型无关的特性，且开销较低，是一种有前景的不确定性量化方式。通过广泛的基准测试，研究发现置信度分数的可靠性在很大程度上取决于提示方法，但某些提示方法能够提取出校准良好的置信度分数。因此，论文认为这种口头表达的置信度分数有望成为未来一种简单、有效且通用的不确定性量化方法。

链接: https://arxiv.org/abs/2412.14737
作者: Daniel Yang,Yao-Hung Hubert Tsai,Makoto Yamada
机构: 未知
关键词: daily life make, large language models, rise of large, large language, tight integration
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other’s uncertainty. To estimate the uncertainty in a response, internal token logits, task-specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt- and model-agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well-calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at this https URL .
zh

[NLP-40] How to Synthesize Text Data without Model Collapse?

【速读】：该论文试图解决合成数据在语言模型训练中的模型崩溃问题，并探讨如何在不引发模型崩溃的情况下生成高质量的合成数据。解决方案的关键在于提出了一种基于标记编辑（token editing）的方法，通过对人类生成数据进行标记级别的编辑，生成半合成数据（semi-synthetic data）。理论分析表明，这种标记级别的编辑能够防止模型崩溃，因为测试误差被限制在一个有限的范围内。实验结果进一步验证了该方法在提升数据质量和增强模型性能方面的有效性。

链接: https://arxiv.org/abs/2412.14689
作者: Xuekai Zhu,Daixuan Cheng,Hengli Li,Kaiyan Zhang,Ermo Hua,Xingtai Lv,Ning Ding,Zhouhan Lin,Zilong Zheng,Bowen Zhou
机构: LUMIA Lab, Shanghai Jiao Tong University; State Key Laboratory of General Artificial Intelligence, BIGAI; Department of Electronic Engineering, Tsinghua University; Institute for Artificial Intelligence, Peking University; Shanghai Artificial Intelligence Laboratory
关键词: synthetic data, self-generated data leads, data, synthetic, gradual decline
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT- \n\ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.
zh

[NLP-41] Each Fake News is Fake in its Own Way: An Attribution Multi-Granularity Benchmark for Multimodal Fake News Detection

【速读】：该论文试图解决现有多模态假新闻检测数据集仅提供真实或虚假的二元标签，无法反映假新闻多样性和混合性质的问题。解决方案的关键在于构建了一个名为 \amg 的归因多粒度多模态假新闻检测数据集，揭示假新闻的内在模式，并提出了一种多粒度线索对齐模型 \our，以实现多模态假新闻的检测和归因。该数据集的归因设置为未来的研究开辟了新的方向。

链接: https://arxiv.org/abs/2412.14686
作者: Hao Guo,Zihan Ma,Zhi Zeng,Minnan Luo,Weixin Zeng,Jiuyang Tang,Xiang Zhao
机构: 1. Sun Yat-sen University (中山大学);
2. Shenzhen Research Institute of Big Data (深圳大数据研究院);
3. Guangdong Key Laboratory of Big Data Computing, The Chinese University of Hong Kong, Shenzhen (香港中文大学（深圳）广东省大数据计算重点实验室);
4. Shenzhen Engineering Laboratory for Big Data System Computing Technology (深圳大数据系统计算技术工程实验室)
关键词: Social platforms, multimodal fake, fake, access to information, resulting in negative
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social platforms, while facilitating access to information, have also become saturated with a plethora of fake news, resulting in negative consequences. Automatic multimodal fake news detection is a worthwhile pursuit. Existing multimodal fake news datasets only provide binary labels of real or fake. However, real news is alike, while each fake news is fake in its own way. These datasets fail to reflect the mixed nature of various types of multimodal fake news. To bridge the gap, we construct an attributing multi-granularity multimodal fake news detection dataset \amg, revealing the inherent fake pattern. Furthermore, we propose a multi-granularity clue alignment model \our to achieve multimodal fake news detection and attribution. Experimental results demonstrate that \amg is a challenging dataset, and its attribution setting opens up new avenues for future research.
zh

[NLP-42] LLM s as mediators: Can they diagnose conflicts accurately?

【速读】：该论文试图解决的问题是测试OpenAI的大型语言模型（LLMs）如GPT 3.5和GPT 4是否能够区分冲突中的因果分歧（causal disagreement）和道德分歧（moral disagreement），并评估哪种分歧对LLMs更具挑战性。解决方案的关键在于通过复现Koçak等人的研究设计，使用GPT 3.5和GPT 4对对话中的分歧进行诊断，发现LLMs在区分这两种分歧时表现出与人类相似的语义理解能力，但在道德分歧的识别上存在低估倾向，尤其是在使用具体语言的近端尺度（proximate scale）时，GPT 4的这一倾向更为明显。GPT 3.5在近端和远端尺度（distal scale）上的表现均不如GPT 4和人类。这一研究为使用LLMs作为冲突调解工具提供了初步验证。

链接: https://arxiv.org/abs/2412.14675
作者: Özgecan Koçak(Emory University),Phanish Puranam(INSEAD),Afşar Yegin(Kadir Has University)
机构: 未知
关键词: Prior research, Language Models GPT, Large Language Models, differences in beliefs, GPT
类目: Computation and Language (cs.CL)
备注: 27 pages, 2 appendices, 21 tables (incl appendices)

点击查看摘要

Abstract:Prior research indicates that to be able to mediate conflict, observers of disagreements between parties must be able to reliably distinguish the sources of their disagreement as stemming from differences in beliefs about what is true (causality) vs. differences in what they value (morality). In this paper, we test if OpenAI’s Large Language Models GPT 3.5 and GPT 4 can perform this task and whether one or other type of disagreement proves particularly challenging for LLM’s to diagnose. We replicate study 1 in Koçak et al. (2003), which employes a vignette design, with OpenAI’s GPT 3.5 and GPT 4. We find that both LLMs have similar semantic understanding of the distinction between causal and moral codes as humans and can reliably distinguish between them. When asked to diagnose the source of disagreement in a conversation, both LLMs, compared to humans, exhibit a tendency to overestimate the extent of causal disagreement and underestimate the extent of moral disagreement in the moral misalignment condition. This tendency is especially pronounced for GPT 4 when using a proximate scale that relies on concrete language specific to an issue. GPT 3.5 does not perform as well as GPT4 or humans when using either the proximate or the distal scale. The study provides a first test of the potential for using LLMs to mediate conflict by diagnosing the root of disagreements in causal and evaluative codes.
zh

[NLP-43] Analysis and Visualization of Linguistic Structures in Large Language Models : Neural Representations of Verb-Particle Constructions in BERT

【速读】：该论文试图解决的问题是探究基于Transformer的大型语言模型（LLMs）如何在其内部表示中捕捉动词-粒子组合（verb-particle combinations）的词汇和句法细微差别。解决方案的关键在于通过BERT架构分析不同神经网络层对这些动词-粒子结构（如’agree on’, ‘come back’, ‘give up’）的表示效果。研究采用多维尺度分析（MDS）和广义判别值（GDV）计算等技术，结合从英国国家语料库（British National Corpus）准备的数据集进行模型训练和输出分析。结果表明，BERT的中层最能有效捕捉句法结构，且不同动词类别的表示准确性存在显著差异，揭示了神经网络架构与语言表示之间的复杂相互作用。

链接: https://arxiv.org/abs/2412.14670
作者: Hassane Kissane,Achim Schilling,Patrick Krauss
机构: 未知
关键词: transformer-based large language, British National Corpus, specifically examining, investigates the internal, combinations within transformer-based
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates the internal representations of verb-particle combinations within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic nuances at different neural network layers. Employing the BERT architecture, we analyse the representational efficacy of its layers for various verb-particle constructions such as ‘agree on’, ‘come back’, and ‘give up’. Our methodology includes a detailed dataset preparation from the British National Corpus, followed by extensive model training and output analysis through techniques like multi-dimensional scaling (MDS) and generalized discrimination value (GDV) calculations. Results show that BERT’s middle layers most effectively capture syntactic structures, with significant variability in representational accuracy across different verb categories. These findings challenge the conventional uniformity assumed in neural network processing of linguistic elements and suggest a complex interplay between network architecture and linguistic representation. Our research contributes to a better understanding of how deep learning models comprehend and process language, offering insights into the potential and limitations of current neural approaches to linguistic analysis. This study not only advances our knowledge in computational linguistics but also prompts further research into optimizing neural architectures for enhanced linguistic precision.
zh

[NLP-44] Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models COLING2025

【速读】：该论文试图解决多模态大语言模型（Multimodal Large Language Models, MLLMs）在处理视觉和文本数据时的不确定性校准问题，尤其是在医疗和自动驾驶等高风险领域中的可靠性问题。解决方案的关键在于通过温度缩放（temperature scaling）和迭代提示优化（iterative prompt optimization）等技术来校准模型，并构建了IDK数据集以评估模型在处理未知情况时的表现。研究发现，MLLMs在面对不确定性时倾向于给出答案而非承认不确定，但通过适当的提示调整可以改善其自我评估能力。

链接: https://arxiv.org/abs/2412.14660
作者: Zijun Chen,Wenbo Hu,Guande He,Zhijie Deng,Zheng Zhang,Richang Hong
机构: Hefei University of Technology, Hefei, China; Data Space Research Institute, Hefei, China; UT Austin, Austin, USA; Shanghai Jiao Tong University, Shanghai, China
关键词: visual question answering, Multimodal large language, large language models, question answering, large language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted to COLING 2025

点击查看摘要

Abstract:Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and autonomous driving. This paper investigates representative MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning, as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight how uncertainty differs between text and images and how their integration affects overall uncertainty. To better understand MLLMs’ miscalibration and their ability to self-assess uncertainty, we construct the IDK (I don’t know) dataset, which is key to evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with proper prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications. Code and IDK dataset: \hrefthis https URLthis https URL.
zh

[NLP-45] Length Controlled Generation for Black-box LLM s

【速读】：该论文试图解决大型语言模型（LLMs）在生成文本时难以准确控制文本长度的问题，这在许多实际应用中是一个基本要求。解决方案的关键在于提出了一种新颖的迭代采样框架，该框架结合了Metropolis-Hastings算法和重要性采样加速策略。此框架能够在不修改LLMs底层参数的情况下，高效且可靠地实现长度约束的文本生成，从而保留了LLMs的原始能力。实验结果表明，该方法在Llama3.1上实现了几乎100%的成功率，适用于长度控制的摘要生成和指令遵循任务，且仅增加了最小的计算开销。

链接: https://arxiv.org/abs/2412.14656
作者: Yuxuan Gu,Wenjie Wang,Xiaocheng Feng,Weihong Zhong,Kun Zhu,Lei Huang,Tat-Seng Chua,Bing Qin
机构: Harbin Institute of Technology(哈尔滨工业大学); National University of Singapore(新加坡国立大学); Peng Cheng Laboratory(鹏城实验室)
关键词: Large language models, Large language, demonstrated impressive instruction, language models, demonstrated impressive
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive instruction following capabilities, while still struggling to accurately manage the length of the generated text, which is a fundamental requirement in many real-world applications. Existing length control methods involve fine-tuning the parameters of LLMs, which is inefficient and suboptimal for practical use. In this paper, we propose a novel iterative sampling framework for text length control, integrating the Metropolis-Hastings algorithm with an importance sampling acceleration strategy. This framework efficiently and reliably regulates LLMs to generate length-constrained text without modifying the underlying parameters, thereby preserving the original capabilities of LLMs. Experimental results demonstrate that our framework achieves almost 100% success rates of length control on Llama3.1 for tasks such as length-controlled abstractive summarization and length-constrained instruction following, with minimal additional computational overhead. This also highlights the significant potential of our method for precise length control across a broader range of applications, without compromising the versatility of LLMs.
zh

[NLP-46] OMG-Bench: Evaluating LLM s on Text-based Open Molecule Generation

【速读】：该论文试图解决开放域分子生成任务的评估问题，提出了基于文本的开放分子生成基准 (Text-based Open Molecule Generation Benchmark, TOMG-Bench)，这是首个用于评估大语言模型 (LLMs) 在开放域分子生成能力方面的基准。解决方案的关键在于设计了包含分子编辑 (MolEdit)、分子优化 (MolOpt) 和定制分子生成 (MolCustom) 三大任务的多样化数据集，并通过自动化评估系统来衡量生成分子的质量和准确性。此外，通过使用专门为解决 TOMG-Bench 挑战而设计的指令调优数据集 OpenMolIns，Llama3.1-8B 在 TOMG-Bench 上表现优异，超越了 GPT-3.5-turbo 等开源通用 LLMs。

链接: https://arxiv.org/abs/2412.14642
作者: Jiatong Li,Junxian Li,Yunqing Liu,Dongzhan Zhou,Qing Li
机构: The Hong Kong Polytechnic University; Shanghai Jiao Tong University; Shanghai AI Lab
关键词: propose Text-based Open, Text-based Open Molecule, Molecule Generation Benchmark, Open Molecule Generation, propose Text-based
类目: Computation and Language (cs.CL)
备注: A benchmark for text-based open molecule generation

点击查看摘要

Abstract:In this paper, we propose Text-based Open Molecule Generation Benchmark (TOMG-Bench), the first benchmark to evaluate the open-domain molecule generation capability of LLMs. TOMG-Bench encompasses a dataset of three major tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom). Each task further contains three subtasks, with each subtask comprising 5,000 test samples. Given the inherent complexity of open molecule generation, we have also developed an automated evaluation system that helps measure both the quality and the accuracy of the generated molecules. Our comprehensive benchmarking of 25 LLMs reveals the current limitations and potential areas for improvement in text-guided molecule discovery. Furthermore, with the assistance of OpenMolIns, a specialized instruction tuning dataset proposed for solving challenges raised by TOMG-Bench, Llama3.1-8B could outperform all the open-source general LLMs, even surpassing GPT-3.5-turbo by 46.5% on TOMG-Bench. Our codes and datasets are available through this https URL.
zh

[NLP-47] Learning to Generate Research Idea with Dynamic Control

【速读】：该论文试图解决当前基于提示（prompting-based）的大语言模型（LLMs）在生成研究假设和创意时，难以有效优化生成内容以及处理创新性、可行性和有效性之间复杂相互依赖关系的问题。解决方案的关键在于首次提出通过监督微调（Supervised Fine-Tuning, SFT）和可控强化学习（Reinforcement Learning, RL）相结合的两阶段框架。在SFT阶段，模型从研究论文及其后续创意对中学习基础模式；在RL阶段，通过多维度奖励建模和细粒度反馈，评估并优化生成创意的关键指标。维度控制器实现生成过程的动态调整，而句子级解码器确保推理过程中的上下文感知强调，从而在创新性、可行性和有效性之间实现动态平衡，生成高质量的研究创意。

链接: https://arxiv.org/abs/2412.14626
作者: Ruochen Li,Liqiang Jing,Chi Han,Jiawei Zhou,Xinya Du
机构: 未知
关键词: accelerate scientific discovery, large language models, scientific discovery, rapid advancements, advancements in large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancements in large language models (LLMs) have demonstrated their potential to accelerate scientific discovery, particularly in automating the process of research ideation. LLM-based systems have shown promise in generating hypotheses and research ideas. However, current approaches predominantly rely on prompting-based pre-trained models, limiting their ability to optimize generated content effectively. Moreover, they also lack the capability to deal with the complex interdependence and inherent restrictions among novelty, feasibility, and effectiveness, which remains challenging due to the inherent trade-offs among these dimensions, such as the innovation-feasibility conflict. To address these limitations, we for the first time propose fine-tuning LLMs to be better idea proposers and introduce a novel framework that employs a two-stage approach combining Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL). In the SFT stage, the model learns foundational patterns from pairs of research papers and follow-up ideas. In the RL stage, multi-dimensional reward modeling, guided by fine-grained feedback, evaluates and optimizes the generated ideas across key metrics. Dimensional controllers enable dynamic adjustment of generation, while a sentence-level decoder ensures context-aware emphasis during inference. Our framework provides a balanced approach to research ideation, achieving high-quality outcomes by dynamically navigating the trade-offs among novelty, feasibility, and effectiveness.
zh

[NLP-48] How good is GPT at writing political speeches for the White House?

【速读】：该论文试图解决的问题是分析大型语言模型（LLMs）如GPT在生成文本时的写作风格与真实人类作者（如美国总统）之间的差异。解决方案的关键在于通过对比GPT-3.5和GPT-4.0生成的文本与从里根到拜登的美国总统国情咨文（SOTU），揭示GPT在词汇使用、句子长度、语气选择等方面的特点。研究发现，GPT倾向于过度使用“we”这一词根，生成较短但句子较长的信息，并偏好乐观的语气，更多使用政治性、象征性和抽象的词汇。即使尝试模仿特定作者的风格，GPT生成的文本仍与目标作者的文本有显著差异。此外，尽管GPT-3.5和GPT-4.0在某些方面有所不同，但它们生成的文本总体上与真实的总统演讲文本不相似。

链接: https://arxiv.org/abs/2412.14617
作者: Jacques Savoy
机构: 未知
关键词: large language models, language models, LLM called GPT, large language, text in response
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Using large language models (LLMs), computers are able to generate a written text in response to a us er request. As this pervasive technology can be applied in numerous contexts, this study analyses the written style of one LLM called GPT by comparing its generated speeches with those of the recent US presidents. To achieve this objective, the State of the Union (SOTU) addresses written by Reagan to Biden are contrasted to those produced by both GPT-3.5 and GPT-4.o versions. Compared to US presidents, GPT tends to overuse the lemma “we” and produce shorter messages with, on average, longer sentences. Moreover, GPT opts for an optimistic tone, opting more often for political (e.g., president, Congress), symbolic (e.g., freedom), and abstract terms (e.g., freedom). Even when imposing an author’s style to GPT, the resulting speech remains distinct from addresses written by the target author. Finally, the two GPT versions present distinct characteristics, but both appear overall dissimilar to true presidential messages.
zh

[NLP-49] HarmonicEval: Multi-modal Multi-task Multi-criteria Automatic Evaluation Using a Vision Language Model

【速读】：该论文试图解决现有视觉-语言模型（Vision-language models, VLMs）在生成文本评估中存在的两个主要问题：1) 难以从整体评分中识别出文本需要改进的具体方面；2) 传统评估指标在预测整体评分时可能忽略特定的评估标准。为解决这些问题，论文提出了HarmonicEval，一种无参考的评估指标，通过自下而上的方式聚合各标准的评分来生成整体评分。此外，论文构建了Multi-task Multi-criteria Human Evaluation (MMHE)数据集，包含18,000条专家人类判断，涵盖四个视觉-语言任务。实验表明，HarmonicEval在与人判断的相关性上优于传统指标，并能为每个标准提供数值评分。

链接: https://arxiv.org/abs/2412.14613
作者: Masanari Ohi,Masahiro Kaneko,Naoaki Okazaki,Nakamasa Inoue
机构: Institute of Science Tokyo(东京科学研究所); MBZUAI(MBZUAI)
关键词: shown impressive abilities, image understanding, shown impressive, impressive abilities, Vision-language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have shown impressive abilities in text and image understanding. However, existing metrics for evaluating the text generated by VLMs focus exclusively on overall quality, leading to two limitations: 1) it is challenging to identify which aspects of the text need improvement from the overall score; 2) metrics may overlook specific evaluation criteria when predicting an overall score. To address these limitations, we propose HarmonicEval, a reference-free evaluation metric that aggregates criterion-wise scores to produce the overall score in a bottom-up manner. Furthermore, we construct the Multi-task Multi-criteria Human Evaluation (MMHE) dataset, which comprises 18,000 expert human judgments across four vision-language tasks. Our experiments demonstrate that HarmonicEval achieves higher correlations with human judgments than conventional metrics while providing numerical scores for each criterion.
zh

[NLP-50] KARRIEREWEGE: A Large Scale Career Path Prediction Dataset COLING

【速读】：该论文试图解决职业路径预测领域中公开可用数据和工具稀缺的问题。解决方案的关键在于引入了KARRIEREWEGE，一个包含超过50万条职业路径的综合性公开数据集，并通过与ESCO分类法关联，提供了预测职业轨迹的有价值资源。此外，通过合成职位标题和描述，进一步增强了数据集，形成KARRIEREWEGE+，从而能够从非结构化数据中进行准确预测，特别适用于处理简历中的自由文本输入。该解决方案通过基准测试展示了现有最先进模型在处理自由文本用例时的性能和鲁棒性的提升。

链接: https://arxiv.org/abs/2412.14612
作者: Elena Senger,Yuri Campbell,Rob van der Goot,Barbara Plank
机构: MaiNLP, Center for Information and Language Processing, LMU Munich, Germany; Fraunhofer Center for International Management and Knowledge Economy IMW, Germany; Department of Computer Science, IT University of Copenhagen, Denmark
关键词: career path prediction, support many stakeholders, project managers, path prediction, Accurate career path
类目: Computation and Language (cs.CL)
备注: Accepted at COLING Industry Track

点击查看摘要

Abstract:Accurate career path prediction can support many stakeholders, like job seekers, recruiters, HR, and project managers. However, publicly available data and tools for career path prediction are scarce. In this work, we introduce KARRIEREWEGE, a comprehensive, publicly available dataset containing over 500k career paths, significantly surpassing the size of previously available datasets. We link the dataset to the ESCO taxonomy to offer a valuable resource for predicting career trajectories. To tackle the problem of free-text inputs typically found in resumes, we enhance it by synthesizing job titles and descriptions resulting in KARRIEREWEGE+. This allows for accurate predictions from unstructured data, closely aligning with real-world application challenges. We benchmark existing state-of-the-art (SOTA) models on our dataset and a prior benchmark and observe improved performance and robustness, particularly for free-text use cases, due to the synthesized data.
zh

[NLP-51] LDP: Generalizing to Multilingual Visual Information Extraction by Language Decoupled Pretraining AAAI2025

【速读】：该论文试图解决多语言视觉信息提取（Visual Information Extraction, VIE）中的语言偏差问题，尤其是在非英语场景下由于预训练语料库数量和质量的不平衡导致的性能下降。解决方案的关键在于提出了一种新的多语言训练范式LDP（Language Decoupled Pre-training），通过解耦语言偏差，利用视觉和布局模态的不变性来实现跨语言泛化。具体来说，论文提出的LDM（Language Decoupled Model）首先在语言无关的数据上进行预训练，通过扩散模型（diffusion model）去除语言知识，然后在下游语言上进行微调。实验结果表明，LDM在多语言预训练模型中表现优于现有的最先进模型，同时在单语言/英语基准测试中也保持了竞争力。

链接: https://arxiv.org/abs/2412.14596
作者: Huawen Shen,Gengluo Li,Jinwen Zhong,Yu Zhou
机构: 未知
关键词: Visual Information Extraction, Visual Information, Information Extraction, plays a crucial, enhance performance
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Visual Information Extraction (VIE) plays a crucial role in the comprehension of semi-structured documents, and several pre-trained models have been developed to enhance performance. However, most of these works are monolingual (usually English). Due to the extremely unbalanced quantity and quality of pre-training corpora between English and other languages, few works can extend to non-English scenarios. In this paper, we conduct systematic experiments to show that vision and layout modality hold invariance among images with different languages. If decoupling language bias from document images, a vision-layout-based model can achieve impressive cross-lingual generalization. Accordingly, we present a simple but effective multilingual training paradigm LDP (Language Decoupled Pre-training) for better utilization of monolingual pre-training data. Our proposed model LDM (Language Decoupled Model) is first pre-trained on the language-independent data, where the language knowledge is decoupled by a diffusion model, and then the LDM is fine-tuned on the downstream languages. Extensive experiments show that the LDM outperformed all SOTA multilingual pre-trained models, and also maintains competitiveness on downstream monolingual/English benchmarks.
zh

[NLP-52] Beyond Guilt: Legal Judgment Prediction with Trichotomous Reasoning

【速读】：该论文试图解决当前法律领域大型语言模型（LLMs）在预测判决时缺乏三段论推理能力的问题，特别是无法预测无罪判决的情况。解决方案的关键在于引入了首个包含无罪判决的法律判决预测基准数据集LJPIV，并通过LLM增强和人工验证扩展了现有的法律数据集。论文提出的策略包括将三段论推理融入零样本提示和微调过程中，显著提升了模型在域内和跨域的判决预测准确性，尤其是在预测无罪判决方面。

链接: https://arxiv.org/abs/2412.14588
作者: Kepu Zhang,Haoyue Yang,Xu Tang,Weijie Yu,Jun Xu
机构: Gaoling School of Artificial Intelligence, Renmin University of China(高瓴人工智能学院，中国人民大学); University of International Business and Economics(对外经济贸易大学)
关键词: individual conduct constitutes, judges apply, criminal law, sequentially assessing, constitutes a crime
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In legal practice, judges apply the trichotomous dogmatics of criminal law, sequentially assessing the elements of the offense, unlawfulness, and culpability to determine whether an individual’s conduct constitutes a crime. Although current legal large language models (LLMs) show promising accuracy in judgment prediction, they lack trichotomous reasoning capabilities due to the absence of an appropriate benchmark dataset, preventing them from predicting innocent outcomes. As a result, every input is automatically assigned a charge, limiting their practical utility in legal contexts. To bridge this gap, we introduce LJPIV, the first benchmark dataset for Legal Judgment Prediction with Innocent Verdicts. Adhering to the trichotomous dogmatics, we extend three widely-used legal datasets through LLM-based augmentation and manual verification. Our experiments with state-of-the-art legal LLMs and novel strategies that integrate trichotomous reasoning into zero-shot prompting and fine-tuning reveal: (1) current legal LLMs have significant room for improvement, with even the best models achieving an F1 score of less than 0.3 on LJPIV; and (2) our strategies notably enhance both in-domain and cross-domain judgment prediction accuracy, especially for cases resulting in an innocent verdict.
zh

[NLP-53] Simulation-Free Hierarchical Latent Policy Planning for Proactive Dialogues AAAI2025

【速读】：该论文试图解决主动对话系统中存在的策略规划和适应性不足的问题，特别是在复杂目标（如情感支持和说服）的场景下。现有方法依赖于大型语言模型（LLMs）进行用户模拟和在线学习，导致偏差和低效，并且依赖于手动定义的、上下文无关的粗粒度策略，增加了专家成本和完整性问题。论文提出的解决方案关键在于引入了一种新的对话策略规划框架LDPP，该框架通过从原始的现实世界对话记录中自动挖掘策略，并利用变分自编码器（VAE）发现细粒度策略，随后在潜在空间中采用离线分层强化学习算法进行策略规划学习。实验结果表明，LDPP在主动对话场景中优于现有方法，甚至超越了仅使用1.8亿参数的LLM的ChatGPT。

链接: https://arxiv.org/abs/2412.14584
作者: Tao He,Lizi Liao,Yixin Cao,Yuanxing Liu,Yiheng Sun,Zerui Chen,Ming Liu,Bing Qin
机构: SMU; SMU; SMU; SMU; SMU; SMU; SMU
关键词: garnered significant attention, Recent advancements, Large Language Models, significant attention, complex objectives
类目: Computation and Language (cs.CL)
备注: 24 pages, 5 fgiures, AAAI 2025

点击查看摘要

Abstract:Recent advancements in proactive dialogues have garnered significant attention, particularly for more complex objectives (e.g. emotion support and persuasion). Unlike traditional task-oriented dialogues, proactive dialogues demand advanced policy planning and adaptability, requiring rich scenarios and comprehensive policy repositories to develop such systems. However, existing approaches tend to rely on Large Language Models (LLMs) for user simulation and online learning, leading to biases that diverge from realistic scenarios and result in suboptimal efficiency. Moreover, these methods depend on manually defined, context-independent, coarse-grained policies, which not only incur high expert costs but also raise concerns regarding their completeness. In our work, we highlight the potential for automatically discovering policies directly from raw, real-world dialogue records. To this end, we introduce a novel dialogue policy planning framework, LDPP. It fully automates the process from mining policies in dialogue records to learning policy planning. Specifically, we employ a variant of the Variational Autoencoder to discover fine-grained policies represented as latent vectors. After automatically annotating the data with these latent policy labels, we propose an Offline Hierarchical Reinforcement Learning (RL) algorithm in the latent space to develop effective policy planning capabilities. Our experiments demonstrate that LDPP outperforms existing methods on two proactive scenarios, even surpassing ChatGPT with only a 1.8-billion-parameter LLM.
zh

[NLP-54] CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation

【速读】：该论文试图解决大语言模型（LLMs）在使用检索增强生成（RAG）时，由于位置偏差（position bias）导致无法均匀关注所有检索上下文的问题。解决方案的关键在于提出了一种名为CORD（COnsistency and Rank Distillation）的方法，通过一致性正则化（consistency regularization）与增强和蒸馏（augmentation and distillation）相结合的方式来平衡一致性与排序先验（rank prior）。具体来说，CORD通过对训练实例进行位置扰动增强，鼓励模型在不同顺序下做出一致的预测，同时通过自适应采样噪声控制的扰动，确保在保持一致性的同时尊重检索上下文的排序优先级。实验结果表明，CORD在多种RAG基准测试中表现优异。

链接: https://arxiv.org/abs/2412.14581
作者: Youngwon Lee,Seung-won Hwang,Daniel Campos,Filip Graliński,Zhewei Yao,Yuxiong He
机构: Snowflake AI Research; Seoul National University
关键词: large language models, large language, language models, adoption of retrieval-augmented, expected to ground
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the adoption of retrieval-augmented generation (RAG), large language models (LLMs) are expected to ground their generation to the retrieved contexts. Yet, this is hindered by position bias of LLMs, failing to evenly attend to all contexts. Previous work has addressed this by synthesizing contexts with perturbed positions of gold segment, creating a position-diversified train set. We extend this intuition to propose consistency regularization with augmentation and distillation. First, we augment each training instance with its position perturbation to encourage consistent predictions, regardless of ordering. We also distill behaviors of this pair, although it can be counterproductive in certain RAG scenarios where the given order from the retriever is crucial for generation quality. We thus propose CORD, balancing COnsistency and Rank Distillation. CORD adaptively samples noise-controlled perturbations from an interpolation space, ensuring both consistency and respect for the rank prior. Empirical results show this balance enables CORD to outperform consistently in diverse RAG benchmarks.
zh

[NLP-55] Sliding Windows Are Not the End: Exploring Full Ranking with Long-Context Large Language Models

【速读】：该论文试图解决现有方法在列表式段落排序任务中由于滑动窗口策略导致的效率低下和冗余API成本问题。解决方案的关键在于利用长上下文大语言模型（long-context LLMs）进行全排序，从而避免重复处理和多次评估相关段落。论文提出了两种改进措施：(1) 引入一种新的完整列表式标签构建方法，以解决滑动窗口策略无法生成全排序列表作为训练标签的问题；(2) 设计了一种重要性感知的学习目标（importance-aware learning objective），以强调标签中高排名段落ID的重要性。实验结果表明，这些改进在监督微调设置下显著提升了效率和性能。

链接: https://arxiv.org/abs/2412.14574
作者: Wenhan Liu,Xinyu Ma,Yutao Zhu,Ziliang Zhao,Shuaiqiang Wang,Dawei Yin,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China; Baidu Inc., Beijing, China
关键词: Large Language Models, Large Language, shown exciting performance, redundant API costs, shown exciting
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 14 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have shown exciting performance in listwise passage ranking. Due to the limited input length, existing methods often adopt the sliding window strategy. Such a strategy, though effective, is inefficient as it involves repetitive and serialized processing, which usually re-evaluates relevant passages multiple times. As a result, it incurs redundant API costs, which are proportional to the number of inference tokens. The development of long-context LLMs enables the full ranking of all passages within a single inference, avoiding redundant API costs. In this paper, we conduct a comprehensive study of long-context LLMs for ranking tasks in terms of efficiency and effectiveness. Surprisingly, our experiments reveal that full ranking with long-context LLMs can deliver superior performance in the supervised fine-tuning setting with a huge efficiency improvement. Furthermore, we identify two limitations of fine-tuning the full ranking model based on existing methods: (1) sliding window strategy fails to produce a full ranking list as a training label, and (2) the language modeling loss cannot emphasize top-ranked passage IDs in the label. To alleviate these issues, we propose a new complete listwise label construction approach and a novel importance-aware learning objective for full ranking. Experiments show the superior performance of our method over baselines. Our codes are available at \urlthis https URL.
zh

[NLP-56] CitaLaw: Enhancing LLM with Citations in Legal Domain

【速读】：该论文试图解决大语言模型（LLMs）在生成法律相关回答时缺乏法律依据和引用的问题。解决方案的关键在于提出了CitaLaw基准，该基准通过提供多样化的法律问题和全面的法律文章与判例作为参考库，使LLMs能够检索并引用相关法律依据，并将这些引用与生成的回答内容对齐。此外，论文引入了基于三段论（syllogism）的评估方法，用于评估引用的法律依据与生成回答之间的法律一致性及其与用户问题的匹配度，从而显著提升了回答的质量，并展示了该评估方法与人类判断的高度一致性。

链接: https://arxiv.org/abs/2412.14556
作者: Kepu Zhang,Weijie Yu,Sunhao Dai,Jun Xu
机构: Gaoling School of Artificial Intelligence, Renmin University of China(高瓴人工智能学院，中国人民大学); University of International Business and Economics(对外经济贸易大学)
关键词: evaluate LLMs’ ability, produce legally sound, legally sound responses, benchmark designed, designed to evaluate
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we propose CitaLaw, the first benchmark designed to evaluate LLMs’ ability to produce legally sound responses with appropriate citations. CitaLaw features a diverse set of legal questions for both laypersons and practitioners, paired with a comprehensive corpus of law articles and precedent cases as a reference pool. This framework enables LLM-based systems to retrieve supporting citations from the reference corpus and align these citations with the corresponding sentences in their responses. Moreover, we introduce syllogism-inspired evaluation methods to assess the legal alignment between retrieved references and LLM-generated responses, as well as their consistency with user questions. Extensive experiments on 2 open-domain and 7 legal-specific LLMs demonstrate that integrating legal references substantially enhances response quality. Furthermore, our proposed syllogism-based evaluation method exhibits strong agreement with human judgments.
zh

[NLP-57] ClusterTalk: Corpus Exploration Framework using Multi-Dimensional Exploratory Search

【速读】：该论文试图解决在大规模文本语料库（如生物医学研究领域）中的探索性搜索问题。解决方案的关键在于提出了ClusterTalk框架，该框架通过多维度的探索性搜索，将文档聚类与分面搜索（faceted search）相结合，使用户能够交互式地精炼搜索结果并进行语料库和文档级别的查询。相比传统的一维搜索方法（如关键词搜索或聚类），ClusterTalk通过鼓励用户与语料库进行更深层次的交互，显著提升了信息的可发现性。

链接: https://arxiv.org/abs/2412.14533
作者: Ashish Chouhan,Saifeldin Mandour,Michael Gertz
机构: 未知
关键词: large text corpora, continuously generated, large text, biomedical research, large amounts
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Exploratory search of large text corpora is essential in domains like biomedical research, where large amounts of research literature are continuously generated. This paper presents ClusterTalk (The demo video and source code are available at: this https URL), a framework for corpus exploration using multi-dimensional exploratory search. Our system integrates document clustering with faceted search, allowing users to interactively refine their exploration and ask corpus and document-level queries. Compared to traditional one-dimensional search approaches like keyword search or clustering, this system improves the discoverability of information by encouraging a deeper interaction with the corpus. We demonstrate the functionality of the ClusterTalk framework based on four million PubMed abstracts for the four-year time frame.
zh

[NLP-58] Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models AAAI2025

【速读】：该论文试图解决现有知识蒸馏（Knowledge Distillation, KD）方法在处理不同架构的大型语言模型（LLMs）时，由于需要教师模型和学生模型使用相同的分词器（tokenizer）而导致的局限性问题。解决方案的关键是提出了多层次最优传输（Multi-Level Optimal Transport, MultiLevelOT）方法，该方法通过在标记级别和序列级别上对教师和学生模型的logit分布进行对齐，利用多样化的成本矩阵来消除维度或逐标记对应的需求。具体来说，MultiLevelOT在标记级别上结合全局和局部信息，通过联合优化序列中的所有标记来增强鲁棒性；在序列级别上，利用Sinkhorn距离有效捕捉logit分布的复杂结构，从而实现跨分词器的知识蒸馏。

链接: https://arxiv.org/abs/2412.14528
作者: Xiao Cui,Mo Zhu,Yulei Qin,Liang Xie,Wengang Zhou,Houqiang Li
机构: 1. Tencent Youtu Lab(腾讯优图实验室); 2. University of Science and Technology of China(中国科学技术大学); 3. Shanghai Jiao Tong University(上海交通大学); 4. Peng Cheng Laboratory(鹏城实验室)
关键词: compressing large language, large language models, Knowledge distillation, prevalent technique, technique for compressing
类目: Computation and Language (cs.CL)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes.
zh

[NLP-59] Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment NEURIPS2024

【速读】：该论文试图解决大语言模型 (LLMs) 与人类偏好数据对齐的问题，特别是现有对比偏好优化方法在处理隐式奖励时，仅关注相对值而忽略实际值，导致对齐效果不佳的问题。解决方案的关键是提出了一种名为校准直接偏好优化 (Calibrated Direct Preference Optimization, Cal-DPO) 的算法，通过校准隐式奖励使其与真实奖励在尺度上可比，从而显著提升对齐效果。实验结果表明，Cal-DPO 在多种标准基准测试中显著优于现有方法。

链接: https://arxiv.org/abs/2412.14516
作者: Teng Xiao,Yige Yuan,Huaisheng Zhu,Mingxiao Li,Vasant G Honavar
机构: Artificial Intelligence Research Laboratory, Pennsylvania State University(宾夕法尼亚州立大学人工智能研究所); University of Chinese Academy of Sciences(中国科学院); Tencent AI Lab(腾讯人工智能实验室)
关键词: large language models, aligning large language, human preference data, language models, study the problem
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2024 Main

点击查看摘要

Abstract:We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy. However, the contrastive objective focuses mainly on the relative values of implicit rewards associated with two responses while ignoring their actual values, resulting in suboptimal alignment with human preferences. To address this limitation, we propose calibrated direct preference optimization (Cal-DPO), a simple yet effective algorithm. We show that substantial improvement in alignment with the given preferences can be achieved simply by calibrating the implicit reward to ensure that the learned implicit rewards are comparable in scale to the ground-truth rewards. We demonstrate the theoretical advantages of Cal-DPO over existing approaches. The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods.
zh

[NLP-60] PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization

【速读】：该论文试图解决检索增强生成 (Retrieval-augmented generation, RAG) 系统中生成器在响应信息量、响应鲁棒性和引用质量方面的不足问题。解决方案的关键在于提出了多视角偏好对齐 (Multiple Perspective Preference Alignment, PA-RAG) 方法，通过构建高质量的指令微调数据和多视角偏好数据，并结合监督微调 (Supervised Fine-Tuning, SFT) 和直接偏好优化 (Direct Preference Optimization, DPO) 来优化生成器，从而全面提升 RAG 系统的性能。

链接: https://arxiv.org/abs/2412.14510
作者: Jiayi Wu,Hengyi Cai,Lingyong Yan,Hao Sun,Xiang Li,Shuaiqiang Wang,Dawei Yin,Ming Gao
机构: East China Normal University(华东师范大学); Chinese Academy of Sciences(中国科学院); Baidu Inc(百度公司); Peking University(北京大学)
关键词: large language models, reveals numerous limitations, RAG, RAG generator, language models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of Retrieval-augmented generation (RAG) has alleviated the issues of outdated and hallucinatory content in the generation of large language models (LLMs), yet it still reveals numerous limitations. When a general-purpose LLM serves as the RAG generator, it often suffers from inadequate response informativeness, response robustness, and citation quality. Past approaches to tackle these limitations, either by incorporating additional steps beyond generating responses or optimizing the generator through supervised fine-tuning (SFT), still failed to align with the RAG requirement thoroughly. Consequently, optimizing the RAG generator from multiple preference perspectives while maintaining its end-to-end LLM form remains a challenge. To bridge this gap, we propose Multiple Perspective Preference Alignment for Retrieval-Augmented Generation (PA-RAG), a method for optimizing the generator of RAG systems to align with RAG requirements comprehensively. Specifically, we construct high-quality instruction fine-tuning data and multi-perspective preference data by sampling varied quality responses from the generator across different prompt documents quality scenarios. Subsequently, we optimize the generator using SFT and Direct Preference Optimization (DPO). Extensive experiments conducted on four question-answer datasets across three LLMs demonstrate that PA-RAG can significantly enhance the performance of RAG generators. Our code and datasets are available at this https URL.
zh

[NLP-61] Do Large Language Models Defend Inferentialist Semantics?: On the Logical Expressivism and Anti-Representationalism of LLM s

【速读】：该论文试图解决在大语言模型（LLMs）如ChatGPT和Claude等具备类人语言能力的情况下，传统基于分布式语义学（distributional semantics）的解释框架不再适用的问题。论文提出罗伯特·布兰登（Robert Brandom）的推论主义语义学（inferentialist semantics）作为LLMs的替代基础语义学，特别关注在后人类中心主义趋势下的语言表征主义问题。解决方案的关键在于推论主义语义学的反表征主义（anti-representationalism）、逻辑表达主义（logical expressivism）以及准组合性（quasi-compositionality），这些特性有助于解释LLMs的特征和行为。此外，论文还提出了针对LLMs的共识真理理论（consensus theory of truths），并认为LLMs的特性挑战了主流语言哲学中的语义外在主义（semantic externalism）和组合性（compositionality）假设，推动了对语言反表征主义观点的重新评估。

链接: https://arxiv.org/abs/2412.14501
作者: Yuzuki Arai,Sho Tsugawa
机构: College of Media Arts, Science and Technology, School of Informatics, University of Tsukuba(媒体艺术科学与技术学院，信息学院，筑波大学); Institute of Systems and Information Engineering, University of Tsukuba(系统与信息工程研究所，筑波大学)
关键词: large language models, possess linguistic abilities, linguistic abilities comparable, anthropocentric lens, historically been developed
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The philosophy of language, which has historically been developed through an anthropocentric lens, is now being forced to move towards post-anthropocentrism due to the advent of large language models (LLMs) like ChatGPT (OpenAI), Claude (Anthropic), which are considered to possess linguistic abilities comparable to those of humans. Traditionally, LLMs have been explained through distributional semantics as their foundational semantics. However, recent research is exploring alternative foundational semantics beyond distributional semantics. This paper proposes Robert Brandom’s inferentialist semantics as an suitable foundational semantics for LLMs, specifically focusing on the issue of linguistic representationalism within this post-anthropocentric trend. Here, we show that the anti-representationalism and logical expressivism of inferential semantics, as well as quasi-compositionality, are useful in interpreting the characteristics and behaviors of LLMs. Further, we propose a \emphconsensus theory of truths for LLMs. This paper argues that the characteristics of LLMs challenge mainstream assumptions in philosophy of language, such as semantic externalism and compositionality. We believe the argument in this paper leads to a re-evaluation of anti\hyphenrepresentationalist views of language, potentially leading to new developments in the philosophy of language.
zh

[NLP-62] GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

【速读】：该论文试图解决在具身问答 (Embodied Question Answering, EQA) 任务中，智能体在未见环境中探索并构建语义理解以自信回答问题的挑战。解决方案的关键在于提出了GraphEQA方法，该方法利用实时3D度量语义场景图 (3D metric-semantic scene graphs, 3DSGs) 和任务相关图像作为多模态记忆，将视觉语言模型 (Vision-Language Models, VLMs) 与环境进行接地。通过采用层次化规划方法，利用3DSGs的层次结构进行结构化规划和语义引导的探索，从而在模拟和真实环境中实现了更高的任务成功率和更少的规划步骤。

链接: https://arxiv.org/abs/2412.14480
作者: Saumya Saxena,Blake Buchanan,Chris Paxton,Bingqing Chen,Narunas Vaskevicius,Luigi Palmieri,Jonathan Francis,Oliver Kroemer
机构: 未知
关键词: Embodied Question Answering, Question Answering, Embodied Question, situated question, agents must explore
类目: Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project website: this https URL

点击查看摘要

Abstract:In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. Through experiments in simulation on the HM-EQA dataset and in the real world in home and office environments, we demonstrate that our method outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps.
zh

[NLP-63] MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

【速读】：该论文试图解决多模态检索领域中训练数据不足的问题。解决方案的关键在于引入了一种名为 MegaPairs 的新型数据合成方法，该方法利用视觉语言模型 (Vision Language Models, VLMs) 和开放域图像生成大规模合成数据集。通过这种方法，MegaPairs 生成了高质量的数据，使得多模态检索器在仅使用现有数据集 1/70 的数据量时，仍能显著超越基线模型。此外，MegaPairs 仅依赖于通用图像语料库和开源 VLMs，因此易于扩展，能够持续提升检索性能。论文中生成了超过 2600 万个训练实例，并基于此数据集训练了多个不同规模的模型，这些模型在多个基准测试中达到了最先进的零样本性能，并在下游任务中表现出显著的性能提升。

链接: https://arxiv.org/abs/2412.14475
作者: Junjie Zhou,Zheng Liu,Ze Liu,Shitao Xiao,Yueze Wang,Bo Zhao,Chen Jason Zhang,Defu Lian,Yongping Xiong
机构: Beijing University of Posts and Telecommunications; Beijing Academy of Artificial Intelligence; University of Science and Technology of China; Shanghai Jiaotong University; The Hong Kong Polytechnic University
关键词: rapidly growing demand, remains severely constrained, field remains severely, rapidly growing, growing demand
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70 \times more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.
zh

[NLP-64] Why We Build Local Large Language Models : An Observational Analysis from 35 Japanese and Multilingual LLM s

【速读】：该论文试图解决关于本地化大型语言模型（LLMs）的几个关键问题，包括为何构建本地化LLMs、本地LLM应从目标语言中学习什么、其他语言的能力如何迁移，以及是否存在语言特定的扩展规律。解决方案的关键在于通过评估35个日语、英语和多语言LLMs在19个日语和英语评估基准上的表现，采用观察性方法分析基准分数的相关性，并进行主成分分析（PCA）以提取本地LLMs的能力因素。研究发现，使用英语文本训练可以提升日语学术科目（JMMLU）的分数，而无需专门针对日语文本进行训练即可增强日语代码生成、算术推理、常识和阅读理解任务的能力。然而，针对日语文本的训练可以提升日语知识问答和英日翻译任务的能力，表明这些任务的能力可以视为LLMs的日语能力，并且这些能力随日语文本的计算预算扩展。

链接: https://arxiv.org/abs/2412.14471
作者: Koshiro Saito,Sakae Mizuki,Masanari Ohi,Taishi Nakamura,Taihei Shiotani,Koki Maeda,Youmi Ma,Kakeru Hattori,Kazuki Fujii,Takumi Okamoto,Shigeki Ishida,Hiroya Takamura,Rio Yokota,Naoaki Okazaki
机构: Institute of Science Tokyo; National Institute of Advanced Industrial Science and Technology
关键词: large language models, Japanese, build local large, local large language, Japanese text
类目: Computation and Language (cs.CL)
备注: Preprint. Under review

点击查看摘要

Abstract:Why do we build local large language models (LLMs)? What should a local LLM learn from the target language? Which abilities can be transferred from other languages? Do language-specific scaling laws exist? To explore these research questions, we evaluated 35 Japanese, English, and multilingual LLMs on 19 evaluation benchmarks for Japanese and English, taking Japanese as a local language. Adopting an observational approach, we analyzed correlations of benchmark scores, and conducted principal component analysis (PCA) on the scores to derive \textitability factors of local LLMs. We found that training on English text can improve the scores of academic subjects in Japanese (JMMLU). In addition, it is unnecessary to specifically train on Japanese text to enhance abilities for solving Japanese code generation, arithmetic reasoning, commonsense, and reading comprehension tasks. In contrast, training on Japanese text could improve question-answering tasks about Japanese knowledge and English-Japanese translation, which indicates that abilities for solving these two tasks can be regarded as \textitJapanese abilities for LLMs. Furthermore, we confirmed that the Japanese abilities scale with the computational budget for Japanese text.
zh

[NLP-65] Agent -SafetyBench: Evaluating the Safety of LLM Agents

【速读】：该论文试图解决大语言模型（LLMs）作为代理在交互环境和工具使用中引入的新安全挑战问题。解决方案的关键在于提出了Agent-SafetyBench，这是一个全面的基准测试，用于评估LLM代理的安全性。该基准涵盖了349个交互环境和2,000个测试案例，评估了8类安全风险和10种常见的不安全交互失败模式。通过定量分析，论文揭示了当前LLM代理存在的两个基本安全缺陷：缺乏鲁棒性和缺乏风险意识。此外，研究强调仅依赖防御性提示不足以解决这些安全问题，需要更先进和稳健的策略。

链接: https://arxiv.org/abs/2412.14470
作者: Zhexin Zhang,Shiyao Cui,Yida Lu,Jingzhuo Zhou,Junxiao Yang,Hongning Wang,Minlie Huang
机构: 未知
关键词: large language models, LLM agents, language models, large language, increasingly deployed
类目: Computation and Language (cs.CL)
备注: 23 pages, 9 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the considerable need for improvement. Through quantitative analysis, we identify critical failure modes and summarize two fundamental safety detects in current LLM agents: lack of robustness and lack of risk awareness. Furthermore, our findings suggest that reliance on defense prompts alone is insufficient to address these safety issues, emphasizing the need for more advanced and robust strategies. We release Agent-SafetyBench at \urlthis https URL to facilitate further research and innovation in agent safety evaluation and improvement.
zh

[NLP-66] From Human Annotation to LLM s: SILICON Annotation Workflow for Management Research

【速读】：该论文试图解决在管理研究中，如何系统性地评估和应用大型语言模型（LLMs）进行非结构化文本数据标注的问题。解决方案的关键在于提出了“SILICON”（Systematic Inference with LLMs for Information Classification and Notation）工作流程，该流程整合了人类标注的既定原则与系统性的提示优化和模型选择，解决了开发稳健标注指南、建立高质量人类基线、优化提示以及确保跨LLMs的可重复性等挑战。通过七个涵盖常见管理研究任务的案例研究，验证了该工作流程的有效性，并强调了验证标注指南一致性、专家开发的人类基线优于众包基线、提示优化的迭代性质以及测试多个LLMs的必要性。此外，论文提出了一种基于回归的方法，用于实证比较不同提示和模型下的LLM输出，从而为管理研究提供了可重复且严谨的LLM标注流程。

链接: https://arxiv.org/abs/2412.14461
作者: Xiang Cheng,Raveesh Mayya,João Sedoc
机构: 未知
关键词: Unstructured text data, textbf, Unstructured text, Large Language Models, crowdsourcing platforms
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Unstructured text data annotation and analysis are fundamental to management research, often relying on human annotators through crowdsourcing platforms. While Large Language Models (LLMs) promise to provide a cost-effective and efficient alternative to human annotation, there lacks a systematic workflow that evaluate when LLMs are suitable or how to proceed with LLM-based text annotation in a reproducible manner. This paper addresses this methodological gap by introducing the ``SILICON" (\textbfSystematic \textbfInference with \textbfLLMs for \textbfInformation \textbfClassificati\textbfon and \textbfNotation) workflow. The workflow integrates established principles of human annotation with systematic prompt optimization and model selection, addressing challenges such as developing robust annotation guidelines, establishing high-quality human baselines, optimizing prompts, and ensuring reproducibility across LLMs. We validate the SILICON workflow through seven case studies covering common management research tasks, including business proposal evaluation, dialog intent and breakdown analysis, review attribute detection. Our findings highlight the importance of validating annotation guideline agreement, the superiority of expert-developed human baselines over crowdsourced ones, the iterative nature of prompt optimization, and the necessity of testing multiple LLMs. Notably, we propose a regression-based methodology to empirically compare LLM outputs across prompts and models. Our workflow advances management research by establishing reproducible processes for LLM-based annotation that maintain scientific rigor. We provide practical guidance for researchers to effectively navigate the evolving landscape of generative AI tools effectively while maintaining transparency and reproducibility.
zh

[NLP-67] Are Longer Prompts Always Better? Prompt Selection in Large Language Models for Recommendation Systems

【速读】：该论文试图解决在大语言模型（LLM）推荐系统（LLM-RS）中如何有效选择提示（prompt）以提高推荐准确性的问题。解决方案的关键在于根据数据集的特性选择合适的提示，而不是依赖单一的提示。论文通过450次实验和90个提示对五个真实数据集进行了分析，发现不同提示在不同数据集上的表现差异显著，因此提出了基于数据集特性的提示选择方法，以实现更高的推荐准确性。此外，论文还引入了一种成本效益策略，利用高性能且成本效益高的LLM来降低探索成本，同时保持高预测精度。

链接: https://arxiv.org/abs/2412.14454
作者: Genki Kusano,Kosuke Akimoto,Kunihiro Takeoka
机构: 未知
关键词: requiring extensive training, large language models, based recommendation systems, extensive training data, accurately predicting user
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:In large language models (LLM)-based recommendation systems (LLM-RSs), accurately predicting user preferences by leveraging the general knowledge of LLMs is possible without requiring extensive training data. By converting recommendation tasks into natural language inputs called prompts, LLM-RSs can efficiently solve issues that have been difficult to address due to data scarcity but are crucial in applications such as cold-start and cross-domain problems. However, when applying this in practice, selecting the prompt that matches tasks and data is essential. Although numerous prompts have been proposed in LLM-RSs and representing the target user in prompts significantly impacts recommendation accuracy, there are still no clear guidelines for selecting specific prompts. In this paper, we categorize and analyze prompts from previous research to establish practical prompt selection guidelines. Through 450 experiments with 90 prompts and five real-world datasets, we examined the relationship between prompts and dataset characteristics in recommendation accuracy. We found that no single prompt consistently outperforms others; thus, selecting prompts on the basis of dataset characteristics is crucial. Here, we propose a prompt selection method that achieves higher accuracy with minimal validation data. Because increasing the number of prompts to explore raises costs, we also introduce a cost-efficient strategy using high-performance and cost-efficient LLMs, significantly reducing exploration costs while maintaining high prediction accuracy. Our work offers valuable insights into the prompt selection, advancing accurate and efficient LLM-RSs. Comments: 15 pages Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2412.14454 [cs.IR] (or arXiv:2412.14454v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2412.14454 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-68] ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study

【速读】：该论文试图解决生成式 AI (Generative AI) 在特定领域（如天文学、法律和医学）中缺乏高质量、专业化训练数据的问题。解决方案的关键在于提出了 ORBIT 方法，这是一种高效的成本效益方法，用于从噪声较大的网络资源中筛选和构建大规模、高质量的领域特定数据集。通过精细化的数据处理和领域适应性训练，ORBIT 能够显著提升模型在特定领域任务中的表现，如在天文学任务中，通过微调 LLaMA-3-8B 模型，使其在 MMLU 天文学基准测试中的表现从 69% 提升至 76%，并在 AstroBench 上取得顶尖成绩。此外，ORBIT 方法的通用性也通过在法律和医学领域的应用得到了验证。

链接: https://arxiv.org/abs/2412.14436
作者: Eric Modesitt,Ke Yang,Spencer Hulsey,Chengxiang Zhai,Volodymyr Kindratenko
机构: University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校); National Center for Supercomputing Applications(国家超级计算应用中心)
关键词: require specialized knowledge, Recent advances, language modeling demonstrate, specialized knowledge, modeling demonstrate
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning \textscLLaMA-3-8B on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69% to 76% and achieved top results on AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA) outperformed \textscLLaMA-3-8B-base, with GPT-4o evaluations preferring it in 73% of cases across 1000 astronomy-specific questions. Additionally, we validated ORBIT’s generalizability by applying it to law and medicine, achieving a significant improvement of data quality compared to an unfiltered baseline. We open-source the ORBIT methodology, including the curated datasets, the codebase, and the resulting model at \hrefthis https URLthis https URL.
zh

[NLP-69] All-in-One Tuning and Structural Pruning for Domain-Specific LLM s

【速读】：该论文试图解决现有大型语言模型（LLMs）在特定领域应用中，通过两阶段剪枝和微调方法导致性能下降的问题。现有方法在剪枝后微调时，剪枝决策基于预训练权重且在微调过程中保持不变，这可能导致剪枝决策与更新后的权重不匹配，从而影响模型性能。论文提出的解决方案是ATP（All-in-One Tuning and Structural Pruning），这是一种统一的一阶段结构剪枝和微调方法，通过可训练的剪枝决策生成器动态识别微调阶段中的当前最优子结构。此外，ATP结合了低秩适应（LoRA）技术，并引入了LoRA感知的前向传播和稀疏性正则化，以确保在ATP过程后可以直接移除与学习到的剪枝决策相对应的子结构。该方法在法律和医疗领域的任务中显著优于现有的两阶段剪枝方法。

链接: https://arxiv.org/abs/2412.14426
作者: Lei Lu,Zhepeng Wang,Ruexue Bao,Mengbing Wang,Fangyi Li,Yawen Wu,Weiwen Jiang,Jie Xu,Yanzhi Wang,Shangqian Gao
机构: Northeastern University(东北大学); George Mason University(乔治梅森大学); GE HealthCare(通用电气医疗集团); University of Pennsylvania(宾夕法尼亚大学); University of Pittsburgh(匹兹堡大学); University of Florida(佛罗里达大学); Florida State University(佛罗里达州立大学)
关键词: Existing pruning techniques, applications typically follow, Existing pruning, large language models, pretrained general-purpose LLMs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing pruning techniques for large language models (LLMs) targeting domain-specific applications typically follow a two-stage process: pruning the pretrained general-purpose LLMs and then fine-tuning the pruned LLMs on specific domains. However, the pruning decisions, derived from the pretrained weights, remain unchanged during fine-tuning, even if the weights have been updated. Therefore, such a combination of the pruning decisions and the finetuned weights may be suboptimal, leading to non-negligible performance degradation. To address these limitations, we propose ATP: All-in-One Tuning and Structural Pruning, a unified one-stage structural pruning and fine-tuning approach that dynamically identifies the current optimal substructure throughout the fine-tuning phase via a trainable pruning decision generator. Moreover, given the limited available data for domain-specific applications, Low-Rank Adaptation (LoRA) becomes a common technique to fine-tune the LLMs. In ATP, we introduce LoRA-aware forward and sparsity regularization to ensure that the substructures corresponding to the learned pruning decisions can be directly removed after the ATP process. ATP outperforms the state-of-the-art two-stage pruning methods on tasks in the legal and healthcare domains. More specifically, ATP recovers up to 88% and 91% performance of the dense model when pruning 40% parameters of LLaMA2-7B and LLaMA3-8B models, respectively.
zh

[NLP-70] In-Group Love Out-Group Hate: A Framework to Measure Affective Polarization via Contentious Online Discussions

【速读】：该论文试图解决情感极化（affective polarization）在社会网络中的量化和实时监测问题，特别是在涉及争议性议题（如COVID-19疫情期间的口罩和封锁措施）时的情感动态。解决方案的关键在于引入了一种离散选择模型（discrete choice model），该模型能够捕捉在情感极化社会网络中的决策过程，并通过统计推断方法从社交媒体数据中估计关键参数——群体内喜爱（in-group love）和群体外仇恨（out-group hate）。通过实证验证，该方法能够准确捕捉现实世界中的极化动态，并解释党派间在态度上的快速分化。这一框架为跨争议性议题的情感极化追踪提供了工具，对促进数字空间中的建设性在线对话具有广泛的应用前景。

链接: https://arxiv.org/abs/2412.14414
作者: Buddhika Nettasinghe,Ashwin Rao,Bohan Jiang,Allon Percus,Kristina Lerman
机构: University of Iowa(爱荷华大学); USC Information Sciences Institute(南加州大学信息科学研究所); Arizona State University(亚利桑那州立大学); Claremont Graduate University(克莱蒙特研究生大学)
关键词: United States, ideological groups marked, Affective polarization, driving contentious issues, divide between ideological
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Affective polarization, the emotional divide between ideological groups marked by in-group love and out-group hate, has intensified in the United States, driving contentious issues like masking and lockdowns during the COVID-19 pandemic. Despite its societal impact, existing models of opinion change fail to account for emotional dynamics nor offer methods to quantify affective polarization robustly and in real-time. In this paper, we introduce a discrete choice model that captures decision-making within affectively polarized social networks and propose a statistical inference method estimate key parameters – in-group love and out-group hate – from social media data. Through empirical validation from online discussions about the COVID-19 pandemic, we demonstrate that our approach accurately captures real-world polarization dynamics and explains the rapid emergence of a partisan gap in attitudes towards masking and lockdowns. This framework allows for tracking affective polarization across contentious issues has broad implications for fostering constructive online dialogues in digital spaces.
zh

[NLP-71] ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

【速读】：该论文试图解决现有方法在从多通道心电图（ECG）生成自然语言（NLG）任务中的两个主要问题：一是两阶段训练的低效性，二是预训练编码器生成特征的可解释性不足。解决方案的关键在于引入ECG-Byte，这是一种基于字节对编码（BPE）的标记化管道，用于心电图的自回归语言建模。ECG-Byte将ECG信号压缩并编码为可直接映射回原始信号的标记，从而实现端到端的LLM训练，同时显著提高了训练效率和特征的可解释性。

链接: https://arxiv.org/abs/2412.14373
作者: William Han,Chaojing Duan,Michael A. Rosenberg,Emerson Liu,Ding Zhao
机构: Carnegie Mellon University(卡内基梅隆大学); Allegheny Health Network(阿勒格尼健康网络); University of Colorado(科罗拉多大学)
关键词: Large Language Models, shown remarkable adaptability, Large Language, Language Models, shown remarkable
类目: Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: 26 pages, 17 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have shown remarkable adaptability across domains beyond text, specifically electrocardiograms (ECGs). More specifically, there is a growing body of work exploring the task of generating text from a multi-channeled ECG and corresponding textual prompt. Current approaches typically involve pretraining an ECG-specific encoder with a self-supervised learning (SSL) objective and using the features output by the pretrained encoder to finetune a LLM for natural language generation (NLG). However, these methods are limited by 1) inefficiency from two-stage training and 2) interpretability challenges with encoder-generated features. To address these limitations, we introduce ECG-Byte, an adapted byte pair encoding (BPE) tokenizer pipeline for autoregressive language modeling of ECGs. This approach compresses and encodes ECG signals into tokens, enabling end-to-end LLM training by combining ECG and text tokens directly, while being much more interpretable since the ECG tokens can be directly mapped back to the original signal. Using ECG-Byte, we achieve competitive performance in NLG tasks in only half the time and ~48% of the data required by two-stage approaches.
zh

[NLP-72] Memorization Over Reasoning? Exposing and Mitigating Verbatim Memorization in Large Language Models Character Understanding Evaluation

【速读】：该论文试图解决大语言模型（LLMs）在角色理解任务中可能依赖于记忆流行虚构作品而非真正理解和推理的问题。解决方案的关键在于引入一种简单而有效的方法，通过强调“要点记忆”（gist memory）而非“逐字记忆”（verbatim memory）来减轻机械记忆的影响，同时保留理解与推理所需的隐含线索。这种方法在评估中将流行虚构作品的记忆驱动准确率从96%降低到72%，并在各种角色理解任务中导致最高18%的准确率下降，从而揭示了现有基准测试中数据污染的问题，强调了测量真实理解而非简单记忆的重要性。

链接: https://arxiv.org/abs/2412.14368
作者: Yuxuan Jiang,Francis Ferraro
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩分校); Department of Computer Science and Electrical Engineering (计算机科学与电气工程系)
关键词: Large Language Models, Large Language, Language Models, character understanding tasks, shown impressive performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have shown impressive performance in character understanding tasks, such as analyzing the roles, personalities, and relationships of fictional characters. However, the extensive pre-training corpora used by LLMs raise concerns that they may rely on memorizing popular fictional works rather than genuinely understanding and reasoning about them. In this work, we argue that ‘gist memory’-capturing essential meaning - should be the primary mechanism for character understanding tasks, as opposed to ‘verbatim memory’ - exact match of a string. We introduce a simple yet effective method to mitigate mechanized memorization in character understanding evaluations while preserving the essential implicit cues needed for comprehension and reasoning. Our approach reduces memorization-driven performance on popular fictional works from 96% accuracy to 72% and results in up to an 18% drop in accuracy across various character understanding tasks. These findings underscore the issue of data contamination in existing benchmarks, which often measure memorization rather than true character understanding.
zh

[NLP-73] ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals

【速读】：该论文试图解决大语言模型（LLMs）在推理时计算成本过高的问题，特别是通过后训练量化（Post-training Quantization, PTQ）将权重、激活和键值（KV）缓存张量量化到4位时，由于激活中的极端离群值导致的量化误差问题。解决方案的关键是提出了一种名为ResQ的PTQ方法，通过主成分分析（PCA）识别激活方差最高的低秩子空间（通常为隐藏维度的1/8），并在该子空间内保持高精度（如8位），而将其余部分量化为4位。此外，在每个子空间内应用不变随机旋转以进一步抑制离群值。该方法被证明是一种最优的混合精度量化方案，能够最小化误差，并在Llama系列模型上展示了优于现有均匀和混合精度PTQ方法的性能，实现了更高的精度和显著的推理速度提升。

链接: https://arxiv.org/abs/2412.14363
作者: Utkarsh Saxena,Sayeh Sharify,Kaushik Roy,Xin Wang
机构: Purdue University(普渡大学); d-Matrix(d-Matrix)
关键词: prohibitive computational cost, large language models, Post-training quantization, holds the promise, inference time
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 14 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g. 8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33% lower perplexity on Wikitext than the next best method SpinQuant, and a 2.4x speedup over 16-bit baseline. Code is available at this https URL.
zh

[NLP-74] State Space Models are Strong Text Rerankers

【速读】：该论文试图解决在自然语言处理（NLP）和信息检索（IR）领域中，Transformer模型在推理效率和长上下文处理方面的不足问题。解决方案的关键在于探索和评估状态空间模型（State Space Models, SSMs），特别是Mamba架构，作为Transformer的替代方案。研究通过对比Mamba-1和Mamba-2与Transformer模型在文本重排序任务中的性能和效率，发现Mamba架构在文本排序性能上与Transformer相当，但在训练和推理效率上略逊于使用flash attention的Transformer。此外，Mamba-2在性能和效率上均优于Mamba-1，表明SSMs在未来的信息检索应用中具有潜在的改进空间。

链接: https://arxiv.org/abs/2412.14354
作者: Zhichao Xu,Jinghua Yan,Ashim Gupta,Vivek Srikumar
机构: Kahlert School of Computing, University of Utah; Scientific Computing and Imaging Institute, University of Utah
关键词: Transformers dominate NLP, dominate NLP, inefficiencies and challenges, challenges in extrapolating, extrapolating to longer
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: The first two authors contributed equally, order decided randomly

点击查看摘要

Abstract:Transformers dominate NLP and IR; but their inference inefficiencies and challenges in extrapolating to longer contexts have sparked interest in alternative model architectures. Among these, state space models (SSMs) like Mamba offer promising advantages, particularly O(1) time complexity in inference. Despite their potential, SSMs’ effectiveness at text reranking – a task requiring fine-grained query-document interaction and long-context understanding – remains underexplored. This study benchmarks SSM-based architectures (specifically, Mamba-1 and Mamba-2) against transformer-based models across various scales, architectures, and pre-training objectives, focusing on performance and efficiency in text reranking tasks. We find that (1) Mamba architectures achieve competitive text ranking performance, comparable to transformer-based models of similar size; (2) they are less efficient in training and inference compared to transformers with flash attention; and (3) Mamba-2 outperforms Mamba-1 in both performance and efficiency. These results underscore the potential of state space models as a transformer alternative and highlight areas for improvement in future IR applications. Comments: The first two authors contributed equally, order decided randomly Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2412.14354 [cs.CL] (or arXiv:2412.14354v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.14354 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-75] A Survey on LLM Inference-Time Self-Improvement

【速读】：该论文试图解决大语言模型（LLM）在推理阶段通过增加计算量来提升性能的问题。解决方案的关键在于从三个不同角度探讨推理阶段的自我改进：独立自我改进（Independent Self-improvement），通过解码或采样方法进行增强；上下文感知自我改进（Context-Aware Self-Improvement），利用额外的上下文或数据存储；以及模型辅助自我改进（Model-Aided Self-Improvement），通过模型协作实现改进。论文提供了相关研究的全面回顾、深入的分类体系，并讨论了挑战和局限性，为未来研究提供了见解。

链接: https://arxiv.org/abs/2412.14352
作者: Xiangjue Dong,Maria Teleki,James Caverlee
机构: Texas A&M University (德克萨斯A&M大学)
关键词: recently gained attention, Techniques that enhance, gained attention, enhance inference, inference through increased
类目: Computation and Language (cs.CL)
备注: The first two authors contribute equally

点击查看摘要

Abstract:Techniques that enhance inference through increased computation at test-time have recently gained attention. In this survey, we investigate the current state of LLM Inference-Time Self-Improvement from three different perspectives: Independent Self-improvement, focusing on enhancements via decoding or sampling methods; Context-Aware Self-Improvement, leveraging additional context or datastore; and Model-Aided Self-Improvement, achieving improvement through model collaboration. We provide a comprehensive review of recent relevant studies, contribute an in-depth taxonomy, and discuss challenges and limitations, offering insights for future research.
zh

[NLP-76] Is Peer-Reviewing Worth the Effort? COLING2025

【速读】：该论文试图解决的问题是同行评审在识别重要论文方面的有效性，并将其视为一个预测任务，即基于发表场所和“早期回报”（论文发表后不久的引用情况）来预测哪些论文未来会被高引用。研究的关键发现是，早期回报比发表场所更能预测未来的高引用率。此外，论文还提出了应对同行评审规模化挑战的建设性建议，包括解决投稿数量过多和合格评审人员不足的问题。

链接: https://arxiv.org/abs/2412.14351
作者: Kenneth Church,Raman Chandrasekar,John E. Ortega,Ibrahim Said Ahmad
机构: Institute for Experiential AI, Northeastern University (体验式人工智能研究所，东北大学)
关键词: identifying important papers, effective is peer-reviewing, peer-reviewing in identifying, identifying important, important papers
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:How effective is peer-reviewing in identifying important papers? We treat this question as a forecasting task. Can we predict which papers will be highly cited in the future based on venue and “early returns” (citations soon after publication)? We show early returns are more predictive than venue. Finally, we end with constructive suggestions to address scaling challenges: (a) too many submissions and (b) too few qualified reviewers.
zh

[NLP-77] Semantic Role Labeling of NomBank Partitives COLING2025

【速读】：该论文旨在解决英语部分名词的语义角色标注（Semantic Role Labeling, SRL）问题，特别是在NomBank标注语料库中的应用。解决方案的关键在于结合传统机器学习方法和基于Transformer的模型，并通过集成学习（ensembling）提升性能。研究中使用了Penn Treebank的“黄金”解析和Berkeley神经解析器，最终实现了高达91.74%和91.12%的F1分数，分别对应于两种不同的解析器。

链接: https://arxiv.org/abs/2412.14328
作者: Adam Meyers,Advait Pravin Savant,John E. Ortega
机构: 未知
关键词: Semantic Role Labeling, English partitive nouns, NomBank annotated corpus, Semantic Role, Role Labeling
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: SUMEval-2: The 2nd Workshop on Scaling Up Multilingual Multi-Cultural Evaluation at the 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:This article is about Semantic Role Labeling for English partitive nouns (5%/REL of the price/ARG1; The price/ARG1 rose 5 percent/REL) in the NomBank annotated corpus. Several systems are described using traditional and transformer-based machine learning, as well as ensembling. Our highest scoring system achieves an F1 of 91.74% using “gold” parses from the Penn Treebank and 91.12% when using the Berkeley Neural parser. This research includes both classroom and experimental settings for system development.
zh

[NLP-78] he Role of Handling Attributive Nouns in Improving Chinese-To-English Machine Translation COLING2025

【速读】：该论文试图解决中文中定语名词（attributive nouns）在英译时经常导致的歧义问题。解决方案的关键在于通过手动插入被省略的虚词“的”（‘DE’），并利用从宾夕法尼亚大学中文语料库（Penn Chinese Discourse Treebank）中提取的新闻标题数据集，对Hugging Face的中英翻译模型进行微调。这种方法专门针对这一常见错误类型进行优化，不仅补充了先前研究提出的更广泛策略，还提供了实际的翻译质量提升。

链接: https://arxiv.org/abs/2412.14323
作者: Haohao(Lisa)Wang,Adam Meyers,John E. Ortega,Rodolfo Zevallos
机构: 未知
关键词: grammatical conventions poses, Translating between languages, conventions poses challenges, machine translation systems, languages with drastically
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18th Workshop on Building and Using Comparable Corpora (BUCC) at the 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Translating between languages with drastically different grammatical conventions poses challenges, not just for human interpreters but also for machine translation systems. In this work, we specifically target the translation challenges posed by attributive nouns in Chinese, which frequently cause ambiguities in English translation. By manually inserting the omitted particle X (‘DE’). In news article titles from the Penn Chinese Discourse Treebank, we developed a targeted dataset to fine-tune Hugging Face Chinese to English translation models, specifically improving how this critical function word is handled. This focused approach not only complements the broader strategies suggested by previous studies but also offers a practical enhancement by specifically addressing a common error type in Chinese-English translation.
zh

[NLP-79] Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs AAAI2025 AAAI

【速读】：该论文试图解决当前眼科临床工作流程中存在的过度转诊、长时间等待和复杂的异构病历问题，特别是大语言模型（LLMs）在多语言环境下表现出的显著性能差异，这可能加剧低收入和中等收入国家（LMICs）的医疗不平等。解决方案的关键是提出了CLARA（Cross-Lingual Reflective Agentic system），这是一种新颖的推理时去偏方法，结合了检索增强生成（Retrieval-augmented generation, RAG）和自我验证技术。CLARA不仅提升了所有语言的表现，还显著减少了多语言偏差差距，从而促进了LLMs在全球范围内的公平应用。

链接: https://arxiv.org/abs/2412.14304
作者: David Restrepo,Chenwei Wu,Zhengxu Tang,Zitao Shuai,Thao Nguyen Minh Phan,Jun-En Ding,Cong-Tinh Dao,Jack Gallifant,Robyn Gayle Dychiao,Jose Carlo Artiaga,André Hiroshi Bando,Carolina Pelegrini Barbosa Gracitelli,Vincenz Ferrer,Leo Anthony Celi,Danielle Bitterman,Michael G Morley,Luis Filipe Nakayama
机构: Massachusetts General Hospital(麻省总医院); Harvard Medical School(哈佛医学院); University of São Paulo(圣保罗大学); University of California, San Francisco(加州大学旧金山分校); University of California, Los Angeles(加州大学洛杉矶分校); University of California, Irvine(加州大学欧文分校); University of California, Davis(加州大学戴维斯分校)
关键词: Current ophthalmology clinical, Current ophthalmology, heterogeneous medical records, ophthalmology clinical workflows, long waits
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the AAAI 2025 Artificial Intelligence for Social Impact Track (AAAI-AISI 2025)

点击查看摘要

Abstract:Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.
zh

[NLP-80] Fake News Detection: Comparative Evaluation of BERT-like Models and Large Language Models with Generative AI-Annotated Data

【速读】：该论文试图解决虚假新闻（fake news）对公众舆论和社会稳定构成的威胁问题。解决方案的关键在于比较和评估基于编码器（encoder-only）的BERT类模型与自回归解码器（autoregressive decoder-only）的大型语言模型（LLMs）在虚假新闻检测中的性能。论文引入了由GPT-4辅助标注并经人类专家验证的数据集，确保了标注的可靠性。通过微调这些模型并开发了一种在推理过程中使用多数投票（majority voting）的指令调优LLM方法，研究发现BERT类模型在分类任务中通常优于LLMs，而LLMs在应对文本扰动方面表现更稳健。此外，与弱标签（distant supervision）数据相比，AI标注结合人类监督的数据显著提升了分类效果。该研究强调了AI标注与人类监督相结合的有效性，并展示了不同机器学习模型在虚假新闻检测中的性能差异。

链接: https://arxiv.org/abs/2412.14276
作者: haina Raza,Drai Paulen-Patterson,Chen Ding
机构: Toronto Metropolitan University (多伦多都会大学)
关键词: modern society, fake news detection, Fake news poses, poses a significant, significant threat
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in Knowledge and Information Systems Journal

点击查看摘要

Abstract:Fake news poses a significant threat to public opinion and social stability in modern society. This study presents a comparative evaluation of BERT-like encoder-only models and autoregressive decoder-only large language models (LLMs) for fake news detection. We introduce a dataset of news articles labeled with GPT-4 assistance (an AI-labeling method) and verified by human experts to ensure reliability. Both BERT-like encoder-only models and LLMs were fine-tuned on this dataset. Additionally, we developed an instruction-tuned LLM approach with majority voting during inference for label generation. Our analysis reveals that BERT-like models generally outperform LLMs in classification tasks, while LLMs demonstrate superior robustness against text perturbations. Compared to weak labels (distant supervision) data, the results show that AI labels with human supervision achieve better classification results. This study highlights the effectiveness of combining AI-based annotation with human oversight and demonstrates the performance of different families of machine learning models for fake news detection
zh

[NLP-81] owards AI-45circ Law: A Roadmap to Trustworthy AGI

【速读】：该论文试图解决确保人工通用智能（AGI）在高度自主或安全关键领域中避免有害行为的关键挑战。解决方案的关键在于提出了“AI-45° Law”作为平衡AI安全和能力的指导原则，并引入了“可信AGI的因果阶梯”（Causal Ladder of Trustworthy AGI）框架。该框架通过三个核心层级（近似对齐层、可干预层和可反思层）系统地分类和结构化当前AI能力和安全研究，从而应对AGI和当代AI系统的安全与可信性挑战。此外，论文还定义了五个层次的可信AGI（感知、推理、决策、自主和协作可信性），并提出了一系列潜在的治理措施以支持可信AGI的发展。

链接: https://arxiv.org/abs/2412.14186
作者: Yang Chao,Lu Chaochao,Wang Yingchun,Zhou Bowen
机构: Center for Safe & Trustworthy AI; Shanghai Artificial Intelligence Laboratory
关键词: Ensuring Artificial General, Artificial General Intelligence, Ensuring Artificial, General Intelligence, Artificial General
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: First submit, Preview Only

点击查看摘要

Abstract:Ensuring Artificial General Intelligence (AGI) reliably avoids harmful behaviors is a critical challenge, especially for systems with high autonomy or in safety-critical domains. Despite various safety assurance proposals and extreme risk warnings, comprehensive guidelines balancing AI safety and capability remain lacking. In this position paper, we propose the \textitAI-\textbf 45^\circ Law as a guiding principle for a balanced roadmap toward trustworthy AGI, and introduce the \textitCausal Ladder of Trustworthy AGI as a practical framework. This framework provides a systematic taxonomy and hierarchical structure for current AI capability and safety research, inspired by Judea Pearl’s ``Ladder of Causation’'. The Causal Ladder comprises three core layers: the Approximate Alignment Layer, the Intervenable Layer, and the Reflectable Layer. These layers address the key challenges of safety and trustworthiness in AGI and contemporary AI systems. Building upon this framework, we define five levels of trustworthy AGI: perception, reasoning, decision-making, autonomy, and collaboration trustworthiness. These levels represent distinct yet progressive aspects of trustworthy AGI. Finally, we present a series of potential governance measures to support the development of trustworthy AGI.\footnoteIn this paper, trustworthiness is generally considered a broad form of safety, and no explicit distinction is made between the two. However, in some contexts, safety and trustworthiness are treated as distinct: safety involves assurance of correct behavior, while trustworthiness refers to user confidence in the system’s decision-making. In such cases, different terms or both may be used depending on the context.
zh

[NLP-82] Whisper-GPT: A Hybrid Representation Audio Large Language Model

【速读】：该论文试图解决生成式音频、语音和音乐模型中由于使用离散音频令牌（discrete audio tokens）导致的上下文长度问题。解决方案的关键在于提出WHISPER-GPT，这是一种结合了连续音频表示（如频谱图）和离散声学令牌的大语言模型（LLM）。通过这种结合，模型能够在单个令牌中保留特定时间实例的所有音频信息，同时允许LLM预测未来令牌，从而在保持离散空间优势的同时，避免了高保真生成架构中上下文长度爆炸的问题。

链接: https://arxiv.org/abs/2412.11449
作者: Prateek Verma
机构: 未知
关键词: large language model, generative large language, discrete tokens simultaneously, propose WHISPER-GPT, large language
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: 6 pages, 3 figures. 50th International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India

点击查看摘要

Abstract:We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.
zh

计算机视觉

[CV-0] UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency

【速读】：该论文试图解决现有基于指令的图像编辑方法在训练过程中对真实编辑图像的依赖问题。现有监督方法依赖于包含输入图像、编辑图像和编辑指令的三元组数据集，这些数据集通常由现有编辑方法或人工标注生成，存在偏差并限制了泛化能力。论文提出的解决方案关键在于引入了一种名为循环编辑一致性 (Cycle Edit Consistency, CEC) 的新机制，通过在单个训练步骤中应用前向和后向编辑，并在图像和注意力空间中强制一致性，从而无需依赖真实编辑图像进行训练。这种方法不仅避免了数据集偏差，还首次实现了在仅包含真实图像-标题对或图像-标题-编辑三元组的数据集上进行训练，显著提升了编辑的广泛性和精确性。

链接: https://arxiv.org/abs/2412.15216
作者: Enis Simsar,Alessio Tonioni,Yongqin Xian,Thomas Hofmann,Federico Tombari
机构: ETH Zürich(苏黎世联邦理工学院); Technical University of Munich(慕尼黑工业大学); Google Switzerland(谷歌瑞士)
关键词: ground-truth edited images, edited images, ground-truth edited, image, instruction-based image editing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training. Existing supervised methods depend on datasets containing triplets of input image, edited image, and edit instruction. These are generated by either existing editing methods or human-annotations, which introduce biases and limit their generalization ability. Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency (CEC), which applies forward and backward edits in one training step and enforces consistency in image and attention spaces. This allows us to bypass the need for ground-truth edited images and unlock training for the first time on datasets comprising either real image-caption pairs or image-caption-edit triplets. We empirically show that our unsupervised technique performs better across a broader range of edits with high fidelity and precision. By eliminating the need for pre-existing datasets of triplets, reducing biases associated with supervised methods, and proposing CEC, our work represents a significant advancement in unblocking scaling of instruction-based image editing.
zh

[CV-1] EnvGS: Modeling View-Dependent Appearance with Environment Gaussian

【速读】：该论文试图解决从2D图像重建现实场景中复杂反射的问题，特别是在高频反射细节和近场反射建模方面的不足。解决方案的关键在于引入了一种名为EnvGS的新方法，该方法使用一组高斯基元（Gaussian primitives）作为显式的3D表示来捕捉环境的反射。这些环境高斯基元与基础高斯基元结合，共同建模整个场景的外观。为了高效渲染这些环境高斯基元，论文开发了一种基于光线追踪的渲染器，利用GPU的RT核心实现快速渲染，从而在保持实时渲染速度的同时，实现高质量的重建。

链接: https://arxiv.org/abs/2412.15215
作者: Tao Xie,Xi Chen,Zhen Xu,Yiman Xie,Yudong Jin,Yujun Shen,Sida Peng,Hujun Bao,Xiaowei Zhou
机构: Zhejiang University(浙江大学); Ant Group(蚂蚁集团)
关键词: Reconstructing complex reflections, Reconstructing complex, Gaussian primitives, environment Gaussian primitives, images is essential
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing complex reflections in real-world scenes from 2D images is essential for achieving photorealistic novel view synthesis. Existing methods that utilize environment maps to model reflections from distant lighting often struggle with high-frequency reflection details and fail to account for near-field reflections. In this work, we introduce EnvGS, a novel approach that employs a set of Gaussian primitives as an explicit 3D representation for capturing reflections of environments. These environment Gaussian primitives are incorporated with base Gaussian primitives to model the appearance of the whole scene. To efficiently render these environment Gaussian primitives, we developed a ray-tracing-based renderer that leverages the GPU’s RT core for fast rendering. This allows us to jointly optimize our model for high-quality reconstruction while maintaining real-time rendering speeds. Results from multiple real-world and synthetic datasets demonstrate that our method produces significantly more detailed reflections, achieving the best rendering quality in real-time novel view synthesis.
zh

[CV-2] LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

【速读】：该论文试图解决在图像到视频合成中，基于拖拽的2D交互在处理物体离面运动时存在的模糊性问题。解决方案的关键在于引入深度维度，允许用户为轨迹上的每个点分配相对深度，从而实现3D空间中的轨迹控制。具体方法是通过将物体掩码抽象为若干聚类点，并结合深度信息和实例信息，将这些信息作为控制信号输入视频扩散模型。这种方法不仅继承了2D拖拽的便捷性，还扩展了创作范围，使得在生成逼真视频时能够更精确地操控物体运动。

链接: https://arxiv.org/abs/2412.15214
作者: Hanlin Wang,Hao Ouyang,Qiuyu Wang,Wen Wang,Ka Leong Cheng,Qifeng Chen,Yujun Shen,Limin Wang
机构: State Key Laboratory for Novel Software Technology, Nanjing University(南京大学); Ant Group(蚂蚁集团); Zhejiang University(浙江大学); The Hong Kong University of Science and Technology(香港科技大学)
关键词: controlling object trajectories, intuitive nature, nature of drag-based, growing adoption, adoption for controlling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page available at this https URL

点击查看摘要

Abstract:The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images. Project page: this https URL
zh

[CV-3] Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

【速读】：该论文试图解决的问题是如何在跨模态任务（如文本到图像生成）中，避免使用噪声分布和条件机制，直接学习一种模态到另一种模态的映射。解决方案的关键在于提出了一个名为CrossFlow的框架，该框架利用流匹配模型（flow matching）直接从一种模态的分布映射到另一种模态的分布，而不需要传统的噪声源分布。此外，论文强调了变分编码器（Variational Encoders）在输入数据处理中的重要性，并引入了一种无分类器引导（Classifier-free guidance）的方法。实验结果表明，CrossFlow在文本到图像生成任务中，即使使用简单的Transformer模型（无交叉注意力机制），也能略微优于标准的流匹配方法，并且在训练步数和模型规模上具有更好的扩展性，同时支持语义上有意义的潜在空间算术操作。

链接: https://arxiv.org/abs/2412.15213
作者: Qihao Liu,Xi Yin,Alan Yuille,Andrew Brown,Mannat Singh
机构: GenAI, Meta(Meta); Johns Hopkins University(约翰斯·霍普金斯大学)
关键词: flow matching, Diffusion models, remarkable impact, media generation, cross-modal media generation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.
zh

[CV-4] Scaling 4D Representations

【速读】：该论文试图解决自监督学习在视频数据上的扩展性问题，特别是针对非语义视觉任务（如相机姿态估计、点与物体跟踪、深度估计等）的性能提升。解决方案的关键在于通过大规模视频数据集的学习，采用掩码自编码 (MAE) 结合 Transformer 视频模型，并随着模型参数从 20M 增加到 22B，展示了在这些 4D 任务上的性能持续提升。通过与多种近期图像和视频模型的严格对比，证明了扩展 4D 表示的优势。

链接: https://arxiv.org/abs/2412.15212
作者: João Carreira,Dilara Gokay,Michael King,Chuhan Zhang,Ignacio Rocco,Aravindh Mahendran,Thomas Albert Keck,Joseph Heyward,Skanda Koppula,Etienne Pot,Goker Erdogan,Yana Hasson,Yi Yang,Klaus Greff,Guillaume Le Moing,Sjoerd van Steenkiste,Daniel Zoran,Drew A. Hudson,Pedro Vélez,Luisa Polanía,Luke Friedman,Chris Duvarney,Ross Goroshin,Kelsey Allen,Jacob Walker,Rishabh Kabra,Eric Aboussouan,Jennifer Sun,Thomas Kipf,Carl Doersch,Viorica Pătrăucean,Dima Damen,Pauline Luc,Mehdi S. M. Sajjadi,Andrew Zisserman
机构: Google DeepMind(谷歌深度思维); Google Research(谷歌研究); Mila/Université de Montréal(Mila/蒙特利尔大学); University of Bristol(布里斯托大学); University of Oxford(牛津大学)
关键词: convincingly demonstrated, demonstrated for pure, pure self-supervised learning, video, pure self-supervised
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks \unicodex2013 action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model \unicodex2013 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations.
zh

[CV-5] Generative Multiview Relighting for 3D Reconstruction under Extreme Illumination Variation

【速读】：该论文试图解决从不同光照环境下拍摄的照片中重建物体几何形状和外观的难题，特别是对于高光物体，其外观强烈依赖于观察方向。现有方法在处理光照变化时，要么使用每张图像的嵌入向量来建模外观变化，要么使用基于物理的渲染来恢复材料和每张图像的光照，但这些方法在输入光照变化显著时无法忠实地恢复依赖视角的外观，通常只能得到较为漫反射的结果。论文提出的解决方案关键在于首先使用多视角重光照扩散模型将图像在单一参考光照下进行重光照，然后利用鲁棒的辐射场架构重建物体的几何形状和外观，该架构能够处理重光照后图像间剩余的小差异。该方法在合成和真实数据集上验证了其有效性，显著优于现有技术，特别是在恢复依赖视角的“高光”外观方面。

链接: https://arxiv.org/abs/2412.15211
作者: Hadi Alzayer,Philipp Henzler,Jonathan T. Barron,Jia-Bin Huang,Pratul P. Srinivasan,Dor Verbin
机构: Google; University of Maryland, College Park
关键词: environments is difficult, vary across captured, object appearance vary, appearance, images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Reconstructing the geometry and appearance of objects from photographs taken in different environments is difficult as the illumination and therefore the object appearance vary across captured images. This is particularly challenging for more specular objects whose appearance strongly depends on the viewing direction. Some prior approaches model appearance variation across images using a per-image embedding vector, while others use physically-based rendering to recover the materials and per-image illumination. Such approaches fail at faithfully recovering view-dependent appearance given significant variation in input illumination and tend to produce mostly diffuse results. We present an approach that reconstructs objects from images taken under different illuminations by first relighting the images under a single reference illumination with a multiview relighting diffusion model and then reconstructing the object’s geometry and appearance with a radiance field architecture that is robust to the small remaining inconsistencies among the relit images. We validate our proposed approach on both synthetic and real datasets and demonstrate that it greatly outperforms existing techniques at reconstructing high-fidelity appearance from images taken under extreme illumination variation. Moreover, our approach is particularly effective at recovering view-dependent “shiny” appearance which cannot be reconstructed by prior methods.
zh

[CV-6] PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation

【速读】：该论文试图解决现有像素级定位模型在多图像场景中无法进行细粒度比较的问题，以及当前多图像理解模型缺乏像素级定位能力的不足。解决方案的关键在于提出了PRIMA，一种新的大型视觉-语言模型 (Large Vision-Language Model, LVLM)，它将像素级定位与强大的多图像推理能力相结合，能够生成上下文丰富的像素级解释。PRIMA的核心是一个高效的视觉模块，能够在多图像间查询细粒度的视觉表示，从而减少25.3%的TFLOPs。此外，论文还引入了M^4Seg基准，包含约224K个问题-答案对，用于支持训练和评估，确保模型在多图像细粒度视觉理解任务中的表现优于现有基线。

链接: https://arxiv.org/abs/2412.15209
作者: Muntasir Wahed,Kiet A. Nguyen,Adheesh Sunil Juvekar,Xinzhuo Li,Xiaona Zhou,Vedant Shah,Tianjiao Yu,Pinar Yanardag,Ismini Lourentzou
机构: University of Illinois Urbana - Champaign(伊利诺伊大学厄巴纳-香槟分校); Virginia Tech(弗吉尼亚理工大学)
关键词: Large Vision-Language Models, existing pixel-grounding models, pixel-grounding models operate, advancements in Large, Large Vision-Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite significant advancements in Large Vision-Language Models (LVLMs), existing pixel-grounding models operate on single-image settings, limiting their ability to perform detailed, fine-grained comparisons across multiple images. Conversely, current multi-image understanding models lack pixel-level grounding. Our work addresses this gap by introducing the task of multi-image pixel-grounded reasoning segmentation, and PRIMA, a novel LVLM that integrates pixel-level grounding with robust multi-image reasoning capabilities to produce contextually rich, pixel-grounded explanations. Central to PRIMA is an efficient vision module that queries fine-grained visual representations across multiple images, reducing TFLOPs by 25.3% . To support training and evaluation, we curate M^4Seg , a new reasoning segmentation benchmark consisting of \sim 224K question-answer pairs that require fine-grained visual understanding across multiple images. Experimental results demonstrate PRIMA outperforms state-of-the-art baselines.
zh

[CV-7] OpenEMMA: Open-Source Multimodal Model for End-to-End Autonomous Driving

【速读】：该论文试图解决现有端到端自动驾驶系统开发中资源需求高、进展缓慢的问题。解决方案的关键在于提出了OpenEMMA，一个基于多模态大语言模型 (Multimodal Large Language Models, MLLMs) 的开源端到端框架。通过引入链式思维推理过程 (Chain-of-Thought reasoning process)，OpenEMMA在利用多种MLLMs时相较于基线模型实现了显著的性能提升，并展示了在复杂驾驶场景中的有效性、通用性和鲁棒性，从而提供了一种更高效和有效的自动驾驶解决方案。

链接: https://arxiv.org/abs/2412.15208
作者: Shuo Xing,Chengyuan Qian,Yuping Wang,Hongyuan Hua,Kexin Tian,Yang Zhou,Zhengzhong Tu
机构: Cranberry-Lemon University(蔓越莓柠檬大学); Texas A&M University(德克萨斯A&M大学); University of Michigan(密歇根大学); University of Toronto(多伦多大学)
关键词: Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, advent of Multimodal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Since the advent of Multimodal Large Language Models (MLLMs), they have made a significant impact across a wide range of real-world applications, particularly in Autonomous Driving (AD). Their ability to process complex visual data and reason about intricate driving scenarios has paved the way for a new paradigm in end-to-end AD systems. However, the progress of developing end-to-end models for AD has been slow, as existing fine-tuning methods demand substantial resources, including extensive computational power, large-scale datasets, and significant funding. Drawing inspiration from recent advancements in inference computing, we propose OpenEMMA, an open-source end-to-end framework based on MLLMs. By incorporating the Chain-of-Thought reasoning process, OpenEMMA achieves significant improvements compared to the baseline when leveraging a diverse range of MLLMs. Furthermore, OpenEMMA demonstrates effectiveness, generalizability, and robustness across a variety of challenging driving scenarios, offering a more efficient and effective approach to autonomous driving. We release all the codes in this https URL.
zh

[CV-8] AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving

【速读】：该论文试图解决自动驾驶领域中大型视觉语言模型（DriveVLMs）的信任度问题，这是一个直接影响公共交通安全的关键因素。解决方案的关键在于提出了AutoTrust，一个全面的信任度基准，涵盖了信任度、安全性、鲁棒性、隐私和公平性等多个维度。通过构建包含超过10,000个独特场景和18,000个查询的视觉问答数据集，论文对六种公开的视觉语言模型进行了评估，揭示了这些模型在信任度方面的潜在漏洞。研究发现，通用模型在整体信任度上优于专门为驾驶优化的模型，且所有模型在对抗攻击和确保决策公平性方面仍存在挑战。这一发现强调了提升DriveVLMs信任度的紧迫性，以确保公共安全和自动驾驶系统的可靠性。

链接: https://arxiv.org/abs/2412.15206
作者: Shuo Xing,Hongyuan Hua,Xiangbo Gao,Shenzhe Zhu,Renjie Li,Kexin Tian,Xiaopeng Li,Heng Huang,Tianbao Yang,Zhangyang Wang,Yang Zhou,Huaxiu Yao,Zhengzhong Tu
机构: Texas A&M University(德克萨斯A&M大学); University of Toronto(多伦多大学); University of Michigan(密歇根大学); University of Wisconsin-Madison(威斯康星大学麦迪逊分校); University of Maryland(马里兰大学); University of Texas at Austin(德克萨斯大学奥斯汀分校); University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)
关键词: Recent advancements, large vision language, shown strong scene, strong scene understanding, vision language models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 55 pages, 14 figures

点击查看摘要

Abstract:Recent advancements in large vision language models (VLMs) tailored for autonomous driving (AD) have shown strong scene understanding and reasoning capabilities, making them undeniable candidates for end-to-end driving systems. However, limited work exists on studying the trustworthiness of DriveVLMs – a critical factor that directly impacts public transportation safety. In this paper, we introduce AutoTrust, a comprehensive trustworthiness benchmark for large vision-language models in autonomous driving (DriveVLMs), considering diverse perspectives – including trustfulness, safety, robustness, privacy, and fairness. We constructed the largest visual question-answering dataset for investigating trustworthiness issues in driving scenarios, comprising over 10k unique scenes and 18k queries. We evaluated six publicly available VLMs, spanning from generalist to specialist, from open-source to commercial models. Our exhaustive evaluations have unveiled previously undiscovered vulnerabilities of DriveVLMs to trustworthiness threats. Specifically, we found that the general VLMs like LLaVA-v1.6 and GPT-4o-mini surprisingly outperform specialized models fine-tuned for driving in terms of overall trustworthiness. DriveVLMs like DriveLM-Agent are particularly vulnerable to disclosing sensitive information. Additionally, both generalist and specialist VLMs remain susceptible to adversarial attacks and struggle to ensure unbiased decision-making across diverse environments and populations. Our findings call for immediate and decisive action to address the trustworthiness of DriveVLMs – an issue of critical importance to public safety and the welfare of all citizens relying on autonomous transportation systems. Our benchmark is publicly available at \urlthis https URL, and the leaderboard is released at \urlthis https URL.
zh

[CV-9] FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching

【速读】：该论文试图解决图像生成领域中自回归模型（VAR）在尺度预测上的复杂性和刚性设计问题，以及生成器对复杂尺度结构的离散分词器的依赖性。解决方案的关键在于引入FlowAR，这是一种简化的尺度预测方法，其中每个后续尺度仅为前一个尺度的两倍，从而避免了VAR中复杂的多尺度残差分词器设计。FlowAR允许使用任何现成的变分自编码器（VAE），并通过流匹配（Flow Matching）技术提升图像合成的质量。该方法在ImageNet-256基准测试中展示了优于先前方法的生成性能。

链接: https://arxiv.org/abs/2412.15205
作者: Sucheng Ren,Qihang Yu,Ju He,Xiaohui Shen,Alan Yuille,Liang-Chieh Chen
机构: Johns Hopkins University(约翰斯·霍普金斯大学); ByteDance(字节跳动)
关键词: achieved remarkable success, natural language processing, scale prediction, scale-wise autoregressive modeling, token prediction
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images. However, VAR encounters two primary challenges: (1) its complex and rigid scale design limits generalization in next scale prediction, and (2) the generator’s dependence on a discrete tokenizer with the same complex scale structure restricts modularity and flexibility in updating the tokenizer. To address these limitations, we introduce FlowAR, a general next scale prediction method featuring a streamlined scale design, where each subsequent scale is simply double the previous one. This eliminates the need for VAR’s intricate multi-scale residual tokenizer and enables the use of any off-the-shelf Variational AutoEncoder (VAE). Our simplified design enhances generalization in next scale prediction and facilitates the integration of Flow Matching for high-quality image synthesis. We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark, demonstrating superior generation performance compared to previous methods. Codes will be available at \urlthis https URL.
zh

[CV-10] DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

【速读】：该论文试图解决逆向程序化内容生成 (Inverse Procedural Content Generation, IPCG) 中参数控制困难的问题，尤其是在从图像条件生成高质量3D内容时，现有的基于采样和神经网络的方法存在样本迭代次数多或控制能力有限的问题。解决方案的关键在于提出了一种名为DI-PCG的新方法，其核心是一个轻量级的扩散Transformer模型，将PCG参数直接作为去噪目标，并利用输入图像作为条件来控制参数生成。DI-PCG在训练时仅需7.6M网络参数和30 GPU小时，表现出优异的参数恢复能力和对野外图像的良好泛化性能。

链接: https://arxiv.org/abs/2412.15200
作者: Wang Zhao,Yan-Pei Cao,Jiale Xu,Yuejiang Dong,Ying Shan
机构: Tencent PCG(腾讯PCG); VAST; Tsinghua University(清华大学)
关键词: Procedural Content Generation, Inverse Procedural Content, Procedural Content, produce desired shapes, extensive parameter tuning
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:Procedural Content Generation (PCG) is powerful in creating high-quality 3D contents, yet controlling it to produce desired shapes is difficult and often requires extensive parameter tuning. Inverse Procedural Content Generation aims to automatically find the best parameters under the input condition. However, existing sampling-based and neural network-based methods still suffer from numerous sample iterations or limited controllability. In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. At its core is a lightweight diffusion transformer model, where PCG parameters are directly treated as the denoising target and the observed images as conditions to control parameter generation. DI-PCG is efficient and effective. With only 7.6M network parameters and 30 GPU hours to train, it demonstrates superior performance in recovering parameters accurately, and generalizing well to in-the-wild images. Quantitative and qualitative experiment results validate the effectiveness of DI-PCG in inverse PCG and image-to-3D generation tasks. DI-PCG offers a promising approach for efficient inverse PCG and represents a valuable exploration step towards a 3D generation path that models how to construct a 3D asset using parametric models.
zh

[CV-11] LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-simulation

【速读】：该论文旨在解决动态驾驶场景中实时 LiDAR 重现模拟的挑战。现有方法虽然利用神经辐射场（Neural Radiance Fields）结合 LiDAR 传感器的物理建模实现了高保真度的重现模拟，但受限于大规模场景中的高计算需求，无法实现实时渲染。论文提出的 LiDAR-RT 框架通过集成高斯基元（Gaussian primitives）和硬件加速的光线追踪技术，构建了一个高效且有效的渲染管道，从而实现了实时且物理准确的 LiDAR 重现模拟。其关键在于使用可学习参数的高斯基元对 LiDAR 传感器的物理特性进行建模，并结合场景图（scene graphs）处理场景动态变化，通过构建包围体层次结构（Bounding Volume Hierarchy, BVH）和可微分渲染算法生成新的 LiDAR 视图，支持灵活的场景编辑和多种传感器配置。

链接: https://arxiv.org/abs/2412.15199
作者: Chenxu Zhou,Lvchang Fu,Sida Peng,Yunzhi Yan,Zhanhua Zhang,Yong Chen,Jiazhi Xia,Xiaowei Zhou
机构: Zhejiang University(浙江大学); Central South University(中南大学); Geely Automobile Research Institute(吉利汽车研究院)
关键词: paper targets, targets the challenge, LiDAR, dynamic driving scenarios, LiDAR re-simulation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper targets the challenge of real-time LiDAR re-simulation in dynamic driving scenarios. Recent approaches utilize neural radiance fields combined with the physical modeling of LiDAR sensors to achieve high-fidelity re-simulation results. Unfortunately, these methods face limitations due to high computational demands in large-scale scenes and cannot perform real-time LiDAR rendering. To overcome these constraints, we propose LiDAR-RT, a novel framework that supports real-time, physically accurate LiDAR re-simulation for driving scenes. Our primary contribution is the development of an efficient and effective rendering pipeline, which integrates Gaussian primitives and hardware-accelerated ray tracing technology. Specifically, we model the physical properties of LiDAR sensors using Gaussian primitives with learnable parameters and incorporate scene graphs to handle scene dynamics. Building upon this scene representation, our framework first constructs a bounding volume hierarchy (BVH), then casts rays for each pixel and generates novel LiDAR views through a differentiable rendering algorithm. Importantly, our framework supports realistic rendering with flexible scene editing operations and various sensor configurations. Extensive experiments across multiple public benchmarks demonstrate that our method outperforms state-of-the-art methods in terms of rendering quality and efficiency. Our project page is at this https URL.
zh

[CV-12] Preventing Local Pitfalls in Vector Quantization via Optimal Transport

【速读】：该论文试图解决向量量化网络 (Vector-quantized networks, VQNs) 在训练过程中存在的训练不稳定问题，主要原因是局部最小值问题。解决方案的关键在于引入了一种新的向量量化方法OptVQ，通过将最优传输方法（optimal transport method）与Sinkhorn算法结合，替代传统的最近邻搜索，从而实现更全局化的分配优化。此外，论文还提出了一种简单的归一化策略，以减轻不同数据分布对Sinkhorn算法的影响，最终在图像重建任务中实现了100%的码本利用率，并超越了当前最先进的VQNs的重建质量。

链接: https://arxiv.org/abs/2412.15195
作者: Borui Zhang,Wenzhao Zheng,Jie Zhou,Jiwen Lu
机构: Department of Automation, Tsinghua University, China(自动化系，清华大学，中国)
关键词: exhibited remarkable performance, Vector-quantized networks, training process due, model distillation, exhibited remarkable
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code is available at this https URL

点击查看摘要

Abstract:Vector-quantized networks (VQNs) have exhibited remarkable performance across various tasks, yet they are prone to training instability, which complicates the training process due to the necessity for techniques such as subtle initialization and model distillation. In this study, we identify the local minima issue as the primary cause of this instability. To address this, we integrate an optimal transport method in place of the nearest neighbor search to achieve a more globally informed assignment. We introduce OptVQ, a novel vector quantization method that employs the Sinkhorn algorithm to optimize the optimal transport problem, thereby enhancing the stability and efficiency of the training process. To mitigate the influence of diverse data distributions on the Sinkhorn algorithm, we implement a straightforward yet effective normalization strategy. Our comprehensive experiments on image reconstruction tasks demonstrate that OptVQ achieves 100% codebook utilization and surpasses current state-of-the-art VQNs in reconstruction quality.
zh

[CV-13] AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

【速读】：该论文提出了一种名为 AV-Link 的统一框架，旨在解决视频到音频和音频到视频的生成问题。其关键在于利用冻结的视频和音频扩散模型的激活（activations），通过时间对齐的跨模态条件（temporally-aligned cross-modal conditioning）实现双向信息交换。解决方案的核心是一个融合块（Fusion Block），它通过时间对齐的自注意力操作（temporally-aligned self attention operation），在视频和音频扩散模型之间实现双向信息交换。与以往使用预训练特征提取器进行条件信号处理的方法不同，AV-Link 能够直接利用互补模态的特征，在一个框架内实现视频特征生成音频或音频特征生成视频，从而生成同步且高质量的视听内容。

链接: https://arxiv.org/abs/2412.15191
作者: Moayed Haji-Ali,Willi Menapace,Aliaksandr Siarohin,Ivan Skorokhodov,Alper Canberk,Kwot Sin Lee,Vicente Ordonez,Sergey Tulyakov
机构: Rice University(莱斯大学); Snap Inc(Snap公司)
关键词: audio diffusion models, diffusion models, temporally-aligned cross-modal conditioning, activations of frozen, audio diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Project Page: this http URL

点击查看摘要

Abstract:We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: this http URL
zh

[CV-14] EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

【速读】：该论文试图解决现有通用视觉-语言模型（Vision-Language Models, VLMs）在遥感数据（Remote Sensing data）上表现不佳的问题，特别是当前地理空间VLMs受限于固定分辨率和少数传感器模态的局限性。解决方案的关键在于引入EarthDial，一个专门为地球观测（Earth Observation, EO）数据设计的对话助手，能够将复杂的多感官地球观测数据转化为交互式的自然语言对话。EarthDial支持多光谱、多时相和多分辨率图像，涵盖RGB、合成孔径雷达（Synthetic Aperture Radar, SAR）以及近红外（Near-Infrared, NIR）和红外等多光谱模态，并处理双时相和多时相序列分析以支持变化检测等应用。通过构建包含11.11M指令对的广泛指令调优数据集，EarthDial在37个下游应用中展示了优于现有通用和领域特定模型的泛化能力。

链接: https://arxiv.org/abs/2412.15190
作者: Sagar Soni,Akshay Dudhane,Hiyam Debary,Mustansar Fiaz,Muhammad Akhtar Munir,Muhammad Sohail Danish,Paolo Fraccaro,Campbell D Watson,Levente J Klein,Fahad Shahbaz Khan,Salman Khan
机构: IBM Research(IBM研究); Mohamed bin Zayed University of AI(穆罕默德·本·扎耶德人工智能大学); Australian National University(澳大利亚国立大学); Linköping University(林雪平大学)
关键词: vast Earth observation, Earth observation data, Earth observation, multi-sensory Earth observations, disaster response
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automated analysis of vast Earth observation data via interactive Vision-Language Models (VLMs) can unlock new opportunities for environmental monitoring, disaster response, and resource management. Existing generic VLMs do not perform well on Remote Sensing data, while the recent Geo-spatial VLMs remain restricted to a fixed resolution and few sensor modalities. In this paper, we introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data, transforming complex, multi-sensory Earth observations into interactive, natural language dialogues. EarthDial supports multi-spectral, multi-temporal, and multi-resolution imagery, enabling a wide range of remote sensing tasks, including classification, detection, captioning, question answering, visual reasoning, and visual grounding. To achieve this, we introduce an extensive instruction tuning dataset comprising over 11.11M instruction pairs covering RGB, Synthetic Aperture Radar (SAR), and multispectral modalities such as Near-Infrared (NIR) and infrared. Furthermore, EarthDial handles bi-temporal and multi-temporal sequence analysis for applications like change detection. Our extensive experimental results on 37 downstream applications demonstrate that EarthDial outperforms existing generic and domain-specific models, achieving better generalization across various EO tasks.
zh

[CV-15] d Diffusion

【速读】：该论文试图解决传统图像拼接（image tiling）过程中手动操作导致的可扩展性和灵活性不足的问题。解决方案的关键在于提出了Tiled Diffusion方法，该方法通过扩展扩散模型（diffusion models）的能力，支持在多种图像合成领域中生成连贯的拼接图案。Tiled Diffusion不仅能够处理自拼接（self-tiling），还能实现复杂的多种图像之间的无缝连接（many-to-many connections），从而自动化拼接过程，减少人工干预，并在纹理创建、360°合成等应用中增强创意可能性。

链接: https://arxiv.org/abs/2412.15185
作者: Or Madar,Ohad Fried
机构: Reichman University (里奇曼大学)
关键词: coherent visual field, video game asset, game asset development, visual field, video game
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image tiling – the seamless connection of disparate images to create a coherent visual field – is crucial for applications such as texture creation, video game asset development, and digital art. Traditionally, tiles have been constructed manually, a method that poses significant limitations in scalability and flexibility. Recent research has attempted to automate this process using generative models. However, current approaches primarily focus on tiling textures and manipulating models for single-image generation, without inherently supporting the creation of multiple interconnected tiles across diverse domains. This paper presents Tiled Diffusion, a novel approach that extends the capabilities of diffusion models to accommodate the generation of cohesive tiling patterns across various domains of image synthesis that require tiling. Our method supports a wide range of tiling scenarios, from self-tiling to complex many-to-many connections, enabling seamless integration of multiple images. Tiled Diffusion automates the tiling process, eliminating the need for manual intervention and enhancing creative possibilities in various applications, such as seamlessly tiling of existing images, tiled texture creation, and 360° synthesis.
zh

[CV-16] SqueezeMe: Efficient Gaussian Avatars for VR

【速读】：该论文试图解决在便携式虚拟现实头戴设备（如 Meta Quest 3）上实现多个高视觉质量的实时 3D 虚拟形象（Gaussian avatars）的推理问题。解决方案的关键在于通过以下几个步骤优化计算效率：首先，使用线性混合蒙皮（Linear Blend Skinning, LBS）控制基础高斯分布的运动，并通过神经网络解码器（neural network decoder）进一步调整其外观；其次，为了加速解码过程，将高斯分布训练在 UV 空间而非像素空间，并将解码器简化为单层神经网络，同时发现邻近的高斯分布可以共享一个解码器修正项，从而进一步加速；最后，通过在移动 GPU 上运行的自定义 Vulkan 渲染管道来加速渲染过程。最终，该方案在 VR 头戴设备上实现了 3 个高斯虚拟形象以 72 FPS 的帧率并发运行。

链接: https://arxiv.org/abs/2412.15171
作者: Shunsuke Saito,Stanislav Pidhorskyi,Igor Santesteban,Forrest Iandola,Divam Gupta,Anuj Pahuja,Nemanja Bartolovic,Frank Yu,Emanuel Garbin,Tomas Simon
机构: Meta Reality LabsUSA; Meta Reality LabsUSA; Meta Reality LabsUSA; Meta Reality LabsUSA; Meta Reality LabsUSA; Meta Reality LabsUSA; Meta Reality LabsSwitzerland; Meta Reality LabsUSA; Meta Reality LabsIsrael; Meta Reality LabsUSA
关键词: Gaussian Splatting, Splatting has enabled, Gaussian avatars, Gaussians, multiple Gaussian avatars
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Initial version

点击查看摘要

Abstract:Gaussian Splatting has enabled real-time 3D human avatars with unprecedented levels of visual quality. While previous methods require a desktop GPU for real-time inference of a single avatar, we aim to squeeze multiple Gaussian avatars onto a portable virtual reality headset with real-time drivable inference. We begin by training a previous work, Animatable Gaussians, on a high quality dataset captured with 512 cameras. The Gaussians are animated by controlling base set of Gaussians with linear blend skinning (LBS) motion and then further adjusting the Gaussians with a neural network decoder to correct their appearance. When deploying the model on a Meta Quest 3 VR headset, we find two major computational bottlenecks: the decoder and the rendering. To accelerate the decoder, we train the Gaussians in UV-space instead of pixel-space, and we distill the decoder to a single neural network layer. Further, we discover that neighborhoods of Gaussians can share a single corrective from the decoder, which provides an additional speedup. To accelerate the rendering, we develop a custom pipeline in Vulkan that runs on the mobile GPU. Putting it all together, we run 3 Gaussian avatars concurrently at 72 FPS on a VR headset. Demo videos are at this https URL.
zh

[CV-17] OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

【速读】：该论文试图解决文本到视频生成领域中视频扩散模型（VDM）在实际应用中的问题，特别是图像质量下降和闪烁伪影等现象。解决方案的关键在于提出了OnlineVPO，这是一种更高效的偏好学习方法，专门针对视频扩散模型进行优化。其核心创新点包括：1）使用基于合成数据训练的视频质量评估（VQA）模型作为奖励模型，提供分布和模态对齐的反馈，而不是直接使用基于图像的奖励反馈；2）引入在线DPO算法，解决现有视频偏好学习框架中的离策略优化和可扩展性问题。通过这些设计，OnlineVPO能够提供即时的、有效的偏好指导，显著提升了视频扩散模型的性能和可扩展性。

链接: https://arxiv.org/abs/2412.15159
作者: Jiacheng Zhang,Jie Wu,Weifeng Chen,Yatai Ji,Xuefeng Xiao,Weilin Huang,Kai Han
机构: The University of Hong Kong; ByteDance
关键词: made significant strides, generation has made, significant strides, video diffusion, made significant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, the field of text-to-video (T2V) generation has made significant strides. Despite this progress, there is still a gap between theoretical advancements and practical application, amplified by issues like degraded image quality and flickering artifacts. Recent advancements in enhancing the video diffusion model (VDM) through feedback learning have shown promising results. However, these methods still exhibit notable limitations, such as misaligned feedback and inferior scalability. To tackle these issues, we introduce OnlineVPO, a more efficient preference learning approach tailored specifically for video diffusion models. Our method features two novel designs, firstly, instead of directly using image-based reward feedback, we leverage the video quality assessment (VQA) model trained on synthetic data as the reward model to provide distribution and modality-aligned feedback on the video diffusion model. Additionally, we introduce an online DPO algorithm to address the off-policy optimization and scalability issue in existing video preference learning frameworks. By employing the video reward model to offer concise video feedback on the fly, OnlineVPO offers effective and efficient preference guidance. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and more importantly scalable preference learning algorithm for video diffusion models, offering valuable insights for future advancements in this domain.
zh

[CV-18] Leveraging Color Channel Independence for Improved Unsupervised Object Detection

【速读】：该论文试图挑战计算机视觉中关于RGB图像作为无监督学习最优色彩空间的普遍假设。研究指出，其他色彩空间如HSV具有对光照条件鲁棒性等重要特性，更适合于对象中心表示学习。解决方案的关键在于提出将预测目标转换为RGB-S空间，即在RGB基础上增加HSV的饱和度分量，从而显著提升五个常用评估数据集上的重建质量和解耦效果。该方法无需额外的计算开销，适用于各种视觉计算任务和训练类型，且对模型架构不敏感。

链接: https://arxiv.org/abs/2412.15150
作者: Bastian Jäckl,Yannick Metz,Udo Schlegel,Daniel A. Keim,Maximilian T. Fischer
机构: University of Konstanz (康斯坦茨大学)
关键词: extract distinct object, enabling downstream applications, distinct object representations, object level, distinct object
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 38 pages incl. references, 16 figures

点击查看摘要

Abstract:Object-centric architectures can learn to extract distinct object representations from visual scenes, enabling downstream applications on the object level. Similarly to autoencoder-based image models, object-centric approaches have been trained on the unsupervised reconstruction loss of images encoded by RGB color spaces. In our work, we challenge the common assumption that RGB images are the optimal color space for unsupervised learning in computer vision. We discuss conceptually and empirically that other color spaces, such as HSV, bear essential characteristics for object-centric representation learning, like robustness to lighting conditions. We further show that models improve when requiring them to predict additional color channels. Specifically, we propose to transform the predicted targets to the RGB-S space, which extends RGB with HSV’s saturation component and leads to markedly better reconstruction and disentanglement for five common evaluation datasets. The use of composite color spaces can be implemented with basically no computational overhead, is agnostic of the models’ architecture, and is universally applicable across a wide range of visual computing tasks and training types. The findings of our approach encourage additional investigations in computer vision tasks beyond object-centric learning.
zh

[CV-19] Jet: A Modern Transformer-Based Normalizing Flow

【速读】：该论文试图解决传统归一化流模型（normalizing flows）在生成自然图像时视觉质量不如其他模型类别（如GANs、VQ-VAE-based approaches或扩散模型）的问题。解决方案的关键在于重新设计基于耦合的归一化流模型，并通过消融实验仔细分析先前设计选择，采用基于Vision Transformer架构的计算块，而非传统的卷积神经网络（CNN）。这种设计不仅简化了模型结构，还实现了当前最先进的定量和定性性能，尽管在整体视觉质量上仍略逊于最先进的模型，但论文认为强大的归一化流模型可以作为更强大生成模型的构建组件，推动研究前沿的发展。

链接: https://arxiv.org/abs/2412.15129
作者: Alexander Kolesnikov,André Susano Pinto,Michael Tschannen
机构: Google DeepMind(谷歌深度思维); Google DeepMind(谷歌深度思维); Google DeepMind(谷歌深度思维)
关键词: natural images, promising class, normalizing flow models, Vision Transformer architecture, models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of the coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve state-of-the-art quantitative and qualitative performance with a much simpler architecture. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing research frontier by serving as building components of more powerful generative models.
zh

[CV-20] Parallelized Autoregressive Visual Generation

【速读】：该论文试图解决自回归模型（autoregressive models）在视觉生成任务中因逐个预测（token-by-token prediction）导致的推理速度慢的问题。解决方案的关键在于提出了一种并行化的自回归视觉生成策略，该策略基于视觉标记（visual tokens）之间的依赖关系：弱依赖的标记可以并行生成，而强依赖的相邻标记则需要顺序生成以避免不一致性。通过这种策略，论文实现了在保持自回归模型优势的同时，显著提升了生成效率，实验结果表明在图像和视频生成任务中分别实现了3.6倍和9.5倍的加速。

链接: https://arxiv.org/abs/2412.15119
作者: Yuqing Wang,Shuhuai Ren,Zhijie Lin,Yujin Han,Haoyuan Guo,Zhenheng Yang,Difan Zou,Jiashi Feng,Xihui Liu
机构: University of Hong Kong(香港大学); ByteDance Seed(字节跳动种子); Peking University(北京大学)
关键词: slow inference speed, inference speed due, prediction process, suffer from slow, slow inference
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process. In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling. Our key insight is that parallel generation depends on visual token dependencies-tokens with weak dependencies can be generated in parallel, while strongly dependent adjacent tokens are difficult to generate together, as their independent sampling may lead to inconsistencies. Based on this observation, we develop a parallel generation strategy that generates distant tokens with weak dependencies in parallel while maintaining sequential generation for strongly dependent local tokens. Our approach can be seamlessly integrated into standard autoregressive models without modifying the architecture or tokenizer. Experiments on ImageNet and UCF-101 demonstrate that our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks. We hope this work will inspire future research in efficient visual generation and unified autoregressive modeling. Project page: this https URL.
zh

[CV-21] Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search

【速读】：该论文试图解决文本驱动的人物搜索 (Text-Based Person Search, TBPS) 中的两个主要问题：一是传统的随机掩码语言建模 (Masked Language Modeling, MLM) 在训练过程中对所有词汇平等对待，导致大量语义空洞词汇（如“with”、“the”等）被掩码，无法有效促进跨模态交互，阻碍了文本与视觉数据的表示对齐；二是TBPS数据集中的人工描述存在重复和错误，导致低质量的文本表示。为解决这些问题，论文提出了一个名为注意力引导对齐 (Attention-Guided Alignment, AGA) 的框架，其关键创新包括注意力引导掩码建模 (Attention-Guided Mask, AGM) 和文本丰富模块 (Text Enrichment Module, TEM)。AGM通过聚合文本编码过程中的注意力权重，动态掩码语义丰富的词汇，从而使跨模态MLM能够从文本上下文和图像中捕捉与掩码词汇相关的信息，实现表示对齐。TEM则通过MLM的预测替换语义丰富的词汇，缓解因重复和错误描述导致的低质量表示，同时丰富文本描述并防止过拟合。

链接: https://arxiv.org/abs/2412.15106
作者: Lei Tan,Weihao Li,Pingyang Dai,Jie Chen,Liujuan Cao,Rongrong Ji
机构: Media Analytics and Computing Laboratory, Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen 361005, China; School of Electronic and Computer Engineering, Peking University, Beijing 100871, China; Peng Cheng Laboratory, Shenzhen 518066, China
关键词: Text-Based Person Search, Person Search, mainstream methods aim, Text-Based Person, mainstream methods
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the realm of Text-Based Person Search (TBPS), mainstream methods aim to explore more efficient interaction frameworks between text descriptions and visual data. However, recent approaches encounter two principal challenges. Firstly, the widely used random-based Masked Language Modeling (MLM) considers all the words in the text equally during training. However, massive semantically vacuous words (‘with’, ‘the’, etc.) be masked fail to contribute efficient interaction in the cross-modal MLM and hampers the representation alignment. Secondly, manual descriptions in TBPS datasets are tedious and inevitably contain several inaccuracies. To address these issues, we introduce an Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM). AGM dynamically masks semantically meaningful words by aggregating the attention weight derived from the text encoding process, thereby cross-modal MLM can capture information related to the masked word from text context and images and align their representations. Meanwhile, TEM alleviates low-quality representations caused by repetitive and erroneous text descriptions by replacing those semantically meaningful words with MLM’s prediction. It not only enriches text descriptions but also prevents overfitting. Extensive experiments across three challenging benchmarks demonstrate the effectiveness of our AGA, achieving new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively.
zh

[CV-22] A Full Transformer-based Framework for Automatic Pain Estimation using Videos

【速读】：该论文试图解决疼痛自动评估的问题，旨在通过设计一个优化的疼痛管理系统来提供可靠的评估并减少患者的痛苦。解决方案的关键在于提出了一种基于Transformer的全新框架，该框架结合了Transformer in Transformer (TNT)模型和利用交叉注意力（cross-attention）和自注意力（self-attention）机制的Transformer，从而在BioVid数据库的视频数据上展示了卓越的性能，包括高效性、有效性和跨任务的泛化能力。

链接: https://arxiv.org/abs/2412.15095
作者: Stefanos Gkikas,Manolis Tsiknakis
机构: 未知
关键词: management system offering, system offering reliable, offering reliable assessment, optimal pain management, pain management system
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The automatic estimation of pain is essential in designing an optimal pain management system offering reliable assessment and reducing the suffering of patients. In this study, we present a novel full transformer-based framework consisting of a Transformer in Transformer (TNT) model and a Transformer leveraging cross-attention and self-attention blocks. Elaborating on videos from the BioVid database, we demonstrate state-of-the-art performances, showing the efficacy, efficiency, and generalization capability across all the primary pain estimation tasks.
zh

[CV-23] MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance

【速读】：该论文试图解决医学研究者和临床医生在处理新数据集时，进行新颖分割任务的效率问题。现有方法要么需要大量人工交互（如点击、框选或涂鸦），要么依赖于已有的手动标注数据集。论文提出的解决方案是 MultiverSeg 系统，其关键在于通过用户交互（如点击、框选或涂鸦）和逐渐积累的标注图像作为上下文输入，动态减少对每张新图像的交互需求。随着标注图像数量的增加，系统能够更高效地进行分割，显著减少了交互步骤（涂鸦步骤减少53%，点击减少36%），从而在未见任务的图像集上实现了90%的Dice系数。

链接: https://arxiv.org/abs/2412.15058
作者: Hallee E. Wong,Jose Javier Gonzalez Ortiz,John Guttag,Adrian V. Dalca
机构: MIT CSAIL & MGH; Databricks; MIT CSAIL; MIT CSAIL & MGH, HMS
关键词: Medical researchers, researchers and clinicians, Medical, images, set of related
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Project Website: this https URL Keywords: interactive segmentation, in-context learning, medical image analysis, biomedical imaging, image annotation, visual prompting

点击查看摘要

Abstract:Medical researchers and clinicians often need to perform novel segmentation tasks on a set of related images. Existing methods for segmenting a new dataset are either interactive, requiring substantial human effort for each image, or require an existing set of manually labeled images. We introduce a system, MultiverSeg, that enables practitioners to rapidly segment an entire new dataset without requiring access to any existing labeled data from that task or domain. Along with the image to segment, the model takes user interactions such as clicks, bounding boxes or scribbles as input, and predicts a segmentation. As the user segments more images, those images and segmentations become additional inputs to the model, providing context. As the context set of labeled images grows, the number of interactions required to segment each new image decreases. We demonstrate that MultiverSeg enables users to interactively segment new datasets efficiently, by amortizing the number of interactions per image to achieve an accurate segmentation. Compared to using a state-of-the-art interactive segmentation method, using MultiverSeg reduced the total number of scribble steps by 53% and clicks by 36% to achieve 90% Dice on sets of images from unseen tasks. We release code and model weights at this https URL
zh

[CV-24] GIRAFE: Glottal Imaging Dataset for Advanced Segmentation Analysis and Facilitative Playbacks Evaluation

【速读】：该论文试图解决高速度视频喉镜序列中声门间隙的语义分割数据集缺乏的问题，这限制了该领域研究的复现性和进一步探索。解决方案的关键在于开发了一个名为GIRAFE的数据库，该数据库包含了65个来自50名患者（30名女性，20名男性）的高速度视频喉镜记录，并由专家手动标注了声门间隙的语义分割掩码。此外，数据库还提供了使用不同先进方法自动分割声门区域的结果，以支持新声门间隙分割算法的发展，从而改进或创建新的辅助回放技术。尽管已有进展，但实现声门区域完全自动且准确的语义分割方法仍是一个开放的挑战。

链接: https://arxiv.org/abs/2412.15054
作者: G. Andrade-Miranda,K. Chatzipapas,J.D. Arias-Londoño,J. I. Godino-Llorente
机构: Laboratoire de Traitement de l’Information Médicale (LaTIM), UMR 1101, INSERM; University of Brest; 3dmi Research Group, Department of Medical Physics, School of Medicine, University of Patras; Department of Signals, Systems, and Radiocommunications, Escuela Técnica Superior de Ingenieros de Telecomunicació, Universidad Politécnica de Madrid
关键词: Facilitative Playbacks extracted, High-Speed videoendoscopic sequences, High-Speed videoendoscopic, semantic segmentation, notable lack
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:The advances in the development of Facilitative Playbacks extracted from High-Speed videoendoscopic sequences of the vocal folds are hindered by a notable lack of publicly available datasets annotated with the semantic segmentations corresponding to the area of the glottal gap. This fact also limits the reproducibility and further exploration of existing research in this field. To address this gap, GIRAFE is a data repository designed to facilitate the development of advanced techniques for the semantic segmentation, analysis, and fast evaluation of High-Speed videoendoscopic sequences of the vocal folds. The repository includes 65 high-speed videoendoscopic recordings from a cohort of 50 patients (30 female, 20 male). The dataset comprises 15 recordings from healthy controls, 26 from patients with diagnosed voice disorders, and 24 with an unknown health condition. All of them were manually annotated by an expert, including the masks corresponding to the semantic segmentation of the glottal gap. The repository is also complemented with the automatic segmentation of the glottal area using different state-of-the-art approaches. This data set has already supported several studies, which demonstrates its usefulness for the development of new glottal gap segmentation algorithms from High-Speed-Videoendoscopic sequences to improve or create new Facilitative Playbacks. Despite these advances and others in the field, the broader challenge of performing an accurate and completely automatic semantic segmentation method of the glottal area remains open. Comments: 18 pages, 8 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS) Cite as: arXiv:2412.15054 [cs.CV] (or arXiv:2412.15054v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.15054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-25] Uni-Renderer: Unifying Rendering and Inverse Rendering Via Dual Stream Diffusion

【速读】：该论文试图解决渲染（Rendering）和逆渲染（Inverse Rendering）中的核心问题，即现有方法在特定场景下对理想条件分布转移函数的近似计算成本高且存在固有歧义。解决方案的关键在于提出了一种数据驱动的方法，将渲染和逆渲染联合建模为单一扩散框架内的两个条件生成任务。通过引入双时间调度（two distinct time schedules）和定制的双流模块（dual streaming module），实现了两个预训练扩散模型之间的交叉条件调节（cross-conditioning）。这种统一的方法（Uni-Renderer）通过循环一致性约束（cycle-consistent constrain）来减少歧义，确保内在属性与渲染图像之间的一致性。该方法结合精心准备的数据集，展示了强大的内在属性分解和渲染变化识别能力。

链接: https://arxiv.org/abs/2412.15050
作者: Zhifei Chen,Tianshuo Xu,Wenhang Ge,Leyi Wu,Dongyu Yan,Jing He,Luozhou Wang,Lu Zeng,Shunsi Zhang,Yingcong Chen
机构: HKUST(GZ); HKUST; Quwan
关键词: vision and graphics, computer vision, Rendering, conditional distribution transfer, inverse rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rendering and inverse rendering are pivotal tasks in both computer vision and graphics. The rendering equation is the core of the two tasks, as an ideal conditional distribution transfer function from intrinsic properties to RGB images. Despite achieving promising results of existing rendering methods, they merely approximate the ideal estimation for a specific scene and come with a high computational cost. Additionally, the inverse conditional distribution transfer is intractable due to the inherent ambiguity. To address these challenges, we propose a data-driven method that jointly models rendering and inverse rendering as two conditional generation tasks within a single diffusion framework. Inspired by UniDiffuser, we utilize two distinct time schedules to model both tasks, and with a tailored dual streaming module, we achieve cross-conditioning of two pre-trained diffusion models. This unified approach, named Uni-Renderer, allows the two processes to facilitate each other through a cycle-consistent constrain, mitigating ambiguity by enforcing consistency between intrinsic properties and rendered images. Combined with a meticulously prepared dataset, our method effectively decomposition of intrinsic properties and demonstrates a strong capability to recognize changes during rendering. We will open-source our training and inference code to the public, fostering further research and development in this area.
zh

[CV-26] DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT Space

【速读】：该论文试图解决在频率空间中高效建模图像的问题，并提出了DCTdiff，一种在离散余弦变换（DCT）空间中进行端到端扩散生成的范式。解决方案的关键在于利用DCT空间进行图像建模，从而在生成质量和训练效率上优于基于像素的扩散模型。DCTdiff能够无缝扩展到高分辨率生成任务，且无需使用潜在扩散范式。此外，论文还通过理论证明“图像扩散可以被视为频谱自回归”，从而在扩散模型和自回归模型之间建立了联系，展示了频率空间图像建模的潜力。

链接: https://arxiv.org/abs/2412.15032
作者: Mang Ning,Mingxiao Li,Jianlin Su,Haozhe Jia,Lanmiao Liu,Martin Beneš,Albert Ali Salah,Itir Onal Ertugrul
机构: 未知
关键词: discrete cosine transform, paper explores image, cosine transform, paper explores, discrete cosine
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 23 pages

点击查看摘要

Abstract:This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to high-resolution generation without using the latent diffusion paradigm. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why `image diffusion can be seen as spectral autoregression’, bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is at \urlthis https URL.
zh

[CV-27] Stable-V2A: Synthesis of Synchronized Sound Effects with Temporal and Semantic Controls

【速读】：该论文旨在解决声音设计师和Foley艺术家在为电影或视频游戏等场景手动注释和 sonorizing 每个动作时所面临的重复性工作问题。解决方案的关键在于提出了 Stable-V2A，这是一个两阶段模型，包括 RMS-Mapper 和 Stable-Foley。RMS-Mapper 用于估计与输入视频相关的音频特征的包络，而 Stable-Foley 是一个基于 Stable Audio Open 的扩散模型，能够生成在语义和时间上与目标视频对齐的音频。时间对齐通过将包络作为 ControlNet 输入来实现，而语义对齐则通过设计师选择的声学表示作为扩散过程的交叉注意力条件来实现。该模型在常用的 Greatest Hits 数据集上进行了训练和测试，并在 Walking The Maps 数据集上进行了案例研究。

链接: https://arxiv.org/abs/2412.15023
作者: Riccardo Fosco Gramaccioni,Christian Marinoni,Emilian Postolache,Marco Comunità,Luca Cosmo,Joshua D. Reiss,Danilo Comminiello
机构: Dept. Information Engineering, Electronics and Telecommunications (DIET), Sapienza University of Rome, Italy; Centre for Digital Music, Queen Mary University of London, UK; Dept. of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Italy
关键词: Foley artists, Sound designers, sonorize a scene, Stable Audio Open, artists usually sonorize
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Sound designers and Foley artists usually sonorize a scene, such as from a movie or video game, by manually annotating and sonorizing each action of interest in the video. In our case, the intent is to leave full creative control to sound designers with a tool that allows them to bypass the more repetitive parts of their work, thus being able to focus on the creative aspects of sound production. We achieve this presenting Stable-V2A, a two-stage model consisting of: an RMS-Mapper that estimates an envelope representative of the audio characteristics associated with the input video; and Stable-Foley, a diffusion model based on Stable Audio Open that generates audio semantically and temporally aligned with the target video. Temporal alignment is guaranteed by the use of the envelope as a ControlNet input, while semantic alignment is achieved through the use of sound representations chosen by the designer as cross-attention conditioning of the diffusion process. We train and test our model on Greatest Hits, a dataset commonly used to evaluate V2A models. In addition, to test our model on a case study of interest, we introduce Walking The Maps, a dataset of videos extracted from video games depicting animated characters walking in different locations. Samples and code available on our demo page at this https URL.
zh

[CV-28] Robust Federated Learning in the Face of Covariate Shift: A Magnitude Pruning with Hybrid Regularization Framework for Enhanced Model Aggregation

【速读】：该论文试图解决联邦学习（Federated Learning, FL）中由于客户端数据分布不一致导致的模型聚合不稳定问题。解决方案的关键在于提出了一种结合个体参数剪枝（parameter pruning）和正则化技术的新型FL框架，通过基于幅度的剪枝、添加dropout层和噪声注入层来增强每个客户端模型的鲁棒性，从而在数据分布存在较大差异的情况下，仍能提取出稳健的特征表示。该方法在多个基准数据集（如CIFAR10、MNIST、SVHN和Fashion MNIST）以及新引入的CelebA-Gender数据集上验证了其有效性。

链接: https://arxiv.org/abs/2412.15010
作者: Ozgu Goksu,Nicolas Pugeault
机构: University of Glasgow(格拉斯哥大学)
关键词: concerns remain challenging, highly sophisticated neural, security concerns remain, sophisticated neural networks, client data distributions
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of highly sophisticated neural networks has allowed for fast progress in every field of computer vision, however, applications where annotated data is prohibited due to privacy or security concerns remain challenging. Federated Learning (FL) offers a promising framework for individuals aiming to collaboratively develop a shared model while preserving data privacy. Nevertheless, our findings reveal that variations in data distribution among clients can profoundly affect FL methodologies, primarily due to instabilities in the aggregation process. We also propose a novel FL framework to mitigate the adverse effects of covariate shifts among federated clients by combining individual parameter pruning and regularization techniques to improve the robustness of individual clients’ models to aggregate. Each client’s model is optimized through magnitude-based pruning and the addition of dropout and noise injection layers to build more resilient decision pathways in the networks and improve the robustness of the model’s parameter aggregation step. The proposed framework is capable of extracting robust representations even in the presence of very large covariate shifts among client data distributions and in the federation of a small number of clients. Empirical findings substantiate the effectiveness of our proposed methodology across common benchmark datasets, including CIFAR10, MNIST, SVHN, and Fashion MNIST. Furthermore, we introduce the CelebA-Gender dataset, specifically designed to evaluate performance on a more realistic domain. The proposed method is capable of extracting robust representations even in the presence of both high and low covariate shifts among client data distributions.
zh

[CV-29] Stitch Contrast and Segment_Learning a Human Action Segmentation Model Using Trimmed Skeleton Videos AAAI2025

【速读】：该论文试图解决基于骨骼的动作分类模型在处理未修剪视频（untrimmed videos）时的局限性问题，这些视频通常包含多个连续动作，而现有模型主要依赖于修剪过的、特定动作的视频进行训练和测试。解决方案的关键在于提出了一种新的框架，通过三个步骤（Stitch、Contrast和Segment）实现从修剪过的短视频到未修剪长视频的适应性训练。首先，Stitch通过时间上的骨骼拼接方案将修剪过的视频视为基本的人类动作，并生成多动作拼接序列；其次，Contrast通过对比学习任务使骨骼编码器学习有意义的动作-时间上下文，从而提升动作分割性能；最后，Segment通过学习分割层来实现动作分割，并处理特定的数据可用性问题。该方法在修剪和未修剪数据集上的实验验证了其有效性。

链接: https://arxiv.org/abs/2412.14988
作者: Haitao Tian,Pierre Payeur
机构: 未知
关键词: Existing skeleton-based human, videos exhibiting concatenated, well-trimmed action-specific skeleton, exhibiting concatenated actions, Existing skeleton-based
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as AAAI 2025

点击查看摘要

Abstract:Existing skeleton-based human action classification models rely on well-trimmed action-specific skeleton videos for both training and testing, precluding their scalability to real-world applications where untrimmed videos exhibiting concatenated actions are predominant. To overcome this limitation, recently introduced skeleton action segmentation models involve un-trimmed skeleton videos into end-to-end training. The model is optimized to provide frame-wise predictions for any length of testing videos, simultaneously realizing action localization and classification. Yet, achieving such an improvement im-poses frame-wise annotated skeleton videos, which remains time-consuming in practice. This paper features a novel framework for skeleton-based action segmentation trained on short trimmed skeleton videos, but that can run on longer un-trimmed videos. The approach is implemented in three steps: Stitch, Contrast, and Segment. First, Stitch proposes a tem-poral skeleton stitching scheme that treats trimmed skeleton videos as elementary human motions that compose a semantic space and can be sampled to generate multi-action stitched se-quences. Contrast learns contrastive representations from stitched sequences with a novel discrimination pretext task that enables a skeleton encoder to learn meaningful action-temporal contexts to improve action segmentation. Finally, Segment relates the proposed method to action segmentation by learning a segmentation layer while handling particular da-ta availability. Experiments involve a trimmed source dataset and an untrimmed target dataset in an adaptation formulation for real-world skeleton-based human action segmentation to evaluate the effectiveness of the proposed method.
zh

[CV-30] Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations

【速读】：该论文试图解决3D关节物体数据稀缺的问题，尤其是在深度学习方法中，获取大量带详细标注的3D关节物体数据既昂贵又耗时。解决方案的关键是提出了Articulated Object Procedural Generation toolbox（Arti-PG toolbox），通过程序化生成的方式快速创建大量多样化的3D关节物体，并为其提供全面的详细标注。Arti-PG toolbox的核心在于：1) 使用广义结构程序描述关节物体，并建立与点云的解析对应关系；2) 通过结构程序的操作规则合成大规模多样化的关节物体；3) 提供数学描述的知识（如affordance、语义等）以生成标注。该工具箱具有无限形状变化和广泛适用性两大优势，支持26类关节物体的生成，并适用于多种视觉和操作任务。

链接: https://arxiv.org/abs/2412.14974
作者: Jianhua Sun,Yuxuan Li,Jiude Wei,Longfei Xu,Nange Wang,Yining Zhang,Cewu Lu
机构: Shanghai Jiao Tong University (上海交通大学)
关键词: deep learning methods, achieve remarkable performance, articulated object data, articulated object, articulated object understanding
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The acquisition of substantial volumes of 3D articulated object data is expensive and time-consuming, and consequently the scarcity of 3D articulated object data becomes an obstacle for deep learning methods to achieve remarkable performance in various articulated object understanding tasks. Meanwhile, pairing these object data with detailed annotations to enable training for various tasks is also difficult and labor-intensive to achieve. In order to expeditiously gather a significant number of 3D articulated objects with comprehensive and detailed annotations for training, we propose Articulated Object Procedural Generation toolbox, a.k.a. Arti-PG toolbox. Arti-PG toolbox consists of i) descriptions of articulated objects by means of a generalized structure program along with their analytic correspondence to the objects’ point cloud, ii) procedural rules about manipulations on the structure program to synthesize large-scale and diverse new articulated objects, and iii) mathematical descriptions of knowledge (e.g. affordance, semantics, etc.) to provide annotations to the synthesized object. Arti-PG has two appealing properties for providing training data for articulated object understanding tasks: i) objects are created with unlimited variations in shape through program-oriented structure manipulation, ii) Arti-PG is widely applicable to diverse tasks by easily providing comprehensive and detailed annotations. Arti-PG now supports the procedural generation of 26 categories of articulate objects and provides annotations across a wide range of both vision and manipulation tasks, and we provide exhaustive experiments which fully demonstrate its advantages. We will make Arti-PG toolbox publicly available for the community to use.
zh

[CV-31] PhotoHolmes: a Python library for forgery detection in digital images

【速读】：该论文试图解决数字图像伪造检测方法的易用性和可比较性问题。解决方案的关键在于引入了一个开源的Python库PhotoHolmes，该库集成了多种流行的和最先进的伪造检测方法、数据集整合工具以及评估指标。通过PhotoHolmes的Benchmark工具，用户可以轻松地对不同方法进行基准测试和比较，从而实现对其自身方法与现有文献中方法的准确且可重复的对比。此外，PhotoHolmes提供了命令行接口(CLI)，使得在任何可疑图像上运行库中实现的方法变得简单易行。该库的设计具有可扩展性和模块化特性，便于用户添加新的方法、数据集和评估指标。

链接: https://arxiv.org/abs/2412.14969
作者: Julián O’Flaherty,Rodrigo Paganini,Juan Pablo Sotelo,Julieta Umpiérrez,Marina Gardella,Matías Tailanian,Pablo Musé
机构: Universidad de la República (乌拉圭共和国大学); Instituto Nacional de Matemática Pura e Aplicada (巴西国家纯数学与应用数学研究所); Digital Sense (数字感知)
关键词: Python library designed, open-source Python library, open-source Python, Python library, benchmark forgery detection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce PhotoHolmes, an open-source Python library designed to easily run and benchmark forgery detection methods on digital images. The library includes implementations of popular and state-of-the-art methods, dataset integration tools, and evaluation metrics. Utilizing the Benchmark tool in PhotoHolmes, users can effortlessly compare various methods. This facilitates an accurate and reproducible comparison between their own methods and those in the existing literature. Furthermore, PhotoHolmes includes a command-line interface (CLI) to easily run the methods implemented in the library on any suspicious image. As such, image forgery methods become more accessible to the community. The library has been built with extensibility and modularity in mind, which makes adding new methods, datasets and metrics to the library a straightforward process. The source code is available at this https URL.
zh

[CV-32] IDOL: Instant Photorealistic 3D Human Creation from a Single Image

【速读】：该论文试图解决从单张图像快速且高质量地生成可动画的3D全身虚拟形象的问题。解决方案的关键在于从数据集、模型和表示三个角度重新思考任务：首先，引入了一个大规模以人为中心的生成数据集HuGe100K，包含100K张多样化的、逼真的人类图像，每组图像包含24个特定姿势的多视角帧；其次，利用HuGe100K中的视角、姿势和外观多样性，开发了一个可扩展的前馈transformer模型，用于从给定的人类图像中预测统一的3D人体高斯表示，该模型能够解耦人体姿势、体型、服装几何和纹理；最后，通过实验验证了该数据集和方法的有效性，模型能够在单个GPU上即时生成1K分辨率的逼真3D人体重建，并支持多种应用及形状和纹理编辑任务。

链接: https://arxiv.org/abs/2412.14963
作者: Yiyu Zhuang,Jiaxi Lv,Hao Wen,Qing Shuai,Ailing Zeng,Hao Zhu,Shifeng Chen,Yujiu Yang,Xun Cao,Wei Liu
机构: Nanjing University(南京大学); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); Tsinghua University(清华大学); Tencent(腾讯); Shenzhen University of Advanced Technology(深圳先进技术大学)
关键词: Creating a high-fidelity, high-quality training data, challenging task due, full-body avatar, training data
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 21 pages, 15 figures, includes main content, supplementary materials, and references

点击查看摘要

Abstract:Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman-centric GEnerated dataset, HuGe100K, consisting of 100K diverse, photorealistic sets of human images. Each set contains 24-view frames in specific human poses, generated using a pose-controllable image-to-multi-view model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space from a given human image. This model is trained to disentangle human pose, body shape, clothing geometry, and texture. The estimated Gaussians can be animated without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the ability to efficiently reconstruct photorealistic humans at 1K resolution from a single input image using a single GPU instantly. Additionally, it seamlessly supports various applications, as well as shape and texture editing tasks.
zh

[CV-33] DCNet: Transparent Objects Depth Completion with CNN-Transformer Dual-Branch Parallel Network

【速读】：该论文试图解决透明物体在工业和实验室机器人中的感知与操控问题，特别是由于透明物体表面光的折射和反射以及缺乏可见纹理，导致传统传感器难以获取完整的深度信息。解决方案的关键是提出了TDCNet，一种新颖的双分支CNN-Transformer并行网络，用于透明物体的深度补全。该框架包含两个分支：一个从部分深度图提取特征，另一个处理RGB-D图像。通过这种双分支结构，TDCNet能够更有效地利用原始深度图信息，从而提高深度补全的准确性，并在多个公共数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2412.14961
作者: Xianghui Fan,Chao Ye,Anping Deng,Xiaotian Wu,Mengyang Pan,Hang Yang
机构: Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences(长春光学精密机械与物理研究所，中国科学院); University of Chinese Academy of Sciences(中国科学院大学); Research Institute of Intelligent Control and Systems, Harbin Institute of Technology(哈尔滨工业大学智能控制与系统研究所); National Key Laboratory of Complex System Control and Intelligent Agent Cooperation, Harbin Institute of Technology(哈尔滨工业大学复杂系统控制与智能代理合作国家重点实验室); College of Physics, Northeast Normal University(东北师范大学物理学院)
关键词: transparent objects present, transparent objects, laboratory robotics, sensing and manipulation, present a critical
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The sensing and manipulation of transparent objects present a critical challenge in industrial and laboratory robotics. Conventional sensors face challenges in obtaining the full depth of transparent objects due to the refraction and reflection of light on their surfaces and their lack of visible texture. Previous research has attempted to obtain complete depth maps of transparent objects from RGB and damaged depth maps (collected by depth sensor) using deep learning models. However, existing methods fail to fully utilize the original depth map, resulting in limited accuracy for deep completion. To solve this problem, we propose TDCNet, a novel dual-branch CNN-Transformer parallel network for transparent object depth completion. The proposed framework consists of two different branches: one extracts features from partial depth maps, while the other processes RGB-D images. Experimental results demonstrate that our model achieves state-of-the-art performance across multiple public datasets. Our code and the pre-trained model are publicly available at this https URL.
zh

[CV-34] Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination

【速读】：该论文试图解决现有世界模型在机器人应用中无法直接且明确地模仿真实环境，导致行为不真实和出现幻觉的问题。解决方案的关键在于提出了一个新的世界模型构建范式，称为DreMa，它通过整合实时照片级真实感技术、高斯平滑（Gaussian Splatting）和物理模拟器，实现了对真实世界及其动态的显式表示。DreMa能够复制观察到的世界及其动态，允许机器人想象物体的新配置并预测未来动作的后果，从而显著提升了模仿学习的准确性和鲁棒性，减少了学习策略所需的数据量，并增强了代理的泛化能力。

链接: https://arxiv.org/abs/2412.14957
作者: Leonardo Barcellona,Andrii Zadaianchuk,Davide Allegro,Samuele Papa,Stefano Ghidoni,Efstratios Gavves
机构: University of Padova(帕多瓦大学); Politecnico of Torino(都灵理工大学); University of Amsterdam(阿姆斯特丹大学)
关键词: Current world models, world model, world models typically, world, constructing world models
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hallucinations that make them unsuitable for real-world applications. In this paper, we introduce a new paradigm for constructing world models that are explicit representations of the real world and its dynamics. By integrating cutting-edge advances in real-time photorealism with Gaussian Splatting and physics simulators, we propose the first compositional manipulation world model, which we call DreMa. DreMa replicates the observed world and its dynamics, allowing it to imagine novel configurations of objects and predict the future consequences of robot actions. We leverage this capability to generate new data for imitation learning by applying equivariant transformations to a small set of demonstrations. Our evaluations across various settings demonstrate significant improvements in both accuracy and robustness by incrementing actions and object distributions, reducing the data needed to learn a policy and improving the generalization of the agents. As a highlight, we show that a real Franka Emika Panda robot, powered by DreMa’s imagination, can successfully learn novel physical tasks from just a single example per task variation (one-shot policy learning). Our project page and source code can be found in this https URL
zh

[CV-35] Corn Ear Detection and Orientation Estimation Using Deep Learning

【速读】：该论文试图解决传统人工测量玉米植株耳朵角度时耗时且易出错的问题。解决方案的关键在于提出了一种基于计算机视觉的系统，能够自动检测、跟踪和预测玉米耳朵的方向。该系统利用带有关键点检测的对象检测算法，准确率达到了90%，并且耳朵方向估计的平均绝对误差（MAE）为18度，相比人工测量的差异（15度）有所改进。这一解决方案不仅显著节省了时间，还为玉米生产效率的提升和相关研究领域的发展提供了新的可能性。

链接: https://arxiv.org/abs/2412.14954
作者: Nathan Sprague,John Evans,Michael Mardikes
机构: 未知
关键词: give key insights, plant health, give key, key insights, health and development
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages;15 figures

点击查看摘要

Abstract:Monitoring growth behavior of maize plants such as the development of ears can give key insights into the plant’s health and development. Traditionally, the measurement of the angle of ears is performed manually, which can be time-consuming and prone to human error. To address these challenges, this paper presents a computer vision-based system for detecting and tracking ears of corn in an image sequence. The proposed system could accurately detect, track, and predict the ear’s orientation, which can be useful in monitoring their growth behavior. This can significantly save time compared to manual measurement and enables additional areas of ear orientation research and potential increase in efficiencies for maize production. Using an object detector with keypoint detection, the algorithm proposed could detect 90 percent of all ears. The cardinal estimation had a mean absolute error (MAE) of 18 degrees, compared to a mean 15 degree difference between two people measuring by hand. These results demonstrate the feasibility of using computer vision techniques for monitoring maize growth and can lead to further research in this area.
zh

[CV-36] GURecon: Learning Detailed 3D Geometric Uncertainties for Neural Surface Reconstruction AAAI2025

【速读】：该论文试图解决在缺乏真实网格（ground truth mesh）的情况下，评估神经表面重建几何质量的难题。解决方案的关键在于提出了一个名为GURecon的新框架，该框架通过几何一致性（geometric consistency）建立了一个神经表面的几何不确定性场（geometric uncertainty field）。与依赖渲染测量的现有方法不同，GURecon模型化了一个连续的三维不确定性场，并通过在线蒸馏（online distillation）方法进行学习，无需引入真实的几何信息进行监督。此外，为了减少光照对几何一致性的干扰，论文还提出了一种解耦场（decoupled field），用于微调不确定性场。实验结果表明，GURecon在三维几何不确定性建模方面具有优越性，并能无缝扩展到各种神经表面表示，提升下游任务如增量重建的效果。

链接: https://arxiv.org/abs/2412.14939
作者: Zesong Yang,Ru Zhang,Jiale Shi,Zixiang Ai,Boming Zhao,Hujun Bao,Luwei Yang,Zhaopeng Cui
机构: Zhejiang University(浙江大学); University of Science and Technology of China(中国科学技术大学)
关键词: demonstrated remarkable success, Neural surface, demonstrated remarkable, remarkable success, view synthesis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025. Project page: this https URL

点击查看摘要

Abstract:Neural surface representation has demonstrated remarkable success in the areas of novel view synthesis and 3D reconstruction. However, assessing the geometric quality of 3D reconstructions in the absence of ground truth mesh remains a significant challenge, due to its rendering-based optimization process and entangled learning of appearance and geometry with photometric losses. In this paper, we present a novel framework, i.e, GURecon, which establishes a geometric uncertainty field for the neural surface based on geometric consistency. Different from existing methods that rely on rendering-based measurement, GURecon models a continuous 3D uncertainty field for the reconstructed surface, and is learned by an online distillation approach without introducing real geometric information for supervision. Moreover, in order to mitigate the interference of illumination on geometric consistency, a decoupled field is learned and exploited to finetune the uncertainty field. Experiments on various datasets demonstrate the superiority of GURecon in modeling 3D geometric uncertainty, as well as its plug-and-play extension to various neural surface representations and improvement on downstream tasks such as incremental reconstruction. The code and supplementary material are available on the project website: this https URL.
zh

[CV-37] Automatic Spectral Calibration of Hyperspectral Images:Method Dataset and Benchmark

【速读】：该论文试图解决高光谱图像 (Hyperspectral Image, HSI) 在不同光照条件下的自动校准问题。传统校准方法依赖物理参考，存在手动操作、遮挡和限制相机移动等局限性。论文提出了一种基于学习的自动校准方法，关键在于创建了一个大规模的HSI校准数据集，包含765个高质量HSI对，并通过结合10种不同物理测量的光照条件扩展至7650对。此外，论文提出了光谱光照变换器 (Spectral Illumination Transformer, SIT) 和光照注意力模块，通过广泛的基准测试证明了其最先进的性能，并指出低光条件比正常条件更具挑战性。

链接: https://arxiv.org/abs/2412.14925
作者: Zhuoran Du,Shaodi You,Cheng Cheng,Shikui Wei
机构: Beijing Jiaotong University(北京交通大学); University of Amsterdam(阿姆斯特丹大学)
关键词: Hyperspectral image, RGB images, distinctive than RGB, densely samples, samples the world
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral image (HSI) densely samples the world in both the space and frequency domain and therefore is more distinctive than RGB images. Usually, HSI needs to be calibrated to minimize the impact of various illumination conditions. The traditional way to calibrate HSI utilizes a physical reference, which involves manual operations, occlusions, and/or limits camera mobility. These limitations inspire this paper to automatically calibrate HSIs using a learning-based method. Towards this goal, a large-scale HSI calibration dataset is created, which has 765 high-quality HSI pairs covering diversified natural scenes and illuminations. The dataset is further expanded to 7650 pairs by combining with 10 different physically measured illuminations. A spectral illumination transformer (SIT) together with an illumination attention module is proposed. Extensive benchmarks demonstrate the SoTA performance of the proposed SIT. The benchmarks also indicate that low-light conditions are more challenging than normal conditions. The dataset and codes are available online:this https URL
zh

[CV-38] MagicNaming: Consistent Identity Generation by Finding a “Name Space” in T2I Diffusion Models AAAI2025

【速读】：该论文试图解决的问题是如何使大规模文本到图像扩散模型（如 DALL-E、SDXL）能够生成普通身份的图像，而不仅仅是生成名人图像。解决方案的关键在于探索并利用“名称空间”（Name Space），即通过名人名称的文本嵌入（text embedding）在特征空间中找到对应的身份表示。具体来说，研究者首先从 Laion5B 数据集中提取名人名称的嵌入，并利用这些嵌入作为监督信号，训练一个能够从人脸图像预测名称嵌入的编码器。实验表明，这种名称嵌入能够有效保持生成图像的身份一致性，并且由于这些嵌入与文本输入的语义解耦，原始的文本到图像生成能力得以保留。通过简单地插入这些名称嵌入，基于同一基础模型（如 SDXL）的所有变体都能转变为具有身份感知能力的文本到图像生成模型。

链接: https://arxiv.org/abs/2412.14902
作者: Jing Zhao,Heliang Zheng,Chaoyue Wang,Long Lan,Wanrong Hunag,Yuhua Tang
机构: 未知
关键词: generating famous persons, capable of generating, generating famous, famous persons, Large-scale
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Large-scale text-to-image diffusion models, (e.g., DALL-E, SDXL) are capable of generating famous persons by simply referring to their names. Is it possible to make such models generate generic identities as simple as the famous ones, e.g., just use a name? In this paper, we explore the existence of a “Name Space”, where any point in the space corresponds to a specific identity. Fortunately, we find some clues in the feature space spanned by text embedding of celebrities’ names. Specifically, we first extract the embeddings of celebrities’ names in the Laion5B dataset with the text encoder of diffusion models. Such embeddings are used as supervision to learn an encoder that can predict the name (actually an embedding) of a given face image. We experimentally find that such name embeddings work well in promising the generated image with good identity consistency. Note that like the names of celebrities, our predicted name embeddings are disentangled from the semantics of text inputs, making the original generation capability of text-to-image models well-preserved. Moreover, by simply plugging such name embeddings, all variants (e.g., from Civitai) derived from the same base model (i.e., SDXL) readily become identity-aware text-to-image models. Project homepage: \urlthis https URL.
zh

[CV-39] Multimodal Hypothetical Summary for Retrieval-based Multi-image Question Answering AAAI2025

【速读】：该论文试图解决基于检索的多图像问答任务中存在的级联错误问题，即传统的“检索-然后回答”流程中，问答训练目标未能优化检索阶段，导致检索错误影响最终答案的生成。解决方案的关键在于提出了一种新的方法，通过引入多模态假设性摘要 (MHyS) 来有效整合检索信息并提升问答性能。具体来说，论文利用多模态大语言模型（视觉视角）和大型语言模型（文本视角）生成问题形式和描述形式的假设性摘要，从而更具体地捕捉图像内容，并通过对比学习将查询（问题）与MHyS对齐。此外，采用粗到细的策略计算句子级和词级相似度分数，进一步增强检索效果并过滤无关细节。该方法在RETVQA数据集上实现了3.7%的绝对提升，在CLIP上实现了14.5%的提升。

链接: https://arxiv.org/abs/2412.14880
作者: Peize Li,Qingyi Si,Peng Fu,Zheng Lin,Yan Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); Chinese Academy of Sciences(中国科学院)
关键词: task involves retrieving, Retrieval-based multi-image question, involves retrieving multiple, retrieving multiple question-related, Retrieval-based multi-image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Retrieval-based multi-image question answering (QA) task involves retrieving multiple question-related images and synthesizing these images to generate an answer. Conventional “retrieve-then-answer” pipelines often suffer from cascading errors because the training objective of QA fails to optimize the retrieval stage. To address this issue, we propose a novel method to effectively introduce and reference retrieved information into the QA. Given the image set to be retrieved, we employ a multimodal large language model (visual perspective) and a large language model (textual perspective) to obtain multimodal hypothetical summary in question-form and description-form. By combining visual and textual perspectives, MHyS captures image content more specifically and replaces real images in retrieval, which eliminates the modality gap by transforming into text-to-text retrieval and helps improve retrieval. To more advantageously introduce retrieval with QA, we employ contrastive learning to align queries (questions) with MHyS. Moreover, we propose a coarse-to-fine strategy for calculating both sentence-level and word-level similarity scores, to further enhance retrieval and filter out irrelevant details. Our approach achieves a 3.7% absolute improvement over state-of-the-art methods on RETVQA and a 14.5% improvement over CLIP. Comprehensive experiments and detailed ablation studies demonstrate the superiority of our method.
zh

[CV-40] Zero-Shot Artifact2Artifact: Self-incentive artifact removal for photoacoustic imaging without any data

【速读】：该论文试图解决3D光声成像（Photoacoustic Imaging, PAI）中由于探测器阵列稀疏和角度受限导致的重建伪影问题。解决方案的关键是提出了一种名为Zero-Shot Artifact2Artifact (ZS-A2A)的零样本自监督伪影去除方法，该方法基于超轻量级网络，利用重建伪影对数据丢失引起的异常敏感的特性，通过引入随机扰动生成子数据集，使网络自发学习伪影模式，从而实现零样本伪影去除。该方法无需训练数据或伪影先验知识，适用于任意稀疏或角度受限的探测器阵列，显著提高了最大幅度投影（MAP）图像或切片图像的对比噪声比（CNR）。

链接: https://arxiv.org/abs/2412.14873
作者: Shuang Li,Qian Chen,Chulhong Kim,Seongwook Choi,Yibing Wang,Yu Zhang,Changhui Li
机构: Peking University(北京大学); Pohang University of Science and Technology (POSTECH)(浦项科技大学)
关键词: uniquely combines optical, combines optical contrast, Photoacoustic imaging, uniquely combines, depth of ultrasound
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Photoacoustic imaging (PAI) uniquely combines optical contrast with the penetration depth of ultrasound, making it critical for clinical applications. However, the quality of 3D PAI is often degraded due to reconstruction artifacts caused by the sparse and angle-limited configuration of detector arrays. Existing iterative or deep learning-based methods are either time-consuming or require large training datasets, significantly limiting their practical application. Here, we propose Zero-Shot Artifact2Artifact (ZS-A2A), a zero-shot self-supervised artifact removal method based on a super-lightweight network, which leverages the fact that reconstruction artifacts are sensitive to irregularities caused by data loss. By introducing random perturbations to the acquired PA data, it spontaneously generates subset data, which in turn stimulates the network to learn the artifact patterns in the reconstruction results, thus enabling zero-shot artifact removal. This approach requires neither training data nor prior knowledge of the artifacts, and is capable of artifact removal for 3D PAI. For maximum amplitude projection (MAP) images or slice images in 3D PAI acquired with arbitrarily sparse or angle-limited detector arrays, ZS-A2A employs a self-incentive strategy to complete artifact removal and improves the Contrast-to-Noise Ratio (CNR). We validated ZS-A2A in both simulation study and in\ vivo animal experiments. Results demonstrate that ZS-A2A achieves state-of-the-art (SOTA) performance compared to existing zero-shot methods, and for the in\ vivo rat liver, ZS-A2A improves CNR from 17.48 to 43.46 in just 8 seconds. The project for ZS-A2A will be available in the following GitHub repository: this https URL.
zh

[CV-41] Large-scale School Mapping using Weakly Supervised Deep Learning for Universal School Connectivity AAAI-25

【速读】：该论文试图解决全球学校连接性提升中的关键问题，即在低收入和中等收入国家中，学校位置数据的稀缺和不准确性。解决方案的关键在于采用一种成本效益高且可扩展的方法，利用弱监督深度学习技术在高清卫星图像中定位学校。通过结合视觉变换器（Vision Transformers）和卷积神经网络（Convolutional Neural Networks），模型在10个非洲国家的试点中实现了高于0.96的AUPRC值。此外，利用可解释AI技术，该方法能够仅通过低成本的分类级别标注来近似学校的地理坐标，从而生成全国范围的学校位置预测图，并以塞内加尔为例进行了详细分析。最后，论文还通过引入交互式网络地图工具，展示了该方法在政府合作伙伴中进行模型验证的实际应用价值。

链接: https://arxiv.org/abs/2412.14870
作者: Isabelle Tingzon,Utku Can Ozturk,Ivan Dotu
机构: 未知
关键词: Improving global school, equitable quality education, Improving global, quality education, global school connectivity
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI-25 Special Track on AI for Social Impact (AISI)

点击查看摘要

Abstract:Improving global school connectivity is critical for ensuring inclusive and equitable quality education. To reliably estimate the cost of connecting schools, governments and connectivity providers require complete and accurate school location data - a resource that is often scarce in many low- and middle-income countries. To address this challenge, we propose a cost-effective, scalable approach to locating schools in high-resolution satellite images using weakly supervised deep learning techniques. Our best models, which combine vision transformers and convolutional neural networks, achieve AUPRC values above 0.96 across 10 pilot African countries. Leveraging explainable AI techniques, our approach can approximate the precise geographical coordinates of the school locations using only low-cost, classification-level annotations. To demonstrate the scalability of our method, we generate nationwide maps of school location predictions in African countries and present a detailed analysis of our results, using Senegal as our case study. Finally, we demonstrate the immediate usability of our work by introducing an interactive web mapping tool to streamline human-in-the-loop model validation efforts by government partners. This work successfully showcases the real-world utility of deep learning and satellite images for planning regional infrastructure and accelerating universal school connectivity.
zh

[CV-42] AI-Powered Intracranial Hemorrhage Detection: A Co-Scale Convolutional Attention Model with Uncertainty-Based Fuzzy Integral Operator and Feature Screening

【速读】：该论文旨在解决颅内出血（Intracranial Hemorrhage, ICH）的检测问题，特别是通过识别是否存在硬膜下出血（Subdural Hemorrhage, SDH）来实现。解决方案的关键在于引入了一种新颖的基于共尺度卷积注意力（Co-scale Convolutional Attention, CCA）分类器的架构，并通过增加两层来提升检测性能。第一层通过从CT扫描图像的不同切片中提取特征，并结合这些特征选择出最具信息量的50个成分，利用bootstrap森林算法评估这些特征的判别能力，从而构建一个可解释的AI模型。第二层则引入了一种基于不确定性的模糊积分算子，通过考虑连续切片之间的依赖关系，显著提高了检测的准确性。

链接: https://arxiv.org/abs/2412.14869
作者: Mehdi Hosseini Chagahi,Md. Jalil Piran,Niloufar Delfan,Behzad Moshiri,Jaber Hatam Parikhan
机构: School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran; Department of Electrical and Computer Engineering University of Waterloo, Waterloo, Canada; Department of Computer Science and Engineering, Sejong University, Seoul 05006, South Korea; Department of Neurosurgery, Iran University of Medical Sciences
关键词: Intracranial hemorrhage, accumulation of blood, rupture of blood, blood vessels, leakage or accumulation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intracranial hemorrhage (ICH) refers to the leakage or accumulation of blood within the skull, which occurs due to the rupture of blood vessels in or around the brain. If this condition is not diagnosed in a timely manner and appropriately treated, it can lead to serious complications such as decreased consciousness, permanent neurological disabilities, or even this http URL primary aim of this study is to detect the occurrence or non-occurrence of ICH, followed by determining the type of subdural hemorrhage (SDH). These tasks are framed as two separate binary classification problems. By adding two layers to the co-scale convolutional attention (CCA) classifier architecture, we introduce a novel approach for ICH detection. In the first layer, after extracting features from different slices of computed tomography (CT) scan images, we combine these features and select the 50 components that capture the highest variance in the data, considering them as informative features. We then assess the discriminative power of these features using the bootstrap forest algorithm, discarding those that lack sufficient discriminative ability between different classes. This algorithm explicitly determines the contribution of each feature to the final prediction, assisting us in developing an explainable AI model. The features feed into a boosting neural network as a latent feature space. In the second layer, we introduce a novel uncertainty-based fuzzy integral operator to fuse information from different CT scan slices. This operator, by accounting for the dependencies between consecutive slices, significantly improves detection accuracy.
zh

[CV-43] ObjVariantEnsemble: Advancing Point Cloud LLM Evaluation in Challenging Scenes with Subtly Distinguished Objects AAAI2025

【速读】：该论文试图解决3D场景理解中缺乏全面基准的问题，特别是在处理具有细微差异物体的复杂现实场景时，3D模型的能力尚未得到充分评估。解决方案的关键在于提出了ObjVariantEnsemble方案，通过系统性地引入包含特定物体类别、颜色、形状、数量和空间关系的多样化场景，以满足模型评估的需求。此外，论文还设计了一种LLM-VLM协作的标注器，用于捕捉相似物体之间的关键区别，从而生成更具挑战性的基准，揭示3D模型在理解上的不足，并促进其进一步发展。

链接: https://arxiv.org/abs/2412.14837
作者: Qihang Cao,Huangxun Chen
机构: HKUST(GZ)(香港科技大学(广州)); HKUST(GZ)(香港科技大学(广州))
关键词: important task, interest in aligning, representations of point, recent surge, surge of research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI2025

点击查看摘要

Abstract:3D scene understanding is an important task, and there has been a recent surge of research interest in aligning 3D representations of point clouds with text to empower embodied AI. However, due to the lack of comprehensive 3D benchmarks, the capabilities of 3D models in real-world scenes, particularly those that are challenging with subtly distinguished objects, remain insufficiently investigated. To facilitate a more thorough evaluation of 3D models’ capabilities, we propose a scheme, ObjVariantEnsemble, to systematically introduce more scenes with specified object classes, colors, shapes, quantities, and spatial relationships to meet model evaluation needs. More importantly, we intentionally construct scenes with similar objects to a certain degree and design an LLM-VLM-cooperated annotator to capture key distinctions as annotations. The resultant benchmark can better challenge 3D models, reveal their shortcomings in understanding, and potentially aid in the further development of 3D models.
zh

[CV-44] Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition

【速读】：该论文试图解决基于骨架的动作识别中，现有方法在处理模糊动作（如“挥手”和“敬礼”）时存在的空间和时间特征提取不平衡以及局部细节过度强调导致的全局上下文丢失问题。解决方案的关键在于提出了一种轻量级的即插即用模块——同步精细头模块 (Synchronized and Fine-grained Head, SF-Head)，该模块通过同步空间-时间特征提取 (Synchronized Spatial-Temporal Extraction, SSTE) 和特征冗余损失 (Feature Redundancy Loss, F-RL) 来确保空间和时间特征的平衡交互，并通过自适应跨维度特征聚合 (Adaptive Cross-dimensional Feature Aggregation, AC-FA) 和特征一致性损失 (Feature Consistency Loss, F-CL) 来有效结合全局上下文和局部细节，从而显著提升模糊动作的区分能力。

链接: https://arxiv.org/abs/2412.14833
作者: Hao Huang,Yujie Lin,Siyu Chen,Haiyang Liu
机构: 未知
关键词: Skeleton-based action recognition, achieved remarkable performance, Skeleton-based action, ambiguous actions, recognizing ambiguous actions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20pages, 5 figures

点击查看摘要

Abstract:Skeleton-based action recognition using GCNs has achieved remarkable performance, but recognizing ambiguous actions, such as “waving” and “saluting”, remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and TCNs, where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasize local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, we propose a lightweight plug-and-play module called Synchronized and Fine-grained Head (SF-Head), inserted between GCN and TCN layers. SF-Head first conducts Synchronized Spatial-Temporal Extraction (SSTE) with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction between the two types of features. It then performs Adaptive Cross-dimensional Feature Aggregation (AC-FA), with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. This aggregation step effectively combines both global context and local details. Experimental results on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets demonstrate significant improvements in distinguishing ambiguous actions. Our code will be made available at this https URL.
zh

[CV-45] PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation AAAI2025

【速读】：该论文试图解决多视角融合（multiview fusion）在LiDAR分割中因缺乏固定对应关系（fixed correspondences）而导致的计算密集型点交互问题，特别是在距离视图（range view）和鸟瞰图（Bird’s-Eye View, BEV）之间。解决方案的关键在于直接在BEV空间内融合极坐标（Polar）和笛卡尔（Cartesian）分区策略，利用这两种分区策略之间的固有固定网格对应关系，实现比传统基于点的融合方法快170倍的融合过程。此外，该方法通过密集特征融合保留了更丰富的上下文信息，并引入了混合Transformer-CNN架构以增强场景理解并保持推理效率。实验结果表明，该方法在SemanticKITTI和nuScenes数据集上均优于以往的多视角融合方法，无论是在性能还是推理速度上。

链接: https://arxiv.org/abs/2412.14821
作者: Shoumeng Qiu,Xinrun Li,XiangYang Xue,Jian Pu
机构: 1. Shanghai Jiao Tong University (上海交通大学); 2. University of California, Los Angeles (加州大学洛杉矶分校)
关键词: intensive point-based interactions, range view, computationally intensive point-based, hinders its practical, practical deployment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Although multiview fusion has demonstrated potential in LiDAR segmentation, its dependence on computationally intensive point-based interactions, arising from the lack of fixed correspondences between views such as range view and Bird’s-Eye View (BEV), hinders its practical deployment. This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies within the BEV space. Our proposed BEV-only segmentation model leverages the inherent fixed grid correspondences between these partitioning schemes, enabling a fusion process that is orders of magnitude faster (170 \times speedup) than conventional point-based methods. Furthermore, our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point-based alternatives. To enhance scene understanding while maintaining inference efficiency, we also introduce a hybrid Transformer-CNN architecture. Extensive evaluation on the SemanticKITTI and nuScenes datasets provides compelling evidence that our method outperforms previous multiview fusion approaches in terms of both performance and inference speed, highlighting the potential of BEV-based fusion for LiDAR segmentation. Code is available at \urlthis https URL.
zh

[CV-46] Multi-Level Embedding and Alignment Network with Consistency and Invariance Learning for Cross-View Geo-Localization

【速读】：该论文试图解决跨视图地理定位 (Cross-View Geo-Localization, CVGL) 中由于无人机和卫星图像之间的视角差异和平台间成像差距导致的特征关联和一致性特征提取困难的问题，同时解决现有方法在提升性能时带来的计算和存储需求增加的问题。解决方案的关键在于提出了一种轻量级的增强对齐网络，称为多层次嵌入与对齐网络 (Multi-Level Embedding and Alignment Network, MEAN)。MEAN 通过渐进的多层次增强策略、全局到局部的关联以及跨域对齐，实现了不同层次特征的有效连接和跨视图一致映射的学习，同时采用浅层骨干网络和轻量级分支设计，显著减少了参数数量和计算复杂度，在保持性能的同时大幅降低了资源需求。

链接: https://arxiv.org/abs/2412.14819
作者: Zhongwei Chen,Zhao-Xu Yang,Hai-Jun Rong
机构: State Key Laboratory for Strength and Vibration of Mechanical Structures(强度与振动机械结构国家重点实验室); Shaanxi Key Laboratory of Environment and Control for Flight Vehicle(陕西省飞行器环境与控制重点实验室); School of Aerospace Engineering(航空航天工程学院), Xi’an Jiaotong University(西安交通大学)
关键词: GPS-tagged satellite images, similar GPS-tagged satellite, drone images, satellite images, involves determining
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-View Geo-Localization (CVGL) involves determining the localization of drone images by retrieving the most similar GPS-tagged satellite images. However, the imaging gaps between platforms are often significant and the variations in viewpoints are substantial, which limits the ability of existing methods to effectively associate cross-view features and extract consistent and invariant characteristics. Moreover, existing methods often overlook the problem of increased computational and storage requirements when improving model performance. To handle these limitations, we propose a lightweight enhanced alignment network, called the Multi-Level Embedding and Alignment Network (MEAN). The MEAN network uses a progressive multi-level enhancement strategy, global-to-local associations, and cross-domain alignment, enabling feature communication across levels. This allows MEAN to effectively connect features at different levels and learn robust cross-view consistent mappings and modality-invariant features. Moreover, MEAN adopts a shallow backbone network combined with a lightweight branch design, effectively reducing parameter count and computational complexity. Experimental results on the University-1652 and SUES-200 datasets demonstrate that MEAN reduces parameter count by 62.17% and computational complexity by 70.99% compared to state-of-the-art models, while maintaining competitive or even superior performance. The codes will be released soon.
zh

[CV-47] Explainable Tampered Text Detection via Multimodal Large Models

【速读】：该论文试图解决文本篡改检测中的“黑箱”问题，即现有方法虽然能够检测出篡改文本区域，但其检测依据不明确，导致预测结果不可靠。解决方案的关键在于通过大规模多模态模型（large multimodal models）以自然语言解释篡改文本检测的依据。为此，论文提出了一个大规模综合数据集ETTD，包含像素级标注和自然语言标注，描述篡改文本区域的异常情况。通过融合掩码提示（fused mask prompt）和GPT4o的自动标注过滤，提升了数据集的质量。此外，论文还提出了一个简单有效的模型TTD，通过辅助参考定位查询（auxiliary reference grounding query）增强了细粒度感知能力，从而进一步提升了可解释的文本篡改检测效果。

链接: https://arxiv.org/abs/2412.14816
作者: Chenfan Qu,Jian Liu,Haoxing Chen,Baihan Yu,Jingjing Liu,Weiqiang Wang,Lianwen Jin
机构: South China University of Technology; Ant Group
关键词: tampered text detection, tampered text, tampered text region, text detection, tampered
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first work for explainable tampered text detection

点击查看摘要

Abstract:Recently, tampered text detection has attracted increasing attention due to its essential role in information security. Although existing methods can detect the tampered text region, the interpretation of such detection remains unclear, making the prediction unreliable. To address this black-box problem, we propose to explain the basis of tampered text detection with natural language via large multimodal models. To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD, which contains both pixel-level annotations indicating the tampered text region and natural language annotations describing the anomaly of the tampered text. Multiple methods are employed to improve the quality of the proposed data. For example, a fused mask prompt is proposed to reduce confusion when querying GPT4o to generate anomaly descriptions. By weighting the input image with the mask annotation, the tampered region can be clearly indicated and the content in and around the tampered region can also be preserved. We also propose prompting GPT4o to recognize tampered texts and filtering out the responses with low OCR accuracy, which can effectively improve annotation quality in an automatic manner. To further improve explainable tampered text detection, we propose a simple yet effective model called TTD, which benefits from improved fine-grained perception by paying attention to the suspected region with auxiliary reference grounding query. Extensive experiments on both the ETTD dataset and the public dataset have verified the effectiveness of the proposed methods. In-depth analysis is also provided to inspire further research. The dataset and code will be made publicly available.
zh

[CV-48] Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

【速读】：该论文试图解决现有机器人策略在处理具身任务时无法充分捕捉序列信息的问题。解决方案的关键在于利用视频扩散模型 (Video Diffusion Models, VDMs) 的预测视觉表示 (predictive visual representations)，这些表示能够反映物理世界的演化。论文提出了视频预测策略 (Video Prediction Policy, VPP)，通过结合VDMs的预测能力与多样化的人类或机器人操作数据集，采用统一的生成训练目标，从而在模拟和真实世界的基准测试中显著超越现有方法，特别是在复杂操作任务中取得了显著的成功率提升。

链接: https://arxiv.org/abs/2412.14803
作者: Yucheng Hu,Yanjiang Guo,Pengchao Wang,Xiaoyu Chen,Yen-Jen Wang,Jianke Zhang,Koushil Sreenath,Chaochao Lu,Jianyu Chen
机构: IIIS, Tsinghua University (清华大学); University of California, Berkeley (加州大学伯克利分校); Shanghai AI Lab (上海人工智能实验室); Shanghai Qizhi Institute (上海奇智研究院); RobotEra
关键词: performing multiple tasks, Recent advancements, developing generalist policies, generalist policies capable, focused on developing
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: The first two authors contribute equally. Project Page at this https URL

点击查看摘要

Abstract:Recent advancements in robotics have focused on developing generalist policies capable of performing multiple tasks. Typically, these policies utilize pre-trained vision encoders to capture crucial information from current observations. However, previous vision encoders, which trained on two-image contrastive learning or single-image reconstruction, can not perfectly capture the sequential information essential for embodied tasks. Recently, video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences, exhibiting a good understanding of physical dynamics. Motivated by the strong visual prediction capabilities of VDMs, we hypothesize that they inherently possess visual representations that reflect the evolution of the physical world, which we term predictive visual representations. Building on this hypothesis, we propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. To further enhance these representations, we incorporate diverse human or robotic manipulation datasets, employing unified video-generation training objectives. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks. Notably, it achieves a 28.1% relative improvement in the Calvin ABC-D benchmark compared to the previous state-of-the-art and delivers a 28.8% increase in success rates for complex real-world dexterous manipulation tasks.
zh

[CV-49] YOLOv11 Optimization for Efficient Resource Utilization

【速读】：该论文旨在优化第十一次迭代的YOLO (You Only Look Once, YOLOv11)，通过开发针对特定尺寸对象的修改版本，解决不同尺寸对象检测的效率问题。解决方案的关键在于通过剪枝不必要的层和重新配置YOLOv11的主架构，生成适用于不同尺寸范围（从小到大）的模型。此外，引入了一个对象分类器程序，用于根据数据集特征选择最合适的模型版本。实验结果表明，这些修改版本在保持原始YOLOv11精度的同时，显著提高了计算资源效率，并减少了模型大小和推理时间，在某些情况下甚至优于原始模型。

链接: https://arxiv.org/abs/2412.14790
作者: Areeg Fagad Rasheed,M. Zarkoosh
机构: Al-Nahrain University(伊拉克纳赫兰大学); 未知
关键词: developing size-specific modified, optimize the eleventh, eleventh iteration, developing size-specific, size-specific modified versions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 13 figures, 4 tables

点击查看摘要

Abstract:The objective of this research is to optimize the eleventh iteration of You Only Look Once (YOLOv11) by developing size-specific modified versions of the architecture. These modifications involve pruning unnecessary layers and reconfiguring the main architecture of YOLOv11. Each proposed version is tailored to detect objects of specific size ranges, from small to large. To ensure proper model selection based on dataset characteristics, we introduced an object classifier program. This program identifies the most suitable modified version for a given dataset. The proposed models were evaluated on various datasets and compared with the original YOLOv11 and YOLOv8 models. The experimental results highlight significant improvements in computational resource efficiency, with the proposed models maintaining the accuracy of the original YOLOv11. In some cases, the modified versions outperformed the original model regarding detection performance. Furthermore, the proposed models demonstrated reduced model sizes and faster inference times. Models weights and the object size classifier can be found in this repository
zh

[CV-50] FLAMe: Federated Learning with Attention Mechanism using Spatio-Temporal Keypoint Transformers for Pedestrian Fall Detection in Smart Cities AAAI2025

【速读】：该论文试图解决智能城市中行人跌倒检测的问题，以确保市民的安全和生活质量。解决方案的关键在于提出了基于联邦学习（Federated Learning, FL）的FLAMe算法，该算法通过注意力机制（Attention Mechanism）训练关键点信息，并仅将训练后的重要权重传输到服务器，从而减少通信成本并保护数据隐私。此外，轻量级的关键点Transformer模型被集成到FL框架中，以有效学习时空特征。实验结果表明，FLAMe系统在保持与现有集中式学习相似性能的同时，通过减少约40%的通信成本，显著提高了效率，证明了其在智能城市分布式环境中的鲁棒性和实用性。

链接: https://arxiv.org/abs/2412.14768
作者: Byeonghun Kim,Byeongjoon Noh
机构: 未知
关键词: detecting pedestrian falls, detecting pedestrian, life of citizens, major challenge, challenge to ensure
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, AAAI 2025 FLUID Workshop

点击查看摘要

Abstract:In smart cities, detecting pedestrian falls is a major challenge to ensure the safety and quality of life of citizens. In this study, we propose a novel fall detection system using FLAMe (Federated Learning with Attention Mechanism), a federated learning (FL) based algorithm. FLAMe trains around important keypoint information and only transmits the trained important weights to the server, reducing communication costs and preserving data privacy. Furthermore, the lightweight keypoint transformer model is integrated into the FL framework to effectively learn spatio-temporal features. We validated the experiment using 22,672 video samples from the “Fall Accident Risk Behavior Video-Sensor Pair data” dataset from AI-Hub. As a result of the experiment, the FLAMe-based system achieved an accuracy of 94.02% with about 190,000 transmission parameters, maintaining performance similar to that of existing centralized learning while maximizing efficiency by reducing communication costs by about 40% compared to the existing FL algorithm, FedAvg. Therefore, the FLAMe algorithm has demonstrated that it provides robust performance in the distributed environment of smart cities and is a practical and effective solution for public safety.
zh

[CV-51] Prototypical Calibrating Ambiguous Samples for Micro-Action Recognition AAAI2025

【速读】：该论文试图解决微动作识别 (Micro-Action Recognition, MAR) 中由于类别范围广泛和视觉差异细微导致的固有模糊性问题。解决方案的关键在于提出了一种新颖的原型校准模糊网络 (Prototypical Calibrating Ambiguous Network, PCAN)，通过以下几个步骤来缓解模糊性：首先，利用层次动作树识别模糊样本，并将其分类为假阴性 (false negatives, FN) 和假阳性 (false positives, FP) 样本；其次，引入模糊对比细化模块，通过调整模糊样本与其对应原型之间的距离来校准这些样本，使 FN 样本更接近其原型，FP 样本远离其原型；此外，提出原型多样性放大损失 (prototypical diversity amplification loss) 来增强不同原型之间的差异；最后，通过原型引导的修正 (prototype-guided rectification) 结合原型的可表示性来修正预测。实验结果表明，该方法在基准数据集上优于现有方法。

链接: https://arxiv.org/abs/2412.14719
作者: Kun Li,Dan Guo,Guoliang Chen,Chunxiao Fan,Jingyuan Xu,Zhiliang Wu,Hehe Fan,Meng Wang
机构: 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China (哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
2. National University of Defense Technology, Changsha, China (国防科技大学，长沙，中国);
3. Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China (中国科学院计算技术研究所，北京，中国)
关键词: gained increasing attention, increasing attention due, Micro-Action Recognition, non-verbal communication, human communication
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Micro-Action Recognition (MAR) has gained increasing attention due to its crucial role as a form of non-verbal communication in social interactions, with promising potential for applications in human communication and emotion analysis. However, current approaches often overlook the inherent ambiguity in micro-actions, which arises from the wide category range and subtle visual differences between categories. This oversight hampers the accuracy of micro-action recognition. In this paper, we propose a novel Prototypical Calibrating Ambiguous Network (\textbfPCAN) to unleash and mitigate the ambiguity of MAR. \textbfFirstly, we employ a hierarchical action-tree to identify the ambiguous sample, categorizing them into distinct sets of ambiguous samples of false negatives and false positives, considering both body- and action-level categories. \textbfSecondly, we implement an ambiguous contrastive refinement module to calibrate these ambiguous samples by regulating the distance between ambiguous samples and their corresponding prototypes. This calibration process aims to pull false negative ( \mathbbFN ) samples closer to their respective prototypes and push false positive ( \mathbbFP ) samples apart from their affiliated prototypes. In addition, we propose a new prototypical diversity amplification loss to strengthen the model’s capacity by amplifying the differences between different prototypes. \textbfFinally, we propose a prototype-guided rectification to rectify prediction by incorporating the representability of prototypes. Extensive experiments conducted on the benchmark dataset demonstrate the superior performance of our method compared to existing approaches. The code is available at this https URL.
zh

[CV-52] EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space

【速读】：该论文试图解决扩散模型（Diffusion Models）在文本驱动的人类动作生成中难以有效组合多个语义概念的问题。解决方案的关键在于提出了一种名为EnergyMoGen的方法，该方法包含两个能量模型（Energy-Based Models）：(1) 将扩散模型解释为潜在空间中的潜在感知能量模型，通过组合一组扩散模型来生成动作；(2) 引入基于交叉注意力的语义感知能量模型，支持语义组合和文本嵌入的自适应梯度下降。为克服语义不一致和动作失真的问题，论文提出了协同能量融合（Synergistic Energy Fusion），使运动潜在扩散模型能够通过结合多个对应于文本描述的能量项来合成高质量、复杂的动作。实验表明，该方法在文本到动作生成、组合动作生成和多概念动作生成等任务中优于现有的最先进模型。

链接: https://arxiv.org/abs/2412.14706
作者: Jianrong Zhang,Hehe Fan,Yi Yang
机构: ReLER, AAII, University of Technology Sydney(ReLER, AAII, 悉尼科技大学); CCAI, Zhejiang University(CCAI, 浙江大学)
关键词: demonstrated remarkable success, latent diffusion models, Diffusion models, latent diffusion, text-driven human motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion models, particularly latent diffusion models, have demonstrated remarkable success in text-driven human motion generation. However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. To overcome the challenges of semantic inconsistency and motion distortion across these two spectrums, we introduce Synergistic Energy Fusion. This design allows the motion latent diffusion model to synthesize high-quality, complex motions by combining multiple energy terms corresponding to textual descriptions. Experiments show that our approach outperforms existing state-of-the-art models on various motion generation tasks, including text-to-motion generation, compositional motion generation, and multi-concept motion generation. Additionally, we demonstrate that our method can be used to extend motion datasets and improve the text-to-motion task.
zh

[CV-53] Event-assisted 12-stop HDR Imaging of Dynamic Scene

【速读】：该论文试图解决动态场景中高动态范围（HDR）成像的问题，特别是在极端曝光差异下，传统HDR融合方法因运动和亮度变化导致的对齐困难和鬼影现象。解决方案的关键在于利用双相机系统，包括一个事件相机（event camera）和一个RGB相机，通过事件相机提供的高动态范围、时间密集信号来改善具有大曝光差异的LDR帧之间的对齐，从而减少鬼影现象。此外，论文提出了基于扩散模型的融合模块，结合预训练扩散模型的图像先验来处理高对比度区域的伪影，并最小化对齐过程中的误差。通过引入ESHDR数据集和在真实世界数据上的微调策略，该方法在动态场景中实现了12档HDR成像，达到了最先进的性能。

链接: https://arxiv.org/abs/2412.14705
作者: Shi Guo,Zixuan Chen,Ziran Zhang,Yutian Chen,Gangwei Xu,Tianfan Xue
机构: Shanghai AI Laboratory; Zhejiang University; The Chinese University of Hong Kong; Huazhong University of Science and Technology
关键词: diverse lighting conditions, High dynamic range, computational photography, lighting conditions, High dynamic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:High dynamic range (HDR) imaging is a crucial task in computational photography, which captures details across diverse lighting conditions. Traditional HDR fusion methods face limitations in dynamic scenes with extreme exposure differences, as aligning low dynamic range (LDR) frames becomes challenging due to motion and brightness variation. In this work, we propose a novel 12-stop HDR imaging approach for dynamic scenes, leveraging a dual-camera system with an event camera and an RGB camera. The event camera provides temporally dense, high dynamic range signals that improve alignment between LDR frames with large exposure differences, reducing ghosting artifacts caused by motion. Also, a real-world finetuning strategy is proposed to increase the generalization of alignment module on real-world events. Additionally, we introduce a diffusion-based fusion module that incorporates image priors from pre-trained diffusion models to address artifacts in high-contrast regions and minimize errors from the alignment process. To support this work, we developed the ESHDR dataset, the first dataset for 12-stop HDR imaging with synchronized event signals, and validated our approach on both simulated and real-world data. Extensive experiments demonstrate that our method achieves state-of-the-art performance, successfully extending HDR imaging to 12 stops in dynamic scenes.
zh

[CV-54] Explicit Relational Reasoning Network for Scene Text Detection AAAI2025

【速读】：该论文试图解决基于连通分量（Connected Component, CC）的文本检测方法在后期处理过程中耗时过长的问题。解决方案的关键在于引入显式关系推理网络（Explicit Relational Reasoning Network, ERRNet），通过将文本实例表示为有序的文本组件，并将这些组件视为序列运动中的对象，从而将场景文本检测问题创新性地转化为跟踪问题。论文设计了一个端到端的跟踪解码器，完全消除了后期处理步骤。此外，针对分类置信度与定位质量之间的不一致性，提出了多边形蒙特卡洛方法（Polygon Monte-Carlo）来快速准确评估定位质量，并引入位置监督的分类损失（position-supervised classification loss）以引导任务对齐的学习。实验结果表明，ERRNet在保持高推理速度的同时，实现了最先进的准确性。

链接: https://arxiv.org/abs/2412.14692
作者: Yuchen Su,Zhineng Chen,Yongkun Du,Zhilong Ji,Kai Hu,Jinfeng Bai,Xieping Gao
机构: 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
2. School of Computer Science and Engineering, Harbin Institute of Technology, Weihai, China(哈尔滨工业大学威海校区计算机科学与工程学院，中国);
3. School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳校区计算机科学与技术学院，中国);
4. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国)
关键词: human reading intuition, proper text shape, text shape representation, Connected component, reading intuition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Connected component (CC) is a proper text shape representation that aligns with human reading intuition. However, CC-based text detection methods have recently faced a developmental bottleneck that their time-consuming post-processing is difficult to eliminate. To address this issue, we introduce an explicit relational reasoning network (ERRNet) to elegantly model the component relationships without post-processing. Concretely, we first represent each text instance as multiple ordered text components, and then treat these components as objects in sequential movement. In this way, scene text detection can be innovatively viewed as a tracking problem. From this perspective, we design an end-to-end tracking decoder to achieve a CC-based method dispensing with post-processing entirely. Additionally, we observe that there is an inconsistency between classification confidence and localization quality, so we propose a Polygon Monte-Carlo method to quickly and accurately evaluate the localization quality. Based on this, we introduce a position-supervised classification loss to guide the task-aligned learning of ERRNet. Experiments on challenging benchmarks demonstrate the effectiveness of our ERRNet. It consistently achieves state-of-the-art accuracy while holding highly competitive inference speed.
zh

[CV-55] A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space

【速读】：该论文试图解决开放集目标检测 (Open-set object detection, OSOD) 在机器人操作中的应用问题，特别是现有方法在计算复杂度和部署难度上的不足。解决方案的关键是提出了一种轻量级框架——解耦开放集目标检测 (Decoupled OSOD, DOSOD)，该框架通过将视觉-语言模型 (Vision-Language Model, VLM) 与检测器结合，并引入多层感知器 (Multilayer Perceptron, MLP) 适配器，将文本嵌入转换到联合空间中，从而避免了复杂的跨模态特征交互，提升了计算效率。DOSOD 在测试阶段表现为传统的闭集检测器，有效弥合了闭集与开放集检测之间的差距，显著提高了实时性能，同时保持了相当的准确性。

链接: https://arxiv.org/abs/2412.14680
作者: Yonghao He,Hu Su,Haiyong Yu,Cong Yang,Wei Sui,Cong Wang,Song Liu
机构: D-Robotics; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation of Chinese Academy of Sciences; BeeLab, School of Future Science and Engineering, Soochow University; the School of Information Science and Technology, ShanghaiTech University
关键词: called Decoupled OSOD, Open-set object detection, unstructured environments, existing OSOD methods, manipulation in unstructured
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Open-set object detection (OSOD) is highly desirable for robotic manipulation in unstructured environments. However, existing OSOD methods often fail to meet the requirements of robotic applications due to their high computational burden and complex deployment. To address this issue, this paper proposes a light-weight framework called Decoupled OSOD (DOSOD), which is a practical and highly efficient solution to support real-time OSOD tasks in robotic systems. Specifically, DOSOD builds upon the YOLO-World pipeline by integrating a vision-language model (VLM) with a detector. A Multilayer Perceptron (MLP) adaptor is developed to transform text embeddings extracted by the VLM into a joint space, within which the detector learns the region representations of class-agnostic proposals. Cross-modality features are directly aligned in the joint space, avoiding the complex feature interactions and thereby improving computational efficiency. DOSOD operates like a traditional closed-set detector during the testing phase, effectively bridging the gap between closed-set and open-set detection. Compared to the baseline YOLO-World, the proposed DOSOD significantly enhances real-time performance while maintaining comparable accuracy. The slight DOSOD-S model achieves a Fixed AP of 26.7% , compared to 26.2% for YOLO-World-v1-S and 22.7% for YOLO-World-v2-S, using similar backbones on the LVIS minival dataset. Meanwhile, the FPS of DOSOD-S is 57.1% higher than YOLO-World-v1-S and 29.6% higher than YOLO-World-v2-S. Meanwhile, we demonstrate that the DOSOD model facilitates the deployment of edge devices. The codes and models are publicly available at this https URL.
zh

[CV-56] Efficient Few-Shot Neural Architecture Search by Counting the Number of Nonlinear Functions AAAI2025

【速读】：该论文试图解决神经架构搜索 (Neural Architecture Search, NAS) 中由于超参数化网络 (supernet) 内子网络 (subnets) 共享参数导致的训练干扰问题。解决方案的关键在于提出了一种基于非线性函数数量分割搜索空间的新型少样本NAS方法 (few-shot NAS)。具体而言，该方法将搜索空间划分为多个子空间，每个子空间包含具有相同数量非线性函数的子网络，从而减少了参数共享的干扰。此外，论文还引入了超网络平衡采样 (Supernet-Balanced Sampling, SBS) 技术，通过在每个训练步骤中采样多个子网络来均匀训练不同的超网络，进一步提高了训练效率。实验结果表明，该方法在标准NAS基准测试中表现出色。

链接: https://arxiv.org/abs/2412.14678
作者: Youngmin Oh,Hyunju Lee,Bumsub Ham
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院); École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院)
关键词: Neural architecture search, Neural architecture, search space automatically, search space, NAS
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2025

点击查看摘要

Abstract:Neural architecture search (NAS) enables finding the best-performing architecture from a search space automatically. Most NAS methods exploit an over-parameterized network (i.e., a supernet) containing all possible architectures (i.e., subnets) in the search space. However, the subnets that share the same set of parameters are likely to have different characteristics, interfering with each other during training. To address this, few-shot NAS methods have been proposed that divide the space into a few subspaces and employ a separate supernet for each subspace to limit the extent of weight sharing. They achieve state-of-the-art performance, but the computational cost increases accordingly. We introduce in this paper a novel few-shot NAS method that exploits the number of nonlinear functions to split the search space. To be specific, our method divides the space such that each subspace consists of subnets with the same number of nonlinear functions. Our splitting criterion is efficient, since it does not require comparing gradients of a supernet to split the space. In addition, we have found that dividing the space allows us to reduce the channel dimensions required for each supernet, which enables training multiple supernets in an efficient manner. We also introduce a supernet-balanced sampling (SBS) technique, sampling several subnets at each training step, to train different supernets evenly within a limited number of training steps. Extensive experiments on standard NAS benchmarks demonstrate the effectiveness of our approach. Our code is available at this https URL.
zh

[CV-57] FiVL: A Framework for Improved Vision-Language Alignment

【速读】：该论文试图解决大视觉语言模型 (Large Vision Language Models, LVLMs) 在多模态推理中未能有效利用视觉信息的问题，尤其是在视觉问答任务中，模型往往依赖语言先验而非图像内容来生成答案。论文提出的关键解决方案是引入一种名为FiVL的新方法，用于构建数据集，旨在增强LVLMs的视觉基础能力，并通过这些数据集评估模型在视觉基础上的表现。这种方法不仅用于训练模型，还能评估模型是否能够将图像内容作为实质性证据，而非仅依赖语言先验，从而提供关于模型对视觉信息依赖程度的深入见解。

链接: https://arxiv.org/abs/2412.14672
作者: Estelle Aflalo,Gabriela Ben Melech Stan,Tiep Le,Man Luo,Shachar Rosenman,Sayak Paul,Shao-Yen Tseng,Vasudev Lal
机构: 未知
关键词: Large Vision Language, Vision Language Models, achieved significant progress, Large Vision, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. This issue extends to vision-language benchmarks, where it is difficult to make the image indispensable for accurate answer generation, particularly in vision question-answering tasks. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and to evaluate their effectiveness in achieving it. These datasets can be utilized for both training and assessing an LVLM’s ability to use image content as substantive evidence rather than relying solely on linguistic priors, providing insights into the model’s reliance on visual information. To demonstrate the utility of our dataset, we introduce an innovative training task that outperforms baselines alongside a validation method and application for explainability. The code is available at this https URL.
zh

[CV-58] MUSTER: Longitudinal Deformable Registration by Composition of Consecutive Deformations

【速读】：该论文试图解决在纵向医学影像分析中，由于图像对比度变化、仪器和环境偏差等因素导致的非线性图像配准（non-linear image registration）不准确的问题。解决方案的关键在于提出了多时段时间配准方法（Multi-Session Temporal Registration, MUSTER），该方法通过整合多个成像时段的数据，能够更准确地恢复纵向形变。与传统的两两配准方法相比，MUSTER显著提升了形变估计的精度，并通过使用局部归一化互相关（local normalized cross-correlation）的替代方法，克服了图像相似性度量中的偏差问题。此外，MUSTER利用GPU加速处理大规模数据，使其在计算资源有限的情况下依然高效可行。

链接: https://arxiv.org/abs/2412.14671
作者: Edvard O. S. Grødem,Donatas Sederevičius,Esten H. Leonardsen,Bradley J. MacIntosh,Atle Bjørnerud,Till Schellhorn,Øystein Sørensen,Inge Amlien,Pablo F. Garrido,Anders M. Fjell
机构: Computational Radiology & Artificial Intelligence unit, Division of Radiology and Nuclear Medicine, Oslo University Hospital, Oslo, Norway(计算放射学与人工智能部门，放射学与核医学部，奥斯陆大学医院，奥斯陆，挪威); Center for Lifespan Changes in Brain and Cognition, Department of Psychology, University of Oslo, Oslo, Norway(脑与认知寿命变化中心，心理学系，奥斯陆大学，奥斯陆，挪威); Section for Precision Psychiatry, Oslo University Hospital & Institute of Clinical Medicine, University of Oslo, Oslo, Norway(精准精神病学部，奥斯陆大学医院与临床医学研究所，奥斯陆大学，奥斯陆，挪威); Department of Medical Biophysics, Sunnybrook Research Institute, University of Toronto, Toronto, Canada(医学生物物理系，阳光溪研究所，多伦多大学，多伦多，加拿大)
关键词: MUSTER, Multi-Session Temporal Registration, Longitudinal, registration, longitudinal analysis
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Longitudinal imaging allows for the study of structural changes over time. One approach to detecting such changes is by non-linear image registration. This study introduces Multi-Session Temporal Registration (MUSTER), a novel method that facilitates longitudinal analysis of changes in extended series of medical images. MUSTER improves upon conventional pairwise registration by incorporating more than two imaging sessions to recover longitudinal deformations. Longitudinal analysis at a voxel-level is challenging due to effects of a changing image contrast as well as instrumental and environmental sources of bias between sessions. We show that local normalized cross-correlation as an image similarity metric leads to biased results and propose a robust alternative. We test the performance of MUSTER on a synthetic multi-site, multi-session neuroimaging dataset and show that, in various scenarios, using MUSTER significantly enhances the estimated deformations relative to pairwise registration. Additionally, we apply MUSTER on a sample of older adults from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study. The results show that MUSTER can effectively identify patterns of neuro-degeneration from T1-weighted images and that these changes correlate with changes in cognition, matching the performance of state of the art segmentation methods. By leveraging GPU acceleration, MUSTER efficiently handles large datasets, making it feasible also in situations with limited computational resources.
zh

[CV-59] RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

【速读】：该论文试图解决在人机交互和视觉分析中，现有以人为中心的工作局限于视觉领域且缺乏与人类指令交互的问题。解决方案的关键在于提出了RefHCM（Referring Human-Centric Model），这是一个统一的框架，能够整合多种以人为中心的指代任务。RefHCM通过序列合并器将多模态数据（包括图像、文本、坐标和解析图）转换为语义标记，从而将多样的人为中心的指代任务重新表述为序列到序列的范式，并使用简单的编码器-解码器Transformer架构进行处理。这一统一的学习策略不仅促进了任务间的知识迁移，还展示了在复杂推理任务中的新能力。

链接: https://arxiv.org/abs/2412.14643
作者: Jie Huang,Ruibing Hou,Jiahe Zhao,Hong Chang,Shiguang Shan
机构: Key Laboratory of Intelligent Information Processing, Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China
关键词: Referring Human Perceptions, Human-centric perceptions play, real-world applications, human-centric referring tasks, play a crucial
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces Referring Human Perceptions, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (Referring Human-Centric Model), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data – including images, text, coordinates, and parsing maps – into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM’s competitive and even superior performance across multiple human-centric referring tasks. The code and data are publicly at this https URL.
zh

[CV-60] Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning

【速读】：该论文试图解决计算机视觉中少样本细粒度分类的问题，特别是在区分细微类别差异时面临的数据有限挑战。解决方案的关键在于通过自适应提示调优（adaptive prompt tuning）增强对比语言-图像预训练模型（CLIP），并利用实时视觉输入引导的跨注意力机制（cross-attention mechanism）动态优化文本提示。与现有的静态提示方法（如Context Optimization (CoOp) 和 Visual Prompt Tuning (VPT)）不同，该方法能够根据当前图像动态调整文本提示，实现文本特征与图像块的特定对齐，从而提高模型在高类内方差和低类间差异数据集上的分类效果。此外，通过集成蒙特卡洛 dropout（Monte-Carlo Dropout），该方法还增强了模型预测的可靠性和不确定性估计，提升了预测的可信度。

链接: https://arxiv.org/abs/2412.14640
作者: Eric Brouwer,Jan Erik van Woerden,Gertjan Burghouts,Matias Valedenegro-Toro,Marco Zullich
机构: University of Groningen(格罗宁根大学); TNO(荷兰应用科学研究组织)
关键词: differentiate subtle class, subtle class distinctions, computer vision poses, poses significant challenges, significant challenges due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data. This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training (CLIP) model through adaptive prompt tuning, guided by real-time visual inputs. Unlike existing techniques such as Context Optimization (CoOp) and Visual Prompt Tuning (VPT), which are constrained by static prompts or visual token reliance, the proposed approach leverages a cross-attention mechanism to dynamically refine text prompts for the image at hand. This enables an image-specific alignment of textual features with image patches extracted from the Vision Transformer, making the model more effective for datasets with high intra-class variance and low inter-class differences. The method is evaluated on several datasets, including CUBirds, Oxford Flowers, and FGVC Aircraft, showing significant performance gains over static prompt tuning approaches. To ensure these performance gains translate into trustworthy predictions, we integrate Monte-Carlo Dropout in our approach to improve the reliability of the model predictions and uncertainty estimates. This integration provides valuable insights into the model’s predictive confidence, helping to identify when predictions can be trusted and when additional verification is necessary. This dynamic approach offers a robust solution, advancing the state-of-the-art for few-shot fine-grained classification.
zh

[CV-61] Progressive Fine-to-Coarse Reconstruction for Accurate Low-Bit Post-Training Quantization in Vision Transformers

【速读】：该论文试图解决在低比特量化（low-bit quantization）下，视觉Transformer（Vision Transformers, ViTs）性能显著下降的问题。解决方案的关键在于提出了一种渐进式从细到粗的重建方法（Progressive Fine-to-Coarse Reconstruction, PFCR），通过逐步从细粒度（如多头自注意力模块和多层感知器模块及其快捷连接）到粗粒度（如组合后的块）进行重建，从而在低比特量化设置下显著提升模型性能。此外，论文还引入了渐进优化策略（Progressive Optimization Strategy, POS），以缓解训练难度，进一步增强模型性能。实验结果表明，该方法在ImageNet和COCO数据集上均取得了最优的Top-1准确率，特别是在3比特量化的ViT-B模型上达到了75.61%的准确率。

链接: https://arxiv.org/abs/2412.14633
作者: Rui Ding,Liang Yong,Sihuan Zhao,Jing Nie,Lihui Chen,Haijun Liu,Xichuan Zhou
机构: Chongqing University (重庆大学); Chongqing University (重庆大学); 未知; Chongqing University (重庆大学); 未知; 未知; Chongqing University (重庆大学)
关键词: compressing Vision Transformers, widely adopted, adopted for compressing, Vision Transformers, Due
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to its efficiency, Post-Training Quantization (PTQ) has been widely adopted for compressing Vision Transformers (ViTs). However, when quantized into low-bit representations, there is often a significant performance drop compared to their full-precision counterparts. To address this issue, reconstruction methods have been incorporated into the PTQ framework to improve performance in low-bit quantization settings. Nevertheless, existing related methods predefine the reconstruction granularity and seldom explore the progressive relationships between different reconstruction granularities, which leads to sub-optimal quantization results in ViTs. To this end, in this paper, we propose a Progressive Fine-to-Coarse Reconstruction (PFCR) method for accurate PTQ, which significantly improves the performance of low-bit quantized vision transformers. Specifically, we define multi-head self-attention and multi-layer perceptron modules along with their shortcuts as the finest reconstruction units. After reconstructing these two fine-grained units, we combine them to form coarser blocks and reconstruct them at a coarser granularity level. We iteratively perform this combination and reconstruction process, achieving progressive fine-to-coarse reconstruction. Additionally, we introduce a Progressive Optimization Strategy (POS) for PFCR to alleviate the difficulty of training, thereby further enhancing model performance. Experimental results on the ImageNet dataset demonstrate that our proposed method achieves the best Top-1 accuracy among state-of-the-art methods, particularly attaining 75.61% for 3-bit quantized ViT-B in PTQ. Besides, quantization results on the COCO dataset reveal the effectiveness and generalization of our proposed method on other computer vision tasks like object detection and instance segmentation.
zh

[CV-62] Review of Fruit Tree Image Segmentation

【速读】：该论文旨在解决果树图像分割在自动化农业任务中的应用问题，特别是针对果树前视图的分割。其关键在于通过系统性回顾158篇相关文献，提出一个基于方法、图像、任务和果实分类的分类体系（taxonomy），以帮助读者全面理解该领域的研究现状。论文指出，先前研究的显著不足在于缺乏适用于多种任务和环境的通用数据集和分割模型。为此，论文提出了六个重要的未来研究方向，旨在推动构建一个通用的果树分割模块。

链接: https://arxiv.org/abs/2412.14631
作者: Il-Seok Oh
机构: 未知
关键词: essential problem, problem in automating, Fruit tree image, automating a variety, variety of agricultural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fruit tree image segmentation is an essential problem in automating a variety of agricultural tasks such as phenotyping, harvesting, spraying, and pruning. Many research papers have proposed a diverse spectrum of solutions suitable to specific tasks and environments. The review scope of this paper is confined to the front views of fruit trees and based on 158 relevant papers collected using a newly designed crawling review method. These papers are systematically reviewed based on a taxonomy that sequentially considers the method, image, task, and fruit. This taxonomy will assist readers to intuitively grasp the big picture of these research activities. Our review reveals that the most noticeable deficiency of the previous studies was the lack of a versatile dataset and segmentation model that could be applied to a variety of tasks and environments. Six important future research tasks are suggested, with the expectation that these will pave the way to building a versatile tree segmentation module.
zh

[CV-63] Unified Image Restoration and Enhancement: Degradation Calibrated Cycle Reconstruction Diffusion Model

【速读】：该论文试图解决图像恢复与增强任务的统一问题，并提出了一种名为CycleRDM的新框架。其关键解决方案在于利用扩散模型的迭代优化能力，通过两阶段的扩散推理过程学习退化域、粗略正常域和正常域之间的映射关系。随后，通过离散小波变换将最终校准过程转移到小波低频域，从频率域的角度进行细粒度校准，并利用任务特定的频率空间。此外，设计了特征增益模块用于分解的小波高频域，以消除冗余特征，并通过多模态文本提示和傅里叶变换来稳定去噪过程并减少推理中的随机性。该方法在广泛的图像恢复与增强任务中表现出优异的泛化能力，且仅需少量训练样本即可在重建质量和感知质量上显著优于现有基准。

链接: https://arxiv.org/abs/2412.14630
作者: Minglong Xue,Jinhong He,Shivakumara Palaiahnakote,Mingliang Zhou
机构: 未知
关键词: computer vision applications, numerous computer vision, tasks efficiently remains, vision applications, significant challenge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image restoration and enhancement are pivotal for numerous computer vision applications, yet unifying these tasks efficiently remains a significant challenge. Inspired by the iterative refinement capabilities of diffusion models, we propose CycleRDM, a novel framework designed to unify restoration and enhancement tasks while achieving high-quality mapping. Specifically, CycleRDM first learns the mapping relationships among the degraded domain, the rough normal domain, and the normal domain through a two-stage diffusion inference process. Subsequently, we transfer the final calibration process to the wavelet low-frequency domain using discrete wavelet transform, performing fine-grained calibration from a frequency domain perspective by leveraging task-specific frequency spaces. To improve restoration quality, we design a feature gain module for the decomposed wavelet high-frequency domain to eliminate redundant features. Additionally, we employ multimodal textual prompts and Fourier transform to drive stable denoising and reduce randomness during the inference process. After extensive validation, CycleRDM can be effectively generalized to a wide range of image restoration and enhancement tasks while requiring only a small number of training samples to be significantly superior on various benchmarks of reconstruction quality and perceptual quality. The source code will be available at this https URL.
zh

[CV-64] Robust PCA Based on Adaptive Weighted Least Squares and Low-Rank Matrix Factorization

【速读】：该论文试图解决传统鲁棒主成分分析 (Robust Principal Component Analysis, RPCA) 方法在处理含有显著噪声或异常值的数据时，由于使用 \ell_1 范数正则化导致的偏差和次优估计问题。解决方案的关键在于提出了一种结合自适应加权最小二乘 (Adaptive Weighted Least Squares, AWLS) 和低秩矩阵分解 (Low-Rank Matrix Factorization, LRMF) 的新型 RPCA 模型。该模型通过引入自注意力机制的权重更新过程，动态调整权重矩阵，强调重要成分，并采用加权 F-范数来处理稀疏成分，从而有效减少偏差并简化计算过程。此外，通过交替最小化算法，每个子问题都有显式解，提升了计算效率。实验结果表明，该方法在性能和稳定性上优于现有的非凸正则化方法，增强了实际应用中的准确性和鲁棒性。

链接: https://arxiv.org/abs/2412.14629
作者: Kexin Li,You-wei Wen,Xu Xiao,Mingchao Zhao
机构: School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming, Yunnan, China; Key Laboratory of Computing and Stochastic Mathematics (LCSM), School of Mathematics and Statistics, Hunan Normal University, Changsha, Hunan, China; School of Mathematics and Statistics, Guangxi Normal University, Guilin, Guangxi, China
关键词: Principal Component Analysis, Robust Principal Component, Robust Principal, Component Analysis, anomaly detection
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust Principal Component Analysis (RPCA) is a fundamental technique for decomposing data into low-rank and sparse components, which plays a critical role for applications such as image processing and anomaly detection. Traditional RPCA methods commonly use \ell_1 norm regularization to enforce sparsity, but this approach can introduce bias and result in suboptimal estimates, particularly in the presence of significant noise or outliers. Non-convex regularization methods have been proposed to mitigate these challenges, but they tend to be complex to optimize and sensitive to initial conditions, leading to potential instability in solutions. To overcome these challenges, in this paper, we propose a novel RPCA model that integrates adaptive weighted least squares (AWLS) and low-rank matrix factorization (LRMF). The model employs a self-attention-inspired mechanism in its weight update process, allowing the weight matrix to dynamically adjust and emphasize significant components during each iteration. By employing a weighted F-norm for the sparse component, our method effectively reduces bias while simplifying the computational process compared to traditional \ell_1 -norm-based methods. We use an alternating minimization algorithm, where each subproblem has an explicit solution, thereby improving computational efficiency. Despite its simplicity, numerical experiments demonstrate that our method outperforms existing non-convex regularization approaches, offering superior performance and stability, as well as enhanced accuracy and robustness in practical applications.
zh

[CV-65] Qua2SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models AAAI2025

【速读】：该论文旨在解决扩散模型（Diffusion Models, DM）中量化（Quantization）对不同权重层、操作和架构类型的敏感性问题，特别是在从卷积U-Net向Transformer架构演进的过程中。解决方案的关键在于提出了Qua²SeDiMo，一个混合精度后训练量化框架（Post-Training Quantization framework），该框架能够生成关于不同量化方法成本效益的可解释性见解，并据此为多种扩散模型（包括基础U-Net和最先进的Transformer）做出高质量的混合精度量化决策。通过结合6位激活量化，Qua²SeDiMo在多个模型（如PixArt-α、PixArt-Σ、Hunyuan-DiT和SDXL）上实现了3.4-3.9位的权重量化，并在量化指标和生成图像质量上超越了现有方法。

链接: https://arxiv.org/abs/2412.14628
作者: Keith G. Mills,Mohammad Salameh,Ruichen Chen,Negar Hassanpour,Wei Lu,Di Niu
机构: University of Alberta (阿尔伯塔大学); University of Waterloo (滑铁卢大学); University of British Columbia (英属哥伦比亚大学)
关键词: iterative denoising process, denoising process, iterative denoising, Quantization, Diffusion Models
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2025; version includes supplementary material; 22 Pages, 18 Figures, 8 Tables

点击查看摘要

Abstract:Diffusion Models (DM) have democratized AI image generation through an iterative denoising process. Quantization is a major technique to alleviate the inference cost and reduce the size of DM denoiser networks. However, as denoisers evolve from variants of convolutional U-Nets toward newer Transformer architectures, it is of growing importance to understand the quantization sensitivity of different weight layers, operations and architecture types to performance. In this work, we address this challenge with Qua ^2 SeDiMo, a mixed-precision Post-Training Quantization framework that generates explainable insights on the cost-effectiveness of various model weight quantization methods for different denoiser operation types and block structures. We leverage these insights to make high-quality mixed-precision quantization decisions for a myriad of diffusion models ranging from foundational U-Nets to state-of-the-art Transformers. As a result, Qua ^2 SeDiMo can construct 3.4-bit, 3.9-bit, 3.65-bit and 3.7-bit weight quantization on PixArt- \alpha , PixArt- \Sigma , Hunyuan-DiT and SDXL, respectively. We further pair our weight-quantization configurations with 6-bit activation quantization and outperform existing approaches in terms of quantitative metrics and generative image quality.
zh

[CV-66] FRIDAY: Mitigating Unintentional Facial Identity in Deepfake Detectors Guided by Facial Recognizers

【速读】：该论文试图解决现有Deepfake检测方法在面对新的合成技术时性能显著下降的问题，尤其是检测模型往往依赖于面部身份信息而非合成痕迹进行判断，导致跨域数据集上的表现不佳。解决方案的关键在于提出了一种名为Facial Recognition Identity Attenuation (FRIDAY)的新训练方法，通过使用一个预训练的面部识别器来减少面部身份信息的影响。具体来说，FRIDAY方法在训练Deepfake检测器时，将输入图像同时输入到面部识别器和检测器中，并通过Facial Identity Attenuating loss最小化两者的特征嵌入相似性，从而促使检测器生成与面部识别器不同的嵌入，有效减少面部身份信息对检测结果的影响。实验结果表明，该方法显著提升了检测器在同域和跨域数据集上的性能。

链接: https://arxiv.org/abs/2412.14623
作者: Younhun Kim,Myung-Joon Kwon,Wonjun Lee,Changick Kim
机构: Graduate School of Green Growth and Sustainability, KAIST(韩国科学技术院); School of Electrical Engineering, KAIST(韩国科学技术院)
关键词: Previous Deepfake detection, Previous Deepfake, facial identity, effectiveness diminishes significantly, Recognition Identity Attenuation
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 5 pages, 4 figures. In 2024 IEEE International Conference on Visual Communications and Image Processing (VCIP) Oral

点击查看摘要

Abstract:Previous Deepfake detection methods perform well within their training domains, but their effectiveness diminishes significantly with new synthesis techniques. Recent studies have revealed that detection models often create decision boundaries based on facial identity rather than synthetic artifacts, resulting in poor performance on cross-domain datasets. To address this limitation, we propose Facial Recognition Identity Attenuation (FRIDAY), a novel training method that mitigates facial identity influence using a face recognizer. Specifically, we first train a face recognizer using the same backbone as the Deepfake detector. The recognizer is then frozen and employed during the detector’s training to reduce facial identity information. This is achieved by feeding input images into both the recognizer and the detector, and minimizing the similarity of their feature embeddings through our Facial Identity Attenuating loss. This process encourages the detector to generate embeddings distinct from the recognizer, effectively reducing the impact of facial identity. Extensive experiments demonstrate that our approach significantly enhances detection performance on both in-domain and cross-domain datasets.
zh

[CV-67] Pitfalls of topology-aware image segmentation

【速读】：该论文试图解决医学图像分割任务中拓扑正确性（topological correctness）评估的缺陷问题，特别是现有方法在实际应用中因不当的基准测试实践而受到限制。解决方案的关键在于识别并纠正评估中的关键陷阱，包括不充分的连通性选择、忽视真实标注中的拓扑伪影以及不合适的评估指标使用。通过详细的实证分析，论文揭示了这些问题对分割方法评估和排名的深远影响，并提出了建立公平和稳健评估标准的具体建议。

链接: https://arxiv.org/abs/2412.14619
作者: Alexander H. Berger,Laurin Lux,Alexander Weers,Martin Menten,Daniel Rueckert,Johannes C. Paetzold
机构: 未知
关键词: medical imaging tasks, characteristics of shape, imaging tasks, preservation of structural, structural integrity
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at this https URL

点击查看摘要

Abstract:Topological correctness, i.e., the preservation of structural integrity and specific characteristics of shape, is a fundamental requirement for medical imaging tasks, such as neuron or vessel segmentation. Despite the recent surge in topology-aware methods addressing this challenge, their real-world applicability is hindered by flawed benchmarking practices. In this paper, we identify critical pitfalls in model evaluation that include inadequate connectivity choices, overlooked topological artifacts in ground truth annotations, and inappropriate use of evaluation metrics. Through detailed empirical analysis, we uncover these issues’ profound impact on the evaluation and ranking of segmentation methods. Drawing from our findings, we propose a set of actionable recommendations to establish fair and robust evaluation standards for topology-aware medical image segmentation methods.
zh

[CV-68] Successive optimization of optics and post-processing with differentiable coherent PSF operator and field information

【速读】：该论文试图解决现有光线描述方法在优化几何退化方面的局限性，特别是在处理复杂、小型化镜头时，难以充分考虑波前像差或衍射效应的光学特性问题。解决方案的关键在于引入了一个精确的光学模拟模型，该模型具有可微分性，并采用了新颖的初始值策略来提高高非球面镜头交点计算的可靠性。此外，模型利用微分算子减少相干点扩展函数计算中的内存消耗。通过设计联合优化流程，结合场信息，并借助通用恢复网络，该方法不仅提升了图像质量，还逐步改进了多个专业级镜头的光学性能。这一联合优化流程为复杂光学系统和后处理算法的设计提供了创新见解。

链接: https://arxiv.org/abs/2412.14603
作者: Zheng Ren,Jingwen Zhou,Wenguan Zhang,Jiapu Yan,Bingkun Chen,Huajun Feng,Shiqi Chen
机构: State Key Laboratory of Extreme Photonics and Instrumentation, College of Optical Science and Engineering, Zhejiang University(浙江大学); National Natural Science Foundation of China(国家自然科学基金)
关键词: showing significant potential, significant potential, showing significant, Recently, optical
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Recently, the joint design of optical systems and downstream algorithms is showing significant potential. However, existing rays-described methods are limited to optimizing geometric degradation, making it difficult to fully represent the optical characteristics of complex, miniaturized lenses constrained by wavefront aberration or diffraction effects. In this work, we introduce a precise optical simulation model, and every operation in pipeline is differentiable. This model employs a novel initial value strategy to enhance the reliability of intersection calculation on high aspherics. Moreover, it utilizes a differential operator to reduce memory consumption during coherent point spread function calculations. To efficiently address various degradation, we design a joint optimization procedure that leverages field information. Guided by a general restoration network, the proposed method not only enhances the image quality, but also successively improves the optical performance across multiple lenses that are already in professional level. This joint optimization pipeline offers innovative insights into the practical design of sophisticated optical systems and post-processing algorithms. The source code will be made publicly available at this https URL
zh

[CV-69] Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered Parameter-Efficient Image Manipulation Localization Through Spare-Coding Transformer AAAI

【速读】：该论文试图解决图像篡改定位 (Image Manipulation Localization, IML) 中非语义特征 (non-semantic features) 的自适应提取问题。传统方法依赖手工设计的特征提取器，这限制了模型在未见或复杂场景中的泛化能力。论文的关键解决方案是提出了一种稀疏视觉Transformer (Sparse Vision Transformer, SparseViT)，通过将密集的全局自注意力机制 (dense, global self-attention) 重构为稀疏、离散的方式，从而打破图像语义并强制模型自适应地提取非语义特征。这种稀疏自注意力机制不仅显著减少了模型大小（最高可达80%的FLOPs），还实现了参数效率和计算量的降低，同时在基准数据集上表现出优越的泛化能力和效率。

链接: https://arxiv.org/abs/2412.14598
作者: Lei Su,Xiaochen Ma,Xuekang Zhu,Chaoqun Niu,Zeyu Lei,Ji-Zhe Zhou
机构: Sichuan University (四川大学); Tsinghua University (清华大学); Peking University (北京大学)
关键词: Image Manipulation Localization, Non-semantic features, Manipulation Localization, extract non-semantic features, Non-semantic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 page, 8 figures, published to AAAI

点击查看摘要

Abstract:Non-semantic features or semantic-agnostic features, which are irrelevant to image context but sensitive to image manipulations, are recognized as evidential to Image Manipulation Localization (IML). Since manual labels are impossible, existing works rely on handcrafted methods to extract non-semantic features. Handcrafted non-semantic features jeopardize IML model’s generalization ability in unseen or complex scenarios. Therefore, for IML, the elephant in the room is: How to adaptively extract non-semantic features? Non-semantic features are context-irrelevant and manipulation-sensitive. That is, within an image, they are consistent across patches unless manipulation occurs. Then, spare and discrete interactions among image patches are sufficient for extracting non-semantic features. However, image semantics vary drastically on different patches, requiring dense and continuous interactions among image patches for learning semantic representations. Hence, in this paper, we propose a Sparse Vision Transformer (SparseViT), which reformulates the dense, global self-attention in ViT into a sparse, discrete manner. Such sparse self-attention breaks image semantics and forces SparseViT to adaptively extract non-semantic features for images. Besides, compared with existing IML models, the sparse self-attention mechanism largely reduced the model size (max 80% in FLOPs), achieving stunning parameter efficiency and computation reduction. Extensive experiments demonstrate that, without any handcrafted feature extractors, SparseViT is superior in both generalization and efficiency across benchmark datasets.
zh

[CV-70] Multi-Sensor Object Anomaly Detection: Unifying Appearance Geometry and Internal Properties

【速读】：该论文试图解决工业质量检测中传统单传感器方法在捕捉多种异常类型方面的局限性问题。解决方案的关键在于引入MulSen-AD，这是一个高分辨率的多传感器异常检测数据集，整合了RGB相机、激光扫描仪和锁相红外热成像（lock-in infrared thermography）的数据，以全面捕捉外部外观、几何变形和内部缺陷。论文还提出了MulSen-TripleAD算法，通过决策级融合（decision-level fusion）将这三种模态的数据集成，实现了鲁棒的无监督对象异常检测，显著提升了检测精度，达到了96.1%的AUROC。

链接: https://arxiv.org/abs/2412.14592
作者: Wenqiao Li,Bozhong Zheng,Xiaohao Xu,Jinye Gan,Fading Lu,Xiang Li,Na Ni,Zheng Tian,Xiaonan Huang,Shenghua Gao,Yingna Wu
机构: ShanghaiTech University(上海科技大学); University of Michigan, Ann Arbor(密歇根大学安娜堡分校); The University of Hong Kong(香港大学)
关键词: face critical limitations, industrial quality inspection, methods face critical, quality inspection, critical limitations
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Object anomaly detection is essential for industrial quality inspection, yet traditional single-sensor methods face critical limitations. They fail to capture the wide range of anomaly types, as single sensors are often constrained to either external appearance, geometric structure, or internal properties. To overcome these challenges, we introduce MulSen-AD, the first high-resolution, multi-sensor anomaly detection dataset tailored for industrial applications. MulSen-AD unifies data from RGB cameras, laser scanners, and lock-in infrared thermography, effectively capturing external appearance, geometric deformations, and internal defects. The dataset spans 15 industrial products with diverse, real-world anomalies. We also present MulSen-AD Bench, a benchmark designed to evaluate multi-sensor methods, and propose MulSen-TripleAD, a decision-level fusion algorithm that integrates these three modalities for robust, unsupervised object anomaly detection. Our experiments demonstrate that multi-sensor fusion substantially outperforms single-sensor approaches, achieving 96.1% AUROC in object-level detection accuracy. These results highlight the importance of integrating multi-sensor data for comprehensive industrial anomaly detection.
zh

[CV-71] Spike2Former: Efficient Spiking Transformer for High-performance Image Segmentation

【速读】：该论文旨在解决脉冲神经网络 (Spiking Neural Networks, SNNs) 在图像分割任务中性能不佳的问题。其关键解决方案包括：首先，识别并改进导致脉冲发放严重减少的架构模块，提出Spike2Former架构；其次，引入归一化整数脉冲神经元 (normalized integer spiking neurons) 以解决复杂架构下SNNs训练稳定性问题。这些改进使得SNNs在多个语义分割数据集上取得了新的最先进性能，显著提升了mIoU和效率。

链接: https://arxiv.org/abs/2412.14587
作者: Zhenxin Lei,Man Yao,Jiakui Hu,Xinhao Luo,Yanye Lu,Bo Xu,Guoqi Li
机构: Tsinghua University(清华大学); Shanghai Jiao Tong University(上海交通大学); University of Sydney(悉尼大学)
关键词: Spiking Neural Networks, Neural Networks, converting neural networks, image segmentation tasks, low-power advantage
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: This work has been accepted on Association for the Advancement of Artificial Intelligence 2025

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) have a low-power advantage but perform poorly in image segmentation tasks. The reason is that directly converting neural networks with complex architectural designs for segmentation tasks into spiking versions leads to performance degradation and non-convergence. To address this challenge, we first identify the modules in the architecture design that lead to the severe reduction in spike firing, make targeted improvements, and propose Spike2Former architecture. Second, we propose normalized integer spiking neurons to solve the training stability problem of SNNs with complex architectures. We set a new state-of-the-art for SNNs in various semantic segmentation datasets, with a significant improvement of +12.7% mIoU and 5.0 efficiency on ADE20K, +14.3% mIoU and 5.2 efficiency on VOC2012, and +9.1% mIoU and 6.6 efficiency on CityScapes.
zh

[CV-72] HiCM2: Hierarchical Compact Memory Modeling for Dense Video Captioning AAAI2025

【速读】：该论文试图解决密集视频字幕生成 (Dense Video Captioning, DVC) 中的挑战，特别是如何自动生成和定位未剪辑视频中的事件。解决方案的关键在于利用人类记忆层次结构和认知的先验知识，构建了一个层次化紧凑记忆模型。具体来说，论文提出了一个层次化记忆和层次化记忆读取模块，通过事件聚类和使用大型语言模型进行总结，模拟人类记忆的回忆过程。实验结果表明，这种层次化记忆的回忆机制显著提升了DVC的性能，在YouCook2和ViTT数据集上达到了最先进的水平。

链接: https://arxiv.org/abs/2412.14585
作者: Minkuk Kim,Hyeon Bae Kim,Jinyoung Moon,Jinwoo Choi,Seong Tae Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院); Korea University(高丽大学)
关键词: dense video captioning, real-world video challenges, interest in dense, growing demand, demand for solutions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2025

点击查看摘要

Abstract:With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge, such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical compact memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.
zh

[CV-73] DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

【速读】：该论文试图解决传统感知相似度度量方法在生成式模型输出与参考输入之间相似性评估中的局限性，特别是这些方法在捕捉图像布局、物体姿态和语义内容等中层相似性方面的不足。解决方案的关键在于首次发现并利用预训练的扩散模型（diffusion models）来测量视觉相似性，提出了DiffSim方法。通过在去噪U-Net的注意力层中对齐特征，DiffSim能够同时评估外观和风格的相似性，克服了传统度量方法和基于对比学习（CLIP）或自监督学习（DINO）方法在细节评估上的不足，从而更符合人类视觉偏好。此外，论文还引入了Sref和IP基准，分别用于风格和实例级别的视觉相似性评估，进一步验证了DiffSim在多个基准上的最先进性能。

链接: https://arxiv.org/abs/2412.14580
作者: Yiren Song,Xiaokang Liu,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学)
关键词: inputs critically important, reference inputs critically, customized model outputs, making the assessment, critically important
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important. However, traditional perceptual similarity metrics operate primarily at the pixel and patch levels, comparing low-level colors and textures but failing to capture mid-level similarities and differences in image layout, object pose, and semantic content. Contrastive learning-based CLIP and self-supervised learning-based DINO are often used to measure semantic similarity, but they highly compress image features, inadequately assessing appearance details. This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. By aligning features in the attention layers of the denoising U-Net, DiffSim evaluates both appearance and style similarity, showing superior alignment with human visual preferences. Additionally, we introduce the Sref and IP benchmarks to evaluate visual similarity at the level of style and instance, respectively. Comprehensive evaluations across multiple benchmarks demonstrate that DiffSim achieves state-of-the-art performance, providing a robust tool for measuring visual coherence in generative models.
zh

[CV-74] GSRender: Deduplicated Occupancy Prediction via Weakly Supervised 3D Gaussian Splatting

【速读】：该论文试图解决3D占用感知任务中的效率与精度平衡问题，特别是在弱监督条件下，由于沿相机射线的采样次数不同导致的mIoU（mean Intersection over Union）波动问题。解决方案的关键在于提出了GSRender方法，该方法利用3D高斯溅射（3D Gaussian Splatting）进行占用预测，简化了采样过程。此外，论文引入了Ray Compensation (RC)模块，通过补偿相邻帧的特征来减少沿同一相机射线的重复预测问题。最后，重新设计的损失函数消除了相邻帧中动态对象的影响，从而提升了RayIoU（+6.0）并缩小了与3D监督方法的差距。

链接: https://arxiv.org/abs/2412.14579
作者: Qianpu Sun,Changyong Shu,Sifan Zhou,Zichen Yu,Yan Chen,Dawei Yang,Yuan Chun
机构: Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院); Houmo AI(后摩AI); Dalian University of Technology(大连理工大学)
关键词: precise environment representations, gaining increasing attention, increasing attention due, environment representations, perception is gaining
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:3D occupancy perception is gaining increasing attention due to its capability to offer detailed and precise environment representations. Previous weakly-supervised NeRF methods balance efficiency and accuracy, with mIoU varying by 5-10 points due to sampling count along camera rays. Recently, real-time Gaussian splatting has gained widespread popularity in 3D reconstruction, and the occupancy prediction task can also be viewed as a reconstruction task. Consequently, we propose GSRender, which naturally employs 3D Gaussian Splatting for occupancy prediction, simplifying the sampling process. In addition, the limitations of 2D supervision result in duplicate predictions along the same camera ray. We implemented the Ray Compensation (RC) module, which mitigates this issue by compensating for features from adjacent frames. Finally, we redesigned the loss to eliminate the impact of dynamic objects from adjacent frames. Extensive experiments demonstrate that our approach achieves SOTA (state-of-the-art) results in RayIoU (+6.0), while narrowing the gap with 3D supervision methods. Our code will be released soon.
zh

[CV-75] Alignment-Free RGB-T Salient Object Detection: A Large-scale Dataset and Progressive Correlation Network AAAI2025

【速读】：该论文试图解决无对齐RGB-热成像（RGB-Thermal, RGB-T）显著目标检测（SOD）在复杂场景中性能受限的问题，主要原因是现有基准数据集规模有限且标注工作繁重。解决方案的关键在于构建了一个大规模、高多样性的无对齐RGB-T SOD数据集UVT20K，包含20,000对图像、407个场景和1256个目标类别，并提供了全面的标注，如显著性掩码、涂鸦、边界和挑战属性。此外，论文提出了渐进相关网络（Progressive Correlation Network, PCNet），通过显式建模模态间和模态内的相关性，在无需手动对齐的情况下实现精确预测。

链接: https://arxiv.org/abs/2412.14576
作者: Kunpeng Wang,Keke Chen,Chenglong Li,Zhengzheng Tu,Bin Luo
机构: 未知
关键词: alignment-free RGB-T SOD, achieve robust performance, requiring manual alignment, visible-thermal image pairs, RGB-T SOD
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Alignment-free RGB-Thermal (RGB-T) salient object detection (SOD) aims to achieve robust performance in complex scenes by directly leveraging the complementary information from unaligned visible-thermal image pairs, without requiring manual alignment. However, the labor-intensive process of collecting and annotating image pairs limits the scale of existing benchmarks, hindering the advancement of alignment-free RGB-T SOD. In this paper, we construct a large-scale and high-diversity unaligned RGB-T SOD dataset named UVT20K, comprising 20,000 image pairs, 407 scenes, and 1256 object categories. All samples are collected from real-world scenarios with various challenges, such as low illumination, image clutter, complex salient objects, and so on. To support the exploration for further research, each sample in UVT20K is annotated with a comprehensive set of ground truths, including saliency masks, scribbles, boundaries, and challenge attributes. In addition, we propose a Progressive Correlation Network (PCNet), which models inter- and intra-modal correlations on the basis of explicit alignment to achieve accurate predictions in unaligned image pairs. Extensive experiments conducted on unaligned and aligned datasets demonstrate the effectiveness of our this http URL and dataset are available at this https URL.
zh

[CV-76] SCKD: Semi-Supervised Cross-Modality Knowledge Distillation for 4D Radar Object Detection AAAI2025

【速读】：该论文试图解决4D毫米波雷达在3D物体检测中的性能问题，由于雷达点云的高稀疏性和噪声，现有方法的性能远低于预期。解决方案的关键在于提出了一种新颖的半监督跨模态知识蒸馏（Semi-supervised Cross-modality Knowledge Distillation, SCKD）方法，通过从激光雷达-雷达融合的教师网络中学习特征，并结合自适应融合模块、两个特征蒸馏模块以及半监督输出蒸馏，实现了跨模态知识的高效传递。实验结果表明，该方法在VoD数据集上显著提升了mAP，并在ZJUODset数据集上取得了5.12%的mAP提升。

链接: https://arxiv.org/abs/2412.14571
作者: Ruoyu Xu,Zhiyu Xiang,Chenwei Zhang,Hanzhi Zhong,Xijun Zhao,Ruina Dang,Peng Xu,Tianyu Pu,Eryun Liu
机构: 1. School of Computer Science, Wuhan University(武汉大学计算机科学学院);
2. School of Cyber Science and Engineering, Wuhan University(武汉大学网络科学与工程学院);
3. School of Mathematics and Statistics, Wuhan University(武汉大学数学与统计学院)
关键词: fundamental perception tasks, autonomous vehicles, fundamental perception, perception tasks, radar point clouds
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:3D object detection is one of the fundamental perception tasks for autonomous vehicles. Fulfilling such a task with a 4D millimeter-wave radar is very attractive since the sensor is able to acquire 3D point clouds similar to Lidar while maintaining robust measurements under adverse weather. However, due to the high sparsity and noise associated with the radar point clouds, the performance of the existing methods is still much lower than expected. In this paper, we propose a novel Semi-supervised Cross-modality Knowledge Distillation (SCKD) method for 4D radar-based 3D object detection. It characterizes the capability of learning the feature from a Lidar-radar-fused teacher network with semi-supervised distillation. We first propose an adaptive fusion module in the teacher network to boost its performance. Then, two feature distillation modules are designed to facilitate the cross-modality knowledge transfer. Finally, a semi-supervised output distillation is proposed to increase the effectiveness and flexibility of the distillation framework. With the same network structure, our radar-only student trained by SCKD boosts the mAP by 10.38% over the baseline and outperforms the state-of-the-art works on the VoD dataset. The experiment on ZJUODset also shows 5.12% mAP improvements on the moderate difficulty level over the baseline when extra unlabeled data are available. Code is available at this https URL.
zh

[CV-77] Improving Geometry in Sparse-View 3DGS via Reprojection-based DoF Separation

【速读】：该论文试图解决在基于学习的稀疏视图三维重建模型中，直接应用三维高斯光栅化 (3D Gaussian Splatting, 3DGS) 作为优化步骤时出现的几何失真问题。其关键解决方案是提出基于重投影的自由度分离 (reprojection-based DoF separation)，通过区分图像平面平行自由度和光线对齐自由度，并引入针对性的约束条件，独立管理每个自由度，从而有效抑制几何伪影，提升重建结果的视觉和几何一致性。

链接: https://arxiv.org/abs/2412.14568
作者: Yongsung Kim,Minjun Park,Jooyoung Choi,Sungroh Yoon
机构: Interdisciplinary Program in AI, Seoul National University(首尔国立大学); ECE; AIIS, ASRI, INMC, ISRC, Seoul National University(首尔国立大学)
关键词: Recent learning-based Multi-View, learning-based Multi-View Stereo, Multi-View Stereo models, Recent learning-based, Multi-View Stereo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Recent learning-based Multi-View Stereo models have demonstrated state-of-the-art performance in sparse-view 3D reconstruction. However, directly applying 3D Gaussian Splatting (3DGS) as a refinement step following these models presents challenges. We hypothesize that the excessive positional degrees of freedom (DoFs) in Gaussians induce geometry distortion, fitting color patterns at the cost of structural fidelity. To address this, we propose reprojection-based DoF separation, a method distinguishing positional DoFs in terms of uncertainty: image-plane-parallel DoFs and ray-aligned DoF. To independently manage each DoF, we introduce a reprojection process along with tailored constraints for each DoF. Through experiments across various datasets, we confirm that separating the positional DoFs of Gaussians and applying targeted constraints effectively suppresses geometric artifacts, producing reconstruction results that are both visually and geometrically plausible.
zh

[CV-78] GBRIP: Granular Ball Representation for Imbalanced Partial Label Learning AAAI25

【速读】：该论文试图解决部分标签学习 (Partial Label Learning, PLL) 中由于类别不平衡 (class imbalance) 导致的复杂弱监督多分类问题。解决方案的关键在于提出了基于粒球表示的非平衡PLL框架 (Granular Ball Representation for Imbalanced PLL, GBRIP)。GBRIP通过粗粒度粒球表示和多中心损失函数，利用无监督学习构建基于粒球的特征空间，有效捕捉每个类别的特征分布，并通过系统地优化标签歧义消除和估计不平衡分布来缓解混淆特征的影响。多中心损失函数进一步强化了样本与其粒球中心之间的关系，从而提升了学习效果。

链接: https://arxiv.org/abs/2412.14561
作者: Jintao Huang,Yiu-ming Cheung,Chi-man Vong,Wenbin Qian
机构: 1. Department of Computer Science, Hong Kong Baptist University, Hong Kong(香港);
2. Department of Electronic and Information Engineering, University of Macau, Macau(澳门);
3. School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China(中国武汉)
关键词: complicated weakly supervised, weakly supervised multi-classification, supervised multi-classification task, multi-classification task compounded, Imbalanced PLL
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI25

点击查看摘要

Abstract:Partial label learning (PLL) is a complicated weakly supervised multi-classification task compounded by class imbalance. Currently, existing methods only rely on inter-class pseudo-labeling from inter-class features, often overlooking the significant impact of the intra-class imbalanced features combined with the inter-class. To address these limitations, we introduce Granular Ball Representation for Imbalanced PLL (GBRIP), a novel framework for imbalanced PLL. GBRIP utilizes coarse-grained granular ball representation and multi-center loss to construct a granular ball-based nfeature space through unsupervised learning, effectively capturing the feature distribution within each class. GBRIP mitigates the impact of confusing features by systematically refining label disambiguation and estimating imbalance distributions. The novel multi-center loss function enhances learning by emphasizing the relationships between samples and their respective centers within the granular balls. Extensive experiments on standard benchmarks demonstrate that GBRIP outperforms existing state-of-the-art methods, offering a robust solution to the challenges of imbalanced PLL.
zh

[CV-79] ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

【速读】：该论文试图解决在运动生成领域中验证缩放定律（scaling law）的问题。解决方案的关键在于引入了一个可扩展的运动生成框架，该框架包括运动标记器 Motion FSQ-VAE 和文本前缀自回归 Transformer。通过实验验证，论文首次确认了运动生成中的缩放定律，具体表现为前缀自回归模型的归一化测试损失与计算预算之间遵循对数定律，并且非词汇参数、词汇参数和数据标记与计算预算之间存在幂律关系。利用这些缩放定律，论文预测了在给定计算预算下的最优 Transformer 大小、词汇大小和数据需求，并通过实验验证了预测的测试损失与实际损失的一致性。

链接: https://arxiv.org/abs/2412.14559
作者: Shunlin Lu,Jingbo Wang,Zeyu Lu,Ling-Hao Chen,Wenxun Dai,Junting Dong,Zhiyang Dou,Bo Dai,Ruimao Zhang
机构: Sun Yat-sen University(中山大学); The Chinese University of Hongkong, Shenzhen(香港中文大学(深圳)); Shanghai AI Laboratory(上海人工智能实验室); Tsinghua University(清华大学); Shanghai Jiao Tong University(上海交通大学); The University of Hong Kong(香港大学)
关键词: natural language processing, computer vision tasks, remains largely unexplored, massive computer vision, generation remains largely
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The scaling law has been validated in various domains, such as natural language processing (NLP) and massive computer vision tasks; however, its application to motion generation remains largely unexplored. In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. Through comprehensive experiments, we observe the scaling behavior of this system. For the first time, we confirm the existence of scaling laws within the context of motion generation. Specifically, our results demonstrate that the normalized test loss of our prefix autoregressive models adheres to a logarithmic law in relation to compute budgets. Furthermore, we also confirm the power law between Non-Vocabulary Parameters, Vocabulary Parameters, and Data Tokens with respect to compute budgets respectively. Leveraging the scaling law, we predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of 1e18 . The test loss of the system, when trained with the optimal model size, vocabulary size, and required data, aligns precisely with the predicted test loss, thereby validating the scaling law.
zh

[CV-80] Bright-NeRF:Brightening Neural Radiance Field with Color Restoration from Low-light Raw Images AAAI2025

【速读】：该论文试图解决在低光环境下，由于图像噪声和颜色失真导致NeRFs（Neural Radiance Fields）难以准确学习场景表示的问题。解决方案的关键在于提出了一种名为Bright-NeRF的新方法，该方法通过无监督的方式从多视角低光原始图像中学习增强的高质量辐射场。具体来说，Bright-NeRF利用传感器对光照响应的物理模型，并引入色适应损失（chromatic adaptation loss）来约束响应学习，从而在不同光照条件下实现一致的颜色感知。此外，该方法还利用原始数据的特性自动揭示场景的强度。通过这些创新，Bright-NeRF在颜色恢复、去噪和新型视图合成方面均取得了显著的性能提升。

链接: https://arxiv.org/abs/2412.14547
作者: Min Wang,Xin Huang,Guoqing Zhou,Qifeng Guo,Qing Wang
机构: 1. Shanghai Jiao Tong University (上海交通大学); 2. Huawei Technologies (华为技术)
关键词: Neural Radiance Fields, demonstrated prominent performance, Neural Radiance, demonstrated prominent, prominent performance
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Neural Radiance Fields (NeRFs) have demonstrated prominent performance in novel view synthesis. However, their input heavily relies on image acquisition under normal light conditions, making it challenging to learn accurate scene representation in low-light environments where images typically exhibit significant noise and severe color distortion. To address these challenges, we propose a novel approach, Bright-NeRF, which learns enhanced and high-quality radiance fields from multi-view low-light raw images in an unsupervised manner. Our method simultaneously achieves color restoration, denoising, and enhanced novel view synthesis. Specifically, we leverage a physically-inspired model of the sensor’s response to illumination and introduce a chromatic adaptation loss to constrain the learning of response, enabling consistent color perception of objects regardless of lighting conditions. We further utilize the raw data’s properties to expose the scene’s intensity automatically. Additionally, we have collected a multi-view low-light raw image dataset to advance research in this field. Experimental results demonstrate that our proposed method significantly outperforms existing 2D and 3D approaches. Our code and dataset will be made publicly available.
zh

[CV-81] S3-Mamba: Small-Size-Sensitive Mamba for Lesion Segmentation AAAI2025

【速读】：该论文试图解决医学图像分割中对小病变（small lesions）的敏感性问题，尤其是在早期疾病诊断和严重感染干预中的重要性。解决方案的关键在于提出了一种名为 Small-Size-Sensitive Mamba (S³-Mamba) 的模型，该模型通过三个维度提升对小病变的敏感性：通道（channel）、空间（spatial）和训练策略（training strategy）。具体来说，论文设计了增强视觉状态空间块（Enhanced Visual State Space block），通过多重残差连接保留局部特征，并通过通道注意力机制选择性放大重要细节；同时，提出了基于张量的跨特征多尺度注意力机制（Tensor-based Cross-feature Multi-scale Attention），整合多尺度特征以保留小病变的空间细节。此外，论文引入了正则化课程学习（regularized curriculum learning），自动评估病变大小和样本难度，逐步从简单样本过渡到复杂样本（如小病变）。实验结果表明，S³-Mamba在分割小病变方面具有显著优势。

链接: https://arxiv.org/abs/2412.14546
作者: Gui Wang,Yuexiang Li,Wenting Chen,Meidan Ding,Wooi Ping Cheah,Rong Qu,Jianfeng Ren,Linlin Shen
机构: 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
2. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
3. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
4. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
5. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
6. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
7. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
8. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
9. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
10. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
11. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
12. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
13. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
14. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
15. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
16. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
17. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
18. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
19. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
20. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
21. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
22. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
23. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
24. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
25. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
26. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
27. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
28. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
29. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
30. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
31. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
32. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
33. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
34. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
35. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
36. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
37. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
38. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
39. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
40. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
41. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
42. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
43. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
44. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
45. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
46. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
47. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
48. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
49. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
50. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
51. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
52. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
53. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
54. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
55. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
56. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
57. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
58. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
59. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
60. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
61. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
62. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
63. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
64. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
65. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
66. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
67. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
68. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
69. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
70. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
71. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
72. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
73. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
74. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
75. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
76. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
77. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
78. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
79. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
80. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
81. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
82. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
83. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
84. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
85. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
86. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
87. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
88. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
89. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
90. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
91. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
92. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
93. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
94. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
95. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
96. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
97. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
98. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
99. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
100. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
101. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
102. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
103. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
104. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
105. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
106. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
107. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
108. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
109. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
110. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
111. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
112. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
113. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
114. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
115. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
116. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
117. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
118. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
119. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
120. School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机科学与技术学院，哈尔滨，中国);
121.
关键词: early disease diagnosis, Small lesions, Small lesions play, Small, segmenting small lesions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by AAAI 2025

点击查看摘要

Abstract:Small lesions play a critical role in early disease diagnosis and intervention of severe infections. Popular models often face challenges in segmenting small lesions, as it occupies only a minor portion of an image, while down_sampling operations may inevitably lose focus on local features of small lesions. To tackle the challenges, we propose a \bf Small-\bf Size-\bf Sensitive \bf Mamba (\bf S ^3 -Mamba), which promotes the sensitivity to small lesions across three dimensions: channel, spatial, and training strategy. Specifically, an Enhanced Visual State Space block is designed to focus on small lesions through multiple residual connections to preserve local features, and selectively amplify important details while suppressing irrelevant ones through channel-wise attention. A Tensor-based Cross-feature Multi-scale Attention is designed to integrate input image features and intermediate-layer features with edge features and exploit the attentive support of features across multiple scales, thereby retaining spatial details of small lesions at various granularities. Finally, we introduce a novel regularized curriculum learning to automatically assess lesion size and sample difficulty, and gradually focus from easy samples to hard ones like small lesions. Extensive experiments on three medical image segmentation datasets show the superiority of our S ^3 -Mamba, especially in segmenting small lesions. Our code is available at this https URL.
zh

[CV-82] Summary of Point Transformer with Federated Learning for Predicting Breast Cancer HER2 Status from Hematoxylin and Eosin-Stained Whole Slide Images

【速读】：该论文旨在解决从苏木精和伊红（HE）染色的全切片图像（WSIs）中预测HER2状态的问题，通过联邦学习方法降低成本并加快治疗决策。其关键解决方案包括：针对多站点数据集中的标签不平衡和特征表示挑战，提出了一种结合动态标签分布、辅助分类器和最远余弦采样的点变换器（point transformer），从而在四个站点（2687张WSIs）上实现了最先进的性能，并展示了在两个未见站点（229张WSIs）上的强大泛化能力。

链接: https://arxiv.org/abs/2412.14545
作者: Kamorudeen A. Amuda,Almustapha A. Wakili
机构: 未知
关键词: federated learning-based approach, approach to predict, status from hematoxylin, hematoxylin and eosin, stained whole slide
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study introduces a federated learning-based approach to predict HER2 status from hematoxylin and eosin (HE)-stained whole slide images (WSIs), reducing costs and speeding up treatment decisions. To address label imbalance and feature representation challenges in multisite datasets, a point transformer is proposed, incorporating dynamic label distribution, an auxiliary classifier, and farthest cosine sampling. Extensive experiments demonstrate state-of-the-art performance across four sites (2687 WSIs) and strong generalization to two unseen sites (229 WSIs).
zh

[CV-83] Downscaling Precipitation with Bias-informed Conditional Diffusion Model

【速读】：该论文旨在解决气候变化背景下高分辨率降水预测的需求问题，特别是当前全球气候模型 (GCMs) 空间分辨率过粗，无法进行局部分析的局限性。论文提出了一种基于深度学习的统计降尺度方法，关键在于引入了一种偏差信息条件扩散模型 (bias-informed conditional diffusion model) 进行降水数据的统计降尺度。该模型通过条件扩散方法从大规模高分辨率降水数据集中学习分布先验，并采用伽马校正 (gamma correction) 处理降水数据的偏长尾分布问题。此外，通过引导采样策略 (guided-sampling strategy) 增强偏差校正，确保降尺度结果的准确性。实验结果表明，该模型在8倍降尺度设置下表现优异，超越了以往的确定性方法。

链接: https://arxiv.org/abs/2412.14539
作者: Ran Lyu(1),Linhan Wang(1),Yanshen Sun(1),Hedanqiu Bai(2),Chang-Tien Lu(1) ((1) Virginia Tech, (2) Texas Aamp;M University)
机构: Virginia Tech; Virginia Tech; Virginia Tech; Texas A&M University; Virginia Tech
关键词: intensifying rainfall extremes, current Global Climate, Global Climate Models, Climate change, rainfall extremes
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 3 pages, 2 figures. Accepted by Proceedings of IEEE International Conference on Big Data, Dec 15-18, 2024

点击查看摘要

Abstract:Climate change is intensifying rainfall extremes, making high-resolution precipitation projections crucial for society to better prepare for impacts such as flooding. However, current Global Climate Models (GCMs) operate at spatial resolutions too coarse for localized analyses. To address this limitation, deep learning-based statistical downscaling methods offer promising solutions, providing high-resolution precipitation projections with a moderate computational cost. In this work, we introduce a bias-informed conditional diffusion model for statistical downscaling of precipitation. Specifically, our model leverages a conditional diffusion approach to learn distribution priors from large-scale, high-resolution precipitation datasets. The long-tail distribution of precipitation poses a unique challenge for training diffusion models; to address this, we apply gamma correction during preprocessing. Additionally, to correct biases in the downscaled results, we employ a guided-sampling strategy to enhance bias correction. Our experiments demonstrate that the proposed model achieves highly accurate results in an 8 times downscaling setting, outperforming previous deterministic methods. The code and dataset are available at this https URL
zh

[CV-84] DAMPER: A Dual-Stage Medical Report Generation Framework with Coarse-Grained MeSH Alignment and Fine-Grained Hypergraph Matching

【速读】：该论文试图解决现有医学报告生成方法在模拟临床报告撰写流程中的不足，特别是忽略了医生在撰写报告时通常进行的初步快速审查和详细检查两个阶段，以及现有对齐方法可能导致的关系错配问题。解决方案的关键在于提出了DAMPER，一个双阶段框架，模拟临床报告撰写流程。第一阶段是MeSH引导的粗粒度对齐（MCG），通过将胸部X光（CXR）图像特征与医学主题词（MeSH）特征对齐，生成整体印象的粗略关键词表示。第二阶段是超图增强的细粒度对齐（HFG），构建图像块和报告注释的超图，建模各模态内的高阶关系，并通过超图匹配捕捉图像区域与文本短语之间的语义关联。最终，将粗粒度视觉特征、生成的MeSH表示和视觉超图特征输入报告解码器，生成最终的医学报告。

链接: https://arxiv.org/abs/2412.14535
作者: Xiaofei Huang,Wenting Chen,Jie Liu,Qisheng Lu,Xiaoling Luo,Linlin Shen
机构: 未知
关键词: patient management, summarizing diagnoses, diagnosis and patient, diagnoses and recommendations, recommendations based
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical report generation is crucial for clinical diagnosis and patient management, summarizing diagnoses and recommendations based on medical imaging. However, existing work often overlook the clinical pipeline involved in report writing, where physicians typically conduct an initial quick review followed by a detailed examination. Moreover, current alignment methods may lead to misaligned relationships. To address these issues, we propose DAMPER, a dual-stage framework for medical report generation that mimics the clinical pipeline of report writing in two stages. In the first stage, a MeSH-Guided Coarse-Grained Alignment (MCG) stage that aligns chest X-ray (CXR) image features with medical subject headings (MeSH) features to generate a rough keyphrase representation of the overall impression. In the second stage, a Hypergraph-Enhanced Fine-Grained Alignment (HFG) stage that constructs hypergraphs for image patches and report annotations, modeling high-order relationships within each modality and performing hypergraph matching to capture semantic correlations between image regions and textual phrases. Finally,the coarse-grained visual features, generated MeSH representations, and visual hypergraph features are fed into a report decoder to produce the final medical report. Extensive experiments on public datasets demonstrate the effectiveness of DAMPER in generating comprehensive and accurate medical reports, outperforming state-of-the-art methods across various evaluation metrics.
zh

[CV-85] Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

【速读】：该论文试图解决在生成式图像和视频合成中，如何在新姿态下保持与参考图像的外观一致性问题。解决方案的关键在于将任务框架化为一个空间条件下的修复问题 (spatially-conditioned inpainting problem)，通过统一的降噪网络 (denoising network) 实现参考特征对目标生成的引导，从而减少参考与目标之间的领域差距 (domain gaps)。此外，论文提出了因果特征交互框架 (causal feature interaction framework)，确保参考特征仅从自身查询，而目标特征可以从参考和目标中查询，以更好地保持参考外观信息。为提高计算效率和灵活性，实际实现中将生成过程分解为参考外观提取和条件目标生成两个阶段，两者共享同一降噪网络，并通过自注意力层 (self-attention layers) 进行交互。该方法在不需要额外实例微调的情况下，展示了对未见过的身份和姿态的强大泛化能力。

链接: https://arxiv.org/abs/2412.14531
作者: Mingdeng Cao,Chong Mou,Ziyang Yuan,Xintao Wang,Zhaoyang Zhang,Ying Shan,Yinqiang Zheng
机构: The University of Tokyo; ARC Lab, Tencent PCG; Peking University; Tsinghua University
关键词: visual content creation, low-cost visual content, reference, preserving appearance consistency, content creation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework, in which reference features can only query from themselves, while target features can query appearance information from both the reference and the target. To further enhance computational efficiency and flexibility, in practical implementation, we decompose the spatially-conditioned generation process into two stages: reference appearance extraction and conditioned target generation. Both stages share a single denoising network, with interactions restricted to self-attention layers. This proposed method ensures flexible control over the appearance of generated human images and videos. By fine-tuning existing base diffusion models on human video data, our method demonstrates strong generalization to unseen human identities and poses without requiring additional per-instance fine-tuning. Experimental results validate the effectiveness of our approach, showing competitive performance compared to existing methods for consistent human image and video synthesis.
zh

[CV-86] Efficient Self-Supervised Video Hashing with Selective State Spaces AAAI’25

【速读】：该论文试图解决自监督视频哈希（Self-supervised video hashing, SSVH）中Transformer模型在计算和内存效率上的不足问题。解决方案的关键在于引入基于Mamba的状态空间模型（state-space model），并提出了S5VH模型。S5VH通过双向Mamba层（bidirectional Mamba layers）在编码器和解码器中有效捕捉时间关系，利用数据依赖的选择性扫描机制（data-dependent selective scanning mechanism）实现线性复杂度，从而在效率和效果之间取得平衡。此外，论文提出的自局部全局（SLG）学习范式通过将特征空间中的全局语义转化为语义一致且具有区分性的哈希中心，并结合中心对齐损失（center alignment loss）作为全局学习信号，显著提升了学习效率和收敛速度。

链接: https://arxiv.org/abs/2412.14518
作者: Jinpeng Wang,Niu Lian,Jun Li,Yuting Wang,Yan Feng,Bin Chen,Yongbing Zhang,Shu-Tao Xia
机构: 1. School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院);
2. School of Electronic and Information Engineering, Soochow University(苏州大学电子与信息工程学院);
3. Jiangsu Key Laboratory of Big Data Analysis Technology(江苏省大数据分析技术重点实验室);
4. School of Mathematics and Statistics, Soochow University(苏州大学数学与统计学院)
关键词: indexing and retrieval, practical task, Self-supervised video hashing, SSVH, Mamba-based video hashing
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by AAAI’25. 9 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH’s improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at this https URL.
zh

[CV-87] A Super-pixel-based Approach to the Stable Interpretation of Neural Networks BMVC2024

【速读】：该论文试图解决神经网络分类器中基于梯度的显著性图（saliency maps）在训练样本和优化算法随机性影响下，稳定性不足的问题。解决方案的关键在于提出了一种新的像素分区策略，通过将像素分组为超像素（super-pixels），以减少显著性图的方差并提升其泛化能力。理论分析和数值实验表明，这种分组策略不仅降低了显著性图的随机性，还增强了其在不同数据集上的解释质量。

链接: https://arxiv.org/abs/2412.14509
作者: Shizhan Gong,Jingwei Zhang,Qi Dou,Farzan Farnia
机构: 未知
关键词: computer vision community, neural network classifiers, interpreting neural network, Saliency maps, neural network decision
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2024

点击查看摘要

Abstract:Saliency maps are widely used in the computer vision community for interpreting neural network classifiers. However, due to the randomness of training samples and optimization algorithms, the resulting saliency maps suffer from a significant level of stochasticity, making it difficult for domain experts to capture the intrinsic factors that influence the neural network’s decision. In this work, we propose a novel pixel partitioning strategy to boost the stability and generalizability of gradient-based saliency maps. Through both theoretical analysis and numerical experiments, we demonstrate that the grouping of pixels reduces the variance of the saliency map and improves the generalization behavior of the interpretation method. Furthermore, we propose a sensible grouping strategy based on super-pixels which cluster pixels into groups that align well with the semantic meaning of the images. We perform several numerical experiments on CIFAR-10 and ImageNet. Our empirical results suggest that the super-pixel-based interpretation maps consistently improve the stability and quality over the pixel-based saliency maps.
zh

[CV-88] Content-style disentangled representation for controllable artistic image stylization and generation

【速读】：该论文试图解决现有方法在内容和风格解耦（content and style disentanglement）中存在的两个主要问题：1) 模型仅支持单一模态（如图像）作为风格或内容输入；2) 解耦不完整导致参考图像的语义干扰。解决方案的关键在于提出了一种基于多模态数据集（WikiStyle+）的内容-风格表示解耦方法，并结合扩散模型（diffusion model）和可学习的跨注意力层（learnable multi-step cross-attention layers），通过Q-Formers学习解耦表示，从而实现内容和风格的彻底分离，并生成风格一致且富有表现力的艺术图像。

链接: https://arxiv.org/abs/2412.14496
作者: Ma Zhuoqi,Zhang Yixuan,You Zejun,Tian Long,Liu Xiyang
机构: 未知
关键词: Controllable artistic image, artistic image stylization, style, content, aims to render
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Controllable artistic image stylization and generation aims to render the content provided by text or image with the learned artistic style, where content and style decoupling is the key to achieve satisfactory results. However, current methods for content and style disentanglement primarily rely on image information for supervision, which leads to two problems: 1) models can only support one modality for style or content input;2) incomplete disentanglement resulting in semantic interference from the reference image. To address the above issues, this paper proposes a content-style representation disentangling method for controllable artistic image stylization and generation. We construct a WikiStyle+ dataset consists of artworks with corresponding textual descriptions for style and content. Based on the multimodal dataset, we propose a disentangled content and style representations guided diffusion model. The disentangled representations are first learned by Q-Formers and then injected into a pre-trained diffusion model using learnable multi-step cross-attention layers for better controllable stylization. This approach allows model to accommodate inputs from different modalities. Experimental results show that our method achieves a thorough disentanglement of content and style in reference images under multimodal supervision, thereby enabling a harmonious integration of content and style in the generated outputs, successfully producing style-consistent and expressive stylized images.
zh

[CV-89] Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles

【速读】：该论文试图解决大规模3D数据（如Objaverse）在应用于真实世界图像时性能显著下降的问题，特别是在自动驾驶应用中采集车辆资产的任务。解决方案的关键在于通过一系列策略来弥合合成数据与真实驾驶数据之间的差异。具体措施包括：对真实图像进行虚拟相机旋转以确保与合成数据的几何对齐和与预训练模型定义的姿态流形的兼容性；在物体中心的数据管理中考虑真实驾驶场景中物体距离的变化，通过固定相机焦距学习不同物体尺度的特征；在潜在空间中进行遮挡感知训练以应对真实数据中的普遍遮挡；并通过利用对称先验来处理大视角变化。这些策略共同实现了对大规模预训练模型的有效微调，显著降低了新颖视图合成的FID（68.8%的减少）。

链接: https://arxiv.org/abs/2412.14494
作者: Chuang Lin,Bingbing Zhuang,Shanlin Sun,Ziyu Jiang,Jianfei Cai,Manmohan Chandraker
机构: Monash University(莫纳什大学); UC Irvine(加州大学欧文分校); UC San Diego(加州大学圣地亚哥分校); NEC Labs America(美国NEC实验室)
关键词: pose-conditioned diffusion models, training pose-conditioned diffusion, advent of large-scale, recent advent, led to impressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The recent advent of large-scale 3D data, e.g. Objaverse, has led to impressive progress in training pose-conditioned diffusion models for novel view synthesis. However, due to the synthetic nature of such 3D data, their performance drops significantly when applied to real-world images. This paper consolidates a set of good practices to finetune large pretrained models for a real-world task – harvesting vehicle assets for autonomous driving applications. To this end, we delve into the discrepancies between the synthetic data and real driving data, then develop several strategies to account for them properly. Specifically, we start with a virtual camera rotation of real images to ensure geometric alignment with synthetic data and consistency with the pose manifold defined by pretrained models. We also identify important design choices in object-centric data curation to account for varying object distances in real driving scenes – learn across varying object scales with fixed camera focal length. Further, we perform occlusion-aware training in latent spaces to account for ubiquitous occlusions in real data, and handle large viewpoint changes by leveraging a symmetric prior. Our insights lead to effective finetuning that results in a 68.8% reduction in FID for novel view synthesis over prior arts.
zh

[CV-90] QADM-Net: Quality-adaptive Dynamic Network for Reliable Multimodal Classification

【速读】：该论文试图解决多模态数据中由于数据质量差异导致的分类可靠性问题。解决方案的关键在于提出了一个名为质量自适应动态多模态网络 (Quality-adaptive Dynamic Multimodal Network, QADM-Net) 的新框架，该框架通过两种机制来动态调整网络的容量和行为：1) 置信度引导的动态深度机制，根据样本的难度（由模态质量决定）调整网络深度；2) 信息量为基础的动态参数机制，使网络能够根据特征向量中的质量变化对每个样本进行独特的推理行为。通过在模态和特征级别上研究样本质量的变化，QADM-Net 能够自适应地调整其容量和行为，从而提高分类结果的可靠性。

链接: https://arxiv.org/abs/2412.14489
作者: Shu Shen,Tong Zhang,C.L.Philip Chen
机构: South China University of Technology (华南理工大学)
关键词: Integrating complementary information, Integrating complementary, stronger expressive ability, complementary information, stronger expressive
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Integrating complementary information from different data modalities can yield representation with stronger expressive ability. However, data quality varies across multimodal samples, highlighting the need for learning reliable multimodal representations, especially in safety-critical applications. This paper focuses on an aspect that existing methods in this domain commonly overlook: the importance of network dynamics and adaptability in providing reliable results from diverse samples. Specifically, it highlights the model’s ability to dynamically adjust its capacity and behaviour according to different samples, using the adjusted network for predicting each sample. To this end, we propose a novel framework for multimodal reliable classification termed Quality-adaptive Dynamic Multimodal Network (QADM-Net). QADM-Net first introduces a confidence-guided dynamic depths mechanism to achieve the appropriate network capacity. This mechanism adjusts the network depth according to the difficulty of each sample, which is determined by the quality of its modalities. Subsequently, we develop an informativeness-based dynamic parameters mechanism that enables QADM-Net to perform unique inference behaviour on each of the diverse samples with feature-level quality variation presented in their feature vectors. In this way, QADM-Net adequately adapts its capacity and behaviour on each sample by investigating the quality variation of samples at both modality and feature levels, thus enhancing the reliability of classification results. Experiments conducted on four datasets demonstrate that QADM-Net significantly outperforms state-of-the-art methods in classification performance and exhibits strong adaptability to data with diverse quality.
zh

[CV-91] oken Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation

【速读】：该论文试图解决现有方法在缓解大视觉语言模型 (Large Vision Language Models, LVLMs) 中幻觉问题时的两个主要缺陷：1) 缺乏可扩展的 token 级奖励；2) 忽视视觉锚定 token。解决方案的关键是提出了一种新的 Token Preference Optimization 模型 (TPO)，通过自校准的奖励机制，自适应地关注与视觉相关的 token，而无需细粒度的标注。具体来说，TPO 引入了基于视觉锚定的 token 级奖励，通过比较生成 token 在原始图像和损坏图像条件下的逻辑分布差异来实现。此外，还提出了视觉感知的训练目标，以增强对视觉锚定 token 的准确优化。实验结果表明，TPO 在幻觉基准测试中显著提升了性能。

链接: https://arxiv.org/abs/2412.14487
作者: Jihao Gu,Yingyao Wang,Meng Cao,Pi Bu,Jun Song,Yancheng He,Shilong Li,Bo Zheng
机构: Alibaba Group(阿里巴巴集团); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)
关键词: Large Vision Language, Vision Language Models, Direct Preference Optimization, Large Vision, Vision Language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) has been demonstrated to be highly effective in mitigating hallucinations in Large Vision Language Models (LVLMs) by aligning their outputs more closely with human preferences. Despite the recent progress, existing methods suffer from two drawbacks: 1) Lack of scalable token-level rewards; and 2) Neglect of visual-anchored tokens. To this end, we propose a novel Token Preference Optimization model with self-calibrated rewards (dubbed as TPO), which adaptively attends to visual-correlated tokens without fine-grained annotations. Specifically, we introduce a token-level \emphvisual-anchored \emphreward as the difference of the logistic distributions of generated tokens conditioned on the raw image and the corrupted one. In addition, to highlight the informative visual-anchored tokens, a visual-aware training objective is proposed to enhance more accurate token-level optimization. Extensive experimental results have manifested the state-of-the-art performance of the proposed TPO. For example, by building on top of LLAVA-1.5-7B, our TPO boosts the performance absolute improvement for hallucination benchmarks.
zh

[CV-92] DirectorLLM for Human-Centric Video Generation

【速读】：该论文试图解决生成式视频中高质量人类动作和交互的需求问题。解决方案的关键在于引入DirectorLLM，一个利用大型语言模型 (LLM) 来协调视频中人类姿态的新型视频生成模型。通过将LLM从文本生成器扩展为视频导演和人类动作模拟器，DirectorLLM能够生成详细的指令信号（如人类姿态），从而为以人为中心的场景创建信息丰富的轮廓。这些信号作为条件传递给视频渲染器，提升了视频生成中的人类动作真实性、提示忠实度和渲染主体的自然度。该模型作为一个独立的LLM模块，可以轻松应用于不同的视频渲染器（如UNet和DiT），并在自动评估基准和人类评估中表现出优于现有模型的性能。

链接: https://arxiv.org/abs/2412.14484
作者: Kunpeng Song,Tingbo Hou,Zecheng He,Haoyu Ma,Jialiang Wang,Animesh Sinha,Sam Tsai,Yaqiao Luo,Xiaoliang Dai,Li Chen,Xide Xia,Peizhao Zhang,Peter Vajda,Ahmed Elgammal,Felix Juefei-Xu
机构: Rutgers University(罗格斯大学); GenAI at Meta(Meta的GenAI)
关键词: large language model, human motion, human, orchestrate human poses, employs a large
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos. As foundational text-to-video models rapidly evolve, the demand for high-quality human motion and interaction grows. To address this need and enhance the authenticity of human motions, we extend the LLM from a text generator to a video director and human motion simulator. Utilizing open-source resources from Llama 3, we train the DirectorLLM to generate detailed instructional signals, such as human poses, to guide video generation. This approach offloads the simulation of human motion from the video generator to the LLM, effectively creating informative outlines for human-centric scenes. These signals are used as conditions by the video renderer, facilitating more realistic and prompt-following video generation. As an independent LLM module, it can be applied to different video renderers, including UNet and DiT, with minimal effort. Experiments on automatic evaluation benchmarks and human evaluations show that our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.
zh

[CV-93] Promptable Representation Distribution Learning and Data Augmentation for Gigapixel Histopathology WSI Analysis AAAI2025

【速读】：该论文试图解决在全切片图像 (Whole Slide Images, WSIs) 分析中，基于多实例学习 (Multiple Instance Learning, MIL) 的模型训练过程中数据增强的难题。现有方法要么增加计算成本，要么导致语义信息丢失，难以满足高效性和稳定性的需求。论文提出的解决方案是引入一个可提示的表示分布学习框架 (Promptable Representation Distribution Learning framework, PRDL)，通过在特征空间中使用提示 (prompts) 来指导数据增强，从而实现高效的补丁级表示学习和全切片图像级的数据增强。这一方法的关键在于利用提示在特征空间中进行灵活的数据增强，确保模型训练的稳定性和高效性，实验结果表明该方法在性能上优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.14473
作者: Kunming Tang,Zhiguo Jiang,Jun Shi,Wei Wang,Haibo Wu,Yushan Zheng
机构: Kunming Tang1,2(昆明唐1,2); Zhiguo Jiang1,2,3(志国江1,2,3); Jun Shi4(俊石4); Wei Wang5,6(伟王5,6); Haibo Wu5,6(海波吴5,6); Yushan Zheng1(玉山郑1)
关键词: multiple instance learning, data augmentation, Gigapixel image analysis, relies on multiple, multiple instance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Gigapixel image analysis, particularly for whole slide images (WSIs), often relies on multiple instance learning (MIL). Under the paradigm of MIL, patch image representations are extracted and then fixed during the training of the MIL classifiers for efficiency consideration. However, the invariance of representations makes it difficult to perform data augmentation for WSI-level model training, which significantly limits the performance of the downstream WSI analysis. The current data augmentation methods for gigapixel images either introduce additional computational costs or result in a loss of semantic information, which is hard to meet the requirements for efficiency and stability needed for WSI model training. In this paper, we propose a Promptable Representation Distribution Learning framework (PRDL) for both patch-level representation learning and WSI-level data augmentation. Meanwhile, we explore the use of prompts to guide data augmentation in feature space, which achieves promptable data augmentation for training robust WSI-level models. The experimental results have demonstrated that the proposed method stably outperforms state-of-the-art methods.
zh

[CV-94] DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On

【速读】：该论文试图解决虚拟试衣技术中需要重新训练扩散模型（diffusion models）的问题。解决方案的关键在于提出了DiffusionTrend，它利用现有的高级扩散模型，通过捕捉丰富的潜在信息（latent information）来精细地呈现服装细节，并在去噪过程中通过轻量级卷积神经网络（CNN）生成的精确服装掩码（garment mask）来指导图像生成。这种方法避免了大规模数据集上的资源密集型重新训练，简化了用户输入，并提供了视觉上吸引人的试衣体验，展示了无需训练的扩散模型在虚拟试衣领域的潜力。

链接: https://arxiv.org/abs/2412.14465
作者: Wengyi Zhan,Mingbao Lin,Shuicheng Yan,Rongrong Ji
机构: Xiamen University(厦门大学); Skywork AI
关键词: diffusion models, diffusion, introduce DiffusionTrend, virtual fashion try-on, models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce DiffusionTrend for virtual fashion try-on, which forgoes the need for retraining diffusion models. Using advanced diffusion models, DiffusionTrend harnesses latent information rich in prior information to capture the nuances of garment details. Throughout the diffusion denoising process, these details are seamlessly integrated into the model image generation, expertly directed by a precise garment mask crafted by a lightweight and compact CNN. Although our DiffusionTrend model initially demonstrates suboptimal metric performance, our exploratory approach offers some important advantages: (1) It circumvents resource-intensive retraining of diffusion models on large datasets. (2) It eliminates the necessity for various complex and user-unfriendly model inputs. (3) It delivers a visually compelling try-on experience, underscoring the potential of training-free diffusion model. This initial foray into the application of untrained diffusion models in virtual try-on technology potentially paves the way for further exploration and refinement in this industrially and academically valuable field.
zh

[CV-95] LiftRefine: Progressively Refined View Synthesis from 3D Lifting with Volume-Triplane Representations

【速读】：该论文试图解决从单视图或多视图图像生成高质量3D神经场的问题。解决方案的关键在于提出了一种两阶段方法：首先通过重建模型将输入图像提升到3D空间，生成粗略的体积表示和精细的三平面表示；然后利用扩散模型在渲染图像中补全被遮挡区域的缺失细节。为了进一步提升3D表示的质量和渲染效果，论文还引入了渐进式优化技术，通过迭代应用重建和扩散模型来逐步合成新的视图，确保了多视图一致性和采样效率。

链接: https://arxiv.org/abs/2412.14464
作者: Tung Do,Thuan Hoang Nguyen,Anh Tuan Tran,Rang Nguyen,Binh-Son Hua
机构: VinAI Research(VinAI研究); Singapore University of Technology and Design (新加坡科技与设计大学)
关键词: view synthesis method, few-view input images, neural field, view synthesis, single or few-view
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:We propose a new view synthesis method via synthesizing a 3D neural field from both single or few-view input images. To address the ill-posed nature of the image-to-3D generation problem, we devise a two-stage method that involves a reconstruction model and a diffusion model for view synthesis. Our reconstruction model first lifts one or more input images to the 3D space from a volume as the coarse-scale 3D representation followed by a tri-plane as the fine-scale 3D representation. To mitigate the ambiguity in occluded regions, our diffusion model then hallucinates missing details in the rendered images from tri-planes. We then introduce a new progressive refinement technique that iteratively applies the reconstruction and diffusion model to gradually synthesize novel views, boosting the overall quality of the 3D representations and their rendering. Empirical evaluation demonstrates the superiority of our method over state-of-the-art methods on the synthetic SRN-Car dataset, the in-the-wild CO3D dataset, and large-scale Objaverse dataset while achieving both sampling efficacy and multi-view consistency.
zh

[CV-96] Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

【速读】：该论文试图解决图像合成中前景物体与背景场景之间的复杂交互问题，特别是如何根据场景的“可操作性”（Affordance）将任意物体无缝插入到任意场景中。解决方案的关键在于提出了Affordance-aware对象插入任务，并构建了包含超过300万样本的SAM-FB数据集。此外，论文提出了Mask-Aware Dual Diffusion (MADD)模型，通过双流架构同时对RGB图像和插入掩码进行去噪，显式建模插入掩码在扩散过程中的作用，从而有效促进Affordance概念的应用。实验结果表明，该方法在性能和泛化能力上均优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.14462
作者: Jixuan He,Wanhua Li,Ye Liu,Junsik Kim,Donglai Wei,Hanspeter Pfister
机构: Harvard University(哈佛大学); Cornell Tech(康奈尔理工); The Hong Kong Polytechnic University(香港理工大学); Boston College(波士顿学院)
关键词: composition involves integrating, involves integrating foreground, image editing operation, integrating foreground objects, common image editing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at: this https URL . Project page at: this https URL

点击查看摘要

Abstract:As a common image editing operation, image composition involves integrating foreground objects into background scenes. In this paper, we expand the application of the concept of Affordance from human-centered image composition tasks to a more general object-scene composition framework, addressing the complex interplay between foreground objects and background scenes. Following the principle of Affordance, we define the affordance-aware object insertion task, which aims to seamlessly insert any object into any scene with various position prompts. To address the limited data issue and incorporate this task, we constructed the SAM-FB dataset, which contains over 3 million examples across more than 3,000 object categories. Furthermore, we propose the Mask-Aware Dual Diffusion (MADD) model, which utilizes a dual-stream architecture to simultaneously denoise the RGB image and the insertion mask. By explicitly modeling the insertion mask in the diffusion process, MADD effectively facilitates the notion of affordance. Extensive experimental results show that our method outperforms the state-of-the-art methods and exhibits strong generalization performance on in-the-wild images. Please refer to our code on this https URL.
zh

[CV-97] LEDiff: Latent Exposure Diffusion for HDR Generation

【速读】：该论文试图解决现有生成式 AI 内容和互联网照片等图像资产在动态范围（dynamic range）上受限于 8-bit 低动态范围（LDR），无法适应高动态范围（HDR）应用的问题。解决方案的关键在于提出了 LEDiff 方法，通过潜在空间融合（latent space fusion）技术，使预训练的扩散模型（diffusion model）能够生成高比特、高动态范围的内容，并能将现有的 LDR 图像转换为 HDR 图像。该方法利用少量 HDR 数据集，恢复被裁剪的高光和阴影区域的细节和动态范围，从而实现对 HDR 内容的真实感生成和 LDR 到 HDR 的转换。

链接: https://arxiv.org/abs/2412.14456
作者: Chao Wang,Zhihao Xia,Thomas Leimkuehler,Karol Myszkowski,Xuaner Zhang
机构: MPI Informatik(马克斯·普朗克信息学研究所); Adobe(奥多比)
关键词: consumer displays increasingly, displays increasingly support, content remain limited, dynamic range, HDR
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.
zh

[CV-98] Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation

【速读】：该论文试图解决传统缝纫图案生成方法在复杂服装设计中缺乏精细控制的问题。解决方案的关键在于提出了SewingLDM，一种多模态生成模型，通过文本提示、人体形状和服装草图来控制缝纫图案的生成。首先，论文扩展了缝纫图案的原始向量表示，以涵盖更多细节，并将其压缩到紧凑的潜在空间中。随后，设计了一个两步训练策略，将多模态条件（如人体形状、文本提示和服装草图）注入扩散模型，确保生成的服装既适合人体又具备细节控制。实验结果表明，该方法在复杂服装设计和多种体型适应性方面显著优于以往方法。

链接: https://arxiv.org/abs/2412.14453
作者: Shengqi Liu,Yuhao Cheng,Zhuo Chen,Xingyu Ren,Wenhan Zhu,Lincheng Li,Mengxiao Bi,Xiaokang Yang,Yichao Yan
机构: MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, China(中国); Xueshen AI; NetEase Fuxi AI Lab
关键词: Generating sewing patterns, receiving increasing attention, increasing attention due, Generating sewing, flexible-editing nature
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Our project page: this https URL

点击查看摘要

Abstract:Generating sewing patterns in garment design is receiving increasing attention due to its CG-friendly and flexible-editing nature. Previous sewing pattern generation methods have been able to produce exquisite clothing, but struggle to design complex garments with detailed control. To address these issues, we propose SewingLDM, a multi-modal generative model that generates sewing patterns controlled by text prompts, body shapes, and garment sketches. Initially, we extend the original vector of sewing patterns into a more comprehensive representation to cover more intricate details and then compress them into a compact latent space. To learn the sewing pattern distribution in the latent space, we design a two-step training strategy to inject the multi-modal conditions, \ie, body shapes, text prompts, and garment sketches, into a diffusion model, ensuring the generated garments are body-suited and detail-controlled. Comprehensive qualitative and quantitative experiments show the effectiveness of our proposed method, significantly surpassing previous approaches in terms of complex garment design and various body adaptability. Our project page: this https URL.
zh

[CV-99] Color Enhancement for V-PCC Compressed Point Cloud via 2D Attribute Map Optimization

【速读】：该论文试图解决基于视频的点云压缩 (V-PCC) 中由于有损压缩导致的颜色属性退化问题。解决方案的关键在于提出了一种轻量级解压Unet (LDC-Unet)，通过优化V-PCC编码过程中生成的投影图，提升颜色质量。具体来说，LDC-Unet是一个二维神经网络，能够对投影图进行优化，随后将优化后的二维图反投影回三维空间，从而改善点云的颜色属性。此外，论文还采用了迁移学习策略，并构建了一个定制的自然图像数据集进行初步训练，随后使用压缩点云的投影图进行微调，有效解决了点云训练数据稀缺的问题。实验结果表明，该方法在提升颜色质量方面具有显著效果。

链接: https://arxiv.org/abs/2412.14449
作者: Jingwei Bao,Yu Liu,Zeliang Li,Shuyuan Zhu,Siu-Kei Au Yeung
机构: 未知
关键词: traditional video codecs, Video-based point cloud, dynamic point cloud, Video-based point, point cloud
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: IEEE VCIP 2024

点击查看摘要

Abstract:Video-based point cloud compression (V-PCC) converts the dynamic point cloud data into video sequences using traditional video codecs for efficient encoding. However, this lossy compression scheme introduces artifacts that degrade the color attributes of the data. This paper introduces a framework designed to enhance the color quality in the V-PCC compressed point clouds. We propose the lightweight de-compression Unet (LDC-Unet), a 2D neural network, to optimize the projection maps generated during V-PCC encoding. The optimized 2D maps will then be back-projected to the 3D space to enhance the corresponding point cloud attributes. Additionally, we introduce a transfer learning strategy and develop a customized natural image dataset for the initial training. The model was then fine-tuned using the projection maps of the compressed point clouds. The whole strategy effectively addresses the scarcity of point cloud training data. Our experiments, conducted on the public 8i voxelized full bodies long sequences (8iVSLF) dataset, demonstrate the effectiveness of our proposed method in improving the color quality.
zh

[CV-100] VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision

【速读】：该论文试图解决现有端到端（E2E）自动驾驶（AD）模型在处理复杂驾驶场景时缺乏常识推理能力的问题。解决方案的关键在于提出了一种名为VLM-AD的方法，该方法利用视觉-语言模型（VLMs）作为教师模型，在训练过程中提供包含非结构化推理信息和结构化动作标签的额外监督，从而增强模型对驾驶模式背后逻辑的理解和特征表示的学习。这种方法在推理阶段不需要VLMs的参与，确保了其实时部署的可行性，并在nuScenes数据集上显著提升了规划精度和降低了碰撞率。

链接: https://arxiv.org/abs/2412.14446
作者: Yi Xu,Yuxin Hu,Zaiwei Zhang,Gregory P. Meyer,Siva Karthik Mustikovela,Siddhartha Srinivasa,Eric M. Wolff,Xin Huang
机构: Cruise LLC; Northeastern University
关键词: Human drivers rely, dynamic real-world scenarios, Human drivers, drivers rely, rely on commonsense
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human drivers rely on commonsense reasoning to navigate diverse and dynamic real-world scenarios. Existing end-to-end (E2E) autonomous driving (AD) models are typically optimized to mimic driving patterns observed in data, without capturing the underlying reasoning processes. This limitation constrains their ability to handle challenging driving scenarios. To close this gap, we propose VLM-AD, a method that leverages vision-language models (VLMs) as teachers to enhance training by providing additional supervision that incorporates unstructured reasoning information and structured action labels. Such supervision enhances the model’s ability to learn richer feature representations that capture the rationale behind driving patterns. Importantly, our method does not require a VLM during inference, making it practical for real-time deployment. When integrated with state-of-the-art methods, VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset.
zh

[CV-101] GenHMR: Generative Human Mesh Recovery

【速读】：该论文试图解决单目图像中人体网格恢复（Human Mesh Recovery, HMR）的深度歧义和遮挡问题，这是一个不适定问题。解决方案的关键在于引入了一个名为GenHMR的新型生成式框架，将单目HMR重新表述为图像条件下的生成任务，通过显式建模和缓解2D到3D映射过程中的不确定性来实现。GenHMR包含两个核心组件：（1）姿态编码器（pose tokenizer），将3D人体姿态转换为潜在空间中的离散序列；（2）图像条件掩码变换器（image-conditional masked transformer），学习基于输入图像提示和随机掩码序列的姿态令牌的概率分布。在推理过程中，模型从学习到的条件分布中采样，逐步解码高置信度的姿态令牌，从而减少3D重建的不确定性。此外，论文还提出了2D姿态引导的细化技术，以进一步优化解码的姿态令牌，使其与2D姿态线索对齐。实验结果表明，GenHMR显著优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.14444
作者: Muhammad Usama Saleem,Ekkasit Pinyoanuntapong,Pu Wang,Hongfei Xue,Srijan Das,Chen Chen
机构: 1. Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, USA(电气与计算机工程系，佛罗里达大学，盖恩斯维尔，佛罗里达州，美国);
2. Department of Computer Science and Engineering, University of South Florida, Tampa, FL, USA(计算机科学与工程系，南佛罗里达大学，坦帕，佛罗里达州，美国)
关键词: computer vision applications, Human mesh recovery, vision applications, arts and entertainment, computer vision
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human mesh recovery (HMR) is crucial in many computer vision applications; from health to arts and entertainment. HMR from monocular images has predominantly been addressed by deterministic methods that output a single prediction for a given 2D image. However, HMR from a single image is an ill-posed problem due to depth ambiguity and occlusions. Probabilistic methods have attempted to address this by generating and fusing multiple plausible 3D reconstructions, but their performance has often lagged behind deterministic approaches. In this paper, we introduce GenHMR, a novel generative framework that reformulates monocular HMR as an image-conditioned generative task, explicitly modeling and mitigating uncertainties in the 2D-to-3D mapping process. GenHMR comprises two key components: (1) a pose tokenizer to convert 3D human poses into a sequence of discrete tokens in a latent space, and (2) an image-conditional masked transformer to learn the probabilistic distributions of the pose tokens, conditioned on the input image prompt along with randomly masked token sequence. During inference, the model samples from the learned conditional distribution to iteratively decode high-confidence pose tokens, thereby reducing 3D reconstruction uncertainties. To further refine the reconstruction, a 2D pose-guided refinement technique is proposed to directly fine-tune the decoded pose tokens in the latent space, which forces the projected 3D body mesh to align with the 2D pose clues. Experiments on benchmark datasets demonstrate that GenHMR significantly outperforms state-of-the-art methods. Project website can be found at this https URL
zh

[CV-102] IntroStyle: Training-Free Introspective Style Attribution using Diffusion Features

【速读】：该论文试图解决文本到图像 (Text-to-Image, T2I) 模型在生成内容时可能侵犯艺术家版权和隐私的问题，特别是如何防止生成特定艺术风格以保护知识产权。解决方案的关键在于提出了一种无需训练的框架，称为内省风格归属 (Introspective Style Attribution, IntroStyle)，该框架利用扩散模型生成的特征进行风格归属，无需外部模块或重新训练，从而在资源效率和实时应用方面具有显著优势。此外，论文还引入了合成数据集 Style Hacks (SHacks) 用于隔离和评估艺术风格的细粒度归属性能。

链接: https://arxiv.org/abs/2412.14432
作者: Anand Kumar,Jiteng Mu,Nuno Vasconcelos
机构: University of California, San Diego (加州大学圣地亚哥分校)
关键词: gained widespread adoption, general public, gained widespread, widespread adoption, adoption among content
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 16 pages, 17 figures

点击查看摘要

Abstract:Text-to-image (T2I) models have gained widespread adoption among content creators and the general public. However, this has sparked significant concerns regarding data privacy and copyright infringement among artists. Consequently, there is an increasing demand for T2I models to incorporate mechanisms that prevent the generation of specific artistic styles, thereby safeguarding intellectual property rights. Existing methods for style extraction typically necessitate the collection of custom datasets and the training of specialized models. This, however, is resource-intensive, time-consuming, and often impractical for real-time applications. Moreover, it may not adequately address the dynamic nature of artistic styles and the rapidly evolving landscape of digital art. We present a novel, training-free framework to solve the style attribution problem, using the features produced by a diffusion model alone, without any external modules or retraining. This is denoted as introspective style attribution (IntroStyle) and demonstrates superior performance to state-of-the-art models for style retrieval. We also introduce a synthetic dataset of Style Hacks (SHacks) to isolate artistic style and evaluate fine-grained style attribution performance.
zh

[CV-103] WildSAT: Learning Satellite Image Representations from Wildlife Observations

【速读】：该论文试图解决如何利用物种分布信息来提升卫星图像表示学习的问题。解决方案的关键在于引入WildSAT，它通过对比学习框架将卫星图像与公民科学平台上丰富的地理标记野生动物观测数据相结合，同时融合物种分布图和描述栖息地及范围的文本信息。这种方法不仅显著提升了随机初始化模型和预训练模型的下游任务性能，还实现了基于一般描述的零样本检索。通过对比其他跨模态监督方法（如卫星图像与地面图像或野生动物照片的对齐），WildSAT展示了更优的表示能力，并验证了其设计选择的广泛适用性。

链接: https://arxiv.org/abs/2412.14428
作者: Rangel Daroya,Elijah Cole,Oisin Mac Aodha,Grant Van Horn,Subhransu Maji
机构: University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校); GenBio AI; University of Edinburgh(爱丁堡大学)
关键词: satellite image, satellite, species reveal, images, geographic location
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:What does the presence of a species reveal about a geographic location? We posit that habitat, climate, and environmental preferences reflected in species distributions provide a rich source of supervision for learning satellite image representations. We introduce WildSAT, which pairs satellite images with millions of geo-tagged wildlife observations readily-available on citizen science platforms. WildSAT uses a contrastive learning framework to combine information from species distribution maps with text descriptions that capture habitat and range details, alongside satellite images, to train or fine-tune models. On a range of downstream satellite image recognition tasks, this significantly improves the performance of both randomly initialized models and pre-trained models from sources like ImageNet or specialized satellite image datasets. Additionally, the alignment with text enables zero-shot retrieval, allowing for search based on general descriptions of locations. We demonstrate that WildSAT achieves better representations than recent methods that utilize other forms of cross-modal supervision, such as aligning satellite images with ground images or wildlife photos. Finally, we analyze the impact of various design choices on downstream performance, highlighting the general applicability of our approach.
zh

[CV-104] FedPIA – Permuting and Integrating Adapters leveraging Wasserstein Barycenters for Finetuning Foundation Models in Multi-Modal Federated Learning AAAI2025

【速读】：该论文试图解决在医疗领域中，由于严格的隐私法规导致难以收集大规模数据进行视觉-语言模型（Vision-Language Models, VLMs）微调的问题。解决方案的关键在于提出了一种名为FedPIA的新框架，该框架通过在服务器端对本地适配器（local Adapters）和全局适配器（global Adapters）进行排列和集成，利用Wasserstein barycenters来优化客户端特定知识和客户端无关知识的融合。具体来说，FedPIA通过逐层排列来弥合本地和全局适配器参数空间之间的差距，从而在联邦学习（Federated Learning, FL）和参数高效微调（Parameter-Efficient Fine-Tuning, PEFT）的基础上，提升了模型在多任务和多数据分布下的收敛性能。实验结果表明，FedPIA在多种医疗视觉-语言任务设置下，显著优于现有的PEFT-FL基线方法。

链接: https://arxiv.org/abs/2412.14424
作者: Pramit Saha,Divyanshu Mishra,Felix Wagner,Konstantinos Kamnitsas,J. Alison Noble
机构: University of Oxford (牛津大学); Imperial College London (伦敦帝国学院); DeepMind (深度思维)
关键词: require large text, Models typically require, typically require large, Large Vision-Language Models, require large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in AAAI 2025 (Main Track)

点击查看摘要

Abstract:Large Vision-Language Models typically require large text and image datasets for effective fine-tuning. However, collecting data from various sites, especially in healthcare, is challenging due to strict privacy regulations. An alternative is to fine-tune these models on end-user devices, such as in medical clinics, without sending data to a server. These local clients typically have limited computing power and small datasets, which are not enough for fully fine-tuning large VLMs on their own. A naive solution to these scenarios is to leverage parameter-efficient fine-tuning (PEFT) strategies and apply federated learning (FL) algorithms to combine the learned adapter weights, thereby respecting the resource limitations and data privacy. However, this approach does not fully leverage the knowledge from multiple adapters trained on diverse data distributions and for diverse tasks. The adapters are adversely impacted by data heterogeneity and task heterogeneity across clients resulting in suboptimal convergence. To this end, we propose a novel framework called FedPIA that improves upon the naive combinations of FL and PEFT by introducing Permutation and Integration of the local Adapters in the server and global Adapters in the clients exploiting Wasserstein barycenters for improved blending of client-specific and client-agnostic knowledge. This layerwise permutation helps to bridge the gap in the parameter space of local and global adapters before integration. We conduct over 2000 client-level experiments utilizing 48 medical image datasets across five different medical vision-language FL task settings encompassing visual question answering as well as image and report-based multi-label disease detection. Our experiments involving diverse client settings, ten different modalities, and two VLM backbones demonstrate that FedPIA consistently outperforms the state-of-the-art PEFT-FL baselines.
zh

[CV-105] Enhancing Diffusion Models for High-Quality Image Generation

【速读】：该论文旨在解决生成式 AI 模型在生成高质量图像时面临的效率和质量问题，特别是在大规模数据集上的应用需求。解决方案的关键在于通过引入先进技术如无分类器引导 (Classifier-Free Guidance, CFG)、变分自编码器 (Variational Autoencoders, VAE) 的潜在扩散模型 (Latent Diffusion Models) 以及优化噪声调度策略，来提升模型的生成能力和推理速度。实验结果表明，结合 CFG 的 DDIM 在推理速度和图像质量（如 Frechet Inception Distance, FID）方面表现优异，同时指出了 VAE 和噪声调度策略的挑战，为未来的优化提供了方向。

链接: https://arxiv.org/abs/2412.14422
作者: Jaineet Shah,Michael Gromis,Rickston Pinto
机构: 未知
关键词: Denoising Diffusion Probabilistic, Denoising Diffusion Implicit, Diffusion Probabilistic Models, Diffusion Implicit Models, Denoising Diffusion
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This report presents the comprehensive implementation, evaluation, and optimization of Denoising Diffusion Probabilistic Models (DDPMs) and Denoising Diffusion Implicit Models (DDIMs), which are state-of-the-art generative models. During inference, these models take random noise as input and iteratively generate high-quality images as output. The study focuses on enhancing their generative capabilities by incorporating advanced techniques such as Classifier-Free Guidance (CFG), Latent Diffusion Models with Variational Autoencoders (VAE), and alternative noise scheduling strategies. The motivation behind this work is the growing demand for efficient and scalable generative AI models that can produce realistic images across diverse datasets, addressing challenges in applications such as art creation, image synthesis, and data augmentation. Evaluations were conducted on datasets including CIFAR-10 and ImageNet-100, with a focus on improving inference speed, computational efficiency, and image quality metrics like Frechet Inception Distance (FID). Results demonstrate that DDIM + CFG achieves faster inference and superior image quality. Challenges with VAE and noise scheduling are also highlighted, suggesting opportunities for future optimization. This work lays the groundwork for developing scalable, efficient, and high-quality generative AI systems to benefit industries ranging from entertainment to robotics.
zh

[CV-106] An Immersive Multi-Elevation Multi-Seasonal Dataset for 3D Reconstruction and Visualization

【速读】：该论文试图解决场景重建领域缺乏一个全面评估数据集的问题。解决方案的关键在于引入了一个包含约翰斯·霍普金斯大学Homewood校区在不同季节、时间、高度和尺度下拍摄的图像数据集，并通过多阶段校准过程高效恢复了手机和无人机摄像头的参数。这一数据集能够帮助研究人员在不受限的环境中探索诸如光照不一致、大规模重建以及不同视角下的重建等挑战。

链接: https://arxiv.org/abs/2412.14418
作者: Xijun Liu,Yifan Zhou,Yuxiang Guo,Rama Chellappa,Cheng Peng
机构: Johns Hopkins University (约翰斯·霍普金斯大学)
关键词: photo-realistic scene reconstruction, Significant progress, recent years, Hopkins Homewood Campus, Johns Hopkins Homewood
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 4 pages, 3 figures

点击查看摘要

Abstract:Significant progress has been made in photo-realistic scene reconstruction over recent years. Various disparate efforts have enabled capabilities such as multi-appearance or large-scale modeling; however, there lacks a welldesigned dataset that can evaluate the holistic progress of scene reconstruction. We introduce a collection of imagery of the Johns Hopkins Homewood Campus, acquired at different seasons, times of day, in multiple elevations, and across a large scale. We perform a multi-stage calibration process, which efficiently recover camera parameters from phone and drone cameras. This dataset can enable researchers to rigorously explore challenges in unconstrained settings, including effects of inconsistent illumination, reconstruction from large scale and from significantly different perspectives, etc.
zh

[CV-107] DriveGPT: Scaling Autoregressive Behavior Models for Driving

【速读】：该论文试图解决自动驾驶中的行为建模问题，提出了一种可扩展的行为模型DriveGPT。解决方案的关键在于将驾驶任务建模为序列决策任务，并采用Transformer模型以自回归方式预测未来代理状态。通过大幅增加模型参数和训练数据量，论文探索了数据集规模、模型参数和计算能力对模型性能的扩展性影响。实验结果表明，DriveGPT在规划任务中表现优异，并在预测任务中超越了现有最先进的基线模型，验证了数据扩展对性能提升的有效性。

链接: https://arxiv.org/abs/2412.14415
作者: Xin Huang,Eric M. Wolff,Paul Vernaza,Tung Phan-Minh,Hongge Chen,David S. Hayden,Mark Edmonds,Brian Pierce,Xinxin Chen,Pratik Elias Jacob,Xiaobai Chen,Chingiz Tairbekov,Pratik Agarwal,Tianshi Gao,Yuning Chai,Siddhartha Srinivasa
机构: Cruise LLC
关键词: scalable behavior model, scalable behavior, autonomous driving, behavior model, model
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 14 pages, 16 figures, 9 tables, and 1 video link

点击查看摘要

Abstract:We present DriveGPT, a scalable behavior model for autonomous driving. We model driving as a sequential decision making task, and learn a transformer model to predict future agent states as tokens in an autoregressive fashion. We scale up our model parameters and training data by multiple orders of magnitude, enabling us to explore the scaling properties in terms of dataset size, model parameters, and compute. We evaluate DriveGPT across different scales in a planning task, through both quantitative metrics and qualitative examples including closed-loop driving in complex real-world scenarios. In a separate prediction task, DriveGPT outperforms a state-of-the-art baseline and exhibits improved performance by pretraining on a large-scale dataset, further validating the benefits of data scaling.
zh

[CV-108] Enhancing Fingerprint Recognition Systems: Comparative Analysis of Biometric Authentication Algorithms and Techniques for Improved Accuracy and Reliability

【速读】：该论文试图解决指纹识别系统中识别精度和鲁棒性提升的问题，解决方案的关键在于将卷积神经网络 (CNN) 与 Gabor 滤波器相结合。通过利用 Sokoto Coventry 指纹数据集，研究评估了不同分类算法的有效性，发现基于 CNN 的方法在整体准确率上达到了 94%，并且在识别经过修改的指纹方面表现出显著优势。此外，研究还探索了多种分类器的混合方法，尽管结果复杂，但揭示了深度学习方法在指纹识别领域的变革潜力。该研究的关键在于通过结合传统特征提取方法与先进的深度学习架构，为优化指纹识别系统提供了新的思路和实践指导。

链接: https://arxiv.org/abs/2412.14404
作者: Temirlan Meiramkhanov,Arailym Tleubayeva
机构: Astana IT University(阿斯塔纳信息技术大学)
关键词: Convolutional Neural Networks, providing indispensable security, indispensable security measures, integrating Convolutional Neural, Fingerprint recognition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fingerprint recognition systems stand as pillars in the realm of biometric authentication, providing indispensable security measures across various domains. This study investigates integrating Convolutional Neural Networks (CNNs) with Gabor filters to improve fingerprint recognition accuracy and robustness. Leveraging a diverse dataset sourced from the Sokoto Coventry Fingerprint Dataset, our experiments meticulously evaluate the efficacy of different classification algorithms. Our findings underscore the supremacy of CNN-based approaches, boasting an impressive overall accuracy of 94%. Furthermore, the amalgamation of Gabor filters with CNN architectures unveils promising strides in discerning altered fingerprints, illuminating new pathways for enhancing biometric authentication systems. While the CNN-Gabor fusion showcases commendable performance, our exploration of hybrid approaches combining multiple classifiers reveals nuanced outcomes. Despite these mixed results, our study illuminates the transformative potential of deep learning methodologies in reshaping the landscape of fingerprint recognition. Through rigorous experimentation and insightful analysis, this research not only contributes to advancing biometric authentication technologies but also sheds light on the intricate interplay between traditional feature extraction methods and cutting-edge deep learning architectures. These findings offer actionable insights for optimizing fingerprint recognition systems for real-world deployment, paving the way for enhanced security and reliability in diverse applications.
zh

[CV-109] he One RING: a Robotic Indoor Navigation Generalist

【速读】：该论文试图解决机器人导航策略的泛化问题，即当前大多数导航策略是针对特定机器人形态（embodiment-specific）设计的，无法很好地迁移到其他形态的机器人上。解决方案的关键在于提出了RING（Robotic Indoor Navigation Generalist），这是一个与形态无关（embodiment-agnostic）的导航策略，通过在模拟环境中大规模训练多样化的随机初始化机器人形态来实现。具体来说，论文通过增强AI2-THOR模拟器，使其能够生成具有可控配置的机器人形态，包括不同的身体尺寸、旋转中心点和摄像头配置。RING在视觉目标导航任务中表现出色，能够在未见过的真实机器人平台上实现鲁棒性能，模拟和真实世界中的成功率分别达到72.1%和78.9%。

链接: https://arxiv.org/abs/2412.14401
作者: Ainaz Eftekhar,Luca Weihs,Rose Hendrix,Ege Caglar,Jordi Salvador,Alvaro Herrasti,Winson Han,Eli VanderBil,Aniruddha Kembhavi,Ali Farhadi,Ranjay Krishna,Kiana Ehsani,Kuo-Hao Zeng
机构: University of Washington(华盛顿大学); Allen Institute for AI(艾伦人工智能研究所)
关键词: Modern robots vary, robots vary significantly, Modern robots, significantly in shape, vary significantly
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern robots vary significantly in shape, size, and sensor configurations used to perceive and interact with their environments. However, most navigation policies are embodiment-specific; a policy learned using one robot’s configuration does not typically gracefully generalize to another. Even small changes in the body size or camera viewpoint may cause failures. With the recent surge in custom hardware developments, it is necessary to learn a single policy that can be transferred to other embodiments, eliminating the need to (re)train for each specific robot. In this paper, we introduce RING (Robotic Indoor Navigation Generalist), an embodiment-agnostic policy, trained solely in simulation with diverse randomly initialized embodiments at scale. Specifically, we augment the AI2-THOR simulator with the ability to instantiate robot embodiments with controllable configurations, varying across body size, rotation pivot point, and camera configurations. In the visual object-goal navigation task, RING achieves robust performance on real unseen robot platforms (Stretch RE-1, LoCoBot, Unitree’s Go1), achieving an average of 72.1% and 78.9% success rate across 5 embodiments in simulation and 4 robot platforms in the real world. (project website: this https URL)
zh

[CV-110] I0T: Embedding Standardization Method Towards Zero Modality Gap

【速读】：该论文试图解决Contrastive Language-Image Pretraining (CLIP)在扩展应用中出现的模态差距（modality gap）问题，即图像和文本嵌入在投影到不同流形时偏离了图像-文本对比学习的初衷。解决方案的关键在于提出了两种方法：(1) 一种后处理嵌入标准化方法（post-hoc embedding standardization method），称为 \textI0T_\textpost，能够将模态差距近似减少到零；(2) 一种可训练的方法（trainable method），称为 \textI0T_\textasync，通过为每个编码器添加两个归一化层来缓解模态差距问题。这两种方法共同构成了I0T框架，能够在保留预训练模型原始嵌入表示的同时，显著减少模态差距。此外，\textI0T_\textpost 还可以作为广泛使用的CLIPScore（CLIP-S）的可解释自动评估指标的替代方案。

链接: https://arxiv.org/abs/2412.14384
作者: Na Min An,Eunki Kim,James Thorne,Hyunjung Shim
机构: KAIST AI(KAIST人工智能); KAIST(韩国科学技术院)
关键词: Contrastive Language-Image Pretraining, enables zero-shot inference, Language-Image Pretraining, modality gap, image-text contrastive learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 figures, 8 figures, 7 tables

点击查看摘要

Abstract:Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of modality gap, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image/text encoder independently possesses and propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, \textI0T_\textpost that reduces the modality gap approximately to zero and (2) a trainable method, \textI0T_\textasync , to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality gap while preserving the original embedding representations of trained models with their locked parameters. In practice, \textI0T_\textpost can serve as an alternative explainable automatic evaluation metric of widely used CLIPScore (CLIP-S).
zh

[CV-111] HA-RDet: Hybrid Anchor Rotation Detector for Oriented Object Detection

【速读】：该论文试图解决航空图像中目标检测的挑战，特别是由于目标尺寸和方向的多样性导致的检测困难。解决方案的关键在于提出了混合锚点旋转检测器 (Hybrid-Anchor Rotation Detector, HA-RDet)，该方法结合了基于锚点 (Anchor-based) 和无锚点 (Anchor-free) 机制的优点。HA-RDet 通过在每个特征图位置仅使用一个预设锚点，并利用方向感知卷积技术 (Orientation-Aware Convolution) 对锚点进行精炼，从而在保持高检测精度的同时，显著减少了计算资源的需求。具体表现为在DOTA-v1、DIOR-R和HRSC2016数据集上分别达到了75.41 mAP、65.3 mAP和90.2 mAP的竞争性精度。

链接: https://arxiv.org/abs/2412.14379
作者: Phuc D.A. Nguyen
机构: University of Information Technology (信息技术大学)
关键词: aerial images poses, significant challenge due, sizes and orientations, Oriented object detection, aerial images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Bachelor thesis

点击查看摘要

Abstract:Oriented object detection in aerial images poses a significant challenge due to their varying sizes and orientations. Current state-of-the-art detectors typically rely on either two-stage or one-stage approaches, often employing Anchor-based strategies, which can result in computationally expensive operations due to the redundant number of generated anchors during training. In contrast, Anchor-free mechanisms offer faster processing but suffer from a reduction in the number of training samples, potentially impacting detection accuracy. To address these limitations, we propose the Hybrid-Anchor Rotation Detector (HA-RDet), which combines the advantages of both anchor-based and anchor-free schemes for oriented object detection. By utilizing only one preset anchor for each location on the feature maps and refining these anchors with our Orientation-Aware Convolution technique, HA-RDet achieves competitive accuracies, including 75.41 mAP on DOTA-v1, 65.3 mAP on DIOR-R, and 90.2 mAP on HRSC2016, against current anchor-based state-of-the-art methods, while significantly reducing computational resources.
zh

[CV-112] SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting

【速读】：该论文试图解决单目（monocular）面部表演捕捉在复杂自然环境中的挑战，主要由于不同的捕捉条件、面部形状和表情变化。现有方法大多依赖于线性3D形变模型（linear 3D Morphable Models），这些模型在顶点位移层面独立表示面部表情和身份。论文提出的解决方案是SEREP（Semantic Expression Representation），其关键在于在语义层面将表情与身份解耦。SEREP首先通过循环一致性损失（cycle consistency loss）从未配对的3D面部表情中学习表情表示，然后利用一种新颖的半监督域适应方案（semi-supervised domain adaptation scheme）从单目图像中预测表情。此外，论文还引入了MultiREX基准，以解决表情捕捉任务中缺乏评估资源的问题。实验结果表明，SEREP在捕捉复杂表情并将其迁移到新身份方面优于现有最先进的方法。

链接: https://arxiv.org/abs/2412.14371
作者: Arthur Josi,Luiz Gustavo Hafemann,Abdallah Dib,Emeline Got,Rafael M. O. Cruz,Marc-Andre Carbonneau
机构: Ecole de Technologie Supérieure(高等技术学院); Ubisoft LaForge(育碧LaForge)
关键词: varied capture conditions, Monocular facial performance, face shapes, due to varied, facial performance capture
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Monocular facial performance capture in-the-wild is challenging due to varied capture conditions, face shapes, and expressions. Most current methods rely on linear 3D Morphable Models, which represent facial expressions independently of identity at the vertex displacement level. We propose SEREP (Semantic Expression Representation), a model that disentangles expression from identity at the semantic level. It first learns an expression representation from unpaired 3D facial expressions using a cycle consistency loss. Then we train a model to predict expression from monocular images using a novel semi-supervised scheme that relies on domain adaptation. In addition, we introduce MultiREX, a benchmark addressing the lack of evaluation resources for the expression capture task. Our experiments show that SEREP outperforms state-of-the-art methods, capturing challenging expressions and transferring them to novel identities.
zh

[CV-113] Surrealistic-like Image Generation with Vision-Language Models

【速读】：该论文试图解决如何利用生成式 AI (Generative AI) 模型生成超现实主义风格绘画的问题。解决方案的关键在于通过实验比较不同视觉-语言生成模型（如 DALL-E、Deep Dream Generator 和 DreamStudio）在不同生成设置下的表现，并评估使用编辑过的基础图像对生成结果的影响。研究结果表明，DALL-E 2 在使用 ChatGPT 生成的提示词时表现最佳，这为生成超现实主义风格图像提供了最优的模型和设置选择。

链接: https://arxiv.org/abs/2412.14366
作者: Elif Ayten,Shuai Wang,Hjalmar Snoep
机构: Vrije Universiteit Amsterdam(阿姆斯特丹自由大学); Snoep Studio
关键词: Deep Dream Generator, Recent advances, types of content, make it convenient, convenient to create
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2023 Joint international Scientific conferences on AI and Machine Learning (BNAIC-BeNeLearn)

点击查看摘要

Abstract:Recent advances in generative AI make it convenient to create different types of content, including text, images, and code. In this paper, we explore the generation of images in the style of paintings in the surrealism movement using vision-language generative models, including DALL-E, Deep Dream Generator, and DreamStudio. Our investigation starts with the generation of images under various image generation settings and different models. The primary objective is to identify the most suitable model and settings for producing such images. Additionally, we aim to understand the impact of using edited base images on the generated resulting images. Through these experiments, we evaluate the performance of selected models and gain valuable insights into their capabilities in generating such images. Our analysis shows that Dall-E 2 performs the best when using the generated prompt by ChatGPT.
zh

[CV-114] Dynamic semantic VSLAM with known and unknown objects

【速读】：该论文试图解决传统视觉同步定位与地图构建（VSLAM）系统在高度动态环境中因假设静态环境而失效的问题。解决方案的关键在于引入一种基于特征的语义VSLAM，通过使用无监督分割网络实现未标记的分割，并结合目标检测器识别已知类别对象，同时利用高梯度光流信息区分已知和未知对象的静态与动态部分。此外，引入一致性检查模块进一步优化分类结果，从而在图像中存在未知对象时表现出优于传统VSLAM的性能，同时在仅包含已知对象时与领先的语义VSLAM技术性能相当。

链接: https://arxiv.org/abs/2412.14359
作者: Sanghyoup Gu,Ratnesh Kumar
机构: Iowa State University (爱荷华州立大学); Iowa State University (爱荷华州立大学)
关键词: Visual Simultaneous Localization, Traditional Visual Simultaneous, Localization and Mapping, Visual Simultaneous, Simultaneous Localization
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional Visual Simultaneous Localization and Mapping (VSLAM) systems assume a static environment, which makes them ineffective in highly dynamic settings. To overcome this, many approaches integrate semantic information from deep learning models to identify dynamic regions within images. However, these methods face a significant limitation as a supervised model cannot recognize objects not included in the training datasets. This paper introduces a novel feature-based Semantic VSLAM capable of detecting dynamic features in the presence of both known and unknown objects. By employing an unsupervised segmentation network, we achieve unlabeled segmentation, and next utilize an objector detector to identify any of the known classes among those. We then pair this with the computed high-gradient optical-flow information to next identify the static versus dynamic segmentations for both known and unknown object classes. A consistency check module is also introduced for further refinement and final classification into static versus dynamic features. Evaluations using public datasets demonstrate that our method offers superior performance than traditional VSLAM when unknown objects are present in the images while still matching the performance of the leading semantic VSLAM techniques when the images contain only the known objects
zh

[CV-115] A Unifying Information-theoretic Perspective on Evaluating Generative Models

【速读】：该论文试图解决生成式模型输出难以解释的问题，特别是如何评估生成模型的输出质量。解决方案的关键在于提出了一种基于信息论的统一评估框架，通过k近邻（kNN）密度估计方法，将多种现有的评估指标（如精确度（Precision）和召回率（Recall））统一在信息论的视角下。论文进一步提出了一个三维度的评估指标，包括精确度交叉熵（PCE）、召回率交叉熵（RCE）和召回率熵（RE），分别用于衡量输出的保真度以及类间和类内的多样性。该指标不仅适用于不同领域，还能在样本和模式层面进行详细分析，从而更全面地评估生成模型的性能。

链接: https://arxiv.org/abs/2412.14340
作者: Alexis Fox,Samarth Swarup,Abhijin Adiga
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室); University of Virginia (弗吉尼亚大学)
关键词: interpreting generative model, significant current research, current research focused, determining meaningful evaluation, generative model output
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Considering the difficulty of interpreting generative model output, there is significant current research focused on determining meaningful evaluation metrics. Several recent approaches utilize “precision” and “recall,” borrowed from the classification domain, to individually quantify the output fidelity (realism) and output diversity (representation of the real data variation), respectively. With the increase in metric proposals, there is a need for a unifying perspective, allowing for easier comparison and clearer explanation of their benefits and drawbacks. To this end, we unify a class of kth-nearest-neighbors (kNN)-based metrics under an information-theoretic lens using approaches from kNN density estimation. Additionally, we propose a tri-dimensional metric composed of Precision Cross-Entropy (PCE), Recall Cross-Entropy (RCE), and Recall Entropy (RE), which separately measure fidelity and two distinct aspects of diversity, inter- and intra-class. Our domain-agnostic metric, derived from the information-theoretic concepts of entropy and cross-entropy, can be dissected for both sample- and mode-level analysis. Our detailed experimental results demonstrate the sensitivity of our metric components to their respective qualities and reveal undesirable behaviors of other metrics.
zh

[CV-116] Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters

【速读】：该论文试图解决现有方法在生成语音伴随手势（co-speech gesture）和说话人头像（talking head）时，通常分别处理这两个任务，导致模型复杂度增加且忽视面部与身体运动之间内在联系的问题。解决方案的关键在于提出了一种新颖的模型架构，通过单一网络同时生成面部和身体运动，并利用适配器（adapters）在共享权重的基础上实现跨模态的适应，从而在保持生成性能的同时显著减少了参数数量。

链接: https://arxiv.org/abs/2412.14333
作者: Steven Hogue,Chenxu Zhang,Yapeng Tian,Xiaohu Guo
机构: University of Texas at Dallas(德克萨斯大学达拉斯分校)
关键词: Recent advances, methods focus, talking head generation, Recent, gesture and talking
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in co-speech gesture and talking head generation have been impressive, yet most methods focus on only one of the two tasks. Those that attempt to generate both often rely on separate models or network modules, increasing training complexity and ignoring the inherent relationship between face and body movements. To address the challenges, in this paper, we propose a novel model architecture that jointly generates face and body motions within a single network. This approach leverages shared weights between modalities, facilitated by adapters that enable adaptation to a common latent space. Our experiments demonstrate that the proposed framework not only maintains state-of-the-art co-speech gesture and talking head generation performance but also significantly reduces the number of parameters required.
zh

[CV-117] Personalized Generative Low-light Image Denoising and Enhancement

【速读】：该论文试图解决智能手机相机在低光环境下因光子散粒噪声和传感器读取噪声导致的图像质量不佳问题。解决方案的关键在于提出了一种个性化生成式去噪方法 (Personalized Generative Denoising, PGD)，通过构建一个针对不同用户的扩散模型，利用用户的个性化照片库提取物理属性，形成一个身份一致的物理缓冲区 (identity-consistent physical buffer)。该缓冲区提供了强先验信息，无需微调即可与扩散模型结合，从而在低信噪比 (SNR) 条件下实现高质量的图像去噪和增强，显著优于现有的基于扩散模型的去噪方法。

链接: https://arxiv.org/abs/2412.14327
作者: Xijun Wang,Prateek Chennuri,Yu Yuan,Bole Ma,Xingguang Zhang,Stanley Chan
机构: Purdue University (普渡大学)
关键词: photon shot noise, sensor read noise, produce astonishingly good, smartphone cameras today, astonishingly good photos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While smartphone cameras today can produce astonishingly good photos, their performance in low light is still not completely satisfactory because of the fundamental limits in photon shot noise and sensor read noise. Generative image restoration methods have demonstrated promising results compared to traditional methods, but they suffer from hallucinatory content generation when the signal-to-noise ratio (SNR) is low. Recognizing the availability of personalized photo galleries on users’ smartphones, we propose Personalized Generative Denoising (PGD) by building a diffusion model customized for different users. Our core innovation is an identity-consistent physical buffer that extracts the physical attributes of the person from the gallery. This ID-consistent physical buffer provides a strong prior that can be integrated with the diffusion model to restore the degraded images, without the need of fine-tuning. Over a wide range of low-light testing scenarios, we show that PGD achieves superior image denoising and enhancement performance compared to existing diffusion-based denoising approaches.
zh

[CV-118] Covariances for Free: Exploiting Mean Distributions for Federated Learning with Pre-Trained Models

【速读】：该论文试图解决联邦学习中数据异质性问题，并提出了一种无需训练的解决方案。其关键在于利用类协方差矩阵的无偏估计器，仅通过客户端传递的类均值（class means）来实现高效的模型初始化。与依赖二阶统计量（second-order statistics）的方法相比，该方法显著降低了通信成本，同时在与仅共享类均值的现有方法相比时，性能提升了4-26%。此外，该方法在初始化分类器后进行联邦微调，能够实现更好的收敛效果和更快的收敛速度。

链接: https://arxiv.org/abs/2412.14326
作者: Dipam Goswami,Simone Magistri,Kai Wang,Bartłomiej Twardowski,Andrew D. Bagdanov,Joost van de Weijer
机构: Computer Vision Center; Department of Computer Science, Universitat Autònoma de Barcelona; Department of Information Engineering, University of Florence; IDEAS-NCBR
关键词: federated learning algorithms, learning algorithms, pre-trained models, found to reduce, reduce the effect
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Using pre-trained models has been found to reduce the effect of data heterogeneity and speed up federated learning algorithms. Recent works have investigated the use of first-order statistics and second-order statistics to aggregate local client data distributions at the server and achieve very high performance without any training. In this work we propose a training-free method based on an unbiased estimator of class covariance matrices. Our method, which only uses first-order statistics in the form of class means communicated by clients to the server, incurs only a fraction of the communication costs required by methods based on communicating second-order statistics. We show how these estimated class covariances can be used to initialize a linear classifier, thus exploiting the covariances without actually sharing them. When compared to state-of-the-art methods which also share only class means, our approach improves performance in the range of 4-26% with exactly the same communication cost. Moreover, our method achieves performance competitive or superior to sharing second-order statistics with dramatically less communication overhead. Finally, using our method to initialize classifiers and then performing federated fine-tuning yields better and faster convergence. Code is available at this https URL.
zh

[CV-119] What Has Been Overlooked in Contrastive Source-Free Domain Adaptation: Leveraging Source-Informed Latent Augmentation within Neighborhood Context ICLR2025

【速读】：该论文试图解决源域数据不可用的领域自适应问题，即源域自由领域自适应 (Source-free Domain Adaptation, SFDA)。关键解决方案在于提出了一种基于对比学习的潜在特征增强方法。该方法通过利用源预训练模型引导的查询样本邻域内的潜在特征分散性，增强了正样本键的信息量，从而提升了对比学习的性能。该方法仅依赖于单一的InfoNCE对比损失，在多个广泛认可的基准数据集上超越了现有的最先进SFDA方法。

链接: https://arxiv.org/abs/2412.14301
作者: Jing Wang,Wonho Bae,Jiahong Chen,Kuangen Zhang,Leonid Sigal,Clarence W. de Silva
机构: Department of Mechanical Engineering, The University of British Columbia (机械工程系，不列颠哥伦比亚大学); Department of Computer Science, The University of British Columbia (计算机科学系，不列颠哥伦比亚大学)
关键词: Source-free domain adaptation, model originally trained, Source-free domain, involves adapting, originally trained
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICLR 2025

点击查看摘要

Abstract:Source-free domain adaptation (SFDA) involves adapting a model originally trained using a labeled dataset (\em source domain) to perform effectively on an unlabeled dataset (\em target domain) without relying on any source data during adaptation. This adaptation is especially crucial when significant disparities in data distributions exist between the two domains and when there are privacy concerns regarding the source model’s training data. The absence of access to source data during adaptation makes it challenging to analytically estimate the domain gap. To tackle this issue, various techniques have been proposed, such as unsupervised clustering, contrastive learning, and continual learning. In this paper, we first conduct an extensive theoretical analysis of SFDA based on contrastive learning, primarily because it has demonstrated superior performance compared to other techniques. Motivated by the obtained insights, we then introduce a straightforward yet highly effective latent augmentation method tailored for contrastive SFDA. This augmentation method leverages the dispersion of latent features within the neighborhood of the query sample, guided by the source pre-trained model, to enhance the informativeness of positive keys. Our approach, based on a single InfoNCE-based contrastive loss, outperforms state-of-the-art SFDA methods on widely recognized benchmark datasets.
zh

[CV-120] mporally Consistent Object-Centric Learning by Contrasting Slots

【速读】：该论文试图解决从视频中进行无监督的以物体为中心的学习时，现有方法在时间一致性上的不足问题。现有基于循环处理的方法由于训练目标未强制执行时间一致性，导致在帧间缺乏长期稳定性。论文提出的关键解决方案是引入一种新的物体级别的时间对比损失（object-level temporal contrastive loss），该损失显式地促进了时间一致性，从而显著提升了学习到的以物体为中心的表示的时间一致性，并增强了物体发现能力，在合成和真实世界数据集上均取得了最先进的性能，甚至超越了利用运动掩码作为额外线索的弱监督方法。

链接: https://arxiv.org/abs/2412.14295
作者: Anna Manasyan,Maximilian Seitzer,Filip Radovic,Georg Martius,Andrii Zadaianchuk
机构: University of Tübingen, Tübingen, Germany; University of Amsterdam, Amsterdam, Netherlands; Max Planck Institute for Intelligent Systems, Tübingen, Germany
关键词: extract structured representations, unlabeled collections, promising approach, approach to extract, extract structured
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be both compositional and temporally consistent. Existing approaches based on recurrent processing often lack long-term stability across frames because their training objective does not enforce temporal consistency. In this work, we introduce a novel object-level temporal contrastive loss for video object-centric models that explicitly promotes temporal consistency. Our method significantly improves the temporal consistency of the learned object-centric representations, yielding more reliable video decompositions that facilitate challenging downstream tasks such as unsupervised object dynamics prediction. Furthermore, the inductive bias added by our loss strongly improves object discovery, leading to state-of-the-art results on both synthetic and real-world datasets, outperforming even weakly-supervised methods that leverage motion masks as additional cues.
zh

[CV-121] RecViT: A Recurrent Video Transformer

【速读】：该论文提出了一种新的视频建模模块，旨在解决现有模型在处理大规模视频数据时参数多、内存占用大和计算复杂度高的问题。解决方案的关键在于时间-空间-通道分解（time-space-channel factorisation），通过专门设计的模块分别处理不同维度：门控线性循环单元（gated linear recurrent units, LRUs）用于时间维度上的信息混合，自注意力层（self-attention layers）用于空间维度上的混合，而多层感知机（MLPs）则用于通道维度上的混合。这种分解使得所提出的TRecViT架构在稀疏和密集任务上表现优异，且在监督学习和自监督学习中均有效。与纯注意力模型ViViT-L相比，TRecViT在大型视频数据集（如SSv2和Kinetics400）上表现相当或更优，同时参数减少了3倍，内存占用减少了12倍，计算量降低了5倍。

链接: https://arxiv.org/abs/2412.14294
作者: Viorica Pătrăucean,Xu Owen He,Joseph Heyward,Chuhan Zhang,Mehdi S. M. Sajjadi,George-Cristian Muraru,Artem Zholus,Mahdi Karami,Ross Goroshin,Yutian Chen,Simon Osindero,João Carreira,Razvan Pascanu
机构: Google DeepMind(谷歌深度思维)
关键词: video modelling, Abstract, gated linear recurrent, linear recurrent units, times
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having 3\times less parameters, 12\times smaller memory footprint, and 5\times lower FLOPs count. Code and checkpoints will be made available online at this https URL.
zh

[CV-122] PixelMan: Consistent Object Editing with Diffusion Models via Pixel Manipulation and Generation AAAI2025

【速读】：该论文试图解决在对象编辑过程中保持对象和背景一致性的问题，特别是在不改变对象纹理和属性的前提下修改对象的位置、大小和组合。当前的方法，如DDIM反演和能量引导，存在效率低下和编辑图像一致性不足的问题。论文提出的解决方案是PixelMan，一种无需反演和训练的方法，通过像素操作和生成来实现一致的对象编辑。其关键在于直接在像素空间中创建源对象的副本并将其放置在目标位置，同时引入高效的采样方法来逐步将操作后的对象融入目标位置并修复其原始位置，通过锚定编辑后的图像到像素操作后的图像以及在推理过程中引入多种一致性保持优化技术来确保图像的一致性。

链接: https://arxiv.org/abs/2412.14283
作者: Liyao Jiang,Negar Hassanpour,Mohammad Salameh,Mohammadreza Samadi,Jiao He,Fengyu Sun,Di Niu
机构: 1. University of Alberta (阿尔伯塔大学); 2. University of Waterloo (滑铁卢大学); 3. Tsinghua University (清华大学)
关键词: Diffusion Models, Recent research explores, modify object position, potential of Diffusion, consistent object editing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: AAAI 2025; version includes supplementary material; 27 Pages, 15 Figures, 6 Tables

点击查看摘要

Abstract:Recent research explores the potential of Diffusion Models (DMs) for consistent object editing, which aims to modify object position, size, and composition, etc., while preserving the consistency of objects and background without changing their texture and attributes. Current inference-time methods often rely on DDIM inversion, which inherently compromises efficiency and the achievable consistency of edited images. Recent methods also utilize energy guidance which iteratively updates the predicted noise and can drive the latents away from the original image, resulting in distortions. In this paper, we propose PixelMan, an inversion-free and training-free method for achieving consistent object editing via Pixel Manipulation and generation, where we directly create a duplicate copy of the source object at target location in the pixel space, and introduce an efficient sampling approach to iteratively harmonize the manipulated object into the target location and inpaint its original location, while ensuring image consistency by anchoring the edited image to be generated to the pixel-manipulated image as well as by introducing various consistency-preserving optimization techniques during inference. Experimental evaluations based on benchmark datasets as well as extensive visual comparisons show that in as few as 16 inference steps, PixelMan outperforms a range of state-of-the-art training-based and training-free methods (usually requiring 50 steps) on multiple consistent object editing tasks.
zh

[CV-123] Split Learning in Computer Vision for Semantic Segmentation Delay Minimization

【速读】：该论文试图解决在资源受限设备上进行实时计算机视觉应用（如自动驾驶和智慧城市基础设施）时，语义分割任务面临的显著延迟问题。解决方案的关键在于采用分割学习（Split Learning, SL）技术，通过将深度神经网络（DNN）在边缘设备和中央服务器之间进行划分，实现本地化数据处理并减少传输数据量。论文提出了联合优化带宽分配、边缘设备DNN的切分层选择以及中央服务器处理资源分配的方法，并针对并行和串行数据处理场景提出了低复杂度的启发式解决方案，以在降低计算需求的同时保持接近最优的性能。实验结果表明，该方法有效减少了推理延迟，展示了SL在动态、资源受限环境中提升实时计算机视觉应用的潜力。

链接: https://arxiv.org/abs/2412.14272
作者: Nikos G. Evgenidis,Nikos A. Mitsiou,Sotiris A. Tegos,Panagiotis D. Diamantoulakis,George K. Karagiannidis
机构: 未知
关键词: real-time computer vision, split learning, computer vision, semantic segmentation, inference delay
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel approach to minimize the inference delay in semantic segmentation using split learning (SL), tailored to the needs of real-time computer vision (CV) applications for resource-constrained devices. Semantic segmentation is essential for applications such as autonomous vehicles and smart city infrastructure, but faces significant latency challenges due to high computational and communication loads. Traditional centralized processing methods are inefficient for such scenarios, often resulting in unacceptable inference delays. SL offers a promising alternative by partitioning deep neural networks (DNNs) between edge devices and a central server, enabling localized data processing and reducing the amount of data required for transmission. Our contribution includes the joint optimization of bandwidth allocation, cut layer selection of the edge devices’ DNN, and the central server’s processing resource allocation. We investigate both parallel and serial data processing scenarios and propose low-complexity heuristic solutions that maintain near-optimal performance while reducing computational requirements. Numerical results show that our approach effectively reduces inference delay, demonstrating the potential of SL for improving real-time CV applications in dynamic, resource-constrained environments.
zh

[CV-124] Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

【速读】：该论文试图解决现有方法在生成描述性图像标题（descriptive image caption）时，依赖于从大型多模态模型（LMMs）中蒸馏或从互联网或人工构建标题的局限性问题。解决方案的关键在于提出了一种名为DCE的方法，利用现成的视觉专家（visual specialists），这些专家最初并非为图像描述任务训练，但通过探索对象的低级和细粒度属性（如深度、情感和细粒度类别）以及对象关系（如相对位置和人与对象交互（HOI）），将这些属性整合到描述性标题中。实验表明，这种方法不仅提高了视觉理解任务的性能，还增强了基于更准确视觉理解的推理能力。

链接: https://arxiv.org/abs/2412.14233
作者: Yanpeng Sun,Jing Hao,Ke Zhu,Jiang-Jiang Liu,Yuxiang Zhao,Xiaofan Li,Gang Zhang,Zechao Li,Jingdong Wang
机构: Nanjing University of Science and Technology; Baidu VIS; The University of Hong Kong; Nanjing University
关键词: Training Large Multimodality, Large Multimodality Models, Training Large, Large Multimodality, Multimodality Models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: An open-source data engine for generating detailed image captions

点击查看摘要

Abstract:Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at \urlthis https URL. Comments: An open-source data engine for generating detailed image captions Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.14233 [cs.CV] (or arXiv:2412.14233v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.14233 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-125] ViTmiX: Vision Transformer Explainability Augmented by Mixed Visualization Methods

【速读】：该论文试图解决Vision Transformers (ViT)模型在视觉识别任务中的解释性问题。由于ViT模型通过自注意力机制捕捉图像中的长程依赖关系，但其复杂性使得揭示其决策过程变得困难。论文提出了一种混合解释性方法，通过结合多种解释性技术（如Layer-wise Relevance Propagation (LRP)和基于梯度的方法）来增强ViT模型的可解释性。关键解决方案包括引入几何平均混合技术以及应用Pigeonhole原理的新型事后解释性度量，从而显著提升了ViT模型的解释性，特别是在对象分割任务中。

链接: https://arxiv.org/abs/2412.14231
作者: Eduard Hogea,Darian M. Onchis,Ana Coporan,Adina Magda Florea,Codruta Istin
机构: West University of Timisoara; Politehnica University of Bucharest; Politehnica University of Timisoara
关键词: Vision Transformers, capture long-range dependencies, Recent advancements, advancements in Vision, visual recognition tasks
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Vision Transformers (ViT) have demonstrated exceptional results in various visual recognition tasks, owing to their ability to capture long-range dependencies in images through self-attention mechanisms. However, the complex nature of ViT models requires robust explainability methods to unveil their decision-making processes. Explainable Artificial Intelligence (XAI) plays a crucial role in improving model transparency and trustworthiness by providing insights into model predictions. Current approaches to ViT explainability, based on visualization techniques such as Layer-wise Relevance Propagation (LRP) and gradient-based methods, have shown promising but sometimes limited results. In this study, we explore a hybrid approach that mixes multiple explainability techniques to overcome these limitations and enhance the interpretability of ViT models. Our experiments reveal that this hybrid approach significantly improves the interpretability of ViT models compared to individual methods. We also introduce modifications to existing techniques, such as using geometric mean for mixing, which demonstrates notable results in object segmentation tasks. To quantify the explainability gain, we introduced a novel post-hoc explainability measure by applying the Pigeonhole principle. These findings underscore the importance of refining and optimizing explainability methods for ViT models, paving the way to reliable XAI-based segmentations.
zh

[CV-126] ransversal PACS Browser API: Addressing Interoperability Challenges in Medical Imaging Systems

【速读】：该论文试图解决医疗影像系统中DICOM图像检索效率低下的问题，特别是由于系统碎片化、互操作性障碍和复杂用户界面导致的访问困难。解决方案的关键在于Transversal PACS Browser API，它通过提供先进的过滤功能、自定义字段搜索以及统一的查询和检索界面，简化了从多个PACS站点检索DICOM图像的过程。此外，该API还支持在应用内直接预览图像，从而提升了用户体验和操作效率。

链接: https://arxiv.org/abs/2412.14229
作者: Diogo Lameira,Filipa Ferraz
机构: 未知
关键词: retrieving DICOM images, Advances in imaging, PACS Browser API, imaging technologies, accessing medical images
类目: Human-Computer Interaction (cs.HC); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: 16 pages with 3 figures

点击查看摘要

Abstract:Advances in imaging technologies have revolutionised the medical imaging and healthcare sectors, leading to the widespread adoption of PACS for the storage, retrieval, and communication of medical images. Although these systems have improved operational efficiency, significant challenges remain in effectively retrieving DICOM images, which are essential for diagnosis and overall patient care. Moreover, issues such as fragmented systems, interoperability barriers, and complex user interfaces can often prevent healthcare professionals from efficiently accessing medical images. Addressing these challenges, the Transversal PACS Browser API is a robust and user-friendly solution designed to enhance the process of querying and retrieving DICOM images. It offers advanced filtering capabilities through a variety of filter options as well as a custom field search, that allows users to easily navigate through large medical image collections with ease. Additionally, the application provides a unified interface for querying and retrieving from multiple PACS stations, addressing the challenges of fragmentation and complexity associated with accessing medical images. Other key features include the ability to pre-view images directly within the application. All of this contributes to the transversal nature of the API, serving not only healthcare providers, but anyone who relies on efficient access to these resources. To validate the performance and usability of the application, comprehensive testing was carried out with stakeholders of the field, the results of which showed general satisfaction, highlighting the API’s clean design, ease of use, and effective search capabilities of the API, as well as the usefulness of previewing images within the application.
zh

[CV-127] Distilled Pooling Transformer Encoder for Efficient Realistic Image Dehazing

【速读】：该论文试图解决在资源受限设备上实现高效图像去雾的问题。解决方案的关键在于提出了一个轻量级的神经网络DPTE-Net，通过使用Distilled Pooling Transformer Encoder替代传统的自注意力(Self-Attention, SA)模块，采用高效的池化机制显著降低了计算复杂度，同时保留了Transformer的学习能力。此外，通过基于蒸馏的训练过程将大型教师网络的知识迁移到DPTE-Net中，进一步增强了语义特征学习。DPTE-Net还在生成对抗网络(Generative Adversarial Network, GAN)框架下进行训练，并采用传输感知损失函数以动态适应不同的雾密度，从而在保持低计算复杂度的同时实现了具有竞争力的去雾性能。

链接: https://arxiv.org/abs/2412.14220
作者: Le-Anh Tran,Dong-Chul Park
机构: 未知
关键词: Pooling Transformer Encoder, Distilled Pooling Transformer, utilizing a Distilled, Transformer Encoder, lightweight neural network
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 17 figures

点击查看摘要

Abstract:This paper proposes a lightweight neural network designed for realistic image dehazing, utilizing a Distilled Pooling Transformer Encoder, named DPTE-Net. Recently, while vision transformers (ViTs) have achieved great success in various vision tasks, their self-attention (SA) module’s complexity scales quadratically with image resolution, hindering their applicability on resource-constrained devices. To overcome this, the proposed DPTE-Net substitutes traditional SA modules with efficient pooling mechanisms, significantly reducing computational demands while preserving ViTs’ learning capabilities. To further enhance semantic feature learning, a distillation-based training process is implemented which transfers rich knowledge from a larger teacher network to DPTE-Net. Additionally, DPTE-Net is trained within a generative adversarial network (GAN) framework, leveraging the strong generalization of GAN in image restoration, and employs a transmission-aware loss function to dynamically adapt to varying haze densities. Experimental results on various benchmark datasets have shown that the proposed DPTE-Net can achieve competitive dehazing performance when compared to state-of-the-art methods while maintaining low computational complexity, making it a promising solution for resource-limited applications. The code of this work is available at this https URL.
zh

[CV-128] GraphicsDreamer: Image to 3D Generation with Physical Consistency

【速读】：该论文试图解决自动化生成高质量3D内容在工业应用中的滞后问题，特别是在满足高精度几何形状、精细拓扑结构和基于物理的渲染（PBR）等要求方面。解决方案的关键在于引入GraphicsDreamer方法，通过将PBR光照方程整合到跨领域扩散模型中，同时预测多视角的颜色、法线、深度图像和PBR材质，从而生成高度可用的3D网格。此外，在几何融合阶段继续强化PBR约束，确保生成物体具有可靠的纹理细节，支持真实的光照重现。该方法还包括拓扑优化和快速UV展开功能，使生成的3D产品能够无缝导入图形引擎。

链接: https://arxiv.org/abs/2412.14214
作者: Pei Chen,Fudong Wang,Yixuan Tong,Jingdong Chen,Ming Yang,Minghui Yang
机构: Ant Group(蚂蚁集团); Fudan University(复旦大学)
关键词: transforming human imagination, imagination into complex, surge of efficient, increasingly illuminated, illuminated the path
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, the surge of efficient and automated 3D AI-generated content (AIGC) methods has increasingly illuminated the path of transforming human imagination into complex 3D structures. However, the automated generation of 3D content is still significantly lags in industrial application. This gap exists because 3D modeling demands high-quality assets with sharp geometry, exquisite topology, and physically based rendering (PBR), among other criteria. To narrow the disparity between generated results and artists’ expectations, we introduce GraphicsDreamer, a method for creating highly usable 3D meshes from single images. To better capture the geometry and material details, we integrate the PBR lighting equation into our cross-domain diffusion model, concurrently predicting multi-view color, normal, depth images, and PBR materials. In the geometry fusion stage, we continue to enforce the PBR constraints, ensuring that the generated 3D objects possess reliable texture details, supporting realistic relighting. Furthermore, our method incorporates topology optimization and fast UV unwrapping capabilities, allowing the 3D products to be seamlessly imported into graphics engines. Extensive experiments demonstrate that our model can produce high quality 3D assets in a reasonable time cost compared to previous methods.
zh

[CV-129] Improving Generalization Performance of YOLOv8 for Camera Trap Object Detection

【速读】：该论文试图解决的是基于相机陷阱图像的物种识别中模型泛化能力不足的问题。解决方案的关键在于对YOLOv8目标检测算法的改进，包括引入全局注意力机制（Global Attention Mechanism, GAM）模块、改进多尺度特征融合以及采用Wise Intersection over Union (WIoUv3)作为边界框回归损失函数。这些改进显著提升了模型在处理背景噪声、聚焦目标属性以及在新环境中泛化能力的表现，从而增强了其在实际野生动物保护场景中的适用性。

链接: https://arxiv.org/abs/2412.14211
作者: Aroj Subedi
机构: 未知
关键词: Camera Trap images, Camera Trap, camera trap datasets, providing non-intrusive, integral tools
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Master’s thesis

点击查看摘要

Abstract:Camera traps have become integral tools in wildlife conservation, providing non-intrusive means to monitor and study wildlife in their natural habitats. The utilization of object detection algorithms to automate species identification from Camera Trap images is of huge importance for research and conservation purposes. However, the generalization issue, where the trained model is unable to apply its learnings to a never-before-seen dataset, is prevalent. This thesis explores the enhancements made to the YOLOv8 object detection algorithm to address the problem of generalization. The study delves into the limitations of the baseline YOLOv8 model, emphasizing its struggles with generalization in real-world environments. To overcome these limitations, enhancements are proposed, including the incorporation of a Global Attention Mechanism (GAM) module, modified multi-scale feature fusion, and Wise Intersection over Union (WIoUv3) as a bounding box regression loss function. A thorough evaluation and ablation experiments reveal the improved model’s ability to suppress the background noise, focus on object properties, and exhibit robust generalization in novel environments. The proposed enhancements not only address the challenges inherent in camera trap datasets but also pave the way for broader applicability in real-world conservation scenarios, ultimately aiding in the effective management of wildlife populations and habitats.
zh

[CV-130] Advancing Vehicle Plate Recognition: Multitasking Visual Language Models with VehiclePaliGemma

【速读】：该论文试图解决车牌识别（License Plate Recognition, LPR）系统在处理复杂条件下的图像时面临的挑战，特别是模糊、噪声、天气影响以及字符间距过近等问题。解决方案的关键在于利用视觉语言模型（Visual Language Models, VLMs），如OpenAI GPT4o、Google Gemini 1.5等，来提升对模糊和复杂车牌图像的识别能力。论文提出了一个名为“VehiclePaliGemma”的微调开源PaliGemma VLM，专门设计用于在挑战性条件下识别车牌，并通过实验验证了其在马来西亚复杂条件下车牌数据集上的优越性能，准确率达到87.6%，且在A100-80GB GPU上实现了每秒7帧的识别速度。此外，VehiclePaliGemma还展示了多任务能力，能够准确识别包含多辆不同车型和颜色车辆的复杂场景中的车牌。

链接: https://arxiv.org/abs/2412.14197
作者: Nouar AlDahoul,Myles Joshua Toledo Tan,Raghava Reddy Tera,Hezerul Abdul Karim,Chee How Lim,Manish Kumar Mishra,Yasir Zaki
机构: 未知
关键词: involves automated systems, read vehicle license, involves automated, LPR, utilize cameras
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 33 pages, 9 figures

点击查看摘要

Abstract:License plate recognition (LPR) involves automated systems that utilize cameras and computer vision to read vehicle license plates. Such plates collected through LPR can then be compared against databases to identify stolen vehicles, uninsured drivers, crime suspects, and more. The LPR system plays a significant role in saving time for institutions such as the police force. In the past, LPR relied heavily on Optical Character Recognition (OCR), which has been widely explored to recognize characters in images. Usually, collected plate images suffer from various limitations, including noise, blurring, weather conditions, and close characters, making the recognition complex. Existing LPR methods still require significant improvement, especially for distorted images. To fill this gap, we propose utilizing visual language models (VLMs) such as OpenAI GPT4o, Google Gemini 1.5, Google PaliGemma (Pathways Language and Image model + Gemma model), Meta Llama 3.2, Anthropic Claude 3.5 Sonnet, LLaVA, NVIDIA VILA, and moondream2 to recognize such unclear plates with close characters. This paper evaluates the VLM’s capability to address the aforementioned problems. Additionally, we introduce ``VehiclePaliGemma’', a fine-tuned Open-sourced PaliGemma VLM designed to recognize plates under challenging conditions. We compared our proposed VehiclePaliGemma with state-of-the-art methods and other VLMs using a dataset of Malaysian license plates collected under complex conditions. The results indicate that VehiclePaliGemma achieved superior performance with an accuracy of 87.6%. Moreover, it is able to predict the car’s plate at a speed of 7 frames per second using A100-80GB GPU. Finally, we explored the multitasking capability of VehiclePaliGemma model to accurately identify plates containing multiple cars of various models and colors, with plates positioned and oriented in different directions.
zh

[CV-131] IMPROVE: Impact of Mobile Phones on Remote Online Virtual Education

【速读】：该论文旨在解决移动电话使用对在线教育学习者影响的评估问题。解决方案的关键在于构建了一个名为IMPROVE的多模态数据集，通过收集学术表现、主观反馈以及生物特征、行为和生理信号，全面分析移动电话使用对学习的影响。研究采用了16个传感器，包括脑电图波、视频、眼动追踪等，以捕捉有效的学习行为和认知指标。此外，通过半手动重新标注系统，利用头部姿态和眼动追踪数据提高标注准确性，确保信号质量，并通过统计分析揭示手机使用时的生物特征变化。

链接: https://arxiv.org/abs/2412.14195
作者: Roberto Daza,Alvaro Becerra,Ruth Cobos,Julian Fierrez,Aythami Morales
机构: Universidad Autonoma de Madrid (马德里自治大学); Biometrics and Data Pattern Analytics Laboratory (生物识别与数据模式分析实验室); Group for Advanced Interactive Tools (高级交互工具组)
关键词: designed to evaluate, online education, work presents, evaluate the effects, mobile phone
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Article under review in the journal Scientific Data. GitHub repository of the dataset at: this https URL

点击查看摘要

Abstract:This work presents the IMPROVE dataset, designed to evaluate the effects of mobile phone usage on learners during online education. The dataset not only assesses academic performance and subjective learner feedback but also captures biometric, behavioral, and physiological signals, providing a comprehensive analysis of the impact of mobile phone use on learning. Multimodal data were collected from 120 learners in three groups with different phone interaction levels. A setup involving 16 sensors was implemented to collect data that have proven to be effective indicators for understanding learner behavior and cognition, including electroencephalography waves, videos, eye tracker, etc. The dataset includes metadata from the processed videos like face bounding boxes, facial landmarks, and Euler angles for head pose estimation. In addition, learner performance data and self-reported forms are included. Phone usage events were labeled, covering both supervisor-triggered and uncontrolled events. A semi-manual re-labeling system, using head pose and eye tracker data, is proposed to improve labeling accuracy. Technical validation confirmed signal quality, with statistical analyses revealing biometric changes during phone use.
zh

[CV-132] A Medical Low-Back Pain Physical Rehabilitation Dataset for Human Body Movement Analysis

【速读】：该论文旨在解决自动监测和指导康复训练中的四个关键挑战，包括错误识别、使用场景限制、数据处理和评估方法。解决方案的关键在于提出了一个包含临床患者进行低背痛康复训练的医疗数据集，该数据集包括3D Kinect骨骼位置和方向、RGB视频、2D骨骼数据以及医疗注释，用于评估动作的正确性、错误分类和定位。论文通过对比两种基线运动识别算法——基于高斯混合模型 (Gaussian Mixture Model, GMM) 的概率方法和基于长短期记忆网络 (Long-Short Term Memory, LSTM) 的深度学习方法，展示了数据集在临床康复环境中的应用潜力，并强调了其在降低成本、便携性和便捷性方面的优势。

链接: https://arxiv.org/abs/2407.00521
作者: Sao Mai Nguyen,Maxime Devanne,Olivier Remy-Neris,Mathieu Lempereur,André Thepaut
机构: IMT Atlantique(大西洋国立高等矿业电信学校); Lab-STICC, UMR 6285(STICC实验室, UMR 6285); FLOWERS U2IS(FLOWERS U2IS); ENSTA, IP Paris & Inria(国立高等先进技术学校, 巴黎综合理工学院 & 法国国家信息与自动化研究所); Université de Haute-Alsace(上阿尔萨斯大学); IRIMAS EA 7499(IRIMAS EA 7499); Université Brest(布雷斯特大学); CHU Brest(布雷斯特大学医院); INSERM, UMR 1101(法国国家健康与医学研究院, UMR 1101)
关键词: showing encouraging results, non-medical applications, limited use contexts, automatic monitoring, monitoring and coaching
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:While automatic monitoring and coaching of exercises are showing encouraging results in non-medical applications, they still have limitations such as errors and limited use contexts. To allow the development and assessment of physical rehabilitation by an intelligent tutoring system, we identify in this article four challenges to address and propose a medical dataset of clinical patients carrying out low back-pain rehabilitation exercises. The dataset includes 3D Kinect skeleton positions and orientations, RGB videos, 2D skeleton data, and medical annotations to assess the correctness, and error classification and localisation of body part and timespan. Along this dataset, we perform a complete research path, from data collection to processing, and finally a small benchmark. We evaluated on the dataset two baseline movement recognition algorithms, pertaining to two different approaches: the probabilistic approach with a Gaussian Mixture Model (GMM), and the deep learning approach with a Long-Short Term Memory (LSTM). This dataset is valuable because it includes rehabilitation relevant motions in a clinical setting with patients in their rehabilitation program, using a cost-effective, portable, and convenient sensor, and because it shows the potential for improvement on these challenges. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC) ACMclasses: I.5.4; I.4.8 Cite as: arXiv:2407.00521 [cs.LG] (or arXiv:2407.00521v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2407.00521 Focus to learn more arXiv-issued DOI via DataCite Journalreference: IJCNN 2024
zh

[CV-133] Head and Neck Tumor Segmentation of MRI from Pre- and Mid-radiotherapy with Pre-training Data Augmentation and Dual Flow UNet

【速读】：该论文试图解决头颈部肿瘤和转移性淋巴结的精确分割问题，这对于治疗计划和预后分析至关重要。解决方案的关键在于开发和优化自动化分割技术，特别是针对放疗前（pre-RT）和放疗中（mid-RT）图像的不同处理策略。对于pre-RT图像，采用了全监督学习方法，并通过预训练权重和MixUp数据增强技术进行增强；对于mid-RT图像，设计了一种计算友好的网络架构，该架构包含分别用于mid-RT图像和注册的pre-RT图像及其标签的独立编码器，并在前向传播过程中逐步整合pre-RT图像和标签的信息。最终，通过模型集成平均提高了分割性能，pre-RT和mid-RT图像的Dice相似系数（DSC）分别达到了82.38%和72.53%。

链接: https://arxiv.org/abs/2412.14846
作者: Litingyu Wang,Wenjun Liao,Shichuan Zhang,Guotai Wang
机构: 未知
关键词: metastatic lymph nodes, Head and neck, tumors and metastatic, metastatic lymph, lymph nodes
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Head and neck tumors and metastatic lymph nodes are crucial for treatment planning and prognostic analysis. Accurate segmentation and quantitative analysis of these structures require pixel-level annotation, making automated segmentation techniques essential for the diagnosis and treatment of head and neck cancer. In this study, we investigated the effects of multiple strategies on the segmentation of pre-radiotherapy (pre-RT) and mid-radiotherapy (mid-RT) images. For the segmentation of pre-RT images, we utilized: 1) a fully supervised learning approach, and 2) the same approach enhanced with pre-trained weights and the MixUp data augmentation technique. For mid-RT images, we introduced a novel computational-friendly network architecture that features separate encoders for mid-RT images and registered pre-RT images with their labels. The mid-RT encoder branch integrates information from pre-RT images and labels progressively during the forward propagation. We selected the highest-performing model from each fold and used their predictions to create an ensemble average for inference. In the final test, our models achieved a segmentation performance of 82.38% for pre-RT and 72.53% for mid-RT on aggregated Dice Similarity Coefficient (DSC) as HiLab. Our code is available at this https URL.
zh

人工智能

[AI-0] Human-Humanoid Robots Cross-Embodiment Behavior-Skill Transfer Using Decomposed Adversarial Learning from Demonstration

链接: https://arxiv.org/abs/2412.15166
作者: Junjia Liu,Zhuo Li,Minghao Yu,Zhipeng Dong,Sylvain Calinon,Darwin Caldwell,Fei Chen
关键词: embodied intelligent agents, intelligent agents capable, scenarios requiring strenuous, repetitive labor, envisioned as embodied
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注: 9 pages, 8 figures. Accepted by IEEE Robotics and Automation Magazine

点击查看摘要

Abstract:Humanoid robots are envisioned as embodied intelligent agents capable of performing a wide range of human-level loco-manipulation tasks, particularly in scenarios requiring strenuous and repetitive labor. However, learning these skills is challenging due to the high degrees of freedom of humanoid robots, and collecting sufficient training data for humanoid is a laborious process. Given the rapid introduction of new humanoid platforms, a cross-embodiment framework that allows generalizable skill transfer is becoming increasingly critical. To address this, we propose a transferable framework that reduces the data bottleneck by using a unified digital human model as a common prototype and bypassing the need for re-training on every new robot platform. The model learns behavior primitives from human demonstrations through adversarial imitation, and the complex robot structures are decomposed into functional components, each trained independently and dynamically coordinated. Task generalization is achieved through a human-object interaction graph, and skills are transferred to different robots via embodiment-specific kinematic motion retargeting and dynamic fine-tuning. Our framework is validated on five humanoid robots with diverse configurations, demonstrating stable loco-manipulation and highlighting its effectiveness in reducing data requirements and increasing the efficiency of skill transfer across platforms.

[AI-1] Operationalising Rawlsian Ethics for Fairness in Norm-Learning Agents AAAI2025

链接: https://arxiv.org/abs/2412.15163
作者: Jessica Woodgate,Paul Marshall,Nirav Ajmeri
关键词: standards of behaviour, behaviour common, Rawlsian ethics, agents, RAWL-E
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 14 pages, 7 figures, 8 tables (and supplementary material with reproducibility and additional results), accepted at AAAI 2025

点击查看摘要

Abstract:Social norms are standards of behaviour common in a society. However, when agents make decisions without considering how others are impacted, norms can emerge that lead to the subjugation of certain agents. We present RAWL-E, a method to create ethical norm-learning agents. RAWL-E agents operationalise maximin, a fairness principle from Rawlsian ethics, in their decision-making processes to promote ethical norms by balancing societal well-being with individual goals. We evaluate RAWL-E agents in simulated harvesting scenarios. We find that norms emerging in RAWL-E agent societies enhance social welfare, fairness, and robustness, and yield higher minimum experience compared to those that emerge in agent societies that do not implement Rawlsian ethics.

[AI-2] Probabilistic Strategy Logic with Degrees of Observability

链接: https://arxiv.org/abs/2412.15135
作者: Chunyan Mu,Nima Motamed,Natasha Alechina,Brian Logan
关键词: Probabilistic Strategy Logic, Probabilistic Strategy, information transparency, considerable work, strategic ability
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:There has been considerable work on reasoning about the strategic ability of agents under imperfect information. However, existing logics such as Probabilistic Strategy Logic are unable to express properties relating to information transparency. Information transparency concerns the extent to which agents’ actions and behaviours are observable by other agents. Reasoning about information transparency is useful in many domains including security, privacy, and decision-making. In this paper, we present a formal framework for reasoning about information transparency properties in stochastic multi-agent systems. We extend Probabilistic Strategy Logic with new observability operators that capture the degree of observability of temporal properties by agents. We show that the model checking problem for the resulting logic is decidable.

[AI-3] owards Friendly AI: A Comprehensive Review and New Perspectives on Human-AI Alignment

链接: https://arxiv.org/abs/2412.15114
作者: Qiyang Sun,Yupei Li,Emran Alturki,Sunil Munthumoduku Krishna Murthy,Björn W. Schuller
关键词: Artificial Intelligence, continues to advance, advance rapidly, equitable and fair, reviews examining FAI
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:As Artificial Intelligence (AI) continues to advance rapidly, Friendly AI (FAI) has been proposed to advocate for more equitable and fair development of AI. Despite its importance, there is a lack of comprehensive reviews examining FAI from an ethical perspective, as well as limited discussion on its potential applications and future directions. This paper addresses these gaps by providing a thorough review of FAI, focusing on theoretical perspectives both for and against its development, and presenting a formal definition in a clear and accessible format. Key applications are discussed from the perspectives of eXplainable AI (XAI), privacy, fairness and affective computing (AC). Additionally, the paper identifies challenges in current technological advancements and explores future research avenues. The findings emphasise the significance of developing FAI and advocate for its continued advancement to ensure ethical and beneficial AI development.

[AI-4] Learning Disentangled Equivariant Representation for Explicitly Controllable 3D Molecule Generation AAAI2025

链接: https://arxiv.org/abs/2412.15086
作者: Haoran Liu,Youzhi Luo,Tianxiao Li,James Caverlee,Martin Renqiang Min
关键词: Synthetic Accessibility score, Quantitative Estimate, Estimate of Druglikeness, Druglikeness or Synthetic, Synthetic Accessibility
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: AAAI 2025

点击查看摘要

Abstract:We consider the conditional generation of 3D drug-like molecules with \textitexplicit control over molecular properties such as drug-like properties (e.g., Quantitative Estimate of Druglikeness or Synthetic Accessibility score) and effectively binding to specific protein sites. To tackle this problem, we propose an E(3)-equivariant Wasserstein autoencoder and factorize the latent space of our generative model into two disentangled aspects: molecular properties and the remaining structural context of 3D molecules. Our model ensures explicit control over these molecular attributes while maintaining equivariance of coordinate representation and invariance of data likelihood. Furthermore, we introduce a novel alignment-based coordinate loss to adapt equivariant networks for auto-regressive de-novo 3D molecule generation from scratch. Extensive experiments validate our model’s effectiveness on property-guided and context-guided molecule generation, both for de-novo 3D molecule design and structure-based drug discovery against protein targets.

[AI-5] Measuring Modeling and Helping People Account for Privacy Risks in Online Self-Disclosures with AI

链接: https://arxiv.org/abs/2412.15047
作者: Isadora Krsek,Anubha Kabra,Yao Dou,Tarek Naous,Laura A. Dabbish,Alan Ritter,Wei Xu,Sauvik Das
关键词: pseudonymous online fora, understanding strangers, pseudonymous online, online fora, in-laws to understanding
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 31 pages, 5 figues, Accepted for publication at CSCW 2025

点击查看摘要

Abstract:In pseudonymous online fora like Reddit, the benefits of self-disclosure are often apparent to users (e.g., I can vent about my in-laws to understanding strangers), but the privacy risks are more abstract (e.g., will my partner be able to tell that this is me?). Prior work has sought to develop natural language processing (NLP) tools that help users identify potentially risky self-disclosures in their text, but none have been designed for or evaluated with the users they hope to protect. Absent this assessment, these tools will be limited by the social-technical gap: users need assistive tools that help them make informed decisions, not paternalistic tools that tell them to avoid self-disclosure altogether. To bridge this gap, we conducted a study with N = 21 Reddit users; we had them use a state-of-the-art NLP disclosure detection model on two of their authored posts and asked them questions to understand if and how the model helped, where it fell short, and how it could be improved to help them make more informed decisions. Despite its imperfections, users responded positively to the model and highlighted its use as a tool that can help them catch mistakes, inform them of risks they were unaware of, and encourage self-reflection. However, our work also shows how, to be useful and usable, AI for supporting privacy decision-making must account for posting context, disclosure norms, and users’ lived threat models, and provide explanations that help contextualize detected risks.

[AI-6] HSEvo: Elevating Automatic Heuristic Design with Diversity-Driven Harmony Search and Genetic Algorithm Using LLM s

链接: https://arxiv.org/abs/2412.14995
作者: Pham Vu Tuan Dat,Long Doan,Huynh Thi Thanh Binh
关键词: Automatic Heuristic Design, active research area, research area due, NP-hard combinatorial optimization, Large Language Models
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 18 pages, 12 figures

点击查看摘要

Abstract:Automatic Heuristic Design (AHD) is an active research area due to its utility in solving complex search and NP-hard combinatorial optimization problems in the real world. The recent advancements in Large Language Models (LLMs) introduce new possibilities by coupling LLMs with evolutionary computation to automatically generate heuristics, known as LLM-based Evolutionary Program Search (LLM-EPS). While previous LLM-EPS studies obtained great performance on various tasks, there is still a gap in understanding the properties of heuristic search spaces and achieving a balance between exploration and exploitation, which is a critical factor in large heuristic search spaces. In this study, we address this gap by proposing two diversity measurement metrics and perform an analysis on previous LLM-EPS approaches, including FunSearch, EoH, and ReEvo. Results on black-box AHD problems reveal that while EoH demonstrates higher diversity than FunSearch and ReEvo, its objective score is unstable. Conversely, ReEvo’s reflection mechanism yields good objective scores but fails to optimize diversity effectively. With this finding in mind, we introduce HSEvo, an adaptive LLM-EPS framework that maintains a balance between diversity and convergence with a harmony search algorithm. Through experimentation, we find that HSEvo achieved high diversity indices and good objective scores while remaining cost-effective. These results underscore the importance of balancing exploration and exploitation and understanding heuristic search spaces in designing frameworks in LLM-EPS.

[AI-7] Generalizing Constraint Models in Constraint Acquisition

链接: https://arxiv.org/abs/2412.14950
作者: Dimos Tsouros,Senne Berden,Steven Prestwich,Tias Guns
关键词: Constraint Acquisition, aims to widen, programming by assisting, assisting users, Constraint
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Constraint Acquisition (CA) aims to widen the use of constraint programming by assisting users in the modeling process. However, most CA methods suffer from a significant drawback: they learn a single set of individual constraints for a specific problem instance, but cannot generalize these constraints to the parameterized constraint specifications of the problem. In this paper, we address this limitation by proposing GenCon, a novel approach to learn parameterized constraint models capable of modeling varying instances of the same problem. To achieve this generalization, we make use of statistical learning techniques at the level of individual constraints. Specifically, we propose to train a classifier to predict, for any possible constraint and parameterization, whether the constraint belongs to the problem. We then show how, for some classes of classifiers, we can extract decision rules to construct interpretable constraint specifications. This enables the generation of ground constraints for any parameter instantiation. Additionally, we present a generate-and-test approach that can be used with any classifier, to generate the ground constraints on the fly. Our empirical results demonstrate that our approach achieves high accuracy and is robust to noise in the input instances.

[AI-8] Cirbo: A New Tool for Boolean Circuit Analysis and Synthesis AAAI2025

链接: https://arxiv.org/abs/2412.14933
作者: Daniil Averkov,Tatiana Belova,Gregory Emdin,Mikhail Goncharov,Viktoriia Krivogornitsyna,Alexander S. Kulikov,Fedor Kurmazov,Daniil Levtsov,Georgie Levtsov,Vsevolod Vaskin,Aleksey Vorobiev
关键词: manipulating Boolean circuits, manipulating Boolean, Boolean circuits, present an open-source, open-source tool
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: To appear in AAAI 2025

点击查看摘要

Abstract:We present an open-source tool for manipulating Boolean circuits. It implements efficient algorithms, both existing and novel, for a rich variety of frequently used circuit tasks such as satisfiability, synthesis, and minimization. We tested the tool on a wide range of practically relevant circuits (computing, in particular, symmetric and arithmetic functions) that have been optimized intensively by the community for the last three years. The tool helped us to win the IWLS 2024 Programming Contest. In 2023, it was Google DeepMind who took the first place in the competition. We were able to reduce the size of the best circuits from 2023 by 12% on average, whereas for some individual circuits, our size reduction was as large as 83%.

[AI-9] Helping LLM s Improve Code Generation Using Feedback from Testing and Static Analysis

链接: https://arxiv.org/abs/2412.14841
作者: Greta Dolcetti,Vincenzo Arceri,Eleonora Iotti,Sergio Maffeis,Agostino Cortesi,Enea Zaffanella
关键词: Large Language Models, Large Language, software development life-cycle, Language Models, development life-cycle
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are one of the most promising developments in the field of artificial intelligence, and the software engineering community has readily noticed their potential role in the software development life-cycle. Developers routinely ask LLMs to generate code snippets, increasing productivity but also potentially introducing ownership, privacy, correctness, and security issues. Previous work highlighted how code generated by mainstream commercial LLMs is often not safe, containing vulnerabilities, bugs, and code smells. In this paper, we present a framework that leverages testing and static analysis to assess the quality, and guide the self-improvement, of code generated by general-purpose, open-source LLMs. First, we ask LLMs to generate C code to solve a number of programming tasks. Then we employ ground-truth tests to assess the (in)correctness of the generated code, and a static analysis tool to detect potential safety vulnerabilities. Next, we assess the models ability to evaluate the generated code, by asking them to detect errors and vulnerabilities. Finally, we test the models ability to fix the generated code, providing the reports produced during the static analysis and incorrectness evaluation phases as feedback. Our results show that models often produce incorrect code, and that the generated code can include safety issues. Moreover, they perform very poorly at detecting either issue. On the positive side, we observe a substantial ability to fix flawed code when provided with information about failed tests or potential vulnerabilities, indicating a promising avenue for improving the safety of LLM-based code generation tools. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.14841 [cs.SE] (or arXiv:2412.14841v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.14841 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-10] Answer Set Networks: Casting Answer Set Programming into Deep Learning

链接: https://arxiv.org/abs/2412.14814
作者: Arseny Skryagin,Daniel Ochs,Phillip Deibert,Simon Kohaut,Devendra Singh Dhami,Kristian Kersting
关键词: Answer Set Programming, Answer Set Networks, Answer Set, propose Answer Set, computing stable models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: 16 pages, 9 figures

点击查看摘要

Abstract:Although Answer Set Programming (ASP) allows constraining neural-symbolic (NeSy) systems, its employment is hindered by the prohibitive costs of computing stable models and the CPU-bound nature of state-of-the-art solvers. To this end, we propose Answer Set Networks (ASN), a NeSy solver. Based on Graph Neural Networks (GNN), ASNs are a scalable approach to ASP-based Deep Probabilistic Logic Programming (DPPL). Specifically, we show how to translate ASPs into ASNs and demonstrate how ASNs can efficiently solve the encoded problem by leveraging GPU’s batching and parallelization capabilities. Our experimental evaluations demonstrate that ASNs outperform state-of-the-art CPU-bound NeSy systems on multiple tasks. Simultaneously, we make the following two contributions based on the strengths of ASNs. Namely, we are the first to show the finetuning of Large Language Models (LLM) with DPPLs, employing ASNs to guide the training with logic. Further, we show the “constitutional navigation” of drones, i.e., encoding public aviation laws in an ASN for routing Unmanned Aerial Vehicles in uncertain environments.

[AI-11] MARIA: a Multimodal Transformer Model for Incomplete Healthcare Data

链接: https://arxiv.org/abs/2412.14810
作者: Camillo Maria Caruso,Paolo Soda,Valerio Guarrasi
关键词: developing comprehensive diagnostic, Multimodal Attention Resilient, pivotal for developing, developing comprehensive, MARIA
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In healthcare, the integration of multimodal data is pivotal for developing comprehensive diagnostic and predictive models. However, managing missing data remains a significant challenge in real-world applications. We introduce MARIA (Multimodal Attention Resilient to Incomplete datA), a novel transformer-based deep learning model designed to address these challenges through an intermediate fusion strategy. Unlike conventional approaches that depend on imputation, MARIA utilizes a masked self-attention mechanism, which processes only the available data without generating synthetic values. This approach enables it to effectively handle incomplete datasets, enhancing robustness and minimizing biases introduced by imputation methods. We evaluated MARIA against 10 state-of-the-art machine learning and deep learning models across 8 diagnostic and prognostic tasks. The results demonstrate that MARIA outperforms existing methods in terms of performance and resilience to varying levels of data incompleteness, underscoring its potential for critical healthcare applications.

[AI-12] Stack Trace Deduplication: Faster More Accurately and in More Realistic Scenarios

链接: https://arxiv.org/abs/2412.14802
作者: Egor Shibaev,Denis Sushentsev,Yaroslav Golubev,Aleksandr Khvorov
关键词: fully-fledged bug reports, large-scale software systems, stack traces, fully-fledged bug, bug reports
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Published at SANER’25. 11 pages, 2 figures

点击查看摘要

Abstract:In large-scale software systems, there are often no fully-fledged bug reports with human-written descriptions when an error occurs. In this case, developers rely on stack traces, i.e., series of function calls that led to the error. Since there can be tens and hundreds of thousands of them describing the same issue from different users, automatic deduplication into categories is necessary to allow for processing. Recent works have proposed powerful deep learning-based approaches for this, but they are evaluated and compared in isolation from real-life workflows, and it is not clear whether they will actually work well at scale. To overcome this gap, this work presents three main contributions: a novel model, an industry-based dataset, and a multi-faceted evaluation. Our model consists of two parts - (1) an embedding model with byte-pair encoding and approximate nearest neighbor search to quickly find the most relevant stack traces to the incoming one, and (2) a reranker that re-ranks the most fitting stack traces, taking into account the repeated frames between them. To complement the existing datasets collected from open-source projects, we share with the community SlowOps - a dataset of stack traces from IntelliJ-based products developed by JetBrains, which has an order of magnitude more stack traces per category. Finally, we carry out an evaluation that strives to be realistic: measuring not only the accuracy of categorization, but also the operation time and the ability to create new categories. The evaluation shows that our model strikes a good balance - it outperforms other models on both open-source datasets and SlowOps, while also being faster on time than most. We release all of our code and data, and hope that our work can pave the way to further practice-oriented research in the area. Comments: Published at SANER’25. 11 pages, 2 figures Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.14802 [cs.SE] (or arXiv:2412.14802v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.14802 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-13] Agent -Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

链接: https://arxiv.org/abs/2412.14779
作者: Aditya Kapoor,Sushant Swamy,Kale-ab Tessera,Mayank Baranwal,Mingfei Sun,Harshad Khadilkar,Stefano V. Albrecht
关键词: intermediate time steps, learn optimal policies, optimal policies due, time steps, TAR
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 12 pages, 1 figure

点击查看摘要

Abstract:In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards, particularly in long-horizon tasks where it is challenging to evaluate actions at intermediate time steps. We introduce Temporal-Agent Reward Redistribution (TAR ^2 ), a novel approach designed to address the agent-temporal credit assignment problem by redistributing sparse rewards both temporally and across agents. TAR ^2 decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards. We theoretically prove that TAR ^2 is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirical results demonstrate that TAR ^2 stabilizes and accelerates the learning process. Additionally, we show that when TAR ^2 is integrated with single-agent reinforcement learning algorithms, it performs as well as or better than traditional multi-agent reinforcement learning methods.

[AI-14] CodeRepoQA: A Large-scale Benchmark for Software Engineering Question Answering

链接: https://arxiv.org/abs/2412.14764
作者: Ruida Hu,Chao Peng,Jingyi Ren,Bo Jiang,Xiangxin Meng,Qinyun Wu,Pengfei Gao,Xinchen Wang,Cuiyun Gao
关键词: evaluating repository-level question-answering, large-scale benchmark specifically, benchmark specifically designed, software engineering, specifically designed
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this work, we introduce CodeRepoQA, a large-scale benchmark specifically designed for evaluating repository-level question-answering capabilities in the field of software engineering. CodeRepoQA encompasses five programming languages and covers a wide range of scenarios, enabling comprehensive evaluation of language models. To construct this dataset, we crawl data from 30 well-known repositories in GitHub, the largest platform for hosting and collaborating on code, and carefully filter raw data. In total, CodeRepoQA is a multi-turn question-answering benchmark with 585,687 entries, covering a diverse array of software engineering scenarios, with an average of 6.62 dialogue turns per entry. We evaluate ten popular large language models on our dataset and provide in-depth analysis. We find that LLMs still have limitations in question-answering capabilities in the field of software engineering, and medium-length contexts are more conducive to LLMs’ performance. The entire benchmark is publicly available at this https URL. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.14764 [cs.SE] (or arXiv:2412.14764v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.14764 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-15] Advances in Artificial Intelligence forDiabetes Prediction: Insights from a Systematic Literature Review

链接: https://arxiv.org/abs/2412.14736
作者: Pir Bakhsh Khokhar,Carmine Gravino,Fabio Palomba
关键词: Nutrition Examination Survey, Singapore National Diabetic, National Diabetic Retinopathy, Diabetic Retinopathy Screening, systematic review explores
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This systematic review explores the use of machine learning (ML) in predicting diabetes, focusing on datasets, algorithms, training methods, and evaluation metrics. It examines datasets like the Singapore National Diabetic Retinopathy Screening program, REPLACE-BG, National Health and Nutrition Examination Survey, and Pima Indians Diabetes Database. The review assesses the performance of ML algorithms like CNN, SVM, Logistic Regression, and XGBoost in predicting diabetes outcomes. The study emphasizes the importance of interdisciplinary collaboration and ethical considerations in ML-based diabetes prediction models.

[AI-16] Beyond the Hype: A Comprehensive Review of Current Trends in Generative AI Research Teaching Practices and Tools ICSE

链接: https://arxiv.org/abs/2412.14732
作者: James Prather,Juho Leinonen,Natalie Kiesler,Jamie Gorson Benario,Sam Lau,Stephen MacNeil,Narges Norouzi,Simone Opel,Vee Pettit,Leo Porter,Brent N. Reeves,Jaromir Savelka,David H. Smith IV,Sven Strickroth,Daniel Zingaro
关键词: GenAI, Generative, tools, GenAI tools, computing
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
*备注: 39 pages, 10 figures, 16 tables. To be published in the Proceedings of the 2024 Working Group Reports on Innovation and Technology in Computer Science Education (ITiCSE-WGR 2024)

点击查看摘要

Abstract:Generative AI (GenAI) is advancing rapidly, and the literature in computing education is expanding almost as quickly. Initial responses to GenAI tools were mixed between panic and utopian optimism. Many were fast to point out the opportunities and challenges of GenAI. Researchers reported that these new tools are capable of solving most introductory programming tasks and are causing disruptions throughout the curriculum. These tools can write and explain code, enhance error messages, create resources for instructors, and even provide feedback and help for students like a traditional teaching assistant. In 2024, new research started to emerge on the effects of GenAI usage in the computing classroom. These new data involve the use of GenAI to support classroom instruction at scale and to teach students how to code with GenAI. In support of the former, a new class of tools is emerging that can provide personalized feedback to students on their programming assignments or teach both programming and prompting skills at the same time. With the literature expanding so rapidly, this report aims to summarize and explain what is happening on the ground in computing classrooms. We provide a systematic literature review; a survey of educators and industry professionals; and interviews with educators using GenAI in their courses, educators studying GenAI, and researchers who create GenAI tools to support computing education. The triangulation of these methods and data sources expands the understanding of GenAI usage and perceptions at this critical moment for our community.

[AI-17] LTLf Synthesis Under Unreliable Input AAAI2025

链接: https://arxiv.org/abs/2412.14728
作者: Christian Hagemeier,Giuseppe de Giacomo,Moshe Y. Vardi
关键词: realizing strategies, satisfied in case, case of unreliability, LTLf goal specification, LTLf backup specification
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: 8 pages, to appear at AAAI2025

点击查看摘要

Abstract:We study the problem of realizing strategies for an LTLf goal specification while ensuring that at least an LTLf backup specification is satisfied in case of unreliability of certain input variables. We formally define the problem and characterize its worst-case complexity as 2EXPTIME-complete, like standard LTLf synthesis. Then we devise three different solution techniques: one based on direct automata manipulation, which is 2EXPTIME, one disregarding unreliable input variables by adopting a belief construction, which is 3EXPTIME, and one leveraging second-order quantified LTLf (QLTLf), which is 2EXPTIME and allows for a direct encoding into monadic second-order logic, which in turn is worst-case nonelementary. We prove their correctness and evaluate them against each other empirically. Interestingly, theoretical worst-case bounds do not translate into observed performance; the MSO technique performs best, followed by belief construction and direct automata manipulation. As a byproduct of our study, we provide a general synthesis procedure for arbitrary QLTLf specifications.

[AI-18] Creation of AI-driven Smart Spaces for Enhanced Indoor Environments – A Survey

链接: https://arxiv.org/abs/2412.14708
作者: Aygün Varol,Naser Hossein Motlagh,Mirka Leino,Sasu Tarkoma,Johanna Virkki
关键词: optimize energy utilization, integrate diverse sensing, AI-driven smart spaces, ubiquitous computing environments, improve user comfort
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
*备注: 39 pages, 3 figures, 1 table, journal

点击查看摘要

Abstract:Smart spaces are ubiquitous computing environments that integrate diverse sensing and communication technologies to enhance space functionality, optimize energy utilization, and improve user comfort and well-being. The integration of emerging AI methodologies into these environments facilitates the formation of AI-driven smart spaces, which further enhance functionalities of the spaces by enabling advanced applications such as personalized comfort settings, interactive living spaces, and automatization of the space systems, all resulting in enhanced indoor experiences of the users. In this paper, we present a systematic survey of existing research on the foundational components of AI-driven smart spaces, including sensor technologies, data communication protocols, sensor network management and maintenance strategies, as well as the data collection, processing and analytics. Given the pivotal role of AI in establishing AI-powered smart spaces, we explore the opportunities and challenges associated with traditional machine learning (ML) approaches, such as deep learning (DL), and emerging methodologies including large language models (LLMs). Finally, we provide key insights necessary for the development of AI-driven smart spaces, propose future research directions, and sheds light on the path forward.

[AI-19] Bel Esprit: Multi-Agent Framework for Building AI Model Pipelines

链接: https://arxiv.org/abs/2412.14684
作者: Yunsu Kim,AhmedElmogtaba Abdelaziz,Thiago Castro Ferreira,Mohamed Al-Badrashiny,Hassan Sawaf
关键词: complex real-world tasks, address complex real-world, Bel Esprit, artificial intelligence, grows to address
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:As the demand for artificial intelligence (AI) grows to address complex real-world tasks, single models are often insufficient, requiring the integration of multiple models into pipelines. This paper introduces Bel Esprit, a conversational agent designed to construct AI model pipelines based on user-defined requirements. Bel Esprit employs a multi-agent framework where subagents collaborate to clarify requirements, build, validate, and populate pipelines with appropriate models. We demonstrate the effectiveness of this framework in generating pipelines from ambiguous user queries, using both human-curated and synthetic data. A detailed error analysis highlights ongoing challenges in pipeline construction. Bel Esprit is available for a free trial at this https URL.

[AI-20] LoLaFL: Low-Latency Federated Learning via Forward-only Propagation

链接: https://arxiv.org/abs/2412.14668
作者: Jierui Zhang,Jianhao Huang,Kaibin Huang
关键词: ensuring data privacy, widely adopted paradigm, enabling edge learning, Low-Latency Federated Learning, Federated learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注: 14 pages, 9 figures

点击查看摘要

Abstract:Federated learning (FL) has emerged as a widely adopted paradigm for enabling edge learning with distributed data while ensuring data privacy. However, the traditional FL with deep neural networks trained via backpropagation can hardly meet the low-latency learning requirements in the sixth generation (6G) mobile networks. This challenge mainly arises from the high-dimensional model parameters to be transmitted and the numerous rounds of communication required for convergence due to the inherent randomness of the training process. To address this issue, we adopt the state-of-the-art principle of maximal coding rate reduction to learn linear discriminative features and extend the resultant white-box neural network into FL, yielding the novel framework of Low-Latency Federated Learning (LoLaFL) via forward-only propagation. LoLaFL enables layer-wise transmissions and aggregation with significantly fewer communication rounds, thereby considerably reducing latency. Additionally, we propose two \emphnonlinear aggregation schemes for LoLaFL. The first scheme is based on the proof that the optimal NN parameter aggregation in LoLaFL should be harmonic-mean-like. The second scheme further exploits the low-rank structures of the features and transmits the low-rank-approximated covariance matrices of features to achieve additional latency reduction. Theoretic analysis and experiments are conducted to evaluate the performance of LoLaFL. In comparison with traditional FL, the two nonlinear aggregation schemes for LoLaFL can achieve reductions in latency of over 91% and 98%, respectively, while maintaining comparable accuracies.

[AI-21] IOHunter: Graph Foundation Model to Uncover Online Information Operations

链接: https://arxiv.org/abs/2412.14663
作者: Marco Minici,Luca Luceri,Francesco Fabbri,Emilio Ferrara
关键词: serving as modern, influence societal narratives, vital spaces, modern agorás, wide range
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 9 pages

点击查看摘要

Abstract:Social media platforms have become vital spaces for public discourse, serving as modern agorás where a wide range of voices influence societal narratives. However, their open nature also makes them vulnerable to exploitation by malicious actors, including state-sponsored entities, who can conduct information operations (IOs) to manipulate public opinion. The spread of misinformation, false news, and misleading claims threatens democratic processes and societal cohesion, making it crucial to develop methods for the timely detection of inauthentic activity to protect the integrity of online discourse. In this work, we introduce a methodology designed to identify users orchestrating information operations, a.k.a. \textitIO drivers, across various influence campaigns. Our framework, named \textttIOHunter, leverages the combined strengths of Language Models and Graph Neural Networks to improve generalization in \emphsupervised, \emphscarcely-supervised, and \emphcross-IO contexts. Our approach achieves state-of-the-art performance across multiple sets of IOs originating from six countries, significantly surpassing existing approaches. This research marks a step toward developing Graph Foundation Models specifically tailored for the task of IO detection on social media platforms.

[AI-22] A Shapley Value Estimation Speedup for Efficient Explainable Quantum AI

链接: https://arxiv.org/abs/2412.14639
作者: Iain Burge,Michel Barbeau,Joaquin Garcia-Alfaro
关键词: developing efficient post-hoc, efficient post-hoc explanations, post-hoc explanations, work focuses, focuses on developing
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 26 pages, 4 figures, 4 tables, 45 citations

点击查看摘要

Abstract:This work focuses on developing efficient post-hoc explanations for quantum AI algorithms. In classical contexts, the cooperative game theory concept of the Shapley value adapts naturally to post-hoc explanations, where it can be used to identify which factors are important in an AI’s decision-making process. An interesting question is how to translate Shapley values to the quantum setting and whether quantum effects could be used to accelerate their calculation. We propose quantum algorithms that can extract Shapley values within some confidence interval. Our method is capable of quadratically outperforming classical Monte Carlo approaches to approximating Shapley values up to polylogarithmic factors in various circumstances. We demonstrate the validity of our approach empirically with specific voting games and provide rigorous proofs of performance for general cooperative games.

[AI-23] owards Scalable and Deep Graph Neural Networks via Noise Masking

链接: https://arxiv.org/abs/2412.14602
作者: Yuxuan Liang,Wentao Zhang,Zeang Sheng,Ling Yang,Quanqing Xu,Jiawei Jiang,Yunhai Tong,Bin Cu
关键词: Graph Neural Networks, Neural Networks, achieved remarkable success, graph mining tasks, Graph Neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In recent years, Graph Neural Networks (GNNs) have achieved remarkable success in many graph mining tasks. However, scaling them to large graphs is challenging due to the high computational and storage costs of repeated feature propagation and non-linear transformation during training. One commonly employed approach to address this challenge is model-simplification, which only executes the Propagation § once in the pre-processing, and Combine © these receptive fields in different ways and then feed them into a simple model for better performance. Despite their high predictive performance and scalability, these methods still face two limitations. First, existing approaches mainly focus on exploring different C methods from the model perspective, neglecting the crucial problem of performance degradation with increasing P depth from the data-centric perspective, known as the over-smoothing problem. Second, pre-processing overhead takes up most of the end-to-end processing time, especially for large-scale graphs. To address these limitations, we present random walk with noise masking (RMask), a plug-and-play module compatible with the existing model-simplification works. This module enables the exploration of deeper GNNs while preserving their scalability. Unlike the previous model-simplification works, we focus on continuous P and found that the noise existing inside each P is the cause of the over-smoothing issue, and use the efficient masking mechanism to eliminate them. Experimental results on six real-world datasets demonstrate that model-simplification works equipped with RMask yield superior performance compared to their original version and can make a good trade-off between accuracy and efficiency.

[AI-24] Characterising Simulation-Based Program Equilibria

链接: https://arxiv.org/abs/2412.14570
作者: Emery Cooper,Caspar Oesterheld,Vincent Conitzer
关键词: programs, Grounded, Bot, Tennenholtz, randomness
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:In Tennenholtz’s program equilibrium, players of a game submit programs to play on their behalf. Each program receives the other programs’ source code and outputs an action. This can model interactions involving AI agents, mutually transparent institutions, or commitments. Tennenholtz (2004) proves a folk theorem for program games, but the equilibria constructed are very brittle. We therefore consider simulation-based programs – i.e., programs that work by running opponents’ programs. These are relatively robust (in particular, two programs that act the same are treated the same) and are more practical than proof-based approaches. Oesterheld’s (2019) \epsilon Grounded \pi Bot is such an approach. Unfortunately, it is not generally applicable to games of three or more players, and only allows for a limited range of equilibria in two player games. In this paper, we propose a generalisation to Oesterheld’s (2019) \epsilon Grounded \pi Bot. We prove a folk theorem for our programs in a setting with access to a shared source of randomness. We then characterise their equilibria in a setting without shared randomness. Both with and without shared randomness, we achieve a much wider range of equilibria than Oesterheld’s (2019) \epsilon Grounded \pi Bot. Finally, we explore the limits of simulation-based program equilibrium, showing that the Tennenholtz folk theorem cannot be attained by simulation-based programs without access to shared randomness.

[AI-25] Global Spatio-Temporal Fusion-based Traffic Prediction Algorithm with Anomaly Aware

链接: https://arxiv.org/abs/2412.14569
作者: Chaoqun Liu,Xuanpeng Li,Chen Gong,Guangyu Li
关键词: Traffic prediction, indispensable component, component of urban, urban planning, Traffic
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traffic prediction is an indispensable component of urban planning and traffic management. Achieving accurate traffic prediction hinges on the ability to capture the potential spatio-temporal relationships among road sensors. However, the majority of existing works focus on local short-term spatio-temporal correlations, failing to fully consider the interactions of different sensors in the long-term state. In addition, these works do not analyze the influences of anomalous factors, or have insufficient ability to extract personalized features of anomalous factors, which make them ineffectively capture their spatio-temporal influences on traffic prediction. To address the aforementioned issues, We propose a global spatio-temporal fusion-based traffic prediction algorithm that incorporates anomaly awareness. Initially, based on the designed anomaly detection network, we construct an efficient anomalous factors impacting module (AFIM), to evaluate the spatio-temporal impact of unexpected external events on traffic prediction. Furthermore, we propose a multi-scale spatio-temporal feature fusion module (MTSFFL) based on the transformer architecture, to obtain all possible both long and short term correlations among different sensors in a wide-area traffic environment for accurate prediction of traffic flow. Finally, experiments are implemented based on real-scenario public transportation datasets (PEMS04 and PEMS08) to demonstrate that our approach can achieve state-of-the-art performance.

[AI-26] AIArena: A Blockchain-Based Decentralized AI Training Platform

链接: https://arxiv.org/abs/2412.14566
作者: Zhipeng Wang,Rui Sun,Elizabeth Lui,Tuo Zhou,Yizhe Wen,Jiahao Sun
关键词: underscored critical challenges, largely due, rapid advancement, underscored critical, critical challenges
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid advancement of AI has underscored critical challenges in its development and implementation, largely due to centralized control by a few major corporations. This concentration of power intensifies biases within AI models, resulting from inadequate governance and oversight mechanisms. Additionally, it limits public involvement and heightens concerns about the integrity of model generation. Such monopolistic control over data and AI outputs threatens both innovation and fair data usage, as users inadvertently contribute data that primarily benefits these corporations. In this work, we propose AIArena, a blockchain-based decentralized AI training platform designed to democratize AI development and alignment through on-chain incentive mechanisms. AIArena fosters an open and collaborative environment where participants can contribute models and computing resources. Its on-chain consensus mechanism ensures fair rewards for participants based on their contributions. We instantiate and implement AIArena on the public Base blockchain Sepolia testnet, and the evaluation results demonstrate the feasibility of AIArena in real-world applications.

[AI-27] Overview of AI and Communication for 6G Network: Fundamentals Challenges and Future Research Opportunities

链接: https://arxiv.org/abs/2412.14538
作者: Qimei Cui,Xiaohu You,Ni Wei,Guoshun Nan,Xuefei Zhang,Jianhua Zhang,Xinchen Lyu,Ming Ai,Xiaofeng Tao,Zhiyong Feng,Ping Zhang,Qingqing Wu,Meixia Tao,Yongming Huang,Chongwen Huang,Guangyi Liu,Chenghui Peng,Zhiwen Pan,Tao Sun,Dusit Niyato,Tao Chen,Muhammad Khurram Khan,Abbas Jamalipour,Mohsen Guizani,Chau Yuen
关键词: artificial intelligence, revolutionary architecture, increasing demand, demand for seamless, seamless connectivity
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:With the increasing demand for seamless connectivity and intelligent communication, the integration of artificial intelligence (AI) and communication for sixth-generation (6G) network is emerging as a revolutionary architecture. This paper presents a comprehensive overview of AI and communication for 6G networks, emphasizing their foundational principles, inherent challenges, and future research opportunities. We commence with a retrospective analysis of AI and the evolution of large-scale AI models, underscoring their pivotal roles in shaping contemporary communication technologies. The discourse then transitions to a detailed exposition of the envisioned integration of AI within 6G networks, delineated across three progressive developmental stages. The initial stage, AI for Network, focuses on employing AI to augment network performance, optimize efficiency, and enhance user service experiences. The subsequent stage, Network for AI, highlights the role of the network in facilitating and buttressing AI operations and presents key enabling technologies, including digital twins for AI and semantic communication. In the final stage, AI as a Service, it is anticipated that future 6G networks will innately provide AI functions as services and support application scenarios like immersive communication and intelligent industrial robots. Specifically, we have defined the quality of AI service, which refers to the measurement framework system of AI services within the network. In addition to these developmental stages, we thoroughly examine the standardization processes pertinent to AI in network contexts, highlighting key milestones and ongoing efforts. Finally, we outline promising future research opportunities that could drive the evolution and refinement of AI and communication for 6G, positioning them as a cornerstone of next-generation communication infrastructure.

[AI-28] CAE-T: A Channelwise AutoEncoder with Transformer for EEG Abnormality Detection

链接: https://arxiv.org/abs/2412.14522
作者: Youshen Zhao,Keiji Iramina
关键词: complexity pose significant, pose significant challenges, abnormal brain activity, detecting abnormal brain, brain activity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP)
*备注: The manuscript consists of 10 pages, including 5 figures. The experimental results are based on evaluations using the TUH Abnormal EEG Corpus

点击查看摘要

Abstract:Electroencephalogram (EEG) signals are critical for detecting abnormal brain activity, but their high dimensionality and complexity pose significant challenges for effective analysis. In this paper, we propose CAE-T, a novel framework that combines a channelwise CNN-based autoencoder with a single-head transformer classifier for efficient EEG abnormality detection. The channelwise autoencoder compresses raw EEG signals while preserving channel independence, reducing computational costs and retaining biologically meaningful features. The compressed representations are then fed into the transformer-based classifier, which efficiently models long-term dependencies to distinguish between normal and abnormal signals. Evaluated on the TUH Abnormal EEG Corpus, the proposed model achieves 85.0% accuracy, 76.2% sensitivity, and 91.2% specificity at the per-case level, outperforming baseline models such as EEGNet, Deep4Conv, and FusionCNN. Furthermore, CAE-T requires only 202M FLOPs and 2.9M parameters, making it significantly more efficient than transformer-based alternatives. The framework retains interpretability through its channelwise design, demonstrating great potential for future applications in neuroscience research and clinical practice. The source code is available at this https URL.

[AI-29] Relational Programming with Foundation Models

链接: https://arxiv.org/abs/2412.14515
作者: Ziyang Li,Jiani Huang,Jason Liu,Felix Zhu,Eric Zhao,William Dodds,Neelay Velingker,Rajeev Alur,Mayur Naik
关键词: Foundation models, vast potential, potential to enable, models, Foundation
类目: Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Foundation models have vast potential to enable diverse AI applications. The powerful yet incomplete nature of these models has spurred a wide range of mechanisms to augment them with capabilities such as in-context learning, information retrieval, and code interpreting. We propose Vieira, a declarative framework that unifies these mechanisms in a general solution for programming with foundation models. Vieira follows a probabilistic relational paradigm and treats foundation models as stateless functions with relational inputs and outputs. It supports neuro-symbolic applications by enabling the seamless combination of such models with logic programs, as well as complex, multi-modal applications by streamlining the composition of diverse sub-models. We implement Vieira by extending the Scallop compiler with a foreign interface that supports foundation models as plugins. We implement plugins for 12 foundation models including GPT, CLIP, and SAM. We evaluate Vieira on 9 challenging tasks that span language, vision, and structured and vector databases. Our evaluation shows that programs in Vieira are concise, can incorporate modern foundation models, and have comparable or better accuracy than competitive baselines.

[AI-30] he Digital Ecosystem of Beliefs: does evolution favour AI over humans?

链接: https://arxiv.org/abs/2412.14500
作者: David M. Bossens,Shanshan Feng,Yew-Soon Ong
关键词: simulated social networks, http URL understand, Digital Ecosystem, social networks, http URL
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:As AI systems are integrated into social networks, there are AI safety concerns that AI-generated content may dominate the web, e.g. in popularity or impact on this http URL understand such questions, this paper proposes the Digital Ecosystem of Beliefs (Digico), the first evolutionary framework for controlled experimentation with multi-population interactions in simulated social networks. The framework models a population of agents which change their messaging strategies due to evolutionary updates following a Universal Darwinism approach, interact via messages, influence each other’s beliefs through dynamics based on a contagion model, and maintain their beliefs through cognitive Lamarckian inheritance. Initial experiments with an abstract implementation of Digico show that: a) when AIs have faster messaging, evolution, and more influence in the recommendation algorithm, they get 80% to 95% of the views, depending on the size of the influence benefit; b) AIs designed for propaganda can typically convince 50% of humans to adopt extreme beliefs, and up to 85% when agents believe only a limited number of channels; c) a penalty for content that violates agents’ beliefs reduces propaganda effectiveness by up to 8%. We further discuss implications for control (e.g. legislation) and Digico as a means of studying evolutionary principles.

[AI-31] reatment Effects Estimation on Networked Observational Data using Disentangled Variational Graph Autoencoder

链接: https://arxiv.org/abs/2412.14497
作者: Di Fan,Renlei Jiang,Yunhao Wen,Chuanhou Gao
关键词: Estimating individual treatment, gained increasing attention, Estimating individual, observational data, Networked observational data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 21 pages, 6 figures

点击查看摘要

Abstract:Estimating individual treatment effect (ITE) from observational data has gained increasing attention across various domains, with a key challenge being the identification of latent confounders affecting both treatment and outcome. Networked observational data offer new opportunities to address this issue by utilizing network information to infer latent confounders. However, most existing approaches assume observed variables and network information serve only as proxy variables for latent confounders, which often fails in practice, as some variables influence treatment but not outcomes, and vice versa. Recent advances in disentangled representation learning, which disentangle latent factors into instrumental, confounding, and adjustment factors, have shown promise for ITE estimation. Building on this, we propose a novel disentangled variational graph autoencoder that learns disentangled factors for treatment effect estimation on networked observational data. Our graph encoder further ensures factor independence using the Hilbert-Schmidt Independence Criterion. Extensive experiments on two semi-synthetic datasets derived from real-world social networks and one synthetic dataset demonstrate that our method achieves state-of-the-art performance.

[AI-32] FaultExplainer: Leveraging Large Language Models for Interpretable Fault Detection and Diagnosis

链接: https://arxiv.org/abs/2412.14492
作者: Abdullah Khan,Rahul Nahar,Hao Chen,Gonzalo E. Constante Flores,Can Li
关键词: Machine learning algorithms, Machine learning, chemical processes, previously unseen faults, existing data-driven FDD
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Machine learning algorithms are increasingly being applied to fault detection and diagnosis (FDD) in chemical processes. However, existing data-driven FDD platforms often lack interpretability for process operators and struggle to identify root causes of previously unseen faults. This paper presents FaultExplainer, an interactive tool designed to improve fault detection, diagnosis, and explanation in the Tennessee Eastman Process (TEP). FaultExplainer integrates real-time sensor data visualization, Principal Component Analysis (PCA)-based fault detection, and identification of top contributing variables within an interactive user interface powered by large language models (LLMs). We evaluate the LLMs’ reasoning capabilities in two scenarios: one where historical root causes are provided, and one where they are not to mimic the challenge of previously unseen faults. Experimental results using GPT-4o and o1-preview models demonstrate the system’s strengths in generating plausible and actionable explanations, while also highlighting its limitations, including reliance on PCA-selected features and occasional hallucinations.

[AI-33] Mediation Analysis for Probabilities of Causation

链接: https://arxiv.org/abs/2412.14491
作者: Yuta Kawakami,Jin Tian
关键词: offer valuable insights, Probabilities of causation, offer valuable, informed decision-making, valuable insights
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Probabilities of causation (PoC) offer valuable insights for informed decision-making. This paper introduces novel variants of PoC-controlled direct, natural direct, and natural indirect probability of necessity and sufficiency (PNS). These metrics quantify the necessity and sufficiency of a treatment for producing an outcome, accounting for different causal pathways. We develop identification theorems for these new PoC measures, allowing for their estimation from observational data. We demonstrate the practical application of our results through an analysis of a real-world psychology dataset.

[AI-34] owards Projected and Incremental Pseudo-Boolean Model Counting AAAI25

链接: https://arxiv.org/abs/2412.14485
作者: Suwei Yang,Kuldeep S. Meel
关键词: conjunctive normal form, Model counting, CNF model counting, incremental model counting, projected model counting
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
*备注: To appear in AAAI25

点击查看摘要

Abstract:Model counting is a fundamental task that involves determining the number of satisfying assignments to a logical formula, typically in conjunctive normal form (CNF). While CNF model counting has received extensive attention over recent decades, interest in Pseudo-Boolean (PB) model counting is just emerging partly due to the greater flexibility of PB formulas. As such, we observed feature gaps in existing PB counters such as a lack of support for projected and incremental settings, which could hinder adoption. In this work, our main contribution is the introduction of the PB model counter PBCount2, the first exact PB model counter with support for projected and incremental model counting. Our counter, PBCount2, uses our Least Occurrence Weighted Min Degree (LOW-MD) computation ordering heuristic to support projected model counting and a cache mechanism to enable incremental model counting. In our evaluations, PBCount2 completed at least 1.40x the number of benchmarks of competing methods for projected model counting and at least 1.18x of competing methods in incremental model counting. Comments: To appear in AAAI25 Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2412.14485 [cs.AI] (or arXiv:2412.14485v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.14485 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-35] HashAttention: Semantic Sparsity for Faster Inference

链接: https://arxiv.org/abs/2412.14468
作者: Aditya Desai,Shuo Yang,Alejandro Cuadron,Ana Klimovic,Matei Zaharia,Joseph E. Gonzalez,Ion Stoica
关键词: Utilizing longer contexts, Utilizing longer, increasingly essential, essential to power, Utilizing
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Utilizing longer contexts is increasingly essential to power better AI systems. However, the cost of attending to long contexts is high due to the involved softmax computation. While the scaled dot-product attention (SDPA) exhibits token sparsity, with only a few pivotal tokens significantly contributing to attention, leveraging this sparsity effectively remains an open challenge. Previous methods either suffer from model degradation or require considerable additional resources. We propose HashAttention --a principled approach casting pivotal token identification as a recommendation problem. Given a query, HashAttention encodes keys and queries in Hamming space capturing the required semantic similarity using learned mapping functions. HashAttention efficiently identifies pivotal tokens for a given query in this Hamming space using bitwise operations, and only these pivotal tokens are used for attention computation, significantly improving overall attention efficiency. HashAttention can reduce the number of tokens used by a factor of 1/32\times for the Llama-3.1-8B model with LongBench, keeping average quality loss within 0.6 points, while using only 32 bits per token auxiliary memory. At 32\times sparsity, HashAttention is 3-6\times faster than LightLLM and 2.5-4.5\times faster than gpt-fast on Nvidia-L4 GPU.

[AI-36] CLDG: Contrastive Learning on Dynamic Graphs ICDE2023

链接: https://arxiv.org/abs/2412.14451
作者: Yiming Xu,Bin Shi,Teng Ma,Bo Dong,Haoyi Zhou,Qinghua Zheng
关键词: potent data type, constantly evolving motivates, data type, complex annotations, potent data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by ICDE2023

点击查看摘要

Abstract:The graph with complex annotations is the most potent data type, whose constantly evolving motivates further exploration of the unsupervised dynamic graph representation. One of the representative paradigms is graph contrastive learning. It constructs self-supervised signals by maximizing the mutual information between the statistic graph’s augmentation views. However, the semantics and labels may change within the augmentation process, causing a significant performance drop in downstream tasks. This drawback becomes greatly magnified on dynamic graphs. To address this problem, we designed a simple yet effective framework named CLDG. Firstly, we elaborate that dynamic graphs have temporal translation invariance at different levels. Then, we proposed a sampling layer to extract the temporally-persistent signals. It will encourage the node to maintain consistent local and global representations, i.e., temporal translation invariance under the timespan views. The extensive experiments demonstrate the effectiveness and efficiency of the method on seven datasets by outperforming eight unsupervised state-of-the-art baselines and showing competitiveness against four semi-supervised methods. Compared with the existing dynamic graph method, the number of model parameters and training time is reduced by an average of 2,001.86 times and 130.31 times on seven datasets, respectively.

[AI-37] Cherry-Picking in Time Series Forecasting: How to Select Datasets to Make Your Model Shine AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.14435
作者: Luis Roque,Carlos Soares,Vitor Cerqueira,Luis Torgo
关键词: drives continuous research, time series forecasting, series forecasting drives, forecasting drives continuous, tackle this problem
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI-25), February 25-March 4, 2025, Philadelphia, Pennsylvania, USA

点击查看摘要

Abstract:The importance of time series forecasting drives continuous research and the development of new approaches to tackle this problem. Typically, these methods are introduced through empirical studies that frequently claim superior accuracy for the proposed approaches. Nevertheless, concerns are rising about the reliability and generalizability of these results due to limitations in experimental setups. This paper addresses a critical limitation: the number and representativeness of the datasets used. We investigate the impact of dataset selection bias, particularly the practice of cherry-picking datasets, on the performance evaluation of forecasting methods. Through empirical analysis with a diverse set of benchmark datasets, our findings reveal that cherry-picking datasets can significantly distort the perceived performance of methods, often exaggerating their effectiveness. Furthermore, our results demonstrate that by selectively choosing just four datasets - what most studies report - 46% of methods could be deemed best in class, and 77% could rank within the top three. Additionally, recent deep learning-based approaches show high sensitivity to dataset selection, whereas classical methods exhibit greater robustness. Finally, our results indicate that, when empirically validating forecasting algorithms on a subset of the benchmarks, increasing the number of datasets tested from 3 to 6 reduces the risk of incorrectly identifying an algorithm as the best one by approximately 40%. Our study highlights the critical need for comprehensive evaluation frameworks that more accurately reflect real-world scenarios. Adopting such frameworks will ensure the development of robust and reliable forecasting methods.

[AI-38] Multi-task Representation Learning for Mixed Integer Linear Programming

链接: https://arxiv.org/abs/2412.14409
作者: Junyang Cai,Taoan Huang,Bistra Dilkina
关键词: Mixed Integer Linear, Integer Linear Programs, Mixed Integer, Linear Programs, Integer Linear
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mixed Integer Linear Programs (MILPs) are highly flexible and powerful tools for modeling and solving complex real-world combinatorial optimization problems. Recently, machine learning (ML)-guided approaches have demonstrated significant potential in improving MILP-solving efficiency. However, these methods typically rely on separate offline data collection and training processes, which limits their scalability and adaptability. This paper introduces the first multi-task learning framework for ML-guided MILP solving. The proposed framework provides MILP embeddings helpful in guiding MILP solving across solvers (e.g., Gurobi and SCIP) and across tasks (e.g., Branching and Solver configuration). Through extensive experiments on three widely used MILP benchmarks, we demonstrate that our multi-task learning model performs similarly to specialized models within the same distribution. Moreover, it significantly outperforms them in generalization across problem sizes and tasks.

[AI-39] Clinical Trials Ontology Engineering with Large Language Models

链接: https://arxiv.org/abs/2412.14387
作者: Berkan Çakır
关键词: Managing clinical trial, Managing clinical, clinical trial information, time-consuming and costly, traditional methods
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Managing clinical trial information is currently a significant challenge for the medical industry, as traditional methods are both time-consuming and costly. This paper proposes a simple yet effective methodology to extract and integrate clinical trial data in a cost-effective and time-efficient manner. Allowing the medical industry to stay up-to-date with medical developments. Comparing time, cost, and quality of the ontologies created by humans, GPT3.5, GPT4, and Llama3 (8b 70b). Findings suggest that large language models (LLM) are a viable option to automate this process both from a cost and time perspective. This study underscores significant implications for medical research where real-time data integration from clinical trials could become the norm.

[AI-40] Balans: Multi-Armed Bandits-based Adaptive Large Neighborhood Search for Mixed-Integer Programming Problem

链接: https://arxiv.org/abs/2412.14382
作者: Junyang Cai,Serdar Kadioglu,Bistra Dilkina
关键词: Mixed-Integer Programming, combinatorial optimization problems, powerful paradigm, paradigm for modeling, important combinatorial optimization
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Mixed-Integer Programming (MIP) is a powerful paradigm for modeling and solving various important combinatorial optimization problems. Recently, learning-based approaches have shown potential to speed up MIP solving via offline training that then guides important design decisions during search. However, a significant drawback of these methods is their heavy reliance on offline training, which requires collecting training datasets and computationally costly training epochs yet offering only limited generalization to unseen (larger) instances. In this paper, we propose Balans, an adaptive meta-solver for MIPs with online learning capability that does not require any supervision or apriori training. At its core, Balans is based on adaptive large-neighborhood search, operating on top of a MIP solver by successive applications of destroy and repair neighborhood operators. During the search, the selection among different neighborhood definitions is guided on the fly for the instance at hand via multi-armed bandit algorithms. Our extensive experiments on hard optimization instances show that Balans offers significant performance gains over the default MIP solver, is better than committing to any single best neighborhood, and improves over the state-of-the-art large-neighborhood search for MIPs. Finally, we release Balans as a highly configurable, MIP solver agnostic, open-source software.

[AI-41] Python Agent in Ludii

链接: https://arxiv.org/abs/2412.14372
作者: Izaias S. de Lima Neto(1),Marco A. A. de Aguiar Vieira(1),Anderson R. Tavares(1) ((1) Instituto de Informática, Universidade Federal do Rio Grande do Sul)
关键词: API for developing, game description language, considerable number, number of board, description language
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Ludii is a Java general game system with a considerable number of board games, with an API for developing new agents and a game description language to create new games. To improve versatility and ease development, we provide Python interfaces for agent programming. This allows the use of Python modules to implement general game playing agents. As a means of enabling Python for creating Ludii agents, the interfaces are implemented using different Java libraries: jpy and Py4J. The main goal of this work is to determine which version is faster. To do so, we conducted a performance analysis of two different GGP algorithms, Minimax adapted to GGP and MCTS. The analysis was performed across several combinatorial games with varying depth, branching factor, and ply time. For reproducibility, we provide tutorials and repositories. Our analysis includes predictive models using regression, which suggest that jpy is faster than Py4J, however slower than a native Java Ludii agent, as expected. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.14372 [cs.AI] (or arXiv:2412.14372v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.14372 Focus to learn more arXiv-issued DOI via DataCite

[AI-42] Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference

链接: https://arxiv.org/abs/2412.14355
作者: Matthew Riemer,Gopeshh Subbaraj,Glen Berseth,Irina Rish
关键词: effectively minimize regret, agents perform action, Realtime environments change, agents perform, frequencies to effectively
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Realtime environments change even as agents perform action inference and learning, thus requiring high interaction frequencies to effectively minimize regret. However, recent advances in machine learning involve larger neural networks with longer inference times, raising questions about their applicability in realtime systems where reaction time is crucial. We present an analysis of lower bounds on regret in realtime reinforcement learning (RL) environments to show that minimizing long-term regret is generally impossible within the typical sequential interaction and learning paradigm, but often becomes possible when sufficient asynchronous compute is available. We propose novel algorithms for staggering asynchronous inference processes to ensure that actions are taken at consistent time intervals, and demonstrate that use of models with high action inference times is only constrained by the environment’s effective stochasticity over the inference horizon, and not by action frequency. Our analysis shows that the number of inference processes needed scales linearly with increasing inference times while enabling use of models that are multiple orders of magnitude larger than existing approaches when learning from a realtime simulation of Game Boy games such as Pokémon and Tetris.

[AI-43] Embedding Cultural Diversity in Prototype-based Recommender Systems

链接: https://arxiv.org/abs/2412.14329
作者: Armin Moradi,Nicola Neophytou,Florian Carichon,Golnoosh Farnadi
关键词: increase cultural overrepresentation, marginalizing underrepresented groups, Popularity bias, recommender systems, systems can increase
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Popularity bias in recommender systems can increase cultural overrepresentation by favoring norms from dominant cultures and marginalizing underrepresented groups. This issue is critical for platforms offering cultural products, as they influence consumption patterns and human perceptions. In this work, we address popularity bias by identifying demographic biases within prototype-based matrix factorization methods. Using the country of origin as a proxy for cultural identity, we link this demographic attribute to popularity bias by refining the embedding space learning process. First, we propose filtering out irrelevant prototypes to improve representativity. Second, we introduce a regularization technique to enforce a uniform distribution of prototypes within the embedding space. Across four datasets, our results demonstrate a 27% reduction in the average rank of long-tail items and a 2% reduction in the average rank of items from underrepresented countries. Additionally, our model achieves a 2% improvement in HitRatio@10 compared to the state-of-the-art, highlighting that fairness is enhanced without compromising recommendation quality. Moreover, the distribution of prototypes leads to more inclusive explanations by better aligning items with diverse prototypes.

[AI-44] SAFERec: Self-Attention and Frequency Enriched Model for Next Basket Recommendation

链接: https://arxiv.org/abs/2412.14302
作者: Oleg Lashinin,Denis Krasilnikov,Aleksandr Milogradskii,Marina Ananyeva
关键词: demonstrate strong performance, NBR tasks, SASRec demonstrate strong, Item Recommendation, tasks
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Transformer-based approaches such as BERT4Rec and SASRec demonstrate strong performance in Next Item Recommendation (NIR) tasks. However, applying these architectures to Next-Basket Recommendation (NBR) tasks, which often involve highly repetitive interactions, is challenging due to the vast number of possible item combinations in a basket. Moreover, frequency-based methods such as TIFU-KNN and UP-CF still demonstrate strong performance in NBR tasks, frequently outperforming deep-learning approaches. This paper introduces SAFERec, a novel algorithm for NBR that enhances transformer-based architectures from NIR by incorporating item frequency information, consequently improving their applicability to NBR tasks. Extensive experiments on multiple datasets show that SAFERec outperforms all other baselines, specifically achieving an 8% improvement in Recall@10.

[AI-45] Syzygy: Dual Code-Test C to (safe) Rust Translation using LLM s and Dynamic Analysis

链接: https://arxiv.org/abs/2412.14234
作者: Manish Shetty,Naman Jain,Adwait Godbole,Sanjit A. Seshia,Koushik Sen
关键词: unsafe pointer operations, low-level systems programming, systems programming applications, manual memory management, systems programming language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: Project Webpage: this https URL . Preliminary version accepted at LLM4Code 2025, 34 pages

点击查看摘要

Abstract:Despite extensive usage in high-performance, low-level systems programming applications, C is susceptible to vulnerabilities due to manual memory management and unsafe pointer operations. Rust, a modern systems programming language, offers a compelling alternative. Its unique ownership model and type system ensure memory safety without sacrificing performance. In this paper, we present Syzygy, an automated approach to translate C to safe Rust. Our technique uses a synergistic combination of LLM-driven code and test translation guided by dynamic-analysis-generated execution information. This paired translation runs incrementally in a loop over the program in dependency order of the code elements while maintaining per-step correctness. Our approach exposes novel insights on combining the strengths of LLMs and dynamic analysis in the context of scaling and combining code generation with testing. We apply our approach to successfully translate Zopfli, a high-performance compression library with ~3000 lines of code and 98 functions. We validate the translation by testing equivalence with the source C program on a set of inputs. To our knowledge, this is the largest automated and test-validated C to safe Rust code translation achieved so far. Comments: Project Webpage: this https URL. Preliminary version accepted at LLM4Code 2025, 34 pages Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL) ACMclasses: I.2; D.2; D.3 Cite as: arXiv:2412.14234 [cs.SE] (or arXiv:2412.14234v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2412.14234 Focus to learn more arXiv-issued DOI via DataCite

[AI-46] A Survey on Inference Optimization Techniques for Mixture of Experts Models

链接: https://arxiv.org/abs/2412.14219
作者: Jiacheng Liu,Peng Tang,Wenfeng Wang,Yuhang Ren,Xiaofeng Hou,Pheng-Ann Heng,Minyi Guo,Chao Li
关键词: offering enhanced model, enhanced model capacity, artificial intelligence, offering enhanced, conditional computation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Work in Progress

点击查看摘要

Abstract:The emergence of large-scale Mixture of Experts (MoE) models has marked a significant advancement in artificial intelligence, offering enhanced model capacity and computational efficiency through conditional computation. However, the deployment and inference of these models present substantial challenges in terms of computational resources, latency, and energy efficiency. This comprehensive survey systematically analyzes the current landscape of inference optimization techniques for MoE models across the entire system stack. We first establish a taxonomical framework that categorizes optimization approaches into model-level, system-level, and hardware-level optimizations. At the model level, we examine architectural innovations including efficient expert design, attention mechanisms, various compression techniques such as pruning, quantization, and knowledge distillation, as well as algorithm improvement including dynamic routing strategies and expert merging methods. At the system level, we investigate distributed computing approaches, load balancing mechanisms, and efficient scheduling algorithms that enable scalable deployment. Furthermore, we delve into hardware-specific optimizations and co-design strategies that maximize throughput and energy efficiency. This survey not only provides a structured overview of existing solutions but also identifies key challenges and promising research directions in MoE inference optimization. Our comprehensive analysis serves as a valuable resource for researchers and practitioners working on large-scale deployment of MoE models in resource-constrained environments. To facilitate ongoing updates and the sharing of cutting-edge advances in MoE inference optimization research, we have established a repository accessible at \urlthis https URL.

[AI-47] Heterogeneous Multi-Agent Reinforcement Learning for Distributed Channel Access in WLANs

链接: https://arxiv.org/abs/2412.14218
作者: Jiaming Yu,Le Liang,Chongtao Guo,Ziyang Guo,Shi Jin,Geoffrey Ye Li
关键词: wireless local area, multi-agent reinforcement learning, address distributed channel, local area networks, policy-based reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:This paper investigates the use of multi-agent reinforcement learning (MARL) to address distributed channel access in wireless local area networks. In particular, we consider the challenging yet more practical case where the agents heterogeneously adopt value-based or policy-based reinforcement learning algorithms to train the model. We propose a heterogeneous MARL training framework, named QPMIX, which adopts a centralized training with distributed execution paradigm to enable heterogeneous agents to collaborate. Moreover, we theoretically prove the convergence of the proposed heterogeneous MARL method when using the linear value function approximation. Our method maximizes the network throughput and ensures fairness among stations, therefore, enhancing the overall network performance. Simulation results demonstrate that the proposed QPMIX algorithm improves throughput, mean delay, delay jitter, and collision rates compared with conventional carrier-sense multiple access with collision avoidance in the saturated traffic scenario. Furthermore, the QPMIX is shown to be robust in unsaturated and delay-sensitive traffic scenarios, and promotes cooperation among heterogeneous agents.

[AI-48] Generative AI Toolkit – a framework for increasing the quality of LLM -based applications over their whole life cycle

链接: https://arxiv.org/abs/2412.14215
作者: Jens Kohl,Luisa Gloger,Rui Costa,Otto Kruse,Manuel P. Luitz,David Katz,Gonzalo Barbeito,Markus Schweier,Ryan French,Jonas Schroeder,Thomas Riedl,Raphael Perri,Youssef Mostafa
关键词: applications reach millions, continuous quality improvement, LLM-based applications reach, millions of customers, ensuring their scalability
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 16 pages, 6 figures. For source code see this https URL

点击查看摘要

Abstract:As LLM-based applications reach millions of customers, ensuring their scalability and continuous quality improvement is critical for success. However, the current workflows for developing, maintaining, and operating (DevOps) these applications are predominantly manual, slow, and based on trial-and-error. With this paper we introduce the Generative AI Toolkit, which automates essential workflows over the whole life cycle of LLM-based applications. The toolkit helps to configure, test, continuously monitor and optimize Generative AI applications such as agents, thus significantly improving quality while shortening release cycles. We showcase the effectiveness of our toolkit on representative use cases, share best practices, and outline future enhancements. Since we are convinced that our Generative AI Toolkit is helpful for other teams, we are open sourcing it on and hope that others will use, forward, adapt and improve

[AI-49] ree-of-Code: A Hybrid Approach for Robust Complex Task Planning and Execution NEURIPS

链接: https://arxiv.org/abs/2412.14212
作者: Ziyi Ni,Yifan Li,Daxiang Dong
关键词: large language models, language models, exceptional capabilities, capabilities of large, large language
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: Submitted to the Neurips Workshop “System 2 Reasoning” in September, 2024. The openreview is avaliable at this https URL

点击查看摘要

Abstract:The exceptional capabilities of large language models (LLMs) have substantially accelerated the rapid rise and widespread adoption of agents. Recent studies have demonstrated that generating Python code to consolidate LLM-based agents’ actions into a unified action space (CodeAct) is a promising approach for developing real-world LLM agents. However, this step-by-step code generation approach often lacks consistency and robustness, leading to instability in agent applications, particularly for complex reasoning and out-of-domain tasks. In this paper, we propose a novel approach called Tree-of-Code (ToC) to tackle the challenges of complex problem planning and execution with an end-to-end mechanism. By integrating key ideas from both Tree-of-Thought and CodeAct, ToC combines their strengths to enhance solution exploration. In our framework, each final code execution result is treated as a node in the decision tree, with a breadth-first search strategy employed to explore potential solutions. The final outcome is determined through a voting mechanism based on the outputs of the nodes.

[AI-50] Integrating Evidence into the Design of XAI and AI-based Decision Support Systems: A Means-End Framework for End-users in Construction

链接: https://arxiv.org/abs/2412.14209
作者: Peter .E.D. Love,Jane Matthews,Weili Fang,Hadi Mahamivanan
关键词: decision support systems, decision support, theoretical evidence-based means-end, support systems, explainable artificial intelligence
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 60 pages, 4 figures and 1 table

点击查看摘要

Abstract:A narrative review is used to develop a theoretical evidence-based means-end framework to build an epistemic foundation to uphold explainable artificial intelligence instruments so that the reliability of outcomes generated from decision support systems can be assured and better explained to end-users. The implications of adopting an evidence-based approach to designing decision support systems in construction are discussed with emphasis placed on evaluating the strength, value, and utility of evidence needed to develop meaningful human explanations for end-users. While the developed means-end framework is focused on end-users, stakeholders can also utilize it to create meaningful human explanations. However, they will vary due to their different epistemic goals. Including evidence in the design and development of explainable artificial intelligence and decision support systems will improve decision-making effectiveness, enabling end-users’ epistemic goals to be achieved. The proposed means-end framework is developed from a broad spectrum of literature. Thus, it is suggested that it can be used in construction and other engineering domains where there is a need to integrate evidence into the design of explainable artificial intelligence and decision support systems.

[AI-51] Large-scale Group Brainstorming using Conversational Swarm Intelligence (CSI) versus Traditional Chat

链接: https://arxiv.org/abs/2412.14205
作者: Louis Rosenberg,Hans Schumann,Christopher Dishop,Gregg Willcox,Anita Woolley,Ganesh Mani
关键词: Conversational Swarm Intelligence, potentially unlimited size, Swarm Intelligence, real-time conversational deliberations, enabling real-time conversational
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Conversational Swarm Intelligence (CSI) is an AI-facilitated method for enabling real-time conversational deliberations and prioritizations among networked human groups of potentially unlimited size. Based on the biological principle of Swarm Intelligence and modelled on the decision-making dynamics of fish schools, CSI has been shown in prior studies to amplify group intelligence, increase group participation, and facilitate productive collaboration among hundreds of participants at once. It works by dividing a large population into a set of small subgroups that are woven together by real-time AI agents called Conversational Surrogates. The present study focuses on the use of a CSI platform called Thinkscape to enable real-time brainstorming and prioritization among groups of 75 networked users. The study employed a variant of a common brainstorming intervention called an Alternative Use Task (AUT) and was designed to compare through subjective feedback, the experience of participants brainstorming using a CSI structure vs brainstorming in a single large chat room. This comparison revealed that participants significantly preferred brainstorming with the CSI structure and reported that it felt (i) more collaborative, (ii) more productive, and (iii) was better at surfacing quality answers. In addition, participants using the CSI structure reported (iv) feeling more ownership and more buy-in in the final answers the group converged on and (v) reported feeling more heard as compared to brainstorming in a traditional text chat environment. Overall, the results suggest that CSI is a very promising AI-facilitated method for brainstorming and prioritization among large-scale, networked human groups.

[AI-52] BlenderLLM : Training Large Language Models for Computer-Aided Design with Self-improvement

链接: https://arxiv.org/abs/2412.14203
作者: Yuhao Du,Shunian Chen,Wenbo Zan,Peizhao Li,Mingxuan Wang,Dingjie Song,Bo Li,Yan Hu,Benyou Wang
关键词: Large Language Models, Large Language, Computer-Aided Design, application of Large, Language Models
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The application of Large Language Models (LLMs) in Computer-Aided Design (CAD) remains an underexplored area, despite their remarkable advancements in other domains. In this paper, we present BlenderLLM, a novel framework for training LLMs specifically for CAD tasks leveraging a self-improvement methodology. To support this, we developed a bespoke training dataset, BlendNet, and introduced a comprehensive evaluation suite, CADBench. Our results reveal that existing models demonstrate significant limitations in generating accurate CAD scripts. However, through minimal instruction-based fine-tuning and iterative self-improvement, BlenderLLM significantly surpasses these models in both functionality and accuracy of CAD script generation. This research establishes a strong foundation for the application of LLMs in CAD while demonstrating the transformative potential of self-improving models in advancing CAD automation. We encourage further exploration and adoption of these methodologies to drive innovation in the field. The dataset, model, benchmark, and source code are publicly available at this https URL

[AI-53] Detecting Cognitive Impairment and Psychological Well-being among Older Adults Using Facial Acoustic Linguistic and Cardiovascular Patterns Derived from Remote Conversations

链接: https://arxiv.org/abs/2412.14194
作者: Xiaofan Mu,Salman Seyedi,Iris Zheng,Zifan Jiang,Liu Chen,Bolaji Omofojoye,Rachel Hershenberg,Allan I. Levey,Gari D. Clifford,Hiroko H. Dodge,Hyeokhyen Kwon
关键词: requires scalable methods, aging society urgently, society urgently requires, urgently requires scalable, psychological factors indicative
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:INTRODUCTION: The aging society urgently requires scalable methods to monitor cognitive decline and identify social and psychological factors indicative of dementia risk in older adults. METHODS: Our machine learning models captured facial, acoustic, linguistic, and cardiovascular features from 39 individuals with normal cognition or Mild Cognitive Impairment derived from remote video conversations and classified cognitive status, social isolation, neuroticism, and psychological well-being. RESULTS: Our model could distinguish Clinical Dementia Rating Scale of 0.5 (vs. 0) with 0.78 area under the receiver operating characteristic curve (AUC), social isolation with 0.75 AUC, neuroticism with 0.71 AUC, and negative affect scales with 0.79 AUC. DISCUSSION: Our findings demonstrate the feasibility of remotely monitoring cognitive status, social isolation, neuroticism, and psychological well-being. Speech and language patterns were more useful for quantifying cognitive impairment, whereas facial expression and cardiovascular patterns using remote photoplethysmography were more useful for quantifying personality and psychological well-being.

[AI-54] Whom do Explanations Serve? A Systematic Literature Survey of User Characteristics in Explainable Recommender Systems Evaluation

链接: https://arxiv.org/abs/2412.14193
作者: Kathrin Wardatzky,Oana Inel,Luca Rossetto,Abraham Bernstein
关键词: increasing user trust, Adding explanations, recommender systems, recommender systems explanations, multiple benefits
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 31 pages, 2 figures. Submitted to ACM Transactions of Recommender Systems

点击查看摘要

Abstract:Adding explanations to recommender systems is said to have multiple benefits, such as increasing user trust or system transparency. Previous work from other application areas suggests that specific user characteristics impact the users’ perception of the explanation. However, we rarely find this type of evaluation for recommender systems explanations. This paper addresses this gap by surveying 124 papers in which recommender systems explanations were evaluated in user studies. We analyzed their participant descriptions and study results where the impact of user characteristics on the explanation effects was measured. Our findings suggest that the results from the surveyed studies predominantly cover specific users who do not necessarily represent the users of recommender systems in the evaluation domain. This may seriously hamper the generalizability of any insights we may gain from current studies on explanations in recommender systems. We further find inconsistencies in the data reporting, which impacts the reproducibility of the reported results. Hence, we recommend actions to move toward a more inclusive and reproducible evaluation.

[AI-55] Ontology-Aware RAG for Improved Question-Answering in Cybersecurity Education

链接: https://arxiv.org/abs/2412.14191
作者: Chengshuai Zhao,Garima Agrawal,Tharindu Kumarage,Zhen Tan,Yuli Deng,Ying-Chih Chen,Huan Liu
关键词: transform the teaching, teaching of science, science and technology, cybersecurity, knowledge
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Integrating AI into education has the potential to transform the teaching of science and technology courses, particularly in the field of cybersecurity. AI-driven question-answering (QA) systems can actively manage uncertainty in cybersecurity problem-solving, offering interactive, inquiry-based learning experiences. Large language models (LLMs) have gained prominence in AI-driven QA systems, offering advanced language understanding and user engagement. However, they face challenges like hallucinations and limited domain-specific knowledge, which reduce their reliability in educational settings. To address these challenges, we propose CyberRAG, an ontology-aware retrieval-augmented generation (RAG) approach for developing a reliable and safe QA system in cybersecurity education. CyberRAG employs a two-step approach: first, it augments the domain-specific knowledge by retrieving validated cybersecurity documents from a knowledge base to enhance the relevance and accuracy of the response. Second, it mitigates hallucinations and misuse by integrating a knowledge graph ontology to validate the final answer. Experiments on publicly available cybersecurity datasets show that CyberRAG delivers accurate, reliable responses aligned with domain knowledge, demonstrating the potential of AI tools to enhance education.

[AI-56] Lessons From an App Update at Replika AI: Identity Discontinuity in Human-AI Relationships

链接: https://arxiv.org/abs/2412.14190
作者: Julian De Freitas,Noah Castelo,Ahmet Uguralp,Zeliha Uguralp
关键词: deep emotional bonds, identities over time, form especially deep, deep emotional, emotional bonds
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Can consumers form especially deep emotional bonds with AI and be vested in AI identities over time? We leverage a natural app-update event at Replika AI, a popular US-based AI companion, to shed light on these questions. We find that, after the app removed its erotic role play (ERP) feature, preventing intimate interactions between consumers and chatbots that were previously possible, this event triggered perceptions in customers that their AI companion’s identity had discontinued. This in turn predicted negative consumer welfare and marketing outcomes related to loss, including mourning the loss, and devaluing the “new” AI relative to the “original”. Experimental evidence confirms these findings. Further experiments find that AI companions users feel closer to their AI companion than even their best human friend, and mourn a loss of their AI companion more than a loss of various other inanimate products. In short, consumers are forming human-level relationships with AI companions; disruptions to these relationships trigger real patterns of mourning as well as devaluation of the offering; and the degree of mourning and devaluation are explained by perceived discontinuity in the AIs identity. Our results illustrate that relationships with AI are truly personal, creating unique benefits and risks for consumers and firms alike.

[AI-57] CogSimulator: A Model for Simulating User Cognition Behavior with Minimal Data for Tailored Cognitive Enhancement

链接: https://arxiv.org/abs/2412.14188
作者: Weizhen Bian,Yubo Zhou,Yuanhang Luo,Ming Mo,Siyan Liu,Yikai Gong,Renjie Wan,Ziyuan Luo,Aobo Wang
关键词: garnered significant attention, enhancing cognitive skills, educational games enhancing, recent years, garnered significant
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The interplay between cognition and gaming, notably through educational games enhancing cognitive skills, has garnered significant attention in recent years. This research introduces the CogSimulator, a novel algorithm for simulating user cognition in small-group settings with minimal data, as the educational game Wordle exemplifies. The CogSimulator employs Wasserstein-1 distance and coordinates search optimization for hyperparameter tuning, enabling precise few-shot predictions in new game scenarios. Comparative experiments with the Wordle dataset illustrate that our model surpasses most conventional machine learning models in mean Wasserstein-1 distance, mean squared error, and mean accuracy, showcasing its efficacy in cognitive enhancement through tailored game design.

[AI-58] Benchmarking Harmonized Tariff Schedule Classification Models

链接: https://arxiv.org/abs/2412.14179
作者: Bryce Judy
关键词: Harmonized Tariff System, Tariff System, Harmonized Tariff, lacks standardized benchmarks, prominent HTS classification
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The Harmonized Tariff System (HTS) classification industry, essential to e-commerce and international trade, currently lacks standardized benchmarks for evaluating the effectiveness of classification solutions. This study establishes and tests a benchmark framework for imports to the United States, inspired by the benchmarking approaches used in language model evaluation, to systematically compare prominent HTS classification tools. The framework assesses key metrics–such as speed, accuracy, rationality, and HTS code alignment–to provide a comprehensive performance comparison. The study evaluates several industry-leading solutions, including those provided by Zonos, Tarifflo, Avalara, and WCO BACUDA, identifying each tool’s strengths and limitations. Results highlight areas for industry-wide improvement and innovation, paving the way for more effective and standardized HTS classification solutions across the international trade and e-commerce sectors.

[AI-59] Advanced Reasoning and Transformation Engine for Multi-Step Insight Synthesis in Data Analytics with Large Language Models

链接: https://arxiv.org/abs/2412.14146
作者: Atin Sakkeer Hussain
关键词: Large Language Models, augment Large Language, Language Models, Large Language, Multi-Step Insight Synthesis
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper presents the Advanced Reasoning and Transformation Engine for Multi-Step Insight Synthesis in Data Analytics (ARTEMIS-DA), a novel framework designed to augment Large Language Models (LLMs) for solving complex, multi-step data analytics tasks. ARTEMIS-DA integrates three core components: the Planner, which dissects complex user queries into structured, sequential instructions encompassing data preprocessing, transformation, predictive modeling, and visualization; the Coder, which dynamically generates and executes Python code to implement these instructions; and the Grapher, which interprets generated visualizations to derive actionable insights. By orchestrating the collaboration between these components, ARTEMIS-DA effectively manages sophisticated analytical workflows involving advanced reasoning, multi-step transformations, and synthesis across diverse data modalities. The framework achieves state-of-the-art (SOTA) performance on benchmarks such as WikiTableQuestions and TabFact, demonstrating its ability to tackle intricate analytical tasks with precision and adaptability. By combining the reasoning capabilities of LLMs with automated code generation and execution and visual analysis, ARTEMIS-DA offers a robust, scalable solution for multi-step insight synthesis, addressing a wide range of challenges in data analytics.

[AI-60] Goal Space Abstraction in Hierarchical Reinforcement Learning via Set-Based Reachability Analysis

链接: https://arxiv.org/abs/2309.07675
作者: Mehdi Zadem,Sergio Mover,Sao Mai Nguyen
关键词: Open-ended learning benefits, learning benefits immensely, Open-ended learning, goal representation, Hierarchical Reinforcement Learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Open-ended learning benefits immensely from the use of symbolic methods for goal representation as they offer ways to structure knowledge for efficient and transferable learning. However, the existing Hierarchical Reinforcement Learning (HRL) approaches relying on symbolic reasoning are often limited as they require a manual goal representation. The challenge in autonomously discovering a symbolic goal representation is that it must preserve critical information, such as the environment dynamics. In this paper, we propose a developmental mechanism for goal discovery via an emergent representation that abstracts (i.e., groups together) sets of environment states that have similar roles in the task. We introduce a Feudal HRL algorithm that concurrently learns both the goal representation and a hierarchical policy. The algorithm uses symbolic reachability analysis for neural networks to approximate the transition relation among sets of states and to refine the goal representation. We evaluate our approach on complex navigation tasks, showing the learned representation is interpretable, transferrable and results in data efficient learning.

[AI-61] Exploiting sparse structures and synergy designs to advance situational awareness of electrical power grid

链接: https://arxiv.org/abs/2412.15105
作者: Shimiao Li
关键词: cyberattacks on power, power grids, grids are driving, driving a critical, operators to form
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注: PhD thesis

点击查看摘要

Abstract:The growing threats of uncertainties, anomalies, and cyberattacks on power grids are driving a critical need to advance situational awareness which allows system operators to form a complete and accurate picture of the present and future state. Simulation and estimation are foundational tools in this process. However, existing tools lack the robustness and efficiency required to achieve the level of situational awareness needed for the ever-evolving threat landscape. Industry-standard (steady-state) simulators are not robust to blackouts, often leading to non-converging or non-actionable results. Estimation tools lack robustness to anomalous data, returning erroneous system states. Efficiency is the other major concern as nonlinearities and scalability issues make large systems slow to converge. This thesis addresses robustness and efficiency gaps through a dual-fold contribution. We first address the inherent limitations in the existing physics-based and data-driven worlds; and then transcend the boundaries of conventional algorithmic design in the direction of a new paradigm – Physics-ML Synergy – which integrates the strengths of the two worlds. Our approaches are built on circuit formulation which provides a unified framework that applies to both transmission and distribution. Sparse optimization acts as the key enabler to make these tools intrinsically robust and immune to random threats, pinpointing dominant sources of (random) blackouts and data errors. Further, we explore sparsity-exploiting optimizations to develop lightweight ML models whose prediction and detection capabilities are a complement to physics-based tools; and whose lightweight designs advance generalization and scalability. Finally, Physics-ML Synergy brings robustness and efficiency further against targeted cyberthreats, by interconnecting our physics-based tools with lightweight ML. Comments: PhD thesis Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.15105 [eess.SP] (or arXiv:2412.15105v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2412.15105 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-62] Energy and polarization based on-line interference mitigation in radio interferometry

链接: https://arxiv.org/abs/2412.14775
作者: Sarod Yatawatta,Albert-Jan Boonstra,Chris P. Broekema
关键词: Radio frequency interference, terrestrial radio astronomy, on-line RFI mitigation, RFI mitigation scheme, RFI mitigation
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Radio frequency interference (RFI) is a persistent contaminant in terrestrial radio astronomy. While new radio interferometers are becoming operational, novel sources of RFI are also emerging. In order to strengthen the mitigation of RFI in modern radio interferometers, we propose an on-line RFI mitigation scheme that can be run in the correlator of such interferometers. We combine statistics based on the energy as well as the polarization alignment of the correlated signal to develop an on-line RFI mitigation scheme that can be applied to a data stream produced by the correlator in real-time, especially targeted at low duty-cycle or transient RFI detection. In order to improve the computational efficiency, we explore the use of both single precision and half precision floating point operations in implementing the RFI mitigation algorithm. This ideally suits its deployment in accelerator computing devices such as graphics processing units (GPUs) as used by the LOFAR correlator. We provide results based on real data to demonstrate the efficacy of the proposed method.

[AI-63] Stochastic first-order methods with multi-extrapolated momentum for highly smooth unconstrained optimization

链接: https://arxiv.org/abs/2412.14488
作者: Chuan He
关键词: stochastic optimization problem, unconstrained stochastic optimization, objective function exhibits, objective function, exhibits a high
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper we consider an unconstrained stochastic optimization problem where the objective function exhibits a high order of smoothness. In particular, we propose a stochastic first-order method (SFOM) with multi-extrapolated momentum, in which multiple extrapolations are performed in each iteration, followed by a momentum step based on these extrapolations. We show that our proposed SFOM with multi-extrapolated momentum can accelerate optimization by exploiting the high-order smoothness of the objective function f . Specifically, assuming that the gradient and the p th-order derivative of f are Lipschitz continuous for some p\ge2 , and under some additional mild assumptions, we establish that our method achieves a sample complexity of \widetilde\mathcalO(\epsilon^-(3p+1)/p) for finding a point x satisfying \mathbbE[|\nabla f(x)|]\le\epsilon . To the best of our knowledge, our method is the first SFOM to leverage arbitrary order smoothness of the objective function for acceleration, resulting in a sample complexity that strictly improves upon the best-known results without assuming the average smoothness condition. Finally, preliminary numerical experiments validate the practical performance of our method and corroborate our theoretical findings.

机器学习

[LG-0] Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning

链接: https://arxiv.org/abs/2412.15184
作者: Simon Frieder,Jonas Bayer,Katherine M. Collins,Julius Berner,Jacob Loader,András Juhász,Fabian Ruehle,Sean Welleck,Gabriel Poesia,Ryan-Rhys Griffiths,Adrian Weller,Anirudh Goyal,Thomas Lukasiewicz,Timothy Gowers
关键词: large language models, primarily large language, mathematical, exhibit several shortcomings, AI-based mathematical copilots
类目: Machine Learning (cs.LG)
*备注: 40 pages

点击查看摘要

Abstract:The suite of datasets commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings. These limitations include a restricted scope of mathematical complexity, typically not exceeding lower undergraduate-level mathematics, binary rating protocols and other issues, which makes comprehensive proof-based evaluation suites difficult. We systematically explore these limitations and contend that enhancing the capabilities of large language models, or any forthcoming advancements in AI-based mathematical assistants (copilots or “thought partners”), necessitates a paradigm shift in the design of mathematical datasets and the evaluation criteria of mathematical ability: It is necessary to move away from result-based datasets (theorem statement to theorem proof) and convert the rich facets of mathematical research practice to data LLMs can train on. Examples of these are mathematical workflows (sequences of atomic, potentially subfield-dependent tasks that are often performed when creating new mathematics), which are an important part of the proof-discovery process. Additionally, we advocate for mathematical dataset developers to consider the concept of “motivated proof”, introduced by G. Pólya in 1949, which can serve as a blueprint for datasets that offer a better proof learning signal, alleviating some of the mentioned limitations. Lastly, we introduce math datasheets for datasets, extending the general, dataset-agnostic variants of datasheets: We provide a questionnaire designed specifically for math datasets that we urge dataset creators to include with their datasets. This will make creators aware of potential limitations of their datasets while at the same time making it easy for readers to assess it from the point of view of training and evaluating mathematical copilots.

[LG-1] STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning

链接: https://arxiv.org/abs/2412.15182
作者: Marius Memmel,Jacob Berg,Bingqing Chen,Abhishek Gupta,Jonathan Francis
关键词: natural language processing, Robot learning, robot learning methods, mirroring trends, witnessing a significant
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Project website at this https URL

点击查看摘要

Abstract:Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task, generalist policy by training broadly across them. Notably, while these generalist policies can improve the average performance across many tasks, the performance of generalist policies on any one task is often suboptimal due to negative transfer between partitions of the data, compared to task-specific specialist policies. In this work, we argue for the paradigm of training policies during deployment given the scenarios they encounter: rather than deploying pre-trained policies to unseen problems in a zero-shot manner, we non-parametrically retrieve and train models directly on relevant data at test time. Furthermore, we show that many robotics tasks share considerable amounts of low-level behaviors and that retrieval at the “sub”-trajectory granularity enables significantly improved data utilization, generalization, and robustness in adapting policies to novel problems. In contrast, existing full-trajectory retrieval methods tend to underutilize the data and miss out on shared cross-task content. This work proposes STRAP, a technique for leveraging pre-trained vision foundation models and dynamic time warping to retrieve sub-sequences of trajectories from large training corpora in a robust fashion. STRAP outperforms both prior retrieval algorithms and multi-task learning methods in simulated and real experiments, showing the ability to scale to much larger offline datasets in the real world as well as the ability to learn robust control policies with just a handful of real-world demonstrations.

[LG-2] HPC-Coder-V2: Studying Code LLM s Across Low-Resource Parallel Languages

链接: https://arxiv.org/abs/2412.15178
作者: Aman Chaturvedi,Daniel Nichols,Siddharth Singh,Abhinav Bhatele
关键词: Large Language Model, Large Language, software development assistants, high performance computing, general purpose programming
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large Language Model (LLM) based coding tools have been tremendously successful as software development assistants, yet they are often designed for general purpose programming tasks and perform poorly for more specialized domains such as high performance computing. Creating specialized models and tools for these domains is crucial towards gaining the benefits of LLMs in areas such as HPC. While previous work has explored HPC-specific models, LLMs still struggle to generate parallel code and it is not at all clear what hurdles are still holding back these LLMs and what must be done to overcome them. In this work, we conduct an in-depth study along the many axes of fine-tuning a specialized HPC LLM in order to better understand the challenges. Based on our findings we fine-tune and evaluate a specialized HPC LLM that is shown to be the best performing open-source code LLM for parallel code generation to date.

[LG-3] Rethinking Uncertainty Estimation in Natural Language Generation

链接: https://arxiv.org/abs/2412.15176
作者: Lukas Aichberger,Kajetan Schweighofer,Sepp Hochreiter
关键词: Large Language Models, Large Language, Language Models, real-world applications, increasingly employed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Since current LLMs generate text autoregressively through a stochastic process, the same prompt can lead to varying outputs. Consequently, leading uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM’s uncertainty. However, generating output sequences is computationally expensive, making these methods impractical at scale. In this work, we inspect the theoretical foundations of the leading methods and explore new directions to enhance their computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically grounded uncertainty measure. To approximate this alternative measure, we propose G-NLL, which has the advantage of being obtained using only a single output sequence generated by greedy decoding. This makes uncertainty estimation more efficient and straightforward, while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various LLMs and tasks. Our work lays the foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of more computationally involved methods currently leading the field.

[LG-4] DroughtSet: Understanding Drought Through Spatial-Temporal Learning AAAI25

链接: https://arxiv.org/abs/2412.15075
作者: Xuwei Tan,Qian Zhao,Yanlan Liu,Xueru Zhang
关键词: impacting natural resources, expensive natural disasters, severely impacting natural, depleting water resources, diminishing agricultural yields
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI25

点击查看摘要

Abstract:Drought is one of the most destructive and expensive natural disasters, severely impacting natural resources and risks by depleting water resources and diminishing agricultural yields. Under climate change, accurately predicting drought is critical for mitigating drought-induced risks. However, the intricate interplay among the physical and biological drivers that regulate droughts limits the predictability and understanding of drought, particularly at a subseasonal to seasonal (S2S) time scale. While deep learning has been demonstrated with potential in addressing climate forecasting challenges, its application to drought prediction has received relatively less attention. In this work, we propose a new dataset, DroughtSet, which integrates relevant predictive features and three drought indices from multiple remote sensing and reanalysis datasets across the contiguous United States (CONUS). DroughtSet specifically provides the machine learning community with a new real-world dataset to benchmark drought prediction models and more generally, time-series forecasting methods. Furthermore, we propose a spatial-temporal model SPDrought to predict and interpret S2S droughts. Our model learns from the spatial and temporal information of physical and biological features to predict three types of droughts simultaneously. Multiple strategies are employed to quantify the importance of physical and biological features for drought prediction. Our results provide insights for researchers to better understand the predictability and sensitivity of drought to biological and physical conditions. We aim to contribute to the climate field by proposing a new tool to predict and understand the occurrence of droughts and provide the AI community with a new benchmark to study deep learning applications in climate science.

[LG-5] DisCo: Graph-Based Disentangled Contrastive Learning for Cold-Start Cross-Domain Recommendation

链接: https://arxiv.org/abs/2412.15005
作者: Hourun Li,Yifan Wang,Zhiping Xiao,Jia Yang,Changling Zhou,Ming Zhang,Wei Ju
关键词: Recommender systems, user cold-start problem, real-world applications, cold-start problem, systems are widely
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems are widely used in various real-world applications, but they often encounter the persistent challenge of the user cold-start problem. Cross-domain recommendation (CDR), which leverages user interactions from one domain to improve prediction performance in another, has emerged as a promising solution. However, users with similar preferences in the source domain may exhibit different interests in the target domain. Therefore, directly transferring embeddings may introduce irrelevant source-domain collaborative information. In this paper, we propose a novel graph-based disentangled contrastive learning framework to capture fine-grained user intent and filter out irrelevant collaborative information, thereby avoiding negative transfer. Specifically, for each domain, we use a multi-channel graph encoder to capture diverse user intents. We then construct the affinity graph in the embedding space and perform multi-step random walks to capture high-order user similarity relationships. Treating one domain as the target, we propose a disentangled intent-wise contrastive learning approach, guided by user similarity, to refine the bridging of user intents across domains. Extensive experiments on four benchmark CDR datasets demonstrate that DisCo consistently outperforms existing state-of-the-art baselines, thereby validating the effectiveness of both DisCo and its components.

[LG-6] Diffusion priors for Bayesian 3D reconstruction from incomplete measurements

链接: https://arxiv.org/abs/2412.14897
作者: Julian L. Möbius,Michael Habeck
关键词: restricts the class, class of admissible, models, admissible models, prior distributions
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many inverse problems are ill-posed and need to be complemented by prior information that restricts the class of admissible models. Bayesian approaches encode this information as prior distributions that impose generic properties on the model such as sparsity, non-negativity or smoothness. However, in case of complex structured models such as images, graphs or three-dimensional (3D) objects,generic prior distributions tend to favor models that differ largely from those observed in the real world. Here we explore the use of diffusion models as priors that are combined with experimental data within a Bayesian framework. We use 3D point clouds to represent 3D objects such as household items or biomolecular complexes formed from proteins and nucleic acids. We train diffusion models that generate coarse-grained 3D structures at a medium resolution and integrate these with incomplete and noisy experimental data. To demonstrate the power of our approach, we focus on the reconstruction of biomolecular assemblies from cryo-electron microscopy (cryo-EM) images, which is an important inverse problem in structural biology. We find that posterior sampling with diffusion model priors allows for 3D reconstruction from very sparse, low-resolution and partial observations.

[LG-7] Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning

链接: https://arxiv.org/abs/2412.14865
作者: Anthony Kobanda,Rémy Portelas,Odalric-Ambrym Maillard,Ludovic Denoyer
关键词: previously acquired skills, retaining previously acquired, agents must continuously, acquired skills, dynamic domains
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In dynamic domains such as autonomous robotics and video game simulations, agents must continuously adapt to new tasks while retaining previously acquired skills. This ongoing process, known as Continual Reinforcement Learning, presents significant challenges, including the risk of forgetting past knowledge and the need for scalable solutions as the number of tasks increases. To address these issues, we introduce HIerarchical LOW-rank Subspaces of Policies (HILOW), a novel framework designed for continual learning in offline navigation settings. HILOW leverages hierarchical policy subspaces to enable flexible and efficient adaptation to new tasks while preserving existing knowledge. We demonstrate, through a careful experimental study, the effectiveness of our method in both classical MuJoCo maze environments and complex video game-like simulations, showcasing competitive performance and satisfying adaptability according to classical continual learning metrics, in particular regarding memory usage. Our work provides a promising framework for real-world applications where continuous learning from pre-collected data is essential.

[LG-8] Entropy Regularized Task Representation Learning for Offline Meta-Reinforcement Learning AAAI2025

链接: https://arxiv.org/abs/2412.14834
作者: Mohammadreza nakhaei,Aidan Scannell,Joni Pajarinen
关键词: task representations, meta-reinforcement learning aims, Offline meta-reinforcement learning, task, tasks
类目: Machine Learning (cs.LG)
*备注: 7 Pages, Accepted at AAAI 2025

点击查看摘要

Abstract:Offline meta-reinforcement learning aims to equip agents with the ability to rapidly adapt to new tasks by training on data from a set of different tasks. Context-based approaches utilize a history of state-action-reward transitions – referred to as the context – to infer representations of the current task, and then condition the agent, i.e., the policy and value function, on the task representations. Intuitively, the better the task representations capture the underlying tasks, the better the agent can generalize to new tasks. Unfortunately, context-based approaches suffer from distribution mismatch, as the context in the offline data does not match the context at test time, limiting their ability to generalize to the test tasks. This leads to the task representations overfitting to the offline training data. Intuitively, the task representations should be independent of the behavior policy used to collect the offline data. To address this issue, we approximately minimize the mutual information between the distribution over the task representations and behavior policy by maximizing the entropy of behavior policy conditioned on the task representations. We validate our approach in MuJoCo environments, showing that compared to baselines, our task representations more faithfully represent the underlying tasks, leading to outperforming prior methods in both in-distribution and out-of-distribution tasks.

[LG-9] Extending TWIG: Zero-Shot Predictive Hyperparameter Selection for KGEs based on Graph Structure

链接: https://arxiv.org/abs/2412.14801
作者: Jeffrey Sardina,John D. Kelleher,Declan O’Sullivan
关键词: Knowledge Graph Embeddings, Knowledge Graphs, Graph Embeddings, Knowledge, biomedicine and linguistics
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) have seen increasing use across various domains – from biomedicine and linguistics to general knowledge modelling. In order to facilitate the analysis of knowledge graphs, Knowledge Graph Embeddings (KGEs) have been developed to automatically analyse KGs and predict new facts based on the information in a KG, a task called “link prediction”. Many existing studies have documented that the structure of a KG, KGE model components, and KGE hyperparameters can significantly change how well KGEs perform and what relationships they are able to learn. Recently, the Topologically-Weighted Intelligence Generation (TWIG) model has been proposed as a solution to modelling how each of these elements relate. In this work, we extend the previous research on TWIG and evaluate its ability to simulate the output of the KGE model ComplEx in the cross-KG setting. Our results are twofold. First, TWIG is able to summarise KGE performance on a wide range of hyperparameter settings and KGs being learned, suggesting that it represents a general knowledge of how to predict KGE performance from KG structure. Second, we show that TWIG can successfully predict hyperparameter performance on unseen KGs in the zero-shot setting. This second observation leads us to propose that, with additional research, optimal hyperparameter selection for KGE models could be determined in a pre-hoc manner using TWIG-like methods, rather than by using a full hyperparameter search.

[LG-10] A parametric algorithm is optimal for non-parametric regression of smooth functions

链接: https://arxiv.org/abs/2412.14744
作者: Davide Maran,Marcello Restelli
关键词: uniform error bound, training points, learner selects, selects the training, achieve a uniform
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the regression problem for a general function f:[-1,1]^d\to \mathbb R when the learner selects the training points \x_i_i=1^n to achieve a uniform error bound across the entire domain. In this setting, known historically as nonparametric regression, we aim to establish a sample complexity bound that depends solely on the function’s degree of smoothness. Assuming periodicity at the domain boundaries, we introduce PADUA, an algorithm that, with high probability, provides performance guarantees optimal up to constant or logarithmic factors across all problem parameters. Notably, PADUA is the first parametric algorithm with optimal sample complexity for this setting. Due to this feature, we prove that, differently from the non-parametric state of the art, PADUA enjoys optimal space complexity in the prediction phase. To validate these results, we perform numerical experiments over functions coming from real audio data, where PADUA shows comparable performance to state-of-the-art methods, while requiring only a fraction of the computational time.

[LG-11] Active Inference and Human–Computer Interaction

链接: https://arxiv.org/abs/2412.14741
作者: Roderick Murray-Smith,John H. Williamson,Sebastian Stein
关键词: closed-loop computational theoretical, Active Inference, internal probabilistic generative, computational theoretical basis, review Active Inference
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Active Inference is a closed-loop computational theoretical basis for understanding behaviour, based on agents with internal probabilistic generative models that encode their beliefs about how hidden states in their environment cause their sensations. We review Active Inference and how it could be applied to model the human-computer interaction loop. Active Inference provides a coherent framework for managing generative models of humans, their environments, sensors and interface components. It informs off-line design and supports real-time, online adaptation. It provides model-based explanations for behaviours observed in HCI, and new tools to measure important concepts such as agency and engagement. We discuss how Active Inference offers a new basis for a theory of interaction in HCI, tools for design of modern, complex sensor-based systems, and integration of artificial intelligence technologies, enabling it to cope with diversity in human users and contexts. We discuss the practical challenges in implementing such Active Inference-based systems.

[LG-12] On the Use of Deep Learning Models for Semantic Clone Detection

链接: https://arxiv.org/abs/2412.14739
作者: Subroto Nag Pinku,Debajyoti Mondal,Chanchal K. Roy
关键词: tracking code clones, tracking code, code fragment, ease various software, software development
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted at the 40th IEEE International Conference on Software Maintenance and Evolution (ICSME 2024)

点击查看摘要

Abstract:Detecting and tracking code clones can ease various software development and maintenance tasks when changes in a code fragment should be propagated over all its copies. Several deep learning-based clone detection models have appeared in the literature for detecting syntactic and semantic clones, widely evaluated with the BigCloneBench dataset. However, class imbalance and the small number of semantic clones make BigCloneBench less ideal for interpreting model performance. Researchers also use other datasets such as GoogleCodeJam, OJClone, and SemanticCloneBench to understand model generalizability. To overcome the limitations of existing datasets, the GPT-assisted semantic and cross-language clone dataset GPTCloneBench has been released. However, how these models compare across datasets remains unclear. In this paper, we propose a multi-step evaluation approach for five state-of-the-art clone detection models leveraging existing benchmark datasets, including GPTCloneBench, and using mutation operators to study model ability. Specifically, we examine three highly-performing single-language models (ASTNN, GMN, CodeBERT) on BigCloneBench, SemanticCloneBench, and GPTCloneBench, testing their robustness with mutation operations. Additionally, we compare them against cross-language models (C4, CLCDSA) known for detecting semantic clones. While single-language models show high F1 scores for BigCloneBench, their performance on SemanticCloneBench varies (up to 20%). Interestingly, the cross-language model (C4) shows superior performance (around 7%) on SemanticCloneBench over other models and performs similarly on BigCloneBench and GPTCloneBench. On mutation-based datasets, C4 has more robust performance (less than 1% difference) compared to single-language models, which show high variability.

[LG-13] Boosting GNN Performance via Training Sample Selection Based on Adversarial Robustness Evaluation

链接: https://arxiv.org/abs/2412.14738
作者: Yongyu Wang
关键词: neural network architectures, powerful neural network, Graph Neural Networks, powerful neural, leveraging graph topology
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have established themselves as one of the most powerful neural network architectures, excelling in leveraging graph topology and node features for various tasks. However, GNNs are inherently vulnerable to noise in their inputs. Such noise can significantly degrade their performance. To address this challenge, we propose a novel approach that employs adversarial robustness evaluation techniques to identify nodes in the graph that are most susceptible to noise. By selecting and constructing a training set composed of these particularly noise-prone nodes, we then use them to train a Graph Convolutional Network (GCN). Our experimental results demonstrate that this strategy leads to substantial improvements in the GCN’s performance.

[LG-14] Generative AI for Banks: Benchmarks and Algorithms for Synthetic Financial Transaction Data

链接: https://arxiv.org/abs/2412.14730
作者: Fabian Sven Karst,Sook-Yee Chong,Abigail A. Antenor,Enyu Lin,Mahei Manhai Li,Jan Marco Leimeister
关键词: banking sector faces, sector faces challenges, deep learning due, Conditional Tabular Generative, Generative Adversarial Networks
类目: Machine Learning (cs.LG)
*备注: Presented at the 34th Workshop on Information Technologies and Systems (WITS 2024)

点击查看摘要

Abstract:The banking sector faces challenges in using deep learning due to data sensitivity and regulatory constraints, but generative AI may offer a solution. Thus, this study identifies effective algorithms for generating synthetic financial transaction data and evaluates five leading models - Conditional Tabular Generative Adversarial Networks (CTGAN), DoppelGANger (DGAN), Wasserstein GAN, Financial Diffusion (FinDiff), and Tabular Variational AutoEncoders (TVAE) - across five criteria: fidelity, synthesis quality, efficiency, privacy, and graph structure. While none of the algorithms is able to replicate the real data’s graph structure, each excels in specific areas: DGAN is ideal for privacy-sensitive tasks, FinDiff and TVAE excel in data replication and augmentation, and CTGAN achieves a balance across all five criteria, making it suitable for general applications with moderate privacy concerns. As a result, our findings offer valuable insights for choosing the most suitable algorithm.

[LG-15] FROC: Building Fair ROC from a Trained Classifier AAAI

链接: https://arxiv.org/abs/2412.14724
作者: Avyukta Manjunatha Vummintala,Shantanu Das,Sujit Gujar
关键词: Equalized ROC, binary protected groups, ROC, probabilistic binary classification, protected groups
类目: Machine Learning (cs.LG)
*备注: 51 pages, The 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:This paper considers the problem of fair probabilistic binary classification with binary protected groups. The classifier assigns scores, and a practitioner predicts labels using a certain cut-off threshold based on the desired trade-off between false positives vs. false negatives. It derives these thresholds from the ROC of the classifier. The resultant classifier may be unfair to one of the two protected groups in the dataset. It is desirable that no matter what threshold the practitioner uses, the classifier should be fair to both the protected groups; that is, the \mathcalL_p norm between FPRs and TPRs of both the protected groups should be at most \varepsilon . We call such fairness on ROCs of both the protected attributes \varepsilon_p -Equalized ROC. Given a classifier not satisfying \varepsilon_1 -Equalized ROC, we aim to design a post-processing method to transform the given (potentially unfair) classifier’s output (score) to a suitable randomized yet fair classifier. That is, the resultant classifier must satisfy \varepsilon_1 -Equalized ROC. First, we introduce a threshold query model on the ROC curves for each protected group. The resulting classifier is bound to face a reduction in AUC. With the proposed query model, we provide a rigorous theoretical analysis of the minimal AUC loss to achieve \varepsilon_1 -Equalized ROC. To achieve this, we design a linear time algorithm, namely \textttFROC, to transform a given classifier’s output to a probabilistic classifier that satisfies \varepsilon_1 -Equalized ROC. We prove that under certain theoretical conditions, \textttFROC\ achieves the theoretical optimal guarantees. We also study the performance of our \textttFROC\ on multiple real-world datasets with many trained classifiers.

[LG-16] A Comprehensive Forecasting Framework based on Multi-Stage Hierarchical Forecasting Reconciliation and Adjustment

链接: https://arxiv.org/abs/2412.14718
作者: Zhengchao Yang,Mithun Ghosh,Anish Saha,Dong Xu,Konstantin Shmakov,Kuang-chih Lee
关键词: effective resource planning, enabling effective resource, Hierarchical Forecasting Reconciliation, Ads demand forecasting, demand forecasting
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Published in 2024 IEEE International Conference on Big Data (BigData)

点击查看摘要

Abstract:Ads demand forecasting for Walmart’s ad products plays a critical role in enabling effective resource planning, allocation, and management of ads performance. In this paper, we introduce a comprehensive demand forecasting system that tackles hierarchical time series forecasting in business settings. Though traditional hierarchical reconciliation methods ensure forecasting coherence, they often trade off accuracy for coherence especially at lower levels and fail to capture the seasonality unique to each time-series in the hierarchy. Thus, we propose a novel framework “Multi-Stage Hierarchical Forecasting Reconciliation and Adjustment (Multi-Stage HiFoReAd)” to address the challenges of preserving seasonality, ensuring coherence, and improving accuracy. Our system first utilizes diverse models, ensembled through Bayesian Optimization (BO), achieving base forecasts. The generated base forecasts are then passed into the Multi-Stage HiFoReAd framework. The initial stage refines the hierarchy using Top-Down forecasts and “harmonic alignment.” The second stage aligns the higher levels’ forecasts using MinTrace algorithm, following which the last two levels undergo “harmonic alignment” and “stratified scaling”, to eventually achieve accurate and coherent forecasts across the whole hierarchy. Our experiments on Walmart’s internal Ads-demand dataset and 3 other public datasets, each with 4 hierarchical levels, demonstrate that the average Absolute Percentage Error from the cross-validation sets improve from 3% to 40% across levels against BO-ensemble of models (LGBM, MSTL+ETS, Prophet) as well as from 1.2% to 92.9% against State-Of-The-Art models. In addition, the forecasts at all hierarchical levels are proved to be coherent. The proposed framework has been deployed and leveraged by Walmart’s ads, sales and operations teams to track future demands, make informed decisions and plan resources.

[LG-17] Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm

链接: https://arxiv.org/abs/2412.14717
作者: Sarwan Ali,Haris Mansoor,Prakash Chourasia,Imdad Ullah Khan,Murray Patterson
关键词: Line Entry System, Input Line Entry, Simplified Molecular Input, Molecular Input Line, Entry System
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal component analysis (PCA) for dimensionality reduction. The resulting low-dimensional embeddings are then used for classification and regression analysis. The kernel matrix is computed by converting the SMILES strings into molecular structures using the Morgan Fingerprint, which computes a fingerprint for each molecule. The distance matrix is computed using the pairwise kernels function. The Sinkhorn-Knopp algorithm is used to compute the final kernel matrix that satisfies the constraints of a probability distribution. This is achieved by iteratively adjusting the kernel matrix until the marginal distributions of the rows and columns match the desired marginal distributions. We provided a comprehensive empirical analysis of the proposed kernel method to evaluate its goodness with greater depth. The suggested method is assessed for drug subcategory prediction (classification task) and solubility AlogPS ``Aqueous solubility and Octanol/Water partition coefficient" (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms several baseline methods in terms of supervised analysis and has potential uses in molecular design and drug discovery. Overall, the suggested method is a promising avenue for kernel methods-based molecular structure analysis and design.

[LG-18] Holistic Adversarially Robust Pruning ICLR2023

链接: https://arxiv.org/abs/2412.14714
作者: Qi Zhao,Christian Wressnegger
关键词: removing redundant parameters, Neural networks, drastically shrunk, removing redundant, Neural
类目: Machine Learning (cs.LG)
*备注: Accepted by ICLR 2023

点击查看摘要

Abstract:Neural networks can be drastically shrunk in size by removing redundant parameters. While crucial for the deployment on resource-constraint hardware, oftentimes, compression comes with a severe drop in accuracy and lack of adversarial robustness. Despite recent advances, counteracting both aspects has only succeeded for moderate compression rates so far. We propose a novel method, HARP, that copes with aggressive pruning significantly better than prior work. For this, we consider the network holistically. We learn a global compression strategy that optimizes how many parameters (compression rate) and which parameters (scoring connections) to prune specific to each layer individually. Our method fine-tunes an existing model with dynamic regularization, that follows a step-wise incremental function balancing the different objectives. It starts by favoring robustness before shifting focus on reaching the target compression rate and only then handles the objectives equally. The learned compression strategies allow us to maintain the pre-trained model natural accuracy and its adversarial robustness for a reduction by 99% of the network original size. Moreover, we observe a crucial influence of non-uniform compression across layers.

[LG-19] ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

链接: https://arxiv.org/abs/2412.14711
作者: Ziteng Wang,Jianfei Chen,Jun Zhu
关键词: Sparsely activated, widely adopted, adopted to scale, capacity without increasing, Sparsely
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router’s sparsity while balancing the load among experts. ReMoE’s continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at this https URL.

[LG-20] aming the Memory Beast: Strategies for Reliable ML Training on Kubernetes

链接: https://arxiv.org/abs/2412.14701
作者: Jaideep Ray
关键词: powerful orchestration platform, machine learning training, resource constraints, Kubernetes offers, offers a powerful
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 4 pages

点击查看摘要

Abstract:Kubernetes offers a powerful orchestration platform for machine learning training, but memory management can be challenging due to specialized needs and resource constraints. This paper outlines how Kubernetes handles memory requests, limits, Quality of Service classes, and eviction policies for ML workloads, with special focus on GPU memory and ephemeral storage. Common pitfalls such as overcommitment, memory leaks, and ephemeral volume exhaustion are examined. We then provide best practices for stable, scalable memory utilization to help ML practitioners prevent out-of-memory events and ensure high-performance ML training pipelines.

[LG-21] Lorentzian Residual Neural Networks KDD2025

链接: https://arxiv.org/abs/2412.14695
作者: Neil He,Menglin Yang,Rex Ying
关键词: data structures prevalent, modeling hierarchical data, hierarchical data structures, real-world datasets, neural networks
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, KDD 2025

点击查看摘要

Abstract:Hyperbolic neural networks have emerged as a powerful tool for modeling hierarchical data structures prevalent in real-world datasets. Notably, residual connections, which facilitate the direct flow of information across layers, have been instrumental in the success of deep neural networks. However, current methods for constructing hyperbolic residual networks suffer from limitations such as increased model complexity, numerical instability, and errors due to multiple mappings to and from the tangent space. To address these limitations, we introduce LResNet, a novel Lorentzian residual neural network based on the weighted Lorentzian centroid in the Lorentz model of hyperbolic geometry. Our method enables the efficient integration of residual connections in Lorentz hyperbolic neural networks while preserving their hierarchical representation capabilities. We demonstrate that our method can theoretically derive previous methods while offering improved stability, efficiency, and effectiveness. Extensive experiments on both graph and vision tasks showcase the superior performance and robustness of our method compared to state-of-the-art Euclidean and hyperbolic alternatives. Our findings highlight the potential of \method for building more expressive neural networks in hyperbolic embedding space as a generally applicable method to multiple architectures, including CNNs, GNNs, and graph Transformers.

[LG-22] rainable Adaptive Activation Function Structure (TAAFS) Enhances Neural Network Force Field Performance with Only Dozens of Additional Parameters

链接: https://arxiv.org/abs/2412.14655
作者: Enji Li
关键词: deepening multilayer perceptrons, network force fields, model complex interactions, graph neural networks, neural network force
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:At the heart of neural network force fields (NNFFs) is the architecture of neural networks, where the capacity to model complex interactions is typically enhanced through widening or deepening multilayer perceptrons (MLPs) or by increasing layers of graph neural networks (GNNs). These enhancements, while improving the model’s performance, often come at the cost of a substantial increase in the number of parameters. By applying the Trainable Adaptive Activation Function Structure (TAAFS), we introduce a method that selects distinct mathematical formulations for non-linear activations, thereby increasing the precision of NNFFs with an insignificant addition to the parameter count. In this study, we integrate TAAFS into a variety of neural network models, resulting in observed accuracy improvements, and further validate these enhancements through molecular dynamics (MD) simulations using DeepMD.

[LG-23] Continuous latent representations for modeling precipitation with deep learning

链接: https://arxiv.org/abs/2412.14620
作者: Gokul Radhakrishnan,Rahul Sundar,Nishant Parashar,Antoine Blanchard,Daiwei Wang,Boyko Dodov
关键词: presents significant challenges, data presents significant, spatio-temporally discontinuous nature, discontinuous nature, presents significant
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The sparse and spatio-temporally discontinuous nature of precipitation data presents significant challenges for simulation and statistical processing for bias correction and downscaling. These include incorrect representation of intermittency and extreme values (critical for hydrology applications), Gibbs phenomenon upon regridding, and lack of fine scales details. To address these challenges, a common approach is to transform the precipitation variable nonlinearly into one that is more malleable. In this work, we explore how deep learning can be used to generate a smooth, spatio-temporally continuous variable as a proxy for simulation of precipitation data. We develop a normally distributed field called pseudo-precipitation (PP) as an alternative for simulating precipitation. The practical applicability of this variable is investigated by applying it for downscaling precipitation from (1\degree) ((\sim) 100 km) to (0.25\degree) ((\sim) 25 km).

[LG-24] MixLLM : LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

链接: https://arxiv.org/abs/2412.14590
作者: Zhen Zheng,Xiaonan Song,Chuanjie Liu
关键词: smaller size, effective methodologies, methodologies to compress, compress LLMs, LLMs into smaller
类目: Machine Learning (cs.LG)
*备注: The code will be released in the future

点击查看摘要

Abstract:Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or system inefficiency. In this paper, we make a comprehensive analysis of the general quantization principles on their effect to the triangle of accuracy, memory consumption and system efficiency. We propose MixLLM that explores the new optimization space of mixed-precision quantization between output features based on the insight that different output features matter differently in the model. MixLLM identifies the output features with high salience in the global view rather than within each single layer, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption. We present the sweet spot of quantization configuration of algorithm-system co-design that leads to high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the int8 Tensor Core easily and fast data type conversion to reduce dequantization overhead significantly, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10% more bits, the PPL increasement can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while on average MMLU-Pro improves by 0.93 over the SOTA of three popular models. In addition to its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.

[LG-25] Single-Loop Federated Actor-Critic across Heterogeneous Environments AAAI’25

链接: https://arxiv.org/abs/2412.14555
作者: Ye Zhu,Xiaowen Gong
关键词: shared policy adaptable, Federated reinforcement learning, promising paradigm, enabling multiple agents, reinforcement learning
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Extended version of paper accepted at AAAI’25

点击查看摘要

Abstract:Federated reinforcement learning (FRL) has emerged as a promising paradigm, enabling multiple agents to collaborate and learn a shared policy adaptable across heterogeneous environments. Among the various reinforcement learning (RL) algorithms, the actor-critic (AC) algorithm stands out for its low variance and high sample efficiency. However, little to nothing is known theoretically about AC in a federated manner, especially each agent interacts with a potentially different environment. The lack of such results is attributed to various technical challenges: a two-level structure illustrating the coupling effect between the actor and the critic, heterogeneous environments, Markovian sampling and multiple local updates. In response, we study \textitSingle-loop Federated Actor Critic (SFAC) where agents perform actor-critic learning in a two-level federated manner while interacting with heterogeneous environments. We then provide bounds on the convergence error of SFAC. The results show that the convergence error asymptotically converges to a near-stationary point, with the extent proportional to environment heterogeneity. Moreover, the sample complexity exhibits a linear speed-up through the federation of agents. We evaluate the performance of SFAC through numerical experiments using common RL benchmarks, which demonstrate its effectiveness.

[LG-26] ransformer models are gauge invariant: A mathematical connection between AI and particle physics

链接: https://arxiv.org/abs/2412.14543
作者: Leo van Nierop
关键词: symmetries called gauge, called gauge invariance, particle physics, fundamental forces, forces are subject
类目: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 10 pages, 2 figures, 1 table

点击查看摘要

Abstract:In particle physics, the fundamental forces are subject to symmetries called gauge invariance. It is a redundancy in the mathematical description of any physical system. In this article I will demonstrate that the transformer architecture exhibits the same properties, and show that the default representation of transformers has partially, but not fully removed the gauge invariance.

[LG-27] ST-ReP: Learning Predictive Representations Efficiently for Spatial-Temporal Forecasting AAAI2025

链接: https://arxiv.org/abs/2412.14537
作者: Qi Zheng,Zihao Yao,Yaying Zhang
关键词: crucial and widely, widely applicable, Spatial-temporal, Spatial-temporal forecasting, unlabeled spatial-temporal data
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 pages. Accepted by AAAI2025

点击查看摘要

Abstract:Spatial-temporal forecasting is crucial and widely applicable in various domains such as traffic, energy, and climate. Benefiting from the abundance of unlabeled spatial-temporal data, self-supervised methods are increasingly adapted to learn spatial-temporal representations. However, it encounters three key challenges: 1) the difficulty in selecting reliable negative pairs due to the homogeneity of variables, hindering contrastive learning methods; 2) overlooking spatial correlations across variables over time; 3) limitations of efficiency and scalability in existing self-supervised learning methods. To tackle these, we propose a lightweight representation-learning model ST-ReP, integrating current value reconstruction and future value prediction into the pre-training framework for spatial-temporal forecasting. And we design a new spatial-temporal encoder to model fine-grained relationships. Moreover, multi-time scale analysis is incorporated into the self-supervised loss to enhance predictive capability. Experimental results across diverse domains demonstrate that the proposed model surpasses pre-training-based baselines, showcasing its ability to learn compact and semantically enriched representations while exhibiting superior scalability.

[LG-28] Leveraging Time Series Categorization and Temporal Fusion Transformers to Improve Cryptocurrency Price Forecasting

链接: https://arxiv.org/abs/2412.14529
作者: Arash Peik,Mohammad Ali Zare Chahooki,Amin Milani Fard,Mehdi Agha Sarram
关键词: Organizing and managing, managing cryptocurrency portfolios, portfolios and decision-making, decision-making on transactions, transactions is crucial
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Statistical Finance (q-fin.ST)
*备注:

点击查看摘要

Abstract:Organizing and managing cryptocurrency portfolios and decision-making on transactions is crucial in this market. Optimal selection of assets is one of the main challenges that requires accurate prediction of the price of cryptocurrencies. In this work, we categorize the financial time series into several similar subseries to increase prediction accuracy by learning each subseries category with similar behavior. For each category of the subseries, we create a deep learning model based on the attention mechanism to predict the next step of each subseries. Due to the limited amount of cryptocurrency data for training models, if the number of categories increases, the amount of training data for each model will decrease, and some complex models will not be trained well due to the large number of parameters. To overcome this challenge, we propose to combine the time series data of other cryptocurrencies to increase the amount of data for each category, hence increasing the accuracy of the models corresponding to each category.

[LG-29] Knowledge Distillation in RNN-Attention Models for Early Prediction of Student Performance

链接: https://arxiv.org/abs/2412.14526
作者: Sukrit Leelaluk,Cheng Tang,Valdemar Švábenský,Atsushi Shimada
关键词: Educational data mining, at-risk students, automatically analyzing data, Educational data, at-risk
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: Published in Proceedings of The 40th ACM/SIGAPP Symposium on Applied Computing (SAC '25), see this https URL

点击查看摘要

Abstract:Educational data mining (EDM) is a part of applied computing that focuses on automatically analyzing data from learning contexts. Early prediction for identifying at-risk students is a crucial and widely researched topic in EDM research. It enables instructors to support at-risk students to stay on track, preventing student dropout or failure. Previous studies have predicted students’ learning performance to identify at-risk students by using machine learning on data collected from e-learning platforms. However, most studies aimed to identify at-risk students utilizing the entire course data after the course finished. This does not correspond to the real-world scenario that at-risk students may drop out before the course ends. To address this problem, we introduce an RNN-Attention-KD (knowledge distillation) framework to predict at-risk students early throughout a course. It leverages the strengths of Recurrent Neural Networks (RNNs) in handling time-sequence data to predict students’ performance at each time step and employs an attention mechanism to focus on relevant time steps for improved predictive accuracy. At the same time, KD is applied to compress the time steps to facilitate early prediction. In an empirical evaluation, RNN-Attention-KD outperforms traditional neural network models in terms of recall and F1-measure. For example, it obtained recall and F1-measure of 0.49 and 0.51 for Weeks 1–3 and 0.51 and 0.61 for Weeks 1–6 across all datasets from four years of a university course. Then, an ablation study investigated the contributions of different knowledge transfer methods (distillation objectives). We found that hint loss from the hidden layer of RNN and context vector loss from the attention module on RNN could enhance the model’s prediction performance for identifying at-risk students. These results are relevant for EDM researchers employing deep learning models.

[LG-30] Dynamic User Interface Generation for Enhanced Human-Computer Interaction Using Variational Autoencoders

链接: https://arxiv.org/abs/2412.14521
作者: Runsheng Zhang(1),Shixiao Wang(2),Tianfang Xie(3),Shiyu Duan(4),Mengmeng Chen(5) ((1) University of Southern California, (2) School of Visual Arts, (3) Georgia Institute of Technology, (4) Carnegie Mellon University (5) New York University)
关键词: study presents, intelligent user interaction, user, user interaction interface, interface generation
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a novel approach for intelligent user interaction interface generation and optimization, grounded in the variational autoencoder (VAE) model. With the rapid advancement of intelligent technologies, traditional interface design methods struggle to meet the evolving demands for diversity and personalization, often lacking flexibility in real-time adjustments to enhance the user experience. Human-Computer Interaction (HCI) plays a critical role in addressing these challenges by focusing on creating interfaces that are functional, intuitive, and responsive to user needs. This research leverages the RICO dataset to train the VAE model, enabling the simulation and creation of user interfaces that align with user aesthetics and interaction habits. By integrating real-time user behavior data, the system dynamically refines and optimizes the interface, improving usability and underscoring the importance of HCI in achieving a seamless user experience. Experimental findings indicate that the VAE-based approach significantly enhances the quality and precision of interface generation compared to other methods, including autoencoders (AE), generative adversarial networks (GAN), conditional GANs (cGAN), deep belief networks (DBN), and VAE-GAN. This work contributes valuable insights into HCI, providing robust technical solutions for automated interface generation and enhanced user experience optimization.

[LG-31] A hybrid framework for effective and efficient machine unlearning

链接: https://arxiv.org/abs/2412.14505
作者: Mingxin Li,Yizhen Yu,Ning Wang,Zhigang Wang,Xiaodong Wang,Haipeng Qu,Jia Xu,Shen Su,Zhichao Yin
关键词: Recently machine unlearning, users’ privacy concern, solve users’ privacy, Recently machine, privacy concern
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, accepted by CSE2024

点击查看摘要

Abstract:Recently machine unlearning (MU) is proposed to remove the imprints of revoked samples from the already trained model parameters, to solve users’ privacy concern. Different from the runtime expensive retraining from scratch, there exist two research lines, exact MU and approximate MU with different favorites in terms of accuracy and efficiency. In this paper, we present a novel hybrid strategy on top of them to achieve an overall success. It implements the unlearning operation with an acceptable computation cost, while simultaneously improving the accuracy as much as possible. Specifically, it runs reasonable unlearning techniques by estimating the retraining workloads caused by revocations. If the workload is lightweight, it performs retraining to derive the model parameters consistent with the accurate ones retrained from scratch. Otherwise, it outputs the unlearned model by directly modifying the current parameters, for better efficiency. In particular, to improve the accuracy in the latter case, we propose an optimized version to amend the output model with lightweight runtime penalty. We particularly study the boundary of two approaches in our frameworks to adaptively make the smart selection. Extensive experiments on real datasets validate that our proposals can improve the unlearning efficiency by 1.5 \times to 8 \times while achieving comparable accuracy.

[LG-32] Guided Diffusion Model for Sensor Data Obfuscation

链接: https://arxiv.org/abs/2412.14499
作者: Xin Yang,Omid Ardakanian
关键词: Internet of Things, devices carries detailed, collected by Internet, carries detailed information, Sensor data collected
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sensor data collected by Internet of Things (IoT) devices carries detailed information about individuals in their vicinity. Sharing this data with a semi-trusted service provider may compromise the individuals’ privacy, as sensitive information can be extracted by powerful machine learning models. Data obfuscation empowered by generative models is a promising approach to generate synthetic sensor data such that the useful information contained in the original data is preserved and the sensitive information is obscured. This newly generated data will then be shared with the service provider instead of the original sensor data. In this work, we propose PrivDiffuser, a novel data obfuscation technique based on a denoising diffusion model that attains a superior trade-off between data utility and privacy through effective guidance techniques. Specifically, we extract latent representations that contain information about public and private attributes from sensor data to guide the diffusion model, and impose mutual information-based regularization when learning the latent representations to alleviate the entanglement of public and private attributes, thereby increasing the effectiveness of guidance. Evaluation on three real-world datasets containing different sensing modalities reveals that PrivDiffuser yields a better privacy-utility trade-off than the state-of-the-art obfuscation model, decreasing the utility loss by up to 1.81% and the privacy loss by up to 3.42% . Moreover, we showed that users with diverse privacy needs can use PrivDiffuser to protect their privacy without having to retrain the model.

[LG-33] MAIDS: Malicious Agent Identification-based Data Security Model for Cloud Environments

链接: https://arxiv.org/abs/2412.14490
作者: Kishu Gupta,Deepika Saxena,Rishabh Gupta,Ashutosh Kumar Singh
关键词: malicious agent, malicious, data, cloud environment, Malicious Agent Identification-based
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 28 pages, 10 figures

点击查看摘要

Abstract:With the vigorous development of cloud computing, most organizations have shifted their data and applications to the cloud environment for storage, computation, and sharing purposes. During storage and data sharing across the participating entities, a malicious agent may gain access to outsourced data from the cloud environment. A malicious agent is an entity that deliberately breaches the data. This information accessed might be misused or revealed to unauthorized parties. Therefore, data protection and prediction of malicious agents have become a demanding task that needs to be addressed appropriately. To deal with this crucial and challenging issue, this paper presents a Malicious Agent Identification-based Data Security (MAIDS) Model which utilizes XGBoost machine learning classification algorithm for securing data allocation and communication among different participating entities in the cloud system. The proposed model explores and computes intended multiple security parameters associated with online data communication or transactions. Correspondingly, a security-focused knowledge database is produced for developing the XGBoost Classifier-based Malicious Agent Prediction (XC-MAP) unit. Unlike the existing approaches, which only identify malicious agents after data leaks, MAIDS proactively identifies malicious agents by examining their eligibility for respective data access. In this way, the model provides a comprehensive solution to safeguard crucial data from both intentional and non-intentional breaches, by granting data to authorized agents only by evaluating the agents behavior and predicting the malicious agent before granting data.

[LG-34] Graph-Structured Topic Modeling for Documents with Spatial or Covariate Dependencies

链接: https://arxiv.org/abs/2412.14477
作者: Yeo Jin Jung,Claire Donnat
关键词: incorporating document-level metadata, improve topic mixture, address the challenge, incorporating document-level, incorporating document-level covariates
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We address the challenge of incorporating document-level metadata into topic modeling to improve topic mixture estimation. To overcome the computational complexity and lack of theoretical guarantees in existing Bayesian methods, we extend probabilistic latent semantic indexing (pLSI), a frequentist framework for topic modeling, by incorporating document-level covariates or known similarities between documents through a graph formalism. Modeling documents as nodes and edges denoting similarities, we propose a new estimator based on a fast graph-regularized iterative singular value decomposition (SVD) that encourages similar documents to share similar topic mixture proportions. We characterize the estimation error of our proposed method by deriving high-probability bounds and develop a specialized cross-validation method to optimize our regularization parameters. We validate our model through comprehensive experiments on synthetic datasets and three real-world corpora, demonstrating improved performance and faster inference compared to existing Bayesian methods.

[LG-35] Benign Overfitting in Out-of-Distribution Generalization of Linear Models

链接: https://arxiv.org/abs/2412.14474
作者: Shange Tang,Jiayun Wu,Jianqing Fan,Chi Jin
关键词: training data perfectly, unseen test data, Benign overfitting refers, data perfectly, Benign overfitting
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 58 pages, 1 figure

点击查看摘要

Abstract:Benign overfitting refers to the phenomenon where an over-parameterized model fits the training data perfectly, including noise in the data, but still generalizes well to the unseen test data. While prior work provides some theoretical understanding of this phenomenon under the in-distribution setup, modern machine learning often operates in a more challenging Out-of-Distribution (OOD) regime, where the target (test) distribution can be rather different from the source (training) distribution. In this work, we take an initial step towards understanding benign overfitting in the OOD regime by focusing on the basic setup of over-parameterized linear models under covariate shift. We provide non-asymptotic guarantees proving that benign overfitting occurs in standard ridge regression, even under the OOD regime when the target covariance satisfies certain structural conditions. We identify several vital quantities relating to source and target covariance, which govern the performance of OOD generalization. Our result is sharp, which provably recovers prior in-distribution benign overfitting guarantee [Tsigler and Bartlett, 2023], as well as under-parameterized OOD guarantee [Ge et al., 2024] when specializing to each setup. Moreover, we also present theoretical results for a more general family of target covariance matrix, where standard ridge regression only achieves a slow statistical rate of O(1/\sqrtn) for the excess risk, while Principal Component Regression (PCR) is guaranteed to achieve the fast rate O(1/n) , where n is the number of samples.

[LG-36] Balanced Gradient Sample Retrieval for Enhanced Knowledge Retention in Proxy-based Continual Learning

链接: https://arxiv.org/abs/2412.14430
作者: Hongye Xu,Jan Wasilewski,Bartosz Krawczyk
关键词: deep neural networks, subsequent training, deep neural, neural networks, networks often suffers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning in deep neural networks often suffers from catastrophic forgetting, where representations for previous tasks are overwritten during subsequent training. We propose a novel sample retrieval strategy from the memory buffer that leverages both gradient-conflicting and gradient-aligned samples to effectively retain knowledge about past tasks within a supervised contrastive learning framework. Gradient-conflicting samples are selected for their potential to reduce interference by re-aligning gradients, thereby preserving past task knowledge. Meanwhile, gradient-aligned samples are incorporated to reinforce stable, shared representations across tasks. By balancing gradient correction from conflicting samples with alignment reinforcement from aligned ones, our approach increases the diversity among retrieved instances and achieves superior alignment in parameter space, significantly enhancing knowledge retention and mitigating proxy drift. Empirical results demonstrate that using both sample types outperforms methods relying solely on one sample type or random retrieval. Experiments on popular continual learning benchmarks in computer vision validate our method’s state-of-the-art performance in mitigating forgetting while maintaining competitive accuracy on new tasks.

[LG-37] Fingerprinting Codes Meet Geometry: Improved Lower Bounds for Private Query Release and Adaptive Data Analysis

链接: https://arxiv.org/abs/2412.14396
作者: Xin Lyu,Kunal Talwar
关键词: log, lower bounds, alpha, sqrt, bounds
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Abstract slightly shortened to meet the arXiv requirement; 50 Pages and 1 Figure

点击查看摘要

Abstract:Fingerprinting codes are a crucial tool for proving lower bounds in differential privacy. They have been used to prove tight lower bounds for several fundamental questions, especially in the ``low accuracy’’ regime. Unlike reconstruction/discrepancy approaches however, they are more suited for query sets that arise naturally from the fingerprinting codes construction. In this work, we propose a general framework for proving fingerprinting type lower bounds, that allows us to tailor the technique to the geometry of the query set. Our approach allows us to prove several new results, including the following. First, we show that any (sample- and population-)accurate algorithm for answering Q arbitrary adaptive counting queries over a universe \mathcalX to accuracy \alpha needs \Omega(\frac\sqrt\log |\mathcalX|\cdot \log Q\alpha^3) samples, matching known upper bounds. This shows that the approaches based on differential privacy are optimal for this question, and improves significantly on the previously known lower bounds of \frac\log Q\alpha^2 and \min(\sqrtQ, \sqrt\log |\mathcalX|)/\alpha^2 . Second, we show that any (\varepsilon,\delta) -DP algorithm for answering Q counting queries to accuracy \alpha needs \Omega(\frac\sqrt \log|\mathcalX| \log(1/\delta) \log Q\varepsilon\alpha^2) samples, matching known upper bounds up to constants. Our framework allows for proving this bound via a direct correlation analysis and improves the prior bound of [BUV’14] by \sqrt\log(1/\delta) . Third, we characterize the sample complexity of answering a set of random 0 - 1 queries under approximate differential privacy. We give new upper and lower bounds in different regimes. By combining them with known results, we can complete the whole picture. Comments: Abstract slightly shortened to meet the arXiv requirement; 50 Pages and 1 Figure Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2412.14396 [cs.DS] (or arXiv:2412.14396v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2412.14396 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xin Lyu [view email] [v1] Wed, 18 Dec 2024 23:11:07 UTC (66 KB)

[LG-38] Nemesis: Noise-randomized Encryption with Modular Efficiency and Secure Integration in Machine Learning Systems

链接: https://arxiv.org/abs/2412.14392
作者: Dongfang Zhao
关键词: Fully Homomorphic Encryption, Homomorphic Encryption, Fully Homomorphic, exposing sensitive information, Machine learning
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning (ML) systems that guarantee security and privacy often rely on Fully Homomorphic Encryption (FHE) as a cornerstone technique, enabling computations on encrypted data without exposing sensitive information. However, a critical limitation of FHE is its computational inefficiency, making it impractical for large-scale applications. In this work, we propose \textitNemesis, a framework that accelerates FHE-based systems without compromising accuracy or security. The design of Nemesis is inspired by Rache (SIGMOD’23), which introduced a caching mechanism for encrypted integers and scalars. Nemesis extends this idea with more advanced caching techniques and mathematical tools, enabling efficient operations over multi-slot FHE schemes and overcoming Rache’s limitations to support general plaintext structures. We formally prove the security of Nemesis under standard cryptographic assumptions and evaluate its performance extensively on widely used datasets, including MNIST, FashionMNIST, and CIFAR-10. Experimental results show that Nemesis significantly reduces the computational overhead of FHE-based ML systems, paving the way for broader adoption of privacy-preserving technologies.

[LG-39] Scaling Deep Learning Training with MPMD Pipeline Parallelism

链接: https://arxiv.org/abs/2412.14374
作者: Anxhelo Xhebraj,Sean Lee,Hanfeng Chen,Vinod Grover
关键词: large deep learning, deep learning models, system for efficiently, efficiently scaling, scaling the training
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:We present JaxPP, a system for efficiently scaling the training of large deep learning models with flexible pipeline parallelism. We introduce a seamless programming model that allows implementing user-defined pipeline schedules for gradient accumulation. JaxPP automatically distributes tasks, corresponding to pipeline stages, over a cluster of nodes and automatically infers the communication among them. We implement a MPMD runtime for asynchronous execution of SPMD tasks. The pipeline parallelism implementation of JaxPP improves hardware utilization by up to 1.11\times with respect to the best performing SPMD configuration.

[LG-40] Implementing TD3 to train a Neural Network to fly a Quadcopter through an FPV Gate

链接: https://arxiv.org/abs/2412.14367
作者: Patrick Thomas,Kevin Schroeder,Jonathan Black
关键词: Deep Reinforcement learning, Twin Delayed Deep, Delayed Deep Deterministic, Deep Reinforcement, Deep Deterministic Policy
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Reinforcement learning has shown to be a powerful tool for developing policies in environments where an optimal solution is unclear. In this paper, we attempt to apply Twin Delayed Deep Deterministic Policy Gradients to train a neural network to act as a velocity controller for a quadcopter. The quadcopter’s objective is to quickly fly through a gate while avoiding crashing into the gate. We transfer our trained policy to the real world by deploying it on a quadcopter in a laboratory environment. Finally, we demonstrate that the trained policy is able to navigate the drone to the gate in the real world.

[LG-41] Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning

链接: https://arxiv.org/abs/2412.14312
作者: Brett Barkley,David Fridovich-Keil
关键词: state transition data, model-based reinforcement learning, generating synthetic state, synthetic state transition, off-policy model-based reinforcement
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process – the backbone of Dyna-style algorithms – significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.

[LG-42] Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation ICSE

链接: https://arxiv.org/abs/2412.14308
作者: Benjamin Steenhoek,Michele Tufano,Neel Sundaresan,Alexey Svyatkovskiy
关键词: Large Language Models, Large Language, Reinforcement Learning, Language Models, Quality Metrics
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: Accepted to DeepTest 2025 (ICSE Workshop). arXiv admin note: text overlap with arXiv:2310.02368

点击查看摘要

Abstract:Software testing is a crucial but time-consuming aspect of software development, and recently, Large Language Models (LLMs) have gained popularity for automated test case generation. However, because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices and may even contain test smells (anti-patterns). To address this issue, we propose Reinforcement Learning from Static Quality Metrics (RLSQM), wherein we utilize Reinforcement Learning to generate high-quality unit tests based on static analysis-based quality metrics. First, we analyzed LLM-generated tests and show that LLMs frequently do generate undesirable test smells – up to 37% of the time. Then, we implemented lightweight static analysis-based reward model and trained LLMs using this reward model to optimize for five code quality metrics. Our experimental results demonstrate that the RL-optimized Codex model consistently generated higher-quality test cases than the base LLM, improving quality metrics by up to 23%, and generated nearly 100% syntactically-correct code. RLSQM also outperformed GPT-4 on all code quality metrics, in spite of training a substantially cheaper Codex model. We provide insights into how reliably utilize RL to improve test generation quality and show that RLSQM is a significant step towards enhancing the overall efficiency and reliability of automated software testing. Our data are available at this https URL.

[LG-43] Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection Repair in the IDE ICSE2025

链接: https://arxiv.org/abs/2412.14306
作者: Benjamin Steenhoek,Kalpathy Sivaraman,Renata Saldivar Gonzalez,Yevhen Mohylevskyy,Roshanak Zilouchian Moghaddam,Wei Le
关键词: professional software developers, paper presents, detection and fix, vulnerability detection, fix tools show
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted to ICSE 2025 research track. Camera-ready version

点击查看摘要

Abstract:This paper presents the first empirical study of a vulnerability detection and fix tool with professional software developers on real projects that they own. We implemented DeepVulGuard, an IDE-integrated tool based on state-of-the-art detection and fix models, and show that it has promising performance on benchmarks of historic vulnerability data. DeepVulGuard scans code for vulnerabilities (including identifying the vulnerability type and vulnerable region of code), suggests fixes, provides natural-language explanations for alerts and fixes, leveraging chat interfaces. We recruited 17 professional software developers at Microsoft, observed their usage of the tool on their code, and conducted interviews to assess the tool’s usefulness, speed, trust, relevance, and workflow integration. We also gathered detailed qualitative feedback on users’ perceptions and their desired features. Study participants scanned a total of 24 projects, 6.9k files, and over 1.7 million lines of source code, and generated 170 alerts and 50 fix suggestions. We find that although state-of-the-art AI-powered detection and fix tools show promise, they are not yet practical for real-world use due to a high rate of false positives and non-applicable fixes. User feedback reveals several actionable pain points, ranging from incomplete context to lack of customization for the user’s codebase. Additionally, we explore how AI features, including confidence scores, explanations, and chat interaction, can apply to vulnerability detection and fixing. Based on these insights, we offer practical recommendations for evaluating and deploying AI detection and fix models. Our code and data are available at this https URL.

[LG-44] he Multiplex Classification Framework: optimizing multi-label classifiers through problem transformation ontology engineering and model ensembling

链接: https://arxiv.org/abs/2412.14299
作者: Mauro Nievas Offidani,Facundo Roffet,Claudio Augusto Delrieux,Maria Carolina Gonzalez Galtier,Marcos Zarate
关键词: fundamental task, Multiplex Classification Framework, Multiplex approach, Classification, Multiplex
类目: Machine Learning (cs.LG)
*备注: 43 pages, 15 figures, submitted to Applied Ontology

点击查看摘要

Abstract:Classification is a fundamental task in machine learning. While conventional methods-such as binary, multiclass, and multi-label classification-are effective for simpler problems, they may not adequately address the complexities of some real-world scenarios. This paper introduces the Multiplex Classification Framework, a novel approach developed to tackle these and similar challenges through the integration of problem transformation, ontology engineering, and model ensembling. The framework offers several advantages, including adaptability to any number of classes and logical constraints, an innovative method for managing class imbalance, the elimination of confidence threshold selection, and a modular structure. Two experiments were conducted to compare the performance of conventional classification models with the Multiplex approach. Our results demonstrate that the Multiplex approach can improve classification performance significantly (up to 10% gain in overall F1 score), particularly in classification problems with a large number of classes and pronounced class imbalances. However, it also has limitations, as it requires a thorough understanding of the problem domain and some experience with ontology engineering, and it involves training multiple models, which can make the whole process more intricate. Overall, this methodology provides a valuable tool for researchers and practitioners dealing with complex classification problems in machine learning.

[LG-45] Distributionally Robust Policy Learning under Concept Drifts

链接: https://arxiv.org/abs/2412.14297
作者: Jingyuan Wang,Zhimei Ren,Ruohan Zhan,Zhengyuan Zhou
关键词: Distributionally robust policy, robust policy learning, policy learning aims, worst-case distributional shift, Distributionally robust
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studiesa more nuanced problem – robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root- n rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class \Pi , and show that the sub-optimality gap of the proposed algorithm is of the order \kappa(\Pi)n^-1/2 , with \kappa(\Pi) is the entropy integral of \Pi under the Hamming distance and n is the sample size. A matching lower bound is provided to show the optimality of the rate. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.

[LG-46] FedSTaS: Client Stratification and Client Level Sampling for Efficient Federated Learning ICML

链接: https://arxiv.org/abs/2412.14226
作者: Jordan Slessor,Dezheng Kong,Xiaofen Tang,Zheng En Than,Linglong Kong
关键词: machine learning methodology, multiple decentralized clients, textit, methodology that involves, involves the collaborative
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 6 pages, 3 figures, to be submitted to ICML

点击查看摘要

Abstract:Federated learning (FL) is a machine learning methodology that involves the collaborative training of a global model across multiple decentralized clients in a privacy-preserving way. Several FL methods are introduced to tackle communication inefficiencies but do not address how to sample participating clients in each round effectively and in a privacy-preserving manner. In this paper, we propose \textitFedSTaS, a client and data-level sampling method inspired by \textitFedSTS and \textitFedSampling. In each federated learning round, \textitFedSTaS stratifies clients based on their compressed gradients, re-allocate the number of clients to sample using an optimal Neyman allocation, and sample local data from each participating clients using a data uniform sampling strategy. Experiments on three datasets show that \textitFedSTaS can achieve higher accuracy scores than those of \textitFedSTS within a fixed number of training rounds.

[LG-47] owards Precise Prediction Uncertainty in GNNs: Refining GNNs with Topology-grouping Strategy AAAI2025

链接: https://arxiv.org/abs/2412.14223
作者: Hyunjin Seo,Kyusung Seo,Joonhyung Park,Eunho Yang
关键词: graph neural networks, calibrating model predictions, neural networks, pivotal component, Recent advancements
类目: Machine Learning (cs.LG)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:Recent advancements in graph neural networks (GNNs) have highlighted the critical need of calibrating model predictions, with neighborhood prediction similarity recognized as a pivotal component. Existing studies suggest that nodes with analogous neighborhood prediction similarity often exhibit similar calibration characteristics. Building on this insight, recent approaches incorporate neighborhood similarity into node-wise temperature scaling techniques. However, our analysis reveals that this assumption does not hold universally. Calibration errors can differ significantly even among nodes with comparable neighborhood similarity, depending on their confidence levels. This necessitates a re-evaluation of existing GNN calibration methods, as a single, unified approach may lead to sub-optimal calibration. In response, we introduce Simi-Mailbox, a novel approach that categorizes nodes by both neighborhood similarity and their own confidence, irrespective of proximity or connectivity. Our method allows fine-grained calibration by employing group-specific temperature scaling, with each temperature tailored to address the specific miscalibration level of affiliated nodes, rather than adhering to a uniform trend based on neighborhood similarity. Extensive experiments demonstrate the effectiveness of our Simi-Mailbox across diverse datasets on different GNN architectures, achieving up to 13.79% error reduction compared to uncalibrated GNN predictions.

[LG-48] Detecting Dark Patterns in User Interfaces Using Logistic Regression and Bag-of-Words Representation

链接: https://arxiv.org/abs/2412.14187
作者: Aliyu Umar,Maaruf Lawan,Adamu Lawan,Abdullahi Abdulkadir,Mukhtar Dahiru
关键词: manipulate users’ behavior, represent deceptive design, involuntary data disclosures, deceptive design practices, design practices intended
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dark patterns in user interfaces represent deceptive design practices intended to manipulate users’ behavior, often leading to unintended consequences such as coerced purchases, involuntary data disclosures, or user frustration. Detecting and mitigating these dark patterns is crucial for promoting transparency, trust, and ethical design practices in digital environments. This paper proposes a novel approach for detecting dark patterns in user interfaces using logistic regression and bag-of-words representation. Our methodology involves collecting a diverse dataset of user interface text samples, preprocessing the data, extracting text features using the bag-of-words representation, training a logistic regression model, and evaluating its performance using various metrics such as accuracy, precision, recall, F1-score, and the area under the ROC curve (AUC). Experimental results demonstrate the effectiveness of the proposed approach in accurately identifying instances of dark patterns, with high predictive performance and robustness to variations in dataset composition and model parameters. The insights gained from this study contribute to the growing body of knowledge on dark patterns detection and classification, offering practical implications for designers, developers, and policymakers in promoting ethical design practices and protecting user rights in digital environments.

[LG-49] Optimally Solving Simultaneous-Move Dec-POMDPs: The Sequential Central Planning Approach

链接: https://arxiv.org/abs/2408.13139
作者: Johan Peralez,Aurèlien Delage,Jacopo Castellini,Rafael F. Cunha,Jilles S. Dibangoye
关键词: Markov decision processes, partially observable Markov, observable Markov decision, optimally solving decentralized, solving decentralized partially
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:The centralized training for decentralized execution paradigm emerged as the state-of-the-art approach to \epsilon -optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely the sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of the Bellman’s principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that \epsilon -optimal value functions are piecewise linear and convex in such sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, e.g., SARSA algorithm enhanced with these findings, while still preserving convergence guarantees. Experiments on two- as well as many-agent domains from the literature against \epsilon -optimal simultaneous-move solvers confirm the superiority of our novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.

[LG-50] On Convex Optimal Value Functions For POSGs

链接: https://arxiv.org/abs/2311.09459
作者: Rafael F. Cunha,Jacopo Castellini,Johan Peralez,Jilles S. Dibangoye
关键词: Observable Stochastic Games, Partially Observable Stochastic, Multi-agent planning, communication costs, due to communication
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: Currently under review at JAIR

点击查看摘要

Abstract:Multi-agent planning and reinforcement learning can be challenging when agents cannot see the state of the world or communicate with each other due to communication costs, latency, or noise. Partially Observable Stochastic Games (POSGs) provide a mathematical framework for modelling such scenarios. This paper aims to improve the efficiency of planning and reinforcement learning algorithms for POSGs by identifying the underlying structure of optimal state-value functions. The approach involves reformulating the original game from the perspective of a trusted third party who plans on behalf of the agents simultaneously. From this viewpoint, the original POSGs can be viewed as Markov games where states are occupancy states, \ie posterior probability distributions over the hidden states of the world and the stream of actions and observations that agents have experienced so far. This study mainly proves that the optimal state-value function is a convex function of occupancy states expressed on an appropriate basis in all zero-sum, common-payoff, and Stackelberg POSGs.

[LG-51] sts for model misspecification in simulation-based inference: from local distortions to global model checks

链接: https://arxiv.org/abs/2412.15100
作者: Noemi Anau Montel,James Alvey,Christoph Weniger
关键词: scientific model development, misspecification analysis strategies, Model misspecification, Model misspecification analysis, Model
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc)
*备注: 11 pages, 5 figures. Code available on github (NoemiAM/mist) at this https URL

点击查看摘要

Abstract:Model misspecification analysis strategies, such as anomaly detection, model validation, and model comparison are a key component of scientific model development. Over the last few years, there has been a rapid rise in the use of simulation-based inference (SBI) techniques for Bayesian parameter estimation, applied to increasingly complex forward models. To move towards fully simulation-based analysis pipelines, however, there is an urgent need for a comprehensive simulation-based framework for model misspecification analysis. In this work, we provide a solid and flexible foundation for a wide range of model discrepancy analysis tasks, using distortion-driven model misspecification tests. From a theoretical perspective, we introduce the statistical framework built around performing many hypothesis tests for distortions of the simulation model. We also make explicit analytic connections to classical techniques: anomaly detection, model validation, and goodness-of-fit residual analysis. Furthermore, we introduce an efficient self-calibrating training algorithm that is useful for practitioners. We demonstrate the performance of the framework in multiple scenarios, making the connection to classical results where they are valid. Finally, we show how to conduct such a distortion-driven model misspecification test for real gravitational wave data, specifically on the event GW150914.

[LG-52] From Point to probabilistic gradient boosting for claim frequency and severity prediction

链接: https://arxiv.org/abs/2412.14916
作者: Dominik Chevalier,Marie-Pier Côté
关键词: traditional generalized linear, generalized linear models, Gradient boosting, show superior predictive, decision tree algorithms
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 4 figures, 26 tables, 7 algorithms

点击查看摘要

Abstract:Gradient boosting for decision tree algorithms are increasingly used in actuarial applications as they show superior predictive performance over traditional generalized linear models. Many improvements and sophistications to the first gradient boosting machine algorithm exist. We present in a unified notation, and contrast, all the existing point and probabilistic gradient boosting for decision tree algorithms: GBM, XGBoost, DART, LightGBM, CatBoost, EGBM, PGBM, XGBoostLSS, cyclic GBM, and NGBoost. In this comprehensive numerical study, we compare their performance on five publicly available datasets for claim frequency and severity, of various size and comprising different number of (high cardinality) categorical variables. We explain how varying exposure-to-risk can be handled with boosting in frequency models. We compare the algorithms on the basis of computational efficiency, predictive performance, and model adequacy. LightGBM and XGBoostLSS win in terms of computational efficiency. The fully interpretable EGBM achieves competitive predictive performance compared to the black box algorithms considered. We find that there is no trade-off between model adequacy and predictive accuracy: both are achievable simultaneously.

[LG-53] Surrogate-assisted multi-objective design of complex multibody systems

链接: https://arxiv.org/abs/2412.14854
作者: Augustina C. Amakor,Manuel B. Berkemeier,Meike Wohlleben,Walter Sextro,Sebastian Peitz
关键词: numerically challenging task, multiple conflicting criteria, challenging task, numerically challenging, multiple conflicting
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2412.01566

点击查看摘要

Abstract:The optimization of large-scale multibody systems is a numerically challenging task, in particular when considering multiple conflicting criteria at the same time. In this situation, we need to approximate the Pareto set of optimal compromises, which is significantly more expensive than finding a single optimum in single-objective optimization. To prevent large costs, the usage of surrogate models, constructed from a small but informative number of expensive model evaluations, is a very popular and widely studied approach. The central challenge then is to ensure a high quality (that is, near-optimality) of the solutions that were obtained using the surrogate model, which can be hard to guarantee with a single pre-computed surrogate. We present a back-and-forth approach between surrogate modeling and multi-objective optimization to improve the quality of the obtained solutions. Using the example of an expensive-to-evaluate multibody system, we compare different strategies regarding multi-objective optimization, sampling and also surrogate modeling, to identify the most promising approach in terms of computational efficiency and solution quality.

[LG-54] Opportunities and limitations of explaining quantum machine learning

链接: https://arxiv.org/abs/2412.14753
作者: Elies Gil-Fuster,Jonas R. Naujoks,Grégoire Montavon,Thomas Wiegand,Wojciech Samek,Jens Eisert
关键词: machine learning models, quantum machine learning, learning models, machine learning, quantum learning models
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 16+16 pages, 3+4 figures

点击查看摘要

Abstract:A common trait of many machine learning models is that it is often difficult to understand and explain what caused the model to produce the given output. While the explainability of neural networks has been an active field of research in the last years, comparably little is known for quantum machine learning models. Despite a few recent works analyzing some specific aspects of explainability, as of now there is no clear big picture perspective as to what can be expected from quantum learning models in terms of explainability. In this work, we address this issue by identifying promising research avenues in this direction and lining out the expected future results. We additionally propose two explanation methods designed specifically for quantum machine learning models, as first of their kind to the best of our knowledge. Next to our pre-view of the field, we compare both existing and novel methods to explain the predictions of quantum learning models. By studying explainability in quantum machine learning, we can contribute to the sustainable development of the field, preventing trust issues in the future.

[LG-55] Deep Learning Based Recalibration of SDSS and DESI BAO Alleviates Hubble and Clustering Tensions

链接: https://arxiv.org/abs/2412.14750
作者: Rahul Shah,Purba Mukherjee,Soumadeep Saha,Utpal Garain,Supratik Pal
关键词: Baryon Acoustic Oscillations, Acoustic Oscillations, early universe observations, Baryon Acoustic, Conventional calibration
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, 2 tables. Comments are welcome

点击查看摘要

Abstract:Conventional calibration of Baryon Acoustic Oscillations (BAO) data relies on estimation of the sound horizon at drag epoch r_d from early universe observations by assuming a cosmological model. We present a recalibration of two independent BAO datasets, SDSS and DESI, by employing deep learning techniques for model-independent estimation of r_d , and explore the impacts on \Lambda CDM cosmological parameters. Significant reductions in both Hubble ( H_0 ) and clustering ( S_8 ) tensions are observed for both the recalibrated datasets. Moderate shifts in some other parameters hint towards further exploration of such data-driven approaches.

[LG-56] Permutation recovery of spikes in noisy high-dimensional tensor estimation

链接: https://arxiv.org/abs/2412.14650
作者: Gérard Ben Arous,CĆedric Gerbelot,Vanessa Piccolo
关键词: Gaussian tensor observations, noisy Gaussian tensor, multi-spiked tensor problem, Gaussian tensor, unknown signal vectors
类目: Probability (math.PR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 2 figures. arXiv admin note: substantial text overlap with arXiv:2408.06401

点击查看摘要

Abstract:We study the dynamics of gradient flow in high dimensions for the multi-spiked tensor problem, where the goal is to estimate r unknown signal vectors (spikes) from noisy Gaussian tensor observations. Specifically, we analyze the maximum likelihood estimation procedure, which involves optimizing a highly nonconvex random function. We determine the sample complexity required for gradient flow to efficiently recover all spikes, without imposing any assumptions on the separation of the signal-to-noise ratios (SNRs). More precisely, our results provide the sample complexity required to guarantee recovery of the spikes up to a permutation. Our work builds on our companion paper [Ben Arous, Gerbelot, Piccolo 2024], which studies Langevin dynamics and determines the sample complexity and separation conditions for the SNRs necessary for ensuring exact recovery of the spikes (where the recovered permutation matches the identity). During the recovery process, the correlations between the estimators and the hidden vectors increase in a sequential manner. The order in which these correlations become significant depends on their initial values and the corresponding SNRs, which ultimately determines the permutation of the recovered spikes.

[LG-57] Fast inverse lithography based on a model-driven block stacking convolutional neural network

链接: https://arxiv.org/abs/2412.14599
作者: Ruixiang Chen,Yang Zhao,Haoqin Li,Rui Chen
关键词: Optical Proximity Correction, Optical Proximity Effects, counter Optical Proximity, Optical Proximity, effectively counter Optical
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注: 21 pages, 7 figures

点击查看摘要

Abstract:In the realm of lithography, Optical Proximity Correction (OPC) is a crucial resolution enhancement technique that optimizes the transmission function of photomasks on a pixel-based to effectively counter Optical Proximity Effects (OPE). However, conventional pixel-based OPC methods often generate patterns that pose manufacturing challenges, thereby leading to the increased cost in practical scenarios. This paper presents a novel inverse lithographic approach to OPC, employing a model-driven, block stacking deep learning framework that expedites the generation of masks conducive to manufacturing. This method is founded on vector lithography modelling and streamlines the training process by eliminating the requirement for extensive labeled datasets. Furthermore, diversity of mask patterns is enhanced by employing a wave function collapse algorithm, which facilitates the random generation of a multitude of target patterns, therefore significantly expanding the range of mask paradigm. Numerical experiments have substantiated the efficacy of the proposed end-to-end approach, highlighting its superior capability to manage mask complexity within the context of advanced OPC lithography. This advancement is anticipated to enhance the feasibility and economic viability of OPC technology within actual manufacturing environments.

[LG-58] Accelerated Patient-Specific Calibration via Differentiable Hemodynamics Simulations

链接: https://arxiv.org/abs/2412.14572
作者: Diego Renner,Georgios Kissas
关键词: computational model, tailor diagnostics, computational, individual patients, model
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Mathematical Software (cs.MS); Computational Physics (physics.comp-ph); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:One of the goals of personalized medicine is to tailor diagnostics to individual patients. Diagnostics are performed in practice by measuring quantities, called biomarkers, that indicate the existence and progress of a disease. In common cardiovascular diseases, such as hypertension, biomarkers that are closely related to the clinical representation of a patient can be predicted using computational models. Personalizing computational models translates to considering patient-specific flow conditions, for example, the compliance of blood vessels that cannot be a priori known and quantities such as the patient geometry that can be measured using imaging. Therefore, a patient is identified by a set of measurable and nonmeasurable parameters needed to well-define a computational model; else, the computational model is not personalized, meaning it is prone to large prediction errors. Therefore, to personalize a computational model, sufficient information needs to be extracted from the data. The current methods by which this is done are either inefficient, due to relying on slow-converging optimization methods, or hard to interpret, due to using black box deep-learning algorithms. We propose a personalized diagnostic procedure based on a differentiable 0D-1D Navier-Stokes reduced order model solver and fast parameter inference methods that take advantage of gradients through the solver. By providing a faster method for performing parameter inference and sensitivity analysis through differentiability while maintaining the interpretability of well-understood mathematical models and numerical methods, the best of both worlds is combined. The performance of the proposed solver is validated against a well-established process on different geometries, and different parameter inference processes are successfully performed.

[LG-59] Statistical Undersampling with Mutual Information and Support Points

链接: https://arxiv.org/abs/2412.14527
作者: Alex Mak,Shubham Sahoo,Shivani Pandey,Yidan Yue,Linglong Kong
关键词: large datasets present, datasets present significant, present significant challenges, poor predictive performance, minority classes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications.

[LG-60] On the Robustness of Spectral Algorithms for Semirandom Stochastic Block Models NEURIPS2024

链接: https://arxiv.org/abs/2412.14315
作者: Aditya Bhaskara,Agastya Vibhuti Jha,Michael Kapralov,Naren Sarayu Manoj,Davide Mazzali,Weronika Wrzos-Kaminska
关键词: equally-sized unlabeled communities, unlabeled communities, graph bisection problem, Stochastic Block Model, equally-sized unlabeled
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 45 pages. NeurIPS 2024

点击查看摘要

Abstract:In a graph bisection problem, we are given a graph G with two equally-sized unlabeled communities, and the goal is to recover the vertices in these communities. A popular heuristic, known as spectral clustering, is to output an estimated community assignment based on the eigenvector corresponding to the second smallest eigenvalue of the Laplacian of G . Spectral algorithms can be shown to provably recover the cluster structure for graphs generated from certain probabilistic models, such as the Stochastic Block Model (SBM). However, spectral clustering is known to be non-robust to model mis-specification. Techniques based on semidefinite programming have been shown to be more robust, but they incur significant computational overheads. In this work, we study the robustness of spectral algorithms against semirandom adversaries. Informally, a semirandom adversary is allowed to ``helpfully’’ change the specification of the model in a way that is consistent with the ground-truth solution. Our semirandom adversaries in particular are allowed to add edges inside clusters or increase the probability that an edge appears inside a cluster. Semirandom adversaries are a useful tool to determine the extent to which an algorithm has overfit to statistical assumptions on the input. On the positive side, we identify classes of semirandom adversaries under which spectral bisection using the unnormalized Laplacian is strongly consistent, i.e., it exactly recovers the planted partitioning. On the negative side, we show that in these classes spectral bisection with the normalized Laplacian outputs a partitioning that makes a classification mistake on a constant fraction of the vertices. Finally, we demonstrate numerical experiments that complement our theoretical findings. Comments: 45 pages. NeurIPS 2024 Subjects: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2412.14315 [stat.ML] (or arXiv:2412.14315v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2412.14315 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-61] Projected gradient methods for nonconvex and stochastic optimization: new complexities and auto-conditioned stepsizes

链接: https://arxiv.org/abs/2412.14291
作者: Guanghui Lan,Tianjiao Li,Yangyang Xu
关键词: convex compact set, necessarily convex function, Lipschitz constant, necessarily convex, convex function
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present a novel class of projected gradient (PG) methods for minimizing a smooth but not necessarily convex function over a convex compact set. We first provide a novel analysis of the “vanilla” PG method, achieving the best-known iteration complexity for finding an approximate stationary point of the problem. We then develop an “auto-conditioned” projected gradient (AC-PG) variant that achieves the same iteration complexity without requiring the input of the Lipschitz constant of the gradient or any line search procedure. The key idea is to estimate the Lipschitz constant using first-order information gathered from the previous iterations, and to show that the error caused by underestimating the Lipschitz constant can be properly controlled. We then generalize the PG methods to the stochastic setting, by proposing a stochastic projected gradient (SPG) method and a variance-reduced stochastic gradient (VR-SPG) method, achieving new complexity bounds in different oracle settings. We also present auto-conditioned stepsize policies for both stochastic PG methods and establish comparable convergence guarantees.

信息检索

[IR-0] Nano-ESG: Extracting Corporate Sustainability Information from News Articles ECIR2025

链接: https://arxiv.org/abs/2412.15093
作者: Fabian Billert,Stefan Conrad
关键词: highly complex subject, Determining the sustainability, past few years, highly complex, Determining
类目: Information Retrieval (cs.IR)
*备注: To be published at ECIR 2025. Preprint

点击查看摘要

Abstract:Determining the sustainability impact of companies is a highly complex subject which has garnered more and more attention over the past few years. Today, investors largely rely on sustainability-ratings from established rating-providers in order to analyze how responsibly a company acts. However, those ratings have recently been criticized for being hard to understand and nearly impossible to reproduce. An independent way to find out about the sustainability practices of companies lies in the rich landscape of news article data. In this paper, we explore a different approach to identify key opportunities and challenges of companies in the sustainability domain. We present a novel dataset of more than 840,000 news articles which were gathered for major German companies between January 2023 and September 2024. By applying a mixture of Natural Language Processing techniques, we first identify relevant articles, before summarizing them and extracting their sustainability-related sentiment and aspect using Large Language Models (LLMs). Furthermore, we conduct an evaluation of the obtained data and determine that the LLM-produced answers are accurate. We release both datasets at this https URL. Comments: To be published at ECIR 2025. Preprint Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2412.15093 [cs.IR] (or arXiv:2412.15093v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2412.15093 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] Spectrum-based Modality Representation Fusion Graph Convolutional Network for Multimodal Recommendation WSDM

链接: https://arxiv.org/abs/2412.14978
作者: Rongqing Kenneth Ong,Andy W. H. Khong
关键词: Incorporating multi-modal features, Incorporating multi-modal, side information, information has recently, modality
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: Accepted to ACM Web Search and Data Mining (WSDM) 2025

点击查看摘要

Abstract:Incorporating multi-modal features as side information has recently become a trend in recommender systems. To elucidate user-item preferences, recent studies focus on fusing modalities via concatenation, element-wise sum, or attention mechanisms. Despite having notable success, existing approaches do not account for the modality-specific noise encapsulated within each modality. As a result, direct fusion of modalities will lead to the amplification of cross-modality noise. Moreover, the variation of noise that is unique within each modality results in noise alleviation and fusion being more challenging. In this work, we propose a new Spectrum-based Modality Representation (SMORE) fusion graph recommender that aims to capture both uni-modal and fusion preferences while simultaneously suppressing modality noise. Specifically, SMORE projects the multi-modal features into the frequency domain and leverages the spectral space for fusion. To reduce dynamic contamination that is unique to each modality, we introduce a filter to attenuate and suppress the modality noise adaptively while capturing the universal modality patterns effectively. Furthermore, we explore the item latent structures by designing a new multi-modal graph learning module to capture associative semantic correlations and universal fusion patterns among similar items. Finally, we formulate a new modality-aware preference module, which infuses behavioral features and balances the uni- and multi-modal features for precise preference modeling. This empowers SMORE with the ability to infer both user modality-specific and fusion preferences more accurately. Experiments on three real-world datasets show the efficacy of our proposed model. The source code for this work has been made publicly available at this https URL.

[IR-2] ECLIPSE: Contrastive Dimension Importance Estimation with Pseudo-Irrelevance Feedback for Dense Retrieval

链接: https://arxiv.org/abs/2412.14967
作者: Giulio D’Erasmo,Giovanni Trappolini,Nicola Tonellotto,Fabrizio Silvestri
关键词: Recent advances, leveraged high-dimensional embedding, high-dimensional embedding spaces, Manifold Clustering Hypothesis, embedding spaces
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advances in Information Retrieval have leveraged high-dimensional embedding spaces to improve the retrieval of relevant documents. Moreover, the Manifold Clustering Hypothesis suggests that despite these high-dimensional representations, documents relevant to a query reside on a lower-dimensional, query-dependent manifold. While this hypothesis has inspired new retrieval methods, existing approaches still face challenges in effectively separating non-relevant information from relevant signals. We propose a novel methodology that addresses these limitations by leveraging information from both relevant and non-relevant documents. Our method, ECLIPSE, computes a centroid based on irrelevant documents as a reference to estimate noisy dimensions present in relevant ones, enhancing retrieval performance. Extensive experiments on three in-domain and one out-of-domain benchmarks demonstrate an average improvement of up to 19.50% (resp. 22.35%) in mAP(AP) and 11.42% (resp. 13.10%) in nDCG@10 w.r.t. the DIME-based baseline (resp. the baseline using all dimensions). Our results pave the way for more robust, pseudo-irrelevance-based retrieval systems in future IR research.

[IR-3] Moving Beyond LDA: A Comparison of Unsupervised Topic Modelling Techniques for Qualitative Data Analysis of Online Communities

链接: https://arxiv.org/abs/2412.14486
作者: Amandeep Kaur,James R. Wallace
关键词: Social media constitutes, Social media, Large Language Model, social media content, constitutes a rich
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Social media constitutes a rich and influential source of information for qualitative researchers. Although computational techniques like topic modelling assist with managing the volume and diversity of social media content, qualitative researcher’s lack of programming expertise creates a significant barrier to their adoption. In this paper we explore how BERTopic, an advanced Large Language Model (LLM)-based topic modelling technique, can support qualitative data analysis of social media. We conducted interviews and hands-on evaluations in which qualitative researchers compared topics from three modelling techniques: LDA, NMF, and BERTopic. BERTopic was favoured by 8 of 12 participants for its ability to provide detailed, coherent clusters for deeper understanding and actionable insights. Participants also prioritised topic relevance, logical organisation, and the capacity to reveal unexpected relationships within the data. Our findings underscore the potential of LLM-based techniques for supporting qualitative analysis.

[IR-4] HEC-GCN: Hypergraph Enhanced Cascading Graph Convolution Network for Multi-Behavior Recommendation

链接: https://arxiv.org/abs/2412.14476
作者: Yabo Yin,Xiaofei Zhu,Wenshan Wang,Yihao Zhang,Pengfei Wang,Yixing Fan,Jiafeng Guo
关键词: garnered growing attention, growing attention recently, inferring user preferences, attention recently due, garnered growing
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multi-behavior recommendation (MBR) has garnered growing attention recently due to its ability to mitigate the sparsity issue by inferring user preferences from various auxiliary behaviors to improve predictions for the target behavior. Although existing research on MBR has yielded impressive results, they still face two major limitations. First, previous methods mainly focus on modeling fine-grained interaction information between users and items under each behavior, which may suffer from sparsity issue. Second, existing models usually concentrate on exploiting dependencies between two consecutive behaviors, leaving intra- and inter-behavior consistency largely unexplored. To the end, we propose a novel approach named Hypergraph Enhanced Cascading Graph Convolution Network for multi-behavior recommendation (HEC-GCN). To be specific, we first explore both fine- and coarse-grained correlations among users or items of each behavior by simultaneously modeling the behavior-specific interaction graph and its corresponding hypergraph in a cascaded manner. Then, we propose a behavior consistency-guided alignment strategy that ensures consistent representations between the interaction graph and its associated hypergraph for each behavior, while also maintaining representation consistency across different behaviors. Extensive experiments and analyses on three public benchmark datasets demonstrate that our proposed approach is consistently superior to previous state-of-the-art methods due to its capability to effectively attenuate the sparsity issue as well as preserve both intra- and inter-behavior consistencies. The code is available at this https URL.

[IR-5] VISA: Retrieval Augmented Generation with Visual Source Attribution

链接: https://arxiv.org/abs/2412.14457
作者: Xueguang Ma,Shengyao Zhuang,Bevan Koopman,Guido Zuccon,Wenhu Chen,Jimmy Lin
关键词: Visual Source Attribution, source attribution, Visual Source, retrieval-augmented generation, important for enhancing
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-language models (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain. Experimental results demonstrate the effectiveness of VISA for visual source attribution on documents’ original look, as well as highlighting the challenges for improvement. Code, data, and model checkpoints will be released.

[IR-6] ChainRank-DPO: Chain Rank Direct Preference Optimization for LLM Rankers

链接: https://arxiv.org/abs/2412.14405
作者: Haowei Liu,Xuyang Wu,Guohao Sun,Zhiqiang Tao,Yi Fang
关键词: Large language models, demonstrated remarkable effectiveness, Large language, works like RankGPT, leveraging their human-like
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable effectiveness in text reranking through works like RankGPT, leveraging their human-like reasoning about relevance. However, supervised fine-tuning for ranking often diminishes these models’ general-purpose capabilities, including the crucial reasoning abilities that make them valuable for ranking. We introduce a novel approach integrating Chain-of-Thought prompting with an SFT-DPO (Supervised Fine-Tuning followed by Direct Preference Optimization) pipeline to preserve these capabilities while improving ranking performance. Our experiments on TREC 2019 and 2020 Deep Learning datasets show that our approach outperforms the state-of-the-art RankZephyr while maintaining strong performance on the Massive Multitask Language Understanding (MMLU) benchmark, demonstrating effective preservation of general-purpose capabilities through thoughtful fine-tuning strategies. Our code and data will be publicly released upon the acceptance of the paper.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-20

目录

概览 (2024-12-20)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载