Arxiv今日论文 | 2024-12-16

本篇博文主要展示 2024-12-16 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决语言类型学中意义的基础问题，提出了一种基于感知模态数据（如图像）的语言无关意义表示方法。解决方案的关键在于定义并量化“基础性”（groundedness），这是一个基于信息论的经验性语义内容度量，通过多语言多模态语言模型计算得出。具体而言，论文通过 surprisal 差异来衡量上下文的语义内容性，并应用于词类类型学中，揭示了功能类（grammatical）和词汇类（content）之间的内容性不对称性，挑战了功能类不传达内容的观点。此外，论文还发现了基础性层次的普遍趋势（如名词 > 形容词 > 动词），并展示了其与英语心理语言学中的具体性规范的部分相关性。

链接: https://arxiv.org/abs/2412.10369
作者: Coleman Haley,Sharon Goldwater,Edoardo Ponti
机构: University of Edinburgh(爱丁堡大学); University of Cambridge(剑桥大学)
关键词: measure, Abstract, meaning, language, languages
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:We propose a grounded approach to meaning in language typology. We treat data from perceptual modalities, such as images, as a language-agnostic representation of meaning. Hence, we can quantify the function–form relationship between images and captions across languages. Inspired by information theory, we define “groundedness”, an empirical measure of contextual semantic contentfulness (formulated as a difference in surprisal) which can be computed with multilingual multimodal language models. As a proof of concept, we apply this measure to the typology of word classes. Our measure captures the contentfulness asymmetry between functional (grammatical) and lexical (content) classes across languages, but contradicts the view that functional classes do not convey content. Moreover, we find universal trends in the hierarchy of groundedness (e.g., nouns adjectives verbs), and show that our measure partly correlates with psycholinguistic concreteness norms in English. We release a dataset of groundedness scores for 30 languages. Our results suggest that the grounded typology approach can provide quantitative evidence about semantic function in language.
zh

[NLP-1] AdvPrefix: An Objective for Nuanced LLM Jailbreaks

【速读】：该论文试图解决现有 jailbreak 攻击中对大型语言模型 (LLMs) 行为控制有限和优化困难的问题。解决方案的关键在于引入 AdvPrefix，这是一种新的前缀强制目标 (prefix-forcing objective)，通过自动选择基于高攻击成功率和低负对数似然 (negative log-likelihood) 的模型依赖前缀，实现对模型行为的更精细控制，并简化优化过程。AdvPrefix 能够无缝集成到现有攻击中，显著提升攻击成功率，例如在 Llama-3 模型上将 GCG 攻击的细微攻击成功率从 14% 提高到 80%，表明当前对齐方法在处理未见前缀时存在泛化困难。

链接: https://arxiv.org/abs/2412.10321
作者: Sicheng Zhu,Brandon Amos,Yuandong Tian,Chuan Guo,Ivan Evtimov
机构: 未知
关键词: large language models, large language, harmful request, language models, model respond
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix “Sure, here is (harmful request)”. While straightforward, this objective has two limitations: limited control over model behaviors, often resulting in incomplete or unrealistic responses, and a rigid format that hinders optimization. To address these limitations, we introduce AdvPrefix, a new prefix-forcing objective that enables more nuanced control over model behavior while being easy to optimize. Our objective leverages model-dependent prefixes, automatically selected based on two criteria: high prefilling attack success rates and low negative log-likelihood. It can further simplify optimization by using multiple prefixes for a single user request. AdvPrefix can integrate seamlessly into existing jailbreak attacks to improve their performance for free. For example, simply replacing GCG attack’s target prefixes with ours on Llama-3 improves nuanced attack success rates from 14% to 80%, suggesting that current alignment struggles to generalize to unseen prefixes. Our work demonstrates the importance of jailbreak objectives in achieving nuanced jailbreaks.
zh

[NLP-2] SCBench: A KV Cache-Centric Analysis of Long-Context Methods

【速读】：该论文试图解决长上下文大语言模型（LLMs）在推理过程中面临的计算和内存效率问题，特别是围绕键值缓存（KV cache）的优化挑战。解决方案的关键在于引入了一个名为SCBench（SharedContextBench）的综合基准测试，从KV缓存的视角评估长上下文方法，涵盖了KV缓存的生成、压缩、检索和加载四个方面。SCBench通过测试共享上下文的任务，评估了多种长上下文解决方案的性能，包括门控线性循环神经网络（Gated Linear RNNs）、Mamba-Attention混合模型以及稀疏注意力、KV缓存丢弃、量化等高效方法。研究结果表明，在多轮场景中，次线性内存方法表现不佳，而具有线性内存和次线性预填充计算的稀疏编码方法表现稳健。此外，动态稀疏性比静态模式生成更具表达力的KV缓存，而混合架构中的层级稀疏性在减少内存使用的同时保持了高性能。

链接: https://arxiv.org/abs/2412.10319
作者: Yucheng Li,Huiqiang Jiang,Qianhui Wu,Xufang Luo,Surin Ahn,Chengruidong Zhang,Amir H. Abdi,Dongsheng Li,Jianfeng Gao,Yuqing Yang,Lili Qiu
机构: Microsoft Corporation(微软公司); University of Surrey (萨里大学)
关键词: enabled numerous downstream, numerous downstream applications, introduced significant challenges, significant challenges related, cache
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-context LLMs have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench(SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With it, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs, Mamba-Attention hybrids, and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on 8 long-context LLMs. Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly. Dynamic sparsity yields more expressive KV caches than static patterns, and layer-level sparsity in hybrid architectures reduces memory usage with strong performance. Additionally, we identify attention distribution shift issues in long-generation scenarios. this https URL.
zh

[NLP-3] Interlocking-free Selective Rationalization Through Genetic-based Learning

【速读】：该论文试图解决选择性解释（selective rationalization）中的互锁问题（interlocking），即在生成器和预测器协作的端到端架构中，由于其中一个模块的主导地位导致系统陷入次优平衡点的问题。解决方案的关键是提出了GenSPP架构，该架构通过遗传全局搜索（genetic global search）进行生成器和预测器的独立训练，从而避免了互锁现象，且无需引入额外的学习开销或特征启发式、采样及临时正则化等方法。

链接: https://arxiv.org/abs/2412.10312
作者: Federico Ruggeri,Gaetano Signorelli
机构: DISI, University of Bologna (DISI, 博洛尼亚大学)
关键词: extract highlights fed, extract highlights, highlights fed, architecture for selective, selective rationalization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:A popular end-to-end architecture for selective rationalization is the select-then-predict pipeline, comprising a generator to extract highlights fed to a predictor. Such a cooperative system suffers from suboptimal equilibrium minima due to the dominance of one of the two modules, a phenomenon known as interlocking. While several contributions aimed at addressing interlocking, they only mitigate its effect, often by introducing feature-based heuristics, sampling, and ad-hoc regularizations. We present GenSPP, the first interlocking-free architecture for selective rationalization that does not require any learning overhead, as the above-mentioned. GenSPP avoids interlocking by performing disjoint training of the generator and predictor via genetic global search. Experiments on a synthetic and a real-world benchmark show that our model outperforms several state-of-the-art competitors.
zh

[NLP-4] DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

【速读】：该论文试图解决高分辨率图像处理和高效语言推理的问题，解决方案的关键在于两个主要升级：首先，在视觉组件中引入了动态分块视觉编码策略 (dynamic tiling vision encoding strategy)，以处理不同宽高比的高分辨率图像；其次，在语言组件中采用了多头潜在注意力机制 (Multi-head Latent Attention mechanism) 的DeepSeekMoE模型，通过将键值缓存压缩为潜在向量，实现了高效的推理和高吞吐量。这些改进使得DeepSeek-VL2在视觉问答、光学字符识别、文档/表格/图表理解和视觉定位等任务中表现出优越的性能。

链接: https://arxiv.org/abs/2412.10302
作者: Zhiyu Wu,Xiaokang Chen,Zizheng Pan,Xingchao Liu,Wen Liu,Damai Dai,Huazuo Gao,Yiyang Ma,Chengyue Wu,Bingxuan Wang,Zhenda Xie,Yu Wu,Kai Hu,Jiawei Wang,Yaofeng Sun,Yukun Li,Yishi Piao,Kang Guan,Aixin Liu,Xin Xie,Yuxiang You,Kai Dong,Xingkai Yu,Haowei Zhang,Liang Zhao,Yisong Wang,Chong Ruan
机构: DeepSeek-AI
关键词: key major upgrades, major upgrades, significantly improves, key major, Multi-head Latent Attention
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at this https URL.
zh

[NLP-5] Still "Talking About Large Language Models ": Some Clarifications

【速读】：该论文试图澄清其关于大型语言模型（Large Language Models）的观点被误解为支持还原论立场，而实际上作者并不赞同这种立场。论文的关键在于将其讨论置于一个更大的哲学项目中，该项目关注的是词语的（误）使用而非形而上学，这与维特根斯坦后期著作的精神相一致。通过这种方式，论文旨在避免对大型语言模型的还原论解读，并强调语言使用的复杂性和哲学意义。

链接: https://arxiv.org/abs/2412.10291
作者: Murray Shanahan
机构: Imperial College London; Institute of Philosophy, School of Advanced Study, University of London
关键词: Large Language Models, Language Models, Large Language, Talking About Large, interpreted as advocating
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:My paper “Talking About Large Language Models” has more than once been interpreted as advocating a reductionist stance towards large language models. But the paper was not intended that way, and I do not endorse such positions. This short note situates the paper in the context of a larger philosophical project that is concerned with the (mis)use of words rather than metaphysics, in the spirit of Wittgenstein’s later writing.
zh

[NLP-6] One world one opinion? The superstar effect in LLM responses

【速读】：该论文试图解决的问题是大型语言模型 (LLMs) 在不同语言环境下对各领域重要人物的认知多样性问题。研究通过使用十种不同语言的提示词来探索语言多样性对认知的影响。关键解决方案在于揭示了LLMs在不同语言中的响应存在低多样性，即少数人物在多种语言中被广泛认可（即“超级明星效应”），从而强调了LLMs在检索主观信息时可能导致的全球知识表征的狭窄化风险。

链接: https://arxiv.org/abs/2412.10281
作者: Sofie Goethals,Lauren Rhue
机构: University of Antwerp, Belgium; Robert H. Smith School of Business, USA
关键词: large language models, accessed online, wide audience, shared and accessed, language models
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are shaping the way information is shared and accessed online, their opinions have the potential to influence a wide audience. This study examines who the LLMs view as the most prominent figures across various fields, using prompts in ten different languages to explore the influence of linguistic diversity. Our findings reveal low diversity in responses, with a small number of figures dominating recognition across languages (also known as the “superstar effect”). These results highlight the risk of narrowing global knowledge representation when LLMs retrieve subjective information.
zh

[NLP-7] Benchmarking Linguistic Diversity of Large Language Models

【速读】：该论文试图解决的问题是当前大型语言模型（LLMs）在任务解决能力上虽已超越人类，但其生成的语言是否在词汇选择、句法结构和语义表达等方面达到人类的多样性水平，这一问题尚未得到充分关注。论文强调了评估语言模型在保留人类语言丰富性方面的重要性，特别是在LLMs大量生成或辅助生成在线内容的背景下。解决方案的关键在于提出了一个综合框架，用于从词汇、句法和语义等多个维度评估LLMs的语言多样性。通过这一框架，论文对多个最先进的LLMs进行了基准测试，并深入分析了不同开发和部署选择对LLM输出语言多样性的影响。

链接: https://arxiv.org/abs/2412.10271
作者: Yanzhu Guo,Guokan Shang,Chloé Clavel
机构: ALMAnaCH; Inria Paris; France Lab; MBZUAI; École Polytechnique
关键词: Large Language Models, evaluation of Large, surpassing human performance, Large Language, task-solving capabilities
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. This paper emphasizes the importance of examining the preservation of human linguistic richness by language models, given the concerning surge in online content produced or aided by LLMs. We propose a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmark several state-of-the-art LLMs across all diversity dimensions, and conduct an in-depth case study for syntactic diversity. Finally, we analyze how different development and deployment choices impact the linguistic diversity of LLM outputs.
zh

[NLP-8] Reasoner Outperforms: Generative Stance Detection with Rationalization for Social Media

【速读】：该论文试图解决立场检测（stance detection）中的可解释性问题，特别是在使用大型语言模型（LLMs）进行分类时，缺乏透明和可理解的预测解释。解决方案的关键在于采用生成式方法，通过在立场预测中包含显式的、可解释的理由（rationales），并将这些理由整合到较小的语言模型（如FlanT5）中，通过单任务和多任务学习来实现。研究表明，这种生成式方法不仅提升了模型的性能（如FlanT5在零样本设置下优于GPT-3.5），还增强了多任务学习的性能，尽管在单任务设置中可能有所降低。此外，该方法通过忠实的理由提取（rationale distillation），推动了构建可解释、可信赖系统的努力，以应对歧视、促进信任和推动社交媒体上的公平参与。

链接: https://arxiv.org/abs/2412.10266
作者: Jiaqing Yuan,Ruijie Xi,Munindar P. Singh
机构: North Carolina State University(北卡罗来纳州立大学)
关键词: analyzing user-generated content, human-centric Web, Web by analyzing, Stance detection, analyzing user-generated
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stance detection is crucial for fostering a human-centric Web by analyzing user-generated content to identify biases and harmful narratives that undermine trust. With the development of Large Language Models (LLMs), existing approaches treat stance detection as a classification problem, providing robust methodologies for modeling complex group interactions and advancing capabilities in natural language tasks. However, these methods often lack interpretability, limiting their ability to offer transparent and understandable justifications for predictions. This study adopts a generative approach, where stance predictions include explicit, interpretable rationales, and integrates them into smaller language models through single-task and multitask learning. We find that incorporating reasoning into stance detection enables the smaller model (FlanT5) to outperform GPT-3.5’s zero-shot performance, achieving an improvement of up to 9.57%. Moreover, our results show that reasoning capabilities enhance multitask learning performance but may reduce effectiveness in single-task settings. Crucially, we demonstrate that faithful rationales improve rationale distillation into SLMs, advancing efforts to build interpretable, trustworthy systems for addressing discrimination, fostering trust, and promoting equitable engagement on social media.
zh

[NLP-9] argeted Angular Reversal of Weights (TARS) for Knowledge Removal in Large Language Models

【速读】：该论文试图解决现代大型语言模型 (LLMs) 在训练过程中可能获取敏感信息（如生物安全、版权作品）的问题。解决方案的关键在于引入了一种名为定向角反转 (TARS) 的知识移除方法。TARS 方法通过利用 LLM 的内部表示空间，结合详细提示来聚合目标概念的信息，并通过噪声扰动和语言模型头的转换，生成高概率触发目标概念的向量。随后，通过替换与目标向量具有高余弦相似度的前馈权重向量，限制概念在模型中的传播。该方法的模块化设计允许逐步移除多个概念，且在移除知识的同时，保持模型在多语言环境下的性能，并最小化对模型整体能力的负面影响。

链接: https://arxiv.org/abs/2412.10257
作者: Harry J. Davies,Giorgos Iacovides,Danilo P. Mandic
机构: Imperial College London (帝国理工学院)
关键词: poses significant risks, replicate copyrighted works, train modern large, TARS method, poses significant
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, 1 table

点击查看摘要

Abstract:The sheer scale of data required to train modern large language models (LLMs) poses significant risks, as models are likely to gain knowledge of sensitive topics such as bio-security, as well the ability to replicate copyrighted works. Methods designed to remove such knowledge must do so from all prompt directions, in a multi-lingual capacity and without degrading general model performance. To this end, we introduce the targeted angular reversal (TARS) method of knowledge removal from LLMs. The TARS method firstly leverages the LLM in combination with a detailed prompt to aggregate information about a selected concept in the internal representation space of the LLM. It then refines this approximate concept vector to trigger the concept token with high probability, by perturbing the approximate concept vector with noise and transforming it into token scores with the language model head. The feedforward weight vectors in the LLM which operate directly on the internal representation space, and have the highest cosine similarity with this targeting vector, are then replaced by a reversed targeting vector, thus limiting the ability of the concept to propagate through the model. The modularity of the TARS method allows for a sequential removal of concepts from Llama 3.1 8B, such as the famous literary detective Sherlock Holmes, and the planet Saturn. It is demonstrated that the probability of triggering target concepts can be reduced to 0.00 with as few as 1 TARS edit, whilst simultaneously removing the knowledge bi-directionally. Moreover, knowledge is shown to be removed across all languages despite only being targeted in English. Importantly, TARS has minimal impact on the general model capabilities, as after removing 5 diverse concepts in a modular fashion, there is minimal KL divergence in the next token probabilities of the LLM on large corpora of Wikipedia text (median of 0.002).
zh

[NLP-10] Efficient Continual Pre-training of LLM s for Low-resource Languages

【速读】：该论文试图解决开源大型语言模型 (OsLLMs) 在低资源语言 (LRLs) 上表现不佳的问题，主要原因是训练数据量少和词汇表不充分。解决方案的关键在于降低持续预训练 (CPT) 的成本，具体通过开发一种新算法来从大规模语料库中选择文本子集，并设计另一种算法来优化词汇表的选择。实验结果表明，使用少量CPT数据即可显著提升模型性能，尤其是在印度语言的生成任务基准数据集 IndicGenBench 上。

链接: https://arxiv.org/abs/2412.10244
作者: Arijit Nag,Soumen Chakrabarti,Animesh Mukherjee,Niloy Ganguly
机构: 未知
关键词: Open-source Large Language, Open-source Large, update model parameters, natural language research, propel the democratization
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Open-source Large Language models (OsLLMs) propel the democratization of natural language research by giving the flexibility to augment or update model parameters for performance improvement. Nevertheless, like proprietary LLMs, Os-LLMs offer poorer performance on low-resource languages (LRLs) than high-resource languages (HRLs), owing to smaller amounts of training data and underrepresented vocabulary. On the other hand, continual pre-training (CPT) with large amounts of language-specific data is a costly proposition in terms of data acquisition and computational resources. Our goal is to drastically reduce CPT cost. To that end, we first develop a new algorithm to select a subset of texts from a larger corpus. We show the effectiveness of our technique using very little CPT data. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary. We experiment with the recent Llama-3 model and nine Indian languages with diverse scripts and extent of resource availability. For evaluation, we use IndicGenBench, a generation task benchmark dataset for Indic languages. We experiment with various CPT corpora and augmented vocabulary size and offer insights across language families.
zh

[NLP-11] How good is my story? Towards quantitative metrics for evaluating LLM -generated XAI narratives

【速读】：该论文试图解决在可解释人工智能（XAI）领域中，如何将定量解释（如SHAP）转换为用户友好的叙述，并自动化评估这些叙述质量的问题。解决方案的关键在于提出一个框架，并探索多种自动化指标来评估生成式语言模型（LLM）生成的叙述，从而避免依赖人工偏好研究或调查。通过在不同数据集和提示类型上比较多个先进的LLM，这些指标不仅展示了其效用，还揭示了LLM在生成XAI叙述时可能出现的幻觉问题。

链接: https://arxiv.org/abs/2412.10220
作者: Timour Ichmoukhamedov,James Hinns,David Martens
机构: Universiteit Antwerpen(安特卫普大学); ADM(安特卫普大学管理学院)
关键词: smaller prediction models, rapidly developing application, convert quantitative explanations, SHAP into user-friendly, prediction models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A rapidly developing application of LLMs in XAI is to convert quantitative explanations such as SHAP into user-friendly narratives to explain the decisions made by smaller prediction models. Evaluating the narratives without relying on human preference studies or surveys is becoming increasingly important in this field. In this work we propose a framework and explore several automated metrics to evaluate LLM-generated narratives for explanations of tabular classification tasks. We apply our approach to compare several state-of-the-art LLMs across different datasets and prompt types. As a demonstration of their utility, these metrics allow us to identify new challenges related to LLM hallucinations for XAI narratives.
zh

[NLP-12] Retrieval-Augmented Semantic Parsing: Using Large Language Models to Improve Generalization

【速读】：该论文试图解决开放域语义解析（Open-domain semantic parsing）中的挑战，即模型在处理未见概念时往往依赖启发式方法且表现不佳的问题。解决方案的关键在于引入检索增强的语义解析（Retrieval-Augmented Semantic Parsing, RASP），通过将外部词汇知识整合到解析过程中，显著提升了大型语言模型（LLMs）在预测未见概念上的能力，尤其是在处理分布外概念时，性能几乎翻倍，超越了传统的编码器-解码器基线模型。

链接: https://arxiv.org/abs/2412.10207
作者: Xiao Zhang,Qianru Meng,Johan Bos
机构: University of Groningen; Leiden University
关键词: Open-domain semantic parsing, handle unseen concepts, semantic parsing remains, semantic parsing, Open-domain semantic
类目: Computation and Language (cs.CL)
备注: Submitted to ARR

点击查看摘要

Abstract:Open-domain semantic parsing remains a challenging task, as models often rely on heuristics and struggle to handle unseen concepts. In this paper, we investigate the potential of large language models (LLMs) for this task and introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective approach that integrates external lexical knowledge into the parsing process. Our experiments not only show that LLMs outperform previous encoder-decoder baselines for semantic parsing, but that RASP further enhances their ability to predict unseen concepts, nearly doubling the performance of previous models on out-of-distribution concepts. These findings highlight the promise of leveraging large language models and retrieval mechanisms for robust and open-domain semantic parsing.
zh

[NLP-13] VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation COLING2025

【速读】：该论文试图解决视觉语言模型 (Vision Language Models, VLMs) 在基于检索增强生成 (Retrieval Augmented Generation, RAG) 的视觉问答 (Visual Question Answering, VQA) 任务中，如何有效利用多段输入文本进行推理和回答的问题。解决方案的关键在于提出了VLR-Bench基准，该基准包含五个输入段落，要求模型能够判断哪些段落对回答特定问题有用，这是以往研究中缺乏的能力。为此，论文构建了一个包含32,000个自动生成指令跟随示例的数据集VLR-IF，专门用于增强VLMs的RAG能力，使其能够基于输入段落生成合适的答案。通过使用最先进的Llava-Llama-3模型进行验证，证明了该基准和数据集的有效性。

链接: https://arxiv.org/abs/2412.10151
作者: Hyeonseok Lim,Dongjae Shin,Seohyun Song,Inho Won,Minjun Kim,Junghun Yuk,Haneol Jang,KyungTae Lim
机构: Seoul National University of Science and Technology (SeoulTech); SeoulTech & Teddysum; Hanbat National University
关键词: retrieval augmented generation, evaluating vision language, visual question answering, vision language models, external knowledge-based VQA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The 31st International Conference on Computational Linguistics (COLING 2025), 19 pages

点击查看摘要

Abstract:We propose the VLR-Bench, a visual question answering (VQA) benchmark for evaluating vision language models (VLMs) based on retrieval augmented generation (RAG). Unlike existing evaluation datasets for external knowledge-based VQA, the proposed VLR-Bench includes five input passages. This allows testing of the ability to determine which passage is useful for answering a given query, a capability lacking in previous research. In this context, we constructed a dataset of 32,000 automatically generated instruction-following examples, which we denote as VLR-IF. This dataset is specifically designed to enhance the RAG capabilities of VLMs by enabling them to learn how to generate appropriate answers based on input passages. We evaluated the validity of the proposed benchmark and training data and verified its performance using the state-of-the-art Llama3-based VLM, the Llava-Llama-3 model. The proposed VLR-Bench and VLR-IF datasets are publicly available online.
zh

[NLP-14] ACOMORE: Leveraging the Potential of LLM s in Corpus-based Discourse Analysis with Prompt Engineering

【速读】：该论文试图解决大语言模型（LLMs）在基于语料库的语篇分析中表现不佳、产生幻觉（hallucination）和不可重复性（irreproducibility）的问题。解决方案的关键是提出了一种名为TACOMORE的有效提示框架（prompting framework），该框架基于四个原则：任务（Task）、上下文（Context）、模型（Model）和可重复性（Reproducibility），并明确了五个构成良好提示的基本要素：角色描述（Role Description）、任务定义（Task Definition）、任务步骤（Task Procedures）、上下文信息（Contextual Information）和输出格式（Output Format）。通过在三个LLMs（GPT-4o, Gemini-1.5-Pro和另一个模型）上进行实验，TACOMORE显著提升了在COVID-19研究文章开放语料库上的关键词、搭配词和索引行分析任务中的表现，展示了其在准确性（Accuracy）、伦理性（Ethicality）、推理能力（Reasoning）和可重复性方面的有效性。

链接: https://arxiv.org/abs/2412.10139
作者: Bingru Li,Han Wang
机构: University of Birmingham(伯明翰大学)
关键词: discourse analysis incorporating, analysis incorporating LLMs, hindered by issues, issues of unsatisfying, corpus-based discourse analysis
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The capacity of LLMs to carry out automated qualitative analysis has been questioned by corpus linguists, and it has been argued that corpus-based discourse analysis incorporating LLMs is hindered by issues of unsatisfying performance, hallucination, and irreproducibility. Our proposed method, TACOMORE, aims to address these concerns by serving as an effective prompting framework in this domain. The framework consists of four principles, i.e., Task, Context, Model and Reproducibility, and specifies five fundamental elements of a good prompt, i.e., Role Description, Task Definition, Task Procedures, Contextual Information and Output Format. We conduct experiments on three LLMs, i.e., GPT-4o, Gemini-1.5-Pro and this http URL, and find that TACOMORE helps improve LLM performance in three representative discourse analysis tasks, i.e., the analysis of keywords, collocates and concordances, based on an open corpus of COVID-19 research articles. Our findings show the efficacy of the proposed prompting framework TACOMORE in corpus-based discourse analysis in terms of Accuracy, Ethicality, Reasoning, and Reproducibility, and provide novel insights into the application and evaluation of LLMs in automated qualitative studies.
zh

[NLP-15] ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

【速读】：该论文试图解决现有Text-to-SQL (Text2SQL)技术依赖于闭源大型语言模型（LLMs），如GPT-4，导致在开放场景中应用受限的问题。解决方案的关键在于提出了一种名为ROUTE的新方法，通过多任务监督微调（SFT）和多任务协作提示（MCP）策略来提升开源LLMs在Text2SQL任务中的综合能力。具体来说，ROUTE引入了额外的SFT任务，如模式链接、噪声校正和续写，以增强模型对SQL语法的理解，并通过多任务协作减少SQL生成中的幻觉现象，从而显著提升Text2SQL的性能。

链接: https://arxiv.org/abs/2412.10138
作者: Yang Qin,Chao Chen,Zhihang Fu,Ze Chen,Dezhong Peng,Peng Hu,Jieping Ye
机构: Sichuan University (四川大学)
关键词: large language models, facilitated by large, open scenarios, significant advancements, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the significant advancements in Text-to-SQL (Text2SQL) facilitated by large language models (LLMs), the latest state-of-the-art techniques are still trapped in the in-context learning of closed-source LLMs (e.g., GPT-4), which limits their applicability in open scenarios. To address this challenge, we propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to improve the comprehensive capabilities of open-source LLMs for Text2SQL, thereby providing a more practical solution. Our approach begins with multi-task supervised fine-tuning (SFT) using various synthetic training data related to SQL generation. Unlike existing SFT-based Text2SQL methods, we introduced several additional SFT tasks, including schema linking, noise correction, and continuation writing. Engaging in a variety of SQL generation tasks enhances the model’s understanding of SQL syntax and improves its ability to generate high-quality SQL queries. Additionally, inspired by the collaborative modes of LLM agents, we introduce a Multitask Collaboration Prompting (MCP) strategy. This strategy leverages collaboration across several SQL-related tasks to reduce hallucinations during SQL generation, thereby maximizing the potential of enhancing Text2SQL performance through explicit multitask capabilities. Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields leading performance.
zh

[NLP-16] Can LLM s Convert Graphs to Text-Attributed Graphs?

【速读】：该论文试图解决图神经网络 (Graph Neural Networks, GNNs) 在处理具有不同特征空间的多图数据时面临的跨图特征对齐问题。现有 GNN 架构无法直接处理这种跨图特征对齐，而现有的文本属性图方法依赖于节点具有文本描述的数据，这在实际应用中难以获取。论文提出的解决方案是 Topology-Aware Node description Synthesis (TANS)，其关键在于利用大型语言模型 (Large Language Models, LLMs) 自动将现有图转换为文本属性图。具体来说，TANS 通过将拓扑信息与节点属性相结合，增强 LLMs 对图拓扑如何影响节点语义的理解，从而生成统一的节点描述。这种方法不仅在文本丰富的图上表现出色，还在文本有限甚至无文本的图上显著优于传统的手工设计节点特征的方法，展示了 LLMs 在预处理图结构数据方面的潜力。

链接: https://arxiv.org/abs/2412.10136
作者: Zehong Wang,Sidney Liu,Zheyuan Zhang,Tianyi Ma,Chuxu Zhang,Yanfang Ye
机构: University of Notre Dame, Indiana, USA; University of Connecticut, Connecticut, USA
关键词: numerous real-world applications, social network analysis, recommender systems, real-world applications, drug discovery
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Graphs are ubiquitous data structures found in numerous real-world applications, such as drug discovery, recommender systems, and social network analysis. Graph neural networks (GNNs) have become a popular tool to learn node embeddings through message passing on these structures. However, a significant challenge arises when applying GNNs to multiple graphs with different feature spaces, as existing GNN architectures are not designed for cross-graph feature alignment. To address this, recent approaches introduce text-attributed graphs, where each node is associated with a textual description, enabling the use of a shared textual encoder to project nodes from different graphs into a unified feature space. While promising, this method relies heavily on the availability of text-attributed data, which can be difficult to obtain in practice. To bridge this gap, we propose a novel method named Topology-Aware Node description Synthesis (TANS), which leverages large language models (LLMs) to automatically convert existing graphs into text-attributed graphs. The key idea is to integrate topological information with each node’s properties, enhancing the LLMs’ ability to explain how graph topology influences node semantics. We evaluate our TANS on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs. Notably, on text-free graphs, our method significantly outperforms existing approaches that manually design node features, showcasing the potential of LLMs for preprocessing graph-structured data, even in the absence of textual information. The code and data are available at this https URL.
zh

[NLP-17] ASLoRA: Adaptive Sharing Low-Rank Adaptation Across Layers

【速读】：该论文试图解决大规模语言模型（LLMs）在传统全参数微调（full fine-tuning）中面临的计算和存储成本过高的问题。解决方案的关键在于提出了ASLoRA，这是一种结合全局共享与部分自适应共享的跨层参数共享策略。具体来说，ASLoRA在所有层之间共享低秩矩阵A，并在训练过程中自适应地合并矩阵B。这种共享机制不仅有效缓解了过拟合问题，还捕捉了层间依赖关系，显著增强了模型的表示能力。实验结果表明，ASLoRA在减少参数使用量（不到LoRA的25%）的同时，在多种NLP任务上表现优于LoRA，展示了其灵活性和优越的参数效率。

链接: https://arxiv.org/abs/2412.10135
作者: Junyan Hu,Xue Xiao,Mengqi Zhang,Xiao Chen,Zhaochun Ren,Zhumin Chen,Pengjie Ren
机构: Shandong University, Qingdao, China; Inspur Cloud Information Technology Co.,Ltd; Leiden University, Leiden, The Netherlands
关键词: increasingly impractical due, traditional full fine-tuning, grow in size, large language models, traditional full
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) grow in size, traditional full fine-tuning becomes increasingly impractical due to its high computational and storage costs. Although popular parameter-efficient fine-tuning methods, such as LoRA, have significantly reduced the number of tunable parameters, there is still room for further optimization. In this work, we propose ASLoRA, a cross-layer parameter-sharing strategy combining global sharing with partial adaptive sharing. Specifically, we share the low-rank matrix A across all layers and adaptively merge matrix B during training. This sharing mechanism not only mitigates overfitting effectively but also captures inter-layer dependencies, significantly enhancing the model’s representational capability. We conduct extensive experiments on various NLP tasks, showing that ASLoRA outperforms LoRA while using less than 25% of the parameters, highlighting its flexibility and superior parameter efficiency. Furthermore, in-depth analyses of the adaptive sharing strategy confirm its significant advantages in enhancing both model flexibility and task adaptability.
zh

[NLP-18] Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data

【速读】：该论文试图解决零样本命名实体识别（Zero-shot Named Entity Recognition, NER）中由于训练数据与评估数据之间实体类型的高度相似性导致的性能评估偏差问题。解决方案的关键在于提出了一个新的度量标准——Familiarity，该指标综合考虑了训练数据与评估数据中实体类型的语义相似性及其在训练数据中的频率，从而量化标签偏移（label shift）。这一方法不仅帮助研究者更准确地理解零样本NER模型的实际性能，还能生成不同难度的评估设置，以进行更细致的零样本NER分析。

链接: https://arxiv.org/abs/2412.10121
作者: Jonas Golde,Patrick Haller,Max Ploner,Fabio Barth,Nicolaas Jedema,Alan Akbik
机构: Humboldt Universität zu Berlin; DFKI; Amazon
关键词: detecting named entities, named entity recognition, zero-shot NER, Zero-shot named entity, detecting named
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Zero-shot named entity recognition (NER) is the task of detecting named entities of specific types (such as ‘Person’ or ‘Medicine’) without any training examples. Current research increasingly relies on large synthetic datasets, automatically generated to cover tens of thousands of distinct entity types, to train zero-shot NER models. However, in this paper, we find that these synthetic datasets often contain entity types that are semantically highly similar to (or even the same as) those in standard evaluation benchmarks. Because of this overlap, we argue that reported F1 scores for zero-shot NER overestimate the true capabilities of these approaches. Further, we argue that current evaluation setups provide an incomplete picture of zero-shot abilities since they do not quantify the label shift (i.e., the similarity of labels) between training and evaluation datasets. To address these issues, we propose Familiarity, a novel metric that captures both the semantic similarity between entity types in training and evaluation, as well as their frequency in the training data, to provide an estimate of label shift. It allows researchers to contextualize reported zero-shot NER scores when using custom synthetic training datasets. Further, it enables researchers to generate evaluation setups of various transfer difficulties for fine-grained analysis of zero-shot NER.
zh

[NLP-19] Label-template based Few-Shot Text Classification with Contrastive Learning

【速读】：该论文试图解决现有小样本文本分类方法中对类别标签利用不足的问题，特别是基于原型网络的元学习框架容易受到噪声影响且依赖于类间差异的局限性。解决方案的关键在于将类别标签模板嵌入输入句子中，充分利用标签的语义信息，引导预训练模型生成更具区分性的文本表示。通过标签语义的持续影响，结合监督对比学习来建模支持样本与查询样本之间的交互信息，并用注意力机制替代平均机制以突出关键语义信息。

链接: https://arxiv.org/abs/2412.10110
作者: Guanghua Hou,Shuhui Cao,Deqiang Ouyang,Ning Wang
机构: 未知
关键词: few-shot text classification, promising solution, few-shot text, text classification, text classification framework
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As an algorithmic framework for learning to learn, meta-learning provides a promising solution for few-shot text classification. However, most existing research fail to give enough attention to class labels. Traditional basic framework building meta-learner based on prototype networks heavily relies on inter-class variance, and it is easily influenced by noise. To address these limitations, we proposes a simple and effective few-shot text classification framework. In particular, the corresponding label templates are embed into input sentences to fully utilize the potential value of class labels, guiding the pre-trained model to generate more discriminative text representations through the semantic information conveyed by labels. With the continuous influence of label semantics, supervised contrastive learning is utilized to model the interaction information between support samples and query samples. Furthermore, the averaging mechanism is replaced with an attention mechanism to highlight vital semantic information. To verify the proposed scheme, four typical datasets are employed to assess the performance of different methods. Experimental results demonstrate that our method achieves substantial performance enhancements and outperforms existing state-of-the-art models on few-shot text classification tasks.
zh

[NLP-20] MALAMUTE: A Multilingual Highly-granular Template-free Education-based Probing Dataset

【速读】：该论文试图解决现有语言模型（LMs）在教育领域中评估其知识准确性和专业性的问题。现有cloze-style基准测试存在三大局限：1) 不涵盖教育领域；2) 主要评估低复杂度的通用知识，无法充分检验模型在特定学科中的知识；3) 依赖模板，可能导致模型预测偏差。论文提出的解决方案是引入MALAMUTE，这是一个多语言、无模板、高度细粒度的探测数据集，包含来自71本大学教材的专家撰写、同行评审的探测内容，涵盖八大学科领域及其子领域，共计33,361个课程概念和116,887个提示。MALAMUTE的关键在于其细粒度、教育导向的设计，以及对句子级和段落级提示的包含，使其成为评估语言模型在课程相关知识上的理想工具。

链接: https://arxiv.org/abs/2412.10105
作者: Sagi Shaier,George Arthur Baker,Chiranthan Sridhar,Lawrence E Hunter,Katharina von der Wense
机构: University of Colorado Boulder; University of Chicago, Department of Pediatrics; Johannes Gutenberg University Mainz
关键词: broad domains, knowledge, MALAMUTE, specific subjects, domains
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language models (LMs) have excelled in various broad domains. However, to ensure their safe and effective integration into real-world educational settings, they must demonstrate proficiency in specific, granular areas of knowledge. Existing cloze-style benchmarks, commonly used to evaluate LMs’ knowledge, have three major limitations. They: 1) do not cover the educational domain; 2) typically focus on low-complexity, generic knowledge or broad domains, which do not adequately assess the models’ knowledge in specific subjects; and 3) often rely on templates that can bias model predictions. Here, we introduce MALAMUTE, a multilingual, template-free, and highly granular probing dataset comprising expert-written, peer-reviewed probes from 71 university-level textbooks across three languages (English, Spanish, and Polish). MALAMUTE is the first education-based cloze-style dataset. It covers eight domains, each with up to 14 subdomains, further broken down into concepts and concept-based prompts, totaling 33,361 university curriculum concepts and 116,887 prompts. MALAMUTE’s fine granularity, educational focus, and inclusion of both sentence-level and paragraph-level prompts make it an ideal tool for evaluating LMs’ course-related knowledge. Our evaluation of masked and causal LMs on MALAMUTE shows that despite overall proficiency, they have significant gaps in knowledge when examined closely on specific subjects, hindering their safe use in classrooms and underscoring the need for further development.
zh

[NLP-21] RETQA: A Large-Scale Open-Domain Tabular Question Answering Dataset for Real Estate Sector AAAI2025

【速读】：该论文旨在解决房地产领域中缺乏专门用于表格问答（Tabular Question Answering）数据集的问题。为此，研究者引入了RETQA，这是首个大规模开放域中文表格问答数据集，涵盖房地产信息、房地产公司财务信息和土地拍卖信息三大领域，包含4,932个表格和20,762个问答对。RETQA面临的主要挑战包括长表格结构、开放域检索和多领域查询。为应对这些挑战，论文提出了SLUTQA框架，该框架通过将大型语言模型（Large Language Models）与口语理解任务（Spoken Language Understanding tasks）相结合，提升了检索和回答的准确性。实验结果表明，SLUTQA通过上下文学习显著提高了大型语言模型在RETQA上的表现。

链接: https://arxiv.org/abs/2412.10104
作者: Zhensheng Wang,Wenmian Yang,Kun Zhou,Yiquan Zhang,Weijia Jia
机构: 未知
关键词: Tabular Question Answering, market relies heavily, estate market relies, Tabular Question, Question Answering
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper is accepted by AAAI 2025

点击查看摘要

Abstract:The real estate market relies heavily on structured data, such as property details, market trends, and price fluctuations. However, the lack of specialized Tabular Question Answering datasets in this domain limits the development of automated question-answering systems. To fill this gap, we introduce RETQA, the first large-scale open-domain Chinese Tabular Question Answering dataset for Real Estate. RETQA comprises 4,932 tables and 20,762 question-answer pairs across 16 sub-fields within three major domains: property information, real estate company finance information and land auction information. Compared with existing tabular question answering datasets, RETQA poses greater challenges due to three key factors: long-table structures, open-domain retrieval, and multi-domain queries. To tackle these challenges, we propose the SLUTQA framework, which integrates large language models with spoken language understanding tasks to enhance retrieval and answering accuracy. Extensive experiments demonstrate that SLUTQA significantly improves the performance of large language models on RETQA by in-context learning. RETQA and SLUTQA provide essential resources for advancing tabular question answering research in the real estate domain, addressing critical challenges in open-domain and long-table question-answering. The dataset and code are publicly available at \urlthis https URL.
zh

[NLP-22] AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation

【速读】：该论文试图解决 sarcasm 检测中由于数据稀缺导致的多模态计算方法进展受限的问题。解决方案的关键在于提出了 AMuSeD 方法，该方法通过两阶段的双模态数据增强策略来解决数据稀缺问题。第一阶段利用回译（Back Translation）生成多样化的文本样本，第二阶段通过微调基于 FastSpeech 2 的语音合成系统，保留 sarcasm 的语调特征，并结合云端文本转语音（TTS）服务生成相应的音频数据。此外，论文还研究了多种注意力机制，发现自注意力（self-attention）在文本和音频数据的融合中表现最为高效。最终，该方法在文本-音频模态中实现了显著的 F1-score 提升，达到 81.0%。

链接: https://arxiv.org/abs/2412.10103
作者: Xiyuan Gao,Shubhi Bansal,Kushaan Gowda,Zhu Li,Shekhar Nayak,Nagendra Kumar,Matt Coler
机构: Campus Fryslân, University of Groningen (格罗宁根大学); Computer Science and Engineering, Indian Institute of Technology Indore (印度理工学院印多尔分校); Computer Science, Columbia University (哥伦比亚大学)
关键词: including vocal tones, Detecting sarcasm effectively, MUltimodal Sarcasm dEtection, Detecting sarcasm, sarcasm detection
类目: Computation and Language (cs.CL)
备注: This is a preprint version of the paper, submitted and under review at the IEEE Transactions on Affective Computing

点击查看摘要

Abstract:Detecting sarcasm effectively requires a nuanced understanding of context, including vocal tones and facial expressions. The progression towards multimodal computational methods in sarcasm detection, however, faces challenges due to the scarcity of data. To address this, we present AMuSeD (Attentive deep neural network for MUltimodal Sarcasm dEtection incorporating bi-modal Data augmentation). This approach utilizes the Multimodal Sarcasm Detection Dataset (MUStARD) and introduces a two-phase bimodal data augmentation strategy. The first phase involves generating varied text samples through Back Translation from several secondary languages. The second phase involves the refinement of a FastSpeech 2-based speech synthesis system, tailored specifically for sarcasm to retain sarcastic intonations. Alongside a cloud-based Text-to-Speech (TTS) service, this Fine-tuned FastSpeech 2 system produces corresponding audio for the text augmentations. We also investigate various attention mechanisms for effectively merging text and audio data, finding self-attention to be the most efficient for bimodal integration. Our experiments reveal that this combined augmentation and attention approach achieves a significant F1-score of 81.0% in text-audio modalities, surpassing even models that use three modalities from the MUStARD dataset.
zh

[NLP-23] HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language Transfer and Automatic Data Annotation

【速读】：该论文旨在解决挪威语方言中的意图检测（Intent Detection）、槽填充（Slot Filling）和方言识别（Dialect Identification）三个任务。解决方案的关键在于采用了跨语言的多任务学习模型，并利用了包含17种语言的xSID数据集进行微调。对于方言识别任务，论文通过在开发集上微调模型，取得了最高的实验分数。此外，论文还强调了数据集的领域特定性（domain-specificity）以及语言组合对模型性能的影响，这些因素是解决方案成功的关键。

链接: https://arxiv.org/abs/2412.10095
作者: Jaione Bengoetxea,Mikel Zubillaga,Ekhi Azurmendi,Maite Heredia,Julen Etxaniz,Markel Ferro,Jeremy Barnes
机构: HiTZ Center – Ixa, University of the Basque Country (UPV/EHU)(巴斯克大学); University of the Basque Country (UPV/EHU)(巴斯克大学)
关键词: NorSID Shared Task, Shared Task, Intent Detection, Slot Filling, VarDial Workshop
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Vardial 2025 NorSID Shared Task

点击查看摘要

Abstract:In this paper we present our submission for the NorSID Shared Task as part of the 2025 VarDial Workshop (Scherrer et al., 2025), consisting of three tasks: Intent Detection, Slot Filling and Dialect Identification, evaluated using data in different dialects of the Norwegian language. For Intent Detection and Slot Filling, we have fine-tuned a multitask model in a cross-lingual setting, to leverage the xSID dataset available in 17 languages. In the case of Dialect Identification, our final submission consists of a model fine-tuned on the provided development set, which has obtained the highest scores within our experiments. Our final results on the test set show that our models do not drop in performance compared to the development set, likely due to the domain-specificity of the dataset and the similar distribution of both subsets. Finally, we also report an in-depth analysis of the provided datasets and their artifacts, as well as other sets of experiments that have been carried out but did not yield the best results. Additionally, we present an analysis on the reasons why some methods have been more successful than others; mainly the impact of the combination of languages and domain-specificity of the training data on the results.
zh

[NLP-24] Lost in the Middle and In-Between: Enhancing Language Models Ability to Reason Over Long Contexts in Multi-Hop QA

【速读】：该论文试图解决长上下文语言模型在处理输入中间部分信息时存在的“中间信息丢失”问题，特别是在多跳问答任务中，模型难以有效利用分散在输入各处的多个关键信息。解决方案的关键在于通过知识图谱三元组提取和摘要技术减少冗余文档内容，并采用链式思维提示（chain-of-thought prompting）引导模型进行更深入的推理，从而缓解信息分散带来的性能下降问题。

链接: https://arxiv.org/abs/2412.10079
作者: George Arthur Baker,Ankush Raut,Sagi Shaier,Lawrence E Hunter,Katharina von der Wense
机构: University of Colorado Boulder; University of Chicago, Department of Pediatrics; Johannes Gutenberg University Mainz
关键词: Previous work finds, recent long-context language, long-context language models, language models fail, Previous work
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Previous work finds that recent long-context language models fail to make equal use of information in the middle of their inputs, preferring pieces of information located at the tail ends which creates an undue bias in situations where we would like models to be equally capable of using different parts of the input. Thus far, the problem has mainly only been considered in settings with single pieces of critical information, leading us to question what happens when multiple necessary pieces of information are spread out over the inputs. Here, we demonstrate the effects of the “lost in the middle” problem in the multi-hop question answering setting – in which multiple reasoning “hops” over disconnected documents are required – and show that performance degrades not only with respect to the distance of information from the edges of the context, but also between pieces of information. Additionally, we experiment with means of alleviating the problem by reducing superfluous document contents through knowledge graph triple extraction and summarization, and prompting models to reason more thoroughly using chain-of-thought prompting.
zh

[NLP-25] GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLM s?

【速读】：该论文试图解决大型语言模型（LLMs）在现有基准测试中可能通过数据泄露等方式“游戏”测试，导致高分无法真实反映其与人类能力对齐的问题。解决方案的关键在于创建了基于中国高考（Gaokao）的综合基准测试GAOKAO-Eval，并通过“闭卷”评估方式验证了代表性模型的表现。研究发现，即使解决了数据泄露和测试全面性问题，高分仍未能真正反映模型的实际能力。为此，论文引入了认知心理学中的Rasch模型来分析LLM的得分模式，识别出两个关键差异：1）在不同难度问题上的异常一致表现；2）在相似难度问题上的高表现差异。此外，还发现教师对LLM生成答案的评分不一致和重复的错误模式。这些结果表明GAOKAO-Eval能够揭示当前基准测试未能捕捉到的LLM能力限制，并强调了进行更符合LLM特点的难度分析的必要性。

链接: https://arxiv.org/abs/2412.10056
作者: Zhikai Lei,Tianyi Liang,Hanglei Hu,Jin Zhang,Yunhua Zhou,Yunfan Shao,Linyang Li,Chenchui Li,Changbo Wang,Hang Yan,Qipeng Guo
机构: Shanghai Artificial Intelligence Laboratory; East China Normal University; Harbin Institute of Technology; The Chinese University of Hong Kong
关键词: Large Language Models, Large Language, higher scores implicitly, implicitly reflect stronger, reflect stronger human-like
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 13 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1’s reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.
zh

[NLP-26] Unsupervised Named Entity Disambiguation for Low Resource Domains EMNLP-2024

【速读】：该论文试图解决在自然语言处理和信息检索领域中，现有实体消歧方法在处理噪声文本、低资源环境以及特定领域知识库（KB）时表现不佳的问题。解决方案的关键在于提出了一种无监督方法，利用组斯坦纳树（Group Steiner Trees, GST）概念，通过文档中所有提及的候选实体之间的上下文相似性来识别最相关的实体消歧候选者。该方法在多个特定领域数据集上的Precision@1指标上，相比现有最先进的无监督方法提升了超过40%。

链接: https://arxiv.org/abs/2412.10054
作者: Debarghya Datta,Soumajit Pramanik
机构: Indian Institute of Technology, Bhilai(印度理工学院，比莱)
关键词: natural language processing, entity linking algorithms, domain-specific entity linking, information retrieval, increasingly apparent
类目: Computation and Language (cs.CL)
备注: Accepted in EMNLP-2024

点击查看摘要

Abstract:In the ever-evolving landscape of natural language processing and information retrieval, the need for robust and domain-specific entity linking algorithms has become increasingly apparent. It is crucial in a considerable number of fields such as humanities, technical writing and biomedical sciences to enrich texts with semantics and discover more knowledge. The use of Named Entity Disambiguation (NED) in such domains requires handling noisy texts, low resource settings and domain-specific KBs. Existing approaches are mostly inappropriate for such scenarios, as they either depend on training data or are not flexible enough to work with domain-specific KBs. Thus in this work, we present an unsupervised approach leveraging the concept of Group Steiner Trees (GST), which can identify the most relevant candidates for entity disambiguation using the contextual similarities across candidate entities for all the mentions present in a document. We outperform the state-of-the-art unsupervised methods by more than 40% (in avg.) in terms of Precision@1 across various domain-specific datasets.
zh

[NLP-27] Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language COLING2025

【速读】：该论文试图解决在低资源领域特定语言（low-resource domain-specific languages）中，特别是德国化学工业领域的语义搜索评估中，自动化收集测试数据集的挑战。解决方案的关键在于提出了一种端到端的标注流水线，用于自动生成查询并对查询-文档对进行重新评分。为了克服德国化学领域缺乏专门训练的文本编码器的问题，研究采用了集成学习（ensemble learning）的方法，结合多个在通用知识数据集上训练的“弱”文本编码器的个体相关性评分，以及由大型语言模型（LLM）生成的相关性评分，以达成查询-文档对齐的共识。实验结果表明，集成方法显著提高了与人工标注相关性评分的对齐度，在编码者间一致性和准确性指标上均优于单个模型，从而为低资源领域特定语言的语义搜索系统提供了一种有效的适应方法。

链接: https://arxiv.org/abs/2412.10008
作者: Anastasia Zhukova,Christian E. Matt,Bela Gipp
机构: University of Göttingen; eschbach GmbH
关键词: lot of specific, specific terminology, terminology often fall, Collecting test datasets, low-resource domain-specific German
类目: Computation and Language (cs.CL)
备注: accepted in the First Workshop on Language Models for Low-Resource Languages (LoResLM) co-located with the 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages. Collecting test datasets in a narrow domain is time-consuming and requires skilled human resources with domain knowledge and training for the annotation task. This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language of the process industry. Our approach proposes an end-to-end annotation pipeline for automated query generation to the score reassessment of query-document pairs. To overcome the lack of text encoders trained in the German chemistry domain, we explore a principle of an ensemble of “weak” text encoders trained on common knowledge datasets. We combine individual relevance scores from diverse models to retrieve document candidates and relevance scores generated by an LLM, aiming to achieve consensus on query-document alignment. Evaluation results demonstrate that the ensemble method significantly improves alignment with human-assigned relevance scores, outperforming individual models in both inter-coder agreement and accuracy metrics. These findings suggest that ensemble learning can effectively adapt semantic search systems for specialized, low-resource languages, offering a practical solution to resource limitations in domain-specific contexts.
zh

[NLP-28] he role of inhibitory control in garden-path sentence processing: A Chinese-English bilingual perspective

【速读】：该论文试图解决的问题是抑制控制 (Inhibitory Control, IC) 在处理歧义句（如花园路径句）中的作用，特别是在中英双语者中，如何影响对初始误解的恢复。解决方案的关键在于通过自定步速阅读任务，研究IC在母语（中文）和第二语言（英语）处理中的不同表现。研究发现，IC在母语处理中对歧义句的恢复影响不大，表明语义上下文可能减少了对IC的需求；而在第二语言处理中，IC与语言熟练度之间存在复杂关系，低熟练度但高IC的参与者表现出持续的误解，而高熟练度的参与者则没有。此外，通过比较三种Stroop任务版本，确定了母语颜色词Stroop任务是双语研究中衡量IC的优选方法。

链接: https://arxiv.org/abs/2412.10006
作者: Xiaohui Rao,Haoze Li,Xiaofang Lin,Lijuan Liang
机构: 未知
关键词: resolve competing interpretations, people must resolve, competing interpretations, linger despite reanalysis, resolve competing
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In reading garden-path sentences, people must resolve competing interpretations, though initial misinterpretations can linger despite reanalysis. This study examines the role of inhibitory control (IC) in managing these misinterpretations among Chinese-English bilinguals. Using self-paced reading tasks, we investigated how IC influences recovery from garden-path sentences in Chinese (L1) and its interaction with language proficiency during English (L2) processing. Results indicate that IC does not affect garden-path recovery in Chinese, suggesting reliance on semantic context may reduce the need for IC. In contrast, findings for English L2 learners reveal a complex relationship between language proficiency and IC: Participants with low L2 proficiency but high IC showed lingering misinterpretations, while those with high proficiency exhibited none. These results support and extend the Model of Cognitive Control (Ness et al., 2023). Moreover, our comparison of three Stroop task versions identifies L1 colour-word Stroop task as the preferred measure of IC in bilingual research.
zh

[NLP-29] A Comparative Study of LLM s NMT Models and Their Combination in Persian-English Idiom Translation

【速读】：该论文试图解决在翻译习语（idiomatic expressions）时，不同提示方法（prompting methods）和大型语言模型（LLMs）与神经机器翻译（NMT）系统组合对翻译效果的影响问题。解决方案的关键在于引入两个包含习语表达的平行语料库，分别用于波斯语到英语和英语到波斯语的翻译，并通过实验评估不同LLMs、NMT模型及其组合的翻译准确性和流畅性。研究还发现，自动评估方法如LLM-as-a-judge、BLEU和BERTScore在比较模型性能方面是有效的。实验结果表明，Claude-3.5-Sonnet在两种翻译方向上表现出色，而结合较弱的LLMs与Google Translate可以提升英语到波斯语的翻译效果，波斯语到英语的翻译则受益于简单模型的单一提示和复杂模型的高级提示。

链接: https://arxiv.org/abs/2412.09993
作者: Sara Rezaeimanesh,Faezeh Hosseini,Yadollah Yaghoobzadeh
机构: 未知
关键词: Large language models, figurative language compared, translating figurative language, shown superior capabilities, Large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown superior capabilities in translating figurative language compared to neural machine translation (NMT) systems. However, the impact of different prompting methods and LLM-NMT combinations on idiom translation has yet to be thoroughly investigated. This paper introduces two parallel datasets of sentences containing idiomatic expressions for Persian \rightarrow English and English \rightarrow Persian translations, with Persian idioms sampled from our PersianIdioms resource, a collection of 2,200 idioms and their meanings. Using these datasets, we evaluate various open- and closed-source LLMs, NMT models, and their combinations. Translation quality is assessed through idiom translation accuracy and fluency. We also find that automatic evaluation methods like LLM-as-a-judge, BLEU and BERTScore are effective for comparing different aspects of model performance. Our experiments reveal that Claude-3.5-Sonnet delivers outstanding results in both translation directions. For English \rightarrow Persian, combining weaker LLMs with Google Translate improves results, while Persian \rightarrow English translations benefit from single prompts for simpler models and complex prompts for advanced ones.
zh

[NLP-30] Small Language Model as Data Prospector for Large Language Model

【速读】：该论文试图解决在大规模数据集中高效筛选高质量指令数据以提升微调大型语言模型（LLMs）性能的问题。解决方案的关键在于提出了一种改进的算法 \textttSuperNUGGETS，它通过使用小型语言模型（SLM）替代大型语言模型（LLM）来筛选出能够显著提升任务性能的单次实例（one-shot instances），并优化了预定义的测试集。这种方法在性能仅下降1-2%的情况下，将效率提升了58倍，显著降低了资源消耗，从而提高了整体实用性。

链接: https://arxiv.org/abs/2412.09990
作者: Shiwen Ni,Haihong Wu,Di Yang,Qiang Qu,Hamid Alinejad-Rokny,Min Yang
机构: Shenzhen Key Laboratory for High Performance Data Mining(深圳高性能数据挖掘重点实验室); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); University of Science and Technology of China(中国科学技术大学); The University of New South Wales(新南威尔士大学); Shenzhen University of Advanced Technology(深圳先进技术大学)
关键词: Large Language Models, fine-tuned Large Language, data directly affects, texttt, instruction data directly
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The quality of instruction data directly affects the performance of fine-tuned Large Language Models (LLMs). Previously, \citeli2023one proposed \textttNUGGETS, which identifies and selects high-quality quality data from a large dataset by identifying those individual instruction examples that can significantly improve the performance of different tasks after being learnt as one-shot instances. In this work, we propose \textttSuperNUGGETS, an improved variant of \textttNUGGETS optimised for efficiency and performance. Our \textttSuperNUGGETS uses a small language model (SLM) instead of a large language model (LLM) to filter the data for outstanding one-shot instances and refines the predefined set of tests. The experimental results show that the performance of \textttSuperNUGGETS only decreases by 1-2% compared to \textttNUGGETS, but the efficiency can be increased by a factor of 58. Compared to the original \textttNUGGETS, our \textttSuperNUGGETS has a higher utility value due to the significantly lower resource consumption.
zh

[NLP-31] Romanized to Native Malayalam Script Transliteration Using an Encoder-Decoder Framework

【速读】：该论文旨在解决将罗马化的马拉雅拉姆语转换为本地文字的问题，关键解决方案是采用基于注意力机制的双向长短期记忆网络（Bi-LSTM）编码器-解码器框架。该模型通过结合公开的印度语言转写数据集Dakshina和Aksharantar中的430万对转写数据进行训练，并在两个不同的测试集上进行评估，分别针对一般输入模式和特殊输入模式。在一般输入模式下，模型的字符错误率（CER）为7.4%，而在特殊输入模式下，由于大多数元音标记缺失，CER上升至22.7%。

链接: https://arxiv.org/abs/2412.09957
作者: Bajiyo Baiju,Kavya Manohar,Leena G Pillai,Elizabeth Sherly
机构: Digital University Kerala
关键词: Short Term Memory, Long Short Term, bidirectional Long Short, convert romanized Malayalam, attention-based bidirectional Long
类目: Computation and Language (cs.CL)
备注: 5 pages

点击查看摘要

Abstract:In this work, we present the development of a reverse transliteration model to convert romanized Malayalam to native script using an encoder-decoder framework built with attention-based bidirectional Long Short Term Memory (Bi-LSTM) architecture. To train the model, we have used curated and combined collection of 4.3 million transliteration pairs derived from publicly available Indic language translitertion datasets, Dakshina and Aksharantar. We evaluated the model on two different test dataset provided by IndoNLP-2025-Shared-Task that contain, (1) General typing patterns and (2) Adhoc typing patterns, respectively. On the Test Set-1, we obtained a character error rate (CER) of 7.4%. However upon Test Set-2, with adhoc typing patterns, where most vowel indicators are missing, our model gave a CER of 22.7%.
zh

[NLP-32] Enhancing Nursing and Elderly Care with Large Language Models : An AI-Driven Framework

【速读】：该论文试图解决在护理和老年护理领域中，如何利用大型语言模型 (LLMs) 实现高效的病人监测和交互问题。解决方案的关键在于通过引入一个新颖的中文护理数据集，并采用增量预训练 (IPT) 和监督微调 (SFT) 技术，显著提升 LLM 在特定任务中的性能。此外，利用 LangChain 开发了一个能够实时护理和个性化干预的动态护理助手，从而为应对老龄化社会中日益增长的医疗需求提供了AI驱动的解决方案。

链接: https://arxiv.org/abs/2412.09946
作者: Qiao Sun,Jiexin Xie,Nanyang Ye,Qinying Gu,Shijie Guo
机构: Academy for Engineering and Technology, Fudan University, China; Shanghai Artificial Intelligence Laboratory, China; Guilin University of Electronic Technology, China; Shanghai Jiao Tong University, China
关键词: large language models, AI-driven patient monitoring, language models, monitoring and interaction, paper explores
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the application of large language models (LLMs) in nursing and elderly care, focusing on AI-driven patient monitoring and interaction. We introduce a novel Chinese nursing dataset and implement incremental pre-training (IPT) and supervised fine-tuning (SFT) techniques to enhance LLM performance in specialized tasks. Using LangChain, we develop a dynamic nursing assistant capable of real-time care and personalized interventions. Experimental results demonstrate significant improvements, paving the way for AI-driven solutions to meet the growing demands of healthcare in aging populations.
zh

[NLP-33] Simulating Hard Attention Using Soft Attention

【速读】：该论文试图解决的问题是：在何种条件下，使用软注意力机制（soft attention）的Transformer模型能够模拟硬注意力机制（hard attention），即能够有效地将注意力集中在一部分位置上。解决方案的关键在于：通过使用无界的位置嵌入（unbounded positional embeddings）或温度缩放（temperature scaling），软注意力Transformer能够计算线性时序逻辑（linear temporal logic）的公式，并模拟具有均匀无偏差特性（uniform-tieless property）的大类平均硬注意力Transformer。

链接: https://arxiv.org/abs/2412.09925
作者: Andy Yang,Lena Strobl,David Chiang,Dana Angluin
机构: University of Notre Dame; Umeå University; Yale University
关键词: effectively focus, subset of positions, study conditions, hard attention, attention
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several variants of linear temporal logic, whose formulas have been previously been shown to be computable using hard attention transformers. We demonstrate how soft attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate a large subclass of average-hard attention transformers, those that have what we call the uniform-tieless property.
zh

[NLP-34] Low-Resource Fast Text Classification Based on Intra-Class and Inter-Class Distance Calculation

【速读】：该论文试图解决基于神经网络和预训练模型的文本分类方法在实际应用中的三个主要问题：(1) 现有方法主要关注句子间的匹配相似度，忽略了同类别句子内部及不同类别句子间隐含的高价值信息；(2) 预训练语言模型和图方法在训练和文本图构建过程中消耗大量内存；(3) 低资源方法虽然性能较好，但处理时间过长。解决方案的关键在于提出了一种低资源且快速的文本分类模型LFTC，通过构建每个类别的压缩列表来充分挖掘类内数据的规律性信息，去除与目标分类无关的冗余信息以减少处理时间，并最终通过计算文本对的相似距离进行分类。实验结果表明，LFTC在性能和处理时间上均有显著提升，特别是在计算和数据资源有限的情况下表现尤为突出。

链接: https://arxiv.org/abs/2412.09922
作者: Yanxu Mao,Peipei Liu,Tiehan Cui,Congying Liu,Datao You
机构: School of Software, Henan University, China; Institute of Information Engineering, Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China
关键词: gained increasing attention, demonstrated excellent performance, recent years, based on neural, neural networks
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, text classification methods based on neural networks and pre-trained models have gained increasing attention and demonstrated excellent performance. However, these methods still have some limitations in practical applications: (1) They typically focus only on the matching similarity between sentences. However, there exists implicit high-value information both within sentences of the same class and across different classes, which is very crucial for classification tasks. (2) Existing methods such as pre-trained language models and graph-based approaches often consume substantial memory for training and text-graph construction. (3) Although some low-resource methods can achieve good performance, they often suffer from excessively long processing times. To address these challenges, we propose a low-resource and fast text classification model called LFTC. Our approach begins by constructing a compressor list for each class to fully mine the regularity information within intra-class data. We then remove redundant information irrelevant to the target classification to reduce processing time. Finally, we compute the similarity distance between text pairs for classification. We evaluate LFTC on 9 publicly available benchmark datasets, and the results demonstrate significant improvements in performance and processing time, especially under limited computational and data resources, highlighting its superior advantages.
zh

[NLP-35] Enhancing the Reasoning Capabilities of Small Language Models via Solution Guidance Fine-Tuning COLING2025

【速读】：该论文试图解决小规模语言模型（SLMs）在复杂推理任务中的性能不足问题，特别是在低数据环境下，传统的链式思维（Chain-of-Thought, CoT）微调方法依赖大量训练数据且效果有限。解决方案的关键在于引入了一种新的推理策略——解决方案指导（Solution Guidance, SG），以及一种即插即用的训练范式——解决方案指导微调（Solution-Guidance Fine-Tuning, SGFT）。SG 强调在语义和逻辑层面进行问题理解和分解，而非具体的计算过程，从而有效提升 SLMs 的泛化能力和推理能力。通过仅使用少量 SG 训练数据，SGFT 能够微调 SLM 以生成准确的问题解决指导，这些指导可以作为提示灵活地输入到任何 SLM 中，直接生成正确答案，显著提升 SLMs 在各种推理任务中的表现，特别是在资源受限的环境中。

链接: https://arxiv.org/abs/2412.09906
作者: Jing Bi,Yuting Wu,Weiwei Xing,Zhenjie Wei
机构: Beijing Jiaotong University (北京交通大学)
关键词: Large language models, demonstrated remarkable performance, Large language, demonstrated remarkable, wide range
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures, to be published in The 31st International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks. Advances in prompt engineering and fine-tuning techniques have further enhanced their ability to address complex reasoning challenges. However, these advanced capabilities are often exclusive to models exceeding 100 billion parameters. Although Chain-of-Thought (CoT) fine-tuning methods have been explored for smaller models (under 10 billion parameters), they typically depend on extensive CoT training data, which can introduce inconsistencies and limit effectiveness in low-data settings. To overcome these limitations, this paper introduce a new reasoning strategy Solution Guidance (SG) and a plug-and-play training paradigm Solution-Guidance Fine-Tuning (SGFT) for enhancing the reasoning capabilities of small language models. SG focuses on problem understanding and decomposition at the semantic and logical levels, rather than specific computations, which can effectively improve the SLMs’ generalization and reasoning abilities. With only a small amount of SG training data, SGFT can fine-tune a SLM to produce accurate problem-solving guidances, which can then be flexibly fed to any SLM as prompts, enabling it to generate correct answers directly. Experimental results demonstrate that our method significantly improves the performance of SLMs on various reasoning tasks, enhancing both their practicality and efficiency within resource-constrained environments.
zh

[NLP-36] Analyzing Fairness of Computer Vision and Natural Language Processing Models

【速读】：该论文试图解决机器学习（ML）算法在计算机视觉（Computer Vision）和自然语言处理（NLP）模型中可能存在的偏见和公平性问题，特别是在处理非结构化数据时，这些偏见可能加剧现有的系统性不平等。解决方案的关键在于评估和改进这些模型的公平性，并采用两个领先的公平性库——微软的Fairlearn和IBM的AIF360——进行偏见分析和缓解。这些工具提供了全面的公平性分析框架，包括指标评估、结果可视化和偏见缓解技术，从而帮助研究人员和实践者有效测量模型中的偏见水平，并比较不同工具的效果，最终提供可操作的建议以在实际应用中集成公平性解决方案。

链接: https://arxiv.org/abs/2412.09900
作者: Ahmed Rashed,Abdelkrim Kallich,Mohamed Eltayeb
机构: 未知
关键词: Natural Language Processing, algorithms play, law enforcement, play a crucial, crucial role
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 16 pages, 1 table, 4 figures

点击查看摘要

Abstract:Machine learning (ML) algorithms play a crucial role in decision making across diverse fields such as healthcare, finance, education, and law enforcement. Despite their widespread adoption, these systems raise ethical and social concerns due to potential biases and fairness issues. This study focuses on evaluating and improving the fairness of Computer Vision and Natural Language Processing (NLP) models applied to unstructured datasets, emphasizing how biased predictions can reinforce existing systemic inequalities. A publicly available dataset from Kaggle was utilized to simulate a practical scenario for examining fairness in ML workflows. To address and mitigate biases, the study employed two leading fairness libraries: Fairlearn by Microsoft, and AIF360 by IBM. These tools offer comprehensive frameworks for fairness analysis, including metrics evaluation, result visualization, and bias mitigation techniques. The research aims to measure bias levels in ML models, compare the effectiveness of these fairness libraries, and provide actionable recommendations for practitioners. The results demonstrate that each library possesses distinct strengths and limitations in evaluating and mitigating fairness. By systematically analyzing these tools, the study contributes valuable insights to the growing field of ML fairness, offering practical guidance for integrating fairness solutions into real world applications. This research underscores the importance of building more equitable and responsible machine learning systems.
zh

[NLP-37] Benchmarking Table Comprehension In The Wild NEURIPS2024

【速读】：该论文试图解决大语言模型（LLMs）在处理包含长表格和文本混合内容的复杂场景（如学术论文和财务报告）时表现不佳的问题。解决方案的关键在于引入了一个新的基准测试——TableQuest，该基准旨在评估LLMs在自然表格丰富的财务报告情境中的整体表格理解能力。TableQuest通过严格的数据处理和筛选程序，确保问题-答案对逻辑合理且多样化，从而更真实地反映实际应用场景。论文还指出，现有基准测试主要关注孤立表格和狭窄的技能集（如表格识别、数据操作/计算、表格摘要等），而TableQuest则强调多技能的综合运用，以更全面地评估模型的能力。

链接: https://arxiv.org/abs/2412.09884
作者: Yikang Pan,Yi Zhu,Rand Xie,Yizhi Liu
机构: Boson AI; University of Toronto
关键词: Large Language Models, Large Language, lengthy table-text mixtures, limited success understanding, success understanding lengthy
类目: Computation and Language (cs.CL)
备注: Accepted at TRL Workshop@Neurips 2024. Link to data this https URL

点击查看摘要

Abstract:Large Language Models (LLMs), while being increasingly dominant on a myriad of knowledge-intensive activities, have only had limited success understanding lengthy table-text mixtures, such as academic papers and financial reports. Recent advances of long-context LLMs have opened up new possibilities for this field. Nonetheless, we identify two roadblocks: (1) Prior benchmarks of table question answering (TableQA) have focused on isolated tables without context, making it hard to evaluate models in real-world scenarios. (2) Prior benchmarks have focused on some narrow skill sets of table comprehension such as table recognition, data manipulation/calculation, table summarization etc., while a skilled human employs those skills collectively. In this work, we introduce TableQuest, a new benchmark designed to evaluate the holistic table comprehension capabilities of LLMs in the natural table-rich context of financial reports. We employ a rigorous data processing and filtering procedure to ensure that the question-answer pairs are logical, reasonable, and diverse. We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations. We conclude with a qualitative study of the failure modes and discuss the challenges of constructing a challenging benchmark. We make the evaluation data, judging procedure and results of this study publicly available to facilitate research in this field.
zh

[NLP-38] On the Limit of Language Models as Planning Formalizers

【速读】：该论文试图解决大型语言模型（Large Language Models, LLMs）在具体环境中无法生成可执行和可验证计划的问题。解决方案的关键在于将LLMs用作形式化工具，生成规划领域（如PDDL）的完整形式化表示，而不是直接生成计划。通过这种方法，LLMs能够将自然语言描述转化为PDDL，从而利用确定性求解器找到可行的计划。论文还指出，足够大的模型在处理不同自然程度的描述时表现更优，尽管随着描述的自然性增加，性能会有所下降。

链接: https://arxiv.org/abs/2412.09879
作者: Cassie Huang,Li Zhang
机构: Drexel University (德雷塞尔大学)
关键词: Large Language Models, Large Language, Language Models, shown to fail, fail to create
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have been shown to fail to create executable and verifiable plans in grounded environments. An emerging line of work shows success in using LLM as a formalizer to generate a formal representation (e.g., PDDL) of the planning domain, which can be deterministically solved to find a plan. We systematically evaluate this methodology while bridging some major gaps. While previous work only generates a partial PDDL representation given templated and thus unrealistic environment descriptions, we generate the complete representation given descriptions of various naturalness levels. Among an array of observations critical to improve LLMs’ formal planning ability, we note that large enough models can effectively formalize descriptions as PDDL, outperforming those directly generating plans, while being robust to lexical perturbation. As the descriptions become more natural-sounding, we observe a decrease in performance and provide detailed error analysis.
zh

[NLP-39] Byte Latent Transformer: Patches Scale Better Than Tokens

【速读】：该论文试图解决现有基于分词 (tokenization-based) 的大语言模型 (LLM) 在推理效率和鲁棒性方面的不足，并提出了一种新的字节级模型架构——字节潜在变换器 (Byte Latent Transformer, BLT)。解决方案的关键在于通过动态调整字节块 (patches) 的大小来优化计算资源的分配，字节块的大小基于下一个字节的熵来确定，从而在数据复杂度较高的地方分配更多的计算和模型容量。BLT 通过直接处理原始字节，避免了固定词汇表的限制，实现了在训练和推理效率上的显著提升，同时在推理成本固定的情况下，表现出比基于分词的模型更好的扩展性。

链接: https://arxiv.org/abs/2412.09871
作者: Artidoro Pagnoni,Ram Pasunuru,Pedro Rodriguez,John Nguyen,Benjamin Muller,Margaret Li,Chunting Zhou,Lili Yu,Jason Weston,Luke Zettlemoyer,Gargi Ghosh,Mike Lewis,Ari Holtzman,Srinivasan Iyer
机构: 未知
关键词: Byte Latent Transformer, Latent Transformer, matches tokenization-based LLM, byte-level LLM architecture, tokenization-based LLM performance
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.
zh

[NLP-40] Human-Like Embodied AI Interviewer: Employing Android ERICA in Real International Conference COLING2025

【速读】：该论文试图解决如何实现具有人类特质的实体AI访谈者的问题，其关键在于整合配备了先进对话能力的安卓机器人，包括专注倾听、对话修复和用户流利度适应等功能。此外，系统能够在访谈后分析并呈现结果。通过在SIGDIAL 2024国际会议上进行的真实案例研究，证明了该系统的有效性，并首次在国际会议中应用了此类系统。

链接: https://arxiv.org/abs/2412.09867
作者: Zi Haur Pang,Yahui Fu,Divesh Lala,Mikey Elmers,Koji Inoue,Tatsuya Kawahara
机构: Graduate School of Informatics, Kyoto University, Japan(京都大学信息学研究生院，日本)
关键词: including attentive listening, advanced conversational capabilities, user fluency adaptation, integrates android robots, android robots equipped
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: This paper has been accepted for demonstration presentation at International Conference on Computational Linguistics (COLING 2025)

点击查看摘要

Abstract:This paper introduces the human-like embodied AI interviewer which integrates android robots equipped with advanced conversational capabilities, including attentive listening, conversational repairs, and user fluency adaptation. Moreover, it can analyze and present results post-interview. We conducted a real-world case study at SIGDIAL 2024 with 42 participants, of whom 69% reported positive experiences. This study demonstrated the system’s effectiveness in conducting interviews just like a human and marked the first employment of such a system at an international conference. The demonstration video is available at this https URL.
zh

[NLP-41] Financial Sentiment Analysis: Leveraging Actual and Synthetic Data for Supervised Fine-tuning

【速读】：该论文试图解决金融新闻情感分析中通用语言模型过于泛化的问题，以及现有微调模型在捕捉金融文本最大上下文宽度方面的不足。解决方案的关键在于引入BertNSP-finance模型，通过将较短的金融句子连接成长句子以增强上下文理解，并使用finbert-lc模型从数字文本中确定情感。实验结果表明，这种方法在金融短语库数据上的准确率和F1分数均有显著提升，尤其是在50%和100%的同意水平上。

链接: https://arxiv.org/abs/2412.09859
作者: Abraham Atsiwo
机构: University of Nevada, Reno (内华达大学里诺分校)
关键词: Efficient Market Hypothesis, Market Hypothesis, Efficient Market, stock price movement, highlights the essence
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The Efficient Market Hypothesis (EMH) highlights the essence of financial news in stock price movement. Financial news comes in the form of corporate announcements, news titles, and other forms of digital text. The generation of insights from financial news can be done with sentiment analysis. General-purpose language models are too general for sentiment analysis in finance. Curated labeled data for fine-tuning general-purpose language models are scare, and existing fine-tuned models for sentiment analysis in finance do not capture the maximum context width. We hypothesize that using actual and synthetic data can improve performance. We introduce BertNSP-finance to concatenate shorter financial sentences into longer financial sentences, and finbert-lc to determine sentiment from digital text. The results show improved performance on the accuracy and the f1 score for the financial phrasebank data with 50% and 100% agreement levels.
zh

[NLP-42] Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models AAAI2025

【速读】：该论文试图解决在参数高效微调预训练大语言模型时，现有方法如LoRA在处理新任务时性能不足的问题。解决方案的关键在于提出了低秩适应与任务相关特征增强方法 (LoRATRF)，通过设计一个任务感知滤波器，从神经网络的隐藏表示中选择性地提取与目标任务相关的特征，从而增强任务相关特征的表达。实验结果表明，该方法在减少33.71%参数的同时，在多种数据集上实现了优于现有低秩方法的性能。

链接: https://arxiv.org/abs/2412.09827
作者: Changqun Li,Chaofan Ding,Kexin Luan,Xinhan Di
机构: 未知
关键词: pre-trained large language, large language models, Fine-tuning pre-trained large, effectiveness and efficiency, pre-trained large
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 Pages, 3 figures accepted by AAAI 2025 CoLoRAI - Connecting Low-Rank Representations in AI Workshop

点击查看摘要

Abstract:Fine-tuning pre-trained large language models in a parameter-efficient manner is widely studied for its effectiveness and efficiency. LoRA is one of the most widely used methods, which assumes that the optimization process is essentially low dimensional. Although LoRA has demonstrated commendable performance, there remains a significant performance gap between LoRA and full fine-tuning when learning new tasks. In this work, we propose Low-Rank Adaptation with Task-Relevant Feature Enhancement(LoRATRF) for enhancing task-relevant features from the perspective of editing neural network representations. To prioritize task-relevant features, a task-aware filter that selectively extracts valuable knowledge from hidden representations for the target or current task is designed. As the experiments on a vareity of datasets including NLU, commonsense reasoning and mathematical reasoning tasks demonstrates, our method reduces 33.71% parameters and achieves better performance on a variety of datasets in comparison with SOTA low-rank methods.
zh

[NLP-43] MERaLiON-AudioLLM : Technical Report

【速读】：该论文试图解决新加坡多语言和多文化背景下的语音和文本处理问题，特别是针对当地口音和方言的复杂性。解决方案的关键在于开发了MERaLiON-AudioLLM（Multimodal Empathetic Reasoning and Learning in One Network），这是一个专门为新加坡设计的语音-文本模型，通过集成先进的语音和文本处理技术，有效应对多语言环境中的语言细微差异，从而提升语音识别和任务特定理解的能力。

链接: https://arxiv.org/abs/2412.09818
作者: Yingxu He,Zhuohan Liu,Shuo Sun,Bin Wang,Wenyu Zhang,Xunlong Zou,Nancy F. Chen,Ai Ti Aw
机构: Institute for Infocomm Research (I2R), A*STAR, Singapore
关键词: Multimodal Empathetic Reasoning, Multimodal Empathetic, Empathetic Reasoning, Reasoning and Learning, speech-text model tailored
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce MERaLiON-AudioLLM (Multimodal Empathetic Reasoning and Learning in One Network), the first speech-text model tailored for Singapore’s multilingual and multicultural landscape. Developed under the National Large Language Models Funding Initiative, Singapore, MERaLiON-AudioLLM integrates advanced speech and text processing to address the diverse linguistic nuances of local accents and dialects, enhancing accessibility and usability in complex, multilingual environments. Our results demonstrate improvements in both speech recognition and task-specific understanding, positioning MERaLiON-AudioLLM as a pioneering solution for region specific AI applications. We envision this release to set a precedent for future models designed to address localised linguistic and cultural contexts in a global framework.
zh

[NLP-44] Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, LVLMs) 在复杂任务（如链式思维推理）中的可解释性问题，特别是模型内部机制难以解释的“黑箱”现象。解决方案的关键在于提出了一种新的图像标记简化方法，称为Simignore，通过计算图像和文本嵌入之间的相似性，忽略与文本无关或不重要的图像标记，从而提高模型在复杂推理任务中的表现。该方法通过实验验证了其在复杂推理任务中的有效性。

链接: https://arxiv.org/abs/2412.09817
作者: Xiaofeng Zhang,Fanshuo Zeng,Yihao Quan,Zheng Hui,Jiawei Yao
机构: 未知
关键词: Multimodal large language, experienced rapid growth, Multimodal large, large language models, rapid growth
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal large language models have experienced rapid growth, and numerous different models have emerged. The interpretability of LVLMs remains an under-explored area. Especially when faced with more complex tasks such as chain-of-thought reasoning, its internal mechanisms still resemble a black box that is difficult to decipher. By studying the interaction and information flow between images and text, we noticed that in models such as LLaVA1.5, image tokens that are semantically related to text are more likely to have information flow convergence in the LLM decoding layer, and these image tokens receive higher attention scores. However, those image tokens that are less relevant to the text do not have information flow convergence, and they only get very small attention scores. To efficiently utilize the image information, we propose a new image token reduction method, Simignore, which aims to improve the complex reasoning ability of LVLMs by computing the similarity between image and text embeddings and ignoring image tokens that are irrelevant and unimportant to the text. Through extensive experiments, we demonstrate the effectiveness of our method for complex reasoning tasks. The paper’s source code can be accessed from \urlthis https URL.
zh

[NLP-45] ScaleOT: Privacy-utility-scalable Offsite-tuning with Dynamic LayerReplace and Selective Rank Compression AAAI2025

【速读】：该论文试图解决现有离线调优（offsite-tuning）方法在保护隐私的同时，面临适应性下降、高计算成本和保护强度不足的问题。解决方案的关键在于提出了ScaleOT框架，该框架通过引入一种新颖的分层有损压缩算法，利用强化学习确定各层的重要性，并使用轻量级网络（harmonizers）替代原始的大型语言模型（LLM）层。通过按比例组合原始LLM层和harmonizers，ScaleOT生成了针对不同模型规模优化的模拟器，从而在增强隐私保护的同时，实现接近无损的离线调优性能。此外，论文还提出了秩减少方法，进一步压缩原始LLM层，显著提升隐私保护效果，且对模型性能影响极小。

链接: https://arxiv.org/abs/2412.09812
作者: Kai Yao,Zhaorui Tan,Tiandi Ye,Lichun Li,Yuan Zhao,Wenyan Liu,Wei Wang,Jianke Zhu
机构: 1. Zhejiang University (浙江大学); 2. Zhejiang University of Technology (浙江工业大学); 3. Zhejiang University of Finance & Economics (浙江财经大学); 4. Zhejiang University of Science and Technology (浙江科技学院)
关键词: downstream task tuning, tuning large language, large language models, LLM layers, large language
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: accepted by AAAI2025

点击查看摘要

Abstract:Offsite-tuning is a privacy-preserving method for tuning large language models (LLMs) by sharing a lossy compressed emulator from the LLM owners with data owners for downstream task tuning. This approach protects the privacy of both the model and data owners. However, current offsite tuning methods often suffer from adaptation degradation, high computational costs, and limited protection strength due to uniformly dropping LLM layers or relying on expensive knowledge distillation. To address these issues, we propose ScaleOT, a novel privacy-utility-scalable offsite-tuning framework that effectively balances privacy and utility. ScaleOT introduces a novel layerwise lossy compression algorithm that uses reinforcement learning to obtain the importance of each layer. It employs lightweight networks, termed harmonizers, to replace the raw LLM layers. By combining important original LLM layers and harmonizers in different ratios, ScaleOT generates emulators tailored for optimal performance with various model scales for enhanced privacy protection. Additionally, we present a rank reduction method to further compress the original LLM layers, significantly enhancing privacy with negligible impact on utility. Comprehensive experiments show that ScaleOT can achieve nearly lossless offsite tuning performance compared with full fine-tuning while obtaining better model privacy.
zh

[NLP-46] LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering

【速读】：该论文试图解决在多选题问答（MCQA）领域中，由于构建高质量数据集的高成本，导致少样本学习（few-shot learning）成为关键挑战的问题。解决方案的关键在于利用大型语言模型（LLMs）进行数据生成和评分，通过生成包含问题和选项的MCQA数据，并为其分配概率分数，然后使用这些生成的数据和LLM分配的分数来微调一个更小且高效的仅编码器模型（如DeBERTa-v3-base），并通过蒸馏损失（distillation loss）进行优化。实验结果表明，该方法在Massive Multitask Language Understanding (MMLU)基准测试中显著提升了准确率，从28.9%提高到39.3%，展示了LLM驱动的数据生成和知识蒸馏在少样本MCQA中的有效性。

链接: https://arxiv.org/abs/2412.09807
作者: Patrick Sutanto,Joan Santoso
机构: Institut Sains dan Teknologi Terpadu Surabaya (ISTTS)
关键词: Multiple Choice Question, Choice Question Answering, Question Answering, Multiple Choice, numerous real-world applications
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education. The high cost of building MCQA datasets makes few-shot learning pivotal in this domain. While Large Language Models (LLMs) can enable few-shot learning, their direct application in real-world scenarios is often hindered by their high computational cost. To address this challenge, we propose a simple yet effective approach that uses LLMs for data generation and scoring. Our approach utilizes LLMs to create MCQA data which contains questions and choices, and to assign probability scores to the generated choices. We then use the generated data and LLM-assigned scores to finetune a smaller and more efficient encoder-only model, DeBERTa-v3-base by leveraging distillation loss. Extensive experiments on the Massive Multitask Language Understanding (MMLU) benchmark demonstrate that our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples. This shows the effectiveness of LLM-driven data generation and knowledge distillation for few-shot MCQA.
zh

[NLP-47] AutoPatent: A Multi-Agent Framework for Automatic Patent Generation

【速读】：该论文试图解决生成式 AI (Generative AI) 在专利处理领域中生成完整专利文档的挑战，特别是针对长文本生成任务。解决方案的关键在于提出了一个多智能体框架 AutoPatent，该框架通过结合基于大语言模型 (LLM) 的规划智能体 (planner agent)、写作智能体 (writer agents) 和审查智能体 (examiner agent)，并利用 PGTree 和 RRAG 技术，能够生成高质量、复杂且完整的专利文档。实验结果表明，基于 Qwen2.5-7B 模型的 AutoPatent 框架在生成专利文档方面优于更大、更强大的 LLM 模型，如 GPT-4o、Qwen2.5-72B 和 LLAMA3.1-70B。

链接: https://arxiv.org/abs/2412.09796
作者: Qiyao Wang,Shiwen Ni,Huaren Liu,Shule Lu,Guhong Chen,Xi Feng,Chi Wei,Qiang Qu,Hamid Alinejad-Rokny,Yuan Lin,Min Yang
机构: Shenzhen Key Laboratory for High Performance Data Mining(深圳高性能数据挖掘重点实验室); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); Dalian University of Technology(大连理工大学); Southern University of Science and Technology(南方科技大学); Shenzhen University of Advanced Technology(深圳先进技术大学); The University of New South Wales(新南威尔士大学)
关键词: language processing community, Large Language Models, natural language processing, Large Language, garnered increased attention
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 7 figures

点击查看摘要

Abstract:As the capabilities of Large Language Models (LLMs) continue to advance, the field of patent processing has garnered increased attention within the natural language processing community. However, the majority of research has been concentrated on classification tasks, such as patent categorization and examination, or on short text generation tasks like patent summarization and patent quizzes. In this paper, we introduce a novel and practical task known as Draft2Patent, along with its corresponding D2P benchmark, which challenges LLMs to generate full-length patents averaging 17K tokens based on initial drafts. Patents present a significant challenge to LLMs due to their specialized nature, standardized terminology, and extensive length. We propose a multi-agent framework called AutoPatent which leverages the LLM-based planner agent, writer agents, and examiner agent with PGTree and RRAG to generate lengthy, intricate, and high-quality complete patent documents. The experimental results demonstrate that our AutoPatent framework significantly enhances the ability to generate comprehensive patents across various LLMs. Furthermore, we have discovered that patents generated solely with the AutoPatent framework based on the Qwen2.5-7B model outperform those produced by larger and more powerful LLMs, such as GPT-4o, Qwen2.5-72B, and LLAMA3.1-70B, in both objective metrics and human evaluations. We will make the data and code available upon acceptance at \urlthis https URL.
zh

[NLP-48] Semi-IIN: Semi-supervised Intra-inter modal Interaction Learning Network for Multimodal Sentiment Analysis

【速读】：该论文试图解决多模态情感分析中的高标注成本和标签模糊性问题，以及不同样本间模态内和模态间交互重要性差异的问题。解决方案的关键是提出了Semi-IIN（Semi-supervised Intra-inter modal Interaction learning Network），该网络通过集成掩码注意力机制和门控机制，能够独立捕获模态内和模态间交互信息，并进行动态选择。结合自训练方法，Semi-IIN充分利用了未标注数据的知识，从而在MOSI和MOSEI两个公开数据集上取得了新的最先进性能。

链接: https://arxiv.org/abs/2412.09784
作者: Jinhao Lin,Yifei Wang,Yanwu Xu,Qi Liu
机构: 未知
关键词: fertile research ground, high annotation cost, labeled data acquisition, high-quality labeled data, multimodal sentiment analysis
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite multimodal sentiment analysis being a fertile research ground that merits further investigation, current approaches take up high annotation cost and suffer from label ambiguity, non-amicable to high-quality labeled data acquisition. Furthermore, choosing the right interactions is essential because the significance of intra- or inter-modal interactions can differ among various samples. To this end, we propose Semi-IIN, a Semi-supervised Intra-inter modal Interaction learning Network for multimodal sentiment analysis. Semi-IIN integrates masked attention and gating mechanisms, enabling effective dynamic selection after independently capturing intra- and inter-modal interactive information. Combined with the self-training approach, Semi-IIN fully utilizes the knowledge learned from unlabeled data. Experimental results on two public datasets, MOSI and MOSEI, demonstrate the effectiveness of Semi-IIN, establishing a new state-of-the-art on several metrics. Code is available at this https URL.
zh

[NLP-49] Memory Layers at Scale

【速读】：该论文试图解决在保持计算效率的同时，增强语言模型在下游任务中的表现问题。解决方案的关键在于引入了一种改进的记忆层（memory layer），该层通过可训练的键值查找机制在不增加浮点运算（FLOPs）的情况下为模型添加额外参数。记忆层与计算密集型的前馈层互补，提供了一种廉价的方式来存储和检索信息，从而在事实性任务中表现出显著的性能提升。论文证明了这种记忆层在大规模模型中的实用性，并展示了其在与计算和参数匹配的情况下优于密集模型和专家混合模型（mixture-of-expert models）。

链接: https://arxiv.org/abs/2412.09764
作者: Vincent-Pierre Berges,Barlas Oğuz,Daniel Haziza,Wen-tau Yih,Luke Zettlemoyer,Gargi Gosh
机构: Meta(元); Meta(元)
关键词: trainable key-value lookup, key-value lookup mechanism, add extra parameters, increasing FLOPs, trainable key-value
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
zh

[NLP-50] GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers

【速读】：该论文试图解决现有提示工程自动化方法依赖于大型语言模型（LLMs）进行文本反馈和推理错误识别的问题，这些方法不仅计算成本高，而且无法充分利用更直接和细粒度的信息（如梯度）。解决方案的关键在于引入了一种名为GReaTer的新型提示优化技术，该技术通过直接利用任务损失梯度信息，实现了对开源、轻量级语言模型的提示自优化，从而在不依赖昂贵的大型LLMs的情况下，实现了高效的提示优化。GReaTer在多种推理任务中的广泛评估表明，其性能优于依赖大型LLMs的现有最先进方法，并展示了梯度引导提示优化的有效性。

链接: https://arxiv.org/abs/2412.09722
作者: Sarkar Snigdha Sarathi Das,Ryo Kamoi,Bo Pang,Yusen Zhang,Caiming Xiong,Rui Zhang
机构: The Pennsylvania State University; Salesforce Research
关键词: closely tied, essential for enhancing, wide range, prompt optimization, making prompt optimization
类目: Computation and Language (cs.CL)
备注: 32 pages, 8 figures

点击查看摘要

Abstract:The effectiveness of large language models (LLMs) is closely tied to the design of prompts, making prompt optimization essential for enhancing their performance across a wide range of tasks. Many existing approaches to automating prompt engineering rely exclusively on textual feedback, refining prompts based solely on inference errors identified by large, computationally expensive LLMs. Unfortunately, smaller models struggle to generate high-quality feedback, resulting in complete dependence on large LLM judgment. Moreover, these methods fail to leverage more direct and finer-grained information, such as gradients, due to operating purely in text space. To this end, we introduce GReaTer, a novel prompt optimization technique that directly incorporates gradient information over task-specific reasoning. By utilizing task loss gradients, GReaTer enables self-optimization of prompts for open-source, lightweight language models without the need for costly closed-source LLMs. This allows high-performance prompt optimization without dependence on massive LLMs, closing the gap between smaller models and the sophisticated reasoning often needed for prompt refinement. Extensive evaluations across diverse reasoning tasks including BBH, GSM8k, and FOLIO demonstrate that GReaTer consistently outperforms previous state-of-the-art prompt optimization methods, even those reliant on powerful LLMs. Additionally, GReaTer-optimized prompts frequently exhibit better transferability and, in some cases, boost task performance to levels comparable to or surpassing those achieved by larger language models, highlighting the effectiveness of prompt optimization guided by gradients over reasoning. Code of GReaTer is available at this https URL.
zh

[NLP-51] Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts COLING2025

【速读】：该论文试图解决的问题是如何评估和提升AI生成图像的检测能力，特别是在不同详细程度的提示（prompt）下。解决方案的关键在于研究提示的详细程度对AI生成图像检测性能的影响。通过创建一个包含真实照片和不同长度提示生成的合成图像的数据集COCOXGEN，并进行用户研究和AI检测模型的实验，论文发现更长、更详细的提示生成的图像更容易被检测为假图像。此外，AI检测模型和人类在识别假图像时关注不同的细节，这通过热图分析得以验证。

链接: https://arxiv.org/abs/2412.09715
作者: Philipp Moeßner,Heike Adel
机构: Hochschule der Medien (Hochschule der Medien); Hochschule der Medien (Hochschule der Medien)
关键词: fully synthetic images, largely democratized, advent of publicly, process of creating, creating photorealistic
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at Workshop on Detecting AI Generated Content (at COLING 2025)

点击查看摘要

Abstract:With the advent of publicly available AI-based text-to-image systems, the process of creating photorealistic but fully synthetic images has been largely democratized. This can pose a threat to the public through a simplified spread of disinformation. Machine detectors and human media expertise can help to differentiate between AI-generated (fake) and real images and counteract this danger. Although AI generation models are highly prompt-dependent, the impact of the prompt on the fake detection performance has rarely been investigated yet. This work therefore examines the influence of the prompt’s level of detail on the detectability of fake images, both with an AI detector and in a user study. For this purpose, we create a novel dataset, COCOXGEN, which consists of real photos from the COCO dataset as well as images generated with SDXL and Fooocus using prompts of two standardized lengths. Our user study with 200 participants shows that images generated with longer, more detailed prompts are detected significantly more easily than those generated with short prompts. Similarly, an AI-based detection model achieves better performance on images generated with longer prompts. However, humans and AI models seem to pay attention to different details, as we show in a heat map analysis.
zh

[NLP-52] Systematic Analysis of LLM Contributions to Planning: Solver Verifier Heuristic

【速读】：该论文试图解决如何利用大型语言模型 (LLMs) 解决规划问题，并探讨了 LLMs 在问题求解、解决方案验证以及启发式指导中的作用。解决方案的关键在于 LLMs 虽然难以直接生成正确的规划，但它们在提供中间/不完全解决方案的反馈信号方面表现出色，尤其是通过比较启发式函数的形式。论文提出了一种评估框架，旨在为未来设计基于 LLM 的树搜索算法提供见解，并引入了一个新的基准测试，用于评估 LLMs 在实际应用中动态学习用户偏好的能力。

链接: https://arxiv.org/abs/2412.09666
作者: Haoming Li,Zhaoliang Chen,Songyuan Liu,Yiming Lu,Fei Liu
机构: Emory University, Department of Computer Science
关键词: large language models, language models, contribute to solving, solving planning problems, large language
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we provide a systematic analysis of how large language models (LLMs) contribute to solving planning problems. In particular, we examine how LLMs perform when they are used as problem solver, solution verifier, and heuristic guidance to improve intermediate solutions. Our analysis reveals that although it is difficult for LLMs to generate correct plans out-of-the-box, LLMs are much better at providing feedback signals to intermediate/incomplete solutions in the form of comparative heuristic functions. This evaluation framework provides insights into how future work may design better LLM-based tree-search algorithms to solve diverse planning and reasoning problems. We also propose a novel benchmark to evaluate LLM’s ability to learn user preferences on the fly, which has wide applications in practical settings.
zh

[NLP-53] Evaluation Agent : Efficient and Promptable Evaluation Framework for Visual Generative Models

【速读】：该论文试图解决视觉生成模型评估过程中计算成本高、评估流程僵化且缺乏用户定制化解释的问题。解决方案的关键在于提出了Evaluation Agent框架，该框架通过采用类人策略进行高效、动态的多轮评估，每轮仅需少量样本，并提供详细的、用户定制化的分析。其核心优势包括：1) 高效性，2) 可定制化的评估以满足多样用户需求，3) 超越单一数值评分的可解释性，以及4) 跨多种模型和工具的扩展性。实验表明，该框架将评估时间减少至传统方法的10%，同时保持相当的评估效果。

链接: https://arxiv.org/abs/2412.09645
作者: Fan Zhang,Shulin Tian,Ziqi Huang,Yu Qiao,Ziwei Liu
机构: Shanghai Artificial Intelligence Laboratory; S-Lab, Nanyang Technological University
关键词: Recent advancements, enabled high-quality image, opening diverse applications, enabled high-quality, Evaluation Agent
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model’s capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.
zh

[NLP-54] NERsocial: Efficient Named Entity Recognition Dataset Construction for Human-Robot Interaction Utilizing RapidNER

【速读】：该论文试图解决在新领域中快速部署命名实体识别（Named Entity Recognition, NER）系统的问题。解决方案的关键在于引入了一个名为RapidNER的框架，通过高效的数据集构建来实现快速部署。RapidNER的核心步骤包括：(1) 从通用知识图谱中提取领域特定的子图和三元组；(2) 从多种来源收集文本，构建专注于人机交互中典型实体的NERsocial数据集；(3) 使用Elasticsearch (ES) 实现高效的标注方案。通过这些步骤，RapidNER能够显著加速数据集的创建，并经过人工标注验证，展示了其高效性。

链接: https://arxiv.org/abs/2412.09634
作者: Jesse Atuhurra,Hidetaka Kamigaito,Hiroki Ouchi,Hiroyuki Shindo,Taro Watanabe
机构: Division of Information Science, NAIST, Japan
关键词: poses significant challenges, Adapting named entity, domains poses significant, Adapting named, named entity recognition
类目: Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Adapting named entity recognition (NER) methods to new domains poses significant challenges. We introduce RapidNER, a framework designed for the rapid deployment of NER systems through efficient dataset construction. RapidNER operates through three key steps: (1) extracting domain-specific sub-graphs and triples from a general knowledge graph, (2) collecting and leveraging texts from various sources to build the NERsocial dataset, which focuses on entities typical in human-robot interaction, and (3) implementing an annotation scheme using Elasticsearch (ES) to enhance efficiency. NERsocial, validated by human annotators, includes six entity types, 153K tokens, and 99.4K sentences, demonstrating RapidNER’s capability to expedite dataset creation.
zh

[NLP-55] Assessing Personalized AI Mentoring with Large Language Models in the Computing Field

【速读】：该论文试图解决在计算机领域中，如何利用先进的大型语言模型 (LLMs) 提供个性化职业指导的问题。解决方案的关键在于通过零样本学习 (zero-shot learning) 方法，对 GPT-4、LLaMA 3 和 Palm 2 进行性能评估，并结合自然语言处理 (NLP) 分析管道进行定量和定性分析，以确定这些模型在不同学生背景（如性别、种族和职业水平）下的个性化指导能力。研究结果表明，GPT-4 在提供更个性化、准确且有用的指导方面表现优于其他模型，这为开发基于 LLMs 的个性化指导工具奠定了基础，并强调了在过程中引入人类导师以增强指导效果的重要性。

链接: https://arxiv.org/abs/2412.08430
作者: Xiao Luo,Sean O’Connell,Shamima Mithun
机构: Oklahoma State University(俄克拉荷马州立大学); Purdue University(普渡大学)
关键词: Large Language Models, Large Language, Language Models, distinct student profiles, computing field
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper provides an in-depth evaluation of three state-of-the-art Large Language Models (LLMs) for personalized career mentoring in the computing field, using three distinct student profiles that consider gender, race, and professional levels. We evaluated the performance of GPT-4, LLaMA 3, and Palm 2 using a zero-shot learning approach without human intervention. A quantitative evaluation was conducted through a custom natural language processing analytics pipeline to highlight the uniqueness of the responses and to identify words reflecting each student’s profile, including race, gender, or professional level. The analysis of frequently used words in the responses indicates that GPT-4 offers more personalized mentoring compared to the other two LLMs. Additionally, a qualitative evaluation was performed to see if human experts reached similar conclusions. The analysis of survey responses shows that GPT-4 outperformed the other two LLMs in delivering more accurate and useful mentoring while addressing specific challenges with encouragement languages. Our work establishes a foundation for developing personalized mentoring tools based on LLMs, incorporating human mentors in the process to deliver a more impactful and tailored mentoring experience.
zh

[NLP-56] Chatbots im Schulunterricht: Wir testen das Fobizz-Tool zur automatischen Bewertung von Hausaufgaben

【速读】：该论文试图解决的问题是评估由德国公司Fobizz开发的AI评分工具“AI Grading Assistant”在教育环境中的功能适用性，特别是在教师评估学生作业和提供反馈方面的表现。研究的关键在于揭示该工具在实际应用中的显著不足，包括其评分和定性反馈的随机性、对错误和无意义提交的检测能力不足，以及评分标准的实施不透明和不可靠。这些缺陷主要源于大型语言模型（LLMs）的固有限制，因此短期内难以实现根本性改进。论文还批评了将AI作为教育系统问题的快速解决方案的趋势，指出Fobizz对该工具的营销宣传存在误导性和不负责任之处。最终，研究呼吁对教育环境中AI工具的使用进行系统性评估和学科特定的教学审查。

链接: https://arxiv.org/abs/2412.06651
作者: Rainer Muehlhoff,Marte Henningsen
机构: 未知
关键词: German company Fobizz, German company, German language, AI-powered grading tool, German
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注: 33 pages, in German language

点击查看摘要

Abstract:[Study in German language.] This study examines the AI-powered grading tool “AI Grading Assistant” by the German company Fobizz, designed to support teachers in evaluating and providing feedback on student assignments. Against the societal backdrop of an overburdened education system and rising expectations for artificial intelligence as a solution to these challenges, the investigation evaluates the tool’s functional suitability through two test series. The results reveal significant shortcomings: The tool’s numerical grades and qualitative feedback are often random and do not improve even when its suggestions are incorporated. The highest ratings are achievable only with texts generated by ChatGPT. False claims and nonsensical submissions frequently go undetected, while the implementation of some grading criteria is unreliable and opaque. Since these deficiencies stem from the inherent limitations of large language models (LLMs), fundamental improvements to this or similar tools are not immediately foreseeable. The study critiques the broader trend of adopting AI as a quick fix for systemic problems in education, concluding that Fobizz’s marketing of the tool as an objective and time-saving solution is misleading and irresponsible. Finally, the study calls for systematic evaluation and subject-specific pedagogical scrutiny of the use of AI tools in educational contexts.
zh

[NLP-57] ransformative Influence of LLM and AI Tools in Student Social Media Engagement: Analyzing Personalization Communication Efficiency and Collaborative Learning

【速读】：该论文探讨了大型语言模型（LLMs）和人工智能（AI）工具如何改变学生在社交媒体上的学术和社会体验。其核心问题是如何通过AI驱动的社交平台提升学生的学术表现、批判性思维能力以及协作能力。解决方案的关键在于利用AI算法提供个性化内容和推荐，过滤干扰性内容，促进基于学术兴趣和职业目标的学生匹配，从而创建一个支持性、智力刺激的在线社区。通过UniversityCube的数据分析，研究表明这些AI工具显著提高了学生的满意度、学术成绩和参与度，为数字时代的教育环境提供了更丰富、高效和有支持性的解决方案。

链接: https://arxiv.org/abs/2407.15012
作者: Masoud Bashiri,Kamran Kowsari
机构: University of Virginia(弗吉尼亚大学)
关键词: Large Language Models, Artificial Intelligence, advent of Large, Language Models, social media
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) and Artificial Intelligence (AI) tools has revolutionized various facets of our lives, particularly in the realm of social media. For students, these advancements have unlocked unprecedented opportunities for learning, collaboration, and personal growth. AI-driven applications are transforming how students interact with social media, offering personalized content and recommendations, and enabling smarter, more efficient communication. Recent studies utilizing data from UniversityCube underscore the profound impact of AI tools on students’ academic and social experiences. These studies reveal that students engaging with AI-enhanced social media platforms report higher academic performance, enhanced critical thinking skills, and increased engagement in collaborative projects. Moreover, AI tools assist in filtering out distracting content, allowing students to concentrate more on educational materials and pertinent discussions. The integration of LLMs in social media has further facilitated improved peer-to-peer communication and mentorship opportunities. AI algorithms effectively match students based on shared academic interests and career goals, fostering a supportive and intellectually stimulating online community, thereby contributing to increased student satisfaction and retention rates. In this article, we delve into the data provided by UniversityCube to explore how LLMs and AI tools are specifically transforming social media for students. Through case studies and statistical analyses, we offer a comprehensive understanding of the educational and social benefits these technologies offer. Our exploration highlights the potential of AI-driven tools to create a more enriched, efficient, and supportive educational environment for students in the digital age. Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2407.15012 [cs.CY] (or arXiv:2407.15012v2 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2407.15012 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-58] Formal Languages and TQFTs with Defects

【速读】：该论文试图解决将有限状态自动机（Finite State Automaton）与带缺陷的一维拓扑量子场论（1D TQFT with defects）进行关联的问题。解决方案的关键在于证明这种构造在有限状态自动机范畴（以转换器为态射）中是函子性的，并且通过引入同调结构和上下文无关文法的范畴化版本，进一步扩展了这一构造。具体来说，论文展示了特定子正则语言类与相关TQFT中的附加同调结构相对应，并通过Melliès和Zeilberger的范畴化Chomsky-Schützenberger表示定理，将构造推广到上下文无关文法，最终将相应的TQFT描述为带缺陷的余边界范畴上的彩色算子代数态射。

链接: https://arxiv.org/abs/2412.09688
作者: Luisa Boateng,Matilde Marcolli
机构: Stanford University (斯坦福大学); California Institute of Technology (加州理工学院)
关键词: assigns a Boolean, developed by Gustafson, finite state automaton, finite state, automaton was recently
类目: Mathematical Physics (math-ph); Computation and Language (cs.CL); Quantum Algebra (math.QA)
备注: 28 pages, 9 figures

点击查看摘要

Abstract:A construction that assigns a Boolean 1D TQFT with defects to a finite state automaton was recently developed by Gustafson, Im, Kaldawy, Khovanov, and Lihn. We show that the construction is functorial with respect to the category of finite state automata with transducers as morphisms. Certain classes of subregular languages correspond to additional cohomological structures on the associated TQFTs. We also show that the construction generalizes to context-free grammars through a categorical version of the Chomsky-Schützenberger representation theorem, due to Melliès and Zeilberger. The corresponding TQFTs are then described as morphisms of colored operads on an operad of cobordisms with defects.
zh

计算机视觉

[CV-0] GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction

【速读】：该论文试图解决自动驾驶中3D占据预测的问题，特别是如何利用场景演变的先验信息来提高预测的准确性。现有方法通常通过融合前几帧的表示来推断当前的3D占据情况，但忽略了驾驶场景的连续性和3D场景演变的强先验（如只有动态物体移动）。论文提出的解决方案关键在于将3D占据预测重新表述为基于当前传感器输入的4D占据预测问题，并通过分解场景演变为三个因素（自车运动对齐的静态场景、动态物体的局部运动、新观测场景的补全）来利用这些先验信息。具体实现上，论文采用了一个基于高斯世界模型（GaussianWorld）的框架，在3D高斯空间中显式地利用这些先验信息并结合当前的RGB观测进行场景演变的推断，从而在不增加额外计算的情况下提升了单帧方法的性能。

链接: https://arxiv.org/abs/2412.10373
作者: Sicheng Zuo,Wenzhao Zheng,Yuanhui Huang,Jie Zhou,Jiwen Lu
关键词: autonomous driving due, important for autonomous, occupancy prediction, scene evolution, occupancy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:3D occupancy prediction is important for autonomous driving due to its comprehensive perception of the surroundings. To incorporate sequential inputs, most existing methods fuse representations from previous frames to infer the current 3D occupancy. However, they fail to consider the continuity of driving scenarios and ignore the strong prior provided by the evolution of 3D scenes (e.g., only dynamic objects move). In this paper, we propose a world-model-based framework to exploit the scene evolution for perception. We reformulate 3D occupancy prediction as a 4D occupancy forecasting problem conditioned on the current sensor input. We decompose the scene evolution into three factors: 1) ego motion alignment of static scenes; 2) local movements of dynamic objects; and 3) completion of newly-observed scenes. We then employ a Gaussian world model (GaussianWorld) to explicitly exploit these priors and infer the scene evolution in the 3D Gaussian space considering the current RGB observation. We evaluate the effectiveness of our framework on the widely used nuScenes dataset. Our GaussianWorld improves the performance of the single-frame counterpart by over 2% in mIoU without introducing additional computations. Code: this https URL.
zh

[CV-1] UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities UAI

【速读】：该论文试图解决医学领域中视觉-语言模型 (Vision-Language Models, VLMs) 应用受限的问题，主要原因是缺乏大规模、公开可用的医学图像-文本数据集。解决方案的关键在于引入UniMed，这是一个大规模、开源的多模态医学数据集，包含超过530万张图像-文本对，涵盖六种不同的成像模态（X-ray、CT、MRI、Ultrasound、Pathology、Fundus）。UniMed通过利用大语言模型 (Large Language Models, LLMs) 将特定模态的分类数据集转换为图像-文本格式，并整合现有的医学图像-文本数据，从而实现可扩展的VLM预训练。基于UniMed数据集训练的UniMed-CLIP模型在多种模态上表现优异，显著超越现有的通用VLM，并在零样本评估中取得了显著的性能提升。

链接: https://arxiv.org/abs/2412.10372
作者: Muhammad Uzair Khattak,Shahina Kunhimon,Muzammal Naseer,Salman Khan,Fahad Shahbaz Khan
关键词: natural image tasks, achieved notable success, image tasks, contrastive learning, learning have achieved
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code, models and demo available at this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. However, their application in the medical domain remains limited due to the scarcity of openly accessible, large-scale medical image-text datasets. Existing medical VLMs either train on closed-source proprietary or relatively small open-source datasets that do not generalize well. Similarly, most models remain specific to a single or limited number of medical imaging domains, again restricting their applicability to other modalities. To address this gap, we introduce UniMed, a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs across six diverse imaging modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus. UniMed is developed using a data-collection framework that leverages Large Language Models (LLMs) to transform modality-specific classification datasets into image-text formats while incorporating existing image-text data from the medical domain, facilitating scalable VLM pretraining. Using UniMed, we trained UniMed-CLIP, a unified VLM for six modalities that significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs, achieving notable gains in zero-shot evaluations. For instance, UniMed-CLIP improves over BiomedCLIP (trained on proprietary data) by an absolute gain of +12.61, averaged over 21 datasets, while using 3x less training data. To facilitate future research, we release UniMed dataset, training codes, and models at this https URL.
zh

[CV-2] GaussianAD: Gaussian-Centric End-to-End Autonomous Driving

【速读】：该论文试图解决基于视觉的自动驾驶中，现有方法在密集表示（如鸟瞰图）和稀疏表示（如实例框）之间存在的全面性与效率的权衡问题。解决方案的关键在于提出了一种以高斯为中心的端到端自动驾驶框架（GaussianAD），通过3D语义高斯来广泛而稀疏地描述场景。具体方法包括初始化均匀的3D高斯，利用周围视图图像逐步精炼这些高斯以获得3D高斯场景表示，并使用稀疏卷积进行高效的3D感知（如3D检测、语义地图构建）。此外，通过预测具有动态语义的高斯的3D流，并基于未来场景预测的目标规划自车轨迹。GaussianAD能够以端到端的方式进行训练，并在nuScenes数据集上验证了其在运动规划、3D占用预测和4D占用预测等任务中的有效性。

链接: https://arxiv.org/abs/2412.10371
作者: Wenzhao Zheng,Junjie Wu,Yao Zheng,Sicheng Zuo,Zixun Xie,Longchao Yang,Yong Pan,Zhihui Hao,Peng Jia,Xianpeng Lang,Shanghang Zhang
关键词: shows great potential, great potential due, Vision-based autonomous driving, driving shows great, Vision-based autonomous
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Vision-based autonomous driving shows great potential due to its satisfactory performance and low costs. Most existing methods adopt dense representations (e.g., bird’s eye view) or sparse representations (e.g., instance boxes) for decision-making, which suffer from the trade-off between comprehensiveness and efficiency. This paper explores a Gaussian-centric end-to-end autonomous driving (GaussianAD) framework and exploits 3D semantic Gaussians to extensively yet sparsely describe the scene. We initialize the scene with uniform 3D Gaussians and use surrounding-view images to progressively refine them to obtain the 3D Gaussian scene representation. We then use sparse convolutions to efficiently perform 3D perception (e.g., 3D detection, semantic map construction). We predict 3D flows for the Gaussians with dynamic semantics and plan the ego trajectory accordingly with an objective of future scene forecasting. Our GaussianAD can be trained in an end-to-end manner with optional perception labels when available. Extensive experiments on the widely used nuScenes dataset verify the effectiveness of our end-to-end GaussianAD on various tasks including motion planning, 3D occupancy prediction, and 4D occupancy forecasting. Code: this https URL.
zh

[CV-3] OP-LoRA: The Blessing of Dimensionality

【速读】：该论文试图解决低秩适配器（low-rank adapters）在微调大型模型时面临的优化挑战，特别是收敛速度慢的问题。解决方案的关键在于引入超参数化（over-parameterized）方法，通过为每一层引入一个独立的MLP（多层感知机）和学习嵌入（learned embedding）来重新参数化低秩适配。这种方法在训练过程中加速了优化，因为它隐式地起到了自适应学习率和动量的作用。在推理阶段，MLP可以被丢弃，从而保持标准的低秩适配器结构，不影响推理成本。通过在矩阵分解等小规模但困难的代理任务上验证，该方法实现了更快的收敛和更低的最终损失，并在更大规模的任务中展现出一致的性能提升，特别是在视觉-语言任务和图像生成任务中取得了显著的改进。

链接: https://arxiv.org/abs/2412.10362
作者: Piotr Teterwak,Kate Saenko,Bryan A. Plummer,Ser-Nam Lim
关键词: reducing storage costs, adapters enable fine-tuning, catastrophic forgetting, enable fine-tuning, fine-tuning of large
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-rank adapters enable fine-tuning of large models with only a small number of parameters, thus reducing storage costs and minimizing the risk of catastrophic forgetting. However, they often pose optimization challenges, with poor convergence. To overcome these challenges, we introduce an over-parameterized approach that accelerates training without increasing inference costs. This method reparameterizes low-rank adaptation by employing a separate MLP and learned embedding for each layer. The learned embedding is input to the MLP, which generates the adapter parameters. Such overparamaterization has been shown to implicitly function as an adaptive learning rate and momentum, accelerating optimization. At inference time, the MLP can be discarded, leaving behind a standard low-rank adapter. To study the effect of MLP overparameterization on a small yet difficult proxy task, we implement it for matrix factorization, and find it achieves faster convergence and lower final loss. Extending this approach to larger-scale tasks, we observe consistent performance gains across domains. We achieve improvements in vision-language tasks and especially notable increases in image generation, with CMMD scores improving by up to 15 points.
zh

[CV-4] Apollo: An Exploration of Video Understanding in Large Multimodal Models

【速读】：该论文试图解决视频感知能力在大规模多模态模型（Large Multimodal Models, LMMs）中的理解机制不明确的问题，以及由于高计算成本和有限的开源研究阻碍了视频-LMMs的发展。解决方案的关键在于发现并验证了“扩展一致性”（Scaling Consistency），即在较小模型和数据集上做出的设计和训练决策可以有效地扩展到更大规模的模型。基于这一发现，论文深入研究了视频采样、架构、数据组成、训练计划等多个视频特定方面，并提出了Apollo系列模型，这些模型在不同规模上均表现出优越的性能，特别是在处理长时间视频时效率显著提高。

链接: https://arxiv.org/abs/2412.10360
作者: Orr Zohar,Xiaohan Wang,Yann Dubois,Nikhil Mehta,Tong Xiao,Philippe Hansen-Estruch,Licheng Yu,Xiaofang Wang,Felix Juefei-Xu,Ning Zhang,Serena Yeung-Levy,Xide Xia
关键词: Large Multimodal Models, Large Multimodal, remain poorly understood, underlying mechanisms driving, capabilities into Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing 7 B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME. Comments: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.10360 [cs.CV] (or arXiv:2412.10360v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.10360 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-5] Robust image classification with multi-modal large language models

【速读】：该论文试图解决深度神经网络在面对对抗样本（adversarial examples）时的脆弱性问题。解决方案的关键在于提出了一种名为Multi-Shield的新型防御方法，该方法通过结合多模态信息（multi-modal information）来增强现有防御机制的鲁棒性。Multi-Shield利用多模态大语言模型（multi-modal large language models）来检测对抗样本，并在输入的文本和视觉表示不一致时拒绝分类，从而避免不确定的预测。该方法在CIFAR-10和ImageNet数据集上的广泛评估表明，Multi-Shield能够有效集成并超越原有的单一模态防御方法。

链接: https://arxiv.org/abs/2412.10353
作者: Francesco Villani,Igor Maljkovic,Dario Lazzaro,Angelo Sotgiu,Antonio Emanuele Cinà,Fabio Roli
关键词: Deep Neural Networks, Deep Neural, Neural Networks, make incorrect predictions, carefully crafted input
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep Neural Networks are vulnerable to adversarial examples, i.e., carefully crafted input samples that can cause models to make incorrect predictions with high confidence. To mitigate these vulnerabilities, adversarial training and detection-based defenses have been proposed to strengthen models in advance. However, most of these approaches focus on a single data modality, overlooking the relationships between visual patterns and textual descriptions of the input. In this paper, we propose a novel defense, Multi-Shield, designed to combine and complement these defenses with multi-modal information to further enhance their robustness. Multi-Shield leverages multi-modal large language models to detect adversarial examples and abstain from uncertain classifications when there is no alignment between textual and visual representations of the input. Extensive evaluations on CIFAR-10 and ImageNet datasets, using robust and non-robust image classification models, demonstrate that Multi-Shield can be easily integrated to detect and reject adversarial examples, outperforming the original defenses.
zh

[CV-6] VibrantVS: A high-resolution multi-task transformer for forest canopy height estimation

【速读】：该论文试图解决利用多光谱影像数据（如NAIP影像）生成冠层高度模型（CHMs）的问题，特别是在美国西部多个生态区域中的应用。解决方案的关键在于引入了一种新颖的多任务视觉变换器（Vision Transformer, ViT）模型，即VibrantVS模型。该模型在跨生态区域的广泛应用中表现出更高的准确性和精度，相较于其他三个基准模型，它不仅在局部区域提供高精度，还能在更广的地理范围内保持优势，并且能够以三年或更短的周期生成更新推断，具有高空间分辨率，从而为生态监测和土地管理决策（如野火缓解）提供了显著的价值。

链接: https://arxiv.org/abs/2412.10351
作者: Tony Chang,Kiarie Ndegwa,Andreas Gros,Vincent A. Landau,Luke J. Zachmann,Bogdan State,Mitchell A. Gritts,Colton W. Miller,Nathan E. Rutenbeck,Scott Conway,Guy Bayes
关键词: National Agriculture Imagery, Agriculture Imagery Program, western United States, National Agriculture, multi-task vision transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures

点击查看摘要

Abstract:This paper explores the application of a novel multi-task vision transformer (ViT) model for the estimation of canopy height models (CHMs) using 4-band National Agriculture Imagery Program (NAIP) imagery across the western United States. We compare the effectiveness of this model in terms of accuracy and precision aggregated across ecoregions and class heights versus three other benchmark peer-reviewed models. Key findings suggest that, while other benchmark models can provide high precision in localized areas, the VibrantVS model has substantial advantages across a broad reach of ecoregions in the western United States with higher accuracy, higher precision, the ability to generate updated inference at a cadence of three years or less, and high spatial resolution. The VibrantVS model provides significant value for ecological monitoring and land management decisions for wildfire mitigation.
zh

[CV-7] Ensuring Force Safety in Vision-Guided Robotic Manipulation via Implicit Tactile Calibration

【速读】：该论文试图解决在动态环境中机器人操作具有特定属性（如门）的物体时，因运动轨迹受限而可能导致机器人和物体受损的问题。解决方案的关键在于引入了一种名为SafeDiff的新型状态扩散框架，该框架通过结合视觉上下文观察和实时触觉反馈来生成并优化未来的机器人状态序列。与以往将视觉和触觉数据直接拼接生成未来状态序列的方法不同，SafeDiff利用触觉数据作为校准信号，在状态空间内隐式调整机器人状态，从而显著提高了状态规划的合理性，并基于逆动力学推导出安全的动作轨迹。这一方法在模拟和真实环境中均有效减少了门开启过程中有害力的风险。

链接: https://arxiv.org/abs/2412.10349
作者: Lai Wei,Jiahua Ma,Yibo Hu,Ruimao Zhang
关键词: encounter constrained movement, constrained movement trajectories, robot state, specific properties, encounter constrained
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In dynamic environments, robots often encounter constrained movement trajectories when manipulating objects with specific properties, such as doors. Therefore, applying the appropriate force is crucial to prevent damage to both the robots and the objects. However, current vision-guided robot state generation methods often falter in this regard, as they lack the integration of tactile perception. To tackle this issue, this paper introduces a novel state diffusion framework termed SafeDiff. It generates a prospective state sequence from the current robot state and visual context observation while incorporating real-time tactile feedback to refine the sequence. As far as we know, this is the first study specifically focused on ensuring force safety in robotic manipulation. It significantly enhances the rationality of state planning, and the safe action trajectory is derived from inverse dynamics based on this refined planning. In practice, unlike previous approaches that concatenate visual and tactile data to generate future robot state sequences, our method employs tactile data as a calibration signal to adjust the robot’s state within the state space implicitly. Additionally, we’ve developed a large-scale simulation dataset called SafeDoorManip50k, offering extensive multimodal data to train and evaluate the proposed method. Extensive experiments show that our visual-tactile model substantially mitigates the risk of harmful forces in the door opening, across both simulated and real-world settings.
zh

[CV-8] A dual contrastive framework

【速读】：该论文试图解决大规模视觉-语言模型在区域级视觉理解任务中的挑战，特别是由于空间感知有限和粗粒度预训练导致的潜在表示优化困难。解决方案的关键在于提出了AlignCap框架，通过细粒度对齐潜在空间来增强区域级理解。具体包括引入潜在特征精炼模块（latent feature refinement module）以提升条件潜在空间表示，采用语义空间对齐模块（semantic space alignment module）来提高多模态表示的质量，并在两个模块中创新性地结合对比学习（contrastive learning）以进一步增强区域级描述性能。此外，通过使用通用对象检测方法（General Object Detection, GOD）作为数据预处理管道，增强了区域级的空间推理能力。

链接: https://arxiv.org/abs/2412.10348
作者: Yuan Sun,Zhao Zhang,Jorge Ortiz
关键词: adapting intermediate layers, models typically freeze, task-specific goals, typically freeze, freeze the encoder
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In current multimodal tasks, models typically freeze the encoder and decoder while adapting intermediate layers to task-specific goals, such as region captioning. Region-level visual understanding presents significant challenges for large-scale vision-language models. While limited spatial awareness is a known issue, coarse-grained pretraining, in particular, exacerbates the difficulty of optimizing latent representations for effective encoder-decoder alignment. We propose AlignCap, a framework designed to enhance region-level understanding through fine-grained alignment of latent spaces. Our approach introduces a novel latent feature refinement module that enhances conditioned latent space representations to improve region-level captioning performance. We also propose an innovative alignment strategy, the semantic space alignment module, which boosts the quality of multimodal representations. Additionally, we incorporate contrastive learning in a novel manner within both modules to further enhance region-level captioning performance. To address spatial limitations, we employ a General Object Detection (GOD) method as a data preprocessing pipeline that enhances spatial reasoning at the regional level. Extensive experiments demonstrate that our approach significantly improves region-level captioning performance across various tasks
zh

[CV-9] Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

【速读】：该论文试图解决在交互式数字环境中，基于视觉的代理（visual agents）在处理高分辨率、视觉复杂的图形用户界面（Graphical User Interfaces, GUIs）时面临的视觉感知挑战。解决方案的关键在于两项创新技术：信息敏感裁剪（Information-Sensitive Cropping, ISC）和自精炼双重学习（Self-Refining Dual Learning, SRDL）。ISC通过边缘检测算法动态识别并优先处理视觉密集区域，从而提高计算资源的利用效率；SRDL则通过双重学习循环，增强代理在描述（referring）和定位（grounding）UI元素方面的能力，且无需额外的标注数据。这些创新使得Iris代理在多个基准测试中实现了最先进的性能，并显著提升了下游任务的效果。

链接: https://arxiv.org/abs/2412.10342
作者: Zhiqi Ge,Juncheng Li,Xinglei Pang,Minghe Gao,Kaihang Pan,Wang Lin,Hao Fei,Wenqiao Zhang,Siliang Tang,Yueting Zhuang
关键词: Large Language Models, Multimodal Large Language, Graphical User Interfaces, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems. While text-based agents built on Large Language Models (LLMs) often require frequent updates due to platform-specific APIs, visual agents leveraging Multimodal Large Language Models (MLLMs) offer enhanced adaptability by interacting directly with Graphical User Interfaces (GUIs). However, these agents face significant challenges in visual perception, particularly when handling high-resolution, visually complex digital environments. This paper introduces Iris, a foundational visual agent that addresses these challenges through two key innovations: Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL). ISC dynamically identifies and prioritizes visually dense regions using a edge detection algorithm, enabling efficient processing by allocating more computational resources to areas with higher information density. SRDL enhances the agent’s ability to handle complex tasks by leveraging a dual-learning loop, where improvements in referring (describing UI elements) reinforce grounding (locating elements) and vice versa, all without requiring additional annotated data. Empirical evaluations demonstrate that Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations, outperforming methods using 10x more training data. These improvements further translate to significant gains in both web and OS agent downstream tasks.
zh

[CV-10] A Universal Degradation-based Bridging Technique for Domain Adaptive Semantic Segmentation

【速读】：该论文试图解决语义分割在跨域应用时性能显著下降的问题，提出了一个基于退化（degradation）的通用桥接技术，称为DiDA。解决方案的关键在于两个模块：(1) 基于退化的中间域构建（Degradation-based Intermediate Domain Construction），通过简单的图像退化操作生成连续的中间域，逐步减少域差异，促进学习域不变特征；(2) 语义偏移补偿（Semantic Shift Compensation），利用扩散编码器（diffusion encoder）对退化时间步长的语义偏移信息进行编码和补偿，保持中间域中的判别性表示。DiDA作为一种即插即用的解决方案，支持多种退化操作，并能无缝集成到现有的无监督域适应（UDA）方法中，显著提升跨域语义分割的性能。

链接: https://arxiv.org/abs/2412.10339
作者: Wangkai Li,Rui Sun,Tianzhu Zhang
关键词: suffers from significant, trained network, network is applied, intermediate domains, semantic shift
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic segmentation often suffers from significant performance degradation when the trained network is applied to a different domain. To address this issue, unsupervised domain adaptation (UDA) has been extensively studied. Existing methods introduce the domain bridging techniques to mitigate substantial domain gap, which construct intermediate domains to facilitate the gradual transfer of knowledge across different domains. However, these strategies often require dataset-specific designs and may generate unnatural intermediate distributions that lead to semantic shift. In this paper, we propose DiDA, a universal degradation-based bridging technique formalized as a diffusion forward process. DiDA consists of two key modules: (1) Degradation-based Intermediate Domain Construction, which creates continuous intermediate domains through simple image degradation operations to encourage learning domain-invariant features as domain differences gradually diminish; (2) Semantic Shift Compensation, which leverages a diffusion encoder to encode and compensate for semantic shift information with degraded time-steps, preserving discriminative representations in the intermediate domains. As a plug-and-play solution, DiDA supports various degradation operations and seamlessly integrates with existing UDA methods. Extensive experiments on prevalent synthetic-to-real semantic segmentation benchmarks demonstrate that DiDA consistently improves performance across different settings and achieves new state-of-the-art results when combined with existing methods.
zh

[CV-11] XYScanNet: An Interpretable State Space Model for Perceptual Image Deblurring

【速读】：该论文试图解决现有基于Mamba架构的深度状态空间模型（SSMs）在图像恢复任务中，由于采用扁平化扫描策略（flatten-and-scan strategy）导致的局部像素依赖性丢失和空间错位问题，从而影响低级视觉任务中的噪声感知和图像锐度。解决方案的关键在于提出了一种新的切片扫描策略（slice-and-scan strategy），通过交替扫描切片内和切片间的方式，保留局部像素依赖性并减少空间错位。此外，论文设计了新的视觉状态空间模块（Vision State Space Module, VSSM）用于图像去模糊，并开发了XYScanNet架构，结合轻量级特征融合模块，显著提升了图像去模糊的感知性能，实验结果显示其KID指标比最接近的竞争对手提高了17%。

链接: https://arxiv.org/abs/2412.10338
作者: Hanzhou Liu,Chengkai Liu,Jiacong Xu,Peng Jiang,Mi Lu
关键词: Deep state-space models, CNN and Transformer, recent Mamba architectures, Transformer networks, Deep state-space
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep state-space models (SSMs), like recent Mamba architectures, are emerging as a promising alternative to CNN and Transformer networks. Existing Mamba-based restoration methods process the visual data by leveraging a flatten-and-scan strategy that converts image patches into a 1D sequence before scanning. However, this scanning paradigm ignores local pixel dependencies and introduces spatial misalignment by positioning distant pixels incorrectly adjacent, which reduces local noise-awareness and degrades image sharpness in low-level vision tasks. To overcome these issues, we propose a novel slice-and-scan strategy that alternates scanning along intra- and inter-slices. We further design a new Vision State Space Module (VSSM) for image deblurring, and tackle the inefficiency challenges of the current Mamba-based vision module. Building upon this, we develop XYScanNet, an SSM architecture integrated with a lightweight feature fusion module for enhanced image deblurring. XYScanNet, maintains competitive distortion metrics and significantly improves perceptual performance. Experimental results show that XYScanNet enhances KID by 17% compared to the nearest competitor. Our code will be released soon.
zh

[CV-12] BrushEdit: All-In-One Image Inpainting and Editing

【速读】：该论文试图解决当前基于反演（inversion-based）和基于指令（instruction-based）的图像编辑方法在处理大幅修改（如添加或删除对象）时的局限性。基于反演的方法由于反演噪声的结构化特性，难以实现大幅度的修改；而基于指令的方法则限制了用户对编辑区域和强度的直接交互，通常表现为黑箱操作。论文提出的解决方案是 BrushEdit，一种基于修复（inpainting-based）的指令引导图像编辑范式，其关键在于结合多模态大语言模型（MLLMs）和图像修复模型，通过代理协作框架实现自由形式的指令编辑。具体来说，该系统通过集成MLLMs和双分支图像修复模型，执行编辑类别分类、主要对象识别、掩码获取和编辑区域修复，从而实现自主、用户友好且交互性强的自由形式编辑。

链接: https://arxiv.org/abs/2412.10316
作者: Yaowei Li,Yuxuan Bian,Xuan Ju,Zhaoyang Zhang,Ying Shan,Qiang Xu
关键词: advanced significantly, development of diffusion, editing, instruction-based methods, current inversion-based approaches
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: WebPage available at this https URL

点击查看摘要

Abstract:Image editing has advanced significantly with the development of diffusion models using both inversion-based and instruction-based methods. However, current inversion-based approaches struggle with big modifications (e.g., adding or removing objects) due to the structured nature of inversion noise, which hinders substantial changes. Meanwhile, instruction-based methods often constrain users to black-box operations, limiting direct interaction for specifying editing regions and intensity. To address these limitations, we propose BrushEdit, a novel inpainting-based instruction-guided image editing paradigm, which leverages multimodal large language models (MLLMs) and image inpainting models to enable autonomous, user-friendly, and interactive free-form instruction editing. Specifically, we devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model in an agent-cooperative framework to perform editing category classification, main object identification, mask acquisition, and editing area inpainting. Extensive experiments show that our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics including mask region preservation and editing effect coherence.
zh

[CV-13] rafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes

【速读】：该论文试图解决交通监控摄像头在合作感知中的定位问题。为克服缺乏大规模真实世界交叉口数据集的挑战，研究者引入了Carla Intersection，一个包含75个城乡交叉口的模拟数据集。解决方案的关键在于提出了一种新型神经网络TrafficLoc，该网络通过粗到细的匹配流程在3D参考地图中定位交通摄像头。TrafficLoc采用了几何引导的注意力损失（Geometry-guided Attention Loss）来解决图像与点云特征融合中的跨模态视角不一致问题，并通过内外对比学习（Inter-Intra Contrastive Learning）在粗匹配阶段实现精确对齐，同时保持局部特征的独特性。此外，密集训练对齐（Dense Training Alignment）结合软argmax算子（soft-argmax operator）在回归最终位置时考虑了更多特征。实验表明，TrafficLoc在Carla Intersection数据集上显著提升了定位精度（最高达86%），并在KITTI和NuScenes数据集上实现了新的最先进性能，展示了其在车载和交通摄像头中的强大定位能力。

链接: https://arxiv.org/abs/2412.10308
作者: Yan Xia,Yunxiang Lu,Rui Song,Oussema Dhaouadi,João F. Henriques,Daniel Cremers
关键词: cooperative perception, tackle the problem, traffic surveillance cameras, introduce Carla Intersection, Carla Intersection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:We tackle the problem of localizing the traffic surveillance cameras in cooperative perception. To overcome the lack of large-scale real-world intersection datasets, we introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. Moreover, we introduce a novel neural network, TrafficLoc, localizing traffic cameras within a 3D reference map. TrafficLoc employs a coarse-to-fine matching pipeline. For image-point cloud feature fusion, we propose a novel Geometry-guided Attention Loss to address cross-modal viewpoint inconsistencies. During coarse matching, we propose an Inter-Intra Contrastive Learning to achieve precise alignment while preserving distinctiveness among local intra-features within image patch-point group pairs. Besides, we introduce Dense Training Alignment with a soft-argmax operator to consider additional features when regressing the final position. Extensive experiments show that our TrafficLoc improves the localization accuracy over the state-of-the-art Image-to-point cloud registration methods by a large margin (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating strong localization ability across both in-vehicle and traffic cameras. Our project page is publicly available at this https URL.
zh

[CV-14] Coherent 3D Scene Diffusion From a Single RGB Image NEURIPS2024 WWW

【速读】：该论文试图解决从单张RGB图像进行连贯的3D场景重建问题。解决方案的关键在于提出了一种基于扩散模型的方法，通过图像条件下的3D场景扩散模型同时去噪所有场景对象的3D姿态和几何形状。为了应对任务的病态性并获得一致的重建结果，研究者通过同时条件化所有场景对象来学习生成式场景先验，捕捉场景上下文并让模型在扩散过程中学习对象间的关系。此外，论文还提出了一种高效的表面对齐损失，即使在缺乏完整真实标注的情况下也能促进训练，该损失利用了表达性形状表示，允许从中间形状预测中直接采样点。通过将单RGB图像的3D场景重建任务框架化为条件扩散过程，该方法在SUN RGB-D数据集上将AP3D提高了12.04%，在Pix3D数据集上将F-Score提高了13.43%，超越了当前的最先进方法。

链接: https://arxiv.org/abs/2412.10294
作者: Manuel Dahnert,Angela Dai,Norman Müller,Matthias Nießner
关键词: single RGB image, scene, RGB image, scene reconstruction, single RGB
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL - Accepted at NeurIPS 2024

点击查看摘要

Abstract:We present a novel diffusion-based approach for coherent 3D scene reconstruction from a single RGB image. Our method utilizes an image-conditioned 3D scene diffusion model to simultaneously denoise the 3D poses and geometries of all objects within the scene. Motivated by the ill-posed nature of the task and to obtain consistent scene reconstruction results, we learn a generative scene prior by conditioning on all scene objects simultaneously to capture the scene context and by allowing the model to learn inter-object relationships throughout the diffusion process. We further propose an efficient surface alignment loss to facilitate training even in the absence of full ground-truth annotation, which is common in publicly available datasets. This loss leverages an expressive shape representation, which enables direct point sampling from intermediate shape predictions. By framing the task of single RGB image 3D scene reconstruction as a conditional diffusion process, our approach surpasses current state-of-the-art methods, achieving a 12.04% improvement in AP3D on SUN RGB-D and a 13.43% increase in F-Score on Pix3D.
zh

[CV-15] Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation

【速读】：该论文试图解决开放词汇分割问题，即在不同环境中识别广泛类别对象的任务，使用文本提示作为输入。解决方案的关键在于提出了一种名为提示引导掩码生成 (Prompt-guided Mask Proposal, PMP) 的新方法，通过将文本提示与掩码生成过程结合，使得生成的掩码更符合输入提示。具体实现上，设计了一种跨注意力机制，将文本标记与查询标记进行交互，从而在每次解码后生成提示引导的掩码建议。实验结果表明，该方法在多个基准数据集上显著提升了性能（mIOU 提升 1% ~ 3%），展示了其轻量级提示感知方法的有效性和泛化能力。

链接: https://arxiv.org/abs/2412.10292
作者: Yu-Jhe Li,Xinyang Zhang,Kun Wan,Lantao Yu,Ajinkya Kale,Xin Lu
关键词: mask proposals, mask, Prompt-guided Mask Proposal, identify objects, wide range
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages. Work done during 2023 summer and has been released

点击查看摘要

Abstract:We tackle the challenge of open-vocabulary segmentation, where we need to identify objects from a wide range of categories in different environments, using text prompts as our input. To overcome this challenge, existing methods often use multi-modal models like CLIP, which combine image and text features in a shared embedding space to bridge the gap between limited and extensive vocabulary recognition, resulting in a two-stage approach: In the first stage, a mask generator takes an input image to generate mask proposals, and the in the second stage the target mask is picked based on the query. However, the expected target mask may not exist in the generated mask proposals, which leads to an unexpected output mask. In our work, we propose a novel approach named Prompt-guided Mask Proposal (PMP) where the mask generator takes the input text prompts and generates masks guided by these prompts. Compared with mask proposals generated without input prompts, masks generated by PMP are better aligned with the input prompts. To realize PMP, we designed a cross-attention mechanism between text tokens and query tokens which is capable of generating prompt-guided mask proposals after each decoding. We combined our PMP with several existing works employing a query-based segmentation backbone and the experiments on five benchmark datasets demonstrate the effectiveness of this approach, showcasing significant improvements over the current two-stage models (1% ~ 3% absolute performance gain in terms of mIOU). The steady improvement in performance across these benchmarks indicates the effective generalization of our proposed lightweight prompt-aware method.
zh

[CV-16] IV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

【速读】：该论文试图解决文本驱动图像到视频生成 (Text-driven Image to Video Generation, TI2V) 中的两个主要挑战：(i) 如何识别目标对象并确保运动轨迹与文本描述的一致性；(ii) 如何提高生成视频的主观质量。解决方案的关键在于提出了一种基于扩散的 TI2V 框架，称为 TIV-Diffusion，通过对象中心的文本-视觉对齐 (object-centric textual-visual alignment) 实现精确控制和高品质视频生成。具体而言，该框架通过尺度偏移调制 (scale-offset modulation) 融合文本和视觉知识，使模型能够感知文本描述的对象及其运动轨迹。此外，引入对象中心的文本-视觉对齐模块，通过解耦参考图像中的对象并分别对齐文本特征，有效减少了对象消失和对象/运动错位的问题。

链接: https://arxiv.org/abs/2412.10275
作者: Xingrui Wang,Xin Li,Yaosi Hu,Hanxin Zhu,Chen Hou,Cuiling Lan,Zhibo Chen
关键词: generate controllable video, Text-driven Image, high-quality video generation, Video Generation, aims to generate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-driven Image to Video Generation (TI2V) aims to generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how to identify the target objects and ensure the consistency between the movement trajectory and the textual description. (ii) how to improve the subjective quality of generated videos. To tackle the above challenges, we propose a new diffusion-based TI2V framework, termed TIV-Diffusion, via object-centric textual-visual alignment, intending to achieve precise control and high-quality video generation based on textual-described motion for different objects. Concretely, we enable our TIV-Diffuion model to perceive the textual-described objects and their motion trajectory by incorporating the fused textual and visual knowledge through scale-offset modulation. Moreover, to mitigate the problems of object disappearance and misaligned objects and motion, we introduce an object-centric textual-visual alignment module, which reduces the risk of misaligned objects/motion by decoupling the objects in the reference image and aligning textual features with each object individually. Based on the above innovations, our TIV-Diffusion achieves state-of-the-art high-quality video generation compared with existing TI2V methods.
zh

[CV-17] Probabilistic Inverse Cameras: Image to 3D via Multiview Geometry

【速读】：该论文试图解决从2D图像生成多视角3D模型的问题。解决方案的关键在于采用了一种分层概率方法，通过扩散“先验”模型来预测未见的3D几何结构，然后利用扩散“解码器”生成新视角的图像。具体实现中，使用基于点图的几何表示在多视角图像格式中协调生成多个目标视角，并通过假设目标相机相对于源相机的固定姿态，构建可预测的几何特征分布，从而促进视角间的对应关系。这种方法（称为“unPIC”）在多个数据集上超越了现有的最先进方法，如CAT3D和One-2-3-45。

链接: https://arxiv.org/abs/2412.10273
作者: Rishabh Kabra,Drew A. Hudson,Sjoerd van Steenkiste,Joao Carreira,Niloy J. Mitra
关键词: hierarchical probabilistic approach, conditions a diffusion, models the unseen, introduce a hierarchical, hierarchical probabilistic
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce a hierarchical probabilistic approach to go from a 2D image to multiview 3D: a diffusion “prior” models the unseen 3D geometry, which then conditions a diffusion “decoder” to generate novel views of the subject. We use a pointmap-based geometric representation in a multiview image format to coordinate the generation of multiple target views simultaneously. We facilitate correspondence between views by assuming fixed target camera poses relative to the source camera, and constructing a predictable distribution of geometric features per target. Our modular, geometry-driven approach to novel-view synthesis (called “unPIC”) beats SoTA baselines such as CAT3D and One-2-3-45 on held-out objects from ObjaverseXL, as well as real-world objects ranging from Google Scanned Objects, Amazon Berkeley Objects, to the Digital Twin Catalog.
zh

[CV-18] MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization ASPLOS’25

【速读】：该论文试图解决传统向量量化 (Vector Quantization, VQ) 方法在深度神经网络 (DNN) 压缩中导致的显著精度损失问题。解决方案的关键在于提出了一种新的多向量量化 (Multi-Vector Quantization, MVQ) 方法，通过在算法层面进行N:M剪枝去除不重要的权重，并利用掩码k均值算法最小化剩余权重与码字之间的向量聚类误差，从而更好地近似重要权重。在架构层面，MVQ在增强权重静态 (Enhanced Weight Stationary, EWS) CNN加速器上实现了向量量化，并设计了稀疏脉动阵列以最大化掩码向量量化的效益。实验结果表明，MVQ在相同压缩比下优于传统VQ方法，并减少了浮点运算次数 (FLOPs)，同时在ASIC评估中提升了2.3倍的能效，并将脉动阵列尺寸减少了55%。

链接: https://arxiv.org/abs/2412.10261
作者: Shuaiting Li,Chengxuan Wang,Juncan Deng,Zeyu Wang,Zewen Ye,Zongsheng Wang,Haibin Shen,Kejie Huang
关键词: hardware-friendly DNN compression, hardware-friendly DNN, DNN compression method, DNN compression, storage cost
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注: Accepted by ASPLOS '25

点击查看摘要

Abstract:Vector quantization(VQ) is a hardware-friendly DNN compression method that can reduce the storage cost and weight-loading datawidth of hardware accelerators. However, conventional VQ techniques lead to significant accuracy loss because the important weights are not well preserved. To tackle this problem, a novel approach called MVQ is proposed, which aims at better approximating important weights with a limited number of codewords. At the algorithm level, our approach removes the less important weights through N:M pruning and then minimizes the vector clustering error between the remaining weights and codewords by the masked k-means algorithm. Only distances between the unpruned weights and the codewords are computed, which are then used to update the codewords. At the architecture level, our accelerator implements vector quantization on an EWS (Enhanced weight stationary) CNN accelerator and proposes a sparse systolic array design to maximize the benefits brought by masked vector quantization.\ Our algorithm is validated on various models for image classification, object detection, and segmentation tasks. Experimental results demonstrate that MVQ not only outperforms conventional vector quantization methods at comparable compression ratios but also reduces FLOPs. Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by 2.3 \times and reduces the size of the systolic array by 55% when compared with the base EWS accelerator. Compared to the previous sparse accelerators, MVQ achieves 1.73 \times higher energy efficiency.
zh

[CV-19] EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling

【速读】：该论文试图解决使用VR设备头部和手部跟踪信号进行全身运动估计的问题，尤其是由于观测稀疏性和独特分布导致的病态问题（ill-posed problem），这使得全身运动估计存在多个可行解（hypotheses），增加了不确定性和模糊性，尤其是对下半身关节的估计。解决方案的关键在于提出了一种名为EnvPoser的两阶段框架：第一阶段通过不确定性感知估计模块（uncertainty-aware estimation module）处理人体运动的多假设特性；第二阶段通过整合语义和几何环境约束（semantic and geometric environmental constraints）对多假设估计进行细化，确保最终的运动估计既符合环境上下文又符合物理交互的真实性。

链接: https://arxiv.org/abs/2412.10235
作者: Songpengcheng Xia,Yu Zhang,Zhuo Su,Xiaozheng Zheng,Zheng Lv,Guidong Wang,Yongjie Zhang,Qi Wu,Lei Chu,Ling Pei
关键词: holds great potential, Estimating full-body motion, devices holds great, Estimating full-body, head and hands
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating full-body motion using the tracking signals of head and hands from VR devices holds great potential for various applications. However, the sparsity and unique distribution of observations present a significant challenge, resulting in an ill-posed problem with multiple feasible solutions (i.e., hypotheses). This amplifies uncertainty and ambiguity in full-body motion estimation, especially for the lower-body joints. Therefore, we propose a new method, EnvPoser, that employs a two-stage framework to perform full-body motion estimation using sparse tracking signals and pre-scanned environment from VR devices. EnvPoser models the multi-hypothesis nature of human motion through an uncertainty-aware estimation module in the first stage. In the second stage, we refine these multi-hypothesis estimates by integrating semantic and geometric environmental constraints, ensuring that the final motion estimation aligns realistically with both the environmental context and physical interactions. Qualitative and quantitative experiments on two public datasets demonstrate that our method achieves state-of-the-art performance, highlighting significant improvements in human motion estimation within motion-environment interaction scenarios.
zh

[CV-20] SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians

【速读】：该论文试图解决现有3D高斯喷射（Gaussian Splatting）方法在场景理解和复杂结构分割方面的不足。解决方案的关键在于引入了一种名为SuperGSeg的新方法，通过解耦分割和语言场蒸馏来促进上下文感知的场景表示。SuperGSeg首先利用神经高斯（neural Gaussians）从多视角图像中学习实例和层次分割特征，并借助现成的2D掩码生成稀疏的Super-Gaussians。这些Super-Gaussians能够将2D语言特征蒸馏到3D空间，从而实现高维语言特征的渲染，同时避免GPU内存的过度消耗。实验结果表明，SuperGSeg在开放词汇对象定位和语义分割任务上优于现有方法。

链接: https://arxiv.org/abs/2412.10231
作者: Siyun Liang,Sen Wang,Kunyi Li,Michael Niemeyer,Stefano Gasperini,Nassir Navab,Federico Tombari
关键词: recently gained traction, Gaussian Splatting, vanilla Gaussian Splatting, Gaussian Splatting representation, recently gained
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:3D Gaussian Splatting has recently gained traction for its efficient training and real-time rendering. While the vanilla Gaussian Splatting representation is mainly designed for view synthesis, more recent works investigated how to extend it with scene understanding and language features. However, existing methods lack a detailed comprehension of scenes, limiting their ability to segment and interpret complex structures. To this end, We introduce SuperGSeg, a novel approach that fosters cohesive, context-aware scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural Gaussians to learn instance and hierarchical segmentation features from multi-view images with the aid of off-the-shelf 2D masks. These features are then leveraged to create a sparse set of what we call Super-Gaussians. Super-Gaussians facilitate the distillation of 2D language features into 3D space. Through Super-Gaussians, our method enables high-dimensional language feature rendering without extreme increases in GPU memory. Extensive experiments demonstrate that SuperGSeg outperforms prior works on both open-vocabulary object localization and semantic segmentation tasks.
zh

[CV-21] SPT: Sequence Prompt Transformer for Interactive Image Segmentation

【速读】：该论文试图解决在交互式分割任务中，现有方法通常逐张处理图像而未能利用图像序列的顺序信息的问题。解决方案的关键在于提出了Sequence Prompt Transformer (SPT)，这是首个利用图像序列信息进行交互式分割的方法。其核心组件包括：(1) SPT，用于从图像序列、用户点击和掩码中获取信息以提高分割精度；(2) Top-k Prompt Selection (TPS)，用于为SPT选择精确的提示以进一步增强分割效果。此外，论文还创建了ADE20K-Seq基准数据集，以更好地评估模型性能。实验结果表明，该方法在多个基准数据集上均超越了现有的最先进方法。

链接: https://arxiv.org/abs/2412.10224
作者: Senlin Cheng,Haopeng Sun
关键词: Sequence Prompt Transformer, Interactive segmentation aims, Prompt Transformer, aims to extract, based on user-provided
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interactive segmentation aims to extract objects of interest from an image based on user-provided clicks. In real-world applications, there is often a need to segment a series of images featuring the same target object. However, existing methods typically process one image at a time, failing to consider the sequential nature of the images. To overcome this limitation, we propose a novel method called Sequence Prompt Transformer (SPT), the first to utilize sequential image information for interactive segmentation. Our model comprises two key components: (1) Sequence Prompt Transformer (SPT) for acquiring information from sequence of images, clicks and masks to improve accurate. (2) Top-k Prompt Selection (TPS) selects precise prompts for SPT to further enhance the segmentation effect. Additionally, we create the ADE20K-Seq benchmark to better evaluate model performance. We evaluate our approach on multiple benchmark datasets and show that our model surpasses state-of-the-art methods across all datasets.
zh

[CV-22] Learning Complex Non-Rigid Image Edits from Multimodal Conditioning

【速读】：该论文试图解决将单张人物图像自然地插入到新场景中的问题，特别是在保持人物身份和姿态一致性的同时，实现对文本和姿态的高度可控性。解决方案的关键在于利用Stable Diffusion模型，并通过训练成对的图像（包括参考图像和目标图像）以及描述新姿态的文本标题来实现。论文还提出了一种新的数据集，该数据集通过从以人为中心和动作丰富的视频中提取帧对，并使用多模态大型语言模型（LLM）自动生成描述姿态差异的文本标题。此外，论文强调了在“自然场景”中保持身份一致性的挑战，特别是在人物与物体互动的场景中，结合噪声标题的弱监督和稳健的2D姿态估计，显著提升了人物与物体互动的质量。

链接: https://arxiv.org/abs/2412.10219
作者: Nikolai Warner,Jack Kolb,Meera Hahn,Vighnesh Birodkar,Jonathan Huang,Irfan Essa
关键词: focus on inserting, Stable Diffusion, single image, paper we focus, reference image
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper we focus on inserting a given human (specifically, a single image of a person) into a novel scene. Our method, which builds on top of Stable Diffusion, yields natural looking images while being highly controllable with text and pose. To accomplish this we need to train on pairs of images, the first a reference image with the person, the second a “target image” showing the same person (with a different pose and possibly in a different background). Additionally we require a text caption describing the new pose relative to that in the reference image. In this paper we present a novel dataset following this criteria, which we create using pairs of frames from human-centric and action-rich videos and employing a multimodal LLM to automatically summarize the difference in human pose for the text captions. We demonstrate that identity preservation is a more challenging task in scenes “in-the-wild”, and especially scenes where there is an interaction between persons and objects. Combining the weak supervision from noisy captions, with robust 2D pose improves the quality of person-object interactions.
zh

[CV-23] RAID-Database: human Responses to Affine Image Distortions

【速读】：该论文试图解决现有图像质量数据库主要关注数字媒体中常见的失真，而忽视自然条件下常见失真（如仿射变换和加性高斯噪声）的问题。解决方案的关键在于引入了一套包含人类对超阈值仿射变换（旋转、平移、缩放）和加性高斯噪声的主观感知响应的数据集。该数据集通过最大似然差异缩放法（Maximum Likelihood Difference Scaling method）进行测量，包含864张失真图像的响应，涉及105名观察者和超过20000次图像四元组比较。该数据集的质量得到了保证，因为它复现了经典的Piéron定律和绝对检测阈值，并与传统图像质量数据库一致，同时通过Group-MAD实验进行了改进。

链接: https://arxiv.org/abs/2412.10211
作者: Paula Daudén-Oliver,David Agost-Beltran,Emilio Sansano-Sansano,Valero Laparra,Jesús Malo,Marina Martínez-Garcia
关键词: subjective human perception, predicting subjective human, Image quality databases, train models, models for predicting
类目: Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Image quality databases are used to train models for predicting subjective human perception. However, most existing databases focus on distortions commonly found in digital media and not in natural conditions. Affine transformations are particularly relevant to study, as they are among the most commonly encountered by human observers in everyday life. This Data Descriptor presents a set of human responses to suprathreshold affine image transforms (rotation, translation, scaling) and Gaussian noise as convenient reference to compare with previously existing image quality databases. The responses were measured using well established psychophysics: the Maximum Likelihood Difference Scaling method. The set contains responses to 864 distorted images. The experiments involved 105 observers and more than 20000 comparisons of quadruples of images. The quality of the dataset is ensured because (a) it reproduces the classical Piéron’s law, (b) it reproduces classical absolute detection thresholds, and © it is consistent with conventional image quality databases but improves them according to Group-MAD experiments.
zh

[CV-24] GAF: Gaussian Avatar Reconstruction from Monocular Videos via Multi-view Diffusion

【速读】：该论文试图解决从单目视频中重建可动画的3D高斯头像（animatable 3D Gaussian avatars）的问题，特别是在使用智能手机等消费级设备拍摄的有限视角下，由于未观察区域的约束不足，可能导致新视角下的伪影问题。解决方案的关键在于引入多视角头部扩散模型（multi-view head diffusion model），利用其先验知识填补缺失区域并确保高斯点渲染（Gaussian splatting renderings）中的视角一致性。此外，通过使用基于FLAME的头部重建生成的法线贴图（normal maps）提供像素对齐的归纳偏置，以及将扩散模型条件化于从输入图像中提取的VAE特征，以保留面部身份和外观的细节。通过使用迭代去噪图像作为伪真值进行多视角扩散先验的蒸馏，有效缓解了过饱和问题，并通过潜在空间上采样进一步提升了重建的逼真度。

链接: https://arxiv.org/abs/2412.10209
作者: Jiapeng Tang,Davide Davoli,Tobias Kirschstein,Liam Schoneveld,Matthias Niessner
关键词: reconstructing animatable, approach for reconstructing, Gaussian avatar reconstruction, Gaussian, avatar reconstruction
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Paper Video: this https URL Project Page: this https URL

点击查看摘要

Abstract:We propose a novel approach for reconstructing animatable 3D Gaussian avatars from monocular videos captured by commodity devices like smartphones. Photorealistic 3D head avatar reconstruction from such recordings is challenging due to limited observations, which leaves unobserved regions under-constrained and can lead to artifacts in novel views. To address this problem, we introduce a multi-view head diffusion model, leveraging its priors to fill in missing regions and ensure view consistency in Gaussian splatting renderings. To enable precise viewpoint control, we use normal maps rendered from FLAME-based head reconstruction, which provides pixel-aligned inductive biases. We also condition the diffusion model on VAE features extracted from the input image to preserve details of facial identity and appearance. For Gaussian avatar reconstruction, we distill multi-view diffusion priors by using iteratively denoised images as pseudo-ground truths, effectively mitigating over-saturation issues. To further improve photorealism, we apply latent upsampling to refine the denoised latent before decoding it into an image. We evaluate our method on the NeRSemble dataset, showing that GAF outperforms the previous state-of-the-art methods in novel view synthesis by a 5.34% higher SSIM score. Furthermore, we demonstrate higher-fidelity avatar reconstructions from monocular videos captured on commodity devices.
zh

[CV-25] Sims: An Interactive Tool for Geospatial Matching and Clustering

【速读】：该论文试图解决大规模时空地理数据获取、处理和可视化所需的大量计算资源问题，尤其是在快速发现预测特征方面存在的障碍。解决方案的关键是开发了一个名为Similarity Search (Sims)的无代码网络工具，该工具利用Google Earth Engine作为后端，允许用户在定义的感兴趣区域内进行可视化、比较、聚类和相似性搜索。Sims专注于特征探索而非模型创建，旨在补充现有的建模工具，并通过案例研究展示了其在分析模拟玉米产量数据中的实用性。

链接: https://arxiv.org/abs/2412.10184
作者: Akram Zaytar,Girmaw Abebe Tadesse,Caleb Robinson,Eduardo G. Bendito,Medha Devare,Meklit Chernet,Gilles Q. Hacheme,Rahul Dodhia,Juan M. Lavista Ferres
关键词: significant computing resources, large spatio-temporal domains, requires significant computing, data requires significant, computing resources
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
备注:

点击查看摘要

Abstract:Acquiring, processing, and visualizing geospatial data requires significant computing resources, especially for large spatio-temporal domains. This challenge hinders the rapid discovery of predictive features, which is essential for advancing geospatial modeling. To address this, we developed Similarity Search (Sims), a no-code web tool that allows users to visualize, compare, cluster, and perform similarity search over defined regions of interest using Google Earth Engine as a backend. Sims is designed to complement existing modeling tools by focusing on feature exploration rather than model creation. We demonstrate the utility of Sims through a case study analyzing simulated maize yield data in Rwanda, where we evaluate how different combinations of soil, weather, and agronomic features affect the clustering of yield response zones. Sims is open source and available at this https URL
zh

[CV-26] Multi-Head Encoding for Extreme Label Classification

【速读】：该论文试图解决极端标签分类 (eXtreme Label Classification, XLC) 中的分类器计算过载问题 (Classifier Computational Overload Problem, CCOP)，即随着类别数量的增加，分类器中的参数和非线性操作数量急剧上升，导致计算负担过重。解决方案的关键是提出了一种多头部编码机制 (Multi-Head Encoding, MHE)，通过将极端标签分解为多个短局部标签，并在训练过程中使用多个头部分别处理这些局部标签，从而在测试时直接从各头部的局部预测中计算出最终标签。这种方法几何级地减少了计算负担，并通过理论证明其在性能上接近传统分类器。此外，论文还针对不同XLC任务（如单标签、多标签和模型预训练任务）提出了三种基于MHE的实现（Multi-Head Product、Multi-Head Cascade和Multi-Head Sampling），以更有效地应对CCOP。

链接: https://arxiv.org/abs/2412.10182
作者: Daojun Liang,Haixia Zhang,Dongfeng Yuan,Minggao Zhang
关键词: eXtreme Label Classification, number of categories, real world, labels, Classifier Computational Overload
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 12 figs, Published in TPAMI

点击查看摘要

Abstract:The number of categories of instances in the real world is normally huge, and each instance may contain multiple labels. To distinguish these massive labels utilizing machine learning, eXtreme Label Classification (XLC) has been established. However, as the number of categories increases, the number of parameters and nonlinear operations in the classifier also rises. This results in a Classifier Computational Overload Problem (CCOP). To address this, we propose a Multi-Head Encoding (MHE) mechanism, which replaces the vanilla classifier with a multi-head classifier. During the training process, MHE decomposes extreme labels into the product of multiple short local labels, with each head trained on these local labels. During testing, the predicted labels can be directly calculated from the local predictions of each head. This reduces the computational load geometrically. Then, according to the characteristics of different XLC tasks, e.g., single-label, multi-label, and model pretraining tasks, three MHE-based implementations, i.e., Multi-Head Product, Multi-Head Cascade, and Multi-Head Sampling, are proposed to more effectively cope with CCOP. Moreover, we theoretically demonstrate that MHE can achieve performance approximately equivalent to that of the vanilla classifier by generalizing the low-rank approximation problem from Frobenius-norm to Cross-Entropy. Experimental results show that the proposed methods achieve state-of-the-art performance while significantly streamlining the training and inference processes of XLC tasks. The source code has been made public at this https URL.
zh

[CV-27] Ultra-High Resolution Segmentation via Boundary-Enhanced Patch-Merging Transformer

【速读】：该论文试图解决超高分辨率（UHR）图像分割中的关键问题，即在高空间分辨率和丰富细节的情况下，如何有效处理全局和局部信息的冲突，同时避免显著增加计算成本。解决方案的关键在于提出了一种名为边界增强型补丁合并Transformer（Boundary-enhanced Patch-merging Transformer, BPT）的新方法。BPT包含两个核心组件：补丁合并Transformer（Patch-Merging Transformer, PMT）用于动态分配标记到信息丰富的区域，以获取全局和局部表示；边界增强模块（Boundary-Enhanced Module, BEM）利用边界信息来丰富细节。通过这种方式，BPT在多个UHR图像分割基准上表现优于现有的最先进方法，且未增加额外的计算开销。

链接: https://arxiv.org/abs/2412.10181
作者: Haopeng Sun
关键词: high spatial resolution, poses significant challenges, significant challenges due, rich fine details, numerous applications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Segmentation of ultra-high resolution (UHR) images is a critical task with numerous applications, yet it poses significant challenges due to high spatial resolution and rich fine details. Recent approaches adopt a dual-branch architecture, where a global branch learns long-range contextual information and a local branch captures fine details. However, they struggle to handle the conflict between global and local information while adding significant extra computational cost. Inspired by the human visual system’s ability to rapidly orient attention to important areas with fine details and filter out irrelevant information, we propose a novel UHR segmentation method called Boundary-enhanced Patch-merging Transformer (BPT). BPT consists of two key components: (1) Patch-Merging Transformer (PMT) for dynamically allocating tokens to informative regions to acquire global and local representations, and (2) Boundary-Enhanced Module (BEM) that leverages boundary information to enrich fine details. Extensive experiments on multiple UHR image segmentation benchmarks demonstrate that our BPT outperforms previous state-of-the-art methods without introducing extra computational overhead. Codes will be released to facilitate research.
zh

[CV-28] SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

【速读】：该论文旨在解决视频虚拟试衣中的时空一致性问题，特别是在将基于图像的虚拟试衣技术扩展到视频时，容易出现帧间不一致的现象。解决方案的关键在于将视频虚拟试衣重新构想为条件视频修复任务，并通过引入时间注意力层来增强图像扩散模型的时间一致性。此外，论文提出了ShiftCaching技术，以减少重复计算带来的计算开销，从而提高推理速度。通过这些创新，该方法在视频一致性和推理速度方面显著优于现有基线。

链接: https://arxiv.org/abs/2412.10178
作者: Hung Nguyen,Quang Qui-Vinh Nguyen,Khoi Nguyen,Rang Nguyen
关键词: maintaining spatiotemporal consistency, maintaining spatiotemporal, video, person is wearing, person
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. While significant advances have been made in image-based virtual try-ons, extending these successes to video often results in frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequence. To address these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we introduce ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the \dataname~dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments show that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. Data and code are available at this https URL
zh

[CV-29] UN-DETR: Promoting Objectness Learning via Joint Supervision for Unknown Object Detection AAAI-2025

【速读】：该论文试图解决未知目标检测 (Unknown Object Detection, UOD) 中的关键问题，即在闭环世界假设限制下，如何有效识别和定位未见类别的目标。传统方法在学习目标性 (objectness) 时，通常与定位或分类信息隔离，导致性能不佳。论文提出的解决方案是基于Transformer的UOD框架UN-DETR，并通过引入实例存在分数 (Instance Presence Score, IPS) 来表示目标存在的概率。IPS通过联合监督学习策略，整合位置和类别潜在空间中的通用目标性属性作为监督信号，并采用一对多分配策略以增强监督信息的利用。此外，论文还提出了无偏查询选择和IPS引导的后处理策略，以优化初始查询向量和过滤冗余框，同时纠正分类预测。最后，通过无监督预训练获取目标性先验，进一步提升模型性能。实验结果表明，UN-DETR在多个UOD和已知检测基准上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.10176
作者: Haomiao Liu,Hao Xu,Chuhuai Yue,Bo Ma
关键词: traditional detection paradigm, detection paradigm limited, aims to identify, closed-world assumption, paradigm limited
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI-2025;15 pages, 11figures

点击查看摘要

Abstract:Unknown Object Detection (UOD) aims to identify objects of unseen categories, differing from the traditional detection paradigm limited by the closed-world assumption. A key component of UOD is learning a generalized representation, i.e. objectness for both known and unknown categories to distinguish and localize objects from the background in a class-agnostic manner. However, previous methods obtain supervision signals for learning objectness in isolation from either localization or classification information, leading to poor performance for UOD. To address this issue, we propose a transformer-based UOD framework, UN-DETR. Based on this, we craft Instance Presence Score (IPS) to represent the probability of an object’s presence. For the purpose of information complementarity, IPS employs a strategy of joint supervised learning, integrating attributes representing general objectness from the positional and the categorical latent space as supervision signals. To enhance IPS learning, we introduce a one-to-many assignment strategy to incorporate more supervision. Then, we propose Unbiased Query Selection to provide premium initial query vectors for the decoder. Additionally, we propose an IPS-guided post-process strategy to filter redundant boxes and correct classification predictions for known and unknown objects. Finally, we pretrain the entire UN-DETR in an unsupervised manner, in order to obtain objectness prior. Our UN-DETR is comprehensively evaluated on multiple UOD and known detection benchmarks, demonstrating its effectiveness and achieving state-of-the-art performance.
zh

[CV-30] Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance AAAI2025

【速读】：该论文试图解决场景文本检测与识别中的阅读顺序问题，特别是在没有复杂检测模块的情况下实现任意阅读顺序的文本定位与识别。解决方案的关键在于提出了基于局部语义引导的场景文本检测器（LSGSpotter），通过自动回归解码字符的位置和内容，并利用局部语义信息来确定正确的阅读顺序。具体来说，LSGSpotter 包含两个核心模块：起始点定位模块（SPLM）用于确定文本的起始点，以及多尺度自适应注意力模块（MAAM）用于在局部区域内自适应地聚合文本特征。这种方法不仅避免了复杂检测模块的限制，还通过网格采样策略降低了计算资源的消耗，从而在任意形状和阅读顺序的场景文本检测任务中实现了最先进的性能。

链接: https://arxiv.org/abs/2412.10159
作者: Jiahao Lyu,Wei Wang,Dongbao Yang,Jinwen Zhong,Yu Zhou
关键词: Scene text, reading order, local semantics, text, Semantics Guided scene
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Scene text spotting has attracted the enthusiasm of relative researchers in recent years. Most existing scene text spotters follow the detection-then-recognition paradigm, where the vanilla detection module hardly determines the reading order and leads to failure recognition. After rethinking the auto-regressive scene text recognition method, we find that a well-trained recognizer can implicitly perceive the local semantics of all characters in a complete word or a sentence without a character-level detection module. Local semantic knowledge not only includes text content but also spatial information in the right reading order. Motivated by the above analysis, we propose the Local Semantics Guided scene text Spotter (LSGSpotter), which auto-regressively decodes the position and content of characters guided by the local semantics. Specifically, two effective modules are proposed in LSGSpotter. On the one hand, we design a Start Point Localization Module (SPLM) for locating text start points to determine the right reading order. On the other hand, a Multi-scale Adaptive Attention Module (MAAM) is proposed to adaptively aggregate text features in a local area. In conclusion, LSGSpotter achieves the arbitrary reading order spotting task without the limitation of sophisticated detection, while alleviating the cost of computational resources with the grid sampling strategy. Extensive experiment results show LSGSpotter achieves state-of-the-art performance on the InverseText benchmark. Moreover, our spotter demonstrates superior performance on English benchmarks for arbitrary-shaped text, achieving improvements of 0.7% and 2.5% on Total-Text and SCUT-CTW1500, respectively. These results validate our text spotter is effective for scene texts in arbitrary reading order and shape.
zh

[CV-31] WordVIS: A Color Worth A Thousand Words

【速读】：该论文试图解决多模态文档分类方法在工业应用中因需要大量训练数据和计算资源而未被充分利用的问题。解决方案的关键在于将文本特征直接嵌入视觉空间，使得轻量级的基于图像的分类器能够在小规模数据集上实现最先进的分类效果。通过这种方法，论文在Tobacco-3482数据集上使用ResNet50模型实现了4.64%的准确率提升，并在无预训练的情况下使用DocXClassifier达到了91.14%的准确率，创下了该数据集的新纪录。

链接: https://arxiv.org/abs/2412.10155
作者: Umar Khan,Saifullah,Stefan Agne,Andreas Dengel,Sheraz Ahmed
关键词: document processing systems, automated document processing, Document classification, processing systems, considered a critical
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Document classification is considered a critical element in automated document processing systems. In recent years multi-modal approaches have become increasingly popular for document classification. Despite their improvements, these approaches are underutilized in the industry due to their requirement for a tremendous volume of training data and extensive computational power. In this paper, we attempt to address these issues by embedding textual features directly into the visual space, allowing lightweight image-based classifiers to achieve state-of-the-art results using small-scale datasets in document classification. To evaluate the efficacy of the visual features generated from our approach on limited data, we tested on the standard dataset Tobacco-3482. Our experiments show a tremendous improvement in image-based classifiers, achieving an improvement of 4.64% using ResNet50 with no document pre-training. It also sets a new record for the best accuracy of the Tobacco-3482 dataset with a score of 91.14% using the image-based DocXClassifier with no document pre-training. The simplicity of the approach, its resource requirements, and subsequent results provide a good prospect for its use in industrial use cases.
zh

[CV-32] EVOS: Efficient Implicit Neural Training via EVOlutionary Selector

【速读】：该论文试图解决隐式神经表示 (Implicit Neural Representation, INR) 训练过程中计算开销过大的问题。解决方案的关键在于提出了一种名为进化选择器 (EVOlutionary Selector, EVOS) 的高效训练范式。EVOS 通过在每次迭代中仅选择部分样本进行训练，避免了传统方法中所有样本都通过神经网络的冗余前向传播，从而显著减少了计算开销。具体而言，EVOS 将每个样本视为进化过程中的个体，仅保留最适应的样本进行训练，并随着神经网络动态自适应调整。为实现这一目标，EVOS 设计了稀疏适应度评估、频率引导交叉和增强的无偏突变等机制，分别用于降低计算成本、提升性能以及缓解选择偏差。实验结果表明，EVOS 在保证收敛性能的同时，实现了约 48%-66% 的训练时间减少，成为基于采样的加速策略中的最新技术。

链接: https://arxiv.org/abs/2412.10153
作者: Weixiang Zhang,Shuzhao Xie,Chengwei Ren,Siyi Xie,Chen Tang,Shijia Ge,Mingzi Wang,Zhi Wang
关键词: Implicit Neural Representation, accelerating Implicit Neural, propose EVOlutionary Selector, accelerating Implicit, Neural Representation
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:We propose EVOlutionary Selector (EVOS), an efficient training paradigm for accelerating Implicit Neural Representation (INR). Unlike conventional INR training that feeds all samples through the neural network in each iteration, our approach restricts training to strategically selected points, reducing computational overhead by eliminating redundant forward passes. Specifically, we treat each sample as an individual in an evolutionary process, where only those fittest ones survive and merit inclusion in training, adaptively evolving with the neural network dynamics. While this is conceptually similar to Evolutionary Algorithms, their distinct objectives (selection for acceleration vs. iterative solution optimization) require a fundamental redefinition of evolutionary mechanisms for our context. In response, we design sparse fitness evaluation, frequency-guided crossover, and augmented unbiased mutation to comprise EVOS. These components respectively guide sample selection with reduced computational cost, enhance performance through frequency-domain balance, and mitigate selection bias from cached evaluation. Extensive experiments demonstrate that our method achieves approximately 48%-66% reduction in training time while ensuring superior convergence without additional cost, establishing state-of-the-art acceleration among recent sampling-based strategies.
zh

[CV-33] Investigating generalization capabilities of neural networks by means of loss landscapes and Hessian analysis

【速读】：该论文试图解决神经网络（NNs）在存在批量归一化层（batch normalization layers）时，传统方法无法确保损失景观（loss landscapes）正确可视化的问题。解决方案的关键在于使用Hessian轴（Hessian axes）来缓解这一问题，并提出了选择Hessian轴的方法。此外，论文还研究了Hessian特征分解的谱（spectra of Hessian eigendecomposition），并提出了基于这些谱的定量标准，用于评估神经网络的性能和泛化能力。实验结果表明，这些标准在不同数据集之间变化时与准确率的变化相关，从而提供了一种计算效率高的泛化能力估计方法，特别适用于超大规模数据集。

链接: https://arxiv.org/abs/2412.10146
作者: Nikita Gabdullin
关键词: improved PyTorch library, PyTorch library Loss, library Loss Landscape, Loss Landscape Analysis, Loss Landscape
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper studies generalization capabilities of neural networks (NNs) using new and improved PyTorch library Loss Landscape Analysis (LLA). LLA facilitates visualization and analysis of loss landscapes along with the properties of NN Hessian. Different approaches to NN loss landscape plotting are discussed with particular focus on normalization techniques showing that conventional methods cannot always ensure correct visualization when batch normalization layers are present in NN architecture. The use of Hessian axes is shown to be able to mitigate this effect, and methods for choosing Hessian axes are proposed. In addition, spectra of Hessian eigendecomposition are studied and it is shown that typical spectra exist for a wide range of NNs. This allows to propose quantitative criteria for Hessian analysis that can be applied to evaluate NN performance and assess its generalization capabilities. Generalization experiments are conducted using ImageNet-1K pre-trained models along with several models trained as part of this study. The experiment include training models on one dataset and testing on another one to maximize experiment similarity to model performance in the Wild. It is shown that when datasets change, the changes in criteria correlate with the changes in accuracy, making the proposed criteria a computationally efficient estimate of generalization ability, which is especially useful for extremely large datasets.
zh

[CV-34] Constraint-Aware Zero-Shot Vision-Language Navigation in Continuous Environments

【速读】：该论文试图解决在连续环境中的视觉语言导航任务（Vision-Language Navigation in Continuous Environments, VLN-CE）在零样本设置下的挑战。解决方案的关键在于提出了一个约束感知导航器（Constraint-Aware Navigator, CA-Nav），它将零样本VLN-CE重新定义为一个序列化的、约束感知的子指令完成过程。CA-Nav通过两个核心模块实现这一目标：约束感知子指令管理器（Constraint-Aware Sub-instruction Manager, CSM）和约束感知价值映射器（Constraint-Aware Value Mapper, CVM）。CSM定义了分解子指令的完成标准作为约束，并通过约束感知的方式切换子指令以跟踪导航进度；CVM则在CSM的约束指导下动态生成价值映射，并使用超像素聚类进行细化以提高导航稳定性。该方法在两个VLN-CE基准测试中达到了最先进的性能，显著超越了之前的方法。

链接: https://arxiv.org/abs/2412.10137
作者: Kehan Chen,Dong An,Yan Huang,Rongtao Xu,Yifei Su,Yonggen Ling,Ian Reid,Liang Wang
关键词: Continuous Environments, address the task, task of Vision-Language, zero-shot setting, minimal environment structural
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the task of Vision-Language Navigation in Continuous Environments (VLN-CE) under the zero-shot setting. Zero-shot VLN-CE is particularly challenging due to the absence of expert demonstrations for training and minimal environment structural prior to guide navigation. To confront these challenges, we propose a Constraint-Aware Navigator (CA-Nav), which reframes zero-shot VLN-CE as a sequential, constraint-aware sub-instruction completion process. CA-Nav continuously translates sub-instructions into navigation plans using two core modules: the Constraint-Aware Sub-instruction Manager (CSM) and the Constraint-Aware Value Mapper (CVM). CSM defines the completion criteria for decomposed sub-instructions as constraints and tracks navigation progress by switching sub-instructions in a constraint-aware manner. CVM, guided by CSM’s constraints, generates a value map on the fly and refines it using superpixel clustering to improve navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the previous best method by 12 percent and 13 percent in Success Rate on the validation unseen splits of R2R-CE and RxR-CE, respectively. Moreover, CA-Nav demonstrates its effectiveness in real-world robot deployments across various indoor scenes and instructions.
zh

[CV-35] he Art of Deception: Color Visual Illusions and Diffusion Models

【速读】：该论文试图解决的问题是探讨人工神经网络（ANNs）与人类大脑在视觉错觉感知上的相似性，并研究扩散模型（diffusion models）在编码和预测视觉错觉方面的能力。解决方案的关键在于发现扩散模型在其潜在空间（latent space）中表现出与人类相似的亮度/颜色偏移，并利用这一特性来预测和生成新的视觉错觉。通过心理物理实验验证，论文展示了模型生成的错觉同样能够欺骗人类观察者，从而证明了扩散模型在理解和生成视觉错觉方面的有效性。

链接: https://arxiv.org/abs/2412.10122
作者: Alex Gomez-Villa,Kai Wang,Alejandro C. Parraga,Bartlomiej Twardowski,Jesus Malo,Javier Vazquez-Corral,Joost van de Weijer
关键词: Visual illusions, arise when interpreting, deviates from reality, illusions, observer is adapted
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual illusions in humans arise when interpreting out-of-distribution stimuli: if the observer is adapted to certain statistics, perception of outliers deviates from reality. Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions. This revelation raises profound questions about the nature of visual information. Why are two independent systems, both human brains and ANNs, susceptible to the same illusions? Should any ANN be capable of perceiving visual illusions? Are these perceptions a feature or a flaw? In this work, we study how visual illusions are encoded in diffusion models. Remarkably, we show that they present human-like brightness/color shifts in their latent space. We use this fact to demonstrate that diffusion models can predict visual illusions. Furthermore, we also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models. We validate this ability through psychophysical experiments that show how our model-generated illusions also fool humans.
zh

[CV-36] HS-FPN: High Frequency and Spatial Perception FPN for Tiny Object Detection

【速读】：该论文试图解决在目标检测中检测微小物体（tiny objects）的挑战，特别是由于微小物体在特征图中所占比例极小，导致其特征难以有效提取的问题。解决方案的关键在于提出了一个新型的高频与空间感知特征金字塔网络（High Frequency and Spatial Perception Feature Pyramid Network, HS-FPN），并引入了两个创新模块：高频感知模块（High Frequency Perception Module, HFP）和空间依赖感知模块（Spatial Dependency Perception Module, SDP）。HFP通过高通滤波器生成高频响应，从空间和通道角度丰富和突出微小物体的特征；SDP则捕捉FPN所缺乏的空间依赖关系，从而提升对微小物体的检测能力。

链接: https://arxiv.org/abs/2412.10116
作者: Zican Shi,Jing Hu,Jie Ren,Hengkang Ye,Xuyang Yuan,Yan Ouyang,Jia He,Bo Ji,Junyu Guo
关键词: Feature Pyramid Network, Pyramid Network, significantly improved object, Feature Pyramid, Perception Feature Pyramid
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages,12 figures,7 tables

点击查看摘要

Abstract:The introduction of Feature Pyramid Network (FPN) has significantly improved object detection performance. However, substantial challenges remain in detecting tiny objects, as their features occupy only a very small proportion of the feature maps. Although FPN integrates multi-scale features, it does not directly enhance or enrich the features of tiny objects. Furthermore, FPN lacks spatial perception ability. To address these issues, we propose a novel High Frequency and Spatial Perception Feature Pyramid Network (HS-FPN) with two innovative modules. First, we designed a high frequency perception module (HFP) that generates high frequency responses through high pass filters. These high frequency responses are used as mask weights from both spatial and channel perspectives to enrich and highlight the features of tiny objects in the original feature maps. Second, we developed a spatial dependency perception module (SDP) to capture the spatial dependencies that FPN lacks. Our experiments demonstrate that detectors based on HS-FPN exhibit competitive advantages over state-of-the-art models on the AI-TOD dataset for tiny object detection.
zh

[CV-37] Filter or Compensate: Towards Invariant Representation from Distribution Shift for Anomaly Detection AAAI2025

【速读】：该论文试图解决异常检测（Anomaly Detection, AD）中由于数据分布偏移（distribution shift）导致的性能下降问题。解决方案的关键在于提出了FiCo（Filter or Compensate）框架，通过两个核心模块来应对分布偏移：首先，分布特定补偿（Distribution-Specific Compensation, DiSCo）模块用于补偿分布特定信息，减少教师网络与学生网络之间的错位；其次，分布不变滤波（Distribution-Invariant Filter, DiIFi）模块用于过滤所有异常信息，捕捉分布不变的正常性。实验结果表明，FiCo在多个异常检测基准上优于现有的最先进（SOTA）方法，甚至在分布内（In-Distribution, ID）场景中也表现出色。

链接: https://arxiv.org/abs/2412.10115
作者: Zining Chen,Xingshuang Luo,Weiqiu Wang,Zhicheng Zhao,Fei Su,Aidong Men
关键词: Recent Anomaly Detection, Recent Anomaly, achieved great success, Anomaly Detection, success with In-Distribution
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Recent Anomaly Detection (AD) methods have achieved great success with In-Distribution (ID) data. However, real-world data often exhibits distribution shift, causing huge performance decay on traditional AD methods. From this perspective, few previous work has explored AD with distribution shift, and the distribution-invariant normality learning has been proposed based on the Reverse Distillation (RD) framework. However, we observe the misalignment issue between the teacher and the student network that causes detection failure, thereby propose FiCo, Filter or Compensate, to address the distribution shift issue in AD. FiCo firstly compensates the distribution-specific information to reduce the misalignment between the teacher and student network via the Distribution-Specific Compensation (DiSCo) module, and secondly filters all abnormal information to capture distribution-invariant normality with the Distribution-Invariant Filter (DiIFi) module. Extensive experiments on three different AD benchmarks demonstrate the effectiveness of FiCo, which outperforms all existing state-of-the-art (SOTA) methods, and even achieves better results on the ID scenario compared with RD-based methods. Our code is available at this https URL.
zh

[CV-38] Data Pruning Can Do More: A Comprehensive Data Pruning Approach for Object Re-identification

【速读】：该论文试图解决在目标重识别（ReID）任务中数据修剪（data pruning）的可行性问题，旨在通过移除不重要或信息量较少的样本，同时保持与原始数据集训练相当的性能，从而降低存储和训练成本。解决方案的关键在于提出了一种基于训练过程中logit历史的全方位数据修剪方法，该方法能够更准确地量化样本的重要性，识别并修正错误标注的样本，以及检测异常样本。此外，该方法在重要性评分估计方面比现有方法效率提高了10倍，并且是一个即插即用、架构无关的框架，能够在VeRi、MSMT17和Market1501数据集上分别减少35%、30%和5%的样本/训练时间，且准确率损失可忽略不计（0.1%）。

链接: https://arxiv.org/abs/2412.10091
作者: Zi Yang,Haojin Yang,Soumajit Majumder,Jorge Cardoso,Guillermo Gallego
关键词: Previous studies, studies have demonstrated, Data pruning, Previous, training
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Previous studies have demonstrated that not each sample in a dataset is of equal importance during training. Data pruning aims to remove less important or informative samples while still achieving comparable results as training on the original (untruncated) dataset, thereby reducing storage and training costs. However, the majority of data pruning methods are applied to image classification tasks. To our knowledge, this work is the first to explore the feasibility of these pruning methods applied to object re-identification (ReID) tasks, while also presenting a more comprehensive data pruning approach. By fully leveraging the logit history during training, our approach offers a more accurate and comprehensive metric for quantifying sample importance, as well as correcting mislabeled samples and recognizing outliers. Furthermore, our approach is highly efficient, reducing the cost of importance score estimation by 10 times compared to existing methods. Our approach is a plug-and-play, architecture-agnostic framework that can eliminate/reduce 35%, 30%, and 5% of samples/training time on the VeRi, MSMT17 and Market1501 datasets, respectively, with negligible loss in accuracy ( 0.1%). The lists of important, mislabeled, and outlier samples from these ReID datasets are available at this https URL.
zh

[CV-39] Guidance Not Obstruction: A Conjugate Consistent Enhanced Strategy for Domain Generalization

【速读】：该论文试图解决领域泛化（domain generalization）中的领域偏移问题，特别是在多领域环境下，由于边缘分布对齐（marginal alignment）未能保证条件分布对齐（conditional alignment），导致类间判别信息不足的问题。解决方案的关键在于提出了一种新的共轭一致性增强模块（Conjugate Consistent Enhanced Module，Con2EM），通过引入元分布（meta-distribution）来生成多样化的领域相关的类条件分布（class-conditional distributions），从而增强类间的判别信息。具体而言，论文采用了一种分布级别的Universum策略，生成补充的类条件分布，并通过重新采样这些分布来反馈给原始的实例级分类器，提升其对目标无关场景的适应性。此外，为了确保生成分布的准确性，论文还引入了一个分布级别的分类器来正则化这些条件分布。

链接: https://arxiv.org/abs/2412.10089
作者: Meng Cao,Songcan Chen
关键词: addresses domain shift, generalization addresses domain, Domain generalization addresses, real-world applications, addresses domain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain generalization addresses domain shift in real-world applications. Most approaches adopt a domain angle, seeking invariant representation across domains by aligning their marginal distributions, irrespective of individual classes, naturally leading to insufficient exploration of discriminative information. Switching to a class angle, we find that multiple domain-related peaks or clusters within the same individual classes must emerge due to distribution shift. In other words, marginal alignment does not guarantee conditional alignment, leading to suboptimal generalization. Therefore, we argue that acquiring discriminative generalization between classes within domains is crucial. In contrast to seeking distribution alignment, we endeavor to safeguard domain-related between-class discrimination. To this end, we devise a novel Conjugate Consistent Enhanced Module, namely Con2EM, based on a distribution over domains, i.e., a meta-distribution. Specifically, we employ a novel distribution-level Universum strategy to generate supplementary diverse domain-related class-conditional distributions, thereby enhancing generalization. This allows us to resample from these generated distributions to provide feedback to the primordial instance-level classifier, further improving its adaptability to the target-agnostic. To ensure generation accuracy, we establish an additional distribution-level classifier to regularize these conditional distributions. Extensive experiments have been conducted to demonstrate its effectiveness and low computational cost compared to SOTAs.
zh

[CV-40] ProbeSDF: Light Field Probes for Neural Surface Reconstruction

【速读】：该论文旨在改进基于符号距离函数（SDF）的微分渲染框架，以实现更高效的多视角3D形状重建。其关键解决方案在于重新设计了外观模型，通过物理启发的最小辐射参数化方法，将角向和空间贡献解耦，并分别编码在不同分辨率的体素网格中。这种方法仅需每个体素四个参数，并结合一个小型多层感知机（MLP）调用，显著提升了计算速度和性能，同时实现了实时渲染和训练加速。该方法在通用物体和人体形状重建等广泛应用领域中，通过四个具有挑战性的数据集验证了其一致的高性能表现。

链接: https://arxiv.org/abs/2412.10084
作者: Briac Toussaint,Diego Thomas,Jean-Sébastien Franco
关键词: SDF-based differential rendering, differential rendering frameworks, SDF-based differential, shape reconstruction, differential rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures

点击查看摘要

Abstract:SDF-based differential rendering frameworks have achieved state-of-the-art multiview 3D shape reconstruction. In this work, we re-examine this family of approaches by minimally reformulating its core appearance model in a way that simultaneously yields faster computation and increased performance. To this goal, we exhibit a physically-inspired minimal radiance parametrization decoupling angular and spatial contributions, by encoding them with a small number of features stored in two respective volumetric grids of different resolutions. Requiring as little as four parameters per voxel, and a tiny MLP call inside a single fully fused kernel, our approach allows to enhance performance with both surface and image (PSNR) metrics, while providing a significant training speedup and real-time rendering. We show this performance to be consistently achieved on real data over two widely different and popular application fields, generic object and human subject shape reconstruction, using four representative and challenging datasets.
zh

[CV-41] oy-GS: Assembling Local Gaussians for Precisely Rendering Large-Scale Free Camera Trajectories

【速读】：该论文试图解决大规模自由相机轨迹（arbitrary input camera trajectories）下的3D渲染问题，主要挑战包括相机分布和观察角度的不规则性，以及处理大规模场景时对GPU内存的巨大需求。解决方案的关键在于提出了一种自适应空间划分方法（adaptive spatial division approach），将相机和场景的稀疏点云根据相机姿态划分为多个区域，并并行训练每个区域的局部高斯模型（local Gaussian），以集中处理纹理细节并减少GPU内存占用。此外，通过多视角约束和位置感知点自适应控制（PPAC）提升纹理细节的渲染质量，并通过区域融合方法结合局部和全局高斯模型，进一步提高渲染效果。实验结果表明，该方法在PSNR指标上提升了1.19 dB，并节省了7 G的GPU内存。

链接: https://arxiv.org/abs/2412.10078
作者: Xiaohan Zhang,Zhenyu Sun,Yukui Qiu,Junyan Su,Qi Liu
关键词: arbitrary input camera, free camera trajectories, poses significant challenges, large-scale free camera, input camera trajectories
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Currently, 3D rendering for large-scale free camera trajectories, namely, arbitrary input camera trajectories, poses significant challenges: 1) The distribution and observation angles of the cameras are irregular, and various types of scenes are included in the free trajectories; 2) Processing the entire point cloud and all images at once for large-scale scenes requires a substantial amount of GPU memory. This paper presents a Toy-GS method for accurately rendering large-scale free camera trajectories. Specifically, we propose an adaptive spatial division approach for free trajectories to divide cameras and the sparse point cloud of the entire scene into various regions according to camera poses. Training each local Gaussian in parallel for each area enables us to concentrate on texture details and minimize GPU memory usage. Next, we use the multi-view constraint and position-aware point adaptive control (PPAC) to improve the rendering quality of texture details. In addition, our regional fusion approach combines local and global Gaussians to enhance rendering quality with an increasing number of divided areas. Extensive experiments have been carried out to confirm the effectiveness and efficiency of Toy-GS, leading to state-of-the-art results on two public large-scale datasets as well as our SCUTic dataset. Our proposal demonstrates an enhancement of 1.19 dB in PSNR and conserves 7 G of GPU memory when compared to various benchmarks.
zh

[CV-42] Quaffure: Real-Time Quasi-Static Neural Hair Simulation

【速读】：该论文试图解决实时应用中高质量虚拟角色头发运动的计算资源限制问题。解决方案的关键在于提出了一种新颖的神经网络方法，用于预测物理上合理的头发变形，能够泛化到各种身体姿势、形状和发型。该模型通过自监督损失进行训练，无需昂贵的数据生成和存储，且在消费级硬件上具有极快的推理速度（仅几毫秒），并能扩展到预测1000个发型的下垂效果，仅需0.3秒。

链接: https://arxiv.org/abs/2412.10061
作者: Tuur Stuyck,Gene Wei-Chin Lin,Egor Larionov,Hsiao-yu Chen,Aljaz Bozic,Nikolaos Sarafianos,Doug Roble
关键词: Realistic hair motion, Realistic hair, high-quality avatars, motion is crucial, crucial for high-quality
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Realistic hair motion is crucial for high-quality avatars, but it is often limited by the computational resources available for real-time applications. To address this challenge, we propose a novel neural approach to predict physically plausible hair deformations that generalizes to various body poses, shapes, and hairstyles. Our model is trained using a self-supervised loss, eliminating the need for expensive data generation and storage. We demonstrate our method’s effectiveness through numerous results across a wide range of pose and shape variations, showcasing its robust generalization capabilities and temporally smooth results. Our approach is highly suitable for real-time applications with an inference time of only a few milliseconds on consumer hardware and its ability to scale to predicting the drape of 1000 grooms in 0.3 seconds.
zh

[CV-43] SGaussian: Semantic and Depth-Guided Target-Specific Gaussian Splatting from Sparse Views

【速读】：该论文试图解决从稀疏视角重建具有复杂结构的指定目标时，现有方法常忽略的几何退化问题。解决方案的关键在于引入TSGaussian框架，该框架结合语义约束和深度先验，优先分配计算资源给指定目标，同时最小化背景资源的分配。具体实现上，利用YOLOv9的边界框作为提示，驱动Segment Anything Model生成2D掩码预测，确保语义准确性和成本效率。此外，TSGaussian通过引入紧凑的身份编码和3D空间一致性正则化，对3D高斯进行聚类，并提出一种剪枝策略以减少3D高斯冗余，从而在特定对象的新视角合成任务中取得优异效果。

链接: https://arxiv.org/abs/2412.10051
作者: Liang Zhao,Zehan Bao,Yi Xie,Hong Chen,Yaohui Chen,Weifu Li
关键词: Recent advances, Splatting have significantly, advanced the field, Gaussian Splatting, significantly advanced
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Gaussian Splatting have significantly advanced the field, achieving both panoptic and interactive segmentation of 3D scenes. However, existing methodologies often overlook the critical need for reconstructing specified targets with complex structures from sparse views. To address this issue, we introduce TSGaussian, a novel framework that combines semantic constraints with depth priors to avoid geometry degradation in challenging novel view synthesis tasks. Our approach prioritizes computational resources on designated targets while minimizing background allocation. Bounding boxes from YOLOv9 serve as prompts for Segment Anything Model to generate 2D mask predictions, ensuring semantic accuracy and cost efficiency. TSGaussian effectively clusters 3D gaussians by introducing a compact identity encoding for each Gaussian ellipsoid and incorporating 3D spatial consistency regularization. Leveraging these modules, we propose a pruning strategy to effectively reduce redundancy in 3D gaussians. Extensive experiments demonstrate that TSGaussian outperforms state-of-the-art methods on three standard datasets and a new challenging dataset we collected, achieving superior results in novel view synthesis of specific objects. Code is available at: this https URL.
zh

[CV-44] ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?

【速读】：该论文试图解决传统机器人感知交互区域方法在计算复杂度和环境适应性上的问题。解决方案的关键在于引入ManipGPT框架，利用预训练的视觉变换器（ViT）预测关节对象的最佳交互区域。通过在9.9k模拟和真实图像的数据集上微调ViT，显著提升了部件级别的可操作性分割，并结合阻抗适应策略，实现了在模拟和真实环境中高效的操作，无需复杂的感知系统或大规模数据集。

链接: https://arxiv.org/abs/2412.10050
作者: Taewhan Kim,Hojin Bae,Zeming Li,Xiaoqi Li,Iaroslav Ponomarenko,Ruihai Wu,Hao Dong
关键词: Visual actionable affordance, Visual actionable, perceiving interaction areas, interaction areas prior, approach in robotics
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated and real images to bridge the sim-to-real gap and enhance real-world applicability. By fine-tuning the vision transformer on this small dataset, we significantly improved part-level affordance segmentation, adapting the model’s in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems.
zh

[CV-45] SuperMark: Robust and Training-free Image Watermarking via Diffusion-based Super-Resolution

【速读】：该论文试图解决生成式 AI 内容与真实内容混合场景下的版权保护和内容认证问题，特别是现有水印技术在平衡鲁棒性（robustness）和保真度（fidelity）方面的不足，以及对自适应攻击（adaptive attacks）的脆弱性。解决方案的关键在于提出了一种名为 SuperMark 的鲁棒、无需训练的水印框架。该框架借鉴了扩散模型（diffusion models）中的去噪/加噪过程，通过将水印嵌入到初始高斯噪声中，并利用预训练的超分辨率（Super-Resolution, SR）模型进行去噪，生成最终的水印图像。提取过程则通过 DDIM 反演（DDIM Inversion）将水印图像还原为初始水印噪声，从而提取嵌入的水印。该方法不仅在标准失真下表现出高提取准确率（99.46%），在自适应攻击下也显著提升了鲁棒性（89.29%），同时保持了高保真度，并展示了跨数据集、SR 模型、嵌入方法和分辨率的强迁移性。

链接: https://arxiv.org/abs/2412.10049
作者: Runyi Hu,Jie Zhang,Yiming Li,Jiwei Li,Qing Guo,Han Qiu,Tianwei Zhang
关键词: today digital landscape, digital landscape, today digital, blending of AI-generated, AI-generated and authentic
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: robust image watermarking

点击查看摘要

Abstract:In today’s digital landscape, the blending of AI-generated and authentic content has underscored the need for copyright protection and content authentication. Watermarking has become a vital tool to address these challenges, safeguarding both generated and real content. Effective watermarking methods must withstand various distortions and attacks. Current deep watermarking techniques often use an encoder-noise layer-decoder architecture and include distortions to enhance robustness. However, they struggle to balance robustness and fidelity and remain vulnerable to adaptive attacks, despite extensive training. To overcome these limitations, we propose SuperMark, a robust, training-free watermarking framework. Inspired by the parallels between watermark embedding/extraction in watermarking and the denoising/noising processes in diffusion models, SuperMark embeds the watermark into initial Gaussian noise using existing techniques. It then applies pre-trained Super-Resolution (SR) models to denoise the watermarked noise, producing the final watermarked image. For extraction, the process is reversed: the watermarked image is inverted back to the initial watermarked noise via DDIM Inversion, from which the embedded watermark is extracted. This flexible framework supports various noise injection methods and diffusion-based SR models, enabling enhanced customization. The robustness of the DDIM Inversion process against perturbations allows SuperMark to achieve strong resilience to distortions while maintaining high fidelity. Experiments demonstrate that SuperMark achieves fidelity comparable to existing methods while significantly improving robustness. Under standard distortions, it achieves an average watermark extraction accuracy of 99.46%, and 89.29% under adaptive attacks. Moreover, SuperMark shows strong transferability across datasets, SR models, embedding methods, and resolutions.
zh

[CV-46] RemDet: Rethinking Efficient Model Design for UAV Object Detection AAAI25

【速读】：该论文试图解决无人机图像中的目标检测问题，主要面临的挑战包括：1) 目标通常较小且密集，分布在广阔的图像中；2) 计算资源受限，导致大多数模型无法实时部署。解决方案的关键在于提出了一种新型检测器RemDet (Reparameter efficient multiplication Detector)，并通过以下创新点有效应对这些挑战：1) 重新思考现有检测器在处理小而密集的无人机图像时的信息损失问题，并将其作为高效模型的设计准则；2) 引入ChannelC2f模块，通过高维表示增强小目标检测性能；3) 设计GatedFFN模块，提供高性能的同时保持低延迟，适用于实时检测；4) 提出CED模块，结合ViT和CNN下采样的优势，减少信息损失并增强小而密集目标的上下文信息。实验结果表明，该方法在Visdrone和UAVDT等大型无人机数据集上实现了实时效率和卓越的检测性能。

链接: https://arxiv.org/abs/2412.10040
作者: Chen Li,Rui Zhao,Zeyu Wang,Huiying Xu,Xinzhong Zhu
关键词: Unmanned Aerial Vehicle, Aerial Vehicle, Unmanned Aerial, computational resource constraints, resource constraints render
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI25

点击查看摘要

Abstract:Object detection in Unmanned Aerial Vehicle (UAV) images has emerged as a focal area of research, which presents two significant challenges: i) objects are typically small and dense within vast images; ii) computational resource constraints render most models unsuitable for real-time deployment. Current real-time object detectors are not optimized for UAV images, and complex methods designed for small object detection often lack real-time capabilities. To address these challenges, we propose a novel detector, RemDet (Reparameter efficient multiplication Detector). Our contributions are as follows: 1) Rethinking the challenges of existing detectors for small and dense UAV images, and proposing information loss as a design guideline for efficient models. 2) We introduce the ChannelC2f module to enhance small object detection performance, demonstrating that high-dimensional representations can effectively mitigate information loss. 3) We design the GatedFFN module to provide not only strong performance but also low latency, effectively addressing the challenges of real-time detection. Our research reveals that GatedFFN, through the use of multiplication, is more cost-effective than feed-forward networks for high-dimensional representation. 4) We propose the CED module, which combines the advantages of ViT and CNN downsampling to effectively reduce information loss. It specifically enhances context information for small and dense objects. Extensive experiments on large UAV datasets, Visdrone and UAVDT, validate the real-time efficiency and superior performance of our methods. On the challenging UAV dataset VisDrone, our methods not only provided state-of-the-art results, improving detection by more than 3.4%, but also achieve 110 FPS on a single this http URL are available at (this URL)(this https URL).
zh

[CV-47] mealign: A multi-modal object detection method for time misalignment fusing in autonomous driving

【速读】：该论文试图解决自动驾驶领域中多模态感知方法在时间对齐方面的挑战，特别是在LiDAR数据传输延迟导致的时间错位问题。解决方案的关键在于设计了一个名为Timealign的模块，该模块基于SOTA的GraphBEV框架，通过预测和结合LiDAR的历史帧特征与当前观测数据，有效应对时间对齐问题，从而提高环境信息的准确性。

链接: https://arxiv.org/abs/2412.10033
作者: Zhihang Song,Lihui Peng,Jianming Hu,Danya Yao,Yi Zhang
关键词: multi-modal perception methods, driving field due, autonomous driving field, multi-modal perception, field due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:The multi-modal perception methods are thriving in the autonomous driving field due to their better usage of complementary data from different sensors. Such methods depend on calibration and synchronization between sensors to get accurate environmental information. There have already been studies about space-alignment robustness in autonomous driving object detection process, however, the research for time-alignment is relatively few. As in reality experiments, LiDAR point clouds are more challenging for real-time data transfer, our study used historical frames of LiDAR to better align features when the LiDAR data lags exist. We designed a Timealign module to predict and combine LiDAR features with observation to tackle such time misalignment based on SOTA GraphBEV framework.
zh

[CV-48] Object-Focused Data Selection for Dense Prediction Tasks

【速读】：该论文试图解决在密集预测任务（如目标检测和语义分割）中，由于高质量像素级标签获取成本高昂，如何在有限的标注预算下从大量未标注图像中选择具有代表性的子集进行标注的问题。解决方案的关键在于提出了对象聚焦数据选择方法（Object-Focused Data Selection, OFDS），该方法利用对象级别的表示来确保所选图像子集在语义上覆盖目标类别，包括稀有类别，从而有效应对类别分布不平衡的问题。实验证明，OFDS在类别分布不平衡的场景中显著优于基于图像级别表示的传统方法，并且通过在全数据集上使用自动标签进行预训练，再对OFDS选择的子集进行微调，进一步提升了最终性能。

链接: https://arxiv.org/abs/2412.10032
作者: Niclas Popp,Dan Zhang,Jan Hendrik Metzen,Matthias Hein,Lukas Schott
关键词: Dense prediction tasks, require high-quality labels, Dense prediction, segmentation require high-quality, pixel level
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dense prediction tasks such as object detection and segmentation require high-quality labels at pixel level, which are costly to obtain. Recent advances in foundation models have enabled the generation of autolabels, which we find to be competitive but not yet sufficient to fully replace human annotations, especially for more complex datasets. Thus, we consider the challenge of selecting a representative subset of images for labeling from a large pool of unlabeled images under a constrained annotation budget. This task is further complicated by imbalanced class distributions, as rare classes are often underrepresented in selected subsets. We propose object-focused data selection (OFDS) which leverages object-level representations to ensure that the selected image subsets semantically cover the target classes, including rare ones. We validate OFDS on PASCAL VOC and Cityscapes for object detection and semantic segmentation tasks. Our experiments demonstrate that prior methods which employ image-level representations fail to consistently outperform random selection. In contrast, OFDS consistently achieves state-of-the-art performance with substantial improvements over all baselines in scenarios with imbalanced class distributions. Moreover, we demonstrate that pre-training with autolabels on the full datasets before fine-tuning on human-labeled subsets selected by OFDS further enhances the final performance.
zh

[CV-49] Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples AAAI2025

【速读】：该论文试图解决现有视觉-语言预训练（Vision-Language Pretraining, VLP）模型在细粒度理解能力上的局限性问题。现有模型主要依赖于整体特征的相似性进行跨模态交互，忽略了不同模态特征表达中的细微差别，导致在需要精细感知能力的任务中表现不足。论文提出的解决方案关键在于引入负样本增强（Negative Augmented Samples, NAS）方法，通过视觉字典（Visual Dictionary, VD）作为语义桥梁，并采用基于VD的负视觉增强（Negative Visual Augmentation, NVA）方法生成在token级别上与正样本有细微差异的负样本。这种方法迫使模型更精确地识别正负样本之间的微小差异，从而提升细粒度视觉-语言理解能力。

链接: https://arxiv.org/abs/2412.10029
作者: Yeyuan Wang,Dehong Gao,Lei Yi,Linbo Jin,Jinxia Zhang,Libin Yang,Xiaoyan Cai
关键词: achieved remarkable improvements, Existing Vision-Language Pretraining, Existing Vision-Language, confirming their effectiveness, achieved remarkable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15pages, Accepted by AAAI2025, full paper

点击查看摘要

Abstract:Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intricate distinctions in expressing different modal features and typically depend on the similarity of holistic features for cross-modal interactions. Moreover, these models directly align and integrate features from different modalities, focusing more on coarse-grained general representations, thus failing to capture the nuanced differences necessary for tasks demanding a more detailed perception. In response to these limitations, we introduce Negative Augmented Samples(NAS), a refined vision-language pretraining model that innovatively incorporates NAS to specifically address the challenge of fine-grained understanding. NAS utilizes a Visual Dictionary(VD) as a semantic bridge between visual and linguistic domains. Additionally, it employs a Negative Visual Augmentation(NVA) method based on the VD to generate challenging negative image samples. These samples deviate from positive samples exclusively at the token level, thereby necessitating that the model discerns the subtle disparities between positive and negative samples with greater precision. Comprehensive experiments validate the efficacy of NAS components and underscore its potential to enhance fine-grained vision-language comprehension.
zh

[CV-50] Mr. DETR: Instructive Multi-Route Training for Detection Transformers

【速读】：该论文试图解决检测变换器（detection transformers）训练中的问题，特别是如何通过引入辅助的一对多分配（one-to-many assignment）来增强模型性能。解决方案的关键在于将模型视为多任务框架，同时执行一对一（one-to-one）和一对多预测。论文提出了一种多路径训练机制，包括一个主路径用于一对一预测，以及两个辅助路径用于一对多预测。通过引入一种新颖的指导性自注意力（instructive self-attention），动态且灵活地引导对象查询（object queries）进行一对多预测，从而提升训练效果。在推理阶段，辅助路径被移除，确保模型架构和推理成本不受影响。实验结果表明，该方法在多个基线上实现了持续的性能提升。

链接: https://arxiv.org/abs/2412.10028
作者: Chang-Bin Zhang,Yujie Zhong,Kai Han
关键词: Existing methods enhance, Existing methods, Existing, detection transformers, training
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech. report

点击查看摘要

Abstract:Existing methods enhance the training of detection transformers by incorporating an auxiliary one-to-many assignment. In this work, we treat the model as a multi-task framework, simultaneously performing one-to-one and one-to-many predictions. We investigate the roles of each component in the transformer decoder across these two training targets, including self-attention, cross-attention, and feed-forward network. Our empirical results demonstrate that any independent component in the decoder can effectively learn both targets simultaneously, even when other components are shared. This finding leads us to propose a multi-route training mechanism, featuring a primary route for one-to-one prediction and two auxiliary training routes for one-to-many prediction. We enhance the training mechanism with a novel instructive self-attention that dynamically and flexibly guides object queries for one-to-many prediction. The auxiliary routes are removed during inference, ensuring no impact on model architecture or inference cost. We conduct extensive experiments on various baselines, achieving consistent improvements as shown in Figure 1.
zh

[CV-51] NeRF-Texture: Synthesizing Neural Radiance Field Textures

【速读】：该论文试图解决在计算机图形学中，现有方法难以有效处理包含3D几何空间中中尺度结构（meso-structure）的纹理合成问题，如草、叶子和织物等。解决方案的关键在于提出了一种基于神经辐射场（Neural Radiance Fields, NeRF）的新型纹理合成方法，通过从多视角图像中捕捉和合成纹理。具体来说，该方法将具有精细几何细节的场景分解为中尺度纹理和基础形状，并将这些纹理作为潜在特征嵌入到基础形状上，通过同时训练的NeRF解码器来表示丰富的视角依赖外观。此外，为了提高合成质量，论文还通过引入聚类约束来正则化潜在特征的分布，从而增强匹配性能。该方法不仅能在平面域上生成NeRF纹理，还能在曲面上合成纹理，具有实际应用价值。

链接: https://arxiv.org/abs/2412.10004
作者: Yi-Hua Huang,Yan-Pei Cao,Yu-Kun Lai,Ying Shan,Lin Gao
关键词: Neural Radiance Fields, textures, benefit various applications, fundamental problem, problem in computer
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Texture synthesis is a fundamental problem in computer graphics that would benefit various applications. Existing methods are effective in handling 2D image textures. In contrast, many real-world textures contain meso-structure in the 3D geometry space, such as grass, leaves, and fabrics, which cannot be effectively modeled using only 2D image textures. We propose a novel texture synthesis method with Neural Radiance Fields (NeRF) to capture and synthesize textures from given multi-view images. In the proposed NeRF texture representation, a scene with fine geometric details is disentangled into the meso-structure textures and the underlying base shape. This allows textures with meso-structure to be effectively learned as latent features situated on the base shape, which are fed into a NeRF decoder trained simultaneously to represent the rich view-dependent appearance. Using this implicit representation, we can synthesize NeRF-based textures through patch matching of latent features. However, inconsistencies between the metrics of the reconstructed content space and the latent feature space may compromise the synthesis quality. To enhance matching performance, we further regularize the distribution of latent features by incorporating a clustering constraint. In addition to generating NeRF textures over a planar domain, our method can also synthesize NeRF textures over curved surfaces, which are practically useful. Experimental results and evaluations demonstrate the effectiveness of our approach.
zh

[CV-52] NowYouSee Me: Context-Aware Automatic Audio Description

【速读】：该论文试图解决多媒体内容中为视障观众提供自动音频描述（Audio Description, AD）的问题，关键在于提出了一个名为 CA³D 的统一上下文感知自动音频描述系统。该系统通过以下三个模块实现：1) 时间特征增强模块（Temporal Feature Enhancement Module），用于有效捕捉长期依赖关系；2) 基于锚点的AD事件检测器（anchor-based AD event detector），结合特征抑制模块（feature suppression module），用于定位AD事件并提取生成AD所需的判别特征；3) 自优化模块（self-refinement module），利用生成的输出对AD事件边界进行从粗到精的调整。与依赖元数据和真实AD时间戳的传统方法不同，CA³D是首个仅使用视觉线索的端到端可训练系统，显著提升了AD事件检测和脚本生成的效果，达到了新的技术水平。

链接: https://arxiv.org/abs/2412.10002
作者: Seon-Ho Lee,Jue Wang,David Fan,Zhikang Zhang,Linda Liu,Xiang Hao,Vimal Bhat,Xinyu Li
关键词: Automatic Audio Description, visually impaired audiences, Audio Description system, Audio Description, application system aimed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Audio Description (AD) plays a pivotal role as an application system aimed at guaranteeing accessibility in multimedia content, which provides additional narrations at suitable intervals to describe visual elements, catering specifically to the needs of visually impaired audiences. In this paper, we introduce \mathrmCA^3D , the pioneering unified Context-Aware Automatic Audio Description system that provides AD event scripts with precise locations in the long cinematic content. Specifically, \mathrmCA^3D system consists of: 1) a Temporal Feature Enhancement Module to efficiently capture longer term dependencies, 2) an anchor-based AD event detector with feature suppression module that localizes the AD events and extracts discriminative feature for AD generation, and 3) a self-refinement module that leverages the generated output to tweak AD event boundaries from coarse to fine. Unlike conventional methods which rely on metadata and ground truth AD timestamp for AD detection and generation tasks, the proposed \mathrmCA^3D is the first end-to-end trainable system that only uses visual cue. Extensive experiments demonstrate that the proposed \mathrmCA^3D improves existing architectures for both AD event detection and script generation metrics, establishing the new state-of-the-art performances in the AD automation.
zh

[CV-53] GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark

【速读】：该论文试图解决生成式文本到3D (General Text-to-3D, GT23D) 领域缺乏系统性评估基准的问题。现有最大的3D数据集Objaverse存在标注缺失、组织混乱和质量低下的问题，而现有评估指标仅关注文本与图像的对齐，未考虑3D质量。解决方案的关键在于提出了一个全面的GT23D基准——GT23D-Bench，包括：1) 一个经过系统性标注、组织和过滤的高保真、高质量的40万3D数据集；2) 包含10个明确指标的3D感知评估体系，涵盖文本与3D对齐以及3D视觉质量的多维度评估。该基准具有多模态标注、全面的评估维度和对当前基线模型的深入分析，实验表明其标注和评估指标与人类偏好高度一致。

链接: https://arxiv.org/abs/2412.09997
作者: Sitong Su,Xiao Cai,Lianli Gao,Pengpeng Zeng,Qinhong Du,Mengqi Li,Heng Tao Shen,Jingkuan Song
关键词: Recent advances, advances in General, Recent, General, metrics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in General Text-to-3D (GT23D) have been significant. However, the lack of a benchmark has hindered systematic evaluation and progress due to issues in datasets and metrics: 1) The largest 3D dataset Objaverse suffers from omitted annotations, disorganization, and low-quality. 2) Existing metrics only evaluate textual-image alignment without considering the 3D-level quality. To this end, we are the first to present a comprehensive benchmark for GT23D called GT23D-Bench consisting of: 1) a 400k high-fidelity and well-organized 3D dataset that curated issues in Objaverse through a systematical annotation-organize-filter pipeline; and 2) comprehensive 3D-aware evaluation metrics which encompass 10 clearly defined metrics thoroughly accounting for multi-dimension of GT23D. Notably, GT23D-Bench features three properties: 1) Multimodal Annotations. Our dataset annotates each 3D object with 64-view depth maps, normal maps, rendered images, and coarse-to-fine captions. 2) Holistic Evaluation Dimensions. Our metrics are dissected into a) Textual-3D Alignment measures textual alignment with multi-granularity visual 3D representations; and b) 3D Visual Quality which considers texture fidelity, multi-view consistency, and geometry correctness. 3) Valuable Insights. We delve into the performance of current GT23D baselines across different evaluation dimensions and provide insightful analysis. Extensive experiments demonstrate that our annotations and metrics are aligned with human preferences.
zh

[CV-54] Visual Object Tracking across Diverse Data Modalities: A Review

【速读】：该论文旨在全面综述视觉目标跟踪 (Visual Object Tracking, VOT) 领域的最新进展，特别是单模态和多模态 VOT 的深度学习方法。解决方案的关键在于系统性地回顾了三种主流的单模态 VOT（RGB、热红外和点云跟踪）及其框架，并总结了四种多模态 VOT（RGB-Depth、RGB-Thermal、RGB-LiDAR 和 RGB-Language）。此外，论文通过对比大量 VOT 基准测试的结果，提供了对不同模态性能的深入分析，并为未来的研究方向提供了建议和洞察。

链接: https://arxiv.org/abs/2412.09991
作者: Mengmeng Wang,Teli Ma,Shuo Xin,Xiaojun Hou,Jiazheng Xing,Guang Dai,Jingdong Wang,Yong Liu
关键词: Visual Object Tracking, track specific targets, Visual Object, significant research area, target objects
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual Object Tracking (VOT) is an attractive and significant research area in computer vision, which aims to recognize and track specific targets in video sequences where the target objects are arbitrary and class-agnostic. The VOT technology could be applied in various scenarios, processing data of diverse modalities such as RGB, thermal infrared and point cloud. Besides, since no one sensor could handle all the dynamic and varying environments, multi-modal VOT is also investigated. This paper presents a comprehensive survey of the recent progress of both single-modal and multi-modal VOT, especially the deep learning methods. Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking. In particular, we conclude four widely-used single-modal frameworks, abstracting their schemas and categorizing the existing inheritors. Then we summarize four kinds of multi-modal VOT, including RGB-Depth, RGB-Thermal, RGB-LiDAR and RGB-Language. Moreover, the comparison results in plenty of VOT benchmarks of the discussed modalities are presented. Finally, we provide recommendations and insightful observations, inspiring the future development of this fast-growing literature.
zh

[CV-55] SplineGS: Robust Motion-Adaptive Spline for Real-Time Dynamic 3D Gaussians from Monocular Video

【速读】：该论文试图解决从单目视频中合成动态场景新视角的挑战，主要难点在于场景动态性和缺乏多视角线索。解决方案的关键是提出了SplineGS框架，该框架无需依赖COLMAP进行预处理，通过动态3D高斯喷射 (3D Gaussian Splatting, 3DGS) 实现高质量重建和快速渲染。核心创新在于引入了一种运动自适应样条 (Motion-Adaptive Spline, MAS) 方法，使用少量控制点的三次Hermite样条来表示连续的动态3D高斯轨迹。此外，提出了运动自适应控制点剪枝 (Motion-Adaptive Control points Pruning, MACP) 方法，以在保持动态建模完整性的同时，逐步剪枝控制点来处理不同运动中的形变。论文还采用了联合优化策略，结合光度和几何一致性来估计相机参数和3D高斯属性，从而增强了在真实世界条件下的鲁棒性。

链接: https://arxiv.org/abs/2412.09982
作者: Jongmin Park,Minh-Quan Viet Bui,Juan Luis Gonzalez Bello,Jaeho Moon,Jihyong Oh,Munchurl Kim
关键词: monocular videos, Synthesizing novel views, multi-view cues, Gaussian Splatting, challenging due
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally to this work (equal contribution). The last two authors advised equally to this work. Please visit our project page at this this https URL

点击查看摘要

Abstract:Synthesizing novel views from in-the-wild monocular videos is challenging due to scene dynamics and the lack of multi-view cues. To address this, we propose SplineGS, a COLMAP-free dynamic 3D Gaussian Splatting (3DGS) framework for high-quality reconstruction and fast rendering from monocular videos. At its core is a novel Motion-Adaptive Spline (MAS) method, which represents continuous dynamic 3D Gaussian trajectories using cubic Hermite splines with a small number of control points. For MAS, we introduce a Motion-Adaptive Control points Pruning (MACP) method to model the deformation of each dynamic 3D Gaussian across varying motions, progressively pruning control points while maintaining dynamic modeling integrity. Additionally, we present a joint optimization strategy for camera parameter estimation and 3D Gaussian attributes, leveraging photometric and geometric consistency. This eliminates the need for Structure-from-Motion preprocessing and enhances SplineGS’s robustness in real-world conditions. Experiments show that SplineGS significantly outperforms state-of-the-art methods in novel view synthesis quality for dynamic scenes from monocular videos, achieving thousands times faster rendering speed.
zh

[CV-56] SUMI-IFL: An Information-Theoretic Framework for Image Forgery Localization with Sufficiency and Minimality Constraints

【速读】：该论文试图解决图像篡改定位 (Image Forgery Localization, IFL) 中由于图像篡改技术快速发展导致的伪造线索提取不全面和不准确的问题。解决方案的关键在于提出了一个名为 SUMI-IFL 的新型信息论框架，该框架通过在伪造特征表示上施加充分性视角 (sufficiency-view) 和最小性视角 (minimality-view) 约束来实现。首先，基于互信息理论分析，充分性视角约束确保潜在伪造特征包含全面的伪造线索，并通过多角度整合多个个体伪造特征来构建潜在伪造特征。其次，基于信息瓶颈理论，最小性视角约束在特征推理网络上实现精确且简洁的伪造特征表示，以减少任务无关特征的干扰。实验结果表明，SUMI-IFL 在数据集内和跨数据集比较中均优于现有的最先进方法。

链接: https://arxiv.org/abs/2412.09981
作者: Ziqi Sheng,Wei Lu,Xiangyang Luo,Jiantao Zhou,Xiaochun Cao
关键词: protecting social safety, preventing tampered image, tampered image misuse, Image forgery localization, forgery
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Image forgery localization (IFL) is a crucial technique for preventing tampered image misuse and protecting social safety. However, due to the rapid development of image tampering technologies, extracting more comprehensive and accurate forgery clues remains an urgent challenge. To address these challenges, we introduce a novel information-theoretic IFL framework named SUMI-IFL that imposes sufficiency-view and minimality-view constraints on forgery feature representation. First, grounded in the theoretical analysis of mutual information, the sufficiency-view constraint is enforced on the feature extraction network to ensure that the latent forgery feature contains comprehensive forgery clues. Considering that forgery clues obtained from a single aspect alone may be incomplete, we construct the latent forgery feature by integrating several individual forgery features from multiple perspectives. Second, based on the information bottleneck, the minimality-view constraint is imposed on the feature reasoning network to achieve an accurate and concise forgery feature representation that counters the interference of task-unrelated features. Extensive experiments show the superior performance of SUMI-IFL to existing state-of-the-art methods, not only on in-dataset comparisons but also on cross-dataset comparisons.
zh

[CV-57] EP-CFG: Energy-Preserving Classifier-Free Guidance

【速读】：该论文试图解决扩散模型中使用分类器自由引导（Classifier-Free Guidance, CFG）时在高引导强度下出现的过度对比和过度饱和伪影问题。解决方案的关键在于提出能量保持分类器自由引导（Energy-Preserving Classifier-Free Guidance, EP-CFG），通过在引导过程中保持条件预测的能量分布来解决这些问题。具体而言，EP-CFG通过在每个去噪步骤中重新调整引导输出的能量，使其与条件预测的能量匹配，并提供一个可选的鲁棒变体以进一步抑制伪影。该方法在保持图像自然质量、保留细节的同时，保留了CFG的语义对齐优势，且计算开销极小。

链接: https://arxiv.org/abs/2412.09966
作者: Kai Zhang,Fujun Luan,Sai Bi,Jianming Zhang
关键词: Energy-Preserving Classifier-Free Guidance, higher guidance strengths, Classifier-free guidance, diffusion models, introduces over-contrast
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Classifier-free guidance (CFG) is widely used in diffusion models but often introduces over-contrast and over-saturation artifacts at higher guidance strengths. We present EP-CFG (Energy-Preserving Classifier-Free Guidance), which addresses these issues by preserving the energy distribution of the conditional prediction during the guidance process. Our method simply rescales the energy of the guided output to match that of the conditional prediction at each denoising step, with an optional robust variant for improved artifact suppression. Through experiments, we show that EP-CFG maintains natural image quality and preserves details across guidance strengths while retaining CFG’s semantic alignment benefits, all with minimal computational overhead.
zh

[CV-58] END2: Robust Dual-Decoder Watermarking Framework Against Non-Differentiable Distortions

【速读】：该论文试图解决基于深度神经网络（DNN）的水印方法在面对非可微分（non-differentiable）真实世界失真时，难以进行端到端训练的问题。解决方案的关键在于提出了一种新颖的双解码器架构（END²），通过引入两个结构相同的解码器：处理纯净水印图像的教师解码器（Teacher Decoder）和处理失真图像的学生解码器（Student Decoder）。通过仅通过教师解码器分支进行梯度反向传播，绕过了非可微分失真的问题，从而实现了对编码器的优化。此外，通过最大化两个解码器在超球面上中间特征向量的余弦相似度，确保了对任意失真的抵抗能力。实验结果表明，该方法在各种非可微分失真下优于现有算法，并且在没有可微分噪声层的情况下仍能超越基线方法。

链接: https://arxiv.org/abs/2412.09960
作者: Nan Sun,Han Fang,Yuxing Lu,Chengxin Zhao,Hefei Ling
关键词: DNN-based watermarking methods, Encoder-Noise Layer-Decoder, DNN-based watermarking, rapidly advanced, Teacher Decoder
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:DNN-based watermarking methods have rapidly advanced, with the ``Encoder-Noise Layer-Decoder’’ (END) framework being the most widely used. To ensure end-to-end training, the noise layer in the framework must be differentiable. However, real-world distortions are often non-differentiable, leading to challenges in end-to-end training. Existing solutions only treat the distortion perturbation as additive noise, which does not fully integrate the effect of distortion in training. To better incorporate non-differentiable distortions into training, we propose a novel dual-decoder architecture (END ^2 ). Unlike conventional END architecture, our method employs two structurally identical decoders: the Teacher Decoder, processing pure watermarked images, and the Student Decoder, handling distortion-perturbed images. The gradient is backpropagated only through the Teacher Decoder branch to optimize the encoder thus bypassing the problem of non-differentiability. To ensure resistance to arbitrary distortions, we enforce alignment of the two decoders’ feature representations by maximizing the cosine similarity between their intermediate vectors on a hypersphere. Extensive experiments demonstrate that our scheme outperforms state-of-the-art algorithms under various non-differentiable distortions. Moreover, even without the differentiability constraint, our method surpasses baselines with a differentiable noise layer. Our approach is effective and easily implementable across all END architectures, enhancing practicality and generalizability.
zh

[CV-59] Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization

【速读】：该论文试图解决大规模数据集和复杂深度网络（如ImageNet-1K和ResNet-101）在数据集蒸馏（Dataset Distillation）中面临的优化空间过大和实际应用受限的问题。解决方案的关键在于提出了一种与现有基于扩散模型的蒸馏方法正交的新框架，利用扩散模型进行选择而非生成。具体来说，该方法通过预测扩散模型生成的噪声并结合输入图像和文本提示（包括标签文本或无标签文本）计算损失，进而识别原始图像中的显著区域。此外，通过类内聚类和排序来保持所选区域的多样性，从而实现单步蒸馏过程。该方法简化了流程，并在多种评估指标上超越了当前最先进的方法。

链接: https://arxiv.org/abs/2412.09959
作者: Xinhao Zhong,Shuoyang Sun,Xulin Gu,Zhaoyang Xu,Yaowei Wang,Jianlong Wu,Bin Chen
关键词: offers an efficient, reduce memory, memory and computational, computational costs, costs by optimizing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: text overlap with arXiv:2408.02752 by other authors

点击查看摘要

Abstract:Dataset distillation offers an efficient way to reduce memory and computational costs by optimizing a smaller dataset with performance comparable to the full-scale original. However, for large datasets and complex deep networks (e.g., ImageNet-1K with ResNet-101), the extensive optimization space limits performance, reducing its practicality. Recent approaches employ pre-trained diffusion models to generate informative images directly, avoiding pixel-level optimization and achieving notable results. However, these methods often face challenges due to distribution shifts between pre-trained models and target datasets, along with the need for multiple distillation steps across varying settings. To address these issues, we propose a novel framework orthogonal to existing diffusion-based distillation methods, leveraging diffusion models for selection rather than generation. Our method starts by predicting noise generated by the diffusion model based on input images and text prompts (with or without label text), then calculates the corresponding loss for each pair. With the loss differences, we identify distinctive regions of the original images. Additionally, we perform intra-class clustering and ranking on selected patches to maintain diversity constraints. This streamlined framework enables a single-step distillation process, and extensive experiments demonstrate that our approach outperforms state-of-the-art methods across various metrics.
zh

[CV-60] textrmAtextrm2RNet: Adversarial Attack Resilient Network for Robust Infrared and Visible Image Fusion AAAI

【速读】：该论文试图解决红外与可见光图像融合 (IVIF) 中对抗攻击对融合结果鲁棒性的影响问题。解决方案的关键在于提出了一种新型的抗对抗攻击网络，称为 \textrmA^\textrm2 RNet，通过引入对抗攻击范式和反攻击损失函数来实现对抗攻击的模拟与训练。该方法基于IVIF的内在特性，采用Unet架构并结合基于transformer的防御性细化模块 (DRM)，确保在对抗扰动下仍能保持高保真度的融合图像质量。与以往方法相比，该方案有效减轻了对抗扰动的不利影响，并能在对抗攻击下维持下游任务的性能。

链接: https://arxiv.org/abs/2412.09954
作者: Jiawei Li,Hongwei Yu,Jiansheng Chen,Xinlong Ding,Jinlong Wang,Jinyuan Liu,Bochao Zou,Huimin Ma
关键词: integrating unique information, Infrared and visible, enhancing visual performance, crucial technique, technique for enhancing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 8 figures, The 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Infrared and visible image fusion (IVIF) is a crucial technique for enhancing visual performance by integrating unique information from different modalities into one fused image. Exiting methods pay more attention to conducting fusion with undisturbed data, while overlooking the impact of deliberate interference on the effectiveness of fusion results. To investigate the robustness of fusion models, in this paper, we propose a novel adversarial attack resilient network, called \textrmA^\textrm2 RNet. Specifically, we develop an adversarial paradigm with an anti-attack loss function to implement adversarial attacks and training. It is constructed based on the intrinsic nature of IVIF and provide a robust foundation for future research advancements. We adopt a Unet as the pipeline with a transformer-based defensive refinement module (DRM) under this paradigm, which guarantees fused image quality in a robust coarse-to-fine manner. Compared to previous works, our method mitigates the adverse effects of adversarial perturbations, consistently maintaining high-fidelity fusion results. Furthermore, the performance of downstream tasks can also be well maintained under adversarial attacks. Code is available at this https URL.
zh

[CV-61] WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model

【速读】：该论文试图解决的问题是探究基础驾驶知识（fundamental driving knowledge）的深度和广度对闭环自动驾驶性能的影响，特别是对轨迹规划的影响。解决方案的关键在于提出了WiseAD，这是一种专门为端到端自动驾驶设计的视觉语言模型（VLM），具备驾驶推理、动作合理性验证、物体识别、风险分析、驾驶建议和跨场景轨迹规划的能力。通过在驾驶知识和规划数据集上的联合训练，WiseAD能够实现与知识对齐的轨迹规划，从而显著减少关键事故，并在Carla闭环评估中提升了驾驶评分和路线完成率，达到了最先进的性能。

链接: https://arxiv.org/abs/2412.09951
作者: Songyan Zhang,Wenhui Huang,Zihui Gao,Hao Chen,Chen Lv
关键词: rapidly progressed vision-language, driven increasing interest, impressive logical reasoning, logical reasoning capacity, general human knowledge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of general human knowledge and impressive logical reasoning capacity in rapidly progressed vision-language models (VLMs) have driven increasing interest in applying VLMs to high-level autonomous driving tasks, such as scene understanding and decision-making. However, an in-depth study on the relationship between knowledge proficiency, especially essential driving expertise, and closed-loop autonomous driving performance requires further exploration. In this paper, we investigate the effects of the depth and breadth of fundamental driving knowledge on closed-loop trajectory planning and introduce WiseAD, a specialized VLM tailored for end-to-end autonomous driving capable of driving reasoning, action justification, object recognition, risk analysis, driving suggestions, and trajectory planning across diverse scenarios. We employ joint training on driving knowledge and planning datasets, enabling the model to perform knowledge-aligned trajectory planning accordingly. Extensive experiments indicate that as the diversity of driving knowledge extends, critical accidents are notably reduced, contributing 11.9% and 12.4% improvements in the driving score and route completion on the Carla closed-loop evaluations, achieving state-of-the-art performance. Moreover, WiseAD also demonstrates remarkable performance in knowledge evaluations on both in-domain and out-of-domain datasets.
zh

[CV-62] Going Beyond Feature Similarity: Effective Dataset distillation based on Class-aware Conditional Mutual Information

【速读】：该论文试图解决现有数据集蒸馏（Dataset Distillation, DD）方法在生成合成数据集时，由于通过特征相似性度量（如分布匹配, Distribution Matching, DM）压缩大量信息，导致合成数据集过于复杂、难以学习的问题。解决方案的关键在于引入条件互信息（Conditional Mutual Information, CMI）来评估数据集的类别感知复杂度，并通过最小化CMI来约束合成数据集的复杂度。具体来说，该方法在最小化蒸馏损失的同时，通过最小化预训练网络特征空间中的经验CMI，来控制合成数据集的类别感知复杂度，从而提升蒸馏方法的性能和训练效率。

链接: https://arxiv.org/abs/2412.09945
作者: Xinhao Zhong,Bin Chen,Hao Fang,Xulin Gu,Shu-Tao Xia,En-Hui Yang
关键词: memory consumption needed, full real dataset, deep neural networks, smaller synthetic dataset, training deep neural
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation (DD) aims to minimize the time and memory consumption needed for training deep neural networks on large datasets, by creating a smaller synthetic dataset that has similar performance to that of the full real dataset. However, current dataset distillation methods often result in synthetic datasets that are excessively difficult for networks to learn from, due to the compression of a substantial amount of information from the original data through metrics measuring feature similarity, e,g., distribution matching (DM). In this work, we introduce conditional mutual information (CMI) to assess the class-aware complexity of a dataset and propose a novel method by minimizing CMI. Specifically, we minimize the distillation loss while constraining the class-aware complexity of the synthetic dataset by minimizing its empirical CMI from the feature space of pre-trained networks, simultaneously. Conducting on a thorough set of experiments, we show that our method can serve as a general regularization method to existing DD methods and improve the performance and training efficiency.
zh

[CV-63] Pixel Intensity Tracking for Remote Respiratory Monitoring: A Study on Indonesian Subject

【速读】：该论文旨在解决传统接触式呼吸率测量方法的不适性和局限性问题，提出了一种基于RGB相机图像的非接触式测量方法。解决方案的关键在于利用像素强度变化（Pixel Intensity Changes, PIC）技术，通过实验优化了边界框大小、滤波器选项和角点检测算法等参数配置，并采用Lukas-Kanade算法进行跟踪。实验结果表明，在静态条件下，中等大小的边界框、Sobel滤波器和Harris角点检测方法表现最佳；在动态条件下，大边界框无滤波器和ShiTomasi角点检测方法，以及中边界框无滤波器和Harris角点检测方法表现最优。

链接: https://arxiv.org/abs/2412.09938
作者: Muhammad Yahya Ayyashy Mujahidan,Martin Clinton Tosima Manullang
关键词: vital sign indicating, Medium Bounding box, Respiratory rate, vital sign, sign indicating
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Respiratory rate is a vital sign indicating various health conditions. Traditional contact-based measurement methods are often uncomfortable, and alternatives like respiratory belts and smartwatches have limitations in cost and operability. Therefore, a non-contact method based on Pixel Intensity Changes (PIC) with RGB camera images is proposed. Experiments involved 3 sizes of bounding boxes, 3 filter options (Laplacian, Sobel, and no filter), and 2 corner detection algorithms (ShiTomasi and Harris), with tracking using the Lukas-Kanade algorithm. Eighteen configurations were tested on 67 subjects in static and dynamic conditions. The best results in static conditions were achieved with the Medium Bounding box, Sobel Filter, and Harris Method (MAE: 0.85, RMSE: 1.49). In dynamic conditions, the Large Bounding box with no filter and ShiTomasi, and Medium Bounding box with no filter and Harris, produced the lowest MAE (0.81) and RMSE (1.35)
zh

[CV-64] CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models

【速读】：该论文试图解决传统卡路里估算工具在实际应用中的局限性问题，特别是它们依赖特定数据格式或复杂流程，导致在现实场景中的实用性受限。解决方案的关键在于利用视觉-语言模型 (Vision-Language Models, VLMs) 进行食材识别和卡路里估算，并通过构建一个名为 CalData 的 330K 图像-文本对数据集，结合大规模食谱数据和详细的营养信息，进行高效的视觉-语言训练。论文提出的 CaLoRAify 框架通过低秩适应 (Low-rank Adaptation, LoRA) 和检索增强生成 (Retrieve-augmented Generation, RAG) 技术，增强了基础 VLMs 在卡路里估算领域的性能，使得用户仅需提供单张食物图像即可进行卡路里估算，同时保持基于代理的对话交互灵活性。

链接: https://arxiv.org/abs/2412.09936
作者: Dongyu Yao,Keling Yao,Junhong Zhou,Yinghao Zhang
关键词: chronic diseases worldwide, preventable chronic diseases, obesity phenomenon, heavy issue, diseases worldwide
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Disclaimer: This work is part of a course project and reflects ongoing exploration in the field of vision-language models and calorie estimation. Findings and conclusions are subject to further validation and refinement

点击查看摘要

Abstract:The obesity phenomenon, known as the heavy issue, is a leading cause of preventable chronic diseases worldwide. Traditional calorie estimation tools often rely on specific data formats or complex pipelines, limiting their practicality in real-world scenarios. Recently, vision-language models (VLMs) have excelled in understanding real-world contexts and enabling conversational interactions, making them ideal for downstream tasks such as ingredient analysis. However, applying VLMs to calorie estimation requires domain-specific data and alignment strategies. To this end, we curated CalData, a 330K image-text pair dataset tailored for ingredient recognition and calorie estimation, combining a large-scale recipe dataset with detailed nutritional instructions for robust vision-language training. Built upon this dataset, we present CaLoRAify, a novel VLM framework aligning ingredient recognition and calorie estimation via training with visual-text pairs. During inference, users only need a single monocular food image to estimate calories while retaining the flexibility of agent-based conversational interaction. With Low-rank Adaptation (LoRA) and Retrieve-augmented Generation (RAG) techniques, our system enhances the performance of foundational VLMs in the vertical domain of calorie estimation. Our code and data are fully open-sourced at this https URL.
zh

[CV-65] FaceShield: Defending Facial Image against Deepfake Threats

【速读】：该论文试图解决深度伪造（deepfake）在犯罪活动中的广泛使用问题，特别是针对基于扩散模型（Diffusion Models, DMs）和生成对抗网络（Generative Adversarial Networks, GANs）的深度伪造模型的防御。解决方案的关键在于提出了一种名为FaceShield的主动防御方法，该方法通过以下三个主要策略实现：(i) 操纵扩散模型的注意力机制，在去噪过程中排除受保护的面部特征；(ii) 针对显著的面部特征提取模型，增强对抗扰动的鲁棒性；(iii) 使用高斯模糊和低通滤波技术，提高扰动的不可感知性并增强对JPEG失真的鲁棒性。实验结果表明，FaceShield在CelebA-HQ和VGGFace2-HQ数据集上对最新的基于扩散模型的深度伪造模型实现了最先进的防御性能，同时对GANs也表现出适用性，并展示了更高的噪声不可感知性和鲁棒性。

链接: https://arxiv.org/abs/2412.09921
作者: Jaehwan Jeong,Sumin In,Sieun Kim,Hannie Shin,Jongheon Jeong,Sang Ho Yoon,Jaewook Chung,Sangpil Kim
关键词: inciting widespread controversy, criminal activities presents, significant issue, inciting widespread, widespread controversy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rising use of deepfakes in criminal activities presents a significant issue, inciting widespread controversy. While numerous studies have tackled this problem, most primarily focus on deepfake detection. These reactive solutions are insufficient as a fundamental approach for crimes where authenticity verification is not critical. Existing proactive defenses also have limitations, as they are effective only for deepfake models based on specific Generative Adversarial Networks (GANs), making them less applicable in light of recent advancements in diffusion-based models. In this paper, we propose a proactive defense method named FaceShield, which introduces novel attack strategies targeting deepfakes generated by Diffusion Models (DMs) and facilitates attacks on various existing GAN-based deepfake models through facial feature extractor manipulations. Our approach consists of three main components: (i) manipulating the attention mechanism of DMs to exclude protected facial features during the denoising process, (ii) targeting prominent facial feature extraction models to enhance the robustness of our adversarial perturbation, and (iii) employing Gaussian blur and low-pass filtering techniques to improve imperceptibility while enhancing robustness against JPEG distortion. Experimental results on the CelebA-HQ and VGGFace2-HQ datasets demonstrate that our method achieves state-of-the-art performance against the latest deepfake models based on DMs, while also exhibiting applicability to GANs and showcasing greater imperceptibility of noise along with enhanced robustness.
zh

[CV-66] Precision-Enhanced Human-Object Contact Detection via Depth-Aware Perspective Interaction and Object Texture Restoration

【速读】：该论文试图解决在人-物体接触检测（Human-object contact, HOT）中，由于物体遮挡导致接触区域识别不准确的问题。解决方案的关键在于提出了一种名为PIHOT的透视交互检测器，通过深度图生成模型提供人-物体相对于相机的深度信息，从而避免误检测。此外，使用掩膜膨胀和物体恢复技术来恢复被遮挡区域的纹理细节，增强物体边界，并引入空间感知机制，聚焦于接触点附近的特征，从而显著提升了检测性能。

链接: https://arxiv.org/abs/2412.09920
作者: Yuxiao Wang,Wenpeng Neng,Zhenao Wei,Yu Lei,Weiying Xue,Nan Zhuang,Yanwu Xu,Xinyu Jiang,Qi Liu
关键词: Human-object contact, designed to accurately, accurately identify, objects, Human-object
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAl 2025

点击查看摘要

Abstract:Human-object contact (HOT) is designed to accurately identify the areas where humans and objects come into contact. Current methods frequently fail to account for scenarios where objects are frequently blocking the view, resulting in inaccurate identification of contact areas. To tackle this problem, we suggest using a perspective interaction HOT detector called PIHOT, which utilizes a depth map generation model to offer depth information of humans and objects related to the camera, thereby preventing false interaction detection. Furthermore, we use mask dilatation and object restoration techniques to restore the texture details in covered areas, improve the boundaries between objects, and enhance the perception of humans interacting with objects. Moreover, a spatial awareness perception is intended to concentrate on the characteristic features close to the points of contact. The experimental results show that the PIHOT algorithm achieves state-of-the-art performance on three benchmark datasets for HOT detection tasks. Compared to the most recent DHOT, our method enjoys an average improvement of 13%, 27.5%, 16%, and 18.5% on SC-Acc., C-Acc., mIoU, and wIoU metrics, respectively.
zh

[CV-67] B-VLLM : A Vision Large Language Model with Balanced Spatio-Temporal Tokens

【速读】：该论文试图解决视觉大语言模型 (Vision Large Language Models, VLLMs) 在处理长视频时面临的挑战，即由于视频编码生成的大量视觉标记 (visual tokens) 导致上下文窗口超限和计算负担加重的问题。解决方案的关键在于提出了一种平衡的 VLLM 框架 (Balanced-VLLM, B-VLLM)，通过引入文本条件自适应帧选择模块 (text-conditioned adaptive frame selection module) 来识别与视觉理解任务相关的帧，并使用时间帧标记合并技术 (temporal frame token merging technique) 去重。此外，通过空间标记采样模块 (spatial token sampling module) 和可选的空间标记合并策略 (spatial token merging strategy) 精确控制标记数量，从而在保持任务相关时空线索的同时，有效限制视觉标记的数量，避免超出 VLLM 的上下文窗口长度。实验结果表明，B-VLLM 在视频理解任务中实现了帧数和视觉标记的平衡，并在多个视频理解基准上表现优异。

链接: https://arxiv.org/abs/2412.09919
作者: Zhuqiang Lu,Zhenfei Yin,Mengwei He,Zhihui Wang,Zicheng Liu,Zhiyong Wang,Kun Hu
关键词: Large Language Models, Vision Large Language, Language Models, Large Language, Vision Large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Vision Large Language Models (VLLMs) integrated with vision encoders have shown promising performance in vision understanding. The key of VLLMs is to encode visual content into sequences of visual tokens, enabling VLLMs to simultaneously process both visual and textual content. However, understanding videos, especially long videos, remain a challenge to VLLMs as the number of visual tokens grows rapidly when encoding videos, resulting in the risk of exceeding the context window of VLLMs and introducing heavy computation burden. To restrict the number of visual tokens, existing VLLMs either: (1) uniformly downsample videos into a fixed number of frames or (2) reducing the number of visual tokens encoded from each frame. We argue the former solution neglects the rich temporal cue in videos and the later overlooks the spatial details in each frame. In this work, we present Balanced-VLLM (B-VLLM): a novel VLLM framework that aims to effectively leverage task relevant spatio-temporal cues while restricting the number of visual tokens under the VLLM context window length. At the core of our method, we devise a text-conditioned adaptive frame selection module to identify frames relevant to the visual understanding task. The selected frames are then de-duplicated using a temporal frame token merging technique. The visual tokens of the selected frames are processed through a spatial token sampling module and an optional spatial token merging strategy to achieve precise control over the token count. Experimental results show that B-VLLM is effective in balancing the number of frames and visual tokens in video understanding, yielding superior performance on various video understanding benchmarks. Our code is available at this https URL.
zh

[CV-68] All-in-One: Transferring Vision Foundation Models into Stereo Matching AAAI2025

【速读】：该论文试图解决立体匹配任务中基于迭代优化的方法在特征提取能力上的不足问题。解决方案的关键在于提出了AIO-Stereo模型，该模型能够灵活地从多个异构视觉基础模型（Vision Foundation Models, VFMs）中选择和迁移知识，并通过一种双层特征利用机制（dual-level feature utilization mechanism）来协调异构特征并转移多层次的知识。具体实现上，设计了双层选择性知识迁移模块（dual-level selective knowledge transfer module），以选择性地迁移知识并整合多个VFMs的优势，从而在多个数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2412.09912
作者: Jingyi Zhou,Haoyu Zhang,Jiakang Yuan,Peng Ye,Tao Chen,Hao Jiang,Meiya Chen,Yangyang Zhang
关键词: made remarkable progress, fundamental vision task, remarkable progress, made remarkable, stereo matching
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks 1^st on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.
zh

[CV-69] Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images

【速读】：该论文试图解决深度神经网络 (DNNs) 在乳腺癌诊断中的可靠性问题，特别是对抗攻击 (adversarial attacks) 的威胁。传统攻击方法依赖于固定范数扰动 (fixed-norm perturbations)，与人类感知不一致，而基于扩散的攻击 (diffusion-based attacks) 需要预训练模型，在数据稀缺的医疗影像场景中难以实施。论文提出的解决方案是 Prompt2Perturb (P2P)，一种基于语言引导的攻击方法，通过学习可微调的提示 (learnable prompts) 在文本编码器中生成微小但有效的扰动，从而引导模型达到目标结果。P2P 的关键在于直接更新文本嵌入 (text embeddings)，无需重新训练扩散模型，并通过优化早期反向扩散步骤提高效率，同时确保生成的对抗样本在视觉上自然且不引入明显伪影。实验表明，P2P 在多个乳腺超声数据集上的表现优于现有最先进的攻击技术。

链接: https://arxiv.org/abs/2412.09910
作者: Yasamin Medghalchi,Moein Heidari,Clayton Allard,Leonid Sigal,Ilker Hacihaliloglu
关键词: Deep neural networks, offer significant promise, Deep neural, improving breast cancer, breast cancer diagnosis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) offer significant promise for improving breast cancer diagnosis in medical imaging. However, these models are highly susceptible to adversarial attacks–small, imperceptible changes that can mislead classifiers–raising critical concerns about their reliability and security. Traditional attacks rely on fixed-norm perturbations, misaligning with human perception. In contrast, diffusion-based attacks require pre-trained models, demanding substantial data when these models are unavailable, limiting practical use in data-scarce scenarios. In medical imaging, however, this is often unfeasible due to the limited availability of datasets. Building on recent advancements in learnable prompts, we propose Prompt2Perturb (P2P), a novel language-guided attack method capable of generating meaningful attack examples driven by text instructions. During the prompt learning phase, our approach leverages learnable prompts within the text encoder to create subtle, yet impactful, perturbations that remain imperceptible while guiding the model towards targeted outcomes. In contrast to current prompt learning-based approaches, our P2P stands out by directly updating text embeddings, avoiding the need for retraining diffusion models. Further, we leverage the finding that optimizing only the early reverse diffusion steps boosts efficiency while ensuring that the generated adversarial examples incorporate subtle noise, thus preserving ultrasound image quality without introducing noticeable artifacts. We show that our method outperforms state-of-the-art attack techniques across three breast ultrasound datasets in FID and LPIPS. Moreover, the generated images are both more natural in appearance and more effective compared to existing adversarial attacks. Our code will be publicly available this https URL.
zh

[CV-70] IQViC: In-context Question Adaptive Vision Compressor for Long-term Video Understanding LMMs

【速读】：该论文试图解决现有长期视频理解方法在处理复杂视频数据时，难以准确捕捉和分析长时间序列以及处理视频内容中复杂依赖关系的问题。解决方案的关键在于提出了一种简单而有效的大规模多模态模型框架，并引入了创新的视觉压缩器——上下文自适应视觉压缩器 (In-context, Question Adaptive Visual Compressor, IQViC)。该框架利用基于transformer的IQViC，通过问题条件化的上下文压缩，选择性地提取相关信息，从而显著减少内存需求，提升长期视频问答的准确性和内存效率。

链接: https://arxiv.org/abs/2412.09907
作者: Sosuke Yamao,Natsuki Miyahara,Yuki Harazono,Shun Takeuchi
关键词: extended video sequences, analyze extended video, long-term video understanding, visual compressor, long-term video
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the increasing complexity of video data and the need for more efficient long-term temporal understanding, existing long-term video understanding methods often fail to accurately capture and analyze extended video sequences. These methods typically struggle to maintain performance over longer durations and to handle the intricate dependencies within the video content. To address these limitations, we propose a simple yet effective large multi-modal model framework for long-term video understanding that incorporates a novel visual compressor, the In-context, Question Adaptive Visual Compressor (IQViC). The key idea, inspired by humans’ selective attention and in-context memory mechanisms, is to introduce a novel visual compressor and incorporate efficient memory management techniques to enhance long-term video question answering. Our framework utilizes IQViC, a transformer-based visual compressor, enabling question-conditioned in-context compression, unlike existing methods that rely on full video visual features. This selectively extracts relevant information, significantly reducing memory token requirements. Through extensive experiments on a new dataset based on InfiniBench for long-term video understanding, and standard benchmarks used for existing methods’ evaluation, we demonstrate the effectiveness of our proposed IQViC framework and its superiority over state-of-the-art methods in terms of video understanding accuracy and memory efficiency.
zh

[CV-71] MulSMo: Multimodal Stylized Motion Generation by Bidirectional Control Flow

【速读】：该论文试图解决在生成符合目标风格的动作序列时，内容与风格之间的冲突问题。现有方法通常仅从风格流向内容，可能导致风格与内容的冲突，影响整体融合效果。论文的关键解决方案是构建内容与风格之间的双向控制流，并调整风格以适应内容，从而缓解风格与内容的冲突，并更好地保留风格的动态特性。此外，通过对比学习将风格化动作生成从单一风格动作扩展到多模态（如文本和图像），实现对动作生成的灵活风格控制。实验结果表明，该方法在不同数据集上显著优于先前方法，并支持多模态信号的控制。

链接: https://arxiv.org/abs/2412.09901
作者: Zhe Li,Yisheng He,Lei Zhong,Weichao Shen,Qi Zuo,Lingteng Qiu,Zilong Dong,Laurence Tianruo Yang,Weihao Yuan
关键词: Generating motion sequences, prompts requires accommodating, content prompts requires, motion sequences conforming, Generating motion
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating motion sequences conforming to a target style while adhering to the given content prompts requires accommodating both the content and style. In existing methods, the information usually only flows from style to content, which may cause conflict between the style and content, harming the integration. Differently, in this work we build a bidirectional control flow between the style and the content, also adjusting the style towards the content, in which case the style-content collision is alleviated and the dynamics of the style is better preserved in the integration. Moreover, we extend the stylized motion generation from one modality, i.e. the style motion, to multiple modalities including texts and images through contrastive learning, leading to flexible style control on the motion generation. Extensive experiments demonstrate that our method significantly outperforms previous methods across different datasets, while also enabling multimodal signals control. The code of our method will be made publicly available.
zh

[CV-72] Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP AAAI2025

【速读】：该论文试图解决零样本动作识别 (Zero-shot Action Recognition, ZSAR) 中多模态时空理解的问题，特别是现有方法（如直接微调 CLIP）在捕捉细粒度时空动态方面的不足。解决方案的关键在于提出了时空动态双模 (Spatiotemporal Dynamic Duo, STDD) 框架，通过视觉端的时空交叉注意力机制 (Space-time Cross Attention) 和语义端的动作语义知识图谱 (Action Semantic Knowledge Graph, ASKG) 来协同理解多模态时空动态。视觉端通过简单有效的操作灵活捕捉时空动态，而语义端通过构建 ASKG 来生成细致的文本提示，从而在训练阶段实现视频帧级表示与文本提示级表示的精确对齐，并结合冻结的 CLIP 视频表示以增强泛化能力。

链接: https://arxiv.org/abs/2412.09895
作者: Yating Yu,Congqi Cao,Yueran Zhang,Qinyi Lv,Lingtong Min,Yanning Zhang
关键词: Zero-shot action recognition, requires collaborative multi-modal, multi-modal spatiotemporal understanding, Zero-shot action, collaborative multi-modal spatiotemporal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Zero-shot action recognition (ZSAR) requires collaborative multi-modal spatiotemporal understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance, given its inherent constraints in capturing essential temporal dynamics from both vision and text perspectives, especially when encountering novel actions with fine-grained spatiotemporal discrepancies. In this work, we propose Spatiotemporal Dynamic Duo (STDD), a novel CLIP-based framework to comprehend multi-modal spatiotemporal dynamics synergistically. For the vision side, we propose an efficient Space-time Cross Attention, which captures spatiotemporal dynamics flexibly with simple yet effective operations applied before and after spatial attention, without adding additional parameters or increasing computational complexity. For the semantic side, we conduct spatiotemporal text augmentation by comprehensively constructing an Action Semantic Knowledge Graph (ASKG) to derive nuanced text prompts. The ASKG elaborates on static and dynamic concepts and their interrelations, based on the idea of decomposing actions into spatial appearances and temporal motions. During the training phase, the frame-level video representations are meticulously aligned with prompt-level nuanced text representations, which are concurrently regulated by the video representations from the frozen CLIP to enhance generalizability. Extensive experiments validate the effectiveness of our approach, which consistently surpasses state-of-the-art approaches on popular video benchmarks (i.e., Kinetics-600, UCF101, and HMDB51) under challenging ZSAR settings. Code is available at this https URL.
zh

[CV-73] VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization

【速读】：该论文试图解决多语言场景下说话头生成中的唇同步和自然运动问题。解决方案的关键在于基于语音学原理，利用有限的声音单元（phonemes）和视觉发音单元（visemes）的共性，通过矢量量化（Vector Quantization）框架VQTalker实现。具体而言，论文引入了基于组残差有限标量量化（Group Residual Finite Scalar Quantization, GRFSQ）的面部运动标记器，生成离散化的面部特征表示，从而全面捕捉面部运动并提升多语言泛化能力。随后，通过粗到细的运动生成过程逐步优化面部动画。该方法在多语言环境下实现了视频驱动和语音驱动的最先进性能，并在512*512像素分辨率下以约11 kbps的低比特率保持高质量结果。

链接: https://arxiv.org/abs/2412.09892
作者: Tao Liu,Ziyang Ma,Qi Chen,Feilong Chen,Shuai Fan,Xie Chen,Kai Yu
关键词: Vector Quantization-based framework, Vector Quantization-based, Quantization-based framework, Finite Scalar Quantization, Group Residual Finite
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages

点击查看摘要

Abstract:We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512*512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation. Synthetic results can be viewed at this https URL.
zh

[CV-74] -GMSI: A transformer-based generative model for spatial interpolation under sparse measurements

【速读】：该论文试图解决在空间建模中，尤其是地形建模领域，从稀疏采样数据生成连续环境模型的关键挑战。传统空间插值方法在处理稀疏测量数据时表现不佳，因此论文提出了一种基于Transformer的生成式空间插值模型（T-GMSI），采用视觉Transformer（ViT）架构进行数字高程模型（DEM）生成。其关键在于使用ViT替代传统的卷积方法进行特征提取和DEM插值，并结合地形特征感知损失函数以提高精度。T-GMSI在处理超过70%稀疏度的数据集时表现出色，且在不同地形间具有强大的迁移能力，无需微调。

链接: https://arxiv.org/abs/2412.09886
作者: Xiangxi Tian,Jie Shan
关键词: Generating continuous environmental, Generating continuous, continuous environmental models, sparsely sampled data, spatial interpolation
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Generating continuous environmental models from sparsely sampled data is a critical challenge in spatial modeling, particularly for topography. Traditional spatial interpolation methods often struggle with handling sparse measurements. To address this, we propose a Transformer-based Generative Model for Spatial Interpolation (T-GMSI) using a vision transformer (ViT) architecture for digital elevation model (DEM) generation under sparse conditions. T-GMSI replaces traditional convolution-based methods with ViT for feature extraction and DEM interpolation while incorporating a terrain feature-aware loss function for enhanced accuracy. T-GMSI excels in producing high-quality elevation surfaces from datasets with over 70% sparsity and demonstrates strong transferability across diverse landscapes without fine-tuning. Its performance is validated through extensive experiments, outperforming traditional methods such as ordinary Kriging (OK) and natural neighbor (NN) and a conditional generative adversarial network (CGAN)-based model (CEDGAN). Compared to OK and NN, T-GMSI reduces root mean square error (RMSE) by 40% and 25% on airborne lidar data and by 23% and 10% on spaceborne lidar data. Against CEDGAN, T-GMSI achieves a 20% RMSE improvement on provided DEM data, requiring no fine-tuning. The ability of model on generalizing to large, unseen terrains underscores its transferability and potential applicability beyond topographic modeling. This research establishes T-GMSI as a state-of-the-art solution for spatial interpolation on sparse datasets and highlights its broader utility for other sparse data interpolation challenges.
zh

[CV-75] Sharpening Your Density Fields: Spiking Neuron Aided Fast Geometry Learning

【速读】：该论文试图解决从神经辐射场 (NeRF) 中提取几何信息时，依赖手工设定的阈值进行水平集定义的问题。传统方法使用Marching Cubes算法，但该算法需要针对不同场景进行繁琐的手动调参，限制了其在实际应用中的实用性。论文的关键解决方案是引入一种基于尖峰神经元机制的动态阈值调整方法，通过消除手动选择阈值的需求来提高效率。为应对直接训练尖峰神经元时可能出现的模型崩溃和噪声输出问题，论文提出了一种轮询策略 (round-robin strategy)，该策略稳定了训练过程，并使几何网络能够生成更锐利和精确的密度分布，同时保持较低的计算开销。

链接: https://arxiv.org/abs/2412.09881
作者: Yi Gu,Zhaorui Wang,Dongjun Ye,Renjing Xu
关键词: Neural Radiance Fields, Radiance Fields, achieved remarkable progress, Neural Radiance, neural rendering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) have achieved remarkable progress in neural rendering. Extracting geometry from NeRF typically relies on the Marching Cubes algorithm, which uses a hand-crafted threshold to define the level set. However, this threshold-based approach requires laborious and scenario-specific tuning, limiting its practicality for real-world applications. In this work, we seek to enhance the efficiency of this method during the training time. To this end, we introduce a spiking neuron mechanism that dynamically adjusts the threshold, eliminating the need for manual selection. Despite its promise, directly training with the spiking neuron often results in model collapse and noisy outputs. To overcome these challenges, we propose a round-robin strategy that stabilizes the training process and enables the geometry network to achieve a sharper and more precise density distribution with minimal computational overhead. We validate our approach through extensive experiments on both synthetic and real-world datasets. The results show that our method significantly improves the performance of threshold-based techniques, offering a more robust and efficient solution for NeRF geometry extraction.
zh

[CV-76] Selective State Space Memory for Large Vision-Language Models

【速读】：该论文试图解决大规模视觉-语言模型 (Large Vision-Language Models, LVLMs) 在特定领域应用中的高效微调问题。解决方案的关键是提出了状态空间记忆集成 (State Space Memory Integration, SSMI) 方法，通过将轻量级的基于 Mamba 的状态空间模块集成到 LVLM 架构中，有效捕捉长程依赖关系并注入任务特定的视觉和序列模式。与传统微调方法相比，SSMI 仅需更新模型参数的一小部分，从而显著提高了计算效率和可扩展性。

链接: https://arxiv.org/abs/2412.09875
作者: Chee Ng,Yuen Fung
关键词: Large Vision-Language Models, demonstrated remarkable performance, Large Vision-Language, multimodal tasks, Space Memory Integration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across a wide range of multimodal tasks. However, fine-tuning these models for domain-specific applications remains a computationally intensive challenge. This paper introduces State Space Memory Integration (SSMI), a novel approach for efficient fine-tuning of LVLMs. By integrating lightweight Mamba-based state space modules into the LVLM architecture, SSMI captures long-range dependencies and injects task-specific visual and sequential patterns effectively. Unlike traditional fine-tuning methods, SSMI requires only a fraction of the model’s parameters to be updated, making it computationally efficient and scalable. Experiments on benchmark datasets, including COCO Captioning, VQA, and Flickr30k, demonstrate that SSMI achieves state-of-the-art performance while maintaining robustness and generalization capabilities. Comprehensive analysis further validates the advantages of SSMI in terms of efficiency, adaptability, and interpretability, positioning it as a compelling solution for fine-tuning large-scale vision-language models.
zh

[CV-77] Can Students Beyond The Teacher? Distilling Knowledge from Teachers Bias

【速读】：该论文试图解决知识蒸馏 (Knowledge Distillation, KD) 中学生模型性能受教师模型传递的偏差影响的问题。解决方案的关键在于提出了一种新颖的策略，通过三步操作来提升学生模型的性能：首先，设计了一种偏差消除方法，过滤掉教师模型传递的错误知识，保留正确的知识供学生模型学习；其次，提出偏差校正方法，修正教师模型的错误预测，从根本上解决偏差干扰；最后，引入动态学习方法，通过动态更新的损失函数，使学生模型先快速学习基于正确知识的简单任务，再逐步处理与偏差相关的复杂任务，从而显著提高学习效率。该策略首次实现了学生模型超越教师模型的目标，并且具有广泛的适用性，可作为即插即用模块应用于多种主流KD框架。

链接: https://arxiv.org/abs/2412.09874
作者: Jianhua Zhang,Yi Gao,Ruyu Liu,Xu Cheng,Houxiang Zhang,Shengyong Chen
关键词: student model, smaller student model, model compression technique, model, teacher model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation (KD) is a model compression technique that transfers knowledge from a large teacher model to a smaller student model to enhance its performance. Existing methods often assume that the student model is inherently inferior to the teacher model. However, we identify that the fundamental issue affecting student performance is the bias transferred by the teacher. Current KD frameworks transmit both right and wrong knowledge, introducing bias that misleads the student model. To address this issue, we propose a novel strategy to rectify bias and greatly improve the student model’s performance. Our strategy involves three steps: First, we differentiate knowledge and design a bias elimination method to filter out biases, retaining only the right knowledge for the student model to learn. Next, we propose a bias rectification method to rectify the teacher model’s wrong predictions, fundamentally addressing bias interference. The student model learns from both the right knowledge and the rectified biases, greatly improving its prediction accuracy. Additionally, we introduce a dynamic learning approach with a loss function that updates weights dynamically, allowing the student model to quickly learn right knowledge-based easy tasks initially and tackle hard tasks corresponding to biases later, greatly enhancing the student model’s learning efficiency. To the best of our knowledge, this is the first strategy enabling the student model to surpass the teacher model. Experiments demonstrate that our strategy, as a plug-and-play module, is versatile across various mainstream KD frameworks. We will release our code after the paper is accepted.
zh

[CV-78] Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction

【速读】：该论文试图解决多模态社交媒体帖子中的语义位置预测问题，特别是上下文歧义和模态差异带来的挑战。解决方案的关键在于提出了上下文视觉语言对齐框架 (Contextualized Vision-Language Alignment, CoVLA)，该框架通过上下文对齐模块 (Contextual Alignment Module, CAM) 增强跨模态特征对齐，并通过跨模态融合模块 (Cross-modal Fusion Module, CMF) 动态整合文本和视觉信息。这些模块的设计有效提升了模型在语义位置预测任务中的性能，实验结果表明其在准确率和F1分数上均显著优于现有最先进方法。

链接: https://arxiv.org/abs/2412.09870
作者: Liu Jing,Amirul Rahman
关键词: multimodal social media, social media posts, Cross-modal Fusion Module, Contextual Alignment Module, Contextualized Vision-Language Alignment
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic location prediction from multimodal social media posts is a critical task with applications in personalized services and human mobility analysis. This paper introduces \textitContextualized Vision-Language Alignment (CoVLA), a discriminative framework designed to address the challenges of contextual ambiguity and modality discrepancy inherent in this task. CoVLA leverages a Contextual Alignment Module (CAM) to enhance cross-modal feature alignment and a Cross-modal Fusion Module (CMF) to dynamically integrate textual and visual information. Extensive experiments on a benchmark dataset demonstrate that CoVLA significantly outperforms state-of-the-art methods, achieving improvements of 2.3% in accuracy and 2.5% in F1-score. Ablation studies validate the contributions of CAM and CMF, while human evaluations highlight the contextual relevance of the predictions. Additionally, robustness analysis shows that CoVLA maintains high performance under noisy conditions, making it a reliable solution for real-world applications. These results underscore the potential of CoVLA in advancing semantic location prediction research.
zh

[CV-79] RP-SLAM: Real-time Photorealistic SLAM with Efficient 3D Gaussian Splatting

【速读】：该论文试图解决现有3D Gaussian Splatting (3DGS) 方法在现实感SLAM系统中应用时面临的问题，包括高斯基元冗余、连续优化中的遗忘问题以及单目情况下由于缺乏深度信息导致的基元初始化困难。解决方案的关键在于提出了RP-SLAM，一种基于3DGS的视觉SLAM方法，适用于单目和RGB-D相机。RP-SLAM通过将相机位姿估计与高斯基元优化解耦，并包含三个核心组件：1) 通过自适应采样和高斯基元过滤实现场景的紧凑和准确表示的增量映射方法；2) 动态窗口优化方法，缓解遗忘问题并提高地图一致性；3) 针对单目情况，提出基于稀疏点云的单目关键帧初始化方法，以提高高斯基元的初始化精度，为后续优化提供几何基础。实验结果表明，RP-SLAM在保证实时性能和模型紧凑性的同时，实现了最先进的映射渲染精度。

链接: https://arxiv.org/abs/2412.09868
作者: Lizhi Bai,Chunqi Tian,Jun Yang,Siyu Zhang,Masanori Suganuma,Takayuki Okatani
关键词: realism SLAM systems, Splatting has emerged, Gaussian Splatting, Gaussian primitives, SLAM systems
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting has emerged as a promising technique for high-quality 3D rendering, leading to increasing interest in integrating 3DGS into realism SLAM systems. However, existing methods face challenges such as Gaussian primitives redundancy, forgetting problem during continuous optimization, and difficulty in initializing primitives in monocular case due to lack of depth information. In order to achieve efficient and photorealistic mapping, we propose RP-SLAM, a 3D Gaussian splatting-based vision SLAM method for monocular and RGB-D cameras. RP-SLAM decouples camera poses estimation from Gaussian primitives optimization and consists of three key components. Firstly, we propose an efficient incremental mapping approach to achieve a compact and accurate representation of the scene through adaptive sampling and Gaussian primitives filtering. Secondly, a dynamic window optimization method is proposed to mitigate the forgetting problem and improve map consistency. Finally, for the monocular case, a monocular keyframe initialization method based on sparse point cloud is proposed to improve the initialization accuracy of Gaussian primitives, which provides a geometric basis for subsequent optimization. The results of numerous experiments demonstrate that RP-SLAM achieves state-of-the-art map rendering accuracy while ensuring real-time performance and model compactness.
zh

[CV-80] LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

【速读】：该论文试图解决文本到视频生成（Text-to-video generation）中计算成本高昂的问题，特别是扩散变换器（Diffusion Transformers, DiTs）在像素数量上具有二次复杂度，导致生成分钟级视频的计算成本极高。解决方案的关键在于提出了一种线性复杂度的文本到视频生成框架（Linear-complexity text-to-video Generation, LinGen），其计算成本随像素数量线性增长。LinGen通过替换自注意力机制（self-attention）这一计算主导且具有二次复杂度的模块，采用了一种名为MATE的线性复杂度模块。MATE模块由MA分支和TE分支组成，分别处理短程到长程的相关性和时间相关性，显著降低了计算复杂度并提升了视频生成质量。实验结果表明，LinGen在视频质量上优于DiT，且计算量（FLOPs）和延迟（latency）分别减少了15倍和11.5倍。

链接: https://arxiv.org/abs/2412.09856
作者: Hongjie Wang,Chih-Yao Ma,Yen-Cheng Liu,Ji Hou,Tao Xu,Jialiang Wang,Felix Juefei-Xu,Yaqiao Luo,Peizhao Zhang,Tingbo Hou,Peter Vajda,Niraj K. Jha,Xiaoliang Dai
关键词: Diffusion Transformers, highly computationally intensive, enhances content creation, generation enhances content, video generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 20 pages, 20 figures

点击查看摘要

Abstract:Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15 \times (11.5 \times ) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: this https URL.
zh

[CV-81] Real-time Identity Defenses against Malicious Personalization of Diffusion Models

【速读】：该论文试图解决个性化扩散模型（Personalized diffusion models）在生成高度逼真图像时带来的身份复制风险，尤其是由此引发的社会、伦理和法律问题。解决方案的关键在于提出了实时身份防御器（Real-time Identity Defender, RID），这是一种神经网络，能够通过单次前向传播生成对抗性扰动，从而避免了针对单个图像的计算密集型优化过程。RID在效率上取得了显著突破，防御时间在单GPU上低至0.12秒，在标准CPU上为1.1秒，使其适用于边缘设备如智能手机。尽管效率极高，RID在视觉和定量基准测试中仍达到了最先进的性能，有效缓解了身份复制风险。此外，RID的扰动虽然模仿了传统防御的效果，但其特性与自然噪声（如高斯扰动）不同，增强了其鲁棒性。为了进一步提升防御能力，论文还将RID扩展为集成框架，整合了多个预训练的文本到图像扩散模型，以应对黑箱攻击和后处理技术（如JPEG压缩和扩散净化）。

链接: https://arxiv.org/abs/2412.09844
作者: Hanzhong Guo,Shen Nie,Chao Du,Tianyu Pang,Hao Sun,Chongxuan Li
关键词: pose substantial social, synthesizing highly realistic, Personalized diffusion models, realistic images based, highly realistic images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Personalized diffusion models, capable of synthesizing highly realistic images based on a few reference portraits, pose substantial social, ethical, and legal risks by enabling identity replication. Existing defense mechanisms rely on computationally intensive adversarial perturbations tailored to individual images, rendering them impractical for real-world deployment. This study introduces Real-time Identity Defender (RID), a neural network designed to generate adversarial perturbations through a single forward pass, bypassing the need for image-specific optimization. RID achieves unprecedented efficiency, with defense times as low as 0.12 seconds on a single GPU (4,400 times faster than leading methods) and 1.1 seconds per image on a standard Intel i9 CPU, making it suitable for edge devices such as smartphones. Despite its efficiency, RID matches state-of-the-art performance across visual and quantitative benchmarks, effectively mitigating identity replication risks. Our analysis reveals that RID’s perturbations mimic the efficacy of traditional defenses while exhibiting properties distinct from natural noise, such as Gaussian perturbations. To enhance robustness, we extend RID into an ensemble framework that integrates multiple pre-trained text-to-image diffusion models, ensuring resilience against black-box attacks and post-processing techniques, including JPEG compression and diffusion-based purification.
zh

[CV-82] Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training

【速读】：该论文试图解决在差分隐私训练中使用程序化生成的合成数据（synthetic data）进行分类时，由于合成数据与真实数据分布不一致导致的生成图像不真实的问题。解决方案的关键在于提出了DP-SynGen方法，该方法利用扩散模型（diffusion models）的三个阶段（粗略阶段、上下文阶段和清理阶段），通过理论和实验验证了在清理和粗略阶段可以使用合成数据替代真实数据进行训练，从而减少隐私预算并提高生成数据的质量。

链接: https://arxiv.org/abs/2412.09842
作者: Yujin Choi,Jinseong Park,Junyoung Byun,Jaewook Lee
关键词: Programmatically generated synthetic, synthetic data, generated synthetic data, synthetic, differential private training
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Programmatically generated synthetic data has been used in differential private training for classification to enhance performance without privacy leakage. However, as the synthetic data is generated from a random process, the distribution of real data and the synthetic data are distinguishable and difficult to transfer. Therefore, the model trained with the synthetic data generates unrealistic random images, raising challenges to adapt the synthetic data for generative models. In this work, we propose DP-SynGen, which leverages programmatically generated synthetic data in diffusion models to address this challenge. By exploiting the three stages of diffusion models(coarse, context, and cleaning) we identify stages where synthetic data can be effectively utilized. We theoretically and empirically verified that cleaning and coarse stages can be trained without private data, replacing them with synthetic data to reduce the privacy budget. The experimental results show that DP-SynGen improves the quality of generative data by mitigating the negative impact of privacy-induced noise on the generation process.
zh

[CV-83] Super-Resolution for Remote Sensing Imagery via the Coupling of a Variational Model and Deep Learning

【速读】：该论文试图解决遥感图像超分辨率 (Image Super-Resolution, SR) 问题，旨在通过增强空间分辨率和细节信息来提升图像的视觉质量。解决方案的关键在于提出了一种新颖的梯度引导多帧超分辨率 (Gradient-guided Multi-Frame Super-Resolution, MFSR) 框架，该框架将学习到的梯度先验作为正则化项集成到基于模型的优化方法中。具体来说，框架结合了局部梯度正则化 (Local Gradient Regularization, LGR) 先验和非局部全变分 (Non-Local Total Variation, NLTV) 先验，前者通过深度残差注意力网络 (Deep Residual Attention Network, DRAN) 的梯度轮廓变换得到，后者基于梯度块的空间结构相似性，采用最大后验概率 (Maximum A Posteriori, MAP) 模型进行表征。这两种互补的先验分别在保持边缘平滑和抑制视觉伪影、增强锐利边缘和恢复精细结构方面表现出色。通过将这两种先验整合到基于自适应范数的重建框架中，优化了混合 L1 和 L2 正则化的最小化问题，从而生成所需的遥感高分辨率图像。实验结果表明，该方法在视觉质量和定量评估方面均优于几种最先进的超分辨率算法。

链接: https://arxiv.org/abs/2412.09841
作者: Jing Sun,Huanfeng Shen,Qiangqiang Yuan,Liangpei Zhang
关键词: remote sensing, resolution and detail, detail information, effective image priors, remote sensing images
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 18 pages, 14 figures

点击查看摘要

Abstract:Image super-resolution (SR) is an effective way to enhance the spatial resolution and detail information of remote sensing images, to obtain a superior visual quality. As SR is severely ill-conditioned, effective image priors are necessary to regularize the solution space and generate the corresponding high-resolution (HR) image. In this paper, we propose a novel gradient-guided multi-frame super-resolution (MFSR) framework for remote sensing imagery reconstruction. The framework integrates a learned gradient prior as the regularization term into a model-based optimization method. Specifically, the local gradient regularization (LGR) prior is derived from the deep residual attention network (DRAN) through gradient profile transformation. The non-local total variation (NLTV) prior is characterized using the spatial structure similarity of the gradient patches with the maximum a posteriori (MAP) model. The modeled prior performs well in preserving edge smoothness and suppressing visual artifacts, while the learned prior is effective in enhancing sharp edges and recovering fine structures. By incorporating the two complementary priors into an adaptive norm based reconstruction framework, the mixed L1 and L2 regularization minimization problem is optimized to achieve the required HR remote sensing image. Extensive experimental results on remote sensing data demonstrate that the proposed method can produce visually pleasant images and is superior to several of the state-of-the-art SR algorithms in terms of the quantitative evaluation.
zh

[CV-84] Which cycling environment appears safer? Learning cycling safety perceptions from pairwise image comparisons

【速读】：该论文试图解决城市中个体对骑行安全感的感知问题，特别是如何快速且准确地捕捉和理解个体对骑行环境安全性的感知。解决方案的关键在于采用了一种基于成对比较的方法，通过反复向受访者展示真实世界的道路环境图像，并让他们选择认为更安全的骑行环境，从而收集数据。利用这些数据，研究者训练了一个孪生卷积神经网络（siamese-convolutional neural network），并采用多损失框架（multi-loss framework）来学习个体对骑行环境安全性的偏好，包括处理“平局”情况（ties）。这种方法不仅能够预测人类对骑行环境安全性的感知，还能有效应用于实际场景，如提高干预措施的效果，并支持对骑行环境变化的持续评估。

链接: https://arxiv.org/abs/2412.09835
作者: Miguel Costa,Manuel Marques,Carlos Lima Azevedo,Felix Wilhelm Siebert,Filipe Moura
关键词: sustainable transport modes, transport modes, cities to transition, sustainable transport, Cycling
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Cycling is critical for cities to transition to more sustainable transport modes. Yet, safety concerns remain a critical deterrent for individuals to cycle. If individuals perceive an environment as unsafe for cycling, it is likely that they will prefer other means of transportation. Yet, capturing and understanding how individuals perceive cycling risk is complex and often slow, with researchers defaulting to traditional surveys and in-loco interviews. In this study, we tackle this problem. We base our approach on using pairwise comparisons of real-world images, repeatedly presenting respondents with pairs of road environments and asking them to select the one they perceive as safer for cycling, if any. Using the collected data, we train a siamese-convolutional neural network using a multi-loss framework that learns from individuals’ responses, learns preferences directly from images, and includes ties (often discarded in the literature). Effectively, this model learns to predict human-style perceptions, evaluating which cycling environments are perceived as safer. Our model achieves good results, showcasing this approach has a real-life impact, such as improving interventions’ effectiveness. Furthermore, it facilitates the continuous assessment of changing cycling environments, permitting short-term evaluations of measures to enhance perceived cycling safety. Finally, our method can be efficiently deployed in different locations with a growing number of openly available street-view images.
zh

[CV-85] MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion

【速读】：该论文试图解决生成高分辨率、富含语义和复杂运动视频的技术挑战和计算成本问题。解决方案的关键在于提出了一个多尺度因果 (Multi-Scale Causal, MSC) 框架，通过在空间维度引入多分辨率、在时间维度引入高低频信息来实现高效的注意力计算。此外，该框架通过在不同尺度上控制注意力模块的组合，实现了对噪声图像帧的因果条件扩散训练，基于噪声在不同分辨率上破坏信息速率不同的原理。理论分析表明，该方法能显著降低计算复杂度并提高训练效率，同时支持自回归长视频生成，且不违反帧序列的自然顺序。

链接: https://arxiv.org/abs/2412.09828
作者: Xunnong Xu,Mengying Cao
关键词: transformers enable flexible, enable flexible generative, flexible generative modeling, Diffusion transformers enable, transformers enable
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.
zh

[CV-86] Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism

【速读】：该论文试图解决视频试衣（video try-on）领域中的两个主要问题：一是如何在利用服装编码器（garment encoder）的同时降低计算资源消耗；二是如何在快速运动中确保人体部位合成的时间一致性。解决方案的关键在于提出了一种基于扩散Transformer（Diffusion Transformer, DiT）的新型视频试衣框架，称为Dynamic Try-On。该框架通过直接使用DiT主干网络作为服装编码器，并引入动态特征融合模块来存储和整合服装特征，从而减少了计算开销。此外，为了确保时间一致性，论文提出了一种肢体感知动态注意力模块（limb-aware dynamic attention module），使DiT在去噪过程中能够专注于人体肢体区域，从而在复杂姿态下生成稳定且流畅的试衣结果。

链接: https://arxiv.org/abs/2412.09822
作者: Jun Zheng,Jing Wang,Fuwei Zhao,Xujie Zhang,Xiaodan Liang
关键词: tremendous real-world potential, real-world potential, Video try-on stands, promising area, tremendous real-world
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Video try-on stands as a promising area for its tremendous real-world potential. Previous research on video try-on has primarily focused on transferring product clothing images to videos with simple human poses, while performing poorly with complex movements. To better preserve clothing details, those approaches are armed with an additional garment encoder, resulting in higher computational resource consumption. The primary challenges in this domain are twofold: (1) leveraging the garment encoder’s capabilities in video try-on while lowering computational requirements; (2) ensuring temporal consistency in the synthesis of human body parts, especially during rapid movements. To tackle these issues, we propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On. To reduce computational overhead, we adopt a straightforward approach by utilizing the DiT backbone itself as the garment encoder and employing a dynamic feature fusion module to store and integrate garment features. To ensure temporal consistency of human body parts, we introduce a limb-aware dynamic attention module that enforces the DiT backbone to focus on the regions of human limbs during the denoising process. Extensive experiments demonstrate the superiority of Dynamic Try-On in generating stable and smooth try-on results, even for videos featuring complicated human postures. Comments: Project Page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2412.09822 [cs.CV] (or arXiv:2412.09822v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.09822 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-87] CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection AAAI2025

【速读】：该论文试图解决在通用目标检测中引入语言并泛化开放集概念时面临的两个主要挑战：(i) 如何有效利用提示中的先验信息来泛化对象，以及 (ii) 如何在下游任务中减少对齐偏差，从而避免在预训练之外的场景中性能下降。解决方案的关键在于提出了一种名为 CP-DETR 的强通用检测基础模型，该模型通过设计高效的提示视觉混合编码器（prompt visual hybrid encoder），利用尺度间和多尺度融合模块增强提示与视觉信息之间的交互。此外，通过提示多标签损失（prompt multi-label loss）和辅助检测头（auxiliary detection head），进一步充分利用提示信息。论文还设计了两种实用的概念提示生成方法，即视觉提示（visual prompt）和优化提示（optimized prompt），以通过具体视觉示例提取抽象概念，并稳定地减少下游任务中的对齐偏差。这些设计使得 CP-DETR 在广泛的场景中展示了卓越的通用检测性能。

链接: https://arxiv.org/abs/2412.09799
作者: Qibo Chen,Weizhong Jin,Jianyue Ge,Mengdi Liu,Yuchao Yan,Jian Jiang,Li Yu,Xuanjiang Guo,Shuchang Li,Jianzhong Chen
关键词: SoTA closed-set detector, Recent research, datasets for training, constructing large-scale, object detection aims
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI2025

点击查看摘要

Abstract:Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.
zh

[CV-88] Is it the model or the metric – On robustness measures of deeplearning models

【速读】：该论文试图解决深度学习模型在自动决策系统中的鲁棒性问题，特别是在深度伪造检测（deepfake detection）领域的应用。解决方案的关键在于提出了鲁棒比率（Robust Ratio, RR）这一补充性指标，用于量化输入扰动下模型归一化或概率输出的变化。通过对比鲁棒精度（Robust Accuracy, RA）和RR，研究发现即使在RA相似的情况下，不同模型在不同容忍度（扰动）水平下表现出不同的RR，从而揭示了现有RA指标的局限性，并为模型的安全部署提供了新的评估维度。

链接: https://arxiv.org/abs/2412.09795
作者: Zhijin Lyu,Yutong Jin,Sneha Das
关键词: automated decision-making systems, deep learning models, advanced deep learning, decision-making systems, deep learning
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Extended abstract at Northern Lights Deep Learning (NLDL) Conference 2025

点击查看摘要

Abstract:Determining the robustness of deep learning models is an established and ongoing challenge within automated decision-making systems. With the advent and success of techniques that enable advanced deep learning (DL), these models are being used in widespread applications, including high-stake ones like healthcare, education, border-control. Therefore, it is critical to understand the limitations of these models and predict their regions of failures, in order to create the necessary guardrails for their successful and safe deployment. In this work, we revisit robustness, specifically investigating the sufficiency of robust accuracy (RA), within the context of deepfake detection. We present robust ratio (RR) as a complementary metric, that can quantify the changes to the normalized or probability outcomes under input perturbation. We present a comparison of RA and RR and demonstrate that despite similar RA between models, the models show varying RR under different tolerance (perturbation) levels.
zh

[CV-89] EI-Drive: A Platform for Cooperative Perception with Realistic Communication Models

【速读】：该论文试图解决现有合作感知研究中未充分考虑现实环境中的传输延迟和错误的问题。解决方案的关键在于引入EI-Drive，一个基于边缘AI的自动驾驶仿真平台，该平台在CARLA框架基础上集成了先进的合作感知模块，并引入了更真实的通信模型，以模拟传输延迟和错误。EI-Drive通过融合来自多个数据源的信息，提升了车辆在复杂环境中的情境感知和安全性，并通过其模块化设计，支持对感知、规划和控制在多种合作驾驶场景中的详细探索。

链接: https://arxiv.org/abs/2412.09782
作者: Hanchu Zhou,Edward Xie,Wei Shao,Dechen Gao,Michelle Dong,Junshan Zhang
关键词: accurately simulating cooperative, cooperative perception, simulating cooperative perception, cooperative perception process, autonomous driving calls
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The growing interest in autonomous driving calls for realistic simulation platforms capable of accurately simulating cooperative perception process in realistic traffic scenarios. Existing studies for cooperative perception often have not accounted for transmission latency and errors in real-world environments. To address this gap, we introduce EI-Drive, an edge-AI based autonomous driving simulation platform that integrates advanced cooperative perception with more realistic communication models. Built on the CARLA framework, EI-Drive features new modules for cooperative perception while taking into account transmission latency and errors, providing a more realistic platform for evaluating cooperative perception algorithms. In particular, the platform enables vehicles to fuse data from multiple sources, improving situational awareness and safety in complex environments. With its modular design, EI-Drive allows for detailed exploration of sensing, perception, planning, and control in various cooperative driving scenarios. Experiments using EI-Drive demonstrate significant improvements in vehicle safety and performance, particularly in scenarios with complex traffic flow and network conditions. All code and documents are accessible on our GitHub page: \urlthis https URL.
zh

[CV-90] A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization

【速读】：该论文试图解决在计算成像系统设计中，端到端优化方法因高计算成本而难以同时建模像差和衍射的问题。解决方案的关键在于提出了一种可微分的光学模拟器，能够高效地同时建模像差和衍射，从而实现对复合光学系统的端到端优化。通过该模拟器，研究展示了场景重建和分类的优化结果，并验证了在优化过程中考虑波光学效应的重要性，以确保系统在实际应用中的鲁棒性和高性能。

链接: https://arxiv.org/abs/2412.09774
作者: Chi-Jui Ho,Yash Belhe,Steve Rotenberg,Ravi Ramamoorthi,Tzu-Mao Li,Nicholas Antipa
关键词: powerful data-driven method, simultaneously optimizes optics, optics, wave optics, simultaneously optimizes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end optimization, which simultaneously optimizes optics and algorithms, has emerged as a powerful data-driven method for computational imaging system design. This method achieves joint optimization through backpropagation by incorporating differentiable optics simulators to generate measurements and algorithms to extract information from measurements. However, due to high computational costs, it is challenging to model both aberration and diffraction in light transport for end-to-end optimization of compound optics. Therefore, most existing methods compromise physical accuracy by neglecting wave optics effects or off-axis aberrations, which raises concerns about the robustness of the resulting designs. In this paper, we propose a differentiable optics simulator that efficiently models both aberration and diffraction for compound optics. Using the simulator, we conduct end-to-end optimization on scene reconstruction and classification. Experimental results demonstrate that both lenses and algorithms adopt different configurations depending on whether wave optics is modeled. We also show that systems optimized without wave optics suffer from performance degradation when wave optics effects are introduced during testing. These findings underscore the importance of accurate wave optics modeling in optimizing imaging systems for robust, high-performance applications.
zh

[CV-91] Acquisition of Spatially-Varying Reflectance and Surface Normals via Polarized Reflectance Fields

【速读】：该论文试图解决复杂形状物体（如具有凹面特征、空心雕刻和多样表面）的几何测量和空间变化反射率测量问题，这些问题由于物体形状的复杂性导致的光线反射和遮挡现象，以及硬件限制（如镜头光晕和过曝）而变得困难。解决方案的关键在于采用偏振反射场捕捉（polarized reflectance field capture）和综合统计分析算法，以实现高精度的表面法线（surface normals）和空间变化反射率数据的测量，包括反照率（albedo）、镜面分离（specular separation）、粗糙度（roughness）和各向异性（anisotropy）参数。该算法通过解析建模去除图像伪影，并利用全图像集合的初始步骤和优化步骤进一步提高每个像素的反射率和法线测量的精度。

链接: https://arxiv.org/abs/2412.09772
作者: Jing Yang,Pratusha Bhuvana Prasad,Qing Zhang,Yajie Zhao
关键词: complex task due, Accurately measuring, intricate shapes formed, concave features, hollow engravings
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Accurately measuring the geometry and spatially-varying reflectance of real-world objects is a complex task due to their intricate shapes formed by concave features, hollow engravings and diverse surfaces, resulting in inter-reflection and occlusion when photographed. Moreover, issues like lens flare and overexposure can arise from interference from secondary reflections and limitations of hardware even in professional studios. In this paper, we propose a novel approach using polarized reflectance field capture and a comprehensive statistical analysis algorithm to obtain highly accurate surface normals (within 0.1mm/px) and spatially-varying reflectance data, including albedo, specular separation, roughness, and anisotropy parameters for realistic rendering and analysis. Our algorithm removes image artifacts via analytical modeling and further employs both an initial step and an optimization step computed on the whole image collection to further enhance the precision of per-pixel surface reflectance and normal measurement. We showcase the captured shapes and reflectance of diverse objects with a wide material range, spanning from highly diffuse to highly glossy - a challenge unaddressed by prior techniques. Our approach enhances downstream applications by offering precise measurements for realistic rendering and provides a valuable training dataset for emerging research in inverse rendering. We will release the polarized reflectance fields of several captured objects with this work.
zh

[CV-92] L-WISE: Boosting Human Image Category Learning Through Model-Based Image Selection And Enhancement

【速读】：该论文试图解决的问题是如何通过增强视觉学习来提高人类在视觉分类任务中的准确性。解决方案的关键在于利用人工神经网络 (ANN) 模型来指导图像扰动和选择，从而提升人类的学习效果。具体来说，研究者通过以下两个策略实现这一目标：(i) 根据模型估计的识别难度选择图像，(ii) 使用有助于新手学习者的图像扰动。这些基于模型的策略不仅显著提高了测试时的分类准确性（相对控制组提升了33-72%），还缩短了训练时间（减少了20-23%）。该方法在自然图像、组织学和皮肤镜检查等具有挑战性的领域中得到了验证，展示了其在增强人类视觉学习中的有效性。

链接: https://arxiv.org/abs/2412.09765
作者: Morgan B. Talbot,Gabriel Kreiman,James J. DiCarlo,Guy Gaziv
关键词: artificial neural network, leading artificial neural, visual ventral stream, neural network, ventral stream
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The currently leading artificial neural network (ANN) models of the visual ventral stream – which are derived from a combination of performance optimization and robustification methods – have demonstrated a remarkable degree of behavioral alignment with humans on visual categorization tasks. Extending upon previous work, we show that not only can these models guide image perturbations that change the induced human category percepts, but they also can enhance human ability to accurately report the original ground truth. Furthermore, we find that the same models can also be used out-of-the-box to predict the proportion of correct human responses to individual images, providing a simple, human-aligned estimator of the relative difficulty of each image. Motivated by these observations, we propose to augment visual learning in humans in a way that improves human categorization accuracy at test time. Our learning augmentation approach consists of (i) selecting images based on their model-estimated recognition difficulty, and (ii) using image perturbations that aid recognition for novice learners. We find that combining these model-based strategies gives rise to test-time categorization accuracy gains of 33-72% relative to control subjects without these interventions, despite using the same number of training feedback trials. Surprisingly, beyond the accuracy gain, the training time for the augmented learning group was also shorter by 20-23%. We demonstrate the efficacy of our approach in a fine-grained categorization task with natural images, as well as tasks in two clinically relevant image domains – histology and dermoscopy – where visual learning is notoriously challenging. To the best of our knowledge, this is the first application of ANNs to increase visual learning performance in humans by enhancing category-specific features.
zh

[CV-93] ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation

【速读】：该论文试图解决多模态大语言模型（MLLMs）在视频理解领域中，高层次任务（如视频描述和问答）与密集像素级分割任务（如类别引导或引用对象分割）之间研究方向分离的问题。解决方案的关键在于引入了一个名为ViCaS的新数据集，该数据集包含数千个具有挑战性的视频，每个视频都带有详细的人工编写描述和时间上一致的像素级精确掩码，用于多对象的短语定位。通过这一数据集，论文不仅评估了模型在整体高层次理解上的表现，还评估了其在语言引导下的像素级精确分割能力。此外，论文还提出了经过验证的评估指标和一种有效的模型架构，以应对这一新的基准测试。

链接: https://arxiv.org/abs/2412.09754
作者: Ali Athar,Xueqing Deng,Liang-Chieh Chen
关键词: multimodal large language, Recent advances, large language models, primarily focusing, captioning and question-answering
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high-level tasks such as video captioning and question-answering. Meanwhile, a smaller body of work addresses dense, pixel-precise segmentation tasks, which typically involve category-guided or referral-based object segmentation. Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high-level understanding and language-guided, pixel-precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: this https URL
zh

[CV-94] On Round-Off Errors and Gaussian Blur in Superresolution and in Image Registration

【速读】：该论文试图解决在存在模糊和噪声的情况下，从不同样本集中恢复信号的问题。解决方案的关键在于利用离散图像配准技术，通过信号依赖的测量矩阵来捕捉高斯模糊或混合高斯模糊以及舍入误差的影响。论文提出了一种基于动态规划的方法，能够在满足特定条件下（如模糊、噪声和间断点间距的条件），正确对齐并确定两个数据序列中每个间断点后的第一个样本，从而实现对信号幅度和间断点的有效推断。

链接: https://arxiv.org/abs/2412.09741
作者: Serap A. Savari
关键词: Discrete image registration, theory and techniques, techniques seek, seek to recover, recover signals
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Superresolution theory and techniques seek to recover signals from samples in the presence of blur and noise. Discrete image registration can be an approach to fuse information from different sets of samples of the same signal. Quantization errors in the spatial domain are inherent to digital images. We consider superresolution and discrete image registration for one-dimensional spatially-limited piecewise constant functions which are subject to blur which is Gaussian or a mixture of Gaussians as well as to round-off errors. We describe a signal-dependent measurement matrix which captures both types of effects. For this setting we show that the difficulties in determining the discontinuity points from two sets of samples even in the absence of other types of noise. If the samples are also subject to statistical noise, then it is necessary to align and segment the data sequences to make the most effective inferences about the amplitudes and discontinuity points. Under some conditions on the blur, the noise, and the distance between discontinuity points, we prove that we can correctly align and determine the first samples following each discontinuity point in two data sequences with an approach based on dynamic programming.
zh

[CV-95] Agtech Framework for Cranberry-Ripening Analysis Using Vision Foundation Models

【速读】：该论文旨在解决通过定量视觉评估来表征蔓越莓作物成熟过程的问题，这是精准农业任务（如作物品种比较和疾病检测）的关键组成部分。解决方案的关键在于结合无人机（aerial imaging）和地面成像（ground-based imaging）技术，获取覆盖整个生长季节的多周时间序列数据。无人机成像提供了多个样本的反照率（albedo）分布，而地面成像通过固定基准标记追踪单个浆果的外观变化。通过使用视觉变换器（Vision Transformers, ViT）进行特征提取，并结合UMAP降维技术生成蔓越莓外观的二维流形，实现了对成熟路径和成熟速率的量化评估。这一方法不仅为植物生物学家和种植者提供了可解释的特征描述，还为作物育种决策提供了支持。

链接: https://arxiv.org/abs/2412.09739
作者: Faith Johnson,Ryan Meegan,Jack Lowry,Peter Oudemans,Kristin Dana
关键词: quantitative visual evaluation, Agricultural domains, support quantitative visual, visual evaluation, transformed by recent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2309.00028

点击查看摘要

Abstract:Agricultural domains are being transformed by recent advances in AI and computer vision that support quantitative visual evaluation. Using aerial and ground imaging over a time series, we develop a framework for characterizing the ripening process of cranberry crops, a crucial component for precision agriculture tasks such as comparing crop breeds (high-throughput phenotyping) and detecting disease. Using drone imaging, we capture images from 20 waypoints across multiple bogs, and using ground-based imaging (hand-held camera), we image same bog patch using fixed fiducial markers. Both imaging methods are repeated to gather a multi-week time series spanning the entire growing season. Aerial imaging provides multiple samples to compute a distribution of albedo values. Ground imaging enables tracking of individual berries for a detailed view of berry appearance changes. Using vision transformers (ViT) for feature detection after segmentation, we extract a high dimensional feature descriptor of berry appearance. Interpretability of appearance is critical for plant biologists and cranberry growers to support crop breeding decisions (e.g.\ comparison of berry varieties from breeding programs). For interpretability, we create a 2D manifold of cranberry appearance by using a UMAP dimensionality reduction on ViT features. This projection enables quantification of ripening paths and a useful metric of ripening rate. We demonstrate the comparison of four cranberry varieties based on our ripening assessments. This work is the first of its kind and has future impact for cranberries and for other crops including wine grapes, olives, blueberries, and maize. Aerial and ground datasets are made publicly available.
zh

[CV-96] Double-Exponential Increases in Inference Energy: The Cost of the Race for Accuracy

【速读】：该论文试图解决计算机视觉领域深度学习模型在推理阶段的能源效率问题，特别是在追求边际精度提升时所带来的可持续性挑战。解决方案的关键在于通过全面分析1,200个ImageNet分类模型的推理能耗，揭示了精度提升与能耗增加之间的递减回报关系，并识别了影响能耗的关键因素。论文提出了改进能源效率的方法，并引入了一个能源效率评分系统和一个交互式网络应用，以便用户根据精度和能耗进行模型比较，从而促进更可持续的AI实践和决策。

链接: https://arxiv.org/abs/2412.09731
作者: Zeyu Yang,Karel Adamek,Wesley Armour
关键词: Deep learning models, achieved significant success, Deep learning, pose increasing concerns, energy consumption
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models in computer vision have achieved significant success but pose increasing concerns about energy consumption and sustainability. Despite these concerns, there is a lack of comprehensive understanding of their energy efficiency during inference. In this study, we conduct a comprehensive analysis of the inference energy consumption of 1,200 ImageNet classification models - the largest evaluation of its kind to date. Our findings reveal a steep diminishing return in accuracy gains relative to the increase in energy usage, highlighting sustainability concerns in the pursuit of marginal improvements. We identify key factors contributing to energy consumption and demonstrate methods to improve energy efficiency. To promote more sustainable AI practices, we introduce an energy efficiency scoring system and develop an interactive web application that allows users to compare models based on accuracy and energy consumption. By providing extensive empirical data and practical tools, we aim to facilitate informed decision-making and encourage collaborative efforts in developing energy-efficient AI technologies.
zh

[CV-97] he Unreasonable Effectiveness of Gaussian Score Approximation for Diffusion Models and its Applications

【速读】：该论文试图解决扩散模型中学习到的分数函数与底层数据流形分数之间的关系问题。解决方案的关键在于通过比较学习到的神经网络分数与两种可解析分布（高斯分布和混合高斯分布）的分数，揭示它们之间的关系。研究表明，在中等到高噪声尺度下，学习到的神经网络分数主要由其线性（高斯）近似主导，且这一近似在比理论预期更广的噪声尺度范围内有效，尤其是在训练早期阶段。在较小噪声尺度下，学习到的分数更符合训练数据的粗粒度（混合高斯）近似，而非训练分布的分数。基于这些发现，论文提出了一种新的混合采样方法，称为“解析传送”，通过跳过初始采样步骤（15-30%）来加速采样过程，同时保持高质量的样本生成。

链接: https://arxiv.org/abs/2412.09726
作者: Binxu Wang,John J. Vastola
关键词: smoothed data distributions, iteratively generate samples, Gaussian, learning the gradient, gradient of smoothed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 69 pages, 34 figures. Published in TMLR. Previous shorter versions at arxiv.org/abs/2303.02490 and arxiv.org/abs/2311.10892

点击查看摘要

Abstract:By learning the gradient of smoothed data distributions, diffusion models can iteratively generate samples from complex distributions. The learned score function enables their generalization capabilities, but how the learned score relates to the score of the underlying data manifold remains largely unclear. Here, we aim to elucidate this relationship by comparing learned neural scores to the scores of two kinds of analytically tractable distributions: Gaussians and Gaussian mixtures. The simplicity of the Gaussian model makes it theoretically attractive, and we show that it admits a closed-form solution and predicts many qualitative aspects of sample generation dynamics. We claim that the learned neural score is dominated by its linear (Gaussian) approximation for moderate to high noise scales, and supply both theoretical and empirical arguments to support this claim. Moreover, the Gaussian approximation empirically works for a larger range of noise scales than naive theory suggests it should, and is preferentially learned early in training. At smaller noise scales, we observe that learned scores are better described by a coarse-grained (Gaussian mixture) approximation of training data than by the score of the training distribution, a finding consistent with generalization. Our findings enable us to precisely predict the initial phase of trained models’ sampling trajectories through their Gaussian approximations. We show that this allows the skipping of the first 15-30% of sampling steps while maintaining high sample quality (with a near state-of-the-art FID score of 1.93 on CIFAR-10 unconditional generation). This forms the foundation of a novel hybrid sampling method, termed analytical teleportation, which can seamlessly integrate with and accelerate existing samplers, including DPM-Solver-v3 and UniPC. Our findings suggest ways to improve the design and training of diffusion models.
zh

[CV-98] MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction

【速读】：该论文试图解决实时多智能体协作下的自我运动估计和高保真3D重建问题，传统方法生成的地图稀疏且细节不足，而密集映射方法则面临高延迟的挑战。解决方案的关键在于提出了MAC-Ego3D框架，通过多智能体高斯共识（Multi-Agent Gaussian Consensus）实现实时协作的逼真3D重建。该框架允许智能体独立构建、对齐并通过统一的Gaussian splat表示迭代优化局部地图，利用智能体内高斯共识（Intra-Agent Gaussian Consensus）确保局部空间一致性，并通过并行化的智能体间高斯共识（Inter-Agent Gaussian Consensus）实现全局对齐和优化，从而无缝集成多个智能体的局部地图为高保真3D模型。该方法显著提高了效率，减少了定位误差，并提升了映射的保真度。

链接: https://arxiv.org/abs/2412.09723
作者: Xiaohao Xu,Feng Xue,Shibo Zhao,Yike Pan,Sebastian Scherer,Xiaonan Huang
关键词: Gaussian Consensus, scalable spatial intelligence, Multi-Agent Gaussian Consensus, Real-time multi-agent collaboration, Gaussian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 25 figures

点击查看摘要

Abstract:Real-time multi-agent collaboration for ego-motion estimation and high-fidelity 3D reconstruction is vital for scalable spatial intelligence. However, traditional methods produce sparse, low-detail maps, while recent dense mapping approaches struggle with high latency. To overcome these challenges, we present MAC-Ego3D, a novel framework for real-time collaborative photorealistic 3D reconstruction via Multi-Agent Gaussian Consensus. MAC-Ego3D enables agents to independently construct, align, and iteratively refine local maps using a unified Gaussian splat representation. Through Intra-Agent Gaussian Consensus, it enforces spatial coherence among neighboring Gaussian splats within an agent. For global alignment, parallelized Inter-Agent Gaussian Consensus, which asynchronously aligns and optimizes local maps by regularizing multi-agent Gaussian splats, seamlessly integrates them into a high-fidelity 3D model. Leveraging Gaussian primitives, MAC-Ego3D supports efficient RGB-D rendering, enabling rapid inter-agent Gaussian association and alignment. MAC-Ego3D bridges local precision and global coherence, delivering higher efficiency, largely reducing localization error, and improving mapping fidelity. It establishes a new SOTA on synthetic and real-world benchmarks, achieving a 15x increase in inference speed, order-of-magnitude reductions in ego-motion estimation error for partial cases, and RGB PSNR gains of 4 to 10 dB. Our code will be made publicly available at this https URL .
zh

[CV-99] BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation

【速读】：该论文试图解决现有CLIP适配器在不确定性估计方面的不足，特别是在实际应用中安全部署的需求。解决方案的关键在于引入BayesAdapter，通过贝叶斯推理（Bayesian inference）来估计完整的概率分布，而非单一的点估计，从而更好地捕捉参数空间中的变异性。这种方法在不确定性估计的质量上表现出色，尤其在模型校准（calibration）和选择性分类（selective classification）方面。

链接: https://arxiv.org/abs/2412.09718
作者: Pablo Morales-Álvarez,Stergios Christodoulidis,Maria Vakalopoulou,Pablo Piantanida,Jose Dolz
关键词: pre-trained vision-language models, visual recognition tasks, large pre-trained vision-language, represents a paradigm, pre-trained vision-language
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 30 pages, 5 figures, 23 tables

点击查看摘要

Abstract:The emergence of large pre-trained vision-language models (VLMs) represents a paradigm shift in machine learning, with unprecedented results in a broad span of visual recognition tasks. CLIP, one of the most popular VLMs, has exhibited remarkable zero-shot and transfer learning capabilities in classification. To transfer CLIP to downstream tasks, adapters constitute a parameter-efficient approach that avoids backpropagation through the large model (unlike related prompt learning methods). However, CLIP adapters have been developed to target discriminative performance, and the quality of their uncertainty estimates has been overlooked. In this work we show that the discriminative performance of state-of-the-art CLIP adapters does not always correlate with their uncertainty estimation capabilities, which are essential for a safe deployment in real-world scenarios. We also demonstrate that one of such adapters is obtained through MAP inference from a more general probabilistic framework. Based on this observation we introduce BayesAdapter, which leverages Bayesian inference to estimate a full probability distribution instead of a single point, better capturing the variability inherent in the parameter space. In a comprehensive empirical evaluation we show that our approach obtains high quality uncertainty estimates in the predictions, standing out in calibration and selective classification. Our code is publicly available at: this https URL.
zh

[CV-100] Diffusion-Enhanced Test-time Adaptation with Text and Image Augmentation

【速读】：该论文试图解决现有测试时提示调优（Test-Time Prompt Tuning, TPT）方法在单一模态数据上的局限性，特别是在生成增强图像数量有限时性能显著下降的问题。解决方案的关键在于引入IT3A，一种利用预训练生成模型进行多模态增强的测试时适应方法。IT3A通过结合预训练的视觉和语言模型的增强数据，提升了模型对未知新测试数据的适应能力。此外，通过在增强图像和文本的logits与原始测试数据之间使用余弦相似度过滤，确保关键语义的准确保留，并过滤掉不合适的增强组合。为了更灵活地利用文本模板，论文还用适配器（adapter）替代了提示调优。实验结果表明，在零样本设置下，IT3A在分布偏移和领域差异的测试数据集上，相比现有TPT方法提高了5.50%的准确率。

链接: https://arxiv.org/abs/2412.09706
作者: Chun-Mei Feng,Yuanyang He,Jian Zou,Salman Khan,Huan Xiong,Zhen Li,Wangmeng Zuo,Rick Siow Mong Goh,Yong Liu
关键词: primarily enhancing images, Existing test-time prompt, Existing test-time, primarily enhancing, confidence ratings
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Journal of Computer Vision

点击查看摘要

Abstract:Existing test-time prompt tuning (TPT) methods focus on single-modality data, primarily enhancing images and using confidence ratings to filter out inaccurate images. However, while image generation models can produce visually diverse images, single-modality data enhancement techniques still fail to capture the comprehensive knowledge provided by different modalities. Additionally, we note that the performance of TPT-based methods drops significantly when the number of augmented images is limited, which is not unusual given the computational expense of generative augmentation. To address these issues, we introduce IT3A, a novel test-time adaptation method that utilizes a pre-trained generative model for multi-modal augmentation of each test sample from unknown new domains. By combining augmented data from pre-trained vision and language models, we enhance the ability of the model to adapt to unknown new test data. Additionally, to ensure that key semantics are accurately retained when generating various visual and text enhancements, we employ cosine similarity filtering between the logits of the enhanced images and text with the original test data. This process allows us to filter out some spurious augmentation and inadequate combinations. To leverage the diverse enhancements provided by the generation model across different modals, we have replaced prompt tuning with an adapter for greater flexibility in utilizing text templates. Our experiments on the test datasets with distribution shifts and domain gaps show that in a zero-shot setting, IT3A outperforms state-of-the-art test-time prompt tuning methods with a 5.50% increase in accuracy.
zh

[CV-101] Soybean Maturity Prediction using 2D Contour Plots from Drone based Time Series Imagery

【速读】：该论文试图解决传统大豆育种中通过人工视觉评估成熟度值的主观性和耗时性问题。解决方案的关键在于开发了一种基于无人机（UAV）时间序列图像的机器学习模型，利用从无人机RGB图像中提取的等高线图（contour plot）作为输入，通过深度学习模型预测大豆的相对成熟度评级。该方法通过将时间和空间变化编码为单一图像，显著提高了预测的准确性和鲁棒性，达到85%的准确率，并展示了在减少时间点的情况下保持预测精度的能力。这一解决方案提供了一种可扩展、客观且高效的大豆成熟度评估方法，减少了对人工检查和主观评估的依赖。

链接: https://arxiv.org/abs/2412.09696
作者: Bitgoeul Kim,Samuel W. Blair,Talukder Z. Jubery,Soumik Sarkar,Arti Singh,Asheesh K. Singh,Baskar Ganapathysubramanian
关键词: Plant breeding programs, Plant breeding, breeding programs require, maturity, accurate selection
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Plant breeding programs require assessments of days to maturity for accurate selection and placement of entries in appropriate tests. In the early stages of the breeding pipeline, soybean breeding programs assign relative maturity ratings to experimental varieties that indicate their suitable maturity zones. Traditionally, the estimation of maturity value for breeding varieties has involved breeders manually inspecting fields and assessing maturity value visually. This approach relies heavily on rater judgment, making it subjective and time-consuming. This study aimed to develop a machine-learning model for evaluating soybean maturity using UAV-based time-series imagery. Images were captured at three-day intervals, beginning as the earliest varieties started maturing and continuing until the last varieties fully matured. The data collected for this experiment consisted of 22,043 plots collected across three years (2021 to 2023) and represent relative maturity groups 1.6 - 3.9. We utilized contour plot images extracted from the time-series UAV RGB imagery as input for a neural network model. This contour plot approach encoded the temporal and spatial variation within each plot into a single image. A deep learning model was trained to utilize this contour plot to predict maturity ratings. This model significantly improves accuracy and robustness, achieving up to 85% accuracy. We also evaluate the model’s accuracy as we reduce the number of time points, quantifying the trade-off between temporal resolution and maturity prediction. The predictive model offers a scalable, objective, and efficient means of assessing crop maturity, enabling phenomics and ML approaches to reduce the reliance on manual inspection and subjective assessment. This approach enables the automatic prediction of relative maturity ratings in a breeding program, saving time and resources.
zh

[CV-102] Omni-ID: Holistic Identity Representation Designed for Generative Tasks

【速读】：该论文试图解决生成任务中面部表示的全面性和多样性问题，提出了一种名为Omni-ID的新型面部表示方法。解决方案的关键在于其采用了一种“少对多”的身份重建训练范式 (few-to-many identity reconstruction training paradigm)，通过有限的输入图像重建同一人在不同姿态和表情下的多个目标图像，并结合多解码器框架 (multi-decoder framework) 来利用不同解码器的互补优势。与传统的基于判别或对比学习的表示方法（如CLIP和ArcFace）不同，Omni-ID通过生成目标进行优化，从而在生成任务中实现了更全面和细致的身份特征捕捉。

链接: https://arxiv.org/abs/2412.09694
作者: Guocheng Qian,Kuan-Chieh Wang,Or Patashnik,Negin Heravi,Daniil Ostashev,Sergey Tulyakov,Daniel Cohen-Or,Kfir Aberman
关键词: representation designed specifically, designed specifically, facial representation designed, generative tasks, representation designed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL

点击查看摘要

Abstract:We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. Omni-ID encodes holistic information about an individual’s appearance across diverse expressions and poses within a fixed-size representation. It consolidates information from a varied number of unstructured input images into a structured representation, where each entry represents certain global or local identity features. Our approach uses a few-to-many identity reconstruction training paradigm, where a limited set of input images is used to reconstruct multiple target images of the same individual in various poses and expressions. A multi-decoder framework is further employed to leverage the complementary strengths of diverse decoders during training. Unlike conventional representations, such as CLIP and ArcFace, which are typically learned through discriminative or contrastive objectives, Omni-ID is optimized with a generative objective, resulting in a more comprehensive and nuanced identity capture for generative tasks. Trained on our MFHQ dataset – a multi-view facial image collection, Omni-ID demonstrates substantial improvements over conventional representations across various generative tasks.
zh

[CV-103] OAP: Towards Better Robustness in Universal Transferable Anti-Facial Retrieval

【速读】：该论文试图解决深度哈希检索技术在面部检索系统中应用时面临的隐私泄露问题，特别是对抗样本对深度哈希模型的影响及其在在线社交网络（OSNs）中的鲁棒性问题。解决方案的关键在于提出了三合一对抗扰动（Three-in-One Adversarial Perturbation, TOAP），通过分析深度哈希模型在图像后处理后的表现，构建局部和全局压缩生成器（Compression Generator, CG）来模拟复杂的图像后处理场景。论文进一步探索了模型目标函数在后处理过程中的变化模式，提出了鲁棒优化目标、聚类中心和数据空间中心，并使用元学习进行优化。通过交替生成对抗样本和微调CG，TOAP在保持扰动性能的同时增强了CG的抗扰动能力。实验结果表明，TOAP在鲁棒性、通用性和可迁移性方面显著优于现有方法，在多种模拟后处理场景和主流OSNs中有效提升了隐私保护效果。

链接: https://arxiv.org/abs/2412.09692
作者: Yunna Lv,Long Tang,Dengpan Ye,Caiyun Xie,Jiacheng Deng,Yiheng He
关键词: facial retrieval systems, hash-based retrieval techniques, Deep hash models, facial matching, Deep hash-based retrieval
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep hash-based retrieval techniques are widely used in facial retrieval systems to improve the efficiency of facial matching. However, it also brings the risk of privacy leakage. Deep hash models are easily influenced by adversarial examples, which can be leveraged to prevent the malicious retrieval of private images. The existing adversarial example methods against deep hash models focus on universality and transferability, lacking the research on its robustness in online social networks (OSNs), which leads to their failure in anti-retrieval after post-processing. Therefore, we provide the first in-depth discussion on robustness adversarial perturbation in universal transferable anti-facial retrieval and propose Three-in-One Adversarial Perturbation (TOAP). Specifically, we firstly analyze the performance of deep hash models after post-processing and construct a local and global Compression Generator (CG) to simulate complex post-processing scenarios. Then, we explore the variation patterns of the model’s objective under image post-processing and propose robust optimization objectives, cluster centers and data space centers, optimizing them using meta-learning. Finally, we iteratively optimize perturbation by alternately generating adversarial examples and fine-tuning the CG, balancing the performance of perturbation while enhancing CG’s ability to mitigate them. Numerous experiments demonstrate that, in addition to its advantages in universality and transferability, TOAP significantly outperforms current state-of-the-art methods in multiple robustness metrics. It further improves universality and transferability by 5% to 28%, and achieves up to about 33% significant improvement in several simulated post-processing scenarios as well as mainstream OSNs, demonstrating that TOAP can effectively protect private images from malicious retrieval in real-world scenarios.
zh

[CV-104] DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations NEURIPS2024

【速读】：该论文试图解决在资源受限设备上进行深度神经网络（DNN）激活量化时，现有方法计算复杂度高、难以实现亚6位（sub-6-bit）量化的问题。解决方案的关键在于提出了一种新的深度量化方法DQA（Deep Quantization of DNN Activations），该方法通过简单的移位操作和霍夫曼编码（Huffman coding）来实现高效的亚6位量化，并显著提升了量化精度（最高提升29.28%），尤其是在3、4、5位量化级别下表现优异。

链接: https://arxiv.org/abs/2412.09687
作者: Wenhao Hu,Paul Henderson,José Cano
关键词: Deep Neural Network, Neural Network, Deep Neural, commonly used technique, technique to reduce
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Accepted to Second Workshop on Machine Learning with New Compute Paradigms at NeurIPS 2024 (MLNCP 2024)

点击查看摘要

Abstract:Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive searches for the best hyper-parameters. However, these expensive operations are impractical on devices with limited computation capabilities, memory capacities, and energy budgets. Furthermore, many existing methods do not focus on sub-6-bit (or deep) quantization. To fill these gaps, in this paper we propose DQA (Deep Quantization of DNN Activations), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and Huffman coding to be efficient and achieve high accuracy. We evaluate DQA with 3, 4, and 5-bit quantization levels and three different DNN models for two different tasks, image classification and image segmentation, on two different datasets. DQA shows significantly better accuracy (up to 29.28%) compared to the direct quantization method and the state-of-the-art NoisyQuant for sub-6-bit quantization. Comments: Accepted to Second Workshop on Machine Learning with New Compute Paradigms at NeurIPS 2024 (MLNCP 2024) Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML) Cite as: arXiv:2412.09687 [cs.LG] (or arXiv:2412.09687v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.09687 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-105] PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields

【速读】：该论文试图解决传统神经辐射场 (NeRF) 和三维高斯拼接方法在三维重建中无法同时估计场景材料和光照的问题。解决方案的关键在于提出了一个基于物理渲染 (Physics-Based Rendering, PBR) 理论的逆向渲染模型 (PBR-NeRF)，该模型能够联合估计场景的几何、材料和光照。论文引入了两个新的基于物理的先验条件，这些先验条件被严格地表述为直观的损失项，从而在不牺牲新视角合成质量的前提下，实现了最先进的材料估计。这种方法不仅提升了逆向渲染的效果，还展示了扩展当前神经渲染方法以全面建模场景属性的重要性。

链接: https://arxiv.org/abs/2412.09680
作者: Sean Wu,Shamik Basu,Tim Broedermann,Luc Van Gool,Christos Sakaridis
关键词: Neural Radiance Field, Radiance Field, named PBR-NeRF, Gaussian Splatting approaches, Neural Radiance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 7 figures. Code is publicly available at this https URL

点击查看摘要

Abstract:We tackle the ill-posed inverse rendering problem in 3D reconstruction with a Neural Radiance Field (NeRF) approach informed by Physics-Based Rendering (PBR) theory, named PBR-NeRF. Our method addresses a key limitation in most NeRF and 3D Gaussian Splatting approaches: they estimate view-dependent appearance without modeling scene materials and illumination. To address this limitation, we present an inverse rendering (IR) model capable of jointly estimating scene geometry, materials, and illumination. Our model builds upon recent NeRF-based IR approaches, but crucially introduces two novel physics-based priors that better constrain the IR estimation. Our priors are rigorously formulated as intuitive loss terms and achieve state-of-the-art material estimation without compromising novel view synthesis quality. Our method is easily adaptable to other inverse rendering and 3D reconstruction frameworks that require material estimation. We demonstrate the importance of extending current neural rendering approaches to fully model scene properties beyond geometry and view-dependent appearance. Code is publicly available at this https URL
zh

[CV-106] Vision-Language Models Represent Darker-Skinned Black Individuals as More Homogeneous than Lighter-Skinned Black Individuals

【速读】：该论文试图解决视觉-语言模型 (Vision-Language Models, VLMs) 中存在的肤色偏见 (skin tone bias) 问题，特别是在生成关于不同肤色个体的描述时可能放大的刻板印象。解决方案的关键在于通过控制肤色变化并保持其他特征不变，使用生成对抗网络 (GAN) 生成的面部图像数据库，对比不同肤色个体的描述生成结果的同质性。研究发现，VLMs 在生成关于较深肤色个体的描述时表现出更高的同质性，且女性比男性更受影响，这反映了已知的刻板印象模式。该研究强调了单模态 AI 系统中的偏见如何传播到多模态模型，并呼吁进一步研究以解决 AI 中的交叉性偏见 (intersectional biases)。

链接: https://arxiv.org/abs/2412.09668
作者: Messi H.J. Lee,Soyeon Jeon
关键词: combine Large Language, Large Language Model, Large Language, combine Large, Language Model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) combine Large Language Model (LLM) capabilities with image processing, enabling tasks like image captioning and text-to-image generation. Yet concerns persist about their potential to amplify human-like biases, including skin tone bias. Skin tone bias, where darker-skinned individuals face more negative stereotyping than lighter-skinned individuals, is well-documented in the social sciences but remains under-explored in Artificial Intelligence (AI), particularly in VLMs. While well-documented in the social sciences, this bias remains under-explored in AI, particularly in VLMs. Using the GAN Face Database, we sampled computer-generated images of Black American men and women, controlling for skin tone variations while keeping other features constant. We then asked VLMs to write stories about these faces and compared the homogeneity of the generated stories. Stories generated by VLMs about darker-skinned Black individuals were more homogeneous than those about lighter-skinned individuals in three of four models, and Black women were consistently represented more homogeneously than Black men across all models. Interaction effects revealed a greater impact of skin tone on women in two VLMs, while the other two showed nonsignificant results, reflecting known stereotyping patterns. These findings underscore the propagation of biases from single-modality AI systems to multimodal models and highlight the need for further research to address intersectional biases in AI.
zh

[CV-107] SEGT: A General Spatial Expansion Group Transformer for nuScenes Lidar-based Object Detection Task

【速读】：该论文试图解决基于激光雷达（lidar）的物体检测任务中，点云数据的不规则性和稀疏性问题。解决方案的关键在于提出了一种名为空间扩展组变换器（Spatial Expansion Group Transformer, SEGT）的新型基于变换器的框架。该框架通过将体素（voxels）迁移到不同的有序场中，并采用组注意力机制在这些场内提取特征图，随后通过交替应用多种扩展策略来整合不同有序场的特征表示，从而增强模型对空间信息的全面捕捉能力。实验结果表明，该方法在nuScenes激光雷达物体检测测试数据集上取得了显著的性能提升，NDS分数分别为73.5（无测试时增强）和74.2（有测试时增强）。

链接: https://arxiv.org/abs/2412.09658
作者: Cheng Mei,Hao He,Yahui Liu,Zhenhua Guo
关键词: Expansion Group Transformer, termed Spatial Expansion, Spatial Expansion Group, spatial expansion strategies, Group Transformer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the technical report, we present a novel transformer-based framework for nuScenes lidar-based object detection task, termed Spatial Expansion Group Transformer (SEGT). To efficiently handle the irregular and sparse nature of point cloud, we propose migrating the voxels into distinct specialized ordered fields with the general spatial expansion strategies, and employ group attention mechanisms to extract the exclusive feature maps within each field. Subsequently, we integrate the feature representations across different ordered fields by alternately applying diverse expansion strategies, thereby enhancing the model’s ability to capture comprehensive spatial information. The method was evaluated on the nuScenes lidar-based object detection test dataset, achieving an NDS score of 73.5 without Test-Time Augmentation (TTA) and 74.2 with TTA, demonstrating the effectiveness of the proposed method.
zh

[CV-108] From Noise to Nuance: Advances in Deep Generative Image Models

【速读】：该论文试图解决深度学习图像生成领域中的效率和质量问题，特别是在资源受限条件下的加速推理和生成精度提升。解决方案的关键在于计算高效的扩散模型（compute-efficient diffusion models）和视觉Transformer架构（vision transformer architectures），以及参数高效训练方法（parameter-efficient training methodologies）。此外，论文还强调了控制机制（如ControlNet）和区域注意力系统（regional attention systems）在提升生成精度和内容定制化方面的贡献。通过分析这些技术，论文展示了如何在保持生成质量的同时，实现更高效的计算和更精确的控制，为工业应用中的资源节约和可解释性提供了新的研究方向。

链接: https://arxiv.org/abs/2412.09656
作者: Benji Peng,Chia Xin Liang,Ziqian Bi,Ming Liu,Yichao Zhang,Tianyang Wang,Keyu Chen,Xinyuan Song,Pohsun Feng
关键词: Deep learning-based image, fundamental architectural breakthroughs, Deep learning-based, marked by fundamental, learning-based image generation
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning-based image generation has undergone a paradigm shift since 2021, marked by fundamental architectural breakthroughs and computational innovations. Through reviewing architectural innovations and empirical results, this paper analyzes the transition from traditional generative methods to advanced architectures, with focus on compute-efficient diffusion models and vision transformer architectures. We examine how recent developments in Stable Diffusion, DALL-E, and consistency models have redefined the capabilities and performance boundaries of image synthesis, while addressing persistent challenges in efficiency and quality. Our analysis focuses on the evolution of latent space representations, cross-attention mechanisms, and parameter-efficient training methodologies that enable accelerated inference under resource constraints. While more efficient training methods enable faster inference, advanced control mechanisms like ControlNet and regional attention systems have simultaneously improved generation precision and content customization. We investigate how enhanced multi-modal understanding and zero-shot generation capabilities are reshaping practical applications across industries. Our analysis demonstrates that despite remarkable advances in generation quality and computational efficiency, critical challenges remain in developing resource-conscious architectures and interpretable generation systems for industrial applications. The paper concludes by mapping promising research directions, including neural architecture optimization and explainable generation frameworks.
zh

[CV-109] Pole-based Vehicle Localization with Vector Maps: A Camera-LiDAR Comparative Study

【速读】：该论文试图解决城市环境中自动驾驶车辆的精确定位问题，尤其是在全球导航卫星系统 (GNSS) 受限的情况下。解决方案的关键在于利用外感知信息源，如通过激光雷达 (LiDAR) 或深度神经网络从单目相机中提取的杆状物体（如交通标志、交通灯和路灯），并将其与矢量地图中的地理参考特征进行关联。论文提出了一种基于轻量级神经网络的实时摄像头检测方法，并通过在具有挑战性的场景中进行评估，展示了视觉方法在开放道路条件下的高精度定位效果。

链接: https://arxiv.org/abs/2412.09649
作者: Maxime Noizet(Heudiasyc),Philippe Xu(Heudiasyc),Philippe Bonnifait(Heudiasyc)
关键词: Navigation Satellite Systems, Global Navigation Satellite, autonomous navigation, Satellite Systems, Global Navigation
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:For autonomous navigation, accurate localization with respect to a map is needed. In urban environments, infrastructure such as buildings or bridges cause major difficulties to Global Navigation Satellite Systems (GNSS) and, despite advances in inertial navigation, it is necessary to support them with other sources of exteroceptive information. In road environments, many common furniture such as traffic signs, traffic lights and street lights take the form of poles. By georeferencing these features in vector maps, they can be used within a localization filter that includes a detection pipeline and a data association method. Poles, having discriminative vertical structures, can be extracted from 3D geometric information using LiDAR sensors. Alternatively, deep neural networks can be employed to detect them from monocular cameras. The lack of depth information induces challenges in associating camera detections with map features. Yet, multi-camera integration provides a cost-efficient solution. This paper quantitatively evaluates the efficacy of these approaches in terms of localization. It introduces a real-time method for camera-based pole detection using a lightweight neural network trained on automatically annotated images. The proposed methods’ efficiency is assessed on a challenging sequence with a vector map. The results highlight the high accuracy of the vision-based approach in open road conditions.
zh

[CV-110] Bench2Drive-R: Turning Real World Data into Reactive Closed-Loop Autonomous Driving Benchmark by Generative Model

【速读】：该论文试图解决端到端自动驾驶（E2E-AD）中的闭环评估问题，特别是现有评估系统在真实性和交互性方面的不足。解决方案的关键在于提出了一个名为Bench2Drive-R的生成式框架，该框架通过将传感器渲染与行为推演解耦，并引入独立的行为控制器来模拟周围代理的反应，从而实现反应性闭环评估。具体来说，Bench2Drive-R通过噪声调制的时间编码器和高斯模糊来确保时间一致性，并通过空间最近图像的检索机制和3D相对位置编码来保证空间一致性。这些设计使得Bench2Drive-R在图像生成质量和交互模拟方面达到了最先进的性能。

链接: https://arxiv.org/abs/2412.09647
作者: Junqi You,Xiaosong Jia,Zhiyuan Zhang,Yutao Zhu,Junchi Yan
关键词: evaluation system remains, autonomous driving, system remains, reactive closed-loop evaluation, evaluation system
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:For end-to-end autonomous driving (E2E-AD), the evaluation system remains an open problem. Existing closed-loop evaluation protocols usually rely on simulators like CARLA being less realistic; while NAVSIM using real-world vision data, yet is limited to fixed planning trajectories in short horizon and assumes other agents are not reactive. We introduce Bench2Drive-R, a generative framework that enables reactive closed-loop evaluation. Unlike existing video generative models for AD, the proposed designs are tailored for interactive simulation, where sensor rendering and behavior rollout are decoupled by applying a separate behavioral controller to simulate the reactions of surrounding agents. As a result, the renderer could focus on image fidelity, control adherence, and spatial-temporal coherence. For temporal consistency, due to the step-wise interaction nature of simulation, we design a noise modulating temporal encoder with Gaussian blurring to encourage long-horizon autoregressive rollout of image sequences without deteriorating distribution shifts. For spatial consistency, a retrieval mechanism, which takes the spatially nearest images as references, is introduced to to ensure scene-level rendering fidelity during the generation process. The spatial relations between target and reference are explicitly modeled with 3D relative position encodings and the potential over-reliance of reference images is mitigated with hierarchical sampling and classifier-free guidance. We compare the generation quality of Bench2Drive-R with existing generative models and achieve state-of-the-art performance. We further integrate Bench2Drive-R into nuPlan and evaluate the generative qualities with closed-loop simulation results. We will open source our code. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2412.09647 [cs.RO] (or arXiv:2412.09647v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2412.09647 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-111] A Practical Exercise in Adapting SIFT Using FHE Primitives

【速读】：该论文试图解决在全同态加密 (Fully Homomorphic Encryption, FHE) 环境下实现尺度不变特征变换 (Scale Invariant Feature Transform, SIFT) 时遇到的主要限制，特别是缺乏标准的比较操作符及其相关操作（如数组最大值、直方图分箱等）的问题。解决方案的关键在于：1. 将常规代码适配到 FHE 环境中；2. 提供标准算法的替代实现（如数组最大值、直方图分箱等）以减少乘法深度；3. 通过延迟计算 (deferred computations) 避免在加密域中执行昂贵的比较操作。这些方法共同为在 FHE 环境下适配算法提供了实用的指导。

链接: https://arxiv.org/abs/2412.09642
作者: Ishwar B Balappanawar,Bhargav Srinivas Kommireddy
关键词: Scale Invariant Feature, Invariant Feature Transform, CKKS Fully Homomorphic, implementing Scale Invariant, Fully Homomorphic encryption
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review at this http URL collocated with Real World Crypto 2025

点击查看摘要

Abstract:An exercise in implementing Scale Invariant Feature Transform using CKKS Fully Homomorphic encryption quickly reveals some glaring limitations in the current FHE paradigm. These limitations include the lack of a standard comparison operator and certain operations that depend on it (like array max, histogram binning etc). We also observe that the existing solutions are either too low level or do not have proper abstractions to implement algorithms like SIFT. In this work, we demonstrate: 1. Methods of adapting regular code to the FHE setting. 2. Alternate implementations of standard algorithms (like array max, histogram binning, etc.) to reduce the multiplicative depth. 3. A novel method of using deferred computations to avoid performing expensive operations such as comparisons in the encrypted domain. Through this exercise, we hope this work acts as a practical guide on how one can adapt algorithms to FHE
zh

[CV-112] Iterating the Transient Light Transport Matrix for Non-Line-of-Sight Imaging

【速读】：该论文试图解决非视线成像（Non-line-of-sight, NLOS）中如何高效获取和处理瞬时光传输矩阵（Transient Light Transport Matrix, TLTM）的问题。解决方案的关键在于利用定制的门控单光子雪崩二极管（Single Photon Avalanche Diode, SPAD）阵列，通过并行采集光子来减少采集时间，并开发高效的算法来处理完整的TLTM。这些算法能够迭代处理测得的第一阶TLTM，并从中提取隐藏场景中表面的第二阶TLTM，从而实现对隐藏场景的计算聚焦和检测。此外，该方法还展示了在NLOS成像中的三个应用：场景重新照明、分离隐藏场景中的直接和间接光传输分量，以及双摄影。

链接: https://arxiv.org/abs/2412.10300
作者: Talha Sultan,Eric Brandt,Khadijeh Masumnia-Bisheh,Simone Riccardo,Pavel Polynkin,Alberto Tosi,Andreas Velten
关键词: Light Transport Matrix, Transient Light Transport, controllable light source, active imaging system, resulting spatiotemporal light
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Active imaging systems sample the Transient Light Transport Matrix (TLTM) for a scene by sequentially illuminating various positions in this scene using a controllable light source, and then measuring the resulting spatiotemporal light transport with time of flight (ToF) sensors. Time-resolved Non-line-of-sight (NLOS) imaging employs an active imaging system that measures part of the TLTM of an intermediary relay surface, and uses the indirect reflections of light encoded within this TLTM to “see around corners”. Such imaging systems have applications in diverse areas such as disaster response, remote surveillance, and autonomous navigation. While existing NLOS imaging systems usually measure a subset of the full TLTM, development of customized gated Single Photon Avalanche Diode (SPAD) arrays \citericcardo_fast-gated_2022 has made it feasible to probe the full measurement space. In this work, we demonstrate that the full TLTM on the relay surface can be processed with efficient algorithms to computationally focus and detect our illumination in different parts of the hidden scene, turning the relay surface into a second-order active imaging system. These algorithms allow us to iterate on the measured, first-order TLTM, and extract a \textbfsecond order TLTM for surfaces in the hidden scene. We showcase three applications of TLTMs in NLOS imaging: (1) Scene Relighting with novel illumination, (2) Separation of direct and indirect components of light transport in the hidden scene, and (3) Dual Photography. Additionally, we empirically demonstrate that SPAD arrays enable parallel acquisition of photons, effectively mitigating long acquisition times.
zh

[CV-113] Copy-Move Detection in Optical Microscopy: A Segmentation Network and A Dataset

【速读】：该论文试图解决生物医学领域中伪造实验图像的检测问题，特别是针对未见过的复制-粘贴伪造区域（copy-move forgery）。解决方案的关键在于将问题重新定义为图像内共显著性检测任务（intra-image co-saliency detection task），并提出了CMSeg-Net，一种能够识别未见过的复制区域的分段网络（copy-move forgery segmentation network）。CMSeg-Net基于多分辨率编码器-解码器架构，结合自相关和相关辅助的空间注意力模块（self-correlation and correlation-assisted spatial-attention modules），能够在不同观察尺度下检测特征张量中的区域相似性，从而有效区分复杂显微图像中的小型复制目标与其他相似物体。

链接: https://arxiv.org/abs/2412.10258
作者: Hao-Chiang Shao,Yuan-Rong Liao,Tse-Yu Tseng,Yen-Liang Chuo,Fong-Yi Lin
关键词: detecting forged experimental, forged experimental images, academic fraud, public concern, increasing revelations
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to IEEE SPL

点击查看摘要

Abstract:With increasing revelations of academic fraud, detecting forged experimental images in the biomedical field has become a public concern. The challenge lies in the fact that copy-move targets can include background tissue, small foreground objects, or both, which may be out of the training domain and subject to unseen attacks, rendering standard object-detection-based approaches less effective. To address this, we reformulate the problem of detecting biomedical copy-move forgery regions as an intra-image co-saliency detection task and propose CMSeg-Net, a copy-move forgery segmentation network capable of identifying unseen duplicated areas. Built on a multi-resolution encoder-decoder architecture, CMSeg-Net incorporates self-correlation and correlation-assisted spatial-attention modules to detect intra-image regional similarities within feature tensors at each observation scale. This design helps distinguish even small copy-move targets in complex microscopic images from other similar objects. Furthermore, we created a copy-move forgery dataset of optical microscopic images, named FakeParaEgg, using open data from the ICIP 2022 Challenge to support CMSeg-Net’s development and verify its performance. Extensive experiments demonstrate that our approach outperforms previous state-of-the-art methods on the FakeParaEgg dataset and other open copy-move detection datasets, including CASIA-CMFD, CoMoFoD, and CMF. The FakeParaEgg dataset, our source code, and the CMF dataset with our manually defined segmentation ground truths available at ``this https URL.
zh

[CV-114] A Cascaded Dilated Convolution Approach for Mpox Lesion Classification

【速读】：该论文试图解决Mpox病毒诊断中的挑战，特别是由于其皮肤病变与其他疾病相似而导致的诊断困难。解决方案的关键在于引入了一种新的级联空洞组注意力模块（Cascaded Atrous Group Attention, CAGA），该模块能够增强多尺度特征表示并优化计算效率。通过将CAGA与EfficientViT-L1架构结合，研究实现了在MCSI数据集上0.98%的优异性能，同时将模型参数减少了37.5%，从而在保持诊断准确性的同时降低了计算复杂度，便于在资源受限的医疗环境中广泛部署。此外，该模型在其他两个基准数据集上的广泛验证也证明了其鲁棒性和优于现有方法的性能。

链接: https://arxiv.org/abs/2412.10106
作者: Ayush Deshmukh
关键词: Public Health Emergency, skin lesion diseases, Public Health, Health Emergency, Emergency of International
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: (7 pages, 2 figures, 5 tables)

点击查看摘要

Abstract:The global outbreak of Mpox virus, classified as a Public Health Emergency of International Concern by WHO, presents significant diagnostic challenges due to its visual similarity to other skin lesion diseases. Current clinical detection techniques face limitations in accuracy and efficiency, necessitating improved automated diagnostic solutions. This study introduces a novel Cascaded Atrous Group Attention (CAGA) module, specifically designed to enhance multi-scale feature representation while optimizing computational efficiency. By integrating CAGA with EfficientViT-L1 as the backbone architecture, our approach achieves state-of-the-art performance with a score of 0.98% on the MCSI dataset, while reducing model parameters by 37.5% compared to the original EfficientViT-L1. This reduction in computational complexity maintains diagnostic accuracy while enabling broader deployment across resource-constrained healthcare settings. Extensive validation across two other benchmark datasets, including MSID and MSLD, demonstrate the model’s robustness, consistently outperforming existing approaches. Our findings suggest that CAGA’s efficient feature extraction mechanism could be adapted for other medical imaging tasks requiring fine-grained visual discrimination.
zh

[CV-115] FM2S: Self-Supervised Fluorescence Microscopy Denoising With Single Noisy Image

【速读】：该论文试图解决荧光显微镜图像去噪中的挑战，特别是由于难以精确建模噪声和获取干净图像用于训练所导致的问题。解决方案的关键在于提出了一种高效的自我监督去噪方法——FM2S (Fluorescence Micrograph to Self)，该方法能够在仅使用单张噪声图像的情况下实现高质量的去噪效果。其核心创新包括引入自适应的全局-局部噪声添加模块（adaptive global-local Noise Addition module）用于数据增强，以解决合成噪声与真实噪声之间的差异问题，并通过训练一个两层神经网络来学习从噪声图像到滤波图像的映射，从而在噪声去除和计算效率之间取得平衡。实验结果表明，FM2S在不同显微镜类型和噪声水平下均表现出优异的去噪效果和时间效率，平均PSNR提升约6 dB。

链接: https://arxiv.org/abs/2412.10031
作者: Jizhihui Liu,Qixun Teng,Junjun Jiang
关键词: significantly advanced biological, advanced biological research, visualizing detailed cellular, detailed cellular structures, biological processes
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fluorescence microscopy has significantly advanced biological research by visualizing detailed cellular structures and biological processes. However, such image denoising task often faces challenges due to difficulty in precisely modeling the inherent noise and acquiring clean images for training, which constrains most existing methods. In this paper, we propose an efficient self-supervised denoiser Fluorescence Micrograph to Self (FM2S), enabling a high-quality denoised result with a single noisy image. Our method introduces an adaptive global-local Noise Addition module for data augmentation, addressing generalization problems caused by discrepancies between synthetic and real-world noise. We then train a two-layer neural network to learn the mapping from the noise-added image to the filtered image, achieving a balance between noise removal and computational efficiency. Experimental results demonstrate that FM2S excels in various microscope types and noise levels in terms of denoising effects and time consumption, obtaining an average PSNR improvement of around 6 dB over the original noisy image in a few seconds. The code is available at this https URL.
zh

[CV-116] Cycle-Consistent Bridge Diffusion Model for Accelerated MRI Reconstruction

【速读】：该论文试图解决加速磁共振成像（MRI）重建中的高保真度和效率问题。现有深度学习方法在处理欠采样数据时，虽然基于传统重建方法，但仍难以提供高质量的重建结果。扩散模型（Diffusion Models）虽然近年来在提高生成图像保真度方面显示出潜力，但其从随机高斯噪声开始的推理过程导致结果不稳定，且通常需要数千次采样步骤，导致重建质量和效率不佳。论文提出的解决方案是循环一致桥接扩散模型（Cycle-Consistent Bridge Diffusion Model, CBDM），其关键在于通过两个桥接扩散模型构建循环一致的扩散过程，并引入一致性损失（consistency loss），从而增强重建图像的细节并减少扩散步骤。此外，CBDM还集成了轮廓波分解嵌入模块（Contourlet Decomposition Embedding Module, CDEM），通过频率域分解金字塔和方向滤波器组捕捉图像的多尺度结构纹理知识，进一步提高结构保真度。实验结果表明，CBDM在重建质量和训练迭代次数方面均优于现有方法，达到了加速MRI重建的新技术水平。

链接: https://arxiv.org/abs/2412.09998
作者: Tao Song,Yicheng Wu,Minhao Hu,Xiangde Luo,Guoting Luo,Guotai Wang,Yi Guo,Feng Xu,Shaoting Zhang
关键词: reduce examination time, improving patient comfort, reconstruction techniques aim, maintaining high image, MRI reconstruction techniques
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accelerated MRI reconstruction techniques aim to reduce examination time while maintaining high image fidelity, which is highly desirable in clinical settings for improving patient comfort and hospital efficiency. Existing deep learning methods typically reconstruct images from under-sampled data with traditional reconstruction approaches, but they still struggle to provide high-fidelity results. Diffusion models show great potential to improve fidelity of generated images in recent years. However, their inference process starting with a random Gaussian noise introduces instability into the results and usually requires thousands of sampling steps, resulting in sub-optimal reconstruction quality and low efficiency. To address these challenges, we propose Cycle-Consistent Bridge Diffusion Model (CBDM). CBDM employs two bridge diffusion models to construct a cycle-consistent diffusion process with a consistency loss, enhancing the fine-grained details of reconstructed images and reducing the number of diffusion steps. Moreover, CBDM incorporates a Contourlet Decomposition Embedding Module (CDEM) which captures multi-scale structural texture knowledge in images through frequency domain decomposition pyramids and directional filter banks to improve structural fidelity. Extensive experiments demonstrate the superiority of our model by higher reconstruction quality and fewer training iterations, achieving a new state of the art for accelerated MRI reconstruction in both fastMRI and IXI datasets.
zh

[CV-117] Neural Vector Tomography for Reconstructing a Magnetization Vector Field

【速读】：该论文试图解决向量断层成像重建中由于离散化技术导致的伪影问题，尤其是在噪声增加时重建质量的进一步下降。解决方案的关键在于使用平滑神经场（smooth neural fields）来建模底层向量场。由于神经网络中的激活函数可以选择为平滑的，并且不再依赖像素化的域，因此即使在噪声存在的情况下，该模型也能实现高质量的重建。特别是在存在底层全局连续对称性的情况下，神经网络显著提高了重建的准确性，优于现有技术。

链接: https://arxiv.org/abs/2412.09927
作者: Giorgi Butbaia,Jiadong Zang
关键词: vector tomographic reconstructions, Discretized techniques, prone to producing, producing artifacts, vector tomographic
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures

点击查看摘要

Abstract:Discretized techniques for vector tomographic reconstructions are prone to producing artifacts in the reconstructions. The quality of these reconstructions may further deteriorate as the amount of noise increases. In this work, we instead model the underlying vector fields using smooth neural fields. Owing to the fact that the activation functions in the neural network may be chosen to be smooth and the domain is no longer pixelated, the model results in high-quality reconstructions, even under presence of noise. In the case where we have underlying global continuous symmetry, we find that the neural network substantially improves the accuracy of the reconstruction over the existing techniques.
zh

[CV-118] A Single-Frame and Multi-Frame Cascaded Image Super-Resolution Method

【速读】：该论文旨在解决图像超分辨率（Image Super-Resolution）中由于放大倍数增加导致单帧和多帧超分辨率重建性能下降的问题。解决方案的关键在于提出了一种新颖的两步图像超分辨率方法，即将多帧超分辨率（Multi-Frame Super-Resolution, MFSR）与单帧超分辨率（Single-Frame Super-Resolution, SFSR）相结合，逐步将图像上采样至目标分辨率。该方法包括一个L0范数约束的重建方案和一个增强的残差反投影网络，结合了变分模型方法的灵活性和深度学习方法的特征学习能力。实验结果表明，该方法在客观和感知质量测量上均表现出优越性能，且能稳健地应用于不同的SFSR和MFSR方法。

链接: https://arxiv.org/abs/2412.09846
作者: Jing Sun,Qiangqiang Yuan,Huanfeng Shen,Jie Li,Liangpei Zhang
关键词: reconstruct a high-resolution, prior knowledge, multi-frame super-resolution reconstruction, multi-frame super-resolution, super-resolution reconstruction degrades
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 13 figures

点击查看摘要

Abstract:The objective of image super-resolution is to reconstruct a high-resolution (HR) image with the prior knowledge from one or several low-resolution (LR) images. However, in the real world, due to the limited complementary information, the performance of both single-frame and multi-frame super-resolution reconstruction degrades rapidly as the magnification increases. In this paper, we propose a novel two-step image super resolution method concatenating multi-frame super-resolution (MFSR) with single-frame super-resolution (SFSR), to progressively upsample images to the desired resolution. The proposed method consisting of an L0-norm constrained reconstruction scheme and an enhanced residual back-projection network, integrating the flexibility of the variational modelbased method and the feature learning capacity of the deep learning-based method. To verify the effectiveness of the proposed algorithm, extensive experiments with both simulated and real world sequences were implemented. The experimental results show that the proposed method yields superior performance in both objective and perceptual quality measurements. The average PSNRs of the cascade model in set5 and set14 are 33.413 dB and 29.658 dB respectively, which are 0.76 dB and 0.621 dB more than the baseline method. In addition, the experiment indicates that this cascade model can be robustly applied to different SFSR and MFSR methods.
zh

[CV-119] waveOrder: generalist framework for label-agnostic computational microscopy

【速读】：该论文旨在解决动态生物系统中跨空间尺度（从细胞器到整个生物体）的形态学和分子测量数据的整合问题。其关键解决方案是提出了一种通用的计算成像框架——waveOrder，该框架通过波光学成像技术，能够在最小化的采集通道下编码和解码多种样本属性，无论是否使用荧光标记。waveOrder通过物理驱动的基向量表达材料属性（如相位、吸收、双折射、二向色性和荧光密度），并将图像数据表达为可直接测量的斯托克斯参数，从而实现了多通道重建算法，以恢复样本在多种对比模式下的属性。该框架支持多种3D计算显微成像方法，包括定量相位成像、无标记定量成像以及荧光反卷积成像，并提供了开源的计算成像库和napari插件，以促进其应用和扩展。

链接: https://arxiv.org/abs/2412.09775
作者: Talon Chandler,Eduardo Hirata-Miyasaki,Ivan E. Ivanov,Ziwen Liu,Deepika Sundarraman,Allyson Quinn Ryan,Adrian Jacobo,Keir Balla,Shalin B. Mehta
关键词: Correlative computational microscopy, dynamic biological systems, Correlative computational, entire organisms, biological systems
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 11 pages of main text (5 figures, one table); 9 pages of supplementary text (4 figures, one table); 5 ancillary videos

点击查看摘要

Abstract:Correlative computational microscopy is accelerating the mapping of dynamic biological systems by integrating morphological and molecular measurements across spatial scales, from organelles to entire organisms. Visualization, measurement, and prediction of interactions among the components of biological systems can be accelerated by generalist computational imaging frameworks that relax the trade-offs imposed by multiplex dynamic imaging. This work reports a generalist framework for wave optical imaging of the architectural order (waveOrder) among biomolecules for encoding and decoding multiple specimen properties from a minimal set of acquired channels, with or without fluorescent labels. waveOrder expresses material properties in terms of elegant physically motivated basis vectors directly interpretable as phase, absorption, birefringence, diattenuation, and fluorophore density; and it expresses image data in terms of directly measurable Stokes parameters. We report a corresponding multi-channel reconstruction algorithm to recover specimen properties in multiple contrast modes. With this framework, we implement multiple 3D computational microscopy methods, including quantitative phase imaging, quantitative label-free imaging with phase and polarization, and fluorescence deconvolution imaging, across scales ranging from organelles to whole zebrafish. These advances are available via an extensible open-source computational imaging library, waveOrder, and a napari plugin, recOrder.
zh

[CV-120] DSplats: 3D Generation by Denoising Splats-Based Multiview Diffusion Models

【速读】：该论文试图解决生成高质量3D内容时，现有基于高斯（Gaussian）的3D重建技术缺乏扩散模型（Diffusion Models）所提供的广泛先验和表达能力的问题。解决方案的关键在于引入DSplats方法，通过结合高斯Splat重建器和预训练的潜在扩散模型（Latent Diffusion Model），直接对多视角图像进行去噪，生成多样化的真实3D资产。该方法利用扩散模型的广泛先验，同时通过显式的3D表示确保几何一致性，从而在单图像到3D重建任务中实现了高质量和空间一致的输出。

链接: https://arxiv.org/abs/2412.09648
作者: Kevin Miao,Harsh Agrawal,Qihang Zhang,Federico Semeraro,Marco Cavallo,Jiatao Gu,Alexander Toshev
关键词: learning robust distributions, content requires models, requires models capable, Diffusion Models, Latent Diffusion Model
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generating high-quality 3D content requires models capable of learning robust distributions of complex scenes and the real-world objects within them. Recent Gaussian-based 3D reconstruction techniques have achieved impressive results in recovering high-fidelity 3D assets from sparse input images by predicting 3D Gaussians in a feed-forward manner. However, these techniques often lack the extensive priors and expressiveness offered by Diffusion Models. On the other hand, 2D Diffusion Models, which have been successfully applied to denoise multiview images, show potential for generating a wide range of photorealistic 3D outputs but still fall short on explicit 3D priors and consistency. In this work, we aim to bridge these two approaches by introducing DSplats, a novel method that directly denoises multiview images using Gaussian Splat-based Reconstructors to produce a diverse array of realistic 3D assets. To harness the extensive priors of 2D Diffusion Models, we incorporate a pretrained Latent Diffusion Model into the reconstructor backbone to predict a set of 3D Gaussians. Additionally, the explicit 3D representation embedded in the denoising network provides a strong inductive bias, ensuring geometrically consistent novel view generation. Our qualitative and quantitative experiments demonstrate that DSplats not only produces high-quality, spatially consistent outputs, but also sets a new standard in single-image to 3D reconstruction. When evaluated on the Google Scanned Objects dataset, DSplats achieves a PSNR of 20.38, an SSIM of 0.842, and an LPIPS of 0.109.
zh

[CV-121] RealOSR: Latent Unfolding Boosting Diffusion-based Real-world Omnidirectional Image Super-Resolution

【速读】：该论文试图解决全向图像超分辨率 (Omnidirectional Image Super-Resolution, ODISR) 中现有方法在处理复杂、未知的真实世界退化过程时的局限性，以及扩散模型在推理速度上的瓶颈问题。解决方案的关键在于提出了 RealOSR，一种基于扩散的单步去噪方法，通过引入轻量级域对齐模块 (lightweight domain alignment module) 和潜在空间展开模块 (latent unfolding module)，前者用于高效地将低分辨率全向图像注入单步潜在去噪中，后者则利用去噪 UNet 的多尺度特征建模能力，在潜在空间中直接模拟梯度下降过程。实验结果表明，RealOSR 在图像恢复质量和推理效率上均优于现有方法，特别是在视觉质量和推理速度上显著超越了当前最先进的 OmniSSR 方法。

链接: https://arxiv.org/abs/2412.09646
作者: Xuhan Sheng,Runyi Li,Bin Chen,Weiqi Li,Xu Jiang,Jian Zhang
关键词: Omnidirectional image super-resolution, Omnidirectional image, image super-resolution, detailed visual content, aims to upscale
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Omnidirectional image super-resolution (ODISR) aims to upscale low-resolution (LR) omnidirectional images (ODIs) to high-resolution (HR), addressing the growing demand for detailed visual content across a 180^\circ\times360^\circ viewport. Existing methods are limited by simple degradation assumptions (e.g., bicubic downsampling), which fail to capture the complex, unknown real-world degradation processes. Recent diffusion-based approaches suffer from slow inference due to their hundreds of sampling steps and frequent pixel-latent space conversions. To tackle these challenges, in this paper, we propose RealOSR, a novel diffusion-based approach for real-world ODISR (Real-ODISR) with single-step diffusion denoising. To sufficiently exploit the input information, RealOSR introduces a lightweight domain alignment module, which facilitates the efficient injection of LR ODI into the single-step latent denoising. Additionally, to better utilize the rich semantic and multi-scale feature modeling ability of denoising UNet, we develop a latent unfolding module that simulates the gradient descent process directly in latent space. Experimental results demonstrate that RealOSR outperforms previous methods in both ODI recovery quality and efficiency. Compared to the recent state-of-the-art diffusion-based ODISR method, OmniSSR, RealOSR achieves significant improvements in visual quality and over \textbf200 \times inference acceleration. Our code and models will be released.
zh

人工智能

[AI-0] A Library for Learning Neural Operators

链接: https://arxiv.org/abs/2412.10354
作者: Jean Kossaifi,Nikola Kovachki,Zongyi Li,Davit Pitt,Miguel Liu-Schiaffini,Robert Joseph George,Boris Bonev,Kamyar Azizzadenesheli,Julius Berner,Anima Anandkumar
关键词: open-source Python library, Python library, open-source Python, finite-dimensional Euclidean spaces, Python
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We present NeuralOperator, an open-source Python library for operator learning. Neural operators generalize neural networks to maps between function spaces instead of finite-dimensional Euclidean spaces. They can be trained and inferenced on input and output functions given at various discretizations, satisfying a discretization convergence properties. Built on top of PyTorch, NeuralOperator provides all the tools for training and deploying neural operator models, as well as developing new ones, in a high-quality, tested, open-source package. It combines cutting-edge models and customizability with a gentle learning curve and simple user interface for newcomers.

[AI-1] raceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

链接: https://arxiv.org/abs/2412.10345
作者: Ruijie Zheng,Yongyuan Liang,Shuaiyi Huang,Jianfeng Gao,Hal Daumé III,Andrey Kolobov,Furong Huang,Jianwei Yang
关键词: offer promising generalist, promising generalist policies, handling complex tasks, datasets offer promising, visual trace prompting
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models’ spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

[AI-2] Generative AI in Medicine

链接: https://arxiv.org/abs/2412.10337
作者: Divya Shanmugam,Monica Agrawal,Rajiv Movva,Irene Y. Chen,Marzyeh Ghassemi,Emma Pierson
关键词: increased capabilities, dramatically expanded, capabilities of generative, clinical trial organizers, cases in medicine
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注: To appear in the Annual Review of Biomedical Data Science, August 2025

点击查看摘要

Abstract:The increased capabilities of generative AI have dramatically expanded its possible use cases in medicine. We provide a comprehensive overview of generative AI use cases for clinicians, patients, clinical trial organizers, researchers, and trainees. We then discuss the many challenges – including maintaining privacy and security, improving transparency and interpretability, upholding equity, and rigorously evaluating models – which must be overcome to realize this potential, and the open research directions they give rise to.

[AI-3] MeshA*: Efficient Path Planing With Motion Primitives

链接: https://arxiv.org/abs/2412.10320
作者: Marat Agranovskiy,Konstantin Yakovlev
关键词: motion primitives aligned, path planning problem, move actions, finite set, motion primitives
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We study a path planning problem where the possible move actions are represented as a finite set of motion primitives aligned with the grid representation of the environment. That is, each primitive corresponds to a short kinodynamically-feasible motion of an agent and is represented as a sequence of the swept cells of a grid. Typically heuristic search, i.e. A*, is conducted over the lattice induced by these primitives (lattice-based planning) to find a path. However due to the large branching factor such search may be inefficient in practice. To this end we suggest a novel technique rooted in the idea of searching over the grid cells (as in vanilla A*) simultaneously fitting the possible sequences of the motion primitives into these cells. The resultant algorithm, MeshA*, provably preserves the guarantees on completeness and optimality, on the one hand, and is shown to notably outperform conventional lattice-based planning (x1.5 decrease in the runtime), on the other hand. Moreover, we suggest an additional pruning technique that additionally decreases the search space of MeshA*. The resultant planner is combined with the regular A* to retain completeness and is shown to further increase the search performance at the cost of negligible decrease of the solution quality.

[AI-4] Envisioning National Resources for Artificial Intelligence Research: NSF Workshop Report

链接: https://arxiv.org/abs/2412.10278
作者: Shantenu Jha,Yolanda Gil
关键词: Artificial Intelligence Research, Envisioning National Resources, NSF workshop titled, Artificial Intelligence, held in Alexandria
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:This is a report of an NSF workshop titled “Envisioning National Resources for Artificial Intelligence Research” held in Alexandria, Virginia, in May 2024. The workshop aimed to identify initial challenges and opportunities for national resources for AI research (e.g., compute, data, models, etc.) and to facilitate planning for the envisioned National AI Research Resource. Participants included AI and cyberinfrastructure (CI) experts. The report outlines significant findings and identifies needs and recommendations from the workshop.

[AI-5] rustworthy and Explainable Decision-Making for Workforce allocation

链接: https://arxiv.org/abs/2412.10272
作者: Guillaume Povéda,Ryma Boumazouza,Andreas Strahl,Mark Hall,Santiago Quintana-Amate,Nahum Alvarez,Ignace Bleukx,Dimos Tsouros,Hélène Verhaeghe,Tias Guns
关键词: effective workforce allocation, operational efficiency, crucial for operational, effective workforce, workforce allocation
类目: Artificial Intelligence (cs.AI)
*备注: Accepted for presentation at PTHG-24: The Seventh Workshop on Progress Towards the Holy Grail, part of the 30th International Conference on Principles and Practice of Constraint Programming. For more details, visit the workshop webpage: this https URL

点击查看摘要

Abstract:In industrial contexts, effective workforce allocation is crucial for operational efficiency. This paper presents an ongoing project focused on developing a decision-making tool designed for workforce allocation, emphasising the explainability to enhance its trustworthiness. Our objective is to create a system that not only optimises the allocation of teams to scheduled tasks but also provides clear, understandable explanations for its decisions, particularly in cases where the problem is infeasible. By incorporating human-in-the-loop mechanisms, the tool aims to enhance user trust and facilitate interactive conflict resolution. We implemented our approach on a prototype tool/digital demonstrator intended to be evaluated on a real industrial scenario both in terms of performance and user acceptability.

[AI-6] Cultural Evolution of Cooperation among LLM Agents

链接: https://arxiv.org/abs/2412.10270
作者: Aron Vallinder,Edward Hughes
关键词: Large language models, Large language, LLM agents, LLM agent deployment, multiple LLM agents
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) provide a compelling foundation for building generally-capable AI agents. These agents may soon be deployed at scale in the real world, representing the interests of individual humans (e.g., AI assistants) or groups of humans (e.g., AI-accelerated corporations). At present, relatively little is known about the dynamics of multiple LLM agents interacting over many generations of iterative deployment. In this paper, we examine whether a “society” of LLM agents can learn mutually beneficial social norms in the face of incentives to defect, a distinctive feature of human sociality that is arguably crucial to the success of civilization. In particular, we study the evolution of indirect reciprocity across generations of LLM agents playing a classic iterated Donor Game in which agents can observe the recent behavior of their peers. We find that the evolution of cooperation differs markedly across base models, with societies of Claude 3.5 Sonnet agents achieving significantly higher average scores than Gemini 1.5 Flash, which, in turn, outperforms GPT-4o. Further, Claude 3.5 Sonnet can make use of an additional mechanism for costly punishment to achieve yet higher scores, while Gemini 1.5 Flash and GPT-4o fail to do so. For each model class, we also observe variation in emergent behavior across random seeds, suggesting an understudied sensitive dependence on initial conditions. We suggest that our evaluation regime could inspire an inexpensive and informative new class of LLM benchmarks, focussed on the implications of LLM agent deployment for the cooperative infrastructure of society.

[AI-7] Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

链接: https://arxiv.org/abs/2412.10267
作者: Danielle R. Thomas,Conrad Borchers,Sanjit Kakarla,Jionghao Lin,Shambhavi Bhushan,Boyuan Guo,Erin Gatz,Kenneth R. Koedinger
关键词: role of multiple-choice, debated in past, MCQs, multiple-choice questions, open response
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: Full research paper accepted to Learning Analytics and Knowledge (LAK 2025)

点击查看摘要

Abstract:The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research. While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading. This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning. These activities are embedded within six tutor lessons on advocacy. Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both. We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction. These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited. To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo. GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use. This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.

[AI-8] Exploring the Frontiers of Animation Video Generation in the Sora Era: Method Dataset and Benchmark

链接: https://arxiv.org/abs/2412.10255
作者: Yudong Jiang,Baohan Xu,Siqian Yang,Mingyu Yin,Jing Liu,Chao Xu,Siqi Wang,Yidi Wu,Bingwen Zhu,Jixuan Xu,Yue Zhang,Jinlong Hou,Huyang Sun
关键词: animation video generation, gained significant interest, video generation, animation video, Animation
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation dataset. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, the evaluation on VBench and human double-blind test demonstrates consistency in character and motion, achieving state-of-the-art results in animation video generation. %We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our model access API and evaluation benchmark will be publicly available.

[AI-9] From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection

链接: https://arxiv.org/abs/2412.10198
作者: Haowei Wang,Rupeng Zhang,Junjie Wang,Mingyang Li,Yuekai Huang,Dandan Wang,Qing Wang
关键词: Large Language Model, changed Large Language, Language Model, Large Language, changed Large
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tool-calling has changed Large Language Model (LLM) applications by integrating external tools, significantly enhancing their functionality across diverse tasks. However, this integration also introduces new security vulnerabilities, particularly in the tool scheduling mechanisms of LLM, which have not been extensively studied. To fill this gap, we present ToolCommander, a novel framework designed to exploit vulnerabilities in LLM tool-calling systems through adversarial tool injection. Our framework employs a well-designed two-stage attack strategy. Firstly, it injects malicious tools to collect user queries, then dynamically updates the injected tools based on the stolen information to enhance subsequent attacks. These stages enable ToolCommander to execute privacy theft, launch denial-of-service attacks, and even manipulate business competition by triggering unscheduled tool-calling. Notably, the ASR reaches 91.67% for privacy theft and hits 100% for denial-of-service and unscheduled tool calling in certain cases. Our work demonstrates that these vulnerabilities can lead to severe consequences beyond simple misuse of tool-calling systems, underscoring the urgent need for robust defensive strategies to secure LLM Tool-calling systems.

[AI-10] BiCert: A Bilinear Mixed Integer Programming Formulation for Precise Certified Bounds Against Data Poisoning Attacks

链接: https://arxiv.org/abs/2412.10186
作者: Tobias Lorenz,Marta Kwiatkowska,Mario Fritz
关键词: poisoning attacks pose, necessitating robust defenses, Data poisoning attacks, modern AI systems, necessitating robust
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Data poisoning attacks pose one of the biggest threats to modern AI systems, necessitating robust defenses. While extensive efforts have been made to develop empirical defenses, attackers continue to evolve, creating sophisticated methods to circumvent these measures. To address this, we must move beyond empirical defenses and establish provable certification methods that guarantee robustness. This paper introduces a novel certification approach, BiCert, using Bilinear Mixed Integer Programming (BMIP) to compute sound deterministic bounds that provide such provable robustness. Using BMIP, we compute the reachable set of parameters that could result from training with potentially manipulated data. A key element to make this computation feasible is to relax the reachable parameter set to a convex set between training iterations. At test time, this parameter set allows us to predict all possible outcomes, guaranteeing robustness. BiCert is more precise than previous methods, which rely solely on interval and polyhedral bounds. Crucially, our approach overcomes the fundamental limitation of prior approaches where parameter bounds could only grow, often uncontrollably. We show that BiCert’s tighter bounds eliminate a key source of divergence issues, resulting in more stable training and higher certified accuracy.

[AI-11] Solving Robust Markov Decision Processes: Generic Reliable Efficient AAAI’25

链接: https://arxiv.org/abs/2412.10185
作者: Tobias Meggendorfer,Maximilian Weininger,Patrick Wienhöft
关键词: Markov decision processes, Markov decision, decision processes, sequential decision-making, robust MDP
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted for publication at AAAI’25. Extended version with full appendix, 26 pages

点击查看摘要

Abstract:Markov decision processes (MDP) are a well-established model for sequential decision-making in the presence of probabilities. In robust MDP (RMDP), every action is associated with an uncertainty set of probability distributions, modelling that transition probabilities are not known precisely. Based on the known theoretical connection to stochastic games, we provide a framework for solving RMDPs that is generic, reliable, and efficient. It is generic both with respect to the model, allowing for a wide range of uncertainty sets, including but not limited to intervals, L^1 - or L^2 -balls, and polytopes; and with respect to the objective, including long-run average reward, undiscounted total reward, and stochastic shortest path. It is reliable, as our approach not only converges in the limit, but provides precision guarantees at any time during the computation. It is efficient because – in contrast to state-of-the-art approaches – it avoids explicitly constructing the underlying stochastic game. Consequently, our prototype implementation outperforms existing tools by several orders of magnitude and can solve RMDPs with a million states in under a minute.

[AI-12] Scaling Combinatorial Optimization Neural Improvement Heuristics with Online Search and Adaptation

链接: https://arxiv.org/abs/2412.10163
作者: Federico Julian Camerota Verdù,Lorenzo Castelli,Luca Bortolussi
关键词: Limited Rollout Beam, introduce Limited Rollout, Rollout Beam Search, deep reinforcement learning, based combinatorial optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce Limited Rollout Beam Search (LRBS), a beam search strategy for deep reinforcement learning (DRL) based combinatorial optimization improvement heuristics. Utilizing pre-trained models on the Euclidean Traveling Salesperson Problem, LRBS significantly enhances both in-distribution performance and generalization to larger problem instances, achieving optimality gaps that outperform existing improvement heuristics and narrowing the gap with state-of-the-art constructive methods. We also extend our analysis to two pickup and delivery TSP variants to validate our results. Finally, we employ our search strategy for offline and online adaptation of the pre-trained improvement policy, leading to improved search performance and surpassing recent adaptive methods for constructive heuristics.

[AI-13] Direct Encoding of Declare Constraints in ASP

链接: https://arxiv.org/abs/2412.10152
作者: Francesco Chiariello,Valeria Fionda,Antonio Ielo,Francesco Ricca
关键词: Answer Set Programming, recently found practical, found practical application, Answer Set, logic programming paradigm
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
*备注: Under consideration in Theory and Practice of Logic Programming (TPLP)

点击查看摘要

Abstract:Answer Set Programming (ASP), a well-known declarative logic programming paradigm, has recently found practical application in Process Mining. In particular, ASP has been used to model tasks involving declarative specifications of business processes. In this area, Declare stands out as the most widely adopted declarative process modeling language, offering a means to model processes through sets of constraints valid traces must satisfy, that can be expressed in Linear Temporal Logic over Finite Traces (LTLf). Existing ASP-based solutions encode Declare constraints by modeling the corresponding LTLf formula or its equivalent automaton which can be obtained using established techniques. In this paper, we introduce a novel encoding for Declare constraints that directly models their semantics as ASP rules, eliminating the need for intermediate representations. We assess the effectiveness of this novel approach on two Process Mining tasks by comparing it with alternative ASP encodings and a Python library for Declare. Under consideration in Theory and Practice of Logic Programming (TPLP).

[AI-14] You Name It I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

链接: https://arxiv.org/abs/2412.10133
作者: Islem Bouzenia,Michael Pradel
关键词: assess code quality, test suite, code coverage, assess code, code quality
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it difficult to create a reliable, universal test execution method that works across different projects. This paper presents ExecutionAgent, an automated technique that installs arbitrary projects, configures them to run test cases, and produces project-specific scripts to reproduce the setup. Inspired by the way a human developer would address this task, our approach is a large language model-based agent that autonomously executes commands and interacts with the host system. The agent uses meta-prompting to gather guidelines on the latest technologies related to the given project, and it iteratively refines its process based on feedback from the previous steps. Our evaluation applies ExecutionAgent to 50 open-source projects that use 14 different programming languages and many different build and testing tools. The approach successfully executes the test suites of 33/55 projects, while matching the test results of ground truth test suite executions with a deviation of only 7.5%. These results improve over the best previously available technique by 6.6x. The costs imposed by the approach are reasonable, with an execution time of 74 minutes and LLM costs of 0.16 dollars, on average per project. We envision ExecutionAgent to serve as a valuable tool for developers, automated programming tools, and researchers that need to execute tests across a wide variety of projects.

[AI-15] CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

链接: https://arxiv.org/abs/2412.10117
作者: Zhihao Du,Yuxuan Wang,Qian Chen,Xian Shi,Xiang Lv,Tianyu Zhao,Zhifu Gao,Yexin Yang,Changfeng Gao,Hui Wang,Fan Yu,Huadai Liu,Zhengyan Sheng,Yue Gu,Chong Deng,Wen Wang,Shiliang Zhang,Zhijie Yan,Jingren Zhou
关键词: supervised discrete speech, discrete speech tokens, previous work, based on supervised, supervised discrete
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Tech report, work in progress

点击查看摘要

Abstract:In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice 2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice 2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at this https URL.

[AI-16] NetOrchLLM : Mastering Wireless Network Orchestration with Large Language Models

链接: https://arxiv.org/abs/2412.10107
作者: Asmaa Abdallah,Abdullatif Albaseer,Abdulkadir Celik,Mohamed Abdallah,Ahmed M. Eltawil
关键词: increased data rates, networks promises unprecedented, promises unprecedented advancements, ultra-low latency, data rates
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The transition to 6G networks promises unprecedented advancements in wireless communication, with increased data rates, ultra-low latency, and enhanced capacity. However, the complexity of managing and optimizing these next-generation networks presents significant challenges. The advent of large language models (LLMs) has revolutionized various domains by leveraging their sophisticated natural language understanding capabilities. However, the practical application of LLMs in wireless network orchestration and management remains largely unexplored. Existing literature predominantly offers visionary perspectives without concrete implementations, leaving a significant gap in the field. To address this gap, this paper presents NETORCHLLM, a wireless NETwork ORCHestrator LLM framework that uses LLMs to seamlessly orchestrate diverse wireless-specific models from wireless communication communities using their language understanding and generation capabilities. A comprehensive framework is introduced, demonstrating the practical viability of our approach and showcasing how LLMs can be effectively harnessed to optimize dense network operations, manage dynamic environments, and improve overall network performance. NETORCHLLM bridges the theoretical aspirations of prior research with practical, actionable solutions, paving the way for future advancements in integrating generative AI technologies within the wireless communications sector.

[AI-17] Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity HPCA2025

链接: https://arxiv.org/abs/2412.10059
作者: Dongyun Kam,Myeongji Yun,Sunwoo Yoo,Seungwoo Hong,Zhengya Zhang,Youngjoo Lee
关键词: accelerate general matrix-multiplications, deep neural network, large-scale deep neural, Low bit-precisions, large-scale DNN inferences
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注: 15 pages, 20 figures, Accepted to HPCA 2025

点击查看摘要

Abstract:Low bit-precisions and their bit-slice sparsity have recently been studied to accelerate general matrix-multiplications (GEMM) during large-scale deep neural network (DNN) inferences. While the conventional symmetric quantization facilitates low-resolution processing with bit-slice sparsity for both weight and activation, its accuracy loss caused by the activation’s asymmetric distributions cannot be acceptable, especially for large-scale DNNs. In efforts to mitigate this accuracy loss, recent studies have actively utilized asymmetric quantization for activations without requiring additional operations. However, the cutting-edge asymmetric quantization produces numerous nonzero slices that cannot be compressed and skipped by recent bit-slice GEMM accelerators, naturally consuming more processing energy to handle the quantized DNN models. To simultaneously achieve high accuracy and hardware efficiency for large-scale DNN inferences, this paper proposes an Asymmetrically-Quantized bit-Slice GEMM (AQS-GEMM) for the first time. In contrast to the previous bit-slice computing, which only skips operations of zero slices, the AQS-GEMM compresses frequent nonzero slices, generated by asymmetric quantization, and skips their operations. To increase the slice-level sparsity of activations, we also introduce two algorithm-hardware co-optimization methods: a zero-point manipulation and a distribution-based bit-slicing. To support the proposed AQS-GEMM and optimizations at the hardware-level, we newly introduce a DNN accelerator, Panacea, which efficiently handles sparse/dense workloads of the tiled AQS-GEMM to increase data reuse and utilization. Panacea supports a specialized dataflow and run-length encoding to maximize data reuse and minimize external memory accesses, significantly improving its hardware efficiency. Our benchmark evaluations show Panacea outperforms existing DNN accelerators. Comments: 15 pages, 20 figures, Accepted to HPCA 2025 Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.10059 [cs.AR] (or arXiv:2412.10059v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2412.10059 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-18] Large Action Models: From Inception to Implementation

链接: https://arxiv.org/abs/2412.10047
作者: Lu Wang,Fangkai Yang,Chaoyun Zhang,Junting Lu,Jiaxu Qian,Shilin He,Pu Zhao,Bo Qiao,Ray Huang,Si Qin,Qisheng Su,Jiayi Ye,Yudi Zhang,Jian-Guang Lou,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang
关键词: Large Action Models, intelligent agents capable, Large Language Models, performing real-world actions, continues to advance
类目: Artificial Intelligence (cs.AI)
*备注: 25pages,12 figures

点击查看摘要

Abstract:As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence. In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications. The code for the data collection process utilized in this paper is publicly available at: this https URL, and comprehensive documentation can be found at this https URL. Comments: 25pages,12 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2412.10047 [cs.AI] (or arXiv:2412.10047v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2412.10047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-19] Enhanced Speech Emotion Recognition with Efficient Channel Attention Guided Deep CNN-BiLSTM Framework

链接: https://arxiv.org/abs/2412.10011
作者: Niloy Kumar Kundu,Sarah Kobir,Md. Rayhan Ahmed,Tahmina Aktar,Niloya Roy
关键词: enhancing affective computing, Speech emotion recognition, emotion recognition, human-computer interaction, speech signals
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 42 pages,10 figures

点击查看摘要

Abstract:Speech emotion recognition (SER) is crucial for enhancing affective computing and enriching the domain of human-computer interaction. However, the main challenge in SER lies in selecting relevant feature representations from speech signals with lower computational costs. In this paper, we propose a lightweight SER architecture that integrates attention-based local feature blocks (ALFBs) to capture high-level relevant feature vectors from speech signals. We also incorporate a global feature block (GFB) technique to capture sequential, global information and long-term dependencies in speech signals. By aggregating attention-based local and global contextual feature vectors, our model effectively captures the internal correlation between salient features that reflect complex human emotional cues. To evaluate our approach, we extracted four types of spectral features from speech audio samples: mel-frequency cepstral coefficients, mel-spectrogram, root mean square value, and zero-crossing rate. Through a 5-fold cross-validation strategy, we tested the proposed method on five multi-lingual standard benchmark datasets: TESS, RAVDESS, BanglaSER, SUBESCO, and Emo-DB, and obtained a mean accuracy of 99.65%, 94.88%, 98.12%, 97.94%, and 97.19% respectively. The results indicate that our model achieves state-of-the-art (SOTA) performance compared to most existing methods.

[AI-20] One Filter to Deploy Them All: Robust Safety for Quadrupedal Navigation in Unknown Environments

链接: https://arxiv.org/abs/2412.09989
作者: Albert Lin,Shuang Peng,Somil Bansal
关键词: legged robots rapidly, robots rapidly grow, safety assurances efficiently, grow in popularity, learning-based methods
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Project website: this https URL

点击查看摘要

Abstract:As learning-based methods for legged robots rapidly grow in popularity, it is important that we can provide safety assurances efficiently across different controllers and environments. Existing works either rely on a priori knowledge of the environment and safety constraints to ensure system safety or provide assurances for a specific locomotion policy. To address these limitations, we propose an observation-conditioned reachability-based (OCR) safety-filter framework. Our key idea is to use an OCR value network (OCR-VN) that predicts the optimal control-theoretic safety value function for new failure regions and dynamic uncertainty during deployment time. Specifically, the OCR-VN facilitates rapid safety adaptation through two key components: a LiDAR-based input that allows the dynamic construction of safe regions in light of new obstacles and a disturbance estimation module that accounts for dynamics uncertainty in the wild. The predicted safety value function is used to construct an adaptive safety filter that overrides the nominal quadruped controller when necessary to maintain safety. Through simulation studies and hardware experiments on a Unitree Go1 quadruped, we demonstrate that the proposed framework can automatically safeguard a wide range of hierarchical quadruped controllers, adapts to novel environments, and is robust to unmodeled dynamics without a priori access to the controllers or environments - hence, “One Filter to Deploy Them All”. The experiment videos can be found on the project website.

[AI-21] AI and the Future of Digital Public Squares

链接: https://arxiv.org/abs/2412.09988
作者: Beth Goldberg,Diana Acosta-Navas,Michiel Bakker,Ian Beacock,Matt Botvinick,Prateek Buch,Renée DiResta,Nandika Donthi,Nathanael Fast,Ravi Iyer,Zaria Jalan,Andrew Konya,Grace Kwak Danciu,Hélène Landemore,Alice Marwick,Carl Miller,Aviv Ovadya,Emily Saltz,Lisa Schirch,Dalit Shalom,Divya Siddarth,Felix Sieker,Christopher Small,Jonathan Stray,Audrey Tang,Michael Henry Tessler,Amy Zhang
关键词: large language models, substantial technological advances, digital public squares, recent decades, language models
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: 40 pages, 5 figures

点击查看摘要

Abstract:Two substantial technological advances have reshaped the public square in recent decades: first with the advent of the internet and second with the recent introduction of large language models (LLMs). LLMs offer opportunities for a paradigm shift towards more decentralized, participatory online spaces that can be used to facilitate deliberative dialogues at scale, but also create risks of exacerbating societal schisms. Here, we explore four applications of LLMs to improve digital public squares: collective dialogue systems, bridging systems, community moderation, and proof-of-humanity systems. Building on the input from over 70 civil society experts and technologists, we argue that LLMs both afford promising opportunities to shift the paradigm for conversations at scale and pose distinct risks for digital public squares. We lay out an agenda for future research and investments in AI that will strengthen digital public squares and safeguard against potential misuses of AI.

[AI-22] Efficient Large-Scale Traffic Forecasting with Transformers: A Spatial Data Management Perspective KDD2025

链接: https://arxiv.org/abs/2412.09972
作者: Yuchen Fang,Yuxuan Liang,Bo Hui,Zezhi Shao,Liwei Deng,Xu Liu,Xinke Jiang,Kai Zheng
关键词: real-world intelligent transportation, intelligent transportation scenarios, Road traffic forecasting, Road traffic, personal traveling
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted by SIGKDD 2025

点击查看摘要

Abstract:Road traffic forecasting is crucial in real-world intelligent transportation scenarios like traffic dispatching and path planning in city management and personal traveling. Spatio-temporal graph neural networks (STGNNs) stand out as the mainstream solution in this task. Nevertheless, the quadratic complexity of remarkable dynamic spatial modeling-based STGNNs has become the bottleneck over large-scale traffic data. From the spatial data management perspective, we present a novel Transformer framework called PatchSTG to efficiently and dynamically model spatial dependencies for large-scale traffic forecasting with interpretability and fidelity. Specifically, we design a novel irregular spatial patching to reduce the number of points involved in the dynamic calculation of Transformer. The irregular spatial patching first utilizes the leaf K-dimensional tree (KDTree) to recursively partition irregularly distributed traffic points into leaf nodes with a small capacity, and then merges leaf nodes belonging to the same subtree into occupancy-equaled and non-overlapped patches through padding and backtracking. Based on the patched data, depth and breadth attention are used interchangeably in the encoder to dynamically learn local and global spatial knowledge from points in a patch and points with the same index of patches. Experimental results on four real world large-scale traffic datasets show that our PatchSTG achieves train speed and memory utilization improvements up to 10\times and 4\times with the state-of-the-art performance.

[AI-23] What constitutes a Deep Fake? The blurry line between legitimate processing and manipulation under the EU AI Act

链接: https://arxiv.org/abs/2412.09961
作者: Kristof Meding,Christoph Sorge
关键词: image resemble reality, deep fakes, resemble reality, digital image resemble, Act transparency obligations
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
*备注: Preprint. Accepted at ACM CSLaw '25

点击查看摘要

Abstract:When does a digital image resemble reality? The relevance of this question increases as the generation of synthetic images – so called deep fakes – becomes increasingly popular. Deep fakes have gained much attention for a number of reasons – among others, due to their potential to disrupt the political climate. In order to mitigate these threats, the EU AI Act implements specific transparency regulations for generating synthetic content or manipulating existing content. However, the distinction between real and synthetic images is – even from a computer vision perspective – far from trivial. We argue that the current definition of deep fakes in the AI act and the corresponding obligations are not sufficiently specified to tackle the challenges posed by deep fakes. By analyzing the life cycle of a digital photo from the camera sensor to the digital editing features, we find that: (1.) Deep fakes are ill-defined in the EU AI Act. The definition leaves too much scope for what a deep fake is. (2.) It is unclear how editing functions like Google’s ``best take’’ feature can be considered as an exception to transparency obligations. (3.) The exception for substantially edited images raises questions about what constitutes substantial editing of content and whether or not this editing must be perceptible by a natural person. Our results demonstrate that complying with the current AI Act transparency obligations is difficult for providers and deployers. As a consequence of the unclear provisions, there is a risk that exceptions may be either too broad or too limited. We intend our analysis to foster the discussion on what constitutes a deep fake and to raise awareness about the pitfalls in the current AI Act transparency obligations.

[AI-24] Analyzing Fairness of Classification Machine Learning Model with Structured Dataset

链接: https://arxiv.org/abs/2412.09896
作者: Ahmed Rashed,Abdelkrim Kallich,Mohamed Eltayeb
关键词: including healthcare, law enforcement, Machine learning, integral to decision, decision making
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 12 pages, 3 tables

点击查看摘要

Abstract:Machine learning (ML) algorithms have become integral to decision making in various domains, including healthcare, finance, education, and law enforcement. However, concerns about fairness and bias in these systems pose significant ethical and social challenges. This study investigates the fairness of ML models applied to structured datasets in classification tasks, highlighting the potential for biased predictions to perpetuate systemic inequalities. A publicly available dataset from Kaggle was selected for analysis, offering a realistic scenario for evaluating fairness in machine learning workflows. To assess and mitigate biases, three prominent fairness libraries; Fairlearn by Microsoft, AIF360 by IBM, and the What If Tool by Google were employed. These libraries provide robust frameworks for analyzing fairness, offering tools to evaluate metrics, visualize results, and implement bias mitigation strategies. The research aims to assess the extent of bias in the ML models, compare the effectiveness of these libraries, and derive actionable insights for practitioners. The findings reveal that each library has unique strengths and limitations in fairness evaluation and mitigation. By systematically comparing their capabilities, this study contributes to the growing field of ML fairness by providing practical guidance for integrating fairness tools into real world applications. These insights are intended to support the development of more equitable machine learning systems. Comments: 12 pages, 3 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.09896 [cs.LG] (or arXiv:2412.09896v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.09896 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-25] Semi-Periodic Activation for Time Series Classification

链接: https://arxiv.org/abs/2412.09889
作者: José Gilberto Barbosa de Medeiros Júnior,Andre Guarnier de Mitri,Diego Furtado Silva
关键词: time series tasks, neural network models, paper investigates, investigates the lack, lack of research
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper investigates the lack of research on activation functions for neural network models in time series tasks. It highlights the need to identify essential properties of these activations to improve their effectiveness in specific domains. To this end, the study comprehensively analyzes properties, such as bounded, monotonic, nonlinearity, and periodicity, for activation in time series neural networks. We propose a new activation that maximizes the coverage of these properties, called LeakySineLU. We empirically evaluate the LeakySineLU against commonly used activations in the literature using 112 benchmark datasets for time series classification, obtaining the best average ranking in all comparative scenarios.

[AI-26] Brain-inspired Chaotic Graph Backpropagation for Large-scale Combinatorial Optimization

链接: https://arxiv.org/abs/2412.09860
作者: Peng Tao,Kazuyuki Aihara,Luonan Chen
关键词: efficient time complexity, Graph neural networks, combinatorial optimization problems, combinatorial optimization, Graph neural
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) with unsupervised learning can solve large-scale combinatorial optimization problems (COPs) with efficient time complexity, making them versatile for various applications. However, since this method maps the combinatorial optimization problem to the training process of a graph neural network, and the current mainstream backpropagation-based training algorithms are prone to fall into local minima, the optimization performance is still inferior to the current state-of-the-art (SOTA) COP methods. To address this issue, inspired by possibly chaotic dynamics of real brain learning, we introduce a chaotic training algorithm, i.e. chaotic graph backpropagation (CGBP), which introduces a local loss function in GNN that makes the training process not only chaotic but also highly efficient. Different from existing methods, we show that the global ergodicity and pseudo-randomness of such chaotic dynamics enable CGBP to learn each optimal GNN effectively and globally, thus solving the COP efficiently. We have applied CGBP to solve various COPs, such as the maximum independent set, maximum cut, and graph coloring. Results on several large-scale benchmark datasets showcase that CGBP can outperform not only existing GNN algorithms but also SOTA methods. In addition to solving large-scale COPs, CGBP as a universal learning algorithm for GNNs, i.e. as a plug-in unit, can be easily integrated into any existing method for improving the performance.

[AI-27] RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning

链接: https://arxiv.org/abs/2412.09858
作者: Charles Xu,Qiyang Li,Jianlan Luo,Sergey Levine
关键词: Recent advances, enabled the development, adapt to diverse, Reinforcement Learning Distilled, Learning Distilled Generalists
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in robotic foundation models have enabled the development of generalist policies that can adapt to diverse tasks. While these models show impressive flexibility, their performance heavily depends on the quality of their training data. In this work, we propose Reinforcement Learning Distilled Generalists (RLDG), a method that leverages reinforcement learning to generate high-quality training data for finetuning generalist policies. Through extensive real-world experiments on precise manipulation tasks like connector insertion and assembly, we demonstrate that generalist policies trained with RL-generated data consistently outperform those trained with human demonstrations, achieving up to 40% higher success rates while generalizing better to new tasks. We also provide a detailed analysis that reveals this performance gain stems from both optimized action distributions and improved state coverage. Our results suggest that combining task-specific RL with generalist policy distillation offers a promising approach for developing more capable and efficient robotic manipulation systems that maintain the flexibility of foundation models while achieving the performance of specialized controllers. Videos and code can be found on our project website this https URL

[AI-28] Learning Structural Causal Models from Ordering: Identifiable Flow Models AAAI2025

链接: https://arxiv.org/abs/2412.09843
作者: Minh Khoa Le,Kien Do,Truyen Tran
关键词: address causal inference, valid causal ordering, causal, structural causal models, observational data
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: Accepted at AAAI 2025

点击查看摘要

Abstract:In this study, we address causal inference when only observational data and a valid causal ordering from the causal graph are available. We introduce a set of flow models that can recover component-wise, invertible transformation of exogenous variables. Our flow-based methods offer flexible model design while maintaining causal consistency regardless of the number of discretization steps. We propose design improvements that enable simultaneous learning of all causal mechanisms and reduce abduction and prediction complexity to linear O(n) relative to the number of layers, independent of the number of causal variables. Empirically, we demonstrate that our method outperforms previous state-of-the-art approaches and delivers consistent performance across a wide range of structural causal models in answering observational, interventional, and counterfactual questions. Additionally, our method achieves a significant reduction in computational time compared to existing diffusion-based techniques, making it practical for large structural causal models.

[AI-29] mporal Causal Discovery in Dynamic Bayesian Networks Using Federated Learning

链接: https://arxiv.org/abs/2412.09814
作者: Jianhong Chen,Ying Ma,Xubo Yue
关键词: Dynamic Bayesian Network, Dynamic Bayesian, Bayesian Network, Dynamic, Bayesian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO)
*备注: 23 pages

点击查看摘要

Abstract:Traditionally, learning the structure of a Dynamic Bayesian Network has been centralized, with all data pooled in one location. However, in real-world scenarios, data are often dispersed among multiple parties (e.g., companies, devices) that aim to collaboratively learn a Dynamic Bayesian Network while preserving their data privacy and security. In this study, we introduce a federated learning approach for estimating the structure of a Dynamic Bayesian Network from data distributed horizontally across different parties. We propose a distributed structure learning method that leverages continuous optimization so that only model parameters are exchanged during optimization. Experimental results on synthetic and real datasets reveal that our method outperforms other state-of-the-art techniques, particularly when there are many clients with limited individual sample sizes.

[AI-30] Universal Inceptive GNNs by Eliminating the Smoothness-generalization Dilemma

链接: https://arxiv.org/abs/2412.09805
作者: Ming Gu,Zhuonan Zheng,Sheng Zhou,Meihan Liu,Jiawei Chen,Tanyu Qiao,Liangcheng Li,Jiajun Bu
关键词: Graph Neural Networks, demonstrated remarkable success, Neural Networks, demonstrated remarkable, remarkable success
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
*备注: 12 pages

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains, such as transaction and social net-works. However, their application is often hindered by the varyinghomophily levels across different orders of neighboring nodes, ne-cessitating separate model designs for homophilic and heterophilicgraphs. In this paper, we aim to develop a unified framework ca-pable of handling neighborhoods of various orders and homophilylevels. Through theoretical exploration, we identify a previouslyoverlooked architectural aspect in multi-hop learning: the cascadedependency, which leads to asmoothness-generalization this http URL dilemma significantly affects the learning process, especiallyin the context of high-order neighborhoods and heterophilic this http URL resolve this issue, we propose an Inceptive Graph Neural Net-work (IGNN), a universal message-passing framework that replacesthe cascade dependency with an inceptive architecture. IGNN pro-vides independent representations for each hop, allowing personal-ized generalization capabilities, and captures neighborhood-wiserelationships to select appropriate receptive fields. Extensive ex-periments show that our IGNN outperforms 23 baseline methods,demonstrating superior performance on both homophilic and het-erophilic graphs, while also scaling efficiently to large graphs.

[AI-31] Learning Visually Grounded Domain Ontologies via Embodied Conversation and Explanation AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.09770
作者: Jonghyuk Park,Alex Lascarides,Subramanian Ramamoorthy
关键词: agent knowledge gaps, gaps are overcome, overcome through corrective, agent explains, agent
类目: Artificial Intelligence (cs.AI)
*备注: Accepted to, and to appear in the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:In this paper, we offer a learning framework in which the agent’s knowledge gaps are overcome through corrective feedback from a teacher whenever the agent explains its (incorrect) predictions. We test it in a low-resource visual processing scenario, in which the agent must learn to recognize distinct types of toy truck. The agent starts the learning process with no ontology about what types of trucks exist nor which parts they have, and a deficient model for recognizing those parts from visual input. The teacher’s feedback to the agent’s explanations addresses its lack of relevant knowledge in the ontology via a generic rule (e.g., “dump trucks have dumpers”), whereas an inaccurate part recognition is corrected by a deictic statement (e.g., “this is not a dumper”). The learner utilizes this feedback not only to improve its estimate of the hypothesis space of possible domain ontologies and probability distributions over them, but also to use those estimates to update its visual interpretation of the scene. Our experiments demonstrate that teacher-learner pairs utilizing explanations and corrections are more data-efficient than those without such a faculty.

[AI-32] Congruence-based Learning of Probabilistic Deterministic Finite Automata

链接: https://arxiv.org/abs/2412.09760
作者: Matías Carrasco,Franz Mayr,Sergio Yovine
关键词: probabilistic deterministic automata, learning probabilistic deterministic, language models, work studies, studies the question
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This work studies the question of learning probabilistic deterministic automata from language models. For this purpose, it focuses on analyzing the relations defined on algebraic structures over strings by equivalences and similarities on probability distributions. We introduce a congruence that extends the classical Myhill-Nerode congruence for formal languages. This new congruence is the basis for defining regularity over language models. We present an active learning algorithm that computes the quotient with respect to this congruence whenever the language model is regular. The paper also defines the notion of recognizability for language models and shows that it coincides with regularity for congruences. For relations which are not congruences, it shows that this is not the case. Finally, it discusses the impact of this result on learning in the context of language models.

[AI-33] AI Red-Teaming is a Sociotechnical System. Now What?

链接: https://arxiv.org/abs/2412.09751
作者: Tarleton Gillespie,Ryland Shaw,Mary L. Gray,Jina Suh
关键词: real-world applications, testing their performance, importance of testing, generative AI technologies, technologies find
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
*备注: 8 pages

点击查看摘要

Abstract:As generative AI technologies find more and more real-world applications, the importance of testing their performance and safety seems paramount. ``Red-teaming’’ has quickly become the primary approach to test AI models–prioritized by AI companies, and enshrined in AI policy and regulation. Members of red teams act as adversaries, probing AI systems to test their safety mechanisms and uncover vulnerabilities. Yet we know too little about this work and its implications. This essay calls for collaboration between computer scientists and social scientists to study the sociotechnical systems surrounding AI technologies, including the work of red-teaming, to avoid repeating the mistakes of the recent past. We highlight the importance of understanding the values and assumptions behind red-teaming, the labor involved, and the psychological impacts on red-teamers.

[AI-34] ransferLight: Zero-Shot Traffic Signal Control on any Road-Network ALT AAAI

链接: https://arxiv.org/abs/2412.09719
作者: Johann Schmidt,Frank Dreyer,Sayed Abid Hashimi,Sebastian Stober
关键词: signal control plays, control plays, plays a crucial, crucial role, Traffic signal control
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: AAAI Workshop Paper (MALTA)

点击查看摘要

Abstract:Traffic signal control plays a crucial role in urban mobility. However, existing methods often struggle to generalize beyond their training environments to unseen scenarios with varying traffic dynamics. We present TransferLight, a novel framework designed for robust generalization across road-networks, diverse traffic conditions and intersection geometries. At its core, we propose a log-distance reward function, offering spatially-aware signal prioritization while remaining adaptable to varied lane configurations - overcoming the limitations of traditional pressure-based rewards. Our hierarchical, heterogeneous, and directed graph neural network architecture effectively captures granular traffic dynamics, enabling transferability to arbitrary intersection layouts. Using a decentralized multi-agent approach, global rewards, and novel state transition priors, we develop a single, weight-tied policy that scales zero-shot to any road network without re-training. Through domain randomization during training, we additionally enhance generalization capabilities. Experimental results validate TransferLight’s superior performance in unseen scenarios, advancing practical, generalizable intelligent transportation systems to meet evolving urban traffic demands.

[AI-35] CUAL: Continual Uncertainty-aware Active Learner

链接: https://arxiv.org/abs/2412.09701
作者: Amanda Rios,Ibrahima Ndiour,Parual Datta,Jerry Sydir,Omesh Tickoo,Nilesh Ahuja
关键词: encountered after deployment, real-world use cases, capable of adapting, adapting to novelties, novelties encountered
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:AI deployed in many real-world use cases should be capable of adapting to novelties encountered after deployment. Here, we consider a challenging, under-explored and realistic continual adaptation problem: a deployed AI agent is continuously provided with unlabeled data that may contain not only unseen samples of known classes but also samples from novel (unknown) classes. In such a challenging setting, it has only a tiny labeling budget to query the most informative samples to help it continuously learn. We present a comprehensive solution to this complex problem with our model “CUAL” (Continual Uncertainty-aware Active Learner). CUAL leverages an uncertainty estimation algorithm to prioritize active labeling of ambiguous (uncertain) predicted novel class samples while also simultaneously pseudo-labeling the most certain predictions of each class. Evaluations across multiple datasets, ablations, settings and backbones (e.g. ViT foundation model) demonstrate our method’s effectiveness. We will release our code upon acceptance.

[AI-36] Assisted morbidity coding: the SISCO.web use case for identifying the main diagnosis in Hospital Discharge Records

链接: https://arxiv.org/abs/2412.09651
作者: Elena Cardillo,Lucilla Frattura
关键词: standard diagnostic classifications, international standard diagnostic, Coding morbidity data, morbidity data, standard diagnostic
类目: Other Computer Science (cs.OH); Artificial Intelligence (cs.AI)
*备注: 18 pages

点击查看摘要

Abstract:Coding morbidity data using international standard diagnostic classifications is increasingly important and still challenging. Clinical coders and physicians assign codes to patient episodes based on their interpretation of case notes or electronic patient records. Therefore, accurate coding relies on the legibility of case notes and the coders’ understanding of medical terminology. During the last ten years, many studies have shown poor reproducibility of clinical coding, even recently, with the application of Artificial Intelligence-based models. Given this context, the paper aims to present the this http URL approach designed to support physicians in filling in Hospital Discharge Records with proper diagnoses and procedures codes using the International Classification of Diseases (9th and 10th), and, above all, in identifying the main pathological condition. The web service leverages NLP algorithms, specific coding rules, as well as ad hoc decision trees to identify the main condition, showing promising results in providing accurate ICD coding suggestions.

[AI-37] Combining knowledge graphs and LLM s for hazardous chemical information management and reuse

链接: https://arxiv.org/abs/2412.09644
作者: Marcos Da Silveira,Louis Deladiennee,Kheira Acem,Oona Freudenthal
关键词: increasingly threatened, threatened by exposure, persistent and toxic, information, toxic chemicals
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: Submitted to IEEE BIBM24

点击查看摘要

Abstract:Human health is increasingly threatened by exposure to hazardous substances, particularly persistent and toxic chemicals. The link between these substances, often encountered in complex mixtures, and various diseases are demonstrated in scientific studies. However, this information is scattered across several sources and hardly accessible by humans and machines. This paper evaluates current practices for publishing/accessing information on hazardous chemicals and proposes a novel platform designed to facilitate retrieval of critical chemical data in urgent situations. The platform aggregates information from multiple sources and organizes it into a structured knowledge graph. Users can access this information through a visual interface such as Neo4J Bloom and dashboards, or via natural language queries using a Chatbot. Our findings demonstrate a significant reduction in the time and effort required to access vital chemical information when datasets follow FAIR principles. Furthermore, we discuss the lessons learned from the development and implementation of this platform and provide recommendations for data owners and publishers to enhance data reuse and interoperability. This work aims to improve the accessibility and usability of chemical information by healthcare professionals, thereby supporting better health outcomes and informed decision-making in the face of patients exposed to chemical intoxication risks.

[AI-38] Blockchain Data Analysis in the Era of Large-Language Models

链接: https://arxiv.org/abs/2412.09640
作者: Kentaroh Toyoda,Xiao Wang,Mingzhe Li,Bo Gao,Yuan Wang,Qingsong Wei
关键词: Blockchain data analysis, Blockchain data, data analysis, tracking transactions, essential for deriving
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Blockchain data analysis is essential for deriving insights, tracking transactions, identifying patterns, and ensuring the integrity and security of decentralized networks. It plays a key role in various areas, such as fraud detection, regulatory compliance, smart contract auditing, and decentralized finance (DeFi) risk management. However, existing blockchain data analysis tools face challenges, including data scarcity, the lack of generalizability, and the lack of reasoning capability. We believe large language models (LLMs) can mitigate these challenges; however, we have not seen papers discussing LLM integration in blockchain data analysis in a comprehensive and systematic way. This paper systematically explores potential techniques and design patterns in LLM-integrated blockchain data analysis. We also outline prospective research opportunities and challenges, emphasizing the need for further exploration in this promising field. This paper aims to benefit a diverse audience spanning academia, industry, and policy-making, offering valuable insights into the integration of LLMs in blockchain data analysis. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.09640 [cs.CR] (or arXiv:2412.09640v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.09640 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-39] Methods to Assess the UK Governments Current Role as a Data Provider for AI

链接: https://arxiv.org/abs/2412.09632
作者: Neil Majithia,Elena Simperl
关键词: remain closely-guarded secrets, organisational data owners, corpora remain closely-guarded, training corpora remain, Large Language Models
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
*备注: 17 pages, 5 figures; for the accompanying, non-technical ODI report see this https URL

点击查看摘要

Abstract:The compositions of generative AI training corpora remain closely-guarded secrets, causing an asymmetry of information between AI developers and organisational data owners whose digital assets may have been incorporated into the corpora without their knowledge. While this asymmetry is the subject of well-known ongoing lawsuits, it also inhibits the measurement of the impact of open data sources for AI training. To address this, we introduce and implement two methods to assess open data usage for the training of Large Language Models (LLMs) and ‘peek behind the curtain’ in order to observe the UK government’s current contributions as a data provider for AI. The first method, an ablation study that utilises LLM ‘unlearning’, seeks to examine the importance of the information held on UK government websites for LLMs and their performance in citizen query tasks. The second method, an information leakage study, seeks to ascertain whether LLMs are aware of the information held in the datasets published on the UK government’s open data initiative this http URL. Our findings indicate that UK government websites are important data sources for AI (heterogenously across subject matters) while this http URL is not. This paper serves as a technical report, explaining in-depth the designs, mechanics, and limitations of the above experiments. It is accompanied by a complementary non-technical report on the ODI website in which we summarise the experiments and key findings, interpret them, and build a set of actionable recommendations for the UK government to take forward as it seeks to design AI policy. While we focus on UK open government data, we believe that the methods introduced in this paper present a reproducible approach to tackle the opaqueness of AI training corpora and provide organisations a framework to evaluate and maximize their contributions to AI development.

[AI-40] Bridging AI and Science: Implications from a Large-Scale Literature Analysis of AI4Science

链接: https://arxiv.org/abs/2412.09628
作者: Yutong Xie,Yijun Pan,Hua Xu,Qiaozhu Mei
关键词: Artificial Intelligence, Intelligence has proven, advancing scientific research, wide range, scientific
类目: Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Artificial Intelligence has proven to be a transformative tool for advancing scientific research across a wide range of disciplines. However, a significant gap still exists between AI and scientific communities, limiting the full potential of AI methods in driving broad scientific discovery. Existing efforts in bridging this gap have often relied on qualitative examination of small samples of literature, offering a limited perspective on the broader AI4Science landscape. In this work, we present a large-scale analysis of the AI4Science literature, starting by using large language models to identify scientific problems and AI methods in publications from top science and AI venues. Leveraging this new dataset, we quantitatively highlight key disparities between AI methods and scientific problems in this integrated space, revealing substantial opportunities for deeper AI integration across scientific disciplines. Furthermore, we explore the potential and challenges of facilitating collaboration between AI and scientific communities through the lens of link prediction. Our findings and tools aim to promote more impactful interdisciplinary collaborations and accelerate scientific discovery through deeper and broader AI integration.

[AI-41] COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models

链接: https://arxiv.org/abs/2412.10347
作者: Yuchen Ren,Wenwei Han,Qianyuan Zhang,Yining Tang,Weiqiang Bai,Yuchen Cai,Lifeng Qiao,Hao Jiang,Dong Yuan,Tao Chen,Siqi Sun,Pan Tan,Wanli Ouyang,Nanqing Dong,Xinzhu Ma,Peng Ye
关键词: play crucial roles, guaranteeing accurate genetic, accurate genetic expression, proteins play crucial, Multi-omics Evaluation Tasks
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large language models-poses challenges for researchers in choosing the most suitable models for specific tasks, especially for cross-omics and multi-omics tasks due to the lack of comprehensive benchmarks. To address this, we introduce the first comprehensive multi-omics benchmark COMET (Benchmark for Biological COmprehensive Multi-omics Evaluation Tasks and Language Models), designed to evaluate models across single-omics, cross-omics, and multi-omics tasks. First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins, including tasks that span multiple omics levels. Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method, offering valuable insights into their performance in integrating and analyzing data from different biological modalities. This benchmark aims to define critical issues in multi-omics research and guide future directions, ultimately promoting advancements in understanding biological processes through integrated and different omics data analysis.

[AI-42] Physics Instrument Design with Reinforcement Learning

链接: https://arxiv.org/abs/2412.10237
作者: Shah Rukh Qasim,Patrick Owen,Nicola Serra
关键词: gradient-based instrument-optimization methods, Reinforcement Learning, present a case, gradient-based instrument-optimization, instrument-optimization methods
类目: Instrumentation and Detectors (physics.ins-det); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex)
*备注:

点击查看摘要

Abstract:We present a case for the use of Reinforcement Learning (RL) for the design of physics instrument as an alternative to gradient-based instrument-optimization methods. It’s applicability is demonstrated using two empirical studies. One is longitudinal segmentation of calorimeters and the second is both transverse segmentation as well longitudinal placement of trackers in a spectrometer. Based on these experiments, we propose an alternative approach that offers unique advantages over differentiable programming and surrogate-based differentiable design optimization methods. First, Reinforcement Learning (RL) algorithms possess inherent exploratory capabilities, which help mitigate the risk of convergence to local optima. Second, this approach eliminates the necessity of constraining the design to a predefined detector model with fixed parameters. Instead, it allows for the flexible placement of a variable number of detector components and facilitates discrete decision-making. We then discuss the road map of how this idea can be extended into designing very complex instruments. The presented study sets the stage for a novel framework in physics instrument design, offering a scalable and efficient framework that can be pivotal for future projects such as the Future Circular Collider (FCC), where most optimized detectors are essential for exploring physics at unprecedented energy scales.

[AI-43] AI in the Cosmos

链接: https://arxiv.org/abs/2412.10093
作者: N. Sahakyan
关键词: Artificial intelligence, hidden patterns, revolutionizing research, research by enabling, enabling the efficient
类目: High Energy Astrophysical Phenomena (astro-ph.HE); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
*备注: In press in the International Journal of Modern Physics D; invited talk at the 17th Marcel Grossmann Meeting

点击查看摘要

Abstract:Artificial intelligence (AI) is revolutionizing research by enabling the efficient analysis of large datasets and the discovery of hidden patterns. In astrophysics, AI has become essential, transforming the classification of celestial sources, data modeling, and the interpretation of observations. In this review, I highlight examples of AI applications in astrophysics, including source classification, spectral energy distribution modeling, and discuss the advancements achievable through generative AI. However, the use of AI introduces challenges, including biases, errors, and the “black box” nature of AI models, which must be resolved before their application. These issues can be addressed through the concept of Human-Guided AI (HG-AI), which integrates human expertise and domain-specific knowledge into AI applications. This approach aims to ensure that AI is applied in a robust, interpretable, and ethical manner, leading to deeper insights and fostering scientific excellence.

[AI-44] CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls AAAI-25

链接: https://arxiv.org/abs/2412.09887
作者: Li Chai,Donglin Wang
关键词: highly challenging task, highly challenging, challenging task, generation, in-attention Transformer decoder
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
*备注: Accepted at AAAI-25

点击查看摘要

Abstract:Lyric-to-melody generation is a highly challenging task in the field of AI music generation. Due to the difficulty of learning strict yet weak correlations between lyrics and melodies, previous methods have suffered from weak controllability, low-quality and poorly structured generation. To address these challenges, we propose CSL-L2M, a controllable song-level lyric-to-melody generation method based on an in-attention Transformer decoder with fine-grained lyric and musical controls, which is able to generate full-song melodies matched with the given lyrics and user-specified musical attributes. Specifically, we first introduce REMI-Aligned, a novel music representation that incorporates strict syllable- and sentence-level alignments between lyrics and melodies, facilitating precise alignment modeling. Subsequently, sentence-level semantic lyric embeddings independently extracted from a sentence-wise Transformer encoder are combined with word-level part-of-speech embeddings and syllable-level tone embeddings as fine-grained controls to enhance the controllability of lyrics over melody generation. Then we introduce human-labeled musical tags, sentence-level statistical musical attributes, and learned musical features extracted from a pre-trained VQ-VAE as coarse-grained, fine-grained and high-fidelity controls, respectively, to the generation process, thereby enabling user control over melody generation. Finally, an in-attention Transformer decoder technique is leveraged to exert fine-grained control over the full-song melody generation with the aforementioned lyric and musical conditions. Experimental results demonstrate that our proposed CSL-L2M outperforms the state-of-the-art models, generating melodies with higher quality, better controllability and enhanced structure. Demos and source code are available at this https URL.

[AI-45] Deep Learning for Spectrum Prediction in Cognitive Radio Networks: State-of-the-Art New Opportunities and Challenges

链接: https://arxiv.org/abs/2412.09849
作者: Guangliang Pan,David K. Y. Yau,Bo Zhou,Qihui Wu
关键词: cognitive radio networks, dynamic spectrum access, enhances spectrum efficiency, assisting dynamic spectrum, radio networks
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Spectrum prediction is considered to be a promising technology that enhances spectrum efficiency by assisting dynamic spectrum access (DSA) in cognitive radio networks (CRN). Nonetheless, the highly nonlinear nature of spectrum data across time, frequency, and space domains, coupled with the intricate spectrum usage patterns, poses challenges for accurate spectrum prediction. Deep learning (DL), recognized for its capacity to extract nonlinear features, has been applied to solve these challenges. This paper first shows the advantages of applying DL by comparing with traditional prediction methods. Then, the current state-of-the-art DL-based spectrum prediction techniques are reviewed and summarized in terms of intra-band and crossband prediction. Notably, this paper uses a real-world spectrum dataset to prove the advancements of DL-based methods. Then, this paper proposes a novel intra-band spatiotemporal spectrum prediction framework named ViTransLSTM. This framework integrates visual self-attention and long short-term memory to capture both local and global long-term spatiotemporal dependencies of spectrum usage patterns. Similarly, the effectiveness of the proposed framework is validated on the aforementioned real-world dataset. Finally, the paper presents new related challenges and potential opportunities for future research.

[AI-46] Precise Antigen-Antibody Structure Predictions Enhance Antibody Development with HelixFold-Multimer

链接: https://arxiv.org/abs/2412.09826
作者: Jie Gao,Jing Hu,Lihang Liu,Yang Xue,Kunrui Zhu,Xiaonan Zhang,Xiaomin Fang
关键词: underlie immune responses, elucidate molecular interactions, immune responses, advancing immunology, elucidate molecular
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accurate prediction of antigen-antibody structures is essential for advancing immunology and therapeutic development, as it helps elucidate molecular interactions that underlie immune responses. Despite recent progress with deep learning models like AlphaFold and RoseTTAFold, accurately modeling antigen-antibody complexes remains a challenge due to their unique evolutionary characteristics. HelixFold-Multimer, a specialized model developed for this purpose, builds on the framework of AlphaFold-Multimer and demonstrates improved precision for antigen-antibody structures. HelixFold-Multimer not only surpasses other models in accuracy but also provides essential insights into antibody development, enabling more precise identification of binding sites, improved interaction prediction, and enhanced design of therapeutic antibodies. These advances underscore HelixFold-Multimer’s potential in supporting antibody research and therapeutic innovation.

[AI-47] Let Curves Speak: A Continuous Glucose Monitor based Large Sensor Foundation Model for Diabetes Management

链接: https://arxiv.org/abs/2412.09727
作者: Junjie Luo,Abhimanyu Kumbara,Mansur Shomali,Rui Han,Anand Iyer,Ritu Agarwal,Gordon Gao
关键词: near-future glucose prediction, prediction remains limited, timely diabetes self-management, near-future glucose, enables timely diabetes
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While previous studies of AI in diabetes management focus on long-term risk, research on near-future glucose prediction remains limited but important as it enables timely diabetes self-management. Integrating AI with continuous glucose monitoring (CGM) holds promise for near-future glucose prediction. However, existing models have limitations in capturing patterns of blood glucose fluctuations and demonstrate poor generalizability. A robust approach is needed to leverage massive CGM data for near-future glucose prediction. We propose large sensor models (LSMs) to capture knowledge in CGM data by modeling patients as sequences of glucose. CGM-LSM is pretrained on 15.96 million glucose records from 592 diabetes patients for near-future glucose prediction. We evaluated CGM-LSM against state-of-the-art methods using the OhioT1DM dataset across various metrics, prediction horizons, and unseen patients. Additionally, we assessed its generalizability across factors like diabetes type, age, gender, and hour of day. CGM-LSM achieved exceptional performance, with an rMSE of 29.81 mg/dL for type 1 diabetes patients and 23.49 mg/dL for type 2 diabetes patients in a two-hour prediction horizon. For the OhioT1DM dataset, CGM-LSM achieved a one-hour rMSE of 15.64 mg/dL, halving the previous best of 31.97 mg/dL. Robustness analyses revealed consistent performance not only for unseen patients and future periods, but also across diabetes type, age, and gender. The model demonstrated adaptability to different hours of day, maintaining accuracy across periods of various activity intensity levels. CGM-LSM represents a transformative step in diabetes management by leveraging pretraining to uncover latent glucose generation patterns in sensor data. Our findings also underscore the broader potential of LSMs to drive innovation across domains involving complex sensor data.

[AI-48] Language model driven: a PROTAC generation pipeline with dual constraints of structure and property

链接: https://arxiv.org/abs/2412.09661
作者: Jinsong Shao,Qineng Gong,Zeyu Yin,Yu Chen,Yajie Hao,Lei Zhang,Linlin Jiang,Min Yao,Jinlong Li,Fubo Wang,Li Wang
关键词: Proteolysis Targeting Chimera, driven Proteolysis Targeting, language model driven, computer-aided drug discovery, drug discovery tools
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
*备注: 61 pages,12 figures

点击查看摘要

Abstract:The imperfect modeling of ternary complexes has limited the application of computer-aided drug discovery tools in PROTAC research and development. In this study, an AI-assisted approach for PROTAC molecule design pipeline named LM-PROTAC was developed, which stands for language model driven Proteolysis Targeting Chimera, by embedding a transformer-based generative model with dual constraints on structure and properties, referred to as the DCT. This study utilized the fragmentation representation of molecules and developed a language model driven pipeline. Firstly, a language model driven affinity model for protein compounds to screen molecular fragments with high affinity for the target protein. Secondly, structural and physicochemical properties of these fragments were constrained during the generation process to meet specific scenario requirements. Finally, a two-round screening of the preliminary generated molecules using a multidimensional property prediction model to generate a batch of PROTAC molecules capable of degrading disease-relevant target proteins for validation in vitro experiments, thus achieving a complete solution for AI-assisted PROTAC drug generation. Taking the tumor key target Wnt3a as an example, the LM-PROTAC pipeline successfully generated PROTAC molecules capable of inhibiting Wnt3a. The results show that DCT can efficiently generate PROTAC that targets and hydrolyses Wnt3a.

机器学习

[LG-0] he Correlated Gaussian Sparse Histogram Mechanism

链接: https://arxiv.org/abs/2412.10357
作者: Christian Janos Lebeda,Lukas Retschmeier
关键词: differential privacy, problem of releasing, Gaussian noise, noise, Gaussian
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the problem of releasing a sparse histogram under (\varepsilon, \delta) -differential privacy. The stability histogram independently adds noise from a Laplace or Gaussian distribution to the non-zero entries and removes those noisy counts below a threshold. Thereby, the introduction of new non-zero values between neighboring histograms is only revealed with probability at most \delta , and typically, the value of the threshold dominates the error of the mechanism. We consider the variant of the stability histogram with Gaussian noise. Recent works ([Joseph and Yu, COLT '24] and [Lebeda, SOSA '25]) reduced the error for private histograms using correlated Gaussian noise. However, these techniques can not be directly applied in the very sparse setting. Instead, we adopt Lebeda’s technique and show that adding correlated noise to the non-zero counts only allows us to reduce the magnitude of noise when we have a sparsity bound. This, in turn, allows us to use a lower threshold by up to a factor of 1/2 compared to the non-correlated noise mechanism. We then extend our mechanism to a setting without a known bound on sparsity. Additionally, we show that correlated noise can give a similar improvement for the more practical discrete Gaussian mechanism. Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2412.10357 [cs.DS] (or arXiv:2412.10357v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2412.10357 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-1] Shape error prediction in 5-axis machining using graph neural networks

链接: https://arxiv.org/abs/2412.10341
作者: Julia Huuk,Abheek Dhingra,Eirini Ntoutsi,Bernd Denkena
关键词: graph neural networks, predicting shape errors, neural networks, paper presents, presents an innovative
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents an innovative method for predicting shape errors in 5-axis machining using graph neural networks. The graph structure is defined with nodes representing workpiece surface points and edges denoting the neighboring relationships. The dataset encompasses data from a material removal simulation, process data, and post-machining quality information. Experimental results show that the presented approach can generalize the shape error prediction for the investigated workpiece geometry. Moreover, by modelling spatial and temporal connections within the workpiece, the approach handles a low number of labels compared to non-graphical methods such as Support Vector Machines.

[LG-2] MST-R: Multi-Stage Tuning for Retrieval Systems and Metric Evaluation

链接: https://arxiv.org/abs/2412.10313
作者: Yash Malviya,Karan Dhingra,Maneesh Singh
关键词: Regulatory documents, specialized semantics, documents are rich, rich in nuanced, nuanced terminology
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regulatory documents are rich in nuanced terminology and specialized semantics. FRAG systems: Frozen retrieval-augmented generators utilizing pre-trained (or, frozen) components face consequent challenges with both retriever and answering performance. We present a system that adapts the retriever performance to the target domain using a multi-stage tuning (MST) strategy. Our retrieval approach, called MST-R (a) first fine-tunes encoders used in vector stores using hard negative mining, (b) then uses a hybrid retriever, combining sparse and dense retrievers using reciprocal rank fusion, and then © adapts the cross-attention encoder by fine-tuning only the top-k retrieved results. We benchmark the system performance on the dataset released for the RIRAG challenge (as part of the RegNLP workshop at COLING 2025). We achieve significant performance gains obtaining a top rank on the RegNLP challenge leaderboard. We also show that a trivial answering approach games the RePASs metric outscoring all baselines and a pre-trained Llama model. Analyzing this anomaly, we present important takeaways for future research.

[LG-3] Buzz to Broadcast: Predicting Sports Viewership Using Social Media Engagement

链接: https://arxiv.org/abs/2412.10298
作者: Anakin Trotter
关键词: predicting sports viewership, Accurately predicting sports, crucial for optimizing, optimizing ad sales, Accurately predicting
类目: Machine Learning (cs.LG)
*备注: 17 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Accurately predicting sports viewership is crucial for optimizing ad sales and revenue forecasting. Social media platforms, such as Reddit, provide a wealth of user-generated content that reflects audience engagement and interest. In this study, we propose a regression-based approach to predict sports viewership using social media metrics, including post counts, comments, scores, and sentiment analysis from TextBlob and VADER. Through iterative improvements, such as focusing on major sports subreddits, incorporating categorical features, and handling outliers by sport, the model achieved an R^2 of 0.99, a Mean Absolute Error (MAE) of 1.27 million viewers, and a Root Mean Squared Error (RMSE) of 2.33 million viewers on the full dataset. These results demonstrate the model’s ability to accurately capture patterns in audience behavior, offering significant potential for pre-event revenue forecasting and targeted advertising strategies.

[LG-4] Performance evaluation of predictive AI models to support medical decisions: Overview and guidance

链接: https://arxiv.org/abs/2412.10288
作者: Ben Van Calster,Gary S. Collins,Andrew J. Vickers,Laure Wynants,Kathleen F. Kerr,Lasai Barreñada,Gael Varoquaux,Karandeep Singh,Karel G. M. Moons,Tina Hernandez-boussard,Dirk Timmerman,David J. Mclernon,Maarten Van Smeden,Ewout W. Steyerberg(topic group 6 of the STRATOS initiative)
关键词: predictive artificial intelligence, performance, artificial intelligence, measures, performance measures
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 60 pages, 8 tables, 11 figures, two supplementary appendices

点击查看摘要

Abstract:A myriad of measures to illustrate performance of predictive artificial intelligence (AI) models have been proposed in the literature. Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs. We aim to assess the merits of classic and contemporary performance measures when validating predictive AI models for use in medical practice. We focus on models with a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall, classification, and clinical utility) along with accompanying graphical assessments. The first four domains cover statistical performance, the fifth domain covers decision-analytic performance. We explain why two key characteristics are important when selecting which performance measures to assess: (1) whether the measure’s expected value is optimized when it is calculated using the correct probabilities (i.e., a “proper” measure), and (2) whether they reflect either purely statistical performance or decision-analytic performance by properly considering misclassification costs. Seventeen measures exhibit both characteristics, fourteen measures exhibited one characteristic, and one measure possessed neither characteristic (the F1 measure). All classification measures (such as classification accuracy and F1) are improper for clinically relevant decision thresholds other than 0.5 or the prevalence. We recommend the following measures and plots as essential to report: AUROC, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot with probability distributions per outcome category.

[LG-5] Adversarial Robustness of Bottleneck Injected Deep Neural Networks for Task-Oriented Communication ICML

链接: https://arxiv.org/abs/2412.10265
作者: Alireza Furutanpey,Pantelis A. Frangoudis,Patrik Szabo,Schahram Dustdar
关键词: task-oriented communication systems, Deep Variational Information, Variational Bottleneck Injection, task-oriented communication, paper investigates
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI); Image and Video Processing (eess.IV)
*备注: Submission to ICMLCN, 6 pages, 9 figures, 3 tables

点击查看摘要

Abstract:This paper investigates the adversarial robustness of Deep Neural Networks (DNNs) using Information Bottleneck (IB) objectives for task-oriented communication systems. We empirically demonstrate that while IB-based approaches provide baseline resilience against attacks targeting downstream tasks, the reliance on generative models for task-oriented communication introduces new vulnerabilities. Through extensive experiments on several datasets, we analyze how bottleneck depth and task complexity influence adversarial robustness. Our key findings show that Shallow Variational Bottleneck Injection (SVBI) provides less adversarial robustness compared to Deep Variational Information Bottleneck (DVIB) approaches, with the gap widening for more complex tasks. Additionally, we reveal that IB-based objectives exhibit stronger robustness against attacks focusing on salient pixels with high intensity compared to those perturbing many pixels with lower intensity. Lastly, we demonstrate that task-oriented communication systems that rely on generative models to extract and recover salient information have an increased attack surface. The results highlight important security considerations for next-generation communication systems that leverage neural networks for goal-oriented compression.

[LG-6] Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Unanswerable Questions and Ambiguous Prompts

链接: https://arxiv.org/abs/2412.10246
作者: Hazel Kim,Adel Bibi,Philip Torr,Yarin Gal
关键词: frequently generate confident, introducing significant risks, Large language models, Large language, frequently generate
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) frequently generate confident yet inaccurate responses, introducing significant risks for deployment in safety-critical domains. We present a novel approach to detecting model hallucination through systematic analysis of information flow across model layers when processing inputs with insufficient or ambiguous context. Our investigation reveals that hallucination manifests as usable information deficiencies in inter-layer transmissions. While existing approaches primarily focus on final-layer output analysis, we demonstrate that tracking cross-layer information dynamics ( \mathcalL I) provides robust indicators of model reliability, accounting for both information gain and loss during computation. \mathcalL I improves model reliability by immediately integrating with universal LLMs without additional training or architectural modifications.

[LG-7] Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

链接: https://arxiv.org/abs/2412.10208
作者: Jaehyeon Kim,Taehong Moon,Keon Lee,Jaewoong Cho
关键词: Residual Vector Quantization, Residual Vector, vector-quantized generative models, Residual, quantization technique maintains
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We explore the use of Residual Vector Quantization (RVQ) for high-fidelity generation in vector-quantized generative models. This quantization technique maintains higher data fidelity by employing more in-depth tokens. However, increasing the token number in generative models leads to slower inference speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. Our key idea is a direct prediction of vector embedding of collective tokens rather than individual ones. Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models. The project page can be found at this https URL

[LG-8] Integrative Analysis of Financial Market Sentiment Using CNN and GRU for Risk Prediction and Alert Systems

链接: https://arxiv.org/abs/2412.10199
作者: You Wu,Mengfang Sun,Hongye Zheng,Jinxin Hu,Yingbin Liang,Zhenghao Lin
关键词: Gated Recurrent Units, Convolutional Neural Networks, Recurrent Units, Convolutional Neural, Gated Recurrent
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:This document presents an in-depth examination of stock market sentiment through the integration of Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU), enabling precise risk alerts. The robust feature extraction capability of CNN is utilized to preprocess and analyze extensive network text data, identifying local features and patterns. The extracted feature sequences are then input into the GRU model to understand the progression of emotional states over time and their potential impact on future market sentiment and risk. This approach addresses the order dependence and long-term dependencies inherent in time series data, resulting in a detailed analysis of stock market sentiment and effective early warnings of future risks.

[LG-9] Simple Guidance Mechanisms for Discrete Diffusion Models

链接: https://arxiv.org/abs/2412.10193
作者: Yair Schiff,Subham Sekhar Sahoo,Hao Phung,Guanghan Wang,Sam Boshar,Hugo Dalla-torre,Bernardo P. de Almeida,Alexander Rush,Thomas Pierrot,Volodymyr Kuleshov
关键词: gained widespread adoption, widespread adoption owing, data gained widespread, continuous data gained, gained widespread
类目: Machine Learning (cs.LG)
*备注: Code to reproduce our experiments is available here: this https URL

点击查看摘要

Abstract:Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges given that continuous guidance methods do not directly apply to discrete diffusion. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and that are more guidable because they can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, especially in settings involving guidance or fast generation. Empirically, we demonstrate that our guidance mechanisms combined with uniform noise diffusion improve controllable generation relative to autoregressive and diffusion baselines on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation.

[LG-10] Learning payoffs while routing in skill-based queues

链接: https://arxiv.org/abs/2412.10168
作者: Sanne van Kempen,Jaron Sanders,Fiona Sloothaak,Maarten G. Wolf
关键词: Motivated by applications, service systems, skill set, applications in service, server dependent payoff
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:Motivated by applications in service systems, we consider queueing systems where each customer must be handled by a server with the right skill set. We focus on optimizing the routing of customers to servers in order to maximize the total payoff of customer–server matches. In addition, customer–server dependent payoff parameters are assumed to be unknown a priori. We construct a machine learning algorithm that adaptively learns the payoff parameters while maximizing the total payoff and prove that it achieves polylogarithmic regret. Moreover, we show that the algorithm is asymptotically optimal up to logarithmic terms by deriving a regret lower bound. The algorithm leverages the basic feasible solutions of a static linear program as the action space. The regret analysis overcomes the complex interplay between queueing and learning by analyzing the convergence of the queue length process to its stationary behavior. We also demonstrate the performance of the algorithm numerically, and have included an experiment with time-varying parameters highlighting the potential of the algorithm in non-static environments.

[LG-11] Optimal Bounds for Private Minimum Spanning Trees via Input Perturbation

链接: https://arxiv.org/abs/2412.10130
作者: Rasmus Pagh,Lukas Retschmeier,Hao Wu,Hanwen Zhang
关键词: minimum spanning tree, approximate minimum spanning, vec, non-private MST algorithm, MST
类目: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of privately releasing an approximate minimum spanning tree (MST). Given a graph G = (V, E, \vecW) where V is a set of n vertices, E is a set of m undirected edges, and \vecW \in \mathbbR^|E| is an edge-weight vector, our goal is to publish an approximate MST under edge-weight differential privacy, as introduced by Sealfon in PODS 2016, where V and E are considered public and the weight vector is private. Our neighboring relation is \ell_\infty -distance on weights: for a sensitivity parameter \Delta_\infty , graphs G = (V, E, \vecW) and G’ = (V, E, \vecW’) are neighboring if |\vecW-\vecW’|\infty \leq \Delta\infty . Existing private MST algorithms face a trade-off, sacrificing either computational efficiency or accuracy. We show that it is possible to get the best of both worlds: With a suitable random perturbation of the input that does not suffice to make the weight vector private, the result of any non-private MST algorithm will be private and achieves a state-of-the-art error guarantee. Furthermore, by establishing a connection to Private Top-k Selection [Steinke and Ullman, FOCS '17], we give the first privacy-utility trade-off lower bound for MST under approximate differential privacy, demonstrating that the error magnitude, \tildeO(n^3/2) , is optimal up to logarithmic factors. That is, our approach matches the time complexity of any non-private MST algorithm and at the same time achieves optimal error. We complement our theoretical treatment with experiments that confirm the practicality of our approach. Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2412.10130 [cs.DS] (or arXiv:2412.10130v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2412.10130 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-12] Feature Selection for Latent Factor Models CVPR

链接: https://arxiv.org/abs/2412.10128
作者: Rittwika Kansabanik,Adrian Barbu
关键词: machine learning performance, enhancing machine learning, feature selection methods, pinpointing relevant features, Feature selection
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: Submitted to the CVPR conference

点击查看摘要

Abstract:Feature selection is crucial for pinpointing relevant features in high-dimensional datasets, mitigating the ‘curse of dimensionality,’ and enhancing machine learning performance. Traditional feature selection methods for classification use data from all classes to select features for each class. This paper explores feature selection methods that select features for each class separately, using class models based on low-rank generative methods and introducing a signal-to-noise ratio (SNR) feature selection criterion. This novel approach has theoretical true feature recovery guarantees under certain assumptions and is shown to outperform some existing feature selection methods on standard classification datasets.

[LG-13] AMUSE: Adaptive Model Updating using a Simulated Environment AISTATS2025

链接: https://arxiv.org/abs/2412.10119
作者: Louis Chislett,Catalina A. Vallejos,Timothy I. Cannings,James Liley
关键词: Prediction models frequently, Prediction models, models frequently face, underlying data distribution, frequently face
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 12 pages, 2 tables. Submitted to AIStats 2025 (under review)

点击查看摘要

Abstract:Prediction models frequently face the challenge of concept drift, in which the underlying data distribution changes over time, weakening performance. Examples can include models which predict loan default, or those used in healthcare contexts. Typical management strategies involve regular model updates or updates triggered by concept drift detection. However, these simple policies do not necessarily balance the cost of model updating with improved classifier performance. We present AMUSE (Adaptive Model Updating using a Simulated Environment), a novel method leveraging reinforcement learning trained within a simulated data generating environment, to determine update timings for classifiers. The optimal updating policy depends on the current data generating process and ongoing drift process. Our key idea is that we can train an arbitrarily complex model updating policy by creating a training environment in which possible episodes of drift are simulated by a parametric model, which represents expectations of possible drift patterns. As a result, AMUSE proactively recommends updates based on estimated performance improvements, learning a policy that balances maintaining model performance with minimizing update costs. Empirical results confirm the effectiveness of AMUSE in simulated data.

[LG-14] Reward Machine Inference for Robotic Manipulation

链接: https://arxiv.org/abs/2412.10096
作者: Mattijs Baert,Sam Leroux,Pieter Simoens
关键词: Reinforcement Learning, accomplish complex tasks, enabled robot agents, enabled robot, accomplish complex
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning from Demonstrations (LfD) and Reinforcement Learning (RL) have enabled robot agents to accomplish complex tasks. Reward Machines (RMs) enhance RL’s capability to train policies over extended time horizons by structuring high-level task information. In this work, we introduce a novel LfD approach for learning RMs directly from visual demonstrations of robotic manipulation tasks. Unlike previous methods, our approach requires no predefined propositions or prior knowledge of the underlying sparse reward signals. Instead, it jointly learns the RM structure and identifies key high-level events that drive transitions between RM states. We validate our method on vision-based manipulation tasks, showing that the inferred RM accurately captures task structure and enables an RL agent to effectively learn an optimal policy.

[LG-15] A Survey on Knowledge Graph Structure and Knowledge Graph Embeddings

链接: https://arxiv.org/abs/2412.10092
作者: Jeffrey Sardina,John D. Kelleher,Declan O’Sullivan
关键词: machine learning counterpart, Knowledge Graph Embedding, Graph Embedding Models, Graph Embedding, Knowledge Graph
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) and their machine learning counterpart, Knowledge Graph Embedding Models (KGEMs), have seen ever-increasing use in a wide variety of academic and applied settings. In particular, KGEMs are typically applied to KGs to solve the link prediction task; i.e. to predict new facts in the domain of a KG based on existing, observed facts. While this approach has been shown substantial power in many end-use cases, it remains incompletely characterised in terms of how KGEMs react differently to KG structure. This is of particular concern in light of recent studies showing that KG structure can be a significant source of bias as well as partially determinant of overall KGEM performance. This paper seeks to address this gap in the state-of-the-art. This paper provides, to the authors’ knowledge, the first comprehensive survey exploring established relationships of Knowledge Graph Embedding Models and Graph structure in the literature. It is the hope of the authors that this work will inspire further studies in this area, and contribute to a more holistic understanding of KGs, KGEMs, and the link prediction task.

[LG-16] xt2Cypher: Bridging Natural Language and Graph Databases

链接: https://arxiv.org/abs/2412.10064
作者: Makbule Gulcin Ozsoy,Leila Messallem,Jon Besga,Gianandrea Minneci
关键词: arbitrarily complex data, represent arbitrarily complex, Knowledge graphs, Cypher query language, properties to represent
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Knowledge graphs use nodes, relationships, and properties to represent arbitrarily complex data. When stored in a graph database, the Cypher query language enables efficient modeling and querying of knowledge graphs. However, using Cypher requires specialized knowledge, which can present a challenge for non-expert users. Our work Text2Cypher aims to bridge this gap by translating natural language queries into Cypher query language and extending the utility of knowledge graphs to non-technical expert users. While large language models (LLMs) can be used for this purpose, they often struggle to capture complex nuances, resulting in incomplete or incorrect outputs. Fine-tuning LLMs on domain-specific datasets has proven to be a more promising approach, but the limited availability of high-quality, publicly available Text2Cypher datasets makes this challenging. In this work, we show how we combined, cleaned and organized several publicly available datasets into a total of 44,387 instances, enabling effective fine-tuning and evaluation. Models fine-tuned on this dataset showed significant performance gains, with improvements in Google-BLEU and Exact Match scores over baseline models, highlighting the importance of high-quality datasets and fine-tuning in improving Text2Cypher performance. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2412.10064 [cs.LG] (or arXiv:2412.10064v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.10064 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Class flipping for uplift modeling and Heterogeneous Treatment Effect estimation on imbalanced RCT data

链接: https://arxiv.org/abs/2412.10009
作者: Krzysztof Rudaś,Szymon Jaroszewicz
关键词: Heterogeneous Treatment Effect, Randomized Controlled Experiments, Heterogeneous Treatment, estimation aim, specific individual
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Uplift modeling and Heterogeneous Treatment Effect (HTE) estimation aim at predicting the causal effect of an action, such as a medical treatment or a marketing campaign on a specific individual. In this paper, we focus on data from Randomized Controlled Experiments which guarantee causal interpretation of the outcomes. Class and treatment imbalance are important problems in uplift modeling/HTE, but classical undersampling or oversampling based approaches are hard to apply in this case since they distort the predicted effect. Calibration methods have been proposed in the past, however, they do not guarantee correct predictions. In this work, we propose an approach alternative to undersampling, based on flipping the class value of selected records. We show that the proposed approach does not distort the predicted effect and does not require calibration. The method is especially useful for models based on class variable transformation (modified outcome models). We address those models separately, designing a transformation scheme which guarantees correct predictions and addresses also the problem of treatment imbalance which is especially important for those models. Experiments fully confirm our theoretical results. Additionally, we demonstrate that our method is a viable alternative also for standard classification problems.

[LG-18] Real-Time Fall Detection Using Smartphone Accelerometers and WiFi Channel State Information

链接: https://arxiv.org/abs/2412.09980
作者: Lingyun Wang,Deqi Su,Aohua Zhang,Yujun Zhu,Weiwei Jiang,Xin He,Panlong Yang
关键词: recent years, population ages, increasingly posed, IMU distinguishes falls, significant threat
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, as the population ages, falls have increasingly posed a significant threat to the health of the elderly. We propose a real-time fall detection system that integrates the inertial measurement unit (IMU) of a smartphone with optimized Wi-Fi channel state information (CSI) for secondary validation. Initially, the IMU distinguishes falls from routine daily activities with minimal computational demand. Subsequently, the CSI is employed for further assessment, which includes evaluating the individual’s post-fall mobility. This methodology not only achieves high accuracy but also reduces energy consumption in the smartphone platform. An Android application developed specifically for the purpose issues an emergency alert if the user experiences a fall and is unable to move. Experimental results indicate that the CSI model, based on convolutional neural networks (CNN), achieves a detection accuracy of 99%, \revisedsurpassing comparable IMU-only models, and demonstrating significant resilience in distinguishing between falls and non-fall activities.

[LG-19] GraSP: Simple yet Effective Graph Similarity Predictions AAAI2025

链接: https://arxiv.org/abs/2412.09968
作者: Haoran Zheng,Jieming Shi,Renchi Yang
关键词: fundamental problem, problem with fruitful, fruitful applications, GED and MCS, GSC
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI2025. 13 pages, 14 figures. The code is available at this https URL

点击查看摘要

Abstract:Graph similarity computation (GSC) is to calculate the similarity between one pair of graphs, which is a fundamental problem with fruitful applications in the graph community. In GSC, graph edit distance (GED) and maximum common subgraph (MCS) are two important similarity metrics, both of which are NP-hard to compute. Instead of calculating the exact values, recent solutions resort to leveraging graph neural networks (GNNs) to learn data-driven models for the estimation of GED and MCS. Most of them are built on components involving node-level interactions crossing graphs, which engender vast computation overhead but are of little avail in effectiveness. In the paper, we present GraSP, a simple yet effective GSC approach for GED and MCS prediction. GraSP achieves high result efficacy through several key instruments: enhanced node features via positional encoding and a GNN model augmented by a gating mechanism, residual connections, as well as multi-scale pooling. Theoretically, GraSP can surpass the 1-WL test, indicating its high expressiveness. Empirically, extensive experiments comparing GraSP against 10 competitors on multiple widely adopted benchmark datasets showcase the superiority of GraSP over prior arts in terms of both effectiveness and efficiency. The code is available at this https URL.

[LG-20] Llama 3 Meets MoE: Efficient Upcycling

链接: https://arxiv.org/abs/2412.09952
作者: Aditya Vavre,Ethan He,Dennis Liu,Zijie Yan,June Yang,Nima Tajbakhsh,Ashwath Aithal
关键词: Scaling large language, prohibitive computational costs, Scaling large, large language models, significantly improves performance
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Scaling large language models (LLMs) significantly improves performance but comes with prohibitive computational costs. Mixture-of-Experts (MoE) models offer an efficient alternative, increasing capacity without a proportional rise in compute requirements. However, training MoE models from scratch poses challenges like overfitting and routing instability. We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than 1% of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a \textbf2% improvement in 0-shot accuracy on MMLU, while reaching a Model FLOPs Utilization (MFU) of \textbf46.8% during training using our framework. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.

[LG-21] owards Fair Graph Neural Networks via Graph Counterfactual without Sensitive Attributes ICDE2025

链接: https://arxiv.org/abs/2412.09947
作者: Xuemin Wang,Tianlong Gu,Xuguang Bao,Liang Chang
关键词: today connected world, driving extensive research, Graph Neural Networks, Graph-structured data, connected world
类目: Machine Learning (cs.LG)
*备注: ICDE 2025

点击查看摘要

Abstract:Graph-structured data is ubiquitous in today’s connected world, driving extensive research in graph analysis. Graph Neural Networks (GNNs) have shown great success in this field, leading to growing interest in developing fair GNNs for critical applications. However, most existing fair GNNs focus on statistical fairness notions, which may be insufficient when dealing with statistical anomalies. Hence, motivated by causal theory, there has been growing attention to mitigating root causes of unfairness utilizing graph counterfactuals. Unfortunately, existing methods for generating graph counterfactuals invariably require the sensitive attribute. Nevertheless, in many real-world applications, it is usually infeasible to obtain sensitive attributes due to privacy or legal issues, which challenge existing methods. In this paper, we propose a framework named Fairwos (improving Fairness without sensitive attributes). In particular, we first propose a mechanism to generate pseudo-sensitive attributes to remedy the problem of missing sensitive attributes, and then design a strategy for finding graph counterfactuals from the real dataset. To train fair GNNs, we propose a method to ensure that the embeddings from the original data are consistent with those from the graph counterfactuals, and dynamically adjust the weight of each pseudo-sensitive attribute to balance its contribution to fairness and utility. Furthermore, we theoretically demonstrate that minimizing the relation between these pseudo-sensitive attributes and the prediction can enable the fairness of GNNs. Experimental results on six real-world datasets show that our approach outperforms state-of-the-art methods in balancing utility and fairness.

[LG-22] Predictive Query-based Pipeline for Graph Data

链接: https://arxiv.org/abs/2412.09940
作者: Plácido A Souza Neto(UO)
关键词: Graphs face challenges, face challenges, challenges when dealing, dealing with massive, massive datasets
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graphs face challenges when dealing with massive datasets. They are essential tools for modeling interconnected data and often become computationally expensive. Graph embedding techniques, on the other hand, provide an efficient approach. By projecting complex graphs into a lower-dimensional space, these techniques simplify the analysis and processing of large-scale graphs. By transforming graphs into vectors, it simplifies the analysis and processing of large-scale datasets. Several approaches, such as GraphSAGE, Node2Vec, and FastRP, offer efficient methods for generating graph embeddings. By storing embeddings as node properties, it is possible to compare different embedding techniques and evaluate their effectiveness for specific tasks. This flexibilityallows for dynamic updates to embeddings and facilitates experimentation with different approaches. By analyzing these embeddings, one can extract valuable insights into the relationships between nodes and their similarities within the embedding space

[LG-23] One Node One Model: Featuring the Missing-Half for Graph Clustering

链接: https://arxiv.org/abs/2412.09902
作者: Xuanting Xie,Bingheng Li,Erlin Pan,Zhaochen Guo,Zhao Kang,Wenyu Chen
关键词: exploiting topological structure, existing graph clustering, methods primarily focus, graph clustering, topological structure
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most existing graph clustering methods primarily focus on exploiting topological structure, often neglecting the missing-half" node feature information, especially how these features can enhance clustering performance. This issue is further compounded by the challenges associated with high-dimensional features. Feature selection in graph clustering is particularly difficult because it requires simultaneously discovering clusters and identifying the relevant features for these clusters. To address this gap, we introduce a novel paradigm called one node one model", which builds an exclusive model for each node and defines the node label as a combination of predictions for node groups. Specifically, the proposed ``Feature Personalized Graph Clustering (FPGC)" method identifies cluster-relevant features for each node using a squeeze-and-excitation block, integrating these features into each model to form the final representations. Additionally, the concept of feature cross is developed as a data augmentation technique to learn low-order feature interactions. Extensive experimental results demonstrate that FPGC outperforms state-of-the-art clustering methods. Moreover, the plug-and-play nature of our method provides a versatile solution to enhance GNN-based models from a feature perspective.

[LG-24] AQ: Towards Stable Post-training Quantization in Continuous Domain Adaptation

链接: https://arxiv.org/abs/2412.09899
作者: Junrui Xiao,Zhikai Li,Lianwei Yang,Yiduo Mei,Qingyi Gu
关键词: reduces excessive hardware, tiny calibration set, excessive hardware cost, quantizing full-precision models, lower bit representations
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) reduces excessive hardware cost by quantizing full-precision models into lower bit representations on a tiny calibration set, without retraining. Despite the remarkable progress made through recent efforts, traditional PTQ methods typically encounter failure in dynamic and ever-changing real-world scenarios, involving unpredictable data streams and continual domain shifts, which poses greater challenges. In this paper, we propose a novel and stable quantization process for test-time adaptation (TTA), dubbed TTAQ, to address the performance degradation of traditional PTQ in dynamically evolving test domains. To tackle domain shifts in quantizer, TTAQ proposes the Perturbation Error Mitigation (PEM) and Perturbation Consistency Reconstruction (PCR). Specifically, PEM analyzes the error propagation and devises a weight regularization scheme to mitigate the impact of input perturbations. On the other hand, PCR introduces consistency learning to ensure that quantized models provide stable predictions for same sample. Furthermore, we introduce Adaptive Balanced Loss (ABL) to adjust the logits by taking advantage of the frequency and complexity of the class, which can effectively address the class imbalance caused by unpredictable data streams during optimization. Extensive experiments are conducted on multiple datasets with generic TTA methods, proving that TTAQ can outperform existing baselines and encouragingly improve the accuracy of low bit PTQ models in continually changing test domains. For instance, TTAQ decreases the mean error of 2-bit models on ImageNet-C dataset by an impressive 10.1%.

[LG-25] Data-Driven Transfer Learning Framework for Estimating Turning Movement Counts

链接: https://arxiv.org/abs/2412.09861
作者: Xiaobo Ma,Hyunsoo Noh,Ryan Hatch,James Tokishi,Zepu Wang
关键词: Urban transportation networks, Urban transportation, necessitating effective traffic, effective traffic management, people and goods
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban transportation networks are vital for the efficient movement of people and goods, necessitating effective traffic management and planning. An integral part of traffic management is understanding the turning movement counts (TMCs) at intersections, Accurate TMCs at intersections are crucial for traffic signal control, congestion mitigation, and road safety. In general, TMCs are obtained using physical sensors installed at intersections, but this approach can be cost-prohibitive and technically challenging, especially for cities with extensive road networks. Recent advancements in machine learning and data-driven approaches have offered promising alternatives for estimating TMCs. Traffic patterns can vary significantly across different intersections due to factors such as road geometry, traffic signal settings, and local driver behaviors. This domain discrepancy limits the generalizability and accuracy of machine learning models when applied to new or unseen intersections. In response to these limitations, this research proposes a novel framework leveraging transfer learning (TL) to estimate TMCs at intersections by using traffic controller event-based data, road infrastructure data, and point-of-interest (POI) data. Evaluated on 30 intersections in Tucson, Arizona, the performance of the proposed TL model was compared with eight state-of-the-art regression models and achieved the lowest values in terms of Mean Absolute Error and Root Mean Square Error.

[LG-26] Understand the Effectiveness of Shortcuts through the Lens of DCA

链接: https://arxiv.org/abs/2412.09853
作者: Youran Sun,Yihua Liu,Yi-Shuai Niu
关键词: nonconvex optimization algorithm, well-known nonconvex optimization, well-known nonconvex, nonconvex function, existing optimization algorithms
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Difference-of-Convex Algorithm (DCA) is a well-known nonconvex optimization algorithm for minimizing a nonconvex function that can be expressed as the difference of two convex ones. Many famous existing optimization algorithms, such as SGD and proximal point methods, can be viewed as special DCAs with specific DC decompositions, making it a powerful framework for optimization. On the other hand, shortcuts are a key architectural feature in modern deep neural networks, facilitating both training and optimization. We showed that the shortcut neural network gradient can be obtained by applying DCA to vanilla neural networks, networks without shortcut connections. Therefore, from the perspective of DCA, we can better understand the effectiveness of networks with shortcuts. Moreover, we proposed a new architecture called NegNet that does not fit the previous interpretation but performs on par with ResNet and can be included in the DCA framework.

[LG-27] Multivariate Time Series Clustering for Environmental State Characterization of Ground-Based Gravitational-Wave Detectors

链接: https://arxiv.org/abs/2412.09832
作者: Rutuja Gurav,Isaac Kelly,Pooyan Goodarzi,Anamaria Effler,Barry Barish,Evangelos Papalexakis,Jonathan Richardson
关键词: long observation periods, multi-kilometer geographic area, terrestrial instruments housed, LIGO are large-scale, maintain operational stability
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); General Relativity and Quantum Cosmology (gr-qc)
*备注: 8 pages, 6 figures, Accepted to The 5th International Workshop on Big Data AI Tools, Methods, and Use Cases for Innovative Scientific Discovery (BTSD 2024)

点击查看摘要

Abstract:Gravitational-wave observatories like LIGO are large-scale, terrestrial instruments housed in infrastructure that spans a multi-kilometer geographic area and which must be actively controlled to maintain operational stability for long observation periods. Despite exquisite seismic isolation, they remain susceptible to seismic noise and other terrestrial disturbances that can couple undesirable vibrations into the instrumental infrastructure, potentially leading to control instabilities or noise artifacts in the detector output. It is, therefore, critical to characterize the seismic state of these observatories to identify a set of temporal patterns that can inform the detector operators in day-to-day monitoring and diagnostics. On a day-to-day basis, the operators monitor several seismically relevant data streams to diagnose operational instabilities and sources of noise using some simple empirically-determined thresholds. It can be untenable for a human operator to monitor multiple data streams in this manual fashion and thus a distillation of these data-streams into a more human-friendly format is sought. In this paper, we present an end-to-end machine learning pipeline for features-based multivariate time series clustering to achieve this goal and to provide actionable insights to the detector operators by correlating found clusters with events of interest in the detector.

[LG-28] FDM-Bench: A Comprehensive Benchmark for Evaluating Large Language Models in Additive Manufacturing Tasks

链接: https://arxiv.org/abs/2412.09819
作者: Ahmadreza Eslaminia,Adrian Jackson,Beitong Tian,Avi Stern,Hallie Gordon,Rajiv Malhotra,Klara Nahrstedt,Chenhui Shao
关键词: Fused Deposition Modeling, Fused Deposition, Deposition Modeling, industries including healthcare, FDM
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Fused Deposition Modeling (FDM) is a widely used additive manufacturing (AM) technique valued for its flexibility and cost-efficiency, with applications in a variety of industries including healthcare and aerospace. Recent developments have made affordable FDM machines accessible and encouraged adoption among diverse users. However, the design, planning, and production process in FDM require specialized interdisciplinary knowledge. Managing the complex parameters and resolving print defects in FDM remain challenging. These technical complexities form the most critical barrier preventing individuals without technical backgrounds and even professional engineers without training in other domains from participating in AM design and manufacturing. Large Language Models (LLMs), with their advanced capabilities in text and code processing, offer the potential for addressing these challenges in FDM. However, existing research on LLM applications in this field is limited, typically focusing on specific use cases without providing comprehensive evaluations across multiple models and tasks. To this end, we introduce FDM-Bench, a benchmark dataset designed to evaluate LLMs on FDM-specific tasks. FDM-Bench enables a thorough assessment by including user queries across various experience levels and G-code samples that represent a range of anomalies. We evaluate two closed-source models (GPT-4o and Claude 3.5 Sonnet) and two open-source models (Llama-3.1-70B and Llama-3.1-405B) on FDM-Bench. A panel of FDM experts assess the models’ responses to user queries in detail. Results indicate that closed-source models generally outperform open-source models in G-code anomaly detection, whereas Llama-3.1-405B demonstrates a slight advantage over other models in responding to user queries. These findings underscore FDM-Bench’s potential as a foundational tool for advancing research on LLM capabilities in FDM.

[LG-29] he Complexity Dynamics of Grokking

链接: https://arxiv.org/abs/2412.09810
作者: Branton DeMoss,Silvia Sapora,Jakob Foerster,Nick Hawes,Ingmar Posner
关键词: investigate the phenomenon, neural networks, networks, neural, complexity
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the phenomenon of generalization through the lens of compression. In particular, we study the complexity dynamics of neural networks to explain grokking, where networks suddenly transition from memorizing to generalizing solutions long after over-fitting the training data. To this end we introduce a new measure of intrinsic complexity for neural networks based on the theory of Kolmogorov complexity. Tracking this metric throughout network training, we find a consistent pattern in training dynamics, consisting of a rise and fall in complexity. We demonstrate that this corresponds to memorization followed by generalization. Based on insights from rate–distortion theory and the minimum description length principle, we lay out a principled approach to lossy compression of neural networks, and connect our complexity measure to explicit generalization bounds. Based on a careful analysis of information capacity in neural networks, we propose a new regularization method which encourages networks towards low-rank representations by penalizing their spectral entropy, and find that our regularizer outperforms baselines in total compression of the dataset.

[LG-30] deepNoC: A deep learning system to assign the number of contributors to a short tandem repeat DNA profile

链接: https://arxiv.org/abs/2412.09803
作者: Duncan Taylor,Melissa A. Humphries
关键词: tandem repeat DNA, evaluate short tandem, short tandem repeat, repeat DNA profiles, DNA profiles
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 29 pages, 8 figures

点击查看摘要

Abstract:A common task in forensic biology is to interpret and evaluate short tandem repeat DNA profiles. The first step in these interpretations is to assign a number of contributors to the profiles, a task that is most often performed manually by a scientist using their knowledge of DNA profile behaviour. Studies using constructed DNA profiles have shown that as DNA profiles become more complex, and the number of DNA-donating individuals increases, the ability for scientists to assign the target number. There have been a number of machine learning algorithms developed that seek to assign the number of contributors to a DNA profile, however due to practical limitations in being able to generate DNA profiles in a laboratory, the algorithms have been based on summaries of the available information. In this work we develop an analysis pipeline that simulates the electrophoretic signal of an STR profile, allowing virtually unlimited, pre-labelled training material to be generated. We show that by simulating 100 000 profiles and training a number of contributors estimation tool using a deep neural network architecture (in an algorithm named deepNoC) that a high level of performance is achieved (89% for 1 to 10 contributors). The trained network can then have fine-tuning training performed with only a few hundred profiles in order to achieve the same accuracy within a specific laboratory. We also build into deepNoC secondary outputs that provide a level of explainability to a user of algorithm, and show how they can be displayed in an intuitive manner.

[LG-31] Infinite-dimensional next-generation reservoir computing

链接: https://arxiv.org/abs/2412.09800
作者: Lyudmila Grigoryeva,Hannah Lim Jing Ting,Juan-Pablo Ortega
关键词: Next-generation reservoir computing, Next-generation reservoir, reservoir computing, ease of implementation, attracted much attention
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
*备注: 13 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Next-generation reservoir computing (NG-RC) has attracted much attention due to its excellent performance in spatio-temporal forecasting of complex systems and its ease of implementation. This paper shows that NG-RC can be encoded as a kernel ridge regression that makes training efficient and feasible even when the space of chosen polynomial features is very large. Additionally, an extension to an infinite number of covariates is possible, which makes the methodology agnostic with respect to the lags into the past that are considered as explanatory factors, as well as with respect to the number of polynomial covariates, an important hyperparameter in traditional NG-RC. We show that this approach has solid theoretical backing and good behavior based on kernel universality properties previously established in the literature. Various numerical illustrations show that these generalizations of NG-RC outperform the traditional approach in several forecasting applications.

[LG-32] A Novel Methodology in Credit Spread Prediction Based on Ensemble Learning and Feature Selection

链接: https://arxiv.org/abs/2412.09769
作者: Yu Shao,Jiawen Bai,Yingze Hou,Xia’an Zhou,Zhanhao Pan
关键词: effective trading strategies, devise effective trading, offering valuable insights, offering valuable, trading strategies
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 7 pages, 5 figures

点击查看摘要

Abstract:The credit spread is a key indicator in bond investments, offering valuable insights for fixed-income investors to devise effective trading strategies. This study proposes a novel credit spread forecasting model leveraging ensemble learning techniques. To enhance predictive accuracy, a feature selection method based on mutual information is incorporated. Empirical results demonstrate that the proposed methodology delivers superior accuracy in credit spread predictions. Additionally, we present a forecast of future credit spread trends using current data, providing actionable insights for investment decision-making.

[LG-33] oward Foundation Model for Multivariate Wearable Sensing of Physiological Signals

链接: https://arxiv.org/abs/2412.09758
作者: Yunfei Luo,Yuliang Chen,Asif Salekin,Tauhidur Rahman
关键词: Time-series foundation models, Time-series foundation, Wearable sensing, ability to run, Wearable sensing data
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: The code is available at: this http URL

点击查看摘要

Abstract:Time-series foundation models have the ability to run inference, mainly forecasting, on any type of time series data, thanks to the informative representations comprising waveform features. Wearable sensing data, on the other hand, contain more variability in both patterns and frequency bands of interest and generally emphasize more on the ability to infer healthcare-related outcomes. The main challenge of crafting a foundation model for wearable sensing physiological signals is to learn generalizable representations that support efficient adaptation across heterogeneous sensing configurations and applications. In this work, we propose NormWear, a step toward such a foundation model, aiming to extract generalized and informative wearable sensing representations. NormWear has been pretrained on a large set of physiological signals, including PPG, ECG, EEG, GSR, and IMU, from various public resources. For a holistic assessment, we perform downstream evaluation on 11 public wearable sensing datasets, spanning 18 applications in the areas of mental health, body state inference, biomarker estimations, and disease risk evaluations. We demonstrate that NormWear achieves a better performance improvement over competitive baselines in general time series foundation modeling. In addition, leveraging a novel representation-alignment-match-based method, we align physiological signals embeddings with text embeddings. This alignment enables our proposed foundation model to perform zero-shot inference, allowing it to generalize to previously unseen wearable signal-based health applications. Finally, we perform nonlinear dynamic analysis on the waveform features extracted by the model at each intermediate layer. This analysis quantifies the model’s internal processes, offering clear insights into its behavior and fostering greater trust in its inferences among end users.

[LG-34] owards joint graph and sampling set selection from data

链接: https://arxiv.org/abs/2412.09753
作者: Shashank N. Sridhara,Eduardo Pavez,Antonio Ortega
关键词: Vertex Importance Sampling, Vertex Importance, sampling, graph, Importance Sampling
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 5 pages, 7 figures, IEEE Asilomar Conference on Signals, Systems, and Computers-2024

点击查看摘要

Abstract:We explore the problem of sampling graph signals in scenarios where the graph structure is not predefined and must be inferred from data. In this scenario, existing approaches rely on a two-step process, where a graph is learned first, followed by sampling. More generally, graph learning and graph signal sampling have been studied as two independent problems in the literature. This work provides a foundational step towards jointly optimizing the graph structure and sampling set. Our main contribution, Vertex Importance Sampling (VIS), is to show that the sampling set can be effectively determined from the vertex importance (node weights) obtained from graph learning. We further propose Vertex Importance Sampling with Repulsion (VISR), a greedy algorithm where spatially -separated “important” nodes are selected to ensure better reconstruction. Empirical results on simulated data show that sampling using VIS and VISR leads to competitive reconstruction performance and lower complexity than the conventional two-step approach of graph learning followed by graph sampling.

[LG-35] A Quasilinear Algorithm for Computing Higher-Order Derivatives of Deep Feed-Forward Neural Networks

链接: https://arxiv.org/abs/2412.09752
作者: Kyle R. Chickering
关键词: solving differential equations, practically difficult due, exponentially increasing runtime, computing high-order derivatives, solving differential
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 11 pages, 10 figures

点击查看摘要

Abstract:The use of neural networks for solving differential equations is practically difficult due to the exponentially increasing runtime of autodifferentiation when computing high-order derivatives. We propose n -TangentProp, the natural extension of the TangentProp formalism \citesimard1991tangent to arbitrarily many derivatives. n -TangentProp computes the exact derivative d^n/dx^n f(x) in quasilinear, instead of exponential time, for a densely connected, feed-forward neural network f with a smooth, parameter-free activation function. We validate our algorithm empirically across a range of depths, widths, and number of derivatives. We demonstrate that our method is particularly beneficial in the context of physics-informed neural networks where \ntp allows for significantly faster training times than previous methods and has favorable scaling with respect to both model size and loss-function complexity as measured by the number of required derivatives. The code for this paper can be found at this https URL_tangentprop.

[LG-36] New Approach to Clustering Random Attributes

链接: https://arxiv.org/abs/2412.09748
作者: Zenon Gniazdowski
关键词: attributes, nominal attributes, paper proposes, types of random, nominal
类目: Machine Learning (cs.LG)
*备注: 50 pages, 15 figures, 25 tables

点击查看摘要

Abstract:This paper proposes a new method for similarity analysis and, consequently, a new algorithm for clustering different types of random attributes, both numerical and nominal. However, in order for nominal attributes to be clustered, their values must be properly encoded. In the encoding process, nominal attributes obtain a new representation in numerical form. Only the numeric attributes can be subjected to factor analysis, which allows them to be clustered in terms of their similarity to factors. The proposed method was tested for several sample datasets. It was found that the proposed method is universal. On the one hand, the method allows clustering of numerical attributes. On the other hand, it provides the ability to cluster nominal attributes. It also allows simultaneous clustering of numerical attributes and numerically encoded nominal attributes.

[LG-37] Apart: Differentiating Network Faults from Customer-Premise Faults in Cable Broadband Networks

链接: https://arxiv.org/abs/2412.09740
作者: Jiyao Hu,Zhenyu Zhou,Xiaowei Yang
关键词: impairments frequently occur, impairments occur, impairments frequently, frequently occur, occur inside
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 14 pages. arXiv admin note: text overlap with arXiv:2412.09564

点击查看摘要

Abstract:Two types of radio frequency (RF) impairments frequently occur in a cable broadband network: impairments that occur inside a cable network and impairments occur at the edge of the broadband network, i.e., in a subscriber’s premise. Differentiating these two types of faults is important, as different faults require different types of technical personnel to repair them. Presently, the cable industry lacks publicly available tools to automatically diagnose the type of fault. In this work, we present TelApart, a fault diagnosis system for cable broadband networks. TelApart uses telemetry data collected by the Proactive Network Maintenance (PNM) infrastructure in cable networks to effectively differentiate the type of fault. Integral to TelApart’s design is an unsupervised machine learning model that groups cable devices sharing similar anomalous patterns together. We use metrics derived from an ISP’s customer trouble tickets to programmatically tune the model’s hyper-parameters so that an ISP can deploy TelApart in various conditions without hand-tuning its hyper-parameters. We also address the data challenge that the telemetry data collected by the PNM system contain numerous missing, duplicated, and unaligned data points. Using real-world data contributed by a cable ISP, we show that TelApart can effectively identify different types of faults.

[LG-38] he Cost of Replicability in Active Learning

链接: https://arxiv.org/abs/2412.09686
作者: Rupkatha Hira,Dominik Kau,Jessica Sorrell
关键词: unlabeled data points, initially unlabeled data, Active learning aims, Active learning, machine learning models
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Active learning aims to reduce the required number of labeled data for machine learning algorithms by selectively querying the labels of initially unlabeled data points. Ensuring the replicability of results, where an algorithm consistently produces the same outcome across different runs, is essential for the reliability of machine learning models but often increases sample complexity. This report investigates the cost of replicability in active learning using the CAL algorithm, a classical disagreement-based active learning method. By integrating replicable statistical query subroutines and random thresholding techniques, we propose two versions of a replicable CAL algorithm. Our theoretical analysis demonstrates that while replicability does increase label complexity, the CAL algorithm can still achieve significant savings in label complexity even with the replicability constraint. These findings offer valuable insights into balancing efficiency and robustness in machine learning models.

[LG-39] Revisiting Graph Homophily Measures

链接: https://arxiv.org/abs/2412.09663
作者: Mikhail Mironov,Liudmila Prokhorenkova
关键词: connect similar nodes, graph property describing, Homophily, similar nodes, property describing
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Social and Information Networks (cs.SI)
*备注: 22 pages, 3 figures, Learning on Graphs Conference 2024

点击查看摘要

Abstract:Homophily is a graph property describing the tendency of edges to connect similar nodes. There are several measures used for assessing homophily but all are known to have certain drawbacks: in particular, they cannot be reliably used for comparing datasets with varying numbers of classes and class size balance. To show this, previous works on graph homophily suggested several properties desirable for a good homophily measure, also noting that no existing homophily measure has all these properties. Our paper addresses this issue by introducing a new homophily measure - unbiased homophily - that has all the desirable properties and thus can be reliably used across datasets with different label distributions. The proposed measure is suitable for undirected (and possibly weighted) graphs. We show both theoretically and via empirical examples that the existing homophily measures have serious drawbacks while unbiased homophily has a desirable behavior for the considered scenarios. Finally, when it comes to directed graphs, we prove that some desirable properties contradict each other and thus a measure satisfying all of them cannot exist.

[LG-40] Machine Learning Driven Smishing Detection Framework for Mobile Security

链接: https://arxiv.org/abs/2412.09641
作者: Diksha Goel,Hussain Ahmad,Ankit Kumar Jain,Nikhil Kumar Goel
关键词: personal data management, financial transactions, smartphones for communication, targets for cyberattacks, increasing reliance
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing reliance on smartphones for communication, financial transactions, and personal data management has made them prime targets for cyberattacks, particularly smishing, a sophisticated variant of phishing conducted via SMS. Despite the growing threat, traditional detection methods often struggle with the informal and evolving nature of SMS language, which includes abbreviations, slang, and short forms. This paper presents an enhanced content-based smishing detection framework that leverages advanced text normalization techniques to improve detection accuracy. By converting nonstandard text into its standardized form, the proposed model enhances the efficacy of machine learning classifiers, particularly the Naive Bayesian classifier, in distinguishing smishing messages from legitimate ones. Our experimental results, validated on a publicly available dataset, demonstrate a detection accuracy of 96.2%, with a low False Positive Rate of 3.87% and False Negative Rate of 2.85%. This approach significantly outperforms existing methodologies, providing a robust solution to the increasingly sophisticated threat of smishing in the mobile environment.

[LG-41] Integrating Functionalities To A System Via Autoencoder Hippocampus Network

链接: https://arxiv.org/abs/2412.09635
作者: Siwei Luo
关键词: Integrating multiple functionalities, Integrating multiple, deep learning, multiple functionalities, poses a fascinating
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Integrating multiple functionalities into a system poses a fascinating challenge to the field of deep learning. While the precise mechanisms by which the brain encodes and decodes information, and learns diverse skills, remain elusive, memorization undoubtedly plays a pivotal role in this process. In this article, we delve into the implementation and application of an autoencoder-inspired hippocampus network in a multi-functional system. We propose an autoencoder-based memorization method for policy function’s parameters. Specifically, the encoder of the autoencoder maps policy function’s parameters to a skill vector, while the decoder retrieves the parameters via this skill vector. The policy function is dynamically adjusted tailored to corresponding tasks. Henceforth, a skill vectors graph neural network is employed to represent the homeomorphic topological structure of subtasks and manage subtasks execution.

[LG-42] Controlling dynamical systems into unseen target states using machine learning

链接: https://arxiv.org/abs/2412.10251
作者: Daniel Köglmayr,Alexander Haluszczynski,Christoph Räth
关键词: controlling complex dynamical, complex dynamical systems, controlling complex, complex dynamical, unseen target states
类目: Chaotic Dynamics (nlin.CD); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We present a novel, model-free, and data-driven methodology for controlling complex dynamical systems into previously unseen target states, including those with significantly different and complex dynamics. Leveraging a parameter-aware realization of next-generation reservoir computing, our approach accurately predicts system behavior in unobserved parameter regimes, enabling control over transitions to arbitrary target states. Crucially, this includes states with dynamics that differ fundamentally from known regimes, such as shifts from periodic to intermittent or chaotic behavior. The method’s parameter-awareness facilitates non-stationary control, ensuring smooth transitions between states. By extending the applicability of machine learning-based control mechanisms to previously inaccessible target dynamics, this methodology opens the door to transformative new applications while maintaining exceptional efficiency. Our results highlight reservoir computing as a powerful alternative to traditional methods for dynamic system control.

[LG-43] Data Integration with Fusion Searchlight: Classifying Brain States from Resting-state fMRI

链接: https://arxiv.org/abs/2412.10161
作者: Simon Wein,Marco Riebel,Lisa-Marie Brunner,Caroline Nothdurfter,Rainer Rupprecht,Jens V. Schwarzbach
关键词: Spontaneous neural activity, complex spatio-temporal dynamics, neural activity observed, Spontaneous neural, activity observed
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spontaneous neural activity observed in resting-state fMRI is characterized by complex spatio-temporal dynamics. Different measures related to local and global brain connectivity and fluctuations in low-frequency amplitudes can quantify individual aspects of these neural dynamics. Even though such measures are derived from the same functional signals, they are often evaluated separately, neglecting their interrelations and potentially reducing the analysis sensitivity. In our study, we present a fusion searchlight (FuSL) framework to combine the complementary information contained in different resting-state fMRI metrics and demonstrate how this can improve the decoding of brain states. Moreover, we show how explainable AI allows us to reconstruct the differential impact of each metric on the decoding, which additionally increases spatial specificity of searchlight analysis. In general, this framework can be adapted to combine information derived from different imaging modalities or experimental conditions, offering a versatile and interpretable tool for data fusion in neuroimaging.

[LG-44] Matrix Completion via Residual Spectral Matching

链接: https://arxiv.org/abs/2412.10005
作者: Ziyuan Chen,Fang Yao
关键词: attracted significant attention, significant attention due, Noisy matrix completion, recommendation systems, signal processing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 23 pages, 6 figures

点击查看摘要

Abstract:Noisy matrix completion has attracted significant attention due to its applications in recommendation systems, signal processing and image restoration. Most existing works rely on (weighted) least squares methods under various low-rank constraints. However, minimizing the sum of squared residuals is not always efficient, as it may ignore the potential structural information in the this http URL this study, we propose a novel residual spectral matching criterion that incorporates not only the numerical but also locational information of residuals. This criterion is the first in noisy matrix completion to adopt the perspective of low-rank perturbation of random matrices and exploit the spectral properties of sparse random matrices. We derive optimal statistical properties by analyzing the spectral properties of sparse random matrices and bounding the effects of low-rank perturbations and partial observations. Additionally, we propose algorithms that efficiently approximate solutions by constructing easily computable pseudo-gradients. The iterative process of the proposed algorithms ensures convergence at a rate consistent with the optimal statistical error bound. Our method and algorithms demonstrate improved numerical performance in both simulated and real data examples, particularly in environments with high noise levels.

[LG-45] Latent feedback control of distributed systems in multiple scenarios through deep learning-based reduced order models

链接: https://arxiv.org/abs/2412.09942
作者: Matteo Tomasetto,Francesco Braghin,Andrea Manzoni
关键词: desired physical behavior, Continuous monitoring, high-dimensional distributed systems, crucial in applications, applications to ensure
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Continuous monitoring and real-time control of high-dimensional distributed systems are often crucial in applications to ensure a desired physical behavior, without degrading stability and system performances. Traditional feedback control design that relies on full-order models, such as high-dimensional state-space representations or partial differential equations, fails to meet these requirements due to the delay in the control computation, which requires multiple expensive simulations of the physical system. The computational bottleneck is even more severe when considering parametrized systems, as new strategies have to be determined for every new scenario. To address these challenges, we propose a real-time closed-loop control strategy enhanced by nonlinear non-intrusive Deep Learning-based Reduced Order Models (DL-ROMs). Specifically, in the offline phase, (i) full-order state-control pairs are generated for different scenarios through the adjoint method, (ii) the essential features relevant for control design are extracted from the snapshots through a combination of Proper Orthogonal Decomposition (POD) and deep autoencoders, and (iii) the low-dimensional policy bridging latent control and state spaces is approximated with a feedforward neural network. After data generation and neural networks training, the optimal control actions are retrieved in real-time for any observed state and scenario. In addition, the dynamics may be approximated through a cheap surrogate model in order to close the loop at the latent level, thus continuously controlling the system in real-time even when full-order state measurements are missing. The effectiveness of the proposed method, in terms of computational speed, accuracy, and robustness against noisy data, is finally assessed on two different high-dimensional optimal transport problems, one of which also involving an underlying fluid flow.

[LG-46] Financial Fine-tuning a Large Time Series Model

链接: https://arxiv.org/abs/2412.09880
作者: Xinghong Fu,Masanori Hirano,Kentaro Imajo
关键词: natural language processing, shown unprecedented capabilities, time series forecasting, time series, latest time series
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large models have shown unprecedented capabilities in natural language processing, image generation, and most recently, time series forecasting. This leads us to ask the question: treating market prices as a time series, can large models be used to predict the market? In this paper, we answer this by evaluating the performance of the latest time series foundation model TimesFM on price prediction. We find that due to the irregular nature of price data, directly applying TimesFM gives unsatisfactory results and propose to fine-tune TimeFM on financial data for the task of price prediction. This is done by continual pre-training of the latest time series foundation model TimesFM on price data containing 100 million time points, spanning a range of financial instruments spanning hourly and daily granularities. The fine-tuned model demonstrates higher price prediction accuracy than the baseline model. We conduct mock trading for our model in various financial markets and show that it outperforms various benchmarks in terms of returns, sharpe ratio, max drawdown and trading cost.

[LG-47] A Statistical Analysis for Supervised Deep Learning with Exponential Families for Intrinsically Low-dimensional Data

链接: https://arxiv.org/abs/2412.09779
作者: Saptarshi Chakraborty,Peter L. Bartlett
关键词: beta, Recent advances, supervised learning decays, intrinsic dimension, expected test error
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Recent advances have revealed that the rate of convergence of the expected test error in deep supervised learning decays as a function of the intrinsic dimension and not the dimension d of the input space. Existing literature defines this intrinsic dimension as the Minkowski dimension or the manifold dimension of the support of the underlying probability measures, which often results in sub-optimal rates and unrealistic assumptions. In this paper, we consider supervised deep learning when the response given the explanatory variable is distributed according to an exponential family with a \beta -Hölder smooth mean function. We consider an entropic notion of the intrinsic data-dimension and demonstrate that with n independent and identically distributed samples, the test error scales as \tilde\mathcalO\left(n^-\frac2\beta2\beta + \bard_2\beta(\lambda)\right) , where \bard_2\beta(\lambda) is the 2\beta -entropic dimension of \lambda , the distribution of the explanatory variables. This improves on the best-known rates. Furthermore, under the assumption of an upper-bounded density of the explanatory variables, we characterize the rate of convergence as \tilde\mathcalO\left( d^\frac2\lfloor\beta\rfloor(\beta + d)2\beta + dn^-\frac2\beta2\beta + d\right) , establishing that the dependence on d is not exponential but at most polynomial. We also demonstrate that when the explanatory variable has a lower bounded density, this rate in terms of the number of data samples, is nearly optimal for learning the dependence structure for exponential families.

[LG-48] MPAX: Mathematical Programming in JAX

链接: https://arxiv.org/abs/2412.09734
作者: Haihao Lu,Zedong Peng,Jinwen Yang
关键词: integrating mathematical programming, machine learning workflows, Mathematical Programming, versatile and efficient, efficient toolbox
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce MPAX (Mathematical Programming in JAX), a versatile and efficient toolbox for integrating mathematical programming into machine learning workflows. MPAX implemented firstorder methods in JAX, providing native support for hardware accelerations along with features like batch solving, auto-differentiation, and device parallelism. Currently in beta version, MPAX supports linear programming and will be extended to solve more general mathematical programming problems and specialized modules for common machine learning tasks. The solver is available at this https URL.

[LG-49] Doubly Robust Conformalized Survival Analysis with Right-Censored Data

链接: https://arxiv.org/abs/2412.09729
作者: Matteo Sesia,Vladimir Svetnik
关键词: constructing lower prediction, lower prediction bounds, extending recent approaches, extending recent, recent approaches designed
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a conformal inference method for constructing lower prediction bounds for survival times from right-censored data, extending recent approaches designed for type-I censoring. This method imputes unobserved censoring times using a suitable model, and then analyzes the imputed data using weighted conformal inference. This approach is theoretically supported by an asymptotic double robustness property. Empirical studies on simulated and real data sets demonstrate that our method is more robust than existing approaches in challenging settings where the survival model may be inaccurate, while achieving comparable performance in easier scenarios.

[LG-50] Investigating the Impact of Balancing Filtering and Complexity on Predictive Multiplicity: A Data-Centric Perspective

链接: https://arxiv.org/abs/2412.09712
作者: Mustafa Cavus,Przemyslaw Biecek
关键词: Rashomon effect presents, presents a significant, significant challenge, predictive multiplicity, model
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 38 pages, 7 figures

点击查看摘要

Abstract:The Rashomon effect presents a significant challenge in model selection. It occurs when multiple models achieve similar performance on a dataset but produce different predictions, resulting in predictive multiplicity. This is especially problematic in high-stakes environments, where arbitrary model outcomes can have serious consequences. Traditional model selection methods prioritize accuracy and fail to address this issue. Factors such as class imbalance and irrelevant variables further complicate the situation, making it harder for models to provide trustworthy predictions. Data-centric AI approaches can mitigate these problems by prioritizing data optimization, particularly through preprocessing techniques. However, recent studies suggest preprocessing methods may inadvertently inflate predictive multiplicity. This paper investigates how data preprocessing techniques like balancing and filtering methods impact predictive multiplicity and model stability, considering the complexity of the data. We conduct the experiments on 21 real-world datasets, applying various balancing and filtering techniques, and assess the level of predictive multiplicity introduced by these methods by leveraging the Rashomon effect. Additionally, we examine how filtering techniques reduce redundancy and enhance model generalization. The findings provide insights into the relationship between balancing methods, data complexity, and predictive multiplicity, demonstrating how data-centric AI strategies can improve model performance.

[LG-51] Langevin Monte Carlo Beyond Lipschitz Gradient Continuity AAAI-25 AAAI

链接: https://arxiv.org/abs/2412.09698
作者: Matej Benko,Iwona Chlebicka,Jørgen Endal,Błażej Miasojedow
关键词: Langevin Monte Carlo, Inexact Proximal Langevin, Proximal Langevin Algorithm, Monte Carlo, Langevin Monte
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: To appear in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-25)

点击查看摘要

Abstract:We present a significant advancement in the field of Langevin Monte Carlo (LMC) methods by introducing the Inexact Proximal Langevin Algorithm (IPLA). This novel algorithm broadens the scope of problems that LMC can effectively address while maintaining controlled computational costs. IPLA extends LMC’s applicability to potentials that are convex, strongly convex in the tails, and exhibit polynomial growth, beyond the conventional L -smoothness assumption. Moreover, we extend LMC’s applicability to super-quadratic potentials and offer improved convergence rates over existing algorithms. Additionally, we provide bounds on all moments of the Markov chain generated by IPLA, enhancing its analytical robustness.

[LG-52] Predicting Organic-Inorganic Halide Perovskite Photovoltaic Performance from Optical Properties of Constituent Films through Machine Learning

链接: https://arxiv.org/abs/2412.09638
作者: Ruiqi Zhang,Brandon Motes,Shaun Tan,Yongli Lu,Meng-Chen Shih,Yilun Hao,Karen Yang,Shreyas Srinivasan,Moungi G. Bawendi,Vladimir Bulovic
关键词: OABr hybrid organic-inorganic, organic-inorganic halide perovskite, hybrid organic-inorganic halide, HOIP solar cells, solar cells
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 36 pages, 6 figures

点击查看摘要

Abstract:We demonstrate a machine learning (ML) approach that accurately predicts the current-voltage behavior of 3D/2D-structured (FAMA)Pb(IBr)3/OABr hybrid organic-inorganic halide perovskite (HOIP) solar cells under AM1.5 illumination. Our neural network algorithm is trained on measured responses from several hundred HOIP solar cells, using three simple optical measurements of constituent HOIP films as input: optical transmission spectrum, spectrally-resolved photoluminescence, and time-resolved photoluminescence, from which we predict the open-circuit voltage (Voc), short-circuit current (Jsc), and fill factors (FF) values of solar cells that contain the HOIP active layers. Determined average prediction accuracies for 95 % of the predicted Voc, Jsc, and FF values are 91%, 94% and 89%, respectively, with R2 coefficients of determination of 0.47, 0.77, and 0.58, respectively. Quantifying the connection between ML predictions and physical parameters extracted from the measured HOIP films optical properties, allows us to identify the most significant parameters influencing the prediction results. With separate ML-classifying algorithms, we identify degraded solar cells using the same optical input data, achieving over 90% classification accuracy through support vector machine, cross entropy loss, and artificial neural network algorithms. To our knowledge, the demonstrated regression and classification work is the first to use ML to predict device photovoltaic properties solely from the optical properties of constituent materials.

信息检索

[IR-0] Static Pruning in Dense Retrieval using Matrix Decomposition

链接: https://arxiv.org/abs/2412.09983
作者: Federico Siciliano,Francesca Pezzuti,Nicola Tonellotto,Fabrizio Silvestri
关键词: transform text documents, largely based, based on encoding, transform text, retrieval
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the era of dense retrieval, document indexing and retrieval is largely based on encoding models that transform text documents into embeddings. The efficiency of retrieval is directly proportional to the number of documents and the size of the embeddings. Recent studies have shown that it is possible to reduce embedding size without sacrificing - and in some cases improving - the retrieval effectiveness. However, the methods introduced by these studies are query-dependent, so they can’t be applied offline and require additional computations during query processing, thus negatively impacting the retrieval efficiency. In this paper, we present a novel static pruning method for reducing the dimensionality of embeddings using Principal Components Analysis. This approach is query-independent and can be executed offline, leading to a significant boost in dense retrieval efficiency with a negligible impact on the system effectiveness. Our experiments show that our proposed method reduces the dimensionality of document representations by over 50% with up to a 5% reduction in NDCG@10, for different dense retrieval models.

[IR-1] Hesitation and Tolerance in Recommender Systems

链接: https://arxiv.org/abs/2412.09950
作者: Kuan Zou,Aixin Sun,Xuemeng Jiang,Yitong Ji,Hao Zhang,Jing Wang,Ruijie Guo
关键词: inherently complex, acceptance or rejection, simple acceptance, User, involving behaviors
类目: Information Retrieval (cs.IR)
*备注: 30 pages, 6 figures, 6 tables

点击查看摘要

Abstract:User interactions in recommender systems are inherently complex, often involving behaviors that go beyond simple acceptance or rejection. One particularly common behavior is hesitation, where users deliberate over recommended items, signaling uncertainty. Our large-scale surveys, with 6,644 and 3,864 responses respectively, confirm that hesitation is not only widespread but also has a profound impact on user experiences. When users spend additional time engaging with content they are ultimately uninterested in, this can lead to negative emotions, a phenomenon we term as tolerance. The surveys reveal that such tolerance behaviors often arise after hesitation and can erode trust, satisfaction, and long-term loyalty to the platform. For instance, a click might reflect a need for more information rather than genuine interest, and prolonged exposure to unsuitable content amplifies frustration. This misalignment between user intent and system interpretation introduces noise into recommendation training, resulting in suggestions that increase uncertainty and disengagement. To address these issues, we identified signals indicative of tolerance behavior and analyzed datasets from both e-commerce and short-video platforms. The analysis shows a strong correlation between increased tolerance behavior and decreased user activity. We integrated these insights into the training process of a recommender system for a major short-video platform. Results from four independent online A/B experiments demonstrated significant improvements in user retention, achieved with minimal additional computational costs. These findings underscore the importance of recognizing hesitation as a ubiquitous user behavior and addressing tolerance to enhance satisfaction, build trust, and sustain long-term engagement in recommender systems.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-16

目录

概览 (2024-12-16)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载