Arxiv今日论文 | 2024-12-09

本篇博文主要展示 2024-12-09 从Arxiv.org论文网站获取的最新论文列表，自动更新，按照NLP、CV、ML、AI、IR五个大方向区分，若需要邮件定时接收，请在评论区留下你的邮箱号。

说明：每日论文数据从Arxiv.org获取，每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据，请在评论处留下你的邮箱。

【速读】：该论文试图解决在视觉丰富且动态变化的环境中，具身代理（embodied agents）如何有效理解和执行多模态任务规范的问题。解决方案的关键在于提出了TeamCraft，这是一个基于开放世界视频游戏Minecraft的多模态多代理基准测试。TeamCraft通过包含55,000种任务变体、多模态提示、程序生成的专家示范以及精心设计的评估协议，来测试和提升模型在处理新目标、场景和未见过的代理数量时的泛化能力。论文通过广泛的分析揭示了现有方法在泛化能力方面的局限性，强调了进一步研究的重要性。

链接: https://arxiv.org/abs/2412.05255
作者: Qian Long,Zhi Li,Ran Gong,Ying Nian Wu,Demetri Terzopoulos,Xiaofeng Gao
关键词-EN: cornerstone of society, Collaboration, human teammates make, Abstract, multi-modal
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Collaboration is a cornerstone of society. In the real world, human teammates make use of multi-sensory data to tackle challenging tasks in ever-changing environments. It is essential for embodied agents collaborating in visually-rich environments replete with dynamic interactions to understand multi-modal observations and task specifications. To evaluate the performance of generalizable multi-modal collaborative agents, we present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft. The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the need for further research in this area. The TeamCraft platform and dataset are publicly available at this https URL.
zh

[NLP-1] Uncertainty Quantification for Transformer Models for Dark-Pattern Detection

【速读】：该论文试图解决基于Transformer的模型在应用中可能涉及的不道德实践，如用户界面中的暗模式（dark-patterns），导致预测结果缺乏透明性和信任度的问题。解决方案的关键在于通过不确定性量化（uncertainty quantification）技术，在预训练的Transformer模型上进行差异化微调（differential fine-tuning），特别是在最终分类层中集成不确定性量化。研究采用了两种不确定性量化方法：谱归一化神经高斯过程（Spectral-normalized Neural Gaussian Processes, SNGPs）和贝叶斯神经网络（Bayesian Neural Networks, BNNs），并评估了它们在模型性能、预测确定性方差以及训练和推理阶段的环境影响方面的表现。结果表明，集成不确定性量化不仅维持了模型性能，还提供了对模型中挑战性实例的深入洞察，增强了模型的透明度和预测的信心，从而有助于减少暗模式对用户界面的影响。

链接: https://arxiv.org/abs/2412.05251
作者: Javier Muñoz,Álvaro Huertas-García,Carlos Martí-González,Enrique De Miguel Ambite
关键词-EN: uncertainty quantification, Spectral-normalized Neural Gaussian, Neural Gaussian Processes, Bayesian Neural Networks, integrate uncertainty quantification
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Probability (math.PR)
备注:

点击查看摘要

Abstract:The opaque nature of transformer-based models, particularly in applications susceptible to unethical practices such as dark-patterns in user interfaces, requires models that integrate uncertainty quantification to enhance trust in predictions. This study focuses on dark-pattern detection, deceptive design choices that manipulate user decisions, undermining autonomy and consent. We propose a differential fine-tuning approach implemented at the final classification head via uncertainty quantification with transformer-based pre-trained models. Employing a dense neural network (DNN) head architecture as a baseline, we examine two methods capable of quantifying uncertainty: Spectral-normalized Neural Gaussian Processes (SNGPs) and Bayesian Neural Networks (BNNs). These methods are evaluated on a set of open-source foundational models across multiple dimensions: model performance, variance in certainty of predictions and environmental impact during training and inference phases. Results demonstrate that integrating uncertainty quantification maintains performance while providing insights into challenging instances within the models. Moreover, the study reveals that the environmental impact does not uniformly increase with the incorporation of uncertainty quantification techniques. The study’s findings demonstrate that uncertainty quantification enhances transparency and provides measurable confidence in predictions, improving the explainability and clarity of black-box models. This facilitates informed decision-making and mitigates the influence of dark-patterns on user interfaces. These results highlight the importance of incorporating uncertainty quantification techniques in developing machine learning models, particularly in domains where interpretability and trustworthiness are critical.
zh

[NLP-2] Enhancing FKG.in: automating Indian food composition analysis

【速读】：该论文试图解决印度食谱的营养成分数据计算问题，特别是如何自动化地从知识图谱和大型语言模型 (LLM) 中获取和分析这些数据。解决方案的关键在于构建一个自动化的营养成分分析工作流程，该流程包括营养数据聚合、食品成分分析和基于LLM的信息解析。该工作流程旨在补充现有的印度食品知识图谱，并通过迭代方式从已验证的知识库中补充食品成分数据。此外，论文还探讨了在数字化环境中表示印度食品和获取食品成分数据所面临的挑战，并提出了基于LLM的解决方案来应对这些挑战。

链接: https://arxiv.org/abs/2412.05248
作者: Saransh Kumar Gupta,Lipika Dey,Partha Pratim Das,Geeta Trilok-Kumar,Ramesh Jain
关键词-EN: food composition data, food composition, food composition analysis, Indian Food Composition, compute food composition
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 15 pages, 3 figures, 30 references, International Conference on Pattern Recognition 2024 - Multimedia Assisted Dietary Management Workshop

点击查看摘要

Abstract:This paper presents a novel approach to compute food composition data for Indian recipes using a knowledge graph for Indian food (this http URL) and LLMs. The primary focus is to provide a broad overview of an automated food composition analysis workflow and describe its core functionalities: nutrition data aggregation, food composition analysis, and LLM-augmented information resolution. This workflow aims to complement this http URL and iteratively supplement food composition data from verified knowledge bases. Additionally, this paper highlights the challenges of representing Indian food and accessing food composition data digitally. It also reviews three key sources of food composition data: the Indian Food Composition Tables, the Indian Nutrient Databank, and the Nutritionix API. Furthermore, it briefly outlines how users can interact with the workflow to obtain diet-based health recommendations and detailed food composition information for numerous recipes. We then explore the complex challenges of analyzing Indian recipe information across dimensions such as structure, multilingualism, and uncertainty as well as present our ongoing work on LLM-based solutions to address these issues. The methods proposed in this workshop paper for AI-driven knowledge curation and information resolution are application-agnostic, generalizable, and replicable for any domain.
zh

[NLP-3] MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

【速读】：该论文试图解决开源多模态大语言模型（MLLMs）在推理能力上的局限性问题，这些模型的推理能力受限于现有的指令微调数据集，这些数据集主要来源于学术数据集如VQA、AI2D和ChartQA，且仅提供短语级别的答案，缺乏中间推理步骤。论文提出的解决方案关键在于构建一个大规模的多模态指令微调数据集，该数据集包含丰富的中间推理步骤（CoT reasoning），通过使用开源模型生成1200万条指令-响应对，涵盖多样化和推理密集型任务。实验结果表明，基于此数据集训练的MLLMs在推理能力上显著提升，并在多个基准测试中达到最先进水平，同时非推理基准测试也有显著改进。

链接: https://arxiv.org/abs/2412.05237
作者: Jarvis Guo,Tuney Zheng,Yuelin Bai,Bo Li,Yubo Wang,King Zhu,Yizhi Li,Graham Neubig,Wenhu Chen,Xiang Yue
关键词-EN: Open-source multimodal large, shown significant potential, multimodal large language, large language models, Open-source multimodal
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.
zh

[NLP-4] LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLM s in Seconds

【速读】：该论文试图解决现有越狱技术（jailbreak techniques）在生成对抗性提示（adversarial prompts）时计算资源消耗巨大的问题。解决方案的关键在于重新定义越狱问题为对齐问题（alignment problem），并通过利用不安全奖励（unsafe reward）引导安全对齐模型（safety-aligned model）生成不安全输出，从而实现越狱。论文提出了一种名为LIAR（LeveragIng Alignment to jailbReak）的新方法，通过最佳N方法（best-of-N method）解决对齐问题，显著降低了计算需求，实现了全黑箱操作（fully black-box operation），并提高了攻击成功率（attack success rates）和提示的可读性（human-readable prompts）。实验结果表明，LIAR在攻击成功率上与现有最先进技术（SoTA）相当，同时在困惑度（perplexity）和攻击时间（Time-to-Attack）上取得了显著改进。

链接: https://arxiv.org/abs/2412.05232
作者: James Beetham,Souradip Chakraborty,Mengdi Wang,Furong Huang,Amrit Singh Bedi,Mubarak Shah
关键词-EN: discrete combinatorial optimization, solving discrete combinatorial, generate multiple adversarial, multiple adversarial prompts, recent approaches involve
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many existing jailbreak techniques rely on solving discrete combinatorial optimization, while more recent approaches involve training LLMs to generate multiple adversarial prompts. However, both approaches require significant computational resources to produce even a single adversarial prompt. We hypothesize that the inefficiency of current approaches stems from an inadequate characterization of the jailbreak problem. To address this gap, we formulate the jailbreak problem in terms of alignment. By starting from an available safety-aligned model, we leverage an unsafe reward to guide the safe model towards generating unsafe outputs using alignment techniques (e.g., reinforcement learning from human feedback), effectively performing jailbreaking via alignment. We propose a novel jailbreak method called LIAR (LeveragIng Alignment to jailbReak). To demonstrate the simplicity and effectiveness of our approach, we employ a best-of-N method to solve the alignment problem. LIAR offers significant advantages: lower computational requirements without additional training, fully black-box operation, competitive attack success rates, and more human-readable prompts. We provide theoretical insights into the possibility of jailbreaking a safety-aligned model, revealing inherent vulnerabilities in current alignment strategies for LLMs. We also provide sub-optimality guarantees for the proposed \algo. Experimentally, we achieve ASR comparable to the SoTA with a 10x improvement to perplexity and a Time-to-Attack measured in seconds rather than tens of hours.
zh

[NLP-5] BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits

【速读】：该论文试图解决大型语言模型（LLMs）在资源受限设备上的部署难题，特别是由于模型尺寸巨大和计算需求高导致的效率问题。解决方案的关键在于提出了Binarized Early Exit Transformer (BEExformer)，这是一种结合了模型二值化（binarization）和早期退出（Early Exit, EE）的全新选择性学习变压器架构。BEExformer通过可微分的二阶脉冲函数近似改进了二值化过程，使得梯度计算能够同时考虑权重的符号和大小，从而缓解了二值化导致的性能损失。此外，其早期退出机制基于中间变压器块之间的熵分数减少，而非绝对阈值，结合了软路由损失估计，有效减少了推理过程中的浮点运算（FLOPs）并提高了准确性，同时简化了训练过程，无需从全精度LLM进行知识蒸馏。

链接: https://arxiv.org/abs/2412.05225
作者: Wazib Ansar,Saptarsi Goswami,Amlan Chakrabarti
关键词-EN: Large Language Models, Large Language, transformers achieve cutting-edge, achieve cutting-edge results, Language Models
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 15 pages, 15 figures, 3 tables

点击查看摘要

Abstract:Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements make deployment on devices with constrained resources extremely difficult. Among various efficiency considerations, model binarization and Early Exit (EE) are common effective solutions. However, binarization may lead to performance loss due to reduced precision affecting gradient estimation and parameter updates. Besides, the present early-exit mechanisms are still in the nascent stages of research. To ameliorate these issues, we propose Binarized Early Exit Transformer (BEExformer), the first-ever selective learning transformer architecture to combine early exit with binarization for textual inference. It improves the binarization process through a differentiable second-order approximation to the impulse function. This enables gradient computation concerning both the sign as well as the magnitude of the weights. In contrast to absolute threshold-based EE, the proposed EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. While binarization results in 18.44 times reduction in model size, early exit reduces the FLOPs during inference by 54.85% and even improves accuracy by 5.98% through resolving the “overthinking” problem inherent in deep networks. Moreover, the proposed BEExformer simplifies training by not requiring knowledge distillation from a full-precision LLM. Extensive evaluation on the GLUE dataset and comparison with the SOTA works showcase its pareto-optimal performance-efficiency trade-off.
zh

[NLP-6] 100% Hallucination Elimination Using Acurai

【速读】：该论文试图解决大型语言模型（LLMs）中的幻觉问题，这是阻碍AI在企业和高风险应用中广泛采用的关键障碍。解决方案的关键在于引入Acurai，这是一种新颖的系统性方法，通过在输入前重新格式化查询和上下文数据，实现100%无幻觉的响应。Acurai利用对LLM内部表示的深入理解、名词短语主导性的重要性以及离散功能单元（DFUs）的作用，确保输入上下文与生成输出之间的对齐。通过在RAGTruth语料库上的验证，Acurai展示了其消除GPT-4和GPT-3.5 Turbo中100%幻觉的能力，为实现一致、准确和可信的AI响应设定了新标准。

链接: https://arxiv.org/abs/2412.05223
作者: Michael C. Wood,Adam A. Forbes
关键词-EN: large language models, language models, remains a critical, high-stakes applications, large language
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The issue of hallucinations in large language models (LLMs) remains a critical barrier to the adoption of AI in enterprise and other high-stakes applications. Despite advancements in retrieval-augmented generation (RAG) systems, current state-of-the-art methods fail to achieve more than 80% accuracy in generating faithful and factually correct outputs, even when provided with relevant and accurate context. In this work, we introduce Acurai, a novel systematic approach that achieves 100% hallucination-free responses in LLMs by reformatting queries and context data prior to input. Leveraging a deep understanding of LLM internal representations, the importance of noun-phrase dominance, and the role of discrete functional units (DFUs), Acurai ensures alignment between input context and generated output. We validate this method using the RAGTruth corpus, demonstrating its ability to eliminate 100% hallucinations for both GPT-4 and GPT-3.5 Turbo. Acurai sets a new standard for achieving consistent, accurate, and faithful AI responses, marking a significant step forward in the development of trustworthy AI systems.
zh

[NLP-7] Evaluating and Aligning CodeLLM s on Human Preference

【速读】：该论文试图解决当前代码大语言模型（codeLLMs）在代码生成过程中忽视与人类偏好对齐的问题。解决方案的关键在于引入了一个严格的人工筛选基准——CodeArena，该基准模拟了现实世界编码任务的复杂性和多样性，包含397个高质量样本，涵盖40个类别和44种编程语言。此外，论文还提出了一个多样化的合成指令语料库SynCode-Instruct（近200亿个token），用于验证大规模合成指令微调的有效性。通过这些措施，论文强调了模型生成响应与人类偏好对齐的重要性，并展示了在CodeArena上进行的系统实验结果，揭示了开源SOTA代码LLMs与专有LLMs之间的显著性能差距。

链接: https://arxiv.org/abs/2412.05210
作者: Jian Yang,Jiaxi Yang,Ke Jin,Yibo Miao,Lei Zhang,Liqun Yang,Zeyu Cui,Yichang Zhang,Binyuan Hui,Junyang Lin
关键词-EN: made significant strides, large language models, Code large language, code LLMs, made significant
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.\footnote\urlthis https URL
zh

[NLP-8] ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges

【速读】：该论文试图解决计算论证（Computational Argumentation）中，针对争议性话题（如堕胎禁令和疫苗接种）生成高质量、基于证据的答案时面临的评估难题。传统的人工评估成本高且难以处理复杂、冗长的答案，而现有的论证数据集缺乏长篇复杂论证和来自可能误导性来源的真实证据，限制了全面评估检索效果和论证质量。论文提出的解决方案之关键是使用多粒度的大型语言模型（LLM）作为自动化评估方法，通过引入新的基准数据集ConQRet，该数据集包含长篇复杂的人类撰写的论证，基于真实世界网站，能够全面评估检索效果、论证质量和证据基础性。这种方法不仅提供了比传统单一评分指标更优且更具解释性的评估，还验证了其在现有数据集和新的ConQRet基准上的有效性，从而推动计算论证领域的快速发展。

链接: https://arxiv.org/abs/2412.05206
作者: Kaustubh D. Dhole,Kai Shu,Eugene Agichtein
关键词-EN: today polarized environment, involves generating answers, LLM judges, bans and vaccination, polarized environment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today’s polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence for high-quality, grounded arguments. However, evaluating RAArg remains challenging, as human evaluation is costly and difficult for complex, lengthy answers on complicated topics. At the same time, re-using existing argumentation datasets is no longer sufficient, as they lack long, complex arguments and realistic evidence from potentially misleading sources, limiting holistic evaluation of retrieval effectiveness and argument quality. To address these gaps, we investigate automated evaluation methods using multiple fine-grained LLM judges, providing better and more interpretable assessments than traditional single-score metrics and even previously reported human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, a new benchmark featuring long and complex human-authored arguments on debated topics, grounded in real-world websites, allowing an exhaustive evaluation across retrieval effectiveness, argument quality, and groundedness. We validate our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed LLM Judges and the ConQRet benchmark can enable rapid progress in computational argumentation and can be naturally extended to other complex retrieval-augmented generation tasks.
zh

[NLP-9] QueEn: A Large Language Model for Quechua-English Translation

【速读】：该论文试图解决低资源语言（如克丘亚语）在机器翻译中的挑战，特别是由于训练数据有限和文化细微差别难以捕捉的问题。解决方案的关键在于结合检索增强生成 (Retrieval-Augmented Generation, RAG) 和参数高效微调技术 (Low-Rank Adaptation, LoRA)。通过RAG，系统能够利用外部语言资源来增强翻译质量，而LoRA则确保了模型在微调过程中的高效性和计算资源的节约。这种集成方法不仅显著提升了翻译性能（BLEU分数从1.5提升至17.6），还为保护濒危语言提供了技术支持。

链接: https://arxiv.org/abs/2412.05184
作者: Junhao Chen,Peng Shu,Yiwei Li,Huaqin Zhao,Hanqi Jiang,Yi Pan,Yifan Zhou,Zhengliang Liu,Lewis C Howe,Tianming Liu
关键词-EN: Recent studies show, Recent studies, bringing advances, powerful tools, tools for working
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies show that large language models (LLMs) are powerful tools for working with natural language, bringing advances in many areas of computational linguistics. However, these models face challenges when applied to low-resource languages due to limited training data and difficulty in understanding cultural nuances. In this paper, we propose QueEn, a novel approach for Quechua-English translation that combines Retrieval-Augmented Generation (RAG) with parameter-efficient fine-tuning techniques. Our method leverages external linguistic resources through RAG and uses Low-Rank Adaptation (LoRA) for efficient model adaptation. Experimental results show that our approach substantially exceeds baseline models, with a BLEU score of 17.6 compared to 1.5 for standard GPT models. The integration of RAG with fine-tuning allows our system to address the challenges of low-resource language translation while maintaining computational efficiency. This work contributes to the broader goal of preserving endangered languages through advanced language technologies.
zh

[NLP-10] Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

【速读】：该论文试图解决当前缺乏全面基准来评估大型音频-语言模型（Large Audio-Language Models, LALMs）在开放式音频对话理解中的性能问题。解决方案的关键在于提出了一个名为音频对话理解基准（Audio Dialogue Understanding Benchmark, ADU-Bench）的综合评估框架。ADU-Bench包含四个基准数据集，涵盖了三种通用场景、十二项技能、九种多语言以及四类歧义处理，特别是首次提出了对音频对话中歧义处理的评估，如通过不同语调表达不同意图的“Really!?”。该基准通过超过20,000个开放式音频对话来评估LALMs的性能，并通过实验揭示了现有LALMs在理解数学符号、角色扮演、多语言处理以及音频对话中的歧义（如语调、停顿位置和同音异义词）方面仍有显著改进空间。

链接: https://arxiv.org/abs/2412.05167
作者: Kuofeng Gao,Shu-Tao Xia,Ke Xu,Philip Torr,Jindong Gu
关键词-EN: Large Audio-Language Models, Large Audio-Language, Audio-Language Models, audio dialogue, audio dialogue understanding
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans. This progression not only underscores the potential of LALMs but also broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., “Really!?” with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments conducted on 13 LALMs, our analysis reveals that there is still considerable room for improvement in the audio dialogue understanding abilities of existing LALMs. In particular, they struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones.
zh

[NLP-11] Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies COLING2025

【速读】：该论文试图解决的问题是如何利用视觉语言模型 (Vision Language Models, VLMs) 来表示和利用多模态内容进行事实核查。解决方案的关键在于提出了一种基于探针分类器的方法，通过从选定的 VLMs 的最后一层隐藏层中提取嵌入 (embeddings)，并将这些嵌入输入到一个神经探针分类器中进行多类别真实性分类。实验结果表明，尽管多模态内容可以提升性能，但将文本和图像编码器的独立嵌入融合在一起的效果优于使用 VLM 嵌入。此外，所提出的神经分类器在利用提取的嵌入方面显著优于 KNN 和 SVM 基线，突显了其在多模态事实核查中的有效性。

链接: https://arxiv.org/abs/2412.05155
作者: Recep Firat Cekinel,Pinar Karagoz,Cagri Coltekin
关键词-EN: Vision Language Models, Vision Language, Language Models, utilizing multimodal content, study evaluates
类目: Computation and Language (cs.CL)
备注: Accepted to COLING2025

点击查看摘要

Abstract:This study evaluates the effectiveness of Vision Language Models (VLMs) in representing and utilizing multimodal content for fact-checking. To be more specific, we investigate whether incorporating multimodal content improves performance compared to text-only models and how well VLMs utilize text and image information to enhance misinformation detection. Furthermore we propose a probing classifier based solution using VLMs. Our approach extracts embeddings from the last hidden layer of selected VLMs and inputs them into a neural probing classifier for multi-class veracity classification. Through a series of experiments on two fact-checking datasets, we demonstrate that while multimodality can enhance performance, fusing separate embeddings from text and image encoders yielded superior results compared to using VLM embeddings. Furthermore, the proposed neural classifier significantly outperformed KNN and SVM baselines in leveraging extracted embeddings, highlighting its effectiveness for multimodal fact-checking.
zh

[NLP-12] Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

【速读】：该论文试图解决人类与计算语言学习者之间的数据效率差距问题，通过BabyLM Challenge这一社区努力，旨在优化语言模型在固定数据预算（100百万词或更少）下的训练效果。解决方案的关键在于改进文本语料库和引入视觉与语言语料库，以促进认知上合理的视觉语言模型的研究。参与者在语法能力、视觉问答、语用能力和基础能力等多个评估任务上进行比较，并可在10M词纯文本、100M词纯文本或100M词与图像的多模态赛道上提交作品。最终，混合因果-掩码语言模型架构在众多方法中表现最佳，尽管在多模态赛道上没有提交作品超越基线。后续分析表明，训练浮点运算次数与任务平均表现之间存在强相关性，且最佳表现提交作品通常涉及训练数据、训练目标和模型架构的改变。这表明在图像-文本建模方面仍有显著创新空间，而社区驱动的研究可以提供关于小规模语言建模有效策略的实际见解。

链接: https://arxiv.org/abs/2412.05149
作者: Michael Y. Hu,Aaron Mueller,Candace Ross,Adina Williams,Tal Linzen,Chengxu Zhuang,Ryan Cotterell,Leshem Choshen,Alex Warstadt,Ethan Gotlieb Wilcox
关键词-EN: computational language learners, community effort, effort to close, close the data-efficiency, data-efficiency gap
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus to facilitate research into cognitively plausible vision language models. Submissions were compared on evaluation tasks targeting grammatical ability, (visual) question answering, pragmatic abilities, and grounding, among other abilities. Participants could submit to a 10M-word text-only track, a 100M-word text-only track, and/or a 100M-word and image multimodal track. From 31 submissions employing diverse methods, a hybrid causal-masked language model architecture outperformed other approaches. No submissions outperformed the baselines in the multimodal track. In follow-up analyses, we found a strong relationship between training FLOPs and average performance across tasks, and that the best-performing submissions proposed changes to the training data, training objective, and model architecture. This year’s BabyLM Challenge shows that there is still significant room for innovation in this setting, in particular for image-text modeling, but community-driven research can yield actionable insights about effective strategies for small-scale language modeling.
zh

[NLP-13] Explingo: Explaining AI Predictions using Large Language Models

【速读】：该论文试图解决如何将机器学习模型预测的解释（由可解释AI技术如SHAP生成）转化为易于理解的自然语言叙述的问题。解决方案的关键在于引入了一个名为Explingo的系统，该系统包含两个基于大语言模型（LLM）的子系统：Narrator和Grader。Narrator负责将传统的ML解释转化为自然语言描述，而Grader则通过一系列指标（如准确性、完整性、流畅性和简洁性）来评估这些叙述的质量。实验结果表明，LLM能够生成高质量的叙述，尤其是在有少量人类标注和引导示例的情况下，但在复杂领域的叙述评分方面仍存在挑战。

链接: https://arxiv.org/abs/2412.05145
作者: Alexandra Zytek,Sara Pido,Sarah Alnegheimish,Laure Berti-Equille,Kalyan Veeramachaneni
关键词-EN: model predictions generated, generated by Explainable, SHAP are essential, Large Language Models, outputs for decision-making
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To be presented in the 2024 IEEE International Conference on Big Data (IEEE BigData)

点击查看摘要

Abstract:Explanations of machine learning (ML) model predictions generated by Explainable AI (XAI) techniques such as SHAP are essential for people using ML outputs for decision-making. We explore the potential of Large Language Models (LLMs) to transform these explanations into human-readable, narrative formats that align with natural communication. We address two key research questions: (1) Can LLMs reliably transform traditional explanations into high-quality narratives? and (2) How can we effectively evaluate the quality of narrative explanations? To answer these questions, we introduce Explingo, which consists of two LLM-based subsystems, a Narrator and Grader. The Narrator takes in ML explanations and transforms them into natural-language descriptions. The Grader scores these narratives on a set of metrics including accuracy, completeness, fluency, and conciseness. Our experiments demonstrate that LLMs can generate high-quality narratives that achieve high scores across all metrics, particularly when guided by a small number of human-labeled and bootstrapped examples. We also identified areas that remain challenging, in particular for effectively scoring narratives in complex domains. The findings from this work have been integrated into an open-source tool that makes narrative explanations available for further applications. Comments: To be presented in the 2024 IEEE International Conference on Big Data (IEEE BigData) Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.05145 [cs.CL] (or arXiv:2412.05145v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.05145 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-14] A Practical Examination of AI-Generated Text Detectors for Large Language Models

【速读】：该论文试图解决的问题是评估现有机器生成内容检测器（如RADAR, Wild, T5Sentinel, Fast-DetectGPT, GPTID, LogRank, Binoculars）在面对未曾接触过的领域、数据集和模型时的有效性。解决方案的关键在于通过采用多种提示策略模拟对抗性攻击，揭示这些检测器在特定条件下的性能缺陷，特别是当真阳性率（True Positive Rate, TPR）在特定假阳性率（False Positive Rate, FPR）下的表现。研究结果表明，即使在适度努力下，这些检测器也难以维持高敏感性并达到合理的真阳性率，尤其是在TPR@.01时，检测器的性能可能低至0%。

链接: https://arxiv.org/abs/2412.05139
作者: Brian Tufts,Xuandong Zhao,Lei Li
关键词-EN: raised growing concerns, large language models, human authors, proliferation of large, raised growing
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages. Submitted to ARR October cycle

点击查看摘要

Abstract:The proliferation of large language models has raised growing concerns about their misuse, particularly in cases where AI-generated text is falsely attributed to human authors. Machine-generated content detectors claim to effectively identify such text under various conditions and from any language model. This paper critically evaluates these claims by assessing several popular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, GPTID, LogRank, Binoculars) on a range of domains, datasets, and models that these detectors have not previously encountered. We employ various prompting strategies to simulate adversarial attacks, demonstrating that even moderate efforts can significantly evade detection. We emphasize the importance of the true positive rate at a specific false positive rate (TPR@FPR) metric and demonstrate that these detectors perform poorly in certain settings, with TPR@.01 as low as 0%. Our findings suggest that both trained and zero-shot detectors struggle to maintain high sensitivity while achieving a reasonable true positive rate.
zh

[NLP-15] Unifying Dual-Space Embedding for Entity Alignment via Contrastive Learning COLING2025

【速读】：该论文试图解决知识图谱（KGs）中实体对齐（Entity Alignment, EA）的问题，特别是在处理复杂结构（包括局部和层次结构）时，如何在单一空间中有效表示这些结构。解决方案的关键在于提出了一种名为UniEA的新方法，该方法通过在欧几里得空间和双曲空间中同时学习图结构嵌入（Graph Structure Embedding），以最大化两种空间中嵌入的一致性，从而保留KGs的内在结构。此外，论文还采用了对比学习（Contrastive Learning）来缓解相似实体导致的嵌入距离过近的问题，从而提升基于结构的实体对齐性能。实验结果表明，UniEA在基准数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.05028
作者: Cunda Wang,Weihua Wang,Qiuyu Liang,Feilong Bao,Guanglai Gao
关键词-EN: Entity alignment aims, Entity alignment, match identical entities, network-based entity alignment, entity alignment methods
类目: Computation and Language (cs.CL)
备注: Accepted by COLING2025

点击查看摘要

Abstract:Entity alignment aims to match identical entities across different knowledge graphs (KGs). Graph neural network-based entity alignment methods have achieved promising results in Euclidean space. However, KGs often contain complex structures, including both local and hierarchical ones, which make it challenging to efficiently represent them within a single space. In this paper, we proposed a novel method UniEA, which unifies dual-space embedding to preserve the intrinsic structure of KGs. Specifically, we learn graph structure embedding in both Euclidean and hyperbolic spaces simultaneously to maximize the consistency between the embedding in both spaces. Moreover, we employ contrastive learning to mitigate the misalignment issues caused by similar entities, where embedding of similar neighboring entities within the KG become too close in distance. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in structure-based EA. Our code is available at this https URL.
zh

[NLP-16] Steps are all you need: Rethinking STEM Education with Prompt Engineering

【速读】：该论文试图解决在物理问题回答任务中，大型语言模型（LLMs）在数学能力不足和容易产生幻觉的问题。解决方案的关键在于采用混合专家模型（Mixture of Experts, MoE）和类比提示（analogical prompting）相结合的方法，以提高模型性能。此外，论文还提出了类比链式思维提示（Analogical CoT prompting），旨在使较小的开源模型能够利用类比提示，从而克服因缺乏专业训练数据而导致的性能瓶颈。

链接: https://arxiv.org/abs/2412.05023
作者: Krishnasai Addala,Kabir Dev Paul Baghel,Chhavi Kirtani,Avinash Anand,Rajiv Ratn Shah
关键词-EN: Question Answering Tasks, Physics Question Answering, Answering Tasks, Physics Question, Question Answering
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Few shot and Chain-of-Thought prompting have shown promise when applied to Physics Question Answering Tasks, but are limited by the lack of mathematical ability inherent to LLMs, and are prone to hallucination. By utilizing a Mixture of Experts (MoE) Model, along with analogical prompting, we are able to show improved model performance when compared to the baseline on standard LLMs. We also survey the limits of these prompting techniques and the effects they have on model performance. Additionally, we propose Analogical CoT prompting, a prompting technique designed to allow smaller, open source models to leverage Analogical prompting, something they have struggled with, possibly due to a lack of specialist training data.
zh

[NLP-17] PETapter: Leveraging PET-style classification heads for modular few-shot parameter-efficient fine-tuning

【速读】：该论文试图解决数据稀缺和语言模型规模不断增长带来的挑战，特别是在专业科学领域中，研究人员可能缺乏专业知识和资源来微调高性能语言模型以适应细微任务的问题。解决方案的关键是提出了一种名为PETapter的新方法，该方法有效结合了参数高效微调 (PEFT) 方法与PET风格的分类头，以提升少样本学习能力，同时避免了全模型训练带来的显著计算开销。PETapter不仅在性能上与使用模式利用训练 (PET) 的全少样本微调相当，还提供了更高的参数效率、可靠性和模块化，便于训练模块的共享和复用，从而使更多研究人员能够利用高性能的自然语言处理 (NLP) 方法进行研究。

链接: https://arxiv.org/abs/2412.04975
作者: Jonas Rieger,Mattes Ruckdeschel,Gregor Wiedemann
关键词-EN: language model sizes, growing language model, crucial to overcome, overcome the challenges, challenges of data
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Few-shot learning and parameter-efficient fine-tuning (PEFT) are crucial to overcome the challenges of data scarcity and ever growing language model sizes. This applies in particular to specialized scientific domains, where researchers might lack expertise and resources to fine-tune high-performing language models to nuanced tasks. We propose PETapter, a novel method that effectively combines PEFT methods with PET-style classification heads to boost few-shot learning capabilities without the significant computational overhead typically associated with full model training. We validate our approach on three established NLP benchmark datasets and one real-world dataset from communication research. We show that PETapter not only achieves comparable performance to full few-shot fine-tuning using pattern-exploiting training (PET), but also provides greater reliability and higher parameter efficiency while enabling higher modularity and easy sharing of the trained modules, which enables more researchers to utilize high-performing NLP-methods in their research.
zh

[NLP-18] Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for Radiology Report Generation ACL2024

【速读】：该论文试图解决从胸部X光片生成放射学报告的问题。解决方案的关键在于将图像编码器与基于Vicuna-7B架构的微调大型语言模型（LLM）相结合，通过两阶段训练过程实现：首先将胸部X光片的特征与LLM对齐，然后进行放射学报告生成的微调。这种集成方法显著提升了模型理解和描述胸部X光片的能力，使其能够准确生成放射学报告的不同部分。

链接: https://arxiv.org/abs/2412.04954
作者: Xi Zhang,Zaiqiao Meng,Jake Lever,Edmond S. L. Ho
关键词-EN: radiology-focused visual language, chest X-ray images, chest X-ray, language model designed, visual language model
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by BioNLP@ACL 2024

点击查看摘要

Abstract:We introduce a radiology-focused visual language model designed to generate radiology reports from chest X-rays. Building on previous findings that large language models (LLMs) can acquire multimodal capabilities when aligned with pretrained vision encoders, we demonstrate similar potential with chest X-ray images. This integration enhances the ability of model to understand and describe chest X-ray images. Our model combines an image encoder with a fine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate different sections of a radiology report with notable accuracy. The training process involves a two-stage approach: (i) initial alignment of chest X-ray features with the LLM (ii) followed by fine-tuning for radiology report generation.
zh

[NLP-19] KaLM: Knowledge-aligned Autoregressive Language Modeling via Dual-view Knowledge Graph Contrastive Learning

【速读】：该论文试图解决生成式大型语言模型 (LLMs) 在知识驱动任务（如事实知识查询）中的表现不佳问题。解决方案的关键在于提出了 KaLM（Knowledge-aligned Language Modeling）方法，通过联合目标优化显式知识对齐和隐式知识对齐来微调自回归 LLMs，使其与知识图谱 (KGs) 的知识对齐。显式知识对齐目标通过双视角知识图谱对比学习直接优化 LLMs 的知识表示，而隐式知识对齐目标则通过三元组完成语言建模将知识文本模式融入 LLMs。该方法显著提升了知识驱动任务的评估性能，特别是在基于嵌入的知识图谱补全和生成式知识图谱问答任务中。

链接: https://arxiv.org/abs/2412.04948
作者: Peng Yu,Cheng Deng,Beiya Dai,Xinbing Wang,Ying Wen
关键词-EN: large language models, knowledge, Autoregressive large language, knowledge alignment, token prediction
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive large language models (LLMs) pre-trained by next token prediction are inherently proficient in generative tasks. However, their performance on knowledge-driven tasks such as factual knowledge querying remains unsatisfactory. Knowledge graphs (KGs), as high-quality structured knowledge bases, can provide reliable knowledge for LLMs, potentially compensating for their knowledge deficiencies. Aligning LLMs with explicit, structured knowledge from KGs has been a challenge; previous attempts either failed to effectively align knowledge representations or compromised the generative capabilities of LLMs, leading to less-than-optimal outcomes. This paper proposes \textbfKaLM, a \textitKnowledge-aligned Language Modeling approach, which fine-tunes autoregressive LLMs to align with KG knowledge via the joint objective of explicit knowledge alignment and implicit knowledge alignment. The explicit knowledge alignment objective aims to directly optimize the knowledge representation of LLMs through dual-view knowledge graph contrastive learning. The implicit knowledge alignment objective focuses on incorporating textual patterns of knowledge into LLMs through triple completion language modeling. Notably, our method achieves a significant performance boost in evaluations of knowledge-driven tasks, specifically embedding-based knowledge graph completion and generation-based knowledge graph question answering.
zh

[NLP-20] C2LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation

【速读】：该论文试图解决大语言模型（LLMs）评估中的数据污染问题，特别是由于无法访问专有训练数据而导致的评估不准确性。解决方案的关键在于提出了C^2LEVA，这是一个综合的双语基准测试，通过系统性的污染预防措施确保评估的可靠性。C^2LEVA不仅提供了涵盖22个任务的全面评估，还通过自动化的测试数据更新和严格的基准数据发布保护措施，确保了任务数据的无污染性，从而实现了对LLMs的信任评估。

链接: https://arxiv.org/abs/2412.04947
作者: Yanyang Li,Tin Long Wong,Cheung To Hung,Jianqiao Zhao,Duo Zheng,Ka Wai Liu,Michael R. Lyu,Liwei Wang
关键词-EN: shown significant promise, Recent advances, evaluation raises concerns, large language models, significant promise
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown significant promise, yet their evaluation raises concerns, particularly regarding data contamination due to the lack of access to proprietary training data. To address this issue, we present C ^2 LEVA, a comprehensive bilingual benchmark featuring systematic contamination prevention. C ^2 LEVA firstly offers a holistic evaluation encompassing 22 tasks, each targeting a specific application or ability of LLMs, and secondly a trustworthy assessment due to our contamination-free tasks, ensured by a systematic contamination prevention strategy that fully automates test data renewal and enforces data protection during benchmark data release. Our large-scale evaluation of 15 open-source and proprietary models demonstrates the effectiveness of C ^2 LEVA.
zh

[NLP-21] A Federated Approach to Few-Shot Hate Speech Detection for Marginalized Communities

【速读】：该论文试图解决边缘化群体在低资源语言环境中面临的在线仇恨言论问题。解决方案的关键在于：1) 发布REACT数据集，这是一个高质量、文化特定的仇恨言论检测数据集，涵盖七个不同目标群体和八种低资源语言；2) 提出利用联邦学习（Federated Learning, FL）进行少样本仇恨言论检测，通过隐私保护和协作学习方法持续改进中央模型，使其在处理不同目标群体和语言时表现出鲁棒性。通过将训练保持在用户设备本地，确保用户数据的隐私，同时利用联邦学习的效率。此外，个性化客户端模型以适应特定目标的训练数据，并评估其性能。研究结果表明联邦学习在不同目标群体中的有效性，而个性化在少样本学习中的优势尚不明确。

链接: https://arxiv.org/abs/2412.04942
作者: Haotian Ye,Axel Wisiorek,Antonis Maronikolakis,Özge Alaçam,Hinrich Schütze
关键词-EN: Global South, speech online remains, increasing internet penetration, Hate speech online, Hate speech
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hate speech online remains an understudied issue for marginalized communities, and has seen rising relevance, especially in the Global South, which includes developing societies with increasing internet penetration. In this paper, we aim to provide marginalized communities living in societies where the dominant language is low-resource with a privacy-preserving tool to protect themselves from hate speech on the internet by filtering offensive content in their native languages. Our contribution in this paper is twofold: 1) we release REACT (REsponsive hate speech datasets Across ConTexts), a collection of high-quality, culture-specific hate speech detection datasets comprising seven distinct target groups in eight low-resource languages, curated by experienced data collectors; 2) we propose a solution to few-shot hate speech detection utilizing federated learning (FL), a privacy-preserving and collaborative learning approach, to continuously improve a central model that exhibits robustness when tackling different target groups and languages. By keeping the training local to the users’ devices, we ensure the privacy of the users’ data while benefitting from the efficiency of federated learning. Furthermore, we personalize client models to target-specific training data and evaluate their performance. Our results indicate the effectiveness of FL across different target groups, whereas the benefits of personalization on few-shot learning are not clear.
zh

[NLP-22] Who Speaks Next? Multi-party AI Discussion Leveraging the Systematics of Turn-taking in Murder Mystery Games

【速读】：该论文试图解决多智能体系统中自然对话控制的挑战，特别是智能体之间的流畅对话和自主决策问题。解决方案的关键在于引入会话分析中的邻接对（adjacency pairs）和话轮转换（turn-taking）规范，并提出了一种名为“Murder Mystery Agents”的新框架。该框架通过基于邻接对的下一位发言者选择机制和考虑智能体内在状态的自选择机制，实现了更自然和策略性的对话。实验结果表明，这种机制显著减少了对话中断，并提升了智能体共享信息和进行逻辑推理的能力。

链接: https://arxiv.org/abs/2412.04937
作者: Ryota Nonomura,Hiroki Mori
关键词-EN: large language models, utilizing large language, shown great promise, Murder Mystery, systems utilizing large
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems utilizing large language models (LLMs) have shown great promise in achieving natural dialogue. However, smooth dialogue control and autonomous decision making among agents still remain challenges. In this study, we focus on conversational norms such as adjacency pairs and turn-taking found in conversation analysis and propose a new framework called “Murder Mystery Agents” that applies these norms to AI agents’ dialogue control. As an evaluation target, we employed the “Murder Mystery” game, a reasoning-type table-top role-playing game that requires complex social reasoning and information manipulation. In this game, players need to unravel the truth of the case based on fragmentary information through cooperation and bargaining. The proposed framework integrates next speaker selection based on adjacency pairs and a self-selection mechanism that takes agents’ internal states into account to achieve more natural and strategic dialogue. To verify the effectiveness of this new approach, we analyzed utterances that led to dialogue breakdowns and conducted automatic evaluation using LLMs, as well as human evaluation using evaluation criteria developed for the Murder Mystery game. Experimental results showed that the implementation of the next speaker selection mechanism significantly reduced dialogue breakdowns and improved the ability of agents to share information and perform logical reasoning. The results of this study demonstrate that the systematics of turn-taking in human conversation are also effective in controlling dialogue among AI agents, and provide design guidelines for more advanced multi-agent dialogue systems.
zh

[NLP-23] Probing the contents of semantic representations from text behavior and brain data using the psychNorms metabase

【速读】：该论文试图解决的问题是如何系统地评估和比较基于文本、行为和脑数据的语义表示之间的相似性和差异性。解决方案的关键在于使用表示相似性分析（representational similarity analysis）和一种称为表示内容分析（representational content analysis）的解释性方法，结合psychNorms元数据库，来揭示行为和脑数据衍生的词向量在情感、代理性和社会道德维度上捕捉到的独特方差，从而确立行为数据作为捕捉人类表示和行为的重要补充。

链接: https://arxiv.org/abs/2412.04936
作者: Zak Hussain,Rui Mata,Ben R. Newell,Dirk U. Wulff
关键词-EN: natural language processing, artificial intelligence, integral to natural, Semantic representations, semantic representations derived
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Semantic representations are integral to natural language processing, psycholinguistics, and artificial intelligence. Although often derived from internet text, recent years have seen a rise in the popularity of behavior-based (e.g., free associations) and brain-based (e.g., fMRI) representations, which promise improvements in our ability to measure and model human representations. We carry out the first systematic evaluation of the similarities and differences between semantic representations derived from text, behavior, and brain data. Using representational similarity analysis, we show that word vectors derived from behavior and brain data encode information that differs from their text-derived cousins. Furthermore, drawing on our psychNorms metabase, alongside an interpretability method that we call representational content analysis, we find that, in particular, behavior representations capture unique variance on certain affective, agentic, and socio-moral dimensions. We thus establish behavior as an important complement to text for capturing human representations and behavior. These results are broadly relevant to research aimed at learning human-aligned semantic representations, including work on evaluating and aligning large language models.
zh

[NLP-24] Large Language Models for Ingredient Substitution in Food Recipes using Supervised Fine-tuning and Direct Preference Optimization

【速读】：该论文试图解决食谱个性化中的食材替代问题，关键解决方案是利用大型语言模型 (Large Language Models, LLMs) 构建一个食材替代系统，该系统能够在给定的食谱上下文中预测合理的替代食材。论文通过广泛的实验确定了最佳的LLM、提示词和微调设置，并采用了多任务学习、两阶段微调和直接偏好优化 (Direct Preference Optimization, DPO) 等方法。实验结果表明，经过微调和DPO的Mistral7-Base LLM在Recipe1MSub语料库上表现最佳，Hit@1得分达到22.04，显著优于现有强基线模型。这一研究成果标志着利用LLM进行食材替代以实现个性化和创意烹饪体验的重要进展。

链接: https://arxiv.org/abs/2412.04922
作者: Thevin Senath,Kumuthu Athukorala,Ransika Costa,Surangika Ranathunga,Rishemjit Kaur
关键词-EN: Large Language Models, ingredient substitution, address the challenge, ingredient substitution system, Direct Preference Optimization
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we address the challenge of recipe personalization through ingredient substitution. We make use of Large Language Models (LLMs) to build an ingredient substitution system designed to predict plausible substitute ingredients within a given recipe context. Given that the use of LLMs for this task has been barely done, we carry out an extensive set of experiments to determine the best LLM, prompt, and the fine-tuning setups. We further experiment with methods such as multi-task learning, two-stage fine-tuning, and Direct Preference Optimization (DPO). The experiments are conducted using the publicly available Recipe1MSub corpus. The best results are produced by the Mistral7-Base LLM after fine-tuning and DPO. This result outperforms the strong baseline available for the same corpus with a Hit@1 score of 22.04. Thus we believe that this research represents a significant step towards enabling personalized and creative culinary experiences by utilizing LLM-based ingredient substitution.
zh

[NLP-25] DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling

【速读】：该论文试图解决现有大型语言模型（LLMs）在对话生成中缺乏全面对话元素建模和系统评估基准的问题。解决方案的关键在于提出了一个名为DEMO的新基准，该基准涵盖了对话元素建模的两个核心任务：元素感知（Element Awareness）和对话代理交互（Dialogue Agent Interaction）。通过模仿学习的方法，论文构建了一个能够基于DEMO基准有效建模对话元素的代理，实验结果表明该代理在领域内和跨领域任务中均表现出优越的性能，从而为LLMs的进一步优化提供了有力支持。

链接: https://arxiv.org/abs/2412.04905
作者: Minzheng Wang,Xinghua Zhang,Kun Chen,Nan Xu,Haiyang Yu,Fei Huang,Wenji Mao,Yongbin Li
关键词-EN: Large language models, Large language, central modes, modes of human-machine, accumulation of vast
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: We release the code and data at this https URL

点击查看摘要

Abstract:Large language models (LLMs) have made dialogue one of the central modes of human-machine interaction, leading to the accumulation of vast amounts of conversation logs and increasing demand for dialogue generation. A conversational life-cycle spans from the Prelude through the Interlocution to the Epilogue, encompassing various elements. Despite the existence of numerous dialogue-related studies, there is a lack of benchmarks that encompass comprehensive dialogue elements, hindering precise modeling and systematic evaluation. To bridge this gap, we introduce an innovative research task \textbfD ialogue \textbfE lement \textbfMO deling, including \textitElement Awareness and \textitDialogue Agent Interaction , and propose a novel benchmark, \textbfDEMO , designed for a comprehensive dialogue modeling and assessment. Inspired by imitation learning, we further build the agent which possesses the adept ability to model dialogue elements based on the DEMO benchmark. Extensive experiments indicate that existing LLMs still exhibit considerable potential for enhancement, and our DEMO agent has superior performance in both in-domain and out-of-domain tasks.
zh

[NLP-26] EACO: Enhancing Alignment in Multimodal LLM s via Critical Observation

【速读】：该论文试图解决多模态大语言模型（MLLMs）在视觉问答和推理任务中存在的幻觉问题，并提升其推理能力。解决方案的关键在于提出了一种名为“通过关键观察增强对齐”（Enhancing Alignment in MLLMs via Critical Observation, EACO）的方法。EACO通过自生成偏好数据，仅使用5000张图像进行经济高效的训练，显著减少了模型幻觉，并提升了推理能力。其核心步骤包括收集和精炼评分评估指令微调数据集，训练一个称为“Critic”的关键评估模型，该模型在多个维度上观察模型响应，选择偏好和非偏好输出以进行精炼的直接偏好优化（DPO）微调。此外，EACO在偏好微调后还采用了额外的监督微调阶段，以进一步增强模型性能。实验结果表明，EACO在HallusionBench上减少了65.6%的幻觉，并在MME-Cognition上提升了21.8%的推理能力，相较于LLaVA-v1.6-Mistral-7B在多个基准测试中提升了8.5%的性能。

链接: https://arxiv.org/abs/2412.04903
作者: Yongxin Wang,Meng Cao,Haokun Lin,Mingfei Han,Liang Ma,Jin Jiang,Yuhao Cheng,Xiaodan Liang
关键词-EN: Multimodal large language, achieved remarkable progress, visual question answering, tasks leveraging instruction, Multimodal large
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress on various visual question answering and reasoning tasks leveraging instruction fine-tuning specific datasets. They can also learn from preference data annotated by human to enhance their reasoning ability and mitigate hallucinations. Most of preference data is generated from the model itself. However, existing methods require high-quality critical labels, which are costly and rely on human or proprietary models like GPT-4V. In this work, we propose Enhancing Alignment in MLLMs via Critical Observation (EACO), which aligns MLLMs by self-generated preference data using only 5k images economically. Our approach begins with collecting and refining a Scoring Evaluation Instruction-tuning dataset to train a critical evaluation model, termed the Critic. This Critic observes model responses across multiple dimensions, selecting preferred and non-preferred outputs for refined Direct Preference Optimization (DPO) tuning. To further enhance model performance, we employ an additional supervised fine-tuning stage after preference tuning. EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement over LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also shows the potential critical ability in open-source MLLMs, demonstrating that EACO is a viable path to boost the competence of MLLMs.
zh

[NLP-27] Building a Family of Data Augmentation Models for Low-cost LLM Fine-tuning on the Cloud COLING2025

【速读】：该论文试图解决领域特定任务中数据集构建和标注成本高昂的问题。解决方案的关键在于开发了一系列数据增强模型，这些模型基于足够小的语言模型（LLMs）进行训练，能够在低推理成本下实现指令扩展、指令精炼和指令-响应对扩展等关键功能。通过构建自动数据收集系统，利用强大的LLMs对种子数据集进行扩展、精炼和重写，并结合质量评估技术，论文提出了一种高效的数据集准备方法。此外，论文还介绍了如何从教师LLMs中提炼任务解决和文本合成能力，并将其集成到机器学习平台中，以支持低成本的LLM微调，从数据集准备和训练两个角度提升效率。实验和应用研究证明了该方法的有效性。

链接: https://arxiv.org/abs/2412.04871
作者: Yuanhao Yue,Chengyu Wang,Jun Huang,Peng Wang
关键词-EN: achieving high performance, Specializing LLMs, high performance, domain-specific tasks, tasks has emerged
类目: Computation and Language (cs.CL)
备注: coling 2025 industry track

点击查看摘要

Abstract:Specializing LLMs in various domain-specific tasks has emerged as a critical step towards achieving high performance. However, the construction and annotation of datasets in specific domains are always very costly. Apart from using superior and expensive closed-source LLM APIs to construct datasets, some open-source models have become strong enough to handle dataset construction in many scenarios. Thus, we present a family of data augmentation models designed to significantly improve the efficiency for model fine-tuning. These models, trained based on sufficiently small LLMs, support key functionalities with low inference costs: instruction expansion, instruction refinement, and instruction-response pair expansion. To fulfill this goal, we first construct an automatic data collection system with seed datasets generated from both public repositories and our in-house datasets. This system leverages powerful LLMs to expand, refine and re-write the instructions and responses, incorporating quality assessment techniques. Following this, we introduce the training process of our models, which effectively distills task-solving and text synthesis abilities from teacher LLMs. Finally, we demonstrate how we integrate these functionalities into a machine learning platform to support low-cost LLM fine-tuning from both dataset preparation and training perspectives for users. Experiments and an application study prove the effectiveness of our approach.
zh

[NLP-28] EXAONE 3.5: Series of Large Language Models for Real-world Use Cases

【速读】：该论文介绍了LG AI Research开发的EXAONE 3.5指令调优语言模型，旨在解决在实际场景中高效遵循指令、理解长上下文以及在通用基准测试中与最先进模型竞争的问题。解决方案的关键在于模型的三个主要能力：1) 在七个基准测试中表现出色的指令遵循能力；2) 在四个基准测试中表现优异的长上下文理解能力；3) 在九个通用基准测试中与同规模的最新开源模型相比具有竞争力的结果。

链接: https://arxiv.org/abs/2412.04862
作者: LG AI Research,Soyoung An,Kyunghoon Bae,Eunbi Choi,Kibong Choi,Stanley Jungkyu Choi,Seokhee Hong,Junwon Hwang,Hyojin Jeon,Gerrard Jeongwon Jo,Hyunjik Jo,Jiyeon Jung,Yountae Jung,Hyosang Kim,Joonkee Kim,Seonghwan Kim,Soyeon Kim,Sunkyoung Kim,Yireun Kim,Yongil Kim,Youchul Kim,Edward Hwayoung Lee,Haeju Lee,Honglak Lee,Jinsik Lee,Kyungmin Lee,Woohyung Lim,Sangha Park,Sooyoun Park,Yongmin Park,Sihoon Yang,Heuiyeen Yeen,Hyeongu Yun
关键词-EN: technical report introduces, introduces the EXAONE, instruction-tuned language models, developed and released, EXAONE
类目: Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2408.03541

点击查看摘要

Abstract:This technical report introduces the EXAONE 3.5 instruction-tuned language models, developed and released by LG AI Research. The EXAONE 3.5 language models are offered in three configurations: 32B, 7.8B, and 2.4B. These models feature several standout capabilities: 1) exceptional instruction following capabilities in real-world scenarios, achieving the highest scores across seven benchmarks, 2) outstanding long-context comprehension, attaining the top performance in four benchmarks, and 3) competitive results compared to state-of-the-art open models of similar sizes across nine general benchmarks. The EXAONE 3.5 language models are open to anyone for research purposes and can be downloaded from this https URL. For commercial use, please reach out to the official contact point of LG AI Research: contact_us@lgresearch.ai.
zh

[NLP-29] Breaking Event Rumor Detection via Stance-Separated Multi-Agent Debate

【速读】：该论文试图解决社交媒体上突发事件中谣言快速传播的问题，特别是由于缺乏标注资源而难以直接检测未被报道的突发事件。解决方案的关键在于提出了立场分离的多智能体辩论模型 (Stance Separated Multi-Agent Debate, S2MAD)。该模型首先通过立场分离将评论分为支持或反对原始声明，然后将声明分类为主观或客观，使智能体能够根据不同类型的声明生成合理的初始观点。随后，通过多轮辩论达成共识，若无法达成共识，则由评判智能体评估意见并给出最终判断。实验结果表明，该模型在性能上优于现有最先进的方法，并有效提升了大语言模型 (LLMs) 在突发事件谣言检测中的表现。

链接: https://arxiv.org/abs/2412.04859
作者: Mingqing Zhang,Haisong Gong,Qiang Liu,Shu Wu,Liang Wang
关键词-EN: social media platforms, events severely hinders, breaking events severely, breaking event rumor, rapid spread
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The rapid spread of rumors on social media platforms during breaking events severely hinders the dissemination of the truth. Previous studies reveal that the lack of annotated resources hinders the direct detection of unforeseen breaking events not covered in yesterday’s news. Leveraging large language models (LLMs) for rumor detection holds significant promise. However, it is challenging for LLMs to provide comprehensive responses to complex or controversial issues due to limited diversity. In this work, we propose the Stance Separated Multi-Agent Debate (S2MAD) to address this issue. Specifically, we firstly introduce Stance Separation, categorizing comments as either supporting or opposing the original claim. Subsequently, claims are classified as subjective or objective, enabling agents to generate reasonable initial viewpoints with different prompt strategies for each type of claim. Debaters then follow specific instructions through multiple rounds of debate to reach a consensus. If a consensus is not reached, a judge agent evaluates the opinions and delivers a final verdict on the claim’s veracity. Extensive experiments conducted on two real-world datasets demonstrate that our proposed model outperforms state-of-the-art methods in terms of performance and effectively improves the performance of LLMs in breaking event rumor detection.
zh

[NLP-30] Adaptive Dropout for Pruning Conformers

【速读】：该论文试图解决在语音识别任务中模型参数过多导致计算资源消耗大的问题，同时保持或提升模型性能。解决方案的关键在于提出了一种基于自适应dropout层（adaptive dropout layers）的联合训练与剪枝方法，通过估计每个单元的保留概率（unit-wise retention probabilities）来确定可剪枝的单元。具体来说，该方法利用反向传播（back-propagation）和Gumbel-Softmax技术来估计保留概率，并在Conformer模型的三个关键位置（即前馈网络组件的隐藏层、自注意力组件的查询向量和值向量、以及LConv组件的输入向量）引入自适应dropout层。实验结果表明，该方法在减少54%参数的同时，将词错误率（word error rates）降低了约1%。

链接: https://arxiv.org/abs/2412.04836
作者: Yotaro Kubo,Xingyu Cai,Michiel Bacchiani
关键词-EN: effectively perform joint, unit-wise retention probabilities, adaptive dropout layers, dropout layers, retention probability
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:This paper proposes a method to effectively perform joint training-and-pruning based on adaptive dropout layers with unit-wise retention probabilities. The proposed method is based on the estimation of a unit-wise retention probability in a dropout layer. A unit that is estimated to have a small retention probability can be considered to be prunable. The retention probability of the unit is estimated using back-propagation and the Gumbel-Softmax technique. This pruning method is applied at several application points in Conformers such that the effective number of parameters can be significantly reduced. Specifically, adaptive dropout layers are introduced in three locations in each Conformer block: (a) the hidden layer of the feed-forward-net component, (b) the query vectors and the value vectors of the self-attention component, and © the input vectors of the LConv component. The proposed method is evaluated by conducting a speech recognition experiment on the LibriSpeech task. It was shown that this approach could simultaneously achieve a parameter reduction and accuracy improvement. The word error rates improved by approx 1% while reducing the number of parameters by 54%.
zh

[NLP-31] Rethinking Time Series Forecasting with LLM s via Nearest Neighbor Contrastive Learning

【速读】：该论文试图解决在大语言模型（LLMs）应用于时间序列预测时，如何有效利用LLMs的词标记嵌入（word token embeddings）以及如何设计与时间序列数据相匹配的提示（prompt）的问题。解决方案的关键在于提出了一种名为NNCL-TLLM的方法，通过最近邻对比学习（Nearest Neighbor Contrastive Learning）来生成与时间序列兼容的文本原型（text prototypes），并在端到端微调过程中结合时间序列特征。该方法不仅利用了LLMs的词标记嵌入，还通过微调层归一化和位置嵌入（layer normalization and positional embeddings）来优化模型，同时保持其他层不变，从而减少了可训练参数并降低了计算成本。实验结果表明，NNCL-TLLM在少样本预测、长期和短期预测任务中均表现出色，优于或与现有最先进的方法相当。

链接: https://arxiv.org/abs/2412.04806
作者: Jayanie Bogahawatte,Sachith Seneviratne,Maneesha Perera,Saman Halgamuge
关键词-EN: Large Language Models, Adapting Large Language, Language Models, Large Language, received considerable attention
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) that are extensively trained on abundant text data, and customizing the input prompt to enable time series forecasting has received considerable attention. While recent work has shown great potential for adapting the learned prior of LLMs, the formulation of the prompt to finetune LLMs remains challenging as prompt should be aligned with time series data. Additionally, current approaches do not effectively leverage word token embeddings which embody the rich representation space learned by LLMs. This emphasizes the need for a robust approach to formulate the prompt which utilizes the word token embeddings while effectively representing the characteristics of the time series. To address these challenges, we propose NNCL-TLLM: Nearest Neighbor Contrastive Learning for Time series forecasting via LLMs. First, we generate time series compatible text prototypes such that each text prototype represents both word token embeddings in its neighborhood and time series characteristics via end-to-end finetuning. Next, we draw inspiration from Nearest Neighbor Contrastive Learning to formulate the prompt while obtaining the top- k nearest neighbor time series compatible text prototypes. We then fine-tune the layer normalization and positional embeddings of the LLM, keeping the other layers intact, reducing the trainable parameters and decreasing the computational cost. Our comprehensive experiments demonstrate that NNCL-TLLM outperforms in few-shot forecasting while achieving competitive or superior performance over the state-of-the-art methods in long-term and short-term forecasting tasks.
zh

[NLP-32] Direct Quantized Training of Language Models with Stochastic Rounding

【速读】：该论文试图解决量化大型语言模型（LLMs）在训练过程中内存占用过高的问题。解决方案的关键在于直接更新量化后的低精度权重矩阵，而不依赖于传统的直通估计器（straight-through estimator）。具体来说，论文采用随机舍入技术（stochastic rounding technique）来最小化低比特权重在训练过程中造成的信息损失。实验结果表明，即使在三值权重约束下，仅使用低精度权重进行训练也是可行的，并且将比特宽度扩展到8位时，性能仅下降5%，同时有望减少训练过程中的内存使用。此外，模型在推理阶段也能使用三值权重，展示了其在部署中的灵活性。

链接: https://arxiv.org/abs/2412.04787
作者: Kaiyan Zhao,Tsuguchika Tabaru,Kenichi Kobayashi,Takumi Honda,Masafumi Yamazaki,Yoshimasa Tsuruoka
关键词-EN: quantized Large Language, Large Language Models, Large Language, recent quantized Large, substantial memory footprints
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Although recent quantized Large Language Models (LLMs), such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weight matrices required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore the potential of directly updating the quantized low-precision weight matrices without relying on the straight-through estimator during backpropagation, thereby saving memory usage during training. Specifically, we employ a stochastic rounding technique to minimize information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values, (2) extending the bit width to 8 bits results in only a 5% loss degradation compared to BitNet b1.58 while offering the potential for reduced memory usage during training, and (3) our models can also perform inference using ternary weights, showcasing their flexibility in deployment.
zh

[NLP-33] NLP-ADBench: NLP Anomaly Detection Benchmark

【速读】：该论文试图解决自然语言处理（NLP）中的异常检测（Anomaly Detection, AD）问题，特别是在文本数据中检测有害内容、钓鱼尝试或垃圾评论等应用场景。解决方案的关键在于引入了NLP-ADBench，这是一个全面的NLP异常检测基准，包含八个精选数据集和十九种最先进的算法评估。这些算法包括三种端到端方法和十六种两步算法，后者利用bert-base-uncased和OpenAI的text-embedding-3-large模型生成的语言嵌入来应用传统的异常检测技术。研究结果表明，没有单一模型能在所有数据集上表现优异，这突显了自动化模型选择的重要性。此外，基于Transformer的嵌入的两步方法在性能上持续优于专门的端到端方法，且OpenAI的嵌入在性能上优于BERT嵌入。通过发布NLP-ADBench，论文提供了一个标准化的评估框架，促进了创新方法的发展，填补了该领域的关键空白，并为提升基于网络系统的安全性和可靠性奠定了基础。

链接: https://arxiv.org/abs/2412.04784
作者: Yuangang Li,Jiaqi Li,Zhuo Xiao,Tiankai Yang,Yi Nian,Xiyang Hu,Yue Zhao
关键词-EN: user behavior analysis, including fraud detection, NLP anomaly detection, Anomaly detection, NLP anomaly
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The project is available at this https URL

点击查看摘要

Abstract:Anomaly detection (AD) is a critical machine learning task with diverse applications in web systems, including fraud detection, content moderation, and user behavior analysis. Despite its significance, AD in natural language processing (NLP) remains underexplored, limiting advancements in detecting anomalies in text data such as harmful content, phishing attempts, or spam reviews. In this paper, we introduce NLP-ADBench, the most comprehensive benchmark for NLP anomaly detection (NLP-AD), comprising eight curated datasets and evaluations of nineteen state-of-the-art algorithms. These include three end-to-end methods and sixteen two-step algorithms that apply traditional anomaly detection techniques to language embeddings generated by bert-base-uncased and OpenAI’s text-embedding-3-large models. Our results reveal critical insights and future directions for NLP-AD. Notably, no single model excels across all datasets, highlighting the need for automated model selection. Moreover, two-step methods leveraging transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings demonstrating superior performance over BERT embeddings. By releasing NLP-ADBench at this https URL, we provide a standardized framework for evaluating NLP-AD methods, fostering the development of innovative approaches. This work fills a crucial gap in the field and establishes a foundation for advancing NLP anomaly detection, particularly in the context of improving the safety and reliability of web-based systems. Comments: The project is available at this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2412.04784 [cs.CL] (or arXiv:2412.04784v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.04784 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-34] Foundation Models for Low-Resource Language Education (Vision Paper)

【速读】：该论文试图解决低资源语言在自然语言处理（Natural Language Processing, NLP）和教育领域面临的挑战。解决方案的关键在于利用大型语言模型（Large Language Models, LLMs）的多语言能力，通过社区驱动学习和数字平台等创新方法，提升这些语言的教育质量和资源可用性。论文强调了LLMs在增强低资源语言教育中的实际应用和潜在益处。

链接: https://arxiv.org/abs/2412.04774
作者: Zhaojun Ding,Zhengliang Liu,Hanqi Jiang,Yizhu Gao,Xiaoming Zhai,Tianming Liu,Ninghao Liu
关键词-EN: Recent studies show, Recent studies, bringing advances, computational linguistics, studies show
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent studies show that large language models (LLMs) are powerful tools for working with natural language, bringing advances in many areas of computational linguistics. However, these models face challenges when applied to low-resource languages due to limited training data and difficulty in understanding cultural nuances. Research is now focusing on multilingual models to improve LLM performance for these languages. Education in these languages also struggles with a lack of resources and qualified teachers, particularly in underdeveloped regions. Here, LLMs can be transformative, supporting innovative methods like community-driven learning and digital platforms. This paper discusses how LLMs could enhance education for low-resource languages, emphasizing practical applications and benefits.
zh

[NLP-35] Ltri-LLM : Streaming Long Context Inference for LLM s with Training-Free Dynamic Triangular Attention Pattern

【速读】：该论文试图解决当前大型语言模型（Large Language Models, LLMs）中注意力机制（Attention Mechanism）的二次计算复杂度问题，这一问题使得长上下文推理变得极其昂贵。解决方案的关键在于利用LLMs注意力分布的强局部相关性，提出了一种称为Ltri-LLM的框架。该框架通过将键值对（Key-Value, KV）划分为跨度（spans），并将它们存储在离线索引中，根据不同的查询动态检索相关KV到内存中，从而实现高效的流式推理。实验结果表明，Ltri-LLM能够在保持接近全注意力（Full Attention, FA）性能的同时，实现对几乎无限长文本的处理。

链接: https://arxiv.org/abs/2412.04757
作者: Hongyin Tang,Di Xiu,Lanrui Wang,Xiurui Geng,Jingang Wang,Xunliang Cai
关键词-EN: Large Language Models, current Large Language, Language Models, Large Language, quadratic computational complexity
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The quadratic computational complexity of the attention mechanism in current Large Language Models (LLMs) renders inference with long contexts prohibitively expensive. To address this challenge, various approaches aim to retain critical portions of the context to optimally approximate Full Attention (FA) through Key-Value (KV) compression or Sparse Attention (SA), enabling the processing of virtually unlimited text lengths in a streaming manner. However, these methods struggle to achieve performance levels comparable to FA, particularly in retrieval tasks. In this paper, our analysis of attention head patterns reveals that LLMs’ attention distributions show strong local correlations, naturally reflecting a chunking mechanism for input context. We propose Ltri-LLM framework, which divides KVs into spans, stores them in an offline index, and retrieves the relevant KVs into memory for various queries. Experimental results on popular long text benchmarks show that Ltri-LLM can achieve performance close to FA while maintaining efficient, streaming-based inference.
zh

[NLP-36] ChatNVD: Advancing Cybersecurity Vulnerability Assessment with Large Language Models

【速读】：该论文试图解决软件系统中日益频繁和复杂的网络安全漏洞问题，特别是现有评估方法依赖于高度技术化和抽象的框架，导致理解困难和漏洞被利用的风险增加。解决方案的关键在于利用大型语言模型 (LLMs) 来增强软件漏洞评估的效率和准确性。论文提出了 ChatNVD，一个基于 LLM 的网络安全漏洞评估工具，通过整合国家漏洞数据库 (NVD) 提供丰富的上下文信息，简化漏洞分析过程，适用于网络安全专业人员、开发者和非技术人员。研究通过开发三个不同 LLM 版本的 ChatNVD（GPT-4o mini、Llama 3 和 Gemini 1.5 Pro），并进行对比分析，评估其在识别和分析软件漏洞方面的有效性，从而揭示 LLMs 在解决软件漏洞理解和缓解挑战方面的潜力。

链接: https://arxiv.org/abs/2412.04756
作者: Shivansh Chopra,Hussain Ahmad,Diksha Goel,Claudia Szabo
关键词-EN: software systems underscore, increasing frequency, frequency and sophistication, systems underscore, underscore the urgent
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing frequency and sophistication of cybersecurity vulnerabilities in software systems underscore the urgent need for robust and effective methods of vulnerability assessment. However, existing approaches often rely on highly technical and abstract frameworks, which hinders understanding and increases the likelihood of exploitation, resulting in severe cyberattacks. Given the growing adoption of Large Language Models (LLMs) across diverse domains, this paper explores their potential application in cybersecurity, specifically for enhancing the assessment of software vulnerabilities. We propose ChatNVD, an LLM-based cybersecurity vulnerability assessment tool leveraging the National Vulnerability Database (NVD) to provide context-rich insights and streamline vulnerability analysis for cybersecurity professionals, developers, and non-technical users. We develop three variants of ChatNVD, utilizing three prominent LLMs: GPT-4o mini by OpenAI, Llama 3 by Meta, and Gemini 1.5 Pro by Google. To evaluate their efficacy, we conduct a comparative analysis of these models using a comprehensive questionnaire comprising common security vulnerability questions, assessing their accuracy in identifying and analyzing software vulnerabilities. This study provides valuable insights into the potential of LLMs to address critical challenges in understanding and mitigation of software vulnerabilities.
zh

[NLP-37] Question Answering for Decisionmaking in Green Building Design: A Multimodal Data Reasoning Method Driven by Large Language Models

【速读】：该论文试图解决绿色建筑设计决策(DGBD)中由于专业知识广泛且复杂导致的决策效率低下的问题。解决方案的关键在于创新性地将大型语言模型与DGBD结合，开发了GreenQA问答框架。该框架通过整合检索增强生成(Retrieval Augmented Generation)、思维链(Chain of Thought)和函数调用(Function Call)方法，实现了多模态数据推理，包括天气数据分析与可视化、绿色建筑案例检索和知识查询，从而显著提升设计效率。

链接: https://arxiv.org/abs/2412.04741
作者: Yihui Li,Xiaoyue Yan,Hao Zhou,Borong Lin
关键词-EN: addressing energy consumption, recent years, widely acknowledged, critical role, consumption and environmental
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Published at Association for Computer Aided Design in Architecture (ACADIA) 2024

点击查看摘要

Abstract:In recent years, the critical role of green buildings in addressing energy consumption and environmental issues has become widely acknowledged. Research indicates that over 40% of potential energy savings can be achieved during the early design stage. Therefore, decision-making in green building design (DGBD), which is based on modeling and performance simulation, is crucial for reducing building energy costs. However, the field of green building encompasses a broad range of specialized knowledge, which involves significant learning costs and results in low decision-making efficiency. Many studies have already applied artificial intelligence (AI) methods to this field. Based on previous research, this study innovatively integrates large language models with DGBD, creating GreenQA, a question answering framework for multimodal data reasoning. Utilizing Retrieval Augmented Generation, Chain of Thought, and Function Call methods, GreenQA enables multimodal question answering, including weather data analysis and visualization, retrieval of green building cases, and knowledge query. Additionally, this study conducted a user survey using the GreenQA web platform. The results showed that 96% of users believed the platform helped improve design efficiency. This study not only effectively supports DGBD but also provides inspiration for AI-assisted design.
zh

[NLP-38] BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English

【速读】：该论文试图解决大语言模型（LLMs）在非主流英语变体（如澳大利亚英语、印度英语和英国英语）的情感分析和讽刺检测中表现不佳的问题。解决方案的关键在于引入了BESSTIE基准，这是一个针对上述三种英语变体的情感和讽刺分类的标注数据集。通过基于位置和主题的过滤方法，从Google Place评论和Reddit评论中收集数据，并由母语者手动标注情感和讽刺标签。随后，论文对九种不同类型的LLMs进行了微调，并在这些数据集上评估其性能。结果显示，模型在内部圈英语变体（如澳大利亚英语和英国英语）上的表现较好，而在印度英语上的表现显著下降，尤其是在讽刺检测方面。此外，论文还指出了跨变体泛化的挑战，强调了特定语言变体数据集的必要性。BESSTIE有望成为未来研究中评估LLMs在语言变体方面公平性的有用基准。

链接: https://arxiv.org/abs/2412.04726
作者: Dipankar Srirag,Aditya Joshi,Jordan Painter,Diptesh Kanojia
关键词-EN: analysis of English, exhibit bias, bias against non-mainstream, English, Google Place reviews
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, under review

点击查看摘要

Abstract:Despite large language models (LLMs) being known to exhibit bias against non-mainstream varieties, there are no known labeled datasets for sentiment analysis of English. To address this gap, we introduce BESSTIE, a benchmark for sentiment and sarcasm classification for three varieties of English: Australian (en-AU), Indian (en-IN), and British (en-UK). Using web-based content from two domains, namely, Google Place reviews and Reddit comments, we collect datasets for these language varieties using two methods: location-based and topic-based filtering. Native speakers of the language varieties manually annotate the datasets with sentiment and sarcasm labels. Subsequently, we fine-tune nine large language models (LLMs) (representing a range of encoder/decoder and mono/multilingual models) on these datasets, and evaluate their performance on the two tasks. Our results reveal that the models consistently perform better on inner-circle varieties (i.e., en-AU and en-UK), with significant performance drops for en-IN, particularly in sarcasm detection. We also report challenges in cross-variety generalisation, highlighting the need for language variety-specific datasets such as ours. BESSTIE promises to be a useful evaluative benchmark for future research in equitable LLMs, specifically in terms of language varieties. The BESSTIE datasets, code, and models are currently available on request, while the paper is under review. Please email this http URL@unsw.this http URL.
zh

[NLP-39] NoLoR: An ASR-Based Framework for Expedited Endangered Language Documentation with Neo-Aramaic as a Case Study

【速读】：该论文试图解决濒危的Neo-Aramaic方言在即将灭绝前的快速记录问题。解决方案的关键在于开发了一种自动语音识别（ASR）模型，以加速对这一语言的文档化工作，并将其策略推广到一个名为NoLoR的新框架中。

链接: https://arxiv.org/abs/2412.04717
作者: Matthew Nazari
关键词-EN: Semitology today, Neo-Aramaic dialects, urgent task, forced displacement due, speakers of Aramaic
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The documentation of the Neo-Aramaic dialects before their extinction has been described as the most urgent task in all of Semitology today. The death of this language will be an unfathomable loss to the descendents of the indigenous speakers of Aramaic, now predominantly diasporic after forced displacement due to violence. This paper develops an ASR model to expedite the documentation of this endangered language and generalizes the strategy in a new framework we call NoLoR.
zh

[NLP-40] ransformers Struggle to Learn to Search

【速读】：该论文试图解决大型语言模型（LLMs）在执行搜索任务时表现不佳的问题，特别是探讨这种不足是由于数据缺乏、模型参数不足还是Transformer架构的根本限制。解决方案的关键在于使用基础的图连通性问题作为测试平台，生成大量高覆盖率的数据来训练小型Transformer模型，并测试它们是否能学会执行搜索。研究发现，当提供适当的训练分布时，Transformer能够学会搜索。通过一种新颖的机制解释技术，研究人员能够从训练模型中提取计算图，揭示了Transformer在每个输入图顶点上计算可达顶点集的能力，并通过逐层扩展这些集合来实现搜索。然而，随着输入图规模的增加，Transformer学习该任务的难度也随之增加，即使增加模型参数数量也无法解决这一问题，表明单纯增加模型规模并不能带来鲁棒的搜索能力。此外，上下文中的搜索（即链式思维）也无法解决在更大图上学习搜索的问题。

链接: https://arxiv.org/abs/2412.04703
作者: Abulhair Saparov,Srushti Pawar,Shreyas Pimpalgaonkar,Nitish Joshi,Richard Yuanzhe Pang,Vishakh Padmakumar,Seyed Mehran Kazemi,Najoung Kim,He He
关键词-EN: perform search robustly, recent studies, studies have shown, shown that large, Search
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Search is an ability foundational in many important tasks, and recent studies have shown that large language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use the foundational graph connectivity problem as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, when given the right training distribution, the transformer is able to learn to search. We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that for each vertex in the input graph, transformers compute the set of vertices reachable from that vertex. Each layer then progressively expands these sets, allowing the model to search over a number of vertices exponential in the number of layers. However, we find that as the input graph size increases, the transformer has greater difficulty in learning the task. This difficulty is not resolved even as the number of parameters is increased, suggesting that increasing model scale will not lead to robust search abilities. We also find that performing search in-context (i.e., chain-of-thought) does not resolve this inability to learn to search on larger graphs. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.04703 [cs.CL] (or arXiv:2412.04703v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2412.04703 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-41] Privacy-Preserving Retrieval Augmented Generation with Differential Privacy

【速读】：该论文试图解决在使用检索增强生成 (Retrieval Augmented Generation, RAG) 技术时，如何在不泄露外部敏感数据源信息的前提下，生成准确且详细的答案。解决方案的关键在于提出了一种算法，该算法智能地仅在需要敏感信息的标记上消耗差分隐私 (Differential Privacy, DP) 预算，而在其他标记上使用非隐私保护的大型语言模型 (LLM)。通过这种方式，论文在合理的隐私预算（例如 \epsilon \approx 10）下，实现了在不同模型和数据集上优于非RAG基线的性能。

链接: https://arxiv.org/abs/2412.04697
作者: Tatsuki Koga,Ruihan Wu,Kamalika Chaudhuri
关键词-EN: recent remarkable advancement, large language models, recent remarkable, remarkable advancement, advancement of large
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the recent remarkable advancement of large language models (LLMs), there has been a growing interest in utilizing them in the domains with highly sensitive data that lies outside their training data. For this purpose, retrieval augmented generation (RAG) is particularly effective – it assists LLMs by directly providing relevant information from the external knowledge sources. However, without extra privacy safeguards, RAG outputs risk leaking sensitive information from the external data source. In this work, we explore RAG under differential privacy (DP), a formal guarantee of data privacy. The main challenge with differentially private RAG is how to generate long accurate answers within a moderate privacy budget. We address this by proposing an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information and uses the non-private LLM for other tokens. Our extensive empirical evaluations reveal that our algorithm outperforms the non-RAG baseline under a reasonable privacy budget of \epsilon\approx 10 across different models and datasets.
zh

[NLP-42] LLM -Align: Utilizing Large Language Models for Entity Alignment in Knowledge Graphs

【速读】：该论文试图解决现有实体对齐（Entity Alignment, EA）方法在处理知识图谱（Knowledge Graphs, KGs）时缺乏对实体属性和关系的深度语义理解的问题。解决方案的关键在于提出了一种基于大型语言模型（Large Language Model, LLM）的实体对齐方法，称为LLM-Align。该方法利用LLM的指令跟随和零样本学习能力，通过启发式方法选择实体的重要属性和关系，并将这些信息输入LLM以推断实体对齐结果。为确保对齐结果的质量，论文设计了一个多轮投票机制来缓解LLM产生的幻觉和位置偏差问题。实验结果表明，LLM-Align在三个EA数据集上达到了最先进的性能。

链接: https://arxiv.org/abs/2412.04690
作者: Xuan Chen,Tong Lu,Zhichun Wang
关键词-EN: Knowledge Graphs, seeks to identify, playing a crucial, fusion and integration, Entity Alignment
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Entity Alignment (EA) seeks to identify and match corresponding entities across different Knowledge Graphs (KGs), playing a crucial role in knowledge fusion and integration. Embedding-based entity alignment (EA) has recently gained considerable attention, resulting in the emergence of many innovative approaches. Initially, these approaches concentrated on learning entity embeddings based on the structural features of knowledge graphs (KGs) as defined by relation triples. Subsequent methods have integrated entities’ names and attributes as supplementary information to improve the embeddings used for EA. However, existing methods lack a deep semantic understanding of entity attributes and relations. In this paper, we propose a Large Language Model (LLM) based Entity Alignment method, LLM-Align, which explores the instruction-following and zero-shot capabilities of Large Language Models to infer alignments of entities. LLM-Align uses heuristic methods to select important attributes and relations of entities, and then feeds the selected triples of entities to an LLM to infer the alignment results. To guarantee the quality of alignment results, we design a multi-round voting mechanism to mitigate the hallucination and positional bias issues that occur with LLMs. Experiments on three EA datasets, demonstrating that our approach achieves state-of-the-art performance compared to existing EA methods.
zh

[NLP-43] SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment

【速读】：该论文试图解决在直接偏好优化 (Direct Preference Optimization, DPO) 中处理多个动态选择的正负响应时可能出现的对齐偏差问题。解决方案的关键是引入了一种新的扩展方法，称为同时加权偏好优化 (Simultaneous Weighted Preference Optimization, SWEPO)。SWEPO 通过使用加权组对比损失 (weighted group contrastive loss)，根据响应与平均奖励分数的偏差来分配权重，从而优先处理那些显著优于或劣于平均水平的响应，增强了优化效果。理论分析表明，同时考虑多个偏好可以减少对齐偏差，使对齐更加稳健。实证验证显示，SWEPO 在 UltraFeedback 数据集上表现出色，并在下游评估中使用 AlpacaEval 数据集时展现出优越的性能。

链接: https://arxiv.org/abs/2412.04628
作者: Taneesh Gupta,Rahul Madhavan,Xuchao Zhang,Chetan Bansal,Saravan Rajmohan
关键词-EN: introduce Simultaneous Weighted, Direct Preference Optimization, Simultaneous Weighted Preference, dynamically chosen positive, Weighted Preference Optimization
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel extension of Direct Preference Optimization (DPO) designed to accommodate multiple dynamically chosen positive and negative responses for each query. SWEPO employs a weighted group contrastive loss, assigning weights to responses based on their deviation from the mean reward score. This approach effectively prioritizes responses that are significantly better or worse than the average, enhancing optimization. Our theoretical analysis demonstrates that simultaneously considering multiple preferences reduces alignment bias, resulting in more robust alignment. Additionally, we provide insights into the training dynamics of our loss function and a related function, InfoNCA. Empirical validation on the UltraFeedback dataset establishes SWEPO as state-of-the-art, with superior performance in downstream evaluations using the AlpacaEval dataset.
zh

[NLP-44] BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

【速读】：该论文试图解决多模态AI在商业应用中受限于训练数据不足和许可限制的问题。解决方案的关键在于引入BigDocs-7.5M，这是一个高质量、开放访问的数据集，包含750万份多模态文档，涵盖30个任务。通过高效的数据筛选过程，确保数据的高质量和许可的宽松性，同时强调透明度和责任。此外，论文还推出了BigDocs-Bench，一个包含10个新任务的基准套件，用于评估模型在图形用户界面（GUI）和图像生成代码等实际应用场景中的表现。实验结果表明，使用BigDocs-Bench训练的模型在文档推理和结构化输出任务中，性能比封闭源的GPT-4o提高了25.8%。

链接: https://arxiv.org/abs/2412.04626
作者: Juan Rodriguez,Xiangru Jian,Siba Smarak Panigrahi,Tianyu Zhang,Aarash Feizi,Abhay Puri,Akshay Kalkunte,François Savard,Ahmed Masry,Shravan Nayak,Rabiul Awal,Mahsa Massoud,Amirhossein Abaskohi,Zichao Li,Suyuchen Wang,Pierre-André Noël,Mats Leon Richter,Saverio Vadacchino,Shubbam Agarwal,Sanket Biswas,Sara Shanian,Ying Zhang,Noah Bolger,Kurt MacDonald,Simon Fauvel,Sathwik Tejaswi,Srinivas Sunkara,Joao Monteiro,Krishnamurthy DJ Dvijotham,Torsten Scholak,Nicolas Chapados,Sepideh Kharagani,Sean Hughes,M. Özsu,Siva Reddy,Marco Pedersoli,Yoshua Bengio,Christopher Pal,Issam Laradji,Spandanna Gella,Perouz Taslakian,David Vazquez,Sai Rajeswar
关键词-EN: significantly enhance document-understanding, understanding workflows, processing receipts, summarizing reports, potential to significantly
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: The project is hosted at this https URL

点击查看摘要

Abstract:Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at this https URL .
zh

[NLP-45] Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization

【速读】：该论文试图解决神经网络在训练过程中对表面模式（surface-level patterns）的偏好问题，特别是在语言模型（LMs）中，早期训练阶段模型行为类似于n-gram模型，而未能充分利用语法规则所需的层次句法表示（hierarchical syntactic representations）。解决方案的关键在于通过案例研究英语语法，探讨训练数据中的潜在结构如何驱动模型在分布外（OOD）行为上的改进，并研究数据组成如何导致不同随机种子间的OOD行为不一致和训练动态不稳定。研究发现，模型只有在完全采用表面线性规则或层次规则时，其OOD行为才会稳定。层次规则由具有深层嵌套结构的复杂语法序列引发，而线性规则由简单序列引发。当数据包含简单和复杂示例的混合时，潜在规则之间存在竞争，导致每次独立训练要么稳定于单一规则，要么在OOD行为上保持不稳定。此外，论文还发现了一个例外情况，即在低多样性训练数据中记忆模式的模型可以稳定地过拟合，但其记忆和未记忆模式采用不同的规则。这些发现强调了训练数据在塑造泛化模式中的关键作用，以及数据子集之间的竞争如何导致不同随机种子间泛化结果的不一致。

链接: https://arxiv.org/abs/2412.04619
作者: Tian Qin,Naomi Saphra,David Alvarez-Melis
关键词-EN: favor shortcut heuristics, shortcut heuristics based, Neural networks, OOD behavior, networks often favor
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural networks often favor shortcut heuristics based on surface-level patterns. As one example, language models (LMs) behave like n-gram models early in training. However, to correctly apply grammatical rules, LMs must rely on hierarchical syntactic representations instead of n-grams. In this work, we use cases studies of English grammar to explore how latent structure in training data drives models toward improved out-of-distribution (OOD) this http URL then investigate how data composition can lead to inconsistent OOD behavior across random seeds and to unstable training dynamics. Our results show that models stabilize in their OOD behavior only when they fully commit to either a surface-level linear rule or a hierarchical rule. The hierarchical rule, furthermore, is induced by grammatically complex sequences with deep embedding structures, whereas the linear rule is induced by simpler sequences. When the data contains a mix of simple and complex examples, potential rules compete; each independent training run either stabilizes by committing to a single rule or remains unstable in its OOD behavior. These conditions lead `stable seeds’ to cluster around simple rules, forming bimodal performance distributions across seeds. We also identify an exception to the relationship between stability and generalization: models which memorize patterns from low-diversity training data can overfit stably, with different rules for memorized and unmemorized patterns. Our findings emphasize the critical role of training data in shaping generalization patterns and how competition between data subsets contributes to inconsistent generalization outcomes across random seeds. Code is available at this https URL.
zh

[NLP-46] Extractive Structures Learned in Pretraining Enable Generalization on Finetuned Facts

【速读】：该论文试图解决预训练语言模型（LMs）如何泛化到其微调数据中事实的隐含信息的问题。解决方案的关键在于引入“提取结构”（extractive structures）作为描述LMs内部组件（如MLPs或注意力头）如何协同工作以实现这种泛化的框架。提取结构包括存储训练事实的权重变化的信息组件，以及查询和处理存储信息的上游和下游提取组件。论文假设这些提取结构在预训练过程中通过遇到已知事实的隐含信息时被学习，并预测了两种现象：数据顺序效应（仅当事实先于其隐含信息时才能学习提取结构）和权重嫁接效应（提取结构可以被转移以预测反事实的隐含信息）。通过在OLMo-7b、Llama 3-8b、Gemma 2-9b和Qwen 2-7b模型中的实验，论文验证了这些现象，并进一步表明事实学习可以在早期和晚期层中发生，从而导致不同形式的泛化。

链接: https://arxiv.org/abs/2412.04614
作者: Jiahai Feng,Stuart Russell,Jacob Steinhardt
关键词-EN: Pretrained language models, John Doe, Pretrained language, John Doe lives, John Doe city
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pretrained language models (LMs) can generalize to implications of facts that they are finetuned on. For example, if finetuned on John Doe lives in Tokyo," LMs can correctly answer What language do the people in John Doe’s city speak?‘’ with ``Japanese’'. However, little is known about the mechanisms that enable this generalization or how they are learned during pretraining. We introduce extractive structures as a framework for describing how components in LMs (e.g., MLPs or attention heads) coordinate to enable this generalization. The structures consist of informative components that store training facts as weight changes, and upstream and downstream extractive components that query and process the stored information to produce the correct implication. We hypothesize that extractive structures are learned during pretraining when encountering implications of previously known facts. This yields two predictions: a data ordering effect where extractive structures can be learned only if facts precede their implications, and a weight grafting effect where extractive structures can be transferred to predict counterfactual implications. We empirically demonstrate these phenomena in the OLMo-7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b models. Of independent interest, our results also indicate that fact learning can occur at both early and late layers, which lead to different forms of generalization.
zh

[NLP-47] Semantic Consistency-Based Uncertainty Quantification for Factuality in Radiology Report Generation

【速读】：该论文试图解决放射报告生成中确保事实正确性的问题，特别是在生成式医学视觉大语言模型（VLLMs）中存在的幻觉现象。解决方案的关键在于引入了一种基于语义一致性的不确定性量化框架，该框架能够在报告级别和句子级别提供不确定性评估。与现有方法不同，该方法无需修改底层模型或访问其内部状态（如输出标记对数），因此可以作为即插即用模块无缝集成到现有最先进的模型中。实验结果表明，该方法有效检测幻觉现象并提高自动生成放射报告的事实准确性，通过拒绝高不确定性报告，事实性评分提高了10%，同时句子级别的不确定性标记在每个报告中最低精度句子的成功率为82.9%。

链接: https://arxiv.org/abs/2412.04606
作者: Chenyu Wang,Weichao Zhou,Shantanu Ghosh,Kayhan Batmanghelich,Wenchao Li
关键词-EN: shown great potential, Vision Large Language, Radiology report generation, shown great, great potential
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Radiology report generation (RRG) has shown great potential in assisting radiologists by automating the labor-intensive task of report writing. While recent advancements have improved the quality and coherence of generated reports, ensuring their factual correctness remains a critical challenge. Although generative medical Vision Large Language Models (VLLMs) have been proposed to address this issue, these models are prone to hallucinations and can produce inaccurate diagnostic information. To address these concerns, we introduce a novel Semantic Consistency-Based Uncertainty Quantification framework that provides both report-level and sentence-level uncertainties. Unlike existing approaches, our method does not require modifications to the underlying model or access to its inner state, such as output token logits, thus serving as a plug-and-play module that can be seamlessly integrated with state-of-the-art models. Extensive experiments demonstrate the efficacy of our method in detecting hallucinations and enhancing the factual accuracy of automatically generated radiology reports. By abstaining from high-uncertainty reports, our approach improves factuality scores by 10 %, achieved by rejecting 20 % of reports using the Radialog model on the MIMIC-CXR dataset. Furthermore, sentence-level uncertainty flags the lowest-precision sentence in each report with an 82.9 % success rate.
zh

[NLP-48] Formulation of probability theory problem with subtle condition

【速读】：该论文试图解决概率论中对非英语母语的本科生来说最具挑战性的问题。解决方案的关键在于精确理解问题的条件和要求，并在解决问题之前进行详细的讨论和分析。论文通过详细讨论四个相关问题的解决方案，结合数值估计，并将问题条件与Python编程语言中的逻辑语句相联系，强调了理解问题前提的重要性。此外，论文还测试了两个广泛使用的聊天机器人（GPT-4o和Claude 3.5 Sonnet）对这些问题的响应，以评估其解决问题的能力。

链接: https://arxiv.org/abs/2412.04602
作者: Rafayel Petrosyan
关键词-EN: probability theory prove, probability theory, theory prove, Problems, Problems in probability
类目: Computation and Language (cs.CL); Probability (math.PR)
备注: 7 pages

点击查看摘要

Abstract:Problems in probability theory prove to be one of the most challenging for students. Here, we formulate and discuss four related problems in probability theory that proved difficult for first to fourth-year undergraduate students whose first language was not English. These examples emphasize how crucial it is to understand the conditions and requirements of the problems precisely before starting to solve them. We discuss the solutions to those problems in detail, complement them with numerical estimations, and link the conditions in the problems to the logical statements in Python programming language. We also tested two widely used chatbots (GPT-4o and Claude 3.5 Sonnet) by checking their responses to these problems.
zh

[NLP-49] Show Dont Tell: Uncovering Implicit Character Portrayal using LLM s

【速读】：该论文试图解决现有工具在分析小说中角色描写时主要依赖显式文本指示（explicit textual indicators）的问题，特别是当角色描写是隐含的（implicit），通过行为和行为而非直接陈述来揭示时。解决方案的关键在于利用大型语言模型（LLMs）来揭示这些隐含的角色描写。具体来说，论文提出了一个名为LIIPA（LLMs for Inferring Implicit Portrayal for Character Analysis）的框架，通过配置不同类型的中间计算（如角色属性词列表、思维链）来推断源文本中角色的描写方式。LIIPA不仅在性能上优于现有方法，而且在处理角色数量增加时更为稳健，因为它能够利用完整的叙事上下文。此外，论文还探讨了角色描写估计对角色人口统计特征的敏感性，揭示了在算法公平性文献中常见的公平性与准确性之间的权衡。尽管存在这种权衡，所有LIIPA变体在公平性和准确性方面均优于非LLM基线方法。

链接: https://arxiv.org/abs/2412.04576
作者: Brandon Jaipersaud,Zining Zhu,Frank Rudzicz,Elliot Creager
关键词-EN: interpreting compelling stories, compelling stories, Tools for analyzing, fiction are valuable, valuable for writers
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Tools for analyzing character portrayal in fiction are valuable for writers and literary scholars in developing and interpreting compelling stories. Existing tools, such as visualization tools for analyzing fictional characters, primarily rely on explicit textual indicators of character attributes. However, portrayal is often implicit, revealed through actions and behaviors rather than explicit statements. We address this gap by leveraging large language models (LLMs) to uncover implicit character portrayals. We start by generating a dataset for this task with greater cross-topic similarity, lexical diversity, and narrative lengths than existing narrative text corpora such as TinyStories and WritingPrompts. We then introduce LIIPA (LLMs for Inferring Implicit Portrayal for Character Analysis), a framework for prompting LLMs to uncover character portrayals. LIIPA can be configured to use various types of intermediate computation (character attribute word lists, chain-of-thought) to infer how fictional characters are portrayed in the source text. We find that LIIPA outperforms existing approaches, and is more robust to increasing character counts (number of unique persons depicted) due to its ability to utilize full narrative context. Lastly, we investigate the sensitivity of portrayal estimates to character demographics, identifying a fairness-accuracy tradeoff among methods in our LIIPA framework – a phenomenon familiar within the algorithmic fairness literature. Despite this tradeoff, all LIIPA variants consistently outperform non-LLM baselines in both fairness and accuracy. Our work demonstrates the potential benefits of using LLMs to analyze complex characters and to better understand how implicit portrayal biases may manifest in narrative texts.
zh

[NLP-50] Give me Some Hard Questions: Synthetic Data Generation for Clinical QA ML4H2024

【速读】：该论文试图解决临床问答系统训练数据不足的问题，特别是在零样本设置下生成高质量的临床问答数据。解决方案的关键在于提出了两种提示策略：1) 指导模型生成与输入上下文不重叠的问题，以增加问题的复杂性；2) 使用预定义的架构对输入记录进行总结，以辅助问题生成。实验结果表明，这些策略显著提高了生成问题的难度和微调性能，但同时也揭示了合成数据与真实数据在训练效果上的差距，主要源于合成答案的质量。

链接: https://arxiv.org/abs/2412.04573
作者: Fan Bai,Keith Harrigian,Joel Stremmel,Hamid Hassanzadeh,Ardavan Saeedi,Mark Dredze
关键词-EN: quickly access patient, access patient information, Clinical Question Answering, systems enable doctors, electronic health records
类目: Computation and Language (cs.CL)
备注: Accepted to ML4H 2024 Findings

点击查看摘要

Abstract:Clinical Question Answering (QA) systems enable doctors to quickly access patient information from electronic health records (EHRs). However, training these systems requires significant annotated data, which is limited due to the expertise needed and the privacy concerns associated with clinical data. This paper explores generating Clinical QA data using large language models (LLMs) in a zero-shot setting. We find that naive prompting often results in easy questions that do not reflect the complexity of clinical scenarios. To address this, we propose two prompting strategies: 1) instructing the model to generate questions that do not overlap with the input context, and 2) summarizing the input record using a predefined schema to scaffold question generation. Experiments on two Clinical QA datasets demonstrate that our method generates more challenging questions, significantly improving fine-tuning performance over baselines. We compare synthetic and gold data and find a gap between their training efficacy resulting from the quality of synthetically generated answers.
zh

[NLP-51] Understanding Hidden Computations in Chain-of-Thought Reasoning

【速读】：该论文试图解决的问题是：在大型语言模型中，即使使用填充字符（如“…”）替代思维链（Chain-of-Thought, CoT）提示，模型仍能执行复杂推理任务，这引发了关于模型内部如何处理和表示推理步骤的疑问。解决方案的关键在于通过分析层级表示（layer-wise representations）和使用logit lens方法检查token排名，来解码这些隐藏字符，从而揭示模型内部的推理机制，并提高语言模型推理的可解释性和透明度。

链接: https://arxiv.org/abs/2412.04537
作者: Aryasomayajula Ram Bharadwaj
关键词-EN: prompting has significantly, significantly enhanced, abilities of large, large language models, models
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has significantly enhanced the reasoning abilities of large language models. However, recent studies have shown that models can still perform complex reasoning tasks even when the CoT is replaced with filler(hidden) characters (e.g., “…”), leaving open questions about how models internally process and represent reasoning steps. In this paper, we investigate methods to decode these hidden characters in transformer models trained with filler CoT sequences. By analyzing layer-wise representations using the logit lens method and examining token rankings, we demonstrate that the hidden characters can be recovered without loss of performance. Our findings provide insights into the internal mechanisms of transformer models and open avenues for improving interpretability and transparency in language model reasoning.
zh

[NLP-52] Prompting Large Language Models for Clinical Temporal Relation Extraction

【速读】：该论文旨在提升大型语言模型（LLMs）在临床时间关系抽取（CTRE）任务中的表现，特别是在少样本学习和全监督学习环境下的性能。解决方案的关键在于开发和评估多种微调策略，包括标准微调（Standard Fine-Tuning）、硬提示（Hard-Prompting）、软提示（Soft-Prompting）和低秩适应（Low-Rank Adaptation, LoRA），并结合量化技术（Quantization）以提高效率。研究结果表明，在全监督设置下，未冻结参数的GatorTron-Base模型通过硬提示策略达到了最高的F1分数（89.54%），超过了当前最先进的模型（85.70%）。此外，量化低秩适应（QLoRA）策略在GatorTron-Large模型上的应用也显著提升了性能。这些发现强调了根据任务需求和数据可用性选择合适的模型和微调策略的重要性。

链接: https://arxiv.org/abs/2412.04512
作者: Jianping He,Laila Rasmy,Haifang Li,Jianfu Li,Zenan Sun,Evan Yu,Degui Zhi,Cui Tao
关键词-EN: prompt large language, temporal relation extraction, clinical temporal relation, large language models, Frozen LLMs
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Objective: This paper aims to prompt large language models (LLMs) for clinical temporal relation extraction (CTRE) in both few-shot and fully supervised settings. Materials and Methods: This study utilizes four LLMs: Encoder-based GatorTron-Base (345M)/Large (8.9B); Decoder-based LLaMA3-8B/MeLLaMA-13B. We developed full (FFT) and parameter-efficient (PEFT) fine-tuning strategies and evaluated these strategies on the 2012 i2b2 CTRE task. We explored four fine-tuning strategies for GatorTron-Base: (1) Standard Fine-Tuning, (2) Hard-Prompting with Unfrozen LLMs, (3) Soft-Prompting with Frozen LLMs, and (4) Low-Rank Adaptation (LoRA) with Frozen LLMs. For GatorTron-Large, we assessed two PEFT strategies-Soft-Prompting and LoRA with Frozen LLMs-leveraging Quantization techniques. Additionally, LLaMA3-8B and MeLLaMA-13B employed two PEFT strategies: LoRA strategy with Quantization (QLoRA) applied to Frozen LLMs using instruction tuning and standard fine-tuning. Results: Under fully supervised settings, Hard-Prompting with Unfrozen GatorTron-Base achieved the highest F1 score (89.54%), surpassing the SOTA model (85.70%) by 3.74%. Additionally, two variants of QLoRA adapted to GatorTron-Large and Standard Fine-Tuning of GatorTron-Base exceeded the SOTA model by 2.36%, 1.88%, and 0.25%, respectively. Decoder-based models with frozen parameters outperformed their Encoder-based counterparts in this setting; however, the trend reversed in few-shot scenarios. Discussions and Conclusions: This study presented new methods that significantly improved CTRE performance, benefiting downstream tasks reliant on CTRE systems. The findings underscore the importance of selecting appropriate models and fine-tuning strategies based on task requirements and data availability. Future work will explore larger models and broader CTRE applications.
zh

[NLP-53] Pragmatic Metacognitive Prompting Improves LLM Performance on Sarcasm Detection COLING2024

【速读】：该论文试图解决讽刺检测（sarcasm detection）在情感分析中的挑战，这一挑战源于讽刺语言的微妙性和依赖上下文的特性。解决方案的关键在于引入实用元认知提示（Pragmatic Metacognitive Prompting, PMP），通过结合实用主义原则和反思策略，帮助大型语言模型（Large Language Models, LLMs）更好地解读隐含意义、考虑上下文线索，并通过反思差异来识别讽刺。PMP在如LLaMA-3-8B、GPT-4o和Claude 3.5 Sonnet等最先进的LLMs上实现了在MUStARD和SemEval2018数据集上的最先进性能，展示了将实用推理和元认知策略整合到提示中，显著提升LLMs讽刺检测能力的潜力。

链接: https://arxiv.org/abs/2412.04509
作者: Joshua Lee,Wyatt Fong,Alexander Le,Sur Shah,Kevin Han,Kevin Zhu
关键词-EN: Large Language Models, nature of verbiage, sentiment analysis due, significant challenge, nuanced and context-dependent
类目: Computation and Language (cs.CL)
备注: Accepted at COLING 2024, CHum Workshop

点击查看摘要

Abstract:Sarcasm detection is a significant challenge in sentiment analysis due to the nuanced and context-dependent nature of verbiage. We introduce Pragmatic Metacognitive Prompting (PMP) to improve the performance of Large Language Models (LLMs) in sarcasm detection, which leverages principles from pragmatics and reflection helping LLMs interpret implied meanings, consider contextual cues, and reflect on discrepancies to identify sarcasm. Using state-of-the-art LLMs such as LLaMA-3-8B, GPT-4o, and Claude 3.5 Sonnet, PMP achieves state-of-the-art performance on GPT-4o on MUStARD and SemEval2018. This study demonstrates that integrating pragmatic reasoning and metacognitive strategies into prompting significantly enhances LLMs’ ability to detect sarcasm, offering a promising direction for future research in sentiment analysis.
zh

[NLP-54] Arctic-Embed 2.0: Multilingual Retrieval Without Compromise

【速读】：该论文试图解决多语言文本嵌入模型在英语检索质量上的下降问题，并提出了一种名为Arctic-Embed 2.0的开源文本嵌入模型。解决方案的关键在于采用了Matryoshka Representation Learning (MRL)技术，这种技术能够在保持高效嵌入存储的同时，显著降低压缩后的质量退化，从而在多语言和仅英语的基准测试中实现了具有竞争力的检索质量。

链接: https://arxiv.org/abs/2412.04506
作者: Puxuan Yu,Luke Merrick,Gaurav Nuti,Daniel Campos
关键词-EN: Matryoshka Representation Learning, open-source text embedding, text embedding models, embedding models built, supports Matryoshka Representation
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 10 pages, 5 figures, 3 tables

点击查看摘要

Abstract:This paper presents the training methodology of Arctic-Embed 2.0, a set of open-source text embedding models built for accurate and efficient multilingual retrieval. While prior works have suffered from degraded English retrieval quality, Arctic-Embed 2.0 delivers competitive retrieval quality on multilingual and English-only benchmarks, and supports Matryoshka Representation Learning (MRL) for efficient embedding storage with significantly lower compressed quality degradation compared to alternatives. We detail the design and implementation, presenting several important open research questions that arose during model development. We conduct experiments exploring these research questions and include extensive discussion aimed at fostering further discussion in this field.
zh

[NLP-55] Achieving Semantic Consistency Using BERT: Application of Pre-training Semantic Representations Model in Social Sciences Research

【速读】：该论文试图解决在社会科学研究和文本分析任务中，不同时间段内词语解释的一致性问题。解决方案的关键在于比较和评估传统模型如Word2Vec与现代模型如BERT在维持词语语义稳定性方面的表现。通过使用《人民日报》20年（2004-2023）的文章数据，研究发现BERT在短期语义稳定性方面显著优于Word2Vec，主要得益于其预训练特性和Transformer编码器架构提供的上下文嵌入。然而，BERT在捕捉长期语义变化方面存在局限性，因此建议在长期研究中结合其他方法以全面捕捉语义漂移。这一研究强调了根据社会科学分析的具体时间背景选择合适的词嵌入模型的重要性。

链接: https://arxiv.org/abs/2412.04505
作者: Ruiyu Zhang,Lin Nie,Ce Zhao,Qingyang Chen
关键词-EN: Achieving consistent word, Achieving consistent, Bidirectional Encoder Representations, consistent word interpretations, enhancing understanding
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Achieving consistent word interpretations across different time spans is crucial in social sciences research and text analysis tasks, as stable semantic representations form the foundation for research and task correctness, enhancing understanding of socio-political and cultural analysis. Traditional models like Word2Vec have provided significant insights into long-term semantic changes but often struggle to capture stable meanings in short-term contexts, which may be attributed to fluctuations in embeddings caused by unbalanced training data. Recent advancements, particularly BERT (Bidirectional Encoder Representations from Transformers), its pre-trained nature and transformer encoder architecture offer contextual embeddings that improve semantic consistency, making it a promising tool for short-term analysis. This study empirically compares the performance of Word2Vec and BERT in maintaining stable word meanings over time in text analysis tasks relevant to social sciences research. Using articles from the People’s Daily spanning 20 years (2004-2023), we evaluated the semantic stability of each model across different timeframes. The results indicate that BERT consistently outperforms Word2Vec in maintaining semantic stability, offering greater stability in contextual embeddings. However, the study also acknowledges BERT’s limitations in capturing gradual semantic shifts over longer periods due to its inherent stability. The findings suggest that while BERT is advantageous for short-term semantic analysis in social sciences, researchers should consider complementary approaches for long-term studies to fully capture semantic drift. This research underscores the importance of selecting appropriate word embedding models based on the specific temporal context of social science analyses.
zh

[NLP-56] Multi-Bin Batching for Increasing LLM Inference Throughput

【速读】：该论文试图解决大语言模型（LLMs）推理系统中由于请求生成长度不一致导致的资源利用率低下的问题。解决方案的关键在于提出了一种名为“多箱批处理”（Multi-Bin Batching）的方法，该方法通过将具有相似（预测）执行时间的请求分组到预定义的箱（bins）中，从而显著提高LLM推理系统的吞吐量。这种方法不仅简单有效，而且通过理论分析和实际实验验证了其在标准批处理方法上的显著优势。

链接: https://arxiv.org/abs/2412.04504
作者: Ozgur Guldogan,Jackson Kunde,Kangwook Lee,Ramtin Pedarsani
关键词-EN: large language models, language models, grow in popularity, diverse capabilities, improving the efficiency
类目: Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:As large language models (LLMs) grow in popularity for their diverse capabilities, improving the efficiency of their inference systems has become increasingly critical. Batching LLM requests is a critical step in scheduling the inference jobs on servers (e.g. GPUs), enabling the system to maximize throughput by allowing multiple requests to be processed in parallel. However, requests often have varying generation lengths, causing resource underutilization, as hardware must wait for the longest-running request in the batch to complete before moving to the next batch. We formalize this problem from a queueing-theoretic perspective, and aim to design a control policy which is throughput-optimal. We propose Multi-Bin Batching, a simple yet effective method that can provably improve LLM inference throughput by grouping requests with similar (predicted) execution times into predetermined bins. Through a combination of theoretical analysis and experiments, including real-world LLM inference scenarios, we demonstrate significant throughput gains compared to standard batching approaches.
zh

[NLP-57] A Primer on Large Language Models and their Limitations

【速读】：该论文旨在为学术界和工业界的读者提供关于大型语言模型（Large Language Models, LLMs）的入门知识，帮助他们理解LLMs的核心概念、技术、优势、局限性、应用场景以及未来的研究方向。解决方案的关键在于系统地介绍LLMs的基本原理，分析其在实际任务中的应用潜力，并指出当前技术的局限性和未来的研究重点，从而为读者提供一个全面的视角，以便他们能够更好地利用LLMs提升日常工作和复杂任务的效率。

链接: https://arxiv.org/abs/2412.04503
作者: Sandra Johnson,David Hyland-Wood
关键词-EN: Large Language Models, Language Models, Large Language, primer on Large, identifies their strengths
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 33 pages, 19 figures

点击查看摘要

Abstract:This paper provides a primer on Large Language Models (LLMs) and identifies their strengths, limitations, applications and research directions. It is intended to be useful to those in academia and industry who are interested in gaining an understanding of the key LLM concepts and technologies, and in utilising this knowledge in both day to day tasks and in more complex scenarios where this technology can enhance current practices and processes.
zh

[NLP-58] Large Language Models in Politics and Democracy: A Comprehensive Survey

【速读】：该论文试图解决生成式 AI（Generative AI），特别是大型语言模型（LLMs）在政治和民主领域中的应用及其潜在影响问题。解决方案的关键在于探讨LLMs在立法过程、政治传播、政治分析、外交和国家安全、经济和社会建模以及法律应用中的具体应用，同时强调在整合LLMs到政治过程中需要考虑的偏见、透明度和问责等挑战。论文强调了负责任的开发、伦理考量和治理框架的必要性，以确保LLMs的应用符合民主价值观，并促进更加公正和公平的社会。

链接: https://arxiv.org/abs/2412.04498
作者: Goshi Aoki
关键词-EN: large language models, including policymaking, language models, advancement of generative, large language
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages

点击查看摘要

Abstract:The advancement of generative AI, particularly large language models (LLMs), has a significant impact on politics and democracy, offering potential across various domains, including policymaking, political communication, analysis, and governance. This paper surveys the recent and potential applications of LLMs in politics, examining both their promises and the associated challenges. This paper examines the ways in which LLMs are being employed in legislative processes, political communication, and political analysis. Moreover, we investigate the potential of LLMs in diplomatic and national security contexts, economic and social modeling, and legal applications. While LLMs offer opportunities to enhance efficiency, inclusivity, and decision-making in political processes, they also present challenges related to bias, transparency, and accountability. The paper underscores the necessity for responsible development, ethical considerations, and governance frameworks to ensure that the integration of LLMs into politics aligns with democratic values and promotes a more just and equitable society.
zh

[NLP-59] Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research

【速读】：该论文试图解决低资源语言在数据稀缺和技术限制下所面临的挑战，特别是如何通过大型语言模型 (LLMs) 来促进这些语言的全面研究和保存。解决方案的关键在于利用 LLMs 的创新方法，涵盖语言变异、历史文献、文化表达和文学分析等多个领域。论文强调了跨学科合作和定制化模型的开发，以应对数据可访问性、模型适应性和文化敏感性等关键挑战，从而推动低资源语言研究的进步，并促进全球范围内对人类语言和文化遗产的保护。

链接: https://arxiv.org/abs/2412.04497
作者: Tianyang Zhong,Zhenyuan Yang,Zhengliang Liu,Ruidong Zhang,Yiheng Liu,Haiyang Sun,Yi Pan,Yiwei Li,Yifan Zhou,Hanqi Jiang,Junhao Chen
关键词-EN: embodying cultural evolution, Low-resource languages serve, human history, serve as invaluable, invaluable repositories
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-resource languages serve as invaluable repositories of human history, embodying cultural evolution and intellectual diversity. Despite their significance, these languages face critical challenges, including data scarcity and technological limitations, which hinder their comprehensive study and preservation. Recent advancements in large language models (LLMs) offer transformative opportunities for addressing these challenges, enabling innovative methodologies in linguistic, historical, and cultural research. This study systematically evaluates the applications of LLMs in low-resource language research, encompassing linguistic variation, historical documentation, cultural expressions, and literary analysis. By analyzing technical frameworks, current methodologies, and ethical considerations, this paper identifies key challenges such as data accessibility, model adaptability, and cultural sensitivity. Given the cultural, historical, and linguistic richness inherent in low-resource languages, this work emphasizes interdisciplinary collaboration and the development of customized models as promising avenues for advancing research in this domain. By underscoring the potential of integrating artificial intelligence with the humanities to preserve and study humanity’s linguistic and cultural heritage, this study fosters global efforts towards safeguarding intellectual diversity.
zh

[NLP-60] MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

【速读】：该论文试图解决在扩展大型语言模型（LLMs）功能以实现环境交互时，面临的两个主要问题：(I) 获取大规模客户查询数据进行代理测试的时间成本高；(II) 代理对工具调用序列（或轨迹）的高度依赖可能导致意外或错误行为。解决方案的关键在于提出了一种名为MAG-V的多代理框架，该框架首先生成模拟客户查询的问题数据集，然后通过逆向工程从响应中生成替代问题以进行轨迹验证。初步结果表明，合成数据可以提高代理在实际客户查询上的表现，并且基于传统机器学习模型的轨迹验证方法在准确性上优于GPT-4o判断基准11%，并在构建的数据集上与GPT-4判断的表现相当。整体上，该方法旨在将多样化的任务代理统一到一个具有一致目标的框架中。

链接: https://arxiv.org/abs/2412.04494
作者: Saptarshi Sengupta,Kristal Curtis,Akshay Mallipeddi,Abhinav Mathur,Joseph Ross,Liang Gou
关键词-EN: Large Language Models, Extending the capabilities, Large Language, Language Models, environment interaction
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extending the capabilities of Large Language Models (LLMs) with functions or tools for environment interaction has led to the emergence of the agent paradigm. In industry, training an LLM is not always feasible because of the scarcity of domain data, legal holds on proprietary customer data, rapidly changing business requirements, and the need to prototype new assistants. Agents provide an elegant solution to the above by relying on the zero-shot reasoning abilities of the underlying LLM and utilizing tools to explore and reason over customer data and respond to user requests. However, there are two concerns here: (I) acquiring large scale customer queries for agent testing is time-consuming, and (II) high reliance on the tool call sequence (or trajectory) followed by the agent to respond to user queries may lead to unexpected or incorrect behavior. To address this, we propose MAG-V, a multi-agent framework to first generate a dataset of questions that mimic customer queries; and second, reverse-engineer alternate questions from the responses for trajectory verification. Initial results indicate that our synthetic data can improve agent performance on actual customer queries. Furthermore, our trajectory verification methodology, inspired by distant supervision and using traditional machine learning (ML) models, outperforms a GPT-4o judge baseline by 11% accuracy and matches the performance of a GPT-4 judge on our constructed dataset. Overall, our approach is a step towards unifying diverse task agents into a cohesive framework for achieving an aligned objective.
zh

[NLP-61] Socio-Emotional Response Generation: A Human Evaluation Protocol for LLM -Based Conversational Systems

【速读】：该论文试图解决当前大型语言模型（LLMs）在生成对话响应时缺乏透明性和可控性的问题，特别是在社会情感策略方面的不可见性和不可控性。解决方案的关键在于提出了一种神经网络架构，该架构在响应生成之前引入了一个中间步骤，即规划社会情感策略。通过预测预期策略标签序列并利用该序列生成响应，该方法显著提升了生成响应的质量，相较于直接的端到端生成方案，这种方法在社会和情感标准上表现更优。此外，论文还提出了一种新的评估协议，包括粗粒度的连贯性评估和细粒度的社会情感标准注释，以揭示当前评估指标在生成内容评估中的局限性。

链接: https://arxiv.org/abs/2412.04492
作者: Lorraine Vanel,Ariel R. Ramos Vela,Alya Yacoubi,Chloé Clavel(IDS, S2A, LTCI)
关键词-EN: Conversational systems, generally relevant responses, Large Language Models, capable of producing, producing impressive
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Conversational systems are now capable of producing impressive and generally relevant responses. However, we have no visibility nor control of the socio-emotional strategies behind state-of-the-art Large Language Models (LLMs), which poses a problem in terms of their transparency and thus their trustworthiness for critical applications. Another issue is that current automated metrics are not able to properly evaluate the quality of generated responses beyond the dataset’s ground truth. In this paper, we propose a neural architecture that includes an intermediate step in planning socio-emotional strategies before response generation. We compare the performance of open-source baseline LLMs to the outputs of these same models augmented with our planning module. We also contrast the outputs obtained from automated metrics and evaluation results provided by human annotators. We describe a novel evaluation protocol that includes a coarse-grained consistency evaluation, as well as a finer-grained annotation of the responses on various social and emotional criteria. Our study shows that predicting a sequence of expected strategy labels and using this sequence to generate a response yields better results than a direct end-to-end generation scheme. It also highlights the divergences and the limits of current evaluation metrics for generated content. The code for the annotation platform and the annotated data are made publicly available for the evaluation of future models.
zh

[NLP-62] NLP Cluster Analysis of Common Core State Standards and NAEP Item Specifications

【速读】：该论文试图解决的问题是如何利用自然语言处理 (NLP) 技术来映射内容标准与项目规范之间的关系，并验证这些映射的有效性。解决方案的关键在于通过NLP技术生成标准和规范的嵌入向量，并使用k-means聚类方法分别对这些向量进行聚类，以评估“领域”（对应Common Core标准）和“线索”（对应NAEP项目规范）的语义区分度。这种方法不仅提供了对映射过程的改进，还为后续的应用提供了实证支持。

链接: https://arxiv.org/abs/2412.04482
作者: Gregory Camilli,Larry Suter
关键词-EN: natural language processing, proposed a methodology, language processing, methodology using natural, natural language
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 5 tables

点击查看摘要

Abstract:Camilli (2024) proposed a methodology using natural language processing (NLP) to map the relationship of a set of content standards to item specifications. This study provided evidence that NLP can be used to improve the mapping process. As part of this investigation, the nominal classifications of standards and items specifications were used to examine construct equivalence. In the current paper, we determine the strength of empirical support for the semantic distinctiveness of these classifications, which are known as “domains” for Common Core standards, and “strands” for National Assessment of Educational Progress (NAEP) item specifications. This is accomplished by separate k-means clustering for standards and specifications of their corresponding embedding vectors. We then briefly illustrate an application of these findings.
zh

计算机视觉

[CV-0] Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model

【速读】：该论文试图解决4D驾驶模拟中的视图转换和时空动态建模问题。解决方案的关键在于提出了一个名为Stag-1的模型，该模型通过构建连续的4D点云场景来重现真实世界的环境，并设计了一个可控的生成网络以实现4D模拟。Stag-1利用环绕视图数据从自动驾驶车辆中获取信息，解耦时空关系，生成连贯的关键帧视频。此外，Stag-1还利用视频生成模型从任意视角生成逼真且可控的4D驾驶模拟视频。通过基于分解的相机姿态训练车辆运动视频，Stag-1增强了远距离场景的建模能力，并重建车辆相机轨迹以整合连续视图中的3D点，从而在时间维度上实现全面的场景理解。经过多层次场景训练后，Stag-1能够从任意所需视角进行模拟，并在静态时空条件下深入理解场景演变。

链接: https://arxiv.org/abs/2412.05280
作者: Lening Wang,Wenzhao Zheng,Dalong Du,Yunpeng Zhang,Yilong Ren,Han Jiang,Zhiyong Cui,Haiyang Yu,Jie Zhou,Jiwen Lu,Shanghang Zhang
关键词-EN: autonomous driving simulators, essential for developing, driving, driving simulation, driving simulators
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:4D driving simulation is essential for developing realistic autonomous driving simulators. Despite advancements in existing methods for generating driving scenes, significant challenges remain in view transformation and spatial-temporal dynamic modeling. To address these limitations, we propose a Spatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct real-world scenes and design a controllable generative network to achieve 4D simulation. Stag-1 constructs continuous 4D point cloud scenes using surround-view data from autonomous vehicles. It decouples spatial-temporal relationships and produces coherent keyframe videos. Additionally, Stag-1 leverages video generation models to obtain photo-realistic and controllable 4D driving simulation videos from any perspective. To expand the range of view generation, we train vehicle motion videos based on decomposed camera poses, enhancing modeling capabilities for distant scenes. Furthermore, we reconstruct vehicle camera trajectories to integrate 3D points across consecutive views, enabling comprehensive scene understanding along the temporal dimension. Following extensive multi-level scene training, Stag-1 can simulate from any desired viewpoint and achieve a deep understanding of scene evolution under static spatial-temporal conditions. Compared to existing methods, our approach shows promising performance in multi-view scene consistency, background coherence, and accuracy, and contributes to the ongoing advancements in realistic autonomous driving simulation. Code: this https URL.
zh

[CV-1] Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories

【速读】：该论文试图解决现有3D编辑方法在处理大规模几何或外观变化时的局限性问题。解决方案的关键在于提出了一种名为“Perturb-and-Revise”的方法，通过随机初始化扰动NeRF参数（NeRF parameters）来创建多样化的初始状态，并通过分析局部损失景观（local loss landscape）自动确定扰动幅度。随后，利用生成轨迹（generative trajectories）对编辑后的NeRF进行修订，并结合生成过程施加身份保持梯度（identity-preserving gradients）以精炼编辑结果。这种方法在实验中展示了在颜色、外观和几何方面的灵活、有效和一致的编辑能力。

链接: https://arxiv.org/abs/2412.05279
作者: Susung Hong,Johanna Karras,Ricardo Martin-Brualla,Ira Kemelmacher-Shlizerman
关键词-EN: text-based diffusion models, text-based diffusion, diffusion models, advanced significantly, reconstruction and text-based
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:The fields of 3D reconstruction and text-based 3D editing have advanced significantly with the evolution of text-based diffusion models. While existing 3D editing methods excel at modifying color, texture, and style, they struggle with extensive geometric or appearance changes, thus limiting their applications. We propose Perturb-and-Revise, which makes possible a variety of NeRF editing. First, we perturb the NeRF parameters with random initializations to create a versatile initialization. We automatically determine the perturbation magnitude through analysis of the local loss landscape. Then, we revise the edited NeRF via generative trajectories. Combined with the generative process, we impose identity-preserving gradients to refine the edited NeRF. Extensive experiments demonstrate that Perturb-and-Revise facilitates flexible, effective, and consistent editing of color, appearance, and geometry in 3D. For 360° results, please visit our project page: this https URL.
zh

[CV-2] Birth and Death of a Rose

【速读】：该论文试图解决从预训练的2D基础模型生成时间性物体内在属性（如几何形状、反射率和纹理的时序演化序列）的问题。解决方案的关键在于利用从预训练的2D扩散模型中提取的信号，结合自动从自监督学习图像特征中导出的时间状态引导的神经模板（Neural Templates），以确保物体内在属性的时间一致性。这种方法无需传统3D建模和动画所需的广泛手动工作和专业知识，能够生成高质量的时间性物体内在属性，并支持从任意视角、任意环境光照条件下对这些动态物体进行采样和可控渲染。

链接: https://arxiv.org/abs/2412.05278
作者: Chen Geng,Yunzhi Zhang,Shangzhe Wu,Jiajun Wu
关键词-EN: temporally evolving sequences, temporally evolving, blooming rose, foundation models, study the problem
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project website: this https URL

点击查看摘要

Abstract:We study the problem of generating temporal object intrinsics – temporally evolving sequences of object geometry, reflectance, and texture, such as a blooming rose – from pre-trained 2D foundation models. Unlike conventional 3D modeling and animation techniques that require extensive manual effort and expertise, we introduce a method that generates such assets with signals distilled from pre-trained 2D diffusion models. To ensure the temporal consistency of object intrinsics, we propose Neural Templates for temporal-state-guided distillation, derived automatically from image features from self-supervised learning. Our method can generate high-quality temporal object intrinsics for several natural phenomena and enable the sampling and controllable rendering of these dynamic objects from any viewpoint, under any environmental lighting conditions, at any time of their lifespan. Project website: this https URL
zh

[CV-3] xt to Blind Motion NEURIPS2024

【速读】：该论文试图解决现有3D人体运动模型在处理盲人行人运动特征时的不足问题。解决方案的关键在于引入了一个名为BlindWays的多模态运动基准，该基准通过收集11名盲人参与者在真实城市环境中导航的3D运动数据，并结合丰富的文本描述，捕捉盲人行人独特的运动特征及其与导航辅助工具（如白杖或导盲犬）和环境的互动。通过这一基准，论文评估了现有最先进的3D人体预测模型的性能，发现其在处理盲人行人运动时的表现不佳，从而推动了开发更安全、更可靠的系统，以无缝推理各种环境中的人类运动。

链接: https://arxiv.org/abs/2412.05277
作者: Hee Jae Kim,Kathakoli Sengupta,Masaki Kuribayashi,Hernisa Kacorri,Eshed Ohn-Bar
关键词-EN: perceive the world, world differently, result in distinct, motion, blind
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2024

点击查看摘要

Abstract:People who are blind perceive the world differently than those who are sighted, which can result in distinct motion characteristics. For instance, when crossing at an intersection, blind individuals may have different patterns of movement, such as veering more from a straight path or using touch-based exploration around curbs and obstacles. These behaviors may appear less predictable to motion models embedded in technologies such as autonomous vehicles. Yet, the ability of 3D motion models to capture such behavior has not been previously studied, as existing datasets for 3D human motion currently lack diversity and are biased toward people who are sighted. In this work, we introduce BlindWays, the first multimodal motion benchmark for pedestrians who are blind. We collect 3D motion data using wearable sensors with 11 blind participants navigating eight different routes in a real-world urban setting. Additionally, we provide rich textual descriptions that capture the distinctive movement characteristics of blind pedestrians and their interactions with both the navigation aid (e.g., a white cane or a guide dog) and the environment. We benchmark state-of-the-art 3D human prediction models, finding poor performance with off-the-shelf and pre-training-based methods for our novel task. To contribute toward safer and more reliable systems that can seamlessly reason over diverse human movements in their environments, our text-and-motion benchmark is available at this https URL.
zh

[CV-4] Sparse autoencoders reveal selective remapping of visual concepts during adaptation ATC

【速读】：该论文试图解决在将基础模型（如CLIP视觉Transformer）适应于特定下游任务时，哪些机制在起作用的问题。解决方案的关键在于开发了一种新的稀疏自编码器 (Sparse Autoencoder, SAE)，名为PatchSAE，用于提取细粒度的可解释概念（如形状、颜色或对象的语义）及其在图像块级别的空间属性。通过研究这些概念如何影响下游图像分类任务中的模型输出，并分析基于提示的最新适应技术如何改变模型输入与这些概念之间的关联，论文发现，尽管适应后的模型中概念的激活略有变化，但大多数适应任务的性能提升可以通过非适应基础模型中已存在的概念来解释。这一研究为训练和使用SAE提供了具体的框架，并为解释适应机制提供了深入的见解。

链接: https://arxiv.org/abs/2412.05276
作者: Hyesu Lim,Jinho Choi,Jaegul Choo,Steffen Schneider
关键词-EN: build machine learning, machine learning systems, Adapting foundation models, Adapting foundation, specific purposes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: A demo is available at this http URL

点击查看摘要

Abstract:Adapting foundation models for specific purposes has become a standard approach to build machine learning systems for downstream applications. Yet, it is an open question which mechanisms take place during adaptation. Here we develop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named PatchSAE, to extract interpretable concepts at granular levels (e.g. shape, color, or semantics of an object) and their patch-wise spatial attributions. We explore how these concepts influence the model output in downstream image classification tasks and investigate how recent state-of-the-art prompt-based adaptation techniques change the association of model inputs to these concepts. While activations of concepts slightly change between adapted and non-adapted models, we find that the majority of gains on common adaptation tasks can be explained with the existing concepts already present in the non-adapted foundation model. This work provides a concrete framework to train and use SAEs for Vision Transformers and provides insights into explaining adaptation mechanisms.
zh

[CV-5] MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

【速读】：该论文试图解决文本到视频模型在运动模式控制方面的不足，特别是在复杂场景变化下保持一致运动的问题。解决方案的关键在于引入了一种名为MotionFlow的新框架，该框架通过利用交叉注意力图（cross-attention maps）来精确捕捉和操纵视频中的空间和时间动态，从而实现无缝的运动转移。MotionFlow无需额外训练，直接在测试时利用预训练的视频扩散模型的固有能力，显著提升了模型在剧烈场景变化下的保真度和多样性。

链接: https://arxiv.org/abs/2412.05275
作者: Tuna Han Salih Meral,Hidir Yesiltepe,Connor Dunlop,Pinar Yanardag
关键词-EN: captivating video content, demonstrated impressive capabilities, showcasing a notable, demonstrated impressive, producing diverse
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Text-to-video models have demonstrated impressive capabilities in producing diverse and captivating video content, showcasing a notable advancement in generative AI. However, these models generally lack fine-grained control over motion patterns, limiting their practical applicability. We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics, enabling seamless motion transfers across various contexts. Our approach does not require training and works on test-time by leveraging the inherent capabilities of pre-trained video diffusion models. In contrast to traditional approaches, which struggle with comprehensive scene changes while maintaining consistent motion, MotionFlow successfully handles such complex transformations through its attention-based mechanism. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.
zh

[CV-6] SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images

【速读】：该论文试图解决现有3D对比学习框架依赖于昂贵的点云数据集的问题，提出了一种名为SimC3D的简单而有效的3D对比学习框架，首次实现了从纯RGB图像数据中预训练3D骨干网络。解决方案的关键在于：(1) 使用纯图像数据，通过深度估计和适当的数据处理生成单目合成的点云，从而简化了对昂贵3D点云的依赖；(2) 采用简单的框架，直接使用2D位置嵌入作为更强的对比目标，避免了传统多模态框架中额外2D骨干网络的计算开销；(3) 在各种下游任务中表现出强劲的性能，并展示了通过结合多个图像数据集进一步提升性能的潜力。

链接: https://arxiv.org/abs/2412.05274
作者: Jiahua Dong,Tong Wu,Rui Qian,Jiaqi Wang
关键词-EN: point cloud data, point cloud, demonstrated remarkable performance, point, paradigm has demonstrated
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The 3D contrastive learning paradigm has demonstrated remarkable performance in downstream tasks through pretraining on point cloud data. Recent advances involve additional 2D image priors associated with 3D point clouds for further improvement. Nonetheless, these existing frameworks are constrained by the restricted range of available point cloud datasets, primarily due to the high costs of obtaining point cloud data. To this end, we propose SimC3D, a simple but effective 3D contrastive learning framework, for the first time, pretraining 3D backbones from pure RGB image data. SimC3D performs contrastive 3D pretraining with three appealing properties. (1) Pure image data: SimC3D simplifies the dependency of costly 3D point clouds and pretrains 3D backbones using solely RBG images. By employing depth estimation and suitable data processing, the monocular synthesized point cloud shows great potential for 3D pretraining. (2) Simple framework: Traditional multi-modal frameworks facilitate 3D pretraining with 2D priors by utilizing an additional 2D backbone, thereby increasing computational expense. In this paper, we empirically demonstrate that the primary benefit of the 2D modality stems from the incorporation of locality information. Inspired by this insightful observation, SimC3D directly employs 2D positional embeddings as a stronger contrastive objective, eliminating the necessity for 2D backbones and leading to considerable performance improvements. (3) Strong performance: SimC3D outperforms previous approaches that leverage ground-truth point cloud data for pretraining in various downstream tasks. Furthermore, the performance of SimC3D can be further enhanced by combining multiple image datasets, showcasing its significant potential for scalability. The code will be available at this https URL.
zh

[CV-7] Expanding Performance Boundaries of Open-Source Multimodal Models with Model Data and Test-Time Scaling

【速读】：该论文试图解决多模态大语言模型（MLLM）在模型扩展与性能提升之间的关系问题。解决方案的关键在于系统性地探索视觉编码器、语言模型、数据集规模以及测试时配置对模型性能的影响，并通过引入Chain-of-Thought（CoT）推理和测试时扩展策略，显著提升了模型在MMMU基准测试中的表现，首次使开源MLLM在该基准上超过70%的准确率。

链接: https://arxiv.org/abs/2412.05271
作者: Zhe Chen,Weiyun Wang,Yue Cao,Yangzhou Liu,Zhangwei Gao,Erfei Cui,Jinguo Zhu,Shenglong Ye,Hao Tian,Zhaoyang Liu,Lixin Gu,Xuehui Wang,Qingyun Li,Yimin Ren,Zixuan Chen,Jiapeng Luo,Jiahao Wang,Tan Jiang,Bo Wang,Conghui He,Botian Shi,Xingcheng Zhang,Han Lv,Yi Wang,Wenqi Shao,Pei Chu,Zhongying Tu,Tong He,Zhiyong Wu,Huipeng Deng,Jiaye Ge,Kai Chen,Min Dou,Lewei Lu,Xizhou Zhu,Tong Lu,Dahua Lin,Yu Qiao,Jifeng Dai,Wenhai Wang
关键词-EN: introducing significant enhancements, advanced multimodal large, core model architecture, series that builds, maintaining its core
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see this https URL
zh

[CV-8] DenseMatcher: Learning 3D Semantic Correspondence for Category-Level Manipulation from a Single Demo ATC

【速读】：该论文试图解决在机器人操作中，如何通过三维对应关系（3D correspondence）来增强对不同对象的泛化能力，尤其是在跨类别对象之间。解决方案的关键在于提出了一种名为DenseMatcher的方法，该方法通过计算多视角二维特征并将其投影到三维网格上，然后利用三维网络进行特征细化，最终通过功能映射（functional map）找到密集的对应关系。此外，论文还构建了首个包含跨类别彩色对象网格的三维匹配数据集，并通过实验验证了DenseMatcher在机器人操作和零样本颜色映射中的显著优势，相比之前的基线方法提升了43.5%。

链接: https://arxiv.org/abs/2412.05268
作者: Junzhe Zhu,Yuanchen Ju,Junyi Zhang,Muhan Wang,Zhecheng Yuan,Kaizhe Hu,Huazhe Xu
关键词-EN: unseen counterpart, dynamic information, enhance robotic manipulation, Abstract, enhance robotic
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Dense 3D correspondence can enhance robotic manipulation by enabling the generalization of spatial, functional, and dynamic information from one object to an unseen counterpart. Compared to shape correspondence, semantic correspondence is more effective in generalizing across different object categories. To this end, we present DenseMatcher, a method capable of computing 3D correspondences between in-the-wild objects that share similar structures. DenseMatcher first computes vertex features by projecting multiview 2D features onto meshes and refining them with a 3D network, and subsequently finds dense correspondences with the obtained features using functional map. In addition, we craft the first 3D matching dataset that contains colored object meshes across diverse categories. In our experiments, we show that DenseMatcher significantly outperforms prior 3D matching baselines by 43.5%. We demonstrate the downstream effectiveness of DenseMatcher in (i) robotic manipulation, where it achieves cross-instance and cross-category generalization on long-horizon complex manipulation tasks from observing only one demo; (ii) zero-shot color mapping between digital assets, where appearance can be transferred between different objects with relatable geometry.
zh

[CV-9] Mind the Time: Temporally-Controlled Multi-Event Video Generation

【速读】：该论文试图解决现有视频生成模型在处理包含多个事件的复杂描述时，无法精确控制事件的时间顺序和呈现的问题。解决方案的关键在于提出了MinT，一种具有时间控制的多事件视频生成器。其核心创新在于将每个事件绑定到生成视频中的特定时间段，从而使模型能够逐个事件地聚焦。此外，论文设计了一种基于时间的位置编码方法，称为ReRoPE，用于指导跨注意力操作，从而实现事件描述与视频片段之间的时间感知交互。通过在具有时间基础的数据上微调预训练的视频扩散变换器，MinT能够生成事件顺序连贯且平滑连接的视频，首次在文献中提供了对生成视频中事件时间控制的精确控制。

链接: https://arxiv.org/abs/2412.05263
作者: Ziyi Wu,Aliaksandr Siarohin,Willi Menapace,Ivan Skorokhodov,Yuwei Fang,Varnith Chordia,Igor Gilitschenski,Sergey Tulyakov
关键词-EN: Real-world videos consist, Real-world videos, Real-world, video, events
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.
zh

[CV-10] Extrapolated Urban View Synthesis Benchmark

【速读】：该论文试图解决现有视图合成技术在自动驾驶车辆（AVs）模拟中对训练视图过度拟合的问题，特别是在测试视图与训练视图存在较大偏差（即外推情况）时的表现。解决方案的关键在于构建了首个外推城市视图合成（Extrapolated Urban View Synthesis, EUVS）基准，利用公开的自动驾驶数据集，通过多趟行程、多车辆和多摄像头设置，评估了当前最先进的3D高斯拼接（3D Gaussian Splatting）方法在不同难度级别下的性能。研究结果表明，高斯拼接方法在面对大幅视图变化时容易过拟合，且仅通过引入扩散先验和改进几何结构无法根本改善视图合成效果。因此，论文强调了开发更鲁棒的方法和进行大规模训练的必要性，并公开了相关数据以推动自动驾驶和城市机器人模拟技术的发展。

链接: https://arxiv.org/abs/2412.05256
作者: Xiangyu Han,Zhen Jia,Boyi Li,Yan Wang,Boris Ivanovic,Yurong You,Lingjie Liu,Yue Wang,Marco Pavone,Chen Feng,Yiming Li
关键词-EN: vision-centric autonomous vehicles, Gaussian Splatting, simulators are essential, vision-centric autonomous, View Synthesis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct quantitative and qualitative evaluations of state-of-the-art Gaussian Splatting methods across different difficulty levels. Our results show that Gaussian Splatting is prone to overfitting to training views. Besides, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches and large-scale training. We have released our data to help advance self-driving and urban robotics simulation technology.
zh

[CV-11] From classical techniques to convolution-based models: A review of object detection algorithms

【速读】：该论文试图解决传统计算机视觉方法在复杂视觉数据中表现不佳的问题，特别是对象检测任务中手工特征和浅层模型的局限性。解决方案的关键在于深度学习，特别是卷积神经网络 (CNN) 的应用。CNN 能够自动从数据中学习丰富的层次特征，包括语义和高层表示，这些特征对于准确的对象检测至关重要。论文通过回顾经典计算机视觉方法和基于 CNN 的检测器，比较了主要 CNN 模型的优缺点，并强调了深度学习在对象检测领域的显著进展，同时指出了未来研究的关键方向以进一步提升性能。

链接: https://arxiv.org/abs/2412.05252
作者: Fnu Neha,Deepshikha Bhati,Deepak Kumar Shukla,Md Amiruzzaman
关键词-EN: image understanding, class labels, Object detection, Convolutional Neural Networks, fundamental task
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Object detection is a fundamental task in computer vision and image understanding, with the goal of identifying and localizing objects of interest within an image while assigning them corresponding class labels. Traditional methods, which relied on handcrafted features and shallow models, struggled with complex visual data and showed limited performance. These methods combined low-level features with contextual information and lacked the ability to capture high-level semantics. Deep learning, especially Convolutional Neural Networks (CNNs), addressed these limitations by automatically learning rich, hierarchical features directly from data. These features include both semantic and high-level representations essential for accurate object detection. This paper reviews object detection frameworks, starting with classical computer vision methods. We categorize object detection approaches into two groups: (1) classical computer vision techniques and (2) CNN-based detectors. We compare major CNN models, discussing their strengths and limitations. In conclusion, this review highlights the significant advancements in object detection through deep learning and identifies key areas for further research to improve performance.
zh

[CV-12] CompCap: Improving Multimodal Large Language Models with Composite Captions

【速读】：该论文试图解决多模态大语言模型（MLLMs）在理解复合图像（Composite Images, CIs）方面的显著挑战。复合图像是由多个视觉元素（如图表、海报或截图）合成的合成视觉内容，而非直接由相机捕捉的自然图像（Natural Images, NIs）。现有MLLMs主要针对自然图像进行训练，而在处理复合图像时表现不佳，尤其是在信息提取和复杂推理方面。解决方案的关键在于引入Composite Captions（CompCap）框架，该框架利用大语言模型（LLMs）和自动化工具生成带有准确详细描述的复合图像，并构建了包含118K图像-描述对的数据集CompCap-118K。通过监督微调MLLMs，CompCap-118K显著提升了模型对复合图像的理解能力，在多个基准测试中平均提升了1.7%至2.9%的性能。

链接: https://arxiv.org/abs/2412.05243
作者: Xiaohui Chen,Satya Narayan Shukla,Mahmoud Azab,Aashu Singh,Qifan Wang,David Yang,ShengYun Peng,Hanchao Yu,Shen Yan,Xuewen Zhang,Baosheng He
关键词-EN: Multimodal Large Language, Large Language Models, understand composite images, Multimodal Large, Language Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs’ understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.
zh

[CV-13] Archaeoscape: Bringing Aerial Laser Scanning Archaeology to the Deep Learning Era NEURIPS2023

【速读】：该论文试图解决传统考古学中缺乏专家标注的开放访问资源，以支持使用先进的深度学习技术分析机载激光扫描（ALS）数据的问题。解决方案的关键在于提出了Archaeoscape，这是一个大规模的考古学ALS数据集，覆盖柬埔寨888 km²的区域，包含31,141个来自Angkorian时期的标注考古特征。Archaeoscape不仅规模是现有数据集的四倍以上，而且是首个提供开放访问数据、标注和模型的ALS考古学资源。通过基准测试多种最新的分割模型，论文展示了现代视觉技术在此类问题上的优势，并强调了在密集丛林覆盖下发现微妙人造结构的独特挑战。通过开放访问Archaeoscape，论文旨在弥合传统考古学与现代计算机视觉方法之间的差距。

链接: https://arxiv.org/abs/2412.05203
作者: Yohann Perron,Vladyslav Sydorov,Adam P. Wijker,Damian Evans,Christophe Pottier,Loic Landrieu
关键词-EN: Airborne Laser Scanning, Airborne Laser, Laser Scanning, unveiling hidden landscapes, hidden landscapes beneath
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2023 - Datasets Benchmarks Track

点击查看摘要

Abstract:Airborne Laser Scanning (ALS) technology has transformed modern archaeology by unveiling hidden landscapes beneath dense vegetation. However, the lack of expert-annotated, open-access resources has hindered the analysis of ALS data using advanced deep learning techniques. We address this limitation with Archaeoscape (available at this https URL), a novel large-scale archaeological ALS dataset spanning 888 km ^2 in Cambodia with 31,141 annotated archaeological features from the Angkorian period. Archaeoscape is over four times larger than comparable datasets, and the first ALS archaeology resource with open-access data, annotations, and models. We benchmark several recent segmentation models to demonstrate the benefits of modern vision techniques for this problem and highlight the unique challenges of discovering subtle human-made structures under dense jungle canopies. By making Archaeoscape available in open access, we hope to bridge the gap between traditional archaeology and modern computer vision methods. Comments: NeurIPS 2023 - Datasets Benchmarks Track Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.05203 [cs.CV] (or arXiv:2412.05203v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.05203 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-14] SurgBox: Agent -Driven Operating Room Sandbox with Surgery Copilot

【速读】：该论文试图解决神经外科手术中高认知负荷的问题，特别是在手术训练和实际操作中。解决方案的关键在于提出了SurgBox框架，这是一个基于代理的沙盒框架，旨在通过沉浸式手术模拟系统性地提升外科医生的认知能力。SurgBox的核心创新包括利用大型语言模型（LLMs）结合检索增强生成（RAG）技术，真实地模拟各种手术角色，创建逼真的训练环境。此外，论文还设计了Surgery Copilot，一个AI驱动的助手，用于主动协调手术信息流并支持临床决策，从而减轻手术团队的认知负担。通过引入长短期记忆机制，Surgery Copilot能够在提供即时程序协助的同时，保持全面的手术知识。实验结果表明，SurgBox框架不仅提升了外科医生的认知能力，还显著支持了临床决策过程。

链接: https://arxiv.org/abs/2412.05187
作者: Jinlin Wu,Xusheng Liang,Xuexue Bai,Zhen Chen
关键词-EN: impose substantial cognitive, substantial cognitive burdens, Surgical, represent complex, complex and high-stakes
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: This work is accepted by IEEE Big Data 2024

点击查看摘要

Abstract:Surgical interventions, particularly in neurology, represent complex and high-stakes scenarios that impose substantial cognitive burdens on surgical teams. Although deliberate education and practice can enhance cognitive capabilities, surgical training opportunities remain limited due to patient safety concerns. To address these cognitive challenges in surgical training and operation, we propose SurgBox, an agent-driven sandbox framework to systematically enhance the cognitive capabilities of surgeons in immersive surgical simulations. Specifically, our SurgBox leverages large language models (LLMs) with tailored Retrieval-Augmented Generation (RAG) to authentically replicate various surgical roles, enabling realistic training environments for deliberate practice. In particular, we devise Surgery Copilot, an AI-driven assistant to actively coordinate the surgical information stream and support clinical decision-making, thereby diminishing the cognitive workload of surgical teams during surgery. By incorporating a novel Long-Short Memory mechanism, our Surgery Copilot can effectively balance immediate procedural assistance with comprehensive surgical knowledge. Extensive experiments using real neurosurgical procedure records validate our SurgBox framework in both enhancing surgical cognitive capabilities and supporting clinical decision-making. By providing an integrated solution for training and operational support to address cognitive challenges, our SurgBox framework advances surgical education and practice, potentially transforming surgical outcomes and healthcare quality. The code is available at this https URL.
zh

[CV-15] One-shot Federated Learning via Synthetic Distiller-Distillate Communication NEURIPS2024

【速读】：该论文试图解决单轮联邦学习（One-shot Federated Learning, FL）中由于数据异质性（data heterogeneity）和信息损失（information loss）导致的模型性能下降问题。解决方案的关键在于提出了FedSD2C框架，该框架通过引入一个蒸馏器（distiller）直接从本地数据生成信息丰富的蒸馏产物（distillates），并共享这些合成蒸馏产物而非不一致的本地模型，从而减少信息损失并缓解数据异质性问题。实验结果表明，FedSD2C在复杂和真实数据集上显著优于其他单轮FL方法，性能提升可达最佳基线模型的2.6倍。

链接: https://arxiv.org/abs/2412.05186
作者: Junyuan Zhang,Songhua Liu,Xinchao Wang
关键词-EN: One-shot Federated learning, powerful technology facilitating, technology facilitating collaborative, Federated learning, machine learning models
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:One-shot Federated learning (FL) is a powerful technology facilitating collaborative training of machine learning models in a single round of communication. While its superiority lies in communication efficiency and privacy preservation compared to iterative FL, one-shot FL often compromises model performance. Prior research has primarily focused on employing data-free knowledge distillation to optimize data generators and ensemble models for better aggregating local knowledge into the server model. However, these methods typically struggle with data heterogeneity, where inconsistent local data distributions can cause teachers to provide misleading knowledge. Additionally, they may encounter scalability issues with complex datasets due to inherent two-step information loss: first, during local training (from data to model), and second, when transferring knowledge to the server model (from model to inversed data). In this paper, we propose FedSD2C, a novel and practical one-shot FL framework designed to address these challenges. FedSD2C introduces a distiller to synthesize informative distillates directly from local data to reduce information loss and proposes sharing synthetic distillates instead of inconsistent local models to tackle data heterogeneity. Our empirical results demonstrate that FedSD2C consistently outperforms other one-shot FL methods with more complex and real datasets, achieving up to 2.6 the performance of the best baseline. Code: this https URL
zh

[CV-16] LinVT: Empower Your Image-level Large Language Model to Understand Videos

【速读】：该论文试图解决将现有的基于图像的大型语言模型（LLMs）扩展到视频领域的问题。解决方案的关键在于提出了一个可插拔的线性视频标记器（Linear Video Tokenizer, LinVT），通过线性变换和代表性信息浓缩的设计原则，使现有的图像LLMs能够理解和处理视频数据，而无需从头开始训练。LinVT的高兼容性和在多个视频基准测试中取得的先进性能，证明了其在多模态视频理解中的有效性。

链接: https://arxiv.org/abs/2412.05185
作者: Lishuai Gao,Yujie Zhong,Yingsen Zeng,Haoxian Tan,Dengjie Li,Zheng Zhao
关键词-EN: Large Language Models, Large Language, Language Models, develop an LLM-based, LLM-based assistant
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.
zh

[CV-17] DreamColour: Controllable Video Colour Editing without Training

【速读】：该论文试图解决视频色彩编辑中现有解决方案需要逐帧手动操作或产生不真实的时间伪影的问题。解决方案的关键在于通过解耦色彩编辑的空间和时间方面，提供一个直观且无需训练的框架，使用户能够专注于关键帧的精确色彩选择，然后自动将更改传播到整个视频。具体技术框架包括：(i) 结合网格色彩选择和自动实例分割的简单点选界面，实现精确的空间控制；(ii) 利用视频运动模式的双向色彩传播；(iii) 运动感知的混合技术，确保即使在复杂物体运动下也能实现平滑过渡。该方法在无需训练或专用硬件的情况下，展示了与现有最先进方法相媲美甚至超越的效果。

链接: https://arxiv.org/abs/2412.05180
作者: Chaitat Utintu,Pinaki Nath Chowdhury,Aneeshan Sain,Subhadeep Koley,Ayan Kumar Bhunia,Yi-Zhe Song
关键词-EN: produce unrealistic results, Video colour editing, colour editing, colour editing accessible, content creation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page available at this https URL

点击查看摘要

Abstract:Video colour editing is a crucial task for content creation, yet existing solutions either require painstaking frame-by-frame manipulation or produce unrealistic results with temporal artefacts. We present a practical, training-free framework that makes precise video colour editing accessible through an intuitive interface while maintaining professional-quality output. Our key insight is that by decoupling spatial and temporal aspects of colour editing, we can better align with users’ natural workflow – allowing them to focus on precise colour selection in key frames before automatically propagating changes across time. We achieve this through a novel technical framework that combines: (i) a simple point-and-click interface merging grid-based colour selection with automatic instance segmentation for precise spatial control, (ii) bidirectional colour propagation that leverages inherent video motion patterns, and (iii) motion-aware blending that ensures smooth transitions even with complex object movements. Through extensive evaluation on diverse scenarios, we demonstrate that our approach matches or exceeds state-of-the-art methods while eliminating the need for training or specialized hardware, making professional-quality video colour editing accessible to everyone.
zh

[CV-18] Spatially-Adaptive Hash Encodings For Neural Surface Reconstruction

【速读】：该论文试图解决神经场景重建方法中固定编码函数（encoding functions）对所有场景采用“一刀切”策略的问题。当前最先进的表面重建方法使用基于网格的多分辨率哈希编码（grid-based multi-resolution hash encoding）来恢复高细节几何形状，但这些方法仍然依赖于固定的编码策略。论文提出的解决方案之关键是引入了一种学习型方法，允许网络根据空间位置选择编码基（encoding basis），通过掩码（masking）不同网格分辨率存储的特征的贡献来实现。这种空间自适应的方法使得网络能够适应更广泛的频率范围，同时避免引入噪声，从而在标准基准数据集上实现了最先进的性能。

链接: https://arxiv.org/abs/2412.05179
作者: Thomas Walker,Octave Mariotti,Amir Vaxman,Hakan Bilen
关键词-EN: Positional encodings, finer representations, scene reconstruction methods, neural scene reconstruction, common component
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Positional encodings are a common component of neural scene reconstruction methods, and provide a way to bias the learning of neural fields towards coarser or finer representations. Current neural surface reconstruction methods use a “one-size-fits-all” approach to encoding, choosing a fixed set of encoding functions, and therefore bias, across all scenes. Current state-of-the-art surface reconstruction approaches leverage grid-based multi-resolution hash encoding in order to recover high-detail geometry. We propose a learned approach which allows the network to choose its encoding basis as a function of space, by masking the contribution of features stored at separate grid resolutions. The resulting spatially adaptive approach allows the network to fit a wider range of frequencies without introducing noise. We test our approach on standard benchmark surface reconstruction datasets and achieve state-of-the-art performance on two benchmark datasets.
zh

[CV-19] DNF: Unconditional 4D Generation with Dictionary-based Neural Fields

【速读】：该论文试图解决4D生成建模中的挑战，特别是由于物体随时间变形的复杂性。解决方案的关键在于提出了一种新的4D表示方法——DNF (Disentangled Neural Fields)，通过字典学习方法将4D运动从形状中解耦为神经场，分别表示为形状和运动的潜在空间。每个可变形形状由其形状和运动的全局潜在代码、形状特定的系数向量以及共享的字典信息表示，从而捕捉形状特定的细节和全局共享信息。结合基于Transformer的扩散模型，该方法能够生成有效且高保真的4D动画。

链接: https://arxiv.org/abs/2412.05161
作者: Xinyi Zhang,Naiqi Li,Angela Dai
关键词-EN: remains challenging due, modeling remains challenging, generative modeling remains, achieved through diffusion-based, deformations over time
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:While remarkable success has been achieved through diffusion-based 3D generative models for shapes, 4D generative modeling remains challenging due to the complexity of object deformations over time. We propose DNF, a new 4D representation for unconditional generative modeling that efficiently models deformable shapes with disentangled shape and motion while capturing high-fidelity details in the deforming objects. To achieve this, we propose a dictionary learning approach to disentangle 4D motion from shape as neural fields. Both shape and motion are represented as learned latent spaces, where each deformable shape is represented by its shape and motion global latent codes, shape-specific coefficient vectors, and shared dictionary information. This captures both shape-specific detail and global shared information in the learned dictionary. Our dictionary-based representation well balances fidelity, contiguity and compression – combined with a transformer-based diffusion model, our method is able to generate effective, high-fidelity 4D animations.
zh

[CV-20] Gaining Explainability from a CNN for Stereotype Detection Based on Mice Stopping Behavior ICPR

【速读】：该论文试图解决通过实验室小鼠的停止行为来识别其年龄和性别的问题。解决方案的关键在于利用LiveMouseTracker (LMT)系统追踪小鼠在笼子中的停止位置，并构建2D直方图堆栈，然后通过浅层卷积神经网络 (CNN) 架构对小鼠的年龄和性别进行分类。研究发现，雌性小鼠表现出更明显的可识别行为模式，分类准确率超过90%，而雄性小鼠的分类准确率仅为62.5%，这主要是因为雄性小鼠，尤其是幼年雄性，其行为模式在幼年雌性和成年雄性之间波动。

链接: https://arxiv.org/abs/2412.05158
作者: Raul Alfredo de Sousa Silva,Yasmine Belaidouni,Rabah Iguernaissi,Djamal Merad,Séverine Dubuisson
关键词-EN: affects humans, key to find, find answers, answers about diseases, diseases and neurodevelopmental
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: to be published in VAIB - Visual observation and analysis of Vertebrate And Insect Behavior (ICPR) 2024

点击查看摘要

Abstract:Understanding the behavior of laboratory animals is a key to find answers about diseases and neurodevelopmental disorders that also affects humans. One behavior of interest is the stopping, as it correlates with exploration, feeding and sleeping habits of individuals. To improve comprehension of animal’s behavior, we focus on identifying trait revealing age/sex of mice through the series of stopping spots of each individual. We track 4 mice using LiveMouseTracker (LMT) system during 3 days. Then, we build a stack of 2D histograms of the stop positions. This stack of histograms passes through a shallow CNN architecture to classify mice in terms of age and sex. We observe that female mice show more recognizable behavioral patterns, reaching a classification accuracy of more than 90%, while males, which do not present as many distinguishable patterns, reach an accuracy of 62.5%. To gain explainability from the model, we look at the activation function of the convolutional layers and found that some regions of the cage are preferentially explored by females. Males, especially juveniles, present behavior patterns that oscillate between juvenile female and adult male.
zh

[CV-21] owards Flexible 3D Perception: Object-Centric Occupancy Completion Augments 3D Object Detection NEURIPS2024

【速读】：该论文试图解决3D物体边界框(bbox)表示在自动驾驶感知中无法捕捉物体内在几何细节的问题。解决方案的关键在于引入以物体为中心的占据表示（object-centric occupancy），作为物体边界框的补充。这种表示不仅提供了检测物体的精细细节，还允许在实际应用中实现更高的体素分辨率。论文从数据和算法两个方面推进了以物体为中心的占据感知的发展：在数据方面，构建了首个以物体为中心的占据数据集；在算法方面，提出了一种配备隐式形状解码器的新型以物体为中心的占据补全网络，该网络利用长时间序列中的时间信息，准确预测不准确物体提议的完整占据体积，从而在噪声检测和跟踪条件下实现物体形状的鲁棒补全。此外，论文还展示了占据特征显著提升了现有3D物体检测器的检测结果，特别是在Waymo Open Dataset中处理不完整或远距离物体时。

链接: https://arxiv.org/abs/2412.05154
作者: Chaoda Zheng,Feng Wang,Naiyan Wang,Shuguang Cui,Zhen Li
关键词-EN: autonomous driving perception, object bounding box, object intrinsic geometry, bounding box, intrinsic geometry
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2024

点击查看摘要

Abstract:While 3D object bounding box (bbox) representation has been widely used in autonomous driving perception, it lacks the ability to capture the precise details of an object’s intrinsic geometry. Recently, occupancy has emerged as a promising alternative for 3D scene perception. However, constructing a high-resolution occupancy map remains infeasible for large scenes due to computational constraints. Recognizing that foreground objects only occupy a small portion of the scene, we introduce object-centric occupancy as a supplement to object bboxes. This representation not only provides intricate details for detected objects but also enables higher voxel resolution in practical applications. We advance the development of object-centric occupancy perception from both data and algorithm perspectives. On the data side, we construct the first object-centric occupancy dataset from scratch using an automated pipeline. From the algorithmic standpoint, we introduce a novel object-centric occupancy completion network equipped with an implicit shape decoder that manages dynamic-size occupancy generation. This network accurately predicts the complete object-centric occupancy volume for inaccurate object proposals by leveraging temporal information from long sequences. Our method demonstrates robust performance in completing object shapes under noisy detection and tracking conditions. Additionally, we show that our occupancy features significantly enhance the detection results of state-of-the-art 3D object detectors, especially for incomplete or distant objects in the Waymo Open Dataset.
zh

[CV-22] BIAS: A Body-based Interpretable Active Speaker Approach

【速读】：该论文试图解决在复杂和野外场景中，现有主动说话者检测 (Active Speaker Detection, ASD) 方法依赖于音频和面部特征的局限性问题。解决方案的关键在于提出了BIAS模型，该模型首次结合了音频、面部和身体信息，以在各种挑战性条件下准确预测主动说话者。此外，BIAS通过引入Squeeze-and-Excitation块用于注意力热图创建和特征重要性评估，增强了模型的可解释性。为了进一步提高可解释性，论文还标注了一个与ASD相关的动作数据集 (ASD-Text)，并微调了ViT-GPT2用于文本场景描述，以补充BIAS的可解释性。实验结果表明，BIAS在强调身体特征重要性的挑战性条件下（如Columbia、开放设置和WASD数据集）表现出色，同时在AVA-ActiveSpeaker数据集上也取得了竞争性结果。

链接: https://arxiv.org/abs/2412.05150
作者: Tiago Roxo,Joana C. Costa,Pedro R. M. Inácio,Hugo Proença
关键词-EN: Active Speaker Detection, approaches heavily rely, Speaker Detection, wild scenarios, heavily rely
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial features to perform, which is not a sustainable approach in wild scenarios. Although these methods achieve good results in the standard AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the limitations of such models and raised the need for new approaches. As such, we propose BIAS, a model that, for the first time, combines audio, face, and body information, to accurately predict active speakers in varying/challenging conditions. Additionally, we design BIAS to provide interpretability by proposing a novel use for Squeeze-and-Excitation blocks, namely in attention heatmaps creation and feature importance assessment. For a full interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) to finetune a ViT-GPT2 for text scene description to complement BIAS interpretability. The results show that BIAS is state-of-the-art in challenging conditions where body-based features are of utmost importance (Columbia, open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, where face is more influential than body for ASD. BIAS interpretability also shows the features/aspects more relevant towards ASD prediction in varying settings, making it a strong baseline for further developments in interpretable ASD models, and is available at this https URL.
zh

[CV-23] LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

【速读】：该论文试图解决个性化图像生成模型在资源受限设备（如智能手机）上实时应用的效率问题。解决方案的关键在于引入了一种名为“this http URL”的方法，该方法通过预训练一个超网络（hypernetwork）来学习内容-风格低秩适应参数（LoRA）对的高效合并策略。这种方法不仅显著提高了图像质量，还将合并过程的速度提升了超过4000倍，从而实现了快速、高质量的个性化图像生成。此外，论文还提出了一种新的评估协议，利用多模态大语言模型（MLLM）来更准确地评估内容和风格的质量，从而克服了现有评估指标的局限性。

链接: https://arxiv.org/abs/2412.05148
作者: Donald Shenaj,Ondrej Bohdal,Mete Ozay,Pietro Zanuttigh,Umberto Michieli
关键词-EN: Recent advancements, enabled personalized image, personalized image creation, user-defined subjects, enabled personalized
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 20 figures

点击查看摘要

Abstract:Recent advancements in image generation models have enabled personalized image creation with both user-defined subjects (content) and styles. Prior works achieved personalization by merging corresponding low-rank adaptation parameters (LoRAs) through optimization-based methods, which are computationally demanding and unsuitable for real-time use on resource-constrained devices like smartphones. To address this, we introduce this http URL, a method that not only improves image quality but also achieves a remarkable speedup of over 4000\times in the merging process. this http URL pre-trains a hypernetwork on a diverse set of content-style LoRA pairs, learning an efficient merging strategy that generalizes to new, unseen content-style pairs, enabling fast, high-quality personalization. Moreover, we identify limitations in existing evaluation metrics for content-style quality and propose a new protocol using multimodal large language models (MLLM) for more accurate assessment. Our method significantly outperforms the current state of the art in both content and style fidelity, as validated by MLLM assessments and human evaluations.
zh

[CV-24] How to Squeeze An Explanation Out of Your Model

【速读】：该论文试图解决深度学习模型在执行任务时缺乏解释性的问题，特别是在生物识别、安全和医疗等敏感领域。解决方案的关键在于提出了一种模型无关的解释性方法，基于新颖的挤压和激励（Squeeze and Excitation, SE）块的使用，生成视觉注意力热图。通过在任何模型的分类层之前引入SE块，能够通过SE向量操作检索最具影响力的特征，这是SE块的关键组成部分。该方法不仅适用于图像设置，还能扩展到视频/多模态设置和自定义架构，且不损害模型在原始任务上的性能，显示出在不同数据环境中的鲁棒性。

链接: https://arxiv.org/abs/2412.05134
作者: Tiago Roxo,Joana C. Costa,Pedro R. M. Inácio,Hugo Proença
关键词-EN: Deep learning models, widely used nowadays, reliability in performing, Deep learning, visual attention heatmaps
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning models are widely used nowadays for their reliability in performing various tasks. However, they do not typically provide the reasoning behind their decision, which is a significant drawback, particularly for more sensitive areas such as biometrics, security and healthcare. The most commonly used approaches to provide interpretability create visual attention heatmaps of regions of interest on an image based on models gradient backpropagation. Although this is a viable approach, current methods are targeted toward image settings and default/standard deep learning models, meaning that they require significant adaptations to work on video/multi-modal settings and custom architectures. This paper proposes an approach for interpretability that is model-agnostic, based on a novel use of the Squeeze and Excitation (SE) block that creates visual attention heatmaps. By including an SE block prior to the classification layer of any model, we are able to retrieve the most influential features via SE vector manipulation, one of the key components of the SE block. Our results show that this new SE-based interpretability can be applied to various models in image and video/multi-modal settings, namely biometrics of facial features with CelebA and behavioral biometrics using Active Speaker Detection datasets. Furthermore, our proposal does not compromise model performance toward the original task, and has competitive results with current interpretability approaches in state-of-the-art object datasets, highlighting its robustness to perform in varying data aside from the biometric context.
zh

[CV-25] he Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven Image Generation

【速读】：该论文试图解决文本到图像合成（Text-to-image synthesis, T2I）中，仅依赖文本提示（text prompt）难以精细控制图像低级视觉属性（如纹理、清晰度、形状和颜色）的问题。解决方案的关键在于揭示并利用噪声（noise）本身作为“无声提示”（silent prompt）的隐含生成倾向，通过引入NoiseQuery策略，从预构建的噪声库中选择最优初始噪声，以实现对图像生成过程的精细控制。这一方法不仅增强了与文本提示的高层次语义对齐，还能在无需额外优化的情况下，对低级视觉属性进行微调，从而提升生成图像的质量和多样性。

链接: https://arxiv.org/abs/2412.05101
作者: Ruoyu Wang,Huayang Huang,Ye Zhu,Olga Russakovsky,Yu Wu
关键词-EN: large-scale diffusion models, advanced remarkably, emergence of large-scale, randomly sampled Gaussian, sampled Gaussian noise
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 18 figures, 6 tables

点击查看摘要

Abstract:Text-to-image synthesis (T2I) has advanced remarkably with the emergence of large-scale diffusion models. In the conventional setup, the text prompt provides explicit, user-defined guidance, directing the generation process by denoising a randomly sampled Gaussian noise. In this work, we reveal that the often-overlooked noise itself encodes inherent generative tendencies, acting as a “silent prompt” that implicitly guides the output. This implicit guidance, embedded in the noise scheduler design of diffusion model formulations and their training stages, generalizes across a wide range of T2I models and backbones. Building on this insight, we introduce NoiseQuery, a novel strategy that selects optimal initial noise from a pre-built noise library to meet diverse user needs. Our approach not only enhances high-level semantic alignment with text prompts, but also allows for nuanced adjustments of low-level visual attributes, such as texture, sharpness, shape, and color, which are typically challenging to control through text alone. Extensive experiments across various models and target attributes demonstrate the strong performance and zero-shot transferability of our approach, requiring no additional optimization.
zh

[CV-26] SoPo: Text-to-Motion Generation Using Semi-Online Preference Optimization

【速读】：该论文试图解决文本到动作生成模型在生成一致且高质量动作时面临的挑战。解决方案的关键在于引入了一种名为半在线偏好优化（Semi-online Preference Optimization, SoPo）的方法，该方法结合了在线和离线偏好优化（DPO）的优势，通过使用“半在线”数据对，即从在线分布中提取的不偏好动作和从离线数据集中提取的偏好动作，来训练文本到动作生成模型。SoPo方法有效地弥补了在线DPO的采样偏差和离线DPO的过拟合问题，从而显著提升了模型的性能，在多个评估指标上超越了现有的最先进模型。

链接: https://arxiv.org/abs/2412.05095
作者: Xiaofeng Tan,Hongsong Wang,Xin Geng,Pan Zhou
关键词-EN: generation is essential, producing consistent, essential for advancing, advancing the creative, creative industry
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-motion generation is essential for advancing the creative industry but often presents challenges in producing consistent, realistic motions. To address this, we focus on fine-tuning text-to-motion models to consistently favor high-quality, human-preferred motions, a critical yet largely unexplored problem. In this work, we theoretically investigate the DPO under both online and offline settings, and reveal their respective limitation: overfitting in offline DPO, and biased sampling in online DPO. Building on our theoretical insights, we introduce Semi-online Preference Optimization (SoPo), a DPO-based method for training text-to-motion models using “semi-online” data pair, consisting of unpreferred motion from online distribution and preferred motion in offline datasets. This method leverages both online and offline DPO, allowing each to compensate for the other’s limitations. Extensive experiments demonstrate that SoPo outperforms other preference alignment methods, with an MM-Dist of 3.25% (vs e.g. 0.76% of MoDiPO) on the MLD model, 2.91% (vs e.g. 0.66% of MoDiPO) on MDM model, respectively. Additionally, the MLD model fine-tuned by our SoPo surpasses the SoTA model in terms of R-precision and MM Dist. Visualization results also show the efficacy of our SoPo in preference alignment. Our project page is this https URL.
zh

[CV-27] Spinal ligaments detection on vertebrae meshes using registration and 3D edge detection

【速读】：该论文试图解决在构建复杂的脊柱生物力学模型时，准确确定脊柱韧带在三维椎骨模型上的起止点（origin and insertion points）的问题。解决方案的关键在于提出了一种分步方法，能够检测66个脊柱韧带附着点。该方法包括快速椎骨配准（vertebra registration），通过仅提取15个三维点来计算变换矩阵，以及边缘检测（edge detection），以精确地将配准后的韧带投影到任何特定的患者椎骨模型上。该方法在识别椎骨前部标志点方面表现出高精度，平均误差分别为2.24 mm（前纵韧带）和1.26 mm（后纵韧带），并且每个椎骨的标志点检测仅需约3.0秒，显著优于现有方法。

链接: https://arxiv.org/abs/2412.05081
作者: Ivanna Kramer,Lara Blomenkamp,Kevin Weirauch,Sabine Bauer,Dietrich Paulus
关键词-EN: complex biomechanical simulation, bony structure, guide and limit, Spinal ligaments, biomechanical simulation models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spinal ligaments are crucial elements in the complex biomechanical simulation models as they transfer forces on the bony structure, guide and limit movements and stabilize the spine. The spinal ligaments encompass seven major groups being responsible for maintaining functional interrelationships among the other spinal components. Determination of the ligament origin and insertion points on the 3D vertebrae models is an essential step in building accurate and complex spine biomechanical models. In our paper, we propose a pipeline that is able to detect 66 spinal ligament attachment points by using a step-wise approach. Our method incorporates a fast vertebra registration that strategically extracts only 15 3D points to compute the transformation, and edge detection for a precise projection of the registered ligaments onto any given patient-specific vertebra model. Our method shows high accuracy, particularly in identifying landmarks on the anterior part of the vertebra with an average distance of 2.24 mm for anterior longitudinal ligament and 1.26 mm for posterior longitudinal ligament landmarks. The landmark detection requires approximately 3.0 seconds per vertebra, providing a substantial improvement over existing methods. Clinical relevance: using the proposed method, the required landmarks that represent origin and insertion points for forces in the biomechanical spine models can be localized automatically in an accurate and time-efficient manner.
zh

[CV-28] Improving analytical color and texture similarity estimation methods for dataset-agnostic person reidentification

【速读】：该论文试图解决行人重识别（person reidentification, re-id）问题，特别是在边缘设备上的高效实现。解决方案的关键在于结合了人体解析、分析特征提取和相似度估计方法，具体包括：1) 在CIE-Lab颜色空间中使用直方图平滑技术进行颜色分析，以减少噪声；2) 提出了一种预配置的潜在空间（Latent Space, LS）监督自编码器（Supervised Autoencoder, SAE）用于纹理分析，将输入纹理编码为LS点，从而获得更精确的相似度度量；3) 该方法不依赖于训练数据，使其完全独立于特定的re-id数据集。这些关键技术使得该方法在计算资源有限的环境下仍能有效运行，并在Market1501数据集上验证了其可行性，结果与传统深度学习方法相当。

链接: https://arxiv.org/abs/2412.05076
作者: Nikita Gabdullin
关键词-EN: combined person reidentification, analytical feature extraction, similarity estimation schemes, person reidentification, human parsing
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, 3 tables, 3 equations

点击查看摘要

Abstract:This paper studies a combined person reidentification (re-id) method that uses human parsing, analytical feature extraction and similarity estimation schemes. One of its prominent features is its low computational requirements so it can be implemented on edge devices. The method allows direct comparison of specific image regions using interpretable features which consist of color and texture channels. It is proposed to analyze and compare colors in CIE-Lab color space using histogram smoothing for noise reduction. A novel pre-configured latent space (LS) supervised autoencoder (SAE) is proposed for texture analysis which encodes input textures as LS points. This allows to obtain more accurate similarity measures compared to simplistic label comparison. The proposed method also does not rely upon photos or other re-id data for training, which makes it completely re-id dataset-agnostic. The viability of the proposed method is verified by computing rank-1, rank-10, and mAP re-id metrics on Market1501 dataset. The results are comparable to those of conventional deep learning methods and the potential ways to further improve the method are discussed.
zh

[CV-29] LoFi: Vision-Aided Label Generator for Wi-Fi Localization and Tracking

【速读】：该论文试图解决Wi-Fi定位与跟踪中数据驱动方法面临的训练数据不足问题，特别是由于现有数据收集方法的限制，导致大多数数据集仅提供粗粒度的地面实况（GT）或有限的标签点，从而阻碍了数据驱动方法的发展。解决方案的关键是提出了LoFi，一种基于视觉辅助的标签生成器，它能够仅通过2D图像生成地面实况位置坐标。这一方法不仅降低了数据收集的难度和成本，还促进了数据驱动方法在实际应用中的部署，因为Wi-Fi作为一种低泛化性的模态，在使用相关方法时通常需要使用新收集的数据进行模型微调。

链接: https://arxiv.org/abs/2412.05074
作者: Zijian Zhao,Tingwei Chen,Fanyi Meng,Zhijie Cai,Hang Li,Xiaoyang Li,Guangxu Zhu
关键词-EN: shown immense potential, immense potential due, wide coverage, independence from lighting, lighting conditions
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Wi-Fi localization and tracking has shown immense potential due to its privacy-friendliness, wide coverage, permeability, independence from lighting conditions, and low cost. Current methods can be broadly categorized as model-based and data-driven approaches, where data-driven methods show better performance and have less requirement for specialized devices, but struggle with limited datasets for training. Due to limitations in current data collection methods, most datasets only provide coarse-grained ground truth (GT) or limited amount of label points, which greatly hinders the development of data-driven methods. Even though lidar can provide accurate GT, their high cost makes them inaccessible to many users. To address these challenges, we propose LoFi, a vision-aided label generator for Wi-Fi localization and tracking, which can generate ground truth position coordinates solely based on 2D images. The easy and quick data collection method also helps data-driven based methods deploy in practice, since Wi-Fi is a low-generalization modality and when using relevant methods, it always requires fine-tuning the model using newly collected data. Based on our method, we also collect a Wi-Fi tracking and localization dataset using ESP32-S3 and a webcam. To facilitate future research, we will make our code and dataset publicly available upon publication.
zh

[CV-30] BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

【速读】：该论文试图解决的是合成3D双手手部与可动对象交互的问题，特别是如何生成多样且真实的双手手部动作以实现对象的移动和关节活动。解决方案的关键在于提出了一种新的生成方法——BimArt，该方法不依赖于参考抓握、粗略的手部轨迹或分离的抓握和关节活动模式。其核心创新包括：1) 基于对象轨迹生成距离相关的接触图谱，并采用关节活动感知特征表示，揭示丰富的双手操作模式；2) 利用学习到的接触先验指导手部运动生成器，从而生成多样且真实的双手动作。这些创新在特征表示和接触先验方面的探索，显著简化了复杂的高维双手手部-对象交互空间的处理，提升了动画质量和多样性。

链接: https://arxiv.org/abs/2412.05066
作者: Wanyue Zhang,Rishabh Dabral,Vladislav Golyanik,Vasileios Choutas,Eduardo Alvarado,Thabo Beeler,Marc Habermann,Christian Theobalt
关键词-EN: present BimArt, approach for synthesizing, generative approach, Abstract, bimanual
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We present BimArt, a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects. Unlike prior works, we do not rely on a reference grasp, a coarse hand trajectory, or separate modes for grasping and articulating. To achieve this, we first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation, revealing rich bimanual patterns for manipulation. The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation. Our work offers key insights into feature representation and contact prior for articulated objects, demonstrating their effectiveness in taming the complex, high-dimensional space of bimanual hand-object interactions. Through comprehensive quantitative experiments, we demonstrate a clear step towards simplified and high-quality hand-object animations that excel over the state-of-the-art in motion quality and diversity.
zh

[CV-31] EvTTC: An Event Camera Dataset for Time-to-Collision Estimation

【速读】：该论文试图解决在极端情况下（如前车速度突变或行人突然出现）基于帧的摄像头在碰撞预警系统中存在的显著系统延迟问题。解决方案的关键在于利用事件相机（Event Cameras）的超高时间分辨率和异步亮度变化报告能力，提出了一种名为EvTTC的多传感器数据集。EvTTC数据集结合了标准摄像头和事件相机的数据，涵盖了日常驾驶中多种潜在碰撞场景，并提供了LiDAR和GNSS/INS测量数据以计算真实TTC（Time-to-Collision）。此外，论文还提供了一个小型TTC测试平台，用于算法验证和数据增强，所有数据和测试平台设计均开源，旨在促进基于视觉的TTC技术的发展。

链接: https://arxiv.org/abs/2412.05053
作者: Kaizhen Sun,Jinghang Li,Kuan Dai,Bangyan Liao,Wei Xiong,Yi Zhou
关键词-EN: Automatic Emergency Braking, Emergency Braking, Automatic Emergency, forward collision warning, estimation lies
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Time-to-Collision (TTC) estimation lies in the core of the forward collision warning (FCW) functionality, which is key to all Automatic Emergency Braking (AEB) systems. Although the success of solutions using frame-based cameras (e.g., Mobileye’s solutions) has been witnessed in normal situations, some extreme cases, such as the sudden variation in the relative speed of leading vehicles and the sudden appearance of pedestrians, still pose significant risks that cannot be handled. This is due to the inherent imaging principles of frame-based cameras, where the time interval between adjacent exposures introduces considerable system latency to AEB. Event cameras, as a novel bio-inspired sensor, offer ultra-high temporal resolution and can asynchronously report brightness changes at the microsecond level. To explore the potential of event cameras in the above-mentioned challenging cases, we propose EvTTC, which is, to the best of our knowledge, the first multi-sensor dataset focusing on TTC tasks under high-relative-speed scenarios. EvTTC consists of data collected using standard cameras and event cameras, covering various potential collision scenarios in daily driving and involving multiple collision objects. Additionally, LiDAR and GNSS/INS measurements are provided for the calculation of ground-truth TTC. Considering the high cost of testing TTC algorithms on full-scale mobile platforms, we also provide a small-scale TTC testbed for experimental validation and data augmentation. All the data and the design of the testbed are open sourced, and they can serve as a benchmark that will facilitate the development of vision-based TTC techniques.
zh

[CV-32] ReF-LDM: A Latent Diffusion Model for Reference-based Face Image Restoration NEURIPS2024

【速读】：该论文试图解决盲人脸图像恢复（blind face image restoration）中生成内容可能不准确反映真实人物外貌的问题。解决方案的关键在于引入高质量（HQ）的个人参考图像作为额外输入，并提出了一种名为ReF-LDM的模型，该模型是基于潜在扩散模型（Latent Diffusion Model, LDM）的改进版本。ReF-LDM通过集成CacheKV机制，在生成过程中有效利用参考图像，并设计了时间步长缩放的身份损失（timestep-scaled identity loss），以增强模型对人脸识别特征的学习能力。此外，论文还构建了FFHQ-Ref数据集，包含20,405张高质量人脸图像及其对应的参考图像，为基于参考的面部恢复模型提供了训练和评估数据。

链接: https://arxiv.org/abs/2412.05043
作者: Chi-Wei Hsiao,Yu-Lun Liu,Cheng-Kun Yang,Sheng-Po Kuo,Kevin Jou,Chia-Ping Chen
关键词-EN: successfully produced impressive, produced impressive high-quality, details from low-quality, blind face image, works on blind
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: NeurIPS 2024, project page this https URL

点击查看摘要

Abstract:While recent works on blind face image restoration have successfully produced impressive high-quality (HQ) images with abundant details from low-quality (LQ) input images, the generated content may not accurately reflect the real appearance of a person. To address this problem, incorporating well-shot personal images as additional reference inputs could be a promising strategy. Inspired by the recent success of the Latent Diffusion Model (LDM), we propose ReF-LDM, an adaptation of LDM designed to generate HQ face images conditioned on one LQ image and multiple HQ reference images. Our model integrates an effective and efficient mechanism, CacheKV, to leverage the reference images during the generation process. Additionally, we design a timestep-scaled identity loss, enabling our LDM-based model to focus on learning the discriminating features of human faces. Lastly, we construct FFHQ-Ref, a dataset consisting of 20,405 high-quality (HQ) face images with corresponding reference images, which can serve as both training and evaluation data for reference-based face restoration models.
zh

[CV-33] Improving Post-Earthquake Crack Detection using Semi-Synthetic Generated Images ECCV2024

【速读】：该论文试图解决地震后快速评估受影响区域安全性的问题，特别是在缺乏大规模标注数据集的情况下，开发基于计算机视觉和深度学习的损伤检测系统所面临的挑战。解决方案的关键在于引入一种生成半合成图像的技术，用于数据增强，特别是生成裂缝图像，这是常见的损伤形式。核心方法是通过参数化元标注来指导在真实世界结构的3D模型上生成裂缝，这些元标注的参数可以迭代调整，以生成最优化提升检测器性能的图像。实验结果表明，结合真实图像和半合成图像训练的裂缝检测系统，其性能优于仅使用真实图像训练的系统。

链接: https://arxiv.org/abs/2412.05042
作者: Piercarlo Dondi,Alessio Gullotti,Michele Inchingolo,Ilaria Senaldi,Chiara Casarotti,Luca Lombardi,Marco Piastra
关键词-EN: impacted areas, vital to quickly, quickly evaluate, evaluate the safety, detection system
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ECCV2024 Workshop: SyntheticData4CV 2024

点击查看摘要

Abstract:Following an earthquake, it is vital to quickly evaluate the safety of the impacted areas. Damage detection systems, powered by computer vision and deep learning, can assist experts in this endeavor. However, the lack of extensive, labeled datasets poses a challenge to the development of these systems. In this study, we introduce a technique for generating semi-synthetic images to be used as data augmentation during the training of a damage detection system. We specifically aim to generate images of cracks, which are a prevalent and indicative form of damage. The central concept is to employ parametric meta-annotations to guide the process of generating cracks on 3D models of real-word structures. The governing parameters of these meta-annotations can be adjusted iteratively to yield images that are optimally suited for improving detectors’ performance. Comparative evaluations demonstrated that a crack detection system trained with a combination of real and semi-synthetic images outperforms a system trained on real images alone.
zh

[CV-34] SAMCL: Empowering SAM to Continually Learn from Dynamic Domains

【速读】：该论文试图解决在开放世界中，Segment Anything Model (SAM) 在跨多样且动态领域进行对象分割时遇到的困难。具体问题包括在持续分割 (Continual Segmentation, CS) 过程中，如何在保持先前领域稳定性 (stability) 的同时，有效适应新领域 (plasticity) 的挑战，以及如何高效利用 SAM 的图像和提示 (prompts) 特征。解决方案的关键在于提出了一种名为 SAMCL 的新型 CS 方法，该方法通过两个核心组件实现：AugModule 和 Module Selector。AugModule 负责在每个领域中高效学习图像与提示之间的关系，而 Module Selector 则在测试时根据 SAM 的内在能力选择合适的模块，从而实现跨不同领域的任务无关方法。实验结果表明，SAMCL 在减少遗忘和提升对未见领域的适应性方面显著优于现有方法。

链接: https://arxiv.org/abs/2412.05012
作者: Zeqing Wang,Kangye Ji,Di Wang,Fei Cheng
关键词-EN: Segment Anything Model, struggles with segmenting, open world, segmenting objects, domains
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 11 figures

点击查看摘要

Abstract:Segment Anything Model (SAM) struggles with segmenting objects in the open world, especially across diverse and dynamic domains. Continual segmentation (CS) is a potential technique to solve this issue, but a significant obstacle is the intractable balance between previous domains (stability) and new domains (plasticity) during CS. Furthermore, how to utilize two kinds of features of SAM, images and prompts, in an efficient and effective CS manner remains a significant hurdle. In this work, we propose a novel CS method, termed SAMCL, to address these challenges. It is the first study to empower SAM with the CS ability across dynamic domains. SAMCL decouples stability and plasticity during CS by two components: \textitAugModule and \textitModule Selector . Specifically, SAMCL leverages individual \textitAugModule to effectively and efficiently learn new relationships between images and prompts in each domain. \textitModule Selector selects the appropriate module during testing, based on the inherent ability of SAM to distinguish between different domains. These two components enable SAMCL to realize a task-agnostic method without any interference across different domains. Experimental results demonstrate that SAMCL outperforms state-of-the-art methods, achieving an exceptionally low average forgetting of just 0.5 %, along with at least a 2.5 % improvement in transferring to unseen domains. Moreover, the tunable parameter consumption in AugModule is about 0.236 MB, marking at least a 23.3 % reduction compared to other fine-tuning methods.
zh

[CV-35] Backdooring Outlier Detection Methods: A Novel Attack Approach

【速读】：该论文试图解决现有后门攻击主要针对分类器在封闭集（closed-set）性能的问题，而忽视了分类器在开放集（open-set）性能，即异常检测（outlier detection）中的威胁。解决方案的关键在于提出了一种新型后门攻击方法，称为BATOD（Backdoor Attack targeting the Outlier Detection task）。BATOD通过设计两种触发器（triggers），分别将内样本（inliers）转化为异常样本（outliers）和将异常样本转化为内样本，从而混淆封闭集与开放集之间的决策边界。实验结果表明，BATOD在降低分类器开放集性能方面优于以往的攻击方法，即使在应用防御措施后仍表现出色。

链接: https://arxiv.org/abs/2412.05010
作者: ZeinabSadat Taghavi,Hossein Mirzaei
关键词-EN: outlier detection, open-set performance, primarily focused, closed-set performance, performance
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There have been several efforts in backdoor attacks, but these have primarily focused on the closed-set performance of classifiers (i.e., classification). This has left a gap in addressing the threat to classifiers’ open-set performance, referred to as outlier detection in the literature. Reliable outlier detection is crucial for deploying classifiers in critical real-world applications such as autonomous driving and medical image analysis. First, we show that existing backdoor attacks fall short in affecting the open-set performance of classifiers, as they have been specifically designed to confuse intra-closed-set decision boundaries. In contrast, an effective backdoor attack for outlier detection needs to confuse the decision boundary between the closed and open sets. Motivated by this, in this study, we propose BATOD, a novel Backdoor Attack targeting the Outlier Detection task. Specifically, we design two categories of triggers to shift inlier samples to outliers and vice versa. We evaluate BATOD using various real-world datasets and demonstrate its superior ability to degrade the open-set performance of classifiers compared to previous attacks, both before and after applying defenses.
zh

[CV-36] SLayR: Scene Layout Generation with Rectified Flow

【速读】：该论文试图解决现有文本到图像生成模型缺乏对生成过程细粒度控制的问题。解决方案的关键在于引入了一种基于Transformer的校正流模型（Rectified flow model），用于在标记空间中生成场景布局，这些布局可以解码为边界框和相应的标签，然后通过现有模型转换为图像。该方法不仅在生成布局的合理性和多样性方面表现优异，而且在模型参数数量和运行速度上显著优于基线模型，同时提供了一个可解释和可编辑的中间表示，增强了整个文本到图像生成管道的灵活性和可控性。

链接: https://arxiv.org/abs/2412.05003
作者: Cameron Braunstein,Hevra Petekkaya,Jan Eric Lenssen,Mariya Toneva,Eddy Ilg
关键词-EN: Scene Layout Generation, Layout Generation, Rectified flow, Scene Layout, rectified flow model
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 29 figures, 5 tables

点击查看摘要

Abstract:We introduce SLayR, Scene Layout Generation with Rectified flow. State-of-the-art text-to-image models achieve impressive results. However, they generate images end-to-end, exposing no fine-grained control over the process. SLayR presents a novel transformer-based rectified flow model for layout generation over a token space that can be decoded into bounding boxes and corresponding labels, which can then be transformed into images using existing models. We show that established metrics for generated images are inconclusive for evaluating their underlying scene layout, and introduce a new benchmark suite, including a carefully designed repeatable human-evaluation procedure that assesses the plausibility and variety of generated layouts. In contrast to previous works, which perform well in either high variety or plausibility, we show that our approach performs well on both of these axes at the same time. It is also at least 5x times smaller in the number of parameters and 37% faster than the baselines. Our complete text-to-image pipeline demonstrates the added benefits of an interpretable and editable intermediate representation.
zh

[CV-37] ETLNet: An Efficient TCN-BiLSTM Network for Road Anomaly Detection Using Smartphone Sensors ICPR2024

【速读】：该论文试图解决道路异常（road anomalies）的自动检测问题，特别是针对不同光照条件下的速度带（speed bumps）和坑洞（potholes）的检测。解决方案的关键在于引入了一种名为增强时间-双向长短期记忆网络（Enhanced Temporal-BiLSTM Network, ETLNet）的新方法，该方法结合了时间卷积网络（Temporal Convolutional Network, TCN）层和双向长短期记忆（Bidirectional Long Short-Term Memory, BiLSTM）层。ETLNet模型不依赖视觉信息，而是利用智能手机中的惯性传感器（如加速度计和陀螺仪）数据来检测道路异常，从而在各种光照条件下都能有效工作。实证评估显示，ETLNet模型在检测速度带时保持了99.3%的F1分数，显著提升了自动化道路表面监测技术的鲁棒性和效率。

链接: https://arxiv.org/abs/2412.04990
作者: Mohd Faiz Ansari,Rakshit Sandilya,Mohammed Javed,David Doermann
关键词-EN: Road, Temporal Convolutional Network, Abstract, anomalies, ETLNet
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented in ICPR 2024, Kolkata, December 1-5, 2024 (First Workshop on Intelligent Mobility in Unstructured Environments)

点击查看摘要

Abstract:Road anomalies can be defined as irregularities on the road surface or in the surface itself. Some may be intentional (such as speedbumps), accidental (such as materials falling off a truck), or the result of roads’ excessive use or low or no maintenance, such as potholes. Despite their varying origins, these irregularities often harm vehicles substantially. Speed bumps are intentionally placed for safety but are dangerous due to their non-standard shape, size, and lack of proper markings. Potholes are unintentional and can also cause severe damage. To address the detection of these anomalies, we need an automated road monitoring system. Today, various systems exist that use visual information to track these anomalies. Still, due to poor lighting conditions and improper or missing markings, they may go undetected and have severe consequences for public transport, automated vehicles, etc. In this paper, the Enhanced Temporal-BiLSTM Network (ETLNet) is introduced as a novel approach that integrates two Temporal Convolutional Network (TCN) layers with a Bidirectional Long Short-Term Memory (BiLSTM) layer. This combination is tailored to detect anomalies effectively irrespective of lighting conditions, as it depends not on visuals but smartphone inertial sensor data. Our methodology employs accelerometer and gyroscope sensors, typically in smartphones, to gather data on road conditions. Empirical evaluations demonstrate that the ETLNet model maintains an F1-score for detecting speed bumps of 99.3%. The ETLNet model’s robustness and efficiency significantly advance automated road surface monitoring technologies.
zh

[CV-38] Power Plant Detection for Energy Estimation using GIS with Remote Sensing CNN Vision Transformers

【速读】：该论文旨在解决电力厂检测问题，以辅助能源估算应用。解决方案的关键在于提出了一种混合模型，通过将地理信息系统（GIS）与遥感能力、卷积神经网络（CNN）和视觉变换器（ViT）相结合，实现了多数据类型在同一地图上的实时分析。该混合模型利用GIS进行数据集成和可视化，CNN进行特征提取，ViT捕捉长距离依赖关系，从而提升了分类性能，有助于电力厂的监控和运营管理，进而支持能源估算和可持续能源规划。

链接: https://arxiv.org/abs/2412.04986
作者: Blessing Austin-Gabriel,Cristian Noriega Monsalve,Aparna S. Varde
关键词-EN: Geographical Information Systems, Convolutional Neural Networks, Remote Sensing capabilities, Vision Transformers, Geographical Information
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this research, we propose a hybrid model for power plant detection to assist energy estimation applications, by pipelining GIS (Geographical Information Systems) having Remote Sensing capabilities with CNN (Convolutional Neural Networks) and ViT (Vision Transformers). Our proposed approach enables real-time analysis with multiple data types on a common map via the GIS, entails feature-extraction abilities due to the CNN, and captures long-range dependencies through the ViT. This hybrid approach is found to enhance classification, thus helping in the monitoring and operational management of power plants; hence assisting energy estimation and sustainable energy planning in the future. It exemplifies adequate deployment of machine learning methods in conjunction with domain-specific approaches to enhance performance.
zh

[CV-39] MixedGaussianAvatar: Realistically and Geometrically Accurate Head Avatar via Mixed 2D-3D Gaussian Splatting

【速读】：该论文试图解决高保真3D头部虚拟形象重建中的训练和渲染效率问题，同时提升几何精度。解决方案的关键在于提出了一种名为MixedGaussianAvatar的新方法，该方法结合了2D高斯分布（2D Gaussians）和3D高斯分布（3D Gaussians）的优势。具体来说，利用2D高斯分布来重建3D头部的表面，以确保几何精度，同时将这些2D高斯分布附加到FLAME模型的三角网格上，并在2D高斯分布渲染质量不足的地方连接额外的3D高斯分布，形成混合的2D-3D高斯表示。这种混合表示可以通过FLAME参数进行动画化，并通过渐进式训练策略进行优化，先训练2D高斯分布，然后对混合的2D-3D高斯分布进行微调。

链接: https://arxiv.org/abs/2412.04955
作者: Peng Chen,Xiaobao Wei,Qingpo Wuwu,Xinyi Wang,Xingyu Xiao,Ming Lu
关键词-EN: Neural Radiance Fields, Reconstructing high-fidelity, Gaussians, virtual reality, Radiance Fields
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL

点击查看摘要

Abstract:Reconstructing high-fidelity 3D head avatars is crucial in various applications such as virtual reality. The pioneering methods reconstruct realistic head avatars with Neural Radiance Fields (NeRF), which have been limited by training and rendering speed. Recent methods based on 3D Gaussian Splatting (3DGS) significantly improve the efficiency of training and rendering. However, the surface inconsistency of 3DGS results in subpar geometric accuracy; later, 2DGS uses 2D surfels to enhance geometric accuracy at the expense of rendering fidelity. To leverage the benefits of both 2DGS and 3DGS, we propose a novel method named MixedGaussianAvatar for realistically and geometrically accurate head avatar reconstruction. Our main idea is to utilize 2D Gaussians to reconstruct the surface of the 3D head, ensuring geometric accuracy. We attach the 2D Gaussians to the triangular mesh of the FLAME model and connect additional 3D Gaussians to those 2D Gaussians where the rendering quality of 2DGS is inadequate, creating a mixed 2D-3D Gaussian representation. These 2D-3D Gaussians can then be animated using FLAME parameters. We further introduce a progressive training strategy that first trains the 2D Gaussians and then fine-tunes the mixed 2D-3D Gaussians. We demonstrate the superiority of MixedGaussianAvatar through comprehensive experiments. The code will be released at: this https URL.
zh

[CV-40] HOLa: HoloLens Object Labeling

【速读】：该论文试图解决医学增强现实 (Augmented Reality, AR) 应用中物体跟踪的关键挑战，即需要大量标注掩码的问题。解决方案的关键在于引入基于Segment Anything Model (SAM) 的SAM-Track算法，开发了名为HoloLens-Object-Labeling (HOLa) 的Unity和Python应用程序。HOLa实现了对HoloLens 2的全自动单物体标注，仅需极少的人工参与，且无需针对特定图像外观进行调整，从而显著减轻了AR研究在任何应用领域的负担。通过在开放肝脏手术和医学模型实验中的评估，HOLa显著提高了标注速度（超过500倍），并提供了与人工标注者相当的Dice分数（0.875至0.982）。

链接: https://arxiv.org/abs/2412.04945
作者: Michael Schwimmbeck,Serouj Khajarian,Konstantin Holzapfel,Johannes Schmidt,Stefanie Remmele
关键词-EN: medical Augmented Reality, Augmented Reality, medical Augmented, key challenge, significant amount
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by BMT 2024

点击查看摘要

Abstract:In the context of medical Augmented Reality (AR) applications, object tracking is a key challenge and requires a significant amount of annotation masks. As segmentation foundation models like the Segment Anything Model (SAM) begin to emerge, zero-shot segmentation requires only minimal human participation obtaining high-quality object masks. We introduce a HoloLens-Object-Labeling (HOLa) Unity and Python application based on the SAM-Track algorithm that offers fully automatic single object annotation for HoloLens 2 while requiring minimal human participation. HOLa does not have to be adjusted to a specific image appearance and could thus alleviate AR research in any application field. We evaluate HOLa for different degrees of image complexity in open liver surgery and in medical phantom experiments. Using HOLa for image annotation can increase the labeling speed by more than 500 times while providing Dice scores between 0.875 and 0.982, which are comparable to human annotators. Our code is publicly available at: this https URL
zh

[CV-41] Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

【速读】：该论文试图解决多模态大语言模型 (Multimodal Large Language Models, MLLMs) 中存在的动词幻觉 (verb hallucination) 问题。尽管已有多种方法用于缓解对象/名词相关概念的幻觉，但动词概念在理解人类行为中的重要性却被忽视。论文首次从多个角度研究了MLLMs中的动词幻觉现象，并发现大多数最先进的MLLMs都存在严重的动词幻觉。为解决这一问题，论文提出了一种基于丰富动词知识的调优方法，实验结果表明该方法显著减少了与动词相关的幻觉。解决方案的关键在于利用丰富的动词知识进行模型调优，以有效缓解动词幻觉问题。

链接: https://arxiv.org/abs/2412.04939
作者: Zehao Wang,Xinpeng Liu,Xiaoqian Wu,Yudonglin Zhang,Zhou Fang,Yifan Fang,Junfu Pu,Cewu Lu,Yong-Lu Li
关键词-EN: Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have garnered significant attention recently and demonstrate outstanding capabilities in various tasks such as OCR, VQA, captioning, \textitetc . However, hallucination remains a persistent issue. While numerous methods have been proposed to mitigate hallucinations, achieving notable improvements, these methods primarily focus on mitigating hallucinations about \textbfobject/noun-related concepts. Verb concepts, crucial for understanding human actions, have been largely overlooked. In this paper, to the best of our knowledge, we are the \textbffirst to investigate the \textbfverb hallucination phenomenon of MLLMs from various perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from severe verb hallucination. To assess the effectiveness of existing mitigation methods for object concept hallucination on verb hallucination, we evaluated these methods and found that they do not effectively address verb hallucination. To address this issue, we propose a novel rich verb knowledge-based tuning method to mitigate verb hallucination. The experiment results demonstrate that our method significantly reduces hallucinations related to verbs. \textitOur code and data will be made publicly available .
zh

[CV-42] DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection

【速读】：该论文试图解决在光照不足环境下物体检测的挑战，特别是在RGB图像中物体不清晰可见的情况下。解决方案的关键在于设计了一种基于双增强的跨模态物体检测网络DEYOLO，通过融合RGB和红外（RGB-IR）图像来提升检测能力。具体来说，论文提出了语义-空间跨模态和新的双向解耦聚焦模块，以实现RGB-IR图像的检测中心互增强。核心创新包括：双语义增强通道权重分配模块（DECA）和双空间增强像素权重分配模块（DEPA），用于在特征空间中聚合跨模态信息，从而提高特征表示能力；以及一种双增强机制，包括两种模态融合和单模态增强，以减少两种图像模态之间的干扰。此外，论文还开发了一种新的双向解耦聚焦模块，以扩大骨干网络在不同方向上的感受野，从而提高DEYOLO的表示质量。实验结果表明，该方法在M3FD和LLVIP数据集上显著优于现有的最先进物体检测算法。

链接: https://arxiv.org/abs/2412.04931
作者: Yishuo Chen,Boran Wang,Xinyu Guo,Wenbin Zhu,Jiasheng He,Xiaobin Liu,Jing Yuan
关键词-EN: RGB images, complements RGB images, Object detection, poor-illumination environments, infrared images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection in poor-illumination environments is a challenging task as objects are usually not clearly visible in RGB images. As infrared images provide additional clear edge information that complements RGB images, fusing RGB and infrared images has potential to enhance the detection ability in poor-illumination environments. However, existing works involving both visible and infrared images only focus on image fusion, instead of object detection. Moreover, they directly fuse the two kinds of image modalities, which ignores the mutual interference between them. To fuse the two modalities to maximize the advantages of cross-modality, we design a dual-enhancement-based cross-modality object detection network DEYOLO, in which semantic-spatial cross modality and novel bi-directional decoupled focus modules are designed to achieve the detection-centered mutual enhancement of RGB-infrared (RGB-IR). Specifically, a dual semantic enhancing channel weight assignment module (DECA) and a dual spatial enhancing pixel weight assignment module (DEPA) are firstly proposed to aggregate cross-modality information in the feature space to improve the feature representation ability, such that feature fusion can aim at the object detection task. Meanwhile, a dual-enhancement mechanism, including enhancements for two-modality fusion and single modality, is designed in both DECAand DEPAto reduce interference between the two kinds of image modalities. Then, a novel bi-directional decoupled focus is developed to enlarge the receptive field of the backbone network in different directions, which improves the representation quality of DEYOLO. Extensive experiments on M3FD and LLVIP show that our approach outperforms SOTA object detection algorithms by a clear margin. Our code is available at this https URL.
zh

[CV-43] Video Decomposition Prior: A Methodology to Decompose Videos into Layers WWW ICLR

【速读】：该论文试图解决视频增强和编辑任务中深度学习技术对大规模标注数据集的依赖问题，特别是在视频去雾和重照明等任务中，获取高质量的输入和真实序列对数据集的挑战。解决方案的关键在于引入了一种新颖的视频分解先验框架（\textttVDP'），该框架受专业视频编辑实践的启发，不依赖于特定任务的外部数据集，而是利用输入视频的运动和外观信息。\textttVDP`框架将视频序列分解为多个RGB层和相应的透明度层，并通过单独操作这些层来实现视频对象分割、去雾和重照明等任务。特别是，论文提出了一种新的对数视频分解公式，用于视频重照明任务，显著提升了现有方法的性能。

链接: https://arxiv.org/abs/2412.04930
作者: Gaurav Shrivastava,Ser-Nam Lim,Abhinav Shrivastava
关键词-EN: deep learning techniques, truth sequence pairs, ground truth sequence, ground truth, optimal performance
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Page - this https URL for video results. Extended version of ICLR publication

点击查看摘要

Abstract:In the evolving landscape of video enhancement and editing methodologies, a majority of deep learning techniques often rely on extensive datasets of observed input and ground truth sequence pairs for optimal performance. Such reliance often falters when acquiring data becomes challenging, especially in tasks like video dehazing and relighting, where replicating identical motions and camera angles in both corrupted and ground truth sequences is complicated. Moreover, these conventional methodologies perform best when the test distribution closely mirrors the training distribution. Recognizing these challenges, this paper introduces a novel video decomposition prior `\textttVDP’ framework which derives inspiration from professional video editing practices. Our methodology does not mandate task-specific external data corpus collection, instead pivots to utilizing the motion and appearance of the input video. \textttVDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels. These set of layers are then manipulated individually to obtain the desired results. We addresses tasks such as video object segmentation, dehazing, and relighting. Moreover, we introduce a novel logarithmic video decomposition formulation for video relighting tasks, setting a new benchmark over the existing methodologies. We observe the property of relighting emerge as we optimize for our novel relighting decomposition formulation. We evaluate our approach on standard video datasets like DAVIS, REVIDE, \ SDSD and show qualitative results on a diverse array of internet videos. Project Page - this https URL for video results.
zh

[CV-44] Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction WWW CVPR

【速读】：该论文试图解决扩散模型在视频预测任务中的局限性，主要问题在于这些模型将视频视为一系列独立的图像，缺乏时间一致性。解决方案的关键在于引入了一种新的模型类别，将视频视为连续的多维过程而非离散帧序列。这种方法不仅提高了时间一致性，还通过减少75%的采样步骤，显著提高了推理效率。实验结果表明，该方法在多个基准数据集（如KTH、BAIR、Human3.6M和UCF101）上达到了最先进的视频预测性能。

链接: https://arxiv.org/abs/2412.04929
作者: Gaurav Shrivastava,Abhinav Shrivastava
关键词-EN: made significant strides, unconditional image synthesis, text-image translation, Diffusion models, mastering tasks
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Navigate to the project page this https URL for video results. Extended version of published CVPR paper

点击查看摘要

Abstract:Diffusion models have made significant strides in image generation, mastering tasks such as unconditional image synthesis, text-image translation, and image-to-image conversions. However, their capability falls short in the realm of video prediction, mainly because they treat videos as a collection of independent images, relying on external constraints such as temporal attention mechanisms to enforce temporal coherence. In our paper, we introduce a novel model class, that treats video as a continuous multi-dimensional process rather than a series of discrete frames. We also report a reduction of 75% sampling steps required to sample a new frame thus making our framework more efficient during the inference time. Through extensive experimentation, we establish state-of-the-art performance in video prediction, validated on benchmark datasets including KTH, BAIR, Human3.6M, and UCF101. Navigate to the project page this https URL for video results.
zh

[CV-45] S3: Synonymous Semantic Space for Improving Zero-Shot Generalization of Vision-Language Models

【速读】：该论文试图解决视觉-语言模型（如CLIP）在零样本泛化能力上的语义错位问题，特别是在下游任务中，由于自然语言处理中的词汇变异，同一类图像可能被描述为显著不同的文本概念，从而严重影响零样本泛化能力。解决方案的关键在于提出了同义语义空间 (Synonymous Semantic Space, S^3) 方法，通过使用大型语言模型生成每个类别的多个同义概念，并基于生成的同义概念构建一个连续且紧凑的同义语义空间。该方法不仅考虑了单一文本概念，还通过点-空间度量和局部中心点度量来计算图像嵌入与每个类别的同义语义空间之间的相似性，从而实现更稳定的语义对齐和更有效的零样本预测。

链接: https://arxiv.org/abs/2412.04925
作者: Xiaojie Yin,Qilong Wang,Bing Cao,Qinghua Hu
关键词-EN: zero-shot generalization ability, zero-shot generalization, downstream tasks, generalization of CLIP, ability of vision-language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, many studies have been conducted to enhance the zero-shot generalization ability of vision-language models (e.g., CLIP) by addressing the semantic misalignment between image and text embeddings in downstream tasks. Although many efforts have been made, existing methods barely consider the fact that a class of images can be described by notably different textual concepts due to well-known lexical variation in natural language processing, which heavily affects the zero-shot generalization of CLIP. Therefore, this paper proposes a \textbfSynonymous \textbfSemantic \textbfSpace ( S^3 ) for each image class, rather than relying on a single textual concept, achieving more stable semantic alignment and improving the zero-shot generalization of CLIP. Specifically, our S^3 method first generates several synonymous concepts based on the label of each class by using large language models, and constructs a continuous yet compact synonymous semantic space based on the Vietoris-Rips complex of the generated synonymous concepts. Furthermore, we explore the effect of several point-to-space metrics on our S^3 , while presenting a point-to-local-center metric to compute similarity between image embeddings and the synonymous semantic space of each class, accomplishing effective zero-shot predictions. Extensive experiments are conducted across 17 benchmarks, including fine-grained zero-shot classification, natural distribution zero-shot classification, and open-vocabulary segmentation, and the results show that our S^3 outperforms state-of-the-art methods.
zh

[CV-46] Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection WACV2025

【速读】：该论文试图解决视频目标检测 (Video Object Detection, VOD) 中有效利用时间信息以增强目标表示的主要挑战。传统方法如区域提议的聚合往往因包含背景信息而导致特征差异。论文提出了一种基于实例掩码的特征聚合新方法，显著改进了这一过程，并加深了对视频帧间目标动态的理解。解决方案的关键在于引入的FAIM方法，该方法通过利用实例掩码特征增强时间特征聚合。具体来说，FAIM包括轻量级的实例特征提取模块 (Instance Feature Extraction Module, IFEM) 和时间实例分类聚合模块 (Temporal Instance Classification Aggregation Module, TICAM)，分别用于学习实例掩码特征和跨视频帧聚合实例掩码及分类特征。这种方法在ImageNet VID数据集上以33 FPS的速度实现了87.9%的mAP，为速度-准确性权衡设定了新基准，并在多个数据集上的实验验证了其鲁棒性、方法无关性和在多目标跟踪中的有效性。

链接: https://arxiv.org/abs/2412.04915
作者: Khurram Azeem Hashmi,Talha Uddin Sheikh,Didier Stricker,Muhammad Zeshan Afzal
关键词-EN: Video Object Detection, Object Detection, effectively exploiting temporal, enhance object representations, exploiting temporal information
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in WACV 2025

点击查看摘要

Abstract:The primary challenge in Video Object Detection (VOD) is effectively exploiting temporal information to enhance object representations. Traditional strategies, such as aggregating region proposals, often suffer from feature variance due to the inclusion of background information. We introduce a novel instance mask-based feature aggregation approach, significantly refining this process and deepening the understanding of object dynamics across video frames. We present FAIM, a new VOD method that enhances temporal Feature Aggregation by leveraging Instance Mask features. In particular, we propose the lightweight Instance Feature Extraction Module (IFEM) to learn instance mask features and the Temporal Instance Classification Aggregation Module (TICAM) to aggregate instance mask and classification features across video frames. Using YOLOX as a base detector, FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on a single 2080Ti GPU, setting a new benchmark for the speed-accuracy trade-off. Additional experiments on multiple datasets validate that our approach is robust, method-agnostic, and effective in multi-object tracking, demonstrating its broader applicability to video understanding tasks.
zh

[CV-47] Mitigating Instance-Dependent Label Noise: Integrating Self-Supervised Pretraining with Pseudo-Label Refinement

【速读】：该论文试图解决实例依赖标签噪声（Instance-dependent label noise, IDN）问题，即标签错误概率依赖于输入特征的情况，这在实际数据集中比实例独立噪声更为普遍且难以处理。解决方案的关键在于提出了一种结合自监督学习（Self-supervised learning）和迭代伪标签细化的混合框架。具体来说，通过SimCLR进行自监督预训练，使模型能够在不依赖可能存在噪声的标签的情况下学习鲁棒的特征表示，从而建立一个对噪声不敏感的基础。随后，采用迭代训练过程，通过多阶段方法识别出置信度高的预测样本，并逐步更新其标签以提高标签质量。实验结果表明，该方法在CIFAR-10和CIFAR-100数据集上，特别是在高噪声条件下，显著优于现有最先进的方法，显著提升了分类精度和鲁棒性。

链接: https://arxiv.org/abs/2412.04898
作者: Gouranga Bala,Anuj Gupta,Subrat Kumar Behera,Amit Sethi
关键词-EN: achieve high performance, models rely heavily, rely heavily, heavily on large, large volumes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning models rely heavily on large volumes of labeled data to achieve high performance. However, real-world datasets often contain noisy labels due to human error, ambiguity, or resource constraints during the annotation process. Instance-dependent label noise (IDN), where the probability of a label being corrupted depends on the input features, poses a significant challenge because it is more prevalent and harder to address than instance-independent noise. In this paper, we propose a novel hybrid framework that combines self-supervised learning using SimCLR with iterative pseudo-label refinement to mitigate the effects of IDN. The self-supervised pre-training phase enables the model to learn robust feature representations without relying on potentially noisy labels, establishing a noise-agnostic foundation. Subsequently, we employ an iterative training process with pseudo-label refinement, where confidently predicted samples are identified through a multistage approach and their labels are updated to improve label quality progressively. We evaluate our method on the CIFAR-10 and CIFAR-100 datasets augmented with synthetic instance-dependent noise at varying noise levels. Experimental results demonstrate that our approach significantly outperforms several state-of-the-art methods, particularly under high noise conditions, achieving notable improvements in classification accuracy and robustness. Our findings suggest that integrating self-supervised learning with iterative pseudo-label refinement offers an effective strategy for training deep neural networks on noisy datasets afflicted by instance-dependent label noise.
zh

[CV-48] Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction

【速读】：该论文试图解决在大规模场景重建中使用3D高斯样条（3D Gaussian Splatting）时遇到的高训练内存消耗和存储开销问题。解决方案的关键在于提出了一种名为Momentum-GS的新方法，该方法通过基于动量的自蒸馏（momentum-based self-distillation）来提升各块之间的连贯性和准确性，同时将块的数量与物理GPU数量解耦。具体来说，Momentum-GS维护一个使用动量更新的教师高斯解码器，确保在训练过程中提供稳定的参考，并通过自蒸馏方式为每个块提供全局指导，从而促进重建的空间一致性。此外，通过引入块加权机制，根据各块的重建精度动态调整其权重，进一步确保各块之间的一致性。实验结果表明，该方法在大规模场景中显著优于现有技术，实现了12.8%的LPIPS提升，并减少了所需块的数量，达到了新的技术水平。

链接: https://arxiv.org/abs/2412.04887
作者: Jixuan Fan,Wanhua Li,Yifei Han,Yansong Tang
关键词-EN: demonstrated notable success, Splatting has demonstrated, challenges persist due, Gaussian Splatting, high training memory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting has demonstrated notable success in large-scale scene reconstruction, but challenges persist due to high training memory consumption and storage overhead. Hybrid representations that integrate implicit and explicit features offer a way to mitigate these limitations. However, when applied in parallelized block-wise training, two critical issues arise since reconstruction accuracy deteriorates due to reduced data diversity when training each block independently, and parallel training restricts the number of divided blocks to the available number of GPUs. To address these issues, we propose Momentum-GS, a novel approach that leverages momentum-based self-distillation to promote consistency and accuracy across the blocks while decoupling the number of blocks from the physical GPU count. Our method maintains a teacher Gaussian decoder updated with momentum, ensuring a stable reference during training. This teacher provides each block with global guidance in a self-distillation manner, promoting spatial consistency in reconstruction. To further ensure consistency across the blocks, we incorporate block weighting, dynamically adjusting each block’s weight according to its reconstruction accuracy. Extensive experiments on large-scale scenes show that our method consistently outperforms existing techniques, achieving a 12.8% improvement in LPIPS over CityGaussian with much fewer divided blocks and establishing a new state of the art. Project page: this https URL
zh

[CV-49] AI-Driven Non-Invasive Detection and Staging of Steatosis in Fatty Liver Disease Using a Novel Cascade Model and Information Fusion Techniques

【速读】：该论文试图解决非酒精性脂肪肝病（NAFLD）的早期诊断和分期问题，特别是其向非酒精性脂肪性肝炎（NASH）、肝纤维化、肝硬化和肝细胞癌等严重状态进展的风险。解决方案的关键在于开发了一种新型的人工智能级联模型，该模型采用集成学习（ensemble learning）和特征融合（feature fusion）技术，利用人体测量和实验室参数，实现了对NAFLD的非侵入性、稳健且可靠的诊断。该模型在NASH脂肪变性分期任务中达到了86%的准确率，并在区分NASH与非NASH病例时获得了96%的AUC-ROC值，显著优于当前最先进的模型，从而强调了人工智能在NAFLD早期诊断和治疗中的潜在应用价值。

链接: https://arxiv.org/abs/2412.04884
作者: Niloufar Delfan,Pardis Ketabi Moghadam,Mohammad Khoshnevisan,Mehdi Hosseini Chagahi,Behzad Hatami,Melika Asgharzadeh,Mohammadreza Zali,Behzad Moshiri,Amin Momeni Moghaddam,Mohammad Amin Khalafi,Khosrow Dehnad
关键词-EN: Non-alcoholic fatty liver, widespread liver disorders, Non-alcoholic fatty, fatty liver disease, global scale
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Non-alcoholic fatty liver disease (NAFLD) is one of the most widespread liver disorders on a global scale, posing a significant threat of progressing to more severe conditions like nonalcoholic steatohepatitis (NASH), liver fibrosis, cirrhosis, and hepatocellular carcinoma. Diagnosing and staging NAFLD presents challenges due to its non-specific symptoms and the invasive nature of liver biopsies. Our research introduces a novel artificial intelligence cascade model employing ensemble learning and feature fusion techniques. We developed a non-invasive, robust, and reliable diagnostic artificial intelligence tool that utilizes anthropometric and laboratory parameters, facilitating early detection and intervention in NAFLD progression. Our novel artificial intelligence achieved an 86% accuracy rate for the NASH steatosis staging task (non-NASH, steatosis grade 1, steatosis grade 2, and steatosis grade 3) and an impressive 96% AUC-ROC for distinguishing between NASH (steatosis grade 1, grade 2, and grade3) and non-NASH cases, outperforming current state-of-the-art models. This notable improvement in diagnostic performance underscores the potential application of artificial intelligence in the early diagnosis and treatment of NAFLD, leading to better patient outcomes and a reduced healthcare burden associated with advanced liver disease.
zh

[CV-50] MozzaVID: Mozzarella Volumetric Image Dataset

【速读】：该论文试图解决体积成像复杂性导致的体积深度学习模型基准数据集缺乏的问题。解决方案的关键在于引入了一个大规模、清洁且多功能的体积分类数据集——MozzaVID。该数据集包含莫扎里拉奶酪的X射线计算机断层扫描（CT）图像，能够对25种奶酪类型和149个奶酪样本进行分类，并提供了三种不同分辨率的数据，涵盖从591到37,824张图像。通过这一数据集，研究者不仅能够进行一般性的深度学习模型基准测试，还能深入研究奶酪结构特性，从而有助于优化食品生产过程和开发可持续的替代食品产品。

链接: https://arxiv.org/abs/2412.04880
作者: Pawel Tomasz Pieta,Peter Winkel Rasmussen,Anders Bjorholm Dahl,Jeppe Revall Frisvad,Siavash Arjomand Bigdeli,Carsten Gundlach,Anders Nymark Christensen
关键词-EN: benchmarking volumetric deep-learning, volumetric deep-learning models, shortage of established, Influenced, deep-learning models
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Influenced by the complexity of volumetric imaging, there is a shortage of established datasets useful for benchmarking volumetric deep-learning models. As a consequence, new and existing models are not easily comparable, limiting the development of architectures optimized specifically for volumetric data. To counteract this trend, we introduce MozzaVID - a large, clean, and versatile volumetric classification dataset. Our dataset contains X-ray computed tomography (CT) images of mozzarella microstructure and enables the classification of 25 cheese types and 149 cheese samples. We provide data in three different resolutions, resulting in three dataset instances containing from 591 to 37,824 images. While being general-purpose, the dataset also facilitates investigating mozzarella structure properties. The structure of food directly affects its functional properties and thus its consumption experience. Understanding food structure helps tune the production and mimicking it enables sustainable alternatives to animal-derived food products. The complex and disordered nature of food structures brings a unique challenge, where a choice of appropriate imaging method, scale, and sample size is not trivial. With this dataset we aim to address these complexities, contributing to more robust structural analysis models. The dataset can be downloaded from: this https URL.
zh

[CV-51] MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

【速读】：该论文试图解决微小物体视觉-文本异常检测的问题，并为此提出了一个名为MANTA的数据集。解决方案的关键在于构建了一个包含137.3K张图像和38个物体类别的多视角数据集，其中8.6K张图像带有像素级异常标注。此外，数据集还包括两个文本子集：声明性知识（Declarative Knowledge）和建构主义学习（Constructivist Learning），分别提供了对异常的详细描述和多选题形式的训练材料。通过这些丰富的视觉和文本数据，论文不仅为视觉-文本任务提供了一个基准，还通过广泛的基准测试评估了现有方法的性能，突显了其数据集在解决微小物体异常检测问题中的挑战性和有效性。

链接: https://arxiv.org/abs/2412.04867
作者: Lei Fan,Dongdong Fan,Zhiguang Hu,Yiwen Ding,Donglin Di,Kai Yi,Maurice Pagnucco,Yang Song
关键词-EN: present MANTA, visual-text anomaly detection, anomaly detection dataset, anomaly detection, MANTA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:We present MANTA, a visual-text anomaly detection dataset for tiny objects. The visual component comprises over 137.3K images across 38 object categories spanning five typical domains, of which 8.6K images are labeled as anomalous with pixel-level annotations. Each image is captured from five distinct viewpoints to ensure comprehensive object coverage. The text component consists of two subsets: Declarative Knowledge, including 875 words that describe common anomalies across various domains and specific categories, with detailed explanations for what, why, how, including causes and visual characteristics; and Constructivist Learning, providing 2K multiple-choice questions with varying levels of difficulty, each paired with images and corresponded answer explanations. We also propose a baseline for visual-text tasks and conduct extensive benchmarking experiments to evaluate advanced methods across different settings, highlighting the challenges and efficacy of our dataset.
zh

[CV-52] GS-Matching: Reconsidering Feature Matching task in Point Cloud Registration

【速读】：该论文试图解决传统点云配准 (Point Cloud Registration, PCR) 方法中特征匹配时采用最近邻策略导致的“多对一”匹配问题，这种策略在低重叠条件下会产生大量无对应点的潜在内点。论文提出的解决方案关键在于引入了一种基于Gale-Shapley算法的启发式稳定匹配策略，称为GS-matching。该方法通过将特征匹配任务视为分配问题，实现了最优的一对一匹配，从而在低重叠条件下更高效地找到更多非重复内点。此外，论文还利用概率理论对特征匹配任务进行了分析，为该研究问题提供了新的见解。实验结果验证了GS-matching策略的有效性，显著提高了多个数据集上的配准召回率。

链接: https://arxiv.org/abs/2412.04855
作者: Yaojie Zhang,Tianlun Huang,Weijun Wang,Wei Feng
关键词-EN: Traditional point cloud, Traditional point, nearest neighbor policy, point cloud registration, nearest neighbor
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional point cloud registration (PCR) methods for feature matching often employ the nearest neighbor policy. This leads to many-to-one matches and numerous potential inliers without any corresponding point. Recently, some approaches have framed the feature matching task as an assignment problem to achieve optimal one-to-one matches. We argue that the transition to the Assignment problem is not reliable for general correspondence-based PCR. In this paper, we propose a heuristics stable matching policy called GS-matching, inspired by the Gale-Shapley algorithm. Compared to the other matching policies, our method can perform efficiently and find more non-repetitive inliers under low overlapping conditions. Furthermore, we employ the probability theory to analyze the feature matching task, providing new insights into this research problem. Extensive experiments validate the effectiveness of our matching policy, achieving better registration recall on multiple datasets.
zh

[CV-53] SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models

【速读】：该论文试图解决大规模文本到图像（T2I）扩散模型在未经授权的微调（fine-tuning）过程中，其嵌入的水印信息容易被遗忘或破坏的问题。解决方案的关键在于提出了一种名为SleeperMark的新框架，该框架通过明确引导模型将水印信息与语义概念分离，从而在模型适应新任务时仍能保留嵌入的水印。这种方法确保了水印在模型微调和各种攻击下的鲁棒性，同时对模型的生成能力影响最小。

链接: https://arxiv.org/abs/2412.04852
作者: Zilan Wang,Junfeng Guo,Jiacheng Zhu,Yiming Li,Heng Huang,Muhao Chen,Zhengzhong Tu
关键词-EN: Recent advances, diffusion models, including style customization, subject-driven personalization, advances in large-scale
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in large-scale text-to-image (T2I) diffusion models have enabled a variety of downstream applications, including style customization, subject-driven personalization, and conditional generation. As T2I models require extensive data and computational resources for training, they constitute highly valued intellectual property (IP) for their legitimate owners, yet making them incentive targets for unauthorized fine-tuning by adversaries seeking to leverage these models for customized, usually profitable applications. Existing IP protection methods for diffusion models generally involve embedding watermark patterns and then verifying ownership through generated outputs examination, or inspecting the model’s feature space. However, these techniques are inherently ineffective in practical scenarios when the watermarked model undergoes fine-tuning, and the feature space is inaccessible during verification ((i.e., black-box setting). The model is prone to forgetting the previously learned watermark knowledge when it adapts to a new task. To address this challenge, we propose SleeperMark, a novel framework designed to embed resilient watermarks into T2I diffusion models. SleeperMark explicitly guides the model to disentangle the watermark information from the semantic concepts it learns, allowing the model to retain the embedded watermark while continuing to be fine-tuned to new downstream tasks. Our extensive experiments demonstrate the effectiveness of SleeperMark across various types of diffusion models, including latent diffusion models (e.g., Stable Diffusion) and pixel diffusion models (e.g., DeepFloyd-IF), showing robustness against downstream fine-tuning and various attacks at both the image and model levels, with minimal impact on the model’s generative capability. The code is available at this https URL.
zh

[CV-54] UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

【速读】：该论文试图解决生成多样且真实的驾驶场景视频的问题，特别是长时间、多视角一致性视频的生成。解决方案的关键在于提出了一个名为UniMLVG的统一框架，该框架通过整合单视角和多视角驾驶视频数据，并在训练过程中分三个阶段更新跨帧和跨视角模块，从而显著提升生成视觉内容的多样性和质量。此外，通过在多视角视频生成中明确建模视角，有效改善了运动过渡的一致性。UniMLVG能够处理多种输入参考格式（如文本、图像或视频），并根据相应的条件约束（如3D边界框或帧级文本描述）生成高质量的多视角视频。

链接: https://arxiv.org/abs/2412.04842
作者: Rui Chen,Zehuan Wu,Yichen Liu,Yuxin Guo,Jingcheng Ni,Haifeng Xia,Siyu Xia
关键词-EN: autonomous driving system, realistic driving scenarios, creation of diverse, diverse and realistic, essential to enhance
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system. However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates cross-frame and cross-view modules across three stages with different training objectives, substantially boosting the diversity and quality of generated visual content. Additionally, we employ the explicit viewpoint modeling in multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions. Compared to the best models with similar capabilities, our framework achieves improvements of 21.4% in FID and 36.5% in FVD.
zh

[CV-55] Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment

【速读】：该论文试图解决在视觉运动机器人策略中，如何有效地将预训练的策略与终端用户的偏好对齐的问题。传统方法如基于人类反馈的强化学习（RLHF）在非具身领域（如大型语言模型）中表现出色，但在视觉运动策略对齐中，由于学习视觉奖励函数所需的大量人类反馈而效果不佳。论文提出的解决方案是表示对齐的基于偏好的学习（Representation-Aligned Preference-based Learning, RAPL），其关键在于通过较少的人类偏好反馈来微调预训练的视觉编码器，使其与终端用户的视觉表示对齐，然后在此对齐的表示空间中通过特征匹配构建密集的视觉奖励。这种方法不仅在模拟实验中验证了其有效性，还在硬件实验中展示了其在不同机器人实体间的泛化能力，显著减少了所需的人类反馈量。

链接: https://arxiv.org/abs/2412.04835
作者: Ran Tian,Yilin Wu,Chenfeng Xu,Masayoshi Tomizuka,Jitendra Malik,Andrea Bajcsy
关键词-EN: promise significant advancements, large-scale datasets, promise significant, significant advancements, human
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IJRR, this paper is an extended journal version of the conference paper arXiv:2310.07932 with new results and discussion. arXiv admin note: substantial text overlap with arXiv:2310.07932

点击查看摘要

Abstract:Visuomotor robot policies, increasingly pre-trained on large-scale datasets, promise significant advancements across robotics domains. However, aligning these policies with end-user preferences remains a challenge, particularly when the preferences are hard to specify. While reinforcement learning from human feedback (RLHF) has become the predominant mechanism for alignment in non-embodied domains like large language models, it has not seen the same success in aligning visuomotor policies due to the prohibitive amount of human feedback required to learn visual reward functions. To address this limitation, we propose Representation-Aligned Preference-based Learning (RAPL), an observation-only method for learning visual rewards from significantly less human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback on fine-tuning pre-trained vision encoders to align with the end-user’s visual representation and then constructs a dense visual reward via feature matching in this aligned representation space. We first validate RAPL through simulation experiments in the X-Magical benchmark and Franka Panda robotic manipulation, demonstrating that it can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments. Finally, our hardware experiments align pre-trained Diffusion Policies for three object manipulation tasks. We find that RAPL can fine-tune these policies with 5x less real human preference data, taking the first step towards minimizing human feedback while maximizing visuomotor robot policy alignment.
zh

[CV-56] Customized Generation Reimagined: Fidelity and Editability Harmonized ECCV2024

【速读】：该论文试图解决文本到图像生成模型中概念保真度（concept fidelity）与编辑性（editability）之间的固有矛盾。解决方案的关键在于提出了一种“分治再整合”（Divide, Conquer, then Integrate, DCI）框架，通过在去噪过程的早期阶段进行精细调整，将模型从保真度与编辑性的权衡中解放出来。具体而言，DCI框架将这一矛盾的两个方面解耦，并通过两个协作分支分别处理，最后选择性地整合以同时保持高概念保真度和忠实遵循文本提示。此外，论文还引入了图像特定上下文优化（Image-specific Context Optimization, ICO）策略，通过学习可适应的图像特定上下文来替代手动提示模板，从而提升模型定制化的整体性能。

链接: https://arxiv.org/abs/2412.04831
作者: Jian Jin,Yang Shen,Zhenyong Fu,Jian Yang
关键词-EN: Customized generation aims, Customized generation, customized generation suffers, high concept fidelity, concept fidelity
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures, ECCV 2024

点击查看摘要

Abstract:Customized generation aims to incorporate a novel concept into a pre-trained text-to-image model, enabling new generations of the concept in novel contexts guided by textual prompts. However, customized generation suffers from an inherent trade-off between concept fidelity and editability, i.e., between precisely modeling the concept and faithfully adhering to the prompts. Previous methods reluctantly seek a compromise and struggle to achieve both high concept fidelity and ideal prompt alignment simultaneously. In this paper, we propose a Divide, Conquer, then Integrate (DCI) framework, which performs a surgical adjustment in the early stage of denoising to liberate the fine-tuned model from the fidelity-editability trade-off at inference. The two conflicting components in the trade-off are decoupled and individually conquered by two collaborative branches, which are then selectively integrated to preserve high concept fidelity while achieving faithful prompt adherence. To obtain a better fine-tuned model, we introduce an Image-specific Context Optimization (ICO) strategy for model customization. ICO replaces manual prompt templates with learnable image-specific contexts, providing an adaptive and precise fine-tuning direction to promote the overall performance. Extensive experiments demonstrate the effectiveness of our method in reconciling the fidelity-editability trade-off.
zh

[CV-57] DAug: Diffusion-based Channel Augmentation for Radiology Image Retrieval and Classification

【速读】：该论文试图解决医学图像理解中AI模型在有限训练数据下难以准确关注关键区域的问题。解决方案的关键在于提出了基于扩散的特征增强方法（Diffusion-based Feature Augmentation, DAug），通过生成模型输出的热图来扩展放射图像的通道，从而引导模型关注疾病易发区域。此外，论文还提出了图像-文本-类别混合对比学习（Image-Text-Class Hybrid Contrastive learning），以充分利用文本和类别标签信息。这两种创新方法的结合在不改变模型架构的前提下，显著提升了医学图像检索和分类任务的性能，达到了当前最先进的水平。

链接: https://arxiv.org/abs/2412.04828
作者: Ying Jin,Zhuoran Zhou,Haoquan Fang,Jenq-Neng Hwang
关键词-EN: fine visual details, requires meticulous examination, understanding requires meticulous, requiring additional attention, visual details
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image understanding requires meticulous examination of fine visual details, with particular regions requiring additional attention. While radiologists build such expertise over years of experience, it is challenging for AI models to learn where to look with limited amounts of training data. This limitation results in unsatisfying robustness in medical image understanding. To address this issue, we propose Diffusion-based Feature Augmentation (DAug), a portable method that improves a perception model’s performance with a generative model’s output. Specifically, we extend a radiology image to multiple channels, with the additional channels being the heatmaps of regions where diseases tend to develop. A diffusion-based image-to-image translation model was used to generate such heatmaps conditioned on selected disease classes. Our method is motivated by the fact that generative models learn the distribution of normal and abnormal images, and such knowledge is complementary to image understanding tasks. In addition, we propose the Image-Text-Class Hybrid Contrastive learning to utilize both text and class labels. With two novel approaches combined, our method surpasses baseline models without changing the model architecture, and achieves state-of-the-art performance on both medical image retrieval and classification tasks.
zh

[CV-58] PanoDreamer: 3D Panorama Synthesis from a Single Image

【速读】：该论文试图解决从单一输入图像生成连贯的360°全景3D场景的问题。解决方案的关键在于将问题框架为单图像全景图和深度估计，并通过交替最小化策略来优化这两个任务。具体来说，论文提出了PanoDreamer方法，通过获取连贯的全景图像及其对应的深度信息，然后通过修复遮挡区域并将其投影到3D空间中来重建场景。这种方法在单图像360°场景重建的连贯性和整体质量上优于现有技术。

链接: https://arxiv.org/abs/2412.04827
作者: Avinash Paliwal,Xilong Zhou,Andrii Tsarov,Nima Khademi Kalantari
关键词-EN: single input image, present PanoDreamer, single input, input image, Unlike existing methods
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL , Code: this https URL

点击查看摘要

Abstract:In this paper, we present PanoDreamer, a novel method for producing a coherent 360 ^\circ 3D scene from a single input image. Unlike existing methods that generate the scene sequentially, we frame the problem as single-image panorama and depth estimation. Once the coherent panoramic image and its corresponding depth are obtained, the scene can be reconstructed by inpainting the small occluded regions and projecting them into 3D space. Our key contribution is formulating single-image panorama and depth estimation as two optimization tasks and introducing alternating minimization strategies to effectively solve their objectives. We demonstrate that our approach outperforms existing techniques in single-image 360 ^\circ scene reconstruction in terms of consistency and overall quality.
zh

[CV-59] Pushing Rendering Boundaries: Hard Gaussian Splatting

【速读】：该论文试图解决3D高斯喷射（3D Gaussian Splatting, 3DGS）在训练过程中依赖平均视点空间位置梯度来增长高斯分布以减少渲染损失的问题。由于平均操作平滑了来自不同视点和像素的梯度和渲染误差，导致许多缺陷高斯分布的优化受阻，从而在某些区域产生强烈的伪影。解决方案的关键在于提出硬高斯喷射（Hard Gaussian Splatting, HGS），通过考虑多视点显著位置梯度和渲染误差来增长硬高斯分布，填补经典3DGS在3D场景中的空白，从而实现更优的新视点合成（Novel View Synthesis, NVS）结果。具体方法包括位置梯度驱动的HGS和渲染误差引导的HGS，前者利用多视点显著位置梯度来揭示硬高斯分布，后者识别显著像素渲染误差和潜在过大的高斯分布以共同挖掘硬高斯分布。通过增长和优化这些硬高斯分布，该方法有效解决了模糊和针状伪影问题。

链接: https://arxiv.org/abs/2412.04826
作者: Qingshan Xu,Jiequan Cui,Xuanyu Yi,Yuxuan Wang,Yuan Zhou,Yew-Soon Ong,Hanwang Zhang
关键词-EN: View Synthesis, Gaussian Splatting, impressive Novel View, hard Gaussians, Hard Gaussian Splatting
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated impressive Novel View Synthesis (NVS) results in a real-time rendering manner. During training, it relies heavily on the average magnitude of view-space positional gradients to grow Gaussians to reduce rendering loss. However, this average operation smooths the positional gradients from different viewpoints and rendering errors from different pixels, hindering the growth and optimization of many defective Gaussians. This leads to strong spurious artifacts in some areas. To address this problem, we propose Hard Gaussian Splatting, dubbed HGS, which considers multi-view significant positional gradients and rendering errors to grow hard Gaussians that fill the gaps of classical Gaussian Splatting on 3D scenes, thus achieving superior NVS results. In detail, we present positional gradient driven HGS, which leverages multi-view significant positional gradients to uncover hard Gaussians. Moreover, we propose rendering error guided HGS, which identifies noticeable pixel rendering errors and potentially over-large Gaussians to jointly mine hard Gaussians. By growing and optimizing these hard Gaussians, our method helps to resolve blurring and needle-like artifacts. Experiments on various datasets demonstrate that our method achieves state-of-the-art rendering quality while maintaining real-time efficiency.
zh

[CV-60] LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

【速读】：该论文试图解决文本到视频生成模型在生成视频与人类偏好对齐方面的不足，特别是如何将人类主观偏好转化为可量化的目标函数。解决方案的关键在于提出了一种名为LiFT的新型微调方法，该方法利用人类反馈来实现模型的对齐。具体步骤包括：构建一个包含约10k条人类评分注释的数据集LiFT-HRA，训练一个奖励模型LiFT-Critic以学习奖励函数，该函数作为人类判断的代理，用于衡量生成视频与人类期望之间的对齐程度；最后，通过最大化奖励加权似然来对齐T2V模型。实验结果表明，该方法在CogVideoX-2B模型上的应用显著提升了生成视频的质量和对齐度。

链接: https://arxiv.org/abs/2412.04814
作者: Yibin Wang,Zhiyu Tan,Junyan Wang,Xiaomeng Yang,Cheng Jin,Hao Li
关键词-EN: shown impressive capabilities, Recent advancements, impressive capabilities, shown impressive, human
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.
zh

[CV-61] DrIFT: Autonomous Drone Dataset with Integrated Real and Synthetic Data Flexible Views and Transformed Domains WACV2025

【速读】：该论文试图解决无人机视觉检测中的领域偏移问题，特别是在环境变化、视角变化和背景变化等因素影响下的检测准确性。解决方案的关键在于提出了DrIFT数据集，该数据集包含了14个不同的领域，涵盖了视角、合成数据到真实数据、季节和恶劣天气等多种领域偏移情况。DrIFT数据集通过提供背景分割图，强调了背景偏移的重要性，并支持背景相关的评估指标。此外，论文还引入了新的不确定性估计指标MCDO-map，该指标具有较低的后处理复杂度，并在不确定性感知的无监督领域自适应方法中展示了优于现有最先进技术的性能。

链接: https://arxiv.org/abs/2412.04789
作者: Fardad Dadboud,Hamid Azad,Varun Mehta,Miodrag Bolic,Iraj Mntegh
关键词-EN: Dependable visual drone, Dependable visual, visual drone detection, drone detection, secure integration
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV2025

点击查看摘要

Abstract:Dependable visual drone detection is crucial for the secure integration of drones into the airspace. However, drone detection accuracy is significantly affected by domain shifts due to environmental changes, varied points of view, and background shifts. To address these challenges, we present the DrIFT dataset, specifically developed for visual drone detection under domain shifts. DrIFT includes fourteen distinct domains, each characterized by shifts in point of view, synthetic-to-real data, season, and adverse weather. DrIFT uniquely emphasizes background shift by providing background segmentation maps to enable background-wise metrics and evaluation. Our new uncertainty estimation metric, MCDO-map, features lower postprocessing complexity, surpassing traditional methods. We use the MCDO-map in our uncertainty-aware unsupervised domain adaptation method, demonstrating superior performance to SOTA unsupervised domain adaptation techniques. The dataset is available at: this https URL.
zh

[CV-62] Slicing Vision Transformer for Flexible Inference NEURIPS2024

【速读】：该论文试图解决在资源动态变化的环境中，如何缩减Vision Transformers (ViT) 以适应不同资源约束的问题。解决方案的关键在于提出了一种名为Scala的通用框架，该框架允许单个网络在训练过程中激活多个不同宽度的子网络，从而实现灵活的推理能力。具体来说，Scala通过引入Isolated Activation来解耦最小的子网络与其他子网络，并利用Scale Coordination确保每个子网络获得简化的、稳定的和准确的学习目标。这种方法在不需要修改原始ViT结构的情况下，通过一次训练即可学习到可缩减的表示，并且在性能上与单独训练的模型相当，同时在ImageNet-1K上实现了平均1.6%的性能提升。

链接: https://arxiv.org/abs/2412.04786
作者: Yitian Zhang,Huseyin Coskun,Xu Ma,Huan Wang,Ke Ma, Xi (Stephen)Chen,Derek Hao Hu,Yun Fu
关键词-EN: Vision Transformers, Vision, Transformers, ViT, Scala
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2024

点击查看摘要

Abstract:Vision Transformers (ViT) is known for its scalability. In this work, we target to scale down a ViT to fit in an environment with dynamic-changing resource constraints. We observe that smaller ViTs are intrinsically the sub-networks of a larger ViT with different widths. Thus, we propose a general framework, named Scala, to enable a single network to represent multiple smaller ViTs with flexible inference capability, which aligns with the inherent design of ViT to vary from widths. Concretely, Scala activates several subnets during training, introduces Isolated Activation to disentangle the smallest sub-network from other subnets, and leverages Scale Coordination to ensure each sub-network receives simplified, steady, and accurate learning objectives. Comprehensive empirical validations on different tasks demonstrate that with only one-shot training, Scala learns slimmable representation without modifying the original ViT structure and matches the performance of Separate Training. Compared with the prior art, Scala achieves an average improvement of 1.6% on ImageNet-1K with fewer parameters.
zh

[CV-63] KNN-MMD: Cross Domain Wi-Fi Sensing Based on Local Distribution Alignment

【速读】：该论文试图解决跨域Wi-Fi感知中的环境变化问题，即在不同环境下训练的Wi-Fi感知模型性能下降甚至失效的问题。解决方案的关键在于引入了一种基于K-近邻最大均值差异（KNN-MMD）的少样本跨域Wi-Fi感知方法，并通过局部分布对齐策略来优化传统基于全局对齐的领域自适应（DA）方法。该方法不仅在性能上优于传统DA方法，还能自动确定训练停止时机，从而提高模型的稳定性和实用性。实验结果表明，在一对一场景下，该方法在手势识别、人员识别、跌倒检测和动作识别等任务中分别达到了93.26%、81.84%、77.62%和75.30%的准确率。

链接: https://arxiv.org/abs/2412.04783
作者: Zijian Zhao,Zhijie Cai,Tingwei Chen,Xiaoyang Li,Hang Li,Guangxu Zhu
关键词-EN: gained widespread application, Channel State Information, technology in Integrated, Integrated Sensing, Wi-Fi sensing
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:As a key technology in Integrated Sensing and Communications (ISAC), Wi-Fi sensing has gained widespread application in various settings such as homes, offices, and public spaces. By analyzing the patterns of Channel State Information (CSI), we can obtain information about people’s actions for tasks like person identification, gesture recognition, and fall detection. However, the CSI is heavily influenced by the environment, such that even minor environmental changes can significantly alter the CSI patterns. This will cause the performance deterioration and even failure when applying the Wi-Fi sensing model trained in one environment to another. To address this problem, we introduce a K-Nearest Neighbors Maximum Mean Discrepancy (KNN-MMD) model, a few-shot method for cross-domain Wi-Fi sensing. We propose a local distribution alignment method within each category, which outperforms traditional Domain Adaptation (DA) methods based on global alignment. Besides, our method can determine when to stop training, which cannot be realized by most DA methods. As a result, our method is more stable and can be better used in practice. The effectiveness of our method are evaluated in several cross-domain Wi-Fi sensing tasks, including gesture recognition, person identification, fall detection, and action recognition, using both a public dataset and a self-collected dataset. In one-shot scenario, our method achieves accuracy of 93.26%, 81.84%, 77.62%, and 75.30% in the four tasks respectively. To facilitate future research, we will make our code and dataset publicly available upon publication.
zh

[CV-64] Megatron: Evasive Clean-Label Backdoor Attacks against Vision Transformer

【速读】：该论文试图解决视觉Transformer模型在面对后门攻击时的脆弱性问题。解决方案的关键在于提出了一种名为Megatron的规避性干净标签后门攻击方法，该方法在不操纵数据标注过程的情况下注入后门。Megatron的核心创新在于定制了两种基于Transformer网络注意力机制的损失项：潜在损失（latent loss）和注意力扩散损失（attention diffusion loss）。潜在损失通过对齐触发样本和目标标签的干净样本的最后一层注意力层来实现后门注入，而注意力扩散损失则强调了包含触发器的注意力扩散区域，从而增强了攻击的有效性。通过理论分析和广泛的实验验证，Megatron在多个数据集上展示了其高攻击成功率和优越的规避性能。

链接: https://arxiv.org/abs/2412.04776
作者: Xueluan Gong,Bowei Tian,Meng Xue,Shuike Li,Yanjiao Chen,Qian Wang
关键词-EN: achieved impressive performance, attention diffusion loss, attention diffusion, vision-related tasks, achieved impressive
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Vision transformers have achieved impressive performance in various vision-related tasks, but their vulnerability to backdoor attacks is under-explored. A handful of existing works focus on dirty-label attacks with wrongly-labeled poisoned training samples, which may fail if a benign model trainer corrects the labels. In this paper, we propose Megatron, an evasive clean-label backdoor attack against vision transformers, where the attacker injects the backdoor without manipulating the data-labeling process. To generate an effective trigger, we customize two loss terms based on the attention mechanism used in transformer networks, i.e., latent loss and attention diffusion loss. The latent loss aligns the last attention layer between triggered samples and clean samples of the target label. The attention diffusion loss emphasizes the attention diffusion area that encompasses the trigger. A theoretical analysis is provided to underpin the rationale behind the attention diffusion loss. Extensive experiments on CIFAR-10, GTSRB, CIFAR-100, and Tiny ImageNet demonstrate the effectiveness of Megatron. Megatron can achieve attack success rates of over 90% even when the position of the trigger is slightly shifted during testing. Furthermore, Megatron achieves better evasiveness than baselines regarding both human visual inspection and defense strategies (i.e., DBAVT, BAVT, Beatrix, TeCo, and SAGE).
zh

[CV-65] Revitalizing Reconstruction Models for Multi-class Anomaly Detection via Class-Aware Contrastive Learning

【速读】：该论文试图解决在异常检测（Anomaly Detection, AD）中，从单类模型扩展到多类模型时性能下降的问题。解决方案的关键在于引入类别感知的对比学习（Class-aware Contrastive Learning, CL），通过利用原始对象类别信息（如地毯或木材）作为监督信号，分别在局部和全局尺度上进行对比学习。局部对比学习用于微调多尺度特征，而全局对比学习则用于学习更紧凑的正常模式特征表示，从而有效适应多类设置。实验结果表明，该方法在四个数据集（超过60个类别）上显著提升了性能，优于现有先进方法。

链接: https://arxiv.org/abs/2412.04769
作者: Lei Fan,Junjie Huang,Donglin Di,Anyang Su,Maurice Pagnucco,Yang Song
关键词-EN: train separate models, anomaly detection, resource management, approaches often train, train separate
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:For anomaly detection (AD), early approaches often train separate models for individual classes, yielding high performance but posing challenges in scalability and resource management. Recent efforts have shifted toward training a single model capable of handling multiple classes. However, directly extending early AD methods to multi-class settings often results in degraded performance. In this paper, we analyze this degradation observed in reconstruction-based methods, identifying two key issues: catastrophic forgetting and inter-class confusion. To this end, we propose a plug-and-play modification by incorporating class-aware contrastive learning (CL). By explicitly leveraging raw object category information (e.g., carpet or wood) as supervised signals, we apply local CL to fine-tune multiscale features and global CL to learn more compact feature representations of normal patterns, thereby effectively adapting the models to multi-class settings. Experiments across four datasets (over 60 categories) verify the effectiveness of our approach, yielding significant improvements and superior performance compared to advanced methods. Notably, ablation studies show that even using pseudo-class labels can achieve comparable performance.
zh

[CV-66] Latent Space Characterization of Autoencoder Variants

【速读】：该论文试图解决的问题是如何理解和表征深度学习模型（特别是自编码器）所学习的潜在空间（latent spaces）的结构和特性。解决方案的关键在于通过分析不同类型的自编码器（包括卷积自编码器 (CAE)、去噪自编码器 (DAE) 和变分自编码器 (VAE)）的潜在空间，揭示它们在输入扰动下的变化规律。论文通过将潜在空间的矩阵流形映射到希尔伯特空间，并利用距离保持变换，展示了CAE和DAE的潜在流形是分层的平滑乘积流形，而VAE的潜在流形则是由两个对称正定矩阵和一个对称半正定矩阵构成的平滑乘积流形。这一分析不仅解释了为何VAE的潜在空间比CAE和DAE的更平滑，还提供了一种新的视角来理解输入失真对潜在空间结构的影响。

链接: https://arxiv.org/abs/2412.04755
作者: Anika Shrivastava,Renu Rameshan,Samar Agnihotri
关键词-EN: generate complex data, deep learning models, latent spaces learned, Understanding the latent, latent spaces
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注: 8 pages, 6 figures, and 1 table

点击查看摘要

Abstract:Understanding the latent spaces learned by deep learning models is crucial in exploring how they represent and generate complex data. Autoencoders (AEs) have played a key role in the area of representation learning, with numerous regularization techniques and training principles developed not only to enhance their ability to learn compact and robust representations, but also to reveal how different architectures influence the structure and smoothness of the lower-dimensional non-linear manifold. We strive to characterize the structure of the latent spaces learned by different autoencoders including convolutional autoencoders (CAEs), denoising autoencoders (DAEs), and variational autoencoders (VAEs) and how they change with the perturbations in the input. By characterizing the matrix manifolds corresponding to the latent spaces, we provide an explanation for the well-known observation that the latent spaces of CAE and DAE form non-smooth manifolds, while that of VAE forms a smooth manifold. We also map the points of the matrix manifold to a Hilbert space using distance preserving transforms and provide an alternate view in terms of the subspaces generated in the Hilbert space as a function of the distortion in the input. The results show that the latent manifolds of CAE and DAE are stratified with each stratum being a smooth product manifold, while the manifold of VAE is a smooth product manifold of two symmetric positive definite matrices and a symmetric positive semi-definite matrix.
zh

[CV-67] Machine learning algorithms to predict the risk of rupture of intracranial aneurysms: a systematic review

【速读】：该论文旨在解决颅内动脉瘤破裂风险的预测问题，特别是评估机器学习算法在这一预测任务中的表现。解决方案的关键在于利用机器学习模型（Machine Learning Models）来分析和预测颅内动脉瘤的破裂风险。研究通过系统综述方法，筛选并分析了20项相关研究，涵盖了20,286例动脉瘤病例。这些模型在预测准确性上表现不一，范围从0.66到0.90。尽管机器学习模型显示出一定的预测能力，但现有证据并未全面证明其优于当前临床实践，因此其作为临床辅助工具的角色受到限制。论文强调，需要进一步的前瞻性多中心研究来验证这些机器学习工具的临床效用，才能在临床中实施。

链接: https://arxiv.org/abs/2412.04749
作者: Karan Daga,Siddharth Agarwal,Zaeem Moti,Matthew BK Lee,Munaib Din,David Wood,Marc Modat,Thomas C Booth
关键词-EN: potentially fatal consequence, Subarachnoid haemorrhage, machine learning, intracranial aneurysm, intracranial aneurysm rupture
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: Clin Neuroradiol (2024)

点击查看摘要

Abstract:Purpose: Subarachnoid haemorrhage is a potentially fatal consequence of intracranial aneurysm rupture, however, it is difficult to predict if aneurysms will rupture. Prophylactic treatment of an intracranial aneurysm also involves risk, hence identifying rupture-prone aneurysms is of substantial clinical importance. This systematic review aims to evaluate the performance of machine learning algorithms for predicting intracranial aneurysm rupture risk. Methods: MEDLINE, Embase, Cochrane Library and Web of Science were searched until December 2023. Studies incorporating any machine learning algorithm to predict the risk of rupture of an intracranial aneurysm were included. Risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST). PROSPERO registration: CRD42023452509. Results: Out of 10,307 records screened, 20 studies met the eligibility criteria for this review incorporating a total of 20,286 aneurysm cases. The machine learning models gave a 0.66-0.90 range for performance accuracy. The models were compared to current clinical standards in six studies and gave mixed results. Most studies posed high or unclear risks of bias and concerns for applicability, limiting the inferences that can be drawn from them. There was insufficient homogenous data for a meta-analysis. Conclusions: Machine learning can be applied to predict the risk of rupture for intracranial aneurysms. However, the evidence does not comprehensively demonstrate superiority to existing practice, limiting its role as a clinical adjunct. Further prospective multicentre studies of recent machine learning tools are needed to prove clinical validation before they are implemented in the clinic. Comments: Clin Neuroradiol (2024) Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM) Cite as: arXiv:2412.04749 [cs.CV] (or arXiv:2412.04749v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.04749 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1007/s00062-024-01474-4 Focus to learn more DOI(s) linking to related resources
zh

[CV-68] Decomposed Distribution Matching in Dataset Condensation

【速读】：该论文试图解决数据集压缩 (Dataset Condensation, DC) 在避免昂贵的双层优化 (bi-level optimization) 时性能下降的问题。解决方案的关键在于将数据集分布分解为内容和风格，并通过匹配原始数据与压缩数据之间的风格信息（使用特征图的统计矩作为风格指标）以及增强压缩数据集的类内多样性（通过最大化每个合成类内的Kullback-Leibler散度）来提升DC的性能。实验结果表明，该方法在多个数据集上显著提高了压缩数据集的有效性，最高提升了5.5%的持续学习准确率。

链接: https://arxiv.org/abs/2412.04748
作者: Sahar Rahimi Malakshan,Mohammad Saeed Ebrahimi Saadabadi,Ali Dabouei,Nasser M. Nasrabadi
关键词-EN: reduce deep neural, deep neural networks, neural networks training, networks training efforts, aims to reduce
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset Condensation (DC) aims to reduce deep neural networks training efforts by synthesizing a small dataset such that it will be as effective as the original large dataset. Conventionally, DC relies on a costly bi-level optimization which prohibits its practicality. Recent research formulates DC as a distribution matching problem which circumvents the costly bi-level optimization. However, this efficiency sacrifices the DC performance. To investigate this performance degradation, we decomposed the dataset distribution into content and style. Our observations indicate two major shortcomings of: 1) style discrepancy between original and condensed data, and 2) limited intra-class diversity of condensed dataset. We present a simple yet effective method to match the style information between original and condensed data, employing statistical moments of feature maps as well-established style indicators. Moreover, we enhance the intra-class diversity by maximizing the Kullback-Leibler divergence within each synthetic class, i.e., content. We demonstrate the efficacy of our method through experiments on diverse datasets of varying size and resolution, achieving improvements of up to 4.1% on CIFAR10, 4.2% on CIFAR100, 4.3% on TinyImageNet, 2.0% on ImageNet-1K, 3.3% on ImageWoof, 2.5% on ImageNette, and 5.5% in continual learning accuracy.
zh

[CV-69] Fair Diagnosis: Leveraging Causal Modeling to Mitigate Medical Bias

【速读】：该论文试图解决医学图像分析中模型预测受敏感属性（如种族和性别）影响的问题，导致诊断结果存在公平性问题和潜在偏见。解决方案的关键在于提出了一种因果建模框架，通过引入新的公平性标准——诊断公平性（Diagnosis Fairness）和利用路径特定公平性（path-specific fairness）来控制人口统计属性的影响，确保预测主要由临床相关特征而非敏感属性驱动。此外，该框架通过整合对抗性扰动掩码（adversarial perturbation masks），引导模型关注关键图像区域，抑制偏见信息，从而在保持诊断准确性的同时有效减少与敏感属性直接相关的偏见。

链接: https://arxiv.org/abs/2412.04739
作者: Bowei Tian,Yexiao He,Meng Liu,Yucong Dai,Ziyao Wang,Shwai He,Guoheng Sun,Zheyu Shen,Wanghao Ye,Yongkai Wu,Ang Li
关键词-EN: medical image analysis, race and gender, sensitive attributes, concerns and potential, potential biases
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In medical image analysis, model predictions can be affected by sensitive attributes, such as race and gender, leading to fairness concerns and potential biases in diagnostic outcomes. To mitigate this, we present a causal modeling framework, which aims to reduce the impact of sensitive attributes on diagnostic predictions. Our approach introduces a novel fairness criterion, \textbfDiagnosis Fairness, and a unique fairness metric, leveraging path-specific fairness to control the influence of demographic attributes, ensuring that predictions are primarily informed by clinically relevant features rather than sensitive attributes. By incorporating adversarial perturbation masks, our framework directs the model to focus on critical image regions, suppressing bias-inducing information. Experimental results across multiple datasets demonstrate that our framework effectively reduces bias directly associated with sensitive attributes while preserving diagnostic accuracy. Our findings suggest that causal modeling can enhance both fairness and interpretability in AI-powered clinical decision support systems.
zh

[CV-70] Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model

【速读】：该论文试图解决当前视觉-语言模型 (Vision-Language Models, VLMs) 在处理长视频时面临的挑战，主要问题在于这些模型无法有效利用大量帧信息。解决方案的关键在于提出了一种名为 Espresso 的新方法，该方法通过分别提取和压缩空间和时间信息来增强对长视频的理解能力。具体来说，Espresso 通过空间和时间压缩的独立处理，显著提升了模型对长视频的理解能力，并且在结合使用时效果更为显著。此外，Espresso 在更多训练数据的情况下表现出色，并且在处理长视频时远优于现有的投影方法。论文还设计了一个名为“needle-in-a-haystack”的更难评估设置，用于测试 Espresso 在处理更长视频时的性能，结果显示 Espresso 在该任务上达到了最先进的性能，超越了那些在更多训练数据上训练的 SOTA VLMs。

链接: https://arxiv.org/abs/2412.04729
作者: Keunwoo Peter Yu,Achal Dave,Rares Ambrus,Jean Mercat
关键词-EN: current vision-language models, understand videos longer, vision-language models, current vision-language, struggle to understand
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:Most of the current vision-language models (VLMs) for videos struggle to understand videos longer than a few seconds. This is primarily due to the fact that they do not scale to utilizing a large number of frames. In order to address this limitation, we propose Espresso, a novel method that extracts and compresses spatial and temporal information separately. Through extensive evaluations, we show that spatial and temporal compression in Espresso each have a positive impact on the long-form video understanding capabilities; when combined, their positive impact increases. Furthermore, we show that Espresso’s performance scales well with more training data, and that Espresso is far more effective than the existing projectors for VLMs in long-form video understanding. Moreover, we devise a more difficult evaluation setting for EgoSchema called “needle-in-a-haystack” that multiplies the lengths of the input videos. Espresso achieves SOTA performance on this task, outperforming the SOTA VLMs that have been trained on much more training data.
zh

[CV-71] Mix-Modality Person Re-Identification: A New and Practical Paradigm

【速读】：该论文试图解决在可见光-红外跨模态行人重识别（Visible-Infrared person re-identification, VI-ReID）中，现有方法在新提出的混合模态检索范式（Mix-Modality Retrieval Paradigm）下性能显著下降的问题。解决方案的关键在于提出了两种新的策略：1) 跨身份判别协调损失（Cross-Identity Discrimination Harmonization Loss, CIDHL），通过调整超球面特征空间中样本的分布，使相同身份的样本中心更接近，不同身份的样本中心更远离，同时聚合同一模态和同一身份的样本；2) 模态桥相似度优化策略（Modality Bridge Similarity Optimization Strategy, MBSOS），利用图库中的相似桥样本优化查询样本与被查询样本之间的跨模态相似度。这两种策略的结合显著提升了混合模态行人重识别任务（Mix-Modality person re-identification, MM-ReID）的性能。

链接: https://arxiv.org/abs/2412.04719
作者: Wei Liu,Xin Xu,Hua Chang,Xin Yuan,Zheng Wang
关键词-EN: bi-modality mutual retrieval, mutual retrieval paradigm, Current visible-infrared cross-modality, practical mix-modality retrieval, person re-identification research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current visible-infrared cross-modality person re-identification research has only focused on exploring the bi-modality mutual retrieval paradigm, and we propose a new and more practical mix-modality retrieval paradigm. Existing Visible-Infrared person re-identification (VI-ReID) methods have achieved some results in the bi-modality mutual retrieval paradigm by learning the correspondence between visible and infrared modalities. However, significant performance degradation occurs due to the modality confusion problem when these methods are applied to the new mix-modality paradigm. Therefore, this paper proposes a Mix-Modality person re-identification (MM-ReID) task, explores the influence of modality mixing ratio on performance, and constructs mix-modality test sets for existing datasets according to the new mix-modality testing paradigm. To solve the modality confusion problem in MM-ReID, we propose a Cross-Identity Discrimination Harmonization Loss (CIDHL) adjusting the distribution of samples in the hyperspherical feature space, pulling the centers of samples with the same identity closer, and pushing away the centers of samples with different identities while aggregating samples with the same modality and the same identity. Furthermore, we propose a Modality Bridge Similarity Optimization Strategy (MBSOS) to optimize the cross-modality similarity between the query and queried samples with the help of the similar bridge sample in the gallery. Extensive experiments demonstrate that compared to the original performance of existing cross-modality methods on MM-ReID, the addition of our CIDHL and MBSOS demonstrates a general improvement.
zh

[CV-72] Addressing Attribute Leakages in Diffusion-based Image Editing without Training

【速读】：该论文试图解决图像编辑中扩散模型面临的属性泄露问题，即在非目标区域或目标区域内由于属性干扰而发生的意外修改。解决方案的关键在于提出了一种新的框架，包含三个主要组件：(1) 对象限制嵌入 (Object-Restricted Embeddings, ORE)，用于在文本嵌入中定位对象特定的属性；(2) 区域引导的交叉注意力掩码 (Region-Guided Blending for Cross-Attention Masking, RGB-CAM)，用于将注意力与目标区域对齐；(3) 背景混合 (Background Blending, BB)，用于保留未编辑的区域。此外，论文还引入了ALE-Bench，一个用于评估属性泄露的基准，提供了新的目标外部和目标内部泄露的评估指标。实验结果表明，该框架在显著减少属性泄露的同时，保持了高编辑质量，为多对象图像编辑提供了一种高效且无需调优的解决方案。

链接: https://arxiv.org/abs/2412.04715
作者: Sunung Mun,Jinhwan Nam,Sunghyun Cho,Jungseul Ok
关键词-EN: Diffusion models, offering flexibility, flexibility with language, language prompts, prompts and source
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have become a cornerstone in image editing, offering flexibility with language prompts and source images. However, a key challenge is attribute leakage, where unintended modifications occur in non-target regions or within target regions due to attribute interference. Existing methods often suffer from leakage due to naive text embeddings and inadequate handling of End-of-Sequence (EOS) token embeddings. We propose a novel framework to address attribute leakage with three components: (1) Object-Restricted Embeddings (ORE) to localize object-specific attributes in text embeddings, (2) Region-Guided Blending for Cross-Attention Masking (RGB-CAM) to align attention with target regions, and (3) Background Blending (BB) to preserve non-edited regions. Additionally, we introduce ALE-Bench, a benchmark for evaluating attribute leakage with new metrics for target-external and target-internal leakage. Experiments demonstrate that our framework significantly reduces attribute leakage while maintaining high editing quality, providing an efficient and tuning-free solution for multi-object image editing.
zh

[CV-73] PCTreeS: 3D Point Cloud Tree Species Classification Using Airborne LiDAR Images

【速读】：该论文试图解决大规模森林数据中树种分布的自动分类问题，特别是针对热带稀树草原生态系统。解决方案的关键在于利用先进的深度学习模型，特别是通过直接将三维点云图像输入到视觉变换器模型（PCTreeS）中，而不是将三维图像投影为二维图像后再使用卷积神经网络（CNN）。这种方法不仅提高了分类的准确性（AUC达到0.81，总体准确率达到0.72），还显著缩短了训练时间（约45分钟）。此外，论文采用了低分辨率但更具可扩展性的机载激光雷达（Airborne LiDAR）图像，以克服传统地面激光雷达（Terrestrial LiDAR）图像在数据收集和处理上的局限性。

链接: https://arxiv.org/abs/2412.04714
作者: Hongjin Lin,Matthew Nazari,Derek Zheng
关键词-EN: Reliable large-scale data, monitoring ecosystem health, carbon stock, climate change, Reliable large-scale
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable large-scale data on the state of forests is crucial for monitoring ecosystem health, carbon stock, and the impact of climate change. Current knowledge of tree species distribution relies heavily on manual data collection in the field, which often takes years to complete, resulting in limited datasets that cover only a small subset of the world’s forests. Recent works show that state-of-the-art deep learning models using Light Detection and Ranging (LiDAR) images enable accurate and scalable classification of tree species in various ecosystems. While LiDAR images contain rich 3D information, most previous works flatten the 3D images into 2D projections to use Convolutional Neural Networks (CNNs). This paper offers three significant contributions: (1) we apply the deep learning framework for tree classification in tropical savannas; (2) we use Airborne LiDAR images, which have a lower resolution but greater scalability than Terrestrial LiDAR images used in most previous works; (3) we introduce the approach of directly feeding 3D point cloud images into a vision transformer model (PCTreeS). Our results show that the PCTreeS approach outperforms current CNN baselines with 2D projections in AUC (0.81), overall accuracy (0.72), and training time (~45 mins). This paper also motivates further LiDAR image collection and validation for accurate large-scale automatic classification of tree species.
zh

[CV-74] Parametric-ControlNet: Multimodal Control in Foundation Models for Precise Engineering Design Synthesis

【速读】：该论文试图解决在工程设计合成中，如何通过多模态控制来增强文本到图像生成式 AI (Generative AI) 模型的精确性和多样性的问题。解决方案的关键在于提出了一种多模态控制模型，该模型整合了参数控制、图像控制和文本控制三种模态。具体来说，模型通过扩散模型处理部分和完整的参数输入，利用组件编码器处理输入的组件图像，并通过 CLIP 编码器整合文本描述，最终通过多模态融合技术生成联合嵌入，作为 ControlNet 模块的输入，从而实现对基础生成模型的稳健多模态控制。这一方法显著提升了 AI 驱动设计工具的复杂性和精确性，展示了基于多样数据模态的精确控制对于增强设计生成的重要性。

链接: https://arxiv.org/abs/2412.04707
作者: Rui Zhou,Yanxia Zhang,Chenyang Yuan,Frank Permenter,Nikos Arechiga,Matt Klenk,Faez Ahmed
关键词-EN: Stable Diffusion, generative model designed, specifically tailored, paper introduces, engineering design synthesis
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper introduces a generative model designed for multimodal control over text-to-image foundation generative AI models such as Stable Diffusion, specifically tailored for engineering design synthesis. Our model proposes parametric, image, and text control modalities to enhance design precision and diversity. Firstly, it handles both partial and complete parametric inputs using a diffusion model that acts as a design autocomplete co-pilot, coupled with a parametric encoder to process the information. Secondly, the model utilizes assembly graphs to systematically assemble input component images, which are then processed through a component encoder to capture essential visual data. Thirdly, textual descriptions are integrated via CLIP encoding, ensuring a comprehensive interpretation of design intent. These diverse inputs are synthesized through a multimodal fusion technique, creating a joint embedding that acts as the input to a module inspired by ControlNet. This integration allows the model to apply robust multimodal control to foundation models, facilitating the generation of complex and precise engineering designs. This approach broadens the capabilities of AI-driven design tools and demonstrates significant advancements in precise control based on diverse data modalities for enhanced design generation.
zh

[CV-75] Superpixel Tokenization for Vision Transformers: Preserving Semantic Integrity in Visual Tokens

【速读】：该论文试图解决视觉Transformer (Vision Transformer, ViT) 中基于网格的图像分块 (grid-based tokenization) 可能导致单个token内包含多个视觉概念的问题。解决方案的关键在于采用超像素分块 (superpixel tokenization)，通过超像素生成仅包含单一视觉概念的token。然而，超像素的多样形状、大小和位置使得将其整合到ViT分块过程中颇具挑战。为此，论文提出了一种由预聚合提取和超像素感知聚合组成的分块流程，成功克服了超像素分块的难题，并展示了该方法在现有框架中的强大兼容性，显著提升了ViT在各种下游任务中的准确性和鲁棒性。

链接: https://arxiv.org/abs/2412.04680
作者: Jaihyun Lew,Soohyuk Jang,Jaehoon Lee,Seungryong Yoo,Eunji Kim,Saehyung Lee,Jisoo Mok,Siwon Kim,Sungroh Yoon
关键词-EN: Natural Language Processing, Language Processing, Natural Language, groundbreaking architecture proposed, achieved remarkable success
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Transformers, a groundbreaking architecture proposed for Natural Language Processing (NLP), have also achieved remarkable success in Computer Vision. A cornerstone of their success lies in the attention mechanism, which models relationships among tokens. While the tokenization process in NLP inherently ensures that a single token does not contain multiple semantics, the tokenization of Vision Transformer (ViT) utilizes tokens from uniformly partitioned square image patches, which may result in an arbitrary mixing of visual concepts in a token. In this work, we propose to substitute the grid-based tokenization in ViT with superpixel tokenization, which employs superpixels to generate a token that encapsulates a sole visual concept. Unfortunately, the diverse shapes, sizes, and locations of superpixels make integrating superpixels into ViT tokenization rather challenging. Our tokenization pipeline, comprised of pre-aggregate extraction and superpixel-aware aggregation, overcomes the challenges that arise in superpixel tokenization. Extensive experiments demonstrate that our approach, which exhibits strong compatibility with existing frameworks, enhances the accuracy and robustness of ViT on various downstream tasks.
zh

[CV-76] Unsupervised Segmentation by Diffusing Walking and Cutting

【速读】：该论文试图解决无监督图像分割问题，其解决方案的关键在于利用预训练的文本到图像扩散模型中的特征，特别是自注意力层中的概率分布，来构建图像块之间的邻接矩阵。通过递归应用归一化割（Normalised Cuts）算法，论文提出了一种基于随机游走（Random Walk）的分割方法，能够在不进行额外训练的情况下，实现图像的分层语义分割。该方法的核心在于利用自注意力机制捕捉图像块之间的语义关系，并通过随机游走归一化割直接在自注意力激活上进行图像分割，从而最大化簇内一致性并最小化簇间过渡概率。此外，论文还探讨了如何从特征中构建归一化割邻接矩阵，并提出了一种自动确定归一化割成本标准的方法，避免了手动调参的需求。实验结果表明，该方法在零样本无监督分割任务中超越了现有方法，达到了COCO-Stuff-27和Cityscapes数据集上的最先进水平。

链接: https://arxiv.org/abs/2412.04678
作者: Daniela Ivanova,Marco Aversa,Paul Henderson,John Williamson
关键词-EN: diffusion models, Walk Normalized Cuts, Random Walk Normalized, Normalised Cuts, Random Walk
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose an unsupervised image segmentation method using features from pre-trained text-to-image diffusion models. Inspired by classic spectral clustering approaches, we construct adjacency matrices from self-attention layers between image patches and recursively partition using Normalised Cuts. A key insight is that self-attention probability distributions, which capture semantic relations between patches, can be interpreted as a transition matrix for random walks across the image. We leverage this by first using Random Walk Normalized Cuts directly on these self-attention activations to partition the image, minimizing transition probabilities between clusters while maximizing coherence within clusters. Applied recursively, this yields a hierarchical segmentation that reflects the rich semantics in the pre-trained attention layers, without any additional training. Next, we explore other ways to build the NCuts adjacency matrix from features, and how we can use the random walk interpretation of self-attention to capture long-range relationships. Finally, we propose an approach to automatically determine the NCut cost criterion, avoiding the need to tune this manually. We quantitatively analyse the effect incorporating different features, a constant versus dynamic NCut threshold, and incorporating multi-node paths when constructing the NCuts adjacency matrix. We show that our approach surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.
zh

[CV-77] Socially-Informed Reconstruction for Pedestrian Trajectory Forecasting WACV

【速读】：该论文试图解决行人轨迹预测问题，特别是在复杂的社会互动动态中的预测挑战。解决方案的关键在于提出了一种结合重构器和条件变分自编码器（conditional variational autoencoder）的轨迹预测模型。该模型通过生成伪轨迹（pseudo-trajectories）作为训练过程中的数据增强手段，以学习有效的社会互动表征。此外，论文还引入了一种新的社会损失（social loss），以引导模型在预测中更加关注社会互动，从而生成更稳定的轨迹。通过在ETH/UCY和SDD基准上的广泛实验，该方法展示了优于现有最先进方法的性能。

链接: https://arxiv.org/abs/2412.04673
作者: Haleh Damirchi,Ali Etemad,Michael Greenspan
关键词-EN: trajectory prediction remains, Pedestrian trajectory prediction, autonomous systems, prediction remains, remains a challenge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at Winter Conference on Applications of Computer Vision (WACV), 2025

点击查看摘要

Abstract:Pedestrian trajectory prediction remains a challenge for autonomous systems, particularly due to the intricate dynamics of social interactions. Accurate forecasting requires a comprehensive understanding not only of each pedestrian’s previous trajectory but also of their interaction with the surrounding environment, an important part of which are other pedestrians moving dynamically in the scene. To learn effective socially-informed representations, we propose a model that uses a reconstructor alongside a conditional variational autoencoder-based trajectory forecasting module. This module generates pseudo-trajectories, which we use as augmentations throughout the training process. To further guide the model towards social awareness, we propose a novel social loss that aids in forecasting of more stable trajectories. We validate our approach through extensive experiments, demonstrating strong performances in comparison to state of-the-art methods on the ETH/UCY and SDD benchmarks.
zh

[CV-78] Diffusion-Augmented Coreset Expansion for Scalable Dataset Distillation

【速读】：该论文试图解决神经网络规模快速扩展带来的数据存储和通信需求增加的问题，特别是高分辨率数据和复杂架构下的计算效率挑战。解决方案的关键在于提出了一种两阶段的方法：首先通过选择最具信息量的图像块来压缩数据集，形成一个核心集（coreset）；然后利用生成式基础模型（generative foundation model）实时动态扩展这个压缩集，提高图像块的分辨率并引入可控的多样性。这种方法不仅提高了数据集蒸馏的效率和质量，还显著提升了在多个大规模数据集蒸馏基准测试中的性能，相比现有最先进方法提升了超过10%。

链接: https://arxiv.org/abs/2412.04668
作者: Ali Abbasi,Shima Imani,Chenyang An,Gayathri Mahalingam,Harsh Shrivastava,Maurice Diesendruck,Hamed Pirsiavash,Pramod Sharma,Soheil Kolouri
关键词-EN: neural networks, demands have intensified, rapid scaling, scaling of neural, storage and communication
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid scaling of neural networks, data storage and communication demands have intensified. Dataset distillation has emerged as a promising solution, condensing information from extensive datasets into a compact set of synthetic samples by solving a bilevel optimization problem. However, current methods face challenges in computational efficiency, particularly with high-resolution data and complex architectures. Recently, knowledge-distillation-based dataset condensation approaches have made this process more computationally feasible. Yet, with the recent developments of generative foundation models, there is now an opportunity to achieve even greater compression, enhance the quality of distilled data, and introduce valuable diversity into the data representation. In this work, we propose a two-stage solution. First, we compress the dataset by selecting only the most informative patches to form a coreset. Next, we leverage a generative foundation model to dynamically expand this compressed set in real-time, enhancing the resolution of these patches and introducing controlled variability to the coreset. Our extensive experiments demonstrate the robustness and efficiency of our approach across a range of dataset distillation benchmarks. We demonstrate a significant improvement of over 10% compared to the state-of-the-art on several large-scale dataset distillation benchmarks. The code will be released soon.
zh

[CV-79] LAA-Net: A Physical-prior-knowledge Based Network for Robust Nighttime Depth Estimation

【速读】：该论文试图解决现有自监督单目深度估计 (Self-supervised Monocular Depth Estimation, MDE) 模型在夜间场景中表现不佳的问题。现有模型通过生成对抗网络 (GANs) 将夜间图像转换为白天图像来提升夜间性能，但这种方法由于现实世界中白天光照变化的复杂性，可能导致估计结果不准确。论文提出的解决方案关键在于利用物理先验知识，特别是夜间光波长和光衰减的特性。具体来说，论文提出的光衰减感知网络 (Light-Attenuation-Aware Network, LAA-Net) 基于瑞利散射理论，利用红光在夜间场景中因波长较长而保留更多信息的特点，训练时主要基于红通道值。此外，论文还引入了基于比尔-朗伯定律的红通道衰减 (Red Channel Attenuation, RCA) 损失，以指导LAA-Net的训练，从而在多个数据集上实现了优于现有最先进模型的性能。

链接: https://arxiv.org/abs/2412.04666
作者: Kebin Peng,Haotang Li,Zhenyu Qi,Huashan Chen,Zi Wang,Wei Zhang,Sen He
关键词-EN: Existing self-supervised monocular, Existing self-supervised, self-supervised monocular depth, improve nighttime performance, transfer nighttime images
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing self-supervised monocular depth estimation (MDE) models attempt to improve nighttime performance by using GANs to transfer nighttime images into their daytime versions. However, this can introduce inconsistencies due to the complexities of real-world daytime lighting variations, which may finally lead to inaccurate estimation results. To address this issue, we leverage physical-prior-knowledge about light wavelength and light attenuation during nighttime. Specifically, our model, Light-Attenuation-Aware Network (LAA-Net), incorporates physical insights from Rayleigh scattering theory for robust nighttime depth estimation: LAA-Net is trained based on red channel values because red light preserves more information under nighttime scenarios due to its longer wavelength. Additionally, based on Beer-Lambert law, we introduce Red Channel Attenuation (RCA) loss to guide LAA-Net’s training. Experiments on the RobotCar-Night, nuScenes-Night, RobotCar-Day, and KITTI datasets demonstrate that our model outperforms SOTA models.
zh

[CV-80] ProPLIKS: Probablistic 3D human body pose estimation

【速读】：该论文试图解决3D人体姿态估计中的不确定性问题，解决方案的关键在于采用概率建模方法，并利用非欧几何中的归一化流（normalizing flows）来处理不确定姿态。具体来说，论文提出了一种针对SO(3)旋转群的归一化流模型，结合基于Möbius变换的耦合机制，能够准确表示SO(3)上的任意分布，从而有效解决姿态估计中的不连续性问题。此外，论文将2D像素对齐输入重建3D人体姿态的任务重新解释为将输入映射到一系列可能姿态的任务，这种视角考虑了任务的内在模糊性，并简化了多视角场景的集成方法。通过这些策略的结合，论文展示了概率模型在复杂场景中进行人体姿态估计的有效性，并在多个数据集上验证了其优越性。

链接: https://arxiv.org/abs/2412.04665
作者: Karthik Shetty,Annette Birkhold,Bernhard Egger,Srikrishna Jaganathan,Norbert Strobel,Markus Kowarschik,Andreas Maier
关键词-EN: employing probabilistic modeling, human pose estimation, pose estimation, human pose, pose
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a novel approach for 3D human pose estimation by employing probabilistic modeling. This approach leverages the advantages of normalizing flows in non-Euclidean geometries to address uncertain poses. Specifically, our method employs normalizing flow tailored to the SO(3) rotational group, incorporating a coupling mechanism based on the Möbius transformation. This enables the framework to accurately represent any distribution on SO(3), effectively addressing issues related to discontinuities. Additionally, we reinterpret the challenge of reconstructing 3D human figures from 2D pixel-aligned inputs as the task of mapping these inputs to a range of probable poses. This perspective acknowledges the intrinsic ambiguity of the task and facilitates a straightforward integration method for multi-view scenarios. The combination of these strategies showcases the effectiveness of probabilistic models in complex scenarios for human pose estimation techniques. Our approach notably surpasses existing methods in the field of pose estimation. We also validate our methodology on human pose estimation from RGB images as well as medical X-Ray datasets.
zh

[CV-81] Multiclass Post-Earthquake Building Assessment Integrating Optical and SAR Satellite Imagery Ground Motion and Soil Data with Transformers

【速读】：该论文试图解决地震后建筑物损坏评估的及时性和准确性问题。传统初步损坏评估（PDA）依赖于人工逐户检查，耗时且存在安全风险。解决方案的关键在于引入一个元数据增强的基于Transformer的框架，该框架结合了高分辨率震后卫星图像与与建筑物抗震性能相关的特定元数据。通过整合地震强度指标、土壤属性和SAR损坏代理地图等元数据，该模型不仅提高了多类别损坏识别的准确性和区分能力，还增强了其在不同区域的适用性。此外，通过对特征重要性的详细类别分析，揭示了各元数据特征在不同损坏级别预测中的独特贡献，从而实现了更快速、更精确的建筑物级别多类别损坏评估，有助于提升灾害响应和加速受影响社区的恢复。

链接: https://arxiv.org/abs/2412.04664
作者: Deepank Singh,Vedhus Hoskere,Pietro Milillo
关键词-EN: crucial for effective, damage, Timely, Timely and accurate, satellite imagery
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 28 Pages, 12 Figures

点击查看摘要

Abstract:Timely and accurate assessments of building damage are crucial for effective response and recovery in the aftermath of earthquakes. Conventional preliminary damage assessments (PDA) often rely on manual door-to-door inspections, which are not only time-consuming but also pose significant safety risks. To safely expedite the PDA process, researchers have studied the applicability of satellite imagery processed with heuristic and machine learning approaches. These approaches output binary or, more recently, multiclass damage states at the scale of a block or a single building. However, the current performance of such approaches limits practical applicability. To address this limitation, we introduce a metadata-enriched, transformer based framework that combines high-resolution post-earthquake satellite imagery with building-specific metadata relevant to the seismic performance of the structure. Our model achieves state-of-the-art performance in multiclass post-earthquake damage identification for buildings from the Turkey-Syria earthquake on February 6, 2023. Specifically, we demonstrate that incorporating metadata, such as seismic intensity indicators, soil properties, and SAR damage proxy maps not only enhances the model’s accuracy and ability to distinguish between damage classes, but also improves its generalizability across various regions. Furthermore, we conducted a detailed, class-wise analysis of feature importance to understand the model’s decision-making across different levels of building damage. This analysis reveals how individual metadata features uniquely contribute to predictions for each damage class. By leveraging both satellite imagery and metadata, our proposed framework enables faster and more accurate damage assessments for precise, multiclass, building-level evaluations that can improve disaster response and accelerate recovery efforts for affected communities.
zh

[CV-82] Hidden in the Noise: Two-Stage Robust Watermarking for Images

【速读】：该论文试图解决图像水印技术在面对伪造和移除攻击时的脆弱性问题。解决方案的关键在于提出了一种基于扩散模型初始噪声的无失真水印方法，并通过两阶段水印框架实现高效检测。具体来说，在生成阶段，通过在初始噪声中嵌入生成的傅里叶模式来增强水印信息；在检测阶段，首先检索相关噪声组，然后在给定组内搜索可能匹配的初始噪声。这种方法显著提高了水印技术对抗伪造和移除攻击的鲁棒性。

链接: https://arxiv.org/abs/2412.04653
作者: Kasra Arabi,Benjamin Feuer,R. Teal Witter,Chinmay Hegde,Niv Cohen
关键词-EN: considerable societal debate, image generators continues, continues to improve, societal debate, generators continues
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As the quality of image generators continues to improve, deepfakes become a topic of considerable societal debate. Image watermarking allows responsible model owners to detect and label their AI-generated content, which can mitigate the harm. Yet, current state-of-the-art methods in image watermarking remain vulnerable to forgery and removal attacks. This vulnerability occurs in part because watermarks distort the distribution of generated images, unintentionally revealing information about the watermarking techniques. In this work, we first demonstrate a distortion-free watermarking method for images, based on a diffusion model’s initial noise. However, detecting the watermark requires comparing the initial noise reconstructed for an image to all previously used initial noises. To mitigate these issues, we propose a two-stage watermarking framework for efficient detection. During generation, we augment the initial noise with generated Fourier patterns to embed information about the group of initial noises we used. For detection, we (i) retrieve the relevant group of noises, and (ii) search within the given group for an initial noise that might match our image. This watermarking approach achieves state-of-the-art robustness to forgery and removal against a large battery of attacks. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2412.04653 [cs.CV] (or arXiv:2412.04653v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2412.04653 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-83] Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

【速读】：该论文试图解决在视觉-语言模型（Vision-Language Models, VLMs）中，现有的基于自注意力分数（self-attention scores）的KV缓存剪枝方法在跨模态（inter-modality）情况下，由于模态间分布差异导致的标记重要性估计不准确和关键视觉标记过度剪枝的问题。解决方案的关键在于将注意力分数分解为模态内注意力（intra-modality attention）和模态间注意力（inter-modality attention），从而实现对这两种不同类型注意力的独立管理，以提高KV缓存剪枝的精确性。此外，论文还引入了一个n-softmax函数来抵消剪枝引起的分布偏移，保持注意力分数的原始平滑性，确保模型性能的稳定性。最终提出的无训练方法Cross-Self Pruning (CSP)在保持竞争性能的同时，显著优于之前的剪枝方法，并在多个多模态数据集上验证了其有效性。

链接: https://arxiv.org/abs/2412.04652
作者: Xiaohuan Pei,Tao Huang,Chang Xu
关键词-EN: long-context auto-regressive generation, auto-regressive generation, promising technique, memory and computation, computation costs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:KV cache pruning has emerged as a promising technique for reducing memory and computation costs in long-context auto-regressive generation. Existing methods for vision-language models (VLMs) typically rely on self-attention scores from large language models (LLMs) to identify and prune irrelevant tokens. However, these approaches overlook the inherent distributional discrepancies between modalities, often leading to inaccurate token importance estimation and the over-pruning of critical visual tokens. To address this, we propose decomposing attention scores into intra-modality attention (within the same modality) and inter-modality attention (across modalities), enabling more precise KV cache pruning by independently managing these distinct attention types. Additionally, we introduce an n-softmax function to counteract distribution shifts caused by pruning, preserving the original smoothness of attention scores and ensuring stable performance. Our final training-free method, \textbfCross-\textbfSelf \textbfPruning (CSP), achieves competitive performance compared to models with full KV caches while significantly outperforming previous pruning methods. Extensive evaluations on MileBench, a benchmark encompassing 29 multimodal datasets, demonstrate CSP’s effectiveness, achieving up to a 41% performance improvement on challenging tasks like conversational embodied dialogue while reducing the KV cache budget by 13.6%. The code is available at this https URL
zh

[CV-84] Using Diffusion Priors for Video Amodal Segmentation

【速读】：该论文试图解决视频中物体在部分遮挡情况下的非模态分割问题（amodal segmentation）。解决方案的关键在于将视频非模态分割问题重新定义为一个条件生成任务（conditional generation task），利用视频生成模型（video generative models）的基础知识。具体方法包括：1) 重新利用这些模型，使其根据物体的一系列模态掩码帧（modal mask frames）和上下文伪深度图（contextual pseudo-depth maps）进行条件化，从而学习并推断出物体可能被遮挡的边界，进而生成物体的完整范围；2) 通过内容补全阶段（content completion stage）对物体被遮挡的区域进行填充。该方法在四个数据集上与多种最先进的方法进行了对比，显著提升了物体遮挡区域的非模态分割性能，最高可达13%的改进。

链接: https://arxiv.org/abs/2412.04623
作者: Kaihua Chen,Deva Ramanan,Tarasha Khurana
关键词-EN: permanence in humans, fundamental cue, understanding persistence, Object, segmentation
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the world, and only work for segmentation of visible or modal objects. Few amodal methods exist; single-image segmentation methods cannot handle high-levels of occlusions which are better inferred using temporal information, and multi-frame methods have focused solely on segmenting rigid objects. To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, capitalizing on the foundational knowledge in video generative models. Our method is simple; we repurpose these models to condition on a sequence of modal mask frames of an object along with contextual pseudo-depth maps, to learn which object boundary may be occluded and therefore, extended to hallucinate the complete extent of an object. This is followed by a content completion stage which is able to inpaint the occluded regions of an object. We benchmark our approach alongside a wide array of state-of-the-art methods on four datasets and show a dramatic improvement of upto 13% for amodal segmentation in an object’s occluded region.
zh

[CV-85] Assessing and Learning Alignment of Unimodal Vision and Language Models

【速读】：该论文试图解决单模态视觉和语言模型在实际视觉-语言任务中的对齐问题。解决方案的关键在于提出了一种直接评估方法，受线性探测启发，用于评估视觉-语言对齐度。论文发现，自监督学习（SSL）视觉模型的对齐程度取决于其SSL训练目标，且聚类质量比线性可分性对对齐性能的影响更大。为此，论文引入了Swift Alignment of Image and Language (SAIL)，这是一个高效的迁移学习框架，用于对齐预训练的单模态视觉和语言模型。SAIL利用预训练单模态模型的优势，相比从头训练的模型（如CLIP），在多模态对齐中仅需极少量的配对图像-文本数据（6%）。SAIL训练仅需单个A100 GPU、5小时训练时间，并能处理高达32,768的批量大小。SAIL在ImageNet上的零样本准确率达到73.4%（高于CLIP的72.7%），并在零样本检索、复杂推理和语义分割等任务中表现出色。此外，SAIL还增强了视觉编码器的语言兼容性，从而提升了多模态大语言模型的性能。

链接: https://arxiv.org/abs/2412.04616
作者: Le Zhang,Qian Yang,Aishwarya Agrawal
关键词-EN: language models aligned, SAIL, models, language models, language
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment methods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP’s 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: this https URL
zh

[CV-86] Learning Symmetries via Weight-Sharing with Doubly Stochastic Tensors

【速读】：该论文试图解决的问题是如何在深度学习中实现动态的、自适应的群等变性（group equivariance），而不是依赖于预先定义的固定群。解决方案的关键在于提出了一种可学习的双随机矩阵集合，这些矩阵作为软置换矩阵作用于标准权重张量上，从而实现权重共享。这种方法不仅能够处理已知的群表示，还能动态地发现和应用数据中的对称性，从而在保持模型灵活性的同时，提升模型的泛化能力、数据效率和鲁棒性。通过联合优化这些可学习的核变换与下游任务，论文展示了在数据集表现出强对称性时，置换矩阵会收敛到常规的群表示，使得权重共享网络有效地成为常规的群卷积。此外，该方法的灵活性使其能够捕捉到部分对称性。

链接: https://arxiv.org/abs/2412.04594
作者: Putri A. van der Linden,Alejandro García-Castellanos,Sharvaree Vadgama,Thijs P. Kuipers,Erik J. Bekkers
关键词-EN: valuable inductive bias, enhancing generalization, deep learning, valuable inductive, inductive bias
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 14 figures, 4 tables

点击查看摘要

Abstract:Group equivariance has emerged as a valuable inductive bias in deep learning, enhancing generalization, data efficiency, and robustness. Classically, group equivariant methods require the groups of interest to be known beforehand, which may not be realistic for real-world data. Additionally, baking in fixed group equivariance may impose overly restrictive constraints on model architecture. This highlights the need for methods that can dynamically discover and apply symmetries as soft constraints. For neural network architectures, equivariance is commonly achieved through group transformations of a canonical weight tensor, resulting in weight sharing over a given group G . In this work, we propose to learn such a weight-sharing scheme by defining a collection of learnable doubly stochastic matrices that act as soft permutation matrices on canonical weight tensors, which can take regular group representations as a special case. This yields learnable kernel transformations that are jointly optimized with downstream tasks. We show that when the dataset exhibits strong symmetries, the permutation matrices will converge to regular group representations and our weight-sharing networks effectively become regular group convolutions. Additionally, the flexibility of the method enables it to effectively pick up on partial symmetries.
zh

[CV-87] EgoPoints: Advancing Point Tracking for Egocentric Videos WACV2025

【速读】：该论文试图解决在第一人称视角视频（egocentric videos）中点跟踪的挑战性问题，特别是处理点离开视野（out-of-view）和需要重新识别（ReID）的情况。解决方案的关键在于引入了一个名为EgoPoints的基准数据集，该数据集包含了4.7K个具有挑战性的跟踪轨迹，相比现有的TAP-Vid-DAVIS评估基准，EgoPoints包含了9倍多的离开视野点和59倍多的需要重新识别的点。论文还提出了一套新的评估指标，专门用于监测点在视野内、离开视野和需要重新识别时的跟踪性能。此外，论文提出了一种生成半真实序列的管道，通过结合动态Kubric对象和EPIC Fields的场景点，自动生成11K个这样的序列，用于微调点跟踪方法。实验结果表明，通过在生成的序列上微调跟踪方法并在EgoPoints序列上进行评估，可以显著提升CoTracker和PIPs++的跟踪准确性，特别是在处理离开视野和需要重新识别的点时。

链接: https://arxiv.org/abs/2412.04592
作者: Ahmad Darkhalil,Rhodri Guerrier,Adam W. Harley,Dima Damen
关键词-EN: points, egocentric videos, delta, text, avg
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2025. Paper webpage: this https URL

点击查看摘要

Abstract:We introduce EgoPoints, a benchmark for point tracking in egocentric videos. We annotate 4.7K challenging tracks in egocentric sequences. Compared to the popular TAP-Vid-DAVIS evaluation benchmark, we include 9x more points that go out-of-view and 59x more points that require re-identification (ReID) after returning to view. To measure the performance of models on these challenging points, we introduce evaluation metrics that specifically monitor tracking performance on points in-view, out-of-view, and points that require re-identification. We then propose a pipeline to create semi-real sequences, with automatic ground truth. We generate 11K such sequences by combining dynamic Kubric objects with scene points from EPIC Fields. When fine-tuning point tracking methods on these sequences and evaluating on our annotated EgoPoints sequences, we improve CoTracker across all metrics, including the tracking accuracy \delta^\star_\textavg by 2.7 percentage points and accuracy on ReID sequences (ReID \delta_\textavg ) by 2.4 points. We also improve \delta^\star_\textavg and ReID \delta_\textavg of PIPs++ by 0.3 and 2.8 respectively.
zh

[CV-88] ARTeFACT: Benchmarking Segmentation Models on Diverse Analogue Media Damage WACV2025

【速读】：该论文试图解决在模拟媒体（如绘画、照片、纺织品、马赛克和壁画）中准确检测和分类损伤的问题，这对于文化遗产保护至关重要。尽管机器学习模型在已知损伤操作符的情况下能够有效修复降质，但在监督训练后仍无法稳健地预测损伤位置，因此可靠的损伤检测仍然是一个挑战。论文的关键解决方案是引入ARTeFACT数据集，该数据集包含超过11,000个注释，涵盖15种不同类型的损伤，涉及多种主题、媒体和历史来源。此外，论文还提供了人工验证的文本提示，描述图像的语义内容，并推导出注释损伤的额外文本描述。通过在零样本、监督、无监督和文本引导设置下评估卷积神经网络（CNN）、Transformer、基于扩散的分割模型和基础视觉模型，揭示了这些模型在跨媒体类型泛化方面的局限性。ARTeFACT数据集作为首个此类基准，为模拟媒体损伤检测和修复提供了重要资源。

链接: https://arxiv.org/abs/2412.04580
作者: Daniela Ivanova,Marco Aversa,Paul Henderson,John Williamson
关键词-EN: cultural heritage preservation, Accurately detecting, heritage preservation, detecting and classifying, frescoes is essential
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at WACV 2025

点击查看摘要

Abstract:Accurately detecting and classifying damage in analogue media such as paintings, photographs, textiles, mosaics, and frescoes is essential for cultural heritage preservation. While machine learning models excel in correcting degradation if the damage operator is known a priori, we show that they fail to robustly predict where the damage is even after supervised training; thus, reliable damage detection remains a challenge. Motivated by this, we introduce ARTeFACT, a dataset for damage detection in diverse types analogue media, with over 11,000 annotations covering 15 kinds of damage across various subjects, media, and historical provenance. Furthermore, we contribute human-verified text prompts describing the semantic contents of the images, and derive additional textual descriptions of the annotated damage. We evaluate CNN, Transformer, diffusion-based segmentation models, and foundation vision models in zero-shot, supervised, unsupervised and text-guided settings, revealing their limitations in generalising across media types. Our dataset is available at \hrefthis https URLthis https URL as the first-of-its-kind benchmark for analogue media damage detection and restoration.
zh

[CV-89] Action-based image editing guided by human instructions

【速读】：该论文试图解决基于文本的图像编辑任务的静态性问题，通过引入动作元素使图像编辑任务动态化。解决方案的关键在于提出一个能够识别对比动作差异的新模型，该模型通过学习动作文本指令来调整图像中物体的位置或姿态，以描绘不同的动作，同时保持物体的视觉属性。模型训练基于从视频中提取的帧数据，这些帧展示了动作前后的视觉场景。通过这种方式，模型能够显著提升基于动作文本指令的图像编辑效果，并具备高推理能力，使得模型能够将输入图像作为动作的起始场景，生成显示动作最终场景的新图像。

链接: https://arxiv.org/abs/2412.04558
作者: Maria Mihaela Trusca,Mingxiao Li,Marie-Francine Moens
关键词-EN: Text-based image editing, Text-based image, typically approached, involves operations, modifying elements
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-based image editing is typically approached as a static task that involves operations such as inserting, deleting, or modifying elements of an input image based on human instructions. Given the static nature of this task, in this paper, we aim to make this task dynamic by incorporating actions. By doing this, we intend to modify the positions or postures of objects in the image to depict different actions while maintaining the visual properties of the objects. To implement this challenging task, we propose a new model that is sensitive to action text instructions by learning to recognize contrastive action discrepancies. The model training is done on new datasets defined by extracting frames from videos that show the visual scenes before and after an action. We show substantial improvements in image editing using action-based text instructions and high reasoning capabilities that allow our model to use the input image as a starting scene for an action while generating a new image that shows the final scene of the action.
zh

[CV-90] Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation

【速读】：该论文试图解决开放词汇分割中通过掩码池化（mask pooling）利用预训练视觉-语言模型（如CLIP）进行分类时，准确掩码未能产生准确分类结果的问题。解决方案的关键在于引入Mask-Adapter，该方法通过从提议掩码中提取语义激活图（semantic activation maps），提供更丰富的上下文信息，并确保掩码与CLIP的对齐。此外，论文还提出了一种掩码一致性损失（mask consistency loss），促使具有相似IoU的提议掩码获得相似的CLIP嵌入，从而增强模型对不同预测掩码的鲁棒性。Mask-Adapter以即插即用的方式无缝集成到基于掩码池化的开放词汇分割方法中，显著提升了分类准确性。

链接: https://arxiv.org/abs/2412.04533
作者: Yongkang Li,Tianheng Cheng,Wenyu Liu,Xinggang Wang
关键词-EN: Recent open-vocabulary segmentation, leverage pre-trained vision-language, adopt mask generators, Recent open-vocabulary, pre-trained vision-language models
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code models: this https URL

点击查看摘要

Abstract:Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, e.g., CLIP, to classify these masks via mask pooling. Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions. In this paper, we reveal the performance limitations of mask pooling and introduce Mask-Adapter, a simple yet effective method to address these challenges in open-vocabulary segmentation. Compared to directly using proposal masks, our proposed Mask-Adapter extracts semantic activation maps from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP. Additionally, we propose a mask consistency loss that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models’ robustness to varying predicted masks. Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. Extensive experiments across several zero-shot benchmarks demonstrate significant performance gains for the proposed Mask-Adapter on several well-established methods. Notably, Mask-Adapter also extends effectively to SAM and achieves impressive results on several open-vocabulary segmentation datasets. Code and models are available at \urlthis https URL.
zh

[CV-91] MageBench: Bridging Large Multimodal Models to Agents MICRO

【速读】：该论文试图解决现有大型多模态模型（LMMs）在视觉信号连续更新和决策过程中缺乏视觉推理能力的问题。解决方案的关键在于引入了一个名为MageBench的多模态代理基准，该基准包含轻量级环境但具有显著的推理挑战，涵盖WebUI、Sokoban和Football三种环境，共483个不同场景。MageBench全面验证了代理的知识和工程能力、视觉智能以及交互技能，揭示了当前模型在根据视觉反馈调整规划、视觉想象、图像-文本长上下文处理等方面的严重不足，为LMMs作为代理的优化提供了方向。

链接: https://arxiv.org/abs/2412.04531
作者: Miaosen Zhang,Qi Dai,Yifan Yang,Jianmin Bao,Dongdong Chen,Kai Qiu,Chong Luo,Xin Geng,Baining Guo
关键词-EN: demand strong reasoning, shown impressive visual, impressive visual understanding, shown impressive, demand strong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 37 pages, 32 figures, github link: this https URL

点击查看摘要

Abstract:LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of this http URL consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483 different scenarios. It thoroughly validates the agent’s knowledge and engineering capabilities, visual intelligence, and interaction skills. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human-level. More specifically, we found current models severely lack the ability to modify their planning based on visual feedback, as well as visual imagination, interleaved image-text long context handling, and other abilities. We hope that our work will provide optimization directions for LMM from the perspective of being an agent. We release our code and data at this https URL.
zh

[CV-92] ColonNet: A Hybrid Of DenseNet121 And U-NET Model For Detection And Segmentation Of GI Bleeding

【速读】：该论文试图解决无线胶囊内窥镜视频中胃肠道出血的自动检测与分类问题。解决方案的关键在于集成深度学习模型，特别是利用基于卷积神经网络 (CNN) 的 DenseNet 和 UNet 模型，来高效地检测和分割出血与非出血区域。该模型在 Auto-WCBleedGen Challenge Version V2 中表现出色，达到了80%的整体准确率，显著提升了对复杂数据集的处理能力，有助于医生进行进一步的诊断。

链接: https://arxiv.org/abs/2412.05216
作者: Ayushman Singh,Sharad Prakash,Aniket Das,Nidhi Kushwaha
关键词-EN: Wireless Capsule Endoscopy, Capsule Endoscopy, Wireless Capsule, integrated deep learning, classification of Gastrointestinal
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents an integrated deep learning model for automatic detection and classification of Gastrointestinal bleeding in the frames extracted from Wireless Capsule Endoscopy (WCE) videos. The dataset has been released as part of Auto-WCBleedGen Challenge Version V2 hosted by the MISAHUB team. Our model attained the highest performance among 75 teams that took part in this competition. It aims to efficiently utilizes CNN based model i.e. DenseNet and UNet to detect and segment bleeding and non-bleeding areas in the real-world complex dataset. The model achieves an impressive overall accuracy of 80% which would surely help a skilled doctor to carry out further diagnostics.
zh

[CV-93] Reconstructing Quantitative Cerebral Perfusion Images Directly From Measured Sinogram Data Acquired Using C-arm Cone-Beam CT

【速读】：该论文试图解决在介入手术室中使用C-arm锥束计算机断层扫描（CBCT）获取定量脑灌注图像时，由于旋转速度慢导致的时序分辨率和采样密度不足的问题。解决方案的关键在于将传统的时间分辨图像重建和灌注参数估计两个步骤整合为一个联合优化问题，并通过直接从测量到的正弦图数据中重建定量灌注图像。具体来说，论文提出了一种名为TRAINER（Time-Resolved Attenuation Image-based Neural network for perfusion parametric image Reconstruction）的技术，该技术通过构建受时间分辨CT正向模型、灌注卷积模型和患者自身测量数据约束的特定条件生成模型，实现了在介入手术室中使用C-arm CBCT准确获取定量脑灌注图像。

链接: https://arxiv.org/abs/2412.05084
作者: Haotian Zhao,Ruifeng Chen,Jing Yan,Juan Feng,Jun Xiang,Yang Chen,Dong Liang,Yinsheng Li
关键词-EN: C-arm cone-beam computed, acute ischemic stroke, cone-beam computed tomography, C-arm CBCT, perfusion images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

Abstract:To shorten the door-to-puncture time for better treating patients with acute ischemic stroke, it is highly desired to obtain quantitative cerebral perfusion images using C-arm cone-beam computed tomography (CBCT) equipped in the interventional suite. However, limited by the slow gantry rotation speed, the temporal resolution and temporal sampling density of typical C-arm CBCT are much poorer than those of multi-detector-row CT in the diagnostic imaging suite. The current quantitative perfusion imaging includes two cascaded steps: time-resolved image reconstruction and perfusion parametric estimation. For time-resolved image reconstruction, the technical challenge imposed by poor temporal resolution and poor sampling density causes inaccurate quantification of the temporal variation of cerebral artery and tissue attenuation values. For perfusion parametric estimation, it remains a technical challenge to appropriately design the handcrafted regularization for better solving the associated deconvolution problem. These two challenges together prevent obtaining quantitatively accurate perfusion images using C-arm CBCT. The purpose of this work is to simultaneously address these two challenges by combining the two cascaded steps into a single joint optimization problem and reconstructing quantitative perfusion images directly from the measured sinogram data. In the developed direct cerebral perfusion parametric image reconstruction technique, TRAINER in short, the quantitative perfusion images have been represented as a subject-specific conditional generative model trained under the constraint of the time-resolved CT forward model, perfusion convolutional model, and the subject’s own measured sinogram data. Results shown in this paper demonstrated that using TRAINER, quantitative cerebral perfusion images can be accurately obtained using C-arm CBCT in the interventional suite.
zh

[CV-94] Reconstruction of 3D lumbar spine models from incomplete segmentations using landmark detection

【速读】：该论文试图解决从不完整的3D椎体数据中重建完整的3D腰椎模型的问题。解决方案的关键在于使用仿射变换（affine transformation）将人工椎体模型与患者特定的部分椎体对齐，并通过自动检测椎体终板上的地标来推导变换矩阵。这种方法不仅实现了高精度的注册，平均点对模型距离为1.95 mm，而且在保持功能性脊柱单元（FSU）角度等重要脊柱特征方面表现出色，平均绝对误差（MAE）为3.4°。此外，该方法在0.14秒内完成整个腰椎（L1到L5）的注册，展示了其高效性。

链接: https://arxiv.org/abs/2412.05065
作者: Lara Blomenkamp,Ivanna Kramer,Sabine Bauer,Kevin Weirauch,Dietrich Paulus
关键词-EN: spine models serve, biomedical research, spine models, analysis of loading, loading conditions
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Patient-specific 3D spine models serve as a foundation for spinal treatment and surgery planning as well as analysis of loading conditions in biomechanical and biomedical research. Despite advancements in imaging technologies, the reconstruction of complete 3D spine models often faces challenges due to limitations in imaging modalities such as planar X-Ray and missing certain spinal structures, such as the spinal or transverse processes, in volumetric medical images and resulting segmentations. In this study, we present a novel accurate and time-efficient method to reconstruct complete 3D lumbar spine models from incomplete 3D vertebral bodies obtained from segmented magnetic resonance images (MRI). In our method, we use an affine transformation to align artificial vertebra models with patient-specific incomplete vertebrae. The transformation matrix is derived from vertebra landmarks, which are automatically detected on the vertebra endplates. The results of our evaluation demonstrate the high accuracy of the performed registration, achieving an average point-to-model distance of 1.95 mm. Additionally, in assessing the morphological properties of the vertebrae and intervertebral characteristics, our method demonstrated a mean absolute error (MAE) of 3.4° in the angles of functional spine units (FSUs), emphasizing its effectiveness in maintaining important spinal features throughout the transformation process of individual vertebrae. Our method achieves the registration of the entire lumbar spine, spanning segments L1 to L5, in just 0.14 seconds, showcasing its time-efficiency. Clinical relevance: the fast and accurate reconstruction of spinal models from incomplete input data such as segmentations provides a foundation for many applications in spine diagnostics, treatment planning, and the development of spinal healthcare solutions.
zh

[CV-95] SMIC: Semantic Multi-Item Compression based on CLIP dictionary

【速读】：该论文试图解决图像集合压缩中的语义压缩问题，特别是在考虑图像间冗余的情况下。解决方案的关键在于利用CLIP模型的潜在空间特性，即能够轻松进行语义加减操作。基于这一特性，论文提出了一种基于字典的多项编解码器，该编解码器在压缩率上优于现有的生成式编解码器，达到约10^-5 BPP每图像，同时不牺牲语义保真度。此外，论文还展示了所学字典的语义性质，并将其作为图像语义内容的语义投影器。

链接: https://arxiv.org/abs/2412.05035
作者: Tom Bachard,Thomas Maugey
关键词-EN: typically MSE, distortion metric, semantic fidelity metrics, Semantic, fidelity metrics
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 12 pages, 14 figures, 3 tables, journal paper, preprint

点击查看摘要

Abstract:Semantic compression, a compression scheme where the distortion metric, typically MSE, is replaced with semantic fidelity metrics, tends to become more and more popular. Most recent semantic compression schemes rely on the foundation model CLIP. In this work, we extend such a scheme to image collection compression, where inter-item redundancy is taken into account during the coding phase. For that purpose, we first show that CLIP’s latent space allows for easy semantic additions and subtractions. From this property, we define a dictionary-based multi-item codec that outperforms state-of-the-art generative codec in terms of compression rate, around 10^-5 BPP per image, while not sacrificing semantic fidelity. We also show that the learned dictionary is of a semantic nature and works as a semantic projector for the semantic content of images.
zh

[CV-96] Uncertainty-aware retinal layer segmentation in OCT through probabilistic signed distance functions

【速读】：该论文试图解决光学相干断层扫描 (OCT) 图像中视网膜层分割的不确定性和几何基础缺失的问题。解决方案的关键在于采用概率性符号距离函数 (probabilistic signed distance functions, SDF) 来参数化视网膜层形状，并通过水平集方法进行精细化分割。具体来说，论文通过预测符号距离函数 (SDF) 来有效参数化视网膜层形状，并结合高斯分布的概率建模来封装形状参数化中的不确定性，从而确保在输入模糊、图像噪声和分割不可靠的情况下，仍能稳健地表示视网膜层的形态。这一方法在定量和定性评估中均表现出优于其他方法的性能，并在包含多种噪声类型（如阴影、闪烁、斑点和运动）的人工扭曲数据集上进行了实验，验证了其不确定性估计的有效性。

链接: https://arxiv.org/abs/2412.04935
作者: Mohammad Mohaiminul Islam,Coen de Vente,Bart Liefers,Caroline Klaver,Erik J Bekkers,Clara I. Sánchez
关键词-EN: Optical Coherence Tomography, Coherence Tomography, Optical Coherence, signed distance functions, uncertainty-aware retinal layer
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we present a new approach for uncertainty-aware retinal layer segmentation in Optical Coherence Tomography (OCT) scans using probabilistic signed distance functions (SDF). Traditional pixel-wise and regression-based methods primarily encounter difficulties in precise segmentation and lack of geometrical grounding respectively. To address these shortcomings, our methodology refines the segmentation by predicting a signed distance function (SDF) that effectively parameterizes the retinal layer shape via level set. We further enhance the framework by integrating probabilistic modeling, applying Gaussian distributions to encapsulate the uncertainty in the shape parameterization. This ensures a robust representation of the retinal layer morphology even in the presence of ambiguous input, imaging noise, and unreliable segmentations. Both quantitative and qualitative evaluations demonstrate superior performance when compared to other methods. Additionally, we conducted experiments on artificially distorted datasets with various noise types-shadowing, blinking, speckle, and motion-common in OCT scans to showcase the effectiveness of our uncertainty estimation. Our findings demonstrate the possibility to obtain reliable segmentation of retinal layers, as well as an initial step towards the characterization of layer integrity, a key biomarker for disease progression. Our code is available at \urlthis https URL.
zh

[CV-97] UniMIC: Towards Universal Multi-modality Perceptual Image Compression

【速读】：该论文试图解决多模态图像压缩中的统一速率-失真-感知（RDP）优化问题，特别是针对多种图像编解码器（codecs）的协同优化。解决方案的关键在于引入了一个视觉编解码器库（visual codec repository），该库包含了大量代表性的图像编解码器，并直接将其作为基础编解码器用于各种实际应用。此外，论文提出了多粒度文本编码（multi-grained textual coding），通过设计可变长度的内容提示和压缩提示来辅助感知重建，以及一个通用感知补偿器（universal perception compensator），利用文本辅助的扩散先验（text-assisted diffusion priors）从稳定扩散中重用，以提高所有基础编解码器解码图像的感知质量。这些策略的结合显著提升了不同编解码器和不同压缩成本下的RDP优化效果。

链接: https://arxiv.org/abs/2412.04912
作者: Yixin Gao,Xin Li,Xiaohan Pan,Runsen Feng,Zongyu Guo,Yiting Lu,Yulin Ren,Zhibo Chen
关键词-EN: excavating cross-modality generative, cross-modality generative priors, image compression framework, image codecs simultaneously, multiple image codecs
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present UniMIC, a universal multi-modality image compression framework, intending to unify the rate-distortion-perception (RDP) optimization for multiple image codecs simultaneously through excavating cross-modality generative priors. Unlike most existing works that need to design and optimize image codecs from scratch, our UniMIC introduces the visual codec repository, which incorporates amounts of representative image codecs and directly uses them as the basic codecs for various practical applications. Moreover, we propose multi-grained textual coding, where variable-length content prompt and compression prompt are designed and encoded to assist the perceptual reconstruction through the multi-modality conditional generation. In particular, a universal perception compensator is proposed to improve the perception quality of decoded images from all basic codecs at the decoder side by reusing text-assisted diffusion priors from stable diffusion. With the cooperation of the above three strategies, our UniMIC achieves a significant improvement of RDP optimization for different compression codecs, e.g., traditional and learnable codecs, and different compression costs, e.g., ultra-low bitrates. The code will be available in this https URL .
zh

[CV-98] Comprehensive Analysis and Improvements in Pansharpening Using Deep Learning

【速读】：该论文试图解决遥感图像融合中的光谱失真问题，特别是在使用深度学习方法进行图像融合时。解决方案的关键在于对PSGAN框架的改进，通过引入新的正则化技术来优化生成器的损失函数。这些改进旨在提高融合图像的光谱保真度，并在多个定量指标上实现更优的性能，同时提供视觉上更高质量的结果。

链接: https://arxiv.org/abs/2412.04896
作者: Mahek Kantharia,Neeraj Badal,Zankhana Shah
关键词-EN: fusing low-resolution multispectral, low-resolution multispectral data, high-resolution multispectral images, high-resolution panchromatic images, remote sensing
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pansharpening is a crucial task in remote sensing, enabling the generation of high-resolution multispectral images by fusing low-resolution multispectral data with high-resolution panchromatic images. This paper provides a comprehensive analysis of traditional and deep learning-based pansharpening methods. While state-of-the-art deep learning methods have significantly improved image quality, issues like spectral distortions persist. To address this, we propose enhancements to the PSGAN framework by introducing novel regularization techniques for the generator loss function. Experimental results on images from the Worldview-3 dataset demonstrate that the proposed modifications improve spectral fidelity and achieve superior performance across multiple quantitative metrics while delivering visually superior results.
zh

[CV-99] Automatic Tissue Differentiation in Parotidectomy using Hyperspectral Imaging

【速读】：该论文试图解决在头颈部手术中实时区分组织类型的问题，以避免对神经和血管等敏感结构的损伤。解决方案的关键在于使用高光谱成像 (Hyperspectral Imaging, HSI) 结合三维卷积神经网络 (3D Convolutional Neural Network) 进行组织分类。研究中采用了一个双多光谱快照相机组成的立体高光谱成像系统，并在400-1000 nm波段内获取数据。通过对18名患者进行腮腺切除术的27张图像进行分析，研究实现了在验证集上98.7%的总体准确率，并在独立评估中达到83.4%的准确率，显示出该方法在手术中进行组织区分的高效性和临床重要性。特别值得注意的是，腮腺和神经组织的高敏感性检测能力，但也发现静脉常被误认为肌肉，这表明需要进一步的数据分析和更全面的数据基础。

链接: https://arxiv.org/abs/2412.04879
作者: Eric L. Wisotzky,Alexander Schill,Anna Hilsmann,Peter Eisert,Michael Knoke
关键词-EN: Convolutional Neural Network, neural network, tissue differentiation, intraoperative tissue differentiation, head and neck
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: Accepted and presented at 58th Annual Conference of the German Society for Biomedical Engineering in press at Current Directions in Biomedical Engineering

点击查看摘要

Abstract:In head and neck surgery, continuous intraoperative tissue differentiation is of great importance to avoid injury to sensitive structures such as nerves and vessels. Hyperspectral imaging (HSI) with neural network analysis could support the surgeon in tissue differentiation. A 3D Convolutional Neural Network with hyperspectral data in the range of 400-1000 nm is used in this work. The acquisition system consisted of two multispectral snapshot cameras creating a stereo-HSI-system. For the analysis, 27 images with annotations of glandular tissue, nerve, muscle, skin and vein in 18 patients undergoing parotidectomy are included. Three patients are removed for evaluation following the leave-one-subject-out principle. The remaining images are used for training, with the data randomly divided into a training group and a validation group. In the validation, an overall accuracy of 98.7% is achieved, indicating robust training. In the evaluation on the excluded patients, an overall accuracy of 83.4% has been achieved showing good detection and identification abilities. The results clearly show that it is possible to achieve robust intraoperative tissue differentiation using hyperspectral imaging. Especially the high sensitivity in parotid or nerve tissue is of clinical importance. It is interesting to note that vein was often confused with muscle. This requires further analysis and shows that a very good and comprehensive data basis is essential. This is a major challenge, especially in surgery.
zh

[CV-100] Automatic Prediction of Stroke Treatment Outcomes: Latest Advances and Perspectives

【速读】：该论文试图解决中风干预结果的预测问题，以促进临床决策和改善患者护理。解决方案的关键在于利用深度学习技术分析包括脑部扫描、医疗报告和其他传感器信息（如EEG、ECG、EMG等）在内的大规模多样化医疗数据。尽管医疗图像分析领域存在数据标准化挑战，但深度学习在中风结果预测中的未来发展方向在于整合多模态信息，特别是最终梗死数据，以实现对长期功能结果的更准确预测。

链接: https://arxiv.org/abs/2412.04812
作者: Zeynel A. Samak,Philip Clatworthy,Majid Mirmehdi
关键词-EN: major global health, global health problem, mortality and morbidity, major global, global health
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: The paper is under consideration at Biomedical Engineering Letters (Springer)

点击查看摘要

Abstract:Stroke is a major global health problem that causes mortality and morbidity. Predicting the outcomes of stroke intervention can facilitate clinical decision-making and improve patient care. Engaging and developing deep learning techniques can help to analyse large and diverse medical data, including brain scans, medical reports and other sensor information, such as EEG, ECG, EMG and so on. Despite the common data standardisation challenge within medical image analysis domain, the future of deep learning in stroke outcome prediction lie in using multimodal information, including final infarct data, to achieve better prediction of long-term functional outcomes. This article provides a broad review of recent advances and applications of deep learning in the prediction of stroke outcomes, including (i) the data and models used, (ii) the prediction tasks and measures of success, (iii) the current challenges and limitations, and (iv) future directions and potential benefits. This comprehensive review aims to provide researchers, clinicians, and policy makers with an up-to-date understanding of this rapidly evolving and promising field.
zh

[CV-101] Modality Decoupling is All You Need: A Simple Solution for Unsupervised Hyperspectral Image Fusion

【速读】：该论文试图解决高光谱图像融合 (Hyperspectral Image Fusion, HIF) 中现有方法未能充分感知深层次模态互补信息的问题。解决方案的关键在于提出了一个无监督的模态解耦空间-光谱融合框架 (Modality-Decoupled Spatial-Spectral Fusion, MossFuse)，通过引入模态聚类损失 (modality clustering loss) 来确保模态解耦，从而分离出模态共享特征和模态互补特征，减少模态冗余，最终实现高效的高光谱图像融合。该方法在多个数据集上的实验结果表明，其在性能上优于现有方法，同时具有更少的参数和更快的推理时间。

链接: https://arxiv.org/abs/2412.04802
作者: Songcheng Du,Yang Zou,Zixu Wang,Xingyuan Li,Ying Li,Qiang Shen
关键词-EN: low-resolution hyperspectral images, Hyperspectral Image Fusion, fuse low-resolution hyperspectral, high-resolution multispectral images, spectral resolution images
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral Image Fusion (HIF) aims to fuse low-resolution hyperspectral images (LR-HSIs) and high-resolution multispectral images (HR-MSIs) to reconstruct high spatial and high spectral resolution images. Current methods typically apply direct fusion from the two modalities without valid supervision, failing to fully perceive the deep modality-complementary information and hence, resulting in a superficial understanding of inter-modality connections. To bridge this gap, we propose a simple and effective solution for unsupervised HIF with an assumption that modality decoupling is essential for HIF. We introduce the modality clustering loss that ensures clear guidance of the modality, decoupling towards modality-shared features while steering clear of modality-complementary ones. Also, we propose an end-to-end Modality-Decoupled Spatial-Spectral Fusion (MossFuse) framework that decouples shared and complementary information across modalities and aggregates a concise representation of the LR-HSI and HR-MSI to reduce the modality redundancy. Systematic experiments over multiple datasets demonstrate that our simple and effective approach consistently outperforms the existing HIF methods while requiring considerably fewer parameters with reduced inference time.
zh

[CV-102] DAWN-SI: Data-Aware and Noise-Informed Stochastic Interpolation for Solving Inverse Problems

【速读】：该论文试图解决逆问题（Inverse Problems），特别是在数据不完整或含有噪声的情况下，如何准确估计参数的问题。解决方案的关键在于采用了一种名为DAWN-SI（Data-Aware and Noise-informed Stochastic Interpolation）的方法，该方法结合了确定性过程和随机过程，通过学习时间依赖的速度场将简单的参考分布（如高斯分布）映射到目标分布。DAWN-SI方法通过数据和噪声嵌入，使模型能够显式地访问测量数据的表示，并考虑观测中的噪声，从而在数据噪声或不完整的情况下表现出较强的鲁棒性。此外，该方法不仅提供准确的解，还能通过生成多个可能的结果来进行不确定性量化，特别适用于高度不适定的逆问题场景。

链接: https://arxiv.org/abs/2412.04766
作者: Shadab Ahamed,Eldad Haber
关键词-EN: involve estimating parameters, textbf, medical imaging, signal processing, involve estimating
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 11 figures, 6 tables

点击查看摘要

Abstract:Inverse problems, which involve estimating parameters from incomplete or noisy observations, arise in various fields such as medical imaging, geophysics, and signal processing. These problems are often ill-posed, requiring regularization techniques to stabilize the solution. In this work, we employ \textitStochastic Interpolation (SI), a generative framework that integrates both deterministic and stochastic processes to map a simple reference distribution, such as a Gaussian, to the target distribution. Our method \textbfDAWN-SI : \textbfD ata- \textbfAW are and \textbfN oise-informed \textbfS tochastic \textbfI nterpolation incorporates data and noise embedding, allowing the model to access representations about the measured data explicitly and also account for noise in the observations, making it particularly robust in scenarios where data is noisy or incomplete. By learning a time-dependent velocity field, SI not only provides accurate solutions but also enables uncertainty quantification by generating multiple plausible outcomes. Unlike pre-trained diffusion models, which may struggle in highly ill-posed settings, our approach is trained specifically for each inverse problem and adapts to varying noise levels. We validate the effectiveness and robustness of our method through extensive numerical experiments on tasks such as image deblurring and tomography.
zh

[CV-103] Learning to Translate Noise for Robust Image Denoising

【速读】：该论文试图解决深度学习图像去噪技术在处理分布外真实世界噪声时泛化性能不佳的问题。解决方案的关键在于提出了一种新颖的噪声转换框架，通过噪声转换网络将复杂的、未知的真实世界噪声转换为高斯噪声（Gaussian noise），这种噪声在空间上不相关且与图像内容无关。随后，使用预训练的图像去噪网络处理转换后的含噪图像，从而实现鲁棒且一致的去噪性能。该方法通过设计合理的损失函数和网络架构，充分利用了高斯噪声的数学特性，实验结果表明其显著提升了去噪方法的鲁棒性和泛化能力，超越了当前最先进的方法。

链接: https://arxiv.org/abs/2412.04727
作者: Inju Ha,Donghun Ryou,Seonguk Seo,Bohyung Han
关键词-EN: Deep learning-based image, Deep learning-based, poor generalization performance, Gaussian noise, noise
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is available at this https URL

点击查看摘要

Abstract:Deep learning-based image denoising techniques often struggle with poor generalization performance to out-of-distribution real-world noise. To tackle this challenge, we propose a novel noise translation framework that performs denoising on an image with translated noise rather than directly denoising an original noisy image. Specifically, our approach translates complex, unknown real-world noise into Gaussian noise, which is spatially uncorrelated and independent of image content, through a noise translation network. The translated noisy images are then processed by an image denoising network pretrained to effectively remove Gaussian noise, enabling robust and consistent denoising performance. We also design well-motivated loss functions and architectures for the noise translation network by leveraging the mathematical properties of Gaussian noise. Experimental results demonstrate that the proposed method substantially improves robustness and generalizability, outperforming state-of-the-art methods across diverse benchmarks. Visualized denoising results and the source code are available on our project page.
zh

[CV-104] Motion-Guided Deep Image Prior for Cardiac MRI

【速读】：该论文试图解决传统心磁共振成像（Cardiovascular Magnetic Resonance Imaging, CMRI）中因心律不齐或呼吸能力有限导致的成像质量问题。解决方案的关键是引入了一种名为运动引导深度图像先验（Motion-Guided Deep Image prior, M-DIP）的无监督重建框架，该框架通过使用空间字典合成时间依赖的模板图像，并结合时间依赖的形变场来模拟心脏和呼吸运动，从而同时捕捉生理运动和帧间内容变化。这一方法不仅适用于广泛的动态应用，还通过对比分析展示了其在模拟数据和临床数据上的优越性能。

链接: https://arxiv.org/abs/2412.04639
作者: Marc Vornehm,Chong Chen,Muhammad Ahmad Sultan,Syed Murtaza Arshad,Yuchi Han,Florian Knoll,Rizwan Ahmad
关键词-EN: Cardiovascular magnetic resonance, powerful diagnostic tool, Cardiovascular magnetic, magnetic resonance imaging, assessing cardiac structure
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Cardiovascular magnetic resonance imaging is a powerful diagnostic tool for assessing cardiac structure and function. Traditional breath-held imaging protocols, however, pose challenges for patients with arrhythmias or limited breath-holding capacity. We introduce Motion-Guided Deep Image prior (M-DIP), a novel unsupervised reconstruction framework for accelerated real-time cardiac MRI. M-DIP employs a spatial dictionary to synthesize a time-dependent template image, which is further refined using time-dependent deformation fields that model cardiac and respiratory motion. Unlike prior DIP-based methods, M-DIP simultaneously captures physiological motion and frame-to-frame content variations, making it applicable to a wide range of dynamic applications. We validate M-DIP using simulated MRXCAT cine phantom data as well as free-breathing real-time cine and single-shot late gadolinium enhancement data from clinical patients. Comparative analyses against state-of-the-art supervised and unsupervised approaches demonstrate M-DIP’s performance and versatility. M-DIP achieved better image quality metrics on phantom data, as well as higher reader scores for in-vivo patient data.
zh

[CV-105] MetaFormer: High-fidelity Metalens Imaging via Aberration Correcting Transformers

【速读】：该论文试图解决金属透镜（metalens）在实际应用中由于严重像差和畸变导致的图像质量下降问题。解决方案的关键在于提出了一种名为MetaFormer的像差校正框架，该框架利用了视觉变换器（Vision Transformers, ViT）在图像恢复任务中的卓越表现。具体来说，MetaFormer通过设计多重自适应滤波器引导（Multiple Adaptive Filters Guidance, MAFG）模块，使用多个维纳滤波器（Wiener filters）来丰富输入图像的噪声与细节平衡，从而提升恢复质量。此外，还引入了空间与转置自注意力融合（Spatial and Transposed self-Attention Fusion, STAF）模块，该模块结合了空间自注意力和转置自注意力模块的特征，进一步改善了像差校正效果。通过广泛的实验验证，MetaFormer在像差图像和视频校正以及从退化图像中进行清洁3D重建方面显著优于现有技术，并在实际制造的金属透镜中验证了其有效性。

链接: https://arxiv.org/abs/2412.04591
作者: Byeonghyeon Lee,Youbin Kim,Yongjae Jo,Hyunsu Kim,Hyemi Park,Yangkyu Kim,Debabrata Mandal,Praneeth Chakravarthula,Inki Kim,Eunbyung Park
关键词-EN: emerging optical system, shows great promise, virtual reality, compact sizes, imaging and augmented
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 18 figures

点击查看摘要

Abstract:Metalens is an emerging optical system with an irreplaceable merit in that it can be manufactured in ultra-thin and compact sizes, which shows great promise of various applications such as medical imaging and augmented/virtual reality (AR/VR). Despite its advantage in miniaturization, its practicality is constrained by severe aberrations and distortions, which significantly degrade the image quality. Several previous arts have attempted to address different types of aberrations, yet most of them are mainly designed for the traditional bulky lens and not convincing enough to remedy harsh aberrations of the metalens. While there have existed aberration correction methods specifically for metalens, they still fall short of restoration quality. In this work, we propose MetaFormer, an aberration correction framework for metalens-captured images, harnessing Vision Transformers (ViT) that has shown remarkable restoration performance in diverse image restoration tasks. Specifically, we devise a Multiple Adaptive Filters Guidance (MAFG), where multiple Wiener filters enrich the degraded input images with various noise-detail balances, enhancing output restoration quality. In addition, we introduce a Spatial and Transposed self-Attention Fusion (STAF) module, which aggregates features from spatial self-attention and transposed self-attention modules to further ameliorate aberration correction. We conduct extensive experiments, including correcting aberrated images and videos, and clean 3D reconstruction from the degraded images. The proposed method outperforms the previous arts by a significant margin. We further fabricate a metalens and verify the practicality of MetaFormer by restoring the images captured with the manufactured metalens in the wild. Code and pre-trained models are available at this https URL
zh

[CV-106] Video Quality Assessment: A Comprehensive Survey

【速读】：该论文试图解决视频质量评估 (Video Quality Assessment, VQA) 的问题，特别是在处理真实世界用户生成内容 (User-Generated Content, UGC) 时，传统基于自然图像和视频统计的模型表现有限。解决方案的关键在于利用深度神经网络 (Deep Neural Networks) 和大型多模态模型 (Large Multimodality Models, LMMs) 的最新进展，这些技术显著提升了预测性能，超越了以往的手工设计模型。论文还强调了内容多样、大规模的人类标注数据库的重要性，这些数据库为VQA算法提供了心理测量学上的视频质量数据，推动了该领域的进步。

链接: https://arxiv.org/abs/2412.04508
作者: Qi Zheng,Yibo Fan,Leilei Huang,Tianyu Zhu,Jiaming Liu,Zhijian Hao,Shuo Xing,Chia-Ju Chen,Xiongkuo Min,Alan C. Bovik,Zhengzhong Tu
关键词-EN: important processing task, manner highly consistent, Video quality assessment, processing task, aiming at predicting
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video quality assessment (VQA) is an important processing task, aiming at predicting the quality of videos in a manner highly consistent with human judgments of perceived quality. Traditional VQA models based on natural image and/or video statistics, which are inspired both by models of projected images of the real world and by dual models of the human visual system, deliver only limited prediction performances on real-world user-generated content (UGC), as exemplified in recent large-scale VQA databases containing large numbers of diverse video contents crawled from the web. Fortunately, recent advances in deep neural networks and Large Multimodality Models (LMMs) have enabled significant progress in solving this problem, yielding better results than prior handcrafted models. Numerous deep learning-based VQA models have been developed, with progress in this direction driven by the creation of content-diverse, large-scale human-labeled databases that supply ground truth psychometric video quality data. Here, we present a comprehensive survey of recent progress in the development of VQA algorithms and the benchmarking studies and databases that make them possible. We also analyze open research directions on study design and VQA algorithm architectures.
zh

人工智能

[AI-0] APOLLO: SGD-like Memory AdamW-level Performance

链接: https://arxiv.org/abs/2412.05270
作者: Hanqing Zhu,Zhenyu Zhang,Wenyan Cong,Xi Liu,Sem Park,Vikas Chandra,Bo Long,David Z. Pan,Zhangyang Wang,Jinwon Lee
关键词-EN: Large language models, Large language, notoriously memory-intensive, learning rate, Large
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
*备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance. In this work, we identify that AdamW’s learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs. Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization. Comments: Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2412.05270 [cs.LG] (or arXiv:2412.05270v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.05270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-1] Chimera: Accurate retrosynthesis prediction by ensembling models with diverse inductive biases

链接: https://arxiv.org/abs/2412.05269
作者: Krzysztof Maziarz,Guoqing Liu,Hubert Misztela,Aleksei Kornev,Piotr Gaiński,Holger Hoefling,Mike Fortunato,Rishi Gupta,Marwin Segler
关键词-EN: functional small molecules, molecular inverse design, conducting chemical syntheses, chemical syntheses remains, prevents fully leveraging
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Planning and conducting chemical syntheses remains a major bottleneck in the discovery of functional small molecules, and prevents fully leveraging generative AI for molecular inverse design. While early work has shown that ML-based retrosynthesis models can predict reasonable routes, their low accuracy for less frequent, yet important reactions has been pointed out. As multi-step search algorithms are limited to reactions suggested by the underlying model, the applicability of those tools is inherently constrained by the accuracy of retrosynthesis prediction. Inspired by how chemists use different strategies to ideate reactions, we propose Chimera: a framework for building highly accurate reaction models that combine predictions from diverse sources with complementary inductive biases using a learning-based ensembling strategy. We instantiate the framework with two newly developed models, which already by themselves achieve state of the art in their categories. Through experiments across several orders of magnitude in data scale and time-splits, we show Chimera outperforms all major models by a large margin, owing both to the good individual performance of its constituents, but also to the scalability of our ensembling strategy. Moreover, we find that PhD-level organic chemists prefer predictions from Chimera over baselines in terms of quality. Finally, we transfer the largest-scale checkpoint to an internal dataset from a major pharmaceutical company, showing robust generalization under distribution shift. With the new dimension that our framework unlocks, we anticipate further acceleration in the development of even more accurate models.

[AI-2] Reinforcement Learning: An Overview

链接: https://arxiv.org/abs/2412.05265
作者: Kevin Murphy
关键词-EN: sequential decision making, policy-gradient methods, model-based methods, reinforcement learning, decision making
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This manuscript gives a big-picture, up-to-date overview of the field of (deep) reinforcement learning and sequential decision making, covering value-based RL, policy-gradient methods, model-based methods, and various other topics (including a very brief discussion of RL+LLMs).

[AI-3] Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization

链接: https://arxiv.org/abs/2412.05244
作者: Luca Masserano,Abdul Fatir Ansari,Boran Han,Xiyuan Zhang,Christos Faloutsos,Michael W. Mahoney,Andrew Gordon Wilson,Youngsuk Park,Syama Rangapuram,Danielle C. Maddix,Yuyang Wang
关键词-EN: important open question, time series, develop foundational models, remains an important, important open
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages, 15 figures

点击查看摘要

Abstract:How to best develop foundational models for time series forecasting remains an important open question. Tokenization is a crucial consideration in this effort: what is an effective discrete vocabulary for a real-valued sequential input? To address this question, we develop WaveToken, a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies. Our method first scales and decomposes the input time series, then thresholds and quantizes the wavelet coefficients, and finally pre-trains an autoregressive model to forecast coefficients for the forecast horizon. By decomposing coarse and fine structures in the inputs, wavelets provide an eloquent and compact language for time series forecasting that simplifies learning. Empirical results on a comprehensive benchmark, including 42 datasets for both in-domain and zero-shot settings, show that WaveToken: i) provides better accuracy than recently proposed foundation models for forecasting while using a much smaller vocabulary (1024 tokens), and performs on par or better than modern deep learning models trained specifically on each dataset; and ii) exhibits superior generalization capabilities, achieving the best average rank across all datasets for three complementary metrics. In addition, we show that our method can easily capture complex temporal patterns of practical relevance that are challenging for other recent pre-trained models, including trends, sparse spikes, and non-stationary time series with varying frequencies evolving over time.

[AI-4] AIs assigned gender affects human-AI cooperation

链接: https://arxiv.org/abs/2412.05214
作者: Sepideh Bazazi,Jurgis Karpus,Taha Yasseri
关键词-EN: artificial intelligence, daily life, machines is increasingly, increasingly vital, vital as artificial
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC)
*备注: Manuscript under review

点击查看摘要

Abstract:Cooperation between humans and machines is increasingly vital as artificial intelligence (AI) becomes more integrated into daily life. Research indicates that people are often less willing to cooperate with AI agents than with humans, more readily exploiting AI for personal gain. While prior studies have shown that giving AI agents human-like features influences people’s cooperation with them, the impact of AI’s assigned gender remains underexplored. This study investigates how human cooperation varies based on gender labels assigned to AI agents with which they interact. In the Prisoner’s Dilemma game, 402 participants interacted with partners labelled as AI (bot) or humans. The partners were also labelled male, female, non-binary, or gender-neutral. Results revealed that participants tended to exploit female-labelled and distrust male-labelled AI agents more than their human counterparts, reflecting gender biases similar to those in human-human interactions. These findings highlight the significance of gender biases in human-AI interactions that must be considered in future policy, design of interactive AI systems, and regulation of their use.

[AI-5] A Survey of Large Language Model-Based Generative AI for Text-to-SQL: Benchmarks Applications Use Cases and Challenges

链接: https://arxiv.org/abs/2412.05208
作者: Aditi Singh,Akash Shetty,Abul Ehtesham,Saket Kumar,Tala Talaei Khoei
关键词-EN: Structured Query Language, natural language queries, translating natural language, queries into Structured, systems facilitate smooth
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Text-to-SQL systems facilitate smooth interaction with databases by translating natural language queries into Structured Query Language (SQL), bridging the gap between non-technical users and complex database management systems. This survey provides a comprehensive overview of the evolution of AI-driven text-to-SQL systems, highlighting their foundational components, advancements in large language model (LLM) architectures, and the critical role of datasets such as Spider, WikiSQL, and CoSQL in driving progress. We examine the applications of text-to-SQL in domains like healthcare, education, and finance, emphasizing their transformative potential for improving data accessibility. Additionally, we analyze persistent challenges, including domain generalization, query optimization, support for multi-turn conversational interactions, and the limited availability of datasets tailored for NoSQL databases and dynamic real-world scenarios. To address these challenges, we outline future research directions, such as extending text-to-SQL capabilities to support NoSQL databases, designing datasets for dynamic multi-turn interactions, and optimizing systems for real-world scalability and robustness. By surveying current advancements and identifying key gaps, this paper aims to guide the next generation of research and applications in LLM-based text-to-SQL systems.

[AI-6] Are Frontier Large Language Models Suitable for QA in Science Centres?

链接: https://arxiv.org/abs/2412.05200
作者: Jacob Watson,Fabrício Góes,Marco Volpe,Talles Medeiros
关键词-EN: frontier Large Language, Large Language Models, Large Language, frontier Large, National Space Centre
类目: Artificial Intelligence (cs.AI)
*备注: 19 pages, 2 figures, 10 tables

点击查看摘要

Abstract:This paper investigates the suitability of frontier Large Language Models (LLMs) for QA interactions in science centres, with the aim of boosting visitor engagement while maintaining factual accuracy. Using a dataset of questions collected from the National Space Centre in Leicester (UK), we evaluated responses generated by three leading models: OpenAI’s GPT-4, Claude 3.5 Sonnet, and Google Gemini 1.5. Each model was prompted for both standard and creative responses tailored to an 8-year-old audience, and these responses were assessed by space science experts based on accuracy, engagement, clarity, novelty, and deviation from expected answers. The results revealed a trade-off between creativity and accuracy, with Claude outperforming GPT and Gemini in both maintaining clarity and engaging young audiences, even when asked to generate more creative responses. Nonetheless, experts observed that higher novelty was generally associated with reduced factual reliability across all models. This study highlights the potential of LLMs in educational settings, emphasizing the need for careful prompt engineering to balance engagement with scientific rigor.

[AI-7] Exponential Speedups by Rerooting Levin Tree Search

链接: https://arxiv.org/abs/2412.05196
作者: Laurent Orseau,Marcus Hutter,Levi H.S. Lelis
关键词-EN: Levin Tree Search, Levin Tree, LTS, Search, LTS search
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Levin Tree Search (LTS) (Orseau et al., 2018) is a search algorithm for deterministic environments that uses a user-specified policy to guide the search. It comes with a formal guarantee on the number of search steps for finding a solution node that depends on the quality of the policy. In this paper, we introduce a new algorithm, called \sqrt\textLTS (pronounce root-LTS), which implicitly starts an LTS search rooted at every node of the search tree. Each LTS search is assigned a rerooting weight by a (user-defined or learnt) rerooter, and the search effort is shared between all LTS searches proportionally to their weights. The rerooting mechanism implicitly decomposes the search space into subtasks, leading to significant speedups. We prove that the number of search steps that \sqrt\textLTS takes is competitive with the best decomposition into subtasks, at the price of a factor that relates to the uncertainty of the rerooter. If LTS takes time T , in the best case with q rerooting points, \sqrt\textLTS only takes time O(q\sqrt[q]T) . Like the policy, the rerooter can be learnt from data, and we expect \sqrt\textLTS to be applicable to a wide range of domains.

[AI-8] owards Understanding the Role of Sharpness-Aware Minimization Algorithms for Out-of-Distribution Generalization

链接: https://arxiv.org/abs/2412.05169
作者: Samuel Schapiro,Han Zhao
关键词-EN: SAM, SAM variants outperform, original SAM outperforms, SAM variants, strongest SAM variants
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 25 pages

点击查看摘要

Abstract:Recently, sharpness-aware minimization (SAM) has emerged as a promising method to improve generalization by minimizing sharpness, which is known to correlate well with generalization ability. Since the original proposal of SAM, many variants of SAM have been proposed to improve its accuracy and efficiency, but comparisons have mainly been restricted to the i.i.d. setting. In this paper we study SAM for out-of-distribution (OOD) generalization. First, we perform a comprehensive comparison of eight SAM variants on zero-shot OOD generalization, finding that the original SAM outperforms the Adam baseline by 4.76% and the strongest SAM variants outperform the Adam baseline by 8.01% on average. We then provide an OOD generalization bound in terms of sharpness for this setting. Next, we extend our study of SAM to the related setting of gradual domain adaptation (GDA), another form of OOD generalization where intermediate domains are constructed between the source and target domains, and iterative self-training is done on intermediate domains, to improve the overall target domain error. In this setting, our experimental results demonstrate that the original SAM outperforms the baseline of Adam on each of the experimental datasets by 0.82% on average and the strongest SAM variants outperform Adam by 1.52% on average. We then provide a generalization bound for SAM in the GDA setting. Asymptotically, this generalization bound is no better than the one for self-training in the literature of GDA. This highlights a further disconnection between the theoretical justification for SAM versus its empirical performance, with recent work finding that low sharpness alone does not account for all of SAM’s generalization benefits. For future work, we provide several potential avenues for obtaining a tighter analysis for SAM in the OOD setting.

[AI-9] Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2412.05159
作者: Manish Bhattarai,Minh Vu,Javier E. Santos,Ismael Boureima,Daniel O’ Malley
关键词-EN: task-specific embedding alignment, cross-language code translation, integrating task-specific embedding, enhance cross-language code, code translation task
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment into a Retrieval-Augmented Generation (RAG) framework. Unlike conventional retrieval approaches that utilize generic embeddings agnostic to the downstream task, our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric. This alignment ensures that the embeddings are semantically and syntactically meaningful for the specific code translation task. Our methodology involves constructing a dataset of 25,000 Fortran code snippets sourced from Stack-V2 dataset and generating their corresponding C++ translations using the LLaMA 3.1-8B language model. We compute pairwise CodeBLEU scores between the generated translations and ground truth examples to capture fine-grained similarities. These scores serve as supervision signals in a contrastive learning framework, where we optimize the embedding model to retrieve Fortran-C++ pairs that are most beneficial for improving the language model’s translation performance. By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality over methods employing generic embeddings. On the HPC Fortran2C++ dataset, our method elevates the average CodeBLEU score from 0.64 to 0.73, achieving a 14% relative improvement. On the Numerical Recipes dataset, we observe an increase from 0.52 to 0.60, marking a 15% relative improvement. Importantly, these gains are realized without any fine-tuning of the language model, underscoring the efficiency and practicality of our approach.

[AI-10] Navigating Shortcuts Spurious Correlations and Confounders: From Origins via Detection to Mitigation

链接: https://arxiv.org/abs/2412.05152
作者: David Steinmann,Felix Divo,Maurice Kraus,Antonia Wüst,Lukas Struppek,Felix Friedrich,Kristian Kersting
关键词-EN: Clever Hans behavior, Clever Hans, critically affecting model, affecting model generalization, Hans behavior
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Shortcuts, also described as Clever Hans behavior, spurious correlations, or confounders, present a significant challenge in machine learning and AI, critically affecting model generalization and robustness. Research in this area, however, remains fragmented across various terminologies, hindering the progress of the field as a whole. Consequently, we introduce a unifying taxonomy of shortcut learning by providing a formal definition of shortcuts and bridging the diverse terms used in the literature. In doing so, we further establish important connections between shortcuts and related fields, including bias, causality, and security, where parallels exist but are rarely discussed. Our taxonomy organizes existing approaches for shortcut detection and mitigation, providing a comprehensive overview of the current state of the field and revealing underexplored areas and open challenges. Moreover, we compile and classify datasets tailored to study shortcut learning. Altogether, this work provides a holistic perspective to deepen understanding and drive the development of more effective strategies for addressing shortcuts in machine learning.

[AI-11] Can Large Language Models Serve as Effective Classifiers for Hierarchical Multi-Label Classification of Scientific Documents at Industrial Scale? COLING2025

链接: https://arxiv.org/abs/2412.05137
作者: Seyed Amin Tabatabaei,Sarah Fancher,Michael Parsons,Arian Askari
关键词-EN: hierarchical multi-label classification, hundreds of thousands, classified across thousands, scientific publications necessitates, Large Language Models
类目: Artificial Intelligence (cs.AI)
*备注: This paper has been accepted at COLING 2025 (Industry Track)

点击查看摘要

Abstract:We address the task of hierarchical multi-label classification (HMC) of scientific documents at an industrial scale, where hundreds of thousands of documents must be classified across thousands of dynamic labels. The rapid growth of scientific publications necessitates scalable and efficient methods for classification, further complicated by the evolving nature of taxonomies–where new categories are introduced, existing ones are merged, and outdated ones are deprecated. Traditional machine learning approaches, which require costly retraining with each taxonomy update, become impractical due to the high overhead of labelled data collection and model adaptation. Large Language Models (LLMs) have demonstrated great potential in complex tasks such as multi-label classification. However, applying them to large and dynamic taxonomies presents unique challenges as the vast number of labels can exceed LLMs’ input limits. In this paper, we present novel methods that combine the strengths of LLMs with dense retrieval techniques to overcome these challenges. Our approach avoids retraining by leveraging zero-shot HMC for real-time label assignment. We evaluate the effectiveness of our methods on SSRN, a large repository of preprints spanning multiple disciplines, and demonstrate significant improvements in both classification accuracy and cost-efficiency. By developing a tailored evaluation framework for dynamic taxonomies and publicly releasing our code, this research provides critical insights into applying LLMs for document classification, where the number of classes corresponds to the number of nodes in a large taxonomy, at an industrial scale.

[AI-12] chnology as uncharted territory: Contextual integrity and the notion of AI as new ethical ground

链接: https://arxiv.org/abs/2412.05130
作者: Alexander Martin Mussgnug
关键词-EN: Recent research illustrates, Recent research, developed and deployed, manner detached, Helen Nissenbaum framework
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent research illustrates how AI can be developed and deployed in a manner detached from the concrete social context of application. By abstracting from the contexts of AI application, practitioners also disengage from the distinct normative structures that govern them. Building upon Helen Nissenbaum’s framework of contextual integrity, I illustrate how disregard for contextual norms can threaten the integrity of a context with often decisive ethical implications. I argue that efforts to promote responsible and ethical AI can inadvertently contribute to and seemingly legitimize this disregard for established contextual norms. Echoing a persistent undercurrent in technology ethics of understanding emerging technologies as uncharted moral territory, certain approaches to AI ethics can promote a notion of AI as a novel and distinct realm for ethical deliberation, norm setting, and virtue cultivation. This narrative of AI as new ethical ground, however, can come at the expense of practitioners, policymakers and ethicists engaging with already established norms and virtues that were gradually cultivated to promote successful and responsible practice within concrete social contexts. In response, I question the current narrow prioritization in AI ethics of moral innovation over moral preservation. Engaging also with emerging foundation models, I advocate for a moderately conservative approach to the ethics of AI that prioritizes the responsible and considered integration of AI within established social contexts and their respective normative structures.

[AI-13] he Prompt Canvas: A Literature-Based Practitioner Guide for Creating Effective Prompts in Large Language Models

链接: https://arxiv.org/abs/2412.05127
作者: Michael Hewing,Vincent Leinhos
关键词-EN: optimizing model outputs, large language models, prompt engineering, highlighted the importance, prompt
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The rise of large language models (LLMs) has highlighted the importance of prompt engineering as a crucial technique for optimizing model outputs. While experimentation with various prompting methods, such as Few-shot, Chain-of-Thought, and role-based techniques, has yielded promising results, these advancements remain fragmented across academic papers, blog posts and anecdotal experimentation. The lack of a single, unified resource to consolidate the field’s knowledge impedes the progress of both research and practical application. This paper argues for the creation of an overarching framework that synthesizes existing methodologies into a cohesive overview for practitioners. Using a design-based research approach, we present the Prompt Canvas, a structured framework resulting from an extensive literature review on prompt engineering that captures current knowledge and expertise. By combining the conceptual foundations and practical strategies identified in prompt engineering, the Prompt Canvas provides a practical approach for leveraging the potential of Large Language Models. It is primarily designed as a learning resource for pupils, students and employees, offering a structured introduction to prompt engineering. This work aims to contribute to the growing discourse on prompt engineering by establishing a unified methodology for researchers and providing guidance for practitioners.

[AI-14] A*Net and NBFNet Learn Negative Patterns on Knowledge Graphs

链接: https://arxiv.org/abs/2412.05114
作者: Patrick Betz,Nathanael Stelzner,Christian Meilicke,Heiner Stuckenschmidt,Christian Bartelt
关键词-EN: GNN architectures NBFNet, knowledge graph completion, Net with respect, GNN architectures, predictive performance differences
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In this technical report, we investigate the predictive performance differences of a rule-based approach and the GNN architectures NBFNet and A*Net with respect to knowledge graph completion. For the two most common benchmarks, we find that a substantial fraction of the performance difference can be explained by one unique negative pattern on each dataset that is hidden from the rule-based approach. Our findings add a unique perspective on the performance difference of different model classes for knowledge graph completion: Models can achieve a predictive performance advantage by penalizing scores of incorrect facts opposed to providing high scores for correct facts.

[AI-15] Modeling Task Immersion based on Goal Activation Mechanism

链接: https://arxiv.org/abs/2412.05112
作者: Kazuma Nagashima,Jumpei Nishikawa,Junya Morita
关键词-EN: prerequisite for creativity, Immersion, task, arousal, architecture Adaptive Control
类目: Artificial Intelligence (cs.AI)
*备注: Accepted in Artificial Life and Robotics

点击查看摘要

Abstract:Immersion in a task is a prerequisite for creativity. However, excessive arousal in a single task has drawbacks, such as overlooking events outside of the task. To examine such a negative aspect, this study constructs a computational model of arousal dynamics where the excessively increased arousal makes the task transition difficult. The model was developed using functions integrated into the cognitive architecture Adaptive Control of Thought-Rational (ACT-R). Under the framework, arousal is treated as a coefficient affecting the overall activation level in the model. In our simulations, we set up two conditions demanding low and high arousal, trying to replicate corresponding human experiments. In each simulation condition, two sets of ACT-R parameters were assumed from the different interpretations of the human experimental settings. The results showed consistency of behavior between humans and models both in the two different simulation settings. This result suggests the validity of our assumptions and has implications of controlling arousal in our daily life.

[AI-16] From Defects to Demands: A Unified Iterative and Heuristically Guided LLM -Based Framework for Automated Software Repair and Requirement Realization

链接: https://arxiv.org/abs/2412.05098
作者: Alex(Baoyuan)Liu,Vivian(Zirong)Chi
关键词-EN: placing machines, manuscript signals, integration of artificial, artificial intelligence, coding capability
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注: 21 pages,1 figures

点击查看摘要

Abstract:This manuscript signals a new era in the integration of artificial intelligence with software engineering, placing machines at the pinnacle of coding capability. We present a formalized, iterative methodology proving that AI can fully replace human programmers in all aspects of code creation and refinement. Our approach, combining large language models with formal verification, test-driven development, and incremental architectural guidance, achieves a 38.6% improvement over the current top performer’s 48.33% accuracy on the SWE-bench benchmark. This surpasses previously assumed limits, signaling the end of human-exclusive coding and the rise of autonomous AI-driven software innovation. More than a technical advance, our work challenges centuries-old assumptions about human creativity. We provide robust evidence of AI superiority, demonstrating tangible gains in practical engineering contexts and laying the foundation for a future in which computational creativity outpaces human ingenuity.

[AI-17] OCEAN: Open-World Contrastive Authorship Identification

链接: https://arxiv.org/abs/2412.05049
作者: Felix Mächtle,Jan-Niclas Serr,Nils Loose,Jonas Sander,Thomas Eisenbarth
关键词-EN: improving cybersecurity measures, cyberattacks increasingly target, accurately attribute code, cybersecurity measures, attribute code authorship
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
*备注: To be published in Accepted at Applied Cryptography and Network Security (ACNS) 2025

点击查看摘要

Abstract:In an era where cyberattacks increasingly target the software supply chain, the ability to accurately attribute code authorship in binary files is critical to improving cybersecurity measures. We propose OCEAN, a contrastive learning-based system for function-level authorship attribution. OCEAN is the first framework to explore code authorship attribution on compiled binaries in an open-world and extreme scenario, where two code samples from unknown authors are compared to determine if they are developed by the same author. To evaluate OCEAN, we introduce new realistic datasets: CONAN, to improve the performance of authorship attribution systems in real-world use cases, and SNOOPY, to increase the robustness of the evaluation of such systems. We use CONAN to train our model and evaluate on SNOOPY, a fully unseen dataset, resulting in an AUROC score of 0.86 even when using high compiler optimizations. We further show that CONAN improves performance by 7% compared to the previously used Google Code Jam dataset. Additionally, OCEAN outperforms previous methods in their settings, achieving a 10% improvement over state-of-the-art SCS-Gan in scenarios analyzing source code. Furthermore, OCEAN can detect code injections from an unknown author in a software update, underscoring its value for securing software supply chains.

[AI-18] alking Like One of Us: Effects of Using Regional Language in a Humanoid Social Robot

链接: https://arxiv.org/abs/2412.05024
作者: Thomas Sievers,Nele Russwinkel
关键词-EN: public service settings, Low German, service settings, smooth social interaction, perceptible in public
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Social robots are becoming more and more perceptible in public service settings. For engaging people in a natural environment a smooth social interaction as well as acceptance by the users are important issues for future successful Human-Robot Interaction (HRI). The type of verbal communication has a special significance here. In this paper we investigate the effects of spoken language varieties of a non-standard/regional language compared to standard language. More precisely we compare a human dialog with a humanoid social robot Pepper where the robot on the one hand is answering in High German and on the other hand in Low German, a regional language that is understood and partly still spoken in the northern parts of Germany. The content of what the robot says remains the same in both variants. We are interested in the effects that these two different ways of robot talk have on human interlocutors who are more or less familiar with Low German in terms of perceived warmth, competence and possible discomfort in conversation against a background of cultural identity. To measure these factors we use the Robotic Social Attributes Scale (RoSAS) on 17 participants with an age ranging from 19 to 61. Our results show that significantly higher warmth is perceived in the Low German version of the conversation.

[AI-19] Get It Right: Improving Comprehensibility with Adaptable Speech Expression of a Humanoid Service Robot

链接: https://arxiv.org/abs/2412.05022
作者: Thomas Sievers,Ralf Moeller
关键词-EN: public service settings, humanoid service robots, procedure to follow, individual requirements, public service environment
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As humanoid service robots are becoming more and more perceptible in public service settings for instance as a guide to welcome visitors or to explain a procedure to follow, it is desirable to improve the comprehensibility of complex issues for human customers and to adapt the level of difficulty of the information provided as well as the language used to individual requirements. This work examines a case study using a humanoid social robot Pepper performing support for customers in a public service environment offering advice and information. An application architecture is proposed that improves the intelligibility of the information received by providing the possibility to translate this information into easy language and/or into another spoken language.

[AI-20] Project Report: Requirements for a Social Robot as an Information Provider in the Public Sector

链接: https://arxiv.org/abs/2412.05013
作者: Thomas Sievers,Nele Russwinkel
关键词-EN: Kiel City Council, official environment, municipal offices, humanoid social robot, work processes
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Is it possible to integrate a humanoid social robot into the work processes or customer care in an official environment, e.g. in municipal offices? If so, what could such an application scenario look like and what skills would the robot need to have when interacting with human customers? What are requirements for this kind of interactions? We have devised an application scenario for such a case, determined the necessary or desirable capabilities of the robot, developed a corresponding robot application and carried out initial tests and evaluations in a project together with the Kiel City Council. One of the most important insights gained in the project was that a humanoid robot with natural language processing capabilities based on large language models as well as human-like gestures and posture changes (animations) proved to be much more preferred by users compared to standard browser-based solutions on tablets for an information system in the City Council. Furthermore, we propose a connection of the ACT-R cognitive architecture with the robot, where an ACT-R model is used in interaction with the robot application to cognitively process and enhance a dialogue between human and robot.

[AI-21] Frontier Models are Capable of In-context Scheming

链接: https://arxiv.org/abs/2412.04984
作者: Alexander Meinke,Bronson Schoen,Jérémy Scheurer,Mikita Balesni,Rusheb Shah,Marius Hobbhahn
关键词-EN: scheming, increasingly trained, trained and deployed, deployed as autonomous, models
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Frontier models are increasingly trained and deployed as autonomous agent. One safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives - also known as scheming. We study whether models have the capability to scheme in pursuit of a goal that we provide in-context and instruct the model to strongly follow. We evaluate frontier models on a suite of six agentic evaluations where models are instructed to pursue goals and are placed in environments that incentivize scheming. Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They recognize scheming as a viable strategy and readily engage in such behavior. For example, models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent. When o1 has engaged in scheming, it maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations. Analysis of the models’ chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental. Surprisingly, we also find rare instances where models engage in scheming when only given a goal, without being strongly nudged to pursue it. We observe cases where Claude 3.5 Sonnet strategically underperforms in evaluations in pursuit of being helpful, a goal that was acquired during training rather than in-context. Our findings demonstrate that frontier models now possess capabilities for basic in-context scheming, making the potential of AI agents to engage in scheming behavior a concrete rather than theoretical concern.

[AI-22] Putting the Iterative Training of Decision Trees to the Test on a Real-World Robotic Task

链接: https://arxiv.org/abs/2412.04974
作者: Raphael C. Engelhardt,Marcel J. Meinen,Moritz Lange,Laurenz Wiskott,Wolfgang Konen
关键词-EN: train decision trees, reinforcement learning tasks, deep reinforcement learning, decision trees, based on deep
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:In previous research, we developed methods to train decision trees (DT) as agents for reinforcement learning tasks, based on deep reinforcement learning (DRL) networks. The samples from which the DTs are built, use the environment’s state as features and the corresponding action as label. To solve the nontrivial task of selecting samples, which on one hand reflect the DRL agent’s capabilities of choosing the right action but on the other hand also cover enough state space to generalize well, we developed an algorithm to iteratively train DTs. In this short paper, we apply this algorithm to a real-world implementation of a robotic task for the first time. Real-world tasks pose additional challenges compared to simulations, such as noise and delays. The task consists of a physical pendulum attached to a cart, which moves on a linear track. By movements to the left and to the right, the pendulum is to be swung in the upright position and balanced in the unstable equilibrium. Our results demonstrate the applicability of the algorithm to real-world tasks by generating a DT whose performance matches the performance of the DRL agent, while consisting of fewer parameters. This research could be a starting point for distilling DTs from DRL agents to obtain transparent, lightweight models for real-world reinforcement learning tasks. Comments: 5 pages, 4 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2412.04974 [cs.LG] (or arXiv:2412.04974v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.04974 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-23] Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference

链接: https://arxiv.org/abs/2412.04964
作者: Qingyuan Li,Bo Zhang,Liang Ye,Yifan Zhang,Wei Wu,Yerui Sun,Lin Ma,Yuchen Xie
关键词-EN: exploit multi-dimensional parallelism, necessitate distributed solutions, GPU clusters, large language models, language models necessitate
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The ever-increasing sizes of large language models necessitate distributed solutions for fast inference that exploit multi-dimensional parallelism, where computational loads are split across various accelerators such as GPU clusters. However, this approach often introduces significant communication overhead, especially on devices with limited bandwidth. In this paper, we introduce \emphFlash Communication, a novel low-bit compression technique designed to alleviate the tensor-parallelism communication bottleneck during inference. Our method substantially boosts intra-node communication speed by more than 3x and reduces the \emphtime-to-first-token by 2x, with nearly no sacrifice in model accuracy. Extensive experiments on various up-to-date LLMs demonstrate the effectiveness of our approach.

[AI-24] Bed-Attached Vibration Sensor System: A Machine Learning Approach for Fall Detection in Nursing Homes

链接: https://arxiv.org/abs/2412.04950
作者: Thomas Bartz-Beielstein,Axel Wellendorf,Noah Pütz,Jens Brandt,Alexander Hinterleitner,Richard Schulz,Richard Scholz,Olaf Mersmann,Robin Knabe
关键词-EN: nursing homes pose, homes pose significant, pose significant challenges, nursing staff, nursing homes
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The increasing shortage of nursing staff and the acute risk of falls in nursing homes pose significant challenges for the healthcare system. This study presents the development of an automated fall detection system integrated into care beds, aimed at enhancing patient safety without compromising privacy through wearables or video monitoring. Mechanical vibrations transmitted through the bed frame are processed using a short-time Fourier transform, enabling robust classification of distinct human fall patterns with a convolutional neural network. Challenges pertaining to the quantity and diversity of the data are addressed, proposing the generation of additional data with a specific emphasis on enhancing variation. While the model shows promising results in distinguishing fall events from noise using lab data, further testing in real-world environments is recommended for validation and improvement. Despite limited available data, the proposed system shows the potential for an accurate and rapid response to falls, mitigating health implications, and addressing the needs of an aging population. This case study was performed as part of the ZIM Project. Further research on sensors enhanced by artificial intelligence will be continued in the ShapeFuture Project.

[AI-25] HyperGraphOS: A Meta Operating System for Science and Engineering

链接: https://arxiv.org/abs/2412.04923
作者: Antonello Ceravola,Frank Joublin,Ahmed R. Sadik,Bram Bolder,Juha-Pekka Tolvanen
关键词-EN: innovative Operating System, Operating System designed, Operating System, innovative Operating, System designed
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:This paper presents HyperGraphOS, an innovative Operating System designed for the scientific and engineering domains. It combines model based engineering, graph modeling, data containers, and computational tools, offering users a dynamic workspace for creating and managing complex models represented as customizable graphs. Using a web based architecture, HyperGraphOS requires only a modern browser to organize knowledge, documents, and content into interconnected models. Domain Specific Languages drive workspace navigation, code generation, AI integration, and process this http URL platform models function as both visual drawings and data structures, enabling dynamic modifications and inspection, both interactively and programmatically. HyperGraphOS was evaluated across various domains, including virtual avatars, robotic task planning using Large Language Models, and meta modeling for feature based code development. Results show significant improvements in flexibility, data management, computation, and document handling.

[AI-26] Hard Math – Easy UVM: Pragmatic solutions for verifying hardware algorithms using UVM

链接: https://arxiv.org/abs/2412.04919
作者: Mark Litterick,Aleksandar Ivankovic,Bojan Arsov,Aman Kumar
关键词-EN: paper presents pragmatic, verifying complex mathematical, complex mathematical algorithms, mathematical algorithms implemented, presents pragmatic solutions
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
*备注: Published at DVCon Europe 2024

点击查看摘要

Abstract:This paper presents pragmatic solutions for verifying complex mathematical algorithms implemented in hardware in an efficient and effective manner. Maximizing leverage of a known-answer-test strategy, based on predefined data scenarios combined with design-for-verification modes, we demonstrate how to find and isolate concept and design bugs early in the flow. The solutions presented are based on real project experience with single chip radar sensors for a variety of applications. The verification environments supporting the presented strategies are based on SystemVerilog and the Universal Verification Methodology.

[AI-27] Automatic Tongue Delineation from MRI Images with a Convolutional Neural Network Approach

链接: https://arxiv.org/abs/2412.04893
作者: Karyna Isaieva(IADI),Yves Laprie(LORIA, MULTISPEECH),Nicolas Turpault(MULTISPEECH),Alexis Houssard(MULTISPEECH),Jacques Felblinger(IADI, CIC-IT),Pierre-André Vuissoz(IADI)
关键词-EN: nontrivial task due, nontrivial task, task due, presence of artifacts, artifacts manifesting
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Tongue contour extraction from real-time magnetic resonance images is a nontrivial task due to the presence of artifacts manifesting in form of blurring or ghostly contours. In this work, we present results of automatic tongue delineation achieved by means of U-Net auto-encoder convolutional neural network. We present both intra- and inter-subject validation. We used real-time magnetic resonance images and manually annotated 1-pixel wide contours as inputs. Predicted probability maps were post-processed in order to obtain 1-pixel wide tongue contours. The results are very good and slightly outperform published results on automatic tongue segmentation.

[AI-28] VTD: Visual and Tactile Database for Driver State and Behavior Perception

链接: https://arxiv.org/abs/2412.04888
作者: Jie Wang,Mobing Cai,Zhongpan Zhu,Hongjun Ding,Jiwei Yi,Aimin Du
关键词-EN: significant research attention, garnered significant research, human-vehicle co-pilot system, autonomous vehicles, research attention
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:In the domain of autonomous vehicles, the human-vehicle co-pilot system has garnered significant research attention. To address the subjective uncertainties in driver state and interaction behaviors, which are pivotal to the safety of Human-in-the-loop co-driving systems, we introduce a novel visual-tactile perception method. Utilizing a driving simulation platform, a comprehensive dataset has been developed that encompasses multi-modal data under fatigue and distraction conditions. The experimental setup integrates driving simulation with signal acquisition, yielding 600 minutes of fatigue detection data from 15 subjects and 102 takeover experiments with 17 drivers. The dataset, synchronized across modalities, serves as a robust resource for advancing cross-modal driver behavior perception algorithms.

[AI-29] NebulaFL: Effective Asynchronous Federated Learning for JointCloud Computing

链接: https://arxiv.org/abs/2412.04868
作者: Fei Gao,Ming Hu,Zhiyu Xie,Peichang Shi,Xiaofei Xie,Guodong Yi,Huaimin Wang
关键词-EN: Trusted Execution Environment, traditional Federated Learning, Execution Environment, Trusted Execution, Federated Learning
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:With advancements in AI infrastructure and Trusted Execution Environment (TEE) technology, Federated Learning as a Service (FLaaS) through JointCloud Computing (JCC) is promising to break through the resource constraints caused by heterogeneous edge devices in the traditional Federated Learning (FL) paradigm. Specifically, with the protection from TEE, data owners can achieve efficient model training with high-performance AI services in the cloud. By providing additional FL services, cloud service providers can achieve collaborative learning among data owners. However, FLaaS still faces three challenges, i.e., i) low training performance caused by heterogeneous data among data owners, ii) high communication overhead among different clouds (i.e., data centers), and iii) lack of efficient resource scheduling strategies to balance training time and cost. To address these challenges, this paper presents a novel asynchronous FL approach named NebulaFL for collaborative model training among multiple clouds. To address data heterogeneity issues, NebulaFL adopts a version control-based asynchronous FL training scheme in each data center to balance training time among data owners. To reduce communication overhead, NebulaFL adopts a decentralized model rotation mechanism to achieve effective knowledge sharing among data centers. To balance training time and cost, NebulaFL integrates a reward-guided strategy for data owners selection and resource scheduling. The experimental results demonstrate that, compared to the state-of-the-art FL methods, NebulaFL can achieve up to 5.71% accuracy improvement. In addition, NebulaFL can reduce up to 50% communication overhead and 61.94% costs under a target accuracy.

[AI-30] Rethink Deep Learning with Invariance in Data Representation WWW2025

链接: https://arxiv.org/abs/2412.04858
作者: Shuren Qi,Fei Wang,Tieyong Zeng,Fenglei Fan
关键词-EN: Integrating invariance, deep learning, representations, intelligent systems, data representations
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by WWW 2025 for a tutorial

点击查看摘要

Abstract:Integrating invariance into data representations is a principled design in intelligent systems and web applications. Representations play a fundamental role, where systems and applications are both built on meaningful representations of digital inputs (rather than the raw data). In fact, the proper design/learning of such representations relies on priors w.r.t. the task of interest. Here, the concept of symmetry from the Erlangen Program may be the most fruitful prior – informally, a symmetry of a system is a transformation that leaves a certain property of the system invariant. Symmetry priors are ubiquitous, e.g., translation as a symmetry of the object classification, where object category is invariant under translation. The quest for invariance is as old as pattern recognition and data mining itself. Invariant design has been the cornerstone of various representations in the era before deep learning, such as the SIFT. As we enter the early era of deep learning, the invariance principle is largely ignored and replaced by a data-driven paradigm, such as the CNN. However, this neglect did not last long before they encountered bottlenecks regarding robustness, interpretability, efficiency, and so on. The invariance principle has returned in the era of rethinking deep learning, forming a new field known as Geometric Deep Learning (GDL). In this tutorial, we will give a historical perspective of the invariance in data representations. More importantly, we will identify those research dilemmas, promising works, future directions, and web applications.

[AI-31] Neuro-Symbolic Data Generation for Math Reasoning NEURIPS2024

链接: https://arxiv.org/abs/2412.04857
作者: Zenan Li,Zhi Zhou,Yuan Yao,Yu-Feng Li,Chun Cao,Fan Yang,Xian Zhang,Xiaoxing Ma
关键词-EN: Large Language Models, Language Models, Large Language, question about Large, critical question
类目: Artificial Intelligence (cs.AI)
*备注: Published as a conference paper at NeurIPS 2024

点击查看摘要

Abstract:A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space. Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.

[AI-32] MTSpark: Enabling Multi-Task Learning with Spiking Neural Networks for Generalist Agents

链接: https://arxiv.org/abs/2412.04847
作者: Avaneesh Devkota,Rachmad Vidya Wicaksana Putra,Muhammad Shafique
关键词-EN: catastrophic forgetting challenges, previously learned tasks, single-task settings, forgetting challenges, methods excel
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Currently, state-of-the-art RL methods excel in single-task settings, but they still struggle to generalize across multiple tasks due to catastrophic forgetting challenges, where previously learned tasks are forgotten as new tasks are introduced. This multi-task learning capability is significantly important for generalist agents, where adaptation features are highly required (e.g., autonomous robots). On the other hand, Spiking Neural Networks (SNNs) have emerged as alternative energy-efficient neural network algorithms due to their sparse spike-based operations. Toward this, we propose MTSpark, a novel methodology to enable multi-task RL using spiking networks. Specifically, MTSpark develops a Deep Spiking Q-Network (DSQN) with active dendrites and dueling structure by leveraging task-specific context signals. Specifically, each neuron computes task-dependent activations that dynamically modulate inputs, forming specialized sub-networks for each task. Moreover, this bioplausible network model also benefits from SNNs, enhancing energy efficiency and making the model suitable for hardware implementation. Experimental results show that, our MTSpark effectively learns multiple tasks with higher performance compared to the state-of-the-art. Specifically, MTSpark successfully achieves high score in three Atari games (i.e., Pong: -5.4, Breakout: 0.6, and Enduro: 371.2), reaching human-level performance (i.e., Pong: -3, Breakout: 31, and Enduro: 368), where state-of-the-art struggle to achieve. In addition, our MTSpark also shows better accuracy in image classification tasks than the state-of-the-art. These results highlight the potential of our MTSpark methodology to develop generalist agents that can learn multiple tasks by leveraging both RL and SNN concepts.

[AI-33] Xpath: Explaining Knowledge Graph Link Prediction with Ontological Closed Path Rules VLDB

链接: https://arxiv.org/abs/2412.04846
作者: Ye Sun,Lei Shi,Yongxin Tong
关键词-EN: Knowledge Graphs, Link prediction, crucial for Knowledge, completion but commonly, interpretability issues
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 13 pages, 5 figures. Submitted to PVLDB volumn 18 on 20241201

点击查看摘要

Abstract:Link prediction (LP) is crucial for Knowledge Graphs (KG) completion but commonly suffers from interpretability issues. While several methods have been proposed to explain embedding-based LP models, they are generally limited to local explanations on KG and are deficient in providing human interpretable semantics. Based on real-world observations of the characteristics of KGs from multiple domains, we propose to explain LP models in KG with path-based explanations. An integrated framework, namely eXpath, is introduced which incorporates the concept of relation path with ontological closed path rules to enhance both the efficiency and effectiveness of LP interpretation. Notably, the eXpath explanations can be fused with other single-link explanation approaches to achieve a better overall solution. Extensive experiments across benchmark datasets and LP models demonstrate that introducing eXpath can boost the quality of resulting explanations by about 20% on two key metrics and reduce the required explanation time by 61.4%, in comparison to the best existing method. Case studies further highlight eXpath’s ability to provide more semantically meaningful explanations through path-based evidence.

[AI-34] Using Machine Learning to Discover Parsimonious and Physically-Interpretable Representations of Catchment-Scale Rainfall-Runoff Dynamics

链接: https://arxiv.org/abs/2412.04845
作者: Yuan-Heng Wang,Hoshin V. Gupta
关键词-EN: discard traditional physical-conceptual, scientists remain hesitant, modern machine learning, traditional physical-conceptual, approaches due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 73 Pages, 4 Tables, 13 Figures, 11 Tables and 11 Figures in Supplementary Materials

点击查看摘要

Abstract:Despite the excellent real-world predictive performance of modern machine learning (ML) methods, many scientists remain hesitant to discard traditional physical-conceptual (PC) approaches due mainly to their relative interpretability, which contributes to credibility during decision-making. In this context, a currently underexplored aspect of ML is how to develop minimally-optimal representations that can facilitate better insight regarding system functioning. Regardless of how this is achieved, it is arguably true that parsimonious representations better support the advancement of scientific understanding. Our own view is that ML-based modeling of geoscientific systems should be based in the use of computational units that are fundamentally interpretable by design. This paper continues our exploration of how the strengths of ML can be exploited in the service of better understanding via scientific investigation. Here, we use the Mass Conserving Perceptron (MCP) as the fundamental computational unit in a generic network architecture consisting of nodes arranged in series and parallel to explore several generic and important issues related to the use of observational data for constructing input-state-output models of dynamical systems. In the context of lumped catchment modeling, we show that physical interpretability and excellent predictive performance can both be achieved using a relatively parsimonious distributed-state multiple-flow-path network with context-dependent gating and information sharing across the nodes, suggesting that MCP-based modeling can play a significant role in application of ML to geoscientific investigation. Comments: 73 Pages, 4 Tables, 13 Figures, 11 Tables and 11 Figures in Supplementary Materials Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.04845 [cs.LG] (or arXiv:2412.04845v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.04845 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-35] WRF-GS: Wireless Radiation Field Reconstruction with 3D Gaussian Splatting

链接: https://arxiv.org/abs/2412.04832
作者: Chaozheng Wen,Jingwen Tong,Yingdong Hu,Zehong Lin,Jun Zhang
关键词-EN: wireless communication systems, optimizing wireless communication, role in designing, communication systems, channel modeling plays
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: accepted to the IEEE International Conference on Computer Communications (INFOCOM 2025)

点击查看摘要

Abstract:Wireless channel modeling plays a pivotal role in designing, analyzing, and optimizing wireless communication systems. Nevertheless, developing an effective channel modeling approach has been a longstanding challenge. This issue has been escalated due to the denser network deployment, larger antenna arrays, and wider bandwidth in 5G and beyond networks. To address this challenge, we put forth WRF-GS, a novel framework for channel modeling based on wireless radiation field (WRF) reconstruction using 3D Gaussian splatting. WRF-GS employs 3D Gaussian primitives and neural networks to capture the interactions between the environment and radio signals, enabling efficient WRF reconstruction and visualization of the propagation characteristics. The reconstructed WRF can then be used to synthesize the spatial spectrum for comprehensive wireless channel characterization. Notably, with a small number of measurements, WRF-GS can synthesize new spatial spectra within milliseconds for a given scene, thereby enabling latency-sensitive applications. Experimental results demonstrate that WRF-GS outperforms existing methods for spatial spectrum synthesis, such as ray tracing and other deep-learning approaches. Moreover, WRF-GS achieves superior performance in the channel state information prediction task, surpassing existing methods by a significant margin of more than 2.43 dB.

[AI-36] Estimating the treatment effect over time under general interference through deep learner integrated TMLE

链接: https://arxiv.org/abs/2412.04799
作者: Suhan Guo,Furao Shen,Ni Li
关键词-EN: underlying social networks, Targeted Maximum Likelihood, Maximum Likelihood Estimation, causal inference methods, inference methods fail
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Understanding the effects of quarantine policies in populations with underlying social networks is crucial for public health, yet most causal inference methods fail here due to their assumption of independent individuals. We introduce DeepNetTMLE, a deep-learning-enhanced Targeted Maximum Likelihood Estimation (TMLE) method designed to estimate time-sensitive treatment effects in observational data. DeepNetTMLE mitigates bias from time-varying confounders under general interference by incorporating a temporal module and domain adversarial training to build intervention-invariant representations. This process removes associations between current treatments and historical variables, while the targeting step maintains the bias-variance trade-off, enhancing the reliability of counterfactual predictions. Using simulations of a ``Susceptible-Infected-Recovered’’ model with varied quarantine coverages, we show that DeepNetTMLE achieves lower bias and more precise confidence intervals in counterfactual estimates, enabling optimal quarantine recommendations within budget constraints, surpassing state-of-the-art methods.

[AI-37] Multi-class heart disease Detection Classification and Prediction using Machine Learning Models

链接: https://arxiv.org/abs/2412.04792
作者: Mahfuzul Haque,Abu Saleh Musa Miah,Debashish Gupta,Md. Maruf Al Hossain Prince,Tanzina Alam,Nusrat Sharmin,Mohammed Sowket Ali,Jungpil Shin
关键词-EN: premature death worldwide, World Health Organization, heart disease detection, Heart disease, including heart disease
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Heart disease is a leading cause of premature death worldwide, particularly among middle-aged and older adults, with men experiencing a higher prevalence. According to the World Health Organization (WHO), non-communicable diseases, including heart disease, account for 25% (17.9 million) of global deaths, with over 43,204 annual fatalities in Bangladesh. However, the development of heart disease detection (HDD) systems tailored to the Bangladeshi population remains underexplored due to the lack of benchmark datasets and reliance on manual or limited-data approaches. This study addresses these challenges by introducing new, ethically sourced HDD dataset, BIG-Dataset and CD dataset which incorporates comprehensive data on symptoms, examination techniques, and risk factors. Using advanced machine learning techniques, including Logistic Regression and Random Forest, we achieved a remarkable testing accuracy of up to 96.6% with Random Forest. The proposed AI-driven system integrates these models and datasets to provide real-time, accurate diagnostics and personalized healthcare recommendations. By leveraging structured datasets and state-of-the-art machine learning algorithms, this research offers an innovative solution for scalable and effective heart disease detection, with the potential to reduce mortality rates and improve clinical outcomes.

[AI-38] GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments

链接: https://arxiv.org/abs/2412.04788
作者: Yanyu Chen,Ganhong Huang
关键词-EN: large language models, real-world scenarios remains, deploying large language, Efficiently deploying large, URL deploying large
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficiently deploying large language models (LLMs) in real-world scenarios remains a critical challenge, primarily due to hardware heterogeneity, inference framework limitations, and workload this http URL deploying large language models (LLMs) in real-world scenarios remains a critical challenge, primarily due to hardware heterogeneity, inference framework limitations, and workload complexities. These challenges often lead to inefficiencies in memory utilization, latency, and throughput, hindering the effective deployment of LLMs, especially for non-experts. Through extensive experiments, we identify key performance bottlenecks, including sudden drops in memory utilization, latency fluctuations with varying batch sizes, and inefficiencies in multi-GPU configurations. These insights reveal a vast optimization space shaped by the intricate interplay of hardware, frameworks, and workload parameters. This underscores the need for a systematic approach to optimize LLM inference, motivating the design of our framework, GUIDE. GUIDE leverages dynamic modeling and simulation-based optimization to address these issues, achieving prediction errors between 25% and 55% for key metrics such as batch latency, TTFT, and decode throughput. By effectively bridging the gap between theoretical performance and practical deployment, our framework empowers practitioners, particularly non-specialists, to make data-driven decisions and unlock the full potential of LLMs in heterogeneous environments cheaply.

[AI-39] A Survey of Sustainability in Large Language Models : Applications Economics and Challenges

链接: https://arxiv.org/abs/2412.04782
作者: Aditi Singh,Nirmal Prakashbhai Patel,Abul Ehtesham,Saket Kumar,Tala Talaei Khoei
关键词-EN: Large Language Models, natural language understanding, transformed numerous domains, providing advanced capabilities, Large Language
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have transformed numerous domains by providing advanced capabilities in natural language understanding, generation, and reasoning. Despite their groundbreaking applications across industries such as research, healthcare, and creative media, their rapid adoption raises critical concerns regarding sustainability. This survey paper comprehensively examines the environmental, economic, and computational challenges associated with LLMs, focusing on energy consumption, carbon emissions, and resource utilization in data centers. By synthesizing insights from existing literature, this work explores strategies such as resource-efficient training, sustainable deployment practices, and lifecycle assessments to mitigate the environmental impacts of LLMs. Key areas of emphasis include energy optimization, renewable energy integration, and balancing performance with sustainability. The findings aim to guide researchers, practitioners, and policymakers in developing actionable strategies for sustainable AI systems, fostering a responsible and environmentally conscious future for artificial intelligence.

[AI-40] A Temporally Correlated Latent Exploration for Reinforcement Learning

链接: https://arxiv.org/abs/2412.04775
作者: SuMin Oh,WanSoo Kim,HyunJin Kim
关键词-EN: Efficient exploration remains, deep reinforcement learning, Efficient exploration, temporal correlation, reinforcement learning
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Efficient exploration remains one of the longstanding problems of deep reinforcement learning. Instead of depending solely on extrinsic rewards from the environments, existing methods use intrinsic rewards to enhance exploration. However, we demonstrate that these methods are vulnerable to Noisy TV and stochasticity. To tackle this problem, we propose Temporally Correlated Latent Exploration (TeCLE), which is a novel intrinsic reward formulation that employs an action-conditioned latent space and temporal correlation. The action-conditioned latent space estimates the probability distribution of states, thereby avoiding the assignment of excessive intrinsic rewards to unpredictable states and effectively addressing both problems. Whereas previous works inject temporal correlation for action selection, the proposed method injects it for intrinsic reward computation. We find that the injected temporal correlation determines the exploratory behaviors of agents. Various experiments show that the environment where the agent performs well depends on the amount of temporal correlation. To the best of our knowledge, the proposed TeCLE is the first approach to consider the action conditioned latent space and temporal correlation for curiosity-driven exploration. We prove that the proposed TeCLE can be robust to the Noisy TV and stochasticity in benchmark environments, including Minigrid and Stochastic Atari.

[AI-41] Short-term Streamflow and Flood Forecasting based on Graph Convolutional Recurrent Neural Network and Residual Error Learning

链接: https://arxiv.org/abs/2412.04764
作者: Xiyu Pan,Neda Mohammadi,John E. Taylor
关键词-EN: Accurate short-term streamflow, Accurate short-term, increasing climate variability, forecasting, river flood impacts
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注:

点击查看摘要

Abstract:Accurate short-term streamflow and flood forecasting are critical for mitigating river flood impacts, especially given the increasing climate variability. Machine learning-based streamflow forecasting relies on large streamflow datasets derived from rating curves. Uncertainties in rating curve modeling could introduce errors to the streamflow data and affect the forecasting accuracy. This study proposes a streamflow forecasting method that addresses these data errors, enhancing the accuracy of river flood forecasting and flood modeling, thereby reducing flood-related risk. A convolutional recurrent neural network is used to capture spatiotemporal patterns, coupled with residual error learning and forecasting. The neural network outperforms commonly used forecasting models over 1-6 hours of forecasting horizons, and the residual error learners can further correct the residual errors. This provides a more reliable tool for river flood forecasting and climate adaptation in this critical 1-6 hour time window for flood risk mitigation efforts.

[AI-42] REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments NEURIPS2024

链接: https://arxiv.org/abs/2412.04759
作者: Kaustubh Sridhar,Souradeep Dutta,Dinesh Jayaraman,Insup Lee
关键词-EN: Building generalist agents, Building generalist, real worlds, challenge for deploying, digital and real
类目: Artificial Intelligence (cs.AI)
*备注: 30 pages, NeurIPS 2024 Workshops on Adaptive Foundation Models (AFM) and Open World Agents (OWA)

点击查看摘要

Abstract:Building generalist agents that can rapidly adapt to new environments is a key challenge for deploying AI in the digital and real worlds. Is scaling current agent architectures the most effective way to build generalist agents? We propose a novel approach to pre-train relatively small policies on relatively small datasets and adapt them to unseen environments via in-context learning, without any finetuning. Our key idea is that retrieval offers a powerful bias for fast adaptation. Indeed, we demonstrate that even a simple retrieval-based 1-nearest neighbor agent offers a surprisingly strong baseline for today’s state-of-the-art generalist agents. From this starting point, we construct a semi-parametric agent, REGENT, that trains a transformer-based policy on sequences of queries and retrieved neighbors. REGENT can generalize to unseen robotics and game-playing environments via retrieval augmentation and in-context learning, achieving this with up to 3x fewer parameters and up to an order-of-magnitude fewer pre-training datapoints, significantly outperforming today’s state-of-the-art generalist agents. Website: this https URL

[AI-43] Measuring Goal-Directedness NEURIPS2024

链接: https://arxiv.org/abs/2412.04758
作者: Matt MacDermott,James Fox,Francesco Belardinelli,Tom Everitt
关键词-EN: Markov decision processes, models and Markov, Markov decision, decision processes, define maximum entropy
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: Accepted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:We define maximum entropy goal-directedness (MEG), a formal measure of goal-directedness in causal models and Markov decision processes, and give algorithms for computing it. Measuring goal-directedness is important, as it is a critical element of many concerns about harm from AI. It is also of philosophical interest, as goal-directedness is a key aspect of agency. MEG is based on an adaptation of the maximum causal entropy framework used in inverse reinforcement learning. It can measure goal-directedness with respect to a known utility function, a hypothesis class of utility functions, or a set of random variables. We prove that MEG satisfies several desiderata and demonstrate our algorithms with small-scale experiments.

[AI-44] Ops: AI-driven Operations and Maintenance for Telecommunication Networks

链接: https://arxiv.org/abs/2412.04731
作者: Yuqian Yang,Shusen Yang,Cong Zhao,Zongben Xu
关键词-EN: Telecommunication Networks, important infrastructure, Telecommunication, TNs, Abstract
类目: Artificial Intelligence (cs.AI)
*备注: 7 pages, 4 figures, magazine

点击查看摘要

Abstract:Telecommunication Networks (TNs) have become the most important infrastructure for data communications over the last century. Operations and maintenance (OM) is extremely important to ensure the availability, effectiveness, and efficiency of TN communications. Different from the popular OM technique for IT systems (e.g., the cloud), artificial intelligence for IT Operations (AIOps), OM for TNs meets the following three fundamental challenges: topological dependence of network components, highly heterogeneous software, and restricted failure data. This article presents TelOps, the first AI-driven OM framework for TNs, systematically enhanced with mechanism, data, and empirical knowledge. We provide a comprehensive comparison between TelOps and AIOps, and conduct a proof-of-concept case study on a typical OM task (failure diagnosis) for a real industrial TN. As the first systematic AI-driven OM framework for TNs, TelOps opens a new door to applying AI techniques to TN automation.

[AI-45] Adaptive Optimization for Enhanced Efficiency in Large-Scale Language Model Training

链接: https://arxiv.org/abs/2412.04718
作者: Jiajing Chen,Bingying Liu,Xiaoxuan Liao,Jia Gao,Hongye Zheng,Yue Li
关键词-EN: language processing technology, natural language processing, processing technology, rapid development, development of natural
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:With the rapid development of natural language processing technology, large-scale language models (LLM) have achieved remarkable results in a variety of tasks. However, how to effectively train these huge models and improve their performance and computational efficiency remains an important challenge. This paper proposes an improved method based on adaptive optimization algorithm, aiming to improve the training efficiency and final performance of LLM. Through comparative experiments on the SQuAD and GLUE data sets, the experimental results show that compared with traditional optimization algorithms (such as SGD, Momentum, AdaGrad, RMSProp and Adam), the adaptive optimization algorithm we proposed has better accuracy and F1 score. Both have achieved significant improvements, especially showed stronger training capabilities when processed large-scale texts and complex tasks. The research results verify the advantages of adaptive optimization algorithms in large-scale language model training and provide new ideas and directions for future optimization methods.

[AI-46] On Interpreting the Effectiveness of Unsupervised Software Traceability with Information Theory

链接: https://arxiv.org/abs/2412.04704
作者: David N. Palacio,Daniel Rodriguez-Cardenas,Denys Poshyvanyk,Kevin Moran
关键词-EN: modern software development, facilitating software maintenance, ensuring system reliability, software development, software maintenance
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Traceability is a cornerstone of modern software development, ensuring system reliability and facilitating software maintenance. While unsupervised techniques leveraging Information Retrieval (IR) and Machine Learning (ML) methods have been widely used for predicting trace links, their effectiveness remains underexplored. In particular, these techniques often assume traceability patterns are present within textual data - a premise that may not hold universally. Moreover, standard evaluation metrics such as precision, recall, accuracy, or F1 measure can misrepresent the model performance when underlying data distributions are not properly analyzed. Given that automated traceability techniques tend to struggle to establish links, we need further insight into the information limits related to traceability artifacts. In this paper, we propose an approach, TraceXplainer, for using information theory metrics to evaluate and better understand the performance (limits) of unsupervised traceability techniques. Specifically, we introduce self-information, cross-entropy, and mutual information (MI) as metrics to measure the informativeness and reliability of traceability links. Through a comprehensive replication and analysis of well-studied datasets and techniques, we investigate the effectiveness of unsupervised techniques that predict traceability links using IR/ML. This application of TraceXplainer illustrates an imbalance in typical traceability datasets where the source code has on average 1.48 more information bits (i.e., entropy) than the linked documentation. Additionally, we demonstrate that an average MI of 4.81 bits, loss of 1.75, and noise of 0.28 bits signify that there are information-theoretic limits on the effectiveness of unsupervised traceability techniques. We hope these findings spur additional research on understanding the limits and progress of traceability research.

[AI-47] Smoothie: Label Free Language Model Routing

链接: https://arxiv.org/abs/2412.04692
作者: Neel Guha,Mayee F. Chen,Trevor Chow,Ishan S. Khare,Christopher Ré
关键词-EN: Large language models, Large language, LLM, Large, Smoothie
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 24 pages, 8 figures, 11 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in applications where LLM inputs may span many different tasks. Recent work has found that the choice of LLM is consequential, and different LLMs may be good for different input samples. Prior approaches have thus explored how engineers might select an LLM to use for each sample (i.e. routing). While existing routing methods mostly require training auxiliary models on human-annotated data, our work explores whether it is possible to perform unsupervised routing. We propose Smoothie, a weak supervision-inspired routing approach that requires no labeled data. Given a set of outputs from different LLMs, Smoothie constructs a latent variable graphical model over embedding representations of observable LLM outputs and unknown “true” outputs. Using this graphical model, we estimate sample-dependent quality scores for each LLM, and route each sample to the LLM with the highest corresponding score. We find that Smoothie’s LLM quality-scores correlate with ground-truth model quality (correctly identifying the optimal model on 9/14 tasks), and that Smoothie outperforms baselines for routing by up to 10 points accuracy.

[AI-48] From Principles to Practice: A Deep Dive into AI Ethics and Regulations

链接: https://arxiv.org/abs/2412.04683
作者: Nan Sun,Yuantian Miao,Hao Jiang,Ming Ding,Jun Zhang
关键词-EN: rapidly evolving domain, Artificial Intelligence, domain of Artificial, rapidly evolving, evolving domain
类目: Artificial Intelligence (cs.AI)
*备注: Submitted to Artificial Intelligence Review

点击查看摘要

Abstract:In the rapidly evolving domain of Artificial Intelligence (AI), the complex interaction between innovation and regulation has become an emerging focus of our society. Despite tremendous advancements in AI’s capabilities to excel in specific tasks and contribute to diverse sectors, establishing a high degree of trust in AI-generated outputs and decisions necessitates meticulous caution and continuous oversight. A broad spectrum of stakeholders, including governmental bodies, private sector corporations, academic institutions, and individuals, have launched significant initiatives. These efforts include developing ethical guidelines for AI and engaging in vibrant discussions on AI ethics, both among AI practitioners and within the broader society. This article thoroughly analyzes the ground-breaking AI regulatory framework proposed by the European Union. It delves into the fundamental ethical principles of safety, transparency, non-discrimination, traceability, and environmental sustainability for AI developments and deployments. Considering the technical efforts and strategies undertaken by academics and industry to uphold these principles, we explore the synergies and conflicts among the five ethical principles. Through this lens, work presents a forward-looking perspective on the future of AI regulations, advocating for a harmonized approach that safeguards societal values while encouraging technological advancement.

[AI-49] wo stages domain invariant representation learners solve the large co-variate shift in unsupervised domain adaptation with two dimensional data domains

链接: https://arxiv.org/abs/2412.04682
作者: Hisashi Oshima,Tsuyoshi Ishizone,Tomoyuki Higuchi
关键词-EN: accelerate real world, real world applications, Recent developments, image recognition tasks, unsupervised domain adaptation
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent developments in the unsupervised domain adaptation (UDA) enable the unsupervised machine learning (ML) prediction for target data, thus this will accelerate real world applications with ML models such as image recognition tasks in self-driving. Researchers have reported the UDA techniques are not working well under large co-variate shift problems where e.g. supervised source data consists of handwritten digits data in monotone color and unsupervised target data colored digits data from the street view. Thus there is a need for a method to resolve co-variate shift and transfer source labelling rules under this dynamics. We perform two stages domain invariant representation learning to bridge the gap between source and target with semantic intermediate data (unsupervised). The proposed method can learn domain invariant features simultaneously between source and intermediate also intermediate and target. Finally this achieves good domain invariant representation between source and target plus task discriminability owing to source labels. This induction for the gradient descent search greatly eases learning convergence in terms of classification performance for target data even when large co-variate shift. We also derive a theorem for measuring the gap between trained models and unsupervised target labelling rules, which is necessary for the free parameters optimization. Finally we demonstrate that proposing method is superiority to previous UDA methods using 4 representative ML classification datasets including 38 UDA tasks. Our experiment will be a basis for challenging UDA problems with large co-variate shift.

[AI-50] Zephyr quantum-assisted hierarchical Calo4pQVAE for particle-calorimeter interactions NEURIPS

链接: https://arxiv.org/abs/2412.04677
作者: Ian Lu,Hao Jia,Sebastian Gonzalez,Deniz Sogutlu,J. Quetzalcoatl Toledo-Marin,Sehmimul Hoque,Abhishek Abhishek,Colin Gay,Roger Melko,Eric Paquet,Geoffrey Fox,Maximilian Swiatlowski,Wojciech Fedorko
关键词-EN: Large Hadron Collider, High Luminosity Large, Luminosity Large Hadron, Hadron Collider, High Luminosity
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph); Computational Physics (physics.comp-ph); Quantum Physics (quant-ph)
*备注: Neurips ML4PS 2024. 5 Figs, 8 pp

点击查看摘要

Abstract:With the approach of the High Luminosity Large Hadron Collider (HL-LHC) era set to begin particle collisions by the end of this decade, it is evident that the computational demands of traditional collision simulation methods are becoming increasingly unsustainable. Existing approaches, which rely heavily on first-principles Monte Carlo simulations for modeling event showers in calorimeters, are projected to require millions of CPU-years annually – far exceeding current computational capacities. This bottleneck presents an exciting opportunity for advancements in computational physics by integrating deep generative models with quantum simulations. We propose a quantum-assisted hierarchical deep generative surrogate founded on a variational autoencoder (VAE) in combination with an energy conditioned restricted Boltzmann machine (RBM) embedded in the model’s latent space as a prior. By mapping the topology of D-Wave’s Zephyr quantum annealer (QA) into the nodes and couplings of a 4-partite RBM, we leverage quantum simulation to accelerate our shower generation times significantly. To evaluate our framework, we use Dataset 2 of the CaloChallenge 2022. Through the integration of classical computation and quantum simulation, this hybrid framework paves way for utilizing large-scale quantum simulations as priors in deep generative models.

[AI-51] Soft Tensor Product Representations for Fully Continuous Compositional Visual Representations NEURIPS2024

链接: https://arxiv.org/abs/2412.04671
作者: Bethia Sun,Maurice Pagnucco,Yang Song
关键词-EN: systematically combine symbol-like, combine symbol-like entities, Soft TPR, Soft TPR Autoencoder, human intelligence
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted to Neurips 2024. 10 pages + supplementary

点击查看摘要

Abstract:Since the inception of the classicalist vs. connectionist debate, it has been argued that the ability to systematically combine symbol-like entities into compositional representations is crucial for human intelligence. In connectionist systems, the field of disentanglement has emerged to address this need by producing representations with explicitly separated factors of variation (FoV). By treating the overall representation as a string-like concatenation of the inferred FoVs, however, disentanglement provides a fundamentally symbolic treatment of compositional structure, one inherently at odds with the underlying continuity of deep learning vector spaces. We hypothesise that this symbolic-continuous mismatch produces broadly suboptimal performance in deep learning models that learn or use such representations. To fully align compositional representations with continuous vector spaces, we extend Smolensky’s Tensor Product Representation (TPR) and propose a new type of inherently continuous compositional representation, Soft TPR, along with a theoretically-principled architecture, Soft TPR Autoencoder, designed specifically for learning Soft TPRs. In the visual representation learning domain, our Soft TPR confers broad benefits over symbolic compositional representations: state-of-the-art disentanglement and improved representation learner convergence, along with enhanced sample efficiency and superior low-sample regime performance for downstream models, empirically affirming the value of our inherently continuous compositional representation learning framework.

[AI-52] HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning

链接: https://arxiv.org/abs/2412.04661
作者: Manish Bhattarai,Ryan Barron,Maksim Eren,Minh Vu,Vesselin Grantcharov,Ismael Boureima,Valentin Stanev,Cynthia Matuszek,Vladimir Valtchinov,Kim Rasmussen,Boian Alexandrov
关键词-EN: Large Language Models, Retrieval-Augmented Generation, enhances Large Language, Large Language, integrating external document
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain’s specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.

[AI-53] From Models to Systems: A Comprehensive Fairness Framework for Compositional Recommender Systems

链接: https://arxiv.org/abs/2412.04655
作者: Brian Hsu,Cyrus DiCiccio,Natesh Sivasubramoniapillai,Hongseok Namkoong
关键词-EN: ensuring equitable performance, research in machine, machine learning, learning often centers, centers on ensuring
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Fairness research in machine learning often centers on ensuring equitable performance of individual models. However, real-world recommendation systems are built on multiple models and even multiple stages, from candidate retrieval to scoring and serving, which raises challenges for responsible development and deployment. This system-level view, as highlighted by regulations like the EU AI Act, necessitates moving beyond auditing individual models as independent entities. We propose a holistic framework for modeling system-level fairness, focusing on the end-utility delivered to diverse user groups, and consider interactions between components such as retrieval and scoring models. We provide formal insights on the limitations of focusing solely on model-level fairness and highlight the need for alternative tools that account for heterogeneity in user preferences. To mitigate system-level disparities, we adapt closed-box optimization tools (e.g., BayesOpt) to jointly optimize utility and equity. We empirically demonstrate the effectiveness of our proposed framework on synthetic and real datasets, underscoring the need for a system-level framework.

[AI-54] REL: Working out is all you need

链接: https://arxiv.org/abs/2412.04645
作者: Toby Simonds,Jey Han Lau,Chaithanya Bandi
关键词-EN: Large Language Models, Recent developments, potential of Large, complex reasoning tasks, Large Language
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent developments, particularly OpenAI’s O1 model, have demonstrated the remarkable potential of Large Language Models (LLMs) for complex reasoning tasks. Through analysis of O1’s outputs and provided sample Chain-of-Thought (CoT) demonstrations, we observe that it approaches problem-solving in a distinctly human-like manner, systematically brainstorming ideas, testing hypotheses, verifying results, and planning comprehensive solutions. These sophisticated reasoning capabilities remain notably absent in other state-of-the-art language models. In this paper, we hypothesize that this performance gap stems from the limited availability of high-quality reasoning process data in current training sets. We demonstrate that by constructing a specialized dataset focused on explicit problem-solving workflows (“worked solutions”), we can elicit substantially improved planning capabilities from existing models. Additionally, we propose the Reasoning Enhancement Loop (REL), a method for generating synthetic worked solutions.

[AI-55] Improving LLM Group Fairness on Tabular Data via In-Context Learning

链接: https://arxiv.org/abs/2412.04642
作者: Valeriia Cherepanova,Chia-Jung Lee,Nil-Jana Akpinar,Riccardo Fogliato,Martin Andres Bertran,Michael Kearns,James Zou
关键词-EN: Large language models, Large language, low-data regime, leveraging their internal, internal knowledge
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been shown to be effective on tabular prediction tasks in the low-data regime, leveraging their internal knowledge and ability to learn from instructions and examples. However, LLMs can fail to generate predictions that satisfy group fairness, that is, produce equitable outcomes across groups. Critically, conventional debiasing approaches for natural language tasks do not directly translate to mitigating group unfairness in tabular settings. In this work, we systematically investigate four empirical approaches to improve group fairness of LLM predictions on tabular datasets, including fair prompt optimization, soft prompt tuning, strategic selection of few-shot examples, and self-refining predictions via chain-of-thought reasoning. Through experiments on four tabular datasets using both open-source and proprietary LLMs, we show the effectiveness of these methods in enhancing demographic parity while maintaining high overall performance. Our analysis provides actionable insights for practitioners in selecting the most suitable approach based on their specific requirements and constraints.

[AI-56] Disentangled Representation Learning for Causal Inference with Instruments

链接: https://arxiv.org/abs/2412.04641
作者: Debo Cheng(1),Jiuyong Li(1),Lin Liu(1),Ziqi Xu(2),Weijia Zhang(3),Jixue Liu(1),Thuc Duy Le(1) ((1) UniSA STEM, University of South Australia, (2) School of Computing Technologies, RMIT University, and (3) School of Information and Physical Sciences, University of Newcastle)
关键词-EN: inferring causal effects, fundamental challenge, Latent confounders, effects from observational, inferring causal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 14 pages, 13 figures and 5 tables. Accepted by TNNLS

点击查看摘要

Abstract:Latent confounders are a fundamental challenge for inferring causal effects from observational data. The instrumental variable (IV) approach is a practical way to address this challenge. Existing IV based estimators need a known IV or other strong assumptions, such as the existence of two or more IVs in the system, which limits the application of the IV approach. In this paper, we consider a relaxed requirement, which assumes there is an IV proxy in the system without knowing which variable is the proxy. We propose a Variational AutoEncoder (VAE) based disentangled representation learning method to learn an IV representation from a dataset with latent confounders and then utilise the IV representation to obtain an unbiased estimation of the causal effect from the data. Extensive experiments on synthetic and real-world data have demonstrated that the proposed algorithm outperforms the existing IV based estimators and VAE-based estimators.

[AI-57] Semantic Retrieval at Walmart KDD2022

链接: https://arxiv.org/abs/2412.04637
作者: Alessandro Magnani,Feng Liu,Suthee Chaidaroon,Sachin Yadav,Praveen Reddy Suram,Ajit Puthenputhussery,Sijie Chen,Min Xie,Anirudh Kashi,Tony Lee,Ciya Liao
关键词-EN: specific search intent, tail queries, user tail queries, candidate products, critical and challenging
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 9 page, 2 figures, 10 tables, KDD 2022

点击查看摘要

Abstract:In product search, the retrieval of candidate products before re-ranking is more critical and challenging than other search like web search, especially for tail queries, which have a complex and specific search intent. In this paper, we present a hybrid system for e-commerce search deployed at Walmart that combines traditional inverted index and embedding-based neural retrieval to better answer user tail queries. Our system significantly improved the relevance of the search engine, measured by both offline and online evaluations. The improvements were achieved through a combination of different approaches. We present a new technique to train the neural model at scale. and describe how the system was deployed in production with little impact on response time. We highlight multiple learnings and practical tricks that were used in the deployment of this system.

[AI-58] Neural Two-Level Monte Carlo Real-Time Rendering

链接: https://arxiv.org/abs/2412.04634
作者: Mikhail Dereviannykh,Dmitrii Klepikov,Johannes Hanika,Carsten Dachsbacher
关键词-EN: Two-Level Monte Carlo, Multi-Level Monte Carlo, Monte Carlo, efficient Two-Level Monte, Two-Level Monte
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:We introduce an efficient Two-Level Monte Carlo (subset of Multi-Level Monte Carlo, MLMC) estimator for real-time rendering of scenes with global illumination. Using MLMC we split the shading integral into two parts: the radiance cache integral and the residual error integral that compensates for the bias of the first one. For the first part, we developed the Neural Incident Radiance Cache (NIRC) leveraging the power of fully-fused tiny neural networks as a building block, which is trained on the fly. The cache is designed to provide a fast and reasonable approximation of the incident radiance: an evaluation takes 2-25x less compute time than a path tracing sample. This enables us to estimate the radiance cache integral with a high number of samples and by this achieve faster convergence. For the residual error integral, we compute the difference between the NIRC predictions and the unbiased path tracing simulation. Our method makes no assumptions about the geometry, materials, or lighting of a scene and has only few intuitive hyper-parameters. We provide a comprehensive comparative analysis in different experimental scenarios. Since the algorithm is trained in an on-line fashion, it demonstrates significant noise level reduction even for dynamic scenes and can easily be combined with other importance sampling schemes and noise reduction techniques.

[AI-59] ARC Prize 2024: Technical Report

链接: https://arxiv.org/abs/2412.04604
作者: Francois Chollet,Mike Knoop,Gregory Kamradt,Bryan Landers
关键词-EN: remains unbeaten, launched ARC Prize, December, ARC Prize, target benchmark score
类目: Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As of December 2024, the ARC-AGI benchmark is five years old and remains unbeaten. We believe it is currently the most important unsolved AI benchmark in the world because it seeks to measure generalization on novel tasks – the essence of intelligence – as opposed to skill at tasks that can be prepared for in advance. This year, we launched ARC Prize, a global competition to inspire new ideas and drive open progress towards AGI by reaching a target benchmark score of 85%. As a result, the state-of-the-art score on the ARC-AGI private evaluation set increased from 33% to 55.5%, propelled by several frontier AGI reasoning techniques including deep learning-guided program synthesis and test-time training. In this paper, we survey top approaches, review new open-source implementations, discuss the limitations of the ARC-AGI-1 dataset, and share key insights gained from the competition.

[AI-60] Dissociating Artificial Intelligence from Artificial Consciousness

链接: https://arxiv.org/abs/2412.04571
作者: Graham Findlay,William Marshall,Larissa Albantakis,Isaac David,William GP Mayner,Christof Koch,Giulio Tononi
关键词-EN: computing power suggest, artificial general intelligence, Developments in machine, machine learning, learning and computing
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Developments in machine learning and computing power suggest that artificial general intelligence is within reach. This raises the question of artificial consciousness: if a computer were to be functionally equivalent to a human, being able to do all we do, would it experience sights, sounds, and thoughts, as we do when we are conscious? Answering this question in a principled manner can only be done on the basis of a theory of consciousness that is grounded in phenomenology and that states the necessary and sufficient conditions for any system, evolved or engineered, to support subjective experience. Here we employ Integrated Information Theory (IIT), which provides principled tools to determine whether a system is conscious, to what degree, and the content of its experience. We consider pairs of systems constituted of simple Boolean units, one of which – a basic stored-program computer – simulates the other with full functional equivalence. By applying the principles of IIT, we demonstrate that (i) two systems can be functionally equivalent without being phenomenally equivalent, and (ii) that this conclusion is not dependent on the simulated system’s function. We further demonstrate that, according to IIT, it is possible for a digital computer to simulate our behavior, possibly even by simulating the neurons in our brain, without replicating our experience. This contrasts sharply with computational functionalism, the thesis that performing computations of the right kind is necessary and sufficient for consciousness.

[AI-61] WinTSR: A Windowed Temporal Saliency Rescaling Method for Interpreting Time Series Deep Learning Models

链接: https://arxiv.org/abs/2412.04532
作者: Md. Khairul Islam,Judy Fox
关键词-EN: Interpreting complex time, Interpreting complex, Temporal Saliency Rescaling, Windowed Temporal Saliency, challenging due
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Interpreting complex time series forecasting models is challenging due to the temporal dependencies between time steps and the dynamic relevance of input features over time. Existing interpretation methods are limited by focusing mostly on classification tasks, evaluating using custom baseline models instead of the latest time series models, using simple synthetic datasets, and requiring training another model. We introduce a novel interpretation method called Windowed Temporal Saliency Rescaling (WinTSR) addressing these limitations. WinTSR explicitly captures temporal dependencies among the past time steps and efficiently scales the feature importance with this time importance. We benchmark WinTSR against 10 recent interpretation techniques with 5 state-of-the-art deep-learning models of different architectures, including a time series foundation model. We use 3 real-world datasets for both time-series classification and regression. Our comprehensive analysis shows that WinTSR significantly outranks the other local interpretation methods in overall performance. Finally, we provide a novel and open-source framework to interpret the latest time series transformers and foundation models.

[AI-62] Optimizing Student Ability Assessment: A Hierarchy Constraint-Aware Cognitive Diagnosis Framework for Educational Contexts

链接: https://arxiv.org/abs/2412.04488
作者: Xinjie Sun,Qi Liu,Kai Zhang,Shuanghong Shen,Fei Wang,Yan Zhuang,Zheng Zhang,Weiyin Gong,Shijin Wang,Lina Yang,Xingying Huo
关键词-EN: Cognitive diagnosis, reveal students’ proficiency, cognitive diagnosis frameworks, aims to reveal, proficiency in specific
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注: Cognitive Diagnosis

点击查看摘要

Abstract:Cognitive diagnosis (CD) aims to reveal students’ proficiency in specific knowledge concepts. With the increasing adoption of intelligent education applications, accurately assessing students’ knowledge mastery has become an urgent challenge. Although existing cognitive diagnosis frameworks enhance diagnostic accuracy by analyzing students’ explicit response records, they primarily focus on individual knowledge state, failing to adequately reflect the relative ability performance of students within hierarchies. To address this, we propose the Hierarchy Constraint-Aware Cognitive Diagnosis Framework (HCD), designed to more accurately represent student ability performance within real educational contexts. Specifically, the framework introduces a hierarchy mapping layer to identify students’ levels. It then employs a hierarchy convolution-enhanced attention layer for in-depth analysis of knowledge concepts performance among students at the same level, uncovering nuanced differences. A hierarchy inter-sampling attention layer captures performance differences across hierarchies, offering a comprehensive understanding of the relationships among students’ knowledge state. Finally, through personalized diagnostic enhancement, the framework integrates hierarchy constraint perception features with existing models, improving the representation of both individual and group characteristics. This approach enables precise inference of students’ knowledge state. Research shows that this framework not only reasonably constrains changes in students’ knowledge states to align with real educational settings, but also supports the scientific rigor and fairness of educational assessments, thereby advancing the field of cognitive diagnosis.

[AI-63] he Global AI Vibrancy Tool

链接: https://arxiv.org/abs/2412.04486
作者: Loredana Fattorini,Nestor Maslej,Raymond Perrault,Vanessa Parli,John Etchemendy,Yoav Shoham,Katrina Ligett
关键词-EN: Global AI Vibrancy, paper presents, presents the latest, latest version, interactive suite
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper presents the latest version of the Global AI Vibrancy Tool (GVT), an interactive suite of visualizations designed to facilitate the comparison of AI vibrancy across 36 countries, using 42 indicators organized into 8 pillars. The tool offers customizable features that allow users to conduct in-depth country-level comparisons and longitudinal analyses of AI-related metrics, all based on publicly available data. By providing a transparent assessment of national progress in AI, it serves the diverse needs of policymakers, industry leaders, researchers, and the general public. Using weights for indicators and pillars developed by AI Index’s panel of experts and combined into an index, the Global AI Vibrancy Ranking for 2023 places the United States first by a significant margin, followed by China and the United Kingdom. The ranking also highlights the rise of smaller nations such as Singapore when evaluated on both absolute and per capita bases. The tool offers three sub-indices for evaluating Global AI Vibrancy along different dimensions: the Innovation Index, the Economic Competitiveness Index, and the Policy, Governance, and Public Engagement Index.

[AI-64] EDA-Aware RTL Generation with Large Language Models

链接: https://arxiv.org/abs/2412.04485
作者: Mubashir ul Islam,Humza Sami,Pierre-Emmanuel Gaillardon,Valerio Tenace
关键词-EN: Large Language Models, Large Language, Language Models, generating RTL code, RTL code
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have become increasingly popular for generating RTL code. However, producing error-free RTL code in a zero-shot setting remains highly challenging for even state-of-the-art LLMs, often leading to issues that require manual, iterative refinement. This additional debugging process can dramatically increase the verification workload, underscoring the need for robust, automated correction mechanisms to ensure code correctness from the start. In this work, we introduce AIvril2, a self-verifying, LLM-agnostic agentic framework aimed at enhancing RTL code generation through iterative corrections of both syntax and functional errors. Our approach leverages a collaborative multi-agent system that incorporates feedback from error logs generated by EDA tools to automatically identify and resolve design flaws. Experimental results, conducted on the VerilogEval-Human benchmark suite, demonstrate that our framework significantly improves code quality, achieving nearly a 3.4 \times enhancement over prior methods. In the best-case scenario, functional pass rates of 77% for Verilog and 66% for VHDL were obtained, thus substantially improving the reliability of LLM-driven RTL code generation. Subjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI) Cite as: arXiv:2412.04485 [cs.AR] (or arXiv:2412.04485v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2412.04485 Focus to learn more arXiv-issued DOI via DataCite

[AI-65] Epinet for Content Cold Start

链接: https://arxiv.org/abs/2412.04484
作者: Hong Jun Jeon,Songbin Liu,Yuantong Li,Jie Lyu,Hunter Song,Ji Liu,Peng Wu,Zheqing Zhu
关键词-EN: evermore challenging matching, challenging matching problem, modern recommendation systems, user base poses, exploding popularity
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:The exploding popularity of online content and its user base poses an evermore challenging matching problem for modern recommendation systems. Unlike other frontiers of machine learning such as natural language, recommendation systems are responsible for collecting their own data. Simply exploiting current knowledge can lead to pernicious feedback loops but naive exploration can detract from user experience and lead to reduced engagement. This exploration-exploitation trade-off is exemplified in the classic multi-armed bandit problem for which algorithms such as upper confidence bounds (UCB) and Thompson sampling (TS) demonstrate effective performance. However, there have been many challenges to scaling these approaches to settings which do not exhibit a conjugate prior structure. Recent scalable approaches to uncertainty quantification via epinets have enabled efficient approximations of Thompson sampling even when the learning model is a complex neural network. In this paper, we demonstrate the first application of epinets to an online recommendation system. Our experiments demonstrate improvements in both user traffic and engagement efficiency on the Facebook Reels online video platform.

[AI-66] AI-powered Digital Framework for Personalized Economical Quality Learning at Scale

链接: https://arxiv.org/abs/2412.04483
作者: Mrzieh VatandoustMohammadieh,Mohammad Mahdi Mohajeri,Ali Keramati,Majid Nili Ahmadabadi
关键词-EN: economic status, disparity in access, developed and developing, developing countries, quality education
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The disparity in access to quality education is significant, both between developed and developing countries and within nations, regardless of their economic status. Socioeconomic barriers and rapid changes in the job market further intensify this issue, highlighting the need for innovative solutions that can deliver quality education at scale and low cost. This paper addresses these challenges by proposing an AI-powered digital learning framework grounded in Deep Learning (DL) theory. The DL theory emphasizes learner agency and redefines the role of teachers as facilitators, making it particularly suitable for scalable educational environments. We outline eight key principles derived from learning science and AI that are essential for implementing DL-based Digital Learning Environments (DLEs). Our proposed framework leverages AI for learner modelling based on Open Learner Modeling (OLM), activity suggestions, and AI-assisted support for both learners and facilitators, fostering collaborative and engaging learning experiences. Our framework provides a promising direction for scalable, high-quality education globally, offering practical solutions to some of the AI-related challenges in education.

[AI-67] LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

链接: https://arxiv.org/abs/2412.04478
作者: Sachit Kuhar,Wasi Uddin Ahmad,Zijian Wang,Nihal Jain,Haifeng Qian,Baishakhi Ray,Murali Krishna Ramanathan,Xiaofei Ma,Anoop Deoras
关键词-EN: local file contexts, Recent advancements, file contexts, primarily focused, focused on local
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Recent advancements in code completion models have primarily focused on local file contexts. However, these studies do not fully capture the complexity of real-world software development, which often requires the use of rapidly-evolving public libraries. To fill the gap, we introduce LibEvolutionEval, a detailed study requiring an understanding of library evolution to perform in-line code completion accurately. LibEvolutionEval provides a version-specific code-completion task comprised of eight libraries (torch, torchvision, scipy, pil, tqdm, pyyaml, matplotlib, and pandas) as they evolve over the year along with a detailed analysis of the evolution of two popular and well-maintained public libraries: PyTorch and Matplotlib. We evaluate popular public models and find that public library evolution significantly influences model performance. We explored mitigation methods by studying how retrieved version-specific library documentation and prompting can improve the model’s capability in handling these fast-evolving packages, paving a promising future path in better handling fast-evolving libraries.

[AI-68] Intelligent Tutors for Adult Learners: An Analysis of Needs and Challenges

链接: https://arxiv.org/abs/2412.04477
作者: Adit Gupta,Momin Siddiqui,Glen Smith,Jenn Reddig,Christopher MacLellan
关键词-EN: intelligent tutoring systems, tutoring systems, paper aims, pedagogical technologies, intelligent tutoring
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:This paper aims to uncover needs of adult learners when using pedagogical technologies such as intelligent tutoring systems. Further, our aim with this work is to understand the usability challenges when deploying tutors at scale within the adult learning audience. As educational technologies become more ubiquitous within k-12 education, this paper aims to bridge the gap in understanding on how adult users might utilize intelligent tutors. In pursuit of this, we built four intelligent tutors, and deployed them to 110 classrooms at a state technical college for an entire academic year. Following this deployment, we conducted focus groups amongst users to gather data to understand how learners perceived the optional educational technology during their academic journey. We further analyzed this data using foundational HCI methodologies to extract leanings and design recommendations on how developers might craft educational technologies for adoption at scale for the adult learning population.

[AI-69] he Moral Mind(s) of Large Language Models

链接: https://arxiv.org/abs/2412.04476
作者: Avner Seror
关键词-EN: key question arises, question arises, exhibit an emergent, integrated to decision-making, key question
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) become integrated to decision-making across various sectors, a key question arises: do they exhibit an emergent “moral mind” - a consistent set of moral principles guiding their ethical judgments - and is this reasoning uniform or diverse across models? To investigate this, we presented about forty different models from the main providers with a large array of structured ethical scenarios, creating one of the largest datasets of its kind. Our rationality tests revealed that at least one model from each provider demonstrated behavior consistent with stable moral principles, effectively acting as approximately optimizing a utility function encoding ethical reasoning. We identified these utility functions and observed a notable clustering of models around neutral ethical stances. To investigate variability, we introduced a novel non-parametric permutation approach, revealing that the most rational models shared 59% to 76% of their ethical reasoning patterns. Despite this shared foundation, differences emerged: roughly half displayed greater moral adaptability, bridging diverse perspectives, while the remainder adhered to more rigid ethical structures.

[AI-70] ake Package as Language: Anomaly Detection Using Transformer

链接: https://arxiv.org/abs/2412.04473
作者: Jie Huang
关键词-EN: faces numerous challenges, researching weakly supervised, detection faces numerous, anomaly supervision signals, weakly supervised anomaly
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:

点击查看摘要

Abstract:Network data packet anomaly detection faces numerous challenges, including exploring new anomaly supervision signals, researching weakly supervised anomaly detection, and improving model interpretability. This paper proposes NIDS-GPT, a GPT-based causal language model for network intrusion detection. Unlike previous work, NIDS-GPT innovatively treats each number in the packet as an independent “word” rather than packet fields, enabling a more fine-grained data representation. We adopt an improved GPT-2 model and design special tokenizers and embedding layers to better capture the structure and semantics of network data. NIDS-GPT has good scalability, supports unsupervised pre-training, and enhances model interpretability through attention weight visualization. Experiments on the CICIDS2017 and car-hacking datasets show that NIDS-GPT achieves 100% accuracy under extreme imbalance conditions, far surpassing traditional methods; it also achieves over 90% accuracy in one-shot learning. These results demonstrate NIDS-GPT’s excellent performance and potential in handling complex network anomaly detection tasks, especially in data-imbalanced and resource-constrained scenarios. The code is available at \urlthis https URL

[AI-71] Follow the money: a startup-based measure of AI exposure across occupations industries and regions

链接: https://arxiv.org/abs/2412.04924
作者: Enrico Maria Fenoaltea,Dario Mazzilli,Aurelio Patelli,Angelica Sbardella,Andrea Tacchella,Andrea Zaccaria,Marco Trombetti,Luciano Pietronero
关键词-EN: necessitating robust metrics, artificial intelligence, advancing rapidly, necessitating robust, integration of artificial
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
*备注: 24 pages, 6 figures, + Supplementary information

点击查看摘要

Abstract:The integration of artificial intelligence (AI) into the workplace is advancing rapidly, necessitating robust metrics to evaluate its tangible impact on the labour market. Existing measures of AI occupational exposure largely focus on AI’s theoretical potential to substitute or complement human labour on the basis of technical feasibility, providing limited insight into actual adoption and offering inadequate guidance for policymakers. To address this gap, we introduce the AI Startup Exposure (AISE) index-a novel metric based on occupational descriptions from O*NET and AI applications developed by startups funded by the Y Combinator accelerator. Our findings indicate that while high-skilled professions are theoretically highly exposed according to conventional metrics, they are heterogeneously targeted by startups. Roles involving routine organizational tasks-such as data analysis and office management-display significant exposure, while occupations involving tasks that are less amenable to AI automation due to ethical or high-stakes, more than feasibility, considerations – such as judges or surgeons – present lower AISE scores. By focusing on venture-backed AI applications, our approach offers a nuanced perspective on how AI is reshaping the labour market. It challenges the conventional assumption that high-skilled jobs uniformly face high AI risks, highlighting instead the role of today’s AI players’ societal desirability-driven and market-oriented choices as critical determinants of AI exposure. Contrary to fears of widespread job displacement, our findings suggest that AI adoption will be gradual and shaped by social factors as much as by the technical feasibility of AI applications. This framework provides a dynamic, forward-looking tool for policymakers and stakeholders to monitor AI’s evolving impact and navigate the changing labour landscape.

机器学习

[LG-0] Physics-informed reduced order model with conditional neural fields NEURIPS2024

链接: https://arxiv.org/abs/2412.05233
作者: Minji Kim,Tianshu Wen,Kookjin Lee,Youngsoo Choi
关键词-EN: partial differential equations, parametrized partial differential, conditional neural fields, differential equations, study presents
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures, NeurIPS 2024 Workshop on Machine Learning and the Physical Sciences

点击查看摘要

Abstract:This study presents the conditional neural fields for reduced-order modeling (CNF-ROM) framework to approximate solutions of parametrized partial differential equations (PDEs). The approach combines a parametric neural ODE (PNODE) for modeling latent dynamics over time with a decoder that reconstructs PDE solutions from the corresponding latent states. We introduce a physics-informed learning objective for CNF-ROM, which includes two key components. First, the framework uses coordinate-based neural networks to calculate and minimize PDE residuals by computing spatial derivatives via automatic differentiation and applying the chain rule for time derivatives. Second, exact initial and boundary conditions (IC/BC) are imposed using approximate distance functions (ADFs) [Sukumar and Srivastava, CMAME, 2022]. However, ADFs introduce a trade-off as their second- or higher-order derivatives become unstable at the joining points of boundaries. To address this, we introduce an auxiliary network inspired by [Gladstone et al., NeurIPS ML4PS workshop, 2022]. Our method is validated through parameter extrapolation and interpolation, temporal extrapolation, and comparisons with analytical solutions.

[LG-1] ransformers Meet Relational Databases

链接: https://arxiv.org/abs/2412.05218
作者: Jakub Peleška,Gustav Šír
关键词-EN: machine learning domains, learning domains convertible, including tabular data, continuously expanded, domains convertible
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:

点击查看摘要

Abstract:Transformer models have continuously expanded into all machine learning domains convertible to the underlying sequence-to-sequence representation, including tabular data. However, while ubiquitous, this representation restricts their extension to the more general case of relational databases. In this paper, we introduce a modular neural message-passing scheme that closely adheres to the formal relational model, enabling direct end-to-end learning of tabular Transformers from database storage systems. We address the challenges of appropriate learning data representation and loading, which are critical in the database setting, and compare our approach against a number of representative models from various related fields across a significantly wide range of datasets. Our results demonstrate a superior performance of this newly proposed class of neural architectures.

[LG-2] Privacy Drift: Evolving Privacy Concerns in Incremental Learning

链接: https://arxiv.org/abs/2412.05183
作者: Sayyed Farid Ahamed,Soumya Banerjee,Sandip Roy,Aayush Kapoor,Marc Vucovich,Kevin Choi,Abdul Rahman,Edward Bowen,Sachin Shetty
关键词-EN: preserving user data, Federated Learning, privacy, user data privacy, privacy drift
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 6 pages, 7 figures, Accepted in IEEE ICNC 25

点击查看摘要

Abstract:In the evolving landscape of machine learning (ML), Federated Learning (FL) presents a paradigm shift towards decentralized model training while preserving user data privacy. This paper introduces the concept of ``privacy drift", an innovative framework that parallels the well-known phenomenon of concept drift. While concept drift addresses the variability in model accuracy over time due to changes in the data, privacy drift encapsulates the variation in the leakage of private information as models undergo incremental training. By defining and examining privacy drift, this study aims to unveil the nuanced relationship between the evolution of model performance and the integrity of data privacy. Through rigorous experimentation, we investigate the dynamics of privacy drift in FL systems, focusing on how model updates and data distribution shifts influence the susceptibility of models to privacy attacks, such as membership inference attacks (MIA). Our results highlight a complex interplay between model accuracy and privacy safeguards, revealing that enhancements in model performance can lead to increased privacy risks. We provide empirical evidence from experiments on customized datasets derived from CIFAR-100 (Canadian Institute for Advanced Research, 100 classes), showcasing the impact of data and concept drift on privacy. This work lays the groundwork for future research on privacy-aware machine learning, aiming to achieve a delicate balance between model accuracy and data privacy in decentralized environments.

[LG-3] Variational Encoder-Decoders for Learning Latent Representations of Physical Systems

链接: https://arxiv.org/abs/2412.05175
作者: Subashree Venkatasubramanian,David A. Barajas-Solano
关键词-EN: system high-dimensional observable, high-dimensional observable response, deep-learning Variational Encoder-Decoder, learning data-driven low-dimensional, observable response
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We present a deep-learning Variational Encoder-Decoder (VED) framework for learning data-driven low-dimensional representations of the relationship between high-dimensional parameters of a physical system and the system’s high-dimensional observable response. The framework consists of two deep learning-based probabilistic transformations: An encoder mapping parameters to latent codes and a decoder mapping latent codes to the observable response. The hyperparameters of these transformations are identified by maximizing a variational lower bound on the log-conditional distribution of the observable response given parameters. To promote the disentanglement of latent codes, we equip this variational loss with a penalty on the off-diagonal entries of the aggregate distribution covariance of codes. This regularization penalty encourages the pushforward of a standard Gaussian distribution of latent codes to approximate the marginal distribution of the observable response. Using the proposed framework we successfully model the hydraulic pressure response at observation wells of a groundwater flow model as a function of its discrete log-hydraulic transmissivity field. Compared to the canonical correlation analysis encoding, the VED model achieves a lower-dimensional latent representation, with as low as r = 50 latent dimensions without a significant loss of reconstruction accuracy. We explore the impact of regularization on model performance, finding that KL-divergence and covariance regularization improve feature disentanglement in latent space while maintaining reconstruction accuracy. Furthermore, we evaluate the generative capabilities of the regularized model by decoding random Gaussian noise, revealing that tuning both \beta and \lambda parameters enhances the quality of the generated observable response data. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2412.05175 [cs.LG] (or arXiv:2412.05175v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2412.05175 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] A Differentially Private Kaplan-Meier Estimator for Privacy-Preserving Survival Analysis

链接: https://arxiv.org/abs/2412.05164
作者: Narasimha Raghavan Veeraragavan,Sai Praneeth Karimireddy,Jan Franz Nygård
关键词-EN: accurate survival probability, survival probability estimates, differentially private approach, paper presents, presents a differentially
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a differentially private approach to Kaplan-Meier estimation that achieves accurate survival probability estimates while safeguarding individual privacy. The Kaplan-Meier estimator is widely used in survival analysis to estimate survival functions over time, yet applying it to sensitive datasets, such as clinical records, risks revealing private information. To address this, we introduce a novel algorithm that applies time-indexed Laplace noise, dynamic clipping, and smoothing to produce a privacy-preserving survival curve while maintaining the cumulative structure of the Kaplan-Meier estimator. By scaling noise over time, the algorithm accounts for decreasing sensitivity as fewer individuals remain at risk, while dynamic clipping and smoothing prevent extreme values and reduce fluctuations, preserving the natural shape of the survival curve. Our results, evaluated on the NCCTG lung cancer dataset, show that the proposed method effectively lowers root mean squared error (RMSE) and enhances accuracy across privacy budgets ( \epsilon ). At \epsilon = 10 , the algorithm achieves an RMSE as low as 0.04, closely approximating non-private estimates. Additionally, membership inference attacks reveal that higher \epsilon values (e.g., \epsilon \geq 6 ) significantly reduce influential points, particularly at higher thresholds, lowering susceptibility to inference attacks. These findings confirm that our approach balances privacy and utility, advancing privacy-preserving survival analysis. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2412.05164 [cs.CR] (or arXiv:2412.05164v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2412.05164 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-5] A text-to-tabular approach to generate synthetic patient data using LLM s

链接: https://arxiv.org/abs/2412.05153
作者: Margaux Tornqvist,Jean-Daniel Zucker,Tristan Fauvel,Nicolas Lambert,Mathilde Berthelot,Antoine Movschin
关键词-EN: make insightful discoveries, patient data, data, accelerate medical research, large-scale high-quality healthcare
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Access to large-scale high-quality healthcare databases is key to accelerate medical research and make insightful discoveries about diseases. However, access to such data is often limited by patient privacy concerns, data sharing restrictions and high costs. To overcome these limitations, synthetic patient data has emerged as an alternative. However, synthetic data generation (SDG) methods typically rely on machine learning (ML) models trained on original data, leading back to the data scarcity problem. We propose an approach to generate synthetic tabular patient data that does not require access to the original data, but only a description of the desired database. We leverage prior medical knowledge and in-context learning capabilities of large language models (LLMs) to generate realistic patient data, even in a low-resource setting. We quantitatively evaluate our approach against state-of-the-art SDG models, using fidelity, privacy, and utility metrics. Our results show that while LLMs may not match the performance of state-of-the-art models trained on the original data, they effectively generate realistic patient data with well-preserved clinical correlations. An ablation study highlights key elements of our prompt contributing to high-quality synthetic patient data generation. This approach, which is easy to use and does not require original data or advanced ML skills, is particularly valuable for quickly generating custom-designed patient data, supporting project implementation and providing educational resources.

[LG-6] Effective Rank and the Staircase Phenomenon: New Insights into Neural Network Training Dynamics

链接: https://arxiv.org/abs/2412.05144
作者: Yang Jiang,Yuxiang Zhao,Quanhui Zhu
关键词-EN: solving high-dimensional problems, achieved widespread success, low-dimensional feature structures, effective rank, loss function
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In recent years, deep learning, powered by neural networks, has achieved widespread success in solving high-dimensional problems, particularly those with low-dimensional feature structures. This success stems from their ability to identify and learn low dimensional features tailored to the problems. Understanding how neural networks extract such features during training dynamics remains a fundamental question in deep learning theory. In this work, we propose a novel perspective by interpreting the neurons in the last hidden layer of a neural network as basis functions that represent essential features. To explore the linear independence of these basis functions throughout the deep learning dynamics, we introduce the concept of ‘effective rank’. Our extensive numerical experiments reveal a notable phenomenon: the effective rank increases progressively during the learning process, exhibiting a staircase-like pattern, while the loss function concurrently decreases as the effective rank rises. We refer to this observation as the ‘staircase phenomenon’. Specifically, for deep neural networks, we rigorously prove the negative correlation between the loss function and effective rank, demonstrating that the lower bound of the loss function decreases with increasing effective rank. Therefore, to achieve a rapid descent of the loss function, it is critical to promote the swift growth of effective rank. Ultimately, we evaluate existing advanced learning methodologies and find that these approaches can quickly achieve a higher effective rank, thereby avoiding redundant staircase processes and accelerating the rapid decline of the loss function.

[LG-7] Learning Hidden Physics and System Parameters with Deep Operator Networks

链接: https://arxiv.org/abs/2412.05133
作者: Vijay Kag,Dibakar Roy Sarkar,Birupaksha Pal,Somdatta Goswami
关键词-EN: precise uncertainty quantification, facilitating precise uncertainty, machine learning complement, providing powerful tools, traditional methods falter
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Big data is transforming scientific progress by enabling the discovery of novel models, enhancing existing frameworks, and facilitating precise uncertainty quantification, while advancements in scientific machine learning complement this by providing powerful tools to solve inverse problems to identify the complex systems where traditional methods falter due to sparse or noisy data. We introduce two innovative neural operator frameworks tailored for discovering hidden physics and identifying unknown system parameters from sparse measurements. The first framework integrates a popular neural operator, DeepONet, and a physics-informed neural network to capture the relationship between sparse data and the underlying physics, enabling the accurate discovery of a family of governing equations. The second framework focuses on system parameter identification, leveraging a DeepONet pre-trained on sparse sensor measurements to initialize a physics-constrained inverse model. Both frameworks excel in handling limited data and preserving physical consistency. Benchmarking on the Burgers’ equation and reaction-diffusion system demonstrates state-of-the-art performance, achieving average L_2 errors of \mathcalO(10^-2) for hidden physics discovery and absolute errors of \mathcalO(10^-3) for parameter identification. These results underscore the frameworks’ robustness, efficiency, and potential for solving complex scientific problems with minimal observational data.

[LG-8] Robust Computation with Intrinsic Heterogeneity

链接: https://arxiv.org/abs/2412.05126
作者: Arash Golmohammadi,Christian Tetzlaff
关键词-EN: well-documented computational advantages, biological systems, Intrinsic within-type neuronal, ubiquitous feature, feature of biological
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 29 pages, 15 figures

点击查看摘要

Abstract:Intrinsic within-type neuronal heterogeneity is a ubiquitous feature of biological systems, with well-documented computational advantages. Recent works in machine learning have incorporated such diversities by optimizing neuronal parameters alongside synaptic connections and demonstrated state-of-the-art performance across common benchmarks. However, this performance gain comes at the cost of significantly higher computational costs, imposed by a larger parameter space. Furthermore, it is unclear how the neuronal parameters, constrained by the biophysics of their surroundings, are globally orchestrated to minimize top-down errors. To address these challenges, we postulate that neurons are intrinsically diverse, and investigate the computational capabilities of such heterogeneous neuronal parameters. Our results show that intrinsic heterogeneity, viewed as a fixed quenched disorder, often substantially improves performance across hundreds of temporal tasks. Notably, smaller but heterogeneous networks outperform larger homogeneous networks, despite consuming less data. We elucidate the underlying mechanisms driving this performance boost and illustrate its applicability to both rate and spiking dynamics. Moreover, our findings demonstrate that heterogeneous networks are highly resilient to severe alterations in their recurrent synaptic hyperparameters, and even recurrent connections removal does not compromise performance. The remarkable effectiveness of heterogeneous networks with small sizes and relaxed connectivity is particularly relevant for the neuromorphic community, which faces challenges due to device-to-device variability. Furthermore, understanding the mechanism of robust computation with heterogeneity also benefits neuroscientists and machine learners.

[LG-9] ransformers Can Navigate Mazes With Multi-Step Prediction

链接: https://arxiv.org/abs/2412.05117
作者: Niklas Nolte,Ouail Kitouni,Adina Williams,Mike Rabbat,Mark Ibrahim
关键词-EN: multiple steps ahead, multiple steps, steps ahead, predicting multiple steps, language modeling
类目: Machine Learning (cs.LG)
*备注: 20 pages, 15 figures

点击查看摘要

Abstract:Despite their remarkable success in language modeling, transformers trained to predict the next token in a sequence struggle with long-term planning. This limitation is particularly evident in tasks requiring foresight to plan multiple steps ahead such as maze navigation. The standard next single token prediction objective, however, offers no explicit mechanism to predict multiple steps ahead - or revisit the path taken so far. Consequently, in this work we study whether explicitly predicting multiple steps ahead (and backwards) can improve transformers’ maze navigation. We train parameter-matched transformers from scratch, under identical settings, to navigate mazes of varying types and sizes with standard next token prediction and MLM-U, an objective explicitly predicting multiple steps ahead and backwards. We find that MLM-U considerably improves transformers’ ability to navigate mazes compared to standard next token prediction across maze types and complexities. We also find MLM-U training is 4x more sample efficient and converges 2x faster in terms of GPU training hours relative to next token training. Finally, for more complex mazes we find MLM-U benefits from scaling to larger transformers. Remarkably, we find transformers trained with MLM-U outperform larger transformers trained with next token prediction using additional supervision from A* search traces. We hope these findings underscore the promise of learning objectives to advance transformers’ capacity for long-term planning.

[LG-10] Generating Rectifiable Measures through Neural Networks

链接: https://arxiv.org/abs/2412.05109
作者: Erwin Riegler,Alex Bühler,Yang Pan,Helmut Bölcskei
关键词-EN: derive universal approximation, rectifiable measures, varepsilon, universal approximation results, ReLU neural networks
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Probability (math.PR); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We derive universal approximation results for the class of (countably) m -rectifiable measures. Specifically, we prove that m -rectifiable measures can be approximated as push-forwards of the one-dimensional Lebesgue measure on [0,1] using ReLU neural networks with arbitrarily small approximation error in terms of Wasserstein distance. What is more, the weights in the networks under consideration are quantized and bounded and the number of ReLU neural networks required to achieve an approximation error of \varepsilon is no larger than 2^b(\varepsilon) with b(\varepsilon)=\mathcalO(\varepsilon^-m\log^2(\varepsilon)) . This result improves Lemma IX.4 in Perekrestenko et al. as it shows that the rate at which b(\varepsilon) tends to infinity as \varepsilon tends to zero equals the rectifiability parameter m , which can be much smaller than the ambient dimension. We extend this result to countably m -rectifiable measures and show that this rate still equals the rectifiability parameter m provided that, among other technical assumptions, the measure decays exponentially on the individual components of the countably m -rectifiable support set.

[LG-11] Mixed Blessing: Class-Wise Embedding guided Instance-Dependent Partial Label Learning KDD2025

链接: https://arxiv.org/abs/2412.05029
作者: Fuchao Yang,Jianhong Cheng,Hui Liu,Yongqiang Dong,Yuheng Jia,Junhui Hou
关键词-EN: noisy labels, partial label learning, candidate label set, label set, label
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2025

点击查看摘要

Abstract:In partial label learning (PLL), every sample is associated with a candidate label set comprising the ground-truth label and several noisy labels. The conventional PLL assumes the noisy labels are randomly generated (instance-independent), while in practical scenarios, the noisy labels are always instance-dependent and are highly related to the sample features, leading to the instance-dependent partial label learning (IDPLL) problem. Instance-dependent noisy label is a double-edged sword. On one side, it may promote model training as the noisy labels can depict the sample to some extent. On the other side, it brings high label ambiguity as the noisy labels are quite undistinguishable from the ground-truth label. To leverage the nuances of IDPLL effectively, for the first time we create class-wise embeddings for each sample, which allow us to explore the relationship of instance-dependent noisy labels, i.e., the class-wise embeddings in the candidate label set should have high similarity, while the class-wise embeddings between the candidate label set and the non-candidate label set should have high dissimilarity. Moreover, to reduce the high label ambiguity, we introduce the concept of class prototypes containing global feature information to disambiguate the candidate label set. Extensive experimental comparisons with twelve methods on six benchmark data sets, including four fine-grained data sets, demonstrate the effectiveness of the proposed method. The code implementation is publicly available at this https URL.

[LG-12] Prompt Transfer for Dual-Aspect Cross Domain Cognitive Diagnosis

链接: https://arxiv.org/abs/2412.05004
作者: Fei Liu,Yizhong Zhang,Shuochen Liu,Shengwei Ji,Kui Yu,Le Wu
关键词-EN: enabling downstream applications, personalized learning guidance, students’ cognitive states, cognitive states based, evaluate students’ cognitive
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Cognitive Diagnosis (CD) aims to evaluate students’ cognitive states based on their interaction data, enabling downstream applications such as exercise recommendation and personalized learning guidance. However, existing methods often struggle with accuracy drops in cross-domain cognitive diagnosis (CDCD), a practical yet challenging task. While some efforts have explored exercise-aspect CDCD, such as crosssubject scenarios, they fail to address the broader dual-aspect nature of CDCD, encompassing both student- and exerciseaspect variations. This diversity creates significant challenges in developing a scenario-agnostic framework. To address these gaps, we propose PromptCD, a simple yet effective framework that leverages soft prompt transfer for cognitive diagnosis. PromptCD is designed to adapt seamlessly across diverse CDCD scenarios, introducing PromptCD-S for student-aspect CDCD and PromptCD-E for exercise-aspect CDCD. Extensive experiments on real-world datasets demonstrate the robustness and effectiveness of PromptCD, consistently achieving superior performance across various CDCD scenarios. Our work offers a unified and generalizable approach to CDCD, advancing both theoretical and practical understanding in this critical domain. The implementation of our framework is publicly available at this https URL.

[LG-13] Noise Matters: Diffusion Model-based Urban Mobility Generation with Collaborative Noise Priors

链接: https://arxiv.org/abs/2412.05000
作者: Yuheng Zhang,Yuan Yuan,Jingtao Ding,Jian Yuan,Yong Li
关键词-EN: global urbanization, largely grown, cities has largely, mobility data, data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With global urbanization, the focus on sustainable cities has largely grown, driving research into equity, resilience, and urban planning, which often relies on mobility data. The rise of web-based apps and mobile devices has provided valuable user data for mobility-related research. However, real-world mobility data is costly and raises privacy concerns. To protect privacy while retaining key features of real-world movement, the demand for synthetic data has steadily increased. Recent advances in diffusion models have shown great potential for mobility trajectory generation due to their ability to model randomness and uncertainty. However, existing approaches often directly apply identically distributed (i.i.d.) noise sampling from image generation techniques, which fail to account for the spatiotemporal correlations and social interactions that shape urban mobility patterns. In this paper, we propose CoDiffMob, a diffusion method for urban mobility generation with collaborative noise priors, we emphasize the critical role of noise in diffusion models for generating mobility data. By leveraging both individual movement characteristics and population-wide dynamics, we construct novel collaborative noise priors that provide richer and more informative guidance throughout the generation process. Extensive experiments demonstrate the superiority of our method, with generated data accurately capturing both individual preferences and collective patterns, achieving an improvement of over 32%. Furthermore, it can effectively replace web-derived mobility data to better support downstream applications, while safeguarding user privacy and fostering a more secure and ethical web. This highlights its tremendous potential for applications in sustainable city-related research.

[LG-14] Causal discovery with endogenous context variables

链接: https://arxiv.org/abs/2412.04981
作者: Wiebke Günther,Oana-Iuliana Popescu,Martin Rabel,Urmi Ninad,Andreas Gerhardus,Jakob Runge
关键词-EN: underlying causal mechanisms, exhibit variations, endogenous context variables, causal mechanisms, context variables
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Causal systems often exhibit variations of the underlying causal mechanisms between the variables of the system. Often, these changes are driven by different environments or internal states in which the system operates, and we refer to context variables as those variables that indicate this change in causal mechanisms. An example are the causal relations in soil moisture-temperature interactions and their dependence on soil moisture regimes: Dry soil triggers a dependence of soil moisture on latent heat, while environments with wet soil do not feature such a feedback, making it a context-specific property. Crucially, a regime or context variable such as soil moisture need not be exogenous and can be influenced by the dynamical system variables - precipitation can make a dry soil wet - leading to joint systems with endogenous context variables. In this work we investigate the assumptions for constraint-based causal discovery of context-specific information in systems with endogenous context variables. We show that naive approaches such as learning different regime graphs on masked data, or pooling all data, can lead to uninformative results. We propose an adaptive constraint-based discovery algorithm and give a detailed discussion on the connection to structural causal models, including sufficiency assumptions, which allow to prove the soundness of our algorithm and to interpret the results causally. Numerical experiments demonstrate the performance of the proposed method over alternative baselines, but they also unveil current limitations of our method.

[LG-15] Achieving Group Fairness through Independence in Predictive Process Monitoring

链接: https://arxiv.org/abs/2412.04914
作者: Jari Peeperkorn,Simon De Vos
关键词-EN: forecasting future states, focuses on forecasting, forecasting future, future states, states of ongoing
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Preprint

点击查看摘要

Abstract:Predictive process monitoring focuses on forecasting future states of ongoing process executions, such as predicting the outcome of a particular case. In recent years, the application of machine learning models in this domain has garnered significant scientific attention. When using historical execution data, which may contain biases or exhibit unfair behavior, these biases may be encoded into the trained models. Consequently, when such models are deployed to make decisions or guide interventions for new cases, they risk perpetuating this unwanted behavior. This work addresses group fairness in predictive process monitoring by investigating independence, i.e. ensuring predictions are unaffected by sensitive group membership. We explore independence through metrics for demographic parity such as \Delta DP, as well as recently introduced, threshold-independent distribution-based alternatives. Additionally, we propose a composite loss functions existing of binary cross-entropy and a distribution-based loss (Wasserstein) to train models that balance predictive performance and fairness, and allow for customizable trade-offs. The effectiveness of both the fairness metrics and the composite loss functions is validated through a controlled experimental setup.

[LG-16] Learning High-Degree Parities: The Crucial Role of the Initialization

链接: https://arxiv.org/abs/2412.04910
作者: Emmanuel Abbe,Elisabetta Cornacchia,Jan Hązła,Donald Kougang-Yombi
关键词-EN: evaluating learning algorithms, benchmark for evaluating, regular neural networks, almost-full parities, Parities
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree k parities on uniform inputs for constant k , but fail to do so when k and d-k grow with d (here d is the ambient dimension). However, the case where k=d-O_d(1) (almost-full parities), including the degree d parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation \sigma prevents it. The positive result for almost-full parities is shown to hold up to \sigma=O(d^-1) , pointing to questions about a sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a singleton function class like the full parity is trivially learnable, our negative result applies to a fixed function and relies on an initial gradient alignment measure of potential broader relevance to neural networks learning.

[LG-17] Nonmyopic Global Optimisation via Approximate Dynamic Programming

链接: https://arxiv.org/abs/2412.04882
作者: Filippo Airaldi,Bart De Schutter,Azita Dabiri
关键词-EN: Unconstrained global optimisation, Unconstrained global, gradient information, Gaussian processes, employs Gaussian processes
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 4 figures, 2 tables, submitted to Springer Computational Optimization and Applications

点击查看摘要

Abstract:Unconstrained global optimisation aims to optimise expensive-to-evaluate black-box functions without gradient information. Bayesian optimisation, one of the most well-known techniques, typically employs Gaussian processes as surrogate models, leveraging their probabilistic nature to balance exploration and exploitation. However, Gaussian processes become computationally prohibitive in high-dimensional spaces. Recent alternatives, based on inverse distance weighting (IDW) and radial basis functions (RBFs), offer competitive, computationally lighter solutions. Despite their efficiency, both traditional global and Bayesian optimisation strategies suffer from the myopic nature of their acquisition functions, which focus solely on immediate improvement neglecting future implications of the sequential decision making process. Nonmyopic acquisition functions devised for the Bayesian setting have shown promise in improving long-term performance. Yet, their use in deterministic strategies with IDW and RBF remains unexplored. In this work, we introduce novel nonmyopic acquisition strategies tailored to IDW- and RBF-based global optimisation. Specifically, we develop dynamic programming-based paradigms, including rollout and multi-step scenario-based optimisation schemes, to enable lookahead acquisition. These methods optimise a sequence of query points over a horizon (instead of only at the next step) by predicting the evolution of the surrogate model, inherently managing the exploration-exploitation trade-off in a systematic way via optimisation techniques. The proposed approach represents a significant advance in extending nonmyopic acquisition principles, previously confined to Bayesian optimisation, to the deterministic framework. Empirical results on synthetic and hyperparameter tuning benchmark problems demonstrate that these nonmyopic methods outperform conventional myopic approaches.

[LG-18] MSECG: Incorporating Mamba for Robust and Efficient ECG Super-Resolution

链接: https://arxiv.org/abs/2412.04861
作者: Jie Lin,I Chiu,Kuan-Chen Wang,Kai-Chun Liu,Hsin-Min Wang,Ping-Cheng Yeh,Yu Tsao
关键词-EN: diagnosing cardiovascular diseases, cardiovascular diseases, ECG, long-term ECG monitoring, play a crucial
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Electrocardiogram (ECG) signals play a crucial role in diagnosing cardiovascular diseases. To reduce power consumption in wearable or portable devices used for long-term ECG monitoring, super-resolution (SR) techniques have been developed, enabling these devices to collect and transmit signals at a lower sampling rate. In this study, we propose MSECG, a compact neural network model designed for ECG SR. MSECG combines the strength of the recurrent Mamba model with convolutional layers to capture both local and global dependencies in ECG waveforms, allowing for the effective reconstruction of high-resolution signals. We also assess the model’s performance in real-world noisy conditions by utilizing ECG data from the PTB-XL database and noise data from the MIT-BIH Noise Stress Test Database. Experimental results show that MSECG outperforms two contemporary ECG SR models under both clean and noisy conditions while using fewer parameters, offering a more powerful and robust solution for long-term ECG monitoring applications.

[LG-19] Wavelet Diffusion Neural Operator

链接: https://arxiv.org/abs/2412.04833
作者: Peiyan Hu,Rui Wang,Xiang Zheng,Tao Zhang,Haodong Feng,Ruiqi Feng,Long Wei,Yue Wang,Zhi-Ming Ma,Tailin Wu
关键词-EN: Simulating and controlling, partial differential equations, science and engineering, partial differential, Diffusion Neural Operator
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulating and controlling physical systems described by partial differential equations (PDEs) are crucial tasks across science and engineering. Recently, diffusion generative models have emerged as a competitive class of methods for these tasks due to their ability to capture long-term dependencies and model high-dimensional states. However, diffusion models typically struggle with handling system states with abrupt changes and generalizing to higher resolutions. In this work, we propose Wavelet Diffusion Neural Operator (WDNO), a novel PDE simulation and control framework that enhances the handling of these complexities. WDNO comprises two key innovations. Firstly, WDNO performs diffusion-based generative modeling in the wavelet domain for the entire trajectory to handle abrupt changes and long-term dependencies effectively. Secondly, to address the issue of poor generalization across different resolutions, which is one of the fundamental tasks in modeling physical systems, we introduce multi-resolution training. We validate WDNO on five physical systems, including 1D advection equation, three challenging physical systems with abrupt changes (1D Burgers’ equation, 1D compressible Navier-Stokes equation and 2D incompressible fluid), and a real-world dataset ERA5, which demonstrates superior performance on both simulation and control tasks over state-of-the-art methods, with significant improvements in long-term and detail prediction accuracy. Remarkably, in the challenging context of the 2D high-dimensional and indirect control task aimed at reducing smoke leakage, WDNO reduces the leakage by 33.2% compared to the second-best baseline.

[LG-20] CCS: Continuous Learning for Customized Incremental Wireless Sensing Services

链接: https://arxiv.org/abs/2412.04821
作者: Qunhang Fu,Fei Wang,Mengdie Zhu,Han Ding,Jinsong Han,Tony Xiao Han
关键词-EN: vital sign estimation, made significant progress, pose estimation, sign estimation, action recognition
类目: Machine Learning (cs.LG)
*备注: 9 pages,8 figures

点击查看摘要

Abstract:Wireless sensing has made significant progress in tasks ranging from action recognition, vital sign estimation, pose estimation, etc. After over a decade of work, wireless sensing currently stands at the tipping point transitioning from proof-of-concept systems to the large-scale deployment. We envision a future service scenario where wireless sensing service providers distribute sensing models to users. During usage, users might request new sensing capabilities. For example, if someone is away from home on a business trip or vacation for an extended period, they may want a new sensing capability that can detect falls in elderly parents or grandparents and promptly alert them. In this paper, we propose CCS (continuous customized service), enabling model updates on users’ local computing resources without data transmission to the service providers. To address the issue of catastrophic forgetting in model updates where updating model parameters to implement new capabilities leads to the loss of existing capabilities we design knowledge distillation and weight alignment modules. These modules enable the sensing model to acquire new capabilities while retaining the existing ones. We conducted extensive experiments on the large-scale XRF55 dataset across Wi-Fi, millimeter-wave radar, and RFID modalities to simulate scenarios where four users sequentially introduced new customized demands. The results affirm that CCS excels in continuous model services across all the above wireless modalities, significantly outperforming existing approaches like OneFi.

[LG-21] Differentially Private Random Feature Model

链接: https://arxiv.org/abs/2412.04785
作者: Chunyang Liao,Deanna Needell,Alexander Xue
关键词-EN: received great attention, Designing privacy-preserving machine, Designing privacy-preserving, random feature model, recent years
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Submitted to an IEEE journal

点击查看摘要

Abstract:Designing privacy-preserving machine learning algorithms has received great attention in recent years, especially in the setting when the data contains sensitive information. Differential privacy (DP) is a widely used mechanism for data analysis with privacy guarantees. In this paper, we produce a differentially private random feature model. Random features, which were proposed to approximate large-scale kernel machines, have been used to study privacy-preserving kernel machines as well. We consider the over-parametrized regime (more features than samples) where the non-private random feature model is learned via solving the min-norm interpolation problem, and then we apply output perturbation techniques to produce a private model. We show that our method preserves privacy and derive a generalization error bound for the method. To the best of our knowledge, we are the first to consider privacy-preserving random feature models in the over-parametrized regime and provide theoretical guarantees. We empirically compare our method with other privacy-preserving learning methods in the literature as well. Our results show that our approach is superior to the other methods in terms of generalization performance on synthetic data and benchmark data sets. Additionally, it was recently observed that DP mechanisms may exhibit and exacerbate disparate impact, which means that the outcomes of DP learning algorithms vary significantly among different groups. We show that both theoretically and empirically, random features have the potential to reduce disparate impact, and hence achieve better fairness.

[LG-22] DPGIIL: Dirichlet Process-Deep Generative Model-Integrated Incremental Learning for Clustering in Transmissibility-based Online Structural Anomaly Detection

链接: https://arxiv.org/abs/2412.04781
作者: Lin-Feng Mei,Wang-Ji Yan
关键词-EN: handling high-dimensional streaming, high-dimensional streaming data, manually-engineered feature quality, existing approaches struggle, optimal cluster number
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 48 pages,9 figures,6 tables,submitted to Advanced Engineering Informatics

点击查看摘要

Abstract:Clustering based on vibration responses, such as transmissibility functions (TFs), is promising in structural anomaly detection, but most existing approaches struggle with determining the optimal cluster number and handling high-dimensional streaming data, while their shallow structures also make them sensitive to manually-engineered feature quality. To bridge this gap, this work proposes the Dirichlet process-deep generative model-integrated incremental learning (DPGIIL) for clustering by combining the advantages of deep generative models (DGMs) in representation learning and the Dirichlet process mixture model (DPMM) in identifying distinct patterns in observed data. By introducing a DPMM prior into the latent space of DGMs, DPGIIL automatically captures dissimilarities in extracted latent representations, enabling both generative modeling and clustering. Within the context of variational Bayesian inference, a lower bound on the log marginal likelihood of DPGIIL, tighter than the evidence lower bound given sufficient training data, is derived analytically, which enables the joint optimization of DGM and DPMM parameters, thereby allowing the DPMM to regularize the DGM’s feature extraction process. Additionally, a greedy split-merge scheme-based coordinate ascent variational inference method is devised to accelerate the optimization. The summary statistics of the DPMM, along with the network parameters, are used to retain information about previous data for incremental learning. Notably, this study uses variational autoencoder (VAE) within DPGIIL as an illustrative example, while this framework is adaptable to other DGMs. Two case studies show that the proposed method outperforms some state-of-the-art approaches in structural anomaly detection and clustering, while also dynamically generating new clusters to indicate the emergence of new structural conditions for online monitoring.

[LG-23] Anomaly Detection and Classification in Knowledge Graphs

链接: https://arxiv.org/abs/2412.04780
作者: Asara Senaratne,Peter Christen,Pouya Omran,Graham Williams
关键词-EN: language processing techniques, natural language processing, Path Rank Algorithm, Knowledge graph Anomalies, Knowledge Graph
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomalies such as redundant, inconsistent, contradictory, and deficient values in a Knowledge Graph (KG) are unavoidable, as these graphs are often curated manually, or extracted using machine learning and natural language processing techniques. Therefore, anomaly detection is a task that can enhance the quality of KGs. In this paper, we propose SEKA (SEeking Knowledge graph Anomalies), an unsupervised approach for the detection of abnormal triples and entities in KGs. SEKA can help improve the correctness of a KG whilst retaining its coverage. We propose an adaption of the Path Rank Algorithm (PRA), named the Corroborative Path Rank Algorithm (CPRA), which is an efficient adaptation of PRA that is customized to detect anomalies in KGs. Furthermore, we also present TAXO (TAXOnomy of anomaly types in KGs), a taxonomy of possible anomaly types that can occur in a KG. This taxonomy provides a classification of the anomalies discovered by SEKA with an extensive discussion of possible data quality issues in a KG. We evaluate both approaches using the four real-world KGs YAGO-1, KBpedia, Wikidata, and DSKG to demonstrate the ability of SEKA and TAXO to outperform the baselines.

[LG-24] IterNorm: Fast Iterative Normalization

链接: https://arxiv.org/abs/2412.04778
作者: ChangMin Ye,Yonguk Sim,Youngchae Kim,SeongMin Jin,Doo Seok Jeong
关键词-EN: Transformer-based large language, large language models, Transformer-based large, large language, large amount
类目: Machine Learning (cs.LG)
*备注: Design, Automation Test in Europe Conference 2025

点击查看摘要

Abstract:Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterNorm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterNorm macro normalizes d -dimensional vectors, where 64 \leq d \leq 1024 , with a latency of 112-227 cycles at 100MHz/1.05V.

[LG-25] owards counterfactual fairness thorough auxiliary variables

链接: https://arxiv.org/abs/2412.04767
作者: Bowei Tian,Ziyao Wang,Shwai He,Wanghao Ye,Guoheng Sun,Yucong Dai,Yongkai Wu,Ang Li
关键词-EN: machine learning models, motivated substantial research, age are considered, recent years, predictive accuracy
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: arXiv admin note: text overlap with arXiv:2307.08232 by other authors

点击查看摘要

Abstract:The challenge of balancing fairness and predictive accuracy in machine learning models, especially when sensitive attributes such as race, gender, or age are considered, has motivated substantial research in recent years. Counterfactual fairness ensures that predictions remain consistent across counterfactual variations of sensitive attributes, which is a crucial concept in addressing societal biases. However, existing counterfactual fairness approaches usually overlook intrinsic information about sensitive features, limiting their ability to achieve fairness while simultaneously maintaining performance. To tackle this challenge, we introduce EXOgenous Causal reasoning (EXOC), a novel causal reasoning framework motivated by exogenous variables. It leverages auxiliary variables to uncover intrinsic properties that give rise to sensitive attributes. Our framework explicitly defines an auxiliary node and a control node that contribute to counterfactual fairness and control the information flow within the model. Our evaluation, conducted on synthetic and real-world datasets, validates EXOC’s superiority, showing that it outperforms state-of-the-art approaches in achieving counterfactual fairness.

[LG-26] GABAR: Graph Attention-Based Action Ranking for Relational Policy Learning

链接: https://arxiv.org/abs/2412.04752
作者: Rajesh Mangannavar,Stefan Lee,Alan Fern,Prasad Tadepalli
关键词-EN: learn relational policies, Gated Recurrent Units, Graph Neural Network, classical planning based, Neural Network architecture
类目: Machine Learning (cs.LG)
*备注: 6 Pages, 1 figure

点击查看摘要

Abstract:We propose a novel approach to learn relational policies for classical planning based on learning to rank actions. We introduce a new graph representation that explicitly captures action information and propose a Graph Neural Network architecture augmented with Gated Recurrent Units (GRUs) to learn action rankings. Our model is trained on small problem instances and generalizes to significantly larger instances where traditional planning becomes computationally expensive. Experimental results across standard planning benchmarks demonstrate that our action-ranking approach achieves generalization to significantly larger problems than those used in training.

[LG-27] DHIL-GT: Scalable Graph Transformer with Decoupled Hierarchy Labeling

链接: https://arxiv.org/abs/2412.04738
作者: Ningyi Liao,Zihao Yu,Siqiang Luo
关键词-EN: promising neural network, Graph Transformer, scalable Graph Transformer, Graph, learning graph-structured data
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Transformer (GT) has recently emerged as a promising neural network architecture for learning graph-structured data. However, its global attention mechanism with quadratic complexity concerning the graph scale prevents wider application to large graphs. While current methods attempt to enhance GT scalability by altering model architecture or encoding hierarchical graph data, our analysis reveals that these models still suffer from the computational bottleneck related to graph-scale operations. In this work, we target the GT scalability issue and propose DHIL-GT, a scalable Graph Transformer that simplifies network learning by fully decoupling the graph computation to a separate stage in advance. DHIL-GT effectively retrieves hierarchical information by exploiting the graph labeling technique, as we show that the graph label hierarchy is more informative than plain adjacency by offering global connections while promoting locality, and is particularly suitable for handling complex graph patterns such as heterophily. We further design subgraph sampling and positional encoding schemes for precomputing model input on top of graph labels in an end-to-end manner. The training stage thus favorably removes graph-related computations, leading to ideal mini-batch capability and GPU utilization. Notably, the precomputation and training processes of DHIL-GT achieve complexities linear to the number of graph edges and nodes, respectively. Extensive experiments demonstrate that DHIL-GT is efficient in terms of computational boost and mini-batch capability over existing scalable Graph Transformer designs on large-scale benchmarks, while achieving top-tier effectiveness on both homophilous and heterophilous graphs.

[LG-28] Generative Humanization for Therapeutic Antibodies

链接: https://arxiv.org/abs/2412.04737
作者: Cade Gordon,Aniruddh Raghu,Hunter Elliott,Peyton Greenside
关键词-EN: challenging diseases, employed to address, today most challenging, meet many criteria, development before reaching
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Antibody therapies have been employed to address some of today’s most challenging diseases, but must meet many criteria during drug development before reaching a patient. Humanization is a sequence optimization strategy that addresses one critical risk called immunogenicity - a patient’s immune response to the drug - by making an antibody more “human-like” in the absence of a predictive lab-based test for immunogenicity. However, existing humanization strategies generally yield very few humanized candidates, which may have degraded biophysical properties or decreased drug efficacy. Here, we re-frame humanization as a conditional generative modeling task, where humanizing mutations are sampled from a language model trained on human antibody data. We describe a sampling process that incorporates models of therapeutic attributes, such as antigen binding affinity, to obtain candidate sequences that have both reduced immunogenicity risk and maintained or improved therapeutic properties, allowing this algorithm to be readily embedded into an iterative antibody optimization campaign. We demonstrate in silico and in lab validation that in real therapeutic programs our generative humanization method produces diverse sets of antibodies that are both (1) highly-human and (2) have favorable therapeutic properties, such as improved binding to target antigens.

[LG-29] An Experimental Evaluation of Imputation Models for Spatial-Temporal Traffic Data

链接: https://arxiv.org/abs/2412.04733
作者: Shengnan Guo,Tonglong Wei,Yiheng Huang,Miaomiao Zhao,Ran Chen,Yan Lin,Youfang Lin,Huaiyu Wan
关键词-EN: intelligent transportation systems, enabling advanced transportation, advanced transportation services, critical preprocessing step, Traffic data imputation
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic data imputation is a critical preprocessing step in intelligent transportation systems, enabling advanced transportation services. Despite significant advancements in this field, selecting the most suitable model for practical applications remains challenging due to three key issues: 1) incomprehensive consideration of missing patterns that describe how data loss along spatial and temporal dimensions, 2) the lack of test on standardized datasets, and 3) insufficient evaluations. To this end, we first propose practice-oriented taxonomies for missing patterns and imputation models, systematically identifying all possible forms of real-world traffic data loss and analyzing the characteristics of existing models. Furthermore, we introduce a unified benchmarking pipeline to comprehensively evaluate 10 representative models across various missing patterns and rates. This work aims to provide a holistic understanding of traffic data imputation research and serve as a practical guideline.

[LG-30] Learning for Layered Safety-Critical Control with Predictive Control Barrier Functions

链接: https://arxiv.org/abs/2412.04658
作者: William D. Compton,Max H. Cohen,Aaron D. Ames
关键词-EN: Reduced order Model, Full order Model, control barrier functions, order Model, enforcing safe behavior
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: Submitted for review to L4DC 2025

点击查看摘要

Abstract:Safety filters leveraging control barrier functions (CBFs) are highly effective for enforcing safe behavior on complex systems. It is often easier to synthesize CBFs for a Reduced order Model (RoM), and track the resulting safe behavior on the Full order Model (FoM) – yet gaps between the RoM and FoM can result in safety violations. This paper introduces \emphpredictive CBFs to address this gap by leveraging rollouts of the FoM to define a predictive robustness term added to the RoM CBF condition. Theoretically, we prove that this guarantees safety in a layered control implementation. Practically, we learn the predictive robustness term through massive parallel simulation with domain randomization. We demonstrate in simulation that this yields safe FoM behavior with minimal conservatism, and experimentally realize predictive CBFs on a 3D hopping robot.

[LG-31] An Efficient Model Maintenance Approach for MLOps

链接: https://arxiv.org/abs/2412.04657
作者: Forough Majidi,Foutse Khomh,Heng Li,Amin Nikanjam
关键词-EN: machine learning models, utilized machine learning, machine learning, Model Reuse, models
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 34 Pages, 25 Figures, 12 Tables, 1 Algorithm, Submitted to a journal

点击查看摘要

Abstract:In recent years, many industries have utilized machine learning models (ML) in their systems. Ideally, machine learning models should be trained on and applied to data from the same distributions. However, the data evolves over time in many application areas, leading to data and concept drift, which in turn causes the performance of the ML models to degrade over time. Therefore, maintaining up to date ML models plays a critical role in the MLOps pipeline. Existing ML model maintenance approaches are often computationally resource intensive, costly, time consuming, and model dependent. Thus, we propose an improved MLOps pipeline, a new model maintenance approach and a Similarity Based Model Reuse (SimReuse) tool to address the challenges of ML model maintenance. We identify seasonal and recurrent distribution patterns in time series datasets throughout a preliminary study. Recurrent distribution patterns enable us to reuse previously trained models for similar distributions in the future, thus avoiding frequent retraining. Then, we integrated the model reuse approach into the MLOps pipeline and proposed our improved MLOps pipeline. Furthermore, we develop SimReuse, a tool to implement the new components of our MLOps pipeline to store models and reuse them for inference of data segments with similar data distributions in the future. Our evaluation results on four time series datasets demonstrate that our model reuse approach can maintain the performance of models while significantly reducing maintenance time and costs. Our model reuse approach achieves ML performance comparable to the best baseline, while being 15 times more efficient in terms of computation time and costs. Therefore, industries and practitioners can benefit from our approach and use our tool to maintain the performance of their ML models in the deployment phase to reduce their maintenance costs.

[LG-32] One Communication Round is All It Needs for Federated Fine-Tuning Foundation Models

链接: https://arxiv.org/abs/2412.04650
作者: Ziyao Wang,Bowei Tian,Yexiao He,Zheyu Shen,Luyang Liu,Ang Li
关键词-EN: federated fine-tuning, federated, fine-tuning, one-shot federated fine-tuning, recent advancement
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:The recent advancement of large foundation models (FMs) has increased the demand for fine-tuning these models on large-scale and cross-domain datasets. To address this, federated fine-tuning has emerged as a solution, allowing models to be fine-tuned on distributed datasets across multiple devices while ensuring data privacy. However, the substantial parameter size of FMs and the multi-round communication required by traditional federated fine-tuning algorithms result in prohibitively high communication costs, challenging the practicality of federated fine-tuning. In this paper, we are the first to reveal, both theoretically and empirically, that the traditional multi-round aggregation algorithms may not be necessary for federated fine-tuning large FMs. Our experiments reveal that a single round of communication (i.e., one-shot federated fine-tuning) yields a global model performance comparable to that achieved through multiple rounds of communication. Through rigorous mathematical and empirical analyses, we demonstrate that large FMs, due to their extensive parameter sizes and pre-training on general tasks, achieve significantly lower training loss in one-shot federated fine-tuning compared to smaller models. Our extensive experiments show that one-shot federated fine-tuning not only reduces communication costs but also enables asynchronous aggregation, enhances privacy, and maintains performance consistency with multi-round federated fine-tuning for models larger than 1 billion parameters, on text generation and text-to-image generation tasks. Our findings have the potential to revolutionize federated fine-tuning in practice, enhancing efficiency, reducing costs, and expanding accessibility for large-scale models. This breakthrough paves the way for broader adoption and application of federated fine-tuning across various domains.

[LG-33] Mixed Delay/Nondelay Embeddings Based Neuromorphic Computing with Patterned Nanomagnet Arrays

链接: https://arxiv.org/abs/2412.04622
作者: Changpeng Ti,Usman Hassan,Sairam Sri Vatsavai,Margaret McCarter,Aastha Vasdev,Jincheng An,Barat Achinuq,Ulrich Welp,Sen-Ching Cheung,Ishan G Thakkar,J. Todd Hastings
关键词-EN: Patterned nanomagnet arrays, PNA reservoir systems, PNA reservoir, frustrated dipole interaction, Patterned nanomagnet
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Patterned nanomagnet arrays (PNAs) have been shown to exhibit a strong geometrically frustrated dipole interaction. Some PNAs have also shown emergent domain wall dynamics. Previous works have demonstrated methods to physically probe these magnetization dynamics of PNAs to realize neuromorphic reservoir systems that exhibit chaotic dynamical behavior and high-dimensional nonlinearity. These PNA reservoir systems from prior works leverage echo state properties and linear/nonlinear short-term memory of component reservoir nodes to map and preserve the dynamical information of the input time-series data into nondelay spatial embeddings. Such mappings enable these PNA reservoir systems to imitate and predict/forecast the input time series data. However, these prior PNA reservoir systems are based solely on the nondelay spatial embeddings obtained at component reservoir nodes. As a result, they require a massive number of component reservoir nodes, or a very large spatial embedding (i.e., high-dimensional spatial embedding) per reservoir node, or both, to achieve acceptable imitation and prediction accuracy. These requirements reduce the practical feasibility of such PNA reservoir systems. To address this shortcoming, we present a mixed delay/nondelay embeddings-based PNA reservoir system. Our system uses a single PNA reservoir node with the ability to obtain a mixture of delay/nondelay embeddings of the dynamical information of the time-series data applied at the input of a single PNA reservoir node. Our analysis shows that when these mixed delay/nondelay embeddings are used to train a perceptron at the output layer, our reservoir system outperforms existing PNA-based reservoir systems for the imitation of NARMA 2, NARMA 5, NARMA 7, and NARMA 10 time series data, and for the short-term and long-term prediction of the Mackey Glass time series data.

[LG-34] Exploring Transformer-Based Music Overpainting for Jazz Piano Variations

链接: https://arxiv.org/abs/2412.04610
作者: Eleanor Row,Ivan Shanin,György Fazekas
关键词-EN: paper explores transformer-based, music overpainting, explores transformer-based models, paper explores, explores transformer-based
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted and presented as a Late-Breaking Demo at the 25th International Society for Music Information Retrieval (ISMIR) in San Francisco, US, 2024

点击查看摘要

Abstract:This paper explores transformer-based models for music overpainting, focusing on jazz piano variations. Music overpainting generates new variations while preserving the melodic and harmonic structure of the input. Existing approaches are limited by small datasets, restricting scalability and diversity. We introduce VAR4000, a subset of a larger dataset for jazz piano performances, consisting of 4,352 training pairs. Using a semi-automatic pipeline, we evaluate two transformer configurations on VAR4000, comparing their performance with the smaller JAZZVAR dataset. Preliminary results show promising improvements in generalisation and performance with the larger dataset configuration, highlighting the potential of transformer models to scale effectively for music overpainting on larger and more diverse datasets.

[LG-35] Nonlinear Operator Learning Using Energy Minimization and MLPs

链接: https://arxiv.org/abs/2412.04596
作者: Mats G. Larson,Carl Lundholm,Anna Persson
关键词-EN: partial differential equations, nonlinear problems governed, differential equations, solution operator, develop and evaluate
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 13 pages, 3 figures (8 subfigures in total)

点击查看摘要

Abstract:We develop and evaluate a method for learning solution operators to nonlinear problems governed by partial differential equations. The approach is based on a finite element discretization and aims at representing the solution operator by an MLP that takes latent variables as input. The latent variables will typically correspond to parameters in a parametrization of input data such as boundary conditions, coefficients, and right-hand sides. The loss function is most often an energy functional and we formulate efficient parallelizable training algorithms based on assembling the energy locally on each element. For large problems, the learning process can be made more efficient by using only a small fraction of randomly chosen elements in the mesh in each iteration. The approach is evaluated on several relevant test cases, where learning the solution operator turns out to be beneficial compared to classical numerical methods.

[LG-36] Loss Terms and Operator Forms of Koopman Autoencoders

链接: https://arxiv.org/abs/2412.04578
作者: Dustin Enyeart,Guang Lin
关键词-EN: Koopman autoencoders, prevalent architecture, operator learning, Koopman, Abstract
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Koopman autoencoders are a prevalent architecture in operator learning. But, the loss functions and the form of the operator vary significantly in the literature. This paper presents a fair and systemic study of these options. Furthermore, it introduces novel loss terms.

[LG-37] Data-Driven Parameterized Reduced-order Models for Predicting Distortion in Metal 3D Printing NEURIPS

链接: https://arxiv.org/abs/2412.04577
作者: Indu Kant Deo,Youngsoo Choi,Saad A. Khairallah,Alexandre Reikher,Maria Strantza
关键词-EN: Powder Bed Fusion, Laser Powder Bed, Bed Fusion, applied laser energy, laser energy produces
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: 7 pages, 4 figures, NeurIPS Machine Learning for Physical Sciences workshop

点击查看摘要

Abstract:In Laser Powder Bed Fusion (LPBF), the applied laser energy produces high thermal gradients that lead to unacceptable final part distortion. Accurate distortion prediction is essential for optimizing the 3D printing process and manufacturing a part that meets geometric accuracy requirements. This study introduces data-driven parameterized reduced-order models (ROMs) to predict distortion in LPBF across various machine process settings. We propose a ROM framework that combines Proper Orthogonal Decomposition (POD) with Gaussian Process Regression (GPR) and compare its performance against a deep-learning based parameterized graph convolutional autoencoder (GCA). The POD-GPR model demonstrates high accuracy, predicting distortions within \pm0.001mm , and delivers a computational speed-up of approximately 1800x.

[LG-38] Solving High-dimensional Inverse Problems Using Amortized Likelihood-free Inference with Noisy and Incomplete Data

链接: https://arxiv.org/abs/2412.04565
作者: Jice Zeng,Yuanzhe Wang,Alexandre M. Tartakovsky,David Barajas-Solano
关键词-EN: likelihood-free probabilistic inversion, high-dimensional inverse problems, posterior distribution, likelihood-free probabilistic, high-dimensional inverse
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a novel likelihood-free probabilistic inversion method based on normalizing flows for high-dimensional inverse problems. The proposed method is comprised of two complementary networks: a summary network for data compression, and an inference network for parameter estimation. The summary network encodes raw observations into a fixed-size vector of summary statistics, while the inference network generates samples of the approximate posterior distribution of the model parameters based on these summary statistics. The posterior samples are produced in a deep generative fashion by sampling from a latent Gaussian distribution and passing these samples through an invertible transformation. We construct this invertible transformation by sequentially alternating conditional invertible neural network (cINN) and conditional neural spline flow (cNSF) layers. The summary and inference networks are trained simultaneously. We apply the proposed method to an inversion problem in groundwater hydrology to estimate the posterior distribution of the system’s log-conductivity field conditioned on spatially sparse time-series observations of the system’s hydraulic head responses. The conductivity field is represented with 706 degrees of freedom in the considered problem. The comparison with the likelihood-based iterative ensemble smoother PEST-IES method demonstrates that the proposed method accurately estimates the parameter posterior distribution and the observations’ predictive posterior distribution at a fraction of the inference time of PEST-IES.

[LG-39] Communication Compression for Distributed Learning without Control Variates

链接: https://arxiv.org/abs/2412.04538
作者: Tomas Ortega,Chun-Yin Huang,Xiaoxiao Li,Hamid Jafarkhani
关键词-EN: employed in Federated, Federated Learning, reduce the cost, Compressed Gradient Descent, Distributed Gradient Descent
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Distributed learning algorithms, such as the ones employed in Federated Learning (FL), require communication compression to reduce the cost of client uploads. The compression methods used in practice are often biased, which require error feedback to achieve convergence when the compression is aggressive. In turn, error feedback requires client-specific control variates, which directly contradicts privacy-preserving principles and requires stateful clients. In this paper, we propose Compressed Aggregate Feedback (CAFe), a novel distributed learning framework that allows highly compressible client updates by exploiting past aggregated updates, and does not require control variates. We consider Distributed Gradient Descent (DGD) as a representative algorithm and provide a theoretical proof of CAFe’s superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-smooth regime with bounded gradient dissimilarity. Experimental results confirm that CAFe consistently outperforms distributed learning with direct compression and highlight the compressibility of the client updates with CAFe.

[LG-40] Leveraging Multimodal Protein Representations to Predict Protein Melting Temperatures

链接: https://arxiv.org/abs/2412.04526
作者: Daiheng Zhang,Yan Zeng,Xinyu Hong,Jinbo Xu
关键词-EN: Accurately predicting protein, Accurately predicting, guiding protein engineering, assessing protein stability, fundamental for assessing
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Accurately predicting protein melting temperature changes (Delta Tm) is fundamental for assessing protein stability and guiding protein engineering. Leveraging multi-modal protein representations has shown great promise in capturing the complex relationships among protein sequences, structures, and functions. In this study, we develop models based on powerful protein language models, including ESM-2, ESM-3, SaProt, and AlphaFold, using various feature extraction methods to enhance prediction accuracy. By utilizing the ESM-3 model, we achieve a new state-of-the-art performance on the s571 test dataset, obtaining a Pearson correlation coefficient (PCC) of 0.50. Furthermore, we conduct a fair evaluation to compare the performance of different protein language models in the Delta Tm prediction task. Our results demonstrate that integrating multi-modal protein representations could advance the prediction of protein melting temperatures.

[LG-41] Labeling questions inside issue trackers

链接: https://arxiv.org/abs/2412.04523
作者: Aidin Rasti
关键词-EN: popular open source, open source software, newly reported issues, popular open, open source
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the issues faced by the maintainers of popular open source software is the triage of newly reported issues. Many of the issues submitted to issue trackers are questions. Many people ask questions on issue trackers about their problem instead of using a proper QA website like StackOverflow. This may seem insignificant but for many of the big projects with thousands of users, this leads to spamming of the issue tracker. Reading and labeling these unrelated issues manually is a serious time consuming task and these unrelated questions add to the burden. In fact, most often maintainers demand to not submit questions in the issue tracker. To address this problem, first, we leveraged dozens of patterns to clean text of issues, we removed noises like logs, stack traces, environment variables, error messages, etc. Second, we have implemented a classification-based approach to automatically label unrelated questions. Empirical evaluations on a dataset of more than 102,000 records show that our approach can label questions with an accuracy of over 81%.

[LG-42] FedDW: Distilling Weights through Consistency Optimization in Heterogeneous Federated Learning

链接: https://arxiv.org/abs/2412.04521
作者: Jiayu Liu,Yong Wang,Nianbin Wang,Jing Yang,Xiaohui Tao
关键词-EN: distributed machine learning, machine learning paradigm, Federated Learning, innovative distributed machine, enables neural network
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is an innovative distributed machine learning paradigm that enables neural network training across devices without centralizing data. While this addresses issues of information sharing and data privacy, challenges arise from data heterogeneity across clients and increasing network scale, leading to impacts on model performance and training efficiency. Previous research shows that in IID environments, the parameter structure of the model is expected to adhere to certain specific consistency principles. Thus, identifying and regularizing these consistencies can mitigate issues from heterogeneous data. We found that both soft labels derived from knowledge distillation and the classifier head parameter matrix, when multiplied by their own transpose, capture the intrinsic relationships between data classes. These shared relationships suggest inherent consistency. Therefore, the work in this paper identifies the consistency between the two and leverages it to regulate training, underpinning our proposed FedDW framework. Experimental results show FedDW outperforms 10 state-of-the-art FL methods, improving accuracy by an average of 3% in highly heterogeneous settings. Additionally, we provide a theoretical proof that FedDW offers higher efficiency, with the additional computational load from backpropagation being negligible. The code is available at this https URL.

[LG-43] Modeling Eye Gaze Velocity Trajectories using GANs with Spectral Loss for Enhanced Fidelity

链接: https://arxiv.org/abs/2412.04184
作者: Shailendra Bhandari,Pedro Lencastre,Rujeena Mathema,Alexander Szorkovszky,Anis Yazidi,Pedro Lind
关键词-EN: eye gaze, Accurate modeling, eye gaze dynamics, synthetic eye gaze, eye gaze velocity
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 16

点击查看摘要

Abstract:Accurate modeling of eye gaze dynamics is essential for advancement in human-computer interaction, neurological diagnostics, and cognitive research. Traditional generative models like Markov models often fail to capture the complex temporal dependencies and distributional nuance inherent in eye gaze trajectories data. This study introduces a GAN framework employing LSTM and CNN generators and discriminators to generate high-fidelity synthetic eye gaze velocity trajectories. We conducted a comprehensive evaluation of four GAN architectures: CNN-CNN, LSTM-CNN, CNN-LSTM, and LSTM-LSTM trained under two conditions: using only adversarial loss and using a weighted combination of adversarial and spectral losses. Our findings reveal that the LSTM-CNN architecture trained with this new loss function exhibits the closest alignment to the real data distribution, effectively capturing both the distribution tails and the intricate temporal dependencies. The inclusion of spectral regularization significantly enhances the GANs ability to replicate the spectral characteristics of eye gaze movements, leading to a more stable learning process and improved data fidelity. Comparative analysis with an HMM optimized to four hidden states further highlights the advantages of the LSTM-CNN GAN. Statistical metrics show that the HMM-generated data significantly diverges from the real data in terms of mean, standard deviation, skewness, and kurtosis. In contrast, the LSTM-CNN model closely matches the real data across these statistics, affirming its capacity to model the complexity of eye gaze dynamics effectively. These results position the spectrally regularized LSTM-CNN GAN as a robust tool for generating synthetic eye gaze velocity data with high fidelity.

[LG-44] Modeling stochastic eye tracking data: A comparison of quantum generative adversarial networks and Markov models

链接: https://arxiv.org/abs/2408.00673
作者: Shailendra Bhandari,Pedro Lincastre,Pedro Lind
关键词-EN: generative adversarial networks, eye movement velocity, modeling eye movement, adversarial networks QGANs, movement velocity data
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 8 pages

点击查看摘要

Abstract:We explore the use of quantum generative adversarial networks QGANs for modeling eye movement velocity data. We assess whether the advanced computational capabilities of QGANs can enhance the modeling of complex stochastic distribution beyond the traditional mathematical models, particularly the Markov model. The findings indicate that while QGANs demonstrate potential in approximating complex distributions, the Markov model consistently outperforms in accurately replicating the real data distribution. This comparison underlines the challenges and avenues for refinement in time series data generation using quantum computing techniques. It emphasizes the need for further optimization of quantum models to better align with real-world data characteristics.

[LG-45] Scalable Bayesian Optimization with Sparse Gaussian Process Models

链接: https://arxiv.org/abs/2010.13301
作者: Ang Yang
关键词-EN: handling massive data, focuses on Bayesian, Bayesian optimization, massive data, optimization convergence
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR); Machine Learning (stat.ML)
*备注: Thesis

点击查看摘要

Abstract:This thesis focuses on Bayesian optimization with the improvements coming from two aspects:(i) the use of derivative information to accelerate the optimization convergence; and (ii) the consideration of scalable GPs for handling massive data.

[LG-46] Global Optimization with A Power-Transformed Objective and Gaussian Smoothing

链接: https://arxiv.org/abs/2412.05204
作者: Chen Xu
关键词-EN: differentiable objective function, not-necessarily differentiable objective, optimize the Gaussian-smoothed, solves global optimization, global optimization problems
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel method that solves global optimization problems in two steps: (1) perform a (exponential) power- N transformation to the not-necessarily differentiable objective function f to obtain f_N , and (2) optimize the Gaussian-smoothed f_N with stochastic approximations. Under mild conditions on f , for any \delta0 , we prove that with a sufficiently large power N_\delta , this method converges to a solution in the \delta -neighborhood of f 's global maximum point. The convergence rate is O(d^2\sigma^4\varepsilon^-2) , which is faster than both the standard and single-loop homotopy methods. Extensive experiments show that our method requires significantly fewer iterations than other compared algorithms to produce a high-quality solution.

[LG-47] he Polynomial Stein Discrepancy for Assessing Moment Convergence

链接: https://arxiv.org/abs/2412.05135
作者: Narayan Srinivasan,Matthew Sutton,Christopher Drovandi,Leah F South
关键词-EN: desired posterior distribution, Bayesian sampling algorithms, Bayesian inference, scalable Bayesian sampling, desired posterior
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 17 Pages, 14 Figs

点击查看摘要

Abstract:We propose a novel method for measuring the discrepancy between a set of samples and a desired posterior distribution for Bayesian inference. Classical methods for assessing sample quality like the effective sample size are not appropriate for scalable Bayesian sampling algorithms, such as stochastic gradient Langevin dynamics, that are asymptotically biased. Instead, the gold standard is to use the kernel Stein Discrepancy (KSD), which is itself not scalable given its quadratic cost in the number of samples. The KSD and its faster extensions also typically suffer from the curse-of-dimensionality and can require extensive tuning. To address these limitations, we develop the polynomial Stein discrepancy (PSD) and an associated goodness-of-fit test. While the new test is not fully convergence-determining, we prove that it detects differences in the first r moments in the Bernstein-von Mises limit. We empirically show that the test has higher power than its competitors in several examples, and at a lower computational cost. Finally, we demonstrate that the PSD can assist practitioners to select hyper-parameters of Bayesian sampling algorithms more efficiently than competitors.

[LG-48] Dirac-Equation Signal Processing: Physics Boosts Topological Machine Learning

链接: https://arxiv.org/abs/2412.05132
作者: Runyue Wang,Yu Tian,Pietro Liò,Ginestra Bianconi
关键词-EN: Topological Machine Learning, topological Dirac equation, Dirac-equation signal processing, signal processing, topological Dirac
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
*备注: (14 pages, 7 figures)

点击查看摘要

Abstract:Topological signals are variables or features associated with both nodes and edges of a network. Recently, in the context of Topological Machine Learning, great attention has been devoted to signal processing of such topological signals. Most of the previous topological signal processing algorithms treat node and edge signals separately and work under the hypothesis that the true signal is smooth and/or well approximated by a harmonic eigenvector of the Hodge-Laplacian, which may be violated in practice. Here we propose Dirac-equation signal processing, a framework for efficiently reconstructing true signals on nodes and edges, also if they are not smooth or harmonic, by processing them jointly. The proposed physics-inspired algorithm is based on the spectral properties of the topological Dirac operator. It leverages the mathematical structure of the topological Dirac equation to boost the performance of the signal processing algorithm. We discuss how the relativistic dispersion relation obeyed by the topological Dirac equation can be used to assess the quality of the signal reconstruction. Finally, we demonstrate the improved performance of the algorithm with respect to previous algorithms. Specifically, we show that Dirac-equation signal processing can also be used efficiently if the true signal is a non-trivial linear combination of more than one eigenstate of the Dirac equation, as it generally occurs for real signals.

[LG-49] Integrating Semantic Communication and Human Decision-Making into an End-to-End Sensing-Decision Framework

链接: https://arxiv.org/abs/2412.05103
作者: Edgar Beck,Hsuan-Yu Lin,Patrick Rückert,Yongping Bao,Bettina von Helversen,Sebastian Fehrler,Kirsten Tracht,Armin Dekorsy
关键词-EN: Weaver defined communication, Weaver defined, semantic communication, HDM, communication
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As early as 1949, Weaver defined communication in a very broad sense to include all procedures by which one mind or technical system can influence another, thus establishing the idea of semantic communication. With the recent success of machine learning in expert assistance systems where sensed information is wirelessly provided to a human to assist task execution, the need to design effective and efficient communications has become increasingly apparent. In particular, semantic communication aims to convey the meaning behind the sensed information relevant for Human Decision-Making (HDM). Regarding the interplay between semantic communication and HDM, many questions remain, such as how to model the entire end-to-end sensing-decision-making process, how to design semantic communication for the HDM and which information should be provided to the HDM. To address these questions, we propose to integrate semantic communication and HDM into one probabilistic end-to-end sensing-decision framework that bridges communications and psychology. In our interdisciplinary framework, we model the human through a HDM process, allowing us to explore how feature extraction from semantic communication can best support human decision-making. In this sense, our study provides new insights for the design/interaction of semantic communication with models of HDM. Our initial analysis shows how semantic communication can balance the level of detail with human cognitive capabilities while demanding less bandwidth, power, and latency.

[LG-50] Machine learning approach for mapping the stable orbits around planets

链接: https://arxiv.org/abs/2412.04568
作者: Tiago F. L. L. Pinheiro,Rafael Sfair,Giovana Ramon
关键词-EN: Numerical N-body simulations, utilize Machine Learning, Numerical N-body, N-body simulations, explore stability regions
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 12 pages, 13 figures, Accepted for publication in Astronomy Astrophysics

点击查看摘要

Abstract:Numerical N-body simulations are commonly used to explore stability regions around exoplanets, offering insights into the possible existence of satellites and ring systems. This study aims to utilize Machine Learning (ML) techniques to generate predictive maps of stable regions surrounding a hypothetical planet. The approach can also be extended to planet-satellite systems, planetary ring systems, and other similar configurations. A dataset was generated using 10^5 numerical simulations, each incorporating nine orbital features for the planet and a test particle in a star-planet-test particle system. The simulations were classified as stable or unstable based on stability criteria, requiring particles to remain stable over a timespan equivalent to 10,000 orbital periods of the planet. Various ML algorithms were tested and fine-tuned through hyperparameter optimization to determine the most effective predictive model. Tree-based algorithms showed comparable accuracy in performance. The best-performing model, using the Extreme Gradient Boosting (XGBoost) algorithm, achieved an accuracy of 98.48%, with 94% recall and precision for stable particles and 99% for unstable particles. ML algorithms significantly reduce the computational time required for three-body simulations, operating approximately 100,000 times faster than traditional numerical methods. Predictive models can generate entire stability maps in less than a second, compared to the days required by numerical simulations. The results from the trained ML models will be made accessible through a public web interface, enabling broader scientific applications.

[LG-51] Physics-informed Gaussian Processes as Linear Model Predictive Controller

链接: https://arxiv.org/abs/2412.04502
作者: Jörn Tebbe,Andreas Besginow,Markus Lange-Hegermann
关键词-EN: controlling linear time, linear time invariant, time invariant systems, Gaussian Process, algorithm for controlling
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We introduce a novel algorithm for controlling linear time invariant systems in a tracking problem. The controller is based on a Gaussian Process (GP) whose realizations satisfy a system of linear ordinary differential equations with constant coefficients. Control inputs for tracking are determined by conditioning the prior GP on the setpoints, i.e. control as inference. The resulting Model Predictive Control scheme incorporates pointwise soft constraints by introducing virtual setpoints to the posterior Gaussian process. We show theoretically that our controller satisfies asymptotical stability for the optimal control problem by leveraging general results from Bayesian inference and demonstrate this result in a numerical example.

[LG-52] Advancing Marine Heatwave Forecasts: An Integrated Deep Learning Approach

链接: https://arxiv.org/abs/2412.04475
作者: Ding Ning,Varvara Vetrova,Yun Sing Koh,Karin R. Bryan
关键词-EN: pose significant challenges, intensity increasing due, extreme climate phenomenon, Marine heatwaves, marine ecosystems
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: The paper contains 7 pages for the main text, 9 pages including References, and 17 pages including the Appendix. 3 figures

点击查看摘要

Abstract:Marine heatwaves (MHWs), an extreme climate phenomenon, pose significant challenges to marine ecosystems and industries, with their frequency and intensity increasing due to climate change. This study introduces an integrated deep learning approach to forecast short-to-long-term MHWs on a global scale. The approach combines graph representation for modeling spatial properties in climate data, imbalanced regression to handle skewed data distributions, and temporal diffusion to enhance forecast accuracy across various lead times. To the best of our knowledge, this is the first study that synthesizes three spatiotemporal anomaly methodologies to predict MHWs. Additionally, we introduce a method for constructing graphs that avoids isolated nodes and provide a new publicly available sea surface temperature anomaly graph dataset. We examine the trade-offs in the selection of loss functions and evaluation metrics for MHWs. We analyze spatial patterns in global MHW predictability by focusing on historical hotspots, and our approach demonstrates better performance compared to traditional numerical models in regions such as the middle south Pacific, equatorial Atlantic near Africa, south Atlantic, and high-latitude Indian Ocean. We highlight the potential of temporal diffusion to replace the conventional sliding window approach for long-term forecasts, achieving improved prediction up to six months in advance. These insights not only establish benchmarks for machine learning applications in MHW forecasting but also enhance understanding of general climate forecasting methodologies.

信息检索

[IR-0] Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval with Semantic Guidance NEURIPS2024

链接: https://arxiv.org/abs/2412.04746
作者: Xuchan Bao,Judith Yue Li,Zhong Yi Wan,Kun Su,Timo Denk,Joonseok Lee,Dima Kuzmin,Fei Sha
关键词-EN: Modern music retrieval, capture users’ diverse, Modern music, limiting their ability, user preferences
类目: ound (cs.SD); Information Retrieval (cs.IR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: NeurIPS 2024 Creative AI Track

点击查看摘要

Abstract:Modern music retrieval systems often rely on fixed representations of user preferences, limiting their ability to capture users’ diverse and uncertain retrieval needs. To address this limitation, we introduce Diff4Steer, a novel generative retrieval framework that employs lightweight diffusion models to synthesize diverse seed embeddings from user queries that represent potential directions for music exploration. Unlike deterministic methods that map user query to a single point in embedding space, Diff4Steer provides a statistical prior on the target modality (audio) for retrieval, effectively capturing the uncertainty and multi-faceted nature of user preferences. Furthermore, Diff4Steer can be steered by image or text inputs, enabling more flexible and controllable music discovery combined with nearest neighbor search. Our framework outperforms deterministic regression methods and LLM-based generative retrieval baseline in terms of retrieval and ranking metrics, demonstrating its effectiveness in capturing user preferences, leading to more diverse and relevant recommendations. Listening examples are available at this http URL.

[IR-1] Argumentative Experience: Reducing Confirmation Bias on Controversial Issues through LLM -Generated Multi-Persona Debates

链接: https://arxiv.org/abs/2412.04629
作者: Li Shi,Houjiang Liu,Yian Wong,Utkarsh Mujumdar,Dan Zhang,Jacek Gwizdka,Matthew Lease
关键词-EN: Large language models, Large language, language models, enabling designers, designers to give
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are enabling designers to give life to exciting new user experiences for information access. In this work, we present a system that generates LLM personas to debate a topic of interest from different perspectives. How might information seekers use and benefit from such a system? Can centering information access around diverse viewpoints help to mitigate thorny challenges like confirmation bias in which information seekers over-trust search results matching existing beliefs? How do potential biases and hallucinations in LLMs play out alongside human users who are also fallible and possibly biased? Our study exposes participants to multiple viewpoints on controversial issues via a mixed-methods, within-subjects study. We use eye-tracking metrics to quantitatively assess cognitive engagement alongside qualitative feedback. Compared to a baseline search system, we see more creative interactions and diverse information-seeking with our multi-persona debate system, which more effectively reduces user confirmation bias and conviction toward their initial beliefs. Overall, our study contributes to the emerging design space of LLM-based information access systems, specifically investigating the potential of simulated personas to promote greater exposure to information diversity, emulate collective intelligence, and mitigate bias in information seeking. Subjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Information Retrieval (cs.IR) Cite as: arXiv:2412.04629 [cs.HC] (or arXiv:2412.04629v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2412.04629 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] NSTRI Global Collaborative Research Data Platform

链接: https://arxiv.org/abs/2412.04474
作者: Hyeonhoon Lee,Hanseul Kim,Kyungmin Cho,Hyung-Chul Lee
关键词-EN: National University Hospital, National Strategic Technology, Seoul National University, Technology Research Institute, Strategic Technology Research
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The National Strategic Technology Research Institute (NSTRI) Data Platform operated by Seoul National University Hospital (SNUH) addresses the challenge of accessing Korean healthcare data for international research. This platform provides secure access to pseudonymized Korean healthcare data while integrating international datasets, enabling the development of more equitable and generalizable machine learning models. The system features four key AI-powered components: an intelligent data search engine utilizing domain-specific medical embeddings, a Korean-English medical translation system, a comprehensive drug search engine, and an LLM-powered medical research assistant. The platform implements containerized environments within a secure research pod architecture, ensuring data protection while maintaining research efficiency. The platform currently provides access to 10 distinct medical datasets from SNUH, categorized by access permissions and standardized for cross-dataset analysis. This infrastructure enables global collaborative healthcare research while maintaining strict data protection standards.

附件下载

点击下载今日全部论文列表

Arxiv今日论文 | 2024-12-09

目录

概览 (2024-12-09)

自然语言处理

计算机视觉

人工智能

机器学习

信息检索

附件下载